Files
bakery-ia/HEALTH_CHECKS.md
2025-09-29 13:13:12 +02:00

8.3 KiB

Unified Health Check System

This document describes the unified health check system implemented across all microservices in the bakery-ia platform.

Overview

The unified health check system provides standardized health monitoring endpoints across all services, with comprehensive database verification, Kubernetes integration, and detailed health reporting.

Key Features

  • Standardized Endpoints: All services now provide the same health check endpoints
  • Database Verification: Comprehensive database health checks including table existence verification
  • Kubernetes Integration: Proper separation of liveness and readiness probes
  • Detailed Reporting: Rich health status information for debugging and monitoring
  • App State Integration: Health checks automatically detect service ready state

Health Check Endpoints

/health - Basic Health Check

  • Purpose: Basic service health status
  • Use Case: General health monitoring, API gateways
  • Response: Service name, version, status, and timestamp
  • Status Codes: 200 (healthy/starting)

/health/ready - Kubernetes Readiness Probe

  • Purpose: Indicates if service is ready to receive traffic
  • Use Case: Kubernetes readiness probe, load balancer health checks
  • Checks: Application state, database connectivity, table verification, custom checks
  • Status Codes: 200 (ready), 503 (not ready)

/health/live - Kubernetes Liveness Probe

  • Purpose: Indicates if service is alive and should not be restarted
  • Use Case: Kubernetes liveness probe
  • Response: Simple alive status
  • Status Codes: 200 (alive)

/health/database - Detailed Database Health

  • Purpose: Comprehensive database health information for debugging
  • Use Case: Database monitoring, troubleshooting
  • Checks: Connectivity, table existence, connection pool status, response times
  • Status Codes: 200 (healthy), 503 (unhealthy)

Implementation

Services Updated

The following services have been updated to use the unified health check system:

  1. Training Service (training-service)

    • Full implementation with database manager integration
    • Table verification for ML training tables
    • Expected tables: model_training_logs, trained_models, model_performance_metrics, training_job_queue, model_artifacts
  2. Orders Service (orders-service)

    • Legacy database integration with custom health checks
    • Expected tables: customers, customer_contacts, customer_orders, order_items, order_status_history, procurement_plans, procurement_requirements
  3. Inventory Service (inventory-service)

    • Full database manager integration
    • Food safety and inventory table verification
    • Expected tables: ingredients, stock, stock_movements, product_transformations, stock_alerts, food_safety_compliance, temperature_logs, food_safety_alerts

Code Integration

Basic Setup

from shared.monitoring.health_checks import setup_fastapi_health_checks

# Setup unified health checks
health_manager = setup_fastapi_health_checks(
    app=app,
    service_name="my-service",
    version="1.0.0",
    database_manager=database_manager,
    expected_tables=['table1', 'table2'],
    custom_checks={"custom_check": custom_check_function}
)

With Custom Checks

async def custom_health_check():
    """Custom health check function"""
    return await some_service_check()

health_manager = setup_fastapi_health_checks(
    app=app,
    service_name="my-service",
    version="1.0.0",
    database_manager=database_manager,
    expected_tables=['table1', 'table2'],
    custom_checks={"external_service": custom_health_check}
)

Service Ready State

# In your lifespan function
async def lifespan(app: FastAPI):
    # Startup logic
    await initialize_service()

    # Mark service as ready
    app.state.ready = True

    yield

    # Shutdown logic

Kubernetes Configuration

Updated Probe Configuration

The microservice template and specific service configurations have been updated to use the new endpoints:

livenessProbe:
  httpGet:
    path: /health/live
    port: 8000
  initialDelaySeconds: 30
  timeoutSeconds: 5
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 15
  timeoutSeconds: 3
  periodSeconds: 5
  failureThreshold: 5

Key Changes from Previous Configuration

  1. Liveness Probe: Now uses /health/live instead of /health
  2. Readiness Probe: Now uses /health/ready instead of /health
  3. Improved Timing: Adjusted timeouts and failure thresholds for better reliability
  4. Separate Concerns: Liveness and readiness are now properly separated

Health Check Response Examples

Basic Health Check Response

{
  "status": "healthy",
  "service": "training-service",
  "version": "1.0.0",
  "timestamp": "2025-01-27T10:30:00Z"
}

Readiness Check Response (Ready)

{
  "status": "ready",
  "checks": {
    "application": true,
    "database_connectivity": true,
    "database_tables": true
  },
  "database": {
    "status": "healthy",
    "tables_verified": ["model_training_logs", "trained_models"],
    "missing_tables": [],
    "errors": []
  }
}

Database Health Response

{
  "status": "healthy",
  "connectivity": true,
  "tables_exist": true,
  "tables_verified": ["model_training_logs", "trained_models"],
  "missing_tables": [],
  "errors": [],
  "connection_info": {
    "service_name": "training-service",
    "database_type": "postgresql",
    "pool_size": 20,
    "current_checked_out": 2
  },
  "response_time_ms": 15.23
}

Testing

Manual Testing

# Test all endpoints for a running service
curl http://localhost:8000/health
curl http://localhost:8000/health/ready
curl http://localhost:8000/health/live
curl http://localhost:8000/health/database

Automated Testing

Use the provided test script:

python test_unified_health_checks.py

Migration Guide

For Existing Services

  1. Add Health Check Import:

    from shared.monitoring.health_checks import setup_fastapi_health_checks
    
  2. Add Database Manager Import (if using shared database):

    from app.core.database import database_manager
    
  3. Setup Health Checks (after app creation, before router inclusion):

    health_manager = setup_fastapi_health_checks(
        app=app,
        service_name="your-service-name",
        version=settings.VERSION,
        database_manager=database_manager,
        expected_tables=["table1", "table2"]
    )
    
  4. Remove Old Health Endpoints: Remove any existing @app.get("/health") endpoints

  5. Add Ready State Management:

    # In lifespan function after successful startup
    app.state.ready = True
    
  6. Update Kubernetes Configuration: Update deployment YAML to use new probe endpoints

For Services Using Legacy Database

If your service doesn't use the shared database manager:

async def legacy_database_check():
    """Custom health check for legacy database"""
    return await your_db_health_check()

health_manager = setup_fastapi_health_checks(
    app=app,
    service_name="your-service",
    version=settings.VERSION,
    database_manager=None,
    expected_tables=None,
    custom_checks={"legacy_database": legacy_database_check}
)

Benefits

  1. Consistency: All services now provide the same health check interface
  2. Better Kubernetes Integration: Proper separation of liveness and readiness concerns
  3. Enhanced Debugging: Detailed health information for troubleshooting
  4. Database Verification: Comprehensive database health checks including table verification
  5. Monitoring Ready: Rich health status information for monitoring systems
  6. Maintainability: Centralized health check logic reduces code duplication

Future Enhancements

  1. Metrics Integration: Add Prometheus metrics for health check performance
  2. Circuit Breaker: Implement circuit breaker pattern for external service checks
  3. Health Check Dependencies: Add dependency health checks between services
  4. Performance Thresholds: Add configurable performance thresholds for health checks
  5. Health Check Scheduling: Add scheduled background health checks