# Unified Health Check System This document describes the unified health check system implemented across all microservices in the bakery-ia platform. ## Overview The unified health check system provides standardized health monitoring endpoints across all services, with comprehensive database verification, Kubernetes integration, and detailed health reporting. ## Key Features - **Standardized Endpoints**: All services now provide the same health check endpoints - **Database Verification**: Comprehensive database health checks including table existence verification - **Kubernetes Integration**: Proper separation of liveness and readiness probes - **Detailed Reporting**: Rich health status information for debugging and monitoring - **App State Integration**: Health checks automatically detect service ready state ## Health Check Endpoints ### `/health` - Basic Health Check - **Purpose**: Basic service health status - **Use Case**: General health monitoring, API gateways - **Response**: Service name, version, status, and timestamp - **Status Codes**: 200 (healthy/starting) ### `/health/ready` - Kubernetes Readiness Probe - **Purpose**: Indicates if service is ready to receive traffic - **Use Case**: Kubernetes readiness probe, load balancer health checks - **Checks**: Application state, database connectivity, table verification, custom checks - **Status Codes**: 200 (ready), 503 (not ready) ### `/health/live` - Kubernetes Liveness Probe - **Purpose**: Indicates if service is alive and should not be restarted - **Use Case**: Kubernetes liveness probe - **Response**: Simple alive status - **Status Codes**: 200 (alive) ### `/health/database` - Detailed Database Health - **Purpose**: Comprehensive database health information for debugging - **Use Case**: Database monitoring, troubleshooting - **Checks**: Connectivity, table existence, connection pool status, response times - **Status Codes**: 200 (healthy), 503 (unhealthy) ## Implementation ### Services Updated The following services have been updated to use the unified health check system: 1. **Training Service** (`training-service`) - Full implementation with database manager integration - Table verification for ML training tables - Expected tables: `model_training_logs`, `trained_models`, `model_performance_metrics`, `training_job_queue`, `model_artifacts` 2. **Orders Service** (`orders-service`) - Legacy database integration with custom health checks - Expected tables: `customers`, `customer_contacts`, `customer_orders`, `order_items`, `order_status_history`, `procurement_plans`, `procurement_requirements` 3. **Inventory Service** (`inventory-service`) - Full database manager integration - Food safety and inventory table verification - Expected tables: `ingredients`, `stock`, `stock_movements`, `product_transformations`, `stock_alerts`, `food_safety_compliance`, `temperature_logs`, `food_safety_alerts` ### Code Integration #### Basic Setup ```python from shared.monitoring.health_checks import setup_fastapi_health_checks # Setup unified health checks health_manager = setup_fastapi_health_checks( app=app, service_name="my-service", version="1.0.0", database_manager=database_manager, expected_tables=['table1', 'table2'], custom_checks={"custom_check": custom_check_function} ) ``` #### With Custom Checks ```python async def custom_health_check(): """Custom health check function""" return await some_service_check() health_manager = setup_fastapi_health_checks( app=app, service_name="my-service", version="1.0.0", database_manager=database_manager, expected_tables=['table1', 'table2'], custom_checks={"external_service": custom_health_check} ) ``` #### Service Ready State ```python # In your lifespan function async def lifespan(app: FastAPI): # Startup logic await initialize_service() # Mark service as ready app.state.ready = True yield # Shutdown logic ``` ## Kubernetes Configuration ### Updated Probe Configuration The microservice template and specific service configurations have been updated to use the new endpoints: ```yaml livenessProbe: httpGet: path: /health/live port: 8000 initialDelaySeconds: 30 timeoutSeconds: 5 periodSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: path: /health/ready port: 8000 initialDelaySeconds: 15 timeoutSeconds: 3 periodSeconds: 5 failureThreshold: 5 ``` ### Key Changes from Previous Configuration 1. **Liveness Probe**: Now uses `/health/live` instead of `/health` 2. **Readiness Probe**: Now uses `/health/ready` instead of `/health` 3. **Improved Timing**: Adjusted timeouts and failure thresholds for better reliability 4. **Separate Concerns**: Liveness and readiness are now properly separated ## Health Check Response Examples ### Basic Health Check Response ```json { "status": "healthy", "service": "training-service", "version": "1.0.0", "timestamp": "2025-01-27T10:30:00Z" } ``` ### Readiness Check Response (Ready) ```json { "status": "ready", "checks": { "application": true, "database_connectivity": true, "database_tables": true }, "database": { "status": "healthy", "tables_verified": ["model_training_logs", "trained_models"], "missing_tables": [], "errors": [] } } ``` ### Database Health Response ```json { "status": "healthy", "connectivity": true, "tables_exist": true, "tables_verified": ["model_training_logs", "trained_models"], "missing_tables": [], "errors": [], "connection_info": { "service_name": "training-service", "database_type": "postgresql", "pool_size": 20, "current_checked_out": 2 }, "response_time_ms": 15.23 } ``` ## Testing ### Manual Testing ```bash # Test all endpoints for a running service curl http://localhost:8000/health curl http://localhost:8000/health/ready curl http://localhost:8000/health/live curl http://localhost:8000/health/database ``` ### Automated Testing Use the provided test script: ```bash python test_unified_health_checks.py ``` ## Migration Guide ### For Existing Services 1. **Add Health Check Import**: ```python from shared.monitoring.health_checks import setup_fastapi_health_checks ``` 2. **Add Database Manager Import** (if using shared database): ```python from app.core.database import database_manager ``` 3. **Setup Health Checks** (after app creation, before router inclusion): ```python health_manager = setup_fastapi_health_checks( app=app, service_name="your-service-name", version=settings.VERSION, database_manager=database_manager, expected_tables=["table1", "table2"] ) ``` 4. **Remove Old Health Endpoints**: Remove any existing `@app.get("/health")` endpoints 5. **Add Ready State Management**: ```python # In lifespan function after successful startup app.state.ready = True ``` 6. **Update Kubernetes Configuration**: Update deployment YAML to use new probe endpoints ### For Services Using Legacy Database If your service doesn't use the shared database manager: ```python async def legacy_database_check(): """Custom health check for legacy database""" return await your_db_health_check() health_manager = setup_fastapi_health_checks( app=app, service_name="your-service", version=settings.VERSION, database_manager=None, expected_tables=None, custom_checks={"legacy_database": legacy_database_check} ) ``` ## Benefits 1. **Consistency**: All services now provide the same health check interface 2. **Better Kubernetes Integration**: Proper separation of liveness and readiness concerns 3. **Enhanced Debugging**: Detailed health information for troubleshooting 4. **Database Verification**: Comprehensive database health checks including table verification 5. **Monitoring Ready**: Rich health status information for monitoring systems 6. **Maintainability**: Centralized health check logic reduces code duplication ## Future Enhancements 1. **Metrics Integration**: Add Prometheus metrics for health check performance 2. **Circuit Breaker**: Implement circuit breaker pattern for external service checks 3. **Health Check Dependencies**: Add dependency health checks between services 4. **Performance Thresholds**: Add configurable performance thresholds for health checks 5. **Health Check Scheduling**: Add scheduled background health checks