8.3 KiB
Unified Health Check System
This document describes the unified health check system implemented across all microservices in the bakery-ia platform.
Overview
The unified health check system provides standardized health monitoring endpoints across all services, with comprehensive database verification, Kubernetes integration, and detailed health reporting.
Key Features
- Standardized Endpoints: All services now provide the same health check endpoints
- Database Verification: Comprehensive database health checks including table existence verification
- Kubernetes Integration: Proper separation of liveness and readiness probes
- Detailed Reporting: Rich health status information for debugging and monitoring
- App State Integration: Health checks automatically detect service ready state
Health Check Endpoints
/health - Basic Health Check
- Purpose: Basic service health status
- Use Case: General health monitoring, API gateways
- Response: Service name, version, status, and timestamp
- Status Codes: 200 (healthy/starting)
/health/ready - Kubernetes Readiness Probe
- Purpose: Indicates if service is ready to receive traffic
- Use Case: Kubernetes readiness probe, load balancer health checks
- Checks: Application state, database connectivity, table verification, custom checks
- Status Codes: 200 (ready), 503 (not ready)
/health/live - Kubernetes Liveness Probe
- Purpose: Indicates if service is alive and should not be restarted
- Use Case: Kubernetes liveness probe
- Response: Simple alive status
- Status Codes: 200 (alive)
/health/database - Detailed Database Health
- Purpose: Comprehensive database health information for debugging
- Use Case: Database monitoring, troubleshooting
- Checks: Connectivity, table existence, connection pool status, response times
- Status Codes: 200 (healthy), 503 (unhealthy)
Implementation
Services Updated
The following services have been updated to use the unified health check system:
-
Training Service (
training-service)- Full implementation with database manager integration
- Table verification for ML training tables
- Expected tables:
model_training_logs,trained_models,model_performance_metrics,training_job_queue,model_artifacts
-
Orders Service (
orders-service)- Legacy database integration with custom health checks
- Expected tables:
customers,customer_contacts,customer_orders,order_items,order_status_history,procurement_plans,procurement_requirements
-
Inventory Service (
inventory-service)- Full database manager integration
- Food safety and inventory table verification
- Expected tables:
ingredients,stock,stock_movements,product_transformations,stock_alerts,food_safety_compliance,temperature_logs,food_safety_alerts
Code Integration
Basic Setup
from shared.monitoring.health_checks import setup_fastapi_health_checks
# Setup unified health checks
health_manager = setup_fastapi_health_checks(
app=app,
service_name="my-service",
version="1.0.0",
database_manager=database_manager,
expected_tables=['table1', 'table2'],
custom_checks={"custom_check": custom_check_function}
)
With Custom Checks
async def custom_health_check():
"""Custom health check function"""
return await some_service_check()
health_manager = setup_fastapi_health_checks(
app=app,
service_name="my-service",
version="1.0.0",
database_manager=database_manager,
expected_tables=['table1', 'table2'],
custom_checks={"external_service": custom_health_check}
)
Service Ready State
# In your lifespan function
async def lifespan(app: FastAPI):
# Startup logic
await initialize_service()
# Mark service as ready
app.state.ready = True
yield
# Shutdown logic
Kubernetes Configuration
Updated Probe Configuration
The microservice template and specific service configurations have been updated to use the new endpoints:
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 30
timeoutSeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 15
timeoutSeconds: 3
periodSeconds: 5
failureThreshold: 5
Key Changes from Previous Configuration
- Liveness Probe: Now uses
/health/liveinstead of/health - Readiness Probe: Now uses
/health/readyinstead of/health - Improved Timing: Adjusted timeouts and failure thresholds for better reliability
- Separate Concerns: Liveness and readiness are now properly separated
Health Check Response Examples
Basic Health Check Response
{
"status": "healthy",
"service": "training-service",
"version": "1.0.0",
"timestamp": "2025-01-27T10:30:00Z"
}
Readiness Check Response (Ready)
{
"status": "ready",
"checks": {
"application": true,
"database_connectivity": true,
"database_tables": true
},
"database": {
"status": "healthy",
"tables_verified": ["model_training_logs", "trained_models"],
"missing_tables": [],
"errors": []
}
}
Database Health Response
{
"status": "healthy",
"connectivity": true,
"tables_exist": true,
"tables_verified": ["model_training_logs", "trained_models"],
"missing_tables": [],
"errors": [],
"connection_info": {
"service_name": "training-service",
"database_type": "postgresql",
"pool_size": 20,
"current_checked_out": 2
},
"response_time_ms": 15.23
}
Testing
Manual Testing
# Test all endpoints for a running service
curl http://localhost:8000/health
curl http://localhost:8000/health/ready
curl http://localhost:8000/health/live
curl http://localhost:8000/health/database
Automated Testing
Use the provided test script:
python test_unified_health_checks.py
Migration Guide
For Existing Services
-
Add Health Check Import:
from shared.monitoring.health_checks import setup_fastapi_health_checks -
Add Database Manager Import (if using shared database):
from app.core.database import database_manager -
Setup Health Checks (after app creation, before router inclusion):
health_manager = setup_fastapi_health_checks( app=app, service_name="your-service-name", version=settings.VERSION, database_manager=database_manager, expected_tables=["table1", "table2"] ) -
Remove Old Health Endpoints: Remove any existing
@app.get("/health")endpoints -
Add Ready State Management:
# In lifespan function after successful startup app.state.ready = True -
Update Kubernetes Configuration: Update deployment YAML to use new probe endpoints
For Services Using Legacy Database
If your service doesn't use the shared database manager:
async def legacy_database_check():
"""Custom health check for legacy database"""
return await your_db_health_check()
health_manager = setup_fastapi_health_checks(
app=app,
service_name="your-service",
version=settings.VERSION,
database_manager=None,
expected_tables=None,
custom_checks={"legacy_database": legacy_database_check}
)
Benefits
- Consistency: All services now provide the same health check interface
- Better Kubernetes Integration: Proper separation of liveness and readiness concerns
- Enhanced Debugging: Detailed health information for troubleshooting
- Database Verification: Comprehensive database health checks including table verification
- Monitoring Ready: Rich health status information for monitoring systems
- Maintainability: Centralized health check logic reduces code duplication
Future Enhancements
- Metrics Integration: Add Prometheus metrics for health check performance
- Circuit Breaker: Implement circuit breaker pattern for external service checks
- Health Check Dependencies: Add dependency health checks between services
- Performance Thresholds: Add configurable performance thresholds for health checks
- Health Check Scheduling: Add scheduled background health checks