Files
bakery-ia/HEALTH_CHECKS.md
2025-09-29 13:13:12 +02:00

281 lines
8.3 KiB
Markdown

# Unified Health Check System
This document describes the unified health check system implemented across all microservices in the bakery-ia platform.
## Overview
The unified health check system provides standardized health monitoring endpoints across all services, with comprehensive database verification, Kubernetes integration, and detailed health reporting.
## Key Features
- **Standardized Endpoints**: All services now provide the same health check endpoints
- **Database Verification**: Comprehensive database health checks including table existence verification
- **Kubernetes Integration**: Proper separation of liveness and readiness probes
- **Detailed Reporting**: Rich health status information for debugging and monitoring
- **App State Integration**: Health checks automatically detect service ready state
## Health Check Endpoints
### `/health` - Basic Health Check
- **Purpose**: Basic service health status
- **Use Case**: General health monitoring, API gateways
- **Response**: Service name, version, status, and timestamp
- **Status Codes**: 200 (healthy/starting)
### `/health/ready` - Kubernetes Readiness Probe
- **Purpose**: Indicates if service is ready to receive traffic
- **Use Case**: Kubernetes readiness probe, load balancer health checks
- **Checks**: Application state, database connectivity, table verification, custom checks
- **Status Codes**: 200 (ready), 503 (not ready)
### `/health/live` - Kubernetes Liveness Probe
- **Purpose**: Indicates if service is alive and should not be restarted
- **Use Case**: Kubernetes liveness probe
- **Response**: Simple alive status
- **Status Codes**: 200 (alive)
### `/health/database` - Detailed Database Health
- **Purpose**: Comprehensive database health information for debugging
- **Use Case**: Database monitoring, troubleshooting
- **Checks**: Connectivity, table existence, connection pool status, response times
- **Status Codes**: 200 (healthy), 503 (unhealthy)
## Implementation
### Services Updated
The following services have been updated to use the unified health check system:
1. **Training Service** (`training-service`)
- Full implementation with database manager integration
- Table verification for ML training tables
- Expected tables: `model_training_logs`, `trained_models`, `model_performance_metrics`, `training_job_queue`, `model_artifacts`
2. **Orders Service** (`orders-service`)
- Legacy database integration with custom health checks
- Expected tables: `customers`, `customer_contacts`, `customer_orders`, `order_items`, `order_status_history`, `procurement_plans`, `procurement_requirements`
3. **Inventory Service** (`inventory-service`)
- Full database manager integration
- Food safety and inventory table verification
- Expected tables: `ingredients`, `stock`, `stock_movements`, `product_transformations`, `stock_alerts`, `food_safety_compliance`, `temperature_logs`, `food_safety_alerts`
### Code Integration
#### Basic Setup
```python
from shared.monitoring.health_checks import setup_fastapi_health_checks
# Setup unified health checks
health_manager = setup_fastapi_health_checks(
app=app,
service_name="my-service",
version="1.0.0",
database_manager=database_manager,
expected_tables=['table1', 'table2'],
custom_checks={"custom_check": custom_check_function}
)
```
#### With Custom Checks
```python
async def custom_health_check():
"""Custom health check function"""
return await some_service_check()
health_manager = setup_fastapi_health_checks(
app=app,
service_name="my-service",
version="1.0.0",
database_manager=database_manager,
expected_tables=['table1', 'table2'],
custom_checks={"external_service": custom_health_check}
)
```
#### Service Ready State
```python
# In your lifespan function
async def lifespan(app: FastAPI):
# Startup logic
await initialize_service()
# Mark service as ready
app.state.ready = True
yield
# Shutdown logic
```
## Kubernetes Configuration
### Updated Probe Configuration
The microservice template and specific service configurations have been updated to use the new endpoints:
```yaml
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 30
timeoutSeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 15
timeoutSeconds: 3
periodSeconds: 5
failureThreshold: 5
```
### Key Changes from Previous Configuration
1. **Liveness Probe**: Now uses `/health/live` instead of `/health`
2. **Readiness Probe**: Now uses `/health/ready` instead of `/health`
3. **Improved Timing**: Adjusted timeouts and failure thresholds for better reliability
4. **Separate Concerns**: Liveness and readiness are now properly separated
## Health Check Response Examples
### Basic Health Check Response
```json
{
"status": "healthy",
"service": "training-service",
"version": "1.0.0",
"timestamp": "2025-01-27T10:30:00Z"
}
```
### Readiness Check Response (Ready)
```json
{
"status": "ready",
"checks": {
"application": true,
"database_connectivity": true,
"database_tables": true
},
"database": {
"status": "healthy",
"tables_verified": ["model_training_logs", "trained_models"],
"missing_tables": [],
"errors": []
}
}
```
### Database Health Response
```json
{
"status": "healthy",
"connectivity": true,
"tables_exist": true,
"tables_verified": ["model_training_logs", "trained_models"],
"missing_tables": [],
"errors": [],
"connection_info": {
"service_name": "training-service",
"database_type": "postgresql",
"pool_size": 20,
"current_checked_out": 2
},
"response_time_ms": 15.23
}
```
## Testing
### Manual Testing
```bash
# Test all endpoints for a running service
curl http://localhost:8000/health
curl http://localhost:8000/health/ready
curl http://localhost:8000/health/live
curl http://localhost:8000/health/database
```
### Automated Testing
Use the provided test script:
```bash
python test_unified_health_checks.py
```
## Migration Guide
### For Existing Services
1. **Add Health Check Import**:
```python
from shared.monitoring.health_checks import setup_fastapi_health_checks
```
2. **Add Database Manager Import** (if using shared database):
```python
from app.core.database import database_manager
```
3. **Setup Health Checks** (after app creation, before router inclusion):
```python
health_manager = setup_fastapi_health_checks(
app=app,
service_name="your-service-name",
version=settings.VERSION,
database_manager=database_manager,
expected_tables=["table1", "table2"]
)
```
4. **Remove Old Health Endpoints**:
Remove any existing `@app.get("/health")` endpoints
5. **Add Ready State Management**:
```python
# In lifespan function after successful startup
app.state.ready = True
```
6. **Update Kubernetes Configuration**:
Update deployment YAML to use new probe endpoints
### For Services Using Legacy Database
If your service doesn't use the shared database manager:
```python
async def legacy_database_check():
"""Custom health check for legacy database"""
return await your_db_health_check()
health_manager = setup_fastapi_health_checks(
app=app,
service_name="your-service",
version=settings.VERSION,
database_manager=None,
expected_tables=None,
custom_checks={"legacy_database": legacy_database_check}
)
```
## Benefits
1. **Consistency**: All services now provide the same health check interface
2. **Better Kubernetes Integration**: Proper separation of liveness and readiness concerns
3. **Enhanced Debugging**: Detailed health information for troubleshooting
4. **Database Verification**: Comprehensive database health checks including table verification
5. **Monitoring Ready**: Rich health status information for monitoring systems
6. **Maintainability**: Centralized health check logic reduces code duplication
## Future Enhancements
1. **Metrics Integration**: Add Prometheus metrics for health check performance
2. **Circuit Breaker**: Implement circuit breaker pattern for external service checks
3. **Health Check Dependencies**: Add dependency health checks between services
4. **Performance Thresholds**: Add configurable performance thresholds for health checks
5. **Health Check Scheduling**: Add scheduled background health checks