281 lines
8.3 KiB
Markdown
281 lines
8.3 KiB
Markdown
# Unified Health Check System
|
|
|
|
This document describes the unified health check system implemented across all microservices in the bakery-ia platform.
|
|
|
|
## Overview
|
|
|
|
The unified health check system provides standardized health monitoring endpoints across all services, with comprehensive database verification, Kubernetes integration, and detailed health reporting.
|
|
|
|
## Key Features
|
|
|
|
- **Standardized Endpoints**: All services now provide the same health check endpoints
|
|
- **Database Verification**: Comprehensive database health checks including table existence verification
|
|
- **Kubernetes Integration**: Proper separation of liveness and readiness probes
|
|
- **Detailed Reporting**: Rich health status information for debugging and monitoring
|
|
- **App State Integration**: Health checks automatically detect service ready state
|
|
|
|
## Health Check Endpoints
|
|
|
|
### `/health` - Basic Health Check
|
|
- **Purpose**: Basic service health status
|
|
- **Use Case**: General health monitoring, API gateways
|
|
- **Response**: Service name, version, status, and timestamp
|
|
- **Status Codes**: 200 (healthy/starting)
|
|
|
|
### `/health/ready` - Kubernetes Readiness Probe
|
|
- **Purpose**: Indicates if service is ready to receive traffic
|
|
- **Use Case**: Kubernetes readiness probe, load balancer health checks
|
|
- **Checks**: Application state, database connectivity, table verification, custom checks
|
|
- **Status Codes**: 200 (ready), 503 (not ready)
|
|
|
|
### `/health/live` - Kubernetes Liveness Probe
|
|
- **Purpose**: Indicates if service is alive and should not be restarted
|
|
- **Use Case**: Kubernetes liveness probe
|
|
- **Response**: Simple alive status
|
|
- **Status Codes**: 200 (alive)
|
|
|
|
### `/health/database` - Detailed Database Health
|
|
- **Purpose**: Comprehensive database health information for debugging
|
|
- **Use Case**: Database monitoring, troubleshooting
|
|
- **Checks**: Connectivity, table existence, connection pool status, response times
|
|
- **Status Codes**: 200 (healthy), 503 (unhealthy)
|
|
|
|
## Implementation
|
|
|
|
### Services Updated
|
|
|
|
The following services have been updated to use the unified health check system:
|
|
|
|
1. **Training Service** (`training-service`)
|
|
- Full implementation with database manager integration
|
|
- Table verification for ML training tables
|
|
- Expected tables: `model_training_logs`, `trained_models`, `model_performance_metrics`, `training_job_queue`, `model_artifacts`
|
|
|
|
2. **Orders Service** (`orders-service`)
|
|
- Legacy database integration with custom health checks
|
|
- Expected tables: `customers`, `customer_contacts`, `customer_orders`, `order_items`, `order_status_history`, `procurement_plans`, `procurement_requirements`
|
|
|
|
3. **Inventory Service** (`inventory-service`)
|
|
- Full database manager integration
|
|
- Food safety and inventory table verification
|
|
- Expected tables: `ingredients`, `stock`, `stock_movements`, `product_transformations`, `stock_alerts`, `food_safety_compliance`, `temperature_logs`, `food_safety_alerts`
|
|
|
|
### Code Integration
|
|
|
|
#### Basic Setup
|
|
```python
|
|
from shared.monitoring.health_checks import setup_fastapi_health_checks
|
|
|
|
# Setup unified health checks
|
|
health_manager = setup_fastapi_health_checks(
|
|
app=app,
|
|
service_name="my-service",
|
|
version="1.0.0",
|
|
database_manager=database_manager,
|
|
expected_tables=['table1', 'table2'],
|
|
custom_checks={"custom_check": custom_check_function}
|
|
)
|
|
```
|
|
|
|
#### With Custom Checks
|
|
```python
|
|
async def custom_health_check():
|
|
"""Custom health check function"""
|
|
return await some_service_check()
|
|
|
|
health_manager = setup_fastapi_health_checks(
|
|
app=app,
|
|
service_name="my-service",
|
|
version="1.0.0",
|
|
database_manager=database_manager,
|
|
expected_tables=['table1', 'table2'],
|
|
custom_checks={"external_service": custom_health_check}
|
|
)
|
|
```
|
|
|
|
#### Service Ready State
|
|
```python
|
|
# In your lifespan function
|
|
async def lifespan(app: FastAPI):
|
|
# Startup logic
|
|
await initialize_service()
|
|
|
|
# Mark service as ready
|
|
app.state.ready = True
|
|
|
|
yield
|
|
|
|
# Shutdown logic
|
|
```
|
|
|
|
## Kubernetes Configuration
|
|
|
|
### Updated Probe Configuration
|
|
|
|
The microservice template and specific service configurations have been updated to use the new endpoints:
|
|
|
|
```yaml
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health/live
|
|
port: 8000
|
|
initialDelaySeconds: 30
|
|
timeoutSeconds: 5
|
|
periodSeconds: 10
|
|
failureThreshold: 3
|
|
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /health/ready
|
|
port: 8000
|
|
initialDelaySeconds: 15
|
|
timeoutSeconds: 3
|
|
periodSeconds: 5
|
|
failureThreshold: 5
|
|
```
|
|
|
|
### Key Changes from Previous Configuration
|
|
|
|
1. **Liveness Probe**: Now uses `/health/live` instead of `/health`
|
|
2. **Readiness Probe**: Now uses `/health/ready` instead of `/health`
|
|
3. **Improved Timing**: Adjusted timeouts and failure thresholds for better reliability
|
|
4. **Separate Concerns**: Liveness and readiness are now properly separated
|
|
|
|
## Health Check Response Examples
|
|
|
|
### Basic Health Check Response
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"service": "training-service",
|
|
"version": "1.0.0",
|
|
"timestamp": "2025-01-27T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
### Readiness Check Response (Ready)
|
|
```json
|
|
{
|
|
"status": "ready",
|
|
"checks": {
|
|
"application": true,
|
|
"database_connectivity": true,
|
|
"database_tables": true
|
|
},
|
|
"database": {
|
|
"status": "healthy",
|
|
"tables_verified": ["model_training_logs", "trained_models"],
|
|
"missing_tables": [],
|
|
"errors": []
|
|
}
|
|
}
|
|
```
|
|
|
|
### Database Health Response
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"connectivity": true,
|
|
"tables_exist": true,
|
|
"tables_verified": ["model_training_logs", "trained_models"],
|
|
"missing_tables": [],
|
|
"errors": [],
|
|
"connection_info": {
|
|
"service_name": "training-service",
|
|
"database_type": "postgresql",
|
|
"pool_size": 20,
|
|
"current_checked_out": 2
|
|
},
|
|
"response_time_ms": 15.23
|
|
}
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Manual Testing
|
|
```bash
|
|
# Test all endpoints for a running service
|
|
curl http://localhost:8000/health
|
|
curl http://localhost:8000/health/ready
|
|
curl http://localhost:8000/health/live
|
|
curl http://localhost:8000/health/database
|
|
```
|
|
|
|
### Automated Testing
|
|
Use the provided test script:
|
|
```bash
|
|
python test_unified_health_checks.py
|
|
```
|
|
|
|
## Migration Guide
|
|
|
|
### For Existing Services
|
|
|
|
1. **Add Health Check Import**:
|
|
```python
|
|
from shared.monitoring.health_checks import setup_fastapi_health_checks
|
|
```
|
|
|
|
2. **Add Database Manager Import** (if using shared database):
|
|
```python
|
|
from app.core.database import database_manager
|
|
```
|
|
|
|
3. **Setup Health Checks** (after app creation, before router inclusion):
|
|
```python
|
|
health_manager = setup_fastapi_health_checks(
|
|
app=app,
|
|
service_name="your-service-name",
|
|
version=settings.VERSION,
|
|
database_manager=database_manager,
|
|
expected_tables=["table1", "table2"]
|
|
)
|
|
```
|
|
|
|
4. **Remove Old Health Endpoints**:
|
|
Remove any existing `@app.get("/health")` endpoints
|
|
|
|
5. **Add Ready State Management**:
|
|
```python
|
|
# In lifespan function after successful startup
|
|
app.state.ready = True
|
|
```
|
|
|
|
6. **Update Kubernetes Configuration**:
|
|
Update deployment YAML to use new probe endpoints
|
|
|
|
### For Services Using Legacy Database
|
|
|
|
If your service doesn't use the shared database manager:
|
|
|
|
```python
|
|
async def legacy_database_check():
|
|
"""Custom health check for legacy database"""
|
|
return await your_db_health_check()
|
|
|
|
health_manager = setup_fastapi_health_checks(
|
|
app=app,
|
|
service_name="your-service",
|
|
version=settings.VERSION,
|
|
database_manager=None,
|
|
expected_tables=None,
|
|
custom_checks={"legacy_database": legacy_database_check}
|
|
)
|
|
```
|
|
|
|
## Benefits
|
|
|
|
1. **Consistency**: All services now provide the same health check interface
|
|
2. **Better Kubernetes Integration**: Proper separation of liveness and readiness concerns
|
|
3. **Enhanced Debugging**: Detailed health information for troubleshooting
|
|
4. **Database Verification**: Comprehensive database health checks including table verification
|
|
5. **Monitoring Ready**: Rich health status information for monitoring systems
|
|
6. **Maintainability**: Centralized health check logic reduces code duplication
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Metrics Integration**: Add Prometheus metrics for health check performance
|
|
2. **Circuit Breaker**: Implement circuit breaker pattern for external service checks
|
|
3. **Health Check Dependencies**: Add dependency health checks between services
|
|
4. **Performance Thresholds**: Add configurable performance thresholds for health checks
|
|
5. **Health Check Scheduling**: Add scheduled background health checks |