Refactor all main.py
This commit is contained in:
281
HEALTH_CHECKS.md
Normal file
281
HEALTH_CHECKS.md
Normal file
@@ -0,0 +1,281 @@
|
||||
# Unified Health Check System
|
||||
|
||||
This document describes the unified health check system implemented across all microservices in the bakery-ia platform.
|
||||
|
||||
## Overview
|
||||
|
||||
The unified health check system provides standardized health monitoring endpoints across all services, with comprehensive database verification, Kubernetes integration, and detailed health reporting.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Standardized Endpoints**: All services now provide the same health check endpoints
|
||||
- **Database Verification**: Comprehensive database health checks including table existence verification
|
||||
- **Kubernetes Integration**: Proper separation of liveness and readiness probes
|
||||
- **Detailed Reporting**: Rich health status information for debugging and monitoring
|
||||
- **App State Integration**: Health checks automatically detect service ready state
|
||||
|
||||
## Health Check Endpoints
|
||||
|
||||
### `/health` - Basic Health Check
|
||||
- **Purpose**: Basic service health status
|
||||
- **Use Case**: General health monitoring, API gateways
|
||||
- **Response**: Service name, version, status, and timestamp
|
||||
- **Status Codes**: 200 (healthy/starting)
|
||||
|
||||
### `/health/ready` - Kubernetes Readiness Probe
|
||||
- **Purpose**: Indicates if service is ready to receive traffic
|
||||
- **Use Case**: Kubernetes readiness probe, load balancer health checks
|
||||
- **Checks**: Application state, database connectivity, table verification, custom checks
|
||||
- **Status Codes**: 200 (ready), 503 (not ready)
|
||||
|
||||
### `/health/live` - Kubernetes Liveness Probe
|
||||
- **Purpose**: Indicates if service is alive and should not be restarted
|
||||
- **Use Case**: Kubernetes liveness probe
|
||||
- **Response**: Simple alive status
|
||||
- **Status Codes**: 200 (alive)
|
||||
|
||||
### `/health/database` - Detailed Database Health
|
||||
- **Purpose**: Comprehensive database health information for debugging
|
||||
- **Use Case**: Database monitoring, troubleshooting
|
||||
- **Checks**: Connectivity, table existence, connection pool status, response times
|
||||
- **Status Codes**: 200 (healthy), 503 (unhealthy)
|
||||
|
||||
## Implementation
|
||||
|
||||
### Services Updated
|
||||
|
||||
The following services have been updated to use the unified health check system:
|
||||
|
||||
1. **Training Service** (`training-service`)
|
||||
- Full implementation with database manager integration
|
||||
- Table verification for ML training tables
|
||||
- Expected tables: `model_training_logs`, `trained_models`, `model_performance_metrics`, `training_job_queue`, `model_artifacts`
|
||||
|
||||
2. **Orders Service** (`orders-service`)
|
||||
- Legacy database integration with custom health checks
|
||||
- Expected tables: `customers`, `customer_contacts`, `customer_orders`, `order_items`, `order_status_history`, `procurement_plans`, `procurement_requirements`
|
||||
|
||||
3. **Inventory Service** (`inventory-service`)
|
||||
- Full database manager integration
|
||||
- Food safety and inventory table verification
|
||||
- Expected tables: `ingredients`, `stock`, `stock_movements`, `product_transformations`, `stock_alerts`, `food_safety_compliance`, `temperature_logs`, `food_safety_alerts`
|
||||
|
||||
### Code Integration
|
||||
|
||||
#### Basic Setup
|
||||
```python
|
||||
from shared.monitoring.health_checks import setup_fastapi_health_checks
|
||||
|
||||
# Setup unified health checks
|
||||
health_manager = setup_fastapi_health_checks(
|
||||
app=app,
|
||||
service_name="my-service",
|
||||
version="1.0.0",
|
||||
database_manager=database_manager,
|
||||
expected_tables=['table1', 'table2'],
|
||||
custom_checks={"custom_check": custom_check_function}
|
||||
)
|
||||
```
|
||||
|
||||
#### With Custom Checks
|
||||
```python
|
||||
async def custom_health_check():
|
||||
"""Custom health check function"""
|
||||
return await some_service_check()
|
||||
|
||||
health_manager = setup_fastapi_health_checks(
|
||||
app=app,
|
||||
service_name="my-service",
|
||||
version="1.0.0",
|
||||
database_manager=database_manager,
|
||||
expected_tables=['table1', 'table2'],
|
||||
custom_checks={"external_service": custom_health_check}
|
||||
)
|
||||
```
|
||||
|
||||
#### Service Ready State
|
||||
```python
|
||||
# In your lifespan function
|
||||
async def lifespan(app: FastAPI):
|
||||
# Startup logic
|
||||
await initialize_service()
|
||||
|
||||
# Mark service as ready
|
||||
app.state.ready = True
|
||||
|
||||
yield
|
||||
|
||||
# Shutdown logic
|
||||
```
|
||||
|
||||
## Kubernetes Configuration
|
||||
|
||||
### Updated Probe Configuration
|
||||
|
||||
The microservice template and specific service configurations have been updated to use the new endpoints:
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/live
|
||||
port: 8000
|
||||
initialDelaySeconds: 30
|
||||
timeoutSeconds: 5
|
||||
periodSeconds: 10
|
||||
failureThreshold: 3
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/ready
|
||||
port: 8000
|
||||
initialDelaySeconds: 15
|
||||
timeoutSeconds: 3
|
||||
periodSeconds: 5
|
||||
failureThreshold: 5
|
||||
```
|
||||
|
||||
### Key Changes from Previous Configuration
|
||||
|
||||
1. **Liveness Probe**: Now uses `/health/live` instead of `/health`
|
||||
2. **Readiness Probe**: Now uses `/health/ready` instead of `/health`
|
||||
3. **Improved Timing**: Adjusted timeouts and failure thresholds for better reliability
|
||||
4. **Separate Concerns**: Liveness and readiness are now properly separated
|
||||
|
||||
## Health Check Response Examples
|
||||
|
||||
### Basic Health Check Response
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"service": "training-service",
|
||||
"version": "1.0.0",
|
||||
"timestamp": "2025-01-27T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Readiness Check Response (Ready)
|
||||
```json
|
||||
{
|
||||
"status": "ready",
|
||||
"checks": {
|
||||
"application": true,
|
||||
"database_connectivity": true,
|
||||
"database_tables": true
|
||||
},
|
||||
"database": {
|
||||
"status": "healthy",
|
||||
"tables_verified": ["model_training_logs", "trained_models"],
|
||||
"missing_tables": [],
|
||||
"errors": []
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Database Health Response
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"connectivity": true,
|
||||
"tables_exist": true,
|
||||
"tables_verified": ["model_training_logs", "trained_models"],
|
||||
"missing_tables": [],
|
||||
"errors": [],
|
||||
"connection_info": {
|
||||
"service_name": "training-service",
|
||||
"database_type": "postgresql",
|
||||
"pool_size": 20,
|
||||
"current_checked_out": 2
|
||||
},
|
||||
"response_time_ms": 15.23
|
||||
}
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Manual Testing
|
||||
```bash
|
||||
# Test all endpoints for a running service
|
||||
curl http://localhost:8000/health
|
||||
curl http://localhost:8000/health/ready
|
||||
curl http://localhost:8000/health/live
|
||||
curl http://localhost:8000/health/database
|
||||
```
|
||||
|
||||
### Automated Testing
|
||||
Use the provided test script:
|
||||
```bash
|
||||
python test_unified_health_checks.py
|
||||
```
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### For Existing Services
|
||||
|
||||
1. **Add Health Check Import**:
|
||||
```python
|
||||
from shared.monitoring.health_checks import setup_fastapi_health_checks
|
||||
```
|
||||
|
||||
2. **Add Database Manager Import** (if using shared database):
|
||||
```python
|
||||
from app.core.database import database_manager
|
||||
```
|
||||
|
||||
3. **Setup Health Checks** (after app creation, before router inclusion):
|
||||
```python
|
||||
health_manager = setup_fastapi_health_checks(
|
||||
app=app,
|
||||
service_name="your-service-name",
|
||||
version=settings.VERSION,
|
||||
database_manager=database_manager,
|
||||
expected_tables=["table1", "table2"]
|
||||
)
|
||||
```
|
||||
|
||||
4. **Remove Old Health Endpoints**:
|
||||
Remove any existing `@app.get("/health")` endpoints
|
||||
|
||||
5. **Add Ready State Management**:
|
||||
```python
|
||||
# In lifespan function after successful startup
|
||||
app.state.ready = True
|
||||
```
|
||||
|
||||
6. **Update Kubernetes Configuration**:
|
||||
Update deployment YAML to use new probe endpoints
|
||||
|
||||
### For Services Using Legacy Database
|
||||
|
||||
If your service doesn't use the shared database manager:
|
||||
|
||||
```python
|
||||
async def legacy_database_check():
|
||||
"""Custom health check for legacy database"""
|
||||
return await your_db_health_check()
|
||||
|
||||
health_manager = setup_fastapi_health_checks(
|
||||
app=app,
|
||||
service_name="your-service",
|
||||
version=settings.VERSION,
|
||||
database_manager=None,
|
||||
expected_tables=None,
|
||||
custom_checks={"legacy_database": legacy_database_check}
|
||||
)
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Consistency**: All services now provide the same health check interface
|
||||
2. **Better Kubernetes Integration**: Proper separation of liveness and readiness concerns
|
||||
3. **Enhanced Debugging**: Detailed health information for troubleshooting
|
||||
4. **Database Verification**: Comprehensive database health checks including table verification
|
||||
5. **Monitoring Ready**: Rich health status information for monitoring systems
|
||||
6. **Maintainability**: Centralized health check logic reduces code duplication
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Metrics Integration**: Add Prometheus metrics for health check performance
|
||||
2. **Circuit Breaker**: Implement circuit breaker pattern for external service checks
|
||||
3. **Health Check Dependencies**: Add dependency health checks between services
|
||||
4. **Performance Thresholds**: Add configurable performance thresholds for health checks
|
||||
5. **Health Check Scheduling**: Add scheduled background health checks
|
||||
Reference in New Issue
Block a user