Files
bakery-ia/docs/SCHEDULER_RUNBOOK.md

531 lines
15 KiB
Markdown
Raw Normal View History

2025-10-09 18:01:24 +02:00
# Production Planning Scheduler Runbook
**Quick Reference Guide for DevOps & Support Teams**
---
## Quick Links
- [Full Documentation](./PRODUCTION_PLANNING_SYSTEM.md)
- [Metrics Dashboard](http://grafana:3000/d/production-planning)
- [Logs](http://kibana:5601)
- [Alerts](http://alertmanager:9093)
---
## Emergency Contacts
| Role | Contact | Availability |
|------|---------|--------------|
| **Backend Lead** | #backend-team | 24/7 |
| **DevOps On-Call** | #devops-oncall | 24/7 |
| **Product Owner** | TBD | Business hours |
---
## Scheduler Overview
| Scheduler | Time | What It Does |
|-----------|------|--------------|
| **Production** | 5:30 AM (tenant timezone) | Creates daily production schedules |
| **Procurement** | 6:00 AM (tenant timezone) | Creates daily procurement plans |
**Critical:** Both schedulers MUST complete successfully every morning, or bakeries won't have production/procurement plans for the day!
---
## Common Incidents & Solutions
### 🔴 CRITICAL: Scheduler Completely Failed
**Alert:** `SchedulerUnhealthy` or `NoProductionSchedulesGenerated`
**Impact:** HIGH - No plans generated for any tenant
**Immediate Actions (< 5 minutes):**
```bash
# 1. Check if service is running
kubectl get pods -n production | grep production-service
kubectl get pods -n orders | grep orders-service
# 2. Check recent logs for errors
kubectl logs -n production deployment/production-service --tail=100 | grep ERROR
kubectl logs -n orders deployment/orders-service --tail=100 | grep ERROR
# 3. Restart service if frozen/crashed
kubectl rollout restart deployment/production-service -n production
kubectl rollout restart deployment/orders-service -n orders
# 4. Wait 2 minutes for scheduler to initialize, then manually trigger
curl -X POST http://production-service:8000/test/production-scheduler \
-H "Authorization: Bearer $ADMIN_TOKEN"
curl -X POST http://orders-service:8000/test/procurement-scheduler \
-H "Authorization: Bearer $ADMIN_TOKEN"
```
**Follow-up Actions:**
- Check RabbitMQ health (leader election depends on it)
- Review database connectivity
- Check resource limits (CPU/memory)
- Monitor metrics for successful generation
---
### 🟠 HIGH: Single Tenant Failed
**Alert:** `DailyProductionPlanningFailed{tenant_id="abc-123"}`
**Impact:** MEDIUM - One bakery affected
**Immediate Actions (< 10 minutes):**
```bash
# 1. Check logs for specific tenant
kubectl logs -n production deployment/production-service --tail=500 | \
grep "tenant_id=abc-123" | grep ERROR
# 2. Common causes:
# - Tenant database connection issue
# - External service timeout (Forecasting, Inventory)
# - Invalid data (e.g., missing products)
# 3. Manually retry for this tenant
curl -X POST http://production-service:8000/test/production-scheduler \
-H "Authorization: Bearer $ADMIN_TOKEN"
# (Scheduler will skip tenants that already have schedules)
# 4. If still failing, check tenant-specific issues:
# - Verify tenant exists and is active
# - Check tenant's inventory has products
# - Check forecasting service can access tenant data
```
**Follow-up Actions:**
- Contact tenant to understand their setup
- Review tenant data quality
- Check if tenant is new (may need initial setup)
---
### 🟡 MEDIUM: Scheduler Running Slow
**Alert:** `production_schedule_generation_duration_seconds > 120s`
**Impact:** LOW - Scheduler completes but takes longer than expected
**Immediate Actions (< 15 minutes):**
```bash
# 1. Check current execution time
kubectl logs -n production deployment/production-service --tail=100 | \
grep "production planning completed"
# 2. Check database query performance
# Look for slow query logs in PostgreSQL
# 3. Check external service response times
# - Forecasting Service health: curl http://forecasting-service:8000/health
# - Inventory Service health: curl http://inventory-service:8000/health
# - Orders Service health: curl http://orders-service:8000/health
# 4. Check CPU/memory usage
kubectl top pods -n production | grep production-service
kubectl top pods -n orders | grep orders-service
```
**Follow-up Actions:**
- Consider increasing timeout if consistently near limit
- Optimize slow database queries
- Scale external services if overloaded
- Review tenant count (may need to process fewer in parallel)
---
### 🟡 MEDIUM: Low Forecast Cache Hit Rate
**Alert:** `ForecastCacheHitRateLow < 50%`
**Impact:** LOW - Increased load on Forecasting Service, slower responses
**Immediate Actions (< 10 minutes):**
```bash
# 1. Check Redis is running
kubectl get pods -n redis | grep redis
redis-cli ping # Should return PONG
# 2. Check cache statistics
curl http://forecasting-service:8000/api/v1/{tenant_id}/forecasting/cache/stats \
-H "Authorization: Bearer $ADMIN_TOKEN"
# 3. Check cache keys
redis-cli KEYS "forecast:*" | wc -l # Should have many entries
# 4. Check Redis memory
redis-cli INFO memory | grep used_memory_human
# 5. If cache is empty or Redis is down, restart Redis
kubectl rollout restart statefulset/redis -n redis
```
**Follow-up Actions:**
- Monitor cache rebuild (should hit ~80-90% within 1 day)
- Check Redis configuration (memory limits, eviction policy)
- Review forecast TTL settings
- Check for cache invalidation bugs
---
### 🟢 LOW: Plan Rejected by User
**Alert:** `procurement_plan_rejections_total` increasing
**Impact:** LOW - Normal user workflow
**Actions (< 5 minutes):**
```bash
# 1. Check rejection logs for patterns
kubectl logs -n orders deployment/orders-service --tail=200 | \
grep "plan rejection"
# 2. Check if auto-regeneration triggered
kubectl logs -n orders deployment/orders-service --tail=200 | \
grep "Auto-regenerating plan"
# 3. Verify rejection notification sent
# Check RabbitMQ queue: procurement.plan.rejected
# 4. If rejection notes mention "stale" or "outdated", plan will auto-regenerate
# Otherwise, user needs to manually regenerate or modify plan
```
**Follow-up Actions:**
- Review rejection reasons for trends
- Consider user training if many rejections
- Improve plan accuracy if consistent issues
---
## Health Check Commands
### Quick Service Health Check
```bash
# Production Service
curl http://production-service:8000/health | jq .
# Orders Service
curl http://orders-service:8000/health | jq .
# Forecasting Service
curl http://forecasting-service:8000/health | jq .
# Redis
redis-cli ping
# RabbitMQ
curl http://rabbitmq:15672/api/health/checks/alarms \
-u guest:guest | jq .
```
### Detailed Scheduler Status
```bash
# Check last scheduler run time
curl http://production-service:8000/health | \
jq '.custom_checks.scheduler_service'
# Check APScheduler job status (requires internal access)
# Look for: scheduler.get_jobs() output in logs
kubectl logs -n production deployment/production-service | \
grep "scheduled jobs configured"
```
### Database Connectivity
```bash
# Check production database
kubectl exec -it deployment/production-service -n production -- \
python -c "from app.core.database import database_manager; \
import asyncio; \
asyncio.run(database_manager.health_check())"
# Check orders database
kubectl exec -it deployment/orders-service -n orders -- \
python -c "from app.core.database import database_manager; \
import asyncio; \
asyncio.run(database_manager.health_check())"
```
---
## Maintenance Procedures
### Disable Schedulers (Maintenance Mode)
```bash
# 1. Set environment variable to disable schedulers
kubectl set env deployment/production-service SCHEDULER_DISABLED=true -n production
kubectl set env deployment/orders-service SCHEDULER_DISABLED=true -n orders
# 2. Wait for pods to restart
kubectl rollout status deployment/production-service -n production
kubectl rollout status deployment/orders-service -n orders
# 3. Verify schedulers are disabled (check logs)
kubectl logs -n production deployment/production-service | grep "Scheduler disabled"
```
### Re-enable Schedulers (After Maintenance)
```bash
# 1. Remove environment variable
kubectl set env deployment/production-service SCHEDULER_DISABLED- -n production
kubectl set env deployment/orders-service SCHEDULER_DISABLED- -n orders
# 2. Wait for pods to restart
kubectl rollout status deployment/production-service -n production
kubectl rollout status deployment/orders-service -n orders
# 3. Manually trigger to catch up (if during scheduled time)
curl -X POST http://production-service:8000/test/production-scheduler \
-H "Authorization: Bearer $ADMIN_TOKEN"
curl -X POST http://orders-service:8000/test/procurement-scheduler \
-H "Authorization: Bearer $ADMIN_TOKEN"
```
### Clear Forecast Cache
```bash
# Clear all forecast cache (will rebuild automatically)
redis-cli KEYS "forecast:*" | xargs redis-cli DEL
# Clear specific tenant's cache
redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL
# Verify cache cleared
redis-cli DBSIZE
```
---
## Metrics to Monitor
### Production Scheduler
```promql
# Success rate (should be > 95%)
rate(production_schedules_generated_total{status="success"}[5m]) /
rate(production_schedules_generated_total[5m])
# Average generation time (should be < 60s)
histogram_quantile(0.95,
rate(production_schedule_generation_duration_seconds_bucket[5m]))
# Failed tenants (should be 0)
increase(production_tenants_processed_total{status="failure"}[5m])
```
### Procurement Scheduler
```promql
# Success rate (should be > 95%)
rate(procurement_plans_generated_total{status="success"}[5m]) /
rate(procurement_plans_generated_total[5m])
# Average generation time (should be < 60s)
histogram_quantile(0.95,
rate(procurement_plan_generation_duration_seconds_bucket[5m]))
# Failed tenants (should be 0)
increase(procurement_tenants_processed_total{status="failure"}[5m])
```
### Forecast Cache
```promql
# Cache hit rate (should be > 70%)
forecast_cache_hit_rate
# Cache hits per minute
rate(forecast_cache_hits_total[5m])
# Cache misses per minute
rate(forecast_cache_misses_total[5m])
```
---
## Log Patterns to Watch
### Success Patterns
```
✅ "Daily production planning completed" - All tenants processed
✅ "Production schedule created successfully" - Individual tenant success
✅ "Forecast cache HIT" - Cache working correctly
✅ "Production scheduler service started" - Service initialized
```
### Warning Patterns
```
⚠️ "Tenant processing timed out" - Individual tenant taking too long
⚠️ "Forecast cache MISS" - Cache miss (expected some, but not all)
⚠️ "Approving plan older than 24 hours" - Stale plan being approved
⚠️ "Could not fetch tenant timezone" - Timezone configuration issue
```
### Error Patterns
```
❌ "Daily production planning failed completely" - Complete failure
❌ "Error processing tenant production" - Tenant-specific failure
❌ "Forecast cache Redis connection failed" - Cache unavailable
❌ "Migration version mismatch" - Database migration issue
❌ "Failed to publish event" - RabbitMQ connectivity issue
```
---
## Escalation Procedure
### Level 1: DevOps On-Call (0-30 minutes)
- Check service health
- Review logs for obvious errors
- Restart services if crashed
- Manually trigger schedulers if needed
- Monitor for resolution
### Level 2: Backend Team (30-60 minutes)
- Investigate complex errors
- Check database issues
- Review scheduler logic
- Coordinate with other teams (if external service issue)
### Level 3: Engineering Lead (> 60 minutes)
- Major architectural issues
- Database corruption or loss
- Multi-service cascading failures
- Decisions on emergency fixes vs. scheduled maintenance
---
## Testing After Deployment
### Post-Deployment Checklist
```bash
# 1. Verify services are running
kubectl get pods -n production
kubectl get pods -n orders
# 2. Check health endpoints
curl http://production-service:8000/health
curl http://orders-service:8000/health
# 3. Verify schedulers are configured
kubectl logs -n production deployment/production-service | \
grep "scheduled jobs configured"
# 4. Manually trigger test run
curl -X POST http://production-service:8000/test/production-scheduler \
-H "Authorization: Bearer $ADMIN_TOKEN"
curl -X POST http://orders-service:8000/test/procurement-scheduler \
-H "Authorization: Bearer $ADMIN_TOKEN"
# 5. Verify test run completed successfully
kubectl logs -n production deployment/production-service | \
grep "Production schedule created successfully"
kubectl logs -n orders deployment/orders-service | \
grep "Procurement plan generated successfully"
# 6. Check metrics dashboard
# Visit: http://grafana:3000/d/production-planning
```
---
## Known Issues & Workarounds
### Issue: Scheduler runs twice in distributed setup
**Symptom:** Duplicate schedules/plans for same tenant and date
**Cause:** Leader election not working (RabbitMQ connection issue)
**Workaround:**
```bash
# Temporarily scale to single instance
kubectl scale deployment/production-service --replicas=1 -n production
kubectl scale deployment/orders-service --replicas=1 -n orders
# Fix RabbitMQ connectivity
# Then scale back up
kubectl scale deployment/production-service --replicas=3 -n production
kubectl scale deployment/orders-service --replicas=3 -n orders
```
### Issue: Timezone shows wrong time
**Symptom:** Schedules generated at wrong hour
**Cause:** Tenant timezone not configured or incorrect
**Workaround:**
```sql
-- Check tenant timezone
SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}';
-- Update if incorrect
UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}';
-- Verify server uses UTC
-- In container: date (should show UTC)
```
### Issue: Forecast cache always misses
**Symptom:** `forecast_cache_hit_rate = 0%`
**Cause:** Redis not accessible or REDIS_URL misconfigured
**Workaround:**
```bash
# Check REDIS_URL environment variable
kubectl get deployment forecasting-service -n forecasting -o yaml | \
grep REDIS_URL
# Should be: redis://redis:6379/0
# If incorrect, update:
kubectl set env deployment/forecasting-service \
REDIS_URL=redis://redis:6379/0 -n forecasting
```
---
## Additional Resources
- **Full Documentation:** [PRODUCTION_PLANNING_SYSTEM.md](./PRODUCTION_PLANNING_SYSTEM.md)
- **Metrics File:** [`shared/monitoring/scheduler_metrics.py`](../shared/monitoring/scheduler_metrics.py)
- **Scheduler Code:**
- Production: [`services/production/app/services/production_scheduler_service.py`](../services/production/app/services/production_scheduler_service.py)
- Procurement: [`services/orders/app/services/procurement_scheduler_service.py`](../services/orders/app/services/procurement_scheduler_service.py)
- **Forecast Cache:** [`services/forecasting/app/services/forecast_cache.py`](../services/forecasting/app/services/forecast_cache.py)
---
**Runbook Version:** 1.0
**Last Updated:** 2025-10-09
**Maintained By:** Backend Team