# Production Planning Scheduler Runbook **Quick Reference Guide for DevOps & Support Teams** --- ## Quick Links - [Full Documentation](./PRODUCTION_PLANNING_SYSTEM.md) - [Metrics Dashboard](http://grafana:3000/d/production-planning) - [Logs](http://kibana:5601) - [Alerts](http://alertmanager:9093) --- ## Emergency Contacts | Role | Contact | Availability | |------|---------|--------------| | **Backend Lead** | #backend-team | 24/7 | | **DevOps On-Call** | #devops-oncall | 24/7 | | **Product Owner** | TBD | Business hours | --- ## Scheduler Overview | Scheduler | Time | What It Does | |-----------|------|--------------| | **Production** | 5:30 AM (tenant timezone) | Creates daily production schedules | | **Procurement** | 6:00 AM (tenant timezone) | Creates daily procurement plans | **Critical:** Both schedulers MUST complete successfully every morning, or bakeries won't have production/procurement plans for the day! --- ## Common Incidents & Solutions ### 🔴 CRITICAL: Scheduler Completely Failed **Alert:** `SchedulerUnhealthy` or `NoProductionSchedulesGenerated` **Impact:** HIGH - No plans generated for any tenant **Immediate Actions (< 5 minutes):** ```bash # 1. Check if service is running kubectl get pods -n production | grep production-service kubectl get pods -n orders | grep orders-service # 2. Check recent logs for errors kubectl logs -n production deployment/production-service --tail=100 | grep ERROR kubectl logs -n orders deployment/orders-service --tail=100 | grep ERROR # 3. Restart service if frozen/crashed kubectl rollout restart deployment/production-service -n production kubectl rollout restart deployment/orders-service -n orders # 4. Wait 2 minutes for scheduler to initialize, then manually trigger curl -X POST http://production-service:8000/test/production-scheduler \ -H "Authorization: Bearer $ADMIN_TOKEN" curl -X POST http://orders-service:8000/test/procurement-scheduler \ -H "Authorization: Bearer $ADMIN_TOKEN" ``` **Follow-up Actions:** - Check RabbitMQ health (leader election depends on it) - Review database connectivity - Check resource limits (CPU/memory) - Monitor metrics for successful generation --- ### 🟠 HIGH: Single Tenant Failed **Alert:** `DailyProductionPlanningFailed{tenant_id="abc-123"}` **Impact:** MEDIUM - One bakery affected **Immediate Actions (< 10 minutes):** ```bash # 1. Check logs for specific tenant kubectl logs -n production deployment/production-service --tail=500 | \ grep "tenant_id=abc-123" | grep ERROR # 2. Common causes: # - Tenant database connection issue # - External service timeout (Forecasting, Inventory) # - Invalid data (e.g., missing products) # 3. Manually retry for this tenant curl -X POST http://production-service:8000/test/production-scheduler \ -H "Authorization: Bearer $ADMIN_TOKEN" # (Scheduler will skip tenants that already have schedules) # 4. If still failing, check tenant-specific issues: # - Verify tenant exists and is active # - Check tenant's inventory has products # - Check forecasting service can access tenant data ``` **Follow-up Actions:** - Contact tenant to understand their setup - Review tenant data quality - Check if tenant is new (may need initial setup) --- ### 🟡 MEDIUM: Scheduler Running Slow **Alert:** `production_schedule_generation_duration_seconds > 120s` **Impact:** LOW - Scheduler completes but takes longer than expected **Immediate Actions (< 15 minutes):** ```bash # 1. Check current execution time kubectl logs -n production deployment/production-service --tail=100 | \ grep "production planning completed" # 2. Check database query performance # Look for slow query logs in PostgreSQL # 3. Check external service response times # - Forecasting Service health: curl http://forecasting-service:8000/health # - Inventory Service health: curl http://inventory-service:8000/health # - Orders Service health: curl http://orders-service:8000/health # 4. Check CPU/memory usage kubectl top pods -n production | grep production-service kubectl top pods -n orders | grep orders-service ``` **Follow-up Actions:** - Consider increasing timeout if consistently near limit - Optimize slow database queries - Scale external services if overloaded - Review tenant count (may need to process fewer in parallel) --- ### 🟡 MEDIUM: Low Forecast Cache Hit Rate **Alert:** `ForecastCacheHitRateLow < 50%` **Impact:** LOW - Increased load on Forecasting Service, slower responses **Immediate Actions (< 10 minutes):** ```bash # 1. Check Redis is running kubectl get pods -n redis | grep redis redis-cli ping # Should return PONG # 2. Check cache statistics curl http://forecasting-service:8000/api/v1/{tenant_id}/forecasting/cache/stats \ -H "Authorization: Bearer $ADMIN_TOKEN" # 3. Check cache keys redis-cli KEYS "forecast:*" | wc -l # Should have many entries # 4. Check Redis memory redis-cli INFO memory | grep used_memory_human # 5. If cache is empty or Redis is down, restart Redis kubectl rollout restart statefulset/redis -n redis ``` **Follow-up Actions:** - Monitor cache rebuild (should hit ~80-90% within 1 day) - Check Redis configuration (memory limits, eviction policy) - Review forecast TTL settings - Check for cache invalidation bugs --- ### 🟢 LOW: Plan Rejected by User **Alert:** `procurement_plan_rejections_total` increasing **Impact:** LOW - Normal user workflow **Actions (< 5 minutes):** ```bash # 1. Check rejection logs for patterns kubectl logs -n orders deployment/orders-service --tail=200 | \ grep "plan rejection" # 2. Check if auto-regeneration triggered kubectl logs -n orders deployment/orders-service --tail=200 | \ grep "Auto-regenerating plan" # 3. Verify rejection notification sent # Check RabbitMQ queue: procurement.plan.rejected # 4. If rejection notes mention "stale" or "outdated", plan will auto-regenerate # Otherwise, user needs to manually regenerate or modify plan ``` **Follow-up Actions:** - Review rejection reasons for trends - Consider user training if many rejections - Improve plan accuracy if consistent issues --- ## Health Check Commands ### Quick Service Health Check ```bash # Production Service curl http://production-service:8000/health | jq . # Orders Service curl http://orders-service:8000/health | jq . # Forecasting Service curl http://forecasting-service:8000/health | jq . # Redis redis-cli ping # RabbitMQ curl http://rabbitmq:15672/api/health/checks/alarms \ -u guest:guest | jq . ``` ### Detailed Scheduler Status ```bash # Check last scheduler run time curl http://production-service:8000/health | \ jq '.custom_checks.scheduler_service' # Check APScheduler job status (requires internal access) # Look for: scheduler.get_jobs() output in logs kubectl logs -n production deployment/production-service | \ grep "scheduled jobs configured" ``` ### Database Connectivity ```bash # Check production database kubectl exec -it deployment/production-service -n production -- \ python -c "from app.core.database import database_manager; \ import asyncio; \ asyncio.run(database_manager.health_check())" # Check orders database kubectl exec -it deployment/orders-service -n orders -- \ python -c "from app.core.database import database_manager; \ import asyncio; \ asyncio.run(database_manager.health_check())" ``` --- ## Maintenance Procedures ### Disable Schedulers (Maintenance Mode) ```bash # 1. Set environment variable to disable schedulers kubectl set env deployment/production-service SCHEDULER_DISABLED=true -n production kubectl set env deployment/orders-service SCHEDULER_DISABLED=true -n orders # 2. Wait for pods to restart kubectl rollout status deployment/production-service -n production kubectl rollout status deployment/orders-service -n orders # 3. Verify schedulers are disabled (check logs) kubectl logs -n production deployment/production-service | grep "Scheduler disabled" ``` ### Re-enable Schedulers (After Maintenance) ```bash # 1. Remove environment variable kubectl set env deployment/production-service SCHEDULER_DISABLED- -n production kubectl set env deployment/orders-service SCHEDULER_DISABLED- -n orders # 2. Wait for pods to restart kubectl rollout status deployment/production-service -n production kubectl rollout status deployment/orders-service -n orders # 3. Manually trigger to catch up (if during scheduled time) curl -X POST http://production-service:8000/test/production-scheduler \ -H "Authorization: Bearer $ADMIN_TOKEN" curl -X POST http://orders-service:8000/test/procurement-scheduler \ -H "Authorization: Bearer $ADMIN_TOKEN" ``` ### Clear Forecast Cache ```bash # Clear all forecast cache (will rebuild automatically) redis-cli KEYS "forecast:*" | xargs redis-cli DEL # Clear specific tenant's cache redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL # Verify cache cleared redis-cli DBSIZE ``` --- ## Metrics to Monitor ### Production Scheduler ```promql # Success rate (should be > 95%) rate(production_schedules_generated_total{status="success"}[5m]) / rate(production_schedules_generated_total[5m]) # Average generation time (should be < 60s) histogram_quantile(0.95, rate(production_schedule_generation_duration_seconds_bucket[5m])) # Failed tenants (should be 0) increase(production_tenants_processed_total{status="failure"}[5m]) ``` ### Procurement Scheduler ```promql # Success rate (should be > 95%) rate(procurement_plans_generated_total{status="success"}[5m]) / rate(procurement_plans_generated_total[5m]) # Average generation time (should be < 60s) histogram_quantile(0.95, rate(procurement_plan_generation_duration_seconds_bucket[5m])) # Failed tenants (should be 0) increase(procurement_tenants_processed_total{status="failure"}[5m]) ``` ### Forecast Cache ```promql # Cache hit rate (should be > 70%) forecast_cache_hit_rate # Cache hits per minute rate(forecast_cache_hits_total[5m]) # Cache misses per minute rate(forecast_cache_misses_total[5m]) ``` --- ## Log Patterns to Watch ### Success Patterns ``` ✅ "Daily production planning completed" - All tenants processed ✅ "Production schedule created successfully" - Individual tenant success ✅ "Forecast cache HIT" - Cache working correctly ✅ "Production scheduler service started" - Service initialized ``` ### Warning Patterns ``` ⚠️ "Tenant processing timed out" - Individual tenant taking too long ⚠️ "Forecast cache MISS" - Cache miss (expected some, but not all) ⚠️ "Approving plan older than 24 hours" - Stale plan being approved ⚠️ "Could not fetch tenant timezone" - Timezone configuration issue ``` ### Error Patterns ``` ❌ "Daily production planning failed completely" - Complete failure ❌ "Error processing tenant production" - Tenant-specific failure ❌ "Forecast cache Redis connection failed" - Cache unavailable ❌ "Migration version mismatch" - Database migration issue ❌ "Failed to publish event" - RabbitMQ connectivity issue ``` --- ## Escalation Procedure ### Level 1: DevOps On-Call (0-30 minutes) - Check service health - Review logs for obvious errors - Restart services if crashed - Manually trigger schedulers if needed - Monitor for resolution ### Level 2: Backend Team (30-60 minutes) - Investigate complex errors - Check database issues - Review scheduler logic - Coordinate with other teams (if external service issue) ### Level 3: Engineering Lead (> 60 minutes) - Major architectural issues - Database corruption or loss - Multi-service cascading failures - Decisions on emergency fixes vs. scheduled maintenance --- ## Testing After Deployment ### Post-Deployment Checklist ```bash # 1. Verify services are running kubectl get pods -n production kubectl get pods -n orders # 2. Check health endpoints curl http://production-service:8000/health curl http://orders-service:8000/health # 3. Verify schedulers are configured kubectl logs -n production deployment/production-service | \ grep "scheduled jobs configured" # 4. Manually trigger test run curl -X POST http://production-service:8000/test/production-scheduler \ -H "Authorization: Bearer $ADMIN_TOKEN" curl -X POST http://orders-service:8000/test/procurement-scheduler \ -H "Authorization: Bearer $ADMIN_TOKEN" # 5. Verify test run completed successfully kubectl logs -n production deployment/production-service | \ grep "Production schedule created successfully" kubectl logs -n orders deployment/orders-service | \ grep "Procurement plan generated successfully" # 6. Check metrics dashboard # Visit: http://grafana:3000/d/production-planning ``` --- ## Known Issues & Workarounds ### Issue: Scheduler runs twice in distributed setup **Symptom:** Duplicate schedules/plans for same tenant and date **Cause:** Leader election not working (RabbitMQ connection issue) **Workaround:** ```bash # Temporarily scale to single instance kubectl scale deployment/production-service --replicas=1 -n production kubectl scale deployment/orders-service --replicas=1 -n orders # Fix RabbitMQ connectivity # Then scale back up kubectl scale deployment/production-service --replicas=3 -n production kubectl scale deployment/orders-service --replicas=3 -n orders ``` ### Issue: Timezone shows wrong time **Symptom:** Schedules generated at wrong hour **Cause:** Tenant timezone not configured or incorrect **Workaround:** ```sql -- Check tenant timezone SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}'; -- Update if incorrect UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}'; -- Verify server uses UTC -- In container: date (should show UTC) ``` ### Issue: Forecast cache always misses **Symptom:** `forecast_cache_hit_rate = 0%` **Cause:** Redis not accessible or REDIS_URL misconfigured **Workaround:** ```bash # Check REDIS_URL environment variable kubectl get deployment forecasting-service -n forecasting -o yaml | \ grep REDIS_URL # Should be: redis://redis:6379/0 # If incorrect, update: kubectl set env deployment/forecasting-service \ REDIS_URL=redis://redis:6379/0 -n forecasting ``` --- ## Additional Resources - **Full Documentation:** [PRODUCTION_PLANNING_SYSTEM.md](./PRODUCTION_PLANNING_SYSTEM.md) - **Metrics File:** [`shared/monitoring/scheduler_metrics.py`](../shared/monitoring/scheduler_metrics.py) - **Scheduler Code:** - Production: [`services/production/app/services/production_scheduler_service.py`](../services/production/app/services/production_scheduler_service.py) - Procurement: [`services/orders/app/services/procurement_scheduler_service.py`](../services/orders/app/services/procurement_scheduler_service.py) - **Forecast Cache:** [`services/forecasting/app/services/forecast_cache.py`](../services/forecasting/app/services/forecast_cache.py) --- **Runbook Version:** 1.0 **Last Updated:** 2025-10-09 **Maintained By:** Backend Team