531 lines
15 KiB
Markdown
531 lines
15 KiB
Markdown
# Production Planning Scheduler Runbook
|
|
|
|
**Quick Reference Guide for DevOps & Support Teams**
|
|
|
|
---
|
|
|
|
## Quick Links
|
|
|
|
- [Full Documentation](./PRODUCTION_PLANNING_SYSTEM.md)
|
|
- [Metrics Dashboard](http://grafana:3000/d/production-planning)
|
|
- [Logs](http://kibana:5601)
|
|
- [Alerts](http://alertmanager:9093)
|
|
|
|
---
|
|
|
|
## Emergency Contacts
|
|
|
|
| Role | Contact | Availability |
|
|
|------|---------|--------------|
|
|
| **Backend Lead** | #backend-team | 24/7 |
|
|
| **DevOps On-Call** | #devops-oncall | 24/7 |
|
|
| **Product Owner** | TBD | Business hours |
|
|
|
|
---
|
|
|
|
## Scheduler Overview
|
|
|
|
| Scheduler | Time | What It Does |
|
|
|-----------|------|--------------|
|
|
| **Production** | 5:30 AM (tenant timezone) | Creates daily production schedules |
|
|
| **Procurement** | 6:00 AM (tenant timezone) | Creates daily procurement plans |
|
|
|
|
**Critical:** Both schedulers MUST complete successfully every morning, or bakeries won't have production/procurement plans for the day!
|
|
|
|
---
|
|
|
|
## Common Incidents & Solutions
|
|
|
|
### 🔴 CRITICAL: Scheduler Completely Failed
|
|
|
|
**Alert:** `SchedulerUnhealthy` or `NoProductionSchedulesGenerated`
|
|
|
|
**Impact:** HIGH - No plans generated for any tenant
|
|
|
|
**Immediate Actions (< 5 minutes):**
|
|
|
|
```bash
|
|
# 1. Check if service is running
|
|
kubectl get pods -n production | grep production-service
|
|
kubectl get pods -n orders | grep orders-service
|
|
|
|
# 2. Check recent logs for errors
|
|
kubectl logs -n production deployment/production-service --tail=100 | grep ERROR
|
|
kubectl logs -n orders deployment/orders-service --tail=100 | grep ERROR
|
|
|
|
# 3. Restart service if frozen/crashed
|
|
kubectl rollout restart deployment/production-service -n production
|
|
kubectl rollout restart deployment/orders-service -n orders
|
|
|
|
# 4. Wait 2 minutes for scheduler to initialize, then manually trigger
|
|
curl -X POST http://production-service:8000/test/production-scheduler \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
|
|
curl -X POST http://orders-service:8000/test/procurement-scheduler \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
```
|
|
|
|
**Follow-up Actions:**
|
|
- Check RabbitMQ health (leader election depends on it)
|
|
- Review database connectivity
|
|
- Check resource limits (CPU/memory)
|
|
- Monitor metrics for successful generation
|
|
|
|
---
|
|
|
|
### 🟠 HIGH: Single Tenant Failed
|
|
|
|
**Alert:** `DailyProductionPlanningFailed{tenant_id="abc-123"}`
|
|
|
|
**Impact:** MEDIUM - One bakery affected
|
|
|
|
**Immediate Actions (< 10 minutes):**
|
|
|
|
```bash
|
|
# 1. Check logs for specific tenant
|
|
kubectl logs -n production deployment/production-service --tail=500 | \
|
|
grep "tenant_id=abc-123" | grep ERROR
|
|
|
|
# 2. Common causes:
|
|
# - Tenant database connection issue
|
|
# - External service timeout (Forecasting, Inventory)
|
|
# - Invalid data (e.g., missing products)
|
|
|
|
# 3. Manually retry for this tenant
|
|
curl -X POST http://production-service:8000/test/production-scheduler \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
# (Scheduler will skip tenants that already have schedules)
|
|
|
|
# 4. If still failing, check tenant-specific issues:
|
|
# - Verify tenant exists and is active
|
|
# - Check tenant's inventory has products
|
|
# - Check forecasting service can access tenant data
|
|
```
|
|
|
|
**Follow-up Actions:**
|
|
- Contact tenant to understand their setup
|
|
- Review tenant data quality
|
|
- Check if tenant is new (may need initial setup)
|
|
|
|
---
|
|
|
|
### 🟡 MEDIUM: Scheduler Running Slow
|
|
|
|
**Alert:** `production_schedule_generation_duration_seconds > 120s`
|
|
|
|
**Impact:** LOW - Scheduler completes but takes longer than expected
|
|
|
|
**Immediate Actions (< 15 minutes):**
|
|
|
|
```bash
|
|
# 1. Check current execution time
|
|
kubectl logs -n production deployment/production-service --tail=100 | \
|
|
grep "production planning completed"
|
|
|
|
# 2. Check database query performance
|
|
# Look for slow query logs in PostgreSQL
|
|
|
|
# 3. Check external service response times
|
|
# - Forecasting Service health: curl http://forecasting-service:8000/health
|
|
# - Inventory Service health: curl http://inventory-service:8000/health
|
|
# - Orders Service health: curl http://orders-service:8000/health
|
|
|
|
# 4. Check CPU/memory usage
|
|
kubectl top pods -n production | grep production-service
|
|
kubectl top pods -n orders | grep orders-service
|
|
```
|
|
|
|
**Follow-up Actions:**
|
|
- Consider increasing timeout if consistently near limit
|
|
- Optimize slow database queries
|
|
- Scale external services if overloaded
|
|
- Review tenant count (may need to process fewer in parallel)
|
|
|
|
---
|
|
|
|
### 🟡 MEDIUM: Low Forecast Cache Hit Rate
|
|
|
|
**Alert:** `ForecastCacheHitRateLow < 50%`
|
|
|
|
**Impact:** LOW - Increased load on Forecasting Service, slower responses
|
|
|
|
**Immediate Actions (< 10 minutes):**
|
|
|
|
```bash
|
|
# 1. Check Redis is running
|
|
kubectl get pods -n redis | grep redis
|
|
redis-cli ping # Should return PONG
|
|
|
|
# 2. Check cache statistics
|
|
curl http://forecasting-service:8000/api/v1/{tenant_id}/forecasting/cache/stats \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
|
|
# 3. Check cache keys
|
|
redis-cli KEYS "forecast:*" | wc -l # Should have many entries
|
|
|
|
# 4. Check Redis memory
|
|
redis-cli INFO memory | grep used_memory_human
|
|
|
|
# 5. If cache is empty or Redis is down, restart Redis
|
|
kubectl rollout restart statefulset/redis -n redis
|
|
```
|
|
|
|
**Follow-up Actions:**
|
|
- Monitor cache rebuild (should hit ~80-90% within 1 day)
|
|
- Check Redis configuration (memory limits, eviction policy)
|
|
- Review forecast TTL settings
|
|
- Check for cache invalidation bugs
|
|
|
|
---
|
|
|
|
### 🟢 LOW: Plan Rejected by User
|
|
|
|
**Alert:** `procurement_plan_rejections_total` increasing
|
|
|
|
**Impact:** LOW - Normal user workflow
|
|
|
|
**Actions (< 5 minutes):**
|
|
|
|
```bash
|
|
# 1. Check rejection logs for patterns
|
|
kubectl logs -n orders deployment/orders-service --tail=200 | \
|
|
grep "plan rejection"
|
|
|
|
# 2. Check if auto-regeneration triggered
|
|
kubectl logs -n orders deployment/orders-service --tail=200 | \
|
|
grep "Auto-regenerating plan"
|
|
|
|
# 3. Verify rejection notification sent
|
|
# Check RabbitMQ queue: procurement.plan.rejected
|
|
|
|
# 4. If rejection notes mention "stale" or "outdated", plan will auto-regenerate
|
|
# Otherwise, user needs to manually regenerate or modify plan
|
|
```
|
|
|
|
**Follow-up Actions:**
|
|
- Review rejection reasons for trends
|
|
- Consider user training if many rejections
|
|
- Improve plan accuracy if consistent issues
|
|
|
|
---
|
|
|
|
## Health Check Commands
|
|
|
|
### Quick Service Health Check
|
|
|
|
```bash
|
|
# Production Service
|
|
curl http://production-service:8000/health | jq .
|
|
|
|
# Orders Service
|
|
curl http://orders-service:8000/health | jq .
|
|
|
|
# Forecasting Service
|
|
curl http://forecasting-service:8000/health | jq .
|
|
|
|
# Redis
|
|
redis-cli ping
|
|
|
|
# RabbitMQ
|
|
curl http://rabbitmq:15672/api/health/checks/alarms \
|
|
-u guest:guest | jq .
|
|
```
|
|
|
|
### Detailed Scheduler Status
|
|
|
|
```bash
|
|
# Check last scheduler run time
|
|
curl http://production-service:8000/health | \
|
|
jq '.custom_checks.scheduler_service'
|
|
|
|
# Check APScheduler job status (requires internal access)
|
|
# Look for: scheduler.get_jobs() output in logs
|
|
kubectl logs -n production deployment/production-service | \
|
|
grep "scheduled jobs configured"
|
|
```
|
|
|
|
### Database Connectivity
|
|
|
|
```bash
|
|
# Check production database
|
|
kubectl exec -it deployment/production-service -n production -- \
|
|
python -c "from app.core.database import database_manager; \
|
|
import asyncio; \
|
|
asyncio.run(database_manager.health_check())"
|
|
|
|
# Check orders database
|
|
kubectl exec -it deployment/orders-service -n orders -- \
|
|
python -c "from app.core.database import database_manager; \
|
|
import asyncio; \
|
|
asyncio.run(database_manager.health_check())"
|
|
```
|
|
|
|
---
|
|
|
|
## Maintenance Procedures
|
|
|
|
### Disable Schedulers (Maintenance Mode)
|
|
|
|
```bash
|
|
# 1. Set environment variable to disable schedulers
|
|
kubectl set env deployment/production-service SCHEDULER_DISABLED=true -n production
|
|
kubectl set env deployment/orders-service SCHEDULER_DISABLED=true -n orders
|
|
|
|
# 2. Wait for pods to restart
|
|
kubectl rollout status deployment/production-service -n production
|
|
kubectl rollout status deployment/orders-service -n orders
|
|
|
|
# 3. Verify schedulers are disabled (check logs)
|
|
kubectl logs -n production deployment/production-service | grep "Scheduler disabled"
|
|
```
|
|
|
|
### Re-enable Schedulers (After Maintenance)
|
|
|
|
```bash
|
|
# 1. Remove environment variable
|
|
kubectl set env deployment/production-service SCHEDULER_DISABLED- -n production
|
|
kubectl set env deployment/orders-service SCHEDULER_DISABLED- -n orders
|
|
|
|
# 2. Wait for pods to restart
|
|
kubectl rollout status deployment/production-service -n production
|
|
kubectl rollout status deployment/orders-service -n orders
|
|
|
|
# 3. Manually trigger to catch up (if during scheduled time)
|
|
curl -X POST http://production-service:8000/test/production-scheduler \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
|
|
curl -X POST http://orders-service:8000/test/procurement-scheduler \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
```
|
|
|
|
### Clear Forecast Cache
|
|
|
|
```bash
|
|
# Clear all forecast cache (will rebuild automatically)
|
|
redis-cli KEYS "forecast:*" | xargs redis-cli DEL
|
|
|
|
# Clear specific tenant's cache
|
|
redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL
|
|
|
|
# Verify cache cleared
|
|
redis-cli DBSIZE
|
|
```
|
|
|
|
---
|
|
|
|
## Metrics to Monitor
|
|
|
|
### Production Scheduler
|
|
|
|
```promql
|
|
# Success rate (should be > 95%)
|
|
rate(production_schedules_generated_total{status="success"}[5m]) /
|
|
rate(production_schedules_generated_total[5m])
|
|
|
|
# Average generation time (should be < 60s)
|
|
histogram_quantile(0.95,
|
|
rate(production_schedule_generation_duration_seconds_bucket[5m]))
|
|
|
|
# Failed tenants (should be 0)
|
|
increase(production_tenants_processed_total{status="failure"}[5m])
|
|
```
|
|
|
|
### Procurement Scheduler
|
|
|
|
```promql
|
|
# Success rate (should be > 95%)
|
|
rate(procurement_plans_generated_total{status="success"}[5m]) /
|
|
rate(procurement_plans_generated_total[5m])
|
|
|
|
# Average generation time (should be < 60s)
|
|
histogram_quantile(0.95,
|
|
rate(procurement_plan_generation_duration_seconds_bucket[5m]))
|
|
|
|
# Failed tenants (should be 0)
|
|
increase(procurement_tenants_processed_total{status="failure"}[5m])
|
|
```
|
|
|
|
### Forecast Cache
|
|
|
|
```promql
|
|
# Cache hit rate (should be > 70%)
|
|
forecast_cache_hit_rate
|
|
|
|
# Cache hits per minute
|
|
rate(forecast_cache_hits_total[5m])
|
|
|
|
# Cache misses per minute
|
|
rate(forecast_cache_misses_total[5m])
|
|
```
|
|
|
|
---
|
|
|
|
## Log Patterns to Watch
|
|
|
|
### Success Patterns
|
|
|
|
```
|
|
✅ "Daily production planning completed" - All tenants processed
|
|
✅ "Production schedule created successfully" - Individual tenant success
|
|
✅ "Forecast cache HIT" - Cache working correctly
|
|
✅ "Production scheduler service started" - Service initialized
|
|
```
|
|
|
|
### Warning Patterns
|
|
|
|
```
|
|
⚠️ "Tenant processing timed out" - Individual tenant taking too long
|
|
⚠️ "Forecast cache MISS" - Cache miss (expected some, but not all)
|
|
⚠️ "Approving plan older than 24 hours" - Stale plan being approved
|
|
⚠️ "Could not fetch tenant timezone" - Timezone configuration issue
|
|
```
|
|
|
|
### Error Patterns
|
|
|
|
```
|
|
❌ "Daily production planning failed completely" - Complete failure
|
|
❌ "Error processing tenant production" - Tenant-specific failure
|
|
❌ "Forecast cache Redis connection failed" - Cache unavailable
|
|
❌ "Migration version mismatch" - Database migration issue
|
|
❌ "Failed to publish event" - RabbitMQ connectivity issue
|
|
```
|
|
|
|
---
|
|
|
|
## Escalation Procedure
|
|
|
|
### Level 1: DevOps On-Call (0-30 minutes)
|
|
|
|
- Check service health
|
|
- Review logs for obvious errors
|
|
- Restart services if crashed
|
|
- Manually trigger schedulers if needed
|
|
- Monitor for resolution
|
|
|
|
### Level 2: Backend Team (30-60 minutes)
|
|
|
|
- Investigate complex errors
|
|
- Check database issues
|
|
- Review scheduler logic
|
|
- Coordinate with other teams (if external service issue)
|
|
|
|
### Level 3: Engineering Lead (> 60 minutes)
|
|
|
|
- Major architectural issues
|
|
- Database corruption or loss
|
|
- Multi-service cascading failures
|
|
- Decisions on emergency fixes vs. scheduled maintenance
|
|
|
|
---
|
|
|
|
## Testing After Deployment
|
|
|
|
### Post-Deployment Checklist
|
|
|
|
```bash
|
|
# 1. Verify services are running
|
|
kubectl get pods -n production
|
|
kubectl get pods -n orders
|
|
|
|
# 2. Check health endpoints
|
|
curl http://production-service:8000/health
|
|
curl http://orders-service:8000/health
|
|
|
|
# 3. Verify schedulers are configured
|
|
kubectl logs -n production deployment/production-service | \
|
|
grep "scheduled jobs configured"
|
|
|
|
# 4. Manually trigger test run
|
|
curl -X POST http://production-service:8000/test/production-scheduler \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
|
|
curl -X POST http://orders-service:8000/test/procurement-scheduler \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
|
|
# 5. Verify test run completed successfully
|
|
kubectl logs -n production deployment/production-service | \
|
|
grep "Production schedule created successfully"
|
|
|
|
kubectl logs -n orders deployment/orders-service | \
|
|
grep "Procurement plan generated successfully"
|
|
|
|
# 6. Check metrics dashboard
|
|
# Visit: http://grafana:3000/d/production-planning
|
|
```
|
|
|
|
---
|
|
|
|
## Known Issues & Workarounds
|
|
|
|
### Issue: Scheduler runs twice in distributed setup
|
|
|
|
**Symptom:** Duplicate schedules/plans for same tenant and date
|
|
|
|
**Cause:** Leader election not working (RabbitMQ connection issue)
|
|
|
|
**Workaround:**
|
|
```bash
|
|
# Temporarily scale to single instance
|
|
kubectl scale deployment/production-service --replicas=1 -n production
|
|
kubectl scale deployment/orders-service --replicas=1 -n orders
|
|
|
|
# Fix RabbitMQ connectivity
|
|
# Then scale back up
|
|
kubectl scale deployment/production-service --replicas=3 -n production
|
|
kubectl scale deployment/orders-service --replicas=3 -n orders
|
|
```
|
|
|
|
### Issue: Timezone shows wrong time
|
|
|
|
**Symptom:** Schedules generated at wrong hour
|
|
|
|
**Cause:** Tenant timezone not configured or incorrect
|
|
|
|
**Workaround:**
|
|
```sql
|
|
-- Check tenant timezone
|
|
SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}';
|
|
|
|
-- Update if incorrect
|
|
UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}';
|
|
|
|
-- Verify server uses UTC
|
|
-- In container: date (should show UTC)
|
|
```
|
|
|
|
### Issue: Forecast cache always misses
|
|
|
|
**Symptom:** `forecast_cache_hit_rate = 0%`
|
|
|
|
**Cause:** Redis not accessible or REDIS_URL misconfigured
|
|
|
|
**Workaround:**
|
|
```bash
|
|
# Check REDIS_URL environment variable
|
|
kubectl get deployment forecasting-service -n forecasting -o yaml | \
|
|
grep REDIS_URL
|
|
|
|
# Should be: redis://redis:6379/0
|
|
|
|
# If incorrect, update:
|
|
kubectl set env deployment/forecasting-service \
|
|
REDIS_URL=redis://redis:6379/0 -n forecasting
|
|
```
|
|
|
|
---
|
|
|
|
## Additional Resources
|
|
|
|
- **Full Documentation:** [PRODUCTION_PLANNING_SYSTEM.md](./PRODUCTION_PLANNING_SYSTEM.md)
|
|
- **Metrics File:** [`shared/monitoring/scheduler_metrics.py`](../shared/monitoring/scheduler_metrics.py)
|
|
- **Scheduler Code:**
|
|
- Production: [`services/production/app/services/production_scheduler_service.py`](../services/production/app/services/production_scheduler_service.py)
|
|
- Procurement: [`services/orders/app/services/procurement_scheduler_service.py`](../services/orders/app/services/procurement_scheduler_service.py)
|
|
- **Forecast Cache:** [`services/forecasting/app/services/forecast_cache.py`](../services/forecasting/app/services/forecast_cache.py)
|
|
|
|
---
|
|
|
|
**Runbook Version:** 1.0
|
|
**Last Updated:** 2025-10-09
|
|
**Maintained By:** Backend Team
|