bakery-admin/bakery-ia

Fork 0

Files

Urtzi Alfaro b420af32c5 REFACTOR production scheduler

2025-10-09 18:01:24 +02:00

15 KiB

Raw Blame History

Production Planning Scheduler Runbook

Quick Reference Guide for DevOps & Support Teams

Quick Links

Emergency Contacts

Role	Contact	Availability
Backend Lead	#backend-team	24/7
DevOps On-Call	#devops-oncall	24/7
Product Owner	TBD	Business hours

Scheduler Overview

Scheduler	Time	What It Does
Production	5:30 AM (tenant timezone)	Creates daily production schedules
Procurement	6:00 AM (tenant timezone)	Creates daily procurement plans

Critical: Both schedulers MUST complete successfully every morning, or bakeries won't have production/procurement plans for the day!

Common Incidents & Solutions

🔴 CRITICAL: Scheduler Completely Failed

Alert: SchedulerUnhealthy or NoProductionSchedulesGenerated

Impact: HIGH - No plans generated for any tenant

Immediate Actions (< 5 minutes):

# 1. Check if service is running
kubectl get pods -n production | grep production-service
kubectl get pods -n orders | grep orders-service

# 2. Check recent logs for errors
kubectl logs -n production deployment/production-service --tail=100 | grep ERROR
kubectl logs -n orders deployment/orders-service --tail=100 | grep ERROR

# 3. Restart service if frozen/crashed
kubectl rollout restart deployment/production-service -n production
kubectl rollout restart deployment/orders-service -n orders

# 4. Wait 2 minutes for scheduler to initialize, then manually trigger
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Follow-up Actions:

Check RabbitMQ health (leader election depends on it)
Review database connectivity
Check resource limits (CPU/memory)
Monitor metrics for successful generation

🟠 HIGH: Single Tenant Failed

Alert: DailyProductionPlanningFailed{tenant_id="abc-123"}

Impact: MEDIUM - One bakery affected

Immediate Actions (< 10 minutes):

# 1. Check logs for specific tenant
kubectl logs -n production deployment/production-service --tail=500 | \
  grep "tenant_id=abc-123" | grep ERROR

# 2. Common causes:
#    - Tenant database connection issue
#    - External service timeout (Forecasting, Inventory)
#    - Invalid data (e.g., missing products)

# 3. Manually retry for this tenant
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"
# (Scheduler will skip tenants that already have schedules)

# 4. If still failing, check tenant-specific issues:
# - Verify tenant exists and is active
# - Check tenant's inventory has products
# - Check forecasting service can access tenant data

Follow-up Actions:

Contact tenant to understand their setup
Review tenant data quality
Check if tenant is new (may need initial setup)

🟡 MEDIUM: Scheduler Running Slow

Alert: production_schedule_generation_duration_seconds > 120s

Impact: LOW - Scheduler completes but takes longer than expected

Immediate Actions (< 15 minutes):

# 1. Check current execution time
kubectl logs -n production deployment/production-service --tail=100 | \
  grep "production planning completed"

# 2. Check database query performance
# Look for slow query logs in PostgreSQL

# 3. Check external service response times
# - Forecasting Service health: curl http://forecasting-service:8000/health
# - Inventory Service health: curl http://inventory-service:8000/health
# - Orders Service health: curl http://orders-service:8000/health

# 4. Check CPU/memory usage
kubectl top pods -n production | grep production-service
kubectl top pods -n orders | grep orders-service

Follow-up Actions:

Consider increasing timeout if consistently near limit
Optimize slow database queries
Scale external services if overloaded
Review tenant count (may need to process fewer in parallel)

🟡 MEDIUM: Low Forecast Cache Hit Rate

Alert: ForecastCacheHitRateLow < 50%

Impact: LOW - Increased load on Forecasting Service, slower responses

Immediate Actions (< 10 minutes):

# 1. Check Redis is running
kubectl get pods -n redis | grep redis
redis-cli ping  # Should return PONG

# 2. Check cache statistics
curl http://forecasting-service:8000/api/v1/{tenant_id}/forecasting/cache/stats \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# 3. Check cache keys
redis-cli KEYS "forecast:*" | wc -l  # Should have many entries

# 4. Check Redis memory
redis-cli INFO memory | grep used_memory_human

# 5. If cache is empty or Redis is down, restart Redis
kubectl rollout restart statefulset/redis -n redis

Follow-up Actions:

Monitor cache rebuild (should hit ~80-90% within 1 day)
Check Redis configuration (memory limits, eviction policy)
Review forecast TTL settings
Check for cache invalidation bugs

🟢 LOW: Plan Rejected by User

Alert: procurement_plan_rejections_total increasing

Impact: LOW - Normal user workflow

Actions (< 5 minutes):

# 1. Check rejection logs for patterns
kubectl logs -n orders deployment/orders-service --tail=200 | \
  grep "plan rejection"

# 2. Check if auto-regeneration triggered
kubectl logs -n orders deployment/orders-service --tail=200 | \
  grep "Auto-regenerating plan"

# 3. Verify rejection notification sent
# Check RabbitMQ queue: procurement.plan.rejected

# 4. If rejection notes mention "stale" or "outdated", plan will auto-regenerate
# Otherwise, user needs to manually regenerate or modify plan

Follow-up Actions:

Review rejection reasons for trends
Consider user training if many rejections
Improve plan accuracy if consistent issues

Health Check Commands

Quick Service Health Check

# Production Service
curl http://production-service:8000/health | jq .

# Orders Service
curl http://orders-service:8000/health | jq .

# Forecasting Service
curl http://forecasting-service:8000/health | jq .

# Redis
redis-cli ping

# RabbitMQ
curl http://rabbitmq:15672/api/health/checks/alarms \
  -u guest:guest | jq .

Detailed Scheduler Status

# Check last scheduler run time
curl http://production-service:8000/health | \
  jq '.custom_checks.scheduler_service'

# Check APScheduler job status (requires internal access)
# Look for: scheduler.get_jobs() output in logs
kubectl logs -n production deployment/production-service | \
  grep "scheduled jobs configured"

Database Connectivity

# Check production database
kubectl exec -it deployment/production-service -n production -- \
  python -c "from app.core.database import database_manager; \
             import asyncio; \
             asyncio.run(database_manager.health_check())"

# Check orders database
kubectl exec -it deployment/orders-service -n orders -- \
  python -c "from app.core.database import database_manager; \
             import asyncio; \
             asyncio.run(database_manager.health_check())"

Maintenance Procedures

Disable Schedulers (Maintenance Mode)

# 1. Set environment variable to disable schedulers
kubectl set env deployment/production-service SCHEDULER_DISABLED=true -n production
kubectl set env deployment/orders-service SCHEDULER_DISABLED=true -n orders

# 2. Wait for pods to restart
kubectl rollout status deployment/production-service -n production
kubectl rollout status deployment/orders-service -n orders

# 3. Verify schedulers are disabled (check logs)
kubectl logs -n production deployment/production-service | grep "Scheduler disabled"

Re-enable Schedulers (After Maintenance)

# 1. Remove environment variable
kubectl set env deployment/production-service SCHEDULER_DISABLED- -n production
kubectl set env deployment/orders-service SCHEDULER_DISABLED- -n orders

# 2. Wait for pods to restart
kubectl rollout status deployment/production-service -n production
kubectl rollout status deployment/orders-service -n orders

# 3. Manually trigger to catch up (if during scheduled time)
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Clear Forecast Cache

# Clear all forecast cache (will rebuild automatically)
redis-cli KEYS "forecast:*" | xargs redis-cli DEL

# Clear specific tenant's cache
redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL

# Verify cache cleared
redis-cli DBSIZE

Metrics to Monitor

Production Scheduler

# Success rate (should be > 95%)
rate(production_schedules_generated_total{status="success"}[5m]) /
rate(production_schedules_generated_total[5m])

# Average generation time (should be < 60s)
histogram_quantile(0.95,
  rate(production_schedule_generation_duration_seconds_bucket[5m]))

# Failed tenants (should be 0)
increase(production_tenants_processed_total{status="failure"}[5m])

Procurement Scheduler

# Success rate (should be > 95%)
rate(procurement_plans_generated_total{status="success"}[5m]) /
rate(procurement_plans_generated_total[5m])

# Average generation time (should be < 60s)
histogram_quantile(0.95,
  rate(procurement_plan_generation_duration_seconds_bucket[5m]))

# Failed tenants (should be 0)
increase(procurement_tenants_processed_total{status="failure"}[5m])

Forecast Cache

# Cache hit rate (should be > 70%)
forecast_cache_hit_rate

# Cache hits per minute
rate(forecast_cache_hits_total[5m])

# Cache misses per minute
rate(forecast_cache_misses_total[5m])

Log Patterns to Watch

Success Patterns

✅ "Daily production planning completed" - All tenants processed
✅ "Production schedule created successfully" - Individual tenant success
✅ "Forecast cache HIT" - Cache working correctly
✅ "Production scheduler service started" - Service initialized

Warning Patterns

⚠️ "Tenant processing timed out" - Individual tenant taking too long
⚠️ "Forecast cache MISS" - Cache miss (expected some, but not all)
⚠️ "Approving plan older than 24 hours" - Stale plan being approved
⚠️ "Could not fetch tenant timezone" - Timezone configuration issue

Error Patterns

❌ "Daily production planning failed completely" - Complete failure
❌ "Error processing tenant production" - Tenant-specific failure
❌ "Forecast cache Redis connection failed" - Cache unavailable
❌ "Migration version mismatch" - Database migration issue
❌ "Failed to publish event" - RabbitMQ connectivity issue

Escalation Procedure

Level 1: DevOps On-Call (0-30 minutes)

Check service health
Review logs for obvious errors
Restart services if crashed
Manually trigger schedulers if needed
Monitor for resolution

Level 2: Backend Team (30-60 minutes)

Investigate complex errors
Check database issues
Review scheduler logic
Coordinate with other teams (if external service issue)

Level 3: Engineering Lead (> 60 minutes)

Major architectural issues
Database corruption or loss
Multi-service cascading failures
Decisions on emergency fixes vs. scheduled maintenance

Testing After Deployment

Post-Deployment Checklist

# 1. Verify services are running
kubectl get pods -n production
kubectl get pods -n orders

# 2. Check health endpoints
curl http://production-service:8000/health
curl http://orders-service:8000/health

# 3. Verify schedulers are configured
kubectl logs -n production deployment/production-service | \
  grep "scheduled jobs configured"

# 4. Manually trigger test run
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# 5. Verify test run completed successfully
kubectl logs -n production deployment/production-service | \
  grep "Production schedule created successfully"

kubectl logs -n orders deployment/orders-service | \
  grep "Procurement plan generated successfully"

# 6. Check metrics dashboard
# Visit: http://grafana:3000/d/production-planning

Known Issues & Workarounds

Issue: Scheduler runs twice in distributed setup

Symptom: Duplicate schedules/plans for same tenant and date

Cause: Leader election not working (RabbitMQ connection issue)

Workaround:

# Temporarily scale to single instance
kubectl scale deployment/production-service --replicas=1 -n production
kubectl scale deployment/orders-service --replicas=1 -n orders

# Fix RabbitMQ connectivity
# Then scale back up
kubectl scale deployment/production-service --replicas=3 -n production
kubectl scale deployment/orders-service --replicas=3 -n orders

Issue: Timezone shows wrong time

Symptom: Schedules generated at wrong hour

Cause: Tenant timezone not configured or incorrect

Workaround:

-- Check tenant timezone
SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}';

-- Update if incorrect
UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}';

-- Verify server uses UTC
-- In container: date (should show UTC)

Issue: Forecast cache always misses

Symptom: forecast_cache_hit_rate = 0%

Cause: Redis not accessible or REDIS_URL misconfigured

Workaround:

# Check REDIS_URL environment variable
kubectl get deployment forecasting-service -n forecasting -o yaml | \
  grep REDIS_URL

# Should be: redis://redis:6379/0

# If incorrect, update:
kubectl set env deployment/forecasting-service \
  REDIS_URL=redis://redis:6379/0 -n forecasting

Additional Resources

Full Documentation: PRODUCTION_PLANNING_SYSTEM.md
Metrics File: shared/monitoring/scheduler_metrics.py
Scheduler Code:
- Production: services/production/app/services/production_scheduler_service.py
- Procurement: services/orders/app/services/procurement_scheduler_service.py
Forecast Cache: services/forecasting/app/services/forecast_cache.py

Runbook Version: 1.0 Last Updated: 2025-10-09 Maintained By: Backend Team

15 KiB Raw Blame History

Production Planning Scheduler Runbook

Quick Links

Emergency Contacts

Scheduler Overview

Common Incidents & Solutions

🔴 CRITICAL: Scheduler Completely Failed

🟠 HIGH: Single Tenant Failed

🟡 MEDIUM: Scheduler Running Slow

🟡 MEDIUM: Low Forecast Cache Hit Rate

🟢 LOW: Plan Rejected by User

Health Check Commands

Quick Service Health Check

Detailed Scheduler Status

Database Connectivity

Maintenance Procedures

Disable Schedulers (Maintenance Mode)

Re-enable Schedulers (After Maintenance)

Clear Forecast Cache

Metrics to Monitor

Production Scheduler

Procurement Scheduler

Forecast Cache

Log Patterns to Watch

Success Patterns

Warning Patterns

Error Patterns

Escalation Procedure

Level 1: DevOps On-Call (0-30 minutes)

Level 2: Backend Team (30-60 minutes)

Level 3: Engineering Lead (> 60 minutes)

Testing After Deployment

Post-Deployment Checklist

Known Issues & Workarounds

Issue: Scheduler runs twice in distributed setup

Issue: Timezone shows wrong time

Issue: Forecast cache always misses

Additional Resources

15 KiB

Raw Blame History