bakery-ia/docs/SCHEDULER_RUNBOOK.md

# Production Planning Scheduler Runbook

**Quick Reference Guide for DevOps & Support Teams**

---

## Quick Links

- [Full Documentation](./PRODUCTION_PLANNING_SYSTEM.md)
- [Metrics Dashboard](http://grafana:3000/d/production-planning)
- [Logs](http://kibana:5601)
- [Alerts](http://alertmanager:9093)

---

## Emergency Contacts

| Role | Contact | Availability |
|------|---------|--------------|
| **Backend Lead** | #backend-team | 24/7 |
| **DevOps On-Call** | #devops-oncall | 24/7 |
| **Product Owner** | TBD | Business hours |

---

## Scheduler Overview

| Scheduler | Time | What It Does |
|-----------|------|--------------|
| **Production** | 5:30 AM (tenant timezone) | Creates daily production schedules |
| **Procurement** | 6:00 AM (tenant timezone) | Creates daily procurement plans |

**Critical:** Both schedulers MUST complete successfully every morning, or bakeries won't have production/procurement plans for the day!

---

## Common Incidents & Solutions

### 🔴 CRITICAL: Scheduler Completely Failed

**Alert:** `SchedulerUnhealthy` or `NoProductionSchedulesGenerated`

**Impact:** HIGH - No plans generated for any tenant

**Immediate Actions (< 5 minutes):**

```bash
# 1. Check if service is running
kubectl get pods -n production | grep production-service
kubectl get pods -n orders | grep orders-service

# 2. Check recent logs for errors
kubectl logs -n production deployment/production-service --tail=100 | grep ERROR
kubectl logs -n orders deployment/orders-service --tail=100 | grep ERROR

# 3. Restart service if frozen/crashed
kubectl rollout restart deployment/production-service -n production
kubectl rollout restart deployment/orders-service -n orders

# 4. Wait 2 minutes for scheduler to initialize, then manually trigger
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"
```

**Follow-up Actions:**
- Check RabbitMQ health (leader election depends on it)
- Review database connectivity
- Check resource limits (CPU/memory)
- Monitor metrics for successful generation

---

### 🟠 HIGH: Single Tenant Failed

**Alert:** `DailyProductionPlanningFailed{tenant_id="abc-123"}`

**Impact:** MEDIUM - One bakery affected

**Immediate Actions (< 10 minutes):**

```bash
# 1. Check logs for specific tenant
kubectl logs -n production deployment/production-service --tail=500 | \
  grep "tenant_id=abc-123" | grep ERROR

# 2. Common causes:
#    - Tenant database connection issue
#    - External service timeout (Forecasting, Inventory)
#    - Invalid data (e.g., missing products)

# 3. Manually retry for this tenant
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"
# (Scheduler will skip tenants that already have schedules)

# 4. If still failing, check tenant-specific issues:
# - Verify tenant exists and is active
# - Check tenant's inventory has products
# - Check forecasting service can access tenant data
```

**Follow-up Actions:**
- Contact tenant to understand their setup
- Review tenant data quality
- Check if tenant is new (may need initial setup)

---

### 🟡 MEDIUM: Scheduler Running Slow

**Alert:** `production_schedule_generation_duration_seconds > 120s`

**Impact:** LOW - Scheduler completes but takes longer than expected

**Immediate Actions (< 15 minutes):**

```bash
# 1. Check current execution time
kubectl logs -n production deployment/production-service --tail=100 | \
  grep "production planning completed"

# 2. Check database query performance
# Look for slow query logs in PostgreSQL

# 3. Check external service response times
# - Forecasting Service health: curl http://forecasting-service:8000/health
# - Inventory Service health: curl http://inventory-service:8000/health
# - Orders Service health: curl http://orders-service:8000/health

# 4. Check CPU/memory usage
kubectl top pods -n production | grep production-service
kubectl top pods -n orders | grep orders-service
```

**Follow-up Actions:**
- Consider increasing timeout if consistently near limit
- Optimize slow database queries
- Scale external services if overloaded
- Review tenant count (may need to process fewer in parallel)

---

### 🟡 MEDIUM: Low Forecast Cache Hit Rate

**Alert:** `ForecastCacheHitRateLow < 50%`

**Impact:** LOW - Increased load on Forecasting Service, slower responses

**Immediate Actions (< 10 minutes):**

```bash
# 1. Check Redis is running
kubectl get pods -n redis | grep redis
redis-cli ping  # Should return PONG

# 2. Check cache statistics
curl http://forecasting-service:8000/api/v1/{tenant_id}/forecasting/cache/stats \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# 3. Check cache keys
redis-cli KEYS "forecast:*" | wc -l  # Should have many entries

# 4. Check Redis memory
redis-cli INFO memory | grep used_memory_human

# 5. If cache is empty or Redis is down, restart Redis
kubectl rollout restart statefulset/redis -n redis
```

**Follow-up Actions:**
- Monitor cache rebuild (should hit ~80-90% within 1 day)
- Check Redis configuration (memory limits, eviction policy)
- Review forecast TTL settings
- Check for cache invalidation bugs

---

### 🟢 LOW: Plan Rejected by User

**Alert:** `procurement_plan_rejections_total` increasing

**Impact:** LOW - Normal user workflow

**Actions (< 5 minutes):**

```bash
# 1. Check rejection logs for patterns
kubectl logs -n orders deployment/orders-service --tail=200 | \
  grep "plan rejection"

# 2. Check if auto-regeneration triggered
kubectl logs -n orders deployment/orders-service --tail=200 | \
  grep "Auto-regenerating plan"

# 3. Verify rejection notification sent
# Check RabbitMQ queue: procurement.plan.rejected

# 4. If rejection notes mention "stale" or "outdated", plan will auto-regenerate
# Otherwise, user needs to manually regenerate or modify plan
```

**Follow-up Actions:**
- Review rejection reasons for trends
- Consider user training if many rejections
- Improve plan accuracy if consistent issues

---

## Health Check Commands

### Quick Service Health Check

```bash
# Production Service
curl http://production-service:8000/health | jq .

# Orders Service
curl http://orders-service:8000/health | jq .

# Forecasting Service
curl http://forecasting-service:8000/health | jq .

# Redis
redis-cli ping

# RabbitMQ
curl http://rabbitmq:15672/api/health/checks/alarms \
  -u guest:guest | jq .
```

### Detailed Scheduler Status

```bash
# Check last scheduler run time
curl http://production-service:8000/health | \
  jq '.custom_checks.scheduler_service'

# Check APScheduler job status (requires internal access)
# Look for: scheduler.get_jobs() output in logs
kubectl logs -n production deployment/production-service | \
  grep "scheduled jobs configured"
```

### Database Connectivity

```bash
# Check production database
kubectl exec -it deployment/production-service -n production -- \
  python -c "from app.core.database import database_manager; \
             import asyncio; \
             asyncio.run(database_manager.health_check())"

# Check orders database
kubectl exec -it deployment/orders-service -n orders -- \
  python -c "from app.core.database import database_manager; \
             import asyncio; \
             asyncio.run(database_manager.health_check())"
```

---

## Maintenance Procedures

### Disable Schedulers (Maintenance Mode)

```bash
# 1. Set environment variable to disable schedulers
kubectl set env deployment/production-service SCHEDULER_DISABLED=true -n production
kubectl set env deployment/orders-service SCHEDULER_DISABLED=true -n orders

# 2. Wait for pods to restart
kubectl rollout status deployment/production-service -n production
kubectl rollout status deployment/orders-service -n orders

# 3. Verify schedulers are disabled (check logs)
kubectl logs -n production deployment/production-service | grep "Scheduler disabled"
```

### Re-enable Schedulers (After Maintenance)

```bash
# 1. Remove environment variable
kubectl set env deployment/production-service SCHEDULER_DISABLED- -n production
kubectl set env deployment/orders-service SCHEDULER_DISABLED- -n orders

# 2. Wait for pods to restart
kubectl rollout status deployment/production-service -n production
kubectl rollout status deployment/orders-service -n orders

# 3. Manually trigger to catch up (if during scheduled time)
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"
```

### Clear Forecast Cache

```bash
# Clear all forecast cache (will rebuild automatically)
redis-cli KEYS "forecast:*" | xargs redis-cli DEL

# Clear specific tenant's cache
redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL

# Verify cache cleared
redis-cli DBSIZE
```

---

## Metrics to Monitor

### Production Scheduler

```promql
# Success rate (should be > 95%)
rate(production_schedules_generated_total{status="success"}[5m]) /
rate(production_schedules_generated_total[5m])

# Average generation time (should be < 60s)
histogram_quantile(0.95,
  rate(production_schedule_generation_duration_seconds_bucket[5m]))

# Failed tenants (should be 0)
increase(production_tenants_processed_total{status="failure"}[5m])
```

### Procurement Scheduler

```promql
# Success rate (should be > 95%)
rate(procurement_plans_generated_total{status="success"}[5m]) /
rate(procurement_plans_generated_total[5m])

# Average generation time (should be < 60s)
histogram_quantile(0.95,
  rate(procurement_plan_generation_duration_seconds_bucket[5m]))

# Failed tenants (should be 0)
increase(procurement_tenants_processed_total{status="failure"}[5m])
```

### Forecast Cache

```promql
# Cache hit rate (should be > 70%)
forecast_cache_hit_rate

# Cache hits per minute
rate(forecast_cache_hits_total[5m])

# Cache misses per minute
rate(forecast_cache_misses_total[5m])
```

---

## Log Patterns to Watch

### Success Patterns

```
✅ "Daily production planning completed" - All tenants processed
✅ "Production schedule created successfully" - Individual tenant success
✅ "Forecast cache HIT" - Cache working correctly
✅ "Production scheduler service started" - Service initialized
```

### Warning Patterns

```
⚠️ "Tenant processing timed out" - Individual tenant taking too long
⚠️ "Forecast cache MISS" - Cache miss (expected some, but not all)
⚠️ "Approving plan older than 24 hours" - Stale plan being approved
⚠️ "Could not fetch tenant timezone" - Timezone configuration issue
```

### Error Patterns

```
❌ "Daily production planning failed completely" - Complete failure
❌ "Error processing tenant production" - Tenant-specific failure
❌ "Forecast cache Redis connection failed" - Cache unavailable
❌ "Migration version mismatch" - Database migration issue
❌ "Failed to publish event" - RabbitMQ connectivity issue
```

---

## Escalation Procedure

### Level 1: DevOps On-Call (0-30 minutes)

- Check service health
- Review logs for obvious errors
- Restart services if crashed
- Manually trigger schedulers if needed
- Monitor for resolution

### Level 2: Backend Team (30-60 minutes)

- Investigate complex errors
- Check database issues
- Review scheduler logic
- Coordinate with other teams (if external service issue)

### Level 3: Engineering Lead (> 60 minutes)

- Major architectural issues
- Database corruption or loss
- Multi-service cascading failures
- Decisions on emergency fixes vs. scheduled maintenance

---

## Testing After Deployment

### Post-Deployment Checklist

```bash
# 1. Verify services are running
kubectl get pods -n production
kubectl get pods -n orders

# 2. Check health endpoints
curl http://production-service:8000/health
curl http://orders-service:8000/health

# 3. Verify schedulers are configured
kubectl logs -n production deployment/production-service | \
  grep "scheduled jobs configured"

# 4. Manually trigger test run
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# 5. Verify test run completed successfully
kubectl logs -n production deployment/production-service | \
  grep "Production schedule created successfully"

kubectl logs -n orders deployment/orders-service | \
  grep "Procurement plan generated successfully"

# 6. Check metrics dashboard
# Visit: http://grafana:3000/d/production-planning
```

---

## Known Issues & Workarounds

### Issue: Scheduler runs twice in distributed setup

**Symptom:** Duplicate schedules/plans for same tenant and date

**Cause:** Leader election not working (RabbitMQ connection issue)

**Workaround:**
```bash
# Temporarily scale to single instance
kubectl scale deployment/production-service --replicas=1 -n production
kubectl scale deployment/orders-service --replicas=1 -n orders

# Fix RabbitMQ connectivity
# Then scale back up
kubectl scale deployment/production-service --replicas=3 -n production
kubectl scale deployment/orders-service --replicas=3 -n orders
```

### Issue: Timezone shows wrong time

**Symptom:** Schedules generated at wrong hour

**Cause:** Tenant timezone not configured or incorrect

**Workaround:**
```sql
-- Check tenant timezone
SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}';

-- Update if incorrect
UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}';

-- Verify server uses UTC
-- In container: date (should show UTC)
```

### Issue: Forecast cache always misses

**Symptom:** `forecast_cache_hit_rate = 0%`

**Cause:** Redis not accessible or REDIS_URL misconfigured

**Workaround:**
```bash
# Check REDIS_URL environment variable
kubectl get deployment forecasting-service -n forecasting -o yaml | \
  grep REDIS_URL

# Should be: redis://redis:6379/0

# If incorrect, update:
kubectl set env deployment/forecasting-service \
  REDIS_URL=redis://redis:6379/0 -n forecasting
```

---

## Additional Resources

- **Full Documentation:** [PRODUCTION_PLANNING_SYSTEM.md](./PRODUCTION_PLANNING_SYSTEM.md)
- **Metrics File:** [`shared/monitoring/scheduler_metrics.py`](../shared/monitoring/scheduler_metrics.py)
- **Scheduler Code:**
  - Production: [`services/production/app/services/production_scheduler_service.py`](../services/production/app/services/production_scheduler_service.py)
  - Procurement: [`services/orders/app/services/procurement_scheduler_service.py`](../services/orders/app/services/procurement_scheduler_service.py)
- **Forecast Cache:** [`services/forecasting/app/services/forecast_cache.py`](../services/forecasting/app/services/forecast_cache.py)

---

**Runbook Version:** 1.0
**Last Updated:** 2025-10-09
**Maintained By:** Backend Team