bakery-ia/docs/SCHEDULER_RUNBOOK.md

# Production Planning Scheduler Runbook

**Quick Reference Guide for DevOps & Support Teams**

---

## Quick Links

- [Full Documentation](./PRODUCTION_PLANNING_SYSTEM.md)
- [Metrics Dashboard](http://grafana:3000/d/production-planning)
- [Logs](http://kibana:5601)
- [Alerts](http://alertmanager:9093)

---

## Emergency Contacts

| Role | Contact | Availability |
|------|---------|--------------|
| **Backend Lead** | #backend-team | 24/7 |
| **DevOps On-Call** | #devops-oncall | 24/7 |
| **Product Owner** | TBD | Business hours |

---

## Scheduler Overview

| Scheduler | Time | What It Does |
|-----------|------|--------------|
| **Production** | 5:30 AM (tenant timezone) | Creates daily production schedules |
| **Procurement** | 6:00 AM (tenant timezone) | Creates daily procurement plans |

**Critical:** Both schedulers MUST complete successfully every morning, or bakeries won't have production/procurement plans for the day!

---

## Common Incidents & Solutions

### 🔴 CRITICAL: Scheduler Completely Failed

**Alert:** `SchedulerUnhealthy` or `NoProductionSchedulesGenerated`

**Impact:** HIGH - No plans generated for any tenant

**Immediate Actions (< 5 minutes):**

```bash
# 1. Check if service is running
kubectl get pods -n production | grep production-service
kubectl get pods -n orders | grep orders-service

# 2. Check recent logs for errors
kubectl logs -n production deployment/production-service --tail=100 | grep ERROR
kubectl logs -n orders deployment/orders-service --tail=100 | grep ERROR

# 3. Restart service if frozen/crashed
kubectl rollout restart deployment/production-service -n production
kubectl rollout restart deployment/orders-service -n orders

# 4. Wait 2 minutes for scheduler to initialize, then manually trigger
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"
```

**Follow-up Actions:**
- Check RabbitMQ health (leader election depends on it)
- Review database connectivity
- Check resource limits (CPU/memory)
- Monitor metrics for successful generation

---

### 🟠 HIGH: Single Tenant Failed

**Alert:** `DailyProductionPlanningFailed{tenant_id="abc-123"}`

**Impact:** MEDIUM - One bakery affected

**Immediate Actions (< 10 minutes):**

```bash
# 1. Check logs for specific tenant
kubectl logs -n production deployment/production-service --tail=500 | \
  grep "tenant_id=abc-123" | grep ERROR

# 2. Common causes:
#    - Tenant database connection issue
#    - External service timeout (Forecasting, Inventory)
#    - Invalid data (e.g., missing products)

# 3. Manually retry for this tenant
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"
# (Scheduler will skip tenants that already have schedules)

# 4. If still failing, check tenant-specific issues:
# - Verify tenant exists and is active
# - Check tenant's inventory has products
# - Check forecasting service can access tenant data
```

**Follow-up Actions:**
- Contact tenant to understand their setup
- Review tenant data quality
- Check if tenant is new (may need initial setup)

---

### 🟡 MEDIUM: Scheduler Running Slow

**Alert:** `production_schedule_generation_duration_seconds > 120s`

**Impact:** LOW - Scheduler completes but takes longer than expected

**Immediate Actions (< 15 minutes):**

```bash
# 1. Check current execution time
kubectl logs -n production deployment/production-service --tail=100 | \
  grep "production planning completed"

# 2. Check database query performance
# Look for slow query logs in PostgreSQL

# 3. Check external service response times
# - Forecasting Service health: curl http://forecasting-service:8000/health
# - Inventory Service health: curl http://inventory-service:8000/health
# - Orders Service health: curl http://orders-service:8000/health

# 4. Check CPU/memory usage
kubectl top pods -n production | grep production-service
kubectl top pods -n orders | grep orders-service
```

**Follow-up Actions:**
- Consider increasing timeout if consistently near limit
- Optimize slow database queries
- Scale external services if overloaded
- Review tenant count (may need to process fewer in parallel)

---

### 🟡 MEDIUM: Low Forecast Cache Hit Rate

**Alert:** `ForecastCacheHitRateLow < 50%`

**Impact:** LOW - Increased load on Forecasting Service, slower responses

**Immediate Actions (< 10 minutes):**

```bash
# 1. Check Redis is running
kubectl get pods -n redis | grep redis
redis-cli ping  # Should return PONG

# 2. Check cache statistics
curl http://forecasting-service:8000/api/v1/{tenant_id}/forecasting/cache/stats \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# 3. Check cache keys
redis-cli KEYS "forecast:*" | wc -l  # Should have many entries

# 4. Check Redis memory
redis-cli INFO memory | grep used_memory_human

# 5. If cache is empty or Redis is down, restart Redis
kubectl rollout restart statefulset/redis -n redis
```

**Follow-up Actions:**
- Monitor cache rebuild (should hit ~80-90% within 1 day)
- Check Redis configuration (memory limits, eviction policy)
- Review forecast TTL settings
- Check for cache invalidation bugs

---

### 🟢 LOW: Plan Rejected by User

**Alert:** `procurement_plan_rejections_total` increasing

**Impact:** LOW - Normal user workflow

**Actions (< 5 minutes):**

```bash
# 1. Check rejection logs for patterns
kubectl logs -n orders deployment/orders-service --tail=200 | \
  grep "plan rejection"

# 2. Check if auto-regeneration triggered
kubectl logs -n orders deployment/orders-service --tail=200 | \
  grep "Auto-regenerating plan"

# 3. Verify rejection notification sent
# Check RabbitMQ queue: procurement.plan.rejected

# 4. If rejection notes mention "stale" or "outdated", plan will auto-regenerate
# Otherwise, user needs to manually regenerate or modify plan
```

**Follow-up Actions:**
- Review rejection reasons for trends
- Consider user training if many rejections
- Improve plan accuracy if consistent issues

---

## Health Check Commands

### Quick Service Health Check

```bash
# Production Service
curl http://production-service:8000/health | jq .

# Orders Service
curl http://orders-service:8000/health | jq .

# Forecasting Service
curl http://forecasting-service:8000/health | jq .

# Redis
redis-cli ping

# RabbitMQ
curl http://rabbitmq:15672/api/health/checks/alarms \
  -u guest:guest | jq .
```

### Detailed Scheduler Status

```bash
# Check last scheduler run time
curl http://production-service:8000/health | \
  jq '.custom_checks.scheduler_service'

# Check APScheduler job status (requires internal access)
# Look for: scheduler.get_jobs() output in logs
kubectl logs -n production deployment/production-service | \
  grep "scheduled jobs configured"
```

### Database Connectivity

```bash
# Check production database
kubectl exec -it deployment/production-service -n production -- \
  python -c "from app.core.database import database_manager; \
             import asyncio; \
             asyncio.run(database_manager.health_check())"

# Check orders database
kubectl exec -it deployment/orders-service -n orders -- \
  python -c "from app.core.database import database_manager; \
             import asyncio; \
             asyncio.run(database_manager.health_check())"
```

---

## Maintenance Procedures

### Disable Schedulers (Maintenance Mode)

```bash
# 1. Set environment variable to disable schedulers
kubectl set env deployment/production-service SCHEDULER_DISABLED=true -n production
kubectl set env deployment/orders-service SCHEDULER_DISABLED=true -n orders

# 2. Wait for pods to restart
kubectl rollout status deployment/production-service -n production
kubectl rollout status deployment/orders-service -n orders

# 3. Verify schedulers are disabled (check logs)
kubectl logs -n production deployment/production-service | grep "Scheduler disabled"
```

### Re-enable Schedulers (After Maintenance)

```bash
# 1. Remove environment variable
kubectl set env deployment/production-service SCHEDULER_DISABLED- -n production
kubectl set env deployment/orders-service SCHEDULER_DISABLED- -n orders

# 2. Wait for pods to restart
kubectl rollout status deployment/production-service -n production
kubectl rollout status deployment/orders-service -n orders

# 3. Manually trigger to catch up (if during scheduled time)
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"
```

### Clear Forecast Cache

```bash
# Clear all forecast cache (will rebuild automatically)
redis-cli KEYS "forecast:*" | xargs redis-cli DEL

# Clear specific tenant's cache
redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL

# Verify cache cleared
redis-cli DBSIZE
```

---

## Metrics to Monitor

### Production Scheduler

```promql
# Success rate (should be > 95%)
rate(production_schedules_generated_total{status="success"}[5m]) /
rate(production_schedules_generated_total[5m])

# Average generation time (should be < 60s)
histogram_quantile(0.95,
  rate(production_schedule_generation_duration_seconds_bucket[5m]))

# Failed tenants (should be 0)
increase(production_tenants_processed_total{status="failure"}[5m])
```

### Procurement Scheduler

```promql
# Success rate (should be > 95%)
rate(procurement_plans_generated_total{status="success"}[5m]) /
rate(procurement_plans_generated_total[5m])

# Average generation time (should be < 60s)
histogram_quantile(0.95,
  rate(procurement_plan_generation_duration_seconds_bucket[5m]))

# Failed tenants (should be 0)
increase(procurement_tenants_processed_total{status="failure"}[5m])
```

### Forecast Cache

```promql
# Cache hit rate (should be > 70%)
forecast_cache_hit_rate

# Cache hits per minute
rate(forecast_cache_hits_total[5m])

# Cache misses per minute
rate(forecast_cache_misses_total[5m])
```

---

## Log Patterns to Watch

### Success Patterns

```
✅ "Daily production planning completed" - All tenants processed
✅ "Production schedule created successfully" - Individual tenant success
✅ "Forecast cache HIT" - Cache working correctly
✅ "Production scheduler service started" - Service initialized
```

### Warning Patterns

```
⚠️ "Tenant processing timed out" - Individual tenant taking too long
⚠️ "Forecast cache MISS" - Cache miss (expected some, but not all)
⚠️ "Approving plan older than 24 hours" - Stale plan being approved
⚠️ "Could not fetch tenant timezone" - Timezone configuration issue
```

### Error Patterns

```
❌ "Daily production planning failed completely" - Complete failure
❌ "Error processing tenant production" - Tenant-specific failure
❌ "Forecast cache Redis connection failed" - Cache unavailable
❌ "Migration version mismatch" - Database migration issue
❌ "Failed to publish event" - RabbitMQ connectivity issue
```

---

## Escalation Procedure

### Level 1: DevOps On-Call (0-30 minutes)

- Check service health
- Review logs for obvious errors
- Restart services if crashed
- Manually trigger schedulers if needed
- Monitor for resolution

### Level 2: Backend Team (30-60 minutes)

- Investigate complex errors
- Check database issues
- Review scheduler logic
- Coordinate with other teams (if external service issue)

### Level 3: Engineering Lead (> 60 minutes)

- Major architectural issues
- Database corruption or loss
- Multi-service cascading failures
- Decisions on emergency fixes vs. scheduled maintenance

---

## Testing After Deployment

### Post-Deployment Checklist

```bash
# 1. Verify services are running
kubectl get pods -n production
kubectl get pods -n orders

# 2. Check health endpoints
curl http://production-service:8000/health
curl http://orders-service:8000/health

# 3. Verify schedulers are configured
kubectl logs -n production deployment/production-service | \
  grep "scheduled jobs configured"

# 4. Manually trigger test run
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# 5. Verify test run completed successfully
kubectl logs -n production deployment/production-service | \
  grep "Production schedule created successfully"

kubectl logs -n orders deployment/orders-service | \
  grep "Procurement plan generated successfully"

# 6. Check metrics dashboard
# Visit: http://grafana:3000/d/production-planning
```

---

## Known Issues & Workarounds

### Issue: Scheduler runs twice in distributed setup

**Symptom:** Duplicate schedules/plans for same tenant and date

**Cause:** Leader election not working (RabbitMQ connection issue)

**Workaround:**
```bash
# Temporarily scale to single instance
kubectl scale deployment/production-service --replicas=1 -n production
kubectl scale deployment/orders-service --replicas=1 -n orders

# Fix RabbitMQ connectivity
# Then scale back up
kubectl scale deployment/production-service --replicas=3 -n production
kubectl scale deployment/orders-service --replicas=3 -n orders
```

### Issue: Timezone shows wrong time

**Symptom:** Schedules generated at wrong hour

**Cause:** Tenant timezone not configured or incorrect

**Workaround:**
```sql
-- Check tenant timezone
SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}';

-- Update if incorrect
UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}';

-- Verify server uses UTC
-- In container: date (should show UTC)
```

### Issue: Forecast cache always misses

**Symptom:** `forecast_cache_hit_rate = 0%`

**Cause:** Redis not accessible or REDIS_URL misconfigured

**Workaround:**
```bash
# Check REDIS_URL environment variable
kubectl get deployment forecasting-service -n forecasting -o yaml | \
  grep REDIS_URL

# Should be: redis://redis:6379/0

# If incorrect, update:
kubectl set env deployment/forecasting-service \
  REDIS_URL=redis://redis:6379/0 -n forecasting
```

---

## Additional Resources

- **Full Documentation:** [PRODUCTION_PLANNING_SYSTEM.md](./PRODUCTION_PLANNING_SYSTEM.md)
- **Metrics File:** [`shared/monitoring/scheduler_metrics.py`](../shared/monitoring/scheduler_metrics.py)
- **Scheduler Code:**
  - Production: [`services/production/app/services/production_scheduler_service.py`](../services/production/app/services/production_scheduler_service.py)
  - Procurement: [`services/orders/app/services/procurement_scheduler_service.py`](../services/orders/app/services/procurement_scheduler_service.py)
- **Forecast Cache:** [`services/forecasting/app/services/forecast_cache.py`](../services/forecasting/app/services/forecast_cache.py)

---

**Runbook Version:** 1.0
**Last Updated:** 2025-10-09
**Maintained By:** Backend Team
REFACTOR production scheduler 2025-10-09 18:01:24 +02:00			`# Production Planning Scheduler Runbook`

			`Quick Reference Guide for DevOps & Support Teams`

			`---`

			`## Quick Links`

			`- [Full Documentation](./PRODUCTION_PLANNING_SYSTEM.md)`
			`- [Metrics Dashboard](http://grafana:3000/d/production-planning)`
			`- [Logs](http://kibana:5601)`
			`- [Alerts](http://alertmanager:9093)`

			`---`

			`## Emergency Contacts`

			`\| Role \| Contact \| Availability \|`
			`\|------\|---------\|--------------\|`
			`\| Backend Lead \| #backend-team \| 24/7 \|`
			`\| DevOps On-Call \| #devops-oncall \| 24/7 \|`
			`\| Product Owner \| TBD \| Business hours \|`

			`---`

			`## Scheduler Overview`

			`\| Scheduler \| Time \| What It Does \|`
			`\|-----------\|------\|--------------\|`
			`\| Production \| 5:30 AM (tenant timezone) \| Creates daily production schedules \|`
			`\| Procurement \| 6:00 AM (tenant timezone) \| Creates daily procurement plans \|`

			`Critical: Both schedulers MUST complete successfully every morning, or bakeries won't have production/procurement plans for the day!`

			`---`

			`## Common Incidents & Solutions`

			`### 🔴 CRITICAL: Scheduler Completely Failed`

			Alert: `SchedulerUnhealthy` or `NoProductionSchedulesGenerated`

			`Impact: HIGH - No plans generated for any tenant`

			`Immediate Actions (< 5 minutes):`

			```bash
			`# 1. Check if service is running`
			`kubectl get pods -n production \| grep production-service`
			`kubectl get pods -n orders \| grep orders-service`

			`# 2. Check recent logs for errors`
			`kubectl logs -n production deployment/production-service --tail=100 \| grep ERROR`
			`kubectl logs -n orders deployment/orders-service --tail=100 \| grep ERROR`

			`# 3. Restart service if frozen/crashed`
			`kubectl rollout restart deployment/production-service -n production`
			`kubectl rollout restart deployment/orders-service -n orders`

			`# 4. Wait 2 minutes for scheduler to initialize, then manually trigger`
			`curl -X POST http://production-service:8000/test/production-scheduler \`
			`-H "Authorization: Bearer $ADMIN_TOKEN"`

			`curl -X POST http://orders-service:8000/test/procurement-scheduler \`
			`-H "Authorization: Bearer $ADMIN_TOKEN"`
			```

			`Follow-up Actions:`
			`- Check RabbitMQ health (leader election depends on it)`
			`- Review database connectivity`
			`- Check resource limits (CPU/memory)`
			`- Monitor metrics for successful generation`

			`---`

			`### 🟠 HIGH: Single Tenant Failed`

			Alert: `DailyProductionPlanningFailed{tenant_id="abc-123"}`

			`Impact: MEDIUM - One bakery affected`

			`Immediate Actions (< 10 minutes):`

			```bash
			`# 1. Check logs for specific tenant`
			`kubectl logs -n production deployment/production-service --tail=500 \| \`
			`grep "tenant_id=abc-123" \| grep ERROR`

			`# 2. Common causes:`
			`# - Tenant database connection issue`
			`# - External service timeout (Forecasting, Inventory)`
			`# - Invalid data (e.g., missing products)`

			`# 3. Manually retry for this tenant`
			`curl -X POST http://production-service:8000/test/production-scheduler \`
			`-H "Authorization: Bearer $ADMIN_TOKEN"`
			`# (Scheduler will skip tenants that already have schedules)`

			`# 4. If still failing, check tenant-specific issues:`
			`# - Verify tenant exists and is active`
			`# - Check tenant's inventory has products`
			`# - Check forecasting service can access tenant data`
			```

			`Follow-up Actions:`
			`- Contact tenant to understand their setup`
			`- Review tenant data quality`
			`- Check if tenant is new (may need initial setup)`

			`---`

			`### 🟡 MEDIUM: Scheduler Running Slow`

			Alert: `production_schedule_generation_duration_seconds > 120s`

			`Impact: LOW - Scheduler completes but takes longer than expected`

			`Immediate Actions (< 15 minutes):`

			```bash
			`# 1. Check current execution time`
			`kubectl logs -n production deployment/production-service --tail=100 \| \`
			`grep "production planning completed"`

			`# 2. Check database query performance`
			`# Look for slow query logs in PostgreSQL`

			`# 3. Check external service response times`
			`# - Forecasting Service health: curl http://forecasting-service:8000/health`
			`# - Inventory Service health: curl http://inventory-service:8000/health`
			`# - Orders Service health: curl http://orders-service:8000/health`

			`# 4. Check CPU/memory usage`
			`kubectl top pods -n production \| grep production-service`
			`kubectl top pods -n orders \| grep orders-service`
			```

			`Follow-up Actions:`
			`- Consider increasing timeout if consistently near limit`
			`- Optimize slow database queries`
			`- Scale external services if overloaded`
			`- Review tenant count (may need to process fewer in parallel)`

			`---`

			`### 🟡 MEDIUM: Low Forecast Cache Hit Rate`

			Alert: `ForecastCacheHitRateLow < 50%`

			`Impact: LOW - Increased load on Forecasting Service, slower responses`

			`Immediate Actions (< 10 minutes):`

			```bash
			`# 1. Check Redis is running`
			`kubectl get pods -n redis \| grep redis`
			`redis-cli ping # Should return PONG`

			`# 2. Check cache statistics`
			`curl http://forecasting-service:8000/api/v1/{tenant_id}/forecasting/cache/stats \`
			`-H "Authorization: Bearer $ADMIN_TOKEN"`

			`# 3. Check cache keys`
			`redis-cli KEYS "forecast:*" \| wc -l # Should have many entries`

			`# 4. Check Redis memory`
			`redis-cli INFO memory \| grep used_memory_human`

			`# 5. If cache is empty or Redis is down, restart Redis`
			`kubectl rollout restart statefulset/redis -n redis`
			```

			`Follow-up Actions:`
			`- Monitor cache rebuild (should hit ~80-90% within 1 day)`
			`- Check Redis configuration (memory limits, eviction policy)`
			`- Review forecast TTL settings`
			`- Check for cache invalidation bugs`

			`---`

			`### 🟢 LOW: Plan Rejected by User`

			Alert: `procurement_plan_rejections_total` increasing

			`Impact: LOW - Normal user workflow`

			`Actions (< 5 minutes):`

			```bash
			`# 1. Check rejection logs for patterns`
			`kubectl logs -n orders deployment/orders-service --tail=200 \| \`
			`grep "plan rejection"`

			`# 2. Check if auto-regeneration triggered`
			`kubectl logs -n orders deployment/orders-service --tail=200 \| \`
			`grep "Auto-regenerating plan"`

			`# 3. Verify rejection notification sent`
			`# Check RabbitMQ queue: procurement.plan.rejected`

			`# 4. If rejection notes mention "stale" or "outdated", plan will auto-regenerate`
			`# Otherwise, user needs to manually regenerate or modify plan`
			```

			`Follow-up Actions:`
			`- Review rejection reasons for trends`
			`- Consider user training if many rejections`
			`- Improve plan accuracy if consistent issues`

			`---`

			`## Health Check Commands`

			`### Quick Service Health Check`

			```bash
			`# Production Service`
			`curl http://production-service:8000/health \| jq .`

			`# Orders Service`
			`curl http://orders-service:8000/health \| jq .`

			`# Forecasting Service`
			`curl http://forecasting-service:8000/health \| jq .`

			`# Redis`
			`redis-cli ping`

			`# RabbitMQ`
			`curl http://rabbitmq:15672/api/health/checks/alarms \`
			`-u guest:guest \| jq .`
			```

			`### Detailed Scheduler Status`

			```bash
			`# Check last scheduler run time`
			`curl http://production-service:8000/health \| \`
			`jq '.custom_checks.scheduler_service'`

			`# Check APScheduler job status (requires internal access)`
			`# Look for: scheduler.get_jobs() output in logs`
			`kubectl logs -n production deployment/production-service \| \`
			`grep "scheduled jobs configured"`
			```

			`### Database Connectivity`

			```bash
			`# Check production database`
			`kubectl exec -it deployment/production-service -n production -- \`
			`python -c "from app.core.database import database_manager; \`
			`import asyncio; \`
			`asyncio.run(database_manager.health_check())"`

			`# Check orders database`
			`kubectl exec -it deployment/orders-service -n orders -- \`
			`python -c "from app.core.database import database_manager; \`
			`import asyncio; \`
			`asyncio.run(database_manager.health_check())"`
			```

			`---`

			`## Maintenance Procedures`

			`### Disable Schedulers (Maintenance Mode)`

			```bash
			`# 1. Set environment variable to disable schedulers`
			`kubectl set env deployment/production-service SCHEDULER_DISABLED=true -n production`
			`kubectl set env deployment/orders-service SCHEDULER_DISABLED=true -n orders`

			`# 2. Wait for pods to restart`
			`kubectl rollout status deployment/production-service -n production`
			`kubectl rollout status deployment/orders-service -n orders`

			`# 3. Verify schedulers are disabled (check logs)`
			`kubectl logs -n production deployment/production-service \| grep "Scheduler disabled"`
			```

			`### Re-enable Schedulers (After Maintenance)`

			```bash
			`# 1. Remove environment variable`
			`kubectl set env deployment/production-service SCHEDULER_DISABLED- -n production`
			`kubectl set env deployment/orders-service SCHEDULER_DISABLED- -n orders`

			`# 2. Wait for pods to restart`
			`kubectl rollout status deployment/production-service -n production`
			`kubectl rollout status deployment/orders-service -n orders`

			`# 3. Manually trigger to catch up (if during scheduled time)`
			`curl -X POST http://production-service:8000/test/production-scheduler \`
			`-H "Authorization: Bearer $ADMIN_TOKEN"`

			`curl -X POST http://orders-service:8000/test/procurement-scheduler \`
			`-H "Authorization: Bearer $ADMIN_TOKEN"`
			```

			`### Clear Forecast Cache`

			```bash
			`# Clear all forecast cache (will rebuild automatically)`
			`redis-cli KEYS "forecast:*" \| xargs redis-cli DEL`

			`# Clear specific tenant's cache`
			`redis-cli KEYS "forecast:{tenant_id}:*" \| xargs redis-cli DEL`

			`# Verify cache cleared`
			`redis-cli DBSIZE`
			```

			`---`

			`## Metrics to Monitor`

			`### Production Scheduler`

			```promql
			`# Success rate (should be > 95%)`
			`rate(production_schedules_generated_total{status="success"}[5m]) /`
			`rate(production_schedules_generated_total[5m])`

			`# Average generation time (should be < 60s)`
			`histogram_quantile(0.95,`
			`rate(production_schedule_generation_duration_seconds_bucket[5m]))`

			`# Failed tenants (should be 0)`
			`increase(production_tenants_processed_total{status="failure"}[5m])`
			```

			`### Procurement Scheduler`

			```promql
			`# Success rate (should be > 95%)`
			`rate(procurement_plans_generated_total{status="success"}[5m]) /`
			`rate(procurement_plans_generated_total[5m])`

			`# Average generation time (should be < 60s)`
			`histogram_quantile(0.95,`
			`rate(procurement_plan_generation_duration_seconds_bucket[5m]))`

			`# Failed tenants (should be 0)`
			`increase(procurement_tenants_processed_total{status="failure"}[5m])`
			```

			`### Forecast Cache`

			```promql
			`# Cache hit rate (should be > 70%)`
			`forecast_cache_hit_rate`

			`# Cache hits per minute`
			`rate(forecast_cache_hits_total[5m])`

			`# Cache misses per minute`
			`rate(forecast_cache_misses_total[5m])`
			```

			`---`

			`## Log Patterns to Watch`

			`### Success Patterns`

			```
			`✅ "Daily production planning completed" - All tenants processed`
			`✅ "Production schedule created successfully" - Individual tenant success`
			`✅ "Forecast cache HIT" - Cache working correctly`
			`✅ "Production scheduler service started" - Service initialized`
			```

			`### Warning Patterns`

			```
			`⚠️ "Tenant processing timed out" - Individual tenant taking too long`
			`⚠️ "Forecast cache MISS" - Cache miss (expected some, but not all)`
			`⚠️ "Approving plan older than 24 hours" - Stale plan being approved`
			`⚠️ "Could not fetch tenant timezone" - Timezone configuration issue`
			```

			`### Error Patterns`

			```
			`❌ "Daily production planning failed completely" - Complete failure`
			`❌ "Error processing tenant production" - Tenant-specific failure`
			`❌ "Forecast cache Redis connection failed" - Cache unavailable`
			`❌ "Migration version mismatch" - Database migration issue`
			`❌ "Failed to publish event" - RabbitMQ connectivity issue`
			```

			`---`

			`## Escalation Procedure`

			`### Level 1: DevOps On-Call (0-30 minutes)`

			`- Check service health`
			`- Review logs for obvious errors`
			`- Restart services if crashed`
			`- Manually trigger schedulers if needed`
			`- Monitor for resolution`

			`### Level 2: Backend Team (30-60 minutes)`

			`- Investigate complex errors`
			`- Check database issues`
			`- Review scheduler logic`
			`- Coordinate with other teams (if external service issue)`

			`### Level 3: Engineering Lead (> 60 minutes)`

			`- Major architectural issues`
			`- Database corruption or loss`
			`- Multi-service cascading failures`
			`- Decisions on emergency fixes vs. scheduled maintenance`

			`---`

			`## Testing After Deployment`

			`### Post-Deployment Checklist`

			```bash
			`# 1. Verify services are running`
			`kubectl get pods -n production`
			`kubectl get pods -n orders`

			`# 2. Check health endpoints`
			`curl http://production-service:8000/health`
			`curl http://orders-service:8000/health`

			`# 3. Verify schedulers are configured`
			`kubectl logs -n production deployment/production-service \| \`
			`grep "scheduled jobs configured"`

			`# 4. Manually trigger test run`
			`curl -X POST http://production-service:8000/test/production-scheduler \`
			`-H "Authorization: Bearer $ADMIN_TOKEN"`

			`curl -X POST http://orders-service:8000/test/procurement-scheduler \`
			`-H "Authorization: Bearer $ADMIN_TOKEN"`

			`# 5. Verify test run completed successfully`
			`kubectl logs -n production deployment/production-service \| \`
			`grep "Production schedule created successfully"`

			`kubectl logs -n orders deployment/orders-service \| \`
			`grep "Procurement plan generated successfully"`

			`# 6. Check metrics dashboard`
			`# Visit: http://grafana:3000/d/production-planning`
			```

			`---`

			`## Known Issues & Workarounds`

			`### Issue: Scheduler runs twice in distributed setup`

			`Symptom: Duplicate schedules/plans for same tenant and date`

			`Cause: Leader election not working (RabbitMQ connection issue)`

			`Workaround:`
			```bash
			`# Temporarily scale to single instance`
			`kubectl scale deployment/production-service --replicas=1 -n production`
			`kubectl scale deployment/orders-service --replicas=1 -n orders`

			`# Fix RabbitMQ connectivity`
			`# Then scale back up`
			`kubectl scale deployment/production-service --replicas=3 -n production`
			`kubectl scale deployment/orders-service --replicas=3 -n orders`
			```

			`### Issue: Timezone shows wrong time`

			`Symptom: Schedules generated at wrong hour`

			`Cause: Tenant timezone not configured or incorrect`

			`Workaround:`
			```sql
			`-- Check tenant timezone`
			`SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}';`

			`-- Update if incorrect`
			`UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}';`

			`-- Verify server uses UTC`
			`-- In container: date (should show UTC)`
			```

			`### Issue: Forecast cache always misses`

			Symptom: `forecast_cache_hit_rate = 0%`

			`Cause: Redis not accessible or REDIS_URL misconfigured`

			`Workaround:`
			```bash
			`# Check REDIS_URL environment variable`
			`kubectl get deployment forecasting-service -n forecasting -o yaml \| \`
			`grep REDIS_URL`

			`# Should be: redis://redis:6379/0`

			`# If incorrect, update:`
			`kubectl set env deployment/forecasting-service \`
			`REDIS_URL=redis://redis:6379/0 -n forecasting`
			```

			`---`

			`## Additional Resources`

			`- Full Documentation: [PRODUCTION_PLANNING_SYSTEM.md](./PRODUCTION_PLANNING_SYSTEM.md)`
			- Metrics File: [`shared/monitoring/scheduler_metrics.py`](../shared/monitoring/scheduler_metrics.py)
			`- Scheduler Code:`
			- Production: [`services/production/app/services/production_scheduler_service.py`](../services/production/app/services/production_scheduler_service.py)
			- Procurement: [`services/orders/app/services/procurement_scheduler_service.py`](../services/orders/app/services/procurement_scheduler_service.py)
			- Forecast Cache: [`services/forecasting/app/services/forecast_cache.py`](../services/forecasting/app/services/forecast_cache.py)

			`---`

			`Runbook Version: 1.0`
			`Last Updated: 2025-10-09`
			`Maintained By: Backend Team`