23 KiB
Production Planning System Documentation
Overview
The Production Planning System automates daily production and procurement scheduling for bakery operations. The system consists of two primary schedulers that run every morning to generate plans based on demand forecasts, inventory levels, and capacity constraints.
Last Updated: 2025-10-09 Version: 2.0 (Automated Scheduling) Status: Production Ready
Architecture
System Components
┌─────────────────────────────────────────────────────────────────┐
│ DAILY PLANNING WORKFLOW │
└─────────────────────────────────────────────────────────────────┘
05:30 AM → Production Scheduler
├─ Generates production schedules for all tenants
├─ Calls Forecasting Service (cached) for demand
├─ Calls Orders Service for demand requirements
├─ Creates production batches
└─ Sends notifications to production managers
06:00 AM → Procurement Scheduler
├─ Generates procurement plans for all tenants
├─ Calls Forecasting Service (cached - reuses cached data!)
├─ Calls Inventory Service for stock levels
├─ Matches suppliers for requirements
└─ Sends notifications to procurement managers
08:00 AM → Operators review plans
├─ Accept → Plans move to "approved" status
├─ Reject → Automatic regeneration if stale data detected
└─ Modify → Recalculate and resubmit
Throughout Day → Alert services monitor execution
├─ Production delays
├─ Capacity issues
├─ Quality problems
└─ Equipment failures
Services Involved
| Service | Role | Endpoints |
|---|---|---|
| Production Service | Generates daily production schedules | POST /api/v1/{tenant_id}/production/operations/schedule |
| Orders Service | Generates daily procurement plans | POST /api/v1/{tenant_id}/orders/operations/procurement/generate |
| Forecasting Service | Provides demand predictions (cached) | POST /api/v1/{tenant_id}/forecasting/operations/single |
| Inventory Service | Provides current stock levels | GET /api/v1/{tenant_id}/inventory/products |
| Tenant Service | Provides timezone configuration | GET /api/v1/tenants/{tenant_id} |
Schedulers
1. Production Scheduler
Service: Production Service
Class: ProductionSchedulerService
File: services/production/app/services/production_scheduler_service.py
Schedule
| Job | Time | Purpose | Grace Period |
|---|---|---|---|
| Daily Production Planning | 5:30 AM (tenant timezone) | Generate next-day production schedules | 5 minutes |
| Stale Schedule Cleanup | 5:50 AM | Archive/cancel old schedules, send escalations | 5 minutes |
| Test Mode | Every 30 min (DEBUG only) | Development/testing | 5 minutes |
Features
- ✅ Timezone-aware: Respects tenant timezone configuration
- ✅ Leader election: Only one instance runs in distributed deployment
- ✅ Idempotent: Checks if schedule exists before creating
- ✅ Parallel processing: Processes tenants concurrently with timeouts
- ✅ Error isolation: Tenant failures don't affect others
- ✅ Demo tenant filtering: Excludes demo tenants from automation
Workflow
- Tenant Discovery: Fetch all active non-demo tenants
- Parallel Processing: Process each tenant concurrently (180s timeout)
- Date Calculation: Use tenant timezone to determine target date
- Duplicate Check: Skip if schedule already exists
- Requirements Calculation: Call
calculate_daily_requirements() - Schedule Creation: Create schedule with status "draft"
- Batch Generation: Create production batches from requirements
- Notification: Send alert to production managers
- Monitoring: Record metrics for observability
Configuration
# Environment Variables
PRODUCTION_TEST_MODE=false # Enable 30-minute test job
DEBUG=false # Enable verbose logging
# Tenant Configuration
tenant.timezone=Europe/Madrid # IANA timezone string
2. Procurement Scheduler
Service: Orders Service
Class: ProcurementSchedulerService
File: services/orders/app/services/procurement_scheduler_service.py
Schedule
| Job | Time | Purpose | Grace Period |
|---|---|---|---|
| Daily Procurement Planning | 6:00 AM (tenant timezone) | Generate next-day procurement plans | 5 minutes |
| Stale Plan Cleanup | 6:30 AM | Archive/cancel old plans, send reminders | 5 minutes |
| Weekly Optimization | Monday 7:00 AM | Weekly procurement optimization review | 10 minutes |
| Test Mode | Every 30 min (DEBUG only) | Development/testing | 5 minutes |
Features
- ✅ Timezone-aware: Respects tenant timezone configuration
- ✅ Leader election: Prevents duplicate runs
- ✅ Idempotent: Checks if plan exists before generating
- ✅ Parallel processing: 120s timeout per tenant
- ✅ Forecast fallback: Uses historical data if forecast unavailable
- ✅ Critical stock alerts: Automatic alerts for zero-stock items
- ✅ Rejection workflow: Auto-regeneration for rejected plans
Workflow
- Tenant Discovery: Fetch active non-demo tenants
- Parallel Processing: Process each tenant (120s timeout)
- Date Calculation: Use tenant timezone
- Duplicate Check: Skip if plan exists (unless force_regenerate)
- Forecasting: Call Forecasting Service (uses cache!)
- Inventory Check: Get current stock levels
- Requirements Calculation: Calculate net requirements
- Supplier Matching: Find suitable suppliers
- Plan Creation: Create plan with status "draft"
- Critical Alerts: Send alerts for critical items
- Notification: Notify procurement managers
- Caching: Cache plan in Redis (6h TTL)
Forecast Caching
Overview
To eliminate redundant forecast computations, the Forecasting Service now includes a service-level Redis cache. Both Production and Procurement schedulers benefit from this without any code changes.
File: services/forecasting/app/services/forecast_cache.py
Cache Strategy
Key Format: forecast:{tenant_id}:{product_id}:{forecast_date}
TTL: Until midnight of day after forecast_date
Example: forecast:abc-123:prod-456:2025-10-10 → expires 2025-10-11 00:00:00
Cache Flow
Client Request → Forecasting API
↓
Check Redis Cache
├─ HIT → Return cached result (add 'cached: true')
└─ MISS → Generate forecast
↓
Cache result (TTL)
↓
Return result
Benefits
| Metric | Before Caching | After Caching | Improvement |
|---|---|---|---|
| Duplicate Forecasts | 2x per day (Production + Procurement) | 1x per day | 50% reduction |
| Forecast Response Time | ~2-5 seconds | ~50-100ms (cache hit) | 95%+ faster |
| Forecasting Service Load | 100% | 50% | 50% reduction |
| Cache Hit Rate | N/A | ~80-90% (expected) | - |
Cache Invalidation
Forecasts are invalidated when:
- TTL Expiry: Automatic at midnight after forecast_date
- Model Retraining: When ML model is retrained for product
- Manual Invalidation: Via API endpoint (admin only)
# Invalidate specific product forecasts
DELETE /api/v1/{tenant_id}/forecasting/cache/product/{product_id}
# Invalidate all tenant forecasts
DELETE /api/v1/{tenant_id}/forecasting/cache
# Invalidate all forecasts (use with caution!)
DELETE /admin/forecasting/cache/all
Plan Rejection Workflow
Overview
When a procurement plan is rejected by an operator, the system automatically handles the rejection with notifications and optional regeneration.
File: services/orders/app/services/procurement_service.py
Rejection Flow
User Rejects Plan (status → "cancelled")
↓
Record rejection in approval_workflow (JSONB)
↓
Send notification to stakeholders
↓
Publish rejection event (RabbitMQ)
↓
Analyze rejection reason
├─ Contains "stale", "outdated", etc. → Auto-regenerate
└─ Other reason → Manual regeneration required
↓
Schedule regeneration (if applicable)
↓
Send regeneration request event
Auto-Regeneration Keywords
Plans are automatically regenerated if rejection notes contain:
staleoutdatedold datadatos antiguos(Spanish)desactualizado(Spanish)obsoleto(Spanish)
Events Published
| Event | Routing Key | Consumers |
|---|---|---|
| Plan Rejected | procurement.plan.rejected |
Alert Service, UI Notifications |
| Regeneration Requested | procurement.plan.regeneration_requested |
Procurement Scheduler |
| Plan Status Changed | procurement.plan.status_changed |
Inventory Service, Dashboard |
Timezone Configuration
Overview
All schedulers are timezone-aware to ensure accurate "daily" execution relative to the bakery's local time.
Tenant Configuration
Model: Tenant
File: services/tenant/app/models/tenants.py
Field: timezone (String, default: "Europe/Madrid")
Migration: services/tenant/migrations/versions/20251009_add_timezone_to_tenants.py
Supported Timezones
All IANA timezone strings are supported. Common examples:
Europe/Madrid(Spain - CEST/CET)Europe/London(UK - BST/GMT)America/New_York(US Eastern)America/Los_Angeles(US Pacific)Asia/Tokyo(Japan)UTC(Universal Time)
Usage in Schedulers
from shared.utils.timezone_helper import TimezoneHelper
# Get current date in tenant's timezone
target_date = TimezoneHelper.get_current_date_in_timezone(tenant_tz)
# Get current datetime in tenant's timezone
now = TimezoneHelper.get_current_datetime_in_timezone(tenant_tz)
# Check if within business hours
is_business_hours = TimezoneHelper.is_business_hours(
timezone_str=tenant_tz,
start_hour=8,
end_hour=20
)
Monitoring & Alerts
Prometheus Metrics
File: shared/monitoring/scheduler_metrics.py
Key Metrics
| Metric | Type | Description |
|---|---|---|
production_schedules_generated_total |
Counter | Total production schedules generated (by tenant, status) |
production_schedule_generation_duration_seconds |
Histogram | Time to generate schedule per tenant |
procurement_plans_generated_total |
Counter | Total procurement plans generated (by tenant, status) |
procurement_plan_generation_duration_seconds |
Histogram | Time to generate plan per tenant |
forecast_cache_hits_total |
Counter | Forecast cache hits (by tenant) |
forecast_cache_misses_total |
Counter | Forecast cache misses (by tenant) |
forecast_cache_hit_rate |
Gauge | Cache hit rate percentage (0-100) |
procurement_plan_rejections_total |
Counter | Plan rejections (by tenant, auto_regenerated) |
scheduler_health_status |
Gauge | Scheduler health (1=healthy, 0=unhealthy) |
tenant_processing_timeout_total |
Counter | Tenant processing timeouts (by service) |
Recommended Alerts
# Alert: Daily production planning failed
- alert: DailyProductionPlanningFailed
expr: production_schedules_generated_total{status="failure"} > 0
for: 10m
labels:
severity: high
annotations:
summary: "Daily production planning failed for at least one tenant"
description: "Check production scheduler logs for tenant {{ $labels.tenant_id }}"
# Alert: Daily procurement planning failed
- alert: DailyProcurementPlanningFailed
expr: procurement_plans_generated_total{status="failure"} > 0
for: 10m
labels:
severity: high
annotations:
summary: "Daily procurement planning failed for at least one tenant"
description: "Check procurement scheduler logs for tenant {{ $labels.tenant_id }}"
# Alert: No production schedules in 24 hours
- alert: NoProductionSchedulesGenerated
expr: rate(production_schedules_generated_total{status="success"}[24h]) == 0
for: 1h
labels:
severity: critical
annotations:
summary: "No production schedules generated in last 24 hours"
description: "Production scheduler may be down or misconfigured"
# Alert: Forecast cache hit rate low
- alert: ForecastCacheHitRateLow
expr: forecast_cache_hit_rate < 50
for: 30m
labels:
severity: warning
annotations:
summary: "Forecast cache hit rate below 50%"
description: "Cache may not be functioning correctly for tenant {{ $labels.tenant_id }}"
# Alert: High tenant processing timeouts
- alert: HighTenantProcessingTimeouts
expr: rate(tenant_processing_timeout_total[5m]) > 0.1
for: 15m
labels:
severity: warning
annotations:
summary: "High rate of tenant processing timeouts"
description: "{{ $labels.service }} scheduler experiencing timeouts for tenant {{ $labels.tenant_id }}"
# Alert: Scheduler unhealthy
- alert: SchedulerUnhealthy
expr: scheduler_health_status == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Scheduler is unhealthy"
description: "{{ $labels.service }} {{ $labels.scheduler_type }} scheduler is reporting unhealthy status"
Grafana Dashboard
Create dashboard with panels for:
-
Scheduler Success Rate (line chart)
production_schedules_generated_total{status="success"}procurement_plans_generated_total{status="success"}
-
Schedule Generation Duration (heatmap)
production_schedule_generation_duration_secondsprocurement_plan_generation_duration_seconds
-
Forecast Cache Hit Rate (gauge)
forecast_cache_hit_rate
-
Tenant Processing Status (pie chart)
production_tenants_processed_totalprocurement_tenants_processed_total
-
Plan Rejections (table)
procurement_plan_rejections_total
-
Scheduler Health (status panel)
scheduler_health_status
Testing
Manual Testing
Test Production Scheduler
# Trigger test production schedule generation
curl -X POST http://production-service:8000/test/production-scheduler \
-H "Authorization: Bearer $TOKEN"
# Expected response:
{
"message": "Production scheduler test triggered successfully"
}
Test Procurement Scheduler
# Trigger test procurement plan generation
curl -X POST http://orders-service:8000/test/procurement-scheduler \
-H "Authorization: Bearer $TOKEN"
# Expected response:
{
"message": "Procurement scheduler test triggered successfully"
}
Automated Testing
# Test production scheduler
async def test_production_scheduler():
scheduler = ProductionSchedulerService(config)
await scheduler.start()
await scheduler.test_production_schedule_generation()
assert scheduler._checks_performed > 0
# Test procurement scheduler
async def test_procurement_scheduler():
scheduler = ProcurementSchedulerService(config)
await scheduler.start()
await scheduler.test_procurement_generation()
assert scheduler._checks_performed > 0
# Test forecast caching
async def test_forecast_cache():
cache = get_forecast_cache_service(redis_url)
# Cache forecast
await cache.cache_forecast(tenant_id, product_id, forecast_date, data)
# Retrieve cached forecast
cached = await cache.get_cached_forecast(tenant_id, product_id, forecast_date)
assert cached is not None
assert cached['cached'] == True
Troubleshooting
Scheduler Not Running
Symptoms: No schedules/plans generated in morning
Checks:
- Verify scheduler service is running:
kubectl get pods -n production - Check scheduler health endpoint:
curl http://service:8000/health - Check APScheduler status in logs:
grep "scheduler" logs/production.log - Verify leader election (distributed setup): Check
is_leaderin logs
Solutions:
- Restart service:
kubectl rollout restart deployment/production-service - Check environment variables:
PRODUCTION_TEST_MODE,DEBUG - Verify database connectivity
- Check RabbitMQ connectivity for leader election
Timezone Issues
Symptoms: Schedules generated at wrong time
Checks:
- Check tenant timezone configuration:
SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}'; - Verify server timezone:
date(should be UTC in containers) - Check logs for timezone warnings
Solutions:
- Update tenant timezone:
UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}'; - Verify TimezoneHelper is being used in schedulers
- Check cron trigger configuration uses correct timezone
Low Cache Hit Rate
Symptoms: forecast_cache_hit_rate < 50%
Checks:
- Verify Redis is running:
redis-cli ping - Check cache keys:
redis-cli KEYS "forecast:*" - Check TTL on cache entries:
redis-cli TTL "forecast:{tenant}:{product}:{date}" - Review logs for cache errors
Solutions:
- Restart Redis if unhealthy
- Clear cache and let it rebuild:
redis-cli FLUSHDB - Verify REDIS_URL environment variable
- Check Redis memory limits:
redis-cli INFO memory
Plan Rejection Not Auto-Regenerating
Symptoms: Rejected plans not triggering regeneration
Checks:
- Check rejection notes contain auto-regenerate keywords
- Verify RabbitMQ events are being published: Check
procurement.plan.rejectedqueue - Check scheduler is listening to regeneration events
Solutions:
- Use keywords like "stale" or "outdated" in rejection notes
- Manually trigger regeneration via API
- Check RabbitMQ connectivity
- Verify event routing keys are correct
Tenant Processing Timeouts
Symptoms: tenant_processing_timeout_total increasing
Checks:
- Check timeout duration (180s for production, 120s for procurement)
- Review slow queries in database logs
- Check external service response times (Forecasting, Inventory)
- Monitor CPU/memory usage during scheduler runs
Solutions:
- Increase timeout if consistently hitting limit
- Optimize database queries (add indexes)
- Scale external services if response time high
- Process fewer tenants in parallel (reduce concurrency)
Maintenance
Scheduled Maintenance Windows
When performing maintenance on schedulers:
- Announce downtime to users (UI banner)
- Disable schedulers temporarily:
# Set environment variable SCHEDULER_DISABLED=true - Perform maintenance (database migrations, service updates)
- Re-enable schedulers:
SCHEDULER_DISABLED=false - Manually trigger missed runs if needed:
curl -X POST http://service:8000/test/production-scheduler curl -X POST http://service:8000/test/procurement-scheduler
Database Migrations
When adding fields to scheduler-related tables:
- Create migration with proper rollback
- Test migration on staging environment
- Run migration during low-traffic period (3-4 AM)
- Verify scheduler still works after migration
- Monitor metrics for anomalies
Cache Maintenance
Clear Stale Cache Entries:
# Clear all forecast cache (will rebuild automatically)
redis-cli KEYS "forecast:*" | xargs redis-cli DEL
# Clear specific tenant's cache
redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL
Monitor Cache Size:
# Check number of forecast keys
redis-cli DBSIZE
# Check memory usage
redis-cli INFO memory
API Reference
Production Scheduler Endpoints
POST /test/production-scheduler
Description: Manually trigger production scheduler (test mode)
Auth: Bearer token required
Response: {"message": "Production scheduler test triggered successfully"}
Procurement Scheduler Endpoints
POST /test/procurement-scheduler
Description: Manually trigger procurement scheduler (test mode)
Auth: Bearer token required
Response: {"message": "Procurement scheduler test triggered successfully"}
Forecast Cache Endpoints
GET /api/v1/{tenant_id}/forecasting/cache/stats
Description: Get forecast cache statistics
Auth: Bearer token required
Response: {
"available": true,
"total_forecast_keys": 1234,
"batch_forecast_keys": 45,
"single_forecast_keys": 1189,
"hit_rate_percent": 87.5,
...
}
DELETE /api/v1/{tenant_id}/forecasting/cache/product/{product_id}
Description: Invalidate forecast cache for specific product
Auth: Bearer token required (admin only)
Response: {"invalidated_keys": 7}
DELETE /api/v1/{tenant_id}/forecasting/cache
Description: Invalidate all forecast cache for tenant
Auth: Bearer token required (admin only)
Response: {"invalidated_keys": 123}
Change Log
Version 2.0 (2025-10-09) - Automated Scheduling
Added:
- ✨ ProductionSchedulerService for automated daily production planning
- ✨ Timezone configuration in Tenant model
- ✨ Forecast caching in Forecasting Service (service-level)
- ✨ Plan rejection workflow with auto-regeneration
- ✨ Comprehensive Prometheus metrics for monitoring
- ✨ TimezoneHelper utility for consistent timezone handling
Changed:
- 🔄 All schedulers now timezone-aware
- 🔄 Forecast service returns
cached: trueflag in metadata - 🔄 Plan rejection triggers notifications and events
Fixed:
- 🐛 Duplicate forecast computations eliminated (50% reduction)
- 🐛 Timezone-related scheduling issues resolved
- 🐛 Rejected plans now have proper workflow handling
Documentation:
- 📚 Comprehensive production planning system documentation
- 📚 Runbooks for troubleshooting common issues
- 📚 Monitoring and alerting guidelines
Version 1.0 (2025-10-07) - Initial Release
Added:
- ✨ ProcurementSchedulerService for automated procurement planning
- ✨ Daily, weekly, and cleanup jobs
- ✨ Leader election for distributed deployments
- ✨ Parallel tenant processing with timeouts
Support & Contact
For issues or questions about the Production Planning System:
- Documentation: This file
- Source Code:
services/production/,services/orders/ - Issues: GitHub Issues
- Slack:
#production-planningchannel
Document Version: 2.0 Last Review Date: 2025-10-09 Next Review Date: 2025-11-09