Files
bakery-ia/docs/PRODUCTION_PLANNING_SYSTEM.md
2025-10-09 18:01:24 +02:00

23 KiB

Production Planning System Documentation

Overview

The Production Planning System automates daily production and procurement scheduling for bakery operations. The system consists of two primary schedulers that run every morning to generate plans based on demand forecasts, inventory levels, and capacity constraints.

Last Updated: 2025-10-09 Version: 2.0 (Automated Scheduling) Status: Production Ready


Architecture

System Components

┌─────────────────────────────────────────────────────────────────┐
│                   DAILY PLANNING WORKFLOW                        │
└─────────────────────────────────────────────────────────────────┘

05:30 AM → Production Scheduler
           ├─ Generates production schedules for all tenants
           ├─ Calls Forecasting Service (cached) for demand
           ├─ Calls Orders Service for demand requirements
           ├─ Creates production batches
           └─ Sends notifications to production managers

06:00 AM → Procurement Scheduler
           ├─ Generates procurement plans for all tenants
           ├─ Calls Forecasting Service (cached - reuses cached data!)
           ├─ Calls Inventory Service for stock levels
           ├─ Matches suppliers for requirements
           └─ Sends notifications to procurement managers

08:00 AM → Operators review plans
           ├─ Accept → Plans move to "approved" status
           ├─ Reject → Automatic regeneration if stale data detected
           └─ Modify → Recalculate and resubmit

Throughout Day → Alert services monitor execution
                 ├─ Production delays
                 ├─ Capacity issues
                 ├─ Quality problems
                 └─ Equipment failures

Services Involved

Service Role Endpoints
Production Service Generates daily production schedules POST /api/v1/{tenant_id}/production/operations/schedule
Orders Service Generates daily procurement plans POST /api/v1/{tenant_id}/orders/operations/procurement/generate
Forecasting Service Provides demand predictions (cached) POST /api/v1/{tenant_id}/forecasting/operations/single
Inventory Service Provides current stock levels GET /api/v1/{tenant_id}/inventory/products
Tenant Service Provides timezone configuration GET /api/v1/tenants/{tenant_id}

Schedulers

1. Production Scheduler

Service: Production Service Class: ProductionSchedulerService File: services/production/app/services/production_scheduler_service.py

Schedule

Job Time Purpose Grace Period
Daily Production Planning 5:30 AM (tenant timezone) Generate next-day production schedules 5 minutes
Stale Schedule Cleanup 5:50 AM Archive/cancel old schedules, send escalations 5 minutes
Test Mode Every 30 min (DEBUG only) Development/testing 5 minutes

Features

  • Timezone-aware: Respects tenant timezone configuration
  • Leader election: Only one instance runs in distributed deployment
  • Idempotent: Checks if schedule exists before creating
  • Parallel processing: Processes tenants concurrently with timeouts
  • Error isolation: Tenant failures don't affect others
  • Demo tenant filtering: Excludes demo tenants from automation

Workflow

  1. Tenant Discovery: Fetch all active non-demo tenants
  2. Parallel Processing: Process each tenant concurrently (180s timeout)
  3. Date Calculation: Use tenant timezone to determine target date
  4. Duplicate Check: Skip if schedule already exists
  5. Requirements Calculation: Call calculate_daily_requirements()
  6. Schedule Creation: Create schedule with status "draft"
  7. Batch Generation: Create production batches from requirements
  8. Notification: Send alert to production managers
  9. Monitoring: Record metrics for observability

Configuration

# Environment Variables
PRODUCTION_TEST_MODE=false  # Enable 30-minute test job
DEBUG=false                 # Enable verbose logging

# Tenant Configuration
tenant.timezone=Europe/Madrid  # IANA timezone string

2. Procurement Scheduler

Service: Orders Service Class: ProcurementSchedulerService File: services/orders/app/services/procurement_scheduler_service.py

Schedule

Job Time Purpose Grace Period
Daily Procurement Planning 6:00 AM (tenant timezone) Generate next-day procurement plans 5 minutes
Stale Plan Cleanup 6:30 AM Archive/cancel old plans, send reminders 5 minutes
Weekly Optimization Monday 7:00 AM Weekly procurement optimization review 10 minutes
Test Mode Every 30 min (DEBUG only) Development/testing 5 minutes

Features

  • Timezone-aware: Respects tenant timezone configuration
  • Leader election: Prevents duplicate runs
  • Idempotent: Checks if plan exists before generating
  • Parallel processing: 120s timeout per tenant
  • Forecast fallback: Uses historical data if forecast unavailable
  • Critical stock alerts: Automatic alerts for zero-stock items
  • Rejection workflow: Auto-regeneration for rejected plans

Workflow

  1. Tenant Discovery: Fetch active non-demo tenants
  2. Parallel Processing: Process each tenant (120s timeout)
  3. Date Calculation: Use tenant timezone
  4. Duplicate Check: Skip if plan exists (unless force_regenerate)
  5. Forecasting: Call Forecasting Service (uses cache!)
  6. Inventory Check: Get current stock levels
  7. Requirements Calculation: Calculate net requirements
  8. Supplier Matching: Find suitable suppliers
  9. Plan Creation: Create plan with status "draft"
  10. Critical Alerts: Send alerts for critical items
  11. Notification: Notify procurement managers
  12. Caching: Cache plan in Redis (6h TTL)

Forecast Caching

Overview

To eliminate redundant forecast computations, the Forecasting Service now includes a service-level Redis cache. Both Production and Procurement schedulers benefit from this without any code changes.

File: services/forecasting/app/services/forecast_cache.py

Cache Strategy

Key Format: forecast:{tenant_id}:{product_id}:{forecast_date}
TTL: Until midnight of day after forecast_date
Example: forecast:abc-123:prod-456:2025-10-10 → expires 2025-10-11 00:00:00

Cache Flow

Client Request → Forecasting API
                     ↓
                Check Redis Cache
                     ├─ HIT → Return cached result (add 'cached: true')
                     └─ MISS → Generate forecast
                                ↓
                           Cache result (TTL)
                                ↓
                           Return result

Benefits

Metric Before Caching After Caching Improvement
Duplicate Forecasts 2x per day (Production + Procurement) 1x per day 50% reduction
Forecast Response Time ~2-5 seconds ~50-100ms (cache hit) 95%+ faster
Forecasting Service Load 100% 50% 50% reduction
Cache Hit Rate N/A ~80-90% (expected) -

Cache Invalidation

Forecasts are invalidated when:

  1. TTL Expiry: Automatic at midnight after forecast_date
  2. Model Retraining: When ML model is retrained for product
  3. Manual Invalidation: Via API endpoint (admin only)
# Invalidate specific product forecasts
DELETE /api/v1/{tenant_id}/forecasting/cache/product/{product_id}

# Invalidate all tenant forecasts
DELETE /api/v1/{tenant_id}/forecasting/cache

# Invalidate all forecasts (use with caution!)
DELETE /admin/forecasting/cache/all

Plan Rejection Workflow

Overview

When a procurement plan is rejected by an operator, the system automatically handles the rejection with notifications and optional regeneration.

File: services/orders/app/services/procurement_service.py

Rejection Flow

User Rejects Plan (status → "cancelled")
           ↓
    Record rejection in approval_workflow (JSONB)
           ↓
    Send notification to stakeholders
           ↓
    Publish rejection event (RabbitMQ)
           ↓
    Analyze rejection reason
           ├─ Contains "stale", "outdated", etc. → Auto-regenerate
           └─ Other reason → Manual regeneration required
           ↓
    Schedule regeneration (if applicable)
           ↓
    Send regeneration request event

Auto-Regeneration Keywords

Plans are automatically regenerated if rejection notes contain:

  • stale
  • outdated
  • old data
  • datos antiguos (Spanish)
  • desactualizado (Spanish)
  • obsoleto (Spanish)

Events Published

Event Routing Key Consumers
Plan Rejected procurement.plan.rejected Alert Service, UI Notifications
Regeneration Requested procurement.plan.regeneration_requested Procurement Scheduler
Plan Status Changed procurement.plan.status_changed Inventory Service, Dashboard

Timezone Configuration

Overview

All schedulers are timezone-aware to ensure accurate "daily" execution relative to the bakery's local time.

Tenant Configuration

Model: Tenant File: services/tenant/app/models/tenants.py Field: timezone (String, default: "Europe/Madrid")

Migration: services/tenant/migrations/versions/20251009_add_timezone_to_tenants.py

Supported Timezones

All IANA timezone strings are supported. Common examples:

  • Europe/Madrid (Spain - CEST/CET)
  • Europe/London (UK - BST/GMT)
  • America/New_York (US Eastern)
  • America/Los_Angeles (US Pacific)
  • Asia/Tokyo (Japan)
  • UTC (Universal Time)

Usage in Schedulers

from shared.utils.timezone_helper import TimezoneHelper

# Get current date in tenant's timezone
target_date = TimezoneHelper.get_current_date_in_timezone(tenant_tz)

# Get current datetime in tenant's timezone
now = TimezoneHelper.get_current_datetime_in_timezone(tenant_tz)

# Check if within business hours
is_business_hours = TimezoneHelper.is_business_hours(
    timezone_str=tenant_tz,
    start_hour=8,
    end_hour=20
)

Monitoring & Alerts

Prometheus Metrics

File: shared/monitoring/scheduler_metrics.py

Key Metrics

Metric Type Description
production_schedules_generated_total Counter Total production schedules generated (by tenant, status)
production_schedule_generation_duration_seconds Histogram Time to generate schedule per tenant
procurement_plans_generated_total Counter Total procurement plans generated (by tenant, status)
procurement_plan_generation_duration_seconds Histogram Time to generate plan per tenant
forecast_cache_hits_total Counter Forecast cache hits (by tenant)
forecast_cache_misses_total Counter Forecast cache misses (by tenant)
forecast_cache_hit_rate Gauge Cache hit rate percentage (0-100)
procurement_plan_rejections_total Counter Plan rejections (by tenant, auto_regenerated)
scheduler_health_status Gauge Scheduler health (1=healthy, 0=unhealthy)
tenant_processing_timeout_total Counter Tenant processing timeouts (by service)
# Alert: Daily production planning failed
- alert: DailyProductionPlanningFailed
  expr: production_schedules_generated_total{status="failure"} > 0
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Daily production planning failed for at least one tenant"
    description: "Check production scheduler logs for tenant {{ $labels.tenant_id }}"

# Alert: Daily procurement planning failed
- alert: DailyProcurementPlanningFailed
  expr: procurement_plans_generated_total{status="failure"} > 0
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Daily procurement planning failed for at least one tenant"
    description: "Check procurement scheduler logs for tenant {{ $labels.tenant_id }}"

# Alert: No production schedules in 24 hours
- alert: NoProductionSchedulesGenerated
  expr: rate(production_schedules_generated_total{status="success"}[24h]) == 0
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: "No production schedules generated in last 24 hours"
    description: "Production scheduler may be down or misconfigured"

# Alert: Forecast cache hit rate low
- alert: ForecastCacheHitRateLow
  expr: forecast_cache_hit_rate < 50
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Forecast cache hit rate below 50%"
    description: "Cache may not be functioning correctly for tenant {{ $labels.tenant_id }}"

# Alert: High tenant processing timeouts
- alert: HighTenantProcessingTimeouts
  expr: rate(tenant_processing_timeout_total[5m]) > 0.1
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High rate of tenant processing timeouts"
    description: "{{ $labels.service }} scheduler experiencing timeouts for tenant {{ $labels.tenant_id }}"

# Alert: Scheduler unhealthy
- alert: SchedulerUnhealthy
  expr: scheduler_health_status == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Scheduler is unhealthy"
    description: "{{ $labels.service }} {{ $labels.scheduler_type }} scheduler is reporting unhealthy status"

Grafana Dashboard

Create dashboard with panels for:

  1. Scheduler Success Rate (line chart)

    • production_schedules_generated_total{status="success"}
    • procurement_plans_generated_total{status="success"}
  2. Schedule Generation Duration (heatmap)

    • production_schedule_generation_duration_seconds
    • procurement_plan_generation_duration_seconds
  3. Forecast Cache Hit Rate (gauge)

    • forecast_cache_hit_rate
  4. Tenant Processing Status (pie chart)

    • production_tenants_processed_total
    • procurement_tenants_processed_total
  5. Plan Rejections (table)

    • procurement_plan_rejections_total
  6. Scheduler Health (status panel)

    • scheduler_health_status

Testing

Manual Testing

Test Production Scheduler

# Trigger test production schedule generation
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $TOKEN"

# Expected response:
{
  "message": "Production scheduler test triggered successfully"
}

Test Procurement Scheduler

# Trigger test procurement plan generation
curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $TOKEN"

# Expected response:
{
  "message": "Procurement scheduler test triggered successfully"
}

Automated Testing

# Test production scheduler
async def test_production_scheduler():
    scheduler = ProductionSchedulerService(config)
    await scheduler.start()
    await scheduler.test_production_schedule_generation()
    assert scheduler._checks_performed > 0

# Test procurement scheduler
async def test_procurement_scheduler():
    scheduler = ProcurementSchedulerService(config)
    await scheduler.start()
    await scheduler.test_procurement_generation()
    assert scheduler._checks_performed > 0

# Test forecast caching
async def test_forecast_cache():
    cache = get_forecast_cache_service(redis_url)

    # Cache forecast
    await cache.cache_forecast(tenant_id, product_id, forecast_date, data)

    # Retrieve cached forecast
    cached = await cache.get_cached_forecast(tenant_id, product_id, forecast_date)
    assert cached is not None
    assert cached['cached'] == True

Troubleshooting

Scheduler Not Running

Symptoms: No schedules/plans generated in morning

Checks:

  1. Verify scheduler service is running: kubectl get pods -n production
  2. Check scheduler health endpoint: curl http://service:8000/health
  3. Check APScheduler status in logs: grep "scheduler" logs/production.log
  4. Verify leader election (distributed setup): Check is_leader in logs

Solutions:

  • Restart service: kubectl rollout restart deployment/production-service
  • Check environment variables: PRODUCTION_TEST_MODE, DEBUG
  • Verify database connectivity
  • Check RabbitMQ connectivity for leader election

Timezone Issues

Symptoms: Schedules generated at wrong time

Checks:

  1. Check tenant timezone configuration:
    SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}';
    
  2. Verify server timezone: date (should be UTC in containers)
  3. Check logs for timezone warnings

Solutions:

  • Update tenant timezone: UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}';
  • Verify TimezoneHelper is being used in schedulers
  • Check cron trigger configuration uses correct timezone

Low Cache Hit Rate

Symptoms: forecast_cache_hit_rate < 50%

Checks:

  1. Verify Redis is running: redis-cli ping
  2. Check cache keys: redis-cli KEYS "forecast:*"
  3. Check TTL on cache entries: redis-cli TTL "forecast:{tenant}:{product}:{date}"
  4. Review logs for cache errors

Solutions:

  • Restart Redis if unhealthy
  • Clear cache and let it rebuild: redis-cli FLUSHDB
  • Verify REDIS_URL environment variable
  • Check Redis memory limits: redis-cli INFO memory

Plan Rejection Not Auto-Regenerating

Symptoms: Rejected plans not triggering regeneration

Checks:

  1. Check rejection notes contain auto-regenerate keywords
  2. Verify RabbitMQ events are being published: Check procurement.plan.rejected queue
  3. Check scheduler is listening to regeneration events

Solutions:

  • Use keywords like "stale" or "outdated" in rejection notes
  • Manually trigger regeneration via API
  • Check RabbitMQ connectivity
  • Verify event routing keys are correct

Tenant Processing Timeouts

Symptoms: tenant_processing_timeout_total increasing

Checks:

  1. Check timeout duration (180s for production, 120s for procurement)
  2. Review slow queries in database logs
  3. Check external service response times (Forecasting, Inventory)
  4. Monitor CPU/memory usage during scheduler runs

Solutions:

  • Increase timeout if consistently hitting limit
  • Optimize database queries (add indexes)
  • Scale external services if response time high
  • Process fewer tenants in parallel (reduce concurrency)

Maintenance

Scheduled Maintenance Windows

When performing maintenance on schedulers:

  1. Announce downtime to users (UI banner)
  2. Disable schedulers temporarily:
    # Set environment variable
    SCHEDULER_DISABLED=true
    
  3. Perform maintenance (database migrations, service updates)
  4. Re-enable schedulers:
    SCHEDULER_DISABLED=false
    
  5. Manually trigger missed runs if needed:
    curl -X POST http://service:8000/test/production-scheduler
    curl -X POST http://service:8000/test/procurement-scheduler
    

Database Migrations

When adding fields to scheduler-related tables:

  1. Create migration with proper rollback
  2. Test migration on staging environment
  3. Run migration during low-traffic period (3-4 AM)
  4. Verify scheduler still works after migration
  5. Monitor metrics for anomalies

Cache Maintenance

Clear Stale Cache Entries:

# Clear all forecast cache (will rebuild automatically)
redis-cli KEYS "forecast:*" | xargs redis-cli DEL

# Clear specific tenant's cache
redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL

Monitor Cache Size:

# Check number of forecast keys
redis-cli DBSIZE

# Check memory usage
redis-cli INFO memory

API Reference

Production Scheduler Endpoints

POST /test/production-scheduler
Description: Manually trigger production scheduler (test mode)
Auth: Bearer token required
Response: {"message": "Production scheduler test triggered successfully"}

Procurement Scheduler Endpoints

POST /test/procurement-scheduler
Description: Manually trigger procurement scheduler (test mode)
Auth: Bearer token required
Response: {"message": "Procurement scheduler test triggered successfully"}

Forecast Cache Endpoints

GET /api/v1/{tenant_id}/forecasting/cache/stats
Description: Get forecast cache statistics
Auth: Bearer token required
Response: {
  "available": true,
  "total_forecast_keys": 1234,
  "batch_forecast_keys": 45,
  "single_forecast_keys": 1189,
  "hit_rate_percent": 87.5,
  ...
}

DELETE /api/v1/{tenant_id}/forecasting/cache/product/{product_id}
Description: Invalidate forecast cache for specific product
Auth: Bearer token required (admin only)
Response: {"invalidated_keys": 7}

DELETE /api/v1/{tenant_id}/forecasting/cache
Description: Invalidate all forecast cache for tenant
Auth: Bearer token required (admin only)
Response: {"invalidated_keys": 123}

Change Log

Version 2.0 (2025-10-09) - Automated Scheduling

Added:

  • ProductionSchedulerService for automated daily production planning
  • Timezone configuration in Tenant model
  • Forecast caching in Forecasting Service (service-level)
  • Plan rejection workflow with auto-regeneration
  • Comprehensive Prometheus metrics for monitoring
  • TimezoneHelper utility for consistent timezone handling

Changed:

  • 🔄 All schedulers now timezone-aware
  • 🔄 Forecast service returns cached: true flag in metadata
  • 🔄 Plan rejection triggers notifications and events

Fixed:

  • 🐛 Duplicate forecast computations eliminated (50% reduction)
  • 🐛 Timezone-related scheduling issues resolved
  • 🐛 Rejected plans now have proper workflow handling

Documentation:

  • 📚 Comprehensive production planning system documentation
  • 📚 Runbooks for troubleshooting common issues
  • 📚 Monitoring and alerting guidelines

Version 1.0 (2025-10-07) - Initial Release

Added:

  • ProcurementSchedulerService for automated procurement planning
  • Daily, weekly, and cleanup jobs
  • Leader election for distributed deployments
  • Parallel tenant processing with timeouts

Support & Contact

For issues or questions about the Production Planning System:

  • Documentation: This file
  • Source Code: services/production/, services/orders/
  • Issues: GitHub Issues
  • Slack: #production-planning channel

Document Version: 2.0 Last Review Date: 2025-10-09 Next Review Date: 2025-11-09