Files

Urtzi Alfaro b420af32c5 REFACTOR production scheduler

2025-10-09 18:01:24 +02:00

23 KiB

Raw Blame History

Production Planning System Documentation

Overview

The Production Planning System automates daily production and procurement scheduling for bakery operations. The system consists of two primary schedulers that run every morning to generate plans based on demand forecasts, inventory levels, and capacity constraints.

Last Updated: 2025-10-09 Version: 2.0 (Automated Scheduling) Status: Production Ready

Architecture

System Components

┌─────────────────────────────────────────────────────────────────┐
│                   DAILY PLANNING WORKFLOW                        │
└─────────────────────────────────────────────────────────────────┘

05:30 AM → Production Scheduler
           ├─ Generates production schedules for all tenants
           ├─ Calls Forecasting Service (cached) for demand
           ├─ Calls Orders Service for demand requirements
           ├─ Creates production batches
           └─ Sends notifications to production managers

06:00 AM → Procurement Scheduler
           ├─ Generates procurement plans for all tenants
           ├─ Calls Forecasting Service (cached - reuses cached data!)
           ├─ Calls Inventory Service for stock levels
           ├─ Matches suppliers for requirements
           └─ Sends notifications to procurement managers

08:00 AM → Operators review plans
           ├─ Accept → Plans move to "approved" status
           ├─ Reject → Automatic regeneration if stale data detected
           └─ Modify → Recalculate and resubmit

Throughout Day → Alert services monitor execution
                 ├─ Production delays
                 ├─ Capacity issues
                 ├─ Quality problems
                 └─ Equipment failures

Services Involved

Service	Role	Endpoints
Production Service	Generates daily production schedules	`POST /api/v1/{tenant_id}/production/operations/schedule`
Orders Service	Generates daily procurement plans	`POST /api/v1/{tenant_id}/orders/operations/procurement/generate`
Forecasting Service	Provides demand predictions (cached)	`POST /api/v1/{tenant_id}/forecasting/operations/single`
Inventory Service	Provides current stock levels	`GET /api/v1/{tenant_id}/inventory/products`
Tenant Service	Provides timezone configuration	`GET /api/v1/tenants/{tenant_id}`

Schedulers

1. Production Scheduler

Service: Production Service Class: ProductionSchedulerService File: services/production/app/services/production_scheduler_service.py

Schedule

Job	Time	Purpose	Grace Period
Daily Production Planning	5:30 AM (tenant timezone)	Generate next-day production schedules	5 minutes
Stale Schedule Cleanup	5:50 AM	Archive/cancel old schedules, send escalations	5 minutes
Test Mode	Every 30 min (DEBUG only)	Development/testing	5 minutes

Features

✅ Timezone-aware: Respects tenant timezone configuration
✅ Leader election: Only one instance runs in distributed deployment
✅ Idempotent: Checks if schedule exists before creating
✅ Parallel processing: Processes tenants concurrently with timeouts
✅ Error isolation: Tenant failures don't affect others
✅ Demo tenant filtering: Excludes demo tenants from automation

Workflow

Tenant Discovery: Fetch all active non-demo tenants
Parallel Processing: Process each tenant concurrently (180s timeout)
Date Calculation: Use tenant timezone to determine target date
Duplicate Check: Skip if schedule already exists
Requirements Calculation: Call calculate_daily_requirements()
Schedule Creation: Create schedule with status "draft"
Batch Generation: Create production batches from requirements
Notification: Send alert to production managers
Monitoring: Record metrics for observability

Configuration

# Environment Variables
PRODUCTION_TEST_MODE=false  # Enable 30-minute test job
DEBUG=false                 # Enable verbose logging

# Tenant Configuration
tenant.timezone=Europe/Madrid  # IANA timezone string

2. Procurement Scheduler

Service: Orders Service Class: ProcurementSchedulerService File: services/orders/app/services/procurement_scheduler_service.py

Schedule

Job	Time	Purpose	Grace Period
Daily Procurement Planning	6:00 AM (tenant timezone)	Generate next-day procurement plans	5 minutes
Stale Plan Cleanup	6:30 AM	Archive/cancel old plans, send reminders	5 minutes
Weekly Optimization	Monday 7:00 AM	Weekly procurement optimization review	10 minutes
Test Mode	Every 30 min (DEBUG only)	Development/testing	5 minutes

Features

✅ Timezone-aware: Respects tenant timezone configuration
✅ Leader election: Prevents duplicate runs
✅ Idempotent: Checks if plan exists before generating
✅ Parallel processing: 120s timeout per tenant
✅ Forecast fallback: Uses historical data if forecast unavailable
✅ Critical stock alerts: Automatic alerts for zero-stock items
✅ Rejection workflow: Auto-regeneration for rejected plans

Workflow

Tenant Discovery: Fetch active non-demo tenants
Parallel Processing: Process each tenant (120s timeout)
Date Calculation: Use tenant timezone
Duplicate Check: Skip if plan exists (unless force_regenerate)
Forecasting: Call Forecasting Service (uses cache!)
Inventory Check: Get current stock levels
Requirements Calculation: Calculate net requirements
Supplier Matching: Find suitable suppliers
Plan Creation: Create plan with status "draft"
Critical Alerts: Send alerts for critical items
Notification: Notify procurement managers
Caching: Cache plan in Redis (6h TTL)

Forecast Caching

Overview

To eliminate redundant forecast computations, the Forecasting Service now includes a service-level Redis cache. Both Production and Procurement schedulers benefit from this without any code changes.

File: services/forecasting/app/services/forecast_cache.py

Cache Strategy

Key Format: forecast:{tenant_id}:{product_id}:{forecast_date}
TTL: Until midnight of day after forecast_date
Example: forecast:abc-123:prod-456:2025-10-10 → expires 2025-10-11 00:00:00

Cache Flow

Client Request → Forecasting API
                     ↓
                Check Redis Cache
                     ├─ HIT → Return cached result (add 'cached: true')
                     └─ MISS → Generate forecast
                                ↓
                           Cache result (TTL)
                                ↓
                           Return result

Benefits

Metric	Before Caching	After Caching	Improvement
Duplicate Forecasts	2x per day (Production + Procurement)	1x per day	50% reduction
Forecast Response Time	~2-5 seconds	~50-100ms (cache hit)	95%+ faster
Forecasting Service Load	100%	50%	50% reduction
Cache Hit Rate	N/A	~80-90% (expected)	-

Cache Invalidation

Forecasts are invalidated when:

TTL Expiry: Automatic at midnight after forecast_date
Model Retraining: When ML model is retrained for product
Manual Invalidation: Via API endpoint (admin only)

# Invalidate specific product forecasts
DELETE /api/v1/{tenant_id}/forecasting/cache/product/{product_id}

# Invalidate all tenant forecasts
DELETE /api/v1/{tenant_id}/forecasting/cache

# Invalidate all forecasts (use with caution!)
DELETE /admin/forecasting/cache/all

Plan Rejection Workflow

Overview

When a procurement plan is rejected by an operator, the system automatically handles the rejection with notifications and optional regeneration.

File: services/orders/app/services/procurement_service.py

Rejection Flow

User Rejects Plan (status → "cancelled")
           ↓
    Record rejection in approval_workflow (JSONB)
           ↓
    Send notification to stakeholders
           ↓
    Publish rejection event (RabbitMQ)
           ↓
    Analyze rejection reason
           ├─ Contains "stale", "outdated", etc. → Auto-regenerate
           └─ Other reason → Manual regeneration required
           ↓
    Schedule regeneration (if applicable)
           ↓
    Send regeneration request event

Auto-Regeneration Keywords

Plans are automatically regenerated if rejection notes contain:

stale
outdated
old data
datos antiguos (Spanish)
desactualizado (Spanish)
obsoleto (Spanish)

Events Published

Event	Routing Key	Consumers
Plan Rejected	`procurement.plan.rejected`	Alert Service, UI Notifications
Regeneration Requested	`procurement.plan.regeneration_requested`	Procurement Scheduler
Plan Status Changed	`procurement.plan.status_changed`	Inventory Service, Dashboard

Timezone Configuration

Overview

All schedulers are timezone-aware to ensure accurate "daily" execution relative to the bakery's local time.

Tenant Configuration

Model: Tenant File: services/tenant/app/models/tenants.py Field: timezone (String, default: "Europe/Madrid")

Migration: services/tenant/migrations/versions/20251009_add_timezone_to_tenants.py

Supported Timezones

All IANA timezone strings are supported. Common examples:

Europe/Madrid (Spain - CEST/CET)
Europe/London (UK - BST/GMT)
America/New_York (US Eastern)
America/Los_Angeles (US Pacific)
Asia/Tokyo (Japan)
UTC (Universal Time)

Usage in Schedulers

from shared.utils.timezone_helper import TimezoneHelper

# Get current date in tenant's timezone
target_date = TimezoneHelper.get_current_date_in_timezone(tenant_tz)

# Get current datetime in tenant's timezone
now = TimezoneHelper.get_current_datetime_in_timezone(tenant_tz)

# Check if within business hours
is_business_hours = TimezoneHelper.is_business_hours(
    timezone_str=tenant_tz,
    start_hour=8,
    end_hour=20
)

Monitoring & Alerts

Prometheus Metrics

File: shared/monitoring/scheduler_metrics.py

Key Metrics

Metric	Type	Description
`production_schedules_generated_total`	Counter	Total production schedules generated (by tenant, status)
`production_schedule_generation_duration_seconds`	Histogram	Time to generate schedule per tenant
`procurement_plans_generated_total`	Counter	Total procurement plans generated (by tenant, status)
`procurement_plan_generation_duration_seconds`	Histogram	Time to generate plan per tenant
`forecast_cache_hits_total`	Counter	Forecast cache hits (by tenant)
`forecast_cache_misses_total`	Counter	Forecast cache misses (by tenant)
`forecast_cache_hit_rate`	Gauge	Cache hit rate percentage (0-100)
`procurement_plan_rejections_total`	Counter	Plan rejections (by tenant, auto_regenerated)
`scheduler_health_status`	Gauge	Scheduler health (1=healthy, 0=unhealthy)
`tenant_processing_timeout_total`	Counter	Tenant processing timeouts (by service)

Recommended Alerts

# Alert: Daily production planning failed
- alert: DailyProductionPlanningFailed
  expr: production_schedules_generated_total{status="failure"} > 0
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Daily production planning failed for at least one tenant"
    description: "Check production scheduler logs for tenant {{ $labels.tenant_id }}"

# Alert: Daily procurement planning failed
- alert: DailyProcurementPlanningFailed
  expr: procurement_plans_generated_total{status="failure"} > 0
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Daily procurement planning failed for at least one tenant"
    description: "Check procurement scheduler logs for tenant {{ $labels.tenant_id }}"

# Alert: No production schedules in 24 hours
- alert: NoProductionSchedulesGenerated
  expr: rate(production_schedules_generated_total{status="success"}[24h]) == 0
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: "No production schedules generated in last 24 hours"
    description: "Production scheduler may be down or misconfigured"

# Alert: Forecast cache hit rate low
- alert: ForecastCacheHitRateLow
  expr: forecast_cache_hit_rate < 50
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Forecast cache hit rate below 50%"
    description: "Cache may not be functioning correctly for tenant {{ $labels.tenant_id }}"

# Alert: High tenant processing timeouts
- alert: HighTenantProcessingTimeouts
  expr: rate(tenant_processing_timeout_total[5m]) > 0.1
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High rate of tenant processing timeouts"
    description: "{{ $labels.service }} scheduler experiencing timeouts for tenant {{ $labels.tenant_id }}"

# Alert: Scheduler unhealthy
- alert: SchedulerUnhealthy
  expr: scheduler_health_status == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Scheduler is unhealthy"
    description: "{{ $labels.service }} {{ $labels.scheduler_type }} scheduler is reporting unhealthy status"

Grafana Dashboard

Create dashboard with panels for:

Scheduler Success Rate (line chart)
- production_schedules_generated_total{status="success"}
- procurement_plans_generated_total{status="success"}
Schedule Generation Duration (heatmap)
- production_schedule_generation_duration_seconds
- procurement_plan_generation_duration_seconds
Forecast Cache Hit Rate (gauge)
- forecast_cache_hit_rate
Tenant Processing Status (pie chart)
- production_tenants_processed_total
- procurement_tenants_processed_total
Plan Rejections (table)
- procurement_plan_rejections_total
Scheduler Health (status panel)
- scheduler_health_status

Testing

Manual Testing

Test Production Scheduler

# Trigger test production schedule generation
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $TOKEN"

# Expected response:
{
  "message": "Production scheduler test triggered successfully"
}

Test Procurement Scheduler

# Trigger test procurement plan generation
curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $TOKEN"

# Expected response:
{
  "message": "Procurement scheduler test triggered successfully"
}

Automated Testing

# Test production scheduler
async def test_production_scheduler():
    scheduler = ProductionSchedulerService(config)
    await scheduler.start()
    await scheduler.test_production_schedule_generation()
    assert scheduler._checks_performed > 0

# Test procurement scheduler
async def test_procurement_scheduler():
    scheduler = ProcurementSchedulerService(config)
    await scheduler.start()
    await scheduler.test_procurement_generation()
    assert scheduler._checks_performed > 0

# Test forecast caching
async def test_forecast_cache():
    cache = get_forecast_cache_service(redis_url)

    # Cache forecast
    await cache.cache_forecast(tenant_id, product_id, forecast_date, data)

    # Retrieve cached forecast
    cached = await cache.get_cached_forecast(tenant_id, product_id, forecast_date)
    assert cached is not None
    assert cached['cached'] == True

Troubleshooting

Scheduler Not Running

Symptoms: No schedules/plans generated in morning

Checks:

Verify scheduler service is running: kubectl get pods -n production
Check scheduler health endpoint: curl http://service:8000/health
Check APScheduler status in logs: grep "scheduler" logs/production.log
Verify leader election (distributed setup): Check is_leader in logs

Solutions:

Restart service: kubectl rollout restart deployment/production-service
Check environment variables: PRODUCTION_TEST_MODE, DEBUG
Verify database connectivity
Check RabbitMQ connectivity for leader election

Timezone Issues

Symptoms: Schedules generated at wrong time

Checks:

Check tenant timezone configuration:

SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}';

Verify server timezone: date (should be UTC in containers)
Check logs for timezone warnings

Solutions:

Update tenant timezone: UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}';
Verify TimezoneHelper is being used in schedulers
Check cron trigger configuration uses correct timezone

Low Cache Hit Rate

Symptoms: forecast_cache_hit_rate < 50%

Checks:

Verify Redis is running: redis-cli ping
Check cache keys: redis-cli KEYS "forecast:*"
Check TTL on cache entries: redis-cli TTL "forecast:{tenant}:{product}:{date}"
Review logs for cache errors

Solutions:

Restart Redis if unhealthy
Clear cache and let it rebuild: redis-cli FLUSHDB
Verify REDIS_URL environment variable
Check Redis memory limits: redis-cli INFO memory

Plan Rejection Not Auto-Regenerating

Symptoms: Rejected plans not triggering regeneration

Checks:

Check rejection notes contain auto-regenerate keywords
Verify RabbitMQ events are being published: Check procurement.plan.rejected queue
Check scheduler is listening to regeneration events

Solutions:

Use keywords like "stale" or "outdated" in rejection notes
Manually trigger regeneration via API
Check RabbitMQ connectivity
Verify event routing keys are correct

Tenant Processing Timeouts

Symptoms: tenant_processing_timeout_total increasing

Checks:

Check timeout duration (180s for production, 120s for procurement)
Review slow queries in database logs
Check external service response times (Forecasting, Inventory)
Monitor CPU/memory usage during scheduler runs

Solutions:

Increase timeout if consistently hitting limit
Optimize database queries (add indexes)
Scale external services if response time high
Process fewer tenants in parallel (reduce concurrency)

Maintenance

Scheduled Maintenance Windows

When performing maintenance on schedulers:

Announce downtime to users (UI banner)

Disable schedulers temporarily:

# Set environment variable
SCHEDULER_DISABLED=true

Perform maintenance (database migrations, service updates)
Re-enable schedulers:
```
SCHEDULER_DISABLED=false
```

Manually trigger missed runs if needed:

curl -X POST http://service:8000/test/production-scheduler
curl -X POST http://service:8000/test/procurement-scheduler

Database Migrations

When adding fields to scheduler-related tables:

Create migration with proper rollback
Test migration on staging environment
Run migration during low-traffic period (3-4 AM)
Verify scheduler still works after migration
Monitor metrics for anomalies

Cache Maintenance

Clear Stale Cache Entries:

# Clear all forecast cache (will rebuild automatically)
redis-cli KEYS "forecast:*" | xargs redis-cli DEL

# Clear specific tenant's cache
redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL

Monitor Cache Size:

# Check number of forecast keys
redis-cli DBSIZE

# Check memory usage
redis-cli INFO memory

API Reference

Production Scheduler Endpoints

POST /test/production-scheduler
Description: Manually trigger production scheduler (test mode)
Auth: Bearer token required
Response: {"message": "Production scheduler test triggered successfully"}

Procurement Scheduler Endpoints

POST /test/procurement-scheduler
Description: Manually trigger procurement scheduler (test mode)
Auth: Bearer token required
Response: {"message": "Procurement scheduler test triggered successfully"}

Forecast Cache Endpoints

GET /api/v1/{tenant_id}/forecasting/cache/stats
Description: Get forecast cache statistics
Auth: Bearer token required
Response: {
  "available": true,
  "total_forecast_keys": 1234,
  "batch_forecast_keys": 45,
  "single_forecast_keys": 1189,
  "hit_rate_percent": 87.5,
  ...
}

DELETE /api/v1/{tenant_id}/forecasting/cache/product/{product_id}
Description: Invalidate forecast cache for specific product
Auth: Bearer token required (admin only)
Response: {"invalidated_keys": 7}

DELETE /api/v1/{tenant_id}/forecasting/cache
Description: Invalidate all forecast cache for tenant
Auth: Bearer token required (admin only)
Response: {"invalidated_keys": 123}

Change Log

Version 2.0 (2025-10-09) - Automated Scheduling

Added:

✨ ProductionSchedulerService for automated daily production planning
✨ Timezone configuration in Tenant model
✨ Forecast caching in Forecasting Service (service-level)
✨ Plan rejection workflow with auto-regeneration
✨ Comprehensive Prometheus metrics for monitoring
✨ TimezoneHelper utility for consistent timezone handling

Changed:

🔄 All schedulers now timezone-aware
🔄 Forecast service returns cached: true flag in metadata
🔄 Plan rejection triggers notifications and events

Fixed:

🐛 Duplicate forecast computations eliminated (50% reduction)
🐛 Timezone-related scheduling issues resolved
🐛 Rejected plans now have proper workflow handling

Documentation:

📚 Comprehensive production planning system documentation
📚 Runbooks for troubleshooting common issues
📚 Monitoring and alerting guidelines

Version 1.0 (2025-10-07) - Initial Release

Added:

✨ ProcurementSchedulerService for automated procurement planning
✨ Daily, weekly, and cleanup jobs
✨ Leader election for distributed deployments
✨ Parallel tenant processing with timeouts

Support & Contact

For issues or questions about the Production Planning System:

Documentation: This file
Source Code: services/production/, services/orders/
Issues: GitHub Issues
Slack: #production-planning channel

Document Version: 2.0 Last Review Date: 2025-10-09 Next Review Date: 2025-11-09

23 KiB Raw Blame History

Production Planning System Documentation

Overview

Architecture

System Components

Services Involved

Schedulers

1. Production Scheduler

Schedule

Features

Workflow

Configuration

2. Procurement Scheduler

Schedule

Features

Workflow

Forecast Caching

Overview

Cache Strategy

Cache Flow

Benefits

Cache Invalidation

Plan Rejection Workflow

Overview

Rejection Flow

Auto-Regeneration Keywords

Events Published

Timezone Configuration

Overview

Tenant Configuration

Supported Timezones

Usage in Schedulers

Monitoring & Alerts

Prometheus Metrics

Key Metrics

Recommended Alerts

Grafana Dashboard

Testing

Manual Testing

Test Production Scheduler

Test Procurement Scheduler

Automated Testing

Troubleshooting

Scheduler Not Running

Timezone Issues

Low Cache Hit Rate

Plan Rejection Not Auto-Regenerating

Tenant Processing Timeouts

Maintenance

Scheduled Maintenance Windows

Database Migrations

Cache Maintenance

API Reference

Production Scheduler Endpoints

Procurement Scheduler Endpoints

Forecast Cache Endpoints

Change Log

Version 2.0 (2025-10-09) - Automated Scheduling

Version 1.0 (2025-10-07) - Initial Release

Support & Contact

23 KiB

Raw Blame History