bakery-ia/docs/PRODUCTION_PLANNING_SYSTEM.md

# Production Planning System Documentation

## Overview

The Production Planning System automates daily production and procurement scheduling for bakery operations. The system consists of two primary schedulers that run every morning to generate plans based on demand forecasts, inventory levels, and capacity constraints.

**Last Updated:** 2025-10-09
**Version:** 2.0 (Automated Scheduling)
**Status:** Production Ready

---

## Architecture

### System Components

```
┌─────────────────────────────────────────────────────────────────┐
│                   DAILY PLANNING WORKFLOW                        │
└─────────────────────────────────────────────────────────────────┘

05:30 AM → Production Scheduler
           ├─ Generates production schedules for all tenants
           ├─ Calls Forecasting Service (cached) for demand
           ├─ Calls Orders Service for demand requirements
           ├─ Creates production batches
           └─ Sends notifications to production managers

06:00 AM → Procurement Scheduler
           ├─ Generates procurement plans for all tenants
           ├─ Calls Forecasting Service (cached - reuses cached data!)
           ├─ Calls Inventory Service for stock levels
           ├─ Matches suppliers for requirements
           └─ Sends notifications to procurement managers

08:00 AM → Operators review plans
           ├─ Accept → Plans move to "approved" status
           ├─ Reject → Automatic regeneration if stale data detected
           └─ Modify → Recalculate and resubmit

Throughout Day → Alert services monitor execution
                 ├─ Production delays
                 ├─ Capacity issues
                 ├─ Quality problems
                 └─ Equipment failures
```

### Services Involved

| Service | Role | Endpoints |
|---------|------|-----------|
| **Production Service** | Generates daily production schedules | `POST /api/v1/{tenant_id}/production/operations/schedule` |
| **Orders Service** | Generates daily procurement plans | `POST /api/v1/{tenant_id}/orders/operations/procurement/generate` |
| **Forecasting Service** | Provides demand predictions (cached) | `POST /api/v1/{tenant_id}/forecasting/operations/single` |
| **Inventory Service** | Provides current stock levels | `GET /api/v1/{tenant_id}/inventory/products` |
| **Tenant Service** | Provides timezone configuration | `GET /api/v1/tenants/{tenant_id}` |

---

## Schedulers

### 1. Production Scheduler

**Service:** Production Service
**Class:** `ProductionSchedulerService`
**File:** [`services/production/app/services/production_scheduler_service.py`](../services/production/app/services/production_scheduler_service.py)

#### Schedule

| Job | Time | Purpose | Grace Period |
|-----|------|---------|--------------|
| **Daily Production Planning** | 5:30 AM (tenant timezone) | Generate next-day production schedules | 5 minutes |
| **Stale Schedule Cleanup** | 5:50 AM | Archive/cancel old schedules, send escalations | 5 minutes |
| **Test Mode** | Every 30 min (DEBUG only) | Development/testing | 5 minutes |

#### Features

- ✅ **Timezone-aware**: Respects tenant timezone configuration
- ✅ **Leader election**: Only one instance runs in distributed deployment
- ✅ **Idempotent**: Checks if schedule exists before creating
- ✅ **Parallel processing**: Processes tenants concurrently with timeouts
- ✅ **Error isolation**: Tenant failures don't affect others
- ✅ **Demo tenant filtering**: Excludes demo tenants from automation

#### Workflow

1. **Tenant Discovery**: Fetch all active non-demo tenants
2. **Parallel Processing**: Process each tenant concurrently (180s timeout)
3. **Date Calculation**: Use tenant timezone to determine target date
4. **Duplicate Check**: Skip if schedule already exists
5. **Requirements Calculation**: Call `calculate_daily_requirements()`
6. **Schedule Creation**: Create schedule with status "draft"
7. **Batch Generation**: Create production batches from requirements
8. **Notification**: Send alert to production managers
9. **Monitoring**: Record metrics for observability

#### Configuration

```python
# Environment Variables
PRODUCTION_TEST_MODE=false  # Enable 30-minute test job
DEBUG=false                 # Enable verbose logging

# Tenant Configuration
tenant.timezone=Europe/Madrid  # IANA timezone string
```

---

### 2. Procurement Scheduler

**Service:** Orders Service
**Class:** `ProcurementSchedulerService`
**File:** [`services/orders/app/services/procurement_scheduler_service.py`](../services/orders/app/services/procurement_scheduler_service.py)

#### Schedule

| Job | Time | Purpose | Grace Period |
|-----|------|---------|--------------|
| **Daily Procurement Planning** | 6:00 AM (tenant timezone) | Generate next-day procurement plans | 5 minutes |
| **Stale Plan Cleanup** | 6:30 AM | Archive/cancel old plans, send reminders | 5 minutes |
| **Weekly Optimization** | Monday 7:00 AM | Weekly procurement optimization review | 10 minutes |
| **Test Mode** | Every 30 min (DEBUG only) | Development/testing | 5 minutes |

#### Features

- ✅ **Timezone-aware**: Respects tenant timezone configuration
- ✅ **Leader election**: Prevents duplicate runs
- ✅ **Idempotent**: Checks if plan exists before generating
- ✅ **Parallel processing**: 120s timeout per tenant
- ✅ **Forecast fallback**: Uses historical data if forecast unavailable
- ✅ **Critical stock alerts**: Automatic alerts for zero-stock items
- ✅ **Rejection workflow**: Auto-regeneration for rejected plans

#### Workflow

1. **Tenant Discovery**: Fetch active non-demo tenants
2. **Parallel Processing**: Process each tenant (120s timeout)
3. **Date Calculation**: Use tenant timezone
4. **Duplicate Check**: Skip if plan exists (unless force_regenerate)
5. **Forecasting**: Call Forecasting Service (uses cache!)
6. **Inventory Check**: Get current stock levels
7. **Requirements Calculation**: Calculate net requirements
8. **Supplier Matching**: Find suitable suppliers
9. **Plan Creation**: Create plan with status "draft"
10. **Critical Alerts**: Send alerts for critical items
11. **Notification**: Notify procurement managers
12. **Caching**: Cache plan in Redis (6h TTL)

---

## Forecast Caching

### Overview

To eliminate redundant forecast computations, the Forecasting Service now includes a service-level Redis cache. Both Production and Procurement schedulers benefit from this without any code changes.

**File:** [`services/forecasting/app/services/forecast_cache.py`](../services/forecasting/app/services/forecast_cache.py)

### Cache Strategy

```
Key Format: forecast:{tenant_id}:{product_id}:{forecast_date}
TTL: Until midnight of day after forecast_date
Example: forecast:abc-123:prod-456:2025-10-10 → expires 2025-10-11 00:00:00
```

### Cache Flow

```
Client Request → Forecasting API
                     ↓
                Check Redis Cache
                     ├─ HIT → Return cached result (add 'cached: true')
                     └─ MISS → Generate forecast
                                ↓
                           Cache result (TTL)
                                ↓
                           Return result
```

### Benefits

| Metric | Before Caching | After Caching | Improvement |
|--------|---------------|---------------|-------------|
| **Duplicate Forecasts** | 2x per day (Production + Procurement) | 1x per day | 50% reduction |
| **Forecast Response Time** | ~2-5 seconds | ~50-100ms (cache hit) | 95%+ faster |
| **Forecasting Service Load** | 100% | 50% | 50% reduction |
| **Cache Hit Rate** | N/A | ~80-90% (expected) | - |

### Cache Invalidation

Forecasts are invalidated when:

1. **TTL Expiry**: Automatic at midnight after forecast_date
2. **Model Retraining**: When ML model is retrained for product
3. **Manual Invalidation**: Via API endpoint (admin only)

```python
# Invalidate specific product forecasts
DELETE /api/v1/{tenant_id}/forecasting/cache/product/{product_id}

# Invalidate all tenant forecasts
DELETE /api/v1/{tenant_id}/forecasting/cache

# Invalidate all forecasts (use with caution!)
DELETE /admin/forecasting/cache/all
```

---

## Plan Rejection Workflow

### Overview

When a procurement plan is rejected by an operator, the system automatically handles the rejection with notifications and optional regeneration.

**File:** [`services/orders/app/services/procurement_service.py`](../services/orders/app/services/procurement_service.py:1244-1404)

### Rejection Flow

```
User Rejects Plan (status → "cancelled")
           ↓
    Record rejection in approval_workflow (JSONB)
           ↓
    Send notification to stakeholders
           ↓
    Publish rejection event (RabbitMQ)
           ↓
    Analyze rejection reason
           ├─ Contains "stale", "outdated", etc. → Auto-regenerate
           └─ Other reason → Manual regeneration required
           ↓
    Schedule regeneration (if applicable)
           ↓
    Send regeneration request event
```

### Auto-Regeneration Keywords

Plans are automatically regenerated if rejection notes contain:

- `stale`
- `outdated`
- `old data`
- `datos antiguos` (Spanish)
- `desactualizado` (Spanish)
- `obsoleto` (Spanish)

### Events Published

| Event | Routing Key | Consumers |
|-------|-------------|-----------|
| **Plan Rejected** | `procurement.plan.rejected` | Alert Service, UI Notifications |
| **Regeneration Requested** | `procurement.plan.regeneration_requested` | Procurement Scheduler |
| **Plan Status Changed** | `procurement.plan.status_changed` | Inventory Service, Dashboard |

---

## Timezone Configuration

### Overview

All schedulers are timezone-aware to ensure accurate "daily" execution relative to the bakery's local time.

### Tenant Configuration

**Model:** `Tenant`
**File:** [`services/tenant/app/models/tenants.py`](../services/tenant/app/models/tenants.py:32-33)
**Field:** `timezone` (String, default: `"Europe/Madrid"`)

**Migration:** [`services/tenant/migrations/versions/20251009_add_timezone_to_tenants.py`](../services/tenant/migrations/versions/20251009_add_timezone_to_tenants.py)

### Supported Timezones

All IANA timezone strings are supported. Common examples:

- `Europe/Madrid` (Spain - CEST/CET)
- `Europe/London` (UK - BST/GMT)
- `America/New_York` (US Eastern)
- `America/Los_Angeles` (US Pacific)
- `Asia/Tokyo` (Japan)
- `UTC` (Universal Time)

### Usage in Schedulers

```python
from shared.utils.timezone_helper import TimezoneHelper

# Get current date in tenant's timezone
target_date = TimezoneHelper.get_current_date_in_timezone(tenant_tz)

# Get current datetime in tenant's timezone
now = TimezoneHelper.get_current_datetime_in_timezone(tenant_tz)

# Check if within business hours
is_business_hours = TimezoneHelper.is_business_hours(
    timezone_str=tenant_tz,
    start_hour=8,
    end_hour=20
)
```

---

## Monitoring & Alerts

### Prometheus Metrics

**File:** [`shared/monitoring/scheduler_metrics.py`](../shared/monitoring/scheduler_metrics.py)

#### Key Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `production_schedules_generated_total` | Counter | Total production schedules generated (by tenant, status) |
| `production_schedule_generation_duration_seconds` | Histogram | Time to generate schedule per tenant |
| `procurement_plans_generated_total` | Counter | Total procurement plans generated (by tenant, status) |
| `procurement_plan_generation_duration_seconds` | Histogram | Time to generate plan per tenant |
| `forecast_cache_hits_total` | Counter | Forecast cache hits (by tenant) |
| `forecast_cache_misses_total` | Counter | Forecast cache misses (by tenant) |
| `forecast_cache_hit_rate` | Gauge | Cache hit rate percentage (0-100) |
| `procurement_plan_rejections_total` | Counter | Plan rejections (by tenant, auto_regenerated) |
| `scheduler_health_status` | Gauge | Scheduler health (1=healthy, 0=unhealthy) |
| `tenant_processing_timeout_total` | Counter | Tenant processing timeouts (by service) |

### Recommended Alerts

```yaml
# Alert: Daily production planning failed
- alert: DailyProductionPlanningFailed
  expr: production_schedules_generated_total{status="failure"} > 0
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Daily production planning failed for at least one tenant"
    description: "Check production scheduler logs for tenant {{ $labels.tenant_id }}"

# Alert: Daily procurement planning failed
- alert: DailyProcurementPlanningFailed
  expr: procurement_plans_generated_total{status="failure"} > 0
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Daily procurement planning failed for at least one tenant"
    description: "Check procurement scheduler logs for tenant {{ $labels.tenant_id }}"

# Alert: No production schedules in 24 hours
- alert: NoProductionSchedulesGenerated
  expr: rate(production_schedules_generated_total{status="success"}[24h]) == 0
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: "No production schedules generated in last 24 hours"
    description: "Production scheduler may be down or misconfigured"

# Alert: Forecast cache hit rate low
- alert: ForecastCacheHitRateLow
  expr: forecast_cache_hit_rate < 50
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Forecast cache hit rate below 50%"
    description: "Cache may not be functioning correctly for tenant {{ $labels.tenant_id }}"

# Alert: High tenant processing timeouts
- alert: HighTenantProcessingTimeouts
  expr: rate(tenant_processing_timeout_total[5m]) > 0.1
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High rate of tenant processing timeouts"
    description: "{{ $labels.service }} scheduler experiencing timeouts for tenant {{ $labels.tenant_id }}"

# Alert: Scheduler unhealthy
- alert: SchedulerUnhealthy
  expr: scheduler_health_status == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Scheduler is unhealthy"
    description: "{{ $labels.service }} {{ $labels.scheduler_type }} scheduler is reporting unhealthy status"
```

### Grafana Dashboard

Create dashboard with panels for:

1. **Scheduler Success Rate** (line chart)
   - `production_schedules_generated_total{status="success"}`
   - `procurement_plans_generated_total{status="success"}`

2. **Schedule Generation Duration** (heatmap)
   - `production_schedule_generation_duration_seconds`
   - `procurement_plan_generation_duration_seconds`

3. **Forecast Cache Hit Rate** (gauge)
   - `forecast_cache_hit_rate`

4. **Tenant Processing Status** (pie chart)
   - `production_tenants_processed_total`
   - `procurement_tenants_processed_total`

5. **Plan Rejections** (table)
   - `procurement_plan_rejections_total`

6. **Scheduler Health** (status panel)
   - `scheduler_health_status`

---

## Testing

### Manual Testing

#### Test Production Scheduler

```bash
# Trigger test production schedule generation
curl -X POST http://production-service:8000/test/production-scheduler \
  -H "Authorization: Bearer $TOKEN"

# Expected response:
{
  "message": "Production scheduler test triggered successfully"
}
```

#### Test Procurement Scheduler

```bash
# Trigger test procurement plan generation
curl -X POST http://orders-service:8000/test/procurement-scheduler \
  -H "Authorization: Bearer $TOKEN"

# Expected response:
{
  "message": "Procurement scheduler test triggered successfully"
}
```

### Automated Testing

```python
# Test production scheduler
async def test_production_scheduler():
    scheduler = ProductionSchedulerService(config)
    await scheduler.start()
    await scheduler.test_production_schedule_generation()
    assert scheduler._checks_performed > 0

# Test procurement scheduler
async def test_procurement_scheduler():
    scheduler = ProcurementSchedulerService(config)
    await scheduler.start()
    await scheduler.test_procurement_generation()
    assert scheduler._checks_performed > 0

# Test forecast caching
async def test_forecast_cache():
    cache = get_forecast_cache_service(redis_url)

    # Cache forecast
    await cache.cache_forecast(tenant_id, product_id, forecast_date, data)

    # Retrieve cached forecast
    cached = await cache.get_cached_forecast(tenant_id, product_id, forecast_date)
    assert cached is not None
    assert cached['cached'] == True
```

---

## Troubleshooting

### Scheduler Not Running

**Symptoms:** No schedules/plans generated in morning

**Checks:**
1. Verify scheduler service is running: `kubectl get pods -n production`
2. Check scheduler health endpoint: `curl http://service:8000/health`
3. Check APScheduler status in logs: `grep "scheduler" logs/production.log`
4. Verify leader election (distributed setup): Check `is_leader` in logs

**Solutions:**
- Restart service: `kubectl rollout restart deployment/production-service`
- Check environment variables: `PRODUCTION_TEST_MODE`, `DEBUG`
- Verify database connectivity
- Check RabbitMQ connectivity for leader election

### Timezone Issues

**Symptoms:** Schedules generated at wrong time

**Checks:**
1. Check tenant timezone configuration:
   ```sql
   SELECT id, name, timezone FROM tenants WHERE id = '{tenant_id}';
   ```
2. Verify server timezone: `date` (should be UTC in containers)
3. Check logs for timezone warnings

**Solutions:**
- Update tenant timezone: `UPDATE tenants SET timezone = 'Europe/Madrid' WHERE id = '{tenant_id}';`
- Verify TimezoneHelper is being used in schedulers
- Check cron trigger configuration uses correct timezone

### Low Cache Hit Rate

**Symptoms:** `forecast_cache_hit_rate < 50%`

**Checks:**
1. Verify Redis is running: `redis-cli ping`
2. Check cache keys: `redis-cli KEYS "forecast:*"`
3. Check TTL on cache entries: `redis-cli TTL "forecast:{tenant}:{product}:{date}"`
4. Review logs for cache errors

**Solutions:**
- Restart Redis if unhealthy
- Clear cache and let it rebuild: `redis-cli FLUSHDB`
- Verify REDIS_URL environment variable
- Check Redis memory limits: `redis-cli INFO memory`

### Plan Rejection Not Auto-Regenerating

**Symptoms:** Rejected plans not triggering regeneration

**Checks:**
1. Check rejection notes contain auto-regenerate keywords
2. Verify RabbitMQ events are being published: Check `procurement.plan.rejected` queue
3. Check scheduler is listening to regeneration events

**Solutions:**
- Use keywords like "stale" or "outdated" in rejection notes
- Manually trigger regeneration via API
- Check RabbitMQ connectivity
- Verify event routing keys are correct

### Tenant Processing Timeouts

**Symptoms:** `tenant_processing_timeout_total` increasing

**Checks:**
1. Check timeout duration (180s for production, 120s for procurement)
2. Review slow queries in database logs
3. Check external service response times (Forecasting, Inventory)
4. Monitor CPU/memory usage during scheduler runs

**Solutions:**
- Increase timeout if consistently hitting limit
- Optimize database queries (add indexes)
- Scale external services if response time high
- Process fewer tenants in parallel (reduce concurrency)

---

## Maintenance

### Scheduled Maintenance Windows

When performing maintenance on schedulers:

1. **Announce downtime** to users (UI banner)
2. **Disable schedulers** temporarily:
   ```python
   # Set environment variable
   SCHEDULER_DISABLED=true
   ```
3. **Perform maintenance** (database migrations, service updates)
4. **Re-enable schedulers**:
   ```python
   SCHEDULER_DISABLED=false
   ```
5. **Manually trigger** missed runs if needed:
   ```bash
   curl -X POST http://service:8000/test/production-scheduler
   curl -X POST http://service:8000/test/procurement-scheduler
   ```

### Database Migrations

When adding fields to scheduler-related tables:

1. **Create migration** with proper rollback
2. **Test migration** on staging environment
3. **Run migration** during low-traffic period (3-4 AM)
4. **Verify scheduler** still works after migration
5. **Monitor metrics** for anomalies

### Cache Maintenance

**Clear Stale Cache Entries:**
```bash
# Clear all forecast cache (will rebuild automatically)
redis-cli KEYS "forecast:*" | xargs redis-cli DEL

# Clear specific tenant's cache
redis-cli KEYS "forecast:{tenant_id}:*" | xargs redis-cli DEL
```

**Monitor Cache Size:**
```bash
# Check number of forecast keys
redis-cli DBSIZE

# Check memory usage
redis-cli INFO memory
```

---

## API Reference

### Production Scheduler Endpoints

```
POST /test/production-scheduler
Description: Manually trigger production scheduler (test mode)
Auth: Bearer token required
Response: {"message": "Production scheduler test triggered successfully"}
```

### Procurement Scheduler Endpoints

```
POST /test/procurement-scheduler
Description: Manually trigger procurement scheduler (test mode)
Auth: Bearer token required
Response: {"message": "Procurement scheduler test triggered successfully"}
```

### Forecast Cache Endpoints

```
GET /api/v1/{tenant_id}/forecasting/cache/stats
Description: Get forecast cache statistics
Auth: Bearer token required
Response: {
  "available": true,
  "total_forecast_keys": 1234,
  "batch_forecast_keys": 45,
  "single_forecast_keys": 1189,
  "hit_rate_percent": 87.5,
  ...
}

DELETE /api/v1/{tenant_id}/forecasting/cache/product/{product_id}
Description: Invalidate forecast cache for specific product
Auth: Bearer token required (admin only)
Response: {"invalidated_keys": 7}

DELETE /api/v1/{tenant_id}/forecasting/cache
Description: Invalidate all forecast cache for tenant
Auth: Bearer token required (admin only)
Response: {"invalidated_keys": 123}
```

---

## Change Log

### Version 2.0 (2025-10-09) - Automated Scheduling

**Added:**
- ✨ ProductionSchedulerService for automated daily production planning
- ✨ Timezone configuration in Tenant model
- ✨ Forecast caching in Forecasting Service (service-level)
- ✨ Plan rejection workflow with auto-regeneration
- ✨ Comprehensive Prometheus metrics for monitoring
- ✨ TimezoneHelper utility for consistent timezone handling

**Changed:**
- 🔄 All schedulers now timezone-aware
- 🔄 Forecast service returns `cached: true` flag in metadata
- 🔄 Plan rejection triggers notifications and events

**Fixed:**
- 🐛 Duplicate forecast computations eliminated (50% reduction)
- 🐛 Timezone-related scheduling issues resolved
- 🐛 Rejected plans now have proper workflow handling

**Documentation:**
- 📚 Comprehensive production planning system documentation
- 📚 Runbooks for troubleshooting common issues
- 📚 Monitoring and alerting guidelines

### Version 1.0 (2025-10-07) - Initial Release

**Added:**
- ✨ ProcurementSchedulerService for automated procurement planning
- ✨ Daily, weekly, and cleanup jobs
- ✨ Leader election for distributed deployments
- ✨ Parallel tenant processing with timeouts

---

## Support & Contact

For issues or questions about the Production Planning System:

- **Documentation:** This file
- **Source Code:** `services/production/`, `services/orders/`
- **Issues:** GitHub Issues
- **Slack:** `#production-planning` channel

---

**Document Version:** 2.0
**Last Review Date:** 2025-10-09
**Next Review Date:** 2025-11-09