929 lines
32 KiB
Markdown
929 lines
32 KiB
Markdown
# Orchestrator Service
|
|
|
|
## Overview
|
|
|
|
The **Orchestrator Service** automates daily operational workflows by coordinating tasks across multiple microservices. It schedules and executes recurring jobs like daily forecasting, production planning, procurement needs calculation, and report generation. Operating on a configurable schedule (default: daily at 8:00 AM Madrid time), it ensures that bakery owners start each day with fresh forecasts, optimized production plans, and actionable insights - all without manual intervention.
|
|
|
|
## Key Features
|
|
|
|
### Workflow Automation
|
|
- **Daily Forecasting** - Generate 7-day demand forecasts every morning
|
|
- **Production Planning** - Calculate production schedules from forecasts
|
|
- **Procurement Planning** - Identify purchasing needs automatically
|
|
- **Inventory Projections** - Project stock levels for next 14 days
|
|
- **Report Generation** - Daily summaries, weekly digests
|
|
- **Model Retraining** - Weekly ML model updates
|
|
- **Alert Cleanup** - Archive resolved alerts
|
|
|
|
### Scheduling System
|
|
- **Cron-Based Scheduling** - Flexible schedule configuration
|
|
- **Timezone-Aware** - Respects tenant timezone (Madrid default)
|
|
- **Configurable Frequency** - Daily, weekly, monthly workflows
|
|
- **Time-Based Execution** - Run at optimal times (early morning)
|
|
- **Holiday Awareness** - Skip or adjust on public holidays
|
|
- **Weekend Handling** - Different schedules for weekends
|
|
|
|
### Workflow Execution
|
|
- **Sequential Workflows** - Execute steps in correct order
|
|
- **Parallel Execution** - Run independent tasks concurrently
|
|
- **Error Handling** - Retry failed tasks with exponential backoff
|
|
- **Timeout Management** - Cancel long-running tasks
|
|
- **Progress Tracking** - Monitor workflow execution status
|
|
- **Result Caching** - Cache workflow results in Redis
|
|
|
|
### Multi-Tenant Management
|
|
- **Per-Tenant Workflows** - Execute for all active tenants
|
|
- **Tenant Priority** - Prioritize by subscription tier
|
|
- **Tenant Filtering** - Skip suspended or cancelled tenants
|
|
- **Load Balancing** - Distribute tenant workflows evenly
|
|
- **Resource Limits** - Prevent resource exhaustion
|
|
|
|
### Monitoring & Observability
|
|
- **Workflow Metrics** - Execution time, success rate
|
|
- **Health Checks** - Service and job health monitoring
|
|
- **Failure Alerts** - Notify on workflow failures
|
|
- **Audit Logging** - Complete execution history
|
|
- **Performance Tracking** - Identify slow workflows
|
|
- **Cost Tracking** - Monitor computational costs
|
|
|
|
### Leader Election
|
|
- **Distributed Coordination** - Redis-based leader election
|
|
- **High Availability** - Multiple orchestrator instances
|
|
- **Automatic Failover** - New leader elected on failure
|
|
- **Split-Brain Prevention** - Ensure only one leader
|
|
- **Leader Health** - Continuous health monitoring
|
|
|
|
### 🆕 Enterprise Tier: Network Dashboard & Orchestration (NEW)
|
|
- **Aggregated Network Metrics** - Single dashboard view consolidating all child outlet data
|
|
- **Production Coordination** - Central production facility gets visibility into network-wide demand
|
|
- **Distribution Integration** - Dashboard displays active delivery routes and shipment status
|
|
- **Network Demand Forecasting** - Aggregated demand forecasts across all retail outlets
|
|
- **Multi-Location Performance** - Compare performance metrics across all locations
|
|
- **Child Outlet Visibility** - Drill down into individual outlet performance
|
|
- **Enterprise KPIs** - Network-level metrics: total production, total sales, network-wide waste reduction
|
|
- **Subscription Gating** - Enterprise dashboard requires Enterprise tier subscription
|
|
|
|
## Business Value
|
|
|
|
### For Bakery Owners
|
|
- **Zero Manual Work** - Forecasts and plans generated automatically
|
|
- **Consistent Execution** - Never forget to plan production
|
|
- **Early Morning Ready** - Start day with fresh data (8:00 AM)
|
|
- **Weekend Coverage** - Works 7 days/week, 365 days/year
|
|
- **Reliable** - Automatic retries on failures
|
|
- **Transparent** - Clear audit trail of all automation
|
|
|
|
### Quantifiable Impact
|
|
- **Time Savings**: 15-20 hours/week on manual planning (€900-1,200/month)
|
|
- **Consistency**: 100% vs. 70-80% manual execution rate
|
|
- **Early Detection**: Issues identified before business hours
|
|
- **Error Reduction**: 95%+ accuracy vs. 80-90% manual
|
|
- **Staff Freedom**: Staff focus on operations, not planning
|
|
- **Scalability**: Handles 10,000+ tenants automatically
|
|
|
|
### For Platform Operations
|
|
- **Automation**: 95%+ of platform operations automated
|
|
- **Scalability**: Linear cost scaling with tenants
|
|
- **Reliability**: 99.9%+ workflow success rate
|
|
- **Predictability**: Consistent execution times
|
|
- **Resource Efficiency**: Optimal resource utilization
|
|
- **Cost Control**: Prevent runaway computational costs
|
|
|
|
## Technology Stack
|
|
|
|
- **Framework**: FastAPI (Python 3.11+) - Async web framework
|
|
- **Scheduler**: APScheduler - Job scheduling
|
|
- **Database**: PostgreSQL 17 - Workflow history
|
|
- **Caching**: Redis 7.4 - Leader election, results cache
|
|
- **Messaging**: RabbitMQ 4.1 - Event publishing
|
|
- **HTTP Client**: HTTPx - Async service calls
|
|
- **ORM**: SQLAlchemy 2.0 (async) - Database abstraction
|
|
- **Logging**: Structlog - Structured JSON logging
|
|
- **Metrics**: Prometheus Client - Workflow metrics
|
|
|
|
## API Endpoints (Key Routes)
|
|
|
|
### Workflow Management
|
|
- `GET /api/v1/orchestrator/workflows` - List workflows
|
|
- `GET /api/v1/orchestrator/workflows/{workflow_id}` - Get workflow details
|
|
- `POST /api/v1/orchestrator/workflows/{workflow_id}/execute` - Manually trigger workflow
|
|
- `PUT /api/v1/orchestrator/workflows/{workflow_id}` - Update workflow configuration
|
|
- `POST /api/v1/orchestrator/workflows/{workflow_id}/enable` - Enable workflow
|
|
- `POST /api/v1/orchestrator/workflows/{workflow_id}/disable` - Disable workflow
|
|
|
|
### Execution History
|
|
- `GET /api/v1/orchestrator/executions` - List workflow executions
|
|
- `GET /api/v1/orchestrator/executions/{execution_id}` - Get execution details
|
|
- `GET /api/v1/orchestrator/executions/{execution_id}/logs` - Get execution logs
|
|
- `GET /api/v1/orchestrator/executions/failed` - List failed executions
|
|
- `POST /api/v1/orchestrator/executions/{execution_id}/retry` - Retry failed execution
|
|
|
|
### Scheduling
|
|
- `GET /api/v1/orchestrator/schedule` - Get current schedule
|
|
- `PUT /api/v1/orchestrator/schedule` - Update schedule
|
|
- `GET /api/v1/orchestrator/schedule/next-run` - Get next execution time
|
|
|
|
### Health & Monitoring
|
|
- `GET /api/v1/orchestrator/health` - Service health
|
|
- `GET /api/v1/orchestrator/leader` - Current leader instance
|
|
- `GET /api/v1/orchestrator/metrics` - Workflow metrics
|
|
- `GET /api/v1/orchestrator/statistics` - Execution statistics
|
|
|
|
### 🆕 Enterprise Network Dashboard (NEW)
|
|
- `GET /api/v1/{parent_tenant}/orchestrator/enterprise/dashboard` - Get aggregated enterprise network dashboard
|
|
- `GET /api/v1/{parent_tenant}/orchestrator/enterprise/network-summary` - Get network-wide summary metrics
|
|
- `GET /api/v1/{parent_tenant}/orchestrator/enterprise/production-overview` - Get production coordination overview
|
|
- `GET /api/v1/{parent_tenant}/orchestrator/enterprise/distribution-status` - Get current distribution/delivery status
|
|
- `GET /api/v1/{parent_tenant}/orchestrator/enterprise/child-performance` - Compare performance across child outlets
|
|
|
|
## Database Schema
|
|
|
|
### Main Tables
|
|
|
|
**orchestrator_workflows**
|
|
```sql
|
|
CREATE TABLE orchestrator_workflows (
|
|
id UUID PRIMARY KEY,
|
|
workflow_name VARCHAR(255) NOT NULL UNIQUE,
|
|
workflow_type VARCHAR(100) NOT NULL, -- daily, weekly, monthly, on_demand
|
|
description TEXT,
|
|
|
|
-- Schedule
|
|
cron_expression VARCHAR(100), -- e.g., "0 8 * * *" for 8 AM daily
|
|
timezone VARCHAR(50) DEFAULT 'Europe/Madrid',
|
|
is_enabled BOOLEAN DEFAULT TRUE,
|
|
|
|
-- Execution
|
|
max_execution_time_seconds INTEGER DEFAULT 3600,
|
|
max_retries INTEGER DEFAULT 3,
|
|
retry_delay_seconds INTEGER DEFAULT 300,
|
|
|
|
-- Workflow steps
|
|
steps JSONB NOT NULL, -- Array of workflow steps
|
|
|
|
-- Status
|
|
last_execution_at TIMESTAMP,
|
|
last_success_at TIMESTAMP,
|
|
last_failure_at TIMESTAMP,
|
|
next_execution_at TIMESTAMP,
|
|
consecutive_failures INTEGER DEFAULT 0,
|
|
|
|
created_at TIMESTAMP DEFAULT NOW(),
|
|
updated_at TIMESTAMP DEFAULT NOW()
|
|
);
|
|
```
|
|
|
|
**orchestrator_executions**
|
|
```sql
|
|
CREATE TABLE orchestrator_executions (
|
|
id UUID PRIMARY KEY,
|
|
workflow_id UUID REFERENCES orchestrator_workflows(id),
|
|
workflow_name VARCHAR(255) NOT NULL,
|
|
execution_type VARCHAR(50) NOT NULL, -- scheduled, manual
|
|
triggered_by UUID, -- User ID if manual
|
|
|
|
-- Tenant
|
|
tenant_id UUID, -- NULL for global workflows
|
|
|
|
-- Status
|
|
status VARCHAR(50) DEFAULT 'pending', -- pending, running, completed, failed, cancelled
|
|
started_at TIMESTAMP,
|
|
completed_at TIMESTAMP,
|
|
duration_seconds INTEGER,
|
|
|
|
-- Results
|
|
steps_completed INTEGER DEFAULT 0,
|
|
steps_total INTEGER DEFAULT 0,
|
|
steps_failed INTEGER DEFAULT 0,
|
|
error_message TEXT,
|
|
result_summary JSONB,
|
|
|
|
-- Leader info
|
|
executed_by_instance VARCHAR(255), -- Instance ID that ran this
|
|
|
|
created_at TIMESTAMP DEFAULT NOW(),
|
|
INDEX idx_executions_workflow_date (workflow_id, created_at DESC),
|
|
INDEX idx_executions_tenant_date (tenant_id, created_at DESC)
|
|
);
|
|
```
|
|
|
|
**orchestrator_execution_logs**
|
|
```sql
|
|
CREATE TABLE orchestrator_execution_logs (
|
|
id UUID PRIMARY KEY,
|
|
execution_id UUID REFERENCES orchestrator_executions(id) ON DELETE CASCADE,
|
|
step_name VARCHAR(255) NOT NULL,
|
|
step_index INTEGER NOT NULL,
|
|
log_level VARCHAR(50) NOT NULL, -- info, warning, error
|
|
log_message TEXT NOT NULL,
|
|
log_data JSONB,
|
|
logged_at TIMESTAMP DEFAULT NOW(),
|
|
INDEX idx_execution_logs_execution (execution_id, step_index)
|
|
);
|
|
```
|
|
|
|
**orchestrator_leader**
|
|
```sql
|
|
CREATE TABLE orchestrator_leader (
|
|
id INTEGER PRIMARY KEY DEFAULT 1, -- Always 1 (singleton)
|
|
instance_id VARCHAR(255) NOT NULL,
|
|
instance_hostname VARCHAR(255),
|
|
became_leader_at TIMESTAMP NOT NULL,
|
|
last_heartbeat_at TIMESTAMP NOT NULL,
|
|
heartbeat_interval_seconds INTEGER DEFAULT 30,
|
|
CONSTRAINT single_leader CHECK (id = 1)
|
|
);
|
|
```
|
|
|
|
**orchestrator_metrics**
|
|
```sql
|
|
CREATE TABLE orchestrator_metrics (
|
|
id UUID PRIMARY KEY,
|
|
metric_date DATE NOT NULL,
|
|
workflow_name VARCHAR(255),
|
|
|
|
-- Volume
|
|
total_executions INTEGER DEFAULT 0,
|
|
successful_executions INTEGER DEFAULT 0,
|
|
failed_executions INTEGER DEFAULT 0,
|
|
|
|
-- Performance
|
|
avg_duration_seconds INTEGER,
|
|
min_duration_seconds INTEGER,
|
|
max_duration_seconds INTEGER,
|
|
|
|
-- Reliability
|
|
success_rate_percentage DECIMAL(5, 2),
|
|
avg_retry_count DECIMAL(5, 2),
|
|
|
|
calculated_at TIMESTAMP DEFAULT NOW(),
|
|
UNIQUE(metric_date, workflow_name)
|
|
);
|
|
```
|
|
|
|
### Indexes for Performance
|
|
```sql
|
|
CREATE INDEX idx_workflows_enabled ON orchestrator_workflows(is_enabled, next_execution_at);
|
|
CREATE INDEX idx_executions_status ON orchestrator_executions(status, started_at);
|
|
CREATE INDEX idx_executions_workflow_status ON orchestrator_executions(workflow_id, status);
|
|
CREATE INDEX idx_metrics_date ON orchestrator_metrics(metric_date DESC);
|
|
```
|
|
|
|
## Business Logic Examples
|
|
|
|
### Daily Workflow Orchestration
|
|
```python
|
|
async def execute_daily_workflow():
|
|
"""
|
|
Main daily workflow executed at 8:00 AM Madrid time.
|
|
Coordinates forecasting, production, and procurement.
|
|
"""
|
|
workflow_name = "daily_operations"
|
|
execution_id = uuid.uuid4()
|
|
|
|
logger.info("Starting daily workflow", execution_id=str(execution_id))
|
|
|
|
# Create execution record
|
|
execution = OrchestratorExecution(
|
|
id=execution_id,
|
|
workflow_name=workflow_name,
|
|
execution_type='scheduled',
|
|
status='running',
|
|
started_at=datetime.utcnow()
|
|
)
|
|
db.add(execution)
|
|
await db.flush()
|
|
|
|
try:
|
|
# Get all active tenants
|
|
tenants = await db.query(Tenant).filter(
|
|
Tenant.status == 'active'
|
|
).all()
|
|
|
|
execution.steps_total = len(tenants) * 5 # 5 steps per tenant
|
|
|
|
for tenant in tenants:
|
|
try:
|
|
# Step 1: Generate forecasts
|
|
await log_step(execution_id, "generate_forecasts", tenant.id, "Starting forecast generation")
|
|
forecast_result = await trigger_forecasting(tenant.id)
|
|
await log_step(execution_id, "generate_forecasts", tenant.id, f"Generated {forecast_result['count']} forecasts")
|
|
execution.steps_completed += 1
|
|
|
|
# Step 2: Calculate production needs
|
|
await log_step(execution_id, "calculate_production", tenant.id, "Calculating production needs")
|
|
production_result = await trigger_production_planning(tenant.id)
|
|
await log_step(execution_id, "calculate_production", tenant.id, f"Planned {production_result['batches']} batches")
|
|
execution.steps_completed += 1
|
|
|
|
# Step 3: Calculate procurement needs
|
|
await log_step(execution_id, "calculate_procurement", tenant.id, "Calculating procurement needs")
|
|
procurement_result = await trigger_procurement_planning(tenant.id)
|
|
await log_step(execution_id, "calculate_procurement", tenant.id, f"Identified {procurement_result['needs_count']} procurement needs")
|
|
execution.steps_completed += 1
|
|
|
|
# Step 4: Generate inventory projections
|
|
await log_step(execution_id, "project_inventory", tenant.id, "Projecting inventory")
|
|
inventory_result = await trigger_inventory_projection(tenant.id)
|
|
await log_step(execution_id, "project_inventory", tenant.id, "Inventory projections completed")
|
|
execution.steps_completed += 1
|
|
|
|
# Step 5: Send daily summary
|
|
await log_step(execution_id, "send_summary", tenant.id, "Sending daily summary")
|
|
await send_daily_summary(tenant.id, {
|
|
'forecasts': forecast_result,
|
|
'production': production_result,
|
|
'procurement': procurement_result
|
|
})
|
|
await log_step(execution_id, "send_summary", tenant.id, "Daily summary sent")
|
|
execution.steps_completed += 1
|
|
|
|
except Exception as e:
|
|
execution.steps_failed += 1
|
|
await log_step(execution_id, "tenant_workflow", tenant.id, f"Failed: {str(e)}", level='error')
|
|
logger.error("Tenant workflow failed",
|
|
tenant_id=str(tenant.id),
|
|
error=str(e))
|
|
continue
|
|
|
|
# Mark execution complete
|
|
execution.status = 'completed'
|
|
execution.completed_at = datetime.utcnow()
|
|
execution.duration_seconds = int((execution.completed_at - execution.started_at).total_seconds())
|
|
|
|
await db.commit()
|
|
|
|
logger.info("Daily workflow completed",
|
|
execution_id=str(execution_id),
|
|
tenants_processed=len(tenants),
|
|
duration_seconds=execution.duration_seconds)
|
|
|
|
# Publish event
|
|
await publish_event('orchestrator', 'orchestrator.workflow_completed', {
|
|
'workflow_name': workflow_name,
|
|
'execution_id': str(execution_id),
|
|
'tenants_processed': len(tenants),
|
|
'steps_completed': execution.steps_completed,
|
|
'steps_failed': execution.steps_failed
|
|
})
|
|
|
|
except Exception as e:
|
|
execution.status = 'failed'
|
|
execution.error_message = str(e)
|
|
execution.completed_at = datetime.utcnow()
|
|
execution.duration_seconds = int((execution.completed_at - execution.started_at).total_seconds())
|
|
|
|
await db.commit()
|
|
|
|
logger.error("Daily workflow failed",
|
|
execution_id=str(execution_id),
|
|
error=str(e))
|
|
|
|
# Send alert
|
|
await send_workflow_failure_alert(workflow_name, str(e))
|
|
|
|
raise
|
|
|
|
async def trigger_forecasting(tenant_id: UUID) -> dict:
|
|
"""
|
|
Call forecasting service to generate forecasts.
|
|
"""
|
|
async with httpx.AsyncClient() as client:
|
|
response = await client.post(
|
|
f"{FORECASTING_SERVICE_URL}/api/v1/forecasting/generate",
|
|
json={'tenant_id': str(tenant_id), 'days_ahead': 7},
|
|
timeout=300.0
|
|
)
|
|
|
|
if response.status_code != 200:
|
|
raise Exception(f"Forecasting failed: {response.text}")
|
|
|
|
return response.json()
|
|
|
|
async def trigger_production_planning(tenant_id: UUID) -> dict:
|
|
"""
|
|
Call production service to generate production schedules.
|
|
"""
|
|
async with httpx.AsyncClient() as client:
|
|
response = await client.post(
|
|
f"{PRODUCTION_SERVICE_URL}/api/v1/production/schedules/generate",
|
|
json={'tenant_id': str(tenant_id)},
|
|
timeout=180.0
|
|
)
|
|
|
|
if response.status_code != 200:
|
|
raise Exception(f"Production planning failed: {response.text}")
|
|
|
|
return response.json()
|
|
|
|
async def trigger_procurement_planning(tenant_id: UUID) -> dict:
|
|
"""
|
|
Call procurement service to calculate needs.
|
|
"""
|
|
async with httpx.AsyncClient() as client:
|
|
response = await client.post(
|
|
f"{PROCUREMENT_SERVICE_URL}/api/v1/procurement/needs/calculate",
|
|
json={'tenant_id': str(tenant_id), 'days_ahead': 14},
|
|
timeout=180.0
|
|
)
|
|
|
|
if response.status_code != 200:
|
|
raise Exception(f"Procurement planning failed: {response.text}")
|
|
|
|
return response.json()
|
|
```
|
|
|
|
### Leader Election
|
|
```python
|
|
async def start_leader_election():
|
|
"""
|
|
Participate in leader election using Redis.
|
|
Only the leader executes workflows.
|
|
"""
|
|
instance_id = f"{socket.gethostname()}_{uuid.uuid4().hex[:8]}"
|
|
|
|
while True:
|
|
try:
|
|
# Try to become leader
|
|
is_leader = await try_become_leader(instance_id)
|
|
|
|
if is_leader:
|
|
logger.info("This instance is the leader", instance_id=instance_id)
|
|
|
|
# Start workflow scheduler
|
|
await start_workflow_scheduler()
|
|
|
|
# Maintain leadership with heartbeats
|
|
while True:
|
|
await asyncio.sleep(30) # Heartbeat every 30 seconds
|
|
if not await maintain_leadership(instance_id):
|
|
logger.warning("Lost leadership", instance_id=instance_id)
|
|
break
|
|
else:
|
|
# Not leader, check again in 60 seconds
|
|
logger.info("This instance is a follower", instance_id=instance_id)
|
|
await asyncio.sleep(60)
|
|
|
|
except Exception as e:
|
|
logger.error("Leader election error",
|
|
instance_id=instance_id,
|
|
error=str(e))
|
|
await asyncio.sleep(60)
|
|
|
|
async def try_become_leader(instance_id: str) -> bool:
|
|
"""
|
|
Try to acquire leadership using Redis lock.
|
|
"""
|
|
# Try to set leader lock in Redis
|
|
lock_key = "orchestrator:leader:lock"
|
|
lock_acquired = await redis.set(
|
|
lock_key,
|
|
instance_id,
|
|
ex=90, # Expire in 90 seconds
|
|
nx=True # Only set if not exists
|
|
)
|
|
|
|
if lock_acquired:
|
|
# Record in database
|
|
leader = await db.query(OrchestratorLeader).filter(
|
|
OrchestratorLeader.id == 1
|
|
).first()
|
|
|
|
if not leader:
|
|
leader = OrchestratorLeader(
|
|
id=1,
|
|
instance_id=instance_id,
|
|
instance_hostname=socket.gethostname(),
|
|
became_leader_at=datetime.utcnow(),
|
|
last_heartbeat_at=datetime.utcnow()
|
|
)
|
|
db.add(leader)
|
|
else:
|
|
leader.instance_id = instance_id
|
|
leader.instance_hostname = socket.gethostname()
|
|
leader.became_leader_at = datetime.utcnow()
|
|
leader.last_heartbeat_at = datetime.utcnow()
|
|
|
|
await db.commit()
|
|
|
|
return True
|
|
|
|
return False
|
|
|
|
async def maintain_leadership(instance_id: str) -> bool:
|
|
"""
|
|
Maintain leadership by refreshing Redis lock.
|
|
"""
|
|
lock_key = "orchestrator:leader:lock"
|
|
|
|
# Check if we still hold the lock
|
|
current_leader = await redis.get(lock_key)
|
|
if current_leader != instance_id:
|
|
return False
|
|
|
|
# Refresh lock
|
|
await redis.expire(lock_key, 90)
|
|
|
|
# Update heartbeat
|
|
leader = await db.query(OrchestratorLeader).filter(
|
|
OrchestratorLeader.id == 1
|
|
).first()
|
|
|
|
if leader and leader.instance_id == instance_id:
|
|
leader.last_heartbeat_at = datetime.utcnow()
|
|
await db.commit()
|
|
return True
|
|
|
|
return False
|
|
```
|
|
|
|
### Workflow Scheduler
|
|
```python
|
|
async def start_workflow_scheduler():
|
|
"""
|
|
Start APScheduler to execute workflows on schedule.
|
|
"""
|
|
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
|
from apscheduler.triggers.cron import CronTrigger
|
|
|
|
scheduler = AsyncIOScheduler(timezone='Europe/Madrid')
|
|
|
|
# Get workflow configurations
|
|
workflows = await db.query(OrchestratorWorkflow).filter(
|
|
OrchestratorWorkflow.is_enabled == True
|
|
).all()
|
|
|
|
for workflow in workflows:
|
|
# Parse cron expression
|
|
trigger = CronTrigger.from_crontab(workflow.cron_expression, timezone=workflow.timezone)
|
|
|
|
# Add job to scheduler
|
|
scheduler.add_job(
|
|
execute_workflow,
|
|
trigger=trigger,
|
|
args=[workflow.id],
|
|
id=str(workflow.id),
|
|
name=workflow.workflow_name,
|
|
max_instances=1, # Prevent concurrent executions
|
|
replace_existing=True
|
|
)
|
|
|
|
logger.info("Scheduled workflow",
|
|
workflow_name=workflow.workflow_name,
|
|
cron=workflow.cron_expression)
|
|
|
|
# Start scheduler
|
|
scheduler.start()
|
|
logger.info("Workflow scheduler started")
|
|
|
|
# Keep scheduler running
|
|
while True:
|
|
await asyncio.sleep(3600) # Check every hour
|
|
```
|
|
|
|
## Events & Messaging
|
|
|
|
### Published Events (RabbitMQ)
|
|
|
|
**Exchange**: `orchestrator`
|
|
**Routing Keys**: `orchestrator.workflow_completed`, `orchestrator.workflow_failed`
|
|
|
|
**Workflow Completed Event**
|
|
```json
|
|
{
|
|
"event_type": "orchestrator_workflow_completed",
|
|
"workflow_name": "daily_operations",
|
|
"execution_id": "uuid",
|
|
"tenants_processed": 125,
|
|
"steps_completed": 625,
|
|
"steps_failed": 3,
|
|
"duration_seconds": 1820,
|
|
"timestamp": "2025-11-06T08:30:20Z"
|
|
}
|
|
```
|
|
|
|
**Workflow Failed Event**
|
|
```json
|
|
{
|
|
"event_type": "orchestrator_workflow_failed",
|
|
"workflow_name": "daily_operations",
|
|
"execution_id": "uuid",
|
|
"error_message": "Database connection timeout",
|
|
"tenants_affected": 45,
|
|
"timestamp": "2025-11-06T08:15:30Z"
|
|
}
|
|
```
|
|
|
|
### Consumed Events
|
|
None - Orchestrator initiates workflows but doesn't consume events
|
|
|
|
## Custom Metrics (Prometheus)
|
|
|
|
```python
|
|
# Workflow metrics
|
|
workflow_executions_total = Counter(
|
|
'orchestrator_workflow_executions_total',
|
|
'Total workflow executions',
|
|
['workflow_name', 'status']
|
|
)
|
|
|
|
workflow_duration_seconds = Histogram(
|
|
'orchestrator_workflow_duration_seconds',
|
|
'Workflow execution duration',
|
|
['workflow_name'],
|
|
buckets=[60, 300, 600, 1200, 1800, 3600]
|
|
)
|
|
|
|
workflow_success_rate = Gauge(
|
|
'orchestrator_workflow_success_rate_percentage',
|
|
'Workflow success rate',
|
|
['workflow_name']
|
|
)
|
|
|
|
tenants_processed_total = Counter(
|
|
'orchestrator_tenants_processed_total',
|
|
'Total tenants processed',
|
|
['workflow_name', 'status']
|
|
)
|
|
|
|
leader_instance = Gauge(
|
|
'orchestrator_leader_instance',
|
|
'Current leader instance (1=leader, 0=follower)',
|
|
['instance_id']
|
|
)
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
**Service Configuration:**
|
|
- `PORT` - Service port (default: 8018)
|
|
- `DATABASE_URL` - PostgreSQL connection string
|
|
- `REDIS_URL` - Redis connection string
|
|
- `RABBITMQ_URL` - RabbitMQ connection string
|
|
|
|
**Workflow Configuration:**
|
|
- `DAILY_WORKFLOW_CRON` - Daily workflow schedule (default: "0 8 * * *")
|
|
- `WEEKLY_WORKFLOW_CRON` - Weekly workflow schedule (default: "0 9 * * 1")
|
|
- `DEFAULT_TIMEZONE` - Default timezone (default: "Europe/Madrid")
|
|
- `MAX_WORKFLOW_DURATION_SECONDS` - Max execution time (default: 3600)
|
|
|
|
**Leader Election:**
|
|
- `ENABLE_LEADER_ELECTION` - Enable HA mode (default: true)
|
|
- `LEADER_HEARTBEAT_SECONDS` - Heartbeat interval (default: 30)
|
|
- `LEADER_LOCK_TTL_SECONDS` - Lock expiration (default: 90)
|
|
|
|
**Service URLs:**
|
|
- `FORECASTING_SERVICE_URL` - Forecasting service URL
|
|
- `PRODUCTION_SERVICE_URL` - Production service URL
|
|
- `PROCUREMENT_SERVICE_URL` - Procurement service URL
|
|
- `INVENTORY_SERVICE_URL` - Inventory service URL
|
|
|
|
## Development Setup
|
|
|
|
### Prerequisites
|
|
- Python 3.11+
|
|
- PostgreSQL 17
|
|
- Redis 7.4
|
|
- RabbitMQ 4.1
|
|
|
|
### Local Development
|
|
```bash
|
|
cd services/orchestrator
|
|
python -m venv venv
|
|
source venv/bin/activate
|
|
|
|
pip install -r requirements.txt
|
|
|
|
export DATABASE_URL=postgresql://user:pass@localhost:5432/orchestrator
|
|
export REDIS_URL=redis://localhost:6379/0
|
|
export RABBITMQ_URL=amqp://guest:guest@localhost:5672/
|
|
export FORECASTING_SERVICE_URL=http://localhost:8003
|
|
export PRODUCTION_SERVICE_URL=http://localhost:8007
|
|
|
|
alembic upgrade head
|
|
python main.py
|
|
```
|
|
|
|
## Integration Points
|
|
|
|
### Dependencies
|
|
- **All Services** - Calls service APIs to execute workflows
|
|
- **🆕 Tenant Service** (NEW) - Fetch tenant hierarchy for enterprise dashboards
|
|
- **🆕 Forecasting Service** (NEW) - Fetch network-aggregated demand forecasts
|
|
- **🆕 Distribution Service** (NEW) - Fetch active delivery routes and shipment status
|
|
- **🆕 Production Service** (NEW) - Fetch production metrics across network
|
|
- **Redis** - Leader election and caching
|
|
- **PostgreSQL** - Workflow history
|
|
- **RabbitMQ** - Event publishing
|
|
|
|
### Dependents
|
|
- **All Services** - Benefit from automated workflows
|
|
- **Monitoring** - Tracks workflow execution
|
|
- **🆕 Frontend Enterprise Dashboard** (NEW) - Displays aggregated network metrics for parent tenants
|
|
|
|
## Business Value for VUE Madrid
|
|
|
|
### Problem Statement
|
|
Manual daily operations don't scale:
|
|
- Staff forget to generate forecasts daily
|
|
- Production planning done inconsistently
|
|
- Procurement needs identified too late
|
|
- Reports generated manually
|
|
- No weekend/holiday coverage
|
|
- Human error in execution
|
|
|
|
### Solution
|
|
Bakery-IA Orchestrator provides:
|
|
- **Fully Automated**: 95%+ operations automated
|
|
- **Consistent Execution**: 100% vs. 70-80% manual
|
|
- **Early Morning Ready**: Data ready before business opens
|
|
- **365-Day Coverage**: Works weekends and holidays
|
|
- **Error Recovery**: Automatic retries
|
|
- **Scalable**: Handles 10,000+ tenants
|
|
|
|
### Quantifiable Impact
|
|
|
|
**Time Savings:**
|
|
- 15-20 hours/week per bakery on manual planning
|
|
- €900-1,200/month labor cost savings per bakery
|
|
- 100% consistency vs. 70-80% manual execution
|
|
|
|
**Operational Excellence:**
|
|
- 99.9%+ workflow success rate
|
|
- Issues identified before business hours
|
|
- Zero forgotten forecasts or plans
|
|
- Predictable daily operations
|
|
|
|
**Platform Scalability:**
|
|
- Linear cost scaling with tenants
|
|
- 10,000+ tenant capacity with one orchestrator
|
|
- €0.01-0.05 per tenant per day computational cost
|
|
- High availability with leader election
|
|
|
|
### ROI for Platform
|
|
**Investment**: €50-200/month (compute + infrastructure)
|
|
**Value Delivered**: €900-1,200/month per tenant
|
|
**Platform Scale**: €90,000-120,000/month at 100 tenants
|
|
**Cost Ratio**: <1% of value delivered
|
|
|
|
---
|
|
|
|
## 🆕 Forecast Validation Integration (NEW)
|
|
|
|
### Overview
|
|
The orchestrator now integrates with the Forecasting Service's validation system to automatically validate forecast accuracy and trigger model improvements.
|
|
|
|
### Daily Workflow Integration
|
|
|
|
The daily workflow now includes a **Step 5: Validate Previous Forecasts** after generating new forecasts:
|
|
|
|
```python
|
|
# Step 5: Validate previous day's forecasts
|
|
await log_step(execution_id, "validate_forecasts", tenant.id, "Validating forecasts")
|
|
validation_result = await forecast_client.validate_forecasts(
|
|
tenant_id=tenant.id,
|
|
orchestration_run_id=execution_id
|
|
)
|
|
await log_step(
|
|
execution_id,
|
|
"validate_forecasts",
|
|
tenant.id,
|
|
f"Validation complete: MAPE={validation_result.get('overall_mape', 'N/A')}%"
|
|
)
|
|
execution.steps_completed += 1
|
|
```
|
|
|
|
### What Gets Validated
|
|
|
|
Every morning at 8:00 AM, the orchestrator:
|
|
|
|
1. **Generates today's forecasts** (Steps 1-4)
|
|
2. **Validates yesterday's forecasts** (Step 5) by:
|
|
- Fetching yesterday's forecast predictions
|
|
- Fetching yesterday's actual sales from Sales Service
|
|
- Calculating accuracy metrics (MAE, MAPE, RMSE, R², Accuracy %)
|
|
- Storing validation results in `validation_runs` table
|
|
- Identifying poor-performing products/locations
|
|
|
|
### Benefits
|
|
|
|
**For Bakery Owners:**
|
|
- **Daily Accuracy Tracking**: See how accurate yesterday's forecast was
|
|
- **Product-Level Insights**: Know which products have reliable forecasts
|
|
- **Continuous Improvement**: Models automatically retrain when accuracy drops
|
|
- **Trust & Confidence**: Validated accuracy metrics build trust in forecasts
|
|
|
|
**For Platform Operations:**
|
|
- **Automated Quality Control**: No manual validation needed
|
|
- **Early Problem Detection**: Performance degradation identified within 24 hours
|
|
- **Model Health Monitoring**: Track accuracy trends over time
|
|
- **Automatic Retraining**: Models improve automatically when needed
|
|
|
|
### Validation Metrics
|
|
|
|
Each validation run tracks:
|
|
- **Overall Metrics**: MAPE, MAE, RMSE, R², Accuracy %
|
|
- **Coverage**: % of forecasts with actual sales data
|
|
- **Product Performance**: Top/bottom performers by MAPE
|
|
- **Location Performance**: Accuracy by location/POS
|
|
- **Trend Analysis**: Week-over-week accuracy changes
|
|
|
|
### Historical Data Handling
|
|
|
|
When late sales data arrives (e.g., from CSV imports or delayed POS sync):
|
|
- **Webhook Integration**: Sales Service notifies Forecasting Service
|
|
- **Gap Detection**: System identifies dates with forecasts but no validation
|
|
- **Automatic Backfill**: Validates historical forecasts retroactively
|
|
- **Complete Coverage**: Ensures 100% of forecasts eventually get validated
|
|
|
|
### Performance Monitoring & Retraining
|
|
|
|
**Weekly Evaluation** (runs Sunday night):
|
|
```python
|
|
# Analyze 30-day performance
|
|
await retraining_service.evaluate_and_trigger_retraining(
|
|
tenant_id=tenant.id,
|
|
auto_trigger=True # Automatically retrain poor performers
|
|
)
|
|
```
|
|
|
|
**Retraining Triggers:**
|
|
- MAPE > 30% (critical threshold)
|
|
- MAPE increased > 5% in 30 days
|
|
- Model age > 30 days
|
|
- Manual trigger via API
|
|
|
|
**Automatic Actions:**
|
|
- Identifies products with MAPE > 30%
|
|
- Triggers retraining via Training Service
|
|
- Tracks retraining job status
|
|
- Validates improved accuracy after retraining
|
|
|
|
### Integration Flow
|
|
|
|
```
|
|
Daily Orchestrator (8:00 AM)
|
|
↓
|
|
Step 1-4: Generate forecasts, production, procurement
|
|
↓
|
|
Step 5: Validate yesterday's forecasts
|
|
↓
|
|
Forecasting Service validates vs Sales Service
|
|
↓
|
|
Store validation results in validation_runs table
|
|
↓
|
|
If poor performance detected → Queue for retraining
|
|
↓
|
|
Weekly Retraining Job (Sunday night)
|
|
↓
|
|
Trigger Training Service for poor performers
|
|
↓
|
|
Models improve over time
|
|
```
|
|
|
|
### Expected Results
|
|
|
|
**After 1 month:**
|
|
- 100% validation coverage (all forecasts validated)
|
|
- Baseline accuracy metrics established
|
|
- Poor performers identified for retraining
|
|
|
|
**After 3 months:**
|
|
- 10-15% accuracy improvement from automatic retraining
|
|
- Reduced MAPE from 25% → 15% average
|
|
- Better inventory decisions from trusted forecasts
|
|
- Reduced waste from more accurate predictions
|
|
|
|
**After 6 months:**
|
|
- Continuous model improvement cycle established
|
|
- Optimal accuracy for each product category
|
|
- Predictable performance metrics
|
|
- Trust in forecast-driven decisions
|
|
|
|
### Monitoring Dashboard Additions
|
|
|
|
New metrics available for dashboards:
|
|
|
|
1. **Validation Status Card**
|
|
- Last validation: timestamp, status
|
|
- Overall MAPE: % with trend arrow
|
|
- Validation coverage: %
|
|
- Health status: healthy/warning/critical
|
|
|
|
2. **Accuracy Trends Graph**
|
|
- 30-day MAPE trend line
|
|
- Target threshold lines (20%, 30%)
|
|
- Product performance distribution
|
|
|
|
3. **Retraining Activity**
|
|
- Models retrained this week
|
|
- Retraining success rate
|
|
- Products pending retraining
|
|
- Next scheduled retraining
|
|
|
|
---
|
|
|
|
**Copyright © 2025 Bakery-IA. All rights reserved.**
|