58 KiB
Alert System Architecture
Last Updated: 2025-11-25 Status: Production-Ready Version: 2.0
Table of Contents
- Overview
- Event Flow & Lifecycle
- Three-Tier Enrichment Strategy
- Enrichment Process
- Priority Scoring Algorithm
- Alert Types & Classification
- Smart Actions & User Agency
- Alert Lifecycle & State Transitions
- Escalation System
- Alert Chaining & Deduplication
- Cronjob Integration
- Service Integration Patterns
- Frontend Integration
- Redis Pub/Sub Architecture
- Database Schema
- Performance & Monitoring
1. Overview
1.1 Philosophy
The Bakery-IA alert system transforms passive notifications into context-aware, actionable guidance. Every alert includes enrichment context, priority scoring, and suggested actions, enabling users to make informed decisions quickly.
Core Principles:
- Alerts are not just notifications - They're AI-enhanced action items
- Context over noise - Every alert includes business impact and suggested actions
- Smart prioritization - Multi-factor scoring ensures critical issues surface first
- Progressive enhancement - Different event types get appropriate enrichment levels
- User agency - System respects what users can actually control
1.2 Architecture Goals
✅ Performance: 80% faster notification processing, 70% less SSE traffic ✅ Type Safety: Complete TypeScript definitions matching backend ✅ Developer Experience: 18 specialized React hooks for different use cases ✅ Production Ready: Backward compatible, fully documented, deployment-ready
2. Event Flow & Lifecycle
2.1 Event Generation
Services detect issues via three patterns:
Scheduled Background Jobs
- Inventory service: Stock checks every 5-15 minutes
- Production service: Capacity checks every 10-45 minutes
- Forecasting service: Demand analysis (Friday 3 PM weekly)
Event-Driven
- RabbitMQ subscriptions to business events
- Example: Order created → Check stock availability → Emit low stock alert
Database Triggers
- Direct PostgreSQL notifications for critical state changes
- Example: Stock quantity falls below threshold → Immediate alert
2.2 Alert Publishing Flow
Service detects issue
↓
Validates against RawAlert schema (title, message, type, severity, metadata)
↓
Generates deduplication key (type + entity IDs)
↓
Checks Redis (prevent duplicates within 15-minute window)
↓
Publishes to RabbitMQ (alerts.exchange with routing key)
↓
Alert Processor consumes message
↓
Conditional enrichment based on event type
↓
Stores in PostgreSQL
↓
Publishes to Redis (domain-based channels)
↓
Gateway streams via SSE
↓
Frontend hooks receive and display
2.3 Complete Event Flow Diagram
Domain Service → RabbitMQ → Alert Processor → PostgreSQL → Redis → Gateway → Frontend
↓ ↓
Conditional Enrichment SSE Stream
- Alert: Full (500-800ms) - Domain filtered
- Notification: Fast (20-30ms) - Wildcard support
- Recommendation: Medium (50-80ms) - Real-time updates
3. Three-Tier Enrichment Strategy
3.1 Tier 1: ALERTS (Full Enrichment)
When: Critical business events requiring user decisions
Enrichment Pipeline (7 steps):
- Orchestrator Context Query
- Business Impact Analysis
- Urgency Assessment
- User Agency Evaluation
- Multi-Factor Priority Scoring
- Timing Intelligence
- Smart Action Generation
Processing Time: 500-800ms Database: Full alert record with all enrichment fields TTL: Indefinite (until resolved)
Examples:
- Low stock warning requiring PO approval
- Production delay affecting customer orders
- Equipment failure needing immediate attention
3.2 Tier 2: NOTIFICATIONS (Lightweight)
When: Informational state changes
Enrichment:
- Format title/message
- Set placement hint
- Assign domain
- No priority scoring
- No orchestrator queries
Processing Time: 20-30ms (80% faster than alerts) Database: Minimal notification record TTL: 7 days (automatic cleanup)
Examples:
- Stock received confirmation
- Batch completed notification
- PO sent to supplier
3.3 Tier 3: RECOMMENDATIONS (Moderate)
When: AI suggestions for optimization
Enrichment:
- Light priority scoring (info level by default)
- Confidence assessment
- Estimated impact calculation
- No orchestrator context
- Dismissible by users
Processing Time: 50-80ms Database: Recommendation record with impact fields TTL: 30 days or until dismissed
Examples:
- Demand surge prediction
- Inventory optimization suggestion
- Cost reduction opportunity
3.4 Performance Comparison
| Event Class | Old | New | Improvement |
|---|---|---|---|
| Alert | 200-300ms | 500-800ms | Baseline (more enrichment) |
| Notification | 200-300ms | 20-30ms | 80% faster |
| Recommendation | 200-300ms | 50-80ms | 60% faster |
Overall: 54% average improvement due to selective enrichment
4. Enrichment Process
4.1 Orchestrator Context Enrichment
Purpose: Determine if AI has already addressed the alert
Service: orchestrator_client.py
Query: Daily Orchestrator microservice for related actions
Questions Answered:
- Has AI already created a purchase order for this low stock?
- What's the PO ID and current status?
- When will the delivery arrive?
- What's the estimated cost savings?
Response Fields:
{
"already_addressed": bool,
"action_type": "purchase_order" | "production_batch" | "schedule_adjustment",
"action_id": str, # e.g., "PO-12345"
"action_status": "pending_approval" | "approved" | "in_progress",
"delivery_date": datetime,
"estimated_savings_eur": Decimal
}
Caching: Results cached to avoid redundant queries
4.2 Business Impact Analysis
Service: context_enrichment.py
Dimensions Analyzed:
Financial Impact
financial_impact_eur: Decimal
# Calculation examples:
# - Low stock: lost_sales = out_of_stock_days × avg_daily_revenue_per_product
# - Production delay: penalty_fees + rush_order_costs
# - Equipment failure: repair_cost + lost_production_value
Customer Impact
affected_customers: List[str] # Customer names
affected_orders: int # Count of at-risk orders
customer_satisfaction_impact: "low" | "medium" | "high"
# Based on order priority, customer tier, delay duration
Operational Impact
production_batches_at_risk: List[str] # Batch IDs
waste_risk_kg: Decimal # Spoilage or overproduction
equipment_downtime_hours: Decimal
4.3 Urgency Context
Fields:
deadline: datetime # When consequences occur
time_until_consequence_hours: Decimal # Countdown
can_wait_until_tomorrow: bool # For overnight batch processing
auto_action_countdown_seconds: int # For escalation alerts
Urgency Scoring:
- >48h until consequence: Low urgency (20 points)
- 24-48h: Medium urgency (50 points)
- 6-24h: High urgency (80 points)
- <6h: Critical urgency (100 points)
4.4 User Agency Assessment
Purpose: Determine what user can actually do
Fields:
can_user_fix: bool # Can user resolve this directly?
requires_external_party: bool # Need supplier/customer action?
external_party_name: str # "Supplier Inc."
external_party_contact: str # "+34-123-456-789"
blockers: List[str] # What prevents immediate action
User Agency Scoring:
- Can fix directly: 80 points
- Requires external party: 50 points
- Has blockers: -30 penalty
- No control: 20 points
4.5 Trend Context (for trend_warning alerts)
Fields:
metric_name: str # "weekend_demand"
current_value: Decimal # 450
baseline_value: Decimal # 300
change_percentage: Decimal # 50
direction: "increasing" | "decreasing" | "volatile"
significance: "low" | "medium" | "high"
period_days: int # 7
possible_causes: List[str] # ["Holiday weekend", "Promotion"]
4.6 Timing Intelligence
Service: timing_intelligence.py
Delivery Method Decisions:
def decide_timing(alert):
if priority >= 90: # Critical
return "SEND_NOW" # Immediate push notification
if is_business_hours() and priority >= 70:
return "SEND_NOW" # Important during work hours
if is_night_hours() and priority < 90:
return "SCHEDULE_LATER" # Queue for 8 AM
if priority < 50:
return "BATCH_FOR_DIGEST" # Daily summary email
Considerations:
- Priority level
- Business hours (8 AM - 8 PM)
- User preferences (digest settings)
- Alert type (action_needed vs informational)
4.7 Smart Actions Generation
Service: context_enrichment.py
Action Structure:
{
label: string, // "Approve Purchase Order"
type: SmartActionType, // approve_po
variant: "primary" | "secondary" | "tertiary",
metadata: object, // Context for action handler
disabled: boolean, // Based on user permissions/state
estimated_time_minutes: number, // How long action takes
consequence: string // "Order will be placed immediately"
}
Action Examples by Alert Type:
Low Stock Alert:
[
{
label: "Approve Purchase Order",
type: "approve_po",
variant: "primary",
metadata: { po_id: "PO-12345", amount: 1500.00 }
},
{
label: "Contact Supplier",
type: "call_supplier",
variant: "secondary",
metadata: { supplier_contact: "+34-123-456-789" }
}
]
Production Delay Alert:
[
{
label: "Adjust Schedule",
type: "reschedule_production",
variant: "primary",
metadata: { batch_id: "BATCH-001", delay_minutes: 30 }
},
{
label: "Notify Customer",
type: "send_notification",
variant: "secondary",
metadata: { customer_id: "CUST-456" }
}
]
5. Priority Scoring Algorithm
5.1 Multi-Factor Weighted Scoring
Formula:
Priority Score (0-100) =
(Business_Impact × 0.40) +
(Urgency × 0.30) +
(User_Agency × 0.20) +
(Confidence × 0.10)
5.2 Business Impact Score (40% weight)
Financial Impact:
- ≤€50: 20 points
- €50-200: 40 points
- €200-500: 60 points
- >€500: 100 points
Customer Impact:
- 1 affected customer: 30 points
- 2-5 customers: 50 points
- 5+ customers: 100 points
Operational Impact:
- 1 order at risk: 30 points
- 2-10 orders: 60 points
- 10+ orders: 100 points
Weighted Average:
business_impact_score = (
financial_score * 0.5 +
customer_score * 0.3 +
operational_score * 0.2
)
5.3 Urgency Score (30% weight)
Time Until Consequence:
- >48 hours: 20 points
- 24-48 hours: 50 points
- 6-24 hours: 80 points
- <6 hours: 100 points
Deadline Approaching Bonus:
- Within 24h of deadline: +30 points
- Within 6h of deadline: +50 points (capped at 100)
5.4 User Agency Score (20% weight)
Base Score:
- Can user fix directly: 80 points
- Requires coordination: 50 points
- No control: 20 points
Modifiers:
- Has external party contact: +20 bonus
- Requires supplier action: -20 penalty
- Has known blockers: -30 penalty
5.5 Confidence Score (10% weight)
Data Quality Assessment:
- High confidence (complete data): 100 points
- Medium confidence (some assumptions): 70 points
- Low confidence (many unknowns): 40 points
5.6 Priority Levels
Mapping:
- CRITICAL (90-100): Immediate action required, high business impact
- IMPORTANT (70-89): Action needed today, moderate impact
- STANDARD (50-69): Action recommended this week
- INFO (0-49): Informational, no urgency
6. Alert Types & Classification
6.1 Alert Type Classes
ACTION_NEEDED (~70% of alerts):
- User decision required
- Appears in action queue
- Has deadline
- Examples: Low stock, pending PO approval, equipment failure
PREVENTED_ISSUE (~10% of alerts):
- AI already handled the problem
- Positive framing: "I prevented X by doing Y"
- User awareness only, no action needed
- Examples: "Stock shortage prevented by auto-PO"
TREND_WARNING (~15% of alerts):
- Proactive insight about emerging patterns
- Gives user time to prepare
- May become action_needed if ignored
- Examples: "Demand trending up 35% this week"
ESCALATION (~3% of alerts):
- Time-sensitive with auto-action countdown
- System will act automatically if user doesn't
- Countdown timer shown prominently
- Examples: "Critical stock, auto-ordering in 2 hours"
INFORMATION (~2% of alerts):
- FYI only, no action expected
- Low priority
- Often batched for digest emails
- Examples: "Production batch completed"
6.2 Event Domains
- inventory: Stock levels, expiration, movements
- production: Batches, capacity, equipment
- procurement: Purchase orders, deliveries, suppliers
- forecasting: Demand predictions, trends
- orders: Customer orders, fulfillment
- orchestrator: AI-driven automation actions
- delivery: Delivery tracking, receipt
- sales: Sales analytics, patterns
6.3 Alert Type Catalog (40+ types)
Inventory Domain
critical_stock_shortage (action_needed, critical)
low_stock_warning (action_needed, important)
expired_products (action_needed, critical)
stock_depleted_by_order (information, standard)
stock_received (notification, info)
stock_movement (notification, info)
Production Domain
production_delay (action_needed, important)
equipment_failure (action_needed, critical)
capacity_overload (action_needed, important)
quality_control_failure (action_needed, critical)
batch_state_changed (notification, info)
batch_completed (notification, info)
Procurement Domain
po_approval_needed (action_needed, important)
po_approval_escalation (escalation, critical)
delivery_overdue (action_needed, critical)
po_approved (notification, info)
po_sent (notification, info)
delivery_scheduled (notification, info)
delivery_received (notification, info)
Delivery Tracking
delivery_scheduled (information, info)
delivery_arriving_soon (action_needed, important)
delivery_overdue (action_needed, critical)
stock_receipt_incomplete (action_needed, important)
Forecasting Domain
demand_surge_predicted (trend_warning, important)
weekend_demand_surge (trend_warning, standard)
weather_impact_forecast (trend_warning, standard)
holiday_preparation (trend_warning, important)
Operations Domain
orchestration_run_started (notification, info)
orchestration_run_completed (notification, info)
action_created (notification, info)
6.4 Placement Hints
Where alerts appear:
ACTION_QUEUE: Dashboard action section (action_needed)NOTIFICATION_PANEL: Bell icon dropdown (notifications)DASHBOARD_INLINE: Embedded in relevant page sectionTOAST: Immediate popup (critical alerts)EMAIL_DIGEST: End-of-day summary email
7. Smart Actions & User Agency
7.1 Action Types
Complete Enumeration:
class SmartActionType(str, Enum):
# Procurement
APPROVE_PO = "approve_po"
REJECT_PO = "reject_po"
MODIFY_PO = "modify_po"
CALL_SUPPLIER = "call_supplier"
# Production
START_PRODUCTION_BATCH = "start_production_batch"
RESCHEDULE_PRODUCTION = "reschedule_production"
HALT_PRODUCTION = "halt_production"
# Inventory
MARK_DELIVERY_RECEIVED = "mark_delivery_received"
COMPLETE_STOCK_RECEIPT = "complete_stock_receipt"
ADJUST_STOCK_MANUALLY = "adjust_stock_manually"
# Customer Service
NOTIFY_CUSTOMER = "notify_customer"
CANCEL_ORDER = "cancel_order"
ADJUST_DELIVERY_DATE = "adjust_delivery_date"
# System
SNOOZE_ALERT = "snooze_alert"
DISMISS_ALERT = "dismiss_alert"
ESCALATE_TO_MANAGER = "escalate_to_manager"
7.2 Action Lifecycle
1. Generation (enrichment stage):
- Service context: What's possible in this situation?
- User agency: Can user execute this action?
- Permissions: Does user have required role?
- Conditional rendering: Disable if prerequisites not met
2. Display (frontend):
- Primary action highlighted (most recommended)
- Secondary actions offered (alternatives)
- Disabled actions shown with reason tooltip
- Consequence preview on hover
3. Execution (API call):
- Handler routes by action type
- Executes business logic (PO approval, schedule change, etc.)
- Creates audit trail
- Emits follow-up events/notifications
- May create new alerts
4. Escalation (if unacted):
- 24h: Alert priority boosted
- 48h: Type changed to escalation
- 72h: Priority boosted further, countdown timer shown
- System may auto-execute if configured
7.3 Consequence Preview
Purpose: Build trust by showing impact before action
Example:
{
action: "approve_po",
consequence: {
immediate: "Order will be sent to supplier within 5 minutes",
timing: "Delivery expected in 2-3 business days",
cost: "€1,250.00 will be added to monthly expenses",
impact: "Resolves low stock for 3 ingredients affecting 8 orders"
}
}
Display:
- Shown on hover or in confirmation modal
- Highlights positive outcomes (orders fulfilled)
- Notes financial impact (€ amount)
- Clarifies timing (when effect occurs)
8. Alert Lifecycle & State Transitions
8.1 Alert States
Created → Active
↓
├─→ Acknowledged (user saw it)
├─→ In Progress (user taking action)
├─→ Resolved (action completed)
├─→ Dismissed (user chose to ignore)
└─→ Snoozed (remind me later)
8.2 State Transitions
Created → Active:
- Automatic on creation
- Appears in relevant UI sections based on placement hints
Active → Acknowledged:
- User clicks alert or views action queue
- Tracked for analytics (response time)
Acknowledged → In Progress:
- User starts working on resolution
- May set estimated completion time
In Progress → Resolved:
- Smart action executed successfully
- Or user manually marks as resolved
resolved_attimestamp set
Active → Dismissed:
- User chooses not to act
- May require dismissal reason (for audit)
Active → Snoozed:
- User requests reminder later (e.g., in 1 hour, tomorrow morning)
- Returns to Active at scheduled time
8.3 Key Fields
Lifecycle Tracking:
status: AlertStatus # Current state
created_at: datetime # When alert was created
acknowledged_at: datetime # When user first viewed
resolved_at: datetime # When action completed
action_created_at: datetime # For escalation age calculation
Interaction Tracking:
interactions: List[AlertInteraction] # All user interactions
last_interaction_at: datetime # Most recent interaction
response_time_seconds: int # Time to first action
resolution_time_seconds: int # Time to resolution
8.4 Alert Interactions
Tracked Events:
view: User viewed alertacknowledge: User acknowledged alertaction_taken: User executed smart actionsnooze: User snoozed alertdismiss: User dismissed alertresolve: User resolved alert
Interaction Record:
class AlertInteraction(Base):
id: UUID
tenant_id: UUID
alert_id: UUID
user_id: UUID
interaction_type: InteractionType
action_type: Optional[SmartActionType]
metadata: dict # Context of interaction
created_at: datetime
Analytics Usage:
- Measure alert effectiveness (% resolved)
- Track response times (how quickly users act)
- Identify ignored alerts (high dismiss rate)
- Optimize smart action suggestions
9. Escalation System
9.1 Time-Based Escalation
Purpose: Prevent action fatigue and ensure critical alerts don't age
Escalation Rules:
# Applied hourly to action_needed alerts
if alert.status == "active" and alert.type_class == "action_needed":
age_hours = (now - alert.action_created_at).hours
escalation_boost = 0
# Age-based escalation
if age_hours > 72:
escalation_boost = 20
elif age_hours > 48:
escalation_boost = 10
# Deadline-based escalation
if alert.deadline:
hours_to_deadline = (alert.deadline - now).hours
if hours_to_deadline < 6:
escalation_boost = max(escalation_boost, 30)
elif hours_to_deadline < 24:
escalation_boost = max(escalation_boost, 15)
# Skip if already critical
if alert.priority_score >= 90:
escalation_boost = 0
# Apply boost (capped at +30)
alert.priority_score += min(escalation_boost, 30)
alert.priority_level = calculate_level(alert.priority_score)
9.2 Escalation Cronjob
Schedule: Every hour at :15 (:15 * * * *)
Configuration:
alert-priority-recalculation-cronjob:
schedule: "15 * * * *"
resources:
memory: 256Mi
cpu: 100m
timeout: 30 minutes
concurrency: Forbid
batch_size: 50
Processing Logic:
- Query all
action_neededalerts withstatus=active - Batch process (50 alerts at a time)
- Calculate escalation boost for each
- Update
priority_scoreandpriority_level - Add
escalation_metadata(boost amount, reason) - Invalidate Redis cache (
tenant:{id}:alerts:*) - Log escalation events for analytics
9.3 Escalation Metadata
Stored in enrichment_context:
{
"escalation": {
"applied_at": "2025-11-25T15:00:00Z",
"boost_amount": 20,
"reason": "pending_72h",
"previous_score": 65,
"new_score": 85,
"previous_level": "standard",
"new_level": "important"
}
}
9.4 Escalation to Auto-Action
When:
- Alert >72h old
- Priority ≥90 (critical)
- Has auto-action configured
Process:
if age_hours > 72 and priority_score >= 90:
alert.type_class = "escalation"
alert.auto_action_countdown_seconds = 7200 # 2 hours
alert.auto_action_type = determine_auto_action(alert)
alert.auto_action_metadata = {...}
Frontend Display:
- Shows countdown timer: "Auto-approving PO in 1h 23m"
- Primary action becomes "Cancel Auto-Action"
- User can cancel or let system proceed
10. Alert Chaining & Deduplication
10.1 Deduplication Strategy
Purpose: Prevent alert spam when same issue detected multiple times
Deduplication Key:
def generate_dedup_key(tenant_id, alert_type, entity_ids):
key_parts = [alert_type]
# Add entity identifiers
if product_id:
key_parts.append(f"product:{product_id}")
if supplier_id:
key_parts.append(f"supplier:{supplier_id}")
if batch_id:
key_parts.append(f"batch:{batch_id}")
key = ":".join(key_parts)
return f"{tenant_id}:alert:{key}"
Redis Check:
dedup_key = generate_dedup_key(...)
if redis.exists(dedup_key):
return # Skip, alert already exists
else:
redis.setex(dedup_key, 900, "1") # 15-minute window
create_alert(...)
10.2 Alert Chaining
Purpose: Link related alerts to tell coherent story
Database Fields (added in migration 20251123):
action_created_at: datetime # Original creation time (for age)
superseded_by_action_id: UUID # Links to solving action
hidden_from_ui: bool # Hide superseded alerts
10.3 Chaining Methods
1. Mark as Superseded:
def mark_alert_as_superseded(alert_id, solving_action_id):
alert = db.query(Alert).filter(Alert.id == alert_id).first()
alert.superseded_by_action_id = solving_action_id
alert.hidden_from_ui = True
alert.updated_at = now()
db.commit()
# Invalidate cache
redis.delete(f"tenant:{alert.tenant_id}:alerts:*")
2. Create Combined Alert:
def create_combined_alert(original_alert, solving_action):
# Create new prevented_issue alert
combined_alert = Alert(
tenant_id=original_alert.tenant_id,
alert_type="prevented_issue",
type_class="prevented_issue",
title=f"Stock shortage prevented",
message=f"I detected low stock for {product_name} and created "
f"PO-{po_number} automatically. Order will arrive in 2 days.",
priority_level="info",
metadata={
"original_alert_id": str(original_alert.id),
"solving_action_id": str(solving_action.id),
"problem": original_alert.message,
"solution": solving_action.description
}
)
db.add(combined_alert)
db.commit()
# Mark original as superseded
mark_alert_as_superseded(original_alert.id, combined_alert.id)
3. Find Related Alerts:
def find_related_alert(tenant_id, alert_type, product_id):
return db.query(Alert).filter(
Alert.tenant_id == tenant_id,
Alert.alert_type == alert_type,
Alert.metadata['product_id'].astext == product_id,
Alert.created_at > now() - timedelta(hours=24),
Alert.hidden_from_ui == False
).first()
4. Filter Hidden Alerts:
def get_active_alerts(tenant_id):
return db.query(Alert).filter(
Alert.tenant_id == tenant_id,
Alert.status.in_(["active", "acknowledged"]),
Alert.hidden_from_ui == False # Exclude superseded alerts
).all()
10.4 Chaining Example Flow
Step 1: Low stock detected
→ Create LOW_STOCK alert (action_needed, priority: 75)
→ User sees "Low stock for flour, action needed"
Step 2: Daily Orchestrator runs
→ Finds LOW_STOCK alert
→ Creates purchase order automatically
→ PO-12345 created with delivery date
Step 3: Orchestrator chains alerts
→ Calls mark_alert_as_superseded(low_stock_alert.id, po.id)
→ Creates PREVENTED_ISSUE alert
→ Message: "I prevented flour shortage by creating PO-12345.
Delivery arrives Nov 28. Approve or modify if needed."
Step 4: User sees only prevented_issue alert
→ Original low stock alert hidden from UI
→ User understands: problem detected → AI acted → needs approval
→ Single coherent narrative, not 3 separate alerts
11. Cronjob Integration
11.1 Why CronJobs Are Needed
Event System Cannot:
- Emit events "2 hours before delivery"
- Detect "alert is now 48 hours old"
- Poll external state (procurement PO status)
CronJobs Excel At:
- Time-based conditions
- Periodic checks
- Predictive alerts
- Batch recalculations
11.2 Delivery Tracking CronJob
Schedule: Every hour at :30 (:30 * * * *)
Configuration:
delivery-tracking-cronjob:
schedule: "30 * * * *"
resources:
memory: 256Mi
cpu: 100m
timeout: 30 minutes
concurrency: Forbid
Service: DeliveryTrackingService in Orchestrator
Processing Flow:
def check_expected_deliveries():
# Query procurement service for expected deliveries
deliveries = procurement_api.get_expected_deliveries(
from_date=now(),
to_date=now() + timedelta(days=3)
)
for delivery in deliveries:
current_time = now()
expected_time = delivery.expected_delivery_datetime
window_start = delivery.delivery_window_start
window_end = delivery.delivery_window_end
# T-2h: Arriving soon alert
if current_time >= (window_start - timedelta(hours=2)) and \
current_time < window_start:
send_arriving_soon_alert(delivery)
# T+30min: Overdue alert
elif current_time > (window_end + timedelta(minutes=30)) and \
not delivery.marked_received:
send_overdue_alert(delivery)
# Window passed, not received: Incomplete alert
elif current_time > (window_end + timedelta(hours=2)) and \
not delivery.marked_received and \
not delivery.stock_receipt_id:
send_receipt_incomplete_alert(delivery)
Alert Types Generated:
- DELIVERY_ARRIVING_SOON (T-2h):
{
"alert_type": "delivery_arriving_soon",
"type_class": "action_needed",
"priority_level": "important",
"placement": "action_queue",
"smart_actions": [
{
"type": "mark_delivery_received",
"label": "Mark as Received",
"variant": "primary"
}
]
}
- DELIVERY_OVERDUE (T+30min):
{
"alert_type": "delivery_overdue",
"type_class": "action_needed",
"priority_level": "critical",
"priority_score": 95,
"smart_actions": [
{
"type": "call_supplier",
"label": "Call Supplier",
"metadata": {
"supplier_contact": "+34-123-456-789"
}
}
]
}
- STOCK_RECEIPT_INCOMPLETE (Post-window):
{
"alert_type": "stock_receipt_incomplete",
"type_class": "action_needed",
"priority_level": "important",
"priority_score": 80,
"smart_actions": [
{
"type": "complete_stock_receipt",
"label": "Complete Stock Receipt",
"metadata": {
"po_id": "...",
"draft_receipt_id": "..."
}
}
]
}
11.3 Delivery Alert Lifecycle
PO Approved
↓
DELIVERY_SCHEDULED (informational, notification_panel)
↓ T-2 hours
DELIVERY_ARRIVING_SOON (action_needed, action_queue)
↓ Expected time + 30 min
DELIVERY_OVERDUE (critical, action_queue + toast)
↓ Window passed + 2 hours
STOCK_RECEIPT_INCOMPLETE (important, action_queue)
11.4 Priority Recalculation CronJob
See Section 9.2 for details.
11.5 Decision Matrix: Events vs CronJobs
| Feature | Event System | CronJob | Best Choice |
|---|---|---|---|
| State change notification | ✅ Excellent | ❌ Poor | Event System |
| Time-based alerts | ❌ Complex | ✅ Simple | CronJob ✅ |
| Real-time updates | ✅ Instant | ❌ Delayed | Event System |
| Predictive alerts | ❌ Hard | ✅ Easy | CronJob ✅ |
| Priority escalation | ❌ Complex | ✅ Natural | CronJob ✅ |
| Deadline tracking | ❌ Complex | ✅ Simple | CronJob ✅ |
| Batch processing | ❌ Not designed | ✅ Ideal | CronJob ✅ |
12. Service Integration Patterns
12.1 Base Alert Service
All services extend: BaseAlertService from shared/alerts/base_service.py
Core Method:
async def publish_item(
self,
tenant_id: UUID,
item_data: dict,
item_type: ItemType = ItemType.ALERT
):
# Validate schema
validated_item = validate_item(item_data, item_type)
# Generate deduplication key
dedup_key = self.generate_dedup_key(tenant_id, validated_item)
# Check Redis for duplicates (15-minute window)
if await self.redis.exists(dedup_key):
logger.info(f"Skipping duplicate {item_type}: {dedup_key}")
return
# Publish to RabbitMQ
await self.rabbitmq.publish(
exchange="alerts.exchange",
routing_key=f"{item_type}.{validated_item['severity']}",
message={
"tenant_id": str(tenant_id),
"item_type": item_type,
"data": validated_item
}
)
# Set deduplication key
await self.redis.setex(dedup_key, 900, "1") # 15 minutes
12.2 Inventory Service
Service Class: InventoryAlertService
Background Jobs:
# Check stock levels every 5 minutes
@scheduler.scheduled_job('interval', minutes=5)
async def check_stock_levels():
service = InventoryAlertService()
critical_items = await service.find_critical_stock()
for item in critical_items:
await service.publish_item(
tenant_id=item.tenant_id,
item_data={
"type": "critical_stock_shortage",
"severity": "high",
"title": f"Critical: {item.name} stock depleted",
"message": f"Only {item.current_stock}{item.unit} remaining. "
f"Required: {item.minimum_stock}{item.unit}",
"actions": ["approve_po", "call_supplier"],
"metadata": {
"ingredient_id": str(item.id),
"current_stock": item.current_stock,
"minimum_stock": item.minimum_stock,
"unit": item.unit
}
},
item_type=ItemType.ALERT
)
# Check expiring products every 2 hours
@scheduler.scheduled_job('interval', hours=2)
async def check_expiring_products():
# Similar pattern...
Event-Driven Alerts:
# Listen to order events
@event_handler("order.created")
async def on_order_created(event):
service = InventoryAlertService()
order = event.data
# Check if order depletes stock below threshold
for item in order.items:
stock_after_order = calculate_remaining_stock(item)
if stock_after_order < item.minimum_stock:
await service.publish_item(
tenant_id=order.tenant_id,
item_data={
"type": "stock_depleted_by_order",
"severity": "medium",
# ... details
},
item_type=ItemType.ALERT
)
Recommendations:
async def analyze_inventory_optimization():
# Analyze stock patterns
# Generate optimization recommendations
await service.publish_item(
tenant_id=tenant_id,
item_data={
"type": "inventory_optimization",
"title": "Reduce waste by adjusting par levels",
"suggested_actions": ["adjust_par_levels"],
"estimated_impact": "Save €250/month",
"confidence_score": 0.85
},
item_type=ItemType.RECOMMENDATION
)
12.3 Production Service
Service Class: ProductionAlertService
Background Jobs:
@scheduler.scheduled_job('interval', minutes=15)
async def check_production_capacity():
# Check if scheduled batches exceed capacity
# Emit capacity_overload alerts
@scheduler.scheduled_job('interval', minutes=10)
async def check_production_delays():
# Check batches behind schedule
# Emit production_delay alerts
Event-Driven:
@event_handler("equipment.status_changed")
async def on_equipment_failure(event):
if event.data.status == "failed":
await service.publish_item(
item_data={
"type": "equipment_failure",
"severity": "high",
"priority_score": 95, # Manual override
# ...
}
)
12.4 Forecasting Service
Service Class: ForecastingRecommendationService
Scheduled Analysis:
@scheduler.scheduled_job('cron', day_of_week='fri', hour=15)
async def check_weekend_demand_surge():
forecast = await get_weekend_forecast()
if forecast.predicted_demand > (forecast.baseline * 1.3):
await service.publish_item(
item_data={
"type": "demand_surge_weekend",
"title": "Weekend demand surge predicted",
"message": f"Demand trending up {forecast.increase_pct}%. "
f"Consider increasing production.",
"suggested_actions": ["increase_production"],
"confidence_score": forecast.confidence
},
item_type=ItemType.RECOMMENDATION
)
12.5 Procurement Service
Service Class: ProcurementEventService (mixed alerts + notifications)
Event-Driven:
@event_handler("po.created")
async def on_po_created(event):
po = event.data
if po.amount > APPROVAL_THRESHOLD:
# Emit alert requiring approval
await service.publish_item(
item_data={
"type": "po_approval_needed",
"severity": "medium",
# ...
},
item_type=ItemType.ALERT
)
else:
# Emit notification (auto-approved)
await service.publish_item(
item_data={
"type": "po_approved",
"message": f"PO-{po.number} auto-approved (€{po.amount})",
"old_state": "draft",
"new_state": "approved"
},
item_type=ItemType.NOTIFICATION
)
13. Frontend Integration
13.1 React Hooks Catalog (18 hooks)
Alert Hooks (4)
// Subscribe to all critical alerts
const { alerts, criticalAlerts, isLoading } = useAlerts({
domains: ['inventory', 'production'],
minPriority: 'important'
});
// Critical alerts only
const { criticalAlerts } = useCriticalAlerts();
// Action-needed alerts only
const { alerts } = useActionNeededAlerts();
// Domain-specific alerts
const { alerts } = useAlertsByDomain('inventory');
Notification Hooks (9)
// All notifications
const { notifications } = useEventNotifications();
// Domain-specific notifications
const { notifications } = useProductionNotifications();
const { notifications } = useInventoryNotifications();
const { notifications } = useSupplyChainNotifications();
const { notifications } = useOperationsNotifications();
// Type-specific notifications
const { notifications } = useBatchNotifications();
const { notifications } = useDeliveryNotifications();
const { notifications } = useOrchestrationNotifications();
// Generic domain filter
const { notifications } = useNotificationsByDomain('production');
Recommendation Hooks (5)
// All recommendations
const { recommendations } = useRecommendations();
// Type-specific recommendations
const { recommendations } = useDemandRecommendations();
const { recommendations } = useInventoryOptimizationRecommendations();
const { recommendations } = useCostReductionRecommendations();
// High confidence only
const { recommendations } = useHighConfidenceRecommendations(0.8);
// Generic filters
const { recommendations } = useRecommendationsByDomain('forecasting');
const { recommendations } = useRecommendationsByType('demand_surge');
13.2 Base SSE Hook
useSSE Hook:
function useSSE(channels: string[]) {
const [events, setEvents] = useState<Event[]>([]);
const [isConnected, setIsConnected] = useState(false);
useEffect(() => {
const eventSource = new EventSource(
`/api/events/sse?channels=${channels.join(',')}`
);
eventSource.onopen = () => setIsConnected(true);
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
setEvents(prev => [data, ...prev]);
};
eventSource.onerror = () => setIsConnected(false);
return () => eventSource.close();
}, [channels]);
return { events, isConnected };
}
13.3 TypeScript Definitions
Alert Type:
interface Alert {
id: string;
tenant_id: string;
alert_type: string;
type_class: AlertTypeClass;
service: string;
title: string;
message: string;
status: AlertStatus;
priority_score: number;
priority_level: PriorityLevel;
// Enrichment
orchestrator_context?: OrchestratorContext;
business_impact?: BusinessImpact;
urgency_context?: UrgencyContext;
user_agency?: UserAgency;
trend_context?: TrendContext;
// Actions
smart_actions?: SmartAction[];
// Metadata
alert_metadata?: Record<string, any>;
created_at: string;
updated_at: string;
resolved_at?: string;
}
enum AlertTypeClass {
ACTION_NEEDED = "action_needed",
PREVENTED_ISSUE = "prevented_issue",
TREND_WARNING = "trend_warning",
ESCALATION = "escalation",
INFORMATION = "information"
}
enum PriorityLevel {
CRITICAL = "critical",
IMPORTANT = "important",
STANDARD = "standard",
INFO = "info"
}
enum AlertStatus {
ACTIVE = "active",
ACKNOWLEDGED = "acknowledged",
IN_PROGRESS = "in_progress",
RESOLVED = "resolved",
DISMISSED = "dismissed",
SNOOZED = "snoozed"
}
13.4 Component Integration Examples
Action Queue Card:
function UnifiedActionQueueCard() {
const { alerts } = useAlerts({
typeClass: ['action_needed', 'escalation'],
includeResolved: false
});
const groupedAlerts = useMemo(() => {
return groupByTimeCategory(alerts);
// Returns: { urgent: [...], today: [...], thisWeek: [...] }
}, [alerts]);
return (
<Card>
<h2>Actions Needed</h2>
{groupedAlerts.urgent.length > 0 && (
<UrgentSection alerts={groupedAlerts.urgent} />
)}
{groupedAlerts.today.length > 0 && (
<TodaySection alerts={groupedAlerts.today} />
)}
</Card>
);
}
Health Hero Component:
function GlanceableHealthHero() {
const { criticalAlerts } = useCriticalAlerts();
const { notifications } = useEventNotifications();
const healthStatus = useMemo(() => {
if (criticalAlerts.length > 0) return 'red';
if (hasUrgentNotifications(notifications)) return 'yellow';
return 'green';
}, [criticalAlerts, notifications]);
return (
<Card>
<StatusIndicator color={healthStatus} />
{healthStatus === 'red' && (
<UrgentBadge count={criticalAlerts.length} />
)}
</Card>
);
}
Event-Driven Refetch:
function InventoryStats() {
const { data, refetch } = useInventoryStats();
const { notifications } = useInventoryNotifications();
useEffect(() => {
const relevantEvent = notifications.find(
n => n.event_type === 'stock_received'
);
if (relevantEvent) {
refetch(); // Update stats on stock change
}
}, [notifications, refetch]);
return <StatsCard data={data} />;
}
14. Redis Pub/Sub Architecture
14.1 Channel Naming Convention
Pattern: tenant:{tenant_id}:{domain}.{event_type}
Examples:
tenant:123e4567-e89b-12d3-a456-426614174000:inventory.alerts
tenant:123e4567-e89b-12d3-a456-426614174000:inventory.notifications
tenant:123e4567-e89b-12d3-a456-426614174000:production.alerts
tenant:123e4567-e89b-12d3-a456-426614174000:production.notifications
tenant:123e4567-e89b-12d3-a456-426614174000:supply_chain.alerts
tenant:123e4567-e89b-12d3-a456-426614174000:supply_chain.notifications
tenant:123e4567-e89b-12d3-a456-426614174000:operations.notifications
tenant:123e4567-e89b-12d3-a456-426614174000:recommendations
14.2 Domain-Based Routing
Alert Processor publishes to Redis:
def publish_to_redis(alert):
domain = alert.domain # inventory, production, etc.
channel = f"tenant:{alert.tenant_id}:{domain}.alerts"
redis.publish(channel, json.dumps({
"id": str(alert.id),
"alert_type": alert.alert_type,
"type_class": alert.type_class,
"priority_level": alert.priority_level,
"title": alert.title,
"message": alert.message,
# ... full alert data
}))
14.3 Gateway SSE Endpoint
Multi-Channel Subscription:
@app.get("/api/events/sse")
async def sse_endpoint(
channels: str, # Comma-separated: "inventory.alerts,production.alerts"
tenant_id: UUID = Depends(get_current_tenant)
):
async def event_stream():
pubsub = redis.pubsub()
# Subscribe to requested channels
for channel in channels.split(','):
full_channel = f"tenant:{tenant_id}:{channel}"
await pubsub.subscribe(full_channel)
# Stream events
async for message in pubsub.listen():
if message['type'] == 'message':
yield f"data: {message['data']}\n\n"
return StreamingResponse(
event_stream(),
media_type="text/event-stream"
)
Wildcard Support:
// Frontend can subscribe to:
"*.alerts" // All alert channels
"inventory.*" // All inventory events
"*.notifications" // All notification channels
14.4 Traffic Reduction
Before (legacy):
- All pages subscribe to single
tenant:{id}:eventschannel - 100% of events sent to all pages
- High bandwidth, slow filtering
After (domain-based):
- Dashboard: Subscribes to
*.alerts,*.notifications,recommendations - Inventory page: Subscribes to
inventory.alerts,inventory.notifications - Production page: Subscribes to
production.alerts,production.notifications
Traffic Reduction by Page:
| Page | Old Traffic | New Traffic | Reduction |
|---|---|---|---|
| Dashboard | 100% | 100% | 0% (needs all) |
| Inventory | 100% | 15% | 85% |
| Production | 100% | 20% | 80% |
| Supply Chain | 100% | 18% | 82% |
Average: 70% reduction on specialized pages
15. Database Schema
15.1 Alerts Table
CREATE TABLE alerts (
-- Identity
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
-- Classification
alert_type VARCHAR(100) NOT NULL,
type_class VARCHAR(50) NOT NULL, -- action_needed, prevented_issue, etc.
service VARCHAR(50) NOT NULL,
event_domain VARCHAR(50), -- Added in migration 20251125
-- Content
title VARCHAR(500) NOT NULL,
message TEXT NOT NULL,
-- Status
status VARCHAR(50) NOT NULL DEFAULT 'active',
-- Priority
priority_score INTEGER NOT NULL DEFAULT 50,
priority_level VARCHAR(50) NOT NULL DEFAULT 'standard',
-- Enrichment Context (JSONB)
orchestrator_context JSONB,
business_impact JSONB,
urgency_context JSONB,
user_agency JSONB,
trend_context JSONB,
-- Smart Actions
smart_actions JSONB, -- Array of action objects
-- Timing
timing_decision VARCHAR(50),
scheduled_send_time TIMESTAMP,
-- Escalation (Added in migration 20251123)
action_created_at TIMESTAMP, -- For age calculation
superseded_by_action_id UUID, -- Links to solving action
hidden_from_ui BOOLEAN DEFAULT FALSE,
-- Metadata
alert_metadata JSONB,
-- Timestamps
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
resolved_at TIMESTAMP,
-- Foreign Keys
FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE
);
15.2 Indexes
-- Tenant filtering
CREATE INDEX idx_alerts_tenant_status
ON alerts(tenant_id, status);
-- Priority sorting
CREATE INDEX idx_alerts_tenant_priority_created
ON alerts(tenant_id, priority_score DESC, created_at DESC);
-- Type class filtering
CREATE INDEX idx_alerts_tenant_typeclass_status
ON alerts(tenant_id, type_class, status);
-- Timing queries
CREATE INDEX idx_alerts_timing_scheduled
ON alerts(timing_decision, scheduled_send_time);
-- Escalation queries (Added in migration 20251123)
CREATE INDEX idx_alerts_tenant_action_created
ON alerts(tenant_id, action_created_at);
CREATE INDEX idx_alerts_superseded_by
ON alerts(superseded_by_action_id);
CREATE INDEX idx_alerts_tenant_hidden_status
ON alerts(tenant_id, hidden_from_ui, status);
-- Domain filtering (Added in migration 20251125)
CREATE INDEX idx_alerts_tenant_domain
ON alerts(tenant_id, event_domain);
15.3 Alert Interactions Table
CREATE TABLE alert_interactions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
alert_id UUID NOT NULL,
user_id UUID NOT NULL,
-- Interaction type
interaction_type VARCHAR(50) NOT NULL, -- view, acknowledge, action_taken, etc.
action_type VARCHAR(50), -- Smart action type if applicable
-- Context
metadata JSONB,
response_time_seconds INTEGER, -- Time from alert creation to this interaction
-- Timestamps
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
-- Foreign Keys
FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE,
FOREIGN KEY (alert_id) REFERENCES alerts(id) ON DELETE CASCADE,
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);
CREATE INDEX idx_interactions_alert ON alert_interactions(alert_id);
CREATE INDEX idx_interactions_tenant_user ON alert_interactions(tenant_id, user_id);
15.4 Notifications Table
CREATE TABLE notifications (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
-- Classification
event_type VARCHAR(100) NOT NULL,
event_domain VARCHAR(50) NOT NULL,
-- Content
title VARCHAR(500) NOT NULL,
message TEXT NOT NULL,
-- State change tracking
entity_type VARCHAR(50), -- "purchase_order", "batch", etc.
entity_id UUID,
old_state VARCHAR(50),
new_state VARCHAR(50),
-- Display
placement_hint VARCHAR(50) DEFAULT 'notification_panel',
-- Metadata
notification_metadata JSONB,
-- Timestamps
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
expires_at TIMESTAMP DEFAULT (NOW() + INTERVAL '7 days'),
-- Foreign Keys
FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE
);
CREATE INDEX idx_notifications_tenant_created
ON notifications(tenant_id, created_at DESC);
CREATE INDEX idx_notifications_tenant_domain
ON notifications(tenant_id, event_domain);
15.5 Recommendations Table
CREATE TABLE recommendations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
-- Classification
recommendation_type VARCHAR(100) NOT NULL,
event_domain VARCHAR(50) NOT NULL,
-- Content
title VARCHAR(500) NOT NULL,
message TEXT NOT NULL,
-- Actions & Impact
suggested_actions JSONB, -- Array of suggested action types
estimated_impact TEXT, -- "Save €250/month"
confidence_score DECIMAL(3, 2), -- 0.00 - 1.00
-- Status
status VARCHAR(50) DEFAULT 'active', -- active, dismissed, implemented
-- Metadata
recommendation_metadata JSONB,
-- Timestamps
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
expires_at TIMESTAMP DEFAULT (NOW() + INTERVAL '30 days'),
dismissed_at TIMESTAMP,
-- Foreign Keys
FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE
);
CREATE INDEX idx_recommendations_tenant_status
ON recommendations(tenant_id, status);
CREATE INDEX idx_recommendations_tenant_domain
ON recommendations(tenant_id, event_domain);
15.6 Migrations
Key Migrations:
-
20251015_1230_initial_schema.py
- Created alerts, notifications, recommendations tables
- Initial indexes
- Full enrichment fields
-
20251123_add_alert_enhancements.py
- Added
action_created_atfor escalation tracking - Added
superseded_by_action_idfor chaining - Added
hidden_from_uiflag - Created indexes for escalation queries
- Backfilled
action_created_atfor existing alerts
- Added
-
20251125_add_event_domain_column.py
- Added
event_domainto alerts table - Added index on (tenant_id, event_domain)
- Populated domain from existing alert_type patterns
- Added
16. Performance & Monitoring
16.1 Performance Metrics
Processing Speed:
- Alert enrichment: 500-800ms (full pipeline)
- Notification processing: 20-30ms (80% faster)
- Recommendation processing: 50-80ms (60% faster)
- Average improvement: 54%
Database Query Performance:
- Get active alerts by tenant: <50ms
- Get critical alerts with priority sort: <100ms
- Escalation age calculation: <150ms
- Alert chaining lookup: <75ms
API Response Times:
- GET /alerts (paginated): <200ms
- POST /alerts/{id}/acknowledge: <50ms
- POST /alerts/{id}/resolve: <100ms
SSE Traffic:
- Legacy (single channel): 100% of events to all pages
- New (domain-based): 70% reduction on specialized pages
- Dashboard: No change (needs all events)
- Domain pages: 80-85% reduction
16.2 Caching Strategy
Redis Cache Keys:
tenant:{tenant_id}:alerts:active
tenant:{tenant_id}:alerts:critical
tenant:{tenant_id}:orchestrator_context:{action_id}
Cache Invalidation:
- On alert creation: Invalidate
alerts:active - On priority update: Invalidate
alerts:critical - On escalation: Invalidate all alert caches
- On resolution: Invalidate both active and critical
TTL:
- Alert lists: 5 minutes
- Orchestrator context: 15 minutes
- Deduplication keys: 15 minutes
16.3 Monitoring Metrics
Prometheus Metrics:
# Alert creation rate
alert_created_total = Counter('alert_created_total', 'Total alerts created', ['tenant_id', 'alert_type'])
# Enrichment timing
enrichment_duration_seconds = Histogram('enrichment_duration_seconds', 'Enrichment processing time', ['event_type'])
# Priority distribution
alert_priority_distribution = Histogram('alert_priority_distribution', 'Alert priority scores', ['priority_level'])
# Resolution metrics
alert_resolution_time_seconds = Histogram('alert_resolution_time_seconds', 'Time to resolve alerts', ['alert_type'])
# Escalation tracking
alert_escalated_total = Counter('alert_escalated_total', 'Alerts escalated', ['escalation_reason'])
# Deduplication hits
alert_deduplicated_total = Counter('alert_deduplicated_total', 'Alerts deduplicated', ['alert_type'])
Key Metrics to Monitor:
- Alert creation rate (per tenant, per type)
- Average resolution time (should decrease over time)
- Escalation rate (high rate indicates alerts being ignored)
- Deduplication hit rate (should be 10-20%)
- Enrichment performance (p50, p95, p99)
- SSE connection count and duration
16.4 Health Checks
Alert Processor Health:
@app.get("/health")
async def health_check():
checks = {
"database": await check_db_connection(),
"redis": await check_redis_connection(),
"rabbitmq": await check_rabbitmq_connection(),
"orchestrator_api": await check_orchestrator_api()
}
overall_healthy = all(checks.values())
status_code = 200 if overall_healthy else 503
return JSONResponse(
status_code=status_code,
content={
"status": "healthy" if overall_healthy else "unhealthy",
"checks": checks,
"timestamp": datetime.utcnow().isoformat()
}
)
CronJob Monitoring:
# Kubernetes CronJob metrics
- Last successful run timestamp
- Last failed run timestamp
- Average execution duration
- Alert count processed per run
- Error count per run
16.5 Troubleshooting Guide
Problem: Alerts not appearing in frontend
Diagnosis:
- Check alert created in database:
SELECT * FROM alerts WHERE tenant_id=... ORDER BY created_at DESC LIMIT 10; - Check Redis pub/sub:
SUBSCRIBE tenant:{id}:inventory.alerts - Check SSE connection: Browser dev tools → Network → EventStream
- Check frontend hook subscription: Console logs
Problem: Slow enrichment
Diagnosis:
- Check Prometheus metrics for
enrichment_duration_seconds - Identify slow enrichment service (orchestrator, priority scoring, etc.)
- Check orchestrator API response time
- Review database query performance (EXPLAIN ANALYZE)
Problem: High escalation rate
Diagnosis:
- Query alerts by age:
SELECT alert_type, COUNT(*) FROM alerts WHERE action_created_at < NOW() - INTERVAL '48 hours' GROUP BY alert_type; - Check if certain alert types are consistently ignored
- Review smart actions (are they actionable?)
- Check user permissions (can users actually execute actions?)
Problem: Duplicate alerts
Diagnosis:
- Check deduplication key generation logic
- Verify Redis connection (dedup keys being set?)
- Review deduplication window (15 minutes may be too short)
- Check for race conditions in concurrent alert creation
17. Deployment Guide
17.1 5-Week Deployment Timeline
Week 1: Backend & Gateway
- Day 1: Database migration in dev environment
- Day 2-3: Deploy alert processor with dual publishing
- Day 4: Deploy updated gateway
- Day 5: Monitoring & validation
Week 2-3: Frontend Integration
- Dashboard components with event hooks
- Priority components (ActionQueue, HealthHero, ExecutionTracker)
- Domain pages (Inventory, Production, Supply Chain)
Week 4: Cutover
- Verify complete migration
- Remove dual publishing
- Database cleanup (remove legacy columns)
Week 5: Optimization
- Performance tuning
- Monitoring dashboards
- Alert rules refinement
17.2 Pre-Deployment Checklist
- ✅ Database migration scripts tested
- ✅ Backward compatibility verified
- ✅ Rollback procedure documented
- ✅ Monitoring metrics defined
- ✅ Performance benchmarks set
- ✅ Example integrations tested
- ✅ Documentation complete
17.3 Rollback Procedure
If issues occur:
- Stop new alert processor deployment
- Revert gateway to previous version
- Roll back database migration (if safe)
- Resume dual publishing if partially migrated
- Investigate root cause
- Fix and redeploy
Appendix
Related Documentation
- Frontend README - Frontend architecture and components
- Alert Processor Service README - Service implementation details
- Inventory Service README - Stock receipt system
- Orchestrator Service README - Delivery tracking
- Technical Documentation Summary - System overview
Version History
- v2.0 (2025-11-25): Complete architecture with escalation, chaining, cronjobs
- v1.5 (2025-11-23): Added stock receipt system and delivery tracking
- v1.0 (2025-11-15): Initial three-tier enrichment system
Contributors
This alert system was designed and implemented collaboratively to support the Bakery-IA platform's mission of providing intelligent, context-aware alerts that respect user time and decision-making agency.
Last Updated: 2025-11-25 Status: Production-Ready ✅ Next Review: As needed based on system evolution