# Alert System Architecture **Last Updated**: 2025-11-25 **Status**: Production-Ready **Version**: 2.0 --- ## Table of Contents 1. [Overview](#1-overview) 2. [Event Flow & Lifecycle](#2-event-flow--lifecycle) 3. [Three-Tier Enrichment Strategy](#3-three-tier-enrichment-strategy) 4. [Enrichment Process](#4-enrichment-process) 5. [Priority Scoring Algorithm](#5-priority-scoring-algorithm) 6. [Alert Types & Classification](#6-alert-types--classification) 7. [Smart Actions & User Agency](#7-smart-actions--user-agency) 8. [Alert Lifecycle & State Transitions](#8-alert-lifecycle--state-transitions) 9. [Escalation System](#9-escalation-system) 10. [Alert Chaining & Deduplication](#10-alert-chaining--deduplication) 11. [Cronjob Integration](#11-cronjob-integration) 12. [Service Integration Patterns](#12-service-integration-patterns) 13. [Frontend Integration](#13-frontend-integration) 14. [Redis Pub/Sub Architecture](#14-redis-pubsub-architecture) 15. [Database Schema](#15-database-schema) 16. [Performance & Monitoring](#16-performance--monitoring) --- ## 1. Overview ### 1.1 Philosophy The Bakery-IA alert system transforms passive notifications into **context-aware, actionable guidance**. Every alert includes enrichment context, priority scoring, and suggested actions, enabling users to make informed decisions quickly. **Core Principles**: - **Alerts are not just notifications** - They're AI-enhanced action items - **Context over noise** - Every alert includes business impact and suggested actions - **Smart prioritization** - Multi-factor scoring ensures critical issues surface first - **Progressive enhancement** - Different event types get appropriate enrichment levels - **User agency** - System respects what users can actually control ### 1.2 Architecture Goals ✅ **Performance**: 80% faster notification processing, 70% less SSE traffic ✅ **Type Safety**: Complete TypeScript definitions matching backend ✅ **Developer Experience**: 18 specialized React hooks for different use cases ✅ **Production Ready**: Backward compatible, fully documented, deployment-ready --- ## 2. Event Flow & Lifecycle ### 2.1 Event Generation Services detect issues via three patterns: #### **Scheduled Background Jobs** - Inventory service: Stock checks every 5-15 minutes - Production service: Capacity checks every 10-45 minutes - Forecasting service: Demand analysis (Friday 3 PM weekly) #### **Event-Driven** - RabbitMQ subscriptions to business events - Example: Order created → Check stock availability → Emit low stock alert #### **Database Triggers** - Direct PostgreSQL notifications for critical state changes - Example: Stock quantity falls below threshold → Immediate alert ### 2.2 Alert Publishing Flow ``` Service detects issue ↓ Validates against RawAlert schema (title, message, type, severity, metadata) ↓ Generates deduplication key (type + entity IDs) ↓ Checks Redis (prevent duplicates within 15-minute window) ↓ Publishes to RabbitMQ (alerts.exchange with routing key) ↓ Alert Processor consumes message ↓ Conditional enrichment based on event type ↓ Stores in PostgreSQL ↓ Publishes to Redis (domain-based channels) ↓ Gateway streams via SSE ↓ Frontend hooks receive and display ``` ### 2.3 Complete Event Flow Diagram ``` Domain Service → RabbitMQ → Alert Processor → PostgreSQL → Redis → Gateway → Frontend ↓ ↓ Conditional Enrichment SSE Stream - Alert: Full (500-800ms) - Domain filtered - Notification: Fast (20-30ms) - Wildcard support - Recommendation: Medium (50-80ms) - Real-time updates ``` --- ## 3. Three-Tier Enrichment Strategy ### 3.1 Tier 1: ALERTS (Full Enrichment) **When**: Critical business events requiring user decisions **Enrichment Pipeline** (7 steps): 1. Orchestrator Context Query 2. Business Impact Analysis 3. Urgency Assessment 4. User Agency Evaluation 5. Multi-Factor Priority Scoring 6. Timing Intelligence 7. Smart Action Generation **Processing Time**: 500-800ms **Database**: Full alert record with all enrichment fields **TTL**: Indefinite (until resolved) **Examples**: - Low stock warning requiring PO approval - Production delay affecting customer orders - Equipment failure needing immediate attention ### 3.2 Tier 2: NOTIFICATIONS (Lightweight) **When**: Informational state changes **Enrichment**: - Format title/message - Set placement hint - Assign domain - **No priority scoring** - **No orchestrator queries** **Processing Time**: 20-30ms (80% faster than alerts) **Database**: Minimal notification record **TTL**: 7 days (automatic cleanup) **Examples**: - Stock received confirmation - Batch completed notification - PO sent to supplier ### 3.3 Tier 3: RECOMMENDATIONS (Moderate) **When**: AI suggestions for optimization **Enrichment**: - Light priority scoring (info level by default) - Confidence assessment - Estimated impact calculation - **No orchestrator context** - Dismissible by users **Processing Time**: 50-80ms **Database**: Recommendation record with impact fields **TTL**: 30 days or until dismissed **Examples**: - Demand surge prediction - Inventory optimization suggestion - Cost reduction opportunity ### 3.4 Performance Comparison | Event Class | Old | New | Improvement | |-------------|-----|-----|-------------| | Alert | 200-300ms | 500-800ms | Baseline (more enrichment) | | Notification | 200-300ms | 20-30ms | **80% faster** | | Recommendation | 200-300ms | 50-80ms | **60% faster** | **Overall**: 54% average improvement due to selective enrichment --- ## 4. Enrichment Process ### 4.1 Orchestrator Context Enrichment **Purpose**: Determine if AI has already addressed the alert **Service**: `orchestrator_client.py` **Query**: Daily Orchestrator microservice for related actions **Questions Answered**: - Has AI already created a purchase order for this low stock? - What's the PO ID and current status? - When will the delivery arrive? - What's the estimated cost savings? **Response Fields**: ```python { "already_addressed": bool, "action_type": "purchase_order" | "production_batch" | "schedule_adjustment", "action_id": str, # e.g., "PO-12345" "action_status": "pending_approval" | "approved" | "in_progress", "delivery_date": datetime, "estimated_savings_eur": Decimal } ``` **Caching**: Results cached to avoid redundant queries ### 4.2 Business Impact Analysis **Service**: `context_enrichment.py` **Dimensions Analyzed**: #### Financial Impact ```python financial_impact_eur: Decimal # Calculation examples: # - Low stock: lost_sales = out_of_stock_days × avg_daily_revenue_per_product # - Production delay: penalty_fees + rush_order_costs # - Equipment failure: repair_cost + lost_production_value ``` #### Customer Impact ```python affected_customers: List[str] # Customer names affected_orders: int # Count of at-risk orders customer_satisfaction_impact: "low" | "medium" | "high" # Based on order priority, customer tier, delay duration ``` #### Operational Impact ```python production_batches_at_risk: List[str] # Batch IDs waste_risk_kg: Decimal # Spoilage or overproduction equipment_downtime_hours: Decimal ``` ### 4.3 Urgency Context **Fields**: ```python deadline: datetime # When consequences occur time_until_consequence_hours: Decimal # Countdown can_wait_until_tomorrow: bool # For overnight batch processing auto_action_countdown_seconds: int # For escalation alerts ``` **Urgency Scoring**: - \>48h until consequence: Low urgency (20 points) - 24-48h: Medium urgency (50 points) - 6-24h: High urgency (80 points) - <6h: Critical urgency (100 points) ### 4.4 User Agency Assessment **Purpose**: Determine what user can actually do **Fields**: ```python can_user_fix: bool # Can user resolve this directly? requires_external_party: bool # Need supplier/customer action? external_party_name: str # "Supplier Inc." external_party_contact: str # "+34-123-456-789" blockers: List[str] # What prevents immediate action ``` **User Agency Scoring**: - Can fix directly: 80 points - Requires external party: 50 points - Has blockers: -30 penalty - No control: 20 points ### 4.5 Trend Context (for trend_warning alerts) **Fields**: ```python metric_name: str # "weekend_demand" current_value: Decimal # 450 baseline_value: Decimal # 300 change_percentage: Decimal # 50 direction: "increasing" | "decreasing" | "volatile" significance: "low" | "medium" | "high" period_days: int # 7 possible_causes: List[str] # ["Holiday weekend", "Promotion"] ``` ### 4.6 Timing Intelligence **Service**: `timing_intelligence.py` **Delivery Method Decisions**: ```python def decide_timing(alert): if priority >= 90: # Critical return "SEND_NOW" # Immediate push notification if is_business_hours() and priority >= 70: return "SEND_NOW" # Important during work hours if is_night_hours() and priority < 90: return "SCHEDULE_LATER" # Queue for 8 AM if priority < 50: return "BATCH_FOR_DIGEST" # Daily summary email ``` **Considerations**: - Priority level - Business hours (8 AM - 8 PM) - User preferences (digest settings) - Alert type (action_needed vs informational) ### 4.7 Smart Actions Generation **Service**: `context_enrichment.py` **Action Structure**: ```typescript { label: string, // "Approve Purchase Order" type: SmartActionType, // approve_po variant: "primary" | "secondary" | "tertiary", metadata: object, // Context for action handler disabled: boolean, // Based on user permissions/state estimated_time_minutes: number, // How long action takes consequence: string // "Order will be placed immediately" } ``` **Action Examples by Alert Type**: **Low Stock Alert**: ```javascript [ { label: "Approve Purchase Order", type: "approve_po", variant: "primary", metadata: { po_id: "PO-12345", amount: 1500.00 } }, { label: "Contact Supplier", type: "call_supplier", variant: "secondary", metadata: { supplier_contact: "+34-123-456-789" } } ] ``` **Production Delay Alert**: ```javascript [ { label: "Adjust Schedule", type: "reschedule_production", variant: "primary", metadata: { batch_id: "BATCH-001", delay_minutes: 30 } }, { label: "Notify Customer", type: "send_notification", variant: "secondary", metadata: { customer_id: "CUST-456" } } ] ``` --- ## 5. Priority Scoring Algorithm ### 5.1 Multi-Factor Weighted Scoring **Formula**: ``` Priority Score (0-100) = (Business_Impact × 0.40) + (Urgency × 0.30) + (User_Agency × 0.20) + (Confidence × 0.10) ``` ### 5.2 Business Impact Score (40% weight) **Financial Impact**: - ≤€50: 20 points - €50-200: 40 points - €200-500: 60 points - \>€500: 100 points **Customer Impact**: - 1 affected customer: 30 points - 2-5 customers: 50 points - 5+ customers: 100 points **Operational Impact**: - 1 order at risk: 30 points - 2-10 orders: 60 points - 10+ orders: 100 points **Weighted Average**: ```python business_impact_score = ( financial_score * 0.5 + customer_score * 0.3 + operational_score * 0.2 ) ``` ### 5.3 Urgency Score (30% weight) **Time Until Consequence**: - \>48 hours: 20 points - 24-48 hours: 50 points - 6-24 hours: 80 points - <6 hours: 100 points **Deadline Approaching Bonus**: - Within 24h of deadline: +30 points - Within 6h of deadline: +50 points (capped at 100) ### 5.4 User Agency Score (20% weight) **Base Score**: - Can user fix directly: 80 points - Requires coordination: 50 points - No control: 20 points **Modifiers**: - Has external party contact: +20 bonus - Requires supplier action: -20 penalty - Has known blockers: -30 penalty ### 5.5 Confidence Score (10% weight) **Data Quality Assessment**: - High confidence (complete data): 100 points - Medium confidence (some assumptions): 70 points - Low confidence (many unknowns): 40 points ### 5.6 Priority Levels **Mapping**: - **CRITICAL** (90-100): Immediate action required, high business impact - **IMPORTANT** (70-89): Action needed today, moderate impact - **STANDARD** (50-69): Action recommended this week - **INFO** (0-49): Informational, no urgency --- ## 6. Alert Types & Classification ### 6.1 Alert Type Classes **ACTION_NEEDED** (~70% of alerts): - User decision required - Appears in action queue - Has deadline - Examples: Low stock, pending PO approval, equipment failure **PREVENTED_ISSUE** (~10% of alerts): - AI already handled the problem - Positive framing: "I prevented X by doing Y" - User awareness only, no action needed - Examples: "Stock shortage prevented by auto-PO" **TREND_WARNING** (~15% of alerts): - Proactive insight about emerging patterns - Gives user time to prepare - May become action_needed if ignored - Examples: "Demand trending up 35% this week" **ESCALATION** (~3% of alerts): - Time-sensitive with auto-action countdown - System will act automatically if user doesn't - Countdown timer shown prominently - Examples: "Critical stock, auto-ordering in 2 hours" **INFORMATION** (~2% of alerts): - FYI only, no action expected - Low priority - Often batched for digest emails - Examples: "Production batch completed" ### 6.2 Event Domains - **inventory**: Stock levels, expiration, movements - **production**: Batches, capacity, equipment - **procurement**: Purchase orders, deliveries, suppliers - **forecasting**: Demand predictions, trends - **orders**: Customer orders, fulfillment - **orchestrator**: AI-driven automation actions - **delivery**: Delivery tracking, receipt - **sales**: Sales analytics, patterns ### 6.3 Alert Type Catalog (40+ types) #### Inventory Domain ``` critical_stock_shortage (action_needed, critical) low_stock_warning (action_needed, important) expired_products (action_needed, critical) stock_depleted_by_order (information, standard) stock_received (notification, info) stock_movement (notification, info) ``` #### Production Domain ``` production_delay (action_needed, important) equipment_failure (action_needed, critical) capacity_overload (action_needed, important) quality_control_failure (action_needed, critical) batch_state_changed (notification, info) batch_completed (notification, info) ``` #### Procurement Domain ``` po_approval_needed (action_needed, important) po_approval_escalation (escalation, critical) delivery_overdue (action_needed, critical) po_approved (notification, info) po_sent (notification, info) delivery_scheduled (notification, info) delivery_received (notification, info) ``` #### Delivery Tracking ``` delivery_scheduled (information, info) delivery_arriving_soon (action_needed, important) delivery_overdue (action_needed, critical) stock_receipt_incomplete (action_needed, important) ``` #### Forecasting Domain ``` demand_surge_predicted (trend_warning, important) weekend_demand_surge (trend_warning, standard) weather_impact_forecast (trend_warning, standard) holiday_preparation (trend_warning, important) ``` #### Operations Domain ``` orchestration_run_started (notification, info) orchestration_run_completed (notification, info) action_created (notification, info) ``` ### 6.4 Placement Hints **Where alerts appear**: - `ACTION_QUEUE`: Dashboard action section (action_needed) - `NOTIFICATION_PANEL`: Bell icon dropdown (notifications) - `DASHBOARD_INLINE`: Embedded in relevant page section - `TOAST`: Immediate popup (critical alerts) - `EMAIL_DIGEST`: End-of-day summary email --- ## 7. Smart Actions & User Agency ### 7.1 Action Types **Complete Enumeration**: ```python class SmartActionType(str, Enum): # Procurement APPROVE_PO = "approve_po" REJECT_PO = "reject_po" MODIFY_PO = "modify_po" CALL_SUPPLIER = "call_supplier" # Production START_PRODUCTION_BATCH = "start_production_batch" RESCHEDULE_PRODUCTION = "reschedule_production" HALT_PRODUCTION = "halt_production" # Inventory MARK_DELIVERY_RECEIVED = "mark_delivery_received" COMPLETE_STOCK_RECEIPT = "complete_stock_receipt" ADJUST_STOCK_MANUALLY = "adjust_stock_manually" # Customer Service NOTIFY_CUSTOMER = "notify_customer" CANCEL_ORDER = "cancel_order" ADJUST_DELIVERY_DATE = "adjust_delivery_date" # System SNOOZE_ALERT = "snooze_alert" DISMISS_ALERT = "dismiss_alert" ESCALATE_TO_MANAGER = "escalate_to_manager" ``` ### 7.2 Action Lifecycle **1. Generation** (enrichment stage): - Service context: What's possible in this situation? - User agency: Can user execute this action? - Permissions: Does user have required role? - Conditional rendering: Disable if prerequisites not met **2. Display** (frontend): - Primary action highlighted (most recommended) - Secondary actions offered (alternatives) - Disabled actions shown with reason tooltip - Consequence preview on hover **3. Execution** (API call): - Handler routes by action type - Executes business logic (PO approval, schedule change, etc.) - Creates audit trail - Emits follow-up events/notifications - May create new alerts **4. Escalation** (if unacted): - 24h: Alert priority boosted - 48h: Type changed to escalation - 72h: Priority boosted further, countdown timer shown - System may auto-execute if configured ### 7.3 Consequence Preview **Purpose**: Build trust by showing impact before action **Example**: ```typescript { action: "approve_po", consequence: { immediate: "Order will be sent to supplier within 5 minutes", timing: "Delivery expected in 2-3 business days", cost: "€1,250.00 will be added to monthly expenses", impact: "Resolves low stock for 3 ingredients affecting 8 orders" } } ``` **Display**: - Shown on hover or in confirmation modal - Highlights positive outcomes (orders fulfilled) - Notes financial impact (€ amount) - Clarifies timing (when effect occurs) --- ## 8. Alert Lifecycle & State Transitions ### 8.1 Alert States ``` Created → Active ↓ ├─→ Acknowledged (user saw it) ├─→ In Progress (user taking action) ├─→ Resolved (action completed) ├─→ Dismissed (user chose to ignore) └─→ Snoozed (remind me later) ``` ### 8.2 State Transitions **Created → Active**: - Automatic on creation - Appears in relevant UI sections based on placement hints **Active → Acknowledged**: - User clicks alert or views action queue - Tracked for analytics (response time) **Acknowledged → In Progress**: - User starts working on resolution - May set estimated completion time **In Progress → Resolved**: - Smart action executed successfully - Or user manually marks as resolved - `resolved_at` timestamp set **Active → Dismissed**: - User chooses not to act - May require dismissal reason (for audit) **Active → Snoozed**: - User requests reminder later (e.g., in 1 hour, tomorrow morning) - Returns to Active at scheduled time ### 8.3 Key Fields **Lifecycle Tracking**: ```python status: AlertStatus # Current state created_at: datetime # When alert was created acknowledged_at: datetime # When user first viewed resolved_at: datetime # When action completed action_created_at: datetime # For escalation age calculation ``` **Interaction Tracking**: ```python interactions: List[AlertInteraction] # All user interactions last_interaction_at: datetime # Most recent interaction response_time_seconds: int # Time to first action resolution_time_seconds: int # Time to resolution ``` ### 8.4 Alert Interactions **Tracked Events**: - `view`: User viewed alert - `acknowledge`: User acknowledged alert - `action_taken`: User executed smart action - `snooze`: User snoozed alert - `dismiss`: User dismissed alert - `resolve`: User resolved alert **Interaction Record**: ```python class AlertInteraction(Base): id: UUID tenant_id: UUID alert_id: UUID user_id: UUID interaction_type: InteractionType action_type: Optional[SmartActionType] metadata: dict # Context of interaction created_at: datetime ``` **Analytics Usage**: - Measure alert effectiveness (% resolved) - Track response times (how quickly users act) - Identify ignored alerts (high dismiss rate) - Optimize smart action suggestions --- ## 9. Escalation System ### 9.1 Time-Based Escalation **Purpose**: Prevent action fatigue and ensure critical alerts don't age **Escalation Rules**: ```python # Applied hourly to action_needed alerts if alert.status == "active" and alert.type_class == "action_needed": age_hours = (now - alert.action_created_at).hours escalation_boost = 0 # Age-based escalation if age_hours > 72: escalation_boost = 20 elif age_hours > 48: escalation_boost = 10 # Deadline-based escalation if alert.deadline: hours_to_deadline = (alert.deadline - now).hours if hours_to_deadline < 6: escalation_boost = max(escalation_boost, 30) elif hours_to_deadline < 24: escalation_boost = max(escalation_boost, 15) # Skip if already critical if alert.priority_score >= 90: escalation_boost = 0 # Apply boost (capped at +30) alert.priority_score += min(escalation_boost, 30) alert.priority_level = calculate_level(alert.priority_score) ``` ### 9.2 Escalation Cronjob **Schedule**: Every hour at :15 (`:15 * * * *`) **Configuration**: ```yaml alert-priority-recalculation-cronjob: schedule: "15 * * * *" resources: memory: 256Mi cpu: 100m timeout: 30 minutes concurrency: Forbid batch_size: 50 ``` **Processing Logic**: 1. Query all `action_needed` alerts with `status=active` 2. Batch process (50 alerts at a time) 3. Calculate escalation boost for each 4. Update `priority_score` and `priority_level` 5. Add `escalation_metadata` (boost amount, reason) 6. Invalidate Redis cache (`tenant:{id}:alerts:*`) 7. Log escalation events for analytics ### 9.3 Escalation Metadata **Stored in enrichment_context**: ```json { "escalation": { "applied_at": "2025-11-25T15:00:00Z", "boost_amount": 20, "reason": "pending_72h", "previous_score": 65, "new_score": 85, "previous_level": "standard", "new_level": "important" } } ``` ### 9.4 Escalation to Auto-Action **When**: - Alert >72h old - Priority ≥90 (critical) - Has auto-action configured **Process**: ```python if age_hours > 72 and priority_score >= 90: alert.type_class = "escalation" alert.auto_action_countdown_seconds = 7200 # 2 hours alert.auto_action_type = determine_auto_action(alert) alert.auto_action_metadata = {...} ``` **Frontend Display**: - Shows countdown timer: "Auto-approving PO in 1h 23m" - Primary action becomes "Cancel Auto-Action" - User can cancel or let system proceed --- ## 10. Alert Chaining & Deduplication ### 10.1 Deduplication Strategy **Purpose**: Prevent alert spam when same issue detected multiple times **Deduplication Key**: ```python def generate_dedup_key(tenant_id, alert_type, entity_ids): key_parts = [alert_type] # Add entity identifiers if product_id: key_parts.append(f"product:{product_id}") if supplier_id: key_parts.append(f"supplier:{supplier_id}") if batch_id: key_parts.append(f"batch:{batch_id}") key = ":".join(key_parts) return f"{tenant_id}:alert:{key}" ``` **Redis Check**: ```python dedup_key = generate_dedup_key(...) if redis.exists(dedup_key): return # Skip, alert already exists else: redis.setex(dedup_key, 900, "1") # 15-minute window create_alert(...) ``` ### 10.2 Alert Chaining **Purpose**: Link related alerts to tell coherent story **Database Fields** (added in migration 20251123): ```python action_created_at: datetime # Original creation time (for age) superseded_by_action_id: UUID # Links to solving action hidden_from_ui: bool # Hide superseded alerts ``` ### 10.3 Chaining Methods **1. Mark as Superseded**: ```python def mark_alert_as_superseded(alert_id, solving_action_id): alert = db.query(Alert).filter(Alert.id == alert_id).first() alert.superseded_by_action_id = solving_action_id alert.hidden_from_ui = True alert.updated_at = now() db.commit() # Invalidate cache redis.delete(f"tenant:{alert.tenant_id}:alerts:*") ``` **2. Create Combined Alert**: ```python def create_combined_alert(original_alert, solving_action): # Create new prevented_issue alert combined_alert = Alert( tenant_id=original_alert.tenant_id, alert_type="prevented_issue", type_class="prevented_issue", title=f"Stock shortage prevented", message=f"I detected low stock for {product_name} and created " f"PO-{po_number} automatically. Order will arrive in 2 days.", priority_level="info", metadata={ "original_alert_id": str(original_alert.id), "solving_action_id": str(solving_action.id), "problem": original_alert.message, "solution": solving_action.description } ) db.add(combined_alert) db.commit() # Mark original as superseded mark_alert_as_superseded(original_alert.id, combined_alert.id) ``` **3. Find Related Alerts**: ```python def find_related_alert(tenant_id, alert_type, product_id): return db.query(Alert).filter( Alert.tenant_id == tenant_id, Alert.alert_type == alert_type, Alert.metadata['product_id'].astext == product_id, Alert.created_at > now() - timedelta(hours=24), Alert.hidden_from_ui == False ).first() ``` **4. Filter Hidden Alerts**: ```python def get_active_alerts(tenant_id): return db.query(Alert).filter( Alert.tenant_id == tenant_id, Alert.status.in_(["active", "acknowledged"]), Alert.hidden_from_ui == False # Exclude superseded alerts ).all() ``` ### 10.4 Chaining Example Flow ``` Step 1: Low stock detected → Create LOW_STOCK alert (action_needed, priority: 75) → User sees "Low stock for flour, action needed" Step 2: Daily Orchestrator runs → Finds LOW_STOCK alert → Creates purchase order automatically → PO-12345 created with delivery date Step 3: Orchestrator chains alerts → Calls mark_alert_as_superseded(low_stock_alert.id, po.id) → Creates PREVENTED_ISSUE alert → Message: "I prevented flour shortage by creating PO-12345. Delivery arrives Nov 28. Approve or modify if needed." Step 4: User sees only prevented_issue alert → Original low stock alert hidden from UI → User understands: problem detected → AI acted → needs approval → Single coherent narrative, not 3 separate alerts ``` --- ## 11. Cronjob Integration ### 11.1 Why CronJobs Are Needed **Event System Cannot**: - Emit events "2 hours before delivery" - Detect "alert is now 48 hours old" - Poll external state (procurement PO status) **CronJobs Excel At**: - Time-based conditions - Periodic checks - Predictive alerts - Batch recalculations ### 11.2 Delivery Tracking CronJob **Schedule**: Every hour at :30 (`:30 * * * *`) **Configuration**: ```yaml delivery-tracking-cronjob: schedule: "30 * * * *" resources: memory: 256Mi cpu: 100m timeout: 30 minutes concurrency: Forbid ``` **Service**: `DeliveryTrackingService` in Orchestrator **Processing Flow**: ```python def check_expected_deliveries(): # Query procurement service for expected deliveries deliveries = procurement_api.get_expected_deliveries( from_date=now(), to_date=now() + timedelta(days=3) ) for delivery in deliveries: current_time = now() expected_time = delivery.expected_delivery_datetime window_start = delivery.delivery_window_start window_end = delivery.delivery_window_end # T-2h: Arriving soon alert if current_time >= (window_start - timedelta(hours=2)) and \ current_time < window_start: send_arriving_soon_alert(delivery) # T+30min: Overdue alert elif current_time > (window_end + timedelta(minutes=30)) and \ not delivery.marked_received: send_overdue_alert(delivery) # Window passed, not received: Incomplete alert elif current_time > (window_end + timedelta(hours=2)) and \ not delivery.marked_received and \ not delivery.stock_receipt_id: send_receipt_incomplete_alert(delivery) ``` **Alert Types Generated**: 1. **DELIVERY_ARRIVING_SOON** (T-2h): ```python { "alert_type": "delivery_arriving_soon", "type_class": "action_needed", "priority_level": "important", "placement": "action_queue", "smart_actions": [ { "type": "mark_delivery_received", "label": "Mark as Received", "variant": "primary" } ] } ``` 2. **DELIVERY_OVERDUE** (T+30min): ```python { "alert_type": "delivery_overdue", "type_class": "action_needed", "priority_level": "critical", "priority_score": 95, "smart_actions": [ { "type": "call_supplier", "label": "Call Supplier", "metadata": { "supplier_contact": "+34-123-456-789" } } ] } ``` 3. **STOCK_RECEIPT_INCOMPLETE** (Post-window): ```python { "alert_type": "stock_receipt_incomplete", "type_class": "action_needed", "priority_level": "important", "priority_score": 80, "smart_actions": [ { "type": "complete_stock_receipt", "label": "Complete Stock Receipt", "metadata": { "po_id": "...", "draft_receipt_id": "..." } } ] } ``` ### 11.3 Delivery Alert Lifecycle ``` PO Approved ↓ DELIVERY_SCHEDULED (informational, notification_panel) ↓ T-2 hours DELIVERY_ARRIVING_SOON (action_needed, action_queue) ↓ Expected time + 30 min DELIVERY_OVERDUE (critical, action_queue + toast) ↓ Window passed + 2 hours STOCK_RECEIPT_INCOMPLETE (important, action_queue) ``` ### 11.4 Priority Recalculation CronJob See [Section 9.2](#92-escalation-cronjob) for details. ### 11.5 Decision Matrix: Events vs CronJobs | Feature | Event System | CronJob | Best Choice | |---------|--------------|---------|-------------| | State change notification | ✅ Excellent | ❌ Poor | Event System | | Time-based alerts | ❌ Complex | ✅ Simple | CronJob ✅ | | Real-time updates | ✅ Instant | ❌ Delayed | Event System | | Predictive alerts | ❌ Hard | ✅ Easy | CronJob ✅ | | Priority escalation | ❌ Complex | ✅ Natural | CronJob ✅ | | Deadline tracking | ❌ Complex | ✅ Simple | CronJob ✅ | | Batch processing | ❌ Not designed | ✅ Ideal | CronJob ✅ | --- ## 12. Service Integration Patterns ### 12.1 Base Alert Service **All services extend**: `BaseAlertService` from `shared/alerts/base_service.py` **Core Method**: ```python async def publish_item( self, tenant_id: UUID, item_data: dict, item_type: ItemType = ItemType.ALERT ): # Validate schema validated_item = validate_item(item_data, item_type) # Generate deduplication key dedup_key = self.generate_dedup_key(tenant_id, validated_item) # Check Redis for duplicates (15-minute window) if await self.redis.exists(dedup_key): logger.info(f"Skipping duplicate {item_type}: {dedup_key}") return # Publish to RabbitMQ await self.rabbitmq.publish( exchange="alerts.exchange", routing_key=f"{item_type}.{validated_item['severity']}", message={ "tenant_id": str(tenant_id), "item_type": item_type, "data": validated_item } ) # Set deduplication key await self.redis.setex(dedup_key, 900, "1") # 15 minutes ``` ### 12.2 Inventory Service **Service Class**: `InventoryAlertService` **Background Jobs**: ```python # Check stock levels every 5 minutes @scheduler.scheduled_job('interval', minutes=5) async def check_stock_levels(): service = InventoryAlertService() critical_items = await service.find_critical_stock() for item in critical_items: await service.publish_item( tenant_id=item.tenant_id, item_data={ "type": "critical_stock_shortage", "severity": "high", "title": f"Critical: {item.name} stock depleted", "message": f"Only {item.current_stock}{item.unit} remaining. " f"Required: {item.minimum_stock}{item.unit}", "actions": ["approve_po", "call_supplier"], "metadata": { "ingredient_id": str(item.id), "current_stock": item.current_stock, "minimum_stock": item.minimum_stock, "unit": item.unit } }, item_type=ItemType.ALERT ) # Check expiring products every 2 hours @scheduler.scheduled_job('interval', hours=2) async def check_expiring_products(): # Similar pattern... ``` **Event-Driven Alerts**: ```python # Listen to order events @event_handler("order.created") async def on_order_created(event): service = InventoryAlertService() order = event.data # Check if order depletes stock below threshold for item in order.items: stock_after_order = calculate_remaining_stock(item) if stock_after_order < item.minimum_stock: await service.publish_item( tenant_id=order.tenant_id, item_data={ "type": "stock_depleted_by_order", "severity": "medium", # ... details }, item_type=ItemType.ALERT ) ``` **Recommendations**: ```python async def analyze_inventory_optimization(): # Analyze stock patterns # Generate optimization recommendations await service.publish_item( tenant_id=tenant_id, item_data={ "type": "inventory_optimization", "title": "Reduce waste by adjusting par levels", "suggested_actions": ["adjust_par_levels"], "estimated_impact": "Save €250/month", "confidence_score": 0.85 }, item_type=ItemType.RECOMMENDATION ) ``` ### 12.3 Production Service **Service Class**: `ProductionAlertService` **Background Jobs**: ```python @scheduler.scheduled_job('interval', minutes=15) async def check_production_capacity(): # Check if scheduled batches exceed capacity # Emit capacity_overload alerts @scheduler.scheduled_job('interval', minutes=10) async def check_production_delays(): # Check batches behind schedule # Emit production_delay alerts ``` **Event-Driven**: ```python @event_handler("equipment.status_changed") async def on_equipment_failure(event): if event.data.status == "failed": await service.publish_item( item_data={ "type": "equipment_failure", "severity": "high", "priority_score": 95, # Manual override # ... } ) ``` ### 12.4 Forecasting Service **Service Class**: `ForecastingRecommendationService` **Scheduled Analysis**: ```python @scheduler.scheduled_job('cron', day_of_week='fri', hour=15) async def check_weekend_demand_surge(): forecast = await get_weekend_forecast() if forecast.predicted_demand > (forecast.baseline * 1.3): await service.publish_item( item_data={ "type": "demand_surge_weekend", "title": "Weekend demand surge predicted", "message": f"Demand trending up {forecast.increase_pct}%. " f"Consider increasing production.", "suggested_actions": ["increase_production"], "confidence_score": forecast.confidence }, item_type=ItemType.RECOMMENDATION ) ``` ### 12.5 Procurement Service **Service Class**: `ProcurementEventService` (mixed alerts + notifications) **Event-Driven**: ```python @event_handler("po.created") async def on_po_created(event): po = event.data if po.amount > APPROVAL_THRESHOLD: # Emit alert requiring approval await service.publish_item( item_data={ "type": "po_approval_needed", "severity": "medium", # ... }, item_type=ItemType.ALERT ) else: # Emit notification (auto-approved) await service.publish_item( item_data={ "type": "po_approved", "message": f"PO-{po.number} auto-approved (€{po.amount})", "old_state": "draft", "new_state": "approved" }, item_type=ItemType.NOTIFICATION ) ``` --- ## 13. Frontend Integration ### 13.1 React Hooks Catalog (18 hooks) #### Alert Hooks (4) ```typescript // Subscribe to all critical alerts const { alerts, criticalAlerts, isLoading } = useAlerts({ domains: ['inventory', 'production'], minPriority: 'important' }); // Critical alerts only const { criticalAlerts } = useCriticalAlerts(); // Action-needed alerts only const { alerts } = useActionNeededAlerts(); // Domain-specific alerts const { alerts } = useAlertsByDomain('inventory'); ``` #### Notification Hooks (9) ```typescript // All notifications const { notifications } = useEventNotifications(); // Domain-specific notifications const { notifications } = useProductionNotifications(); const { notifications } = useInventoryNotifications(); const { notifications } = useSupplyChainNotifications(); const { notifications } = useOperationsNotifications(); // Type-specific notifications const { notifications } = useBatchNotifications(); const { notifications } = useDeliveryNotifications(); const { notifications } = useOrchestrationNotifications(); // Generic domain filter const { notifications } = useNotificationsByDomain('production'); ``` #### Recommendation Hooks (5) ```typescript // All recommendations const { recommendations } = useRecommendations(); // Type-specific recommendations const { recommendations } = useDemandRecommendations(); const { recommendations } = useInventoryOptimizationRecommendations(); const { recommendations } = useCostReductionRecommendations(); // High confidence only const { recommendations } = useHighConfidenceRecommendations(0.8); // Generic filters const { recommendations } = useRecommendationsByDomain('forecasting'); const { recommendations } = useRecommendationsByType('demand_surge'); ``` ### 13.2 Base SSE Hook **`useSSE` Hook**: ```typescript function useSSE(channels: string[]) { const [events, setEvents] = useState([]); const [isConnected, setIsConnected] = useState(false); useEffect(() => { const eventSource = new EventSource( `/api/events/sse?channels=${channels.join(',')}` ); eventSource.onopen = () => setIsConnected(true); eventSource.onmessage = (event) => { const data = JSON.parse(event.data); setEvents(prev => [data, ...prev]); }; eventSource.onerror = () => setIsConnected(false); return () => eventSource.close(); }, [channels]); return { events, isConnected }; } ``` ### 13.3 TypeScript Definitions **Alert Type**: ```typescript interface Alert { id: string; tenant_id: string; alert_type: string; type_class: AlertTypeClass; service: string; title: string; message: string; status: AlertStatus; priority_score: number; priority_level: PriorityLevel; // Enrichment orchestrator_context?: OrchestratorContext; business_impact?: BusinessImpact; urgency_context?: UrgencyContext; user_agency?: UserAgency; trend_context?: TrendContext; // Actions smart_actions?: SmartAction[]; // Metadata alert_metadata?: Record; created_at: string; updated_at: string; resolved_at?: string; } enum AlertTypeClass { ACTION_NEEDED = "action_needed", PREVENTED_ISSUE = "prevented_issue", TREND_WARNING = "trend_warning", ESCALATION = "escalation", INFORMATION = "information" } enum PriorityLevel { CRITICAL = "critical", IMPORTANT = "important", STANDARD = "standard", INFO = "info" } enum AlertStatus { ACTIVE = "active", ACKNOWLEDGED = "acknowledged", IN_PROGRESS = "in_progress", RESOLVED = "resolved", DISMISSED = "dismissed", SNOOZED = "snoozed" } ``` ### 13.4 Component Integration Examples **Action Queue Card**: ```typescript function UnifiedActionQueueCard() { const { alerts } = useAlerts({ typeClass: ['action_needed', 'escalation'], includeResolved: false }); const groupedAlerts = useMemo(() => { return groupByTimeCategory(alerts); // Returns: { urgent: [...], today: [...], thisWeek: [...] } }, [alerts]); return (

Actions Needed

{groupedAlerts.urgent.length > 0 && ( )} {groupedAlerts.today.length > 0 && ( )}
); } ``` **Health Hero Component**: ```typescript function GlanceableHealthHero() { const { criticalAlerts } = useCriticalAlerts(); const { notifications } = useEventNotifications(); const healthStatus = useMemo(() => { if (criticalAlerts.length > 0) return 'red'; if (hasUrgentNotifications(notifications)) return 'yellow'; return 'green'; }, [criticalAlerts, notifications]); return ( {healthStatus === 'red' && ( )} ); } ``` **Event-Driven Refetch**: ```typescript function InventoryStats() { const { data, refetch } = useInventoryStats(); const { notifications } = useInventoryNotifications(); useEffect(() => { const relevantEvent = notifications.find( n => n.event_type === 'stock_received' ); if (relevantEvent) { refetch(); // Update stats on stock change } }, [notifications, refetch]); return ; } ``` --- ## 14. Redis Pub/Sub Architecture ### 14.1 Channel Naming Convention **Pattern**: `tenant:{tenant_id}:{domain}.{event_type}` **Examples**: ``` tenant:123e4567-e89b-12d3-a456-426614174000:inventory.alerts tenant:123e4567-e89b-12d3-a456-426614174000:inventory.notifications tenant:123e4567-e89b-12d3-a456-426614174000:production.alerts tenant:123e4567-e89b-12d3-a456-426614174000:production.notifications tenant:123e4567-e89b-12d3-a456-426614174000:supply_chain.alerts tenant:123e4567-e89b-12d3-a456-426614174000:supply_chain.notifications tenant:123e4567-e89b-12d3-a456-426614174000:operations.notifications tenant:123e4567-e89b-12d3-a456-426614174000:recommendations ``` ### 14.2 Domain-Based Routing **Alert Processor publishes to Redis**: ```python def publish_to_redis(alert): domain = alert.domain # inventory, production, etc. channel = f"tenant:{alert.tenant_id}:{domain}.alerts" redis.publish(channel, json.dumps({ "id": str(alert.id), "alert_type": alert.alert_type, "type_class": alert.type_class, "priority_level": alert.priority_level, "title": alert.title, "message": alert.message, # ... full alert data })) ``` ### 14.3 Gateway SSE Endpoint **Multi-Channel Subscription**: ```python @app.get("/api/events/sse") async def sse_endpoint( channels: str, # Comma-separated: "inventory.alerts,production.alerts" tenant_id: UUID = Depends(get_current_tenant) ): async def event_stream(): pubsub = redis.pubsub() # Subscribe to requested channels for channel in channels.split(','): full_channel = f"tenant:{tenant_id}:{channel}" await pubsub.subscribe(full_channel) # Stream events async for message in pubsub.listen(): if message['type'] == 'message': yield f"data: {message['data']}\n\n" return StreamingResponse( event_stream(), media_type="text/event-stream" ) ``` **Wildcard Support**: ```typescript // Frontend can subscribe to: "*.alerts" // All alert channels "inventory.*" // All inventory events "*.notifications" // All notification channels ``` ### 14.4 Traffic Reduction **Before (legacy)**: - All pages subscribe to single `tenant:{id}:events` channel - 100% of events sent to all pages - High bandwidth, slow filtering **After (domain-based)**: - Dashboard: Subscribes to `*.alerts`, `*.notifications`, `recommendations` - Inventory page: Subscribes to `inventory.alerts`, `inventory.notifications` - Production page: Subscribes to `production.alerts`, `production.notifications` **Traffic Reduction by Page**: | Page | Old Traffic | New Traffic | Reduction | |------|-------------|-------------|-----------| | Dashboard | 100% | 100% | 0% (needs all) | | Inventory | 100% | 15% | **85%** | | Production | 100% | 20% | **80%** | | Supply Chain | 100% | 18% | **82%** | **Average**: 70% reduction on specialized pages --- ## 15. Database Schema ### 15.1 Alerts Table ```sql CREATE TABLE alerts ( -- Identity id UUID PRIMARY KEY DEFAULT gen_random_uuid(), tenant_id UUID NOT NULL, -- Classification alert_type VARCHAR(100) NOT NULL, type_class VARCHAR(50) NOT NULL, -- action_needed, prevented_issue, etc. service VARCHAR(50) NOT NULL, event_domain VARCHAR(50), -- Added in migration 20251125 -- Content title VARCHAR(500) NOT NULL, message TEXT NOT NULL, -- Status status VARCHAR(50) NOT NULL DEFAULT 'active', -- Priority priority_score INTEGER NOT NULL DEFAULT 50, priority_level VARCHAR(50) NOT NULL DEFAULT 'standard', -- Enrichment Context (JSONB) orchestrator_context JSONB, business_impact JSONB, urgency_context JSONB, user_agency JSONB, trend_context JSONB, -- Smart Actions smart_actions JSONB, -- Array of action objects -- Timing timing_decision VARCHAR(50), scheduled_send_time TIMESTAMP, -- Escalation (Added in migration 20251123) action_created_at TIMESTAMP, -- For age calculation superseded_by_action_id UUID, -- Links to solving action hidden_from_ui BOOLEAN DEFAULT FALSE, -- Metadata alert_metadata JSONB, -- Timestamps created_at TIMESTAMP NOT NULL DEFAULT NOW(), updated_at TIMESTAMP NOT NULL DEFAULT NOW(), resolved_at TIMESTAMP, -- Foreign Keys FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE ); ``` ### 15.2 Indexes ```sql -- Tenant filtering CREATE INDEX idx_alerts_tenant_status ON alerts(tenant_id, status); -- Priority sorting CREATE INDEX idx_alerts_tenant_priority_created ON alerts(tenant_id, priority_score DESC, created_at DESC); -- Type class filtering CREATE INDEX idx_alerts_tenant_typeclass_status ON alerts(tenant_id, type_class, status); -- Timing queries CREATE INDEX idx_alerts_timing_scheduled ON alerts(timing_decision, scheduled_send_time); -- Escalation queries (Added in migration 20251123) CREATE INDEX idx_alerts_tenant_action_created ON alerts(tenant_id, action_created_at); CREATE INDEX idx_alerts_superseded_by ON alerts(superseded_by_action_id); CREATE INDEX idx_alerts_tenant_hidden_status ON alerts(tenant_id, hidden_from_ui, status); -- Domain filtering (Added in migration 20251125) CREATE INDEX idx_alerts_tenant_domain ON alerts(tenant_id, event_domain); ``` ### 15.3 Alert Interactions Table ```sql CREATE TABLE alert_interactions ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), tenant_id UUID NOT NULL, alert_id UUID NOT NULL, user_id UUID NOT NULL, -- Interaction type interaction_type VARCHAR(50) NOT NULL, -- view, acknowledge, action_taken, etc. action_type VARCHAR(50), -- Smart action type if applicable -- Context metadata JSONB, response_time_seconds INTEGER, -- Time from alert creation to this interaction -- Timestamps created_at TIMESTAMP NOT NULL DEFAULT NOW(), -- Foreign Keys FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE, FOREIGN KEY (alert_id) REFERENCES alerts(id) ON DELETE CASCADE, FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE ); CREATE INDEX idx_interactions_alert ON alert_interactions(alert_id); CREATE INDEX idx_interactions_tenant_user ON alert_interactions(tenant_id, user_id); ``` ### 15.4 Notifications Table ```sql CREATE TABLE notifications ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), tenant_id UUID NOT NULL, -- Classification event_type VARCHAR(100) NOT NULL, event_domain VARCHAR(50) NOT NULL, -- Content title VARCHAR(500) NOT NULL, message TEXT NOT NULL, -- State change tracking entity_type VARCHAR(50), -- "purchase_order", "batch", etc. entity_id UUID, old_state VARCHAR(50), new_state VARCHAR(50), -- Display placement_hint VARCHAR(50) DEFAULT 'notification_panel', -- Metadata notification_metadata JSONB, -- Timestamps created_at TIMESTAMP NOT NULL DEFAULT NOW(), expires_at TIMESTAMP DEFAULT (NOW() + INTERVAL '7 days'), -- Foreign Keys FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE ); CREATE INDEX idx_notifications_tenant_created ON notifications(tenant_id, created_at DESC); CREATE INDEX idx_notifications_tenant_domain ON notifications(tenant_id, event_domain); ``` ### 15.5 Recommendations Table ```sql CREATE TABLE recommendations ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), tenant_id UUID NOT NULL, -- Classification recommendation_type VARCHAR(100) NOT NULL, event_domain VARCHAR(50) NOT NULL, -- Content title VARCHAR(500) NOT NULL, message TEXT NOT NULL, -- Actions & Impact suggested_actions JSONB, -- Array of suggested action types estimated_impact TEXT, -- "Save €250/month" confidence_score DECIMAL(3, 2), -- 0.00 - 1.00 -- Status status VARCHAR(50) DEFAULT 'active', -- active, dismissed, implemented -- Metadata recommendation_metadata JSONB, -- Timestamps created_at TIMESTAMP NOT NULL DEFAULT NOW(), expires_at TIMESTAMP DEFAULT (NOW() + INTERVAL '30 days'), dismissed_at TIMESTAMP, -- Foreign Keys FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE ); CREATE INDEX idx_recommendations_tenant_status ON recommendations(tenant_id, status); CREATE INDEX idx_recommendations_tenant_domain ON recommendations(tenant_id, event_domain); ``` ### 15.6 Migrations **Key Migrations**: 1. **20251015_1230_initial_schema.py** - Created alerts, notifications, recommendations tables - Initial indexes - Full enrichment fields 2. **20251123_add_alert_enhancements.py** - Added `action_created_at` for escalation tracking - Added `superseded_by_action_id` for chaining - Added `hidden_from_ui` flag - Created indexes for escalation queries - Backfilled `action_created_at` for existing alerts 3. **20251125_add_event_domain_column.py** - Added `event_domain` to alerts table - Added index on (tenant_id, event_domain) - Populated domain from existing alert_type patterns --- ## 16. Performance & Monitoring ### 16.1 Performance Metrics **Processing Speed**: - Alert enrichment: 500-800ms (full pipeline) - Notification processing: 20-30ms (80% faster) - Recommendation processing: 50-80ms (60% faster) - Average improvement: 54% **Database Query Performance**: - Get active alerts by tenant: <50ms - Get critical alerts with priority sort: <100ms - Escalation age calculation: <150ms - Alert chaining lookup: <75ms **API Response Times**: - GET /alerts (paginated): <200ms - POST /alerts/{id}/acknowledge: <50ms - POST /alerts/{id}/resolve: <100ms **SSE Traffic**: - Legacy (single channel): 100% of events to all pages - New (domain-based): 70% reduction on specialized pages - Dashboard: No change (needs all events) - Domain pages: 80-85% reduction ### 16.2 Caching Strategy **Redis Cache Keys**: ``` tenant:{tenant_id}:alerts:active tenant:{tenant_id}:alerts:critical tenant:{tenant_id}:orchestrator_context:{action_id} ``` **Cache Invalidation**: - On alert creation: Invalidate `alerts:active` - On priority update: Invalidate `alerts:critical` - On escalation: Invalidate all alert caches - On resolution: Invalidate both active and critical **TTL**: - Alert lists: 5 minutes - Orchestrator context: 15 minutes - Deduplication keys: 15 minutes ### 16.3 Monitoring Metrics **Prometheus Metrics**: ```python # Alert creation rate alert_created_total = Counter('alert_created_total', 'Total alerts created', ['tenant_id', 'alert_type']) # Enrichment timing enrichment_duration_seconds = Histogram('enrichment_duration_seconds', 'Enrichment processing time', ['event_type']) # Priority distribution alert_priority_distribution = Histogram('alert_priority_distribution', 'Alert priority scores', ['priority_level']) # Resolution metrics alert_resolution_time_seconds = Histogram('alert_resolution_time_seconds', 'Time to resolve alerts', ['alert_type']) # Escalation tracking alert_escalated_total = Counter('alert_escalated_total', 'Alerts escalated', ['escalation_reason']) # Deduplication hits alert_deduplicated_total = Counter('alert_deduplicated_total', 'Alerts deduplicated', ['alert_type']) ``` **Key Metrics to Monitor**: - Alert creation rate (per tenant, per type) - Average resolution time (should decrease over time) - Escalation rate (high rate indicates alerts being ignored) - Deduplication hit rate (should be 10-20%) - Enrichment performance (p50, p95, p99) - SSE connection count and duration ### 16.4 Health Checks **Alert Processor Health**: ```python @app.get("/health") async def health_check(): checks = { "database": await check_db_connection(), "redis": await check_redis_connection(), "rabbitmq": await check_rabbitmq_connection(), "orchestrator_api": await check_orchestrator_api() } overall_healthy = all(checks.values()) status_code = 200 if overall_healthy else 503 return JSONResponse( status_code=status_code, content={ "status": "healthy" if overall_healthy else "unhealthy", "checks": checks, "timestamp": datetime.utcnow().isoformat() } ) ``` **CronJob Monitoring**: ```yaml # Kubernetes CronJob metrics - Last successful run timestamp - Last failed run timestamp - Average execution duration - Alert count processed per run - Error count per run ``` ### 16.5 Troubleshooting Guide **Problem**: Alerts not appearing in frontend **Diagnosis**: 1. Check alert created in database: `SELECT * FROM alerts WHERE tenant_id=... ORDER BY created_at DESC LIMIT 10;` 2. Check Redis pub/sub: `SUBSCRIBE tenant:{id}:inventory.alerts` 3. Check SSE connection: Browser dev tools → Network → EventStream 4. Check frontend hook subscription: Console logs **Problem**: Slow enrichment **Diagnosis**: 1. Check Prometheus metrics for `enrichment_duration_seconds` 2. Identify slow enrichment service (orchestrator, priority scoring, etc.) 3. Check orchestrator API response time 4. Review database query performance (EXPLAIN ANALYZE) **Problem**: High escalation rate **Diagnosis**: 1. Query alerts by age: `SELECT alert_type, COUNT(*) FROM alerts WHERE action_created_at < NOW() - INTERVAL '48 hours' GROUP BY alert_type;` 2. Check if certain alert types are consistently ignored 3. Review smart actions (are they actionable?) 4. Check user permissions (can users actually execute actions?) **Problem**: Duplicate alerts **Diagnosis**: 1. Check deduplication key generation logic 2. Verify Redis connection (dedup keys being set?) 3. Review deduplication window (15 minutes may be too short) 4. Check for race conditions in concurrent alert creation --- ## 17. Deployment Guide ### 17.1 5-Week Deployment Timeline **Week 1: Backend & Gateway** - Day 1: Database migration in dev environment - Day 2-3: Deploy alert processor with dual publishing - Day 4: Deploy updated gateway - Day 5: Monitoring & validation **Week 2-3: Frontend Integration** - Dashboard components with event hooks - Priority components (ActionQueue, HealthHero, ExecutionTracker) - Domain pages (Inventory, Production, Supply Chain) **Week 4: Cutover** - Verify complete migration - Remove dual publishing - Database cleanup (remove legacy columns) **Week 5: Optimization** - Performance tuning - Monitoring dashboards - Alert rules refinement ### 17.2 Pre-Deployment Checklist - ✅ Database migration scripts tested - ✅ Backward compatibility verified - ✅ Rollback procedure documented - ✅ Monitoring metrics defined - ✅ Performance benchmarks set - ✅ Example integrations tested - ✅ Documentation complete ### 17.3 Rollback Procedure **If issues occur**: 1. Stop new alert processor deployment 2. Revert gateway to previous version 3. Roll back database migration (if safe) 4. Resume dual publishing if partially migrated 5. Investigate root cause 6. Fix and redeploy --- ## Appendix ### Related Documentation - [Frontend README](../frontend/README.md) - Frontend architecture and components - [Alert Processor Service README](../services/alert_processor/README.md) - Service implementation details - [Inventory Service README](../services/inventory/README.md) - Stock receipt system - [Orchestrator Service README](../services/orchestrator/README.md) - Delivery tracking - [Technical Documentation Summary](./TECHNICAL-DOCUMENTATION-SUMMARY.md) - System overview ### Version History - **v2.0** (2025-11-25): Complete architecture with escalation, chaining, cronjobs - **v1.5** (2025-11-23): Added stock receipt system and delivery tracking - **v1.0** (2025-11-15): Initial three-tier enrichment system ### Contributors This alert system was designed and implemented collaboratively to support the Bakery-IA platform's mission of providing intelligent, context-aware alerts that respect user time and decision-making agency. --- **Last Updated**: 2025-11-25 **Status**: Production-Ready ✅ **Next Review**: As needed based on system evolution