Files
bakery-ia/docs/ALERT-SYSTEM-ARCHITECTURE.md

58 KiB
Raw Blame History

Alert System Architecture

Last Updated: 2025-11-25 Status: Production-Ready Version: 2.0


Table of Contents

  1. Overview
  2. Event Flow & Lifecycle
  3. Three-Tier Enrichment Strategy
  4. Enrichment Process
  5. Priority Scoring Algorithm
  6. Alert Types & Classification
  7. Smart Actions & User Agency
  8. Alert Lifecycle & State Transitions
  9. Escalation System
  10. Alert Chaining & Deduplication
  11. Cronjob Integration
  12. Service Integration Patterns
  13. Frontend Integration
  14. Redis Pub/Sub Architecture
  15. Database Schema
  16. Performance & Monitoring

1. Overview

1.1 Philosophy

The Bakery-IA alert system transforms passive notifications into context-aware, actionable guidance. Every alert includes enrichment context, priority scoring, and suggested actions, enabling users to make informed decisions quickly.

Core Principles:

  • Alerts are not just notifications - They're AI-enhanced action items
  • Context over noise - Every alert includes business impact and suggested actions
  • Smart prioritization - Multi-factor scoring ensures critical issues surface first
  • Progressive enhancement - Different event types get appropriate enrichment levels
  • User agency - System respects what users can actually control

1.2 Architecture Goals

Performance: 80% faster notification processing, 70% less SSE traffic Type Safety: Complete TypeScript definitions matching backend Developer Experience: 18 specialized React hooks for different use cases Production Ready: Backward compatible, fully documented, deployment-ready


2. Event Flow & Lifecycle

2.1 Event Generation

Services detect issues via three patterns:

Scheduled Background Jobs

  • Inventory service: Stock checks every 5-15 minutes
  • Production service: Capacity checks every 10-45 minutes
  • Forecasting service: Demand analysis (Friday 3 PM weekly)

Event-Driven

  • RabbitMQ subscriptions to business events
  • Example: Order created → Check stock availability → Emit low stock alert

Database Triggers

  • Direct PostgreSQL notifications for critical state changes
  • Example: Stock quantity falls below threshold → Immediate alert

2.2 Alert Publishing Flow

Service detects issue
    ↓
Validates against RawAlert schema (title, message, type, severity, metadata)
    ↓
Generates deduplication key (type + entity IDs)
    ↓
Checks Redis (prevent duplicates within 15-minute window)
    ↓
Publishes to RabbitMQ (alerts.exchange with routing key)
    ↓
Alert Processor consumes message
    ↓
Conditional enrichment based on event type
    ↓
Stores in PostgreSQL
    ↓
Publishes to Redis (domain-based channels)
    ↓
Gateway streams via SSE
    ↓
Frontend hooks receive and display

2.3 Complete Event Flow Diagram

Domain Service → RabbitMQ → Alert Processor → PostgreSQL → Redis → Gateway → Frontend
                                    ↓                                    ↓
                            Conditional Enrichment              SSE Stream
                            - Alert: Full (500-800ms)          - Domain filtered
                            - Notification: Fast (20-30ms)     - Wildcard support
                            - Recommendation: Medium (50-80ms) - Real-time updates

3. Three-Tier Enrichment Strategy

3.1 Tier 1: ALERTS (Full Enrichment)

When: Critical business events requiring user decisions

Enrichment Pipeline (7 steps):

  1. Orchestrator Context Query
  2. Business Impact Analysis
  3. Urgency Assessment
  4. User Agency Evaluation
  5. Multi-Factor Priority Scoring
  6. Timing Intelligence
  7. Smart Action Generation

Processing Time: 500-800ms Database: Full alert record with all enrichment fields TTL: Indefinite (until resolved)

Examples:

  • Low stock warning requiring PO approval
  • Production delay affecting customer orders
  • Equipment failure needing immediate attention

3.2 Tier 2: NOTIFICATIONS (Lightweight)

When: Informational state changes

Enrichment:

  • Format title/message
  • Set placement hint
  • Assign domain
  • No priority scoring
  • No orchestrator queries

Processing Time: 20-30ms (80% faster than alerts) Database: Minimal notification record TTL: 7 days (automatic cleanup)

Examples:

  • Stock received confirmation
  • Batch completed notification
  • PO sent to supplier

3.3 Tier 3: RECOMMENDATIONS (Moderate)

When: AI suggestions for optimization

Enrichment:

  • Light priority scoring (info level by default)
  • Confidence assessment
  • Estimated impact calculation
  • No orchestrator context
  • Dismissible by users

Processing Time: 50-80ms Database: Recommendation record with impact fields TTL: 30 days or until dismissed

Examples:

  • Demand surge prediction
  • Inventory optimization suggestion
  • Cost reduction opportunity

3.4 Performance Comparison

Event Class Old New Improvement
Alert 200-300ms 500-800ms Baseline (more enrichment)
Notification 200-300ms 20-30ms 80% faster
Recommendation 200-300ms 50-80ms 60% faster

Overall: 54% average improvement due to selective enrichment


4. Enrichment Process

4.1 Orchestrator Context Enrichment

Purpose: Determine if AI has already addressed the alert

Service: orchestrator_client.py

Query: Daily Orchestrator microservice for related actions

Questions Answered:

  • Has AI already created a purchase order for this low stock?
  • What's the PO ID and current status?
  • When will the delivery arrive?
  • What's the estimated cost savings?

Response Fields:

{
    "already_addressed": bool,
    "action_type": "purchase_order" | "production_batch" | "schedule_adjustment",
    "action_id": str,  # e.g., "PO-12345"
    "action_status": "pending_approval" | "approved" | "in_progress",
    "delivery_date": datetime,
    "estimated_savings_eur": Decimal
}

Caching: Results cached to avoid redundant queries

4.2 Business Impact Analysis

Service: context_enrichment.py

Dimensions Analyzed:

Financial Impact

financial_impact_eur: Decimal
# Calculation examples:
# - Low stock: lost_sales = out_of_stock_days × avg_daily_revenue_per_product
# - Production delay: penalty_fees + rush_order_costs
# - Equipment failure: repair_cost + lost_production_value

Customer Impact

affected_customers: List[str]  # Customer names
affected_orders: int  # Count of at-risk orders
customer_satisfaction_impact: "low" | "medium" | "high"
# Based on order priority, customer tier, delay duration

Operational Impact

production_batches_at_risk: List[str]  # Batch IDs
waste_risk_kg: Decimal  # Spoilage or overproduction
equipment_downtime_hours: Decimal

4.3 Urgency Context

Fields:

deadline: datetime  # When consequences occur
time_until_consequence_hours: Decimal  # Countdown
can_wait_until_tomorrow: bool  # For overnight batch processing
auto_action_countdown_seconds: int  # For escalation alerts

Urgency Scoring:

  • >48h until consequence: Low urgency (20 points)
  • 24-48h: Medium urgency (50 points)
  • 6-24h: High urgency (80 points)
  • <6h: Critical urgency (100 points)

4.4 User Agency Assessment

Purpose: Determine what user can actually do

Fields:

can_user_fix: bool  # Can user resolve this directly?
requires_external_party: bool  # Need supplier/customer action?
external_party_name: str  # "Supplier Inc."
external_party_contact: str  # "+34-123-456-789"
blockers: List[str]  # What prevents immediate action

User Agency Scoring:

  • Can fix directly: 80 points
  • Requires external party: 50 points
  • Has blockers: -30 penalty
  • No control: 20 points

4.5 Trend Context (for trend_warning alerts)

Fields:

metric_name: str  # "weekend_demand"
current_value: Decimal  # 450
baseline_value: Decimal  # 300
change_percentage: Decimal  # 50
direction: "increasing" | "decreasing" | "volatile"
significance: "low" | "medium" | "high"
period_days: int  # 7
possible_causes: List[str]  # ["Holiday weekend", "Promotion"]

4.6 Timing Intelligence

Service: timing_intelligence.py

Delivery Method Decisions:

def decide_timing(alert):
    if priority >= 90:  # Critical
        return "SEND_NOW"  # Immediate push notification

    if is_business_hours() and priority >= 70:
        return "SEND_NOW"  # Important during work hours

    if is_night_hours() and priority < 90:
        return "SCHEDULE_LATER"  # Queue for 8 AM

    if priority < 50:
        return "BATCH_FOR_DIGEST"  # Daily summary email

Considerations:

  • Priority level
  • Business hours (8 AM - 8 PM)
  • User preferences (digest settings)
  • Alert type (action_needed vs informational)

4.7 Smart Actions Generation

Service: context_enrichment.py

Action Structure:

{
    label: string,  // "Approve Purchase Order"
    type: SmartActionType,  // approve_po
    variant: "primary" | "secondary" | "tertiary",
    metadata: object,  // Context for action handler
    disabled: boolean,  // Based on user permissions/state
    estimated_time_minutes: number,  // How long action takes
    consequence: string  // "Order will be placed immediately"
}

Action Examples by Alert Type:

Low Stock Alert:

[
    {
        label: "Approve Purchase Order",
        type: "approve_po",
        variant: "primary",
        metadata: { po_id: "PO-12345", amount: 1500.00 }
    },
    {
        label: "Contact Supplier",
        type: "call_supplier",
        variant: "secondary",
        metadata: { supplier_contact: "+34-123-456-789" }
    }
]

Production Delay Alert:

[
    {
        label: "Adjust Schedule",
        type: "reschedule_production",
        variant: "primary",
        metadata: { batch_id: "BATCH-001", delay_minutes: 30 }
    },
    {
        label: "Notify Customer",
        type: "send_notification",
        variant: "secondary",
        metadata: { customer_id: "CUST-456" }
    }
]

5. Priority Scoring Algorithm

5.1 Multi-Factor Weighted Scoring

Formula:

Priority Score (0-100) =
    (Business_Impact × 0.40) +
    (Urgency × 0.30) +
    (User_Agency × 0.20) +
    (Confidence × 0.10)

5.2 Business Impact Score (40% weight)

Financial Impact:

  • ≤€50: 20 points
  • €50-200: 40 points
  • €200-500: 60 points
  • >€500: 100 points

Customer Impact:

  • 1 affected customer: 30 points
  • 2-5 customers: 50 points
  • 5+ customers: 100 points

Operational Impact:

  • 1 order at risk: 30 points
  • 2-10 orders: 60 points
  • 10+ orders: 100 points

Weighted Average:

business_impact_score = (
    financial_score * 0.5 +
    customer_score * 0.3 +
    operational_score * 0.2
)

5.3 Urgency Score (30% weight)

Time Until Consequence:

  • >48 hours: 20 points
  • 24-48 hours: 50 points
  • 6-24 hours: 80 points
  • <6 hours: 100 points

Deadline Approaching Bonus:

  • Within 24h of deadline: +30 points
  • Within 6h of deadline: +50 points (capped at 100)

5.4 User Agency Score (20% weight)

Base Score:

  • Can user fix directly: 80 points
  • Requires coordination: 50 points
  • No control: 20 points

Modifiers:

  • Has external party contact: +20 bonus
  • Requires supplier action: -20 penalty
  • Has known blockers: -30 penalty

5.5 Confidence Score (10% weight)

Data Quality Assessment:

  • High confidence (complete data): 100 points
  • Medium confidence (some assumptions): 70 points
  • Low confidence (many unknowns): 40 points

5.6 Priority Levels

Mapping:

  • CRITICAL (90-100): Immediate action required, high business impact
  • IMPORTANT (70-89): Action needed today, moderate impact
  • STANDARD (50-69): Action recommended this week
  • INFO (0-49): Informational, no urgency

6. Alert Types & Classification

6.1 Alert Type Classes

ACTION_NEEDED (~70% of alerts):

  • User decision required
  • Appears in action queue
  • Has deadline
  • Examples: Low stock, pending PO approval, equipment failure

PREVENTED_ISSUE (~10% of alerts):

  • AI already handled the problem
  • Positive framing: "I prevented X by doing Y"
  • User awareness only, no action needed
  • Examples: "Stock shortage prevented by auto-PO"

TREND_WARNING (~15% of alerts):

  • Proactive insight about emerging patterns
  • Gives user time to prepare
  • May become action_needed if ignored
  • Examples: "Demand trending up 35% this week"

ESCALATION (~3% of alerts):

  • Time-sensitive with auto-action countdown
  • System will act automatically if user doesn't
  • Countdown timer shown prominently
  • Examples: "Critical stock, auto-ordering in 2 hours"

INFORMATION (~2% of alerts):

  • FYI only, no action expected
  • Low priority
  • Often batched for digest emails
  • Examples: "Production batch completed"

6.2 Event Domains

  • inventory: Stock levels, expiration, movements
  • production: Batches, capacity, equipment
  • procurement: Purchase orders, deliveries, suppliers
  • forecasting: Demand predictions, trends
  • orders: Customer orders, fulfillment
  • orchestrator: AI-driven automation actions
  • delivery: Delivery tracking, receipt
  • sales: Sales analytics, patterns

6.3 Alert Type Catalog (40+ types)

Inventory Domain

critical_stock_shortage (action_needed, critical)
low_stock_warning (action_needed, important)
expired_products (action_needed, critical)
stock_depleted_by_order (information, standard)
stock_received (notification, info)
stock_movement (notification, info)

Production Domain

production_delay (action_needed, important)
equipment_failure (action_needed, critical)
capacity_overload (action_needed, important)
quality_control_failure (action_needed, critical)
batch_state_changed (notification, info)
batch_completed (notification, info)

Procurement Domain

po_approval_needed (action_needed, important)
po_approval_escalation (escalation, critical)
delivery_overdue (action_needed, critical)
po_approved (notification, info)
po_sent (notification, info)
delivery_scheduled (notification, info)
delivery_received (notification, info)

Delivery Tracking

delivery_scheduled (information, info)
delivery_arriving_soon (action_needed, important)
delivery_overdue (action_needed, critical)
stock_receipt_incomplete (action_needed, important)

Forecasting Domain

demand_surge_predicted (trend_warning, important)
weekend_demand_surge (trend_warning, standard)
weather_impact_forecast (trend_warning, standard)
holiday_preparation (trend_warning, important)

Operations Domain

orchestration_run_started (notification, info)
orchestration_run_completed (notification, info)
action_created (notification, info)

6.4 Placement Hints

Where alerts appear:

  • ACTION_QUEUE: Dashboard action section (action_needed)
  • NOTIFICATION_PANEL: Bell icon dropdown (notifications)
  • DASHBOARD_INLINE: Embedded in relevant page section
  • TOAST: Immediate popup (critical alerts)
  • EMAIL_DIGEST: End-of-day summary email

7. Smart Actions & User Agency

7.1 Action Types

Complete Enumeration:

class SmartActionType(str, Enum):
    # Procurement
    APPROVE_PO = "approve_po"
    REJECT_PO = "reject_po"
    MODIFY_PO = "modify_po"
    CALL_SUPPLIER = "call_supplier"

    # Production
    START_PRODUCTION_BATCH = "start_production_batch"
    RESCHEDULE_PRODUCTION = "reschedule_production"
    HALT_PRODUCTION = "halt_production"

    # Inventory
    MARK_DELIVERY_RECEIVED = "mark_delivery_received"
    COMPLETE_STOCK_RECEIPT = "complete_stock_receipt"
    ADJUST_STOCK_MANUALLY = "adjust_stock_manually"

    # Customer Service
    NOTIFY_CUSTOMER = "notify_customer"
    CANCEL_ORDER = "cancel_order"
    ADJUST_DELIVERY_DATE = "adjust_delivery_date"

    # System
    SNOOZE_ALERT = "snooze_alert"
    DISMISS_ALERT = "dismiss_alert"
    ESCALATE_TO_MANAGER = "escalate_to_manager"

7.2 Action Lifecycle

1. Generation (enrichment stage):

  • Service context: What's possible in this situation?
  • User agency: Can user execute this action?
  • Permissions: Does user have required role?
  • Conditional rendering: Disable if prerequisites not met

2. Display (frontend):

  • Primary action highlighted (most recommended)
  • Secondary actions offered (alternatives)
  • Disabled actions shown with reason tooltip
  • Consequence preview on hover

3. Execution (API call):

  • Handler routes by action type
  • Executes business logic (PO approval, schedule change, etc.)
  • Creates audit trail
  • Emits follow-up events/notifications
  • May create new alerts

4. Escalation (if unacted):

  • 24h: Alert priority boosted
  • 48h: Type changed to escalation
  • 72h: Priority boosted further, countdown timer shown
  • System may auto-execute if configured

7.3 Consequence Preview

Purpose: Build trust by showing impact before action

Example:

{
    action: "approve_po",
    consequence: {
        immediate: "Order will be sent to supplier within 5 minutes",
        timing: "Delivery expected in 2-3 business days",
        cost: "€1,250.00 will be added to monthly expenses",
        impact: "Resolves low stock for 3 ingredients affecting 8 orders"
    }
}

Display:

  • Shown on hover or in confirmation modal
  • Highlights positive outcomes (orders fulfilled)
  • Notes financial impact (€ amount)
  • Clarifies timing (when effect occurs)

8. Alert Lifecycle & State Transitions

8.1 Alert States

Created → Active
    ↓
    ├─→ Acknowledged (user saw it)
    ├─→ In Progress (user taking action)
    ├─→ Resolved (action completed)
    ├─→ Dismissed (user chose to ignore)
    └─→ Snoozed (remind me later)

8.2 State Transitions

Created → Active:

  • Automatic on creation
  • Appears in relevant UI sections based on placement hints

Active → Acknowledged:

  • User clicks alert or views action queue
  • Tracked for analytics (response time)

Acknowledged → In Progress:

  • User starts working on resolution
  • May set estimated completion time

In Progress → Resolved:

  • Smart action executed successfully
  • Or user manually marks as resolved
  • resolved_at timestamp set

Active → Dismissed:

  • User chooses not to act
  • May require dismissal reason (for audit)

Active → Snoozed:

  • User requests reminder later (e.g., in 1 hour, tomorrow morning)
  • Returns to Active at scheduled time

8.3 Key Fields

Lifecycle Tracking:

status: AlertStatus  # Current state
created_at: datetime  # When alert was created
acknowledged_at: datetime  # When user first viewed
resolved_at: datetime  # When action completed
action_created_at: datetime  # For escalation age calculation

Interaction Tracking:

interactions: List[AlertInteraction]  # All user interactions
last_interaction_at: datetime  # Most recent interaction
response_time_seconds: int  # Time to first action
resolution_time_seconds: int  # Time to resolution

8.4 Alert Interactions

Tracked Events:

  • view: User viewed alert
  • acknowledge: User acknowledged alert
  • action_taken: User executed smart action
  • snooze: User snoozed alert
  • dismiss: User dismissed alert
  • resolve: User resolved alert

Interaction Record:

class AlertInteraction(Base):
    id: UUID
    tenant_id: UUID
    alert_id: UUID
    user_id: UUID
    interaction_type: InteractionType
    action_type: Optional[SmartActionType]
    metadata: dict  # Context of interaction
    created_at: datetime

Analytics Usage:

  • Measure alert effectiveness (% resolved)
  • Track response times (how quickly users act)
  • Identify ignored alerts (high dismiss rate)
  • Optimize smart action suggestions

9. Escalation System

9.1 Time-Based Escalation

Purpose: Prevent action fatigue and ensure critical alerts don't age

Escalation Rules:

# Applied hourly to action_needed alerts

if alert.status == "active" and alert.type_class == "action_needed":
    age_hours = (now - alert.action_created_at).hours

    escalation_boost = 0

    # Age-based escalation
    if age_hours > 72:
        escalation_boost = 20
    elif age_hours > 48:
        escalation_boost = 10

    # Deadline-based escalation
    if alert.deadline:
        hours_to_deadline = (alert.deadline - now).hours
        if hours_to_deadline < 6:
            escalation_boost = max(escalation_boost, 30)
        elif hours_to_deadline < 24:
            escalation_boost = max(escalation_boost, 15)

    # Skip if already critical
    if alert.priority_score >= 90:
        escalation_boost = 0

    # Apply boost (capped at +30)
    alert.priority_score += min(escalation_boost, 30)
    alert.priority_level = calculate_level(alert.priority_score)

9.2 Escalation Cronjob

Schedule: Every hour at :15 (:15 * * * *)

Configuration:

alert-priority-recalculation-cronjob:
  schedule: "15 * * * *"
  resources:
    memory: 256Mi
    cpu: 100m
  timeout: 30 minutes
  concurrency: Forbid
  batch_size: 50

Processing Logic:

  1. Query all action_needed alerts with status=active
  2. Batch process (50 alerts at a time)
  3. Calculate escalation boost for each
  4. Update priority_score and priority_level
  5. Add escalation_metadata (boost amount, reason)
  6. Invalidate Redis cache (tenant:{id}:alerts:*)
  7. Log escalation events for analytics

9.3 Escalation Metadata

Stored in enrichment_context:

{
    "escalation": {
        "applied_at": "2025-11-25T15:00:00Z",
        "boost_amount": 20,
        "reason": "pending_72h",
        "previous_score": 65,
        "new_score": 85,
        "previous_level": "standard",
        "new_level": "important"
    }
}

9.4 Escalation to Auto-Action

When:

  • Alert >72h old
  • Priority ≥90 (critical)
  • Has auto-action configured

Process:

if age_hours > 72 and priority_score >= 90:
    alert.type_class = "escalation"
    alert.auto_action_countdown_seconds = 7200  # 2 hours
    alert.auto_action_type = determine_auto_action(alert)
    alert.auto_action_metadata = {...}

Frontend Display:

  • Shows countdown timer: "Auto-approving PO in 1h 23m"
  • Primary action becomes "Cancel Auto-Action"
  • User can cancel or let system proceed

10. Alert Chaining & Deduplication

10.1 Deduplication Strategy

Purpose: Prevent alert spam when same issue detected multiple times

Deduplication Key:

def generate_dedup_key(tenant_id, alert_type, entity_ids):
    key_parts = [alert_type]

    # Add entity identifiers
    if product_id:
        key_parts.append(f"product:{product_id}")
    if supplier_id:
        key_parts.append(f"supplier:{supplier_id}")
    if batch_id:
        key_parts.append(f"batch:{batch_id}")

    key = ":".join(key_parts)
    return f"{tenant_id}:alert:{key}"

Redis Check:

dedup_key = generate_dedup_key(...)
if redis.exists(dedup_key):
    return  # Skip, alert already exists
else:
    redis.setex(dedup_key, 900, "1")  # 15-minute window
    create_alert(...)

10.2 Alert Chaining

Purpose: Link related alerts to tell coherent story

Database Fields (added in migration 20251123):

action_created_at: datetime  # Original creation time (for age)
superseded_by_action_id: UUID  # Links to solving action
hidden_from_ui: bool  # Hide superseded alerts

10.3 Chaining Methods

1. Mark as Superseded:

def mark_alert_as_superseded(alert_id, solving_action_id):
    alert = db.query(Alert).filter(Alert.id == alert_id).first()
    alert.superseded_by_action_id = solving_action_id
    alert.hidden_from_ui = True
    alert.updated_at = now()
    db.commit()

    # Invalidate cache
    redis.delete(f"tenant:{alert.tenant_id}:alerts:*")

2. Create Combined Alert:

def create_combined_alert(original_alert, solving_action):
    # Create new prevented_issue alert
    combined_alert = Alert(
        tenant_id=original_alert.tenant_id,
        alert_type="prevented_issue",
        type_class="prevented_issue",
        title=f"Stock shortage prevented",
        message=f"I detected low stock for {product_name} and created "
                f"PO-{po_number} automatically. Order will arrive in 2 days.",
        priority_level="info",
        metadata={
            "original_alert_id": str(original_alert.id),
            "solving_action_id": str(solving_action.id),
            "problem": original_alert.message,
            "solution": solving_action.description
        }
    )
    db.add(combined_alert)
    db.commit()

    # Mark original as superseded
    mark_alert_as_superseded(original_alert.id, combined_alert.id)

3. Find Related Alerts:

def find_related_alert(tenant_id, alert_type, product_id):
    return db.query(Alert).filter(
        Alert.tenant_id == tenant_id,
        Alert.alert_type == alert_type,
        Alert.metadata['product_id'].astext == product_id,
        Alert.created_at > now() - timedelta(hours=24),
        Alert.hidden_from_ui == False
    ).first()

4. Filter Hidden Alerts:

def get_active_alerts(tenant_id):
    return db.query(Alert).filter(
        Alert.tenant_id == tenant_id,
        Alert.status.in_(["active", "acknowledged"]),
        Alert.hidden_from_ui == False  # Exclude superseded alerts
    ).all()

10.4 Chaining Example Flow

Step 1: Low stock detected
    → Create LOW_STOCK alert (action_needed, priority: 75)
    → User sees "Low stock for flour, action needed"

Step 2: Daily Orchestrator runs
    → Finds LOW_STOCK alert
    → Creates purchase order automatically
    → PO-12345 created with delivery date

Step 3: Orchestrator chains alerts
    → Calls mark_alert_as_superseded(low_stock_alert.id, po.id)
    → Creates PREVENTED_ISSUE alert
    → Message: "I prevented flour shortage by creating PO-12345.
                Delivery arrives Nov 28. Approve or modify if needed."

Step 4: User sees only prevented_issue alert
    → Original low stock alert hidden from UI
    → User understands: problem detected → AI acted → needs approval
    → Single coherent narrative, not 3 separate alerts

11. Cronjob Integration

11.1 Why CronJobs Are Needed

Event System Cannot:

  • Emit events "2 hours before delivery"
  • Detect "alert is now 48 hours old"
  • Poll external state (procurement PO status)

CronJobs Excel At:

  • Time-based conditions
  • Periodic checks
  • Predictive alerts
  • Batch recalculations

11.2 Delivery Tracking CronJob

Schedule: Every hour at :30 (:30 * * * *)

Configuration:

delivery-tracking-cronjob:
  schedule: "30 * * * *"
  resources:
    memory: 256Mi
    cpu: 100m
  timeout: 30 minutes
  concurrency: Forbid

Service: DeliveryTrackingService in Orchestrator

Processing Flow:

def check_expected_deliveries():
    # Query procurement service for expected deliveries
    deliveries = procurement_api.get_expected_deliveries(
        from_date=now(),
        to_date=now() + timedelta(days=3)
    )

    for delivery in deliveries:
        current_time = now()
        expected_time = delivery.expected_delivery_datetime
        window_start = delivery.delivery_window_start
        window_end = delivery.delivery_window_end

        # T-2h: Arriving soon alert
        if current_time >= (window_start - timedelta(hours=2)) and \
           current_time < window_start:
            send_arriving_soon_alert(delivery)

        # T+30min: Overdue alert
        elif current_time > (window_end + timedelta(minutes=30)) and \
             not delivery.marked_received:
            send_overdue_alert(delivery)

        # Window passed, not received: Incomplete alert
        elif current_time > (window_end + timedelta(hours=2)) and \
             not delivery.marked_received and \
             not delivery.stock_receipt_id:
            send_receipt_incomplete_alert(delivery)

Alert Types Generated:

  1. DELIVERY_ARRIVING_SOON (T-2h):
{
    "alert_type": "delivery_arriving_soon",
    "type_class": "action_needed",
    "priority_level": "important",
    "placement": "action_queue",
    "smart_actions": [
        {
            "type": "mark_delivery_received",
            "label": "Mark as Received",
            "variant": "primary"
        }
    ]
}
  1. DELIVERY_OVERDUE (T+30min):
{
    "alert_type": "delivery_overdue",
    "type_class": "action_needed",
    "priority_level": "critical",
    "priority_score": 95,
    "smart_actions": [
        {
            "type": "call_supplier",
            "label": "Call Supplier",
            "metadata": {
                "supplier_contact": "+34-123-456-789"
            }
        }
    ]
}
  1. STOCK_RECEIPT_INCOMPLETE (Post-window):
{
    "alert_type": "stock_receipt_incomplete",
    "type_class": "action_needed",
    "priority_level": "important",
    "priority_score": 80,
    "smart_actions": [
        {
            "type": "complete_stock_receipt",
            "label": "Complete Stock Receipt",
            "metadata": {
                "po_id": "...",
                "draft_receipt_id": "..."
            }
        }
    ]
}

11.3 Delivery Alert Lifecycle

PO Approved
    ↓
DELIVERY_SCHEDULED (informational, notification_panel)
    ↓ T-2 hours
DELIVERY_ARRIVING_SOON (action_needed, action_queue)
    ↓ Expected time + 30 min
DELIVERY_OVERDUE (critical, action_queue + toast)
    ↓ Window passed + 2 hours
STOCK_RECEIPT_INCOMPLETE (important, action_queue)

11.4 Priority Recalculation CronJob

See Section 9.2 for details.

11.5 Decision Matrix: Events vs CronJobs

Feature Event System CronJob Best Choice
State change notification Excellent Poor Event System
Time-based alerts Complex Simple CronJob
Real-time updates Instant Delayed Event System
Predictive alerts Hard Easy CronJob
Priority escalation Complex Natural CronJob
Deadline tracking Complex Simple CronJob
Batch processing Not designed Ideal CronJob

12. Service Integration Patterns

12.1 Base Alert Service

All services extend: BaseAlertService from shared/alerts/base_service.py

Core Method:

async def publish_item(
    self,
    tenant_id: UUID,
    item_data: dict,
    item_type: ItemType = ItemType.ALERT
):
    # Validate schema
    validated_item = validate_item(item_data, item_type)

    # Generate deduplication key
    dedup_key = self.generate_dedup_key(tenant_id, validated_item)

    # Check Redis for duplicates (15-minute window)
    if await self.redis.exists(dedup_key):
        logger.info(f"Skipping duplicate {item_type}: {dedup_key}")
        return

    # Publish to RabbitMQ
    await self.rabbitmq.publish(
        exchange="alerts.exchange",
        routing_key=f"{item_type}.{validated_item['severity']}",
        message={
            "tenant_id": str(tenant_id),
            "item_type": item_type,
            "data": validated_item
        }
    )

    # Set deduplication key
    await self.redis.setex(dedup_key, 900, "1")  # 15 minutes

12.2 Inventory Service

Service Class: InventoryAlertService

Background Jobs:

# Check stock levels every 5 minutes
@scheduler.scheduled_job('interval', minutes=5)
async def check_stock_levels():
    service = InventoryAlertService()
    critical_items = await service.find_critical_stock()

    for item in critical_items:
        await service.publish_item(
            tenant_id=item.tenant_id,
            item_data={
                "type": "critical_stock_shortage",
                "severity": "high",
                "title": f"Critical: {item.name} stock depleted",
                "message": f"Only {item.current_stock}{item.unit} remaining. "
                          f"Required: {item.minimum_stock}{item.unit}",
                "actions": ["approve_po", "call_supplier"],
                "metadata": {
                    "ingredient_id": str(item.id),
                    "current_stock": item.current_stock,
                    "minimum_stock": item.minimum_stock,
                    "unit": item.unit
                }
            },
            item_type=ItemType.ALERT
        )

# Check expiring products every 2 hours
@scheduler.scheduled_job('interval', hours=2)
async def check_expiring_products():
    # Similar pattern...

Event-Driven Alerts:

# Listen to order events
@event_handler("order.created")
async def on_order_created(event):
    service = InventoryAlertService()
    order = event.data

    # Check if order depletes stock below threshold
    for item in order.items:
        stock_after_order = calculate_remaining_stock(item)

        if stock_after_order < item.minimum_stock:
            await service.publish_item(
                tenant_id=order.tenant_id,
                item_data={
                    "type": "stock_depleted_by_order",
                    "severity": "medium",
                    # ... details
                },
                item_type=ItemType.ALERT
            )

Recommendations:

async def analyze_inventory_optimization():
    # Analyze stock patterns
    # Generate optimization recommendations
    await service.publish_item(
        tenant_id=tenant_id,
        item_data={
            "type": "inventory_optimization",
            "title": "Reduce waste by adjusting par levels",
            "suggested_actions": ["adjust_par_levels"],
            "estimated_impact": "Save €250/month",
            "confidence_score": 0.85
        },
        item_type=ItemType.RECOMMENDATION
    )

12.3 Production Service

Service Class: ProductionAlertService

Background Jobs:

@scheduler.scheduled_job('interval', minutes=15)
async def check_production_capacity():
    # Check if scheduled batches exceed capacity
    # Emit capacity_overload alerts

@scheduler.scheduled_job('interval', minutes=10)
async def check_production_delays():
    # Check batches behind schedule
    # Emit production_delay alerts

Event-Driven:

@event_handler("equipment.status_changed")
async def on_equipment_failure(event):
    if event.data.status == "failed":
        await service.publish_item(
            item_data={
                "type": "equipment_failure",
                "severity": "high",
                "priority_score": 95,  # Manual override
                # ...
            }
        )

12.4 Forecasting Service

Service Class: ForecastingRecommendationService

Scheduled Analysis:

@scheduler.scheduled_job('cron', day_of_week='fri', hour=15)
async def check_weekend_demand_surge():
    forecast = await get_weekend_forecast()

    if forecast.predicted_demand > (forecast.baseline * 1.3):
        await service.publish_item(
            item_data={
                "type": "demand_surge_weekend",
                "title": "Weekend demand surge predicted",
                "message": f"Demand trending up {forecast.increase_pct}%. "
                          f"Consider increasing production.",
                "suggested_actions": ["increase_production"],
                "confidence_score": forecast.confidence
            },
            item_type=ItemType.RECOMMENDATION
        )

12.5 Procurement Service

Service Class: ProcurementEventService (mixed alerts + notifications)

Event-Driven:

@event_handler("po.created")
async def on_po_created(event):
    po = event.data

    if po.amount > APPROVAL_THRESHOLD:
        # Emit alert requiring approval
        await service.publish_item(
            item_data={
                "type": "po_approval_needed",
                "severity": "medium",
                # ...
            },
            item_type=ItemType.ALERT
        )
    else:
        # Emit notification (auto-approved)
        await service.publish_item(
            item_data={
                "type": "po_approved",
                "message": f"PO-{po.number} auto-approved (€{po.amount})",
                "old_state": "draft",
                "new_state": "approved"
            },
            item_type=ItemType.NOTIFICATION
        )

13. Frontend Integration

13.1 React Hooks Catalog (18 hooks)

Alert Hooks (4)

// Subscribe to all critical alerts
const { alerts, criticalAlerts, isLoading } = useAlerts({
    domains: ['inventory', 'production'],
    minPriority: 'important'
});

// Critical alerts only
const { criticalAlerts } = useCriticalAlerts();

// Action-needed alerts only
const { alerts } = useActionNeededAlerts();

// Domain-specific alerts
const { alerts } = useAlertsByDomain('inventory');

Notification Hooks (9)

// All notifications
const { notifications } = useEventNotifications();

// Domain-specific notifications
const { notifications } = useProductionNotifications();
const { notifications } = useInventoryNotifications();
const { notifications } = useSupplyChainNotifications();
const { notifications } = useOperationsNotifications();

// Type-specific notifications
const { notifications } = useBatchNotifications();
const { notifications } = useDeliveryNotifications();
const { notifications } = useOrchestrationNotifications();

// Generic domain filter
const { notifications } = useNotificationsByDomain('production');

Recommendation Hooks (5)

// All recommendations
const { recommendations } = useRecommendations();

// Type-specific recommendations
const { recommendations } = useDemandRecommendations();
const { recommendations } = useInventoryOptimizationRecommendations();
const { recommendations } = useCostReductionRecommendations();

// High confidence only
const { recommendations } = useHighConfidenceRecommendations(0.8);

// Generic filters
const { recommendations } = useRecommendationsByDomain('forecasting');
const { recommendations } = useRecommendationsByType('demand_surge');

13.2 Base SSE Hook

useSSE Hook:

function useSSE(channels: string[]) {
    const [events, setEvents] = useState<Event[]>([]);
    const [isConnected, setIsConnected] = useState(false);

    useEffect(() => {
        const eventSource = new EventSource(
            `/api/events/sse?channels=${channels.join(',')}`
        );

        eventSource.onopen = () => setIsConnected(true);

        eventSource.onmessage = (event) => {
            const data = JSON.parse(event.data);
            setEvents(prev => [data, ...prev]);
        };

        eventSource.onerror = () => setIsConnected(false);

        return () => eventSource.close();
    }, [channels]);

    return { events, isConnected };
}

13.3 TypeScript Definitions

Alert Type:

interface Alert {
    id: string;
    tenant_id: string;
    alert_type: string;
    type_class: AlertTypeClass;
    service: string;
    title: string;
    message: string;
    status: AlertStatus;
    priority_score: number;
    priority_level: PriorityLevel;

    // Enrichment
    orchestrator_context?: OrchestratorContext;
    business_impact?: BusinessImpact;
    urgency_context?: UrgencyContext;
    user_agency?: UserAgency;
    trend_context?: TrendContext;

    // Actions
    smart_actions?: SmartAction[];

    // Metadata
    alert_metadata?: Record<string, any>;
    created_at: string;
    updated_at: string;
    resolved_at?: string;
}

enum AlertTypeClass {
    ACTION_NEEDED = "action_needed",
    PREVENTED_ISSUE = "prevented_issue",
    TREND_WARNING = "trend_warning",
    ESCALATION = "escalation",
    INFORMATION = "information"
}

enum PriorityLevel {
    CRITICAL = "critical",
    IMPORTANT = "important",
    STANDARD = "standard",
    INFO = "info"
}

enum AlertStatus {
    ACTIVE = "active",
    ACKNOWLEDGED = "acknowledged",
    IN_PROGRESS = "in_progress",
    RESOLVED = "resolved",
    DISMISSED = "dismissed",
    SNOOZED = "snoozed"
}

13.4 Component Integration Examples

Action Queue Card:

function UnifiedActionQueueCard() {
    const { alerts } = useAlerts({
        typeClass: ['action_needed', 'escalation'],
        includeResolved: false
    });

    const groupedAlerts = useMemo(() => {
        return groupByTimeCategory(alerts);
        // Returns: { urgent: [...], today: [...], thisWeek: [...] }
    }, [alerts]);

    return (
        <Card>
            <h2>Actions Needed</h2>
            {groupedAlerts.urgent.length > 0 && (
                <UrgentSection alerts={groupedAlerts.urgent} />
            )}
            {groupedAlerts.today.length > 0 && (
                <TodaySection alerts={groupedAlerts.today} />
            )}
        </Card>
    );
}

Health Hero Component:

function GlanceableHealthHero() {
    const { criticalAlerts } = useCriticalAlerts();
    const { notifications } = useEventNotifications();

    const healthStatus = useMemo(() => {
        if (criticalAlerts.length > 0) return 'red';
        if (hasUrgentNotifications(notifications)) return 'yellow';
        return 'green';
    }, [criticalAlerts, notifications]);

    return (
        <Card>
            <StatusIndicator color={healthStatus} />
            {healthStatus === 'red' && (
                <UrgentBadge count={criticalAlerts.length} />
            )}
        </Card>
    );
}

Event-Driven Refetch:

function InventoryStats() {
    const { data, refetch } = useInventoryStats();
    const { notifications } = useInventoryNotifications();

    useEffect(() => {
        const relevantEvent = notifications.find(
            n => n.event_type === 'stock_received'
        );

        if (relevantEvent) {
            refetch();  // Update stats on stock change
        }
    }, [notifications, refetch]);

    return <StatsCard data={data} />;
}

14. Redis Pub/Sub Architecture

14.1 Channel Naming Convention

Pattern: tenant:{tenant_id}:{domain}.{event_type}

Examples:

tenant:123e4567-e89b-12d3-a456-426614174000:inventory.alerts
tenant:123e4567-e89b-12d3-a456-426614174000:inventory.notifications
tenant:123e4567-e89b-12d3-a456-426614174000:production.alerts
tenant:123e4567-e89b-12d3-a456-426614174000:production.notifications
tenant:123e4567-e89b-12d3-a456-426614174000:supply_chain.alerts
tenant:123e4567-e89b-12d3-a456-426614174000:supply_chain.notifications
tenant:123e4567-e89b-12d3-a456-426614174000:operations.notifications
tenant:123e4567-e89b-12d3-a456-426614174000:recommendations

14.2 Domain-Based Routing

Alert Processor publishes to Redis:

def publish_to_redis(alert):
    domain = alert.domain  # inventory, production, etc.
    channel = f"tenant:{alert.tenant_id}:{domain}.alerts"

    redis.publish(channel, json.dumps({
        "id": str(alert.id),
        "alert_type": alert.alert_type,
        "type_class": alert.type_class,
        "priority_level": alert.priority_level,
        "title": alert.title,
        "message": alert.message,
        # ... full alert data
    }))

14.3 Gateway SSE Endpoint

Multi-Channel Subscription:

@app.get("/api/events/sse")
async def sse_endpoint(
    channels: str,  # Comma-separated: "inventory.alerts,production.alerts"
    tenant_id: UUID = Depends(get_current_tenant)
):
    async def event_stream():
        pubsub = redis.pubsub()

        # Subscribe to requested channels
        for channel in channels.split(','):
            full_channel = f"tenant:{tenant_id}:{channel}"
            await pubsub.subscribe(full_channel)

        # Stream events
        async for message in pubsub.listen():
            if message['type'] == 'message':
                yield f"data: {message['data']}\n\n"

    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream"
    )

Wildcard Support:

// Frontend can subscribe to:
"*.alerts"  // All alert channels
"inventory.*"  // All inventory events
"*.notifications"  // All notification channels

14.4 Traffic Reduction

Before (legacy):

  • All pages subscribe to single tenant:{id}:events channel
  • 100% of events sent to all pages
  • High bandwidth, slow filtering

After (domain-based):

  • Dashboard: Subscribes to *.alerts, *.notifications, recommendations
  • Inventory page: Subscribes to inventory.alerts, inventory.notifications
  • Production page: Subscribes to production.alerts, production.notifications

Traffic Reduction by Page:

Page Old Traffic New Traffic Reduction
Dashboard 100% 100% 0% (needs all)
Inventory 100% 15% 85%
Production 100% 20% 80%
Supply Chain 100% 18% 82%

Average: 70% reduction on specialized pages


15. Database Schema

15.1 Alerts Table

CREATE TABLE alerts (
    -- Identity
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL,

    -- Classification
    alert_type VARCHAR(100) NOT NULL,
    type_class VARCHAR(50) NOT NULL,  -- action_needed, prevented_issue, etc.
    service VARCHAR(50) NOT NULL,
    event_domain VARCHAR(50),  -- Added in migration 20251125

    -- Content
    title VARCHAR(500) NOT NULL,
    message TEXT NOT NULL,

    -- Status
    status VARCHAR(50) NOT NULL DEFAULT 'active',

    -- Priority
    priority_score INTEGER NOT NULL DEFAULT 50,
    priority_level VARCHAR(50) NOT NULL DEFAULT 'standard',

    -- Enrichment Context (JSONB)
    orchestrator_context JSONB,
    business_impact JSONB,
    urgency_context JSONB,
    user_agency JSONB,
    trend_context JSONB,

    -- Smart Actions
    smart_actions JSONB,  -- Array of action objects

    -- Timing
    timing_decision VARCHAR(50),
    scheduled_send_time TIMESTAMP,

    -- Escalation (Added in migration 20251123)
    action_created_at TIMESTAMP,  -- For age calculation
    superseded_by_action_id UUID,  -- Links to solving action
    hidden_from_ui BOOLEAN DEFAULT FALSE,

    -- Metadata
    alert_metadata JSONB,

    -- Timestamps
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
    resolved_at TIMESTAMP,

    -- Foreign Keys
    FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE
);

15.2 Indexes

-- Tenant filtering
CREATE INDEX idx_alerts_tenant_status
ON alerts(tenant_id, status);

-- Priority sorting
CREATE INDEX idx_alerts_tenant_priority_created
ON alerts(tenant_id, priority_score DESC, created_at DESC);

-- Type class filtering
CREATE INDEX idx_alerts_tenant_typeclass_status
ON alerts(tenant_id, type_class, status);

-- Timing queries
CREATE INDEX idx_alerts_timing_scheduled
ON alerts(timing_decision, scheduled_send_time);

-- Escalation queries (Added in migration 20251123)
CREATE INDEX idx_alerts_tenant_action_created
ON alerts(tenant_id, action_created_at);

CREATE INDEX idx_alerts_superseded_by
ON alerts(superseded_by_action_id);

CREATE INDEX idx_alerts_tenant_hidden_status
ON alerts(tenant_id, hidden_from_ui, status);

-- Domain filtering (Added in migration 20251125)
CREATE INDEX idx_alerts_tenant_domain
ON alerts(tenant_id, event_domain);

15.3 Alert Interactions Table

CREATE TABLE alert_interactions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL,
    alert_id UUID NOT NULL,
    user_id UUID NOT NULL,

    -- Interaction type
    interaction_type VARCHAR(50) NOT NULL,  -- view, acknowledge, action_taken, etc.
    action_type VARCHAR(50),  -- Smart action type if applicable

    -- Context
    metadata JSONB,
    response_time_seconds INTEGER,  -- Time from alert creation to this interaction

    -- Timestamps
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),

    -- Foreign Keys
    FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE,
    FOREIGN KEY (alert_id) REFERENCES alerts(id) ON DELETE CASCADE,
    FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);

CREATE INDEX idx_interactions_alert ON alert_interactions(alert_id);
CREATE INDEX idx_interactions_tenant_user ON alert_interactions(tenant_id, user_id);

15.4 Notifications Table

CREATE TABLE notifications (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL,

    -- Classification
    event_type VARCHAR(100) NOT NULL,
    event_domain VARCHAR(50) NOT NULL,

    -- Content
    title VARCHAR(500) NOT NULL,
    message TEXT NOT NULL,

    -- State change tracking
    entity_type VARCHAR(50),  -- "purchase_order", "batch", etc.
    entity_id UUID,
    old_state VARCHAR(50),
    new_state VARCHAR(50),

    -- Display
    placement_hint VARCHAR(50) DEFAULT 'notification_panel',

    -- Metadata
    notification_metadata JSONB,

    -- Timestamps
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    expires_at TIMESTAMP DEFAULT (NOW() + INTERVAL '7 days'),

    -- Foreign Keys
    FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE
);

CREATE INDEX idx_notifications_tenant_created
ON notifications(tenant_id, created_at DESC);

CREATE INDEX idx_notifications_tenant_domain
ON notifications(tenant_id, event_domain);

15.5 Recommendations Table

CREATE TABLE recommendations (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL,

    -- Classification
    recommendation_type VARCHAR(100) NOT NULL,
    event_domain VARCHAR(50) NOT NULL,

    -- Content
    title VARCHAR(500) NOT NULL,
    message TEXT NOT NULL,

    -- Actions & Impact
    suggested_actions JSONB,  -- Array of suggested action types
    estimated_impact TEXT,  -- "Save €250/month"
    confidence_score DECIMAL(3, 2),  -- 0.00 - 1.00

    -- Status
    status VARCHAR(50) DEFAULT 'active',  -- active, dismissed, implemented

    -- Metadata
    recommendation_metadata JSONB,

    -- Timestamps
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    expires_at TIMESTAMP DEFAULT (NOW() + INTERVAL '30 days'),
    dismissed_at TIMESTAMP,

    -- Foreign Keys
    FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE
);

CREATE INDEX idx_recommendations_tenant_status
ON recommendations(tenant_id, status);

CREATE INDEX idx_recommendations_tenant_domain
ON recommendations(tenant_id, event_domain);

15.6 Migrations

Key Migrations:

  1. 20251015_1230_initial_schema.py

    • Created alerts, notifications, recommendations tables
    • Initial indexes
    • Full enrichment fields
  2. 20251123_add_alert_enhancements.py

    • Added action_created_at for escalation tracking
    • Added superseded_by_action_id for chaining
    • Added hidden_from_ui flag
    • Created indexes for escalation queries
    • Backfilled action_created_at for existing alerts
  3. 20251125_add_event_domain_column.py

    • Added event_domain to alerts table
    • Added index on (tenant_id, event_domain)
    • Populated domain from existing alert_type patterns

16. Performance & Monitoring

16.1 Performance Metrics

Processing Speed:

  • Alert enrichment: 500-800ms (full pipeline)
  • Notification processing: 20-30ms (80% faster)
  • Recommendation processing: 50-80ms (60% faster)
  • Average improvement: 54%

Database Query Performance:

  • Get active alerts by tenant: <50ms
  • Get critical alerts with priority sort: <100ms
  • Escalation age calculation: <150ms
  • Alert chaining lookup: <75ms

API Response Times:

  • GET /alerts (paginated): <200ms
  • POST /alerts/{id}/acknowledge: <50ms
  • POST /alerts/{id}/resolve: <100ms

SSE Traffic:

  • Legacy (single channel): 100% of events to all pages
  • New (domain-based): 70% reduction on specialized pages
  • Dashboard: No change (needs all events)
  • Domain pages: 80-85% reduction

16.2 Caching Strategy

Redis Cache Keys:

tenant:{tenant_id}:alerts:active
tenant:{tenant_id}:alerts:critical
tenant:{tenant_id}:orchestrator_context:{action_id}

Cache Invalidation:

  • On alert creation: Invalidate alerts:active
  • On priority update: Invalidate alerts:critical
  • On escalation: Invalidate all alert caches
  • On resolution: Invalidate both active and critical

TTL:

  • Alert lists: 5 minutes
  • Orchestrator context: 15 minutes
  • Deduplication keys: 15 minutes

16.3 Monitoring Metrics

Prometheus Metrics:

# Alert creation rate
alert_created_total = Counter('alert_created_total', 'Total alerts created', ['tenant_id', 'alert_type'])

# Enrichment timing
enrichment_duration_seconds = Histogram('enrichment_duration_seconds', 'Enrichment processing time', ['event_type'])

# Priority distribution
alert_priority_distribution = Histogram('alert_priority_distribution', 'Alert priority scores', ['priority_level'])

# Resolution metrics
alert_resolution_time_seconds = Histogram('alert_resolution_time_seconds', 'Time to resolve alerts', ['alert_type'])

# Escalation tracking
alert_escalated_total = Counter('alert_escalated_total', 'Alerts escalated', ['escalation_reason'])

# Deduplication hits
alert_deduplicated_total = Counter('alert_deduplicated_total', 'Alerts deduplicated', ['alert_type'])

Key Metrics to Monitor:

  • Alert creation rate (per tenant, per type)
  • Average resolution time (should decrease over time)
  • Escalation rate (high rate indicates alerts being ignored)
  • Deduplication hit rate (should be 10-20%)
  • Enrichment performance (p50, p95, p99)
  • SSE connection count and duration

16.4 Health Checks

Alert Processor Health:

@app.get("/health")
async def health_check():
    checks = {
        "database": await check_db_connection(),
        "redis": await check_redis_connection(),
        "rabbitmq": await check_rabbitmq_connection(),
        "orchestrator_api": await check_orchestrator_api()
    }

    overall_healthy = all(checks.values())
    status_code = 200 if overall_healthy else 503

    return JSONResponse(
        status_code=status_code,
        content={
            "status": "healthy" if overall_healthy else "unhealthy",
            "checks": checks,
            "timestamp": datetime.utcnow().isoformat()
        }
    )

CronJob Monitoring:

# Kubernetes CronJob metrics
- Last successful run timestamp
- Last failed run timestamp
- Average execution duration
- Alert count processed per run
- Error count per run

16.5 Troubleshooting Guide

Problem: Alerts not appearing in frontend

Diagnosis:

  1. Check alert created in database: SELECT * FROM alerts WHERE tenant_id=... ORDER BY created_at DESC LIMIT 10;
  2. Check Redis pub/sub: SUBSCRIBE tenant:{id}:inventory.alerts
  3. Check SSE connection: Browser dev tools → Network → EventStream
  4. Check frontend hook subscription: Console logs

Problem: Slow enrichment

Diagnosis:

  1. Check Prometheus metrics for enrichment_duration_seconds
  2. Identify slow enrichment service (orchestrator, priority scoring, etc.)
  3. Check orchestrator API response time
  4. Review database query performance (EXPLAIN ANALYZE)

Problem: High escalation rate

Diagnosis:

  1. Query alerts by age: SELECT alert_type, COUNT(*) FROM alerts WHERE action_created_at < NOW() - INTERVAL '48 hours' GROUP BY alert_type;
  2. Check if certain alert types are consistently ignored
  3. Review smart actions (are they actionable?)
  4. Check user permissions (can users actually execute actions?)

Problem: Duplicate alerts

Diagnosis:

  1. Check deduplication key generation logic
  2. Verify Redis connection (dedup keys being set?)
  3. Review deduplication window (15 minutes may be too short)
  4. Check for race conditions in concurrent alert creation

17. Deployment Guide

17.1 5-Week Deployment Timeline

Week 1: Backend & Gateway

  • Day 1: Database migration in dev environment
  • Day 2-3: Deploy alert processor with dual publishing
  • Day 4: Deploy updated gateway
  • Day 5: Monitoring & validation

Week 2-3: Frontend Integration

  • Dashboard components with event hooks
  • Priority components (ActionQueue, HealthHero, ExecutionTracker)
  • Domain pages (Inventory, Production, Supply Chain)

Week 4: Cutover

  • Verify complete migration
  • Remove dual publishing
  • Database cleanup (remove legacy columns)

Week 5: Optimization

  • Performance tuning
  • Monitoring dashboards
  • Alert rules refinement

17.2 Pre-Deployment Checklist

  • Database migration scripts tested
  • Backward compatibility verified
  • Rollback procedure documented
  • Monitoring metrics defined
  • Performance benchmarks set
  • Example integrations tested
  • Documentation complete

17.3 Rollback Procedure

If issues occur:

  1. Stop new alert processor deployment
  2. Revert gateway to previous version
  3. Roll back database migration (if safe)
  4. Resume dual publishing if partially migrated
  5. Investigate root cause
  6. Fix and redeploy

Appendix

Version History

  • v2.0 (2025-11-25): Complete architecture with escalation, chaining, cronjobs
  • v1.5 (2025-11-23): Added stock receipt system and delivery tracking
  • v1.0 (2025-11-15): Initial three-tier enrichment system

Contributors

This alert system was designed and implemented collaboratively to support the Bakery-IA platform's mission of providing intelligent, context-aware alerts that respect user time and decision-making agency.


Last Updated: 2025-11-25 Status: Production-Ready Next Review: As needed based on system evolution