Files
bakery-ia/docs/MONITORING_DOCUMENTATION.md
2026-01-09 07:26:11 +01:00

16 KiB

📊 Bakery-ia Monitoring System Documentation

🎯 Overview

The bakery-ia platform features a comprehensive, modern monitoring system built on OpenTelemetry and SigNoz. This documentation provides a complete guide to the monitoring architecture, setup, and usage.

🚀 Monitoring Architecture

Core Components

graph TD
    A[Microservices] -->|OTLP| B[OpenTelemetry Collector]
    B -->|gRPC| C[SigNoz]
    C --> D[Traces Dashboard]
    C --> E[Metrics Dashboard]
    C --> F[Logs Dashboard]
    C --> G[Alerts]

Technology Stack

  • Instrumentation: OpenTelemetry Python SDK
  • Protocol: OTLP (OpenTelemetry Protocol) over gRPC
  • Backend: SigNoz (open-source observability platform)
  • Metrics: Prometheus-compatible metrics via OTLP
  • Traces: Jaeger-compatible tracing via OTLP
  • Logs: Structured logging with trace correlation

📋 Monitoring Coverage

Service Coverage (100%)

Service Category Services Monitoring Type Status
Critical Services auth, orders, sales, external Base Class Monitored
AI Services ai-insights, training Direct Monitored
Data Services inventory, procurement, production, forecasting Base Class Monitored
Operational Services tenant, notification, distribution Base Class Monitored
Specialized Services suppliers, pos, recipes, orchestrator Base Class Monitored
Infrastructure gateway, alert-processor, demo-session Direct Monitored

Total: 20 services with 100% monitoring coverage

🔧 Monitoring Implementation

Implementation Patterns

1. Base Class Pattern (16 services)

Services using StandardFastAPIService inherit comprehensive monitoring:

from shared.service_base import StandardFastAPIService

class MyService(StandardFastAPIService):
    def __init__(self):
        super().__init__(
            service_name="my-service",
            app_name="My Service",
            description="Service description",
            version="1.0.0",
            # Monitoring enabled by default
            enable_metrics=True,      # ✅ Metrics collection
            enable_tracing=True,      # ✅ Distributed tracing
            enable_health_checks=True # ✅ Health endpoints
        )

2. Direct Pattern (4 services)

Critical services with custom monitoring needs:

# services/ai_insights/app/main.py
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector

# Initialize metrics collectors
metrics_collector = MetricsCollector("ai-insights")
system_metrics = SystemMetricsCollector("ai-insights")

# Add middleware
add_metrics_middleware(app, metrics_collector)

Monitoring Components

OpenTelemetry Instrumentation

# Automatic instrumentation in base class
FastAPIInstrumentor.instrument_app(app)      # HTTP requests
HTTPXClientInstrumentor().instrument()       # Outgoing HTTP
RedisInstrumentor().instrument()             # Redis operations
SQLAlchemyInstrumentor().instrument()       # Database queries

Metrics Collection

# Standard metrics automatically collected
metrics_collector.register_counter("http_requests_total", "Total HTTP requests")
metrics_collector.register_histogram("http_request_duration", "Request duration")
metrics_collector.register_gauge("active_requests", "Active requests")

# System metrics automatically collected
system_metrics = SystemMetricsCollector("service-name")
# → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors

Health Checks

# Automatic health check endpoints
GET /health          # Overall service health
GET /health/detailed # Detailed health with dependencies
GET /health/ready    # Readiness probe
GET /health/live     # Liveness probe

📊 Metrics Reference

Standard Metrics (All Services)

Metric Type Metric Name Description Labels
HTTP Metrics {service}_http_requests_total Total HTTP requests method, endpoint, status_code
HTTP Metrics {service}_http_request_duration_seconds Request duration histogram method, endpoint, status_code
HTTP Metrics {service}_active_requests Currently active requests -
System Metrics process.cpu.utilization Process CPU usage -
System Metrics process.memory.usage Process memory usage -
System Metrics system.cpu.utilization System CPU usage -
System Metrics system.memory.usage System memory usage -
Database Metrics db.query.duration Database query duration operation, table
Cache Metrics cache.operation.duration Cache operation duration operation, key

Custom Metrics (Service-Specific)

Examples of service-specific metrics:

Auth Service:

  • auth_registration_total (by status)
  • auth_login_success_total
  • auth_login_failure_total (by reason)
  • auth_registration_duration_seconds

Orders Service:

  • orders_created_total
  • orders_processed_total (by status)
  • orders_processing_duration_seconds

AI Insights Service:

  • ai_insights_generated_total
  • ai_model_inference_duration_seconds
  • ai_feedback_received_total

🔍 Tracing Guide

Trace Propagation

Traces automatically flow across service boundaries:

sequenceDiagram
    participant Client
    participant Gateway
    participant Auth
    participant Orders
    
    Client->>Gateway: HTTP Request (trace_id: abc123)
    Gateway->>Auth: Auth Check (trace_id: abc123)
    Auth-->>Gateway: Auth Response (trace_id: abc123)
    Gateway->>Orders: Create Order (trace_id: abc123)
    Orders-->>Gateway: Order Created (trace_id: abc123)
    Gateway-->>Client: Final Response (trace_id: abc123)

Trace Context in Logs

All logs include trace correlation:

{
    "level": "info",
    "message": "Processing order",
    "service": "orders-service",
    "trace_id": "abc123def456",
    "span_id": "789ghi",
    "order_id": "12345",
    "timestamp": "2024-01-08T19:00:00Z"
}

Manual Trace Enhancement

Add custom trace attributes:

from shared.monitoring.tracing import add_trace_attributes, add_trace_event

# Add custom attributes
add_trace_attributes(
    user_id="123",
    tenant_id="abc",
    operation="order_creation"
)

# Add trace events
add_trace_event("order_validation_started")
# ... validation logic ...
add_trace_event("order_validation_completed", status="success")

🚨 Alerting Guide

Alert Name Condition Severity Notification
High Error Rate error_rate > 5% for 5m High PagerDuty + Slack
High Latency p99_latency > 2s for 5m High PagerDuty + Slack
Service Unavailable up == 0 for 1m Critical PagerDuty + Slack + Email
High Memory Usage memory_usage > 80% for 10m Medium Slack
High CPU Usage cpu_usage > 90% for 5m Medium Slack
Database Connection Issues db_connections < minimum_pool_size High PagerDuty + Slack
Cache Hit Ratio Low cache_hit_ratio < 70% for 15m Low Slack

Creating Alerts in SigNoz

  1. Navigate to Alerts: SigNoz UI → Alerts → Create Alert
  2. Select Metric: Choose from available metrics
  3. Set Condition: Define threshold and duration
  4. Configure Notifications: Add notification channels
  5. Set Severity: Critical, High, Medium, Low
  6. Add Description: Explain alert purpose and resolution steps

Example Alert Configuration (YAML)

# Example for Terraform/Kubernetes
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: bakery-ia-alerts
  namespace: monitoring
spec:
  groups:
  - name: service-health
    rules:
    - alert: ServiceDown
      expr: up{service!~"signoz.*"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.service }} is down"
        description: "{{ $labels.service }} has been down for more than 1 minute"
        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down"

    - alert: HighErrorRate
      expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
      for: 5m
      labels:
        severity: high
      annotations:
        summary: "High error rate in {{ $labels.service }}"
        description: "Error rate is {{ $value }}% (threshold: 5%)"
        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate"

📈 Dashboard Guide

1. Service Overview Dashboard

  • HTTP Request Rate
  • Error Rate
  • Latency Percentiles (p50, p90, p99)
  • Active Requests
  • System Resource Usage

2. Performance Dashboard

  • Request Duration Histogram
  • Database Query Performance
  • Cache Performance
  • External API Call Performance

3. System Health Dashboard

  • CPU Usage (Process & System)
  • Memory Usage (Process & System)
  • Disk I/O
  • Network I/O
  • File Descriptors
  • Thread Count

4. Business Metrics Dashboard

  • User Registrations
  • Order Volume
  • AI Insights Generated
  • API Usage by Tenant

Creating Dashboards in SigNoz

  1. Navigate to Dashboards: SigNoz UI → Dashboards → Create Dashboard
  2. Add Panels: Click "Add Panel" and select metric
  3. Configure Visualization: Choose chart type and settings
  4. Set Time Range: Default to last 1h, 6h, 24h, 7d
  5. Add Variables: For dynamic filtering (service, environment)
  6. Save Dashboard: Give it a descriptive name

🛠️ Troubleshooting Guide

Common Issues & Solutions

Issue: No Metrics Appearing in SigNoz

Checklist:

  • OpenTelemetry Collector running? kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
  • Service can reach collector? telnet signoz-otel-collector.bakery-ia 4318
  • OTLP endpoint configured correctly? Check OTEL_EXPORTER_OTLP_ENDPOINT
  • Service logs show OTLP export? Look for "Exporting metrics"
  • No network policies blocking? Check Kubernetes network policies

Debugging:

# Check OpenTelemetry Collector logs
kubectl logs -n bakery-ia -l app=otel-collector

# Check service logs for OTLP errors
kubectl logs -l app=auth-service | grep -i otel

# Test OTLP connectivity from service pod
kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.bakery-ia:4318

Issue: High Latency in Specific Service

Checklist:

  • Database queries slow? Check db.query.duration metrics
  • External API calls slow? Check trace waterfall
  • High CPU usage? Check system metrics
  • Memory pressure? Check memory metrics
  • Too many active requests? Check concurrency

Debugging:

# Add detailed tracing to suspicious code
from shared.monitoring.tracing import add_trace_event

add_trace_event("database_query_started", table="users")
# ... database query ...
add_trace_event("database_query_completed", duration_ms=45)

Issue: High Error Rate

Checklist:

  • Database connection issues? Check health endpoints
  • External API failures? Check dependency metrics
  • Authentication failures? Check auth service logs
  • Validation errors? Check application logs
  • Rate limiting? Check gateway metrics

Debugging:

# Check error logs with trace correlation
kubectl logs -l app=auth-service | grep -i error | grep -i trace

# Filter traces by error status
# In SigNoz: Add filter http.status_code >= 400

📚 Runbook Reference

See RUNBOOKS.md for detailed troubleshooting procedures.

🔧 Development Guide

Adding Custom Metrics

# In any service using direct monitoring
self.metrics_collector.register_counter(
    "custom_metric_name",
    "Description of what this metric tracks",
    labels=["label1", "label2"]  # Optional labels
)

# Increment the counter
self.metrics_collector.increment_counter(
    "custom_metric_name",
    value=1,
    labels={"label1": "value1", "label2": "value2"}
)

Adding Custom Trace Attributes

# Add context to current span
from shared.monitoring.tracing import add_trace_attributes

add_trace_attributes(
    user_id=user.id,
    tenant_id=tenant.id,
    operation="premium_feature_access",
    feature_name="advanced_forecasting"
)

Service-Specific Monitoring Setup

For services needing custom monitoring beyond the base class:

# In your service's __init__ method
from shared.monitoring.system_metrics import SystemMetricsCollector
from shared.monitoring.metrics import MetricsCollector

class MyService(StandardFastAPIService):
    def __init__(self):
        # Call parent constructor first
        super().__init__(...)
        
        # Add custom metrics collector
        self.custom_metrics = MetricsCollector("my-service")
        
        # Register custom metrics
        self.custom_metrics.register_counter(
            "business_specific_events",
            "Custom business event counter"
        )
        
        # Add system metrics if not using base class defaults
        self.system_metrics = SystemMetricsCollector("my-service")

📊 SigNoz Configuration

Environment Variables

# OpenTelemetry Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.bakery-ia:4318

# Service-specific configuration
OTEL_SERVICE_NAME=auth-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia

# Metrics export interval (default: 60000ms = 60s)
OTEL_METRIC_EXPORT_INTERVAL=60000

# Batch span processor configuration
OTEL_BSP_SCHEDULE_DELAY=5000
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512

Kubernetes Configuration

# Example deployment with monitoring sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
spec:
  template:
    spec:
      containers:
      - name: auth-service
        image: auth-service:latest
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://signoz-otel-collector.bakery-ia:4318"
        - name: OTEL_SERVICE_NAME
          value: "auth-service"
        - name: ENVIRONMENT
          value: "production"
        resources:
          limits:
            cpu: "1"
            memory: "512Mi"
          requests:
            cpu: "200m"
            memory: "256Mi"

🎯 Best Practices

Monitoring Best Practices

  1. Use Consistent Naming: Follow OpenTelemetry semantic conventions
  2. Add Context to Traces: Include user/tenant IDs in trace attributes
  3. Monitor Dependencies: Track external API and database performance
  4. Set Appropriate Alerts: Avoid alert fatigue with meaningful thresholds
  5. Document Metrics: Keep metrics documentation up to date
  6. Review Regularly: Update dashboards as services evolve
  7. Test Alerts: Ensure alerts fire correctly before production

Performance Best Practices

  1. Batch Metrics Export: Use default 60s interval for most services
  2. Sample Traces: Consider sampling for high-volume services
  3. Limit Custom Metrics: Only track metrics that provide value
  4. Use Histograms Wisely: Histograms can be resource-intensive
  5. Monitor Monitoring: Track OTLP export success/failure rates

📞 Support

Getting Help

  1. Check Documentation: This file and RUNBOOKS.md
  2. Review SigNoz Docs: https://signoz.io/docs/
  3. OpenTelemetry Docs: https://opentelemetry.io/docs/
  4. Team Channel: #monitoring in Slack
  5. GitHub Issues: https://github.com/yourorg/bakery-ia/issues

Escalation Path

  1. First Line: Development team (service owners)
  2. Second Line: DevOps team (monitoring specialists)
  3. Third Line: SigNoz support (vendor support)

🎉 Summary

The bakery-ia monitoring system provides:

  • 📊 100% Service Coverage: All 20 services monitored
  • 🚀 Modern Architecture: OpenTelemetry + SigNoz
  • 🔧 Comprehensive Metrics: System, HTTP, database, cache
  • 🔍 Full Observability: Traces, metrics, logs integrated
  • Production Ready: Battle-tested and scalable

All services are fully instrumented and ready for production monitoring! 🎉