# 📊 Bakery-ia Monitoring System Documentation ## 🎯 Overview The bakery-ia platform features a comprehensive, modern monitoring system built on **OpenTelemetry** and **SigNoz**. This documentation provides a complete guide to the monitoring architecture, setup, and usage. ## 🚀 Monitoring Architecture ### Core Components ```mermaid graph TD A[Microservices] -->|OTLP| B[OpenTelemetry Collector] B -->|gRPC| C[SigNoz] C --> D[Traces Dashboard] C --> E[Metrics Dashboard] C --> F[Logs Dashboard] C --> G[Alerts] ``` ### Technology Stack - **Instrumentation**: OpenTelemetry Python SDK - **Protocol**: OTLP (OpenTelemetry Protocol) over gRPC - **Backend**: SigNoz (open-source observability platform) - **Metrics**: Prometheus-compatible metrics via OTLP - **Traces**: Jaeger-compatible tracing via OTLP - **Logs**: Structured logging with trace correlation ## 📋 Monitoring Coverage ### Service Coverage (100%) | Service Category | Services | Monitoring Type | Status | |-----------------|----------|----------------|--------| | **Critical Services** | auth, orders, sales, external | Base Class | ✅ Monitored | | **AI Services** | ai-insights, training | Direct | ✅ Monitored | | **Data Services** | inventory, procurement, production, forecasting | Base Class | ✅ Monitored | | **Operational Services** | tenant, notification, distribution | Base Class | ✅ Monitored | | **Specialized Services** | suppliers, pos, recipes, orchestrator | Base Class | ✅ Monitored | | **Infrastructure** | gateway, alert-processor, demo-session | Direct | ✅ Monitored | **Total: 20 services with 100% monitoring coverage** ## 🔧 Monitoring Implementation ### Implementation Patterns #### 1. Base Class Pattern (16 services) Services using `StandardFastAPIService` inherit comprehensive monitoring: ```python from shared.service_base import StandardFastAPIService class MyService(StandardFastAPIService): def __init__(self): super().__init__( service_name="my-service", app_name="My Service", description="Service description", version="1.0.0", # Monitoring enabled by default enable_metrics=True, # ✅ Metrics collection enable_tracing=True, # ✅ Distributed tracing enable_health_checks=True # ✅ Health endpoints ) ``` #### 2. Direct Pattern (4 services) Critical services with custom monitoring needs: ```python # services/ai_insights/app/main.py from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware from shared.monitoring.system_metrics import SystemMetricsCollector # Initialize metrics collectors metrics_collector = MetricsCollector("ai-insights") system_metrics = SystemMetricsCollector("ai-insights") # Add middleware add_metrics_middleware(app, metrics_collector) ``` ### Monitoring Components #### OpenTelemetry Instrumentation ```python # Automatic instrumentation in base class FastAPIInstrumentor.instrument_app(app) # HTTP requests HTTPXClientInstrumentor().instrument() # Outgoing HTTP RedisInstrumentor().instrument() # Redis operations SQLAlchemyInstrumentor().instrument() # Database queries ``` #### Metrics Collection ```python # Standard metrics automatically collected metrics_collector.register_counter("http_requests_total", "Total HTTP requests") metrics_collector.register_histogram("http_request_duration", "Request duration") metrics_collector.register_gauge("active_requests", "Active requests") # System metrics automatically collected system_metrics = SystemMetricsCollector("service-name") # → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors ``` #### Health Checks ```python # Automatic health check endpoints GET /health # Overall service health GET /health/detailed # Detailed health with dependencies GET /health/ready # Readiness probe GET /health/live # Liveness probe ``` ## 📊 Metrics Reference ### Standard Metrics (All Services) | Metric Type | Metric Name | Description | Labels | |-------------|------------|-------------|--------| | **HTTP Metrics** | `{service}_http_requests_total` | Total HTTP requests | method, endpoint, status_code | | **HTTP Metrics** | `{service}_http_request_duration_seconds` | Request duration histogram | method, endpoint, status_code | | **HTTP Metrics** | `{service}_active_requests` | Currently active requests | - | | **System Metrics** | `process.cpu.utilization` | Process CPU usage | - | | **System Metrics** | `process.memory.usage` | Process memory usage | - | | **System Metrics** | `system.cpu.utilization` | System CPU usage | - | | **System Metrics** | `system.memory.usage` | System memory usage | - | | **Database Metrics** | `db.query.duration` | Database query duration | operation, table | | **Cache Metrics** | `cache.operation.duration` | Cache operation duration | operation, key | ### Custom Metrics (Service-Specific) Examples of service-specific metrics: **Auth Service:** - `auth_registration_total` (by status) - `auth_login_success_total` - `auth_login_failure_total` (by reason) - `auth_registration_duration_seconds` **Orders Service:** - `orders_created_total` - `orders_processed_total` (by status) - `orders_processing_duration_seconds` **AI Insights Service:** - `ai_insights_generated_total` - `ai_model_inference_duration_seconds` - `ai_feedback_received_total` ## 🔍 Tracing Guide ### Trace Propagation Traces automatically flow across service boundaries: ```mermaid sequenceDiagram participant Client participant Gateway participant Auth participant Orders Client->>Gateway: HTTP Request (trace_id: abc123) Gateway->>Auth: Auth Check (trace_id: abc123) Auth-->>Gateway: Auth Response (trace_id: abc123) Gateway->>Orders: Create Order (trace_id: abc123) Orders-->>Gateway: Order Created (trace_id: abc123) Gateway-->>Client: Final Response (trace_id: abc123) ``` ### Trace Context in Logs All logs include trace correlation: ```json { "level": "info", "message": "Processing order", "service": "orders-service", "trace_id": "abc123def456", "span_id": "789ghi", "order_id": "12345", "timestamp": "2024-01-08T19:00:00Z" } ``` ### Manual Trace Enhancement Add custom trace attributes: ```python from shared.monitoring.tracing import add_trace_attributes, add_trace_event # Add custom attributes add_trace_attributes( user_id="123", tenant_id="abc", operation="order_creation" ) # Add trace events add_trace_event("order_validation_started") # ... validation logic ... add_trace_event("order_validation_completed", status="success") ``` ## 🚨 Alerting Guide ### Standard Alerts (Recommended) | Alert Name | Condition | Severity | Notification | |------------|-----------|----------|--------------| | **High Error Rate** | `error_rate > 5%` for 5m | High | PagerDuty + Slack | | **High Latency** | `p99_latency > 2s` for 5m | High | PagerDuty + Slack | | **Service Unavailable** | `up == 0` for 1m | Critical | PagerDuty + Slack + Email | | **High Memory Usage** | `memory_usage > 80%` for 10m | Medium | Slack | | **High CPU Usage** | `cpu_usage > 90%` for 5m | Medium | Slack | | **Database Connection Issues** | `db_connections < minimum_pool_size` | High | PagerDuty + Slack | | **Cache Hit Ratio Low** | `cache_hit_ratio < 70%` for 15m | Low | Slack | ### Creating Alerts in SigNoz 1. **Navigate to Alerts**: SigNoz UI → Alerts → Create Alert 2. **Select Metric**: Choose from available metrics 3. **Set Condition**: Define threshold and duration 4. **Configure Notifications**: Add notification channels 5. **Set Severity**: Critical, High, Medium, Low 6. **Add Description**: Explain alert purpose and resolution steps ### Example Alert Configuration (YAML) ```yaml # Example for Terraform/Kubernetes apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: bakery-ia-alerts namespace: monitoring spec: groups: - name: service-health rules: - alert: ServiceDown expr: up{service!~"signoz.*"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.service }} is down" description: "{{ $labels.service }} has been down for more than 1 minute" runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down" - alert: HighErrorRate expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: high annotations: summary: "High error rate in {{ $labels.service }}" description: "Error rate is {{ $value }}% (threshold: 5%)" runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate" ``` ## 📈 Dashboard Guide ### Recommended Dashboards #### 1. Service Overview Dashboard - HTTP Request Rate - Error Rate - Latency Percentiles (p50, p90, p99) - Active Requests - System Resource Usage #### 2. Performance Dashboard - Request Duration Histogram - Database Query Performance - Cache Performance - External API Call Performance #### 3. System Health Dashboard - CPU Usage (Process & System) - Memory Usage (Process & System) - Disk I/O - Network I/O - File Descriptors - Thread Count #### 4. Business Metrics Dashboard - User Registrations - Order Volume - AI Insights Generated - API Usage by Tenant ### Creating Dashboards in SigNoz 1. **Navigate to Dashboards**: SigNoz UI → Dashboards → Create Dashboard 2. **Add Panels**: Click "Add Panel" and select metric 3. **Configure Visualization**: Choose chart type and settings 4. **Set Time Range**: Default to last 1h, 6h, 24h, 7d 5. **Add Variables**: For dynamic filtering (service, environment) 6. **Save Dashboard**: Give it a descriptive name ## 🛠️ Troubleshooting Guide ### Common Issues & Solutions #### Issue: No Metrics Appearing in SigNoz **Checklist:** - ✅ OpenTelemetry Collector running? `kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz` - ✅ Service can reach collector? `telnet signoz-otel-collector.bakery-ia 4318` - ✅ OTLP endpoint configured correctly? Check `OTEL_EXPORTER_OTLP_ENDPOINT` - ✅ Service logs show OTLP export? Look for "Exporting metrics" - ✅ No network policies blocking? Check Kubernetes network policies **Debugging:** ```bash # Check OpenTelemetry Collector logs kubectl logs -n bakery-ia -l app=otel-collector # Check service logs for OTLP errors kubectl logs -l app=auth-service | grep -i otel # Test OTLP connectivity from service pod kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.bakery-ia:4318 ``` #### Issue: High Latency in Specific Service **Checklist:** - ✅ Database queries slow? Check `db.query.duration` metrics - ✅ External API calls slow? Check trace waterfall - ✅ High CPU usage? Check system metrics - ✅ Memory pressure? Check memory metrics - ✅ Too many active requests? Check concurrency **Debugging:** ```python # Add detailed tracing to suspicious code from shared.monitoring.tracing import add_trace_event add_trace_event("database_query_started", table="users") # ... database query ... add_trace_event("database_query_completed", duration_ms=45) ``` #### Issue: High Error Rate **Checklist:** - ✅ Database connection issues? Check health endpoints - ✅ External API failures? Check dependency metrics - ✅ Authentication failures? Check auth service logs - ✅ Validation errors? Check application logs - ✅ Rate limiting? Check gateway metrics **Debugging:** ```bash # Check error logs with trace correlation kubectl logs -l app=auth-service | grep -i error | grep -i trace # Filter traces by error status # In SigNoz: Add filter http.status_code >= 400 ``` ## 📚 Runbook Reference See [RUNBOOKS.md](RUNBOOKS.md) for detailed troubleshooting procedures. ## 🔧 Development Guide ### Adding Custom Metrics ```python # In any service using direct monitoring self.metrics_collector.register_counter( "custom_metric_name", "Description of what this metric tracks", labels=["label1", "label2"] # Optional labels ) # Increment the counter self.metrics_collector.increment_counter( "custom_metric_name", value=1, labels={"label1": "value1", "label2": "value2"} ) ``` ### Adding Custom Trace Attributes ```python # Add context to current span from shared.monitoring.tracing import add_trace_attributes add_trace_attributes( user_id=user.id, tenant_id=tenant.id, operation="premium_feature_access", feature_name="advanced_forecasting" ) ``` ### Service-Specific Monitoring Setup For services needing custom monitoring beyond the base class: ```python # In your service's __init__ method from shared.monitoring.system_metrics import SystemMetricsCollector from shared.monitoring.metrics import MetricsCollector class MyService(StandardFastAPIService): def __init__(self): # Call parent constructor first super().__init__(...) # Add custom metrics collector self.custom_metrics = MetricsCollector("my-service") # Register custom metrics self.custom_metrics.register_counter( "business_specific_events", "Custom business event counter" ) # Add system metrics if not using base class defaults self.system_metrics = SystemMetricsCollector("my-service") ``` ## 📊 SigNoz Configuration ### Environment Variables ```env # OpenTelemetry Collector endpoint OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.bakery-ia:4318 # Service-specific configuration OTEL_SERVICE_NAME=auth-service OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia # Metrics export interval (default: 60000ms = 60s) OTEL_METRIC_EXPORT_INTERVAL=60000 # Batch span processor configuration OTEL_BSP_SCHEDULE_DELAY=5000 OTEL_BSP_MAX_QUEUE_SIZE=2048 OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512 ``` ### Kubernetes Configuration ```yaml # Example deployment with monitoring sidecar apiVersion: apps/v1 kind: Deployment metadata: name: auth-service spec: template: spec: containers: - name: auth-service image: auth-service:latest env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://signoz-otel-collector.bakery-ia:4318" - name: OTEL_SERVICE_NAME value: "auth-service" - name: ENVIRONMENT value: "production" resources: limits: cpu: "1" memory: "512Mi" requests: cpu: "200m" memory: "256Mi" ``` ## 🎯 Best Practices ### Monitoring Best Practices 1. **Use Consistent Naming**: Follow OpenTelemetry semantic conventions 2. **Add Context to Traces**: Include user/tenant IDs in trace attributes 3. **Monitor Dependencies**: Track external API and database performance 4. **Set Appropriate Alerts**: Avoid alert fatigue with meaningful thresholds 5. **Document Metrics**: Keep metrics documentation up to date 6. **Review Regularly**: Update dashboards as services evolve 7. **Test Alerts**: Ensure alerts fire correctly before production ### Performance Best Practices 1. **Batch Metrics Export**: Use default 60s interval for most services 2. **Sample Traces**: Consider sampling for high-volume services 3. **Limit Custom Metrics**: Only track metrics that provide value 4. **Use Histograms Wisely**: Histograms can be resource-intensive 5. **Monitor Monitoring**: Track OTLP export success/failure rates ## 📞 Support ### Getting Help 1. **Check Documentation**: This file and RUNBOOKS.md 2. **Review SigNoz Docs**: https://signoz.io/docs/ 3. **OpenTelemetry Docs**: https://opentelemetry.io/docs/ 4. **Team Channel**: #monitoring in Slack 5. **GitHub Issues**: https://github.com/yourorg/bakery-ia/issues ### Escalation Path 1. **First Line**: Development team (service owners) 2. **Second Line**: DevOps team (monitoring specialists) 3. **Third Line**: SigNoz support (vendor support) ## 🎉 Summary The bakery-ia monitoring system provides: - **📊 100% Service Coverage**: All 20 services monitored - **🚀 Modern Architecture**: OpenTelemetry + SigNoz - **🔧 Comprehensive Metrics**: System, HTTP, database, cache - **🔍 Full Observability**: Traces, metrics, logs integrated - **✅ Production Ready**: Battle-tested and scalable **All services are fully instrumented and ready for production monitoring!** 🎉