Improve metrics

2026-01-08 20:48:24 +01:00
parent 29d19087f1
commit e8fda39e50
21 changed files with 615 additions and 3019 deletions
--- a/docs/MONITORING_DOCUMENTATION.md
+++ b/docs/MONITORING_DOCUMENTATION.md
@@ -0,0 +1,536 @@
+# 📊 Bakery-ia Monitoring System Documentation
+
+## 🎯 Overview
+
+The bakery-ia platform features a comprehensive, modern monitoring system built on **OpenTelemetry** and **SigNoz**. This documentation provides a complete guide to the monitoring architecture, setup, and usage.
+
+## 🚀 Monitoring Architecture
+
+### Core Components
+
+```mermaid
+graph TD
+    A[Microservices] -->|OTLP| B[OpenTelemetry Collector]
+    B -->|gRPC| C[SigNoz]
+    C --> D[Traces Dashboard]
+    C --> E[Metrics Dashboard]
+    C --> F[Logs Dashboard]
+    C --> G[Alerts]
+```
+
+### Technology Stack
+
+- **Instrumentation**: OpenTelemetry Python SDK
+- **Protocol**: OTLP (OpenTelemetry Protocol) over gRPC
+- **Backend**: SigNoz (open-source observability platform)
+- **Metrics**: Prometheus-compatible metrics via OTLP
+- **Traces**: Jaeger-compatible tracing via OTLP
+- **Logs**: Structured logging with trace correlation
+
+## 📋 Monitoring Coverage
+
+### Service Coverage (100%)
+
+| Service Category | Services | Monitoring Type | Status |
+|-----------------|----------|----------------|--------|
+| **Critical Services** | auth, orders, sales, external | Base Class | ✅ Monitored |
+| **AI Services** | ai-insights, training | Direct | ✅ Monitored |
+| **Data Services** | inventory, procurement, production, forecasting | Base Class | ✅ Monitored |
+| **Operational Services** | tenant, notification, distribution | Base Class | ✅ Monitored |
+| **Specialized Services** | suppliers, pos, recipes, orchestrator | Base Class | ✅ Monitored |
+| **Infrastructure** | gateway, alert-processor, demo-session | Direct | ✅ Monitored |
+
+**Total: 20 services with 100% monitoring coverage**
+
+## 🔧 Monitoring Implementation
+
+### Implementation Patterns
+
+#### 1. Base Class Pattern (16 services)
+
+Services using `StandardFastAPIService` inherit comprehensive monitoring:
+
+```python
+from shared.service_base import StandardFastAPIService
+
+class MyService(StandardFastAPIService):
+    def __init__(self):
+        super().__init__(
+            service_name="my-service",
+            app_name="My Service",
+            description="Service description",
+            version="1.0.0",
+            # Monitoring enabled by default
+            enable_metrics=True,      # ✅ Metrics collection
+            enable_tracing=True,      # ✅ Distributed tracing
+            enable_health_checks=True # ✅ Health endpoints
+        )
+```
+
+#### 2. Direct Pattern (4 services)
+
+Critical services with custom monitoring needs:
+
+```python
+# services/ai_insights/app/main.py
+from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
+from shared.monitoring.system_metrics import SystemMetricsCollector
+
+# Initialize metrics collectors
+metrics_collector = MetricsCollector("ai-insights")
+system_metrics = SystemMetricsCollector("ai-insights")
+
+# Add middleware
+add_metrics_middleware(app, metrics_collector)
+```
+
+### Monitoring Components
+
+#### OpenTelemetry Instrumentation
+
+```python
+# Automatic instrumentation in base class
+FastAPIInstrumentor.instrument_app(app)      # HTTP requests
+HTTPXClientInstrumentor().instrument()       # Outgoing HTTP
+RedisInstrumentor().instrument()             # Redis operations
+SQLAlchemyInstrumentor().instrument()       # Database queries
+```
+
+#### Metrics Collection
+
+```python
+# Standard metrics automatically collected
+metrics_collector.register_counter("http_requests_total", "Total HTTP requests")
+metrics_collector.register_histogram("http_request_duration", "Request duration")
+metrics_collector.register_gauge("active_requests", "Active requests")
+
+# System metrics automatically collected
+system_metrics = SystemMetricsCollector("service-name")
+# → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors
+```
+
+#### Health Checks
+
+```python
+# Automatic health check endpoints
+GET /health          # Overall service health
+GET /health/detailed # Detailed health with dependencies
+GET /health/ready    # Readiness probe
+GET /health/live     # Liveness probe
+```
+
+## 📊 Metrics Reference
+
+### Standard Metrics (All Services)
+
+| Metric Type | Metric Name | Description | Labels |
+|-------------|------------|-------------|--------|
+| **HTTP Metrics** | `{service}_http_requests_total` | Total HTTP requests | method, endpoint, status_code |
+| **HTTP Metrics** | `{service}_http_request_duration_seconds` | Request duration histogram | method, endpoint, status_code |
+| **HTTP Metrics** | `{service}_active_requests` | Currently active requests | - |
+| **System Metrics** | `process.cpu.utilization` | Process CPU usage | - |
+| **System Metrics** | `process.memory.usage` | Process memory usage | - |
+| **System Metrics** | `system.cpu.utilization` | System CPU usage | - |
+| **System Metrics** | `system.memory.usage` | System memory usage | - |
+| **Database Metrics** | `db.query.duration` | Database query duration | operation, table |
+| **Cache Metrics** | `cache.operation.duration` | Cache operation duration | operation, key |
+
+### Custom Metrics (Service-Specific)
+
+Examples of service-specific metrics:
+
+**Auth Service:**
+- `auth_registration_total` (by status)
+- `auth_login_success_total`
+- `auth_login_failure_total` (by reason)
+- `auth_registration_duration_seconds`
+
+**Orders Service:**
+- `orders_created_total`
+- `orders_processed_total` (by status)
+- `orders_processing_duration_seconds`
+
+**AI Insights Service:**
+- `ai_insights_generated_total`
+- `ai_model_inference_duration_seconds`
+- `ai_feedback_received_total`
+
+## 🔍 Tracing Guide
+
+### Trace Propagation
+
+Traces automatically flow across service boundaries:
+
+```mermaid
+sequenceDiagram
+    participant Client
+    participant Gateway
+    participant Auth
+    participant Orders
+    
+    Client->>Gateway: HTTP Request (trace_id: abc123)
+    Gateway->>Auth: Auth Check (trace_id: abc123)
+    Auth-->>Gateway: Auth Response (trace_id: abc123)
+    Gateway->>Orders: Create Order (trace_id: abc123)
+    Orders-->>Gateway: Order Created (trace_id: abc123)
+    Gateway-->>Client: Final Response (trace_id: abc123)
+```
+
+### Trace Context in Logs
+
+All logs include trace correlation:
+
+```json
+{
+    "level": "info",
+    "message": "Processing order",
+    "service": "orders-service",
+    "trace_id": "abc123def456",
+    "span_id": "789ghi",
+    "order_id": "12345",
+    "timestamp": "2024-01-08T19:00:00Z"
+}
+```
+
+### Manual Trace Enhancement
+
+Add custom trace attributes:
+
+```python
+from shared.monitoring.tracing import add_trace_attributes, add_trace_event
+
+# Add custom attributes
+add_trace_attributes(
+    user_id="123",
+    tenant_id="abc",
+    operation="order_creation"
+)
+
+# Add trace events
+add_trace_event("order_validation_started")
+# ... validation logic ...
+add_trace_event("order_validation_completed", status="success")
+```
+
+## 🚨 Alerting Guide
+
+### Standard Alerts (Recommended)
+
+| Alert Name | Condition | Severity | Notification |
+|------------|-----------|----------|--------------|
+| **High Error Rate** | `error_rate > 5%` for 5m | High | PagerDuty + Slack |
+| **High Latency** | `p99_latency > 2s` for 5m | High | PagerDuty + Slack |
+| **Service Unavailable** | `up == 0` for 1m | Critical | PagerDuty + Slack + Email |
+| **High Memory Usage** | `memory_usage > 80%` for 10m | Medium | Slack |
+| **High CPU Usage** | `cpu_usage > 90%` for 5m | Medium | Slack |
+| **Database Connection Issues** | `db_connections < minimum_pool_size` | High | PagerDuty + Slack |
+| **Cache Hit Ratio Low** | `cache_hit_ratio < 70%` for 15m | Low | Slack |
+
+### Creating Alerts in SigNoz
+
+1. **Navigate to Alerts**: SigNoz UI → Alerts → Create Alert
+2. **Select Metric**: Choose from available metrics
+3. **Set Condition**: Define threshold and duration
+4. **Configure Notifications**: Add notification channels
+5. **Set Severity**: Critical, High, Medium, Low
+6. **Add Description**: Explain alert purpose and resolution steps
+
+### Example Alert Configuration (YAML)
+
+```yaml
+# Example for Terraform/Kubernetes
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: bakery-ia-alerts
+  namespace: monitoring
+spec:
+  groups:
+  - name: service-health
+    rules:
+    - alert: ServiceDown
+      expr: up{service!~"signoz.*"} == 0
+      for: 1m
+      labels:
+        severity: critical
+      annotations:
+        summary: "Service {{ $labels.service }} is down"
+        description: "{{ $labels.service }} has been down for more than 1 minute"
+        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down"
+
+    - alert: HighErrorRate
+      expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
+      for: 5m
+      labels:
+        severity: high
+      annotations:
+        summary: "High error rate in {{ $labels.service }}"
+        description: "Error rate is {{ $value }}% (threshold: 5%)"
+        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate"
+```
+
+## 📈 Dashboard Guide
+
+### Recommended Dashboards
+
+#### 1. Service Overview Dashboard
+- HTTP Request Rate
+- Error Rate
+- Latency Percentiles (p50, p90, p99)
+- Active Requests
+- System Resource Usage
+
+#### 2. Performance Dashboard
+- Request Duration Histogram
+- Database Query Performance
+- Cache Performance
+- External API Call Performance
+
+#### 3. System Health Dashboard
+- CPU Usage (Process & System)
+- Memory Usage (Process & System)
+- Disk I/O
+- Network I/O
+- File Descriptors
+- Thread Count
+
+#### 4. Business Metrics Dashboard
+- User Registrations
+- Order Volume
+- AI Insights Generated
+- API Usage by Tenant
+
+### Creating Dashboards in SigNoz
+
+1. **Navigate to Dashboards**: SigNoz UI → Dashboards → Create Dashboard
+2. **Add Panels**: Click "Add Panel" and select metric
+3. **Configure Visualization**: Choose chart type and settings
+4. **Set Time Range**: Default to last 1h, 6h, 24h, 7d
+5. **Add Variables**: For dynamic filtering (service, environment)
+6. **Save Dashboard**: Give it a descriptive name
+
+## 🛠️ Troubleshooting Guide
+
+### Common Issues & Solutions
+
+#### Issue: No Metrics Appearing in SigNoz
+
+**Checklist:**
+- ✅ OpenTelemetry Collector running? `kubectl get pods -n signoz`
+- ✅ Service can reach collector? `telnet signoz-otel-collector.signoz 4318`
+- ✅ OTLP endpoint configured correctly? Check `OTEL_EXPORTER_OTLP_ENDPOINT`
+- ✅ Service logs show OTLP export? Look for "Exporting metrics"
+- ✅ No network policies blocking? Check Kubernetes network policies
+
+**Debugging:**
+```bash
+# Check OpenTelemetry Collector logs
+kubectl logs -n signoz -l app=otel-collector
+
+# Check service logs for OTLP errors
+kubectl logs -l app=auth-service | grep -i otel
+
+# Test OTLP connectivity from service pod
+kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.signoz:4318
+```
+
+#### Issue: High Latency in Specific Service
+
+**Checklist:**
+- ✅ Database queries slow? Check `db.query.duration` metrics
+- ✅ External API calls slow? Check trace waterfall
+- ✅ High CPU usage? Check system metrics
+- ✅ Memory pressure? Check memory metrics
+- ✅ Too many active requests? Check concurrency
+
+**Debugging:**
+```python
+# Add detailed tracing to suspicious code
+from shared.monitoring.tracing import add_trace_event
+
+add_trace_event("database_query_started", table="users")
+# ... database query ...
+add_trace_event("database_query_completed", duration_ms=45)
+```
+
+#### Issue: High Error Rate
+
+**Checklist:**
+- ✅ Database connection issues? Check health endpoints
+- ✅ External API failures? Check dependency metrics
+- ✅ Authentication failures? Check auth service logs
+- ✅ Validation errors? Check application logs
+- ✅ Rate limiting? Check gateway metrics
+
+**Debugging:**
+```bash
+# Check error logs with trace correlation
+kubectl logs -l app=auth-service | grep -i error | grep -i trace
+
+# Filter traces by error status
+# In SigNoz: Add filter http.status_code >= 400
+```
+
+## 📚 Runbook Reference
+
+See [RUNBOOKS.md](RUNBOOKS.md) for detailed troubleshooting procedures.
+
+## 🔧 Development Guide
+
+### Adding Custom Metrics
+
+```python
+# In any service using direct monitoring
+self.metrics_collector.register_counter(
+    "custom_metric_name",
+    "Description of what this metric tracks",
+    labels=["label1", "label2"]  # Optional labels
+)
+
+# Increment the counter
+self.metrics_collector.increment_counter(
+    "custom_metric_name",
+    value=1,
+    labels={"label1": "value1", "label2": "value2"}
+)
+```
+
+### Adding Custom Trace Attributes
+
+```python
+# Add context to current span
+from shared.monitoring.tracing import add_trace_attributes
+
+add_trace_attributes(
+    user_id=user.id,
+    tenant_id=tenant.id,
+    operation="premium_feature_access",
+    feature_name="advanced_forecasting"
+)
+```
+
+### Service-Specific Monitoring Setup
+
+For services needing custom monitoring beyond the base class:
+
+```python
+# In your service's __init__ method
+from shared.monitoring.system_metrics import SystemMetricsCollector
+from shared.monitoring.metrics import MetricsCollector
+
+class MyService(StandardFastAPIService):
+    def __init__(self):
+        # Call parent constructor first
+        super().__init__(...)
+        
+        # Add custom metrics collector
+        self.custom_metrics = MetricsCollector("my-service")
+        
+        # Register custom metrics
+        self.custom_metrics.register_counter(
+            "business_specific_events",
+            "Custom business event counter"
+        )
+        
+        # Add system metrics if not using base class defaults
+        self.system_metrics = SystemMetricsCollector("my-service")
+```
+
+## 📊 SigNoz Configuration
+
+### Environment Variables
+
+```env
+# OpenTelemetry Collector endpoint
+OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.signoz:4318
+
+# Service-specific configuration
+OTEL_SERVICE_NAME=auth-service
+OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia
+
+# Metrics export interval (default: 60000ms = 60s)
+OTEL_METRIC_EXPORT_INTERVAL=60000
+
+# Batch span processor configuration
+OTEL_BSP_SCHEDULE_DELAY=5000
+OTEL_BSP_MAX_QUEUE_SIZE=2048
+OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512
+```
+
+### Kubernetes Configuration
+
+```yaml
+# Example deployment with monitoring sidecar
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: auth-service
+spec:
+  template:
+    spec:
+      containers:
+      - name: auth-service
+        image: auth-service:latest
+        env:
+        - name: OTEL_EXPORTER_OTLP_ENDPOINT
+          value: "http://signoz-otel-collector.signoz:4318"
+        - name: OTEL_SERVICE_NAME
+          value: "auth-service"
+        - name: ENVIRONMENT
+          value: "production"
+        resources:
+          limits:
+            cpu: "1"
+            memory: "512Mi"
+          requests:
+            cpu: "200m"
+            memory: "256Mi"
+```
+
+## 🎯 Best Practices
+
+### Monitoring Best Practices
+
+1. **Use Consistent Naming**: Follow OpenTelemetry semantic conventions
+2. **Add Context to Traces**: Include user/tenant IDs in trace attributes
+3. **Monitor Dependencies**: Track external API and database performance
+4. **Set Appropriate Alerts**: Avoid alert fatigue with meaningful thresholds
+5. **Document Metrics**: Keep metrics documentation up to date
+6. **Review Regularly**: Update dashboards as services evolve
+7. **Test Alerts**: Ensure alerts fire correctly before production
+
+### Performance Best Practices
+
+1. **Batch Metrics Export**: Use default 60s interval for most services
+2. **Sample Traces**: Consider sampling for high-volume services
+3. **Limit Custom Metrics**: Only track metrics that provide value
+4. **Use Histograms Wisely**: Histograms can be resource-intensive
+5. **Monitor Monitoring**: Track OTLP export success/failure rates
+
+## 📞 Support
+
+### Getting Help
+
+1. **Check Documentation**: This file and RUNBOOKS.md
+2. **Review SigNoz Docs**: https://signoz.io/docs/
+3. **OpenTelemetry Docs**: https://opentelemetry.io/docs/
+4. **Team Channel**: #monitoring in Slack
+5. **GitHub Issues**: https://github.com/yourorg/bakery-ia/issues
+
+### Escalation Path
+
+1. **First Line**: Development team (service owners)
+2. **Second Line**: DevOps team (monitoring specialists)
+3. **Third Line**: SigNoz support (vendor support)
+
+## 🎉 Summary
+
+The bakery-ia monitoring system provides:
+
+- **📊 100% Service Coverage**: All 20 services monitored
+- **🚀 Modern Architecture**: OpenTelemetry + SigNoz
+- **🔧 Comprehensive Metrics**: System, HTTP, database, cache
+- **🔍 Full Observability**: Traces, metrics, logs integrated
+- **✅ Production Ready**: Battle-tested and scalable
+
+**All services are fully instrumented and ready for production monitoring!** 🎉