bakery-ia/docs/MONITORING_DOCUMENTATION.md

# 📊 Bakery-ia Monitoring System Documentation

## 🎯 Overview

The bakery-ia platform features a comprehensive, modern monitoring system built on **OpenTelemetry** and **SigNoz**. This documentation provides a complete guide to the monitoring architecture, setup, and usage.

## 🚀 Monitoring Architecture

### Core Components

```mermaid
graph TD
    A[Microservices] -->|OTLP| B[OpenTelemetry Collector]
    B -->|gRPC| C[SigNoz]
    C --> D[Traces Dashboard]
    C --> E[Metrics Dashboard]
    C --> F[Logs Dashboard]
    C --> G[Alerts]
```

### Technology Stack

- **Instrumentation**: OpenTelemetry Python SDK
- **Protocol**: OTLP (OpenTelemetry Protocol) over gRPC
- **Backend**: SigNoz (open-source observability platform)
- **Metrics**: Prometheus-compatible metrics via OTLP
- **Traces**: Jaeger-compatible tracing via OTLP
- **Logs**: Structured logging with trace correlation

## 📋 Monitoring Coverage

### Service Coverage (100%)

| Service Category | Services | Monitoring Type | Status |
|-----------------|----------|----------------|--------|
| **Critical Services** | auth, orders, sales, external | Base Class | ✅ Monitored |
| **AI Services** | ai-insights, training | Direct | ✅ Monitored |
| **Data Services** | inventory, procurement, production, forecasting | Base Class | ✅ Monitored |
| **Operational Services** | tenant, notification, distribution | Base Class | ✅ Monitored |
| **Specialized Services** | suppliers, pos, recipes, orchestrator | Base Class | ✅ Monitored |
| **Infrastructure** | gateway, alert-processor, demo-session | Direct | ✅ Monitored |

**Total: 20 services with 100% monitoring coverage**

## 🔧 Monitoring Implementation

### Implementation Patterns

#### 1. Base Class Pattern (16 services)

Services using `StandardFastAPIService` inherit comprehensive monitoring:

```python
from shared.service_base import StandardFastAPIService

class MyService(StandardFastAPIService):
    def __init__(self):
        super().__init__(
            service_name="my-service",
            app_name="My Service",
            description="Service description",
            version="1.0.0",
            # Monitoring enabled by default
            enable_metrics=True,      # ✅ Metrics collection
            enable_tracing=True,      # ✅ Distributed tracing
            enable_health_checks=True # ✅ Health endpoints
        )
```

#### 2. Direct Pattern (4 services)

Critical services with custom monitoring needs:

```python
# services/ai_insights/app/main.py
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector

# Initialize metrics collectors
metrics_collector = MetricsCollector("ai-insights")
system_metrics = SystemMetricsCollector("ai-insights")

# Add middleware
add_metrics_middleware(app, metrics_collector)
```

### Monitoring Components

#### OpenTelemetry Instrumentation

```python
# Automatic instrumentation in base class
FastAPIInstrumentor.instrument_app(app)      # HTTP requests
HTTPXClientInstrumentor().instrument()       # Outgoing HTTP
RedisInstrumentor().instrument()             # Redis operations
SQLAlchemyInstrumentor().instrument()       # Database queries
```

#### Metrics Collection

```python
# Standard metrics automatically collected
metrics_collector.register_counter("http_requests_total", "Total HTTP requests")
metrics_collector.register_histogram("http_request_duration", "Request duration")
metrics_collector.register_gauge("active_requests", "Active requests")

# System metrics automatically collected
system_metrics = SystemMetricsCollector("service-name")
# → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors
```

#### Health Checks

```python
# Automatic health check endpoints
GET /health          # Overall service health
GET /health/detailed # Detailed health with dependencies
GET /health/ready    # Readiness probe
GET /health/live     # Liveness probe
```

## 📊 Metrics Reference

### Standard Metrics (All Services)

| Metric Type | Metric Name | Description | Labels |
|-------------|------------|-------------|--------|
| **HTTP Metrics** | `{service}_http_requests_total` | Total HTTP requests | method, endpoint, status_code |
| **HTTP Metrics** | `{service}_http_request_duration_seconds` | Request duration histogram | method, endpoint, status_code |
| **HTTP Metrics** | `{service}_active_requests` | Currently active requests | - |
| **System Metrics** | `process.cpu.utilization` | Process CPU usage | - |
| **System Metrics** | `process.memory.usage` | Process memory usage | - |
| **System Metrics** | `system.cpu.utilization` | System CPU usage | - |
| **System Metrics** | `system.memory.usage` | System memory usage | - |
| **Database Metrics** | `db.query.duration` | Database query duration | operation, table |
| **Cache Metrics** | `cache.operation.duration` | Cache operation duration | operation, key |

### Custom Metrics (Service-Specific)

Examples of service-specific metrics:

**Auth Service:**
- `auth_registration_total` (by status)
- `auth_login_success_total`
- `auth_login_failure_total` (by reason)
- `auth_registration_duration_seconds`

**Orders Service:**
- `orders_created_total`
- `orders_processed_total` (by status)
- `orders_processing_duration_seconds`

**AI Insights Service:**
- `ai_insights_generated_total`
- `ai_model_inference_duration_seconds`
- `ai_feedback_received_total`

## 🔍 Tracing Guide

### Trace Propagation

Traces automatically flow across service boundaries:

```mermaid
sequenceDiagram
    participant Client
    participant Gateway
    participant Auth
    participant Orders

    Client->>Gateway: HTTP Request (trace_id: abc123)
    Gateway->>Auth: Auth Check (trace_id: abc123)
    Auth-->>Gateway: Auth Response (trace_id: abc123)
    Gateway->>Orders: Create Order (trace_id: abc123)
    Orders-->>Gateway: Order Created (trace_id: abc123)
    Gateway-->>Client: Final Response (trace_id: abc123)
```

### Trace Context in Logs

All logs include trace correlation:

```json
{
    "level": "info",
    "message": "Processing order",
    "service": "orders-service",
    "trace_id": "abc123def456",
    "span_id": "789ghi",
    "order_id": "12345",
    "timestamp": "2024-01-08T19:00:00Z"
}
```

### Manual Trace Enhancement

Add custom trace attributes:

```python
from shared.monitoring.tracing import add_trace_attributes, add_trace_event

# Add custom attributes
add_trace_attributes(
    user_id="123",
    tenant_id="abc",
    operation="order_creation"
)

# Add trace events
add_trace_event("order_validation_started")
# ... validation logic ...
add_trace_event("order_validation_completed", status="success")
```

## 🚨 Alerting Guide

### Standard Alerts (Recommended)

| Alert Name | Condition | Severity | Notification |
|------------|-----------|----------|--------------|
| **High Error Rate** | `error_rate > 5%` for 5m | High | PagerDuty + Slack |
| **High Latency** | `p99_latency > 2s` for 5m | High | PagerDuty + Slack |
| **Service Unavailable** | `up == 0` for 1m | Critical | PagerDuty + Slack + Email |
| **High Memory Usage** | `memory_usage > 80%` for 10m | Medium | Slack |
| **High CPU Usage** | `cpu_usage > 90%` for 5m | Medium | Slack |
| **Database Connection Issues** | `db_connections < minimum_pool_size` | High | PagerDuty + Slack |
| **Cache Hit Ratio Low** | `cache_hit_ratio < 70%` for 15m | Low | Slack |

### Creating Alerts in SigNoz

1. **Navigate to Alerts**: SigNoz UI → Alerts → Create Alert
2. **Select Metric**: Choose from available metrics
3. **Set Condition**: Define threshold and duration
4. **Configure Notifications**: Add notification channels
5. **Set Severity**: Critical, High, Medium, Low
6. **Add Description**: Explain alert purpose and resolution steps

### Example Alert Configuration (YAML)

```yaml
# Example for Terraform/Kubernetes
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: bakery-ia-alerts
  namespace: monitoring
spec:
  groups:
  - name: service-health
    rules:
    - alert: ServiceDown
      expr: up{service!~"signoz.*"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.service }} is down"
        description: "{{ $labels.service }} has been down for more than 1 minute"
        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down"

    - alert: HighErrorRate
      expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
      for: 5m
      labels:
        severity: high
      annotations:
        summary: "High error rate in {{ $labels.service }}"
        description: "Error rate is {{ $value }}% (threshold: 5%)"
        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate"
```

## 📈 Dashboard Guide

### Recommended Dashboards

#### 1. Service Overview Dashboard
- HTTP Request Rate
- Error Rate
- Latency Percentiles (p50, p90, p99)
- Active Requests
- System Resource Usage

#### 2. Performance Dashboard
- Request Duration Histogram
- Database Query Performance
- Cache Performance
- External API Call Performance

#### 3. System Health Dashboard
- CPU Usage (Process & System)
- Memory Usage (Process & System)
- Disk I/O
- Network I/O
- File Descriptors
- Thread Count

#### 4. Business Metrics Dashboard
- User Registrations
- Order Volume
- AI Insights Generated
- API Usage by Tenant

### Creating Dashboards in SigNoz

1. **Navigate to Dashboards**: SigNoz UI → Dashboards → Create Dashboard
2. **Add Panels**: Click "Add Panel" and select metric
3. **Configure Visualization**: Choose chart type and settings
4. **Set Time Range**: Default to last 1h, 6h, 24h, 7d
5. **Add Variables**: For dynamic filtering (service, environment)
6. **Save Dashboard**: Give it a descriptive name

## 🛠️ Troubleshooting Guide

### Common Issues & Solutions

#### Issue: No Metrics Appearing in SigNoz

**Checklist:**
- ✅ OpenTelemetry Collector running? `kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz`
- ✅ Service can reach collector? `telnet signoz-otel-collector.bakery-ia 4318`
- ✅ OTLP endpoint configured correctly? Check `OTEL_EXPORTER_OTLP_ENDPOINT`
- ✅ Service logs show OTLP export? Look for "Exporting metrics"
- ✅ No network policies blocking? Check Kubernetes network policies

**Debugging:**
```bash
# Check OpenTelemetry Collector logs
kubectl logs -n bakery-ia -l app=otel-collector

# Check service logs for OTLP errors
kubectl logs -l app=auth-service | grep -i otel

# Test OTLP connectivity from service pod
kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.bakery-ia:4318
```

#### Issue: High Latency in Specific Service

**Checklist:**
- ✅ Database queries slow? Check `db.query.duration` metrics
- ✅ External API calls slow? Check trace waterfall
- ✅ High CPU usage? Check system metrics
- ✅ Memory pressure? Check memory metrics
- ✅ Too many active requests? Check concurrency

**Debugging:**
```python
# Add detailed tracing to suspicious code
from shared.monitoring.tracing import add_trace_event

add_trace_event("database_query_started", table="users")
# ... database query ...
add_trace_event("database_query_completed", duration_ms=45)
```

#### Issue: High Error Rate

**Checklist:**
- ✅ Database connection issues? Check health endpoints
- ✅ External API failures? Check dependency metrics
- ✅ Authentication failures? Check auth service logs
- ✅ Validation errors? Check application logs
- ✅ Rate limiting? Check gateway metrics

**Debugging:**
```bash
# Check error logs with trace correlation
kubectl logs -l app=auth-service | grep -i error | grep -i trace

# Filter traces by error status
# In SigNoz: Add filter http.status_code >= 400
```

## 📚 Runbook Reference

See [RUNBOOKS.md](RUNBOOKS.md) for detailed troubleshooting procedures.

## 🔧 Development Guide

### Adding Custom Metrics

```python
# In any service using direct monitoring
self.metrics_collector.register_counter(
    "custom_metric_name",
    "Description of what this metric tracks",
    labels=["label1", "label2"]  # Optional labels
)

# Increment the counter
self.metrics_collector.increment_counter(
    "custom_metric_name",
    value=1,
    labels={"label1": "value1", "label2": "value2"}
)
```

### Adding Custom Trace Attributes

```python
# Add context to current span
from shared.monitoring.tracing import add_trace_attributes

add_trace_attributes(
    user_id=user.id,
    tenant_id=tenant.id,
    operation="premium_feature_access",
    feature_name="advanced_forecasting"
)
```

### Service-Specific Monitoring Setup

For services needing custom monitoring beyond the base class:

```python
# In your service's __init__ method
from shared.monitoring.system_metrics import SystemMetricsCollector
from shared.monitoring.metrics import MetricsCollector

class MyService(StandardFastAPIService):
    def __init__(self):
        # Call parent constructor first
        super().__init__(...)

        # Add custom metrics collector
        self.custom_metrics = MetricsCollector("my-service")

        # Register custom metrics
        self.custom_metrics.register_counter(
            "business_specific_events",
            "Custom business event counter"
        )

        # Add system metrics if not using base class defaults
        self.system_metrics = SystemMetricsCollector("my-service")
```

## 📊 SigNoz Configuration

### Environment Variables

```env
# OpenTelemetry Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.bakery-ia:4318

# Service-specific configuration
OTEL_SERVICE_NAME=auth-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia

# Metrics export interval (default: 60000ms = 60s)
OTEL_METRIC_EXPORT_INTERVAL=60000

# Batch span processor configuration
OTEL_BSP_SCHEDULE_DELAY=5000
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512
```

### Kubernetes Configuration

```yaml
# Example deployment with monitoring sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
spec:
  template:
    spec:
      containers:
      - name: auth-service
        image: auth-service:latest
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://signoz-otel-collector.bakery-ia:4318"
        - name: OTEL_SERVICE_NAME
          value: "auth-service"
        - name: ENVIRONMENT
          value: "production"
        resources:
          limits:
            cpu: "1"
            memory: "512Mi"
          requests:
            cpu: "200m"
            memory: "256Mi"
```

## 🎯 Best Practices

### Monitoring Best Practices

1. **Use Consistent Naming**: Follow OpenTelemetry semantic conventions
2. **Add Context to Traces**: Include user/tenant IDs in trace attributes
3. **Monitor Dependencies**: Track external API and database performance
4. **Set Appropriate Alerts**: Avoid alert fatigue with meaningful thresholds
5. **Document Metrics**: Keep metrics documentation up to date
6. **Review Regularly**: Update dashboards as services evolve
7. **Test Alerts**: Ensure alerts fire correctly before production

### Performance Best Practices

1. **Batch Metrics Export**: Use default 60s interval for most services
2. **Sample Traces**: Consider sampling for high-volume services
3. **Limit Custom Metrics**: Only track metrics that provide value
4. **Use Histograms Wisely**: Histograms can be resource-intensive
5. **Monitor Monitoring**: Track OTLP export success/failure rates

## 📞 Support

### Getting Help

1. **Check Documentation**: This file and RUNBOOKS.md
2. **Review SigNoz Docs**: https://signoz.io/docs/
3. **OpenTelemetry Docs**: https://opentelemetry.io/docs/
4. **Team Channel**: #monitoring in Slack
5. **GitHub Issues**: https://github.com/yourorg/bakery-ia/issues

### Escalation Path

1. **First Line**: Development team (service owners)
2. **Second Line**: DevOps team (monitoring specialists)
3. **Third Line**: SigNoz support (vendor support)

## 🎉 Summary

The bakery-ia monitoring system provides:

- **📊 100% Service Coverage**: All 20 services monitored
- **🚀 Modern Architecture**: OpenTelemetry + SigNoz
- **🔧 Comprehensive Metrics**: System, HTTP, database, cache
- **🔍 Full Observability**: Traces, metrics, logs integrated
- **✅ Production Ready**: Battle-tested and scalable

**All services are fully instrumented and ready for production monitoring!** 🎉