Files

Urtzi Alfaro e8fda39e50 Improve metrics

2026-01-08 20:48:24 +01:00

16 KiB

Raw Blame History

📊 Bakery-ia Monitoring System Documentation

🎯 Overview

The bakery-ia platform features a comprehensive, modern monitoring system built on OpenTelemetry and SigNoz. This documentation provides a complete guide to the monitoring architecture, setup, and usage.

🚀 Monitoring Architecture

Core Components

graph TD
    A[Microservices] -->|OTLP| B[OpenTelemetry Collector]
    B -->|gRPC| C[SigNoz]
    C --> D[Traces Dashboard]
    C --> E[Metrics Dashboard]
    C --> F[Logs Dashboard]
    C --> G[Alerts]

Technology Stack

Instrumentation: OpenTelemetry Python SDK
Protocol: OTLP (OpenTelemetry Protocol) over gRPC
Backend: SigNoz (open-source observability platform)
Metrics: Prometheus-compatible metrics via OTLP
Traces: Jaeger-compatible tracing via OTLP
Logs: Structured logging with trace correlation

📋 Monitoring Coverage

Service Coverage (100%)

Service Category	Services	Monitoring Type	Status
Critical Services	auth, orders, sales, external	Base Class	✅ Monitored
AI Services	ai-insights, training	Direct	✅ Monitored
Data Services	inventory, procurement, production, forecasting	Base Class	✅ Monitored
Operational Services	tenant, notification, distribution	Base Class	✅ Monitored
Specialized Services	suppliers, pos, recipes, orchestrator	Base Class	✅ Monitored
Infrastructure	gateway, alert-processor, demo-session	Direct	✅ Monitored

Total: 20 services with 100% monitoring coverage

🔧 Monitoring Implementation

Implementation Patterns

1. Base Class Pattern (16 services)

Services using StandardFastAPIService inherit comprehensive monitoring:

from shared.service_base import StandardFastAPIService

class MyService(StandardFastAPIService):
    def __init__(self):
        super().__init__(
            service_name="my-service",
            app_name="My Service",
            description="Service description",
            version="1.0.0",
            # Monitoring enabled by default
            enable_metrics=True,      # ✅ Metrics collection
            enable_tracing=True,      # ✅ Distributed tracing
            enable_health_checks=True # ✅ Health endpoints
        )

2. Direct Pattern (4 services)

Critical services with custom monitoring needs:

# services/ai_insights/app/main.py
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector

# Initialize metrics collectors
metrics_collector = MetricsCollector("ai-insights")
system_metrics = SystemMetricsCollector("ai-insights")

# Add middleware
add_metrics_middleware(app, metrics_collector)

Monitoring Components

OpenTelemetry Instrumentation

# Automatic instrumentation in base class
FastAPIInstrumentor.instrument_app(app)      # HTTP requests
HTTPXClientInstrumentor().instrument()       # Outgoing HTTP
RedisInstrumentor().instrument()             # Redis operations
SQLAlchemyInstrumentor().instrument()       # Database queries

Metrics Collection

# Standard metrics automatically collected
metrics_collector.register_counter("http_requests_total", "Total HTTP requests")
metrics_collector.register_histogram("http_request_duration", "Request duration")
metrics_collector.register_gauge("active_requests", "Active requests")

# System metrics automatically collected
system_metrics = SystemMetricsCollector("service-name")
# → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors

Health Checks

# Automatic health check endpoints
GET /health          # Overall service health
GET /health/detailed # Detailed health with dependencies
GET /health/ready    # Readiness probe
GET /health/live     # Liveness probe

📊 Metrics Reference

Standard Metrics (All Services)

Metric Type	Metric Name	Description	Labels
HTTP Metrics	`{service}_http_requests_total`	Total HTTP requests	method, endpoint, status_code
HTTP Metrics	`{service}_http_request_duration_seconds`	Request duration histogram	method, endpoint, status_code
HTTP Metrics	`{service}_active_requests`	Currently active requests	-
System Metrics	`process.cpu.utilization`	Process CPU usage	-
System Metrics	`process.memory.usage`	Process memory usage	-
System Metrics	`system.cpu.utilization`	System CPU usage	-
System Metrics	`system.memory.usage`	System memory usage	-
Database Metrics	`db.query.duration`	Database query duration	operation, table
Cache Metrics	`cache.operation.duration`	Cache operation duration	operation, key

Custom Metrics (Service-Specific)

Examples of service-specific metrics:

Auth Service:

auth_registration_total (by status)
auth_login_success_total
auth_login_failure_total (by reason)
auth_registration_duration_seconds

Orders Service:

orders_created_total
orders_processed_total (by status)
orders_processing_duration_seconds

AI Insights Service:

ai_insights_generated_total
ai_model_inference_duration_seconds
ai_feedback_received_total

🔍 Tracing Guide

Trace Propagation

Traces automatically flow across service boundaries:

sequenceDiagram
    participant Client
    participant Gateway
    participant Auth
    participant Orders
    
    Client->>Gateway: HTTP Request (trace_id: abc123)
    Gateway->>Auth: Auth Check (trace_id: abc123)
    Auth-->>Gateway: Auth Response (trace_id: abc123)
    Gateway->>Orders: Create Order (trace_id: abc123)
    Orders-->>Gateway: Order Created (trace_id: abc123)
    Gateway-->>Client: Final Response (trace_id: abc123)

Trace Context in Logs

All logs include trace correlation:

{
    "level": "info",
    "message": "Processing order",
    "service": "orders-service",
    "trace_id": "abc123def456",
    "span_id": "789ghi",
    "order_id": "12345",
    "timestamp": "2024-01-08T19:00:00Z"
}

Manual Trace Enhancement

Add custom trace attributes:

from shared.monitoring.tracing import add_trace_attributes, add_trace_event

# Add custom attributes
add_trace_attributes(
    user_id="123",
    tenant_id="abc",
    operation="order_creation"
)

# Add trace events
add_trace_event("order_validation_started")
# ... validation logic ...
add_trace_event("order_validation_completed", status="success")

🚨 Alerting Guide

Standard Alerts (Recommended)

Alert Name	Condition	Severity	Notification
High Error Rate	`error_rate > 5%` for 5m	High	PagerDuty + Slack
High Latency	`p99_latency > 2s` for 5m	High	PagerDuty + Slack
Service Unavailable	`up == 0` for 1m	Critical	PagerDuty + Slack + Email
High Memory Usage	`memory_usage > 80%` for 10m	Medium	Slack
High CPU Usage	`cpu_usage > 90%` for 5m	Medium	Slack
Database Connection Issues	`db_connections < minimum_pool_size`	High	PagerDuty + Slack
Cache Hit Ratio Low	`cache_hit_ratio < 70%` for 15m	Low	Slack

Creating Alerts in SigNoz

Navigate to Alerts: SigNoz UI → Alerts → Create Alert
Select Metric: Choose from available metrics
Set Condition: Define threshold and duration
Configure Notifications: Add notification channels
Set Severity: Critical, High, Medium, Low
Add Description: Explain alert purpose and resolution steps

Example Alert Configuration (YAML)

# Example for Terraform/Kubernetes
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: bakery-ia-alerts
  namespace: monitoring
spec:
  groups:
  - name: service-health
    rules:
    - alert: ServiceDown
      expr: up{service!~"signoz.*"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.service }} is down"
        description: "{{ $labels.service }} has been down for more than 1 minute"
        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down"

    - alert: HighErrorRate
      expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
      for: 5m
      labels:
        severity: high
      annotations:
        summary: "High error rate in {{ $labels.service }}"
        description: "Error rate is {{ $value }}% (threshold: 5%)"
        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate"

📈 Dashboard Guide

Recommended Dashboards

1. Service Overview Dashboard

HTTP Request Rate
Error Rate
Latency Percentiles (p50, p90, p99)
Active Requests
System Resource Usage

2. Performance Dashboard

Request Duration Histogram
Database Query Performance
Cache Performance
External API Call Performance

3. System Health Dashboard

CPU Usage (Process & System)
Memory Usage (Process & System)
Disk I/O
Network I/O
File Descriptors
Thread Count

4. Business Metrics Dashboard

User Registrations
Order Volume
AI Insights Generated
API Usage by Tenant

Creating Dashboards in SigNoz

Navigate to Dashboards: SigNoz UI → Dashboards → Create Dashboard
Add Panels: Click "Add Panel" and select metric
Configure Visualization: Choose chart type and settings
Set Time Range: Default to last 1h, 6h, 24h, 7d
Add Variables: For dynamic filtering (service, environment)
Save Dashboard: Give it a descriptive name

🛠️ Troubleshooting Guide

Common Issues & Solutions

Issue: No Metrics Appearing in SigNoz

Checklist:

✅ OpenTelemetry Collector running? kubectl get pods -n signoz
✅ Service can reach collector? telnet signoz-otel-collector.signoz 4318
✅ OTLP endpoint configured correctly? Check OTEL_EXPORTER_OTLP_ENDPOINT
✅ Service logs show OTLP export? Look for "Exporting metrics"
✅ No network policies blocking? Check Kubernetes network policies

Debugging:

# Check OpenTelemetry Collector logs
kubectl logs -n signoz -l app=otel-collector

# Check service logs for OTLP errors
kubectl logs -l app=auth-service | grep -i otel

# Test OTLP connectivity from service pod
kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.signoz:4318

Issue: High Latency in Specific Service

Checklist:

✅ Database queries slow? Check db.query.duration metrics
✅ External API calls slow? Check trace waterfall
✅ High CPU usage? Check system metrics
✅ Memory pressure? Check memory metrics
✅ Too many active requests? Check concurrency

Debugging:

# Add detailed tracing to suspicious code
from shared.monitoring.tracing import add_trace_event

add_trace_event("database_query_started", table="users")
# ... database query ...
add_trace_event("database_query_completed", duration_ms=45)

Issue: High Error Rate

Checklist:

✅ Database connection issues? Check health endpoints
✅ External API failures? Check dependency metrics
✅ Authentication failures? Check auth service logs
✅ Validation errors? Check application logs
✅ Rate limiting? Check gateway metrics

Debugging:

# Check error logs with trace correlation
kubectl logs -l app=auth-service | grep -i error | grep -i trace

# Filter traces by error status
# In SigNoz: Add filter http.status_code >= 400

📚 Runbook Reference

See RUNBOOKS.md for detailed troubleshooting procedures.

🔧 Development Guide

Adding Custom Metrics

# In any service using direct monitoring
self.metrics_collector.register_counter(
    "custom_metric_name",
    "Description of what this metric tracks",
    labels=["label1", "label2"]  # Optional labels
)

# Increment the counter
self.metrics_collector.increment_counter(
    "custom_metric_name",
    value=1,
    labels={"label1": "value1", "label2": "value2"}
)

Adding Custom Trace Attributes

# Add context to current span
from shared.monitoring.tracing import add_trace_attributes

add_trace_attributes(
    user_id=user.id,
    tenant_id=tenant.id,
    operation="premium_feature_access",
    feature_name="advanced_forecasting"
)

Service-Specific Monitoring Setup

For services needing custom monitoring beyond the base class:

# In your service's __init__ method
from shared.monitoring.system_metrics import SystemMetricsCollector
from shared.monitoring.metrics import MetricsCollector

class MyService(StandardFastAPIService):
    def __init__(self):
        # Call parent constructor first
        super().__init__(...)
        
        # Add custom metrics collector
        self.custom_metrics = MetricsCollector("my-service")
        
        # Register custom metrics
        self.custom_metrics.register_counter(
            "business_specific_events",
            "Custom business event counter"
        )
        
        # Add system metrics if not using base class defaults
        self.system_metrics = SystemMetricsCollector("my-service")

📊 SigNoz Configuration

Environment Variables

# OpenTelemetry Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.signoz:4318

# Service-specific configuration
OTEL_SERVICE_NAME=auth-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia

# Metrics export interval (default: 60000ms = 60s)
OTEL_METRIC_EXPORT_INTERVAL=60000

# Batch span processor configuration
OTEL_BSP_SCHEDULE_DELAY=5000
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512

Kubernetes Configuration

# Example deployment with monitoring sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
spec:
  template:
    spec:
      containers:
      - name: auth-service
        image: auth-service:latest
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://signoz-otel-collector.signoz:4318"
        - name: OTEL_SERVICE_NAME
          value: "auth-service"
        - name: ENVIRONMENT
          value: "production"
        resources:
          limits:
            cpu: "1"
            memory: "512Mi"
          requests:
            cpu: "200m"
            memory: "256Mi"

🎯 Best Practices

Monitoring Best Practices

Use Consistent Naming: Follow OpenTelemetry semantic conventions
Add Context to Traces: Include user/tenant IDs in trace attributes
Monitor Dependencies: Track external API and database performance
Set Appropriate Alerts: Avoid alert fatigue with meaningful thresholds
Document Metrics: Keep metrics documentation up to date
Review Regularly: Update dashboards as services evolve
Test Alerts: Ensure alerts fire correctly before production

Performance Best Practices

Batch Metrics Export: Use default 60s interval for most services
Sample Traces: Consider sampling for high-volume services
Limit Custom Metrics: Only track metrics that provide value
Use Histograms Wisely: Histograms can be resource-intensive
Monitor Monitoring: Track OTLP export success/failure rates

📞 Support

Getting Help

Check Documentation: This file and RUNBOOKS.md
Review SigNoz Docs: https://signoz.io/docs/
OpenTelemetry Docs: https://opentelemetry.io/docs/
Team Channel: #monitoring in Slack
GitHub Issues: https://github.com/yourorg/bakery-ia/issues

Escalation Path

First Line: Development team (service owners)
Second Line: DevOps team (monitoring specialists)
Third Line: SigNoz support (vendor support)

🎉 Summary

The bakery-ia monitoring system provides:

📊 100% Service Coverage: All 20 services monitored
🚀 Modern Architecture: OpenTelemetry + SigNoz
🔧 Comprehensive Metrics: System, HTTP, database, cache
🔍 Full Observability: Traces, metrics, logs integrated
✅ Production Ready: Battle-tested and scalable

All services are fully instrumented and ready for production monitoring! 🎉

16 KiB Raw Blame History

📊 Bakery-ia Monitoring System Documentation

🎯 Overview

🚀 Monitoring Architecture

Core Components

Technology Stack

📋 Monitoring Coverage

Service Coverage (100%)

🔧 Monitoring Implementation

Implementation Patterns

1. Base Class Pattern (16 services)

2. Direct Pattern (4 services)

Monitoring Components

OpenTelemetry Instrumentation

Metrics Collection

Health Checks

📊 Metrics Reference

Standard Metrics (All Services)

Custom Metrics (Service-Specific)

🔍 Tracing Guide

Trace Propagation

Trace Context in Logs

Manual Trace Enhancement

🚨 Alerting Guide

Standard Alerts (Recommended)

Creating Alerts in SigNoz

Example Alert Configuration (YAML)

📈 Dashboard Guide

Recommended Dashboards

1. Service Overview Dashboard

2. Performance Dashboard

3. System Health Dashboard

4. Business Metrics Dashboard

Creating Dashboards in SigNoz

🛠️ Troubleshooting Guide

Common Issues & Solutions

Issue: No Metrics Appearing in SigNoz

Issue: High Latency in Specific Service

Issue: High Error Rate

📚 Runbook Reference

🔧 Development Guide

Adding Custom Metrics

Adding Custom Trace Attributes

Service-Specific Monitoring Setup

📊 SigNoz Configuration

Environment Variables

Kubernetes Configuration

🎯 Best Practices

Monitoring Best Practices

Performance Best Practices

📞 Support

Getting Help

Escalation Path

🎉 Summary

16 KiB

Raw Blame History