16 KiB
📊 Bakery-ia Monitoring System Documentation
🎯 Overview
The bakery-ia platform features a comprehensive, modern monitoring system built on OpenTelemetry and SigNoz. This documentation provides a complete guide to the monitoring architecture, setup, and usage.
🚀 Monitoring Architecture
Core Components
graph TD
A[Microservices] -->|OTLP| B[OpenTelemetry Collector]
B -->|gRPC| C[SigNoz]
C --> D[Traces Dashboard]
C --> E[Metrics Dashboard]
C --> F[Logs Dashboard]
C --> G[Alerts]
Technology Stack
- Instrumentation: OpenTelemetry Python SDK
- Protocol: OTLP (OpenTelemetry Protocol) over gRPC
- Backend: SigNoz (open-source observability platform)
- Metrics: Prometheus-compatible metrics via OTLP
- Traces: Jaeger-compatible tracing via OTLP
- Logs: Structured logging with trace correlation
📋 Monitoring Coverage
Service Coverage (100%)
| Service Category | Services | Monitoring Type | Status |
|---|---|---|---|
| Critical Services | auth, orders, sales, external | Base Class | ✅ Monitored |
| AI Services | ai-insights, training | Direct | ✅ Monitored |
| Data Services | inventory, procurement, production, forecasting | Base Class | ✅ Monitored |
| Operational Services | tenant, notification, distribution | Base Class | ✅ Monitored |
| Specialized Services | suppliers, pos, recipes, orchestrator | Base Class | ✅ Monitored |
| Infrastructure | gateway, alert-processor, demo-session | Direct | ✅ Monitored |
Total: 20 services with 100% monitoring coverage
🔧 Monitoring Implementation
Implementation Patterns
1. Base Class Pattern (16 services)
Services using StandardFastAPIService inherit comprehensive monitoring:
from shared.service_base import StandardFastAPIService
class MyService(StandardFastAPIService):
def __init__(self):
super().__init__(
service_name="my-service",
app_name="My Service",
description="Service description",
version="1.0.0",
# Monitoring enabled by default
enable_metrics=True, # ✅ Metrics collection
enable_tracing=True, # ✅ Distributed tracing
enable_health_checks=True # ✅ Health endpoints
)
2. Direct Pattern (4 services)
Critical services with custom monitoring needs:
# services/ai_insights/app/main.py
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector
# Initialize metrics collectors
metrics_collector = MetricsCollector("ai-insights")
system_metrics = SystemMetricsCollector("ai-insights")
# Add middleware
add_metrics_middleware(app, metrics_collector)
Monitoring Components
OpenTelemetry Instrumentation
# Automatic instrumentation in base class
FastAPIInstrumentor.instrument_app(app) # HTTP requests
HTTPXClientInstrumentor().instrument() # Outgoing HTTP
RedisInstrumentor().instrument() # Redis operations
SQLAlchemyInstrumentor().instrument() # Database queries
Metrics Collection
# Standard metrics automatically collected
metrics_collector.register_counter("http_requests_total", "Total HTTP requests")
metrics_collector.register_histogram("http_request_duration", "Request duration")
metrics_collector.register_gauge("active_requests", "Active requests")
# System metrics automatically collected
system_metrics = SystemMetricsCollector("service-name")
# → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors
Health Checks
# Automatic health check endpoints
GET /health # Overall service health
GET /health/detailed # Detailed health with dependencies
GET /health/ready # Readiness probe
GET /health/live # Liveness probe
📊 Metrics Reference
Standard Metrics (All Services)
| Metric Type | Metric Name | Description | Labels |
|---|---|---|---|
| HTTP Metrics | {service}_http_requests_total |
Total HTTP requests | method, endpoint, status_code |
| HTTP Metrics | {service}_http_request_duration_seconds |
Request duration histogram | method, endpoint, status_code |
| HTTP Metrics | {service}_active_requests |
Currently active requests | - |
| System Metrics | process.cpu.utilization |
Process CPU usage | - |
| System Metrics | process.memory.usage |
Process memory usage | - |
| System Metrics | system.cpu.utilization |
System CPU usage | - |
| System Metrics | system.memory.usage |
System memory usage | - |
| Database Metrics | db.query.duration |
Database query duration | operation, table |
| Cache Metrics | cache.operation.duration |
Cache operation duration | operation, key |
Custom Metrics (Service-Specific)
Examples of service-specific metrics:
Auth Service:
auth_registration_total(by status)auth_login_success_totalauth_login_failure_total(by reason)auth_registration_duration_seconds
Orders Service:
orders_created_totalorders_processed_total(by status)orders_processing_duration_seconds
AI Insights Service:
ai_insights_generated_totalai_model_inference_duration_secondsai_feedback_received_total
🔍 Tracing Guide
Trace Propagation
Traces automatically flow across service boundaries:
sequenceDiagram
participant Client
participant Gateway
participant Auth
participant Orders
Client->>Gateway: HTTP Request (trace_id: abc123)
Gateway->>Auth: Auth Check (trace_id: abc123)
Auth-->>Gateway: Auth Response (trace_id: abc123)
Gateway->>Orders: Create Order (trace_id: abc123)
Orders-->>Gateway: Order Created (trace_id: abc123)
Gateway-->>Client: Final Response (trace_id: abc123)
Trace Context in Logs
All logs include trace correlation:
{
"level": "info",
"message": "Processing order",
"service": "orders-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"order_id": "12345",
"timestamp": "2024-01-08T19:00:00Z"
}
Manual Trace Enhancement
Add custom trace attributes:
from shared.monitoring.tracing import add_trace_attributes, add_trace_event
# Add custom attributes
add_trace_attributes(
user_id="123",
tenant_id="abc",
operation="order_creation"
)
# Add trace events
add_trace_event("order_validation_started")
# ... validation logic ...
add_trace_event("order_validation_completed", status="success")
🚨 Alerting Guide
Standard Alerts (Recommended)
| Alert Name | Condition | Severity | Notification |
|---|---|---|---|
| High Error Rate | error_rate > 5% for 5m |
High | PagerDuty + Slack |
| High Latency | p99_latency > 2s for 5m |
High | PagerDuty + Slack |
| Service Unavailable | up == 0 for 1m |
Critical | PagerDuty + Slack + Email |
| High Memory Usage | memory_usage > 80% for 10m |
Medium | Slack |
| High CPU Usage | cpu_usage > 90% for 5m |
Medium | Slack |
| Database Connection Issues | db_connections < minimum_pool_size |
High | PagerDuty + Slack |
| Cache Hit Ratio Low | cache_hit_ratio < 70% for 15m |
Low | Slack |
Creating Alerts in SigNoz
- Navigate to Alerts: SigNoz UI → Alerts → Create Alert
- Select Metric: Choose from available metrics
- Set Condition: Define threshold and duration
- Configure Notifications: Add notification channels
- Set Severity: Critical, High, Medium, Low
- Add Description: Explain alert purpose and resolution steps
Example Alert Configuration (YAML)
# Example for Terraform/Kubernetes
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: bakery-ia-alerts
namespace: monitoring
spec:
groups:
- name: service-health
rules:
- alert: ServiceDown
expr: up{service!~"signoz.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.service }} is down"
description: "{{ $labels.service }} has been down for more than 1 minute"
runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down"
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: high
annotations:
summary: "High error rate in {{ $labels.service }}"
description: "Error rate is {{ $value }}% (threshold: 5%)"
runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate"
📈 Dashboard Guide
Recommended Dashboards
1. Service Overview Dashboard
- HTTP Request Rate
- Error Rate
- Latency Percentiles (p50, p90, p99)
- Active Requests
- System Resource Usage
2. Performance Dashboard
- Request Duration Histogram
- Database Query Performance
- Cache Performance
- External API Call Performance
3. System Health Dashboard
- CPU Usage (Process & System)
- Memory Usage (Process & System)
- Disk I/O
- Network I/O
- File Descriptors
- Thread Count
4. Business Metrics Dashboard
- User Registrations
- Order Volume
- AI Insights Generated
- API Usage by Tenant
Creating Dashboards in SigNoz
- Navigate to Dashboards: SigNoz UI → Dashboards → Create Dashboard
- Add Panels: Click "Add Panel" and select metric
- Configure Visualization: Choose chart type and settings
- Set Time Range: Default to last 1h, 6h, 24h, 7d
- Add Variables: For dynamic filtering (service, environment)
- Save Dashboard: Give it a descriptive name
🛠️ Troubleshooting Guide
Common Issues & Solutions
Issue: No Metrics Appearing in SigNoz
Checklist:
- ✅ OpenTelemetry Collector running?
kubectl get pods -n signoz - ✅ Service can reach collector?
telnet signoz-otel-collector.signoz 4318 - ✅ OTLP endpoint configured correctly? Check
OTEL_EXPORTER_OTLP_ENDPOINT - ✅ Service logs show OTLP export? Look for "Exporting metrics"
- ✅ No network policies blocking? Check Kubernetes network policies
Debugging:
# Check OpenTelemetry Collector logs
kubectl logs -n signoz -l app=otel-collector
# Check service logs for OTLP errors
kubectl logs -l app=auth-service | grep -i otel
# Test OTLP connectivity from service pod
kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.signoz:4318
Issue: High Latency in Specific Service
Checklist:
- ✅ Database queries slow? Check
db.query.durationmetrics - ✅ External API calls slow? Check trace waterfall
- ✅ High CPU usage? Check system metrics
- ✅ Memory pressure? Check memory metrics
- ✅ Too many active requests? Check concurrency
Debugging:
# Add detailed tracing to suspicious code
from shared.monitoring.tracing import add_trace_event
add_trace_event("database_query_started", table="users")
# ... database query ...
add_trace_event("database_query_completed", duration_ms=45)
Issue: High Error Rate
Checklist:
- ✅ Database connection issues? Check health endpoints
- ✅ External API failures? Check dependency metrics
- ✅ Authentication failures? Check auth service logs
- ✅ Validation errors? Check application logs
- ✅ Rate limiting? Check gateway metrics
Debugging:
# Check error logs with trace correlation
kubectl logs -l app=auth-service | grep -i error | grep -i trace
# Filter traces by error status
# In SigNoz: Add filter http.status_code >= 400
📚 Runbook Reference
See RUNBOOKS.md for detailed troubleshooting procedures.
🔧 Development Guide
Adding Custom Metrics
# In any service using direct monitoring
self.metrics_collector.register_counter(
"custom_metric_name",
"Description of what this metric tracks",
labels=["label1", "label2"] # Optional labels
)
# Increment the counter
self.metrics_collector.increment_counter(
"custom_metric_name",
value=1,
labels={"label1": "value1", "label2": "value2"}
)
Adding Custom Trace Attributes
# Add context to current span
from shared.monitoring.tracing import add_trace_attributes
add_trace_attributes(
user_id=user.id,
tenant_id=tenant.id,
operation="premium_feature_access",
feature_name="advanced_forecasting"
)
Service-Specific Monitoring Setup
For services needing custom monitoring beyond the base class:
# In your service's __init__ method
from shared.monitoring.system_metrics import SystemMetricsCollector
from shared.monitoring.metrics import MetricsCollector
class MyService(StandardFastAPIService):
def __init__(self):
# Call parent constructor first
super().__init__(...)
# Add custom metrics collector
self.custom_metrics = MetricsCollector("my-service")
# Register custom metrics
self.custom_metrics.register_counter(
"business_specific_events",
"Custom business event counter"
)
# Add system metrics if not using base class defaults
self.system_metrics = SystemMetricsCollector("my-service")
📊 SigNoz Configuration
Environment Variables
# OpenTelemetry Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.signoz:4318
# Service-specific configuration
OTEL_SERVICE_NAME=auth-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia
# Metrics export interval (default: 60000ms = 60s)
OTEL_METRIC_EXPORT_INTERVAL=60000
# Batch span processor configuration
OTEL_BSP_SCHEDULE_DELAY=5000
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512
Kubernetes Configuration
# Example deployment with monitoring sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
name: auth-service
spec:
template:
spec:
containers:
- name: auth-service
image: auth-service:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://signoz-otel-collector.signoz:4318"
- name: OTEL_SERVICE_NAME
value: "auth-service"
- name: ENVIRONMENT
value: "production"
resources:
limits:
cpu: "1"
memory: "512Mi"
requests:
cpu: "200m"
memory: "256Mi"
🎯 Best Practices
Monitoring Best Practices
- Use Consistent Naming: Follow OpenTelemetry semantic conventions
- Add Context to Traces: Include user/tenant IDs in trace attributes
- Monitor Dependencies: Track external API and database performance
- Set Appropriate Alerts: Avoid alert fatigue with meaningful thresholds
- Document Metrics: Keep metrics documentation up to date
- Review Regularly: Update dashboards as services evolve
- Test Alerts: Ensure alerts fire correctly before production
Performance Best Practices
- Batch Metrics Export: Use default 60s interval for most services
- Sample Traces: Consider sampling for high-volume services
- Limit Custom Metrics: Only track metrics that provide value
- Use Histograms Wisely: Histograms can be resource-intensive
- Monitor Monitoring: Track OTLP export success/failure rates
📞 Support
Getting Help
- Check Documentation: This file and RUNBOOKS.md
- Review SigNoz Docs: https://signoz.io/docs/
- OpenTelemetry Docs: https://opentelemetry.io/docs/
- Team Channel: #monitoring in Slack
- GitHub Issues: https://github.com/yourorg/bakery-ia/issues
Escalation Path
- First Line: Development team (service owners)
- Second Line: DevOps team (monitoring specialists)
- Third Line: SigNoz support (vendor support)
🎉 Summary
The bakery-ia monitoring system provides:
- 📊 100% Service Coverage: All 20 services monitored
- 🚀 Modern Architecture: OpenTelemetry + SigNoz
- 🔧 Comprehensive Metrics: System, HTTP, database, cache
- 🔍 Full Observability: Traces, metrics, logs integrated
- ✅ Production Ready: Battle-tested and scalable
All services are fully instrumented and ready for production monitoring! 🎉