Improve metrics
This commit is contained in:
536
docs/MONITORING_DOCUMENTATION.md
Normal file
536
docs/MONITORING_DOCUMENTATION.md
Normal file
@@ -0,0 +1,536 @@
|
||||
# 📊 Bakery-ia Monitoring System Documentation
|
||||
|
||||
## 🎯 Overview
|
||||
|
||||
The bakery-ia platform features a comprehensive, modern monitoring system built on **OpenTelemetry** and **SigNoz**. This documentation provides a complete guide to the monitoring architecture, setup, and usage.
|
||||
|
||||
## 🚀 Monitoring Architecture
|
||||
|
||||
### Core Components
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[Microservices] -->|OTLP| B[OpenTelemetry Collector]
|
||||
B -->|gRPC| C[SigNoz]
|
||||
C --> D[Traces Dashboard]
|
||||
C --> E[Metrics Dashboard]
|
||||
C --> F[Logs Dashboard]
|
||||
C --> G[Alerts]
|
||||
```
|
||||
|
||||
### Technology Stack
|
||||
|
||||
- **Instrumentation**: OpenTelemetry Python SDK
|
||||
- **Protocol**: OTLP (OpenTelemetry Protocol) over gRPC
|
||||
- **Backend**: SigNoz (open-source observability platform)
|
||||
- **Metrics**: Prometheus-compatible metrics via OTLP
|
||||
- **Traces**: Jaeger-compatible tracing via OTLP
|
||||
- **Logs**: Structured logging with trace correlation
|
||||
|
||||
## 📋 Monitoring Coverage
|
||||
|
||||
### Service Coverage (100%)
|
||||
|
||||
| Service Category | Services | Monitoring Type | Status |
|
||||
|-----------------|----------|----------------|--------|
|
||||
| **Critical Services** | auth, orders, sales, external | Base Class | ✅ Monitored |
|
||||
| **AI Services** | ai-insights, training | Direct | ✅ Monitored |
|
||||
| **Data Services** | inventory, procurement, production, forecasting | Base Class | ✅ Monitored |
|
||||
| **Operational Services** | tenant, notification, distribution | Base Class | ✅ Monitored |
|
||||
| **Specialized Services** | suppliers, pos, recipes, orchestrator | Base Class | ✅ Monitored |
|
||||
| **Infrastructure** | gateway, alert-processor, demo-session | Direct | ✅ Monitored |
|
||||
|
||||
**Total: 20 services with 100% monitoring coverage**
|
||||
|
||||
## 🔧 Monitoring Implementation
|
||||
|
||||
### Implementation Patterns
|
||||
|
||||
#### 1. Base Class Pattern (16 services)
|
||||
|
||||
Services using `StandardFastAPIService` inherit comprehensive monitoring:
|
||||
|
||||
```python
|
||||
from shared.service_base import StandardFastAPIService
|
||||
|
||||
class MyService(StandardFastAPIService):
|
||||
def __init__(self):
|
||||
super().__init__(
|
||||
service_name="my-service",
|
||||
app_name="My Service",
|
||||
description="Service description",
|
||||
version="1.0.0",
|
||||
# Monitoring enabled by default
|
||||
enable_metrics=True, # ✅ Metrics collection
|
||||
enable_tracing=True, # ✅ Distributed tracing
|
||||
enable_health_checks=True # ✅ Health endpoints
|
||||
)
|
||||
```
|
||||
|
||||
#### 2. Direct Pattern (4 services)
|
||||
|
||||
Critical services with custom monitoring needs:
|
||||
|
||||
```python
|
||||
# services/ai_insights/app/main.py
|
||||
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
|
||||
from shared.monitoring.system_metrics import SystemMetricsCollector
|
||||
|
||||
# Initialize metrics collectors
|
||||
metrics_collector = MetricsCollector("ai-insights")
|
||||
system_metrics = SystemMetricsCollector("ai-insights")
|
||||
|
||||
# Add middleware
|
||||
add_metrics_middleware(app, metrics_collector)
|
||||
```
|
||||
|
||||
### Monitoring Components
|
||||
|
||||
#### OpenTelemetry Instrumentation
|
||||
|
||||
```python
|
||||
# Automatic instrumentation in base class
|
||||
FastAPIInstrumentor.instrument_app(app) # HTTP requests
|
||||
HTTPXClientInstrumentor().instrument() # Outgoing HTTP
|
||||
RedisInstrumentor().instrument() # Redis operations
|
||||
SQLAlchemyInstrumentor().instrument() # Database queries
|
||||
```
|
||||
|
||||
#### Metrics Collection
|
||||
|
||||
```python
|
||||
# Standard metrics automatically collected
|
||||
metrics_collector.register_counter("http_requests_total", "Total HTTP requests")
|
||||
metrics_collector.register_histogram("http_request_duration", "Request duration")
|
||||
metrics_collector.register_gauge("active_requests", "Active requests")
|
||||
|
||||
# System metrics automatically collected
|
||||
system_metrics = SystemMetricsCollector("service-name")
|
||||
# → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors
|
||||
```
|
||||
|
||||
#### Health Checks
|
||||
|
||||
```python
|
||||
# Automatic health check endpoints
|
||||
GET /health # Overall service health
|
||||
GET /health/detailed # Detailed health with dependencies
|
||||
GET /health/ready # Readiness probe
|
||||
GET /health/live # Liveness probe
|
||||
```
|
||||
|
||||
## 📊 Metrics Reference
|
||||
|
||||
### Standard Metrics (All Services)
|
||||
|
||||
| Metric Type | Metric Name | Description | Labels |
|
||||
|-------------|------------|-------------|--------|
|
||||
| **HTTP Metrics** | `{service}_http_requests_total` | Total HTTP requests | method, endpoint, status_code |
|
||||
| **HTTP Metrics** | `{service}_http_request_duration_seconds` | Request duration histogram | method, endpoint, status_code |
|
||||
| **HTTP Metrics** | `{service}_active_requests` | Currently active requests | - |
|
||||
| **System Metrics** | `process.cpu.utilization` | Process CPU usage | - |
|
||||
| **System Metrics** | `process.memory.usage` | Process memory usage | - |
|
||||
| **System Metrics** | `system.cpu.utilization` | System CPU usage | - |
|
||||
| **System Metrics** | `system.memory.usage` | System memory usage | - |
|
||||
| **Database Metrics** | `db.query.duration` | Database query duration | operation, table |
|
||||
| **Cache Metrics** | `cache.operation.duration` | Cache operation duration | operation, key |
|
||||
|
||||
### Custom Metrics (Service-Specific)
|
||||
|
||||
Examples of service-specific metrics:
|
||||
|
||||
**Auth Service:**
|
||||
- `auth_registration_total` (by status)
|
||||
- `auth_login_success_total`
|
||||
- `auth_login_failure_total` (by reason)
|
||||
- `auth_registration_duration_seconds`
|
||||
|
||||
**Orders Service:**
|
||||
- `orders_created_total`
|
||||
- `orders_processed_total` (by status)
|
||||
- `orders_processing_duration_seconds`
|
||||
|
||||
**AI Insights Service:**
|
||||
- `ai_insights_generated_total`
|
||||
- `ai_model_inference_duration_seconds`
|
||||
- `ai_feedback_received_total`
|
||||
|
||||
## 🔍 Tracing Guide
|
||||
|
||||
### Trace Propagation
|
||||
|
||||
Traces automatically flow across service boundaries:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Gateway
|
||||
participant Auth
|
||||
participant Orders
|
||||
|
||||
Client->>Gateway: HTTP Request (trace_id: abc123)
|
||||
Gateway->>Auth: Auth Check (trace_id: abc123)
|
||||
Auth-->>Gateway: Auth Response (trace_id: abc123)
|
||||
Gateway->>Orders: Create Order (trace_id: abc123)
|
||||
Orders-->>Gateway: Order Created (trace_id: abc123)
|
||||
Gateway-->>Client: Final Response (trace_id: abc123)
|
||||
```
|
||||
|
||||
### Trace Context in Logs
|
||||
|
||||
All logs include trace correlation:
|
||||
|
||||
```json
|
||||
{
|
||||
"level": "info",
|
||||
"message": "Processing order",
|
||||
"service": "orders-service",
|
||||
"trace_id": "abc123def456",
|
||||
"span_id": "789ghi",
|
||||
"order_id": "12345",
|
||||
"timestamp": "2024-01-08T19:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Manual Trace Enhancement
|
||||
|
||||
Add custom trace attributes:
|
||||
|
||||
```python
|
||||
from shared.monitoring.tracing import add_trace_attributes, add_trace_event
|
||||
|
||||
# Add custom attributes
|
||||
add_trace_attributes(
|
||||
user_id="123",
|
||||
tenant_id="abc",
|
||||
operation="order_creation"
|
||||
)
|
||||
|
||||
# Add trace events
|
||||
add_trace_event("order_validation_started")
|
||||
# ... validation logic ...
|
||||
add_trace_event("order_validation_completed", status="success")
|
||||
```
|
||||
|
||||
## 🚨 Alerting Guide
|
||||
|
||||
### Standard Alerts (Recommended)
|
||||
|
||||
| Alert Name | Condition | Severity | Notification |
|
||||
|------------|-----------|----------|--------------|
|
||||
| **High Error Rate** | `error_rate > 5%` for 5m | High | PagerDuty + Slack |
|
||||
| **High Latency** | `p99_latency > 2s` for 5m | High | PagerDuty + Slack |
|
||||
| **Service Unavailable** | `up == 0` for 1m | Critical | PagerDuty + Slack + Email |
|
||||
| **High Memory Usage** | `memory_usage > 80%` for 10m | Medium | Slack |
|
||||
| **High CPU Usage** | `cpu_usage > 90%` for 5m | Medium | Slack |
|
||||
| **Database Connection Issues** | `db_connections < minimum_pool_size` | High | PagerDuty + Slack |
|
||||
| **Cache Hit Ratio Low** | `cache_hit_ratio < 70%` for 15m | Low | Slack |
|
||||
|
||||
### Creating Alerts in SigNoz
|
||||
|
||||
1. **Navigate to Alerts**: SigNoz UI → Alerts → Create Alert
|
||||
2. **Select Metric**: Choose from available metrics
|
||||
3. **Set Condition**: Define threshold and duration
|
||||
4. **Configure Notifications**: Add notification channels
|
||||
5. **Set Severity**: Critical, High, Medium, Low
|
||||
6. **Add Description**: Explain alert purpose and resolution steps
|
||||
|
||||
### Example Alert Configuration (YAML)
|
||||
|
||||
```yaml
|
||||
# Example for Terraform/Kubernetes
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
name: bakery-ia-alerts
|
||||
namespace: monitoring
|
||||
spec:
|
||||
groups:
|
||||
- name: service-health
|
||||
rules:
|
||||
- alert: ServiceDown
|
||||
expr: up{service!~"signoz.*"} == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Service {{ $labels.service }} is down"
|
||||
description: "{{ $labels.service }} has been down for more than 1 minute"
|
||||
runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down"
|
||||
|
||||
- alert: HighErrorRate
|
||||
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
|
||||
for: 5m
|
||||
labels:
|
||||
severity: high
|
||||
annotations:
|
||||
summary: "High error rate in {{ $labels.service }}"
|
||||
description: "Error rate is {{ $value }}% (threshold: 5%)"
|
||||
runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate"
|
||||
```
|
||||
|
||||
## 📈 Dashboard Guide
|
||||
|
||||
### Recommended Dashboards
|
||||
|
||||
#### 1. Service Overview Dashboard
|
||||
- HTTP Request Rate
|
||||
- Error Rate
|
||||
- Latency Percentiles (p50, p90, p99)
|
||||
- Active Requests
|
||||
- System Resource Usage
|
||||
|
||||
#### 2. Performance Dashboard
|
||||
- Request Duration Histogram
|
||||
- Database Query Performance
|
||||
- Cache Performance
|
||||
- External API Call Performance
|
||||
|
||||
#### 3. System Health Dashboard
|
||||
- CPU Usage (Process & System)
|
||||
- Memory Usage (Process & System)
|
||||
- Disk I/O
|
||||
- Network I/O
|
||||
- File Descriptors
|
||||
- Thread Count
|
||||
|
||||
#### 4. Business Metrics Dashboard
|
||||
- User Registrations
|
||||
- Order Volume
|
||||
- AI Insights Generated
|
||||
- API Usage by Tenant
|
||||
|
||||
### Creating Dashboards in SigNoz
|
||||
|
||||
1. **Navigate to Dashboards**: SigNoz UI → Dashboards → Create Dashboard
|
||||
2. **Add Panels**: Click "Add Panel" and select metric
|
||||
3. **Configure Visualization**: Choose chart type and settings
|
||||
4. **Set Time Range**: Default to last 1h, 6h, 24h, 7d
|
||||
5. **Add Variables**: For dynamic filtering (service, environment)
|
||||
6. **Save Dashboard**: Give it a descriptive name
|
||||
|
||||
## 🛠️ Troubleshooting Guide
|
||||
|
||||
### Common Issues & Solutions
|
||||
|
||||
#### Issue: No Metrics Appearing in SigNoz
|
||||
|
||||
**Checklist:**
|
||||
- ✅ OpenTelemetry Collector running? `kubectl get pods -n signoz`
|
||||
- ✅ Service can reach collector? `telnet signoz-otel-collector.signoz 4318`
|
||||
- ✅ OTLP endpoint configured correctly? Check `OTEL_EXPORTER_OTLP_ENDPOINT`
|
||||
- ✅ Service logs show OTLP export? Look for "Exporting metrics"
|
||||
- ✅ No network policies blocking? Check Kubernetes network policies
|
||||
|
||||
**Debugging:**
|
||||
```bash
|
||||
# Check OpenTelemetry Collector logs
|
||||
kubectl logs -n signoz -l app=otel-collector
|
||||
|
||||
# Check service logs for OTLP errors
|
||||
kubectl logs -l app=auth-service | grep -i otel
|
||||
|
||||
# Test OTLP connectivity from service pod
|
||||
kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.signoz:4318
|
||||
```
|
||||
|
||||
#### Issue: High Latency in Specific Service
|
||||
|
||||
**Checklist:**
|
||||
- ✅ Database queries slow? Check `db.query.duration` metrics
|
||||
- ✅ External API calls slow? Check trace waterfall
|
||||
- ✅ High CPU usage? Check system metrics
|
||||
- ✅ Memory pressure? Check memory metrics
|
||||
- ✅ Too many active requests? Check concurrency
|
||||
|
||||
**Debugging:**
|
||||
```python
|
||||
# Add detailed tracing to suspicious code
|
||||
from shared.monitoring.tracing import add_trace_event
|
||||
|
||||
add_trace_event("database_query_started", table="users")
|
||||
# ... database query ...
|
||||
add_trace_event("database_query_completed", duration_ms=45)
|
||||
```
|
||||
|
||||
#### Issue: High Error Rate
|
||||
|
||||
**Checklist:**
|
||||
- ✅ Database connection issues? Check health endpoints
|
||||
- ✅ External API failures? Check dependency metrics
|
||||
- ✅ Authentication failures? Check auth service logs
|
||||
- ✅ Validation errors? Check application logs
|
||||
- ✅ Rate limiting? Check gateway metrics
|
||||
|
||||
**Debugging:**
|
||||
```bash
|
||||
# Check error logs with trace correlation
|
||||
kubectl logs -l app=auth-service | grep -i error | grep -i trace
|
||||
|
||||
# Filter traces by error status
|
||||
# In SigNoz: Add filter http.status_code >= 400
|
||||
```
|
||||
|
||||
## 📚 Runbook Reference
|
||||
|
||||
See [RUNBOOKS.md](RUNBOOKS.md) for detailed troubleshooting procedures.
|
||||
|
||||
## 🔧 Development Guide
|
||||
|
||||
### Adding Custom Metrics
|
||||
|
||||
```python
|
||||
# In any service using direct monitoring
|
||||
self.metrics_collector.register_counter(
|
||||
"custom_metric_name",
|
||||
"Description of what this metric tracks",
|
||||
labels=["label1", "label2"] # Optional labels
|
||||
)
|
||||
|
||||
# Increment the counter
|
||||
self.metrics_collector.increment_counter(
|
||||
"custom_metric_name",
|
||||
value=1,
|
||||
labels={"label1": "value1", "label2": "value2"}
|
||||
)
|
||||
```
|
||||
|
||||
### Adding Custom Trace Attributes
|
||||
|
||||
```python
|
||||
# Add context to current span
|
||||
from shared.monitoring.tracing import add_trace_attributes
|
||||
|
||||
add_trace_attributes(
|
||||
user_id=user.id,
|
||||
tenant_id=tenant.id,
|
||||
operation="premium_feature_access",
|
||||
feature_name="advanced_forecasting"
|
||||
)
|
||||
```
|
||||
|
||||
### Service-Specific Monitoring Setup
|
||||
|
||||
For services needing custom monitoring beyond the base class:
|
||||
|
||||
```python
|
||||
# In your service's __init__ method
|
||||
from shared.monitoring.system_metrics import SystemMetricsCollector
|
||||
from shared.monitoring.metrics import MetricsCollector
|
||||
|
||||
class MyService(StandardFastAPIService):
|
||||
def __init__(self):
|
||||
# Call parent constructor first
|
||||
super().__init__(...)
|
||||
|
||||
# Add custom metrics collector
|
||||
self.custom_metrics = MetricsCollector("my-service")
|
||||
|
||||
# Register custom metrics
|
||||
self.custom_metrics.register_counter(
|
||||
"business_specific_events",
|
||||
"Custom business event counter"
|
||||
)
|
||||
|
||||
# Add system metrics if not using base class defaults
|
||||
self.system_metrics = SystemMetricsCollector("my-service")
|
||||
```
|
||||
|
||||
## 📊 SigNoz Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```env
|
||||
# OpenTelemetry Collector endpoint
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.signoz:4318
|
||||
|
||||
# Service-specific configuration
|
||||
OTEL_SERVICE_NAME=auth-service
|
||||
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia
|
||||
|
||||
# Metrics export interval (default: 60000ms = 60s)
|
||||
OTEL_METRIC_EXPORT_INTERVAL=60000
|
||||
|
||||
# Batch span processor configuration
|
||||
OTEL_BSP_SCHEDULE_DELAY=5000
|
||||
OTEL_BSP_MAX_QUEUE_SIZE=2048
|
||||
OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512
|
||||
```
|
||||
|
||||
### Kubernetes Configuration
|
||||
|
||||
```yaml
|
||||
# Example deployment with monitoring sidecar
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: auth-service
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: auth-service
|
||||
image: auth-service:latest
|
||||
env:
|
||||
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
||||
value: "http://signoz-otel-collector.signoz:4318"
|
||||
- name: OTEL_SERVICE_NAME
|
||||
value: "auth-service"
|
||||
- name: ENVIRONMENT
|
||||
value: "production"
|
||||
resources:
|
||||
limits:
|
||||
cpu: "1"
|
||||
memory: "512Mi"
|
||||
requests:
|
||||
cpu: "200m"
|
||||
memory: "256Mi"
|
||||
```
|
||||
|
||||
## 🎯 Best Practices
|
||||
|
||||
### Monitoring Best Practices
|
||||
|
||||
1. **Use Consistent Naming**: Follow OpenTelemetry semantic conventions
|
||||
2. **Add Context to Traces**: Include user/tenant IDs in trace attributes
|
||||
3. **Monitor Dependencies**: Track external API and database performance
|
||||
4. **Set Appropriate Alerts**: Avoid alert fatigue with meaningful thresholds
|
||||
5. **Document Metrics**: Keep metrics documentation up to date
|
||||
6. **Review Regularly**: Update dashboards as services evolve
|
||||
7. **Test Alerts**: Ensure alerts fire correctly before production
|
||||
|
||||
### Performance Best Practices
|
||||
|
||||
1. **Batch Metrics Export**: Use default 60s interval for most services
|
||||
2. **Sample Traces**: Consider sampling for high-volume services
|
||||
3. **Limit Custom Metrics**: Only track metrics that provide value
|
||||
4. **Use Histograms Wisely**: Histograms can be resource-intensive
|
||||
5. **Monitor Monitoring**: Track OTLP export success/failure rates
|
||||
|
||||
## 📞 Support
|
||||
|
||||
### Getting Help
|
||||
|
||||
1. **Check Documentation**: This file and RUNBOOKS.md
|
||||
2. **Review SigNoz Docs**: https://signoz.io/docs/
|
||||
3. **OpenTelemetry Docs**: https://opentelemetry.io/docs/
|
||||
4. **Team Channel**: #monitoring in Slack
|
||||
5. **GitHub Issues**: https://github.com/yourorg/bakery-ia/issues
|
||||
|
||||
### Escalation Path
|
||||
|
||||
1. **First Line**: Development team (service owners)
|
||||
2. **Second Line**: DevOps team (monitoring specialists)
|
||||
3. **Third Line**: SigNoz support (vendor support)
|
||||
|
||||
## 🎉 Summary
|
||||
|
||||
The bakery-ia monitoring system provides:
|
||||
|
||||
- **📊 100% Service Coverage**: All 20 services monitored
|
||||
- **🚀 Modern Architecture**: OpenTelemetry + SigNoz
|
||||
- **🔧 Comprehensive Metrics**: System, HTTP, database, cache
|
||||
- **🔍 Full Observability**: Traces, metrics, logs integrated
|
||||
- **✅ Production Ready**: Battle-tested and scalable
|
||||
|
||||
**All services are fully instrumented and ready for production monitoring!** 🎉
|
||||
Reference in New Issue
Block a user