Update monitoring packages to latest versions
- Updated all OpenTelemetry packages to latest versions: - opentelemetry-api: 1.27.0 → 1.39.1 - opentelemetry-sdk: 1.27.0 → 1.39.1 - opentelemetry-exporter-otlp-proto-grpc: 1.27.0 → 1.39.1 - opentelemetry-exporter-otlp-proto-http: 1.27.0 → 1.39.1 - opentelemetry-instrumentation-fastapi: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-httpx: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-redis: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-sqlalchemy: 0.48b0 → 0.60b1 - Removed prometheus-client==0.23.1 from all services - Unified all services to use the same monitoring package versions Generated by Mistral Vibe. Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit is contained in:
449
docs/MONITORING_COMPLETE_GUIDE.md
Normal file
449
docs/MONITORING_COMPLETE_GUIDE.md
Normal file
@@ -0,0 +1,449 @@
|
||||
# Complete Monitoring Guide - Bakery IA Platform
|
||||
|
||||
This guide provides the complete overview of observability implementation for the Bakery IA platform using SigNoz and OpenTelemetry.
|
||||
|
||||
## 🎯 Executive Summary
|
||||
|
||||
**What's Implemented:**
|
||||
- ✅ **Distributed Tracing** - All 17 services
|
||||
- ✅ **Application Metrics** - HTTP requests, latencies, errors
|
||||
- ✅ **System Metrics** - CPU, memory, disk, network per service
|
||||
- ✅ **Structured Logs** - With trace correlation
|
||||
- ✅ **Database Monitoring** - PostgreSQL, Redis, RabbitMQ metrics
|
||||
- ✅ **Pure OpenTelemetry** - No Prometheus, all OTLP push
|
||||
|
||||
**Technology Stack:**
|
||||
- **Backend**: OpenTelemetry Python SDK
|
||||
- **Collector**: OpenTelemetry Collector (OTLP receivers)
|
||||
- **Storage**: ClickHouse (traces, metrics, logs)
|
||||
- **Frontend**: SigNoz UI
|
||||
- **Protocol**: OTLP over HTTP/gRPC
|
||||
|
||||
## 📊 Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ Application Services │
|
||||
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
|
||||
│ │ auth │ │ inv │ │ orders │ │ ... │ │
|
||||
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ └───────────┴────────────┴───────────┘ │
|
||||
│ │ │
|
||||
│ Traces + Metrics + Logs │
|
||||
│ (OpenTelemetry OTLP) │
|
||||
└──────────────────┼──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ Database Monitoring Collector │
|
||||
│ ┌────────┐ ┌────────┐ ┌────────┐ │
|
||||
│ │ PG │ │ Redis │ │RabbitMQ│ │
|
||||
│ └───┬────┘ └───┬────┘ └───┬────┘ │
|
||||
│ │ │ │ │
|
||||
│ └───────────┴────────────┘ │
|
||||
│ │ │
|
||||
│ Database Metrics │
|
||||
└──────────────────┼──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ SigNoz OpenTelemetry Collector │
|
||||
│ │
|
||||
│ Receivers: OTLP (gRPC :4317, HTTP :4318) │
|
||||
│ Processors: batch, memory_limiter, resourcedetection │
|
||||
│ Exporters: ClickHouse │
|
||||
└──────────────────┼──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ ClickHouse Database │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ Traces │ │ Metrics │ │ Logs │ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ │
|
||||
└──────────────────┼──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ SigNoz Frontend UI │
|
||||
│ https://monitoring.bakery-ia.local │
|
||||
└──────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### 1. Deploy SigNoz
|
||||
|
||||
```bash
|
||||
# Add Helm repository
|
||||
helm repo add signoz https://charts.signoz.io
|
||||
helm repo update
|
||||
|
||||
# Create namespace and install
|
||||
kubectl create namespace signoz
|
||||
helm install signoz signoz/signoz \
|
||||
-n signoz \
|
||||
-f infrastructure/helm/signoz-values-dev.yaml
|
||||
|
||||
# Wait for pods
|
||||
kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
|
||||
```
|
||||
|
||||
### 2. Deploy Services with Monitoring
|
||||
|
||||
All services are already configured with OpenTelemetry environment variables.
|
||||
|
||||
```bash
|
||||
# Apply all services
|
||||
kubectl apply -k infrastructure/kubernetes/overlays/dev/
|
||||
|
||||
# Or restart existing services
|
||||
kubectl rollout restart deployment -n bakery-ia
|
||||
```
|
||||
|
||||
### 3. Deploy Database Monitoring
|
||||
|
||||
```bash
|
||||
# Run the setup script
|
||||
./infrastructure/kubernetes/setup-database-monitoring.sh
|
||||
|
||||
# This will:
|
||||
# - Create monitoring users in PostgreSQL
|
||||
# - Deploy OpenTelemetry collector for database metrics
|
||||
# - Start collecting PostgreSQL, Redis, RabbitMQ metrics
|
||||
```
|
||||
|
||||
### 4. Access SigNoz UI
|
||||
|
||||
```bash
|
||||
# Via ingress
|
||||
open https://monitoring.bakery-ia.local
|
||||
|
||||
# Or port-forward
|
||||
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
|
||||
open http://localhost:3301
|
||||
```
|
||||
|
||||
## 📈 Metrics Collected
|
||||
|
||||
### Application Metrics (Per Service)
|
||||
|
||||
| Metric | Description | Type |
|
||||
|--------|-------------|------|
|
||||
| `http_requests_total` | Total HTTP requests | Counter |
|
||||
| `http_request_duration_seconds` | Request latency | Histogram |
|
||||
| `active_requests` | Current active requests | Gauge |
|
||||
|
||||
### System Metrics (Per Service)
|
||||
|
||||
| Metric | Description | Type |
|
||||
|--------|-------------|------|
|
||||
| `process.cpu.utilization` | Process CPU % | Gauge |
|
||||
| `process.memory.usage` | Process memory bytes | Gauge |
|
||||
| `process.memory.utilization` | Process memory % | Gauge |
|
||||
| `process.threads.count` | Thread count | Gauge |
|
||||
| `process.open_file_descriptors` | Open FDs (Unix) | Gauge |
|
||||
| `system.cpu.utilization` | System CPU % | Gauge |
|
||||
| `system.memory.usage` | System memory | Gauge |
|
||||
| `system.memory.utilization` | System memory % | Gauge |
|
||||
| `system.disk.io.read` | Disk read bytes | Counter |
|
||||
| `system.disk.io.write` | Disk write bytes | Counter |
|
||||
| `system.network.io.sent` | Network sent bytes | Counter |
|
||||
| `system.network.io.received` | Network recv bytes | Counter |
|
||||
|
||||
### PostgreSQL Metrics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `postgresql.backends` | Active connections |
|
||||
| `postgresql.database.size` | Database size in bytes |
|
||||
| `postgresql.commits` | Transaction commits |
|
||||
| `postgresql.rollbacks` | Transaction rollbacks |
|
||||
| `postgresql.deadlocks` | Deadlock count |
|
||||
| `postgresql.blocks_read` | Blocks read from disk |
|
||||
| `postgresql.table.size` | Table size |
|
||||
| `postgresql.index.size` | Index size |
|
||||
|
||||
### Redis Metrics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `redis.clients.connected` | Connected clients |
|
||||
| `redis.commands.processed` | Commands processed |
|
||||
| `redis.keyspace.hits` | Cache hits |
|
||||
| `redis.keyspace.misses` | Cache misses |
|
||||
| `redis.memory.used` | Memory usage |
|
||||
| `redis.memory.fragmentation_ratio` | Fragmentation |
|
||||
| `redis.db.keys` | Number of keys |
|
||||
|
||||
### RabbitMQ Metrics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `rabbitmq.consumer.count` | Active consumers |
|
||||
| `rabbitmq.message.current` | Messages in queue |
|
||||
| `rabbitmq.message.acknowledged` | Messages ACKed |
|
||||
| `rabbitmq.message.delivered` | Messages delivered |
|
||||
| `rabbitmq.message.published` | Messages published |
|
||||
|
||||
## 🔍 Traces
|
||||
|
||||
**Automatic instrumentation for:**
|
||||
- FastAPI endpoints
|
||||
- HTTP client requests (HTTPX)
|
||||
- Redis commands
|
||||
- PostgreSQL queries (SQLAlchemy)
|
||||
- RabbitMQ publish/consume
|
||||
|
||||
**View traces:**
|
||||
1. Go to **Services** tab in SigNoz
|
||||
2. Select a service
|
||||
3. View individual traces
|
||||
4. Click trace → See full span tree with timing
|
||||
|
||||
## 📝 Logs
|
||||
|
||||
**Features:**
|
||||
- Structured logging with context
|
||||
- Automatic trace-log correlation
|
||||
- Searchable by service, level, message, custom fields
|
||||
|
||||
**View logs:**
|
||||
1. Go to **Logs** tab in SigNoz
|
||||
2. Filter by service: `service_name="auth-service"`
|
||||
3. Search for specific messages
|
||||
4. Click log → See full context including trace_id
|
||||
|
||||
## 🎛️ Configuration Files
|
||||
|
||||
### Services
|
||||
|
||||
All services configured in:
|
||||
```
|
||||
infrastructure/kubernetes/base/components/*/\*-service.yaml
|
||||
```
|
||||
|
||||
Each service has these environment variables:
|
||||
```yaml
|
||||
env:
|
||||
- name: OTEL_COLLECTOR_ENDPOINT
|
||||
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
|
||||
- name: OTEL_SERVICE_NAME
|
||||
value: "service-name"
|
||||
- name: ENABLE_TRACING
|
||||
value: "true"
|
||||
- name: OTEL_LOGS_EXPORTER
|
||||
value: "otlp"
|
||||
- name: ENABLE_OTEL_METRICS
|
||||
value: "true"
|
||||
- name: ENABLE_SYSTEM_METRICS
|
||||
value: "true"
|
||||
```
|
||||
|
||||
### SigNoz
|
||||
|
||||
Configuration file:
|
||||
```
|
||||
infrastructure/helm/signoz-values-dev.yaml
|
||||
```
|
||||
|
||||
Key settings:
|
||||
- OTLP receivers on ports 4317 (gRPC) and 4318 (HTTP)
|
||||
- No Prometheus scraping (pure OTLP push)
|
||||
- ClickHouse backend for storage
|
||||
- Reduced resources for development
|
||||
|
||||
### Database Monitoring
|
||||
|
||||
Deployment file:
|
||||
```
|
||||
infrastructure/kubernetes/base/monitoring/database-otel-collector.yaml
|
||||
```
|
||||
|
||||
Setup script:
|
||||
```
|
||||
infrastructure/kubernetes/setup-database-monitoring.sh
|
||||
```
|
||||
|
||||
## 📚 Documentation
|
||||
|
||||
| Document | Description |
|
||||
|----------|-------------|
|
||||
| [MONITORING_QUICKSTART.md](./MONITORING_QUICKSTART.md) | 10-minute quick start guide |
|
||||
| [MONITORING_SETUP.md](./MONITORING_SETUP.md) | Detailed setup and troubleshooting |
|
||||
| [DATABASE_MONITORING.md](./DATABASE_MONITORING.md) | Database metrics and logs guide |
|
||||
| This document | Complete overview |
|
||||
|
||||
## 🔧 Shared Libraries
|
||||
|
||||
### Monitoring Modules
|
||||
|
||||
Located in `shared/monitoring/`:
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `__init__.py` | Package exports |
|
||||
| `logging.py` | Standard logging setup |
|
||||
| `logs_exporter.py` | OpenTelemetry logs export |
|
||||
| `metrics.py` | OpenTelemetry metrics (no Prometheus) |
|
||||
| `metrics_exporter.py` | OTLP metrics export setup |
|
||||
| `system_metrics.py` | System metrics collection (CPU, memory, etc.) |
|
||||
| `tracing.py` | Distributed tracing setup |
|
||||
| `health_checks.py` | Health check endpoints |
|
||||
|
||||
### Usage in Services
|
||||
|
||||
```python
|
||||
from shared.service_base import StandardFastAPIService
|
||||
|
||||
# Create service
|
||||
service = AuthService()
|
||||
|
||||
# Create app with auto-configured monitoring
|
||||
app = service.create_app()
|
||||
|
||||
# Monitoring is automatically enabled:
|
||||
# - Tracing (if ENABLE_TRACING=true)
|
||||
# - Metrics (if ENABLE_OTEL_METRICS=true)
|
||||
# - System metrics (if ENABLE_SYSTEM_METRICS=true)
|
||||
# - Logs (if OTEL_LOGS_EXPORTER=otlp)
|
||||
```
|
||||
|
||||
## 🎨 Dashboard Examples
|
||||
|
||||
### Service Health Dashboard
|
||||
|
||||
Create a dashboard with:
|
||||
1. **Request Rate** - `rate(http_requests_total[5m])`
|
||||
2. **Error Rate** - `rate(http_requests_total{status_code=~"5.."}[5m])`
|
||||
3. **Latency (P95)** - `histogram_quantile(0.95, http_request_duration_seconds)`
|
||||
4. **Active Requests** - `active_requests`
|
||||
5. **CPU Usage** - `process.cpu.utilization`
|
||||
6. **Memory Usage** - `process.memory.utilization`
|
||||
|
||||
### Database Dashboard
|
||||
|
||||
1. **PostgreSQL Connections** - `postgresql.backends`
|
||||
2. **Database Size** - `postgresql.database.size`
|
||||
3. **Transaction Rate** - `rate(postgresql.commits[5m])`
|
||||
4. **Redis Hit Rate** - `redis.keyspace.hits / (redis.keyspace.hits + redis.keyspace.misses)`
|
||||
5. **RabbitMQ Queue Depth** - `rabbitmq.message.current`
|
||||
|
||||
## ⚠️ Alerts
|
||||
|
||||
### Recommended Alerts
|
||||
|
||||
**Application:**
|
||||
- High error rate (>5% of requests failing)
|
||||
- High latency (P95 > 1s)
|
||||
- Service down (no metrics for 5 minutes)
|
||||
|
||||
**System:**
|
||||
- High CPU (>80% for 5 minutes)
|
||||
- High memory (>90%)
|
||||
- Disk space low (<10%)
|
||||
|
||||
**Database:**
|
||||
- PostgreSQL connections near max (>80% of max_connections)
|
||||
- Slow queries (>5s)
|
||||
- Redis memory high (>80%)
|
||||
- RabbitMQ queue buildup (>10k messages)
|
||||
|
||||
## 🐛 Troubleshooting
|
||||
|
||||
### No Data in SigNoz
|
||||
|
||||
```bash
|
||||
# 1. Check service logs
|
||||
kubectl logs -n bakery-ia deployment/auth-service | grep -i otel
|
||||
|
||||
# 2. Check SigNoz collector
|
||||
kubectl logs -n signoz deployment/signoz-otel-collector
|
||||
|
||||
# 3. Test connectivity
|
||||
kubectl exec -n bakery-ia deployment/auth-service -- \
|
||||
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
|
||||
```
|
||||
|
||||
### Database Metrics Missing
|
||||
|
||||
```bash
|
||||
# Check database monitoring collector
|
||||
kubectl logs -n bakery-ia deployment/database-otel-collector
|
||||
|
||||
# Verify monitoring user exists
|
||||
kubectl exec -n bakery-ia deployment/auth-db -- \
|
||||
psql -U postgres -c "\du otel_monitor"
|
||||
```
|
||||
|
||||
### Traces Not Correlated with Logs
|
||||
|
||||
Ensure `OTEL_LOGS_EXPORTER=otlp` is set in service environment variables.
|
||||
|
||||
## 🎯 Best Practices
|
||||
|
||||
1. **Always use structured logging** - Add context with key-value pairs
|
||||
2. **Add custom spans** - For important business operations
|
||||
3. **Set appropriate log levels** - INFO for production, DEBUG for dev
|
||||
4. **Monitor your monitors** - Alert on collector failures
|
||||
5. **Regular retention policy reviews** - Balance cost vs. data retention
|
||||
6. **Create service dashboards** - One dashboard per service
|
||||
7. **Set up critical alerts first** - Service down, high error rate
|
||||
8. **Document custom metrics** - Explain business-specific metrics
|
||||
|
||||
## 📊 Performance Impact
|
||||
|
||||
**Resource Usage (per service):**
|
||||
- CPU: +5-10% (instrumentation overhead)
|
||||
- Memory: +50-100MB (SDK and buffers)
|
||||
- Network: Minimal (batched export every 60s)
|
||||
|
||||
**Latency Impact:**
|
||||
- Per request: <1ms (async instrumentation)
|
||||
- No impact on user-facing latency
|
||||
|
||||
**Storage (SigNoz):**
|
||||
- Traces: ~1GB per million requests
|
||||
- Metrics: ~100MB per service per day
|
||||
- Logs: Varies by log volume
|
||||
|
||||
## 🔐 Security Considerations
|
||||
|
||||
1. **Use dedicated monitoring users** - Never use app credentials
|
||||
2. **Limit collector permissions** - Read-only access to databases
|
||||
3. **Secure OTLP endpoints** - Use TLS in production
|
||||
4. **Sanitize sensitive data** - Don't log passwords, tokens
|
||||
5. **Network policies** - Restrict collector network access
|
||||
6. **RBAC** - Limit SigNoz UI access per team
|
||||
|
||||
## 🚀 Next Steps
|
||||
|
||||
1. **Deploy to production** - Update production SigNoz config
|
||||
2. **Create team dashboards** - Per-service and system-wide views
|
||||
3. **Set up alerts** - Start with critical service health alerts
|
||||
4. **Train team** - SigNoz UI usage, query language
|
||||
5. **Document runbooks** - How to respond to alerts
|
||||
6. **Optimize retention** - Based on actual data volume
|
||||
7. **Add custom metrics** - Business-specific KPIs
|
||||
|
||||
## 📞 Support
|
||||
|
||||
- **SigNoz Community**: https://signoz.io/slack
|
||||
- **OpenTelemetry Docs**: https://opentelemetry.io/docs/
|
||||
- **Internal Docs**: See /docs folder
|
||||
|
||||
## 📝 Change Log
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2026-01-08 | Initial implementation - All services configured |
|
||||
| 2026-01-08 | Database monitoring added (PostgreSQL, Redis, RabbitMQ) |
|
||||
| 2026-01-08 | System metrics collection implemented |
|
||||
| 2026-01-08 | Removed Prometheus, pure OpenTelemetry |
|
||||
|
||||
---
|
||||
|
||||
**Congratulations! Your platform now has complete observability. 🎉**
|
||||
|
||||
Every request is traced, every metric is collected, every log is searchable.
|
||||
Reference in New Issue
Block a user