# Complete Monitoring Guide - Bakery IA Platform This guide provides the complete overview of observability implementation for the Bakery IA platform using SigNoz and OpenTelemetry. ## 🎯 Executive Summary **What's Implemented:** - βœ… **Distributed Tracing** - All 17 services - βœ… **Application Metrics** - HTTP requests, latencies, errors - βœ… **System Metrics** - CPU, memory, disk, network per service - βœ… **Structured Logs** - With trace correlation - βœ… **Database Monitoring** - PostgreSQL, Redis, RabbitMQ metrics - βœ… **Pure OpenTelemetry** - No Prometheus, all OTLP push **Technology Stack:** - **Backend**: OpenTelemetry Python SDK - **Collector**: OpenTelemetry Collector (OTLP receivers) - **Storage**: ClickHouse (traces, metrics, logs) - **Frontend**: SigNoz UI - **Protocol**: OTLP over HTTP/gRPC ## πŸ“Š Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Application Services β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ auth β”‚ β”‚ inv β”‚ β”‚ orders β”‚ β”‚ ... β”‚ β”‚ β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ Traces + Metrics + Logs β”‚ β”‚ (OpenTelemetry OTLP) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Database Monitoring Collector β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ PG β”‚ β”‚ Redis β”‚ β”‚RabbitMQβ”‚ β”‚ β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ Database Metrics β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ SigNoz OpenTelemetry Collector β”‚ β”‚ β”‚ β”‚ Receivers: OTLP (gRPC :4317, HTTP :4318) β”‚ β”‚ Processors: batch, memory_limiter, resourcedetection β”‚ β”‚ Exporters: ClickHouse β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ClickHouse Database β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Traces β”‚ β”‚ Metrics β”‚ β”‚ Logs β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ SigNoz Frontend UI β”‚ β”‚ https://monitoring.bakery-ia.local β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## πŸš€ Quick Start ### 1. Deploy SigNoz ```bash # Add Helm repository helm repo add signoz https://charts.signoz.io helm repo update # Create namespace and install kubectl create namespace signoz helm install signoz signoz/signoz \ -n signoz \ -f infrastructure/helm/signoz-values-dev.yaml # Wait for pods kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s ``` ### 2. Deploy Services with Monitoring All services are already configured with OpenTelemetry environment variables. ```bash # Apply all services kubectl apply -k infrastructure/kubernetes/overlays/dev/ # Or restart existing services kubectl rollout restart deployment -n bakery-ia ``` ### 3. Deploy Database Monitoring ```bash # Run the setup script ./infrastructure/kubernetes/setup-database-monitoring.sh # This will: # - Create monitoring users in PostgreSQL # - Deploy OpenTelemetry collector for database metrics # - Start collecting PostgreSQL, Redis, RabbitMQ metrics ``` ### 4. Access SigNoz UI ```bash # Via ingress open https://monitoring.bakery-ia.local # Or port-forward kubectl port-forward -n signoz svc/signoz-frontend 3301:3301 open http://localhost:3301 ``` ## πŸ“ˆ Metrics Collected ### Application Metrics (Per Service) | Metric | Description | Type | |--------|-------------|------| | `http_requests_total` | Total HTTP requests | Counter | | `http_request_duration_seconds` | Request latency | Histogram | | `active_requests` | Current active requests | Gauge | ### System Metrics (Per Service) | Metric | Description | Type | |--------|-------------|------| | `process.cpu.utilization` | Process CPU % | Gauge | | `process.memory.usage` | Process memory bytes | Gauge | | `process.memory.utilization` | Process memory % | Gauge | | `process.threads.count` | Thread count | Gauge | | `process.open_file_descriptors` | Open FDs (Unix) | Gauge | | `system.cpu.utilization` | System CPU % | Gauge | | `system.memory.usage` | System memory | Gauge | | `system.memory.utilization` | System memory % | Gauge | | `system.disk.io.read` | Disk read bytes | Counter | | `system.disk.io.write` | Disk write bytes | Counter | | `system.network.io.sent` | Network sent bytes | Counter | | `system.network.io.received` | Network recv bytes | Counter | ### PostgreSQL Metrics | Metric | Description | |--------|-------------| | `postgresql.backends` | Active connections | | `postgresql.database.size` | Database size in bytes | | `postgresql.commits` | Transaction commits | | `postgresql.rollbacks` | Transaction rollbacks | | `postgresql.deadlocks` | Deadlock count | | `postgresql.blocks_read` | Blocks read from disk | | `postgresql.table.size` | Table size | | `postgresql.index.size` | Index size | ### Redis Metrics | Metric | Description | |--------|-------------| | `redis.clients.connected` | Connected clients | | `redis.commands.processed` | Commands processed | | `redis.keyspace.hits` | Cache hits | | `redis.keyspace.misses` | Cache misses | | `redis.memory.used` | Memory usage | | `redis.memory.fragmentation_ratio` | Fragmentation | | `redis.db.keys` | Number of keys | ### RabbitMQ Metrics | Metric | Description | |--------|-------------| | `rabbitmq.consumer.count` | Active consumers | | `rabbitmq.message.current` | Messages in queue | | `rabbitmq.message.acknowledged` | Messages ACKed | | `rabbitmq.message.delivered` | Messages delivered | | `rabbitmq.message.published` | Messages published | ## πŸ” Traces **Automatic instrumentation for:** - FastAPI endpoints - HTTP client requests (HTTPX) - Redis commands - PostgreSQL queries (SQLAlchemy) - RabbitMQ publish/consume **View traces:** 1. Go to **Services** tab in SigNoz 2. Select a service 3. View individual traces 4. Click trace β†’ See full span tree with timing ## πŸ“ Logs **Features:** - Structured logging with context - Automatic trace-log correlation - Searchable by service, level, message, custom fields **View logs:** 1. Go to **Logs** tab in SigNoz 2. Filter by service: `service_name="auth-service"` 3. Search for specific messages 4. Click log β†’ See full context including trace_id ## πŸŽ›οΈ Configuration Files ### Services All services configured in: ``` infrastructure/kubernetes/base/components/*/\*-service.yaml ``` Each service has these environment variables: ```yaml env: - name: OTEL_COLLECTOR_ENDPOINT value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318" - name: OTEL_SERVICE_NAME value: "service-name" - name: ENABLE_TRACING value: "true" - name: OTEL_LOGS_EXPORTER value: "otlp" - name: ENABLE_OTEL_METRICS value: "true" - name: ENABLE_SYSTEM_METRICS value: "true" ``` ### SigNoz Configuration file: ``` infrastructure/helm/signoz-values-dev.yaml ``` Key settings: - OTLP receivers on ports 4317 (gRPC) and 4318 (HTTP) - No Prometheus scraping (pure OTLP push) - ClickHouse backend for storage - Reduced resources for development ### Database Monitoring Deployment file: ``` infrastructure/kubernetes/base/monitoring/database-otel-collector.yaml ``` Setup script: ``` infrastructure/kubernetes/setup-database-monitoring.sh ``` ## πŸ“š Documentation | Document | Description | |----------|-------------| | [MONITORING_QUICKSTART.md](./MONITORING_QUICKSTART.md) | 10-minute quick start guide | | [MONITORING_SETUP.md](./MONITORING_SETUP.md) | Detailed setup and troubleshooting | | [DATABASE_MONITORING.md](./DATABASE_MONITORING.md) | Database metrics and logs guide | | This document | Complete overview | ## πŸ”§ Shared Libraries ### Monitoring Modules Located in `shared/monitoring/`: | File | Purpose | |------|---------| | `__init__.py` | Package exports | | `logging.py` | Standard logging setup | | `logs_exporter.py` | OpenTelemetry logs export | | `metrics.py` | OpenTelemetry metrics (no Prometheus) | | `metrics_exporter.py` | OTLP metrics export setup | | `system_metrics.py` | System metrics collection (CPU, memory, etc.) | | `tracing.py` | Distributed tracing setup | | `health_checks.py` | Health check endpoints | ### Usage in Services ```python from shared.service_base import StandardFastAPIService # Create service service = AuthService() # Create app with auto-configured monitoring app = service.create_app() # Monitoring is automatically enabled: # - Tracing (if ENABLE_TRACING=true) # - Metrics (if ENABLE_OTEL_METRICS=true) # - System metrics (if ENABLE_SYSTEM_METRICS=true) # - Logs (if OTEL_LOGS_EXPORTER=otlp) ``` ## 🎨 Dashboard Examples ### Service Health Dashboard Create a dashboard with: 1. **Request Rate** - `rate(http_requests_total[5m])` 2. **Error Rate** - `rate(http_requests_total{status_code=~"5.."}[5m])` 3. **Latency (P95)** - `histogram_quantile(0.95, http_request_duration_seconds)` 4. **Active Requests** - `active_requests` 5. **CPU Usage** - `process.cpu.utilization` 6. **Memory Usage** - `process.memory.utilization` ### Database Dashboard 1. **PostgreSQL Connections** - `postgresql.backends` 2. **Database Size** - `postgresql.database.size` 3. **Transaction Rate** - `rate(postgresql.commits[5m])` 4. **Redis Hit Rate** - `redis.keyspace.hits / (redis.keyspace.hits + redis.keyspace.misses)` 5. **RabbitMQ Queue Depth** - `rabbitmq.message.current` ## ⚠️ Alerts ### Recommended Alerts **Application:** - High error rate (>5% of requests failing) - High latency (P95 > 1s) - Service down (no metrics for 5 minutes) **System:** - High CPU (>80% for 5 minutes) - High memory (>90%) - Disk space low (<10%) **Database:** - PostgreSQL connections near max (>80% of max_connections) - Slow queries (>5s) - Redis memory high (>80%) - RabbitMQ queue buildup (>10k messages) ## πŸ› Troubleshooting ### No Data in SigNoz ```bash # 1. Check service logs kubectl logs -n bakery-ia deployment/auth-service | grep -i otel # 2. Check SigNoz collector kubectl logs -n signoz deployment/signoz-otel-collector # 3. Test connectivity kubectl exec -n bakery-ia deployment/auth-service -- \ curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318 ``` ### Database Metrics Missing ```bash # Check database monitoring collector kubectl logs -n bakery-ia deployment/database-otel-collector # Verify monitoring user exists kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U postgres -c "\du otel_monitor" ``` ### Traces Not Correlated with Logs Ensure `OTEL_LOGS_EXPORTER=otlp` is set in service environment variables. ## 🎯 Best Practices 1. **Always use structured logging** - Add context with key-value pairs 2. **Add custom spans** - For important business operations 3. **Set appropriate log levels** - INFO for production, DEBUG for dev 4. **Monitor your monitors** - Alert on collector failures 5. **Regular retention policy reviews** - Balance cost vs. data retention 6. **Create service dashboards** - One dashboard per service 7. **Set up critical alerts first** - Service down, high error rate 8. **Document custom metrics** - Explain business-specific metrics ## πŸ“Š Performance Impact **Resource Usage (per service):** - CPU: +5-10% (instrumentation overhead) - Memory: +50-100MB (SDK and buffers) - Network: Minimal (batched export every 60s) **Latency Impact:** - Per request: <1ms (async instrumentation) - No impact on user-facing latency **Storage (SigNoz):** - Traces: ~1GB per million requests - Metrics: ~100MB per service per day - Logs: Varies by log volume ## πŸ” Security Considerations 1. **Use dedicated monitoring users** - Never use app credentials 2. **Limit collector permissions** - Read-only access to databases 3. **Secure OTLP endpoints** - Use TLS in production 4. **Sanitize sensitive data** - Don't log passwords, tokens 5. **Network policies** - Restrict collector network access 6. **RBAC** - Limit SigNoz UI access per team ## πŸš€ Next Steps 1. **Deploy to production** - Update production SigNoz config 2. **Create team dashboards** - Per-service and system-wide views 3. **Set up alerts** - Start with critical service health alerts 4. **Train team** - SigNoz UI usage, query language 5. **Document runbooks** - How to respond to alerts 6. **Optimize retention** - Based on actual data volume 7. **Add custom metrics** - Business-specific KPIs ## πŸ“ž Support - **SigNoz Community**: https://signoz.io/slack - **OpenTelemetry Docs**: https://opentelemetry.io/docs/ - **Internal Docs**: See /docs folder ## πŸ“ Change Log | Date | Change | |------|--------| | 2026-01-08 | Initial implementation - All services configured | | 2026-01-08 | Database monitoring added (PostgreSQL, Redis, RabbitMQ) | | 2026-01-08 | System metrics collection implemented | | 2026-01-08 | Removed Prometheus, pure OpenTelemetry | --- **Congratulations! Your platform now has complete observability. πŸŽ‰** Every request is traced, every metric is collected, every log is searchable.