450 lines
16 KiB
Markdown
450 lines
16 KiB
Markdown
|
|
# Complete Monitoring Guide - Bakery IA Platform
|
||
|
|
|
||
|
|
This guide provides the complete overview of observability implementation for the Bakery IA platform using SigNoz and OpenTelemetry.
|
||
|
|
|
||
|
|
## 🎯 Executive Summary
|
||
|
|
|
||
|
|
**What's Implemented:**
|
||
|
|
- ✅ **Distributed Tracing** - All 17 services
|
||
|
|
- ✅ **Application Metrics** - HTTP requests, latencies, errors
|
||
|
|
- ✅ **System Metrics** - CPU, memory, disk, network per service
|
||
|
|
- ✅ **Structured Logs** - With trace correlation
|
||
|
|
- ✅ **Database Monitoring** - PostgreSQL, Redis, RabbitMQ metrics
|
||
|
|
- ✅ **Pure OpenTelemetry** - No Prometheus, all OTLP push
|
||
|
|
|
||
|
|
**Technology Stack:**
|
||
|
|
- **Backend**: OpenTelemetry Python SDK
|
||
|
|
- **Collector**: OpenTelemetry Collector (OTLP receivers)
|
||
|
|
- **Storage**: ClickHouse (traces, metrics, logs)
|
||
|
|
- **Frontend**: SigNoz UI
|
||
|
|
- **Protocol**: OTLP over HTTP/gRPC
|
||
|
|
|
||
|
|
## 📊 Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
┌──────────────────────────────────────────────────────────┐
|
||
|
|
│ Application Services │
|
||
|
|
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
|
||
|
|
│ │ auth │ │ inv │ │ orders │ │ ... │ │
|
||
|
|
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
|
||
|
|
│ │ │ │ │ │
|
||
|
|
│ └───────────┴────────────┴───────────┘ │
|
||
|
|
│ │ │
|
||
|
|
│ Traces + Metrics + Logs │
|
||
|
|
│ (OpenTelemetry OTLP) │
|
||
|
|
└──────────────────┼──────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌──────────────────────────────────────────────────────────┐
|
||
|
|
│ Database Monitoring Collector │
|
||
|
|
│ ┌────────┐ ┌────────┐ ┌────────┐ │
|
||
|
|
│ │ PG │ │ Redis │ │RabbitMQ│ │
|
||
|
|
│ └───┬────┘ └───┬────┘ └───┬────┘ │
|
||
|
|
│ │ │ │ │
|
||
|
|
│ └───────────┴────────────┘ │
|
||
|
|
│ │ │
|
||
|
|
│ Database Metrics │
|
||
|
|
└──────────────────┼──────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌──────────────────────────────────────────────────────────┐
|
||
|
|
│ SigNoz OpenTelemetry Collector │
|
||
|
|
│ │
|
||
|
|
│ Receivers: OTLP (gRPC :4317, HTTP :4318) │
|
||
|
|
│ Processors: batch, memory_limiter, resourcedetection │
|
||
|
|
│ Exporters: ClickHouse │
|
||
|
|
└──────────────────┼──────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌──────────────────────────────────────────────────────────┐
|
||
|
|
│ ClickHouse Database │
|
||
|
|
│ │
|
||
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||
|
|
│ │ Traces │ │ Metrics │ │ Logs │ │
|
||
|
|
│ └──────────┘ └──────────┘ └──────────┘ │
|
||
|
|
└──────────────────┼──────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌──────────────────────────────────────────────────────────┐
|
||
|
|
│ SigNoz Frontend UI │
|
||
|
|
│ https://monitoring.bakery-ia.local │
|
||
|
|
└──────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🚀 Quick Start
|
||
|
|
|
||
|
|
### 1. Deploy SigNoz
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Add Helm repository
|
||
|
|
helm repo add signoz https://charts.signoz.io
|
||
|
|
helm repo update
|
||
|
|
|
||
|
|
# Create namespace and install
|
||
|
|
kubectl create namespace signoz
|
||
|
|
helm install signoz signoz/signoz \
|
||
|
|
-n signoz \
|
||
|
|
-f infrastructure/helm/signoz-values-dev.yaml
|
||
|
|
|
||
|
|
# Wait for pods
|
||
|
|
kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Deploy Services with Monitoring
|
||
|
|
|
||
|
|
All services are already configured with OpenTelemetry environment variables.
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Apply all services
|
||
|
|
kubectl apply -k infrastructure/kubernetes/overlays/dev/
|
||
|
|
|
||
|
|
# Or restart existing services
|
||
|
|
kubectl rollout restart deployment -n bakery-ia
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Deploy Database Monitoring
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Run the setup script
|
||
|
|
./infrastructure/kubernetes/setup-database-monitoring.sh
|
||
|
|
|
||
|
|
# This will:
|
||
|
|
# - Create monitoring users in PostgreSQL
|
||
|
|
# - Deploy OpenTelemetry collector for database metrics
|
||
|
|
# - Start collecting PostgreSQL, Redis, RabbitMQ metrics
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Access SigNoz UI
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Via ingress
|
||
|
|
open https://monitoring.bakery-ia.local
|
||
|
|
|
||
|
|
# Or port-forward
|
||
|
|
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
|
||
|
|
open http://localhost:3301
|
||
|
|
```
|
||
|
|
|
||
|
|
## 📈 Metrics Collected
|
||
|
|
|
||
|
|
### Application Metrics (Per Service)
|
||
|
|
|
||
|
|
| Metric | Description | Type |
|
||
|
|
|--------|-------------|------|
|
||
|
|
| `http_requests_total` | Total HTTP requests | Counter |
|
||
|
|
| `http_request_duration_seconds` | Request latency | Histogram |
|
||
|
|
| `active_requests` | Current active requests | Gauge |
|
||
|
|
|
||
|
|
### System Metrics (Per Service)
|
||
|
|
|
||
|
|
| Metric | Description | Type |
|
||
|
|
|--------|-------------|------|
|
||
|
|
| `process.cpu.utilization` | Process CPU % | Gauge |
|
||
|
|
| `process.memory.usage` | Process memory bytes | Gauge |
|
||
|
|
| `process.memory.utilization` | Process memory % | Gauge |
|
||
|
|
| `process.threads.count` | Thread count | Gauge |
|
||
|
|
| `process.open_file_descriptors` | Open FDs (Unix) | Gauge |
|
||
|
|
| `system.cpu.utilization` | System CPU % | Gauge |
|
||
|
|
| `system.memory.usage` | System memory | Gauge |
|
||
|
|
| `system.memory.utilization` | System memory % | Gauge |
|
||
|
|
| `system.disk.io.read` | Disk read bytes | Counter |
|
||
|
|
| `system.disk.io.write` | Disk write bytes | Counter |
|
||
|
|
| `system.network.io.sent` | Network sent bytes | Counter |
|
||
|
|
| `system.network.io.received` | Network recv bytes | Counter |
|
||
|
|
|
||
|
|
### PostgreSQL Metrics
|
||
|
|
|
||
|
|
| Metric | Description |
|
||
|
|
|--------|-------------|
|
||
|
|
| `postgresql.backends` | Active connections |
|
||
|
|
| `postgresql.database.size` | Database size in bytes |
|
||
|
|
| `postgresql.commits` | Transaction commits |
|
||
|
|
| `postgresql.rollbacks` | Transaction rollbacks |
|
||
|
|
| `postgresql.deadlocks` | Deadlock count |
|
||
|
|
| `postgresql.blocks_read` | Blocks read from disk |
|
||
|
|
| `postgresql.table.size` | Table size |
|
||
|
|
| `postgresql.index.size` | Index size |
|
||
|
|
|
||
|
|
### Redis Metrics
|
||
|
|
|
||
|
|
| Metric | Description |
|
||
|
|
|--------|-------------|
|
||
|
|
| `redis.clients.connected` | Connected clients |
|
||
|
|
| `redis.commands.processed` | Commands processed |
|
||
|
|
| `redis.keyspace.hits` | Cache hits |
|
||
|
|
| `redis.keyspace.misses` | Cache misses |
|
||
|
|
| `redis.memory.used` | Memory usage |
|
||
|
|
| `redis.memory.fragmentation_ratio` | Fragmentation |
|
||
|
|
| `redis.db.keys` | Number of keys |
|
||
|
|
|
||
|
|
### RabbitMQ Metrics
|
||
|
|
|
||
|
|
| Metric | Description |
|
||
|
|
|--------|-------------|
|
||
|
|
| `rabbitmq.consumer.count` | Active consumers |
|
||
|
|
| `rabbitmq.message.current` | Messages in queue |
|
||
|
|
| `rabbitmq.message.acknowledged` | Messages ACKed |
|
||
|
|
| `rabbitmq.message.delivered` | Messages delivered |
|
||
|
|
| `rabbitmq.message.published` | Messages published |
|
||
|
|
|
||
|
|
## 🔍 Traces
|
||
|
|
|
||
|
|
**Automatic instrumentation for:**
|
||
|
|
- FastAPI endpoints
|
||
|
|
- HTTP client requests (HTTPX)
|
||
|
|
- Redis commands
|
||
|
|
- PostgreSQL queries (SQLAlchemy)
|
||
|
|
- RabbitMQ publish/consume
|
||
|
|
|
||
|
|
**View traces:**
|
||
|
|
1. Go to **Services** tab in SigNoz
|
||
|
|
2. Select a service
|
||
|
|
3. View individual traces
|
||
|
|
4. Click trace → See full span tree with timing
|
||
|
|
|
||
|
|
## 📝 Logs
|
||
|
|
|
||
|
|
**Features:**
|
||
|
|
- Structured logging with context
|
||
|
|
- Automatic trace-log correlation
|
||
|
|
- Searchable by service, level, message, custom fields
|
||
|
|
|
||
|
|
**View logs:**
|
||
|
|
1. Go to **Logs** tab in SigNoz
|
||
|
|
2. Filter by service: `service_name="auth-service"`
|
||
|
|
3. Search for specific messages
|
||
|
|
4. Click log → See full context including trace_id
|
||
|
|
|
||
|
|
## 🎛️ Configuration Files
|
||
|
|
|
||
|
|
### Services
|
||
|
|
|
||
|
|
All services configured in:
|
||
|
|
```
|
||
|
|
infrastructure/kubernetes/base/components/*/\*-service.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
Each service has these environment variables:
|
||
|
|
```yaml
|
||
|
|
env:
|
||
|
|
- name: OTEL_COLLECTOR_ENDPOINT
|
||
|
|
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
|
||
|
|
- name: OTEL_SERVICE_NAME
|
||
|
|
value: "service-name"
|
||
|
|
- name: ENABLE_TRACING
|
||
|
|
value: "true"
|
||
|
|
- name: OTEL_LOGS_EXPORTER
|
||
|
|
value: "otlp"
|
||
|
|
- name: ENABLE_OTEL_METRICS
|
||
|
|
value: "true"
|
||
|
|
- name: ENABLE_SYSTEM_METRICS
|
||
|
|
value: "true"
|
||
|
|
```
|
||
|
|
|
||
|
|
### SigNoz
|
||
|
|
|
||
|
|
Configuration file:
|
||
|
|
```
|
||
|
|
infrastructure/helm/signoz-values-dev.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
Key settings:
|
||
|
|
- OTLP receivers on ports 4317 (gRPC) and 4318 (HTTP)
|
||
|
|
- No Prometheus scraping (pure OTLP push)
|
||
|
|
- ClickHouse backend for storage
|
||
|
|
- Reduced resources for development
|
||
|
|
|
||
|
|
### Database Monitoring
|
||
|
|
|
||
|
|
Deployment file:
|
||
|
|
```
|
||
|
|
infrastructure/kubernetes/base/monitoring/database-otel-collector.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
Setup script:
|
||
|
|
```
|
||
|
|
infrastructure/kubernetes/setup-database-monitoring.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
## 📚 Documentation
|
||
|
|
|
||
|
|
| Document | Description |
|
||
|
|
|----------|-------------|
|
||
|
|
| [MONITORING_QUICKSTART.md](./MONITORING_QUICKSTART.md) | 10-minute quick start guide |
|
||
|
|
| [MONITORING_SETUP.md](./MONITORING_SETUP.md) | Detailed setup and troubleshooting |
|
||
|
|
| [DATABASE_MONITORING.md](./DATABASE_MONITORING.md) | Database metrics and logs guide |
|
||
|
|
| This document | Complete overview |
|
||
|
|
|
||
|
|
## 🔧 Shared Libraries
|
||
|
|
|
||
|
|
### Monitoring Modules
|
||
|
|
|
||
|
|
Located in `shared/monitoring/`:
|
||
|
|
|
||
|
|
| File | Purpose |
|
||
|
|
|------|---------|
|
||
|
|
| `__init__.py` | Package exports |
|
||
|
|
| `logging.py` | Standard logging setup |
|
||
|
|
| `logs_exporter.py` | OpenTelemetry logs export |
|
||
|
|
| `metrics.py` | OpenTelemetry metrics (no Prometheus) |
|
||
|
|
| `metrics_exporter.py` | OTLP metrics export setup |
|
||
|
|
| `system_metrics.py` | System metrics collection (CPU, memory, etc.) |
|
||
|
|
| `tracing.py` | Distributed tracing setup |
|
||
|
|
| `health_checks.py` | Health check endpoints |
|
||
|
|
|
||
|
|
### Usage in Services
|
||
|
|
|
||
|
|
```python
|
||
|
|
from shared.service_base import StandardFastAPIService
|
||
|
|
|
||
|
|
# Create service
|
||
|
|
service = AuthService()
|
||
|
|
|
||
|
|
# Create app with auto-configured monitoring
|
||
|
|
app = service.create_app()
|
||
|
|
|
||
|
|
# Monitoring is automatically enabled:
|
||
|
|
# - Tracing (if ENABLE_TRACING=true)
|
||
|
|
# - Metrics (if ENABLE_OTEL_METRICS=true)
|
||
|
|
# - System metrics (if ENABLE_SYSTEM_METRICS=true)
|
||
|
|
# - Logs (if OTEL_LOGS_EXPORTER=otlp)
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🎨 Dashboard Examples
|
||
|
|
|
||
|
|
### Service Health Dashboard
|
||
|
|
|
||
|
|
Create a dashboard with:
|
||
|
|
1. **Request Rate** - `rate(http_requests_total[5m])`
|
||
|
|
2. **Error Rate** - `rate(http_requests_total{status_code=~"5.."}[5m])`
|
||
|
|
3. **Latency (P95)** - `histogram_quantile(0.95, http_request_duration_seconds)`
|
||
|
|
4. **Active Requests** - `active_requests`
|
||
|
|
5. **CPU Usage** - `process.cpu.utilization`
|
||
|
|
6. **Memory Usage** - `process.memory.utilization`
|
||
|
|
|
||
|
|
### Database Dashboard
|
||
|
|
|
||
|
|
1. **PostgreSQL Connections** - `postgresql.backends`
|
||
|
|
2. **Database Size** - `postgresql.database.size`
|
||
|
|
3. **Transaction Rate** - `rate(postgresql.commits[5m])`
|
||
|
|
4. **Redis Hit Rate** - `redis.keyspace.hits / (redis.keyspace.hits + redis.keyspace.misses)`
|
||
|
|
5. **RabbitMQ Queue Depth** - `rabbitmq.message.current`
|
||
|
|
|
||
|
|
## ⚠️ Alerts
|
||
|
|
|
||
|
|
### Recommended Alerts
|
||
|
|
|
||
|
|
**Application:**
|
||
|
|
- High error rate (>5% of requests failing)
|
||
|
|
- High latency (P95 > 1s)
|
||
|
|
- Service down (no metrics for 5 minutes)
|
||
|
|
|
||
|
|
**System:**
|
||
|
|
- High CPU (>80% for 5 minutes)
|
||
|
|
- High memory (>90%)
|
||
|
|
- Disk space low (<10%)
|
||
|
|
|
||
|
|
**Database:**
|
||
|
|
- PostgreSQL connections near max (>80% of max_connections)
|
||
|
|
- Slow queries (>5s)
|
||
|
|
- Redis memory high (>80%)
|
||
|
|
- RabbitMQ queue buildup (>10k messages)
|
||
|
|
|
||
|
|
## 🐛 Troubleshooting
|
||
|
|
|
||
|
|
### No Data in SigNoz
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Check service logs
|
||
|
|
kubectl logs -n bakery-ia deployment/auth-service | grep -i otel
|
||
|
|
|
||
|
|
# 2. Check SigNoz collector
|
||
|
|
kubectl logs -n signoz deployment/signoz-otel-collector
|
||
|
|
|
||
|
|
# 3. Test connectivity
|
||
|
|
kubectl exec -n bakery-ia deployment/auth-service -- \
|
||
|
|
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
|
||
|
|
```
|
||
|
|
|
||
|
|
### Database Metrics Missing
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check database monitoring collector
|
||
|
|
kubectl logs -n bakery-ia deployment/database-otel-collector
|
||
|
|
|
||
|
|
# Verify monitoring user exists
|
||
|
|
kubectl exec -n bakery-ia deployment/auth-db -- \
|
||
|
|
psql -U postgres -c "\du otel_monitor"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Traces Not Correlated with Logs
|
||
|
|
|
||
|
|
Ensure `OTEL_LOGS_EXPORTER=otlp` is set in service environment variables.
|
||
|
|
|
||
|
|
## 🎯 Best Practices
|
||
|
|
|
||
|
|
1. **Always use structured logging** - Add context with key-value pairs
|
||
|
|
2. **Add custom spans** - For important business operations
|
||
|
|
3. **Set appropriate log levels** - INFO for production, DEBUG for dev
|
||
|
|
4. **Monitor your monitors** - Alert on collector failures
|
||
|
|
5. **Regular retention policy reviews** - Balance cost vs. data retention
|
||
|
|
6. **Create service dashboards** - One dashboard per service
|
||
|
|
7. **Set up critical alerts first** - Service down, high error rate
|
||
|
|
8. **Document custom metrics** - Explain business-specific metrics
|
||
|
|
|
||
|
|
## 📊 Performance Impact
|
||
|
|
|
||
|
|
**Resource Usage (per service):**
|
||
|
|
- CPU: +5-10% (instrumentation overhead)
|
||
|
|
- Memory: +50-100MB (SDK and buffers)
|
||
|
|
- Network: Minimal (batched export every 60s)
|
||
|
|
|
||
|
|
**Latency Impact:**
|
||
|
|
- Per request: <1ms (async instrumentation)
|
||
|
|
- No impact on user-facing latency
|
||
|
|
|
||
|
|
**Storage (SigNoz):**
|
||
|
|
- Traces: ~1GB per million requests
|
||
|
|
- Metrics: ~100MB per service per day
|
||
|
|
- Logs: Varies by log volume
|
||
|
|
|
||
|
|
## 🔐 Security Considerations
|
||
|
|
|
||
|
|
1. **Use dedicated monitoring users** - Never use app credentials
|
||
|
|
2. **Limit collector permissions** - Read-only access to databases
|
||
|
|
3. **Secure OTLP endpoints** - Use TLS in production
|
||
|
|
4. **Sanitize sensitive data** - Don't log passwords, tokens
|
||
|
|
5. **Network policies** - Restrict collector network access
|
||
|
|
6. **RBAC** - Limit SigNoz UI access per team
|
||
|
|
|
||
|
|
## 🚀 Next Steps
|
||
|
|
|
||
|
|
1. **Deploy to production** - Update production SigNoz config
|
||
|
|
2. **Create team dashboards** - Per-service and system-wide views
|
||
|
|
3. **Set up alerts** - Start with critical service health alerts
|
||
|
|
4. **Train team** - SigNoz UI usage, query language
|
||
|
|
5. **Document runbooks** - How to respond to alerts
|
||
|
|
6. **Optimize retention** - Based on actual data volume
|
||
|
|
7. **Add custom metrics** - Business-specific KPIs
|
||
|
|
|
||
|
|
## 📞 Support
|
||
|
|
|
||
|
|
- **SigNoz Community**: https://signoz.io/slack
|
||
|
|
- **OpenTelemetry Docs**: https://opentelemetry.io/docs/
|
||
|
|
- **Internal Docs**: See /docs folder
|
||
|
|
|
||
|
|
## 📝 Change Log
|
||
|
|
|
||
|
|
| Date | Change |
|
||
|
|
|------|--------|
|
||
|
|
| 2026-01-08 | Initial implementation - All services configured |
|
||
|
|
| 2026-01-08 | Database monitoring added (PostgreSQL, Redis, RabbitMQ) |
|
||
|
|
| 2026-01-08 | System metrics collection implemented |
|
||
|
|
| 2026-01-08 | Removed Prometheus, pure OpenTelemetry |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Congratulations! Your platform now has complete observability. 🎉**
|
||
|
|
|
||
|
|
Every request is traced, every metric is collected, every log is searchable.
|