# Complete Monitoring Guide - Bakery IA Platform

This guide provides the complete overview of observability implementation for the Bakery IA platform using SigNoz and OpenTelemetry.

## 🎯 Executive Summary

**What's Implemented:**
- ✅ **Distributed Tracing** - All 17 services
- ✅ **Application Metrics** - HTTP requests, latencies, errors
- ✅ **System Metrics** - CPU, memory, disk, network per service
- ✅ **Structured Logs** - With trace correlation
- ✅ **Database Monitoring** - PostgreSQL, Redis, RabbitMQ metrics
- ✅ **Pure OpenTelemetry** - No Prometheus, all OTLP push

**Technology Stack:**
- **Backend**: OpenTelemetry Python SDK
- **Collector**: OpenTelemetry Collector (OTLP receivers)
- **Storage**: ClickHouse (traces, metrics, logs)
- **Frontend**: SigNoz UI
- **Protocol**: OTLP over HTTP/gRPC

## 📊 Architecture

```
┌──────────────────────────────────────────────────────────┐
│                  Application Services                     │
│  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐        │
│  │  auth  │  │  inv   │  │ orders │  │  ...   │        │
│  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘        │
│      │           │            │           │              │
│      └───────────┴────────────┴───────────┘              │
│                  │                                        │
│         Traces + Metrics + Logs                          │
│         (OpenTelemetry OTLP)                             │
└──────────────────┼──────────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────────┐
│            Database Monitoring Collector                  │
│  ┌────────┐  ┌────────┐  ┌────────┐                     │
│  │   PG   │  │ Redis  │  │RabbitMQ│                     │
│  └───┬────┘  └───┬────┘  └───┬────┘                     │
│      │           │            │                           │
│      └───────────┴────────────┘                           │
│                  │                                        │
│         Database Metrics                                  │
└──────────────────┼──────────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────────┐
│           SigNoz OpenTelemetry Collector                  │
│                                                           │
│  Receivers: OTLP (gRPC :4317, HTTP :4318)               │
│  Processors: batch, memory_limiter, resourcedetection   │
│  Exporters: ClickHouse                                   │
└──────────────────┼──────────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────────┐
│               ClickHouse Database                         │
│                                                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
│  │  Traces  │  │  Metrics │  │   Logs   │              │
│  └──────────┘  └──────────┘  └──────────┘              │
└──────────────────┼──────────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────────┐
│               SigNoz Frontend UI                          │
│         https://monitoring.bakery-ia.local                │
└──────────────────────────────────────────────────────────┘
```

## 🚀 Quick Start

### 1. Deploy SigNoz

```bash
# Add Helm repository
helm repo add signoz https://charts.signoz.io
helm repo update

# Create namespace and install
kubectl create namespace signoz
helm install signoz signoz/signoz \
  -n signoz \
  -f infrastructure/helm/signoz-values-dev.yaml

# Wait for pods
kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
```

### 2. Deploy Services with Monitoring

All services are already configured with OpenTelemetry environment variables.

```bash
# Apply all services
kubectl apply -k infrastructure/kubernetes/overlays/dev/

# Or restart existing services
kubectl rollout restart deployment -n bakery-ia
```

### 3. Deploy Database Monitoring

```bash
# Run the setup script
./infrastructure/kubernetes/setup-database-monitoring.sh

# This will:
# - Create monitoring users in PostgreSQL
# - Deploy OpenTelemetry collector for database metrics
# - Start collecting PostgreSQL, Redis, RabbitMQ metrics
```

### 4. Access SigNoz UI

```bash
# Via ingress
open https://monitoring.bakery-ia.local

# Or port-forward
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
open http://localhost:3301
```

## 📈 Metrics Collected

### Application Metrics (Per Service)

| Metric | Description | Type |
|--------|-------------|------|
| `http_requests_total` | Total HTTP requests | Counter |
| `http_request_duration_seconds` | Request latency | Histogram |
| `active_requests` | Current active requests | Gauge |

### System Metrics (Per Service)

| Metric | Description | Type |
|--------|-------------|------|
| `process.cpu.utilization` | Process CPU % | Gauge |
| `process.memory.usage` | Process memory bytes | Gauge |
| `process.memory.utilization` | Process memory % | Gauge |
| `process.threads.count` | Thread count | Gauge |
| `process.open_file_descriptors` | Open FDs (Unix) | Gauge |
| `system.cpu.utilization` | System CPU % | Gauge |
| `system.memory.usage` | System memory | Gauge |
| `system.memory.utilization` | System memory % | Gauge |
| `system.disk.io.read` | Disk read bytes | Counter |
| `system.disk.io.write` | Disk write bytes | Counter |
| `system.network.io.sent` | Network sent bytes | Counter |
| `system.network.io.received` | Network recv bytes | Counter |

### PostgreSQL Metrics

| Metric | Description |
|--------|-------------|
| `postgresql.backends` | Active connections |
| `postgresql.database.size` | Database size in bytes |
| `postgresql.commits` | Transaction commits |
| `postgresql.rollbacks` | Transaction rollbacks |
| `postgresql.deadlocks` | Deadlock count |
| `postgresql.blocks_read` | Blocks read from disk |
| `postgresql.table.size` | Table size |
| `postgresql.index.size` | Index size |

### Redis Metrics

| Metric | Description |
|--------|-------------|
| `redis.clients.connected` | Connected clients |
| `redis.commands.processed` | Commands processed |
| `redis.keyspace.hits` | Cache hits |
| `redis.keyspace.misses` | Cache misses |
| `redis.memory.used` | Memory usage |
| `redis.memory.fragmentation_ratio` | Fragmentation |
| `redis.db.keys` | Number of keys |

### RabbitMQ Metrics

| Metric | Description |
|--------|-------------|
| `rabbitmq.consumer.count` | Active consumers |
| `rabbitmq.message.current` | Messages in queue |
| `rabbitmq.message.acknowledged` | Messages ACKed |
| `rabbitmq.message.delivered` | Messages delivered |
| `rabbitmq.message.published` | Messages published |

## 🔍 Traces

**Automatic instrumentation for:**
- FastAPI endpoints
- HTTP client requests (HTTPX)
- Redis commands
- PostgreSQL queries (SQLAlchemy)
- RabbitMQ publish/consume

**View traces:**
1. Go to **Services** tab in SigNoz
2. Select a service
3. View individual traces
4. Click trace → See full span tree with timing

## 📝 Logs

**Features:**
- Structured logging with context
- Automatic trace-log correlation
- Searchable by service, level, message, custom fields

**View logs:**
1. Go to **Logs** tab in SigNoz
2. Filter by service: `service_name="auth-service"`
3. Search for specific messages
4. Click log → See full context including trace_id

## 🎛️ Configuration Files

### Services

All services configured in:
```
infrastructure/kubernetes/base/components/*/\*-service.yaml
```

Each service has these environment variables:
```yaml
env:
  - name: OTEL_COLLECTOR_ENDPOINT
    value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
  - name: OTEL_SERVICE_NAME
    value: "service-name"
  - name: ENABLE_TRACING
    value: "true"
  - name: OTEL_LOGS_EXPORTER
    value: "otlp"
  - name: ENABLE_OTEL_METRICS
    value: "true"
  - name: ENABLE_SYSTEM_METRICS
    value: "true"
```

### SigNoz

Configuration file:
```
infrastructure/helm/signoz-values-dev.yaml
```

Key settings:
- OTLP receivers on ports 4317 (gRPC) and 4318 (HTTP)
- No Prometheus scraping (pure OTLP push)
- ClickHouse backend for storage
- Reduced resources for development

### Database Monitoring

Deployment file:
```
infrastructure/kubernetes/base/monitoring/database-otel-collector.yaml
```

Setup script:
```
infrastructure/kubernetes/setup-database-monitoring.sh
```

## 📚 Documentation

| Document | Description |
|----------|-------------|
| [MONITORING_QUICKSTART.md](./MONITORING_QUICKSTART.md) | 10-minute quick start guide |
| [MONITORING_SETUP.md](./MONITORING_SETUP.md) | Detailed setup and troubleshooting |
| [DATABASE_MONITORING.md](./DATABASE_MONITORING.md) | Database metrics and logs guide |
| This document | Complete overview |

## 🔧 Shared Libraries

### Monitoring Modules

Located in `shared/monitoring/`:

| File | Purpose |
|------|---------|
| `__init__.py` | Package exports |
| `logging.py` | Standard logging setup |
| `logs_exporter.py` | OpenTelemetry logs export |
| `metrics.py` | OpenTelemetry metrics (no Prometheus) |
| `metrics_exporter.py` | OTLP metrics export setup |
| `system_metrics.py` | System metrics collection (CPU, memory, etc.) |
| `tracing.py` | Distributed tracing setup |
| `health_checks.py` | Health check endpoints |

### Usage in Services

```python
from shared.service_base import StandardFastAPIService

# Create service
service = AuthService()

# Create app with auto-configured monitoring
app = service.create_app()

# Monitoring is automatically enabled:
# - Tracing (if ENABLE_TRACING=true)
# - Metrics (if ENABLE_OTEL_METRICS=true)
# - System metrics (if ENABLE_SYSTEM_METRICS=true)
# - Logs (if OTEL_LOGS_EXPORTER=otlp)
```

## 🎨 Dashboard Examples

### Service Health Dashboard

Create a dashboard with:
1. **Request Rate** - `rate(http_requests_total[5m])`
2. **Error Rate** - `rate(http_requests_total{status_code=~"5.."}[5m])`
3. **Latency (P95)** - `histogram_quantile(0.95, http_request_duration_seconds)`
4. **Active Requests** - `active_requests`
5. **CPU Usage** - `process.cpu.utilization`
6. **Memory Usage** - `process.memory.utilization`

### Database Dashboard

1. **PostgreSQL Connections** - `postgresql.backends`
2. **Database Size** - `postgresql.database.size`
3. **Transaction Rate** - `rate(postgresql.commits[5m])`
4. **Redis Hit Rate** - `redis.keyspace.hits / (redis.keyspace.hits + redis.keyspace.misses)`
5. **RabbitMQ Queue Depth** - `rabbitmq.message.current`

## ⚠️ Alerts

### Recommended Alerts

**Application:**
- High error rate (>5% of requests failing)
- High latency (P95 > 1s)
- Service down (no metrics for 5 minutes)

**System:**
- High CPU (>80% for 5 minutes)
- High memory (>90%)
- Disk space low (<10%)

**Database:**
- PostgreSQL connections near max (>80% of max_connections)
- Slow queries (>5s)
- Redis memory high (>80%)
- RabbitMQ queue buildup (>10k messages)

## 🐛 Troubleshooting

### No Data in SigNoz

```bash
# 1. Check service logs
kubectl logs -n bakery-ia deployment/auth-service | grep -i otel

# 2. Check SigNoz collector
kubectl logs -n signoz deployment/signoz-otel-collector

# 3. Test connectivity
kubectl exec -n bakery-ia deployment/auth-service -- \
  curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
```

### Database Metrics Missing

```bash
# Check database monitoring collector
kubectl logs -n bakery-ia deployment/database-otel-collector

# Verify monitoring user exists
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "\du otel_monitor"
```

### Traces Not Correlated with Logs

Ensure `OTEL_LOGS_EXPORTER=otlp` is set in service environment variables.

## 🎯 Best Practices

1. **Always use structured logging** - Add context with key-value pairs
2. **Add custom spans** - For important business operations
3. **Set appropriate log levels** - INFO for production, DEBUG for dev
4. **Monitor your monitors** - Alert on collector failures
5. **Regular retention policy reviews** - Balance cost vs. data retention
6. **Create service dashboards** - One dashboard per service
7. **Set up critical alerts first** - Service down, high error rate
8. **Document custom metrics** - Explain business-specific metrics

## 📊 Performance Impact

**Resource Usage (per service):**
- CPU: +5-10% (instrumentation overhead)
- Memory: +50-100MB (SDK and buffers)
- Network: Minimal (batched export every 60s)

**Latency Impact:**
- Per request: <1ms (async instrumentation)
- No impact on user-facing latency

**Storage (SigNoz):**
- Traces: ~1GB per million requests
- Metrics: ~100MB per service per day
- Logs: Varies by log volume

## 🔐 Security Considerations

1. **Use dedicated monitoring users** - Never use app credentials
2. **Limit collector permissions** - Read-only access to databases
3. **Secure OTLP endpoints** - Use TLS in production
4. **Sanitize sensitive data** - Don't log passwords, tokens
5. **Network policies** - Restrict collector network access
6. **RBAC** - Limit SigNoz UI access per team

## 🚀 Next Steps

1. **Deploy to production** - Update production SigNoz config
2. **Create team dashboards** - Per-service and system-wide views
3. **Set up alerts** - Start with critical service health alerts
4. **Train team** - SigNoz UI usage, query language
5. **Document runbooks** - How to respond to alerts
6. **Optimize retention** - Based on actual data volume
7. **Add custom metrics** - Business-specific KPIs

## 📞 Support

- **SigNoz Community**: https://signoz.io/slack
- **OpenTelemetry Docs**: https://opentelemetry.io/docs/
- **Internal Docs**: See /docs folder

## 📝 Change Log

| Date | Change |
|------|--------|
| 2026-01-08 | Initial implementation - All services configured |
| 2026-01-08 | Database monitoring added (PostgreSQL, Redis, RabbitMQ) |
| 2026-01-08 | System metrics collection implemented |
| 2026-01-08 | Removed Prometheus, pure OpenTelemetry |

---

**Congratulations! Your platform now has complete observability. 🎉**

Every request is traced, every metric is collected, every log is searchable.