Imporve monitoring 6

2026-01-10 13:43:38 +01:00
parent c05538cafb
commit b089c216db
13 changed files with 1248 additions and 2546 deletions
--- a/docs/PRODUCTION_OPERATIONS_GUIDE.md
+++ b/docs/PRODUCTION_OPERATIONS_GUIDE.md
@@ -60,84 +60,129 @@

 **Production URLs:**
 ```
-https://monitoring.bakewise.ai/grafana       # Dashboards & visualization
-https://monitoring.bakewise.ai/prometheus    # Metrics & alerts
-https://monitoring.bakewise.ai/alertmanager  # Alert management
-https://monitoring.bakewise.ai/signoz        # Unified observability platform (traces, metrics, logs)
+https://monitoring.bakewise.ai/signoz        # SigNoz - Unified observability (PRIMARY)
+https://monitoring.bakewise.ai/alertmanager  # AlertManager - Alert management
 ```

+**What is SigNoz?**
+SigNoz is a comprehensive, open-source observability platform that provides:
+- **Distributed Tracing** - End-to-end request tracking across all microservices
+- **Metrics Monitoring** - Application and infrastructure metrics
+- **Log Management** - Centralized log aggregation with trace correlation
+- **Service Performance Monitoring (SPM)** - RED metrics (Rate, Error, Duration) from traces
+- **Database Monitoring** - All 18 PostgreSQL databases + Redis + RabbitMQ
+- **Kubernetes Monitoring** - Cluster, node, pod, and container metrics
+
 **Port Forwarding (if ingress not available):**
 ```bash
-# Grafana
-kubectl port-forward -n monitoring svc/grafana 3000:3000
+# SigNoz Frontend (Main UI)
+kubectl port-forward -n bakery-ia svc/signoz 8080:8080

-# Prometheus
-kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+# SigNoz AlertManager
+kubectl port-forward -n bakery-ia svc/signoz-alertmanager 9093:9093

-# AlertManager
-kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
-
-# SigNoz
-kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301
+# OTel Collector (for debugging)
+kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4317:4317  # gRPC
+kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4318:4318  # HTTP
 ```

-### Key Dashboards
+### Key SigNoz Dashboards and Features

-#### 1. Services Overview Dashboard
+#### 1. Services Tab - APM Overview
 **What to Monitor:**
- Request rate per service
- Error rate (aim: <1%)
- P95/P99 latency (aim: <2s)
- Active connections
- Pod health status
+- **Service List** - All 18 microservices with health status
+- **Request Rate** - Requests per second per service
+- **Error Rate** - Percentage of failed requests (aim: <1%)
+- **P50/P90/P99 Latency** - Response time percentiles (aim: P99 <2s)
+- **Operations** - Breakdown by endpoint/operation

 **Red Flags:**
- ❌ Error rate >5%
- ❌ P95 latency >3s
- ❌ Any service showing 0 requests (might be down)
- ❌ Pod restarts >3 in last hour
+- ❌ Error rate >5% sustained
+- ❌ P99 latency >3s
+- ❌ Sudden drop in request rate (service might be down)
+- ❌ High latency on specific endpoints

-#### 2. Database Dashboard (PostgreSQL)
+**How to Access:**
+- Navigate to `Services` tab in SigNoz
+- Click on any service for detailed metrics
+- Use "Traces" tab to see sample requests
+
+#### 2. Traces Tab - Distributed Tracing
 **What to Monitor:**
- Active connections per database
- Cache hit ratio (aim: >90%)
- Query duration (P95)
- Transaction rate
- Replication lag (if applicable)
+- **End-to-end request flows** across microservices
+- **Span duration** - Time spent in each service
+- **Database query performance** - Auto-captured from SQLAlchemy
+- **External API calls** - Auto-captured from HTTPX
+- **Error traces** - Requests that failed with stack traces
+
+**Features:**
+- Filter by service, operation, status code, duration
+- Search by trace ID or span ID
+- Correlate traces with logs
+- Identify slow database queries and N+1 problems

 **Red Flags:**
- ❌ Connection count >80% of max
- ❌ Cache hit ratio <80%
- ❌ Slow queries >1s frequently
- ❌ Locks increasing
+- ❌ Traces showing >10 database queries per request (N+1 issue)
+- ❌ External API calls taking >1s
+- ❌ Services with >500ms internal processing time
+- ❌ Error spans with exceptions

-#### 3. Node Exporter (Infrastructure)
-**What to Monitor:**
- CPU usage per node
- Memory usage and swap
- Disk I/O and latency
- Network throughput
- Disk space remaining
+#### 3. Dashboards Tab - Infrastructure Metrics
+**Pre-built Dashboards:**
+- **PostgreSQL Monitoring** - All 18 databases
+  - Active connections, transactions/sec, cache hit ratio
+  - Slow queries, lock waits, replication lag
+  - Database size, disk I/O
+- **Redis Monitoring** - Cache performance
+  - Memory usage, hit rate, evictions
+  - Commands/sec, latency
+- **RabbitMQ Monitoring** - Message queue health
+  - Queue depth, message rates
+  - Consumer status, connections
+- **Kubernetes Cluster** - Node and pod metrics
+  - CPU, memory, disk, network per node
+  - Pod resource utilization
+  - Container restarts and OOM kills

 **Red Flags:**
- ❌ CPU usage >85% sustained
- ❌ Memory usage >90%
- ❌ Swap usage >0 (indicates memory pressure)
- ❌ Disk space <20% remaining
- ❌ Disk I/O latency >100ms
+- ❌ PostgreSQL: Cache hit ratio <80%, active connections >80% of max
+- ❌ Redis: Memory >90%, evictions increasing
+- ❌ RabbitMQ: Queue depth growing, no consumers
+- ❌ Kubernetes: CPU >85%, memory >90%, disk <20% free
+
+#### 4. Logs Tab - Centralized Logging
+**Features:**
+- **Unified logs** from all 18 microservices + databases
+- **Trace correlation** - Click on trace ID to see related logs
+- **Kubernetes metadata** - Auto-tagged with pod, namespace, container
+- **Search and filter** - By service, severity, time range, content
+- **Log patterns** - Automatically detect common patterns

-#### 4. Business Metrics Dashboard
 **What to Monitor:**
- Active tenants
- ML training jobs (success/failure rate)
- Forecast requests per hour
- Alert volume
- API health score
+- Error and warning logs across all services
+- Database connection errors
+- Authentication failures
+- API request/response logs

 **Red Flags:**
- ❌ Training failure rate >10%
- ❌ No forecast requests (might indicate issue)
- ❌ Alert volume spike (investigate cause)
+- ❌ Increasing error logs
+- ❌ Repeated "connection refused" or "timeout" messages
+- ❌ Authentication failures (potential security issue)
+- ❌ Out of memory errors
+
+#### 5. Alerts Tab - Alert Management
+**Features:**
+- Create alerts based on metrics, traces, or logs
+- Configure notification channels (email, Slack, webhook)
+- View firing alerts and alert history
+- Alert silencing and acknowledgment
+
+**Pre-configured Alerts (see SigNoz):**
+- High error rate (>5% for 5 minutes)
+- High latency (P99 >3s for 5 minutes)
+- Service down (no requests for 2 minutes)
+- Database connection errors
+- High memory/CPU usage

 ### Alert Severity Levels

@@ -195,7 +240,35 @@ Response:
 3. See "Certificate Rotation" section below
 ```

-### Metrics to Track Daily
+### Daily Monitoring Workflow with SigNoz
+
+#### Morning Health Check (5 minutes)
+
+1. **Open SigNoz Dashboard**
+   ```
+   https://monitoring.bakewise.ai/signoz
+   ```
+
+2. **Check Services Tab:**
+   - Verify all 18 services are reporting metrics
+   - Check error rate <1% for all services
+   - Check P99 latency <2s for critical services
+
+3. **Check Alerts Tab:**
+   - Review any firing alerts
+   - Check for patterns (repeated alerts on same service)
+   - Acknowledge or resolve as needed
+
+4. **Quick Infrastructure Check:**
+   - Navigate to Dashboards → PostgreSQL
+     - Verify all 18 databases are up
+     - Check connection counts are healthy
+   - Navigate to Dashboards → Redis
+     - Check memory usage <80%
+   - Navigate to Dashboards → Kubernetes
+     - Verify node health, no OOM kills
+
+#### Command-Line Health Check (Alternative)

 ```bash
 # Quick health check command
@@ -211,19 +284,19 @@ echo ""

 echo "2. Resource Usage:"
 kubectl top nodes
+kubectl top pods -n bakery-ia --sort-by=memory | head -10
 echo ""

-echo "3. Database Connections:"
-kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
-  "SELECT count(*) as connections FROM pg_stat_activity;"
+echo "3. SigNoz Components:"
+kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
 echo ""

-echo "4. Recent Alerts:"
-curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state}' | head -10
+echo "4. Recent Alerts (from SigNoz AlertManager):"
+curl -s http://localhost:9093/api/v1/alerts 2>/dev/null | jq '.data[] | select(.status.state=="firing") | {alert: .labels.alertname, severity: .labels.severity}' | head -10
 echo ""

-echo "5. Disk Usage:"
-kubectl exec -n bakery-ia deployment/auth-db -- df -h /var/lib/postgresql/data
+echo "5. OTel Collector Health:"
+kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133 2>/dev/null || echo "✅ Health check endpoint responding"
 echo ""

 echo "=== End Health Check ==="
@@ -233,6 +306,38 @@ chmod +x ~/health-check.sh
 ./health-check.sh
 ```

+#### Troubleshooting Common Issues
+
+**Issue: Service not showing in SigNoz**
+```bash
+# Check if service is sending telemetry
+kubectl logs -n bakery-ia deployment/SERVICE_NAME | grep -i "telemetry\|otel\|signoz"
+
+# Check OTel Collector is receiving data
+kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep SERVICE_NAME
+
+# Verify service has proper OTEL endpoints configured
+kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep OTEL
+```
+
+**Issue: No traces appearing**
+```bash
+# Check tracing is enabled in service
+kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep ENABLE_TRACING
+
+# Verify OTel Collector gRPC endpoint is reachable
+kubectl exec -n bakery-ia deployment/SERVICE_NAME -- nc -zv signoz-otel-collector 4317
+```
+
+**Issue: Logs not appearing**
+```bash
+# Check filelog receiver is working
+kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog
+
+# Check k8sattributes processor
+kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep k8sattributes
+```
+
 ---

 ## Security Operations