Imporve monitoring 6

This commit is contained in:
Urtzi Alfaro
2026-01-10 13:43:38 +01:00
parent c05538cafb
commit b089c216db
13 changed files with 1248 additions and 2546 deletions

View File

@@ -60,84 +60,129 @@
**Production URLs:**
```
https://monitoring.bakewise.ai/grafana # Dashboards & visualization
https://monitoring.bakewise.ai/prometheus # Metrics & alerts
https://monitoring.bakewise.ai/alertmanager # Alert management
https://monitoring.bakewise.ai/signoz # Unified observability platform (traces, metrics, logs)
https://monitoring.bakewise.ai/signoz # SigNoz - Unified observability (PRIMARY)
https://monitoring.bakewise.ai/alertmanager # AlertManager - Alert management
```
**What is SigNoz?**
SigNoz is a comprehensive, open-source observability platform that provides:
- **Distributed Tracing** - End-to-end request tracking across all microservices
- **Metrics Monitoring** - Application and infrastructure metrics
- **Log Management** - Centralized log aggregation with trace correlation
- **Service Performance Monitoring (SPM)** - RED metrics (Rate, Error, Duration) from traces
- **Database Monitoring** - All 18 PostgreSQL databases + Redis + RabbitMQ
- **Kubernetes Monitoring** - Cluster, node, pod, and container metrics
**Port Forwarding (if ingress not available):**
```bash
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000
# SigNoz Frontend (Main UI)
kubectl port-forward -n bakery-ia svc/signoz 8080:8080
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# SigNoz AlertManager
kubectl port-forward -n bakery-ia svc/signoz-alertmanager 9093:9093
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
# SigNoz
kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301
# OTel Collector (for debugging)
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4317:4317 # gRPC
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4318:4318 # HTTP
```
### Key Dashboards
### Key SigNoz Dashboards and Features
#### 1. Services Overview Dashboard
#### 1. Services Tab - APM Overview
**What to Monitor:**
- Request rate per service
- Error rate (aim: <1%)
- P95/P99 latency (aim: <2s)
- Active connections
- Pod health status
- **Service List** - All 18 microservices with health status
- **Request Rate** - Requests per second per service
- **Error Rate** - Percentage of failed requests (aim: <1%)
- **P50/P90/P99 Latency** - Response time percentiles (aim: P99 <2s)
- **Operations** - Breakdown by endpoint/operation
**Red Flags:**
- Error rate >5%
- ❌ P95 latency >3s
-Any service showing 0 requests (might be down)
-Pod restarts >3 in last hour
- Error rate >5% sustained
- ❌ P99 latency >3s
-Sudden drop in request rate (service might be down)
-High latency on specific endpoints
#### 2. Database Dashboard (PostgreSQL)
**How to Access:**
- Navigate to `Services` tab in SigNoz
- Click on any service for detailed metrics
- Use "Traces" tab to see sample requests
#### 2. Traces Tab - Distributed Tracing
**What to Monitor:**
- Active connections per database
- Cache hit ratio (aim: >90%)
- Query duration (P95)
- Transaction rate
- Replication lag (if applicable)
- **End-to-end request flows** across microservices
- **Span duration** - Time spent in each service
- **Database query performance** - Auto-captured from SQLAlchemy
- **External API calls** - Auto-captured from HTTPX
- **Error traces** - Requests that failed with stack traces
**Features:**
- Filter by service, operation, status code, duration
- Search by trace ID or span ID
- Correlate traces with logs
- Identify slow database queries and N+1 problems
**Red Flags:**
-Connection count >80% of max
-Cache hit ratio <80%
- Slow queries >1s frequently
-Locks increasing
-Traces showing >10 database queries per request (N+1 issue)
-External API calls taking >1s
- ❌ Services with >500ms internal processing time
-Error spans with exceptions
#### 3. Node Exporter (Infrastructure)
**What to Monitor:**
- CPU usage per node
- Memory usage and swap
- Disk I/O and latency
- Network throughput
- Disk space remaining
#### 3. Dashboards Tab - Infrastructure Metrics
**Pre-built Dashboards:**
- **PostgreSQL Monitoring** - All 18 databases
- Active connections, transactions/sec, cache hit ratio
- Slow queries, lock waits, replication lag
- Database size, disk I/O
- **Redis Monitoring** - Cache performance
- Memory usage, hit rate, evictions
- Commands/sec, latency
- **RabbitMQ Monitoring** - Message queue health
- Queue depth, message rates
- Consumer status, connections
- **Kubernetes Cluster** - Node and pod metrics
- CPU, memory, disk, network per node
- Pod resource utilization
- Container restarts and OOM kills
**Red Flags:**
-CPU usage >85% sustained
-Memory usage >90%
-Swap usage >0 (indicates memory pressure)
-Disk space <20% remaining
- Disk I/O latency >100ms
-PostgreSQL: Cache hit ratio <80%, active connections >80% of max
-Redis: Memory >90%, evictions increasing
-RabbitMQ: Queue depth growing, no consumers
-Kubernetes: CPU >85%, memory >90%, disk <20% free
#### 4. Logs Tab - Centralized Logging
**Features:**
- **Unified logs** from all 18 microservices + databases
- **Trace correlation** - Click on trace ID to see related logs
- **Kubernetes metadata** - Auto-tagged with pod, namespace, container
- **Search and filter** - By service, severity, time range, content
- **Log patterns** - Automatically detect common patterns
#### 4. Business Metrics Dashboard
**What to Monitor:**
- Active tenants
- ML training jobs (success/failure rate)
- Forecast requests per hour
- Alert volume
- API health score
- Error and warning logs across all services
- Database connection errors
- Authentication failures
- API request/response logs
**Red Flags:**
- ❌ Training failure rate >10%
- ❌ No forecast requests (might indicate issue)
- ❌ Alert volume spike (investigate cause)
- Increasing error logs
- Repeated "connection refused" or "timeout" messages
- Authentication failures (potential security issue)
- Out of memory errors
#### 5. Alerts Tab - Alert Management
**Features:**
- Create alerts based on metrics, traces, or logs
- Configure notification channels (email, Slack, webhook)
- View firing alerts and alert history
- Alert silencing and acknowledgment
**Pre-configured Alerts (see SigNoz):**
- High error rate (>5% for 5 minutes)
- High latency (P99 >3s for 5 minutes)
- Service down (no requests for 2 minutes)
- Database connection errors
- High memory/CPU usage
### Alert Severity Levels
@@ -195,7 +240,35 @@ Response:
3. See "Certificate Rotation" section below
```
### Metrics to Track Daily
### Daily Monitoring Workflow with SigNoz
#### Morning Health Check (5 minutes)
1. **Open SigNoz Dashboard**
```
https://monitoring.bakewise.ai/signoz
```
2. **Check Services Tab:**
- Verify all 18 services are reporting metrics
- Check error rate <1% for all services
- Check P99 latency <2s for critical services
3. **Check Alerts Tab:**
- Review any firing alerts
- Check for patterns (repeated alerts on same service)
- Acknowledge or resolve as needed
4. **Quick Infrastructure Check:**
- Navigate to Dashboards → PostgreSQL
- Verify all 18 databases are up
- Check connection counts are healthy
- Navigate to Dashboards → Redis
- Check memory usage <80%
- Navigate to Dashboards → Kubernetes
- Verify node health, no OOM kills
#### Command-Line Health Check (Alternative)
```bash
# Quick health check command
@@ -211,19 +284,19 @@ echo ""
echo "2. Resource Usage:"
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=memory | head -10
echo ""
echo "3. Database Connections:"
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
"SELECT count(*) as connections FROM pg_stat_activity;"
echo "3. SigNoz Components:"
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
echo ""
echo "4. Recent Alerts:"
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state}' | head -10
echo "4. Recent Alerts (from SigNoz AlertManager):"
curl -s http://localhost:9093/api/v1/alerts 2>/dev/null | jq '.data[] | select(.status.state=="firing") | {alert: .labels.alertname, severity: .labels.severity}' | head -10
echo ""
echo "5. Disk Usage:"
kubectl exec -n bakery-ia deployment/auth-db -- df -h /var/lib/postgresql/data
echo "5. OTel Collector Health:"
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133 2>/dev/null || echo "✅ Health check endpoint responding"
echo ""
echo "=== End Health Check ==="
@@ -233,6 +306,38 @@ chmod +x ~/health-check.sh
./health-check.sh
```
#### Troubleshooting Common Issues
**Issue: Service not showing in SigNoz**
```bash
# Check if service is sending telemetry
kubectl logs -n bakery-ia deployment/SERVICE_NAME | grep -i "telemetry\|otel\|signoz"
# Check OTel Collector is receiving data
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep SERVICE_NAME
# Verify service has proper OTEL endpoints configured
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep OTEL
```
**Issue: No traces appearing**
```bash
# Check tracing is enabled in service
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep ENABLE_TRACING
# Verify OTel Collector gRPC endpoint is reachable
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- nc -zv signoz-otel-collector 4317
```
**Issue: Logs not appearing**
```bash
# Check filelog receiver is working
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog
# Check k8sattributes processor
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep k8sattributes
```
---
## Security Operations