Imporve monitoring 6
This commit is contained in:
@@ -60,84 +60,129 @@
|
||||
|
||||
**Production URLs:**
|
||||
```
|
||||
https://monitoring.bakewise.ai/grafana # Dashboards & visualization
|
||||
https://monitoring.bakewise.ai/prometheus # Metrics & alerts
|
||||
https://monitoring.bakewise.ai/alertmanager # Alert management
|
||||
https://monitoring.bakewise.ai/signoz # Unified observability platform (traces, metrics, logs)
|
||||
https://monitoring.bakewise.ai/signoz # SigNoz - Unified observability (PRIMARY)
|
||||
https://monitoring.bakewise.ai/alertmanager # AlertManager - Alert management
|
||||
```
|
||||
|
||||
**What is SigNoz?**
|
||||
SigNoz is a comprehensive, open-source observability platform that provides:
|
||||
- **Distributed Tracing** - End-to-end request tracking across all microservices
|
||||
- **Metrics Monitoring** - Application and infrastructure metrics
|
||||
- **Log Management** - Centralized log aggregation with trace correlation
|
||||
- **Service Performance Monitoring (SPM)** - RED metrics (Rate, Error, Duration) from traces
|
||||
- **Database Monitoring** - All 18 PostgreSQL databases + Redis + RabbitMQ
|
||||
- **Kubernetes Monitoring** - Cluster, node, pod, and container metrics
|
||||
|
||||
**Port Forwarding (if ingress not available):**
|
||||
```bash
|
||||
# Grafana
|
||||
kubectl port-forward -n monitoring svc/grafana 3000:3000
|
||||
# SigNoz Frontend (Main UI)
|
||||
kubectl port-forward -n bakery-ia svc/signoz 8080:8080
|
||||
|
||||
# Prometheus
|
||||
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
||||
# SigNoz AlertManager
|
||||
kubectl port-forward -n bakery-ia svc/signoz-alertmanager 9093:9093
|
||||
|
||||
# AlertManager
|
||||
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
|
||||
|
||||
# SigNoz
|
||||
kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301
|
||||
# OTel Collector (for debugging)
|
||||
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4317:4317 # gRPC
|
||||
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4318:4318 # HTTP
|
||||
```
|
||||
|
||||
### Key Dashboards
|
||||
### Key SigNoz Dashboards and Features
|
||||
|
||||
#### 1. Services Overview Dashboard
|
||||
#### 1. Services Tab - APM Overview
|
||||
**What to Monitor:**
|
||||
- Request rate per service
|
||||
- Error rate (aim: <1%)
|
||||
- P95/P99 latency (aim: <2s)
|
||||
- Active connections
|
||||
- Pod health status
|
||||
- **Service List** - All 18 microservices with health status
|
||||
- **Request Rate** - Requests per second per service
|
||||
- **Error Rate** - Percentage of failed requests (aim: <1%)
|
||||
- **P50/P90/P99 Latency** - Response time percentiles (aim: P99 <2s)
|
||||
- **Operations** - Breakdown by endpoint/operation
|
||||
|
||||
**Red Flags:**
|
||||
- ❌ Error rate >5%
|
||||
- ❌ P95 latency >3s
|
||||
- ❌ Any service showing 0 requests (might be down)
|
||||
- ❌ Pod restarts >3 in last hour
|
||||
- ❌ Error rate >5% sustained
|
||||
- ❌ P99 latency >3s
|
||||
- ❌ Sudden drop in request rate (service might be down)
|
||||
- ❌ High latency on specific endpoints
|
||||
|
||||
#### 2. Database Dashboard (PostgreSQL)
|
||||
**How to Access:**
|
||||
- Navigate to `Services` tab in SigNoz
|
||||
- Click on any service for detailed metrics
|
||||
- Use "Traces" tab to see sample requests
|
||||
|
||||
#### 2. Traces Tab - Distributed Tracing
|
||||
**What to Monitor:**
|
||||
- Active connections per database
|
||||
- Cache hit ratio (aim: >90%)
|
||||
- Query duration (P95)
|
||||
- Transaction rate
|
||||
- Replication lag (if applicable)
|
||||
- **End-to-end request flows** across microservices
|
||||
- **Span duration** - Time spent in each service
|
||||
- **Database query performance** - Auto-captured from SQLAlchemy
|
||||
- **External API calls** - Auto-captured from HTTPX
|
||||
- **Error traces** - Requests that failed with stack traces
|
||||
|
||||
**Features:**
|
||||
- Filter by service, operation, status code, duration
|
||||
- Search by trace ID or span ID
|
||||
- Correlate traces with logs
|
||||
- Identify slow database queries and N+1 problems
|
||||
|
||||
**Red Flags:**
|
||||
- ❌ Connection count >80% of max
|
||||
- ❌ Cache hit ratio <80%
|
||||
- ❌ Slow queries >1s frequently
|
||||
- ❌ Locks increasing
|
||||
- ❌ Traces showing >10 database queries per request (N+1 issue)
|
||||
- ❌ External API calls taking >1s
|
||||
- ❌ Services with >500ms internal processing time
|
||||
- ❌ Error spans with exceptions
|
||||
|
||||
#### 3. Node Exporter (Infrastructure)
|
||||
**What to Monitor:**
|
||||
- CPU usage per node
|
||||
- Memory usage and swap
|
||||
- Disk I/O and latency
|
||||
- Network throughput
|
||||
- Disk space remaining
|
||||
#### 3. Dashboards Tab - Infrastructure Metrics
|
||||
**Pre-built Dashboards:**
|
||||
- **PostgreSQL Monitoring** - All 18 databases
|
||||
- Active connections, transactions/sec, cache hit ratio
|
||||
- Slow queries, lock waits, replication lag
|
||||
- Database size, disk I/O
|
||||
- **Redis Monitoring** - Cache performance
|
||||
- Memory usage, hit rate, evictions
|
||||
- Commands/sec, latency
|
||||
- **RabbitMQ Monitoring** - Message queue health
|
||||
- Queue depth, message rates
|
||||
- Consumer status, connections
|
||||
- **Kubernetes Cluster** - Node and pod metrics
|
||||
- CPU, memory, disk, network per node
|
||||
- Pod resource utilization
|
||||
- Container restarts and OOM kills
|
||||
|
||||
**Red Flags:**
|
||||
- ❌ CPU usage >85% sustained
|
||||
- ❌ Memory usage >90%
|
||||
- ❌ Swap usage >0 (indicates memory pressure)
|
||||
- ❌ Disk space <20% remaining
|
||||
- ❌ Disk I/O latency >100ms
|
||||
- ❌ PostgreSQL: Cache hit ratio <80%, active connections >80% of max
|
||||
- ❌ Redis: Memory >90%, evictions increasing
|
||||
- ❌ RabbitMQ: Queue depth growing, no consumers
|
||||
- ❌ Kubernetes: CPU >85%, memory >90%, disk <20% free
|
||||
|
||||
#### 4. Logs Tab - Centralized Logging
|
||||
**Features:**
|
||||
- **Unified logs** from all 18 microservices + databases
|
||||
- **Trace correlation** - Click on trace ID to see related logs
|
||||
- **Kubernetes metadata** - Auto-tagged with pod, namespace, container
|
||||
- **Search and filter** - By service, severity, time range, content
|
||||
- **Log patterns** - Automatically detect common patterns
|
||||
|
||||
#### 4. Business Metrics Dashboard
|
||||
**What to Monitor:**
|
||||
- Active tenants
|
||||
- ML training jobs (success/failure rate)
|
||||
- Forecast requests per hour
|
||||
- Alert volume
|
||||
- API health score
|
||||
- Error and warning logs across all services
|
||||
- Database connection errors
|
||||
- Authentication failures
|
||||
- API request/response logs
|
||||
|
||||
**Red Flags:**
|
||||
- ❌ Training failure rate >10%
|
||||
- ❌ No forecast requests (might indicate issue)
|
||||
- ❌ Alert volume spike (investigate cause)
|
||||
- ❌ Increasing error logs
|
||||
- ❌ Repeated "connection refused" or "timeout" messages
|
||||
- ❌ Authentication failures (potential security issue)
|
||||
- ❌ Out of memory errors
|
||||
|
||||
#### 5. Alerts Tab - Alert Management
|
||||
**Features:**
|
||||
- Create alerts based on metrics, traces, or logs
|
||||
- Configure notification channels (email, Slack, webhook)
|
||||
- View firing alerts and alert history
|
||||
- Alert silencing and acknowledgment
|
||||
|
||||
**Pre-configured Alerts (see SigNoz):**
|
||||
- High error rate (>5% for 5 minutes)
|
||||
- High latency (P99 >3s for 5 minutes)
|
||||
- Service down (no requests for 2 minutes)
|
||||
- Database connection errors
|
||||
- High memory/CPU usage
|
||||
|
||||
### Alert Severity Levels
|
||||
|
||||
@@ -195,7 +240,35 @@ Response:
|
||||
3. See "Certificate Rotation" section below
|
||||
```
|
||||
|
||||
### Metrics to Track Daily
|
||||
### Daily Monitoring Workflow with SigNoz
|
||||
|
||||
#### Morning Health Check (5 minutes)
|
||||
|
||||
1. **Open SigNoz Dashboard**
|
||||
```
|
||||
https://monitoring.bakewise.ai/signoz
|
||||
```
|
||||
|
||||
2. **Check Services Tab:**
|
||||
- Verify all 18 services are reporting metrics
|
||||
- Check error rate <1% for all services
|
||||
- Check P99 latency <2s for critical services
|
||||
|
||||
3. **Check Alerts Tab:**
|
||||
- Review any firing alerts
|
||||
- Check for patterns (repeated alerts on same service)
|
||||
- Acknowledge or resolve as needed
|
||||
|
||||
4. **Quick Infrastructure Check:**
|
||||
- Navigate to Dashboards → PostgreSQL
|
||||
- Verify all 18 databases are up
|
||||
- Check connection counts are healthy
|
||||
- Navigate to Dashboards → Redis
|
||||
- Check memory usage <80%
|
||||
- Navigate to Dashboards → Kubernetes
|
||||
- Verify node health, no OOM kills
|
||||
|
||||
#### Command-Line Health Check (Alternative)
|
||||
|
||||
```bash
|
||||
# Quick health check command
|
||||
@@ -211,19 +284,19 @@ echo ""
|
||||
|
||||
echo "2. Resource Usage:"
|
||||
kubectl top nodes
|
||||
kubectl top pods -n bakery-ia --sort-by=memory | head -10
|
||||
echo ""
|
||||
|
||||
echo "3. Database Connections:"
|
||||
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
|
||||
"SELECT count(*) as connections FROM pg_stat_activity;"
|
||||
echo "3. SigNoz Components:"
|
||||
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
|
||||
echo ""
|
||||
|
||||
echo "4. Recent Alerts:"
|
||||
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state}' | head -10
|
||||
echo "4. Recent Alerts (from SigNoz AlertManager):"
|
||||
curl -s http://localhost:9093/api/v1/alerts 2>/dev/null | jq '.data[] | select(.status.state=="firing") | {alert: .labels.alertname, severity: .labels.severity}' | head -10
|
||||
echo ""
|
||||
|
||||
echo "5. Disk Usage:"
|
||||
kubectl exec -n bakery-ia deployment/auth-db -- df -h /var/lib/postgresql/data
|
||||
echo "5. OTel Collector Health:"
|
||||
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133 2>/dev/null || echo "✅ Health check endpoint responding"
|
||||
echo ""
|
||||
|
||||
echo "=== End Health Check ==="
|
||||
@@ -233,6 +306,38 @@ chmod +x ~/health-check.sh
|
||||
./health-check.sh
|
||||
```
|
||||
|
||||
#### Troubleshooting Common Issues
|
||||
|
||||
**Issue: Service not showing in SigNoz**
|
||||
```bash
|
||||
# Check if service is sending telemetry
|
||||
kubectl logs -n bakery-ia deployment/SERVICE_NAME | grep -i "telemetry\|otel\|signoz"
|
||||
|
||||
# Check OTel Collector is receiving data
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep SERVICE_NAME
|
||||
|
||||
# Verify service has proper OTEL endpoints configured
|
||||
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep OTEL
|
||||
```
|
||||
|
||||
**Issue: No traces appearing**
|
||||
```bash
|
||||
# Check tracing is enabled in service
|
||||
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep ENABLE_TRACING
|
||||
|
||||
# Verify OTel Collector gRPC endpoint is reachable
|
||||
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- nc -zv signoz-otel-collector 4317
|
||||
```
|
||||
|
||||
**Issue: Logs not appearing**
|
||||
```bash
|
||||
# Check filelog receiver is working
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog
|
||||
|
||||
# Check k8sattributes processor
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep k8sattributes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Operations
|
||||
|
||||
Reference in New Issue
Block a user