Imporve monitoring 6

This commit is contained in:
Urtzi Alfaro
2026-01-10 13:43:38 +01:00
parent c05538cafb
commit b089c216db
13 changed files with 1248 additions and 2546 deletions

View File

@@ -856,87 +856,227 @@ kubectl logs -n bakery-ia deployment/auth-service | grep -i "email\|smtp"
## Post-Deployment
### Step 1: Access Monitoring Stack
### Step 1: Access SigNoz Monitoring Stack
Your production monitoring stack provides complete observability with multiple tools:
Your production deployment includes **SigNoz**, a unified observability platform that provides complete visibility into your application:
#### What is SigNoz?
SigNoz is an **open-source, all-in-one observability platform** that provides:
- **📊 Distributed Tracing** - See end-to-end request flows across all 18 microservices
- **📈 Metrics Monitoring** - Application performance and infrastructure metrics
- **📝 Log Management** - Centralized logs from all services with trace correlation
- **🔍 Service Performance Monitoring (SPM)** - Automatic RED metrics (Rate, Error, Duration)
- **🗄️ Database Monitoring** - All 18 PostgreSQL databases + Redis + RabbitMQ
- **☸️ Kubernetes Monitoring** - Cluster, node, pod, and container metrics
**Why SigNoz instead of Prometheus/Grafana?**
- Single unified UI for traces, metrics, and logs (no context switching)
- Automatic service dependency mapping
- Built-in APM (Application Performance Monitoring)
- Log-trace correlation with one click
- Better query performance with ClickHouse backend
- Modern UI designed for microservices
#### Production Monitoring URLs
Access via domain (recommended):
Access via domain:
```
https://monitoring.bakewise.ai/grafana # Dashboards & visualization
https://monitoring.bakewise.ai/prometheus # Metrics & queries
https://monitoring.bakewise.ai/signoz # Unified observability platform (traces, metrics, logs)
https://monitoring.bakewise.ai/alertmanager # Alert management
https://monitoring.bakewise.ai/signoz # SigNoz - Main observability UI
https://monitoring.bakewise.ai/alertmanager # AlertManager - Alert management
```
Or via port forwarding (if needed):
```bash
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
# SigNoz Frontend (Main UI)
kubectl port-forward -n bakery-ia svc/signoz 8080:8080 &
# Open: http://localhost:8080
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
# SigNoz AlertManager
kubectl port-forward -n bakery-ia svc/signoz-alertmanager 9093:9093 &
# Open: http://localhost:9093
# SigNoz
kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301 &
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
# OTel Collector (for debugging)
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4317:4317 & # gRPC
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4318:4318 & # HTTP
```
#### Available Dashboards
#### Key SigNoz Features to Explore
Login to Grafana (admin / your-password) and explore:
Once you open SigNoz (https://monitoring.bakewise.ai/signoz), explore these tabs:
**Main Dashboards:**
1. **Gateway Metrics** - HTTP request rates, latencies, error rates
2. **Services Overview** - Multi-service health and performance
3. **Circuit Breakers** - Reliability metrics
**1. Services Tab - Application Performance**
- View all 18 microservices with live metrics
- See request rate, error rate, and latency (P50/P90/P99)
- Click on any service to drill down into operations
- Identify slow endpoints and error-prone operations
**Extended Dashboards:**
4. **Service Performance Monitoring (SPM)** - RED metrics from distributed traces
5. **PostgreSQL Database** - Database health, connections, query performance
6. **Node Exporter Infrastructure** - CPU, memory, disk, network per node
7. **AlertManager Monitoring** - Alert tracking and notification status
8. **Business Metrics & KPIs** - Tenant activity, ML jobs, forecasts
**2. Traces Tab - Request Flow Visualization**
- See complete request journeys across services
- Identify bottlenecks (slow database queries, API calls)
- Debug errors with full stack traces
- Correlate with logs for complete context
**3. Dashboards Tab - Infrastructure & Database Metrics**
- **PostgreSQL** - Monitor all 18 databases (connections, queries, cache hit ratio)
- **Redis** - Cache performance (memory, hit rate, commands/sec)
- **RabbitMQ** - Message queue health (depth, rates, consumers)
- **Kubernetes** - Cluster metrics (nodes, pods, containers)
**4. Logs Tab - Centralized Log Management**
- Search and filter logs from all services
- Click on trace ID in logs to see related request trace
- Auto-enriched with Kubernetes metadata (pod, namespace, container)
- Identify patterns and anomalies
**5. Alerts Tab - Proactive Monitoring**
- Configure alerts on metrics, traces, or logs
- Email/Slack/Webhook notifications
- View firing alerts and alert history
#### Quick Health Check
```bash
# Verify all monitoring pods are running
kubectl get pods -n monitoring
# Verify SigNoz components are running
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
# Check Prometheus targets (all should be UP)
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Open: http://localhost:9090/targets
# Expected output:
# signoz-0 READY 1/1
# signoz-otel-collector-xxx READY 1/1
# signoz-alertmanager-xxx READY 1/1
# signoz-clickhouse-xxx READY 1/1
# signoz-zookeeper-xxx READY 1/1
# View active alerts
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Open: http://localhost:9090/alerts
# Check OTel Collector health
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133
# View recent telemetry in OTel Collector logs
kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=50 | grep -i "traces\|metrics\|logs"
```
#### Verify Telemetry is Working
1. **Check Services are Reporting:**
```bash
# Open SigNoz and navigate to Services tab
# You should see all 18 microservices listed
# If services are missing, check if they're sending telemetry:
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel"
```
2. **Check Database Metrics:**
```bash
# Navigate to Dashboards → PostgreSQL in SigNoz
# You should see metrics from all 18 databases
# Verify OTel Collector is scraping databases:
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep postgresql
```
3. **Check Traces are Being Collected:**
```bash
# Make a test API request
curl https://bakewise.ai/api/v1/health
# Navigate to Traces tab in SigNoz
# Search for "gateway" service
# You should see the trace for your request
```
4. **Check Logs are Being Collected:**
```bash
# Navigate to Logs tab in SigNoz
# Filter by namespace: bakery-ia
# You should see logs from all pods
# Verify filelog receiver is working:
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog
```
### Step 2: Configure Alerting
Update AlertManager with your notification email addresses:
SigNoz includes integrated alerting with AlertManager. Configure it for your team:
#### Update Email Notification Settings
The alerting configuration is in the SigNoz Helm values. To update:
```bash
# Edit alertmanager configuration
kubectl edit configmap -n monitoring alertmanager-config
# For production, edit the values file:
nano infrastructure/helm/signoz-values-prod.yaml
# Update recipient emails in the routes section:
# - alerts@bakewise.ai (general alerts)
# - critical-alerts@bakewise.ai (critical issues)
# - oncall@bakewise.ai (on-call rotation)
# Update the alertmanager.config section:
# 1. Update SMTP settings:
# - smtp_from: 'your-alerts@bakewise.ai'
# - smtp_auth_username: 'your-alerts@bakewise.ai'
# - smtp_auth_password: (use Kubernetes secret)
#
# 2. Update receivers:
# - critical-alerts email: critical-alerts@bakewise.ai
# - warning-alerts email: oncall@bakewise.ai
#
# 3. (Optional) Add Slack webhook for critical alerts
# Apply the updated configuration:
helm upgrade signoz signoz/signoz \
-n bakery-ia \
-f infrastructure/helm/signoz-values-prod.yaml
```
Test alert delivery:
#### Create Alerts in SigNoz UI
1. **Open SigNoz Alerts Tab:**
```
https://monitoring.bakewise.ai/signoz → Alerts
```
2. **Create Common Alerts:**
**Alert 1: High Error Rate**
- Name: `HighErrorRate`
- Query: `error_rate > 5` for `5 minutes`
- Severity: `critical`
- Description: "Service {{service_name}} has error rate >5%"
**Alert 2: High Latency**
- Name: `HighLatency`
- Query: `P99_latency > 3000ms` for `5 minutes`
- Severity: `warning`
- Description: "Service {{service_name}} P99 latency >3s"
**Alert 3: Service Down**
- Name: `ServiceDown`
- Query: `request_rate == 0` for `2 minutes`
- Severity: `critical`
- Description: "Service {{service_name}} not receiving requests"
**Alert 4: Database Connection Issues**
- Name: `DatabaseConnectionsHigh`
- Query: `pg_active_connections > 80` for `5 minutes`
- Severity: `warning`
- Description: "Database {{database}} connection count >80%"
**Alert 5: High Memory Usage**
- Name: `HighMemoryUsage`
- Query: `container_memory_percent > 85` for `5 minutes`
- Severity: `warning`
- Description: "Pod {{pod_name}} using >85% memory"
#### Test Alert Delivery
```bash
# Fire a test alert
# Method 1: Create a test alert in SigNoz UI
# Go to Alerts → New Alert → Set a test condition that will fire
# Method 2: Fire a test alert via stress test
kubectl run memory-test --image=polinux/stress --restart=Never \
--namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
# Check alert appears in AlertManager
# Check alert appears in SigNoz Alerts tab
# https://monitoring.bakewise.ai/signoz → Alerts
# Also check AlertManager
# https://monitoring.bakewise.ai/alertmanager
# Verify email notification received
@@ -945,6 +1085,26 @@ kubectl run memory-test --image=polinux/stress --restart=Never \
kubectl delete pod memory-test -n bakery-ia
```
#### Configure Notification Channels
In SigNoz Alerts tab, configure channels:
1. **Email Channel:**
- Already configured via AlertManager
- Emails sent to addresses in signoz-values-prod.yaml
2. **Slack Channel (Optional):**
```bash
# Add Slack webhook URL to signoz-values-prod.yaml
# Under alertmanager.config.receivers.critical-alerts.slack_configs:
# - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
# channel: '#alerts-critical'
```
3. **Webhook Channel (Optional):**
- Configure custom webhook for integration with PagerDuty, OpsGenie, etc.
- Add to alertmanager.config.receivers
### Step 3: Configure Backups
```bash
@@ -992,26 +1152,61 @@ kubectl edit configmap -n monitoring alertmanager-config
# Update recipient emails in the routes section
```
### Step 4: Verify Monitoring is Working
### Step 4: Verify SigNoz Monitoring is Working
Before proceeding, ensure all monitoring components are operational:
```bash
# 1. Check Prometheus targets
# Open: https://monitoring.bakewise.ai/prometheus/targets
# All targets should show "UP" status
# 1. Verify SigNoz pods are running
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
# 2. Verify Grafana dashboards load data
# Open: https://monitoring.bakewise.ai/grafana
# Navigate to any dashboard and verify metrics are displaying
# Expected pods (all should be Running/Ready):
# - signoz-0 (or signoz-1, signoz-2 for HA)
# - signoz-otel-collector-xxx
# - signoz-alertmanager-xxx
# - signoz-clickhouse-xxx
# - signoz-zookeeper-xxx
# 3. Check SigNoz is receiving traces
# Open: https://monitoring.bakewise.ai/signoz
# Search for traces from "gateway" service
# 2. Check SigNoz UI is accessible
curl -I https://monitoring.bakewise.ai/signoz
# Should return: HTTP/2 200 OK
# 4. Verify AlertManager cluster
# Open: https://monitoring.bakewise.ai/alertmanager
# Check that all 3 AlertManager instances are connected
# 3. Verify OTel Collector is receiving data
kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=100 | grep -i "received"
# Should show: "Traces received: X" "Metrics received: Y" "Logs received: Z"
# 4. Check ClickHouse database is healthy
kubectl exec -n bakery-ia deployment/signoz-clickhouse -- clickhouse-client --query="SELECT count() FROM system.tables WHERE database LIKE 'signoz_%'"
# Should return a number > 0 (tables exist)
```
**Complete Verification Checklist:**
- [ ] **SigNoz UI loads** at https://monitoring.bakewise.ai/signoz
- [ ] **Services tab shows all 18 microservices** with metrics
- [ ] **Traces tab has sample traces** from gateway and other services
- [ ] **Dashboards tab shows PostgreSQL metrics** from all 18 databases
- [ ] **Dashboards tab shows Redis metrics** (memory, commands, etc.)
- [ ] **Dashboards tab shows RabbitMQ metrics** (queues, messages)
- [ ] **Dashboards tab shows Kubernetes metrics** (nodes, pods)
- [ ] **Logs tab displays logs** from all services in bakery-ia namespace
- [ ] **Alerts tab is accessible** and can create new alerts
- [ ] **AlertManager** is reachable at https://monitoring.bakewise.ai/alertmanager
**If any checks fail, troubleshoot:**
```bash
# Check OTel Collector configuration
kubectl describe configmap -n bakery-ia signoz-otel-collector
# Check for errors in OTel Collector
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep -i error
# Check ClickHouse is accepting writes
kubectl logs -n bakery-ia deployment/signoz-clickhouse | grep -i error
# Restart OTel Collector if needed
kubectl rollout restart deployment/signoz-otel-collector -n bakery-ia
```
### Step 5: Document Everything
@@ -1033,41 +1228,113 @@ Create a secure runbook with all credentials and procedures:
### Step 6: Train Your Team
Conduct a training session covering:
Conduct a training session covering SigNoz and operational procedures:
- [ ] **Access monitoring dashboards**
- Show how to login to https://monitoring.bakewise.ai/grafana
- Walk through key dashboards (Services Overview, Database, Infrastructure)
- Explain how to interpret metrics and identify issues
#### Part 1: SigNoz Navigation (30 minutes)
- [ ] **Check application logs**
- [ ] **Login and Overview**
- Show how to access https://monitoring.bakewise.ai/signoz
- Navigate through main tabs: Services, Traces, Dashboards, Logs, Alerts
- Explain the unified nature of SigNoz (all-in-one platform)
- [ ] **Services Tab - Application Performance Monitoring**
- Show all 18 microservices
- Explain RED metrics (Request rate, Error rate, Duration/latency)
- Demo: Click on a service → Operations → See endpoint breakdown
- Demo: Identify slow endpoints and high error rates
- [ ] **Traces Tab - Request Flow Debugging**
- Show how to search for traces by service, operation, or time
- Demo: Click on a trace → See full waterfall (service → database → cache)
- Demo: Find slow database queries in trace spans
- Demo: Click "View Logs" to correlate trace with logs
- [ ] **Dashboards Tab - Infrastructure Monitoring**
- Navigate to PostgreSQL dashboard → Show all 18 databases
- Navigate to Redis dashboard → Show cache metrics
- Navigate to Kubernetes dashboard → Show node/pod metrics
- Explain what metrics indicate issues (connection %, memory %, etc.)
- [ ] **Logs Tab - Log Search and Analysis**
- Show how to filter by service, severity, time range
- Demo: Search for "error" in last hour
- Demo: Click on trace_id in log → Jump to related trace
- Show Kubernetes metadata (pod, namespace, container)
- [ ] **Alerts Tab - Proactive Monitoring**
- Show how to create alerts on metrics
- Review pre-configured alerts
- Show alert history and firing alerts
- Explain how to acknowledge/silence alerts
#### Part 2: Operational Tasks (30 minutes)
- [ ] **Check application logs** (multiple ways)
```bash
# View logs for a service
# Method 1: Via kubectl (for immediate debugging)
kubectl logs -n bakery-ia deployment/orders-service --tail=100 -f
# Search for errors
kubectl logs -n bakery-ia deployment/gateway | grep ERROR
# Method 2: Via SigNoz Logs tab (for analysis and correlation)
# 1. Open https://monitoring.bakewise.ai/signoz → Logs
# 2. Filter by k8s_deployment_name: orders-service
# 3. Click on trace_id to see related request flow
```
- [ ] **Restart services when needed**
```bash
# Restart a service (rolling update, no downtime)
kubectl rollout restart deployment/orders-service -n bakery-ia
# Verify restart in SigNoz:
# 1. Check Services tab → orders-service → Should show brief dip then recovery
# 2. Check Logs tab → Filter by orders-service → See restart logs
```
- [ ] **Investigate performance issues**
```bash
# Scenario: "Orders API is slow"
# 1. SigNoz → Services → orders-service → Check P99 latency
# 2. SigNoz → Traces → Filter service:orders-service, duration:>1s
# 3. Click on slow trace → Identify bottleneck (DB query? External API?)
# 4. SigNoz → Dashboards → PostgreSQL → Check orders_db connections/queries
# 5. Fix identified issue (add index, optimize query, scale service)
```
- [ ] **Respond to alerts**
- Show how to access AlertManager at https://monitoring.bakewise.ai/alertmanager
- Show how to access alerts in SigNoz → Alerts tab
- Show AlertManager UI at https://monitoring.bakewise.ai/alertmanager
- Review common alerts and their resolution steps
- Reference the [Production Operations Guide](./PRODUCTION_OPERATIONS_GUIDE.md)
#### Part 3: Documentation and Resources (10 minutes)
- [ ] **Share documentation**
- [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide
- [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations
- [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide (deployment)
- [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations with SigNoz
- [security-checklist.md](./security-checklist.md) - Security procedures
- [ ] **Bookmark key URLs**
- SigNoz: https://monitoring.bakewise.ai/signoz
- AlertManager: https://monitoring.bakewise.ai/alertmanager
- Production app: https://bakewise.ai
- [ ] **Setup on-call rotation** (if applicable)
- Configure in AlertManager
- Configure rotation schedule in AlertManager
- Document escalation procedures
- Test alert delivery to on-call phone/email
#### Part 4: Hands-On Exercise (15 minutes)
**Exercise: Investigate a Simulated Issue**
1. Create a load test to generate traffic
2. Use SigNoz to find the slowest endpoint
3. Identify the root cause using traces
4. Correlate with logs to confirm
5. Check infrastructure metrics (DB, memory, CPU)
6. Propose a fix based on findings
This trains the team to use SigNoz effectively for real incidents.
---
@@ -1204,17 +1471,33 @@ kubectl scale deployment monitoring -n bakery-ia --replicas=0
- **RBAC Implementation:** [rbac-implementation.md](./rbac-implementation.md) - Access control
**Monitoring Access:**
- **Grafana:** https://monitoring.bakewise.ai/grafana (admin / your-password)
- **Prometheus:** https://monitoring.bakewise.ai/prometheus
- **SigNoz:** https://monitoring.bakewise.ai/signoz
- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
- **SigNoz (Primary):** https://monitoring.bakewise.ai/signoz - All-in-one observability
- Services: Application performance monitoring (APM)
- Traces: Distributed tracing across all services
- Dashboards: PostgreSQL, Redis, RabbitMQ, Kubernetes metrics
- Logs: Centralized log management with trace correlation
- Alerts: Alert configuration and management
- **AlertManager:** https://monitoring.bakewise.ai/alertmanager - Alert routing and notifications
**External Resources:**
- **MicroK8s Docs:** https://microk8s.io/docs
- **Kubernetes Docs:** https://kubernetes.io/docs
- **Let's Encrypt:** https://letsencrypt.org/docs
- **Cloudflare DNS:** https://developers.cloudflare.com/dns
- **Monitoring Stack README:** infrastructure/kubernetes/base/components/monitoring/README.md
- **SigNoz Documentation:** https://signoz.io/docs/
- **OpenTelemetry Documentation:** https://opentelemetry.io/docs/
**Monitoring Architecture:**
- **OpenTelemetry:** Industry-standard instrumentation framework
- Auto-instruments FastAPI, HTTPX, SQLAlchemy, Redis
- Collects traces, metrics, and logs from all services
- Exports to SigNoz via OTLP protocol (gRPC port 4317, HTTP port 4318)
- **SigNoz Components:**
- **Frontend:** Web UI for visualization and analysis
- **OTel Collector:** Receives and processes telemetry data
- **ClickHouse:** Time-series database for fast queries
- **AlertManager:** Alert routing and notification delivery
- **Zookeeper:** Coordination service for ClickHouse cluster
---