Improve monitoring for prod

2026-01-07 19:12:35 +01:00
parent 560c7ba86f
commit 07178f8972
44 changed files with 6581 additions and 5111 deletions
--- a/docs/QUICK_START_MONITORING.md
+++ b/docs/QUICK_START_MONITORING.md
@@ -0,0 +1,284 @@
+# 🚀 Quick Start: Deploy Monitoring to Production
+
+**Time to deploy: ~15 minutes**
+
+---
+
+## Step 1: Update Secrets (5 min)
+
+```bash
+cd infrastructure/kubernetes/base/components/monitoring
+
+# 1. Generate strong passwords
+GRAFANA_PASS=$(openssl rand -base64 32)
+echo "Grafana Password: $GRAFANA_PASS" > ~/SAVE_THIS_PASSWORD.txt
+
+# 2. Edit secrets.yaml and replace:
+#    - CHANGE_ME_IN_PRODUCTION (Grafana password)
+#    - SMTP settings (your email server)
+#    - PostgreSQL connection string (your DB)
+
+nano secrets.yaml
+```
+
+**Required Changes in secrets.yaml:**
+```yaml
+# Line 13: Change Grafana password
+admin-password: "YOUR_STRONG_PASSWORD_HERE"
+
+# Lines 30-33: Update SMTP settings
+smtp-host: "smtp.gmail.com:587"
+smtp-username: "your-alerts@yourdomain.com"
+smtp-password: "YOUR_SMTP_PASSWORD"
+smtp-from: "alerts@yourdomain.com"
+
+# Line 49: Update PostgreSQL connection
+data-source-name: "postgresql://USER:PASSWORD@postgres.bakery-ia:5432/bakery?sslmode=require"
+```
+
+---
+
+## Step 2: Update Alert Email Addresses (2 min)
+
+```bash
+# Edit alertmanager.yaml to set your team's email addresses
+nano alertmanager.yaml
+
+# Update these lines (search for @yourdomain.com):
+# - Line 93: to: 'alerts@yourdomain.com'
+# - Line 101: to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
+# - Line 116: to: 'alerts@yourdomain.com'
+# - Line 125: to: 'alert-system-team@yourdomain.com'
+# - Line 134: to: 'database-team@yourdomain.com'
+# - Line 143: to: 'infra-team@yourdomain.com'
+```
+
+---
+
+## Step 3: Deploy to Production (3 min)
+
+```bash
+# Return to project root
+cd /Users/urtzialfaro/Documents/bakery-ia
+
+# Deploy the entire stack
+kubectl apply -k infrastructure/kubernetes/overlays/prod
+
+# Watch the pods come up
+kubectl get pods -n monitoring -w
+```
+
+**Expected Output:**
+```
+NAME                                  READY   STATUS    RESTARTS   AGE
+prometheus-0                          1/1     Running   0          2m
+prometheus-1                          1/1     Running   0          1m
+alertmanager-0                        2/2     Running   0          2m
+alertmanager-1                        2/2     Running   0          1m
+alertmanager-2                        2/2     Running   0          1m
+grafana-xxxxx                         1/1     Running   0          2m
+postgres-exporter-xxxxx               1/1     Running   0          2m
+node-exporter-xxxxx                   1/1     Running   0          2m
+jaeger-xxxxx                          1/1     Running   0          2m
+```
+
+---
+
+## Step 4: Verify Deployment (3 min)
+
+```bash
+# Check all pods are running
+kubectl get pods -n monitoring
+
+# Check storage is provisioned
+kubectl get pvc -n monitoring
+
+# Check services are created
+kubectl get svc -n monitoring
+```
+
+---
+
+## Step 5: Access Dashboards (2 min)
+
+### **Option A: Via Ingress (if configured)**
+```
+https://monitoring.yourdomain.com/grafana
+https://monitoring.yourdomain.com/prometheus
+https://monitoring.yourdomain.com/alertmanager
+https://monitoring.yourdomain.com/jaeger
+```
+
+### **Option B: Via Port Forwarding**
+```bash
+# Grafana
+kubectl port-forward -n monitoring svc/grafana 3000:3000 &
+
+# Prometheus
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
+
+# AlertManager
+kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
+
+# Jaeger
+kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
+
+# Now access:
+# - Grafana: http://localhost:3000 (admin / YOUR_PASSWORD)
+# - Prometheus: http://localhost:9090
+# - AlertManager: http://localhost:9093
+# - Jaeger: http://localhost:16686
+```
+
+---
+
+## Step 6: Verify Everything Works (5 min)
+
+### **Check Prometheus Targets**
+1. Open Prometheus: http://localhost:9090
+2. Go to Status → Targets
+3. Verify all targets are **UP**:
+   - prometheus (1/1 up)
+   - bakery-services (multiple pods up)
+   - alertmanager (3/3 up)
+   - postgres-exporter (1/1 up)
+   - node-exporter (N/N up, where N = number of nodes)
+
+### **Check Grafana Dashboards**
+1. Open Grafana: http://localhost:3000
+2. Login with admin / YOUR_PASSWORD
+3. Go to Dashboards → Browse
+4. You should see 11 dashboards:
+   - Bakery IA folder: Gateway Metrics, Services Overview, Circuit Breakers
+   - Bakery IA - Extended folder: PostgreSQL, Node Exporter, AlertManager, Business Metrics
+5. Open any dashboard and verify data is loading
+
+### **Test Alert Flow**
+```bash
+# Fire a test alert by creating high memory pod
+kubectl run memory-test --image=polinux/stress --restart=Never \
+  --namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
+
+# Wait 5 minutes, then check:
+# 1. Prometheus Alerts: http://localhost:9090/alerts
+#    - Should see "HighMemoryUsage" firing
+# 2. AlertManager: http://localhost:9093
+#    - Should see the alert
+# 3. Email inbox - Should receive notification
+
+# Clean up
+kubectl delete pod memory-test -n bakery-ia
+```
+
+### **Verify Jaeger Tracing**
+1. Make a request to your API:
+   ```bash
+   curl -H "Authorization: Bearer YOUR_TOKEN" \
+     https://api.yourdomain.com/api/v1/health
+   ```
+2. Open Jaeger: http://localhost:16686
+3. Select a service from dropdown
+4. Click "Find Traces"
+5. You should see traces appearing
+
+---
+
+## ✅ Success Criteria
+
+Your monitoring is working correctly if:
+
+- [x] All Prometheus targets show "UP" status
+- [x] Grafana dashboards display metrics
+- [x] AlertManager cluster shows 3/3 members
+- [x] Test alert fired and email received
+- [x] Jaeger shows traces from services
+- [x] No pods in CrashLoopBackOff state
+- [x] All PVCs are Bound
+
+---
+
+## 🔧 Troubleshooting
+
+### **Problem: Pods not starting**
+```bash
+# Check pod status
+kubectl describe pod POD_NAME -n monitoring
+
+# Check logs
+kubectl logs POD_NAME -n monitoring
+
+# Common issues:
+# - Insufficient resources: Check node capacity
+# - PVC not binding: Check storage class exists
+# - Image pull errors: Check network/registry access
+```
+
+### **Problem: Prometheus targets DOWN**
+```bash
+# Check if services exist
+kubectl get svc -n bakery-ia
+
+# Check if pods have correct labels
+kubectl get pods -n bakery-ia --show-labels
+
+# Check if pods expose metrics port (8080)
+kubectl get pod POD_NAME -n bakery-ia -o yaml | grep -A 5 ports
+```
+
+### **Problem: Grafana shows "No Data"**
+```bash
+# Test Prometheus datasource
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+
+# Run a test query in Prometheus
+curl "http://localhost:9090/api/v1/query?query=up" | jq
+
+# If Prometheus has data but Grafana doesn't, check Grafana datasource config
+```
+
+### **Problem: Alerts not firing**
+```bash
+# Check alert rules are loaded
+kubectl logs -n monitoring prometheus-0 | grep "Loading configuration"
+
+# Check AlertManager config
+kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
+
+# Test SMTP connection
+kubectl exec -n monitoring alertmanager-0 -- \
+  nc -zv smtp.gmail.com 587
+```
+
+---
+
+## 📞 Need Help?
+
+1. Check full documentation: [infrastructure/kubernetes/base/components/monitoring/README.md](infrastructure/kubernetes/base/components/monitoring/README.md)
+2. Review deployment summary: [MONITORING_DEPLOYMENT_SUMMARY.md](MONITORING_DEPLOYMENT_SUMMARY.md)
+3. Check Prometheus logs: `kubectl logs -n monitoring prometheus-0`
+4. Check AlertManager logs: `kubectl logs -n monitoring alertmanager-0`
+5. Check Grafana logs: `kubectl logs -n monitoring deployment/grafana`
+
+---
+
+## 🎉 You're Done!
+
+Your monitoring stack is now running in production!
+
+**Next steps:**
+1. Save your Grafana password securely
+2. Set up on-call rotation
+3. Review alert thresholds and adjust as needed
+4. Create team-specific dashboards
+5. Train team on using monitoring tools
+
+**Access your monitoring:**
+- Grafana: https://monitoring.yourdomain.com/grafana
+- Prometheus: https://monitoring.yourdomain.com/prometheus
+- AlertManager: https://monitoring.yourdomain.com/alertmanager
+- Jaeger: https://monitoring.yourdomain.com/jaeger
+
+---
+
+*Deployment time: ~15 minutes*
+*Last updated: 2026-01-07*