Improve monitoring for prod
This commit is contained in:
284
docs/QUICK_START_MONITORING.md
Normal file
284
docs/QUICK_START_MONITORING.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# 🚀 Quick Start: Deploy Monitoring to Production
|
||||
|
||||
**Time to deploy: ~15 minutes**
|
||||
|
||||
---
|
||||
|
||||
## Step 1: Update Secrets (5 min)
|
||||
|
||||
```bash
|
||||
cd infrastructure/kubernetes/base/components/monitoring
|
||||
|
||||
# 1. Generate strong passwords
|
||||
GRAFANA_PASS=$(openssl rand -base64 32)
|
||||
echo "Grafana Password: $GRAFANA_PASS" > ~/SAVE_THIS_PASSWORD.txt
|
||||
|
||||
# 2. Edit secrets.yaml and replace:
|
||||
# - CHANGE_ME_IN_PRODUCTION (Grafana password)
|
||||
# - SMTP settings (your email server)
|
||||
# - PostgreSQL connection string (your DB)
|
||||
|
||||
nano secrets.yaml
|
||||
```
|
||||
|
||||
**Required Changes in secrets.yaml:**
|
||||
```yaml
|
||||
# Line 13: Change Grafana password
|
||||
admin-password: "YOUR_STRONG_PASSWORD_HERE"
|
||||
|
||||
# Lines 30-33: Update SMTP settings
|
||||
smtp-host: "smtp.gmail.com:587"
|
||||
smtp-username: "your-alerts@yourdomain.com"
|
||||
smtp-password: "YOUR_SMTP_PASSWORD"
|
||||
smtp-from: "alerts@yourdomain.com"
|
||||
|
||||
# Line 49: Update PostgreSQL connection
|
||||
data-source-name: "postgresql://USER:PASSWORD@postgres.bakery-ia:5432/bakery?sslmode=require"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 2: Update Alert Email Addresses (2 min)
|
||||
|
||||
```bash
|
||||
# Edit alertmanager.yaml to set your team's email addresses
|
||||
nano alertmanager.yaml
|
||||
|
||||
# Update these lines (search for @yourdomain.com):
|
||||
# - Line 93: to: 'alerts@yourdomain.com'
|
||||
# - Line 101: to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
|
||||
# - Line 116: to: 'alerts@yourdomain.com'
|
||||
# - Line 125: to: 'alert-system-team@yourdomain.com'
|
||||
# - Line 134: to: 'database-team@yourdomain.com'
|
||||
# - Line 143: to: 'infra-team@yourdomain.com'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 3: Deploy to Production (3 min)
|
||||
|
||||
```bash
|
||||
# Return to project root
|
||||
cd /Users/urtzialfaro/Documents/bakery-ia
|
||||
|
||||
# Deploy the entire stack
|
||||
kubectl apply -k infrastructure/kubernetes/overlays/prod
|
||||
|
||||
# Watch the pods come up
|
||||
kubectl get pods -n monitoring -w
|
||||
```
|
||||
|
||||
**Expected Output:**
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
prometheus-0 1/1 Running 0 2m
|
||||
prometheus-1 1/1 Running 0 1m
|
||||
alertmanager-0 2/2 Running 0 2m
|
||||
alertmanager-1 2/2 Running 0 1m
|
||||
alertmanager-2 2/2 Running 0 1m
|
||||
grafana-xxxxx 1/1 Running 0 2m
|
||||
postgres-exporter-xxxxx 1/1 Running 0 2m
|
||||
node-exporter-xxxxx 1/1 Running 0 2m
|
||||
jaeger-xxxxx 1/1 Running 0 2m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 4: Verify Deployment (3 min)
|
||||
|
||||
```bash
|
||||
# Check all pods are running
|
||||
kubectl get pods -n monitoring
|
||||
|
||||
# Check storage is provisioned
|
||||
kubectl get pvc -n monitoring
|
||||
|
||||
# Check services are created
|
||||
kubectl get svc -n monitoring
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 5: Access Dashboards (2 min)
|
||||
|
||||
### **Option A: Via Ingress (if configured)**
|
||||
```
|
||||
https://monitoring.yourdomain.com/grafana
|
||||
https://monitoring.yourdomain.com/prometheus
|
||||
https://monitoring.yourdomain.com/alertmanager
|
||||
https://monitoring.yourdomain.com/jaeger
|
||||
```
|
||||
|
||||
### **Option B: Via Port Forwarding**
|
||||
```bash
|
||||
# Grafana
|
||||
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
|
||||
|
||||
# Prometheus
|
||||
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
|
||||
|
||||
# AlertManager
|
||||
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
|
||||
|
||||
# Jaeger
|
||||
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
|
||||
|
||||
# Now access:
|
||||
# - Grafana: http://localhost:3000 (admin / YOUR_PASSWORD)
|
||||
# - Prometheus: http://localhost:9090
|
||||
# - AlertManager: http://localhost:9093
|
||||
# - Jaeger: http://localhost:16686
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 6: Verify Everything Works (5 min)
|
||||
|
||||
### **Check Prometheus Targets**
|
||||
1. Open Prometheus: http://localhost:9090
|
||||
2. Go to Status → Targets
|
||||
3. Verify all targets are **UP**:
|
||||
- prometheus (1/1 up)
|
||||
- bakery-services (multiple pods up)
|
||||
- alertmanager (3/3 up)
|
||||
- postgres-exporter (1/1 up)
|
||||
- node-exporter (N/N up, where N = number of nodes)
|
||||
|
||||
### **Check Grafana Dashboards**
|
||||
1. Open Grafana: http://localhost:3000
|
||||
2. Login with admin / YOUR_PASSWORD
|
||||
3. Go to Dashboards → Browse
|
||||
4. You should see 11 dashboards:
|
||||
- Bakery IA folder: Gateway Metrics, Services Overview, Circuit Breakers
|
||||
- Bakery IA - Extended folder: PostgreSQL, Node Exporter, AlertManager, Business Metrics
|
||||
5. Open any dashboard and verify data is loading
|
||||
|
||||
### **Test Alert Flow**
|
||||
```bash
|
||||
# Fire a test alert by creating high memory pod
|
||||
kubectl run memory-test --image=polinux/stress --restart=Never \
|
||||
--namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
|
||||
|
||||
# Wait 5 minutes, then check:
|
||||
# 1. Prometheus Alerts: http://localhost:9090/alerts
|
||||
# - Should see "HighMemoryUsage" firing
|
||||
# 2. AlertManager: http://localhost:9093
|
||||
# - Should see the alert
|
||||
# 3. Email inbox - Should receive notification
|
||||
|
||||
# Clean up
|
||||
kubectl delete pod memory-test -n bakery-ia
|
||||
```
|
||||
|
||||
### **Verify Jaeger Tracing**
|
||||
1. Make a request to your API:
|
||||
```bash
|
||||
curl -H "Authorization: Bearer YOUR_TOKEN" \
|
||||
https://api.yourdomain.com/api/v1/health
|
||||
```
|
||||
2. Open Jaeger: http://localhost:16686
|
||||
3. Select a service from dropdown
|
||||
4. Click "Find Traces"
|
||||
5. You should see traces appearing
|
||||
|
||||
---
|
||||
|
||||
## ✅ Success Criteria
|
||||
|
||||
Your monitoring is working correctly if:
|
||||
|
||||
- [x] All Prometheus targets show "UP" status
|
||||
- [x] Grafana dashboards display metrics
|
||||
- [x] AlertManager cluster shows 3/3 members
|
||||
- [x] Test alert fired and email received
|
||||
- [x] Jaeger shows traces from services
|
||||
- [x] No pods in CrashLoopBackOff state
|
||||
- [x] All PVCs are Bound
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### **Problem: Pods not starting**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl describe pod POD_NAME -n monitoring
|
||||
|
||||
# Check logs
|
||||
kubectl logs POD_NAME -n monitoring
|
||||
|
||||
# Common issues:
|
||||
# - Insufficient resources: Check node capacity
|
||||
# - PVC not binding: Check storage class exists
|
||||
# - Image pull errors: Check network/registry access
|
||||
```
|
||||
|
||||
### **Problem: Prometheus targets DOWN**
|
||||
```bash
|
||||
# Check if services exist
|
||||
kubectl get svc -n bakery-ia
|
||||
|
||||
# Check if pods have correct labels
|
||||
kubectl get pods -n bakery-ia --show-labels
|
||||
|
||||
# Check if pods expose metrics port (8080)
|
||||
kubectl get pod POD_NAME -n bakery-ia -o yaml | grep -A 5 ports
|
||||
```
|
||||
|
||||
### **Problem: Grafana shows "No Data"**
|
||||
```bash
|
||||
# Test Prometheus datasource
|
||||
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
||||
|
||||
# Run a test query in Prometheus
|
||||
curl "http://localhost:9090/api/v1/query?query=up" | jq
|
||||
|
||||
# If Prometheus has data but Grafana doesn't, check Grafana datasource config
|
||||
```
|
||||
|
||||
### **Problem: Alerts not firing**
|
||||
```bash
|
||||
# Check alert rules are loaded
|
||||
kubectl logs -n monitoring prometheus-0 | grep "Loading configuration"
|
||||
|
||||
# Check AlertManager config
|
||||
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
|
||||
|
||||
# Test SMTP connection
|
||||
kubectl exec -n monitoring alertmanager-0 -- \
|
||||
nc -zv smtp.gmail.com 587
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📞 Need Help?
|
||||
|
||||
1. Check full documentation: [infrastructure/kubernetes/base/components/monitoring/README.md](infrastructure/kubernetes/base/components/monitoring/README.md)
|
||||
2. Review deployment summary: [MONITORING_DEPLOYMENT_SUMMARY.md](MONITORING_DEPLOYMENT_SUMMARY.md)
|
||||
3. Check Prometheus logs: `kubectl logs -n monitoring prometheus-0`
|
||||
4. Check AlertManager logs: `kubectl logs -n monitoring alertmanager-0`
|
||||
5. Check Grafana logs: `kubectl logs -n monitoring deployment/grafana`
|
||||
|
||||
---
|
||||
|
||||
## 🎉 You're Done!
|
||||
|
||||
Your monitoring stack is now running in production!
|
||||
|
||||
**Next steps:**
|
||||
1. Save your Grafana password securely
|
||||
2. Set up on-call rotation
|
||||
3. Review alert thresholds and adjust as needed
|
||||
4. Create team-specific dashboards
|
||||
5. Train team on using monitoring tools
|
||||
|
||||
**Access your monitoring:**
|
||||
- Grafana: https://monitoring.yourdomain.com/grafana
|
||||
- Prometheus: https://monitoring.yourdomain.com/prometheus
|
||||
- AlertManager: https://monitoring.yourdomain.com/alertmanager
|
||||
- Jaeger: https://monitoring.yourdomain.com/jaeger
|
||||
|
||||
---
|
||||
|
||||
*Deployment time: ~15 minutes*
|
||||
*Last updated: 2026-01-07*
|
||||
Reference in New Issue
Block a user