bakery-ia/docs/QUICK_START_MONITORING.md

# 🚀 Quick Start: Deploy Monitoring to Production

**Time to deploy: ~15 minutes**

---

## Step 1: Update Secrets (5 min)

```bash
cd infrastructure/kubernetes/base/components/monitoring

# 1. Generate strong passwords
GRAFANA_PASS=$(openssl rand -base64 32)
echo "Grafana Password: $GRAFANA_PASS" > ~/SAVE_THIS_PASSWORD.txt

# 2. Edit secrets.yaml and replace:
#    - CHANGE_ME_IN_PRODUCTION (Grafana password)
#    - SMTP settings (your email server)
#    - PostgreSQL connection string (your DB)

nano secrets.yaml
```

**Required Changes in secrets.yaml:**
```yaml
# Line 13: Change Grafana password
admin-password: "YOUR_STRONG_PASSWORD_HERE"

# Lines 30-33: Update SMTP settings
smtp-host: "smtp.gmail.com:587"
smtp-username: "your-alerts@yourdomain.com"
smtp-password: "YOUR_SMTP_PASSWORD"
smtp-from: "alerts@yourdomain.com"

# Line 49: Update PostgreSQL connection
data-source-name: "postgresql://USER:PASSWORD@postgres.bakery-ia:5432/bakery?sslmode=require"
```

---

## Step 2: Update Alert Email Addresses (2 min)

```bash
# Edit alertmanager.yaml to set your team's email addresses
nano alertmanager.yaml

# Update these lines (search for @yourdomain.com):
# - Line 93: to: 'alerts@yourdomain.com'
# - Line 101: to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
# - Line 116: to: 'alerts@yourdomain.com'
# - Line 125: to: 'alert-system-team@yourdomain.com'
# - Line 134: to: 'database-team@yourdomain.com'
# - Line 143: to: 'infra-team@yourdomain.com'
```

---

## Step 3: Deploy to Production (3 min)

```bash
# Return to project root
cd /Users/urtzialfaro/Documents/bakery-ia

# Deploy the entire stack
kubectl apply -k infrastructure/kubernetes/overlays/prod

# Watch the pods come up
kubectl get pods -n monitoring -w
```

**Expected Output:**
```
NAME                                  READY   STATUS    RESTARTS   AGE
prometheus-0                          1/1     Running   0          2m
prometheus-1                          1/1     Running   0          1m
alertmanager-0                        2/2     Running   0          2m
alertmanager-1                        2/2     Running   0          1m
alertmanager-2                        2/2     Running   0          1m
grafana-xxxxx                         1/1     Running   0          2m
postgres-exporter-xxxxx               1/1     Running   0          2m
node-exporter-xxxxx                   1/1     Running   0          2m
jaeger-xxxxx                          1/1     Running   0          2m
```

---

## Step 4: Verify Deployment (3 min)

```bash
# Check all pods are running
kubectl get pods -n monitoring

# Check storage is provisioned
kubectl get pvc -n monitoring

# Check services are created
kubectl get svc -n monitoring
```

---

## Step 5: Access Dashboards (2 min)

### **Option A: Via Ingress (if configured)**
```
https://monitoring.yourdomain.com/grafana
https://monitoring.yourdomain.com/prometheus
https://monitoring.yourdomain.com/alertmanager
https://monitoring.yourdomain.com/jaeger
```

### **Option B: Via Port Forwarding**
```bash
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000 &

# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &

# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &

# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &

# Now access:
# - Grafana: http://localhost:3000 (admin / YOUR_PASSWORD)
# - Prometheus: http://localhost:9090
# - AlertManager: http://localhost:9093
# - Jaeger: http://localhost:16686
```

---

## Step 6: Verify Everything Works (5 min)

### **Check Prometheus Targets**
1. Open Prometheus: http://localhost:9090
2. Go to Status → Targets
3. Verify all targets are **UP**:
   - prometheus (1/1 up)
   - bakery-services (multiple pods up)
   - alertmanager (3/3 up)
   - postgres-exporter (1/1 up)
   - node-exporter (N/N up, where N = number of nodes)

### **Check Grafana Dashboards**
1. Open Grafana: http://localhost:3000
2. Login with admin / YOUR_PASSWORD
3. Go to Dashboards → Browse
4. You should see 11 dashboards:
   - Bakery IA folder: Gateway Metrics, Services Overview, Circuit Breakers
   - Bakery IA - Extended folder: PostgreSQL, Node Exporter, AlertManager, Business Metrics
5. Open any dashboard and verify data is loading

### **Test Alert Flow**
```bash
# Fire a test alert by creating high memory pod
kubectl run memory-test --image=polinux/stress --restart=Never \
  --namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s

# Wait 5 minutes, then check:
# 1. Prometheus Alerts: http://localhost:9090/alerts
#    - Should see "HighMemoryUsage" firing
# 2. AlertManager: http://localhost:9093
#    - Should see the alert
# 3. Email inbox - Should receive notification

# Clean up
kubectl delete pod memory-test -n bakery-ia
```

### **Verify Jaeger Tracing**
1. Make a request to your API:
   ```bash
   curl -H "Authorization: Bearer YOUR_TOKEN" \
     https://api.yourdomain.com/api/v1/health
   ```
2. Open Jaeger: http://localhost:16686
3. Select a service from dropdown
4. Click "Find Traces"
5. You should see traces appearing

---

## ✅ Success Criteria

Your monitoring is working correctly if:

- [x] All Prometheus targets show "UP" status
- [x] Grafana dashboards display metrics
- [x] AlertManager cluster shows 3/3 members
- [x] Test alert fired and email received
- [x] Jaeger shows traces from services
- [x] No pods in CrashLoopBackOff state
- [x] All PVCs are Bound

---

## 🔧 Troubleshooting

### **Problem: Pods not starting**
```bash
# Check pod status
kubectl describe pod POD_NAME -n monitoring

# Check logs
kubectl logs POD_NAME -n monitoring

# Common issues:
# - Insufficient resources: Check node capacity
# - PVC not binding: Check storage class exists
# - Image pull errors: Check network/registry access
```

### **Problem: Prometheus targets DOWN**
```bash
# Check if services exist
kubectl get svc -n bakery-ia

# Check if pods have correct labels
kubectl get pods -n bakery-ia --show-labels

# Check if pods expose metrics port (8080)
kubectl get pod POD_NAME -n bakery-ia -o yaml | grep -A 5 ports
```

### **Problem: Grafana shows "No Data"**
```bash
# Test Prometheus datasource
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090

# Run a test query in Prometheus
curl "http://localhost:9090/api/v1/query?query=up" | jq

# If Prometheus has data but Grafana doesn't, check Grafana datasource config
```

### **Problem: Alerts not firing**
```bash
# Check alert rules are loaded
kubectl logs -n monitoring prometheus-0 | grep "Loading configuration"

# Check AlertManager config
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml

# Test SMTP connection
kubectl exec -n monitoring alertmanager-0 -- \
  nc -zv smtp.gmail.com 587
```

---

## 📞 Need Help?

1. Check full documentation: [infrastructure/kubernetes/base/components/monitoring/README.md](infrastructure/kubernetes/base/components/monitoring/README.md)
2. Review deployment summary: [MONITORING_DEPLOYMENT_SUMMARY.md](MONITORING_DEPLOYMENT_SUMMARY.md)
3. Check Prometheus logs: `kubectl logs -n monitoring prometheus-0`
4. Check AlertManager logs: `kubectl logs -n monitoring alertmanager-0`
5. Check Grafana logs: `kubectl logs -n monitoring deployment/grafana`

---

## 🎉 You're Done!

Your monitoring stack is now running in production!

**Next steps:**
1. Save your Grafana password securely
2. Set up on-call rotation
3. Review alert thresholds and adjust as needed
4. Create team-specific dashboards
5. Train team on using monitoring tools

**Access your monitoring:**
- Grafana: https://monitoring.yourdomain.com/grafana
- Prometheus: https://monitoring.yourdomain.com/prometheus
- AlertManager: https://monitoring.yourdomain.com/alertmanager
- Jaeger: https://monitoring.yourdomain.com/jaeger

---

*Deployment time: ~15 minutes*
*Last updated: 2026-01-07*