7.7 KiB
7.7 KiB
🚀 Quick Start: Deploy Monitoring to Production
Time to deploy: ~15 minutes
Step 1: Update Secrets (5 min)
cd infrastructure/kubernetes/base/components/monitoring
# 1. Generate strong passwords
GRAFANA_PASS=$(openssl rand -base64 32)
echo "Grafana Password: $GRAFANA_PASS" > ~/SAVE_THIS_PASSWORD.txt
# 2. Edit secrets.yaml and replace:
# - CHANGE_ME_IN_PRODUCTION (Grafana password)
# - SMTP settings (your email server)
# - PostgreSQL connection string (your DB)
nano secrets.yaml
Required Changes in secrets.yaml:
# Line 13: Change Grafana password
admin-password: "YOUR_STRONG_PASSWORD_HERE"
# Lines 30-33: Update SMTP settings
smtp-host: "smtp.gmail.com:587"
smtp-username: "your-alerts@yourdomain.com"
smtp-password: "YOUR_SMTP_PASSWORD"
smtp-from: "alerts@yourdomain.com"
# Line 49: Update PostgreSQL connection
data-source-name: "postgresql://USER:PASSWORD@postgres.bakery-ia:5432/bakery?sslmode=require"
Step 2: Update Alert Email Addresses (2 min)
# Edit alertmanager.yaml to set your team's email addresses
nano alertmanager.yaml
# Update these lines (search for @yourdomain.com):
# - Line 93: to: 'alerts@yourdomain.com'
# - Line 101: to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
# - Line 116: to: 'alerts@yourdomain.com'
# - Line 125: to: 'alert-system-team@yourdomain.com'
# - Line 134: to: 'database-team@yourdomain.com'
# - Line 143: to: 'infra-team@yourdomain.com'
Step 3: Deploy to Production (3 min)
# Return to project root
cd /Users/urtzialfaro/Documents/bakery-ia
# Deploy the entire stack
kubectl apply -k infrastructure/kubernetes/overlays/prod
# Watch the pods come up
kubectl get pods -n monitoring -w
Expected Output:
NAME READY STATUS RESTARTS AGE
prometheus-0 1/1 Running 0 2m
prometheus-1 1/1 Running 0 1m
alertmanager-0 2/2 Running 0 2m
alertmanager-1 2/2 Running 0 1m
alertmanager-2 2/2 Running 0 1m
grafana-xxxxx 1/1 Running 0 2m
postgres-exporter-xxxxx 1/1 Running 0 2m
node-exporter-xxxxx 1/1 Running 0 2m
jaeger-xxxxx 1/1 Running 0 2m
Step 4: Verify Deployment (3 min)
# Check all pods are running
kubectl get pods -n monitoring
# Check storage is provisioned
kubectl get pvc -n monitoring
# Check services are created
kubectl get svc -n monitoring
Step 5: Access Dashboards (2 min)
Option A: Via Ingress (if configured)
https://monitoring.yourdomain.com/grafana
https://monitoring.yourdomain.com/prometheus
https://monitoring.yourdomain.com/alertmanager
https://monitoring.yourdomain.com/jaeger
Option B: Via Port Forwarding
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
# Now access:
# - Grafana: http://localhost:3000 (admin / YOUR_PASSWORD)
# - Prometheus: http://localhost:9090
# - AlertManager: http://localhost:9093
# - Jaeger: http://localhost:16686
Step 6: Verify Everything Works (5 min)
Check Prometheus Targets
- Open Prometheus: http://localhost:9090
- Go to Status → Targets
- Verify all targets are UP:
- prometheus (1/1 up)
- bakery-services (multiple pods up)
- alertmanager (3/3 up)
- postgres-exporter (1/1 up)
- node-exporter (N/N up, where N = number of nodes)
Check Grafana Dashboards
- Open Grafana: http://localhost:3000
- Login with admin / YOUR_PASSWORD
- Go to Dashboards → Browse
- You should see 11 dashboards:
- Bakery IA folder: Gateway Metrics, Services Overview, Circuit Breakers
- Bakery IA - Extended folder: PostgreSQL, Node Exporter, AlertManager, Business Metrics
- Open any dashboard and verify data is loading
Test Alert Flow
# Fire a test alert by creating high memory pod
kubectl run memory-test --image=polinux/stress --restart=Never \
--namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
# Wait 5 minutes, then check:
# 1. Prometheus Alerts: http://localhost:9090/alerts
# - Should see "HighMemoryUsage" firing
# 2. AlertManager: http://localhost:9093
# - Should see the alert
# 3. Email inbox - Should receive notification
# Clean up
kubectl delete pod memory-test -n bakery-ia
Verify Jaeger Tracing
- Make a request to your API:
curl -H "Authorization: Bearer YOUR_TOKEN" \ https://api.yourdomain.com/api/v1/health - Open Jaeger: http://localhost:16686
- Select a service from dropdown
- Click "Find Traces"
- You should see traces appearing
✅ Success Criteria
Your monitoring is working correctly if:
- All Prometheus targets show "UP" status
- Grafana dashboards display metrics
- AlertManager cluster shows 3/3 members
- Test alert fired and email received
- Jaeger shows traces from services
- No pods in CrashLoopBackOff state
- All PVCs are Bound
🔧 Troubleshooting
Problem: Pods not starting
# Check pod status
kubectl describe pod POD_NAME -n monitoring
# Check logs
kubectl logs POD_NAME -n monitoring
# Common issues:
# - Insufficient resources: Check node capacity
# - PVC not binding: Check storage class exists
# - Image pull errors: Check network/registry access
Problem: Prometheus targets DOWN
# Check if services exist
kubectl get svc -n bakery-ia
# Check if pods have correct labels
kubectl get pods -n bakery-ia --show-labels
# Check if pods expose metrics port (8080)
kubectl get pod POD_NAME -n bakery-ia -o yaml | grep -A 5 ports
Problem: Grafana shows "No Data"
# Test Prometheus datasource
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Run a test query in Prometheus
curl "http://localhost:9090/api/v1/query?query=up" | jq
# If Prometheus has data but Grafana doesn't, check Grafana datasource config
Problem: Alerts not firing
# Check alert rules are loaded
kubectl logs -n monitoring prometheus-0 | grep "Loading configuration"
# Check AlertManager config
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
# Test SMTP connection
kubectl exec -n monitoring alertmanager-0 -- \
nc -zv smtp.gmail.com 587
📞 Need Help?
- Check full documentation: infrastructure/kubernetes/base/components/monitoring/README.md
- Review deployment summary: MONITORING_DEPLOYMENT_SUMMARY.md
- Check Prometheus logs:
kubectl logs -n monitoring prometheus-0 - Check AlertManager logs:
kubectl logs -n monitoring alertmanager-0 - Check Grafana logs:
kubectl logs -n monitoring deployment/grafana
🎉 You're Done!
Your monitoring stack is now running in production!
Next steps:
- Save your Grafana password securely
- Set up on-call rotation
- Review alert thresholds and adjust as needed
- Create team-specific dashboards
- Train team on using monitoring tools
Access your monitoring:
- Grafana: https://monitoring.yourdomain.com/grafana
- Prometheus: https://monitoring.yourdomain.com/prometheus
- AlertManager: https://monitoring.yourdomain.com/alertmanager
- Jaeger: https://monitoring.yourdomain.com/jaeger
Deployment time: ~15 minutes Last updated: 2026-01-07