bakery-admin/bakery-ia

Fork 0

Files

Urtzi Alfaro 07178f8972 Improve monitoring for prod

2026-01-07 19:12:35 +01:00

7.7 KiB

Raw Blame History

🚀 Quick Start: Deploy Monitoring to Production

Time to deploy: ~15 minutes

Step 1: Update Secrets (5 min)

cd infrastructure/kubernetes/base/components/monitoring

# 1. Generate strong passwords
GRAFANA_PASS=$(openssl rand -base64 32)
echo "Grafana Password: $GRAFANA_PASS" > ~/SAVE_THIS_PASSWORD.txt

# 2. Edit secrets.yaml and replace:
#    - CHANGE_ME_IN_PRODUCTION (Grafana password)
#    - SMTP settings (your email server)
#    - PostgreSQL connection string (your DB)

nano secrets.yaml

Required Changes in secrets.yaml:

# Line 13: Change Grafana password
admin-password: "YOUR_STRONG_PASSWORD_HERE"

# Lines 30-33: Update SMTP settings
smtp-host: "smtp.gmail.com:587"
smtp-username: "your-alerts@yourdomain.com"
smtp-password: "YOUR_SMTP_PASSWORD"
smtp-from: "alerts@yourdomain.com"

# Line 49: Update PostgreSQL connection
data-source-name: "postgresql://USER:PASSWORD@postgres.bakery-ia:5432/bakery?sslmode=require"

Step 2: Update Alert Email Addresses (2 min)

# Edit alertmanager.yaml to set your team's email addresses
nano alertmanager.yaml

# Update these lines (search for @yourdomain.com):
# - Line 93: to: 'alerts@yourdomain.com'
# - Line 101: to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
# - Line 116: to: 'alerts@yourdomain.com'
# - Line 125: to: 'alert-system-team@yourdomain.com'
# - Line 134: to: 'database-team@yourdomain.com'
# - Line 143: to: 'infra-team@yourdomain.com'

Step 3: Deploy to Production (3 min)

# Return to project root
cd /Users/urtzialfaro/Documents/bakery-ia

# Deploy the entire stack
kubectl apply -k infrastructure/kubernetes/overlays/prod

# Watch the pods come up
kubectl get pods -n monitoring -w

Expected Output:

NAME                                  READY   STATUS    RESTARTS   AGE
prometheus-0                          1/1     Running   0          2m
prometheus-1                          1/1     Running   0          1m
alertmanager-0                        2/2     Running   0          2m
alertmanager-1                        2/2     Running   0          1m
alertmanager-2                        2/2     Running   0          1m
grafana-xxxxx                         1/1     Running   0          2m
postgres-exporter-xxxxx               1/1     Running   0          2m
node-exporter-xxxxx                   1/1     Running   0          2m
jaeger-xxxxx                          1/1     Running   0          2m

Step 4: Verify Deployment (3 min)

# Check all pods are running
kubectl get pods -n monitoring

# Check storage is provisioned
kubectl get pvc -n monitoring

# Check services are created
kubectl get svc -n monitoring

Step 5: Access Dashboards (2 min)

Option A: Via Ingress (if configured)

https://monitoring.yourdomain.com/grafana
https://monitoring.yourdomain.com/prometheus
https://monitoring.yourdomain.com/alertmanager
https://monitoring.yourdomain.com/jaeger

Option B: Via Port Forwarding

# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000 &

# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &

# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &

# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &

# Now access:
# - Grafana: http://localhost:3000 (admin / YOUR_PASSWORD)
# - Prometheus: http://localhost:9090
# - AlertManager: http://localhost:9093
# - Jaeger: http://localhost:16686

Step 6: Verify Everything Works (5 min)

Check Prometheus Targets

Open Prometheus: http://localhost:9090
Go to Status → Targets
Verify all targets are UP:
- prometheus (1/1 up)
- bakery-services (multiple pods up)
- alertmanager (3/3 up)
- postgres-exporter (1/1 up)
- node-exporter (N/N up, where N = number of nodes)

Check Grafana Dashboards

Open Grafana: http://localhost:3000
Login with admin / YOUR_PASSWORD
Go to Dashboards → Browse
You should see 11 dashboards:
- Bakery IA folder: Gateway Metrics, Services Overview, Circuit Breakers
- Bakery IA - Extended folder: PostgreSQL, Node Exporter, AlertManager, Business Metrics
Open any dashboard and verify data is loading

Test Alert Flow

# Fire a test alert by creating high memory pod
kubectl run memory-test --image=polinux/stress --restart=Never \
  --namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s

# Wait 5 minutes, then check:
# 1. Prometheus Alerts: http://localhost:9090/alerts
#    - Should see "HighMemoryUsage" firing
# 2. AlertManager: http://localhost:9093
#    - Should see the alert
# 3. Email inbox - Should receive notification

# Clean up
kubectl delete pod memory-test -n bakery-ia

Verify Jaeger Tracing

Make a request to your API:

curl -H "Authorization: Bearer YOUR_TOKEN" \
  https://api.yourdomain.com/api/v1/health

Open Jaeger: http://localhost:16686
Select a service from dropdown
Click "Find Traces"
You should see traces appearing

✅ Success Criteria

Your monitoring is working correctly if:

All Prometheus targets show "UP" status
Grafana dashboards display metrics
AlertManager cluster shows 3/3 members
Test alert fired and email received
Jaeger shows traces from services
No pods in CrashLoopBackOff state
All PVCs are Bound

🔧 Troubleshooting

Problem: Pods not starting

# Check pod status
kubectl describe pod POD_NAME -n monitoring

# Check logs
kubectl logs POD_NAME -n monitoring

# Common issues:
# - Insufficient resources: Check node capacity
# - PVC not binding: Check storage class exists
# - Image pull errors: Check network/registry access

Problem: Prometheus targets DOWN

# Check if services exist
kubectl get svc -n bakery-ia

# Check if pods have correct labels
kubectl get pods -n bakery-ia --show-labels

# Check if pods expose metrics port (8080)
kubectl get pod POD_NAME -n bakery-ia -o yaml | grep -A 5 ports

Problem: Grafana shows "No Data"

# Test Prometheus datasource
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090

# Run a test query in Prometheus
curl "http://localhost:9090/api/v1/query?query=up" | jq

# If Prometheus has data but Grafana doesn't, check Grafana datasource config

Problem: Alerts not firing

# Check alert rules are loaded
kubectl logs -n monitoring prometheus-0 | grep "Loading configuration"

# Check AlertManager config
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml

# Test SMTP connection
kubectl exec -n monitoring alertmanager-0 -- \
  nc -zv smtp.gmail.com 587

📞 Need Help?

Check full documentation: infrastructure/kubernetes/base/components/monitoring/README.md
Review deployment summary: MONITORING_DEPLOYMENT_SUMMARY.md
Check Prometheus logs: kubectl logs -n monitoring prometheus-0
Check AlertManager logs: kubectl logs -n monitoring alertmanager-0
Check Grafana logs: kubectl logs -n monitoring deployment/grafana

🎉 You're Done!

Your monitoring stack is now running in production!

Next steps:

Save your Grafana password securely
Set up on-call rotation
Review alert thresholds and adjust as needed
Create team-specific dashboards
Train team on using monitoring tools

Access your monitoring:

Grafana: https://monitoring.yourdomain.com/grafana
Prometheus: https://monitoring.yourdomain.com/prometheus
AlertManager: https://monitoring.yourdomain.com/alertmanager
Jaeger: https://monitoring.yourdomain.com/jaeger

Deployment time: ~15 minutes Last updated: 2026-01-07

7.7 KiB Raw Blame History