285 lines
7.7 KiB
Markdown
285 lines
7.7 KiB
Markdown
# 🚀 Quick Start: Deploy Monitoring to Production
|
|
|
|
**Time to deploy: ~15 minutes**
|
|
|
|
---
|
|
|
|
## Step 1: Update Secrets (5 min)
|
|
|
|
```bash
|
|
cd infrastructure/kubernetes/base/components/monitoring
|
|
|
|
# 1. Generate strong passwords
|
|
GRAFANA_PASS=$(openssl rand -base64 32)
|
|
echo "Grafana Password: $GRAFANA_PASS" > ~/SAVE_THIS_PASSWORD.txt
|
|
|
|
# 2. Edit secrets.yaml and replace:
|
|
# - CHANGE_ME_IN_PRODUCTION (Grafana password)
|
|
# - SMTP settings (your email server)
|
|
# - PostgreSQL connection string (your DB)
|
|
|
|
nano secrets.yaml
|
|
```
|
|
|
|
**Required Changes in secrets.yaml:**
|
|
```yaml
|
|
# Line 13: Change Grafana password
|
|
admin-password: "YOUR_STRONG_PASSWORD_HERE"
|
|
|
|
# Lines 30-33: Update SMTP settings
|
|
smtp-host: "smtp.gmail.com:587"
|
|
smtp-username: "your-alerts@yourdomain.com"
|
|
smtp-password: "YOUR_SMTP_PASSWORD"
|
|
smtp-from: "alerts@yourdomain.com"
|
|
|
|
# Line 49: Update PostgreSQL connection
|
|
data-source-name: "postgresql://USER:PASSWORD@postgres.bakery-ia:5432/bakery?sslmode=require"
|
|
```
|
|
|
|
---
|
|
|
|
## Step 2: Update Alert Email Addresses (2 min)
|
|
|
|
```bash
|
|
# Edit alertmanager.yaml to set your team's email addresses
|
|
nano alertmanager.yaml
|
|
|
|
# Update these lines (search for @yourdomain.com):
|
|
# - Line 93: to: 'alerts@yourdomain.com'
|
|
# - Line 101: to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
|
|
# - Line 116: to: 'alerts@yourdomain.com'
|
|
# - Line 125: to: 'alert-system-team@yourdomain.com'
|
|
# - Line 134: to: 'database-team@yourdomain.com'
|
|
# - Line 143: to: 'infra-team@yourdomain.com'
|
|
```
|
|
|
|
---
|
|
|
|
## Step 3: Deploy to Production (3 min)
|
|
|
|
```bash
|
|
# Return to project root
|
|
cd /Users/urtzialfaro/Documents/bakery-ia
|
|
|
|
# Deploy the entire stack
|
|
kubectl apply -k infrastructure/kubernetes/overlays/prod
|
|
|
|
# Watch the pods come up
|
|
kubectl get pods -n monitoring -w
|
|
```
|
|
|
|
**Expected Output:**
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
prometheus-0 1/1 Running 0 2m
|
|
prometheus-1 1/1 Running 0 1m
|
|
alertmanager-0 2/2 Running 0 2m
|
|
alertmanager-1 2/2 Running 0 1m
|
|
alertmanager-2 2/2 Running 0 1m
|
|
grafana-xxxxx 1/1 Running 0 2m
|
|
postgres-exporter-xxxxx 1/1 Running 0 2m
|
|
node-exporter-xxxxx 1/1 Running 0 2m
|
|
jaeger-xxxxx 1/1 Running 0 2m
|
|
```
|
|
|
|
---
|
|
|
|
## Step 4: Verify Deployment (3 min)
|
|
|
|
```bash
|
|
# Check all pods are running
|
|
kubectl get pods -n monitoring
|
|
|
|
# Check storage is provisioned
|
|
kubectl get pvc -n monitoring
|
|
|
|
# Check services are created
|
|
kubectl get svc -n monitoring
|
|
```
|
|
|
|
---
|
|
|
|
## Step 5: Access Dashboards (2 min)
|
|
|
|
### **Option A: Via Ingress (if configured)**
|
|
```
|
|
https://monitoring.yourdomain.com/grafana
|
|
https://monitoring.yourdomain.com/prometheus
|
|
https://monitoring.yourdomain.com/alertmanager
|
|
https://monitoring.yourdomain.com/jaeger
|
|
```
|
|
|
|
### **Option B: Via Port Forwarding**
|
|
```bash
|
|
# Grafana
|
|
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
|
|
|
|
# Prometheus
|
|
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
|
|
|
|
# AlertManager
|
|
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
|
|
|
|
# Jaeger
|
|
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
|
|
|
|
# Now access:
|
|
# - Grafana: http://localhost:3000 (admin / YOUR_PASSWORD)
|
|
# - Prometheus: http://localhost:9090
|
|
# - AlertManager: http://localhost:9093
|
|
# - Jaeger: http://localhost:16686
|
|
```
|
|
|
|
---
|
|
|
|
## Step 6: Verify Everything Works (5 min)
|
|
|
|
### **Check Prometheus Targets**
|
|
1. Open Prometheus: http://localhost:9090
|
|
2. Go to Status → Targets
|
|
3. Verify all targets are **UP**:
|
|
- prometheus (1/1 up)
|
|
- bakery-services (multiple pods up)
|
|
- alertmanager (3/3 up)
|
|
- postgres-exporter (1/1 up)
|
|
- node-exporter (N/N up, where N = number of nodes)
|
|
|
|
### **Check Grafana Dashboards**
|
|
1. Open Grafana: http://localhost:3000
|
|
2. Login with admin / YOUR_PASSWORD
|
|
3. Go to Dashboards → Browse
|
|
4. You should see 11 dashboards:
|
|
- Bakery IA folder: Gateway Metrics, Services Overview, Circuit Breakers
|
|
- Bakery IA - Extended folder: PostgreSQL, Node Exporter, AlertManager, Business Metrics
|
|
5. Open any dashboard and verify data is loading
|
|
|
|
### **Test Alert Flow**
|
|
```bash
|
|
# Fire a test alert by creating high memory pod
|
|
kubectl run memory-test --image=polinux/stress --restart=Never \
|
|
--namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
|
|
|
|
# Wait 5 minutes, then check:
|
|
# 1. Prometheus Alerts: http://localhost:9090/alerts
|
|
# - Should see "HighMemoryUsage" firing
|
|
# 2. AlertManager: http://localhost:9093
|
|
# - Should see the alert
|
|
# 3. Email inbox - Should receive notification
|
|
|
|
# Clean up
|
|
kubectl delete pod memory-test -n bakery-ia
|
|
```
|
|
|
|
### **Verify Jaeger Tracing**
|
|
1. Make a request to your API:
|
|
```bash
|
|
curl -H "Authorization: Bearer YOUR_TOKEN" \
|
|
https://api.yourdomain.com/api/v1/health
|
|
```
|
|
2. Open Jaeger: http://localhost:16686
|
|
3. Select a service from dropdown
|
|
4. Click "Find Traces"
|
|
5. You should see traces appearing
|
|
|
|
---
|
|
|
|
## ✅ Success Criteria
|
|
|
|
Your monitoring is working correctly if:
|
|
|
|
- [x] All Prometheus targets show "UP" status
|
|
- [x] Grafana dashboards display metrics
|
|
- [x] AlertManager cluster shows 3/3 members
|
|
- [x] Test alert fired and email received
|
|
- [x] Jaeger shows traces from services
|
|
- [x] No pods in CrashLoopBackOff state
|
|
- [x] All PVCs are Bound
|
|
|
|
---
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### **Problem: Pods not starting**
|
|
```bash
|
|
# Check pod status
|
|
kubectl describe pod POD_NAME -n monitoring
|
|
|
|
# Check logs
|
|
kubectl logs POD_NAME -n monitoring
|
|
|
|
# Common issues:
|
|
# - Insufficient resources: Check node capacity
|
|
# - PVC not binding: Check storage class exists
|
|
# - Image pull errors: Check network/registry access
|
|
```
|
|
|
|
### **Problem: Prometheus targets DOWN**
|
|
```bash
|
|
# Check if services exist
|
|
kubectl get svc -n bakery-ia
|
|
|
|
# Check if pods have correct labels
|
|
kubectl get pods -n bakery-ia --show-labels
|
|
|
|
# Check if pods expose metrics port (8080)
|
|
kubectl get pod POD_NAME -n bakery-ia -o yaml | grep -A 5 ports
|
|
```
|
|
|
|
### **Problem: Grafana shows "No Data"**
|
|
```bash
|
|
# Test Prometheus datasource
|
|
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
|
|
|
# Run a test query in Prometheus
|
|
curl "http://localhost:9090/api/v1/query?query=up" | jq
|
|
|
|
# If Prometheus has data but Grafana doesn't, check Grafana datasource config
|
|
```
|
|
|
|
### **Problem: Alerts not firing**
|
|
```bash
|
|
# Check alert rules are loaded
|
|
kubectl logs -n monitoring prometheus-0 | grep "Loading configuration"
|
|
|
|
# Check AlertManager config
|
|
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
|
|
|
|
# Test SMTP connection
|
|
kubectl exec -n monitoring alertmanager-0 -- \
|
|
nc -zv smtp.gmail.com 587
|
|
```
|
|
|
|
---
|
|
|
|
## 📞 Need Help?
|
|
|
|
1. Check full documentation: [infrastructure/kubernetes/base/components/monitoring/README.md](infrastructure/kubernetes/base/components/monitoring/README.md)
|
|
2. Review deployment summary: [MONITORING_DEPLOYMENT_SUMMARY.md](MONITORING_DEPLOYMENT_SUMMARY.md)
|
|
3. Check Prometheus logs: `kubectl logs -n monitoring prometheus-0`
|
|
4. Check AlertManager logs: `kubectl logs -n monitoring alertmanager-0`
|
|
5. Check Grafana logs: `kubectl logs -n monitoring deployment/grafana`
|
|
|
|
---
|
|
|
|
## 🎉 You're Done!
|
|
|
|
Your monitoring stack is now running in production!
|
|
|
|
**Next steps:**
|
|
1. Save your Grafana password securely
|
|
2. Set up on-call rotation
|
|
3. Review alert thresholds and adjust as needed
|
|
4. Create team-specific dashboards
|
|
5. Train team on using monitoring tools
|
|
|
|
**Access your monitoring:**
|
|
- Grafana: https://monitoring.yourdomain.com/grafana
|
|
- Prometheus: https://monitoring.yourdomain.com/prometheus
|
|
- AlertManager: https://monitoring.yourdomain.com/alertmanager
|
|
- Jaeger: https://monitoring.yourdomain.com/jaeger
|
|
|
|
---
|
|
|
|
*Deployment time: ~15 minutes*
|
|
*Last updated: 2026-01-07*
|