13 KiB
Bakery IA - Production Monitoring Stack
This directory contains the complete production-ready monitoring infrastructure for the Bakery IA platform.
📊 Components
Core Monitoring
- Prometheus v3.0.1 - Time-series metrics database (2 replicas with HA)
- Grafana v12.3.0 - Visualization and dashboarding
- AlertManager v0.27.0 - Alert routing and notification (3 replicas with HA)
Distributed Tracing
- Jaeger v1.51 - Distributed tracing with persistent storage
Exporters
- PostgreSQL Exporter v0.15.0 - Database metrics and health
- Node Exporter v1.7.0 - Infrastructure and OS-level metrics (DaemonSet)
🚀 Deployment
Prerequisites
- Kubernetes cluster (v1.24+)
- kubectl configured
- kustomize (v4.0+) or kubectl with kustomize support
- Storage class available for PersistentVolumeClaims
Production Deployment
# 1. Update secrets with production values
kubectl create secret generic grafana-admin \
--from-literal=admin-user=admin \
--from-literal=admin-password=$(openssl rand -base64 32) \
--namespace monitoring --dry-run=client -o yaml > secrets.yaml
# 2. Update AlertManager SMTP credentials
kubectl create secret generic alertmanager-secrets \
--from-literal=smtp-host="smtp.gmail.com:587" \
--from-literal=smtp-username="alerts@yourdomain.com" \
--from-literal=smtp-password="YOUR_SMTP_PASSWORD" \
--from-literal=smtp-from="alerts@yourdomain.com" \
--from-literal=slack-webhook-url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
--namespace monitoring --dry-run=client -o yaml >> secrets.yaml
# 3. Update PostgreSQL exporter connection string
kubectl create secret generic postgres-exporter \
--from-literal=data-source-name="postgresql://user:password@postgres.bakery-ia:5432/bakery?sslmode=require" \
--namespace monitoring --dry-run=client -o yaml >> secrets.yaml
# 4. Deploy monitoring stack
kubectl apply -k infrastructure/kubernetes/overlays/prod
# 5. Verify deployment
kubectl get pods -n monitoring
kubectl get pvc -n monitoring
Local Development Deployment
For local Kind clusters, monitoring is disabled by default to save resources. To enable:
# Uncomment monitoring in overlays/dev/kustomization.yaml
# Then apply:
kubectl apply -k infrastructure/kubernetes/overlays/dev
🔐 Security Configuration
Important Security Notes
⚠️ NEVER commit real secrets to Git!
The secrets.yaml file contains placeholder values. In production, use one of:
-
Sealed Secrets (Recommended)
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml kubeseal --format=yaml < secrets.yaml > sealed-secrets.yaml -
External Secrets Operator
helm install external-secrets external-secrets/external-secrets -n external-secrets -
Cloud Provider Secrets
- AWS Secrets Manager
- GCP Secret Manager
- Azure Key Vault
Grafana Admin Password
Change the default password immediately:
# Generate strong password
NEW_PASSWORD=$(openssl rand -base64 32)
# Update secret
kubectl patch secret grafana-admin -n monitoring \
-p="{\"data\":{\"admin-password\":\"$(echo -n $NEW_PASSWORD | base64)\"}}"
# Restart Grafana
kubectl rollout restart deployment grafana -n monitoring
📈 Accessing Monitoring Services
Via Ingress (Production)
https://monitoring.yourdomain.com/grafana
https://monitoring.yourdomain.com/prometheus
https://monitoring.yourdomain.com/alertmanager
https://monitoring.yourdomain.com/jaeger
Via Port Forwarding (Development)
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
Then access:
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- AlertManager: http://localhost:9093
- Jaeger: http://localhost:16686
📊 Grafana Dashboards
Pre-configured Dashboards
-
Gateway Metrics - API gateway performance
- Request rate by endpoint
- P95 latency
- Error rates
- Authentication metrics
-
Services Overview - Microservices health
- Request rate by service
- P99 latency
- Error rates by service
- Service health status
-
Circuit Breakers - Resilience patterns
- Circuit breaker states
- Trip rates
- Rejected requests
-
PostgreSQL Monitoring - Database health
- Connections, transactions, cache hit ratio
- Slow queries, locks, replication lag
-
Node Metrics - Infrastructure monitoring
- CPU, memory, disk, network per node
-
AlertManager - Alert management
- Active alerts, firing rate, notifications
-
Business Metrics - KPIs
- Service performance, tenant activity, ML metrics
Creating Custom Dashboards
- Login to Grafana (admin/[your-password])
- Click "+ → Dashboard"
- Add panels with Prometheus queries
- Save dashboard
- Export JSON and add to
grafana-dashboards.yaml
🚨 Alert Configuration
Alert Rules
Alert rules are defined in alert-rules.yaml and organized by category:
- bakery_services - Service health, errors, latency, memory
- bakery_business - Training jobs, ML accuracy, API limits
- alert_system_health - Alert system components, RabbitMQ, Redis
- alert_system_performance - Processing errors, delivery failures
- alert_system_business - Alert volume, response times
- alert_system_capacity - Queue sizes, storage performance
- alert_system_critical - System failures, data loss
- monitoring_health - Prometheus, AlertManager self-monitoring
Alert Routing
Alerts are routed based on:
- Severity (critical, warning, info)
- Component (alert-system, database, infrastructure)
- Service name
Notification Channels
Configure in alertmanager.yaml:
-
Email (default)
-
Slack (optional, commented out)
- Update slack-webhook-url in secrets
- Uncomment slack_configs in alertmanager.yaml
-
PagerDuty (add if needed)
pagerduty_configs: - routing_key: YOUR_ROUTING_KEY severity: '{{ .Labels.severity }}'
Testing Alerts
# Fire a test alert
kubectl run test-alert --image=busybox -n bakery-ia --restart=Never -- sleep 3600
# Check alert in Prometheus
# Navigate to http://localhost:9090/alerts
# Check AlertManager
# Navigate to http://localhost:9093
🔍 Troubleshooting
Prometheus Issues
# Check Prometheus logs
kubectl logs -n monitoring prometheus-0 -f
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Visit http://localhost:9090/targets
# Check Prometheus configuration
kubectl get configmap prometheus-config -n monitoring -o yaml
AlertManager Issues
# Check AlertManager logs
kubectl logs -n monitoring alertmanager-0 -f
# Check AlertManager configuration
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
# Test SMTP connection
kubectl exec -n monitoring alertmanager-0 -- \
wget --spider --server-response --timeout=10 smtp://smtp.gmail.com:587
Grafana Issues
# Check Grafana logs
kubectl logs -n monitoring deployment/grafana -f
# Reset Grafana admin password
kubectl exec -n monitoring deployment/grafana -- \
grafana-cli admin reset-admin-password NEW_PASSWORD
PostgreSQL Exporter Issues
# Check exporter logs
kubectl logs -n monitoring deployment/postgres-exporter -f
# Test database connection
kubectl exec -n monitoring deployment/postgres-exporter -- \
wget -O- http://localhost:9187/metrics | grep pg_up
Node Exporter Issues
# Check node exporter on specific node
kubectl logs -n monitoring daemonset/node-exporter --selector=kubernetes.io/hostname=NODE_NAME -f
# Check metrics endpoint
kubectl exec -n monitoring daemonset/node-exporter -- \
wget -O- http://localhost:9100/metrics | head -n 20
📏 Resource Requirements
Minimum Requirements (Development)
- CPU: 2 cores
- Memory: 4Gi
- Storage: 30Gi
Recommended Requirements (Production)
- CPU: 6-8 cores
- Memory: 16Gi
- Storage: 100Gi
Component Resource Allocation
| Component | Replicas | CPU Request | Memory Request | CPU Limit | Memory Limit |
|---|---|---|---|---|---|
| Prometheus | 2 | 500m | 1Gi | 1 | 2Gi |
| AlertManager | 3 | 100m | 128Mi | 500m | 256Mi |
| Grafana | 1 | 100m | 256Mi | 500m | 512Mi |
| Postgres Exporter | 1 | 50m | 64Mi | 200m | 128Mi |
| Node Exporter | 1/node | 50m | 64Mi | 200m | 128Mi |
| Jaeger | 1 | 250m | 512Mi | 500m | 1Gi |
🔄 High Availability
Prometheus HA
- 2 replicas in StatefulSet
- Each has independent storage (volumeClaimTemplates)
- Anti-affinity to spread across nodes
- Both scrape the same targets independently
- Use Thanos for long-term storage and global query view (future enhancement)
AlertManager HA
- 3 replicas in StatefulSet
- Clustered mode (gossip protocol)
- Automatic leader election
- Alert deduplication across instances
- Anti-affinity to spread across nodes
PodDisruptionBudgets
Ensure minimum availability during:
- Node maintenance
- Cluster upgrades
- Rolling updates
Prometheus: minAvailable=1 (out of 2)
AlertManager: minAvailable=2 (out of 3)
Grafana: minAvailable=1 (out of 1)
📊 Metrics Reference
Application Metrics (from services)
# HTTP request rate
rate(http_requests_total[5m])
# HTTP error rate
rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
# Request latency (P95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Active connections
active_connections
PostgreSQL Metrics
# Active connections
pg_stat_database_numbackends
# Transaction rate
rate(pg_stat_database_xact_commit[5m])
# Cache hit ratio
rate(pg_stat_database_blks_hit[5m]) /
(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))
# Replication lag
pg_replication_lag_seconds
Node Metrics
# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Network traffic
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
🔗 Distributed Tracing
Jaeger Configuration
Services automatically send traces when JAEGER_ENABLED=true:
# In prod-configmap.yaml
JAEGER_ENABLED: "true"
JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local"
JAEGER_AGENT_PORT: "6831"
Viewing Traces
- Access Jaeger UI: https://monitoring.yourdomain.com/jaeger
- Select service from dropdown
- Click "Find Traces"
- Explore trace details, spans, and timing
Trace Sampling
Current sampling: 100% (all traces collected)
For high-traffic production:
# Adjust in shared/monitoring/tracing.py
JAEGER_SAMPLE_RATE: "0.1" # 10% of traces
📚 Additional Resources
- Prometheus Documentation
- Grafana Documentation
- AlertManager Documentation
- Jaeger Documentation
- PostgreSQL Exporter
- Node Exporter
🆘 Support
For monitoring issues:
- Check component logs (see Troubleshooting section)
- Verify Prometheus targets are UP
- Check AlertManager configuration and routing
- Review resource usage and quotas
- Contact platform team: platform-team@yourdomain.com
🔄 Maintenance
Regular Tasks
Daily:
- Review critical alerts
- Check service health dashboards
Weekly:
- Review alert noise and adjust thresholds
- Check storage usage for Prometheus and Jaeger
- Review slow queries in PostgreSQL dashboard
Monthly:
- Update dashboard with new metrics
- Review and update alert runbooks
- Capacity planning based on trends
Backup and Recovery
Prometheus Data:
# Backup Prometheus data
kubectl exec -n monitoring prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz /prometheus
kubectl cp monitoring/prometheus-0:/tmp/prometheus-backup.tar.gz ./prometheus-backup.tar.gz
# Restore (stop Prometheus first)
kubectl cp ./prometheus-backup.tar.gz monitoring/prometheus-0:/tmp/
kubectl exec -n monitoring prometheus-0 -- tar xzf /tmp/prometheus-backup.tar.gz -C /
Grafana Dashboards:
# Export all dashboards via API
curl -u admin:password http://localhost:3000/api/search | \
jq -r '.[] | .uid' | \
xargs -I{} curl -u admin:password http://localhost:3000/api/dashboards/uid/{} > dashboards-backup.json
📝 Version History
- v1.0.0 (2026-01-07) - Initial production-ready monitoring stack
- Prometheus v3.0.1 with HA
- AlertManager v0.27.0 with clustering
- Grafana v12.3.0 with 7 dashboards
- PostgreSQL and Node exporters
- 50+ alert rules
- Comprehensive documentation