# Bakery IA - Production Monitoring Stack This directory contains the complete production-ready monitoring infrastructure for the Bakery IA platform. ## 📊 Components ### Core Monitoring - **Prometheus v3.0.1** - Time-series metrics database (2 replicas with HA) - **Grafana v12.3.0** - Visualization and dashboarding - **AlertManager v0.27.0** - Alert routing and notification (3 replicas with HA) ### Distributed Tracing - **Jaeger v1.51** - Distributed tracing with persistent storage ### Exporters - **PostgreSQL Exporter v0.15.0** - Database metrics and health - **Node Exporter v1.7.0** - Infrastructure and OS-level metrics (DaemonSet) ## 🚀 Deployment ### Prerequisites 1. Kubernetes cluster (v1.24+) 2. kubectl configured 3. kustomize (v4.0+) or kubectl with kustomize support 4. Storage class available for PersistentVolumeClaims ### Production Deployment ```bash # 1. Update secrets with production values kubectl create secret generic grafana-admin \ --from-literal=admin-user=admin \ --from-literal=admin-password=$(openssl rand -base64 32) \ --namespace monitoring --dry-run=client -o yaml > secrets.yaml # 2. Update AlertManager SMTP credentials kubectl create secret generic alertmanager-secrets \ --from-literal=smtp-host="smtp.gmail.com:587" \ --from-literal=smtp-username="alerts@yourdomain.com" \ --from-literal=smtp-password="YOUR_SMTP_PASSWORD" \ --from-literal=smtp-from="alerts@yourdomain.com" \ --from-literal=slack-webhook-url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \ --namespace monitoring --dry-run=client -o yaml >> secrets.yaml # 3. Update PostgreSQL exporter connection string kubectl create secret generic postgres-exporter \ --from-literal=data-source-name="postgresql://user:password@postgres.bakery-ia:5432/bakery?sslmode=require" \ --namespace monitoring --dry-run=client -o yaml >> secrets.yaml # 4. Deploy monitoring stack kubectl apply -k infrastructure/kubernetes/overlays/prod # 5. Verify deployment kubectl get pods -n monitoring kubectl get pvc -n monitoring ``` ### Local Development Deployment For local Kind clusters, monitoring is disabled by default to save resources. To enable: ```bash # Uncomment monitoring in overlays/dev/kustomization.yaml # Then apply: kubectl apply -k infrastructure/kubernetes/overlays/dev ``` ## 🔐 Security Configuration ### Important Security Notes ⚠️ **NEVER commit real secrets to Git!** The `secrets.yaml` file contains placeholder values. In production, use one of: 1. **Sealed Secrets** (Recommended) ```bash kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml kubeseal --format=yaml < secrets.yaml > sealed-secrets.yaml ``` 2. **External Secrets Operator** ```bash helm install external-secrets external-secrets/external-secrets -n external-secrets ``` 3. **Cloud Provider Secrets** - AWS Secrets Manager - GCP Secret Manager - Azure Key Vault ### Grafana Admin Password Change the default password immediately: ```bash # Generate strong password NEW_PASSWORD=$(openssl rand -base64 32) # Update secret kubectl patch secret grafana-admin -n monitoring \ -p="{\"data\":{\"admin-password\":\"$(echo -n $NEW_PASSWORD | base64)\"}}" # Restart Grafana kubectl rollout restart deployment grafana -n monitoring ``` ## 📈 Accessing Monitoring Services ### Via Ingress (Production) ``` https://monitoring.yourdomain.com/grafana https://monitoring.yourdomain.com/prometheus https://monitoring.yourdomain.com/alertmanager https://monitoring.yourdomain.com/jaeger ``` ### Via Port Forwarding (Development) ```bash # Grafana kubectl port-forward -n monitoring svc/grafana 3000:3000 # Prometheus kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 # AlertManager kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 # Jaeger kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 ``` Then access: - Grafana: http://localhost:3000 - Prometheus: http://localhost:9090 - AlertManager: http://localhost:9093 - Jaeger: http://localhost:16686 ## 📊 Grafana Dashboards ### Pre-configured Dashboards 1. **Gateway Metrics** - API gateway performance - Request rate by endpoint - P95 latency - Error rates - Authentication metrics 2. **Services Overview** - Microservices health - Request rate by service - P99 latency - Error rates by service - Service health status 3. **Circuit Breakers** - Resilience patterns - Circuit breaker states - Trip rates - Rejected requests 4. **PostgreSQL Monitoring** - Database health - Connections, transactions, cache hit ratio - Slow queries, locks, replication lag 5. **Node Metrics** - Infrastructure monitoring - CPU, memory, disk, network per node 6. **AlertManager** - Alert management - Active alerts, firing rate, notifications 7. **Business Metrics** - KPIs - Service performance, tenant activity, ML metrics ### Creating Custom Dashboards 1. Login to Grafana (admin/[your-password]) 2. Click "+ → Dashboard" 3. Add panels with Prometheus queries 4. Save dashboard 5. Export JSON and add to `grafana-dashboards.yaml` ## 🚨 Alert Configuration ### Alert Rules Alert rules are defined in `alert-rules.yaml` and organized by category: - **bakery_services** - Service health, errors, latency, memory - **bakery_business** - Training jobs, ML accuracy, API limits - **alert_system_health** - Alert system components, RabbitMQ, Redis - **alert_system_performance** - Processing errors, delivery failures - **alert_system_business** - Alert volume, response times - **alert_system_capacity** - Queue sizes, storage performance - **alert_system_critical** - System failures, data loss - **monitoring_health** - Prometheus, AlertManager self-monitoring ### Alert Routing Alerts are routed based on: - **Severity** (critical, warning, info) - **Component** (alert-system, database, infrastructure) - **Service** name ### Notification Channels Configure in `alertmanager.yaml`: 1. **Email** (default) - critical-alerts@yourdomain.com - oncall@yourdomain.com 2. **Slack** (optional, commented out) - Update slack-webhook-url in secrets - Uncomment slack_configs in alertmanager.yaml 3. **PagerDuty** (add if needed) ```yaml pagerduty_configs: - routing_key: YOUR_ROUTING_KEY severity: '{{ .Labels.severity }}' ``` ### Testing Alerts ```bash # Fire a test alert kubectl run test-alert --image=busybox -n bakery-ia --restart=Never -- sleep 3600 # Check alert in Prometheus # Navigate to http://localhost:9090/alerts # Check AlertManager # Navigate to http://localhost:9093 ``` ## 🔍 Troubleshooting ### Prometheus Issues ```bash # Check Prometheus logs kubectl logs -n monitoring prometheus-0 -f # Check Prometheus targets kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 # Visit http://localhost:9090/targets # Check Prometheus configuration kubectl get configmap prometheus-config -n monitoring -o yaml ``` ### AlertManager Issues ```bash # Check AlertManager logs kubectl logs -n monitoring alertmanager-0 -f # Check AlertManager configuration kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml # Test SMTP connection kubectl exec -n monitoring alertmanager-0 -- \ wget --spider --server-response --timeout=10 smtp://smtp.gmail.com:587 ``` ### Grafana Issues ```bash # Check Grafana logs kubectl logs -n monitoring deployment/grafana -f # Reset Grafana admin password kubectl exec -n monitoring deployment/grafana -- \ grafana-cli admin reset-admin-password NEW_PASSWORD ``` ### PostgreSQL Exporter Issues ```bash # Check exporter logs kubectl logs -n monitoring deployment/postgres-exporter -f # Test database connection kubectl exec -n monitoring deployment/postgres-exporter -- \ wget -O- http://localhost:9187/metrics | grep pg_up ``` ### Node Exporter Issues ```bash # Check node exporter on specific node kubectl logs -n monitoring daemonset/node-exporter --selector=kubernetes.io/hostname=NODE_NAME -f # Check metrics endpoint kubectl exec -n monitoring daemonset/node-exporter -- \ wget -O- http://localhost:9100/metrics | head -n 20 ``` ## 📏 Resource Requirements ### Minimum Requirements (Development) - CPU: 2 cores - Memory: 4Gi - Storage: 30Gi ### Recommended Requirements (Production) - CPU: 6-8 cores - Memory: 16Gi - Storage: 100Gi ### Component Resource Allocation | Component | Replicas | CPU Request | Memory Request | CPU Limit | Memory Limit | |-----------|----------|-------------|----------------|-----------|--------------| | Prometheus | 2 | 500m | 1Gi | 1 | 2Gi | | AlertManager | 3 | 100m | 128Mi | 500m | 256Mi | | Grafana | 1 | 100m | 256Mi | 500m | 512Mi | | Postgres Exporter | 1 | 50m | 64Mi | 200m | 128Mi | | Node Exporter | 1/node | 50m | 64Mi | 200m | 128Mi | | Jaeger | 1 | 250m | 512Mi | 500m | 1Gi | ## 🔄 High Availability ### Prometheus HA - 2 replicas in StatefulSet - Each has independent storage (volumeClaimTemplates) - Anti-affinity to spread across nodes - Both scrape the same targets independently - Use Thanos for long-term storage and global query view (future enhancement) ### AlertManager HA - 3 replicas in StatefulSet - Clustered mode (gossip protocol) - Automatic leader election - Alert deduplication across instances - Anti-affinity to spread across nodes ### PodDisruptionBudgets Ensure minimum availability during: - Node maintenance - Cluster upgrades - Rolling updates ```yaml Prometheus: minAvailable=1 (out of 2) AlertManager: minAvailable=2 (out of 3) Grafana: minAvailable=1 (out of 1) ``` ## 📊 Metrics Reference ### Application Metrics (from services) ```promql # HTTP request rate rate(http_requests_total[5m]) # HTTP error rate rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) # Request latency (P95) histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Active connections active_connections ``` ### PostgreSQL Metrics ```promql # Active connections pg_stat_database_numbackends # Transaction rate rate(pg_stat_database_xact_commit[5m]) # Cache hit ratio rate(pg_stat_database_blks_hit[5m]) / (rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m])) # Replication lag pg_replication_lag_seconds ``` ### Node Metrics ```promql # CPU usage 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory usage (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 # Disk I/O rate(node_disk_read_bytes_total[5m]) rate(node_disk_written_bytes_total[5m]) # Network traffic rate(node_network_receive_bytes_total[5m]) rate(node_network_transmit_bytes_total[5m]) ``` ## 🔗 Distributed Tracing ### Jaeger Configuration Services automatically send traces when `JAEGER_ENABLED=true`: ```yaml # In prod-configmap.yaml JAEGER_ENABLED: "true" JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local" JAEGER_AGENT_PORT: "6831" ``` ### Viewing Traces 1. Access Jaeger UI: https://monitoring.yourdomain.com/jaeger 2. Select service from dropdown 3. Click "Find Traces" 4. Explore trace details, spans, and timing ### Trace Sampling Current sampling: 100% (all traces collected) For high-traffic production: ```yaml # Adjust in shared/monitoring/tracing.py JAEGER_SAMPLE_RATE: "0.1" # 10% of traces ``` ## 📚 Additional Resources - [Prometheus Documentation](https://prometheus.io/docs/) - [Grafana Documentation](https://grafana.com/docs/) - [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/) - [Jaeger Documentation](https://www.jaegertracing.io/docs/) - [PostgreSQL Exporter](https://github.com/prometheus-community/postgres_exporter) - [Node Exporter](https://github.com/prometheus/node_exporter) ## 🆘 Support For monitoring issues: 1. Check component logs (see Troubleshooting section) 2. Verify Prometheus targets are UP 3. Check AlertManager configuration and routing 4. Review resource usage and quotas 5. Contact platform team: platform-team@yourdomain.com ## 🔄 Maintenance ### Regular Tasks **Daily:** - Review critical alerts - Check service health dashboards **Weekly:** - Review alert noise and adjust thresholds - Check storage usage for Prometheus and Jaeger - Review slow queries in PostgreSQL dashboard **Monthly:** - Update dashboard with new metrics - Review and update alert runbooks - Capacity planning based on trends ### Backup and Recovery **Prometheus Data:** ```bash # Backup Prometheus data kubectl exec -n monitoring prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz /prometheus kubectl cp monitoring/prometheus-0:/tmp/prometheus-backup.tar.gz ./prometheus-backup.tar.gz # Restore (stop Prometheus first) kubectl cp ./prometheus-backup.tar.gz monitoring/prometheus-0:/tmp/ kubectl exec -n monitoring prometheus-0 -- tar xzf /tmp/prometheus-backup.tar.gz -C / ``` **Grafana Dashboards:** ```bash # Export all dashboards via API curl -u admin:password http://localhost:3000/api/search | \ jq -r '.[] | .uid' | \ xargs -I{} curl -u admin:password http://localhost:3000/api/dashboards/uid/{} > dashboards-backup.json ``` ## 📝 Version History - **v1.0.0** (2026-01-07) - Initial production-ready monitoring stack - Prometheus v3.0.1 with HA - AlertManager v0.27.0 with clustering - Grafana v12.3.0 with 7 dashboards - PostgreSQL and Node exporters - 50+ alert rules - Comprehensive documentation