Improve monitoring for prod

2026-01-07 19:12:35 +01:00
parent 560c7ba86f
commit 07178f8972
44 changed files with 6581 additions and 5111 deletions
--- a/infrastructure/kubernetes/base/components/monitoring/README.md
+++ b/infrastructure/kubernetes/base/components/monitoring/README.md
@@ -0,0 +1,501 @@
+# Bakery IA - Production Monitoring Stack
+
+This directory contains the complete production-ready monitoring infrastructure for the Bakery IA platform.
+
+## 📊 Components
+
+### Core Monitoring
+- **Prometheus v3.0.1** - Time-series metrics database (2 replicas with HA)
+- **Grafana v12.3.0** - Visualization and dashboarding
+- **AlertManager v0.27.0** - Alert routing and notification (3 replicas with HA)
+
+### Distributed Tracing
+- **Jaeger v1.51** - Distributed tracing with persistent storage
+
+### Exporters
+- **PostgreSQL Exporter v0.15.0** - Database metrics and health
+- **Node Exporter v1.7.0** - Infrastructure and OS-level metrics (DaemonSet)
+
+## 🚀 Deployment
+
+### Prerequisites
+1. Kubernetes cluster (v1.24+)
+2. kubectl configured
+3. kustomize (v4.0+) or kubectl with kustomize support
+4. Storage class available for PersistentVolumeClaims
+
+### Production Deployment
+
+```bash
+# 1. Update secrets with production values
+kubectl create secret generic grafana-admin \
+  --from-literal=admin-user=admin \
+  --from-literal=admin-password=$(openssl rand -base64 32) \
+  --namespace monitoring --dry-run=client -o yaml > secrets.yaml
+
+# 2. Update AlertManager SMTP credentials
+kubectl create secret generic alertmanager-secrets \
+  --from-literal=smtp-host="smtp.gmail.com:587" \
+  --from-literal=smtp-username="alerts@yourdomain.com" \
+  --from-literal=smtp-password="YOUR_SMTP_PASSWORD" \
+  --from-literal=smtp-from="alerts@yourdomain.com" \
+  --from-literal=slack-webhook-url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
+  --namespace monitoring --dry-run=client -o yaml >> secrets.yaml
+
+# 3. Update PostgreSQL exporter connection string
+kubectl create secret generic postgres-exporter \
+  --from-literal=data-source-name="postgresql://user:password@postgres.bakery-ia:5432/bakery?sslmode=require" \
+  --namespace monitoring --dry-run=client -o yaml >> secrets.yaml
+
+# 4. Deploy monitoring stack
+kubectl apply -k infrastructure/kubernetes/overlays/prod
+
+# 5. Verify deployment
+kubectl get pods -n monitoring
+kubectl get pvc -n monitoring
+```
+
+### Local Development Deployment
+
+For local Kind clusters, monitoring is disabled by default to save resources. To enable:
+
+```bash
+# Uncomment monitoring in overlays/dev/kustomization.yaml
+# Then apply:
+kubectl apply -k infrastructure/kubernetes/overlays/dev
+```
+
+## 🔐 Security Configuration
+
+### Important Security Notes
+
+⚠️ **NEVER commit real secrets to Git!**
+
+The `secrets.yaml` file contains placeholder values. In production, use one of:
+
+1. **Sealed Secrets** (Recommended)
+   ```bash
+   kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
+   kubeseal --format=yaml < secrets.yaml > sealed-secrets.yaml
+   ```
+
+2. **External Secrets Operator**
+   ```bash
+   helm install external-secrets external-secrets/external-secrets -n external-secrets
+   ```
+
+3. **Cloud Provider Secrets**
+   - AWS Secrets Manager
+   - GCP Secret Manager
+   - Azure Key Vault
+
+### Grafana Admin Password
+
+Change the default password immediately:
+```bash
+# Generate strong password
+NEW_PASSWORD=$(openssl rand -base64 32)
+
+# Update secret
+kubectl patch secret grafana-admin -n monitoring \
+  -p="{\"data\":{\"admin-password\":\"$(echo -n $NEW_PASSWORD | base64)\"}}"
+
+# Restart Grafana
+kubectl rollout restart deployment grafana -n monitoring
+```
+
+## 📈 Accessing Monitoring Services
+
+### Via Ingress (Production)
+
+```
+https://monitoring.yourdomain.com/grafana
+https://monitoring.yourdomain.com/prometheus
+https://monitoring.yourdomain.com/alertmanager
+https://monitoring.yourdomain.com/jaeger
+```
+
+### Via Port Forwarding (Development)
+
+```bash
+# Grafana
+kubectl port-forward -n monitoring svc/grafana 3000:3000
+
+# Prometheus
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+
+# AlertManager
+kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
+
+# Jaeger
+kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
+```
+
+Then access:
+- Grafana: http://localhost:3000
+- Prometheus: http://localhost:9090
+- AlertManager: http://localhost:9093
+- Jaeger: http://localhost:16686
+
+## 📊 Grafana Dashboards
+
+### Pre-configured Dashboards
+
+1. **Gateway Metrics** - API gateway performance
+   - Request rate by endpoint
+   - P95 latency
+   - Error rates
+   - Authentication metrics
+
+2. **Services Overview** - Microservices health
+   - Request rate by service
+   - P99 latency
+   - Error rates by service
+   - Service health status
+
+3. **Circuit Breakers** - Resilience patterns
+   - Circuit breaker states
+   - Trip rates
+   - Rejected requests
+
+4. **PostgreSQL Monitoring** - Database health
+   - Connections, transactions, cache hit ratio
+   - Slow queries, locks, replication lag
+
+5. **Node Metrics** - Infrastructure monitoring
+   - CPU, memory, disk, network per node
+
+6. **AlertManager** - Alert management
+   - Active alerts, firing rate, notifications
+
+7. **Business Metrics** - KPIs
+   - Service performance, tenant activity, ML metrics
+
+### Creating Custom Dashboards
+
+1. Login to Grafana (admin/[your-password])
+2. Click "+ → Dashboard"
+3. Add panels with Prometheus queries
+4. Save dashboard
+5. Export JSON and add to `grafana-dashboards.yaml`
+
+## 🚨 Alert Configuration
+
+### Alert Rules
+
+Alert rules are defined in `alert-rules.yaml` and organized by category:
+
+- **bakery_services** - Service health, errors, latency, memory
+- **bakery_business** - Training jobs, ML accuracy, API limits
+- **alert_system_health** - Alert system components, RabbitMQ, Redis
+- **alert_system_performance** - Processing errors, delivery failures
+- **alert_system_business** - Alert volume, response times
+- **alert_system_capacity** - Queue sizes, storage performance
+- **alert_system_critical** - System failures, data loss
+- **monitoring_health** - Prometheus, AlertManager self-monitoring
+
+### Alert Routing
+
+Alerts are routed based on:
+- **Severity** (critical, warning, info)
+- **Component** (alert-system, database, infrastructure)
+- **Service** name
+
+### Notification Channels
+
+Configure in `alertmanager.yaml`:
+
+1. **Email** (default)
+   - critical-alerts@yourdomain.com
+   - oncall@yourdomain.com
+
+2. **Slack** (optional, commented out)
+   - Update slack-webhook-url in secrets
+   - Uncomment slack_configs in alertmanager.yaml
+
+3. **PagerDuty** (add if needed)
+   ```yaml
+   pagerduty_configs:
+   - routing_key: YOUR_ROUTING_KEY
+     severity: '{{ .Labels.severity }}'
+   ```
+
+### Testing Alerts
+
+```bash
+# Fire a test alert
+kubectl run test-alert --image=busybox -n bakery-ia --restart=Never -- sleep 3600
+
+# Check alert in Prometheus
+# Navigate to http://localhost:9090/alerts
+
+# Check AlertManager
+# Navigate to http://localhost:9093
+```
+
+## 🔍 Troubleshooting
+
+### Prometheus Issues
+
+```bash
+# Check Prometheus logs
+kubectl logs -n monitoring prometheus-0 -f
+
+# Check Prometheus targets
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+# Visit http://localhost:9090/targets
+
+# Check Prometheus configuration
+kubectl get configmap prometheus-config -n monitoring -o yaml
+```
+
+### AlertManager Issues
+
+```bash
+# Check AlertManager logs
+kubectl logs -n monitoring alertmanager-0 -f
+
+# Check AlertManager configuration
+kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
+
+# Test SMTP connection
+kubectl exec -n monitoring alertmanager-0 -- \
+  wget --spider --server-response --timeout=10 smtp://smtp.gmail.com:587
+```
+
+### Grafana Issues
+
+```bash
+# Check Grafana logs
+kubectl logs -n monitoring deployment/grafana -f
+
+# Reset Grafana admin password
+kubectl exec -n monitoring deployment/grafana -- \
+  grafana-cli admin reset-admin-password NEW_PASSWORD
+```
+
+### PostgreSQL Exporter Issues
+
+```bash
+# Check exporter logs
+kubectl logs -n monitoring deployment/postgres-exporter -f
+
+# Test database connection
+kubectl exec -n monitoring deployment/postgres-exporter -- \
+  wget -O- http://localhost:9187/metrics | grep pg_up
+```
+
+### Node Exporter Issues
+
+```bash
+# Check node exporter on specific node
+kubectl logs -n monitoring daemonset/node-exporter --selector=kubernetes.io/hostname=NODE_NAME -f
+
+# Check metrics endpoint
+kubectl exec -n monitoring daemonset/node-exporter -- \
+  wget -O- http://localhost:9100/metrics | head -n 20
+```
+
+## 📏 Resource Requirements
+
+### Minimum Requirements (Development)
+- CPU: 2 cores
+- Memory: 4Gi
+- Storage: 30Gi
+
+### Recommended Requirements (Production)
+- CPU: 6-8 cores
+- Memory: 16Gi
+- Storage: 100Gi
+
+### Component Resource Allocation
+
+| Component | Replicas | CPU Request | Memory Request | CPU Limit | Memory Limit |
+|-----------|----------|-------------|----------------|-----------|--------------|
+| Prometheus | 2 | 500m | 1Gi | 1 | 2Gi |
+| AlertManager | 3 | 100m | 128Mi | 500m | 256Mi |
+| Grafana | 1 | 100m | 256Mi | 500m | 512Mi |
+| Postgres Exporter | 1 | 50m | 64Mi | 200m | 128Mi |
+| Node Exporter | 1/node | 50m | 64Mi | 200m | 128Mi |
+| Jaeger | 1 | 250m | 512Mi | 500m | 1Gi |
+
+## 🔄 High Availability
+
+### Prometheus HA
+
+- 2 replicas in StatefulSet
+- Each has independent storage (volumeClaimTemplates)
+- Anti-affinity to spread across nodes
+- Both scrape the same targets independently
+- Use Thanos for long-term storage and global query view (future enhancement)
+
+### AlertManager HA
+
+- 3 replicas in StatefulSet
+- Clustered mode (gossip protocol)
+- Automatic leader election
+- Alert deduplication across instances
+- Anti-affinity to spread across nodes
+
+### PodDisruptionBudgets
+
+Ensure minimum availability during:
+- Node maintenance
+- Cluster upgrades
+- Rolling updates
+
+```yaml
+Prometheus: minAvailable=1 (out of 2)
+AlertManager: minAvailable=2 (out of 3)
+Grafana: minAvailable=1 (out of 1)
+```
+
+## 📊 Metrics Reference
+
+### Application Metrics (from services)
+
+```promql
+# HTTP request rate
+rate(http_requests_total[5m])
+
+# HTTP error rate
+rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
+
+# Request latency (P95)
+histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
+
+# Active connections
+active_connections
+```
+
+### PostgreSQL Metrics
+
+```promql
+# Active connections
+pg_stat_database_numbackends
+
+# Transaction rate
+rate(pg_stat_database_xact_commit[5m])
+
+# Cache hit ratio
+rate(pg_stat_database_blks_hit[5m]) /
+(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))
+
+# Replication lag
+pg_replication_lag_seconds
+```
+
+### Node Metrics
+
+```promql
+# CPU usage
+100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Memory usage
+(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
+
+# Disk I/O
+rate(node_disk_read_bytes_total[5m])
+rate(node_disk_written_bytes_total[5m])
+
+# Network traffic
+rate(node_network_receive_bytes_total[5m])
+rate(node_network_transmit_bytes_total[5m])
+```
+
+## 🔗 Distributed Tracing
+
+### Jaeger Configuration
+
+Services automatically send traces when `JAEGER_ENABLED=true`:
+
+```yaml
+# In prod-configmap.yaml
+JAEGER_ENABLED: "true"
+JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local"
+JAEGER_AGENT_PORT: "6831"
+```
+
+### Viewing Traces
+
+1. Access Jaeger UI: https://monitoring.yourdomain.com/jaeger
+2. Select service from dropdown
+3. Click "Find Traces"
+4. Explore trace details, spans, and timing
+
+### Trace Sampling
+
+Current sampling: 100% (all traces collected)
+
+For high-traffic production:
+```yaml
+# Adjust in shared/monitoring/tracing.py
+JAEGER_SAMPLE_RATE: "0.1"  # 10% of traces
+```
+
+## 📚 Additional Resources
+
+- [Prometheus Documentation](https://prometheus.io/docs/)
+- [Grafana Documentation](https://grafana.com/docs/)
+- [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
+- [Jaeger Documentation](https://www.jaegertracing.io/docs/)
+- [PostgreSQL Exporter](https://github.com/prometheus-community/postgres_exporter)
+- [Node Exporter](https://github.com/prometheus/node_exporter)
+
+## 🆘 Support
+
+For monitoring issues:
+1. Check component logs (see Troubleshooting section)
+2. Verify Prometheus targets are UP
+3. Check AlertManager configuration and routing
+4. Review resource usage and quotas
+5. Contact platform team: platform-team@yourdomain.com
+
+## 🔄 Maintenance
+
+### Regular Tasks
+
+**Daily:**
+- Review critical alerts
+- Check service health dashboards
+
+**Weekly:**
+- Review alert noise and adjust thresholds
+- Check storage usage for Prometheus and Jaeger
+- Review slow queries in PostgreSQL dashboard
+
+**Monthly:**
+- Update dashboard with new metrics
+- Review and update alert runbooks
+- Capacity planning based on trends
+
+### Backup and Recovery
+
+**Prometheus Data:**
+```bash
+# Backup Prometheus data
+kubectl exec -n monitoring prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz /prometheus
+kubectl cp monitoring/prometheus-0:/tmp/prometheus-backup.tar.gz ./prometheus-backup.tar.gz
+
+# Restore (stop Prometheus first)
+kubectl cp ./prometheus-backup.tar.gz monitoring/prometheus-0:/tmp/
+kubectl exec -n monitoring prometheus-0 -- tar xzf /tmp/prometheus-backup.tar.gz -C /
+```
+
+**Grafana Dashboards:**
+```bash
+# Export all dashboards via API
+curl -u admin:password http://localhost:3000/api/search | \
+  jq -r '.[] | .uid' | \
+  xargs -I{} curl -u admin:password http://localhost:3000/api/dashboards/uid/{} > dashboards-backup.json
+```
+
+## 📝 Version History
+
+- **v1.0.0** (2026-01-07) - Initial production-ready monitoring stack
+  - Prometheus v3.0.1 with HA
+  - AlertManager v0.27.0 with clustering
+  - Grafana v12.3.0 with 7 dashboards
+  - PostgreSQL and Node exporters
+  - 50+ alert rules
+  - Comprehensive documentation