Improve monitoring for prod

2026-01-07 19:12:35 +01:00
parent 560c7ba86f
commit 07178f8972
44 changed files with 6581 additions and 5111 deletions
--- a/infrastructure/kubernetes/base/components/monitoring/README.md
+++ b/infrastructure/kubernetes/base/components/monitoring/README.md
@@ -0,0 +1,501 @@
+# Bakery IA - Production Monitoring Stack
+
+This directory contains the complete production-ready monitoring infrastructure for the Bakery IA platform.
+
+## 📊 Components
+
+### Core Monitoring
+- **Prometheus v3.0.1** - Time-series metrics database (2 replicas with HA)
+- **Grafana v12.3.0** - Visualization and dashboarding
+- **AlertManager v0.27.0** - Alert routing and notification (3 replicas with HA)
+
+### Distributed Tracing
+- **Jaeger v1.51** - Distributed tracing with persistent storage
+
+### Exporters
+- **PostgreSQL Exporter v0.15.0** - Database metrics and health
+- **Node Exporter v1.7.0** - Infrastructure and OS-level metrics (DaemonSet)
+
+## 🚀 Deployment
+
+### Prerequisites
+1. Kubernetes cluster (v1.24+)
+2. kubectl configured
+3. kustomize (v4.0+) or kubectl with kustomize support
+4. Storage class available for PersistentVolumeClaims
+
+### Production Deployment
+
+```bash
+# 1. Update secrets with production values
+kubectl create secret generic grafana-admin \
+  --from-literal=admin-user=admin \
+  --from-literal=admin-password=$(openssl rand -base64 32) \
+  --namespace monitoring --dry-run=client -o yaml > secrets.yaml
+
+# 2. Update AlertManager SMTP credentials
+kubectl create secret generic alertmanager-secrets \
+  --from-literal=smtp-host="smtp.gmail.com:587" \
+  --from-literal=smtp-username="alerts@yourdomain.com" \
+  --from-literal=smtp-password="YOUR_SMTP_PASSWORD" \
+  --from-literal=smtp-from="alerts@yourdomain.com" \
+  --from-literal=slack-webhook-url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
+  --namespace monitoring --dry-run=client -o yaml >> secrets.yaml
+
+# 3. Update PostgreSQL exporter connection string
+kubectl create secret generic postgres-exporter \
+  --from-literal=data-source-name="postgresql://user:password@postgres.bakery-ia:5432/bakery?sslmode=require" \
+  --namespace monitoring --dry-run=client -o yaml >> secrets.yaml
+
+# 4. Deploy monitoring stack
+kubectl apply -k infrastructure/kubernetes/overlays/prod
+
+# 5. Verify deployment
+kubectl get pods -n monitoring
+kubectl get pvc -n monitoring
+```
+
+### Local Development Deployment
+
+For local Kind clusters, monitoring is disabled by default to save resources. To enable:
+
+```bash
+# Uncomment monitoring in overlays/dev/kustomization.yaml
+# Then apply:
+kubectl apply -k infrastructure/kubernetes/overlays/dev
+```
+
+## 🔐 Security Configuration
+
+### Important Security Notes
+
+⚠️ **NEVER commit real secrets to Git!**
+
+The `secrets.yaml` file contains placeholder values. In production, use one of:
+
+1. **Sealed Secrets** (Recommended)
+   ```bash
+   kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
+   kubeseal --format=yaml < secrets.yaml > sealed-secrets.yaml
+   ```
+
+2. **External Secrets Operator**
+   ```bash
+   helm install external-secrets external-secrets/external-secrets -n external-secrets
+   ```
+
+3. **Cloud Provider Secrets**
+   - AWS Secrets Manager
+   - GCP Secret Manager
+   - Azure Key Vault
+
+### Grafana Admin Password
+
+Change the default password immediately:
+```bash
+# Generate strong password
+NEW_PASSWORD=$(openssl rand -base64 32)
+
+# Update secret
+kubectl patch secret grafana-admin -n monitoring \
+  -p="{\"data\":{\"admin-password\":\"$(echo -n $NEW_PASSWORD | base64)\"}}"
+
+# Restart Grafana
+kubectl rollout restart deployment grafana -n monitoring
+```
+
+## 📈 Accessing Monitoring Services
+
+### Via Ingress (Production)
+
+```
+https://monitoring.yourdomain.com/grafana
+https://monitoring.yourdomain.com/prometheus
+https://monitoring.yourdomain.com/alertmanager
+https://monitoring.yourdomain.com/jaeger
+```
+
+### Via Port Forwarding (Development)
+
+```bash
+# Grafana
+kubectl port-forward -n monitoring svc/grafana 3000:3000
+
+# Prometheus
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+
+# AlertManager
+kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
+
+# Jaeger
+kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
+```
+
+Then access:
+- Grafana: http://localhost:3000
+- Prometheus: http://localhost:9090
+- AlertManager: http://localhost:9093
+- Jaeger: http://localhost:16686
+
+## 📊 Grafana Dashboards
+
+### Pre-configured Dashboards
+
+1. **Gateway Metrics** - API gateway performance
+   - Request rate by endpoint
+   - P95 latency
+   - Error rates
+   - Authentication metrics
+
+2. **Services Overview** - Microservices health
+   - Request rate by service
+   - P99 latency
+   - Error rates by service
+   - Service health status
+
+3. **Circuit Breakers** - Resilience patterns
+   - Circuit breaker states
+   - Trip rates
+   - Rejected requests
+
+4. **PostgreSQL Monitoring** - Database health
+   - Connections, transactions, cache hit ratio
+   - Slow queries, locks, replication lag
+
+5. **Node Metrics** - Infrastructure monitoring
+   - CPU, memory, disk, network per node
+
+6. **AlertManager** - Alert management
+   - Active alerts, firing rate, notifications
+
+7. **Business Metrics** - KPIs
+   - Service performance, tenant activity, ML metrics
+
+### Creating Custom Dashboards
+
+1. Login to Grafana (admin/[your-password])
+2. Click "+ → Dashboard"
+3. Add panels with Prometheus queries
+4. Save dashboard
+5. Export JSON and add to `grafana-dashboards.yaml`
+
+## 🚨 Alert Configuration
+
+### Alert Rules
+
+Alert rules are defined in `alert-rules.yaml` and organized by category:
+
+- **bakery_services** - Service health, errors, latency, memory
+- **bakery_business** - Training jobs, ML accuracy, API limits
+- **alert_system_health** - Alert system components, RabbitMQ, Redis
+- **alert_system_performance** - Processing errors, delivery failures
+- **alert_system_business** - Alert volume, response times
+- **alert_system_capacity** - Queue sizes, storage performance
+- **alert_system_critical** - System failures, data loss
+- **monitoring_health** - Prometheus, AlertManager self-monitoring
+
+### Alert Routing
+
+Alerts are routed based on:
+- **Severity** (critical, warning, info)
+- **Component** (alert-system, database, infrastructure)
+- **Service** name
+
+### Notification Channels
+
+Configure in `alertmanager.yaml`:
+
+1. **Email** (default)
+   - critical-alerts@yourdomain.com
+   - oncall@yourdomain.com
+
+2. **Slack** (optional, commented out)
+   - Update slack-webhook-url in secrets
+   - Uncomment slack_configs in alertmanager.yaml
+
+3. **PagerDuty** (add if needed)
+   ```yaml
+   pagerduty_configs:
+   - routing_key: YOUR_ROUTING_KEY
+     severity: '{{ .Labels.severity }}'
+   ```
+
+### Testing Alerts
+
+```bash
+# Fire a test alert
+kubectl run test-alert --image=busybox -n bakery-ia --restart=Never -- sleep 3600
+
+# Check alert in Prometheus
+# Navigate to http://localhost:9090/alerts
+
+# Check AlertManager
+# Navigate to http://localhost:9093
+```
+
+## 🔍 Troubleshooting
+
+### Prometheus Issues
+
+```bash
+# Check Prometheus logs
+kubectl logs -n monitoring prometheus-0 -f
+
+# Check Prometheus targets
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+# Visit http://localhost:9090/targets
+
+# Check Prometheus configuration
+kubectl get configmap prometheus-config -n monitoring -o yaml
+```
+
+### AlertManager Issues
+
+```bash
+# Check AlertManager logs
+kubectl logs -n monitoring alertmanager-0 -f
+
+# Check AlertManager configuration
+kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
+
+# Test SMTP connection
+kubectl exec -n monitoring alertmanager-0 -- \
+  wget --spider --server-response --timeout=10 smtp://smtp.gmail.com:587
+```
+
+### Grafana Issues
+
+```bash
+# Check Grafana logs
+kubectl logs -n monitoring deployment/grafana -f
+
+# Reset Grafana admin password
+kubectl exec -n monitoring deployment/grafana -- \
+  grafana-cli admin reset-admin-password NEW_PASSWORD
+```
+
+### PostgreSQL Exporter Issues
+
+```bash
+# Check exporter logs
+kubectl logs -n monitoring deployment/postgres-exporter -f
+
+# Test database connection
+kubectl exec -n monitoring deployment/postgres-exporter -- \
+  wget -O- http://localhost:9187/metrics | grep pg_up
+```
+
+### Node Exporter Issues
+
+```bash
+# Check node exporter on specific node
+kubectl logs -n monitoring daemonset/node-exporter --selector=kubernetes.io/hostname=NODE_NAME -f
+
+# Check metrics endpoint
+kubectl exec -n monitoring daemonset/node-exporter -- \
+  wget -O- http://localhost:9100/metrics | head -n 20
+```
+
+## 📏 Resource Requirements
+
+### Minimum Requirements (Development)
+- CPU: 2 cores
+- Memory: 4Gi
+- Storage: 30Gi
+
+### Recommended Requirements (Production)
+- CPU: 6-8 cores
+- Memory: 16Gi
+- Storage: 100Gi
+
+### Component Resource Allocation
+
+| Component | Replicas | CPU Request | Memory Request | CPU Limit | Memory Limit |
+|-----------|----------|-------------|----------------|-----------|--------------|
+| Prometheus | 2 | 500m | 1Gi | 1 | 2Gi |
+| AlertManager | 3 | 100m | 128Mi | 500m | 256Mi |
+| Grafana | 1 | 100m | 256Mi | 500m | 512Mi |
+| Postgres Exporter | 1 | 50m | 64Mi | 200m | 128Mi |
+| Node Exporter | 1/node | 50m | 64Mi | 200m | 128Mi |
+| Jaeger | 1 | 250m | 512Mi | 500m | 1Gi |
+
+## 🔄 High Availability
+
+### Prometheus HA
+
+- 2 replicas in StatefulSet
+- Each has independent storage (volumeClaimTemplates)
+- Anti-affinity to spread across nodes
+- Both scrape the same targets independently
+- Use Thanos for long-term storage and global query view (future enhancement)
+
+### AlertManager HA
+
+- 3 replicas in StatefulSet
+- Clustered mode (gossip protocol)
+- Automatic leader election
+- Alert deduplication across instances
+- Anti-affinity to spread across nodes
+
+### PodDisruptionBudgets
+
+Ensure minimum availability during:
+- Node maintenance
+- Cluster upgrades
+- Rolling updates
+
+```yaml
+Prometheus: minAvailable=1 (out of 2)
+AlertManager: minAvailable=2 (out of 3)
+Grafana: minAvailable=1 (out of 1)
+```
+
+## 📊 Metrics Reference
+
+### Application Metrics (from services)
+
+```promql
+# HTTP request rate
+rate(http_requests_total[5m])
+
+# HTTP error rate
+rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
+
+# Request latency (P95)
+histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
+
+# Active connections
+active_connections
+```
+
+### PostgreSQL Metrics
+
+```promql
+# Active connections
+pg_stat_database_numbackends
+
+# Transaction rate
+rate(pg_stat_database_xact_commit[5m])
+
+# Cache hit ratio
+rate(pg_stat_database_blks_hit[5m]) /
+(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))
+
+# Replication lag
+pg_replication_lag_seconds
+```
+
+### Node Metrics
+
+```promql
+# CPU usage
+100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Memory usage
+(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
+
+# Disk I/O
+rate(node_disk_read_bytes_total[5m])
+rate(node_disk_written_bytes_total[5m])
+
+# Network traffic
+rate(node_network_receive_bytes_total[5m])
+rate(node_network_transmit_bytes_total[5m])
+```
+
+## 🔗 Distributed Tracing
+
+### Jaeger Configuration
+
+Services automatically send traces when `JAEGER_ENABLED=true`:
+
+```yaml
+# In prod-configmap.yaml
+JAEGER_ENABLED: "true"
+JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local"
+JAEGER_AGENT_PORT: "6831"
+```
+
+### Viewing Traces
+
+1. Access Jaeger UI: https://monitoring.yourdomain.com/jaeger
+2. Select service from dropdown
+3. Click "Find Traces"
+4. Explore trace details, spans, and timing
+
+### Trace Sampling
+
+Current sampling: 100% (all traces collected)
+
+For high-traffic production:
+```yaml
+# Adjust in shared/monitoring/tracing.py
+JAEGER_SAMPLE_RATE: "0.1"  # 10% of traces
+```
+
+## 📚 Additional Resources
+
+- [Prometheus Documentation](https://prometheus.io/docs/)
+- [Grafana Documentation](https://grafana.com/docs/)
+- [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
+- [Jaeger Documentation](https://www.jaegertracing.io/docs/)
+- [PostgreSQL Exporter](https://github.com/prometheus-community/postgres_exporter)
+- [Node Exporter](https://github.com/prometheus/node_exporter)
+
+## 🆘 Support
+
+For monitoring issues:
+1. Check component logs (see Troubleshooting section)
+2. Verify Prometheus targets are UP
+3. Check AlertManager configuration and routing
+4. Review resource usage and quotas
+5. Contact platform team: platform-team@yourdomain.com
+
+## 🔄 Maintenance
+
+### Regular Tasks
+
+**Daily:**
+- Review critical alerts
+- Check service health dashboards
+
+**Weekly:**
+- Review alert noise and adjust thresholds
+- Check storage usage for Prometheus and Jaeger
+- Review slow queries in PostgreSQL dashboard
+
+**Monthly:**
+- Update dashboard with new metrics
+- Review and update alert runbooks
+- Capacity planning based on trends
+
+### Backup and Recovery
+
+**Prometheus Data:**
+```bash
+# Backup Prometheus data
+kubectl exec -n monitoring prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz /prometheus
+kubectl cp monitoring/prometheus-0:/tmp/prometheus-backup.tar.gz ./prometheus-backup.tar.gz
+
+# Restore (stop Prometheus first)
+kubectl cp ./prometheus-backup.tar.gz monitoring/prometheus-0:/tmp/
+kubectl exec -n monitoring prometheus-0 -- tar xzf /tmp/prometheus-backup.tar.gz -C /
+```
+
+**Grafana Dashboards:**
+```bash
+# Export all dashboards via API
+curl -u admin:password http://localhost:3000/api/search | \
+  jq -r '.[] | .uid' | \
+  xargs -I{} curl -u admin:password http://localhost:3000/api/dashboards/uid/{} > dashboards-backup.json
+```
+
+## 📝 Version History
+
+- **v1.0.0** (2026-01-07) - Initial production-ready monitoring stack
+  - Prometheus v3.0.1 with HA
+  - AlertManager v0.27.0 with clustering
+  - Grafana v12.3.0 with 7 dashboards
+  - PostgreSQL and Node exporters
+  - 50+ alert rules
+  - Comprehensive documentation
--- a/infrastructure/kubernetes/base/components/monitoring/alert-rules.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/alert-rules.yaml
@@ -0,0 +1,429 @@
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: prometheus-alert-rules
+  namespace: monitoring
+data:
+  alert-rules.yml: |
+    groups:
+    # Basic Infrastructure Alerts
+    - name: bakery_services
+      interval: 30s
+      rules:
+      - alert: ServiceDown
+        expr: up{job="bakery-services"} == 0
+        for: 2m
+        labels:
+          severity: critical
+          component: infrastructure
+        annotations:
+          summary: "Service {{ $labels.service }} is down"
+          description: "Service {{ $labels.service }} in namespace {{ $labels.namespace }} has been down for more than 2 minutes."
+          runbook_url: "https://runbooks.bakery-ia.local/ServiceDown"
+
+      - alert: HighErrorRate
+        expr: |
+          (
+            sum(rate(http_requests_total{status_code=~"5..", job="bakery-services"}[5m])) by (service)
+            /
+            sum(rate(http_requests_total{job="bakery-services"}[5m])) by (service)
+          ) > 0.10
+        for: 5m
+        labels:
+          severity: critical
+          component: application
+        annotations:
+          summary: "High error rate on {{ $labels.service }}"
+          description: "Service {{ $labels.service }} has error rate above 10% (current: {{ $value | humanizePercentage }})."
+          runbook_url: "https://runbooks.bakery-ia.local/HighErrorRate"
+
+      - alert: HighResponseTime
+        expr: |
+          histogram_quantile(0.95,
+            sum(rate(http_request_duration_seconds_bucket{job="bakery-services"}[5m])) by (service, le)
+          ) > 1
+        for: 5m
+        labels:
+          severity: warning
+          component: performance
+        annotations:
+          summary: "High response time on {{ $labels.service }}"
+          description: "Service {{ $labels.service }} P95 latency is above 1 second (current: {{ $value }}s)."
+          runbook_url: "https://runbooks.bakery-ia.local/HighResponseTime"
+
+      - alert: HighMemoryUsage
+        expr: |
+          container_memory_usage_bytes{namespace="bakery-ia", container!=""} > 500000000
+        for: 5m
+        labels:
+          severity: warning
+          component: infrastructure
+        annotations:
+          summary: "High memory usage in {{ $labels.pod }}"
+          description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is using more than 500MB of memory (current: {{ $value | humanize }}B)."
+          runbook_url: "https://runbooks.bakery-ia.local/HighMemoryUsage"
+
+      - alert: DatabaseConnectionHigh
+        expr: |
+          pg_stat_database_numbackends{datname="bakery"} > 80
+        for: 5m
+        labels:
+          severity: warning
+          component: database
+        annotations:
+          summary: "High database connection count"
+          description: "Database has more than 80 active connections (current: {{ $value }})."
+          runbook_url: "https://runbooks.bakery-ia.local/DatabaseConnectionHigh"
+
+    # Business Logic Alerts
+    - name: bakery_business
+      interval: 30s
+      rules:
+      - alert: TrainingJobFailed
+        expr: |
+          increase(training_job_failures_total[1h]) > 0
+        for: 5m
+        labels:
+          severity: warning
+          component: ml-training
+        annotations:
+          summary: "Training job failures detected"
+          description: "{{ $value }} training job(s) failed in the last hour."
+          runbook_url: "https://runbooks.bakery-ia.local/TrainingJobFailed"
+
+      - alert: LowPredictionAccuracy
+        expr: |
+          prediction_model_accuracy < 0.70
+        for: 15m
+        labels:
+          severity: warning
+          component: ml-inference
+        annotations:
+          summary: "Model prediction accuracy is low"
+          description: "Model {{ $labels.model_name }} accuracy is below 70% (current: {{ $value | humanizePercentage }})."
+          runbook_url: "https://runbooks.bakery-ia.local/LowPredictionAccuracy"
+
+      - alert: APIRateLimitHit
+        expr: |
+          increase(rate_limit_hits_total[5m]) > 10
+        for: 5m
+        labels:
+          severity: info
+          component: api-gateway
+        annotations:
+          summary: "API rate limits being hit frequently"
+          description: "Rate limits hit {{ $value }} times in the last 5 minutes."
+          runbook_url: "https://runbooks.bakery-ia.local/APIRateLimitHit"
+
+    # Alert System Health
+    - name: alert_system_health
+      interval: 30s
+      rules:
+      - alert: AlertSystemComponentDown
+        expr: |
+          alert_system_component_health{component=~"processor|notifier|scheduler"} == 0
+        for: 2m
+        labels:
+          severity: critical
+          component: alert-system
+        annotations:
+          summary: "Alert system component {{ $labels.component }} is unhealthy"
+          description: "Component {{ $labels.component }} has been unhealthy for more than 2 minutes."
+          runbook_url: "https://runbooks.bakery-ia.local/AlertSystemComponentDown"
+
+      - alert: RabbitMQConnectionDown
+        expr: |
+          rabbitmq_up == 0
+        for: 1m
+        labels:
+          severity: critical
+          component: alert-system
+        annotations:
+          summary: "RabbitMQ connection is down"
+          description: "Alert system has lost connection to RabbitMQ message queue."
+          runbook_url: "https://runbooks.bakery-ia.local/RabbitMQConnectionDown"
+
+      - alert: RedisConnectionDown
+        expr: |
+          redis_up == 0
+        for: 1m
+        labels:
+          severity: critical
+          component: alert-system
+        annotations:
+          summary: "Redis connection is down"
+          description: "Alert system has lost connection to Redis cache."
+          runbook_url: "https://runbooks.bakery-ia.local/RedisConnectionDown"
+
+      - alert: NoSchedulerLeader
+        expr: |
+          sum(alert_system_scheduler_leader) == 0
+        for: 5m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "No alert scheduler leader elected"
+          description: "No scheduler instance has been elected as leader for 5 minutes."
+          runbook_url: "https://runbooks.bakery-ia.local/NoSchedulerLeader"
+
+    # Alert System Performance
+    - name: alert_system_performance
+      interval: 30s
+      rules:
+      - alert: HighAlertProcessingErrorRate
+        expr: |
+          (
+            sum(rate(alert_processing_errors_total[2m]))
+            /
+            sum(rate(alerts_processed_total[2m]))
+          ) > 0.10
+        for: 2m
+        labels:
+          severity: critical
+          component: alert-system
+        annotations:
+          summary: "High alert processing error rate"
+          description: "Alert processing error rate is above 10% (current: {{ $value | humanizePercentage }})."
+          runbook_url: "https://runbooks.bakery-ia.local/HighAlertProcessingErrorRate"
+
+      - alert: HighNotificationDeliveryFailureRate
+        expr: |
+          (
+            sum(rate(notification_delivery_failures_total[3m]))
+            /
+            sum(rate(notifications_sent_total[3m]))
+          ) > 0.05
+        for: 3m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "High notification delivery failure rate"
+          description: "Notification delivery failure rate is above 5% (current: {{ $value | humanizePercentage }})."
+          runbook_url: "https://runbooks.bakery-ia.local/HighNotificationDeliveryFailureRate"
+
+      - alert: HighAlertProcessingLatency
+        expr: |
+          histogram_quantile(0.95,
+            sum(rate(alert_processing_duration_seconds_bucket[5m])) by (le)
+          ) > 5
+        for: 5m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "High alert processing latency"
+          description: "P95 alert processing latency is above 5 seconds (current: {{ $value }}s)."
+          runbook_url: "https://runbooks.bakery-ia.local/HighAlertProcessingLatency"
+
+      - alert: TooManySSEConnections
+        expr: |
+          sse_active_connections > 1000
+        for: 2m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "Too many active SSE connections"
+          description: "More than 1000 active SSE connections (current: {{ $value }})."
+          runbook_url: "https://runbooks.bakery-ia.local/TooManySSEConnections"
+
+      - alert: SSEConnectionErrors
+        expr: |
+          rate(sse_connection_errors_total[3m]) > 0.5
+        for: 3m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "High rate of SSE connection errors"
+          description: "SSE connection error rate is {{ $value }} errors/sec."
+          runbook_url: "https://runbooks.bakery-ia.local/SSEConnectionErrors"
+
+    # Alert System Business Logic
+    - name: alert_system_business
+      interval: 30s
+      rules:
+      - alert: UnusuallyHighAlertVolume
+        expr: |
+          rate(alerts_generated_total[5m]) > 2
+        for: 5m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "Unusually high alert generation volume"
+          description: "More than 2 alerts per second being generated (current: {{ $value }}/sec)."
+          runbook_url: "https://runbooks.bakery-ia.local/UnusuallyHighAlertVolume"
+
+      - alert: NoAlertsGenerated
+        expr: |
+          rate(alerts_generated_total[30m]) == 0
+        for: 15m
+        labels:
+          severity: info
+          component: alert-system
+        annotations:
+          summary: "No alerts generated recently"
+          description: "No alerts have been generated in the last 30 minutes. This might indicate a problem with alert detection."
+          runbook_url: "https://runbooks.bakery-ia.local/NoAlertsGenerated"
+
+      - alert: SlowAlertResponseTime
+        expr: |
+          histogram_quantile(0.95,
+            sum(rate(alert_response_time_seconds_bucket[10m])) by (le)
+          ) > 3600
+        for: 10m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "Slow alert response times"
+          description: "P95 alert response time is above 1 hour (current: {{ $value | humanizeDuration }})."
+          runbook_url: "https://runbooks.bakery-ia.local/SlowAlertResponseTime"
+
+      - alert: CriticalAlertsUnacknowledged
+        expr: |
+          sum(alerts_unacknowledged{severity="critical"}) > 5
+        for: 10m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "Multiple critical alerts unacknowledged"
+          description: "{{ $value }} critical alerts have not been acknowledged for 10+ minutes."
+          runbook_url: "https://runbooks.bakery-ia.local/CriticalAlertsUnacknowledged"
+
+    # Alert System Capacity
+    - name: alert_system_capacity
+      interval: 30s
+      rules:
+      - alert: LargeSSEMessageQueues
+        expr: |
+          sse_message_queue_size > 100
+        for: 5m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "Large SSE message queues detected"
+          description: "SSE message queue for tenant {{ $labels.tenant_id }} has {{ $value }} messages queued."
+          runbook_url: "https://runbooks.bakery-ia.local/LargeSSEMessageQueues"
+
+      - alert: SlowDatabaseStorage
+        expr: |
+          histogram_quantile(0.95,
+            sum(rate(alert_storage_duration_seconds_bucket[5m])) by (le)
+          ) > 1
+        for: 5m
+        labels:
+          severity: warning
+          component: alert-system
+        annotations:
+          summary: "Slow alert database storage"
+          description: "P95 alert storage latency is above 1 second (current: {{ $value }}s)."
+          runbook_url: "https://runbooks.bakery-ia.local/SlowDatabaseStorage"
+
+    # Alert System Critical Scenarios
+    - name: alert_system_critical
+      interval: 15s
+      rules:
+      - alert: AlertSystemDown
+        expr: |
+          up{service=~"alert-processor|notification-service"} == 0
+        for: 1m
+        labels:
+          severity: critical
+          component: alert-system
+        annotations:
+          summary: "Alert system is completely down"
+          description: "Core alert system service {{ $labels.service }} is down."
+          runbook_url: "https://runbooks.bakery-ia.local/AlertSystemDown"
+
+      - alert: AlertDataNotPersisted
+        expr: |
+          (
+            sum(rate(alerts_processed_total[2m]))
+            -
+            sum(rate(alerts_stored_total[2m]))
+          ) > 0
+        for: 2m
+        labels:
+          severity: critical
+          component: alert-system
+        annotations:
+          summary: "Alerts not being persisted to database"
+          description: "Alerts are being processed but not stored in the database."
+          runbook_url: "https://runbooks.bakery-ia.local/AlertDataNotPersisted"
+
+      - alert: NotificationsNotDelivered
+        expr: |
+          (
+            sum(rate(alerts_processed_total[3m]))
+            -
+            sum(rate(notifications_sent_total[3m]))
+          ) > 0
+        for: 3m
+        labels:
+          severity: critical
+          component: alert-system
+        annotations:
+          summary: "Notifications not being delivered"
+          description: "Alerts are being processed but notifications are not being sent."
+          runbook_url: "https://runbooks.bakery-ia.local/NotificationsNotDelivered"
+
+    # Monitoring System Self-Monitoring
+    - name: monitoring_health
+      interval: 30s
+      rules:
+      - alert: PrometheusDown
+        expr: up{job="prometheus"} == 0
+        for: 5m
+        labels:
+          severity: critical
+          component: monitoring
+        annotations:
+          summary: "Prometheus is down"
+          description: "Prometheus monitoring system is not responding."
+          runbook_url: "https://runbooks.bakery-ia.local/PrometheusDown"
+
+      - alert: AlertManagerDown
+        expr: up{job="alertmanager"} == 0
+        for: 2m
+        labels:
+          severity: critical
+          component: monitoring
+        annotations:
+          summary: "AlertManager is down"
+          description: "AlertManager is not responding. Alerts will not be routed."
+          runbook_url: "https://runbooks.bakery-ia.local/AlertManagerDown"
+
+      - alert: PrometheusStorageFull
+        expr: |
+          (
+            prometheus_tsdb_storage_blocks_bytes
+            /
+            (prometheus_tsdb_storage_blocks_bytes + prometheus_tsdb_wal_size_bytes)
+          ) > 0.90
+        for: 10m
+        labels:
+          severity: warning
+          component: monitoring
+        annotations:
+          summary: "Prometheus storage almost full"
+          description: "Prometheus storage is {{ $value | humanizePercentage }} full."
+          runbook_url: "https://runbooks.bakery-ia.local/PrometheusStorageFull"
+
+      - alert: PrometheusScrapeErrors
+        expr: |
+          rate(prometheus_target_scrapes_exceeded_sample_limit_total[5m]) > 0
+        for: 5m
+        labels:
+          severity: warning
+          component: monitoring
+        annotations:
+          summary: "Prometheus scrape errors detected"
+          description: "Prometheus is experiencing scrape errors for target {{ $labels.job }}."
+          runbook_url: "https://runbooks.bakery-ia.local/PrometheusScrapeErrors"
--- a/infrastructure/kubernetes/base/components/monitoring/alertmanager-init.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/alertmanager-init.yaml
@@ -0,0 +1,27 @@
+---
+# InitContainer to substitute secrets into AlertManager config
+# This allows us to use environment variables from secrets in the config file
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: alertmanager-init-script
+  namespace: monitoring
+data:
+  init-config.sh: |
+    #!/bin/sh
+    set -e
+
+    # Read the template config
+    TEMPLATE=$(cat /etc/alertmanager-template/alertmanager.yml)
+
+    # Substitute environment variables
+    echo "$TEMPLATE" | \
+      sed "s|{{ .smtp_host }}|${SMTP_HOST}|g" | \
+      sed "s|{{ .smtp_from }}|${SMTP_FROM}|g" | \
+      sed "s|{{ .smtp_username }}|${SMTP_USERNAME}|g" | \
+      sed "s|{{ .smtp_password }}|${SMTP_PASSWORD}|g" | \
+      sed "s|{{ .slack_webhook_url }}|${SLACK_WEBHOOK_URL}|g" \
+      > /etc/alertmanager-final/alertmanager.yml
+
+    echo "AlertManager config initialized successfully"
+    cat /etc/alertmanager-final/alertmanager.yml
--- a/infrastructure/kubernetes/base/components/monitoring/alertmanager.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/alertmanager.yaml
@@ -0,0 +1,391 @@
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: alertmanager-config
+  namespace: monitoring
+data:
+  alertmanager.yml: |
+    global:
+      resolve_timeout: 5m
+      smtp_smarthost: '{{ .smtp_host }}'
+      smtp_from: '{{ .smtp_from }}'
+      smtp_auth_username: '{{ .smtp_username }}'
+      smtp_auth_password: '{{ .smtp_password }}'
+      smtp_require_tls: true
+
+    # Define notification templates
+    templates:
+    - '/etc/alertmanager/templates/*.tmpl'
+
+    # Route alerts to appropriate receivers
+    route:
+      # Default receiver
+      receiver: 'default-email'
+      # Group alerts by these labels
+      group_by: ['alertname', 'cluster', 'service']
+      # Wait time before sending initial notification
+      group_wait: 10s
+      # Wait time before sending notifications about new alerts in the group
+      group_interval: 10s
+      # Wait time before re-sending a notification
+      repeat_interval: 12h
+
+      # Child routes for specific alert routing
+      routes:
+      # Critical alerts - send immediately to all channels
+      - match:
+          severity: critical
+        receiver: 'critical-alerts'
+        group_wait: 0s
+        group_interval: 5m
+        repeat_interval: 4h
+        continue: true
+
+      # Warning alerts - less urgent
+      - match:
+          severity: warning
+        receiver: 'warning-alerts'
+        group_wait: 30s
+        group_interval: 5m
+        repeat_interval: 12h
+
+      # Alert system specific alerts
+      - match:
+          component: alert-system
+        receiver: 'alert-system-team'
+        group_wait: 10s
+        repeat_interval: 6h
+
+      # Database alerts
+      - match_re:
+          alertname: ^(DatabaseConnectionHigh|SlowDatabaseStorage)$
+        receiver: 'database-team'
+        group_wait: 30s
+        repeat_interval: 8h
+
+      # Infrastructure alerts
+      - match_re:
+          alertname: ^(HighMemoryUsage|ServiceDown)$
+        receiver: 'infra-team'
+        group_wait: 30s
+        repeat_interval: 6h
+
+    # Inhibition rules - prevent alert spam
+    inhibit_rules:
+    # If service is down, inhibit all other alerts for that service
+    - source_match:
+        alertname: 'ServiceDown'
+      target_match_re:
+        alertname: '(HighErrorRate|HighResponseTime|HighMemoryUsage)'
+      equal: ['service']
+
+    # If AlertSystem is completely down, inhibit component alerts
+    - source_match:
+        alertname: 'AlertSystemDown'
+      target_match_re:
+        alertname: 'AlertSystemComponent.*'
+      equal: ['namespace']
+
+    # If RabbitMQ is down, inhibit alert processing errors
+    - source_match:
+        alertname: 'RabbitMQConnectionDown'
+      target_match:
+        alertname: 'HighAlertProcessingErrorRate'
+      equal: ['namespace']
+
+    # Receivers - notification destinations
+    receivers:
+    # Default email receiver
+    - name: 'default-email'
+      email_configs:
+      - to: 'alerts@yourdomain.com'
+        headers:
+          Subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
+        html: |
+          {{ range .Alerts }}
+          <h2>{{ .Labels.alertname }}</h2>
+          <p><strong>Status:</strong> {{ .Status }}</p>
+          <p><strong>Severity:</strong> {{ .Labels.severity }}</p>
+          <p><strong>Service:</strong> {{ .Labels.service }}</p>
+          <p><strong>Summary:</strong> {{ .Annotations.summary }}</p>
+          <p><strong>Description:</strong> {{ .Annotations.description }}</p>
+          <p><strong>Started:</strong> {{ .StartsAt }}</p>
+          {{ if .EndsAt }}<p><strong>Ended:</strong> {{ .EndsAt }}</p>{{ end }}
+          {{ end }}
+
+    # Critical alerts - multiple channels
+    - name: 'critical-alerts'
+      email_configs:
+      - to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
+        headers:
+          Subject: '🚨 [CRITICAL] {{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
+        send_resolved: true
+      # Uncomment to enable Slack notifications
+      # slack_configs:
+      # - api_url: '{{ .slack_webhook_url }}'
+      #   channel: '#alerts-critical'
+      #   title: '🚨 Critical Alert'
+      #   text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
+      #   send_resolved: true
+
+    # Warning alerts
+    - name: 'warning-alerts'
+      email_configs:
+      - to: 'alerts@yourdomain.com'
+        headers:
+          Subject: '⚠️ [WARNING] {{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
+        send_resolved: true
+
+    # Alert system team
+    - name: 'alert-system-team'
+      email_configs:
+      - to: 'alert-system-team@yourdomain.com'
+        headers:
+          Subject: '[Alert System] {{ .GroupLabels.alertname }}'
+        send_resolved: true
+
+    # Database team
+    - name: 'database-team'
+      email_configs:
+      - to: 'database-team@yourdomain.com'
+        headers:
+          Subject: '[Database] {{ .GroupLabels.alertname }}'
+        send_resolved: true
+
+    # Infrastructure team
+    - name: 'infra-team'
+      email_configs:
+      - to: 'infra-team@yourdomain.com'
+        headers:
+          Subject: '[Infrastructure] {{ .GroupLabels.alertname }}'
+        send_resolved: true
+
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: alertmanager-templates
+  namespace: monitoring
+data:
+  default.tmpl: |
+    {{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".*alertmanager\\.(.*)" "$1" }}{{ end }}
+
+    {{ define "slack.default.title" }}
+    [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
+    {{ end }}
+
+    {{ define "slack.default.text" }}
+    {{ range .Alerts }}
+    *Alert:* {{ .Annotations.summary }}
+    *Description:* {{ .Annotations.description }}
+    *Severity:* `{{ .Labels.severity }}`
+    *Service:* `{{ .Labels.service }}`
+    {{ end }}
+    {{ end }}
+
+---
+apiVersion: apps/v1
+kind: StatefulSet
+metadata:
+  name: alertmanager
+  namespace: monitoring
+  labels:
+    app: alertmanager
+spec:
+  serviceName: alertmanager
+  replicas: 3
+  selector:
+    matchLabels:
+      app: alertmanager
+  template:
+    metadata:
+      labels:
+        app: alertmanager
+    spec:
+      serviceAccountName: prometheus
+      initContainers:
+      - name: init-config
+        image: busybox:1.36
+        command: ['/bin/sh', '/scripts/init-config.sh']
+        env:
+        - name: SMTP_HOST
+          valueFrom:
+            secretKeyRef:
+              name: alertmanager-secrets
+              key: smtp-host
+        - name: SMTP_USERNAME
+          valueFrom:
+            secretKeyRef:
+              name: alertmanager-secrets
+              key: smtp-username
+        - name: SMTP_PASSWORD
+          valueFrom:
+            secretKeyRef:
+              name: alertmanager-secrets
+              key: smtp-password
+        - name: SMTP_FROM
+          valueFrom:
+            secretKeyRef:
+              name: alertmanager-secrets
+              key: smtp-from
+        - name: SLACK_WEBHOOK_URL
+          valueFrom:
+            secretKeyRef:
+              name: alertmanager-secrets
+              key: slack-webhook-url
+              optional: true
+        volumeMounts:
+        - name: init-script
+          mountPath: /scripts
+        - name: config-template
+          mountPath: /etc/alertmanager-template
+        - name: config-final
+          mountPath: /etc/alertmanager-final
+      affinity:
+        podAntiAffinity:
+          preferredDuringSchedulingIgnoredDuringExecution:
+          - weight: 100
+            podAffinityTerm:
+              labelSelector:
+                matchExpressions:
+                - key: app
+                  operator: In
+                  values:
+                  - alertmanager
+              topologyKey: kubernetes.io/hostname
+      containers:
+      - name: alertmanager
+        image: prom/alertmanager:v0.27.0
+        args:
+        - '--config.file=/etc/alertmanager/alertmanager.yml'
+        - '--storage.path=/alertmanager'
+        - '--cluster.listen-address=0.0.0.0:9094'
+        - '--cluster.peer=alertmanager-0.alertmanager.monitoring.svc.cluster.local:9094'
+        - '--cluster.peer=alertmanager-1.alertmanager.monitoring.svc.cluster.local:9094'
+        - '--cluster.peer=alertmanager-2.alertmanager.monitoring.svc.cluster.local:9094'
+        - '--cluster.reconnect-timeout=5m'
+        - '--web.external-url=http://monitoring.bakery-ia.local/alertmanager'
+        - '--web.route-prefix=/'
+        ports:
+        - name: web
+          containerPort: 9093
+        - name: mesh-tcp
+          containerPort: 9094
+        - name: mesh-udp
+          containerPort: 9094
+          protocol: UDP
+        env:
+        - name: POD_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.name
+        volumeMounts:
+        - name: config-final
+          mountPath: /etc/alertmanager
+        - name: templates
+          mountPath: /etc/alertmanager/templates
+        - name: storage
+          mountPath: /alertmanager
+        resources:
+          requests:
+            memory: "128Mi"
+            cpu: "100m"
+          limits:
+            memory: "256Mi"
+            cpu: "500m"
+        livenessProbe:
+          httpGet:
+            path: /-/healthy
+            port: 9093
+          initialDelaySeconds: 30
+          periodSeconds: 10
+        readinessProbe:
+          httpGet:
+            path: /-/ready
+            port: 9093
+          initialDelaySeconds: 5
+          periodSeconds: 5
+
+      # Config reloader sidecar
+      - name: configmap-reload
+        image: jimmidyson/configmap-reload:v0.12.0
+        args:
+        - '--webhook-url=http://localhost:9093/-/reload'
+        - '--volume-dir=/etc/alertmanager'
+        volumeMounts:
+        - name: config-final
+          mountPath: /etc/alertmanager
+          readOnly: true
+        resources:
+          requests:
+            memory: "16Mi"
+            cpu: "10m"
+          limits:
+            memory: "32Mi"
+            cpu: "50m"
+
+      volumes:
+      - name: init-script
+        configMap:
+          name: alertmanager-init-script
+          defaultMode: 0755
+      - name: config-template
+        configMap:
+          name: alertmanager-config
+      - name: config-final
+        emptyDir: {}
+      - name: templates
+        configMap:
+          name: alertmanager-templates
+
+  volumeClaimTemplates:
+  - metadata:
+      name: storage
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 2Gi
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: alertmanager
+  namespace: monitoring
+  labels:
+    app: alertmanager
+spec:
+  type: ClusterIP
+  clusterIP: None
+  ports:
+  - name: web
+    port: 9093
+    targetPort: 9093
+  - name: mesh-tcp
+    port: 9094
+    targetPort: 9094
+  - name: mesh-udp
+    port: 9094
+    targetPort: 9094
+    protocol: UDP
+  selector:
+    app: alertmanager
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: alertmanager-external
+  namespace: monitoring
+  labels:
+    app: alertmanager
+spec:
+  type: ClusterIP
+  ports:
+  - name: web
+    port: 9093
+    targetPort: 9093
+  selector:
+    app: alertmanager
--- a/infrastructure/kubernetes/base/components/monitoring/grafana-dashboards-extended.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/grafana-dashboards-extended.yaml
@@ -0,0 +1,949 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboards-extended
+  namespace: monitoring
+data:
+  postgresql-dashboard.json: |
+    {
+      "dashboard": {
+        "title": "Bakery IA - PostgreSQL Database",
+        "tags": ["bakery-ia", "postgresql", "database"],
+        "timezone": "browser",
+        "refresh": "30s",
+        "schemaVersion": 16,
+        "version": 1,
+        "panels": [
+          {
+            "id": 1,
+            "title": "Active Connections by Database",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "pg_stat_activity_count{state=\"active\"}",
+                "legendFormat": "{{datname}} - active"
+              },
+              {
+                "expr": "pg_stat_activity_count{state=\"idle\"}",
+                "legendFormat": "{{datname}} - idle"
+              },
+              {
+                "expr": "pg_stat_activity_count{state=\"idle in transaction\"}",
+                "legendFormat": "{{datname}} - idle tx"
+              }
+            ]
+          },
+          {
+            "id": 2,
+            "title": "Total Connections",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "sum(pg_stat_activity_count)",
+                "legendFormat": "Total connections"
+              }
+            ]
+          },
+          {
+            "id": 3,
+            "title": "Max Connections",
+            "type": "stat",
+            "gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "pg_settings_max_connections",
+                "legendFormat": "Max connections"
+              }
+            ]
+          },
+          {
+            "id": 4,
+            "title": "Transaction Rate (Commits vs Rollbacks)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(pg_stat_database_xact_commit[5m])",
+                "legendFormat": "{{datname}} - commits"
+              },
+              {
+                "expr": "rate(pg_stat_database_xact_rollback[5m])",
+                "legendFormat": "{{datname}} - rollbacks"
+              }
+            ]
+          },
+          {
+            "id": 5,
+            "title": "Cache Hit Ratio",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 8, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "100 * (1 - (sum(rate(pg_stat_io_blocks_read_total[5m])) / (sum(rate(pg_stat_io_blocks_read_total[5m])) + sum(rate(pg_stat_io_blocks_hit_total[5m])))))",
+                "legendFormat": "Cache hit ratio %"
+              }
+            ]
+          },
+          {
+            "id": 6,
+            "title": "Slow Queries (> 30s)",
+            "type": "table",
+            "gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "pg_slow_queries{duration_ms > 30000}",
+                "format": "table",
+                "instant": true
+              }
+            ],
+            "transformations": [
+              {
+                "id": "organize",
+                "options": {
+                  "excludeByName": {},
+                  "indexByName": {},
+                  "renameByName": {
+                    "query": "Query",
+                    "duration_ms": "Duration (ms)",
+                    "datname": "Database"
+                  }
+                }
+              }
+            ]
+          },
+          {
+            "id": 7,
+            "title": "Dead Tuples by Table",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "pg_stat_user_tables_n_dead_tup",
+                "legendFormat": "{{schemaname}}.{{relname}}"
+              }
+            ]
+          },
+          {
+            "id": 8,
+            "title": "Table Bloat Estimate",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 24, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "100 * (pg_stat_user_tables_n_dead_tup * avg_tuple_size) / (pg_total_relation_size * 8192)",
+                "legendFormat": "{{schemaname}}.{{relname}} bloat %"
+              }
+            ]
+          },
+          {
+            "id": 9,
+            "title": "Replication Lag (bytes)",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 24, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "pg_replication_lag_bytes",
+                "legendFormat": "{{slot_name}} - {{application_name}}"
+              }
+            ]
+          },
+          {
+            "id": 10,
+            "title": "Database Size (GB)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 32, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "pg_database_size_bytes / 1024 / 1024 / 1024",
+                "legendFormat": "{{datname}}"
+              }
+            ]
+          },
+          {
+            "id": 11,
+            "title": "Database Size Growth (per hour)",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 32, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(pg_database_size_bytes[1h])",
+                "legendFormat": "{{datname}} - bytes/hour"
+              }
+            ]
+          },
+          {
+            "id": 12,
+            "title": "Lock Counts by Type",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 40, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "pg_locks_count",
+                "legendFormat": "{{datname}} - {{locktype}} - {{mode}}"
+              }
+            ]
+          },
+          {
+            "id": 13,
+            "title": "Query Duration (p95)",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 40, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "histogram_quantile(0.95, rate(pg_query_duration_seconds_bucket[5m]))",
+                "legendFormat": "p95"
+              }
+            ]
+          }
+        ]
+      }
+    }
+
+  node-exporter-dashboard.json: |
+    {
+      "dashboard": {
+        "title": "Bakery IA - Node Exporter Infrastructure",
+        "tags": ["bakery-ia", "node-exporter", "infrastructure"],
+        "timezone": "browser",
+        "refresh": "15s",
+        "schemaVersion": 16,
+        "version": 1,
+        "panels": [
+          {
+            "id": 1,
+            "title": "CPU Usage by Node",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
+                "legendFormat": "{{instance}} - {{cpu}}"
+              }
+            ]
+          },
+          {
+            "id": 2,
+            "title": "Average CPU Usage",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
+                "legendFormat": "Average CPU %"
+              }
+            ]
+          },
+          {
+            "id": 3,
+            "title": "CPU Load (1m, 5m, 15m)",
+            "type": "stat",
+            "gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "avg(node_load1)",
+                "legendFormat": "1m"
+              },
+              {
+                "expr": "avg(node_load5)",
+                "legendFormat": "5m"
+              },
+              {
+                "expr": "avg(node_load15)",
+                "legendFormat": "15m"
+              }
+            ]
+          },
+          {
+            "id": 4,
+            "title": "Memory Usage by Node",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
+                "legendFormat": "{{instance}}"
+              }
+            ]
+          },
+          {
+            "id": 5,
+            "title": "Memory Used (GB)",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 8, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024 / 1024 / 1024",
+                "legendFormat": "{{instance}}"
+              }
+            ]
+          },
+          {
+            "id": 6,
+            "title": "Memory Available (GB)",
+            "type": "stat",
+            "gridPos": {"x": 18, "y": 8, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "node_memory_MemAvailable_bytes / 1024 / 1024 / 1024",
+                "legendFormat": "{{instance}}"
+              }
+            ]
+          },
+          {
+            "id": 7,
+            "title": "Disk I/O Read Rate (MB/s)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(node_disk_read_bytes_total[5m]) / 1024 / 1024",
+                "legendFormat": "{{instance}} - {{device}}"
+              }
+            ]
+          },
+          {
+            "id": 8,
+            "title": "Disk I/O Write Rate (MB/s)",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(node_disk_written_bytes_total[5m]) / 1024 / 1024",
+                "legendFormat": "{{instance}} - {{device}}"
+              }
+            ]
+          },
+          {
+            "id": 9,
+            "title": "Disk I/O Operations (IOPS)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 24, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])",
+                "legendFormat": "{{instance}} - {{device}}"
+              }
+            ]
+          },
+          {
+            "id": 10,
+            "title": "Network Receive Rate (Mbps)",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 24, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(node_network_receive_bytes_total{device!=\"lo\"}[5m]) * 8 / 1024 / 1024",
+                "legendFormat": "{{instance}} - {{device}}"
+              }
+            ]
+          },
+          {
+            "id": 11,
+            "title": "Network Transmit Rate (Mbps)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 32, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(node_network_transmit_bytes_total{device!=\"lo\"}[5m]) * 8 / 1024 / 1024",
+                "legendFormat": "{{instance}} - {{device}}"
+              }
+            ]
+          },
+          {
+            "id": 12,
+            "title": "Network Errors",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 32, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])",
+                "legendFormat": "{{instance}} - {{device}}"
+              }
+            ]
+          },
+          {
+            "id": 13,
+            "title": "Filesystem Usage by Mount",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 40, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "100 * (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes))",
+                "legendFormat": "{{instance}} - {{mountpoint}}"
+              }
+            ]
+          },
+          {
+            "id": 14,
+            "title": "Filesystem Available (GB)",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 40, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "node_filesystem_avail_bytes / 1024 / 1024 / 1024",
+                "legendFormat": "{{instance}} - {{mountpoint}}"
+              }
+            ]
+          },
+          {
+            "id": 15,
+            "title": "Filesystem Size (GB)",
+            "type": "stat",
+            "gridPos": {"x": 18, "y": 40, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "node_filesystem_size_bytes / 1024 / 1024 / 1024",
+                "legendFormat": "{{instance}} - {{mountpoint}}"
+              }
+            ]
+          },
+          {
+            "id": 16,
+            "title": "Load Average (1m, 5m, 15m)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 48, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "node_load1",
+                "legendFormat": "{{instance}} - 1m"
+              },
+              {
+                "expr": "node_load5",
+                "legendFormat": "{{instance}} - 5m"
+              },
+              {
+                "expr": "node_load15",
+                "legendFormat": "{{instance}} - 15m"
+              }
+            ]
+          },
+          {
+            "id": 17,
+            "title": "System Up Time",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 48, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "node_boot_time_seconds",
+                "legendFormat": "{{instance}} - uptime"
+              }
+            ]
+          },
+          {
+            "id": 18,
+            "title": "Context Switches",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 56, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(node_context_switches_total[5m])",
+                "legendFormat": "{{instance}}"
+              }
+            ]
+          },
+          {
+            "id": 19,
+            "title": "Interrupts",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 56, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(node_intr_total[5m])",
+                "legendFormat": "{{instance}}"
+              }
+            ]
+          }
+        ]
+      }
+    }
+
+  alertmanager-dashboard.json: |
+    {
+      "dashboard": {
+        "title": "Bakery IA - AlertManager Monitoring",
+        "tags": ["bakery-ia", "alertmanager", "alerting"],
+        "timezone": "browser",
+        "refresh": "10s",
+        "schemaVersion": 16,
+        "version": 1,
+        "panels": [
+          {
+            "id": 1,
+            "title": "Active Alerts by Severity",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "count by (severity) (ALERTS{alertstate=\"firing\"})",
+                "legendFormat": "{{severity}}"
+              }
+            ]
+          },
+          {
+            "id": 2,
+            "title": "Total Active Alerts",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "count(ALERTS{alertstate=\"firing\"})",
+                "legendFormat": "Active alerts"
+              }
+            ]
+          },
+          {
+            "id": 3,
+            "title": "Critical Alerts",
+            "type": "stat",
+            "gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "count(ALERTS{alertstate=\"firing\", severity=\"critical\"})",
+                "legendFormat": "Critical"
+              }
+            ]
+          },
+          {
+            "id": 4,
+            "title": "Alert Firing Rate (per minute)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(alertmanager_alerts_fired_total[1m])",
+                "legendFormat": "Alerts fired/min"
+              }
+            ]
+          },
+          {
+            "id": 5,
+            "title": "Alert Resolution Rate (per minute)",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 8, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(alertmanager_alerts_resolved_total[1m])",
+                "legendFormat": "Alerts resolved/min"
+              }
+            ]
+          },
+          {
+            "id": 6,
+            "title": "Notification Success Rate",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "100 * (rate(alertmanager_notifications_total{status=\"success\"}[5m]) / rate(alertmanager_notifications_total[5m]))",
+                "legendFormat": "Success rate %"
+              }
+            ]
+          },
+          {
+            "id": 7,
+            "title": "Notification Failures",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(alertmanager_notifications_total{status=\"failed\"}[5m])",
+                "legendFormat": "{{integration}}"
+              }
+            ]
+          },
+          {
+            "id": 8,
+            "title": "Silenced Alerts",
+            "type": "stat",
+            "gridPos": {"x": 0, "y": 24, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "count(ALERTS{alertstate=\"silenced\"})",
+                "legendFormat": "Silenced"
+              }
+            ]
+          },
+          {
+            "id": 9,
+            "title": "AlertManager Cluster Size",
+            "type": "stat",
+            "gridPos": {"x": 6, "y": 24, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "count(alertmanager_cluster_peers)",
+                "legendFormat": "Cluster peers"
+              }
+            ]
+          },
+          {
+            "id": 10,
+            "title": "AlertManager Peers",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 24, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "alertmanager_cluster_peers",
+                "legendFormat": "{{instance}}"
+              }
+            ]
+          },
+          {
+            "id": 11,
+            "title": "Cluster Status",
+            "type": "stat",
+            "gridPos": {"x": 18, "y": 24, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "up{job=\"alertmanager\"}",
+                "legendFormat": "{{instance}}"
+              }
+            ]
+          },
+          {
+            "id": 12,
+            "title": "Alerts by Group",
+            "type": "table",
+            "gridPos": {"x": 0, "y": 28, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "count by (alertname) (ALERTS{alertstate=\"firing\"})",
+                "format": "table",
+                "instant": true
+              }
+            ],
+            "transformations": [
+              {
+                "id": "organize",
+                "options": {
+                  "excludeByName": {},
+                  "indexByName": {},
+                  "renameByName": {
+                    "alertname": "Alert Name",
+                    "Value": "Count"
+                  }
+                }
+              }
+            ]
+          },
+          {
+            "id": 13,
+            "title": "Alert Duration (p99)",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 28, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "histogram_quantile(0.99, rate(alertmanager_alert_duration_seconds_bucket[5m]))",
+                "legendFormat": "p99 duration"
+              }
+            ]
+          },
+          {
+            "id": 14,
+            "title": "Processing Time",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 36, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(alertmanager_receiver_processing_duration_seconds_sum[5m]) / rate(alertmanager_receiver_processing_duration_seconds_count[5m])",
+                "legendFormat": "{{receiver}}"
+              }
+            ]
+          },
+          {
+            "id": 15,
+            "title": "Memory Usage",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 36, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "process_resident_memory_bytes{job=\"alertmanager\"} / 1024 / 1024",
+                "legendFormat": "{{instance}} - MB"
+              }
+            ]
+          }
+        ]
+      }
+    }
+
+  business-metrics-dashboard.json: |
+    {
+      "dashboard": {
+        "title": "Bakery IA - Business Metrics & KPIs",
+        "tags": ["bakery-ia", "business-metrics", "kpis"],
+        "timezone": "browser",
+        "refresh": "30s",
+        "schemaVersion": 16,
+        "version": 1,
+        "panels": [
+          {
+            "id": 1,
+            "title": "Requests per Service (Rate)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "sum by (service) (rate(http_requests_total[5m]))",
+                "legendFormat": "{{service}}"
+              }
+            ]
+          },
+          {
+            "id": 2,
+            "title": "Total Request Rate",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "sum(rate(http_requests_total[5m]))",
+                "legendFormat": "requests/sec"
+              }
+            ]
+          },
+          {
+            "id": 3,
+            "title": "Peak Request Rate (5m)",
+            "type": "stat",
+            "gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "max(sum(rate(http_requests_total[5m])))",
+                "legendFormat": "Peak requests/sec"
+              }
+            ]
+          },
+          {
+            "id": 4,
+            "title": "Error Rates by Service",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "sum by (service) (rate(http_requests_total{status_code=~\"5..\"}[5m]))",
+                "legendFormat": "{{service}}"
+              }
+            ]
+          },
+          {
+            "id": 5,
+            "title": "Overall Error Rate",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 8, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "100 * (sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])))",
+                "legendFormat": "Error %"
+              }
+            ]
+          },
+          {
+            "id": 6,
+            "title": "4xx Error Rate",
+            "type": "stat",
+            "gridPos": {"x": 18, "y": 8, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "100 * (sum(rate(http_requests_total{status_code=~\"4..\"}[5m])) / sum(rate(http_requests_total[5m])))",
+                "legendFormat": "4xx %"
+              }
+            ]
+          },
+          {
+            "id": 7,
+            "title": "P95 Latency by Service (ms)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) * 1000",
+                "legendFormat": "{{service}} p95"
+              }
+            ]
+          },
+          {
+            "id": 8,
+            "title": "P99 Latency by Service (ms)",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) * 1000",
+                "legendFormat": "{{service}} p99"
+              }
+            ]
+          },
+          {
+            "id": 9,
+            "title": "Average Latency (ms)",
+            "type": "stat",
+            "gridPos": {"x": 0, "y": 24, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "(sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) * 1000",
+                "legendFormat": "Avg latency ms"
+              }
+            ]
+          },
+          {
+            "id": 10,
+            "title": "Active Tenants",
+            "type": "stat",
+            "gridPos": {"x": 6, "y": 24, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "count(count by (tenant_id) (rate(http_requests_total[5m])))",
+                "legendFormat": "Active tenants"
+              }
+            ]
+          },
+          {
+            "id": 11,
+            "title": "Requests per Tenant",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 24, "w": 12, "h": 4},
+            "targets": [
+              {
+                "expr": "sum by (tenant_id) (rate(http_requests_total[5m]))",
+                "legendFormat": "Tenant {{tenant_id}}"
+              }
+            ]
+          },
+          {
+            "id": 12,
+            "title": "Alert Generation Rate (per minute)",
+            "type": "graph",
+            "gridPos": {"x": 0, "y": 32, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "rate(ALERTS_FOR_STATE[1m])",
+                "legendFormat": "{{alertname}}"
+              }
+            ]
+          },
+          {
+            "id": 13,
+            "title": "Training Job Success Rate",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 32, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "100 * (sum(training_job_completed_total{status=\"success\"}) / sum(training_job_completed_total))",
+                "legendFormat": "Success rate %"
+              }
+            ]
+          },
+          {
+            "id": 14,
+            "title": "Training Jobs in Progress",
+            "type": "stat",
+            "gridPos": {"x": 0, "y": 40, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "count(training_job_in_progress)",
+                "legendFormat": "Jobs running"
+              }
+            ]
+          },
+          {
+            "id": 15,
+            "title": "Training Job Completion Time (p95, minutes)",
+            "type": "stat",
+            "gridPos": {"x": 6, "y": 40, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "histogram_quantile(0.95, training_job_duration_seconds) / 60",
+                "legendFormat": "p95 minutes"
+              }
+            ]
+          },
+          {
+            "id": 16,
+            "title": "Failed Training Jobs",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 40, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "sum(training_job_completed_total{status=\"failed\"})",
+                "legendFormat": "Failed jobs"
+              }
+            ]
+          },
+          {
+            "id": 17,
+            "title": "Total Training Jobs Completed",
+            "type": "stat",
+            "gridPos": {"x": 18, "y": 40, "w": 6, "h": 4},
+            "targets": [
+              {
+                "expr": "sum(training_job_completed_total)",
+                "legendFormat": "Total completed"
+              }
+            ]
+          },
+          {
+            "id": 18,
+            "title": "API Health Status",
+            "type": "table",
+            "gridPos": {"x": 0, "y": 48, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "up{job=\"bakery-services\"}",
+                "format": "table",
+                "instant": true
+              }
+            ],
+            "transformations": [
+              {
+                "id": "organize",
+                "options": {
+                  "excludeByName": {},
+                  "indexByName": {},
+                  "renameByName": {
+                    "service": "Service",
+                    "Value": "Status",
+                    "instance": "Instance"
+                  }
+                }
+              }
+            ]
+          },
+          {
+            "id": 19,
+            "title": "Service Success Rate (%)",
+            "type": "graph",
+            "gridPos": {"x": 12, "y": 48, "w": 12, "h": 8},
+            "targets": [
+              {
+                "expr": "100 * (1 - (sum by (service) (rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum by (service) (rate(http_requests_total[5m]))))",
+                "legendFormat": "{{service}}"
+              }
+            ]
+          },
+          {
+            "id": 20,
+            "title": "Requests Processed Today",
+            "type": "stat",
+            "gridPos": {"x": 0, "y": 56, "w": 12, "h": 4},
+            "targets": [
+              {
+                "expr": "sum(increase(http_requests_total[24h]))",
+                "legendFormat": "Requests (24h)"
+              }
+            ]
+          },
+          {
+            "id": 21,
+            "title": "Distinct Users Today",
+            "type": "stat",
+            "gridPos": {"x": 12, "y": 56, "w": 12, "h": 4},
+            "targets": [
+              {
+                "expr": "count(count by (user_id) (increase(http_requests_total{user_id!=\"\"}[24h])))",
+                "legendFormat": "Users (24h)"
+              }
+            ]
+          }
+        ]
+      }
+    }
--- a/infrastructure/kubernetes/base/components/monitoring/grafana.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/grafana.yaml
@@ -34,6 +34,15 @@ data:
      allowUiUpdates: true
      options:
        path: /var/lib/grafana/dashboards
+    - name: 'extended'
+      orgId: 1
+      folder: 'Bakery IA - Extended'
+      type: file
+      disableDeletion: false
+      updateIntervalSeconds: 10
+      allowUiUpdates: true
+      options:
+        path: /var/lib/grafana/dashboards-extended

 ---
 apiVersion: apps/v1
@@ -61,9 +70,15 @@ spec:
          name: http
        env:
        - name: GF_SECURITY_ADMIN_USER
-          value: admin
+          valueFrom:
+            secretKeyRef:
+              name: grafana-admin
+              key: admin-user
        - name: GF_SECURITY_ADMIN_PASSWORD
-          value: admin
+          valueFrom:
+            secretKeyRef:
+              name: grafana-admin
+              key: admin-password
        - name: GF_SERVER_ROOT_URL
          value: "http://monitoring.bakery-ia.local/grafana"
        - name: GF_SERVER_SERVE_FROM_SUB_PATH
@@ -81,6 +96,8 @@ spec:
          mountPath: /etc/grafana/provisioning/dashboards
        - name: grafana-dashboards
          mountPath: /var/lib/grafana/dashboards
+        - name: grafana-dashboards-extended
+          mountPath: /var/lib/grafana/dashboards-extended
        resources:
          requests:
            memory: "256Mi"
@@ -113,6 +130,9 @@ spec:
      - name: grafana-dashboards
        configMap:
          name: grafana-dashboards
+      - name: grafana-dashboards-extended
+        configMap:
+          name: grafana-dashboards-extended

 ---
 apiVersion: v1
--- a/infrastructure/kubernetes/base/components/monitoring/ha-policies.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/ha-policies.yaml
@@ -0,0 +1,100 @@
+---
+# PodDisruptionBudgets ensure minimum availability during voluntary disruptions
+# (node drains, rolling updates, etc.)
+
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: prometheus-pdb
+  namespace: monitoring
+spec:
+  minAvailable: 1
+  selector:
+    matchLabels:
+      app: prometheus
+
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: alertmanager-pdb
+  namespace: monitoring
+spec:
+  minAvailable: 2
+  selector:
+    matchLabels:
+      app: alertmanager
+
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: grafana-pdb
+  namespace: monitoring
+spec:
+  minAvailable: 1
+  selector:
+    matchLabels:
+      app: grafana
+
+---
+# ResourceQuota limits total resources in monitoring namespace
+apiVersion: v1
+kind: ResourceQuota
+metadata:
+  name: monitoring-quota
+  namespace: monitoring
+spec:
+  hard:
+    # Compute resources
+    requests.cpu: "10"
+    requests.memory: "16Gi"
+    limits.cpu: "20"
+    limits.memory: "32Gi"
+
+    # Storage
+    persistentvolumeclaims: "10"
+    requests.storage: "100Gi"
+
+    # Object counts
+    pods: "50"
+    services: "20"
+    configmaps: "30"
+    secrets: "20"
+
+---
+# LimitRange sets default resource limits for pods in monitoring namespace
+apiVersion: v1
+kind: LimitRange
+metadata:
+  name: monitoring-limits
+  namespace: monitoring
+spec:
+  limits:
+  # Default container limits
+  - max:
+      cpu: "2"
+      memory: "4Gi"
+    min:
+      cpu: "10m"
+      memory: "16Mi"
+    default:
+      cpu: "500m"
+      memory: "512Mi"
+    defaultRequest:
+      cpu: "100m"
+      memory: "128Mi"
+    type: Container
+
+  # Pod limits
+  - max:
+      cpu: "4"
+      memory: "8Gi"
+    type: Pod
+
+  # PVC limits
+  - max:
+      storage: "50Gi"
+    min:
+      storage: "1Gi"
+    type: PersistentVolumeClaim
--- a/infrastructure/kubernetes/base/components/monitoring/ingress.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/ingress.yaml
@@ -23,7 +23,7 @@ spec:
        pathType: ImplementationSpecific
        backend:
          service:
-            name: prometheus
+            name: prometheus-external
            port:
              number: 9090
      - path: /jaeger(/|$)(.*)
@@ -33,3 +33,10 @@ spec:
            name: jaeger-query
            port:
              number: 16686
+      - path: /alertmanager(/|$)(.*)
+        pathType: ImplementationSpecific
+        backend:
+          service:
+            name: alertmanager-external
+            port:
+              number: 9093
--- a/infrastructure/kubernetes/base/components/monitoring/kustomization.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/kustomization.yaml
@@ -3,8 +3,16 @@ kind: Kustomization

 resources:
  - namespace.yaml
+  - secrets.yaml
  - prometheus.yaml
+  - alert-rules.yaml
+  - alertmanager.yaml
+  - alertmanager-init.yaml
  - grafana.yaml
  - grafana-dashboards.yaml
+  - grafana-dashboards-extended.yaml
+  - postgres-exporter.yaml
+  - node-exporter.yaml
  - jaeger.yaml
+  - ha-policies.yaml
  - ingress.yaml
--- a/infrastructure/kubernetes/base/components/monitoring/node-exporter.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/node-exporter.yaml
@@ -0,0 +1,103 @@
+---
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: node-exporter
+  namespace: monitoring
+  labels:
+    app: node-exporter
+spec:
+  selector:
+    matchLabels:
+      app: node-exporter
+  updateStrategy:
+    type: RollingUpdate
+    rollingUpdate:
+      maxUnavailable: 1
+  template:
+    metadata:
+      labels:
+        app: node-exporter
+    spec:
+      hostNetwork: true
+      hostPID: true
+      nodeSelector:
+        kubernetes.io/os: linux
+      tolerations:
+      # Run on all nodes including master
+      - operator: Exists
+        effect: NoSchedule
+      containers:
+      - name: node-exporter
+        image: quay.io/prometheus/node-exporter:v1.7.0
+        args:
+        - '--path.sysfs=/host/sys'
+        - '--path.rootfs=/host/root'
+        - '--path.procfs=/host/proc'
+        - '--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)'
+        - '--collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$'
+        - '--collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15})$'
+        - '--collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15})$'
+        - '--web.listen-address=:9100'
+        ports:
+        - containerPort: 9100
+          protocol: TCP
+          name: metrics
+        resources:
+          requests:
+            memory: "64Mi"
+            cpu: "50m"
+          limits:
+            memory: "128Mi"
+            cpu: "200m"
+        volumeMounts:
+        - name: sys
+          mountPath: /host/sys
+          mountPropagation: HostToContainer
+          readOnly: true
+        - name: root
+          mountPath: /host/root
+          mountPropagation: HostToContainer
+          readOnly: true
+        - name: proc
+          mountPath: /host/proc
+          mountPropagation: HostToContainer
+          readOnly: true
+        securityContext:
+          runAsNonRoot: true
+          runAsUser: 65534
+          capabilities:
+            drop:
+            - ALL
+          readOnlyRootFilesystem: true
+      volumes:
+      - name: sys
+        hostPath:
+          path: /sys
+      - name: root
+        hostPath:
+          path: /
+      - name: proc
+        hostPath:
+          path: /proc
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: node-exporter
+  namespace: monitoring
+  labels:
+    app: node-exporter
+  annotations:
+    prometheus.io/scrape: "true"
+    prometheus.io/port: "9100"
+spec:
+  clusterIP: None
+  ports:
+  - name: metrics
+    port: 9100
+    protocol: TCP
+    targetPort: 9100
+  selector:
+    app: node-exporter
--- a/infrastructure/kubernetes/base/components/monitoring/postgres-exporter.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/postgres-exporter.yaml
@@ -0,0 +1,306 @@
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: postgres-exporter
+  namespace: monitoring
+  labels:
+    app: postgres-exporter
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: postgres-exporter
+  template:
+    metadata:
+      labels:
+        app: postgres-exporter
+    spec:
+      containers:
+      - name: postgres-exporter
+        image: prometheuscommunity/postgres-exporter:v0.15.0
+        ports:
+        - containerPort: 9187
+          name: metrics
+        env:
+        - name: DATA_SOURCE_NAME
+          valueFrom:
+            secretKeyRef:
+              name: postgres-exporter
+              key: data-source-name
+        # Enable extended metrics
+        - name: PG_EXPORTER_EXTEND_QUERY_PATH
+          value: "/etc/postgres-exporter/queries.yaml"
+        # Disable default metrics (we'll use custom ones)
+        - name: PG_EXPORTER_DISABLE_DEFAULT_METRICS
+          value: "false"
+        # Disable settings metrics (can be noisy)
+        - name: PG_EXPORTER_DISABLE_SETTINGS_METRICS
+          value: "false"
+        volumeMounts:
+        - name: queries
+          mountPath: /etc/postgres-exporter
+        resources:
+          requests:
+            memory: "64Mi"
+            cpu: "50m"
+          limits:
+            memory: "128Mi"
+            cpu: "200m"
+        livenessProbe:
+          httpGet:
+            path: /
+            port: 9187
+          initialDelaySeconds: 30
+          periodSeconds: 10
+        readinessProbe:
+          httpGet:
+            path: /
+            port: 9187
+          initialDelaySeconds: 5
+          periodSeconds: 5
+      volumes:
+      - name: queries
+        configMap:
+          name: postgres-exporter-queries
+
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: postgres-exporter-queries
+  namespace: monitoring
+data:
+  queries.yaml: |
+    # Custom PostgreSQL queries for bakery-ia metrics
+
+    pg_database:
+      query: |
+        SELECT
+          datname,
+          numbackends as connections,
+          xact_commit as transactions_committed,
+          xact_rollback as transactions_rolled_back,
+          blks_read as blocks_read,
+          blks_hit as blocks_hit,
+          tup_returned as tuples_returned,
+          tup_fetched as tuples_fetched,
+          tup_inserted as tuples_inserted,
+          tup_updated as tuples_updated,
+          tup_deleted as tuples_deleted,
+          conflicts as conflicts,
+          temp_files as temp_files,
+          temp_bytes as temp_bytes,
+          deadlocks as deadlocks
+        FROM pg_stat_database
+        WHERE datname NOT IN ('template0', 'template1', 'postgres')
+      metrics:
+        - datname:
+            usage: "LABEL"
+            description: "Name of the database"
+        - connections:
+            usage: "GAUGE"
+            description: "Number of backends currently connected to this database"
+        - transactions_committed:
+            usage: "COUNTER"
+            description: "Number of transactions in this database that have been committed"
+        - transactions_rolled_back:
+            usage: "COUNTER"
+            description: "Number of transactions in this database that have been rolled back"
+        - blocks_read:
+            usage: "COUNTER"
+            description: "Number of disk blocks read in this database"
+        - blocks_hit:
+            usage: "COUNTER"
+            description: "Number of times disk blocks were found in the buffer cache"
+        - tuples_returned:
+            usage: "COUNTER"
+            description: "Number of rows returned by queries in this database"
+        - tuples_fetched:
+            usage: "COUNTER"
+            description: "Number of rows fetched by queries in this database"
+        - tuples_inserted:
+            usage: "COUNTER"
+            description: "Number of rows inserted by queries in this database"
+        - tuples_updated:
+            usage: "COUNTER"
+            description: "Number of rows updated by queries in this database"
+        - tuples_deleted:
+            usage: "COUNTER"
+            description: "Number of rows deleted by queries in this database"
+        - conflicts:
+            usage: "COUNTER"
+            description: "Number of queries canceled due to conflicts with recovery"
+        - temp_files:
+            usage: "COUNTER"
+            description: "Number of temporary files created by queries"
+        - temp_bytes:
+            usage: "COUNTER"
+            description: "Total amount of data written to temporary files by queries"
+        - deadlocks:
+            usage: "COUNTER"
+            description: "Number of deadlocks detected in this database"
+
+    pg_replication:
+      query: |
+        SELECT
+          CASE WHEN pg_is_in_recovery() THEN 1 ELSE 0 END as is_replica,
+          EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INT as lag_seconds
+      metrics:
+        - is_replica:
+            usage: "GAUGE"
+            description: "1 if this is a replica, 0 if primary"
+        - lag_seconds:
+            usage: "GAUGE"
+            description: "Replication lag in seconds (only on replicas)"
+
+    pg_slow_queries:
+      query: |
+        SELECT
+          datname,
+          usename,
+          state,
+          COUNT(*) as count,
+          MAX(EXTRACT(EPOCH FROM (now() - query_start))) as max_duration_seconds
+        FROM pg_stat_activity
+        WHERE state != 'idle'
+          AND query NOT LIKE '%pg_stat_activity%'
+          AND query_start < now() - interval '30 seconds'
+        GROUP BY datname, usename, state
+      metrics:
+        - datname:
+            usage: "LABEL"
+            description: "Database name"
+        - usename:
+            usage: "LABEL"
+            description: "User name"
+        - state:
+            usage: "LABEL"
+            description: "Query state"
+        - count:
+            usage: "GAUGE"
+            description: "Number of slow queries"
+        - max_duration_seconds:
+            usage: "GAUGE"
+            description: "Maximum query duration in seconds"
+
+    pg_table_stats:
+      query: |
+        SELECT
+          schemaname,
+          relname,
+          seq_scan,
+          seq_tup_read,
+          idx_scan,
+          idx_tup_fetch,
+          n_tup_ins,
+          n_tup_upd,
+          n_tup_del,
+          n_tup_hot_upd,
+          n_live_tup,
+          n_dead_tup,
+          n_mod_since_analyze,
+          last_vacuum,
+          last_autovacuum,
+          last_analyze,
+          last_autoanalyze
+        FROM pg_stat_user_tables
+        WHERE schemaname = 'public'
+        ORDER BY n_live_tup DESC
+        LIMIT 20
+      metrics:
+        - schemaname:
+            usage: "LABEL"
+            description: "Schema name"
+        - relname:
+            usage: "LABEL"
+            description: "Table name"
+        - seq_scan:
+            usage: "COUNTER"
+            description: "Number of sequential scans"
+        - seq_tup_read:
+            usage: "COUNTER"
+            description: "Number of tuples read by sequential scans"
+        - idx_scan:
+            usage: "COUNTER"
+            description: "Number of index scans"
+        - idx_tup_fetch:
+            usage: "COUNTER"
+            description: "Number of tuples fetched by index scans"
+        - n_tup_ins:
+            usage: "COUNTER"
+            description: "Number of tuples inserted"
+        - n_tup_upd:
+            usage: "COUNTER"
+            description: "Number of tuples updated"
+        - n_tup_del:
+            usage: "COUNTER"
+            description: "Number of tuples deleted"
+        - n_tup_hot_upd:
+            usage: "COUNTER"
+            description: "Number of tuples HOT updated"
+        - n_live_tup:
+            usage: "GAUGE"
+            description: "Estimated number of live rows"
+        - n_dead_tup:
+            usage: "GAUGE"
+            description: "Estimated number of dead rows"
+        - n_mod_since_analyze:
+            usage: "GAUGE"
+            description: "Number of rows modified since last analyze"
+
+    pg_locks:
+      query: |
+        SELECT
+          mode,
+          locktype,
+          COUNT(*) as count
+        FROM pg_locks
+        GROUP BY mode, locktype
+      metrics:
+        - mode:
+            usage: "LABEL"
+            description: "Lock mode"
+        - locktype:
+            usage: "LABEL"
+            description: "Lock type"
+        - count:
+            usage: "GAUGE"
+            description: "Number of locks"
+
+    pg_connection_pool:
+      query: |
+        SELECT
+          state,
+          COUNT(*) as count,
+          MAX(EXTRACT(EPOCH FROM (now() - state_change))) as max_state_duration_seconds
+        FROM pg_stat_activity
+        GROUP BY state
+      metrics:
+        - state:
+            usage: "LABEL"
+            description: "Connection state"
+        - count:
+            usage: "GAUGE"
+            description: "Number of connections in this state"
+        - max_state_duration_seconds:
+            usage: "GAUGE"
+            description: "Maximum time a connection has been in this state"
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: postgres-exporter
+  namespace: monitoring
+  labels:
+    app: postgres-exporter
+spec:
+  type: ClusterIP
+  ports:
+  - port: 9187
+    targetPort: 9187
+    protocol: TCP
+    name: metrics
+  selector:
+    app: postgres-exporter
--- a/infrastructure/kubernetes/base/components/monitoring/prometheus.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/prometheus.yaml
@@ -56,6 +56,19 @@ data:
        cluster: 'bakery-ia'
        environment: 'production'

+    # AlertManager configuration
+    alerting:
+      alertmanagers:
+      - static_configs:
+        - targets:
+          - alertmanager-0.alertmanager.monitoring.svc.cluster.local:9093
+          - alertmanager-1.alertmanager.monitoring.svc.cluster.local:9093
+          - alertmanager-2.alertmanager.monitoring.svc.cluster.local:9093
+
+    # Load alert rules
+    rule_files:
+      - '/etc/prometheus/rules/*.yml'
+
    scrape_configs:
      # Scrape Prometheus itself
      - job_name: 'prometheus'
@@ -114,16 +127,42 @@ data:
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics

+      # Scrape AlertManager
+      - job_name: 'alertmanager'
+        static_configs:
+          - targets:
+            - alertmanager-0.alertmanager.monitoring.svc.cluster.local:9093
+            - alertmanager-1.alertmanager.monitoring.svc.cluster.local:9093
+            - alertmanager-2.alertmanager.monitoring.svc.cluster.local:9093
+
+      # Scrape PostgreSQL exporter
+      - job_name: 'postgres-exporter'
+        static_configs:
+          - targets: ['postgres-exporter.monitoring.svc.cluster.local:9187']
+
+      # Scrape Node Exporter
+      - job_name: 'node-exporter'
+        kubernetes_sd_configs:
+          - role: node
+        relabel_configs:
+          - source_labels: [__address__]
+            regex: '(.*):10250'
+            replacement: '${1}:9100'
+            target_label: __address__
+          - source_labels: [__meta_kubernetes_node_name]
+            target_label: node
+
 ---
 apiVersion: apps/v1
-kind: Deployment
+kind: StatefulSet
 metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
 spec:
-  replicas: 1
+  serviceName: prometheus
+  replicas: 2
  selector:
    matchLabels:
      app: prometheus
@@ -133,6 +172,18 @@ spec:
        app: prometheus
    spec:
      serviceAccountName: prometheus
+      affinity:
+        podAntiAffinity:
+          preferredDuringSchedulingIgnoredDuringExecution:
+          - weight: 100
+            podAffinityTerm:
+              labelSelector:
+                matchExpressions:
+                - key: app
+                  operator: In
+                  values:
+                  - prometheus
+              topologyKey: kubernetes.io/hostname
      containers:
      - name: prometheus
        image: prom/prometheus:v3.0.1
@@ -149,6 +200,8 @@ spec:
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus
+        - name: prometheus-rules
+          mountPath: /etc/prometheus/rules
        - name: prometheus-storage
          mountPath: /prometheus
        resources:
@@ -174,22 +227,18 @@ spec:
      - name: prometheus-config
        configMap:
          name: prometheus-config
-      - name: prometheus-storage
-        persistentVolumeClaim:
-          claimName: prometheus-storage
+      - name: prometheus-rules
+        configMap:
+          name: prometheus-alert-rules

---
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: prometheus-storage
-  namespace: monitoring
-spec:
-  accessModes:
-    - ReadWriteOnce
-  resources:
-    requests:
-      storage: 20Gi
+  volumeClaimTemplates:
+  - metadata:
+      name: prometheus-storage
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 20Gi

 ---
 apiVersion: v1
@@ -199,6 +248,25 @@ metadata:
  namespace: monitoring
  labels:
    app: prometheus
+spec:
+  type: ClusterIP
+  clusterIP: None
+  ports:
+  - port: 9090
+    targetPort: 9090
+    protocol: TCP
+    name: web
+  selector:
+    app: prometheus
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: prometheus-external
+  namespace: monitoring
+  labels:
+    app: prometheus
 spec:
  type: ClusterIP
  ports:
--- a/infrastructure/kubernetes/base/components/monitoring/secrets.yaml
+++ b/infrastructure/kubernetes/base/components/monitoring/secrets.yaml
@@ -0,0 +1,52 @@
+---
+# NOTE: This file contains example secrets for development.
+# For production, use one of the following:
+# 1. Sealed Secrets (bitnami-labs/sealed-secrets)
+# 2. External Secrets Operator
+# 3. HashiCorp Vault
+# 4. Cloud provider secret managers (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault)
+#
+# NEVER commit real production secrets to git!
+
+apiVersion: v1
+kind: Secret
+metadata:
+  name: grafana-admin
+  namespace: monitoring
+type: Opaque
+stringData:
+  admin-user: admin
+  # CHANGE THIS PASSWORD IN PRODUCTION!
+  # Generate with: openssl rand -base64 32
+  admin-password: "CHANGE_ME_IN_PRODUCTION"
+
+---
+apiVersion: v1
+kind: Secret
+metadata:
+  name: alertmanager-secrets
+  namespace: monitoring
+type: Opaque
+stringData:
+  # SMTP configuration for email alerts
+  # CHANGE THESE VALUES IN PRODUCTION!
+  smtp-host: "smtp.gmail.com:587"
+  smtp-username: "alerts@yourdomain.com"
+  smtp-password: "CHANGE_ME_IN_PRODUCTION"
+  smtp-from: "alerts@yourdomain.com"
+
+  # Slack webhook URL (optional)
+  slack-webhook-url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
+
+---
+apiVersion: v1
+kind: Secret
+metadata:
+  name: postgres-exporter
+  namespace: monitoring
+type: Opaque
+stringData:
+  # PostgreSQL connection string
+  # Format: postgresql://username:password@hostname:port/database?sslmode=disable
+  # CHANGE THIS IN PRODUCTION!
+  data-source-name: "postgresql://postgres:postgres@postgres.bakery-ia:5432/bakery?sslmode=disable"