Files
bakery-ia/infrastructure/kubernetes/base/components/monitoring

Bakery IA - Production Monitoring Stack

This directory contains the complete production-ready monitoring infrastructure for the Bakery IA platform.

📊 Components

Core Monitoring

  • Prometheus v3.0.1 - Time-series metrics database (2 replicas with HA)
  • Grafana v12.3.0 - Visualization and dashboarding
  • AlertManager v0.27.0 - Alert routing and notification (3 replicas with HA)

Distributed Tracing

  • Jaeger v1.51 - Distributed tracing with persistent storage

Exporters

  • PostgreSQL Exporter v0.15.0 - Database metrics and health
  • Node Exporter v1.7.0 - Infrastructure and OS-level metrics (DaemonSet)

🚀 Deployment

Prerequisites

  1. Kubernetes cluster (v1.24+)
  2. kubectl configured
  3. kustomize (v4.0+) or kubectl with kustomize support
  4. Storage class available for PersistentVolumeClaims

Production Deployment

# 1. Update secrets with production values
kubectl create secret generic grafana-admin \
  --from-literal=admin-user=admin \
  --from-literal=admin-password=$(openssl rand -base64 32) \
  --namespace monitoring --dry-run=client -o yaml > secrets.yaml

# 2. Update AlertManager SMTP credentials
kubectl create secret generic alertmanager-secrets \
  --from-literal=smtp-host="smtp.gmail.com:587" \
  --from-literal=smtp-username="alerts@yourdomain.com" \
  --from-literal=smtp-password="YOUR_SMTP_PASSWORD" \
  --from-literal=smtp-from="alerts@yourdomain.com" \
  --from-literal=slack-webhook-url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
  --namespace monitoring --dry-run=client -o yaml >> secrets.yaml

# 3. Update PostgreSQL exporter connection string
kubectl create secret generic postgres-exporter \
  --from-literal=data-source-name="postgresql://user:password@postgres.bakery-ia:5432/bakery?sslmode=require" \
  --namespace monitoring --dry-run=client -o yaml >> secrets.yaml

# 4. Deploy monitoring stack
kubectl apply -k infrastructure/kubernetes/overlays/prod

# 5. Verify deployment
kubectl get pods -n monitoring
kubectl get pvc -n monitoring

Local Development Deployment

For local Kind clusters, monitoring is disabled by default to save resources. To enable:

# Uncomment monitoring in overlays/dev/kustomization.yaml
# Then apply:
kubectl apply -k infrastructure/kubernetes/overlays/dev

🔐 Security Configuration

Important Security Notes

⚠️ NEVER commit real secrets to Git!

The secrets.yaml file contains placeholder values. In production, use one of:

  1. Sealed Secrets (Recommended)

    kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
    kubeseal --format=yaml < secrets.yaml > sealed-secrets.yaml
    
  2. External Secrets Operator

    helm install external-secrets external-secrets/external-secrets -n external-secrets
    
  3. Cloud Provider Secrets

    • AWS Secrets Manager
    • GCP Secret Manager
    • Azure Key Vault

Grafana Admin Password

Change the default password immediately:

# Generate strong password
NEW_PASSWORD=$(openssl rand -base64 32)

# Update secret
kubectl patch secret grafana-admin -n monitoring \
  -p="{\"data\":{\"admin-password\":\"$(echo -n $NEW_PASSWORD | base64)\"}}"

# Restart Grafana
kubectl rollout restart deployment grafana -n monitoring

📈 Accessing Monitoring Services

Via Ingress (Production)

https://monitoring.yourdomain.com/grafana
https://monitoring.yourdomain.com/prometheus
https://monitoring.yourdomain.com/alertmanager
https://monitoring.yourdomain.com/jaeger

Via Port Forwarding (Development)

# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000

# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090

# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093

# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686

Then access:

📊 Grafana Dashboards

Pre-configured Dashboards

  1. Gateway Metrics - API gateway performance

    • Request rate by endpoint
    • P95 latency
    • Error rates
    • Authentication metrics
  2. Services Overview - Microservices health

    • Request rate by service
    • P99 latency
    • Error rates by service
    • Service health status
  3. Circuit Breakers - Resilience patterns

    • Circuit breaker states
    • Trip rates
    • Rejected requests
  4. PostgreSQL Monitoring - Database health

    • Connections, transactions, cache hit ratio
    • Slow queries, locks, replication lag
  5. Node Metrics - Infrastructure monitoring

    • CPU, memory, disk, network per node
  6. AlertManager - Alert management

    • Active alerts, firing rate, notifications
  7. Business Metrics - KPIs

    • Service performance, tenant activity, ML metrics

Creating Custom Dashboards

  1. Login to Grafana (admin/[your-password])
  2. Click "+ → Dashboard"
  3. Add panels with Prometheus queries
  4. Save dashboard
  5. Export JSON and add to grafana-dashboards.yaml

🚨 Alert Configuration

Alert Rules

Alert rules are defined in alert-rules.yaml and organized by category:

  • bakery_services - Service health, errors, latency, memory
  • bakery_business - Training jobs, ML accuracy, API limits
  • alert_system_health - Alert system components, RabbitMQ, Redis
  • alert_system_performance - Processing errors, delivery failures
  • alert_system_business - Alert volume, response times
  • alert_system_capacity - Queue sizes, storage performance
  • alert_system_critical - System failures, data loss
  • monitoring_health - Prometheus, AlertManager self-monitoring

Alert Routing

Alerts are routed based on:

  • Severity (critical, warning, info)
  • Component (alert-system, database, infrastructure)
  • Service name

Notification Channels

Configure in alertmanager.yaml:

  1. Email (default)

  2. Slack (optional, commented out)

    • Update slack-webhook-url in secrets
    • Uncomment slack_configs in alertmanager.yaml
  3. PagerDuty (add if needed)

    pagerduty_configs:
    - routing_key: YOUR_ROUTING_KEY
      severity: '{{ .Labels.severity }}'
    

Testing Alerts

# Fire a test alert
kubectl run test-alert --image=busybox -n bakery-ia --restart=Never -- sleep 3600

# Check alert in Prometheus
# Navigate to http://localhost:9090/alerts

# Check AlertManager
# Navigate to http://localhost:9093

🔍 Troubleshooting

Prometheus Issues

# Check Prometheus logs
kubectl logs -n monitoring prometheus-0 -f

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Visit http://localhost:9090/targets

# Check Prometheus configuration
kubectl get configmap prometheus-config -n monitoring -o yaml

AlertManager Issues

# Check AlertManager logs
kubectl logs -n monitoring alertmanager-0 -f

# Check AlertManager configuration
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml

# Test SMTP connection
kubectl exec -n monitoring alertmanager-0 -- \
  wget --spider --server-response --timeout=10 smtp://smtp.gmail.com:587

Grafana Issues

# Check Grafana logs
kubectl logs -n monitoring deployment/grafana -f

# Reset Grafana admin password
kubectl exec -n monitoring deployment/grafana -- \
  grafana-cli admin reset-admin-password NEW_PASSWORD

PostgreSQL Exporter Issues

# Check exporter logs
kubectl logs -n monitoring deployment/postgres-exporter -f

# Test database connection
kubectl exec -n monitoring deployment/postgres-exporter -- \
  wget -O- http://localhost:9187/metrics | grep pg_up

Node Exporter Issues

# Check node exporter on specific node
kubectl logs -n monitoring daemonset/node-exporter --selector=kubernetes.io/hostname=NODE_NAME -f

# Check metrics endpoint
kubectl exec -n monitoring daemonset/node-exporter -- \
  wget -O- http://localhost:9100/metrics | head -n 20

📏 Resource Requirements

Minimum Requirements (Development)

  • CPU: 2 cores
  • Memory: 4Gi
  • Storage: 30Gi
  • CPU: 6-8 cores
  • Memory: 16Gi
  • Storage: 100Gi

Component Resource Allocation

Component Replicas CPU Request Memory Request CPU Limit Memory Limit
Prometheus 2 500m 1Gi 1 2Gi
AlertManager 3 100m 128Mi 500m 256Mi
Grafana 1 100m 256Mi 500m 512Mi
Postgres Exporter 1 50m 64Mi 200m 128Mi
Node Exporter 1/node 50m 64Mi 200m 128Mi
Jaeger 1 250m 512Mi 500m 1Gi

🔄 High Availability

Prometheus HA

  • 2 replicas in StatefulSet
  • Each has independent storage (volumeClaimTemplates)
  • Anti-affinity to spread across nodes
  • Both scrape the same targets independently
  • Use Thanos for long-term storage and global query view (future enhancement)

AlertManager HA

  • 3 replicas in StatefulSet
  • Clustered mode (gossip protocol)
  • Automatic leader election
  • Alert deduplication across instances
  • Anti-affinity to spread across nodes

PodDisruptionBudgets

Ensure minimum availability during:

  • Node maintenance
  • Cluster upgrades
  • Rolling updates
Prometheus: minAvailable=1 (out of 2)
AlertManager: minAvailable=2 (out of 3)
Grafana: minAvailable=1 (out of 1)

📊 Metrics Reference

Application Metrics (from services)

# HTTP request rate
rate(http_requests_total[5m])

# HTTP error rate
rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])

# Request latency (P95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Active connections
active_connections

PostgreSQL Metrics

# Active connections
pg_stat_database_numbackends

# Transaction rate
rate(pg_stat_database_xact_commit[5m])

# Cache hit ratio
rate(pg_stat_database_blks_hit[5m]) /
(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))

# Replication lag
pg_replication_lag_seconds

Node Metrics

# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Network traffic
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

🔗 Distributed Tracing

Jaeger Configuration

Services automatically send traces when JAEGER_ENABLED=true:

# In prod-configmap.yaml
JAEGER_ENABLED: "true"
JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local"
JAEGER_AGENT_PORT: "6831"

Viewing Traces

  1. Access Jaeger UI: https://monitoring.yourdomain.com/jaeger
  2. Select service from dropdown
  3. Click "Find Traces"
  4. Explore trace details, spans, and timing

Trace Sampling

Current sampling: 100% (all traces collected)

For high-traffic production:

# Adjust in shared/monitoring/tracing.py
JAEGER_SAMPLE_RATE: "0.1"  # 10% of traces

📚 Additional Resources

🆘 Support

For monitoring issues:

  1. Check component logs (see Troubleshooting section)
  2. Verify Prometheus targets are UP
  3. Check AlertManager configuration and routing
  4. Review resource usage and quotas
  5. Contact platform team: platform-team@yourdomain.com

🔄 Maintenance

Regular Tasks

Daily:

  • Review critical alerts
  • Check service health dashboards

Weekly:

  • Review alert noise and adjust thresholds
  • Check storage usage for Prometheus and Jaeger
  • Review slow queries in PostgreSQL dashboard

Monthly:

  • Update dashboard with new metrics
  • Review and update alert runbooks
  • Capacity planning based on trends

Backup and Recovery

Prometheus Data:

# Backup Prometheus data
kubectl exec -n monitoring prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz /prometheus
kubectl cp monitoring/prometheus-0:/tmp/prometheus-backup.tar.gz ./prometheus-backup.tar.gz

# Restore (stop Prometheus first)
kubectl cp ./prometheus-backup.tar.gz monitoring/prometheus-0:/tmp/
kubectl exec -n monitoring prometheus-0 -- tar xzf /tmp/prometheus-backup.tar.gz -C /

Grafana Dashboards:

# Export all dashboards via API
curl -u admin:password http://localhost:3000/api/search | \
  jq -r '.[] | .uid' | \
  xargs -I{} curl -u admin:password http://localhost:3000/api/dashboards/uid/{} > dashboards-backup.json

📝 Version History

  • v1.0.0 (2026-01-07) - Initial production-ready monitoring stack
    • Prometheus v3.0.1 with HA
    • AlertManager v0.27.0 with clustering
    • Grafana v12.3.0 with 7 dashboards
    • PostgreSQL and Node exporters
    • 50+ alert rules
    • Comprehensive documentation