502 lines
13 KiB
Markdown
502 lines
13 KiB
Markdown
# Bakery IA - Production Monitoring Stack
|
|
|
|
This directory contains the complete production-ready monitoring infrastructure for the Bakery IA platform.
|
|
|
|
## 📊 Components
|
|
|
|
### Core Monitoring
|
|
- **Prometheus v3.0.1** - Time-series metrics database (2 replicas with HA)
|
|
- **Grafana v12.3.0** - Visualization and dashboarding
|
|
- **AlertManager v0.27.0** - Alert routing and notification (3 replicas with HA)
|
|
|
|
### Distributed Tracing
|
|
- **Jaeger v1.51** - Distributed tracing with persistent storage
|
|
|
|
### Exporters
|
|
- **PostgreSQL Exporter v0.15.0** - Database metrics and health
|
|
- **Node Exporter v1.7.0** - Infrastructure and OS-level metrics (DaemonSet)
|
|
|
|
## 🚀 Deployment
|
|
|
|
### Prerequisites
|
|
1. Kubernetes cluster (v1.24+)
|
|
2. kubectl configured
|
|
3. kustomize (v4.0+) or kubectl with kustomize support
|
|
4. Storage class available for PersistentVolumeClaims
|
|
|
|
### Production Deployment
|
|
|
|
```bash
|
|
# 1. Update secrets with production values
|
|
kubectl create secret generic grafana-admin \
|
|
--from-literal=admin-user=admin \
|
|
--from-literal=admin-password=$(openssl rand -base64 32) \
|
|
--namespace monitoring --dry-run=client -o yaml > secrets.yaml
|
|
|
|
# 2. Update AlertManager SMTP credentials
|
|
kubectl create secret generic alertmanager-secrets \
|
|
--from-literal=smtp-host="smtp.gmail.com:587" \
|
|
--from-literal=smtp-username="alerts@yourdomain.com" \
|
|
--from-literal=smtp-password="YOUR_SMTP_PASSWORD" \
|
|
--from-literal=smtp-from="alerts@yourdomain.com" \
|
|
--from-literal=slack-webhook-url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
|
|
--namespace monitoring --dry-run=client -o yaml >> secrets.yaml
|
|
|
|
# 3. Update PostgreSQL exporter connection string
|
|
kubectl create secret generic postgres-exporter \
|
|
--from-literal=data-source-name="postgresql://user:password@postgres.bakery-ia:5432/bakery?sslmode=require" \
|
|
--namespace monitoring --dry-run=client -o yaml >> secrets.yaml
|
|
|
|
# 4. Deploy monitoring stack
|
|
kubectl apply -k infrastructure/kubernetes/overlays/prod
|
|
|
|
# 5. Verify deployment
|
|
kubectl get pods -n monitoring
|
|
kubectl get pvc -n monitoring
|
|
```
|
|
|
|
### Local Development Deployment
|
|
|
|
For local Kind clusters, monitoring is disabled by default to save resources. To enable:
|
|
|
|
```bash
|
|
# Uncomment monitoring in overlays/dev/kustomization.yaml
|
|
# Then apply:
|
|
kubectl apply -k infrastructure/kubernetes/overlays/dev
|
|
```
|
|
|
|
## 🔐 Security Configuration
|
|
|
|
### Important Security Notes
|
|
|
|
⚠️ **NEVER commit real secrets to Git!**
|
|
|
|
The `secrets.yaml` file contains placeholder values. In production, use one of:
|
|
|
|
1. **Sealed Secrets** (Recommended)
|
|
```bash
|
|
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
|
|
kubeseal --format=yaml < secrets.yaml > sealed-secrets.yaml
|
|
```
|
|
|
|
2. **External Secrets Operator**
|
|
```bash
|
|
helm install external-secrets external-secrets/external-secrets -n external-secrets
|
|
```
|
|
|
|
3. **Cloud Provider Secrets**
|
|
- AWS Secrets Manager
|
|
- GCP Secret Manager
|
|
- Azure Key Vault
|
|
|
|
### Grafana Admin Password
|
|
|
|
Change the default password immediately:
|
|
```bash
|
|
# Generate strong password
|
|
NEW_PASSWORD=$(openssl rand -base64 32)
|
|
|
|
# Update secret
|
|
kubectl patch secret grafana-admin -n monitoring \
|
|
-p="{\"data\":{\"admin-password\":\"$(echo -n $NEW_PASSWORD | base64)\"}}"
|
|
|
|
# Restart Grafana
|
|
kubectl rollout restart deployment grafana -n monitoring
|
|
```
|
|
|
|
## 📈 Accessing Monitoring Services
|
|
|
|
### Via Ingress (Production)
|
|
|
|
```
|
|
https://monitoring.yourdomain.com/grafana
|
|
https://monitoring.yourdomain.com/prometheus
|
|
https://monitoring.yourdomain.com/alertmanager
|
|
https://monitoring.yourdomain.com/jaeger
|
|
```
|
|
|
|
### Via Port Forwarding (Development)
|
|
|
|
```bash
|
|
# Grafana
|
|
kubectl port-forward -n monitoring svc/grafana 3000:3000
|
|
|
|
# Prometheus
|
|
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
|
|
|
# AlertManager
|
|
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
|
|
|
|
# Jaeger
|
|
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
|
|
```
|
|
|
|
Then access:
|
|
- Grafana: http://localhost:3000
|
|
- Prometheus: http://localhost:9090
|
|
- AlertManager: http://localhost:9093
|
|
- Jaeger: http://localhost:16686
|
|
|
|
## 📊 Grafana Dashboards
|
|
|
|
### Pre-configured Dashboards
|
|
|
|
1. **Gateway Metrics** - API gateway performance
|
|
- Request rate by endpoint
|
|
- P95 latency
|
|
- Error rates
|
|
- Authentication metrics
|
|
|
|
2. **Services Overview** - Microservices health
|
|
- Request rate by service
|
|
- P99 latency
|
|
- Error rates by service
|
|
- Service health status
|
|
|
|
3. **Circuit Breakers** - Resilience patterns
|
|
- Circuit breaker states
|
|
- Trip rates
|
|
- Rejected requests
|
|
|
|
4. **PostgreSQL Monitoring** - Database health
|
|
- Connections, transactions, cache hit ratio
|
|
- Slow queries, locks, replication lag
|
|
|
|
5. **Node Metrics** - Infrastructure monitoring
|
|
- CPU, memory, disk, network per node
|
|
|
|
6. **AlertManager** - Alert management
|
|
- Active alerts, firing rate, notifications
|
|
|
|
7. **Business Metrics** - KPIs
|
|
- Service performance, tenant activity, ML metrics
|
|
|
|
### Creating Custom Dashboards
|
|
|
|
1. Login to Grafana (admin/[your-password])
|
|
2. Click "+ → Dashboard"
|
|
3. Add panels with Prometheus queries
|
|
4. Save dashboard
|
|
5. Export JSON and add to `grafana-dashboards.yaml`
|
|
|
|
## 🚨 Alert Configuration
|
|
|
|
### Alert Rules
|
|
|
|
Alert rules are defined in `alert-rules.yaml` and organized by category:
|
|
|
|
- **bakery_services** - Service health, errors, latency, memory
|
|
- **bakery_business** - Training jobs, ML accuracy, API limits
|
|
- **alert_system_health** - Alert system components, RabbitMQ, Redis
|
|
- **alert_system_performance** - Processing errors, delivery failures
|
|
- **alert_system_business** - Alert volume, response times
|
|
- **alert_system_capacity** - Queue sizes, storage performance
|
|
- **alert_system_critical** - System failures, data loss
|
|
- **monitoring_health** - Prometheus, AlertManager self-monitoring
|
|
|
|
### Alert Routing
|
|
|
|
Alerts are routed based on:
|
|
- **Severity** (critical, warning, info)
|
|
- **Component** (alert-system, database, infrastructure)
|
|
- **Service** name
|
|
|
|
### Notification Channels
|
|
|
|
Configure in `alertmanager.yaml`:
|
|
|
|
1. **Email** (default)
|
|
- critical-alerts@yourdomain.com
|
|
- oncall@yourdomain.com
|
|
|
|
2. **Slack** (optional, commented out)
|
|
- Update slack-webhook-url in secrets
|
|
- Uncomment slack_configs in alertmanager.yaml
|
|
|
|
3. **PagerDuty** (add if needed)
|
|
```yaml
|
|
pagerduty_configs:
|
|
- routing_key: YOUR_ROUTING_KEY
|
|
severity: '{{ .Labels.severity }}'
|
|
```
|
|
|
|
### Testing Alerts
|
|
|
|
```bash
|
|
# Fire a test alert
|
|
kubectl run test-alert --image=busybox -n bakery-ia --restart=Never -- sleep 3600
|
|
|
|
# Check alert in Prometheus
|
|
# Navigate to http://localhost:9090/alerts
|
|
|
|
# Check AlertManager
|
|
# Navigate to http://localhost:9093
|
|
```
|
|
|
|
## 🔍 Troubleshooting
|
|
|
|
### Prometheus Issues
|
|
|
|
```bash
|
|
# Check Prometheus logs
|
|
kubectl logs -n monitoring prometheus-0 -f
|
|
|
|
# Check Prometheus targets
|
|
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
|
# Visit http://localhost:9090/targets
|
|
|
|
# Check Prometheus configuration
|
|
kubectl get configmap prometheus-config -n monitoring -o yaml
|
|
```
|
|
|
|
### AlertManager Issues
|
|
|
|
```bash
|
|
# Check AlertManager logs
|
|
kubectl logs -n monitoring alertmanager-0 -f
|
|
|
|
# Check AlertManager configuration
|
|
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
|
|
|
|
# Test SMTP connection
|
|
kubectl exec -n monitoring alertmanager-0 -- \
|
|
wget --spider --server-response --timeout=10 smtp://smtp.gmail.com:587
|
|
```
|
|
|
|
### Grafana Issues
|
|
|
|
```bash
|
|
# Check Grafana logs
|
|
kubectl logs -n monitoring deployment/grafana -f
|
|
|
|
# Reset Grafana admin password
|
|
kubectl exec -n monitoring deployment/grafana -- \
|
|
grafana-cli admin reset-admin-password NEW_PASSWORD
|
|
```
|
|
|
|
### PostgreSQL Exporter Issues
|
|
|
|
```bash
|
|
# Check exporter logs
|
|
kubectl logs -n monitoring deployment/postgres-exporter -f
|
|
|
|
# Test database connection
|
|
kubectl exec -n monitoring deployment/postgres-exporter -- \
|
|
wget -O- http://localhost:9187/metrics | grep pg_up
|
|
```
|
|
|
|
### Node Exporter Issues
|
|
|
|
```bash
|
|
# Check node exporter on specific node
|
|
kubectl logs -n monitoring daemonset/node-exporter --selector=kubernetes.io/hostname=NODE_NAME -f
|
|
|
|
# Check metrics endpoint
|
|
kubectl exec -n monitoring daemonset/node-exporter -- \
|
|
wget -O- http://localhost:9100/metrics | head -n 20
|
|
```
|
|
|
|
## 📏 Resource Requirements
|
|
|
|
### Minimum Requirements (Development)
|
|
- CPU: 2 cores
|
|
- Memory: 4Gi
|
|
- Storage: 30Gi
|
|
|
|
### Recommended Requirements (Production)
|
|
- CPU: 6-8 cores
|
|
- Memory: 16Gi
|
|
- Storage: 100Gi
|
|
|
|
### Component Resource Allocation
|
|
|
|
| Component | Replicas | CPU Request | Memory Request | CPU Limit | Memory Limit |
|
|
|-----------|----------|-------------|----------------|-----------|--------------|
|
|
| Prometheus | 2 | 500m | 1Gi | 1 | 2Gi |
|
|
| AlertManager | 3 | 100m | 128Mi | 500m | 256Mi |
|
|
| Grafana | 1 | 100m | 256Mi | 500m | 512Mi |
|
|
| Postgres Exporter | 1 | 50m | 64Mi | 200m | 128Mi |
|
|
| Node Exporter | 1/node | 50m | 64Mi | 200m | 128Mi |
|
|
| Jaeger | 1 | 250m | 512Mi | 500m | 1Gi |
|
|
|
|
## 🔄 High Availability
|
|
|
|
### Prometheus HA
|
|
|
|
- 2 replicas in StatefulSet
|
|
- Each has independent storage (volumeClaimTemplates)
|
|
- Anti-affinity to spread across nodes
|
|
- Both scrape the same targets independently
|
|
- Use Thanos for long-term storage and global query view (future enhancement)
|
|
|
|
### AlertManager HA
|
|
|
|
- 3 replicas in StatefulSet
|
|
- Clustered mode (gossip protocol)
|
|
- Automatic leader election
|
|
- Alert deduplication across instances
|
|
- Anti-affinity to spread across nodes
|
|
|
|
### PodDisruptionBudgets
|
|
|
|
Ensure minimum availability during:
|
|
- Node maintenance
|
|
- Cluster upgrades
|
|
- Rolling updates
|
|
|
|
```yaml
|
|
Prometheus: minAvailable=1 (out of 2)
|
|
AlertManager: minAvailable=2 (out of 3)
|
|
Grafana: minAvailable=1 (out of 1)
|
|
```
|
|
|
|
## 📊 Metrics Reference
|
|
|
|
### Application Metrics (from services)
|
|
|
|
```promql
|
|
# HTTP request rate
|
|
rate(http_requests_total[5m])
|
|
|
|
# HTTP error rate
|
|
rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
|
|
|
|
# Request latency (P95)
|
|
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
|
|
|
# Active connections
|
|
active_connections
|
|
```
|
|
|
|
### PostgreSQL Metrics
|
|
|
|
```promql
|
|
# Active connections
|
|
pg_stat_database_numbackends
|
|
|
|
# Transaction rate
|
|
rate(pg_stat_database_xact_commit[5m])
|
|
|
|
# Cache hit ratio
|
|
rate(pg_stat_database_blks_hit[5m]) /
|
|
(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))
|
|
|
|
# Replication lag
|
|
pg_replication_lag_seconds
|
|
```
|
|
|
|
### Node Metrics
|
|
|
|
```promql
|
|
# CPU usage
|
|
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
|
|
# Memory usage
|
|
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
|
|
|
|
# Disk I/O
|
|
rate(node_disk_read_bytes_total[5m])
|
|
rate(node_disk_written_bytes_total[5m])
|
|
|
|
# Network traffic
|
|
rate(node_network_receive_bytes_total[5m])
|
|
rate(node_network_transmit_bytes_total[5m])
|
|
```
|
|
|
|
## 🔗 Distributed Tracing
|
|
|
|
### Jaeger Configuration
|
|
|
|
Services automatically send traces when `JAEGER_ENABLED=true`:
|
|
|
|
```yaml
|
|
# In prod-configmap.yaml
|
|
JAEGER_ENABLED: "true"
|
|
JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local"
|
|
JAEGER_AGENT_PORT: "6831"
|
|
```
|
|
|
|
### Viewing Traces
|
|
|
|
1. Access Jaeger UI: https://monitoring.yourdomain.com/jaeger
|
|
2. Select service from dropdown
|
|
3. Click "Find Traces"
|
|
4. Explore trace details, spans, and timing
|
|
|
|
### Trace Sampling
|
|
|
|
Current sampling: 100% (all traces collected)
|
|
|
|
For high-traffic production:
|
|
```yaml
|
|
# Adjust in shared/monitoring/tracing.py
|
|
JAEGER_SAMPLE_RATE: "0.1" # 10% of traces
|
|
```
|
|
|
|
## 📚 Additional Resources
|
|
|
|
- [Prometheus Documentation](https://prometheus.io/docs/)
|
|
- [Grafana Documentation](https://grafana.com/docs/)
|
|
- [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
|
|
- [Jaeger Documentation](https://www.jaegertracing.io/docs/)
|
|
- [PostgreSQL Exporter](https://github.com/prometheus-community/postgres_exporter)
|
|
- [Node Exporter](https://github.com/prometheus/node_exporter)
|
|
|
|
## 🆘 Support
|
|
|
|
For monitoring issues:
|
|
1. Check component logs (see Troubleshooting section)
|
|
2. Verify Prometheus targets are UP
|
|
3. Check AlertManager configuration and routing
|
|
4. Review resource usage and quotas
|
|
5. Contact platform team: platform-team@yourdomain.com
|
|
|
|
## 🔄 Maintenance
|
|
|
|
### Regular Tasks
|
|
|
|
**Daily:**
|
|
- Review critical alerts
|
|
- Check service health dashboards
|
|
|
|
**Weekly:**
|
|
- Review alert noise and adjust thresholds
|
|
- Check storage usage for Prometheus and Jaeger
|
|
- Review slow queries in PostgreSQL dashboard
|
|
|
|
**Monthly:**
|
|
- Update dashboard with new metrics
|
|
- Review and update alert runbooks
|
|
- Capacity planning based on trends
|
|
|
|
### Backup and Recovery
|
|
|
|
**Prometheus Data:**
|
|
```bash
|
|
# Backup Prometheus data
|
|
kubectl exec -n monitoring prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz /prometheus
|
|
kubectl cp monitoring/prometheus-0:/tmp/prometheus-backup.tar.gz ./prometheus-backup.tar.gz
|
|
|
|
# Restore (stop Prometheus first)
|
|
kubectl cp ./prometheus-backup.tar.gz monitoring/prometheus-0:/tmp/
|
|
kubectl exec -n monitoring prometheus-0 -- tar xzf /tmp/prometheus-backup.tar.gz -C /
|
|
```
|
|
|
|
**Grafana Dashboards:**
|
|
```bash
|
|
# Export all dashboards via API
|
|
curl -u admin:password http://localhost:3000/api/search | \
|
|
jq -r '.[] | .uid' | \
|
|
xargs -I{} curl -u admin:password http://localhost:3000/api/dashboards/uid/{} > dashboards-backup.json
|
|
```
|
|
|
|
## 📝 Version History
|
|
|
|
- **v1.0.0** (2026-01-07) - Initial production-ready monitoring stack
|
|
- Prometheus v3.0.1 with HA
|
|
- AlertManager v0.27.0 with clustering
|
|
- Grafana v12.3.0 with 7 dashboards
|
|
- PostgreSQL and Node exporters
|
|
- 50+ alert rules
|
|
- Comprehensive documentation
|