Improve monitoring for prod

This commit is contained in:
Urtzi Alfaro
2026-01-07 19:12:35 +01:00
parent 560c7ba86f
commit 07178f8972
44 changed files with 6581 additions and 5111 deletions

View File

@@ -0,0 +1,501 @@
# Bakery IA - Production Monitoring Stack
This directory contains the complete production-ready monitoring infrastructure for the Bakery IA platform.
## 📊 Components
### Core Monitoring
- **Prometheus v3.0.1** - Time-series metrics database (2 replicas with HA)
- **Grafana v12.3.0** - Visualization and dashboarding
- **AlertManager v0.27.0** - Alert routing and notification (3 replicas with HA)
### Distributed Tracing
- **Jaeger v1.51** - Distributed tracing with persistent storage
### Exporters
- **PostgreSQL Exporter v0.15.0** - Database metrics and health
- **Node Exporter v1.7.0** - Infrastructure and OS-level metrics (DaemonSet)
## 🚀 Deployment
### Prerequisites
1. Kubernetes cluster (v1.24+)
2. kubectl configured
3. kustomize (v4.0+) or kubectl with kustomize support
4. Storage class available for PersistentVolumeClaims
### Production Deployment
```bash
# 1. Update secrets with production values
kubectl create secret generic grafana-admin \
--from-literal=admin-user=admin \
--from-literal=admin-password=$(openssl rand -base64 32) \
--namespace monitoring --dry-run=client -o yaml > secrets.yaml
# 2. Update AlertManager SMTP credentials
kubectl create secret generic alertmanager-secrets \
--from-literal=smtp-host="smtp.gmail.com:587" \
--from-literal=smtp-username="alerts@yourdomain.com" \
--from-literal=smtp-password="YOUR_SMTP_PASSWORD" \
--from-literal=smtp-from="alerts@yourdomain.com" \
--from-literal=slack-webhook-url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
--namespace monitoring --dry-run=client -o yaml >> secrets.yaml
# 3. Update PostgreSQL exporter connection string
kubectl create secret generic postgres-exporter \
--from-literal=data-source-name="postgresql://user:password@postgres.bakery-ia:5432/bakery?sslmode=require" \
--namespace monitoring --dry-run=client -o yaml >> secrets.yaml
# 4. Deploy monitoring stack
kubectl apply -k infrastructure/kubernetes/overlays/prod
# 5. Verify deployment
kubectl get pods -n monitoring
kubectl get pvc -n monitoring
```
### Local Development Deployment
For local Kind clusters, monitoring is disabled by default to save resources. To enable:
```bash
# Uncomment monitoring in overlays/dev/kustomization.yaml
# Then apply:
kubectl apply -k infrastructure/kubernetes/overlays/dev
```
## 🔐 Security Configuration
### Important Security Notes
⚠️ **NEVER commit real secrets to Git!**
The `secrets.yaml` file contains placeholder values. In production, use one of:
1. **Sealed Secrets** (Recommended)
```bash
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
kubeseal --format=yaml < secrets.yaml > sealed-secrets.yaml
```
2. **External Secrets Operator**
```bash
helm install external-secrets external-secrets/external-secrets -n external-secrets
```
3. **Cloud Provider Secrets**
- AWS Secrets Manager
- GCP Secret Manager
- Azure Key Vault
### Grafana Admin Password
Change the default password immediately:
```bash
# Generate strong password
NEW_PASSWORD=$(openssl rand -base64 32)
# Update secret
kubectl patch secret grafana-admin -n monitoring \
-p="{\"data\":{\"admin-password\":\"$(echo -n $NEW_PASSWORD | base64)\"}}"
# Restart Grafana
kubectl rollout restart deployment grafana -n monitoring
```
## 📈 Accessing Monitoring Services
### Via Ingress (Production)
```
https://monitoring.yourdomain.com/grafana
https://monitoring.yourdomain.com/prometheus
https://monitoring.yourdomain.com/alertmanager
https://monitoring.yourdomain.com/jaeger
```
### Via Port Forwarding (Development)
```bash
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
```
Then access:
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- AlertManager: http://localhost:9093
- Jaeger: http://localhost:16686
## 📊 Grafana Dashboards
### Pre-configured Dashboards
1. **Gateway Metrics** - API gateway performance
- Request rate by endpoint
- P95 latency
- Error rates
- Authentication metrics
2. **Services Overview** - Microservices health
- Request rate by service
- P99 latency
- Error rates by service
- Service health status
3. **Circuit Breakers** - Resilience patterns
- Circuit breaker states
- Trip rates
- Rejected requests
4. **PostgreSQL Monitoring** - Database health
- Connections, transactions, cache hit ratio
- Slow queries, locks, replication lag
5. **Node Metrics** - Infrastructure monitoring
- CPU, memory, disk, network per node
6. **AlertManager** - Alert management
- Active alerts, firing rate, notifications
7. **Business Metrics** - KPIs
- Service performance, tenant activity, ML metrics
### Creating Custom Dashboards
1. Login to Grafana (admin/[your-password])
2. Click "+ → Dashboard"
3. Add panels with Prometheus queries
4. Save dashboard
5. Export JSON and add to `grafana-dashboards.yaml`
## 🚨 Alert Configuration
### Alert Rules
Alert rules are defined in `alert-rules.yaml` and organized by category:
- **bakery_services** - Service health, errors, latency, memory
- **bakery_business** - Training jobs, ML accuracy, API limits
- **alert_system_health** - Alert system components, RabbitMQ, Redis
- **alert_system_performance** - Processing errors, delivery failures
- **alert_system_business** - Alert volume, response times
- **alert_system_capacity** - Queue sizes, storage performance
- **alert_system_critical** - System failures, data loss
- **monitoring_health** - Prometheus, AlertManager self-monitoring
### Alert Routing
Alerts are routed based on:
- **Severity** (critical, warning, info)
- **Component** (alert-system, database, infrastructure)
- **Service** name
### Notification Channels
Configure in `alertmanager.yaml`:
1. **Email** (default)
- critical-alerts@yourdomain.com
- oncall@yourdomain.com
2. **Slack** (optional, commented out)
- Update slack-webhook-url in secrets
- Uncomment slack_configs in alertmanager.yaml
3. **PagerDuty** (add if needed)
```yaml
pagerduty_configs:
- routing_key: YOUR_ROUTING_KEY
severity: '{{ .Labels.severity }}'
```
### Testing Alerts
```bash
# Fire a test alert
kubectl run test-alert --image=busybox -n bakery-ia --restart=Never -- sleep 3600
# Check alert in Prometheus
# Navigate to http://localhost:9090/alerts
# Check AlertManager
# Navigate to http://localhost:9093
```
## 🔍 Troubleshooting
### Prometheus Issues
```bash
# Check Prometheus logs
kubectl logs -n monitoring prometheus-0 -f
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Visit http://localhost:9090/targets
# Check Prometheus configuration
kubectl get configmap prometheus-config -n monitoring -o yaml
```
### AlertManager Issues
```bash
# Check AlertManager logs
kubectl logs -n monitoring alertmanager-0 -f
# Check AlertManager configuration
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
# Test SMTP connection
kubectl exec -n monitoring alertmanager-0 -- \
wget --spider --server-response --timeout=10 smtp://smtp.gmail.com:587
```
### Grafana Issues
```bash
# Check Grafana logs
kubectl logs -n monitoring deployment/grafana -f
# Reset Grafana admin password
kubectl exec -n monitoring deployment/grafana -- \
grafana-cli admin reset-admin-password NEW_PASSWORD
```
### PostgreSQL Exporter Issues
```bash
# Check exporter logs
kubectl logs -n monitoring deployment/postgres-exporter -f
# Test database connection
kubectl exec -n monitoring deployment/postgres-exporter -- \
wget -O- http://localhost:9187/metrics | grep pg_up
```
### Node Exporter Issues
```bash
# Check node exporter on specific node
kubectl logs -n monitoring daemonset/node-exporter --selector=kubernetes.io/hostname=NODE_NAME -f
# Check metrics endpoint
kubectl exec -n monitoring daemonset/node-exporter -- \
wget -O- http://localhost:9100/metrics | head -n 20
```
## 📏 Resource Requirements
### Minimum Requirements (Development)
- CPU: 2 cores
- Memory: 4Gi
- Storage: 30Gi
### Recommended Requirements (Production)
- CPU: 6-8 cores
- Memory: 16Gi
- Storage: 100Gi
### Component Resource Allocation
| Component | Replicas | CPU Request | Memory Request | CPU Limit | Memory Limit |
|-----------|----------|-------------|----------------|-----------|--------------|
| Prometheus | 2 | 500m | 1Gi | 1 | 2Gi |
| AlertManager | 3 | 100m | 128Mi | 500m | 256Mi |
| Grafana | 1 | 100m | 256Mi | 500m | 512Mi |
| Postgres Exporter | 1 | 50m | 64Mi | 200m | 128Mi |
| Node Exporter | 1/node | 50m | 64Mi | 200m | 128Mi |
| Jaeger | 1 | 250m | 512Mi | 500m | 1Gi |
## 🔄 High Availability
### Prometheus HA
- 2 replicas in StatefulSet
- Each has independent storage (volumeClaimTemplates)
- Anti-affinity to spread across nodes
- Both scrape the same targets independently
- Use Thanos for long-term storage and global query view (future enhancement)
### AlertManager HA
- 3 replicas in StatefulSet
- Clustered mode (gossip protocol)
- Automatic leader election
- Alert deduplication across instances
- Anti-affinity to spread across nodes
### PodDisruptionBudgets
Ensure minimum availability during:
- Node maintenance
- Cluster upgrades
- Rolling updates
```yaml
Prometheus: minAvailable=1 (out of 2)
AlertManager: minAvailable=2 (out of 3)
Grafana: minAvailable=1 (out of 1)
```
## 📊 Metrics Reference
### Application Metrics (from services)
```promql
# HTTP request rate
rate(http_requests_total[5m])
# HTTP error rate
rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
# Request latency (P95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Active connections
active_connections
```
### PostgreSQL Metrics
```promql
# Active connections
pg_stat_database_numbackends
# Transaction rate
rate(pg_stat_database_xact_commit[5m])
# Cache hit ratio
rate(pg_stat_database_blks_hit[5m]) /
(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))
# Replication lag
pg_replication_lag_seconds
```
### Node Metrics
```promql
# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Network traffic
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
```
## 🔗 Distributed Tracing
### Jaeger Configuration
Services automatically send traces when `JAEGER_ENABLED=true`:
```yaml
# In prod-configmap.yaml
JAEGER_ENABLED: "true"
JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local"
JAEGER_AGENT_PORT: "6831"
```
### Viewing Traces
1. Access Jaeger UI: https://monitoring.yourdomain.com/jaeger
2. Select service from dropdown
3. Click "Find Traces"
4. Explore trace details, spans, and timing
### Trace Sampling
Current sampling: 100% (all traces collected)
For high-traffic production:
```yaml
# Adjust in shared/monitoring/tracing.py
JAEGER_SAMPLE_RATE: "0.1" # 10% of traces
```
## 📚 Additional Resources
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
- [Jaeger Documentation](https://www.jaegertracing.io/docs/)
- [PostgreSQL Exporter](https://github.com/prometheus-community/postgres_exporter)
- [Node Exporter](https://github.com/prometheus/node_exporter)
## 🆘 Support
For monitoring issues:
1. Check component logs (see Troubleshooting section)
2. Verify Prometheus targets are UP
3. Check AlertManager configuration and routing
4. Review resource usage and quotas
5. Contact platform team: platform-team@yourdomain.com
## 🔄 Maintenance
### Regular Tasks
**Daily:**
- Review critical alerts
- Check service health dashboards
**Weekly:**
- Review alert noise and adjust thresholds
- Check storage usage for Prometheus and Jaeger
- Review slow queries in PostgreSQL dashboard
**Monthly:**
- Update dashboard with new metrics
- Review and update alert runbooks
- Capacity planning based on trends
### Backup and Recovery
**Prometheus Data:**
```bash
# Backup Prometheus data
kubectl exec -n monitoring prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz /prometheus
kubectl cp monitoring/prometheus-0:/tmp/prometheus-backup.tar.gz ./prometheus-backup.tar.gz
# Restore (stop Prometheus first)
kubectl cp ./prometheus-backup.tar.gz monitoring/prometheus-0:/tmp/
kubectl exec -n monitoring prometheus-0 -- tar xzf /tmp/prometheus-backup.tar.gz -C /
```
**Grafana Dashboards:**
```bash
# Export all dashboards via API
curl -u admin:password http://localhost:3000/api/search | \
jq -r '.[] | .uid' | \
xargs -I{} curl -u admin:password http://localhost:3000/api/dashboards/uid/{} > dashboards-backup.json
```
## 📝 Version History
- **v1.0.0** (2026-01-07) - Initial production-ready monitoring stack
- Prometheus v3.0.1 with HA
- AlertManager v0.27.0 with clustering
- Grafana v12.3.0 with 7 dashboards
- PostgreSQL and Node exporters
- 50+ alert rules
- Comprehensive documentation

View File

@@ -0,0 +1,429 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alert-rules
namespace: monitoring
data:
alert-rules.yml: |
groups:
# Basic Infrastructure Alerts
- name: bakery_services
interval: 30s
rules:
- alert: ServiceDown
expr: up{job="bakery-services"} == 0
for: 2m
labels:
severity: critical
component: infrastructure
annotations:
summary: "Service {{ $labels.service }} is down"
description: "Service {{ $labels.service }} in namespace {{ $labels.namespace }} has been down for more than 2 minutes."
runbook_url: "https://runbooks.bakery-ia.local/ServiceDown"
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status_code=~"5..", job="bakery-services"}[5m])) by (service)
/
sum(rate(http_requests_total{job="bakery-services"}[5m])) by (service)
) > 0.10
for: 5m
labels:
severity: critical
component: application
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Service {{ $labels.service }} has error rate above 10% (current: {{ $value | humanizePercentage }})."
runbook_url: "https://runbooks.bakery-ia.local/HighErrorRate"
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="bakery-services"}[5m])) by (service, le)
) > 1
for: 5m
labels:
severity: warning
component: performance
annotations:
summary: "High response time on {{ $labels.service }}"
description: "Service {{ $labels.service }} P95 latency is above 1 second (current: {{ $value }}s)."
runbook_url: "https://runbooks.bakery-ia.local/HighResponseTime"
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{namespace="bakery-ia", container!=""} > 500000000
for: 5m
labels:
severity: warning
component: infrastructure
annotations:
summary: "High memory usage in {{ $labels.pod }}"
description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is using more than 500MB of memory (current: {{ $value | humanize }}B)."
runbook_url: "https://runbooks.bakery-ia.local/HighMemoryUsage"
- alert: DatabaseConnectionHigh
expr: |
pg_stat_database_numbackends{datname="bakery"} > 80
for: 5m
labels:
severity: warning
component: database
annotations:
summary: "High database connection count"
description: "Database has more than 80 active connections (current: {{ $value }})."
runbook_url: "https://runbooks.bakery-ia.local/DatabaseConnectionHigh"
# Business Logic Alerts
- name: bakery_business
interval: 30s
rules:
- alert: TrainingJobFailed
expr: |
increase(training_job_failures_total[1h]) > 0
for: 5m
labels:
severity: warning
component: ml-training
annotations:
summary: "Training job failures detected"
description: "{{ $value }} training job(s) failed in the last hour."
runbook_url: "https://runbooks.bakery-ia.local/TrainingJobFailed"
- alert: LowPredictionAccuracy
expr: |
prediction_model_accuracy < 0.70
for: 15m
labels:
severity: warning
component: ml-inference
annotations:
summary: "Model prediction accuracy is low"
description: "Model {{ $labels.model_name }} accuracy is below 70% (current: {{ $value | humanizePercentage }})."
runbook_url: "https://runbooks.bakery-ia.local/LowPredictionAccuracy"
- alert: APIRateLimitHit
expr: |
increase(rate_limit_hits_total[5m]) > 10
for: 5m
labels:
severity: info
component: api-gateway
annotations:
summary: "API rate limits being hit frequently"
description: "Rate limits hit {{ $value }} times in the last 5 minutes."
runbook_url: "https://runbooks.bakery-ia.local/APIRateLimitHit"
# Alert System Health
- name: alert_system_health
interval: 30s
rules:
- alert: AlertSystemComponentDown
expr: |
alert_system_component_health{component=~"processor|notifier|scheduler"} == 0
for: 2m
labels:
severity: critical
component: alert-system
annotations:
summary: "Alert system component {{ $labels.component }} is unhealthy"
description: "Component {{ $labels.component }} has been unhealthy for more than 2 minutes."
runbook_url: "https://runbooks.bakery-ia.local/AlertSystemComponentDown"
- alert: RabbitMQConnectionDown
expr: |
rabbitmq_up == 0
for: 1m
labels:
severity: critical
component: alert-system
annotations:
summary: "RabbitMQ connection is down"
description: "Alert system has lost connection to RabbitMQ message queue."
runbook_url: "https://runbooks.bakery-ia.local/RabbitMQConnectionDown"
- alert: RedisConnectionDown
expr: |
redis_up == 0
for: 1m
labels:
severity: critical
component: alert-system
annotations:
summary: "Redis connection is down"
description: "Alert system has lost connection to Redis cache."
runbook_url: "https://runbooks.bakery-ia.local/RedisConnectionDown"
- alert: NoSchedulerLeader
expr: |
sum(alert_system_scheduler_leader) == 0
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "No alert scheduler leader elected"
description: "No scheduler instance has been elected as leader for 5 minutes."
runbook_url: "https://runbooks.bakery-ia.local/NoSchedulerLeader"
# Alert System Performance
- name: alert_system_performance
interval: 30s
rules:
- alert: HighAlertProcessingErrorRate
expr: |
(
sum(rate(alert_processing_errors_total[2m]))
/
sum(rate(alerts_processed_total[2m]))
) > 0.10
for: 2m
labels:
severity: critical
component: alert-system
annotations:
summary: "High alert processing error rate"
description: "Alert processing error rate is above 10% (current: {{ $value | humanizePercentage }})."
runbook_url: "https://runbooks.bakery-ia.local/HighAlertProcessingErrorRate"
- alert: HighNotificationDeliveryFailureRate
expr: |
(
sum(rate(notification_delivery_failures_total[3m]))
/
sum(rate(notifications_sent_total[3m]))
) > 0.05
for: 3m
labels:
severity: warning
component: alert-system
annotations:
summary: "High notification delivery failure rate"
description: "Notification delivery failure rate is above 5% (current: {{ $value | humanizePercentage }})."
runbook_url: "https://runbooks.bakery-ia.local/HighNotificationDeliveryFailureRate"
- alert: HighAlertProcessingLatency
expr: |
histogram_quantile(0.95,
sum(rate(alert_processing_duration_seconds_bucket[5m])) by (le)
) > 5
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "High alert processing latency"
description: "P95 alert processing latency is above 5 seconds (current: {{ $value }}s)."
runbook_url: "https://runbooks.bakery-ia.local/HighAlertProcessingLatency"
- alert: TooManySSEConnections
expr: |
sse_active_connections > 1000
for: 2m
labels:
severity: warning
component: alert-system
annotations:
summary: "Too many active SSE connections"
description: "More than 1000 active SSE connections (current: {{ $value }})."
runbook_url: "https://runbooks.bakery-ia.local/TooManySSEConnections"
- alert: SSEConnectionErrors
expr: |
rate(sse_connection_errors_total[3m]) > 0.5
for: 3m
labels:
severity: warning
component: alert-system
annotations:
summary: "High rate of SSE connection errors"
description: "SSE connection error rate is {{ $value }} errors/sec."
runbook_url: "https://runbooks.bakery-ia.local/SSEConnectionErrors"
# Alert System Business Logic
- name: alert_system_business
interval: 30s
rules:
- alert: UnusuallyHighAlertVolume
expr: |
rate(alerts_generated_total[5m]) > 2
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "Unusually high alert generation volume"
description: "More than 2 alerts per second being generated (current: {{ $value }}/sec)."
runbook_url: "https://runbooks.bakery-ia.local/UnusuallyHighAlertVolume"
- alert: NoAlertsGenerated
expr: |
rate(alerts_generated_total[30m]) == 0
for: 15m
labels:
severity: info
component: alert-system
annotations:
summary: "No alerts generated recently"
description: "No alerts have been generated in the last 30 minutes. This might indicate a problem with alert detection."
runbook_url: "https://runbooks.bakery-ia.local/NoAlertsGenerated"
- alert: SlowAlertResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(alert_response_time_seconds_bucket[10m])) by (le)
) > 3600
for: 10m
labels:
severity: warning
component: alert-system
annotations:
summary: "Slow alert response times"
description: "P95 alert response time is above 1 hour (current: {{ $value | humanizeDuration }})."
runbook_url: "https://runbooks.bakery-ia.local/SlowAlertResponseTime"
- alert: CriticalAlertsUnacknowledged
expr: |
sum(alerts_unacknowledged{severity="critical"}) > 5
for: 10m
labels:
severity: warning
component: alert-system
annotations:
summary: "Multiple critical alerts unacknowledged"
description: "{{ $value }} critical alerts have not been acknowledged for 10+ minutes."
runbook_url: "https://runbooks.bakery-ia.local/CriticalAlertsUnacknowledged"
# Alert System Capacity
- name: alert_system_capacity
interval: 30s
rules:
- alert: LargeSSEMessageQueues
expr: |
sse_message_queue_size > 100
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "Large SSE message queues detected"
description: "SSE message queue for tenant {{ $labels.tenant_id }} has {{ $value }} messages queued."
runbook_url: "https://runbooks.bakery-ia.local/LargeSSEMessageQueues"
- alert: SlowDatabaseStorage
expr: |
histogram_quantile(0.95,
sum(rate(alert_storage_duration_seconds_bucket[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "Slow alert database storage"
description: "P95 alert storage latency is above 1 second (current: {{ $value }}s)."
runbook_url: "https://runbooks.bakery-ia.local/SlowDatabaseStorage"
# Alert System Critical Scenarios
- name: alert_system_critical
interval: 15s
rules:
- alert: AlertSystemDown
expr: |
up{service=~"alert-processor|notification-service"} == 0
for: 1m
labels:
severity: critical
component: alert-system
annotations:
summary: "Alert system is completely down"
description: "Core alert system service {{ $labels.service }} is down."
runbook_url: "https://runbooks.bakery-ia.local/AlertSystemDown"
- alert: AlertDataNotPersisted
expr: |
(
sum(rate(alerts_processed_total[2m]))
-
sum(rate(alerts_stored_total[2m]))
) > 0
for: 2m
labels:
severity: critical
component: alert-system
annotations:
summary: "Alerts not being persisted to database"
description: "Alerts are being processed but not stored in the database."
runbook_url: "https://runbooks.bakery-ia.local/AlertDataNotPersisted"
- alert: NotificationsNotDelivered
expr: |
(
sum(rate(alerts_processed_total[3m]))
-
sum(rate(notifications_sent_total[3m]))
) > 0
for: 3m
labels:
severity: critical
component: alert-system
annotations:
summary: "Notifications not being delivered"
description: "Alerts are being processed but notifications are not being sent."
runbook_url: "https://runbooks.bakery-ia.local/NotificationsNotDelivered"
# Monitoring System Self-Monitoring
- name: monitoring_health
interval: 30s
rules:
- alert: PrometheusDown
expr: up{job="prometheus"} == 0
for: 5m
labels:
severity: critical
component: monitoring
annotations:
summary: "Prometheus is down"
description: "Prometheus monitoring system is not responding."
runbook_url: "https://runbooks.bakery-ia.local/PrometheusDown"
- alert: AlertManagerDown
expr: up{job="alertmanager"} == 0
for: 2m
labels:
severity: critical
component: monitoring
annotations:
summary: "AlertManager is down"
description: "AlertManager is not responding. Alerts will not be routed."
runbook_url: "https://runbooks.bakery-ia.local/AlertManagerDown"
- alert: PrometheusStorageFull
expr: |
(
prometheus_tsdb_storage_blocks_bytes
/
(prometheus_tsdb_storage_blocks_bytes + prometheus_tsdb_wal_size_bytes)
) > 0.90
for: 10m
labels:
severity: warning
component: monitoring
annotations:
summary: "Prometheus storage almost full"
description: "Prometheus storage is {{ $value | humanizePercentage }} full."
runbook_url: "https://runbooks.bakery-ia.local/PrometheusStorageFull"
- alert: PrometheusScrapeErrors
expr: |
rate(prometheus_target_scrapes_exceeded_sample_limit_total[5m]) > 0
for: 5m
labels:
severity: warning
component: monitoring
annotations:
summary: "Prometheus scrape errors detected"
description: "Prometheus is experiencing scrape errors for target {{ $labels.job }}."
runbook_url: "https://runbooks.bakery-ia.local/PrometheusScrapeErrors"

View File

@@ -0,0 +1,27 @@
---
# InitContainer to substitute secrets into AlertManager config
# This allows us to use environment variables from secrets in the config file
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-init-script
namespace: monitoring
data:
init-config.sh: |
#!/bin/sh
set -e
# Read the template config
TEMPLATE=$(cat /etc/alertmanager-template/alertmanager.yml)
# Substitute environment variables
echo "$TEMPLATE" | \
sed "s|{{ .smtp_host }}|${SMTP_HOST}|g" | \
sed "s|{{ .smtp_from }}|${SMTP_FROM}|g" | \
sed "s|{{ .smtp_username }}|${SMTP_USERNAME}|g" | \
sed "s|{{ .smtp_password }}|${SMTP_PASSWORD}|g" | \
sed "s|{{ .slack_webhook_url }}|${SLACK_WEBHOOK_URL}|g" \
> /etc/alertmanager-final/alertmanager.yml
echo "AlertManager config initialized successfully"
cat /etc/alertmanager-final/alertmanager.yml

View File

@@ -0,0 +1,391 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: '{{ .smtp_host }}'
smtp_from: '{{ .smtp_from }}'
smtp_auth_username: '{{ .smtp_username }}'
smtp_auth_password: '{{ .smtp_password }}'
smtp_require_tls: true
# Define notification templates
templates:
- '/etc/alertmanager/templates/*.tmpl'
# Route alerts to appropriate receivers
route:
# Default receiver
receiver: 'default-email'
# Group alerts by these labels
group_by: ['alertname', 'cluster', 'service']
# Wait time before sending initial notification
group_wait: 10s
# Wait time before sending notifications about new alerts in the group
group_interval: 10s
# Wait time before re-sending a notification
repeat_interval: 12h
# Child routes for specific alert routing
routes:
# Critical alerts - send immediately to all channels
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 0s
group_interval: 5m
repeat_interval: 4h
continue: true
# Warning alerts - less urgent
- match:
severity: warning
receiver: 'warning-alerts'
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
# Alert system specific alerts
- match:
component: alert-system
receiver: 'alert-system-team'
group_wait: 10s
repeat_interval: 6h
# Database alerts
- match_re:
alertname: ^(DatabaseConnectionHigh|SlowDatabaseStorage)$
receiver: 'database-team'
group_wait: 30s
repeat_interval: 8h
# Infrastructure alerts
- match_re:
alertname: ^(HighMemoryUsage|ServiceDown)$
receiver: 'infra-team'
group_wait: 30s
repeat_interval: 6h
# Inhibition rules - prevent alert spam
inhibit_rules:
# If service is down, inhibit all other alerts for that service
- source_match:
alertname: 'ServiceDown'
target_match_re:
alertname: '(HighErrorRate|HighResponseTime|HighMemoryUsage)'
equal: ['service']
# If AlertSystem is completely down, inhibit component alerts
- source_match:
alertname: 'AlertSystemDown'
target_match_re:
alertname: 'AlertSystemComponent.*'
equal: ['namespace']
# If RabbitMQ is down, inhibit alert processing errors
- source_match:
alertname: 'RabbitMQConnectionDown'
target_match:
alertname: 'HighAlertProcessingErrorRate'
equal: ['namespace']
# Receivers - notification destinations
receivers:
# Default email receiver
- name: 'default-email'
email_configs:
- to: 'alerts@yourdomain.com'
headers:
Subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
html: |
{{ range .Alerts }}
<h2>{{ .Labels.alertname }}</h2>
<p><strong>Status:</strong> {{ .Status }}</p>
<p><strong>Severity:</strong> {{ .Labels.severity }}</p>
<p><strong>Service:</strong> {{ .Labels.service }}</p>
<p><strong>Summary:</strong> {{ .Annotations.summary }}</p>
<p><strong>Description:</strong> {{ .Annotations.description }}</p>
<p><strong>Started:</strong> {{ .StartsAt }}</p>
{{ if .EndsAt }}<p><strong>Ended:</strong> {{ .EndsAt }}</p>{{ end }}
{{ end }}
# Critical alerts - multiple channels
- name: 'critical-alerts'
email_configs:
- to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
headers:
Subject: '🚨 [CRITICAL] {{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
send_resolved: true
# Uncomment to enable Slack notifications
# slack_configs:
# - api_url: '{{ .slack_webhook_url }}'
# channel: '#alerts-critical'
# title: '🚨 Critical Alert'
# text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
# send_resolved: true
# Warning alerts
- name: 'warning-alerts'
email_configs:
- to: 'alerts@yourdomain.com'
headers:
Subject: '⚠️ [WARNING] {{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
send_resolved: true
# Alert system team
- name: 'alert-system-team'
email_configs:
- to: 'alert-system-team@yourdomain.com'
headers:
Subject: '[Alert System] {{ .GroupLabels.alertname }}'
send_resolved: true
# Database team
- name: 'database-team'
email_configs:
- to: 'database-team@yourdomain.com'
headers:
Subject: '[Database] {{ .GroupLabels.alertname }}'
send_resolved: true
# Infrastructure team
- name: 'infra-team'
email_configs:
- to: 'infra-team@yourdomain.com'
headers:
Subject: '[Infrastructure] {{ .GroupLabels.alertname }}'
send_resolved: true
---
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-templates
namespace: monitoring
data:
default.tmpl: |
{{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".*alertmanager\\.(.*)" "$1" }}{{ end }}
{{ define "slack.default.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{ end }}
{{ define "slack.default.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* `{{ .Labels.severity }}`
*Service:* `{{ .Labels.service }}`
{{ end }}
{{ end }}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: alertmanager
namespace: monitoring
labels:
app: alertmanager
spec:
serviceName: alertmanager
replicas: 3
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
serviceAccountName: prometheus
initContainers:
- name: init-config
image: busybox:1.36
command: ['/bin/sh', '/scripts/init-config.sh']
env:
- name: SMTP_HOST
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: smtp-host
- name: SMTP_USERNAME
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: smtp-username
- name: SMTP_PASSWORD
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: smtp-password
- name: SMTP_FROM
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: smtp-from
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: slack-webhook-url
optional: true
volumeMounts:
- name: init-script
mountPath: /scripts
- name: config-template
mountPath: /etc/alertmanager-template
- name: config-final
mountPath: /etc/alertmanager-final
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- alertmanager
topologyKey: kubernetes.io/hostname
containers:
- name: alertmanager
image: prom/alertmanager:v0.27.0
args:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-0.alertmanager.monitoring.svc.cluster.local:9094'
- '--cluster.peer=alertmanager-1.alertmanager.monitoring.svc.cluster.local:9094'
- '--cluster.peer=alertmanager-2.alertmanager.monitoring.svc.cluster.local:9094'
- '--cluster.reconnect-timeout=5m'
- '--web.external-url=http://monitoring.bakery-ia.local/alertmanager'
- '--web.route-prefix=/'
ports:
- name: web
containerPort: 9093
- name: mesh-tcp
containerPort: 9094
- name: mesh-udp
containerPort: 9094
protocol: UDP
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: config-final
mountPath: /etc/alertmanager
- name: templates
mountPath: /etc/alertmanager/templates
- name: storage
mountPath: /alertmanager
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /-/healthy
port: 9093
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 9093
initialDelaySeconds: 5
periodSeconds: 5
# Config reloader sidecar
- name: configmap-reload
image: jimmidyson/configmap-reload:v0.12.0
args:
- '--webhook-url=http://localhost:9093/-/reload'
- '--volume-dir=/etc/alertmanager'
volumeMounts:
- name: config-final
mountPath: /etc/alertmanager
readOnly: true
resources:
requests:
memory: "16Mi"
cpu: "10m"
limits:
memory: "32Mi"
cpu: "50m"
volumes:
- name: init-script
configMap:
name: alertmanager-init-script
defaultMode: 0755
- name: config-template
configMap:
name: alertmanager-config
- name: config-final
emptyDir: {}
- name: templates
configMap:
name: alertmanager-templates
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 2Gi
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitoring
labels:
app: alertmanager
spec:
type: ClusterIP
clusterIP: None
ports:
- name: web
port: 9093
targetPort: 9093
- name: mesh-tcp
port: 9094
targetPort: 9094
- name: mesh-udp
port: 9094
targetPort: 9094
protocol: UDP
selector:
app: alertmanager
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager-external
namespace: monitoring
labels:
app: alertmanager
spec:
type: ClusterIP
ports:
- name: web
port: 9093
targetPort: 9093
selector:
app: alertmanager

View File

@@ -0,0 +1,949 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards-extended
namespace: monitoring
data:
postgresql-dashboard.json: |
{
"dashboard": {
"title": "Bakery IA - PostgreSQL Database",
"tags": ["bakery-ia", "postgresql", "database"],
"timezone": "browser",
"refresh": "30s",
"schemaVersion": 16,
"version": 1,
"panels": [
{
"id": 1,
"title": "Active Connections by Database",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_stat_activity_count{state=\"active\"}",
"legendFormat": "{{datname}} - active"
},
{
"expr": "pg_stat_activity_count{state=\"idle\"}",
"legendFormat": "{{datname}} - idle"
},
{
"expr": "pg_stat_activity_count{state=\"idle in transaction\"}",
"legendFormat": "{{datname}} - idle tx"
}
]
},
{
"id": 2,
"title": "Total Connections",
"type": "stat",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "sum(pg_stat_activity_count)",
"legendFormat": "Total connections"
}
]
},
{
"id": 3,
"title": "Max Connections",
"type": "stat",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "pg_settings_max_connections",
"legendFormat": "Max connections"
}
]
},
{
"id": 4,
"title": "Transaction Rate (Commits vs Rollbacks)",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(pg_stat_database_xact_commit[5m])",
"legendFormat": "{{datname}} - commits"
},
{
"expr": "rate(pg_stat_database_xact_rollback[5m])",
"legendFormat": "{{datname}} - rollbacks"
}
]
},
{
"id": 5,
"title": "Cache Hit Ratio",
"type": "graph",
"gridPos": {"x": 12, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (1 - (sum(rate(pg_stat_io_blocks_read_total[5m])) / (sum(rate(pg_stat_io_blocks_read_total[5m])) + sum(rate(pg_stat_io_blocks_hit_total[5m])))))",
"legendFormat": "Cache hit ratio %"
}
]
},
{
"id": 6,
"title": "Slow Queries (> 30s)",
"type": "table",
"gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_slow_queries{duration_ms > 30000}",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {},
"renameByName": {
"query": "Query",
"duration_ms": "Duration (ms)",
"datname": "Database"
}
}
}
]
},
{
"id": 7,
"title": "Dead Tuples by Table",
"type": "graph",
"gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_stat_user_tables_n_dead_tup",
"legendFormat": "{{schemaname}}.{{relname}}"
}
]
},
{
"id": 8,
"title": "Table Bloat Estimate",
"type": "graph",
"gridPos": {"x": 0, "y": 24, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (pg_stat_user_tables_n_dead_tup * avg_tuple_size) / (pg_total_relation_size * 8192)",
"legendFormat": "{{schemaname}}.{{relname}} bloat %"
}
]
},
{
"id": 9,
"title": "Replication Lag (bytes)",
"type": "graph",
"gridPos": {"x": 12, "y": 24, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_replication_lag_bytes",
"legendFormat": "{{slot_name}} - {{application_name}}"
}
]
},
{
"id": 10,
"title": "Database Size (GB)",
"type": "graph",
"gridPos": {"x": 0, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_database_size_bytes / 1024 / 1024 / 1024",
"legendFormat": "{{datname}}"
}
]
},
{
"id": 11,
"title": "Database Size Growth (per hour)",
"type": "graph",
"gridPos": {"x": 12, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(pg_database_size_bytes[1h])",
"legendFormat": "{{datname}} - bytes/hour"
}
]
},
{
"id": 12,
"title": "Lock Counts by Type",
"type": "graph",
"gridPos": {"x": 0, "y": 40, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_locks_count",
"legendFormat": "{{datname}} - {{locktype}} - {{mode}}"
}
]
},
{
"id": 13,
"title": "Query Duration (p95)",
"type": "graph",
"gridPos": {"x": 12, "y": 40, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.95, rate(pg_query_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
}
]
}
]
}
}
node-exporter-dashboard.json: |
{
"dashboard": {
"title": "Bakery IA - Node Exporter Infrastructure",
"tags": ["bakery-ia", "node-exporter", "infrastructure"],
"timezone": "browser",
"refresh": "15s",
"schemaVersion": 16,
"version": 1,
"panels": [
{
"id": 1,
"title": "CPU Usage by Node",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}} - {{cpu}}"
}
]
},
{
"id": 2,
"title": "Average CPU Usage",
"type": "stat",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "Average CPU %"
}
]
},
{
"id": 3,
"title": "CPU Load (1m, 5m, 15m)",
"type": "stat",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "avg(node_load1)",
"legendFormat": "1m"
},
{
"expr": "avg(node_load5)",
"legendFormat": "5m"
},
{
"expr": "avg(node_load15)",
"legendFormat": "15m"
}
]
},
{
"id": 4,
"title": "Memory Usage by Node",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 5,
"title": "Memory Used (GB)",
"type": "stat",
"gridPos": {"x": 12, "y": 8, "w": 6, "h": 4},
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024 / 1024 / 1024",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 6,
"title": "Memory Available (GB)",
"type": "stat",
"gridPos": {"x": 18, "y": 8, "w": 6, "h": 4},
"targets": [
{
"expr": "node_memory_MemAvailable_bytes / 1024 / 1024 / 1024",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 7,
"title": "Disk I/O Read Rate (MB/s)",
"type": "graph",
"gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_disk_read_bytes_total[5m]) / 1024 / 1024",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 8,
"title": "Disk I/O Write Rate (MB/s)",
"type": "graph",
"gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_disk_written_bytes_total[5m]) / 1024 / 1024",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 9,
"title": "Disk I/O Operations (IOPS)",
"type": "graph",
"gridPos": {"x": 0, "y": 24, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 10,
"title": "Network Receive Rate (Mbps)",
"type": "graph",
"gridPos": {"x": 12, "y": 24, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{device!=\"lo\"}[5m]) * 8 / 1024 / 1024",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 11,
"title": "Network Transmit Rate (Mbps)",
"type": "graph",
"gridPos": {"x": 0, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_network_transmit_bytes_total{device!=\"lo\"}[5m]) * 8 / 1024 / 1024",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 12,
"title": "Network Errors",
"type": "graph",
"gridPos": {"x": 12, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 13,
"title": "Filesystem Usage by Mount",
"type": "graph",
"gridPos": {"x": 0, "y": 40, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes))",
"legendFormat": "{{instance}} - {{mountpoint}}"
}
]
},
{
"id": 14,
"title": "Filesystem Available (GB)",
"type": "stat",
"gridPos": {"x": 12, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "node_filesystem_avail_bytes / 1024 / 1024 / 1024",
"legendFormat": "{{instance}} - {{mountpoint}}"
}
]
},
{
"id": 15,
"title": "Filesystem Size (GB)",
"type": "stat",
"gridPos": {"x": 18, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "node_filesystem_size_bytes / 1024 / 1024 / 1024",
"legendFormat": "{{instance}} - {{mountpoint}}"
}
]
},
{
"id": 16,
"title": "Load Average (1m, 5m, 15m)",
"type": "graph",
"gridPos": {"x": 0, "y": 48, "w": 12, "h": 8},
"targets": [
{
"expr": "node_load1",
"legendFormat": "{{instance}} - 1m"
},
{
"expr": "node_load5",
"legendFormat": "{{instance}} - 5m"
},
{
"expr": "node_load15",
"legendFormat": "{{instance}} - 15m"
}
]
},
{
"id": 17,
"title": "System Up Time",
"type": "stat",
"gridPos": {"x": 12, "y": 48, "w": 12, "h": 8},
"targets": [
{
"expr": "node_boot_time_seconds",
"legendFormat": "{{instance}} - uptime"
}
]
},
{
"id": 18,
"title": "Context Switches",
"type": "graph",
"gridPos": {"x": 0, "y": 56, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_context_switches_total[5m])",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 19,
"title": "Interrupts",
"type": "graph",
"gridPos": {"x": 12, "y": 56, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_intr_total[5m])",
"legendFormat": "{{instance}}"
}
]
}
]
}
}
alertmanager-dashboard.json: |
{
"dashboard": {
"title": "Bakery IA - AlertManager Monitoring",
"tags": ["bakery-ia", "alertmanager", "alerting"],
"timezone": "browser",
"refresh": "10s",
"schemaVersion": 16,
"version": 1,
"panels": [
{
"id": 1,
"title": "Active Alerts by Severity",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "count by (severity) (ALERTS{alertstate=\"firing\"})",
"legendFormat": "{{severity}}"
}
]
},
{
"id": 2,
"title": "Total Active Alerts",
"type": "stat",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "count(ALERTS{alertstate=\"firing\"})",
"legendFormat": "Active alerts"
}
]
},
{
"id": 3,
"title": "Critical Alerts",
"type": "stat",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "count(ALERTS{alertstate=\"firing\", severity=\"critical\"})",
"legendFormat": "Critical"
}
]
},
{
"id": 4,
"title": "Alert Firing Rate (per minute)",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(alertmanager_alerts_fired_total[1m])",
"legendFormat": "Alerts fired/min"
}
]
},
{
"id": 5,
"title": "Alert Resolution Rate (per minute)",
"type": "graph",
"gridPos": {"x": 12, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(alertmanager_alerts_resolved_total[1m])",
"legendFormat": "Alerts resolved/min"
}
]
},
{
"id": 6,
"title": "Notification Success Rate",
"type": "graph",
"gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (rate(alertmanager_notifications_total{status=\"success\"}[5m]) / rate(alertmanager_notifications_total[5m]))",
"legendFormat": "Success rate %"
}
]
},
{
"id": 7,
"title": "Notification Failures",
"type": "graph",
"gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(alertmanager_notifications_total{status=\"failed\"}[5m])",
"legendFormat": "{{integration}}"
}
]
},
{
"id": 8,
"title": "Silenced Alerts",
"type": "stat",
"gridPos": {"x": 0, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "count(ALERTS{alertstate=\"silenced\"})",
"legendFormat": "Silenced"
}
]
},
{
"id": 9,
"title": "AlertManager Cluster Size",
"type": "stat",
"gridPos": {"x": 6, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "count(alertmanager_cluster_peers)",
"legendFormat": "Cluster peers"
}
]
},
{
"id": 10,
"title": "AlertManager Peers",
"type": "stat",
"gridPos": {"x": 12, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "alertmanager_cluster_peers",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 11,
"title": "Cluster Status",
"type": "stat",
"gridPos": {"x": 18, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "up{job=\"alertmanager\"}",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 12,
"title": "Alerts by Group",
"type": "table",
"gridPos": {"x": 0, "y": 28, "w": 12, "h": 8},
"targets": [
{
"expr": "count by (alertname) (ALERTS{alertstate=\"firing\"})",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {},
"renameByName": {
"alertname": "Alert Name",
"Value": "Count"
}
}
}
]
},
{
"id": 13,
"title": "Alert Duration (p99)",
"type": "graph",
"gridPos": {"x": 12, "y": 28, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, rate(alertmanager_alert_duration_seconds_bucket[5m]))",
"legendFormat": "p99 duration"
}
]
},
{
"id": 14,
"title": "Processing Time",
"type": "graph",
"gridPos": {"x": 0, "y": 36, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(alertmanager_receiver_processing_duration_seconds_sum[5m]) / rate(alertmanager_receiver_processing_duration_seconds_count[5m])",
"legendFormat": "{{receiver}}"
}
]
},
{
"id": 15,
"title": "Memory Usage",
"type": "stat",
"gridPos": {"x": 12, "y": 36, "w": 12, "h": 8},
"targets": [
{
"expr": "process_resident_memory_bytes{job=\"alertmanager\"} / 1024 / 1024",
"legendFormat": "{{instance}} - MB"
}
]
}
]
}
}
business-metrics-dashboard.json: |
{
"dashboard": {
"title": "Bakery IA - Business Metrics & KPIs",
"tags": ["bakery-ia", "business-metrics", "kpis"],
"timezone": "browser",
"refresh": "30s",
"schemaVersion": 16,
"version": 1,
"panels": [
{
"id": 1,
"title": "Requests per Service (Rate)",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "sum by (service) (rate(http_requests_total[5m]))",
"legendFormat": "{{service}}"
}
]
},
{
"id": 2,
"title": "Total Request Rate",
"type": "stat",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "sum(rate(http_requests_total[5m]))",
"legendFormat": "requests/sec"
}
]
},
{
"id": 3,
"title": "Peak Request Rate (5m)",
"type": "stat",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "max(sum(rate(http_requests_total[5m])))",
"legendFormat": "Peak requests/sec"
}
]
},
{
"id": 4,
"title": "Error Rates by Service",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "sum by (service) (rate(http_requests_total{status_code=~\"5..\"}[5m]))",
"legendFormat": "{{service}}"
}
]
},
{
"id": 5,
"title": "Overall Error Rate",
"type": "stat",
"gridPos": {"x": 12, "y": 8, "w": 6, "h": 4},
"targets": [
{
"expr": "100 * (sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])))",
"legendFormat": "Error %"
}
]
},
{
"id": 6,
"title": "4xx Error Rate",
"type": "stat",
"gridPos": {"x": 18, "y": 8, "w": 6, "h": 4},
"targets": [
{
"expr": "100 * (sum(rate(http_requests_total{status_code=~\"4..\"}[5m])) / sum(rate(http_requests_total[5m])))",
"legendFormat": "4xx %"
}
]
},
{
"id": 7,
"title": "P95 Latency by Service (ms)",
"type": "graph",
"gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) * 1000",
"legendFormat": "{{service}} p95"
}
]
},
{
"id": 8,
"title": "P99 Latency by Service (ms)",
"type": "graph",
"gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) * 1000",
"legendFormat": "{{service}} p99"
}
]
},
{
"id": 9,
"title": "Average Latency (ms)",
"type": "stat",
"gridPos": {"x": 0, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "(sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) * 1000",
"legendFormat": "Avg latency ms"
}
]
},
{
"id": 10,
"title": "Active Tenants",
"type": "stat",
"gridPos": {"x": 6, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "count(count by (tenant_id) (rate(http_requests_total[5m])))",
"legendFormat": "Active tenants"
}
]
},
{
"id": 11,
"title": "Requests per Tenant",
"type": "stat",
"gridPos": {"x": 12, "y": 24, "w": 12, "h": 4},
"targets": [
{
"expr": "sum by (tenant_id) (rate(http_requests_total[5m]))",
"legendFormat": "Tenant {{tenant_id}}"
}
]
},
{
"id": 12,
"title": "Alert Generation Rate (per minute)",
"type": "graph",
"gridPos": {"x": 0, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(ALERTS_FOR_STATE[1m])",
"legendFormat": "{{alertname}}"
}
]
},
{
"id": 13,
"title": "Training Job Success Rate",
"type": "stat",
"gridPos": {"x": 12, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (sum(training_job_completed_total{status=\"success\"}) / sum(training_job_completed_total))",
"legendFormat": "Success rate %"
}
]
},
{
"id": 14,
"title": "Training Jobs in Progress",
"type": "stat",
"gridPos": {"x": 0, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "count(training_job_in_progress)",
"legendFormat": "Jobs running"
}
]
},
{
"id": 15,
"title": "Training Job Completion Time (p95, minutes)",
"type": "stat",
"gridPos": {"x": 6, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "histogram_quantile(0.95, training_job_duration_seconds) / 60",
"legendFormat": "p95 minutes"
}
]
},
{
"id": 16,
"title": "Failed Training Jobs",
"type": "stat",
"gridPos": {"x": 12, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "sum(training_job_completed_total{status=\"failed\"})",
"legendFormat": "Failed jobs"
}
]
},
{
"id": 17,
"title": "Total Training Jobs Completed",
"type": "stat",
"gridPos": {"x": 18, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "sum(training_job_completed_total)",
"legendFormat": "Total completed"
}
]
},
{
"id": 18,
"title": "API Health Status",
"type": "table",
"gridPos": {"x": 0, "y": 48, "w": 12, "h": 8},
"targets": [
{
"expr": "up{job=\"bakery-services\"}",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {},
"renameByName": {
"service": "Service",
"Value": "Status",
"instance": "Instance"
}
}
}
]
},
{
"id": 19,
"title": "Service Success Rate (%)",
"type": "graph",
"gridPos": {"x": 12, "y": 48, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (1 - (sum by (service) (rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum by (service) (rate(http_requests_total[5m]))))",
"legendFormat": "{{service}}"
}
]
},
{
"id": 20,
"title": "Requests Processed Today",
"type": "stat",
"gridPos": {"x": 0, "y": 56, "w": 12, "h": 4},
"targets": [
{
"expr": "sum(increase(http_requests_total[24h]))",
"legendFormat": "Requests (24h)"
}
]
},
{
"id": 21,
"title": "Distinct Users Today",
"type": "stat",
"gridPos": {"x": 12, "y": 56, "w": 12, "h": 4},
"targets": [
{
"expr": "count(count by (user_id) (increase(http_requests_total{user_id!=\"\"}[24h])))",
"legendFormat": "Users (24h)"
}
]
}
]
}
}

View File

@@ -34,6 +34,15 @@ data:
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
- name: 'extended'
orgId: 1
folder: 'Bakery IA - Extended'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards-extended
---
apiVersion: apps/v1
@@ -61,9 +70,15 @@ spec:
name: http
env:
- name: GF_SECURITY_ADMIN_USER
value: admin
valueFrom:
secretKeyRef:
name: grafana-admin
key: admin-user
- name: GF_SECURITY_ADMIN_PASSWORD
value: admin
valueFrom:
secretKeyRef:
name: grafana-admin
key: admin-password
- name: GF_SERVER_ROOT_URL
value: "http://monitoring.bakery-ia.local/grafana"
- name: GF_SERVER_SERVE_FROM_SUB_PATH
@@ -81,6 +96,8 @@ spec:
mountPath: /etc/grafana/provisioning/dashboards
- name: grafana-dashboards
mountPath: /var/lib/grafana/dashboards
- name: grafana-dashboards-extended
mountPath: /var/lib/grafana/dashboards-extended
resources:
requests:
memory: "256Mi"
@@ -113,6 +130,9 @@ spec:
- name: grafana-dashboards
configMap:
name: grafana-dashboards
- name: grafana-dashboards-extended
configMap:
name: grafana-dashboards-extended
---
apiVersion: v1

View File

@@ -0,0 +1,100 @@
---
# PodDisruptionBudgets ensure minimum availability during voluntary disruptions
# (node drains, rolling updates, etc.)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: prometheus-pdb
namespace: monitoring
spec:
minAvailable: 1
selector:
matchLabels:
app: prometheus
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: alertmanager-pdb
namespace: monitoring
spec:
minAvailable: 2
selector:
matchLabels:
app: alertmanager
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: grafana-pdb
namespace: monitoring
spec:
minAvailable: 1
selector:
matchLabels:
app: grafana
---
# ResourceQuota limits total resources in monitoring namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: monitoring-quota
namespace: monitoring
spec:
hard:
# Compute resources
requests.cpu: "10"
requests.memory: "16Gi"
limits.cpu: "20"
limits.memory: "32Gi"
# Storage
persistentvolumeclaims: "10"
requests.storage: "100Gi"
# Object counts
pods: "50"
services: "20"
configmaps: "30"
secrets: "20"
---
# LimitRange sets default resource limits for pods in monitoring namespace
apiVersion: v1
kind: LimitRange
metadata:
name: monitoring-limits
namespace: monitoring
spec:
limits:
# Default container limits
- max:
cpu: "2"
memory: "4Gi"
min:
cpu: "10m"
memory: "16Mi"
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
# Pod limits
- max:
cpu: "4"
memory: "8Gi"
type: Pod
# PVC limits
- max:
storage: "50Gi"
min:
storage: "1Gi"
type: PersistentVolumeClaim

View File

@@ -23,7 +23,7 @@ spec:
pathType: ImplementationSpecific
backend:
service:
name: prometheus
name: prometheus-external
port:
number: 9090
- path: /jaeger(/|$)(.*)
@@ -33,3 +33,10 @@ spec:
name: jaeger-query
port:
number: 16686
- path: /alertmanager(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: alertmanager-external
port:
number: 9093

View File

@@ -3,8 +3,16 @@ kind: Kustomization
resources:
- namespace.yaml
- secrets.yaml
- prometheus.yaml
- alert-rules.yaml
- alertmanager.yaml
- alertmanager-init.yaml
- grafana.yaml
- grafana-dashboards.yaml
- grafana-dashboards-extended.yaml
- postgres-exporter.yaml
- node-exporter.yaml
- jaeger.yaml
- ha-policies.yaml
- ingress.yaml

View File

@@ -0,0 +1,103 @@
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
nodeSelector:
kubernetes.io/os: linux
tolerations:
# Run on all nodes including master
- operator: Exists
effect: NoSchedule
containers:
- name: node-exporter
image: quay.io/prometheus/node-exporter:v1.7.0
args:
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
- '--path.procfs=/host/proc'
- '--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)'
- '--collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$'
- '--collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15})$'
- '--collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15})$'
- '--web.listen-address=:9100'
ports:
- containerPort: 9100
protocol: TCP
name: metrics
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "200m"
volumeMounts:
- name: sys
mountPath: /host/sys
mountPropagation: HostToContainer
readOnly: true
- name: root
mountPath: /host/root
mountPropagation: HostToContainer
readOnly: true
- name: proc
mountPath: /host/proc
mountPropagation: HostToContainer
readOnly: true
securityContext:
runAsNonRoot: true
runAsUser: 65534
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
volumes:
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
- name: proc
hostPath:
path: /proc
---
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
clusterIP: None
ports:
- name: metrics
port: 9100
protocol: TCP
targetPort: 9100
selector:
app: node-exporter

View File

@@ -0,0 +1,306 @@
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-exporter
namespace: monitoring
labels:
app: postgres-exporter
spec:
replicas: 1
selector:
matchLabels:
app: postgres-exporter
template:
metadata:
labels:
app: postgres-exporter
spec:
containers:
- name: postgres-exporter
image: prometheuscommunity/postgres-exporter:v0.15.0
ports:
- containerPort: 9187
name: metrics
env:
- name: DATA_SOURCE_NAME
valueFrom:
secretKeyRef:
name: postgres-exporter
key: data-source-name
# Enable extended metrics
- name: PG_EXPORTER_EXTEND_QUERY_PATH
value: "/etc/postgres-exporter/queries.yaml"
# Disable default metrics (we'll use custom ones)
- name: PG_EXPORTER_DISABLE_DEFAULT_METRICS
value: "false"
# Disable settings metrics (can be noisy)
- name: PG_EXPORTER_DISABLE_SETTINGS_METRICS
value: "false"
volumeMounts:
- name: queries
mountPath: /etc/postgres-exporter
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "200m"
livenessProbe:
httpGet:
path: /
port: 9187
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 9187
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: queries
configMap:
name: postgres-exporter-queries
---
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-exporter-queries
namespace: monitoring
data:
queries.yaml: |
# Custom PostgreSQL queries for bakery-ia metrics
pg_database:
query: |
SELECT
datname,
numbackends as connections,
xact_commit as transactions_committed,
xact_rollback as transactions_rolled_back,
blks_read as blocks_read,
blks_hit as blocks_hit,
tup_returned as tuples_returned,
tup_fetched as tuples_fetched,
tup_inserted as tuples_inserted,
tup_updated as tuples_updated,
tup_deleted as tuples_deleted,
conflicts as conflicts,
temp_files as temp_files,
temp_bytes as temp_bytes,
deadlocks as deadlocks
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1', 'postgres')
metrics:
- datname:
usage: "LABEL"
description: "Name of the database"
- connections:
usage: "GAUGE"
description: "Number of backends currently connected to this database"
- transactions_committed:
usage: "COUNTER"
description: "Number of transactions in this database that have been committed"
- transactions_rolled_back:
usage: "COUNTER"
description: "Number of transactions in this database that have been rolled back"
- blocks_read:
usage: "COUNTER"
description: "Number of disk blocks read in this database"
- blocks_hit:
usage: "COUNTER"
description: "Number of times disk blocks were found in the buffer cache"
- tuples_returned:
usage: "COUNTER"
description: "Number of rows returned by queries in this database"
- tuples_fetched:
usage: "COUNTER"
description: "Number of rows fetched by queries in this database"
- tuples_inserted:
usage: "COUNTER"
description: "Number of rows inserted by queries in this database"
- tuples_updated:
usage: "COUNTER"
description: "Number of rows updated by queries in this database"
- tuples_deleted:
usage: "COUNTER"
description: "Number of rows deleted by queries in this database"
- conflicts:
usage: "COUNTER"
description: "Number of queries canceled due to conflicts with recovery"
- temp_files:
usage: "COUNTER"
description: "Number of temporary files created by queries"
- temp_bytes:
usage: "COUNTER"
description: "Total amount of data written to temporary files by queries"
- deadlocks:
usage: "COUNTER"
description: "Number of deadlocks detected in this database"
pg_replication:
query: |
SELECT
CASE WHEN pg_is_in_recovery() THEN 1 ELSE 0 END as is_replica,
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INT as lag_seconds
metrics:
- is_replica:
usage: "GAUGE"
description: "1 if this is a replica, 0 if primary"
- lag_seconds:
usage: "GAUGE"
description: "Replication lag in seconds (only on replicas)"
pg_slow_queries:
query: |
SELECT
datname,
usename,
state,
COUNT(*) as count,
MAX(EXTRACT(EPOCH FROM (now() - query_start))) as max_duration_seconds
FROM pg_stat_activity
WHERE state != 'idle'
AND query NOT LIKE '%pg_stat_activity%'
AND query_start < now() - interval '30 seconds'
GROUP BY datname, usename, state
metrics:
- datname:
usage: "LABEL"
description: "Database name"
- usename:
usage: "LABEL"
description: "User name"
- state:
usage: "LABEL"
description: "Query state"
- count:
usage: "GAUGE"
description: "Number of slow queries"
- max_duration_seconds:
usage: "GAUGE"
description: "Maximum query duration in seconds"
pg_table_stats:
query: |
SELECT
schemaname,
relname,
seq_scan,
seq_tup_read,
idx_scan,
idx_tup_fetch,
n_tup_ins,
n_tup_upd,
n_tup_del,
n_tup_hot_upd,
n_live_tup,
n_dead_tup,
n_mod_since_analyze,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_live_tup DESC
LIMIT 20
metrics:
- schemaname:
usage: "LABEL"
description: "Schema name"
- relname:
usage: "LABEL"
description: "Table name"
- seq_scan:
usage: "COUNTER"
description: "Number of sequential scans"
- seq_tup_read:
usage: "COUNTER"
description: "Number of tuples read by sequential scans"
- idx_scan:
usage: "COUNTER"
description: "Number of index scans"
- idx_tup_fetch:
usage: "COUNTER"
description: "Number of tuples fetched by index scans"
- n_tup_ins:
usage: "COUNTER"
description: "Number of tuples inserted"
- n_tup_upd:
usage: "COUNTER"
description: "Number of tuples updated"
- n_tup_del:
usage: "COUNTER"
description: "Number of tuples deleted"
- n_tup_hot_upd:
usage: "COUNTER"
description: "Number of tuples HOT updated"
- n_live_tup:
usage: "GAUGE"
description: "Estimated number of live rows"
- n_dead_tup:
usage: "GAUGE"
description: "Estimated number of dead rows"
- n_mod_since_analyze:
usage: "GAUGE"
description: "Number of rows modified since last analyze"
pg_locks:
query: |
SELECT
mode,
locktype,
COUNT(*) as count
FROM pg_locks
GROUP BY mode, locktype
metrics:
- mode:
usage: "LABEL"
description: "Lock mode"
- locktype:
usage: "LABEL"
description: "Lock type"
- count:
usage: "GAUGE"
description: "Number of locks"
pg_connection_pool:
query: |
SELECT
state,
COUNT(*) as count,
MAX(EXTRACT(EPOCH FROM (now() - state_change))) as max_state_duration_seconds
FROM pg_stat_activity
GROUP BY state
metrics:
- state:
usage: "LABEL"
description: "Connection state"
- count:
usage: "GAUGE"
description: "Number of connections in this state"
- max_state_duration_seconds:
usage: "GAUGE"
description: "Maximum time a connection has been in this state"
---
apiVersion: v1
kind: Service
metadata:
name: postgres-exporter
namespace: monitoring
labels:
app: postgres-exporter
spec:
type: ClusterIP
ports:
- port: 9187
targetPort: 9187
protocol: TCP
name: metrics
selector:
app: postgres-exporter

View File

@@ -56,6 +56,19 @@ data:
cluster: 'bakery-ia'
environment: 'production'
# AlertManager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-0.alertmanager.monitoring.svc.cluster.local:9093
- alertmanager-1.alertmanager.monitoring.svc.cluster.local:9093
- alertmanager-2.alertmanager.monitoring.svc.cluster.local:9093
# Load alert rules
rule_files:
- '/etc/prometheus/rules/*.yml'
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
@@ -114,16 +127,42 @@ data:
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Scrape AlertManager
- job_name: 'alertmanager'
static_configs:
- targets:
- alertmanager-0.alertmanager.monitoring.svc.cluster.local:9093
- alertmanager-1.alertmanager.monitoring.svc.cluster.local:9093
- alertmanager-2.alertmanager.monitoring.svc.cluster.local:9093
# Scrape PostgreSQL exporter
- job_name: 'postgres-exporter'
static_configs:
- targets: ['postgres-exporter.monitoring.svc.cluster.local:9187']
# Scrape Node Exporter
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
- source_labels: [__meta_kubernetes_node_name]
target_label: node
---
apiVersion: apps/v1
kind: Deployment
kind: StatefulSet
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1
serviceName: prometheus
replicas: 2
selector:
matchLabels:
app: prometheus
@@ -133,6 +172,18 @@ spec:
app: prometheus
spec:
serviceAccountName: prometheus
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- prometheus
topologyKey: kubernetes.io/hostname
containers:
- name: prometheus
image: prom/prometheus:v3.0.1
@@ -149,6 +200,8 @@ spec:
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-rules
mountPath: /etc/prometheus/rules
- name: prometheus-storage
mountPath: /prometheus
resources:
@@ -174,22 +227,18 @@ spec:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage
persistentVolumeClaim:
claimName: prometheus-storage
- name: prometheus-rules
configMap:
name: prometheus-alert-rules
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-storage
namespace: monitoring
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeClaimTemplates:
- metadata:
name: prometheus-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 20Gi
---
apiVersion: v1
@@ -199,6 +248,25 @@ metadata:
namespace: monitoring
labels:
app: prometheus
spec:
type: ClusterIP
clusterIP: None
ports:
- port: 9090
targetPort: 9090
protocol: TCP
name: web
selector:
app: prometheus
---
apiVersion: v1
kind: Service
metadata:
name: prometheus-external
namespace: monitoring
labels:
app: prometheus
spec:
type: ClusterIP
ports:

View File

@@ -0,0 +1,52 @@
---
# NOTE: This file contains example secrets for development.
# For production, use one of the following:
# 1. Sealed Secrets (bitnami-labs/sealed-secrets)
# 2. External Secrets Operator
# 3. HashiCorp Vault
# 4. Cloud provider secret managers (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault)
#
# NEVER commit real production secrets to git!
apiVersion: v1
kind: Secret
metadata:
name: grafana-admin
namespace: monitoring
type: Opaque
stringData:
admin-user: admin
# CHANGE THIS PASSWORD IN PRODUCTION!
# Generate with: openssl rand -base64 32
admin-password: "CHANGE_ME_IN_PRODUCTION"
---
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-secrets
namespace: monitoring
type: Opaque
stringData:
# SMTP configuration for email alerts
# CHANGE THESE VALUES IN PRODUCTION!
smtp-host: "smtp.gmail.com:587"
smtp-username: "alerts@yourdomain.com"
smtp-password: "CHANGE_ME_IN_PRODUCTION"
smtp-from: "alerts@yourdomain.com"
# Slack webhook URL (optional)
slack-webhook-url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
---
apiVersion: v1
kind: Secret
metadata:
name: postgres-exporter
namespace: monitoring
type: Opaque
stringData:
# PostgreSQL connection string
# Format: postgresql://username:password@hostname:port/database?sslmode=disable
# CHANGE THIS IN PRODUCTION!
data-source-name: "postgresql://postgres:postgres@postgres.bakery-ia:5432/bakery?sslmode=disable"