bakery-ia/infrastructure/kubernetes/base/components/monitoring/README.md

# Bakery IA - Production Monitoring Stack

This directory contains the complete production-ready monitoring infrastructure for the Bakery IA platform.

## 📊 Components

### Core Monitoring
- **Prometheus v3.0.1** - Time-series metrics database (2 replicas with HA)
- **Grafana v12.3.0** - Visualization and dashboarding
- **AlertManager v0.27.0** - Alert routing and notification (3 replicas with HA)

### Distributed Tracing
- **Jaeger v1.51** - Distributed tracing with persistent storage

### Exporters
- **PostgreSQL Exporter v0.15.0** - Database metrics and health
- **Node Exporter v1.7.0** - Infrastructure and OS-level metrics (DaemonSet)

## 🚀 Deployment

### Prerequisites
1. Kubernetes cluster (v1.24+)
2. kubectl configured
3. kustomize (v4.0+) or kubectl with kustomize support
4. Storage class available for PersistentVolumeClaims

### Production Deployment

```bash
# 1. Update secrets with production values
kubectl create secret generic grafana-admin \
  --from-literal=admin-user=admin \
  --from-literal=admin-password=$(openssl rand -base64 32) \
  --namespace monitoring --dry-run=client -o yaml > secrets.yaml

# 2. Update AlertManager SMTP credentials
kubectl create secret generic alertmanager-secrets \
  --from-literal=smtp-host="smtp.gmail.com:587" \
  --from-literal=smtp-username="alerts@yourdomain.com" \
  --from-literal=smtp-password="YOUR_SMTP_PASSWORD" \
  --from-literal=smtp-from="alerts@yourdomain.com" \
  --from-literal=slack-webhook-url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
  --namespace monitoring --dry-run=client -o yaml >> secrets.yaml

# 3. Update PostgreSQL exporter connection string
kubectl create secret generic postgres-exporter \
  --from-literal=data-source-name="postgresql://user:password@postgres.bakery-ia:5432/bakery?sslmode=require" \
  --namespace monitoring --dry-run=client -o yaml >> secrets.yaml

# 4. Deploy monitoring stack
kubectl apply -k infrastructure/kubernetes/overlays/prod

# 5. Verify deployment
kubectl get pods -n monitoring
kubectl get pvc -n monitoring
```

### Local Development Deployment

For local Kind clusters, monitoring is disabled by default to save resources. To enable:

```bash
# Uncomment monitoring in overlays/dev/kustomization.yaml
# Then apply:
kubectl apply -k infrastructure/kubernetes/overlays/dev
```

## 🔐 Security Configuration

### Important Security Notes

⚠️ **NEVER commit real secrets to Git!**

The `secrets.yaml` file contains placeholder values. In production, use one of:

1. **Sealed Secrets** (Recommended)
   ```bash
   kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
   kubeseal --format=yaml < secrets.yaml > sealed-secrets.yaml
   ```

2. **External Secrets Operator**
   ```bash
   helm install external-secrets external-secrets/external-secrets -n external-secrets
   ```

3. **Cloud Provider Secrets**
   - AWS Secrets Manager
   - GCP Secret Manager
   - Azure Key Vault

### Grafana Admin Password

Change the default password immediately:
```bash
# Generate strong password
NEW_PASSWORD=$(openssl rand -base64 32)

# Update secret
kubectl patch secret grafana-admin -n monitoring \
  -p="{\"data\":{\"admin-password\":\"$(echo -n $NEW_PASSWORD | base64)\"}}"

# Restart Grafana
kubectl rollout restart deployment grafana -n monitoring
```

## 📈 Accessing Monitoring Services

### Via Ingress (Production)

```
https://monitoring.yourdomain.com/grafana
https://monitoring.yourdomain.com/prometheus
https://monitoring.yourdomain.com/alertmanager
https://monitoring.yourdomain.com/jaeger
```

### Via Port Forwarding (Development)

```bash
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000

# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090

# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093

# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
```

Then access:
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- AlertManager: http://localhost:9093
- Jaeger: http://localhost:16686

## 📊 Grafana Dashboards

### Pre-configured Dashboards

1. **Gateway Metrics** - API gateway performance
   - Request rate by endpoint
   - P95 latency
   - Error rates
   - Authentication metrics

2. **Services Overview** - Microservices health
   - Request rate by service
   - P99 latency
   - Error rates by service
   - Service health status

3. **Circuit Breakers** - Resilience patterns
   - Circuit breaker states
   - Trip rates
   - Rejected requests

4. **PostgreSQL Monitoring** - Database health
   - Connections, transactions, cache hit ratio
   - Slow queries, locks, replication lag

5. **Node Metrics** - Infrastructure monitoring
   - CPU, memory, disk, network per node

6. **AlertManager** - Alert management
   - Active alerts, firing rate, notifications

7. **Business Metrics** - KPIs
   - Service performance, tenant activity, ML metrics

### Creating Custom Dashboards

1. Login to Grafana (admin/[your-password])
2. Click "+ → Dashboard"
3. Add panels with Prometheus queries
4. Save dashboard
5. Export JSON and add to `grafana-dashboards.yaml`

## 🚨 Alert Configuration

### Alert Rules

Alert rules are defined in `alert-rules.yaml` and organized by category:

- **bakery_services** - Service health, errors, latency, memory
- **bakery_business** - Training jobs, ML accuracy, API limits
- **alert_system_health** - Alert system components, RabbitMQ, Redis
- **alert_system_performance** - Processing errors, delivery failures
- **alert_system_business** - Alert volume, response times
- **alert_system_capacity** - Queue sizes, storage performance
- **alert_system_critical** - System failures, data loss
- **monitoring_health** - Prometheus, AlertManager self-monitoring

### Alert Routing

Alerts are routed based on:
- **Severity** (critical, warning, info)
- **Component** (alert-system, database, infrastructure)
- **Service** name

### Notification Channels

Configure in `alertmanager.yaml`:

1. **Email** (default)
   - critical-alerts@yourdomain.com
   - oncall@yourdomain.com

2. **Slack** (optional, commented out)
   - Update slack-webhook-url in secrets
   - Uncomment slack_configs in alertmanager.yaml

3. **PagerDuty** (add if needed)
   ```yaml
   pagerduty_configs:
   - routing_key: YOUR_ROUTING_KEY
     severity: '{{ .Labels.severity }}'
   ```

### Testing Alerts

```bash
# Fire a test alert
kubectl run test-alert --image=busybox -n bakery-ia --restart=Never -- sleep 3600

# Check alert in Prometheus
# Navigate to http://localhost:9090/alerts

# Check AlertManager
# Navigate to http://localhost:9093
```

## 🔍 Troubleshooting

### Prometheus Issues

```bash
# Check Prometheus logs
kubectl logs -n monitoring prometheus-0 -f

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Visit http://localhost:9090/targets

# Check Prometheus configuration
kubectl get configmap prometheus-config -n monitoring -o yaml
```

### AlertManager Issues

```bash
# Check AlertManager logs
kubectl logs -n monitoring alertmanager-0 -f

# Check AlertManager configuration
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml

# Test SMTP connection
kubectl exec -n monitoring alertmanager-0 -- \
  wget --spider --server-response --timeout=10 smtp://smtp.gmail.com:587
```

### Grafana Issues

```bash
# Check Grafana logs
kubectl logs -n monitoring deployment/grafana -f

# Reset Grafana admin password
kubectl exec -n monitoring deployment/grafana -- \
  grafana-cli admin reset-admin-password NEW_PASSWORD
```

### PostgreSQL Exporter Issues

```bash
# Check exporter logs
kubectl logs -n monitoring deployment/postgres-exporter -f

# Test database connection
kubectl exec -n monitoring deployment/postgres-exporter -- \
  wget -O- http://localhost:9187/metrics | grep pg_up
```

### Node Exporter Issues

```bash
# Check node exporter on specific node
kubectl logs -n monitoring daemonset/node-exporter --selector=kubernetes.io/hostname=NODE_NAME -f

# Check metrics endpoint
kubectl exec -n monitoring daemonset/node-exporter -- \
  wget -O- http://localhost:9100/metrics | head -n 20
```

## 📏 Resource Requirements

### Minimum Requirements (Development)
- CPU: 2 cores
- Memory: 4Gi
- Storage: 30Gi

### Recommended Requirements (Production)
- CPU: 6-8 cores
- Memory: 16Gi
- Storage: 100Gi

### Component Resource Allocation

| Component | Replicas | CPU Request | Memory Request | CPU Limit | Memory Limit |
|-----------|----------|-------------|----------------|-----------|--------------|
| Prometheus | 2 | 500m | 1Gi | 1 | 2Gi |
| AlertManager | 3 | 100m | 128Mi | 500m | 256Mi |
| Grafana | 1 | 100m | 256Mi | 500m | 512Mi |
| Postgres Exporter | 1 | 50m | 64Mi | 200m | 128Mi |
| Node Exporter | 1/node | 50m | 64Mi | 200m | 128Mi |
| Jaeger | 1 | 250m | 512Mi | 500m | 1Gi |

## 🔄 High Availability

### Prometheus HA

- 2 replicas in StatefulSet
- Each has independent storage (volumeClaimTemplates)
- Anti-affinity to spread across nodes
- Both scrape the same targets independently
- Use Thanos for long-term storage and global query view (future enhancement)

### AlertManager HA

- 3 replicas in StatefulSet
- Clustered mode (gossip protocol)
- Automatic leader election
- Alert deduplication across instances
- Anti-affinity to spread across nodes

### PodDisruptionBudgets

Ensure minimum availability during:
- Node maintenance
- Cluster upgrades
- Rolling updates

```yaml
Prometheus: minAvailable=1 (out of 2)
AlertManager: minAvailable=2 (out of 3)
Grafana: minAvailable=1 (out of 1)
```

## 📊 Metrics Reference

### Application Metrics (from services)

```promql
# HTTP request rate
rate(http_requests_total[5m])

# HTTP error rate
rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])

# Request latency (P95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Active connections
active_connections
```

### PostgreSQL Metrics

```promql
# Active connections
pg_stat_database_numbackends

# Transaction rate
rate(pg_stat_database_xact_commit[5m])

# Cache hit ratio
rate(pg_stat_database_blks_hit[5m]) /
(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))

# Replication lag
pg_replication_lag_seconds
```

### Node Metrics

```promql
# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Network traffic
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
```

## 🔗 Distributed Tracing

### Jaeger Configuration

Services automatically send traces when `JAEGER_ENABLED=true`:

```yaml
# In prod-configmap.yaml
JAEGER_ENABLED: "true"
JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local"
JAEGER_AGENT_PORT: "6831"
```

### Viewing Traces

1. Access Jaeger UI: https://monitoring.yourdomain.com/jaeger
2. Select service from dropdown
3. Click "Find Traces"
4. Explore trace details, spans, and timing

### Trace Sampling

Current sampling: 100% (all traces collected)

For high-traffic production:
```yaml
# Adjust in shared/monitoring/tracing.py
JAEGER_SAMPLE_RATE: "0.1"  # 10% of traces
```

## 📚 Additional Resources

- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
- [Jaeger Documentation](https://www.jaegertracing.io/docs/)
- [PostgreSQL Exporter](https://github.com/prometheus-community/postgres_exporter)
- [Node Exporter](https://github.com/prometheus/node_exporter)

## 🆘 Support

For monitoring issues:
1. Check component logs (see Troubleshooting section)
2. Verify Prometheus targets are UP
3. Check AlertManager configuration and routing
4. Review resource usage and quotas
5. Contact platform team: platform-team@yourdomain.com

## 🔄 Maintenance

### Regular Tasks

**Daily:**
- Review critical alerts
- Check service health dashboards

**Weekly:**
- Review alert noise and adjust thresholds
- Check storage usage for Prometheus and Jaeger
- Review slow queries in PostgreSQL dashboard

**Monthly:**
- Update dashboard with new metrics
- Review and update alert runbooks
- Capacity planning based on trends

### Backup and Recovery

**Prometheus Data:**
```bash
# Backup Prometheus data
kubectl exec -n monitoring prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz /prometheus
kubectl cp monitoring/prometheus-0:/tmp/prometheus-backup.tar.gz ./prometheus-backup.tar.gz

# Restore (stop Prometheus first)
kubectl cp ./prometheus-backup.tar.gz monitoring/prometheus-0:/tmp/
kubectl exec -n monitoring prometheus-0 -- tar xzf /tmp/prometheus-backup.tar.gz -C /
```

**Grafana Dashboards:**
```bash
# Export all dashboards via API
curl -u admin:password http://localhost:3000/api/search | \
  jq -r '.[] | .uid' | \
  xargs -I{} curl -u admin:password http://localhost:3000/api/dashboards/uid/{} > dashboards-backup.json
```

## 📝 Version History

- **v1.0.0** (2026-01-07) - Initial production-ready monitoring stack
  - Prometheus v3.0.1 with HA
  - AlertManager v0.27.0 with clustering
  - Grafana v12.3.0 with 7 dashboards
  - PostgreSQL and Node exporters
  - 50+ alert rules
  - Comprehensive documentation
Improve monitoring for prod 2026-01-07 19:12:35 +01:00			`# Bakery IA - Production Monitoring Stack`

			`This directory contains the complete production-ready monitoring infrastructure for the Bakery IA platform.`

			`## 📊 Components`

			`### Core Monitoring`
			`- Prometheus v3.0.1 - Time-series metrics database (2 replicas with HA)`
			`- Grafana v12.3.0 - Visualization and dashboarding`
			`- AlertManager v0.27.0 - Alert routing and notification (3 replicas with HA)`

			`### Distributed Tracing`
			`- Jaeger v1.51 - Distributed tracing with persistent storage`

			`### Exporters`
			`- PostgreSQL Exporter v0.15.0 - Database metrics and health`
			`- Node Exporter v1.7.0 - Infrastructure and OS-level metrics (DaemonSet)`

			`## 🚀 Deployment`

			`### Prerequisites`
			`1. Kubernetes cluster (v1.24+)`
			`2. kubectl configured`
			`3. kustomize (v4.0+) or kubectl with kustomize support`
			`4. Storage class available for PersistentVolumeClaims`

			`### Production Deployment`

			```bash
			`# 1. Update secrets with production values`
			`kubectl create secret generic grafana-admin \`
			`--from-literal=admin-user=admin \`
			`--from-literal=admin-password=$(openssl rand -base64 32) \`
			`--namespace monitoring --dry-run=client -o yaml > secrets.yaml`

			`# 2. Update AlertManager SMTP credentials`
			`kubectl create secret generic alertmanager-secrets \`
			`--from-literal=smtp-host="smtp.gmail.com:587" \`
			`--from-literal=smtp-username="alerts@yourdomain.com" \`
			`--from-literal=smtp-password="YOUR_SMTP_PASSWORD" \`
			`--from-literal=smtp-from="alerts@yourdomain.com" \`
			`--from-literal=slack-webhook-url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \`
			`--namespace monitoring --dry-run=client -o yaml >> secrets.yaml`

			`# 3. Update PostgreSQL exporter connection string`
			`kubectl create secret generic postgres-exporter \`
			`--from-literal=data-source-name="postgresql://user:password@postgres.bakery-ia:5432/bakery?sslmode=require" \`
			`--namespace monitoring --dry-run=client -o yaml >> secrets.yaml`

			`# 4. Deploy monitoring stack`
			`kubectl apply -k infrastructure/kubernetes/overlays/prod`

			`# 5. Verify deployment`
			`kubectl get pods -n monitoring`
			`kubectl get pvc -n monitoring`
			```

			`### Local Development Deployment`

			`For local Kind clusters, monitoring is disabled by default to save resources. To enable:`

			```bash
			`# Uncomment monitoring in overlays/dev/kustomization.yaml`
			`# Then apply:`
			`kubectl apply -k infrastructure/kubernetes/overlays/dev`
			```

			`## 🔐 Security Configuration`

			`### Important Security Notes`

			`⚠️ NEVER commit real secrets to Git!`

			The `secrets.yaml` file contains placeholder values. In production, use one of:

			`1. Sealed Secrets (Recommended)`
			```bash
			`kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml`
			`kubeseal --format=yaml < secrets.yaml > sealed-secrets.yaml`
			```

			`2. External Secrets Operator`
			```bash
			`helm install external-secrets external-secrets/external-secrets -n external-secrets`
			```

			`3. Cloud Provider Secrets`
			`- AWS Secrets Manager`
			`- GCP Secret Manager`
			`- Azure Key Vault`

			`### Grafana Admin Password`

			`Change the default password immediately:`
			```bash
			`# Generate strong password`
			`NEW_PASSWORD=$(openssl rand -base64 32)`

			`# Update secret`
			`kubectl patch secret grafana-admin -n monitoring \`
			`-p="{\"data\":{\"admin-password\":\"$(echo -n $NEW_PASSWORD \| base64)\"}}"`

			`# Restart Grafana`
			`kubectl rollout restart deployment grafana -n monitoring`
			```

			`## 📈 Accessing Monitoring Services`

			`### Via Ingress (Production)`

			```
			`https://monitoring.yourdomain.com/grafana`
			`https://monitoring.yourdomain.com/prometheus`
			`https://monitoring.yourdomain.com/alertmanager`
			`https://monitoring.yourdomain.com/jaeger`
			```

			`### Via Port Forwarding (Development)`

			```bash
			`# Grafana`
			`kubectl port-forward -n monitoring svc/grafana 3000:3000`

			`# Prometheus`
			`kubectl port-forward -n monitoring svc/prometheus-external 9090:9090`

			`# AlertManager`
			`kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093`

			`# Jaeger`
			`kubectl port-forward -n monitoring svc/jaeger-query 16686:16686`
			```

			`Then access:`
			`- Grafana: http://localhost:3000`
			`- Prometheus: http://localhost:9090`
			`- AlertManager: http://localhost:9093`
			`- Jaeger: http://localhost:16686`

			`## 📊 Grafana Dashboards`

			`### Pre-configured Dashboards`

			`1. Gateway Metrics - API gateway performance`
			`- Request rate by endpoint`
			`- P95 latency`
			`- Error rates`
			`- Authentication metrics`

			`2. Services Overview - Microservices health`
			`- Request rate by service`
			`- P99 latency`
			`- Error rates by service`
			`- Service health status`

			`3. Circuit Breakers - Resilience patterns`
			`- Circuit breaker states`
			`- Trip rates`
			`- Rejected requests`

			`4. PostgreSQL Monitoring - Database health`
			`- Connections, transactions, cache hit ratio`
			`- Slow queries, locks, replication lag`

			`5. Node Metrics - Infrastructure monitoring`
			`- CPU, memory, disk, network per node`

			`6. AlertManager - Alert management`
			`- Active alerts, firing rate, notifications`

			`7. Business Metrics - KPIs`
			`- Service performance, tenant activity, ML metrics`

			`### Creating Custom Dashboards`

			`1. Login to Grafana (admin/[your-password])`
			`2. Click "+ → Dashboard"`
			`3. Add panels with Prometheus queries`
			`4. Save dashboard`
			5. Export JSON and add to `grafana-dashboards.yaml`

			`## 🚨 Alert Configuration`

			`### Alert Rules`

			Alert rules are defined in `alert-rules.yaml` and organized by category:

			`- bakery_services - Service health, errors, latency, memory`
			`- bakery_business - Training jobs, ML accuracy, API limits`
			`- alert_system_health - Alert system components, RabbitMQ, Redis`
			`- alert_system_performance - Processing errors, delivery failures`
			`- alert_system_business - Alert volume, response times`
			`- alert_system_capacity - Queue sizes, storage performance`
			`- alert_system_critical - System failures, data loss`
			`- monitoring_health - Prometheus, AlertManager self-monitoring`

			`### Alert Routing`

			`Alerts are routed based on:`
			`- Severity (critical, warning, info)`
			`- Component (alert-system, database, infrastructure)`
			`- Service name`

			`### Notification Channels`

			Configure in `alertmanager.yaml`:

			`1. Email (default)`
			`- critical-alerts@yourdomain.com`
			`- oncall@yourdomain.com`

			`2. Slack (optional, commented out)`
			`- Update slack-webhook-url in secrets`
			`- Uncomment slack_configs in alertmanager.yaml`

			`3. PagerDuty (add if needed)`
			```yaml
			`pagerduty_configs:`
			`- routing_key: YOUR_ROUTING_KEY`
			`severity: '{{ .Labels.severity }}'`
			```

			`### Testing Alerts`

			```bash
			`# Fire a test alert`
			`kubectl run test-alert --image=busybox -n bakery-ia --restart=Never -- sleep 3600`

			`# Check alert in Prometheus`
			`# Navigate to http://localhost:9090/alerts`

			`# Check AlertManager`
			`# Navigate to http://localhost:9093`
			```

			`## 🔍 Troubleshooting`

			`### Prometheus Issues`

			```bash
			`# Check Prometheus logs`
			`kubectl logs -n monitoring prometheus-0 -f`

			`# Check Prometheus targets`
			`kubectl port-forward -n monitoring svc/prometheus-external 9090:9090`
			`# Visit http://localhost:9090/targets`

			`# Check Prometheus configuration`
			`kubectl get configmap prometheus-config -n monitoring -o yaml`
			```

			`### AlertManager Issues`

			```bash
			`# Check AlertManager logs`
			`kubectl logs -n monitoring alertmanager-0 -f`

			`# Check AlertManager configuration`
			`kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml`

			`# Test SMTP connection`
			`kubectl exec -n monitoring alertmanager-0 -- \`
			`wget --spider --server-response --timeout=10 smtp://smtp.gmail.com:587`
			```

			`### Grafana Issues`

			```bash
			`# Check Grafana logs`
			`kubectl logs -n monitoring deployment/grafana -f`

			`# Reset Grafana admin password`
			`kubectl exec -n monitoring deployment/grafana -- \`
			`grafana-cli admin reset-admin-password NEW_PASSWORD`
			```

			`### PostgreSQL Exporter Issues`

			```bash
			`# Check exporter logs`
			`kubectl logs -n monitoring deployment/postgres-exporter -f`

			`# Test database connection`
			`kubectl exec -n monitoring deployment/postgres-exporter -- \`
			`wget -O- http://localhost:9187/metrics \| grep pg_up`
			```

			`### Node Exporter Issues`

			```bash
			`# Check node exporter on specific node`
			`kubectl logs -n monitoring daemonset/node-exporter --selector=kubernetes.io/hostname=NODE_NAME -f`

			`# Check metrics endpoint`
			`kubectl exec -n monitoring daemonset/node-exporter -- \`
			`wget -O- http://localhost:9100/metrics \| head -n 20`
			```

			`## 📏 Resource Requirements`

			`### Minimum Requirements (Development)`
			`- CPU: 2 cores`
			`- Memory: 4Gi`
			`- Storage: 30Gi`

			`### Recommended Requirements (Production)`
			`- CPU: 6-8 cores`
			`- Memory: 16Gi`
			`- Storage: 100Gi`

			`### Component Resource Allocation`

			`\| Component \| Replicas \| CPU Request \| Memory Request \| CPU Limit \| Memory Limit \|`
			`\|-----------\|----------\|-------------\|----------------\|-----------\|--------------\|`
			`\| Prometheus \| 2 \| 500m \| 1Gi \| 1 \| 2Gi \|`
			`\| AlertManager \| 3 \| 100m \| 128Mi \| 500m \| 256Mi \|`
			`\| Grafana \| 1 \| 100m \| 256Mi \| 500m \| 512Mi \|`
			`\| Postgres Exporter \| 1 \| 50m \| 64Mi \| 200m \| 128Mi \|`
			`\| Node Exporter \| 1/node \| 50m \| 64Mi \| 200m \| 128Mi \|`
			`\| Jaeger \| 1 \| 250m \| 512Mi \| 500m \| 1Gi \|`

			`## 🔄 High Availability`

			`### Prometheus HA`

			`- 2 replicas in StatefulSet`
			`- Each has independent storage (volumeClaimTemplates)`
			`- Anti-affinity to spread across nodes`
			`- Both scrape the same targets independently`
			`- Use Thanos for long-term storage and global query view (future enhancement)`

			`### AlertManager HA`

			`- 3 replicas in StatefulSet`
			`- Clustered mode (gossip protocol)`
			`- Automatic leader election`
			`- Alert deduplication across instances`
			`- Anti-affinity to spread across nodes`

			`### PodDisruptionBudgets`

			`Ensure minimum availability during:`
			`- Node maintenance`
			`- Cluster upgrades`
			`- Rolling updates`

			```yaml
			`Prometheus: minAvailable=1 (out of 2)`
			`AlertManager: minAvailable=2 (out of 3)`
			`Grafana: minAvailable=1 (out of 1)`
			```

			`## 📊 Metrics Reference`

			`### Application Metrics (from services)`

			```promql
			`# HTTP request rate`
			`rate(http_requests_total[5m])`

			`# HTTP error rate`
			`rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])`

			`# Request latency (P95)`
			`histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))`

			`# Active connections`
			`active_connections`
			```

			`### PostgreSQL Metrics`

			```promql
			`# Active connections`
			`pg_stat_database_numbackends`

			`# Transaction rate`
			`rate(pg_stat_database_xact_commit[5m])`

			`# Cache hit ratio`
			`rate(pg_stat_database_blks_hit[5m]) /`
			`(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))`

			`# Replication lag`
			`pg_replication_lag_seconds`
			```

			`### Node Metrics`

			```promql
			`# CPU usage`
			`100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`

			`# Memory usage`
			`(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100`

			`# Disk I/O`
			`rate(node_disk_read_bytes_total[5m])`
			`rate(node_disk_written_bytes_total[5m])`

			`# Network traffic`
			`rate(node_network_receive_bytes_total[5m])`
			`rate(node_network_transmit_bytes_total[5m])`
			```

			`## 🔗 Distributed Tracing`

			`### Jaeger Configuration`

			Services automatically send traces when `JAEGER_ENABLED=true`:

			```yaml
			`# In prod-configmap.yaml`
			`JAEGER_ENABLED: "true"`
			`JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local"`
			`JAEGER_AGENT_PORT: "6831"`
			```

			`### Viewing Traces`

			`1. Access Jaeger UI: https://monitoring.yourdomain.com/jaeger`
			`2. Select service from dropdown`
			`3. Click "Find Traces"`
			`4. Explore trace details, spans, and timing`

			`### Trace Sampling`

			`Current sampling: 100% (all traces collected)`

			`For high-traffic production:`
			```yaml
			`# Adjust in shared/monitoring/tracing.py`
			`JAEGER_SAMPLE_RATE: "0.1" # 10% of traces`
			```

			`## 📚 Additional Resources`

			`- [Prometheus Documentation](https://prometheus.io/docs/)`
			`- [Grafana Documentation](https://grafana.com/docs/)`
			`- [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)`
			`- [Jaeger Documentation](https://www.jaegertracing.io/docs/)`
			`- [PostgreSQL Exporter](https://github.com/prometheus-community/postgres_exporter)`
			`- [Node Exporter](https://github.com/prometheus/node_exporter)`

			`## 🆘 Support`

			`For monitoring issues:`
			`1. Check component logs (see Troubleshooting section)`
			`2. Verify Prometheus targets are UP`
			`3. Check AlertManager configuration and routing`
			`4. Review resource usage and quotas`
			`5. Contact platform team: platform-team@yourdomain.com`

			`## 🔄 Maintenance`

			`### Regular Tasks`

			`Daily:`
			`- Review critical alerts`
			`- Check service health dashboards`

			`Weekly:`
			`- Review alert noise and adjust thresholds`
			`- Check storage usage for Prometheus and Jaeger`
			`- Review slow queries in PostgreSQL dashboard`

			`Monthly:`
			`- Update dashboard with new metrics`
			`- Review and update alert runbooks`
			`- Capacity planning based on trends`

			`### Backup and Recovery`

			`Prometheus Data:`
			```bash
			`# Backup Prometheus data`
			`kubectl exec -n monitoring prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz /prometheus`
			`kubectl cp monitoring/prometheus-0:/tmp/prometheus-backup.tar.gz ./prometheus-backup.tar.gz`

			`# Restore (stop Prometheus first)`
			`kubectl cp ./prometheus-backup.tar.gz monitoring/prometheus-0:/tmp/`
			`kubectl exec -n monitoring prometheus-0 -- tar xzf /tmp/prometheus-backup.tar.gz -C /`
			```

			`Grafana Dashboards:`
			```bash
			`# Export all dashboards via API`
			`curl -u admin:password http://localhost:3000/api/search \| \`
			`jq -r '.[] \| .uid' \| \`
			`xargs -I{} curl -u admin:password http://localhost:3000/api/dashboards/uid/{} > dashboards-backup.json`
			```

			`## 📝 Version History`

			`- v1.0.0 (2026-01-07) - Initial production-ready monitoring stack`
			`- Prometheus v3.0.1 with HA`
			`- AlertManager v0.27.0 with clustering`
			`- Grafana v12.3.0 with 7 dashboards`
			`- PostgreSQL and Node exporters`
			`- 50+ alert rules`
			`- Comprehensive documentation`