Add signoz

This commit is contained in:
Urtzi Alfaro
2026-01-08 12:58:00 +01:00
parent 07178f8972
commit dfb7e4b237
40 changed files with 2049 additions and 3935 deletions

View File

@@ -1,459 +0,0 @@
# 🎉 Production Monitoring MVP - Implementation Complete
**Date:** 2026-01-07
**Status:** ✅ READY FOR PRODUCTION DEPLOYMENT
---
## 📊 What Was Implemented
### **Phase 1: Core Infrastructure** ✅
-**Prometheus v3.0.1** (2 replicas, HA mode with StatefulSet)
-**AlertManager v0.27.0** (3 replicas, clustered with gossip protocol)
-**Grafana v12.3.0** (secure credentials via Kubernetes Secrets)
-**PostgreSQL Exporter v0.15.0** (database health monitoring)
-**Node Exporter v1.7.0** (infrastructure monitoring via DaemonSet)
-**Jaeger v1.51** (distributed tracing with persistent storage)
### **Phase 2: Alert Management** ✅
-**50+ Alert Rules** across 9 categories:
- Service health & performance
- Business logic (ML training, API limits)
- Alert system health & performance
- Database & infrastructure alerts
- Monitoring self-monitoring
-**Intelligent Alert Routing** by severity, component, and service
-**Alert Inhibition Rules** to prevent alert storms
-**Multi-Channel Notifications** (email + Slack support)
### **Phase 3: High Availability** ✅
-**PodDisruptionBudgets** for all monitoring components
-**Anti-affinity Rules** to spread pods across nodes
-**ResourceQuota & LimitRange** for namespace resource management
-**StatefulSets** with volumeClaimTemplates for persistent storage
-**Headless Services** for StatefulSet DNS discovery
### **Phase 4: Observability** ✅
-**11 Grafana Dashboards** (7 pre-configured + 4 extended):
1. Gateway Metrics
2. Services Overview
3. Circuit Breakers
4. PostgreSQL Database (13 panels)
5. Node Exporter Infrastructure (19 panels)
6. AlertManager Monitoring (15 panels)
7. Business Metrics & KPIs (21 panels)
8-11. Plus existing dashboards
-**Distributed Tracing** enabled in production
-**Comprehensive Documentation** with runbooks
---
## 📁 Files Created/Modified
### **New Files:**
```
infrastructure/kubernetes/base/components/monitoring/
├── secrets.yaml # Monitoring credentials
├── alertmanager.yaml # AlertManager StatefulSet (3 replicas)
├── alertmanager-init.yaml # Config initialization script
├── alert-rules.yaml # 50+ alert rules
├── postgres-exporter.yaml # PostgreSQL monitoring
├── node-exporter.yaml # Infrastructure monitoring (DaemonSet)
├── grafana-dashboards-extended.yaml # 4 comprehensive dashboards
├── ha-policies.yaml # PDBs + ResourceQuota + LimitRange
└── README.md # Complete documentation (500+ lines)
```
### **Modified Files:**
```
infrastructure/kubernetes/base/components/monitoring/
├── prometheus.yaml # Now StatefulSet with 2 replicas + alert config
├── grafana.yaml # Using secrets + extended dashboards mounted
├── ingress.yaml # Added /alertmanager path
└── kustomization.yaml # Added all new resources
infrastructure/kubernetes/overlays/prod/
├── kustomization.yaml # Enabled monitoring stack
└── prod-configmap.yaml # JAEGER_ENABLED=true
```
### **Deleted:**
```
infrastructure/monitoring/ # Old legacy config (completely removed)
```
---
## 🚀 Deployment Instructions
### **1. Update Secrets (REQUIRED BEFORE DEPLOYMENT)**
```bash
cd infrastructure/kubernetes/base/components/monitoring
# Generate strong Grafana password
GRAFANA_PASSWORD=$(openssl rand -base64 32)
# Update secrets.yaml with your actual values:
# - grafana-admin: admin-password
# - alertmanager-secrets: SMTP credentials
# - postgres-exporter: PostgreSQL connection string
# Example for production:
kubectl create secret generic grafana-admin \
--from-literal=admin-user=admin \
--from-literal=admin-password="${GRAFANA_PASSWORD}" \
--namespace monitoring --dry-run=client -o yaml | \
kubectl apply -f -
```
### **2. Deploy to Production**
```bash
# Apply the monitoring stack
kubectl apply -k infrastructure/kubernetes/overlays/prod
# Verify deployment
kubectl get pods -n monitoring
kubectl get pvc -n monitoring
kubectl get svc -n monitoring
```
### **3. Verify Services**
```bash
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Visit: http://localhost:9090/targets
# Check AlertManager cluster
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
# Visit: http://localhost:9093
# Check Grafana dashboards
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Visit: http://localhost:3000 (admin / YOUR_PASSWORD)
```
---
## 📈 What You Get Out of the Box
### **Monitoring Coverage:**
-**Application Metrics:** Request rates, latencies (P95/P99), error rates per service
-**Database Health:** Connections, transactions, cache hit ratio, slow queries, locks
-**Infrastructure:** CPU, memory, disk I/O, network traffic per node
-**Business KPIs:** Active tenants, training jobs, alert volumes, API health
-**Distributed Traces:** Full request path tracking across microservices
### **Alerting Capabilities:**
-**Service Down Detection:** 2-minute threshold with immediate notifications
-**Performance Degradation:** High latency, error rate, and memory alerts
-**Resource Exhaustion:** Database connections, disk space, memory limits
-**Business Logic:** Training job failures, low ML accuracy, rate limits
-**Alert System Health:** Component failures, delivery issues, capacity problems
### **High Availability:**
-**Prometheus:** 2 independent instances, can lose 1 without data loss
-**AlertManager:** 3-node cluster, requires 2/3 for alerts to fire
-**Monitoring Resilience:** PodDisruptionBudgets ensure service during updates
---
## 🔧 Configuration Highlights
### **Alert Routing (Configured in AlertManager):**
| Severity | Route | Repeat Interval |
|----------|-------|-----------------|
| Critical | critical-alerts@yourdomain.com + oncall@ | 4 hours |
| Warning | alerts@yourdomain.com | 12 hours |
| Info | alerts@yourdomain.com | 24 hours |
**Special Routes:**
- Alert system → alert-system-team@yourdomain.com
- Database alerts → database-team@yourdomain.com
- Infrastructure → infra-team@yourdomain.com
### **Resource Allocation:**
| Component | Replicas | CPU Request | Memory Request | Storage |
|-----------|----------|-------------|----------------|---------|
| Prometheus | 2 | 500m | 1Gi | 20Gi × 2 |
| AlertManager | 3 | 100m | 128Mi | 2Gi × 3 |
| Grafana | 1 | 100m | 256Mi | 5Gi |
| Postgres Exporter | 1 | 50m | 64Mi | - |
| Node Exporter | 1/node | 50m | 64Mi | - |
| Jaeger | 1 | 250m | 512Mi | 10Gi |
**Total Resources:**
- CPU Requests: ~2.5 cores
- Memory Requests: ~4Gi
- Storage: ~70Gi
### **Data Retention:**
- Prometheus: 30 days
- Jaeger: Persistent (BadgerDB)
- Grafana: Persistent dashboards
---
## 🔐 Security Considerations
### **Implemented:**
- ✅ Grafana credentials via Kubernetes Secrets (no hardcoded passwords)
- ✅ SMTP passwords stored in Secrets
- ✅ PostgreSQL connection strings in Secrets
- ✅ Read-only filesystem for Node Exporter
- ✅ Non-root user for Node Exporter (UID 65534)
- ✅ RBAC for Prometheus (ClusterRole with minimal permissions)
### **TODO for Production:**
- ⚠️ Use Sealed Secrets or External Secrets Operator
- ⚠️ Enable TLS for Prometheus remote write (if using)
- ⚠️ Configure Grafana LDAP/OAuth integration
- ⚠️ Set up proper certificate management for Ingress
- ⚠️ Review and tighten ResourceQuota limits
---
## 📊 Dashboard Access
### **Production URLs (via Ingress):**
```
https://monitoring.yourdomain.com/grafana # Grafana UI
https://monitoring.yourdomain.com/prometheus # Prometheus UI
https://monitoring.yourdomain.com/alertmanager # AlertManager UI
https://monitoring.yourdomain.com/jaeger # Jaeger UI
```
### **Local Access (Port Forwarding):**
```bash
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
```
---
## 🧪 Testing & Validation
### **1. Test Alert Flow:**
```bash
# Fire a test alert (HighMemoryUsage)
kubectl run memory-hog --image=polinux/stress --restart=Never \
--namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
# Check alert in Prometheus (should fire within 5 minutes)
# Check AlertManager received it
# Verify email notification sent
```
### **2. Verify Metrics Collection:**
```bash
# Check Prometheus targets (should all be UP)
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Verify PostgreSQL metrics
curl http://localhost:9090/api/v1/query?query=pg_up | jq
# Verify Node metrics
curl http://localhost:9090/api/v1/query?query=node_cpu_seconds_total | jq
```
### **3. Test Jaeger Tracing:**
```bash
# Make a request through the gateway
curl -H "Authorization: Bearer YOUR_TOKEN" \
https://api.yourdomain.com/api/v1/health
# Check trace in Jaeger UI
# Should see spans across gateway → auth → tenant services
```
---
## 📖 Documentation
### **Complete Documentation Available:**
- **[README.md](infrastructure/kubernetes/base/components/monitoring/README.md)** - 500+ lines covering:
- Component overview
- Deployment instructions
- Security best practices
- Accessing services
- Dashboard descriptions
- Alert configuration
- Troubleshooting guide
- Metrics reference
- Backup & recovery procedures
- Maintenance tasks
---
## ⚡ Performance & Scalability
### **Current Capacity:**
- Prometheus can handle ~10M active time series
- AlertManager can process 1000s of alerts/second
- Jaeger can handle 10k spans/second
- Grafana supports 1000+ concurrent users
### **Scaling Recommendations:**
- **> 20M time series:** Deploy Thanos for long-term storage
- **> 5k alerts/min:** Scale AlertManager to 5+ replicas
- **> 50k spans/sec:** Deploy Jaeger with Elasticsearch/Cassandra backend
- **> 5k Grafana users:** Scale Grafana horizontally with shared database
---
## 🎯 Success Criteria - ALL MET ✅
- ✅ Prometheus collecting metrics from all services
- ✅ Alert rules evaluating and firing correctly
- ✅ AlertManager routing notifications to appropriate channels
- ✅ Grafana displaying real-time dashboards
- ✅ Jaeger capturing distributed traces
- ✅ High availability for all critical components
- ✅ Secure credential management
- ✅ Resource limits configured
- ✅ Documentation complete with runbooks
- ✅ No legacy code remaining
---
## 🚨 Important Notes
1. **Update Secrets Before Deployment:**
- Change all default passwords in `secrets.yaml`
- Use strong, randomly generated passwords
- Consider using Sealed Secrets for production
2. **Configure SMTP Settings:**
- Update AlertManager SMTP configuration in secrets
- Test email delivery before relying on alerts
3. **Review Alert Thresholds:**
- Current thresholds are conservative
- Adjust based on your SLAs and baseline metrics
4. **Monitor Resource Usage:**
- Prometheus storage grows over time
- Plan for capacity based on retention period
- Consider cleaning up old metrics
5. **Backup Strategy:**
- PVCs contain critical monitoring data
- Implement backup solution for PersistentVolumes
- Test restore procedures regularly
---
## 🎓 Next Steps (Post-MVP)
### **Short Term (1-2 weeks):**
1. Fine-tune alert thresholds based on production data
2. Add custom business metrics to services
3. Create team-specific dashboards
4. Set up on-call rotation in AlertManager
### **Medium Term (1-3 months):**
1. Implement SLO tracking and error budgets
2. Deploy Loki for log aggregation
3. Add anomaly detection for metrics
4. Integrate with incident management (PagerDuty/Opsgenie)
### **Long Term (3-6 months):**
1. Deploy Thanos for long-term metrics storage
2. Implement cost tracking and chargeback per tenant
3. Add continuous profiling (Pyroscope)
4. Build ML-based alert prediction
---
## 📞 Support & Troubleshooting
### **Common Issues:**
**Issue:** Prometheus targets showing "DOWN"
```bash
# Check service discovery
kubectl get svc -n bakery-ia
kubectl get endpoints -n bakery-ia
```
**Issue:** AlertManager not sending notifications
```bash
# Check SMTP connectivity
kubectl exec -n monitoring alertmanager-0 -- nc -zv smtp.gmail.com 587
# Check AlertManager logs
kubectl logs -n monitoring alertmanager-0 -f
```
**Issue:** Grafana dashboards showing "No Data"
```bash
# Verify Prometheus datasource
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Login → Configuration → Data Sources → Test
# Check Prometheus has data
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Visit /graph and run query: up
```
### **Getting Help:**
- Check logs: `kubectl logs -n monitoring POD_NAME`
- Check events: `kubectl get events -n monitoring`
- Review documentation: `infrastructure/kubernetes/base/components/monitoring/README.md`
- Prometheus troubleshooting: https://prometheus.io/docs/prometheus/latest/troubleshooting/
- Grafana troubleshooting: https://grafana.com/docs/grafana/latest/troubleshooting/
---
## ✅ Deployment Checklist
Before going to production, verify:
- [ ] All secrets updated with production values
- [ ] SMTP configuration tested and working
- [ ] Grafana admin password changed from default
- [ ] PostgreSQL connection string configured
- [ ] Test alert fired and received via email
- [ ] All Prometheus targets are UP
- [ ] Grafana dashboards loading data
- [ ] Jaeger receiving traces
- [ ] Resource quotas appropriate for cluster size
- [ ] Backup strategy implemented for PVCs
- [ ] Team trained on accessing monitoring tools
- [ ] Runbooks reviewed and understood
- [ ] On-call rotation configured (if applicable)
---
## 🎉 Summary
**You now have a production-ready monitoring stack with:**
-**Complete Observability:** Metrics, logs (via stdout), and traces
-**Intelligent Alerting:** 50+ rules with smart routing and inhibition
-**Rich Visualization:** 11 dashboards covering all aspects of the system
-**High Availability:** HA for Prometheus and AlertManager
-**Security:** Secrets management, RBAC, read-only containers
-**Documentation:** Comprehensive guides and runbooks
-**Scalability:** Ready to handle production traffic
**The monitoring MVP is COMPLETE and READY FOR PRODUCTION DEPLOYMENT!** 🚀
---
*Generated: 2026-01-07*
*Version: 1.0.0 - Production MVP*
*Implementation Time: ~3 hours*

View File

@@ -584,23 +584,39 @@ docker push YOUR_VPS_IP:32000/bakery/auth-service
### Step 2: Update Production Configuration
```bash
# On local machine, edit these files:
The production configuration is already set up for **bakewise.ai** domain:
**Production URLs:**
- **Main Application:** https://bakewise.ai
- **API Endpoints:** https://bakewise.ai/api/v1/...
- **Monitoring Dashboard:** https://monitoring.bakewise.ai/grafana
- **Prometheus:** https://monitoring.bakewise.ai/prometheus
- **SigNoz (Traces/Metrics/Logs):** https://monitoring.bakewise.ai/signoz
- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
```bash
# Verify the configuration is correct:
cat infrastructure/kubernetes/overlays/prod/prod-ingress.yaml | grep -A 3 "host:"
# Expected output should show:
# - host: bakewise.ai
# - host: monitoring.bakewise.ai
# Verify CORS configuration
cat infrastructure/kubernetes/overlays/prod/prod-configmap.yaml | grep CORS
# Expected: CORS_ORIGINS: "https://bakewise.ai"
```
**If using a different domain**, update these files:
```bash
# 1. Update domain names
nano infrastructure/kubernetes/overlays/prod/prod-ingress.yaml
# Replace:
# - bakery.yourdomain.com → bakery.your-actual-domain.com
# - api.yourdomain.com → api.your-actual-domain.com
# - monitoring.yourdomain.com → monitoring.your-actual-domain.com
# - Update CORS origins
# - Update cert-manager email
# Replace bakewise.ai with your domain
# 2. Update ConfigMap
nano infrastructure/kubernetes/overlays/prod/prod-configmap.yaml
# Set:
# - DOMAIN: "your-actual-domain.com"
# - CORS_ORIGINS: "https://bakery.your-actual-domain.com,https://www.your-actual-domain.com"
# Update CORS_ORIGINS
# 3. Verify image names (if using custom registry)
nano infrastructure/kubernetes/overlays/prod/kustomization.yaml
@@ -840,22 +856,96 @@ kubectl logs -n bakery-ia deployment/auth-service | grep -i "email\|smtp"
## Post-Deployment
### Step 1: Enable Monitoring
### Step 1: Access Monitoring Stack
```bash
# Monitoring is already configured, verify it's running
kubectl get pods -n monitoring
Your production monitoring stack provides complete observability with multiple tools:
# Access Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000
#### Production Monitoring URLs
# Visit http://localhost:3000
# Login: admin / (password from monitoring secrets)
# Check dashboards are working
Access via domain (recommended):
```
https://monitoring.bakewise.ai/grafana # Dashboards & visualization
https://monitoring.bakewise.ai/prometheus # Metrics & queries
https://monitoring.bakewise.ai/signoz # Unified observability platform (traces, metrics, logs)
https://monitoring.bakewise.ai/alertmanager # Alert management
```
### Step 2: Configure Backups
Or via port forwarding (if needed):
```bash
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
# SigNoz
kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301 &
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
```
#### Available Dashboards
Login to Grafana (admin / your-password) and explore:
**Main Dashboards:**
1. **Gateway Metrics** - HTTP request rates, latencies, error rates
2. **Services Overview** - Multi-service health and performance
3. **Circuit Breakers** - Reliability metrics
**Extended Dashboards:**
4. **Service Performance Monitoring (SPM)** - RED metrics from distributed traces
5. **PostgreSQL Database** - Database health, connections, query performance
6. **Node Exporter Infrastructure** - CPU, memory, disk, network per node
7. **AlertManager Monitoring** - Alert tracking and notification status
8. **Business Metrics & KPIs** - Tenant activity, ML jobs, forecasts
#### Quick Health Check
```bash
# Verify all monitoring pods are running
kubectl get pods -n monitoring
# Check Prometheus targets (all should be UP)
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Open: http://localhost:9090/targets
# View active alerts
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Open: http://localhost:9090/alerts
```
### Step 2: Configure Alerting
Update AlertManager with your notification email addresses:
```bash
# Edit alertmanager configuration
kubectl edit configmap -n monitoring alertmanager-config
# Update recipient emails in the routes section:
# - alerts@bakewise.ai (general alerts)
# - critical-alerts@bakewise.ai (critical issues)
# - oncall@bakewise.ai (on-call rotation)
```
Test alert delivery:
```bash
# Fire a test alert
kubectl run memory-test --image=polinux/stress --restart=Never \
--namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
# Check alert appears in AlertManager
# https://monitoring.bakewise.ai/alertmanager
# Verify email notification received
# Clean up test
kubectl delete pod memory-test -n bakery-ia
```
### Step 3: Configure Backups
```bash
# Create backup script on VPS
@@ -902,26 +992,82 @@ kubectl edit configmap -n monitoring alertmanager-config
# Update recipient emails in the routes section
```
### Step 4: Document Everything
### Step 4: Verify Monitoring is Working
Create a runbook with:
- [ ] VPS login credentials (stored securely)
Before proceeding, ensure all monitoring components are operational:
```bash
# 1. Check Prometheus targets
# Open: https://monitoring.bakewise.ai/prometheus/targets
# All targets should show "UP" status
# 2. Verify Grafana dashboards load data
# Open: https://monitoring.bakewise.ai/grafana
# Navigate to any dashboard and verify metrics are displaying
# 3. Check SigNoz is receiving traces
# Open: https://monitoring.bakewise.ai/signoz
# Search for traces from "gateway" service
# 4. Verify AlertManager cluster
# Open: https://monitoring.bakewise.ai/alertmanager
# Check that all 3 AlertManager instances are connected
```
### Step 5: Document Everything
Create a secure runbook with all credentials and procedures:
**Essential Information to Document:**
- [ ] VPS login credentials (stored securely in password manager)
- [ ] Database passwords (in password manager)
- [ ] Domain registrar access
- [ ] Grafana admin password
- [ ] Domain registrar access (for bakewise.ai)
- [ ] Cloudflare access
- [ ] Email service credentials
- [ ] Email service credentials (SMTP)
- [ ] WhatsApp API credentials
- [ ] Docker Hub / Registry credentials
- [ ] Emergency contact information
- [ ] Rollback procedures
- [ ] Monitoring URLs and access procedures
### Step 5: Train Your Team
### Step 6: Train Your Team
- [ ] Show team how to access Grafana dashboards
- [ ] Demonstrate how to check logs: `kubectl logs`
- [ ] Explain how to restart services if needed
- [ ] Share this documentation with the team
- [ ] Setup on-call rotation (if applicable)
Conduct a training session covering:
- [ ] **Access monitoring dashboards**
- Show how to login to https://monitoring.bakewise.ai/grafana
- Walk through key dashboards (Services Overview, Database, Infrastructure)
- Explain how to interpret metrics and identify issues
- [ ] **Check application logs**
```bash
# View logs for a service
kubectl logs -n bakery-ia deployment/orders-service --tail=100 -f
# Search for errors
kubectl logs -n bakery-ia deployment/gateway | grep ERROR
```
- [ ] **Restart services when needed**
```bash
# Restart a service (rolling update, no downtime)
kubectl rollout restart deployment/orders-service -n bakery-ia
```
- [ ] **Respond to alerts**
- Show how to access AlertManager at https://monitoring.bakewise.ai/alertmanager
- Review common alerts and their resolution steps
- Reference the [Production Operations Guide](./PRODUCTION_OPERATIONS_GUIDE.md)
- [ ] **Share documentation**
- [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide
- [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations
- [security-checklist.md](./security-checklist.md) - Security procedures
- [ ] **Setup on-call rotation** (if applicable)
- Configure in AlertManager
- Document escalation procedures
---
@@ -1050,16 +1196,25 @@ kubectl scale deployment monitoring -n bakery-ia --replicas=0
## Support Resources
- **Full Monitoring Guide:** [MONITORING_DEPLOYMENT_SUMMARY.md](./MONITORING_DEPLOYMENT_SUMMARY.md)
- **Operations Guide:** [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md)
- **Security Guide:** [security-checklist.md](./security-checklist.md)
- **Database Security:** [database-security.md](./database-security.md)
- **TLS Configuration:** [tls-configuration.md](./tls-configuration.md)
**Documentation:**
- **Operations Guide:** [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations, monitoring, incident response
- **Security Guide:** [security-checklist.md](./security-checklist.md) - Security procedures and compliance
- **Database Security:** [database-security.md](./database-security.md) - Database operations and TLS configuration
- **TLS Configuration:** [tls-configuration.md](./tls-configuration.md) - Certificate management
- **RBAC Implementation:** [rbac-implementation.md](./rbac-implementation.md) - Access control
**Monitoring Access:**
- **Grafana:** https://monitoring.bakewise.ai/grafana (admin / your-password)
- **Prometheus:** https://monitoring.bakewise.ai/prometheus
- **SigNoz:** https://monitoring.bakewise.ai/signoz
- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
**External Resources:**
- **MicroK8s Docs:** https://microk8s.io/docs
- **Kubernetes Docs:** https://kubernetes.io/docs
- **Let's Encrypt:** https://letsencrypt.org/docs
- **Cloudflare DNS:** https://developers.cloudflare.com/dns
- **Monitoring Stack README:** infrastructure/kubernetes/base/components/monitoring/README.md
---

View File

@@ -32,7 +32,7 @@
- **Services:** 18 microservices, 14 databases, monitoring stack
- **Capacity:** 10-tenant pilot (scalable to 100+)
- **Security:** TLS encryption, RBAC, audit logging
- **Monitoring:** Prometheus, Grafana, AlertManager, Jaeger
- **Monitoring:** Prometheus, Grafana, AlertManager, SigNoz
**Key Metrics (10-tenant baseline):**
- **Uptime Target:** 99.5% (3.65 hours downtime/month)
@@ -60,10 +60,10 @@
**Production URLs:**
```
https://monitoring.yourdomain.com/grafana # Dashboards & visualization
https://monitoring.yourdomain.com/prometheus # Metrics & alerts
https://monitoring.yourdomain.com/alertmanager # Alert management
https://monitoring.yourdomain.com/jaeger # Distributed tracing
https://monitoring.bakewise.ai/grafana # Dashboards & visualization
https://monitoring.bakewise.ai/prometheus # Metrics & alerts
https://monitoring.bakewise.ai/alertmanager # Alert management
https://monitoring.bakewise.ai/signoz # Unified observability platform (traces, metrics, logs)
```
**Port Forwarding (if ingress not available):**
@@ -77,8 +77,8 @@ kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
# SigNoz
kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301
```
### Key Dashboards
@@ -1099,13 +1099,12 @@ kubectl exec -n bakery-ia deployment/auth-db -- \
## Support Resources
**Documentation:**
- [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment
- [Monitoring Summary](./MONITORING_DEPLOYMENT_SUMMARY.md) - Monitoring details
- [Quick Start Monitoring](./QUICK_START_MONITORING.md) - Monitoring setup
- [Security Checklist](./security-checklist.md) - Security procedures
- [Database Security](./database-security.md) - Database operations
- [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment and setup
- [Security Checklist](./security-checklist.md) - Security procedures and compliance
- [Database Security](./database-security.md) - Database operations and best practices
- [TLS Configuration](./tls-configuration.md) - Certificate management
- [RBAC Implementation](./rbac-implementation.md) - Access control
- [RBAC Implementation](./rbac-implementation.md) - Access control configuration
- [Monitoring Stack README](../infrastructure/kubernetes/base/components/monitoring/README.md) - Detailed monitoring documentation
**External Resources:**
- Kubernetes: https://kubernetes.io/docs
@@ -1115,9 +1114,9 @@ kubectl exec -n bakery-ia deployment/auth-db -- \
- PostgreSQL: https://www.postgresql.org/docs
**Emergency Contacts:**
- DevOps Team: devops@yourdomain.com
- On-Call: oncall@yourdomain.com
- Security Team: security@yourdomain.com
- DevOps Team: devops@bakewise.ai
- On-Call: oncall@bakewise.ai
- Security Team: security@bakewise.ai
---

View File

@@ -1,284 +0,0 @@
# 🚀 Quick Start: Deploy Monitoring to Production
**Time to deploy: ~15 minutes**
---
## Step 1: Update Secrets (5 min)
```bash
cd infrastructure/kubernetes/base/components/monitoring
# 1. Generate strong passwords
GRAFANA_PASS=$(openssl rand -base64 32)
echo "Grafana Password: $GRAFANA_PASS" > ~/SAVE_THIS_PASSWORD.txt
# 2. Edit secrets.yaml and replace:
# - CHANGE_ME_IN_PRODUCTION (Grafana password)
# - SMTP settings (your email server)
# - PostgreSQL connection string (your DB)
nano secrets.yaml
```
**Required Changes in secrets.yaml:**
```yaml
# Line 13: Change Grafana password
admin-password: "YOUR_STRONG_PASSWORD_HERE"
# Lines 30-33: Update SMTP settings
smtp-host: "smtp.gmail.com:587"
smtp-username: "your-alerts@yourdomain.com"
smtp-password: "YOUR_SMTP_PASSWORD"
smtp-from: "alerts@yourdomain.com"
# Line 49: Update PostgreSQL connection
data-source-name: "postgresql://USER:PASSWORD@postgres.bakery-ia:5432/bakery?sslmode=require"
```
---
## Step 2: Update Alert Email Addresses (2 min)
```bash
# Edit alertmanager.yaml to set your team's email addresses
nano alertmanager.yaml
# Update these lines (search for @yourdomain.com):
# - Line 93: to: 'alerts@yourdomain.com'
# - Line 101: to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
# - Line 116: to: 'alerts@yourdomain.com'
# - Line 125: to: 'alert-system-team@yourdomain.com'
# - Line 134: to: 'database-team@yourdomain.com'
# - Line 143: to: 'infra-team@yourdomain.com'
```
---
## Step 3: Deploy to Production (3 min)
```bash
# Return to project root
cd /Users/urtzialfaro/Documents/bakery-ia
# Deploy the entire stack
kubectl apply -k infrastructure/kubernetes/overlays/prod
# Watch the pods come up
kubectl get pods -n monitoring -w
```
**Expected Output:**
```
NAME READY STATUS RESTARTS AGE
prometheus-0 1/1 Running 0 2m
prometheus-1 1/1 Running 0 1m
alertmanager-0 2/2 Running 0 2m
alertmanager-1 2/2 Running 0 1m
alertmanager-2 2/2 Running 0 1m
grafana-xxxxx 1/1 Running 0 2m
postgres-exporter-xxxxx 1/1 Running 0 2m
node-exporter-xxxxx 1/1 Running 0 2m
jaeger-xxxxx 1/1 Running 0 2m
```
---
## Step 4: Verify Deployment (3 min)
```bash
# Check all pods are running
kubectl get pods -n monitoring
# Check storage is provisioned
kubectl get pvc -n monitoring
# Check services are created
kubectl get svc -n monitoring
```
---
## Step 5: Access Dashboards (2 min)
### **Option A: Via Ingress (if configured)**
```
https://monitoring.yourdomain.com/grafana
https://monitoring.yourdomain.com/prometheus
https://monitoring.yourdomain.com/alertmanager
https://monitoring.yourdomain.com/jaeger
```
### **Option B: Via Port Forwarding**
```bash
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
# Now access:
# - Grafana: http://localhost:3000 (admin / YOUR_PASSWORD)
# - Prometheus: http://localhost:9090
# - AlertManager: http://localhost:9093
# - Jaeger: http://localhost:16686
```
---
## Step 6: Verify Everything Works (5 min)
### **Check Prometheus Targets**
1. Open Prometheus: http://localhost:9090
2. Go to Status → Targets
3. Verify all targets are **UP**:
- prometheus (1/1 up)
- bakery-services (multiple pods up)
- alertmanager (3/3 up)
- postgres-exporter (1/1 up)
- node-exporter (N/N up, where N = number of nodes)
### **Check Grafana Dashboards**
1. Open Grafana: http://localhost:3000
2. Login with admin / YOUR_PASSWORD
3. Go to Dashboards → Browse
4. You should see 11 dashboards:
- Bakery IA folder: Gateway Metrics, Services Overview, Circuit Breakers
- Bakery IA - Extended folder: PostgreSQL, Node Exporter, AlertManager, Business Metrics
5. Open any dashboard and verify data is loading
### **Test Alert Flow**
```bash
# Fire a test alert by creating high memory pod
kubectl run memory-test --image=polinux/stress --restart=Never \
--namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
# Wait 5 minutes, then check:
# 1. Prometheus Alerts: http://localhost:9090/alerts
# - Should see "HighMemoryUsage" firing
# 2. AlertManager: http://localhost:9093
# - Should see the alert
# 3. Email inbox - Should receive notification
# Clean up
kubectl delete pod memory-test -n bakery-ia
```
### **Verify Jaeger Tracing**
1. Make a request to your API:
```bash
curl -H "Authorization: Bearer YOUR_TOKEN" \
https://api.yourdomain.com/api/v1/health
```
2. Open Jaeger: http://localhost:16686
3. Select a service from dropdown
4. Click "Find Traces"
5. You should see traces appearing
---
## ✅ Success Criteria
Your monitoring is working correctly if:
- [x] All Prometheus targets show "UP" status
- [x] Grafana dashboards display metrics
- [x] AlertManager cluster shows 3/3 members
- [x] Test alert fired and email received
- [x] Jaeger shows traces from services
- [x] No pods in CrashLoopBackOff state
- [x] All PVCs are Bound
---
## 🔧 Troubleshooting
### **Problem: Pods not starting**
```bash
# Check pod status
kubectl describe pod POD_NAME -n monitoring
# Check logs
kubectl logs POD_NAME -n monitoring
# Common issues:
# - Insufficient resources: Check node capacity
# - PVC not binding: Check storage class exists
# - Image pull errors: Check network/registry access
```
### **Problem: Prometheus targets DOWN**
```bash
# Check if services exist
kubectl get svc -n bakery-ia
# Check if pods have correct labels
kubectl get pods -n bakery-ia --show-labels
# Check if pods expose metrics port (8080)
kubectl get pod POD_NAME -n bakery-ia -o yaml | grep -A 5 ports
```
### **Problem: Grafana shows "No Data"**
```bash
# Test Prometheus datasource
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
# Run a test query in Prometheus
curl "http://localhost:9090/api/v1/query?query=up" | jq
# If Prometheus has data but Grafana doesn't, check Grafana datasource config
```
### **Problem: Alerts not firing**
```bash
# Check alert rules are loaded
kubectl logs -n monitoring prometheus-0 | grep "Loading configuration"
# Check AlertManager config
kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
# Test SMTP connection
kubectl exec -n monitoring alertmanager-0 -- \
nc -zv smtp.gmail.com 587
```
---
## 📞 Need Help?
1. Check full documentation: [infrastructure/kubernetes/base/components/monitoring/README.md](infrastructure/kubernetes/base/components/monitoring/README.md)
2. Review deployment summary: [MONITORING_DEPLOYMENT_SUMMARY.md](MONITORING_DEPLOYMENT_SUMMARY.md)
3. Check Prometheus logs: `kubectl logs -n monitoring prometheus-0`
4. Check AlertManager logs: `kubectl logs -n monitoring alertmanager-0`
5. Check Grafana logs: `kubectl logs -n monitoring deployment/grafana`
---
## 🎉 You're Done!
Your monitoring stack is now running in production!
**Next steps:**
1. Save your Grafana password securely
2. Set up on-call rotation
3. Review alert thresholds and adjust as needed
4. Create team-specific dashboards
5. Train team on using monitoring tools
**Access your monitoring:**
- Grafana: https://monitoring.yourdomain.com/grafana
- Prometheus: https://monitoring.yourdomain.com/prometheus
- AlertManager: https://monitoring.yourdomain.com/alertmanager
- Jaeger: https://monitoring.yourdomain.com/jaeger
---
*Deployment time: ~15 minutes*
*Last updated: 2026-01-07*