Add signoz

2026-01-08 12:58:00 +01:00
parent 07178f8972
commit dfb7e4b237
40 changed files with 2049 additions and 3935 deletions
--- a/docs/MONITORING_DEPLOYMENT_SUMMARY.md
+++ b/docs/MONITORING_DEPLOYMENT_SUMMARY.md
@@ -1,459 +0,0 @@
-# 🎉 Production Monitoring MVP - Implementation Complete
-
-**Date:** 2026-01-07
-**Status:** ✅ READY FOR PRODUCTION DEPLOYMENT
-
---
-
-## 📊 What Was Implemented
-
-### **Phase 1: Core Infrastructure** ✅
- ✅ **Prometheus v3.0.1** (2 replicas, HA mode with StatefulSet)
- ✅ **AlertManager v0.27.0** (3 replicas, clustered with gossip protocol)
- ✅ **Grafana v12.3.0** (secure credentials via Kubernetes Secrets)
- ✅ **PostgreSQL Exporter v0.15.0** (database health monitoring)
- ✅ **Node Exporter v1.7.0** (infrastructure monitoring via DaemonSet)
- ✅ **Jaeger v1.51** (distributed tracing with persistent storage)
-
-### **Phase 2: Alert Management** ✅
- ✅ **50+ Alert Rules** across 9 categories:
-  - Service health & performance
-  - Business logic (ML training, API limits)
-  - Alert system health & performance
-  - Database & infrastructure alerts
-  - Monitoring self-monitoring
- ✅ **Intelligent Alert Routing** by severity, component, and service
- ✅ **Alert Inhibition Rules** to prevent alert storms
- ✅ **Multi-Channel Notifications** (email + Slack support)
-
-### **Phase 3: High Availability** ✅
- ✅ **PodDisruptionBudgets** for all monitoring components
- ✅ **Anti-affinity Rules** to spread pods across nodes
- ✅ **ResourceQuota & LimitRange** for namespace resource management
- ✅ **StatefulSets** with volumeClaimTemplates for persistent storage
- ✅ **Headless Services** for StatefulSet DNS discovery
-
-### **Phase 4: Observability** ✅
- ✅ **11 Grafana Dashboards** (7 pre-configured + 4 extended):
-  1. Gateway Metrics
-  2. Services Overview
-  3. Circuit Breakers
-  4. PostgreSQL Database (13 panels)
-  5. Node Exporter Infrastructure (19 panels)
-  6. AlertManager Monitoring (15 panels)
-  7. Business Metrics & KPIs (21 panels)
-  8-11. Plus existing dashboards
- ✅ **Distributed Tracing** enabled in production
- ✅ **Comprehensive Documentation** with runbooks
-
---
-
-## 📁 Files Created/Modified
-
-### **New Files:**
-```
-infrastructure/kubernetes/base/components/monitoring/
-├── secrets.yaml                          # Monitoring credentials
-├── alertmanager.yaml                     # AlertManager StatefulSet (3 replicas)
-├── alertmanager-init.yaml                # Config initialization script
-├── alert-rules.yaml                      # 50+ alert rules
-├── postgres-exporter.yaml                # PostgreSQL monitoring
-├── node-exporter.yaml                    # Infrastructure monitoring (DaemonSet)
-├── grafana-dashboards-extended.yaml      # 4 comprehensive dashboards
-├── ha-policies.yaml                      # PDBs + ResourceQuota + LimitRange
-└── README.md                             # Complete documentation (500+ lines)
-```
-
-### **Modified Files:**
-```
-infrastructure/kubernetes/base/components/monitoring/
-├── prometheus.yaml                       # Now StatefulSet with 2 replicas + alert config
-├── grafana.yaml                          # Using secrets + extended dashboards mounted
-├── ingress.yaml                          # Added /alertmanager path
-└── kustomization.yaml                    # Added all new resources
-
-infrastructure/kubernetes/overlays/prod/
-├── kustomization.yaml                    # Enabled monitoring stack
-└── prod-configmap.yaml                   # JAEGER_ENABLED=true
-```
-
-### **Deleted:**
-```
-infrastructure/monitoring/                # Old legacy config (completely removed)
-```
-
---
-
-## 🚀 Deployment Instructions
-
-### **1. Update Secrets (REQUIRED BEFORE DEPLOYMENT)**
-
-```bash
-cd infrastructure/kubernetes/base/components/monitoring
-
-# Generate strong Grafana password
-GRAFANA_PASSWORD=$(openssl rand -base64 32)
-
-# Update secrets.yaml with your actual values:
-# - grafana-admin: admin-password
-# - alertmanager-secrets: SMTP credentials
-# - postgres-exporter: PostgreSQL connection string
-
-# Example for production:
-kubectl create secret generic grafana-admin \
-  --from-literal=admin-user=admin \
-  --from-literal=admin-password="${GRAFANA_PASSWORD}" \
-  --namespace monitoring --dry-run=client -o yaml | \
-  kubectl apply -f -
-```
-
-### **2. Deploy to Production**
-
-```bash
-# Apply the monitoring stack
-kubectl apply -k infrastructure/kubernetes/overlays/prod
-
-# Verify deployment
-kubectl get pods -n monitoring
-kubectl get pvc -n monitoring
-kubectl get svc -n monitoring
-```
-
-### **3. Verify Services**
-
-```bash
-# Check Prometheus targets
-kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
-# Visit: http://localhost:9090/targets
-
-# Check AlertManager cluster
-kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
-# Visit: http://localhost:9093
-
-# Check Grafana dashboards
-kubectl port-forward -n monitoring svc/grafana 3000:3000
-# Visit: http://localhost:3000 (admin / YOUR_PASSWORD)
-```
-
---
-
-## 📈 What You Get Out of the Box
-
-### **Monitoring Coverage:**
- ✅ **Application Metrics:** Request rates, latencies (P95/P99), error rates per service
- ✅ **Database Health:** Connections, transactions, cache hit ratio, slow queries, locks
- ✅ **Infrastructure:** CPU, memory, disk I/O, network traffic per node
- ✅ **Business KPIs:** Active tenants, training jobs, alert volumes, API health
- ✅ **Distributed Traces:** Full request path tracking across microservices
-
-### **Alerting Capabilities:**
- ✅ **Service Down Detection:** 2-minute threshold with immediate notifications
- ✅ **Performance Degradation:** High latency, error rate, and memory alerts
- ✅ **Resource Exhaustion:** Database connections, disk space, memory limits
- ✅ **Business Logic:** Training job failures, low ML accuracy, rate limits
- ✅ **Alert System Health:** Component failures, delivery issues, capacity problems
-
-### **High Availability:**
- ✅ **Prometheus:** 2 independent instances, can lose 1 without data loss
- ✅ **AlertManager:** 3-node cluster, requires 2/3 for alerts to fire
- ✅ **Monitoring Resilience:** PodDisruptionBudgets ensure service during updates
-
---
-
-## 🔧 Configuration Highlights
-
-### **Alert Routing (Configured in AlertManager):**
-
-| Severity | Route | Repeat Interval |
-|----------|-------|-----------------|
-| Critical | critical-alerts@yourdomain.com + oncall@ | 4 hours |
-| Warning | alerts@yourdomain.com | 12 hours |
-| Info | alerts@yourdomain.com | 24 hours |
-
-**Special Routes:**
- Alert system → alert-system-team@yourdomain.com
- Database alerts → database-team@yourdomain.com
- Infrastructure → infra-team@yourdomain.com
-
-### **Resource Allocation:**
-
-| Component | Replicas | CPU Request | Memory Request | Storage |
-|-----------|----------|-------------|----------------|---------|
-| Prometheus | 2 | 500m | 1Gi | 20Gi × 2 |
-| AlertManager | 3 | 100m | 128Mi | 2Gi × 3 |
-| Grafana | 1 | 100m | 256Mi | 5Gi |
-| Postgres Exporter | 1 | 50m | 64Mi | - |
-| Node Exporter | 1/node | 50m | 64Mi | - |
-| Jaeger | 1 | 250m | 512Mi | 10Gi |
-
-**Total Resources:**
- CPU Requests: ~2.5 cores
- Memory Requests: ~4Gi
- Storage: ~70Gi
-
-### **Data Retention:**
- Prometheus: 30 days
- Jaeger: Persistent (BadgerDB)
- Grafana: Persistent dashboards
-
---
-
-## 🔐 Security Considerations
-
-### **Implemented:**
- ✅ Grafana credentials via Kubernetes Secrets (no hardcoded passwords)
- ✅ SMTP passwords stored in Secrets
- ✅ PostgreSQL connection strings in Secrets
- ✅ Read-only filesystem for Node Exporter
- ✅ Non-root user for Node Exporter (UID 65534)
- ✅ RBAC for Prometheus (ClusterRole with minimal permissions)
-
-### **TODO for Production:**
- ⚠️ Use Sealed Secrets or External Secrets Operator
- ⚠️ Enable TLS for Prometheus remote write (if using)
- ⚠️ Configure Grafana LDAP/OAuth integration
- ⚠️ Set up proper certificate management for Ingress
- ⚠️ Review and tighten ResourceQuota limits
-
---
-
-## 📊 Dashboard Access
-
-### **Production URLs (via Ingress):**
-```
-https://monitoring.yourdomain.com/grafana       # Grafana UI
-https://monitoring.yourdomain.com/prometheus    # Prometheus UI
-https://monitoring.yourdomain.com/alertmanager  # AlertManager UI
-https://monitoring.yourdomain.com/jaeger        # Jaeger UI
-```
-
-### **Local Access (Port Forwarding):**
-```bash
-# Grafana
-kubectl port-forward -n monitoring svc/grafana 3000:3000
-
-# Prometheus
-kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
-
-# AlertManager
-kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
-
-# Jaeger
-kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
-```
-
---
-
-## 🧪 Testing & Validation
-
-### **1. Test Alert Flow:**
-```bash
-# Fire a test alert (HighMemoryUsage)
-kubectl run memory-hog --image=polinux/stress --restart=Never \
-  --namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
-
-# Check alert in Prometheus (should fire within 5 minutes)
-# Check AlertManager received it
-# Verify email notification sent
-```
-
-### **2. Verify Metrics Collection:**
-```bash
-# Check Prometheus targets (should all be UP)
-curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
-
-# Verify PostgreSQL metrics
-curl http://localhost:9090/api/v1/query?query=pg_up | jq
-
-# Verify Node metrics
-curl http://localhost:9090/api/v1/query?query=node_cpu_seconds_total | jq
-```
-
-### **3. Test Jaeger Tracing:**
-```bash
-# Make a request through the gateway
-curl -H "Authorization: Bearer YOUR_TOKEN" \
-  https://api.yourdomain.com/api/v1/health
-
-# Check trace in Jaeger UI
-# Should see spans across gateway → auth → tenant services
-```
-
---
-
-## 📖 Documentation
-
-### **Complete Documentation Available:**
- **[README.md](infrastructure/kubernetes/base/components/monitoring/README.md)** - 500+ lines covering:
-  - Component overview
-  - Deployment instructions
-  - Security best practices
-  - Accessing services
-  - Dashboard descriptions
-  - Alert configuration
-  - Troubleshooting guide
-  - Metrics reference
-  - Backup & recovery procedures
-  - Maintenance tasks
-
---
-
-## ⚡ Performance & Scalability
-
-### **Current Capacity:**
- Prometheus can handle ~10M active time series
- AlertManager can process 1000s of alerts/second
- Jaeger can handle 10k spans/second
- Grafana supports 1000+ concurrent users
-
-### **Scaling Recommendations:**
- **> 20M time series:** Deploy Thanos for long-term storage
- **> 5k alerts/min:** Scale AlertManager to 5+ replicas
- **> 50k spans/sec:** Deploy Jaeger with Elasticsearch/Cassandra backend
- **> 5k Grafana users:** Scale Grafana horizontally with shared database
-
---
-
-## 🎯 Success Criteria - ALL MET ✅
-
- ✅ Prometheus collecting metrics from all services
- ✅ Alert rules evaluating and firing correctly
- ✅ AlertManager routing notifications to appropriate channels
- ✅ Grafana displaying real-time dashboards
- ✅ Jaeger capturing distributed traces
- ✅ High availability for all critical components
- ✅ Secure credential management
- ✅ Resource limits configured
- ✅ Documentation complete with runbooks
- ✅ No legacy code remaining
-
---
-
-## 🚨 Important Notes
-
-1. **Update Secrets Before Deployment:**
-   - Change all default passwords in `secrets.yaml`
-   - Use strong, randomly generated passwords
-   - Consider using Sealed Secrets for production
-
-2. **Configure SMTP Settings:**
-   - Update AlertManager SMTP configuration in secrets
-   - Test email delivery before relying on alerts
-
-3. **Review Alert Thresholds:**
-   - Current thresholds are conservative
-   - Adjust based on your SLAs and baseline metrics
-
-4. **Monitor Resource Usage:**
-   - Prometheus storage grows over time
-   - Plan for capacity based on retention period
-   - Consider cleaning up old metrics
-
-5. **Backup Strategy:**
-   - PVCs contain critical monitoring data
-   - Implement backup solution for PersistentVolumes
-   - Test restore procedures regularly
-
---
-
-## 🎓 Next Steps (Post-MVP)
-
-### **Short Term (1-2 weeks):**
-1. Fine-tune alert thresholds based on production data
-2. Add custom business metrics to services
-3. Create team-specific dashboards
-4. Set up on-call rotation in AlertManager
-
-### **Medium Term (1-3 months):**
-1. Implement SLO tracking and error budgets
-2. Deploy Loki for log aggregation
-3. Add anomaly detection for metrics
-4. Integrate with incident management (PagerDuty/Opsgenie)
-
-### **Long Term (3-6 months):**
-1. Deploy Thanos for long-term metrics storage
-2. Implement cost tracking and chargeback per tenant
-3. Add continuous profiling (Pyroscope)
-4. Build ML-based alert prediction
-
---
-
-## 📞 Support & Troubleshooting
-
-### **Common Issues:**
-
-**Issue:** Prometheus targets showing "DOWN"
-```bash
-# Check service discovery
-kubectl get svc -n bakery-ia
-kubectl get endpoints -n bakery-ia
-```
-
-**Issue:** AlertManager not sending notifications
-```bash
-# Check SMTP connectivity
-kubectl exec -n monitoring alertmanager-0 -- nc -zv smtp.gmail.com 587
-
-# Check AlertManager logs
-kubectl logs -n monitoring alertmanager-0 -f
-```
-
-**Issue:** Grafana dashboards showing "No Data"
-```bash
-# Verify Prometheus datasource
-kubectl port-forward -n monitoring svc/grafana 3000:3000
-# Login → Configuration → Data Sources → Test
-
-# Check Prometheus has data
-kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
-# Visit /graph and run query: up
-```
-
-### **Getting Help:**
- Check logs: `kubectl logs -n monitoring POD_NAME`
- Check events: `kubectl get events -n monitoring`
- Review documentation: `infrastructure/kubernetes/base/components/monitoring/README.md`
- Prometheus troubleshooting: https://prometheus.io/docs/prometheus/latest/troubleshooting/
- Grafana troubleshooting: https://grafana.com/docs/grafana/latest/troubleshooting/
-
---
-
-## ✅ Deployment Checklist
-
-Before going to production, verify:
-
- [ ] All secrets updated with production values
- [ ] SMTP configuration tested and working
- [ ] Grafana admin password changed from default
- [ ] PostgreSQL connection string configured
- [ ] Test alert fired and received via email
- [ ] All Prometheus targets are UP
- [ ] Grafana dashboards loading data
- [ ] Jaeger receiving traces
- [ ] Resource quotas appropriate for cluster size
- [ ] Backup strategy implemented for PVCs
- [ ] Team trained on accessing monitoring tools
- [ ] Runbooks reviewed and understood
- [ ] On-call rotation configured (if applicable)
-
---
-
-## 🎉 Summary
-
-**You now have a production-ready monitoring stack with:**
-
- ✅ **Complete Observability:** Metrics, logs (via stdout), and traces
- ✅ **Intelligent Alerting:** 50+ rules with smart routing and inhibition
- ✅ **Rich Visualization:** 11 dashboards covering all aspects of the system
- ✅ **High Availability:** HA for Prometheus and AlertManager
- ✅ **Security:** Secrets management, RBAC, read-only containers
- ✅ **Documentation:** Comprehensive guides and runbooks
- ✅ **Scalability:** Ready to handle production traffic
-
-**The monitoring MVP is COMPLETE and READY FOR PRODUCTION DEPLOYMENT!** 🚀
-
---
-
-*Generated: 2026-01-07*
-*Version: 1.0.0 - Production MVP*
-*Implementation Time: ~3 hours*
--- a/docs/PILOT_LAUNCH_GUIDE.md
+++ b/docs/PILOT_LAUNCH_GUIDE.md
@@ -584,23 +584,39 @@ docker push YOUR_VPS_IP:32000/bakery/auth-service

 ### Step 2: Update Production Configuration

-```bash
-# On local machine, edit these files:
+The production configuration is already set up for **bakewise.ai** domain:

+**Production URLs:**
+- **Main Application:** https://bakewise.ai
+- **API Endpoints:** https://bakewise.ai/api/v1/...
+- **Monitoring Dashboard:** https://monitoring.bakewise.ai/grafana
+- **Prometheus:** https://monitoring.bakewise.ai/prometheus
+- **SigNoz (Traces/Metrics/Logs):** https://monitoring.bakewise.ai/signoz
+- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
+
+```bash
+# Verify the configuration is correct:
+cat infrastructure/kubernetes/overlays/prod/prod-ingress.yaml | grep -A 3 "host:"
+
+# Expected output should show:
+# - host: bakewise.ai
+# - host: monitoring.bakewise.ai
+
+# Verify CORS configuration
+cat infrastructure/kubernetes/overlays/prod/prod-configmap.yaml | grep CORS
+
+# Expected: CORS_ORIGINS: "https://bakewise.ai"
+```
+
+**If using a different domain**, update these files:
+```bash
 # 1. Update domain names
 nano infrastructure/kubernetes/overlays/prod/prod-ingress.yaml
-# Replace:
-# - bakery.yourdomain.com → bakery.your-actual-domain.com
-# - api.yourdomain.com → api.your-actual-domain.com
-# - monitoring.yourdomain.com → monitoring.your-actual-domain.com
-# - Update CORS origins
-# - Update cert-manager email
+# Replace bakewise.ai with your domain

 # 2. Update ConfigMap
 nano infrastructure/kubernetes/overlays/prod/prod-configmap.yaml
-# Set:
-# - DOMAIN: "your-actual-domain.com"
-# - CORS_ORIGINS: "https://bakery.your-actual-domain.com,https://www.your-actual-domain.com"
+# Update CORS_ORIGINS

 # 3. Verify image names (if using custom registry)
 nano infrastructure/kubernetes/overlays/prod/kustomization.yaml
@@ -840,22 +856,96 @@ kubectl logs -n bakery-ia deployment/auth-service | grep -i "email\|smtp"

 ## Post-Deployment

-### Step 1: Enable Monitoring
+### Step 1: Access Monitoring Stack

-```bash
-# Monitoring is already configured, verify it's running
-kubectl get pods -n monitoring
+Your production monitoring stack provides complete observability with multiple tools:

-# Access Grafana
-kubectl port-forward -n monitoring svc/grafana 3000:3000
+#### Production Monitoring URLs

-# Visit http://localhost:3000
-# Login: admin / (password from monitoring secrets)
-
-# Check dashboards are working
+Access via domain (recommended):
+```
+https://monitoring.bakewise.ai/grafana       # Dashboards & visualization
+https://monitoring.bakewise.ai/prometheus    # Metrics & queries
+https://monitoring.bakewise.ai/signoz        # Unified observability platform (traces, metrics, logs)
+https://monitoring.bakewise.ai/alertmanager  # Alert management
 ```

-### Step 2: Configure Backups
+Or via port forwarding (if needed):
+```bash
+# Grafana
+kubectl port-forward -n monitoring svc/grafana 3000:3000 &
+
+# Prometheus
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
+
+# SigNoz
+kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301 &
+
+# AlertManager
+kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
+```
+
+#### Available Dashboards
+
+Login to Grafana (admin / your-password) and explore:
+
+**Main Dashboards:**
+1. **Gateway Metrics** - HTTP request rates, latencies, error rates
+2. **Services Overview** - Multi-service health and performance
+3. **Circuit Breakers** - Reliability metrics
+
+**Extended Dashboards:**
+4. **Service Performance Monitoring (SPM)** - RED metrics from distributed traces
+5. **PostgreSQL Database** - Database health, connections, query performance
+6. **Node Exporter Infrastructure** - CPU, memory, disk, network per node
+7. **AlertManager Monitoring** - Alert tracking and notification status
+8. **Business Metrics & KPIs** - Tenant activity, ML jobs, forecasts
+
+#### Quick Health Check
+
+```bash
+# Verify all monitoring pods are running
+kubectl get pods -n monitoring
+
+# Check Prometheus targets (all should be UP)
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+# Open: http://localhost:9090/targets
+
+# View active alerts
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+# Open: http://localhost:9090/alerts
+```
+
+### Step 2: Configure Alerting
+
+Update AlertManager with your notification email addresses:
+
+```bash
+# Edit alertmanager configuration
+kubectl edit configmap -n monitoring alertmanager-config
+
+# Update recipient emails in the routes section:
+# - alerts@bakewise.ai (general alerts)
+# - critical-alerts@bakewise.ai (critical issues)
+# - oncall@bakewise.ai (on-call rotation)
+```
+
+Test alert delivery:
+```bash
+# Fire a test alert
+kubectl run memory-test --image=polinux/stress --restart=Never \
+  --namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
+
+# Check alert appears in AlertManager
+# https://monitoring.bakewise.ai/alertmanager
+
+# Verify email notification received
+
+# Clean up test
+kubectl delete pod memory-test -n bakery-ia
+```
+
+### Step 3: Configure Backups

 ```bash
 # Create backup script on VPS
@@ -902,26 +992,82 @@ kubectl edit configmap -n monitoring alertmanager-config
 # Update recipient emails in the routes section
 ```

-### Step 4: Document Everything
+### Step 4: Verify Monitoring is Working

-Create a runbook with:
- [ ] VPS login credentials (stored securely)
+Before proceeding, ensure all monitoring components are operational:
+
+```bash
+# 1. Check Prometheus targets
+# Open: https://monitoring.bakewise.ai/prometheus/targets
+# All targets should show "UP" status
+
+# 2. Verify Grafana dashboards load data
+# Open: https://monitoring.bakewise.ai/grafana
+# Navigate to any dashboard and verify metrics are displaying
+
+# 3. Check SigNoz is receiving traces
+# Open: https://monitoring.bakewise.ai/signoz
+# Search for traces from "gateway" service
+
+# 4. Verify AlertManager cluster
+# Open: https://monitoring.bakewise.ai/alertmanager
+# Check that all 3 AlertManager instances are connected
+```
+
+### Step 5: Document Everything
+
+Create a secure runbook with all credentials and procedures:
+
+**Essential Information to Document:**
+- [ ] VPS login credentials (stored securely in password manager)
 - [ ] Database passwords (in password manager)
- [ ] Domain registrar access
+- [ ] Grafana admin password
+- [ ] Domain registrar access (for bakewise.ai)
 - [ ] Cloudflare access
- [ ] Email service credentials
+- [ ] Email service credentials (SMTP)
 - [ ] WhatsApp API credentials
 - [ ] Docker Hub / Registry credentials
 - [ ] Emergency contact information
 - [ ] Rollback procedures
+- [ ] Monitoring URLs and access procedures

-### Step 5: Train Your Team
+### Step 6: Train Your Team

- [ ] Show team how to access Grafana dashboards
- [ ] Demonstrate how to check logs: `kubectl logs`
- [ ] Explain how to restart services if needed
- [ ] Share this documentation with the team
- [ ] Setup on-call rotation (if applicable)
+Conduct a training session covering:
+
+- [ ] **Access monitoring dashboards**
+  - Show how to login to https://monitoring.bakewise.ai/grafana
+  - Walk through key dashboards (Services Overview, Database, Infrastructure)
+  - Explain how to interpret metrics and identify issues
+
+- [ ] **Check application logs**
+  ```bash
+  # View logs for a service
+  kubectl logs -n bakery-ia deployment/orders-service --tail=100 -f
+
+  # Search for errors
+  kubectl logs -n bakery-ia deployment/gateway | grep ERROR
+  ```
+
+- [ ] **Restart services when needed**
+  ```bash
+  # Restart a service (rolling update, no downtime)
+  kubectl rollout restart deployment/orders-service -n bakery-ia
+  ```
+
+- [ ] **Respond to alerts**
+  - Show how to access AlertManager at https://monitoring.bakewise.ai/alertmanager
+  - Review common alerts and their resolution steps
+  - Reference the [Production Operations Guide](./PRODUCTION_OPERATIONS_GUIDE.md)
+
+- [ ] **Share documentation**
+  - [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide
+  - [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations
+  - [security-checklist.md](./security-checklist.md) - Security procedures
+
+- [ ] **Setup on-call rotation** (if applicable)
+  - Configure in AlertManager
+  - Document escalation procedures

 ---

@@ -1050,16 +1196,25 @@ kubectl scale deployment monitoring -n bakery-ia --replicas=0

 ## Support Resources

- **Full Monitoring Guide:** [MONITORING_DEPLOYMENT_SUMMARY.md](./MONITORING_DEPLOYMENT_SUMMARY.md)
- **Operations Guide:** [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md)
- **Security Guide:** [security-checklist.md](./security-checklist.md)
- **Database Security:** [database-security.md](./database-security.md)
- **TLS Configuration:** [tls-configuration.md](./tls-configuration.md)
+**Documentation:**
+- **Operations Guide:** [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations, monitoring, incident response
+- **Security Guide:** [security-checklist.md](./security-checklist.md) - Security procedures and compliance
+- **Database Security:** [database-security.md](./database-security.md) - Database operations and TLS configuration
+- **TLS Configuration:** [tls-configuration.md](./tls-configuration.md) - Certificate management
+- **RBAC Implementation:** [rbac-implementation.md](./rbac-implementation.md) - Access control

+**Monitoring Access:**
+- **Grafana:** https://monitoring.bakewise.ai/grafana (admin / your-password)
+- **Prometheus:** https://monitoring.bakewise.ai/prometheus
+- **SigNoz:** https://monitoring.bakewise.ai/signoz
+- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
+
+**External Resources:**
 - **MicroK8s Docs:** https://microk8s.io/docs
 - **Kubernetes Docs:** https://kubernetes.io/docs
 - **Let's Encrypt:** https://letsencrypt.org/docs
 - **Cloudflare DNS:** https://developers.cloudflare.com/dns
+- **Monitoring Stack README:** infrastructure/kubernetes/base/components/monitoring/README.md

 ---

--- a/docs/PRODUCTION_OPERATIONS_GUIDE.md
+++ b/docs/PRODUCTION_OPERATIONS_GUIDE.md
@@ -32,7 +32,7 @@
 - **Services:** 18 microservices, 14 databases, monitoring stack
 - **Capacity:** 10-tenant pilot (scalable to 100+)
 - **Security:** TLS encryption, RBAC, audit logging
- **Monitoring:** Prometheus, Grafana, AlertManager, Jaeger
+- **Monitoring:** Prometheus, Grafana, AlertManager, SigNoz

 **Key Metrics (10-tenant baseline):**
 - **Uptime Target:** 99.5% (3.65 hours downtime/month)
@@ -60,10 +60,10 @@

 **Production URLs:**
 ```
-https://monitoring.yourdomain.com/grafana       # Dashboards & visualization
-https://monitoring.yourdomain.com/prometheus    # Metrics & alerts
-https://monitoring.yourdomain.com/alertmanager  # Alert management
-https://monitoring.yourdomain.com/jaeger        # Distributed tracing
+https://monitoring.bakewise.ai/grafana       # Dashboards & visualization
+https://monitoring.bakewise.ai/prometheus    # Metrics & alerts
+https://monitoring.bakewise.ai/alertmanager  # Alert management
+https://monitoring.bakewise.ai/signoz        # Unified observability platform (traces, metrics, logs)
 ```

 **Port Forwarding (if ingress not available):**
@@ -77,8 +77,8 @@ kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
 # AlertManager
 kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093

-# Jaeger
-kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
+# SigNoz
+kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301
 ```

 ### Key Dashboards
@@ -1099,13 +1099,12 @@ kubectl exec -n bakery-ia deployment/auth-db -- \
 ## Support Resources

 **Documentation:**
- [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment
- [Monitoring Summary](./MONITORING_DEPLOYMENT_SUMMARY.md) - Monitoring details
- [Quick Start Monitoring](./QUICK_START_MONITORING.md) - Monitoring setup
- [Security Checklist](./security-checklist.md) - Security procedures
- [Database Security](./database-security.md) - Database operations
+- [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment and setup
+- [Security Checklist](./security-checklist.md) - Security procedures and compliance
+- [Database Security](./database-security.md) - Database operations and best practices
 - [TLS Configuration](./tls-configuration.md) - Certificate management
- [RBAC Implementation](./rbac-implementation.md) - Access control
+- [RBAC Implementation](./rbac-implementation.md) - Access control configuration
+- [Monitoring Stack README](../infrastructure/kubernetes/base/components/monitoring/README.md) - Detailed monitoring documentation

 **External Resources:**
 - Kubernetes: https://kubernetes.io/docs
@@ -1115,9 +1114,9 @@ kubectl exec -n bakery-ia deployment/auth-db -- \
 - PostgreSQL: https://www.postgresql.org/docs

 **Emergency Contacts:**
- DevOps Team: devops@yourdomain.com
- On-Call: oncall@yourdomain.com
- Security Team: security@yourdomain.com
+- DevOps Team: devops@bakewise.ai
+- On-Call: oncall@bakewise.ai
+- Security Team: security@bakewise.ai

 ---

--- a/docs/QUICK_START_MONITORING.md
+++ b/docs/QUICK_START_MONITORING.md
@@ -1,284 +0,0 @@
-# 🚀 Quick Start: Deploy Monitoring to Production
-
-**Time to deploy: ~15 minutes**
-
---
-
-## Step 1: Update Secrets (5 min)
-
-```bash
-cd infrastructure/kubernetes/base/components/monitoring
-
-# 1. Generate strong passwords
-GRAFANA_PASS=$(openssl rand -base64 32)
-echo "Grafana Password: $GRAFANA_PASS" > ~/SAVE_THIS_PASSWORD.txt
-
-# 2. Edit secrets.yaml and replace:
-#    - CHANGE_ME_IN_PRODUCTION (Grafana password)
-#    - SMTP settings (your email server)
-#    - PostgreSQL connection string (your DB)
-
-nano secrets.yaml
-```
-
-**Required Changes in secrets.yaml:**
-```yaml
-# Line 13: Change Grafana password
-admin-password: "YOUR_STRONG_PASSWORD_HERE"
-
-# Lines 30-33: Update SMTP settings
-smtp-host: "smtp.gmail.com:587"
-smtp-username: "your-alerts@yourdomain.com"
-smtp-password: "YOUR_SMTP_PASSWORD"
-smtp-from: "alerts@yourdomain.com"
-
-# Line 49: Update PostgreSQL connection
-data-source-name: "postgresql://USER:PASSWORD@postgres.bakery-ia:5432/bakery?sslmode=require"
-```
-
---
-
-## Step 2: Update Alert Email Addresses (2 min)
-
-```bash
-# Edit alertmanager.yaml to set your team's email addresses
-nano alertmanager.yaml
-
-# Update these lines (search for @yourdomain.com):
-# - Line 93: to: 'alerts@yourdomain.com'
-# - Line 101: to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
-# - Line 116: to: 'alerts@yourdomain.com'
-# - Line 125: to: 'alert-system-team@yourdomain.com'
-# - Line 134: to: 'database-team@yourdomain.com'
-# - Line 143: to: 'infra-team@yourdomain.com'
-```
-
---
-
-## Step 3: Deploy to Production (3 min)
-
-```bash
-# Return to project root
-cd /Users/urtzialfaro/Documents/bakery-ia
-
-# Deploy the entire stack
-kubectl apply -k infrastructure/kubernetes/overlays/prod
-
-# Watch the pods come up
-kubectl get pods -n monitoring -w
-```
-
-**Expected Output:**
-```
-NAME                                  READY   STATUS    RESTARTS   AGE
-prometheus-0                          1/1     Running   0          2m
-prometheus-1                          1/1     Running   0          1m
-alertmanager-0                        2/2     Running   0          2m
-alertmanager-1                        2/2     Running   0          1m
-alertmanager-2                        2/2     Running   0          1m
-grafana-xxxxx                         1/1     Running   0          2m
-postgres-exporter-xxxxx               1/1     Running   0          2m
-node-exporter-xxxxx                   1/1     Running   0          2m
-jaeger-xxxxx                          1/1     Running   0          2m
-```
-
---
-
-## Step 4: Verify Deployment (3 min)
-
-```bash
-# Check all pods are running
-kubectl get pods -n monitoring
-
-# Check storage is provisioned
-kubectl get pvc -n monitoring
-
-# Check services are created
-kubectl get svc -n monitoring
-```
-
---
-
-## Step 5: Access Dashboards (2 min)
-
-### **Option A: Via Ingress (if configured)**
-```
-https://monitoring.yourdomain.com/grafana
-https://monitoring.yourdomain.com/prometheus
-https://monitoring.yourdomain.com/alertmanager
-https://monitoring.yourdomain.com/jaeger
-```
-
-### **Option B: Via Port Forwarding**
-```bash
-# Grafana
-kubectl port-forward -n monitoring svc/grafana 3000:3000 &
-
-# Prometheus
-kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
-
-# AlertManager
-kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
-
-# Jaeger
-kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
-
-# Now access:
-# - Grafana: http://localhost:3000 (admin / YOUR_PASSWORD)
-# - Prometheus: http://localhost:9090
-# - AlertManager: http://localhost:9093
-# - Jaeger: http://localhost:16686
-```
-
---
-
-## Step 6: Verify Everything Works (5 min)
-
-### **Check Prometheus Targets**
-1. Open Prometheus: http://localhost:9090
-2. Go to Status → Targets
-3. Verify all targets are **UP**:
-   - prometheus (1/1 up)
-   - bakery-services (multiple pods up)
-   - alertmanager (3/3 up)
-   - postgres-exporter (1/1 up)
-   - node-exporter (N/N up, where N = number of nodes)
-
-### **Check Grafana Dashboards**
-1. Open Grafana: http://localhost:3000
-2. Login with admin / YOUR_PASSWORD
-3. Go to Dashboards → Browse
-4. You should see 11 dashboards:
-   - Bakery IA folder: Gateway Metrics, Services Overview, Circuit Breakers
-   - Bakery IA - Extended folder: PostgreSQL, Node Exporter, AlertManager, Business Metrics
-5. Open any dashboard and verify data is loading
-
-### **Test Alert Flow**
-```bash
-# Fire a test alert by creating high memory pod
-kubectl run memory-test --image=polinux/stress --restart=Never \
-  --namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
-
-# Wait 5 minutes, then check:
-# 1. Prometheus Alerts: http://localhost:9090/alerts
-#    - Should see "HighMemoryUsage" firing
-# 2. AlertManager: http://localhost:9093
-#    - Should see the alert
-# 3. Email inbox - Should receive notification
-
-# Clean up
-kubectl delete pod memory-test -n bakery-ia
-```
-
-### **Verify Jaeger Tracing**
-1. Make a request to your API:
-   ```bash
-   curl -H "Authorization: Bearer YOUR_TOKEN" \
-     https://api.yourdomain.com/api/v1/health
-   ```
-2. Open Jaeger: http://localhost:16686
-3. Select a service from dropdown
-4. Click "Find Traces"
-5. You should see traces appearing
-
---
-
-## ✅ Success Criteria
-
-Your monitoring is working correctly if:
-
- [x] All Prometheus targets show "UP" status
- [x] Grafana dashboards display metrics
- [x] AlertManager cluster shows 3/3 members
- [x] Test alert fired and email received
- [x] Jaeger shows traces from services
- [x] No pods in CrashLoopBackOff state
- [x] All PVCs are Bound
-
---
-
-## 🔧 Troubleshooting
-
-### **Problem: Pods not starting**
-```bash
-# Check pod status
-kubectl describe pod POD_NAME -n monitoring
-
-# Check logs
-kubectl logs POD_NAME -n monitoring
-
-# Common issues:
-# - Insufficient resources: Check node capacity
-# - PVC not binding: Check storage class exists
-# - Image pull errors: Check network/registry access
-```
-
-### **Problem: Prometheus targets DOWN**
-```bash
-# Check if services exist
-kubectl get svc -n bakery-ia
-
-# Check if pods have correct labels
-kubectl get pods -n bakery-ia --show-labels
-
-# Check if pods expose metrics port (8080)
-kubectl get pod POD_NAME -n bakery-ia -o yaml | grep -A 5 ports
-```
-
-### **Problem: Grafana shows "No Data"**
-```bash
-# Test Prometheus datasource
-kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
-
-# Run a test query in Prometheus
-curl "http://localhost:9090/api/v1/query?query=up" | jq
-
-# If Prometheus has data but Grafana doesn't, check Grafana datasource config
-```
-
-### **Problem: Alerts not firing**
-```bash
-# Check alert rules are loaded
-kubectl logs -n monitoring prometheus-0 | grep "Loading configuration"
-
-# Check AlertManager config
-kubectl exec -n monitoring alertmanager-0 -- cat /etc/alertmanager/alertmanager.yml
-
-# Test SMTP connection
-kubectl exec -n monitoring alertmanager-0 -- \
-  nc -zv smtp.gmail.com 587
-```
-
---
-
-## 📞 Need Help?
-
-1. Check full documentation: [infrastructure/kubernetes/base/components/monitoring/README.md](infrastructure/kubernetes/base/components/monitoring/README.md)
-2. Review deployment summary: [MONITORING_DEPLOYMENT_SUMMARY.md](MONITORING_DEPLOYMENT_SUMMARY.md)
-3. Check Prometheus logs: `kubectl logs -n monitoring prometheus-0`
-4. Check AlertManager logs: `kubectl logs -n monitoring alertmanager-0`
-5. Check Grafana logs: `kubectl logs -n monitoring deployment/grafana`
-
---
-
-## 🎉 You're Done!
-
-Your monitoring stack is now running in production!
-
-**Next steps:**
-1. Save your Grafana password securely
-2. Set up on-call rotation
-3. Review alert thresholds and adjust as needed
-4. Create team-specific dashboards
-5. Train team on using monitoring tools
-
-**Access your monitoring:**
- Grafana: https://monitoring.yourdomain.com/grafana
- Prometheus: https://monitoring.yourdomain.com/prometheus
- AlertManager: https://monitoring.yourdomain.com/alertmanager
- Jaeger: https://monitoring.yourdomain.com/jaeger
-
---
-
-*Deployment time: ~15 minutes*
-*Last updated: 2026-01-07*