Add signoz

2026-01-08 12:58:00 +01:00
parent 07178f8972
commit dfb7e4b237
40 changed files with 2049 additions and 3935 deletions
--- a/docs/PILOT_LAUNCH_GUIDE.md
+++ b/docs/PILOT_LAUNCH_GUIDE.md
@@ -584,23 +584,39 @@ docker push YOUR_VPS_IP:32000/bakery/auth-service

 ### Step 2: Update Production Configuration

-```bash
-# On local machine, edit these files:
+The production configuration is already set up for **bakewise.ai** domain:

+**Production URLs:**
+- **Main Application:** https://bakewise.ai
+- **API Endpoints:** https://bakewise.ai/api/v1/...
+- **Monitoring Dashboard:** https://monitoring.bakewise.ai/grafana
+- **Prometheus:** https://monitoring.bakewise.ai/prometheus
+- **SigNoz (Traces/Metrics/Logs):** https://monitoring.bakewise.ai/signoz
+- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
+
+```bash
+# Verify the configuration is correct:
+cat infrastructure/kubernetes/overlays/prod/prod-ingress.yaml | grep -A 3 "host:"
+
+# Expected output should show:
+# - host: bakewise.ai
+# - host: monitoring.bakewise.ai
+
+# Verify CORS configuration
+cat infrastructure/kubernetes/overlays/prod/prod-configmap.yaml | grep CORS
+
+# Expected: CORS_ORIGINS: "https://bakewise.ai"
+```
+
+**If using a different domain**, update these files:
+```bash
 # 1. Update domain names
 nano infrastructure/kubernetes/overlays/prod/prod-ingress.yaml
-# Replace:
-# - bakery.yourdomain.com → bakery.your-actual-domain.com
-# - api.yourdomain.com → api.your-actual-domain.com
-# - monitoring.yourdomain.com → monitoring.your-actual-domain.com
-# - Update CORS origins
-# - Update cert-manager email
+# Replace bakewise.ai with your domain

 # 2. Update ConfigMap
 nano infrastructure/kubernetes/overlays/prod/prod-configmap.yaml
-# Set:
-# - DOMAIN: "your-actual-domain.com"
-# - CORS_ORIGINS: "https://bakery.your-actual-domain.com,https://www.your-actual-domain.com"
+# Update CORS_ORIGINS

 # 3. Verify image names (if using custom registry)
 nano infrastructure/kubernetes/overlays/prod/kustomization.yaml
@@ -840,22 +856,96 @@ kubectl logs -n bakery-ia deployment/auth-service | grep -i "email\|smtp"

 ## Post-Deployment

-### Step 1: Enable Monitoring
+### Step 1: Access Monitoring Stack

-```bash
-# Monitoring is already configured, verify it's running
-kubectl get pods -n monitoring
+Your production monitoring stack provides complete observability with multiple tools:

-# Access Grafana
-kubectl port-forward -n monitoring svc/grafana 3000:3000
+#### Production Monitoring URLs

-# Visit http://localhost:3000
-# Login: admin / (password from monitoring secrets)
-
-# Check dashboards are working
+Access via domain (recommended):
+```
+https://monitoring.bakewise.ai/grafana       # Dashboards & visualization
+https://monitoring.bakewise.ai/prometheus    # Metrics & queries
+https://monitoring.bakewise.ai/signoz        # Unified observability platform (traces, metrics, logs)
+https://monitoring.bakewise.ai/alertmanager  # Alert management
 ```

-### Step 2: Configure Backups
+Or via port forwarding (if needed):
+```bash
+# Grafana
+kubectl port-forward -n monitoring svc/grafana 3000:3000 &
+
+# Prometheus
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
+
+# SigNoz
+kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301 &
+
+# AlertManager
+kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
+```
+
+#### Available Dashboards
+
+Login to Grafana (admin / your-password) and explore:
+
+**Main Dashboards:**
+1. **Gateway Metrics** - HTTP request rates, latencies, error rates
+2. **Services Overview** - Multi-service health and performance
+3. **Circuit Breakers** - Reliability metrics
+
+**Extended Dashboards:**
+4. **Service Performance Monitoring (SPM)** - RED metrics from distributed traces
+5. **PostgreSQL Database** - Database health, connections, query performance
+6. **Node Exporter Infrastructure** - CPU, memory, disk, network per node
+7. **AlertManager Monitoring** - Alert tracking and notification status
+8. **Business Metrics & KPIs** - Tenant activity, ML jobs, forecasts
+
+#### Quick Health Check
+
+```bash
+# Verify all monitoring pods are running
+kubectl get pods -n monitoring
+
+# Check Prometheus targets (all should be UP)
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+# Open: http://localhost:9090/targets
+
+# View active alerts
+kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
+# Open: http://localhost:9090/alerts
+```
+
+### Step 2: Configure Alerting
+
+Update AlertManager with your notification email addresses:
+
+```bash
+# Edit alertmanager configuration
+kubectl edit configmap -n monitoring alertmanager-config
+
+# Update recipient emails in the routes section:
+# - alerts@bakewise.ai (general alerts)
+# - critical-alerts@bakewise.ai (critical issues)
+# - oncall@bakewise.ai (on-call rotation)
+```
+
+Test alert delivery:
+```bash
+# Fire a test alert
+kubectl run memory-test --image=polinux/stress --restart=Never \
+  --namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
+
+# Check alert appears in AlertManager
+# https://monitoring.bakewise.ai/alertmanager
+
+# Verify email notification received
+
+# Clean up test
+kubectl delete pod memory-test -n bakery-ia
+```
+
+### Step 3: Configure Backups

 ```bash
 # Create backup script on VPS
@@ -902,26 +992,82 @@ kubectl edit configmap -n monitoring alertmanager-config
 # Update recipient emails in the routes section
 ```

-### Step 4: Document Everything
+### Step 4: Verify Monitoring is Working

-Create a runbook with:
- [ ] VPS login credentials (stored securely)
+Before proceeding, ensure all monitoring components are operational:
+
+```bash
+# 1. Check Prometheus targets
+# Open: https://monitoring.bakewise.ai/prometheus/targets
+# All targets should show "UP" status
+
+# 2. Verify Grafana dashboards load data
+# Open: https://monitoring.bakewise.ai/grafana
+# Navigate to any dashboard and verify metrics are displaying
+
+# 3. Check SigNoz is receiving traces
+# Open: https://monitoring.bakewise.ai/signoz
+# Search for traces from "gateway" service
+
+# 4. Verify AlertManager cluster
+# Open: https://monitoring.bakewise.ai/alertmanager
+# Check that all 3 AlertManager instances are connected
+```
+
+### Step 5: Document Everything
+
+Create a secure runbook with all credentials and procedures:
+
+**Essential Information to Document:**
+- [ ] VPS login credentials (stored securely in password manager)
 - [ ] Database passwords (in password manager)
- [ ] Domain registrar access
+- [ ] Grafana admin password
+- [ ] Domain registrar access (for bakewise.ai)
 - [ ] Cloudflare access
- [ ] Email service credentials
+- [ ] Email service credentials (SMTP)
 - [ ] WhatsApp API credentials
 - [ ] Docker Hub / Registry credentials
 - [ ] Emergency contact information
 - [ ] Rollback procedures
+- [ ] Monitoring URLs and access procedures

-### Step 5: Train Your Team
+### Step 6: Train Your Team

- [ ] Show team how to access Grafana dashboards
- [ ] Demonstrate how to check logs: `kubectl logs`
- [ ] Explain how to restart services if needed
- [ ] Share this documentation with the team
- [ ] Setup on-call rotation (if applicable)
+Conduct a training session covering:
+
+- [ ] **Access monitoring dashboards**
+  - Show how to login to https://monitoring.bakewise.ai/grafana
+  - Walk through key dashboards (Services Overview, Database, Infrastructure)
+  - Explain how to interpret metrics and identify issues
+
+- [ ] **Check application logs**
+  ```bash
+  # View logs for a service
+  kubectl logs -n bakery-ia deployment/orders-service --tail=100 -f
+
+  # Search for errors
+  kubectl logs -n bakery-ia deployment/gateway | grep ERROR
+  ```
+
+- [ ] **Restart services when needed**
+  ```bash
+  # Restart a service (rolling update, no downtime)
+  kubectl rollout restart deployment/orders-service -n bakery-ia
+  ```
+
+- [ ] **Respond to alerts**
+  - Show how to access AlertManager at https://monitoring.bakewise.ai/alertmanager
+  - Review common alerts and their resolution steps
+  - Reference the [Production Operations Guide](./PRODUCTION_OPERATIONS_GUIDE.md)
+
+- [ ] **Share documentation**
+  - [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide
+  - [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations
+  - [security-checklist.md](./security-checklist.md) - Security procedures
+
+- [ ] **Setup on-call rotation** (if applicable)
+  - Configure in AlertManager
+  - Document escalation procedures

 ---

@@ -1050,16 +1196,25 @@ kubectl scale deployment monitoring -n bakery-ia --replicas=0

 ## Support Resources

- **Full Monitoring Guide:** [MONITORING_DEPLOYMENT_SUMMARY.md](./MONITORING_DEPLOYMENT_SUMMARY.md)
- **Operations Guide:** [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md)
- **Security Guide:** [security-checklist.md](./security-checklist.md)
- **Database Security:** [database-security.md](./database-security.md)
- **TLS Configuration:** [tls-configuration.md](./tls-configuration.md)
+**Documentation:**
+- **Operations Guide:** [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations, monitoring, incident response
+- **Security Guide:** [security-checklist.md](./security-checklist.md) - Security procedures and compliance
+- **Database Security:** [database-security.md](./database-security.md) - Database operations and TLS configuration
+- **TLS Configuration:** [tls-configuration.md](./tls-configuration.md) - Certificate management
+- **RBAC Implementation:** [rbac-implementation.md](./rbac-implementation.md) - Access control

+**Monitoring Access:**
+- **Grafana:** https://monitoring.bakewise.ai/grafana (admin / your-password)
+- **Prometheus:** https://monitoring.bakewise.ai/prometheus
+- **SigNoz:** https://monitoring.bakewise.ai/signoz
+- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
+
+**External Resources:**
 - **MicroK8s Docs:** https://microk8s.io/docs
 - **Kubernetes Docs:** https://kubernetes.io/docs
 - **Let's Encrypt:** https://letsencrypt.org/docs
 - **Cloudflare DNS:** https://developers.cloudflare.com/dns
+- **Monitoring Stack README:** infrastructure/kubernetes/base/components/monitoring/README.md

 ---