Imporve monitoring 6

2026-01-10 13:43:38 +01:00
parent c05538cafb
commit b089c216db
13 changed files with 1248 additions and 2546 deletions
--- a/docs/PILOT_LAUNCH_GUIDE.md
+++ b/docs/PILOT_LAUNCH_GUIDE.md
@@ -856,87 +856,227 @@ kubectl logs -n bakery-ia deployment/auth-service | grep -i "email\|smtp"

 ## Post-Deployment

-### Step 1: Access Monitoring Stack
+### Step 1: Access SigNoz Monitoring Stack

-Your production monitoring stack provides complete observability with multiple tools:
+Your production deployment includes **SigNoz**, a unified observability platform that provides complete visibility into your application:
+
+#### What is SigNoz?
+
+SigNoz is an **open-source, all-in-one observability platform** that provides:
+- **📊 Distributed Tracing** - See end-to-end request flows across all 18 microservices
+- **📈 Metrics Monitoring** - Application performance and infrastructure metrics
+- **📝 Log Management** - Centralized logs from all services with trace correlation
+- **🔍 Service Performance Monitoring (SPM)** - Automatic RED metrics (Rate, Error, Duration)
+- **🗄️ Database Monitoring** - All 18 PostgreSQL databases + Redis + RabbitMQ
+- **☸️ Kubernetes Monitoring** - Cluster, node, pod, and container metrics
+
+**Why SigNoz instead of Prometheus/Grafana?**
+- Single unified UI for traces, metrics, and logs (no context switching)
+- Automatic service dependency mapping
+- Built-in APM (Application Performance Monitoring)
+- Log-trace correlation with one click
+- Better query performance with ClickHouse backend
+- Modern UI designed for microservices

 #### Production Monitoring URLs

-Access via domain (recommended):
+Access via domain:
 ```
-https://monitoring.bakewise.ai/grafana       # Dashboards & visualization
-https://monitoring.bakewise.ai/prometheus    # Metrics & queries
-https://monitoring.bakewise.ai/signoz        # Unified observability platform (traces, metrics, logs)
-https://monitoring.bakewise.ai/alertmanager  # Alert management
+https://monitoring.bakewise.ai/signoz        # SigNoz - Main observability UI
+https://monitoring.bakewise.ai/alertmanager  # AlertManager - Alert management
 ```

 Or via port forwarding (if needed):
 ```bash
-# Grafana
-kubectl port-forward -n monitoring svc/grafana 3000:3000 &
+# SigNoz Frontend (Main UI)
+kubectl port-forward -n bakery-ia svc/signoz 8080:8080 &
+# Open: http://localhost:8080

-# Prometheus
-kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
+# SigNoz AlertManager
+kubectl port-forward -n bakery-ia svc/signoz-alertmanager 9093:9093 &
+# Open: http://localhost:9093

-# SigNoz
-kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301 &
-
-# AlertManager
-kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
+# OTel Collector (for debugging)
+kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4317:4317 &  # gRPC
+kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4318:4318 &  # HTTP
 ```

-#### Available Dashboards
+#### Key SigNoz Features to Explore

-Login to Grafana (admin / your-password) and explore:
+Once you open SigNoz (https://monitoring.bakewise.ai/signoz), explore these tabs:

-**Main Dashboards:**
-1. **Gateway Metrics** - HTTP request rates, latencies, error rates
-2. **Services Overview** - Multi-service health and performance
-3. **Circuit Breakers** - Reliability metrics
+**1. Services Tab - Application Performance**
+- View all 18 microservices with live metrics
+- See request rate, error rate, and latency (P50/P90/P99)
+- Click on any service to drill down into operations
+- Identify slow endpoints and error-prone operations

-**Extended Dashboards:**
-4. **Service Performance Monitoring (SPM)** - RED metrics from distributed traces
-5. **PostgreSQL Database** - Database health, connections, query performance
-6. **Node Exporter Infrastructure** - CPU, memory, disk, network per node
-7. **AlertManager Monitoring** - Alert tracking and notification status
-8. **Business Metrics & KPIs** - Tenant activity, ML jobs, forecasts
+**2. Traces Tab - Request Flow Visualization**
+- See complete request journeys across services
+- Identify bottlenecks (slow database queries, API calls)
+- Debug errors with full stack traces
+- Correlate with logs for complete context
+
+**3. Dashboards Tab - Infrastructure & Database Metrics**
+- **PostgreSQL** - Monitor all 18 databases (connections, queries, cache hit ratio)
+- **Redis** - Cache performance (memory, hit rate, commands/sec)
+- **RabbitMQ** - Message queue health (depth, rates, consumers)
+- **Kubernetes** - Cluster metrics (nodes, pods, containers)
+
+**4. Logs Tab - Centralized Log Management**
+- Search and filter logs from all services
+- Click on trace ID in logs to see related request trace
+- Auto-enriched with Kubernetes metadata (pod, namespace, container)
+- Identify patterns and anomalies
+
+**5. Alerts Tab - Proactive Monitoring**
+- Configure alerts on metrics, traces, or logs
+- Email/Slack/Webhook notifications
+- View firing alerts and alert history

 #### Quick Health Check

 ```bash
-# Verify all monitoring pods are running
-kubectl get pods -n monitoring
+# Verify SigNoz components are running
+kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz

-# Check Prometheus targets (all should be UP)
-kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
-# Open: http://localhost:9090/targets
+# Expected output:
+# signoz-0                              READY 1/1
+# signoz-otel-collector-xxx             READY 1/1
+# signoz-alertmanager-xxx               READY 1/1
+# signoz-clickhouse-xxx                 READY 1/1
+# signoz-zookeeper-xxx                  READY 1/1

-# View active alerts
-kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
-# Open: http://localhost:9090/alerts
+# Check OTel Collector health
+kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133
+
+# View recent telemetry in OTel Collector logs
+kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=50 | grep -i "traces\|metrics\|logs"
 ```

+#### Verify Telemetry is Working
+
+1. **Check Services are Reporting:**
+   ```bash
+   # Open SigNoz and navigate to Services tab
+   # You should see all 18 microservices listed
+
+   # If services are missing, check if they're sending telemetry:
+   kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel"
+   ```
+
+2. **Check Database Metrics:**
+   ```bash
+   # Navigate to Dashboards → PostgreSQL in SigNoz
+   # You should see metrics from all 18 databases
+
+   # Verify OTel Collector is scraping databases:
+   kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep postgresql
+   ```
+
+3. **Check Traces are Being Collected:**
+   ```bash
+   # Make a test API request
+   curl https://bakewise.ai/api/v1/health
+
+   # Navigate to Traces tab in SigNoz
+   # Search for "gateway" service
+   # You should see the trace for your request
+   ```
+
+4. **Check Logs are Being Collected:**
+   ```bash
+   # Navigate to Logs tab in SigNoz
+   # Filter by namespace: bakery-ia
+   # You should see logs from all pods
+
+   # Verify filelog receiver is working:
+   kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog
+   ```
+
 ### Step 2: Configure Alerting

-Update AlertManager with your notification email addresses:
+SigNoz includes integrated alerting with AlertManager. Configure it for your team:
+
+#### Update Email Notification Settings
+
+The alerting configuration is in the SigNoz Helm values. To update:

 ```bash
-# Edit alertmanager configuration
-kubectl edit configmap -n monitoring alertmanager-config
+# For production, edit the values file:
+nano infrastructure/helm/signoz-values-prod.yaml

-# Update recipient emails in the routes section:
-# - alerts@bakewise.ai (general alerts)
-# - critical-alerts@bakewise.ai (critical issues)
-# - oncall@bakewise.ai (on-call rotation)
+# Update the alertmanager.config section:
+# 1. Update SMTP settings:
+#    - smtp_from: 'your-alerts@bakewise.ai'
+#    - smtp_auth_username: 'your-alerts@bakewise.ai'
+#    - smtp_auth_password: (use Kubernetes secret)
+#
+# 2. Update receivers:
+#    - critical-alerts email: critical-alerts@bakewise.ai
+#    - warning-alerts email: oncall@bakewise.ai
+#
+# 3. (Optional) Add Slack webhook for critical alerts
+
+# Apply the updated configuration:
+helm upgrade signoz signoz/signoz \
+  -n bakery-ia \
+  -f infrastructure/helm/signoz-values-prod.yaml
 ```

-Test alert delivery:
+#### Create Alerts in SigNoz UI
+
+1. **Open SigNoz Alerts Tab:**
+   ```
+   https://monitoring.bakewise.ai/signoz → Alerts
+   ```
+
+2. **Create Common Alerts:**
+
+   **Alert 1: High Error Rate**
+   - Name: `HighErrorRate`
+   - Query: `error_rate > 5` for `5 minutes`
+   - Severity: `critical`
+   - Description: "Service {{service_name}} has error rate >5%"
+
+   **Alert 2: High Latency**
+   - Name: `HighLatency`
+   - Query: `P99_latency > 3000ms` for `5 minutes`
+   - Severity: `warning`
+   - Description: "Service {{service_name}} P99 latency >3s"
+
+   **Alert 3: Service Down**
+   - Name: `ServiceDown`
+   - Query: `request_rate == 0` for `2 minutes`
+   - Severity: `critical`
+   - Description: "Service {{service_name}} not receiving requests"
+
+   **Alert 4: Database Connection Issues**
+   - Name: `DatabaseConnectionsHigh`
+   - Query: `pg_active_connections > 80` for `5 minutes`
+   - Severity: `warning`
+   - Description: "Database {{database}} connection count >80%"
+
+   **Alert 5: High Memory Usage**
+   - Name: `HighMemoryUsage`
+   - Query: `container_memory_percent > 85` for `5 minutes`
+   - Severity: `warning`
+   - Description: "Pod {{pod_name}} using >85% memory"
+
+#### Test Alert Delivery
+
 ```bash
-# Fire a test alert
+# Method 1: Create a test alert in SigNoz UI
+# Go to Alerts → New Alert → Set a test condition that will fire
+
+# Method 2: Fire a test alert via stress test
 kubectl run memory-test --image=polinux/stress --restart=Never \
  --namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s

-# Check alert appears in AlertManager
+# Check alert appears in SigNoz Alerts tab
+# https://monitoring.bakewise.ai/signoz → Alerts
+
+# Also check AlertManager
 # https://monitoring.bakewise.ai/alertmanager

 # Verify email notification received
@@ -945,6 +1085,26 @@ kubectl run memory-test --image=polinux/stress --restart=Never \
 kubectl delete pod memory-test -n bakery-ia
 ```

+#### Configure Notification Channels
+
+In SigNoz Alerts tab, configure channels:
+
+1. **Email Channel:**
+   - Already configured via AlertManager
+   - Emails sent to addresses in signoz-values-prod.yaml
+
+2. **Slack Channel (Optional):**
+   ```bash
+   # Add Slack webhook URL to signoz-values-prod.yaml
+   # Under alertmanager.config.receivers.critical-alerts.slack_configs:
+   #   - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
+   #     channel: '#alerts-critical'
+   ```
+
+3. **Webhook Channel (Optional):**
+   - Configure custom webhook for integration with PagerDuty, OpsGenie, etc.
+   - Add to alertmanager.config.receivers
+
 ### Step 3: Configure Backups

 ```bash
@@ -992,26 +1152,61 @@ kubectl edit configmap -n monitoring alertmanager-config
 # Update recipient emails in the routes section
 ```

-### Step 4: Verify Monitoring is Working
+### Step 4: Verify SigNoz Monitoring is Working

 Before proceeding, ensure all monitoring components are operational:

 ```bash
-# 1. Check Prometheus targets
-# Open: https://monitoring.bakewise.ai/prometheus/targets
-# All targets should show "UP" status
+# 1. Verify SigNoz pods are running
+kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz

-# 2. Verify Grafana dashboards load data
-# Open: https://monitoring.bakewise.ai/grafana
-# Navigate to any dashboard and verify metrics are displaying
+# Expected pods (all should be Running/Ready):
+# - signoz-0 (or signoz-1, signoz-2 for HA)
+# - signoz-otel-collector-xxx
+# - signoz-alertmanager-xxx
+# - signoz-clickhouse-xxx
+# - signoz-zookeeper-xxx

-# 3. Check SigNoz is receiving traces
-# Open: https://monitoring.bakewise.ai/signoz
-# Search for traces from "gateway" service
+# 2. Check SigNoz UI is accessible
+curl -I https://monitoring.bakewise.ai/signoz
+# Should return: HTTP/2 200 OK

-# 4. Verify AlertManager cluster
-# Open: https://monitoring.bakewise.ai/alertmanager
-# Check that all 3 AlertManager instances are connected
+# 3. Verify OTel Collector is receiving data
+kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=100 | grep -i "received"
+# Should show: "Traces received: X" "Metrics received: Y" "Logs received: Z"
+
+# 4. Check ClickHouse database is healthy
+kubectl exec -n bakery-ia deployment/signoz-clickhouse -- clickhouse-client --query="SELECT count() FROM system.tables WHERE database LIKE 'signoz_%'"
+# Should return a number > 0 (tables exist)
+```
+
+**Complete Verification Checklist:**
+
+- [ ] **SigNoz UI loads** at https://monitoring.bakewise.ai/signoz
+- [ ] **Services tab shows all 18 microservices** with metrics
+- [ ] **Traces tab has sample traces** from gateway and other services
+- [ ] **Dashboards tab shows PostgreSQL metrics** from all 18 databases
+- [ ] **Dashboards tab shows Redis metrics** (memory, commands, etc.)
+- [ ] **Dashboards tab shows RabbitMQ metrics** (queues, messages)
+- [ ] **Dashboards tab shows Kubernetes metrics** (nodes, pods)
+- [ ] **Logs tab displays logs** from all services in bakery-ia namespace
+- [ ] **Alerts tab is accessible** and can create new alerts
+- [ ] **AlertManager** is reachable at https://monitoring.bakewise.ai/alertmanager
+
+**If any checks fail, troubleshoot:**
+
+```bash
+# Check OTel Collector configuration
+kubectl describe configmap -n bakery-ia signoz-otel-collector
+
+# Check for errors in OTel Collector
+kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep -i error
+
+# Check ClickHouse is accepting writes
+kubectl logs -n bakery-ia deployment/signoz-clickhouse | grep -i error
+
+# Restart OTel Collector if needed
+kubectl rollout restart deployment/signoz-otel-collector -n bakery-ia
 ```

 ### Step 5: Document Everything
@@ -1033,41 +1228,113 @@ Create a secure runbook with all credentials and procedures:

 ### Step 6: Train Your Team

-Conduct a training session covering:
+Conduct a training session covering SigNoz and operational procedures:

- [ ] **Access monitoring dashboards**
-  - Show how to login to https://monitoring.bakewise.ai/grafana
-  - Walk through key dashboards (Services Overview, Database, Infrastructure)
-  - Explain how to interpret metrics and identify issues
+#### Part 1: SigNoz Navigation (30 minutes)

- [ ] **Check application logs**
+- [ ] **Login and Overview**
+  - Show how to access https://monitoring.bakewise.ai/signoz
+  - Navigate through main tabs: Services, Traces, Dashboards, Logs, Alerts
+  - Explain the unified nature of SigNoz (all-in-one platform)
+
+- [ ] **Services Tab - Application Performance Monitoring**
+  - Show all 18 microservices
+  - Explain RED metrics (Request rate, Error rate, Duration/latency)
+  - Demo: Click on a service → Operations → See endpoint breakdown
+  - Demo: Identify slow endpoints and high error rates
+
+- [ ] **Traces Tab - Request Flow Debugging**
+  - Show how to search for traces by service, operation, or time
+  - Demo: Click on a trace → See full waterfall (service → database → cache)
+  - Demo: Find slow database queries in trace spans
+  - Demo: Click "View Logs" to correlate trace with logs
+
+- [ ] **Dashboards Tab - Infrastructure Monitoring**
+  - Navigate to PostgreSQL dashboard → Show all 18 databases
+  - Navigate to Redis dashboard → Show cache metrics
+  - Navigate to Kubernetes dashboard → Show node/pod metrics
+  - Explain what metrics indicate issues (connection %, memory %, etc.)
+
+- [ ] **Logs Tab - Log Search and Analysis**
+  - Show how to filter by service, severity, time range
+  - Demo: Search for "error" in last hour
+  - Demo: Click on trace_id in log → Jump to related trace
+  - Show Kubernetes metadata (pod, namespace, container)
+
+- [ ] **Alerts Tab - Proactive Monitoring**
+  - Show how to create alerts on metrics
+  - Review pre-configured alerts
+  - Show alert history and firing alerts
+  - Explain how to acknowledge/silence alerts
+
+#### Part 2: Operational Tasks (30 minutes)
+
+- [ ] **Check application logs** (multiple ways)
  ```bash
-  # View logs for a service
+  # Method 1: Via kubectl (for immediate debugging)
  kubectl logs -n bakery-ia deployment/orders-service --tail=100 -f

-  # Search for errors
-  kubectl logs -n bakery-ia deployment/gateway | grep ERROR
+  # Method 2: Via SigNoz Logs tab (for analysis and correlation)
+  # 1. Open https://monitoring.bakewise.ai/signoz → Logs
+  # 2. Filter by k8s_deployment_name: orders-service
+  # 3. Click on trace_id to see related request flow
  ```

 - [ ] **Restart services when needed**
  ```bash
  # Restart a service (rolling update, no downtime)
  kubectl rollout restart deployment/orders-service -n bakery-ia
+
+  # Verify restart in SigNoz:
+  # 1. Check Services tab → orders-service → Should show brief dip then recovery
+  # 2. Check Logs tab → Filter by orders-service → See restart logs
+  ```
+
+- [ ] **Investigate performance issues**
+  ```bash
+  # Scenario: "Orders API is slow"
+  # 1. SigNoz → Services → orders-service → Check P99 latency
+  # 2. SigNoz → Traces → Filter service:orders-service, duration:>1s
+  # 3. Click on slow trace → Identify bottleneck (DB query? External API?)
+  # 4. SigNoz → Dashboards → PostgreSQL → Check orders_db connections/queries
+  # 5. Fix identified issue (add index, optimize query, scale service)
  ```

 - [ ] **Respond to alerts**
-  - Show how to access AlertManager at https://monitoring.bakewise.ai/alertmanager
+  - Show how to access alerts in SigNoz → Alerts tab
+  - Show AlertManager UI at https://monitoring.bakewise.ai/alertmanager
  - Review common alerts and their resolution steps
  - Reference the [Production Operations Guide](./PRODUCTION_OPERATIONS_GUIDE.md)

+#### Part 3: Documentation and Resources (10 minutes)
+
 - [ ] **Share documentation**
-  - [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide
-  - [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations
+  - [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide (deployment)
+  - [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations with SigNoz
  - [security-checklist.md](./security-checklist.md) - Security procedures

+- [ ] **Bookmark key URLs**
+  - SigNoz: https://monitoring.bakewise.ai/signoz
+  - AlertManager: https://monitoring.bakewise.ai/alertmanager
+  - Production app: https://bakewise.ai
+
 - [ ] **Setup on-call rotation** (if applicable)
-  - Configure in AlertManager
+  - Configure rotation schedule in AlertManager
  - Document escalation procedures
+  - Test alert delivery to on-call phone/email
+
+#### Part 4: Hands-On Exercise (15 minutes)
+
+**Exercise: Investigate a Simulated Issue**
+
+1. Create a load test to generate traffic
+2. Use SigNoz to find the slowest endpoint
+3. Identify the root cause using traces
+4. Correlate with logs to confirm
+5. Check infrastructure metrics (DB, memory, CPU)
+6. Propose a fix based on findings
+
+This trains the team to use SigNoz effectively for real incidents.

 ---

@@ -1204,17 +1471,33 @@ kubectl scale deployment monitoring -n bakery-ia --replicas=0
 - **RBAC Implementation:** [rbac-implementation.md](./rbac-implementation.md) - Access control

 **Monitoring Access:**
- **Grafana:** https://monitoring.bakewise.ai/grafana (admin / your-password)
- **Prometheus:** https://monitoring.bakewise.ai/prometheus
- **SigNoz:** https://monitoring.bakewise.ai/signoz
- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
+- **SigNoz (Primary):** https://monitoring.bakewise.ai/signoz - All-in-one observability
+  - Services: Application performance monitoring (APM)
+  - Traces: Distributed tracing across all services
+  - Dashboards: PostgreSQL, Redis, RabbitMQ, Kubernetes metrics
+  - Logs: Centralized log management with trace correlation
+  - Alerts: Alert configuration and management
+- **AlertManager:** https://monitoring.bakewise.ai/alertmanager - Alert routing and notifications

 **External Resources:**
 - **MicroK8s Docs:** https://microk8s.io/docs
 - **Kubernetes Docs:** https://kubernetes.io/docs
 - **Let's Encrypt:** https://letsencrypt.org/docs
 - **Cloudflare DNS:** https://developers.cloudflare.com/dns
- **Monitoring Stack README:** infrastructure/kubernetes/base/components/monitoring/README.md
+- **SigNoz Documentation:** https://signoz.io/docs/
+- **OpenTelemetry Documentation:** https://opentelemetry.io/docs/
+
+**Monitoring Architecture:**
+- **OpenTelemetry:** Industry-standard instrumentation framework
+  - Auto-instruments FastAPI, HTTPX, SQLAlchemy, Redis
+  - Collects traces, metrics, and logs from all services
+  - Exports to SigNoz via OTLP protocol (gRPC port 4317, HTTP port 4318)
+- **SigNoz Components:**
+  - **Frontend:** Web UI for visualization and analysis
+  - **OTel Collector:** Receives and processes telemetry data
+  - **ClickHouse:** Time-series database for fast queries
+  - **AlertManager:** Alert routing and notification delivery
+  - **Zookeeper:** Coordination service for ClickHouse cluster

 ---