Imporve monitoring 6
This commit is contained in:
@@ -856,87 +856,227 @@ kubectl logs -n bakery-ia deployment/auth-service | grep -i "email\|smtp"
|
||||
|
||||
## Post-Deployment
|
||||
|
||||
### Step 1: Access Monitoring Stack
|
||||
### Step 1: Access SigNoz Monitoring Stack
|
||||
|
||||
Your production monitoring stack provides complete observability with multiple tools:
|
||||
Your production deployment includes **SigNoz**, a unified observability platform that provides complete visibility into your application:
|
||||
|
||||
#### What is SigNoz?
|
||||
|
||||
SigNoz is an **open-source, all-in-one observability platform** that provides:
|
||||
- **📊 Distributed Tracing** - See end-to-end request flows across all 18 microservices
|
||||
- **📈 Metrics Monitoring** - Application performance and infrastructure metrics
|
||||
- **📝 Log Management** - Centralized logs from all services with trace correlation
|
||||
- **🔍 Service Performance Monitoring (SPM)** - Automatic RED metrics (Rate, Error, Duration)
|
||||
- **🗄️ Database Monitoring** - All 18 PostgreSQL databases + Redis + RabbitMQ
|
||||
- **☸️ Kubernetes Monitoring** - Cluster, node, pod, and container metrics
|
||||
|
||||
**Why SigNoz instead of Prometheus/Grafana?**
|
||||
- Single unified UI for traces, metrics, and logs (no context switching)
|
||||
- Automatic service dependency mapping
|
||||
- Built-in APM (Application Performance Monitoring)
|
||||
- Log-trace correlation with one click
|
||||
- Better query performance with ClickHouse backend
|
||||
- Modern UI designed for microservices
|
||||
|
||||
#### Production Monitoring URLs
|
||||
|
||||
Access via domain (recommended):
|
||||
Access via domain:
|
||||
```
|
||||
https://monitoring.bakewise.ai/grafana # Dashboards & visualization
|
||||
https://monitoring.bakewise.ai/prometheus # Metrics & queries
|
||||
https://monitoring.bakewise.ai/signoz # Unified observability platform (traces, metrics, logs)
|
||||
https://monitoring.bakewise.ai/alertmanager # Alert management
|
||||
https://monitoring.bakewise.ai/signoz # SigNoz - Main observability UI
|
||||
https://monitoring.bakewise.ai/alertmanager # AlertManager - Alert management
|
||||
```
|
||||
|
||||
Or via port forwarding (if needed):
|
||||
```bash
|
||||
# Grafana
|
||||
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
|
||||
# SigNoz Frontend (Main UI)
|
||||
kubectl port-forward -n bakery-ia svc/signoz 8080:8080 &
|
||||
# Open: http://localhost:8080
|
||||
|
||||
# Prometheus
|
||||
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
|
||||
# SigNoz AlertManager
|
||||
kubectl port-forward -n bakery-ia svc/signoz-alertmanager 9093:9093 &
|
||||
# Open: http://localhost:9093
|
||||
|
||||
# SigNoz
|
||||
kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301 &
|
||||
|
||||
# AlertManager
|
||||
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
|
||||
# OTel Collector (for debugging)
|
||||
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4317:4317 & # gRPC
|
||||
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4318:4318 & # HTTP
|
||||
```
|
||||
|
||||
#### Available Dashboards
|
||||
#### Key SigNoz Features to Explore
|
||||
|
||||
Login to Grafana (admin / your-password) and explore:
|
||||
Once you open SigNoz (https://monitoring.bakewise.ai/signoz), explore these tabs:
|
||||
|
||||
**Main Dashboards:**
|
||||
1. **Gateway Metrics** - HTTP request rates, latencies, error rates
|
||||
2. **Services Overview** - Multi-service health and performance
|
||||
3. **Circuit Breakers** - Reliability metrics
|
||||
**1. Services Tab - Application Performance**
|
||||
- View all 18 microservices with live metrics
|
||||
- See request rate, error rate, and latency (P50/P90/P99)
|
||||
- Click on any service to drill down into operations
|
||||
- Identify slow endpoints and error-prone operations
|
||||
|
||||
**Extended Dashboards:**
|
||||
4. **Service Performance Monitoring (SPM)** - RED metrics from distributed traces
|
||||
5. **PostgreSQL Database** - Database health, connections, query performance
|
||||
6. **Node Exporter Infrastructure** - CPU, memory, disk, network per node
|
||||
7. **AlertManager Monitoring** - Alert tracking and notification status
|
||||
8. **Business Metrics & KPIs** - Tenant activity, ML jobs, forecasts
|
||||
**2. Traces Tab - Request Flow Visualization**
|
||||
- See complete request journeys across services
|
||||
- Identify bottlenecks (slow database queries, API calls)
|
||||
- Debug errors with full stack traces
|
||||
- Correlate with logs for complete context
|
||||
|
||||
**3. Dashboards Tab - Infrastructure & Database Metrics**
|
||||
- **PostgreSQL** - Monitor all 18 databases (connections, queries, cache hit ratio)
|
||||
- **Redis** - Cache performance (memory, hit rate, commands/sec)
|
||||
- **RabbitMQ** - Message queue health (depth, rates, consumers)
|
||||
- **Kubernetes** - Cluster metrics (nodes, pods, containers)
|
||||
|
||||
**4. Logs Tab - Centralized Log Management**
|
||||
- Search and filter logs from all services
|
||||
- Click on trace ID in logs to see related request trace
|
||||
- Auto-enriched with Kubernetes metadata (pod, namespace, container)
|
||||
- Identify patterns and anomalies
|
||||
|
||||
**5. Alerts Tab - Proactive Monitoring**
|
||||
- Configure alerts on metrics, traces, or logs
|
||||
- Email/Slack/Webhook notifications
|
||||
- View firing alerts and alert history
|
||||
|
||||
#### Quick Health Check
|
||||
|
||||
```bash
|
||||
# Verify all monitoring pods are running
|
||||
kubectl get pods -n monitoring
|
||||
# Verify SigNoz components are running
|
||||
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
|
||||
|
||||
# Check Prometheus targets (all should be UP)
|
||||
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
||||
# Open: http://localhost:9090/targets
|
||||
# Expected output:
|
||||
# signoz-0 READY 1/1
|
||||
# signoz-otel-collector-xxx READY 1/1
|
||||
# signoz-alertmanager-xxx READY 1/1
|
||||
# signoz-clickhouse-xxx READY 1/1
|
||||
# signoz-zookeeper-xxx READY 1/1
|
||||
|
||||
# View active alerts
|
||||
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
||||
# Open: http://localhost:9090/alerts
|
||||
# Check OTel Collector health
|
||||
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133
|
||||
|
||||
# View recent telemetry in OTel Collector logs
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=50 | grep -i "traces\|metrics\|logs"
|
||||
```
|
||||
|
||||
#### Verify Telemetry is Working
|
||||
|
||||
1. **Check Services are Reporting:**
|
||||
```bash
|
||||
# Open SigNoz and navigate to Services tab
|
||||
# You should see all 18 microservices listed
|
||||
|
||||
# If services are missing, check if they're sending telemetry:
|
||||
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel"
|
||||
```
|
||||
|
||||
2. **Check Database Metrics:**
|
||||
```bash
|
||||
# Navigate to Dashboards → PostgreSQL in SigNoz
|
||||
# You should see metrics from all 18 databases
|
||||
|
||||
# Verify OTel Collector is scraping databases:
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep postgresql
|
||||
```
|
||||
|
||||
3. **Check Traces are Being Collected:**
|
||||
```bash
|
||||
# Make a test API request
|
||||
curl https://bakewise.ai/api/v1/health
|
||||
|
||||
# Navigate to Traces tab in SigNoz
|
||||
# Search for "gateway" service
|
||||
# You should see the trace for your request
|
||||
```
|
||||
|
||||
4. **Check Logs are Being Collected:**
|
||||
```bash
|
||||
# Navigate to Logs tab in SigNoz
|
||||
# Filter by namespace: bakery-ia
|
||||
# You should see logs from all pods
|
||||
|
||||
# Verify filelog receiver is working:
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog
|
||||
```
|
||||
|
||||
### Step 2: Configure Alerting
|
||||
|
||||
Update AlertManager with your notification email addresses:
|
||||
SigNoz includes integrated alerting with AlertManager. Configure it for your team:
|
||||
|
||||
#### Update Email Notification Settings
|
||||
|
||||
The alerting configuration is in the SigNoz Helm values. To update:
|
||||
|
||||
```bash
|
||||
# Edit alertmanager configuration
|
||||
kubectl edit configmap -n monitoring alertmanager-config
|
||||
# For production, edit the values file:
|
||||
nano infrastructure/helm/signoz-values-prod.yaml
|
||||
|
||||
# Update recipient emails in the routes section:
|
||||
# - alerts@bakewise.ai (general alerts)
|
||||
# - critical-alerts@bakewise.ai (critical issues)
|
||||
# - oncall@bakewise.ai (on-call rotation)
|
||||
# Update the alertmanager.config section:
|
||||
# 1. Update SMTP settings:
|
||||
# - smtp_from: 'your-alerts@bakewise.ai'
|
||||
# - smtp_auth_username: 'your-alerts@bakewise.ai'
|
||||
# - smtp_auth_password: (use Kubernetes secret)
|
||||
#
|
||||
# 2. Update receivers:
|
||||
# - critical-alerts email: critical-alerts@bakewise.ai
|
||||
# - warning-alerts email: oncall@bakewise.ai
|
||||
#
|
||||
# 3. (Optional) Add Slack webhook for critical alerts
|
||||
|
||||
# Apply the updated configuration:
|
||||
helm upgrade signoz signoz/signoz \
|
||||
-n bakery-ia \
|
||||
-f infrastructure/helm/signoz-values-prod.yaml
|
||||
```
|
||||
|
||||
Test alert delivery:
|
||||
#### Create Alerts in SigNoz UI
|
||||
|
||||
1. **Open SigNoz Alerts Tab:**
|
||||
```
|
||||
https://monitoring.bakewise.ai/signoz → Alerts
|
||||
```
|
||||
|
||||
2. **Create Common Alerts:**
|
||||
|
||||
**Alert 1: High Error Rate**
|
||||
- Name: `HighErrorRate`
|
||||
- Query: `error_rate > 5` for `5 minutes`
|
||||
- Severity: `critical`
|
||||
- Description: "Service {{service_name}} has error rate >5%"
|
||||
|
||||
**Alert 2: High Latency**
|
||||
- Name: `HighLatency`
|
||||
- Query: `P99_latency > 3000ms` for `5 minutes`
|
||||
- Severity: `warning`
|
||||
- Description: "Service {{service_name}} P99 latency >3s"
|
||||
|
||||
**Alert 3: Service Down**
|
||||
- Name: `ServiceDown`
|
||||
- Query: `request_rate == 0` for `2 minutes`
|
||||
- Severity: `critical`
|
||||
- Description: "Service {{service_name}} not receiving requests"
|
||||
|
||||
**Alert 4: Database Connection Issues**
|
||||
- Name: `DatabaseConnectionsHigh`
|
||||
- Query: `pg_active_connections > 80` for `5 minutes`
|
||||
- Severity: `warning`
|
||||
- Description: "Database {{database}} connection count >80%"
|
||||
|
||||
**Alert 5: High Memory Usage**
|
||||
- Name: `HighMemoryUsage`
|
||||
- Query: `container_memory_percent > 85` for `5 minutes`
|
||||
- Severity: `warning`
|
||||
- Description: "Pod {{pod_name}} using >85% memory"
|
||||
|
||||
#### Test Alert Delivery
|
||||
|
||||
```bash
|
||||
# Fire a test alert
|
||||
# Method 1: Create a test alert in SigNoz UI
|
||||
# Go to Alerts → New Alert → Set a test condition that will fire
|
||||
|
||||
# Method 2: Fire a test alert via stress test
|
||||
kubectl run memory-test --image=polinux/stress --restart=Never \
|
||||
--namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
|
||||
|
||||
# Check alert appears in AlertManager
|
||||
# Check alert appears in SigNoz Alerts tab
|
||||
# https://monitoring.bakewise.ai/signoz → Alerts
|
||||
|
||||
# Also check AlertManager
|
||||
# https://monitoring.bakewise.ai/alertmanager
|
||||
|
||||
# Verify email notification received
|
||||
@@ -945,6 +1085,26 @@ kubectl run memory-test --image=polinux/stress --restart=Never \
|
||||
kubectl delete pod memory-test -n bakery-ia
|
||||
```
|
||||
|
||||
#### Configure Notification Channels
|
||||
|
||||
In SigNoz Alerts tab, configure channels:
|
||||
|
||||
1. **Email Channel:**
|
||||
- Already configured via AlertManager
|
||||
- Emails sent to addresses in signoz-values-prod.yaml
|
||||
|
||||
2. **Slack Channel (Optional):**
|
||||
```bash
|
||||
# Add Slack webhook URL to signoz-values-prod.yaml
|
||||
# Under alertmanager.config.receivers.critical-alerts.slack_configs:
|
||||
# - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
|
||||
# channel: '#alerts-critical'
|
||||
```
|
||||
|
||||
3. **Webhook Channel (Optional):**
|
||||
- Configure custom webhook for integration with PagerDuty, OpsGenie, etc.
|
||||
- Add to alertmanager.config.receivers
|
||||
|
||||
### Step 3: Configure Backups
|
||||
|
||||
```bash
|
||||
@@ -992,26 +1152,61 @@ kubectl edit configmap -n monitoring alertmanager-config
|
||||
# Update recipient emails in the routes section
|
||||
```
|
||||
|
||||
### Step 4: Verify Monitoring is Working
|
||||
### Step 4: Verify SigNoz Monitoring is Working
|
||||
|
||||
Before proceeding, ensure all monitoring components are operational:
|
||||
|
||||
```bash
|
||||
# 1. Check Prometheus targets
|
||||
# Open: https://monitoring.bakewise.ai/prometheus/targets
|
||||
# All targets should show "UP" status
|
||||
# 1. Verify SigNoz pods are running
|
||||
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
|
||||
|
||||
# 2. Verify Grafana dashboards load data
|
||||
# Open: https://monitoring.bakewise.ai/grafana
|
||||
# Navigate to any dashboard and verify metrics are displaying
|
||||
# Expected pods (all should be Running/Ready):
|
||||
# - signoz-0 (or signoz-1, signoz-2 for HA)
|
||||
# - signoz-otel-collector-xxx
|
||||
# - signoz-alertmanager-xxx
|
||||
# - signoz-clickhouse-xxx
|
||||
# - signoz-zookeeper-xxx
|
||||
|
||||
# 3. Check SigNoz is receiving traces
|
||||
# Open: https://monitoring.bakewise.ai/signoz
|
||||
# Search for traces from "gateway" service
|
||||
# 2. Check SigNoz UI is accessible
|
||||
curl -I https://monitoring.bakewise.ai/signoz
|
||||
# Should return: HTTP/2 200 OK
|
||||
|
||||
# 4. Verify AlertManager cluster
|
||||
# Open: https://monitoring.bakewise.ai/alertmanager
|
||||
# Check that all 3 AlertManager instances are connected
|
||||
# 3. Verify OTel Collector is receiving data
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=100 | grep -i "received"
|
||||
# Should show: "Traces received: X" "Metrics received: Y" "Logs received: Z"
|
||||
|
||||
# 4. Check ClickHouse database is healthy
|
||||
kubectl exec -n bakery-ia deployment/signoz-clickhouse -- clickhouse-client --query="SELECT count() FROM system.tables WHERE database LIKE 'signoz_%'"
|
||||
# Should return a number > 0 (tables exist)
|
||||
```
|
||||
|
||||
**Complete Verification Checklist:**
|
||||
|
||||
- [ ] **SigNoz UI loads** at https://monitoring.bakewise.ai/signoz
|
||||
- [ ] **Services tab shows all 18 microservices** with metrics
|
||||
- [ ] **Traces tab has sample traces** from gateway and other services
|
||||
- [ ] **Dashboards tab shows PostgreSQL metrics** from all 18 databases
|
||||
- [ ] **Dashboards tab shows Redis metrics** (memory, commands, etc.)
|
||||
- [ ] **Dashboards tab shows RabbitMQ metrics** (queues, messages)
|
||||
- [ ] **Dashboards tab shows Kubernetes metrics** (nodes, pods)
|
||||
- [ ] **Logs tab displays logs** from all services in bakery-ia namespace
|
||||
- [ ] **Alerts tab is accessible** and can create new alerts
|
||||
- [ ] **AlertManager** is reachable at https://monitoring.bakewise.ai/alertmanager
|
||||
|
||||
**If any checks fail, troubleshoot:**
|
||||
|
||||
```bash
|
||||
# Check OTel Collector configuration
|
||||
kubectl describe configmap -n bakery-ia signoz-otel-collector
|
||||
|
||||
# Check for errors in OTel Collector
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep -i error
|
||||
|
||||
# Check ClickHouse is accepting writes
|
||||
kubectl logs -n bakery-ia deployment/signoz-clickhouse | grep -i error
|
||||
|
||||
# Restart OTel Collector if needed
|
||||
kubectl rollout restart deployment/signoz-otel-collector -n bakery-ia
|
||||
```
|
||||
|
||||
### Step 5: Document Everything
|
||||
@@ -1033,41 +1228,113 @@ Create a secure runbook with all credentials and procedures:
|
||||
|
||||
### Step 6: Train Your Team
|
||||
|
||||
Conduct a training session covering:
|
||||
Conduct a training session covering SigNoz and operational procedures:
|
||||
|
||||
- [ ] **Access monitoring dashboards**
|
||||
- Show how to login to https://monitoring.bakewise.ai/grafana
|
||||
- Walk through key dashboards (Services Overview, Database, Infrastructure)
|
||||
- Explain how to interpret metrics and identify issues
|
||||
#### Part 1: SigNoz Navigation (30 minutes)
|
||||
|
||||
- [ ] **Check application logs**
|
||||
- [ ] **Login and Overview**
|
||||
- Show how to access https://monitoring.bakewise.ai/signoz
|
||||
- Navigate through main tabs: Services, Traces, Dashboards, Logs, Alerts
|
||||
- Explain the unified nature of SigNoz (all-in-one platform)
|
||||
|
||||
- [ ] **Services Tab - Application Performance Monitoring**
|
||||
- Show all 18 microservices
|
||||
- Explain RED metrics (Request rate, Error rate, Duration/latency)
|
||||
- Demo: Click on a service → Operations → See endpoint breakdown
|
||||
- Demo: Identify slow endpoints and high error rates
|
||||
|
||||
- [ ] **Traces Tab - Request Flow Debugging**
|
||||
- Show how to search for traces by service, operation, or time
|
||||
- Demo: Click on a trace → See full waterfall (service → database → cache)
|
||||
- Demo: Find slow database queries in trace spans
|
||||
- Demo: Click "View Logs" to correlate trace with logs
|
||||
|
||||
- [ ] **Dashboards Tab - Infrastructure Monitoring**
|
||||
- Navigate to PostgreSQL dashboard → Show all 18 databases
|
||||
- Navigate to Redis dashboard → Show cache metrics
|
||||
- Navigate to Kubernetes dashboard → Show node/pod metrics
|
||||
- Explain what metrics indicate issues (connection %, memory %, etc.)
|
||||
|
||||
- [ ] **Logs Tab - Log Search and Analysis**
|
||||
- Show how to filter by service, severity, time range
|
||||
- Demo: Search for "error" in last hour
|
||||
- Demo: Click on trace_id in log → Jump to related trace
|
||||
- Show Kubernetes metadata (pod, namespace, container)
|
||||
|
||||
- [ ] **Alerts Tab - Proactive Monitoring**
|
||||
- Show how to create alerts on metrics
|
||||
- Review pre-configured alerts
|
||||
- Show alert history and firing alerts
|
||||
- Explain how to acknowledge/silence alerts
|
||||
|
||||
#### Part 2: Operational Tasks (30 minutes)
|
||||
|
||||
- [ ] **Check application logs** (multiple ways)
|
||||
```bash
|
||||
# View logs for a service
|
||||
# Method 1: Via kubectl (for immediate debugging)
|
||||
kubectl logs -n bakery-ia deployment/orders-service --tail=100 -f
|
||||
|
||||
# Search for errors
|
||||
kubectl logs -n bakery-ia deployment/gateway | grep ERROR
|
||||
# Method 2: Via SigNoz Logs tab (for analysis and correlation)
|
||||
# 1. Open https://monitoring.bakewise.ai/signoz → Logs
|
||||
# 2. Filter by k8s_deployment_name: orders-service
|
||||
# 3. Click on trace_id to see related request flow
|
||||
```
|
||||
|
||||
- [ ] **Restart services when needed**
|
||||
```bash
|
||||
# Restart a service (rolling update, no downtime)
|
||||
kubectl rollout restart deployment/orders-service -n bakery-ia
|
||||
|
||||
# Verify restart in SigNoz:
|
||||
# 1. Check Services tab → orders-service → Should show brief dip then recovery
|
||||
# 2. Check Logs tab → Filter by orders-service → See restart logs
|
||||
```
|
||||
|
||||
- [ ] **Investigate performance issues**
|
||||
```bash
|
||||
# Scenario: "Orders API is slow"
|
||||
# 1. SigNoz → Services → orders-service → Check P99 latency
|
||||
# 2. SigNoz → Traces → Filter service:orders-service, duration:>1s
|
||||
# 3. Click on slow trace → Identify bottleneck (DB query? External API?)
|
||||
# 4. SigNoz → Dashboards → PostgreSQL → Check orders_db connections/queries
|
||||
# 5. Fix identified issue (add index, optimize query, scale service)
|
||||
```
|
||||
|
||||
- [ ] **Respond to alerts**
|
||||
- Show how to access AlertManager at https://monitoring.bakewise.ai/alertmanager
|
||||
- Show how to access alerts in SigNoz → Alerts tab
|
||||
- Show AlertManager UI at https://monitoring.bakewise.ai/alertmanager
|
||||
- Review common alerts and their resolution steps
|
||||
- Reference the [Production Operations Guide](./PRODUCTION_OPERATIONS_GUIDE.md)
|
||||
|
||||
#### Part 3: Documentation and Resources (10 minutes)
|
||||
|
||||
- [ ] **Share documentation**
|
||||
- [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide
|
||||
- [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations
|
||||
- [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide (deployment)
|
||||
- [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations with SigNoz
|
||||
- [security-checklist.md](./security-checklist.md) - Security procedures
|
||||
|
||||
- [ ] **Bookmark key URLs**
|
||||
- SigNoz: https://monitoring.bakewise.ai/signoz
|
||||
- AlertManager: https://monitoring.bakewise.ai/alertmanager
|
||||
- Production app: https://bakewise.ai
|
||||
|
||||
- [ ] **Setup on-call rotation** (if applicable)
|
||||
- Configure in AlertManager
|
||||
- Configure rotation schedule in AlertManager
|
||||
- Document escalation procedures
|
||||
- Test alert delivery to on-call phone/email
|
||||
|
||||
#### Part 4: Hands-On Exercise (15 minutes)
|
||||
|
||||
**Exercise: Investigate a Simulated Issue**
|
||||
|
||||
1. Create a load test to generate traffic
|
||||
2. Use SigNoz to find the slowest endpoint
|
||||
3. Identify the root cause using traces
|
||||
4. Correlate with logs to confirm
|
||||
5. Check infrastructure metrics (DB, memory, CPU)
|
||||
6. Propose a fix based on findings
|
||||
|
||||
This trains the team to use SigNoz effectively for real incidents.
|
||||
|
||||
---
|
||||
|
||||
@@ -1204,17 +1471,33 @@ kubectl scale deployment monitoring -n bakery-ia --replicas=0
|
||||
- **RBAC Implementation:** [rbac-implementation.md](./rbac-implementation.md) - Access control
|
||||
|
||||
**Monitoring Access:**
|
||||
- **Grafana:** https://monitoring.bakewise.ai/grafana (admin / your-password)
|
||||
- **Prometheus:** https://monitoring.bakewise.ai/prometheus
|
||||
- **SigNoz:** https://monitoring.bakewise.ai/signoz
|
||||
- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
|
||||
- **SigNoz (Primary):** https://monitoring.bakewise.ai/signoz - All-in-one observability
|
||||
- Services: Application performance monitoring (APM)
|
||||
- Traces: Distributed tracing across all services
|
||||
- Dashboards: PostgreSQL, Redis, RabbitMQ, Kubernetes metrics
|
||||
- Logs: Centralized log management with trace correlation
|
||||
- Alerts: Alert configuration and management
|
||||
- **AlertManager:** https://monitoring.bakewise.ai/alertmanager - Alert routing and notifications
|
||||
|
||||
**External Resources:**
|
||||
- **MicroK8s Docs:** https://microk8s.io/docs
|
||||
- **Kubernetes Docs:** https://kubernetes.io/docs
|
||||
- **Let's Encrypt:** https://letsencrypt.org/docs
|
||||
- **Cloudflare DNS:** https://developers.cloudflare.com/dns
|
||||
- **Monitoring Stack README:** infrastructure/kubernetes/base/components/monitoring/README.md
|
||||
- **SigNoz Documentation:** https://signoz.io/docs/
|
||||
- **OpenTelemetry Documentation:** https://opentelemetry.io/docs/
|
||||
|
||||
**Monitoring Architecture:**
|
||||
- **OpenTelemetry:** Industry-standard instrumentation framework
|
||||
- Auto-instruments FastAPI, HTTPX, SQLAlchemy, Redis
|
||||
- Collects traces, metrics, and logs from all services
|
||||
- Exports to SigNoz via OTLP protocol (gRPC port 4317, HTTP port 4318)
|
||||
- **SigNoz Components:**
|
||||
- **Frontend:** Web UI for visualization and analysis
|
||||
- **OTel Collector:** Receives and processes telemetry data
|
||||
- **ClickHouse:** Time-series database for fast queries
|
||||
- **AlertManager:** Alert routing and notification delivery
|
||||
- **Zookeeper:** Coordination service for ClickHouse cluster
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user