Update monitoring packages to latest versions
- Updated all OpenTelemetry packages to latest versions: - opentelemetry-api: 1.27.0 → 1.39.1 - opentelemetry-sdk: 1.27.0 → 1.39.1 - opentelemetry-exporter-otlp-proto-grpc: 1.27.0 → 1.39.1 - opentelemetry-exporter-otlp-proto-http: 1.27.0 → 1.39.1 - opentelemetry-instrumentation-fastapi: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-httpx: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-redis: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-sqlalchemy: 0.48b0 → 0.60b1 - Removed prometheus-client==0.23.1 from all services - Unified all services to use the same monitoring package versions Generated by Mistral Vibe. Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit is contained in:
283
docs/MONITORING_QUICKSTART.md
Normal file
283
docs/MONITORING_QUICKSTART.md
Normal file
@@ -0,0 +1,283 @@
|
||||
# SigNoz Monitoring Quick Start
|
||||
|
||||
Get complete observability (metrics, logs, traces, system metrics) in under 10 minutes using OpenTelemetry.
|
||||
|
||||
## What You'll Get
|
||||
|
||||
✅ **Distributed Tracing** - Complete request flows across all services
|
||||
✅ **Application Metrics** - HTTP requests, durations, error rates, custom business metrics
|
||||
✅ **System Metrics** - CPU usage, memory usage, disk I/O, network I/O per service
|
||||
✅ **Structured Logs** - Searchable logs correlated with traces
|
||||
✅ **Unified Dashboard** - Single UI for all telemetry data
|
||||
|
||||
**All data pushed via OpenTelemetry OTLP protocol - No Prometheus, no scraping needed!**
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Kubernetes cluster running (Kind/Minikube/Production)
|
||||
- Helm 3.x installed
|
||||
- kubectl configured
|
||||
|
||||
## Step 1: Deploy SigNoz
|
||||
|
||||
```bash
|
||||
# Add Helm repository
|
||||
helm repo add signoz https://charts.signoz.io
|
||||
helm repo update
|
||||
|
||||
# Create namespace
|
||||
kubectl create namespace signoz
|
||||
|
||||
# Install SigNoz
|
||||
helm install signoz signoz/signoz \
|
||||
-n signoz \
|
||||
-f infrastructure/helm/signoz-values-dev.yaml
|
||||
|
||||
# Wait for pods to be ready (2-3 minutes)
|
||||
kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
|
||||
```
|
||||
|
||||
## Step 2: Configure Services
|
||||
|
||||
Each service needs OpenTelemetry environment variables. The auth-service is already configured as an example.
|
||||
|
||||
### Quick Configuration (for remaining services)
|
||||
|
||||
Add these environment variables to each service deployment:
|
||||
|
||||
```yaml
|
||||
env:
|
||||
# OpenTelemetry Collector endpoint
|
||||
- name: OTEL_COLLECTOR_ENDPOINT
|
||||
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
|
||||
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
||||
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
|
||||
- name: OTEL_SERVICE_NAME
|
||||
value: "your-service-name" # e.g., "inventory-service"
|
||||
|
||||
# Enable tracing
|
||||
- name: ENABLE_TRACING
|
||||
value: "true"
|
||||
|
||||
# Enable logs export
|
||||
- name: OTEL_LOGS_EXPORTER
|
||||
value: "otlp"
|
||||
|
||||
# Enable metrics export (includes system metrics)
|
||||
- name: ENABLE_OTEL_METRICS
|
||||
value: "true"
|
||||
- name: ENABLE_SYSTEM_METRICS
|
||||
value: "true"
|
||||
```
|
||||
|
||||
### Using the Configuration Script
|
||||
|
||||
```bash
|
||||
# Generate configuration patches for all services
|
||||
./infrastructure/kubernetes/add-monitoring-config.sh
|
||||
|
||||
# This creates /tmp/*-otel-patch.yaml files
|
||||
# Review and manually add to each service deployment
|
||||
```
|
||||
|
||||
## Step 3: Deploy Updated Services
|
||||
|
||||
```bash
|
||||
# Apply updated configurations
|
||||
kubectl apply -k infrastructure/kubernetes/overlays/dev/
|
||||
|
||||
# Or restart services to pick up new env vars
|
||||
kubectl rollout restart deployment -n bakery-ia
|
||||
|
||||
# Wait for rollout
|
||||
kubectl rollout status deployment -n bakery-ia --timeout=5m
|
||||
```
|
||||
|
||||
## Step 4: Access SigNoz UI
|
||||
|
||||
### Via Ingress
|
||||
|
||||
```bash
|
||||
# Add to /etc/hosts if needed
|
||||
echo "127.0.0.1 monitoring.bakery-ia.local" | sudo tee -a /etc/hosts
|
||||
|
||||
# Access UI
|
||||
open https://monitoring.bakery-ia.local
|
||||
```
|
||||
|
||||
### Via Port Forward
|
||||
|
||||
```bash
|
||||
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
|
||||
open http://localhost:3301
|
||||
```
|
||||
|
||||
## Step 5: Explore Your Data
|
||||
|
||||
### Traces
|
||||
|
||||
1. Go to **Services** tab
|
||||
2. See all your services listed
|
||||
3. Click on a service → View traces
|
||||
4. Click on a trace → See detailed span tree with timing
|
||||
|
||||
### Metrics
|
||||
|
||||
**HTTP Metrics** (automatically collected):
|
||||
- `http_requests_total` - Total requests by method, endpoint, status
|
||||
- `http_request_duration_seconds` - Request latency
|
||||
- `active_requests` - Current active HTTP requests
|
||||
|
||||
**System Metrics** (automatically collected per service):
|
||||
- `process.cpu.utilization` - Process CPU usage %
|
||||
- `process.memory.usage` - Process memory in bytes
|
||||
- `process.memory.utilization` - Process memory %
|
||||
- `process.threads.count` - Number of threads
|
||||
- `system.cpu.utilization` - System-wide CPU %
|
||||
- `system.memory.usage` - System memory usage
|
||||
- `system.disk.io.read` - Disk bytes read
|
||||
- `system.disk.io.write` - Disk bytes written
|
||||
- `system.network.io.sent` - Network bytes sent
|
||||
- `system.network.io.received` - Network bytes received
|
||||
|
||||
**Custom Business Metrics** (if configured):
|
||||
- User registrations
|
||||
- Orders created
|
||||
- Login attempts
|
||||
- etc.
|
||||
|
||||
### Logs
|
||||
|
||||
1. Go to **Logs** tab
|
||||
2. Filter by service: `service_name="auth-service"`
|
||||
3. Search for specific messages
|
||||
4. See structured fields (user_id, tenant_id, etc.)
|
||||
|
||||
### Trace-Log Correlation
|
||||
|
||||
1. Find a trace in **Traces** tab
|
||||
2. Note the `trace_id`
|
||||
3. Go to **Logs** tab
|
||||
4. Filter: `trace_id="<the-trace-id>"`
|
||||
5. See all logs for that specific request!
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```bash
|
||||
# Check if services are sending telemetry
|
||||
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel"
|
||||
|
||||
# Check SigNoz collector is receiving data
|
||||
kubectl logs -n signoz deployment/signoz-otel-collector | tail -50
|
||||
|
||||
# Test connectivity to collector
|
||||
kubectl exec -n bakery-ia deployment/auth-service -- \
|
||||
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
|
||||
```
|
||||
|
||||
## Common Issues
|
||||
|
||||
### No data in SigNoz
|
||||
|
||||
```bash
|
||||
# 1. Verify environment variables are set
|
||||
kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL
|
||||
|
||||
# 2. Check collector logs
|
||||
kubectl logs -n signoz deployment/signoz-otel-collector
|
||||
|
||||
# 3. Restart service
|
||||
kubectl rollout restart deployment/auth-service -n bakery-ia
|
||||
```
|
||||
|
||||
### Services not appearing
|
||||
|
||||
```bash
|
||||
# Check network connectivity
|
||||
kubectl exec -n bakery-ia deployment/auth-service -- \
|
||||
curl http://signoz-otel-collector.signoz.svc.cluster.local:4318
|
||||
|
||||
# Should return: connection successful (not connection refused)
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Your Microservices │
|
||||
│ ┌──────┐ ┌──────┐ ┌──────┐ │
|
||||
│ │ auth │ │ inv │ │orders│ ... │
|
||||
│ └──┬───┘ └──┬───┘ └──┬───┘ │
|
||||
│ │ │ │ │
|
||||
│ └─────────┴─────────┘ │
|
||||
│ │ │
|
||||
│ OTLP Push │
|
||||
│ (traces, metrics, logs) │
|
||||
└──────────────┼──────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────┐
|
||||
│ SigNoz OpenTelemetry Collector │
|
||||
│ :4317 (gRPC) :4318 (HTTP) │
|
||||
│ │
|
||||
│ Receivers: OTLP only (no Prometheus) │
|
||||
│ Processors: batch, memory_limiter │
|
||||
│ Exporters: ClickHouse │
|
||||
└──────────────┼──────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────┐
|
||||
│ ClickHouse Database │
|
||||
│ Stores: traces, metrics, logs │
|
||||
└──────────────┼──────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────┐
|
||||
│ SigNoz Frontend UI │
|
||||
│ monitoring.bakery-ia.local or :3301 │
|
||||
└──────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## What Makes This Different
|
||||
|
||||
**Pure OpenTelemetry** - No Prometheus involved:
|
||||
- ✅ All metrics pushed via OTLP (not scraped)
|
||||
- ✅ Automatic system metrics collection (CPU, memory, disk, network)
|
||||
- ✅ Unified data model for all telemetry
|
||||
- ✅ Native trace-metric-log correlation
|
||||
- ✅ Lower resource usage (no scraping overhead)
|
||||
|
||||
## Next Steps
|
||||
|
||||
- **Create Dashboards** - Build custom views for your metrics
|
||||
- **Set Up Alerts** - Configure alerts for errors, latency, resource usage
|
||||
- **Explore System Metrics** - Monitor CPU, memory per service
|
||||
- **Query Logs** - Use powerful log query language
|
||||
- **Correlate Everything** - Jump from traces → logs → metrics
|
||||
|
||||
## Need Help?
|
||||
|
||||
- [Full Documentation](./MONITORING_SETUP.md) - Detailed setup guide
|
||||
- [SigNoz Docs](https://signoz.io/docs/) - Official documentation
|
||||
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/) - Python instrumentation
|
||||
|
||||
---
|
||||
|
||||
**Metrics You Get Out of the Box:**
|
||||
|
||||
| Category | Metrics | Description |
|
||||
|----------|---------|-------------|
|
||||
| HTTP | `http_requests_total` | Total requests by method, endpoint, status |
|
||||
| HTTP | `http_request_duration_seconds` | Request latency histogram |
|
||||
| HTTP | `active_requests` | Current active requests |
|
||||
| Process | `process.cpu.utilization` | Process CPU usage % |
|
||||
| Process | `process.memory.usage` | Process memory in bytes |
|
||||
| Process | `process.memory.utilization` | Process memory % |
|
||||
| Process | `process.threads.count` | Thread count |
|
||||
| System | `system.cpu.utilization` | System CPU % |
|
||||
| System | `system.memory.usage` | System memory usage |
|
||||
| System | `system.memory.utilization` | System memory % |
|
||||
| Disk | `system.disk.io.read` | Disk read bytes |
|
||||
| Disk | `system.disk.io.write` | Disk write bytes |
|
||||
| Network | `system.network.io.sent` | Network sent bytes |
|
||||
| Network | `system.network.io.received` | Network received bytes |
|
||||
Reference in New Issue
Block a user