Update monitoring packages to latest versions

- Updated all OpenTelemetry packages to latest versions:
  - opentelemetry-api: 1.27.0 → 1.39.1
  - opentelemetry-sdk: 1.27.0 → 1.39.1
  - opentelemetry-exporter-otlp-proto-grpc: 1.27.0 → 1.39.1
  - opentelemetry-exporter-otlp-proto-http: 1.27.0 → 1.39.1
  - opentelemetry-instrumentation-fastapi: 0.48b0 → 0.60b1
  - opentelemetry-instrumentation-httpx: 0.48b0 → 0.60b1
  - opentelemetry-instrumentation-redis: 0.48b0 → 0.60b1
  - opentelemetry-instrumentation-sqlalchemy: 0.48b0 → 0.60b1

- Removed prometheus-client==0.23.1 from all services
- Unified all services to use the same monitoring package versions

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit is contained in:
Urtzi Alfaro
2026-01-08 19:25:52 +01:00
parent dfb7e4b237
commit 29d19087f1
129 changed files with 5718 additions and 1821 deletions

View File

@@ -0,0 +1,283 @@
# SigNoz Monitoring Quick Start
Get complete observability (metrics, logs, traces, system metrics) in under 10 minutes using OpenTelemetry.
## What You'll Get
**Distributed Tracing** - Complete request flows across all services
**Application Metrics** - HTTP requests, durations, error rates, custom business metrics
**System Metrics** - CPU usage, memory usage, disk I/O, network I/O per service
**Structured Logs** - Searchable logs correlated with traces
**Unified Dashboard** - Single UI for all telemetry data
**All data pushed via OpenTelemetry OTLP protocol - No Prometheus, no scraping needed!**
## Prerequisites
- Kubernetes cluster running (Kind/Minikube/Production)
- Helm 3.x installed
- kubectl configured
## Step 1: Deploy SigNoz
```bash
# Add Helm repository
helm repo add signoz https://charts.signoz.io
helm repo update
# Create namespace
kubectl create namespace signoz
# Install SigNoz
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-dev.yaml
# Wait for pods to be ready (2-3 minutes)
kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
```
## Step 2: Configure Services
Each service needs OpenTelemetry environment variables. The auth-service is already configured as an example.
### Quick Configuration (for remaining services)
Add these environment variables to each service deployment:
```yaml
env:
# OpenTelemetry Collector endpoint
- name: OTEL_COLLECTOR_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_SERVICE_NAME
value: "your-service-name" # e.g., "inventory-service"
# Enable tracing
- name: ENABLE_TRACING
value: "true"
# Enable logs export
- name: OTEL_LOGS_EXPORTER
value: "otlp"
# Enable metrics export (includes system metrics)
- name: ENABLE_OTEL_METRICS
value: "true"
- name: ENABLE_SYSTEM_METRICS
value: "true"
```
### Using the Configuration Script
```bash
# Generate configuration patches for all services
./infrastructure/kubernetes/add-monitoring-config.sh
# This creates /tmp/*-otel-patch.yaml files
# Review and manually add to each service deployment
```
## Step 3: Deploy Updated Services
```bash
# Apply updated configurations
kubectl apply -k infrastructure/kubernetes/overlays/dev/
# Or restart services to pick up new env vars
kubectl rollout restart deployment -n bakery-ia
# Wait for rollout
kubectl rollout status deployment -n bakery-ia --timeout=5m
```
## Step 4: Access SigNoz UI
### Via Ingress
```bash
# Add to /etc/hosts if needed
echo "127.0.0.1 monitoring.bakery-ia.local" | sudo tee -a /etc/hosts
# Access UI
open https://monitoring.bakery-ia.local
```
### Via Port Forward
```bash
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
open http://localhost:3301
```
## Step 5: Explore Your Data
### Traces
1. Go to **Services** tab
2. See all your services listed
3. Click on a service → View traces
4. Click on a trace → See detailed span tree with timing
### Metrics
**HTTP Metrics** (automatically collected):
- `http_requests_total` - Total requests by method, endpoint, status
- `http_request_duration_seconds` - Request latency
- `active_requests` - Current active HTTP requests
**System Metrics** (automatically collected per service):
- `process.cpu.utilization` - Process CPU usage %
- `process.memory.usage` - Process memory in bytes
- `process.memory.utilization` - Process memory %
- `process.threads.count` - Number of threads
- `system.cpu.utilization` - System-wide CPU %
- `system.memory.usage` - System memory usage
- `system.disk.io.read` - Disk bytes read
- `system.disk.io.write` - Disk bytes written
- `system.network.io.sent` - Network bytes sent
- `system.network.io.received` - Network bytes received
**Custom Business Metrics** (if configured):
- User registrations
- Orders created
- Login attempts
- etc.
### Logs
1. Go to **Logs** tab
2. Filter by service: `service_name="auth-service"`
3. Search for specific messages
4. See structured fields (user_id, tenant_id, etc.)
### Trace-Log Correlation
1. Find a trace in **Traces** tab
2. Note the `trace_id`
3. Go to **Logs** tab
4. Filter: `trace_id="<the-trace-id>"`
5. See all logs for that specific request!
## Verification Commands
```bash
# Check if services are sending telemetry
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel"
# Check SigNoz collector is receiving data
kubectl logs -n signoz deployment/signoz-otel-collector | tail -50
# Test connectivity to collector
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
```
## Common Issues
### No data in SigNoz
```bash
# 1. Verify environment variables are set
kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL
# 2. Check collector logs
kubectl logs -n signoz deployment/signoz-otel-collector
# 3. Restart service
kubectl rollout restart deployment/auth-service -n bakery-ia
```
### Services not appearing
```bash
# Check network connectivity
kubectl exec -n bakery-ia deployment/auth-service -- \
curl http://signoz-otel-collector.signoz.svc.cluster.local:4318
# Should return: connection successful (not connection refused)
```
## Architecture
```
┌─────────────────────────────────────────────┐
│ Your Microservices │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ auth │ │ inv │ │orders│ ... │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │
│ └─────────┴─────────┘ │
│ │ │
│ OTLP Push │
│ (traces, metrics, logs) │
└──────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────┐
│ SigNoz OpenTelemetry Collector │
│ :4317 (gRPC) :4318 (HTTP) │
│ │
│ Receivers: OTLP only (no Prometheus) │
│ Processors: batch, memory_limiter │
│ Exporters: ClickHouse │
└──────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────┐
│ ClickHouse Database │
│ Stores: traces, metrics, logs │
└──────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────┐
│ SigNoz Frontend UI │
│ monitoring.bakery-ia.local or :3301 │
└──────────────────────────────────────────────┘
```
## What Makes This Different
**Pure OpenTelemetry** - No Prometheus involved:
- ✅ All metrics pushed via OTLP (not scraped)
- ✅ Automatic system metrics collection (CPU, memory, disk, network)
- ✅ Unified data model for all telemetry
- ✅ Native trace-metric-log correlation
- ✅ Lower resource usage (no scraping overhead)
## Next Steps
- **Create Dashboards** - Build custom views for your metrics
- **Set Up Alerts** - Configure alerts for errors, latency, resource usage
- **Explore System Metrics** - Monitor CPU, memory per service
- **Query Logs** - Use powerful log query language
- **Correlate Everything** - Jump from traces → logs → metrics
## Need Help?
- [Full Documentation](./MONITORING_SETUP.md) - Detailed setup guide
- [SigNoz Docs](https://signoz.io/docs/) - Official documentation
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/) - Python instrumentation
---
**Metrics You Get Out of the Box:**
| Category | Metrics | Description |
|----------|---------|-------------|
| HTTP | `http_requests_total` | Total requests by method, endpoint, status |
| HTTP | `http_request_duration_seconds` | Request latency histogram |
| HTTP | `active_requests` | Current active requests |
| Process | `process.cpu.utilization` | Process CPU usage % |
| Process | `process.memory.usage` | Process memory in bytes |
| Process | `process.memory.utilization` | Process memory % |
| Process | `process.threads.count` | Thread count |
| System | `system.cpu.utilization` | System CPU % |
| System | `system.memory.usage` | System memory usage |
| System | `system.memory.utilization` | System memory % |
| Disk | `system.disk.io.read` | Disk read bytes |
| Disk | `system.disk.io.write` | Disk write bytes |
| Network | `system.network.io.sent` | Network sent bytes |
| Network | `system.network.io.received` | Network received bytes |