Imporve monitoring 3

This commit is contained in:
Urtzi Alfaro
2026-01-09 11:18:20 +01:00
parent 8ca5d9c100
commit 43a3f35bd1
27 changed files with 1279 additions and 32 deletions

View File

@@ -0,0 +1,435 @@
# SigNoz Telemetry Verification Guide
## Overview
This guide explains how to verify that your services are correctly sending metrics, logs, and traces to SigNoz, and that SigNoz is collecting them properly.
## Current Configuration
### SigNoz Components
- **Version**: v0.106.0
- **OTel Collector**: v0.129.12
- **Namespace**: `bakery-ia`
- **Ingress URL**: https://monitoring.bakery-ia.local
### Telemetry Endpoints
The OTel Collector exposes the following endpoints:
| Protocol | Port | Purpose |
|----------|------|---------|
| OTLP gRPC | 4317 | Traces, Metrics, Logs (gRPC) |
| OTLP HTTP | 4318 | Traces, Metrics, Logs (HTTP) |
| Jaeger gRPC | 14250 | Jaeger traces (gRPC) |
| Jaeger HTTP | 14268 | Jaeger traces (HTTP) |
| Metrics | 8888 | Prometheus metrics from collector |
| Health Check | 13133 | Collector health status |
### Service Configuration
Services are configured via the `bakery-config` ConfigMap:
```yaml
# Observability enabled
ENABLE_TRACING: "true"
ENABLE_METRICS: "true"
ENABLE_LOGS: "true"
# OTel Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
OTEL_EXPORTER_OTLP_PROTOCOL: "grpc"
```
### Shared Tracing Library
Services use `shared/monitoring/tracing.py` which:
- Auto-instruments FastAPI endpoints
- Auto-instruments HTTPX (inter-service calls)
- Auto-instruments Redis operations
- Auto-instruments SQLAlchemy (PostgreSQL)
- Uses OTLP exporter to send traces to SigNoz
**Default endpoint**: `http://signoz-otel-collector.bakery-ia:4318` (HTTP)
## Verification Steps
### 1. Quick Verification Script
Run the automated verification script:
```bash
./infrastructure/helm/verify-signoz-telemetry.sh
```
This script checks:
- ✅ SigNoz components are running
- ✅ OTel Collector endpoints are exposed
- ✅ Configuration is correct
- ✅ Health checks pass
- ✅ Data is being collected in ClickHouse
### 2. Manual Verification
#### Check SigNoz Components Status
```bash
kubectl get pods -n bakery-ia | grep signoz
```
Expected output:
```
signoz-0 1/1 Running
signoz-otel-collector-xxxxx 1/1 Running
chi-signoz-clickhouse-cluster-0-0-0 1/1 Running
signoz-zookeeper-0 1/1 Running
signoz-clickhouse-operator-xxxxx 2/2 Running
```
#### Check OTel Collector Logs
```bash
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=50
```
Look for:
- `"msg":"Everything is ready. Begin running and processing data."`
- No error messages about invalid processors
- Evidence of data reception (traces/metrics/logs)
#### Check Service Logs for Tracing
```bash
# Check a specific service (e.g., gateway)
kubectl logs -n bakery-ia -l app=gateway --tail=100 | grep -i "tracing\|otel"
```
Expected output:
```
Distributed tracing configured
service=gateway-service
otel_endpoint=http://signoz-otel-collector.bakery-ia:4318
```
### 3. Generate Test Traffic
Run the traffic generation script:
```bash
./infrastructure/helm/generate-test-traffic.sh
```
This script:
1. Makes API calls to various service endpoints
2. Checks service logs for telemetry
3. Waits for data processing (30 seconds)
### 4. Verify Data in ClickHouse
```bash
# Get ClickHouse password
CH_PASSWORD=$(kubectl get secret -n bakery-ia signoz-clickhouse -o jsonpath='{.data.admin-password}' 2>/dev/null | base64 -d)
# Get ClickHouse pod
CH_POD=$(kubectl get pods -n bakery-ia -l clickhouse.altinity.com/chi=signoz-clickhouse -o jsonpath='{.items[0].metadata.name}')
# Check traces
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
SELECT
serviceName,
COUNT() as trace_count,
min(timestamp) as first_trace,
max(timestamp) as last_trace
FROM signoz_traces.signoz_index_v2
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY serviceName
ORDER BY trace_count DESC
"
# Check metrics
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
SELECT
metric_name,
COUNT() as sample_count
FROM signoz_metrics.samples_v4
WHERE unix_milli >= toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000
GROUP BY metric_name
ORDER BY sample_count DESC
LIMIT 10
"
# Check logs
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
SELECT
COUNT() as log_count,
min(timestamp) as first_log,
max(timestamp) as last_log
FROM signoz_logs.logs
WHERE timestamp >= now() - INTERVAL 1 HOUR
"
```
### 5. Access SigNoz UI
#### Via Ingress (Recommended)
1. Add to `/etc/hosts`:
```
127.0.0.1 monitoring.bakery-ia.local
```
2. Access: https://monitoring.bakery-ia.local
#### Via Port-Forward
```bash
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
```
Then access: http://localhost:3301
### 6. Explore Telemetry Data in SigNoz UI
1. **Traces**:
- Go to "Services" tab
- You should see your services listed (gateway, auth-service, inventory-service, etc.)
- Click on a service to see its traces
- Click on individual traces to see span details
2. **Metrics**:
- Go to "Dashboards" or "Metrics" tab
- Should see infrastructure metrics (PostgreSQL, Redis, RabbitMQ)
- Should see service metrics (request rate, latency, errors)
3. **Logs**:
- Go to "Logs" tab
- Should see logs from your services
- Can filter by service name, log level, etc.
## Troubleshooting
### Services Can't Connect to OTel Collector
**Symptoms**:
```
[ERROR] opentelemetry.exporter.otlp.proto.grpc.exporter: Failed to export traces
error code: StatusCode.UNAVAILABLE
```
**Solutions**:
1. **Check OTel Collector is running**:
```bash
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=otel-collector
```
2. **Verify service can reach collector**:
```bash
# From a service pod
kubectl exec -it -n bakery-ia <service-pod> -- curl -v http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318
```
3. **Check endpoint configuration**:
- gRPC endpoint should NOT have `http://` prefix
- HTTP endpoint should have `http://` prefix
Update your service's tracing setup:
```python
# For gRPC (recommended)
setup_tracing(app, "my-service", otel_endpoint="signoz-otel-collector.bakery-ia.svc.cluster.local:4317")
# For HTTP
setup_tracing(app, "my-service", otel_endpoint="http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318")
```
4. **Restart services after config changes**:
```bash
kubectl rollout restart deployment/<service-name> -n bakery-ia
```
### No Data in SigNoz
**Possible causes**:
1. **Services haven't been called yet**
- Solution: Generate traffic using the test script
2. **Tracing not initialized**
- Check service logs for tracing initialization messages
- Verify `ENABLE_TRACING=true` in ConfigMap
3. **Wrong OTel endpoint**
- Verify `OTEL_EXPORTER_OTLP_ENDPOINT` in ConfigMap
- Should be: `http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317`
4. **Service not using tracing library**
- Check if service imports and calls `setup_tracing()` in main.py
```python
from shared.monitoring.tracing import setup_tracing
app = FastAPI(title="My Service")
setup_tracing(app, "my-service")
```
### OTel Collector Errors
**Check collector logs**:
```bash
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=100
```
**Common errors**:
1. **Invalid processor error**:
- Check `signoz-values-dev.yaml` has `signozspanmetrics/delta` (not `spanmetrics`)
- Already fixed in your configuration
2. **ClickHouse connection error**:
- Verify ClickHouse is running
- Check ClickHouse service is accessible
3. **Configuration validation error**:
- Validate YAML syntax in `signoz-values-dev.yaml`
- Check all processors used in pipelines are defined
## Infrastructure Metrics
SigNoz automatically collects metrics from your infrastructure:
### PostgreSQL Databases
- **Receivers configured for**:
- auth_db (auth-db-service:5432)
- inventory_db (inventory-db-service:5432)
- orders_db (orders-db-service:5432)
- **Metrics collected**:
- Connection counts
- Query performance
- Database size
- Table statistics
### Redis
- **Endpoint**: redis-service:6379
- **Metrics collected**:
- Memory usage
- Keys count
- Hit/miss ratio
- Command stats
### RabbitMQ
- **Endpoint**: rabbitmq-service:15672 (management API)
- **Metrics collected**:
- Queue lengths
- Message rates
- Connection counts
- Consumer activity
## Best Practices
### 1. Service Implementation
Always initialize tracing in your service's `main.py`:
```python
from fastapi import FastAPI
from shared.monitoring.tracing import setup_tracing
import os
app = FastAPI(title="My Service")
# Initialize tracing
otel_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://signoz-otel-collector.bakery-ia:4318")
setup_tracing(
app,
service_name="my-service",
service_version=os.getenv("SERVICE_VERSION", "1.0.0"),
otel_endpoint=otel_endpoint
)
```
### 2. Custom Spans
Add custom spans for important operations:
```python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@app.post("/process")
async def process_data(data: dict):
with tracer.start_as_current_span("process_data") as span:
span.set_attribute("data.size", len(data))
span.set_attribute("data.type", data.get("type"))
# Your processing logic
result = process(data)
span.set_attribute("result.status", "success")
return result
```
### 3. Error Tracking
Record exceptions in spans:
```python
from shared.monitoring.tracing import record_exception
try:
result = risky_operation()
except Exception as e:
record_exception(e)
raise
```
### 4. Correlation
Use trace IDs in logs for correlation:
```python
from shared.monitoring.tracing import get_current_trace_id
trace_id = get_current_trace_id()
logger.info("Processing request", trace_id=trace_id)
```
## Next Steps
1. ✅ **Verify SigNoz is running** - Run verification script
2. ✅ **Generate test traffic** - Run traffic generation script
3. ✅ **Check data collection** - Query ClickHouse or use UI
4. ✅ **Access SigNoz UI** - Visualize traces, metrics, and logs
5. ⏭️ **Set up dashboards** - Create custom dashboards for your use cases
6. ⏭️ **Configure alerts** - Set up alerts for critical metrics
7. ⏭️ **Document** - Document common queries and dashboard configurations
## Useful Commands
```bash
# Quick status check
kubectl get pods -n bakery-ia | grep signoz
# View OTel Collector metrics
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 8888:8888
# Then visit: http://localhost:8888/metrics
# Restart OTel Collector
kubectl rollout restart deployment/signoz-otel-collector -n bakery-ia
# View all services with telemetry
kubectl get pods -n bakery-ia -l tier!=infrastructure
# Check specific service logs
kubectl logs -n bakery-ia -l app=<service-name> --tail=100 -f
# Port-forward to SigNoz UI
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
```
## Resources
- [SigNoz Documentation](https://signoz.io/docs/)
- [OpenTelemetry Python](https://opentelemetry.io/docs/languages/python/)
- [SigNoz GitHub](https://github.com/SigNoz/signoz)
- [Helm Chart Values](infrastructure/helm/signoz-values-dev.yaml)
- [Verification Script](infrastructure/helm/verify-signoz-telemetry.sh)
- [Traffic Generation Script](infrastructure/helm/generate-test-traffic.sh)