436 lines
11 KiB
Markdown
436 lines
11 KiB
Markdown
|
|
# SigNoz Telemetry Verification Guide
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
This guide explains how to verify that your services are correctly sending metrics, logs, and traces to SigNoz, and that SigNoz is collecting them properly.
|
||
|
|
|
||
|
|
## Current Configuration
|
||
|
|
|
||
|
|
### SigNoz Components
|
||
|
|
- **Version**: v0.106.0
|
||
|
|
- **OTel Collector**: v0.129.12
|
||
|
|
- **Namespace**: `bakery-ia`
|
||
|
|
- **Ingress URL**: https://monitoring.bakery-ia.local
|
||
|
|
|
||
|
|
### Telemetry Endpoints
|
||
|
|
|
||
|
|
The OTel Collector exposes the following endpoints:
|
||
|
|
|
||
|
|
| Protocol | Port | Purpose |
|
||
|
|
|----------|------|---------|
|
||
|
|
| OTLP gRPC | 4317 | Traces, Metrics, Logs (gRPC) |
|
||
|
|
| OTLP HTTP | 4318 | Traces, Metrics, Logs (HTTP) |
|
||
|
|
| Jaeger gRPC | 14250 | Jaeger traces (gRPC) |
|
||
|
|
| Jaeger HTTP | 14268 | Jaeger traces (HTTP) |
|
||
|
|
| Metrics | 8888 | Prometheus metrics from collector |
|
||
|
|
| Health Check | 13133 | Collector health status |
|
||
|
|
|
||
|
|
### Service Configuration
|
||
|
|
|
||
|
|
Services are configured via the `bakery-config` ConfigMap:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# Observability enabled
|
||
|
|
ENABLE_TRACING: "true"
|
||
|
|
ENABLE_METRICS: "true"
|
||
|
|
ENABLE_LOGS: "true"
|
||
|
|
|
||
|
|
# OTel Collector endpoint
|
||
|
|
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
|
||
|
|
OTEL_EXPORTER_OTLP_PROTOCOL: "grpc"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Shared Tracing Library
|
||
|
|
|
||
|
|
Services use `shared/monitoring/tracing.py` which:
|
||
|
|
- Auto-instruments FastAPI endpoints
|
||
|
|
- Auto-instruments HTTPX (inter-service calls)
|
||
|
|
- Auto-instruments Redis operations
|
||
|
|
- Auto-instruments SQLAlchemy (PostgreSQL)
|
||
|
|
- Uses OTLP exporter to send traces to SigNoz
|
||
|
|
|
||
|
|
**Default endpoint**: `http://signoz-otel-collector.bakery-ia:4318` (HTTP)
|
||
|
|
|
||
|
|
## Verification Steps
|
||
|
|
|
||
|
|
### 1. Quick Verification Script
|
||
|
|
|
||
|
|
Run the automated verification script:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./infrastructure/helm/verify-signoz-telemetry.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
This script checks:
|
||
|
|
- ✅ SigNoz components are running
|
||
|
|
- ✅ OTel Collector endpoints are exposed
|
||
|
|
- ✅ Configuration is correct
|
||
|
|
- ✅ Health checks pass
|
||
|
|
- ✅ Data is being collected in ClickHouse
|
||
|
|
|
||
|
|
### 2. Manual Verification
|
||
|
|
|
||
|
|
#### Check SigNoz Components Status
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl get pods -n bakery-ia | grep signoz
|
||
|
|
```
|
||
|
|
|
||
|
|
Expected output:
|
||
|
|
```
|
||
|
|
signoz-0 1/1 Running
|
||
|
|
signoz-otel-collector-xxxxx 1/1 Running
|
||
|
|
chi-signoz-clickhouse-cluster-0-0-0 1/1 Running
|
||
|
|
signoz-zookeeper-0 1/1 Running
|
||
|
|
signoz-clickhouse-operator-xxxxx 2/2 Running
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Check OTel Collector Logs
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=50
|
||
|
|
```
|
||
|
|
|
||
|
|
Look for:
|
||
|
|
- `"msg":"Everything is ready. Begin running and processing data."`
|
||
|
|
- No error messages about invalid processors
|
||
|
|
- Evidence of data reception (traces/metrics/logs)
|
||
|
|
|
||
|
|
#### Check Service Logs for Tracing
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check a specific service (e.g., gateway)
|
||
|
|
kubectl logs -n bakery-ia -l app=gateway --tail=100 | grep -i "tracing\|otel"
|
||
|
|
```
|
||
|
|
|
||
|
|
Expected output:
|
||
|
|
```
|
||
|
|
Distributed tracing configured
|
||
|
|
service=gateway-service
|
||
|
|
otel_endpoint=http://signoz-otel-collector.bakery-ia:4318
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Generate Test Traffic
|
||
|
|
|
||
|
|
Run the traffic generation script:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./infrastructure/helm/generate-test-traffic.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
This script:
|
||
|
|
1. Makes API calls to various service endpoints
|
||
|
|
2. Checks service logs for telemetry
|
||
|
|
3. Waits for data processing (30 seconds)
|
||
|
|
|
||
|
|
### 4. Verify Data in ClickHouse
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Get ClickHouse password
|
||
|
|
CH_PASSWORD=$(kubectl get secret -n bakery-ia signoz-clickhouse -o jsonpath='{.data.admin-password}' 2>/dev/null | base64 -d)
|
||
|
|
|
||
|
|
# Get ClickHouse pod
|
||
|
|
CH_POD=$(kubectl get pods -n bakery-ia -l clickhouse.altinity.com/chi=signoz-clickhouse -o jsonpath='{.items[0].metadata.name}')
|
||
|
|
|
||
|
|
# Check traces
|
||
|
|
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
|
||
|
|
SELECT
|
||
|
|
serviceName,
|
||
|
|
COUNT() as trace_count,
|
||
|
|
min(timestamp) as first_trace,
|
||
|
|
max(timestamp) as last_trace
|
||
|
|
FROM signoz_traces.signoz_index_v2
|
||
|
|
WHERE timestamp >= now() - INTERVAL 1 HOUR
|
||
|
|
GROUP BY serviceName
|
||
|
|
ORDER BY trace_count DESC
|
||
|
|
"
|
||
|
|
|
||
|
|
# Check metrics
|
||
|
|
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
|
||
|
|
SELECT
|
||
|
|
metric_name,
|
||
|
|
COUNT() as sample_count
|
||
|
|
FROM signoz_metrics.samples_v4
|
||
|
|
WHERE unix_milli >= toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000
|
||
|
|
GROUP BY metric_name
|
||
|
|
ORDER BY sample_count DESC
|
||
|
|
LIMIT 10
|
||
|
|
"
|
||
|
|
|
||
|
|
# Check logs
|
||
|
|
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
|
||
|
|
SELECT
|
||
|
|
COUNT() as log_count,
|
||
|
|
min(timestamp) as first_log,
|
||
|
|
max(timestamp) as last_log
|
||
|
|
FROM signoz_logs.logs
|
||
|
|
WHERE timestamp >= now() - INTERVAL 1 HOUR
|
||
|
|
"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5. Access SigNoz UI
|
||
|
|
|
||
|
|
#### Via Ingress (Recommended)
|
||
|
|
|
||
|
|
1. Add to `/etc/hosts`:
|
||
|
|
```
|
||
|
|
127.0.0.1 monitoring.bakery-ia.local
|
||
|
|
```
|
||
|
|
|
||
|
|
2. Access: https://monitoring.bakery-ia.local
|
||
|
|
|
||
|
|
#### Via Port-Forward
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
|
||
|
|
```
|
||
|
|
|
||
|
|
Then access: http://localhost:3301
|
||
|
|
|
||
|
|
### 6. Explore Telemetry Data in SigNoz UI
|
||
|
|
|
||
|
|
1. **Traces**:
|
||
|
|
- Go to "Services" tab
|
||
|
|
- You should see your services listed (gateway, auth-service, inventory-service, etc.)
|
||
|
|
- Click on a service to see its traces
|
||
|
|
- Click on individual traces to see span details
|
||
|
|
|
||
|
|
2. **Metrics**:
|
||
|
|
- Go to "Dashboards" or "Metrics" tab
|
||
|
|
- Should see infrastructure metrics (PostgreSQL, Redis, RabbitMQ)
|
||
|
|
- Should see service metrics (request rate, latency, errors)
|
||
|
|
|
||
|
|
3. **Logs**:
|
||
|
|
- Go to "Logs" tab
|
||
|
|
- Should see logs from your services
|
||
|
|
- Can filter by service name, log level, etc.
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Services Can't Connect to OTel Collector
|
||
|
|
|
||
|
|
**Symptoms**:
|
||
|
|
```
|
||
|
|
[ERROR] opentelemetry.exporter.otlp.proto.grpc.exporter: Failed to export traces
|
||
|
|
error code: StatusCode.UNAVAILABLE
|
||
|
|
```
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
|
||
|
|
1. **Check OTel Collector is running**:
|
||
|
|
```bash
|
||
|
|
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=otel-collector
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Verify service can reach collector**:
|
||
|
|
```bash
|
||
|
|
# From a service pod
|
||
|
|
kubectl exec -it -n bakery-ia <service-pod> -- curl -v http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Check endpoint configuration**:
|
||
|
|
- gRPC endpoint should NOT have `http://` prefix
|
||
|
|
- HTTP endpoint should have `http://` prefix
|
||
|
|
|
||
|
|
Update your service's tracing setup:
|
||
|
|
```python
|
||
|
|
# For gRPC (recommended)
|
||
|
|
setup_tracing(app, "my-service", otel_endpoint="signoz-otel-collector.bakery-ia.svc.cluster.local:4317")
|
||
|
|
|
||
|
|
# For HTTP
|
||
|
|
setup_tracing(app, "my-service", otel_endpoint="http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318")
|
||
|
|
```
|
||
|
|
|
||
|
|
4. **Restart services after config changes**:
|
||
|
|
```bash
|
||
|
|
kubectl rollout restart deployment/<service-name> -n bakery-ia
|
||
|
|
```
|
||
|
|
|
||
|
|
### No Data in SigNoz
|
||
|
|
|
||
|
|
**Possible causes**:
|
||
|
|
|
||
|
|
1. **Services haven't been called yet**
|
||
|
|
- Solution: Generate traffic using the test script
|
||
|
|
|
||
|
|
2. **Tracing not initialized**
|
||
|
|
- Check service logs for tracing initialization messages
|
||
|
|
- Verify `ENABLE_TRACING=true` in ConfigMap
|
||
|
|
|
||
|
|
3. **Wrong OTel endpoint**
|
||
|
|
- Verify `OTEL_EXPORTER_OTLP_ENDPOINT` in ConfigMap
|
||
|
|
- Should be: `http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317`
|
||
|
|
|
||
|
|
4. **Service not using tracing library**
|
||
|
|
- Check if service imports and calls `setup_tracing()` in main.py
|
||
|
|
```python
|
||
|
|
from shared.monitoring.tracing import setup_tracing
|
||
|
|
|
||
|
|
app = FastAPI(title="My Service")
|
||
|
|
setup_tracing(app, "my-service")
|
||
|
|
```
|
||
|
|
|
||
|
|
### OTel Collector Errors
|
||
|
|
|
||
|
|
**Check collector logs**:
|
||
|
|
```bash
|
||
|
|
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=100
|
||
|
|
```
|
||
|
|
|
||
|
|
**Common errors**:
|
||
|
|
|
||
|
|
1. **Invalid processor error**:
|
||
|
|
- Check `signoz-values-dev.yaml` has `signozspanmetrics/delta` (not `spanmetrics`)
|
||
|
|
- Already fixed in your configuration
|
||
|
|
|
||
|
|
2. **ClickHouse connection error**:
|
||
|
|
- Verify ClickHouse is running
|
||
|
|
- Check ClickHouse service is accessible
|
||
|
|
|
||
|
|
3. **Configuration validation error**:
|
||
|
|
- Validate YAML syntax in `signoz-values-dev.yaml`
|
||
|
|
- Check all processors used in pipelines are defined
|
||
|
|
|
||
|
|
## Infrastructure Metrics
|
||
|
|
|
||
|
|
SigNoz automatically collects metrics from your infrastructure:
|
||
|
|
|
||
|
|
### PostgreSQL Databases
|
||
|
|
- **Receivers configured for**:
|
||
|
|
- auth_db (auth-db-service:5432)
|
||
|
|
- inventory_db (inventory-db-service:5432)
|
||
|
|
- orders_db (orders-db-service:5432)
|
||
|
|
|
||
|
|
- **Metrics collected**:
|
||
|
|
- Connection counts
|
||
|
|
- Query performance
|
||
|
|
- Database size
|
||
|
|
- Table statistics
|
||
|
|
|
||
|
|
### Redis
|
||
|
|
- **Endpoint**: redis-service:6379
|
||
|
|
- **Metrics collected**:
|
||
|
|
- Memory usage
|
||
|
|
- Keys count
|
||
|
|
- Hit/miss ratio
|
||
|
|
- Command stats
|
||
|
|
|
||
|
|
### RabbitMQ
|
||
|
|
- **Endpoint**: rabbitmq-service:15672 (management API)
|
||
|
|
- **Metrics collected**:
|
||
|
|
- Queue lengths
|
||
|
|
- Message rates
|
||
|
|
- Connection counts
|
||
|
|
- Consumer activity
|
||
|
|
|
||
|
|
## Best Practices
|
||
|
|
|
||
|
|
### 1. Service Implementation
|
||
|
|
|
||
|
|
Always initialize tracing in your service's `main.py`:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from fastapi import FastAPI
|
||
|
|
from shared.monitoring.tracing import setup_tracing
|
||
|
|
import os
|
||
|
|
|
||
|
|
app = FastAPI(title="My Service")
|
||
|
|
|
||
|
|
# Initialize tracing
|
||
|
|
otel_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://signoz-otel-collector.bakery-ia:4318")
|
||
|
|
setup_tracing(
|
||
|
|
app,
|
||
|
|
service_name="my-service",
|
||
|
|
service_version=os.getenv("SERVICE_VERSION", "1.0.0"),
|
||
|
|
otel_endpoint=otel_endpoint
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Custom Spans
|
||
|
|
|
||
|
|
Add custom spans for important operations:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from opentelemetry import trace
|
||
|
|
|
||
|
|
tracer = trace.get_tracer(__name__)
|
||
|
|
|
||
|
|
@app.post("/process")
|
||
|
|
async def process_data(data: dict):
|
||
|
|
with tracer.start_as_current_span("process_data") as span:
|
||
|
|
span.set_attribute("data.size", len(data))
|
||
|
|
span.set_attribute("data.type", data.get("type"))
|
||
|
|
|
||
|
|
# Your processing logic
|
||
|
|
result = process(data)
|
||
|
|
|
||
|
|
span.set_attribute("result.status", "success")
|
||
|
|
return result
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Error Tracking
|
||
|
|
|
||
|
|
Record exceptions in spans:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from shared.monitoring.tracing import record_exception
|
||
|
|
|
||
|
|
try:
|
||
|
|
result = risky_operation()
|
||
|
|
except Exception as e:
|
||
|
|
record_exception(e)
|
||
|
|
raise
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Correlation
|
||
|
|
|
||
|
|
Use trace IDs in logs for correlation:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from shared.monitoring.tracing import get_current_trace_id
|
||
|
|
|
||
|
|
trace_id = get_current_trace_id()
|
||
|
|
logger.info("Processing request", trace_id=trace_id)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
1. ✅ **Verify SigNoz is running** - Run verification script
|
||
|
|
2. ✅ **Generate test traffic** - Run traffic generation script
|
||
|
|
3. ✅ **Check data collection** - Query ClickHouse or use UI
|
||
|
|
4. ✅ **Access SigNoz UI** - Visualize traces, metrics, and logs
|
||
|
|
5. ⏭️ **Set up dashboards** - Create custom dashboards for your use cases
|
||
|
|
6. ⏭️ **Configure alerts** - Set up alerts for critical metrics
|
||
|
|
7. ⏭️ **Document** - Document common queries and dashboard configurations
|
||
|
|
|
||
|
|
## Useful Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Quick status check
|
||
|
|
kubectl get pods -n bakery-ia | grep signoz
|
||
|
|
|
||
|
|
# View OTel Collector metrics
|
||
|
|
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 8888:8888
|
||
|
|
# Then visit: http://localhost:8888/metrics
|
||
|
|
|
||
|
|
# Restart OTel Collector
|
||
|
|
kubectl rollout restart deployment/signoz-otel-collector -n bakery-ia
|
||
|
|
|
||
|
|
# View all services with telemetry
|
||
|
|
kubectl get pods -n bakery-ia -l tier!=infrastructure
|
||
|
|
|
||
|
|
# Check specific service logs
|
||
|
|
kubectl logs -n bakery-ia -l app=<service-name> --tail=100 -f
|
||
|
|
|
||
|
|
# Port-forward to SigNoz UI
|
||
|
|
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
|
||
|
|
```
|
||
|
|
|
||
|
|
## Resources
|
||
|
|
|
||
|
|
- [SigNoz Documentation](https://signoz.io/docs/)
|
||
|
|
- [OpenTelemetry Python](https://opentelemetry.io/docs/languages/python/)
|
||
|
|
- [SigNoz GitHub](https://github.com/SigNoz/signoz)
|
||
|
|
- [Helm Chart Values](infrastructure/helm/signoz-values-dev.yaml)
|
||
|
|
- [Verification Script](infrastructure/helm/verify-signoz-telemetry.sh)
|
||
|
|
- [Traffic Generation Script](infrastructure/helm/generate-test-traffic.sh)
|