Imporve monitoring 3
This commit is contained in:
435
docs/SIGNOZ_VERIFICATION_GUIDE.md
Normal file
435
docs/SIGNOZ_VERIFICATION_GUIDE.md
Normal file
@@ -0,0 +1,435 @@
|
||||
# SigNoz Telemetry Verification Guide
|
||||
|
||||
## Overview
|
||||
This guide explains how to verify that your services are correctly sending metrics, logs, and traces to SigNoz, and that SigNoz is collecting them properly.
|
||||
|
||||
## Current Configuration
|
||||
|
||||
### SigNoz Components
|
||||
- **Version**: v0.106.0
|
||||
- **OTel Collector**: v0.129.12
|
||||
- **Namespace**: `bakery-ia`
|
||||
- **Ingress URL**: https://monitoring.bakery-ia.local
|
||||
|
||||
### Telemetry Endpoints
|
||||
|
||||
The OTel Collector exposes the following endpoints:
|
||||
|
||||
| Protocol | Port | Purpose |
|
||||
|----------|------|---------|
|
||||
| OTLP gRPC | 4317 | Traces, Metrics, Logs (gRPC) |
|
||||
| OTLP HTTP | 4318 | Traces, Metrics, Logs (HTTP) |
|
||||
| Jaeger gRPC | 14250 | Jaeger traces (gRPC) |
|
||||
| Jaeger HTTP | 14268 | Jaeger traces (HTTP) |
|
||||
| Metrics | 8888 | Prometheus metrics from collector |
|
||||
| Health Check | 13133 | Collector health status |
|
||||
|
||||
### Service Configuration
|
||||
|
||||
Services are configured via the `bakery-config` ConfigMap:
|
||||
|
||||
```yaml
|
||||
# Observability enabled
|
||||
ENABLE_TRACING: "true"
|
||||
ENABLE_METRICS: "true"
|
||||
ENABLE_LOGS: "true"
|
||||
|
||||
# OTel Collector endpoint
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL: "grpc"
|
||||
```
|
||||
|
||||
### Shared Tracing Library
|
||||
|
||||
Services use `shared/monitoring/tracing.py` which:
|
||||
- Auto-instruments FastAPI endpoints
|
||||
- Auto-instruments HTTPX (inter-service calls)
|
||||
- Auto-instruments Redis operations
|
||||
- Auto-instruments SQLAlchemy (PostgreSQL)
|
||||
- Uses OTLP exporter to send traces to SigNoz
|
||||
|
||||
**Default endpoint**: `http://signoz-otel-collector.bakery-ia:4318` (HTTP)
|
||||
|
||||
## Verification Steps
|
||||
|
||||
### 1. Quick Verification Script
|
||||
|
||||
Run the automated verification script:
|
||||
|
||||
```bash
|
||||
./infrastructure/helm/verify-signoz-telemetry.sh
|
||||
```
|
||||
|
||||
This script checks:
|
||||
- ✅ SigNoz components are running
|
||||
- ✅ OTel Collector endpoints are exposed
|
||||
- ✅ Configuration is correct
|
||||
- ✅ Health checks pass
|
||||
- ✅ Data is being collected in ClickHouse
|
||||
|
||||
### 2. Manual Verification
|
||||
|
||||
#### Check SigNoz Components Status
|
||||
|
||||
```bash
|
||||
kubectl get pods -n bakery-ia | grep signoz
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
signoz-0 1/1 Running
|
||||
signoz-otel-collector-xxxxx 1/1 Running
|
||||
chi-signoz-clickhouse-cluster-0-0-0 1/1 Running
|
||||
signoz-zookeeper-0 1/1 Running
|
||||
signoz-clickhouse-operator-xxxxx 2/2 Running
|
||||
```
|
||||
|
||||
#### Check OTel Collector Logs
|
||||
|
||||
```bash
|
||||
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=50
|
||||
```
|
||||
|
||||
Look for:
|
||||
- `"msg":"Everything is ready. Begin running and processing data."`
|
||||
- No error messages about invalid processors
|
||||
- Evidence of data reception (traces/metrics/logs)
|
||||
|
||||
#### Check Service Logs for Tracing
|
||||
|
||||
```bash
|
||||
# Check a specific service (e.g., gateway)
|
||||
kubectl logs -n bakery-ia -l app=gateway --tail=100 | grep -i "tracing\|otel"
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
Distributed tracing configured
|
||||
service=gateway-service
|
||||
otel_endpoint=http://signoz-otel-collector.bakery-ia:4318
|
||||
```
|
||||
|
||||
### 3. Generate Test Traffic
|
||||
|
||||
Run the traffic generation script:
|
||||
|
||||
```bash
|
||||
./infrastructure/helm/generate-test-traffic.sh
|
||||
```
|
||||
|
||||
This script:
|
||||
1. Makes API calls to various service endpoints
|
||||
2. Checks service logs for telemetry
|
||||
3. Waits for data processing (30 seconds)
|
||||
|
||||
### 4. Verify Data in ClickHouse
|
||||
|
||||
```bash
|
||||
# Get ClickHouse password
|
||||
CH_PASSWORD=$(kubectl get secret -n bakery-ia signoz-clickhouse -o jsonpath='{.data.admin-password}' 2>/dev/null | base64 -d)
|
||||
|
||||
# Get ClickHouse pod
|
||||
CH_POD=$(kubectl get pods -n bakery-ia -l clickhouse.altinity.com/chi=signoz-clickhouse -o jsonpath='{.items[0].metadata.name}')
|
||||
|
||||
# Check traces
|
||||
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
|
||||
SELECT
|
||||
serviceName,
|
||||
COUNT() as trace_count,
|
||||
min(timestamp) as first_trace,
|
||||
max(timestamp) as last_trace
|
||||
FROM signoz_traces.signoz_index_v2
|
||||
WHERE timestamp >= now() - INTERVAL 1 HOUR
|
||||
GROUP BY serviceName
|
||||
ORDER BY trace_count DESC
|
||||
"
|
||||
|
||||
# Check metrics
|
||||
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
|
||||
SELECT
|
||||
metric_name,
|
||||
COUNT() as sample_count
|
||||
FROM signoz_metrics.samples_v4
|
||||
WHERE unix_milli >= toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000
|
||||
GROUP BY metric_name
|
||||
ORDER BY sample_count DESC
|
||||
LIMIT 10
|
||||
"
|
||||
|
||||
# Check logs
|
||||
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
|
||||
SELECT
|
||||
COUNT() as log_count,
|
||||
min(timestamp) as first_log,
|
||||
max(timestamp) as last_log
|
||||
FROM signoz_logs.logs
|
||||
WHERE timestamp >= now() - INTERVAL 1 HOUR
|
||||
"
|
||||
```
|
||||
|
||||
### 5. Access SigNoz UI
|
||||
|
||||
#### Via Ingress (Recommended)
|
||||
|
||||
1. Add to `/etc/hosts`:
|
||||
```
|
||||
127.0.0.1 monitoring.bakery-ia.local
|
||||
```
|
||||
|
||||
2. Access: https://monitoring.bakery-ia.local
|
||||
|
||||
#### Via Port-Forward
|
||||
|
||||
```bash
|
||||
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
|
||||
```
|
||||
|
||||
Then access: http://localhost:3301
|
||||
|
||||
### 6. Explore Telemetry Data in SigNoz UI
|
||||
|
||||
1. **Traces**:
|
||||
- Go to "Services" tab
|
||||
- You should see your services listed (gateway, auth-service, inventory-service, etc.)
|
||||
- Click on a service to see its traces
|
||||
- Click on individual traces to see span details
|
||||
|
||||
2. **Metrics**:
|
||||
- Go to "Dashboards" or "Metrics" tab
|
||||
- Should see infrastructure metrics (PostgreSQL, Redis, RabbitMQ)
|
||||
- Should see service metrics (request rate, latency, errors)
|
||||
|
||||
3. **Logs**:
|
||||
- Go to "Logs" tab
|
||||
- Should see logs from your services
|
||||
- Can filter by service name, log level, etc.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Services Can't Connect to OTel Collector
|
||||
|
||||
**Symptoms**:
|
||||
```
|
||||
[ERROR] opentelemetry.exporter.otlp.proto.grpc.exporter: Failed to export traces
|
||||
error code: StatusCode.UNAVAILABLE
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check OTel Collector is running**:
|
||||
```bash
|
||||
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=otel-collector
|
||||
```
|
||||
|
||||
2. **Verify service can reach collector**:
|
||||
```bash
|
||||
# From a service pod
|
||||
kubectl exec -it -n bakery-ia <service-pod> -- curl -v http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318
|
||||
```
|
||||
|
||||
3. **Check endpoint configuration**:
|
||||
- gRPC endpoint should NOT have `http://` prefix
|
||||
- HTTP endpoint should have `http://` prefix
|
||||
|
||||
Update your service's tracing setup:
|
||||
```python
|
||||
# For gRPC (recommended)
|
||||
setup_tracing(app, "my-service", otel_endpoint="signoz-otel-collector.bakery-ia.svc.cluster.local:4317")
|
||||
|
||||
# For HTTP
|
||||
setup_tracing(app, "my-service", otel_endpoint="http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318")
|
||||
```
|
||||
|
||||
4. **Restart services after config changes**:
|
||||
```bash
|
||||
kubectl rollout restart deployment/<service-name> -n bakery-ia
|
||||
```
|
||||
|
||||
### No Data in SigNoz
|
||||
|
||||
**Possible causes**:
|
||||
|
||||
1. **Services haven't been called yet**
|
||||
- Solution: Generate traffic using the test script
|
||||
|
||||
2. **Tracing not initialized**
|
||||
- Check service logs for tracing initialization messages
|
||||
- Verify `ENABLE_TRACING=true` in ConfigMap
|
||||
|
||||
3. **Wrong OTel endpoint**
|
||||
- Verify `OTEL_EXPORTER_OTLP_ENDPOINT` in ConfigMap
|
||||
- Should be: `http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317`
|
||||
|
||||
4. **Service not using tracing library**
|
||||
- Check if service imports and calls `setup_tracing()` in main.py
|
||||
```python
|
||||
from shared.monitoring.tracing import setup_tracing
|
||||
|
||||
app = FastAPI(title="My Service")
|
||||
setup_tracing(app, "my-service")
|
||||
```
|
||||
|
||||
### OTel Collector Errors
|
||||
|
||||
**Check collector logs**:
|
||||
```bash
|
||||
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=100
|
||||
```
|
||||
|
||||
**Common errors**:
|
||||
|
||||
1. **Invalid processor error**:
|
||||
- Check `signoz-values-dev.yaml` has `signozspanmetrics/delta` (not `spanmetrics`)
|
||||
- Already fixed in your configuration
|
||||
|
||||
2. **ClickHouse connection error**:
|
||||
- Verify ClickHouse is running
|
||||
- Check ClickHouse service is accessible
|
||||
|
||||
3. **Configuration validation error**:
|
||||
- Validate YAML syntax in `signoz-values-dev.yaml`
|
||||
- Check all processors used in pipelines are defined
|
||||
|
||||
## Infrastructure Metrics
|
||||
|
||||
SigNoz automatically collects metrics from your infrastructure:
|
||||
|
||||
### PostgreSQL Databases
|
||||
- **Receivers configured for**:
|
||||
- auth_db (auth-db-service:5432)
|
||||
- inventory_db (inventory-db-service:5432)
|
||||
- orders_db (orders-db-service:5432)
|
||||
|
||||
- **Metrics collected**:
|
||||
- Connection counts
|
||||
- Query performance
|
||||
- Database size
|
||||
- Table statistics
|
||||
|
||||
### Redis
|
||||
- **Endpoint**: redis-service:6379
|
||||
- **Metrics collected**:
|
||||
- Memory usage
|
||||
- Keys count
|
||||
- Hit/miss ratio
|
||||
- Command stats
|
||||
|
||||
### RabbitMQ
|
||||
- **Endpoint**: rabbitmq-service:15672 (management API)
|
||||
- **Metrics collected**:
|
||||
- Queue lengths
|
||||
- Message rates
|
||||
- Connection counts
|
||||
- Consumer activity
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Service Implementation
|
||||
|
||||
Always initialize tracing in your service's `main.py`:
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI
|
||||
from shared.monitoring.tracing import setup_tracing
|
||||
import os
|
||||
|
||||
app = FastAPI(title="My Service")
|
||||
|
||||
# Initialize tracing
|
||||
otel_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://signoz-otel-collector.bakery-ia:4318")
|
||||
setup_tracing(
|
||||
app,
|
||||
service_name="my-service",
|
||||
service_version=os.getenv("SERVICE_VERSION", "1.0.0"),
|
||||
otel_endpoint=otel_endpoint
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Custom Spans
|
||||
|
||||
Add custom spans for important operations:
|
||||
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
@app.post("/process")
|
||||
async def process_data(data: dict):
|
||||
with tracer.start_as_current_span("process_data") as span:
|
||||
span.set_attribute("data.size", len(data))
|
||||
span.set_attribute("data.type", data.get("type"))
|
||||
|
||||
# Your processing logic
|
||||
result = process(data)
|
||||
|
||||
span.set_attribute("result.status", "success")
|
||||
return result
|
||||
```
|
||||
|
||||
### 3. Error Tracking
|
||||
|
||||
Record exceptions in spans:
|
||||
|
||||
```python
|
||||
from shared.monitoring.tracing import record_exception
|
||||
|
||||
try:
|
||||
result = risky_operation()
|
||||
except Exception as e:
|
||||
record_exception(e)
|
||||
raise
|
||||
```
|
||||
|
||||
### 4. Correlation
|
||||
|
||||
Use trace IDs in logs for correlation:
|
||||
|
||||
```python
|
||||
from shared.monitoring.tracing import get_current_trace_id
|
||||
|
||||
trace_id = get_current_trace_id()
|
||||
logger.info("Processing request", trace_id=trace_id)
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✅ **Verify SigNoz is running** - Run verification script
|
||||
2. ✅ **Generate test traffic** - Run traffic generation script
|
||||
3. ✅ **Check data collection** - Query ClickHouse or use UI
|
||||
4. ✅ **Access SigNoz UI** - Visualize traces, metrics, and logs
|
||||
5. ⏭️ **Set up dashboards** - Create custom dashboards for your use cases
|
||||
6. ⏭️ **Configure alerts** - Set up alerts for critical metrics
|
||||
7. ⏭️ **Document** - Document common queries and dashboard configurations
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# Quick status check
|
||||
kubectl get pods -n bakery-ia | grep signoz
|
||||
|
||||
# View OTel Collector metrics
|
||||
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 8888:8888
|
||||
# Then visit: http://localhost:8888/metrics
|
||||
|
||||
# Restart OTel Collector
|
||||
kubectl rollout restart deployment/signoz-otel-collector -n bakery-ia
|
||||
|
||||
# View all services with telemetry
|
||||
kubectl get pods -n bakery-ia -l tier!=infrastructure
|
||||
|
||||
# Check specific service logs
|
||||
kubectl logs -n bakery-ia -l app=<service-name> --tail=100 -f
|
||||
|
||||
# Port-forward to SigNoz UI
|
||||
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- [SigNoz Documentation](https://signoz.io/docs/)
|
||||
- [OpenTelemetry Python](https://opentelemetry.io/docs/languages/python/)
|
||||
- [SigNoz GitHub](https://github.com/SigNoz/signoz)
|
||||
- [Helm Chart Values](infrastructure/helm/signoz-values-dev.yaml)
|
||||
- [Verification Script](infrastructure/helm/verify-signoz-telemetry.sh)
|
||||
- [Traffic Generation Script](infrastructure/helm/generate-test-traffic.sh)
|
||||
Reference in New Issue
Block a user