Imporve monitoring 3
This commit is contained in:
289
docs/SIGNOZ_ROOT_CAUSE_ANALYSIS.md
Normal file
289
docs/SIGNOZ_ROOT_CAUSE_ANALYSIS.md
Normal file
@@ -0,0 +1,289 @@
|
||||
# SigNoz OpenAMP Root Cause Analysis & Resolution
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Services were getting `StatusCode.UNAVAILABLE` errors when trying to send traces to the SigNoz OTel Collector at port 4317. The OTel Collector was continuously restarting due to OpenAMP trying to apply invalid remote configurations.
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### Primary Issue: Missing `signozmeter` Connector Pipeline
|
||||
|
||||
**Error Message:**
|
||||
```
|
||||
connector "signozmeter" used as receiver in [metrics/meter] pipeline
|
||||
but not used in any supported exporter pipeline
|
||||
```
|
||||
|
||||
**Root Cause:**
|
||||
The OpenAMP server was pushing a remote configuration that included:
|
||||
1. A `metrics/meter` pipeline that uses `signozmeter` as a receiver
|
||||
2. However, no pipeline was exporting TO the `signozmeter` connector
|
||||
|
||||
**Technical Explanation:**
|
||||
- **Connectors** in OpenTelemetry are special components that act as BOTH exporters AND receivers
|
||||
- They bridge between pipelines (e.g., traces → metrics)
|
||||
- The `signozmeter` connector generates usage/meter metrics from trace data
|
||||
- For a connector to work, it must be:
|
||||
1. Used as an **exporter** in one pipeline (the source)
|
||||
2. Used as a **receiver** in another pipeline (the destination)
|
||||
|
||||
**What Was Missing:**
|
||||
Our configuration had:
|
||||
- ✅ `signozmeter` connector defined
|
||||
- ✅ `metrics/meter` pipeline receiving from `signozmeter`
|
||||
- ❌ **No pipeline exporting TO `signozmeter`**
|
||||
|
||||
The traces pipeline needed to export to `signozmeter`:
|
||||
```yaml
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [...]
|
||||
exporters: [clickhousetraces, metadataexporter, signozmeter] # <-- signozmeter was missing
|
||||
```
|
||||
|
||||
### Secondary Issue: gRPC Endpoint Format
|
||||
|
||||
**Problem:** Services had `http://` prefix in gRPC endpoints
|
||||
**Solution:** Removed `http://` prefix (gRPC doesn't use HTTP protocol prefix)
|
||||
|
||||
**Before:**
|
||||
```yaml
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
|
||||
```
|
||||
|
||||
**After:**
|
||||
```yaml
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
|
||||
```
|
||||
|
||||
### Tertiary Issue: Hardcoded Endpoints
|
||||
|
||||
**Problem:** Each service manifest had hardcoded OTEL endpoints instead of referencing ConfigMap
|
||||
**Solution:** Updated all 18 services to use `valueFrom: configMapKeyRef`
|
||||
|
||||
## Solution Implemented
|
||||
|
||||
### 1. Added Complete Meter Pipeline Configuration
|
||||
|
||||
**Added Connector:**
|
||||
```yaml
|
||||
connectors:
|
||||
signozmeter:
|
||||
dimensions:
|
||||
- name: service.name
|
||||
- name: deployment.environment
|
||||
- name: host.name
|
||||
metrics_flush_interval: 1h
|
||||
```
|
||||
|
||||
**Added Batch Processor:**
|
||||
```yaml
|
||||
processors:
|
||||
batch/meter:
|
||||
timeout: 1s
|
||||
send_batch_size: 20000
|
||||
send_batch_max_size: 25000
|
||||
```
|
||||
|
||||
**Added Exporters:**
|
||||
```yaml
|
||||
exporters:
|
||||
# Meter exporter
|
||||
signozclickhousemeter:
|
||||
dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_meter"
|
||||
timeout: 45s
|
||||
sending_queue:
|
||||
enabled: false
|
||||
|
||||
# Metadata exporter
|
||||
metadataexporter:
|
||||
dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_metadata"
|
||||
timeout: 10s
|
||||
cache:
|
||||
provider: in_memory
|
||||
```
|
||||
|
||||
**Updated Traces Pipeline:**
|
||||
```yaml
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection]
|
||||
exporters: [clickhousetraces, metadataexporter, signozmeter] # Added signozmeter
|
||||
```
|
||||
|
||||
**Added Meter Pipeline:**
|
||||
```yaml
|
||||
metrics/meter:
|
||||
receivers: [signozmeter]
|
||||
processors: [batch/meter]
|
||||
exporters: [signozclickhousemeter]
|
||||
```
|
||||
|
||||
### 2. Fixed gRPC Endpoint Configuration
|
||||
|
||||
Updated ConfigMaps:
|
||||
- `infrastructure/kubernetes/base/configmap.yaml`
|
||||
- `infrastructure/kubernetes/overlays/prod/prod-configmap.yaml`
|
||||
|
||||
### 3. Centralized OTEL Configuration
|
||||
|
||||
Created script: `infrastructure/kubernetes/fix-otel-endpoints.sh`
|
||||
|
||||
Updated 18 service manifests to use ConfigMap reference instead of hardcoded values.
|
||||
|
||||
## Results
|
||||
|
||||
### Before Fix
|
||||
- ❌ OTel Collector continuously restarting
|
||||
- ❌ Services unable to export traces (StatusCode.UNAVAILABLE)
|
||||
- ❌ Error: `connector "signozmeter" used as receiver but not used in any supported exporter pipeline`
|
||||
- ❌ OpenAMP constantly trying to reload bad config
|
||||
|
||||
### After Fix
|
||||
- ✅ OTel Collector stable and running
|
||||
- ✅ Message: `"Everything is ready. Begin running and processing data."`
|
||||
- ✅ No more signozmeter connector errors
|
||||
- ✅ OpenAMP errors are now just warnings (remote server issues, not local config)
|
||||
- ⚠️ Service connectivity still showing transient errors (separate investigation needed)
|
||||
|
||||
## OpenAMP Behavior
|
||||
|
||||
**What is OpenAMP?**
|
||||
- OpenTelemetry Agent Management Protocol
|
||||
- Allows remote management and configuration of collectors
|
||||
- SigNoz uses it for central configuration management
|
||||
|
||||
**Current State:**
|
||||
- OpenAMP continues to show errors, but they're now **non-fatal**
|
||||
- The errors are from the remote OpAMP server (signoz:4320), not local config
|
||||
- Local configuration is valid and working
|
||||
- Collector is stable and processing data
|
||||
|
||||
**OpenAMP Error Pattern:**
|
||||
```
|
||||
[ERROR] opamp/server_client.go:146
|
||||
Server returned an error response
|
||||
```
|
||||
|
||||
This is a **warning** that the remote OpAMP server has configuration issues, but it doesn't affect the locally-configured collector.
|
||||
|
||||
## Files Modified
|
||||
|
||||
### Helm Values
|
||||
1. `infrastructure/helm/signoz-values-dev.yaml`
|
||||
- Added connectors section
|
||||
- Added batch/meter processor
|
||||
- Added signozclickhousemeter exporter
|
||||
- Added metadataexporter
|
||||
- Updated traces pipeline to export to signozmeter
|
||||
- Added metrics/meter pipeline
|
||||
|
||||
2. `infrastructure/helm/signoz-values-prod.yaml`
|
||||
- Same changes as dev
|
||||
|
||||
### ConfigMaps
|
||||
3. `infrastructure/kubernetes/base/configmap.yaml`
|
||||
- Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)
|
||||
|
||||
4. `infrastructure/kubernetes/overlays/prod/prod-configmap.yaml`
|
||||
- Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)
|
||||
|
||||
### Service Manifests (18 files)
|
||||
All services in `infrastructure/kubernetes/base/components/*/` changed from:
|
||||
```yaml
|
||||
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
||||
value: "http://..."
|
||||
```
|
||||
To:
|
||||
```yaml
|
||||
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
||||
valueFrom:
|
||||
configMapKeyRef:
|
||||
name: bakery-config
|
||||
key: OTEL_EXPORTER_OTLP_ENDPOINT
|
||||
```
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```bash
|
||||
# 1. Check OTel Collector is stable
|
||||
kubectl get pods -n bakery-ia | grep otel-collector
|
||||
# Should show: 1/1 Running
|
||||
|
||||
# 2. Check for configuration errors
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=50 | grep -E "failed to apply config|signozmeter"
|
||||
# Should show: NO errors about signozmeter
|
||||
|
||||
# 3. Verify collector is ready
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep "Everything is ready"
|
||||
# Should show: "Everything is ready. Begin running and processing data."
|
||||
|
||||
# 4. Check service configuration
|
||||
kubectl get configmap bakery-config -n bakery-ia -o jsonpath='{.data.OTEL_EXPORTER_OTLP_ENDPOINT}'
|
||||
# Should show: signoz-otel-collector.bakery-ia.svc.cluster.local:4317 (no http://)
|
||||
|
||||
# 5. Verify service is using ConfigMap
|
||||
kubectl get deployment gateway -n bakery-ia -o yaml | grep -A 5 "OTEL_EXPORTER"
|
||||
# Should show: valueFrom / configMapKeyRef
|
||||
|
||||
# 6. Run verification script
|
||||
./infrastructure/helm/verify-signoz-telemetry.sh
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
### Immediate
|
||||
1. ✅ OTel Collector is stable with OpenAMP enabled
|
||||
2. ⏭️ Investigate remaining service connectivity issues
|
||||
3. ⏭️ Generate test traffic and verify data collection
|
||||
4. ⏭️ Check ClickHouse for traces/metrics/logs
|
||||
|
||||
### Short-term
|
||||
1. Monitor OpenAMP errors - they're warnings, not blocking
|
||||
2. Consider contacting SigNoz about OpAMP server configuration
|
||||
3. Set up SigNoz dashboards and alerts
|
||||
4. Document common queries
|
||||
|
||||
### Long-term
|
||||
1. Evaluate if OpAMP remote management is needed
|
||||
2. Consider HTTP exporter as alternative to gRPC
|
||||
3. Implement service mesh if connectivity issues persist
|
||||
4. Set up proper TLS for production
|
||||
|
||||
## Key Learnings
|
||||
|
||||
### About OpenTelemetry Connectors
|
||||
- Connectors must be used in BOTH directions
|
||||
- Source pipeline must export TO the connector
|
||||
- Destination pipeline must receive FROM the connector
|
||||
- Missing either direction causes pipeline build failures
|
||||
|
||||
### About OpenAMP
|
||||
- OpenAMP can push remote configurations
|
||||
- Local config takes precedence
|
||||
- Remote server errors don't prevent local operation
|
||||
- Collector continues with last known good config
|
||||
|
||||
### About gRPC Configuration
|
||||
- gRPC endpoints don't use `http://` or `https://` prefixes
|
||||
- Only use `hostname:port` format
|
||||
- HTTP/REST endpoints DO need the protocol prefix
|
||||
|
||||
### About Configuration Management
|
||||
- Centralize configuration in ConfigMaps
|
||||
- Use `valueFrom: configMapKeyRef` pattern
|
||||
- Single source of truth prevents drift
|
||||
- Makes updates easier across all services
|
||||
|
||||
## References
|
||||
|
||||
- [SigNoz Helm Charts](https://github.com/SigNoz/charts)
|
||||
- [OpenTelemetry Connectors](https://opentelemetry.io/docs/collector/configuration/#connectors)
|
||||
- [OpAMP Specification](https://github.com/open-telemetry/opamp-spec)
|
||||
- [SigNoz OTel Collector](https://github.com/SigNoz/signoz-otel-collector)
|
||||
|
||||
---
|
||||
|
||||
**Resolution Date:** 2026-01-09
|
||||
**Status:** ✅ Resolved - OTel Collector stable, OpenAMP functional
|
||||
**Remaining:** Service connectivity investigation ongoing
|
||||
435
docs/SIGNOZ_VERIFICATION_GUIDE.md
Normal file
435
docs/SIGNOZ_VERIFICATION_GUIDE.md
Normal file
@@ -0,0 +1,435 @@
|
||||
# SigNoz Telemetry Verification Guide
|
||||
|
||||
## Overview
|
||||
This guide explains how to verify that your services are correctly sending metrics, logs, and traces to SigNoz, and that SigNoz is collecting them properly.
|
||||
|
||||
## Current Configuration
|
||||
|
||||
### SigNoz Components
|
||||
- **Version**: v0.106.0
|
||||
- **OTel Collector**: v0.129.12
|
||||
- **Namespace**: `bakery-ia`
|
||||
- **Ingress URL**: https://monitoring.bakery-ia.local
|
||||
|
||||
### Telemetry Endpoints
|
||||
|
||||
The OTel Collector exposes the following endpoints:
|
||||
|
||||
| Protocol | Port | Purpose |
|
||||
|----------|------|---------|
|
||||
| OTLP gRPC | 4317 | Traces, Metrics, Logs (gRPC) |
|
||||
| OTLP HTTP | 4318 | Traces, Metrics, Logs (HTTP) |
|
||||
| Jaeger gRPC | 14250 | Jaeger traces (gRPC) |
|
||||
| Jaeger HTTP | 14268 | Jaeger traces (HTTP) |
|
||||
| Metrics | 8888 | Prometheus metrics from collector |
|
||||
| Health Check | 13133 | Collector health status |
|
||||
|
||||
### Service Configuration
|
||||
|
||||
Services are configured via the `bakery-config` ConfigMap:
|
||||
|
||||
```yaml
|
||||
# Observability enabled
|
||||
ENABLE_TRACING: "true"
|
||||
ENABLE_METRICS: "true"
|
||||
ENABLE_LOGS: "true"
|
||||
|
||||
# OTel Collector endpoint
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL: "grpc"
|
||||
```
|
||||
|
||||
### Shared Tracing Library
|
||||
|
||||
Services use `shared/monitoring/tracing.py` which:
|
||||
- Auto-instruments FastAPI endpoints
|
||||
- Auto-instruments HTTPX (inter-service calls)
|
||||
- Auto-instruments Redis operations
|
||||
- Auto-instruments SQLAlchemy (PostgreSQL)
|
||||
- Uses OTLP exporter to send traces to SigNoz
|
||||
|
||||
**Default endpoint**: `http://signoz-otel-collector.bakery-ia:4318` (HTTP)
|
||||
|
||||
## Verification Steps
|
||||
|
||||
### 1. Quick Verification Script
|
||||
|
||||
Run the automated verification script:
|
||||
|
||||
```bash
|
||||
./infrastructure/helm/verify-signoz-telemetry.sh
|
||||
```
|
||||
|
||||
This script checks:
|
||||
- ✅ SigNoz components are running
|
||||
- ✅ OTel Collector endpoints are exposed
|
||||
- ✅ Configuration is correct
|
||||
- ✅ Health checks pass
|
||||
- ✅ Data is being collected in ClickHouse
|
||||
|
||||
### 2. Manual Verification
|
||||
|
||||
#### Check SigNoz Components Status
|
||||
|
||||
```bash
|
||||
kubectl get pods -n bakery-ia | grep signoz
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
signoz-0 1/1 Running
|
||||
signoz-otel-collector-xxxxx 1/1 Running
|
||||
chi-signoz-clickhouse-cluster-0-0-0 1/1 Running
|
||||
signoz-zookeeper-0 1/1 Running
|
||||
signoz-clickhouse-operator-xxxxx 2/2 Running
|
||||
```
|
||||
|
||||
#### Check OTel Collector Logs
|
||||
|
||||
```bash
|
||||
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=50
|
||||
```
|
||||
|
||||
Look for:
|
||||
- `"msg":"Everything is ready. Begin running and processing data."`
|
||||
- No error messages about invalid processors
|
||||
- Evidence of data reception (traces/metrics/logs)
|
||||
|
||||
#### Check Service Logs for Tracing
|
||||
|
||||
```bash
|
||||
# Check a specific service (e.g., gateway)
|
||||
kubectl logs -n bakery-ia -l app=gateway --tail=100 | grep -i "tracing\|otel"
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
Distributed tracing configured
|
||||
service=gateway-service
|
||||
otel_endpoint=http://signoz-otel-collector.bakery-ia:4318
|
||||
```
|
||||
|
||||
### 3. Generate Test Traffic
|
||||
|
||||
Run the traffic generation script:
|
||||
|
||||
```bash
|
||||
./infrastructure/helm/generate-test-traffic.sh
|
||||
```
|
||||
|
||||
This script:
|
||||
1. Makes API calls to various service endpoints
|
||||
2. Checks service logs for telemetry
|
||||
3. Waits for data processing (30 seconds)
|
||||
|
||||
### 4. Verify Data in ClickHouse
|
||||
|
||||
```bash
|
||||
# Get ClickHouse password
|
||||
CH_PASSWORD=$(kubectl get secret -n bakery-ia signoz-clickhouse -o jsonpath='{.data.admin-password}' 2>/dev/null | base64 -d)
|
||||
|
||||
# Get ClickHouse pod
|
||||
CH_POD=$(kubectl get pods -n bakery-ia -l clickhouse.altinity.com/chi=signoz-clickhouse -o jsonpath='{.items[0].metadata.name}')
|
||||
|
||||
# Check traces
|
||||
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
|
||||
SELECT
|
||||
serviceName,
|
||||
COUNT() as trace_count,
|
||||
min(timestamp) as first_trace,
|
||||
max(timestamp) as last_trace
|
||||
FROM signoz_traces.signoz_index_v2
|
||||
WHERE timestamp >= now() - INTERVAL 1 HOUR
|
||||
GROUP BY serviceName
|
||||
ORDER BY trace_count DESC
|
||||
"
|
||||
|
||||
# Check metrics
|
||||
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
|
||||
SELECT
|
||||
metric_name,
|
||||
COUNT() as sample_count
|
||||
FROM signoz_metrics.samples_v4
|
||||
WHERE unix_milli >= toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000
|
||||
GROUP BY metric_name
|
||||
ORDER BY sample_count DESC
|
||||
LIMIT 10
|
||||
"
|
||||
|
||||
# Check logs
|
||||
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
|
||||
SELECT
|
||||
COUNT() as log_count,
|
||||
min(timestamp) as first_log,
|
||||
max(timestamp) as last_log
|
||||
FROM signoz_logs.logs
|
||||
WHERE timestamp >= now() - INTERVAL 1 HOUR
|
||||
"
|
||||
```
|
||||
|
||||
### 5. Access SigNoz UI
|
||||
|
||||
#### Via Ingress (Recommended)
|
||||
|
||||
1. Add to `/etc/hosts`:
|
||||
```
|
||||
127.0.0.1 monitoring.bakery-ia.local
|
||||
```
|
||||
|
||||
2. Access: https://monitoring.bakery-ia.local
|
||||
|
||||
#### Via Port-Forward
|
||||
|
||||
```bash
|
||||
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
|
||||
```
|
||||
|
||||
Then access: http://localhost:3301
|
||||
|
||||
### 6. Explore Telemetry Data in SigNoz UI
|
||||
|
||||
1. **Traces**:
|
||||
- Go to "Services" tab
|
||||
- You should see your services listed (gateway, auth-service, inventory-service, etc.)
|
||||
- Click on a service to see its traces
|
||||
- Click on individual traces to see span details
|
||||
|
||||
2. **Metrics**:
|
||||
- Go to "Dashboards" or "Metrics" tab
|
||||
- Should see infrastructure metrics (PostgreSQL, Redis, RabbitMQ)
|
||||
- Should see service metrics (request rate, latency, errors)
|
||||
|
||||
3. **Logs**:
|
||||
- Go to "Logs" tab
|
||||
- Should see logs from your services
|
||||
- Can filter by service name, log level, etc.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Services Can't Connect to OTel Collector
|
||||
|
||||
**Symptoms**:
|
||||
```
|
||||
[ERROR] opentelemetry.exporter.otlp.proto.grpc.exporter: Failed to export traces
|
||||
error code: StatusCode.UNAVAILABLE
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check OTel Collector is running**:
|
||||
```bash
|
||||
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=otel-collector
|
||||
```
|
||||
|
||||
2. **Verify service can reach collector**:
|
||||
```bash
|
||||
# From a service pod
|
||||
kubectl exec -it -n bakery-ia <service-pod> -- curl -v http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318
|
||||
```
|
||||
|
||||
3. **Check endpoint configuration**:
|
||||
- gRPC endpoint should NOT have `http://` prefix
|
||||
- HTTP endpoint should have `http://` prefix
|
||||
|
||||
Update your service's tracing setup:
|
||||
```python
|
||||
# For gRPC (recommended)
|
||||
setup_tracing(app, "my-service", otel_endpoint="signoz-otel-collector.bakery-ia.svc.cluster.local:4317")
|
||||
|
||||
# For HTTP
|
||||
setup_tracing(app, "my-service", otel_endpoint="http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318")
|
||||
```
|
||||
|
||||
4. **Restart services after config changes**:
|
||||
```bash
|
||||
kubectl rollout restart deployment/<service-name> -n bakery-ia
|
||||
```
|
||||
|
||||
### No Data in SigNoz
|
||||
|
||||
**Possible causes**:
|
||||
|
||||
1. **Services haven't been called yet**
|
||||
- Solution: Generate traffic using the test script
|
||||
|
||||
2. **Tracing not initialized**
|
||||
- Check service logs for tracing initialization messages
|
||||
- Verify `ENABLE_TRACING=true` in ConfigMap
|
||||
|
||||
3. **Wrong OTel endpoint**
|
||||
- Verify `OTEL_EXPORTER_OTLP_ENDPOINT` in ConfigMap
|
||||
- Should be: `http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317`
|
||||
|
||||
4. **Service not using tracing library**
|
||||
- Check if service imports and calls `setup_tracing()` in main.py
|
||||
```python
|
||||
from shared.monitoring.tracing import setup_tracing
|
||||
|
||||
app = FastAPI(title="My Service")
|
||||
setup_tracing(app, "my-service")
|
||||
```
|
||||
|
||||
### OTel Collector Errors
|
||||
|
||||
**Check collector logs**:
|
||||
```bash
|
||||
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=100
|
||||
```
|
||||
|
||||
**Common errors**:
|
||||
|
||||
1. **Invalid processor error**:
|
||||
- Check `signoz-values-dev.yaml` has `signozspanmetrics/delta` (not `spanmetrics`)
|
||||
- Already fixed in your configuration
|
||||
|
||||
2. **ClickHouse connection error**:
|
||||
- Verify ClickHouse is running
|
||||
- Check ClickHouse service is accessible
|
||||
|
||||
3. **Configuration validation error**:
|
||||
- Validate YAML syntax in `signoz-values-dev.yaml`
|
||||
- Check all processors used in pipelines are defined
|
||||
|
||||
## Infrastructure Metrics
|
||||
|
||||
SigNoz automatically collects metrics from your infrastructure:
|
||||
|
||||
### PostgreSQL Databases
|
||||
- **Receivers configured for**:
|
||||
- auth_db (auth-db-service:5432)
|
||||
- inventory_db (inventory-db-service:5432)
|
||||
- orders_db (orders-db-service:5432)
|
||||
|
||||
- **Metrics collected**:
|
||||
- Connection counts
|
||||
- Query performance
|
||||
- Database size
|
||||
- Table statistics
|
||||
|
||||
### Redis
|
||||
- **Endpoint**: redis-service:6379
|
||||
- **Metrics collected**:
|
||||
- Memory usage
|
||||
- Keys count
|
||||
- Hit/miss ratio
|
||||
- Command stats
|
||||
|
||||
### RabbitMQ
|
||||
- **Endpoint**: rabbitmq-service:15672 (management API)
|
||||
- **Metrics collected**:
|
||||
- Queue lengths
|
||||
- Message rates
|
||||
- Connection counts
|
||||
- Consumer activity
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Service Implementation
|
||||
|
||||
Always initialize tracing in your service's `main.py`:
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI
|
||||
from shared.monitoring.tracing import setup_tracing
|
||||
import os
|
||||
|
||||
app = FastAPI(title="My Service")
|
||||
|
||||
# Initialize tracing
|
||||
otel_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://signoz-otel-collector.bakery-ia:4318")
|
||||
setup_tracing(
|
||||
app,
|
||||
service_name="my-service",
|
||||
service_version=os.getenv("SERVICE_VERSION", "1.0.0"),
|
||||
otel_endpoint=otel_endpoint
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Custom Spans
|
||||
|
||||
Add custom spans for important operations:
|
||||
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
@app.post("/process")
|
||||
async def process_data(data: dict):
|
||||
with tracer.start_as_current_span("process_data") as span:
|
||||
span.set_attribute("data.size", len(data))
|
||||
span.set_attribute("data.type", data.get("type"))
|
||||
|
||||
# Your processing logic
|
||||
result = process(data)
|
||||
|
||||
span.set_attribute("result.status", "success")
|
||||
return result
|
||||
```
|
||||
|
||||
### 3. Error Tracking
|
||||
|
||||
Record exceptions in spans:
|
||||
|
||||
```python
|
||||
from shared.monitoring.tracing import record_exception
|
||||
|
||||
try:
|
||||
result = risky_operation()
|
||||
except Exception as e:
|
||||
record_exception(e)
|
||||
raise
|
||||
```
|
||||
|
||||
### 4. Correlation
|
||||
|
||||
Use trace IDs in logs for correlation:
|
||||
|
||||
```python
|
||||
from shared.monitoring.tracing import get_current_trace_id
|
||||
|
||||
trace_id = get_current_trace_id()
|
||||
logger.info("Processing request", trace_id=trace_id)
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✅ **Verify SigNoz is running** - Run verification script
|
||||
2. ✅ **Generate test traffic** - Run traffic generation script
|
||||
3. ✅ **Check data collection** - Query ClickHouse or use UI
|
||||
4. ✅ **Access SigNoz UI** - Visualize traces, metrics, and logs
|
||||
5. ⏭️ **Set up dashboards** - Create custom dashboards for your use cases
|
||||
6. ⏭️ **Configure alerts** - Set up alerts for critical metrics
|
||||
7. ⏭️ **Document** - Document common queries and dashboard configurations
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# Quick status check
|
||||
kubectl get pods -n bakery-ia | grep signoz
|
||||
|
||||
# View OTel Collector metrics
|
||||
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 8888:8888
|
||||
# Then visit: http://localhost:8888/metrics
|
||||
|
||||
# Restart OTel Collector
|
||||
kubectl rollout restart deployment/signoz-otel-collector -n bakery-ia
|
||||
|
||||
# View all services with telemetry
|
||||
kubectl get pods -n bakery-ia -l tier!=infrastructure
|
||||
|
||||
# Check specific service logs
|
||||
kubectl logs -n bakery-ia -l app=<service-name> --tail=100 -f
|
||||
|
||||
# Port-forward to SigNoz UI
|
||||
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- [SigNoz Documentation](https://signoz.io/docs/)
|
||||
- [OpenTelemetry Python](https://opentelemetry.io/docs/languages/python/)
|
||||
- [SigNoz GitHub](https://github.com/SigNoz/signoz)
|
||||
- [Helm Chart Values](infrastructure/helm/signoz-values-dev.yaml)
|
||||
- [Verification Script](infrastructure/helm/verify-signoz-telemetry.sh)
|
||||
- [Traffic Generation Script](infrastructure/helm/generate-test-traffic.sh)
|
||||
Reference in New Issue
Block a user