Imporve monitoring 3

This commit is contained in:
Urtzi Alfaro
2026-01-09 11:18:20 +01:00
parent 8ca5d9c100
commit 43a3f35bd1
27 changed files with 1279 additions and 32 deletions

View File

@@ -0,0 +1,289 @@
# SigNoz OpenAMP Root Cause Analysis & Resolution
## Problem Statement
Services were getting `StatusCode.UNAVAILABLE` errors when trying to send traces to the SigNoz OTel Collector at port 4317. The OTel Collector was continuously restarting due to OpenAMP trying to apply invalid remote configurations.
## Root Cause Analysis
### Primary Issue: Missing `signozmeter` Connector Pipeline
**Error Message:**
```
connector "signozmeter" used as receiver in [metrics/meter] pipeline
but not used in any supported exporter pipeline
```
**Root Cause:**
The OpenAMP server was pushing a remote configuration that included:
1. A `metrics/meter` pipeline that uses `signozmeter` as a receiver
2. However, no pipeline was exporting TO the `signozmeter` connector
**Technical Explanation:**
- **Connectors** in OpenTelemetry are special components that act as BOTH exporters AND receivers
- They bridge between pipelines (e.g., traces → metrics)
- The `signozmeter` connector generates usage/meter metrics from trace data
- For a connector to work, it must be:
1. Used as an **exporter** in one pipeline (the source)
2. Used as a **receiver** in another pipeline (the destination)
**What Was Missing:**
Our configuration had:
-`signozmeter` connector defined
-`metrics/meter` pipeline receiving from `signozmeter`
-**No pipeline exporting TO `signozmeter`**
The traces pipeline needed to export to `signozmeter`:
```yaml
traces:
receivers: [otlp]
processors: [...]
exporters: [clickhousetraces, metadataexporter, signozmeter] # <-- signozmeter was missing
```
### Secondary Issue: gRPC Endpoint Format
**Problem:** Services had `http://` prefix in gRPC endpoints
**Solution:** Removed `http://` prefix (gRPC doesn't use HTTP protocol prefix)
**Before:**
```yaml
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
```
**After:**
```yaml
OTEL_EXPORTER_OTLP_ENDPOINT: "signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
```
### Tertiary Issue: Hardcoded Endpoints
**Problem:** Each service manifest had hardcoded OTEL endpoints instead of referencing ConfigMap
**Solution:** Updated all 18 services to use `valueFrom: configMapKeyRef`
## Solution Implemented
### 1. Added Complete Meter Pipeline Configuration
**Added Connector:**
```yaml
connectors:
signozmeter:
dimensions:
- name: service.name
- name: deployment.environment
- name: host.name
metrics_flush_interval: 1h
```
**Added Batch Processor:**
```yaml
processors:
batch/meter:
timeout: 1s
send_batch_size: 20000
send_batch_max_size: 25000
```
**Added Exporters:**
```yaml
exporters:
# Meter exporter
signozclickhousemeter:
dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_meter"
timeout: 45s
sending_queue:
enabled: false
# Metadata exporter
metadataexporter:
dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_metadata"
timeout: 10s
cache:
provider: in_memory
```
**Updated Traces Pipeline:**
```yaml
traces:
receivers: [otlp]
processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection]
exporters: [clickhousetraces, metadataexporter, signozmeter] # Added signozmeter
```
**Added Meter Pipeline:**
```yaml
metrics/meter:
receivers: [signozmeter]
processors: [batch/meter]
exporters: [signozclickhousemeter]
```
### 2. Fixed gRPC Endpoint Configuration
Updated ConfigMaps:
- `infrastructure/kubernetes/base/configmap.yaml`
- `infrastructure/kubernetes/overlays/prod/prod-configmap.yaml`
### 3. Centralized OTEL Configuration
Created script: `infrastructure/kubernetes/fix-otel-endpoints.sh`
Updated 18 service manifests to use ConfigMap reference instead of hardcoded values.
## Results
### Before Fix
- ❌ OTel Collector continuously restarting
- ❌ Services unable to export traces (StatusCode.UNAVAILABLE)
- ❌ Error: `connector "signozmeter" used as receiver but not used in any supported exporter pipeline`
- ❌ OpenAMP constantly trying to reload bad config
### After Fix
- ✅ OTel Collector stable and running
- ✅ Message: `"Everything is ready. Begin running and processing data."`
- ✅ No more signozmeter connector errors
- ✅ OpenAMP errors are now just warnings (remote server issues, not local config)
- ⚠️ Service connectivity still showing transient errors (separate investigation needed)
## OpenAMP Behavior
**What is OpenAMP?**
- OpenTelemetry Agent Management Protocol
- Allows remote management and configuration of collectors
- SigNoz uses it for central configuration management
**Current State:**
- OpenAMP continues to show errors, but they're now **non-fatal**
- The errors are from the remote OpAMP server (signoz:4320), not local config
- Local configuration is valid and working
- Collector is stable and processing data
**OpenAMP Error Pattern:**
```
[ERROR] opamp/server_client.go:146
Server returned an error response
```
This is a **warning** that the remote OpAMP server has configuration issues, but it doesn't affect the locally-configured collector.
## Files Modified
### Helm Values
1. `infrastructure/helm/signoz-values-dev.yaml`
- Added connectors section
- Added batch/meter processor
- Added signozclickhousemeter exporter
- Added metadataexporter
- Updated traces pipeline to export to signozmeter
- Added metrics/meter pipeline
2. `infrastructure/helm/signoz-values-prod.yaml`
- Same changes as dev
### ConfigMaps
3. `infrastructure/kubernetes/base/configmap.yaml`
- Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)
4. `infrastructure/kubernetes/overlays/prod/prod-configmap.yaml`
- Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)
### Service Manifests (18 files)
All services in `infrastructure/kubernetes/base/components/*/` changed from:
```yaml
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://..."
```
To:
```yaml
- name: OTEL_EXPORTER_OTLP_ENDPOINT
valueFrom:
configMapKeyRef:
name: bakery-config
key: OTEL_EXPORTER_OTLP_ENDPOINT
```
## Verification Commands
```bash
# 1. Check OTel Collector is stable
kubectl get pods -n bakery-ia | grep otel-collector
# Should show: 1/1 Running
# 2. Check for configuration errors
kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=50 | grep -E "failed to apply config|signozmeter"
# Should show: NO errors about signozmeter
# 3. Verify collector is ready
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep "Everything is ready"
# Should show: "Everything is ready. Begin running and processing data."
# 4. Check service configuration
kubectl get configmap bakery-config -n bakery-ia -o jsonpath='{.data.OTEL_EXPORTER_OTLP_ENDPOINT}'
# Should show: signoz-otel-collector.bakery-ia.svc.cluster.local:4317 (no http://)
# 5. Verify service is using ConfigMap
kubectl get deployment gateway -n bakery-ia -o yaml | grep -A 5 "OTEL_EXPORTER"
# Should show: valueFrom / configMapKeyRef
# 6. Run verification script
./infrastructure/helm/verify-signoz-telemetry.sh
```
## Next Steps
### Immediate
1. ✅ OTel Collector is stable with OpenAMP enabled
2. ⏭️ Investigate remaining service connectivity issues
3. ⏭️ Generate test traffic and verify data collection
4. ⏭️ Check ClickHouse for traces/metrics/logs
### Short-term
1. Monitor OpenAMP errors - they're warnings, not blocking
2. Consider contacting SigNoz about OpAMP server configuration
3. Set up SigNoz dashboards and alerts
4. Document common queries
### Long-term
1. Evaluate if OpAMP remote management is needed
2. Consider HTTP exporter as alternative to gRPC
3. Implement service mesh if connectivity issues persist
4. Set up proper TLS for production
## Key Learnings
### About OpenTelemetry Connectors
- Connectors must be used in BOTH directions
- Source pipeline must export TO the connector
- Destination pipeline must receive FROM the connector
- Missing either direction causes pipeline build failures
### About OpenAMP
- OpenAMP can push remote configurations
- Local config takes precedence
- Remote server errors don't prevent local operation
- Collector continues with last known good config
### About gRPC Configuration
- gRPC endpoints don't use `http://` or `https://` prefixes
- Only use `hostname:port` format
- HTTP/REST endpoints DO need the protocol prefix
### About Configuration Management
- Centralize configuration in ConfigMaps
- Use `valueFrom: configMapKeyRef` pattern
- Single source of truth prevents drift
- Makes updates easier across all services
## References
- [SigNoz Helm Charts](https://github.com/SigNoz/charts)
- [OpenTelemetry Connectors](https://opentelemetry.io/docs/collector/configuration/#connectors)
- [OpAMP Specification](https://github.com/open-telemetry/opamp-spec)
- [SigNoz OTel Collector](https://github.com/SigNoz/signoz-otel-collector)
---
**Resolution Date:** 2026-01-09
**Status:** ✅ Resolved - OTel Collector stable, OpenAMP functional
**Remaining:** Service connectivity investigation ongoing

View File

@@ -0,0 +1,435 @@
# SigNoz Telemetry Verification Guide
## Overview
This guide explains how to verify that your services are correctly sending metrics, logs, and traces to SigNoz, and that SigNoz is collecting them properly.
## Current Configuration
### SigNoz Components
- **Version**: v0.106.0
- **OTel Collector**: v0.129.12
- **Namespace**: `bakery-ia`
- **Ingress URL**: https://monitoring.bakery-ia.local
### Telemetry Endpoints
The OTel Collector exposes the following endpoints:
| Protocol | Port | Purpose |
|----------|------|---------|
| OTLP gRPC | 4317 | Traces, Metrics, Logs (gRPC) |
| OTLP HTTP | 4318 | Traces, Metrics, Logs (HTTP) |
| Jaeger gRPC | 14250 | Jaeger traces (gRPC) |
| Jaeger HTTP | 14268 | Jaeger traces (HTTP) |
| Metrics | 8888 | Prometheus metrics from collector |
| Health Check | 13133 | Collector health status |
### Service Configuration
Services are configured via the `bakery-config` ConfigMap:
```yaml
# Observability enabled
ENABLE_TRACING: "true"
ENABLE_METRICS: "true"
ENABLE_LOGS: "true"
# OTel Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
OTEL_EXPORTER_OTLP_PROTOCOL: "grpc"
```
### Shared Tracing Library
Services use `shared/monitoring/tracing.py` which:
- Auto-instruments FastAPI endpoints
- Auto-instruments HTTPX (inter-service calls)
- Auto-instruments Redis operations
- Auto-instruments SQLAlchemy (PostgreSQL)
- Uses OTLP exporter to send traces to SigNoz
**Default endpoint**: `http://signoz-otel-collector.bakery-ia:4318` (HTTP)
## Verification Steps
### 1. Quick Verification Script
Run the automated verification script:
```bash
./infrastructure/helm/verify-signoz-telemetry.sh
```
This script checks:
- ✅ SigNoz components are running
- ✅ OTel Collector endpoints are exposed
- ✅ Configuration is correct
- ✅ Health checks pass
- ✅ Data is being collected in ClickHouse
### 2. Manual Verification
#### Check SigNoz Components Status
```bash
kubectl get pods -n bakery-ia | grep signoz
```
Expected output:
```
signoz-0 1/1 Running
signoz-otel-collector-xxxxx 1/1 Running
chi-signoz-clickhouse-cluster-0-0-0 1/1 Running
signoz-zookeeper-0 1/1 Running
signoz-clickhouse-operator-xxxxx 2/2 Running
```
#### Check OTel Collector Logs
```bash
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=50
```
Look for:
- `"msg":"Everything is ready. Begin running and processing data."`
- No error messages about invalid processors
- Evidence of data reception (traces/metrics/logs)
#### Check Service Logs for Tracing
```bash
# Check a specific service (e.g., gateway)
kubectl logs -n bakery-ia -l app=gateway --tail=100 | grep -i "tracing\|otel"
```
Expected output:
```
Distributed tracing configured
service=gateway-service
otel_endpoint=http://signoz-otel-collector.bakery-ia:4318
```
### 3. Generate Test Traffic
Run the traffic generation script:
```bash
./infrastructure/helm/generate-test-traffic.sh
```
This script:
1. Makes API calls to various service endpoints
2. Checks service logs for telemetry
3. Waits for data processing (30 seconds)
### 4. Verify Data in ClickHouse
```bash
# Get ClickHouse password
CH_PASSWORD=$(kubectl get secret -n bakery-ia signoz-clickhouse -o jsonpath='{.data.admin-password}' 2>/dev/null | base64 -d)
# Get ClickHouse pod
CH_POD=$(kubectl get pods -n bakery-ia -l clickhouse.altinity.com/chi=signoz-clickhouse -o jsonpath='{.items[0].metadata.name}')
# Check traces
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
SELECT
serviceName,
COUNT() as trace_count,
min(timestamp) as first_trace,
max(timestamp) as last_trace
FROM signoz_traces.signoz_index_v2
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY serviceName
ORDER BY trace_count DESC
"
# Check metrics
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
SELECT
metric_name,
COUNT() as sample_count
FROM signoz_metrics.samples_v4
WHERE unix_milli >= toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000
GROUP BY metric_name
ORDER BY sample_count DESC
LIMIT 10
"
# Check logs
kubectl exec -n bakery-ia $CH_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="
SELECT
COUNT() as log_count,
min(timestamp) as first_log,
max(timestamp) as last_log
FROM signoz_logs.logs
WHERE timestamp >= now() - INTERVAL 1 HOUR
"
```
### 5. Access SigNoz UI
#### Via Ingress (Recommended)
1. Add to `/etc/hosts`:
```
127.0.0.1 monitoring.bakery-ia.local
```
2. Access: https://monitoring.bakery-ia.local
#### Via Port-Forward
```bash
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
```
Then access: http://localhost:3301
### 6. Explore Telemetry Data in SigNoz UI
1. **Traces**:
- Go to "Services" tab
- You should see your services listed (gateway, auth-service, inventory-service, etc.)
- Click on a service to see its traces
- Click on individual traces to see span details
2. **Metrics**:
- Go to "Dashboards" or "Metrics" tab
- Should see infrastructure metrics (PostgreSQL, Redis, RabbitMQ)
- Should see service metrics (request rate, latency, errors)
3. **Logs**:
- Go to "Logs" tab
- Should see logs from your services
- Can filter by service name, log level, etc.
## Troubleshooting
### Services Can't Connect to OTel Collector
**Symptoms**:
```
[ERROR] opentelemetry.exporter.otlp.proto.grpc.exporter: Failed to export traces
error code: StatusCode.UNAVAILABLE
```
**Solutions**:
1. **Check OTel Collector is running**:
```bash
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=otel-collector
```
2. **Verify service can reach collector**:
```bash
# From a service pod
kubectl exec -it -n bakery-ia <service-pod> -- curl -v http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318
```
3. **Check endpoint configuration**:
- gRPC endpoint should NOT have `http://` prefix
- HTTP endpoint should have `http://` prefix
Update your service's tracing setup:
```python
# For gRPC (recommended)
setup_tracing(app, "my-service", otel_endpoint="signoz-otel-collector.bakery-ia.svc.cluster.local:4317")
# For HTTP
setup_tracing(app, "my-service", otel_endpoint="http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318")
```
4. **Restart services after config changes**:
```bash
kubectl rollout restart deployment/<service-name> -n bakery-ia
```
### No Data in SigNoz
**Possible causes**:
1. **Services haven't been called yet**
- Solution: Generate traffic using the test script
2. **Tracing not initialized**
- Check service logs for tracing initialization messages
- Verify `ENABLE_TRACING=true` in ConfigMap
3. **Wrong OTel endpoint**
- Verify `OTEL_EXPORTER_OTLP_ENDPOINT` in ConfigMap
- Should be: `http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317`
4. **Service not using tracing library**
- Check if service imports and calls `setup_tracing()` in main.py
```python
from shared.monitoring.tracing import setup_tracing
app = FastAPI(title="My Service")
setup_tracing(app, "my-service")
```
### OTel Collector Errors
**Check collector logs**:
```bash
kubectl logs -n bakery-ia -l app.kubernetes.io/component=otel-collector --tail=100
```
**Common errors**:
1. **Invalid processor error**:
- Check `signoz-values-dev.yaml` has `signozspanmetrics/delta` (not `spanmetrics`)
- Already fixed in your configuration
2. **ClickHouse connection error**:
- Verify ClickHouse is running
- Check ClickHouse service is accessible
3. **Configuration validation error**:
- Validate YAML syntax in `signoz-values-dev.yaml`
- Check all processors used in pipelines are defined
## Infrastructure Metrics
SigNoz automatically collects metrics from your infrastructure:
### PostgreSQL Databases
- **Receivers configured for**:
- auth_db (auth-db-service:5432)
- inventory_db (inventory-db-service:5432)
- orders_db (orders-db-service:5432)
- **Metrics collected**:
- Connection counts
- Query performance
- Database size
- Table statistics
### Redis
- **Endpoint**: redis-service:6379
- **Metrics collected**:
- Memory usage
- Keys count
- Hit/miss ratio
- Command stats
### RabbitMQ
- **Endpoint**: rabbitmq-service:15672 (management API)
- **Metrics collected**:
- Queue lengths
- Message rates
- Connection counts
- Consumer activity
## Best Practices
### 1. Service Implementation
Always initialize tracing in your service's `main.py`:
```python
from fastapi import FastAPI
from shared.monitoring.tracing import setup_tracing
import os
app = FastAPI(title="My Service")
# Initialize tracing
otel_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://signoz-otel-collector.bakery-ia:4318")
setup_tracing(
app,
service_name="my-service",
service_version=os.getenv("SERVICE_VERSION", "1.0.0"),
otel_endpoint=otel_endpoint
)
```
### 2. Custom Spans
Add custom spans for important operations:
```python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@app.post("/process")
async def process_data(data: dict):
with tracer.start_as_current_span("process_data") as span:
span.set_attribute("data.size", len(data))
span.set_attribute("data.type", data.get("type"))
# Your processing logic
result = process(data)
span.set_attribute("result.status", "success")
return result
```
### 3. Error Tracking
Record exceptions in spans:
```python
from shared.monitoring.tracing import record_exception
try:
result = risky_operation()
except Exception as e:
record_exception(e)
raise
```
### 4. Correlation
Use trace IDs in logs for correlation:
```python
from shared.monitoring.tracing import get_current_trace_id
trace_id = get_current_trace_id()
logger.info("Processing request", trace_id=trace_id)
```
## Next Steps
1. ✅ **Verify SigNoz is running** - Run verification script
2. ✅ **Generate test traffic** - Run traffic generation script
3. ✅ **Check data collection** - Query ClickHouse or use UI
4. ✅ **Access SigNoz UI** - Visualize traces, metrics, and logs
5. ⏭️ **Set up dashboards** - Create custom dashboards for your use cases
6. ⏭️ **Configure alerts** - Set up alerts for critical metrics
7. ⏭️ **Document** - Document common queries and dashboard configurations
## Useful Commands
```bash
# Quick status check
kubectl get pods -n bakery-ia | grep signoz
# View OTel Collector metrics
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 8888:8888
# Then visit: http://localhost:8888/metrics
# Restart OTel Collector
kubectl rollout restart deployment/signoz-otel-collector -n bakery-ia
# View all services with telemetry
kubectl get pods -n bakery-ia -l tier!=infrastructure
# Check specific service logs
kubectl logs -n bakery-ia -l app=<service-name> --tail=100 -f
# Port-forward to SigNoz UI
kubectl port-forward -n bakery-ia svc/signoz 3301:8080
```
## Resources
- [SigNoz Documentation](https://signoz.io/docs/)
- [OpenTelemetry Python](https://opentelemetry.io/docs/languages/python/)
- [SigNoz GitHub](https://github.com/SigNoz/signoz)
- [Helm Chart Values](infrastructure/helm/signoz-values-dev.yaml)
- [Verification Script](infrastructure/helm/verify-signoz-telemetry.sh)
- [Traffic Generation Script](infrastructure/helm/generate-test-traffic.sh)