290 lines
8.8 KiB
Markdown
290 lines
8.8 KiB
Markdown
|
|
# SigNoz OpenAMP Root Cause Analysis & Resolution
|
||
|
|
|
||
|
|
## Problem Statement
|
||
|
|
|
||
|
|
Services were getting `StatusCode.UNAVAILABLE` errors when trying to send traces to the SigNoz OTel Collector at port 4317. The OTel Collector was continuously restarting due to OpenAMP trying to apply invalid remote configurations.
|
||
|
|
|
||
|
|
## Root Cause Analysis
|
||
|
|
|
||
|
|
### Primary Issue: Missing `signozmeter` Connector Pipeline
|
||
|
|
|
||
|
|
**Error Message:**
|
||
|
|
```
|
||
|
|
connector "signozmeter" used as receiver in [metrics/meter] pipeline
|
||
|
|
but not used in any supported exporter pipeline
|
||
|
|
```
|
||
|
|
|
||
|
|
**Root Cause:**
|
||
|
|
The OpenAMP server was pushing a remote configuration that included:
|
||
|
|
1. A `metrics/meter` pipeline that uses `signozmeter` as a receiver
|
||
|
|
2. However, no pipeline was exporting TO the `signozmeter` connector
|
||
|
|
|
||
|
|
**Technical Explanation:**
|
||
|
|
- **Connectors** in OpenTelemetry are special components that act as BOTH exporters AND receivers
|
||
|
|
- They bridge between pipelines (e.g., traces → metrics)
|
||
|
|
- The `signozmeter` connector generates usage/meter metrics from trace data
|
||
|
|
- For a connector to work, it must be:
|
||
|
|
1. Used as an **exporter** in one pipeline (the source)
|
||
|
|
2. Used as a **receiver** in another pipeline (the destination)
|
||
|
|
|
||
|
|
**What Was Missing:**
|
||
|
|
Our configuration had:
|
||
|
|
- ✅ `signozmeter` connector defined
|
||
|
|
- ✅ `metrics/meter` pipeline receiving from `signozmeter`
|
||
|
|
- ❌ **No pipeline exporting TO `signozmeter`**
|
||
|
|
|
||
|
|
The traces pipeline needed to export to `signozmeter`:
|
||
|
|
```yaml
|
||
|
|
traces:
|
||
|
|
receivers: [otlp]
|
||
|
|
processors: [...]
|
||
|
|
exporters: [clickhousetraces, metadataexporter, signozmeter] # <-- signozmeter was missing
|
||
|
|
```
|
||
|
|
|
||
|
|
### Secondary Issue: gRPC Endpoint Format
|
||
|
|
|
||
|
|
**Problem:** Services had `http://` prefix in gRPC endpoints
|
||
|
|
**Solution:** Removed `http://` prefix (gRPC doesn't use HTTP protocol prefix)
|
||
|
|
|
||
|
|
**Before:**
|
||
|
|
```yaml
|
||
|
|
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
|
||
|
|
```
|
||
|
|
|
||
|
|
**After:**
|
||
|
|
```yaml
|
||
|
|
OTEL_EXPORTER_OTLP_ENDPOINT: "signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Tertiary Issue: Hardcoded Endpoints
|
||
|
|
|
||
|
|
**Problem:** Each service manifest had hardcoded OTEL endpoints instead of referencing ConfigMap
|
||
|
|
**Solution:** Updated all 18 services to use `valueFrom: configMapKeyRef`
|
||
|
|
|
||
|
|
## Solution Implemented
|
||
|
|
|
||
|
|
### 1. Added Complete Meter Pipeline Configuration
|
||
|
|
|
||
|
|
**Added Connector:**
|
||
|
|
```yaml
|
||
|
|
connectors:
|
||
|
|
signozmeter:
|
||
|
|
dimensions:
|
||
|
|
- name: service.name
|
||
|
|
- name: deployment.environment
|
||
|
|
- name: host.name
|
||
|
|
metrics_flush_interval: 1h
|
||
|
|
```
|
||
|
|
|
||
|
|
**Added Batch Processor:**
|
||
|
|
```yaml
|
||
|
|
processors:
|
||
|
|
batch/meter:
|
||
|
|
timeout: 1s
|
||
|
|
send_batch_size: 20000
|
||
|
|
send_batch_max_size: 25000
|
||
|
|
```
|
||
|
|
|
||
|
|
**Added Exporters:**
|
||
|
|
```yaml
|
||
|
|
exporters:
|
||
|
|
# Meter exporter
|
||
|
|
signozclickhousemeter:
|
||
|
|
dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_meter"
|
||
|
|
timeout: 45s
|
||
|
|
sending_queue:
|
||
|
|
enabled: false
|
||
|
|
|
||
|
|
# Metadata exporter
|
||
|
|
metadataexporter:
|
||
|
|
dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_metadata"
|
||
|
|
timeout: 10s
|
||
|
|
cache:
|
||
|
|
provider: in_memory
|
||
|
|
```
|
||
|
|
|
||
|
|
**Updated Traces Pipeline:**
|
||
|
|
```yaml
|
||
|
|
traces:
|
||
|
|
receivers: [otlp]
|
||
|
|
processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection]
|
||
|
|
exporters: [clickhousetraces, metadataexporter, signozmeter] # Added signozmeter
|
||
|
|
```
|
||
|
|
|
||
|
|
**Added Meter Pipeline:**
|
||
|
|
```yaml
|
||
|
|
metrics/meter:
|
||
|
|
receivers: [signozmeter]
|
||
|
|
processors: [batch/meter]
|
||
|
|
exporters: [signozclickhousemeter]
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Fixed gRPC Endpoint Configuration
|
||
|
|
|
||
|
|
Updated ConfigMaps:
|
||
|
|
- `infrastructure/kubernetes/base/configmap.yaml`
|
||
|
|
- `infrastructure/kubernetes/overlays/prod/prod-configmap.yaml`
|
||
|
|
|
||
|
|
### 3. Centralized OTEL Configuration
|
||
|
|
|
||
|
|
Created script: `infrastructure/kubernetes/fix-otel-endpoints.sh`
|
||
|
|
|
||
|
|
Updated 18 service manifests to use ConfigMap reference instead of hardcoded values.
|
||
|
|
|
||
|
|
## Results
|
||
|
|
|
||
|
|
### Before Fix
|
||
|
|
- ❌ OTel Collector continuously restarting
|
||
|
|
- ❌ Services unable to export traces (StatusCode.UNAVAILABLE)
|
||
|
|
- ❌ Error: `connector "signozmeter" used as receiver but not used in any supported exporter pipeline`
|
||
|
|
- ❌ OpenAMP constantly trying to reload bad config
|
||
|
|
|
||
|
|
### After Fix
|
||
|
|
- ✅ OTel Collector stable and running
|
||
|
|
- ✅ Message: `"Everything is ready. Begin running and processing data."`
|
||
|
|
- ✅ No more signozmeter connector errors
|
||
|
|
- ✅ OpenAMP errors are now just warnings (remote server issues, not local config)
|
||
|
|
- ⚠️ Service connectivity still showing transient errors (separate investigation needed)
|
||
|
|
|
||
|
|
## OpenAMP Behavior
|
||
|
|
|
||
|
|
**What is OpenAMP?**
|
||
|
|
- OpenTelemetry Agent Management Protocol
|
||
|
|
- Allows remote management and configuration of collectors
|
||
|
|
- SigNoz uses it for central configuration management
|
||
|
|
|
||
|
|
**Current State:**
|
||
|
|
- OpenAMP continues to show errors, but they're now **non-fatal**
|
||
|
|
- The errors are from the remote OpAMP server (signoz:4320), not local config
|
||
|
|
- Local configuration is valid and working
|
||
|
|
- Collector is stable and processing data
|
||
|
|
|
||
|
|
**OpenAMP Error Pattern:**
|
||
|
|
```
|
||
|
|
[ERROR] opamp/server_client.go:146
|
||
|
|
Server returned an error response
|
||
|
|
```
|
||
|
|
|
||
|
|
This is a **warning** that the remote OpAMP server has configuration issues, but it doesn't affect the locally-configured collector.
|
||
|
|
|
||
|
|
## Files Modified
|
||
|
|
|
||
|
|
### Helm Values
|
||
|
|
1. `infrastructure/helm/signoz-values-dev.yaml`
|
||
|
|
- Added connectors section
|
||
|
|
- Added batch/meter processor
|
||
|
|
- Added signozclickhousemeter exporter
|
||
|
|
- Added metadataexporter
|
||
|
|
- Updated traces pipeline to export to signozmeter
|
||
|
|
- Added metrics/meter pipeline
|
||
|
|
|
||
|
|
2. `infrastructure/helm/signoz-values-prod.yaml`
|
||
|
|
- Same changes as dev
|
||
|
|
|
||
|
|
### ConfigMaps
|
||
|
|
3. `infrastructure/kubernetes/base/configmap.yaml`
|
||
|
|
- Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)
|
||
|
|
|
||
|
|
4. `infrastructure/kubernetes/overlays/prod/prod-configmap.yaml`
|
||
|
|
- Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)
|
||
|
|
|
||
|
|
### Service Manifests (18 files)
|
||
|
|
All services in `infrastructure/kubernetes/base/components/*/` changed from:
|
||
|
|
```yaml
|
||
|
|
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
||
|
|
value: "http://..."
|
||
|
|
```
|
||
|
|
To:
|
||
|
|
```yaml
|
||
|
|
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
||
|
|
valueFrom:
|
||
|
|
configMapKeyRef:
|
||
|
|
name: bakery-config
|
||
|
|
key: OTEL_EXPORTER_OTLP_ENDPOINT
|
||
|
|
```
|
||
|
|
|
||
|
|
## Verification Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Check OTel Collector is stable
|
||
|
|
kubectl get pods -n bakery-ia | grep otel-collector
|
||
|
|
# Should show: 1/1 Running
|
||
|
|
|
||
|
|
# 2. Check for configuration errors
|
||
|
|
kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=50 | grep -E "failed to apply config|signozmeter"
|
||
|
|
# Should show: NO errors about signozmeter
|
||
|
|
|
||
|
|
# 3. Verify collector is ready
|
||
|
|
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep "Everything is ready"
|
||
|
|
# Should show: "Everything is ready. Begin running and processing data."
|
||
|
|
|
||
|
|
# 4. Check service configuration
|
||
|
|
kubectl get configmap bakery-config -n bakery-ia -o jsonpath='{.data.OTEL_EXPORTER_OTLP_ENDPOINT}'
|
||
|
|
# Should show: signoz-otel-collector.bakery-ia.svc.cluster.local:4317 (no http://)
|
||
|
|
|
||
|
|
# 5. Verify service is using ConfigMap
|
||
|
|
kubectl get deployment gateway -n bakery-ia -o yaml | grep -A 5 "OTEL_EXPORTER"
|
||
|
|
# Should show: valueFrom / configMapKeyRef
|
||
|
|
|
||
|
|
# 6. Run verification script
|
||
|
|
./infrastructure/helm/verify-signoz-telemetry.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
### Immediate
|
||
|
|
1. ✅ OTel Collector is stable with OpenAMP enabled
|
||
|
|
2. ⏭️ Investigate remaining service connectivity issues
|
||
|
|
3. ⏭️ Generate test traffic and verify data collection
|
||
|
|
4. ⏭️ Check ClickHouse for traces/metrics/logs
|
||
|
|
|
||
|
|
### Short-term
|
||
|
|
1. Monitor OpenAMP errors - they're warnings, not blocking
|
||
|
|
2. Consider contacting SigNoz about OpAMP server configuration
|
||
|
|
3. Set up SigNoz dashboards and alerts
|
||
|
|
4. Document common queries
|
||
|
|
|
||
|
|
### Long-term
|
||
|
|
1. Evaluate if OpAMP remote management is needed
|
||
|
|
2. Consider HTTP exporter as alternative to gRPC
|
||
|
|
3. Implement service mesh if connectivity issues persist
|
||
|
|
4. Set up proper TLS for production
|
||
|
|
|
||
|
|
## Key Learnings
|
||
|
|
|
||
|
|
### About OpenTelemetry Connectors
|
||
|
|
- Connectors must be used in BOTH directions
|
||
|
|
- Source pipeline must export TO the connector
|
||
|
|
- Destination pipeline must receive FROM the connector
|
||
|
|
- Missing either direction causes pipeline build failures
|
||
|
|
|
||
|
|
### About OpenAMP
|
||
|
|
- OpenAMP can push remote configurations
|
||
|
|
- Local config takes precedence
|
||
|
|
- Remote server errors don't prevent local operation
|
||
|
|
- Collector continues with last known good config
|
||
|
|
|
||
|
|
### About gRPC Configuration
|
||
|
|
- gRPC endpoints don't use `http://` or `https://` prefixes
|
||
|
|
- Only use `hostname:port` format
|
||
|
|
- HTTP/REST endpoints DO need the protocol prefix
|
||
|
|
|
||
|
|
### About Configuration Management
|
||
|
|
- Centralize configuration in ConfigMaps
|
||
|
|
- Use `valueFrom: configMapKeyRef` pattern
|
||
|
|
- Single source of truth prevents drift
|
||
|
|
- Makes updates easier across all services
|
||
|
|
|
||
|
|
## References
|
||
|
|
|
||
|
|
- [SigNoz Helm Charts](https://github.com/SigNoz/charts)
|
||
|
|
- [OpenTelemetry Connectors](https://opentelemetry.io/docs/collector/configuration/#connectors)
|
||
|
|
- [OpAMP Specification](https://github.com/open-telemetry/opamp-spec)
|
||
|
|
- [SigNoz OTel Collector](https://github.com/SigNoz/signoz-otel-collector)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Resolution Date:** 2026-01-09
|
||
|
|
**Status:** ✅ Resolved - OTel Collector stable, OpenAMP functional
|
||
|
|
**Remaining:** Service connectivity investigation ongoing
|