Imporve monitoring 3
This commit is contained in:
289
docs/SIGNOZ_ROOT_CAUSE_ANALYSIS.md
Normal file
289
docs/SIGNOZ_ROOT_CAUSE_ANALYSIS.md
Normal file
@@ -0,0 +1,289 @@
|
||||
# SigNoz OpenAMP Root Cause Analysis & Resolution
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Services were getting `StatusCode.UNAVAILABLE` errors when trying to send traces to the SigNoz OTel Collector at port 4317. The OTel Collector was continuously restarting due to OpenAMP trying to apply invalid remote configurations.
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### Primary Issue: Missing `signozmeter` Connector Pipeline
|
||||
|
||||
**Error Message:**
|
||||
```
|
||||
connector "signozmeter" used as receiver in [metrics/meter] pipeline
|
||||
but not used in any supported exporter pipeline
|
||||
```
|
||||
|
||||
**Root Cause:**
|
||||
The OpenAMP server was pushing a remote configuration that included:
|
||||
1. A `metrics/meter` pipeline that uses `signozmeter` as a receiver
|
||||
2. However, no pipeline was exporting TO the `signozmeter` connector
|
||||
|
||||
**Technical Explanation:**
|
||||
- **Connectors** in OpenTelemetry are special components that act as BOTH exporters AND receivers
|
||||
- They bridge between pipelines (e.g., traces → metrics)
|
||||
- The `signozmeter` connector generates usage/meter metrics from trace data
|
||||
- For a connector to work, it must be:
|
||||
1. Used as an **exporter** in one pipeline (the source)
|
||||
2. Used as a **receiver** in another pipeline (the destination)
|
||||
|
||||
**What Was Missing:**
|
||||
Our configuration had:
|
||||
- ✅ `signozmeter` connector defined
|
||||
- ✅ `metrics/meter` pipeline receiving from `signozmeter`
|
||||
- ❌ **No pipeline exporting TO `signozmeter`**
|
||||
|
||||
The traces pipeline needed to export to `signozmeter`:
|
||||
```yaml
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [...]
|
||||
exporters: [clickhousetraces, metadataexporter, signozmeter] # <-- signozmeter was missing
|
||||
```
|
||||
|
||||
### Secondary Issue: gRPC Endpoint Format
|
||||
|
||||
**Problem:** Services had `http://` prefix in gRPC endpoints
|
||||
**Solution:** Removed `http://` prefix (gRPC doesn't use HTTP protocol prefix)
|
||||
|
||||
**Before:**
|
||||
```yaml
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
|
||||
```
|
||||
|
||||
**After:**
|
||||
```yaml
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "signoz-otel-collector.bakery-ia.svc.cluster.local:4317"
|
||||
```
|
||||
|
||||
### Tertiary Issue: Hardcoded Endpoints
|
||||
|
||||
**Problem:** Each service manifest had hardcoded OTEL endpoints instead of referencing ConfigMap
|
||||
**Solution:** Updated all 18 services to use `valueFrom: configMapKeyRef`
|
||||
|
||||
## Solution Implemented
|
||||
|
||||
### 1. Added Complete Meter Pipeline Configuration
|
||||
|
||||
**Added Connector:**
|
||||
```yaml
|
||||
connectors:
|
||||
signozmeter:
|
||||
dimensions:
|
||||
- name: service.name
|
||||
- name: deployment.environment
|
||||
- name: host.name
|
||||
metrics_flush_interval: 1h
|
||||
```
|
||||
|
||||
**Added Batch Processor:**
|
||||
```yaml
|
||||
processors:
|
||||
batch/meter:
|
||||
timeout: 1s
|
||||
send_batch_size: 20000
|
||||
send_batch_max_size: 25000
|
||||
```
|
||||
|
||||
**Added Exporters:**
|
||||
```yaml
|
||||
exporters:
|
||||
# Meter exporter
|
||||
signozclickhousemeter:
|
||||
dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_meter"
|
||||
timeout: 45s
|
||||
sending_queue:
|
||||
enabled: false
|
||||
|
||||
# Metadata exporter
|
||||
metadataexporter:
|
||||
dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_metadata"
|
||||
timeout: 10s
|
||||
cache:
|
||||
provider: in_memory
|
||||
```
|
||||
|
||||
**Updated Traces Pipeline:**
|
||||
```yaml
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection]
|
||||
exporters: [clickhousetraces, metadataexporter, signozmeter] # Added signozmeter
|
||||
```
|
||||
|
||||
**Added Meter Pipeline:**
|
||||
```yaml
|
||||
metrics/meter:
|
||||
receivers: [signozmeter]
|
||||
processors: [batch/meter]
|
||||
exporters: [signozclickhousemeter]
|
||||
```
|
||||
|
||||
### 2. Fixed gRPC Endpoint Configuration
|
||||
|
||||
Updated ConfigMaps:
|
||||
- `infrastructure/kubernetes/base/configmap.yaml`
|
||||
- `infrastructure/kubernetes/overlays/prod/prod-configmap.yaml`
|
||||
|
||||
### 3. Centralized OTEL Configuration
|
||||
|
||||
Created script: `infrastructure/kubernetes/fix-otel-endpoints.sh`
|
||||
|
||||
Updated 18 service manifests to use ConfigMap reference instead of hardcoded values.
|
||||
|
||||
## Results
|
||||
|
||||
### Before Fix
|
||||
- ❌ OTel Collector continuously restarting
|
||||
- ❌ Services unable to export traces (StatusCode.UNAVAILABLE)
|
||||
- ❌ Error: `connector "signozmeter" used as receiver but not used in any supported exporter pipeline`
|
||||
- ❌ OpenAMP constantly trying to reload bad config
|
||||
|
||||
### After Fix
|
||||
- ✅ OTel Collector stable and running
|
||||
- ✅ Message: `"Everything is ready. Begin running and processing data."`
|
||||
- ✅ No more signozmeter connector errors
|
||||
- ✅ OpenAMP errors are now just warnings (remote server issues, not local config)
|
||||
- ⚠️ Service connectivity still showing transient errors (separate investigation needed)
|
||||
|
||||
## OpenAMP Behavior
|
||||
|
||||
**What is OpenAMP?**
|
||||
- OpenTelemetry Agent Management Protocol
|
||||
- Allows remote management and configuration of collectors
|
||||
- SigNoz uses it for central configuration management
|
||||
|
||||
**Current State:**
|
||||
- OpenAMP continues to show errors, but they're now **non-fatal**
|
||||
- The errors are from the remote OpAMP server (signoz:4320), not local config
|
||||
- Local configuration is valid and working
|
||||
- Collector is stable and processing data
|
||||
|
||||
**OpenAMP Error Pattern:**
|
||||
```
|
||||
[ERROR] opamp/server_client.go:146
|
||||
Server returned an error response
|
||||
```
|
||||
|
||||
This is a **warning** that the remote OpAMP server has configuration issues, but it doesn't affect the locally-configured collector.
|
||||
|
||||
## Files Modified
|
||||
|
||||
### Helm Values
|
||||
1. `infrastructure/helm/signoz-values-dev.yaml`
|
||||
- Added connectors section
|
||||
- Added batch/meter processor
|
||||
- Added signozclickhousemeter exporter
|
||||
- Added metadataexporter
|
||||
- Updated traces pipeline to export to signozmeter
|
||||
- Added metrics/meter pipeline
|
||||
|
||||
2. `infrastructure/helm/signoz-values-prod.yaml`
|
||||
- Same changes as dev
|
||||
|
||||
### ConfigMaps
|
||||
3. `infrastructure/kubernetes/base/configmap.yaml`
|
||||
- Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)
|
||||
|
||||
4. `infrastructure/kubernetes/overlays/prod/prod-configmap.yaml`
|
||||
- Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)
|
||||
|
||||
### Service Manifests (18 files)
|
||||
All services in `infrastructure/kubernetes/base/components/*/` changed from:
|
||||
```yaml
|
||||
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
||||
value: "http://..."
|
||||
```
|
||||
To:
|
||||
```yaml
|
||||
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
||||
valueFrom:
|
||||
configMapKeyRef:
|
||||
name: bakery-config
|
||||
key: OTEL_EXPORTER_OTLP_ENDPOINT
|
||||
```
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```bash
|
||||
# 1. Check OTel Collector is stable
|
||||
kubectl get pods -n bakery-ia | grep otel-collector
|
||||
# Should show: 1/1 Running
|
||||
|
||||
# 2. Check for configuration errors
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=50 | grep -E "failed to apply config|signozmeter"
|
||||
# Should show: NO errors about signozmeter
|
||||
|
||||
# 3. Verify collector is ready
|
||||
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep "Everything is ready"
|
||||
# Should show: "Everything is ready. Begin running and processing data."
|
||||
|
||||
# 4. Check service configuration
|
||||
kubectl get configmap bakery-config -n bakery-ia -o jsonpath='{.data.OTEL_EXPORTER_OTLP_ENDPOINT}'
|
||||
# Should show: signoz-otel-collector.bakery-ia.svc.cluster.local:4317 (no http://)
|
||||
|
||||
# 5. Verify service is using ConfigMap
|
||||
kubectl get deployment gateway -n bakery-ia -o yaml | grep -A 5 "OTEL_EXPORTER"
|
||||
# Should show: valueFrom / configMapKeyRef
|
||||
|
||||
# 6. Run verification script
|
||||
./infrastructure/helm/verify-signoz-telemetry.sh
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
### Immediate
|
||||
1. ✅ OTel Collector is stable with OpenAMP enabled
|
||||
2. ⏭️ Investigate remaining service connectivity issues
|
||||
3. ⏭️ Generate test traffic and verify data collection
|
||||
4. ⏭️ Check ClickHouse for traces/metrics/logs
|
||||
|
||||
### Short-term
|
||||
1. Monitor OpenAMP errors - they're warnings, not blocking
|
||||
2. Consider contacting SigNoz about OpAMP server configuration
|
||||
3. Set up SigNoz dashboards and alerts
|
||||
4. Document common queries
|
||||
|
||||
### Long-term
|
||||
1. Evaluate if OpAMP remote management is needed
|
||||
2. Consider HTTP exporter as alternative to gRPC
|
||||
3. Implement service mesh if connectivity issues persist
|
||||
4. Set up proper TLS for production
|
||||
|
||||
## Key Learnings
|
||||
|
||||
### About OpenTelemetry Connectors
|
||||
- Connectors must be used in BOTH directions
|
||||
- Source pipeline must export TO the connector
|
||||
- Destination pipeline must receive FROM the connector
|
||||
- Missing either direction causes pipeline build failures
|
||||
|
||||
### About OpenAMP
|
||||
- OpenAMP can push remote configurations
|
||||
- Local config takes precedence
|
||||
- Remote server errors don't prevent local operation
|
||||
- Collector continues with last known good config
|
||||
|
||||
### About gRPC Configuration
|
||||
- gRPC endpoints don't use `http://` or `https://` prefixes
|
||||
- Only use `hostname:port` format
|
||||
- HTTP/REST endpoints DO need the protocol prefix
|
||||
|
||||
### About Configuration Management
|
||||
- Centralize configuration in ConfigMaps
|
||||
- Use `valueFrom: configMapKeyRef` pattern
|
||||
- Single source of truth prevents drift
|
||||
- Makes updates easier across all services
|
||||
|
||||
## References
|
||||
|
||||
- [SigNoz Helm Charts](https://github.com/SigNoz/charts)
|
||||
- [OpenTelemetry Connectors](https://opentelemetry.io/docs/collector/configuration/#connectors)
|
||||
- [OpAMP Specification](https://github.com/open-telemetry/opamp-spec)
|
||||
- [SigNoz OTel Collector](https://github.com/SigNoz/signoz-otel-collector)
|
||||
|
||||
---
|
||||
|
||||
**Resolution Date:** 2026-01-09
|
||||
**Status:** ✅ Resolved - OTel Collector stable, OpenAMP functional
|
||||
**Remaining:** Service connectivity investigation ongoing
|
||||
Reference in New Issue
Block a user