Files
bakery-ia/docs/SIGNOZ_ROOT_CAUSE_ANALYSIS.md
2026-01-09 11:18:20 +01:00

8.8 KiB

SigNoz OpenAMP Root Cause Analysis & Resolution

Problem Statement

Services were getting StatusCode.UNAVAILABLE errors when trying to send traces to the SigNoz OTel Collector at port 4317. The OTel Collector was continuously restarting due to OpenAMP trying to apply invalid remote configurations.

Root Cause Analysis

Primary Issue: Missing signozmeter Connector Pipeline

Error Message:

connector "signozmeter" used as receiver in [metrics/meter] pipeline
but not used in any supported exporter pipeline

Root Cause: The OpenAMP server was pushing a remote configuration that included:

  1. A metrics/meter pipeline that uses signozmeter as a receiver
  2. However, no pipeline was exporting TO the signozmeter connector

Technical Explanation:

  • Connectors in OpenTelemetry are special components that act as BOTH exporters AND receivers
  • They bridge between pipelines (e.g., traces → metrics)
  • The signozmeter connector generates usage/meter metrics from trace data
  • For a connector to work, it must be:
    1. Used as an exporter in one pipeline (the source)
    2. Used as a receiver in another pipeline (the destination)

What Was Missing: Our configuration had:

  • signozmeter connector defined
  • metrics/meter pipeline receiving from signozmeter
  • No pipeline exporting TO signozmeter

The traces pipeline needed to export to signozmeter:

traces:
  receivers: [otlp]
  processors: [...]
  exporters: [clickhousetraces, metadataexporter, signozmeter]  # <-- signozmeter was missing

Secondary Issue: gRPC Endpoint Format

Problem: Services had http:// prefix in gRPC endpoints Solution: Removed http:// prefix (gRPC doesn't use HTTP protocol prefix)

Before:

OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4317"

After:

OTEL_EXPORTER_OTLP_ENDPOINT: "signoz-otel-collector.bakery-ia.svc.cluster.local:4317"

Tertiary Issue: Hardcoded Endpoints

Problem: Each service manifest had hardcoded OTEL endpoints instead of referencing ConfigMap Solution: Updated all 18 services to use valueFrom: configMapKeyRef

Solution Implemented

1. Added Complete Meter Pipeline Configuration

Added Connector:

connectors:
  signozmeter:
    dimensions:
      - name: service.name
      - name: deployment.environment
      - name: host.name
    metrics_flush_interval: 1h

Added Batch Processor:

processors:
  batch/meter:
    timeout: 1s
    send_batch_size: 20000
    send_batch_max_size: 25000

Added Exporters:

exporters:
  # Meter exporter
  signozclickhousemeter:
    dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_meter"
    timeout: 45s
    sending_queue:
      enabled: false

  # Metadata exporter
  metadataexporter:
    dsn: "tcp://admin:PASSWORD@signoz-clickhouse:9000/signoz_metadata"
    timeout: 10s
    cache:
      provider: in_memory

Updated Traces Pipeline:

traces:
  receivers: [otlp]
  processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection]
  exporters: [clickhousetraces, metadataexporter, signozmeter]  # Added signozmeter

Added Meter Pipeline:

metrics/meter:
  receivers: [signozmeter]
  processors: [batch/meter]
  exporters: [signozclickhousemeter]

2. Fixed gRPC Endpoint Configuration

Updated ConfigMaps:

  • infrastructure/kubernetes/base/configmap.yaml
  • infrastructure/kubernetes/overlays/prod/prod-configmap.yaml

3. Centralized OTEL Configuration

Created script: infrastructure/kubernetes/fix-otel-endpoints.sh

Updated 18 service manifests to use ConfigMap reference instead of hardcoded values.

Results

Before Fix

  • OTel Collector continuously restarting
  • Services unable to export traces (StatusCode.UNAVAILABLE)
  • Error: connector "signozmeter" used as receiver but not used in any supported exporter pipeline
  • OpenAMP constantly trying to reload bad config

After Fix

  • OTel Collector stable and running
  • Message: "Everything is ready. Begin running and processing data."
  • No more signozmeter connector errors
  • OpenAMP errors are now just warnings (remote server issues, not local config)
  • ⚠️ Service connectivity still showing transient errors (separate investigation needed)

OpenAMP Behavior

What is OpenAMP?

  • OpenTelemetry Agent Management Protocol
  • Allows remote management and configuration of collectors
  • SigNoz uses it for central configuration management

Current State:

  • OpenAMP continues to show errors, but they're now non-fatal
  • The errors are from the remote OpAMP server (signoz:4320), not local config
  • Local configuration is valid and working
  • Collector is stable and processing data

OpenAMP Error Pattern:

[ERROR] opamp/server_client.go:146
Server returned an error response

This is a warning that the remote OpAMP server has configuration issues, but it doesn't affect the locally-configured collector.

Files Modified

Helm Values

  1. infrastructure/helm/signoz-values-dev.yaml

    • Added connectors section
    • Added batch/meter processor
    • Added signozclickhousemeter exporter
    • Added metadataexporter
    • Updated traces pipeline to export to signozmeter
    • Added metrics/meter pipeline
  2. infrastructure/helm/signoz-values-prod.yaml

    • Same changes as dev

ConfigMaps

  1. infrastructure/kubernetes/base/configmap.yaml

    • Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)
  2. infrastructure/kubernetes/overlays/prod/prod-configmap.yaml

    • Fixed OTEL_EXPORTER_OTLP_ENDPOINT (removed http://)

Service Manifests (18 files)

All services in infrastructure/kubernetes/base/components/*/ changed from:

- name: OTEL_EXPORTER_OTLP_ENDPOINT
  value: "http://..."

To:

- name: OTEL_EXPORTER_OTLP_ENDPOINT
  valueFrom:
    configMapKeyRef:
      name: bakery-config
      key: OTEL_EXPORTER_OTLP_ENDPOINT

Verification Commands

# 1. Check OTel Collector is stable
kubectl get pods -n bakery-ia | grep otel-collector
# Should show: 1/1 Running

# 2. Check for configuration errors
kubectl logs -n bakery-ia deployment/signoz-otel-collector --tail=50 | grep -E "failed to apply config|signozmeter"
# Should show: NO errors about signozmeter

# 3. Verify collector is ready
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep "Everything is ready"
# Should show: "Everything is ready. Begin running and processing data."

# 4. Check service configuration
kubectl get configmap bakery-config -n bakery-ia -o jsonpath='{.data.OTEL_EXPORTER_OTLP_ENDPOINT}'
# Should show: signoz-otel-collector.bakery-ia.svc.cluster.local:4317 (no http://)

# 5. Verify service is using ConfigMap
kubectl get deployment gateway -n bakery-ia -o yaml | grep -A 5 "OTEL_EXPORTER"
# Should show: valueFrom / configMapKeyRef

# 6. Run verification script
./infrastructure/helm/verify-signoz-telemetry.sh

Next Steps

Immediate

  1. OTel Collector is stable with OpenAMP enabled
  2. ⏭️ Investigate remaining service connectivity issues
  3. ⏭️ Generate test traffic and verify data collection
  4. ⏭️ Check ClickHouse for traces/metrics/logs

Short-term

  1. Monitor OpenAMP errors - they're warnings, not blocking
  2. Consider contacting SigNoz about OpAMP server configuration
  3. Set up SigNoz dashboards and alerts
  4. Document common queries

Long-term

  1. Evaluate if OpAMP remote management is needed
  2. Consider HTTP exporter as alternative to gRPC
  3. Implement service mesh if connectivity issues persist
  4. Set up proper TLS for production

Key Learnings

About OpenTelemetry Connectors

  • Connectors must be used in BOTH directions
  • Source pipeline must export TO the connector
  • Destination pipeline must receive FROM the connector
  • Missing either direction causes pipeline build failures

About OpenAMP

  • OpenAMP can push remote configurations
  • Local config takes precedence
  • Remote server errors don't prevent local operation
  • Collector continues with last known good config

About gRPC Configuration

  • gRPC endpoints don't use http:// or https:// prefixes
  • Only use hostname:port format
  • HTTP/REST endpoints DO need the protocol prefix

About Configuration Management

  • Centralize configuration in ConfigMaps
  • Use valueFrom: configMapKeyRef pattern
  • Single source of truth prevents drift
  • Makes updates easier across all services

References


Resolution Date: 2026-01-09 Status: Resolved - OTel Collector stable, OpenAMP functional Remaining: Service connectivity investigation ongoing