Files

Urtzi Alfaro 1329bae784 Fix SigNoz OTel Collector configuration and disable OpAMP

Root Cause Analysis:
- OTel Collector was starting but OpAMP was overwriting config with "nop" receivers/exporters
- ClickHouse authentication was failing due to missing credentials in DSN strings
- Redis/PostgreSQL/RabbitMQ receivers had missing TLS certs causing startup failures

Changes:
1. Fixed ClickHouse Exporters:
   - Added admin credentials to clickhousetraces datasource
   - Added admin credentials to clickhouselogsexporter dsn
   - Now using: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/

2. Disabled Unconfigured Receivers:
   - Commented out PostgreSQL receivers (no monitor users configured)
   - Commented out Redis receiver (TLS certificates not available)
   - Commented out RabbitMQ receiver (credentials not configured)
   - Updated metrics pipeline to use only OTLP receiver

3. OpAMP Disabled:
   - OpAMP was causing collector to use nop exporters/receivers
   - Cannot disable via Helm (extraArgs appends, doesn't replace)
   - Must apply kubectl patch after Helm install:
     kubectl patch deployment signoz-otel-collector --type=json -p='[{"op":"replace","path":"/spec/template/spec/containers/0/args","value":["--config=/conf/otel-collector-config.yaml","--feature-gates=-pkg.translator.prometheus.NormalizeName"]}]'

Results:
✅ OTel Collector successfully receiving traces (97+ spans)
✅ Services connecting without UNAVAILABLE errors
✅ No ClickHouse authentication failures
✅ All pipelines active (traces, metrics, logs)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-09 11:51:03 +01:00

deploy-signoz.sh

Imporve monitoring 2

2026-01-09 07:26:11 +01:00

generate-test-traffic.sh

Imporve monitoring 3

2026-01-09 11:18:20 +01:00

README.md

Imporve monitoring 2

2026-01-09 07:26:11 +01:00

signoz-values-dev.yaml

Fix SigNoz OTel Collector configuration and disable OpAMP

2026-01-09 11:51:03 +01:00

signoz-values-prod.yaml

Imporve monitoring 3

2026-01-09 11:18:20 +01:00

verify-signoz-telemetry.sh

Imporve monitoring 3

2026-01-09 11:18:20 +01:00

verify-signoz.sh

Imporve monitoring 2

2026-01-09 07:26:11 +01:00

README.md

SigNoz Helm Deployment for Bakery IA

This directory contains Helm configurations and deployment scripts for SigNoz observability platform.

Overview

SigNoz is deployed using the official Helm chart with environment-specific configurations optimized for:

Development: Colima + Kind (Kubernetes in Docker) with Tilt
Production: VPS on clouding.io with MicroK8s

Prerequisites

Required Tools

kubectl 1.22+
Helm 3.8+
Docker (for development)
Kind/MicroK8s (environment-specific)

Docker Hub Authentication

SigNoz uses images from Docker Hub. Set up authentication to avoid rate limits:

# Option 1: Environment variables (recommended)
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-personal-access-token'

# Option 2: Docker login
docker login

Quick Start

Development Deployment

# Deploy SigNoz to development environment
./deploy-signoz.sh dev

# Verify deployment
./verify-signoz.sh dev

# Access SigNoz UI
# Via ingress: http://monitoring.bakery-ia.local
# Or port-forward:
kubectl port-forward -n signoz svc/signoz 8080:8080
# Then open: http://localhost:8080

Production Deployment

# Deploy SigNoz to production environment
./deploy-signoz.sh prod

# Verify deployment
./verify-signoz.sh prod

# Access SigNoz UI
# https://monitoring.bakewise.ai

Configuration Files

signoz-values-dev.yaml

Development environment configuration with:

Single replica for most components
Reduced resource requests (optimized for local Kind cluster)
7-day data retention
Batch size: 10,000 events
ClickHouse 25.5.6, OTel Collector v0.129.12
PostgreSQL, Redis, and RabbitMQ receivers configured

signoz-values-prod.yaml

Production environment configuration with:

High availability: 2+ replicas for critical components
3 Zookeeper replicas (required for production)
30-day data retention
Batch size: 50,000 events (high-performance)
Cold storage enabled with 30-day TTL
Horizontal Pod Autoscaler (HPA) enabled
TLS/SSL with cert-manager
Enhanced security with pod anti-affinity rules

Key Configuration Changes (v0.89.0+)

⚠️ BREAKING CHANGE: SigNoz Helm chart v0.89.0+ uses a unified component structure.

Old Structure (deprecated):

frontend:
  replicaCount: 2
queryService:
  replicaCount: 2

New Structure (current):

signoz:
  replicaCount: 2
  # Combines frontend + query service

Component Architecture

Core Components

SigNoz (unified component)
- Frontend UI + Query Service
- Port 8080 (HTTP/API), 8085 (internal gRPC)
- Dev: 1 replica, Prod: 2+ replicas with HPA
ClickHouse (Time-series database)
- Version: 25.5.6
- Stores traces, metrics, and logs
- Dev: 1 replica, Prod: 2 replicas with cold storage
Zookeeper (ClickHouse coordination)
- Version: 3.7.1
- Dev: 1 replica, Prod: 3 replicas (critical for HA)
OpenTelemetry Collector (Data ingestion)
- Version: v0.129.12
- Ports: 4317 (gRPC), 4318 (HTTP), 8888 (metrics)
- Dev: 1 replica, Prod: 2+ replicas with HPA
Alertmanager (Alert management)
- Version: 0.23.5
- Email and Slack integrations configured
- Port: 9093

Performance Optimizations

Batch Processing

Development: 10,000 events per batch
Production: 50,000 events per batch (official recommendation)
Timeout: 1 second for faster processing

Memory Management

Memory limiter processor prevents OOM
Dev: 400 MiB limit, Prod: 1500 MiB limit
Spike limits configured

Span Metrics Processor

Automatically generates RED metrics (Rate, Errors, Duration):

Latency histogram buckets optimized for microservices
Cache size: 10K (dev), 100K (prod)

Cold Storage (Production Only)

Enabled with 30-day TTL
Automatically moves old data to cold storage
Keeps 10GB free on primary storage

OpenTelemetry Endpoints

From Within Kubernetes Cluster

Development:

OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318

Production:

OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318

Application Configuration Example

# Python with OpenTelemetry
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"

// Node.js with OpenTelemetry
const exporter = new OTLPTraceExporter({
  url: 'http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318/v1/traces',
});

Deployment Scripts

deploy-signoz.sh

Comprehensive deployment script with features:

# Usage
./deploy-signoz.sh [OPTIONS] ENVIRONMENT

# Options
-h, --help              Show help message
-d, --dry-run          Show what would be deployed
-u, --upgrade          Upgrade existing deployment
-r, --remove           Remove deployment
-n, --namespace NS     Custom namespace (default: signoz)

# Examples
./deploy-signoz.sh dev                    # Deploy to dev
./deploy-signoz.sh --upgrade prod         # Upgrade prod
./deploy-signoz.sh --dry-run prod         # Preview changes
./deploy-signoz.sh --remove dev           # Remove dev deployment

Features:

Automatic Helm repository setup
Docker Hub secret creation
Namespace management
Deployment verification
15-minute timeout with --wait flag

verify-signoz.sh

Verification script to check deployment health:

# Usage
./verify-signoz.sh [OPTIONS] ENVIRONMENT

# Examples
./verify-signoz.sh dev                    # Verify dev deployment
./verify-signoz.sh prod                   # Verify prod deployment

Checks performed:

✅ Helm release status
✅ Pod health and readiness
✅ Service availability
✅ Ingress configuration
✅ PVC status
✅ Resource usage (if metrics-server available)
✅ Log errors
✅ Environment-specific validations
- Dev: Single replica, resource limits
- Prod: HA config, TLS, Zookeeper replicas, HPA

Storage Configuration

Development (Kind)

global:
  storageClass: "standard"  # Kind's default provisioner

Production (MicroK8s)

global:
  storageClass: "microk8s-hostpath"  # Or custom storage class

Storage Requirements:

Development: ~35 GiB total
- SigNoz: 5 GiB
- ClickHouse: 20 GiB
- Zookeeper: 5 GiB
- Alertmanager: 2 GiB
Production: ~135 GiB total
- SigNoz: 20 GiB
- ClickHouse: 100 GiB
- Zookeeper: 10 GiB
- Alertmanager: 5 GiB

Resource Requirements

Development Environment

Minimum:

CPU: 550m (0.55 cores)
Memory: 1.6 GiB
Storage: 35 GiB

Recommended:

CPU: 3 cores
Memory: 3 GiB
Storage: 50 GiB

Production Environment

Minimum:

CPU: 3.5 cores
Memory: 8 GiB
Storage: 135 GiB

Recommended:

CPU: 12 cores
Memory: 20 GiB
Storage: 200 GiB

Data Retention

Development

Traces: 7 days (168 hours)
Metrics: 7 days (168 hours)
Logs: 7 days (168 hours)

Production

Traces: 30 days (720 hours)
Metrics: 30 days (720 hours)
Logs: 30 days (720 hours)
Cold storage after 30 days

To modify retention, update the environment variables:

signoz:
  env:
    signoz_traces_ttl_duration_hrs: "720"   # 30 days
    signoz_metrics_ttl_duration_hrs: "720"  # 30 days
    signoz_logs_ttl_duration_hrs: "168"     # 7 days

High Availability (Production)

Replication Strategy

signoz: 2 replicas + HPA (min: 2, max: 5)
clickhouse: 2 replicas
zookeeper: 3 replicas (critical!)
otelCollector: 2 replicas + HPA (min: 2, max: 10)
alertmanager: 2 replicas

Pod Anti-Affinity

Ensures pods are distributed across different nodes:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: query-service
          topologyKey: kubernetes.io/hostname

Pod Disruption Budgets

Configured for all critical components:

podDisruptionBudget:
  enabled: true
  minAvailable: 1

Monitoring and Alerting

Email Alerts (Production)

Configure SMTP in production values:

signoz:
  env:
    signoz_smtp_enabled: "true"
    signoz_smtp_host: "smtp.gmail.com"
    signoz_smtp_port: "587"
    signoz_smtp_from: "alerts@bakewise.ai"
    signoz_smtp_username: "alerts@bakewise.ai"
    # Set via secret: signoz_smtp_password

Slack Alerts (Production)

Configure webhook in Alertmanager:

alertmanager:
  config:
    receivers:
      - name: 'critical-alerts'
        slack_configs:
          - api_url: '${SLACK_WEBHOOK_URL}'
            channel: '#alerts-critical'

Self-Monitoring

SigNoz monitors itself:

selfMonitoring:
  enabled: true
  serviceMonitor:
    enabled: true  # Prod only
    interval: 30s

Troubleshooting

Common Issues

1. Pods not starting

# Check pod status
kubectl get pods -n signoz

# Check pod logs
kubectl logs -n signoz <pod-name>

# Describe pod for events
kubectl describe pod -n signoz <pod-name>

2. Docker Hub rate limits

# Verify secret exists
kubectl get secret dockerhub-creds -n signoz

# Recreate secret
kubectl delete secret dockerhub-creds -n signoz
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-token'
./deploy-signoz.sh dev

3. ClickHouse connection issues

# Check ClickHouse pod
kubectl logs -n signoz -l app.kubernetes.io/component=clickhouse

# Check Zookeeper (required by ClickHouse)
kubectl logs -n signoz -l app.kubernetes.io/component=zookeeper

4. OTel Collector not receiving data

# Check OTel Collector logs
kubectl logs -n signoz -l app.kubernetes.io/component=otel-collector

# Test connectivity
kubectl port-forward -n signoz svc/signoz-otel-collector 4318:4318
curl -v http://localhost:4318/v1/traces

5. Insufficient storage

# Check PVC status
kubectl get pvc -n signoz

# Check storage usage (if metrics-server available)
kubectl top pods -n signoz

Debug Mode

Enable debug exporter in OTel Collector:

otelCollector:
  config:
    exporters:
      debug:
        verbosity: detailed
        sampling_initial: 5
        sampling_thereafter: 200
    service:
      pipelines:
        traces:
          exporters: [clickhousetraces, debug]  # Add debug

Upgrade from Old Version

If upgrading from pre-v0.89.0:

# 1. Backup data (recommended)
kubectl get all -n signoz -o yaml > signoz-backup.yaml

# 2. Remove old deployment
./deploy-signoz.sh --remove prod

# 3. Deploy new version
./deploy-signoz.sh prod

# 4. Verify
./verify-signoz.sh prod

Security Best Practices

Change default password immediately after first login
Use TLS/SSL in production (configured with cert-manager)
Network policies enabled in production
Run as non-root (configured in securityContext)
RBAC with dedicated service account
Secrets management for sensitive data (SMTP, Slack webhooks)
Image pull secrets to avoid exposing Docker Hub credentials

Backup and Recovery

Backup ClickHouse Data

# Export ClickHouse data
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
  --query="BACKUP DATABASE signoz_traces TO Disk('backups', 'traces_backup.zip')"

# Copy backup out
kubectl cp signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/ ./backups/

Restore from Backup

# Copy backup in
kubectl cp ./backups/ signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/

# Restore
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
  --query="RESTORE DATABASE signoz_traces FROM Disk('backups', 'traces_backup.zip')"

Updating Configuration

To update SigNoz configuration:

Edit values file: signoz-values-{env}.yaml
Apply changes:
```
./deploy-signoz.sh --upgrade {env}
```
Verify:
```
./verify-signoz.sh {env}
```

Uninstallation

# Remove SigNoz deployment
./deploy-signoz.sh --remove {env}

# Optionally delete PVCs (WARNING: deletes all data)
kubectl delete pvc -n signoz -l app.kubernetes.io/instance=signoz

# Optionally delete namespace
kubectl delete namespace signoz

References

Support

For issues or questions:

Check SigNoz GitHub Issues
Review deployment logs: kubectl logs -n signoz <pod-name>
Run verification script: ./verify-signoz.sh {env}
Check SigNoz Community Slack

Last Updated: 2026-01-09 SigNoz Helm Chart Version: Latest (v0.129.12 components) Maintained by: Bakery IA Team