Root Cause Analysis:
- OTel Collector was starting but OpAMP was overwriting config with "nop" receivers/exporters
- ClickHouse authentication was failing due to missing credentials in DSN strings
- Redis/PostgreSQL/RabbitMQ receivers had missing TLS certs causing startup failures
Changes:
1. Fixed ClickHouse Exporters:
- Added admin credentials to clickhousetraces datasource
- Added admin credentials to clickhouselogsexporter dsn
- Now using: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/
2. Disabled Unconfigured Receivers:
- Commented out PostgreSQL receivers (no monitor users configured)
- Commented out Redis receiver (TLS certificates not available)
- Commented out RabbitMQ receiver (credentials not configured)
- Updated metrics pipeline to use only OTLP receiver
3. OpAMP Disabled:
- OpAMP was causing collector to use nop exporters/receivers
- Cannot disable via Helm (extraArgs appends, doesn't replace)
- Must apply kubectl patch after Helm install:
kubectl patch deployment signoz-otel-collector --type=json -p='[{"op":"replace","path":"/spec/template/spec/containers/0/args","value":["--config=/conf/otel-collector-config.yaml","--feature-gates=-pkg.translator.prometheus.NormalizeName"]}]'
Results:
✅ OTel Collector successfully receiving traces (97+ spans)
✅ Services connecting without UNAVAILABLE errors
✅ No ClickHouse authentication failures
✅ All pipelines active (traces, metrics, logs)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
SigNoz Helm Deployment for Bakery IA
This directory contains Helm configurations and deployment scripts for SigNoz observability platform.
Overview
SigNoz is deployed using the official Helm chart with environment-specific configurations optimized for:
- Development: Colima + Kind (Kubernetes in Docker) with Tilt
- Production: VPS on clouding.io with MicroK8s
Prerequisites
Required Tools
- kubectl 1.22+
- Helm 3.8+
- Docker (for development)
- Kind/MicroK8s (environment-specific)
Docker Hub Authentication
SigNoz uses images from Docker Hub. Set up authentication to avoid rate limits:
# Option 1: Environment variables (recommended)
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-personal-access-token'
# Option 2: Docker login
docker login
Quick Start
Development Deployment
# Deploy SigNoz to development environment
./deploy-signoz.sh dev
# Verify deployment
./verify-signoz.sh dev
# Access SigNoz UI
# Via ingress: http://monitoring.bakery-ia.local
# Or port-forward:
kubectl port-forward -n signoz svc/signoz 8080:8080
# Then open: http://localhost:8080
Production Deployment
# Deploy SigNoz to production environment
./deploy-signoz.sh prod
# Verify deployment
./verify-signoz.sh prod
# Access SigNoz UI
# https://monitoring.bakewise.ai
Configuration Files
signoz-values-dev.yaml
Development environment configuration with:
- Single replica for most components
- Reduced resource requests (optimized for local Kind cluster)
- 7-day data retention
- Batch size: 10,000 events
- ClickHouse 25.5.6, OTel Collector v0.129.12
- PostgreSQL, Redis, and RabbitMQ receivers configured
signoz-values-prod.yaml
Production environment configuration with:
- High availability: 2+ replicas for critical components
- 3 Zookeeper replicas (required for production)
- 30-day data retention
- Batch size: 50,000 events (high-performance)
- Cold storage enabled with 30-day TTL
- Horizontal Pod Autoscaler (HPA) enabled
- TLS/SSL with cert-manager
- Enhanced security with pod anti-affinity rules
Key Configuration Changes (v0.89.0+)
⚠️ BREAKING CHANGE: SigNoz Helm chart v0.89.0+ uses a unified component structure.
Old Structure (deprecated):
frontend:
replicaCount: 2
queryService:
replicaCount: 2
New Structure (current):
signoz:
replicaCount: 2
# Combines frontend + query service
Component Architecture
Core Components
-
SigNoz (unified component)
- Frontend UI + Query Service
- Port 8080 (HTTP/API), 8085 (internal gRPC)
- Dev: 1 replica, Prod: 2+ replicas with HPA
-
ClickHouse (Time-series database)
- Version: 25.5.6
- Stores traces, metrics, and logs
- Dev: 1 replica, Prod: 2 replicas with cold storage
-
Zookeeper (ClickHouse coordination)
- Version: 3.7.1
- Dev: 1 replica, Prod: 3 replicas (critical for HA)
-
OpenTelemetry Collector (Data ingestion)
- Version: v0.129.12
- Ports: 4317 (gRPC), 4318 (HTTP), 8888 (metrics)
- Dev: 1 replica, Prod: 2+ replicas with HPA
-
Alertmanager (Alert management)
- Version: 0.23.5
- Email and Slack integrations configured
- Port: 9093
Performance Optimizations
Batch Processing
- Development: 10,000 events per batch
- Production: 50,000 events per batch (official recommendation)
- Timeout: 1 second for faster processing
Memory Management
- Memory limiter processor prevents OOM
- Dev: 400 MiB limit, Prod: 1500 MiB limit
- Spike limits configured
Span Metrics Processor
Automatically generates RED metrics (Rate, Errors, Duration):
- Latency histogram buckets optimized for microservices
- Cache size: 10K (dev), 100K (prod)
Cold Storage (Production Only)
- Enabled with 30-day TTL
- Automatically moves old data to cold storage
- Keeps 10GB free on primary storage
OpenTelemetry Endpoints
From Within Kubernetes Cluster
Development:
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
Production:
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
Application Configuration Example
# Python with OpenTelemetry
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
// Node.js with OpenTelemetry
const exporter = new OTLPTraceExporter({
url: 'http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318/v1/traces',
});
Deployment Scripts
deploy-signoz.sh
Comprehensive deployment script with features:
# Usage
./deploy-signoz.sh [OPTIONS] ENVIRONMENT
# Options
-h, --help Show help message
-d, --dry-run Show what would be deployed
-u, --upgrade Upgrade existing deployment
-r, --remove Remove deployment
-n, --namespace NS Custom namespace (default: signoz)
# Examples
./deploy-signoz.sh dev # Deploy to dev
./deploy-signoz.sh --upgrade prod # Upgrade prod
./deploy-signoz.sh --dry-run prod # Preview changes
./deploy-signoz.sh --remove dev # Remove dev deployment
Features:
- Automatic Helm repository setup
- Docker Hub secret creation
- Namespace management
- Deployment verification
- 15-minute timeout with
--waitflag
verify-signoz.sh
Verification script to check deployment health:
# Usage
./verify-signoz.sh [OPTIONS] ENVIRONMENT
# Examples
./verify-signoz.sh dev # Verify dev deployment
./verify-signoz.sh prod # Verify prod deployment
Checks performed:
- ✅ Helm release status
- ✅ Pod health and readiness
- ✅ Service availability
- ✅ Ingress configuration
- ✅ PVC status
- ✅ Resource usage (if metrics-server available)
- ✅ Log errors
- ✅ Environment-specific validations
- Dev: Single replica, resource limits
- Prod: HA config, TLS, Zookeeper replicas, HPA
Storage Configuration
Development (Kind)
global:
storageClass: "standard" # Kind's default provisioner
Production (MicroK8s)
global:
storageClass: "microk8s-hostpath" # Or custom storage class
Storage Requirements:
-
Development: ~35 GiB total
- SigNoz: 5 GiB
- ClickHouse: 20 GiB
- Zookeeper: 5 GiB
- Alertmanager: 2 GiB
-
Production: ~135 GiB total
- SigNoz: 20 GiB
- ClickHouse: 100 GiB
- Zookeeper: 10 GiB
- Alertmanager: 5 GiB
Resource Requirements
Development Environment
Minimum:
- CPU: 550m (0.55 cores)
- Memory: 1.6 GiB
- Storage: 35 GiB
Recommended:
- CPU: 3 cores
- Memory: 3 GiB
- Storage: 50 GiB
Production Environment
Minimum:
- CPU: 3.5 cores
- Memory: 8 GiB
- Storage: 135 GiB
Recommended:
- CPU: 12 cores
- Memory: 20 GiB
- Storage: 200 GiB
Data Retention
Development
- Traces: 7 days (168 hours)
- Metrics: 7 days (168 hours)
- Logs: 7 days (168 hours)
Production
- Traces: 30 days (720 hours)
- Metrics: 30 days (720 hours)
- Logs: 30 days (720 hours)
- Cold storage after 30 days
To modify retention, update the environment variables:
signoz:
env:
signoz_traces_ttl_duration_hrs: "720" # 30 days
signoz_metrics_ttl_duration_hrs: "720" # 30 days
signoz_logs_ttl_duration_hrs: "168" # 7 days
High Availability (Production)
Replication Strategy
signoz: 2 replicas + HPA (min: 2, max: 5)
clickhouse: 2 replicas
zookeeper: 3 replicas (critical!)
otelCollector: 2 replicas + HPA (min: 2, max: 10)
alertmanager: 2 replicas
Pod Anti-Affinity
Ensures pods are distributed across different nodes:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/component: query-service
topologyKey: kubernetes.io/hostname
Pod Disruption Budgets
Configured for all critical components:
podDisruptionBudget:
enabled: true
minAvailable: 1
Monitoring and Alerting
Email Alerts (Production)
Configure SMTP in production values:
signoz:
env:
signoz_smtp_enabled: "true"
signoz_smtp_host: "smtp.gmail.com"
signoz_smtp_port: "587"
signoz_smtp_from: "alerts@bakewise.ai"
signoz_smtp_username: "alerts@bakewise.ai"
# Set via secret: signoz_smtp_password
Slack Alerts (Production)
Configure webhook in Alertmanager:
alertmanager:
config:
receivers:
- name: 'critical-alerts'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts-critical'
Self-Monitoring
SigNoz monitors itself:
selfMonitoring:
enabled: true
serviceMonitor:
enabled: true # Prod only
interval: 30s
Troubleshooting
Common Issues
1. Pods not starting
# Check pod status
kubectl get pods -n signoz
# Check pod logs
kubectl logs -n signoz <pod-name>
# Describe pod for events
kubectl describe pod -n signoz <pod-name>
2. Docker Hub rate limits
# Verify secret exists
kubectl get secret dockerhub-creds -n signoz
# Recreate secret
kubectl delete secret dockerhub-creds -n signoz
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-token'
./deploy-signoz.sh dev
3. ClickHouse connection issues
# Check ClickHouse pod
kubectl logs -n signoz -l app.kubernetes.io/component=clickhouse
# Check Zookeeper (required by ClickHouse)
kubectl logs -n signoz -l app.kubernetes.io/component=zookeeper
4. OTel Collector not receiving data
# Check OTel Collector logs
kubectl logs -n signoz -l app.kubernetes.io/component=otel-collector
# Test connectivity
kubectl port-forward -n signoz svc/signoz-otel-collector 4318:4318
curl -v http://localhost:4318/v1/traces
5. Insufficient storage
# Check PVC status
kubectl get pvc -n signoz
# Check storage usage (if metrics-server available)
kubectl top pods -n signoz
Debug Mode
Enable debug exporter in OTel Collector:
otelCollector:
config:
exporters:
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
service:
pipelines:
traces:
exporters: [clickhousetraces, debug] # Add debug
Upgrade from Old Version
If upgrading from pre-v0.89.0:
# 1. Backup data (recommended)
kubectl get all -n signoz -o yaml > signoz-backup.yaml
# 2. Remove old deployment
./deploy-signoz.sh --remove prod
# 3. Deploy new version
./deploy-signoz.sh prod
# 4. Verify
./verify-signoz.sh prod
Security Best Practices
- Change default password immediately after first login
- Use TLS/SSL in production (configured with cert-manager)
- Network policies enabled in production
- Run as non-root (configured in securityContext)
- RBAC with dedicated service account
- Secrets management for sensitive data (SMTP, Slack webhooks)
- Image pull secrets to avoid exposing Docker Hub credentials
Backup and Recovery
Backup ClickHouse Data
# Export ClickHouse data
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
--query="BACKUP DATABASE signoz_traces TO Disk('backups', 'traces_backup.zip')"
# Copy backup out
kubectl cp signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/ ./backups/
Restore from Backup
# Copy backup in
kubectl cp ./backups/ signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/
# Restore
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
--query="RESTORE DATABASE signoz_traces FROM Disk('backups', 'traces_backup.zip')"
Updating Configuration
To update SigNoz configuration:
- Edit values file:
signoz-values-{env}.yaml - Apply changes:
./deploy-signoz.sh --upgrade {env} - Verify:
./verify-signoz.sh {env}
Uninstallation
# Remove SigNoz deployment
./deploy-signoz.sh --remove {env}
# Optionally delete PVCs (WARNING: deletes all data)
kubectl delete pvc -n signoz -l app.kubernetes.io/instance=signoz
# Optionally delete namespace
kubectl delete namespace signoz
References
- SigNoz Official Documentation
- SigNoz Helm Charts Repository
- OpenTelemetry Documentation
- ClickHouse Documentation
Support
For issues or questions:
- Check SigNoz GitHub Issues
- Review deployment logs:
kubectl logs -n signoz <pod-name> - Run verification script:
./verify-signoz.sh {env} - Check SigNoz Community Slack
Last Updated: 2026-01-09 SigNoz Helm Chart Version: Latest (v0.129.12 components) Maintained by: Bakery IA Team