Files
bakery-ia/infrastructure/helm
2026-01-17 22:42:40 +01:00
..
2026-01-09 07:26:11 +01:00
2026-01-09 11:18:20 +01:00
2026-01-09 07:26:11 +01:00
2026-01-10 13:43:38 +01:00
2026-01-09 07:26:11 +01:00

SigNoz Helm Deployment for Bakery IA

This directory contains Helm configurations and deployment scripts for SigNoz observability platform.

Overview

SigNoz is deployed using the official Helm chart with environment-specific configurations optimized for:

  • Development: Colima + Kind (Kubernetes in Docker) with Tilt
  • Production: VPS on clouding.io with MicroK8s

Prerequisites

Required Tools

  • kubectl 1.22+
  • Helm 3.8+
  • Docker (for development)
  • Kind/MicroK8s (environment-specific)

Docker Hub Authentication

SigNoz uses images from Docker Hub. Set up authentication to avoid rate limits:

# Option 1: Environment variables (recommended)
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-personal-access-token'

# Option 2: Docker login
docker login

Quick Start

Development Deployment

# Deploy SigNoz to development environment
./deploy-signoz.sh dev

# Verify deployment
./verify-signoz.sh dev

# Access SigNoz UI
# Via ingress: http://monitoring.bakery-ia.local
# Or port-forward:
kubectl port-forward -n signoz svc/signoz 8080:8080
# Then open: http://localhost:8080

Production Deployment

# Deploy SigNoz to production environment
./deploy-signoz.sh prod

# Verify deployment
./verify-signoz.sh prod

# Access SigNoz UI
# https://monitoring.bakewise.ai

Configuration Files

signoz-values-dev.yaml

Development environment configuration with:

  • Single replica for most components
  • Reduced resource requests (optimized for local Kind cluster)
  • 7-day data retention
  • Batch size: 10,000 events
  • ClickHouse 25.5.6, OTel Collector v0.129.12
  • PostgreSQL, Redis, and RabbitMQ receivers configured

signoz-values-prod.yaml

Production environment configuration with:

  • High availability: 2+ replicas for critical components
  • 3 Zookeeper replicas (required for production)
  • 30-day data retention
  • Batch size: 50,000 events (high-performance)
  • Cold storage enabled with 30-day TTL
  • Horizontal Pod Autoscaler (HPA) enabled
  • TLS/SSL with cert-manager
  • Enhanced security with pod anti-affinity rules

Key Configuration Changes (v0.89.0+)

⚠️ BREAKING CHANGE: SigNoz Helm chart v0.89.0+ uses a unified component structure.

Old Structure (deprecated):

frontend:
  replicaCount: 2
queryService:
  replicaCount: 2

New Structure (current):

signoz:
  replicaCount: 2
  # Combines frontend + query service

Component Architecture

Core Components

  1. SigNoz (unified component)

    • Frontend UI + Query Service
    • Port 8080 (HTTP/API), 8085 (internal gRPC)
    • Dev: 1 replica, Prod: 2+ replicas with HPA
  2. ClickHouse (Time-series database)

    • Version: 25.5.6
    • Stores traces, metrics, and logs
    • Dev: 1 replica, Prod: 2 replicas with cold storage
  3. Zookeeper (ClickHouse coordination)

    • Version: 3.7.1
    • Dev: 1 replica, Prod: 3 replicas (critical for HA)
  4. OpenTelemetry Collector (Data ingestion)

    • Version: v0.129.12
    • Ports: 4317 (gRPC), 4318 (HTTP), 8888 (metrics)
    • Dev: 1 replica, Prod: 2+ replicas with HPA
  5. Alertmanager (Alert management)

    • Version: 0.23.5
    • Email and Slack integrations configured
    • Port: 9093

Performance Optimizations

Batch Processing

  • Development: 10,000 events per batch
  • Production: 50,000 events per batch (official recommendation)
  • Timeout: 1 second for faster processing

Memory Management

  • Memory limiter processor prevents OOM
  • Dev: 400 MiB limit, Prod: 1500 MiB limit
  • Spike limits configured

Span Metrics Processor

Automatically generates RED metrics (Rate, Errors, Duration):

  • Latency histogram buckets optimized for microservices
  • Cache size: 10K (dev), 100K (prod)

Cold Storage (Production Only)

  • Enabled with 30-day TTL
  • Automatically moves old data to cold storage
  • Keeps 10GB free on primary storage

OpenTelemetry Endpoints

From Within Kubernetes Cluster

Development:

OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318

Production:

OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318

Application Configuration Example

# Python with OpenTelemetry
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
// Node.js with OpenTelemetry
const exporter = new OTLPTraceExporter({
  url: 'http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318/v1/traces',
});

Deployment Scripts

deploy-signoz.sh

Comprehensive deployment script with features:

# Usage
./deploy-signoz.sh [OPTIONS] ENVIRONMENT

# Options
-h, --help              Show help message
-d, --dry-run          Show what would be deployed
-u, --upgrade          Upgrade existing deployment
-r, --remove           Remove deployment
-n, --namespace NS     Custom namespace (default: signoz)

# Examples
./deploy-signoz.sh dev                    # Deploy to dev
./deploy-signoz.sh --upgrade prod         # Upgrade prod
./deploy-signoz.sh --dry-run prod         # Preview changes
./deploy-signoz.sh --remove dev           # Remove dev deployment

Features:

  • Automatic Helm repository setup
  • Docker Hub secret creation
  • Namespace management
  • Deployment verification
  • 15-minute timeout with --wait flag

verify-signoz.sh

Verification script to check deployment health:

# Usage
./verify-signoz.sh [OPTIONS] ENVIRONMENT

# Examples
./verify-signoz.sh dev                    # Verify dev deployment
./verify-signoz.sh prod                   # Verify prod deployment

Checks performed:

  1. Helm release status
  2. Pod health and readiness
  3. Service availability
  4. Ingress configuration
  5. PVC status
  6. Resource usage (if metrics-server available)
  7. Log errors
  8. Environment-specific validations
    • Dev: Single replica, resource limits
    • Prod: HA config, TLS, Zookeeper replicas, HPA

Storage Configuration

Development (Kind)

global:
  storageClass: "standard"  # Kind's default provisioner

Production (MicroK8s)

global:
  storageClass: "microk8s-hostpath"  # Or custom storage class

Storage Requirements:

  • Development: ~35 GiB total

    • SigNoz: 5 GiB
    • ClickHouse: 20 GiB
    • Zookeeper: 5 GiB
    • Alertmanager: 2 GiB
  • Production: ~135 GiB total

    • SigNoz: 20 GiB
    • ClickHouse: 100 GiB
    • Zookeeper: 10 GiB
    • Alertmanager: 5 GiB

Resource Requirements

Development Environment

Minimum:

  • CPU: 550m (0.55 cores)
  • Memory: 1.6 GiB
  • Storage: 35 GiB

Recommended:

  • CPU: 3 cores
  • Memory: 3 GiB
  • Storage: 50 GiB

Production Environment

Minimum:

  • CPU: 3.5 cores
  • Memory: 8 GiB
  • Storage: 135 GiB

Recommended:

  • CPU: 12 cores
  • Memory: 20 GiB
  • Storage: 200 GiB

Data Retention

Development

  • Traces: 7 days (168 hours)
  • Metrics: 7 days (168 hours)
  • Logs: 7 days (168 hours)

Production

  • Traces: 30 days (720 hours)
  • Metrics: 30 days (720 hours)
  • Logs: 30 days (720 hours)
  • Cold storage after 30 days

To modify retention, update the environment variables:

signoz:
  env:
    signoz_traces_ttl_duration_hrs: "720"   # 30 days
    signoz_metrics_ttl_duration_hrs: "720"  # 30 days
    signoz_logs_ttl_duration_hrs: "168"     # 7 days

High Availability (Production)

Replication Strategy

signoz: 2 replicas + HPA (min: 2, max: 5)
clickhouse: 2 replicas
zookeeper: 3 replicas (critical!)
otelCollector: 2 replicas + HPA (min: 2, max: 10)
alertmanager: 2 replicas

Pod Anti-Affinity

Ensures pods are distributed across different nodes:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: query-service
          topologyKey: kubernetes.io/hostname

Pod Disruption Budgets

Configured for all critical components:

podDisruptionBudget:
  enabled: true
  minAvailable: 1

Monitoring and Alerting

Email Alerts (Production)

Configure SMTP in production values:

signoz:
  env:
    signoz_smtp_enabled: "true"
    signoz_smtp_host: "smtp.gmail.com"
    signoz_smtp_port: "587"
    signoz_smtp_from: "alerts@bakewise.ai"
    signoz_smtp_username: "alerts@bakewise.ai"
    # Set via secret: signoz_smtp_password

Slack Alerts (Production)

Configure webhook in Alertmanager:

alertmanager:
  config:
    receivers:
      - name: 'critical-alerts'
        slack_configs:
          - api_url: '${SLACK_WEBHOOK_URL}'
            channel: '#alerts-critical'

Self-Monitoring

SigNoz monitors itself:

selfMonitoring:
  enabled: true
  serviceMonitor:
    enabled: true  # Prod only
    interval: 30s

Troubleshooting

Common Issues

1. Pods not starting

# Check pod status
kubectl get pods -n signoz

# Check pod logs
kubectl logs -n signoz <pod-name>

# Describe pod for events
kubectl describe pod -n signoz <pod-name>

2. Docker Hub rate limits

# Verify secret exists
kubectl get secret dockerhub-creds -n signoz

# Recreate secret
kubectl delete secret dockerhub-creds -n signoz
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-token'
./deploy-signoz.sh dev

3. ClickHouse connection issues

# Check ClickHouse pod
kubectl logs -n signoz -l app.kubernetes.io/component=clickhouse

# Check Zookeeper (required by ClickHouse)
kubectl logs -n signoz -l app.kubernetes.io/component=zookeeper

4. OTel Collector not receiving data

# Check OTel Collector logs
kubectl logs -n signoz -l app.kubernetes.io/component=otel-collector

# Test connectivity
kubectl port-forward -n signoz svc/signoz-otel-collector 4318:4318
curl -v http://localhost:4318/v1/traces

5. Insufficient storage

# Check PVC status
kubectl get pvc -n signoz

# Check storage usage (if metrics-server available)
kubectl top pods -n signoz

Debug Mode

Enable debug exporter in OTel Collector:

otelCollector:
  config:
    exporters:
      debug:
        verbosity: detailed
        sampling_initial: 5
        sampling_thereafter: 200
    service:
      pipelines:
        traces:
          exporters: [clickhousetraces, debug]  # Add debug

Upgrade from Old Version

If upgrading from pre-v0.89.0:

# 1. Backup data (recommended)
kubectl get all -n signoz -o yaml > signoz-backup.yaml

# 2. Remove old deployment
./deploy-signoz.sh --remove prod

# 3. Deploy new version
./deploy-signoz.sh prod

# 4. Verify
./verify-signoz.sh prod

Security Best Practices

  1. Change default password immediately after first login
  2. Use TLS/SSL in production (configured with cert-manager)
  3. Network policies enabled in production
  4. Run as non-root (configured in securityContext)
  5. RBAC with dedicated service account
  6. Secrets management for sensitive data (SMTP, Slack webhooks)
  7. Image pull secrets to avoid exposing Docker Hub credentials

Backup and Recovery

Backup ClickHouse Data

# Export ClickHouse data
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
  --query="BACKUP DATABASE signoz_traces TO Disk('backups', 'traces_backup.zip')"

# Copy backup out
kubectl cp signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/ ./backups/

Restore from Backup

# Copy backup in
kubectl cp ./backups/ signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/

# Restore
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
  --query="RESTORE DATABASE signoz_traces FROM Disk('backups', 'traces_backup.zip')"

Updating Configuration

To update SigNoz configuration:

  1. Edit values file: signoz-values-{env}.yaml
  2. Apply changes:
    ./deploy-signoz.sh --upgrade {env}
    
  3. Verify:
    ./verify-signoz.sh {env}
    

Uninstallation

# Remove SigNoz deployment
./deploy-signoz.sh --remove {env}

# Optionally delete PVCs (WARNING: deletes all data)
kubectl delete pvc -n signoz -l app.kubernetes.io/instance=signoz

# Optionally delete namespace
kubectl delete namespace signoz

References

Support

For issues or questions:

  1. Check SigNoz GitHub Issues
  2. Review deployment logs: kubectl logs -n signoz <pod-name>
  3. Run verification script: ./verify-signoz.sh {env}
  4. Check SigNoz Community Slack

Last Updated: 2026-01-09 SigNoz Helm Chart Version: Latest (v0.129.12 components) Maintained by: Bakery IA Team