# SigNoz Helm Deployment for Bakery IA This directory contains Helm configurations and deployment scripts for SigNoz observability platform. ## Overview SigNoz is deployed using the official Helm chart with environment-specific configurations optimized for: - **Development**: Colima + Kind (Kubernetes in Docker) with Tilt - **Production**: VPS on clouding.io with MicroK8s ## Prerequisites ### Required Tools - **kubectl** 1.22+ - **Helm** 3.8+ - **Docker** (for development) - **Kind/MicroK8s** (environment-specific) ### Docker Hub Authentication SigNoz uses images from Docker Hub. Set up authentication to avoid rate limits: ```bash # Option 1: Environment variables (recommended) export DOCKERHUB_USERNAME='your-username' export DOCKERHUB_PASSWORD='your-personal-access-token' # Option 2: Docker login docker login ``` ## Quick Start ### Development Deployment ```bash # Deploy SigNoz to development environment ./deploy-signoz.sh dev # Verify deployment ./verify-signoz.sh dev # Access SigNoz UI # Via ingress: http://monitoring.bakery-ia.local # Or port-forward: kubectl port-forward -n signoz svc/signoz 8080:8080 # Then open: http://localhost:8080 ``` ### Production Deployment ```bash # Deploy SigNoz to production environment ./deploy-signoz.sh prod # Verify deployment ./verify-signoz.sh prod # Access SigNoz UI # https://monitoring.bakewise.ai ``` ## Configuration Files ### signoz-values-dev.yaml Development environment configuration with: - Single replica for most components - Reduced resource requests (optimized for local Kind cluster) - 7-day data retention - Batch size: 10,000 events - ClickHouse 25.5.6, OTel Collector v0.129.12 - PostgreSQL, Redis, and RabbitMQ receivers configured ### signoz-values-prod.yaml Production environment configuration with: - High availability: 2+ replicas for critical components - 3 Zookeeper replicas (required for production) - 30-day data retention - Batch size: 50,000 events (high-performance) - Cold storage enabled with 30-day TTL - Horizontal Pod Autoscaler (HPA) enabled - TLS/SSL with cert-manager - Enhanced security with pod anti-affinity rules ## Key Configuration Changes (v0.89.0+) ⚠️ **BREAKING CHANGE**: SigNoz Helm chart v0.89.0+ uses a unified component structure. **Old Structure (deprecated):** ```yaml frontend: replicaCount: 2 queryService: replicaCount: 2 ``` **New Structure (current):** ```yaml signoz: replicaCount: 2 # Combines frontend + query service ``` ## Component Architecture ### Core Components 1. **SigNoz** (unified component) - Frontend UI + Query Service - Port 8080 (HTTP/API), 8085 (internal gRPC) - Dev: 1 replica, Prod: 2+ replicas with HPA 2. **ClickHouse** (Time-series database) - Version: 25.5.6 - Stores traces, metrics, and logs - Dev: 1 replica, Prod: 2 replicas with cold storage 3. **Zookeeper** (ClickHouse coordination) - Version: 3.7.1 - Dev: 1 replica, Prod: 3 replicas (critical for HA) 4. **OpenTelemetry Collector** (Data ingestion) - Version: v0.129.12 - Ports: 4317 (gRPC), 4318 (HTTP), 8888 (metrics) - Dev: 1 replica, Prod: 2+ replicas with HPA 5. **Alertmanager** (Alert management) - Version: 0.23.5 - Email and Slack integrations configured - Port: 9093 ## Performance Optimizations ### Batch Processing - **Development**: 10,000 events per batch - **Production**: 50,000 events per batch (official recommendation) - Timeout: 1 second for faster processing ### Memory Management - Memory limiter processor prevents OOM - Dev: 400 MiB limit, Prod: 1500 MiB limit - Spike limits configured ### Span Metrics Processor Automatically generates RED metrics (Rate, Errors, Duration): - Latency histogram buckets optimized for microservices - Cache size: 10K (dev), 100K (prod) ### Cold Storage (Production Only) - Enabled with 30-day TTL - Automatically moves old data to cold storage - Keeps 10GB free on primary storage ## OpenTelemetry Endpoints ### From Within Kubernetes Cluster **Development:** ``` OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317 OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318 ``` **Production:** ``` OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317 OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318 ``` ### Application Configuration Example ```yaml # Python with OpenTelemetry OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318" OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf" ``` ```javascript // Node.js with OpenTelemetry const exporter = new OTLPTraceExporter({ url: 'http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318/v1/traces', }); ``` ## Deployment Scripts ### deploy-signoz.sh Comprehensive deployment script with features: ```bash # Usage ./deploy-signoz.sh [OPTIONS] ENVIRONMENT # Options -h, --help Show help message -d, --dry-run Show what would be deployed -u, --upgrade Upgrade existing deployment -r, --remove Remove deployment -n, --namespace NS Custom namespace (default: signoz) # Examples ./deploy-signoz.sh dev # Deploy to dev ./deploy-signoz.sh --upgrade prod # Upgrade prod ./deploy-signoz.sh --dry-run prod # Preview changes ./deploy-signoz.sh --remove dev # Remove dev deployment ``` **Features:** - Automatic Helm repository setup - Docker Hub secret creation - Namespace management - Deployment verification - 15-minute timeout with `--wait` flag ### verify-signoz.sh Verification script to check deployment health: ```bash # Usage ./verify-signoz.sh [OPTIONS] ENVIRONMENT # Examples ./verify-signoz.sh dev # Verify dev deployment ./verify-signoz.sh prod # Verify prod deployment ``` **Checks performed:** 1. ✅ Helm release status 2. ✅ Pod health and readiness 3. ✅ Service availability 4. ✅ Ingress configuration 5. ✅ PVC status 6. ✅ Resource usage (if metrics-server available) 7. ✅ Log errors 8. ✅ Environment-specific validations - Dev: Single replica, resource limits - Prod: HA config, TLS, Zookeeper replicas, HPA ## Storage Configuration ### Development (Kind) ```yaml global: storageClass: "standard" # Kind's default provisioner ``` ### Production (MicroK8s) ```yaml global: storageClass: "microk8s-hostpath" # Or custom storage class ``` **Storage Requirements:** - **Development**: ~35 GiB total - SigNoz: 5 GiB - ClickHouse: 20 GiB - Zookeeper: 5 GiB - Alertmanager: 2 GiB - **Production**: ~135 GiB total - SigNoz: 20 GiB - ClickHouse: 100 GiB - Zookeeper: 10 GiB - Alertmanager: 5 GiB ## Resource Requirements ### Development Environment **Minimum:** - CPU: 550m (0.55 cores) - Memory: 1.6 GiB - Storage: 35 GiB **Recommended:** - CPU: 3 cores - Memory: 3 GiB - Storage: 50 GiB ### Production Environment **Minimum:** - CPU: 3.5 cores - Memory: 8 GiB - Storage: 135 GiB **Recommended:** - CPU: 12 cores - Memory: 20 GiB - Storage: 200 GiB ## Data Retention ### Development - Traces: 7 days (168 hours) - Metrics: 7 days (168 hours) - Logs: 7 days (168 hours) ### Production - Traces: 30 days (720 hours) - Metrics: 30 days (720 hours) - Logs: 30 days (720 hours) - Cold storage after 30 days To modify retention, update the environment variables: ```yaml signoz: env: signoz_traces_ttl_duration_hrs: "720" # 30 days signoz_metrics_ttl_duration_hrs: "720" # 30 days signoz_logs_ttl_duration_hrs: "168" # 7 days ``` ## High Availability (Production) ### Replication Strategy ```yaml signoz: 2 replicas + HPA (min: 2, max: 5) clickhouse: 2 replicas zookeeper: 3 replicas (critical!) otelCollector: 2 replicas + HPA (min: 2, max: 10) alertmanager: 2 replicas ``` ### Pod Anti-Affinity Ensures pods are distributed across different nodes: ```yaml affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/component: query-service topologyKey: kubernetes.io/hostname ``` ### Pod Disruption Budgets Configured for all critical components: ```yaml podDisruptionBudget: enabled: true minAvailable: 1 ``` ## Monitoring and Alerting ### Email Alerts (Production) Configure SMTP in production values (using Mailu with Mailgun relay): ```yaml signoz: env: signoz_smtp_enabled: "true" signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local" signoz_smtp_port: "587" signoz_smtp_from: "alerts@bakewise.ai" signoz_smtp_username: "alerts@bakewise.ai" # Set via secret: signoz_smtp_password ``` **Note**: Signoz now uses the internal Mailu SMTP service, which relays to Mailgun for better deliverability and centralized email management. ### Slack Alerts (Production) Configure webhook in Alertmanager: ```yaml alertmanager: config: receivers: - name: 'critical-alerts' slack_configs: - api_url: '${SLACK_WEBHOOK_URL}' channel: '#alerts-critical' ``` ### Mailgun Integration for Alert Emails Signoz has been configured to use Mailgun for sending alert emails through the Mailu SMTP service. This provides: **Benefits:** - Better email deliverability through Mailgun's infrastructure - Centralized email management via Mailu - Improved tracking and analytics for alert emails - Compliance with email sending best practices **Architecture:** ``` Signoz Alertmanager → Mailu SMTP → Mailgun Relay → Recipients ``` **Configuration Requirements:** 1. **Mailu Configuration** (`infrastructure/platform/mail/mailu/mailu-configmap.yaml`): ```yaml RELAYHOST: "smtp.mailgun.org:587" RELAY_LOGIN: "postmaster@bakewise.ai" ``` 2. **Mailu Secrets** (`infrastructure/platform/mail/mailu/mailu-secrets.yaml`): ```yaml RELAY_PASSWORD: "" # Base64 encoded Mailgun API key ``` 3. **DNS Configuration** (required for Mailgun): ``` # MX record bakewise.ai. IN MX 10 mail.bakewise.ai. # SPF record (authorize Mailgun) bakewise.ai. IN TXT "v=spf1 include:mailgun.org ~all" # DKIM record (provided by Mailgun) m1._domainkey.bakewise.ai. IN TXT "v=DKIM1; k=rsa; p=" # DMARC record _dmarc.bakewise.ai. IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@bakewise.ai" ``` 4. **Signoz SMTP Configuration** (already configured in `signoz-values-prod.yaml`): ```yaml signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local" signoz_smtp_port: "587" signoz_smtp_from: "alerts@bakewise.ai" ``` **Testing the Integration:** 1. Trigger a test alert from Signoz UI 2. Check Mailu logs: `kubectl logs -f mailu-smtp- -n bakery-ia` 3. Check Mailgun dashboard for delivery status 4. Verify email receipt in destination inbox **Troubleshooting:** - **SMTP Authentication Failed**: Verify Mailu credentials and Mailgun API key - **Email Delivery Delays**: Check Mailu queue with `kubectl exec -it mailu-smtp- -n bakery-ia -- mailq` - **SPF/DKIM Issues**: Verify DNS records and Mailgun domain verification ### Self-Monitoring SigNoz monitors itself: ```yaml selfMonitoring: enabled: true serviceMonitor: enabled: true # Prod only interval: 30s ``` ## Troubleshooting ### Common Issues **1. Pods not starting** ```bash # Check pod status kubectl get pods -n signoz # Check pod logs kubectl logs -n signoz # Describe pod for events kubectl describe pod -n signoz ``` **2. Docker Hub rate limits** ```bash # Verify secret exists kubectl get secret dockerhub-creds -n signoz # Recreate secret kubectl delete secret dockerhub-creds -n signoz export DOCKERHUB_USERNAME='your-username' export DOCKERHUB_PASSWORD='your-token' ./deploy-signoz.sh dev ``` **3. ClickHouse connection issues** ```bash # Check ClickHouse pod kubectl logs -n signoz -l app.kubernetes.io/component=clickhouse # Check Zookeeper (required by ClickHouse) kubectl logs -n signoz -l app.kubernetes.io/component=zookeeper ``` **4. OTel Collector not receiving data** ```bash # Check OTel Collector logs kubectl logs -n signoz -l app.kubernetes.io/component=otel-collector # Test connectivity kubectl port-forward -n signoz svc/signoz-otel-collector 4318:4318 curl -v http://localhost:4318/v1/traces ``` **5. Insufficient storage** ```bash # Check PVC status kubectl get pvc -n signoz # Check storage usage (if metrics-server available) kubectl top pods -n signoz ``` ### Debug Mode Enable debug exporter in OTel Collector: ```yaml otelCollector: config: exporters: debug: verbosity: detailed sampling_initial: 5 sampling_thereafter: 200 service: pipelines: traces: exporters: [clickhousetraces, debug] # Add debug ``` ### Upgrade from Old Version If upgrading from pre-v0.89.0: ```bash # 1. Backup data (recommended) kubectl get all -n signoz -o yaml > signoz-backup.yaml # 2. Remove old deployment ./deploy-signoz.sh --remove prod # 3. Deploy new version ./deploy-signoz.sh prod # 4. Verify ./verify-signoz.sh prod ``` ## Security Best Practices 1. **Change default password** immediately after first login 2. **Use TLS/SSL** in production (configured with cert-manager) 3. **Network policies** enabled in production 4. **Run as non-root** (configured in securityContext) 5. **RBAC** with dedicated service account 6. **Secrets management** for sensitive data (SMTP, Slack webhooks) 7. **Image pull secrets** to avoid exposing Docker Hub credentials ## Backup and Recovery ### Backup ClickHouse Data ```bash # Export ClickHouse data kubectl exec -n signoz -- clickhouse-client \ --query="BACKUP DATABASE signoz_traces TO Disk('backups', 'traces_backup.zip')" # Copy backup out kubectl cp signoz/:/var/lib/clickhouse/backups/ ./backups/ ``` ### Restore from Backup ```bash # Copy backup in kubectl cp ./backups/ signoz/:/var/lib/clickhouse/backups/ # Restore kubectl exec -n signoz -- clickhouse-client \ --query="RESTORE DATABASE signoz_traces FROM Disk('backups', 'traces_backup.zip')" ``` ## Updating Configuration To update SigNoz configuration: 1. Edit values file: `signoz-values-{env}.yaml` 2. Apply changes: ```bash ./deploy-signoz.sh --upgrade {env} ``` 3. Verify: ```bash ./verify-signoz.sh {env} ``` ## Uninstallation ```bash # Remove SigNoz deployment ./deploy-signoz.sh --remove {env} # Optionally delete PVCs (WARNING: deletes all data) kubectl delete pvc -n signoz -l app.kubernetes.io/instance=signoz # Optionally delete namespace kubectl delete namespace signoz ``` ## References - [SigNoz Official Documentation](https://signoz.io/docs/) - [SigNoz Helm Charts Repository](https://github.com/SigNoz/charts) - [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - [ClickHouse Documentation](https://clickhouse.com/docs/) ## Support For issues or questions: 1. Check [SigNoz GitHub Issues](https://github.com/SigNoz/signoz/issues) 2. Review deployment logs: `kubectl logs -n signoz ` 3. Run verification script: `./verify-signoz.sh {env}` 4. Check [SigNoz Community Slack](https://signoz.io/slack) --- **Last Updated**: 2026-01-09 **SigNoz Helm Chart Version**: Latest (v0.129.12 components) **Maintained by**: Bakery IA Team