Files
bakery-ia/infrastructure/monitoring/signoz/README.md

620 lines
15 KiB
Markdown
Raw Normal View History

2026-01-09 07:26:11 +01:00
# SigNoz Helm Deployment for Bakery IA
This directory contains Helm configurations and deployment scripts for SigNoz observability platform.
## Overview
SigNoz is deployed using the official Helm chart with environment-specific configurations optimized for:
- **Development**: Colima + Kind (Kubernetes in Docker) with Tilt
- **Production**: VPS on clouding.io with MicroK8s
## Prerequisites
### Required Tools
- **kubectl** 1.22+
- **Helm** 3.8+
- **Docker** (for development)
- **Kind/MicroK8s** (environment-specific)
### Docker Hub Authentication
SigNoz uses images from Docker Hub. Set up authentication to avoid rate limits:
```bash
# Option 1: Environment variables (recommended)
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-personal-access-token'
# Option 2: Docker login
docker login
```
## Quick Start
### Development Deployment
```bash
# Deploy SigNoz to development environment
./deploy-signoz.sh dev
# Verify deployment
./verify-signoz.sh dev
# Access SigNoz UI
# Via ingress: http://monitoring.bakery-ia.local
# Or port-forward:
kubectl port-forward -n signoz svc/signoz 8080:8080
# Then open: http://localhost:8080
```
### Production Deployment
```bash
# Deploy SigNoz to production environment
./deploy-signoz.sh prod
# Verify deployment
./verify-signoz.sh prod
# Access SigNoz UI
# https://monitoring.bakewise.ai
```
## Configuration Files
### signoz-values-dev.yaml
Development environment configuration with:
- Single replica for most components
- Reduced resource requests (optimized for local Kind cluster)
- 7-day data retention
- Batch size: 10,000 events
- ClickHouse 25.5.6, OTel Collector v0.129.12
- PostgreSQL, Redis, and RabbitMQ receivers configured
### signoz-values-prod.yaml
Production environment configuration with:
- High availability: 2+ replicas for critical components
- 3 Zookeeper replicas (required for production)
- 30-day data retention
- Batch size: 50,000 events (high-performance)
- Cold storage enabled with 30-day TTL
- Horizontal Pod Autoscaler (HPA) enabled
- TLS/SSL with cert-manager
- Enhanced security with pod anti-affinity rules
## Key Configuration Changes (v0.89.0+)
⚠️ **BREAKING CHANGE**: SigNoz Helm chart v0.89.0+ uses a unified component structure.
**Old Structure (deprecated):**
```yaml
frontend:
replicaCount: 2
queryService:
replicaCount: 2
```
**New Structure (current):**
```yaml
signoz:
replicaCount: 2
# Combines frontend + query service
```
## Component Architecture
### Core Components
1. **SigNoz** (unified component)
- Frontend UI + Query Service
- Port 8080 (HTTP/API), 8085 (internal gRPC)
- Dev: 1 replica, Prod: 2+ replicas with HPA
2. **ClickHouse** (Time-series database)
- Version: 25.5.6
- Stores traces, metrics, and logs
- Dev: 1 replica, Prod: 2 replicas with cold storage
3. **Zookeeper** (ClickHouse coordination)
- Version: 3.7.1
- Dev: 1 replica, Prod: 3 replicas (critical for HA)
4. **OpenTelemetry Collector** (Data ingestion)
- Version: v0.129.12
- Ports: 4317 (gRPC), 4318 (HTTP), 8888 (metrics)
- Dev: 1 replica, Prod: 2+ replicas with HPA
5. **Alertmanager** (Alert management)
- Version: 0.23.5
- Email and Slack integrations configured
- Port: 9093
## Performance Optimizations
### Batch Processing
- **Development**: 10,000 events per batch
- **Production**: 50,000 events per batch (official recommendation)
- Timeout: 1 second for faster processing
### Memory Management
- Memory limiter processor prevents OOM
- Dev: 400 MiB limit, Prod: 1500 MiB limit
- Spike limits configured
### Span Metrics Processor
Automatically generates RED metrics (Rate, Errors, Duration):
- Latency histogram buckets optimized for microservices
- Cache size: 10K (dev), 100K (prod)
### Cold Storage (Production Only)
- Enabled with 30-day TTL
- Automatically moves old data to cold storage
- Keeps 10GB free on primary storage
## OpenTelemetry Endpoints
### From Within Kubernetes Cluster
**Development:**
```
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
```
**Production:**
```
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
```
### Application Configuration Example
```yaml
# Python with OpenTelemetry
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
```
```javascript
// Node.js with OpenTelemetry
const exporter = new OTLPTraceExporter({
url: 'http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318/v1/traces',
});
```
## Deployment Scripts
### deploy-signoz.sh
Comprehensive deployment script with features:
```bash
# Usage
./deploy-signoz.sh [OPTIONS] ENVIRONMENT
# Options
-h, --help Show help message
-d, --dry-run Show what would be deployed
-u, --upgrade Upgrade existing deployment
-r, --remove Remove deployment
-n, --namespace NS Custom namespace (default: signoz)
# Examples
./deploy-signoz.sh dev # Deploy to dev
./deploy-signoz.sh --upgrade prod # Upgrade prod
./deploy-signoz.sh --dry-run prod # Preview changes
./deploy-signoz.sh --remove dev # Remove dev deployment
```
**Features:**
- Automatic Helm repository setup
- Docker Hub secret creation
- Namespace management
- Deployment verification
- 15-minute timeout with `--wait` flag
### verify-signoz.sh
Verification script to check deployment health:
```bash
# Usage
./verify-signoz.sh [OPTIONS] ENVIRONMENT
# Examples
./verify-signoz.sh dev # Verify dev deployment
./verify-signoz.sh prod # Verify prod deployment
```
**Checks performed:**
1. ✅ Helm release status
2. ✅ Pod health and readiness
3. ✅ Service availability
4. ✅ Ingress configuration
5. ✅ PVC status
6. ✅ Resource usage (if metrics-server available)
7. ✅ Log errors
8. ✅ Environment-specific validations
- Dev: Single replica, resource limits
- Prod: HA config, TLS, Zookeeper replicas, HPA
## Storage Configuration
### Development (Kind)
```yaml
global:
storageClass: "standard" # Kind's default provisioner
```
### Production (MicroK8s)
```yaml
global:
storageClass: "microk8s-hostpath" # Or custom storage class
```
**Storage Requirements:**
- **Development**: ~35 GiB total
- SigNoz: 5 GiB
- ClickHouse: 20 GiB
- Zookeeper: 5 GiB
- Alertmanager: 2 GiB
- **Production**: ~135 GiB total
- SigNoz: 20 GiB
- ClickHouse: 100 GiB
- Zookeeper: 10 GiB
- Alertmanager: 5 GiB
## Resource Requirements
### Development Environment
**Minimum:**
- CPU: 550m (0.55 cores)
- Memory: 1.6 GiB
- Storage: 35 GiB
**Recommended:**
- CPU: 3 cores
- Memory: 3 GiB
- Storage: 50 GiB
### Production Environment
**Minimum:**
- CPU: 3.5 cores
- Memory: 8 GiB
- Storage: 135 GiB
**Recommended:**
- CPU: 12 cores
- Memory: 20 GiB
- Storage: 200 GiB
## Data Retention
### Development
- Traces: 7 days (168 hours)
- Metrics: 7 days (168 hours)
- Logs: 7 days (168 hours)
### Production
- Traces: 30 days (720 hours)
- Metrics: 30 days (720 hours)
- Logs: 30 days (720 hours)
- Cold storage after 30 days
To modify retention, update the environment variables:
```yaml
signoz:
env:
signoz_traces_ttl_duration_hrs: "720" # 30 days
signoz_metrics_ttl_duration_hrs: "720" # 30 days
signoz_logs_ttl_duration_hrs: "168" # 7 days
```
## High Availability (Production)
### Replication Strategy
```yaml
signoz: 2 replicas + HPA (min: 2, max: 5)
clickhouse: 2 replicas
zookeeper: 3 replicas (critical!)
otelCollector: 2 replicas + HPA (min: 2, max: 10)
alertmanager: 2 replicas
```
### Pod Anti-Affinity
Ensures pods are distributed across different nodes:
```yaml
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/component: query-service
topologyKey: kubernetes.io/hostname
```
### Pod Disruption Budgets
Configured for all critical components:
```yaml
podDisruptionBudget:
enabled: true
minAvailable: 1
```
## Monitoring and Alerting
### Email Alerts (Production)
2026-01-19 11:55:17 +01:00
Configure SMTP in production values (using Mailu with Mailgun relay):
2026-01-09 07:26:11 +01:00
```yaml
signoz:
env:
signoz_smtp_enabled: "true"
2026-01-19 11:55:17 +01:00
signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local"
2026-01-09 07:26:11 +01:00
signoz_smtp_port: "587"
signoz_smtp_from: "alerts@bakewise.ai"
signoz_smtp_username: "alerts@bakewise.ai"
# Set via secret: signoz_smtp_password
```
2026-01-19 11:55:17 +01:00
**Note**: Signoz now uses the internal Mailu SMTP service, which relays to Mailgun for better deliverability and centralized email management.
2026-01-09 07:26:11 +01:00
### Slack Alerts (Production)
Configure webhook in Alertmanager:
```yaml
alertmanager:
config:
receivers:
- name: 'critical-alerts'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts-critical'
```
2026-01-19 11:55:17 +01:00
### Mailgun Integration for Alert Emails
Signoz has been configured to use Mailgun for sending alert emails through the Mailu SMTP service. This provides:
**Benefits:**
- Better email deliverability through Mailgun's infrastructure
- Centralized email management via Mailu
- Improved tracking and analytics for alert emails
- Compliance with email sending best practices
**Architecture:**
```
Signoz Alertmanager → Mailu SMTP → Mailgun Relay → Recipients
```
**Configuration Requirements:**
1. **Mailu Configuration** (`infrastructure/platform/mail/mailu/mailu-configmap.yaml`):
```yaml
RELAYHOST: "smtp.mailgun.org:587"
RELAY_LOGIN: "postmaster@bakewise.ai"
```
2. **Mailu Secrets** (`infrastructure/platform/mail/mailu/mailu-secrets.yaml`):
```yaml
RELAY_PASSWORD: "<mailgun-api-key>" # Base64 encoded Mailgun API key
```
3. **DNS Configuration** (required for Mailgun):
```
# MX record
bakewise.ai. IN MX 10 mail.bakewise.ai.
# SPF record (authorize Mailgun)
bakewise.ai. IN TXT "v=spf1 include:mailgun.org ~all"
# DKIM record (provided by Mailgun)
m1._domainkey.bakewise.ai. IN TXT "v=DKIM1; k=rsa; p=<mailgun-public-key>"
# DMARC record
_dmarc.bakewise.ai. IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@bakewise.ai"
```
4. **Signoz SMTP Configuration** (already configured in `signoz-values-prod.yaml`):
```yaml
signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local"
signoz_smtp_port: "587"
signoz_smtp_from: "alerts@bakewise.ai"
```
**Testing the Integration:**
1. Trigger a test alert from Signoz UI
2. Check Mailu logs: `kubectl logs -f mailu-smtp-<pod-id> -n bakery-ia`
3. Check Mailgun dashboard for delivery status
4. Verify email receipt in destination inbox
**Troubleshooting:**
- **SMTP Authentication Failed**: Verify Mailu credentials and Mailgun API key
- **Email Delivery Delays**: Check Mailu queue with `kubectl exec -it mailu-smtp-<pod-id> -n bakery-ia -- mailq`
- **SPF/DKIM Issues**: Verify DNS records and Mailgun domain verification
2026-01-09 07:26:11 +01:00
### Self-Monitoring
SigNoz monitors itself:
```yaml
selfMonitoring:
enabled: true
serviceMonitor:
enabled: true # Prod only
interval: 30s
```
## Troubleshooting
### Common Issues
**1. Pods not starting**
```bash
# Check pod status
kubectl get pods -n signoz
# Check pod logs
kubectl logs -n signoz <pod-name>
# Describe pod for events
kubectl describe pod -n signoz <pod-name>
```
**2. Docker Hub rate limits**
```bash
# Verify secret exists
kubectl get secret dockerhub-creds -n signoz
# Recreate secret
kubectl delete secret dockerhub-creds -n signoz
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-token'
./deploy-signoz.sh dev
```
**3. ClickHouse connection issues**
```bash
# Check ClickHouse pod
kubectl logs -n signoz -l app.kubernetes.io/component=clickhouse
# Check Zookeeper (required by ClickHouse)
kubectl logs -n signoz -l app.kubernetes.io/component=zookeeper
```
**4. OTel Collector not receiving data**
```bash
# Check OTel Collector logs
kubectl logs -n signoz -l app.kubernetes.io/component=otel-collector
# Test connectivity
kubectl port-forward -n signoz svc/signoz-otel-collector 4318:4318
curl -v http://localhost:4318/v1/traces
```
**5. Insufficient storage**
```bash
# Check PVC status
kubectl get pvc -n signoz
# Check storage usage (if metrics-server available)
kubectl top pods -n signoz
```
### Debug Mode
Enable debug exporter in OTel Collector:
```yaml
otelCollector:
config:
exporters:
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
service:
pipelines:
traces:
exporters: [clickhousetraces, debug] # Add debug
```
### Upgrade from Old Version
If upgrading from pre-v0.89.0:
```bash
# 1. Backup data (recommended)
kubectl get all -n signoz -o yaml > signoz-backup.yaml
# 2. Remove old deployment
./deploy-signoz.sh --remove prod
# 3. Deploy new version
./deploy-signoz.sh prod
# 4. Verify
./verify-signoz.sh prod
```
## Security Best Practices
1. **Change default password** immediately after first login
2. **Use TLS/SSL** in production (configured with cert-manager)
3. **Network policies** enabled in production
4. **Run as non-root** (configured in securityContext)
5. **RBAC** with dedicated service account
6. **Secrets management** for sensitive data (SMTP, Slack webhooks)
7. **Image pull secrets** to avoid exposing Docker Hub credentials
## Backup and Recovery
### Backup ClickHouse Data
```bash
# Export ClickHouse data
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
--query="BACKUP DATABASE signoz_traces TO Disk('backups', 'traces_backup.zip')"
# Copy backup out
kubectl cp signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/ ./backups/
```
### Restore from Backup
```bash
# Copy backup in
kubectl cp ./backups/ signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/
# Restore
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
--query="RESTORE DATABASE signoz_traces FROM Disk('backups', 'traces_backup.zip')"
```
## Updating Configuration
To update SigNoz configuration:
1. Edit values file: `signoz-values-{env}.yaml`
2. Apply changes:
```bash
./deploy-signoz.sh --upgrade {env}
```
3. Verify:
```bash
./verify-signoz.sh {env}
```
## Uninstallation
```bash
# Remove SigNoz deployment
./deploy-signoz.sh --remove {env}
# Optionally delete PVCs (WARNING: deletes all data)
kubectl delete pvc -n signoz -l app.kubernetes.io/instance=signoz
# Optionally delete namespace
kubectl delete namespace signoz
```
## References
- [SigNoz Official Documentation](https://signoz.io/docs/)
- [SigNoz Helm Charts Repository](https://github.com/SigNoz/charts)
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
- [ClickHouse Documentation](https://clickhouse.com/docs/)
## Support
For issues or questions:
1. Check [SigNoz GitHub Issues](https://github.com/SigNoz/signoz/issues)
2. Review deployment logs: `kubectl logs -n signoz <pod-name>`
3. Run verification script: `./verify-signoz.sh {env}`
4. Check [SigNoz Community Slack](https://signoz.io/slack)
---
**Last Updated**: 2026-01-09
**SigNoz Helm Chart Version**: Latest (v0.129.12 components)
**Maintained by**: Bakery IA Team