Imporve monitoring 2
This commit is contained in:
554
infrastructure/helm/README.md
Normal file
554
infrastructure/helm/README.md
Normal file
@@ -0,0 +1,554 @@
|
||||
# SigNoz Helm Deployment for Bakery IA
|
||||
|
||||
This directory contains Helm configurations and deployment scripts for SigNoz observability platform.
|
||||
|
||||
## Overview
|
||||
|
||||
SigNoz is deployed using the official Helm chart with environment-specific configurations optimized for:
|
||||
- **Development**: Colima + Kind (Kubernetes in Docker) with Tilt
|
||||
- **Production**: VPS on clouding.io with MicroK8s
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Required Tools
|
||||
- **kubectl** 1.22+
|
||||
- **Helm** 3.8+
|
||||
- **Docker** (for development)
|
||||
- **Kind/MicroK8s** (environment-specific)
|
||||
|
||||
### Docker Hub Authentication
|
||||
|
||||
SigNoz uses images from Docker Hub. Set up authentication to avoid rate limits:
|
||||
|
||||
```bash
|
||||
# Option 1: Environment variables (recommended)
|
||||
export DOCKERHUB_USERNAME='your-username'
|
||||
export DOCKERHUB_PASSWORD='your-personal-access-token'
|
||||
|
||||
# Option 2: Docker login
|
||||
docker login
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Development Deployment
|
||||
|
||||
```bash
|
||||
# Deploy SigNoz to development environment
|
||||
./deploy-signoz.sh dev
|
||||
|
||||
# Verify deployment
|
||||
./verify-signoz.sh dev
|
||||
|
||||
# Access SigNoz UI
|
||||
# Via ingress: http://monitoring.bakery-ia.local
|
||||
# Or port-forward:
|
||||
kubectl port-forward -n signoz svc/signoz 8080:8080
|
||||
# Then open: http://localhost:8080
|
||||
```
|
||||
|
||||
### Production Deployment
|
||||
|
||||
```bash
|
||||
# Deploy SigNoz to production environment
|
||||
./deploy-signoz.sh prod
|
||||
|
||||
# Verify deployment
|
||||
./verify-signoz.sh prod
|
||||
|
||||
# Access SigNoz UI
|
||||
# https://monitoring.bakewise.ai
|
||||
```
|
||||
|
||||
## Configuration Files
|
||||
|
||||
### signoz-values-dev.yaml
|
||||
|
||||
Development environment configuration with:
|
||||
- Single replica for most components
|
||||
- Reduced resource requests (optimized for local Kind cluster)
|
||||
- 7-day data retention
|
||||
- Batch size: 10,000 events
|
||||
- ClickHouse 25.5.6, OTel Collector v0.129.12
|
||||
- PostgreSQL, Redis, and RabbitMQ receivers configured
|
||||
|
||||
### signoz-values-prod.yaml
|
||||
|
||||
Production environment configuration with:
|
||||
- High availability: 2+ replicas for critical components
|
||||
- 3 Zookeeper replicas (required for production)
|
||||
- 30-day data retention
|
||||
- Batch size: 50,000 events (high-performance)
|
||||
- Cold storage enabled with 30-day TTL
|
||||
- Horizontal Pod Autoscaler (HPA) enabled
|
||||
- TLS/SSL with cert-manager
|
||||
- Enhanced security with pod anti-affinity rules
|
||||
|
||||
## Key Configuration Changes (v0.89.0+)
|
||||
|
||||
⚠️ **BREAKING CHANGE**: SigNoz Helm chart v0.89.0+ uses a unified component structure.
|
||||
|
||||
**Old Structure (deprecated):**
|
||||
```yaml
|
||||
frontend:
|
||||
replicaCount: 2
|
||||
queryService:
|
||||
replicaCount: 2
|
||||
```
|
||||
|
||||
**New Structure (current):**
|
||||
```yaml
|
||||
signoz:
|
||||
replicaCount: 2
|
||||
# Combines frontend + query service
|
||||
```
|
||||
|
||||
## Component Architecture
|
||||
|
||||
### Core Components
|
||||
|
||||
1. **SigNoz** (unified component)
|
||||
- Frontend UI + Query Service
|
||||
- Port 8080 (HTTP/API), 8085 (internal gRPC)
|
||||
- Dev: 1 replica, Prod: 2+ replicas with HPA
|
||||
|
||||
2. **ClickHouse** (Time-series database)
|
||||
- Version: 25.5.6
|
||||
- Stores traces, metrics, and logs
|
||||
- Dev: 1 replica, Prod: 2 replicas with cold storage
|
||||
|
||||
3. **Zookeeper** (ClickHouse coordination)
|
||||
- Version: 3.7.1
|
||||
- Dev: 1 replica, Prod: 3 replicas (critical for HA)
|
||||
|
||||
4. **OpenTelemetry Collector** (Data ingestion)
|
||||
- Version: v0.129.12
|
||||
- Ports: 4317 (gRPC), 4318 (HTTP), 8888 (metrics)
|
||||
- Dev: 1 replica, Prod: 2+ replicas with HPA
|
||||
|
||||
5. **Alertmanager** (Alert management)
|
||||
- Version: 0.23.5
|
||||
- Email and Slack integrations configured
|
||||
- Port: 9093
|
||||
|
||||
## Performance Optimizations
|
||||
|
||||
### Batch Processing
|
||||
- **Development**: 10,000 events per batch
|
||||
- **Production**: 50,000 events per batch (official recommendation)
|
||||
- Timeout: 1 second for faster processing
|
||||
|
||||
### Memory Management
|
||||
- Memory limiter processor prevents OOM
|
||||
- Dev: 400 MiB limit, Prod: 1500 MiB limit
|
||||
- Spike limits configured
|
||||
|
||||
### Span Metrics Processor
|
||||
Automatically generates RED metrics (Rate, Errors, Duration):
|
||||
- Latency histogram buckets optimized for microservices
|
||||
- Cache size: 10K (dev), 100K (prod)
|
||||
|
||||
### Cold Storage (Production Only)
|
||||
- Enabled with 30-day TTL
|
||||
- Automatically moves old data to cold storage
|
||||
- Keeps 10GB free on primary storage
|
||||
|
||||
## OpenTelemetry Endpoints
|
||||
|
||||
### From Within Kubernetes Cluster
|
||||
|
||||
**Development:**
|
||||
```
|
||||
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
|
||||
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
|
||||
```
|
||||
|
||||
**Production:**
|
||||
```
|
||||
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
|
||||
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
|
||||
```
|
||||
|
||||
### Application Configuration Example
|
||||
|
||||
```yaml
|
||||
# Python with OpenTelemetry
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318"
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
|
||||
```
|
||||
|
||||
```javascript
|
||||
// Node.js with OpenTelemetry
|
||||
const exporter = new OTLPTraceExporter({
|
||||
url: 'http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318/v1/traces',
|
||||
});
|
||||
```
|
||||
|
||||
## Deployment Scripts
|
||||
|
||||
### deploy-signoz.sh
|
||||
|
||||
Comprehensive deployment script with features:
|
||||
|
||||
```bash
|
||||
# Usage
|
||||
./deploy-signoz.sh [OPTIONS] ENVIRONMENT
|
||||
|
||||
# Options
|
||||
-h, --help Show help message
|
||||
-d, --dry-run Show what would be deployed
|
||||
-u, --upgrade Upgrade existing deployment
|
||||
-r, --remove Remove deployment
|
||||
-n, --namespace NS Custom namespace (default: signoz)
|
||||
|
||||
# Examples
|
||||
./deploy-signoz.sh dev # Deploy to dev
|
||||
./deploy-signoz.sh --upgrade prod # Upgrade prod
|
||||
./deploy-signoz.sh --dry-run prod # Preview changes
|
||||
./deploy-signoz.sh --remove dev # Remove dev deployment
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Automatic Helm repository setup
|
||||
- Docker Hub secret creation
|
||||
- Namespace management
|
||||
- Deployment verification
|
||||
- 15-minute timeout with `--wait` flag
|
||||
|
||||
### verify-signoz.sh
|
||||
|
||||
Verification script to check deployment health:
|
||||
|
||||
```bash
|
||||
# Usage
|
||||
./verify-signoz.sh [OPTIONS] ENVIRONMENT
|
||||
|
||||
# Examples
|
||||
./verify-signoz.sh dev # Verify dev deployment
|
||||
./verify-signoz.sh prod # Verify prod deployment
|
||||
```
|
||||
|
||||
**Checks performed:**
|
||||
1. ✅ Helm release status
|
||||
2. ✅ Pod health and readiness
|
||||
3. ✅ Service availability
|
||||
4. ✅ Ingress configuration
|
||||
5. ✅ PVC status
|
||||
6. ✅ Resource usage (if metrics-server available)
|
||||
7. ✅ Log errors
|
||||
8. ✅ Environment-specific validations
|
||||
- Dev: Single replica, resource limits
|
||||
- Prod: HA config, TLS, Zookeeper replicas, HPA
|
||||
|
||||
## Storage Configuration
|
||||
|
||||
### Development (Kind)
|
||||
```yaml
|
||||
global:
|
||||
storageClass: "standard" # Kind's default provisioner
|
||||
```
|
||||
|
||||
### Production (MicroK8s)
|
||||
```yaml
|
||||
global:
|
||||
storageClass: "microk8s-hostpath" # Or custom storage class
|
||||
```
|
||||
|
||||
**Storage Requirements:**
|
||||
- **Development**: ~35 GiB total
|
||||
- SigNoz: 5 GiB
|
||||
- ClickHouse: 20 GiB
|
||||
- Zookeeper: 5 GiB
|
||||
- Alertmanager: 2 GiB
|
||||
|
||||
- **Production**: ~135 GiB total
|
||||
- SigNoz: 20 GiB
|
||||
- ClickHouse: 100 GiB
|
||||
- Zookeeper: 10 GiB
|
||||
- Alertmanager: 5 GiB
|
||||
|
||||
## Resource Requirements
|
||||
|
||||
### Development Environment
|
||||
**Minimum:**
|
||||
- CPU: 550m (0.55 cores)
|
||||
- Memory: 1.6 GiB
|
||||
- Storage: 35 GiB
|
||||
|
||||
**Recommended:**
|
||||
- CPU: 3 cores
|
||||
- Memory: 3 GiB
|
||||
- Storage: 50 GiB
|
||||
|
||||
### Production Environment
|
||||
**Minimum:**
|
||||
- CPU: 3.5 cores
|
||||
- Memory: 8 GiB
|
||||
- Storage: 135 GiB
|
||||
|
||||
**Recommended:**
|
||||
- CPU: 12 cores
|
||||
- Memory: 20 GiB
|
||||
- Storage: 200 GiB
|
||||
|
||||
## Data Retention
|
||||
|
||||
### Development
|
||||
- Traces: 7 days (168 hours)
|
||||
- Metrics: 7 days (168 hours)
|
||||
- Logs: 7 days (168 hours)
|
||||
|
||||
### Production
|
||||
- Traces: 30 days (720 hours)
|
||||
- Metrics: 30 days (720 hours)
|
||||
- Logs: 30 days (720 hours)
|
||||
- Cold storage after 30 days
|
||||
|
||||
To modify retention, update the environment variables:
|
||||
```yaml
|
||||
signoz:
|
||||
env:
|
||||
signoz_traces_ttl_duration_hrs: "720" # 30 days
|
||||
signoz_metrics_ttl_duration_hrs: "720" # 30 days
|
||||
signoz_logs_ttl_duration_hrs: "168" # 7 days
|
||||
```
|
||||
|
||||
## High Availability (Production)
|
||||
|
||||
### Replication Strategy
|
||||
```yaml
|
||||
signoz: 2 replicas + HPA (min: 2, max: 5)
|
||||
clickhouse: 2 replicas
|
||||
zookeeper: 3 replicas (critical!)
|
||||
otelCollector: 2 replicas + HPA (min: 2, max: 10)
|
||||
alertmanager: 2 replicas
|
||||
```
|
||||
|
||||
### Pod Anti-Affinity
|
||||
Ensures pods are distributed across different nodes:
|
||||
```yaml
|
||||
affinity:
|
||||
podAntiAffinity:
|
||||
preferredDuringSchedulingIgnoredDuringExecution:
|
||||
- weight: 100
|
||||
podAffinityTerm:
|
||||
labelSelector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/component: query-service
|
||||
topologyKey: kubernetes.io/hostname
|
||||
```
|
||||
|
||||
### Pod Disruption Budgets
|
||||
Configured for all critical components:
|
||||
```yaml
|
||||
podDisruptionBudget:
|
||||
enabled: true
|
||||
minAvailable: 1
|
||||
```
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### Email Alerts (Production)
|
||||
Configure SMTP in production values:
|
||||
```yaml
|
||||
signoz:
|
||||
env:
|
||||
signoz_smtp_enabled: "true"
|
||||
signoz_smtp_host: "smtp.gmail.com"
|
||||
signoz_smtp_port: "587"
|
||||
signoz_smtp_from: "alerts@bakewise.ai"
|
||||
signoz_smtp_username: "alerts@bakewise.ai"
|
||||
# Set via secret: signoz_smtp_password
|
||||
```
|
||||
|
||||
### Slack Alerts (Production)
|
||||
Configure webhook in Alertmanager:
|
||||
```yaml
|
||||
alertmanager:
|
||||
config:
|
||||
receivers:
|
||||
- name: 'critical-alerts'
|
||||
slack_configs:
|
||||
- api_url: '${SLACK_WEBHOOK_URL}'
|
||||
channel: '#alerts-critical'
|
||||
```
|
||||
|
||||
### Self-Monitoring
|
||||
SigNoz monitors itself:
|
||||
```yaml
|
||||
selfMonitoring:
|
||||
enabled: true
|
||||
serviceMonitor:
|
||||
enabled: true # Prod only
|
||||
interval: 30s
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**1. Pods not starting**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods -n signoz
|
||||
|
||||
# Check pod logs
|
||||
kubectl logs -n signoz <pod-name>
|
||||
|
||||
# Describe pod for events
|
||||
kubectl describe pod -n signoz <pod-name>
|
||||
```
|
||||
|
||||
**2. Docker Hub rate limits**
|
||||
```bash
|
||||
# Verify secret exists
|
||||
kubectl get secret dockerhub-creds -n signoz
|
||||
|
||||
# Recreate secret
|
||||
kubectl delete secret dockerhub-creds -n signoz
|
||||
export DOCKERHUB_USERNAME='your-username'
|
||||
export DOCKERHUB_PASSWORD='your-token'
|
||||
./deploy-signoz.sh dev
|
||||
```
|
||||
|
||||
**3. ClickHouse connection issues**
|
||||
```bash
|
||||
# Check ClickHouse pod
|
||||
kubectl logs -n signoz -l app.kubernetes.io/component=clickhouse
|
||||
|
||||
# Check Zookeeper (required by ClickHouse)
|
||||
kubectl logs -n signoz -l app.kubernetes.io/component=zookeeper
|
||||
```
|
||||
|
||||
**4. OTel Collector not receiving data**
|
||||
```bash
|
||||
# Check OTel Collector logs
|
||||
kubectl logs -n signoz -l app.kubernetes.io/component=otel-collector
|
||||
|
||||
# Test connectivity
|
||||
kubectl port-forward -n signoz svc/signoz-otel-collector 4318:4318
|
||||
curl -v http://localhost:4318/v1/traces
|
||||
```
|
||||
|
||||
**5. Insufficient storage**
|
||||
```bash
|
||||
# Check PVC status
|
||||
kubectl get pvc -n signoz
|
||||
|
||||
# Check storage usage (if metrics-server available)
|
||||
kubectl top pods -n signoz
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable debug exporter in OTel Collector:
|
||||
```yaml
|
||||
otelCollector:
|
||||
config:
|
||||
exporters:
|
||||
debug:
|
||||
verbosity: detailed
|
||||
sampling_initial: 5
|
||||
sampling_thereafter: 200
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
exporters: [clickhousetraces, debug] # Add debug
|
||||
```
|
||||
|
||||
### Upgrade from Old Version
|
||||
|
||||
If upgrading from pre-v0.89.0:
|
||||
```bash
|
||||
# 1. Backup data (recommended)
|
||||
kubectl get all -n signoz -o yaml > signoz-backup.yaml
|
||||
|
||||
# 2. Remove old deployment
|
||||
./deploy-signoz.sh --remove prod
|
||||
|
||||
# 3. Deploy new version
|
||||
./deploy-signoz.sh prod
|
||||
|
||||
# 4. Verify
|
||||
./verify-signoz.sh prod
|
||||
```
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
1. **Change default password** immediately after first login
|
||||
2. **Use TLS/SSL** in production (configured with cert-manager)
|
||||
3. **Network policies** enabled in production
|
||||
4. **Run as non-root** (configured in securityContext)
|
||||
5. **RBAC** with dedicated service account
|
||||
6. **Secrets management** for sensitive data (SMTP, Slack webhooks)
|
||||
7. **Image pull secrets** to avoid exposing Docker Hub credentials
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Backup ClickHouse Data
|
||||
```bash
|
||||
# Export ClickHouse data
|
||||
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
|
||||
--query="BACKUP DATABASE signoz_traces TO Disk('backups', 'traces_backup.zip')"
|
||||
|
||||
# Copy backup out
|
||||
kubectl cp signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/ ./backups/
|
||||
```
|
||||
|
||||
### Restore from Backup
|
||||
```bash
|
||||
# Copy backup in
|
||||
kubectl cp ./backups/ signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/
|
||||
|
||||
# Restore
|
||||
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
|
||||
--query="RESTORE DATABASE signoz_traces FROM Disk('backups', 'traces_backup.zip')"
|
||||
```
|
||||
|
||||
## Updating Configuration
|
||||
|
||||
To update SigNoz configuration:
|
||||
|
||||
1. Edit values file: `signoz-values-{env}.yaml`
|
||||
2. Apply changes:
|
||||
```bash
|
||||
./deploy-signoz.sh --upgrade {env}
|
||||
```
|
||||
3. Verify:
|
||||
```bash
|
||||
./verify-signoz.sh {env}
|
||||
```
|
||||
|
||||
## Uninstallation
|
||||
|
||||
```bash
|
||||
# Remove SigNoz deployment
|
||||
./deploy-signoz.sh --remove {env}
|
||||
|
||||
# Optionally delete PVCs (WARNING: deletes all data)
|
||||
kubectl delete pvc -n signoz -l app.kubernetes.io/instance=signoz
|
||||
|
||||
# Optionally delete namespace
|
||||
kubectl delete namespace signoz
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [SigNoz Official Documentation](https://signoz.io/docs/)
|
||||
- [SigNoz Helm Charts Repository](https://github.com/SigNoz/charts)
|
||||
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
|
||||
- [ClickHouse Documentation](https://clickhouse.com/docs/)
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
1. Check [SigNoz GitHub Issues](https://github.com/SigNoz/signoz/issues)
|
||||
2. Review deployment logs: `kubectl logs -n signoz <pod-name>`
|
||||
3. Run verification script: `./verify-signoz.sh {env}`
|
||||
4. Check [SigNoz Community Slack](https://signoz.io/slack)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2026-01-09
|
||||
**SigNoz Helm Chart Version**: Latest (v0.129.12 components)
|
||||
**Maintained by**: Bakery IA Team
|
||||
Reference in New Issue
Block a user