Add new infra architecture
This commit is contained in:
619
infrastructure/monitoring/signoz/README.md
Normal file
619
infrastructure/monitoring/signoz/README.md
Normal file
@@ -0,0 +1,619 @@
|
||||
# SigNoz Helm Deployment for Bakery IA
|
||||
|
||||
This directory contains Helm configurations and deployment scripts for SigNoz observability platform.
|
||||
|
||||
## Overview
|
||||
|
||||
SigNoz is deployed using the official Helm chart with environment-specific configurations optimized for:
|
||||
- **Development**: Colima + Kind (Kubernetes in Docker) with Tilt
|
||||
- **Production**: VPS on clouding.io with MicroK8s
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Required Tools
|
||||
- **kubectl** 1.22+
|
||||
- **Helm** 3.8+
|
||||
- **Docker** (for development)
|
||||
- **Kind/MicroK8s** (environment-specific)
|
||||
|
||||
### Docker Hub Authentication
|
||||
|
||||
SigNoz uses images from Docker Hub. Set up authentication to avoid rate limits:
|
||||
|
||||
```bash
|
||||
# Option 1: Environment variables (recommended)
|
||||
export DOCKERHUB_USERNAME='your-username'
|
||||
export DOCKERHUB_PASSWORD='your-personal-access-token'
|
||||
|
||||
# Option 2: Docker login
|
||||
docker login
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Development Deployment
|
||||
|
||||
```bash
|
||||
# Deploy SigNoz to development environment
|
||||
./deploy-signoz.sh dev
|
||||
|
||||
# Verify deployment
|
||||
./verify-signoz.sh dev
|
||||
|
||||
# Access SigNoz UI
|
||||
# Via ingress: http://monitoring.bakery-ia.local
|
||||
# Or port-forward:
|
||||
kubectl port-forward -n signoz svc/signoz 8080:8080
|
||||
# Then open: http://localhost:8080
|
||||
```
|
||||
|
||||
### Production Deployment
|
||||
|
||||
```bash
|
||||
# Deploy SigNoz to production environment
|
||||
./deploy-signoz.sh prod
|
||||
|
||||
# Verify deployment
|
||||
./verify-signoz.sh prod
|
||||
|
||||
# Access SigNoz UI
|
||||
# https://monitoring.bakewise.ai
|
||||
```
|
||||
|
||||
## Configuration Files
|
||||
|
||||
### signoz-values-dev.yaml
|
||||
|
||||
Development environment configuration with:
|
||||
- Single replica for most components
|
||||
- Reduced resource requests (optimized for local Kind cluster)
|
||||
- 7-day data retention
|
||||
- Batch size: 10,000 events
|
||||
- ClickHouse 25.5.6, OTel Collector v0.129.12
|
||||
- PostgreSQL, Redis, and RabbitMQ receivers configured
|
||||
|
||||
### signoz-values-prod.yaml
|
||||
|
||||
Production environment configuration with:
|
||||
- High availability: 2+ replicas for critical components
|
||||
- 3 Zookeeper replicas (required for production)
|
||||
- 30-day data retention
|
||||
- Batch size: 50,000 events (high-performance)
|
||||
- Cold storage enabled with 30-day TTL
|
||||
- Horizontal Pod Autoscaler (HPA) enabled
|
||||
- TLS/SSL with cert-manager
|
||||
- Enhanced security with pod anti-affinity rules
|
||||
|
||||
## Key Configuration Changes (v0.89.0+)
|
||||
|
||||
⚠️ **BREAKING CHANGE**: SigNoz Helm chart v0.89.0+ uses a unified component structure.
|
||||
|
||||
**Old Structure (deprecated):**
|
||||
```yaml
|
||||
frontend:
|
||||
replicaCount: 2
|
||||
queryService:
|
||||
replicaCount: 2
|
||||
```
|
||||
|
||||
**New Structure (current):**
|
||||
```yaml
|
||||
signoz:
|
||||
replicaCount: 2
|
||||
# Combines frontend + query service
|
||||
```
|
||||
|
||||
## Component Architecture
|
||||
|
||||
### Core Components
|
||||
|
||||
1. **SigNoz** (unified component)
|
||||
- Frontend UI + Query Service
|
||||
- Port 8080 (HTTP/API), 8085 (internal gRPC)
|
||||
- Dev: 1 replica, Prod: 2+ replicas with HPA
|
||||
|
||||
2. **ClickHouse** (Time-series database)
|
||||
- Version: 25.5.6
|
||||
- Stores traces, metrics, and logs
|
||||
- Dev: 1 replica, Prod: 2 replicas with cold storage
|
||||
|
||||
3. **Zookeeper** (ClickHouse coordination)
|
||||
- Version: 3.7.1
|
||||
- Dev: 1 replica, Prod: 3 replicas (critical for HA)
|
||||
|
||||
4. **OpenTelemetry Collector** (Data ingestion)
|
||||
- Version: v0.129.12
|
||||
- Ports: 4317 (gRPC), 4318 (HTTP), 8888 (metrics)
|
||||
- Dev: 1 replica, Prod: 2+ replicas with HPA
|
||||
|
||||
5. **Alertmanager** (Alert management)
|
||||
- Version: 0.23.5
|
||||
- Email and Slack integrations configured
|
||||
- Port: 9093
|
||||
|
||||
## Performance Optimizations
|
||||
|
||||
### Batch Processing
|
||||
- **Development**: 10,000 events per batch
|
||||
- **Production**: 50,000 events per batch (official recommendation)
|
||||
- Timeout: 1 second for faster processing
|
||||
|
||||
### Memory Management
|
||||
- Memory limiter processor prevents OOM
|
||||
- Dev: 400 MiB limit, Prod: 1500 MiB limit
|
||||
- Spike limits configured
|
||||
|
||||
### Span Metrics Processor
|
||||
Automatically generates RED metrics (Rate, Errors, Duration):
|
||||
- Latency histogram buckets optimized for microservices
|
||||
- Cache size: 10K (dev), 100K (prod)
|
||||
|
||||
### Cold Storage (Production Only)
|
||||
- Enabled with 30-day TTL
|
||||
- Automatically moves old data to cold storage
|
||||
- Keeps 10GB free on primary storage
|
||||
|
||||
## OpenTelemetry Endpoints
|
||||
|
||||
### From Within Kubernetes Cluster
|
||||
|
||||
**Development:**
|
||||
```
|
||||
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
|
||||
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
|
||||
```
|
||||
|
||||
**Production:**
|
||||
```
|
||||
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
|
||||
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
|
||||
```
|
||||
|
||||
### Application Configuration Example
|
||||
|
||||
```yaml
|
||||
# Python with OpenTelemetry
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318"
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
|
||||
```
|
||||
|
||||
```javascript
|
||||
// Node.js with OpenTelemetry
|
||||
const exporter = new OTLPTraceExporter({
|
||||
url: 'http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318/v1/traces',
|
||||
});
|
||||
```
|
||||
|
||||
## Deployment Scripts
|
||||
|
||||
### deploy-signoz.sh
|
||||
|
||||
Comprehensive deployment script with features:
|
||||
|
||||
```bash
|
||||
# Usage
|
||||
./deploy-signoz.sh [OPTIONS] ENVIRONMENT
|
||||
|
||||
# Options
|
||||
-h, --help Show help message
|
||||
-d, --dry-run Show what would be deployed
|
||||
-u, --upgrade Upgrade existing deployment
|
||||
-r, --remove Remove deployment
|
||||
-n, --namespace NS Custom namespace (default: signoz)
|
||||
|
||||
# Examples
|
||||
./deploy-signoz.sh dev # Deploy to dev
|
||||
./deploy-signoz.sh --upgrade prod # Upgrade prod
|
||||
./deploy-signoz.sh --dry-run prod # Preview changes
|
||||
./deploy-signoz.sh --remove dev # Remove dev deployment
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Automatic Helm repository setup
|
||||
- Docker Hub secret creation
|
||||
- Namespace management
|
||||
- Deployment verification
|
||||
- 15-minute timeout with `--wait` flag
|
||||
|
||||
### verify-signoz.sh
|
||||
|
||||
Verification script to check deployment health:
|
||||
|
||||
```bash
|
||||
# Usage
|
||||
./verify-signoz.sh [OPTIONS] ENVIRONMENT
|
||||
|
||||
# Examples
|
||||
./verify-signoz.sh dev # Verify dev deployment
|
||||
./verify-signoz.sh prod # Verify prod deployment
|
||||
```
|
||||
|
||||
**Checks performed:**
|
||||
1. ✅ Helm release status
|
||||
2. ✅ Pod health and readiness
|
||||
3. ✅ Service availability
|
||||
4. ✅ Ingress configuration
|
||||
5. ✅ PVC status
|
||||
6. ✅ Resource usage (if metrics-server available)
|
||||
7. ✅ Log errors
|
||||
8. ✅ Environment-specific validations
|
||||
- Dev: Single replica, resource limits
|
||||
- Prod: HA config, TLS, Zookeeper replicas, HPA
|
||||
|
||||
## Storage Configuration
|
||||
|
||||
### Development (Kind)
|
||||
```yaml
|
||||
global:
|
||||
storageClass: "standard" # Kind's default provisioner
|
||||
```
|
||||
|
||||
### Production (MicroK8s)
|
||||
```yaml
|
||||
global:
|
||||
storageClass: "microk8s-hostpath" # Or custom storage class
|
||||
```
|
||||
|
||||
**Storage Requirements:**
|
||||
- **Development**: ~35 GiB total
|
||||
- SigNoz: 5 GiB
|
||||
- ClickHouse: 20 GiB
|
||||
- Zookeeper: 5 GiB
|
||||
- Alertmanager: 2 GiB
|
||||
|
||||
- **Production**: ~135 GiB total
|
||||
- SigNoz: 20 GiB
|
||||
- ClickHouse: 100 GiB
|
||||
- Zookeeper: 10 GiB
|
||||
- Alertmanager: 5 GiB
|
||||
|
||||
## Resource Requirements
|
||||
|
||||
### Development Environment
|
||||
**Minimum:**
|
||||
- CPU: 550m (0.55 cores)
|
||||
- Memory: 1.6 GiB
|
||||
- Storage: 35 GiB
|
||||
|
||||
**Recommended:**
|
||||
- CPU: 3 cores
|
||||
- Memory: 3 GiB
|
||||
- Storage: 50 GiB
|
||||
|
||||
### Production Environment
|
||||
**Minimum:**
|
||||
- CPU: 3.5 cores
|
||||
- Memory: 8 GiB
|
||||
- Storage: 135 GiB
|
||||
|
||||
**Recommended:**
|
||||
- CPU: 12 cores
|
||||
- Memory: 20 GiB
|
||||
- Storage: 200 GiB
|
||||
|
||||
## Data Retention
|
||||
|
||||
### Development
|
||||
- Traces: 7 days (168 hours)
|
||||
- Metrics: 7 days (168 hours)
|
||||
- Logs: 7 days (168 hours)
|
||||
|
||||
### Production
|
||||
- Traces: 30 days (720 hours)
|
||||
- Metrics: 30 days (720 hours)
|
||||
- Logs: 30 days (720 hours)
|
||||
- Cold storage after 30 days
|
||||
|
||||
To modify retention, update the environment variables:
|
||||
```yaml
|
||||
signoz:
|
||||
env:
|
||||
signoz_traces_ttl_duration_hrs: "720" # 30 days
|
||||
signoz_metrics_ttl_duration_hrs: "720" # 30 days
|
||||
signoz_logs_ttl_duration_hrs: "168" # 7 days
|
||||
```
|
||||
|
||||
## High Availability (Production)
|
||||
|
||||
### Replication Strategy
|
||||
```yaml
|
||||
signoz: 2 replicas + HPA (min: 2, max: 5)
|
||||
clickhouse: 2 replicas
|
||||
zookeeper: 3 replicas (critical!)
|
||||
otelCollector: 2 replicas + HPA (min: 2, max: 10)
|
||||
alertmanager: 2 replicas
|
||||
```
|
||||
|
||||
### Pod Anti-Affinity
|
||||
Ensures pods are distributed across different nodes:
|
||||
```yaml
|
||||
affinity:
|
||||
podAntiAffinity:
|
||||
preferredDuringSchedulingIgnoredDuringExecution:
|
||||
- weight: 100
|
||||
podAffinityTerm:
|
||||
labelSelector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/component: query-service
|
||||
topologyKey: kubernetes.io/hostname
|
||||
```
|
||||
|
||||
### Pod Disruption Budgets
|
||||
Configured for all critical components:
|
||||
```yaml
|
||||
podDisruptionBudget:
|
||||
enabled: true
|
||||
minAvailable: 1
|
||||
```
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### Email Alerts (Production)
|
||||
Configure SMTP in production values (using Mailu with Mailgun relay):
|
||||
```yaml
|
||||
signoz:
|
||||
env:
|
||||
signoz_smtp_enabled: "true"
|
||||
signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local"
|
||||
signoz_smtp_port: "587"
|
||||
signoz_smtp_from: "alerts@bakewise.ai"
|
||||
signoz_smtp_username: "alerts@bakewise.ai"
|
||||
# Set via secret: signoz_smtp_password
|
||||
```
|
||||
|
||||
**Note**: Signoz now uses the internal Mailu SMTP service, which relays to Mailgun for better deliverability and centralized email management.
|
||||
|
||||
### Slack Alerts (Production)
|
||||
Configure webhook in Alertmanager:
|
||||
```yaml
|
||||
alertmanager:
|
||||
config:
|
||||
receivers:
|
||||
- name: 'critical-alerts'
|
||||
slack_configs:
|
||||
- api_url: '${SLACK_WEBHOOK_URL}'
|
||||
channel: '#alerts-critical'
|
||||
```
|
||||
|
||||
### Mailgun Integration for Alert Emails
|
||||
|
||||
Signoz has been configured to use Mailgun for sending alert emails through the Mailu SMTP service. This provides:
|
||||
|
||||
**Benefits:**
|
||||
- Better email deliverability through Mailgun's infrastructure
|
||||
- Centralized email management via Mailu
|
||||
- Improved tracking and analytics for alert emails
|
||||
- Compliance with email sending best practices
|
||||
|
||||
**Architecture:**
|
||||
```
|
||||
Signoz Alertmanager → Mailu SMTP → Mailgun Relay → Recipients
|
||||
```
|
||||
|
||||
**Configuration Requirements:**
|
||||
|
||||
1. **Mailu Configuration** (`infrastructure/platform/mail/mailu/mailu-configmap.yaml`):
|
||||
```yaml
|
||||
RELAYHOST: "smtp.mailgun.org:587"
|
||||
RELAY_LOGIN: "postmaster@bakewise.ai"
|
||||
```
|
||||
|
||||
2. **Mailu Secrets** (`infrastructure/platform/mail/mailu/mailu-secrets.yaml`):
|
||||
```yaml
|
||||
RELAY_PASSWORD: "<mailgun-api-key>" # Base64 encoded Mailgun API key
|
||||
```
|
||||
|
||||
3. **DNS Configuration** (required for Mailgun):
|
||||
```
|
||||
# MX record
|
||||
bakewise.ai. IN MX 10 mail.bakewise.ai.
|
||||
|
||||
# SPF record (authorize Mailgun)
|
||||
bakewise.ai. IN TXT "v=spf1 include:mailgun.org ~all"
|
||||
|
||||
# DKIM record (provided by Mailgun)
|
||||
m1._domainkey.bakewise.ai. IN TXT "v=DKIM1; k=rsa; p=<mailgun-public-key>"
|
||||
|
||||
# DMARC record
|
||||
_dmarc.bakewise.ai. IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@bakewise.ai"
|
||||
```
|
||||
|
||||
4. **Signoz SMTP Configuration** (already configured in `signoz-values-prod.yaml`):
|
||||
```yaml
|
||||
signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local"
|
||||
signoz_smtp_port: "587"
|
||||
signoz_smtp_from: "alerts@bakewise.ai"
|
||||
```
|
||||
|
||||
**Testing the Integration:**
|
||||
|
||||
1. Trigger a test alert from Signoz UI
|
||||
2. Check Mailu logs: `kubectl logs -f mailu-smtp-<pod-id> -n bakery-ia`
|
||||
3. Check Mailgun dashboard for delivery status
|
||||
4. Verify email receipt in destination inbox
|
||||
|
||||
**Troubleshooting:**
|
||||
|
||||
- **SMTP Authentication Failed**: Verify Mailu credentials and Mailgun API key
|
||||
- **Email Delivery Delays**: Check Mailu queue with `kubectl exec -it mailu-smtp-<pod-id> -n bakery-ia -- mailq`
|
||||
- **SPF/DKIM Issues**: Verify DNS records and Mailgun domain verification
|
||||
|
||||
### Self-Monitoring
|
||||
SigNoz monitors itself:
|
||||
```yaml
|
||||
selfMonitoring:
|
||||
enabled: true
|
||||
serviceMonitor:
|
||||
enabled: true # Prod only
|
||||
interval: 30s
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**1. Pods not starting**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods -n signoz
|
||||
|
||||
# Check pod logs
|
||||
kubectl logs -n signoz <pod-name>
|
||||
|
||||
# Describe pod for events
|
||||
kubectl describe pod -n signoz <pod-name>
|
||||
```
|
||||
|
||||
**2. Docker Hub rate limits**
|
||||
```bash
|
||||
# Verify secret exists
|
||||
kubectl get secret dockerhub-creds -n signoz
|
||||
|
||||
# Recreate secret
|
||||
kubectl delete secret dockerhub-creds -n signoz
|
||||
export DOCKERHUB_USERNAME='your-username'
|
||||
export DOCKERHUB_PASSWORD='your-token'
|
||||
./deploy-signoz.sh dev
|
||||
```
|
||||
|
||||
**3. ClickHouse connection issues**
|
||||
```bash
|
||||
# Check ClickHouse pod
|
||||
kubectl logs -n signoz -l app.kubernetes.io/component=clickhouse
|
||||
|
||||
# Check Zookeeper (required by ClickHouse)
|
||||
kubectl logs -n signoz -l app.kubernetes.io/component=zookeeper
|
||||
```
|
||||
|
||||
**4. OTel Collector not receiving data**
|
||||
```bash
|
||||
# Check OTel Collector logs
|
||||
kubectl logs -n signoz -l app.kubernetes.io/component=otel-collector
|
||||
|
||||
# Test connectivity
|
||||
kubectl port-forward -n signoz svc/signoz-otel-collector 4318:4318
|
||||
curl -v http://localhost:4318/v1/traces
|
||||
```
|
||||
|
||||
**5. Insufficient storage**
|
||||
```bash
|
||||
# Check PVC status
|
||||
kubectl get pvc -n signoz
|
||||
|
||||
# Check storage usage (if metrics-server available)
|
||||
kubectl top pods -n signoz
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable debug exporter in OTel Collector:
|
||||
```yaml
|
||||
otelCollector:
|
||||
config:
|
||||
exporters:
|
||||
debug:
|
||||
verbosity: detailed
|
||||
sampling_initial: 5
|
||||
sampling_thereafter: 200
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
exporters: [clickhousetraces, debug] # Add debug
|
||||
```
|
||||
|
||||
### Upgrade from Old Version
|
||||
|
||||
If upgrading from pre-v0.89.0:
|
||||
```bash
|
||||
# 1. Backup data (recommended)
|
||||
kubectl get all -n signoz -o yaml > signoz-backup.yaml
|
||||
|
||||
# 2. Remove old deployment
|
||||
./deploy-signoz.sh --remove prod
|
||||
|
||||
# 3. Deploy new version
|
||||
./deploy-signoz.sh prod
|
||||
|
||||
# 4. Verify
|
||||
./verify-signoz.sh prod
|
||||
```
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
1. **Change default password** immediately after first login
|
||||
2. **Use TLS/SSL** in production (configured with cert-manager)
|
||||
3. **Network policies** enabled in production
|
||||
4. **Run as non-root** (configured in securityContext)
|
||||
5. **RBAC** with dedicated service account
|
||||
6. **Secrets management** for sensitive data (SMTP, Slack webhooks)
|
||||
7. **Image pull secrets** to avoid exposing Docker Hub credentials
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Backup ClickHouse Data
|
||||
```bash
|
||||
# Export ClickHouse data
|
||||
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
|
||||
--query="BACKUP DATABASE signoz_traces TO Disk('backups', 'traces_backup.zip')"
|
||||
|
||||
# Copy backup out
|
||||
kubectl cp signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/ ./backups/
|
||||
```
|
||||
|
||||
### Restore from Backup
|
||||
```bash
|
||||
# Copy backup in
|
||||
kubectl cp ./backups/ signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/
|
||||
|
||||
# Restore
|
||||
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
|
||||
--query="RESTORE DATABASE signoz_traces FROM Disk('backups', 'traces_backup.zip')"
|
||||
```
|
||||
|
||||
## Updating Configuration
|
||||
|
||||
To update SigNoz configuration:
|
||||
|
||||
1. Edit values file: `signoz-values-{env}.yaml`
|
||||
2. Apply changes:
|
||||
```bash
|
||||
./deploy-signoz.sh --upgrade {env}
|
||||
```
|
||||
3. Verify:
|
||||
```bash
|
||||
./verify-signoz.sh {env}
|
||||
```
|
||||
|
||||
## Uninstallation
|
||||
|
||||
```bash
|
||||
# Remove SigNoz deployment
|
||||
./deploy-signoz.sh --remove {env}
|
||||
|
||||
# Optionally delete PVCs (WARNING: deletes all data)
|
||||
kubectl delete pvc -n signoz -l app.kubernetes.io/instance=signoz
|
||||
|
||||
# Optionally delete namespace
|
||||
kubectl delete namespace signoz
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [SigNoz Official Documentation](https://signoz.io/docs/)
|
||||
- [SigNoz Helm Charts Repository](https://github.com/SigNoz/charts)
|
||||
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
|
||||
- [ClickHouse Documentation](https://clickhouse.com/docs/)
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
1. Check [SigNoz GitHub Issues](https://github.com/SigNoz/signoz/issues)
|
||||
2. Review deployment logs: `kubectl logs -n signoz <pod-name>`
|
||||
3. Run verification script: `./verify-signoz.sh {env}`
|
||||
4. Check [SigNoz Community Slack](https://signoz.io/slack)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2026-01-09
|
||||
**SigNoz Helm Chart Version**: Latest (v0.129.12 components)
|
||||
**Maintained by**: Bakery IA Team
|
||||
190
infrastructure/monitoring/signoz/dashboards/README.md
Normal file
190
infrastructure/monitoring/signoz/dashboards/README.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# SigNoz Dashboards for Bakery IA
|
||||
|
||||
This directory contains comprehensive SigNoz dashboard configurations for monitoring the Bakery IA system.
|
||||
|
||||
## Available Dashboards
|
||||
|
||||
### 1. Infrastructure Monitoring
|
||||
- **File**: `infrastructure-monitoring.json`
|
||||
- **Purpose**: Monitor Kubernetes infrastructure, pod health, and resource utilization
|
||||
- **Key Metrics**: CPU usage, memory usage, network traffic, pod status, container health
|
||||
|
||||
### 2. Application Performance
|
||||
- **File**: `application-performance.json`
|
||||
- **Purpose**: Monitor microservice performance and API metrics
|
||||
- **Key Metrics**: Request rate, error rate, latency percentiles, endpoint performance
|
||||
|
||||
### 3. Database Performance
|
||||
- **File**: `database-performance.json`
|
||||
- **Purpose**: Monitor PostgreSQL and Redis database performance
|
||||
- **Key Metrics**: Connections, query execution time, cache hit ratio, locks, replication status
|
||||
|
||||
### 4. API Performance
|
||||
- **File**: `api-performance.json`
|
||||
- **Purpose**: Monitor REST and GraphQL API performance
|
||||
- **Key Metrics**: Request volume, response times, status codes, endpoint analysis
|
||||
|
||||
### 5. Error Tracking
|
||||
- **File**: `error-tracking.json`
|
||||
- **Purpose**: Track and analyze system errors
|
||||
- **Key Metrics**: Error rates, error distribution, recent errors, HTTP errors, database errors
|
||||
|
||||
### 6. User Activity
|
||||
- **File**: `user-activity.json`
|
||||
- **Purpose**: Monitor user behavior and activity patterns
|
||||
- **Key Metrics**: Active users, sessions, API calls per user, session duration
|
||||
|
||||
### 7. System Health
|
||||
- **File**: `system-health.json`
|
||||
- **Purpose**: Overall system health monitoring
|
||||
- **Key Metrics**: Availability, health scores, resource utilization, service status
|
||||
|
||||
### 8. Alert Management
|
||||
- **File**: `alert-management.json`
|
||||
- **Purpose**: Monitor and manage system alerts
|
||||
- **Key Metrics**: Active alerts, alert rates, alert distribution, firing alerts
|
||||
|
||||
### 9. Log Analysis
|
||||
- **File**: `log-analysis.json`
|
||||
- **Purpose**: Search and analyze system logs
|
||||
- **Key Metrics**: Log volume, error logs, log distribution, log search
|
||||
|
||||
## How to Import Dashboards
|
||||
|
||||
### Method 1: Using SigNoz UI
|
||||
|
||||
1. **Access SigNoz UI**: Open your SigNoz instance in a web browser
|
||||
2. **Navigate to Dashboards**: Go to the "Dashboards" section
|
||||
3. **Import Dashboard**: Click on "Import Dashboard" button
|
||||
4. **Upload JSON**: Select the JSON file from this directory
|
||||
5. **Configure**: Adjust any variables or settings as needed
|
||||
6. **Save**: Save the imported dashboard
|
||||
|
||||
**Note**: The dashboards now use the correct SigNoz JSON schema with proper filter arrays.
|
||||
|
||||
### Method 2: Using SigNoz API
|
||||
|
||||
```bash
|
||||
# Import a single dashboard
|
||||
curl -X POST "http://<SIGNOZ_HOST>:3301/api/v1/dashboards/import" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer <API_KEY>" \
|
||||
-d @infrastructure-monitoring.json
|
||||
|
||||
# Import all dashboards
|
||||
for file in *.json; do
|
||||
curl -X POST "http://<SIGNOZ_HOST>:3301/api/v1/dashboards/import" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer <API_KEY>" \
|
||||
-d @"$file"
|
||||
done
|
||||
```
|
||||
|
||||
### Method 3: Using Kubernetes ConfigMap
|
||||
|
||||
```yaml
|
||||
# Create a ConfigMap with all dashboards
|
||||
kubectl create configmap signoz-dashboards \
|
||||
--from-file=infrastructure-monitoring.json \
|
||||
--from-file=application-performance.json \
|
||||
--from-file=database-performance.json \
|
||||
--from-file=api-performance.json \
|
||||
--from-file=error-tracking.json \
|
||||
--from-file=user-activity.json \
|
||||
--from-file=system-health.json \
|
||||
--from-file=alert-management.json \
|
||||
--from-file=log-analysis.json \
|
||||
-n signoz
|
||||
```
|
||||
|
||||
## Dashboard Variables
|
||||
|
||||
Most dashboards include variables that allow you to filter and customize the view:
|
||||
|
||||
- **Namespace**: Filter by Kubernetes namespace (e.g., `bakery-ia`, `default`)
|
||||
- **Service**: Filter by specific microservice
|
||||
- **Severity**: Filter by error/alert severity
|
||||
- **Environment**: Filter by deployment environment
|
||||
- **Time Range**: Adjust the time window for analysis
|
||||
|
||||
## Metrics Reference
|
||||
|
||||
The dashboards use standard OpenTelemetry metrics. If you need to add custom metrics, ensure they are properly instrumented in your services.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Dashboard Import Errors
|
||||
|
||||
If you encounter errors when importing dashboards:
|
||||
|
||||
1. **Validate JSON**: Ensure the JSON files are valid
|
||||
```bash
|
||||
jq . infrastructure-monitoring.json
|
||||
```
|
||||
|
||||
2. **Check Metrics**: Verify that the metrics exist in your SigNoz instance
|
||||
|
||||
3. **Adjust Time Range**: Try different time ranges if no data appears
|
||||
|
||||
4. **Check Filters**: Ensure filters match your actual service names and tags
|
||||
|
||||
### "e.filter is not a function" Error
|
||||
|
||||
This error occurs when the dashboard JSON uses an incorrect filter format. The fix has been applied:
|
||||
|
||||
**Before (incorrect)**:
|
||||
```json
|
||||
"filters": {
|
||||
"namespace": "${namespace}"
|
||||
}
|
||||
```
|
||||
|
||||
**After (correct)**:
|
||||
```json
|
||||
"filters": [
|
||||
{
|
||||
"key": "namespace",
|
||||
"operator": "=",
|
||||
"value": "${namespace}"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
All dashboards in this directory now use the correct array format for filters.
|
||||
|
||||
### Missing Data
|
||||
|
||||
If dashboards show no data:
|
||||
|
||||
1. **Verify Instrumentation**: Ensure your services are properly instrumented with OpenTelemetry
|
||||
2. **Check Time Range**: Adjust the time range to include recent data
|
||||
3. **Validate Metrics**: Confirm the metrics are being collected and stored
|
||||
4. **Review Filters**: Check that filters match your actual deployment
|
||||
|
||||
## Customization
|
||||
|
||||
You can customize these dashboards by:
|
||||
|
||||
1. **Editing JSON**: Modify the JSON files to add/remove panels or adjust queries
|
||||
2. **Cloning in UI**: Clone existing dashboards and modify them in the SigNoz UI
|
||||
3. **Adding Variables**: Add new variables for additional filtering options
|
||||
4. **Adjusting Layout**: Change the grid layout and panel sizes
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Regular Reviews**: Review dashboards regularly to ensure they meet your monitoring needs
|
||||
2. **Alert Integration**: Set up alerts based on key metrics shown in these dashboards
|
||||
3. **Team Access**: Share relevant dashboards with appropriate team members
|
||||
4. **Documentation**: Document any custom metrics or specific monitoring requirements
|
||||
|
||||
## Support
|
||||
|
||||
For issues with these dashboards:
|
||||
|
||||
1. Check the [SigNoz documentation](https://signoz.io/docs/)
|
||||
2. Review the [Bakery IA monitoring guide](../SIGNOZ_COMPLETE_CONFIGURATION_GUIDE.md)
|
||||
3. Consult the OpenTelemetry metrics specification
|
||||
|
||||
## License
|
||||
|
||||
These dashboard configurations are provided under the same license as the Bakery IA project.
|
||||
@@ -0,0 +1,170 @@
|
||||
{
|
||||
"description": "Alert monitoring and management dashboard",
|
||||
"tags": ["alerts", "monitoring", "management"],
|
||||
"name": "bakery-ia-alert-management",
|
||||
"title": "Bakery IA - Alert Management",
|
||||
"uploadedGrafana": false,
|
||||
"uuid": "bakery-ia-alerts-01",
|
||||
"version": "v4",
|
||||
"collapsableRowsMigrated": true,
|
||||
"layout": [
|
||||
{
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "active-alerts",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "alert-rate",
|
||||
"moved": false,
|
||||
"static": false
|
||||
}
|
||||
],
|
||||
"variables": {
|
||||
"service": {
|
||||
"id": "service-var",
|
||||
"name": "service",
|
||||
"description": "Filter by service name",
|
||||
"type": "QUERY",
|
||||
"queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'alerts_active' AND value != '' ORDER BY value",
|
||||
"customValue": "",
|
||||
"textboxValue": "",
|
||||
"showALLOption": true,
|
||||
"multiSelect": false,
|
||||
"order": 1,
|
||||
"modificationUUID": "",
|
||||
"sort": "ASC",
|
||||
"selectedValue": null
|
||||
}
|
||||
},
|
||||
"widgets": [
|
||||
{
|
||||
"id": "active-alerts",
|
||||
"title": "Active Alerts",
|
||||
"description": "Number of currently active alerts",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "value",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "alerts_active",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [],
|
||||
"legend": "Active Alerts",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "alert-rate",
|
||||
"title": "Alert Rate",
|
||||
"description": "Rate of alerts over time",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "alerts_total",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "rate",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "alerts/s"
|
||||
}
|
||||
]
|
||||
}
|
||||
351
infrastructure/monitoring/signoz/dashboards/api-performance.json
Normal file
351
infrastructure/monitoring/signoz/dashboards/api-performance.json
Normal file
@@ -0,0 +1,351 @@
|
||||
{
|
||||
"description": "Comprehensive API performance monitoring for Bakery IA REST and GraphQL endpoints",
|
||||
"tags": ["api", "performance", "rest", "graphql"],
|
||||
"name": "bakery-ia-api-performance",
|
||||
"title": "Bakery IA - API Performance",
|
||||
"uploadedGrafana": false,
|
||||
"uuid": "bakery-ia-api-01",
|
||||
"version": "v4",
|
||||
"collapsableRowsMigrated": true,
|
||||
"layout": [
|
||||
{
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "request-volume",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "error-rate",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "avg-response-time",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "p95-latency",
|
||||
"moved": false,
|
||||
"static": false
|
||||
}
|
||||
],
|
||||
"variables": {
|
||||
"service": {
|
||||
"id": "service-var",
|
||||
"name": "service",
|
||||
"description": "Filter by API service",
|
||||
"type": "QUERY",
|
||||
"queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'http_server_requests_seconds_count' AND value != '' ORDER BY value",
|
||||
"customValue": "",
|
||||
"textboxValue": "",
|
||||
"showALLOption": true,
|
||||
"multiSelect": false,
|
||||
"order": 1,
|
||||
"modificationUUID": "",
|
||||
"sort": "ASC",
|
||||
"selectedValue": null
|
||||
}
|
||||
},
|
||||
"widgets": [
|
||||
{
|
||||
"id": "request-volume",
|
||||
"title": "Request Volume",
|
||||
"description": "API request volume by service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "http_server_requests_seconds_count",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "rate",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "service.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "api.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{api.name}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "req/s"
|
||||
},
|
||||
{
|
||||
"id": "error-rate",
|
||||
"title": "Error Rate",
|
||||
"description": "API error rate by service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "http_server_requests_seconds_count",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "rate",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "api.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.api}}"
|
||||
},
|
||||
{
|
||||
"key": {
|
||||
"key": "status_code",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=~",
|
||||
"value": "5.."
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "api.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
{
|
||||
"key": "status_code",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{api.name}} - {{status_code}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "req/s"
|
||||
},
|
||||
{
|
||||
"id": "avg-response-time",
|
||||
"title": "Average Response Time",
|
||||
"description": "Average API response time by endpoint",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "http_server_requests_seconds_sum",
|
||||
"dataType": "float64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "avg",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "api.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.api}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "api.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
{
|
||||
"key": "endpoint",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{api.name}} - {{endpoint}}",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "seconds"
|
||||
},
|
||||
{
|
||||
"id": "p95-latency",
|
||||
"title": "P95 Latency",
|
||||
"description": "95th percentile latency by endpoint",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "histogram_quantile",
|
||||
"aggregateAttribute": {
|
||||
"key": "http_server_requests_seconds_bucket",
|
||||
"dataType": "float64",
|
||||
"type": "Histogram",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "avg",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "api.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.api}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "api.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
{
|
||||
"key": "endpoint",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{api.name}} - {{endpoint}}",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "seconds"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,333 @@
|
||||
{
|
||||
"description": "Application performance monitoring dashboard using distributed traces and metrics",
|
||||
"tags": ["application", "performance", "traces", "apm"],
|
||||
"name": "bakery-ia-application-performance",
|
||||
"title": "Bakery IA - Application Performance (APM)",
|
||||
"uploadedGrafana": false,
|
||||
"uuid": "bakery-ia-apm-01",
|
||||
"version": "v4",
|
||||
"collapsableRowsMigrated": true,
|
||||
"layout": [
|
||||
{
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "latency-p99",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "request-rate",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "error-rate",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "avg-duration",
|
||||
"moved": false,
|
||||
"static": false
|
||||
}
|
||||
],
|
||||
"variables": {
|
||||
"service_name": {
|
||||
"id": "service-var",
|
||||
"name": "service_name",
|
||||
"description": "Filter by service name",
|
||||
"type": "QUERY",
|
||||
"queryValue": "SELECT DISTINCT(serviceName) FROM signoz_traces.distributed_signoz_index_v2 ORDER BY serviceName",
|
||||
"customValue": "",
|
||||
"textboxValue": "",
|
||||
"showALLOption": true,
|
||||
"multiSelect": false,
|
||||
"order": 1,
|
||||
"modificationUUID": "",
|
||||
"sort": "ASC",
|
||||
"selectedValue": null
|
||||
}
|
||||
},
|
||||
"widgets": [
|
||||
{
|
||||
"id": "latency-p99",
|
||||
"title": "P99 Latency",
|
||||
"description": "99th percentile latency for selected service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "traces",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "p99",
|
||||
"aggregateAttribute": {
|
||||
"key": "duration_ns",
|
||||
"dataType": "float64",
|
||||
"type": "",
|
||||
"isColumn": true
|
||||
},
|
||||
"timeAggregation": "avg",
|
||||
"spaceAggregation": "p99",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service_name}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "ms"
|
||||
},
|
||||
{
|
||||
"id": "request-rate",
|
||||
"title": "Request Rate",
|
||||
"description": "Requests per second for the service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "traces",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "count",
|
||||
"aggregateAttribute": {
|
||||
"key": "",
|
||||
"dataType": "",
|
||||
"type": "",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "rate",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service_name}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "reqps"
|
||||
},
|
||||
{
|
||||
"id": "error-rate",
|
||||
"title": "Error Rate",
|
||||
"description": "Error rate percentage for the service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "traces",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "count",
|
||||
"aggregateAttribute": {
|
||||
"key": "",
|
||||
"dataType": "",
|
||||
"type": "",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "rate",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service_name}}"
|
||||
},
|
||||
{
|
||||
"key": {
|
||||
"key": "status_code",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "STATUS_CODE_ERROR"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "reqps"
|
||||
},
|
||||
{
|
||||
"id": "avg-duration",
|
||||
"title": "Average Duration",
|
||||
"description": "Average request duration",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "traces",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "duration_ns",
|
||||
"dataType": "float64",
|
||||
"type": "",
|
||||
"isColumn": true
|
||||
},
|
||||
"timeAggregation": "avg",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service_name}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "ms"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,425 @@
|
||||
{
|
||||
"description": "Comprehensive database performance monitoring for PostgreSQL, Redis, and RabbitMQ",
|
||||
"tags": ["database", "postgresql", "redis", "rabbitmq", "performance"],
|
||||
"name": "bakery-ia-database-performance",
|
||||
"title": "Bakery IA - Database Performance",
|
||||
"uploadedGrafana": false,
|
||||
"uuid": "bakery-ia-db-01",
|
||||
"version": "v4",
|
||||
"collapsableRowsMigrated": true,
|
||||
"layout": [
|
||||
{
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "pg-connections",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "pg-db-size",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "redis-connected-clients",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "redis-memory",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 6,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "rabbitmq-messages",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 6,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "rabbitmq-consumers",
|
||||
"moved": false,
|
||||
"static": false
|
||||
}
|
||||
],
|
||||
"variables": {
|
||||
"database": {
|
||||
"id": "database-var",
|
||||
"name": "database",
|
||||
"description": "Filter by PostgreSQL database name",
|
||||
"type": "QUERY",
|
||||
"queryValue": "SELECT DISTINCT(resource_attrs['postgresql.database.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'postgresql.db_size' AND value != '' ORDER BY value",
|
||||
"customValue": "",
|
||||
"textboxValue": "",
|
||||
"showALLOption": true,
|
||||
"multiSelect": false,
|
||||
"order": 1,
|
||||
"modificationUUID": "",
|
||||
"sort": "ASC",
|
||||
"selectedValue": null
|
||||
}
|
||||
},
|
||||
"widgets": [
|
||||
{
|
||||
"id": "pg-connections",
|
||||
"title": "PostgreSQL - Active Connections",
|
||||
"description": "Number of active PostgreSQL connections",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "postgresql.backends",
|
||||
"dataType": "float64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "postgresql.database.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.database}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "postgresql.database.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{postgresql.database.name}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "pg-db-size",
|
||||
"title": "PostgreSQL - Database Size",
|
||||
"description": "Size of PostgreSQL databases in bytes",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "postgresql.db_size",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "postgresql.database.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.database}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "postgresql.database.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{postgresql.database.name}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "bytes"
|
||||
},
|
||||
{
|
||||
"id": "redis-connected-clients",
|
||||
"title": "Redis - Connected Clients",
|
||||
"description": "Number of clients connected to Redis",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "redis.clients.connected",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "host.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{host.name}}",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "redis-memory",
|
||||
"title": "Redis - Memory Usage",
|
||||
"description": "Redis memory usage in bytes",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "redis.memory.used",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "host.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{host.name}}",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "bytes"
|
||||
},
|
||||
{
|
||||
"id": "rabbitmq-messages",
|
||||
"title": "RabbitMQ - Current Messages",
|
||||
"description": "Number of messages currently in RabbitMQ queues",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "rabbitmq.message.current",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "queue",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "Queue: {{queue}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "rabbitmq-consumers",
|
||||
"title": "RabbitMQ - Consumer Count",
|
||||
"description": "Number of consumers per queue",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "rabbitmq.consumer.count",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "queue",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "Queue: {{queue}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
}
|
||||
]
|
||||
}
|
||||
348
infrastructure/monitoring/signoz/dashboards/error-tracking.json
Normal file
348
infrastructure/monitoring/signoz/dashboards/error-tracking.json
Normal file
@@ -0,0 +1,348 @@
|
||||
{
|
||||
"description": "Comprehensive error tracking and analysis dashboard",
|
||||
"tags": ["errors", "exceptions", "tracking"],
|
||||
"name": "bakery-ia-error-tracking",
|
||||
"title": "Bakery IA - Error Tracking",
|
||||
"uploadedGrafana": false,
|
||||
"uuid": "bakery-ia-errors-01",
|
||||
"version": "v4",
|
||||
"collapsableRowsMigrated": true,
|
||||
"layout": [
|
||||
{
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "total-errors",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "error-rate",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "http-5xx",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "http-4xx",
|
||||
"moved": false,
|
||||
"static": false
|
||||
}
|
||||
],
|
||||
"variables": {
|
||||
"service": {
|
||||
"id": "service-var",
|
||||
"name": "service",
|
||||
"description": "Filter by service name",
|
||||
"type": "QUERY",
|
||||
"queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'error_total' AND value != '' ORDER BY value",
|
||||
"customValue": "",
|
||||
"textboxValue": "",
|
||||
"showALLOption": true,
|
||||
"multiSelect": false,
|
||||
"order": 1,
|
||||
"modificationUUID": "",
|
||||
"sort": "ASC",
|
||||
"selectedValue": null
|
||||
}
|
||||
},
|
||||
"widgets": [
|
||||
{
|
||||
"id": "total-errors",
|
||||
"title": "Total Errors",
|
||||
"description": "Total number of errors across all services",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "value",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "error_total",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "sum",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "service.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [],
|
||||
"legend": "Total Errors",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "error-rate",
|
||||
"title": "Error Rate",
|
||||
"description": "Error rate over time",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "error_total",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "rate",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "service.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "errors/s"
|
||||
},
|
||||
{
|
||||
"id": "http-5xx",
|
||||
"title": "HTTP 5xx Errors",
|
||||
"description": "Server errors (5xx status codes)",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "http_server_requests_seconds_count",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "sum",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "service.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
},
|
||||
{
|
||||
"key": {
|
||||
"key": "status_code",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=~",
|
||||
"value": "5.."
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
{
|
||||
"key": "status_code",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}} - {{status_code}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "number"
|
||||
},
|
||||
{
|
||||
"id": "http-4xx",
|
||||
"title": "HTTP 4xx Errors",
|
||||
"description": "Client errors (4xx status codes)",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "http_server_requests_seconds_count",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "sum",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "service.name",
|
||||
"dataType": "string",
|
||||
"type": "resource",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
},
|
||||
{
|
||||
"key": {
|
||||
"key": "status_code",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=~",
|
||||
"value": "4.."
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
{
|
||||
"key": "status_code",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}} - {{status_code}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "number"
|
||||
}
|
||||
]
|
||||
}
|
||||
213
infrastructure/monitoring/signoz/dashboards/index.json
Normal file
213
infrastructure/monitoring/signoz/dashboards/index.json
Normal file
@@ -0,0 +1,213 @@
|
||||
{
|
||||
"name": "Bakery IA Dashboard Collection",
|
||||
"description": "Complete set of SigNoz dashboards for Bakery IA monitoring",
|
||||
"version": "1.0.0",
|
||||
"author": "Bakery IA Team",
|
||||
"license": "MIT",
|
||||
"dashboards": [
|
||||
{
|
||||
"id": "infrastructure-monitoring",
|
||||
"name": "Infrastructure Monitoring",
|
||||
"description": "Kubernetes infrastructure and resource monitoring",
|
||||
"file": "infrastructure-monitoring.json",
|
||||
"tags": ["infrastructure", "kubernetes", "system"],
|
||||
"category": "infrastructure"
|
||||
},
|
||||
{
|
||||
"id": "application-performance",
|
||||
"name": "Application Performance",
|
||||
"description": "Microservice performance and API metrics",
|
||||
"file": "application-performance.json",
|
||||
"tags": ["application", "performance", "apm"],
|
||||
"category": "performance"
|
||||
},
|
||||
{
|
||||
"id": "database-performance",
|
||||
"name": "Database Performance",
|
||||
"description": "PostgreSQL and Redis database monitoring",
|
||||
"file": "database-performance.json",
|
||||
"tags": ["database", "postgresql", "redis"],
|
||||
"category": "database"
|
||||
},
|
||||
{
|
||||
"id": "api-performance",
|
||||
"name": "API Performance",
|
||||
"description": "REST and GraphQL API performance monitoring",
|
||||
"file": "api-performance.json",
|
||||
"tags": ["api", "rest", "graphql"],
|
||||
"category": "api"
|
||||
},
|
||||
{
|
||||
"id": "error-tracking",
|
||||
"name": "Error Tracking",
|
||||
"description": "System error tracking and analysis",
|
||||
"file": "error-tracking.json",
|
||||
"tags": ["errors", "exceptions", "tracking"],
|
||||
"category": "monitoring"
|
||||
},
|
||||
{
|
||||
"id": "user-activity",
|
||||
"name": "User Activity",
|
||||
"description": "User behavior and activity monitoring",
|
||||
"file": "user-activity.json",
|
||||
"tags": ["user", "activity", "behavior"],
|
||||
"category": "user"
|
||||
},
|
||||
{
|
||||
"id": "system-health",
|
||||
"name": "System Health",
|
||||
"description": "Overall system health monitoring",
|
||||
"file": "system-health.json",
|
||||
"tags": ["system", "health", "overview"],
|
||||
"category": "overview"
|
||||
},
|
||||
{
|
||||
"id": "alert-management",
|
||||
"name": "Alert Management",
|
||||
"description": "Alert monitoring and management",
|
||||
"file": "alert-management.json",
|
||||
"tags": ["alerts", "notifications", "management"],
|
||||
"category": "alerts"
|
||||
},
|
||||
{
|
||||
"id": "log-analysis",
|
||||
"name": "Log Analysis",
|
||||
"description": "Log search and analysis",
|
||||
"file": "log-analysis.json",
|
||||
"tags": ["logs", "search", "analysis"],
|
||||
"category": "logs"
|
||||
}
|
||||
],
|
||||
"categories": [
|
||||
{
|
||||
"id": "infrastructure",
|
||||
"name": "Infrastructure",
|
||||
"description": "Kubernetes and system infrastructure monitoring"
|
||||
},
|
||||
{
|
||||
"id": "performance",
|
||||
"name": "Performance",
|
||||
"description": "Application and service performance monitoring"
|
||||
},
|
||||
{
|
||||
"id": "database",
|
||||
"name": "Database",
|
||||
"description": "Database performance and health monitoring"
|
||||
},
|
||||
{
|
||||
"id": "api",
|
||||
"name": "API",
|
||||
"description": "API performance and usage monitoring"
|
||||
},
|
||||
{
|
||||
"id": "monitoring",
|
||||
"name": "Monitoring",
|
||||
"description": "Error tracking and system monitoring"
|
||||
},
|
||||
{
|
||||
"id": "user",
|
||||
"name": "User",
|
||||
"description": "User activity and behavior monitoring"
|
||||
},
|
||||
{
|
||||
"id": "overview",
|
||||
"name": "Overview",
|
||||
"description": "System-wide overview and health dashboards"
|
||||
},
|
||||
{
|
||||
"id": "alerts",
|
||||
"name": "Alerts",
|
||||
"description": "Alert management and monitoring"
|
||||
},
|
||||
{
|
||||
"id": "logs",
|
||||
"name": "Logs",
|
||||
"description": "Log analysis and search"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"import_methods": [
|
||||
"ui_import",
|
||||
"api_import",
|
||||
"kubernetes_configmap"
|
||||
],
|
||||
"recommended_import_order": [
|
||||
"infrastructure-monitoring",
|
||||
"system-health",
|
||||
"application-performance",
|
||||
"api-performance",
|
||||
"database-performance",
|
||||
"error-tracking",
|
||||
"alert-management",
|
||||
"log-analysis",
|
||||
"user-activity"
|
||||
]
|
||||
},
|
||||
"requirements": {
|
||||
"signoz_version": ">= 0.10.0",
|
||||
"opentelemetry_collector": ">= 0.45.0",
|
||||
"metrics": [
|
||||
"container_cpu_usage_seconds_total",
|
||||
"container_memory_working_set_bytes",
|
||||
"http_server_requests_seconds_count",
|
||||
"http_server_requests_seconds_sum",
|
||||
"pg_stat_activity_count",
|
||||
"pg_stat_statements_total_time",
|
||||
"error_total",
|
||||
"alerts_total",
|
||||
"kube_pod_status_phase",
|
||||
"container_network_receive_bytes_total",
|
||||
"kube_pod_container_status_restarts_total",
|
||||
"kube_pod_container_status_ready",
|
||||
"container_fs_reads_total",
|
||||
"kube_pod_status_phase",
|
||||
"kube_pod_container_status_restarts_total",
|
||||
"kube_pod_container_status_ready",
|
||||
"container_fs_reads_total",
|
||||
"kubernetes_events",
|
||||
"http_server_requests_seconds_bucket",
|
||||
"http_server_active_requests",
|
||||
"http_server_up",
|
||||
"db_query_duration_seconds_sum",
|
||||
"db_connections_active",
|
||||
"http_client_request_duration_seconds_count",
|
||||
"http_client_request_duration_seconds_sum",
|
||||
"graphql_execution_time_seconds",
|
||||
"graphql_errors_total",
|
||||
"pg_stat_database_blks_hit",
|
||||
"pg_stat_database_xact_commit",
|
||||
"pg_locks_count",
|
||||
"pg_table_size_bytes",
|
||||
"pg_stat_user_tables_seq_scan",
|
||||
"redis_memory_used_bytes",
|
||||
"redis_commands_processed_total",
|
||||
"redis_keyspace_hits",
|
||||
"pg_stat_database_deadlocks",
|
||||
"pg_stat_database_conn_errors",
|
||||
"pg_replication_lag_bytes",
|
||||
"pg_replication_is_replica",
|
||||
"active_users",
|
||||
"user_sessions_total",
|
||||
"api_calls_per_user",
|
||||
"session_duration_seconds",
|
||||
"system_availability",
|
||||
"service_health_score",
|
||||
"system_cpu_usage",
|
||||
"system_memory_usage",
|
||||
"service_availability",
|
||||
"alerts_active",
|
||||
"alerts_total",
|
||||
"log_lines_total"
|
||||
]
|
||||
},
|
||||
"support": {
|
||||
"documentation": "https://signoz.io/docs/",
|
||||
"bakery_ia_docs": "../SIGNOZ_COMPLETE_CONFIGURATION_GUIDE.md",
|
||||
"issues": "https://github.com/your-repo/issues"
|
||||
},
|
||||
"notes": {
|
||||
"format_fix": "All dashboards have been updated to use the correct SigNoz JSON schema with proper filter arrays to resolve the 'e.filter is not a function' error.",
|
||||
"compatibility": "Tested with SigNoz v0.10.0+ and OpenTelemetry Collector v0.45.0+",
|
||||
"customization": "You can customize these dashboards by editing the JSON files or cloning them in the SigNoz UI"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,437 @@
|
||||
{
|
||||
"description": "Comprehensive infrastructure monitoring dashboard for Bakery IA Kubernetes cluster",
|
||||
"tags": ["infrastructure", "kubernetes", "k8s", "system"],
|
||||
"name": "bakery-ia-infrastructure-monitoring",
|
||||
"title": "Bakery IA - Infrastructure Monitoring",
|
||||
"uploadedGrafana": false,
|
||||
"uuid": "bakery-ia-infra-01",
|
||||
"version": "v4",
|
||||
"collapsableRowsMigrated": true,
|
||||
"layout": [
|
||||
{
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "pod-count",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "pod-phase",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "container-restarts",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "node-condition",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 6,
|
||||
"w": 12,
|
||||
"h": 3,
|
||||
"i": "deployment-status",
|
||||
"moved": false,
|
||||
"static": false
|
||||
}
|
||||
],
|
||||
"variables": {
|
||||
"namespace": {
|
||||
"id": "namespace-var",
|
||||
"name": "namespace",
|
||||
"description": "Filter by Kubernetes namespace",
|
||||
"type": "QUERY",
|
||||
"queryValue": "SELECT DISTINCT(resource_attrs['k8s.namespace.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'k8s.pod.phase' AND value != '' ORDER BY value",
|
||||
"customValue": "",
|
||||
"textboxValue": "",
|
||||
"showALLOption": true,
|
||||
"multiSelect": false,
|
||||
"order": 1,
|
||||
"modificationUUID": "",
|
||||
"sort": "ASC",
|
||||
"selectedValue": "bakery-ia"
|
||||
}
|
||||
},
|
||||
"widgets": [
|
||||
{
|
||||
"id": "pod-count",
|
||||
"title": "Total Pods",
|
||||
"description": "Total number of pods in the namespace",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "value",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "count",
|
||||
"aggregateAttribute": {
|
||||
"key": "k8s.pod.phase",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"id": "filter-k8s-namespace",
|
||||
"key": {
|
||||
"id": "k8s.namespace.name--string--tag--false",
|
||||
"key": "k8s.namespace.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.namespace}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [],
|
||||
"legend": "Total Pods",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "pod-phase",
|
||||
"title": "Pod Phase Distribution",
|
||||
"description": "Pods by phase (Running, Pending, Failed, etc.)",
|
||||
"isStacked": true,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "k8s.pod.phase",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"id": "filter-k8s-namespace",
|
||||
"key": {
|
||||
"id": "k8s.namespace.name--string--tag--false",
|
||||
"key": "k8s.namespace.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.namespace}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "phase",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{phase}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "container-restarts",
|
||||
"title": "Container Restarts",
|
||||
"description": "Container restart count over time",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "k8s.container.restarts",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "increase",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"id": "filter-k8s-namespace",
|
||||
"key": {
|
||||
"id": "k8s.namespace.name--string--tag--false",
|
||||
"key": "k8s.namespace.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.namespace}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"id": "k8s.pod.name--string--tag--false",
|
||||
"key": "k8s.pod.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{k8s.pod.name}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "node-condition",
|
||||
"title": "Node Conditions",
|
||||
"description": "Node condition status (Ready, MemoryPressure, DiskPressure, etc.)",
|
||||
"isStacked": true,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "k8s.node.condition_ready",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"id": "k8s.node.name--string--tag--false",
|
||||
"key": "k8s.node.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{k8s.node.name}} Ready",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "deployment-status",
|
||||
"title": "Deployment Status (Desired vs Available)",
|
||||
"description": "Deployment replicas: desired vs available",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "k8s.deployment.desired",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"id": "filter-k8s-namespace",
|
||||
"key": {
|
||||
"id": "k8s.namespace.name--string--tag--false",
|
||||
"key": "k8s.namespace.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.namespace}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"id": "k8s.deployment.name--string--tag--false",
|
||||
"key": "k8s.deployment.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{k8s.deployment.name}} (desired)",
|
||||
"reduceTo": "avg"
|
||||
},
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "B",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "k8s.deployment.available",
|
||||
"dataType": "int64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"id": "filter-k8s-namespace",
|
||||
"key": {
|
||||
"id": "k8s.namespace.name--string--tag--false",
|
||||
"key": "k8s.namespace.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.namespace}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "B",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"id": "k8s.deployment.name--string--tag--false",
|
||||
"key": "k8s.deployment.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{k8s.deployment.name}} (available)",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
}
|
||||
]
|
||||
}
|
||||
333
infrastructure/monitoring/signoz/dashboards/log-analysis.json
Normal file
333
infrastructure/monitoring/signoz/dashboards/log-analysis.json
Normal file
@@ -0,0 +1,333 @@
|
||||
{
|
||||
"description": "Comprehensive log analysis and search dashboard",
|
||||
"tags": ["logs", "analysis", "search"],
|
||||
"name": "bakery-ia-log-analysis",
|
||||
"title": "Bakery IA - Log Analysis",
|
||||
"uploadedGrafana": false,
|
||||
"uuid": "bakery-ia-logs-01",
|
||||
"version": "v4",
|
||||
"collapsableRowsMigrated": true,
|
||||
"layout": [
|
||||
{
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "log-volume",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "error-logs",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "logs-by-level",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "logs-by-service",
|
||||
"moved": false,
|
||||
"static": false
|
||||
}
|
||||
],
|
||||
"variables": {
|
||||
"service": {
|
||||
"id": "service-var",
|
||||
"name": "service",
|
||||
"description": "Filter by service name",
|
||||
"type": "QUERY",
|
||||
"queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'log_lines_total' AND value != '' ORDER BY value",
|
||||
"customValue": "",
|
||||
"textboxValue": "",
|
||||
"showALLOption": true,
|
||||
"multiSelect": false,
|
||||
"order": 1,
|
||||
"modificationUUID": "",
|
||||
"sort": "ASC",
|
||||
"selectedValue": null
|
||||
}
|
||||
},
|
||||
"widgets": [
|
||||
{
|
||||
"id": "log-volume",
|
||||
"title": "Log Volume",
|
||||
"description": "Total log volume by service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "log_lines_total",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "rate",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "logs/s"
|
||||
},
|
||||
{
|
||||
"id": "error-logs",
|
||||
"title": "Error Logs",
|
||||
"description": "Error log volume by service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "log_lines_total",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "rate",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
},
|
||||
{
|
||||
"key": {
|
||||
"key": "level",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "error"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}} (errors)",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "logs/s"
|
||||
},
|
||||
{
|
||||
"id": "logs-by-level",
|
||||
"title": "Logs by Level",
|
||||
"description": "Distribution of logs by severity level",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "pie",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "log_lines_total",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "sum",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "level",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
}
|
||||
],
|
||||
"legend": "{{level}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "logs-by-service",
|
||||
"title": "Logs by Service",
|
||||
"description": "Distribution of logs by service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "pie",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "sum",
|
||||
"aggregateAttribute": {
|
||||
"key": "log_lines_total",
|
||||
"dataType": "int64",
|
||||
"type": "Counter",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "sum",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
}
|
||||
]
|
||||
}
|
||||
303
infrastructure/monitoring/signoz/dashboards/system-health.json
Normal file
303
infrastructure/monitoring/signoz/dashboards/system-health.json
Normal file
@@ -0,0 +1,303 @@
|
||||
{
|
||||
"description": "Comprehensive system health monitoring dashboard",
|
||||
"tags": ["system", "health", "monitoring"],
|
||||
"name": "bakery-ia-system-health",
|
||||
"title": "Bakery IA - System Health",
|
||||
"uploadedGrafana": false,
|
||||
"uuid": "bakery-ia-health-01",
|
||||
"version": "v4",
|
||||
"collapsableRowsMigrated": true,
|
||||
"layout": [
|
||||
{
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "system-availability",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "health-score",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "cpu-usage",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "memory-usage",
|
||||
"moved": false,
|
||||
"static": false
|
||||
}
|
||||
],
|
||||
"variables": {
|
||||
"namespace": {
|
||||
"id": "namespace-var",
|
||||
"name": "namespace",
|
||||
"description": "Filter by Kubernetes namespace",
|
||||
"type": "QUERY",
|
||||
"queryValue": "SELECT DISTINCT(resource_attrs['k8s.namespace.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'system_availability' AND value != '' ORDER BY value",
|
||||
"customValue": "",
|
||||
"textboxValue": "",
|
||||
"showALLOption": true,
|
||||
"multiSelect": false,
|
||||
"order": 1,
|
||||
"modificationUUID": "",
|
||||
"sort": "ASC",
|
||||
"selectedValue": "bakery-ia"
|
||||
}
|
||||
},
|
||||
"widgets": [
|
||||
{
|
||||
"id": "system-availability",
|
||||
"title": "System Availability",
|
||||
"description": "Overall system availability percentage",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "value",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "system_availability",
|
||||
"dataType": "float64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"id": "filter-k8s-namespace",
|
||||
"key": {
|
||||
"id": "k8s.namespace.name--string--tag--false",
|
||||
"key": "k8s.namespace.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.namespace}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [],
|
||||
"legend": "System Availability",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "percent"
|
||||
},
|
||||
{
|
||||
"id": "health-score",
|
||||
"title": "Service Health Score",
|
||||
"description": "Overall service health score",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "value",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "service_health_score",
|
||||
"dataType": "float64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "latest",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"id": "filter-k8s-namespace",
|
||||
"key": {
|
||||
"id": "k8s.namespace.name--string--tag--false",
|
||||
"key": "k8s.namespace.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.namespace}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [],
|
||||
"legend": "Health Score",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "cpu-usage",
|
||||
"title": "CPU Usage",
|
||||
"description": "System CPU usage over time",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "system_cpu_usage",
|
||||
"dataType": "float64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "avg",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"id": "filter-k8s-namespace",
|
||||
"key": {
|
||||
"id": "k8s.namespace.name--string--tag--false",
|
||||
"key": "k8s.namespace.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.namespace}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [],
|
||||
"legend": "CPU Usage",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "percent"
|
||||
},
|
||||
{
|
||||
"id": "memory-usage",
|
||||
"title": "Memory Usage",
|
||||
"description": "System memory usage over time",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "metrics",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "avg",
|
||||
"aggregateAttribute": {
|
||||
"key": "system_memory_usage",
|
||||
"dataType": "float64",
|
||||
"type": "Gauge",
|
||||
"isColumn": false
|
||||
},
|
||||
"timeAggregation": "avg",
|
||||
"spaceAggregation": "avg",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"id": "filter-k8s-namespace",
|
||||
"key": {
|
||||
"id": "k8s.namespace.name--string--tag--false",
|
||||
"key": "k8s.namespace.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": false
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.namespace}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [],
|
||||
"legend": "Memory Usage",
|
||||
"reduceTo": "avg"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "percent"
|
||||
}
|
||||
]
|
||||
}
|
||||
429
infrastructure/monitoring/signoz/dashboards/user-activity.json
Normal file
429
infrastructure/monitoring/signoz/dashboards/user-activity.json
Normal file
@@ -0,0 +1,429 @@
|
||||
{
|
||||
"description": "User activity and behavior monitoring dashboard",
|
||||
"tags": ["user", "activity", "behavior"],
|
||||
"name": "bakery-ia-user-activity",
|
||||
"title": "Bakery IA - User Activity",
|
||||
"uploadedGrafana": false,
|
||||
"uuid": "bakery-ia-user-01",
|
||||
"version": "v4",
|
||||
"collapsableRowsMigrated": true,
|
||||
"layout": [
|
||||
{
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "active-users",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 0,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "user-sessions",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "user-actions",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 6,
|
||||
"y": 3,
|
||||
"w": 6,
|
||||
"h": 3,
|
||||
"i": "page-views",
|
||||
"moved": false,
|
||||
"static": false
|
||||
},
|
||||
{
|
||||
"x": 0,
|
||||
"y": 6,
|
||||
"w": 12,
|
||||
"h": 4,
|
||||
"i": "geo-visitors",
|
||||
"moved": false,
|
||||
"static": false
|
||||
}
|
||||
],
|
||||
"variables": {
|
||||
"service": {
|
||||
"id": "service-var",
|
||||
"name": "service",
|
||||
"description": "Filter by service name",
|
||||
"type": "QUERY",
|
||||
"queryValue": "SELECT DISTINCT(serviceName) FROM signoz_traces.distributed_signoz_index_v2 ORDER BY serviceName",
|
||||
"customValue": "",
|
||||
"textboxValue": "",
|
||||
"showALLOption": true,
|
||||
"multiSelect": false,
|
||||
"order": 1,
|
||||
"modificationUUID": "",
|
||||
"sort": "ASC",
|
||||
"selectedValue": "bakery-frontend"
|
||||
}
|
||||
},
|
||||
"widgets": [
|
||||
{
|
||||
"id": "active-users",
|
||||
"title": "Active Users",
|
||||
"description": "Number of active users by service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "traces",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "count_distinct",
|
||||
"aggregateAttribute": {
|
||||
"key": "user.id",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"timeAggregation": "count_distinct",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "user-sessions",
|
||||
"title": "User Sessions",
|
||||
"description": "Total user sessions by service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "traces",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "count",
|
||||
"aggregateAttribute": {
|
||||
"key": "session.id",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"timeAggregation": "count",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
},
|
||||
{
|
||||
"key": {
|
||||
"key": "span.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "user_session"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "user-actions",
|
||||
"title": "User Actions",
|
||||
"description": "Total user actions by service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "traces",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "count",
|
||||
"aggregateAttribute": {
|
||||
"key": "user.action",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"timeAggregation": "count",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
},
|
||||
{
|
||||
"key": {
|
||||
"key": "span.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "user_action"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "page-views",
|
||||
"title": "Page Views",
|
||||
"description": "Total page views by service",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "graph",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "traces",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "count",
|
||||
"aggregateAttribute": {
|
||||
"key": "page.path",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"timeAggregation": "count",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
},
|
||||
{
|
||||
"key": {
|
||||
"key": "span.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "page_view"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [
|
||||
{
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
}
|
||||
],
|
||||
"legend": "{{serviceName}}",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
},
|
||||
{
|
||||
"id": "geo-visitors",
|
||||
"title": "Geolocation Visitors",
|
||||
"description": "Number of visitors who shared location data",
|
||||
"isStacked": false,
|
||||
"nullZeroValues": "zero",
|
||||
"opacity": "1",
|
||||
"panelTypes": "value",
|
||||
"query": {
|
||||
"builder": {
|
||||
"queryData": [
|
||||
{
|
||||
"dataSource": "traces",
|
||||
"queryName": "A",
|
||||
"aggregateOperator": "count",
|
||||
"aggregateAttribute": {
|
||||
"key": "user.id",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"timeAggregation": "count",
|
||||
"spaceAggregation": "sum",
|
||||
"functions": [],
|
||||
"filters": {
|
||||
"items": [
|
||||
{
|
||||
"key": {
|
||||
"key": "serviceName",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "{{.service}}"
|
||||
},
|
||||
{
|
||||
"key": {
|
||||
"key": "span.name",
|
||||
"dataType": "string",
|
||||
"type": "tag",
|
||||
"isColumn": true
|
||||
},
|
||||
"op": "=",
|
||||
"value": "user_location"
|
||||
}
|
||||
],
|
||||
"op": "AND"
|
||||
},
|
||||
"expression": "A",
|
||||
"disabled": false,
|
||||
"having": [],
|
||||
"stepInterval": 60,
|
||||
"limit": null,
|
||||
"orderBy": [],
|
||||
"groupBy": [],
|
||||
"legend": "Visitors with Location Data (See GEOLOCATION_VISUALIZATION_GUIDE.md for map integration)",
|
||||
"reduceTo": "sum"
|
||||
}
|
||||
],
|
||||
"queryFormulas": []
|
||||
},
|
||||
"queryType": "builder"
|
||||
},
|
||||
"fillSpans": false,
|
||||
"yAxisUnit": "none"
|
||||
}
|
||||
]
|
||||
}
|
||||
392
infrastructure/monitoring/signoz/deploy-signoz.sh
Executable file
392
infrastructure/monitoring/signoz/deploy-signoz.sh
Executable file
@@ -0,0 +1,392 @@
|
||||
#!/bin/bash
|
||||
|
||||
# ============================================================================
|
||||
# SigNoz Deployment Script for Bakery IA
|
||||
# ============================================================================
|
||||
# This script deploys SigNoz monitoring stack using Helm
|
||||
# Supports both development and production environments
|
||||
# ============================================================================
|
||||
|
||||
set -e
|
||||
|
||||
# Color codes for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Function to display help
|
||||
show_help() {
|
||||
echo "Usage: $0 [OPTIONS] ENVIRONMENT"
|
||||
echo ""
|
||||
echo "Deploy SigNoz monitoring stack for Bakery IA"
|
||||
echo ""
|
||||
echo "Arguments:
|
||||
ENVIRONMENT Environment to deploy to (dev|prod)"
|
||||
echo ""
|
||||
echo "Options:
|
||||
-h, --help Show this help message
|
||||
-d, --dry-run Dry run - show what would be done without actually deploying
|
||||
-u, --upgrade Upgrade existing deployment
|
||||
-r, --remove Remove/Uninstall SigNoz deployment
|
||||
-n, --namespace NAMESPACE Specify namespace (default: bakery-ia)"
|
||||
echo ""
|
||||
echo "Examples:
|
||||
$0 dev # Deploy to development
|
||||
$0 prod # Deploy to production
|
||||
$0 --upgrade prod # Upgrade production deployment
|
||||
$0 --remove dev # Remove development deployment"
|
||||
echo ""
|
||||
echo "Docker Hub Authentication:"
|
||||
echo " This script automatically creates a Docker Hub secret for image pulls."
|
||||
echo " Provide credentials via environment variables (recommended):"
|
||||
echo " export DOCKERHUB_USERNAME='your-username'"
|
||||
echo " export DOCKERHUB_PASSWORD='your-personal-access-token'"
|
||||
echo " Or ensure you're logged in with Docker CLI:"
|
||||
echo " docker login"
|
||||
}
|
||||
|
||||
# Parse command line arguments
|
||||
DRY_RUN=false
|
||||
UPGRADE=false
|
||||
REMOVE=false
|
||||
NAMESPACE="bakery-ia"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
-h|--help)
|
||||
show_help
|
||||
exit 0
|
||||
;;
|
||||
-d|--dry-run)
|
||||
DRY_RUN=true
|
||||
shift
|
||||
;;
|
||||
-u|--upgrade)
|
||||
UPGRADE=true
|
||||
shift
|
||||
;;
|
||||
-r|--remove)
|
||||
REMOVE=true
|
||||
shift
|
||||
;;
|
||||
-n|--namespace)
|
||||
NAMESPACE="$2"
|
||||
shift 2
|
||||
;;
|
||||
dev|prod)
|
||||
ENVIRONMENT="$1"
|
||||
shift
|
||||
;;
|
||||
*)
|
||||
echo "Unknown argument: $1"
|
||||
show_help
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Validate environment
|
||||
if [[ -z "$ENVIRONMENT" ]]; then
|
||||
echo "Error: Environment not specified. Use 'dev' or 'prod'."
|
||||
show_help
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ "$ENVIRONMENT" != "dev" && "$ENVIRONMENT" != "prod" ]]; then
|
||||
echo "Error: Invalid environment. Use 'dev' or 'prod'."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Function to check if Helm is installed
|
||||
check_helm() {
|
||||
if ! command -v helm &> /dev/null; then
|
||||
echo "${RED}Error: Helm is not installed. Please install Helm first.${NC}"
|
||||
echo "Installation instructions: https://helm.sh/docs/intro/install/"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to check if kubectl is configured
|
||||
check_kubectl() {
|
||||
if ! kubectl cluster-info &> /dev/null; then
|
||||
echo "${RED}Error: kubectl is not configured or cannot connect to cluster.${NC}"
|
||||
echo "Please ensure you have access to a Kubernetes cluster."
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to check if namespace exists, create if not
|
||||
ensure_namespace() {
|
||||
if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
|
||||
echo "${BLUE}Creating namespace $NAMESPACE...${NC}"
|
||||
if [[ "$DRY_RUN" == true ]]; then
|
||||
echo " (dry-run) Would create namespace $NAMESPACE"
|
||||
else
|
||||
kubectl create namespace "$NAMESPACE"
|
||||
echo "${GREEN}Namespace $NAMESPACE created.${NC}"
|
||||
fi
|
||||
else
|
||||
echo "${BLUE}Namespace $NAMESPACE already exists.${NC}"
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to create Docker Hub secret for image pulls
|
||||
create_dockerhub_secret() {
|
||||
echo "${BLUE}Setting up Docker Hub image pull secret...${NC}"
|
||||
|
||||
if [[ "$DRY_RUN" == true ]]; then
|
||||
echo " (dry-run) Would create Docker Hub secret in namespace $NAMESPACE"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check if secret already exists
|
||||
if kubectl get secret dockerhub-creds -n "$NAMESPACE" &> /dev/null; then
|
||||
echo "${GREEN}Docker Hub secret already exists in namespace $NAMESPACE.${NC}"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check if Docker Hub credentials are available
|
||||
if [[ -n "$DOCKERHUB_USERNAME" ]] && [[ -n "$DOCKERHUB_PASSWORD" ]]; then
|
||||
echo "${BLUE}Found DOCKERHUB_USERNAME and DOCKERHUB_PASSWORD environment variables${NC}"
|
||||
|
||||
kubectl create secret docker-registry dockerhub-creds \
|
||||
--docker-server=https://index.docker.io/v1/ \
|
||||
--docker-username="$DOCKERHUB_USERNAME" \
|
||||
--docker-password="$DOCKERHUB_PASSWORD" \
|
||||
--docker-email="${DOCKERHUB_EMAIL:-noreply@bakery-ia.local}" \
|
||||
-n "$NAMESPACE"
|
||||
|
||||
echo "${GREEN}Docker Hub secret created successfully.${NC}"
|
||||
|
||||
elif [[ -f "$HOME/.docker/config.json" ]]; then
|
||||
echo "${BLUE}Attempting to use Docker CLI credentials...${NC}"
|
||||
|
||||
# Try to extract credentials from Docker config
|
||||
if grep -q "credsStore" "$HOME/.docker/config.json"; then
|
||||
echo "${YELLOW}Docker is using a credential store. Please set environment variables:${NC}"
|
||||
echo " export DOCKERHUB_USERNAME='your-username'"
|
||||
echo " export DOCKERHUB_PASSWORD='your-password-or-token'"
|
||||
echo "${YELLOW}Continuing without Docker Hub authentication...${NC}"
|
||||
return
|
||||
fi
|
||||
|
||||
# Try to extract from base64 encoded auth
|
||||
AUTH=$(cat "$HOME/.docker/config.json" | jq -r '.auths["https://index.docker.io/v1/"].auth // empty' 2>/dev/null)
|
||||
if [[ -n "$AUTH" ]]; then
|
||||
echo "${GREEN}Found Docker Hub credentials in Docker config${NC}"
|
||||
local DOCKER_USERNAME=$(echo "$AUTH" | base64 -d | cut -d: -f1)
|
||||
local DOCKER_PASSWORD=$(echo "$AUTH" | base64 -d | cut -d: -f2-)
|
||||
|
||||
kubectl create secret docker-registry dockerhub-creds \
|
||||
--docker-server=https://index.docker.io/v1/ \
|
||||
--docker-username="$DOCKER_USERNAME" \
|
||||
--docker-password="$DOCKER_PASSWORD" \
|
||||
--docker-email="${DOCKERHUB_EMAIL:-noreply@bakery-ia.local}" \
|
||||
-n "$NAMESPACE"
|
||||
|
||||
echo "${GREEN}Docker Hub secret created successfully.${NC}"
|
||||
else
|
||||
echo "${YELLOW}Could not find Docker Hub credentials${NC}"
|
||||
echo "${YELLOW}To enable automatic Docker Hub authentication:${NC}"
|
||||
echo " 1. Run 'docker login', OR"
|
||||
echo " 2. Set environment variables:"
|
||||
echo " export DOCKERHUB_USERNAME='your-username'"
|
||||
echo " export DOCKERHUB_PASSWORD='your-password-or-token'"
|
||||
echo "${YELLOW}Continuing without Docker Hub authentication...${NC}"
|
||||
fi
|
||||
else
|
||||
echo "${YELLOW}Docker Hub credentials not found${NC}"
|
||||
echo "${YELLOW}To enable automatic Docker Hub authentication:${NC}"
|
||||
echo " 1. Run 'docker login', OR"
|
||||
echo " 2. Set environment variables:"
|
||||
echo " export DOCKERHUB_USERNAME='your-username'"
|
||||
echo " export DOCKERHUB_PASSWORD='your-password-or-token'"
|
||||
echo "${YELLOW}Continuing without Docker Hub authentication...${NC}"
|
||||
fi
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Function to add and update Helm repository
|
||||
setup_helm_repo() {
|
||||
echo "${BLUE}Setting up SigNoz Helm repository...${NC}"
|
||||
|
||||
if [[ "$DRY_RUN" == true ]]; then
|
||||
echo " (dry-run) Would add SigNoz Helm repository"
|
||||
return
|
||||
fi
|
||||
|
||||
# Add SigNoz Helm repository
|
||||
if helm repo list | grep -q "^signoz"; then
|
||||
echo "${BLUE}SigNoz repository already added, updating...${NC}"
|
||||
helm repo update signoz
|
||||
else
|
||||
echo "${BLUE}Adding SigNoz Helm repository...${NC}"
|
||||
helm repo add signoz https://charts.signoz.io
|
||||
helm repo update
|
||||
fi
|
||||
|
||||
echo "${GREEN}Helm repository ready.${NC}"
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Function to deploy SigNoz
|
||||
deploy_signoz() {
|
||||
local values_file="infrastructure/helm/signoz-values-$ENVIRONMENT.yaml"
|
||||
|
||||
if [[ ! -f "$values_file" ]]; then
|
||||
echo "${RED}Error: Values file $values_file not found.${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "${BLUE}Deploying SigNoz to $ENVIRONMENT environment...${NC}"
|
||||
echo " Using values file: $values_file"
|
||||
echo " Target namespace: $NAMESPACE"
|
||||
echo " Chart version: Latest from signoz/signoz"
|
||||
|
||||
if [[ "$DRY_RUN" == true ]]; then
|
||||
echo " (dry-run) Would deploy SigNoz with:"
|
||||
echo " helm upgrade --install signoz signoz/signoz -n $NAMESPACE -f $values_file --wait --timeout 15m"
|
||||
return
|
||||
fi
|
||||
|
||||
# Use upgrade --install to handle both new installations and upgrades
|
||||
echo "${BLUE}Installing/Upgrading SigNoz...${NC}"
|
||||
echo "This may take 10-15 minutes..."
|
||||
|
||||
helm upgrade --install signoz signoz/signoz \
|
||||
-n "$NAMESPACE" \
|
||||
-f "$values_file" \
|
||||
--wait \
|
||||
--timeout 15m \
|
||||
--create-namespace
|
||||
|
||||
echo "${GREEN}SigNoz deployment completed.${NC}"
|
||||
echo ""
|
||||
|
||||
# Show deployment status
|
||||
show_deployment_status
|
||||
}
|
||||
|
||||
# Function to remove SigNoz
|
||||
remove_signoz() {
|
||||
echo "${BLUE}Removing SigNoz deployment from namespace $NAMESPACE...${NC}"
|
||||
|
||||
if [[ "$DRY_RUN" == true ]]; then
|
||||
echo " (dry-run) Would remove SigNoz deployment"
|
||||
return
|
||||
fi
|
||||
|
||||
if helm list -n "$NAMESPACE" | grep -q signoz; then
|
||||
helm uninstall signoz -n "$NAMESPACE" --wait
|
||||
echo "${GREEN}SigNoz deployment removed.${NC}"
|
||||
|
||||
# Optionally remove PVCs (commented out by default for safety)
|
||||
echo ""
|
||||
echo "${YELLOW}Note: Persistent Volume Claims (PVCs) were NOT deleted.${NC}"
|
||||
echo "To delete PVCs and all data, run:"
|
||||
echo " kubectl delete pvc -n $NAMESPACE -l app.kubernetes.io/instance=signoz"
|
||||
else
|
||||
echo "${YELLOW}No SigNoz deployment found in namespace $NAMESPACE.${NC}"
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to show deployment status
|
||||
show_deployment_status() {
|
||||
echo ""
|
||||
echo "${BLUE}=== SigNoz Deployment Status ===${NC}"
|
||||
echo ""
|
||||
|
||||
# Get pods
|
||||
echo "Pods:"
|
||||
kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
|
||||
echo ""
|
||||
|
||||
# Get services
|
||||
echo "Services:"
|
||||
kubectl get svc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
|
||||
echo ""
|
||||
|
||||
# Get ingress
|
||||
echo "Ingress:"
|
||||
kubectl get ingress -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
|
||||
echo ""
|
||||
|
||||
# Show access information
|
||||
show_access_info
|
||||
}
|
||||
|
||||
# Function to show access information
|
||||
show_access_info() {
|
||||
echo "${BLUE}=== Access Information ===${NC}"
|
||||
|
||||
if [[ "$ENVIRONMENT" == "dev" ]]; then
|
||||
echo "SigNoz UI: http://monitoring.bakery-ia.local"
|
||||
echo ""
|
||||
echo "OpenTelemetry Collector Endpoints (from within cluster):"
|
||||
echo " gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
|
||||
echo " HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
|
||||
echo ""
|
||||
echo "Port-forward for local access:"
|
||||
echo " kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
|
||||
echo " kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4317:4317"
|
||||
echo " kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4318:4318"
|
||||
else
|
||||
echo "SigNoz UI: https://monitoring.bakewise.ai"
|
||||
echo ""
|
||||
echo "OpenTelemetry Collector Endpoints (from within cluster):"
|
||||
echo " gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
|
||||
echo " HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
|
||||
echo ""
|
||||
echo "External endpoints (if exposed):"
|
||||
echo " Check ingress configuration for external OTLP endpoints"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Default credentials:"
|
||||
echo " Username: admin@example.com"
|
||||
echo " Password: admin"
|
||||
echo ""
|
||||
echo "Note: Change default password after first login!"
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
echo "${BLUE}"
|
||||
echo "=========================================="
|
||||
echo "🚀 SigNoz Deployment for Bakery IA"
|
||||
echo "=========================================="
|
||||
echo "${NC}"
|
||||
|
||||
# Check prerequisites
|
||||
check_helm
|
||||
check_kubectl
|
||||
|
||||
# Ensure namespace
|
||||
ensure_namespace
|
||||
|
||||
if [[ "$REMOVE" == true ]]; then
|
||||
remove_signoz
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Setup Helm repository
|
||||
setup_helm_repo
|
||||
|
||||
# Create Docker Hub secret for image pulls
|
||||
create_dockerhub_secret
|
||||
|
||||
# Deploy SigNoz
|
||||
deploy_signoz
|
||||
|
||||
echo "${GREEN}"
|
||||
echo "=========================================="
|
||||
echo "✅ SigNoz deployment completed!"
|
||||
echo "=========================================="
|
||||
echo "${NC}"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main
|
||||
141
infrastructure/monitoring/signoz/generate-test-traffic.sh
Executable file
141
infrastructure/monitoring/signoz/generate-test-traffic.sh
Executable file
@@ -0,0 +1,141 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Generate Test Traffic to Services
|
||||
# This script generates API calls to verify telemetry data collection
|
||||
|
||||
set -e
|
||||
|
||||
NAMESPACE="bakery-ia"
|
||||
GREEN='\033[0;32m'
|
||||
BLUE='\033[0;34m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m'
|
||||
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE} Generating Test Traffic for SigNoz Verification${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo ""
|
||||
|
||||
# Check if ingress is accessible
|
||||
echo -e "${BLUE}Step 1: Verifying Gateway Access${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
|
||||
GATEWAY_POD=$(kubectl get pods -n $NAMESPACE -l app=gateway --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
|
||||
if [[ -z "$GATEWAY_POD" ]]; then
|
||||
echo -e "${YELLOW}⚠ Gateway pod not running. Starting port-forward...${NC}"
|
||||
# Port forward in background
|
||||
kubectl port-forward -n $NAMESPACE svc/gateway-service 8000:8000 &
|
||||
PORT_FORWARD_PID=$!
|
||||
sleep 3
|
||||
API_URL="http://localhost:8000"
|
||||
else
|
||||
echo -e "${GREEN}✓ Gateway is running: $GATEWAY_POD${NC}"
|
||||
# Use internal service
|
||||
API_URL="http://gateway-service.$NAMESPACE.svc.cluster.local:8000"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Function to make API call from inside cluster
|
||||
make_request() {
|
||||
local endpoint=$1
|
||||
local description=$2
|
||||
|
||||
echo -e "${BLUE}→ Testing: $description${NC}"
|
||||
echo " Endpoint: $endpoint"
|
||||
|
||||
if [[ -n "$GATEWAY_POD" ]]; then
|
||||
# Make request from inside the gateway pod
|
||||
RESPONSE=$(kubectl exec -n $NAMESPACE $GATEWAY_POD -- curl -s -w "\nHTTP_CODE:%{http_code}" "$API_URL$endpoint" 2>/dev/null || echo "FAILED")
|
||||
else
|
||||
# Make request from localhost
|
||||
RESPONSE=$(curl -s -w "\nHTTP_CODE:%{http_code}" "$API_URL$endpoint" 2>/dev/null || echo "FAILED")
|
||||
fi
|
||||
|
||||
if [[ "$RESPONSE" == "FAILED" ]]; then
|
||||
echo -e " ${YELLOW}⚠ Request failed${NC}"
|
||||
else
|
||||
HTTP_CODE=$(echo "$RESPONSE" | grep "HTTP_CODE" | cut -d: -f2)
|
||||
if [[ "$HTTP_CODE" == "200" ]] || [[ "$HTTP_CODE" == "401" ]] || [[ "$HTTP_CODE" == "404" ]]; then
|
||||
echo -e " ${GREEN}✓ Response received (HTTP $HTTP_CODE)${NC}"
|
||||
else
|
||||
echo -e " ${YELLOW}⚠ Unexpected response (HTTP $HTTP_CODE)${NC}"
|
||||
fi
|
||||
fi
|
||||
echo ""
|
||||
sleep 1
|
||||
}
|
||||
|
||||
# Generate traffic to various endpoints
|
||||
echo -e "${BLUE}Step 2: Generating Traffic to Services${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
echo ""
|
||||
|
||||
# Health checks (should generate traces)
|
||||
make_request "/health" "Gateway Health Check"
|
||||
make_request "/api/health" "API Health Check"
|
||||
|
||||
# Auth service endpoints
|
||||
make_request "/api/auth/health" "Auth Service Health"
|
||||
|
||||
# Tenant service endpoints
|
||||
make_request "/api/tenants/health" "Tenant Service Health"
|
||||
|
||||
# Inventory service endpoints
|
||||
make_request "/api/inventory/health" "Inventory Service Health"
|
||||
|
||||
# Orders service endpoints
|
||||
make_request "/api/orders/health" "Orders Service Health"
|
||||
|
||||
# Forecasting service endpoints
|
||||
make_request "/api/forecasting/health" "Forecasting Service Health"
|
||||
|
||||
echo -e "${BLUE}Step 3: Checking Service Logs for Telemetry${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
echo ""
|
||||
|
||||
# Check a few service pods for tracing logs
|
||||
SERVICES=("auth-service" "inventory-service" "gateway")
|
||||
|
||||
for service in "${SERVICES[@]}"; do
|
||||
POD=$(kubectl get pods -n $NAMESPACE -l app=$service --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
|
||||
if [[ -n "$POD" ]]; then
|
||||
echo -e "${BLUE}Checking $service ($POD)...${NC}"
|
||||
TRACING_LOG=$(kubectl logs -n $NAMESPACE $POD --tail=100 2>/dev/null | grep -i "tracing\|otel" | head -n 2 || echo "")
|
||||
if [[ -n "$TRACING_LOG" ]]; then
|
||||
echo -e "${GREEN}✓ Tracing configured:${NC}"
|
||||
echo "$TRACING_LOG" | sed 's/^/ /'
|
||||
else
|
||||
echo -e "${YELLOW}⚠ No tracing logs found${NC}"
|
||||
fi
|
||||
echo ""
|
||||
fi
|
||||
done
|
||||
|
||||
# Wait for data to be processed
|
||||
echo -e "${BLUE}Step 4: Waiting for Data Processing${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
echo "Waiting 30 seconds for telemetry data to be processed..."
|
||||
for i in {30..1}; do
|
||||
echo -ne "\r ${i} seconds remaining..."
|
||||
sleep 1
|
||||
done
|
||||
echo -e "\n"
|
||||
|
||||
# Cleanup port-forward if started
|
||||
if [[ -n "$PORT_FORWARD_PID" ]]; then
|
||||
kill $PORT_FORWARD_PID 2>/dev/null || true
|
||||
fi
|
||||
|
||||
echo -e "${GREEN}✓ Test traffic generation complete!${NC}"
|
||||
echo ""
|
||||
echo -e "${BLUE}Next Steps:${NC}"
|
||||
echo "1. Run the verification script to check for collected data:"
|
||||
echo " ./infrastructure/helm/verify-signoz-telemetry.sh"
|
||||
echo ""
|
||||
echo "2. Access SigNoz UI to visualize the data:"
|
||||
echo " https://monitoring.bakery-ia.local"
|
||||
echo " or"
|
||||
echo " kubectl port-forward -n bakery-ia svc/signoz 3301:8080"
|
||||
echo " Then go to: http://localhost:3301"
|
||||
echo ""
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
175
infrastructure/monitoring/signoz/import-dashboards.sh
Executable file
175
infrastructure/monitoring/signoz/import-dashboards.sh
Executable file
@@ -0,0 +1,175 @@
|
||||
#!/bin/bash
|
||||
|
||||
# SigNoz Dashboard Importer for Bakery IA
|
||||
# This script imports all SigNoz dashboards into your SigNoz instance
|
||||
|
||||
# Configuration
|
||||
SIGNOZ_HOST="localhost"
|
||||
SIGNOZ_PORT="3301"
|
||||
SIGNOZ_API_KEY="" # Add your API key if authentication is required
|
||||
DASHBOARDS_DIR="infrastructure/signoz/dashboards"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Function to display help
|
||||
show_help() {
|
||||
echo "Usage: $0 [options]"
|
||||
echo ""
|
||||
echo "Options:
|
||||
-h, --host SigNoz host (default: localhost)
|
||||
-p, --port SigNoz port (default: 3301)
|
||||
-k, --api-key SigNoz API key (if required)
|
||||
-d, --dir Dashboards directory (default: infrastructure/signoz/dashboards)
|
||||
-h, --help Show this help message"
|
||||
echo ""
|
||||
echo "Example:
|
||||
$0 --host signoz.example.com --port 3301 --api-key your-api-key"
|
||||
}
|
||||
|
||||
# Parse command line arguments
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
-h|--host)
|
||||
SIGNOZ_HOST="$2"
|
||||
shift 2
|
||||
;;
|
||||
-p|--port)
|
||||
SIGNOZ_PORT="$2"
|
||||
shift 2
|
||||
;;
|
||||
-k|--api-key)
|
||||
SIGNOZ_API_KEY="$2"
|
||||
shift 2
|
||||
;;
|
||||
-d|--dir)
|
||||
DASHBOARDS_DIR="$2"
|
||||
shift 2
|
||||
;;
|
||||
--help)
|
||||
show_help
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1"
|
||||
show_help
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Check if dashboards directory exists
|
||||
if [ ! -d "$DASHBOARDS_DIR" ]; then
|
||||
echo -e "${RED}Error: Dashboards directory not found: $DASHBOARDS_DIR${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check if jq is installed for JSON validation
|
||||
if ! command -v jq &> /dev/null; then
|
||||
echo -e "${YELLOW}Warning: jq not found. Skipping JSON validation.${NC}"
|
||||
VALIDATE_JSON=false
|
||||
else
|
||||
VALIDATE_JSON=true
|
||||
fi
|
||||
|
||||
# Function to validate JSON
|
||||
validate_json() {
|
||||
local file="$1"
|
||||
if [ "$VALIDATE_JSON" = true ]; then
|
||||
if ! jq empty "$file" &> /dev/null; then
|
||||
echo -e "${RED}Error: Invalid JSON in file: $file${NC}"
|
||||
return 1
|
||||
fi
|
||||
fi
|
||||
return 0
|
||||
}
|
||||
|
||||
# Function to import a single dashboard
|
||||
import_dashboard() {
|
||||
local file="$1"
|
||||
local filename=$(basename "$file")
|
||||
local dashboard_name=$(jq -r '.name' "$file" 2>/dev/null || echo "Unknown")
|
||||
|
||||
echo -e "${BLUE}Importing dashboard: $dashboard_name ($filename)${NC}"
|
||||
|
||||
# Prepare curl command
|
||||
local curl_cmd="curl -s -X POST http://$SIGNOZ_HOST:$SIGNOZ_PORT/api/v1/dashboards/import"
|
||||
|
||||
if [ -n "$SIGNOZ_API_KEY" ]; then
|
||||
curl_cmd="$curl_cmd -H \"Authorization: Bearer $SIGNOZ_API_KEY\""
|
||||
fi
|
||||
|
||||
curl_cmd="$curl_cmd -H \"Content-Type: application/json\" -d @\"$file\""
|
||||
|
||||
# Execute import
|
||||
local response=$(eval "$curl_cmd")
|
||||
|
||||
# Check response
|
||||
if echo "$response" | grep -q "success"; then
|
||||
echo -e "${GREEN}✓ Successfully imported: $dashboard_name${NC}"
|
||||
return 0
|
||||
else
|
||||
echo -e "${RED}✗ Failed to import: $dashboard_name${NC}"
|
||||
echo "Response: $response"
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Main import process
|
||||
echo -e "${YELLOW}=== SigNoz Dashboard Importer for Bakery IA ===${NC}"
|
||||
echo -e "${BLUE}Configuration:${NC}"
|
||||
echo " Host: $SIGNOZ_HOST"
|
||||
echo " Port: $SIGNOZ_PORT"
|
||||
echo " Dashboards Directory: $DASHBOARDS_DIR"
|
||||
if [ -n "$SIGNOZ_API_KEY" ]; then
|
||||
echo " API Key: ******** (set)"
|
||||
else
|
||||
echo " API Key: Not configured"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Count dashboards
|
||||
DASHBOARD_COUNT=$(find "$DASHBOARDS_DIR" -name "*.json" | wc -l)
|
||||
echo -e "${BLUE}Found $DASHBOARD_COUNT dashboards to import${NC}"
|
||||
echo ""
|
||||
|
||||
# Import each dashboard
|
||||
SUCCESS_COUNT=0
|
||||
FAILURE_COUNT=0
|
||||
|
||||
for file in "$DASHBOARDS_DIR"/*.json; do
|
||||
if [ -f "$file" ]; then
|
||||
# Validate JSON
|
||||
if validate_json "$file"; then
|
||||
if import_dashboard "$file"; then
|
||||
((SUCCESS_COUNT++))
|
||||
else
|
||||
((FAILURE_COUNT++))
|
||||
fi
|
||||
else
|
||||
((FAILURE_COUNT++))
|
||||
fi
|
||||
echo ""
|
||||
fi
|
||||
done
|
||||
|
||||
# Summary
|
||||
echo -e "${YELLOW}=== Import Summary ===${NC}"
|
||||
echo -e "${GREEN}Successfully imported: $SUCCESS_COUNT dashboards${NC}"
|
||||
if [ $FAILURE_COUNT -gt 0 ]; then
|
||||
echo -e "${RED}Failed to import: $FAILURE_COUNT dashboards${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
if [ $FAILURE_COUNT -eq 0 ]; then
|
||||
echo -e "${GREEN}All dashboards imported successfully!${NC}"
|
||||
echo "You can now access them in your SigNoz UI at:"
|
||||
echo "http://$SIGNOZ_HOST:$SIGNOZ_PORT/dashboards"
|
||||
else
|
||||
echo -e "${YELLOW}Some dashboards failed to import. Check the errors above.${NC}"
|
||||
exit 1
|
||||
fi
|
||||
853
infrastructure/monitoring/signoz/signoz-values-dev.yaml
Normal file
853
infrastructure/monitoring/signoz/signoz-values-dev.yaml
Normal file
@@ -0,0 +1,853 @@
|
||||
# SigNoz Helm Chart Values - Development Environment
|
||||
# Optimized for local development with minimal resource usage
|
||||
# DEPLOYED IN bakery-ia NAMESPACE - Ingress managed by bakery-ingress
|
||||
#
|
||||
# Official Chart: https://github.com/SigNoz/charts
|
||||
# Install Command: helm install signoz signoz/signoz -n bakery-ia -f signoz-values-dev.yaml
|
||||
|
||||
global:
|
||||
storageClass: "standard"
|
||||
clusterName: "bakery-ia-dev"
|
||||
domain: "monitoring.bakery-ia.local"
|
||||
# Docker Hub credentials - applied to all sub-charts (including Zookeeper, ClickHouse, etc)
|
||||
imagePullSecrets:
|
||||
- dockerhub-creds
|
||||
|
||||
# Docker Hub credentials for pulling images (root level for SigNoz components)
|
||||
imagePullSecrets:
|
||||
- dockerhub-creds
|
||||
|
||||
# SignOz Main Component (includes frontend and query service)
|
||||
signoz:
|
||||
replicaCount: 1
|
||||
|
||||
service:
|
||||
type: ClusterIP
|
||||
port: 8080
|
||||
|
||||
# DISABLE built-in ingress - using unified bakery-ingress instead
|
||||
# Route configured in infrastructure/kubernetes/overlays/dev/dev-ingress.yaml
|
||||
ingress:
|
||||
enabled: false
|
||||
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m # Combined frontend + query service
|
||||
memory: 256Mi
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
|
||||
# Environment variables (new format - replaces configVars)
|
||||
env:
|
||||
signoz_telemetrystore_provider: "clickhouse"
|
||||
dot_metrics_enabled: "true"
|
||||
signoz_emailing_enabled: "false"
|
||||
signoz_alertmanager_provider: "signoz"
|
||||
# Retention for dev (7 days)
|
||||
signoz_traces_ttl_duration_hrs: "168"
|
||||
signoz_metrics_ttl_duration_hrs: "168"
|
||||
signoz_logs_ttl_duration_hrs: "168"
|
||||
# OpAMP Server Configuration - DISABLED for dev (causes gRPC instability)
|
||||
signoz_opamp_server_enabled: "false"
|
||||
# signoz_opamp_server_endpoint: "0.0.0.0:4320"
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 5Gi
|
||||
storageClass: "standard"
|
||||
|
||||
# AlertManager Configuration
|
||||
alertmanager:
|
||||
replicaCount: 1
|
||||
image:
|
||||
repository: signoz/alertmanager
|
||||
tag: 0.23.5
|
||||
pullPolicy: IfNotPresent
|
||||
|
||||
service:
|
||||
type: ClusterIP
|
||||
port: 9093
|
||||
|
||||
resources:
|
||||
requests:
|
||||
cpu: 25m # Reduced for local dev
|
||||
memory: 64Mi # Reduced for local dev
|
||||
limits:
|
||||
cpu: 200m
|
||||
memory: 256Mi
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 2Gi
|
||||
storageClass: "standard"
|
||||
|
||||
config:
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
route:
|
||||
group_by: ['alertname', 'cluster', 'service']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 12h
|
||||
receiver: 'default'
|
||||
receivers:
|
||||
- name: 'default'
|
||||
# Add email, slack, webhook configs here
|
||||
|
||||
# ClickHouse Configuration - Time Series Database
|
||||
# Minimal resources for local development on constrained Kind cluster
|
||||
clickhouse:
|
||||
enabled: true
|
||||
installCustomStorageClass: false
|
||||
|
||||
image:
|
||||
registry: docker.io
|
||||
repository: clickhouse/clickhouse-server
|
||||
tag: 25.5.6 # Official recommended version
|
||||
|
||||
# Reduce ClickHouse resource requests for local dev
|
||||
clickhouse:
|
||||
resources:
|
||||
requests:
|
||||
cpu: 200m # Reduced from default 500m
|
||||
memory: 512Mi
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 20Gi
|
||||
|
||||
# Zookeeper Configuration (required by ClickHouse)
|
||||
zookeeper:
|
||||
enabled: true
|
||||
replicaCount: 1 # Single replica for dev
|
||||
|
||||
image:
|
||||
tag: 3.7.1 # Official recommended version
|
||||
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 256Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 5Gi
|
||||
|
||||
# OpenTelemetry Collector - Data ingestion endpoint for all telemetry
|
||||
otelCollector:
|
||||
enabled: true
|
||||
replicaCount: 1
|
||||
|
||||
image:
|
||||
repository: signoz/signoz-otel-collector
|
||||
tag: v0.129.12 # Latest recommended version
|
||||
|
||||
# OpAMP Configuration - DISABLED for development
|
||||
# OpAMP is designed for production with remote config management
|
||||
# In dev, it causes gRPC instability and collector reloads
|
||||
# We use static configuration instead
|
||||
|
||||
# Init containers for the Otel Collector pod
|
||||
initContainers:
|
||||
fix-postgres-tls:
|
||||
enabled: true
|
||||
image:
|
||||
registry: docker.io
|
||||
repository: busybox
|
||||
tag: 1.35
|
||||
pullPolicy: IfNotPresent
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
echo "Fixing PostgreSQL TLS file permissions..."
|
||||
cp /etc/postgres-tls-source/* /etc/postgres-tls/
|
||||
chmod 600 /etc/postgres-tls/server-key.pem
|
||||
chmod 644 /etc/postgres-tls/server-cert.pem
|
||||
chmod 644 /etc/postgres-tls/ca-cert.pem
|
||||
echo "PostgreSQL TLS permissions fixed"
|
||||
volumeMounts:
|
||||
- name: postgres-tls-source
|
||||
mountPath: /etc/postgres-tls-source
|
||||
readOnly: true
|
||||
- name: postgres-tls-fixed
|
||||
mountPath: /etc/postgres-tls
|
||||
readOnly: false
|
||||
|
||||
# Service configuration - expose both gRPC and HTTP endpoints
|
||||
service:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
# gRPC receivers
|
||||
- name: otlp-grpc
|
||||
port: 4317
|
||||
targetPort: 4317
|
||||
protocol: TCP
|
||||
# HTTP receivers
|
||||
- name: otlp-http
|
||||
port: 4318
|
||||
targetPort: 4318
|
||||
protocol: TCP
|
||||
# Prometheus remote write
|
||||
- name: prometheus
|
||||
port: 8889
|
||||
targetPort: 8889
|
||||
protocol: TCP
|
||||
# Metrics
|
||||
- name: metrics
|
||||
port: 8888
|
||||
targetPort: 8888
|
||||
protocol: TCP
|
||||
|
||||
resources:
|
||||
requests:
|
||||
cpu: 50m # Reduced from 100m
|
||||
memory: 128Mi # Reduced from 256Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
|
||||
# Additional environment variables for receivers
|
||||
additionalEnvs:
|
||||
POSTGRES_MONITOR_USER: "monitoring"
|
||||
POSTGRES_MONITOR_PASSWORD: "monitoring_369f9c001f242b07ef9e2826e17169ca"
|
||||
REDIS_PASSWORD: "OxdmdJjdVNXp37MNC2IFoMnTpfGGFv1k"
|
||||
RABBITMQ_USER: "bakery"
|
||||
RABBITMQ_PASSWORD: "forecast123"
|
||||
|
||||
# Mount TLS certificates for secure connections
|
||||
extraVolumes:
|
||||
- name: redis-tls
|
||||
secret:
|
||||
secretName: redis-tls-secret
|
||||
- name: postgres-tls
|
||||
secret:
|
||||
secretName: postgres-tls
|
||||
- name: postgres-tls-fixed
|
||||
emptyDir: {}
|
||||
- name: varlogpods
|
||||
hostPath:
|
||||
path: /var/log/pods
|
||||
|
||||
extraVolumeMounts:
|
||||
- name: redis-tls
|
||||
mountPath: /etc/redis-tls
|
||||
readOnly: true
|
||||
- name: postgres-tls
|
||||
mountPath: /etc/postgres-tls-source
|
||||
readOnly: true
|
||||
- name: postgres-tls-fixed
|
||||
mountPath: /etc/postgres-tls
|
||||
readOnly: false
|
||||
- name: varlogpods
|
||||
mountPath: /var/log/pods
|
||||
readOnly: true
|
||||
|
||||
# Disable OpAMP - use static configuration only
|
||||
# Use 'args' instead of 'extraArgs' to completely override the command
|
||||
command:
|
||||
name: /signoz-otel-collector
|
||||
args:
|
||||
- --config=/conf/otel-collector-config.yaml
|
||||
- --feature-gates=-pkg.translator.prometheus.NormalizeName
|
||||
|
||||
# OpenTelemetry Collector configuration
|
||||
config:
|
||||
# Connectors - bridge between pipelines
|
||||
connectors:
|
||||
signozmeter:
|
||||
dimensions:
|
||||
- name: service.name
|
||||
- name: deployment.environment
|
||||
- name: host.name
|
||||
metrics_flush_interval: 1h
|
||||
|
||||
receivers:
|
||||
# OTLP receivers for traces, metrics, and logs from applications
|
||||
# All application telemetry is pushed via OTLP protocol
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
cors:
|
||||
allowed_origins:
|
||||
- "*"
|
||||
|
||||
# Filelog receiver for Kubernetes pod logs
|
||||
# Collects container stdout/stderr from /var/log/pods
|
||||
filelog:
|
||||
include:
|
||||
- /var/log/pods/*/*/*.log
|
||||
exclude:
|
||||
# Exclude SigNoz's own logs to avoid recursive collection
|
||||
- /var/log/pods/bakery-ia_signoz-*/*/*.log
|
||||
include_file_path: true
|
||||
include_file_name: false
|
||||
operators:
|
||||
# Parse CRI-O / containerd log format
|
||||
- type: regex_parser
|
||||
regex: '^(?P<time>[^ ]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
|
||||
timestamp:
|
||||
parse_from: attributes.time
|
||||
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
|
||||
# Fix timestamp parsing - extract from the parsed time field
|
||||
- type: move
|
||||
from: attributes.time
|
||||
to: attributes.timestamp
|
||||
# Extract Kubernetes metadata from file path
|
||||
- type: regex_parser
|
||||
id: extract_metadata_from_filepath
|
||||
regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
|
||||
parse_from: attributes["log.file.path"]
|
||||
# Move metadata to resource attributes
|
||||
- type: move
|
||||
from: attributes.namespace
|
||||
to: resource["k8s.namespace.name"]
|
||||
- type: move
|
||||
from: attributes.pod_name
|
||||
to: resource["k8s.pod.name"]
|
||||
- type: move
|
||||
from: attributes.container_name
|
||||
to: resource["k8s.container.name"]
|
||||
- type: move
|
||||
from: attributes.log
|
||||
to: body
|
||||
|
||||
# Kubernetes Cluster Receiver - Collects cluster-level metrics
|
||||
# Provides information about nodes, namespaces, pods, and other cluster resources
|
||||
k8s_cluster:
|
||||
collection_interval: 30s
|
||||
node_conditions_to_report:
|
||||
- Ready
|
||||
- MemoryPressure
|
||||
- DiskPressure
|
||||
- PIDPressure
|
||||
- NetworkUnavailable
|
||||
allocatable_types_to_report:
|
||||
- cpu
|
||||
- memory
|
||||
- pods
|
||||
|
||||
|
||||
|
||||
# PostgreSQL receivers for database metrics
|
||||
# ENABLED: Monitor users configured and credentials stored in secrets
|
||||
# Collects metrics directly from PostgreSQL databases with proper TLS
|
||||
postgresql/auth:
|
||||
endpoint: auth-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- auth_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/inventory:
|
||||
endpoint: inventory-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- inventory_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/orders:
|
||||
endpoint: orders-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- orders_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/ai-insights:
|
||||
endpoint: ai-insights-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- ai_insights_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/alert-processor:
|
||||
endpoint: alert-processor-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- alert_processor_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/distribution:
|
||||
endpoint: distribution-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- distribution_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/external:
|
||||
endpoint: external-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- external_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/forecasting:
|
||||
endpoint: forecasting-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- forecasting_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/notification:
|
||||
endpoint: notification-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- notification_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/orchestrator:
|
||||
endpoint: orchestrator-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- orchestrator_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/pos:
|
||||
endpoint: pos-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- pos_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/procurement:
|
||||
endpoint: procurement-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- procurement_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/production:
|
||||
endpoint: production-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- production_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/recipes:
|
||||
endpoint: recipes-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- recipes_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/sales:
|
||||
endpoint: sales-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- sales_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/suppliers:
|
||||
endpoint: suppliers-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- suppliers_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/tenant:
|
||||
endpoint: tenant-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- tenant_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/training:
|
||||
endpoint: training-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- training_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
# Redis receiver for cache metrics
|
||||
# ENABLED: Using existing credentials from redis-secrets with TLS
|
||||
redis:
|
||||
endpoint: redis-service.bakery-ia:6379
|
||||
password: ${env:REDIS_PASSWORD}
|
||||
collection_interval: 60s
|
||||
transport: tcp
|
||||
tls:
|
||||
insecure_skip_verify: false
|
||||
cert_file: /etc/redis-tls/redis-cert.pem
|
||||
key_file: /etc/redis-tls/redis-key.pem
|
||||
ca_file: /etc/redis-tls/ca-cert.pem
|
||||
metrics:
|
||||
redis.maxmemory:
|
||||
enabled: true
|
||||
redis.cmd.latency:
|
||||
enabled: true
|
||||
|
||||
# RabbitMQ receiver via management API
|
||||
# ENABLED: Using existing credentials from rabbitmq-secrets
|
||||
rabbitmq:
|
||||
endpoint: http://rabbitmq-service.bakery-ia:15672
|
||||
username: ${env:RABBITMQ_USER}
|
||||
password: ${env:RABBITMQ_PASSWORD}
|
||||
collection_interval: 30s
|
||||
|
||||
# Prometheus Receiver - Scrapes metrics from Kubernetes API
|
||||
# Simplified configuration using only Kubernetes API metrics
|
||||
prometheus:
|
||||
config:
|
||||
scrape_configs:
|
||||
- job_name: 'kubernetes-nodes-cadvisor'
|
||||
scrape_interval: 30s
|
||||
scrape_timeout: 10s
|
||||
scheme: https
|
||||
tls_config:
|
||||
insecure_skip_verify: true
|
||||
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
kubernetes_sd_configs:
|
||||
- role: node
|
||||
relabel_configs:
|
||||
- action: labelmap
|
||||
regex: __meta_kubernetes_node_label_(.+)
|
||||
- target_label: __address__
|
||||
replacement: kubernetes.default.svc:443
|
||||
- source_labels: [__meta_kubernetes_node_name]
|
||||
regex: (.+)
|
||||
target_label: __metrics_path__
|
||||
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
|
||||
- job_name: 'kubernetes-apiserver'
|
||||
scrape_interval: 30s
|
||||
scrape_timeout: 10s
|
||||
scheme: https
|
||||
tls_config:
|
||||
insecure_skip_verify: true
|
||||
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
kubernetes_sd_configs:
|
||||
- role: endpoints
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
|
||||
action: keep
|
||||
regex: default;kubernetes;https
|
||||
|
||||
processors:
|
||||
# Batch processor for better performance (optimized for high throughput)
|
||||
batch:
|
||||
timeout: 1s
|
||||
send_batch_size: 10000 # Increased from 1024 for better performance
|
||||
send_batch_max_size: 10000
|
||||
|
||||
# Batch processor for meter data
|
||||
batch/meter:
|
||||
timeout: 1s
|
||||
send_batch_size: 20000
|
||||
send_batch_max_size: 25000
|
||||
|
||||
# Memory limiter to prevent OOM
|
||||
memory_limiter:
|
||||
check_interval: 1s
|
||||
limit_mib: 400
|
||||
spike_limit_mib: 100
|
||||
|
||||
# Resource detection
|
||||
resourcedetection:
|
||||
detectors: [env, system, docker]
|
||||
timeout: 5s
|
||||
|
||||
# Kubernetes attributes processor - CRITICAL for logs
|
||||
# Extracts pod, namespace, container metadata from log attributes
|
||||
k8sattributes:
|
||||
auth_type: "serviceAccount"
|
||||
passthrough: false
|
||||
extract:
|
||||
metadata:
|
||||
- k8s.pod.name
|
||||
- k8s.pod.uid
|
||||
- k8s.deployment.name
|
||||
- k8s.namespace.name
|
||||
- k8s.node.name
|
||||
- k8s.container.name
|
||||
labels:
|
||||
- tag_name: "app"
|
||||
- tag_name: "pod-template-hash"
|
||||
annotations:
|
||||
- tag_name: "description"
|
||||
|
||||
# SigNoz span metrics processor with delta aggregation (recommended)
|
||||
# Generates RED metrics (Rate, Error, Duration) from trace spans
|
||||
signozspanmetrics/delta:
|
||||
aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
|
||||
metrics_exporter: signozclickhousemetrics
|
||||
latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s]
|
||||
dimensions_cache_size: 100000
|
||||
dimensions:
|
||||
- name: service.namespace
|
||||
default: default
|
||||
- name: deployment.environment
|
||||
default: default
|
||||
- name: signoz.collector.id
|
||||
|
||||
exporters:
|
||||
# ClickHouse exporter for traces
|
||||
clickhousetraces:
|
||||
datasource: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_traces
|
||||
timeout: 10s
|
||||
retry_on_failure:
|
||||
enabled: true
|
||||
initial_interval: 5s
|
||||
max_interval: 30s
|
||||
max_elapsed_time: 300s
|
||||
|
||||
# ClickHouse exporter for metrics
|
||||
signozclickhousemetrics:
|
||||
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metrics"
|
||||
timeout: 10s
|
||||
retry_on_failure:
|
||||
enabled: true
|
||||
initial_interval: 5s
|
||||
max_interval: 30s
|
||||
max_elapsed_time: 300s
|
||||
|
||||
# ClickHouse exporter for meter data (usage metrics)
|
||||
signozclickhousemeter:
|
||||
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_meter"
|
||||
timeout: 45s
|
||||
sending_queue:
|
||||
enabled: false
|
||||
|
||||
# ClickHouse exporter for logs
|
||||
clickhouselogsexporter:
|
||||
dsn: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_logs
|
||||
timeout: 10s
|
||||
retry_on_failure:
|
||||
enabled: true
|
||||
initial_interval: 5s
|
||||
max_interval: 30s
|
||||
|
||||
# Metadata exporter for service metadata
|
||||
metadataexporter:
|
||||
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metadata"
|
||||
timeout: 10s
|
||||
cache:
|
||||
provider: in_memory
|
||||
|
||||
# Debug exporter for debugging (optional)
|
||||
debug:
|
||||
verbosity: detailed
|
||||
sampling_initial: 5
|
||||
sampling_thereafter: 200
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
# Traces pipeline - exports to ClickHouse and signozmeter connector
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection]
|
||||
exporters: [clickhousetraces, metadataexporter, signozmeter]
|
||||
|
||||
# Metrics pipeline
|
||||
metrics:
|
||||
receivers: [otlp,
|
||||
postgresql/auth, postgresql/inventory, postgresql/orders,
|
||||
postgresql/ai-insights, postgresql/alert-processor, postgresql/distribution,
|
||||
postgresql/external, postgresql/forecasting, postgresql/notification,
|
||||
postgresql/orchestrator, postgresql/pos, postgresql/procurement,
|
||||
postgresql/production, postgresql/recipes, postgresql/sales,
|
||||
postgresql/suppliers, postgresql/tenant, postgresql/training,
|
||||
redis, rabbitmq, k8s_cluster, prometheus]
|
||||
processors: [memory_limiter, batch, resourcedetection]
|
||||
exporters: [signozclickhousemetrics]
|
||||
|
||||
# Meter pipeline - receives from signozmeter connector
|
||||
metrics/meter:
|
||||
receivers: [signozmeter]
|
||||
processors: [batch/meter]
|
||||
exporters: [signozclickhousemeter]
|
||||
|
||||
# Logs pipeline - includes both OTLP and Kubernetes pod logs
|
||||
logs:
|
||||
receivers: [otlp, filelog]
|
||||
processors: [memory_limiter, batch, resourcedetection, k8sattributes]
|
||||
exporters: [clickhouselogsexporter]
|
||||
|
||||
# ClusterRole configuration for Kubernetes monitoring
|
||||
# CRITICAL: Required for k8s_cluster receiver to access Kubernetes API
|
||||
# Without these permissions, k8s metrics will not appear in SigNoz UI
|
||||
clusterRole:
|
||||
create: true
|
||||
name: "signoz-otel-collector-bakery-ia"
|
||||
annotations: {}
|
||||
# Complete RBAC rules required by k8sclusterreceiver
|
||||
# Based on OpenTelemetry and SigNoz official documentation
|
||||
rules:
|
||||
# Core API group - fundamental Kubernetes resources
|
||||
- apiGroups: [""]
|
||||
resources:
|
||||
- "events"
|
||||
- "namespaces"
|
||||
- "nodes"
|
||||
- "nodes/proxy"
|
||||
- "nodes/metrics"
|
||||
- "nodes/spec"
|
||||
- "pods"
|
||||
- "pods/status"
|
||||
- "replicationcontrollers"
|
||||
- "replicationcontrollers/status"
|
||||
- "resourcequotas"
|
||||
- "services"
|
||||
- "endpoints"
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Apps API group - modern workload controllers
|
||||
- apiGroups: ["apps"]
|
||||
resources: ["deployments", "daemonsets", "statefulsets", "replicasets"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Batch API group - job management
|
||||
- apiGroups: ["batch"]
|
||||
resources: ["jobs", "cronjobs"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Autoscaling API group - HPA metrics (CRITICAL)
|
||||
- apiGroups: ["autoscaling"]
|
||||
resources: ["horizontalpodautoscalers"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Extensions API group - legacy support
|
||||
- apiGroups: ["extensions"]
|
||||
resources: ["deployments", "daemonsets", "replicasets"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Metrics API group - resource metrics
|
||||
- apiGroups: ["metrics.k8s.io"]
|
||||
resources: ["nodes", "pods"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
clusterRoleBinding:
|
||||
annotations: {}
|
||||
name: "signoz-otel-collector-bakery-ia"
|
||||
|
||||
# Additional Configuration
|
||||
serviceAccount:
|
||||
create: true
|
||||
annotations: {}
|
||||
name: "signoz-otel-collector"
|
||||
|
||||
# Security Context
|
||||
securityContext:
|
||||
runAsNonRoot: true
|
||||
runAsUser: 1000
|
||||
fsGroup: 1000
|
||||
|
||||
# Network Policies (disabled for dev)
|
||||
networkPolicy:
|
||||
enabled: false
|
||||
|
||||
# Monitoring SigNoz itself
|
||||
selfMonitoring:
|
||||
enabled: true
|
||||
serviceMonitor:
|
||||
enabled: false
|
||||
998
infrastructure/monitoring/signoz/signoz-values-prod.yaml
Normal file
998
infrastructure/monitoring/signoz/signoz-values-prod.yaml
Normal file
@@ -0,0 +1,998 @@
|
||||
# SigNoz Helm Chart Values - Production Environment
|
||||
# High-availability configuration with resource optimization
|
||||
# DEPLOYED IN bakery-ia NAMESPACE - Ingress managed by bakery-ingress-prod
|
||||
#
|
||||
# Official Chart: https://github.com/SigNoz/charts
|
||||
# Install Command: helm install signoz signoz/signoz -n bakery-ia -f signoz-values-prod.yaml
|
||||
|
||||
global:
|
||||
storageClass: "microk8s-hostpath" # For MicroK8s, use "microk8s-hostpath" or custom storage class
|
||||
clusterName: "bakery-ia-prod"
|
||||
domain: "monitoring.bakewise.ai"
|
||||
# Docker Hub credentials - applied to all sub-charts (including Zookeeper, ClickHouse, etc)
|
||||
imagePullSecrets:
|
||||
- dockerhub-creds
|
||||
|
||||
# Docker Hub credentials for pulling images (root level for SigNoz components)
|
||||
imagePullSecrets:
|
||||
- dockerhub-creds
|
||||
|
||||
# SigNoz Main Component (unified frontend + query service)
|
||||
# BREAKING CHANGE: v0.89.0+ uses unified component instead of separate frontend/queryService
|
||||
signoz:
|
||||
replicaCount: 2
|
||||
|
||||
image:
|
||||
repository: signoz/signoz
|
||||
tag: v0.106.0 # Latest stable version
|
||||
pullPolicy: IfNotPresent
|
||||
|
||||
service:
|
||||
type: ClusterIP
|
||||
port: 8080 # HTTP/API port
|
||||
internalPort: 8085 # Internal gRPC port
|
||||
|
||||
# DISABLE built-in ingress - using unified bakery-ingress-prod instead
|
||||
# Route configured in infrastructure/kubernetes/overlays/prod/prod-ingress.yaml
|
||||
ingress:
|
||||
enabled: false
|
||||
|
||||
resources:
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 4Gi
|
||||
|
||||
# Pod Anti-affinity for HA
|
||||
affinity:
|
||||
podAntiAffinity:
|
||||
preferredDuringSchedulingIgnoredDuringExecution:
|
||||
- weight: 100
|
||||
podAffinityTerm:
|
||||
labelSelector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/component: query-service
|
||||
topologyKey: kubernetes.io/hostname
|
||||
|
||||
# Environment variables (new format - replaces configVars)
|
||||
env:
|
||||
signoz_telemetrystore_provider: "clickhouse"
|
||||
dot_metrics_enabled: "true"
|
||||
signoz_emailing_enabled: "true"
|
||||
signoz_alertmanager_provider: "signoz"
|
||||
# Retention configuration (30 days for prod)
|
||||
signoz_traces_ttl_duration_hrs: "720"
|
||||
signoz_metrics_ttl_duration_hrs: "720"
|
||||
signoz_logs_ttl_duration_hrs: "720"
|
||||
# OpAMP Server Configuration
|
||||
# WARNING: OpAMP can cause gRPC instability and collector reloads
|
||||
# Only enable if you have a stable OpAMP backend server
|
||||
signoz_opamp_server_enabled: "false"
|
||||
# signoz_opamp_server_endpoint: "0.0.0.0:4320"
|
||||
# SMTP configuration for email alerts - now using Mailu as SMTP server
|
||||
signoz_smtp_enabled: "true"
|
||||
signoz_smtp_host: "email-smtp.bakery-ia.svc.cluster.local"
|
||||
signoz_smtp_port: "587"
|
||||
signoz_smtp_from: "alerts@bakewise.ai"
|
||||
signoz_smtp_username: "alerts@bakewise.ai"
|
||||
# Password should be set via secret: signoz_smtp_password
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 20Gi
|
||||
storageClass: "standard"
|
||||
|
||||
# Horizontal Pod Autoscaler
|
||||
autoscaling:
|
||||
enabled: true
|
||||
minReplicas: 2
|
||||
maxReplicas: 5
|
||||
targetCPUUtilizationPercentage: 70
|
||||
targetMemoryUtilizationPercentage: 80
|
||||
|
||||
# AlertManager Configuration
|
||||
alertmanager:
|
||||
enabled: true
|
||||
replicaCount: 2
|
||||
|
||||
image:
|
||||
repository: signoz/alertmanager
|
||||
tag: 0.23.5
|
||||
pullPolicy: IfNotPresent
|
||||
|
||||
service:
|
||||
type: ClusterIP
|
||||
port: 9093
|
||||
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 128Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
|
||||
# Pod Anti-affinity for HA
|
||||
affinity:
|
||||
podAntiAffinity:
|
||||
preferredDuringSchedulingIgnoredDuringExecution:
|
||||
- weight: 100
|
||||
podAffinityTerm:
|
||||
labelSelector:
|
||||
matchExpressions:
|
||||
- key: app
|
||||
operator: In
|
||||
values:
|
||||
- signoz-alertmanager
|
||||
topologyKey: kubernetes.io/hostname
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 5Gi
|
||||
storageClass: "standard"
|
||||
|
||||
config:
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
smtp_smarthost: 'email-smtp.bakery-ia.svc.cluster.local:587'
|
||||
smtp_from: 'alerts@bakewise.ai'
|
||||
smtp_auth_username: 'alerts@bakewise.ai'
|
||||
smtp_auth_password: '${SMTP_PASSWORD}'
|
||||
smtp_require_tls: true
|
||||
|
||||
route:
|
||||
group_by: ['alertname', 'cluster', 'service', 'severity']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 12h
|
||||
receiver: 'critical-alerts'
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'critical-alerts'
|
||||
continue: true
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: 'warning-alerts'
|
||||
|
||||
receivers:
|
||||
- name: 'critical-alerts'
|
||||
email_configs:
|
||||
- to: 'critical-alerts@bakewise.ai'
|
||||
headers:
|
||||
Subject: '[CRITICAL] {{ .GroupLabels.alertname }} - Bakery IA'
|
||||
# Slack webhook for critical alerts
|
||||
slack_configs:
|
||||
- api_url: '${SLACK_WEBHOOK_URL}'
|
||||
channel: '#alerts-critical'
|
||||
title: '[CRITICAL] {{ .GroupLabels.alertname }}'
|
||||
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
||||
|
||||
- name: 'warning-alerts'
|
||||
email_configs:
|
||||
- to: 'oncall@bakewise.ai'
|
||||
headers:
|
||||
Subject: '[WARNING] {{ .GroupLabels.alertname }} - Bakery IA'
|
||||
|
||||
# ClickHouse Configuration - Time Series Database
|
||||
clickhouse:
|
||||
enabled: true
|
||||
installCustomStorageClass: false
|
||||
|
||||
image:
|
||||
registry: docker.io
|
||||
repository: clickhouse/clickhouse-server
|
||||
tag: 25.5.6 # Updated to official recommended version
|
||||
pullPolicy: IfNotPresent
|
||||
|
||||
# ClickHouse resources (nested config)
|
||||
clickhouse:
|
||||
resources:
|
||||
requests:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
limits:
|
||||
cpu: 4000m
|
||||
memory: 8Gi
|
||||
|
||||
# Pod Anti-affinity for HA
|
||||
affinity:
|
||||
podAntiAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
- labelSelector:
|
||||
matchExpressions:
|
||||
- key: app
|
||||
operator: In
|
||||
values:
|
||||
- signoz-clickhouse
|
||||
topologyKey: kubernetes.io/hostname
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 100Gi
|
||||
storageClass: "standard"
|
||||
|
||||
# Cold storage configuration for better disk space management
|
||||
coldStorage:
|
||||
enabled: true
|
||||
defaultKeepFreeSpaceBytes: 10737418240 # Keep 10GB free
|
||||
ttl:
|
||||
deleteTTLDays: 30 # Move old data to cold storage after 30 days
|
||||
|
||||
# Zookeeper Configuration (required by ClickHouse for coordination)
|
||||
zookeeper:
|
||||
enabled: true
|
||||
replicaCount: 3 # CRITICAL: Always use 3 replicas for production HA
|
||||
|
||||
image:
|
||||
tag: 3.7.1 # Official recommended version
|
||||
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 256Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 10Gi
|
||||
storageClass: "standard"
|
||||
|
||||
# OpenTelemetry Collector - Integrated with SigNoz
|
||||
otelCollector:
|
||||
enabled: true
|
||||
replicaCount: 2
|
||||
|
||||
image:
|
||||
repository: signoz/signoz-otel-collector
|
||||
tag: v0.129.12 # Updated to latest recommended version
|
||||
pullPolicy: IfNotPresent
|
||||
|
||||
# Init containers for the Otel Collector pod
|
||||
initContainers:
|
||||
fix-postgres-tls:
|
||||
enabled: true
|
||||
image:
|
||||
registry: docker.io
|
||||
repository: busybox
|
||||
tag: 1.35
|
||||
pullPolicy: IfNotPresent
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
echo "Fixing PostgreSQL TLS file permissions..."
|
||||
cp /etc/postgres-tls-source/* /etc/postgres-tls/
|
||||
chmod 600 /etc/postgres-tls/server-key.pem
|
||||
chmod 644 /etc/postgres-tls/server-cert.pem
|
||||
chmod 644 /etc/postgres-tls/ca-cert.pem
|
||||
echo "PostgreSQL TLS permissions fixed"
|
||||
volumeMounts:
|
||||
- name: postgres-tls-source
|
||||
mountPath: /etc/postgres-tls-source
|
||||
readOnly: true
|
||||
- name: postgres-tls-fixed
|
||||
mountPath: /etc/postgres-tls
|
||||
readOnly: false
|
||||
|
||||
service:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- name: otlp-grpc
|
||||
port: 4317
|
||||
targetPort: 4317
|
||||
protocol: TCP
|
||||
- name: otlp-http
|
||||
port: 4318
|
||||
targetPort: 4318
|
||||
protocol: TCP
|
||||
- name: prometheus
|
||||
port: 8889
|
||||
targetPort: 8889
|
||||
protocol: TCP
|
||||
- name: metrics
|
||||
port: 8888
|
||||
targetPort: 8888
|
||||
protocol: TCP
|
||||
|
||||
resources:
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 2Gi
|
||||
|
||||
# Additional environment variables for receivers
|
||||
additionalEnvs:
|
||||
POSTGRES_MONITOR_USER: "monitoring"
|
||||
POSTGRES_MONITOR_PASSWORD: "monitoring_369f9c001f242b07ef9e2826e17169ca"
|
||||
REDIS_PASSWORD: "OxdmdJjdVNXp37MNC2IFoMnTpfGGFv1k"
|
||||
RABBITMQ_USER: "bakery"
|
||||
RABBITMQ_PASSWORD: "forecast123"
|
||||
|
||||
# Mount TLS certificates for secure connections
|
||||
extraVolumes:
|
||||
- name: redis-tls
|
||||
secret:
|
||||
secretName: redis-tls-secret
|
||||
- name: postgres-tls
|
||||
secret:
|
||||
secretName: postgres-tls
|
||||
- name: postgres-tls-fixed
|
||||
emptyDir: {}
|
||||
- name: varlogpods
|
||||
hostPath:
|
||||
path: /var/log/pods
|
||||
|
||||
extraVolumeMounts:
|
||||
- name: redis-tls
|
||||
mountPath: /etc/redis-tls
|
||||
readOnly: true
|
||||
- name: postgres-tls
|
||||
mountPath: /etc/postgres-tls-source
|
||||
readOnly: true
|
||||
- name: postgres-tls-fixed
|
||||
mountPath: /etc/postgres-tls
|
||||
readOnly: false
|
||||
- name: varlogpods
|
||||
mountPath: /var/log/pods
|
||||
readOnly: true
|
||||
|
||||
# Enable OpAMP for dynamic configuration management
|
||||
command:
|
||||
name: /signoz-otel-collector
|
||||
extraArgs:
|
||||
- --config=/conf/otel-collector-config.yaml
|
||||
- --manager-config=/conf/otel-collector-opamp-config.yaml
|
||||
- --feature-gates=-pkg.translator.prometheus.NormalizeName
|
||||
|
||||
# Full OTEL Collector Configuration
|
||||
config:
|
||||
# Connectors - bridge between pipelines
|
||||
connectors:
|
||||
signozmeter:
|
||||
dimensions:
|
||||
- name: service.name
|
||||
- name: deployment.environment
|
||||
- name: host.name
|
||||
metrics_flush_interval: 1h
|
||||
|
||||
extensions:
|
||||
health_check:
|
||||
endpoint: 0.0.0.0:13133
|
||||
zpages:
|
||||
endpoint: 0.0.0.0:55679
|
||||
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
max_recv_msg_size_mib: 32 # Increased for larger payloads
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
cors:
|
||||
allowed_origins:
|
||||
- "https://monitoring.bakewise.ai"
|
||||
- "https://*.bakewise.ai"
|
||||
|
||||
# Filelog receiver for Kubernetes pod logs
|
||||
# Collects container stdout/stderr from /var/log/pods
|
||||
filelog:
|
||||
include:
|
||||
- /var/log/pods/*/*/*.log
|
||||
exclude:
|
||||
# Exclude SigNoz's own logs to avoid recursive collection
|
||||
- /var/log/pods/bakery-ia_signoz-*/*/*.log
|
||||
include_file_path: true
|
||||
include_file_name: false
|
||||
operators:
|
||||
# Parse CRI-O / containerd log format
|
||||
- type: regex_parser
|
||||
regex: '^(?P<time>[^ ]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
|
||||
timestamp:
|
||||
parse_from: attributes.time
|
||||
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
|
||||
# Fix timestamp parsing - extract from the parsed time field
|
||||
- type: move
|
||||
from: attributes.time
|
||||
to: attributes.timestamp
|
||||
# Extract Kubernetes metadata from file path
|
||||
- type: regex_parser
|
||||
id: extract_metadata_from_filepath
|
||||
regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
|
||||
parse_from: attributes["log.file.path"]
|
||||
# Move metadata to resource attributes
|
||||
- type: move
|
||||
from: attributes.namespace
|
||||
to: resource["k8s.namespace.name"]
|
||||
- type: move
|
||||
from: attributes.pod_name
|
||||
to: resource["k8s.pod.name"]
|
||||
- type: move
|
||||
from: attributes.container_name
|
||||
to: resource["k8s.container.name"]
|
||||
- type: move
|
||||
from: attributes.log
|
||||
to: body
|
||||
|
||||
# Kubernetes Cluster Receiver - Collects cluster-level metrics
|
||||
# Provides information about nodes, namespaces, pods, and other cluster resources
|
||||
k8s_cluster:
|
||||
collection_interval: 30s
|
||||
node_conditions_to_report:
|
||||
- Ready
|
||||
- MemoryPressure
|
||||
- DiskPressure
|
||||
- PIDPressure
|
||||
- NetworkUnavailable
|
||||
allocatable_types_to_report:
|
||||
- cpu
|
||||
- memory
|
||||
- pods
|
||||
|
||||
# Prometheus receiver for scraping metrics
|
||||
prometheus:
|
||||
config:
|
||||
scrape_configs:
|
||||
- job_name: 'kubernetes-nodes-cadvisor'
|
||||
scrape_interval: 30s
|
||||
scrape_timeout: 10s
|
||||
scheme: https
|
||||
tls_config:
|
||||
insecure_skip_verify: true
|
||||
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
kubernetes_sd_configs:
|
||||
- role: node
|
||||
relabel_configs:
|
||||
- action: labelmap
|
||||
regex: __meta_kubernetes_node_label_(.+)
|
||||
- target_label: __address__
|
||||
replacement: kubernetes.default.svc:443
|
||||
- source_labels: [__meta_kubernetes_node_name]
|
||||
regex: (.+)
|
||||
target_label: __metrics_path__
|
||||
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
|
||||
- job_name: 'kubernetes-apiserver'
|
||||
scrape_interval: 30s
|
||||
scrape_timeout: 10s
|
||||
scheme: https
|
||||
tls_config:
|
||||
insecure_skip_verify: true
|
||||
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
kubernetes_sd_configs:
|
||||
- role: endpoints
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
|
||||
action: keep
|
||||
regex: default;kubernetes;https
|
||||
|
||||
# Redis receiver for cache metrics
|
||||
# ENABLED: Using existing credentials from redis-secrets with TLS
|
||||
redis:
|
||||
endpoint: redis-service.bakery-ia:6379
|
||||
password: ${env:REDIS_PASSWORD}
|
||||
collection_interval: 60s
|
||||
transport: tcp
|
||||
tls:
|
||||
insecure_skip_verify: false
|
||||
cert_file: /etc/redis-tls/redis-cert.pem
|
||||
key_file: /etc/redis-tls/redis-key.pem
|
||||
ca_file: /etc/redis-tls/ca-cert.pem
|
||||
metrics:
|
||||
redis.maxmemory:
|
||||
enabled: true
|
||||
redis.cmd.latency:
|
||||
enabled: true
|
||||
|
||||
# RabbitMQ receiver via management API
|
||||
# ENABLED: Using existing credentials from rabbitmq-secrets
|
||||
rabbitmq:
|
||||
endpoint: http://rabbitmq-service.bakery-ia:15672
|
||||
username: ${env:RABBITMQ_USER}
|
||||
password: ${env:RABBITMQ_PASSWORD}
|
||||
collection_interval: 30s
|
||||
|
||||
# PostgreSQL receivers for database metrics
|
||||
# Monitor all databases with proper TLS configuration
|
||||
postgresql/auth:
|
||||
endpoint: auth-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- auth_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/inventory:
|
||||
endpoint: inventory-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- inventory_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/orders:
|
||||
endpoint: orders-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- orders_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/ai-insights:
|
||||
endpoint: ai-insights-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- ai_insights_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/alert-processor:
|
||||
endpoint: alert-processor-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- alert_processor_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/distribution:
|
||||
endpoint: distribution-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- distribution_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/external:
|
||||
endpoint: external-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- external_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/forecasting:
|
||||
endpoint: forecasting-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- forecasting_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/notification:
|
||||
endpoint: notification-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- notification_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/orchestrator:
|
||||
endpoint: orchestrator-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- orchestrator_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/pos:
|
||||
endpoint: pos-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- pos_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/procurement:
|
||||
endpoint: procurement-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- procurement_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/production:
|
||||
endpoint: production-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- production_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/recipes:
|
||||
endpoint: recipes-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- recipes_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/sales:
|
||||
endpoint: sales-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- sales_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/suppliers:
|
||||
endpoint: suppliers-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- suppliers_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/tenant:
|
||||
endpoint: tenant-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- tenant_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
postgresql/training:
|
||||
endpoint: training-db-service.bakery-ia:5432
|
||||
username: ${env:POSTGRES_MONITOR_USER}
|
||||
password: ${env:POSTGRES_MONITOR_PASSWORD}
|
||||
databases:
|
||||
- training_db
|
||||
collection_interval: 60s
|
||||
tls:
|
||||
insecure: false
|
||||
cert_file: /etc/postgres-tls/server-cert.pem
|
||||
key_file: /etc/postgres-tls/server-key.pem
|
||||
ca_file: /etc/postgres-tls/ca-cert.pem
|
||||
|
||||
processors:
|
||||
# High-performance batch processing (official recommendation)
|
||||
batch:
|
||||
timeout: 1s # Reduced from 10s for faster processing
|
||||
send_batch_size: 50000 # Increased from 2048 (official recommendation for traces)
|
||||
send_batch_max_size: 50000
|
||||
|
||||
# Batch processor for meter data
|
||||
batch/meter:
|
||||
timeout: 1s
|
||||
send_batch_size: 20000
|
||||
send_batch_max_size: 25000
|
||||
|
||||
memory_limiter:
|
||||
check_interval: 1s
|
||||
limit_mib: 1500 # 75% of container memory (2Gi = ~2048Mi)
|
||||
spike_limit_mib: 300
|
||||
|
||||
# Resource detection for K8s
|
||||
resourcedetection:
|
||||
detectors: [env, system, docker]
|
||||
timeout: 5s
|
||||
|
||||
# Add resource attributes
|
||||
resource:
|
||||
attributes:
|
||||
- key: deployment.environment
|
||||
value: production
|
||||
action: upsert
|
||||
- key: cluster.name
|
||||
value: bakery-ia-prod
|
||||
action: upsert
|
||||
|
||||
# Kubernetes attributes processor - CRITICAL for logs
|
||||
# Extracts pod, namespace, container metadata from log attributes
|
||||
k8sattributes:
|
||||
auth_type: "serviceAccount"
|
||||
passthrough: false
|
||||
extract:
|
||||
metadata:
|
||||
- k8s.pod.name
|
||||
- k8s.pod.uid
|
||||
- k8s.deployment.name
|
||||
- k8s.namespace.name
|
||||
- k8s.node.name
|
||||
- k8s.container.name
|
||||
labels:
|
||||
- tag_name: "app"
|
||||
- tag_name: "pod-template-hash"
|
||||
- tag_name: "version"
|
||||
annotations:
|
||||
- tag_name: "description"
|
||||
|
||||
# SigNoz span metrics processor with delta aggregation (recommended)
|
||||
# Generates RED metrics (Rate, Error, Duration) from trace spans
|
||||
signozspanmetrics/delta:
|
||||
aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
|
||||
metrics_exporter: signozclickhousemetrics
|
||||
latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s]
|
||||
dimensions_cache_size: 100000
|
||||
dimensions:
|
||||
- name: service.namespace
|
||||
default: default
|
||||
- name: deployment.environment
|
||||
default: production
|
||||
- name: signoz.collector.id
|
||||
|
||||
exporters:
|
||||
# ClickHouse exporter for traces
|
||||
clickhousetraces:
|
||||
datasource: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_traces
|
||||
timeout: 10s
|
||||
retry_on_failure:
|
||||
enabled: true
|
||||
initial_interval: 5s
|
||||
max_interval: 30s
|
||||
max_elapsed_time: 300s
|
||||
|
||||
# ClickHouse exporter for metrics
|
||||
signozclickhousemetrics:
|
||||
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metrics"
|
||||
timeout: 10s
|
||||
retry_on_failure:
|
||||
enabled: true
|
||||
initial_interval: 5s
|
||||
max_interval: 30s
|
||||
max_elapsed_time: 300s
|
||||
|
||||
# ClickHouse exporter for meter data (usage metrics)
|
||||
signozclickhousemeter:
|
||||
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_meter"
|
||||
timeout: 45s
|
||||
sending_queue:
|
||||
enabled: false
|
||||
|
||||
# ClickHouse exporter for logs
|
||||
clickhouselogsexporter:
|
||||
dsn: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_logs
|
||||
timeout: 10s
|
||||
retry_on_failure:
|
||||
enabled: true
|
||||
initial_interval: 5s
|
||||
max_interval: 30s
|
||||
|
||||
# Metadata exporter for service metadata
|
||||
metadataexporter:
|
||||
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metadata"
|
||||
timeout: 10s
|
||||
cache:
|
||||
provider: in_memory
|
||||
|
||||
# Debug exporter for debugging (optional)
|
||||
debug:
|
||||
verbosity: detailed
|
||||
sampling_initial: 5
|
||||
sampling_thereafter: 200
|
||||
|
||||
service:
|
||||
extensions: [health_check, zpages]
|
||||
pipelines:
|
||||
# Traces pipeline - exports to ClickHouse and signozmeter connector
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection, resource]
|
||||
exporters: [clickhousetraces, metadataexporter, signozmeter]
|
||||
|
||||
# Metrics pipeline - includes all infrastructure receivers
|
||||
metrics:
|
||||
receivers: [otlp,
|
||||
postgresql/auth, postgresql/inventory, postgresql/orders,
|
||||
postgresql/ai-insights, postgresql/alert-processor, postgresql/distribution,
|
||||
postgresql/external, postgresql/forecasting, postgresql/notification,
|
||||
postgresql/orchestrator, postgresql/pos, postgresql/procurement,
|
||||
postgresql/production, postgresql/recipes, postgresql/sales,
|
||||
postgresql/suppliers, postgresql/tenant, postgresql/training,
|
||||
redis, rabbitmq, k8s_cluster, prometheus]
|
||||
processors: [memory_limiter, batch, resourcedetection, resource]
|
||||
exporters: [signozclickhousemetrics]
|
||||
|
||||
# Meter pipeline - receives from signozmeter connector
|
||||
metrics/meter:
|
||||
receivers: [signozmeter]
|
||||
processors: [batch/meter]
|
||||
exporters: [signozclickhousemeter]
|
||||
|
||||
# Logs pipeline - includes both OTLP and Kubernetes pod logs
|
||||
logs:
|
||||
receivers: [otlp, filelog]
|
||||
processors: [memory_limiter, batch, resourcedetection, resource, k8sattributes]
|
||||
exporters: [clickhouselogsexporter]
|
||||
|
||||
# HPA for OTEL Collector
|
||||
autoscaling:
|
||||
enabled: true
|
||||
minReplicas: 2
|
||||
maxReplicas: 10
|
||||
targetCPUUtilizationPercentage: 70
|
||||
targetMemoryUtilizationPercentage: 80
|
||||
|
||||
# ClusterRole configuration for Kubernetes monitoring
|
||||
# CRITICAL: Required for k8s_cluster receiver to access Kubernetes API
|
||||
# Without these permissions, k8s metrics will not appear in SigNoz UI
|
||||
clusterRole:
|
||||
create: true
|
||||
name: "signoz-otel-collector-bakery-ia"
|
||||
annotations: {}
|
||||
# Complete RBAC rules required by k8sclusterreceiver
|
||||
# Based on OpenTelemetry and SigNoz official documentation
|
||||
rules:
|
||||
# Core API group - fundamental Kubernetes resources
|
||||
- apiGroups: [""]
|
||||
resources:
|
||||
- "events"
|
||||
- "namespaces"
|
||||
- "nodes"
|
||||
- "nodes/proxy"
|
||||
- "nodes/metrics"
|
||||
- "nodes/spec"
|
||||
- "pods"
|
||||
- "pods/status"
|
||||
- "replicationcontrollers"
|
||||
- "replicationcontrollers/status"
|
||||
- "resourcequotas"
|
||||
- "services"
|
||||
- "endpoints"
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Apps API group - modern workload controllers
|
||||
- apiGroups: ["apps"]
|
||||
resources: ["deployments", "daemonsets", "statefulsets", "replicasets"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Batch API group - job management
|
||||
- apiGroups: ["batch"]
|
||||
resources: ["jobs", "cronjobs"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Autoscaling API group - HPA metrics (CRITICAL)
|
||||
- apiGroups: ["autoscaling"]
|
||||
resources: ["horizontalpodautoscalers"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Extensions API group - legacy support
|
||||
- apiGroups: ["extensions"]
|
||||
resources: ["deployments", "daemonsets", "replicasets"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
# Metrics API group - resource metrics
|
||||
- apiGroups: ["metrics.k8s.io"]
|
||||
resources: ["nodes", "pods"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
clusterRoleBinding:
|
||||
annotations: {}
|
||||
name: "signoz-otel-collector-bakery-ia"
|
||||
|
||||
# Schema Migrator - Manages ClickHouse schema migrations
|
||||
schemaMigrator:
|
||||
enabled: true
|
||||
|
||||
image:
|
||||
repository: signoz/signoz-schema-migrator
|
||||
tag: v0.129.12 # Updated to latest version
|
||||
pullPolicy: IfNotPresent
|
||||
|
||||
# Enable Helm hooks for proper upgrade handling
|
||||
upgradeHelmHooks: true
|
||||
|
||||
# Additional Configuration
|
||||
serviceAccount:
|
||||
create: true
|
||||
annotations: {}
|
||||
name: "signoz"
|
||||
|
||||
# Security Context
|
||||
securityContext:
|
||||
runAsNonRoot: true
|
||||
runAsUser: 1000
|
||||
fsGroup: 1000
|
||||
|
||||
# Pod Disruption Budgets for HA
|
||||
podDisruptionBudget:
|
||||
frontend:
|
||||
enabled: true
|
||||
minAvailable: 1
|
||||
queryService:
|
||||
enabled: true
|
||||
minAvailable: 1
|
||||
alertmanager:
|
||||
enabled: true
|
||||
minAvailable: 1
|
||||
clickhouse:
|
||||
enabled: true
|
||||
minAvailable: 1
|
||||
|
||||
# Network Policies for security
|
||||
networkPolicy:
|
||||
enabled: true
|
||||
policyTypes:
|
||||
- Ingress
|
||||
- Egress
|
||||
|
||||
# Monitoring SigNoz itself
|
||||
selfMonitoring:
|
||||
enabled: true
|
||||
serviceMonitor:
|
||||
enabled: true
|
||||
interval: 30s
|
||||
177
infrastructure/monitoring/signoz/verify-signoz-telemetry.sh
Executable file
177
infrastructure/monitoring/signoz/verify-signoz-telemetry.sh
Executable file
@@ -0,0 +1,177 @@
|
||||
#!/bin/bash
|
||||
|
||||
# SigNoz Telemetry Verification Script
|
||||
# This script verifies that services are correctly sending metrics, logs, and traces to SigNoz
|
||||
# and that SigNoz is collecting them properly.
|
||||
|
||||
set -e
|
||||
|
||||
NAMESPACE="bakery-ia"
|
||||
GREEN='\033[0;32m'
|
||||
RED='\033[0;31m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE} SigNoz Telemetry Verification Script${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo ""
|
||||
|
||||
# Step 1: Verify SigNoz Components are Running
|
||||
echo -e "${BLUE}[1/7] Checking SigNoz Components Status...${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
|
||||
OTEL_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=signoz,app.kubernetes.io/component=otel-collector --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
|
||||
SIGNOZ_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=signoz,app.kubernetes.io/component=signoz --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
|
||||
CLICKHOUSE_POD=$(kubectl get pods -n $NAMESPACE -l clickhouse.altinity.com/chi=signoz-clickhouse --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
|
||||
|
||||
if [[ -n "$OTEL_POD" && -n "$SIGNOZ_POD" && -n "$CLICKHOUSE_POD" ]]; then
|
||||
echo -e "${GREEN}✓ All SigNoz components are running${NC}"
|
||||
echo " - OTel Collector: $OTEL_POD"
|
||||
echo " - SigNoz Frontend: $SIGNOZ_POD"
|
||||
echo " - ClickHouse: $CLICKHOUSE_POD"
|
||||
else
|
||||
echo -e "${RED}✗ Some SigNoz components are not running${NC}"
|
||||
kubectl get pods -n $NAMESPACE | grep signoz
|
||||
exit 1
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Step 2: Check OTel Collector Endpoints
|
||||
echo -e "${BLUE}[2/7] Verifying OTel Collector Endpoints...${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
|
||||
OTEL_SVC=$(kubectl get svc -n $NAMESPACE signoz-otel-collector -o jsonpath='{.spec.clusterIP}')
|
||||
echo "OTel Collector Service IP: $OTEL_SVC"
|
||||
echo ""
|
||||
echo "Available endpoints:"
|
||||
kubectl get svc -n $NAMESPACE signoz-otel-collector -o jsonpath='{range .spec.ports[*]}{.name}{"\t"}{.port}{"\n"}{end}' | column -t
|
||||
echo ""
|
||||
echo -e "${GREEN}✓ OTel Collector endpoints are exposed${NC}"
|
||||
echo ""
|
||||
|
||||
# Step 3: Check OTel Collector Logs for Data Reception
|
||||
echo -e "${BLUE}[3/7] Checking OTel Collector for Recent Activity...${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
|
||||
echo "Recent OTel Collector logs (last 20 lines):"
|
||||
kubectl logs -n $NAMESPACE $OTEL_POD --tail=20 | grep -E "received|exported|traces|metrics|logs" || echo "No recent telemetry data found in logs"
|
||||
echo ""
|
||||
|
||||
# Step 4: Check Service Configurations
|
||||
echo -e "${BLUE}[4/7] Verifying Service Telemetry Configuration...${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
|
||||
# Check ConfigMap for OTEL settings
|
||||
OTEL_ENDPOINT=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.OTEL_EXPORTER_OTLP_ENDPOINT}')
|
||||
ENABLE_TRACING=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.ENABLE_TRACING}')
|
||||
ENABLE_METRICS=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.ENABLE_METRICS}')
|
||||
ENABLE_LOGS=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.ENABLE_LOGS}')
|
||||
|
||||
echo "Configuration from bakery-config ConfigMap:"
|
||||
echo " OTEL_EXPORTER_OTLP_ENDPOINT: $OTEL_ENDPOINT"
|
||||
echo " ENABLE_TRACING: $ENABLE_TRACING"
|
||||
echo " ENABLE_METRICS: $ENABLE_METRICS"
|
||||
echo " ENABLE_LOGS: $ENABLE_LOGS"
|
||||
echo ""
|
||||
|
||||
if [[ "$ENABLE_TRACING" == "true" && "$ENABLE_METRICS" == "true" && "$ENABLE_LOGS" == "true" ]]; then
|
||||
echo -e "${GREEN}✓ Telemetry is enabled in configuration${NC}"
|
||||
else
|
||||
echo -e "${YELLOW}⚠ Some telemetry features may be disabled${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Step 5: Test OTel Collector Health
|
||||
echo -e "${BLUE}[5/7] Testing OTel Collector Health Endpoint...${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
|
||||
HEALTH_CHECK=$(kubectl exec -n $NAMESPACE $OTEL_POD -- wget -qO- http://localhost:13133/ 2>/dev/null || echo "FAILED")
|
||||
if [[ "$HEALTH_CHECK" == *"Server available"* ]] || [[ "$HEALTH_CHECK" == "{}" ]]; then
|
||||
echo -e "${GREEN}✓ OTel Collector health check passed${NC}"
|
||||
else
|
||||
echo -e "${RED}✗ OTel Collector health check failed${NC}"
|
||||
echo "Response: $HEALTH_CHECK"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Step 6: Query ClickHouse for Telemetry Data
|
||||
echo -e "${BLUE}[6/7] Querying ClickHouse for Telemetry Data...${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
|
||||
# Get ClickHouse credentials
|
||||
CH_PASSWORD=$(kubectl get secret -n $NAMESPACE signoz-clickhouse -o jsonpath='{.data.admin-password}' 2>/dev/null | base64 -d || echo "27ff0399-0d3a-4bd8-919d-17c2181e6fb9")
|
||||
|
||||
echo "Checking for traces in ClickHouse..."
|
||||
TRACES_COUNT=$(kubectl exec -n $NAMESPACE $CLICKHOUSE_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="SELECT count() FROM signoz_traces.signoz_index_v2 WHERE timestamp >= now() - INTERVAL 1 HOUR" 2>/dev/null || echo "0")
|
||||
echo " Traces in last hour: $TRACES_COUNT"
|
||||
|
||||
echo "Checking for metrics in ClickHouse..."
|
||||
METRICS_COUNT=$(kubectl exec -n $NAMESPACE $CLICKHOUSE_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="SELECT count() FROM signoz_metrics.samples_v4 WHERE unix_milli >= toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000" 2>/dev/null || echo "0")
|
||||
echo " Metrics in last hour: $METRICS_COUNT"
|
||||
|
||||
echo "Checking for logs in ClickHouse..."
|
||||
LOGS_COUNT=$(kubectl exec -n $NAMESPACE $CLICKHOUSE_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="SELECT count() FROM signoz_logs.logs WHERE timestamp >= now() - INTERVAL 1 HOUR" 2>/dev/null || echo "0")
|
||||
echo " Logs in last hour: $LOGS_COUNT"
|
||||
echo ""
|
||||
|
||||
if [[ "$TRACES_COUNT" -gt "0" || "$METRICS_COUNT" -gt "0" || "$LOGS_COUNT" -gt "0" ]]; then
|
||||
echo -e "${GREEN}✓ Telemetry data found in ClickHouse!${NC}"
|
||||
else
|
||||
echo -e "${YELLOW}⚠ No telemetry data found in the last hour${NC}"
|
||||
echo " This might be normal if:"
|
||||
echo " - Services were just deployed"
|
||||
echo " - No traffic has been generated yet"
|
||||
echo " - Services haven't finished initializing"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Step 7: Access Information
|
||||
echo -e "${BLUE}[7/7] SigNoz UI Access Information${NC}"
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
echo ""
|
||||
echo "SigNoz is accessible via ingress at:"
|
||||
echo -e " ${GREEN}https://monitoring.bakery-ia.local${NC}"
|
||||
echo ""
|
||||
echo "Or via port-forward:"
|
||||
echo -e " ${YELLOW}kubectl port-forward -n $NAMESPACE svc/signoz 3301:8080${NC}"
|
||||
echo " Then access: http://localhost:3301"
|
||||
echo ""
|
||||
echo "To view OTel Collector metrics:"
|
||||
echo -e " ${YELLOW}kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 8888:8888${NC}"
|
||||
echo " Then access: http://localhost:8888/metrics"
|
||||
echo ""
|
||||
|
||||
# Summary
|
||||
echo ""
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE} Verification Summary${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo ""
|
||||
echo "Component Status:"
|
||||
echo " ✓ SigNoz components running"
|
||||
echo " ✓ OTel Collector healthy"
|
||||
echo " ✓ Configuration correct"
|
||||
echo ""
|
||||
echo "Data Collection (last hour):"
|
||||
echo " Traces: $TRACES_COUNT"
|
||||
echo " Metrics: $METRICS_COUNT"
|
||||
echo " Logs: $LOGS_COUNT"
|
||||
echo ""
|
||||
|
||||
if [[ "$TRACES_COUNT" -gt "0" || "$METRICS_COUNT" -gt "0" || "$LOGS_COUNT" -gt "0" ]]; then
|
||||
echo -e "${GREEN}✓ SigNoz is collecting telemetry data successfully!${NC}"
|
||||
else
|
||||
echo -e "${YELLOW}⚠ To generate telemetry data, try:${NC}"
|
||||
echo ""
|
||||
echo "1. Generate traffic to your services:"
|
||||
echo " curl http://localhost/api/health"
|
||||
echo ""
|
||||
echo "2. Check service logs for tracing initialization:"
|
||||
echo " kubectl logs -n $NAMESPACE <service-pod> | grep -i 'tracing\\|otel\\|signoz'"
|
||||
echo ""
|
||||
echo "3. Wait a few minutes and run this script again"
|
||||
fi
|
||||
echo ""
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
446
infrastructure/monitoring/signoz/verify-signoz.sh
Executable file
446
infrastructure/monitoring/signoz/verify-signoz.sh
Executable file
@@ -0,0 +1,446 @@
|
||||
#!/bin/bash
|
||||
|
||||
# ============================================================================
|
||||
# SigNoz Verification Script for Bakery IA
|
||||
# ============================================================================
|
||||
# This script verifies that SigNoz is properly deployed and functioning
|
||||
# ============================================================================
|
||||
|
||||
set -e
|
||||
|
||||
# Color codes for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Function to display help
|
||||
show_help() {
|
||||
echo "Usage: $0 [OPTIONS] ENVIRONMENT"
|
||||
echo ""
|
||||
echo "Verify SigNoz deployment for Bakery IA"
|
||||
echo ""
|
||||
echo "Arguments:
|
||||
ENVIRONMENT Environment to verify (dev|prod)"
|
||||
echo ""
|
||||
echo "Options:
|
||||
-h, --help Show this help message
|
||||
-n, --namespace NAMESPACE Specify namespace (default: bakery-ia)"
|
||||
echo ""
|
||||
echo "Examples:
|
||||
$0 dev # Verify development deployment
|
||||
$0 prod # Verify production deployment
|
||||
$0 --namespace monitoring dev # Verify with custom namespace"
|
||||
}
|
||||
|
||||
# Parse command line arguments
|
||||
NAMESPACE="bakery-ia"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
-h|--help)
|
||||
show_help
|
||||
exit 0
|
||||
;;
|
||||
-n|--namespace)
|
||||
NAMESPACE="$2"
|
||||
shift 2
|
||||
;;
|
||||
dev|prod)
|
||||
ENVIRONMENT="$1"
|
||||
shift
|
||||
;;
|
||||
*)
|
||||
echo "Unknown argument: $1"
|
||||
show_help
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Validate environment
|
||||
if [[ -z "$ENVIRONMENT" ]]; then
|
||||
echo "Error: Environment not specified. Use 'dev' or 'prod'."
|
||||
show_help
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ "$ENVIRONMENT" != "dev" && "$ENVIRONMENT" != "prod" ]]; then
|
||||
echo "Error: Invalid environment. Use 'dev' or 'prod'."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Function to check if kubectl is configured
|
||||
check_kubectl() {
|
||||
if ! kubectl cluster-info &> /dev/null; then
|
||||
echo "${RED}Error: kubectl is not configured or cannot connect to cluster.${NC}"
|
||||
echo "Please ensure you have access to a Kubernetes cluster."
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to check namespace exists
|
||||
check_namespace() {
|
||||
if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
|
||||
echo "${RED}Error: Namespace $NAMESPACE does not exist.${NC}"
|
||||
echo "Please deploy SigNoz first using: ./deploy-signoz.sh $ENVIRONMENT"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to verify SigNoz deployment
|
||||
verify_deployment() {
|
||||
echo "${BLUE}"
|
||||
echo "=========================================="
|
||||
echo "🔍 Verifying SigNoz Deployment"
|
||||
echo "=========================================="
|
||||
echo "Environment: $ENVIRONMENT"
|
||||
echo "Namespace: $NAMESPACE"
|
||||
echo "${NC}"
|
||||
echo ""
|
||||
|
||||
# Check if SigNoz helm release exists
|
||||
echo "${BLUE}1. Checking Helm release...${NC}"
|
||||
if helm list -n "$NAMESPACE" | grep -q signoz; then
|
||||
echo "${GREEN}✅ SigNoz Helm release found${NC}"
|
||||
else
|
||||
echo "${RED}❌ SigNoz Helm release not found${NC}"
|
||||
echo "Please deploy SigNoz first using: ./deploy-signoz.sh $ENVIRONMENT"
|
||||
exit 1
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check pod status
|
||||
echo "${BLUE}2. Checking pod status...${NC}"
|
||||
local total_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
|
||||
local running_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz --field-selector=status.phase=Running 2>/dev/null | grep -c "Running" || echo "0")
|
||||
local ready_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep "Running" | grep "1/1" | wc -l | tr -d ' ' || echo "0")
|
||||
|
||||
echo "Total pods: $total_pods"
|
||||
echo "Running pods: $running_pods"
|
||||
echo "Ready pods: $ready_pods"
|
||||
|
||||
if [[ $total_pods -eq 0 ]]; then
|
||||
echo "${RED}❌ No SigNoz pods found${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ $running_pods -eq $total_pods ]]; then
|
||||
echo "${GREEN}✅ All pods are running${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ Some pods are not running${NC}"
|
||||
fi
|
||||
|
||||
if [[ $ready_pods -eq $total_pods ]]; then
|
||||
echo "${GREEN}✅ All pods are ready${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ Some pods are not ready${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Show pod details
|
||||
echo "${BLUE}Pod Details:${NC}"
|
||||
kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
|
||||
echo ""
|
||||
|
||||
# Check services
|
||||
echo "${BLUE}3. Checking services...${NC}"
|
||||
local service_count=$(kubectl get svc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
|
||||
|
||||
if [[ $service_count -gt 0 ]]; then
|
||||
echo "${GREEN}✅ Services found ($service_count services)${NC}"
|
||||
kubectl get svc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
|
||||
else
|
||||
echo "${RED}❌ No services found${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check ingress
|
||||
echo "${BLUE}4. Checking ingress...${NC}"
|
||||
local ingress_count=$(kubectl get ingress -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
|
||||
|
||||
if [[ $ingress_count -gt 0 ]]; then
|
||||
echo "${GREEN}✅ Ingress found ($ingress_count ingress resources)${NC}"
|
||||
kubectl get ingress -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
|
||||
else
|
||||
echo "${YELLOW}⚠️ No ingress found (may be configured in main namespace)${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check PVCs
|
||||
echo "${BLUE}5. Checking persistent volume claims...${NC}"
|
||||
local pvc_count=$(kubectl get pvc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
|
||||
|
||||
if [[ $pvc_count -gt 0 ]]; then
|
||||
echo "${GREEN}✅ PVCs found ($pvc_count PVCs)${NC}"
|
||||
kubectl get pvc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
|
||||
else
|
||||
echo "${YELLOW}⚠️ No PVCs found (may not be required for all components)${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check resource usage
|
||||
echo "${BLUE}6. Checking resource usage...${NC}"
|
||||
if command -v kubectl &> /dev/null && kubectl top pods -n "$NAMESPACE" &> /dev/null; then
|
||||
echo "${GREEN}✅ Resource usage:${NC}"
|
||||
kubectl top pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
|
||||
else
|
||||
echo "${YELLOW}⚠️ Metrics server not available or no resource usage data${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check logs for errors
|
||||
echo "${BLUE}7. Checking for errors in logs...${NC}"
|
||||
local error_found=false
|
||||
|
||||
# Check each pod for errors
|
||||
while IFS= read -r pod; do
|
||||
if [[ -n "$pod" ]]; then
|
||||
local pod_errors=$(kubectl logs -n "$NAMESPACE" "$pod" 2>/dev/null | grep -i "error\|exception\|fail\|crash" | wc -l || echo "0")
|
||||
if [[ $pod_errors -gt 0 ]]; then
|
||||
echo "${RED}❌ Errors found in pod $pod ($pod_errors errors)${NC}"
|
||||
error_found=true
|
||||
fi
|
||||
fi
|
||||
done < <(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz -o name | sed 's|pod/||')
|
||||
|
||||
if [[ "$error_found" == false ]]; then
|
||||
echo "${GREEN}✅ No errors found in logs${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Environment-specific checks
|
||||
if [[ "$ENVIRONMENT" == "dev" ]]; then
|
||||
verify_dev_specific
|
||||
else
|
||||
verify_prod_specific
|
||||
fi
|
||||
|
||||
# Show access information
|
||||
show_access_info
|
||||
}
|
||||
|
||||
# Function for development-specific verification
|
||||
verify_dev_specific() {
|
||||
echo "${BLUE}8. Development-specific checks...${NC}"
|
||||
|
||||
# Check if ingress is configured
|
||||
if kubectl get ingress -n "$NAMESPACE" 2>/dev/null | grep -q "monitoring.bakery-ia.local"; then
|
||||
echo "${GREEN}✅ Development ingress configured${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ Development ingress not found${NC}"
|
||||
fi
|
||||
|
||||
# Check unified signoz component resource limits (should be lower for dev)
|
||||
local signoz_mem=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.template.spec.containers[0].resources.limits.memory}' 2>/dev/null || echo "")
|
||||
if [[ -n "$signoz_mem" ]]; then
|
||||
echo "${GREEN}✅ SigNoz component found (memory limit: $signoz_mem)${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ Could not verify SigNoz component resources${NC}"
|
||||
fi
|
||||
|
||||
# Check single replica setup for dev
|
||||
local replicas=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "0")
|
||||
if [[ $replicas -eq 1 ]]; then
|
||||
echo "${GREEN}✅ Single replica configuration (appropriate for dev)${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ Multiple replicas detected (replicas: $replicas)${NC}"
|
||||
fi
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Function for production-specific verification
|
||||
verify_prod_specific() {
|
||||
echo "${BLUE}8. Production-specific checks...${NC}"
|
||||
|
||||
# Check if TLS is configured
|
||||
if kubectl get ingress -n "$NAMESPACE" 2>/dev/null | grep -q "signoz-tls"; then
|
||||
echo "${GREEN}✅ TLS certificate configured${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ TLS certificate not found${NC}"
|
||||
fi
|
||||
|
||||
# Check if multiple replicas are running for HA
|
||||
local signoz_replicas=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "1")
|
||||
if [[ $signoz_replicas -gt 1 ]]; then
|
||||
echo "${GREEN}✅ High availability configured ($signoz_replicas SigNoz replicas)${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ Single SigNoz replica detected (not highly available)${NC}"
|
||||
fi
|
||||
|
||||
# Check Zookeeper replicas (critical for production)
|
||||
local zk_replicas=$(kubectl get statefulset -n "$NAMESPACE" -l app.kubernetes.io/component=zookeeper -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "0")
|
||||
if [[ $zk_replicas -eq 3 ]]; then
|
||||
echo "${GREEN}✅ Zookeeper properly configured with 3 replicas${NC}"
|
||||
elif [[ $zk_replicas -gt 0 ]]; then
|
||||
echo "${YELLOW}⚠️ Zookeeper has $zk_replicas replicas (recommend 3 for production)${NC}"
|
||||
else
|
||||
echo "${RED}❌ Zookeeper not found${NC}"
|
||||
fi
|
||||
|
||||
# Check OTel Collector replicas
|
||||
local otel_replicas=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=otel-collector -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "1")
|
||||
if [[ $otel_replicas -gt 1 ]]; then
|
||||
echo "${GREEN}✅ OTel Collector HA configured ($otel_replicas replicas)${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ Single OTel Collector replica${NC}"
|
||||
fi
|
||||
|
||||
# Check resource limits (should be higher for prod)
|
||||
local signoz_mem=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.template.spec.containers[0].resources.limits.memory}' 2>/dev/null || echo "")
|
||||
if [[ -n "$signoz_mem" ]]; then
|
||||
echo "${GREEN}✅ Production resource limits applied (memory: $signoz_mem)${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ Could not verify resource limits${NC}"
|
||||
fi
|
||||
|
||||
# Check HPA (Horizontal Pod Autoscaler)
|
||||
local hpa_count=$(kubectl get hpa -n "$NAMESPACE" 2>/dev/null | grep -c signoz || echo "0")
|
||||
if [[ $hpa_count -gt 0 ]]; then
|
||||
echo "${GREEN}✅ Horizontal Pod Autoscaler configured${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ No HPA found (consider enabling for production)${NC}"
|
||||
fi
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Function to show access information
|
||||
show_access_info() {
|
||||
echo "${BLUE}"
|
||||
echo "=========================================="
|
||||
echo "📋 Access Information"
|
||||
echo "=========================================="
|
||||
echo "${NC}"
|
||||
|
||||
if [[ "$ENVIRONMENT" == "dev" ]]; then
|
||||
echo "SigNoz UI: http://monitoring.bakery-ia.local"
|
||||
echo ""
|
||||
echo "OpenTelemetry Collector (within cluster):"
|
||||
echo " gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
|
||||
echo " HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
|
||||
echo ""
|
||||
echo "Port-forward for local access:"
|
||||
echo " kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
|
||||
echo " kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4317:4317"
|
||||
echo " kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4318:4318"
|
||||
else
|
||||
echo "SigNoz UI: https://monitoring.bakewise.ai"
|
||||
echo ""
|
||||
echo "OpenTelemetry Collector (within cluster):"
|
||||
echo " gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
|
||||
echo " HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Default Credentials:"
|
||||
echo " Username: admin@example.com"
|
||||
echo " Password: admin"
|
||||
echo ""
|
||||
echo "⚠️ IMPORTANT: Change default password after first login!"
|
||||
echo ""
|
||||
|
||||
# Show connection test commands
|
||||
echo "Connection Test Commands:"
|
||||
if [[ "$ENVIRONMENT" == "dev" ]]; then
|
||||
echo " # Test SigNoz UI"
|
||||
echo " curl http://monitoring.bakery-ia.local"
|
||||
echo ""
|
||||
echo " # Test via port-forward"
|
||||
echo " kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
|
||||
echo " curl http://localhost:8080"
|
||||
else
|
||||
echo " # Test SigNoz UI"
|
||||
echo " curl https://monitoring.bakewise.ai"
|
||||
echo ""
|
||||
echo " # Test API health"
|
||||
echo " kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
|
||||
echo " curl http://localhost:8080/api/v1/health"
|
||||
fi
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Function to run connectivity tests
|
||||
run_connectivity_tests() {
|
||||
echo "${BLUE}"
|
||||
echo "=========================================="
|
||||
echo "🔗 Running Connectivity Tests"
|
||||
echo "=========================================="
|
||||
echo "${NC}"
|
||||
|
||||
# Test pod readiness first
|
||||
echo "Checking pod readiness..."
|
||||
local ready_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz --field-selector=status.phase=Running 2>/dev/null | grep "Running" | grep -c "1/1\|2/2" || echo "0")
|
||||
local total_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
|
||||
|
||||
if [[ $ready_pods -eq $total_pods && $total_pods -gt 0 ]]; then
|
||||
echo "${GREEN}✅ All pods are ready ($ready_pods/$total_pods)${NC}"
|
||||
else
|
||||
echo "${YELLOW}⚠️ Some pods not ready ($ready_pods/$total_pods)${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Test internal service connectivity
|
||||
echo "Testing internal service connectivity..."
|
||||
local signoz_svc=$(kubectl get svc -n "$NAMESPACE" signoz -o jsonpath='{.spec.clusterIP}' 2>/dev/null || echo "")
|
||||
if [[ -n "$signoz_svc" ]]; then
|
||||
echo "${GREEN}✅ SigNoz service accessible at $signoz_svc:8080${NC}"
|
||||
else
|
||||
echo "${RED}❌ SigNoz service not found${NC}"
|
||||
fi
|
||||
|
||||
local otel_svc=$(kubectl get svc -n "$NAMESPACE" signoz-otel-collector -o jsonpath='{.spec.clusterIP}' 2>/dev/null || echo "")
|
||||
if [[ -n "$otel_svc" ]]; then
|
||||
echo "${GREEN}✅ OTel Collector service accessible at $otel_svc:4317 (gRPC), $otel_svc:4318 (HTTP)${NC}"
|
||||
else
|
||||
echo "${RED}❌ OTel Collector service not found${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
if [[ "$ENVIRONMENT" == "prod" ]]; then
|
||||
echo "${YELLOW}⚠️ Production connectivity tests require valid DNS and TLS${NC}"
|
||||
echo " Please ensure monitoring.bakewise.ai resolves to your cluster"
|
||||
echo ""
|
||||
echo "Manual test:"
|
||||
echo " curl -I https://monitoring.bakewise.ai"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
echo "${BLUE}"
|
||||
echo "=========================================="
|
||||
echo "🔍 SigNoz Verification for Bakery IA"
|
||||
echo "=========================================="
|
||||
echo "${NC}"
|
||||
|
||||
# Check prerequisites
|
||||
check_kubectl
|
||||
check_namespace
|
||||
|
||||
# Verify deployment
|
||||
verify_deployment
|
||||
|
||||
# Run connectivity tests
|
||||
run_connectivity_tests
|
||||
|
||||
echo "${GREEN}"
|
||||
echo "=========================================="
|
||||
echo "✅ Verification Complete"
|
||||
echo "=========================================="
|
||||
echo "${NC}"
|
||||
|
||||
echo "Summary:"
|
||||
echo " Environment: $ENVIRONMENT"
|
||||
echo " Namespace: $NAMESPACE"
|
||||
echo ""
|
||||
echo "Next Steps:"
|
||||
echo " 1. Access SigNoz UI and verify dashboards"
|
||||
echo " 2. Configure alert rules for your services"
|
||||
echo " 3. Instrument your applications with OpenTelemetry"
|
||||
echo " 4. Set up custom dashboards for key metrics"
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main
|
||||
Reference in New Issue
Block a user