Add new infra architecture

This commit is contained in:
Urtzi Alfaro
2026-01-19 11:55:17 +01:00
parent 21d35ea92b
commit 35f164f0cd
311 changed files with 13241 additions and 3700 deletions

View File

@@ -0,0 +1,619 @@
# SigNoz Helm Deployment for Bakery IA
This directory contains Helm configurations and deployment scripts for SigNoz observability platform.
## Overview
SigNoz is deployed using the official Helm chart with environment-specific configurations optimized for:
- **Development**: Colima + Kind (Kubernetes in Docker) with Tilt
- **Production**: VPS on clouding.io with MicroK8s
## Prerequisites
### Required Tools
- **kubectl** 1.22+
- **Helm** 3.8+
- **Docker** (for development)
- **Kind/MicroK8s** (environment-specific)
### Docker Hub Authentication
SigNoz uses images from Docker Hub. Set up authentication to avoid rate limits:
```bash
# Option 1: Environment variables (recommended)
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-personal-access-token'
# Option 2: Docker login
docker login
```
## Quick Start
### Development Deployment
```bash
# Deploy SigNoz to development environment
./deploy-signoz.sh dev
# Verify deployment
./verify-signoz.sh dev
# Access SigNoz UI
# Via ingress: http://monitoring.bakery-ia.local
# Or port-forward:
kubectl port-forward -n signoz svc/signoz 8080:8080
# Then open: http://localhost:8080
```
### Production Deployment
```bash
# Deploy SigNoz to production environment
./deploy-signoz.sh prod
# Verify deployment
./verify-signoz.sh prod
# Access SigNoz UI
# https://monitoring.bakewise.ai
```
## Configuration Files
### signoz-values-dev.yaml
Development environment configuration with:
- Single replica for most components
- Reduced resource requests (optimized for local Kind cluster)
- 7-day data retention
- Batch size: 10,000 events
- ClickHouse 25.5.6, OTel Collector v0.129.12
- PostgreSQL, Redis, and RabbitMQ receivers configured
### signoz-values-prod.yaml
Production environment configuration with:
- High availability: 2+ replicas for critical components
- 3 Zookeeper replicas (required for production)
- 30-day data retention
- Batch size: 50,000 events (high-performance)
- Cold storage enabled with 30-day TTL
- Horizontal Pod Autoscaler (HPA) enabled
- TLS/SSL with cert-manager
- Enhanced security with pod anti-affinity rules
## Key Configuration Changes (v0.89.0+)
⚠️ **BREAKING CHANGE**: SigNoz Helm chart v0.89.0+ uses a unified component structure.
**Old Structure (deprecated):**
```yaml
frontend:
replicaCount: 2
queryService:
replicaCount: 2
```
**New Structure (current):**
```yaml
signoz:
replicaCount: 2
# Combines frontend + query service
```
## Component Architecture
### Core Components
1. **SigNoz** (unified component)
- Frontend UI + Query Service
- Port 8080 (HTTP/API), 8085 (internal gRPC)
- Dev: 1 replica, Prod: 2+ replicas with HPA
2. **ClickHouse** (Time-series database)
- Version: 25.5.6
- Stores traces, metrics, and logs
- Dev: 1 replica, Prod: 2 replicas with cold storage
3. **Zookeeper** (ClickHouse coordination)
- Version: 3.7.1
- Dev: 1 replica, Prod: 3 replicas (critical for HA)
4. **OpenTelemetry Collector** (Data ingestion)
- Version: v0.129.12
- Ports: 4317 (gRPC), 4318 (HTTP), 8888 (metrics)
- Dev: 1 replica, Prod: 2+ replicas with HPA
5. **Alertmanager** (Alert management)
- Version: 0.23.5
- Email and Slack integrations configured
- Port: 9093
## Performance Optimizations
### Batch Processing
- **Development**: 10,000 events per batch
- **Production**: 50,000 events per batch (official recommendation)
- Timeout: 1 second for faster processing
### Memory Management
- Memory limiter processor prevents OOM
- Dev: 400 MiB limit, Prod: 1500 MiB limit
- Spike limits configured
### Span Metrics Processor
Automatically generates RED metrics (Rate, Errors, Duration):
- Latency histogram buckets optimized for microservices
- Cache size: 10K (dev), 100K (prod)
### Cold Storage (Production Only)
- Enabled with 30-day TTL
- Automatically moves old data to cold storage
- Keeps 10GB free on primary storage
## OpenTelemetry Endpoints
### From Within Kubernetes Cluster
**Development:**
```
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
```
**Production:**
```
OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
```
### Application Configuration Example
```yaml
# Python with OpenTelemetry
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
```
```javascript
// Node.js with OpenTelemetry
const exporter = new OTLPTraceExporter({
url: 'http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318/v1/traces',
});
```
## Deployment Scripts
### deploy-signoz.sh
Comprehensive deployment script with features:
```bash
# Usage
./deploy-signoz.sh [OPTIONS] ENVIRONMENT
# Options
-h, --help Show help message
-d, --dry-run Show what would be deployed
-u, --upgrade Upgrade existing deployment
-r, --remove Remove deployment
-n, --namespace NS Custom namespace (default: signoz)
# Examples
./deploy-signoz.sh dev # Deploy to dev
./deploy-signoz.sh --upgrade prod # Upgrade prod
./deploy-signoz.sh --dry-run prod # Preview changes
./deploy-signoz.sh --remove dev # Remove dev deployment
```
**Features:**
- Automatic Helm repository setup
- Docker Hub secret creation
- Namespace management
- Deployment verification
- 15-minute timeout with `--wait` flag
### verify-signoz.sh
Verification script to check deployment health:
```bash
# Usage
./verify-signoz.sh [OPTIONS] ENVIRONMENT
# Examples
./verify-signoz.sh dev # Verify dev deployment
./verify-signoz.sh prod # Verify prod deployment
```
**Checks performed:**
1. ✅ Helm release status
2. ✅ Pod health and readiness
3. ✅ Service availability
4. ✅ Ingress configuration
5. ✅ PVC status
6. ✅ Resource usage (if metrics-server available)
7. ✅ Log errors
8. ✅ Environment-specific validations
- Dev: Single replica, resource limits
- Prod: HA config, TLS, Zookeeper replicas, HPA
## Storage Configuration
### Development (Kind)
```yaml
global:
storageClass: "standard" # Kind's default provisioner
```
### Production (MicroK8s)
```yaml
global:
storageClass: "microk8s-hostpath" # Or custom storage class
```
**Storage Requirements:**
- **Development**: ~35 GiB total
- SigNoz: 5 GiB
- ClickHouse: 20 GiB
- Zookeeper: 5 GiB
- Alertmanager: 2 GiB
- **Production**: ~135 GiB total
- SigNoz: 20 GiB
- ClickHouse: 100 GiB
- Zookeeper: 10 GiB
- Alertmanager: 5 GiB
## Resource Requirements
### Development Environment
**Minimum:**
- CPU: 550m (0.55 cores)
- Memory: 1.6 GiB
- Storage: 35 GiB
**Recommended:**
- CPU: 3 cores
- Memory: 3 GiB
- Storage: 50 GiB
### Production Environment
**Minimum:**
- CPU: 3.5 cores
- Memory: 8 GiB
- Storage: 135 GiB
**Recommended:**
- CPU: 12 cores
- Memory: 20 GiB
- Storage: 200 GiB
## Data Retention
### Development
- Traces: 7 days (168 hours)
- Metrics: 7 days (168 hours)
- Logs: 7 days (168 hours)
### Production
- Traces: 30 days (720 hours)
- Metrics: 30 days (720 hours)
- Logs: 30 days (720 hours)
- Cold storage after 30 days
To modify retention, update the environment variables:
```yaml
signoz:
env:
signoz_traces_ttl_duration_hrs: "720" # 30 days
signoz_metrics_ttl_duration_hrs: "720" # 30 days
signoz_logs_ttl_duration_hrs: "168" # 7 days
```
## High Availability (Production)
### Replication Strategy
```yaml
signoz: 2 replicas + HPA (min: 2, max: 5)
clickhouse: 2 replicas
zookeeper: 3 replicas (critical!)
otelCollector: 2 replicas + HPA (min: 2, max: 10)
alertmanager: 2 replicas
```
### Pod Anti-Affinity
Ensures pods are distributed across different nodes:
```yaml
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/component: query-service
topologyKey: kubernetes.io/hostname
```
### Pod Disruption Budgets
Configured for all critical components:
```yaml
podDisruptionBudget:
enabled: true
minAvailable: 1
```
## Monitoring and Alerting
### Email Alerts (Production)
Configure SMTP in production values (using Mailu with Mailgun relay):
```yaml
signoz:
env:
signoz_smtp_enabled: "true"
signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local"
signoz_smtp_port: "587"
signoz_smtp_from: "alerts@bakewise.ai"
signoz_smtp_username: "alerts@bakewise.ai"
# Set via secret: signoz_smtp_password
```
**Note**: Signoz now uses the internal Mailu SMTP service, which relays to Mailgun for better deliverability and centralized email management.
### Slack Alerts (Production)
Configure webhook in Alertmanager:
```yaml
alertmanager:
config:
receivers:
- name: 'critical-alerts'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts-critical'
```
### Mailgun Integration for Alert Emails
Signoz has been configured to use Mailgun for sending alert emails through the Mailu SMTP service. This provides:
**Benefits:**
- Better email deliverability through Mailgun's infrastructure
- Centralized email management via Mailu
- Improved tracking and analytics for alert emails
- Compliance with email sending best practices
**Architecture:**
```
Signoz Alertmanager → Mailu SMTP → Mailgun Relay → Recipients
```
**Configuration Requirements:**
1. **Mailu Configuration** (`infrastructure/platform/mail/mailu/mailu-configmap.yaml`):
```yaml
RELAYHOST: "smtp.mailgun.org:587"
RELAY_LOGIN: "postmaster@bakewise.ai"
```
2. **Mailu Secrets** (`infrastructure/platform/mail/mailu/mailu-secrets.yaml`):
```yaml
RELAY_PASSWORD: "<mailgun-api-key>" # Base64 encoded Mailgun API key
```
3. **DNS Configuration** (required for Mailgun):
```
# MX record
bakewise.ai. IN MX 10 mail.bakewise.ai.
# SPF record (authorize Mailgun)
bakewise.ai. IN TXT "v=spf1 include:mailgun.org ~all"
# DKIM record (provided by Mailgun)
m1._domainkey.bakewise.ai. IN TXT "v=DKIM1; k=rsa; p=<mailgun-public-key>"
# DMARC record
_dmarc.bakewise.ai. IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@bakewise.ai"
```
4. **Signoz SMTP Configuration** (already configured in `signoz-values-prod.yaml`):
```yaml
signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local"
signoz_smtp_port: "587"
signoz_smtp_from: "alerts@bakewise.ai"
```
**Testing the Integration:**
1. Trigger a test alert from Signoz UI
2. Check Mailu logs: `kubectl logs -f mailu-smtp-<pod-id> -n bakery-ia`
3. Check Mailgun dashboard for delivery status
4. Verify email receipt in destination inbox
**Troubleshooting:**
- **SMTP Authentication Failed**: Verify Mailu credentials and Mailgun API key
- **Email Delivery Delays**: Check Mailu queue with `kubectl exec -it mailu-smtp-<pod-id> -n bakery-ia -- mailq`
- **SPF/DKIM Issues**: Verify DNS records and Mailgun domain verification
### Self-Monitoring
SigNoz monitors itself:
```yaml
selfMonitoring:
enabled: true
serviceMonitor:
enabled: true # Prod only
interval: 30s
```
## Troubleshooting
### Common Issues
**1. Pods not starting**
```bash
# Check pod status
kubectl get pods -n signoz
# Check pod logs
kubectl logs -n signoz <pod-name>
# Describe pod for events
kubectl describe pod -n signoz <pod-name>
```
**2. Docker Hub rate limits**
```bash
# Verify secret exists
kubectl get secret dockerhub-creds -n signoz
# Recreate secret
kubectl delete secret dockerhub-creds -n signoz
export DOCKERHUB_USERNAME='your-username'
export DOCKERHUB_PASSWORD='your-token'
./deploy-signoz.sh dev
```
**3. ClickHouse connection issues**
```bash
# Check ClickHouse pod
kubectl logs -n signoz -l app.kubernetes.io/component=clickhouse
# Check Zookeeper (required by ClickHouse)
kubectl logs -n signoz -l app.kubernetes.io/component=zookeeper
```
**4. OTel Collector not receiving data**
```bash
# Check OTel Collector logs
kubectl logs -n signoz -l app.kubernetes.io/component=otel-collector
# Test connectivity
kubectl port-forward -n signoz svc/signoz-otel-collector 4318:4318
curl -v http://localhost:4318/v1/traces
```
**5. Insufficient storage**
```bash
# Check PVC status
kubectl get pvc -n signoz
# Check storage usage (if metrics-server available)
kubectl top pods -n signoz
```
### Debug Mode
Enable debug exporter in OTel Collector:
```yaml
otelCollector:
config:
exporters:
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
service:
pipelines:
traces:
exporters: [clickhousetraces, debug] # Add debug
```
### Upgrade from Old Version
If upgrading from pre-v0.89.0:
```bash
# 1. Backup data (recommended)
kubectl get all -n signoz -o yaml > signoz-backup.yaml
# 2. Remove old deployment
./deploy-signoz.sh --remove prod
# 3. Deploy new version
./deploy-signoz.sh prod
# 4. Verify
./verify-signoz.sh prod
```
## Security Best Practices
1. **Change default password** immediately after first login
2. **Use TLS/SSL** in production (configured with cert-manager)
3. **Network policies** enabled in production
4. **Run as non-root** (configured in securityContext)
5. **RBAC** with dedicated service account
6. **Secrets management** for sensitive data (SMTP, Slack webhooks)
7. **Image pull secrets** to avoid exposing Docker Hub credentials
## Backup and Recovery
### Backup ClickHouse Data
```bash
# Export ClickHouse data
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
--query="BACKUP DATABASE signoz_traces TO Disk('backups', 'traces_backup.zip')"
# Copy backup out
kubectl cp signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/ ./backups/
```
### Restore from Backup
```bash
# Copy backup in
kubectl cp ./backups/ signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/
# Restore
kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
--query="RESTORE DATABASE signoz_traces FROM Disk('backups', 'traces_backup.zip')"
```
## Updating Configuration
To update SigNoz configuration:
1. Edit values file: `signoz-values-{env}.yaml`
2. Apply changes:
```bash
./deploy-signoz.sh --upgrade {env}
```
3. Verify:
```bash
./verify-signoz.sh {env}
```
## Uninstallation
```bash
# Remove SigNoz deployment
./deploy-signoz.sh --remove {env}
# Optionally delete PVCs (WARNING: deletes all data)
kubectl delete pvc -n signoz -l app.kubernetes.io/instance=signoz
# Optionally delete namespace
kubectl delete namespace signoz
```
## References
- [SigNoz Official Documentation](https://signoz.io/docs/)
- [SigNoz Helm Charts Repository](https://github.com/SigNoz/charts)
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
- [ClickHouse Documentation](https://clickhouse.com/docs/)
## Support
For issues or questions:
1. Check [SigNoz GitHub Issues](https://github.com/SigNoz/signoz/issues)
2. Review deployment logs: `kubectl logs -n signoz <pod-name>`
3. Run verification script: `./verify-signoz.sh {env}`
4. Check [SigNoz Community Slack](https://signoz.io/slack)
---
**Last Updated**: 2026-01-09
**SigNoz Helm Chart Version**: Latest (v0.129.12 components)
**Maintained by**: Bakery IA Team

View File

@@ -0,0 +1,190 @@
# SigNoz Dashboards for Bakery IA
This directory contains comprehensive SigNoz dashboard configurations for monitoring the Bakery IA system.
## Available Dashboards
### 1. Infrastructure Monitoring
- **File**: `infrastructure-monitoring.json`
- **Purpose**: Monitor Kubernetes infrastructure, pod health, and resource utilization
- **Key Metrics**: CPU usage, memory usage, network traffic, pod status, container health
### 2. Application Performance
- **File**: `application-performance.json`
- **Purpose**: Monitor microservice performance and API metrics
- **Key Metrics**: Request rate, error rate, latency percentiles, endpoint performance
### 3. Database Performance
- **File**: `database-performance.json`
- **Purpose**: Monitor PostgreSQL and Redis database performance
- **Key Metrics**: Connections, query execution time, cache hit ratio, locks, replication status
### 4. API Performance
- **File**: `api-performance.json`
- **Purpose**: Monitor REST and GraphQL API performance
- **Key Metrics**: Request volume, response times, status codes, endpoint analysis
### 5. Error Tracking
- **File**: `error-tracking.json`
- **Purpose**: Track and analyze system errors
- **Key Metrics**: Error rates, error distribution, recent errors, HTTP errors, database errors
### 6. User Activity
- **File**: `user-activity.json`
- **Purpose**: Monitor user behavior and activity patterns
- **Key Metrics**: Active users, sessions, API calls per user, session duration
### 7. System Health
- **File**: `system-health.json`
- **Purpose**: Overall system health monitoring
- **Key Metrics**: Availability, health scores, resource utilization, service status
### 8. Alert Management
- **File**: `alert-management.json`
- **Purpose**: Monitor and manage system alerts
- **Key Metrics**: Active alerts, alert rates, alert distribution, firing alerts
### 9. Log Analysis
- **File**: `log-analysis.json`
- **Purpose**: Search and analyze system logs
- **Key Metrics**: Log volume, error logs, log distribution, log search
## How to Import Dashboards
### Method 1: Using SigNoz UI
1. **Access SigNoz UI**: Open your SigNoz instance in a web browser
2. **Navigate to Dashboards**: Go to the "Dashboards" section
3. **Import Dashboard**: Click on "Import Dashboard" button
4. **Upload JSON**: Select the JSON file from this directory
5. **Configure**: Adjust any variables or settings as needed
6. **Save**: Save the imported dashboard
**Note**: The dashboards now use the correct SigNoz JSON schema with proper filter arrays.
### Method 2: Using SigNoz API
```bash
# Import a single dashboard
curl -X POST "http://<SIGNOZ_HOST>:3301/api/v1/dashboards/import" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <API_KEY>" \
-d @infrastructure-monitoring.json
# Import all dashboards
for file in *.json; do
curl -X POST "http://<SIGNOZ_HOST>:3301/api/v1/dashboards/import" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <API_KEY>" \
-d @"$file"
done
```
### Method 3: Using Kubernetes ConfigMap
```yaml
# Create a ConfigMap with all dashboards
kubectl create configmap signoz-dashboards \
--from-file=infrastructure-monitoring.json \
--from-file=application-performance.json \
--from-file=database-performance.json \
--from-file=api-performance.json \
--from-file=error-tracking.json \
--from-file=user-activity.json \
--from-file=system-health.json \
--from-file=alert-management.json \
--from-file=log-analysis.json \
-n signoz
```
## Dashboard Variables
Most dashboards include variables that allow you to filter and customize the view:
- **Namespace**: Filter by Kubernetes namespace (e.g., `bakery-ia`, `default`)
- **Service**: Filter by specific microservice
- **Severity**: Filter by error/alert severity
- **Environment**: Filter by deployment environment
- **Time Range**: Adjust the time window for analysis
## Metrics Reference
The dashboards use standard OpenTelemetry metrics. If you need to add custom metrics, ensure they are properly instrumented in your services.
## Troubleshooting
### Dashboard Import Errors
If you encounter errors when importing dashboards:
1. **Validate JSON**: Ensure the JSON files are valid
```bash
jq . infrastructure-monitoring.json
```
2. **Check Metrics**: Verify that the metrics exist in your SigNoz instance
3. **Adjust Time Range**: Try different time ranges if no data appears
4. **Check Filters**: Ensure filters match your actual service names and tags
### "e.filter is not a function" Error
This error occurs when the dashboard JSON uses an incorrect filter format. The fix has been applied:
**Before (incorrect)**:
```json
"filters": {
"namespace": "${namespace}"
}
```
**After (correct)**:
```json
"filters": [
{
"key": "namespace",
"operator": "=",
"value": "${namespace}"
}
]
```
All dashboards in this directory now use the correct array format for filters.
### Missing Data
If dashboards show no data:
1. **Verify Instrumentation**: Ensure your services are properly instrumented with OpenTelemetry
2. **Check Time Range**: Adjust the time range to include recent data
3. **Validate Metrics**: Confirm the metrics are being collected and stored
4. **Review Filters**: Check that filters match your actual deployment
## Customization
You can customize these dashboards by:
1. **Editing JSON**: Modify the JSON files to add/remove panels or adjust queries
2. **Cloning in UI**: Clone existing dashboards and modify them in the SigNoz UI
3. **Adding Variables**: Add new variables for additional filtering options
4. **Adjusting Layout**: Change the grid layout and panel sizes
## Best Practices
1. **Regular Reviews**: Review dashboards regularly to ensure they meet your monitoring needs
2. **Alert Integration**: Set up alerts based on key metrics shown in these dashboards
3. **Team Access**: Share relevant dashboards with appropriate team members
4. **Documentation**: Document any custom metrics or specific monitoring requirements
## Support
For issues with these dashboards:
1. Check the [SigNoz documentation](https://signoz.io/docs/)
2. Review the [Bakery IA monitoring guide](../SIGNOZ_COMPLETE_CONFIGURATION_GUIDE.md)
3. Consult the OpenTelemetry metrics specification
## License
These dashboard configurations are provided under the same license as the Bakery IA project.

View File

@@ -0,0 +1,170 @@
{
"description": "Alert monitoring and management dashboard",
"tags": ["alerts", "monitoring", "management"],
"name": "bakery-ia-alert-management",
"title": "Bakery IA - Alert Management",
"uploadedGrafana": false,
"uuid": "bakery-ia-alerts-01",
"version": "v4",
"collapsableRowsMigrated": true,
"layout": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 3,
"i": "active-alerts",
"moved": false,
"static": false
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 3,
"i": "alert-rate",
"moved": false,
"static": false
}
],
"variables": {
"service": {
"id": "service-var",
"name": "service",
"description": "Filter by service name",
"type": "QUERY",
"queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'alerts_active' AND value != '' ORDER BY value",
"customValue": "",
"textboxValue": "",
"showALLOption": true,
"multiSelect": false,
"order": 1,
"modificationUUID": "",
"sort": "ASC",
"selectedValue": null
}
},
"widgets": [
{
"id": "active-alerts",
"title": "Active Alerts",
"description": "Number of currently active alerts",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "value",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "alerts_active",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [],
"legend": "Active Alerts",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "alert-rate",
"title": "Alert Rate",
"description": "Rate of alerts over time",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "alerts_total",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "rate",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "alerts/s"
}
]
}

View File

@@ -0,0 +1,351 @@
{
"description": "Comprehensive API performance monitoring for Bakery IA REST and GraphQL endpoints",
"tags": ["api", "performance", "rest", "graphql"],
"name": "bakery-ia-api-performance",
"title": "Bakery IA - API Performance",
"uploadedGrafana": false,
"uuid": "bakery-ia-api-01",
"version": "v4",
"collapsableRowsMigrated": true,
"layout": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 3,
"i": "request-volume",
"moved": false,
"static": false
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 3,
"i": "error-rate",
"moved": false,
"static": false
},
{
"x": 0,
"y": 3,
"w": 6,
"h": 3,
"i": "avg-response-time",
"moved": false,
"static": false
},
{
"x": 6,
"y": 3,
"w": 6,
"h": 3,
"i": "p95-latency",
"moved": false,
"static": false
}
],
"variables": {
"service": {
"id": "service-var",
"name": "service",
"description": "Filter by API service",
"type": "QUERY",
"queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'http_server_requests_seconds_count' AND value != '' ORDER BY value",
"customValue": "",
"textboxValue": "",
"showALLOption": true,
"multiSelect": false,
"order": 1,
"modificationUUID": "",
"sort": "ASC",
"selectedValue": null
}
},
"widgets": [
{
"id": "request-volume",
"title": "Request Volume",
"description": "API request volume by service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "http_server_requests_seconds_count",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "rate",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "service.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.service}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "api.name",
"dataType": "string",
"type": "resource",
"isColumn": false
}
],
"legend": "{{api.name}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "req/s"
},
{
"id": "error-rate",
"title": "Error Rate",
"description": "API error rate by service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "http_server_requests_seconds_count",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "rate",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "api.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.api}}"
},
{
"key": {
"key": "status_code",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=~",
"value": "5.."
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "api.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
{
"key": "status_code",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{api.name}} - {{status_code}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "req/s"
},
{
"id": "avg-response-time",
"title": "Average Response Time",
"description": "Average API response time by endpoint",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "http_server_requests_seconds_sum",
"dataType": "float64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "avg",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "api.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.api}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "api.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
{
"key": "endpoint",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{api.name}} - {{endpoint}}",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "seconds"
},
{
"id": "p95-latency",
"title": "P95 Latency",
"description": "95th percentile latency by endpoint",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "histogram_quantile",
"aggregateAttribute": {
"key": "http_server_requests_seconds_bucket",
"dataType": "float64",
"type": "Histogram",
"isColumn": false
},
"timeAggregation": "avg",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "api.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.api}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "api.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
{
"key": "endpoint",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{api.name}} - {{endpoint}}",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "seconds"
}
]
}

View File

@@ -0,0 +1,333 @@
{
"description": "Application performance monitoring dashboard using distributed traces and metrics",
"tags": ["application", "performance", "traces", "apm"],
"name": "bakery-ia-application-performance",
"title": "Bakery IA - Application Performance (APM)",
"uploadedGrafana": false,
"uuid": "bakery-ia-apm-01",
"version": "v4",
"collapsableRowsMigrated": true,
"layout": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 3,
"i": "latency-p99",
"moved": false,
"static": false
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 3,
"i": "request-rate",
"moved": false,
"static": false
},
{
"x": 0,
"y": 3,
"w": 6,
"h": 3,
"i": "error-rate",
"moved": false,
"static": false
},
{
"x": 6,
"y": 3,
"w": 6,
"h": 3,
"i": "avg-duration",
"moved": false,
"static": false
}
],
"variables": {
"service_name": {
"id": "service-var",
"name": "service_name",
"description": "Filter by service name",
"type": "QUERY",
"queryValue": "SELECT DISTINCT(serviceName) FROM signoz_traces.distributed_signoz_index_v2 ORDER BY serviceName",
"customValue": "",
"textboxValue": "",
"showALLOption": true,
"multiSelect": false,
"order": 1,
"modificationUUID": "",
"sort": "ASC",
"selectedValue": null
}
},
"widgets": [
{
"id": "latency-p99",
"title": "P99 Latency",
"description": "99th percentile latency for selected service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "traces",
"queryName": "A",
"aggregateOperator": "p99",
"aggregateAttribute": {
"key": "duration_ns",
"dataType": "float64",
"type": "",
"isColumn": true
},
"timeAggregation": "avg",
"spaceAggregation": "p99",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service_name}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "ms"
},
{
"id": "request-rate",
"title": "Request Rate",
"description": "Requests per second for the service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "traces",
"queryName": "A",
"aggregateOperator": "count",
"aggregateAttribute": {
"key": "",
"dataType": "",
"type": "",
"isColumn": false
},
"timeAggregation": "rate",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service_name}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "reqps"
},
{
"id": "error-rate",
"title": "Error Rate",
"description": "Error rate percentage for the service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "traces",
"queryName": "A",
"aggregateOperator": "count",
"aggregateAttribute": {
"key": "",
"dataType": "",
"type": "",
"isColumn": false
},
"timeAggregation": "rate",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service_name}}"
},
{
"key": {
"key": "status_code",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "STATUS_CODE_ERROR"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "reqps"
},
{
"id": "avg-duration",
"title": "Average Duration",
"description": "Average request duration",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "traces",
"queryName": "A",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "duration_ns",
"dataType": "float64",
"type": "",
"isColumn": true
},
"timeAggregation": "avg",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service_name}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "ms"
}
]
}

View File

@@ -0,0 +1,425 @@
{
"description": "Comprehensive database performance monitoring for PostgreSQL, Redis, and RabbitMQ",
"tags": ["database", "postgresql", "redis", "rabbitmq", "performance"],
"name": "bakery-ia-database-performance",
"title": "Bakery IA - Database Performance",
"uploadedGrafana": false,
"uuid": "bakery-ia-db-01",
"version": "v4",
"collapsableRowsMigrated": true,
"layout": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 3,
"i": "pg-connections",
"moved": false,
"static": false
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 3,
"i": "pg-db-size",
"moved": false,
"static": false
},
{
"x": 0,
"y": 3,
"w": 6,
"h": 3,
"i": "redis-connected-clients",
"moved": false,
"static": false
},
{
"x": 6,
"y": 3,
"w": 6,
"h": 3,
"i": "redis-memory",
"moved": false,
"static": false
},
{
"x": 0,
"y": 6,
"w": 6,
"h": 3,
"i": "rabbitmq-messages",
"moved": false,
"static": false
},
{
"x": 6,
"y": 6,
"w": 6,
"h": 3,
"i": "rabbitmq-consumers",
"moved": false,
"static": false
}
],
"variables": {
"database": {
"id": "database-var",
"name": "database",
"description": "Filter by PostgreSQL database name",
"type": "QUERY",
"queryValue": "SELECT DISTINCT(resource_attrs['postgresql.database.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'postgresql.db_size' AND value != '' ORDER BY value",
"customValue": "",
"textboxValue": "",
"showALLOption": true,
"multiSelect": false,
"order": 1,
"modificationUUID": "",
"sort": "ASC",
"selectedValue": null
}
},
"widgets": [
{
"id": "pg-connections",
"title": "PostgreSQL - Active Connections",
"description": "Number of active PostgreSQL connections",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "postgresql.backends",
"dataType": "float64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "postgresql.database.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.database}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "postgresql.database.name",
"dataType": "string",
"type": "resource",
"isColumn": false
}
],
"legend": "{{postgresql.database.name}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "pg-db-size",
"title": "PostgreSQL - Database Size",
"description": "Size of PostgreSQL databases in bytes",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "postgresql.db_size",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "postgresql.database.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.database}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "postgresql.database.name",
"dataType": "string",
"type": "resource",
"isColumn": false
}
],
"legend": "{{postgresql.database.name}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "bytes"
},
{
"id": "redis-connected-clients",
"title": "Redis - Connected Clients",
"description": "Number of clients connected to Redis",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "redis.clients.connected",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "host.name",
"dataType": "string",
"type": "resource",
"isColumn": false
}
],
"legend": "{{host.name}}",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "redis-memory",
"title": "Redis - Memory Usage",
"description": "Redis memory usage in bytes",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "redis.memory.used",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "host.name",
"dataType": "string",
"type": "resource",
"isColumn": false
}
],
"legend": "{{host.name}}",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "bytes"
},
{
"id": "rabbitmq-messages",
"title": "RabbitMQ - Current Messages",
"description": "Number of messages currently in RabbitMQ queues",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "rabbitmq.message.current",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "queue",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "Queue: {{queue}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "rabbitmq-consumers",
"title": "RabbitMQ - Consumer Count",
"description": "Number of consumers per queue",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "rabbitmq.consumer.count",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "queue",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "Queue: {{queue}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
}
]
}

View File

@@ -0,0 +1,348 @@
{
"description": "Comprehensive error tracking and analysis dashboard",
"tags": ["errors", "exceptions", "tracking"],
"name": "bakery-ia-error-tracking",
"title": "Bakery IA - Error Tracking",
"uploadedGrafana": false,
"uuid": "bakery-ia-errors-01",
"version": "v4",
"collapsableRowsMigrated": true,
"layout": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 3,
"i": "total-errors",
"moved": false,
"static": false
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 3,
"i": "error-rate",
"moved": false,
"static": false
},
{
"x": 0,
"y": 3,
"w": 6,
"h": 3,
"i": "http-5xx",
"moved": false,
"static": false
},
{
"x": 6,
"y": 3,
"w": 6,
"h": 3,
"i": "http-4xx",
"moved": false,
"static": false
}
],
"variables": {
"service": {
"id": "service-var",
"name": "service",
"description": "Filter by service name",
"type": "QUERY",
"queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'error_total' AND value != '' ORDER BY value",
"customValue": "",
"textboxValue": "",
"showALLOption": true,
"multiSelect": false,
"order": 1,
"modificationUUID": "",
"sort": "ASC",
"selectedValue": null
}
},
"widgets": [
{
"id": "total-errors",
"title": "Total Errors",
"description": "Total number of errors across all services",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "value",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "error_total",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "sum",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "service.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.service}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [],
"legend": "Total Errors",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "error-rate",
"title": "Error Rate",
"description": "Error rate over time",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "error_total",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "rate",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "service.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.service}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "errors/s"
},
{
"id": "http-5xx",
"title": "HTTP 5xx Errors",
"description": "Server errors (5xx status codes)",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "http_server_requests_seconds_count",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "sum",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "service.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.service}}"
},
{
"key": {
"key": "status_code",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=~",
"value": "5.."
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
{
"key": "status_code",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{serviceName}} - {{status_code}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "number"
},
{
"id": "http-4xx",
"title": "HTTP 4xx Errors",
"description": "Client errors (4xx status codes)",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "http_server_requests_seconds_count",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "sum",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "service.name",
"dataType": "string",
"type": "resource",
"isColumn": false
},
"op": "=",
"value": "{{.service}}"
},
{
"key": {
"key": "status_code",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=~",
"value": "4.."
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
{
"key": "status_code",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{serviceName}} - {{status_code}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "number"
}
]
}

View File

@@ -0,0 +1,213 @@
{
"name": "Bakery IA Dashboard Collection",
"description": "Complete set of SigNoz dashboards for Bakery IA monitoring",
"version": "1.0.0",
"author": "Bakery IA Team",
"license": "MIT",
"dashboards": [
{
"id": "infrastructure-monitoring",
"name": "Infrastructure Monitoring",
"description": "Kubernetes infrastructure and resource monitoring",
"file": "infrastructure-monitoring.json",
"tags": ["infrastructure", "kubernetes", "system"],
"category": "infrastructure"
},
{
"id": "application-performance",
"name": "Application Performance",
"description": "Microservice performance and API metrics",
"file": "application-performance.json",
"tags": ["application", "performance", "apm"],
"category": "performance"
},
{
"id": "database-performance",
"name": "Database Performance",
"description": "PostgreSQL and Redis database monitoring",
"file": "database-performance.json",
"tags": ["database", "postgresql", "redis"],
"category": "database"
},
{
"id": "api-performance",
"name": "API Performance",
"description": "REST and GraphQL API performance monitoring",
"file": "api-performance.json",
"tags": ["api", "rest", "graphql"],
"category": "api"
},
{
"id": "error-tracking",
"name": "Error Tracking",
"description": "System error tracking and analysis",
"file": "error-tracking.json",
"tags": ["errors", "exceptions", "tracking"],
"category": "monitoring"
},
{
"id": "user-activity",
"name": "User Activity",
"description": "User behavior and activity monitoring",
"file": "user-activity.json",
"tags": ["user", "activity", "behavior"],
"category": "user"
},
{
"id": "system-health",
"name": "System Health",
"description": "Overall system health monitoring",
"file": "system-health.json",
"tags": ["system", "health", "overview"],
"category": "overview"
},
{
"id": "alert-management",
"name": "Alert Management",
"description": "Alert monitoring and management",
"file": "alert-management.json",
"tags": ["alerts", "notifications", "management"],
"category": "alerts"
},
{
"id": "log-analysis",
"name": "Log Analysis",
"description": "Log search and analysis",
"file": "log-analysis.json",
"tags": ["logs", "search", "analysis"],
"category": "logs"
}
],
"categories": [
{
"id": "infrastructure",
"name": "Infrastructure",
"description": "Kubernetes and system infrastructure monitoring"
},
{
"id": "performance",
"name": "Performance",
"description": "Application and service performance monitoring"
},
{
"id": "database",
"name": "Database",
"description": "Database performance and health monitoring"
},
{
"id": "api",
"name": "API",
"description": "API performance and usage monitoring"
},
{
"id": "monitoring",
"name": "Monitoring",
"description": "Error tracking and system monitoring"
},
{
"id": "user",
"name": "User",
"description": "User activity and behavior monitoring"
},
{
"id": "overview",
"name": "Overview",
"description": "System-wide overview and health dashboards"
},
{
"id": "alerts",
"name": "Alerts",
"description": "Alert management and monitoring"
},
{
"id": "logs",
"name": "Logs",
"description": "Log analysis and search"
}
],
"usage": {
"import_methods": [
"ui_import",
"api_import",
"kubernetes_configmap"
],
"recommended_import_order": [
"infrastructure-monitoring",
"system-health",
"application-performance",
"api-performance",
"database-performance",
"error-tracking",
"alert-management",
"log-analysis",
"user-activity"
]
},
"requirements": {
"signoz_version": ">= 0.10.0",
"opentelemetry_collector": ">= 0.45.0",
"metrics": [
"container_cpu_usage_seconds_total",
"container_memory_working_set_bytes",
"http_server_requests_seconds_count",
"http_server_requests_seconds_sum",
"pg_stat_activity_count",
"pg_stat_statements_total_time",
"error_total",
"alerts_total",
"kube_pod_status_phase",
"container_network_receive_bytes_total",
"kube_pod_container_status_restarts_total",
"kube_pod_container_status_ready",
"container_fs_reads_total",
"kube_pod_status_phase",
"kube_pod_container_status_restarts_total",
"kube_pod_container_status_ready",
"container_fs_reads_total",
"kubernetes_events",
"http_server_requests_seconds_bucket",
"http_server_active_requests",
"http_server_up",
"db_query_duration_seconds_sum",
"db_connections_active",
"http_client_request_duration_seconds_count",
"http_client_request_duration_seconds_sum",
"graphql_execution_time_seconds",
"graphql_errors_total",
"pg_stat_database_blks_hit",
"pg_stat_database_xact_commit",
"pg_locks_count",
"pg_table_size_bytes",
"pg_stat_user_tables_seq_scan",
"redis_memory_used_bytes",
"redis_commands_processed_total",
"redis_keyspace_hits",
"pg_stat_database_deadlocks",
"pg_stat_database_conn_errors",
"pg_replication_lag_bytes",
"pg_replication_is_replica",
"active_users",
"user_sessions_total",
"api_calls_per_user",
"session_duration_seconds",
"system_availability",
"service_health_score",
"system_cpu_usage",
"system_memory_usage",
"service_availability",
"alerts_active",
"alerts_total",
"log_lines_total"
]
},
"support": {
"documentation": "https://signoz.io/docs/",
"bakery_ia_docs": "../SIGNOZ_COMPLETE_CONFIGURATION_GUIDE.md",
"issues": "https://github.com/your-repo/issues"
},
"notes": {
"format_fix": "All dashboards have been updated to use the correct SigNoz JSON schema with proper filter arrays to resolve the 'e.filter is not a function' error.",
"compatibility": "Tested with SigNoz v0.10.0+ and OpenTelemetry Collector v0.45.0+",
"customization": "You can customize these dashboards by editing the JSON files or cloning them in the SigNoz UI"
}
}

View File

@@ -0,0 +1,437 @@
{
"description": "Comprehensive infrastructure monitoring dashboard for Bakery IA Kubernetes cluster",
"tags": ["infrastructure", "kubernetes", "k8s", "system"],
"name": "bakery-ia-infrastructure-monitoring",
"title": "Bakery IA - Infrastructure Monitoring",
"uploadedGrafana": false,
"uuid": "bakery-ia-infra-01",
"version": "v4",
"collapsableRowsMigrated": true,
"layout": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 3,
"i": "pod-count",
"moved": false,
"static": false
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 3,
"i": "pod-phase",
"moved": false,
"static": false
},
{
"x": 0,
"y": 3,
"w": 6,
"h": 3,
"i": "container-restarts",
"moved": false,
"static": false
},
{
"x": 6,
"y": 3,
"w": 6,
"h": 3,
"i": "node-condition",
"moved": false,
"static": false
},
{
"x": 0,
"y": 6,
"w": 12,
"h": 3,
"i": "deployment-status",
"moved": false,
"static": false
}
],
"variables": {
"namespace": {
"id": "namespace-var",
"name": "namespace",
"description": "Filter by Kubernetes namespace",
"type": "QUERY",
"queryValue": "SELECT DISTINCT(resource_attrs['k8s.namespace.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'k8s.pod.phase' AND value != '' ORDER BY value",
"customValue": "",
"textboxValue": "",
"showALLOption": true,
"multiSelect": false,
"order": 1,
"modificationUUID": "",
"sort": "ASC",
"selectedValue": "bakery-ia"
}
},
"widgets": [
{
"id": "pod-count",
"title": "Total Pods",
"description": "Total number of pods in the namespace",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "value",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "count",
"aggregateAttribute": {
"key": "k8s.pod.phase",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"id": "filter-k8s-namespace",
"key": {
"id": "k8s.namespace.name--string--tag--false",
"key": "k8s.namespace.name",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "{{.namespace}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [],
"legend": "Total Pods",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "pod-phase",
"title": "Pod Phase Distribution",
"description": "Pods by phase (Running, Pending, Failed, etc.)",
"isStacked": true,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "k8s.pod.phase",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"id": "filter-k8s-namespace",
"key": {
"id": "k8s.namespace.name--string--tag--false",
"key": "k8s.namespace.name",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "{{.namespace}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "phase",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{phase}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "container-restarts",
"title": "Container Restarts",
"description": "Container restart count over time",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "k8s.container.restarts",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "increase",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"id": "filter-k8s-namespace",
"key": {
"id": "k8s.namespace.name--string--tag--false",
"key": "k8s.namespace.name",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "{{.namespace}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"id": "k8s.pod.name--string--tag--false",
"key": "k8s.pod.name",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{k8s.pod.name}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "node-condition",
"title": "Node Conditions",
"description": "Node condition status (Ready, MemoryPressure, DiskPressure, etc.)",
"isStacked": true,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "k8s.node.condition_ready",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"id": "k8s.node.name--string--tag--false",
"key": "k8s.node.name",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{k8s.node.name}} Ready",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "deployment-status",
"title": "Deployment Status (Desired vs Available)",
"description": "Deployment replicas: desired vs available",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "k8s.deployment.desired",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [
{
"id": "filter-k8s-namespace",
"key": {
"id": "k8s.namespace.name--string--tag--false",
"key": "k8s.namespace.name",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "{{.namespace}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"id": "k8s.deployment.name--string--tag--false",
"key": "k8s.deployment.name",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{k8s.deployment.name}} (desired)",
"reduceTo": "avg"
},
{
"dataSource": "metrics",
"queryName": "B",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "k8s.deployment.available",
"dataType": "int64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [
{
"id": "filter-k8s-namespace",
"key": {
"id": "k8s.namespace.name--string--tag--false",
"key": "k8s.namespace.name",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "{{.namespace}}"
}
],
"op": "AND"
},
"expression": "B",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"id": "k8s.deployment.name--string--tag--false",
"key": "k8s.deployment.name",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{k8s.deployment.name}} (available)",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
}
]
}

View File

@@ -0,0 +1,333 @@
{
"description": "Comprehensive log analysis and search dashboard",
"tags": ["logs", "analysis", "search"],
"name": "bakery-ia-log-analysis",
"title": "Bakery IA - Log Analysis",
"uploadedGrafana": false,
"uuid": "bakery-ia-logs-01",
"version": "v4",
"collapsableRowsMigrated": true,
"layout": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 3,
"i": "log-volume",
"moved": false,
"static": false
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 3,
"i": "error-logs",
"moved": false,
"static": false
},
{
"x": 0,
"y": 3,
"w": 6,
"h": 3,
"i": "logs-by-level",
"moved": false,
"static": false
},
{
"x": 6,
"y": 3,
"w": 6,
"h": 3,
"i": "logs-by-service",
"moved": false,
"static": false
}
],
"variables": {
"service": {
"id": "service-var",
"name": "service",
"description": "Filter by service name",
"type": "QUERY",
"queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'log_lines_total' AND value != '' ORDER BY value",
"customValue": "",
"textboxValue": "",
"showALLOption": true,
"multiSelect": false,
"order": 1,
"modificationUUID": "",
"sort": "ASC",
"selectedValue": null
}
},
"widgets": [
{
"id": "log-volume",
"title": "Log Volume",
"description": "Total log volume by service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "log_lines_total",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "rate",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "logs/s"
},
{
"id": "error-logs",
"title": "Error Logs",
"description": "Error log volume by service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "log_lines_total",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "rate",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
},
{
"key": {
"key": "level",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "error"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}} (errors)",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "logs/s"
},
{
"id": "logs-by-level",
"title": "Logs by Level",
"description": "Distribution of logs by severity level",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "pie",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "log_lines_total",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "sum",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "level",
"dataType": "string",
"type": "tag",
"isColumn": false
}
],
"legend": "{{level}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "logs-by-service",
"title": "Logs by Service",
"description": "Distribution of logs by service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "pie",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "sum",
"aggregateAttribute": {
"key": "log_lines_total",
"dataType": "int64",
"type": "Counter",
"isColumn": false
},
"timeAggregation": "sum",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
}
]
}

View File

@@ -0,0 +1,303 @@
{
"description": "Comprehensive system health monitoring dashboard",
"tags": ["system", "health", "monitoring"],
"name": "bakery-ia-system-health",
"title": "Bakery IA - System Health",
"uploadedGrafana": false,
"uuid": "bakery-ia-health-01",
"version": "v4",
"collapsableRowsMigrated": true,
"layout": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 3,
"i": "system-availability",
"moved": false,
"static": false
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 3,
"i": "health-score",
"moved": false,
"static": false
},
{
"x": 0,
"y": 3,
"w": 6,
"h": 3,
"i": "cpu-usage",
"moved": false,
"static": false
},
{
"x": 6,
"y": 3,
"w": 6,
"h": 3,
"i": "memory-usage",
"moved": false,
"static": false
}
],
"variables": {
"namespace": {
"id": "namespace-var",
"name": "namespace",
"description": "Filter by Kubernetes namespace",
"type": "QUERY",
"queryValue": "SELECT DISTINCT(resource_attrs['k8s.namespace.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'system_availability' AND value != '' ORDER BY value",
"customValue": "",
"textboxValue": "",
"showALLOption": true,
"multiSelect": false,
"order": 1,
"modificationUUID": "",
"sort": "ASC",
"selectedValue": "bakery-ia"
}
},
"widgets": [
{
"id": "system-availability",
"title": "System Availability",
"description": "Overall system availability percentage",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "value",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "system_availability",
"dataType": "float64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [
{
"id": "filter-k8s-namespace",
"key": {
"id": "k8s.namespace.name--string--tag--false",
"key": "k8s.namespace.name",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "{{.namespace}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [],
"legend": "System Availability",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "percent"
},
{
"id": "health-score",
"title": "Service Health Score",
"description": "Overall service health score",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "value",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "service_health_score",
"dataType": "float64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "latest",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [
{
"id": "filter-k8s-namespace",
"key": {
"id": "k8s.namespace.name--string--tag--false",
"key": "k8s.namespace.name",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "{{.namespace}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [],
"legend": "Health Score",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "cpu-usage",
"title": "CPU Usage",
"description": "System CPU usage over time",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "system_cpu_usage",
"dataType": "float64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "avg",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [
{
"id": "filter-k8s-namespace",
"key": {
"id": "k8s.namespace.name--string--tag--false",
"key": "k8s.namespace.name",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "{{.namespace}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [],
"legend": "CPU Usage",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "percent"
},
{
"id": "memory-usage",
"title": "Memory Usage",
"description": "System memory usage over time",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "metrics",
"queryName": "A",
"aggregateOperator": "avg",
"aggregateAttribute": {
"key": "system_memory_usage",
"dataType": "float64",
"type": "Gauge",
"isColumn": false
},
"timeAggregation": "avg",
"spaceAggregation": "avg",
"functions": [],
"filters": {
"items": [
{
"id": "filter-k8s-namespace",
"key": {
"id": "k8s.namespace.name--string--tag--false",
"key": "k8s.namespace.name",
"dataType": "string",
"type": "tag",
"isColumn": false
},
"op": "=",
"value": "{{.namespace}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [],
"legend": "Memory Usage",
"reduceTo": "avg"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "percent"
}
]
}

View File

@@ -0,0 +1,429 @@
{
"description": "User activity and behavior monitoring dashboard",
"tags": ["user", "activity", "behavior"],
"name": "bakery-ia-user-activity",
"title": "Bakery IA - User Activity",
"uploadedGrafana": false,
"uuid": "bakery-ia-user-01",
"version": "v4",
"collapsableRowsMigrated": true,
"layout": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 3,
"i": "active-users",
"moved": false,
"static": false
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 3,
"i": "user-sessions",
"moved": false,
"static": false
},
{
"x": 0,
"y": 3,
"w": 6,
"h": 3,
"i": "user-actions",
"moved": false,
"static": false
},
{
"x": 6,
"y": 3,
"w": 6,
"h": 3,
"i": "page-views",
"moved": false,
"static": false
},
{
"x": 0,
"y": 6,
"w": 12,
"h": 4,
"i": "geo-visitors",
"moved": false,
"static": false
}
],
"variables": {
"service": {
"id": "service-var",
"name": "service",
"description": "Filter by service name",
"type": "QUERY",
"queryValue": "SELECT DISTINCT(serviceName) FROM signoz_traces.distributed_signoz_index_v2 ORDER BY serviceName",
"customValue": "",
"textboxValue": "",
"showALLOption": true,
"multiSelect": false,
"order": 1,
"modificationUUID": "",
"sort": "ASC",
"selectedValue": "bakery-frontend"
}
},
"widgets": [
{
"id": "active-users",
"title": "Active Users",
"description": "Number of active users by service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "traces",
"queryName": "A",
"aggregateOperator": "count_distinct",
"aggregateAttribute": {
"key": "user.id",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"timeAggregation": "count_distinct",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "user-sessions",
"title": "User Sessions",
"description": "Total user sessions by service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "traces",
"queryName": "A",
"aggregateOperator": "count",
"aggregateAttribute": {
"key": "session.id",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"timeAggregation": "count",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
},
{
"key": {
"key": "span.name",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "user_session"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "user-actions",
"title": "User Actions",
"description": "Total user actions by service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "traces",
"queryName": "A",
"aggregateOperator": "count",
"aggregateAttribute": {
"key": "user.action",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"timeAggregation": "count",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
},
{
"key": {
"key": "span.name",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "user_action"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "page-views",
"title": "Page Views",
"description": "Total page views by service",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "graph",
"query": {
"builder": {
"queryData": [
{
"dataSource": "traces",
"queryName": "A",
"aggregateOperator": "count",
"aggregateAttribute": {
"key": "page.path",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"timeAggregation": "count",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
},
{
"key": {
"key": "span.name",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "page_view"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [
{
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
}
],
"legend": "{{serviceName}}",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
},
{
"id": "geo-visitors",
"title": "Geolocation Visitors",
"description": "Number of visitors who shared location data",
"isStacked": false,
"nullZeroValues": "zero",
"opacity": "1",
"panelTypes": "value",
"query": {
"builder": {
"queryData": [
{
"dataSource": "traces",
"queryName": "A",
"aggregateOperator": "count",
"aggregateAttribute": {
"key": "user.id",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"timeAggregation": "count",
"spaceAggregation": "sum",
"functions": [],
"filters": {
"items": [
{
"key": {
"key": "serviceName",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "{{.service}}"
},
{
"key": {
"key": "span.name",
"dataType": "string",
"type": "tag",
"isColumn": true
},
"op": "=",
"value": "user_location"
}
],
"op": "AND"
},
"expression": "A",
"disabled": false,
"having": [],
"stepInterval": 60,
"limit": null,
"orderBy": [],
"groupBy": [],
"legend": "Visitors with Location Data (See GEOLOCATION_VISUALIZATION_GUIDE.md for map integration)",
"reduceTo": "sum"
}
],
"queryFormulas": []
},
"queryType": "builder"
},
"fillSpans": false,
"yAxisUnit": "none"
}
]
}

View File

@@ -0,0 +1,392 @@
#!/bin/bash
# ============================================================================
# SigNoz Deployment Script for Bakery IA
# ============================================================================
# This script deploys SigNoz monitoring stack using Helm
# Supports both development and production environments
# ============================================================================
set -e
# Color codes for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Function to display help
show_help() {
echo "Usage: $0 [OPTIONS] ENVIRONMENT"
echo ""
echo "Deploy SigNoz monitoring stack for Bakery IA"
echo ""
echo "Arguments:
ENVIRONMENT Environment to deploy to (dev|prod)"
echo ""
echo "Options:
-h, --help Show this help message
-d, --dry-run Dry run - show what would be done without actually deploying
-u, --upgrade Upgrade existing deployment
-r, --remove Remove/Uninstall SigNoz deployment
-n, --namespace NAMESPACE Specify namespace (default: bakery-ia)"
echo ""
echo "Examples:
$0 dev # Deploy to development
$0 prod # Deploy to production
$0 --upgrade prod # Upgrade production deployment
$0 --remove dev # Remove development deployment"
echo ""
echo "Docker Hub Authentication:"
echo " This script automatically creates a Docker Hub secret for image pulls."
echo " Provide credentials via environment variables (recommended):"
echo " export DOCKERHUB_USERNAME='your-username'"
echo " export DOCKERHUB_PASSWORD='your-personal-access-token'"
echo " Or ensure you're logged in with Docker CLI:"
echo " docker login"
}
# Parse command line arguments
DRY_RUN=false
UPGRADE=false
REMOVE=false
NAMESPACE="bakery-ia"
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help)
show_help
exit 0
;;
-d|--dry-run)
DRY_RUN=true
shift
;;
-u|--upgrade)
UPGRADE=true
shift
;;
-r|--remove)
REMOVE=true
shift
;;
-n|--namespace)
NAMESPACE="$2"
shift 2
;;
dev|prod)
ENVIRONMENT="$1"
shift
;;
*)
echo "Unknown argument: $1"
show_help
exit 1
;;
esac
done
# Validate environment
if [[ -z "$ENVIRONMENT" ]]; then
echo "Error: Environment not specified. Use 'dev' or 'prod'."
show_help
exit 1
fi
if [[ "$ENVIRONMENT" != "dev" && "$ENVIRONMENT" != "prod" ]]; then
echo "Error: Invalid environment. Use 'dev' or 'prod'."
exit 1
fi
# Function to check if Helm is installed
check_helm() {
if ! command -v helm &> /dev/null; then
echo "${RED}Error: Helm is not installed. Please install Helm first.${NC}"
echo "Installation instructions: https://helm.sh/docs/intro/install/"
exit 1
fi
}
# Function to check if kubectl is configured
check_kubectl() {
if ! kubectl cluster-info &> /dev/null; then
echo "${RED}Error: kubectl is not configured or cannot connect to cluster.${NC}"
echo "Please ensure you have access to a Kubernetes cluster."
exit 1
fi
}
# Function to check if namespace exists, create if not
ensure_namespace() {
if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
echo "${BLUE}Creating namespace $NAMESPACE...${NC}"
if [[ "$DRY_RUN" == true ]]; then
echo " (dry-run) Would create namespace $NAMESPACE"
else
kubectl create namespace "$NAMESPACE"
echo "${GREEN}Namespace $NAMESPACE created.${NC}"
fi
else
echo "${BLUE}Namespace $NAMESPACE already exists.${NC}"
fi
}
# Function to create Docker Hub secret for image pulls
create_dockerhub_secret() {
echo "${BLUE}Setting up Docker Hub image pull secret...${NC}"
if [[ "$DRY_RUN" == true ]]; then
echo " (dry-run) Would create Docker Hub secret in namespace $NAMESPACE"
return
fi
# Check if secret already exists
if kubectl get secret dockerhub-creds -n "$NAMESPACE" &> /dev/null; then
echo "${GREEN}Docker Hub secret already exists in namespace $NAMESPACE.${NC}"
return
fi
# Check if Docker Hub credentials are available
if [[ -n "$DOCKERHUB_USERNAME" ]] && [[ -n "$DOCKERHUB_PASSWORD" ]]; then
echo "${BLUE}Found DOCKERHUB_USERNAME and DOCKERHUB_PASSWORD environment variables${NC}"
kubectl create secret docker-registry dockerhub-creds \
--docker-server=https://index.docker.io/v1/ \
--docker-username="$DOCKERHUB_USERNAME" \
--docker-password="$DOCKERHUB_PASSWORD" \
--docker-email="${DOCKERHUB_EMAIL:-noreply@bakery-ia.local}" \
-n "$NAMESPACE"
echo "${GREEN}Docker Hub secret created successfully.${NC}"
elif [[ -f "$HOME/.docker/config.json" ]]; then
echo "${BLUE}Attempting to use Docker CLI credentials...${NC}"
# Try to extract credentials from Docker config
if grep -q "credsStore" "$HOME/.docker/config.json"; then
echo "${YELLOW}Docker is using a credential store. Please set environment variables:${NC}"
echo " export DOCKERHUB_USERNAME='your-username'"
echo " export DOCKERHUB_PASSWORD='your-password-or-token'"
echo "${YELLOW}Continuing without Docker Hub authentication...${NC}"
return
fi
# Try to extract from base64 encoded auth
AUTH=$(cat "$HOME/.docker/config.json" | jq -r '.auths["https://index.docker.io/v1/"].auth // empty' 2>/dev/null)
if [[ -n "$AUTH" ]]; then
echo "${GREEN}Found Docker Hub credentials in Docker config${NC}"
local DOCKER_USERNAME=$(echo "$AUTH" | base64 -d | cut -d: -f1)
local DOCKER_PASSWORD=$(echo "$AUTH" | base64 -d | cut -d: -f2-)
kubectl create secret docker-registry dockerhub-creds \
--docker-server=https://index.docker.io/v1/ \
--docker-username="$DOCKER_USERNAME" \
--docker-password="$DOCKER_PASSWORD" \
--docker-email="${DOCKERHUB_EMAIL:-noreply@bakery-ia.local}" \
-n "$NAMESPACE"
echo "${GREEN}Docker Hub secret created successfully.${NC}"
else
echo "${YELLOW}Could not find Docker Hub credentials${NC}"
echo "${YELLOW}To enable automatic Docker Hub authentication:${NC}"
echo " 1. Run 'docker login', OR"
echo " 2. Set environment variables:"
echo " export DOCKERHUB_USERNAME='your-username'"
echo " export DOCKERHUB_PASSWORD='your-password-or-token'"
echo "${YELLOW}Continuing without Docker Hub authentication...${NC}"
fi
else
echo "${YELLOW}Docker Hub credentials not found${NC}"
echo "${YELLOW}To enable automatic Docker Hub authentication:${NC}"
echo " 1. Run 'docker login', OR"
echo " 2. Set environment variables:"
echo " export DOCKERHUB_USERNAME='your-username'"
echo " export DOCKERHUB_PASSWORD='your-password-or-token'"
echo "${YELLOW}Continuing without Docker Hub authentication...${NC}"
fi
echo ""
}
# Function to add and update Helm repository
setup_helm_repo() {
echo "${BLUE}Setting up SigNoz Helm repository...${NC}"
if [[ "$DRY_RUN" == true ]]; then
echo " (dry-run) Would add SigNoz Helm repository"
return
fi
# Add SigNoz Helm repository
if helm repo list | grep -q "^signoz"; then
echo "${BLUE}SigNoz repository already added, updating...${NC}"
helm repo update signoz
else
echo "${BLUE}Adding SigNoz Helm repository...${NC}"
helm repo add signoz https://charts.signoz.io
helm repo update
fi
echo "${GREEN}Helm repository ready.${NC}"
echo ""
}
# Function to deploy SigNoz
deploy_signoz() {
local values_file="infrastructure/helm/signoz-values-$ENVIRONMENT.yaml"
if [[ ! -f "$values_file" ]]; then
echo "${RED}Error: Values file $values_file not found.${NC}"
exit 1
fi
echo "${BLUE}Deploying SigNoz to $ENVIRONMENT environment...${NC}"
echo " Using values file: $values_file"
echo " Target namespace: $NAMESPACE"
echo " Chart version: Latest from signoz/signoz"
if [[ "$DRY_RUN" == true ]]; then
echo " (dry-run) Would deploy SigNoz with:"
echo " helm upgrade --install signoz signoz/signoz -n $NAMESPACE -f $values_file --wait --timeout 15m"
return
fi
# Use upgrade --install to handle both new installations and upgrades
echo "${BLUE}Installing/Upgrading SigNoz...${NC}"
echo "This may take 10-15 minutes..."
helm upgrade --install signoz signoz/signoz \
-n "$NAMESPACE" \
-f "$values_file" \
--wait \
--timeout 15m \
--create-namespace
echo "${GREEN}SigNoz deployment completed.${NC}"
echo ""
# Show deployment status
show_deployment_status
}
# Function to remove SigNoz
remove_signoz() {
echo "${BLUE}Removing SigNoz deployment from namespace $NAMESPACE...${NC}"
if [[ "$DRY_RUN" == true ]]; then
echo " (dry-run) Would remove SigNoz deployment"
return
fi
if helm list -n "$NAMESPACE" | grep -q signoz; then
helm uninstall signoz -n "$NAMESPACE" --wait
echo "${GREEN}SigNoz deployment removed.${NC}"
# Optionally remove PVCs (commented out by default for safety)
echo ""
echo "${YELLOW}Note: Persistent Volume Claims (PVCs) were NOT deleted.${NC}"
echo "To delete PVCs and all data, run:"
echo " kubectl delete pvc -n $NAMESPACE -l app.kubernetes.io/instance=signoz"
else
echo "${YELLOW}No SigNoz deployment found in namespace $NAMESPACE.${NC}"
fi
}
# Function to show deployment status
show_deployment_status() {
echo ""
echo "${BLUE}=== SigNoz Deployment Status ===${NC}"
echo ""
# Get pods
echo "Pods:"
kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
echo ""
# Get services
echo "Services:"
kubectl get svc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
echo ""
# Get ingress
echo "Ingress:"
kubectl get ingress -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
echo ""
# Show access information
show_access_info
}
# Function to show access information
show_access_info() {
echo "${BLUE}=== Access Information ===${NC}"
if [[ "$ENVIRONMENT" == "dev" ]]; then
echo "SigNoz UI: http://monitoring.bakery-ia.local"
echo ""
echo "OpenTelemetry Collector Endpoints (from within cluster):"
echo " gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
echo " HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
echo ""
echo "Port-forward for local access:"
echo " kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
echo " kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4317:4317"
echo " kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4318:4318"
else
echo "SigNoz UI: https://monitoring.bakewise.ai"
echo ""
echo "OpenTelemetry Collector Endpoints (from within cluster):"
echo " gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
echo " HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
echo ""
echo "External endpoints (if exposed):"
echo " Check ingress configuration for external OTLP endpoints"
fi
echo ""
echo "Default credentials:"
echo " Username: admin@example.com"
echo " Password: admin"
echo ""
echo "Note: Change default password after first login!"
echo ""
}
# Main execution
main() {
echo "${BLUE}"
echo "=========================================="
echo "🚀 SigNoz Deployment for Bakery IA"
echo "=========================================="
echo "${NC}"
# Check prerequisites
check_helm
check_kubectl
# Ensure namespace
ensure_namespace
if [[ "$REMOVE" == true ]]; then
remove_signoz
exit 0
fi
# Setup Helm repository
setup_helm_repo
# Create Docker Hub secret for image pulls
create_dockerhub_secret
# Deploy SigNoz
deploy_signoz
echo "${GREEN}"
echo "=========================================="
echo "✅ SigNoz deployment completed!"
echo "=========================================="
echo "${NC}"
}
# Run main function
main

View File

@@ -0,0 +1,141 @@
#!/bin/bash
# Generate Test Traffic to Services
# This script generates API calls to verify telemetry data collection
set -e
NAMESPACE="bakery-ia"
GREEN='\033[0;32m'
BLUE='\033[0;34m'
YELLOW='\033[1;33m'
NC='\033[0m'
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
echo -e "${BLUE} Generating Test Traffic for SigNoz Verification${NC}"
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
echo ""
# Check if ingress is accessible
echo -e "${BLUE}Step 1: Verifying Gateway Access${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
GATEWAY_POD=$(kubectl get pods -n $NAMESPACE -l app=gateway --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [[ -z "$GATEWAY_POD" ]]; then
echo -e "${YELLOW}⚠ Gateway pod not running. Starting port-forward...${NC}"
# Port forward in background
kubectl port-forward -n $NAMESPACE svc/gateway-service 8000:8000 &
PORT_FORWARD_PID=$!
sleep 3
API_URL="http://localhost:8000"
else
echo -e "${GREEN}✓ Gateway is running: $GATEWAY_POD${NC}"
# Use internal service
API_URL="http://gateway-service.$NAMESPACE.svc.cluster.local:8000"
fi
echo ""
# Function to make API call from inside cluster
make_request() {
local endpoint=$1
local description=$2
echo -e "${BLUE}→ Testing: $description${NC}"
echo " Endpoint: $endpoint"
if [[ -n "$GATEWAY_POD" ]]; then
# Make request from inside the gateway pod
RESPONSE=$(kubectl exec -n $NAMESPACE $GATEWAY_POD -- curl -s -w "\nHTTP_CODE:%{http_code}" "$API_URL$endpoint" 2>/dev/null || echo "FAILED")
else
# Make request from localhost
RESPONSE=$(curl -s -w "\nHTTP_CODE:%{http_code}" "$API_URL$endpoint" 2>/dev/null || echo "FAILED")
fi
if [[ "$RESPONSE" == "FAILED" ]]; then
echo -e " ${YELLOW}⚠ Request failed${NC}"
else
HTTP_CODE=$(echo "$RESPONSE" | grep "HTTP_CODE" | cut -d: -f2)
if [[ "$HTTP_CODE" == "200" ]] || [[ "$HTTP_CODE" == "401" ]] || [[ "$HTTP_CODE" == "404" ]]; then
echo -e " ${GREEN}✓ Response received (HTTP $HTTP_CODE)${NC}"
else
echo -e " ${YELLOW}⚠ Unexpected response (HTTP $HTTP_CODE)${NC}"
fi
fi
echo ""
sleep 1
}
# Generate traffic to various endpoints
echo -e "${BLUE}Step 2: Generating Traffic to Services${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo ""
# Health checks (should generate traces)
make_request "/health" "Gateway Health Check"
make_request "/api/health" "API Health Check"
# Auth service endpoints
make_request "/api/auth/health" "Auth Service Health"
# Tenant service endpoints
make_request "/api/tenants/health" "Tenant Service Health"
# Inventory service endpoints
make_request "/api/inventory/health" "Inventory Service Health"
# Orders service endpoints
make_request "/api/orders/health" "Orders Service Health"
# Forecasting service endpoints
make_request "/api/forecasting/health" "Forecasting Service Health"
echo -e "${BLUE}Step 3: Checking Service Logs for Telemetry${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo ""
# Check a few service pods for tracing logs
SERVICES=("auth-service" "inventory-service" "gateway")
for service in "${SERVICES[@]}"; do
POD=$(kubectl get pods -n $NAMESPACE -l app=$service --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [[ -n "$POD" ]]; then
echo -e "${BLUE}Checking $service ($POD)...${NC}"
TRACING_LOG=$(kubectl logs -n $NAMESPACE $POD --tail=100 2>/dev/null | grep -i "tracing\|otel" | head -n 2 || echo "")
if [[ -n "$TRACING_LOG" ]]; then
echo -e "${GREEN}✓ Tracing configured:${NC}"
echo "$TRACING_LOG" | sed 's/^/ /'
else
echo -e "${YELLOW}⚠ No tracing logs found${NC}"
fi
echo ""
fi
done
# Wait for data to be processed
echo -e "${BLUE}Step 4: Waiting for Data Processing${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "Waiting 30 seconds for telemetry data to be processed..."
for i in {30..1}; do
echo -ne "\r ${i} seconds remaining..."
sleep 1
done
echo -e "\n"
# Cleanup port-forward if started
if [[ -n "$PORT_FORWARD_PID" ]]; then
kill $PORT_FORWARD_PID 2>/dev/null || true
fi
echo -e "${GREEN}✓ Test traffic generation complete!${NC}"
echo ""
echo -e "${BLUE}Next Steps:${NC}"
echo "1. Run the verification script to check for collected data:"
echo " ./infrastructure/helm/verify-signoz-telemetry.sh"
echo ""
echo "2. Access SigNoz UI to visualize the data:"
echo " https://monitoring.bakery-ia.local"
echo " or"
echo " kubectl port-forward -n bakery-ia svc/signoz 3301:8080"
echo " Then go to: http://localhost:3301"
echo ""
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"

View File

@@ -0,0 +1,175 @@
#!/bin/bash
# SigNoz Dashboard Importer for Bakery IA
# This script imports all SigNoz dashboards into your SigNoz instance
# Configuration
SIGNOZ_HOST="localhost"
SIGNOZ_PORT="3301"
SIGNOZ_API_KEY="" # Add your API key if authentication is required
DASHBOARDS_DIR="infrastructure/signoz/dashboards"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Function to display help
show_help() {
echo "Usage: $0 [options]"
echo ""
echo "Options:
-h, --host SigNoz host (default: localhost)
-p, --port SigNoz port (default: 3301)
-k, --api-key SigNoz API key (if required)
-d, --dir Dashboards directory (default: infrastructure/signoz/dashboards)
-h, --help Show this help message"
echo ""
echo "Example:
$0 --host signoz.example.com --port 3301 --api-key your-api-key"
}
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
-h|--host)
SIGNOZ_HOST="$2"
shift 2
;;
-p|--port)
SIGNOZ_PORT="$2"
shift 2
;;
-k|--api-key)
SIGNOZ_API_KEY="$2"
shift 2
;;
-d|--dir)
DASHBOARDS_DIR="$2"
shift 2
;;
--help)
show_help
exit 0
;;
*)
echo "Unknown option: $1"
show_help
exit 1
;;
esac
done
# Check if dashboards directory exists
if [ ! -d "$DASHBOARDS_DIR" ]; then
echo -e "${RED}Error: Dashboards directory not found: $DASHBOARDS_DIR${NC}"
exit 1
fi
# Check if jq is installed for JSON validation
if ! command -v jq &> /dev/null; then
echo -e "${YELLOW}Warning: jq not found. Skipping JSON validation.${NC}"
VALIDATE_JSON=false
else
VALIDATE_JSON=true
fi
# Function to validate JSON
validate_json() {
local file="$1"
if [ "$VALIDATE_JSON" = true ]; then
if ! jq empty "$file" &> /dev/null; then
echo -e "${RED}Error: Invalid JSON in file: $file${NC}"
return 1
fi
fi
return 0
}
# Function to import a single dashboard
import_dashboard() {
local file="$1"
local filename=$(basename "$file")
local dashboard_name=$(jq -r '.name' "$file" 2>/dev/null || echo "Unknown")
echo -e "${BLUE}Importing dashboard: $dashboard_name ($filename)${NC}"
# Prepare curl command
local curl_cmd="curl -s -X POST http://$SIGNOZ_HOST:$SIGNOZ_PORT/api/v1/dashboards/import"
if [ -n "$SIGNOZ_API_KEY" ]; then
curl_cmd="$curl_cmd -H \"Authorization: Bearer $SIGNOZ_API_KEY\""
fi
curl_cmd="$curl_cmd -H \"Content-Type: application/json\" -d @\"$file\""
# Execute import
local response=$(eval "$curl_cmd")
# Check response
if echo "$response" | grep -q "success"; then
echo -e "${GREEN}✓ Successfully imported: $dashboard_name${NC}"
return 0
else
echo -e "${RED}✗ Failed to import: $dashboard_name${NC}"
echo "Response: $response"
return 1
fi
}
# Main import process
echo -e "${YELLOW}=== SigNoz Dashboard Importer for Bakery IA ===${NC}"
echo -e "${BLUE}Configuration:${NC}"
echo " Host: $SIGNOZ_HOST"
echo " Port: $SIGNOZ_PORT"
echo " Dashboards Directory: $DASHBOARDS_DIR"
if [ -n "$SIGNOZ_API_KEY" ]; then
echo " API Key: ******** (set)"
else
echo " API Key: Not configured"
fi
echo ""
# Count dashboards
DASHBOARD_COUNT=$(find "$DASHBOARDS_DIR" -name "*.json" | wc -l)
echo -e "${BLUE}Found $DASHBOARD_COUNT dashboards to import${NC}"
echo ""
# Import each dashboard
SUCCESS_COUNT=0
FAILURE_COUNT=0
for file in "$DASHBOARDS_DIR"/*.json; do
if [ -f "$file" ]; then
# Validate JSON
if validate_json "$file"; then
if import_dashboard "$file"; then
((SUCCESS_COUNT++))
else
((FAILURE_COUNT++))
fi
else
((FAILURE_COUNT++))
fi
echo ""
fi
done
# Summary
echo -e "${YELLOW}=== Import Summary ===${NC}"
echo -e "${GREEN}Successfully imported: $SUCCESS_COUNT dashboards${NC}"
if [ $FAILURE_COUNT -gt 0 ]; then
echo -e "${RED}Failed to import: $FAILURE_COUNT dashboards${NC}"
fi
echo ""
if [ $FAILURE_COUNT -eq 0 ]; then
echo -e "${GREEN}All dashboards imported successfully!${NC}"
echo "You can now access them in your SigNoz UI at:"
echo "http://$SIGNOZ_HOST:$SIGNOZ_PORT/dashboards"
else
echo -e "${YELLOW}Some dashboards failed to import. Check the errors above.${NC}"
exit 1
fi

View File

@@ -0,0 +1,853 @@
# SigNoz Helm Chart Values - Development Environment
# Optimized for local development with minimal resource usage
# DEPLOYED IN bakery-ia NAMESPACE - Ingress managed by bakery-ingress
#
# Official Chart: https://github.com/SigNoz/charts
# Install Command: helm install signoz signoz/signoz -n bakery-ia -f signoz-values-dev.yaml
global:
storageClass: "standard"
clusterName: "bakery-ia-dev"
domain: "monitoring.bakery-ia.local"
# Docker Hub credentials - applied to all sub-charts (including Zookeeper, ClickHouse, etc)
imagePullSecrets:
- dockerhub-creds
# Docker Hub credentials for pulling images (root level for SigNoz components)
imagePullSecrets:
- dockerhub-creds
# SignOz Main Component (includes frontend and query service)
signoz:
replicaCount: 1
service:
type: ClusterIP
port: 8080
# DISABLE built-in ingress - using unified bakery-ingress instead
# Route configured in infrastructure/kubernetes/overlays/dev/dev-ingress.yaml
ingress:
enabled: false
resources:
requests:
cpu: 100m # Combined frontend + query service
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
# Environment variables (new format - replaces configVars)
env:
signoz_telemetrystore_provider: "clickhouse"
dot_metrics_enabled: "true"
signoz_emailing_enabled: "false"
signoz_alertmanager_provider: "signoz"
# Retention for dev (7 days)
signoz_traces_ttl_duration_hrs: "168"
signoz_metrics_ttl_duration_hrs: "168"
signoz_logs_ttl_duration_hrs: "168"
# OpAMP Server Configuration - DISABLED for dev (causes gRPC instability)
signoz_opamp_server_enabled: "false"
# signoz_opamp_server_endpoint: "0.0.0.0:4320"
persistence:
enabled: true
size: 5Gi
storageClass: "standard"
# AlertManager Configuration
alertmanager:
replicaCount: 1
image:
repository: signoz/alertmanager
tag: 0.23.5
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 9093
resources:
requests:
cpu: 25m # Reduced for local dev
memory: 64Mi # Reduced for local dev
limits:
cpu: 200m
memory: 256Mi
persistence:
enabled: true
size: 2Gi
storageClass: "standard"
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
receivers:
- name: 'default'
# Add email, slack, webhook configs here
# ClickHouse Configuration - Time Series Database
# Minimal resources for local development on constrained Kind cluster
clickhouse:
enabled: true
installCustomStorageClass: false
image:
registry: docker.io
repository: clickhouse/clickhouse-server
tag: 25.5.6 # Official recommended version
# Reduce ClickHouse resource requests for local dev
clickhouse:
resources:
requests:
cpu: 200m # Reduced from default 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
persistence:
enabled: true
size: 20Gi
# Zookeeper Configuration (required by ClickHouse)
zookeeper:
enabled: true
replicaCount: 1 # Single replica for dev
image:
tag: 3.7.1 # Official recommended version
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
persistence:
enabled: true
size: 5Gi
# OpenTelemetry Collector - Data ingestion endpoint for all telemetry
otelCollector:
enabled: true
replicaCount: 1
image:
repository: signoz/signoz-otel-collector
tag: v0.129.12 # Latest recommended version
# OpAMP Configuration - DISABLED for development
# OpAMP is designed for production with remote config management
# In dev, it causes gRPC instability and collector reloads
# We use static configuration instead
# Init containers for the Otel Collector pod
initContainers:
fix-postgres-tls:
enabled: true
image:
registry: docker.io
repository: busybox
tag: 1.35
pullPolicy: IfNotPresent
command:
- sh
- -c
- |
echo "Fixing PostgreSQL TLS file permissions..."
cp /etc/postgres-tls-source/* /etc/postgres-tls/
chmod 600 /etc/postgres-tls/server-key.pem
chmod 644 /etc/postgres-tls/server-cert.pem
chmod 644 /etc/postgres-tls/ca-cert.pem
echo "PostgreSQL TLS permissions fixed"
volumeMounts:
- name: postgres-tls-source
mountPath: /etc/postgres-tls-source
readOnly: true
- name: postgres-tls-fixed
mountPath: /etc/postgres-tls
readOnly: false
# Service configuration - expose both gRPC and HTTP endpoints
service:
type: ClusterIP
ports:
# gRPC receivers
- name: otlp-grpc
port: 4317
targetPort: 4317
protocol: TCP
# HTTP receivers
- name: otlp-http
port: 4318
targetPort: 4318
protocol: TCP
# Prometheus remote write
- name: prometheus
port: 8889
targetPort: 8889
protocol: TCP
# Metrics
- name: metrics
port: 8888
targetPort: 8888
protocol: TCP
resources:
requests:
cpu: 50m # Reduced from 100m
memory: 128Mi # Reduced from 256Mi
limits:
cpu: 500m
memory: 512Mi
# Additional environment variables for receivers
additionalEnvs:
POSTGRES_MONITOR_USER: "monitoring"
POSTGRES_MONITOR_PASSWORD: "monitoring_369f9c001f242b07ef9e2826e17169ca"
REDIS_PASSWORD: "OxdmdJjdVNXp37MNC2IFoMnTpfGGFv1k"
RABBITMQ_USER: "bakery"
RABBITMQ_PASSWORD: "forecast123"
# Mount TLS certificates for secure connections
extraVolumes:
- name: redis-tls
secret:
secretName: redis-tls-secret
- name: postgres-tls
secret:
secretName: postgres-tls
- name: postgres-tls-fixed
emptyDir: {}
- name: varlogpods
hostPath:
path: /var/log/pods
extraVolumeMounts:
- name: redis-tls
mountPath: /etc/redis-tls
readOnly: true
- name: postgres-tls
mountPath: /etc/postgres-tls-source
readOnly: true
- name: postgres-tls-fixed
mountPath: /etc/postgres-tls
readOnly: false
- name: varlogpods
mountPath: /var/log/pods
readOnly: true
# Disable OpAMP - use static configuration only
# Use 'args' instead of 'extraArgs' to completely override the command
command:
name: /signoz-otel-collector
args:
- --config=/conf/otel-collector-config.yaml
- --feature-gates=-pkg.translator.prometheus.NormalizeName
# OpenTelemetry Collector configuration
config:
# Connectors - bridge between pipelines
connectors:
signozmeter:
dimensions:
- name: service.name
- name: deployment.environment
- name: host.name
metrics_flush_interval: 1h
receivers:
# OTLP receivers for traces, metrics, and logs from applications
# All application telemetry is pushed via OTLP protocol
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins:
- "*"
# Filelog receiver for Kubernetes pod logs
# Collects container stdout/stderr from /var/log/pods
filelog:
include:
- /var/log/pods/*/*/*.log
exclude:
# Exclude SigNoz's own logs to avoid recursive collection
- /var/log/pods/bakery-ia_signoz-*/*/*.log
include_file_path: true
include_file_name: false
operators:
# Parse CRI-O / containerd log format
- type: regex_parser
regex: '^(?P<time>[^ ]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
# Fix timestamp parsing - extract from the parsed time field
- type: move
from: attributes.time
to: attributes.timestamp
# Extract Kubernetes metadata from file path
- type: regex_parser
id: extract_metadata_from_filepath
regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
parse_from: attributes["log.file.path"]
# Move metadata to resource attributes
- type: move
from: attributes.namespace
to: resource["k8s.namespace.name"]
- type: move
from: attributes.pod_name
to: resource["k8s.pod.name"]
- type: move
from: attributes.container_name
to: resource["k8s.container.name"]
- type: move
from: attributes.log
to: body
# Kubernetes Cluster Receiver - Collects cluster-level metrics
# Provides information about nodes, namespaces, pods, and other cluster resources
k8s_cluster:
collection_interval: 30s
node_conditions_to_report:
- Ready
- MemoryPressure
- DiskPressure
- PIDPressure
- NetworkUnavailable
allocatable_types_to_report:
- cpu
- memory
- pods
# PostgreSQL receivers for database metrics
# ENABLED: Monitor users configured and credentials stored in secrets
# Collects metrics directly from PostgreSQL databases with proper TLS
postgresql/auth:
endpoint: auth-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- auth_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/inventory:
endpoint: inventory-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- inventory_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/orders:
endpoint: orders-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- orders_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/ai-insights:
endpoint: ai-insights-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- ai_insights_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/alert-processor:
endpoint: alert-processor-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- alert_processor_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/distribution:
endpoint: distribution-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- distribution_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/external:
endpoint: external-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- external_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/forecasting:
endpoint: forecasting-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- forecasting_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/notification:
endpoint: notification-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- notification_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/orchestrator:
endpoint: orchestrator-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- orchestrator_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/pos:
endpoint: pos-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- pos_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/procurement:
endpoint: procurement-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- procurement_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/production:
endpoint: production-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- production_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/recipes:
endpoint: recipes-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- recipes_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/sales:
endpoint: sales-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- sales_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/suppliers:
endpoint: suppliers-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- suppliers_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/tenant:
endpoint: tenant-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- tenant_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/training:
endpoint: training-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- training_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
# Redis receiver for cache metrics
# ENABLED: Using existing credentials from redis-secrets with TLS
redis:
endpoint: redis-service.bakery-ia:6379
password: ${env:REDIS_PASSWORD}
collection_interval: 60s
transport: tcp
tls:
insecure_skip_verify: false
cert_file: /etc/redis-tls/redis-cert.pem
key_file: /etc/redis-tls/redis-key.pem
ca_file: /etc/redis-tls/ca-cert.pem
metrics:
redis.maxmemory:
enabled: true
redis.cmd.latency:
enabled: true
# RabbitMQ receiver via management API
# ENABLED: Using existing credentials from rabbitmq-secrets
rabbitmq:
endpoint: http://rabbitmq-service.bakery-ia:15672
username: ${env:RABBITMQ_USER}
password: ${env:RABBITMQ_PASSWORD}
collection_interval: 30s
# Prometheus Receiver - Scrapes metrics from Kubernetes API
# Simplified configuration using only Kubernetes API metrics
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-nodes-cadvisor'
scrape_interval: 30s
scrape_timeout: 10s
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-apiserver'
scrape_interval: 30s
scrape_timeout: 10s
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
processors:
# Batch processor for better performance (optimized for high throughput)
batch:
timeout: 1s
send_batch_size: 10000 # Increased from 1024 for better performance
send_batch_max_size: 10000
# Batch processor for meter data
batch/meter:
timeout: 1s
send_batch_size: 20000
send_batch_max_size: 25000
# Memory limiter to prevent OOM
memory_limiter:
check_interval: 1s
limit_mib: 400
spike_limit_mib: 100
# Resource detection
resourcedetection:
detectors: [env, system, docker]
timeout: 5s
# Kubernetes attributes processor - CRITICAL for logs
# Extracts pod, namespace, container metadata from log attributes
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- k8s.node.name
- k8s.container.name
labels:
- tag_name: "app"
- tag_name: "pod-template-hash"
annotations:
- tag_name: "description"
# SigNoz span metrics processor with delta aggregation (recommended)
# Generates RED metrics (Rate, Error, Duration) from trace spans
signozspanmetrics/delta:
aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
metrics_exporter: signozclickhousemetrics
latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s]
dimensions_cache_size: 100000
dimensions:
- name: service.namespace
default: default
- name: deployment.environment
default: default
- name: signoz.collector.id
exporters:
# ClickHouse exporter for traces
clickhousetraces:
datasource: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_traces
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
# ClickHouse exporter for metrics
signozclickhousemetrics:
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metrics"
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
# ClickHouse exporter for meter data (usage metrics)
signozclickhousemeter:
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_meter"
timeout: 45s
sending_queue:
enabled: false
# ClickHouse exporter for logs
clickhouselogsexporter:
dsn: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_logs
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
# Metadata exporter for service metadata
metadataexporter:
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metadata"
timeout: 10s
cache:
provider: in_memory
# Debug exporter for debugging (optional)
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
service:
pipelines:
# Traces pipeline - exports to ClickHouse and signozmeter connector
traces:
receivers: [otlp]
processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection]
exporters: [clickhousetraces, metadataexporter, signozmeter]
# Metrics pipeline
metrics:
receivers: [otlp,
postgresql/auth, postgresql/inventory, postgresql/orders,
postgresql/ai-insights, postgresql/alert-processor, postgresql/distribution,
postgresql/external, postgresql/forecasting, postgresql/notification,
postgresql/orchestrator, postgresql/pos, postgresql/procurement,
postgresql/production, postgresql/recipes, postgresql/sales,
postgresql/suppliers, postgresql/tenant, postgresql/training,
redis, rabbitmq, k8s_cluster, prometheus]
processors: [memory_limiter, batch, resourcedetection]
exporters: [signozclickhousemetrics]
# Meter pipeline - receives from signozmeter connector
metrics/meter:
receivers: [signozmeter]
processors: [batch/meter]
exporters: [signozclickhousemeter]
# Logs pipeline - includes both OTLP and Kubernetes pod logs
logs:
receivers: [otlp, filelog]
processors: [memory_limiter, batch, resourcedetection, k8sattributes]
exporters: [clickhouselogsexporter]
# ClusterRole configuration for Kubernetes monitoring
# CRITICAL: Required for k8s_cluster receiver to access Kubernetes API
# Without these permissions, k8s metrics will not appear in SigNoz UI
clusterRole:
create: true
name: "signoz-otel-collector-bakery-ia"
annotations: {}
# Complete RBAC rules required by k8sclusterreceiver
# Based on OpenTelemetry and SigNoz official documentation
rules:
# Core API group - fundamental Kubernetes resources
- apiGroups: [""]
resources:
- "events"
- "namespaces"
- "nodes"
- "nodes/proxy"
- "nodes/metrics"
- "nodes/spec"
- "pods"
- "pods/status"
- "replicationcontrollers"
- "replicationcontrollers/status"
- "resourcequotas"
- "services"
- "endpoints"
verbs: ["get", "list", "watch"]
# Apps API group - modern workload controllers
- apiGroups: ["apps"]
resources: ["deployments", "daemonsets", "statefulsets", "replicasets"]
verbs: ["get", "list", "watch"]
# Batch API group - job management
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch"]
# Autoscaling API group - HPA metrics (CRITICAL)
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "watch"]
# Extensions API group - legacy support
- apiGroups: ["extensions"]
resources: ["deployments", "daemonsets", "replicasets"]
verbs: ["get", "list", "watch"]
# Metrics API group - resource metrics
- apiGroups: ["metrics.k8s.io"]
resources: ["nodes", "pods"]
verbs: ["get", "list", "watch"]
clusterRoleBinding:
annotations: {}
name: "signoz-otel-collector-bakery-ia"
# Additional Configuration
serviceAccount:
create: true
annotations: {}
name: "signoz-otel-collector"
# Security Context
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
# Network Policies (disabled for dev)
networkPolicy:
enabled: false
# Monitoring SigNoz itself
selfMonitoring:
enabled: true
serviceMonitor:
enabled: false

View File

@@ -0,0 +1,998 @@
# SigNoz Helm Chart Values - Production Environment
# High-availability configuration with resource optimization
# DEPLOYED IN bakery-ia NAMESPACE - Ingress managed by bakery-ingress-prod
#
# Official Chart: https://github.com/SigNoz/charts
# Install Command: helm install signoz signoz/signoz -n bakery-ia -f signoz-values-prod.yaml
global:
storageClass: "microk8s-hostpath" # For MicroK8s, use "microk8s-hostpath" or custom storage class
clusterName: "bakery-ia-prod"
domain: "monitoring.bakewise.ai"
# Docker Hub credentials - applied to all sub-charts (including Zookeeper, ClickHouse, etc)
imagePullSecrets:
- dockerhub-creds
# Docker Hub credentials for pulling images (root level for SigNoz components)
imagePullSecrets:
- dockerhub-creds
# SigNoz Main Component (unified frontend + query service)
# BREAKING CHANGE: v0.89.0+ uses unified component instead of separate frontend/queryService
signoz:
replicaCount: 2
image:
repository: signoz/signoz
tag: v0.106.0 # Latest stable version
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 8080 # HTTP/API port
internalPort: 8085 # Internal gRPC port
# DISABLE built-in ingress - using unified bakery-ingress-prod instead
# Route configured in infrastructure/kubernetes/overlays/prod/prod-ingress.yaml
ingress:
enabled: false
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
# Pod Anti-affinity for HA
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/component: query-service
topologyKey: kubernetes.io/hostname
# Environment variables (new format - replaces configVars)
env:
signoz_telemetrystore_provider: "clickhouse"
dot_metrics_enabled: "true"
signoz_emailing_enabled: "true"
signoz_alertmanager_provider: "signoz"
# Retention configuration (30 days for prod)
signoz_traces_ttl_duration_hrs: "720"
signoz_metrics_ttl_duration_hrs: "720"
signoz_logs_ttl_duration_hrs: "720"
# OpAMP Server Configuration
# WARNING: OpAMP can cause gRPC instability and collector reloads
# Only enable if you have a stable OpAMP backend server
signoz_opamp_server_enabled: "false"
# signoz_opamp_server_endpoint: "0.0.0.0:4320"
# SMTP configuration for email alerts - now using Mailu as SMTP server
signoz_smtp_enabled: "true"
signoz_smtp_host: "email-smtp.bakery-ia.svc.cluster.local"
signoz_smtp_port: "587"
signoz_smtp_from: "alerts@bakewise.ai"
signoz_smtp_username: "alerts@bakewise.ai"
# Password should be set via secret: signoz_smtp_password
persistence:
enabled: true
size: 20Gi
storageClass: "standard"
# Horizontal Pod Autoscaler
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 5
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# AlertManager Configuration
alertmanager:
enabled: true
replicaCount: 2
image:
repository: signoz/alertmanager
tag: 0.23.5
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 9093
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Pod Anti-affinity for HA
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- signoz-alertmanager
topologyKey: kubernetes.io/hostname
persistence:
enabled: true
size: 5Gi
storageClass: "standard"
config:
global:
resolve_timeout: 5m
smtp_smarthost: 'email-smtp.bakery-ia.svc.cluster.local:587'
smtp_from: 'alerts@bakewise.ai'
smtp_auth_username: 'alerts@bakewise.ai'
smtp_auth_password: '${SMTP_PASSWORD}'
smtp_require_tls: true
route:
group_by: ['alertname', 'cluster', 'service', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'critical-alerts'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'critical-alerts'
email_configs:
- to: 'critical-alerts@bakewise.ai'
headers:
Subject: '[CRITICAL] {{ .GroupLabels.alertname }} - Bakery IA'
# Slack webhook for critical alerts
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts-critical'
title: '[CRITICAL] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'warning-alerts'
email_configs:
- to: 'oncall@bakewise.ai'
headers:
Subject: '[WARNING] {{ .GroupLabels.alertname }} - Bakery IA'
# ClickHouse Configuration - Time Series Database
clickhouse:
enabled: true
installCustomStorageClass: false
image:
registry: docker.io
repository: clickhouse/clickhouse-server
tag: 25.5.6 # Updated to official recommended version
pullPolicy: IfNotPresent
# ClickHouse resources (nested config)
clickhouse:
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 4000m
memory: 8Gi
# Pod Anti-affinity for HA
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- signoz-clickhouse
topologyKey: kubernetes.io/hostname
persistence:
enabled: true
size: 100Gi
storageClass: "standard"
# Cold storage configuration for better disk space management
coldStorage:
enabled: true
defaultKeepFreeSpaceBytes: 10737418240 # Keep 10GB free
ttl:
deleteTTLDays: 30 # Move old data to cold storage after 30 days
# Zookeeper Configuration (required by ClickHouse for coordination)
zookeeper:
enabled: true
replicaCount: 3 # CRITICAL: Always use 3 replicas for production HA
image:
tag: 3.7.1 # Official recommended version
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
persistence:
enabled: true
size: 10Gi
storageClass: "standard"
# OpenTelemetry Collector - Integrated with SigNoz
otelCollector:
enabled: true
replicaCount: 2
image:
repository: signoz/signoz-otel-collector
tag: v0.129.12 # Updated to latest recommended version
pullPolicy: IfNotPresent
# Init containers for the Otel Collector pod
initContainers:
fix-postgres-tls:
enabled: true
image:
registry: docker.io
repository: busybox
tag: 1.35
pullPolicy: IfNotPresent
command:
- sh
- -c
- |
echo "Fixing PostgreSQL TLS file permissions..."
cp /etc/postgres-tls-source/* /etc/postgres-tls/
chmod 600 /etc/postgres-tls/server-key.pem
chmod 644 /etc/postgres-tls/server-cert.pem
chmod 644 /etc/postgres-tls/ca-cert.pem
echo "PostgreSQL TLS permissions fixed"
volumeMounts:
- name: postgres-tls-source
mountPath: /etc/postgres-tls-source
readOnly: true
- name: postgres-tls-fixed
mountPath: /etc/postgres-tls
readOnly: false
service:
type: ClusterIP
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
protocol: TCP
- name: otlp-http
port: 4318
targetPort: 4318
protocol: TCP
- name: prometheus
port: 8889
targetPort: 8889
protocol: TCP
- name: metrics
port: 8888
targetPort: 8888
protocol: TCP
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
# Additional environment variables for receivers
additionalEnvs:
POSTGRES_MONITOR_USER: "monitoring"
POSTGRES_MONITOR_PASSWORD: "monitoring_369f9c001f242b07ef9e2826e17169ca"
REDIS_PASSWORD: "OxdmdJjdVNXp37MNC2IFoMnTpfGGFv1k"
RABBITMQ_USER: "bakery"
RABBITMQ_PASSWORD: "forecast123"
# Mount TLS certificates for secure connections
extraVolumes:
- name: redis-tls
secret:
secretName: redis-tls-secret
- name: postgres-tls
secret:
secretName: postgres-tls
- name: postgres-tls-fixed
emptyDir: {}
- name: varlogpods
hostPath:
path: /var/log/pods
extraVolumeMounts:
- name: redis-tls
mountPath: /etc/redis-tls
readOnly: true
- name: postgres-tls
mountPath: /etc/postgres-tls-source
readOnly: true
- name: postgres-tls-fixed
mountPath: /etc/postgres-tls
readOnly: false
- name: varlogpods
mountPath: /var/log/pods
readOnly: true
# Enable OpAMP for dynamic configuration management
command:
name: /signoz-otel-collector
extraArgs:
- --config=/conf/otel-collector-config.yaml
- --manager-config=/conf/otel-collector-opamp-config.yaml
- --feature-gates=-pkg.translator.prometheus.NormalizeName
# Full OTEL Collector Configuration
config:
# Connectors - bridge between pipelines
connectors:
signozmeter:
dimensions:
- name: service.name
- name: deployment.environment
- name: host.name
metrics_flush_interval: 1h
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 32 # Increased for larger payloads
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins:
- "https://monitoring.bakewise.ai"
- "https://*.bakewise.ai"
# Filelog receiver for Kubernetes pod logs
# Collects container stdout/stderr from /var/log/pods
filelog:
include:
- /var/log/pods/*/*/*.log
exclude:
# Exclude SigNoz's own logs to avoid recursive collection
- /var/log/pods/bakery-ia_signoz-*/*/*.log
include_file_path: true
include_file_name: false
operators:
# Parse CRI-O / containerd log format
- type: regex_parser
regex: '^(?P<time>[^ ]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
# Fix timestamp parsing - extract from the parsed time field
- type: move
from: attributes.time
to: attributes.timestamp
# Extract Kubernetes metadata from file path
- type: regex_parser
id: extract_metadata_from_filepath
regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
parse_from: attributes["log.file.path"]
# Move metadata to resource attributes
- type: move
from: attributes.namespace
to: resource["k8s.namespace.name"]
- type: move
from: attributes.pod_name
to: resource["k8s.pod.name"]
- type: move
from: attributes.container_name
to: resource["k8s.container.name"]
- type: move
from: attributes.log
to: body
# Kubernetes Cluster Receiver - Collects cluster-level metrics
# Provides information about nodes, namespaces, pods, and other cluster resources
k8s_cluster:
collection_interval: 30s
node_conditions_to_report:
- Ready
- MemoryPressure
- DiskPressure
- PIDPressure
- NetworkUnavailable
allocatable_types_to_report:
- cpu
- memory
- pods
# Prometheus receiver for scraping metrics
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-nodes-cadvisor'
scrape_interval: 30s
scrape_timeout: 10s
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-apiserver'
scrape_interval: 30s
scrape_timeout: 10s
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Redis receiver for cache metrics
# ENABLED: Using existing credentials from redis-secrets with TLS
redis:
endpoint: redis-service.bakery-ia:6379
password: ${env:REDIS_PASSWORD}
collection_interval: 60s
transport: tcp
tls:
insecure_skip_verify: false
cert_file: /etc/redis-tls/redis-cert.pem
key_file: /etc/redis-tls/redis-key.pem
ca_file: /etc/redis-tls/ca-cert.pem
metrics:
redis.maxmemory:
enabled: true
redis.cmd.latency:
enabled: true
# RabbitMQ receiver via management API
# ENABLED: Using existing credentials from rabbitmq-secrets
rabbitmq:
endpoint: http://rabbitmq-service.bakery-ia:15672
username: ${env:RABBITMQ_USER}
password: ${env:RABBITMQ_PASSWORD}
collection_interval: 30s
# PostgreSQL receivers for database metrics
# Monitor all databases with proper TLS configuration
postgresql/auth:
endpoint: auth-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- auth_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/inventory:
endpoint: inventory-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- inventory_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/orders:
endpoint: orders-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- orders_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/ai-insights:
endpoint: ai-insights-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- ai_insights_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/alert-processor:
endpoint: alert-processor-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- alert_processor_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/distribution:
endpoint: distribution-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- distribution_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/external:
endpoint: external-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- external_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/forecasting:
endpoint: forecasting-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- forecasting_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/notification:
endpoint: notification-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- notification_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/orchestrator:
endpoint: orchestrator-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- orchestrator_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/pos:
endpoint: pos-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- pos_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/procurement:
endpoint: procurement-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- procurement_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/production:
endpoint: production-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- production_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/recipes:
endpoint: recipes-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- recipes_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/sales:
endpoint: sales-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- sales_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/suppliers:
endpoint: suppliers-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- suppliers_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/tenant:
endpoint: tenant-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- tenant_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
postgresql/training:
endpoint: training-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- training_db
collection_interval: 60s
tls:
insecure: false
cert_file: /etc/postgres-tls/server-cert.pem
key_file: /etc/postgres-tls/server-key.pem
ca_file: /etc/postgres-tls/ca-cert.pem
processors:
# High-performance batch processing (official recommendation)
batch:
timeout: 1s # Reduced from 10s for faster processing
send_batch_size: 50000 # Increased from 2048 (official recommendation for traces)
send_batch_max_size: 50000
# Batch processor for meter data
batch/meter:
timeout: 1s
send_batch_size: 20000
send_batch_max_size: 25000
memory_limiter:
check_interval: 1s
limit_mib: 1500 # 75% of container memory (2Gi = ~2048Mi)
spike_limit_mib: 300
# Resource detection for K8s
resourcedetection:
detectors: [env, system, docker]
timeout: 5s
# Add resource attributes
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
- key: cluster.name
value: bakery-ia-prod
action: upsert
# Kubernetes attributes processor - CRITICAL for logs
# Extracts pod, namespace, container metadata from log attributes
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- k8s.node.name
- k8s.container.name
labels:
- tag_name: "app"
- tag_name: "pod-template-hash"
- tag_name: "version"
annotations:
- tag_name: "description"
# SigNoz span metrics processor with delta aggregation (recommended)
# Generates RED metrics (Rate, Error, Duration) from trace spans
signozspanmetrics/delta:
aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
metrics_exporter: signozclickhousemetrics
latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s]
dimensions_cache_size: 100000
dimensions:
- name: service.namespace
default: default
- name: deployment.environment
default: production
- name: signoz.collector.id
exporters:
# ClickHouse exporter for traces
clickhousetraces:
datasource: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_traces
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
# ClickHouse exporter for metrics
signozclickhousemetrics:
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metrics"
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
# ClickHouse exporter for meter data (usage metrics)
signozclickhousemeter:
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_meter"
timeout: 45s
sending_queue:
enabled: false
# ClickHouse exporter for logs
clickhouselogsexporter:
dsn: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_logs
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
# Metadata exporter for service metadata
metadataexporter:
dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metadata"
timeout: 10s
cache:
provider: in_memory
# Debug exporter for debugging (optional)
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
service:
extensions: [health_check, zpages]
pipelines:
# Traces pipeline - exports to ClickHouse and signozmeter connector
traces:
receivers: [otlp]
processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection, resource]
exporters: [clickhousetraces, metadataexporter, signozmeter]
# Metrics pipeline - includes all infrastructure receivers
metrics:
receivers: [otlp,
postgresql/auth, postgresql/inventory, postgresql/orders,
postgresql/ai-insights, postgresql/alert-processor, postgresql/distribution,
postgresql/external, postgresql/forecasting, postgresql/notification,
postgresql/orchestrator, postgresql/pos, postgresql/procurement,
postgresql/production, postgresql/recipes, postgresql/sales,
postgresql/suppliers, postgresql/tenant, postgresql/training,
redis, rabbitmq, k8s_cluster, prometheus]
processors: [memory_limiter, batch, resourcedetection, resource]
exporters: [signozclickhousemetrics]
# Meter pipeline - receives from signozmeter connector
metrics/meter:
receivers: [signozmeter]
processors: [batch/meter]
exporters: [signozclickhousemeter]
# Logs pipeline - includes both OTLP and Kubernetes pod logs
logs:
receivers: [otlp, filelog]
processors: [memory_limiter, batch, resourcedetection, resource, k8sattributes]
exporters: [clickhouselogsexporter]
# HPA for OTEL Collector
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# ClusterRole configuration for Kubernetes monitoring
# CRITICAL: Required for k8s_cluster receiver to access Kubernetes API
# Without these permissions, k8s metrics will not appear in SigNoz UI
clusterRole:
create: true
name: "signoz-otel-collector-bakery-ia"
annotations: {}
# Complete RBAC rules required by k8sclusterreceiver
# Based on OpenTelemetry and SigNoz official documentation
rules:
# Core API group - fundamental Kubernetes resources
- apiGroups: [""]
resources:
- "events"
- "namespaces"
- "nodes"
- "nodes/proxy"
- "nodes/metrics"
- "nodes/spec"
- "pods"
- "pods/status"
- "replicationcontrollers"
- "replicationcontrollers/status"
- "resourcequotas"
- "services"
- "endpoints"
verbs: ["get", "list", "watch"]
# Apps API group - modern workload controllers
- apiGroups: ["apps"]
resources: ["deployments", "daemonsets", "statefulsets", "replicasets"]
verbs: ["get", "list", "watch"]
# Batch API group - job management
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch"]
# Autoscaling API group - HPA metrics (CRITICAL)
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "watch"]
# Extensions API group - legacy support
- apiGroups: ["extensions"]
resources: ["deployments", "daemonsets", "replicasets"]
verbs: ["get", "list", "watch"]
# Metrics API group - resource metrics
- apiGroups: ["metrics.k8s.io"]
resources: ["nodes", "pods"]
verbs: ["get", "list", "watch"]
clusterRoleBinding:
annotations: {}
name: "signoz-otel-collector-bakery-ia"
# Schema Migrator - Manages ClickHouse schema migrations
schemaMigrator:
enabled: true
image:
repository: signoz/signoz-schema-migrator
tag: v0.129.12 # Updated to latest version
pullPolicy: IfNotPresent
# Enable Helm hooks for proper upgrade handling
upgradeHelmHooks: true
# Additional Configuration
serviceAccount:
create: true
annotations: {}
name: "signoz"
# Security Context
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
# Pod Disruption Budgets for HA
podDisruptionBudget:
frontend:
enabled: true
minAvailable: 1
queryService:
enabled: true
minAvailable: 1
alertmanager:
enabled: true
minAvailable: 1
clickhouse:
enabled: true
minAvailable: 1
# Network Policies for security
networkPolicy:
enabled: true
policyTypes:
- Ingress
- Egress
# Monitoring SigNoz itself
selfMonitoring:
enabled: true
serviceMonitor:
enabled: true
interval: 30s

View File

@@ -0,0 +1,177 @@
#!/bin/bash
# SigNoz Telemetry Verification Script
# This script verifies that services are correctly sending metrics, logs, and traces to SigNoz
# and that SigNoz is collecting them properly.
set -e
NAMESPACE="bakery-ia"
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
echo -e "${BLUE} SigNoz Telemetry Verification Script${NC}"
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
echo ""
# Step 1: Verify SigNoz Components are Running
echo -e "${BLUE}[1/7] Checking SigNoz Components Status...${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
OTEL_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=signoz,app.kubernetes.io/component=otel-collector --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
SIGNOZ_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=signoz,app.kubernetes.io/component=signoz --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
CLICKHOUSE_POD=$(kubectl get pods -n $NAMESPACE -l clickhouse.altinity.com/chi=signoz-clickhouse --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [[ -n "$OTEL_POD" && -n "$SIGNOZ_POD" && -n "$CLICKHOUSE_POD" ]]; then
echo -e "${GREEN}✓ All SigNoz components are running${NC}"
echo " - OTel Collector: $OTEL_POD"
echo " - SigNoz Frontend: $SIGNOZ_POD"
echo " - ClickHouse: $CLICKHOUSE_POD"
else
echo -e "${RED}✗ Some SigNoz components are not running${NC}"
kubectl get pods -n $NAMESPACE | grep signoz
exit 1
fi
echo ""
# Step 2: Check OTel Collector Endpoints
echo -e "${BLUE}[2/7] Verifying OTel Collector Endpoints...${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
OTEL_SVC=$(kubectl get svc -n $NAMESPACE signoz-otel-collector -o jsonpath='{.spec.clusterIP}')
echo "OTel Collector Service IP: $OTEL_SVC"
echo ""
echo "Available endpoints:"
kubectl get svc -n $NAMESPACE signoz-otel-collector -o jsonpath='{range .spec.ports[*]}{.name}{"\t"}{.port}{"\n"}{end}' | column -t
echo ""
echo -e "${GREEN}✓ OTel Collector endpoints are exposed${NC}"
echo ""
# Step 3: Check OTel Collector Logs for Data Reception
echo -e "${BLUE}[3/7] Checking OTel Collector for Recent Activity...${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "Recent OTel Collector logs (last 20 lines):"
kubectl logs -n $NAMESPACE $OTEL_POD --tail=20 | grep -E "received|exported|traces|metrics|logs" || echo "No recent telemetry data found in logs"
echo ""
# Step 4: Check Service Configurations
echo -e "${BLUE}[4/7] Verifying Service Telemetry Configuration...${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
# Check ConfigMap for OTEL settings
OTEL_ENDPOINT=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.OTEL_EXPORTER_OTLP_ENDPOINT}')
ENABLE_TRACING=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.ENABLE_TRACING}')
ENABLE_METRICS=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.ENABLE_METRICS}')
ENABLE_LOGS=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.ENABLE_LOGS}')
echo "Configuration from bakery-config ConfigMap:"
echo " OTEL_EXPORTER_OTLP_ENDPOINT: $OTEL_ENDPOINT"
echo " ENABLE_TRACING: $ENABLE_TRACING"
echo " ENABLE_METRICS: $ENABLE_METRICS"
echo " ENABLE_LOGS: $ENABLE_LOGS"
echo ""
if [[ "$ENABLE_TRACING" == "true" && "$ENABLE_METRICS" == "true" && "$ENABLE_LOGS" == "true" ]]; then
echo -e "${GREEN}✓ Telemetry is enabled in configuration${NC}"
else
echo -e "${YELLOW}⚠ Some telemetry features may be disabled${NC}"
fi
echo ""
# Step 5: Test OTel Collector Health
echo -e "${BLUE}[5/7] Testing OTel Collector Health Endpoint...${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
HEALTH_CHECK=$(kubectl exec -n $NAMESPACE $OTEL_POD -- wget -qO- http://localhost:13133/ 2>/dev/null || echo "FAILED")
if [[ "$HEALTH_CHECK" == *"Server available"* ]] || [[ "$HEALTH_CHECK" == "{}" ]]; then
echo -e "${GREEN}✓ OTel Collector health check passed${NC}"
else
echo -e "${RED}✗ OTel Collector health check failed${NC}"
echo "Response: $HEALTH_CHECK"
fi
echo ""
# Step 6: Query ClickHouse for Telemetry Data
echo -e "${BLUE}[6/7] Querying ClickHouse for Telemetry Data...${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
# Get ClickHouse credentials
CH_PASSWORD=$(kubectl get secret -n $NAMESPACE signoz-clickhouse -o jsonpath='{.data.admin-password}' 2>/dev/null | base64 -d || echo "27ff0399-0d3a-4bd8-919d-17c2181e6fb9")
echo "Checking for traces in ClickHouse..."
TRACES_COUNT=$(kubectl exec -n $NAMESPACE $CLICKHOUSE_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="SELECT count() FROM signoz_traces.signoz_index_v2 WHERE timestamp >= now() - INTERVAL 1 HOUR" 2>/dev/null || echo "0")
echo " Traces in last hour: $TRACES_COUNT"
echo "Checking for metrics in ClickHouse..."
METRICS_COUNT=$(kubectl exec -n $NAMESPACE $CLICKHOUSE_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="SELECT count() FROM signoz_metrics.samples_v4 WHERE unix_milli >= toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000" 2>/dev/null || echo "0")
echo " Metrics in last hour: $METRICS_COUNT"
echo "Checking for logs in ClickHouse..."
LOGS_COUNT=$(kubectl exec -n $NAMESPACE $CLICKHOUSE_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="SELECT count() FROM signoz_logs.logs WHERE timestamp >= now() - INTERVAL 1 HOUR" 2>/dev/null || echo "0")
echo " Logs in last hour: $LOGS_COUNT"
echo ""
if [[ "$TRACES_COUNT" -gt "0" || "$METRICS_COUNT" -gt "0" || "$LOGS_COUNT" -gt "0" ]]; then
echo -e "${GREEN}✓ Telemetry data found in ClickHouse!${NC}"
else
echo -e "${YELLOW}⚠ No telemetry data found in the last hour${NC}"
echo " This might be normal if:"
echo " - Services were just deployed"
echo " - No traffic has been generated yet"
echo " - Services haven't finished initializing"
fi
echo ""
# Step 7: Access Information
echo -e "${BLUE}[7/7] SigNoz UI Access Information${NC}"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo ""
echo "SigNoz is accessible via ingress at:"
echo -e " ${GREEN}https://monitoring.bakery-ia.local${NC}"
echo ""
echo "Or via port-forward:"
echo -e " ${YELLOW}kubectl port-forward -n $NAMESPACE svc/signoz 3301:8080${NC}"
echo " Then access: http://localhost:3301"
echo ""
echo "To view OTel Collector metrics:"
echo -e " ${YELLOW}kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 8888:8888${NC}"
echo " Then access: http://localhost:8888/metrics"
echo ""
# Summary
echo ""
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
echo -e "${BLUE} Verification Summary${NC}"
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
echo ""
echo "Component Status:"
echo " ✓ SigNoz components running"
echo " ✓ OTel Collector healthy"
echo " ✓ Configuration correct"
echo ""
echo "Data Collection (last hour):"
echo " Traces: $TRACES_COUNT"
echo " Metrics: $METRICS_COUNT"
echo " Logs: $LOGS_COUNT"
echo ""
if [[ "$TRACES_COUNT" -gt "0" || "$METRICS_COUNT" -gt "0" || "$LOGS_COUNT" -gt "0" ]]; then
echo -e "${GREEN}✓ SigNoz is collecting telemetry data successfully!${NC}"
else
echo -e "${YELLOW}⚠ To generate telemetry data, try:${NC}"
echo ""
echo "1. Generate traffic to your services:"
echo " curl http://localhost/api/health"
echo ""
echo "2. Check service logs for tracing initialization:"
echo " kubectl logs -n $NAMESPACE <service-pod> | grep -i 'tracing\\|otel\\|signoz'"
echo ""
echo "3. Wait a few minutes and run this script again"
fi
echo ""
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"

View File

@@ -0,0 +1,446 @@
#!/bin/bash
# ============================================================================
# SigNoz Verification Script for Bakery IA
# ============================================================================
# This script verifies that SigNoz is properly deployed and functioning
# ============================================================================
set -e
# Color codes for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Function to display help
show_help() {
echo "Usage: $0 [OPTIONS] ENVIRONMENT"
echo ""
echo "Verify SigNoz deployment for Bakery IA"
echo ""
echo "Arguments:
ENVIRONMENT Environment to verify (dev|prod)"
echo ""
echo "Options:
-h, --help Show this help message
-n, --namespace NAMESPACE Specify namespace (default: bakery-ia)"
echo ""
echo "Examples:
$0 dev # Verify development deployment
$0 prod # Verify production deployment
$0 --namespace monitoring dev # Verify with custom namespace"
}
# Parse command line arguments
NAMESPACE="bakery-ia"
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help)
show_help
exit 0
;;
-n|--namespace)
NAMESPACE="$2"
shift 2
;;
dev|prod)
ENVIRONMENT="$1"
shift
;;
*)
echo "Unknown argument: $1"
show_help
exit 1
;;
esac
done
# Validate environment
if [[ -z "$ENVIRONMENT" ]]; then
echo "Error: Environment not specified. Use 'dev' or 'prod'."
show_help
exit 1
fi
if [[ "$ENVIRONMENT" != "dev" && "$ENVIRONMENT" != "prod" ]]; then
echo "Error: Invalid environment. Use 'dev' or 'prod'."
exit 1
fi
# Function to check if kubectl is configured
check_kubectl() {
if ! kubectl cluster-info &> /dev/null; then
echo "${RED}Error: kubectl is not configured or cannot connect to cluster.${NC}"
echo "Please ensure you have access to a Kubernetes cluster."
exit 1
fi
}
# Function to check namespace exists
check_namespace() {
if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
echo "${RED}Error: Namespace $NAMESPACE does not exist.${NC}"
echo "Please deploy SigNoz first using: ./deploy-signoz.sh $ENVIRONMENT"
exit 1
fi
}
# Function to verify SigNoz deployment
verify_deployment() {
echo "${BLUE}"
echo "=========================================="
echo "🔍 Verifying SigNoz Deployment"
echo "=========================================="
echo "Environment: $ENVIRONMENT"
echo "Namespace: $NAMESPACE"
echo "${NC}"
echo ""
# Check if SigNoz helm release exists
echo "${BLUE}1. Checking Helm release...${NC}"
if helm list -n "$NAMESPACE" | grep -q signoz; then
echo "${GREEN}✅ SigNoz Helm release found${NC}"
else
echo "${RED}❌ SigNoz Helm release not found${NC}"
echo "Please deploy SigNoz first using: ./deploy-signoz.sh $ENVIRONMENT"
exit 1
fi
echo ""
# Check pod status
echo "${BLUE}2. Checking pod status...${NC}"
local total_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
local running_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz --field-selector=status.phase=Running 2>/dev/null | grep -c "Running" || echo "0")
local ready_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep "Running" | grep "1/1" | wc -l | tr -d ' ' || echo "0")
echo "Total pods: $total_pods"
echo "Running pods: $running_pods"
echo "Ready pods: $ready_pods"
if [[ $total_pods -eq 0 ]]; then
echo "${RED}❌ No SigNoz pods found${NC}"
exit 1
fi
if [[ $running_pods -eq $total_pods ]]; then
echo "${GREEN}✅ All pods are running${NC}"
else
echo "${YELLOW}⚠️ Some pods are not running${NC}"
fi
if [[ $ready_pods -eq $total_pods ]]; then
echo "${GREEN}✅ All pods are ready${NC}"
else
echo "${YELLOW}⚠️ Some pods are not ready${NC}"
fi
echo ""
# Show pod details
echo "${BLUE}Pod Details:${NC}"
kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
echo ""
# Check services
echo "${BLUE}3. Checking services...${NC}"
local service_count=$(kubectl get svc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
if [[ $service_count -gt 0 ]]; then
echo "${GREEN}✅ Services found ($service_count services)${NC}"
kubectl get svc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
else
echo "${RED}❌ No services found${NC}"
fi
echo ""
# Check ingress
echo "${BLUE}4. Checking ingress...${NC}"
local ingress_count=$(kubectl get ingress -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
if [[ $ingress_count -gt 0 ]]; then
echo "${GREEN}✅ Ingress found ($ingress_count ingress resources)${NC}"
kubectl get ingress -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
else
echo "${YELLOW}⚠️ No ingress found (may be configured in main namespace)${NC}"
fi
echo ""
# Check PVCs
echo "${BLUE}5. Checking persistent volume claims...${NC}"
local pvc_count=$(kubectl get pvc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
if [[ $pvc_count -gt 0 ]]; then
echo "${GREEN}✅ PVCs found ($pvc_count PVCs)${NC}"
kubectl get pvc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
else
echo "${YELLOW}⚠️ No PVCs found (may not be required for all components)${NC}"
fi
echo ""
# Check resource usage
echo "${BLUE}6. Checking resource usage...${NC}"
if command -v kubectl &> /dev/null && kubectl top pods -n "$NAMESPACE" &> /dev/null; then
echo "${GREEN}✅ Resource usage:${NC}"
kubectl top pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
else
echo "${YELLOW}⚠️ Metrics server not available or no resource usage data${NC}"
fi
echo ""
# Check logs for errors
echo "${BLUE}7. Checking for errors in logs...${NC}"
local error_found=false
# Check each pod for errors
while IFS= read -r pod; do
if [[ -n "$pod" ]]; then
local pod_errors=$(kubectl logs -n "$NAMESPACE" "$pod" 2>/dev/null | grep -i "error\|exception\|fail\|crash" | wc -l || echo "0")
if [[ $pod_errors -gt 0 ]]; then
echo "${RED}❌ Errors found in pod $pod ($pod_errors errors)${NC}"
error_found=true
fi
fi
done < <(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz -o name | sed 's|pod/||')
if [[ "$error_found" == false ]]; then
echo "${GREEN}✅ No errors found in logs${NC}"
fi
echo ""
# Environment-specific checks
if [[ "$ENVIRONMENT" == "dev" ]]; then
verify_dev_specific
else
verify_prod_specific
fi
# Show access information
show_access_info
}
# Function for development-specific verification
verify_dev_specific() {
echo "${BLUE}8. Development-specific checks...${NC}"
# Check if ingress is configured
if kubectl get ingress -n "$NAMESPACE" 2>/dev/null | grep -q "monitoring.bakery-ia.local"; then
echo "${GREEN}✅ Development ingress configured${NC}"
else
echo "${YELLOW}⚠️ Development ingress not found${NC}"
fi
# Check unified signoz component resource limits (should be lower for dev)
local signoz_mem=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.template.spec.containers[0].resources.limits.memory}' 2>/dev/null || echo "")
if [[ -n "$signoz_mem" ]]; then
echo "${GREEN}✅ SigNoz component found (memory limit: $signoz_mem)${NC}"
else
echo "${YELLOW}⚠️ Could not verify SigNoz component resources${NC}"
fi
# Check single replica setup for dev
local replicas=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "0")
if [[ $replicas -eq 1 ]]; then
echo "${GREEN}✅ Single replica configuration (appropriate for dev)${NC}"
else
echo "${YELLOW}⚠️ Multiple replicas detected (replicas: $replicas)${NC}"
fi
echo ""
}
# Function for production-specific verification
verify_prod_specific() {
echo "${BLUE}8. Production-specific checks...${NC}"
# Check if TLS is configured
if kubectl get ingress -n "$NAMESPACE" 2>/dev/null | grep -q "signoz-tls"; then
echo "${GREEN}✅ TLS certificate configured${NC}"
else
echo "${YELLOW}⚠️ TLS certificate not found${NC}"
fi
# Check if multiple replicas are running for HA
local signoz_replicas=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "1")
if [[ $signoz_replicas -gt 1 ]]; then
echo "${GREEN}✅ High availability configured ($signoz_replicas SigNoz replicas)${NC}"
else
echo "${YELLOW}⚠️ Single SigNoz replica detected (not highly available)${NC}"
fi
# Check Zookeeper replicas (critical for production)
local zk_replicas=$(kubectl get statefulset -n "$NAMESPACE" -l app.kubernetes.io/component=zookeeper -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "0")
if [[ $zk_replicas -eq 3 ]]; then
echo "${GREEN}✅ Zookeeper properly configured with 3 replicas${NC}"
elif [[ $zk_replicas -gt 0 ]]; then
echo "${YELLOW}⚠️ Zookeeper has $zk_replicas replicas (recommend 3 for production)${NC}"
else
echo "${RED}❌ Zookeeper not found${NC}"
fi
# Check OTel Collector replicas
local otel_replicas=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=otel-collector -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "1")
if [[ $otel_replicas -gt 1 ]]; then
echo "${GREEN}✅ OTel Collector HA configured ($otel_replicas replicas)${NC}"
else
echo "${YELLOW}⚠️ Single OTel Collector replica${NC}"
fi
# Check resource limits (should be higher for prod)
local signoz_mem=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.template.spec.containers[0].resources.limits.memory}' 2>/dev/null || echo "")
if [[ -n "$signoz_mem" ]]; then
echo "${GREEN}✅ Production resource limits applied (memory: $signoz_mem)${NC}"
else
echo "${YELLOW}⚠️ Could not verify resource limits${NC}"
fi
# Check HPA (Horizontal Pod Autoscaler)
local hpa_count=$(kubectl get hpa -n "$NAMESPACE" 2>/dev/null | grep -c signoz || echo "0")
if [[ $hpa_count -gt 0 ]]; then
echo "${GREEN}✅ Horizontal Pod Autoscaler configured${NC}"
else
echo "${YELLOW}⚠️ No HPA found (consider enabling for production)${NC}"
fi
echo ""
}
# Function to show access information
show_access_info() {
echo "${BLUE}"
echo "=========================================="
echo "📋 Access Information"
echo "=========================================="
echo "${NC}"
if [[ "$ENVIRONMENT" == "dev" ]]; then
echo "SigNoz UI: http://monitoring.bakery-ia.local"
echo ""
echo "OpenTelemetry Collector (within cluster):"
echo " gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
echo " HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
echo ""
echo "Port-forward for local access:"
echo " kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
echo " kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4317:4317"
echo " kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4318:4318"
else
echo "SigNoz UI: https://monitoring.bakewise.ai"
echo ""
echo "OpenTelemetry Collector (within cluster):"
echo " gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
echo " HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
fi
echo ""
echo "Default Credentials:"
echo " Username: admin@example.com"
echo " Password: admin"
echo ""
echo "⚠️ IMPORTANT: Change default password after first login!"
echo ""
# Show connection test commands
echo "Connection Test Commands:"
if [[ "$ENVIRONMENT" == "dev" ]]; then
echo " # Test SigNoz UI"
echo " curl http://monitoring.bakery-ia.local"
echo ""
echo " # Test via port-forward"
echo " kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
echo " curl http://localhost:8080"
else
echo " # Test SigNoz UI"
echo " curl https://monitoring.bakewise.ai"
echo ""
echo " # Test API health"
echo " kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
echo " curl http://localhost:8080/api/v1/health"
fi
echo ""
}
# Function to run connectivity tests
run_connectivity_tests() {
echo "${BLUE}"
echo "=========================================="
echo "🔗 Running Connectivity Tests"
echo "=========================================="
echo "${NC}"
# Test pod readiness first
echo "Checking pod readiness..."
local ready_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz --field-selector=status.phase=Running 2>/dev/null | grep "Running" | grep -c "1/1\|2/2" || echo "0")
local total_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
if [[ $ready_pods -eq $total_pods && $total_pods -gt 0 ]]; then
echo "${GREEN}✅ All pods are ready ($ready_pods/$total_pods)${NC}"
else
echo "${YELLOW}⚠️ Some pods not ready ($ready_pods/$total_pods)${NC}"
fi
echo ""
# Test internal service connectivity
echo "Testing internal service connectivity..."
local signoz_svc=$(kubectl get svc -n "$NAMESPACE" signoz -o jsonpath='{.spec.clusterIP}' 2>/dev/null || echo "")
if [[ -n "$signoz_svc" ]]; then
echo "${GREEN}✅ SigNoz service accessible at $signoz_svc:8080${NC}"
else
echo "${RED}❌ SigNoz service not found${NC}"
fi
local otel_svc=$(kubectl get svc -n "$NAMESPACE" signoz-otel-collector -o jsonpath='{.spec.clusterIP}' 2>/dev/null || echo "")
if [[ -n "$otel_svc" ]]; then
echo "${GREEN}✅ OTel Collector service accessible at $otel_svc:4317 (gRPC), $otel_svc:4318 (HTTP)${NC}"
else
echo "${RED}❌ OTel Collector service not found${NC}"
fi
echo ""
if [[ "$ENVIRONMENT" == "prod" ]]; then
echo "${YELLOW}⚠️ Production connectivity tests require valid DNS and TLS${NC}"
echo " Please ensure monitoring.bakewise.ai resolves to your cluster"
echo ""
echo "Manual test:"
echo " curl -I https://monitoring.bakewise.ai"
fi
}
# Main execution
main() {
echo "${BLUE}"
echo "=========================================="
echo "🔍 SigNoz Verification for Bakery IA"
echo "=========================================="
echo "${NC}"
# Check prerequisites
check_kubectl
check_namespace
# Verify deployment
verify_deployment
# Run connectivity tests
run_connectivity_tests
echo "${GREEN}"
echo "=========================================="
echo "✅ Verification Complete"
echo "=========================================="
echo "${NC}"
echo "Summary:"
echo " Environment: $ENVIRONMENT"
echo " Namespace: $NAMESPACE"
echo ""
echo "Next Steps:"
echo " 1. Access SigNoz UI and verify dashboards"
echo " 2. Configure alert rules for your services"
echo " 3. Instrument your applications with OpenTelemetry"
echo " 4. Set up custom dashboards for key metrics"
echo ""
}
# Run main function
main