Add new infra architecture

2026-01-19 11:55:17 +01:00
parent 21d35ea92b
commit 35f164f0cd
311 changed files with 13241 additions and 3700 deletions
--- a/infrastructure/monitoring/signoz/README.md
+++ b/infrastructure/monitoring/signoz/README.md
@@ -0,0 +1,619 @@
+# SigNoz Helm Deployment for Bakery IA
+
+This directory contains Helm configurations and deployment scripts for SigNoz observability platform.
+
+## Overview
+
+SigNoz is deployed using the official Helm chart with environment-specific configurations optimized for:
+- **Development**: Colima + Kind (Kubernetes in Docker) with Tilt
+- **Production**: VPS on clouding.io with MicroK8s
+
+## Prerequisites
+
+### Required Tools
+- **kubectl** 1.22+
+- **Helm** 3.8+
+- **Docker** (for development)
+- **Kind/MicroK8s** (environment-specific)
+
+### Docker Hub Authentication
+
+SigNoz uses images from Docker Hub. Set up authentication to avoid rate limits:
+
+```bash
+# Option 1: Environment variables (recommended)
+export DOCKERHUB_USERNAME='your-username'
+export DOCKERHUB_PASSWORD='your-personal-access-token'
+
+# Option 2: Docker login
+docker login
+```
+
+## Quick Start
+
+### Development Deployment
+
+```bash
+# Deploy SigNoz to development environment
+./deploy-signoz.sh dev
+
+# Verify deployment
+./verify-signoz.sh dev
+
+# Access SigNoz UI
+# Via ingress: http://monitoring.bakery-ia.local
+# Or port-forward:
+kubectl port-forward -n signoz svc/signoz 8080:8080
+# Then open: http://localhost:8080
+```
+
+### Production Deployment
+
+```bash
+# Deploy SigNoz to production environment
+./deploy-signoz.sh prod
+
+# Verify deployment
+./verify-signoz.sh prod
+
+# Access SigNoz UI
+# https://monitoring.bakewise.ai
+```
+
+## Configuration Files
+
+### signoz-values-dev.yaml
+
+Development environment configuration with:
+- Single replica for most components
+- Reduced resource requests (optimized for local Kind cluster)
+- 7-day data retention
+- Batch size: 10,000 events
+- ClickHouse 25.5.6, OTel Collector v0.129.12
+- PostgreSQL, Redis, and RabbitMQ receivers configured
+
+### signoz-values-prod.yaml
+
+Production environment configuration with:
+- High availability: 2+ replicas for critical components
+- 3 Zookeeper replicas (required for production)
+- 30-day data retention
+- Batch size: 50,000 events (high-performance)
+- Cold storage enabled with 30-day TTL
+- Horizontal Pod Autoscaler (HPA) enabled
+- TLS/SSL with cert-manager
+- Enhanced security with pod anti-affinity rules
+
+## Key Configuration Changes (v0.89.0+)
+
+⚠️ **BREAKING CHANGE**: SigNoz Helm chart v0.89.0+ uses a unified component structure.
+
+**Old Structure (deprecated):**
+```yaml
+frontend:
+  replicaCount: 2
+queryService:
+  replicaCount: 2
+```
+
+**New Structure (current):**
+```yaml
+signoz:
+  replicaCount: 2
+  # Combines frontend + query service
+```
+
+## Component Architecture
+
+### Core Components
+
+1. **SigNoz** (unified component)
+   - Frontend UI + Query Service
+   - Port 8080 (HTTP/API), 8085 (internal gRPC)
+   - Dev: 1 replica, Prod: 2+ replicas with HPA
+
+2. **ClickHouse** (Time-series database)
+   - Version: 25.5.6
+   - Stores traces, metrics, and logs
+   - Dev: 1 replica, Prod: 2 replicas with cold storage
+
+3. **Zookeeper** (ClickHouse coordination)
+   - Version: 3.7.1
+   - Dev: 1 replica, Prod: 3 replicas (critical for HA)
+
+4. **OpenTelemetry Collector** (Data ingestion)
+   - Version: v0.129.12
+   - Ports: 4317 (gRPC), 4318 (HTTP), 8888 (metrics)
+   - Dev: 1 replica, Prod: 2+ replicas with HPA
+
+5. **Alertmanager** (Alert management)
+   - Version: 0.23.5
+   - Email and Slack integrations configured
+   - Port: 9093
+
+## Performance Optimizations
+
+### Batch Processing
+- **Development**: 10,000 events per batch
+- **Production**: 50,000 events per batch (official recommendation)
+- Timeout: 1 second for faster processing
+
+### Memory Management
+- Memory limiter processor prevents OOM
+- Dev: 400 MiB limit, Prod: 1500 MiB limit
+- Spike limits configured
+
+### Span Metrics Processor
+Automatically generates RED metrics (Rate, Errors, Duration):
+- Latency histogram buckets optimized for microservices
+- Cache size: 10K (dev), 100K (prod)
+
+### Cold Storage (Production Only)
+- Enabled with 30-day TTL
+- Automatically moves old data to cold storage
+- Keeps 10GB free on primary storage
+
+## OpenTelemetry Endpoints
+
+### From Within Kubernetes Cluster
+
+**Development:**
+```
+OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
+OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
+```
+
+**Production:**
+```
+OTLP gRPC: signoz-otel-collector.bakery-ia.svc.cluster.local:4317
+OTLP HTTP: signoz-otel-collector.bakery-ia.svc.cluster.local:4318
+```
+
+### Application Configuration Example
+
+```yaml
+# Python with OpenTelemetry
+OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318"
+OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
+```
+
+```javascript
+// Node.js with OpenTelemetry
+const exporter = new OTLPTraceExporter({
+  url: 'http://signoz-otel-collector.bakery-ia.svc.cluster.local:4318/v1/traces',
+});
+```
+
+## Deployment Scripts
+
+### deploy-signoz.sh
+
+Comprehensive deployment script with features:
+
+```bash
+# Usage
+./deploy-signoz.sh [OPTIONS] ENVIRONMENT
+
+# Options
+-h, --help              Show help message
+-d, --dry-run          Show what would be deployed
+-u, --upgrade          Upgrade existing deployment
+-r, --remove           Remove deployment
+-n, --namespace NS     Custom namespace (default: signoz)
+
+# Examples
+./deploy-signoz.sh dev                    # Deploy to dev
+./deploy-signoz.sh --upgrade prod         # Upgrade prod
+./deploy-signoz.sh --dry-run prod         # Preview changes
+./deploy-signoz.sh --remove dev           # Remove dev deployment
+```
+
+**Features:**
+- Automatic Helm repository setup
+- Docker Hub secret creation
+- Namespace management
+- Deployment verification
+- 15-minute timeout with `--wait` flag
+
+### verify-signoz.sh
+
+Verification script to check deployment health:
+
+```bash
+# Usage
+./verify-signoz.sh [OPTIONS] ENVIRONMENT
+
+# Examples
+./verify-signoz.sh dev                    # Verify dev deployment
+./verify-signoz.sh prod                   # Verify prod deployment
+```
+
+**Checks performed:**
+1. ✅ Helm release status
+2. ✅ Pod health and readiness
+3. ✅ Service availability
+4. ✅ Ingress configuration
+5. ✅ PVC status
+6. ✅ Resource usage (if metrics-server available)
+7. ✅ Log errors
+8. ✅ Environment-specific validations
+   - Dev: Single replica, resource limits
+   - Prod: HA config, TLS, Zookeeper replicas, HPA
+
+## Storage Configuration
+
+### Development (Kind)
+```yaml
+global:
+  storageClass: "standard"  # Kind's default provisioner
+```
+
+### Production (MicroK8s)
+```yaml
+global:
+  storageClass: "microk8s-hostpath"  # Or custom storage class
+```
+
+**Storage Requirements:**
+- **Development**: ~35 GiB total
+  - SigNoz: 5 GiB
+  - ClickHouse: 20 GiB
+  - Zookeeper: 5 GiB
+  - Alertmanager: 2 GiB
+
+- **Production**: ~135 GiB total
+  - SigNoz: 20 GiB
+  - ClickHouse: 100 GiB
+  - Zookeeper: 10 GiB
+  - Alertmanager: 5 GiB
+
+## Resource Requirements
+
+### Development Environment
+**Minimum:**
+- CPU: 550m (0.55 cores)
+- Memory: 1.6 GiB
+- Storage: 35 GiB
+
+**Recommended:**
+- CPU: 3 cores
+- Memory: 3 GiB
+- Storage: 50 GiB
+
+### Production Environment
+**Minimum:**
+- CPU: 3.5 cores
+- Memory: 8 GiB
+- Storage: 135 GiB
+
+**Recommended:**
+- CPU: 12 cores
+- Memory: 20 GiB
+- Storage: 200 GiB
+
+## Data Retention
+
+### Development
+- Traces: 7 days (168 hours)
+- Metrics: 7 days (168 hours)
+- Logs: 7 days (168 hours)
+
+### Production
+- Traces: 30 days (720 hours)
+- Metrics: 30 days (720 hours)
+- Logs: 30 days (720 hours)
+- Cold storage after 30 days
+
+To modify retention, update the environment variables:
+```yaml
+signoz:
+  env:
+    signoz_traces_ttl_duration_hrs: "720"   # 30 days
+    signoz_metrics_ttl_duration_hrs: "720"  # 30 days
+    signoz_logs_ttl_duration_hrs: "168"     # 7 days
+```
+
+## High Availability (Production)
+
+### Replication Strategy
+```yaml
+signoz: 2 replicas + HPA (min: 2, max: 5)
+clickhouse: 2 replicas
+zookeeper: 3 replicas (critical!)
+otelCollector: 2 replicas + HPA (min: 2, max: 10)
+alertmanager: 2 replicas
+```
+
+### Pod Anti-Affinity
+Ensures pods are distributed across different nodes:
+```yaml
+affinity:
+  podAntiAffinity:
+    preferredDuringSchedulingIgnoredDuringExecution:
+      - weight: 100
+        podAffinityTerm:
+          labelSelector:
+            matchLabels:
+              app.kubernetes.io/component: query-service
+          topologyKey: kubernetes.io/hostname
+```
+
+### Pod Disruption Budgets
+Configured for all critical components:
+```yaml
+podDisruptionBudget:
+  enabled: true
+  minAvailable: 1
+```
+
+## Monitoring and Alerting
+
+### Email Alerts (Production)
+Configure SMTP in production values (using Mailu with Mailgun relay):
+```yaml
+signoz:
+  env:
+    signoz_smtp_enabled: "true"
+    signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local"
+    signoz_smtp_port: "587"
+    signoz_smtp_from: "alerts@bakewise.ai"
+    signoz_smtp_username: "alerts@bakewise.ai"
+    # Set via secret: signoz_smtp_password
+```
+
+**Note**: Signoz now uses the internal Mailu SMTP service, which relays to Mailgun for better deliverability and centralized email management.
+
+### Slack Alerts (Production)
+Configure webhook in Alertmanager:
+```yaml
+alertmanager:
+  config:
+    receivers:
+      - name: 'critical-alerts'
+        slack_configs:
+          - api_url: '${SLACK_WEBHOOK_URL}'
+            channel: '#alerts-critical'
+```
+
+### Mailgun Integration for Alert Emails
+
+Signoz has been configured to use Mailgun for sending alert emails through the Mailu SMTP service. This provides:
+
+**Benefits:**
+- Better email deliverability through Mailgun's infrastructure
+- Centralized email management via Mailu
+- Improved tracking and analytics for alert emails
+- Compliance with email sending best practices
+
+**Architecture:**
+```
+Signoz Alertmanager → Mailu SMTP → Mailgun Relay → Recipients
+```
+
+**Configuration Requirements:**
+
+1. **Mailu Configuration** (`infrastructure/platform/mail/mailu/mailu-configmap.yaml`):
+   ```yaml
+   RELAYHOST: "smtp.mailgun.org:587"
+   RELAY_LOGIN: "postmaster@bakewise.ai"
+   ```
+
+2. **Mailu Secrets** (`infrastructure/platform/mail/mailu/mailu-secrets.yaml`):
+   ```yaml
+   RELAY_PASSWORD: "<mailgun-api-key>"  # Base64 encoded Mailgun API key
+   ```
+
+3. **DNS Configuration** (required for Mailgun):
+   ```
+   # MX record
+   bakewise.ai.    IN MX    10 mail.bakewise.ai.
+   
+   # SPF record (authorize Mailgun)
+   bakewise.ai.    IN TXT   "v=spf1 include:mailgun.org ~all"
+   
+   # DKIM record (provided by Mailgun)
+   m1._domainkey.bakewise.ai. IN TXT "v=DKIM1; k=rsa; p=<mailgun-public-key>"
+   
+   # DMARC record
+   _dmarc.bakewise.ai. IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@bakewise.ai"
+   ```
+
+4. **Signoz SMTP Configuration** (already configured in `signoz-values-prod.yaml`):
+   ```yaml
+   signoz_smtp_host: "mailu-smtp.bakery-ia.svc.cluster.local"
+   signoz_smtp_port: "587"
+   signoz_smtp_from: "alerts@bakewise.ai"
+   ```
+
+**Testing the Integration:**
+
+1. Trigger a test alert from Signoz UI
+2. Check Mailu logs: `kubectl logs -f mailu-smtp-<pod-id> -n bakery-ia`
+3. Check Mailgun dashboard for delivery status
+4. Verify email receipt in destination inbox
+
+**Troubleshooting:**
+
+- **SMTP Authentication Failed**: Verify Mailu credentials and Mailgun API key
+- **Email Delivery Delays**: Check Mailu queue with `kubectl exec -it mailu-smtp-<pod-id> -n bakery-ia -- mailq`
+- **SPF/DKIM Issues**: Verify DNS records and Mailgun domain verification
+
+### Self-Monitoring
+SigNoz monitors itself:
+```yaml
+selfMonitoring:
+  enabled: true
+  serviceMonitor:
+    enabled: true  # Prod only
+    interval: 30s
+```
+
+## Troubleshooting
+
+### Common Issues
+
+**1. Pods not starting**
+```bash
+# Check pod status
+kubectl get pods -n signoz
+
+# Check pod logs
+kubectl logs -n signoz <pod-name>
+
+# Describe pod for events
+kubectl describe pod -n signoz <pod-name>
+```
+
+**2. Docker Hub rate limits**
+```bash
+# Verify secret exists
+kubectl get secret dockerhub-creds -n signoz
+
+# Recreate secret
+kubectl delete secret dockerhub-creds -n signoz
+export DOCKERHUB_USERNAME='your-username'
+export DOCKERHUB_PASSWORD='your-token'
+./deploy-signoz.sh dev
+```
+
+**3. ClickHouse connection issues**
+```bash
+# Check ClickHouse pod
+kubectl logs -n signoz -l app.kubernetes.io/component=clickhouse
+
+# Check Zookeeper (required by ClickHouse)
+kubectl logs -n signoz -l app.kubernetes.io/component=zookeeper
+```
+
+**4. OTel Collector not receiving data**
+```bash
+# Check OTel Collector logs
+kubectl logs -n signoz -l app.kubernetes.io/component=otel-collector
+
+# Test connectivity
+kubectl port-forward -n signoz svc/signoz-otel-collector 4318:4318
+curl -v http://localhost:4318/v1/traces
+```
+
+**5. Insufficient storage**
+```bash
+# Check PVC status
+kubectl get pvc -n signoz
+
+# Check storage usage (if metrics-server available)
+kubectl top pods -n signoz
+```
+
+### Debug Mode
+
+Enable debug exporter in OTel Collector:
+```yaml
+otelCollector:
+  config:
+    exporters:
+      debug:
+        verbosity: detailed
+        sampling_initial: 5
+        sampling_thereafter: 200
+    service:
+      pipelines:
+        traces:
+          exporters: [clickhousetraces, debug]  # Add debug
+```
+
+### Upgrade from Old Version
+
+If upgrading from pre-v0.89.0:
+```bash
+# 1. Backup data (recommended)
+kubectl get all -n signoz -o yaml > signoz-backup.yaml
+
+# 2. Remove old deployment
+./deploy-signoz.sh --remove prod
+
+# 3. Deploy new version
+./deploy-signoz.sh prod
+
+# 4. Verify
+./verify-signoz.sh prod
+```
+
+## Security Best Practices
+
+1. **Change default password** immediately after first login
+2. **Use TLS/SSL** in production (configured with cert-manager)
+3. **Network policies** enabled in production
+4. **Run as non-root** (configured in securityContext)
+5. **RBAC** with dedicated service account
+6. **Secrets management** for sensitive data (SMTP, Slack webhooks)
+7. **Image pull secrets** to avoid exposing Docker Hub credentials
+
+## Backup and Recovery
+
+### Backup ClickHouse Data
+```bash
+# Export ClickHouse data
+kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
+  --query="BACKUP DATABASE signoz_traces TO Disk('backups', 'traces_backup.zip')"
+
+# Copy backup out
+kubectl cp signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/ ./backups/
+```
+
+### Restore from Backup
+```bash
+# Copy backup in
+kubectl cp ./backups/ signoz/<clickhouse-pod>:/var/lib/clickhouse/backups/
+
+# Restore
+kubectl exec -n signoz <clickhouse-pod> -- clickhouse-client \
+  --query="RESTORE DATABASE signoz_traces FROM Disk('backups', 'traces_backup.zip')"
+```
+
+## Updating Configuration
+
+To update SigNoz configuration:
+
+1. Edit values file: `signoz-values-{env}.yaml`
+2. Apply changes:
+   ```bash
+   ./deploy-signoz.sh --upgrade {env}
+   ```
+3. Verify:
+   ```bash
+   ./verify-signoz.sh {env}
+   ```
+
+## Uninstallation
+
+```bash
+# Remove SigNoz deployment
+./deploy-signoz.sh --remove {env}
+
+# Optionally delete PVCs (WARNING: deletes all data)
+kubectl delete pvc -n signoz -l app.kubernetes.io/instance=signoz
+
+# Optionally delete namespace
+kubectl delete namespace signoz
+```
+
+## References
+
+- [SigNoz Official Documentation](https://signoz.io/docs/)
+- [SigNoz Helm Charts Repository](https://github.com/SigNoz/charts)
+- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
+- [ClickHouse Documentation](https://clickhouse.com/docs/)
+
+## Support
+
+For issues or questions:
+1. Check [SigNoz GitHub Issues](https://github.com/SigNoz/signoz/issues)
+2. Review deployment logs: `kubectl logs -n signoz <pod-name>`
+3. Run verification script: `./verify-signoz.sh {env}`
+4. Check [SigNoz Community Slack](https://signoz.io/slack)
+
+---
+
+**Last Updated**: 2026-01-09
+**SigNoz Helm Chart Version**: Latest (v0.129.12 components)
+**Maintained by**: Bakery IA Team
--- a/infrastructure/monitoring/signoz/dashboards/README.md
+++ b/infrastructure/monitoring/signoz/dashboards/README.md
@@ -0,0 +1,190 @@
+# SigNoz Dashboards for Bakery IA
+
+This directory contains comprehensive SigNoz dashboard configurations for monitoring the Bakery IA system.
+
+## Available Dashboards
+
+### 1. Infrastructure Monitoring
+- **File**: `infrastructure-monitoring.json`
+- **Purpose**: Monitor Kubernetes infrastructure, pod health, and resource utilization
+- **Key Metrics**: CPU usage, memory usage, network traffic, pod status, container health
+
+### 2. Application Performance
+- **File**: `application-performance.json`
+- **Purpose**: Monitor microservice performance and API metrics
+- **Key Metrics**: Request rate, error rate, latency percentiles, endpoint performance
+
+### 3. Database Performance
+- **File**: `database-performance.json`
+- **Purpose**: Monitor PostgreSQL and Redis database performance
+- **Key Metrics**: Connections, query execution time, cache hit ratio, locks, replication status
+
+### 4. API Performance
+- **File**: `api-performance.json`
+- **Purpose**: Monitor REST and GraphQL API performance
+- **Key Metrics**: Request volume, response times, status codes, endpoint analysis
+
+### 5. Error Tracking
+- **File**: `error-tracking.json`
+- **Purpose**: Track and analyze system errors
+- **Key Metrics**: Error rates, error distribution, recent errors, HTTP errors, database errors
+
+### 6. User Activity
+- **File**: `user-activity.json`
+- **Purpose**: Monitor user behavior and activity patterns
+- **Key Metrics**: Active users, sessions, API calls per user, session duration
+
+### 7. System Health
+- **File**: `system-health.json`
+- **Purpose**: Overall system health monitoring
+- **Key Metrics**: Availability, health scores, resource utilization, service status
+
+### 8. Alert Management
+- **File**: `alert-management.json`
+- **Purpose**: Monitor and manage system alerts
+- **Key Metrics**: Active alerts, alert rates, alert distribution, firing alerts
+
+### 9. Log Analysis
+- **File**: `log-analysis.json`
+- **Purpose**: Search and analyze system logs
+- **Key Metrics**: Log volume, error logs, log distribution, log search
+
+## How to Import Dashboards
+
+### Method 1: Using SigNoz UI
+
+1. **Access SigNoz UI**: Open your SigNoz instance in a web browser
+2. **Navigate to Dashboards**: Go to the "Dashboards" section
+3. **Import Dashboard**: Click on "Import Dashboard" button
+4. **Upload JSON**: Select the JSON file from this directory
+5. **Configure**: Adjust any variables or settings as needed
+6. **Save**: Save the imported dashboard
+
+**Note**: The dashboards now use the correct SigNoz JSON schema with proper filter arrays.
+
+### Method 2: Using SigNoz API
+
+```bash
+# Import a single dashboard
+curl -X POST "http://<SIGNOZ_HOST>:3301/api/v1/dashboards/import" \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer <API_KEY>" \
+  -d @infrastructure-monitoring.json
+
+# Import all dashboards
+for file in *.json; do
+  curl -X POST "http://<SIGNOZ_HOST>:3301/api/v1/dashboards/import" \
+    -H "Content-Type: application/json" \
+    -H "Authorization: Bearer <API_KEY>" \
+    -d @"$file"
+done
+```
+
+### Method 3: Using Kubernetes ConfigMap
+
+```yaml
+# Create a ConfigMap with all dashboards
+kubectl create configmap signoz-dashboards \
+  --from-file=infrastructure-monitoring.json \
+  --from-file=application-performance.json \
+  --from-file=database-performance.json \
+  --from-file=api-performance.json \
+  --from-file=error-tracking.json \
+  --from-file=user-activity.json \
+  --from-file=system-health.json \
+  --from-file=alert-management.json \
+  --from-file=log-analysis.json \
+  -n signoz
+```
+
+## Dashboard Variables
+
+Most dashboards include variables that allow you to filter and customize the view:
+
+- **Namespace**: Filter by Kubernetes namespace (e.g., `bakery-ia`, `default`)
+- **Service**: Filter by specific microservice
+- **Severity**: Filter by error/alert severity
+- **Environment**: Filter by deployment environment
+- **Time Range**: Adjust the time window for analysis
+
+## Metrics Reference
+
+The dashboards use standard OpenTelemetry metrics. If you need to add custom metrics, ensure they are properly instrumented in your services.
+
+## Troubleshooting
+
+### Dashboard Import Errors
+
+If you encounter errors when importing dashboards:
+
+1. **Validate JSON**: Ensure the JSON files are valid
+   ```bash
+   jq . infrastructure-monitoring.json
+   ```
+
+2. **Check Metrics**: Verify that the metrics exist in your SigNoz instance
+
+3. **Adjust Time Range**: Try different time ranges if no data appears
+
+4. **Check Filters**: Ensure filters match your actual service names and tags
+
+### "e.filter is not a function" Error
+
+This error occurs when the dashboard JSON uses an incorrect filter format. The fix has been applied:
+
+**Before (incorrect)**:
+```json
+"filters": {
+  "namespace": "${namespace}"
+}
+```
+
+**After (correct)**:
+```json
+"filters": [
+  {
+    "key": "namespace",
+    "operator": "=",
+    "value": "${namespace}"
+  }
+]
+```
+
+All dashboards in this directory now use the correct array format for filters.
+
+### Missing Data
+
+If dashboards show no data:
+
+1. **Verify Instrumentation**: Ensure your services are properly instrumented with OpenTelemetry
+2. **Check Time Range**: Adjust the time range to include recent data
+3. **Validate Metrics**: Confirm the metrics are being collected and stored
+4. **Review Filters**: Check that filters match your actual deployment
+
+## Customization
+
+You can customize these dashboards by:
+
+1. **Editing JSON**: Modify the JSON files to add/remove panels or adjust queries
+2. **Cloning in UI**: Clone existing dashboards and modify them in the SigNoz UI
+3. **Adding Variables**: Add new variables for additional filtering options
+4. **Adjusting Layout**: Change the grid layout and panel sizes
+
+## Best Practices
+
+1. **Regular Reviews**: Review dashboards regularly to ensure they meet your monitoring needs
+2. **Alert Integration**: Set up alerts based on key metrics shown in these dashboards
+3. **Team Access**: Share relevant dashboards with appropriate team members
+4. **Documentation**: Document any custom metrics or specific monitoring requirements
+
+## Support
+
+For issues with these dashboards:
+
+1. Check the [SigNoz documentation](https://signoz.io/docs/)
+2. Review the [Bakery IA monitoring guide](../SIGNOZ_COMPLETE_CONFIGURATION_GUIDE.md)
+3. Consult the OpenTelemetry metrics specification
+
+## License
+
+These dashboard configurations are provided under the same license as the Bakery IA project.
--- a/infrastructure/monitoring/signoz/dashboards/alert-management.json
+++ b/infrastructure/monitoring/signoz/dashboards/alert-management.json
@@ -0,0 +1,170 @@
+{
+  "description": "Alert monitoring and management dashboard",
+  "tags": ["alerts", "monitoring", "management"],
+  "name": "bakery-ia-alert-management",
+  "title": "Bakery IA - Alert Management",
+  "uploadedGrafana": false,
+  "uuid": "bakery-ia-alerts-01",
+  "version": "v4",
+  "collapsableRowsMigrated": true,
+  "layout": [
+    {
+      "x": 0,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "active-alerts",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "alert-rate",
+      "moved": false,
+      "static": false
+    }
+  ],
+  "variables": {
+    "service": {
+      "id": "service-var",
+      "name": "service",
+      "description": "Filter by service name",
+      "type": "QUERY",
+      "queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'alerts_active' AND value != '' ORDER BY value",
+      "customValue": "",
+      "textboxValue": "",
+      "showALLOption": true,
+      "multiSelect": false,
+      "order": 1,
+      "modificationUUID": "",
+      "sort": "ASC",
+      "selectedValue": null
+    }
+  },
+  "widgets": [
+    {
+      "id": "active-alerts",
+      "title": "Active Alerts",
+      "description": "Number of currently active alerts",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "value",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "alerts_active",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [],
+              "legend": "Active Alerts",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "alert-rate",
+      "title": "Alert Rate",
+      "description": "Rate of alerts over time",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "alerts_total",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "rate",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "alerts/s"
+    }
+  ]
+}
--- a/infrastructure/monitoring/signoz/dashboards/api-performance.json
+++ b/infrastructure/monitoring/signoz/dashboards/api-performance.json
@@ -0,0 +1,351 @@
+{
+  "description": "Comprehensive API performance monitoring for Bakery IA REST and GraphQL endpoints",
+  "tags": ["api", "performance", "rest", "graphql"],
+  "name": "bakery-ia-api-performance",
+  "title": "Bakery IA - API Performance",
+  "uploadedGrafana": false,
+  "uuid": "bakery-ia-api-01",
+  "version": "v4",
+  "collapsableRowsMigrated": true,
+  "layout": [
+    {
+      "x": 0,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "request-volume",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "error-rate",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "avg-response-time",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "p95-latency",
+      "moved": false,
+      "static": false
+    }
+  ],
+  "variables": {
+    "service": {
+      "id": "service-var",
+      "name": "service",
+      "description": "Filter by API service",
+      "type": "QUERY",
+      "queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'http_server_requests_seconds_count' AND value != '' ORDER BY value",
+      "customValue": "",
+      "textboxValue": "",
+      "showALLOption": true,
+      "multiSelect": false,
+      "order": 1,
+      "modificationUUID": "",
+      "sort": "ASC",
+      "selectedValue": null
+    }
+  },
+  "widgets": [
+    {
+      "id": "request-volume",
+      "title": "Request Volume",
+      "description": "API request volume by service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "http_server_requests_seconds_count",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "rate",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "service.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "api.name",
+                  "dataType": "string",
+                  "type": "resource",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{api.name}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "req/s"
+    },
+    {
+      "id": "error-rate",
+      "title": "Error Rate",
+      "description": "API error rate by service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "http_server_requests_seconds_count",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "rate",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "api.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.api}}"
+                  },
+                  {
+                    "key": {
+                      "key": "status_code",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=~",
+                    "value": "5.."
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "api.name",
+                  "dataType": "string",
+                  "type": "resource",
+                  "isColumn": false
+                },
+                {
+                  "key": "status_code",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{api.name}} - {{status_code}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "req/s"
+    },
+    {
+      "id": "avg-response-time",
+      "title": "Average Response Time",
+      "description": "Average API response time by endpoint",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "http_server_requests_seconds_sum",
+                "dataType": "float64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "avg",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "api.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.api}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "api.name",
+                  "dataType": "string",
+                  "type": "resource",
+                  "isColumn": false
+                },
+                {
+                  "key": "endpoint",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{api.name}} - {{endpoint}}",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "seconds"
+    },
+    {
+      "id": "p95-latency",
+      "title": "P95 Latency",
+      "description": "95th percentile latency by endpoint",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "histogram_quantile",
+              "aggregateAttribute": {
+                "key": "http_server_requests_seconds_bucket",
+                "dataType": "float64",
+                "type": "Histogram",
+                "isColumn": false
+              },
+              "timeAggregation": "avg",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "api.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.api}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "api.name",
+                  "dataType": "string",
+                  "type": "resource",
+                  "isColumn": false
+                },
+                {
+                  "key": "endpoint",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{api.name}} - {{endpoint}}",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "seconds"
+    }
+  ]
+}
--- a/infrastructure/monitoring/signoz/dashboards/application-performance.json
+++ b/infrastructure/monitoring/signoz/dashboards/application-performance.json
@@ -0,0 +1,333 @@
+{
+  "description": "Application performance monitoring dashboard using distributed traces and metrics",
+  "tags": ["application", "performance", "traces", "apm"],
+  "name": "bakery-ia-application-performance",
+  "title": "Bakery IA - Application Performance (APM)",
+  "uploadedGrafana": false,
+  "uuid": "bakery-ia-apm-01",
+  "version": "v4",
+  "collapsableRowsMigrated": true,
+  "layout": [
+    {
+      "x": 0,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "latency-p99",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "request-rate",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "error-rate",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "avg-duration",
+      "moved": false,
+      "static": false
+    }
+  ],
+  "variables": {
+    "service_name": {
+      "id": "service-var",
+      "name": "service_name",
+      "description": "Filter by service name",
+      "type": "QUERY",
+      "queryValue": "SELECT DISTINCT(serviceName) FROM signoz_traces.distributed_signoz_index_v2 ORDER BY serviceName",
+      "customValue": "",
+      "textboxValue": "",
+      "showALLOption": true,
+      "multiSelect": false,
+      "order": 1,
+      "modificationUUID": "",
+      "sort": "ASC",
+      "selectedValue": null
+    }
+  },
+  "widgets": [
+    {
+      "id": "latency-p99",
+      "title": "P99 Latency",
+      "description": "99th percentile latency for selected service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "traces",
+              "queryName": "A",
+              "aggregateOperator": "p99",
+              "aggregateAttribute": {
+                "key": "duration_ns",
+                "dataType": "float64",
+                "type": "",
+                "isColumn": true
+              },
+              "timeAggregation": "avg",
+              "spaceAggregation": "p99",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service_name}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "ms"
+    },
+    {
+      "id": "request-rate",
+      "title": "Request Rate",
+      "description": "Requests per second for the service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "traces",
+              "queryName": "A",
+              "aggregateOperator": "count",
+              "aggregateAttribute": {
+                "key": "",
+                "dataType": "",
+                "type": "",
+                "isColumn": false
+              },
+              "timeAggregation": "rate",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service_name}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "reqps"
+    },
+    {
+      "id": "error-rate",
+      "title": "Error Rate",
+      "description": "Error rate percentage for the service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "traces",
+              "queryName": "A",
+              "aggregateOperator": "count",
+              "aggregateAttribute": {
+                "key": "",
+                "dataType": "",
+                "type": "",
+                "isColumn": false
+              },
+              "timeAggregation": "rate",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service_name}}"
+                  },
+                  {
+                    "key": {
+                      "key": "status_code",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "STATUS_CODE_ERROR"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "reqps"
+    },
+    {
+      "id": "avg-duration",
+      "title": "Average Duration",
+      "description": "Average request duration",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "traces",
+              "queryName": "A",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "duration_ns",
+                "dataType": "float64",
+                "type": "",
+                "isColumn": true
+              },
+              "timeAggregation": "avg",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service_name}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "ms"
+    }
+  ]
+}
--- a/infrastructure/monitoring/signoz/dashboards/database-performance.json
+++ b/infrastructure/monitoring/signoz/dashboards/database-performance.json
@@ -0,0 +1,425 @@
+{
+  "description": "Comprehensive database performance monitoring for PostgreSQL, Redis, and RabbitMQ",
+  "tags": ["database", "postgresql", "redis", "rabbitmq", "performance"],
+  "name": "bakery-ia-database-performance",
+  "title": "Bakery IA - Database Performance",
+  "uploadedGrafana": false,
+  "uuid": "bakery-ia-db-01",
+  "version": "v4",
+  "collapsableRowsMigrated": true,
+  "layout": [
+    {
+      "x": 0,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "pg-connections",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "pg-db-size",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "redis-connected-clients",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "redis-memory",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 6,
+      "w": 6,
+      "h": 3,
+      "i": "rabbitmq-messages",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 6,
+      "w": 6,
+      "h": 3,
+      "i": "rabbitmq-consumers",
+      "moved": false,
+      "static": false
+    }
+  ],
+  "variables": {
+    "database": {
+      "id": "database-var",
+      "name": "database",
+      "description": "Filter by PostgreSQL database name",
+      "type": "QUERY",
+      "queryValue": "SELECT DISTINCT(resource_attrs['postgresql.database.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'postgresql.db_size' AND value != '' ORDER BY value",
+      "customValue": "",
+      "textboxValue": "",
+      "showALLOption": true,
+      "multiSelect": false,
+      "order": 1,
+      "modificationUUID": "",
+      "sort": "ASC",
+      "selectedValue": null
+    }
+  },
+  "widgets": [
+    {
+      "id": "pg-connections",
+      "title": "PostgreSQL - Active Connections",
+      "description": "Number of active PostgreSQL connections",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "postgresql.backends",
+                "dataType": "float64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "postgresql.database.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.database}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "postgresql.database.name",
+                  "dataType": "string",
+                  "type": "resource",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{postgresql.database.name}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "pg-db-size",
+      "title": "PostgreSQL - Database Size",
+      "description": "Size of PostgreSQL databases in bytes",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "postgresql.db_size",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "postgresql.database.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.database}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "postgresql.database.name",
+                  "dataType": "string",
+                  "type": "resource",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{postgresql.database.name}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "bytes"
+    },
+    {
+      "id": "redis-connected-clients",
+      "title": "Redis - Connected Clients",
+      "description": "Number of clients connected to Redis",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "redis.clients.connected",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "host.name",
+                  "dataType": "string",
+                  "type": "resource",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{host.name}}",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "redis-memory",
+      "title": "Redis - Memory Usage",
+      "description": "Redis memory usage in bytes",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "redis.memory.used",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "host.name",
+                  "dataType": "string",
+                  "type": "resource",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{host.name}}",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "bytes"
+    },
+    {
+      "id": "rabbitmq-messages",
+      "title": "RabbitMQ - Current Messages",
+      "description": "Number of messages currently in RabbitMQ queues",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "rabbitmq.message.current",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "queue",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "Queue: {{queue}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "rabbitmq-consumers",
+      "title": "RabbitMQ - Consumer Count",
+      "description": "Number of consumers per queue",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "rabbitmq.consumer.count",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "queue",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "Queue: {{queue}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    }
+  ]
+}
--- a/infrastructure/monitoring/signoz/dashboards/error-tracking.json
+++ b/infrastructure/monitoring/signoz/dashboards/error-tracking.json
@@ -0,0 +1,348 @@
+{
+  "description": "Comprehensive error tracking and analysis dashboard",
+  "tags": ["errors", "exceptions", "tracking"],
+  "name": "bakery-ia-error-tracking",
+  "title": "Bakery IA - Error Tracking",
+  "uploadedGrafana": false,
+  "uuid": "bakery-ia-errors-01",
+  "version": "v4",
+  "collapsableRowsMigrated": true,
+  "layout": [
+    {
+      "x": 0,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "total-errors",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "error-rate",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "http-5xx",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "http-4xx",
+      "moved": false,
+      "static": false
+    }
+  ],
+  "variables": {
+    "service": {
+      "id": "service-var",
+      "name": "service",
+      "description": "Filter by service name",
+      "type": "QUERY",
+      "queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'error_total' AND value != '' ORDER BY value",
+      "customValue": "",
+      "textboxValue": "",
+      "showALLOption": true,
+      "multiSelect": false,
+      "order": 1,
+      "modificationUUID": "",
+      "sort": "ASC",
+      "selectedValue": null
+    }
+  },
+  "widgets": [
+    {
+      "id": "total-errors",
+      "title": "Total Errors",
+      "description": "Total number of errors across all services",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "value",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "error_total",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "sum",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "service.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [],
+              "legend": "Total Errors",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "error-rate",
+      "title": "Error Rate",
+      "description": "Error rate over time",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "error_total",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "rate",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "service.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "errors/s"
+    },
+    {
+      "id": "http-5xx",
+      "title": "HTTP 5xx Errors",
+      "description": "Server errors (5xx status codes)",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "http_server_requests_seconds_count",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "sum",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "service.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  },
+                  {
+                    "key": {
+                      "key": "status_code",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=~",
+                    "value": "5.."
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                },
+                {
+                  "key": "status_code",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{serviceName}} - {{status_code}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "number"
+    },
+    {
+      "id": "http-4xx",
+      "title": "HTTP 4xx Errors",
+      "description": "Client errors (4xx status codes)",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "http_server_requests_seconds_count",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "sum",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "service.name",
+                      "dataType": "string",
+                      "type": "resource",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  },
+                  {
+                    "key": {
+                      "key": "status_code",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=~",
+                    "value": "4.."
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                },
+                {
+                  "key": "status_code",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{serviceName}} - {{status_code}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "number"
+    }
+  ]
+}
--- a/infrastructure/monitoring/signoz/dashboards/index.json
+++ b/infrastructure/monitoring/signoz/dashboards/index.json
@@ -0,0 +1,213 @@
+{
+  "name": "Bakery IA Dashboard Collection",
+  "description": "Complete set of SigNoz dashboards for Bakery IA monitoring",
+  "version": "1.0.0",
+  "author": "Bakery IA Team",
+  "license": "MIT",
+  "dashboards": [
+    {
+      "id": "infrastructure-monitoring",
+      "name": "Infrastructure Monitoring",
+      "description": "Kubernetes infrastructure and resource monitoring",
+      "file": "infrastructure-monitoring.json",
+      "tags": ["infrastructure", "kubernetes", "system"],
+      "category": "infrastructure"
+    },
+    {
+      "id": "application-performance",
+      "name": "Application Performance",
+      "description": "Microservice performance and API metrics",
+      "file": "application-performance.json",
+      "tags": ["application", "performance", "apm"],
+      "category": "performance"
+    },
+    {
+      "id": "database-performance",
+      "name": "Database Performance",
+      "description": "PostgreSQL and Redis database monitoring",
+      "file": "database-performance.json",
+      "tags": ["database", "postgresql", "redis"],
+      "category": "database"
+    },
+    {
+      "id": "api-performance",
+      "name": "API Performance",
+      "description": "REST and GraphQL API performance monitoring",
+      "file": "api-performance.json",
+      "tags": ["api", "rest", "graphql"],
+      "category": "api"
+    },
+    {
+      "id": "error-tracking",
+      "name": "Error Tracking",
+      "description": "System error tracking and analysis",
+      "file": "error-tracking.json",
+      "tags": ["errors", "exceptions", "tracking"],
+      "category": "monitoring"
+    },
+    {
+      "id": "user-activity",
+      "name": "User Activity",
+      "description": "User behavior and activity monitoring",
+      "file": "user-activity.json",
+      "tags": ["user", "activity", "behavior"],
+      "category": "user"
+    },
+    {
+      "id": "system-health",
+      "name": "System Health",
+      "description": "Overall system health monitoring",
+      "file": "system-health.json",
+      "tags": ["system", "health", "overview"],
+      "category": "overview"
+    },
+    {
+      "id": "alert-management",
+      "name": "Alert Management",
+      "description": "Alert monitoring and management",
+      "file": "alert-management.json",
+      "tags": ["alerts", "notifications", "management"],
+      "category": "alerts"
+    },
+    {
+      "id": "log-analysis",
+      "name": "Log Analysis",
+      "description": "Log search and analysis",
+      "file": "log-analysis.json",
+      "tags": ["logs", "search", "analysis"],
+      "category": "logs"
+    }
+  ],
+  "categories": [
+    {
+      "id": "infrastructure",
+      "name": "Infrastructure",
+      "description": "Kubernetes and system infrastructure monitoring"
+    },
+    {
+      "id": "performance",
+      "name": "Performance",
+      "description": "Application and service performance monitoring"
+    },
+    {
+      "id": "database",
+      "name": "Database",
+      "description": "Database performance and health monitoring"
+    },
+    {
+      "id": "api",
+      "name": "API",
+      "description": "API performance and usage monitoring"
+    },
+    {
+      "id": "monitoring",
+      "name": "Monitoring",
+      "description": "Error tracking and system monitoring"
+    },
+    {
+      "id": "user",
+      "name": "User",
+      "description": "User activity and behavior monitoring"
+    },
+    {
+      "id": "overview",
+      "name": "Overview",
+      "description": "System-wide overview and health dashboards"
+    },
+    {
+      "id": "alerts",
+      "name": "Alerts",
+      "description": "Alert management and monitoring"
+    },
+    {
+      "id": "logs",
+      "name": "Logs",
+      "description": "Log analysis and search"
+    }
+  ],
+  "usage": {
+    "import_methods": [
+      "ui_import",
+      "api_import",
+      "kubernetes_configmap"
+    ],
+    "recommended_import_order": [
+      "infrastructure-monitoring",
+      "system-health",
+      "application-performance",
+      "api-performance",
+      "database-performance",
+      "error-tracking",
+      "alert-management",
+      "log-analysis",
+      "user-activity"
+    ]
+  },
+  "requirements": {
+    "signoz_version": ">= 0.10.0",
+    "opentelemetry_collector": ">= 0.45.0",
+    "metrics": [
+      "container_cpu_usage_seconds_total",
+      "container_memory_working_set_bytes",
+      "http_server_requests_seconds_count",
+      "http_server_requests_seconds_sum",
+      "pg_stat_activity_count",
+      "pg_stat_statements_total_time",
+      "error_total",
+      "alerts_total",
+      "kube_pod_status_phase",
+      "container_network_receive_bytes_total",
+      "kube_pod_container_status_restarts_total",
+      "kube_pod_container_status_ready",
+      "container_fs_reads_total",
+      "kube_pod_status_phase",
+      "kube_pod_container_status_restarts_total",
+      "kube_pod_container_status_ready",
+      "container_fs_reads_total",
+      "kubernetes_events",
+      "http_server_requests_seconds_bucket",
+      "http_server_active_requests",
+      "http_server_up",
+      "db_query_duration_seconds_sum",
+      "db_connections_active",
+      "http_client_request_duration_seconds_count",
+      "http_client_request_duration_seconds_sum",
+      "graphql_execution_time_seconds",
+      "graphql_errors_total",
+      "pg_stat_database_blks_hit",
+      "pg_stat_database_xact_commit",
+      "pg_locks_count",
+      "pg_table_size_bytes",
+      "pg_stat_user_tables_seq_scan",
+      "redis_memory_used_bytes",
+      "redis_commands_processed_total",
+      "redis_keyspace_hits",
+      "pg_stat_database_deadlocks",
+      "pg_stat_database_conn_errors",
+      "pg_replication_lag_bytes",
+      "pg_replication_is_replica",
+      "active_users",
+      "user_sessions_total",
+      "api_calls_per_user",
+      "session_duration_seconds",
+      "system_availability",
+      "service_health_score",
+      "system_cpu_usage",
+      "system_memory_usage",
+      "service_availability",
+      "alerts_active",
+      "alerts_total",
+      "log_lines_total"
+    ]
+  },
+  "support": {
+    "documentation": "https://signoz.io/docs/",
+    "bakery_ia_docs": "../SIGNOZ_COMPLETE_CONFIGURATION_GUIDE.md",
+    "issues": "https://github.com/your-repo/issues"
+  },
+  "notes": {
+    "format_fix": "All dashboards have been updated to use the correct SigNoz JSON schema with proper filter arrays to resolve the 'e.filter is not a function' error.",
+    "compatibility": "Tested with SigNoz v0.10.0+ and OpenTelemetry Collector v0.45.0+",
+    "customization": "You can customize these dashboards by editing the JSON files or cloning them in the SigNoz UI"
+  }
+}
--- a/infrastructure/monitoring/signoz/dashboards/infrastructure-monitoring.json
+++ b/infrastructure/monitoring/signoz/dashboards/infrastructure-monitoring.json
@@ -0,0 +1,437 @@
+{
+  "description": "Comprehensive infrastructure monitoring dashboard for Bakery IA Kubernetes cluster",
+  "tags": ["infrastructure", "kubernetes", "k8s", "system"],
+  "name": "bakery-ia-infrastructure-monitoring",
+  "title": "Bakery IA - Infrastructure Monitoring",
+  "uploadedGrafana": false,
+  "uuid": "bakery-ia-infra-01",
+  "version": "v4",
+  "collapsableRowsMigrated": true,
+  "layout": [
+    {
+      "x": 0,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "pod-count",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "pod-phase",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "container-restarts",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "node-condition",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 6,
+      "w": 12,
+      "h": 3,
+      "i": "deployment-status",
+      "moved": false,
+      "static": false
+    }
+  ],
+  "variables": {
+    "namespace": {
+      "id": "namespace-var",
+      "name": "namespace",
+      "description": "Filter by Kubernetes namespace",
+      "type": "QUERY",
+      "queryValue": "SELECT DISTINCT(resource_attrs['k8s.namespace.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'k8s.pod.phase' AND value != '' ORDER BY value",
+      "customValue": "",
+      "textboxValue": "",
+      "showALLOption": true,
+      "multiSelect": false,
+      "order": 1,
+      "modificationUUID": "",
+      "sort": "ASC",
+      "selectedValue": "bakery-ia"
+    }
+  },
+  "widgets": [
+    {
+      "id": "pod-count",
+      "title": "Total Pods",
+      "description": "Total number of pods in the namespace",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "value",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "count",
+              "aggregateAttribute": {
+                "key": "k8s.pod.phase",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "id": "filter-k8s-namespace",
+                    "key": {
+                      "id": "k8s.namespace.name--string--tag--false",
+                      "key": "k8s.namespace.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.namespace}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [],
+              "legend": "Total Pods",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "pod-phase",
+      "title": "Pod Phase Distribution",
+      "description": "Pods by phase (Running, Pending, Failed, etc.)",
+      "isStacked": true,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "k8s.pod.phase",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "id": "filter-k8s-namespace",
+                    "key": {
+                      "id": "k8s.namespace.name--string--tag--false",
+                      "key": "k8s.namespace.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.namespace}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "phase",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{phase}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "container-restarts",
+      "title": "Container Restarts",
+      "description": "Container restart count over time",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "k8s.container.restarts",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "increase",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "id": "filter-k8s-namespace",
+                    "key": {
+                      "id": "k8s.namespace.name--string--tag--false",
+                      "key": "k8s.namespace.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.namespace}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "id": "k8s.pod.name--string--tag--false",
+                  "key": "k8s.pod.name",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{k8s.pod.name}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "node-condition",
+      "title": "Node Conditions",
+      "description": "Node condition status (Ready, MemoryPressure, DiskPressure, etc.)",
+      "isStacked": true,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "k8s.node.condition_ready",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "id": "k8s.node.name--string--tag--false",
+                  "key": "k8s.node.name",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{k8s.node.name}} Ready",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "deployment-status",
+      "title": "Deployment Status (Desired vs Available)",
+      "description": "Deployment replicas: desired vs available",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "k8s.deployment.desired",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "id": "filter-k8s-namespace",
+                    "key": {
+                      "id": "k8s.namespace.name--string--tag--false",
+                      "key": "k8s.namespace.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.namespace}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "id": "k8s.deployment.name--string--tag--false",
+                  "key": "k8s.deployment.name",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{k8s.deployment.name}} (desired)",
+              "reduceTo": "avg"
+            },
+            {
+              "dataSource": "metrics",
+              "queryName": "B",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "k8s.deployment.available",
+                "dataType": "int64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "id": "filter-k8s-namespace",
+                    "key": {
+                      "id": "k8s.namespace.name--string--tag--false",
+                      "key": "k8s.namespace.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.namespace}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "B",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "id": "k8s.deployment.name--string--tag--false",
+                  "key": "k8s.deployment.name",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{k8s.deployment.name}} (available)",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    }
+  ]
+}
--- a/infrastructure/monitoring/signoz/dashboards/log-analysis.json
+++ b/infrastructure/monitoring/signoz/dashboards/log-analysis.json
@@ -0,0 +1,333 @@
+{
+  "description": "Comprehensive log analysis and search dashboard",
+  "tags": ["logs", "analysis", "search"],
+  "name": "bakery-ia-log-analysis",
+  "title": "Bakery IA - Log Analysis",
+  "uploadedGrafana": false,
+  "uuid": "bakery-ia-logs-01",
+  "version": "v4",
+  "collapsableRowsMigrated": true,
+  "layout": [
+    {
+      "x": 0,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "log-volume",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "error-logs",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "logs-by-level",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "logs-by-service",
+      "moved": false,
+      "static": false
+    }
+  ],
+  "variables": {
+    "service": {
+      "id": "service-var",
+      "name": "service",
+      "description": "Filter by service name",
+      "type": "QUERY",
+      "queryValue": "SELECT DISTINCT(resource_attrs['service.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'log_lines_total' AND value != '' ORDER BY value",
+      "customValue": "",
+      "textboxValue": "",
+      "showALLOption": true,
+      "multiSelect": false,
+      "order": 1,
+      "modificationUUID": "",
+      "sort": "ASC",
+      "selectedValue": null
+    }
+  },
+  "widgets": [
+    {
+      "id": "log-volume",
+      "title": "Log Volume",
+      "description": "Total log volume by service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "log_lines_total",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "rate",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "logs/s"
+    },
+    {
+      "id": "error-logs",
+      "title": "Error Logs",
+      "description": "Error log volume by service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "log_lines_total",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "rate",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  },
+                  {
+                    "key": {
+                      "key": "level",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "error"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}} (errors)",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "logs/s"
+    },
+    {
+      "id": "logs-by-level",
+      "title": "Logs by Level",
+      "description": "Distribution of logs by severity level",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "pie",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "log_lines_total",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "sum",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "level",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": false
+                }
+              ],
+              "legend": "{{level}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "logs-by-service",
+      "title": "Logs by Service",
+      "description": "Distribution of logs by service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "pie",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "sum",
+              "aggregateAttribute": {
+                "key": "log_lines_total",
+                "dataType": "int64",
+                "type": "Counter",
+                "isColumn": false
+              },
+              "timeAggregation": "sum",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    }
+  ]
+}
--- a/infrastructure/monitoring/signoz/dashboards/system-health.json
+++ b/infrastructure/monitoring/signoz/dashboards/system-health.json
@@ -0,0 +1,303 @@
+{
+  "description": "Comprehensive system health monitoring dashboard",
+  "tags": ["system", "health", "monitoring"],
+  "name": "bakery-ia-system-health",
+  "title": "Bakery IA - System Health",
+  "uploadedGrafana": false,
+  "uuid": "bakery-ia-health-01",
+  "version": "v4",
+  "collapsableRowsMigrated": true,
+  "layout": [
+    {
+      "x": 0,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "system-availability",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "health-score",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "cpu-usage",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "memory-usage",
+      "moved": false,
+      "static": false
+    }
+  ],
+  "variables": {
+    "namespace": {
+      "id": "namespace-var",
+      "name": "namespace",
+      "description": "Filter by Kubernetes namespace",
+      "type": "QUERY",
+      "queryValue": "SELECT DISTINCT(resource_attrs['k8s.namespace.name']) as value FROM signoz_metrics.distributed_time_series_v4_1day WHERE metric_name = 'system_availability' AND value != '' ORDER BY value",
+      "customValue": "",
+      "textboxValue": "",
+      "showALLOption": true,
+      "multiSelect": false,
+      "order": 1,
+      "modificationUUID": "",
+      "sort": "ASC",
+      "selectedValue": "bakery-ia"
+    }
+  },
+  "widgets": [
+    {
+      "id": "system-availability",
+      "title": "System Availability",
+      "description": "Overall system availability percentage",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "value",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "system_availability",
+                "dataType": "float64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "id": "filter-k8s-namespace",
+                    "key": {
+                      "id": "k8s.namespace.name--string--tag--false",
+                      "key": "k8s.namespace.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.namespace}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [],
+              "legend": "System Availability",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "percent"
+    },
+    {
+      "id": "health-score",
+      "title": "Service Health Score",
+      "description": "Overall service health score",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "value",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "service_health_score",
+                "dataType": "float64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "latest",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "id": "filter-k8s-namespace",
+                    "key": {
+                      "id": "k8s.namespace.name--string--tag--false",
+                      "key": "k8s.namespace.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.namespace}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [],
+              "legend": "Health Score",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "cpu-usage",
+      "title": "CPU Usage",
+      "description": "System CPU usage over time",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "system_cpu_usage",
+                "dataType": "float64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "avg",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "id": "filter-k8s-namespace",
+                    "key": {
+                      "id": "k8s.namespace.name--string--tag--false",
+                      "key": "k8s.namespace.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.namespace}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [],
+              "legend": "CPU Usage",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "percent"
+    },
+    {
+      "id": "memory-usage",
+      "title": "Memory Usage",
+      "description": "System memory usage over time",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "metrics",
+              "queryName": "A",
+              "aggregateOperator": "avg",
+              "aggregateAttribute": {
+                "key": "system_memory_usage",
+                "dataType": "float64",
+                "type": "Gauge",
+                "isColumn": false
+              },
+              "timeAggregation": "avg",
+              "spaceAggregation": "avg",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "id": "filter-k8s-namespace",
+                    "key": {
+                      "id": "k8s.namespace.name--string--tag--false",
+                      "key": "k8s.namespace.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": false
+                    },
+                    "op": "=",
+                    "value": "{{.namespace}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [],
+              "legend": "Memory Usage",
+              "reduceTo": "avg"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "percent"
+    }
+  ]
+}
--- a/infrastructure/monitoring/signoz/dashboards/user-activity.json
+++ b/infrastructure/monitoring/signoz/dashboards/user-activity.json
@@ -0,0 +1,429 @@
+{
+  "description": "User activity and behavior monitoring dashboard",
+  "tags": ["user", "activity", "behavior"],
+  "name": "bakery-ia-user-activity",
+  "title": "Bakery IA - User Activity",
+  "uploadedGrafana": false,
+  "uuid": "bakery-ia-user-01",
+  "version": "v4",
+  "collapsableRowsMigrated": true,
+  "layout": [
+    {
+      "x": 0,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "active-users",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 0,
+      "w": 6,
+      "h": 3,
+      "i": "user-sessions",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "user-actions",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 6,
+      "y": 3,
+      "w": 6,
+      "h": 3,
+      "i": "page-views",
+      "moved": false,
+      "static": false
+    },
+    {
+      "x": 0,
+      "y": 6,
+      "w": 12,
+      "h": 4,
+      "i": "geo-visitors",
+      "moved": false,
+      "static": false
+    }
+  ],
+  "variables": {
+    "service": {
+      "id": "service-var",
+      "name": "service",
+      "description": "Filter by service name",
+      "type": "QUERY",
+      "queryValue": "SELECT DISTINCT(serviceName) FROM signoz_traces.distributed_signoz_index_v2 ORDER BY serviceName",
+      "customValue": "",
+      "textboxValue": "",
+      "showALLOption": true,
+      "multiSelect": false,
+      "order": 1,
+      "modificationUUID": "",
+      "sort": "ASC",
+      "selectedValue": "bakery-frontend"
+    }
+  },
+  "widgets": [
+    {
+      "id": "active-users",
+      "title": "Active Users",
+      "description": "Number of active users by service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "traces",
+              "queryName": "A",
+              "aggregateOperator": "count_distinct",
+              "aggregateAttribute": {
+                "key": "user.id",
+                "dataType": "string",
+                "type": "tag",
+                "isColumn": true
+              },
+              "timeAggregation": "count_distinct",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "user-sessions",
+      "title": "User Sessions",
+      "description": "Total user sessions by service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "traces",
+              "queryName": "A",
+              "aggregateOperator": "count",
+              "aggregateAttribute": {
+                "key": "session.id",
+                "dataType": "string",
+                "type": "tag",
+                "isColumn": true
+              },
+              "timeAggregation": "count",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  },
+                  {
+                    "key": {
+                      "key": "span.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "user_session"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "user-actions",
+      "title": "User Actions",
+      "description": "Total user actions by service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "traces",
+              "queryName": "A",
+              "aggregateOperator": "count",
+              "aggregateAttribute": {
+                "key": "user.action",
+                "dataType": "string",
+                "type": "tag",
+                "isColumn": true
+              },
+              "timeAggregation": "count",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  },
+                  {
+                    "key": {
+                      "key": "span.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "user_action"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "page-views",
+      "title": "Page Views",
+      "description": "Total page views by service",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "graph",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "traces",
+              "queryName": "A",
+              "aggregateOperator": "count",
+              "aggregateAttribute": {
+                "key": "page.path",
+                "dataType": "string",
+                "type": "tag",
+                "isColumn": true
+              },
+              "timeAggregation": "count",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  },
+                  {
+                    "key": {
+                      "key": "span.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "page_view"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [
+                {
+                  "key": "serviceName",
+                  "dataType": "string",
+                  "type": "tag",
+                  "isColumn": true
+                }
+              ],
+              "legend": "{{serviceName}}",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    },
+    {
+      "id": "geo-visitors",
+      "title": "Geolocation Visitors",
+      "description": "Number of visitors who shared location data",
+      "isStacked": false,
+      "nullZeroValues": "zero",
+      "opacity": "1",
+      "panelTypes": "value",
+      "query": {
+        "builder": {
+          "queryData": [
+            {
+              "dataSource": "traces",
+              "queryName": "A",
+              "aggregateOperator": "count",
+              "aggregateAttribute": {
+                "key": "user.id",
+                "dataType": "string",
+                "type": "tag",
+                "isColumn": true
+              },
+              "timeAggregation": "count",
+              "spaceAggregation": "sum",
+              "functions": [],
+              "filters": {
+                "items": [
+                  {
+                    "key": {
+                      "key": "serviceName",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "{{.service}}"
+                  },
+                  {
+                    "key": {
+                      "key": "span.name",
+                      "dataType": "string",
+                      "type": "tag",
+                      "isColumn": true
+                    },
+                    "op": "=",
+                    "value": "user_location"
+                  }
+                ],
+                "op": "AND"
+              },
+              "expression": "A",
+              "disabled": false,
+              "having": [],
+              "stepInterval": 60,
+              "limit": null,
+              "orderBy": [],
+              "groupBy": [],
+              "legend": "Visitors with Location Data (See GEOLOCATION_VISUALIZATION_GUIDE.md for map integration)",
+              "reduceTo": "sum"
+            }
+          ],
+          "queryFormulas": []
+        },
+        "queryType": "builder"
+      },
+      "fillSpans": false,
+      "yAxisUnit": "none"
+    }
+  ]
+}
--- a/infrastructure/monitoring/signoz/deploy-signoz.sh
+++ b/infrastructure/monitoring/signoz/deploy-signoz.sh
@@ -0,0 +1,392 @@
+#!/bin/bash
+
+# ============================================================================
+# SigNoz Deployment Script for Bakery IA
+# ============================================================================
+# This script deploys SigNoz monitoring stack using Helm
+# Supports both development and production environments
+# ============================================================================
+
+set -e
+
+# Color codes for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Function to display help
+show_help() {
+    echo "Usage: $0 [OPTIONS] ENVIRONMENT"
+    echo ""
+    echo "Deploy SigNoz monitoring stack for Bakery IA"
+    echo ""
+    echo "Arguments:
+    ENVIRONMENT    Environment to deploy to (dev|prod)"
+    echo ""
+    echo "Options:
+    -h, --help     Show this help message
+    -d, --dry-run  Dry run - show what would be done without actually deploying
+    -u, --upgrade  Upgrade existing deployment
+    -r, --remove   Remove/Uninstall SigNoz deployment
+    -n, --namespace NAMESPACE  Specify namespace (default: bakery-ia)"
+    echo ""
+    echo "Examples:
+    $0 dev                    # Deploy to development
+    $0 prod                   # Deploy to production
+    $0 --upgrade prod         # Upgrade production deployment
+    $0 --remove dev           # Remove development deployment"
+    echo ""
+    echo "Docker Hub Authentication:"
+    echo "  This script automatically creates a Docker Hub secret for image pulls."
+    echo "  Provide credentials via environment variables (recommended):"
+    echo "    export DOCKERHUB_USERNAME='your-username'"
+    echo "    export DOCKERHUB_PASSWORD='your-personal-access-token'"
+    echo "  Or ensure you're logged in with Docker CLI:"
+    echo "    docker login"
+}
+
+# Parse command line arguments
+DRY_RUN=false
+UPGRADE=false
+REMOVE=false
+NAMESPACE="bakery-ia"
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        -h|--help)
+            show_help
+            exit 0
+            ;;
+        -d|--dry-run)
+            DRY_RUN=true
+            shift
+            ;;
+        -u|--upgrade)
+            UPGRADE=true
+            shift
+            ;;
+        -r|--remove)
+            REMOVE=true
+            shift
+            ;;
+        -n|--namespace)
+            NAMESPACE="$2"
+            shift 2
+            ;;
+        dev|prod)
+            ENVIRONMENT="$1"
+            shift
+            ;;
+        *)
+            echo "Unknown argument: $1"
+            show_help
+            exit 1
+            ;;
+    esac
+done
+
+# Validate environment
+if [[ -z "$ENVIRONMENT" ]]; then
+    echo "Error: Environment not specified. Use 'dev' or 'prod'."
+    show_help
+    exit 1
+fi
+
+if [[ "$ENVIRONMENT" != "dev" && "$ENVIRONMENT" != "prod" ]]; then
+    echo "Error: Invalid environment. Use 'dev' or 'prod'."
+    exit 1
+fi
+
+# Function to check if Helm is installed
+check_helm() {
+    if ! command -v helm &> /dev/null; then
+        echo "${RED}Error: Helm is not installed. Please install Helm first.${NC}"
+        echo "Installation instructions: https://helm.sh/docs/intro/install/"
+        exit 1
+    fi
+}
+
+# Function to check if kubectl is configured
+check_kubectl() {
+    if ! kubectl cluster-info &> /dev/null; then
+        echo "${RED}Error: kubectl is not configured or cannot connect to cluster.${NC}"
+        echo "Please ensure you have access to a Kubernetes cluster."
+        exit 1
+    fi
+}
+
+# Function to check if namespace exists, create if not
+ensure_namespace() {
+    if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
+        echo "${BLUE}Creating namespace $NAMESPACE...${NC}"
+        if [[ "$DRY_RUN" == true ]]; then
+            echo "  (dry-run) Would create namespace $NAMESPACE"
+        else
+            kubectl create namespace "$NAMESPACE"
+            echo "${GREEN}Namespace $NAMESPACE created.${NC}"
+        fi
+    else
+        echo "${BLUE}Namespace $NAMESPACE already exists.${NC}"
+    fi
+}
+
+# Function to create Docker Hub secret for image pulls
+create_dockerhub_secret() {
+    echo "${BLUE}Setting up Docker Hub image pull secret...${NC}"
+
+    if [[ "$DRY_RUN" == true ]]; then
+        echo "  (dry-run) Would create Docker Hub secret in namespace $NAMESPACE"
+        return
+    fi
+
+    # Check if secret already exists
+    if kubectl get secret dockerhub-creds -n "$NAMESPACE" &> /dev/null; then
+        echo "${GREEN}Docker Hub secret already exists in namespace $NAMESPACE.${NC}"
+        return
+    fi
+
+    # Check if Docker Hub credentials are available
+    if [[ -n "$DOCKERHUB_USERNAME" ]] && [[ -n "$DOCKERHUB_PASSWORD" ]]; then
+        echo "${BLUE}Found DOCKERHUB_USERNAME and DOCKERHUB_PASSWORD environment variables${NC}"
+
+        kubectl create secret docker-registry dockerhub-creds \
+            --docker-server=https://index.docker.io/v1/ \
+            --docker-username="$DOCKERHUB_USERNAME" \
+            --docker-password="$DOCKERHUB_PASSWORD" \
+            --docker-email="${DOCKERHUB_EMAIL:-noreply@bakery-ia.local}" \
+            -n "$NAMESPACE"
+
+        echo "${GREEN}Docker Hub secret created successfully.${NC}"
+
+    elif [[ -f "$HOME/.docker/config.json" ]]; then
+        echo "${BLUE}Attempting to use Docker CLI credentials...${NC}"
+
+        # Try to extract credentials from Docker config
+        if grep -q "credsStore" "$HOME/.docker/config.json"; then
+            echo "${YELLOW}Docker is using a credential store. Please set environment variables:${NC}"
+            echo "  export DOCKERHUB_USERNAME='your-username'"
+            echo "  export DOCKERHUB_PASSWORD='your-password-or-token'"
+            echo "${YELLOW}Continuing without Docker Hub authentication...${NC}"
+            return
+        fi
+
+        # Try to extract from base64 encoded auth
+        AUTH=$(cat "$HOME/.docker/config.json" | jq -r '.auths["https://index.docker.io/v1/"].auth // empty' 2>/dev/null)
+        if [[ -n "$AUTH" ]]; then
+            echo "${GREEN}Found Docker Hub credentials in Docker config${NC}"
+            local DOCKER_USERNAME=$(echo "$AUTH" | base64 -d | cut -d: -f1)
+            local DOCKER_PASSWORD=$(echo "$AUTH" | base64 -d | cut -d: -f2-)
+
+            kubectl create secret docker-registry dockerhub-creds \
+                --docker-server=https://index.docker.io/v1/ \
+                --docker-username="$DOCKER_USERNAME" \
+                --docker-password="$DOCKER_PASSWORD" \
+                --docker-email="${DOCKERHUB_EMAIL:-noreply@bakery-ia.local}" \
+                -n "$NAMESPACE"
+
+            echo "${GREEN}Docker Hub secret created successfully.${NC}"
+        else
+            echo "${YELLOW}Could not find Docker Hub credentials${NC}"
+            echo "${YELLOW}To enable automatic Docker Hub authentication:${NC}"
+            echo "  1. Run 'docker login', OR"
+            echo "  2. Set environment variables:"
+            echo "     export DOCKERHUB_USERNAME='your-username'"
+            echo "     export DOCKERHUB_PASSWORD='your-password-or-token'"
+            echo "${YELLOW}Continuing without Docker Hub authentication...${NC}"
+        fi
+    else
+        echo "${YELLOW}Docker Hub credentials not found${NC}"
+        echo "${YELLOW}To enable automatic Docker Hub authentication:${NC}"
+        echo "  1. Run 'docker login', OR"
+        echo "  2. Set environment variables:"
+        echo "     export DOCKERHUB_USERNAME='your-username'"
+        echo "     export DOCKERHUB_PASSWORD='your-password-or-token'"
+        echo "${YELLOW}Continuing without Docker Hub authentication...${NC}"
+    fi
+    echo ""
+}
+
+# Function to add and update Helm repository
+setup_helm_repo() {
+    echo "${BLUE}Setting up SigNoz Helm repository...${NC}"
+
+    if [[ "$DRY_RUN" == true ]]; then
+        echo "  (dry-run) Would add SigNoz Helm repository"
+        return
+    fi
+
+    # Add SigNoz Helm repository
+    if helm repo list | grep -q "^signoz"; then
+        echo "${BLUE}SigNoz repository already added, updating...${NC}"
+        helm repo update signoz
+    else
+        echo "${BLUE}Adding SigNoz Helm repository...${NC}"
+        helm repo add signoz https://charts.signoz.io
+        helm repo update
+    fi
+
+    echo "${GREEN}Helm repository ready.${NC}"
+    echo ""
+}
+
+# Function to deploy SigNoz
+deploy_signoz() {
+    local values_file="infrastructure/helm/signoz-values-$ENVIRONMENT.yaml"
+
+    if [[ ! -f "$values_file" ]]; then
+        echo "${RED}Error: Values file $values_file not found.${NC}"
+        exit 1
+    fi
+
+    echo "${BLUE}Deploying SigNoz to $ENVIRONMENT environment...${NC}"
+    echo "  Using values file: $values_file"
+    echo "  Target namespace: $NAMESPACE"
+    echo "  Chart version: Latest from signoz/signoz"
+
+    if [[ "$DRY_RUN" == true ]]; then
+        echo "  (dry-run) Would deploy SigNoz with:"
+        echo "    helm upgrade --install signoz signoz/signoz -n $NAMESPACE -f $values_file --wait --timeout 15m"
+        return
+    fi
+
+    # Use upgrade --install to handle both new installations and upgrades
+    echo "${BLUE}Installing/Upgrading SigNoz...${NC}"
+    echo "This may take 10-15 minutes..."
+
+    helm upgrade --install signoz signoz/signoz \
+        -n "$NAMESPACE" \
+        -f "$values_file" \
+        --wait \
+        --timeout 15m \
+        --create-namespace
+
+    echo "${GREEN}SigNoz deployment completed.${NC}"
+    echo ""
+
+    # Show deployment status
+    show_deployment_status
+}
+
+# Function to remove SigNoz
+remove_signoz() {
+    echo "${BLUE}Removing SigNoz deployment from namespace $NAMESPACE...${NC}"
+
+    if [[ "$DRY_RUN" == true ]]; then
+        echo "  (dry-run) Would remove SigNoz deployment"
+        return
+    fi
+
+    if helm list -n "$NAMESPACE" | grep -q signoz; then
+        helm uninstall signoz -n "$NAMESPACE" --wait
+        echo "${GREEN}SigNoz deployment removed.${NC}"
+
+        # Optionally remove PVCs (commented out by default for safety)
+        echo ""
+        echo "${YELLOW}Note: Persistent Volume Claims (PVCs) were NOT deleted.${NC}"
+        echo "To delete PVCs and all data, run:"
+        echo "  kubectl delete pvc -n $NAMESPACE -l app.kubernetes.io/instance=signoz"
+    else
+        echo "${YELLOW}No SigNoz deployment found in namespace $NAMESPACE.${NC}"
+    fi
+}
+
+# Function to show deployment status
+show_deployment_status() {
+    echo ""
+    echo "${BLUE}=== SigNoz Deployment Status ===${NC}"
+    echo ""
+    
+    # Get pods
+    echo "Pods:"
+    kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
+    echo ""
+    
+    # Get services
+    echo "Services:"
+    kubectl get svc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
+    echo ""
+    
+    # Get ingress
+    echo "Ingress:"
+    kubectl get ingress -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
+    echo ""
+    
+    # Show access information
+    show_access_info
+}
+
+# Function to show access information
+show_access_info() {
+    echo "${BLUE}=== Access Information ===${NC}"
+
+    if [[ "$ENVIRONMENT" == "dev" ]]; then
+        echo "SigNoz UI: http://monitoring.bakery-ia.local"
+        echo ""
+        echo "OpenTelemetry Collector Endpoints (from within cluster):"
+        echo "  gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
+        echo "  HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
+        echo ""
+        echo "Port-forward for local access:"
+        echo "  kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
+        echo "  kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4317:4317"
+        echo "  kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4318:4318"
+    else
+        echo "SigNoz UI: https://monitoring.bakewise.ai"
+        echo ""
+        echo "OpenTelemetry Collector Endpoints (from within cluster):"
+        echo "  gRPC: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
+        echo "  HTTP: signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
+        echo ""
+        echo "External endpoints (if exposed):"
+        echo "  Check ingress configuration for external OTLP endpoints"
+    fi
+
+    echo ""
+    echo "Default credentials:"
+    echo "  Username: admin@example.com"
+    echo "  Password: admin"
+    echo ""
+    echo "Note: Change default password after first login!"
+    echo ""
+}
+
+# Main execution
+main() {
+    echo "${BLUE}"
+    echo "=========================================="
+    echo "🚀 SigNoz Deployment for Bakery IA"
+    echo "=========================================="
+    echo "${NC}"
+    
+    # Check prerequisites
+    check_helm
+    check_kubectl
+    
+    # Ensure namespace
+    ensure_namespace
+
+    if [[ "$REMOVE" == true ]]; then
+        remove_signoz
+        exit 0
+    fi
+
+    # Setup Helm repository
+    setup_helm_repo
+
+    # Create Docker Hub secret for image pulls
+    create_dockerhub_secret
+
+    # Deploy SigNoz
+    deploy_signoz
+    
+    echo "${GREEN}"
+    echo "=========================================="
+    echo "✅ SigNoz deployment completed!"
+    echo "=========================================="
+    echo "${NC}"
+}
+
+# Run main function
+main
--- a/infrastructure/monitoring/signoz/generate-test-traffic.sh
+++ b/infrastructure/monitoring/signoz/generate-test-traffic.sh
@@ -0,0 +1,141 @@
+#!/bin/bash
+
+# Generate Test Traffic to Services
+# This script generates API calls to verify telemetry data collection
+
+set -e
+
+NAMESPACE="bakery-ia"
+GREEN='\033[0;32m'
+BLUE='\033[0;34m'
+YELLOW='\033[1;33m'
+NC='\033[0m'
+
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+echo -e "${BLUE}  Generating Test Traffic for SigNoz Verification${NC}"
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+echo ""
+
+# Check if ingress is accessible
+echo -e "${BLUE}Step 1: Verifying Gateway Access${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+
+GATEWAY_POD=$(kubectl get pods -n $NAMESPACE -l app=gateway --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
+if [[ -z "$GATEWAY_POD" ]]; then
+    echo -e "${YELLOW}⚠ Gateway pod not running. Starting port-forward...${NC}"
+    # Port forward in background
+    kubectl port-forward -n $NAMESPACE svc/gateway-service 8000:8000 &
+    PORT_FORWARD_PID=$!
+    sleep 3
+    API_URL="http://localhost:8000"
+else
+    echo -e "${GREEN}✓ Gateway is running: $GATEWAY_POD${NC}"
+    # Use internal service
+    API_URL="http://gateway-service.$NAMESPACE.svc.cluster.local:8000"
+fi
+echo ""
+
+# Function to make API call from inside cluster
+make_request() {
+    local endpoint=$1
+    local description=$2
+
+    echo -e "${BLUE}→ Testing: $description${NC}"
+    echo "  Endpoint: $endpoint"
+
+    if [[ -n "$GATEWAY_POD" ]]; then
+        # Make request from inside the gateway pod
+        RESPONSE=$(kubectl exec -n $NAMESPACE $GATEWAY_POD -- curl -s -w "\nHTTP_CODE:%{http_code}" "$API_URL$endpoint" 2>/dev/null || echo "FAILED")
+    else
+        # Make request from localhost
+        RESPONSE=$(curl -s -w "\nHTTP_CODE:%{http_code}" "$API_URL$endpoint" 2>/dev/null || echo "FAILED")
+    fi
+
+    if [[ "$RESPONSE" == "FAILED" ]]; then
+        echo -e "  ${YELLOW}⚠ Request failed${NC}"
+    else
+        HTTP_CODE=$(echo "$RESPONSE" | grep "HTTP_CODE" | cut -d: -f2)
+        if [[ "$HTTP_CODE" == "200" ]] || [[ "$HTTP_CODE" == "401" ]] || [[ "$HTTP_CODE" == "404" ]]; then
+            echo -e "  ${GREEN}✓ Response received (HTTP $HTTP_CODE)${NC}"
+        else
+            echo -e "  ${YELLOW}⚠ Unexpected response (HTTP $HTTP_CODE)${NC}"
+        fi
+    fi
+    echo ""
+    sleep 1
+}
+
+# Generate traffic to various endpoints
+echo -e "${BLUE}Step 2: Generating Traffic to Services${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo ""
+
+# Health checks (should generate traces)
+make_request "/health" "Gateway Health Check"
+make_request "/api/health" "API Health Check"
+
+# Auth service endpoints
+make_request "/api/auth/health" "Auth Service Health"
+
+# Tenant service endpoints
+make_request "/api/tenants/health" "Tenant Service Health"
+
+# Inventory service endpoints
+make_request "/api/inventory/health" "Inventory Service Health"
+
+# Orders service endpoints
+make_request "/api/orders/health" "Orders Service Health"
+
+# Forecasting service endpoints
+make_request "/api/forecasting/health" "Forecasting Service Health"
+
+echo -e "${BLUE}Step 3: Checking Service Logs for Telemetry${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo ""
+
+# Check a few service pods for tracing logs
+SERVICES=("auth-service" "inventory-service" "gateway")
+
+for service in "${SERVICES[@]}"; do
+    POD=$(kubectl get pods -n $NAMESPACE -l app=$service --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
+    if [[ -n "$POD" ]]; then
+        echo -e "${BLUE}Checking $service ($POD)...${NC}"
+        TRACING_LOG=$(kubectl logs -n $NAMESPACE $POD --tail=100 2>/dev/null | grep -i "tracing\|otel" | head -n 2 || echo "")
+        if [[ -n "$TRACING_LOG" ]]; then
+            echo -e "${GREEN}✓ Tracing configured:${NC}"
+            echo "$TRACING_LOG" | sed 's/^/  /'
+        else
+            echo -e "${YELLOW}⚠ No tracing logs found${NC}"
+        fi
+        echo ""
+    fi
+done
+
+# Wait for data to be processed
+echo -e "${BLUE}Step 4: Waiting for Data Processing${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo "Waiting 30 seconds for telemetry data to be processed..."
+for i in {30..1}; do
+    echo -ne "\r  ${i} seconds remaining..."
+    sleep 1
+done
+echo -e "\n"
+
+# Cleanup port-forward if started
+if [[ -n "$PORT_FORWARD_PID" ]]; then
+    kill $PORT_FORWARD_PID 2>/dev/null || true
+fi
+
+echo -e "${GREEN}✓ Test traffic generation complete!${NC}"
+echo ""
+echo -e "${BLUE}Next Steps:${NC}"
+echo "1. Run the verification script to check for collected data:"
+echo "   ./infrastructure/helm/verify-signoz-telemetry.sh"
+echo ""
+echo "2. Access SigNoz UI to visualize the data:"
+echo "   https://monitoring.bakery-ia.local"
+echo "   or"
+echo "   kubectl port-forward -n bakery-ia svc/signoz 3301:8080"
+echo "   Then go to: http://localhost:3301"
+echo ""
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
--- a/infrastructure/monitoring/signoz/import-dashboards.sh
+++ b/infrastructure/monitoring/signoz/import-dashboards.sh
@@ -0,0 +1,175 @@
+#!/bin/bash
+
+# SigNoz Dashboard Importer for Bakery IA
+# This script imports all SigNoz dashboards into your SigNoz instance
+
+# Configuration
+SIGNOZ_HOST="localhost"
+SIGNOZ_PORT="3301"
+SIGNOZ_API_KEY=""  # Add your API key if authentication is required
+DASHBOARDS_DIR="infrastructure/signoz/dashboards"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'  # No Color
+
+# Function to display help
+show_help() {
+    echo "Usage: $0 [options]"
+    echo ""
+    echo "Options:
+    -h, --host        SigNoz host (default: localhost)
+    -p, --port        SigNoz port (default: 3301)
+    -k, --api-key     SigNoz API key (if required)
+    -d, --dir         Dashboards directory (default: infrastructure/signoz/dashboards)
+    -h, --help        Show this help message"
+    echo ""
+    echo "Example:
+    $0 --host signoz.example.com --port 3301 --api-key your-api-key"
+}
+
+# Parse command line arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        -h|--host)
+            SIGNOZ_HOST="$2"
+            shift 2
+            ;;
+        -p|--port)
+            SIGNOZ_PORT="$2"
+            shift 2
+            ;;
+        -k|--api-key)
+            SIGNOZ_API_KEY="$2"
+            shift 2
+            ;;
+        -d|--dir)
+            DASHBOARDS_DIR="$2"
+            shift 2
+            ;;
+        --help)
+            show_help
+            exit 0
+            ;;
+        *)
+            echo "Unknown option: $1"
+            show_help
+            exit 1
+            ;;
+    esac
+done
+
+# Check if dashboards directory exists
+if [ ! -d "$DASHBOARDS_DIR" ]; then
+    echo -e "${RED}Error: Dashboards directory not found: $DASHBOARDS_DIR${NC}"
+    exit 1
+fi
+
+# Check if jq is installed for JSON validation
+if ! command -v jq &> /dev/null; then
+    echo -e "${YELLOW}Warning: jq not found. Skipping JSON validation.${NC}"
+    VALIDATE_JSON=false
+else
+    VALIDATE_JSON=true
+fi
+
+# Function to validate JSON
+validate_json() {
+    local file="$1"
+    if [ "$VALIDATE_JSON" = true ]; then
+        if ! jq empty "$file" &> /dev/null; then
+            echo -e "${RED}Error: Invalid JSON in file: $file${NC}"
+            return 1
+        fi
+    fi
+    return 0
+}
+
+# Function to import a single dashboard
+import_dashboard() {
+    local file="$1"
+    local filename=$(basename "$file")
+    local dashboard_name=$(jq -r '.name' "$file" 2>/dev/null || echo "Unknown")
+    
+    echo -e "${BLUE}Importing dashboard: $dashboard_name ($filename)${NC}"
+    
+    # Prepare curl command
+    local curl_cmd="curl -s -X POST http://$SIGNOZ_HOST:$SIGNOZ_PORT/api/v1/dashboards/import"
+    
+    if [ -n "$SIGNOZ_API_KEY" ]; then
+        curl_cmd="$curl_cmd -H \"Authorization: Bearer $SIGNOZ_API_KEY\""
+    fi
+    
+    curl_cmd="$curl_cmd -H \"Content-Type: application/json\" -d @\"$file\""
+    
+    # Execute import
+    local response=$(eval "$curl_cmd")
+    
+    # Check response
+    if echo "$response" | grep -q "success"; then
+        echo -e "${GREEN}✓ Successfully imported: $dashboard_name${NC}"
+        return 0
+    else
+        echo -e "${RED}✗ Failed to import: $dashboard_name${NC}"
+        echo "Response: $response"
+        return 1
+    fi
+}
+
+# Main import process
+echo -e "${YELLOW}=== SigNoz Dashboard Importer for Bakery IA ===${NC}"
+echo -e "${BLUE}Configuration:${NC}"
+echo "  Host: $SIGNOZ_HOST"
+echo "  Port: $SIGNOZ_PORT"
+echo "  Dashboards Directory: $DASHBOARDS_DIR"
+if [ -n "$SIGNOZ_API_KEY" ]; then
+    echo "  API Key: ******** (set)"
+else
+    echo "  API Key: Not configured"
+fi
+echo ""
+
+# Count dashboards
+DASHBOARD_COUNT=$(find "$DASHBOARDS_DIR" -name "*.json" | wc -l)
+echo -e "${BLUE}Found $DASHBOARD_COUNT dashboards to import${NC}"
+echo ""
+
+# Import each dashboard
+SUCCESS_COUNT=0
+FAILURE_COUNT=0
+
+for file in "$DASHBOARDS_DIR"/*.json; do
+    if [ -f "$file" ]; then
+        # Validate JSON
+        if validate_json "$file"; then
+            if import_dashboard "$file"; then
+                ((SUCCESS_COUNT++))
+            else
+                ((FAILURE_COUNT++))
+            fi
+        else
+            ((FAILURE_COUNT++))
+        fi
+        echo ""
+    fi
+done
+
+# Summary
+echo -e "${YELLOW}=== Import Summary ===${NC}"
+echo -e "${GREEN}Successfully imported: $SUCCESS_COUNT dashboards${NC}"
+if [ $FAILURE_COUNT -gt 0 ]; then
+    echo -e "${RED}Failed to import: $FAILURE_COUNT dashboards${NC}"
+fi
+echo ""
+
+if [ $FAILURE_COUNT -eq 0 ]; then
+    echo -e "${GREEN}All dashboards imported successfully!${NC}"
+    echo "You can now access them in your SigNoz UI at:"
+    echo "http://$SIGNOZ_HOST:$SIGNOZ_PORT/dashboards"
+else
+    echo -e "${YELLOW}Some dashboards failed to import. Check the errors above.${NC}"
+    exit 1
+fi
--- a/infrastructure/monitoring/signoz/signoz-values-dev.yaml
+++ b/infrastructure/monitoring/signoz/signoz-values-dev.yaml
@@ -0,0 +1,853 @@
+# SigNoz Helm Chart Values - Development Environment
+# Optimized for local development with minimal resource usage
+# DEPLOYED IN bakery-ia NAMESPACE - Ingress managed by bakery-ingress
+#
+# Official Chart: https://github.com/SigNoz/charts
+# Install Command: helm install signoz signoz/signoz -n bakery-ia -f signoz-values-dev.yaml
+
+global:
+  storageClass: "standard"
+  clusterName: "bakery-ia-dev"
+  domain: "monitoring.bakery-ia.local"
+  # Docker Hub credentials - applied to all sub-charts (including Zookeeper, ClickHouse, etc)
+  imagePullSecrets:
+    - dockerhub-creds
+
+# Docker Hub credentials for pulling images (root level for SigNoz components)
+imagePullSecrets:
+  - dockerhub-creds
+
+# SignOz Main Component (includes frontend and query service)
+signoz:
+  replicaCount: 1
+
+  service:
+    type: ClusterIP
+    port: 8080
+
+  # DISABLE built-in ingress - using unified bakery-ingress instead
+  # Route configured in infrastructure/kubernetes/overlays/dev/dev-ingress.yaml
+  ingress:
+    enabled: false
+
+  resources:
+    requests:
+      cpu: 100m  # Combined frontend + query service
+      memory: 256Mi
+    limits:
+      cpu: 1000m
+      memory: 1Gi
+
+  # Environment variables (new format - replaces configVars)
+  env:
+    signoz_telemetrystore_provider: "clickhouse"
+    dot_metrics_enabled: "true"
+    signoz_emailing_enabled: "false"
+    signoz_alertmanager_provider: "signoz"
+    # Retention for dev (7 days)
+    signoz_traces_ttl_duration_hrs: "168"
+    signoz_metrics_ttl_duration_hrs: "168"
+    signoz_logs_ttl_duration_hrs: "168"
+    # OpAMP Server Configuration - DISABLED for dev (causes gRPC instability)
+    signoz_opamp_server_enabled: "false"
+    # signoz_opamp_server_endpoint: "0.0.0.0:4320"
+
+  persistence:
+    enabled: true
+    size: 5Gi
+    storageClass: "standard"
+
+# AlertManager Configuration
+alertmanager:
+  replicaCount: 1
+  image:
+    repository: signoz/alertmanager
+    tag: 0.23.5
+    pullPolicy: IfNotPresent
+
+  service:
+    type: ClusterIP
+    port: 9093
+
+  resources:
+    requests:
+      cpu: 25m  # Reduced for local dev
+      memory: 64Mi  # Reduced for local dev
+    limits:
+      cpu: 200m
+      memory: 256Mi
+
+  persistence:
+    enabled: true
+    size: 2Gi
+    storageClass: "standard"
+
+  config:
+    global:
+      resolve_timeout: 5m
+    route:
+      group_by: ['alertname', 'cluster', 'service']
+      group_wait: 10s
+      group_interval: 10s
+      repeat_interval: 12h
+      receiver: 'default'
+    receivers:
+      - name: 'default'
+        # Add email, slack, webhook configs here
+
+# ClickHouse Configuration - Time Series Database
+# Minimal resources for local development on constrained Kind cluster
+clickhouse:
+  enabled: true
+  installCustomStorageClass: false
+
+  image:
+    registry: docker.io
+    repository: clickhouse/clickhouse-server
+    tag: 25.5.6  # Official recommended version
+
+  # Reduce ClickHouse resource requests for local dev
+  clickhouse:
+    resources:
+      requests:
+        cpu: 200m  # Reduced from default 500m
+        memory: 512Mi
+      limits:
+        cpu: 1000m
+        memory: 1Gi
+
+  persistence:
+    enabled: true
+    size: 20Gi
+
+# Zookeeper Configuration (required by ClickHouse)
+zookeeper:
+  enabled: true
+  replicaCount: 1  # Single replica for dev
+
+  image:
+    tag: 3.7.1  # Official recommended version
+
+  resources:
+    requests:
+      cpu: 100m
+      memory: 256Mi
+    limits:
+      cpu: 500m
+      memory: 512Mi
+
+  persistence:
+    enabled: true
+    size: 5Gi
+
+# OpenTelemetry Collector - Data ingestion endpoint for all telemetry
+otelCollector:
+  enabled: true
+  replicaCount: 1
+
+  image:
+    repository: signoz/signoz-otel-collector
+    tag: v0.129.12  # Latest recommended version
+
+  # OpAMP Configuration - DISABLED for development
+  # OpAMP is designed for production with remote config management
+  # In dev, it causes gRPC instability and collector reloads
+  # We use static configuration instead
+
+  # Init containers for the Otel Collector pod
+  initContainers:
+    fix-postgres-tls:
+      enabled: true
+      image:
+        registry: docker.io
+        repository: busybox
+        tag: 1.35
+        pullPolicy: IfNotPresent
+      command:
+        - sh
+        - -c
+        - |
+          echo "Fixing PostgreSQL TLS file permissions..."
+          cp /etc/postgres-tls-source/* /etc/postgres-tls/
+          chmod 600 /etc/postgres-tls/server-key.pem
+          chmod 644 /etc/postgres-tls/server-cert.pem
+          chmod 644 /etc/postgres-tls/ca-cert.pem
+          echo "PostgreSQL TLS permissions fixed"
+      volumeMounts:
+        - name: postgres-tls-source
+          mountPath: /etc/postgres-tls-source
+          readOnly: true
+        - name: postgres-tls-fixed
+          mountPath: /etc/postgres-tls
+          readOnly: false
+
+  # Service configuration - expose both gRPC and HTTP endpoints
+  service:
+    type: ClusterIP
+    ports:
+      # gRPC receivers
+      - name: otlp-grpc
+        port: 4317
+        targetPort: 4317
+        protocol: TCP
+      # HTTP receivers
+      - name: otlp-http
+        port: 4318
+        targetPort: 4318
+        protocol: TCP
+      # Prometheus remote write
+      - name: prometheus
+        port: 8889
+        targetPort: 8889
+        protocol: TCP
+      # Metrics
+      - name: metrics
+        port: 8888
+        targetPort: 8888
+        protocol: TCP
+
+  resources:
+    requests:
+      cpu: 50m   # Reduced from 100m
+      memory: 128Mi  # Reduced from 256Mi
+    limits:
+      cpu: 500m
+      memory: 512Mi
+
+  # Additional environment variables for receivers
+  additionalEnvs:
+    POSTGRES_MONITOR_USER: "monitoring"
+    POSTGRES_MONITOR_PASSWORD: "monitoring_369f9c001f242b07ef9e2826e17169ca"
+    REDIS_PASSWORD: "OxdmdJjdVNXp37MNC2IFoMnTpfGGFv1k"
+    RABBITMQ_USER: "bakery"
+    RABBITMQ_PASSWORD: "forecast123"
+
+  # Mount TLS certificates for secure connections
+  extraVolumes:
+    - name: redis-tls
+      secret:
+        secretName: redis-tls-secret
+    - name: postgres-tls
+      secret:
+        secretName: postgres-tls
+    - name: postgres-tls-fixed
+      emptyDir: {}
+    - name: varlogpods
+      hostPath:
+        path: /var/log/pods
+
+  extraVolumeMounts:
+    - name: redis-tls
+      mountPath: /etc/redis-tls
+      readOnly: true
+    - name: postgres-tls
+      mountPath: /etc/postgres-tls-source
+      readOnly: true
+    - name: postgres-tls-fixed
+      mountPath: /etc/postgres-tls
+      readOnly: false
+    - name: varlogpods
+      mountPath: /var/log/pods
+      readOnly: true
+
+  # Disable OpAMP - use static configuration only
+  # Use 'args' instead of 'extraArgs' to completely override the command
+  command:
+    name: /signoz-otel-collector
+    args:
+      - --config=/conf/otel-collector-config.yaml
+      - --feature-gates=-pkg.translator.prometheus.NormalizeName
+
+  # OpenTelemetry Collector configuration
+  config:
+    # Connectors - bridge between pipelines
+    connectors:
+      signozmeter:
+        dimensions:
+          - name: service.name
+          - name: deployment.environment
+          - name: host.name
+        metrics_flush_interval: 1h
+
+    receivers:
+      # OTLP receivers for traces, metrics, and logs from applications
+      # All application telemetry is pushed via OTLP protocol
+      otlp:
+        protocols:
+          grpc:
+            endpoint: 0.0.0.0:4317
+          http:
+            endpoint: 0.0.0.0:4318
+            cors:
+              allowed_origins:
+                - "*"
+
+      # Filelog receiver for Kubernetes pod logs
+      # Collects container stdout/stderr from /var/log/pods
+      filelog:
+        include:
+          - /var/log/pods/*/*/*.log
+        exclude:
+          # Exclude SigNoz's own logs to avoid recursive collection
+          - /var/log/pods/bakery-ia_signoz-*/*/*.log
+        include_file_path: true
+        include_file_name: false
+        operators:
+          # Parse CRI-O / containerd log format
+          - type: regex_parser
+            regex: '^(?P<time>[^ ]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
+            timestamp:
+              parse_from: attributes.time
+              layout: '%Y-%m-%dT%H:%M:%S.%LZ'
+          # Fix timestamp parsing - extract from the parsed time field
+          - type: move
+            from: attributes.time
+            to: attributes.timestamp
+          # Extract Kubernetes metadata from file path
+          - type: regex_parser
+            id: extract_metadata_from_filepath
+            regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
+            parse_from: attributes["log.file.path"]
+          # Move metadata to resource attributes
+          - type: move
+            from: attributes.namespace
+            to: resource["k8s.namespace.name"]
+          - type: move
+            from: attributes.pod_name
+            to: resource["k8s.pod.name"]
+          - type: move
+            from: attributes.container_name
+            to: resource["k8s.container.name"]
+          - type: move
+            from: attributes.log
+            to: body
+
+      # Kubernetes Cluster Receiver - Collects cluster-level metrics
+      # Provides information about nodes, namespaces, pods, and other cluster resources
+      k8s_cluster:
+        collection_interval: 30s
+        node_conditions_to_report:
+          - Ready
+          - MemoryPressure
+          - DiskPressure
+          - PIDPressure
+          - NetworkUnavailable
+        allocatable_types_to_report:
+          - cpu
+          - memory
+          - pods
+
+
+
+      # PostgreSQL receivers for database metrics
+      # ENABLED: Monitor users configured and credentials stored in secrets
+      # Collects metrics directly from PostgreSQL databases with proper TLS
+      postgresql/auth:
+        endpoint: auth-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - auth_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/inventory:
+        endpoint: inventory-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - inventory_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/orders:
+        endpoint: orders-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - orders_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/ai-insights:
+        endpoint: ai-insights-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - ai_insights_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/alert-processor:
+        endpoint: alert-processor-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - alert_processor_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/distribution:
+        endpoint: distribution-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - distribution_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/external:
+        endpoint: external-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - external_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/forecasting:
+        endpoint: forecasting-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - forecasting_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/notification:
+        endpoint: notification-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - notification_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/orchestrator:
+        endpoint: orchestrator-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - orchestrator_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/pos:
+        endpoint: pos-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - pos_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/procurement:
+        endpoint: procurement-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - procurement_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/production:
+        endpoint: production-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - production_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/recipes:
+        endpoint: recipes-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - recipes_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/sales:
+        endpoint: sales-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - sales_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/suppliers:
+        endpoint: suppliers-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - suppliers_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/tenant:
+        endpoint: tenant-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - tenant_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/training:
+        endpoint: training-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - training_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      # Redis receiver for cache metrics
+      # ENABLED: Using existing credentials from redis-secrets with TLS
+      redis:
+        endpoint: redis-service.bakery-ia:6379
+        password: ${env:REDIS_PASSWORD}
+        collection_interval: 60s
+        transport: tcp
+        tls:
+          insecure_skip_verify: false
+          cert_file: /etc/redis-tls/redis-cert.pem
+          key_file: /etc/redis-tls/redis-key.pem
+          ca_file: /etc/redis-tls/ca-cert.pem
+        metrics:
+          redis.maxmemory:
+            enabled: true
+          redis.cmd.latency:
+            enabled: true
+
+      # RabbitMQ receiver via management API
+      # ENABLED: Using existing credentials from rabbitmq-secrets
+      rabbitmq:
+        endpoint: http://rabbitmq-service.bakery-ia:15672
+        username: ${env:RABBITMQ_USER}
+        password: ${env:RABBITMQ_PASSWORD}
+        collection_interval: 30s
+
+      # Prometheus Receiver - Scrapes metrics from Kubernetes API
+      # Simplified configuration using only Kubernetes API metrics
+      prometheus:
+        config:
+          scrape_configs:
+            - job_name: 'kubernetes-nodes-cadvisor'
+              scrape_interval: 30s
+              scrape_timeout: 10s
+              scheme: https
+              tls_config:
+                insecure_skip_verify: true
+              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+              kubernetes_sd_configs:
+                - role: node
+              relabel_configs:
+                - action: labelmap
+                  regex: __meta_kubernetes_node_label_(.+)
+                - target_label: __address__
+                  replacement: kubernetes.default.svc:443
+                - source_labels: [__meta_kubernetes_node_name]
+                  regex: (.+)
+                  target_label: __metrics_path__
+                  replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
+            - job_name: 'kubernetes-apiserver'
+              scrape_interval: 30s
+              scrape_timeout: 10s
+              scheme: https
+              tls_config:
+                insecure_skip_verify: true
+              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+              kubernetes_sd_configs:
+                - role: endpoints
+              relabel_configs:
+                - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
+                  action: keep
+                  regex: default;kubernetes;https
+
+    processors:
+      # Batch processor for better performance (optimized for high throughput)
+      batch:
+        timeout: 1s
+        send_batch_size: 10000  # Increased from 1024 for better performance
+        send_batch_max_size: 10000
+
+      # Batch processor for meter data
+      batch/meter:
+        timeout: 1s
+        send_batch_size: 20000
+        send_batch_max_size: 25000
+
+      # Memory limiter to prevent OOM
+      memory_limiter:
+        check_interval: 1s
+        limit_mib: 400
+        spike_limit_mib: 100
+
+      # Resource detection
+      resourcedetection:
+        detectors: [env, system, docker]
+        timeout: 5s
+
+      # Kubernetes attributes processor - CRITICAL for logs
+      # Extracts pod, namespace, container metadata from log attributes
+      k8sattributes:
+        auth_type: "serviceAccount"
+        passthrough: false
+        extract:
+          metadata:
+            - k8s.pod.name
+            - k8s.pod.uid
+            - k8s.deployment.name
+            - k8s.namespace.name
+            - k8s.node.name
+            - k8s.container.name
+          labels:
+            - tag_name: "app"
+            - tag_name: "pod-template-hash"
+          annotations:
+            - tag_name: "description"
+
+      # SigNoz span metrics processor with delta aggregation (recommended)
+      # Generates RED metrics (Rate, Error, Duration) from trace spans
+      signozspanmetrics/delta:
+        aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
+        metrics_exporter: signozclickhousemetrics
+        latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s]
+        dimensions_cache_size: 100000
+        dimensions:
+          - name: service.namespace
+            default: default
+          - name: deployment.environment
+            default: default
+          - name: signoz.collector.id
+
+    exporters:
+      # ClickHouse exporter for traces
+      clickhousetraces:
+        datasource: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_traces
+        timeout: 10s
+        retry_on_failure:
+          enabled: true
+          initial_interval: 5s
+          max_interval: 30s
+          max_elapsed_time: 300s
+
+      # ClickHouse exporter for metrics
+      signozclickhousemetrics:
+        dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metrics"
+        timeout: 10s
+        retry_on_failure:
+          enabled: true
+          initial_interval: 5s
+          max_interval: 30s
+          max_elapsed_time: 300s
+
+      # ClickHouse exporter for meter data (usage metrics)
+      signozclickhousemeter:
+        dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_meter"
+        timeout: 45s
+        sending_queue:
+          enabled: false
+
+      # ClickHouse exporter for logs
+      clickhouselogsexporter:
+        dsn: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_logs
+        timeout: 10s
+        retry_on_failure:
+          enabled: true
+          initial_interval: 5s
+          max_interval: 30s
+
+      # Metadata exporter for service metadata
+      metadataexporter:
+        dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metadata"
+        timeout: 10s
+        cache:
+          provider: in_memory
+
+      # Debug exporter for debugging (optional)
+      debug:
+        verbosity: detailed
+        sampling_initial: 5
+        sampling_thereafter: 200
+
+    service:
+      pipelines:
+        # Traces pipeline - exports to ClickHouse and signozmeter connector
+        traces:
+          receivers: [otlp]
+          processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection]
+          exporters: [clickhousetraces, metadataexporter, signozmeter]
+
+        # Metrics pipeline
+        metrics:
+          receivers: [otlp, 
+            postgresql/auth, postgresql/inventory, postgresql/orders, 
+            postgresql/ai-insights, postgresql/alert-processor, postgresql/distribution,
+            postgresql/external, postgresql/forecasting, postgresql/notification,
+            postgresql/orchestrator, postgresql/pos, postgresql/procurement,
+            postgresql/production, postgresql/recipes, postgresql/sales,
+            postgresql/suppliers, postgresql/tenant, postgresql/training,
+            redis, rabbitmq, k8s_cluster, prometheus]
+          processors: [memory_limiter, batch, resourcedetection]
+          exporters: [signozclickhousemetrics]
+
+        # Meter pipeline - receives from signozmeter connector
+        metrics/meter:
+          receivers: [signozmeter]
+          processors: [batch/meter]
+          exporters: [signozclickhousemeter]
+
+        # Logs pipeline - includes both OTLP and Kubernetes pod logs
+        logs:
+          receivers: [otlp, filelog]
+          processors: [memory_limiter, batch, resourcedetection, k8sattributes]
+          exporters: [clickhouselogsexporter]
+
+  # ClusterRole configuration for Kubernetes monitoring
+  # CRITICAL: Required for k8s_cluster receiver to access Kubernetes API
+  # Without these permissions, k8s metrics will not appear in SigNoz UI
+  clusterRole:
+    create: true
+    name: "signoz-otel-collector-bakery-ia"
+    annotations: {}
+    # Complete RBAC rules required by k8sclusterreceiver
+    # Based on OpenTelemetry and SigNoz official documentation
+    rules:
+      # Core API group - fundamental Kubernetes resources
+      - apiGroups: [""]
+        resources:
+          - "events"
+          - "namespaces"
+          - "nodes"
+          - "nodes/proxy"
+          - "nodes/metrics"
+          - "nodes/spec"
+          - "pods"
+          - "pods/status"
+          - "replicationcontrollers"
+          - "replicationcontrollers/status"
+          - "resourcequotas"
+          - "services"
+          - "endpoints"
+        verbs: ["get", "list", "watch"]
+      # Apps API group - modern workload controllers
+      - apiGroups: ["apps"]
+        resources: ["deployments", "daemonsets", "statefulsets", "replicasets"]
+        verbs: ["get", "list", "watch"]
+      # Batch API group - job management
+      - apiGroups: ["batch"]
+        resources: ["jobs", "cronjobs"]
+        verbs: ["get", "list", "watch"]
+      # Autoscaling API group - HPA metrics (CRITICAL)
+      - apiGroups: ["autoscaling"]
+        resources: ["horizontalpodautoscalers"]
+        verbs: ["get", "list", "watch"]
+      # Extensions API group - legacy support
+      - apiGroups: ["extensions"]
+        resources: ["deployments", "daemonsets", "replicasets"]
+        verbs: ["get", "list", "watch"]
+      # Metrics API group - resource metrics
+      - apiGroups: ["metrics.k8s.io"]
+        resources: ["nodes", "pods"]
+        verbs: ["get", "list", "watch"]
+    clusterRoleBinding:
+      annotations: {}
+      name: "signoz-otel-collector-bakery-ia"
+
+# Additional Configuration
+serviceAccount:
+  create: true
+  annotations: {}
+  name: "signoz-otel-collector"
+
+# Security Context
+securityContext:
+  runAsNonRoot: true
+  runAsUser: 1000
+  fsGroup: 1000
+
+# Network Policies (disabled for dev)
+networkPolicy:
+  enabled: false
+
+# Monitoring SigNoz itself
+selfMonitoring:
+  enabled: true
+  serviceMonitor:
+    enabled: false
--- a/infrastructure/monitoring/signoz/signoz-values-prod.yaml
+++ b/infrastructure/monitoring/signoz/signoz-values-prod.yaml
@@ -0,0 +1,998 @@
+# SigNoz Helm Chart Values - Production Environment
+# High-availability configuration with resource optimization
+# DEPLOYED IN bakery-ia NAMESPACE - Ingress managed by bakery-ingress-prod
+#
+# Official Chart: https://github.com/SigNoz/charts
+# Install Command: helm install signoz signoz/signoz -n bakery-ia -f signoz-values-prod.yaml
+
+global:
+  storageClass: "microk8s-hostpath"  # For MicroK8s, use "microk8s-hostpath" or custom storage class
+  clusterName: "bakery-ia-prod"
+  domain: "monitoring.bakewise.ai"
+  # Docker Hub credentials - applied to all sub-charts (including Zookeeper, ClickHouse, etc)
+  imagePullSecrets:
+    - dockerhub-creds
+
+# Docker Hub credentials for pulling images (root level for SigNoz components)
+imagePullSecrets:
+  - dockerhub-creds
+
+# SigNoz Main Component (unified frontend + query service)
+# BREAKING CHANGE: v0.89.0+ uses unified component instead of separate frontend/queryService
+signoz:
+  replicaCount: 2
+
+  image:
+    repository: signoz/signoz
+    tag: v0.106.0  # Latest stable version
+    pullPolicy: IfNotPresent
+
+  service:
+    type: ClusterIP
+    port: 8080       # HTTP/API port
+    internalPort: 8085  # Internal gRPC port
+
+  # DISABLE built-in ingress - using unified bakery-ingress-prod instead
+  # Route configured in infrastructure/kubernetes/overlays/prod/prod-ingress.yaml
+  ingress:
+    enabled: false
+
+  resources:
+    requests:
+      cpu: 500m
+      memory: 1Gi
+    limits:
+      cpu: 2000m
+      memory: 4Gi
+
+  # Pod Anti-affinity for HA
+  affinity:
+    podAntiAffinity:
+      preferredDuringSchedulingIgnoredDuringExecution:
+        - weight: 100
+          podAffinityTerm:
+            labelSelector:
+              matchLabels:
+                app.kubernetes.io/component: query-service
+            topologyKey: kubernetes.io/hostname
+
+  # Environment variables (new format - replaces configVars)
+  env:
+    signoz_telemetrystore_provider: "clickhouse"
+    dot_metrics_enabled: "true"
+    signoz_emailing_enabled: "true"
+    signoz_alertmanager_provider: "signoz"
+    # Retention configuration (30 days for prod)
+    signoz_traces_ttl_duration_hrs: "720"
+    signoz_metrics_ttl_duration_hrs: "720"
+    signoz_logs_ttl_duration_hrs: "720"
+    # OpAMP Server Configuration
+    # WARNING: OpAMP can cause gRPC instability and collector reloads
+    # Only enable if you have a stable OpAMP backend server
+    signoz_opamp_server_enabled: "false"
+    # signoz_opamp_server_endpoint: "0.0.0.0:4320"
+    # SMTP configuration for email alerts - now using Mailu as SMTP server
+    signoz_smtp_enabled: "true"
+    signoz_smtp_host: "email-smtp.bakery-ia.svc.cluster.local"
+    signoz_smtp_port: "587"
+    signoz_smtp_from: "alerts@bakewise.ai"
+    signoz_smtp_username: "alerts@bakewise.ai"
+    # Password should be set via secret: signoz_smtp_password
+
+  persistence:
+    enabled: true
+    size: 20Gi
+    storageClass: "standard"
+
+  # Horizontal Pod Autoscaler
+  autoscaling:
+    enabled: true
+    minReplicas: 2
+    maxReplicas: 5
+    targetCPUUtilizationPercentage: 70
+    targetMemoryUtilizationPercentage: 80
+
+# AlertManager Configuration
+alertmanager:
+  enabled: true
+  replicaCount: 2
+
+  image:
+    repository: signoz/alertmanager
+    tag: 0.23.5
+    pullPolicy: IfNotPresent
+
+  service:
+    type: ClusterIP
+    port: 9093
+
+  resources:
+    requests:
+      cpu: 100m
+      memory: 128Mi
+    limits:
+      cpu: 500m
+      memory: 512Mi
+
+  # Pod Anti-affinity for HA
+  affinity:
+    podAntiAffinity:
+      preferredDuringSchedulingIgnoredDuringExecution:
+        - weight: 100
+          podAffinityTerm:
+            labelSelector:
+              matchExpressions:
+                - key: app
+                  operator: In
+                  values:
+                    - signoz-alertmanager
+            topologyKey: kubernetes.io/hostname
+
+  persistence:
+    enabled: true
+    size: 5Gi
+    storageClass: "standard"
+
+  config:
+    global:
+      resolve_timeout: 5m
+      smtp_smarthost: 'email-smtp.bakery-ia.svc.cluster.local:587'
+      smtp_from: 'alerts@bakewise.ai'
+      smtp_auth_username: 'alerts@bakewise.ai'
+      smtp_auth_password: '${SMTP_PASSWORD}'
+      smtp_require_tls: true
+
+    route:
+      group_by: ['alertname', 'cluster', 'service', 'severity']
+      group_wait: 10s
+      group_interval: 10s
+      repeat_interval: 12h
+      receiver: 'critical-alerts'
+      routes:
+        - match:
+            severity: critical
+          receiver: 'critical-alerts'
+          continue: true
+        - match:
+            severity: warning
+          receiver: 'warning-alerts'
+
+    receivers:
+      - name: 'critical-alerts'
+        email_configs:
+          - to: 'critical-alerts@bakewise.ai'
+            headers:
+              Subject: '[CRITICAL] {{ .GroupLabels.alertname }} - Bakery IA'
+        # Slack webhook for critical alerts
+        slack_configs:
+          - api_url: '${SLACK_WEBHOOK_URL}'
+            channel: '#alerts-critical'
+            title: '[CRITICAL] {{ .GroupLabels.alertname }}'
+            text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
+
+      - name: 'warning-alerts'
+        email_configs:
+          - to: 'oncall@bakewise.ai'
+            headers:
+              Subject: '[WARNING] {{ .GroupLabels.alertname }} - Bakery IA'
+
+# ClickHouse Configuration - Time Series Database
+clickhouse:
+  enabled: true
+  installCustomStorageClass: false
+
+  image:
+    registry: docker.io
+    repository: clickhouse/clickhouse-server
+    tag: 25.5.6  # Updated to official recommended version
+    pullPolicy: IfNotPresent
+
+  # ClickHouse resources (nested config)
+  clickhouse:
+    resources:
+      requests:
+        cpu: 1000m
+        memory: 2Gi
+      limits:
+        cpu: 4000m
+        memory: 8Gi
+
+  # Pod Anti-affinity for HA
+  affinity:
+    podAntiAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        - labelSelector:
+            matchExpressions:
+              - key: app
+                operator: In
+                values:
+                  - signoz-clickhouse
+          topologyKey: kubernetes.io/hostname
+
+  persistence:
+    enabled: true
+    size: 100Gi
+    storageClass: "standard"
+
+  # Cold storage configuration for better disk space management
+  coldStorage:
+    enabled: true
+    defaultKeepFreeSpaceBytes: 10737418240  # Keep 10GB free
+    ttl:
+      deleteTTLDays: 30  # Move old data to cold storage after 30 days
+
+# Zookeeper Configuration (required by ClickHouse for coordination)
+zookeeper:
+  enabled: true
+  replicaCount: 3  # CRITICAL: Always use 3 replicas for production HA
+
+  image:
+    tag: 3.7.1  # Official recommended version
+
+  resources:
+    requests:
+      cpu: 100m
+      memory: 256Mi
+    limits:
+      cpu: 500m
+      memory: 512Mi
+
+  persistence:
+    enabled: true
+    size: 10Gi
+    storageClass: "standard"
+
+# OpenTelemetry Collector - Integrated with SigNoz
+otelCollector:
+  enabled: true
+  replicaCount: 2
+
+  image:
+    repository: signoz/signoz-otel-collector
+    tag: v0.129.12  # Updated to latest recommended version
+    pullPolicy: IfNotPresent
+
+  # Init containers for the Otel Collector pod
+  initContainers:
+    fix-postgres-tls:
+      enabled: true
+      image:
+        registry: docker.io
+        repository: busybox
+        tag: 1.35
+        pullPolicy: IfNotPresent
+      command:
+        - sh
+        - -c
+        - |
+          echo "Fixing PostgreSQL TLS file permissions..."
+          cp /etc/postgres-tls-source/* /etc/postgres-tls/
+          chmod 600 /etc/postgres-tls/server-key.pem
+          chmod 644 /etc/postgres-tls/server-cert.pem
+          chmod 644 /etc/postgres-tls/ca-cert.pem
+          echo "PostgreSQL TLS permissions fixed"
+      volumeMounts:
+        - name: postgres-tls-source
+          mountPath: /etc/postgres-tls-source
+          readOnly: true
+        - name: postgres-tls-fixed
+          mountPath: /etc/postgres-tls
+          readOnly: false
+
+  service:
+    type: ClusterIP
+    ports:
+      - name: otlp-grpc
+        port: 4317
+        targetPort: 4317
+        protocol: TCP
+      - name: otlp-http
+        port: 4318
+        targetPort: 4318
+        protocol: TCP
+      - name: prometheus
+        port: 8889
+        targetPort: 8889
+        protocol: TCP
+      - name: metrics
+        port: 8888
+        targetPort: 8888
+        protocol: TCP
+
+  resources:
+    requests:
+      cpu: 500m
+      memory: 512Mi
+    limits:
+      cpu: 2000m
+      memory: 2Gi
+
+  # Additional environment variables for receivers
+  additionalEnvs:
+    POSTGRES_MONITOR_USER: "monitoring"
+    POSTGRES_MONITOR_PASSWORD: "monitoring_369f9c001f242b07ef9e2826e17169ca"
+    REDIS_PASSWORD: "OxdmdJjdVNXp37MNC2IFoMnTpfGGFv1k"
+    RABBITMQ_USER: "bakery"
+    RABBITMQ_PASSWORD: "forecast123"
+
+  # Mount TLS certificates for secure connections
+  extraVolumes:
+    - name: redis-tls
+      secret:
+        secretName: redis-tls-secret
+    - name: postgres-tls
+      secret:
+        secretName: postgres-tls
+    - name: postgres-tls-fixed
+      emptyDir: {}
+    - name: varlogpods
+      hostPath:
+        path: /var/log/pods
+
+  extraVolumeMounts:
+    - name: redis-tls
+      mountPath: /etc/redis-tls
+      readOnly: true
+    - name: postgres-tls
+      mountPath: /etc/postgres-tls-source
+      readOnly: true
+    - name: postgres-tls-fixed
+      mountPath: /etc/postgres-tls
+      readOnly: false
+    - name: varlogpods
+      mountPath: /var/log/pods
+      readOnly: true
+
+  # Enable OpAMP for dynamic configuration management
+  command:
+    name: /signoz-otel-collector
+    extraArgs:
+      - --config=/conf/otel-collector-config.yaml
+      - --manager-config=/conf/otel-collector-opamp-config.yaml
+      - --feature-gates=-pkg.translator.prometheus.NormalizeName
+
+  # Full OTEL Collector Configuration
+  config:
+    # Connectors - bridge between pipelines
+    connectors:
+      signozmeter:
+        dimensions:
+          - name: service.name
+          - name: deployment.environment
+          - name: host.name
+        metrics_flush_interval: 1h
+
+    extensions:
+      health_check:
+        endpoint: 0.0.0.0:13133
+      zpages:
+        endpoint: 0.0.0.0:55679
+
+    receivers:
+      otlp:
+        protocols:
+          grpc:
+            endpoint: 0.0.0.0:4317
+            max_recv_msg_size_mib: 32  # Increased for larger payloads
+          http:
+            endpoint: 0.0.0.0:4318
+            cors:
+              allowed_origins:
+                - "https://monitoring.bakewise.ai"
+                - "https://*.bakewise.ai"
+
+      # Filelog receiver for Kubernetes pod logs
+      # Collects container stdout/stderr from /var/log/pods
+      filelog:
+        include:
+          - /var/log/pods/*/*/*.log
+        exclude:
+          # Exclude SigNoz's own logs to avoid recursive collection
+          - /var/log/pods/bakery-ia_signoz-*/*/*.log
+        include_file_path: true
+        include_file_name: false
+        operators:
+          # Parse CRI-O / containerd log format
+          - type: regex_parser
+            regex: '^(?P<time>[^ ]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
+            timestamp:
+              parse_from: attributes.time
+              layout: '%Y-%m-%dT%H:%M:%S.%LZ'
+          # Fix timestamp parsing - extract from the parsed time field
+          - type: move
+            from: attributes.time
+            to: attributes.timestamp
+          # Extract Kubernetes metadata from file path
+          - type: regex_parser
+            id: extract_metadata_from_filepath
+            regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
+            parse_from: attributes["log.file.path"]
+          # Move metadata to resource attributes
+          - type: move
+            from: attributes.namespace
+            to: resource["k8s.namespace.name"]
+          - type: move
+            from: attributes.pod_name
+            to: resource["k8s.pod.name"]
+          - type: move
+            from: attributes.container_name
+            to: resource["k8s.container.name"]
+          - type: move
+            from: attributes.log
+            to: body
+
+      # Kubernetes Cluster Receiver - Collects cluster-level metrics
+      # Provides information about nodes, namespaces, pods, and other cluster resources
+      k8s_cluster:
+        collection_interval: 30s
+        node_conditions_to_report:
+          - Ready
+          - MemoryPressure
+          - DiskPressure
+          - PIDPressure
+          - NetworkUnavailable
+        allocatable_types_to_report:
+          - cpu
+          - memory
+          - pods
+
+      # Prometheus receiver for scraping metrics
+      prometheus:
+        config:
+          scrape_configs:
+            - job_name: 'kubernetes-nodes-cadvisor'
+              scrape_interval: 30s
+              scrape_timeout: 10s
+              scheme: https
+              tls_config:
+                insecure_skip_verify: true
+              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+              kubernetes_sd_configs:
+                - role: node
+              relabel_configs:
+                - action: labelmap
+                  regex: __meta_kubernetes_node_label_(.+)
+                - target_label: __address__
+                  replacement: kubernetes.default.svc:443
+                - source_labels: [__meta_kubernetes_node_name]
+                  regex: (.+)
+                  target_label: __metrics_path__
+                  replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
+            - job_name: 'kubernetes-apiserver'
+              scrape_interval: 30s
+              scrape_timeout: 10s
+              scheme: https
+              tls_config:
+                insecure_skip_verify: true
+              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+              kubernetes_sd_configs:
+                - role: endpoints
+              relabel_configs:
+                - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
+                  action: keep
+                  regex: default;kubernetes;https
+
+      # Redis receiver for cache metrics
+      # ENABLED: Using existing credentials from redis-secrets with TLS
+      redis:
+        endpoint: redis-service.bakery-ia:6379
+        password: ${env:REDIS_PASSWORD}
+        collection_interval: 60s
+        transport: tcp
+        tls:
+          insecure_skip_verify: false
+          cert_file: /etc/redis-tls/redis-cert.pem
+          key_file: /etc/redis-tls/redis-key.pem
+          ca_file: /etc/redis-tls/ca-cert.pem
+        metrics:
+          redis.maxmemory:
+            enabled: true
+          redis.cmd.latency:
+            enabled: true
+
+      # RabbitMQ receiver via management API
+      # ENABLED: Using existing credentials from rabbitmq-secrets
+      rabbitmq:
+        endpoint: http://rabbitmq-service.bakery-ia:15672
+        username: ${env:RABBITMQ_USER}
+        password: ${env:RABBITMQ_PASSWORD}
+        collection_interval: 30s
+
+      # PostgreSQL receivers for database metrics
+      # Monitor all databases with proper TLS configuration
+      postgresql/auth:
+        endpoint: auth-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - auth_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/inventory:
+        endpoint: inventory-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - inventory_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/orders:
+        endpoint: orders-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - orders_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/ai-insights:
+        endpoint: ai-insights-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - ai_insights_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/alert-processor:
+        endpoint: alert-processor-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - alert_processor_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/distribution:
+        endpoint: distribution-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - distribution_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/external:
+        endpoint: external-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - external_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/forecasting:
+        endpoint: forecasting-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - forecasting_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/notification:
+        endpoint: notification-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - notification_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/orchestrator:
+        endpoint: orchestrator-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - orchestrator_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/pos:
+        endpoint: pos-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - pos_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/procurement:
+        endpoint: procurement-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - procurement_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/production:
+        endpoint: production-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - production_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/recipes:
+        endpoint: recipes-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - recipes_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/sales:
+        endpoint: sales-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - sales_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/suppliers:
+        endpoint: suppliers-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - suppliers_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/tenant:
+        endpoint: tenant-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - tenant_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+      postgresql/training:
+        endpoint: training-db-service.bakery-ia:5432
+        username: ${env:POSTGRES_MONITOR_USER}
+        password: ${env:POSTGRES_MONITOR_PASSWORD}
+        databases:
+          - training_db
+        collection_interval: 60s
+        tls:
+          insecure: false
+          cert_file: /etc/postgres-tls/server-cert.pem
+          key_file: /etc/postgres-tls/server-key.pem
+          ca_file: /etc/postgres-tls/ca-cert.pem
+
+    processors:
+      # High-performance batch processing (official recommendation)
+      batch:
+        timeout: 1s  # Reduced from 10s for faster processing
+        send_batch_size: 50000  # Increased from 2048 (official recommendation for traces)
+        send_batch_max_size: 50000
+
+      # Batch processor for meter data
+      batch/meter:
+        timeout: 1s
+        send_batch_size: 20000
+        send_batch_max_size: 25000
+
+      memory_limiter:
+        check_interval: 1s
+        limit_mib: 1500  # 75% of container memory (2Gi = ~2048Mi)
+        spike_limit_mib: 300
+
+      # Resource detection for K8s
+      resourcedetection:
+        detectors: [env, system, docker]
+        timeout: 5s
+
+      # Add resource attributes
+      resource:
+        attributes:
+          - key: deployment.environment
+            value: production
+            action: upsert
+          - key: cluster.name
+            value: bakery-ia-prod
+            action: upsert
+
+      # Kubernetes attributes processor - CRITICAL for logs
+      # Extracts pod, namespace, container metadata from log attributes
+      k8sattributes:
+        auth_type: "serviceAccount"
+        passthrough: false
+        extract:
+          metadata:
+            - k8s.pod.name
+            - k8s.pod.uid
+            - k8s.deployment.name
+            - k8s.namespace.name
+            - k8s.node.name
+            - k8s.container.name
+          labels:
+            - tag_name: "app"
+            - tag_name: "pod-template-hash"
+            - tag_name: "version"
+          annotations:
+            - tag_name: "description"
+
+      # SigNoz span metrics processor with delta aggregation (recommended)
+      # Generates RED metrics (Rate, Error, Duration) from trace spans
+      signozspanmetrics/delta:
+        aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
+        metrics_exporter: signozclickhousemetrics
+        latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s]
+        dimensions_cache_size: 100000
+        dimensions:
+          - name: service.namespace
+            default: default
+          - name: deployment.environment
+            default: production
+          - name: signoz.collector.id
+
+    exporters:
+      # ClickHouse exporter for traces
+      clickhousetraces:
+        datasource: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_traces
+        timeout: 10s
+        retry_on_failure:
+          enabled: true
+          initial_interval: 5s
+          max_interval: 30s
+          max_elapsed_time: 300s
+
+      # ClickHouse exporter for metrics
+      signozclickhousemetrics:
+        dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metrics"
+        timeout: 10s
+        retry_on_failure:
+          enabled: true
+          initial_interval: 5s
+          max_interval: 30s
+          max_elapsed_time: 300s
+
+      # ClickHouse exporter for meter data (usage metrics)
+      signozclickhousemeter:
+        dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_meter"
+        timeout: 45s
+        sending_queue:
+          enabled: false
+
+      # ClickHouse exporter for logs
+      clickhouselogsexporter:
+        dsn: tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/?database=signoz_logs
+        timeout: 10s
+        retry_on_failure:
+          enabled: true
+          initial_interval: 5s
+          max_interval: 30s
+
+      # Metadata exporter for service metadata
+      metadataexporter:
+        dsn: "tcp://admin:27ff0399-0d3a-4bd8-919d-17c2181e6fb9@signoz-clickhouse:9000/signoz_metadata"
+        timeout: 10s
+        cache:
+          provider: in_memory
+
+      # Debug exporter for debugging (optional)
+      debug:
+        verbosity: detailed
+        sampling_initial: 5
+        sampling_thereafter: 200
+
+    service:
+      extensions: [health_check, zpages]
+      pipelines:
+        # Traces pipeline - exports to ClickHouse and signozmeter connector
+        traces:
+          receivers: [otlp]
+          processors: [memory_limiter, batch, signozspanmetrics/delta, resourcedetection, resource]
+          exporters: [clickhousetraces, metadataexporter, signozmeter]
+
+        # Metrics pipeline - includes all infrastructure receivers
+        metrics:
+          receivers: [otlp,
+            postgresql/auth, postgresql/inventory, postgresql/orders,
+            postgresql/ai-insights, postgresql/alert-processor, postgresql/distribution,
+            postgresql/external, postgresql/forecasting, postgresql/notification,
+            postgresql/orchestrator, postgresql/pos, postgresql/procurement,
+            postgresql/production, postgresql/recipes, postgresql/sales,
+            postgresql/suppliers, postgresql/tenant, postgresql/training,
+            redis, rabbitmq, k8s_cluster, prometheus]
+          processors: [memory_limiter, batch, resourcedetection, resource]
+          exporters: [signozclickhousemetrics]
+
+        # Meter pipeline - receives from signozmeter connector
+        metrics/meter:
+          receivers: [signozmeter]
+          processors: [batch/meter]
+          exporters: [signozclickhousemeter]
+
+        # Logs pipeline - includes both OTLP and Kubernetes pod logs
+        logs:
+          receivers: [otlp, filelog]
+          processors: [memory_limiter, batch, resourcedetection, resource, k8sattributes]
+          exporters: [clickhouselogsexporter]
+
+  # HPA for OTEL Collector
+  autoscaling:
+    enabled: true
+    minReplicas: 2
+    maxReplicas: 10
+    targetCPUUtilizationPercentage: 70
+    targetMemoryUtilizationPercentage: 80
+
+  # ClusterRole configuration for Kubernetes monitoring
+  # CRITICAL: Required for k8s_cluster receiver to access Kubernetes API
+  # Without these permissions, k8s metrics will not appear in SigNoz UI
+  clusterRole:
+    create: true
+    name: "signoz-otel-collector-bakery-ia"
+    annotations: {}
+    # Complete RBAC rules required by k8sclusterreceiver
+    # Based on OpenTelemetry and SigNoz official documentation
+    rules:
+      # Core API group - fundamental Kubernetes resources
+      - apiGroups: [""]
+        resources:
+          - "events"
+          - "namespaces"
+          - "nodes"
+          - "nodes/proxy"
+          - "nodes/metrics"
+          - "nodes/spec"
+          - "pods"
+          - "pods/status"
+          - "replicationcontrollers"
+          - "replicationcontrollers/status"
+          - "resourcequotas"
+          - "services"
+          - "endpoints"
+        verbs: ["get", "list", "watch"]
+      # Apps API group - modern workload controllers
+      - apiGroups: ["apps"]
+        resources: ["deployments", "daemonsets", "statefulsets", "replicasets"]
+        verbs: ["get", "list", "watch"]
+      # Batch API group - job management
+      - apiGroups: ["batch"]
+        resources: ["jobs", "cronjobs"]
+        verbs: ["get", "list", "watch"]
+      # Autoscaling API group - HPA metrics (CRITICAL)
+      - apiGroups: ["autoscaling"]
+        resources: ["horizontalpodautoscalers"]
+        verbs: ["get", "list", "watch"]
+      # Extensions API group - legacy support
+      - apiGroups: ["extensions"]
+        resources: ["deployments", "daemonsets", "replicasets"]
+        verbs: ["get", "list", "watch"]
+      # Metrics API group - resource metrics
+      - apiGroups: ["metrics.k8s.io"]
+        resources: ["nodes", "pods"]
+        verbs: ["get", "list", "watch"]
+    clusterRoleBinding:
+      annotations: {}
+      name: "signoz-otel-collector-bakery-ia"
+
+# Schema Migrator - Manages ClickHouse schema migrations
+schemaMigrator:
+  enabled: true
+
+  image:
+    repository: signoz/signoz-schema-migrator
+    tag: v0.129.12  # Updated to latest version
+    pullPolicy: IfNotPresent
+
+  # Enable Helm hooks for proper upgrade handling
+  upgradeHelmHooks: true
+
+# Additional Configuration
+serviceAccount:
+  create: true
+  annotations: {}
+  name: "signoz"
+
+# Security Context
+securityContext:
+  runAsNonRoot: true
+  runAsUser: 1000
+  fsGroup: 1000
+
+# Pod Disruption Budgets for HA
+podDisruptionBudget:
+  frontend:
+    enabled: true
+    minAvailable: 1
+  queryService:
+    enabled: true
+    minAvailable: 1
+  alertmanager:
+    enabled: true
+    minAvailable: 1
+  clickhouse:
+    enabled: true
+    minAvailable: 1
+
+# Network Policies for security
+networkPolicy:
+  enabled: true
+  policyTypes:
+    - Ingress
+    - Egress
+
+# Monitoring SigNoz itself
+selfMonitoring:
+  enabled: true
+  serviceMonitor:
+    enabled: true
+    interval: 30s
--- a/infrastructure/monitoring/signoz/verify-signoz-telemetry.sh
+++ b/infrastructure/monitoring/signoz/verify-signoz-telemetry.sh
@@ -0,0 +1,177 @@
+#!/bin/bash
+
+# SigNoz Telemetry Verification Script
+# This script verifies that services are correctly sending metrics, logs, and traces to SigNoz
+# and that SigNoz is collecting them properly.
+
+set -e
+
+NAMESPACE="bakery-ia"
+GREEN='\033[0;32m'
+RED='\033[0;31m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+echo -e "${BLUE}  SigNoz Telemetry Verification Script${NC}"
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+echo ""
+
+# Step 1: Verify SigNoz Components are Running
+echo -e "${BLUE}[1/7] Checking SigNoz Components Status...${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+
+OTEL_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=signoz,app.kubernetes.io/component=otel-collector --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
+SIGNOZ_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=signoz,app.kubernetes.io/component=signoz --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
+CLICKHOUSE_POD=$(kubectl get pods -n $NAMESPACE -l clickhouse.altinity.com/chi=signoz-clickhouse --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
+
+if [[ -n "$OTEL_POD" && -n "$SIGNOZ_POD" && -n "$CLICKHOUSE_POD" ]]; then
+    echo -e "${GREEN}✓ All SigNoz components are running${NC}"
+    echo "  - OTel Collector: $OTEL_POD"
+    echo "  - SigNoz Frontend: $SIGNOZ_POD"
+    echo "  - ClickHouse: $CLICKHOUSE_POD"
+else
+    echo -e "${RED}✗ Some SigNoz components are not running${NC}"
+    kubectl get pods -n $NAMESPACE | grep signoz
+    exit 1
+fi
+echo ""
+
+# Step 2: Check OTel Collector Endpoints
+echo -e "${BLUE}[2/7] Verifying OTel Collector Endpoints...${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+
+OTEL_SVC=$(kubectl get svc -n $NAMESPACE signoz-otel-collector -o jsonpath='{.spec.clusterIP}')
+echo "OTel Collector Service IP: $OTEL_SVC"
+echo ""
+echo "Available endpoints:"
+kubectl get svc -n $NAMESPACE signoz-otel-collector -o jsonpath='{range .spec.ports[*]}{.name}{"\t"}{.port}{"\n"}{end}' | column -t
+echo ""
+echo -e "${GREEN}✓ OTel Collector endpoints are exposed${NC}"
+echo ""
+
+# Step 3: Check OTel Collector Logs for Data Reception
+echo -e "${BLUE}[3/7] Checking OTel Collector for Recent Activity...${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+
+echo "Recent OTel Collector logs (last 20 lines):"
+kubectl logs -n $NAMESPACE $OTEL_POD --tail=20 | grep -E "received|exported|traces|metrics|logs" || echo "No recent telemetry data found in logs"
+echo ""
+
+# Step 4: Check Service Configurations
+echo -e "${BLUE}[4/7] Verifying Service Telemetry Configuration...${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+
+# Check ConfigMap for OTEL settings
+OTEL_ENDPOINT=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.OTEL_EXPORTER_OTLP_ENDPOINT}')
+ENABLE_TRACING=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.ENABLE_TRACING}')
+ENABLE_METRICS=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.ENABLE_METRICS}')
+ENABLE_LOGS=$(kubectl get configmap bakery-config -n $NAMESPACE -o jsonpath='{.data.ENABLE_LOGS}')
+
+echo "Configuration from bakery-config ConfigMap:"
+echo "  OTEL_EXPORTER_OTLP_ENDPOINT: $OTEL_ENDPOINT"
+echo "  ENABLE_TRACING: $ENABLE_TRACING"
+echo "  ENABLE_METRICS: $ENABLE_METRICS"
+echo "  ENABLE_LOGS: $ENABLE_LOGS"
+echo ""
+
+if [[ "$ENABLE_TRACING" == "true" && "$ENABLE_METRICS" == "true" && "$ENABLE_LOGS" == "true" ]]; then
+    echo -e "${GREEN}✓ Telemetry is enabled in configuration${NC}"
+else
+    echo -e "${YELLOW}⚠ Some telemetry features may be disabled${NC}"
+fi
+echo ""
+
+# Step 5: Test OTel Collector Health
+echo -e "${BLUE}[5/7] Testing OTel Collector Health Endpoint...${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+
+HEALTH_CHECK=$(kubectl exec -n $NAMESPACE $OTEL_POD -- wget -qO- http://localhost:13133/ 2>/dev/null || echo "FAILED")
+if [[ "$HEALTH_CHECK" == *"Server available"* ]] || [[ "$HEALTH_CHECK" == "{}" ]]; then
+    echo -e "${GREEN}✓ OTel Collector health check passed${NC}"
+else
+    echo -e "${RED}✗ OTel Collector health check failed${NC}"
+    echo "Response: $HEALTH_CHECK"
+fi
+echo ""
+
+# Step 6: Query ClickHouse for Telemetry Data
+echo -e "${BLUE}[6/7] Querying ClickHouse for Telemetry Data...${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+
+# Get ClickHouse credentials
+CH_PASSWORD=$(kubectl get secret -n $NAMESPACE signoz-clickhouse -o jsonpath='{.data.admin-password}' 2>/dev/null | base64 -d || echo "27ff0399-0d3a-4bd8-919d-17c2181e6fb9")
+
+echo "Checking for traces in ClickHouse..."
+TRACES_COUNT=$(kubectl exec -n $NAMESPACE $CLICKHOUSE_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="SELECT count() FROM signoz_traces.signoz_index_v2 WHERE timestamp >= now() - INTERVAL 1 HOUR" 2>/dev/null || echo "0")
+echo "  Traces in last hour: $TRACES_COUNT"
+
+echo "Checking for metrics in ClickHouse..."
+METRICS_COUNT=$(kubectl exec -n $NAMESPACE $CLICKHOUSE_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="SELECT count() FROM signoz_metrics.samples_v4 WHERE unix_milli >= toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000" 2>/dev/null || echo "0")
+echo "  Metrics in last hour: $METRICS_COUNT"
+
+echo "Checking for logs in ClickHouse..."
+LOGS_COUNT=$(kubectl exec -n $NAMESPACE $CLICKHOUSE_POD -- clickhouse-client --user=admin --password=$CH_PASSWORD --query="SELECT count() FROM signoz_logs.logs WHERE timestamp >= now() - INTERVAL 1 HOUR" 2>/dev/null || echo "0")
+echo "  Logs in last hour: $LOGS_COUNT"
+echo ""
+
+if [[ "$TRACES_COUNT" -gt "0" || "$METRICS_COUNT" -gt "0" || "$LOGS_COUNT" -gt "0" ]]; then
+    echo -e "${GREEN}✓ Telemetry data found in ClickHouse!${NC}"
+else
+    echo -e "${YELLOW}⚠ No telemetry data found in the last hour${NC}"
+    echo "  This might be normal if:"
+    echo "  - Services were just deployed"
+    echo "  - No traffic has been generated yet"
+    echo "  - Services haven't finished initializing"
+fi
+echo ""
+
+# Step 7: Access Information
+echo -e "${BLUE}[7/7] SigNoz UI Access Information${NC}"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo ""
+echo "SigNoz is accessible via ingress at:"
+echo -e "  ${GREEN}https://monitoring.bakery-ia.local${NC}"
+echo ""
+echo "Or via port-forward:"
+echo -e "  ${YELLOW}kubectl port-forward -n $NAMESPACE svc/signoz 3301:8080${NC}"
+echo "  Then access: http://localhost:3301"
+echo ""
+echo "To view OTel Collector metrics:"
+echo -e "  ${YELLOW}kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 8888:8888${NC}"
+echo "  Then access: http://localhost:8888/metrics"
+echo ""
+
+# Summary
+echo ""
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+echo -e "${BLUE}  Verification Summary${NC}"
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+echo ""
+echo "Component Status:"
+echo "  ✓ SigNoz components running"
+echo "  ✓ OTel Collector healthy"
+echo "  ✓ Configuration correct"
+echo ""
+echo "Data Collection (last hour):"
+echo "  Traces: $TRACES_COUNT"
+echo "  Metrics: $METRICS_COUNT"
+echo "  Logs: $LOGS_COUNT"
+echo ""
+
+if [[ "$TRACES_COUNT" -gt "0" || "$METRICS_COUNT" -gt "0" || "$LOGS_COUNT" -gt "0" ]]; then
+    echo -e "${GREEN}✓ SigNoz is collecting telemetry data successfully!${NC}"
+else
+    echo -e "${YELLOW}⚠ To generate telemetry data, try:${NC}"
+    echo ""
+    echo "1. Generate traffic to your services:"
+    echo "   curl http://localhost/api/health"
+    echo ""
+    echo "2. Check service logs for tracing initialization:"
+    echo "   kubectl logs -n $NAMESPACE <service-pod> | grep -i 'tracing\\|otel\\|signoz'"
+    echo ""
+    echo "3. Wait a few minutes and run this script again"
+fi
+echo ""
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
--- a/infrastructure/monitoring/signoz/verify-signoz.sh
+++ b/infrastructure/monitoring/signoz/verify-signoz.sh
@@ -0,0 +1,446 @@
+#!/bin/bash
+
+# ============================================================================
+# SigNoz Verification Script for Bakery IA
+# ============================================================================
+# This script verifies that SigNoz is properly deployed and functioning
+# ============================================================================
+
+set -e
+
+# Color codes for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Function to display help
+show_help() {
+    echo "Usage: $0 [OPTIONS] ENVIRONMENT"
+    echo ""
+    echo "Verify SigNoz deployment for Bakery IA"
+    echo ""
+    echo "Arguments:
+    ENVIRONMENT    Environment to verify (dev|prod)"
+    echo ""
+    echo "Options:
+    -h, --help     Show this help message
+    -n, --namespace NAMESPACE  Specify namespace (default: bakery-ia)"
+    echo ""
+    echo "Examples:
+    $0 dev                    # Verify development deployment
+    $0 prod                   # Verify production deployment
+    $0 --namespace monitoring dev  # Verify with custom namespace"
+}
+
+# Parse command line arguments
+NAMESPACE="bakery-ia"
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        -h|--help)
+            show_help
+            exit 0
+            ;;
+        -n|--namespace)
+            NAMESPACE="$2"
+            shift 2
+            ;;
+        dev|prod)
+            ENVIRONMENT="$1"
+            shift
+            ;;
+        *)
+            echo "Unknown argument: $1"
+            show_help
+            exit 1
+            ;;
+    esac
+done
+
+# Validate environment
+if [[ -z "$ENVIRONMENT" ]]; then
+    echo "Error: Environment not specified. Use 'dev' or 'prod'."
+    show_help
+    exit 1
+fi
+
+if [[ "$ENVIRONMENT" != "dev" && "$ENVIRONMENT" != "prod" ]]; then
+    echo "Error: Invalid environment. Use 'dev' or 'prod'."
+    exit 1
+fi
+
+# Function to check if kubectl is configured
+check_kubectl() {
+    if ! kubectl cluster-info &> /dev/null; then
+        echo "${RED}Error: kubectl is not configured or cannot connect to cluster.${NC}"
+        echo "Please ensure you have access to a Kubernetes cluster."
+        exit 1
+    fi
+}
+
+# Function to check namespace exists
+check_namespace() {
+    if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
+        echo "${RED}Error: Namespace $NAMESPACE does not exist.${NC}"
+        echo "Please deploy SigNoz first using: ./deploy-signoz.sh $ENVIRONMENT"
+        exit 1
+    fi
+}
+
+# Function to verify SigNoz deployment
+verify_deployment() {
+    echo "${BLUE}"
+    echo "=========================================="
+    echo "🔍 Verifying SigNoz Deployment"
+    echo "=========================================="
+    echo "Environment: $ENVIRONMENT"
+    echo "Namespace: $NAMESPACE"
+    echo "${NC}"
+    echo ""
+    
+    # Check if SigNoz helm release exists
+    echo "${BLUE}1. Checking Helm release...${NC}"
+    if helm list -n "$NAMESPACE" | grep -q signoz; then
+        echo "${GREEN}✅ SigNoz Helm release found${NC}"
+    else
+        echo "${RED}❌ SigNoz Helm release not found${NC}"
+        echo "Please deploy SigNoz first using: ./deploy-signoz.sh $ENVIRONMENT"
+        exit 1
+    fi
+    echo ""
+    
+    # Check pod status
+    echo "${BLUE}2. Checking pod status...${NC}"
+    local total_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
+    local running_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz --field-selector=status.phase=Running 2>/dev/null | grep -c "Running" || echo "0")
+    local ready_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep "Running" | grep "1/1" | wc -l | tr -d ' ' || echo "0")
+    
+    echo "Total pods: $total_pods"
+    echo "Running pods: $running_pods"
+    echo "Ready pods: $ready_pods"
+    
+    if [[ $total_pods -eq 0 ]]; then
+        echo "${RED}❌ No SigNoz pods found${NC}"
+        exit 1
+    fi
+    
+    if [[ $running_pods -eq $total_pods ]]; then
+        echo "${GREEN}✅ All pods are running${NC}"
+    else
+        echo "${YELLOW}⚠️  Some pods are not running${NC}"
+    fi
+    
+    if [[ $ready_pods -eq $total_pods ]]; then
+        echo "${GREEN}✅ All pods are ready${NC}"
+    else
+        echo "${YELLOW}⚠️  Some pods are not ready${NC}"
+    fi
+    echo ""
+    
+    # Show pod details
+    echo "${BLUE}Pod Details:${NC}"
+    kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
+    echo ""
+    
+    # Check services
+    echo "${BLUE}3. Checking services...${NC}"
+    local service_count=$(kubectl get svc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
+    
+    if [[ $service_count -gt 0 ]]; then
+        echo "${GREEN}✅ Services found ($service_count services)${NC}"
+        kubectl get svc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
+    else
+        echo "${RED}❌ No services found${NC}"
+    fi
+    echo ""
+    
+    # Check ingress
+    echo "${BLUE}4. Checking ingress...${NC}"
+    local ingress_count=$(kubectl get ingress -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
+    
+    if [[ $ingress_count -gt 0 ]]; then
+        echo "${GREEN}✅ Ingress found ($ingress_count ingress resources)${NC}"
+        kubectl get ingress -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
+    else
+        echo "${YELLOW}⚠️  No ingress found (may be configured in main namespace)${NC}"
+    fi
+    echo ""
+    
+    # Check PVCs
+    echo "${BLUE}5. Checking persistent volume claims...${NC}"
+    local pvc_count=$(kubectl get pvc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
+    
+    if [[ $pvc_count -gt 0 ]]; then
+        echo "${GREEN}✅ PVCs found ($pvc_count PVCs)${NC}"
+        kubectl get pvc -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
+    else
+        echo "${YELLOW}⚠️  No PVCs found (may not be required for all components)${NC}"
+    fi
+    echo ""
+    
+    # Check resource usage
+    echo "${BLUE}6. Checking resource usage...${NC}"
+    if command -v kubectl &> /dev/null && kubectl top pods -n "$NAMESPACE" &> /dev/null; then
+        echo "${GREEN}✅ Resource usage:${NC}"
+        kubectl top pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz
+    else
+        echo "${YELLOW}⚠️  Metrics server not available or no resource usage data${NC}"
+    fi
+    echo ""
+    
+    # Check logs for errors
+    echo "${BLUE}7. Checking for errors in logs...${NC}"
+    local error_found=false
+    
+    # Check each pod for errors
+    while IFS= read -r pod; do
+        if [[ -n "$pod" ]]; then
+            local pod_errors=$(kubectl logs -n "$NAMESPACE" "$pod" 2>/dev/null | grep -i "error\|exception\|fail\|crash" | wc -l || echo "0")
+            if [[ $pod_errors -gt 0 ]]; then
+                echo "${RED}❌ Errors found in pod $pod ($pod_errors errors)${NC}"
+                error_found=true
+            fi
+        fi
+    done < <(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz -o name | sed 's|pod/||')
+    
+    if [[ "$error_found" == false ]]; then
+        echo "${GREEN}✅ No errors found in logs${NC}"
+    fi
+    echo ""
+    
+    # Environment-specific checks
+    if [[ "$ENVIRONMENT" == "dev" ]]; then
+        verify_dev_specific
+    else
+        verify_prod_specific
+    fi
+    
+    # Show access information
+    show_access_info
+}
+
+# Function for development-specific verification
+verify_dev_specific() {
+    echo "${BLUE}8. Development-specific checks...${NC}"
+
+    # Check if ingress is configured
+    if kubectl get ingress -n "$NAMESPACE" 2>/dev/null | grep -q "monitoring.bakery-ia.local"; then
+        echo "${GREEN}✅ Development ingress configured${NC}"
+    else
+        echo "${YELLOW}⚠️  Development ingress not found${NC}"
+    fi
+
+    # Check unified signoz component resource limits (should be lower for dev)
+    local signoz_mem=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.template.spec.containers[0].resources.limits.memory}' 2>/dev/null || echo "")
+    if [[ -n "$signoz_mem" ]]; then
+        echo "${GREEN}✅ SigNoz component found (memory limit: $signoz_mem)${NC}"
+    else
+        echo "${YELLOW}⚠️  Could not verify SigNoz component resources${NC}"
+    fi
+
+    # Check single replica setup for dev
+    local replicas=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "0")
+    if [[ $replicas -eq 1 ]]; then
+        echo "${GREEN}✅ Single replica configuration (appropriate for dev)${NC}"
+    else
+        echo "${YELLOW}⚠️  Multiple replicas detected (replicas: $replicas)${NC}"
+    fi
+    echo ""
+}
+
+# Function for production-specific verification
+verify_prod_specific() {
+    echo "${BLUE}8. Production-specific checks...${NC}"
+
+    # Check if TLS is configured
+    if kubectl get ingress -n "$NAMESPACE" 2>/dev/null | grep -q "signoz-tls"; then
+        echo "${GREEN}✅ TLS certificate configured${NC}"
+    else
+        echo "${YELLOW}⚠️  TLS certificate not found${NC}"
+    fi
+
+    # Check if multiple replicas are running for HA
+    local signoz_replicas=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "1")
+    if [[ $signoz_replicas -gt 1 ]]; then
+        echo "${GREEN}✅ High availability configured ($signoz_replicas SigNoz replicas)${NC}"
+    else
+        echo "${YELLOW}⚠️  Single SigNoz replica detected (not highly available)${NC}"
+    fi
+
+    # Check Zookeeper replicas (critical for production)
+    local zk_replicas=$(kubectl get statefulset -n "$NAMESPACE" -l app.kubernetes.io/component=zookeeper -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "0")
+    if [[ $zk_replicas -eq 3 ]]; then
+        echo "${GREEN}✅ Zookeeper properly configured with 3 replicas${NC}"
+    elif [[ $zk_replicas -gt 0 ]]; then
+        echo "${YELLOW}⚠️  Zookeeper has $zk_replicas replicas (recommend 3 for production)${NC}"
+    else
+        echo "${RED}❌ Zookeeper not found${NC}"
+    fi
+
+    # Check OTel Collector replicas
+    local otel_replicas=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=otel-collector -o jsonpath='{.items[0].spec.replicas}' 2>/dev/null || echo "1")
+    if [[ $otel_replicas -gt 1 ]]; then
+        echo "${GREEN}✅ OTel Collector HA configured ($otel_replicas replicas)${NC}"
+    else
+        echo "${YELLOW}⚠️  Single OTel Collector replica${NC}"
+    fi
+
+    # Check resource limits (should be higher for prod)
+    local signoz_mem=$(kubectl get deployment -n "$NAMESPACE" -l app.kubernetes.io/component=query-service -o jsonpath='{.items[0].spec.template.spec.containers[0].resources.limits.memory}' 2>/dev/null || echo "")
+    if [[ -n "$signoz_mem" ]]; then
+        echo "${GREEN}✅ Production resource limits applied (memory: $signoz_mem)${NC}"
+    else
+        echo "${YELLOW}⚠️  Could not verify resource limits${NC}"
+    fi
+
+    # Check HPA (Horizontal Pod Autoscaler)
+    local hpa_count=$(kubectl get hpa -n "$NAMESPACE" 2>/dev/null | grep -c signoz || echo "0")
+    if [[ $hpa_count -gt 0 ]]; then
+        echo "${GREEN}✅ Horizontal Pod Autoscaler configured${NC}"
+    else
+        echo "${YELLOW}⚠️  No HPA found (consider enabling for production)${NC}"
+    fi
+    echo ""
+}
+
+# Function to show access information
+show_access_info() {
+    echo "${BLUE}"
+    echo "=========================================="
+    echo "📋 Access Information"
+    echo "=========================================="
+    echo "${NC}"
+
+    if [[ "$ENVIRONMENT" == "dev" ]]; then
+        echo "SigNoz UI:          http://monitoring.bakery-ia.local"
+        echo ""
+        echo "OpenTelemetry Collector (within cluster):"
+        echo "  gRPC:             signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
+        echo "  HTTP:             signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
+        echo ""
+        echo "Port-forward for local access:"
+        echo "  kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
+        echo "  kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4317:4317"
+        echo "  kubectl port-forward -n $NAMESPACE svc/signoz-otel-collector 4318:4318"
+    else
+        echo "SigNoz UI:          https://monitoring.bakewise.ai"
+        echo ""
+        echo "OpenTelemetry Collector (within cluster):"
+        echo "  gRPC:             signoz-otel-collector.$NAMESPACE.svc.cluster.local:4317"
+        echo "  HTTP:             signoz-otel-collector.$NAMESPACE.svc.cluster.local:4318"
+    fi
+
+    echo ""
+    echo "Default Credentials:"
+    echo "  Username:         admin@example.com"
+    echo "  Password:         admin"
+    echo ""
+    echo "⚠️  IMPORTANT: Change default password after first login!"
+    echo ""
+
+    # Show connection test commands
+    echo "Connection Test Commands:"
+    if [[ "$ENVIRONMENT" == "dev" ]]; then
+        echo "  # Test SigNoz UI"
+        echo "  curl http://monitoring.bakery-ia.local"
+        echo ""
+        echo "  # Test via port-forward"
+        echo "  kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
+        echo "  curl http://localhost:8080"
+    else
+        echo "  # Test SigNoz UI"
+        echo "  curl https://monitoring.bakewise.ai"
+        echo ""
+        echo "  # Test API health"
+        echo "  kubectl port-forward -n $NAMESPACE svc/signoz 8080:8080"
+        echo "  curl http://localhost:8080/api/v1/health"
+    fi
+    echo ""
+}
+
+# Function to run connectivity tests
+run_connectivity_tests() {
+    echo "${BLUE}"
+    echo "=========================================="
+    echo "🔗 Running Connectivity Tests"
+    echo "=========================================="
+    echo "${NC}"
+
+    # Test pod readiness first
+    echo "Checking pod readiness..."
+    local ready_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz --field-selector=status.phase=Running 2>/dev/null | grep "Running" | grep -c "1/1\|2/2" || echo "0")
+    local total_pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/instance=signoz 2>/dev/null | grep -v "NAME" | wc -l | tr -d ' ' || echo "0")
+
+    if [[ $ready_pods -eq $total_pods && $total_pods -gt 0 ]]; then
+        echo "${GREEN}✅ All pods are ready ($ready_pods/$total_pods)${NC}"
+    else
+        echo "${YELLOW}⚠️  Some pods not ready ($ready_pods/$total_pods)${NC}"
+    fi
+    echo ""
+
+    # Test internal service connectivity
+    echo "Testing internal service connectivity..."
+    local signoz_svc=$(kubectl get svc -n "$NAMESPACE" signoz -o jsonpath='{.spec.clusterIP}' 2>/dev/null || echo "")
+    if [[ -n "$signoz_svc" ]]; then
+        echo "${GREEN}✅ SigNoz service accessible at $signoz_svc:8080${NC}"
+    else
+        echo "${RED}❌ SigNoz service not found${NC}"
+    fi
+
+    local otel_svc=$(kubectl get svc -n "$NAMESPACE" signoz-otel-collector -o jsonpath='{.spec.clusterIP}' 2>/dev/null || echo "")
+    if [[ -n "$otel_svc" ]]; then
+        echo "${GREEN}✅ OTel Collector service accessible at $otel_svc:4317 (gRPC), $otel_svc:4318 (HTTP)${NC}"
+    else
+        echo "${RED}❌ OTel Collector service not found${NC}"
+    fi
+    echo ""
+
+    if [[ "$ENVIRONMENT" == "prod" ]]; then
+        echo "${YELLOW}⚠️  Production connectivity tests require valid DNS and TLS${NC}"
+        echo "   Please ensure monitoring.bakewise.ai resolves to your cluster"
+        echo ""
+        echo "Manual test:"
+        echo "  curl -I https://monitoring.bakewise.ai"
+    fi
+}
+
+# Main execution
+main() {
+    echo "${BLUE}"
+    echo "=========================================="
+    echo "🔍 SigNoz Verification for Bakery IA"
+    echo "=========================================="
+    echo "${NC}"
+    
+    # Check prerequisites
+    check_kubectl
+    check_namespace
+    
+    # Verify deployment
+    verify_deployment
+    
+    # Run connectivity tests
+    run_connectivity_tests
+    
+    echo "${GREEN}"
+    echo "=========================================="
+    echo "✅ Verification Complete"
+    echo "=========================================="
+    echo "${NC}"
+    
+    echo "Summary:"
+    echo "  Environment: $ENVIRONMENT"
+    echo "  Namespace: $NAMESPACE"
+    echo ""
+    echo "Next Steps:"
+    echo "  1. Access SigNoz UI and verify dashboards"
+    echo "  2. Configure alert rules for your services"
+    echo "  3. Instrument your applications with OpenTelemetry"
+    echo "  4. Set up custom dashboards for key metrics"
+    echo ""
+}
+
+# Run main function
+main