diff --git a/docs/SIGNOZ_COMPLETE_CONFIGURATION_GUIDE.md b/docs/SIGNOZ_COMPLETE_CONFIGURATION_GUIDE.md new file mode 100644 index 00000000..6f2fafb8 --- /dev/null +++ b/docs/SIGNOZ_COMPLETE_CONFIGURATION_GUIDE.md @@ -0,0 +1,518 @@ +# SigNoz Complete Configuration Guide + +## Root Cause Analysis and Solutions + +This document provides a comprehensive analysis of the SigNoz telemetry collection issues and the proper configuration for all receivers. + +--- + +## Problem 1: OpAMP Configuration Corruption + +### Root Cause + +**What is OpAMP?** +[OpAMP (Open Agent Management Protocol)](https://signoz.io/docs/operate/configuration/) is a protocol for remote configuration management in OpenTelemetry Collectors. In SigNoz, OpAMP runs a server that dynamically configures log pipelines in the SigNoz OTel collector. + +**The Issue:** +- OpAMP was successfully connecting to the SigNoz backend and receiving remote configuration +- The remote configuration contained only `nop` (no-operation) receivers and exporters +- This overwrote the local collector configuration at runtime +- Result: The collector appeared healthy but couldn't receive or export any data + +**Why This Happened:** +1. The SigNoz backend's OpAMP server was pushing an invalid/incomplete configuration +2. The collector's `--manager-config` flag pointed to OpAMP configuration +3. OpAMP's `--copy-path=/var/tmp/collector-config.yaml` overwrote the good config + +### Solution Options + +#### Option 1: Disable OpAMP (Current Solution) + +Since OpAMP is pushing bad configuration and we have a working static configuration, we disabled it: + +```bash +kubectl patch deployment -n bakery-ia signoz-otel-collector --type=json -p='[ + { + "op": "replace", + "path": "/spec/template/spec/containers/0/args", + "value": [ + "--config=/conf/otel-collector-config.yaml", + "--feature-gates=-pkg.translator.prometheus.NormalizeName" + ] + } +]' +``` + +**Important:** This patch must be applied after every `helm install` or `helm upgrade` because the Helm chart doesn't support disabling OpAMP via values. + +#### Option 2: Fix OpAMP Configuration (Recommended for Production) + +To properly use OpAMP: + +1. **Check SigNoz Backend Configuration:** + - Verify the SigNoz service is properly configured to serve OpAMP + - Check logs: `kubectl logs -n bakery-ia statefulset/signoz` + - Look for OpAMP-related errors + +2. **Configure OpAMP Server Settings:** + According to [SigNoz configuration documentation](https://signoz.io/docs/operate/configuration/), set these environment variables in the SigNoz statefulset: + + ```yaml + signoz: + env: + OPAMP_ENABLED: "true" + OPAMP_SERVER_ENDPOINT: "ws://signoz:4320/v1/opamp" + ``` + +3. **Verify OpAMP Configuration File:** + ```bash + kubectl get configmap -n bakery-ia signoz-otel-collector -o yaml + ``` + + Should contain: + ```yaml + otel-collector-opamp-config.yaml: | + server_endpoint: "ws://signoz:4320/v1/opamp" + ``` + +4. **Monitor OpAMP Status:** + ```bash + kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep opamp + ``` + +### References +- [SigNoz Architecture](https://signoz.io/docs/architecture/) +- [OpenTelemetry Collector Configuration](https://signoz.io/docs/opentelemetry-collection-agents/opentelemetry-collector/configuration/) +- [SigNoz Helm Chart](https://github.com/SigNoz/charts) + +--- + +## Problem 2: Database and Infrastructure Receivers Configuration + +### Overview + +You have the following infrastructure requiring monitoring: + +- **21 PostgreSQL databases** (auth, inventory, orders, forecasting, production, etc.) +- **1 Redis instance** (caching layer) +- **1 RabbitMQ instance** (message queue) + +All receivers were disabled because they lacked proper credentials and configuration. + +--- + +## PostgreSQL Receiver Configuration + +### Prerequisites + +Based on [SigNoz PostgreSQL Integration Guide](https://signoz.io/docs/integrations/postgresql/), each PostgreSQL instance needs a monitoring user with proper permissions. + +### Step 1: Create Monitoring Users + +For each PostgreSQL database, create a dedicated monitoring user: + +**For PostgreSQL 10 and newer:** +```sql +CREATE USER monitoring WITH PASSWORD 'your_secure_password'; +GRANT pg_monitor TO monitoring; +GRANT SELECT ON pg_stat_database TO monitoring; +``` + +**For PostgreSQL 9.6 to 9.x:** +```sql +CREATE USER monitoring WITH PASSWORD 'your_secure_password'; +GRANT SELECT ON pg_stat_database TO monitoring; +``` + +### Step 2: Create Monitoring User for All Databases + +Run this script to create monitoring users in all PostgreSQL databases: + +```bash +#!/bin/bash +# File: infrastructure/scripts/create-pg-monitoring-users.sh + +DATABASES=( + "auth-db" + "inventory-db" + "orders-db" + "ai-insights-db" + "alert-processor-db" + "demo-session-db" + "distribution-db" + "external-db" + "forecasting-db" + "notification-db" + "orchestrator-db" + "pos-db" + "procurement-db" + "production-db" + "recipes-db" + "sales-db" + "suppliers-db" + "tenant-db" + "training-db" +) + +MONITORING_PASSWORD="monitoring_secure_pass_$(openssl rand -hex 16)" + +echo "Creating monitoring users with password: $MONITORING_PASSWORD" +echo "Save this password for your SigNoz configuration!" + +for db in "${DATABASES[@]}"; do + echo "Processing $db..." + kubectl exec -n bakery-ia deployment/$db -- psql -U postgres -c " + CREATE USER monitoring WITH PASSWORD '$MONITORING_PASSWORD'; + GRANT pg_monitor TO monitoring; + GRANT SELECT ON pg_stat_database TO monitoring; + " 2>&1 | grep -v "already exists" || true +done + +echo "" +echo "Monitoring users created!" +echo "Password: $MONITORING_PASSWORD" +``` + +### Step 3: Store Credentials in Kubernetes Secret + +```bash +kubectl create secret generic -n bakery-ia postgres-monitoring-secrets \ + --from-literal=POSTGRES_MONITOR_USER=monitoring \ + --from-literal=POSTGRES_MONITOR_PASSWORD= +``` + +### Step 4: Configure PostgreSQL Receivers in SigNoz + +Update `infrastructure/helm/signoz-values-dev.yaml`: + +```yaml +otelCollector: + config: + receivers: + # PostgreSQL receivers for database metrics + postgresql/auth: + endpoint: auth-db-service.bakery-ia:5432 + username: ${env:POSTGRES_MONITOR_USER} + password: ${env:POSTGRES_MONITOR_PASSWORD} + databases: + - auth_db + collection_interval: 60s + tls: + insecure: true # Set to false if using TLS + + postgresql/inventory: + endpoint: inventory-db-service.bakery-ia:5432 + username: ${env:POSTGRES_MONITOR_USER} + password: ${env:POSTGRES_MONITOR_PASSWORD} + databases: + - inventory_db + collection_interval: 60s + tls: + insecure: true + + # Add all other databases... + postgresql/orders: + endpoint: orders-db-service.bakery-ia:5432 + username: ${env:POSTGRES_MONITOR_USER} + password: ${env:POSTGRES_MONITOR_PASSWORD} + databases: + - orders_db + collection_interval: 60s + tls: + insecure: true + + # Update metrics pipeline + service: + pipelines: + metrics: + receivers: + - otlp + - postgresql/auth + - postgresql/inventory + - postgresql/orders + # Add all PostgreSQL receivers + processors: [memory_limiter, batch, resourcedetection] + exporters: [signozclickhousemetrics] +``` + +### Step 5: Add Environment Variables to OTel Collector Deployment + +The Helm chart needs to inject these environment variables. Modify your Helm values: + +```yaml +otelCollector: + env: + - name: POSTGRES_MONITOR_USER + valueFrom: + secretKeyRef: + name: postgres-monitoring-secrets + key: POSTGRES_MONITOR_USER + - name: POSTGRES_MONITOR_PASSWORD + valueFrom: + secretKeyRef: + name: postgres-monitoring-secrets + key: POSTGRES_MONITOR_PASSWORD +``` + +### References +- [PostgreSQL Monitoring with OpenTelemetry | SigNoz](https://signoz.io/blog/opentelemetry-postgresql-metrics-monitoring/) +- [PostgreSQL Integration | SigNoz](https://signoz.io/docs/integrations/postgresql/) + +--- + +## Redis Receiver Configuration + +### Current Infrastructure + +- **Service**: `redis-service.bakery-ia:6379` +- **Password**: Available in secret `redis-secrets` +- **TLS**: Currently not configured + +### Step 1: Check if Redis Requires TLS + +```bash +kubectl exec -n bakery-ia deployment/redis -- redis-cli CONFIG GET tls-port +``` + +If TLS is not configured (tls-port is 0 or empty), you can use `insecure: true`. + +### Step 2: Configure Redis Receiver + +Update `infrastructure/helm/signoz-values-dev.yaml`: + +```yaml +otelCollector: + config: + receivers: + # Redis receiver for cache metrics + redis: + endpoint: redis-service.bakery-ia:6379 + password: ${env:REDIS_PASSWORD} + collection_interval: 60s + transport: tcp + tls: + insecure: true # Change to false if using TLS + metrics: + redis.maxmemory: + enabled: true + redis.cmd.latency: + enabled: true + + env: + - name: REDIS_PASSWORD + valueFrom: + secretKeyRef: + name: redis-secrets + key: REDIS_PASSWORD + + service: + pipelines: + metrics: + receivers: [otlp, redis, ...] +``` + +### Optional: Configure TLS for Redis + +If you want to enable TLS for Redis (recommended for production): + +1. **Generate TLS Certificates:** +```bash +# Create CA +openssl genrsa -out ca-key.pem 4096 +openssl req -new -x509 -days 3650 -key ca-key.pem -out ca-cert.pem + +# Create Redis server certificate +openssl genrsa -out redis-key.pem 4096 +openssl req -new -key redis-key.pem -out redis.csr +openssl x509 -req -days 3650 -in redis.csr -CA ca-cert.pem -CAkey ca-key.pem -CAcreateserial -out redis-cert.pem + +# Create Kubernetes secret +kubectl create secret generic -n bakery-ia redis-tls \ + --from-file=ca-cert.pem=ca-cert.pem \ + --from-file=redis-cert.pem=redis-cert.pem \ + --from-file=redis-key.pem=redis-key.pem +``` + +2. **Mount Certificates in OTel Collector:** +```yaml +otelCollector: + volumes: + - name: redis-tls + secret: + secretName: redis-tls + + volumeMounts: + - name: redis-tls + mountPath: /etc/redis-tls + readOnly: true + + config: + receivers: + redis: + tls: + insecure: false + cert_file: /etc/redis-tls/redis-cert.pem + key_file: /etc/redis-tls/redis-key.pem + ca_file: /etc/redis-tls/ca-cert.pem +``` + +### References +- [Redis Monitoring with OpenTelemetry | SigNoz](https://signoz.io/blog/redis-opentelemetry/) +- [Redis Monitoring 101 | SigNoz](https://signoz.io/blog/redis-monitoring/) + +--- + +## RabbitMQ Receiver Configuration + +### Current Infrastructure + +- **Service**: `rabbitmq-service.bakery-ia` + - Port 5672: AMQP protocol + - Port 15672: Management API (required for metrics) +- **Credentials**: + - Username: `bakery` + - Password: Available in secret `rabbitmq-secrets` + +### Step 1: Enable RabbitMQ Management Plugin + +```bash +kubectl exec -n bakery-ia deployment/rabbitmq -- rabbitmq-plugins enable rabbitmq_management +``` + +### Step 2: Verify Management API Access + +```bash +kubectl port-forward -n bakery-ia svc/rabbitmq-service 15672:15672 +# In browser: http://localhost:15672 +# Login with: bakery / +``` + +### Step 3: Configure RabbitMQ Receiver + +Update `infrastructure/helm/signoz-values-dev.yaml`: + +```yaml +otelCollector: + config: + receivers: + # RabbitMQ receiver via management API + rabbitmq: + endpoint: http://rabbitmq-service.bakery-ia:15672 + username: ${env:RABBITMQ_USER} + password: ${env:RABBITMQ_PASSWORD} + collection_interval: 30s + + env: + - name: RABBITMQ_USER + valueFrom: + secretKeyRef: + name: rabbitmq-secrets + key: RABBITMQ_USER + - name: RABBITMQ_PASSWORD + valueFrom: + secretKeyRef: + name: rabbitmq-secrets + key: RABBITMQ_PASSWORD + + service: + pipelines: + metrics: + receivers: [otlp, rabbitmq, ...] +``` + +### References +- [RabbitMQ Monitoring with OpenTelemetry | SigNoz](https://signoz.io/blog/opentelemetry-rabbitmq-metrics-monitoring/) +- [OpenTelemetry Receivers | SigNoz](https://signoz.io/docs/userguide/otel-metrics-receivers/) + +--- + +## Complete Implementation Plan + +### Phase 1: Enable Basic Infrastructure Monitoring (No TLS) + +1. **Create PostgreSQL monitoring users** (all 21 databases) +2. **Create Kubernetes secrets** for credentials +3. **Update Helm values** with receiver configurations +4. **Configure environment variables** in OTel Collector +5. **Apply Helm upgrade** and OpAMP patch +6. **Verify metrics collection** + +### Phase 2: Enable TLS (Optional, Production-Ready) + +1. **Generate TLS certificates** for Redis +2. **Configure Redis TLS** in deployment +3. **Update Redis receiver** with TLS settings +4. **Configure PostgreSQL TLS** if required +5. **Test and verify** secure connections + +### Phase 3: Enable OpAMP (Optional, Advanced) + +1. **Fix SigNoz OpAMP server configuration** +2. **Test remote configuration** in dev environment +3. **Gradually enable** OpAMP after validation +4. **Monitor** for configuration corruption + +--- + +## Verification Commands + +### Check Collector Metrics +```bash +kubectl port-forward -n bakery-ia svc/signoz-otel-collector 8888:8888 +curl http://localhost:8888/metrics | grep "otelcol_receiver_accepted" +``` + +### Check Database Connectivity +```bash +kubectl exec -n bakery-ia deployment/signoz-otel-collector -- \ + /bin/sh -c "nc -zv auth-db-service 5432" +``` + +### Check RabbitMQ Management API +```bash +kubectl exec -n bakery-ia deployment/signoz-otel-collector -- \ + /bin/sh -c "wget -O- http://rabbitmq-service:15672/api/overview" +``` + +### Check Redis Connectivity +```bash +kubectl exec -n bakery-ia deployment/signoz-otel-collector -- \ + /bin/sh -c "nc -zv redis-service 6379" +``` + +--- + +## Troubleshooting + +### PostgreSQL Connection Refused +- Verify monitoring user exists: `kubectl exec deployment/auth-db -- psql -U postgres -c "\du"` +- Check user permissions: `kubectl exec deployment/auth-db -- psql -U monitoring -c "SELECT 1"` + +### Redis Authentication Failed +- Verify password: `kubectl get secret redis-secrets -o jsonpath='{.data.REDIS_PASSWORD}' | base64 -d` +- Test connection: `kubectl exec deployment/redis -- redis-cli -a PING` + +### RabbitMQ Management API Not Available +- Check plugin status: `kubectl exec deployment/rabbitmq -- rabbitmq-plugins list` +- Enable plugin: `kubectl exec deployment/rabbitmq -- rabbitmq-plugins enable rabbitmq_management` + +--- + +## Summary + +**Current Status:** +- ✅ OTel Collector receiving traces (97+ spans) +- ✅ ClickHouse authentication fixed +- ✅ OpAMP disabled (preventing config corruption) +- ❌ PostgreSQL receivers not configured (no monitoring users) +- ❌ Redis receiver not configured (missing in pipeline) +- ❌ RabbitMQ receiver not configured (missing in pipeline) + +**Next Steps:** +1. Create PostgreSQL monitoring users across all 21 databases +2. Configure Redis receiver with existing credentials +3. Configure RabbitMQ receiver with existing credentials +4. Test and verify all metrics are flowing +5. Optionally enable TLS for production +6. Optionally fix and re-enable OpAMP for dynamic configuration + diff --git a/infrastructure/scripts/create-pg-monitoring-users.sh b/infrastructure/scripts/create-pg-monitoring-users.sh new file mode 100755 index 00000000..9b955b86 --- /dev/null +++ b/infrastructure/scripts/create-pg-monitoring-users.sh @@ -0,0 +1,145 @@ +#!/bin/bash +# Create monitoring users in all PostgreSQL databases for SigNoz metrics collection +# +# This script creates a 'monitoring' user with pg_monitor role in each PostgreSQL database +# Based on: https://signoz.io/docs/integrations/postgresql/ +# +# Usage: ./create-pg-monitoring-users.sh + +set -e + +NAMESPACE="bakery-ia" +MONITORING_USER="monitoring" +MONITORING_PASSWORD="monitoring_$(openssl rand -hex 16)" + +# List of all PostgreSQL database deployments +DATABASES=( + "auth-db" + "inventory-db" + "orders-db" + "ai-insights-db" + "alert-processor-db" + "demo-session-db" + "distribution-db" + "external-db" + "forecasting-db" + "notification-db" + "orchestrator-db" + "pos-db" + "procurement-db" + "production-db" + "recipes-db" + "sales-db" + "suppliers-db" + "tenant-db" + "training-db" +) + +echo "==================================================" +echo "PostgreSQL Monitoring User Setup for SigNoz" +echo "==================================================" +echo "" +echo "This script will create a monitoring user in all PostgreSQL databases" +echo "User: $MONITORING_USER" +echo "Password: $MONITORING_PASSWORD" +echo "" +echo "IMPORTANT: Save this password! You'll need it for SigNoz configuration." +echo "" +read -p "Press Enter to continue or Ctrl+C to cancel..." + +SUCCESS_COUNT=0 +FAILED_COUNT=0 +FAILED_DBS=() + +for db in "${DATABASES[@]}"; do + echo "" + echo "Processing: $db" + echo "---" + + # Create monitoring user with pg_monitor role (PostgreSQL 10+) + if kubectl exec -n $NAMESPACE deployment/$db -- psql -U postgres -c " + DO \$\$ + BEGIN + -- Try to create the user + CREATE USER $MONITORING_USER WITH PASSWORD '$MONITORING_PASSWORD'; + RAISE NOTICE 'User created successfully'; + EXCEPTION + WHEN duplicate_object THEN + -- User already exists, update password + ALTER USER $MONITORING_USER WITH PASSWORD '$MONITORING_PASSWORD'; + RAISE NOTICE 'User already exists, password updated'; + END + \$\$; + + -- Grant pg_monitor role (PostgreSQL 10+) + GRANT pg_monitor TO $MONITORING_USER; + + -- Grant SELECT on pg_stat_database + GRANT SELECT ON pg_stat_database TO $MONITORING_USER; + + -- Verify permissions + SELECT + r.rolname as role_name, + ARRAY_AGG(b.rolname) as granted_roles + FROM pg_auth_members m + JOIN pg_roles r ON (m.member = r.oid) + JOIN pg_roles b ON (m.roleid = b.oid) + WHERE r.rolname = '$MONITORING_USER' + GROUP BY r.rolname; + " 2>&1; then + echo "✅ SUCCESS: $db" + ((SUCCESS_COUNT++)) + else + echo "❌ FAILED: $db" + ((FAILED_COUNT++)) + FAILED_DBS+=("$db") + fi +done + +echo "" +echo "==================================================" +echo "Summary" +echo "==================================================" +echo "Successful: $SUCCESS_COUNT databases" +echo "Failed: $FAILED_COUNT databases" + +if [ $FAILED_COUNT -gt 0 ]; then + echo "" + echo "Failed databases:" + for db in "${FAILED_DBS[@]}"; do + echo " - $db" + done +fi + +echo "" +echo "==================================================" +echo "Next Steps" +echo "==================================================" +echo "" +echo "1. Create Kubernetes secret with monitoring credentials:" +echo "" +echo "kubectl create secret generic -n $NAMESPACE postgres-monitoring-secrets \\" +echo " --from-literal=POSTGRES_MONITOR_USER=$MONITORING_USER \\" +echo " --from-literal=POSTGRES_MONITOR_PASSWORD='$MONITORING_PASSWORD'" +echo "" +echo "2. Update infrastructure/helm/signoz-values-dev.yaml with PostgreSQL receivers" +echo "" +echo "3. Add environment variables to otelCollector configuration" +echo "" +echo "4. Run: helm upgrade signoz signoz/signoz -n $NAMESPACE -f infrastructure/helm/signoz-values-dev.yaml" +echo "" +echo "5. Apply OpAMP patch:" +echo "" +echo "kubectl patch deployment -n $NAMESPACE signoz-otel-collector --type=json -p='[" +echo " {\"op\":\"replace\",\"path\":\"/spec/template/spec/containers/0/args\",\"value\":[" +echo " \"--config=/conf/otel-collector-config.yaml\"," +echo " \"--feature-gates=-pkg.translator.prometheus.NormalizeName\"" +echo " ]}" +echo "]'" +echo "" +echo "==================================================" +echo "SAVE THIS INFORMATION!" +echo "==================================================" +echo "Username: $MONITORING_USER" +echo "Password: $MONITORING_PASSWORD" +echo "=================================================="