Add comprehensive SigNoz configuration guide and monitoring setup

Documentation includes:

1. OpAMP Root Cause Analysis:
   - Explains OpenAMP (Open Agent Management Protocol) functionality
   - Documents how OpAMP was overwriting config with "nop" receivers
   - Provides two solution paths:
     * Option 1: Disable OpAMP (current solution)
     * Option 2: Fix OpAMP server configuration (recommended for prod)
   - References: SigNoz architecture and OTel collector docs

2. Database Receivers Configuration:
   - PostgreSQL: Complete setup for 21 database instances
     * SQL commands to create monitoring users
     * Proper pg_monitor role permissions
     * Environment variable configuration
   - Redis: Configuration with/without TLS
     * Uses existing redis-secrets
     * Optional TLS certificate generation
   - RabbitMQ: Management API setup
     * Uses existing rabbitmq-secrets
     * Port 15672 management interface

3. Automation Script:
   - create-pg-monitoring-users.sh
   - Creates monitoring user in all 21 PostgreSQL databases
   - Generates secure random password
   - Verifies permissions
   - Provides next-step commands

Resources Referenced:
- PostgreSQL: https://signoz.io/docs/integrations/postgresql/
- Redis: https://signoz.io/blog/redis-opentelemetry/
- RabbitMQ: https://signoz.io/blog/opentelemetry-rabbitmq-metrics-monitoring/
- OpAMP: https://signoz.io/docs/operate/configuration/
- OTel Config: https://signoz.io/docs/opentelemetry-collection-agents/opentelemetry-collector/configuration/

Current Infrastructure Discovered:
- 21 PostgreSQL databases (all services have dedicated DBs)
- 1 Redis instance (password in redis-secrets)
- 1 RabbitMQ instance (credentials in rabbitmq-secrets)

Next Implementation Steps:
1. Run create-pg-monitoring-users.sh script
2. Create Kubernetes secrets for monitoring credentials
3. Update signoz-values-dev.yaml with receivers
4. Enable receivers in metrics pipeline
5. Test and verify metric collection

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Urtzi Alfaro
2026-01-09 12:15:58 +01:00
parent 1329bae784
commit 7ef85c1188
2 changed files with 663 additions and 0 deletions

View File

@@ -0,0 +1,518 @@
# SigNoz Complete Configuration Guide
## Root Cause Analysis and Solutions
This document provides a comprehensive analysis of the SigNoz telemetry collection issues and the proper configuration for all receivers.
---
## Problem 1: OpAMP Configuration Corruption
### Root Cause
**What is OpAMP?**
[OpAMP (Open Agent Management Protocol)](https://signoz.io/docs/operate/configuration/) is a protocol for remote configuration management in OpenTelemetry Collectors. In SigNoz, OpAMP runs a server that dynamically configures log pipelines in the SigNoz OTel collector.
**The Issue:**
- OpAMP was successfully connecting to the SigNoz backend and receiving remote configuration
- The remote configuration contained only `nop` (no-operation) receivers and exporters
- This overwrote the local collector configuration at runtime
- Result: The collector appeared healthy but couldn't receive or export any data
**Why This Happened:**
1. The SigNoz backend's OpAMP server was pushing an invalid/incomplete configuration
2. The collector's `--manager-config` flag pointed to OpAMP configuration
3. OpAMP's `--copy-path=/var/tmp/collector-config.yaml` overwrote the good config
### Solution Options
#### Option 1: Disable OpAMP (Current Solution)
Since OpAMP is pushing bad configuration and we have a working static configuration, we disabled it:
```bash
kubectl patch deployment -n bakery-ia signoz-otel-collector --type=json -p='[
{
"op": "replace",
"path": "/spec/template/spec/containers/0/args",
"value": [
"--config=/conf/otel-collector-config.yaml",
"--feature-gates=-pkg.translator.prometheus.NormalizeName"
]
}
]'
```
**Important:** This patch must be applied after every `helm install` or `helm upgrade` because the Helm chart doesn't support disabling OpAMP via values.
#### Option 2: Fix OpAMP Configuration (Recommended for Production)
To properly use OpAMP:
1. **Check SigNoz Backend Configuration:**
- Verify the SigNoz service is properly configured to serve OpAMP
- Check logs: `kubectl logs -n bakery-ia statefulset/signoz`
- Look for OpAMP-related errors
2. **Configure OpAMP Server Settings:**
According to [SigNoz configuration documentation](https://signoz.io/docs/operate/configuration/), set these environment variables in the SigNoz statefulset:
```yaml
signoz:
env:
OPAMP_ENABLED: "true"
OPAMP_SERVER_ENDPOINT: "ws://signoz:4320/v1/opamp"
```
3. **Verify OpAMP Configuration File:**
```bash
kubectl get configmap -n bakery-ia signoz-otel-collector -o yaml
```
Should contain:
```yaml
otel-collector-opamp-config.yaml: |
server_endpoint: "ws://signoz:4320/v1/opamp"
```
4. **Monitor OpAMP Status:**
```bash
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep opamp
```
### References
- [SigNoz Architecture](https://signoz.io/docs/architecture/)
- [OpenTelemetry Collector Configuration](https://signoz.io/docs/opentelemetry-collection-agents/opentelemetry-collector/configuration/)
- [SigNoz Helm Chart](https://github.com/SigNoz/charts)
---
## Problem 2: Database and Infrastructure Receivers Configuration
### Overview
You have the following infrastructure requiring monitoring:
- **21 PostgreSQL databases** (auth, inventory, orders, forecasting, production, etc.)
- **1 Redis instance** (caching layer)
- **1 RabbitMQ instance** (message queue)
All receivers were disabled because they lacked proper credentials and configuration.
---
## PostgreSQL Receiver Configuration
### Prerequisites
Based on [SigNoz PostgreSQL Integration Guide](https://signoz.io/docs/integrations/postgresql/), each PostgreSQL instance needs a monitoring user with proper permissions.
### Step 1: Create Monitoring Users
For each PostgreSQL database, create a dedicated monitoring user:
**For PostgreSQL 10 and newer:**
```sql
CREATE USER monitoring WITH PASSWORD 'your_secure_password';
GRANT pg_monitor TO monitoring;
GRANT SELECT ON pg_stat_database TO monitoring;
```
**For PostgreSQL 9.6 to 9.x:**
```sql
CREATE USER monitoring WITH PASSWORD 'your_secure_password';
GRANT SELECT ON pg_stat_database TO monitoring;
```
### Step 2: Create Monitoring User for All Databases
Run this script to create monitoring users in all PostgreSQL databases:
```bash
#!/bin/bash
# File: infrastructure/scripts/create-pg-monitoring-users.sh
DATABASES=(
"auth-db"
"inventory-db"
"orders-db"
"ai-insights-db"
"alert-processor-db"
"demo-session-db"
"distribution-db"
"external-db"
"forecasting-db"
"notification-db"
"orchestrator-db"
"pos-db"
"procurement-db"
"production-db"
"recipes-db"
"sales-db"
"suppliers-db"
"tenant-db"
"training-db"
)
MONITORING_PASSWORD="monitoring_secure_pass_$(openssl rand -hex 16)"
echo "Creating monitoring users with password: $MONITORING_PASSWORD"
echo "Save this password for your SigNoz configuration!"
for db in "${DATABASES[@]}"; do
echo "Processing $db..."
kubectl exec -n bakery-ia deployment/$db -- psql -U postgres -c "
CREATE USER monitoring WITH PASSWORD '$MONITORING_PASSWORD';
GRANT pg_monitor TO monitoring;
GRANT SELECT ON pg_stat_database TO monitoring;
" 2>&1 | grep -v "already exists" || true
done
echo ""
echo "Monitoring users created!"
echo "Password: $MONITORING_PASSWORD"
```
### Step 3: Store Credentials in Kubernetes Secret
```bash
kubectl create secret generic -n bakery-ia postgres-monitoring-secrets \
--from-literal=POSTGRES_MONITOR_USER=monitoring \
--from-literal=POSTGRES_MONITOR_PASSWORD=<password-from-script>
```
### Step 4: Configure PostgreSQL Receivers in SigNoz
Update `infrastructure/helm/signoz-values-dev.yaml`:
```yaml
otelCollector:
config:
receivers:
# PostgreSQL receivers for database metrics
postgresql/auth:
endpoint: auth-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- auth_db
collection_interval: 60s
tls:
insecure: true # Set to false if using TLS
postgresql/inventory:
endpoint: inventory-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- inventory_db
collection_interval: 60s
tls:
insecure: true
# Add all other databases...
postgresql/orders:
endpoint: orders-db-service.bakery-ia:5432
username: ${env:POSTGRES_MONITOR_USER}
password: ${env:POSTGRES_MONITOR_PASSWORD}
databases:
- orders_db
collection_interval: 60s
tls:
insecure: true
# Update metrics pipeline
service:
pipelines:
metrics:
receivers:
- otlp
- postgresql/auth
- postgresql/inventory
- postgresql/orders
# Add all PostgreSQL receivers
processors: [memory_limiter, batch, resourcedetection]
exporters: [signozclickhousemetrics]
```
### Step 5: Add Environment Variables to OTel Collector Deployment
The Helm chart needs to inject these environment variables. Modify your Helm values:
```yaml
otelCollector:
env:
- name: POSTGRES_MONITOR_USER
valueFrom:
secretKeyRef:
name: postgres-monitoring-secrets
key: POSTGRES_MONITOR_USER
- name: POSTGRES_MONITOR_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-monitoring-secrets
key: POSTGRES_MONITOR_PASSWORD
```
### References
- [PostgreSQL Monitoring with OpenTelemetry | SigNoz](https://signoz.io/blog/opentelemetry-postgresql-metrics-monitoring/)
- [PostgreSQL Integration | SigNoz](https://signoz.io/docs/integrations/postgresql/)
---
## Redis Receiver Configuration
### Current Infrastructure
- **Service**: `redis-service.bakery-ia:6379`
- **Password**: Available in secret `redis-secrets`
- **TLS**: Currently not configured
### Step 1: Check if Redis Requires TLS
```bash
kubectl exec -n bakery-ia deployment/redis -- redis-cli CONFIG GET tls-port
```
If TLS is not configured (tls-port is 0 or empty), you can use `insecure: true`.
### Step 2: Configure Redis Receiver
Update `infrastructure/helm/signoz-values-dev.yaml`:
```yaml
otelCollector:
config:
receivers:
# Redis receiver for cache metrics
redis:
endpoint: redis-service.bakery-ia:6379
password: ${env:REDIS_PASSWORD}
collection_interval: 60s
transport: tcp
tls:
insecure: true # Change to false if using TLS
metrics:
redis.maxmemory:
enabled: true
redis.cmd.latency:
enabled: true
env:
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-secrets
key: REDIS_PASSWORD
service:
pipelines:
metrics:
receivers: [otlp, redis, ...]
```
### Optional: Configure TLS for Redis
If you want to enable TLS for Redis (recommended for production):
1. **Generate TLS Certificates:**
```bash
# Create CA
openssl genrsa -out ca-key.pem 4096
openssl req -new -x509 -days 3650 -key ca-key.pem -out ca-cert.pem
# Create Redis server certificate
openssl genrsa -out redis-key.pem 4096
openssl req -new -key redis-key.pem -out redis.csr
openssl x509 -req -days 3650 -in redis.csr -CA ca-cert.pem -CAkey ca-key.pem -CAcreateserial -out redis-cert.pem
# Create Kubernetes secret
kubectl create secret generic -n bakery-ia redis-tls \
--from-file=ca-cert.pem=ca-cert.pem \
--from-file=redis-cert.pem=redis-cert.pem \
--from-file=redis-key.pem=redis-key.pem
```
2. **Mount Certificates in OTel Collector:**
```yaml
otelCollector:
volumes:
- name: redis-tls
secret:
secretName: redis-tls
volumeMounts:
- name: redis-tls
mountPath: /etc/redis-tls
readOnly: true
config:
receivers:
redis:
tls:
insecure: false
cert_file: /etc/redis-tls/redis-cert.pem
key_file: /etc/redis-tls/redis-key.pem
ca_file: /etc/redis-tls/ca-cert.pem
```
### References
- [Redis Monitoring with OpenTelemetry | SigNoz](https://signoz.io/blog/redis-opentelemetry/)
- [Redis Monitoring 101 | SigNoz](https://signoz.io/blog/redis-monitoring/)
---
## RabbitMQ Receiver Configuration
### Current Infrastructure
- **Service**: `rabbitmq-service.bakery-ia`
- Port 5672: AMQP protocol
- Port 15672: Management API (required for metrics)
- **Credentials**:
- Username: `bakery`
- Password: Available in secret `rabbitmq-secrets`
### Step 1: Enable RabbitMQ Management Plugin
```bash
kubectl exec -n bakery-ia deployment/rabbitmq -- rabbitmq-plugins enable rabbitmq_management
```
### Step 2: Verify Management API Access
```bash
kubectl port-forward -n bakery-ia svc/rabbitmq-service 15672:15672
# In browser: http://localhost:15672
# Login with: bakery / <password>
```
### Step 3: Configure RabbitMQ Receiver
Update `infrastructure/helm/signoz-values-dev.yaml`:
```yaml
otelCollector:
config:
receivers:
# RabbitMQ receiver via management API
rabbitmq:
endpoint: http://rabbitmq-service.bakery-ia:15672
username: ${env:RABBITMQ_USER}
password: ${env:RABBITMQ_PASSWORD}
collection_interval: 30s
env:
- name: RABBITMQ_USER
valueFrom:
secretKeyRef:
name: rabbitmq-secrets
key: RABBITMQ_USER
- name: RABBITMQ_PASSWORD
valueFrom:
secretKeyRef:
name: rabbitmq-secrets
key: RABBITMQ_PASSWORD
service:
pipelines:
metrics:
receivers: [otlp, rabbitmq, ...]
```
### References
- [RabbitMQ Monitoring with OpenTelemetry | SigNoz](https://signoz.io/blog/opentelemetry-rabbitmq-metrics-monitoring/)
- [OpenTelemetry Receivers | SigNoz](https://signoz.io/docs/userguide/otel-metrics-receivers/)
---
## Complete Implementation Plan
### Phase 1: Enable Basic Infrastructure Monitoring (No TLS)
1. **Create PostgreSQL monitoring users** (all 21 databases)
2. **Create Kubernetes secrets** for credentials
3. **Update Helm values** with receiver configurations
4. **Configure environment variables** in OTel Collector
5. **Apply Helm upgrade** and OpAMP patch
6. **Verify metrics collection**
### Phase 2: Enable TLS (Optional, Production-Ready)
1. **Generate TLS certificates** for Redis
2. **Configure Redis TLS** in deployment
3. **Update Redis receiver** with TLS settings
4. **Configure PostgreSQL TLS** if required
5. **Test and verify** secure connections
### Phase 3: Enable OpAMP (Optional, Advanced)
1. **Fix SigNoz OpAMP server configuration**
2. **Test remote configuration** in dev environment
3. **Gradually enable** OpAMP after validation
4. **Monitor** for configuration corruption
---
## Verification Commands
### Check Collector Metrics
```bash
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 8888:8888
curl http://localhost:8888/metrics | grep "otelcol_receiver_accepted"
```
### Check Database Connectivity
```bash
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- \
/bin/sh -c "nc -zv auth-db-service 5432"
```
### Check RabbitMQ Management API
```bash
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- \
/bin/sh -c "wget -O- http://rabbitmq-service:15672/api/overview"
```
### Check Redis Connectivity
```bash
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- \
/bin/sh -c "nc -zv redis-service 6379"
```
---
## Troubleshooting
### PostgreSQL Connection Refused
- Verify monitoring user exists: `kubectl exec deployment/auth-db -- psql -U postgres -c "\du"`
- Check user permissions: `kubectl exec deployment/auth-db -- psql -U monitoring -c "SELECT 1"`
### Redis Authentication Failed
- Verify password: `kubectl get secret redis-secrets -o jsonpath='{.data.REDIS_PASSWORD}' | base64 -d`
- Test connection: `kubectl exec deployment/redis -- redis-cli -a <password> PING`
### RabbitMQ Management API Not Available
- Check plugin status: `kubectl exec deployment/rabbitmq -- rabbitmq-plugins list`
- Enable plugin: `kubectl exec deployment/rabbitmq -- rabbitmq-plugins enable rabbitmq_management`
---
## Summary
**Current Status:**
- ✅ OTel Collector receiving traces (97+ spans)
- ✅ ClickHouse authentication fixed
- ✅ OpAMP disabled (preventing config corruption)
- ❌ PostgreSQL receivers not configured (no monitoring users)
- ❌ Redis receiver not configured (missing in pipeline)
- ❌ RabbitMQ receiver not configured (missing in pipeline)
**Next Steps:**
1. Create PostgreSQL monitoring users across all 21 databases
2. Configure Redis receiver with existing credentials
3. Configure RabbitMQ receiver with existing credentials
4. Test and verify all metrics are flowing
5. Optionally enable TLS for production
6. Optionally fix and re-enable OpAMP for dynamic configuration

View File

@@ -0,0 +1,145 @@
#!/bin/bash
# Create monitoring users in all PostgreSQL databases for SigNoz metrics collection
#
# This script creates a 'monitoring' user with pg_monitor role in each PostgreSQL database
# Based on: https://signoz.io/docs/integrations/postgresql/
#
# Usage: ./create-pg-monitoring-users.sh
set -e
NAMESPACE="bakery-ia"
MONITORING_USER="monitoring"
MONITORING_PASSWORD="monitoring_$(openssl rand -hex 16)"
# List of all PostgreSQL database deployments
DATABASES=(
"auth-db"
"inventory-db"
"orders-db"
"ai-insights-db"
"alert-processor-db"
"demo-session-db"
"distribution-db"
"external-db"
"forecasting-db"
"notification-db"
"orchestrator-db"
"pos-db"
"procurement-db"
"production-db"
"recipes-db"
"sales-db"
"suppliers-db"
"tenant-db"
"training-db"
)
echo "=================================================="
echo "PostgreSQL Monitoring User Setup for SigNoz"
echo "=================================================="
echo ""
echo "This script will create a monitoring user in all PostgreSQL databases"
echo "User: $MONITORING_USER"
echo "Password: $MONITORING_PASSWORD"
echo ""
echo "IMPORTANT: Save this password! You'll need it for SigNoz configuration."
echo ""
read -p "Press Enter to continue or Ctrl+C to cancel..."
SUCCESS_COUNT=0
FAILED_COUNT=0
FAILED_DBS=()
for db in "${DATABASES[@]}"; do
echo ""
echo "Processing: $db"
echo "---"
# Create monitoring user with pg_monitor role (PostgreSQL 10+)
if kubectl exec -n $NAMESPACE deployment/$db -- psql -U postgres -c "
DO \$\$
BEGIN
-- Try to create the user
CREATE USER $MONITORING_USER WITH PASSWORD '$MONITORING_PASSWORD';
RAISE NOTICE 'User created successfully';
EXCEPTION
WHEN duplicate_object THEN
-- User already exists, update password
ALTER USER $MONITORING_USER WITH PASSWORD '$MONITORING_PASSWORD';
RAISE NOTICE 'User already exists, password updated';
END
\$\$;
-- Grant pg_monitor role (PostgreSQL 10+)
GRANT pg_monitor TO $MONITORING_USER;
-- Grant SELECT on pg_stat_database
GRANT SELECT ON pg_stat_database TO $MONITORING_USER;
-- Verify permissions
SELECT
r.rolname as role_name,
ARRAY_AGG(b.rolname) as granted_roles
FROM pg_auth_members m
JOIN pg_roles r ON (m.member = r.oid)
JOIN pg_roles b ON (m.roleid = b.oid)
WHERE r.rolname = '$MONITORING_USER'
GROUP BY r.rolname;
" 2>&1; then
echo "✅ SUCCESS: $db"
((SUCCESS_COUNT++))
else
echo "❌ FAILED: $db"
((FAILED_COUNT++))
FAILED_DBS+=("$db")
fi
done
echo ""
echo "=================================================="
echo "Summary"
echo "=================================================="
echo "Successful: $SUCCESS_COUNT databases"
echo "Failed: $FAILED_COUNT databases"
if [ $FAILED_COUNT -gt 0 ]; then
echo ""
echo "Failed databases:"
for db in "${FAILED_DBS[@]}"; do
echo " - $db"
done
fi
echo ""
echo "=================================================="
echo "Next Steps"
echo "=================================================="
echo ""
echo "1. Create Kubernetes secret with monitoring credentials:"
echo ""
echo "kubectl create secret generic -n $NAMESPACE postgres-monitoring-secrets \\"
echo " --from-literal=POSTGRES_MONITOR_USER=$MONITORING_USER \\"
echo " --from-literal=POSTGRES_MONITOR_PASSWORD='$MONITORING_PASSWORD'"
echo ""
echo "2. Update infrastructure/helm/signoz-values-dev.yaml with PostgreSQL receivers"
echo ""
echo "3. Add environment variables to otelCollector configuration"
echo ""
echo "4. Run: helm upgrade signoz signoz/signoz -n $NAMESPACE -f infrastructure/helm/signoz-values-dev.yaml"
echo ""
echo "5. Apply OpAMP patch:"
echo ""
echo "kubectl patch deployment -n $NAMESPACE signoz-otel-collector --type=json -p='["
echo " {\"op\":\"replace\",\"path\":\"/spec/template/spec/containers/0/args\",\"value\":["
echo " \"--config=/conf/otel-collector-config.yaml\","
echo " \"--feature-gates=-pkg.translator.prometheus.NormalizeName\""
echo " ]}"
echo "]'"
echo ""
echo "=================================================="
echo "SAVE THIS INFORMATION!"
echo "=================================================="
echo "Username: $MONITORING_USER"
echo "Password: $MONITORING_PASSWORD"
echo "=================================================="