# SigNoz Monitoring Setup Guide This guide explains how to set up complete observability for the Bakery IA platform using SigNoz, which provides unified metrics, logs, and traces visualization. ## Table of Contents 1. [Architecture Overview](#architecture-overview) 2. [Prerequisites](#prerequisites) 3. [SigNoz Deployment](#signoz-deployment) 4. [Service Configuration](#service-configuration) 5. [Data Flow](#data-flow) 6. [Verification](#verification) 7. [Troubleshooting](#troubleshooting) ## Architecture Overview The monitoring setup uses a three-tier approach: ``` ┌─────────────────────────────────────────────────────────────┐ │ Bakery IA Services │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Auth │ │ Inventory│ │ Orders │ │ ... │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ │ └─────────────┴─────────────┴─────────────┘ │ │ │ │ │ OpenTelemetry Protocol (OTLP) │ │ Traces / Metrics / Logs │ └──────────────────────────┼───────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ SigNoz OpenTelemetry Collector │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ Receivers: │ │ │ │ - OTLP gRPC (4317) - OTLP HTTP (4318) │ │ │ │ - Prometheus Scraper (service discovery) │ │ │ └────────────────────┬───────────────────────────────────┘ │ │ │ │ │ ┌────────────────────┴───────────────────────────────────┐ │ │ │ Processors: batch, memory_limiter, resourcedetection │ │ │ └────────────────────┬───────────────────────────────────┘ │ │ │ │ │ ┌────────────────────┴───────────────────────────────────┐ │ │ │ Exporters: ClickHouse (traces, metrics, logs) │ │ │ └────────────────────────────────────────────────────────┘ │ └──────────────────────────┼───────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ ClickHouse Database │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Traces │ │ Metrics │ │ Logs │ │ │ └──────────┘ └──────────┘ └──────────┘ │ └──────────────────────────┼───────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ SigNoz Query Service │ │ & Frontend UI │ │ https://monitoring.bakery-ia.local │ └──────────────────────────────────────────────────────────────┘ ``` ### Key Components 1. **Services**: Generate telemetry data using OpenTelemetry SDK 2. **OpenTelemetry Collector**: Receives, processes, and exports telemetry 3. **ClickHouse**: Stores traces, metrics, and logs 4. **SigNoz UI**: Query and visualize all telemetry data ## Prerequisites - Kubernetes cluster (Kind, Minikube, or production cluster) - Helm 3.x installed - kubectl configured - At least 4GB RAM available for SigNoz components ## SigNoz Deployment ### 1. Add SigNoz Helm Repository ```bash helm repo add signoz https://charts.signoz.io helm repo update ``` ### 2. Create Namespace ```bash kubectl create namespace signoz ``` ### 3. Deploy SigNoz ```bash # For development environment helm install signoz signoz/signoz \ -n signoz \ -f infrastructure/helm/signoz-values-dev.yaml # For production environment helm install signoz signoz/signoz \ -n signoz \ -f infrastructure/helm/signoz-values-prod.yaml ``` ### 4. Verify Deployment ```bash # Check all pods are running kubectl get pods -n signoz # Expected output: # signoz-alertmanager-0 # signoz-clickhouse-0 # signoz-frontend-* # signoz-otel-collector-* # signoz-query-service-* # Check services kubectl get svc -n signoz ``` ## Service Configuration Each microservice needs to be configured to send telemetry to SigNoz. ### Environment Variables Add these environment variables to your service deployments: ```yaml env: # OpenTelemetry Collector endpoint - name: OTEL_COLLECTOR_ENDPOINT value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318" - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318" # Service identification - name: OTEL_SERVICE_NAME value: "your-service-name" # e.g., "auth-service" # Enable tracing - name: ENABLE_TRACING value: "true" # Enable logs export - name: OTEL_LOGS_EXPORTER value: "otlp" - name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED value: "true" # Enable metrics export (optional, default: true) - name: ENABLE_OTEL_METRICS value: "true" ``` ### Prometheus Annotations Add these annotations to enable Prometheus metrics scraping: ```yaml metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "8000" prometheus.io/path: "/metrics" ``` ### Complete Example See [infrastructure/kubernetes/base/components/auth/auth-service.yaml](../infrastructure/kubernetes/base/components/auth/auth-service.yaml) for a complete example. ### Automated Configuration Script Use the provided script to add monitoring configuration to all services: ```bash # Run from project root ./infrastructure/kubernetes/add-monitoring-config.sh ``` ## Data Flow ### 1. Traces **Automatic Instrumentation:** ```python # In your service's main.py from shared.service_base import StandardFastAPIService service = AuthService() # Extends StandardFastAPIService app = service.create_app() # Tracing is automatically enabled if ENABLE_TRACING=true # All FastAPI endpoints, HTTP clients, Redis, PostgreSQL are auto-instrumented ``` **Manual Instrumentation:** ```python from shared.monitoring.tracing import add_trace_attributes, add_trace_event # Add custom attributes to current span add_trace_attributes( user_id="123", tenant_id="abc", operation="user_registration" ) # Add events for important operations add_trace_event("user_authenticated", user_id="123", method="jwt") ``` ### 2. Metrics **Dual Export Strategy:** Services export metrics in two ways: 1. **Prometheus format** at `/metrics` endpoint (scraped by SigNoz) 2. **OTLP push** directly to SigNoz collector (real-time) **Built-in Metrics:** ```python # Automatically collected by BaseFastAPIService: # - http_requests_total # - http_request_duration_seconds # - active_connections ``` **Custom Metrics:** ```python # Define in your service custom_metrics = { "user_registrations": { "type": "counter", "description": "Total user registrations", "labels": ["status"] }, "login_duration_seconds": { "type": "histogram", "description": "Login request duration" } } service = AuthService(custom_metrics=custom_metrics) # Use in your code service.metrics_collector.increment_counter( "user_registrations", labels={"status": "success"} ) ``` ### 3. Logs **Automatic Export:** ```python # Logs are automatically exported if OTEL_LOGS_EXPORTER=otlp import logging logger = logging.getLogger(__name__) # This will appear in SigNoz logger.info("User logged in", extra={"user_id": "123", "tenant_id": "abc"}) ``` **Structured Logging with Context:** ```python from shared.monitoring.logs_exporter import add_log_context # Add context that persists across log calls log_ctx = add_log_context( request_id="req_123", user_id="user_456", tenant_id="tenant_789" ) # All subsequent logs include this context log_ctx.info("Processing order") # Includes request_id, user_id, tenant_id ``` **Trace Correlation:** ```python from shared.monitoring.logs_exporter import get_current_trace_context # Get trace context for correlation trace_ctx = get_current_trace_context() logger.info("Processing request", extra=trace_ctx) # Logs now include trace_id and span_id for correlation ``` ## Verification ### 1. Check Service Health ```bash # Check that services are exporting telemetry kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel\|signoz" # Expected output includes: # - "Distributed tracing configured" # - "OpenTelemetry logs export configured" # - "OpenTelemetry metrics export configured" ``` ### 2. Access SigNoz UI ```bash # Port-forward (for local development) kubectl port-forward -n signoz svc/signoz-frontend 3301:3301 # Or via Ingress open https://monitoring.bakery-ia.local ``` ### 3. Verify Data Ingestion **Traces:** 1. Go to SigNoz UI → Traces 2. You should see traces from your services 3. Click on a trace to see the full span tree **Metrics:** 1. Go to SigNoz UI → Metrics 2. Query: `http_requests_total` 3. Filter by service: `service="auth-service"` **Logs:** 1. Go to SigNoz UI → Logs 2. Filter by service: `service_name="auth-service"` 3. Search for specific log messages ### 4. Test Trace-Log Correlation 1. Find a trace in SigNoz UI 2. Copy the `trace_id` 3. Go to Logs tab 4. Search: `trace_id=""` 5. You should see all logs for that trace ## Troubleshooting ### No Data in SigNoz **1. Check OpenTelemetry Collector:** ```bash # Check collector logs kubectl logs -n signoz deployment/signoz-otel-collector # Should see: # - "Receiver is starting" # - "Exporter is starting" # - No error messages ``` **2. Check Service Configuration:** ```bash # Verify environment variables kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 20 "env:" # Verify annotations kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 5 "annotations:" ``` **3. Check Network Connectivity:** ```bash # Test from service pod kubectl exec -n bakery-ia deployment/auth-service -- \ curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/traces # Should return: 405 Method Not Allowed (POST required) # If connection refused, check network policies ``` ### Traces Not Appearing **Check instrumentation:** ```python # Verify tracing is enabled import os print(os.getenv("ENABLE_TRACING")) # Should be "true" print(os.getenv("OTEL_COLLECTOR_ENDPOINT")) # Should be set ``` **Check trace sampling:** ```bash # Verify sampling rate (default 100%) kubectl logs -n bakery-ia deployment/auth-service | grep "sampling" ``` ### Metrics Not Appearing **1. Verify Prometheus annotations:** ```bash kubectl get pods -n bakery-ia -o yaml | grep "prometheus.io" ``` **2. Test metrics endpoint:** ```bash # Port-forward service kubectl port-forward -n bakery-ia deployment/auth-service 8000:8000 # Test endpoint curl http://localhost:8000/metrics # Should return Prometheus format metrics ``` **3. Check SigNoz scrape configuration:** ```bash # Check collector config kubectl get configmap -n signoz signoz-otel-collector -o yaml | grep -A 30 "prometheus:" ``` ### Logs Not Appearing **1. Verify log export is enabled:** ```bash kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL_LOGS_EXPORTER # Should return: OTEL_LOGS_EXPORTER=otlp ``` **2. Check log format:** ```bash # Logs should be JSON formatted kubectl logs -n bakery-ia deployment/auth-service | head -5 ``` **3. Verify OTLP endpoint:** ```bash # Test logs endpoint kubectl exec -n bakery-ia deployment/auth-service -- \ curl -X POST http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/logs \ -H "Content-Type: application/json" \ -d '{"resourceLogs":[]}' # Should return 200 OK or 400 Bad Request (not connection error) ``` ## Performance Tuning ### For Development The default configuration is optimized for local development with minimal resources. ### For Production Update the following in `signoz-values-prod.yaml`: ```yaml # Increase collector resources otelCollector: resources: requests: cpu: 500m memory: 1Gi limits: cpu: 2000m memory: 2Gi # Increase batch sizes config: processors: batch: timeout: 10s send_batch_size: 10000 # Increased from 1024 # Add more replicas replicaCount: 2 ``` ## Best Practices 1. **Use Structured Logging**: Always use key-value pairs for better querying 2. **Add Context**: Include user_id, tenant_id, request_id in logs 3. **Trace Business Operations**: Add custom spans for important operations 4. **Monitor Collector Health**: Set up alerts for collector errors 5. **Retention Policy**: Configure ClickHouse retention based on needs ## Additional Resources - [SigNoz Documentation](https://signoz.io/docs/) - [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/) - [Bakery IA Monitoring Shared Library](../shared/monitoring/) ## Support For issues or questions: 1. Check SigNoz community: https://signoz.io/slack 2. Review OpenTelemetry docs: https://opentelemetry.io/docs/ 3. Create issue in project repository