Files
bakery-ia/docs/MONITORING_SETUP.md
Urtzi Alfaro 29d19087f1 Update monitoring packages to latest versions
- Updated all OpenTelemetry packages to latest versions:
  - opentelemetry-api: 1.27.0 → 1.39.1
  - opentelemetry-sdk: 1.27.0 → 1.39.1
  - opentelemetry-exporter-otlp-proto-grpc: 1.27.0 → 1.39.1
  - opentelemetry-exporter-otlp-proto-http: 1.27.0 → 1.39.1
  - opentelemetry-instrumentation-fastapi: 0.48b0 → 0.60b1
  - opentelemetry-instrumentation-httpx: 0.48b0 → 0.60b1
  - opentelemetry-instrumentation-redis: 0.48b0 → 0.60b1
  - opentelemetry-instrumentation-sqlalchemy: 0.48b0 → 0.60b1

- Removed prometheus-client==0.23.1 from all services
- Unified all services to use the same monitoring package versions

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-01-08 19:25:52 +01:00

15 KiB

SigNoz Monitoring Setup Guide

This guide explains how to set up complete observability for the Bakery IA platform using SigNoz, which provides unified metrics, logs, and traces visualization.

Table of Contents

  1. Architecture Overview
  2. Prerequisites
  3. SigNoz Deployment
  4. Service Configuration
  5. Data Flow
  6. Verification
  7. Troubleshooting

Architecture Overview

The monitoring setup uses a three-tier approach:

┌─────────────────────────────────────────────────────────────┐
│                    Bakery IA Services                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  Auth    │  │ Inventory│  │  Orders  │  │   ...    │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │           │
│       └─────────────┴─────────────┴─────────────┘           │
│                          │                                   │
│              OpenTelemetry Protocol (OTLP)                   │
│                  Traces / Metrics / Logs                     │
└──────────────────────────┼───────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│              SigNoz OpenTelemetry Collector                   │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  Receivers:                                            │  │
│  │  - OTLP gRPC (4317)  - OTLP HTTP (4318)              │  │
│  │  - Prometheus Scraper (service discovery)             │  │
│  └────────────────────┬───────────────────────────────────┘  │
│                       │                                       │
│  ┌────────────────────┴───────────────────────────────────┐  │
│  │  Processors: batch, memory_limiter, resourcedetection │  │
│  └────────────────────┬───────────────────────────────────┘  │
│                       │                                       │
│  ┌────────────────────┴───────────────────────────────────┐  │
│  │  Exporters: ClickHouse (traces, metrics, logs)        │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────┼───────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                    ClickHouse Database                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                   │
│  │  Traces  │  │ Metrics  │  │   Logs   │                   │
│  └──────────┘  └──────────┘  └──────────┘                   │
└──────────────────────────┼───────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                    SigNoz Query Service                       │
│                     & Frontend UI                             │
│         https://monitoring.bakery-ia.local                    │
└──────────────────────────────────────────────────────────────┘

Key Components

  1. Services: Generate telemetry data using OpenTelemetry SDK
  2. OpenTelemetry Collector: Receives, processes, and exports telemetry
  3. ClickHouse: Stores traces, metrics, and logs
  4. SigNoz UI: Query and visualize all telemetry data

Prerequisites

  • Kubernetes cluster (Kind, Minikube, or production cluster)
  • Helm 3.x installed
  • kubectl configured
  • At least 4GB RAM available for SigNoz components

SigNoz Deployment

1. Add SigNoz Helm Repository

helm repo add signoz https://charts.signoz.io
helm repo update

2. Create Namespace

kubectl create namespace signoz

3. Deploy SigNoz

# For development environment
helm install signoz signoz/signoz \
  -n signoz \
  -f infrastructure/helm/signoz-values-dev.yaml

# For production environment
helm install signoz signoz/signoz \
  -n signoz \
  -f infrastructure/helm/signoz-values-prod.yaml

4. Verify Deployment

# Check all pods are running
kubectl get pods -n signoz

# Expected output:
# signoz-alertmanager-0
# signoz-clickhouse-0
# signoz-frontend-*
# signoz-otel-collector-*
# signoz-query-service-*

# Check services
kubectl get svc -n signoz

Service Configuration

Each microservice needs to be configured to send telemetry to SigNoz.

Environment Variables

Add these environment variables to your service deployments:

env:
  # OpenTelemetry Collector endpoint
  - name: OTEL_COLLECTOR_ENDPOINT
    value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"

  # Service identification
  - name: OTEL_SERVICE_NAME
    value: "your-service-name"  # e.g., "auth-service"

  # Enable tracing
  - name: ENABLE_TRACING
    value: "true"

  # Enable logs export
  - name: OTEL_LOGS_EXPORTER
    value: "otlp"
  - name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
    value: "true"

  # Enable metrics export (optional, default: true)
  - name: ENABLE_OTEL_METRICS
    value: "true"

Prometheus Annotations

Add these annotations to enable Prometheus metrics scraping:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8000"
    prometheus.io/path: "/metrics"

Complete Example

See infrastructure/kubernetes/base/components/auth/auth-service.yaml for a complete example.

Automated Configuration Script

Use the provided script to add monitoring configuration to all services:

# Run from project root
./infrastructure/kubernetes/add-monitoring-config.sh

Data Flow

1. Traces

Automatic Instrumentation:

# In your service's main.py
from shared.service_base import StandardFastAPIService

service = AuthService()  # Extends StandardFastAPIService
app = service.create_app()

# Tracing is automatically enabled if ENABLE_TRACING=true
# All FastAPI endpoints, HTTP clients, Redis, PostgreSQL are auto-instrumented

Manual Instrumentation:

from shared.monitoring.tracing import add_trace_attributes, add_trace_event

# Add custom attributes to current span
add_trace_attributes(
    user_id="123",
    tenant_id="abc",
    operation="user_registration"
)

# Add events for important operations
add_trace_event("user_authenticated", user_id="123", method="jwt")

2. Metrics

Dual Export Strategy:

Services export metrics in two ways:

  1. Prometheus format at /metrics endpoint (scraped by SigNoz)
  2. OTLP push directly to SigNoz collector (real-time)

Built-in Metrics:

# Automatically collected by BaseFastAPIService:
# - http_requests_total
# - http_request_duration_seconds
# - active_connections

Custom Metrics:

# Define in your service
custom_metrics = {
    "user_registrations": {
        "type": "counter",
        "description": "Total user registrations",
        "labels": ["status"]
    },
    "login_duration_seconds": {
        "type": "histogram",
        "description": "Login request duration"
    }
}

service = AuthService(custom_metrics=custom_metrics)

# Use in your code
service.metrics_collector.increment_counter(
    "user_registrations",
    labels={"status": "success"}
)

3. Logs

Automatic Export:

# Logs are automatically exported if OTEL_LOGS_EXPORTER=otlp
import logging
logger = logging.getLogger(__name__)

# This will appear in SigNoz
logger.info("User logged in", extra={"user_id": "123", "tenant_id": "abc"})

Structured Logging with Context:

from shared.monitoring.logs_exporter import add_log_context

# Add context that persists across log calls
log_ctx = add_log_context(
    request_id="req_123",
    user_id="user_456",
    tenant_id="tenant_789"
)

# All subsequent logs include this context
log_ctx.info("Processing order")  # Includes request_id, user_id, tenant_id

Trace Correlation:

from shared.monitoring.logs_exporter import get_current_trace_context

# Get trace context for correlation
trace_ctx = get_current_trace_context()
logger.info("Processing request", extra=trace_ctx)
# Logs now include trace_id and span_id for correlation

Verification

1. Check Service Health

# Check that services are exporting telemetry
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel\|signoz"

# Expected output includes:
# - "Distributed tracing configured"
# - "OpenTelemetry logs export configured"
# - "OpenTelemetry metrics export configured"

2. Access SigNoz UI

# Port-forward (for local development)
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301

# Or via Ingress
open https://monitoring.bakery-ia.local

3. Verify Data Ingestion

Traces:

  1. Go to SigNoz UI → Traces
  2. You should see traces from your services
  3. Click on a trace to see the full span tree

Metrics:

  1. Go to SigNoz UI → Metrics
  2. Query: http_requests_total
  3. Filter by service: service="auth-service"

Logs:

  1. Go to SigNoz UI → Logs
  2. Filter by service: service_name="auth-service"
  3. Search for specific log messages

4. Test Trace-Log Correlation

  1. Find a trace in SigNoz UI
  2. Copy the trace_id
  3. Go to Logs tab
  4. Search: trace_id="<your-trace-id>"
  5. You should see all logs for that trace

Troubleshooting

No Data in SigNoz

1. Check OpenTelemetry Collector:

# Check collector logs
kubectl logs -n signoz deployment/signoz-otel-collector

# Should see:
# - "Receiver is starting"
# - "Exporter is starting"
# - No error messages

2. Check Service Configuration:

# Verify environment variables
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 20 "env:"

# Verify annotations
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 5 "annotations:"

3. Check Network Connectivity:

# Test from service pod
kubectl exec -n bakery-ia deployment/auth-service -- \
  curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/traces

# Should return: 405 Method Not Allowed (POST required)
# If connection refused, check network policies

Traces Not Appearing

Check instrumentation:

# Verify tracing is enabled
import os
print(os.getenv("ENABLE_TRACING"))  # Should be "true"
print(os.getenv("OTEL_COLLECTOR_ENDPOINT"))  # Should be set

Check trace sampling:

# Verify sampling rate (default 100%)
kubectl logs -n bakery-ia deployment/auth-service | grep "sampling"

Metrics Not Appearing

1. Verify Prometheus annotations:

kubectl get pods -n bakery-ia -o yaml | grep "prometheus.io"

2. Test metrics endpoint:

# Port-forward service
kubectl port-forward -n bakery-ia deployment/auth-service 8000:8000

# Test endpoint
curl http://localhost:8000/metrics

# Should return Prometheus format metrics

3. Check SigNoz scrape configuration:

# Check collector config
kubectl get configmap -n signoz signoz-otel-collector -o yaml | grep -A 30 "prometheus:"

Logs Not Appearing

1. Verify log export is enabled:

kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL_LOGS_EXPORTER
# Should return: OTEL_LOGS_EXPORTER=otlp

2. Check log format:

# Logs should be JSON formatted
kubectl logs -n bakery-ia deployment/auth-service | head -5

3. Verify OTLP endpoint:

# Test logs endpoint
kubectl exec -n bakery-ia deployment/auth-service -- \
  curl -X POST http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/logs \
  -H "Content-Type: application/json" \
  -d '{"resourceLogs":[]}'

# Should return 200 OK or 400 Bad Request (not connection error)

Performance Tuning

For Development

The default configuration is optimized for local development with minimal resources.

For Production

Update the following in signoz-values-prod.yaml:

# Increase collector resources
otelCollector:
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 2Gi

# Increase batch sizes
config:
  processors:
    batch:
      timeout: 10s
      send_batch_size: 10000  # Increased from 1024

# Add more replicas
replicaCount: 2

Best Practices

  1. Use Structured Logging: Always use key-value pairs for better querying
  2. Add Context: Include user_id, tenant_id, request_id in logs
  3. Trace Business Operations: Add custom spans for important operations
  4. Monitor Collector Health: Set up alerts for collector errors
  5. Retention Policy: Configure ClickHouse retention based on needs

Additional Resources

Support

For issues or questions:

  1. Check SigNoz community: https://signoz.io/slack
  2. Review OpenTelemetry docs: https://opentelemetry.io/docs/
  3. Create issue in project repository