- Updated all OpenTelemetry packages to latest versions: - opentelemetry-api: 1.27.0 → 1.39.1 - opentelemetry-sdk: 1.27.0 → 1.39.1 - opentelemetry-exporter-otlp-proto-grpc: 1.27.0 → 1.39.1 - opentelemetry-exporter-otlp-proto-http: 1.27.0 → 1.39.1 - opentelemetry-instrumentation-fastapi: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-httpx: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-redis: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-sqlalchemy: 0.48b0 → 0.60b1 - Removed prometheus-client==0.23.1 from all services - Unified all services to use the same monitoring package versions Generated by Mistral Vibe. Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
512 lines
15 KiB
Markdown
512 lines
15 KiB
Markdown
# SigNoz Monitoring Setup Guide
|
|
|
|
This guide explains how to set up complete observability for the Bakery IA platform using SigNoz, which provides unified metrics, logs, and traces visualization.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Architecture Overview](#architecture-overview)
|
|
2. [Prerequisites](#prerequisites)
|
|
3. [SigNoz Deployment](#signoz-deployment)
|
|
4. [Service Configuration](#service-configuration)
|
|
5. [Data Flow](#data-flow)
|
|
6. [Verification](#verification)
|
|
7. [Troubleshooting](#troubleshooting)
|
|
|
|
## Architecture Overview
|
|
|
|
The monitoring setup uses a three-tier approach:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Bakery IA Services │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ Auth │ │ Inventory│ │ Orders │ │ ... │ │
|
|
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
|
│ │ │ │ │ │
|
|
│ └─────────────┴─────────────┴─────────────┘ │
|
|
│ │ │
|
|
│ OpenTelemetry Protocol (OTLP) │
|
|
│ Traces / Metrics / Logs │
|
|
└──────────────────────────┼───────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ SigNoz OpenTelemetry Collector │
|
|
│ ┌────────────────────────────────────────────────────────┐ │
|
|
│ │ Receivers: │ │
|
|
│ │ - OTLP gRPC (4317) - OTLP HTTP (4318) │ │
|
|
│ │ - Prometheus Scraper (service discovery) │ │
|
|
│ └────────────────────┬───────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ┌────────────────────┴───────────────────────────────────┐ │
|
|
│ │ Processors: batch, memory_limiter, resourcedetection │ │
|
|
│ └────────────────────┬───────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ┌────────────────────┴───────────────────────────────────┐ │
|
|
│ │ Exporters: ClickHouse (traces, metrics, logs) │ │
|
|
│ └────────────────────────────────────────────────────────┘ │
|
|
└──────────────────────────┼───────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ ClickHouse Database │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ Traces │ │ Metrics │ │ Logs │ │
|
|
│ └──────────┘ └──────────┘ └──────────┘ │
|
|
└──────────────────────────┼───────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ SigNoz Query Service │
|
|
│ & Frontend UI │
|
|
│ https://monitoring.bakery-ia.local │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Key Components
|
|
|
|
1. **Services**: Generate telemetry data using OpenTelemetry SDK
|
|
2. **OpenTelemetry Collector**: Receives, processes, and exports telemetry
|
|
3. **ClickHouse**: Stores traces, metrics, and logs
|
|
4. **SigNoz UI**: Query and visualize all telemetry data
|
|
|
|
## Prerequisites
|
|
|
|
- Kubernetes cluster (Kind, Minikube, or production cluster)
|
|
- Helm 3.x installed
|
|
- kubectl configured
|
|
- At least 4GB RAM available for SigNoz components
|
|
|
|
## SigNoz Deployment
|
|
|
|
### 1. Add SigNoz Helm Repository
|
|
|
|
```bash
|
|
helm repo add signoz https://charts.signoz.io
|
|
helm repo update
|
|
```
|
|
|
|
### 2. Create Namespace
|
|
|
|
```bash
|
|
kubectl create namespace signoz
|
|
```
|
|
|
|
### 3. Deploy SigNoz
|
|
|
|
```bash
|
|
# For development environment
|
|
helm install signoz signoz/signoz \
|
|
-n signoz \
|
|
-f infrastructure/helm/signoz-values-dev.yaml
|
|
|
|
# For production environment
|
|
helm install signoz signoz/signoz \
|
|
-n signoz \
|
|
-f infrastructure/helm/signoz-values-prod.yaml
|
|
```
|
|
|
|
### 4. Verify Deployment
|
|
|
|
```bash
|
|
# Check all pods are running
|
|
kubectl get pods -n signoz
|
|
|
|
# Expected output:
|
|
# signoz-alertmanager-0
|
|
# signoz-clickhouse-0
|
|
# signoz-frontend-*
|
|
# signoz-otel-collector-*
|
|
# signoz-query-service-*
|
|
|
|
# Check services
|
|
kubectl get svc -n signoz
|
|
```
|
|
|
|
## Service Configuration
|
|
|
|
Each microservice needs to be configured to send telemetry to SigNoz.
|
|
|
|
### Environment Variables
|
|
|
|
Add these environment variables to your service deployments:
|
|
|
|
```yaml
|
|
env:
|
|
# OpenTelemetry Collector endpoint
|
|
- name: OTEL_COLLECTOR_ENDPOINT
|
|
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
|
|
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
|
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
|
|
|
|
# Service identification
|
|
- name: OTEL_SERVICE_NAME
|
|
value: "your-service-name" # e.g., "auth-service"
|
|
|
|
# Enable tracing
|
|
- name: ENABLE_TRACING
|
|
value: "true"
|
|
|
|
# Enable logs export
|
|
- name: OTEL_LOGS_EXPORTER
|
|
value: "otlp"
|
|
- name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
|
|
value: "true"
|
|
|
|
# Enable metrics export (optional, default: true)
|
|
- name: ENABLE_OTEL_METRICS
|
|
value: "true"
|
|
```
|
|
|
|
### Prometheus Annotations
|
|
|
|
Add these annotations to enable Prometheus metrics scraping:
|
|
|
|
```yaml
|
|
metadata:
|
|
annotations:
|
|
prometheus.io/scrape: "true"
|
|
prometheus.io/port: "8000"
|
|
prometheus.io/path: "/metrics"
|
|
```
|
|
|
|
### Complete Example
|
|
|
|
See [infrastructure/kubernetes/base/components/auth/auth-service.yaml](../infrastructure/kubernetes/base/components/auth/auth-service.yaml) for a complete example.
|
|
|
|
### Automated Configuration Script
|
|
|
|
Use the provided script to add monitoring configuration to all services:
|
|
|
|
```bash
|
|
# Run from project root
|
|
./infrastructure/kubernetes/add-monitoring-config.sh
|
|
```
|
|
|
|
## Data Flow
|
|
|
|
### 1. Traces
|
|
|
|
**Automatic Instrumentation:**
|
|
|
|
```python
|
|
# In your service's main.py
|
|
from shared.service_base import StandardFastAPIService
|
|
|
|
service = AuthService() # Extends StandardFastAPIService
|
|
app = service.create_app()
|
|
|
|
# Tracing is automatically enabled if ENABLE_TRACING=true
|
|
# All FastAPI endpoints, HTTP clients, Redis, PostgreSQL are auto-instrumented
|
|
```
|
|
|
|
**Manual Instrumentation:**
|
|
|
|
```python
|
|
from shared.monitoring.tracing import add_trace_attributes, add_trace_event
|
|
|
|
# Add custom attributes to current span
|
|
add_trace_attributes(
|
|
user_id="123",
|
|
tenant_id="abc",
|
|
operation="user_registration"
|
|
)
|
|
|
|
# Add events for important operations
|
|
add_trace_event("user_authenticated", user_id="123", method="jwt")
|
|
```
|
|
|
|
### 2. Metrics
|
|
|
|
**Dual Export Strategy:**
|
|
|
|
Services export metrics in two ways:
|
|
1. **Prometheus format** at `/metrics` endpoint (scraped by SigNoz)
|
|
2. **OTLP push** directly to SigNoz collector (real-time)
|
|
|
|
**Built-in Metrics:**
|
|
|
|
```python
|
|
# Automatically collected by BaseFastAPIService:
|
|
# - http_requests_total
|
|
# - http_request_duration_seconds
|
|
# - active_connections
|
|
```
|
|
|
|
**Custom Metrics:**
|
|
|
|
```python
|
|
# Define in your service
|
|
custom_metrics = {
|
|
"user_registrations": {
|
|
"type": "counter",
|
|
"description": "Total user registrations",
|
|
"labels": ["status"]
|
|
},
|
|
"login_duration_seconds": {
|
|
"type": "histogram",
|
|
"description": "Login request duration"
|
|
}
|
|
}
|
|
|
|
service = AuthService(custom_metrics=custom_metrics)
|
|
|
|
# Use in your code
|
|
service.metrics_collector.increment_counter(
|
|
"user_registrations",
|
|
labels={"status": "success"}
|
|
)
|
|
```
|
|
|
|
### 3. Logs
|
|
|
|
**Automatic Export:**
|
|
|
|
```python
|
|
# Logs are automatically exported if OTEL_LOGS_EXPORTER=otlp
|
|
import logging
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# This will appear in SigNoz
|
|
logger.info("User logged in", extra={"user_id": "123", "tenant_id": "abc"})
|
|
```
|
|
|
|
**Structured Logging with Context:**
|
|
|
|
```python
|
|
from shared.monitoring.logs_exporter import add_log_context
|
|
|
|
# Add context that persists across log calls
|
|
log_ctx = add_log_context(
|
|
request_id="req_123",
|
|
user_id="user_456",
|
|
tenant_id="tenant_789"
|
|
)
|
|
|
|
# All subsequent logs include this context
|
|
log_ctx.info("Processing order") # Includes request_id, user_id, tenant_id
|
|
```
|
|
|
|
**Trace Correlation:**
|
|
|
|
```python
|
|
from shared.monitoring.logs_exporter import get_current_trace_context
|
|
|
|
# Get trace context for correlation
|
|
trace_ctx = get_current_trace_context()
|
|
logger.info("Processing request", extra=trace_ctx)
|
|
# Logs now include trace_id and span_id for correlation
|
|
```
|
|
|
|
## Verification
|
|
|
|
### 1. Check Service Health
|
|
|
|
```bash
|
|
# Check that services are exporting telemetry
|
|
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel\|signoz"
|
|
|
|
# Expected output includes:
|
|
# - "Distributed tracing configured"
|
|
# - "OpenTelemetry logs export configured"
|
|
# - "OpenTelemetry metrics export configured"
|
|
```
|
|
|
|
### 2. Access SigNoz UI
|
|
|
|
```bash
|
|
# Port-forward (for local development)
|
|
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
|
|
|
|
# Or via Ingress
|
|
open https://monitoring.bakery-ia.local
|
|
```
|
|
|
|
### 3. Verify Data Ingestion
|
|
|
|
**Traces:**
|
|
1. Go to SigNoz UI → Traces
|
|
2. You should see traces from your services
|
|
3. Click on a trace to see the full span tree
|
|
|
|
**Metrics:**
|
|
1. Go to SigNoz UI → Metrics
|
|
2. Query: `http_requests_total`
|
|
3. Filter by service: `service="auth-service"`
|
|
|
|
**Logs:**
|
|
1. Go to SigNoz UI → Logs
|
|
2. Filter by service: `service_name="auth-service"`
|
|
3. Search for specific log messages
|
|
|
|
### 4. Test Trace-Log Correlation
|
|
|
|
1. Find a trace in SigNoz UI
|
|
2. Copy the `trace_id`
|
|
3. Go to Logs tab
|
|
4. Search: `trace_id="<your-trace-id>"`
|
|
5. You should see all logs for that trace
|
|
|
|
## Troubleshooting
|
|
|
|
### No Data in SigNoz
|
|
|
|
**1. Check OpenTelemetry Collector:**
|
|
|
|
```bash
|
|
# Check collector logs
|
|
kubectl logs -n signoz deployment/signoz-otel-collector
|
|
|
|
# Should see:
|
|
# - "Receiver is starting"
|
|
# - "Exporter is starting"
|
|
# - No error messages
|
|
```
|
|
|
|
**2. Check Service Configuration:**
|
|
|
|
```bash
|
|
# Verify environment variables
|
|
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 20 "env:"
|
|
|
|
# Verify annotations
|
|
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 5 "annotations:"
|
|
```
|
|
|
|
**3. Check Network Connectivity:**
|
|
|
|
```bash
|
|
# Test from service pod
|
|
kubectl exec -n bakery-ia deployment/auth-service -- \
|
|
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/traces
|
|
|
|
# Should return: 405 Method Not Allowed (POST required)
|
|
# If connection refused, check network policies
|
|
```
|
|
|
|
### Traces Not Appearing
|
|
|
|
**Check instrumentation:**
|
|
|
|
```python
|
|
# Verify tracing is enabled
|
|
import os
|
|
print(os.getenv("ENABLE_TRACING")) # Should be "true"
|
|
print(os.getenv("OTEL_COLLECTOR_ENDPOINT")) # Should be set
|
|
```
|
|
|
|
**Check trace sampling:**
|
|
|
|
```bash
|
|
# Verify sampling rate (default 100%)
|
|
kubectl logs -n bakery-ia deployment/auth-service | grep "sampling"
|
|
```
|
|
|
|
### Metrics Not Appearing
|
|
|
|
**1. Verify Prometheus annotations:**
|
|
|
|
```bash
|
|
kubectl get pods -n bakery-ia -o yaml | grep "prometheus.io"
|
|
```
|
|
|
|
**2. Test metrics endpoint:**
|
|
|
|
```bash
|
|
# Port-forward service
|
|
kubectl port-forward -n bakery-ia deployment/auth-service 8000:8000
|
|
|
|
# Test endpoint
|
|
curl http://localhost:8000/metrics
|
|
|
|
# Should return Prometheus format metrics
|
|
```
|
|
|
|
**3. Check SigNoz scrape configuration:**
|
|
|
|
```bash
|
|
# Check collector config
|
|
kubectl get configmap -n signoz signoz-otel-collector -o yaml | grep -A 30 "prometheus:"
|
|
```
|
|
|
|
### Logs Not Appearing
|
|
|
|
**1. Verify log export is enabled:**
|
|
|
|
```bash
|
|
kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL_LOGS_EXPORTER
|
|
# Should return: OTEL_LOGS_EXPORTER=otlp
|
|
```
|
|
|
|
**2. Check log format:**
|
|
|
|
```bash
|
|
# Logs should be JSON formatted
|
|
kubectl logs -n bakery-ia deployment/auth-service | head -5
|
|
```
|
|
|
|
**3. Verify OTLP endpoint:**
|
|
|
|
```bash
|
|
# Test logs endpoint
|
|
kubectl exec -n bakery-ia deployment/auth-service -- \
|
|
curl -X POST http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/logs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"resourceLogs":[]}'
|
|
|
|
# Should return 200 OK or 400 Bad Request (not connection error)
|
|
```
|
|
|
|
## Performance Tuning
|
|
|
|
### For Development
|
|
|
|
The default configuration is optimized for local development with minimal resources.
|
|
|
|
### For Production
|
|
|
|
Update the following in `signoz-values-prod.yaml`:
|
|
|
|
```yaml
|
|
# Increase collector resources
|
|
otelCollector:
|
|
resources:
|
|
requests:
|
|
cpu: 500m
|
|
memory: 1Gi
|
|
limits:
|
|
cpu: 2000m
|
|
memory: 2Gi
|
|
|
|
# Increase batch sizes
|
|
config:
|
|
processors:
|
|
batch:
|
|
timeout: 10s
|
|
send_batch_size: 10000 # Increased from 1024
|
|
|
|
# Add more replicas
|
|
replicaCount: 2
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Use Structured Logging**: Always use key-value pairs for better querying
|
|
2. **Add Context**: Include user_id, tenant_id, request_id in logs
|
|
3. **Trace Business Operations**: Add custom spans for important operations
|
|
4. **Monitor Collector Health**: Set up alerts for collector errors
|
|
5. **Retention Policy**: Configure ClickHouse retention based on needs
|
|
|
|
## Additional Resources
|
|
|
|
- [SigNoz Documentation](https://signoz.io/docs/)
|
|
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/)
|
|
- [Bakery IA Monitoring Shared Library](../shared/monitoring/)
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
1. Check SigNoz community: https://signoz.io/slack
|
|
2. Review OpenTelemetry docs: https://opentelemetry.io/docs/
|
|
3. Create issue in project repository
|