Update monitoring packages to latest versions
- Updated all OpenTelemetry packages to latest versions: - opentelemetry-api: 1.27.0 → 1.39.1 - opentelemetry-sdk: 1.27.0 → 1.39.1 - opentelemetry-exporter-otlp-proto-grpc: 1.27.0 → 1.39.1 - opentelemetry-exporter-otlp-proto-http: 1.27.0 → 1.39.1 - opentelemetry-instrumentation-fastapi: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-httpx: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-redis: 0.48b0 → 0.60b1 - opentelemetry-instrumentation-sqlalchemy: 0.48b0 → 0.60b1 - Removed prometheus-client==0.23.1 from all services - Unified all services to use the same monitoring package versions Generated by Mistral Vibe. Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit is contained in:
511
docs/MONITORING_SETUP.md
Normal file
511
docs/MONITORING_SETUP.md
Normal file
@@ -0,0 +1,511 @@
|
||||
# SigNoz Monitoring Setup Guide
|
||||
|
||||
This guide explains how to set up complete observability for the Bakery IA platform using SigNoz, which provides unified metrics, logs, and traces visualization.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Architecture Overview](#architecture-overview)
|
||||
2. [Prerequisites](#prerequisites)
|
||||
3. [SigNoz Deployment](#signoz-deployment)
|
||||
4. [Service Configuration](#service-configuration)
|
||||
5. [Data Flow](#data-flow)
|
||||
6. [Verification](#verification)
|
||||
7. [Troubleshooting](#troubleshooting)
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
The monitoring setup uses a three-tier approach:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Bakery IA Services │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ Auth │ │ Inventory│ │ Orders │ │ ... │ │
|
||||
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ └─────────────┴─────────────┴─────────────┘ │
|
||||
│ │ │
|
||||
│ OpenTelemetry Protocol (OTLP) │
|
||||
│ Traces / Metrics / Logs │
|
||||
└──────────────────────────┼───────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ SigNoz OpenTelemetry Collector │
|
||||
│ ┌────────────────────────────────────────────────────────┐ │
|
||||
│ │ Receivers: │ │
|
||||
│ │ - OTLP gRPC (4317) - OTLP HTTP (4318) │ │
|
||||
│ │ - Prometheus Scraper (service discovery) │ │
|
||||
│ └────────────────────┬───────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌────────────────────┴───────────────────────────────────┐ │
|
||||
│ │ Processors: batch, memory_limiter, resourcedetection │ │
|
||||
│ └────────────────────┬───────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌────────────────────┴───────────────────────────────────┐ │
|
||||
│ │ Exporters: ClickHouse (traces, metrics, logs) │ │
|
||||
│ └────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────┼───────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ ClickHouse Database │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ Traces │ │ Metrics │ │ Logs │ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ │
|
||||
└──────────────────────────┼───────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ SigNoz Query Service │
|
||||
│ & Frontend UI │
|
||||
│ https://monitoring.bakery-ia.local │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Key Components
|
||||
|
||||
1. **Services**: Generate telemetry data using OpenTelemetry SDK
|
||||
2. **OpenTelemetry Collector**: Receives, processes, and exports telemetry
|
||||
3. **ClickHouse**: Stores traces, metrics, and logs
|
||||
4. **SigNoz UI**: Query and visualize all telemetry data
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Kubernetes cluster (Kind, Minikube, or production cluster)
|
||||
- Helm 3.x installed
|
||||
- kubectl configured
|
||||
- At least 4GB RAM available for SigNoz components
|
||||
|
||||
## SigNoz Deployment
|
||||
|
||||
### 1. Add SigNoz Helm Repository
|
||||
|
||||
```bash
|
||||
helm repo add signoz https://charts.signoz.io
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### 2. Create Namespace
|
||||
|
||||
```bash
|
||||
kubectl create namespace signoz
|
||||
```
|
||||
|
||||
### 3. Deploy SigNoz
|
||||
|
||||
```bash
|
||||
# For development environment
|
||||
helm install signoz signoz/signoz \
|
||||
-n signoz \
|
||||
-f infrastructure/helm/signoz-values-dev.yaml
|
||||
|
||||
# For production environment
|
||||
helm install signoz signoz/signoz \
|
||||
-n signoz \
|
||||
-f infrastructure/helm/signoz-values-prod.yaml
|
||||
```
|
||||
|
||||
### 4. Verify Deployment
|
||||
|
||||
```bash
|
||||
# Check all pods are running
|
||||
kubectl get pods -n signoz
|
||||
|
||||
# Expected output:
|
||||
# signoz-alertmanager-0
|
||||
# signoz-clickhouse-0
|
||||
# signoz-frontend-*
|
||||
# signoz-otel-collector-*
|
||||
# signoz-query-service-*
|
||||
|
||||
# Check services
|
||||
kubectl get svc -n signoz
|
||||
```
|
||||
|
||||
## Service Configuration
|
||||
|
||||
Each microservice needs to be configured to send telemetry to SigNoz.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Add these environment variables to your service deployments:
|
||||
|
||||
```yaml
|
||||
env:
|
||||
# OpenTelemetry Collector endpoint
|
||||
- name: OTEL_COLLECTOR_ENDPOINT
|
||||
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
|
||||
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
||||
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
|
||||
|
||||
# Service identification
|
||||
- name: OTEL_SERVICE_NAME
|
||||
value: "your-service-name" # e.g., "auth-service"
|
||||
|
||||
# Enable tracing
|
||||
- name: ENABLE_TRACING
|
||||
value: "true"
|
||||
|
||||
# Enable logs export
|
||||
- name: OTEL_LOGS_EXPORTER
|
||||
value: "otlp"
|
||||
- name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
|
||||
value: "true"
|
||||
|
||||
# Enable metrics export (optional, default: true)
|
||||
- name: ENABLE_OTEL_METRICS
|
||||
value: "true"
|
||||
```
|
||||
|
||||
### Prometheus Annotations
|
||||
|
||||
Add these annotations to enable Prometheus metrics scraping:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8000"
|
||||
prometheus.io/path: "/metrics"
|
||||
```
|
||||
|
||||
### Complete Example
|
||||
|
||||
See [infrastructure/kubernetes/base/components/auth/auth-service.yaml](../infrastructure/kubernetes/base/components/auth/auth-service.yaml) for a complete example.
|
||||
|
||||
### Automated Configuration Script
|
||||
|
||||
Use the provided script to add monitoring configuration to all services:
|
||||
|
||||
```bash
|
||||
# Run from project root
|
||||
./infrastructure/kubernetes/add-monitoring-config.sh
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
### 1. Traces
|
||||
|
||||
**Automatic Instrumentation:**
|
||||
|
||||
```python
|
||||
# In your service's main.py
|
||||
from shared.service_base import StandardFastAPIService
|
||||
|
||||
service = AuthService() # Extends StandardFastAPIService
|
||||
app = service.create_app()
|
||||
|
||||
# Tracing is automatically enabled if ENABLE_TRACING=true
|
||||
# All FastAPI endpoints, HTTP clients, Redis, PostgreSQL are auto-instrumented
|
||||
```
|
||||
|
||||
**Manual Instrumentation:**
|
||||
|
||||
```python
|
||||
from shared.monitoring.tracing import add_trace_attributes, add_trace_event
|
||||
|
||||
# Add custom attributes to current span
|
||||
add_trace_attributes(
|
||||
user_id="123",
|
||||
tenant_id="abc",
|
||||
operation="user_registration"
|
||||
)
|
||||
|
||||
# Add events for important operations
|
||||
add_trace_event("user_authenticated", user_id="123", method="jwt")
|
||||
```
|
||||
|
||||
### 2. Metrics
|
||||
|
||||
**Dual Export Strategy:**
|
||||
|
||||
Services export metrics in two ways:
|
||||
1. **Prometheus format** at `/metrics` endpoint (scraped by SigNoz)
|
||||
2. **OTLP push** directly to SigNoz collector (real-time)
|
||||
|
||||
**Built-in Metrics:**
|
||||
|
||||
```python
|
||||
# Automatically collected by BaseFastAPIService:
|
||||
# - http_requests_total
|
||||
# - http_request_duration_seconds
|
||||
# - active_connections
|
||||
```
|
||||
|
||||
**Custom Metrics:**
|
||||
|
||||
```python
|
||||
# Define in your service
|
||||
custom_metrics = {
|
||||
"user_registrations": {
|
||||
"type": "counter",
|
||||
"description": "Total user registrations",
|
||||
"labels": ["status"]
|
||||
},
|
||||
"login_duration_seconds": {
|
||||
"type": "histogram",
|
||||
"description": "Login request duration"
|
||||
}
|
||||
}
|
||||
|
||||
service = AuthService(custom_metrics=custom_metrics)
|
||||
|
||||
# Use in your code
|
||||
service.metrics_collector.increment_counter(
|
||||
"user_registrations",
|
||||
labels={"status": "success"}
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Logs
|
||||
|
||||
**Automatic Export:**
|
||||
|
||||
```python
|
||||
# Logs are automatically exported if OTEL_LOGS_EXPORTER=otlp
|
||||
import logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# This will appear in SigNoz
|
||||
logger.info("User logged in", extra={"user_id": "123", "tenant_id": "abc"})
|
||||
```
|
||||
|
||||
**Structured Logging with Context:**
|
||||
|
||||
```python
|
||||
from shared.monitoring.logs_exporter import add_log_context
|
||||
|
||||
# Add context that persists across log calls
|
||||
log_ctx = add_log_context(
|
||||
request_id="req_123",
|
||||
user_id="user_456",
|
||||
tenant_id="tenant_789"
|
||||
)
|
||||
|
||||
# All subsequent logs include this context
|
||||
log_ctx.info("Processing order") # Includes request_id, user_id, tenant_id
|
||||
```
|
||||
|
||||
**Trace Correlation:**
|
||||
|
||||
```python
|
||||
from shared.monitoring.logs_exporter import get_current_trace_context
|
||||
|
||||
# Get trace context for correlation
|
||||
trace_ctx = get_current_trace_context()
|
||||
logger.info("Processing request", extra=trace_ctx)
|
||||
# Logs now include trace_id and span_id for correlation
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
### 1. Check Service Health
|
||||
|
||||
```bash
|
||||
# Check that services are exporting telemetry
|
||||
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel\|signoz"
|
||||
|
||||
# Expected output includes:
|
||||
# - "Distributed tracing configured"
|
||||
# - "OpenTelemetry logs export configured"
|
||||
# - "OpenTelemetry metrics export configured"
|
||||
```
|
||||
|
||||
### 2. Access SigNoz UI
|
||||
|
||||
```bash
|
||||
# Port-forward (for local development)
|
||||
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
|
||||
|
||||
# Or via Ingress
|
||||
open https://monitoring.bakery-ia.local
|
||||
```
|
||||
|
||||
### 3. Verify Data Ingestion
|
||||
|
||||
**Traces:**
|
||||
1. Go to SigNoz UI → Traces
|
||||
2. You should see traces from your services
|
||||
3. Click on a trace to see the full span tree
|
||||
|
||||
**Metrics:**
|
||||
1. Go to SigNoz UI → Metrics
|
||||
2. Query: `http_requests_total`
|
||||
3. Filter by service: `service="auth-service"`
|
||||
|
||||
**Logs:**
|
||||
1. Go to SigNoz UI → Logs
|
||||
2. Filter by service: `service_name="auth-service"`
|
||||
3. Search for specific log messages
|
||||
|
||||
### 4. Test Trace-Log Correlation
|
||||
|
||||
1. Find a trace in SigNoz UI
|
||||
2. Copy the `trace_id`
|
||||
3. Go to Logs tab
|
||||
4. Search: `trace_id="<your-trace-id>"`
|
||||
5. You should see all logs for that trace
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Data in SigNoz
|
||||
|
||||
**1. Check OpenTelemetry Collector:**
|
||||
|
||||
```bash
|
||||
# Check collector logs
|
||||
kubectl logs -n signoz deployment/signoz-otel-collector
|
||||
|
||||
# Should see:
|
||||
# - "Receiver is starting"
|
||||
# - "Exporter is starting"
|
||||
# - No error messages
|
||||
```
|
||||
|
||||
**2. Check Service Configuration:**
|
||||
|
||||
```bash
|
||||
# Verify environment variables
|
||||
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 20 "env:"
|
||||
|
||||
# Verify annotations
|
||||
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 5 "annotations:"
|
||||
```
|
||||
|
||||
**3. Check Network Connectivity:**
|
||||
|
||||
```bash
|
||||
# Test from service pod
|
||||
kubectl exec -n bakery-ia deployment/auth-service -- \
|
||||
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/traces
|
||||
|
||||
# Should return: 405 Method Not Allowed (POST required)
|
||||
# If connection refused, check network policies
|
||||
```
|
||||
|
||||
### Traces Not Appearing
|
||||
|
||||
**Check instrumentation:**
|
||||
|
||||
```python
|
||||
# Verify tracing is enabled
|
||||
import os
|
||||
print(os.getenv("ENABLE_TRACING")) # Should be "true"
|
||||
print(os.getenv("OTEL_COLLECTOR_ENDPOINT")) # Should be set
|
||||
```
|
||||
|
||||
**Check trace sampling:**
|
||||
|
||||
```bash
|
||||
# Verify sampling rate (default 100%)
|
||||
kubectl logs -n bakery-ia deployment/auth-service | grep "sampling"
|
||||
```
|
||||
|
||||
### Metrics Not Appearing
|
||||
|
||||
**1. Verify Prometheus annotations:**
|
||||
|
||||
```bash
|
||||
kubectl get pods -n bakery-ia -o yaml | grep "prometheus.io"
|
||||
```
|
||||
|
||||
**2. Test metrics endpoint:**
|
||||
|
||||
```bash
|
||||
# Port-forward service
|
||||
kubectl port-forward -n bakery-ia deployment/auth-service 8000:8000
|
||||
|
||||
# Test endpoint
|
||||
curl http://localhost:8000/metrics
|
||||
|
||||
# Should return Prometheus format metrics
|
||||
```
|
||||
|
||||
**3. Check SigNoz scrape configuration:**
|
||||
|
||||
```bash
|
||||
# Check collector config
|
||||
kubectl get configmap -n signoz signoz-otel-collector -o yaml | grep -A 30 "prometheus:"
|
||||
```
|
||||
|
||||
### Logs Not Appearing
|
||||
|
||||
**1. Verify log export is enabled:**
|
||||
|
||||
```bash
|
||||
kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL_LOGS_EXPORTER
|
||||
# Should return: OTEL_LOGS_EXPORTER=otlp
|
||||
```
|
||||
|
||||
**2. Check log format:**
|
||||
|
||||
```bash
|
||||
# Logs should be JSON formatted
|
||||
kubectl logs -n bakery-ia deployment/auth-service | head -5
|
||||
```
|
||||
|
||||
**3. Verify OTLP endpoint:**
|
||||
|
||||
```bash
|
||||
# Test logs endpoint
|
||||
kubectl exec -n bakery-ia deployment/auth-service -- \
|
||||
curl -X POST http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/logs \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"resourceLogs":[]}'
|
||||
|
||||
# Should return 200 OK or 400 Bad Request (not connection error)
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### For Development
|
||||
|
||||
The default configuration is optimized for local development with minimal resources.
|
||||
|
||||
### For Production
|
||||
|
||||
Update the following in `signoz-values-prod.yaml`:
|
||||
|
||||
```yaml
|
||||
# Increase collector resources
|
||||
otelCollector:
|
||||
resources:
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 2Gi
|
||||
|
||||
# Increase batch sizes
|
||||
config:
|
||||
processors:
|
||||
batch:
|
||||
timeout: 10s
|
||||
send_batch_size: 10000 # Increased from 1024
|
||||
|
||||
# Add more replicas
|
||||
replicaCount: 2
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use Structured Logging**: Always use key-value pairs for better querying
|
||||
2. **Add Context**: Include user_id, tenant_id, request_id in logs
|
||||
3. **Trace Business Operations**: Add custom spans for important operations
|
||||
4. **Monitor Collector Health**: Set up alerts for collector errors
|
||||
5. **Retention Policy**: Configure ClickHouse retention based on needs
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [SigNoz Documentation](https://signoz.io/docs/)
|
||||
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/)
|
||||
- [Bakery IA Monitoring Shared Library](../shared/monitoring/)
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
1. Check SigNoz community: https://signoz.io/slack
|
||||
2. Review OpenTelemetry docs: https://opentelemetry.io/docs/
|
||||
3. Create issue in project repository
|
||||
Reference in New Issue
Block a user