Files
bakery-ia/docs/MONITORING_SETUP.md

512 lines
15 KiB
Markdown
Raw Normal View History

# SigNoz Monitoring Setup Guide
This guide explains how to set up complete observability for the Bakery IA platform using SigNoz, which provides unified metrics, logs, and traces visualization.
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Prerequisites](#prerequisites)
3. [SigNoz Deployment](#signoz-deployment)
4. [Service Configuration](#service-configuration)
5. [Data Flow](#data-flow)
6. [Verification](#verification)
7. [Troubleshooting](#troubleshooting)
## Architecture Overview
The monitoring setup uses a three-tier approach:
```
┌─────────────────────────────────────────────────────────────┐
│ Bakery IA Services │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Auth │ │ Inventory│ │ Orders │ │ ... │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └─────────────┴─────────────┴─────────────┘ │
│ │ │
│ OpenTelemetry Protocol (OTLP) │
│ Traces / Metrics / Logs │
└──────────────────────────┼───────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ SigNoz OpenTelemetry Collector │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Receivers: │ │
│ │ - OTLP gRPC (4317) - OTLP HTTP (4318) │ │
│ │ - Prometheus Scraper (service discovery) │ │
│ └────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────────────────┐ │
│ │ Processors: batch, memory_limiter, resourcedetection │ │
│ └────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────────────────┐ │
│ │ Exporters: ClickHouse (traces, metrics, logs) │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────┼───────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ ClickHouse Database │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Traces │ │ Metrics │ │ Logs │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────┼───────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ SigNoz Query Service │
& Frontend UI │
│ https://monitoring.bakery-ia.local │
└──────────────────────────────────────────────────────────────┘
```
### Key Components
1. **Services**: Generate telemetry data using OpenTelemetry SDK
2. **OpenTelemetry Collector**: Receives, processes, and exports telemetry
3. **ClickHouse**: Stores traces, metrics, and logs
4. **SigNoz UI**: Query and visualize all telemetry data
## Prerequisites
- Kubernetes cluster (Kind, Minikube, or production cluster)
- Helm 3.x installed
- kubectl configured
- At least 4GB RAM available for SigNoz components
## SigNoz Deployment
### 1. Add SigNoz Helm Repository
```bash
helm repo add signoz https://charts.signoz.io
helm repo update
```
### 2. Create Namespace
```bash
kubectl create namespace signoz
```
### 3. Deploy SigNoz
```bash
# For development environment
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-dev.yaml
# For production environment
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-prod.yaml
```
### 4. Verify Deployment
```bash
# Check all pods are running
kubectl get pods -n signoz
# Expected output:
# signoz-alertmanager-0
# signoz-clickhouse-0
# signoz-frontend-*
# signoz-otel-collector-*
# signoz-query-service-*
# Check services
kubectl get svc -n signoz
```
## Service Configuration
Each microservice needs to be configured to send telemetry to SigNoz.
### Environment Variables
Add these environment variables to your service deployments:
```yaml
env:
# OpenTelemetry Collector endpoint
- name: OTEL_COLLECTOR_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
# Service identification
- name: OTEL_SERVICE_NAME
value: "your-service-name" # e.g., "auth-service"
# Enable tracing
- name: ENABLE_TRACING
value: "true"
# Enable logs export
- name: OTEL_LOGS_EXPORTER
value: "otlp"
- name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
value: "true"
# Enable metrics export (optional, default: true)
- name: ENABLE_OTEL_METRICS
value: "true"
```
### Prometheus Annotations
Add these annotations to enable Prometheus metrics scraping:
```yaml
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
```
### Complete Example
See [infrastructure/kubernetes/base/components/auth/auth-service.yaml](../infrastructure/kubernetes/base/components/auth/auth-service.yaml) for a complete example.
### Automated Configuration Script
Use the provided script to add monitoring configuration to all services:
```bash
# Run from project root
./infrastructure/kubernetes/add-monitoring-config.sh
```
## Data Flow
### 1. Traces
**Automatic Instrumentation:**
```python
# In your service's main.py
from shared.service_base import StandardFastAPIService
service = AuthService() # Extends StandardFastAPIService
app = service.create_app()
# Tracing is automatically enabled if ENABLE_TRACING=true
# All FastAPI endpoints, HTTP clients, Redis, PostgreSQL are auto-instrumented
```
**Manual Instrumentation:**
```python
from shared.monitoring.tracing import add_trace_attributes, add_trace_event
# Add custom attributes to current span
add_trace_attributes(
user_id="123",
tenant_id="abc",
operation="user_registration"
)
# Add events for important operations
add_trace_event("user_authenticated", user_id="123", method="jwt")
```
### 2. Metrics
**Dual Export Strategy:**
Services export metrics in two ways:
1. **Prometheus format** at `/metrics` endpoint (scraped by SigNoz)
2. **OTLP push** directly to SigNoz collector (real-time)
**Built-in Metrics:**
```python
# Automatically collected by BaseFastAPIService:
# - http_requests_total
# - http_request_duration_seconds
# - active_connections
```
**Custom Metrics:**
```python
# Define in your service
custom_metrics = {
"user_registrations": {
"type": "counter",
"description": "Total user registrations",
"labels": ["status"]
},
"login_duration_seconds": {
"type": "histogram",
"description": "Login request duration"
}
}
service = AuthService(custom_metrics=custom_metrics)
# Use in your code
service.metrics_collector.increment_counter(
"user_registrations",
labels={"status": "success"}
)
```
### 3. Logs
**Automatic Export:**
```python
# Logs are automatically exported if OTEL_LOGS_EXPORTER=otlp
import logging
logger = logging.getLogger(__name__)
# This will appear in SigNoz
logger.info("User logged in", extra={"user_id": "123", "tenant_id": "abc"})
```
**Structured Logging with Context:**
```python
from shared.monitoring.logs_exporter import add_log_context
# Add context that persists across log calls
log_ctx = add_log_context(
request_id="req_123",
user_id="user_456",
tenant_id="tenant_789"
)
# All subsequent logs include this context
log_ctx.info("Processing order") # Includes request_id, user_id, tenant_id
```
**Trace Correlation:**
```python
from shared.monitoring.logs_exporter import get_current_trace_context
# Get trace context for correlation
trace_ctx = get_current_trace_context()
logger.info("Processing request", extra=trace_ctx)
# Logs now include trace_id and span_id for correlation
```
## Verification
### 1. Check Service Health
```bash
# Check that services are exporting telemetry
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel\|signoz"
# Expected output includes:
# - "Distributed tracing configured"
# - "OpenTelemetry logs export configured"
# - "OpenTelemetry metrics export configured"
```
### 2. Access SigNoz UI
```bash
# Port-forward (for local development)
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
# Or via Ingress
open https://monitoring.bakery-ia.local
```
### 3. Verify Data Ingestion
**Traces:**
1. Go to SigNoz UI → Traces
2. You should see traces from your services
3. Click on a trace to see the full span tree
**Metrics:**
1. Go to SigNoz UI → Metrics
2. Query: `http_requests_total`
3. Filter by service: `service="auth-service"`
**Logs:**
1. Go to SigNoz UI → Logs
2. Filter by service: `service_name="auth-service"`
3. Search for specific log messages
### 4. Test Trace-Log Correlation
1. Find a trace in SigNoz UI
2. Copy the `trace_id`
3. Go to Logs tab
4. Search: `trace_id="<your-trace-id>"`
5. You should see all logs for that trace
## Troubleshooting
### No Data in SigNoz
**1. Check OpenTelemetry Collector:**
```bash
# Check collector logs
kubectl logs -n signoz deployment/signoz-otel-collector
# Should see:
# - "Receiver is starting"
# - "Exporter is starting"
# - No error messages
```
**2. Check Service Configuration:**
```bash
# Verify environment variables
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 20 "env:"
# Verify annotations
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 5 "annotations:"
```
**3. Check Network Connectivity:**
```bash
# Test from service pod
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/traces
# Should return: 405 Method Not Allowed (POST required)
# If connection refused, check network policies
```
### Traces Not Appearing
**Check instrumentation:**
```python
# Verify tracing is enabled
import os
print(os.getenv("ENABLE_TRACING")) # Should be "true"
print(os.getenv("OTEL_COLLECTOR_ENDPOINT")) # Should be set
```
**Check trace sampling:**
```bash
# Verify sampling rate (default 100%)
kubectl logs -n bakery-ia deployment/auth-service | grep "sampling"
```
### Metrics Not Appearing
**1. Verify Prometheus annotations:**
```bash
kubectl get pods -n bakery-ia -o yaml | grep "prometheus.io"
```
**2. Test metrics endpoint:**
```bash
# Port-forward service
kubectl port-forward -n bakery-ia deployment/auth-service 8000:8000
# Test endpoint
curl http://localhost:8000/metrics
# Should return Prometheus format metrics
```
**3. Check SigNoz scrape configuration:**
```bash
# Check collector config
kubectl get configmap -n signoz signoz-otel-collector -o yaml | grep -A 30 "prometheus:"
```
### Logs Not Appearing
**1. Verify log export is enabled:**
```bash
kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL_LOGS_EXPORTER
# Should return: OTEL_LOGS_EXPORTER=otlp
```
**2. Check log format:**
```bash
# Logs should be JSON formatted
kubectl logs -n bakery-ia deployment/auth-service | head -5
```
**3. Verify OTLP endpoint:**
```bash
# Test logs endpoint
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -X POST http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/logs \
-H "Content-Type: application/json" \
-d '{"resourceLogs":[]}'
# Should return 200 OK or 400 Bad Request (not connection error)
```
## Performance Tuning
### For Development
The default configuration is optimized for local development with minimal resources.
### For Production
Update the following in `signoz-values-prod.yaml`:
```yaml
# Increase collector resources
otelCollector:
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
# Increase batch sizes
config:
processors:
batch:
timeout: 10s
send_batch_size: 10000 # Increased from 1024
# Add more replicas
replicaCount: 2
```
## Best Practices
1. **Use Structured Logging**: Always use key-value pairs for better querying
2. **Add Context**: Include user_id, tenant_id, request_id in logs
3. **Trace Business Operations**: Add custom spans for important operations
4. **Monitor Collector Health**: Set up alerts for collector errors
5. **Retention Policy**: Configure ClickHouse retention based on needs
## Additional Resources
- [SigNoz Documentation](https://signoz.io/docs/)
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/)
- [Bakery IA Monitoring Shared Library](../shared/monitoring/)
## Support
For issues or questions:
1. Check SigNoz community: https://signoz.io/slack
2. Review OpenTelemetry docs: https://opentelemetry.io/docs/
3. Create issue in project repository