Improve metrics

This commit is contained in:
Urtzi Alfaro
2026-01-08 20:48:24 +01:00
parent 29d19087f1
commit e8fda39e50
21 changed files with 615 additions and 3019 deletions

View File

@@ -1,337 +0,0 @@
# HTTPS in Development Environment
## Overview
Development environment now uses HTTPS by default to match production behavior and catch SSL-related issues early.
**Benefits:**
- ✅ Matches production HTTPS behavior
- ✅ Tests SSL/TLS configurations
- ✅ Catches mixed content warnings
- ✅ Tests secure cookie handling
- ✅ Better dev-prod parity
---
## Quick Start
### 1. Deploy with HTTPS Enabled
```bash
# Start development environment
skaffold dev --profile=dev
# Wait for certificate to be issued
kubectl get certificate -n bakery-ia
# You should see:
# NAME READY SECRET AGE
# bakery-dev-tls-cert True bakery-dev-tls-cert 1m
```
### 2. Access Your Application
```bash
# Access via HTTPS (will show certificate warning in browser)
open https://localhost
# Or via curl (use -k to skip certificate verification)
curl -k https://localhost/api/health
```
---
## Trust the Self-Signed Certificate
To avoid browser certificate warnings, you need to trust the self-signed certificate.
### Option 1: Accept Browser Warning (Quick & Easy)
When you visit `https://localhost`:
1. Browser shows "Your connection is not private" or similar
2. Click "Advanced" or "Show details"
3. Click "Proceed to localhost" or "Accept the risk"
4. Certificate warning will appear on first visit only per browser session
### Option 2: Trust Certificate in System (Recommended)
#### On macOS:
```bash
# 1. Export the certificate from Kubernetes
kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls\.crt}' | base64 -d > /tmp/bakery-dev-cert.crt
# 2. Add to Keychain
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain /tmp/bakery-dev-cert.crt
# 3. Verify
security find-certificate -c localhost -a
# 4. Cleanup
rm /tmp/bakery-dev-cert.crt
```
**Alternative (GUI):**
1. Export certificate: `kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls\.crt}' | base64 -d > bakery-dev-cert.crt`
2. Double-click the `.crt` file to open Keychain Access
3. Find "localhost" certificate
4. Double-click → Trust → "Always Trust"
5. Close and enter your password
#### On Linux:
```bash
# 1. Export the certificate
kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls\.crt}' | base64 -d | sudo tee /usr/local/share/ca-certificates/bakery-dev.crt
# 2. Update CA certificates
sudo update-ca-certificates
# 3. For browsers (Chromium/Chrome)
mkdir -p $HOME/.pki/nssdb
certutil -d sql:$HOME/.pki/nssdb -A -t "P,," -n "Bakery Dev" -i /usr/local/share/ca-certificates/bakery-dev.crt
```
#### On Windows:
```powershell
# 1. Export the certificate
kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls.crt}' | Out-File -Encoding ASCII bakery-dev-cert.crt
# 2. Import to Trusted Root
Import-Certificate -FilePath .\bakery-dev-cert.crt -CertStoreLocation Cert:\LocalMachine\Root
# Or use GUI:
# - Double-click bakery-dev-cert.crt
# - Install Certificate
# - Store Location: Local Machine
# - Place in: Trusted Root Certification Authorities
```
---
## Testing HTTPS
### Test with curl
```bash
# Without certificate verification (quick test)
curl -k https://localhost/api/health
# With certificate verification (after trusting cert)
curl https://localhost/api/health
# Check certificate details
curl -vI https://localhost/api/health 2>&1 | grep -A 10 "Server certificate"
# Test CORS with HTTPS
curl -H "Origin: https://localhost:3000" \
-H "Access-Control-Request-Method: POST" \
-X OPTIONS https://localhost/api/health
```
### Test with Browser
1. Open `https://localhost`
2. Check for SSL/TLS padlock in address bar
3. Click padlock → View certificate
4. Verify:
- Issued to: localhost
- Issued by: localhost (self-signed)
- Valid for: 90 days
### Test Frontend
```bash
# Update your frontend .env to use HTTPS
echo "VITE_API_URL=https://localhost/api" > frontend/.env.local
# Frontend should now make HTTPS requests
```
---
## Certificate Details
### Certificate Specifications
- **Type**: Self-signed (for development)
- **Algorithm**: RSA 2048-bit
- **Validity**: 90 days (auto-renews 15 days before expiration)
- **Common Name**: localhost
- **DNS Names**:
- localhost
- bakery-ia.local
- api.bakery-ia.local
- *.bakery-ia.local
- **IP Addresses**: 127.0.0.1, ::1
### Certificate Issuer
- **Issuer**: `selfsigned-issuer` (cert-manager ClusterIssuer)
- **Auto-renewal**: Managed by cert-manager
- **Secret Name**: `bakery-dev-tls-cert`
---
## Troubleshooting
### Certificate Not Issued
```bash
# Check certificate status
kubectl describe certificate bakery-dev-tls-cert -n bakery-ia
# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
# Check if cert-manager is installed
kubectl get pods -n cert-manager
# If cert-manager is not installed:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.2/cert-manager.yaml
```
### Certificate Warning in Browser
**Normal for self-signed certificates!** Choose one:
1. Click "Proceed" (quick, temporary)
2. Trust the certificate in your system (permanent)
### Mixed Content Warnings
If you see "mixed content" errors:
- Ensure all API calls use HTTPS
- Check for hardcoded HTTP URLs
- Update `VITE_API_URL` to use HTTPS
### Certificate Expired
```bash
# Check expiration
kubectl get certificate bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.status.notAfter}'
# Force renewal
kubectl delete certificate bakery-dev-tls-cert -n bakery-ia
kubectl apply -k infrastructure/kubernetes/overlays/dev
# cert-manager will automatically recreate it
```
### Browser Shows "NET::ERR_CERT_AUTHORITY_INVALID"
This is expected for self-signed certificates. Options:
1. Click "Advanced" → "Proceed to localhost"
2. Trust the certificate (see instructions above)
3. Use curl with `-k` flag for testing
---
## Disable HTTPS (Not Recommended)
If you need to temporarily disable HTTPS:
```bash
# Edit dev-ingress.yaml
vim infrastructure/kubernetes/overlays/dev/dev-ingress.yaml
# Change:
# nginx.ingress.kubernetes.io/ssl-redirect: "true" → "false"
# nginx.ingress.kubernetes.io/force-ssl-redirect: "true" → "false"
# Comment out the tls section:
# tls:
# - hosts:
# - localhost
# secretName: bakery-dev-tls-cert
# Redeploy
skaffold dev --profile=dev
```
---
## Differences from Production
| Aspect | Development | Production |
|--------|-------------|------------|
| Certificate Type | Self-signed | Let's Encrypt |
| Validity | 90 days | 90 days |
| Auto-renewal | cert-manager | cert-manager |
| Trust | Manual trust needed | Automatically trusted |
| Domains | localhost | Real domains |
| Browser Warning | Yes (self-signed) | No (CA-signed) |
---
## FAQ
### Q: Why am I seeing certificate warnings?
**A:** Self-signed certificates aren't trusted by browsers by default. Trust the certificate or click "Proceed."
### Q: Do I need to trust the certificate?
**A:** No, but it makes development easier. You can click "Proceed" on each browser session.
### Q: Will this affect my frontend development?
**A:** Slightly. Update `VITE_API_URL` to use `https://`. Otherwise works the same.
### Q: Can I use HTTP instead?
**A:** Yes, but not recommended. It reduces dev-prod parity and won't catch HTTPS issues.
### Q: How often do I need to re-trust the certificate?
**A:** Only when the certificate is recreated (every 90 days or when you delete the cluster).
### Q: Does this work with bakery-ia.local?
**A:** Yes! The certificate is valid for both `localhost` and `bakery-ia.local`.
---
## Additional Security Testing
With HTTPS enabled, you can now test:
### 1. Secure Cookies
```javascript
// In your frontend
document.cookie = "session=test; Secure; SameSite=Strict";
```
### 2. Mixed Content Detection
```javascript
// This will show warning in dev (good - catches prod issues!)
fetch('http://api.example.com/data') // ❌ Mixed content
fetch('https://api.example.com/data') // ✅ Secure
```
### 3. HSTS (HTTP Strict Transport Security)
```bash
# Check HSTS headers
curl -I https://localhost/api/health | grep -i strict
```
### 4. TLS Version Testing
```bash
# Test TLS 1.2
curl --tlsv1.2 https://localhost/api/health
# Test TLS 1.3
curl --tlsv1.3 https://localhost/api/health
```
---
## Summary
**Enabled**: HTTPS in development by default
**Certificate**: Self-signed, auto-renewed
**Access**: `https://localhost`
**Trust**: Optional but recommended
**Benefit**: Better dev-prod parity
**Next Steps:**
1. Deploy: `skaffold dev --profile=dev`
2. Access: `https://localhost`
3. Trust: Follow instructions above (optional)
4. Test: Verify HTTPS works
For issues, see Troubleshooting section or check cert-manager logs.

View File

@@ -1,337 +0,0 @@
# Docker Hub Configuration Guide
This guide explains how to configure Docker Hub for all image pulls in the Bakery IA project.
## Overview
The project has been configured to use Docker Hub credentials for pulling both:
- **Base images** (postgres, redis, python, node, nginx, etc.)
- **Custom bakery images** (bakery/auth-service, bakery/gateway, etc.)
## Quick Start
### 1. Create Docker Hub Secret in Kubernetes
Run the automated setup script:
```bash
./infrastructure/kubernetes/setup-dockerhub-secrets.sh
```
This script will:
- Create the `dockerhub-creds` secret in all namespaces (bakery-ia, bakery-ia-dev, bakery-ia-prod, default)
- Use the credentials: `uals` / `dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A`
### 2. Apply Updated Kubernetes Manifests
All manifests have been updated with `imagePullSecrets`. Apply them:
```bash
# For development
kubectl apply -k infrastructure/kubernetes/overlays/dev
# For production
kubectl apply -k infrastructure/kubernetes/overlays/prod
```
### 3. Verify Pods Can Pull Images
```bash
# Check pod status
kubectl get pods -n bakery-ia
# Check events for image pull status
kubectl get events -n bakery-ia --sort-by='.lastTimestamp'
# Describe a specific pod to see image pull details
kubectl describe pod <pod-name> -n bakery-ia
```
## Manual Setup
If you prefer to create the secret manually:
```bash
kubectl create secret docker-registry dockerhub-creds \
--docker-server=docker.io \
--docker-username=uals \
--docker-password=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A \
--docker-email=ualfaro@gmail.com \
-n bakery-ia
```
Repeat for other namespaces:
```bash
kubectl create secret docker-registry dockerhub-creds \
--docker-server=docker.io \
--docker-username=uals \
--docker-password=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A \
--docker-email=ualfaro@gmail.com \
-n bakery-ia-dev
kubectl create secret docker-registry dockerhub-creds \
--docker-server=docker.io \
--docker-username=uals \
--docker-password=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A \
--docker-email=ualfaro@gmail.com \
-n bakery-ia-prod
```
## What Was Changed
### 1. Kubernetes Manifests (47 files updated)
All deployments, jobs, and cronjobs now include `imagePullSecrets`:
```yaml
spec:
template:
spec:
imagePullSecrets:
- name: dockerhub-creds
containers:
- name: ...
```
**Files Updated:**
- **19 Service Deployments**: All microservices (auth, tenant, forecasting, etc.)
- **21 Database Deployments**: All PostgreSQL instances, Redis, RabbitMQ
- **21 Migration Jobs**: All database migration jobs
- **2 CronJobs**: demo-cleanup, external-data-rotation
- **2 Standalone Jobs**: external-data-init, nominatim-init
- **1 Worker Deployment**: demo-cleanup-worker
### 2. Tiltfile Configuration
The Tiltfile now supports both local registry and Docker Hub:
**Default (Local Registry):**
```bash
tilt up
```
**Docker Hub Mode:**
```bash
export USE_DOCKERHUB=true
export DOCKERHUB_USERNAME=uals
tilt up
```
### 3. Scripts
Two new scripts were created:
1. **[setup-dockerhub-secrets.sh](../infrastructure/kubernetes/setup-dockerhub-secrets.sh)**
- Creates Docker Hub secrets in all namespaces
- Idempotent (safe to run multiple times)
2. **[add-image-pull-secrets.sh](../infrastructure/kubernetes/add-image-pull-secrets.sh)**
- Adds `imagePullSecrets` to all Kubernetes manifests
- Already run (no need to run again unless adding new manifests)
## Using Docker Hub with Tilt
To use Docker Hub for development with Tilt:
```bash
# Login to Docker Hub first
docker login -u uals
# Enable Docker Hub mode
export USE_DOCKERHUB=true
export DOCKERHUB_USERNAME=uals
# Start Tilt
tilt up
```
This will:
- Build images locally
- Tag them as `docker.io/uals/<image-name>`
- Push them to Docker Hub
- Deploy to Kubernetes with imagePullSecrets
## Images Configuration
### Base Images (from Docker Hub)
These images are pulled from Docker Hub's public registry:
- `python:3.11-slim` - Python base for all microservices
- `node:18-alpine` - Node.js for frontend builder
- `nginx:1.25-alpine` - Nginx for frontend production
- `postgres:17-alpine` - PostgreSQL databases
- `redis:7.4-alpine` - Redis cache
- `rabbitmq:4.1-management-alpine` - RabbitMQ message broker
- `busybox:latest` - Utility container
- `curlimages/curl:latest` - Curl utility
- `mediagis/nominatim:4.4` - Geolocation service
### Custom Images (bakery/*)
These images are built by the project:
**Infrastructure:**
- `bakery/gateway`
- `bakery/dashboard`
**Core Services:**
- `bakery/auth-service`
- `bakery/tenant-service`
**Data & Analytics:**
- `bakery/training-service`
- `bakery/forecasting-service`
- `bakery/ai-insights-service`
**Operations:**
- `bakery/sales-service`
- `bakery/inventory-service`
- `bakery/production-service`
- `bakery/procurement-service`
- `bakery/distribution-service`
**Supporting:**
- `bakery/recipes-service`
- `bakery/suppliers-service`
- `bakery/pos-service`
- `bakery/orders-service`
- `bakery/external-service`
**Platform:**
- `bakery/notification-service`
- `bakery/alert-processor`
- `bakery/orchestrator-service`
**Demo:**
- `bakery/demo-session-service`
## Pushing Custom Images to Docker Hub
Use the existing tag-and-push script:
```bash
# Login first
docker login -u uals
# Tag and push all images
./scripts/tag-and-push-images.sh
```
Or manually for a specific image:
```bash
# Build
docker build -t bakery/auth-service:latest -f services/auth/Dockerfile .
# Tag for Docker Hub
docker tag bakery/auth-service:latest uals/bakery-auth-service:latest
# Push
docker push uals/bakery-auth-service:latest
```
## Troubleshooting
### Problem: ImagePullBackOff error
Check if the secret exists:
```bash
kubectl get secret dockerhub-creds -n bakery-ia
```
Verify secret is correctly configured:
```bash
kubectl get secret dockerhub-creds -n bakery-ia -o yaml
```
Check pod events:
```bash
kubectl describe pod <pod-name> -n bakery-ia
```
### Problem: Authentication failure
The Docker Hub credentials might be incorrect or expired. Update the secret:
```bash
# Delete old secret
kubectl delete secret dockerhub-creds -n bakery-ia
# Create new secret with updated credentials
kubectl create secret docker-registry dockerhub-creds \
--docker-server=docker.io \
--docker-username=<your-username> \
--docker-password=<your-token> \
--docker-email=<your-email> \
-n bakery-ia
```
### Problem: Pod still using old credentials
Restart the pod to pick up the new secret:
```bash
kubectl rollout restart deployment/<deployment-name> -n bakery-ia
```
## Security Best Practices
1. **Use Docker Hub Access Tokens** (not passwords)
- Create at: https://hub.docker.com/settings/security
- Set appropriate permissions (Read-only for pulls)
2. **Rotate Credentials Regularly**
- Update the secret every 90 days
- Use the setup script for consistent updates
3. **Limit Secret Access**
- Only grant access to necessary namespaces
- Use RBAC to control who can read secrets
4. **Monitor Usage**
- Check Docker Hub pull rate limits
- Monitor for unauthorized access
## Rate Limits
Docker Hub has rate limits for image pulls:
- **Anonymous users**: 100 pulls per 6 hours per IP
- **Authenticated users**: 200 pulls per 6 hours
- **Pro/Team**: Unlimited
Using authentication (imagePullSecrets) ensures you get the authenticated user rate limit.
## Environment Variables
For CI/CD or automated deployments, use these environment variables:
```bash
export DOCKER_USERNAME=uals
export DOCKER_PASSWORD=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A
export DOCKER_EMAIL=ualfaro@gmail.com
```
## Next Steps
1. ✅ Docker Hub secret created in all namespaces
2. ✅ All Kubernetes manifests updated with imagePullSecrets
3. ✅ Tiltfile configured for optional Docker Hub usage
4. 🔄 Apply manifests to your cluster
5. 🔄 Verify pods can pull images successfully
## Related Documentation
- [Kubernetes Setup Guide](./KUBERNETES_SETUP.md)
- [Security Implementation](./SECURITY_IMPLEMENTATION_COMPLETE.md)
- [Tilt Development Workflow](../Tiltfile)
## Support
If you encounter issues:
1. Check the troubleshooting section above
2. Verify Docker Hub credentials at: https://hub.docker.com/settings/security
3. Check Kubernetes events: `kubectl get events -A --sort-by='.lastTimestamp'`
4. Review pod logs: `kubectl logs -n bakery-ia <pod-name>`

View File

@@ -1,449 +0,0 @@
# Complete Monitoring Guide - Bakery IA Platform
This guide provides the complete overview of observability implementation for the Bakery IA platform using SigNoz and OpenTelemetry.
## 🎯 Executive Summary
**What's Implemented:**
-**Distributed Tracing** - All 17 services
-**Application Metrics** - HTTP requests, latencies, errors
-**System Metrics** - CPU, memory, disk, network per service
-**Structured Logs** - With trace correlation
-**Database Monitoring** - PostgreSQL, Redis, RabbitMQ metrics
-**Pure OpenTelemetry** - No Prometheus, all OTLP push
**Technology Stack:**
- **Backend**: OpenTelemetry Python SDK
- **Collector**: OpenTelemetry Collector (OTLP receivers)
- **Storage**: ClickHouse (traces, metrics, logs)
- **Frontend**: SigNoz UI
- **Protocol**: OTLP over HTTP/gRPC
## 📊 Architecture
```
┌──────────────────────────────────────────────────────────┐
│ Application Services │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ auth │ │ inv │ │ orders │ │ ... │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │ │
│ └───────────┴────────────┴───────────┘ │
│ │ │
│ Traces + Metrics + Logs │
│ (OpenTelemetry OTLP) │
└──────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ Database Monitoring Collector │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ PG │ │ Redis │ │RabbitMQ│ │
│ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │
│ └───────────┴────────────┘ │
│ │ │
│ Database Metrics │
└──────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ SigNoz OpenTelemetry Collector │
│ │
│ Receivers: OTLP (gRPC :4317, HTTP :4318) │
│ Processors: batch, memory_limiter, resourcedetection │
│ Exporters: ClickHouse │
└──────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ ClickHouse Database │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Traces │ │ Metrics │ │ Logs │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ SigNoz Frontend UI │
│ https://monitoring.bakery-ia.local │
└──────────────────────────────────────────────────────────┘
```
## 🚀 Quick Start
### 1. Deploy SigNoz
```bash
# Add Helm repository
helm repo add signoz https://charts.signoz.io
helm repo update
# Create namespace and install
kubectl create namespace signoz
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-dev.yaml
# Wait for pods
kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
```
### 2. Deploy Services with Monitoring
All services are already configured with OpenTelemetry environment variables.
```bash
# Apply all services
kubectl apply -k infrastructure/kubernetes/overlays/dev/
# Or restart existing services
kubectl rollout restart deployment -n bakery-ia
```
### 3. Deploy Database Monitoring
```bash
# Run the setup script
./infrastructure/kubernetes/setup-database-monitoring.sh
# This will:
# - Create monitoring users in PostgreSQL
# - Deploy OpenTelemetry collector for database metrics
# - Start collecting PostgreSQL, Redis, RabbitMQ metrics
```
### 4. Access SigNoz UI
```bash
# Via ingress
open https://monitoring.bakery-ia.local
# Or port-forward
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
open http://localhost:3301
```
## 📈 Metrics Collected
### Application Metrics (Per Service)
| Metric | Description | Type |
|--------|-------------|------|
| `http_requests_total` | Total HTTP requests | Counter |
| `http_request_duration_seconds` | Request latency | Histogram |
| `active_requests` | Current active requests | Gauge |
### System Metrics (Per Service)
| Metric | Description | Type |
|--------|-------------|------|
| `process.cpu.utilization` | Process CPU % | Gauge |
| `process.memory.usage` | Process memory bytes | Gauge |
| `process.memory.utilization` | Process memory % | Gauge |
| `process.threads.count` | Thread count | Gauge |
| `process.open_file_descriptors` | Open FDs (Unix) | Gauge |
| `system.cpu.utilization` | System CPU % | Gauge |
| `system.memory.usage` | System memory | Gauge |
| `system.memory.utilization` | System memory % | Gauge |
| `system.disk.io.read` | Disk read bytes | Counter |
| `system.disk.io.write` | Disk write bytes | Counter |
| `system.network.io.sent` | Network sent bytes | Counter |
| `system.network.io.received` | Network recv bytes | Counter |
### PostgreSQL Metrics
| Metric | Description |
|--------|-------------|
| `postgresql.backends` | Active connections |
| `postgresql.database.size` | Database size in bytes |
| `postgresql.commits` | Transaction commits |
| `postgresql.rollbacks` | Transaction rollbacks |
| `postgresql.deadlocks` | Deadlock count |
| `postgresql.blocks_read` | Blocks read from disk |
| `postgresql.table.size` | Table size |
| `postgresql.index.size` | Index size |
### Redis Metrics
| Metric | Description |
|--------|-------------|
| `redis.clients.connected` | Connected clients |
| `redis.commands.processed` | Commands processed |
| `redis.keyspace.hits` | Cache hits |
| `redis.keyspace.misses` | Cache misses |
| `redis.memory.used` | Memory usage |
| `redis.memory.fragmentation_ratio` | Fragmentation |
| `redis.db.keys` | Number of keys |
### RabbitMQ Metrics
| Metric | Description |
|--------|-------------|
| `rabbitmq.consumer.count` | Active consumers |
| `rabbitmq.message.current` | Messages in queue |
| `rabbitmq.message.acknowledged` | Messages ACKed |
| `rabbitmq.message.delivered` | Messages delivered |
| `rabbitmq.message.published` | Messages published |
## 🔍 Traces
**Automatic instrumentation for:**
- FastAPI endpoints
- HTTP client requests (HTTPX)
- Redis commands
- PostgreSQL queries (SQLAlchemy)
- RabbitMQ publish/consume
**View traces:**
1. Go to **Services** tab in SigNoz
2. Select a service
3. View individual traces
4. Click trace → See full span tree with timing
## 📝 Logs
**Features:**
- Structured logging with context
- Automatic trace-log correlation
- Searchable by service, level, message, custom fields
**View logs:**
1. Go to **Logs** tab in SigNoz
2. Filter by service: `service_name="auth-service"`
3. Search for specific messages
4. Click log → See full context including trace_id
## 🎛️ Configuration Files
### Services
All services configured in:
```
infrastructure/kubernetes/base/components/*/\*-service.yaml
```
Each service has these environment variables:
```yaml
env:
- name: OTEL_COLLECTOR_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_SERVICE_NAME
value: "service-name"
- name: ENABLE_TRACING
value: "true"
- name: OTEL_LOGS_EXPORTER
value: "otlp"
- name: ENABLE_OTEL_METRICS
value: "true"
- name: ENABLE_SYSTEM_METRICS
value: "true"
```
### SigNoz
Configuration file:
```
infrastructure/helm/signoz-values-dev.yaml
```
Key settings:
- OTLP receivers on ports 4317 (gRPC) and 4318 (HTTP)
- No Prometheus scraping (pure OTLP push)
- ClickHouse backend for storage
- Reduced resources for development
### Database Monitoring
Deployment file:
```
infrastructure/kubernetes/base/monitoring/database-otel-collector.yaml
```
Setup script:
```
infrastructure/kubernetes/setup-database-monitoring.sh
```
## 📚 Documentation
| Document | Description |
|----------|-------------|
| [MONITORING_QUICKSTART.md](./MONITORING_QUICKSTART.md) | 10-minute quick start guide |
| [MONITORING_SETUP.md](./MONITORING_SETUP.md) | Detailed setup and troubleshooting |
| [DATABASE_MONITORING.md](./DATABASE_MONITORING.md) | Database metrics and logs guide |
| This document | Complete overview |
## 🔧 Shared Libraries
### Monitoring Modules
Located in `shared/monitoring/`:
| File | Purpose |
|------|---------|
| `__init__.py` | Package exports |
| `logging.py` | Standard logging setup |
| `logs_exporter.py` | OpenTelemetry logs export |
| `metrics.py` | OpenTelemetry metrics (no Prometheus) |
| `metrics_exporter.py` | OTLP metrics export setup |
| `system_metrics.py` | System metrics collection (CPU, memory, etc.) |
| `tracing.py` | Distributed tracing setup |
| `health_checks.py` | Health check endpoints |
### Usage in Services
```python
from shared.service_base import StandardFastAPIService
# Create service
service = AuthService()
# Create app with auto-configured monitoring
app = service.create_app()
# Monitoring is automatically enabled:
# - Tracing (if ENABLE_TRACING=true)
# - Metrics (if ENABLE_OTEL_METRICS=true)
# - System metrics (if ENABLE_SYSTEM_METRICS=true)
# - Logs (if OTEL_LOGS_EXPORTER=otlp)
```
## 🎨 Dashboard Examples
### Service Health Dashboard
Create a dashboard with:
1. **Request Rate** - `rate(http_requests_total[5m])`
2. **Error Rate** - `rate(http_requests_total{status_code=~"5.."}[5m])`
3. **Latency (P95)** - `histogram_quantile(0.95, http_request_duration_seconds)`
4. **Active Requests** - `active_requests`
5. **CPU Usage** - `process.cpu.utilization`
6. **Memory Usage** - `process.memory.utilization`
### Database Dashboard
1. **PostgreSQL Connections** - `postgresql.backends`
2. **Database Size** - `postgresql.database.size`
3. **Transaction Rate** - `rate(postgresql.commits[5m])`
4. **Redis Hit Rate** - `redis.keyspace.hits / (redis.keyspace.hits + redis.keyspace.misses)`
5. **RabbitMQ Queue Depth** - `rabbitmq.message.current`
## ⚠️ Alerts
### Recommended Alerts
**Application:**
- High error rate (>5% of requests failing)
- High latency (P95 > 1s)
- Service down (no metrics for 5 minutes)
**System:**
- High CPU (>80% for 5 minutes)
- High memory (>90%)
- Disk space low (<10%)
**Database:**
- PostgreSQL connections near max (>80% of max_connections)
- Slow queries (>5s)
- Redis memory high (>80%)
- RabbitMQ queue buildup (>10k messages)
## 🐛 Troubleshooting
### No Data in SigNoz
```bash
# 1. Check service logs
kubectl logs -n bakery-ia deployment/auth-service | grep -i otel
# 2. Check SigNoz collector
kubectl logs -n signoz deployment/signoz-otel-collector
# 3. Test connectivity
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
```
### Database Metrics Missing
```bash
# Check database monitoring collector
kubectl logs -n bakery-ia deployment/database-otel-collector
# Verify monitoring user exists
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U postgres -c "\du otel_monitor"
```
### Traces Not Correlated with Logs
Ensure `OTEL_LOGS_EXPORTER=otlp` is set in service environment variables.
## 🎯 Best Practices
1. **Always use structured logging** - Add context with key-value pairs
2. **Add custom spans** - For important business operations
3. **Set appropriate log levels** - INFO for production, DEBUG for dev
4. **Monitor your monitors** - Alert on collector failures
5. **Regular retention policy reviews** - Balance cost vs. data retention
6. **Create service dashboards** - One dashboard per service
7. **Set up critical alerts first** - Service down, high error rate
8. **Document custom metrics** - Explain business-specific metrics
## 📊 Performance Impact
**Resource Usage (per service):**
- CPU: +5-10% (instrumentation overhead)
- Memory: +50-100MB (SDK and buffers)
- Network: Minimal (batched export every 60s)
**Latency Impact:**
- Per request: <1ms (async instrumentation)
- No impact on user-facing latency
**Storage (SigNoz):**
- Traces: ~1GB per million requests
- Metrics: ~100MB per service per day
- Logs: Varies by log volume
## 🔐 Security Considerations
1. **Use dedicated monitoring users** - Never use app credentials
2. **Limit collector permissions** - Read-only access to databases
3. **Secure OTLP endpoints** - Use TLS in production
4. **Sanitize sensitive data** - Don't log passwords, tokens
5. **Network policies** - Restrict collector network access
6. **RBAC** - Limit SigNoz UI access per team
## 🚀 Next Steps
1. **Deploy to production** - Update production SigNoz config
2. **Create team dashboards** - Per-service and system-wide views
3. **Set up alerts** - Start with critical service health alerts
4. **Train team** - SigNoz UI usage, query language
5. **Document runbooks** - How to respond to alerts
6. **Optimize retention** - Based on actual data volume
7. **Add custom metrics** - Business-specific KPIs
## 📞 Support
- **SigNoz Community**: https://signoz.io/slack
- **OpenTelemetry Docs**: https://opentelemetry.io/docs/
- **Internal Docs**: See /docs folder
## 📝 Change Log
| Date | Change |
|------|--------|
| 2026-01-08 | Initial implementation - All services configured |
| 2026-01-08 | Database monitoring added (PostgreSQL, Redis, RabbitMQ) |
| 2026-01-08 | System metrics collection implemented |
| 2026-01-08 | Removed Prometheus, pure OpenTelemetry |
---
**Congratulations! Your platform now has complete observability. 🎉**
Every request is traced, every metric is collected, every log is searchable.

View File

@@ -0,0 +1,536 @@
# 📊 Bakery-ia Monitoring System Documentation
## 🎯 Overview
The bakery-ia platform features a comprehensive, modern monitoring system built on **OpenTelemetry** and **SigNoz**. This documentation provides a complete guide to the monitoring architecture, setup, and usage.
## 🚀 Monitoring Architecture
### Core Components
```mermaid
graph TD
A[Microservices] -->|OTLP| B[OpenTelemetry Collector]
B -->|gRPC| C[SigNoz]
C --> D[Traces Dashboard]
C --> E[Metrics Dashboard]
C --> F[Logs Dashboard]
C --> G[Alerts]
```
### Technology Stack
- **Instrumentation**: OpenTelemetry Python SDK
- **Protocol**: OTLP (OpenTelemetry Protocol) over gRPC
- **Backend**: SigNoz (open-source observability platform)
- **Metrics**: Prometheus-compatible metrics via OTLP
- **Traces**: Jaeger-compatible tracing via OTLP
- **Logs**: Structured logging with trace correlation
## 📋 Monitoring Coverage
### Service Coverage (100%)
| Service Category | Services | Monitoring Type | Status |
|-----------------|----------|----------------|--------|
| **Critical Services** | auth, orders, sales, external | Base Class | ✅ Monitored |
| **AI Services** | ai-insights, training | Direct | ✅ Monitored |
| **Data Services** | inventory, procurement, production, forecasting | Base Class | ✅ Monitored |
| **Operational Services** | tenant, notification, distribution | Base Class | ✅ Monitored |
| **Specialized Services** | suppliers, pos, recipes, orchestrator | Base Class | ✅ Monitored |
| **Infrastructure** | gateway, alert-processor, demo-session | Direct | ✅ Monitored |
**Total: 20 services with 100% monitoring coverage**
## 🔧 Monitoring Implementation
### Implementation Patterns
#### 1. Base Class Pattern (16 services)
Services using `StandardFastAPIService` inherit comprehensive monitoring:
```python
from shared.service_base import StandardFastAPIService
class MyService(StandardFastAPIService):
def __init__(self):
super().__init__(
service_name="my-service",
app_name="My Service",
description="Service description",
version="1.0.0",
# Monitoring enabled by default
enable_metrics=True, # ✅ Metrics collection
enable_tracing=True, # ✅ Distributed tracing
enable_health_checks=True # ✅ Health endpoints
)
```
#### 2. Direct Pattern (4 services)
Critical services with custom monitoring needs:
```python
# services/ai_insights/app/main.py
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector
# Initialize metrics collectors
metrics_collector = MetricsCollector("ai-insights")
system_metrics = SystemMetricsCollector("ai-insights")
# Add middleware
add_metrics_middleware(app, metrics_collector)
```
### Monitoring Components
#### OpenTelemetry Instrumentation
```python
# Automatic instrumentation in base class
FastAPIInstrumentor.instrument_app(app) # HTTP requests
HTTPXClientInstrumentor().instrument() # Outgoing HTTP
RedisInstrumentor().instrument() # Redis operations
SQLAlchemyInstrumentor().instrument() # Database queries
```
#### Metrics Collection
```python
# Standard metrics automatically collected
metrics_collector.register_counter("http_requests_total", "Total HTTP requests")
metrics_collector.register_histogram("http_request_duration", "Request duration")
metrics_collector.register_gauge("active_requests", "Active requests")
# System metrics automatically collected
system_metrics = SystemMetricsCollector("service-name")
# → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors
```
#### Health Checks
```python
# Automatic health check endpoints
GET /health # Overall service health
GET /health/detailed # Detailed health with dependencies
GET /health/ready # Readiness probe
GET /health/live # Liveness probe
```
## 📊 Metrics Reference
### Standard Metrics (All Services)
| Metric Type | Metric Name | Description | Labels |
|-------------|------------|-------------|--------|
| **HTTP Metrics** | `{service}_http_requests_total` | Total HTTP requests | method, endpoint, status_code |
| **HTTP Metrics** | `{service}_http_request_duration_seconds` | Request duration histogram | method, endpoint, status_code |
| **HTTP Metrics** | `{service}_active_requests` | Currently active requests | - |
| **System Metrics** | `process.cpu.utilization` | Process CPU usage | - |
| **System Metrics** | `process.memory.usage` | Process memory usage | - |
| **System Metrics** | `system.cpu.utilization` | System CPU usage | - |
| **System Metrics** | `system.memory.usage` | System memory usage | - |
| **Database Metrics** | `db.query.duration` | Database query duration | operation, table |
| **Cache Metrics** | `cache.operation.duration` | Cache operation duration | operation, key |
### Custom Metrics (Service-Specific)
Examples of service-specific metrics:
**Auth Service:**
- `auth_registration_total` (by status)
- `auth_login_success_total`
- `auth_login_failure_total` (by reason)
- `auth_registration_duration_seconds`
**Orders Service:**
- `orders_created_total`
- `orders_processed_total` (by status)
- `orders_processing_duration_seconds`
**AI Insights Service:**
- `ai_insights_generated_total`
- `ai_model_inference_duration_seconds`
- `ai_feedback_received_total`
## 🔍 Tracing Guide
### Trace Propagation
Traces automatically flow across service boundaries:
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Auth
participant Orders
Client->>Gateway: HTTP Request (trace_id: abc123)
Gateway->>Auth: Auth Check (trace_id: abc123)
Auth-->>Gateway: Auth Response (trace_id: abc123)
Gateway->>Orders: Create Order (trace_id: abc123)
Orders-->>Gateway: Order Created (trace_id: abc123)
Gateway-->>Client: Final Response (trace_id: abc123)
```
### Trace Context in Logs
All logs include trace correlation:
```json
{
"level": "info",
"message": "Processing order",
"service": "orders-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"order_id": "12345",
"timestamp": "2024-01-08T19:00:00Z"
}
```
### Manual Trace Enhancement
Add custom trace attributes:
```python
from shared.monitoring.tracing import add_trace_attributes, add_trace_event
# Add custom attributes
add_trace_attributes(
user_id="123",
tenant_id="abc",
operation="order_creation"
)
# Add trace events
add_trace_event("order_validation_started")
# ... validation logic ...
add_trace_event("order_validation_completed", status="success")
```
## 🚨 Alerting Guide
### Standard Alerts (Recommended)
| Alert Name | Condition | Severity | Notification |
|------------|-----------|----------|--------------|
| **High Error Rate** | `error_rate > 5%` for 5m | High | PagerDuty + Slack |
| **High Latency** | `p99_latency > 2s` for 5m | High | PagerDuty + Slack |
| **Service Unavailable** | `up == 0` for 1m | Critical | PagerDuty + Slack + Email |
| **High Memory Usage** | `memory_usage > 80%` for 10m | Medium | Slack |
| **High CPU Usage** | `cpu_usage > 90%` for 5m | Medium | Slack |
| **Database Connection Issues** | `db_connections < minimum_pool_size` | High | PagerDuty + Slack |
| **Cache Hit Ratio Low** | `cache_hit_ratio < 70%` for 15m | Low | Slack |
### Creating Alerts in SigNoz
1. **Navigate to Alerts**: SigNoz UI → Alerts → Create Alert
2. **Select Metric**: Choose from available metrics
3. **Set Condition**: Define threshold and duration
4. **Configure Notifications**: Add notification channels
5. **Set Severity**: Critical, High, Medium, Low
6. **Add Description**: Explain alert purpose and resolution steps
### Example Alert Configuration (YAML)
```yaml
# Example for Terraform/Kubernetes
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: bakery-ia-alerts
namespace: monitoring
spec:
groups:
- name: service-health
rules:
- alert: ServiceDown
expr: up{service!~"signoz.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.service }} is down"
description: "{{ $labels.service }} has been down for more than 1 minute"
runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down"
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: high
annotations:
summary: "High error rate in {{ $labels.service }}"
description: "Error rate is {{ $value }}% (threshold: 5%)"
runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate"
```
## 📈 Dashboard Guide
### Recommended Dashboards
#### 1. Service Overview Dashboard
- HTTP Request Rate
- Error Rate
- Latency Percentiles (p50, p90, p99)
- Active Requests
- System Resource Usage
#### 2. Performance Dashboard
- Request Duration Histogram
- Database Query Performance
- Cache Performance
- External API Call Performance
#### 3. System Health Dashboard
- CPU Usage (Process & System)
- Memory Usage (Process & System)
- Disk I/O
- Network I/O
- File Descriptors
- Thread Count
#### 4. Business Metrics Dashboard
- User Registrations
- Order Volume
- AI Insights Generated
- API Usage by Tenant
### Creating Dashboards in SigNoz
1. **Navigate to Dashboards**: SigNoz UI → Dashboards → Create Dashboard
2. **Add Panels**: Click "Add Panel" and select metric
3. **Configure Visualization**: Choose chart type and settings
4. **Set Time Range**: Default to last 1h, 6h, 24h, 7d
5. **Add Variables**: For dynamic filtering (service, environment)
6. **Save Dashboard**: Give it a descriptive name
## 🛠️ Troubleshooting Guide
### Common Issues & Solutions
#### Issue: No Metrics Appearing in SigNoz
**Checklist:**
- ✅ OpenTelemetry Collector running? `kubectl get pods -n signoz`
- ✅ Service can reach collector? `telnet signoz-otel-collector.signoz 4318`
- ✅ OTLP endpoint configured correctly? Check `OTEL_EXPORTER_OTLP_ENDPOINT`
- ✅ Service logs show OTLP export? Look for "Exporting metrics"
- ✅ No network policies blocking? Check Kubernetes network policies
**Debugging:**
```bash
# Check OpenTelemetry Collector logs
kubectl logs -n signoz -l app=otel-collector
# Check service logs for OTLP errors
kubectl logs -l app=auth-service | grep -i otel
# Test OTLP connectivity from service pod
kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.signoz:4318
```
#### Issue: High Latency in Specific Service
**Checklist:**
- ✅ Database queries slow? Check `db.query.duration` metrics
- ✅ External API calls slow? Check trace waterfall
- ✅ High CPU usage? Check system metrics
- ✅ Memory pressure? Check memory metrics
- ✅ Too many active requests? Check concurrency
**Debugging:**
```python
# Add detailed tracing to suspicious code
from shared.monitoring.tracing import add_trace_event
add_trace_event("database_query_started", table="users")
# ... database query ...
add_trace_event("database_query_completed", duration_ms=45)
```
#### Issue: High Error Rate
**Checklist:**
- ✅ Database connection issues? Check health endpoints
- ✅ External API failures? Check dependency metrics
- ✅ Authentication failures? Check auth service logs
- ✅ Validation errors? Check application logs
- ✅ Rate limiting? Check gateway metrics
**Debugging:**
```bash
# Check error logs with trace correlation
kubectl logs -l app=auth-service | grep -i error | grep -i trace
# Filter traces by error status
# In SigNoz: Add filter http.status_code >= 400
```
## 📚 Runbook Reference
See [RUNBOOKS.md](RUNBOOKS.md) for detailed troubleshooting procedures.
## 🔧 Development Guide
### Adding Custom Metrics
```python
# In any service using direct monitoring
self.metrics_collector.register_counter(
"custom_metric_name",
"Description of what this metric tracks",
labels=["label1", "label2"] # Optional labels
)
# Increment the counter
self.metrics_collector.increment_counter(
"custom_metric_name",
value=1,
labels={"label1": "value1", "label2": "value2"}
)
```
### Adding Custom Trace Attributes
```python
# Add context to current span
from shared.monitoring.tracing import add_trace_attributes
add_trace_attributes(
user_id=user.id,
tenant_id=tenant.id,
operation="premium_feature_access",
feature_name="advanced_forecasting"
)
```
### Service-Specific Monitoring Setup
For services needing custom monitoring beyond the base class:
```python
# In your service's __init__ method
from shared.monitoring.system_metrics import SystemMetricsCollector
from shared.monitoring.metrics import MetricsCollector
class MyService(StandardFastAPIService):
def __init__(self):
# Call parent constructor first
super().__init__(...)
# Add custom metrics collector
self.custom_metrics = MetricsCollector("my-service")
# Register custom metrics
self.custom_metrics.register_counter(
"business_specific_events",
"Custom business event counter"
)
# Add system metrics if not using base class defaults
self.system_metrics = SystemMetricsCollector("my-service")
```
## 📊 SigNoz Configuration
### Environment Variables
```env
# OpenTelemetry Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.signoz:4318
# Service-specific configuration
OTEL_SERVICE_NAME=auth-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia
# Metrics export interval (default: 60000ms = 60s)
OTEL_METRIC_EXPORT_INTERVAL=60000
# Batch span processor configuration
OTEL_BSP_SCHEDULE_DELAY=5000
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512
```
### Kubernetes Configuration
```yaml
# Example deployment with monitoring sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
name: auth-service
spec:
template:
spec:
containers:
- name: auth-service
image: auth-service:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://signoz-otel-collector.signoz:4318"
- name: OTEL_SERVICE_NAME
value: "auth-service"
- name: ENVIRONMENT
value: "production"
resources:
limits:
cpu: "1"
memory: "512Mi"
requests:
cpu: "200m"
memory: "256Mi"
```
## 🎯 Best Practices
### Monitoring Best Practices
1. **Use Consistent Naming**: Follow OpenTelemetry semantic conventions
2. **Add Context to Traces**: Include user/tenant IDs in trace attributes
3. **Monitor Dependencies**: Track external API and database performance
4. **Set Appropriate Alerts**: Avoid alert fatigue with meaningful thresholds
5. **Document Metrics**: Keep metrics documentation up to date
6. **Review Regularly**: Update dashboards as services evolve
7. **Test Alerts**: Ensure alerts fire correctly before production
### Performance Best Practices
1. **Batch Metrics Export**: Use default 60s interval for most services
2. **Sample Traces**: Consider sampling for high-volume services
3. **Limit Custom Metrics**: Only track metrics that provide value
4. **Use Histograms Wisely**: Histograms can be resource-intensive
5. **Monitor Monitoring**: Track OTLP export success/failure rates
## 📞 Support
### Getting Help
1. **Check Documentation**: This file and RUNBOOKS.md
2. **Review SigNoz Docs**: https://signoz.io/docs/
3. **OpenTelemetry Docs**: https://opentelemetry.io/docs/
4. **Team Channel**: #monitoring in Slack
5. **GitHub Issues**: https://github.com/yourorg/bakery-ia/issues
### Escalation Path
1. **First Line**: Development team (service owners)
2. **Second Line**: DevOps team (monitoring specialists)
3. **Third Line**: SigNoz support (vendor support)
## 🎉 Summary
The bakery-ia monitoring system provides:
- **📊 100% Service Coverage**: All 20 services monitored
- **🚀 Modern Architecture**: OpenTelemetry + SigNoz
- **🔧 Comprehensive Metrics**: System, HTTP, database, cache
- **🔍 Full Observability**: Traces, metrics, logs integrated
- **✅ Production Ready**: Battle-tested and scalable
**All services are fully instrumented and ready for production monitoring!** 🎉

View File

@@ -1,283 +0,0 @@
# SigNoz Monitoring Quick Start
Get complete observability (metrics, logs, traces, system metrics) in under 10 minutes using OpenTelemetry.
## What You'll Get
**Distributed Tracing** - Complete request flows across all services
**Application Metrics** - HTTP requests, durations, error rates, custom business metrics
**System Metrics** - CPU usage, memory usage, disk I/O, network I/O per service
**Structured Logs** - Searchable logs correlated with traces
**Unified Dashboard** - Single UI for all telemetry data
**All data pushed via OpenTelemetry OTLP protocol - No Prometheus, no scraping needed!**
## Prerequisites
- Kubernetes cluster running (Kind/Minikube/Production)
- Helm 3.x installed
- kubectl configured
## Step 1: Deploy SigNoz
```bash
# Add Helm repository
helm repo add signoz https://charts.signoz.io
helm repo update
# Create namespace
kubectl create namespace signoz
# Install SigNoz
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-dev.yaml
# Wait for pods to be ready (2-3 minutes)
kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
```
## Step 2: Configure Services
Each service needs OpenTelemetry environment variables. The auth-service is already configured as an example.
### Quick Configuration (for remaining services)
Add these environment variables to each service deployment:
```yaml
env:
# OpenTelemetry Collector endpoint
- name: OTEL_COLLECTOR_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_SERVICE_NAME
value: "your-service-name" # e.g., "inventory-service"
# Enable tracing
- name: ENABLE_TRACING
value: "true"
# Enable logs export
- name: OTEL_LOGS_EXPORTER
value: "otlp"
# Enable metrics export (includes system metrics)
- name: ENABLE_OTEL_METRICS
value: "true"
- name: ENABLE_SYSTEM_METRICS
value: "true"
```
### Using the Configuration Script
```bash
# Generate configuration patches for all services
./infrastructure/kubernetes/add-monitoring-config.sh
# This creates /tmp/*-otel-patch.yaml files
# Review and manually add to each service deployment
```
## Step 3: Deploy Updated Services
```bash
# Apply updated configurations
kubectl apply -k infrastructure/kubernetes/overlays/dev/
# Or restart services to pick up new env vars
kubectl rollout restart deployment -n bakery-ia
# Wait for rollout
kubectl rollout status deployment -n bakery-ia --timeout=5m
```
## Step 4: Access SigNoz UI
### Via Ingress
```bash
# Add to /etc/hosts if needed
echo "127.0.0.1 monitoring.bakery-ia.local" | sudo tee -a /etc/hosts
# Access UI
open https://monitoring.bakery-ia.local
```
### Via Port Forward
```bash
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
open http://localhost:3301
```
## Step 5: Explore Your Data
### Traces
1. Go to **Services** tab
2. See all your services listed
3. Click on a service → View traces
4. Click on a trace → See detailed span tree with timing
### Metrics
**HTTP Metrics** (automatically collected):
- `http_requests_total` - Total requests by method, endpoint, status
- `http_request_duration_seconds` - Request latency
- `active_requests` - Current active HTTP requests
**System Metrics** (automatically collected per service):
- `process.cpu.utilization` - Process CPU usage %
- `process.memory.usage` - Process memory in bytes
- `process.memory.utilization` - Process memory %
- `process.threads.count` - Number of threads
- `system.cpu.utilization` - System-wide CPU %
- `system.memory.usage` - System memory usage
- `system.disk.io.read` - Disk bytes read
- `system.disk.io.write` - Disk bytes written
- `system.network.io.sent` - Network bytes sent
- `system.network.io.received` - Network bytes received
**Custom Business Metrics** (if configured):
- User registrations
- Orders created
- Login attempts
- etc.
### Logs
1. Go to **Logs** tab
2. Filter by service: `service_name="auth-service"`
3. Search for specific messages
4. See structured fields (user_id, tenant_id, etc.)
### Trace-Log Correlation
1. Find a trace in **Traces** tab
2. Note the `trace_id`
3. Go to **Logs** tab
4. Filter: `trace_id="<the-trace-id>"`
5. See all logs for that specific request!
## Verification Commands
```bash
# Check if services are sending telemetry
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel"
# Check SigNoz collector is receiving data
kubectl logs -n signoz deployment/signoz-otel-collector | tail -50
# Test connectivity to collector
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
```
## Common Issues
### No data in SigNoz
```bash
# 1. Verify environment variables are set
kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL
# 2. Check collector logs
kubectl logs -n signoz deployment/signoz-otel-collector
# 3. Restart service
kubectl rollout restart deployment/auth-service -n bakery-ia
```
### Services not appearing
```bash
# Check network connectivity
kubectl exec -n bakery-ia deployment/auth-service -- \
curl http://signoz-otel-collector.signoz.svc.cluster.local:4318
# Should return: connection successful (not connection refused)
```
## Architecture
```
┌─────────────────────────────────────────────┐
│ Your Microservices │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ auth │ │ inv │ │orders│ ... │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │
│ └─────────┴─────────┘ │
│ │ │
│ OTLP Push │
│ (traces, metrics, logs) │
└──────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────┐
│ SigNoz OpenTelemetry Collector │
│ :4317 (gRPC) :4318 (HTTP) │
│ │
│ Receivers: OTLP only (no Prometheus) │
│ Processors: batch, memory_limiter │
│ Exporters: ClickHouse │
└──────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────┐
│ ClickHouse Database │
│ Stores: traces, metrics, logs │
└──────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────┐
│ SigNoz Frontend UI │
│ monitoring.bakery-ia.local or :3301 │
└──────────────────────────────────────────────┘
```
## What Makes This Different
**Pure OpenTelemetry** - No Prometheus involved:
- ✅ All metrics pushed via OTLP (not scraped)
- ✅ Automatic system metrics collection (CPU, memory, disk, network)
- ✅ Unified data model for all telemetry
- ✅ Native trace-metric-log correlation
- ✅ Lower resource usage (no scraping overhead)
## Next Steps
- **Create Dashboards** - Build custom views for your metrics
- **Set Up Alerts** - Configure alerts for errors, latency, resource usage
- **Explore System Metrics** - Monitor CPU, memory per service
- **Query Logs** - Use powerful log query language
- **Correlate Everything** - Jump from traces → logs → metrics
## Need Help?
- [Full Documentation](./MONITORING_SETUP.md) - Detailed setup guide
- [SigNoz Docs](https://signoz.io/docs/) - Official documentation
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/) - Python instrumentation
---
**Metrics You Get Out of the Box:**
| Category | Metrics | Description |
|----------|---------|-------------|
| HTTP | `http_requests_total` | Total requests by method, endpoint, status |
| HTTP | `http_request_duration_seconds` | Request latency histogram |
| HTTP | `active_requests` | Current active requests |
| Process | `process.cpu.utilization` | Process CPU usage % |
| Process | `process.memory.usage` | Process memory in bytes |
| Process | `process.memory.utilization` | Process memory % |
| Process | `process.threads.count` | Thread count |
| System | `system.cpu.utilization` | System CPU % |
| System | `system.memory.usage` | System memory usage |
| System | `system.memory.utilization` | System memory % |
| Disk | `system.disk.io.read` | Disk read bytes |
| Disk | `system.disk.io.write` | Disk write bytes |
| Network | `system.network.io.sent` | Network sent bytes |
| Network | `system.network.io.received` | Network received bytes |

View File

@@ -1,511 +0,0 @@
# SigNoz Monitoring Setup Guide
This guide explains how to set up complete observability for the Bakery IA platform using SigNoz, which provides unified metrics, logs, and traces visualization.
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Prerequisites](#prerequisites)
3. [SigNoz Deployment](#signoz-deployment)
4. [Service Configuration](#service-configuration)
5. [Data Flow](#data-flow)
6. [Verification](#verification)
7. [Troubleshooting](#troubleshooting)
## Architecture Overview
The monitoring setup uses a three-tier approach:
```
┌─────────────────────────────────────────────────────────────┐
│ Bakery IA Services │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Auth │ │ Inventory│ │ Orders │ │ ... │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └─────────────┴─────────────┴─────────────┘ │
│ │ │
│ OpenTelemetry Protocol (OTLP) │
│ Traces / Metrics / Logs │
└──────────────────────────┼───────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ SigNoz OpenTelemetry Collector │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Receivers: │ │
│ │ - OTLP gRPC (4317) - OTLP HTTP (4318) │ │
│ │ - Prometheus Scraper (service discovery) │ │
│ └────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────────────────┐ │
│ │ Processors: batch, memory_limiter, resourcedetection │ │
│ └────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────────────────┐ │
│ │ Exporters: ClickHouse (traces, metrics, logs) │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────┼───────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ ClickHouse Database │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Traces │ │ Metrics │ │ Logs │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────┼───────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ SigNoz Query Service │
│ & Frontend UI │
│ https://monitoring.bakery-ia.local │
└──────────────────────────────────────────────────────────────┘
```
### Key Components
1. **Services**: Generate telemetry data using OpenTelemetry SDK
2. **OpenTelemetry Collector**: Receives, processes, and exports telemetry
3. **ClickHouse**: Stores traces, metrics, and logs
4. **SigNoz UI**: Query and visualize all telemetry data
## Prerequisites
- Kubernetes cluster (Kind, Minikube, or production cluster)
- Helm 3.x installed
- kubectl configured
- At least 4GB RAM available for SigNoz components
## SigNoz Deployment
### 1. Add SigNoz Helm Repository
```bash
helm repo add signoz https://charts.signoz.io
helm repo update
```
### 2. Create Namespace
```bash
kubectl create namespace signoz
```
### 3. Deploy SigNoz
```bash
# For development environment
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-dev.yaml
# For production environment
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-prod.yaml
```
### 4. Verify Deployment
```bash
# Check all pods are running
kubectl get pods -n signoz
# Expected output:
# signoz-alertmanager-0
# signoz-clickhouse-0
# signoz-frontend-*
# signoz-otel-collector-*
# signoz-query-service-*
# Check services
kubectl get svc -n signoz
```
## Service Configuration
Each microservice needs to be configured to send telemetry to SigNoz.
### Environment Variables
Add these environment variables to your service deployments:
```yaml
env:
# OpenTelemetry Collector endpoint
- name: OTEL_COLLECTOR_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
# Service identification
- name: OTEL_SERVICE_NAME
value: "your-service-name" # e.g., "auth-service"
# Enable tracing
- name: ENABLE_TRACING
value: "true"
# Enable logs export
- name: OTEL_LOGS_EXPORTER
value: "otlp"
- name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
value: "true"
# Enable metrics export (optional, default: true)
- name: ENABLE_OTEL_METRICS
value: "true"
```
### Prometheus Annotations
Add these annotations to enable Prometheus metrics scraping:
```yaml
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
```
### Complete Example
See [infrastructure/kubernetes/base/components/auth/auth-service.yaml](../infrastructure/kubernetes/base/components/auth/auth-service.yaml) for a complete example.
### Automated Configuration Script
Use the provided script to add monitoring configuration to all services:
```bash
# Run from project root
./infrastructure/kubernetes/add-monitoring-config.sh
```
## Data Flow
### 1. Traces
**Automatic Instrumentation:**
```python
# In your service's main.py
from shared.service_base import StandardFastAPIService
service = AuthService() # Extends StandardFastAPIService
app = service.create_app()
# Tracing is automatically enabled if ENABLE_TRACING=true
# All FastAPI endpoints, HTTP clients, Redis, PostgreSQL are auto-instrumented
```
**Manual Instrumentation:**
```python
from shared.monitoring.tracing import add_trace_attributes, add_trace_event
# Add custom attributes to current span
add_trace_attributes(
user_id="123",
tenant_id="abc",
operation="user_registration"
)
# Add events for important operations
add_trace_event("user_authenticated", user_id="123", method="jwt")
```
### 2. Metrics
**Dual Export Strategy:**
Services export metrics in two ways:
1. **Prometheus format** at `/metrics` endpoint (scraped by SigNoz)
2. **OTLP push** directly to SigNoz collector (real-time)
**Built-in Metrics:**
```python
# Automatically collected by BaseFastAPIService:
# - http_requests_total
# - http_request_duration_seconds
# - active_connections
```
**Custom Metrics:**
```python
# Define in your service
custom_metrics = {
"user_registrations": {
"type": "counter",
"description": "Total user registrations",
"labels": ["status"]
},
"login_duration_seconds": {
"type": "histogram",
"description": "Login request duration"
}
}
service = AuthService(custom_metrics=custom_metrics)
# Use in your code
service.metrics_collector.increment_counter(
"user_registrations",
labels={"status": "success"}
)
```
### 3. Logs
**Automatic Export:**
```python
# Logs are automatically exported if OTEL_LOGS_EXPORTER=otlp
import logging
logger = logging.getLogger(__name__)
# This will appear in SigNoz
logger.info("User logged in", extra={"user_id": "123", "tenant_id": "abc"})
```
**Structured Logging with Context:**
```python
from shared.monitoring.logs_exporter import add_log_context
# Add context that persists across log calls
log_ctx = add_log_context(
request_id="req_123",
user_id="user_456",
tenant_id="tenant_789"
)
# All subsequent logs include this context
log_ctx.info("Processing order") # Includes request_id, user_id, tenant_id
```
**Trace Correlation:**
```python
from shared.monitoring.logs_exporter import get_current_trace_context
# Get trace context for correlation
trace_ctx = get_current_trace_context()
logger.info("Processing request", extra=trace_ctx)
# Logs now include trace_id and span_id for correlation
```
## Verification
### 1. Check Service Health
```bash
# Check that services are exporting telemetry
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel\|signoz"
# Expected output includes:
# - "Distributed tracing configured"
# - "OpenTelemetry logs export configured"
# - "OpenTelemetry metrics export configured"
```
### 2. Access SigNoz UI
```bash
# Port-forward (for local development)
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
# Or via Ingress
open https://monitoring.bakery-ia.local
```
### 3. Verify Data Ingestion
**Traces:**
1. Go to SigNoz UI → Traces
2. You should see traces from your services
3. Click on a trace to see the full span tree
**Metrics:**
1. Go to SigNoz UI → Metrics
2. Query: `http_requests_total`
3. Filter by service: `service="auth-service"`
**Logs:**
1. Go to SigNoz UI → Logs
2. Filter by service: `service_name="auth-service"`
3. Search for specific log messages
### 4. Test Trace-Log Correlation
1. Find a trace in SigNoz UI
2. Copy the `trace_id`
3. Go to Logs tab
4. Search: `trace_id="<your-trace-id>"`
5. You should see all logs for that trace
## Troubleshooting
### No Data in SigNoz
**1. Check OpenTelemetry Collector:**
```bash
# Check collector logs
kubectl logs -n signoz deployment/signoz-otel-collector
# Should see:
# - "Receiver is starting"
# - "Exporter is starting"
# - No error messages
```
**2. Check Service Configuration:**
```bash
# Verify environment variables
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 20 "env:"
# Verify annotations
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 5 "annotations:"
```
**3. Check Network Connectivity:**
```bash
# Test from service pod
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/traces
# Should return: 405 Method Not Allowed (POST required)
# If connection refused, check network policies
```
### Traces Not Appearing
**Check instrumentation:**
```python
# Verify tracing is enabled
import os
print(os.getenv("ENABLE_TRACING")) # Should be "true"
print(os.getenv("OTEL_COLLECTOR_ENDPOINT")) # Should be set
```
**Check trace sampling:**
```bash
# Verify sampling rate (default 100%)
kubectl logs -n bakery-ia deployment/auth-service | grep "sampling"
```
### Metrics Not Appearing
**1. Verify Prometheus annotations:**
```bash
kubectl get pods -n bakery-ia -o yaml | grep "prometheus.io"
```
**2. Test metrics endpoint:**
```bash
# Port-forward service
kubectl port-forward -n bakery-ia deployment/auth-service 8000:8000
# Test endpoint
curl http://localhost:8000/metrics
# Should return Prometheus format metrics
```
**3. Check SigNoz scrape configuration:**
```bash
# Check collector config
kubectl get configmap -n signoz signoz-otel-collector -o yaml | grep -A 30 "prometheus:"
```
### Logs Not Appearing
**1. Verify log export is enabled:**
```bash
kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL_LOGS_EXPORTER
# Should return: OTEL_LOGS_EXPORTER=otlp
```
**2. Check log format:**
```bash
# Logs should be JSON formatted
kubectl logs -n bakery-ia deployment/auth-service | head -5
```
**3. Verify OTLP endpoint:**
```bash
# Test logs endpoint
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -X POST http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/logs \
-H "Content-Type: application/json" \
-d '{"resourceLogs":[]}'
# Should return 200 OK or 400 Bad Request (not connection error)
```
## Performance Tuning
### For Development
The default configuration is optimized for local development with minimal resources.
### For Production
Update the following in `signoz-values-prod.yaml`:
```yaml
# Increase collector resources
otelCollector:
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
# Increase batch sizes
config:
processors:
batch:
timeout: 10s
send_batch_size: 10000 # Increased from 1024
# Add more replicas
replicaCount: 2
```
## Best Practices
1. **Use Structured Logging**: Always use key-value pairs for better querying
2. **Add Context**: Include user_id, tenant_id, request_id in logs
3. **Trace Business Operations**: Add custom spans for important operations
4. **Monitor Collector Health**: Set up alerts for collector errors
5. **Retention Policy**: Configure ClickHouse retention based on needs
## Additional Resources
- [SigNoz Documentation](https://signoz.io/docs/)
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/)
- [Bakery IA Monitoring Shared Library](../shared/monitoring/)
## Support
For issues or questions:
1. Check SigNoz community: https://signoz.io/slack
2. Review OpenTelemetry docs: https://opentelemetry.io/docs/
3. Create issue in project repository