Improve metrics

This commit is contained in:
Urtzi Alfaro
2026-01-08 20:48:24 +01:00
parent 29d19087f1
commit e8fda39e50
21 changed files with 615 additions and 3019 deletions

View File

@@ -1,134 +0,0 @@
# Docker Hub Quick Start Guide
## 🚀 Quick Setup (3 Steps)
### 1. Create Docker Hub Secrets
```bash
./infrastructure/kubernetes/setup-dockerhub-secrets.sh
```
This creates the `dockerhub-creds` secret in all namespaces with your Docker Hub credentials.
### 2. Apply Updated Manifests
```bash
# Development environment
kubectl apply -k infrastructure/kubernetes/overlays/dev
# Production environment
kubectl apply -k infrastructure/kubernetes/overlays/prod
```
### 3. Verify Pods Are Running
```bash
kubectl get pods -n bakery-ia
```
All pods should now be able to pull images from Docker Hub!
---
## 🔧 What Was Configured
**Docker Hub Credentials**
- Username: `uals`
- Access Token: `dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A`
- Email: `ualfaro@gmail.com`
**Kubernetes Secrets**
- Created in: `bakery-ia`, `bakery-ia-dev`, `bakery-ia-prod`, `default`
- Secret name: `dockerhub-creds`
**Manifests Updated (47 files)**
- All service deployments
- All database deployments
- All migration jobs
- All cronjobs and standalone jobs
**Tiltfile Configuration**
- Supports both local registry and Docker Hub
- Use `export USE_DOCKERHUB=true` to enable Docker Hub mode
---
## 📖 Full Documentation
See [docs/DOCKERHUB_SETUP.md](docs/DOCKERHUB_SETUP.md) for:
- Detailed configuration steps
- Troubleshooting guide
- Security best practices
- Image management
- Rate limits information
---
## 🔄 Using with Tilt (Local Development)
**Default: Local Registry**
```bash
tilt up
```
**Docker Hub Mode**
```bash
export USE_DOCKERHUB=true
export DOCKERHUB_USERNAME=uals
docker login -u uals
tilt up
```
---
## 🐳 Pushing Images to Docker Hub
```bash
# Login first
docker login -u uals
# Use the automated script
./scripts/tag-and-push-images.sh
```
---
## ⚠️ Troubleshooting
**Problem: ImagePullBackOff**
```bash
# Check if secret exists
kubectl get secret dockerhub-creds -n bakery-ia
# Recreate secret if needed
./infrastructure/kubernetes/setup-dockerhub-secrets.sh
```
**Problem: Pods not using new credentials**
```bash
# Restart deployment
kubectl rollout restart deployment/<deployment-name> -n bakery-ia
```
---
## 📝 Scripts Reference
| Script | Purpose |
|--------|---------|
| `infrastructure/kubernetes/setup-dockerhub-secrets.sh` | Create Docker Hub secrets in all namespaces |
| `infrastructure/kubernetes/add-image-pull-secrets.sh` | Add imagePullSecrets to manifests (already done) |
| `scripts/tag-and-push-images.sh` | Tag and push all custom images to Docker Hub |
---
## ✅ Verification Checklist
- [ ] Docker Hub secret created: `kubectl get secret dockerhub-creds -n bakery-ia`
- [ ] Manifests applied: `kubectl apply -k infrastructure/kubernetes/overlays/dev`
- [ ] Pods running: `kubectl get pods -n bakery-ia`
- [ ] No ImagePullBackOff errors: `kubectl get events -n bakery-ia`
---
**Need help?** See the full documentation at [docs/DOCKERHUB_SETUP.md](docs/DOCKERHUB_SETUP.md)

View File

@@ -1,337 +0,0 @@
# HTTPS in Development Environment
## Overview
Development environment now uses HTTPS by default to match production behavior and catch SSL-related issues early.
**Benefits:**
- ✅ Matches production HTTPS behavior
- ✅ Tests SSL/TLS configurations
- ✅ Catches mixed content warnings
- ✅ Tests secure cookie handling
- ✅ Better dev-prod parity
---
## Quick Start
### 1. Deploy with HTTPS Enabled
```bash
# Start development environment
skaffold dev --profile=dev
# Wait for certificate to be issued
kubectl get certificate -n bakery-ia
# You should see:
# NAME READY SECRET AGE
# bakery-dev-tls-cert True bakery-dev-tls-cert 1m
```
### 2. Access Your Application
```bash
# Access via HTTPS (will show certificate warning in browser)
open https://localhost
# Or via curl (use -k to skip certificate verification)
curl -k https://localhost/api/health
```
---
## Trust the Self-Signed Certificate
To avoid browser certificate warnings, you need to trust the self-signed certificate.
### Option 1: Accept Browser Warning (Quick & Easy)
When you visit `https://localhost`:
1. Browser shows "Your connection is not private" or similar
2. Click "Advanced" or "Show details"
3. Click "Proceed to localhost" or "Accept the risk"
4. Certificate warning will appear on first visit only per browser session
### Option 2: Trust Certificate in System (Recommended)
#### On macOS:
```bash
# 1. Export the certificate from Kubernetes
kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls\.crt}' | base64 -d > /tmp/bakery-dev-cert.crt
# 2. Add to Keychain
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain /tmp/bakery-dev-cert.crt
# 3. Verify
security find-certificate -c localhost -a
# 4. Cleanup
rm /tmp/bakery-dev-cert.crt
```
**Alternative (GUI):**
1. Export certificate: `kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls\.crt}' | base64 -d > bakery-dev-cert.crt`
2. Double-click the `.crt` file to open Keychain Access
3. Find "localhost" certificate
4. Double-click → Trust → "Always Trust"
5. Close and enter your password
#### On Linux:
```bash
# 1. Export the certificate
kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls\.crt}' | base64 -d | sudo tee /usr/local/share/ca-certificates/bakery-dev.crt
# 2. Update CA certificates
sudo update-ca-certificates
# 3. For browsers (Chromium/Chrome)
mkdir -p $HOME/.pki/nssdb
certutil -d sql:$HOME/.pki/nssdb -A -t "P,," -n "Bakery Dev" -i /usr/local/share/ca-certificates/bakery-dev.crt
```
#### On Windows:
```powershell
# 1. Export the certificate
kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls.crt}' | Out-File -Encoding ASCII bakery-dev-cert.crt
# 2. Import to Trusted Root
Import-Certificate -FilePath .\bakery-dev-cert.crt -CertStoreLocation Cert:\LocalMachine\Root
# Or use GUI:
# - Double-click bakery-dev-cert.crt
# - Install Certificate
# - Store Location: Local Machine
# - Place in: Trusted Root Certification Authorities
```
---
## Testing HTTPS
### Test with curl
```bash
# Without certificate verification (quick test)
curl -k https://localhost/api/health
# With certificate verification (after trusting cert)
curl https://localhost/api/health
# Check certificate details
curl -vI https://localhost/api/health 2>&1 | grep -A 10 "Server certificate"
# Test CORS with HTTPS
curl -H "Origin: https://localhost:3000" \
-H "Access-Control-Request-Method: POST" \
-X OPTIONS https://localhost/api/health
```
### Test with Browser
1. Open `https://localhost`
2. Check for SSL/TLS padlock in address bar
3. Click padlock → View certificate
4. Verify:
- Issued to: localhost
- Issued by: localhost (self-signed)
- Valid for: 90 days
### Test Frontend
```bash
# Update your frontend .env to use HTTPS
echo "VITE_API_URL=https://localhost/api" > frontend/.env.local
# Frontend should now make HTTPS requests
```
---
## Certificate Details
### Certificate Specifications
- **Type**: Self-signed (for development)
- **Algorithm**: RSA 2048-bit
- **Validity**: 90 days (auto-renews 15 days before expiration)
- **Common Name**: localhost
- **DNS Names**:
- localhost
- bakery-ia.local
- api.bakery-ia.local
- *.bakery-ia.local
- **IP Addresses**: 127.0.0.1, ::1
### Certificate Issuer
- **Issuer**: `selfsigned-issuer` (cert-manager ClusterIssuer)
- **Auto-renewal**: Managed by cert-manager
- **Secret Name**: `bakery-dev-tls-cert`
---
## Troubleshooting
### Certificate Not Issued
```bash
# Check certificate status
kubectl describe certificate bakery-dev-tls-cert -n bakery-ia
# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
# Check if cert-manager is installed
kubectl get pods -n cert-manager
# If cert-manager is not installed:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.2/cert-manager.yaml
```
### Certificate Warning in Browser
**Normal for self-signed certificates!** Choose one:
1. Click "Proceed" (quick, temporary)
2. Trust the certificate in your system (permanent)
### Mixed Content Warnings
If you see "mixed content" errors:
- Ensure all API calls use HTTPS
- Check for hardcoded HTTP URLs
- Update `VITE_API_URL` to use HTTPS
### Certificate Expired
```bash
# Check expiration
kubectl get certificate bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.status.notAfter}'
# Force renewal
kubectl delete certificate bakery-dev-tls-cert -n bakery-ia
kubectl apply -k infrastructure/kubernetes/overlays/dev
# cert-manager will automatically recreate it
```
### Browser Shows "NET::ERR_CERT_AUTHORITY_INVALID"
This is expected for self-signed certificates. Options:
1. Click "Advanced" → "Proceed to localhost"
2. Trust the certificate (see instructions above)
3. Use curl with `-k` flag for testing
---
## Disable HTTPS (Not Recommended)
If you need to temporarily disable HTTPS:
```bash
# Edit dev-ingress.yaml
vim infrastructure/kubernetes/overlays/dev/dev-ingress.yaml
# Change:
# nginx.ingress.kubernetes.io/ssl-redirect: "true" → "false"
# nginx.ingress.kubernetes.io/force-ssl-redirect: "true" → "false"
# Comment out the tls section:
# tls:
# - hosts:
# - localhost
# secretName: bakery-dev-tls-cert
# Redeploy
skaffold dev --profile=dev
```
---
## Differences from Production
| Aspect | Development | Production |
|--------|-------------|------------|
| Certificate Type | Self-signed | Let's Encrypt |
| Validity | 90 days | 90 days |
| Auto-renewal | cert-manager | cert-manager |
| Trust | Manual trust needed | Automatically trusted |
| Domains | localhost | Real domains |
| Browser Warning | Yes (self-signed) | No (CA-signed) |
---
## FAQ
### Q: Why am I seeing certificate warnings?
**A:** Self-signed certificates aren't trusted by browsers by default. Trust the certificate or click "Proceed."
### Q: Do I need to trust the certificate?
**A:** No, but it makes development easier. You can click "Proceed" on each browser session.
### Q: Will this affect my frontend development?
**A:** Slightly. Update `VITE_API_URL` to use `https://`. Otherwise works the same.
### Q: Can I use HTTP instead?
**A:** Yes, but not recommended. It reduces dev-prod parity and won't catch HTTPS issues.
### Q: How often do I need to re-trust the certificate?
**A:** Only when the certificate is recreated (every 90 days or when you delete the cluster).
### Q: Does this work with bakery-ia.local?
**A:** Yes! The certificate is valid for both `localhost` and `bakery-ia.local`.
---
## Additional Security Testing
With HTTPS enabled, you can now test:
### 1. Secure Cookies
```javascript
// In your frontend
document.cookie = "session=test; Secure; SameSite=Strict";
```
### 2. Mixed Content Detection
```javascript
// This will show warning in dev (good - catches prod issues!)
fetch('http://api.example.com/data') // ❌ Mixed content
fetch('https://api.example.com/data') // ✅ Secure
```
### 3. HSTS (HTTP Strict Transport Security)
```bash
# Check HSTS headers
curl -I https://localhost/api/health | grep -i strict
```
### 4. TLS Version Testing
```bash
# Test TLS 1.2
curl --tlsv1.2 https://localhost/api/health
# Test TLS 1.3
curl --tlsv1.3 https://localhost/api/health
```
---
## Summary
**Enabled**: HTTPS in development by default
**Certificate**: Self-signed, auto-renewed
**Access**: `https://localhost`
**Trust**: Optional but recommended
**Benefit**: Better dev-prod parity
**Next Steps:**
1. Deploy: `skaffold dev --profile=dev`
2. Access: `https://localhost`
3. Trust: Follow instructions above (optional)
4. Test: Verify HTTPS works
For issues, see Troubleshooting section or check cert-manager logs.

View File

@@ -1,337 +0,0 @@
# Docker Hub Configuration Guide
This guide explains how to configure Docker Hub for all image pulls in the Bakery IA project.
## Overview
The project has been configured to use Docker Hub credentials for pulling both:
- **Base images** (postgres, redis, python, node, nginx, etc.)
- **Custom bakery images** (bakery/auth-service, bakery/gateway, etc.)
## Quick Start
### 1. Create Docker Hub Secret in Kubernetes
Run the automated setup script:
```bash
./infrastructure/kubernetes/setup-dockerhub-secrets.sh
```
This script will:
- Create the `dockerhub-creds` secret in all namespaces (bakery-ia, bakery-ia-dev, bakery-ia-prod, default)
- Use the credentials: `uals` / `dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A`
### 2. Apply Updated Kubernetes Manifests
All manifests have been updated with `imagePullSecrets`. Apply them:
```bash
# For development
kubectl apply -k infrastructure/kubernetes/overlays/dev
# For production
kubectl apply -k infrastructure/kubernetes/overlays/prod
```
### 3. Verify Pods Can Pull Images
```bash
# Check pod status
kubectl get pods -n bakery-ia
# Check events for image pull status
kubectl get events -n bakery-ia --sort-by='.lastTimestamp'
# Describe a specific pod to see image pull details
kubectl describe pod <pod-name> -n bakery-ia
```
## Manual Setup
If you prefer to create the secret manually:
```bash
kubectl create secret docker-registry dockerhub-creds \
--docker-server=docker.io \
--docker-username=uals \
--docker-password=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A \
--docker-email=ualfaro@gmail.com \
-n bakery-ia
```
Repeat for other namespaces:
```bash
kubectl create secret docker-registry dockerhub-creds \
--docker-server=docker.io \
--docker-username=uals \
--docker-password=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A \
--docker-email=ualfaro@gmail.com \
-n bakery-ia-dev
kubectl create secret docker-registry dockerhub-creds \
--docker-server=docker.io \
--docker-username=uals \
--docker-password=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A \
--docker-email=ualfaro@gmail.com \
-n bakery-ia-prod
```
## What Was Changed
### 1. Kubernetes Manifests (47 files updated)
All deployments, jobs, and cronjobs now include `imagePullSecrets`:
```yaml
spec:
template:
spec:
imagePullSecrets:
- name: dockerhub-creds
containers:
- name: ...
```
**Files Updated:**
- **19 Service Deployments**: All microservices (auth, tenant, forecasting, etc.)
- **21 Database Deployments**: All PostgreSQL instances, Redis, RabbitMQ
- **21 Migration Jobs**: All database migration jobs
- **2 CronJobs**: demo-cleanup, external-data-rotation
- **2 Standalone Jobs**: external-data-init, nominatim-init
- **1 Worker Deployment**: demo-cleanup-worker
### 2. Tiltfile Configuration
The Tiltfile now supports both local registry and Docker Hub:
**Default (Local Registry):**
```bash
tilt up
```
**Docker Hub Mode:**
```bash
export USE_DOCKERHUB=true
export DOCKERHUB_USERNAME=uals
tilt up
```
### 3. Scripts
Two new scripts were created:
1. **[setup-dockerhub-secrets.sh](../infrastructure/kubernetes/setup-dockerhub-secrets.sh)**
- Creates Docker Hub secrets in all namespaces
- Idempotent (safe to run multiple times)
2. **[add-image-pull-secrets.sh](../infrastructure/kubernetes/add-image-pull-secrets.sh)**
- Adds `imagePullSecrets` to all Kubernetes manifests
- Already run (no need to run again unless adding new manifests)
## Using Docker Hub with Tilt
To use Docker Hub for development with Tilt:
```bash
# Login to Docker Hub first
docker login -u uals
# Enable Docker Hub mode
export USE_DOCKERHUB=true
export DOCKERHUB_USERNAME=uals
# Start Tilt
tilt up
```
This will:
- Build images locally
- Tag them as `docker.io/uals/<image-name>`
- Push them to Docker Hub
- Deploy to Kubernetes with imagePullSecrets
## Images Configuration
### Base Images (from Docker Hub)
These images are pulled from Docker Hub's public registry:
- `python:3.11-slim` - Python base for all microservices
- `node:18-alpine` - Node.js for frontend builder
- `nginx:1.25-alpine` - Nginx for frontend production
- `postgres:17-alpine` - PostgreSQL databases
- `redis:7.4-alpine` - Redis cache
- `rabbitmq:4.1-management-alpine` - RabbitMQ message broker
- `busybox:latest` - Utility container
- `curlimages/curl:latest` - Curl utility
- `mediagis/nominatim:4.4` - Geolocation service
### Custom Images (bakery/*)
These images are built by the project:
**Infrastructure:**
- `bakery/gateway`
- `bakery/dashboard`
**Core Services:**
- `bakery/auth-service`
- `bakery/tenant-service`
**Data & Analytics:**
- `bakery/training-service`
- `bakery/forecasting-service`
- `bakery/ai-insights-service`
**Operations:**
- `bakery/sales-service`
- `bakery/inventory-service`
- `bakery/production-service`
- `bakery/procurement-service`
- `bakery/distribution-service`
**Supporting:**
- `bakery/recipes-service`
- `bakery/suppliers-service`
- `bakery/pos-service`
- `bakery/orders-service`
- `bakery/external-service`
**Platform:**
- `bakery/notification-service`
- `bakery/alert-processor`
- `bakery/orchestrator-service`
**Demo:**
- `bakery/demo-session-service`
## Pushing Custom Images to Docker Hub
Use the existing tag-and-push script:
```bash
# Login first
docker login -u uals
# Tag and push all images
./scripts/tag-and-push-images.sh
```
Or manually for a specific image:
```bash
# Build
docker build -t bakery/auth-service:latest -f services/auth/Dockerfile .
# Tag for Docker Hub
docker tag bakery/auth-service:latest uals/bakery-auth-service:latest
# Push
docker push uals/bakery-auth-service:latest
```
## Troubleshooting
### Problem: ImagePullBackOff error
Check if the secret exists:
```bash
kubectl get secret dockerhub-creds -n bakery-ia
```
Verify secret is correctly configured:
```bash
kubectl get secret dockerhub-creds -n bakery-ia -o yaml
```
Check pod events:
```bash
kubectl describe pod <pod-name> -n bakery-ia
```
### Problem: Authentication failure
The Docker Hub credentials might be incorrect or expired. Update the secret:
```bash
# Delete old secret
kubectl delete secret dockerhub-creds -n bakery-ia
# Create new secret with updated credentials
kubectl create secret docker-registry dockerhub-creds \
--docker-server=docker.io \
--docker-username=<your-username> \
--docker-password=<your-token> \
--docker-email=<your-email> \
-n bakery-ia
```
### Problem: Pod still using old credentials
Restart the pod to pick up the new secret:
```bash
kubectl rollout restart deployment/<deployment-name> -n bakery-ia
```
## Security Best Practices
1. **Use Docker Hub Access Tokens** (not passwords)
- Create at: https://hub.docker.com/settings/security
- Set appropriate permissions (Read-only for pulls)
2. **Rotate Credentials Regularly**
- Update the secret every 90 days
- Use the setup script for consistent updates
3. **Limit Secret Access**
- Only grant access to necessary namespaces
- Use RBAC to control who can read secrets
4. **Monitor Usage**
- Check Docker Hub pull rate limits
- Monitor for unauthorized access
## Rate Limits
Docker Hub has rate limits for image pulls:
- **Anonymous users**: 100 pulls per 6 hours per IP
- **Authenticated users**: 200 pulls per 6 hours
- **Pro/Team**: Unlimited
Using authentication (imagePullSecrets) ensures you get the authenticated user rate limit.
## Environment Variables
For CI/CD or automated deployments, use these environment variables:
```bash
export DOCKER_USERNAME=uals
export DOCKER_PASSWORD=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A
export DOCKER_EMAIL=ualfaro@gmail.com
```
## Next Steps
1. ✅ Docker Hub secret created in all namespaces
2. ✅ All Kubernetes manifests updated with imagePullSecrets
3. ✅ Tiltfile configured for optional Docker Hub usage
4. 🔄 Apply manifests to your cluster
5. 🔄 Verify pods can pull images successfully
## Related Documentation
- [Kubernetes Setup Guide](./KUBERNETES_SETUP.md)
- [Security Implementation](./SECURITY_IMPLEMENTATION_COMPLETE.md)
- [Tilt Development Workflow](../Tiltfile)
## Support
If you encounter issues:
1. Check the troubleshooting section above
2. Verify Docker Hub credentials at: https://hub.docker.com/settings/security
3. Check Kubernetes events: `kubectl get events -A --sort-by='.lastTimestamp'`
4. Review pod logs: `kubectl logs -n bakery-ia <pod-name>`

View File

@@ -1,449 +0,0 @@
# Complete Monitoring Guide - Bakery IA Platform
This guide provides the complete overview of observability implementation for the Bakery IA platform using SigNoz and OpenTelemetry.
## 🎯 Executive Summary
**What's Implemented:**
-**Distributed Tracing** - All 17 services
-**Application Metrics** - HTTP requests, latencies, errors
-**System Metrics** - CPU, memory, disk, network per service
-**Structured Logs** - With trace correlation
-**Database Monitoring** - PostgreSQL, Redis, RabbitMQ metrics
-**Pure OpenTelemetry** - No Prometheus, all OTLP push
**Technology Stack:**
- **Backend**: OpenTelemetry Python SDK
- **Collector**: OpenTelemetry Collector (OTLP receivers)
- **Storage**: ClickHouse (traces, metrics, logs)
- **Frontend**: SigNoz UI
- **Protocol**: OTLP over HTTP/gRPC
## 📊 Architecture
```
┌──────────────────────────────────────────────────────────┐
│ Application Services │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ auth │ │ inv │ │ orders │ │ ... │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │ │
│ └───────────┴────────────┴───────────┘ │
│ │ │
│ Traces + Metrics + Logs │
│ (OpenTelemetry OTLP) │
└──────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ Database Monitoring Collector │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ PG │ │ Redis │ │RabbitMQ│ │
│ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │
│ └───────────┴────────────┘ │
│ │ │
│ Database Metrics │
└──────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ SigNoz OpenTelemetry Collector │
│ │
│ Receivers: OTLP (gRPC :4317, HTTP :4318) │
│ Processors: batch, memory_limiter, resourcedetection │
│ Exporters: ClickHouse │
└──────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ ClickHouse Database │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Traces │ │ Metrics │ │ Logs │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ SigNoz Frontend UI │
│ https://monitoring.bakery-ia.local │
└──────────────────────────────────────────────────────────┘
```
## 🚀 Quick Start
### 1. Deploy SigNoz
```bash
# Add Helm repository
helm repo add signoz https://charts.signoz.io
helm repo update
# Create namespace and install
kubectl create namespace signoz
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-dev.yaml
# Wait for pods
kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
```
### 2. Deploy Services with Monitoring
All services are already configured with OpenTelemetry environment variables.
```bash
# Apply all services
kubectl apply -k infrastructure/kubernetes/overlays/dev/
# Or restart existing services
kubectl rollout restart deployment -n bakery-ia
```
### 3. Deploy Database Monitoring
```bash
# Run the setup script
./infrastructure/kubernetes/setup-database-monitoring.sh
# This will:
# - Create monitoring users in PostgreSQL
# - Deploy OpenTelemetry collector for database metrics
# - Start collecting PostgreSQL, Redis, RabbitMQ metrics
```
### 4. Access SigNoz UI
```bash
# Via ingress
open https://monitoring.bakery-ia.local
# Or port-forward
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
open http://localhost:3301
```
## 📈 Metrics Collected
### Application Metrics (Per Service)
| Metric | Description | Type |
|--------|-------------|------|
| `http_requests_total` | Total HTTP requests | Counter |
| `http_request_duration_seconds` | Request latency | Histogram |
| `active_requests` | Current active requests | Gauge |
### System Metrics (Per Service)
| Metric | Description | Type |
|--------|-------------|------|
| `process.cpu.utilization` | Process CPU % | Gauge |
| `process.memory.usage` | Process memory bytes | Gauge |
| `process.memory.utilization` | Process memory % | Gauge |
| `process.threads.count` | Thread count | Gauge |
| `process.open_file_descriptors` | Open FDs (Unix) | Gauge |
| `system.cpu.utilization` | System CPU % | Gauge |
| `system.memory.usage` | System memory | Gauge |
| `system.memory.utilization` | System memory % | Gauge |
| `system.disk.io.read` | Disk read bytes | Counter |
| `system.disk.io.write` | Disk write bytes | Counter |
| `system.network.io.sent` | Network sent bytes | Counter |
| `system.network.io.received` | Network recv bytes | Counter |
### PostgreSQL Metrics
| Metric | Description |
|--------|-------------|
| `postgresql.backends` | Active connections |
| `postgresql.database.size` | Database size in bytes |
| `postgresql.commits` | Transaction commits |
| `postgresql.rollbacks` | Transaction rollbacks |
| `postgresql.deadlocks` | Deadlock count |
| `postgresql.blocks_read` | Blocks read from disk |
| `postgresql.table.size` | Table size |
| `postgresql.index.size` | Index size |
### Redis Metrics
| Metric | Description |
|--------|-------------|
| `redis.clients.connected` | Connected clients |
| `redis.commands.processed` | Commands processed |
| `redis.keyspace.hits` | Cache hits |
| `redis.keyspace.misses` | Cache misses |
| `redis.memory.used` | Memory usage |
| `redis.memory.fragmentation_ratio` | Fragmentation |
| `redis.db.keys` | Number of keys |
### RabbitMQ Metrics
| Metric | Description |
|--------|-------------|
| `rabbitmq.consumer.count` | Active consumers |
| `rabbitmq.message.current` | Messages in queue |
| `rabbitmq.message.acknowledged` | Messages ACKed |
| `rabbitmq.message.delivered` | Messages delivered |
| `rabbitmq.message.published` | Messages published |
## 🔍 Traces
**Automatic instrumentation for:**
- FastAPI endpoints
- HTTP client requests (HTTPX)
- Redis commands
- PostgreSQL queries (SQLAlchemy)
- RabbitMQ publish/consume
**View traces:**
1. Go to **Services** tab in SigNoz
2. Select a service
3. View individual traces
4. Click trace → See full span tree with timing
## 📝 Logs
**Features:**
- Structured logging with context
- Automatic trace-log correlation
- Searchable by service, level, message, custom fields
**View logs:**
1. Go to **Logs** tab in SigNoz
2. Filter by service: `service_name="auth-service"`
3. Search for specific messages
4. Click log → See full context including trace_id
## 🎛️ Configuration Files
### Services
All services configured in:
```
infrastructure/kubernetes/base/components/*/\*-service.yaml
```
Each service has these environment variables:
```yaml
env:
- name: OTEL_COLLECTOR_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_SERVICE_NAME
value: "service-name"
- name: ENABLE_TRACING
value: "true"
- name: OTEL_LOGS_EXPORTER
value: "otlp"
- name: ENABLE_OTEL_METRICS
value: "true"
- name: ENABLE_SYSTEM_METRICS
value: "true"
```
### SigNoz
Configuration file:
```
infrastructure/helm/signoz-values-dev.yaml
```
Key settings:
- OTLP receivers on ports 4317 (gRPC) and 4318 (HTTP)
- No Prometheus scraping (pure OTLP push)
- ClickHouse backend for storage
- Reduced resources for development
### Database Monitoring
Deployment file:
```
infrastructure/kubernetes/base/monitoring/database-otel-collector.yaml
```
Setup script:
```
infrastructure/kubernetes/setup-database-monitoring.sh
```
## 📚 Documentation
| Document | Description |
|----------|-------------|
| [MONITORING_QUICKSTART.md](./MONITORING_QUICKSTART.md) | 10-minute quick start guide |
| [MONITORING_SETUP.md](./MONITORING_SETUP.md) | Detailed setup and troubleshooting |
| [DATABASE_MONITORING.md](./DATABASE_MONITORING.md) | Database metrics and logs guide |
| This document | Complete overview |
## 🔧 Shared Libraries
### Monitoring Modules
Located in `shared/monitoring/`:
| File | Purpose |
|------|---------|
| `__init__.py` | Package exports |
| `logging.py` | Standard logging setup |
| `logs_exporter.py` | OpenTelemetry logs export |
| `metrics.py` | OpenTelemetry metrics (no Prometheus) |
| `metrics_exporter.py` | OTLP metrics export setup |
| `system_metrics.py` | System metrics collection (CPU, memory, etc.) |
| `tracing.py` | Distributed tracing setup |
| `health_checks.py` | Health check endpoints |
### Usage in Services
```python
from shared.service_base import StandardFastAPIService
# Create service
service = AuthService()
# Create app with auto-configured monitoring
app = service.create_app()
# Monitoring is automatically enabled:
# - Tracing (if ENABLE_TRACING=true)
# - Metrics (if ENABLE_OTEL_METRICS=true)
# - System metrics (if ENABLE_SYSTEM_METRICS=true)
# - Logs (if OTEL_LOGS_EXPORTER=otlp)
```
## 🎨 Dashboard Examples
### Service Health Dashboard
Create a dashboard with:
1. **Request Rate** - `rate(http_requests_total[5m])`
2. **Error Rate** - `rate(http_requests_total{status_code=~"5.."}[5m])`
3. **Latency (P95)** - `histogram_quantile(0.95, http_request_duration_seconds)`
4. **Active Requests** - `active_requests`
5. **CPU Usage** - `process.cpu.utilization`
6. **Memory Usage** - `process.memory.utilization`
### Database Dashboard
1. **PostgreSQL Connections** - `postgresql.backends`
2. **Database Size** - `postgresql.database.size`
3. **Transaction Rate** - `rate(postgresql.commits[5m])`
4. **Redis Hit Rate** - `redis.keyspace.hits / (redis.keyspace.hits + redis.keyspace.misses)`
5. **RabbitMQ Queue Depth** - `rabbitmq.message.current`
## ⚠️ Alerts
### Recommended Alerts
**Application:**
- High error rate (>5% of requests failing)
- High latency (P95 > 1s)
- Service down (no metrics for 5 minutes)
**System:**
- High CPU (>80% for 5 minutes)
- High memory (>90%)
- Disk space low (<10%)
**Database:**
- PostgreSQL connections near max (>80% of max_connections)
- Slow queries (>5s)
- Redis memory high (>80%)
- RabbitMQ queue buildup (>10k messages)
## 🐛 Troubleshooting
### No Data in SigNoz
```bash
# 1. Check service logs
kubectl logs -n bakery-ia deployment/auth-service | grep -i otel
# 2. Check SigNoz collector
kubectl logs -n signoz deployment/signoz-otel-collector
# 3. Test connectivity
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
```
### Database Metrics Missing
```bash
# Check database monitoring collector
kubectl logs -n bakery-ia deployment/database-otel-collector
# Verify monitoring user exists
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U postgres -c "\du otel_monitor"
```
### Traces Not Correlated with Logs
Ensure `OTEL_LOGS_EXPORTER=otlp` is set in service environment variables.
## 🎯 Best Practices
1. **Always use structured logging** - Add context with key-value pairs
2. **Add custom spans** - For important business operations
3. **Set appropriate log levels** - INFO for production, DEBUG for dev
4. **Monitor your monitors** - Alert on collector failures
5. **Regular retention policy reviews** - Balance cost vs. data retention
6. **Create service dashboards** - One dashboard per service
7. **Set up critical alerts first** - Service down, high error rate
8. **Document custom metrics** - Explain business-specific metrics
## 📊 Performance Impact
**Resource Usage (per service):**
- CPU: +5-10% (instrumentation overhead)
- Memory: +50-100MB (SDK and buffers)
- Network: Minimal (batched export every 60s)
**Latency Impact:**
- Per request: <1ms (async instrumentation)
- No impact on user-facing latency
**Storage (SigNoz):**
- Traces: ~1GB per million requests
- Metrics: ~100MB per service per day
- Logs: Varies by log volume
## 🔐 Security Considerations
1. **Use dedicated monitoring users** - Never use app credentials
2. **Limit collector permissions** - Read-only access to databases
3. **Secure OTLP endpoints** - Use TLS in production
4. **Sanitize sensitive data** - Don't log passwords, tokens
5. **Network policies** - Restrict collector network access
6. **RBAC** - Limit SigNoz UI access per team
## 🚀 Next Steps
1. **Deploy to production** - Update production SigNoz config
2. **Create team dashboards** - Per-service and system-wide views
3. **Set up alerts** - Start with critical service health alerts
4. **Train team** - SigNoz UI usage, query language
5. **Document runbooks** - How to respond to alerts
6. **Optimize retention** - Based on actual data volume
7. **Add custom metrics** - Business-specific KPIs
## 📞 Support
- **SigNoz Community**: https://signoz.io/slack
- **OpenTelemetry Docs**: https://opentelemetry.io/docs/
- **Internal Docs**: See /docs folder
## 📝 Change Log
| Date | Change |
|------|--------|
| 2026-01-08 | Initial implementation - All services configured |
| 2026-01-08 | Database monitoring added (PostgreSQL, Redis, RabbitMQ) |
| 2026-01-08 | System metrics collection implemented |
| 2026-01-08 | Removed Prometheus, pure OpenTelemetry |
---
**Congratulations! Your platform now has complete observability. 🎉**
Every request is traced, every metric is collected, every log is searchable.

View File

@@ -0,0 +1,536 @@
# 📊 Bakery-ia Monitoring System Documentation
## 🎯 Overview
The bakery-ia platform features a comprehensive, modern monitoring system built on **OpenTelemetry** and **SigNoz**. This documentation provides a complete guide to the monitoring architecture, setup, and usage.
## 🚀 Monitoring Architecture
### Core Components
```mermaid
graph TD
A[Microservices] -->|OTLP| B[OpenTelemetry Collector]
B -->|gRPC| C[SigNoz]
C --> D[Traces Dashboard]
C --> E[Metrics Dashboard]
C --> F[Logs Dashboard]
C --> G[Alerts]
```
### Technology Stack
- **Instrumentation**: OpenTelemetry Python SDK
- **Protocol**: OTLP (OpenTelemetry Protocol) over gRPC
- **Backend**: SigNoz (open-source observability platform)
- **Metrics**: Prometheus-compatible metrics via OTLP
- **Traces**: Jaeger-compatible tracing via OTLP
- **Logs**: Structured logging with trace correlation
## 📋 Monitoring Coverage
### Service Coverage (100%)
| Service Category | Services | Monitoring Type | Status |
|-----------------|----------|----------------|--------|
| **Critical Services** | auth, orders, sales, external | Base Class | ✅ Monitored |
| **AI Services** | ai-insights, training | Direct | ✅ Monitored |
| **Data Services** | inventory, procurement, production, forecasting | Base Class | ✅ Monitored |
| **Operational Services** | tenant, notification, distribution | Base Class | ✅ Monitored |
| **Specialized Services** | suppliers, pos, recipes, orchestrator | Base Class | ✅ Monitored |
| **Infrastructure** | gateway, alert-processor, demo-session | Direct | ✅ Monitored |
**Total: 20 services with 100% monitoring coverage**
## 🔧 Monitoring Implementation
### Implementation Patterns
#### 1. Base Class Pattern (16 services)
Services using `StandardFastAPIService` inherit comprehensive monitoring:
```python
from shared.service_base import StandardFastAPIService
class MyService(StandardFastAPIService):
def __init__(self):
super().__init__(
service_name="my-service",
app_name="My Service",
description="Service description",
version="1.0.0",
# Monitoring enabled by default
enable_metrics=True, # ✅ Metrics collection
enable_tracing=True, # ✅ Distributed tracing
enable_health_checks=True # ✅ Health endpoints
)
```
#### 2. Direct Pattern (4 services)
Critical services with custom monitoring needs:
```python
# services/ai_insights/app/main.py
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector
# Initialize metrics collectors
metrics_collector = MetricsCollector("ai-insights")
system_metrics = SystemMetricsCollector("ai-insights")
# Add middleware
add_metrics_middleware(app, metrics_collector)
```
### Monitoring Components
#### OpenTelemetry Instrumentation
```python
# Automatic instrumentation in base class
FastAPIInstrumentor.instrument_app(app) # HTTP requests
HTTPXClientInstrumentor().instrument() # Outgoing HTTP
RedisInstrumentor().instrument() # Redis operations
SQLAlchemyInstrumentor().instrument() # Database queries
```
#### Metrics Collection
```python
# Standard metrics automatically collected
metrics_collector.register_counter("http_requests_total", "Total HTTP requests")
metrics_collector.register_histogram("http_request_duration", "Request duration")
metrics_collector.register_gauge("active_requests", "Active requests")
# System metrics automatically collected
system_metrics = SystemMetricsCollector("service-name")
# → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors
```
#### Health Checks
```python
# Automatic health check endpoints
GET /health # Overall service health
GET /health/detailed # Detailed health with dependencies
GET /health/ready # Readiness probe
GET /health/live # Liveness probe
```
## 📊 Metrics Reference
### Standard Metrics (All Services)
| Metric Type | Metric Name | Description | Labels |
|-------------|------------|-------------|--------|
| **HTTP Metrics** | `{service}_http_requests_total` | Total HTTP requests | method, endpoint, status_code |
| **HTTP Metrics** | `{service}_http_request_duration_seconds` | Request duration histogram | method, endpoint, status_code |
| **HTTP Metrics** | `{service}_active_requests` | Currently active requests | - |
| **System Metrics** | `process.cpu.utilization` | Process CPU usage | - |
| **System Metrics** | `process.memory.usage` | Process memory usage | - |
| **System Metrics** | `system.cpu.utilization` | System CPU usage | - |
| **System Metrics** | `system.memory.usage` | System memory usage | - |
| **Database Metrics** | `db.query.duration` | Database query duration | operation, table |
| **Cache Metrics** | `cache.operation.duration` | Cache operation duration | operation, key |
### Custom Metrics (Service-Specific)
Examples of service-specific metrics:
**Auth Service:**
- `auth_registration_total` (by status)
- `auth_login_success_total`
- `auth_login_failure_total` (by reason)
- `auth_registration_duration_seconds`
**Orders Service:**
- `orders_created_total`
- `orders_processed_total` (by status)
- `orders_processing_duration_seconds`
**AI Insights Service:**
- `ai_insights_generated_total`
- `ai_model_inference_duration_seconds`
- `ai_feedback_received_total`
## 🔍 Tracing Guide
### Trace Propagation
Traces automatically flow across service boundaries:
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Auth
participant Orders
Client->>Gateway: HTTP Request (trace_id: abc123)
Gateway->>Auth: Auth Check (trace_id: abc123)
Auth-->>Gateway: Auth Response (trace_id: abc123)
Gateway->>Orders: Create Order (trace_id: abc123)
Orders-->>Gateway: Order Created (trace_id: abc123)
Gateway-->>Client: Final Response (trace_id: abc123)
```
### Trace Context in Logs
All logs include trace correlation:
```json
{
"level": "info",
"message": "Processing order",
"service": "orders-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"order_id": "12345",
"timestamp": "2024-01-08T19:00:00Z"
}
```
### Manual Trace Enhancement
Add custom trace attributes:
```python
from shared.monitoring.tracing import add_trace_attributes, add_trace_event
# Add custom attributes
add_trace_attributes(
user_id="123",
tenant_id="abc",
operation="order_creation"
)
# Add trace events
add_trace_event("order_validation_started")
# ... validation logic ...
add_trace_event("order_validation_completed", status="success")
```
## 🚨 Alerting Guide
### Standard Alerts (Recommended)
| Alert Name | Condition | Severity | Notification |
|------------|-----------|----------|--------------|
| **High Error Rate** | `error_rate > 5%` for 5m | High | PagerDuty + Slack |
| **High Latency** | `p99_latency > 2s` for 5m | High | PagerDuty + Slack |
| **Service Unavailable** | `up == 0` for 1m | Critical | PagerDuty + Slack + Email |
| **High Memory Usage** | `memory_usage > 80%` for 10m | Medium | Slack |
| **High CPU Usage** | `cpu_usage > 90%` for 5m | Medium | Slack |
| **Database Connection Issues** | `db_connections < minimum_pool_size` | High | PagerDuty + Slack |
| **Cache Hit Ratio Low** | `cache_hit_ratio < 70%` for 15m | Low | Slack |
### Creating Alerts in SigNoz
1. **Navigate to Alerts**: SigNoz UI → Alerts → Create Alert
2. **Select Metric**: Choose from available metrics
3. **Set Condition**: Define threshold and duration
4. **Configure Notifications**: Add notification channels
5. **Set Severity**: Critical, High, Medium, Low
6. **Add Description**: Explain alert purpose and resolution steps
### Example Alert Configuration (YAML)
```yaml
# Example for Terraform/Kubernetes
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: bakery-ia-alerts
namespace: monitoring
spec:
groups:
- name: service-health
rules:
- alert: ServiceDown
expr: up{service!~"signoz.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.service }} is down"
description: "{{ $labels.service }} has been down for more than 1 minute"
runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down"
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: high
annotations:
summary: "High error rate in {{ $labels.service }}"
description: "Error rate is {{ $value }}% (threshold: 5%)"
runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate"
```
## 📈 Dashboard Guide
### Recommended Dashboards
#### 1. Service Overview Dashboard
- HTTP Request Rate
- Error Rate
- Latency Percentiles (p50, p90, p99)
- Active Requests
- System Resource Usage
#### 2. Performance Dashboard
- Request Duration Histogram
- Database Query Performance
- Cache Performance
- External API Call Performance
#### 3. System Health Dashboard
- CPU Usage (Process & System)
- Memory Usage (Process & System)
- Disk I/O
- Network I/O
- File Descriptors
- Thread Count
#### 4. Business Metrics Dashboard
- User Registrations
- Order Volume
- AI Insights Generated
- API Usage by Tenant
### Creating Dashboards in SigNoz
1. **Navigate to Dashboards**: SigNoz UI → Dashboards → Create Dashboard
2. **Add Panels**: Click "Add Panel" and select metric
3. **Configure Visualization**: Choose chart type and settings
4. **Set Time Range**: Default to last 1h, 6h, 24h, 7d
5. **Add Variables**: For dynamic filtering (service, environment)
6. **Save Dashboard**: Give it a descriptive name
## 🛠️ Troubleshooting Guide
### Common Issues & Solutions
#### Issue: No Metrics Appearing in SigNoz
**Checklist:**
- ✅ OpenTelemetry Collector running? `kubectl get pods -n signoz`
- ✅ Service can reach collector? `telnet signoz-otel-collector.signoz 4318`
- ✅ OTLP endpoint configured correctly? Check `OTEL_EXPORTER_OTLP_ENDPOINT`
- ✅ Service logs show OTLP export? Look for "Exporting metrics"
- ✅ No network policies blocking? Check Kubernetes network policies
**Debugging:**
```bash
# Check OpenTelemetry Collector logs
kubectl logs -n signoz -l app=otel-collector
# Check service logs for OTLP errors
kubectl logs -l app=auth-service | grep -i otel
# Test OTLP connectivity from service pod
kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.signoz:4318
```
#### Issue: High Latency in Specific Service
**Checklist:**
- ✅ Database queries slow? Check `db.query.duration` metrics
- ✅ External API calls slow? Check trace waterfall
- ✅ High CPU usage? Check system metrics
- ✅ Memory pressure? Check memory metrics
- ✅ Too many active requests? Check concurrency
**Debugging:**
```python
# Add detailed tracing to suspicious code
from shared.monitoring.tracing import add_trace_event
add_trace_event("database_query_started", table="users")
# ... database query ...
add_trace_event("database_query_completed", duration_ms=45)
```
#### Issue: High Error Rate
**Checklist:**
- ✅ Database connection issues? Check health endpoints
- ✅ External API failures? Check dependency metrics
- ✅ Authentication failures? Check auth service logs
- ✅ Validation errors? Check application logs
- ✅ Rate limiting? Check gateway metrics
**Debugging:**
```bash
# Check error logs with trace correlation
kubectl logs -l app=auth-service | grep -i error | grep -i trace
# Filter traces by error status
# In SigNoz: Add filter http.status_code >= 400
```
## 📚 Runbook Reference
See [RUNBOOKS.md](RUNBOOKS.md) for detailed troubleshooting procedures.
## 🔧 Development Guide
### Adding Custom Metrics
```python
# In any service using direct monitoring
self.metrics_collector.register_counter(
"custom_metric_name",
"Description of what this metric tracks",
labels=["label1", "label2"] # Optional labels
)
# Increment the counter
self.metrics_collector.increment_counter(
"custom_metric_name",
value=1,
labels={"label1": "value1", "label2": "value2"}
)
```
### Adding Custom Trace Attributes
```python
# Add context to current span
from shared.monitoring.tracing import add_trace_attributes
add_trace_attributes(
user_id=user.id,
tenant_id=tenant.id,
operation="premium_feature_access",
feature_name="advanced_forecasting"
)
```
### Service-Specific Monitoring Setup
For services needing custom monitoring beyond the base class:
```python
# In your service's __init__ method
from shared.monitoring.system_metrics import SystemMetricsCollector
from shared.monitoring.metrics import MetricsCollector
class MyService(StandardFastAPIService):
def __init__(self):
# Call parent constructor first
super().__init__(...)
# Add custom metrics collector
self.custom_metrics = MetricsCollector("my-service")
# Register custom metrics
self.custom_metrics.register_counter(
"business_specific_events",
"Custom business event counter"
)
# Add system metrics if not using base class defaults
self.system_metrics = SystemMetricsCollector("my-service")
```
## 📊 SigNoz Configuration
### Environment Variables
```env
# OpenTelemetry Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.signoz:4318
# Service-specific configuration
OTEL_SERVICE_NAME=auth-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia
# Metrics export interval (default: 60000ms = 60s)
OTEL_METRIC_EXPORT_INTERVAL=60000
# Batch span processor configuration
OTEL_BSP_SCHEDULE_DELAY=5000
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512
```
### Kubernetes Configuration
```yaml
# Example deployment with monitoring sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
name: auth-service
spec:
template:
spec:
containers:
- name: auth-service
image: auth-service:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://signoz-otel-collector.signoz:4318"
- name: OTEL_SERVICE_NAME
value: "auth-service"
- name: ENVIRONMENT
value: "production"
resources:
limits:
cpu: "1"
memory: "512Mi"
requests:
cpu: "200m"
memory: "256Mi"
```
## 🎯 Best Practices
### Monitoring Best Practices
1. **Use Consistent Naming**: Follow OpenTelemetry semantic conventions
2. **Add Context to Traces**: Include user/tenant IDs in trace attributes
3. **Monitor Dependencies**: Track external API and database performance
4. **Set Appropriate Alerts**: Avoid alert fatigue with meaningful thresholds
5. **Document Metrics**: Keep metrics documentation up to date
6. **Review Regularly**: Update dashboards as services evolve
7. **Test Alerts**: Ensure alerts fire correctly before production
### Performance Best Practices
1. **Batch Metrics Export**: Use default 60s interval for most services
2. **Sample Traces**: Consider sampling for high-volume services
3. **Limit Custom Metrics**: Only track metrics that provide value
4. **Use Histograms Wisely**: Histograms can be resource-intensive
5. **Monitor Monitoring**: Track OTLP export success/failure rates
## 📞 Support
### Getting Help
1. **Check Documentation**: This file and RUNBOOKS.md
2. **Review SigNoz Docs**: https://signoz.io/docs/
3. **OpenTelemetry Docs**: https://opentelemetry.io/docs/
4. **Team Channel**: #monitoring in Slack
5. **GitHub Issues**: https://github.com/yourorg/bakery-ia/issues
### Escalation Path
1. **First Line**: Development team (service owners)
2. **Second Line**: DevOps team (monitoring specialists)
3. **Third Line**: SigNoz support (vendor support)
## 🎉 Summary
The bakery-ia monitoring system provides:
- **📊 100% Service Coverage**: All 20 services monitored
- **🚀 Modern Architecture**: OpenTelemetry + SigNoz
- **🔧 Comprehensive Metrics**: System, HTTP, database, cache
- **🔍 Full Observability**: Traces, metrics, logs integrated
- **✅ Production Ready**: Battle-tested and scalable
**All services are fully instrumented and ready for production monitoring!** 🎉

View File

@@ -1,283 +0,0 @@
# SigNoz Monitoring Quick Start
Get complete observability (metrics, logs, traces, system metrics) in under 10 minutes using OpenTelemetry.
## What You'll Get
**Distributed Tracing** - Complete request flows across all services
**Application Metrics** - HTTP requests, durations, error rates, custom business metrics
**System Metrics** - CPU usage, memory usage, disk I/O, network I/O per service
**Structured Logs** - Searchable logs correlated with traces
**Unified Dashboard** - Single UI for all telemetry data
**All data pushed via OpenTelemetry OTLP protocol - No Prometheus, no scraping needed!**
## Prerequisites
- Kubernetes cluster running (Kind/Minikube/Production)
- Helm 3.x installed
- kubectl configured
## Step 1: Deploy SigNoz
```bash
# Add Helm repository
helm repo add signoz https://charts.signoz.io
helm repo update
# Create namespace
kubectl create namespace signoz
# Install SigNoz
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-dev.yaml
# Wait for pods to be ready (2-3 minutes)
kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
```
## Step 2: Configure Services
Each service needs OpenTelemetry environment variables. The auth-service is already configured as an example.
### Quick Configuration (for remaining services)
Add these environment variables to each service deployment:
```yaml
env:
# OpenTelemetry Collector endpoint
- name: OTEL_COLLECTOR_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_SERVICE_NAME
value: "your-service-name" # e.g., "inventory-service"
# Enable tracing
- name: ENABLE_TRACING
value: "true"
# Enable logs export
- name: OTEL_LOGS_EXPORTER
value: "otlp"
# Enable metrics export (includes system metrics)
- name: ENABLE_OTEL_METRICS
value: "true"
- name: ENABLE_SYSTEM_METRICS
value: "true"
```
### Using the Configuration Script
```bash
# Generate configuration patches for all services
./infrastructure/kubernetes/add-monitoring-config.sh
# This creates /tmp/*-otel-patch.yaml files
# Review and manually add to each service deployment
```
## Step 3: Deploy Updated Services
```bash
# Apply updated configurations
kubectl apply -k infrastructure/kubernetes/overlays/dev/
# Or restart services to pick up new env vars
kubectl rollout restart deployment -n bakery-ia
# Wait for rollout
kubectl rollout status deployment -n bakery-ia --timeout=5m
```
## Step 4: Access SigNoz UI
### Via Ingress
```bash
# Add to /etc/hosts if needed
echo "127.0.0.1 monitoring.bakery-ia.local" | sudo tee -a /etc/hosts
# Access UI
open https://monitoring.bakery-ia.local
```
### Via Port Forward
```bash
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
open http://localhost:3301
```
## Step 5: Explore Your Data
### Traces
1. Go to **Services** tab
2. See all your services listed
3. Click on a service → View traces
4. Click on a trace → See detailed span tree with timing
### Metrics
**HTTP Metrics** (automatically collected):
- `http_requests_total` - Total requests by method, endpoint, status
- `http_request_duration_seconds` - Request latency
- `active_requests` - Current active HTTP requests
**System Metrics** (automatically collected per service):
- `process.cpu.utilization` - Process CPU usage %
- `process.memory.usage` - Process memory in bytes
- `process.memory.utilization` - Process memory %
- `process.threads.count` - Number of threads
- `system.cpu.utilization` - System-wide CPU %
- `system.memory.usage` - System memory usage
- `system.disk.io.read` - Disk bytes read
- `system.disk.io.write` - Disk bytes written
- `system.network.io.sent` - Network bytes sent
- `system.network.io.received` - Network bytes received
**Custom Business Metrics** (if configured):
- User registrations
- Orders created
- Login attempts
- etc.
### Logs
1. Go to **Logs** tab
2. Filter by service: `service_name="auth-service"`
3. Search for specific messages
4. See structured fields (user_id, tenant_id, etc.)
### Trace-Log Correlation
1. Find a trace in **Traces** tab
2. Note the `trace_id`
3. Go to **Logs** tab
4. Filter: `trace_id="<the-trace-id>"`
5. See all logs for that specific request!
## Verification Commands
```bash
# Check if services are sending telemetry
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel"
# Check SigNoz collector is receiving data
kubectl logs -n signoz deployment/signoz-otel-collector | tail -50
# Test connectivity to collector
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
```
## Common Issues
### No data in SigNoz
```bash
# 1. Verify environment variables are set
kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL
# 2. Check collector logs
kubectl logs -n signoz deployment/signoz-otel-collector
# 3. Restart service
kubectl rollout restart deployment/auth-service -n bakery-ia
```
### Services not appearing
```bash
# Check network connectivity
kubectl exec -n bakery-ia deployment/auth-service -- \
curl http://signoz-otel-collector.signoz.svc.cluster.local:4318
# Should return: connection successful (not connection refused)
```
## Architecture
```
┌─────────────────────────────────────────────┐
│ Your Microservices │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ auth │ │ inv │ │orders│ ... │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │
│ └─────────┴─────────┘ │
│ │ │
│ OTLP Push │
│ (traces, metrics, logs) │
└──────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────┐
│ SigNoz OpenTelemetry Collector │
│ :4317 (gRPC) :4318 (HTTP) │
│ │
│ Receivers: OTLP only (no Prometheus) │
│ Processors: batch, memory_limiter │
│ Exporters: ClickHouse │
└──────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────┐
│ ClickHouse Database │
│ Stores: traces, metrics, logs │
└──────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────┐
│ SigNoz Frontend UI │
│ monitoring.bakery-ia.local or :3301 │
└──────────────────────────────────────────────┘
```
## What Makes This Different
**Pure OpenTelemetry** - No Prometheus involved:
- ✅ All metrics pushed via OTLP (not scraped)
- ✅ Automatic system metrics collection (CPU, memory, disk, network)
- ✅ Unified data model for all telemetry
- ✅ Native trace-metric-log correlation
- ✅ Lower resource usage (no scraping overhead)
## Next Steps
- **Create Dashboards** - Build custom views for your metrics
- **Set Up Alerts** - Configure alerts for errors, latency, resource usage
- **Explore System Metrics** - Monitor CPU, memory per service
- **Query Logs** - Use powerful log query language
- **Correlate Everything** - Jump from traces → logs → metrics
## Need Help?
- [Full Documentation](./MONITORING_SETUP.md) - Detailed setup guide
- [SigNoz Docs](https://signoz.io/docs/) - Official documentation
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/) - Python instrumentation
---
**Metrics You Get Out of the Box:**
| Category | Metrics | Description |
|----------|---------|-------------|
| HTTP | `http_requests_total` | Total requests by method, endpoint, status |
| HTTP | `http_request_duration_seconds` | Request latency histogram |
| HTTP | `active_requests` | Current active requests |
| Process | `process.cpu.utilization` | Process CPU usage % |
| Process | `process.memory.usage` | Process memory in bytes |
| Process | `process.memory.utilization` | Process memory % |
| Process | `process.threads.count` | Thread count |
| System | `system.cpu.utilization` | System CPU % |
| System | `system.memory.usage` | System memory usage |
| System | `system.memory.utilization` | System memory % |
| Disk | `system.disk.io.read` | Disk read bytes |
| Disk | `system.disk.io.write` | Disk write bytes |
| Network | `system.network.io.sent` | Network sent bytes |
| Network | `system.network.io.received` | Network received bytes |

View File

@@ -1,511 +0,0 @@
# SigNoz Monitoring Setup Guide
This guide explains how to set up complete observability for the Bakery IA platform using SigNoz, which provides unified metrics, logs, and traces visualization.
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Prerequisites](#prerequisites)
3. [SigNoz Deployment](#signoz-deployment)
4. [Service Configuration](#service-configuration)
5. [Data Flow](#data-flow)
6. [Verification](#verification)
7. [Troubleshooting](#troubleshooting)
## Architecture Overview
The monitoring setup uses a three-tier approach:
```
┌─────────────────────────────────────────────────────────────┐
│ Bakery IA Services │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Auth │ │ Inventory│ │ Orders │ │ ... │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └─────────────┴─────────────┴─────────────┘ │
│ │ │
│ OpenTelemetry Protocol (OTLP) │
│ Traces / Metrics / Logs │
└──────────────────────────┼───────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ SigNoz OpenTelemetry Collector │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Receivers: │ │
│ │ - OTLP gRPC (4317) - OTLP HTTP (4318) │ │
│ │ - Prometheus Scraper (service discovery) │ │
│ └────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────────────────┐ │
│ │ Processors: batch, memory_limiter, resourcedetection │ │
│ └────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────────────────┐ │
│ │ Exporters: ClickHouse (traces, metrics, logs) │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────┼───────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ ClickHouse Database │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Traces │ │ Metrics │ │ Logs │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────┼───────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ SigNoz Query Service │
│ & Frontend UI │
│ https://monitoring.bakery-ia.local │
└──────────────────────────────────────────────────────────────┘
```
### Key Components
1. **Services**: Generate telemetry data using OpenTelemetry SDK
2. **OpenTelemetry Collector**: Receives, processes, and exports telemetry
3. **ClickHouse**: Stores traces, metrics, and logs
4. **SigNoz UI**: Query and visualize all telemetry data
## Prerequisites
- Kubernetes cluster (Kind, Minikube, or production cluster)
- Helm 3.x installed
- kubectl configured
- At least 4GB RAM available for SigNoz components
## SigNoz Deployment
### 1. Add SigNoz Helm Repository
```bash
helm repo add signoz https://charts.signoz.io
helm repo update
```
### 2. Create Namespace
```bash
kubectl create namespace signoz
```
### 3. Deploy SigNoz
```bash
# For development environment
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-dev.yaml
# For production environment
helm install signoz signoz/signoz \
-n signoz \
-f infrastructure/helm/signoz-values-prod.yaml
```
### 4. Verify Deployment
```bash
# Check all pods are running
kubectl get pods -n signoz
# Expected output:
# signoz-alertmanager-0
# signoz-clickhouse-0
# signoz-frontend-*
# signoz-otel-collector-*
# signoz-query-service-*
# Check services
kubectl get svc -n signoz
```
## Service Configuration
Each microservice needs to be configured to send telemetry to SigNoz.
### Environment Variables
Add these environment variables to your service deployments:
```yaml
env:
# OpenTelemetry Collector endpoint
- name: OTEL_COLLECTOR_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
# Service identification
- name: OTEL_SERVICE_NAME
value: "your-service-name" # e.g., "auth-service"
# Enable tracing
- name: ENABLE_TRACING
value: "true"
# Enable logs export
- name: OTEL_LOGS_EXPORTER
value: "otlp"
- name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
value: "true"
# Enable metrics export (optional, default: true)
- name: ENABLE_OTEL_METRICS
value: "true"
```
### Prometheus Annotations
Add these annotations to enable Prometheus metrics scraping:
```yaml
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
```
### Complete Example
See [infrastructure/kubernetes/base/components/auth/auth-service.yaml](../infrastructure/kubernetes/base/components/auth/auth-service.yaml) for a complete example.
### Automated Configuration Script
Use the provided script to add monitoring configuration to all services:
```bash
# Run from project root
./infrastructure/kubernetes/add-monitoring-config.sh
```
## Data Flow
### 1. Traces
**Automatic Instrumentation:**
```python
# In your service's main.py
from shared.service_base import StandardFastAPIService
service = AuthService() # Extends StandardFastAPIService
app = service.create_app()
# Tracing is automatically enabled if ENABLE_TRACING=true
# All FastAPI endpoints, HTTP clients, Redis, PostgreSQL are auto-instrumented
```
**Manual Instrumentation:**
```python
from shared.monitoring.tracing import add_trace_attributes, add_trace_event
# Add custom attributes to current span
add_trace_attributes(
user_id="123",
tenant_id="abc",
operation="user_registration"
)
# Add events for important operations
add_trace_event("user_authenticated", user_id="123", method="jwt")
```
### 2. Metrics
**Dual Export Strategy:**
Services export metrics in two ways:
1. **Prometheus format** at `/metrics` endpoint (scraped by SigNoz)
2. **OTLP push** directly to SigNoz collector (real-time)
**Built-in Metrics:**
```python
# Automatically collected by BaseFastAPIService:
# - http_requests_total
# - http_request_duration_seconds
# - active_connections
```
**Custom Metrics:**
```python
# Define in your service
custom_metrics = {
"user_registrations": {
"type": "counter",
"description": "Total user registrations",
"labels": ["status"]
},
"login_duration_seconds": {
"type": "histogram",
"description": "Login request duration"
}
}
service = AuthService(custom_metrics=custom_metrics)
# Use in your code
service.metrics_collector.increment_counter(
"user_registrations",
labels={"status": "success"}
)
```
### 3. Logs
**Automatic Export:**
```python
# Logs are automatically exported if OTEL_LOGS_EXPORTER=otlp
import logging
logger = logging.getLogger(__name__)
# This will appear in SigNoz
logger.info("User logged in", extra={"user_id": "123", "tenant_id": "abc"})
```
**Structured Logging with Context:**
```python
from shared.monitoring.logs_exporter import add_log_context
# Add context that persists across log calls
log_ctx = add_log_context(
request_id="req_123",
user_id="user_456",
tenant_id="tenant_789"
)
# All subsequent logs include this context
log_ctx.info("Processing order") # Includes request_id, user_id, tenant_id
```
**Trace Correlation:**
```python
from shared.monitoring.logs_exporter import get_current_trace_context
# Get trace context for correlation
trace_ctx = get_current_trace_context()
logger.info("Processing request", extra=trace_ctx)
# Logs now include trace_id and span_id for correlation
```
## Verification
### 1. Check Service Health
```bash
# Check that services are exporting telemetry
kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel\|signoz"
# Expected output includes:
# - "Distributed tracing configured"
# - "OpenTelemetry logs export configured"
# - "OpenTelemetry metrics export configured"
```
### 2. Access SigNoz UI
```bash
# Port-forward (for local development)
kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
# Or via Ingress
open https://monitoring.bakery-ia.local
```
### 3. Verify Data Ingestion
**Traces:**
1. Go to SigNoz UI → Traces
2. You should see traces from your services
3. Click on a trace to see the full span tree
**Metrics:**
1. Go to SigNoz UI → Metrics
2. Query: `http_requests_total`
3. Filter by service: `service="auth-service"`
**Logs:**
1. Go to SigNoz UI → Logs
2. Filter by service: `service_name="auth-service"`
3. Search for specific log messages
### 4. Test Trace-Log Correlation
1. Find a trace in SigNoz UI
2. Copy the `trace_id`
3. Go to Logs tab
4. Search: `trace_id="<your-trace-id>"`
5. You should see all logs for that trace
## Troubleshooting
### No Data in SigNoz
**1. Check OpenTelemetry Collector:**
```bash
# Check collector logs
kubectl logs -n signoz deployment/signoz-otel-collector
# Should see:
# - "Receiver is starting"
# - "Exporter is starting"
# - No error messages
```
**2. Check Service Configuration:**
```bash
# Verify environment variables
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 20 "env:"
# Verify annotations
kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 5 "annotations:"
```
**3. Check Network Connectivity:**
```bash
# Test from service pod
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/traces
# Should return: 405 Method Not Allowed (POST required)
# If connection refused, check network policies
```
### Traces Not Appearing
**Check instrumentation:**
```python
# Verify tracing is enabled
import os
print(os.getenv("ENABLE_TRACING")) # Should be "true"
print(os.getenv("OTEL_COLLECTOR_ENDPOINT")) # Should be set
```
**Check trace sampling:**
```bash
# Verify sampling rate (default 100%)
kubectl logs -n bakery-ia deployment/auth-service | grep "sampling"
```
### Metrics Not Appearing
**1. Verify Prometheus annotations:**
```bash
kubectl get pods -n bakery-ia -o yaml | grep "prometheus.io"
```
**2. Test metrics endpoint:**
```bash
# Port-forward service
kubectl port-forward -n bakery-ia deployment/auth-service 8000:8000
# Test endpoint
curl http://localhost:8000/metrics
# Should return Prometheus format metrics
```
**3. Check SigNoz scrape configuration:**
```bash
# Check collector config
kubectl get configmap -n signoz signoz-otel-collector -o yaml | grep -A 30 "prometheus:"
```
### Logs Not Appearing
**1. Verify log export is enabled:**
```bash
kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL_LOGS_EXPORTER
# Should return: OTEL_LOGS_EXPORTER=otlp
```
**2. Check log format:**
```bash
# Logs should be JSON formatted
kubectl logs -n bakery-ia deployment/auth-service | head -5
```
**3. Verify OTLP endpoint:**
```bash
# Test logs endpoint
kubectl exec -n bakery-ia deployment/auth-service -- \
curl -X POST http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/logs \
-H "Content-Type: application/json" \
-d '{"resourceLogs":[]}'
# Should return 200 OK or 400 Bad Request (not connection error)
```
## Performance Tuning
### For Development
The default configuration is optimized for local development with minimal resources.
### For Production
Update the following in `signoz-values-prod.yaml`:
```yaml
# Increase collector resources
otelCollector:
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
# Increase batch sizes
config:
processors:
batch:
timeout: 10s
send_batch_size: 10000 # Increased from 1024
# Add more replicas
replicaCount: 2
```
## Best Practices
1. **Use Structured Logging**: Always use key-value pairs for better querying
2. **Add Context**: Include user_id, tenant_id, request_id in logs
3. **Trace Business Operations**: Add custom spans for important operations
4. **Monitor Collector Health**: Set up alerts for collector errors
5. **Retention Policy**: Configure ClickHouse retention based on needs
## Additional Resources
- [SigNoz Documentation](https://signoz.io/docs/)
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/)
- [Bakery IA Monitoring Shared Library](../shared/monitoring/)
## Support
For issues or questions:
1. Check SigNoz community: https://signoz.io/slack
2. Review OpenTelemetry docs: https://opentelemetry.io/docs/
3. Create issue in project repository

View File

@@ -28,6 +28,7 @@ from app.middleware.read_only_mode import ReadOnlyModeMiddleware
from app.routes import auth, tenant, notification, nominatim, subscription, demo, pos, geocoding, poi_context from app.routes import auth, tenant, notification, nominatim, subscription, demo, pos, geocoding, poi_context
from shared.monitoring.logging import setup_logging from shared.monitoring.logging import setup_logging
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector
# OpenTelemetry imports # OpenTelemetry imports
from opentelemetry import trace from opentelemetry import trace
@@ -200,7 +201,12 @@ async def startup_event():
logger.info("Metrics registered successfully") logger.info("Metrics registered successfully")
metrics_collector.start_metrics_server(8080) # Note: Metrics are exported via OpenTelemetry OTLP to SigNoz - no metrics server needed
# Initialize system metrics collection
system_metrics = SystemMetricsCollector("gateway")
logger.info("System metrics collection started")
logger.info("Metrics export configured via OpenTelemetry OTLP")
logger.info("API Gateway started successfully") logger.info("API Gateway started successfully")
@@ -227,13 +233,8 @@ async def health_check():
"timestamp": time.time() "timestamp": time.time()
} }
@app.get("/metrics") # Note: Metrics are exported via OpenTelemetry OTLP to SigNoz
async def metrics(): # The /metrics endpoint is not needed as metrics are pushed automatically
"""Prometheus metrics endpoint"""
return Response(
content=metrics_collector.get_metrics(),
media_type="text/plain; version=0.0.4; charset=utf-8"
)
# ================================================================ # ================================================================
# SERVER-SENT EVENTS (SSE) HELPER FUNCTIONS # SERVER-SENT EVENTS (SSE) HELPER FUNCTIONS

View File

@@ -19,6 +19,9 @@ sqlalchemy==2.0.44
asyncpg==0.30.0 asyncpg==0.30.0
cryptography==44.0.0 cryptography==44.0.0
ortools==9.8.3296 ortools==9.8.3296
psutil==5.9.8
opentelemetry-api==1.39.1 opentelemetry-api==1.39.1
opentelemetry-sdk==1.39.1 opentelemetry-sdk==1.39.1
opentelemetry-instrumentation-fastapi==0.60b1 opentelemetry-instrumentation-fastapi==0.60b1

View File

@@ -1,133 +0,0 @@
#!/bin/bash
# Setup script for database monitoring with OpenTelemetry and SigNoz
# This script creates monitoring users in PostgreSQL and deploys the collector
set -e
echo "========================================="
echo "Database Monitoring Setup for SigNoz"
echo "========================================="
echo ""
# Configuration
NAMESPACE="bakery-ia"
MONITOR_USER="otel_monitor"
MONITOR_PASSWORD=$(openssl rand -base64 32)
# PostgreSQL databases to monitor
DATABASES=(
"auth-db-service:auth_db"
"inventory-db-service:inventory_db"
"orders-db-service:orders_db"
"tenant-db-service:tenant_db"
"sales-db-service:sales_db"
"production-db-service:production_db"
"recipes-db-service:recipes_db"
"procurement-db-service:procurement_db"
"distribution-db-service:distribution_db"
"forecasting-db-service:forecasting_db"
"external-db-service:external_db"
"suppliers-db-service:suppliers_db"
"pos-db-service:pos_db"
"training-db-service:training_db"
"notification-db-service:notification_db"
"orchestrator-db-service:orchestrator_db"
"ai-insights-db-service:ai_insights_db"
)
echo "Step 1: Creating monitoring user in PostgreSQL databases"
echo "========================================="
echo ""
for db_entry in "${DATABASES[@]}"; do
IFS=':' read -r service dbname <<< "$db_entry"
echo "Creating monitoring user in $dbname..."
# Create monitoring user via kubectl exec
kubectl exec -n "$NAMESPACE" "deployment/${service%-service}" -- psql -U postgres -d "$dbname" -c "
DO \$\$
BEGIN
IF NOT EXISTS (SELECT FROM pg_catalog.pg_roles WHERE rolname = '$MONITOR_USER') THEN
CREATE USER $MONITOR_USER WITH PASSWORD '$MONITOR_PASSWORD';
GRANT pg_monitor TO $MONITOR_USER;
GRANT CONNECT ON DATABASE $dbname TO $MONITOR_USER;
RAISE NOTICE 'User $MONITOR_USER created successfully';
ELSE
RAISE NOTICE 'User $MONITOR_USER already exists';
END IF;
END
\$\$;
" 2>/dev/null || echo " ⚠️ Warning: Could not create user in $dbname (may already exist or database not ready)"
echo ""
done
echo "✅ Monitoring users created"
echo ""
echo "Step 2: Creating Kubernetes secret for monitoring credentials"
echo "========================================="
echo ""
# Create secret for database monitoring
kubectl create secret generic database-monitor-secrets \
-n "$NAMESPACE" \
--from-literal=POSTGRES_MONITOR_USER="$MONITOR_USER" \
--from-literal=POSTGRES_MONITOR_PASSWORD="$MONITOR_PASSWORD" \
--dry-run=client -o yaml | kubectl apply -f -
echo "✅ Secret created: database-monitor-secrets"
echo ""
echo "Step 3: Deploying OpenTelemetry collector for database monitoring"
echo "========================================="
echo ""
kubectl apply -f infrastructure/kubernetes/base/monitoring/database-otel-collector.yaml
echo "✅ Database monitoring collector deployed"
echo ""
echo "Step 4: Waiting for collector to be ready"
echo "========================================="
echo ""
kubectl wait --for=condition=available --timeout=60s \
deployment/database-otel-collector -n "$NAMESPACE"
echo "✅ Collector is ready"
echo ""
echo "========================================="
echo "Database Monitoring Setup Complete!"
echo "========================================="
echo ""
echo "What's been configured:"
echo " ✅ Monitoring user created in all PostgreSQL databases"
echo " ✅ OpenTelemetry collector deployed for database metrics"
echo " ✅ Metrics exported to SigNoz"
echo ""
echo "Metrics being collected:"
echo " 📊 PostgreSQL: connections, commits, rollbacks, deadlocks, table sizes"
echo " 📊 Redis: memory usage, keyspace hits/misses, connected clients"
echo " 📊 RabbitMQ: queue depth, message rates, consumer count"
echo ""
echo "Next steps:"
echo " 1. Check collector logs:"
echo " kubectl logs -n $NAMESPACE deployment/database-otel-collector"
echo ""
echo " 2. View metrics in SigNoz:"
echo " - Go to https://monitoring.bakery-ia.local"
echo " - Create dashboard with queries like:"
echo " * postgresql.backends (connections)"
echo " * postgresql.database.size (database size)"
echo " * redis.memory.used (Redis memory)"
echo " * rabbitmq.message.current (queue depth)"
echo ""
echo " 3. Create alerts for:"
echo " - High connection count (approaching max_connections)"
echo " - Slow query detection (via application traces)"
echo " - High Redis memory usage"
echo " - RabbitMQ queue buildup"
echo ""

View File

@@ -11,6 +11,7 @@ from app.core.database import init_db, close_db
from app.api import insights from app.api import insights
from shared.monitoring.logging import setup_logging from shared.monitoring.logging import setup_logging
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector
# OpenTelemetry imports # OpenTelemetry imports
from opentelemetry import trace from opentelemetry import trace
@@ -56,9 +57,12 @@ async def lifespan(app: FastAPI):
await init_db() await init_db()
logger.info("Database initialized") logger.info("Database initialized")
# Start metrics server # Initialize system metrics collection
metrics_collector.start_metrics_server(8080) system_metrics = SystemMetricsCollector("ai-insights")
logger.info("Metrics server started on port 8080") logger.info("System metrics collection started")
# Note: Metrics are exported via OpenTelemetry OTLP to SigNoz - no metrics server needed
logger.info("Metrics export configured via OpenTelemetry OTLP")
yield yield
@@ -131,13 +135,8 @@ async def health_check():
} }
@app.get("/metrics") # Note: Metrics are exported via OpenTelemetry OTLP to SigNoz
async def metrics(): # The /metrics endpoint is not needed as metrics are pushed automatically
"""Prometheus metrics endpoint"""
return Response(
content=metrics_collector.get_metrics(),
media_type="text/plain; version=0.0.4; charset=utf-8"
)
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -16,6 +16,7 @@ from app.api import alerts, sse
from shared.redis_utils import initialize_redis, close_redis from shared.redis_utils import initialize_redis, close_redis
from shared.monitoring.logging import setup_logging from shared.monitoring.logging import setup_logging
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector
# OpenTelemetry imports # OpenTelemetry imports
from opentelemetry import trace from opentelemetry import trace
@@ -82,9 +83,12 @@ async def lifespan(app: FastAPI):
await consumer.start() await consumer.start()
logger.info("alert_processor_started") logger.info("alert_processor_started")
# Start metrics server # Initialize system metrics collection
metrics_collector.start_metrics_server(8080) system_metrics = SystemMetricsCollector("alert-processor")
logger.info("Metrics server started on port 8080") logger.info("System metrics collection started")
# Note: Metrics are exported via OpenTelemetry OTLP to SigNoz - no metrics server needed
logger.info("Metrics export configured via OpenTelemetry OTLP")
except Exception as e: except Exception as e:
logger.error("alert_processor_startup_failed", error=str(e)) logger.error("alert_processor_startup_failed", error=str(e))
raise raise
@@ -175,13 +179,8 @@ async def root():
} }
@app.get("/metrics") # Note: Metrics are exported via OpenTelemetry OTLP to SigNoz
async def metrics(): # The /metrics endpoint is not needed as metrics are pushed automatically
"""Prometheus metrics endpoint"""
return Response(
content=metrics_collector.get_metrics(),
media_type="text/plain; version=0.0.4; charset=utf-8"
)
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -15,6 +15,7 @@ from app.api import demo_sessions, demo_accounts, demo_operations, internal
from shared.redis_utils import initialize_redis, close_redis from shared.redis_utils import initialize_redis, close_redis
from shared.monitoring.logging import setup_logging from shared.monitoring.logging import setup_logging
from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
from shared.monitoring.system_metrics import SystemMetricsCollector
# OpenTelemetry imports # OpenTelemetry imports
from opentelemetry import trace from opentelemetry import trace
@@ -69,9 +70,12 @@ async def lifespan(app: FastAPI):
max_connections=50 max_connections=50
) )
# Start metrics server # Initialize system metrics collection
metrics_collector.start_metrics_server(8080) system_metrics = SystemMetricsCollector("demo-session")
logger.info("Metrics server started on port 8080") logger.info("System metrics collection started")
# Note: Metrics are exported via OpenTelemetry OTLP to SigNoz - no metrics server needed
logger.info("Metrics export configured via OpenTelemetry OTLP")
logger.info("Demo Session Service started successfully") logger.info("Demo Session Service started successfully")
@@ -164,13 +168,8 @@ async def health():
} }
@app.get("/metrics") # Note: Metrics are exported via OpenTelemetry OTLP to SigNoz
async def metrics(): # The /metrics endpoint is not needed as metrics are pushed automatically
"""Prometheus metrics endpoint"""
return Response(
content=metrics_collector.get_metrics(),
media_type="text/plain; version=0.0.4; charset=utf-8"
)
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -1,85 +0,0 @@
"""
Prometheus metrics for demo session service
"""
from prometheus_client import Counter, Histogram, Gauge
# Counters
demo_sessions_created_total = Counter(
'demo_sessions_created_total',
'Total number of demo sessions created',
['tier', 'status']
)
demo_sessions_deleted_total = Counter(
'demo_sessions_deleted_total',
'Total number of demo sessions deleted',
['tier', 'status']
)
demo_cloning_errors_total = Counter(
'demo_cloning_errors_total',
'Total number of cloning errors',
['tier', 'service', 'error_type']
)
# Histograms (for latency percentiles)
demo_session_creation_duration_seconds = Histogram(
'demo_session_creation_duration_seconds',
'Duration of demo session creation',
['tier'],
buckets=[1, 2, 5, 7, 10, 12, 15, 18, 20, 25, 30, 40, 50, 60]
)
demo_service_clone_duration_seconds = Histogram(
'demo_service_clone_duration_seconds',
'Duration of individual service cloning',
['tier', 'service'],
buckets=[0.5, 1, 2, 3, 5, 10, 15, 20, 30, 40, 50]
)
demo_session_cleanup_duration_seconds = Histogram(
'demo_session_cleanup_duration_seconds',
'Duration of demo session cleanup',
['tier'],
buckets=[0.5, 1, 2, 5, 10, 15, 20, 30]
)
# Gauges
demo_sessions_active = Gauge(
'demo_sessions_active',
'Number of currently active demo sessions',
['tier']
)
demo_sessions_pending_cleanup = Gauge(
'demo_sessions_pending_cleanup',
'Number of demo sessions pending cleanup'
)
# Alert generation metrics
demo_alerts_generated_total = Counter(
'demo_alerts_generated_total',
'Total number of alerts generated post-clone',
['tier', 'alert_type']
)
demo_ai_insights_generated_total = Counter(
'demo_ai_insights_generated_total',
'Total number of AI insights generated post-clone',
['tier', 'insight_type']
)
# Cross-service metrics
demo_cross_service_calls_total = Counter(
'demo_cross_service_calls_total',
'Total number of cross-service API calls during cloning',
['source_service', 'target_service', 'status']
)
demo_cross_service_call_duration_seconds = Histogram(
'demo_cross_service_call_duration_seconds',
'Duration of cross-service API calls during cloning',
['source_service', 'target_service'],
buckets=[0.1, 0.2, 0.5, 1, 2, 5, 10, 15, 20, 30]
)

View File

@@ -14,11 +14,6 @@ import os
from app.models import DemoSession, DemoSessionStatus from app.models import DemoSession, DemoSessionStatus
from datetime import datetime, timezone, timedelta from datetime import datetime, timezone, timedelta
from app.core.redis_wrapper import DemoRedisWrapper from app.core.redis_wrapper import DemoRedisWrapper
from app.monitoring.metrics import (
demo_sessions_deleted_total,
demo_session_cleanup_duration_seconds,
demo_sessions_active
)
logger = structlog.get_logger() logger = structlog.get_logger()

View File

@@ -15,17 +15,6 @@ from shared.clients.inventory_client import InventoryServiceClient
from shared.clients.production_client import ProductionServiceClient from shared.clients.production_client import ProductionServiceClient
from shared.clients.procurement_client import ProcurementServiceClient from shared.clients.procurement_client import ProcurementServiceClient
from shared.config.base import BaseServiceSettings from shared.config.base import BaseServiceSettings
from app.monitoring.metrics import (
demo_sessions_created_total,
demo_session_creation_duration_seconds,
demo_service_clone_duration_seconds,
demo_cloning_errors_total,
demo_sessions_active,
demo_alerts_generated_total,
demo_ai_insights_generated_total,
demo_cross_service_calls_total,
demo_cross_service_call_duration_seconds
)
logger = structlog.get_logger() logger = structlog.get_logger()

View File

@@ -22,6 +22,7 @@ from app.services.whatsapp_service import WhatsAppService
from app.consumers.po_event_consumer import POEventConsumer from app.consumers.po_event_consumer import POEventConsumer
from shared.service_base import StandardFastAPIService from shared.service_base import StandardFastAPIService
from shared.clients.tenant_client import TenantServiceClient from shared.clients.tenant_client import TenantServiceClient
from shared.monitoring.system_metrics import SystemMetricsCollector
import asyncio import asyncio
@@ -184,6 +185,10 @@ class NotificationService(StandardFastAPIService):
self.email_service = EmailService() self.email_service = EmailService()
self.whatsapp_service = WhatsAppService(tenant_client=self.tenant_client) self.whatsapp_service = WhatsAppService(tenant_client=self.tenant_client)
# Initialize system metrics collection
system_metrics = SystemMetricsCollector("notification")
self.logger.info("System metrics collection started")
# Initialize SSE service # Initialize SSE service
self.sse_service = SSEService() self.sse_service = SSEService()
await self.sse_service.initialize(settings.REDIS_URL) await self.sse_service.initialize(settings.REDIS_URL)
@@ -271,12 +276,14 @@ class NotificationService(StandardFastAPIService):
return {"error": "SSE service not available"} return {"error": "SSE service not available"}
# Metrics endpoint # Metrics endpoint
@self.app.get("/metrics") # Note: Metrics are exported via OpenTelemetry OTLP to SigNoz
async def metrics(): # The /metrics endpoint is not needed as metrics are pushed automatically
"""Prometheus metrics endpoint""" # @self.app.get("/metrics")
if self.metrics_collector: # async def metrics():
return self.metrics_collector.get_metrics() # """Prometheus metrics endpoint"""
return {"metrics": "not_available"} # if self.metrics_collector:
# return self.metrics_collector.get_metrics()
# return {"metrics": "not_available"}
# Create service instance # Create service instance

View File

@@ -9,6 +9,7 @@ from app.core.config import settings
from app.core.database import database_manager from app.core.database import database_manager
from app.api import tenants, tenant_members, tenant_operations, webhooks, plans, subscription, tenant_settings, whatsapp_admin, usage_forecast, enterprise_upgrade, tenant_locations, tenant_hierarchy, internal_demo, network_alerts, onboarding from app.api import tenants, tenant_members, tenant_operations, webhooks, plans, subscription, tenant_settings, whatsapp_admin, usage_forecast, enterprise_upgrade, tenant_locations, tenant_hierarchy, internal_demo, network_alerts, onboarding
from shared.service_base import StandardFastAPIService from shared.service_base import StandardFastAPIService
from shared.monitoring.system_metrics import SystemMetricsCollector
class TenantService(StandardFastAPIService): class TenantService(StandardFastAPIService):
@@ -77,6 +78,10 @@ class TenantService(StandardFastAPIService):
redis_client = await get_redis_client() redis_client = await get_redis_client()
self.logger.info("Redis initialized successfully") self.logger.info("Redis initialized successfully")
# Initialize system metrics collection
system_metrics = SystemMetricsCollector("tenant")
self.logger.info("System metrics collection started")
# Start usage tracking scheduler # Start usage tracking scheduler
from app.jobs.usage_tracking_scheduler import start_scheduler from app.jobs.usage_tracking_scheduler import start_scheduler
await start_scheduler(self.database_manager, redis_client, settings) await start_scheduler(self.database_manager, redis_client, settings)
@@ -108,12 +113,14 @@ class TenantService(StandardFastAPIService):
def setup_custom_endpoints(self): def setup_custom_endpoints(self):
"""Setup custom endpoints for tenant service""" """Setup custom endpoints for tenant service"""
@self.app.get("/metrics") # Note: Metrics are exported via OpenTelemetry OTLP to SigNoz
async def metrics(): # The /metrics endpoint is not needed as metrics are pushed automatically
"""Prometheus metrics endpoint""" # @self.app.get("/metrics")
if self.metrics_collector: # async def metrics():
return self.metrics_collector.get_metrics() # """Prometheus metrics endpoint"""
return {"metrics": "not_available"} # if self.metrics_collector:
# return self.metrics_collector.get_metrics()
# return {"metrics": "not_available"}
# Create service instance # Create service instance

View File

@@ -15,6 +15,7 @@ from app.api import training_jobs, training_operations, models, health, monitori
from app.services.training_events import setup_messaging, cleanup_messaging from app.services.training_events import setup_messaging, cleanup_messaging
from app.websocket.events import setup_websocket_event_consumer, cleanup_websocket_consumers from app.websocket.events import setup_websocket_event_consumer, cleanup_websocket_consumers
from shared.service_base import StandardFastAPIService from shared.service_base import StandardFastAPIService
from shared.monitoring.system_metrics import SystemMetricsCollector
class TrainingService(StandardFastAPIService): class TrainingService(StandardFastAPIService):
@@ -77,6 +78,11 @@ class TrainingService(StandardFastAPIService):
async def on_startup(self, app: FastAPI): async def on_startup(self, app: FastAPI):
"""Custom startup logic including migration verification""" """Custom startup logic including migration verification"""
await self.verify_migrations() await self.verify_migrations()
# Initialize system metrics collection
system_metrics = SystemMetricsCollector("training")
self.logger.info("System metrics collection started")
self.logger.info("Training service startup completed") self.logger.info("Training service startup completed")
async def on_shutdown(self, app: FastAPI): async def on_shutdown(self, app: FastAPI):
@@ -132,12 +138,14 @@ class TrainingService(StandardFastAPIService):
def setup_custom_endpoints(self): def setup_custom_endpoints(self):
"""Setup custom endpoints for training service""" """Setup custom endpoints for training service"""
@self.app.get("/metrics") # Note: Metrics are exported via OpenTelemetry OTLP to SigNoz
async def get_metrics(): # The /metrics endpoint is not needed as metrics are pushed automatically
"""Prometheus metrics endpoint""" # @self.app.get("/metrics")
if self.metrics_collector: # async def get_metrics():
return self.metrics_collector.get_metrics() # """Prometheus metrics endpoint"""
return {"status": "metrics not available"} # if self.metrics_collector:
# return self.metrics_collector.get_metrics()
# return {"status": "metrics not available"}
@self.app.get("/") @self.app.get("/")
async def root(): async def root():

View File

@@ -1,420 +0,0 @@
# shared/monitoring/alert_metrics.py
"""
Metrics and monitoring for the alert and recommendation system
Provides comprehensive metrics for tracking system performance and effectiveness
"""
from prometheus_client import Counter, Histogram, Gauge, Summary, Info
from typing import Dict, Any
import time
from functools import wraps
import structlog
logger = structlog.get_logger()
# =================================================================
# DETECTION METRICS
# =================================================================
# Alert and recommendation generation
items_published = Counter(
'alert_items_published_total',
'Total number of alerts and recommendations published',
['service', 'item_type', 'severity', 'type']
)
item_checks_performed = Counter(
'alert_checks_performed_total',
'Total number of alert checks performed',
['service', 'check_type', 'pattern']
)
item_check_duration = Histogram(
'alert_check_duration_seconds',
'Time taken to perform alert checks',
['service', 'check_type'],
buckets=[0.1, 0.5, 1, 2, 5, 10, 30, 60]
)
alert_detection_errors = Counter(
'alert_detection_errors_total',
'Total number of errors during alert detection',
['service', 'error_type', 'check_type']
)
# Deduplication metrics
duplicate_items_prevented = Counter(
'duplicate_items_prevented_total',
'Number of duplicate alerts/recommendations prevented',
['service', 'item_type', 'type']
)
# =================================================================
# PROCESSING METRICS
# =================================================================
# Alert processor metrics
items_processed = Counter(
'alert_items_processed_total',
'Total number of items processed by alert processor',
['item_type', 'severity', 'type', 'status']
)
item_processing_duration = Histogram(
'alert_processing_duration_seconds',
'Time taken to process alerts/recommendations',
['item_type', 'severity'],
buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5]
)
database_storage_duration = Histogram(
'alert_database_storage_duration_seconds',
'Time taken to store items in database',
buckets=[0.01, 0.05, 0.1, 0.5, 1]
)
processing_errors = Counter(
'alert_processing_errors_total',
'Total number of processing errors',
['error_type', 'item_type']
)
# =================================================================
# DELIVERY METRICS
# =================================================================
# Notification delivery
notifications_sent = Counter(
'alert_notifications_sent_total',
'Total notifications sent through all channels',
['channel', 'item_type', 'severity', 'status']
)
notification_delivery_duration = Histogram(
'alert_notification_delivery_duration_seconds',
'Time from item generation to delivery',
['item_type', 'severity', 'channel'],
buckets=[0.1, 0.5, 1, 5, 10, 30, 60]
)
delivery_failures = Counter(
'alert_delivery_failures_total',
'Failed notification deliveries',
['channel', 'item_type', 'error_type']
)
# Channel-specific metrics
email_notifications = Counter(
'alert_email_notifications_total',
'Email notifications sent',
['status', 'item_type']
)
whatsapp_notifications = Counter(
'alert_whatsapp_notifications_total',
'WhatsApp notifications sent',
['status', 'item_type']
)
sse_events_sent = Counter(
'alert_sse_events_sent_total',
'SSE events sent to dashboard',
['tenant', 'event_type', 'item_type']
)
# =================================================================
# SSE METRICS
# =================================================================
# SSE connection metrics
sse_active_connections = Gauge(
'alert_sse_active_connections',
'Number of active SSE connections',
['tenant_id']
)
sse_connection_duration = Histogram(
'alert_sse_connection_duration_seconds',
'Duration of SSE connections',
buckets=[10, 30, 60, 300, 600, 1800, 3600]
)
sse_message_queue_size = Gauge(
'alert_sse_message_queue_size',
'Current size of SSE message queues',
['tenant_id']
)
sse_connection_errors = Counter(
'alert_sse_connection_errors_total',
'SSE connection errors',
['error_type', 'tenant_id']
)
# =================================================================
# SYSTEM HEALTH METRICS
# =================================================================
# Active items gauge
active_items_gauge = Gauge(
'alert_active_items_current',
'Current number of active alerts and recommendations',
['tenant_id', 'item_type', 'severity']
)
# System component health
system_component_health = Gauge(
'alert_system_component_health',
'Health status of alert system components (1=healthy, 0=unhealthy)',
['component', 'service']
)
# Leader election status
scheduler_leader_status = Gauge(
'alert_scheduler_leader_status',
'Leader election status for schedulers (1=leader, 0=follower)',
['service']
)
# Message queue health
rabbitmq_connection_status = Gauge(
'alert_rabbitmq_connection_status',
'RabbitMQ connection status (1=connected, 0=disconnected)',
['service']
)
redis_connection_status = Gauge(
'alert_redis_connection_status',
'Redis connection status (1=connected, 0=disconnected)',
['service']
)
# =================================================================
# BUSINESS METRICS
# =================================================================
# Alert response metrics
items_acknowledged = Counter(
'alert_items_acknowledged_total',
'Number of items acknowledged by users',
['item_type', 'severity', 'service']
)
items_resolved = Counter(
'alert_items_resolved_total',
'Number of items resolved by users',
['item_type', 'severity', 'service']
)
item_response_time = Histogram(
'alert_item_response_time_seconds',
'Time from item creation to acknowledgment',
['item_type', 'severity'],
buckets=[60, 300, 600, 1800, 3600, 7200, 14400]
)
# Recommendation adoption
recommendations_implemented = Counter(
'alert_recommendations_implemented_total',
'Number of recommendations marked as implemented',
['type', 'service']
)
# Effectiveness metrics
false_positive_rate = Gauge(
'alert_false_positive_rate',
'Rate of false positive alerts',
['service', 'alert_type']
)
# =================================================================
# PERFORMANCE DECORATORS
# =================================================================
def track_duration(metric: Histogram, **labels):
"""Decorator to track function execution time"""
def decorator(func):
@wraps(func)
async def async_wrapper(*args, **kwargs):
start_time = time.time()
try:
result = await func(*args, **kwargs)
metric.labels(**labels).observe(time.time() - start_time)
return result
except Exception as e:
# Track error duration too
metric.labels(**labels).observe(time.time() - start_time)
raise
@wraps(func)
def sync_wrapper(*args, **kwargs):
start_time = time.time()
try:
result = func(*args, **kwargs)
metric.labels(**labels).observe(time.time() - start_time)
return result
except Exception as e:
metric.labels(**labels).observe(time.time() - start_time)
raise
return async_wrapper if hasattr(func, '__code__') and func.__code__.co_flags & 0x80 else sync_wrapper
return decorator
def track_errors(error_counter: Counter, **labels):
"""Decorator to track errors in functions"""
def decorator(func):
@wraps(func)
async def async_wrapper(*args, **kwargs):
try:
return await func(*args, **kwargs)
except Exception as e:
error_counter.labels(error_type=type(e).__name__, **labels).inc()
raise
@wraps(func)
def sync_wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except Exception as e:
error_counter.labels(error_type=type(e).__name__, **labels).inc()
raise
return async_wrapper if hasattr(func, '__code__') and func.__code__.co_flags & 0x80 else sync_wrapper
return decorator
# =================================================================
# UTILITY FUNCTIONS
# =================================================================
def record_item_published(service: str, item_type: str, severity: str, alert_type: str):
"""Record that an item was published"""
items_published.labels(
service=service,
item_type=item_type,
severity=severity,
type=alert_type
).inc()
def record_item_processed(item_type: str, severity: str, alert_type: str, status: str):
"""Record that an item was processed"""
items_processed.labels(
item_type=item_type,
severity=severity,
type=alert_type,
status=status
).inc()
def record_notification_sent(channel: str, item_type: str, severity: str, status: str):
"""Record notification delivery"""
notifications_sent.labels(
channel=channel,
item_type=item_type,
severity=severity,
status=status
).inc()
def update_active_items(tenant_id: str, item_type: str, severity: str, count: int):
"""Update active items gauge"""
active_items_gauge.labels(
tenant_id=tenant_id,
item_type=item_type,
severity=severity
).set(count)
def update_component_health(component: str, service: str, is_healthy: bool):
"""Update component health status"""
system_component_health.labels(
component=component,
service=service
).set(1 if is_healthy else 0)
def update_connection_status(connection_type: str, service: str, is_connected: bool):
"""Update connection status"""
if connection_type == 'rabbitmq':
rabbitmq_connection_status.labels(service=service).set(1 if is_connected else 0)
elif connection_type == 'redis':
redis_connection_status.labels(service=service).set(1 if is_connected else 0)
# =================================================================
# METRICS AGGREGATOR
# =================================================================
class AlertMetricsCollector:
"""Centralized metrics collector for alert system"""
def __init__(self, service_name: str):
self.service_name = service_name
def record_check_performed(self, check_type: str, pattern: str):
"""Record that a check was performed"""
item_checks_performed.labels(
service=self.service_name,
check_type=check_type,
pattern=pattern
).inc()
def record_detection_error(self, error_type: str, check_type: str):
"""Record detection error"""
alert_detection_errors.labels(
service=self.service_name,
error_type=error_type,
check_type=check_type
).inc()
def record_duplicate_prevented(self, item_type: str, alert_type: str):
"""Record prevented duplicate"""
duplicate_items_prevented.labels(
service=self.service_name,
item_type=item_type,
type=alert_type
).inc()
def update_leader_status(self, is_leader: bool):
"""Update leader election status"""
scheduler_leader_status.labels(service=self.service_name).set(1 if is_leader else 0)
def get_service_metrics(self) -> Dict[str, Any]:
"""Get all metrics for this service"""
return {
'service': self.service_name,
'items_published': items_published._value._value,
'checks_performed': item_checks_performed._value._value,
'detection_errors': alert_detection_errors._value._value,
'duplicates_prevented': duplicate_items_prevented._value._value
}
# =================================================================
# DASHBOARD METRICS
# =================================================================
def get_system_overview_metrics() -> Dict[str, Any]:
"""Get overview metrics for monitoring dashboard"""
try:
return {
'total_items_published': sum(items_published._value._value.values()),
'total_checks_performed': sum(item_checks_performed._value._value.values()),
'total_notifications_sent': sum(notifications_sent._value._value.values()),
'active_sse_connections': sum(sse_active_connections._value._value.values()),
'processing_errors': sum(processing_errors._value._value.values()),
'delivery_failures': sum(delivery_failures._value._value.values()),
'timestamp': time.time()
}
except Exception as e:
logger.error("Error collecting overview metrics", error=str(e))
return {'error': str(e), 'timestamp': time.time()}
def get_tenant_metrics(tenant_id: str) -> Dict[str, Any]:
"""Get metrics for a specific tenant"""
try:
return {
'tenant_id': tenant_id,
'active_connections': sse_active_connections.labels(tenant_id=tenant_id)._value._value,
'events_sent': sum([
v for k, v in sse_events_sent._value._value.items()
if k[0] == tenant_id
]),
'timestamp': time.time()
}
except Exception as e:
logger.error("Error collecting tenant metrics", tenant_id=tenant_id, error=str(e))
return {'tenant_id': tenant_id, 'error': str(e), 'timestamp': time.time()}

View File

@@ -1,258 +0,0 @@
# shared/monitoring/scheduler_metrics.py
"""
Scheduler Metrics - Prometheus metrics for production and procurement schedulers
Provides comprehensive metrics for monitoring automated daily planning:
- Scheduler execution success/failure rates
- Tenant processing times
- Cache hit rates for forecasts
- Plan generation statistics
"""
from prometheus_client import Counter, Histogram, Gauge, Info
import structlog
logger = structlog.get_logger()
# ================================================================
# PRODUCTION SCHEDULER METRICS
# ================================================================
production_schedules_generated_total = Counter(
'production_schedules_generated_total',
'Total number of production schedules generated',
['tenant_id', 'status'] # status: success, failure
)
production_schedule_generation_duration_seconds = Histogram(
'production_schedule_generation_duration_seconds',
'Time taken to generate production schedule per tenant',
['tenant_id'],
buckets=[1, 5, 10, 30, 60, 120, 180, 300] # seconds
)
production_tenants_processed_total = Counter(
'production_tenants_processed_total',
'Total number of tenants processed by production scheduler',
['status'] # status: success, failure, timeout
)
production_batches_created_total = Counter(
'production_batches_created_total',
'Total number of production batches created',
['tenant_id']
)
production_scheduler_runs_total = Counter(
'production_scheduler_runs_total',
'Total number of production scheduler executions',
['trigger'] # trigger: scheduled, manual, test
)
production_scheduler_errors_total = Counter(
'production_scheduler_errors_total',
'Total number of production scheduler errors',
['error_type']
)
# ================================================================
# PROCUREMENT SCHEDULER METRICS
# ================================================================
procurement_plans_generated_total = Counter(
'procurement_plans_generated_total',
'Total number of procurement plans generated',
['tenant_id', 'status'] # status: success, failure
)
procurement_plan_generation_duration_seconds = Histogram(
'procurement_plan_generation_duration_seconds',
'Time taken to generate procurement plan per tenant',
['tenant_id'],
buckets=[1, 5, 10, 30, 60, 120, 180, 300]
)
procurement_tenants_processed_total = Counter(
'procurement_tenants_processed_total',
'Total number of tenants processed by procurement scheduler',
['status'] # status: success, failure, timeout
)
procurement_requirements_created_total = Counter(
'procurement_requirements_created_total',
'Total number of procurement requirements created',
['tenant_id', 'priority'] # priority: critical, high, medium, low
)
procurement_scheduler_runs_total = Counter(
'procurement_scheduler_runs_total',
'Total number of procurement scheduler executions',
['trigger'] # trigger: scheduled, manual, test
)
procurement_plan_rejections_total = Counter(
'procurement_plan_rejections_total',
'Total number of procurement plans rejected',
['tenant_id', 'auto_regenerated'] # auto_regenerated: true, false
)
procurement_plans_by_status = Gauge(
'procurement_plans_by_status',
'Number of procurement plans by status',
['tenant_id', 'status']
)
# ================================================================
# FORECAST CACHING METRICS
# ================================================================
forecast_cache_hits_total = Counter(
'forecast_cache_hits_total',
'Total number of forecast cache hits',
['tenant_id']
)
forecast_cache_misses_total = Counter(
'forecast_cache_misses_total',
'Total number of forecast cache misses',
['tenant_id']
)
forecast_cache_hit_rate = Gauge(
'forecast_cache_hit_rate',
'Forecast cache hit rate percentage (0-100)',
['tenant_id']
)
forecast_cache_entries_total = Gauge(
'forecast_cache_entries_total',
'Total number of entries in forecast cache',
['cache_type'] # cache_type: single, batch
)
forecast_cache_invalidations_total = Counter(
'forecast_cache_invalidations_total',
'Total number of forecast cache invalidations',
['tenant_id', 'reason'] # reason: model_retrain, manual, expiry
)
# ================================================================
# GENERAL SCHEDULER HEALTH METRICS
# ================================================================
scheduler_health_status = Gauge(
'scheduler_health_status',
'Scheduler health status (1=healthy, 0=unhealthy)',
['service', 'scheduler_type'] # service: production, orders; scheduler_type: daily, weekly, cleanup
)
scheduler_last_run_timestamp = Gauge(
'scheduler_last_run_timestamp',
'Unix timestamp of last scheduler run',
['service', 'scheduler_type']
)
scheduler_next_run_timestamp = Gauge(
'scheduler_next_run_timestamp',
'Unix timestamp of next scheduled run',
['service', 'scheduler_type']
)
tenant_processing_timeout_total = Counter(
'tenant_processing_timeout_total',
'Total number of tenant processing timeouts',
['service', 'tenant_id'] # service: production, procurement
)
# ================================================================
# HELPER FUNCTIONS FOR METRICS
# ================================================================
class SchedulerMetricsCollector:
"""Helper class for collecting scheduler metrics"""
@staticmethod
def record_production_schedule_generated(tenant_id: str, success: bool, duration_seconds: float, batches_created: int):
"""Record production schedule generation"""
status = 'success' if success else 'failure'
production_schedules_generated_total.labels(tenant_id=tenant_id, status=status).inc()
production_schedule_generation_duration_seconds.labels(tenant_id=tenant_id).observe(duration_seconds)
if success:
production_batches_created_total.labels(tenant_id=tenant_id).inc(batches_created)
@staticmethod
def record_procurement_plan_generated(tenant_id: str, success: bool, duration_seconds: float, requirements_count: int):
"""Record procurement plan generation"""
status = 'success' if success else 'failure'
procurement_plans_generated_total.labels(tenant_id=tenant_id, status=status).inc()
procurement_plan_generation_duration_seconds.labels(tenant_id=tenant_id).observe(duration_seconds)
if success:
procurement_requirements_created_total.labels(
tenant_id=tenant_id,
priority='medium' # Default, should be updated with actual priority
).inc(requirements_count)
@staticmethod
def record_scheduler_run(service: str, trigger: str = 'scheduled'):
"""Record scheduler execution"""
if service == 'production':
production_scheduler_runs_total.labels(trigger=trigger).inc()
elif service == 'procurement':
procurement_scheduler_runs_total.labels(trigger=trigger).inc()
@staticmethod
def record_tenant_processing(service: str, status: str):
"""Record tenant processing result"""
if service == 'production':
production_tenants_processed_total.labels(status=status).inc()
elif service == 'procurement':
procurement_tenants_processed_total.labels(status=status).inc()
@staticmethod
def record_forecast_cache_lookup(tenant_id: str, hit: bool):
"""Record forecast cache lookup"""
if hit:
forecast_cache_hits_total.labels(tenant_id=tenant_id).inc()
else:
forecast_cache_misses_total.labels(tenant_id=tenant_id).inc()
@staticmethod
def update_forecast_cache_hit_rate(tenant_id: str, hit_rate_percent: float):
"""Update forecast cache hit rate"""
forecast_cache_hit_rate.labels(tenant_id=tenant_id).set(hit_rate_percent)
@staticmethod
def record_plan_rejection(tenant_id: str, auto_regenerated: bool):
"""Record procurement plan rejection"""
procurement_plan_rejections_total.labels(
tenant_id=tenant_id,
auto_regenerated='true' if auto_regenerated else 'false'
).inc()
@staticmethod
def update_scheduler_health(service: str, scheduler_type: str, is_healthy: bool):
"""Update scheduler health status"""
scheduler_health_status.labels(
service=service,
scheduler_type=scheduler_type
).set(1 if is_healthy else 0)
@staticmethod
def record_timeout(service: str, tenant_id: str):
"""Record tenant processing timeout"""
tenant_processing_timeout_total.labels(
service=service,
tenant_id=tenant_id
).inc()
# Global metrics collector instance
metrics_collector = SchedulerMetricsCollector()
def get_scheduler_metrics_collector() -> SchedulerMetricsCollector:
"""Get global scheduler metrics collector"""
return metrics_collector