Improve metrics

2026-01-08 20:48:24 +01:00
parent 29d19087f1
commit e8fda39e50
21 changed files with 615 additions and 3019 deletions
--- a/docs/DEV-HTTPS-SETUP.md
+++ b/docs/DEV-HTTPS-SETUP.md
@@ -1,337 +0,0 @@
-# HTTPS in Development Environment
-
-## Overview
-
-Development environment now uses HTTPS by default to match production behavior and catch SSL-related issues early.
-
-**Benefits:**
- ✅ Matches production HTTPS behavior
- ✅ Tests SSL/TLS configurations
- ✅ Catches mixed content warnings
- ✅ Tests secure cookie handling
- ✅ Better dev-prod parity
-
---
-
-## Quick Start
-
-### 1. Deploy with HTTPS Enabled
-
-```bash
-# Start development environment
-skaffold dev --profile=dev
-
-# Wait for certificate to be issued
-kubectl get certificate -n bakery-ia
-
-# You should see:
-# NAME                   READY   SECRET                  AGE
-# bakery-dev-tls-cert    True    bakery-dev-tls-cert     1m
-```
-
-### 2. Access Your Application
-
-```bash
-# Access via HTTPS (will show certificate warning in browser)
-open https://localhost
-
-# Or via curl (use -k to skip certificate verification)
-curl -k https://localhost/api/health
-```
-
---
-
-## Trust the Self-Signed Certificate
-
-To avoid browser certificate warnings, you need to trust the self-signed certificate.
-
-### Option 1: Accept Browser Warning (Quick & Easy)
-
-When you visit `https://localhost`:
-1. Browser shows "Your connection is not private" or similar
-2. Click "Advanced" or "Show details"
-3. Click "Proceed to localhost" or "Accept the risk"
-4. Certificate warning will appear on first visit only per browser session
-
-### Option 2: Trust Certificate in System (Recommended)
-
-#### On macOS:
-
-```bash
-# 1. Export the certificate from Kubernetes
-kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls\.crt}' | base64 -d > /tmp/bakery-dev-cert.crt
-
-# 2. Add to Keychain
-sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain /tmp/bakery-dev-cert.crt
-
-# 3. Verify
-security find-certificate -c localhost -a
-
-# 4. Cleanup
-rm /tmp/bakery-dev-cert.crt
-```
-
-**Alternative (GUI):**
-1. Export certificate: `kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls\.crt}' | base64 -d > bakery-dev-cert.crt`
-2. Double-click the `.crt` file to open Keychain Access
-3. Find "localhost" certificate
-4. Double-click → Trust → "Always Trust"
-5. Close and enter your password
-
-#### On Linux:
-
-```bash
-# 1. Export the certificate
-kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls\.crt}' | base64 -d | sudo tee /usr/local/share/ca-certificates/bakery-dev.crt
-
-# 2. Update CA certificates
-sudo update-ca-certificates
-
-# 3. For browsers (Chromium/Chrome)
-mkdir -p $HOME/.pki/nssdb
-certutil -d sql:$HOME/.pki/nssdb -A -t "P,," -n "Bakery Dev" -i /usr/local/share/ca-certificates/bakery-dev.crt
-```
-
-#### On Windows:
-
-```powershell
-# 1. Export the certificate
-kubectl get secret bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.data.tls.crt}' | Out-File -Encoding ASCII bakery-dev-cert.crt
-
-# 2. Import to Trusted Root
-Import-Certificate -FilePath .\bakery-dev-cert.crt -CertStoreLocation Cert:\LocalMachine\Root
-
-# Or use GUI:
-# - Double-click bakery-dev-cert.crt
-# - Install Certificate
-# - Store Location: Local Machine
-# - Place in: Trusted Root Certification Authorities
-```
-
---
-
-## Testing HTTPS
-
-### Test with curl
-
-```bash
-# Without certificate verification (quick test)
-curl -k https://localhost/api/health
-
-# With certificate verification (after trusting cert)
-curl https://localhost/api/health
-
-# Check certificate details
-curl -vI https://localhost/api/health 2>&1 | grep -A 10 "Server certificate"
-
-# Test CORS with HTTPS
-curl -H "Origin: https://localhost:3000" \
-     -H "Access-Control-Request-Method: POST" \
-     -X OPTIONS https://localhost/api/health
-```
-
-### Test with Browser
-
-1. Open `https://localhost`
-2. Check for SSL/TLS padlock in address bar
-3. Click padlock → View certificate
-4. Verify:
-   - Issued to: localhost
-   - Issued by: localhost (self-signed)
-   - Valid for: 90 days
-
-### Test Frontend
-
-```bash
-# Update your frontend .env to use HTTPS
-echo "VITE_API_URL=https://localhost/api" > frontend/.env.local
-
-# Frontend should now make HTTPS requests
-```
-
---
-
-## Certificate Details
-
-### Certificate Specifications
-
- **Type**: Self-signed (for development)
- **Algorithm**: RSA 2048-bit
- **Validity**: 90 days (auto-renews 15 days before expiration)
- **Common Name**: localhost
- **DNS Names**:
-  - localhost
-  - bakery-ia.local
-  - api.bakery-ia.local
-  - *.bakery-ia.local
- **IP Addresses**: 127.0.0.1, ::1
-
-### Certificate Issuer
-
- **Issuer**: `selfsigned-issuer` (cert-manager ClusterIssuer)
- **Auto-renewal**: Managed by cert-manager
- **Secret Name**: `bakery-dev-tls-cert`
-
---
-
-## Troubleshooting
-
-### Certificate Not Issued
-
-```bash
-# Check certificate status
-kubectl describe certificate bakery-dev-tls-cert -n bakery-ia
-
-# Check cert-manager logs
-kubectl logs -n cert-manager deployment/cert-manager
-
-# Check if cert-manager is installed
-kubectl get pods -n cert-manager
-
-# If cert-manager is not installed:
-kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.2/cert-manager.yaml
-```
-
-### Certificate Warning in Browser
-
-**Normal for self-signed certificates!** Choose one:
-1. Click "Proceed" (quick, temporary)
-2. Trust the certificate in your system (permanent)
-
-### Mixed Content Warnings
-
-If you see "mixed content" errors:
- Ensure all API calls use HTTPS
- Check for hardcoded HTTP URLs
- Update `VITE_API_URL` to use HTTPS
-
-### Certificate Expired
-
-```bash
-# Check expiration
-kubectl get certificate bakery-dev-tls-cert -n bakery-ia -o jsonpath='{.status.notAfter}'
-
-# Force renewal
-kubectl delete certificate bakery-dev-tls-cert -n bakery-ia
-kubectl apply -k infrastructure/kubernetes/overlays/dev
-
-# cert-manager will automatically recreate it
-```
-
-### Browser Shows "NET::ERR_CERT_AUTHORITY_INVALID"
-
-This is expected for self-signed certificates. Options:
-1. Click "Advanced" → "Proceed to localhost"
-2. Trust the certificate (see instructions above)
-3. Use curl with `-k` flag for testing
-
---
-
-## Disable HTTPS (Not Recommended)
-
-If you need to temporarily disable HTTPS:
-
-```bash
-# Edit dev-ingress.yaml
-vim infrastructure/kubernetes/overlays/dev/dev-ingress.yaml
-
-# Change:
-# nginx.ingress.kubernetes.io/ssl-redirect: "true"  → "false"
-# nginx.ingress.kubernetes.io/force-ssl-redirect: "true"  → "false"
-
-# Comment out the tls section:
-# tls:
-# - hosts:
-#   - localhost
-#   secretName: bakery-dev-tls-cert
-
-# Redeploy
-skaffold dev --profile=dev
-```
-
---
-
-## Differences from Production
-
-| Aspect | Development | Production |
-|--------|-------------|------------|
-| Certificate Type | Self-signed | Let's Encrypt |
-| Validity | 90 days | 90 days |
-| Auto-renewal | cert-manager | cert-manager |
-| Trust | Manual trust needed | Automatically trusted |
-| Domains | localhost | Real domains |
-| Browser Warning | Yes (self-signed) | No (CA-signed) |
-
---
-
-## FAQ
-
-### Q: Why am I seeing certificate warnings?
-**A:** Self-signed certificates aren't trusted by browsers by default. Trust the certificate or click "Proceed."
-
-### Q: Do I need to trust the certificate?
-**A:** No, but it makes development easier. You can click "Proceed" on each browser session.
-
-### Q: Will this affect my frontend development?
-**A:** Slightly. Update `VITE_API_URL` to use `https://`. Otherwise works the same.
-
-### Q: Can I use HTTP instead?
-**A:** Yes, but not recommended. It reduces dev-prod parity and won't catch HTTPS issues.
-
-### Q: How often do I need to re-trust the certificate?
-**A:** Only when the certificate is recreated (every 90 days or when you delete the cluster).
-
-### Q: Does this work with bakery-ia.local?
-**A:** Yes! The certificate is valid for both `localhost` and `bakery-ia.local`.
-
---
-
-## Additional Security Testing
-
-With HTTPS enabled, you can now test:
-
-### 1. Secure Cookies
-```javascript
-// In your frontend
-document.cookie = "session=test; Secure; SameSite=Strict";
-```
-
-### 2. Mixed Content Detection
-```javascript
-// This will show warning in dev (good - catches prod issues!)
-fetch('http://api.example.com/data')  // ❌ Mixed content
-fetch('https://api.example.com/data') // ✅ Secure
-```
-
-### 3. HSTS (HTTP Strict Transport Security)
-```bash
-# Check HSTS headers
-curl -I https://localhost/api/health | grep -i strict
-```
-
-### 4. TLS Version Testing
-```bash
-# Test TLS 1.2
-curl --tlsv1.2 https://localhost/api/health
-
-# Test TLS 1.3
-curl --tlsv1.3 https://localhost/api/health
-```
-
---
-
-## Summary
-
-✅ **Enabled**: HTTPS in development by default
-✅ **Certificate**: Self-signed, auto-renewed
-✅ **Access**: `https://localhost`
-✅ **Trust**: Optional but recommended
-✅ **Benefit**: Better dev-prod parity
-
-**Next Steps:**
-1. Deploy: `skaffold dev --profile=dev`
-2. Access: `https://localhost`
-3. Trust: Follow instructions above (optional)
-4. Test: Verify HTTPS works
-
-For issues, see Troubleshooting section or check cert-manager logs.
--- a/docs/DOCKERHUB_SETUP.md
+++ b/docs/DOCKERHUB_SETUP.md
@@ -1,337 +0,0 @@
-# Docker Hub Configuration Guide
-
-This guide explains how to configure Docker Hub for all image pulls in the Bakery IA project.
-
-## Overview
-
-The project has been configured to use Docker Hub credentials for pulling both:
- **Base images** (postgres, redis, python, node, nginx, etc.)
- **Custom bakery images** (bakery/auth-service, bakery/gateway, etc.)
-
-## Quick Start
-
-### 1. Create Docker Hub Secret in Kubernetes
-
-Run the automated setup script:
-
-```bash
-./infrastructure/kubernetes/setup-dockerhub-secrets.sh
-```
-
-This script will:
- Create the `dockerhub-creds` secret in all namespaces (bakery-ia, bakery-ia-dev, bakery-ia-prod, default)
- Use the credentials: `uals` / `dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A`
-
-### 2. Apply Updated Kubernetes Manifests
-
-All manifests have been updated with `imagePullSecrets`. Apply them:
-
-```bash
-# For development
-kubectl apply -k infrastructure/kubernetes/overlays/dev
-
-# For production
-kubectl apply -k infrastructure/kubernetes/overlays/prod
-```
-
-### 3. Verify Pods Can Pull Images
-
-```bash
-# Check pod status
-kubectl get pods -n bakery-ia
-
-# Check events for image pull status
-kubectl get events -n bakery-ia --sort-by='.lastTimestamp'
-
-# Describe a specific pod to see image pull details
-kubectl describe pod <pod-name> -n bakery-ia
-```
-
-## Manual Setup
-
-If you prefer to create the secret manually:
-
-```bash
-kubectl create secret docker-registry dockerhub-creds \
-  --docker-server=docker.io \
-  --docker-username=uals \
-  --docker-password=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A \
-  --docker-email=ualfaro@gmail.com \
-  -n bakery-ia
-```
-
-Repeat for other namespaces:
-```bash
-kubectl create secret docker-registry dockerhub-creds \
-  --docker-server=docker.io \
-  --docker-username=uals \
-  --docker-password=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A \
-  --docker-email=ualfaro@gmail.com \
-  -n bakery-ia-dev
-
-kubectl create secret docker-registry dockerhub-creds \
-  --docker-server=docker.io \
-  --docker-username=uals \
-  --docker-password=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A \
-  --docker-email=ualfaro@gmail.com \
-  -n bakery-ia-prod
-```
-
-## What Was Changed
-
-### 1. Kubernetes Manifests (47 files updated)
-
-All deployments, jobs, and cronjobs now include `imagePullSecrets`:
-
-```yaml
-spec:
-  template:
-    spec:
-      imagePullSecrets:
-      - name: dockerhub-creds
-      containers:
-      - name: ...
-```
-
-**Files Updated:**
- **19 Service Deployments**: All microservices (auth, tenant, forecasting, etc.)
- **21 Database Deployments**: All PostgreSQL instances, Redis, RabbitMQ
- **21 Migration Jobs**: All database migration jobs
- **2 CronJobs**: demo-cleanup, external-data-rotation
- **2 Standalone Jobs**: external-data-init, nominatim-init
- **1 Worker Deployment**: demo-cleanup-worker
-
-### 2. Tiltfile Configuration
-
-The Tiltfile now supports both local registry and Docker Hub:
-
-**Default (Local Registry):**
-```bash
-tilt up
-```
-
-**Docker Hub Mode:**
-```bash
-export USE_DOCKERHUB=true
-export DOCKERHUB_USERNAME=uals
-tilt up
-```
-
-### 3. Scripts
-
-Two new scripts were created:
-
-1. **[setup-dockerhub-secrets.sh](../infrastructure/kubernetes/setup-dockerhub-secrets.sh)**
-   - Creates Docker Hub secrets in all namespaces
-   - Idempotent (safe to run multiple times)
-
-2. **[add-image-pull-secrets.sh](../infrastructure/kubernetes/add-image-pull-secrets.sh)**
-   - Adds `imagePullSecrets` to all Kubernetes manifests
-   - Already run (no need to run again unless adding new manifests)
-
-## Using Docker Hub with Tilt
-
-To use Docker Hub for development with Tilt:
-
-```bash
-# Login to Docker Hub first
-docker login -u uals
-
-# Enable Docker Hub mode
-export USE_DOCKERHUB=true
-export DOCKERHUB_USERNAME=uals
-
-# Start Tilt
-tilt up
-```
-
-This will:
- Build images locally
- Tag them as `docker.io/uals/<image-name>`
- Push them to Docker Hub
- Deploy to Kubernetes with imagePullSecrets
-
-## Images Configuration
-
-### Base Images (from Docker Hub)
-
-These images are pulled from Docker Hub's public registry:
-
- `python:3.11-slim` - Python base for all microservices
- `node:18-alpine` - Node.js for frontend builder
- `nginx:1.25-alpine` - Nginx for frontend production
- `postgres:17-alpine` - PostgreSQL databases
- `redis:7.4-alpine` - Redis cache
- `rabbitmq:4.1-management-alpine` - RabbitMQ message broker
- `busybox:latest` - Utility container
- `curlimages/curl:latest` - Curl utility
- `mediagis/nominatim:4.4` - Geolocation service
-
-### Custom Images (bakery/*)
-
-These images are built by the project:
-
-**Infrastructure:**
- `bakery/gateway`
- `bakery/dashboard`
-
-**Core Services:**
- `bakery/auth-service`
- `bakery/tenant-service`
-
-**Data & Analytics:**
- `bakery/training-service`
- `bakery/forecasting-service`
- `bakery/ai-insights-service`
-
-**Operations:**
- `bakery/sales-service`
- `bakery/inventory-service`
- `bakery/production-service`
- `bakery/procurement-service`
- `bakery/distribution-service`
-
-**Supporting:**
- `bakery/recipes-service`
- `bakery/suppliers-service`
- `bakery/pos-service`
- `bakery/orders-service`
- `bakery/external-service`
-
-**Platform:**
- `bakery/notification-service`
- `bakery/alert-processor`
- `bakery/orchestrator-service`
-
-**Demo:**
- `bakery/demo-session-service`
-
-## Pushing Custom Images to Docker Hub
-
-Use the existing tag-and-push script:
-
-```bash
-# Login first
-docker login -u uals
-
-# Tag and push all images
-./scripts/tag-and-push-images.sh
-```
-
-Or manually for a specific image:
-
-```bash
-# Build
-docker build -t bakery/auth-service:latest -f services/auth/Dockerfile .
-
-# Tag for Docker Hub
-docker tag bakery/auth-service:latest uals/bakery-auth-service:latest
-
-# Push
-docker push uals/bakery-auth-service:latest
-```
-
-## Troubleshooting
-
-### Problem: ImagePullBackOff error
-
-Check if the secret exists:
-```bash
-kubectl get secret dockerhub-creds -n bakery-ia
-```
-
-Verify secret is correctly configured:
-```bash
-kubectl get secret dockerhub-creds -n bakery-ia -o yaml
-```
-
-Check pod events:
-```bash
-kubectl describe pod <pod-name> -n bakery-ia
-```
-
-### Problem: Authentication failure
-
-The Docker Hub credentials might be incorrect or expired. Update the secret:
-
-```bash
-# Delete old secret
-kubectl delete secret dockerhub-creds -n bakery-ia
-
-# Create new secret with updated credentials
-kubectl create secret docker-registry dockerhub-creds \
-  --docker-server=docker.io \
-  --docker-username=<your-username> \
-  --docker-password=<your-token> \
-  --docker-email=<your-email> \
-  -n bakery-ia
-```
-
-### Problem: Pod still using old credentials
-
-Restart the pod to pick up the new secret:
-
-```bash
-kubectl rollout restart deployment/<deployment-name> -n bakery-ia
-```
-
-## Security Best Practices
-
-1. **Use Docker Hub Access Tokens** (not passwords)
-   - Create at: https://hub.docker.com/settings/security
-   - Set appropriate permissions (Read-only for pulls)
-
-2. **Rotate Credentials Regularly**
-   - Update the secret every 90 days
-   - Use the setup script for consistent updates
-
-3. **Limit Secret Access**
-   - Only grant access to necessary namespaces
-   - Use RBAC to control who can read secrets
-
-4. **Monitor Usage**
-   - Check Docker Hub pull rate limits
-   - Monitor for unauthorized access
-
-## Rate Limits
-
-Docker Hub has rate limits for image pulls:
-
- **Anonymous users**: 100 pulls per 6 hours per IP
- **Authenticated users**: 200 pulls per 6 hours
- **Pro/Team**: Unlimited
-
-Using authentication (imagePullSecrets) ensures you get the authenticated user rate limit.
-
-## Environment Variables
-
-For CI/CD or automated deployments, use these environment variables:
-
-```bash
-export DOCKER_USERNAME=uals
-export DOCKER_PASSWORD=dckr_pat_zzEY5Q58x1S0puraIoKEtbpue3A
-export DOCKER_EMAIL=ualfaro@gmail.com
-```
-
-## Next Steps
-
-1. ✅ Docker Hub secret created in all namespaces
-2. ✅ All Kubernetes manifests updated with imagePullSecrets
-3. ✅ Tiltfile configured for optional Docker Hub usage
-4. 🔄 Apply manifests to your cluster
-5. 🔄 Verify pods can pull images successfully
-
-## Related Documentation
-
- [Kubernetes Setup Guide](./KUBERNETES_SETUP.md)
- [Security Implementation](./SECURITY_IMPLEMENTATION_COMPLETE.md)
- [Tilt Development Workflow](../Tiltfile)
-
-## Support
-
-If you encounter issues:
-
-1. Check the troubleshooting section above
-2. Verify Docker Hub credentials at: https://hub.docker.com/settings/security
-3. Check Kubernetes events: `kubectl get events -A --sort-by='.lastTimestamp'`
-4. Review pod logs: `kubectl logs -n bakery-ia <pod-name>`
--- a/docs/MONITORING_COMPLETE_GUIDE.md
+++ b/docs/MONITORING_COMPLETE_GUIDE.md
@@ -1,449 +0,0 @@
-# Complete Monitoring Guide - Bakery IA Platform
-
-This guide provides the complete overview of observability implementation for the Bakery IA platform using SigNoz and OpenTelemetry.
-
-## 🎯 Executive Summary
-
-**What's Implemented:**
- ✅ **Distributed Tracing** - All 17 services
- ✅ **Application Metrics** - HTTP requests, latencies, errors
- ✅ **System Metrics** - CPU, memory, disk, network per service
- ✅ **Structured Logs** - With trace correlation
- ✅ **Database Monitoring** - PostgreSQL, Redis, RabbitMQ metrics
- ✅ **Pure OpenTelemetry** - No Prometheus, all OTLP push
-
-**Technology Stack:**
- **Backend**: OpenTelemetry Python SDK
- **Collector**: OpenTelemetry Collector (OTLP receivers)
- **Storage**: ClickHouse (traces, metrics, logs)
- **Frontend**: SigNoz UI
- **Protocol**: OTLP over HTTP/gRPC
-
-## 📊 Architecture
-
-```
-┌──────────────────────────────────────────────────────────┐
-│                  Application Services                     │
-│  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐        │
-│  │  auth  │  │  inv   │  │ orders │  │  ...   │        │
-│  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘        │
-│      │           │            │           │              │
-│      └───────────┴────────────┴───────────┘              │
-│                  │                                        │
-│         Traces + Metrics + Logs                          │
-│         (OpenTelemetry OTLP)                             │
-└──────────────────┼──────────────────────────────────────┘
-                   │
-                   ▼
-┌──────────────────────────────────────────────────────────┐
-│            Database Monitoring Collector                  │
-│  ┌────────┐  ┌────────┐  ┌────────┐                     │
-│  │   PG   │  │ Redis  │  │RabbitMQ│                     │
-│  └───┬────┘  └───┬────┘  └───┬────┘                     │
-│      │           │            │                           │
-│      └───────────┴────────────┘                           │
-│                  │                                        │
-│         Database Metrics                                  │
-└──────────────────┼──────────────────────────────────────┘
-                   │
-                   ▼
-┌──────────────────────────────────────────────────────────┐
-│           SigNoz OpenTelemetry Collector                  │
-│                                                           │
-│  Receivers: OTLP (gRPC :4317, HTTP :4318)               │
-│  Processors: batch, memory_limiter, resourcedetection   │
-│  Exporters: ClickHouse                                   │
-└──────────────────┼──────────────────────────────────────┘
-                   │
-                   ▼
-┌──────────────────────────────────────────────────────────┐
-│               ClickHouse Database                         │
-│                                                           │
-│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
-│  │  Traces  │  │  Metrics │  │   Logs   │              │
-│  └──────────┘  └──────────┘  └──────────┘              │
-└──────────────────┼──────────────────────────────────────┘
-                   │
-                   ▼
-┌──────────────────────────────────────────────────────────┐
-│               SigNoz Frontend UI                          │
-│         https://monitoring.bakery-ia.local                │
-└──────────────────────────────────────────────────────────┘
-```
-
-## 🚀 Quick Start
-
-### 1. Deploy SigNoz
-
-```bash
-# Add Helm repository
-helm repo add signoz https://charts.signoz.io
-helm repo update
-
-# Create namespace and install
-kubectl create namespace signoz
-helm install signoz signoz/signoz \
-  -n signoz \
-  -f infrastructure/helm/signoz-values-dev.yaml
-
-# Wait for pods
-kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
-```
-
-### 2. Deploy Services with Monitoring
-
-All services are already configured with OpenTelemetry environment variables.
-
-```bash
-# Apply all services
-kubectl apply -k infrastructure/kubernetes/overlays/dev/
-
-# Or restart existing services
-kubectl rollout restart deployment -n bakery-ia
-```
-
-### 3. Deploy Database Monitoring
-
-```bash
-# Run the setup script
-./infrastructure/kubernetes/setup-database-monitoring.sh
-
-# This will:
-# - Create monitoring users in PostgreSQL
-# - Deploy OpenTelemetry collector for database metrics
-# - Start collecting PostgreSQL, Redis, RabbitMQ metrics
-```
-
-### 4. Access SigNoz UI
-
-```bash
-# Via ingress
-open https://monitoring.bakery-ia.local
-
-# Or port-forward
-kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
-open http://localhost:3301
-```
-
-## 📈 Metrics Collected
-
-### Application Metrics (Per Service)
-
-| Metric | Description | Type |
-|--------|-------------|------|
-| `http_requests_total` | Total HTTP requests | Counter |
-| `http_request_duration_seconds` | Request latency | Histogram |
-| `active_requests` | Current active requests | Gauge |
-
-### System Metrics (Per Service)
-
-| Metric | Description | Type |
-|--------|-------------|------|
-| `process.cpu.utilization` | Process CPU % | Gauge |
-| `process.memory.usage` | Process memory bytes | Gauge |
-| `process.memory.utilization` | Process memory % | Gauge |
-| `process.threads.count` | Thread count | Gauge |
-| `process.open_file_descriptors` | Open FDs (Unix) | Gauge |
-| `system.cpu.utilization` | System CPU % | Gauge |
-| `system.memory.usage` | System memory | Gauge |
-| `system.memory.utilization` | System memory % | Gauge |
-| `system.disk.io.read` | Disk read bytes | Counter |
-| `system.disk.io.write` | Disk write bytes | Counter |
-| `system.network.io.sent` | Network sent bytes | Counter |
-| `system.network.io.received` | Network recv bytes | Counter |
-
-### PostgreSQL Metrics
-
-| Metric | Description |
-|--------|-------------|
-| `postgresql.backends` | Active connections |
-| `postgresql.database.size` | Database size in bytes |
-| `postgresql.commits` | Transaction commits |
-| `postgresql.rollbacks` | Transaction rollbacks |
-| `postgresql.deadlocks` | Deadlock count |
-| `postgresql.blocks_read` | Blocks read from disk |
-| `postgresql.table.size` | Table size |
-| `postgresql.index.size` | Index size |
-
-### Redis Metrics
-
-| Metric | Description |
-|--------|-------------|
-| `redis.clients.connected` | Connected clients |
-| `redis.commands.processed` | Commands processed |
-| `redis.keyspace.hits` | Cache hits |
-| `redis.keyspace.misses` | Cache misses |
-| `redis.memory.used` | Memory usage |
-| `redis.memory.fragmentation_ratio` | Fragmentation |
-| `redis.db.keys` | Number of keys |
-
-### RabbitMQ Metrics
-
-| Metric | Description |
-|--------|-------------|
-| `rabbitmq.consumer.count` | Active consumers |
-| `rabbitmq.message.current` | Messages in queue |
-| `rabbitmq.message.acknowledged` | Messages ACKed |
-| `rabbitmq.message.delivered` | Messages delivered |
-| `rabbitmq.message.published` | Messages published |
-
-## 🔍 Traces
-
-**Automatic instrumentation for:**
- FastAPI endpoints
- HTTP client requests (HTTPX)
- Redis commands
- PostgreSQL queries (SQLAlchemy)
- RabbitMQ publish/consume
-
-**View traces:**
-1. Go to **Services** tab in SigNoz
-2. Select a service
-3. View individual traces
-4. Click trace → See full span tree with timing
-
-## 📝 Logs
-
-**Features:**
- Structured logging with context
- Automatic trace-log correlation
- Searchable by service, level, message, custom fields
-
-**View logs:**
-1. Go to **Logs** tab in SigNoz
-2. Filter by service: `service_name="auth-service"`
-3. Search for specific messages
-4. Click log → See full context including trace_id
-
-## 🎛️ Configuration Files
-
-### Services
-
-All services configured in:
-```
-infrastructure/kubernetes/base/components/*/\*-service.yaml
-```
-
-Each service has these environment variables:
-```yaml
-env:
-  - name: OTEL_COLLECTOR_ENDPOINT
-    value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
-  - name: OTEL_SERVICE_NAME
-    value: "service-name"
-  - name: ENABLE_TRACING
-    value: "true"
-  - name: OTEL_LOGS_EXPORTER
-    value: "otlp"
-  - name: ENABLE_OTEL_METRICS
-    value: "true"
-  - name: ENABLE_SYSTEM_METRICS
-    value: "true"
-```
-
-### SigNoz
-
-Configuration file:
-```
-infrastructure/helm/signoz-values-dev.yaml
-```
-
-Key settings:
- OTLP receivers on ports 4317 (gRPC) and 4318 (HTTP)
- No Prometheus scraping (pure OTLP push)
- ClickHouse backend for storage
- Reduced resources for development
-
-### Database Monitoring
-
-Deployment file:
-```
-infrastructure/kubernetes/base/monitoring/database-otel-collector.yaml
-```
-
-Setup script:
-```
-infrastructure/kubernetes/setup-database-monitoring.sh
-```
-
-## 📚 Documentation
-
-| Document | Description |
-|----------|-------------|
-| [MONITORING_QUICKSTART.md](./MONITORING_QUICKSTART.md) | 10-minute quick start guide |
-| [MONITORING_SETUP.md](./MONITORING_SETUP.md) | Detailed setup and troubleshooting |
-| [DATABASE_MONITORING.md](./DATABASE_MONITORING.md) | Database metrics and logs guide |
-| This document | Complete overview |
-
-## 🔧 Shared Libraries
-
-### Monitoring Modules
-
-Located in `shared/monitoring/`:
-
-| File | Purpose |
-|------|---------|
-| `__init__.py` | Package exports |
-| `logging.py` | Standard logging setup |
-| `logs_exporter.py` | OpenTelemetry logs export |
-| `metrics.py` | OpenTelemetry metrics (no Prometheus) |
-| `metrics_exporter.py` | OTLP metrics export setup |
-| `system_metrics.py` | System metrics collection (CPU, memory, etc.) |
-| `tracing.py` | Distributed tracing setup |
-| `health_checks.py` | Health check endpoints |
-
-### Usage in Services
-
-```python
-from shared.service_base import StandardFastAPIService
-
-# Create service
-service = AuthService()
-
-# Create app with auto-configured monitoring
-app = service.create_app()
-
-# Monitoring is automatically enabled:
-# - Tracing (if ENABLE_TRACING=true)
-# - Metrics (if ENABLE_OTEL_METRICS=true)
-# - System metrics (if ENABLE_SYSTEM_METRICS=true)
-# - Logs (if OTEL_LOGS_EXPORTER=otlp)
-```
-
-## 🎨 Dashboard Examples
-
-### Service Health Dashboard
-
-Create a dashboard with:
-1. **Request Rate** - `rate(http_requests_total[5m])`
-2. **Error Rate** - `rate(http_requests_total{status_code=~"5.."}[5m])`
-3. **Latency (P95)** - `histogram_quantile(0.95, http_request_duration_seconds)`
-4. **Active Requests** - `active_requests`
-5. **CPU Usage** - `process.cpu.utilization`
-6. **Memory Usage** - `process.memory.utilization`
-
-### Database Dashboard
-
-1. **PostgreSQL Connections** - `postgresql.backends`
-2. **Database Size** - `postgresql.database.size`
-3. **Transaction Rate** - `rate(postgresql.commits[5m])`
-4. **Redis Hit Rate** - `redis.keyspace.hits / (redis.keyspace.hits + redis.keyspace.misses)`
-5. **RabbitMQ Queue Depth** - `rabbitmq.message.current`
-
-## ⚠️ Alerts
-
-### Recommended Alerts
-
-**Application:**
- High error rate (>5% of requests failing)
- High latency (P95 > 1s)
- Service down (no metrics for 5 minutes)
-
-**System:**
- High CPU (>80% for 5 minutes)
- High memory (>90%)
- Disk space low (<10%)
-
-**Database:**
- PostgreSQL connections near max (>80% of max_connections)
- Slow queries (>5s)
- Redis memory high (>80%)
- RabbitMQ queue buildup (>10k messages)
-
-## 🐛 Troubleshooting
-
-### No Data in SigNoz
-
-```bash
-# 1. Check service logs
-kubectl logs -n bakery-ia deployment/auth-service | grep -i otel
-
-# 2. Check SigNoz collector
-kubectl logs -n signoz deployment/signoz-otel-collector
-
-# 3. Test connectivity
-kubectl exec -n bakery-ia deployment/auth-service -- \
-  curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
-```
-
-### Database Metrics Missing
-
-```bash
-# Check database monitoring collector
-kubectl logs -n bakery-ia deployment/database-otel-collector
-
-# Verify monitoring user exists
-kubectl exec -n bakery-ia deployment/auth-db -- \
-  psql -U postgres -c "\du otel_monitor"
-```
-
-### Traces Not Correlated with Logs
-
-Ensure `OTEL_LOGS_EXPORTER=otlp` is set in service environment variables.
-
-## 🎯 Best Practices
-
-1. **Always use structured logging** - Add context with key-value pairs
-2. **Add custom spans** - For important business operations
-3. **Set appropriate log levels** - INFO for production, DEBUG for dev
-4. **Monitor your monitors** - Alert on collector failures
-5. **Regular retention policy reviews** - Balance cost vs. data retention
-6. **Create service dashboards** - One dashboard per service
-7. **Set up critical alerts first** - Service down, high error rate
-8. **Document custom metrics** - Explain business-specific metrics
-
-## 📊 Performance Impact
-
-**Resource Usage (per service):**
- CPU: +5-10% (instrumentation overhead)
- Memory: +50-100MB (SDK and buffers)
- Network: Minimal (batched export every 60s)
-
-**Latency Impact:**
- Per request: <1ms (async instrumentation)
- No impact on user-facing latency
-
-**Storage (SigNoz):**
- Traces: ~1GB per million requests
- Metrics: ~100MB per service per day
- Logs: Varies by log volume
-
-## 🔐 Security Considerations
-
-1. **Use dedicated monitoring users** - Never use app credentials
-2. **Limit collector permissions** - Read-only access to databases
-3. **Secure OTLP endpoints** - Use TLS in production
-4. **Sanitize sensitive data** - Don't log passwords, tokens
-5. **Network policies** - Restrict collector network access
-6. **RBAC** - Limit SigNoz UI access per team
-
-## 🚀 Next Steps
-
-1. **Deploy to production** - Update production SigNoz config
-2. **Create team dashboards** - Per-service and system-wide views
-3. **Set up alerts** - Start with critical service health alerts
-4. **Train team** - SigNoz UI usage, query language
-5. **Document runbooks** - How to respond to alerts
-6. **Optimize retention** - Based on actual data volume
-7. **Add custom metrics** - Business-specific KPIs
-
-## 📞 Support
-
- **SigNoz Community**: https://signoz.io/slack
- **OpenTelemetry Docs**: https://opentelemetry.io/docs/
- **Internal Docs**: See /docs folder
-
-## 📝 Change Log
-
-| Date | Change |
-|------|--------|
-| 2026-01-08 | Initial implementation - All services configured |
-| 2026-01-08 | Database monitoring added (PostgreSQL, Redis, RabbitMQ) |
-| 2026-01-08 | System metrics collection implemented |
-| 2026-01-08 | Removed Prometheus, pure OpenTelemetry |
-
---
-
-**Congratulations! Your platform now has complete observability. 🎉**
-
-Every request is traced, every metric is collected, every log is searchable.
--- a/docs/MONITORING_DOCUMENTATION.md
+++ b/docs/MONITORING_DOCUMENTATION.md
@@ -0,0 +1,536 @@
+# 📊 Bakery-ia Monitoring System Documentation
+
+## 🎯 Overview
+
+The bakery-ia platform features a comprehensive, modern monitoring system built on **OpenTelemetry** and **SigNoz**. This documentation provides a complete guide to the monitoring architecture, setup, and usage.
+
+## 🚀 Monitoring Architecture
+
+### Core Components
+
+```mermaid
+graph TD
+    A[Microservices] -->|OTLP| B[OpenTelemetry Collector]
+    B -->|gRPC| C[SigNoz]
+    C --> D[Traces Dashboard]
+    C --> E[Metrics Dashboard]
+    C --> F[Logs Dashboard]
+    C --> G[Alerts]
+```
+
+### Technology Stack
+
+- **Instrumentation**: OpenTelemetry Python SDK
+- **Protocol**: OTLP (OpenTelemetry Protocol) over gRPC
+- **Backend**: SigNoz (open-source observability platform)
+- **Metrics**: Prometheus-compatible metrics via OTLP
+- **Traces**: Jaeger-compatible tracing via OTLP
+- **Logs**: Structured logging with trace correlation
+
+## 📋 Monitoring Coverage
+
+### Service Coverage (100%)
+
+| Service Category | Services | Monitoring Type | Status |
+|-----------------|----------|----------------|--------|
+| **Critical Services** | auth, orders, sales, external | Base Class | ✅ Monitored |
+| **AI Services** | ai-insights, training | Direct | ✅ Monitored |
+| **Data Services** | inventory, procurement, production, forecasting | Base Class | ✅ Monitored |
+| **Operational Services** | tenant, notification, distribution | Base Class | ✅ Monitored |
+| **Specialized Services** | suppliers, pos, recipes, orchestrator | Base Class | ✅ Monitored |
+| **Infrastructure** | gateway, alert-processor, demo-session | Direct | ✅ Monitored |
+
+**Total: 20 services with 100% monitoring coverage**
+
+## 🔧 Monitoring Implementation
+
+### Implementation Patterns
+
+#### 1. Base Class Pattern (16 services)
+
+Services using `StandardFastAPIService` inherit comprehensive monitoring:
+
+```python
+from shared.service_base import StandardFastAPIService
+
+class MyService(StandardFastAPIService):
+    def __init__(self):
+        super().__init__(
+            service_name="my-service",
+            app_name="My Service",
+            description="Service description",
+            version="1.0.0",
+            # Monitoring enabled by default
+            enable_metrics=True,      # ✅ Metrics collection
+            enable_tracing=True,      # ✅ Distributed tracing
+            enable_health_checks=True # ✅ Health endpoints
+        )
+```
+
+#### 2. Direct Pattern (4 services)
+
+Critical services with custom monitoring needs:
+
+```python
+# services/ai_insights/app/main.py
+from shared.monitoring.metrics import MetricsCollector, add_metrics_middleware
+from shared.monitoring.system_metrics import SystemMetricsCollector
+
+# Initialize metrics collectors
+metrics_collector = MetricsCollector("ai-insights")
+system_metrics = SystemMetricsCollector("ai-insights")
+
+# Add middleware
+add_metrics_middleware(app, metrics_collector)
+```
+
+### Monitoring Components
+
+#### OpenTelemetry Instrumentation
+
+```python
+# Automatic instrumentation in base class
+FastAPIInstrumentor.instrument_app(app)      # HTTP requests
+HTTPXClientInstrumentor().instrument()       # Outgoing HTTP
+RedisInstrumentor().instrument()             # Redis operations
+SQLAlchemyInstrumentor().instrument()       # Database queries
+```
+
+#### Metrics Collection
+
+```python
+# Standard metrics automatically collected
+metrics_collector.register_counter("http_requests_total", "Total HTTP requests")
+metrics_collector.register_histogram("http_request_duration", "Request duration")
+metrics_collector.register_gauge("active_requests", "Active requests")
+
+# System metrics automatically collected
+system_metrics = SystemMetricsCollector("service-name")
+# → CPU, Memory, Disk I/O, Network I/O, Threads, File Descriptors
+```
+
+#### Health Checks
+
+```python
+# Automatic health check endpoints
+GET /health          # Overall service health
+GET /health/detailed # Detailed health with dependencies
+GET /health/ready    # Readiness probe
+GET /health/live     # Liveness probe
+```
+
+## 📊 Metrics Reference
+
+### Standard Metrics (All Services)
+
+| Metric Type | Metric Name | Description | Labels |
+|-------------|------------|-------------|--------|
+| **HTTP Metrics** | `{service}_http_requests_total` | Total HTTP requests | method, endpoint, status_code |
+| **HTTP Metrics** | `{service}_http_request_duration_seconds` | Request duration histogram | method, endpoint, status_code |
+| **HTTP Metrics** | `{service}_active_requests` | Currently active requests | - |
+| **System Metrics** | `process.cpu.utilization` | Process CPU usage | - |
+| **System Metrics** | `process.memory.usage` | Process memory usage | - |
+| **System Metrics** | `system.cpu.utilization` | System CPU usage | - |
+| **System Metrics** | `system.memory.usage` | System memory usage | - |
+| **Database Metrics** | `db.query.duration` | Database query duration | operation, table |
+| **Cache Metrics** | `cache.operation.duration` | Cache operation duration | operation, key |
+
+### Custom Metrics (Service-Specific)
+
+Examples of service-specific metrics:
+
+**Auth Service:**
+- `auth_registration_total` (by status)
+- `auth_login_success_total`
+- `auth_login_failure_total` (by reason)
+- `auth_registration_duration_seconds`
+
+**Orders Service:**
+- `orders_created_total`
+- `orders_processed_total` (by status)
+- `orders_processing_duration_seconds`
+
+**AI Insights Service:**
+- `ai_insights_generated_total`
+- `ai_model_inference_duration_seconds`
+- `ai_feedback_received_total`
+
+## 🔍 Tracing Guide
+
+### Trace Propagation
+
+Traces automatically flow across service boundaries:
+
+```mermaid
+sequenceDiagram
+    participant Client
+    participant Gateway
+    participant Auth
+    participant Orders
+    
+    Client->>Gateway: HTTP Request (trace_id: abc123)
+    Gateway->>Auth: Auth Check (trace_id: abc123)
+    Auth-->>Gateway: Auth Response (trace_id: abc123)
+    Gateway->>Orders: Create Order (trace_id: abc123)
+    Orders-->>Gateway: Order Created (trace_id: abc123)
+    Gateway-->>Client: Final Response (trace_id: abc123)
+```
+
+### Trace Context in Logs
+
+All logs include trace correlation:
+
+```json
+{
+    "level": "info",
+    "message": "Processing order",
+    "service": "orders-service",
+    "trace_id": "abc123def456",
+    "span_id": "789ghi",
+    "order_id": "12345",
+    "timestamp": "2024-01-08T19:00:00Z"
+}
+```
+
+### Manual Trace Enhancement
+
+Add custom trace attributes:
+
+```python
+from shared.monitoring.tracing import add_trace_attributes, add_trace_event
+
+# Add custom attributes
+add_trace_attributes(
+    user_id="123",
+    tenant_id="abc",
+    operation="order_creation"
+)
+
+# Add trace events
+add_trace_event("order_validation_started")
+# ... validation logic ...
+add_trace_event("order_validation_completed", status="success")
+```
+
+## 🚨 Alerting Guide
+
+### Standard Alerts (Recommended)
+
+| Alert Name | Condition | Severity | Notification |
+|------------|-----------|----------|--------------|
+| **High Error Rate** | `error_rate > 5%` for 5m | High | PagerDuty + Slack |
+| **High Latency** | `p99_latency > 2s` for 5m | High | PagerDuty + Slack |
+| **Service Unavailable** | `up == 0` for 1m | Critical | PagerDuty + Slack + Email |
+| **High Memory Usage** | `memory_usage > 80%` for 10m | Medium | Slack |
+| **High CPU Usage** | `cpu_usage > 90%` for 5m | Medium | Slack |
+| **Database Connection Issues** | `db_connections < minimum_pool_size` | High | PagerDuty + Slack |
+| **Cache Hit Ratio Low** | `cache_hit_ratio < 70%` for 15m | Low | Slack |
+
+### Creating Alerts in SigNoz
+
+1. **Navigate to Alerts**: SigNoz UI → Alerts → Create Alert
+2. **Select Metric**: Choose from available metrics
+3. **Set Condition**: Define threshold and duration
+4. **Configure Notifications**: Add notification channels
+5. **Set Severity**: Critical, High, Medium, Low
+6. **Add Description**: Explain alert purpose and resolution steps
+
+### Example Alert Configuration (YAML)
+
+```yaml
+# Example for Terraform/Kubernetes
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: bakery-ia-alerts
+  namespace: monitoring
+spec:
+  groups:
+  - name: service-health
+    rules:
+    - alert: ServiceDown
+      expr: up{service!~"signoz.*"} == 0
+      for: 1m
+      labels:
+        severity: critical
+      annotations:
+        summary: "Service {{ $labels.service }} is down"
+        description: "{{ $labels.service }} has been down for more than 1 minute"
+        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#service-down"
+
+    - alert: HighErrorRate
+      expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
+      for: 5m
+      labels:
+        severity: high
+      annotations:
+        summary: "High error rate in {{ $labels.service }}"
+        description: "Error rate is {{ $value }}% (threshold: 5%)"
+        runbook: "https://github.com/yourorg/bakery-ia/blob/main/RUNBOOKS.md#high-error-rate"
+```
+
+## 📈 Dashboard Guide
+
+### Recommended Dashboards
+
+#### 1. Service Overview Dashboard
+- HTTP Request Rate
+- Error Rate
+- Latency Percentiles (p50, p90, p99)
+- Active Requests
+- System Resource Usage
+
+#### 2. Performance Dashboard
+- Request Duration Histogram
+- Database Query Performance
+- Cache Performance
+- External API Call Performance
+
+#### 3. System Health Dashboard
+- CPU Usage (Process & System)
+- Memory Usage (Process & System)
+- Disk I/O
+- Network I/O
+- File Descriptors
+- Thread Count
+
+#### 4. Business Metrics Dashboard
+- User Registrations
+- Order Volume
+- AI Insights Generated
+- API Usage by Tenant
+
+### Creating Dashboards in SigNoz
+
+1. **Navigate to Dashboards**: SigNoz UI → Dashboards → Create Dashboard
+2. **Add Panels**: Click "Add Panel" and select metric
+3. **Configure Visualization**: Choose chart type and settings
+4. **Set Time Range**: Default to last 1h, 6h, 24h, 7d
+5. **Add Variables**: For dynamic filtering (service, environment)
+6. **Save Dashboard**: Give it a descriptive name
+
+## 🛠️ Troubleshooting Guide
+
+### Common Issues & Solutions
+
+#### Issue: No Metrics Appearing in SigNoz
+
+**Checklist:**
+- ✅ OpenTelemetry Collector running? `kubectl get pods -n signoz`
+- ✅ Service can reach collector? `telnet signoz-otel-collector.signoz 4318`
+- ✅ OTLP endpoint configured correctly? Check `OTEL_EXPORTER_OTLP_ENDPOINT`
+- ✅ Service logs show OTLP export? Look for "Exporting metrics"
+- ✅ No network policies blocking? Check Kubernetes network policies
+
+**Debugging:**
+```bash
+# Check OpenTelemetry Collector logs
+kubectl logs -n signoz -l app=otel-collector
+
+# Check service logs for OTLP errors
+kubectl logs -l app=auth-service | grep -i otel
+
+# Test OTLP connectivity from service pod
+kubectl exec -it auth-service-pod -- curl -v http://signoz-otel-collector.signoz:4318
+```
+
+#### Issue: High Latency in Specific Service
+
+**Checklist:**
+- ✅ Database queries slow? Check `db.query.duration` metrics
+- ✅ External API calls slow? Check trace waterfall
+- ✅ High CPU usage? Check system metrics
+- ✅ Memory pressure? Check memory metrics
+- ✅ Too many active requests? Check concurrency
+
+**Debugging:**
+```python
+# Add detailed tracing to suspicious code
+from shared.monitoring.tracing import add_trace_event
+
+add_trace_event("database_query_started", table="users")
+# ... database query ...
+add_trace_event("database_query_completed", duration_ms=45)
+```
+
+#### Issue: High Error Rate
+
+**Checklist:**
+- ✅ Database connection issues? Check health endpoints
+- ✅ External API failures? Check dependency metrics
+- ✅ Authentication failures? Check auth service logs
+- ✅ Validation errors? Check application logs
+- ✅ Rate limiting? Check gateway metrics
+
+**Debugging:**
+```bash
+# Check error logs with trace correlation
+kubectl logs -l app=auth-service | grep -i error | grep -i trace
+
+# Filter traces by error status
+# In SigNoz: Add filter http.status_code >= 400
+```
+
+## 📚 Runbook Reference
+
+See [RUNBOOKS.md](RUNBOOKS.md) for detailed troubleshooting procedures.
+
+## 🔧 Development Guide
+
+### Adding Custom Metrics
+
+```python
+# In any service using direct monitoring
+self.metrics_collector.register_counter(
+    "custom_metric_name",
+    "Description of what this metric tracks",
+    labels=["label1", "label2"]  # Optional labels
+)
+
+# Increment the counter
+self.metrics_collector.increment_counter(
+    "custom_metric_name",
+    value=1,
+    labels={"label1": "value1", "label2": "value2"}
+)
+```
+
+### Adding Custom Trace Attributes
+
+```python
+# Add context to current span
+from shared.monitoring.tracing import add_trace_attributes
+
+add_trace_attributes(
+    user_id=user.id,
+    tenant_id=tenant.id,
+    operation="premium_feature_access",
+    feature_name="advanced_forecasting"
+)
+```
+
+### Service-Specific Monitoring Setup
+
+For services needing custom monitoring beyond the base class:
+
+```python
+# In your service's __init__ method
+from shared.monitoring.system_metrics import SystemMetricsCollector
+from shared.monitoring.metrics import MetricsCollector
+
+class MyService(StandardFastAPIService):
+    def __init__(self):
+        # Call parent constructor first
+        super().__init__(...)
+        
+        # Add custom metrics collector
+        self.custom_metrics = MetricsCollector("my-service")
+        
+        # Register custom metrics
+        self.custom_metrics.register_counter(
+            "business_specific_events",
+            "Custom business event counter"
+        )
+        
+        # Add system metrics if not using base class defaults
+        self.system_metrics = SystemMetricsCollector("my-service")
+```
+
+## 📊 SigNoz Configuration
+
+### Environment Variables
+
+```env
+# OpenTelemetry Collector endpoint
+OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-otel-collector.signoz:4318
+
+# Service-specific configuration
+OTEL_SERVICE_NAME=auth-service
+OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,k8s.namespace=bakery-ia
+
+# Metrics export interval (default: 60000ms = 60s)
+OTEL_METRIC_EXPORT_INTERVAL=60000
+
+# Batch span processor configuration
+OTEL_BSP_SCHEDULE_DELAY=5000
+OTEL_BSP_MAX_QUEUE_SIZE=2048
+OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512
+```
+
+### Kubernetes Configuration
+
+```yaml
+# Example deployment with monitoring sidecar
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: auth-service
+spec:
+  template:
+    spec:
+      containers:
+      - name: auth-service
+        image: auth-service:latest
+        env:
+        - name: OTEL_EXPORTER_OTLP_ENDPOINT
+          value: "http://signoz-otel-collector.signoz:4318"
+        - name: OTEL_SERVICE_NAME
+          value: "auth-service"
+        - name: ENVIRONMENT
+          value: "production"
+        resources:
+          limits:
+            cpu: "1"
+            memory: "512Mi"
+          requests:
+            cpu: "200m"
+            memory: "256Mi"
+```
+
+## 🎯 Best Practices
+
+### Monitoring Best Practices
+
+1. **Use Consistent Naming**: Follow OpenTelemetry semantic conventions
+2. **Add Context to Traces**: Include user/tenant IDs in trace attributes
+3. **Monitor Dependencies**: Track external API and database performance
+4. **Set Appropriate Alerts**: Avoid alert fatigue with meaningful thresholds
+5. **Document Metrics**: Keep metrics documentation up to date
+6. **Review Regularly**: Update dashboards as services evolve
+7. **Test Alerts**: Ensure alerts fire correctly before production
+
+### Performance Best Practices
+
+1. **Batch Metrics Export**: Use default 60s interval for most services
+2. **Sample Traces**: Consider sampling for high-volume services
+3. **Limit Custom Metrics**: Only track metrics that provide value
+4. **Use Histograms Wisely**: Histograms can be resource-intensive
+5. **Monitor Monitoring**: Track OTLP export success/failure rates
+
+## 📞 Support
+
+### Getting Help
+
+1. **Check Documentation**: This file and RUNBOOKS.md
+2. **Review SigNoz Docs**: https://signoz.io/docs/
+3. **OpenTelemetry Docs**: https://opentelemetry.io/docs/
+4. **Team Channel**: #monitoring in Slack
+5. **GitHub Issues**: https://github.com/yourorg/bakery-ia/issues
+
+### Escalation Path
+
+1. **First Line**: Development team (service owners)
+2. **Second Line**: DevOps team (monitoring specialists)
+3. **Third Line**: SigNoz support (vendor support)
+
+## 🎉 Summary
+
+The bakery-ia monitoring system provides:
+
+- **📊 100% Service Coverage**: All 20 services monitored
+- **🚀 Modern Architecture**: OpenTelemetry + SigNoz
+- **🔧 Comprehensive Metrics**: System, HTTP, database, cache
+- **🔍 Full Observability**: Traces, metrics, logs integrated
+- **✅ Production Ready**: Battle-tested and scalable
+
+**All services are fully instrumented and ready for production monitoring!** 🎉
--- a/docs/MONITORING_QUICKSTART.md
+++ b/docs/MONITORING_QUICKSTART.md
@@ -1,283 +0,0 @@
-# SigNoz Monitoring Quick Start
-
-Get complete observability (metrics, logs, traces, system metrics) in under 10 minutes using OpenTelemetry.
-
-## What You'll Get
-
-✅ **Distributed Tracing** - Complete request flows across all services
-✅ **Application Metrics** - HTTP requests, durations, error rates, custom business metrics
-✅ **System Metrics** - CPU usage, memory usage, disk I/O, network I/O per service
-✅ **Structured Logs** - Searchable logs correlated with traces
-✅ **Unified Dashboard** - Single UI for all telemetry data
-
-**All data pushed via OpenTelemetry OTLP protocol - No Prometheus, no scraping needed!**
-
-## Prerequisites
-
- Kubernetes cluster running (Kind/Minikube/Production)
- Helm 3.x installed
- kubectl configured
-
-## Step 1: Deploy SigNoz
-
-```bash
-# Add Helm repository
-helm repo add signoz https://charts.signoz.io
-helm repo update
-
-# Create namespace
-kubectl create namespace signoz
-
-# Install SigNoz
-helm install signoz signoz/signoz \
-  -n signoz \
-  -f infrastructure/helm/signoz-values-dev.yaml
-
-# Wait for pods to be ready (2-3 minutes)
-kubectl wait --for=condition=ready pod -l app=signoz -n signoz --timeout=300s
-```
-
-## Step 2: Configure Services
-
-Each service needs OpenTelemetry environment variables. The auth-service is already configured as an example.
-
-### Quick Configuration (for remaining services)
-
-Add these environment variables to each service deployment:
-
-```yaml
-env:
-  # OpenTelemetry Collector endpoint
-  - name: OTEL_COLLECTOR_ENDPOINT
-    value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
-  - name: OTEL_EXPORTER_OTLP_ENDPOINT
-    value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
-  - name: OTEL_SERVICE_NAME
-    value: "your-service-name"  # e.g., "inventory-service"
-
-  # Enable tracing
-  - name: ENABLE_TRACING
-    value: "true"
-
-  # Enable logs export
-  - name: OTEL_LOGS_EXPORTER
-    value: "otlp"
-
-  # Enable metrics export (includes system metrics)
-  - name: ENABLE_OTEL_METRICS
-    value: "true"
-  - name: ENABLE_SYSTEM_METRICS
-    value: "true"
-```
-
-### Using the Configuration Script
-
-```bash
-# Generate configuration patches for all services
-./infrastructure/kubernetes/add-monitoring-config.sh
-
-# This creates /tmp/*-otel-patch.yaml files
-# Review and manually add to each service deployment
-```
-
-## Step 3: Deploy Updated Services
-
-```bash
-# Apply updated configurations
-kubectl apply -k infrastructure/kubernetes/overlays/dev/
-
-# Or restart services to pick up new env vars
-kubectl rollout restart deployment -n bakery-ia
-
-# Wait for rollout
-kubectl rollout status deployment -n bakery-ia --timeout=5m
-```
-
-## Step 4: Access SigNoz UI
-
-### Via Ingress
-
-```bash
-# Add to /etc/hosts if needed
-echo "127.0.0.1 monitoring.bakery-ia.local" | sudo tee -a /etc/hosts
-
-# Access UI
-open https://monitoring.bakery-ia.local
-```
-
-### Via Port Forward
-
-```bash
-kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
-open http://localhost:3301
-```
-
-## Step 5: Explore Your Data
-
-### Traces
-
-1. Go to **Services** tab
-2. See all your services listed
-3. Click on a service → View traces
-4. Click on a trace → See detailed span tree with timing
-
-### Metrics
-
-**HTTP Metrics** (automatically collected):
- `http_requests_total` - Total requests by method, endpoint, status
- `http_request_duration_seconds` - Request latency
- `active_requests` - Current active HTTP requests
-
-**System Metrics** (automatically collected per service):
- `process.cpu.utilization` - Process CPU usage %
- `process.memory.usage` - Process memory in bytes
- `process.memory.utilization` - Process memory %
- `process.threads.count` - Number of threads
- `system.cpu.utilization` - System-wide CPU %
- `system.memory.usage` - System memory usage
- `system.disk.io.read` - Disk bytes read
- `system.disk.io.write` - Disk bytes written
- `system.network.io.sent` - Network bytes sent
- `system.network.io.received` - Network bytes received
-
-**Custom Business Metrics** (if configured):
- User registrations
- Orders created
- Login attempts
- etc.
-
-### Logs
-
-1. Go to **Logs** tab
-2. Filter by service: `service_name="auth-service"`
-3. Search for specific messages
-4. See structured fields (user_id, tenant_id, etc.)
-
-### Trace-Log Correlation
-
-1. Find a trace in **Traces** tab
-2. Note the `trace_id`
-3. Go to **Logs** tab
-4. Filter: `trace_id="<the-trace-id>"`
-5. See all logs for that specific request!
-
-## Verification Commands
-
-```bash
-# Check if services are sending telemetry
-kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel"
-
-# Check SigNoz collector is receiving data
-kubectl logs -n signoz deployment/signoz-otel-collector | tail -50
-
-# Test connectivity to collector
-kubectl exec -n bakery-ia deployment/auth-service -- \
-  curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318
-```
-
-## Common Issues
-
-### No data in SigNoz
-
-```bash
-# 1. Verify environment variables are set
-kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL
-
-# 2. Check collector logs
-kubectl logs -n signoz deployment/signoz-otel-collector
-
-# 3. Restart service
-kubectl rollout restart deployment/auth-service -n bakery-ia
-```
-
-### Services not appearing
-
-```bash
-# Check network connectivity
-kubectl exec -n bakery-ia deployment/auth-service -- \
-  curl http://signoz-otel-collector.signoz.svc.cluster.local:4318
-
-# Should return: connection successful (not connection refused)
-```
-
-## Architecture
-
-```
-┌─────────────────────────────────────────────┐
-│         Your Microservices                   │
-│  ┌──────┐  ┌──────┐  ┌──────┐              │
-│  │ auth │  │ inv  │  │orders│  ...         │
-│  └──┬───┘  └──┬───┘  └──┬───┘              │
-│     │         │         │                    │
-│     └─────────┴─────────┘                    │
-│              │                               │
-│         OTLP Push                            │
-│  (traces, metrics, logs)                    │
-└──────────────┼──────────────────────────────┘
-               │
-               ▼
-┌──────────────────────────────────────────────┐
-│   SigNoz OpenTelemetry Collector             │
-│   :4317 (gRPC)  :4318 (HTTP)                │
-│                                              │
-│   Receivers: OTLP only (no Prometheus)      │
-│   Processors: batch, memory_limiter         │
-│   Exporters: ClickHouse                     │
-└──────────────┼──────────────────────────────┘
-               │
-               ▼
-┌──────────────────────────────────────────────┐
-│         ClickHouse Database                   │
-│   Stores: traces, metrics, logs              │
-└──────────────┼──────────────────────────────┘
-               │
-               ▼
-┌──────────────────────────────────────────────┐
-│       SigNoz Frontend UI                      │
-│   monitoring.bakery-ia.local or :3301        │
-└──────────────────────────────────────────────┘
-```
-
-## What Makes This Different
-
-**Pure OpenTelemetry** - No Prometheus involved:
- ✅ All metrics pushed via OTLP (not scraped)
- ✅ Automatic system metrics collection (CPU, memory, disk, network)
- ✅ Unified data model for all telemetry
- ✅ Native trace-metric-log correlation
- ✅ Lower resource usage (no scraping overhead)
-
-## Next Steps
-
- **Create Dashboards** - Build custom views for your metrics
- **Set Up Alerts** - Configure alerts for errors, latency, resource usage
- **Explore System Metrics** - Monitor CPU, memory per service
- **Query Logs** - Use powerful log query language
- **Correlate Everything** - Jump from traces → logs → metrics
-
-## Need Help?
-
- [Full Documentation](./MONITORING_SETUP.md) - Detailed setup guide
- [SigNoz Docs](https://signoz.io/docs/) - Official documentation
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/) - Python instrumentation
-
---
-
-**Metrics You Get Out of the Box:**
-
-| Category | Metrics | Description |
-|----------|---------|-------------|
-| HTTP | `http_requests_total` | Total requests by method, endpoint, status |
-| HTTP | `http_request_duration_seconds` | Request latency histogram |
-| HTTP | `active_requests` | Current active requests |
-| Process | `process.cpu.utilization` | Process CPU usage % |
-| Process | `process.memory.usage` | Process memory in bytes |
-| Process | `process.memory.utilization` | Process memory % |
-| Process | `process.threads.count` | Thread count |
-| System | `system.cpu.utilization` | System CPU % |
-| System | `system.memory.usage` | System memory usage |
-| System | `system.memory.utilization` | System memory % |
-| Disk | `system.disk.io.read` | Disk read bytes |
-| Disk | `system.disk.io.write` | Disk write bytes |
-| Network | `system.network.io.sent` | Network sent bytes |
-| Network | `system.network.io.received` | Network received bytes |
--- a/docs/MONITORING_SETUP.md
+++ b/docs/MONITORING_SETUP.md
@@ -1,511 +0,0 @@
-# SigNoz Monitoring Setup Guide
-
-This guide explains how to set up complete observability for the Bakery IA platform using SigNoz, which provides unified metrics, logs, and traces visualization.
-
-## Table of Contents
-
-1. [Architecture Overview](#architecture-overview)
-2. [Prerequisites](#prerequisites)
-3. [SigNoz Deployment](#signoz-deployment)
-4. [Service Configuration](#service-configuration)
-5. [Data Flow](#data-flow)
-6. [Verification](#verification)
-7. [Troubleshooting](#troubleshooting)
-
-## Architecture Overview
-
-The monitoring setup uses a three-tier approach:
-
-```
-┌─────────────────────────────────────────────────────────────┐
-│                    Bakery IA Services                        │
-│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
-│  │  Auth    │  │ Inventory│  │  Orders  │  │   ...    │   │
-│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
-│       │             │             │             │           │
-│       └─────────────┴─────────────┴─────────────┘           │
-│                          │                                   │
-│              OpenTelemetry Protocol (OTLP)                   │
-│                  Traces / Metrics / Logs                     │
-└──────────────────────────┼───────────────────────────────────┘
-                           │
-                           ▼
-┌──────────────────────────────────────────────────────────────┐
-│              SigNoz OpenTelemetry Collector                   │
-│  ┌────────────────────────────────────────────────────────┐  │
-│  │  Receivers:                                            │  │
-│  │  - OTLP gRPC (4317)  - OTLP HTTP (4318)              │  │
-│  │  - Prometheus Scraper (service discovery)             │  │
-│  └────────────────────┬───────────────────────────────────┘  │
-│                       │                                       │
-│  ┌────────────────────┴───────────────────────────────────┐  │
-│  │  Processors: batch, memory_limiter, resourcedetection │  │
-│  └────────────────────┬───────────────────────────────────┘  │
-│                       │                                       │
-│  ┌────────────────────┴───────────────────────────────────┐  │
-│  │  Exporters: ClickHouse (traces, metrics, logs)        │  │
-│  └────────────────────────────────────────────────────────┘  │
-└──────────────────────────┼───────────────────────────────────┘
-                           │
-                           ▼
-┌──────────────────────────────────────────────────────────────┐
-│                    ClickHouse Database                        │
-│  ┌──────────┐  ┌──────────┐  ┌──────────┐                   │
-│  │  Traces  │  │ Metrics  │  │   Logs   │                   │
-│  └──────────┘  └──────────┘  └──────────┘                   │
-└──────────────────────────┼───────────────────────────────────┘
-                           │
-                           ▼
-┌──────────────────────────────────────────────────────────────┐
-│                    SigNoz Query Service                       │
-│                     & Frontend UI                             │
-│         https://monitoring.bakery-ia.local                    │
-└──────────────────────────────────────────────────────────────┘
-```
-
-### Key Components
-
-1. **Services**: Generate telemetry data using OpenTelemetry SDK
-2. **OpenTelemetry Collector**: Receives, processes, and exports telemetry
-3. **ClickHouse**: Stores traces, metrics, and logs
-4. **SigNoz UI**: Query and visualize all telemetry data
-
-## Prerequisites
-
- Kubernetes cluster (Kind, Minikube, or production cluster)
- Helm 3.x installed
- kubectl configured
- At least 4GB RAM available for SigNoz components
-
-## SigNoz Deployment
-
-### 1. Add SigNoz Helm Repository
-
-```bash
-helm repo add signoz https://charts.signoz.io
-helm repo update
-```
-
-### 2. Create Namespace
-
-```bash
-kubectl create namespace signoz
-```
-
-### 3. Deploy SigNoz
-
-```bash
-# For development environment
-helm install signoz signoz/signoz \
-  -n signoz \
-  -f infrastructure/helm/signoz-values-dev.yaml
-
-# For production environment
-helm install signoz signoz/signoz \
-  -n signoz \
-  -f infrastructure/helm/signoz-values-prod.yaml
-```
-
-### 4. Verify Deployment
-
-```bash
-# Check all pods are running
-kubectl get pods -n signoz
-
-# Expected output:
-# signoz-alertmanager-0
-# signoz-clickhouse-0
-# signoz-frontend-*
-# signoz-otel-collector-*
-# signoz-query-service-*
-
-# Check services
-kubectl get svc -n signoz
-```
-
-## Service Configuration
-
-Each microservice needs to be configured to send telemetry to SigNoz.
-
-### Environment Variables
-
-Add these environment variables to your service deployments:
-
-```yaml
-env:
-  # OpenTelemetry Collector endpoint
-  - name: OTEL_COLLECTOR_ENDPOINT
-    value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
-  - name: OTEL_EXPORTER_OTLP_ENDPOINT
-    value: "http://signoz-otel-collector.signoz.svc.cluster.local:4318"
-
-  # Service identification
-  - name: OTEL_SERVICE_NAME
-    value: "your-service-name"  # e.g., "auth-service"
-
-  # Enable tracing
-  - name: ENABLE_TRACING
-    value: "true"
-
-  # Enable logs export
-  - name: OTEL_LOGS_EXPORTER
-    value: "otlp"
-  - name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
-    value: "true"
-
-  # Enable metrics export (optional, default: true)
-  - name: ENABLE_OTEL_METRICS
-    value: "true"
-```
-
-### Prometheus Annotations
-
-Add these annotations to enable Prometheus metrics scraping:
-
-```yaml
-metadata:
-  annotations:
-    prometheus.io/scrape: "true"
-    prometheus.io/port: "8000"
-    prometheus.io/path: "/metrics"
-```
-
-### Complete Example
-
-See [infrastructure/kubernetes/base/components/auth/auth-service.yaml](../infrastructure/kubernetes/base/components/auth/auth-service.yaml) for a complete example.
-
-### Automated Configuration Script
-
-Use the provided script to add monitoring configuration to all services:
-
-```bash
-# Run from project root
-./infrastructure/kubernetes/add-monitoring-config.sh
-```
-
-## Data Flow
-
-### 1. Traces
-
-**Automatic Instrumentation:**
-
-```python
-# In your service's main.py
-from shared.service_base import StandardFastAPIService
-
-service = AuthService()  # Extends StandardFastAPIService
-app = service.create_app()
-
-# Tracing is automatically enabled if ENABLE_TRACING=true
-# All FastAPI endpoints, HTTP clients, Redis, PostgreSQL are auto-instrumented
-```
-
-**Manual Instrumentation:**
-
-```python
-from shared.monitoring.tracing import add_trace_attributes, add_trace_event
-
-# Add custom attributes to current span
-add_trace_attributes(
-    user_id="123",
-    tenant_id="abc",
-    operation="user_registration"
-)
-
-# Add events for important operations
-add_trace_event("user_authenticated", user_id="123", method="jwt")
-```
-
-### 2. Metrics
-
-**Dual Export Strategy:**
-
-Services export metrics in two ways:
-1. **Prometheus format** at `/metrics` endpoint (scraped by SigNoz)
-2. **OTLP push** directly to SigNoz collector (real-time)
-
-**Built-in Metrics:**
-
-```python
-# Automatically collected by BaseFastAPIService:
-# - http_requests_total
-# - http_request_duration_seconds
-# - active_connections
-```
-
-**Custom Metrics:**
-
-```python
-# Define in your service
-custom_metrics = {
-    "user_registrations": {
-        "type": "counter",
-        "description": "Total user registrations",
-        "labels": ["status"]
-    },
-    "login_duration_seconds": {
-        "type": "histogram",
-        "description": "Login request duration"
-    }
-}
-
-service = AuthService(custom_metrics=custom_metrics)
-
-# Use in your code
-service.metrics_collector.increment_counter(
-    "user_registrations",
-    labels={"status": "success"}
-)
-```
-
-### 3. Logs
-
-**Automatic Export:**
-
-```python
-# Logs are automatically exported if OTEL_LOGS_EXPORTER=otlp
-import logging
-logger = logging.getLogger(__name__)
-
-# This will appear in SigNoz
-logger.info("User logged in", extra={"user_id": "123", "tenant_id": "abc"})
-```
-
-**Structured Logging with Context:**
-
-```python
-from shared.monitoring.logs_exporter import add_log_context
-
-# Add context that persists across log calls
-log_ctx = add_log_context(
-    request_id="req_123",
-    user_id="user_456",
-    tenant_id="tenant_789"
-)
-
-# All subsequent logs include this context
-log_ctx.info("Processing order")  # Includes request_id, user_id, tenant_id
-```
-
-**Trace Correlation:**
-
-```python
-from shared.monitoring.logs_exporter import get_current_trace_context
-
-# Get trace context for correlation
-trace_ctx = get_current_trace_context()
-logger.info("Processing request", extra=trace_ctx)
-# Logs now include trace_id and span_id for correlation
-```
-
-## Verification
-
-### 1. Check Service Health
-
-```bash
-# Check that services are exporting telemetry
-kubectl logs -n bakery-ia deployment/auth-service | grep -i "telemetry\|otel\|signoz"
-
-# Expected output includes:
-# - "Distributed tracing configured"
-# - "OpenTelemetry logs export configured"
-# - "OpenTelemetry metrics export configured"
-```
-
-### 2. Access SigNoz UI
-
-```bash
-# Port-forward (for local development)
-kubectl port-forward -n signoz svc/signoz-frontend 3301:3301
-
-# Or via Ingress
-open https://monitoring.bakery-ia.local
-```
-
-### 3. Verify Data Ingestion
-
-**Traces:**
-1. Go to SigNoz UI → Traces
-2. You should see traces from your services
-3. Click on a trace to see the full span tree
-
-**Metrics:**
-1. Go to SigNoz UI → Metrics
-2. Query: `http_requests_total`
-3. Filter by service: `service="auth-service"`
-
-**Logs:**
-1. Go to SigNoz UI → Logs
-2. Filter by service: `service_name="auth-service"`
-3. Search for specific log messages
-
-### 4. Test Trace-Log Correlation
-
-1. Find a trace in SigNoz UI
-2. Copy the `trace_id`
-3. Go to Logs tab
-4. Search: `trace_id="<your-trace-id>"`
-5. You should see all logs for that trace
-
-## Troubleshooting
-
-### No Data in SigNoz
-
-**1. Check OpenTelemetry Collector:**
-
-```bash
-# Check collector logs
-kubectl logs -n signoz deployment/signoz-otel-collector
-
-# Should see:
-# - "Receiver is starting"
-# - "Exporter is starting"
-# - No error messages
-```
-
-**2. Check Service Configuration:**
-
-```bash
-# Verify environment variables
-kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 20 "env:"
-
-# Verify annotations
-kubectl get deployment auth-service -n bakery-ia -o yaml | grep -A 5 "annotations:"
-```
-
-**3. Check Network Connectivity:**
-
-```bash
-# Test from service pod
-kubectl exec -n bakery-ia deployment/auth-service -- \
-  curl -v http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/traces
-
-# Should return: 405 Method Not Allowed (POST required)
-# If connection refused, check network policies
-```
-
-### Traces Not Appearing
-
-**Check instrumentation:**
-
-```python
-# Verify tracing is enabled
-import os
-print(os.getenv("ENABLE_TRACING"))  # Should be "true"
-print(os.getenv("OTEL_COLLECTOR_ENDPOINT"))  # Should be set
-```
-
-**Check trace sampling:**
-
-```bash
-# Verify sampling rate (default 100%)
-kubectl logs -n bakery-ia deployment/auth-service | grep "sampling"
-```
-
-### Metrics Not Appearing
-
-**1. Verify Prometheus annotations:**
-
-```bash
-kubectl get pods -n bakery-ia -o yaml | grep "prometheus.io"
-```
-
-**2. Test metrics endpoint:**
-
-```bash
-# Port-forward service
-kubectl port-forward -n bakery-ia deployment/auth-service 8000:8000
-
-# Test endpoint
-curl http://localhost:8000/metrics
-
-# Should return Prometheus format metrics
-```
-
-**3. Check SigNoz scrape configuration:**
-
-```bash
-# Check collector config
-kubectl get configmap -n signoz signoz-otel-collector -o yaml | grep -A 30 "prometheus:"
-```
-
-### Logs Not Appearing
-
-**1. Verify log export is enabled:**
-
-```bash
-kubectl get deployment auth-service -n bakery-ia -o yaml | grep OTEL_LOGS_EXPORTER
-# Should return: OTEL_LOGS_EXPORTER=otlp
-```
-
-**2. Check log format:**
-
-```bash
-# Logs should be JSON formatted
-kubectl logs -n bakery-ia deployment/auth-service | head -5
-```
-
-**3. Verify OTLP endpoint:**
-
-```bash
-# Test logs endpoint
-kubectl exec -n bakery-ia deployment/auth-service -- \
-  curl -X POST http://signoz-otel-collector.signoz.svc.cluster.local:4318/v1/logs \
-  -H "Content-Type: application/json" \
-  -d '{"resourceLogs":[]}'
-
-# Should return 200 OK or 400 Bad Request (not connection error)
-```
-
-## Performance Tuning
-
-### For Development
-
-The default configuration is optimized for local development with minimal resources.
-
-### For Production
-
-Update the following in `signoz-values-prod.yaml`:
-
-```yaml
-# Increase collector resources
-otelCollector:
-  resources:
-    requests:
-      cpu: 500m
-      memory: 1Gi
-    limits:
-      cpu: 2000m
-      memory: 2Gi
-
-# Increase batch sizes
-config:
-  processors:
-    batch:
-      timeout: 10s
-      send_batch_size: 10000  # Increased from 1024
-
-# Add more replicas
-replicaCount: 2
-```
-
-## Best Practices
-
-1. **Use Structured Logging**: Always use key-value pairs for better querying
-2. **Add Context**: Include user_id, tenant_id, request_id in logs
-3. **Trace Business Operations**: Add custom spans for important operations
-4. **Monitor Collector Health**: Set up alerts for collector errors
-5. **Retention Policy**: Configure ClickHouse retention based on needs
-
-## Additional Resources
-
- [SigNoz Documentation](https://signoz.io/docs/)
- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/)
- [Bakery IA Monitoring Shared Library](../shared/monitoring/)
-
-## Support
-
-For issues or questions:
-1. Check SigNoz community: https://signoz.io/slack
-2. Review OpenTelemetry docs: https://opentelemetry.io/docs/
-3. Create issue in project repository