Imporve monitoring 6

2026-01-10 13:43:38 +01:00
parent c05538cafb
commit b089c216db
13 changed files with 1248 additions and 2546 deletions
--- a/docs/TECHNICAL-DOCUMENTATION-SUMMARY.md
+++ b/docs/TECHNICAL-DOCUMENTATION-SUMMARY.md
@@ -38,7 +38,8 @@ Bakery-IA is an **AI-powered SaaS platform** designed specifically for the Spani
 **Infrastructure:**
 - Docker containers, Kubernetes orchestration
 - PostgreSQL 17, Redis 7.4, RabbitMQ 4.1
- Prometheus + Grafana monitoring
+- **SigNoz unified observability platform** - Traces, metrics, logs
+- OpenTelemetry instrumentation across all services
 - HTTPS with automatic certificate renewal

 ---
@@ -711,6 +712,14 @@ Data Collection → Feature Engineering → Prophet Training
 - Service decoupling
 - Asynchronous processing

+**4. Distributed Tracing (OpenTelemetry)**
+- End-to-end request tracking across all 18 microservices
+- Automatic instrumentation for FastAPI, HTTPX, SQLAlchemy, Redis
+- Performance bottleneck identification
+- Database query performance analysis
+- External API call monitoring
+- Error tracking with full context
+
 ### Scalability & Performance

 **1. Microservices Architecture**
@@ -731,6 +740,16 @@ Data Collection → Feature Engineering → Prophet Training
 - 1,000+ req/sec per gateway instance
 - 10,000+ concurrent connections

+**4. Observability & Monitoring**
+- **SigNoz Platform**: Unified traces, metrics, and logs
+- **Auto-Instrumentation**: Zero-code instrumentation via OpenTelemetry
+- **Application Monitoring**: All 18 services reporting metrics
+- **Infrastructure Monitoring**: 18 PostgreSQL databases, Redis, RabbitMQ
+- **Kubernetes Monitoring**: Node, pod, container metrics
+- **Log Aggregation**: Centralized logs with trace correlation
+- **Real-Time Alerting**: Email and Slack notifications
+- **Query Performance**: ClickHouse backend for fast analytics
+
 ---

 ## Security & Compliance
@@ -786,8 +805,13 @@ Data Collection → Feature Engineering → Prophet Training
 - **Orchestration**: Kubernetes
 - **Ingress**: NGINX Ingress Controller
 - **Certificates**: Let's Encrypt (auto-renewal)
- **Monitoring**: Prometheus + Grafana
- **Logging**: ELK Stack (planned)
+- **Observability**: SigNoz (unified traces, metrics, logs)
+  - **Distributed Tracing**: OpenTelemetry auto-instrumentation (FastAPI, HTTPX, SQLAlchemy, Redis)
+  - **Application Metrics**: RED metrics (Rate, Error, Duration) from all 18 services
+  - **Infrastructure Metrics**: PostgreSQL (18 databases), Redis, RabbitMQ, Kubernetes cluster
+  - **Log Management**: Centralized logs with trace correlation and Kubernetes metadata
+  - **Alerting**: Multi-channel notifications (email, Slack) via AlertManager
+- **Telemetry Backend**: ClickHouse for high-performance time-series storage

 ### CI/CD Pipeline
 1. Code push to GitHub
@@ -834,11 +858,14 @@ Data Collection → Feature Engineering → Prophet Training
 - Stripe integration
 - Automated billing

-### 5. Real-Time Operations
+### 5. Real-Time Operations & Observability
 - SSE for instant alerts
 - WebSocket for live updates
 - Sub-second dashboard refresh
 - Always up-to-date data
+- **Full-stack observability** with SigNoz
+- Distributed tracing for performance debugging
+- Real-time metrics from all layers (app, DB, cache, queue, cluster)

 ### 6. Developer-Friendly
 - RESTful APIs