Commit Graph

138 Commits

Author SHA1 Message Date
Urtzi Alfaro
e8fda39e50 Improve metrics 2026-01-08 20:48:24 +01:00
Urtzi Alfaro
29d19087f1 Update monitoring packages to latest versions
- Updated all OpenTelemetry packages to latest versions:
  - opentelemetry-api: 1.27.0 → 1.39.1
  - opentelemetry-sdk: 1.27.0 → 1.39.1
  - opentelemetry-exporter-otlp-proto-grpc: 1.27.0 → 1.39.1
  - opentelemetry-exporter-otlp-proto-http: 1.27.0 → 1.39.1
  - opentelemetry-instrumentation-fastapi: 0.48b0 → 0.60b1
  - opentelemetry-instrumentation-httpx: 0.48b0 → 0.60b1
  - opentelemetry-instrumentation-redis: 0.48b0 → 0.60b1
  - opentelemetry-instrumentation-sqlalchemy: 0.48b0 → 0.60b1

- Removed prometheus-client==0.23.1 from all services
- Unified all services to use the same monitoring package versions

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-01-08 19:25:52 +01:00
Urtzi Alfaro
c07df124fb Improve UI 2025-12-30 14:40:20 +01:00
Urtzi Alfaro
02f0c91a15 Fix UI issues 2025-12-29 19:33:35 +01:00
Urtzi Alfaro
ff830a3415 demo seed change 2025-12-13 23:57:54 +01:00
Urtzi Alfaro
667e6e0404 New alert service 2025-12-05 20:07:01 +01:00
Urtzi Alfaro
972db02f6d New enterprise feature 2025-11-30 09:12:40 +01:00
Urtzi Alfaro
843cd2bf5c Improve the UI and training 2025-11-15 15:20:10 +01:00
Urtzi Alfaro
c349b845a6 Bug fixes of training 2025-11-14 20:27:39 +01:00
Urtzi Alfaro
a8d8828935 imporve features 2025-11-14 07:23:56 +01:00
Urtzi Alfaro
5783c7ed05 Add POI feature and imporve the overall backend implementation 2025-11-12 15:34:10 +01:00
Urtzi Alfaro
3007bde05b Improve kubernetes for prod 2025-11-06 11:04:50 +01:00
Urtzi Alfaro
3ad093d38b Fix orchestrator issues 2025-11-05 22:54:14 +01:00
Claude
0c4a941ca8 Fix critical double commit bug and Spanish holidays warning
CRITICAL FIX - Database Transaction:
- Removed duplicate commit logic from _store_in_db() inner function (lines 805-811)
- Prevents 'Method commit() can't be called here' error
- Now only outer scope handles commits (line 821 for new sessions, parent for parent sessions)
- Fixes issue where all 5 models trained successfully but failed to store in DB

MINOR FIX - Logging:
- Fixed Spanish holidays logger call (line 977-978)
- Removed invalid keyword arguments (region=, years=) from logger.info()
- Now uses f-string format consistent with rest of codebase
- Prevents 'Logger._log() got an unexpected keyword argument' warning

Impact:
- Training pipeline can now complete successfully
- Models will be stored in database after training
- No more cascading transaction failures
- Cleaner logs without warnings

Root cause: Double commit introduced during recent session management fixes
Related commits: 673108e, 74215d3, fd0a96e, b2de56e, e585e9f
2025-11-05 18:34:00 +00:00
Urtzi Alfaro
673108e3c0 Fix deadlock issues in training 2 2025-11-05 19:26:52 +01:00
Urtzi Alfaro
74215d3e85 Fix deadlock issues in training 2025-11-05 18:47:20 +01:00
Urtzi Alfaro
fd0a96e254 Fix remaining nested session issues in training pipeline
Issues Fixed:

4️⃣ data_processor.py (Line 230-232):
- Second update_log_progress call without commit after data preparation
- Added commit() after completion update to prevent deadlock
- Added debug logging for visibility

5️⃣ prophet_manager.py _store_model (Line 750):
- Created TRIPLE nested session (training_service → trainer → lock → _store_model)
- Refactored _store_model to accept optional session parameter
- Uses parent session from lock context instead of creating new one
- Updated call site to pass db_session parameter

Complete Session Hierarchy After All Fixes:
training_service.py (session)
  └─ commit() ← FIX #2 (e585e9f)
  └─ trainer.py (new session)  OK
      └─ data_processor.py (new session)
          └─ commit() after first update ← FIX #3 (b2de56e)
          └─ commit() after second update ← FIX #4 (THIS)
      └─ prophet_manager.train_bakery_model (uses parent or new session) ← FIX #1 (caff497)
          └─ lock.acquire(session)
              └─ _store_model(session=parent) ← FIX #5 (THIS)
                  └─ NO NESTED SESSION 

All nested session deadlocks in training path are now resolved.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 16:41:53 +01:00
Urtzi Alfaro
b2de56ead3 Fix additional nested session deadlock in data_processor.py
Root Cause:
After fixing the training_service.py deadlock, training progressed to
data preparation but got stuck there. The data_processor.py creates
another nested session at line 143, updates training_log without
committing, causing another deadlock scenario.

Session Hierarchy:
1. training_service.py: outer session (fixed in e585e9f)
2. trainer.py: creates own session (passes deadlock due to commit)
3. data_processor.py: creates ANOTHER nested session (THIS FIX)

Fix:
Added explicit db_session.commit() after progress update in data_processor
(line 153) to ensure the UPDATE is committed before continuing with data
processing operations that may interact with other sessions.

This completes the chain of nested session fixes:
- caff497: prophet_manager + hybrid_trainer session passing
- e585e9f: training_service commit before trainer call
- THIS: data_processor commit after progress update

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 16:39:05 +01:00
Urtzi Alfaro
e585e9fac0 Fix critical nested session deadlock in training_service.py
Root Cause (Actual):
The actual nested session issue was in training_service.py, not just in
the trainer methods. The flow was:

1. training_service.py creates outer session (line 173)
2. Updates training_log at line 235-237 (uncommitted)
3. Calls trainer.train_tenant_models() at line 239
4. Trainer creates its own session at line 93
5. DEADLOCK: Outer session has uncommitted UPDATE, inner session can't proceed

Fix:
Added explicit session.commit() after the ml_training progress update
(line 241) to ensure the UPDATE is committed before trainer creates
its own session. This prevents the deadlock condition.

Related to previous commit caff497 which fixed nested sessions in
prophet_manager and hybrid_trainer, but missed the actual root cause
in training_service.py.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 16:30:15 +01:00
Urtzi Alfaro
caff49761d Fix training hang caused by nested database sessions and deadlocks
Root Cause:
The training process was hanging at the first progress update due to a
nested database session issue. The main trainer created a session and
repositories, then called prophet_manager.train_bakery_model() which
created another nested session with an advisory lock. This caused a
deadlock where:
1. Outer session had uncommitted UPDATE on model_training_logs
2. Inner session tried to acquire advisory lock
3. Neither could proceed, causing training to hang indefinitely

Changes Made:
1. prophet_manager.py:
   - Added optional 'session' parameter to train_bakery_model()
   - Refactored to use parent session if provided, otherwise create new one
   - Prevents nested session creation during training

2. hybrid_trainer.py:
   - Added optional 'session' parameter to train_hybrid_model()
   - Passes session to prophet_manager to maintain single session context

3. trainer.py:
   - Updated _train_single_product() to accept and pass session
   - Updated _train_all_models_enhanced() to accept and pass session
   - Pass db_session from main training context to all training methods
   - Added explicit db_session.flush() after critical progress update
   - This ensures updates are visible before acquiring locks

Impact:
- Eliminates nested session deadlocks
- Training now proceeds past initial progress update
- Maintains single database session context throughout training
- Prevents database transaction conflicts

Related Issues:
- Fixes training hang during onboarding process
- Not directly related to audit_metadata changes but exposed by them

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 16:13:32 +01:00
Claude
c64585af57 Fix training hang by wrapping blocking ML operations in thread pool
Root Cause:
Training process was stuck at 40% because blocking synchronous ML operations
(model.fit(), model.predict(), study.optimize()) were freezing the asyncio
event loop, preventing RabbitMQ heartbeats, WebSocket communication, and
progress updates.

Changes:
1. prophet_manager.py:
   - Wrapped model.fit() at line 189 with asyncio.to_thread()
   - Wrapped study.optimize() at line 453 with asyncio.to_thread()

2. hybrid_trainer.py:
   - Made _train_xgboost() async and wrapped model.fit() with asyncio.to_thread()
   - Made _evaluate_hybrid_model() async and wrapped predict() calls
   - Fixed predict() method to wrap blocking predict() calls

Impact:
- Event loop no longer blocks during ML training
- RabbitMQ heartbeats continue during training
- WebSocket progress updates work correctly
- Training can now complete successfully

Fixes: Training hang at 40% during onboarding phase
2025-11-05 14:34:53 +00:00
Claude
136761af19 Fix AuditLogger.log_event() parameter name: metadata -> audit_metadata 2025-11-05 14:17:39 +00:00
Claude
8df90338b2 Fix training log race conditions and audit event error
Critical fixes for training session logging:

1. Training log race condition fix:
   - Add explicit session commits after creating training logs
   - Handle duplicate key errors gracefully when multiple sessions
     try to create the same log simultaneously
   - Implement retry logic to query for existing logs after
     duplicate key violations
   - Prevents "Training log not found" errors during training

2. Audit event async generator error fix:
   - Replace incorrect next(get_db()) usage with proper
     async context manager (database_manager.get_session())
   - Fixes "'async_generator' object is not an iterator" error
   - Ensures audit logging works correctly

These changes address race conditions in concurrent database
sessions and ensure training logs are properly synchronized
across the training pipeline.
2025-11-05 13:24:22 +00:00
Claude
5a84be83d6 Fix multiple critical bugs in onboarding training step
This commit addresses all identified bugs and issues in the training code path:

## Critical Fixes:
- Add get_start_time() method to TrainingLogRepository and fix non-existent method call
- Remove duplicate training.started event from API endpoint (trainer publishes the accurate one)
- Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone"

## High Priority Fixes:
- Fix division by zero risk in time estimation with double-check and max() safety
- Remove unreachable exception handler in training_operations.py
- Simplify WebSocket token refresh logic to only reconnect on actual user session changes

## Medium Priority Fixes:
- Fix auto-start training effect with useRef to prevent duplicate starts
- Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket
- Extract all magic numbers to centralized constants files:
  - Backend: services/training/app/core/training_constants.py
  - Frontend: frontend/src/constants/training.ts
- Standardize error logging with exc_info=True on critical errors

## Code Quality Improvements:
- All progress percentages now use named constants
- All timeouts and intervals now use named constants
- Improved code maintainability and readability
- Better separation of concerns

## Files Changed:
- Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py
- Backend: training_operations.py, training_log_repository.py, training_constants.py (new)
- Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new)

All training progress events now properly flow from 0% to 100% with no gaps.
2025-11-05 13:02:39 +00:00
Claude
799e7dbaeb Fix training job concurrent database session conflicts
Root Cause:
- Multiple parallel training tasks (3 at a time) were sharing the same database session
- This caused SQLAlchemy session state conflicts: "Session is already flushing" and "rollback() is already in progress"
- Additionally, duplicate model records were being created by both trainer and training_service

Fixes:
1. Separated model training from database writes:
   - Training happens in parallel (CPU-intensive)
   - Database writes happen sequentially after training completes
   - This eliminates concurrent session access

2. Removed duplicate database writes:
   - Trainer now writes all model records sequentially after parallel training
   - Training service now retrieves models instead of creating duplicates
   - Performance metrics are also created by trainer (no duplicates)

3. Added proper data flow:
   - _train_single_product: Only trains models, stores results
   - _write_training_results_to_database: Sequential DB writes after training
   - _store_trained_models: Changed to retrieve existing models
   - _create_performance_metrics: Changed to verify existing metrics

Benefits:
- Eliminates database session conflicts
- Prevents duplicate model records
- Maintains parallel training performance
- Ensures data consistency

Files Modified:
- services/training/app/ml/trainer.py
- services/training/app/services/training_service.py

Resolves: Onboarding training job database session conflicts
2025-11-05 12:41:42 +00:00
Urtzi Alfaro
394ad3aea4 Improve AI logic 2025-11-05 13:34:56 +01:00
Urtzi Alfaro
5adb0e39c0 Improve the frontend 5 2025-11-02 20:24:44 +01:00
Urtzi Alfaro
269d3b5032 Add user delete process 2025-10-31 11:54:19 +01:00
Urtzi Alfaro
36217a2729 Improve the frontend 2 2025-10-29 06:58:05 +01:00
Urtzi Alfaro
858d985c92 Improve the frontend modals 2025-10-27 16:33:26 +01:00
Urtzi Alfaro
8d30172483 Improve the frontend 2025-10-21 19:50:07 +02:00
Urtzi Alfaro
05da20357d Improve teh securty of teh DB 2025-10-19 19:22:37 +02:00
Urtzi Alfaro
312e36c893 Update requirements and insfra versions 2025-10-17 23:09:40 +02:00
Urtzi Alfaro
dbb48d8e2c Improve the sales import 2025-10-15 21:09:42 +02:00
Urtzi Alfaro
8f9e9a7edc Add role-based filtering and imporve code 2025-10-15 16:12:49 +02:00
Urtzi Alfaro
96ad5c6692 Refactor datetime and timezone utils 2025-10-12 23:16:04 +02:00
Urtzi Alfaro
7556a00db7 Improve the demo feature of the project 2025-10-12 18:47:33 +02:00
Urtzi Alfaro
dbc7f2fa0d Re-create migrations init tables 2025-10-09 20:47:31 +02:00
Urtzi Alfaro
3c689b4f98 REFACTOR external service and improve websocket training 2025-10-09 14:11:02 +02:00
Urtzi Alfaro
7c72f83c51 REFACTOR ALL APIs fix 1 2025-10-07 07:15:07 +02:00
Urtzi Alfaro
38fb98bc27 REFACTOR ALL APIs 2025-10-06 15:27:01 +02:00
Urtzi Alfaro
0fdc3b0211 Fix issues 2025-10-01 16:25:53 +02:00
Urtzi Alfaro
2eeebfc1e0 Fix Alembic issue 2025-10-01 11:24:06 +02:00
Urtzi Alfaro
7cc4b957a5 Fix DB issue 2s 2025-09-30 21:58:10 +02:00
Urtzi Alfaro
147893015e Fix DB issues 2025-09-30 13:32:51 +02:00
Urtzi Alfaro
ec6bcb4c7d Add migration services 2025-09-30 08:12:45 +02:00
Urtzi Alfaro
2712a60a2a Refactor services alembic 2025-09-29 19:16:34 +02:00
Urtzi Alfaro
befcc126b0 Refactor all main.py 2025-09-29 13:13:12 +02:00
Urtzi Alfaro
4777e59e7a Add base kubernetes support final fix 4 2025-09-29 07:54:25 +02:00
Urtzi Alfaro
57f77638cc Add base kubernetes support final fix 2 2025-09-28 19:48:05 +02:00