Commit Graph

31 Commits

Author SHA1 Message Date
Urtzi Alfaro
b2de56ead3 Fix additional nested session deadlock in data_processor.py
Root Cause:
After fixing the training_service.py deadlock, training progressed to
data preparation but got stuck there. The data_processor.py creates
another nested session at line 143, updates training_log without
committing, causing another deadlock scenario.

Session Hierarchy:
1. training_service.py: outer session (fixed in e585e9f)
2. trainer.py: creates own session (passes deadlock due to commit)
3. data_processor.py: creates ANOTHER nested session (THIS FIX)

Fix:
Added explicit db_session.commit() after progress update in data_processor
(line 153) to ensure the UPDATE is committed before continuing with data
processing operations that may interact with other sessions.

This completes the chain of nested session fixes:
- caff497: prophet_manager + hybrid_trainer session passing
- e585e9f: training_service commit before trainer call
- THIS: data_processor commit after progress update

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 16:39:05 +01:00
Urtzi Alfaro
caff49761d Fix training hang caused by nested database sessions and deadlocks
Root Cause:
The training process was hanging at the first progress update due to a
nested database session issue. The main trainer created a session and
repositories, then called prophet_manager.train_bakery_model() which
created another nested session with an advisory lock. This caused a
deadlock where:
1. Outer session had uncommitted UPDATE on model_training_logs
2. Inner session tried to acquire advisory lock
3. Neither could proceed, causing training to hang indefinitely

Changes Made:
1. prophet_manager.py:
   - Added optional 'session' parameter to train_bakery_model()
   - Refactored to use parent session if provided, otherwise create new one
   - Prevents nested session creation during training

2. hybrid_trainer.py:
   - Added optional 'session' parameter to train_hybrid_model()
   - Passes session to prophet_manager to maintain single session context

3. trainer.py:
   - Updated _train_single_product() to accept and pass session
   - Updated _train_all_models_enhanced() to accept and pass session
   - Pass db_session from main training context to all training methods
   - Added explicit db_session.flush() after critical progress update
   - This ensures updates are visible before acquiring locks

Impact:
- Eliminates nested session deadlocks
- Training now proceeds past initial progress update
- Maintains single database session context throughout training
- Prevents database transaction conflicts

Related Issues:
- Fixes training hang during onboarding process
- Not directly related to audit_metadata changes but exposed by them

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 16:13:32 +01:00
Claude
c64585af57 Fix training hang by wrapping blocking ML operations in thread pool
Root Cause:
Training process was stuck at 40% because blocking synchronous ML operations
(model.fit(), model.predict(), study.optimize()) were freezing the asyncio
event loop, preventing RabbitMQ heartbeats, WebSocket communication, and
progress updates.

Changes:
1. prophet_manager.py:
   - Wrapped model.fit() at line 189 with asyncio.to_thread()
   - Wrapped study.optimize() at line 453 with asyncio.to_thread()

2. hybrid_trainer.py:
   - Made _train_xgboost() async and wrapped model.fit() with asyncio.to_thread()
   - Made _evaluate_hybrid_model() async and wrapped predict() calls
   - Fixed predict() method to wrap blocking predict() calls

Impact:
- Event loop no longer blocks during ML training
- RabbitMQ heartbeats continue during training
- WebSocket progress updates work correctly
- Training can now complete successfully

Fixes: Training hang at 40% during onboarding phase
2025-11-05 14:34:53 +00:00
Claude
5a84be83d6 Fix multiple critical bugs in onboarding training step
This commit addresses all identified bugs and issues in the training code path:

## Critical Fixes:
- Add get_start_time() method to TrainingLogRepository and fix non-existent method call
- Remove duplicate training.started event from API endpoint (trainer publishes the accurate one)
- Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone"

## High Priority Fixes:
- Fix division by zero risk in time estimation with double-check and max() safety
- Remove unreachable exception handler in training_operations.py
- Simplify WebSocket token refresh logic to only reconnect on actual user session changes

## Medium Priority Fixes:
- Fix auto-start training effect with useRef to prevent duplicate starts
- Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket
- Extract all magic numbers to centralized constants files:
  - Backend: services/training/app/core/training_constants.py
  - Frontend: frontend/src/constants/training.ts
- Standardize error logging with exc_info=True on critical errors

## Code Quality Improvements:
- All progress percentages now use named constants
- All timeouts and intervals now use named constants
- Improved code maintainability and readability
- Better separation of concerns

## Files Changed:
- Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py
- Backend: training_operations.py, training_log_repository.py, training_constants.py (new)
- Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new)

All training progress events now properly flow from 0% to 100% with no gaps.
2025-11-05 13:02:39 +00:00
Claude
799e7dbaeb Fix training job concurrent database session conflicts
Root Cause:
- Multiple parallel training tasks (3 at a time) were sharing the same database session
- This caused SQLAlchemy session state conflicts: "Session is already flushing" and "rollback() is already in progress"
- Additionally, duplicate model records were being created by both trainer and training_service

Fixes:
1. Separated model training from database writes:
   - Training happens in parallel (CPU-intensive)
   - Database writes happen sequentially after training completes
   - This eliminates concurrent session access

2. Removed duplicate database writes:
   - Trainer now writes all model records sequentially after parallel training
   - Training service now retrieves models instead of creating duplicates
   - Performance metrics are also created by trainer (no duplicates)

3. Added proper data flow:
   - _train_single_product: Only trains models, stores results
   - _write_training_results_to_database: Sequential DB writes after training
   - _store_trained_models: Changed to retrieve existing models
   - _create_performance_metrics: Changed to verify existing metrics

Benefits:
- Eliminates database session conflicts
- Prevents duplicate model records
- Maintains parallel training performance
- Ensures data consistency

Files Modified:
- services/training/app/ml/trainer.py
- services/training/app/services/training_service.py

Resolves: Onboarding training job database session conflicts
2025-11-05 12:41:42 +00:00
Urtzi Alfaro
394ad3aea4 Improve AI logic 2025-11-05 13:34:56 +01:00
Urtzi Alfaro
5adb0e39c0 Improve the frontend 5 2025-11-02 20:24:44 +01:00
Urtzi Alfaro
8d30172483 Improve the frontend 2025-10-21 19:50:07 +02:00
Urtzi Alfaro
05da20357d Improve teh securty of teh DB 2025-10-19 19:22:37 +02:00
Urtzi Alfaro
96ad5c6692 Refactor datetime and timezone utils 2025-10-12 23:16:04 +02:00
Urtzi Alfaro
3c689b4f98 REFACTOR external service and improve websocket training 2025-10-09 14:11:02 +02:00
Urtzi Alfaro
f7de9115d1 Fix new services implementation 5 2025-08-15 17:53:59 +02:00
Urtzi Alfaro
03737430ee Fix new services implementation 3 2025-08-14 16:47:34 +02:00
Urtzi Alfaro
fbe7470ad9 REFACTOR data service 2025-08-12 18:17:30 +02:00
Urtzi Alfaro
8d125ab0d5 Refactor the traffic fetching system 2025-08-10 18:32:47 +02:00
Urtzi Alfaro
488bb3ef93 REFACTOR - Database logic 2025-08-08 09:08:41 +02:00
Urtzi Alfaro
32a7b913d0 Fix new Frontend 15 2025-08-04 21:46:12 +02:00
Urtzi Alfaro
8bb14ecc4f Fix new Frontend 14 2025-08-04 19:17:31 +02:00
Urtzi Alfaro
0ba543a19a Fix new Frontend 13 2025-08-04 18:58:12 +02:00
Urtzi Alfaro
35b02ca364 Fix new Frontend 12 2025-08-04 18:21:42 +02:00
Urtzi Alfaro
e581a144be Improve the event messaging for training service 2 2025-07-31 15:34:35 +02:00
Urtzi Alfaro
024290e4c0 Start fixing forecast service 18 2025-07-30 08:41:47 +02:00
Urtzi Alfaro
c10a745695 Start fixing forecast service 17 2025-07-30 08:29:40 +02:00
Urtzi Alfaro
e5c3375d53 Start fixing forecast service 16 2025-07-30 08:13:32 +02:00
Urtzi Alfaro
cd6fd875f7 Improve training test 3 2025-07-29 12:45:39 +02:00
Urtzi Alfaro
7cd595df81 Improve training code 2 2025-07-28 20:20:54 +02:00
Urtzi Alfaro
98f546af12 Improve training code 2025-07-28 19:28:39 +02:00
Urtzi Alfaro
e63a99b818 Checking onboardin flow - fix 4 2025-07-27 16:29:53 +02:00
Urtzi Alfaro
f3071c00bd Add all the code for training service 2025-07-19 16:59:37 +02:00
Urtzi Alfaro
4073222888 Fix imports 2025-07-18 14:41:39 +02:00
Urtzi Alfaro
347ff51bd7 Initial microservices setup from artifacts 2025-07-17 13:09:24 +02:00