Commit Graph

3 Commits

Author SHA1 Message Date
Urtzi Alfaro
caff49761d Fix training hang caused by nested database sessions and deadlocks
Root Cause:
The training process was hanging at the first progress update due to a
nested database session issue. The main trainer created a session and
repositories, then called prophet_manager.train_bakery_model() which
created another nested session with an advisory lock. This caused a
deadlock where:
1. Outer session had uncommitted UPDATE on model_training_logs
2. Inner session tried to acquire advisory lock
3. Neither could proceed, causing training to hang indefinitely

Changes Made:
1. prophet_manager.py:
   - Added optional 'session' parameter to train_bakery_model()
   - Refactored to use parent session if provided, otherwise create new one
   - Prevents nested session creation during training

2. hybrid_trainer.py:
   - Added optional 'session' parameter to train_hybrid_model()
   - Passes session to prophet_manager to maintain single session context

3. trainer.py:
   - Updated _train_single_product() to accept and pass session
   - Updated _train_all_models_enhanced() to accept and pass session
   - Pass db_session from main training context to all training methods
   - Added explicit db_session.flush() after critical progress update
   - This ensures updates are visible before acquiring locks

Impact:
- Eliminates nested session deadlocks
- Training now proceeds past initial progress update
- Maintains single database session context throughout training
- Prevents database transaction conflicts

Related Issues:
- Fixes training hang during onboarding process
- Not directly related to audit_metadata changes but exposed by them

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 16:13:32 +01:00
Claude
c64585af57 Fix training hang by wrapping blocking ML operations in thread pool
Root Cause:
Training process was stuck at 40% because blocking synchronous ML operations
(model.fit(), model.predict(), study.optimize()) were freezing the asyncio
event loop, preventing RabbitMQ heartbeats, WebSocket communication, and
progress updates.

Changes:
1. prophet_manager.py:
   - Wrapped model.fit() at line 189 with asyncio.to_thread()
   - Wrapped study.optimize() at line 453 with asyncio.to_thread()

2. hybrid_trainer.py:
   - Made _train_xgboost() async and wrapped model.fit() with asyncio.to_thread()
   - Made _evaluate_hybrid_model() async and wrapped predict() calls
   - Fixed predict() method to wrap blocking predict() calls

Impact:
- Event loop no longer blocks during ML training
- RabbitMQ heartbeats continue during training
- WebSocket progress updates work correctly
- Training can now complete successfully

Fixes: Training hang at 40% during onboarding phase
2025-11-05 14:34:53 +00:00
Urtzi Alfaro
394ad3aea4 Improve AI logic 2025-11-05 13:34:56 +01:00