Root Cause:
The training process was hanging at the first progress update due to a
nested database session issue. The main trainer created a session and
repositories, then called prophet_manager.train_bakery_model() which
created another nested session with an advisory lock. This caused a
deadlock where:
1. Outer session had uncommitted UPDATE on model_training_logs
2. Inner session tried to acquire advisory lock
3. Neither could proceed, causing training to hang indefinitely
Changes Made:
1. prophet_manager.py:
- Added optional 'session' parameter to train_bakery_model()
- Refactored to use parent session if provided, otherwise create new one
- Prevents nested session creation during training
2. hybrid_trainer.py:
- Added optional 'session' parameter to train_hybrid_model()
- Passes session to prophet_manager to maintain single session context
3. trainer.py:
- Updated _train_single_product() to accept and pass session
- Updated _train_all_models_enhanced() to accept and pass session
- Pass db_session from main training context to all training methods
- Added explicit db_session.flush() after critical progress update
- This ensures updates are visible before acquiring locks
Impact:
- Eliminates nested session deadlocks
- Training now proceeds past initial progress update
- Maintains single database session context throughout training
- Prevents database transaction conflicts
Related Issues:
- Fixes training hang during onboarding process
- Not directly related to audit_metadata changes but exposed by them
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Root Cause:
Training process was stuck at 40% because blocking synchronous ML operations
(model.fit(), model.predict(), study.optimize()) were freezing the asyncio
event loop, preventing RabbitMQ heartbeats, WebSocket communication, and
progress updates.
Changes:
1. prophet_manager.py:
- Wrapped model.fit() at line 189 with asyncio.to_thread()
- Wrapped study.optimize() at line 453 with asyncio.to_thread()
2. hybrid_trainer.py:
- Made _train_xgboost() async and wrapped model.fit() with asyncio.to_thread()
- Made _evaluate_hybrid_model() async and wrapped predict() calls
- Fixed predict() method to wrap blocking predict() calls
Impact:
- Event loop no longer blocks during ML training
- RabbitMQ heartbeats continue during training
- WebSocket progress updates work correctly
- Training can now complete successfully
Fixes: Training hang at 40% during onboarding phase