CRITICAL FIX - Database Transaction:
- Removed duplicate commit logic from _store_in_db() inner function (lines 805-811)
- Prevents 'Method commit() can't be called here' error
- Now only outer scope handles commits (line 821 for new sessions, parent for parent sessions)
- Fixes issue where all 5 models trained successfully but failed to store in DB
MINOR FIX - Logging:
- Fixed Spanish holidays logger call (line 977-978)
- Removed invalid keyword arguments (region=, years=) from logger.info()
- Now uses f-string format consistent with rest of codebase
- Prevents 'Logger._log() got an unexpected keyword argument' warning
Impact:
- Training pipeline can now complete successfully
- Models will be stored in database after training
- No more cascading transaction failures
- Cleaner logs without warnings
Root cause: Double commit introduced during recent session management fixes
Related commits: 673108e, 74215d3, fd0a96e, b2de56e, e585e9f
Issues Fixed:
4️⃣ data_processor.py (Line 230-232):
- Second update_log_progress call without commit after data preparation
- Added commit() after completion update to prevent deadlock
- Added debug logging for visibility
5️⃣ prophet_manager.py _store_model (Line 750):
- Created TRIPLE nested session (training_service → trainer → lock → _store_model)
- Refactored _store_model to accept optional session parameter
- Uses parent session from lock context instead of creating new one
- Updated call site to pass db_session parameter
Complete Session Hierarchy After All Fixes:
training_service.py (session)
└─ commit() ← FIX#2 (e585e9f)
└─ trainer.py (new session) ✅ OK
└─ data_processor.py (new session)
└─ commit() after first update ← FIX#3 (b2de56e)
└─ commit() after second update ← FIX#4 (THIS)
└─ prophet_manager.train_bakery_model (uses parent or new session) ← FIX#1 (caff497)
└─ lock.acquire(session)
└─ _store_model(session=parent) ← FIX#5 (THIS)
└─ NO NESTED SESSION ✅
All nested session deadlocks in training path are now resolved.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Root Cause:
The training process was hanging at the first progress update due to a
nested database session issue. The main trainer created a session and
repositories, then called prophet_manager.train_bakery_model() which
created another nested session with an advisory lock. This caused a
deadlock where:
1. Outer session had uncommitted UPDATE on model_training_logs
2. Inner session tried to acquire advisory lock
3. Neither could proceed, causing training to hang indefinitely
Changes Made:
1. prophet_manager.py:
- Added optional 'session' parameter to train_bakery_model()
- Refactored to use parent session if provided, otherwise create new one
- Prevents nested session creation during training
2. hybrid_trainer.py:
- Added optional 'session' parameter to train_hybrid_model()
- Passes session to prophet_manager to maintain single session context
3. trainer.py:
- Updated _train_single_product() to accept and pass session
- Updated _train_all_models_enhanced() to accept and pass session
- Pass db_session from main training context to all training methods
- Added explicit db_session.flush() after critical progress update
- This ensures updates are visible before acquiring locks
Impact:
- Eliminates nested session deadlocks
- Training now proceeds past initial progress update
- Maintains single database session context throughout training
- Prevents database transaction conflicts
Related Issues:
- Fixes training hang during onboarding process
- Not directly related to audit_metadata changes but exposed by them
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Root Cause:
Training process was stuck at 40% because blocking synchronous ML operations
(model.fit(), model.predict(), study.optimize()) were freezing the asyncio
event loop, preventing RabbitMQ heartbeats, WebSocket communication, and
progress updates.
Changes:
1. prophet_manager.py:
- Wrapped model.fit() at line 189 with asyncio.to_thread()
- Wrapped study.optimize() at line 453 with asyncio.to_thread()
2. hybrid_trainer.py:
- Made _train_xgboost() async and wrapped model.fit() with asyncio.to_thread()
- Made _evaluate_hybrid_model() async and wrapped predict() calls
- Fixed predict() method to wrap blocking predict() calls
Impact:
- Event loop no longer blocks during ML training
- RabbitMQ heartbeats continue during training
- WebSocket progress updates work correctly
- Training can now complete successfully
Fixes: Training hang at 40% during onboarding phase