Issues Fixed:
4️⃣ data_processor.py (Line 230-232):
- Second update_log_progress call without commit after data preparation
- Added commit() after completion update to prevent deadlock
- Added debug logging for visibility
5️⃣ prophet_manager.py _store_model (Line 750):
- Created TRIPLE nested session (training_service → trainer → lock → _store_model)
- Refactored _store_model to accept optional session parameter
- Uses parent session from lock context instead of creating new one
- Updated call site to pass db_session parameter
Complete Session Hierarchy After All Fixes:
training_service.py (session)
└─ commit() ← FIX#2 (e585e9f)
└─ trainer.py (new session) ✅ OK
└─ data_processor.py (new session)
└─ commit() after first update ← FIX#3 (b2de56e)
└─ commit() after second update ← FIX#4 (THIS)
└─ prophet_manager.train_bakery_model (uses parent or new session) ← FIX#1 (caff497)
└─ lock.acquire(session)
└─ _store_model(session=parent) ← FIX#5 (THIS)
└─ NO NESTED SESSION ✅
All nested session deadlocks in training path are now resolved.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Root Cause:
After fixing the training_service.py deadlock, training progressed to
data preparation but got stuck there. The data_processor.py creates
another nested session at line 143, updates training_log without
committing, causing another deadlock scenario.
Session Hierarchy:
1. training_service.py: outer session (fixed in e585e9f)
2. trainer.py: creates own session (passes deadlock due to commit)
3. data_processor.py: creates ANOTHER nested session (THIS FIX)
Fix:
Added explicit db_session.commit() after progress update in data_processor
(line 153) to ensure the UPDATE is committed before continuing with data
processing operations that may interact with other sessions.
This completes the chain of nested session fixes:
- caff497: prophet_manager + hybrid_trainer session passing
- e585e9f: training_service commit before trainer call
- THIS: data_processor commit after progress update
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Root Cause (Actual):
The actual nested session issue was in training_service.py, not just in
the trainer methods. The flow was:
1. training_service.py creates outer session (line 173)
2. Updates training_log at line 235-237 (uncommitted)
3. Calls trainer.train_tenant_models() at line 239
4. Trainer creates its own session at line 93
5. DEADLOCK: Outer session has uncommitted UPDATE, inner session can't proceed
Fix:
Added explicit session.commit() after the ml_training progress update
(line 241) to ensure the UPDATE is committed before trainer creates
its own session. This prevents the deadlock condition.
Related to previous commit caff497 which fixed nested sessions in
prophet_manager and hybrid_trainer, but missed the actual root cause
in training_service.py.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Root Cause:
The training process was hanging at the first progress update due to a
nested database session issue. The main trainer created a session and
repositories, then called prophet_manager.train_bakery_model() which
created another nested session with an advisory lock. This caused a
deadlock where:
1. Outer session had uncommitted UPDATE on model_training_logs
2. Inner session tried to acquire advisory lock
3. Neither could proceed, causing training to hang indefinitely
Changes Made:
1. prophet_manager.py:
- Added optional 'session' parameter to train_bakery_model()
- Refactored to use parent session if provided, otherwise create new one
- Prevents nested session creation during training
2. hybrid_trainer.py:
- Added optional 'session' parameter to train_hybrid_model()
- Passes session to prophet_manager to maintain single session context
3. trainer.py:
- Updated _train_single_product() to accept and pass session
- Updated _train_all_models_enhanced() to accept and pass session
- Pass db_session from main training context to all training methods
- Added explicit db_session.flush() after critical progress update
- This ensures updates are visible before acquiring locks
Impact:
- Eliminates nested session deadlocks
- Training now proceeds past initial progress update
- Maintains single database session context throughout training
- Prevents database transaction conflicts
Related Issues:
- Fixes training hang during onboarding process
- Not directly related to audit_metadata changes but exposed by them
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Root Cause:
Training process was stuck at 40% because blocking synchronous ML operations
(model.fit(), model.predict(), study.optimize()) were freezing the asyncio
event loop, preventing RabbitMQ heartbeats, WebSocket communication, and
progress updates.
Changes:
1. prophet_manager.py:
- Wrapped model.fit() at line 189 with asyncio.to_thread()
- Wrapped study.optimize() at line 453 with asyncio.to_thread()
2. hybrid_trainer.py:
- Made _train_xgboost() async and wrapped model.fit() with asyncio.to_thread()
- Made _evaluate_hybrid_model() async and wrapped predict() calls
- Fixed predict() method to wrap blocking predict() calls
Impact:
- Event loop no longer blocks during ML training
- RabbitMQ heartbeats continue during training
- WebSocket progress updates work correctly
- Training can now complete successfully
Fixes: Training hang at 40% during onboarding phase
Critical fixes for training session logging:
1. Training log race condition fix:
- Add explicit session commits after creating training logs
- Handle duplicate key errors gracefully when multiple sessions
try to create the same log simultaneously
- Implement retry logic to query for existing logs after
duplicate key violations
- Prevents "Training log not found" errors during training
2. Audit event async generator error fix:
- Replace incorrect next(get_db()) usage with proper
async context manager (database_manager.get_session())
- Fixes "'async_generator' object is not an iterator" error
- Ensures audit logging works correctly
These changes address race conditions in concurrent database
sessions and ensure training logs are properly synchronized
across the training pipeline.
This commit addresses all identified bugs and issues in the training code path:
## Critical Fixes:
- Add get_start_time() method to TrainingLogRepository and fix non-existent method call
- Remove duplicate training.started event from API endpoint (trainer publishes the accurate one)
- Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone"
## High Priority Fixes:
- Fix division by zero risk in time estimation with double-check and max() safety
- Remove unreachable exception handler in training_operations.py
- Simplify WebSocket token refresh logic to only reconnect on actual user session changes
## Medium Priority Fixes:
- Fix auto-start training effect with useRef to prevent duplicate starts
- Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket
- Extract all magic numbers to centralized constants files:
- Backend: services/training/app/core/training_constants.py
- Frontend: frontend/src/constants/training.ts
- Standardize error logging with exc_info=True on critical errors
## Code Quality Improvements:
- All progress percentages now use named constants
- All timeouts and intervals now use named constants
- Improved code maintainability and readability
- Better separation of concerns
## Files Changed:
- Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py
- Backend: training_operations.py, training_log_repository.py, training_constants.py (new)
- Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new)
All training progress events now properly flow from 0% to 100% with no gaps.
Root Cause:
- Multiple parallel training tasks (3 at a time) were sharing the same database session
- This caused SQLAlchemy session state conflicts: "Session is already flushing" and "rollback() is already in progress"
- Additionally, duplicate model records were being created by both trainer and training_service
Fixes:
1. Separated model training from database writes:
- Training happens in parallel (CPU-intensive)
- Database writes happen sequentially after training completes
- This eliminates concurrent session access
2. Removed duplicate database writes:
- Trainer now writes all model records sequentially after parallel training
- Training service now retrieves models instead of creating duplicates
- Performance metrics are also created by trainer (no duplicates)
3. Added proper data flow:
- _train_single_product: Only trains models, stores results
- _write_training_results_to_database: Sequential DB writes after training
- _store_trained_models: Changed to retrieve existing models
- _create_performance_metrics: Changed to verify existing metrics
Benefits:
- Eliminates database session conflicts
- Prevents duplicate model records
- Maintains parallel training performance
- Ensures data consistency
Files Modified:
- services/training/app/ml/trainer.py
- services/training/app/services/training_service.py
Resolves: Onboarding training job database session conflicts