Fix training hang caused by nested database sessions and deadlocks

Root Cause:
The training process was hanging at the first progress update due to a
nested database session issue. The main trainer created a session and
repositories, then called prophet_manager.train_bakery_model() which
created another nested session with an advisory lock. This caused a
deadlock where:
1. Outer session had uncommitted UPDATE on model_training_logs
2. Inner session tried to acquire advisory lock
3. Neither could proceed, causing training to hang indefinitely

Changes Made:
1. prophet_manager.py:
   - Added optional 'session' parameter to train_bakery_model()
   - Refactored to use parent session if provided, otherwise create new one
   - Prevents nested session creation during training

2. hybrid_trainer.py:
   - Added optional 'session' parameter to train_hybrid_model()
   - Passes session to prophet_manager to maintain single session context

3. trainer.py:
   - Updated _train_single_product() to accept and pass session
   - Updated _train_all_models_enhanced() to accept and pass session
   - Pass db_session from main training context to all training methods
   - Added explicit db_session.flush() after critical progress update
   - This ensures updates are visible before acquiring locks

Impact:
- Eliminates nested session deadlocks
- Training now proceeds past initial progress update
- Maintains single database session context throughout training
- Prevents database transaction conflicts

Related Issues:
- Fixes training hang during onboarding process
- Not directly related to audit_metadata changes but exposed by them

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Urtzi Alfaro
2025-11-05 16:13:32 +01:00
parent 7a315afa62
commit caff49761d
3 changed files with 174 additions and 133 deletions

View File

@@ -56,7 +56,8 @@ class HybridProphetXGBoost:
inventory_product_id: str,
df: pd.DataFrame,
job_id: str,
validation_split: float = 0.2
validation_split: float = 0.2,
session = None
) -> Dict[str, Any]:
"""
Train hybrid Prophet + XGBoost model.
@@ -67,6 +68,7 @@ class HybridProphetXGBoost:
df: Training data (must have 'ds', 'y' and regressor columns)
job_id: Training job identifier
validation_split: Fraction of data for validation
session: Optional database session (uses parent session if provided to avoid nested sessions)
Returns:
Dictionary with model metadata and performance metrics
@@ -80,11 +82,13 @@ class HybridProphetXGBoost:
# Step 1: Train Prophet model (base forecaster)
logger.info("Step 1: Training Prophet base model")
# ✅ FIX: Pass session to prophet_manager to avoid nested session issues
prophet_result = await self.prophet_manager.train_bakery_model(
tenant_id=tenant_id,
inventory_product_id=inventory_product_id,
df=df.copy(),
job_id=job_id
job_id=job_id,
session=session
)
self.prophet_model_data = prophet_result