- Identified root cause: double commit bug in prophet_manager.py - Models successfully trained (MAPE 11.96%-14.16%) but DB storage failed - Transaction state errors preventing database record creation - Detailed timeline and recommended fix provided
9.7 KiB
Training Log Analysis - 2025-11-05
Executive Summary
Status: Training FAILED - 5/5 products failed to store in database despite successful model training
Root Cause: Double commit bug in prophet_manager.py causing database transaction conflicts
Impact: All models successfully trained and saved to disk (with good MAPE scores 11.96%-14.16%), but database metadata storage failed due to transaction state errors
Detailed Findings
1. Warnings (Non-Critical)
Spanish Holidays Loading Failure
- Count: 5 occurrences
- Message:
Could not load Spanish holidays dynamically: Logger._log() got an unexpected keyword argument 'region' - Impact: Minor - Prophet models may not properly account for Spanish holidays
- Location:
app/ml/prophet_manager.py- holiday configuration - Severity: LOW
2. Critical Errors
A. Double Commit Transaction Error (ROOT CAUSE)
- Location:
services/training/app/ml/prophet_manager.py:750-824 - Function:
_store_model() - Error Message:
Failed to store model in database: Method 'commit()' can't be called here; method '_prepare_impl()' is already in progress and this would cause an unexpected state change to <SessionTransactionState.CLOSED: 5> - Affected Products: All 5 products
- Occurrences: 3 direct errors
Technical Explanation:
The bug occurs due to duplicate commit logic when no parent session is provided:
- Line 751:
use_parent_session = session is not None→ False when session=None - Lines 819-825: Code creates new dedicated session and calls
_store_in_db(new_session) - Line 807-808: Inside
_store_in_db(), commits becauseuse_parent_session=False - Line 824: Tries to commit again on same session → CONFLICT!
Code Flow:
# Line 751
use_parent_session = session is not None # False
# Lines 819-825 (when no parent session)
else:
async with self.database_manager.get_session() as new_session:
await _store_in_db(new_session) # Commits inside (line 808)
await new_session.commit() # ❌ Double commit!
B. Cascading Transaction Failures
- Error:
InFailedSQLTransactionError: current transaction is aborted, commands ignored until end of transaction block - Count: ~20 occurrences
- Cause: After initial double-commit error, transaction enters failed state
- Impact: All subsequent database operations fail
Affected Operations:
- Getting active models
- Creating trained model records
- Updating training logs
- Retrieving database statistics
3. Training Results
| Product ID | Model Trained | File Saved | MAPE | DB Record | Final Status |
|---|---|---|---|---|---|
5a2155a0-6ccb-4d29-a6db-d25f8e4cc806 |
✅ Yes | ✅ Yes | 13.75% | ❌ No | PARTIAL |
dc4a9ca0-a0e0-436c-94ef-d76c9551ab8c |
✅ Yes | ✅ Yes | 13.04% | ❌ No | PARTIAL |
bcb395bd-b8a4-40de-ba1d-549de48e1688 |
✅ Yes | ✅ Yes | 11.96% | ❌ No | PARTIAL |
ba7aa8bf-e4be-4bbc-a9df-6929c7a8eaa9 |
✅ Yes | ✅ Yes | 14.16% | ❌ No | PARTIAL |
3adadeb8-9d42-4565-afd4-37e0367280ba |
✅ Yes | ✅ Yes | 12.77% | ❌ No | PARTIAL |
Summary:
- ✅ 5/5 models successfully trained with good accuracy (MAPE 11.96% - 14.16%)
- ✅ 5/5 model files saved to disk at
/app/models/{tenant_id}/{model_id}.pkl - ❌ 0/5 database records created - all failed due to transaction errors
- ❌ Final status: FAILED - models cannot be used without DB records
Model Files Created:
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a1cdc2c3-2b11-4e8d-a9a9-008c8bad2369.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b02bec7b-736b-4cc8-85d9-9caa48479d6d.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b5865fca-4901-472c-8803-b4031a250661.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a9f94bcd-9266-4d24-987b-2aca0d0448a5.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/dc8402d0-a053-4b1b-a9c6-fc77480b5170.pkl ✅
4. Timeline of Events
17:39:45 - Data preparation completed for 5 products
17:39:45 - Product categorization: 3 pastries, 2 bread
17:39:45 - Parallel training started (3 concurrent operations)
17:39:45-46 - BATCH 1: Training 3 products in parallel
✅ Prophet models fitted successfully
✅ Models saved to disk with checksums
✅ Previous models deactivated
❌ First double-commit error at 17:39:46
❌ Transaction enters failed state
❌ All subsequent DB operations fail
17:39:46-47 - BATCH 2: Training remaining 2 products
✅ Prophet models fitted successfully
✅ Models saved to disk with checksums
❌ Cannot write to DB (transaction still failed)
17:39:47 - Training completion attempt
❌ Cannot update training logs
❌ Overall status reported as FAILED
✅ Event published to RabbitMQ (training.failed)
5. Root Cause Analysis
File: services/training/app/ml/prophet_manager.py
Method: _store_model() (lines 697-832)
Issue: Double commit logic error
The Bug Pattern:
# Line 751: Set flag based on parent session
use_parent_session = session is not None # False when session=None
# Inner function at lines 753-811
async def _store_in_db(db_session):
# ... database operations ...
db_session.add(db_model)
# Line 807-808: Commit if NOT using parent session
if not use_parent_session:
await db_session.commit() # ❌ FIRST COMMIT
# Lines 813-825: Outer try block
try:
if use_parent_session:
await _store_in_db(session) # ✅ No commit (parent handles it)
else:
# Line 819-825: Create new session
async with self.database_manager.get_session() as new_session:
await _store_in_db(new_session) # ❌ Commits at line 808
await new_session.commit() # ❌ SECOND COMMIT - ERROR!
Why This Fails:
- SQLAlchemy's async session starts commit with
_prepare_impl() - First commit (line 808) begins this process
- Second commit (line 824) tries to commit while prepare is in progress
- Session state machine rejects this: transition to CLOSED would be unexpected
- Transaction marked as failed, all subsequent operations rejected
Historical Context:
Recent commit history shows ongoing session management issues:
673108e - Fix deadlock issues in training 2
74215d3 - Fix deadlock issues in training
fd0a96e - Fix remaining nested session issues in training pipeline
b2de56e - Fix additional nested session deadlock in data_processor.py
e585e9f - Fix critical nested session deadlock in training_service.py
This double-commit bug likely introduced during these "fixes" as a regression.
6. Recommended Solution
Option 1: Remove Inner Commit (Recommended)
Remove the commit logic from _store_in_db() entirely:
# DELETE lines 805-811
# if not use_parent_session:
# await db_session.commit()
# logger.info(f"Model {model_id} stored in database successfully")
# else:
# logger.debug(f"Added model {model_id} to parent session (commit deferred)")
All commits should be handled by the outer scope (either parent session or new session block).
Option 2: Pass Commit Control Flag
async def _store_in_db(db_session, should_commit=False):
"""Inner function to store model in database"""
# ... existing code ...
db_session.add(db_model)
if should_commit:
await db_session.commit()
logger.info(f"Model {model_id} stored successfully")
# Then call it:
if use_parent_session:
await _store_in_db(session, should_commit=False) # Parent commits
else:
async with self.database_manager.get_session() as new_session:
await _store_in_db(new_session, should_commit=False) # We commit here
await new_session.commit()
Recommendation: Option 1 is simpler and follows the "single responsibility principle" - the inner function should only perform the database operations, not manage transactions.
7. Testing Recommendations
After fix implementation:
-
Unit Tests:
- Test
_store_model()withsession=None(new session path) - Test
_store_model()with provided session (parent session path) - Verify only one commit occurs in each path
- Test
-
Integration Tests:
- Run parallel training with 5 products
- Verify all database records created
- Verify no transaction conflicts
- Check training logs updated correctly
-
Monitoring:
- Watch for
InFailedSQLTransactionErrorin logs - Monitor training success rate
- Verify model records appear in
trained_modelstable
- Watch for
Action Items
- Fix double commit bug in
prophet_manager.py:805-811 - Test fix with parallel training
- Add unit tests for session management
- Review all recent session management changes for similar patterns
- Fix Spanish holidays loading warning (optional)
- Document session management patterns for future development
Additional Notes
Good News:
- ✅ All models trained successfully with good accuracy
- ✅ Model files saved to disk and can be recovered
- ✅ Parallel training infrastructure works correctly
- ✅ Prophet optimization producing quality models (MAPE 11.96%-14.16%)
Bad News:
- ❌ Database integration broken due to double commit
- ❌ Models cannot be used without DB records
- ❌ Training pipeline reports as failed despite successful training
- ❌ Recent "fixes" introduced regression
Impact Assessment:
- Severity: HIGH - Training pipeline completely broken
- User Impact: Cannot train new models
- Data Loss: None - models saved to disk
- Recovery: Fix code + re-run training OR manually insert DB records
Analysis Date: 2025-11-05 Analyzed By: Claude Code Agent Log Source: Training service logs 17:39:45 - 17:39:47