Files
bakery-ia/TRAINING_LOG_ANALYSIS.md
Claude 0b129b070d Add comprehensive training log analysis
- Identified root cause: double commit bug in prophet_manager.py
- Models successfully trained (MAPE 11.96%-14.16%) but DB storage failed
- Transaction state errors preventing database record creation
- Detailed timeline and recommended fix provided
2025-11-05 18:31:43 +00:00

9.7 KiB

Training Log Analysis - 2025-11-05

Executive Summary

Status: Training FAILED - 5/5 products failed to store in database despite successful model training

Root Cause: Double commit bug in prophet_manager.py causing database transaction conflicts

Impact: All models successfully trained and saved to disk (with good MAPE scores 11.96%-14.16%), but database metadata storage failed due to transaction state errors


Detailed Findings

1. Warnings (Non-Critical)

Spanish Holidays Loading Failure

  • Count: 5 occurrences
  • Message: Could not load Spanish holidays dynamically: Logger._log() got an unexpected keyword argument 'region'
  • Impact: Minor - Prophet models may not properly account for Spanish holidays
  • Location: app/ml/prophet_manager.py - holiday configuration
  • Severity: LOW

2. Critical Errors

A. Double Commit Transaction Error (ROOT CAUSE)

  • Location: services/training/app/ml/prophet_manager.py:750-824
  • Function: _store_model()
  • Error Message:
    Failed to store model in database: Method 'commit()' can't be called here;
    method '_prepare_impl()' is already in progress and this would cause an
    unexpected state change to <SessionTransactionState.CLOSED: 5>
    
  • Affected Products: All 5 products
  • Occurrences: 3 direct errors

Technical Explanation:

The bug occurs due to duplicate commit logic when no parent session is provided:

  1. Line 751: use_parent_session = session is not None → False when session=None
  2. Lines 819-825: Code creates new dedicated session and calls _store_in_db(new_session)
  3. Line 807-808: Inside _store_in_db(), commits because use_parent_session=False
  4. Line 824: Tries to commit again on same session → CONFLICT!

Code Flow:

# Line 751
use_parent_session = session is not None  # False

# Lines 819-825 (when no parent session)
else:
    async with self.database_manager.get_session() as new_session:
        await _store_in_db(new_session)  # Commits inside (line 808)
        await new_session.commit()  # ❌ Double commit!

B. Cascading Transaction Failures

  • Error: InFailedSQLTransactionError: current transaction is aborted, commands ignored until end of transaction block
  • Count: ~20 occurrences
  • Cause: After initial double-commit error, transaction enters failed state
  • Impact: All subsequent database operations fail

Affected Operations:

  • Getting active models
  • Creating trained model records
  • Updating training logs
  • Retrieving database statistics

3. Training Results

Product ID Model Trained File Saved MAPE DB Record Final Status
5a2155a0-6ccb-4d29-a6db-d25f8e4cc806 Yes Yes 13.75% No PARTIAL
dc4a9ca0-a0e0-436c-94ef-d76c9551ab8c Yes Yes 13.04% No PARTIAL
bcb395bd-b8a4-40de-ba1d-549de48e1688 Yes Yes 11.96% No PARTIAL
ba7aa8bf-e4be-4bbc-a9df-6929c7a8eaa9 Yes Yes 14.16% No PARTIAL
3adadeb8-9d42-4565-afd4-37e0367280ba Yes Yes 12.77% No PARTIAL

Summary:

  • 5/5 models successfully trained with good accuracy (MAPE 11.96% - 14.16%)
  • 5/5 model files saved to disk at /app/models/{tenant_id}/{model_id}.pkl
  • 0/5 database records created - all failed due to transaction errors
  • Final status: FAILED - models cannot be used without DB records

Model Files Created:

/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a1cdc2c3-2b11-4e8d-a9a9-008c8bad2369.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b02bec7b-736b-4cc8-85d9-9caa48479d6d.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b5865fca-4901-472c-8803-b4031a250661.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a9f94bcd-9266-4d24-987b-2aca0d0448a5.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/dc8402d0-a053-4b1b-a9c6-fc77480b5170.pkl ✅

4. Timeline of Events

17:39:45 - Data preparation completed for 5 products
17:39:45 - Product categorization: 3 pastries, 2 bread
17:39:45 - Parallel training started (3 concurrent operations)

17:39:45-46 - BATCH 1: Training 3 products in parallel
  ✅ Prophet models fitted successfully
  ✅ Models saved to disk with checksums
  ✅ Previous models deactivated
  ❌ First double-commit error at 17:39:46
  ❌ Transaction enters failed state
  ❌ All subsequent DB operations fail

17:39:46-47 - BATCH 2: Training remaining 2 products
  ✅ Prophet models fitted successfully
  ✅ Models saved to disk with checksums
  ❌ Cannot write to DB (transaction still failed)

17:39:47 - Training completion attempt
  ❌ Cannot update training logs
  ❌ Overall status reported as FAILED
  ✅ Event published to RabbitMQ (training.failed)

5. Root Cause Analysis

File: services/training/app/ml/prophet_manager.py Method: _store_model() (lines 697-832) Issue: Double commit logic error

The Bug Pattern:

# Line 751: Set flag based on parent session
use_parent_session = session is not None  # False when session=None

# Inner function at lines 753-811
async def _store_in_db(db_session):
    # ... database operations ...
    db_session.add(db_model)

    # Line 807-808: Commit if NOT using parent session
    if not use_parent_session:
        await db_session.commit()  # ❌ FIRST COMMIT

# Lines 813-825: Outer try block
try:
    if use_parent_session:
        await _store_in_db(session)  # ✅ No commit (parent handles it)
    else:
        # Line 819-825: Create new session
        async with self.database_manager.get_session() as new_session:
            await _store_in_db(new_session)  # ❌ Commits at line 808
            await new_session.commit()  # ❌ SECOND COMMIT - ERROR!

Why This Fails:

  1. SQLAlchemy's async session starts commit with _prepare_impl()
  2. First commit (line 808) begins this process
  3. Second commit (line 824) tries to commit while prepare is in progress
  4. Session state machine rejects this: transition to CLOSED would be unexpected
  5. Transaction marked as failed, all subsequent operations rejected

Historical Context:

Recent commit history shows ongoing session management issues:

673108e - Fix deadlock issues in training 2
74215d3 - Fix deadlock issues in training
fd0a96e - Fix remaining nested session issues in training pipeline
b2de56e - Fix additional nested session deadlock in data_processor.py
e585e9f - Fix critical nested session deadlock in training_service.py

This double-commit bug likely introduced during these "fixes" as a regression.


Option 1: Remove Inner Commit (Recommended)

Remove the commit logic from _store_in_db() entirely:

# DELETE lines 805-811
# if not use_parent_session:
#     await db_session.commit()
#     logger.info(f"Model {model_id} stored in database successfully")
# else:
#     logger.debug(f"Added model {model_id} to parent session (commit deferred)")

All commits should be handled by the outer scope (either parent session or new session block).

Option 2: Pass Commit Control Flag

async def _store_in_db(db_session, should_commit=False):
    """Inner function to store model in database"""
    # ... existing code ...
    db_session.add(db_model)

    if should_commit:
        await db_session.commit()
        logger.info(f"Model {model_id} stored successfully")

# Then call it:
if use_parent_session:
    await _store_in_db(session, should_commit=False)  # Parent commits
else:
    async with self.database_manager.get_session() as new_session:
        await _store_in_db(new_session, should_commit=False)  # We commit here
        await new_session.commit()

Recommendation: Option 1 is simpler and follows the "single responsibility principle" - the inner function should only perform the database operations, not manage transactions.


7. Testing Recommendations

After fix implementation:

  1. Unit Tests:

    • Test _store_model() with session=None (new session path)
    • Test _store_model() with provided session (parent session path)
    • Verify only one commit occurs in each path
  2. Integration Tests:

    • Run parallel training with 5 products
    • Verify all database records created
    • Verify no transaction conflicts
    • Check training logs updated correctly
  3. Monitoring:

    • Watch for InFailedSQLTransactionError in logs
    • Monitor training success rate
    • Verify model records appear in trained_models table

Action Items

  • Fix double commit bug in prophet_manager.py:805-811
  • Test fix with parallel training
  • Add unit tests for session management
  • Review all recent session management changes for similar patterns
  • Fix Spanish holidays loading warning (optional)
  • Document session management patterns for future development

Additional Notes

Good News:

  • All models trained successfully with good accuracy
  • Model files saved to disk and can be recovered
  • Parallel training infrastructure works correctly
  • Prophet optimization producing quality models (MAPE 11.96%-14.16%)

Bad News:

  • Database integration broken due to double commit
  • Models cannot be used without DB records
  • Training pipeline reports as failed despite successful training
  • Recent "fixes" introduced regression

Impact Assessment:

  • Severity: HIGH - Training pipeline completely broken
  • User Impact: Cannot train new models
  • Data Loss: None - models saved to disk
  • Recovery: Fix code + re-run training OR manually insert DB records

Analysis Date: 2025-11-05 Analyzed By: Claude Code Agent Log Source: Training service logs 17:39:45 - 17:39:47