From 0b129b070d9c727b381666d1b10a9e45c33a8c08 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 5 Nov 2025 18:31:43 +0000 Subject: [PATCH] Add comprehensive training log analysis - Identified root cause: double commit bug in prophet_manager.py - Models successfully trained (MAPE 11.96%-14.16%) but DB storage failed - Transaction state errors preventing database record creation - Detailed timeline and recommended fix provided --- TRAINING_LOG_ANALYSIS.md | 283 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 283 insertions(+) create mode 100644 TRAINING_LOG_ANALYSIS.md diff --git a/TRAINING_LOG_ANALYSIS.md b/TRAINING_LOG_ANALYSIS.md new file mode 100644 index 00000000..19a0704e --- /dev/null +++ b/TRAINING_LOG_ANALYSIS.md @@ -0,0 +1,283 @@ +# Training Log Analysis - 2025-11-05 + +## Executive Summary + +**Status:** Training FAILED - 5/5 products failed to store in database despite successful model training + +**Root Cause:** Double commit bug in `prophet_manager.py` causing database transaction conflicts + +**Impact:** All models successfully trained and saved to disk (with good MAPE scores 11.96%-14.16%), but database metadata storage failed due to transaction state errors + +--- + +## Detailed Findings + +### 1. Warnings (Non-Critical) + +#### Spanish Holidays Loading Failure +- **Count:** 5 occurrences +- **Message:** `Could not load Spanish holidays dynamically: Logger._log() got an unexpected keyword argument 'region'` +- **Impact:** Minor - Prophet models may not properly account for Spanish holidays +- **Location:** `app/ml/prophet_manager.py` - holiday configuration +- **Severity:** LOW + +--- + +### 2. Critical Errors + +#### A. Double Commit Transaction Error (ROOT CAUSE) +- **Location:** `services/training/app/ml/prophet_manager.py:750-824` +- **Function:** `_store_model()` +- **Error Message:** + ``` + Failed to store model in database: Method 'commit()' can't be called here; + method '_prepare_impl()' is already in progress and this would cause an + unexpected state change to + ``` +- **Affected Products:** All 5 products +- **Occurrences:** 3 direct errors + +**Technical Explanation:** + +The bug occurs due to duplicate commit logic when no parent session is provided: + +1. Line 751: `use_parent_session = session is not None` → False when session=None +2. Lines 819-825: Code creates new dedicated session and calls `_store_in_db(new_session)` +3. Line 807-808: Inside `_store_in_db()`, commits because `use_parent_session=False` +4. Line 824: Tries to commit again on same session → CONFLICT! + +**Code Flow:** +```python +# Line 751 +use_parent_session = session is not None # False + +# Lines 819-825 (when no parent session) +else: + async with self.database_manager.get_session() as new_session: + await _store_in_db(new_session) # Commits inside (line 808) + await new_session.commit() # ❌ Double commit! +``` + +#### B. Cascading Transaction Failures +- **Error:** `InFailedSQLTransactionError: current transaction is aborted, commands ignored until end of transaction block` +- **Count:** ~20 occurrences +- **Cause:** After initial double-commit error, transaction enters failed state +- **Impact:** All subsequent database operations fail + +**Affected Operations:** +- Getting active models +- Creating trained model records +- Updating training logs +- Retrieving database statistics + +--- + +### 3. Training Results + +| Product ID | Model Trained | File Saved | MAPE | DB Record | Final Status | +|------------|---------------|------------|------|-----------|--------------| +| `5a2155a0-6ccb-4d29-a6db-d25f8e4cc806` | ✅ Yes | ✅ Yes | 13.75% | ❌ No | PARTIAL | +| `dc4a9ca0-a0e0-436c-94ef-d76c9551ab8c` | ✅ Yes | ✅ Yes | 13.04% | ❌ No | PARTIAL | +| `bcb395bd-b8a4-40de-ba1d-549de48e1688` | ✅ Yes | ✅ Yes | 11.96% | ❌ No | PARTIAL | +| `ba7aa8bf-e4be-4bbc-a9df-6929c7a8eaa9` | ✅ Yes | ✅ Yes | 14.16% | ❌ No | PARTIAL | +| `3adadeb8-9d42-4565-afd4-37e0367280ba` | ✅ Yes | ✅ Yes | 12.77% | ❌ No | PARTIAL | + +**Summary:** +- ✅ **5/5 models successfully trained** with good accuracy (MAPE 11.96% - 14.16%) +- ✅ **5/5 model files saved to disk** at `/app/models/{tenant_id}/{model_id}.pkl` +- ❌ **0/5 database records created** - all failed due to transaction errors +- ❌ **Final status: FAILED** - models cannot be used without DB records + +**Model Files Created:** +``` +/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a1cdc2c3-2b11-4e8d-a9a9-008c8bad2369.pkl ✅ +/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b02bec7b-736b-4cc8-85d9-9caa48479d6d.pkl ✅ +/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b5865fca-4901-472c-8803-b4031a250661.pkl ✅ +/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a9f94bcd-9266-4d24-987b-2aca0d0448a5.pkl ✅ +/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/dc8402d0-a053-4b1b-a9c6-fc77480b5170.pkl ✅ +``` + +--- + +### 4. Timeline of Events + +``` +17:39:45 - Data preparation completed for 5 products +17:39:45 - Product categorization: 3 pastries, 2 bread +17:39:45 - Parallel training started (3 concurrent operations) + +17:39:45-46 - BATCH 1: Training 3 products in parallel + ✅ Prophet models fitted successfully + ✅ Models saved to disk with checksums + ✅ Previous models deactivated + ❌ First double-commit error at 17:39:46 + ❌ Transaction enters failed state + ❌ All subsequent DB operations fail + +17:39:46-47 - BATCH 2: Training remaining 2 products + ✅ Prophet models fitted successfully + ✅ Models saved to disk with checksums + ❌ Cannot write to DB (transaction still failed) + +17:39:47 - Training completion attempt + ❌ Cannot update training logs + ❌ Overall status reported as FAILED + ✅ Event published to RabbitMQ (training.failed) +``` + +--- + +### 5. Root Cause Analysis + +**File:** `services/training/app/ml/prophet_manager.py` +**Method:** `_store_model()` (lines 697-832) +**Issue:** Double commit logic error + +**The Bug Pattern:** + +```python +# Line 751: Set flag based on parent session +use_parent_session = session is not None # False when session=None + +# Inner function at lines 753-811 +async def _store_in_db(db_session): + # ... database operations ... + db_session.add(db_model) + + # Line 807-808: Commit if NOT using parent session + if not use_parent_session: + await db_session.commit() # ❌ FIRST COMMIT + +# Lines 813-825: Outer try block +try: + if use_parent_session: + await _store_in_db(session) # ✅ No commit (parent handles it) + else: + # Line 819-825: Create new session + async with self.database_manager.get_session() as new_session: + await _store_in_db(new_session) # ❌ Commits at line 808 + await new_session.commit() # ❌ SECOND COMMIT - ERROR! +``` + +**Why This Fails:** +1. SQLAlchemy's async session starts commit with `_prepare_impl()` +2. First commit (line 808) begins this process +3. Second commit (line 824) tries to commit while prepare is in progress +4. Session state machine rejects this: transition to CLOSED would be unexpected +5. Transaction marked as failed, all subsequent operations rejected + +**Historical Context:** + +Recent commit history shows ongoing session management issues: +``` +673108e - Fix deadlock issues in training 2 +74215d3 - Fix deadlock issues in training +fd0a96e - Fix remaining nested session issues in training pipeline +b2de56e - Fix additional nested session deadlock in data_processor.py +e585e9f - Fix critical nested session deadlock in training_service.py +``` + +This double-commit bug likely introduced during these "fixes" as a regression. + +--- + +### 6. Recommended Solution + +**Option 1: Remove Inner Commit (Recommended)** + +Remove the commit logic from `_store_in_db()` entirely: + +```python +# DELETE lines 805-811 +# if not use_parent_session: +# await db_session.commit() +# logger.info(f"Model {model_id} stored in database successfully") +# else: +# logger.debug(f"Added model {model_id} to parent session (commit deferred)") +``` + +All commits should be handled by the outer scope (either parent session or new session block). + +**Option 2: Pass Commit Control Flag** + +```python +async def _store_in_db(db_session, should_commit=False): + """Inner function to store model in database""" + # ... existing code ... + db_session.add(db_model) + + if should_commit: + await db_session.commit() + logger.info(f"Model {model_id} stored successfully") + +# Then call it: +if use_parent_session: + await _store_in_db(session, should_commit=False) # Parent commits +else: + async with self.database_manager.get_session() as new_session: + await _store_in_db(new_session, should_commit=False) # We commit here + await new_session.commit() +``` + +**Recommendation:** Option 1 is simpler and follows the "single responsibility principle" - the inner function should only perform the database operations, not manage transactions. + +--- + +### 7. Testing Recommendations + +After fix implementation: + +1. **Unit Tests:** + - Test `_store_model()` with `session=None` (new session path) + - Test `_store_model()` with provided session (parent session path) + - Verify only one commit occurs in each path + +2. **Integration Tests:** + - Run parallel training with 5 products + - Verify all database records created + - Verify no transaction conflicts + - Check training logs updated correctly + +3. **Monitoring:** + - Watch for `InFailedSQLTransactionError` in logs + - Monitor training success rate + - Verify model records appear in `trained_models` table + +--- + +## Action Items + +- [ ] Fix double commit bug in `prophet_manager.py:805-811` +- [ ] Test fix with parallel training +- [ ] Add unit tests for session management +- [ ] Review all recent session management changes for similar patterns +- [ ] Fix Spanish holidays loading warning (optional) +- [ ] Document session management patterns for future development + +--- + +## Additional Notes + +**Good News:** +- ✅ All models trained successfully with good accuracy +- ✅ Model files saved to disk and can be recovered +- ✅ Parallel training infrastructure works correctly +- ✅ Prophet optimization producing quality models (MAPE 11.96%-14.16%) + +**Bad News:** +- ❌ Database integration broken due to double commit +- ❌ Models cannot be used without DB records +- ❌ Training pipeline reports as failed despite successful training +- ❌ Recent "fixes" introduced regression + +**Impact Assessment:** +- **Severity:** HIGH - Training pipeline completely broken +- **User Impact:** Cannot train new models +- **Data Loss:** None - models saved to disk +- **Recovery:** Fix code + re-run training OR manually insert DB records + +--- + +**Analysis Date:** 2025-11-05 +**Analyzed By:** Claude Code Agent +**Log Source:** Training service logs 17:39:45 - 17:39:47