Files
bakery-ia/TRAINING_LOG_ANALYSIS.md

284 lines
9.7 KiB
Markdown
Raw Normal View History

# Training Log Analysis - 2025-11-05
## Executive Summary
**Status:** Training FAILED - 5/5 products failed to store in database despite successful model training
**Root Cause:** Double commit bug in `prophet_manager.py` causing database transaction conflicts
**Impact:** All models successfully trained and saved to disk (with good MAPE scores 11.96%-14.16%), but database metadata storage failed due to transaction state errors
---
## Detailed Findings
### 1. Warnings (Non-Critical)
#### Spanish Holidays Loading Failure
- **Count:** 5 occurrences
- **Message:** `Could not load Spanish holidays dynamically: Logger._log() got an unexpected keyword argument 'region'`
- **Impact:** Minor - Prophet models may not properly account for Spanish holidays
- **Location:** `app/ml/prophet_manager.py` - holiday configuration
- **Severity:** LOW
---
### 2. Critical Errors
#### A. Double Commit Transaction Error (ROOT CAUSE)
- **Location:** `services/training/app/ml/prophet_manager.py:750-824`
- **Function:** `_store_model()`
- **Error Message:**
```
Failed to store model in database: Method 'commit()' can't be called here;
method '_prepare_impl()' is already in progress and this would cause an
unexpected state change to <SessionTransactionState.CLOSED: 5>
```
- **Affected Products:** All 5 products
- **Occurrences:** 3 direct errors
**Technical Explanation:**
The bug occurs due to duplicate commit logic when no parent session is provided:
1. Line 751: `use_parent_session = session is not None` → False when session=None
2. Lines 819-825: Code creates new dedicated session and calls `_store_in_db(new_session)`
3. Line 807-808: Inside `_store_in_db()`, commits because `use_parent_session=False`
4. Line 824: Tries to commit again on same session → CONFLICT!
**Code Flow:**
```python
# Line 751
use_parent_session = session is not None # False
# Lines 819-825 (when no parent session)
else:
async with self.database_manager.get_session() as new_session:
await _store_in_db(new_session) # Commits inside (line 808)
await new_session.commit() # ❌ Double commit!
```
#### B. Cascading Transaction Failures
- **Error:** `InFailedSQLTransactionError: current transaction is aborted, commands ignored until end of transaction block`
- **Count:** ~20 occurrences
- **Cause:** After initial double-commit error, transaction enters failed state
- **Impact:** All subsequent database operations fail
**Affected Operations:**
- Getting active models
- Creating trained model records
- Updating training logs
- Retrieving database statistics
---
### 3. Training Results
| Product ID | Model Trained | File Saved | MAPE | DB Record | Final Status |
|------------|---------------|------------|------|-----------|--------------|
| `5a2155a0-6ccb-4d29-a6db-d25f8e4cc806` | ✅ Yes | ✅ Yes | 13.75% | ❌ No | PARTIAL |
| `dc4a9ca0-a0e0-436c-94ef-d76c9551ab8c` | ✅ Yes | ✅ Yes | 13.04% | ❌ No | PARTIAL |
| `bcb395bd-b8a4-40de-ba1d-549de48e1688` | ✅ Yes | ✅ Yes | 11.96% | ❌ No | PARTIAL |
| `ba7aa8bf-e4be-4bbc-a9df-6929c7a8eaa9` | ✅ Yes | ✅ Yes | 14.16% | ❌ No | PARTIAL |
| `3adadeb8-9d42-4565-afd4-37e0367280ba` | ✅ Yes | ✅ Yes | 12.77% | ❌ No | PARTIAL |
**Summary:**
-**5/5 models successfully trained** with good accuracy (MAPE 11.96% - 14.16%)
-**5/5 model files saved to disk** at `/app/models/{tenant_id}/{model_id}.pkl`
-**0/5 database records created** - all failed due to transaction errors
-**Final status: FAILED** - models cannot be used without DB records
**Model Files Created:**
```
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a1cdc2c3-2b11-4e8d-a9a9-008c8bad2369.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b02bec7b-736b-4cc8-85d9-9caa48479d6d.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b5865fca-4901-472c-8803-b4031a250661.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a9f94bcd-9266-4d24-987b-2aca0d0448a5.pkl ✅
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/dc8402d0-a053-4b1b-a9c6-fc77480b5170.pkl ✅
```
---
### 4. Timeline of Events
```
17:39:45 - Data preparation completed for 5 products
17:39:45 - Product categorization: 3 pastries, 2 bread
17:39:45 - Parallel training started (3 concurrent operations)
17:39:45-46 - BATCH 1: Training 3 products in parallel
✅ Prophet models fitted successfully
✅ Models saved to disk with checksums
✅ Previous models deactivated
❌ First double-commit error at 17:39:46
❌ Transaction enters failed state
❌ All subsequent DB operations fail
17:39:46-47 - BATCH 2: Training remaining 2 products
✅ Prophet models fitted successfully
✅ Models saved to disk with checksums
❌ Cannot write to DB (transaction still failed)
17:39:47 - Training completion attempt
❌ Cannot update training logs
❌ Overall status reported as FAILED
✅ Event published to RabbitMQ (training.failed)
```
---
### 5. Root Cause Analysis
**File:** `services/training/app/ml/prophet_manager.py`
**Method:** `_store_model()` (lines 697-832)
**Issue:** Double commit logic error
**The Bug Pattern:**
```python
# Line 751: Set flag based on parent session
use_parent_session = session is not None # False when session=None
# Inner function at lines 753-811
async def _store_in_db(db_session):
# ... database operations ...
db_session.add(db_model)
# Line 807-808: Commit if NOT using parent session
if not use_parent_session:
await db_session.commit() # ❌ FIRST COMMIT
# Lines 813-825: Outer try block
try:
if use_parent_session:
await _store_in_db(session) # ✅ No commit (parent handles it)
else:
# Line 819-825: Create new session
async with self.database_manager.get_session() as new_session:
await _store_in_db(new_session) # ❌ Commits at line 808
await new_session.commit() # ❌ SECOND COMMIT - ERROR!
```
**Why This Fails:**
1. SQLAlchemy's async session starts commit with `_prepare_impl()`
2. First commit (line 808) begins this process
3. Second commit (line 824) tries to commit while prepare is in progress
4. Session state machine rejects this: transition to CLOSED would be unexpected
5. Transaction marked as failed, all subsequent operations rejected
**Historical Context:**
Recent commit history shows ongoing session management issues:
```
673108e - Fix deadlock issues in training 2
74215d3 - Fix deadlock issues in training
fd0a96e - Fix remaining nested session issues in training pipeline
b2de56e - Fix additional nested session deadlock in data_processor.py
e585e9f - Fix critical nested session deadlock in training_service.py
```
This double-commit bug likely introduced during these "fixes" as a regression.
---
### 6. Recommended Solution
**Option 1: Remove Inner Commit (Recommended)**
Remove the commit logic from `_store_in_db()` entirely:
```python
# DELETE lines 805-811
# if not use_parent_session:
# await db_session.commit()
# logger.info(f"Model {model_id} stored in database successfully")
# else:
# logger.debug(f"Added model {model_id} to parent session (commit deferred)")
```
All commits should be handled by the outer scope (either parent session or new session block).
**Option 2: Pass Commit Control Flag**
```python
async def _store_in_db(db_session, should_commit=False):
"""Inner function to store model in database"""
# ... existing code ...
db_session.add(db_model)
if should_commit:
await db_session.commit()
logger.info(f"Model {model_id} stored successfully")
# Then call it:
if use_parent_session:
await _store_in_db(session, should_commit=False) # Parent commits
else:
async with self.database_manager.get_session() as new_session:
await _store_in_db(new_session, should_commit=False) # We commit here
await new_session.commit()
```
**Recommendation:** Option 1 is simpler and follows the "single responsibility principle" - the inner function should only perform the database operations, not manage transactions.
---
### 7. Testing Recommendations
After fix implementation:
1. **Unit Tests:**
- Test `_store_model()` with `session=None` (new session path)
- Test `_store_model()` with provided session (parent session path)
- Verify only one commit occurs in each path
2. **Integration Tests:**
- Run parallel training with 5 products
- Verify all database records created
- Verify no transaction conflicts
- Check training logs updated correctly
3. **Monitoring:**
- Watch for `InFailedSQLTransactionError` in logs
- Monitor training success rate
- Verify model records appear in `trained_models` table
---
## Action Items
- [ ] Fix double commit bug in `prophet_manager.py:805-811`
- [ ] Test fix with parallel training
- [ ] Add unit tests for session management
- [ ] Review all recent session management changes for similar patterns
- [ ] Fix Spanish holidays loading warning (optional)
- [ ] Document session management patterns for future development
---
## Additional Notes
**Good News:**
- ✅ All models trained successfully with good accuracy
- ✅ Model files saved to disk and can be recovered
- ✅ Parallel training infrastructure works correctly
- ✅ Prophet optimization producing quality models (MAPE 11.96%-14.16%)
**Bad News:**
- ❌ Database integration broken due to double commit
- ❌ Models cannot be used without DB records
- ❌ Training pipeline reports as failed despite successful training
- ❌ Recent "fixes" introduced regression
**Impact Assessment:**
- **Severity:** HIGH - Training pipeline completely broken
- **User Impact:** Cannot train new models
- **Data Loss:** None - models saved to disk
- **Recovery:** Fix code + re-run training OR manually insert DB records
---
**Analysis Date:** 2025-11-05
**Analyzed By:** Claude Code Agent
**Log Source:** Training service logs 17:39:45 - 17:39:47