Merge pull request #11 from ualsweb/claude/check-training-logs-011CUqDP9mFd2L9mLtL4HiPa
Claude/check training logs 011 c uq dp9m fd2 l9m lt l4 hi pa
This commit is contained in:
283
TRAINING_LOG_ANALYSIS.md
Normal file
283
TRAINING_LOG_ANALYSIS.md
Normal file
@@ -0,0 +1,283 @@
|
|||||||
|
# Training Log Analysis - 2025-11-05
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
**Status:** Training FAILED - 5/5 products failed to store in database despite successful model training
|
||||||
|
|
||||||
|
**Root Cause:** Double commit bug in `prophet_manager.py` causing database transaction conflicts
|
||||||
|
|
||||||
|
**Impact:** All models successfully trained and saved to disk (with good MAPE scores 11.96%-14.16%), but database metadata storage failed due to transaction state errors
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Detailed Findings
|
||||||
|
|
||||||
|
### 1. Warnings (Non-Critical)
|
||||||
|
|
||||||
|
#### Spanish Holidays Loading Failure
|
||||||
|
- **Count:** 5 occurrences
|
||||||
|
- **Message:** `Could not load Spanish holidays dynamically: Logger._log() got an unexpected keyword argument 'region'`
|
||||||
|
- **Impact:** Minor - Prophet models may not properly account for Spanish holidays
|
||||||
|
- **Location:** `app/ml/prophet_manager.py` - holiday configuration
|
||||||
|
- **Severity:** LOW
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Critical Errors
|
||||||
|
|
||||||
|
#### A. Double Commit Transaction Error (ROOT CAUSE)
|
||||||
|
- **Location:** `services/training/app/ml/prophet_manager.py:750-824`
|
||||||
|
- **Function:** `_store_model()`
|
||||||
|
- **Error Message:**
|
||||||
|
```
|
||||||
|
Failed to store model in database: Method 'commit()' can't be called here;
|
||||||
|
method '_prepare_impl()' is already in progress and this would cause an
|
||||||
|
unexpected state change to <SessionTransactionState.CLOSED: 5>
|
||||||
|
```
|
||||||
|
- **Affected Products:** All 5 products
|
||||||
|
- **Occurrences:** 3 direct errors
|
||||||
|
|
||||||
|
**Technical Explanation:**
|
||||||
|
|
||||||
|
The bug occurs due to duplicate commit logic when no parent session is provided:
|
||||||
|
|
||||||
|
1. Line 751: `use_parent_session = session is not None` → False when session=None
|
||||||
|
2. Lines 819-825: Code creates new dedicated session and calls `_store_in_db(new_session)`
|
||||||
|
3. Line 807-808: Inside `_store_in_db()`, commits because `use_parent_session=False`
|
||||||
|
4. Line 824: Tries to commit again on same session → CONFLICT!
|
||||||
|
|
||||||
|
**Code Flow:**
|
||||||
|
```python
|
||||||
|
# Line 751
|
||||||
|
use_parent_session = session is not None # False
|
||||||
|
|
||||||
|
# Lines 819-825 (when no parent session)
|
||||||
|
else:
|
||||||
|
async with self.database_manager.get_session() as new_session:
|
||||||
|
await _store_in_db(new_session) # Commits inside (line 808)
|
||||||
|
await new_session.commit() # ❌ Double commit!
|
||||||
|
```
|
||||||
|
|
||||||
|
#### B. Cascading Transaction Failures
|
||||||
|
- **Error:** `InFailedSQLTransactionError: current transaction is aborted, commands ignored until end of transaction block`
|
||||||
|
- **Count:** ~20 occurrences
|
||||||
|
- **Cause:** After initial double-commit error, transaction enters failed state
|
||||||
|
- **Impact:** All subsequent database operations fail
|
||||||
|
|
||||||
|
**Affected Operations:**
|
||||||
|
- Getting active models
|
||||||
|
- Creating trained model records
|
||||||
|
- Updating training logs
|
||||||
|
- Retrieving database statistics
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Training Results
|
||||||
|
|
||||||
|
| Product ID | Model Trained | File Saved | MAPE | DB Record | Final Status |
|
||||||
|
|------------|---------------|------------|------|-----------|--------------|
|
||||||
|
| `5a2155a0-6ccb-4d29-a6db-d25f8e4cc806` | ✅ Yes | ✅ Yes | 13.75% | ❌ No | PARTIAL |
|
||||||
|
| `dc4a9ca0-a0e0-436c-94ef-d76c9551ab8c` | ✅ Yes | ✅ Yes | 13.04% | ❌ No | PARTIAL |
|
||||||
|
| `bcb395bd-b8a4-40de-ba1d-549de48e1688` | ✅ Yes | ✅ Yes | 11.96% | ❌ No | PARTIAL |
|
||||||
|
| `ba7aa8bf-e4be-4bbc-a9df-6929c7a8eaa9` | ✅ Yes | ✅ Yes | 14.16% | ❌ No | PARTIAL |
|
||||||
|
| `3adadeb8-9d42-4565-afd4-37e0367280ba` | ✅ Yes | ✅ Yes | 12.77% | ❌ No | PARTIAL |
|
||||||
|
|
||||||
|
**Summary:**
|
||||||
|
- ✅ **5/5 models successfully trained** with good accuracy (MAPE 11.96% - 14.16%)
|
||||||
|
- ✅ **5/5 model files saved to disk** at `/app/models/{tenant_id}/{model_id}.pkl`
|
||||||
|
- ❌ **0/5 database records created** - all failed due to transaction errors
|
||||||
|
- ❌ **Final status: FAILED** - models cannot be used without DB records
|
||||||
|
|
||||||
|
**Model Files Created:**
|
||||||
|
```
|
||||||
|
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a1cdc2c3-2b11-4e8d-a9a9-008c8bad2369.pkl ✅
|
||||||
|
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b02bec7b-736b-4cc8-85d9-9caa48479d6d.pkl ✅
|
||||||
|
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/b5865fca-4901-472c-8803-b4031a250661.pkl ✅
|
||||||
|
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/a9f94bcd-9266-4d24-987b-2aca0d0448a5.pkl ✅
|
||||||
|
/app/models/40c4c65b-f2e7-4d4a-a930-97ac60844fb5/dc8402d0-a053-4b1b-a9c6-fc77480b5170.pkl ✅
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. Timeline of Events
|
||||||
|
|
||||||
|
```
|
||||||
|
17:39:45 - Data preparation completed for 5 products
|
||||||
|
17:39:45 - Product categorization: 3 pastries, 2 bread
|
||||||
|
17:39:45 - Parallel training started (3 concurrent operations)
|
||||||
|
|
||||||
|
17:39:45-46 - BATCH 1: Training 3 products in parallel
|
||||||
|
✅ Prophet models fitted successfully
|
||||||
|
✅ Models saved to disk with checksums
|
||||||
|
✅ Previous models deactivated
|
||||||
|
❌ First double-commit error at 17:39:46
|
||||||
|
❌ Transaction enters failed state
|
||||||
|
❌ All subsequent DB operations fail
|
||||||
|
|
||||||
|
17:39:46-47 - BATCH 2: Training remaining 2 products
|
||||||
|
✅ Prophet models fitted successfully
|
||||||
|
✅ Models saved to disk with checksums
|
||||||
|
❌ Cannot write to DB (transaction still failed)
|
||||||
|
|
||||||
|
17:39:47 - Training completion attempt
|
||||||
|
❌ Cannot update training logs
|
||||||
|
❌ Overall status reported as FAILED
|
||||||
|
✅ Event published to RabbitMQ (training.failed)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. Root Cause Analysis
|
||||||
|
|
||||||
|
**File:** `services/training/app/ml/prophet_manager.py`
|
||||||
|
**Method:** `_store_model()` (lines 697-832)
|
||||||
|
**Issue:** Double commit logic error
|
||||||
|
|
||||||
|
**The Bug Pattern:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Line 751: Set flag based on parent session
|
||||||
|
use_parent_session = session is not None # False when session=None
|
||||||
|
|
||||||
|
# Inner function at lines 753-811
|
||||||
|
async def _store_in_db(db_session):
|
||||||
|
# ... database operations ...
|
||||||
|
db_session.add(db_model)
|
||||||
|
|
||||||
|
# Line 807-808: Commit if NOT using parent session
|
||||||
|
if not use_parent_session:
|
||||||
|
await db_session.commit() # ❌ FIRST COMMIT
|
||||||
|
|
||||||
|
# Lines 813-825: Outer try block
|
||||||
|
try:
|
||||||
|
if use_parent_session:
|
||||||
|
await _store_in_db(session) # ✅ No commit (parent handles it)
|
||||||
|
else:
|
||||||
|
# Line 819-825: Create new session
|
||||||
|
async with self.database_manager.get_session() as new_session:
|
||||||
|
await _store_in_db(new_session) # ❌ Commits at line 808
|
||||||
|
await new_session.commit() # ❌ SECOND COMMIT - ERROR!
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why This Fails:**
|
||||||
|
1. SQLAlchemy's async session starts commit with `_prepare_impl()`
|
||||||
|
2. First commit (line 808) begins this process
|
||||||
|
3. Second commit (line 824) tries to commit while prepare is in progress
|
||||||
|
4. Session state machine rejects this: transition to CLOSED would be unexpected
|
||||||
|
5. Transaction marked as failed, all subsequent operations rejected
|
||||||
|
|
||||||
|
**Historical Context:**
|
||||||
|
|
||||||
|
Recent commit history shows ongoing session management issues:
|
||||||
|
```
|
||||||
|
673108e - Fix deadlock issues in training 2
|
||||||
|
74215d3 - Fix deadlock issues in training
|
||||||
|
fd0a96e - Fix remaining nested session issues in training pipeline
|
||||||
|
b2de56e - Fix additional nested session deadlock in data_processor.py
|
||||||
|
e585e9f - Fix critical nested session deadlock in training_service.py
|
||||||
|
```
|
||||||
|
|
||||||
|
This double-commit bug likely introduced during these "fixes" as a regression.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 6. Recommended Solution
|
||||||
|
|
||||||
|
**Option 1: Remove Inner Commit (Recommended)**
|
||||||
|
|
||||||
|
Remove the commit logic from `_store_in_db()` entirely:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# DELETE lines 805-811
|
||||||
|
# if not use_parent_session:
|
||||||
|
# await db_session.commit()
|
||||||
|
# logger.info(f"Model {model_id} stored in database successfully")
|
||||||
|
# else:
|
||||||
|
# logger.debug(f"Added model {model_id} to parent session (commit deferred)")
|
||||||
|
```
|
||||||
|
|
||||||
|
All commits should be handled by the outer scope (either parent session or new session block).
|
||||||
|
|
||||||
|
**Option 2: Pass Commit Control Flag**
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def _store_in_db(db_session, should_commit=False):
|
||||||
|
"""Inner function to store model in database"""
|
||||||
|
# ... existing code ...
|
||||||
|
db_session.add(db_model)
|
||||||
|
|
||||||
|
if should_commit:
|
||||||
|
await db_session.commit()
|
||||||
|
logger.info(f"Model {model_id} stored successfully")
|
||||||
|
|
||||||
|
# Then call it:
|
||||||
|
if use_parent_session:
|
||||||
|
await _store_in_db(session, should_commit=False) # Parent commits
|
||||||
|
else:
|
||||||
|
async with self.database_manager.get_session() as new_session:
|
||||||
|
await _store_in_db(new_session, should_commit=False) # We commit here
|
||||||
|
await new_session.commit()
|
||||||
|
```
|
||||||
|
|
||||||
|
**Recommendation:** Option 1 is simpler and follows the "single responsibility principle" - the inner function should only perform the database operations, not manage transactions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 7. Testing Recommendations
|
||||||
|
|
||||||
|
After fix implementation:
|
||||||
|
|
||||||
|
1. **Unit Tests:**
|
||||||
|
- Test `_store_model()` with `session=None` (new session path)
|
||||||
|
- Test `_store_model()` with provided session (parent session path)
|
||||||
|
- Verify only one commit occurs in each path
|
||||||
|
|
||||||
|
2. **Integration Tests:**
|
||||||
|
- Run parallel training with 5 products
|
||||||
|
- Verify all database records created
|
||||||
|
- Verify no transaction conflicts
|
||||||
|
- Check training logs updated correctly
|
||||||
|
|
||||||
|
3. **Monitoring:**
|
||||||
|
- Watch for `InFailedSQLTransactionError` in logs
|
||||||
|
- Monitor training success rate
|
||||||
|
- Verify model records appear in `trained_models` table
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Action Items
|
||||||
|
|
||||||
|
- [ ] Fix double commit bug in `prophet_manager.py:805-811`
|
||||||
|
- [ ] Test fix with parallel training
|
||||||
|
- [ ] Add unit tests for session management
|
||||||
|
- [ ] Review all recent session management changes for similar patterns
|
||||||
|
- [ ] Fix Spanish holidays loading warning (optional)
|
||||||
|
- [ ] Document session management patterns for future development
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Additional Notes
|
||||||
|
|
||||||
|
**Good News:**
|
||||||
|
- ✅ All models trained successfully with good accuracy
|
||||||
|
- ✅ Model files saved to disk and can be recovered
|
||||||
|
- ✅ Parallel training infrastructure works correctly
|
||||||
|
- ✅ Prophet optimization producing quality models (MAPE 11.96%-14.16%)
|
||||||
|
|
||||||
|
**Bad News:**
|
||||||
|
- ❌ Database integration broken due to double commit
|
||||||
|
- ❌ Models cannot be used without DB records
|
||||||
|
- ❌ Training pipeline reports as failed despite successful training
|
||||||
|
- ❌ Recent "fixes" introduced regression
|
||||||
|
|
||||||
|
**Impact Assessment:**
|
||||||
|
- **Severity:** HIGH - Training pipeline completely broken
|
||||||
|
- **User Impact:** Cannot train new models
|
||||||
|
- **Data Loss:** None - models saved to disk
|
||||||
|
- **Recovery:** Fix code + re-run training OR manually insert DB records
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Analysis Date:** 2025-11-05
|
||||||
|
**Analyzed By:** Claude Code Agent
|
||||||
|
**Log Source:** Training service logs 17:39:45 - 17:39:47
|
||||||
@@ -802,13 +802,10 @@ class BakeryProphetManager:
|
|||||||
|
|
||||||
db_session.add(db_model)
|
db_session.add(db_model)
|
||||||
|
|
||||||
# Only commit if using a new session, not if using parent session
|
# ✅ FIX: Don't commit here - let the outer scope handle commits
|
||||||
# Parent session commits will be handled by the calling method to prevent conflicts
|
# This prevents double commit when using a new dedicated session
|
||||||
if not use_parent_session:
|
# Parent sessions will commit when parent completes; new sessions commit in outer scope
|
||||||
await db_session.commit()
|
logger.debug(f"Added model {model_id} to session (commit handled by caller)")
|
||||||
logger.info(f"Model {model_id} stored in database successfully")
|
|
||||||
else:
|
|
||||||
logger.debug(f"Added model {model_id} to parent session (commit deferred)")
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# ✅ FIX: Use parent session if provided, otherwise create new one
|
# ✅ FIX: Use parent session if provided, otherwise create new one
|
||||||
@@ -977,9 +974,8 @@ class BakeryProphetManager:
|
|||||||
holidays_df = holidays_df.drop_duplicates(subset=['ds', 'holiday'])
|
holidays_df = holidays_df.drop_duplicates(subset=['ds', 'holiday'])
|
||||||
holidays_df = holidays_df.sort_values('ds').reset_index(drop=True)
|
holidays_df = holidays_df.sort_values('ds').reset_index(drop=True)
|
||||||
|
|
||||||
logger.info(f"Loaded {len(holidays_df)} Spanish holidays dynamically",
|
# ✅ FIX: Don't pass keyword args to logger - use f-string instead
|
||||||
region=region or 'National',
|
logger.info(f"Loaded {len(holidays_df)} Spanish holidays dynamically (region={region or 'National'}, years={min(years)}-{max(years)})")
|
||||||
years=f"{min(years)}-{max(years)}")
|
|
||||||
|
|
||||||
return holidays_df
|
return holidays_df
|
||||||
else:
|
else:
|
||||||
|
|||||||
Reference in New Issue
Block a user