Fix multiple critical bugs in onboarding training step

This commit addresses all identified bugs and issues in the training code path: ## Critical Fixes: - Add get_start_time() method to TrainingLogRepository and fix non-existent method call - Remove duplicate training.started event from API endpoint (trainer publishes the accurate one) - Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone" ## High Priority Fixes: - Fix division by zero risk in time estimation with double-check and max() safety - Remove unreachable exception handler in training_operations.py - Simplify WebSocket token refresh logic to only reconnect on actual user session changes ## Medium Priority Fixes: - Fix auto-start training effect with useRef to prevent duplicate starts - Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket - Extract all magic numbers to centralized constants files: - Backend: services/training/app/core/training_constants.py - Frontend: frontend/src/constants/training.ts - Standardize error logging with exc_info=True on critical errors ## Code Quality Improvements: - All progress percentages now use named constants - All timeouts and intervals now use named constants - Improved code maintainability and readability - Better separation of concerns ## Files Changed: - Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py - Backend: training_operations.py, training_log_repository.py, training_constants.py (new) - Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new) All training progress events now properly flow from 0% to 100% with no gaps.
2025-11-05 13:02:39 +00:00
parent e3ea92640b
commit 5a84be83d6
10 changed files with 291 additions and 106 deletions
--- a/services/training/app/ml/trainer.py
+++ b/services/training/app/ml/trainer.py
@@ -6,7 +6,7 @@ Main ML pipeline coordinator using repository pattern for data access and depend
 from typing import Dict, List, Any, Optional
 import pandas as pd
 import numpy as np
-from datetime import datetime
+from datetime import datetime, timezone
 import structlog
 import uuid
 import time
@@ -187,7 +187,10 @@ class EnhancedBakeryMLTrainer:

                # Event 2: Data Analysis (20%)
                # Recalculate time remaining based on elapsed time
-                elapsed_seconds = (datetime.now(timezone.utc) - repos['training_log']._get_start_time(job_id) if hasattr(repos['training_log'], '_get_start_time') else 0) or 0
+                start_time = await repos['training_log'].get_start_time(job_id)
+                elapsed_seconds = 0
+                if start_time:
+                    elapsed_seconds = int((datetime.now(timezone.utc) - start_time).total_seconds())

                # Estimate remaining time: we've done ~20% of work (data analysis)
                # Remaining 80% includes training all products
@@ -285,7 +288,8 @@ class EnhancedBakeryMLTrainer:
        except Exception as e:
            logger.error("Enhanced ML training pipeline failed",
                        job_id=job_id,
-                        error=str(e))
+                        error=str(e),
+                        exc_info=True)

            # Publish training failed event
            await publish_training_failed(job_id, tenant_id, str(e))
@@ -397,7 +401,8 @@ class EnhancedBakeryMLTrainer:
            logger.error("Single product model training failed",
                        job_id=job_id,
                        inventory_product_id=inventory_product_id,
-                        error=str(e))
+                        error=str(e),
+                        exc_info=True)
            raise
    
    def _serialize_scalers(self, scalers: Dict[str, Any]) -> Dict[str, Any]: