Fix multiple critical bugs in onboarding training step

This commit addresses all identified bugs and issues in the training code path: ## Critical Fixes: - Add get_start_time() method to TrainingLogRepository and fix non-existent method call - Remove duplicate training.started event from API endpoint (trainer publishes the accurate one) - Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone" ## High Priority Fixes: - Fix division by zero risk in time estimation with double-check and max() safety - Remove unreachable exception handler in training_operations.py - Simplify WebSocket token refresh logic to only reconnect on actual user session changes ## Medium Priority Fixes: - Fix auto-start training effect with useRef to prevent duplicate starts - Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket - Extract all magic numbers to centralized constants files: - Backend: services/training/app/core/training_constants.py - Frontend: frontend/src/constants/training.ts - Standardize error logging with exc_info=True on critical errors ## Code Quality Improvements: - All progress percentages now use named constants - All timeouts and intervals now use named constants - Improved code maintainability and readability - Better separation of concerns ## Files Changed: - Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py - Backend: training_operations.py, training_log_repository.py, training_constants.py (new) - Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new) All training progress events now properly flow from 0% to 100% with no gaps.
2025-11-05 13:02:39 +00:00
parent e3ea92640b
commit 5a84be83d6
10 changed files with 291 additions and 106 deletions
--- a/services/training/app/api/training_operations.py
+++ b/services/training/app/api/training_operations.py
@@ -189,15 +189,8 @@ async def start_training_job(
        # Calculate estimated completion time
        estimated_completion_time = calculate_estimated_completion_time(estimated_duration_minutes)

-        # Publish training.started event immediately so WebSocket clients
-        # have initial state when they connect
-        await publish_training_started(
-            job_id=job_id,
-            tenant_id=tenant_id,
-            total_products=0,  # Will be updated when actual training starts
-            estimated_duration_minutes=estimated_duration_minutes,
-            estimated_completion_time=estimated_completion_time.isoformat()
-        )
+        # Note: training.started event will be published by the trainer with accurate product count
+        # We don't publish here to avoid duplicate events

        # Add enhanced background task
        background_tasks.add_task(
@@ -401,11 +394,6 @@ async def execute_training_job_background(
        # Failure event is published by the training service
        await publish_training_failed(job_id, tenant_id, str(training_error))

-    except Exception as background_error:
-        logger.error("Critical error in enhanced background training job",
-                    job_id=job_id,
-                    error=str(background_error))
-
    finally:
        logger.info("Enhanced background training job cleanup completed",
                   job_id=job_id)