Fix multiple critical bugs in onboarding training step
This commit addresses all identified bugs and issues in the training code path: ## Critical Fixes: - Add get_start_time() method to TrainingLogRepository and fix non-existent method call - Remove duplicate training.started event from API endpoint (trainer publishes the accurate one) - Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone" ## High Priority Fixes: - Fix division by zero risk in time estimation with double-check and max() safety - Remove unreachable exception handler in training_operations.py - Simplify WebSocket token refresh logic to only reconnect on actual user session changes ## Medium Priority Fixes: - Fix auto-start training effect with useRef to prevent duplicate starts - Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket - Extract all magic numbers to centralized constants files: - Backend: services/training/app/core/training_constants.py - Frontend: frontend/src/constants/training.ts - Standardize error logging with exc_info=True on critical errors ## Code Quality Improvements: - All progress percentages now use named constants - All timeouts and intervals now use named constants - Improved code maintainability and readability - Better separation of concerns ## Files Changed: - Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py - Backend: training_operations.py, training_log_repository.py, training_constants.py (new) - Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new) All training progress events now properly flow from 0% to 100% with no gaps.
This commit is contained in:
@@ -329,4 +329,17 @@ class TrainingLogRepository(TrainingBaseRepository):
|
||||
"min_duration_minutes": 0.0,
|
||||
"max_duration_minutes": 0.0,
|
||||
"completed_jobs_with_duration": 0
|
||||
}
|
||||
}
|
||||
|
||||
async def get_start_time(self, job_id: str) -> Optional[datetime]:
|
||||
"""Get the start time for a training job"""
|
||||
try:
|
||||
log_entry = await self.get_by_job_id(job_id)
|
||||
if log_entry and log_entry.start_time:
|
||||
return log_entry.start_time
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error("Failed to get start time",
|
||||
job_id=job_id,
|
||||
error=str(e))
|
||||
return None
|
||||
Reference in New Issue
Block a user