bakery-ia

Author	SHA1	Message	Date
Urtzi Alfaro	b2de56ead3	Fix additional nested session deadlock in data_processor.py Root Cause: After fixing the training_service.py deadlock, training progressed to data preparation but got stuck there. The data_processor.py creates another nested session at line 143, updates training_log without committing, causing another deadlock scenario. Session Hierarchy: 1. training_service.py: outer session (fixed in `e585e9f`) 2. trainer.py: creates own session (passes deadlock due to commit) 3. data_processor.py: creates ANOTHER nested session (THIS FIX) Fix: Added explicit db_session.commit() after progress update in data_processor (line 153) to ensure the UPDATE is committed before continuing with data processing operations that may interact with other sessions. This completes the chain of nested session fixes: - `caff497`: prophet_manager + hybrid_trainer session passing - `e585e9f`: training_service commit before trainer call - THIS: data_processor commit after progress update 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 16:39:05 +01:00
Urtzi Alfaro	caff49761d	Fix training hang caused by nested database sessions and deadlocks Root Cause: The training process was hanging at the first progress update due to a nested database session issue. The main trainer created a session and repositories, then called prophet_manager.train_bakery_model() which created another nested session with an advisory lock. This caused a deadlock where: 1. Outer session had uncommitted UPDATE on model_training_logs 2. Inner session tried to acquire advisory lock 3. Neither could proceed, causing training to hang indefinitely Changes Made: 1. prophet_manager.py: - Added optional 'session' parameter to train_bakery_model() - Refactored to use parent session if provided, otherwise create new one - Prevents nested session creation during training 2. hybrid_trainer.py: - Added optional 'session' parameter to train_hybrid_model() - Passes session to prophet_manager to maintain single session context 3. trainer.py: - Updated _train_single_product() to accept and pass session - Updated _train_all_models_enhanced() to accept and pass session - Pass db_session from main training context to all training methods - Added explicit db_session.flush() after critical progress update - This ensures updates are visible before acquiring locks Impact: - Eliminates nested session deadlocks - Training now proceeds past initial progress update - Maintains single database session context throughout training - Prevents database transaction conflicts Related Issues: - Fixes training hang during onboarding process - Not directly related to audit_metadata changes but exposed by them 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 16:13:32 +01:00
Claude	c64585af57	Fix training hang by wrapping blocking ML operations in thread pool Root Cause: Training process was stuck at 40% because blocking synchronous ML operations (model.fit(), model.predict(), study.optimize()) were freezing the asyncio event loop, preventing RabbitMQ heartbeats, WebSocket communication, and progress updates. Changes: 1. prophet_manager.py: - Wrapped model.fit() at line 189 with asyncio.to_thread() - Wrapped study.optimize() at line 453 with asyncio.to_thread() 2. hybrid_trainer.py: - Made _train_xgboost() async and wrapped model.fit() with asyncio.to_thread() - Made _evaluate_hybrid_model() async and wrapped predict() calls - Fixed predict() method to wrap blocking predict() calls Impact: - Event loop no longer blocks during ML training - RabbitMQ heartbeats continue during training - WebSocket progress updates work correctly - Training can now complete successfully Fixes: Training hang at 40% during onboarding phase	2025-11-05 14:34:53 +00:00
Claude	5a84be83d6	Fix multiple critical bugs in onboarding training step This commit addresses all identified bugs and issues in the training code path: ## Critical Fixes: - Add get_start_time() method to TrainingLogRepository and fix non-existent method call - Remove duplicate training.started event from API endpoint (trainer publishes the accurate one) - Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone" ## High Priority Fixes: - Fix division by zero risk in time estimation with double-check and max() safety - Remove unreachable exception handler in training_operations.py - Simplify WebSocket token refresh logic to only reconnect on actual user session changes ## Medium Priority Fixes: - Fix auto-start training effect with useRef to prevent duplicate starts - Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket - Extract all magic numbers to centralized constants files: - Backend: services/training/app/core/training_constants.py - Frontend: frontend/src/constants/training.ts - Standardize error logging with exc_info=True on critical errors ## Code Quality Improvements: - All progress percentages now use named constants - All timeouts and intervals now use named constants - Improved code maintainability and readability - Better separation of concerns ## Files Changed: - Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py - Backend: training_operations.py, training_log_repository.py, training_constants.py (new) - Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new) All training progress events now properly flow from 0% to 100% with no gaps.	2025-11-05 13:02:39 +00:00
Claude	799e7dbaeb	Fix training job concurrent database session conflicts Root Cause: - Multiple parallel training tasks (3 at a time) were sharing the same database session - This caused SQLAlchemy session state conflicts: "Session is already flushing" and "rollback() is already in progress" - Additionally, duplicate model records were being created by both trainer and training_service Fixes: 1. Separated model training from database writes: - Training happens in parallel (CPU-intensive) - Database writes happen sequentially after training completes - This eliminates concurrent session access 2. Removed duplicate database writes: - Trainer now writes all model records sequentially after parallel training - Training service now retrieves models instead of creating duplicates - Performance metrics are also created by trainer (no duplicates) 3. Added proper data flow: - _train_single_product: Only trains models, stores results - _write_training_results_to_database: Sequential DB writes after training - _store_trained_models: Changed to retrieve existing models - _create_performance_metrics: Changed to verify existing metrics Benefits: - Eliminates database session conflicts - Prevents duplicate model records - Maintains parallel training performance - Ensures data consistency Files Modified: - services/training/app/ml/trainer.py - services/training/app/services/training_service.py Resolves: Onboarding training job database session conflicts	2025-11-05 12:41:42 +00:00
Urtzi Alfaro	394ad3aea4	Improve AI logic	2025-11-05 13:34:56 +01:00
Urtzi Alfaro	5adb0e39c0	Improve the frontend 5	2025-11-02 20:24:44 +01:00
Urtzi Alfaro	8d30172483	Improve the frontend	2025-10-21 19:50:07 +02:00
Urtzi Alfaro	05da20357d	Improve teh securty of teh DB	2025-10-19 19:22:37 +02:00
Urtzi Alfaro	96ad5c6692	Refactor datetime and timezone utils	2025-10-12 23:16:04 +02:00
Urtzi Alfaro	3c689b4f98	REFACTOR external service and improve websocket training	2025-10-09 14:11:02 +02:00
Urtzi Alfaro	f7de9115d1	Fix new services implementation 5	2025-08-15 17:53:59 +02:00
Urtzi Alfaro	03737430ee	Fix new services implementation 3	2025-08-14 16:47:34 +02:00
Urtzi Alfaro	fbe7470ad9	REFACTOR data service	2025-08-12 18:17:30 +02:00
Urtzi Alfaro	8d125ab0d5	Refactor the traffic fetching system	2025-08-10 18:32:47 +02:00
Urtzi Alfaro	488bb3ef93	REFACTOR - Database logic	2025-08-08 09:08:41 +02:00
Urtzi Alfaro	32a7b913d0	Fix new Frontend 15	2025-08-04 21:46:12 +02:00
Urtzi Alfaro	8bb14ecc4f	Fix new Frontend 14	2025-08-04 19:17:31 +02:00
Urtzi Alfaro	0ba543a19a	Fix new Frontend 13	2025-08-04 18:58:12 +02:00
Urtzi Alfaro	35b02ca364	Fix new Frontend 12	2025-08-04 18:21:42 +02:00
Urtzi Alfaro	e581a144be	Improve the event messaging for training service 2	2025-07-31 15:34:35 +02:00
Urtzi Alfaro	024290e4c0	Start fixing forecast service 18	2025-07-30 08:41:47 +02:00
Urtzi Alfaro	c10a745695	Start fixing forecast service 17	2025-07-30 08:29:40 +02:00
Urtzi Alfaro	e5c3375d53	Start fixing forecast service 16	2025-07-30 08:13:32 +02:00
Urtzi Alfaro	cd6fd875f7	Improve training test 3	2025-07-29 12:45:39 +02:00
Urtzi Alfaro	7cd595df81	Improve training code 2	2025-07-28 20:20:54 +02:00
Urtzi Alfaro	98f546af12	Improve training code	2025-07-28 19:28:39 +02:00
Urtzi Alfaro	e63a99b818	Checking onboardin flow - fix 4	2025-07-27 16:29:53 +02:00
Urtzi Alfaro	f3071c00bd	Add all the code for training service	2025-07-19 16:59:37 +02:00
Urtzi Alfaro	4073222888	Fix imports	2025-07-18 14:41:39 +02:00
Urtzi Alfaro	347ff51bd7	Initial microservices setup from artifacts	2025-07-17 13:09:24 +02:00

31 Commits