bakery-ia

Author	SHA1	Message	Date
Urtzi Alfaro	fd0a96e254	Fix remaining nested session issues in training pipeline Issues Fixed: 4️⃣ data_processor.py (Line 230-232): - Second update_log_progress call without commit after data preparation - Added commit() after completion update to prevent deadlock - Added debug logging for visibility 5️⃣ prophet_manager.py _store_model (Line 750): - Created TRIPLE nested session (training_service → trainer → lock → _store_model) - Refactored _store_model to accept optional session parameter - Uses parent session from lock context instead of creating new one - Updated call site to pass db_session parameter Complete Session Hierarchy After All Fixes: training_service.py (session) └─ commit() ← FIX #2 (`e585e9f`) └─ trainer.py (new session) ✅ OK └─ data_processor.py (new session) └─ commit() after first update ← FIX #3 (`b2de56e`) └─ commit() after second update ← FIX #4 (THIS) └─ prophet_manager.train_bakery_model (uses parent or new session) ← FIX #1 (`caff497`) └─ lock.acquire(session) └─ _store_model(session=parent) ← FIX #5 (THIS) └─ NO NESTED SESSION ✅ All nested session deadlocks in training path are now resolved. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 16:41:53 +01:00
Urtzi Alfaro	b2de56ead3	Fix additional nested session deadlock in data_processor.py Root Cause: After fixing the training_service.py deadlock, training progressed to data preparation but got stuck there. The data_processor.py creates another nested session at line 143, updates training_log without committing, causing another deadlock scenario. Session Hierarchy: 1. training_service.py: outer session (fixed in `e585e9f`) 2. trainer.py: creates own session (passes deadlock due to commit) 3. data_processor.py: creates ANOTHER nested session (THIS FIX) Fix: Added explicit db_session.commit() after progress update in data_processor (line 153) to ensure the UPDATE is committed before continuing with data processing operations that may interact with other sessions. This completes the chain of nested session fixes: - `caff497`: prophet_manager + hybrid_trainer session passing - `e585e9f`: training_service commit before trainer call - THIS: data_processor commit after progress update 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 16:39:05 +01:00
Urtzi Alfaro	e585e9fac0	Fix critical nested session deadlock in training_service.py Root Cause (Actual): The actual nested session issue was in training_service.py, not just in the trainer methods. The flow was: 1. training_service.py creates outer session (line 173) 2. Updates training_log at line 235-237 (uncommitted) 3. Calls trainer.train_tenant_models() at line 239 4. Trainer creates its own session at line 93 5. DEADLOCK: Outer session has uncommitted UPDATE, inner session can't proceed Fix: Added explicit session.commit() after the ml_training progress update (line 241) to ensure the UPDATE is committed before trainer creates its own session. This prevents the deadlock condition. Related to previous commit `caff497` which fixed nested sessions in prophet_manager and hybrid_trainer, but missed the actual root cause in training_service.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 16:30:15 +01:00
Urtzi Alfaro	caff49761d	Fix training hang caused by nested database sessions and deadlocks Root Cause: The training process was hanging at the first progress update due to a nested database session issue. The main trainer created a session and repositories, then called prophet_manager.train_bakery_model() which created another nested session with an advisory lock. This caused a deadlock where: 1. Outer session had uncommitted UPDATE on model_training_logs 2. Inner session tried to acquire advisory lock 3. Neither could proceed, causing training to hang indefinitely Changes Made: 1. prophet_manager.py: - Added optional 'session' parameter to train_bakery_model() - Refactored to use parent session if provided, otherwise create new one - Prevents nested session creation during training 2. hybrid_trainer.py: - Added optional 'session' parameter to train_hybrid_model() - Passes session to prophet_manager to maintain single session context 3. trainer.py: - Updated _train_single_product() to accept and pass session - Updated _train_all_models_enhanced() to accept and pass session - Pass db_session from main training context to all training methods - Added explicit db_session.flush() after critical progress update - This ensures updates are visible before acquiring locks Impact: - Eliminates nested session deadlocks - Training now proceeds past initial progress update - Maintains single database session context throughout training - Prevents database transaction conflicts Related Issues: - Fixes training hang during onboarding process - Not directly related to audit_metadata changes but exposed by them 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 16:13:32 +01:00
Claude	c64585af57	Fix training hang by wrapping blocking ML operations in thread pool Root Cause: Training process was stuck at 40% because blocking synchronous ML operations (model.fit(), model.predict(), study.optimize()) were freezing the asyncio event loop, preventing RabbitMQ heartbeats, WebSocket communication, and progress updates. Changes: 1. prophet_manager.py: - Wrapped model.fit() at line 189 with asyncio.to_thread() - Wrapped study.optimize() at line 453 with asyncio.to_thread() 2. hybrid_trainer.py: - Made _train_xgboost() async and wrapped model.fit() with asyncio.to_thread() - Made _evaluate_hybrid_model() async and wrapped predict() calls - Fixed predict() method to wrap blocking predict() calls Impact: - Event loop no longer blocks during ML training - RabbitMQ heartbeats continue during training - WebSocket progress updates work correctly - Training can now complete successfully Fixes: Training hang at 40% during onboarding phase	2025-11-05 14:34:53 +00:00
Claude	136761af19	Fix AuditLogger.log_event() parameter name: metadata -> audit_metadata	2025-11-05 14:17:39 +00:00
Claude	8df90338b2	Fix training log race conditions and audit event error Critical fixes for training session logging: 1. Training log race condition fix: - Add explicit session commits after creating training logs - Handle duplicate key errors gracefully when multiple sessions try to create the same log simultaneously - Implement retry logic to query for existing logs after duplicate key violations - Prevents "Training log not found" errors during training 2. Audit event async generator error fix: - Replace incorrect next(get_db()) usage with proper async context manager (database_manager.get_session()) - Fixes "'async_generator' object is not an iterator" error - Ensures audit logging works correctly These changes address race conditions in concurrent database sessions and ensure training logs are properly synchronized across the training pipeline.	2025-11-05 13:24:22 +00:00
Claude	5a84be83d6	Fix multiple critical bugs in onboarding training step This commit addresses all identified bugs and issues in the training code path: ## Critical Fixes: - Add get_start_time() method to TrainingLogRepository and fix non-existent method call - Remove duplicate training.started event from API endpoint (trainer publishes the accurate one) - Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone" ## High Priority Fixes: - Fix division by zero risk in time estimation with double-check and max() safety - Remove unreachable exception handler in training_operations.py - Simplify WebSocket token refresh logic to only reconnect on actual user session changes ## Medium Priority Fixes: - Fix auto-start training effect with useRef to prevent duplicate starts - Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket - Extract all magic numbers to centralized constants files: - Backend: services/training/app/core/training_constants.py - Frontend: frontend/src/constants/training.ts - Standardize error logging with exc_info=True on critical errors ## Code Quality Improvements: - All progress percentages now use named constants - All timeouts and intervals now use named constants - Improved code maintainability and readability - Better separation of concerns ## Files Changed: - Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py - Backend: training_operations.py, training_log_repository.py, training_constants.py (new) - Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new) All training progress events now properly flow from 0% to 100% with no gaps.	2025-11-05 13:02:39 +00:00
Claude	799e7dbaeb	Fix training job concurrent database session conflicts Root Cause: - Multiple parallel training tasks (3 at a time) were sharing the same database session - This caused SQLAlchemy session state conflicts: "Session is already flushing" and "rollback() is already in progress" - Additionally, duplicate model records were being created by both trainer and training_service Fixes: 1. Separated model training from database writes: - Training happens in parallel (CPU-intensive) - Database writes happen sequentially after training completes - This eliminates concurrent session access 2. Removed duplicate database writes: - Trainer now writes all model records sequentially after parallel training - Training service now retrieves models instead of creating duplicates - Performance metrics are also created by trainer (no duplicates) 3. Added proper data flow: - _train_single_product: Only trains models, stores results - _write_training_results_to_database: Sequential DB writes after training - _store_trained_models: Changed to retrieve existing models - _create_performance_metrics: Changed to verify existing metrics Benefits: - Eliminates database session conflicts - Prevents duplicate model records - Maintains parallel training performance - Ensures data consistency Files Modified: - services/training/app/ml/trainer.py - services/training/app/services/training_service.py Resolves: Onboarding training job database session conflicts	2025-11-05 12:41:42 +00:00
Urtzi Alfaro	394ad3aea4	Improve AI logic	2025-11-05 13:34:56 +01:00
Urtzi Alfaro	5adb0e39c0	Improve the frontend 5	2025-11-02 20:24:44 +01:00
Urtzi Alfaro	269d3b5032	Add user delete process	2025-10-31 11:54:19 +01:00
Urtzi Alfaro	36217a2729	Improve the frontend 2	2025-10-29 06:58:05 +01:00
Urtzi Alfaro	858d985c92	Improve the frontend modals	2025-10-27 16:33:26 +01:00
Urtzi Alfaro	8d30172483	Improve the frontend	2025-10-21 19:50:07 +02:00
Urtzi Alfaro	05da20357d	Improve teh securty of teh DB	2025-10-19 19:22:37 +02:00
Urtzi Alfaro	312e36c893	Update requirements and insfra versions	2025-10-17 23:09:40 +02:00
Urtzi Alfaro	dbb48d8e2c	Improve the sales import	2025-10-15 21:09:42 +02:00
Urtzi Alfaro	8f9e9a7edc	Add role-based filtering and imporve code	2025-10-15 16:12:49 +02:00
Urtzi Alfaro	96ad5c6692	Refactor datetime and timezone utils	2025-10-12 23:16:04 +02:00
Urtzi Alfaro	7556a00db7	Improve the demo feature of the project	2025-10-12 18:47:33 +02:00
Urtzi Alfaro	dbc7f2fa0d	Re-create migrations init tables	2025-10-09 20:47:31 +02:00
Urtzi Alfaro	3c689b4f98	REFACTOR external service and improve websocket training	2025-10-09 14:11:02 +02:00
Urtzi Alfaro	7c72f83c51	REFACTOR ALL APIs fix 1	2025-10-07 07:15:07 +02:00
Urtzi Alfaro	38fb98bc27	REFACTOR ALL APIs	2025-10-06 15:27:01 +02:00
Urtzi Alfaro	0fdc3b0211	Fix issues	2025-10-01 16:25:53 +02:00
Urtzi Alfaro	2eeebfc1e0	Fix Alembic issue	2025-10-01 11:24:06 +02:00
Urtzi Alfaro	7cc4b957a5	Fix DB issue 2s	2025-09-30 21:58:10 +02:00
Urtzi Alfaro	147893015e	Fix DB issues	2025-09-30 13:32:51 +02:00
Urtzi Alfaro	ec6bcb4c7d	Add migration services	2025-09-30 08:12:45 +02:00
Urtzi Alfaro	2712a60a2a	Refactor services alembic	2025-09-29 19:16:34 +02:00
Urtzi Alfaro	befcc126b0	Refactor all main.py	2025-09-29 13:13:12 +02:00
Urtzi Alfaro	4777e59e7a	Add base kubernetes support final fix 4	2025-09-29 07:54:25 +02:00
Urtzi Alfaro	57f77638cc	Add base kubernetes support final fix 2	2025-09-28 19:48:05 +02:00
Urtzi Alfaro	63a3f9c77a	Add base kubernetes support	2025-09-27 11:18:13 +02:00
Urtzi Alfaro	a8f6e9d593	Simplify the onboardinf flow components 3	2025-09-08 21:52:56 +02:00
Urtzi Alfaro	0faaa25e58	Start integrating the onboarding flow with backend 3	2025-09-04 23:19:53 +02:00
Urtzi Alfaro	4b4268d640	Add new alert architecture	2025-08-23 10:19:58 +02:00
Urtzi Alfaro	f33f5d242a	Fix issues 4	2025-08-17 15:21:10 +02:00
Urtzi Alfaro	cafd316c4b	Fix issues 3	2025-08-17 13:35:05 +02:00
Urtzi Alfaro	d21094a940	Fix issues 2	2025-08-17 11:12:17 +02:00
Urtzi Alfaro	109961ef6e	Fix issues	2025-08-17 10:28:58 +02:00
Urtzi Alfaro	8914786973	New Frontend	2025-08-16 20:13:40 +02:00
Urtzi Alfaro	f7de9115d1	Fix new services implementation 5	2025-08-15 17:53:59 +02:00
Urtzi Alfaro	03737430ee	Fix new services implementation 3	2025-08-14 16:47:34 +02:00
Urtzi Alfaro	fbe7470ad9	REFACTOR data service	2025-08-12 18:17:30 +02:00
Urtzi Alfaro	8d125ab0d5	Refactor the traffic fetching system	2025-08-10 18:32:47 +02:00
Urtzi Alfaro	3c2acc934a	Improve the traffic fetching system	2025-08-10 17:31:38 +02:00
Urtzi Alfaro	312fdc8ef3	Improve the traffic fetching system	2025-08-08 23:29:48 +02:00
Urtzi Alfaro	8af17f1433	Improve the design of the frontend 2	2025-08-08 23:06:54 +02:00

1 2 3

122 Commits