Commit Graph

118 Commits

Author SHA1 Message Date
Claude
c64585af57 Fix training hang by wrapping blocking ML operations in thread pool
Root Cause:
Training process was stuck at 40% because blocking synchronous ML operations
(model.fit(), model.predict(), study.optimize()) were freezing the asyncio
event loop, preventing RabbitMQ heartbeats, WebSocket communication, and
progress updates.

Changes:
1. prophet_manager.py:
   - Wrapped model.fit() at line 189 with asyncio.to_thread()
   - Wrapped study.optimize() at line 453 with asyncio.to_thread()

2. hybrid_trainer.py:
   - Made _train_xgboost() async and wrapped model.fit() with asyncio.to_thread()
   - Made _evaluate_hybrid_model() async and wrapped predict() calls
   - Fixed predict() method to wrap blocking predict() calls

Impact:
- Event loop no longer blocks during ML training
- RabbitMQ heartbeats continue during training
- WebSocket progress updates work correctly
- Training can now complete successfully

Fixes: Training hang at 40% during onboarding phase
2025-11-05 14:34:53 +00:00
Claude
136761af19 Fix AuditLogger.log_event() parameter name: metadata -> audit_metadata 2025-11-05 14:17:39 +00:00
Claude
8df90338b2 Fix training log race conditions and audit event error
Critical fixes for training session logging:

1. Training log race condition fix:
   - Add explicit session commits after creating training logs
   - Handle duplicate key errors gracefully when multiple sessions
     try to create the same log simultaneously
   - Implement retry logic to query for existing logs after
     duplicate key violations
   - Prevents "Training log not found" errors during training

2. Audit event async generator error fix:
   - Replace incorrect next(get_db()) usage with proper
     async context manager (database_manager.get_session())
   - Fixes "'async_generator' object is not an iterator" error
   - Ensures audit logging works correctly

These changes address race conditions in concurrent database
sessions and ensure training logs are properly synchronized
across the training pipeline.
2025-11-05 13:24:22 +00:00
Claude
5a84be83d6 Fix multiple critical bugs in onboarding training step
This commit addresses all identified bugs and issues in the training code path:

## Critical Fixes:
- Add get_start_time() method to TrainingLogRepository and fix non-existent method call
- Remove duplicate training.started event from API endpoint (trainer publishes the accurate one)
- Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone"

## High Priority Fixes:
- Fix division by zero risk in time estimation with double-check and max() safety
- Remove unreachable exception handler in training_operations.py
- Simplify WebSocket token refresh logic to only reconnect on actual user session changes

## Medium Priority Fixes:
- Fix auto-start training effect with useRef to prevent duplicate starts
- Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket
- Extract all magic numbers to centralized constants files:
  - Backend: services/training/app/core/training_constants.py
  - Frontend: frontend/src/constants/training.ts
- Standardize error logging with exc_info=True on critical errors

## Code Quality Improvements:
- All progress percentages now use named constants
- All timeouts and intervals now use named constants
- Improved code maintainability and readability
- Better separation of concerns

## Files Changed:
- Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py
- Backend: training_operations.py, training_log_repository.py, training_constants.py (new)
- Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new)

All training progress events now properly flow from 0% to 100% with no gaps.
2025-11-05 13:02:39 +00:00
Claude
799e7dbaeb Fix training job concurrent database session conflicts
Root Cause:
- Multiple parallel training tasks (3 at a time) were sharing the same database session
- This caused SQLAlchemy session state conflicts: "Session is already flushing" and "rollback() is already in progress"
- Additionally, duplicate model records were being created by both trainer and training_service

Fixes:
1. Separated model training from database writes:
   - Training happens in parallel (CPU-intensive)
   - Database writes happen sequentially after training completes
   - This eliminates concurrent session access

2. Removed duplicate database writes:
   - Trainer now writes all model records sequentially after parallel training
   - Training service now retrieves models instead of creating duplicates
   - Performance metrics are also created by trainer (no duplicates)

3. Added proper data flow:
   - _train_single_product: Only trains models, stores results
   - _write_training_results_to_database: Sequential DB writes after training
   - _store_trained_models: Changed to retrieve existing models
   - _create_performance_metrics: Changed to verify existing metrics

Benefits:
- Eliminates database session conflicts
- Prevents duplicate model records
- Maintains parallel training performance
- Ensures data consistency

Files Modified:
- services/training/app/ml/trainer.py
- services/training/app/services/training_service.py

Resolves: Onboarding training job database session conflicts
2025-11-05 12:41:42 +00:00
Urtzi Alfaro
394ad3aea4 Improve AI logic 2025-11-05 13:34:56 +01:00
Urtzi Alfaro
5adb0e39c0 Improve the frontend 5 2025-11-02 20:24:44 +01:00
Urtzi Alfaro
269d3b5032 Add user delete process 2025-10-31 11:54:19 +01:00
Urtzi Alfaro
36217a2729 Improve the frontend 2 2025-10-29 06:58:05 +01:00
Urtzi Alfaro
858d985c92 Improve the frontend modals 2025-10-27 16:33:26 +01:00
Urtzi Alfaro
8d30172483 Improve the frontend 2025-10-21 19:50:07 +02:00
Urtzi Alfaro
05da20357d Improve teh securty of teh DB 2025-10-19 19:22:37 +02:00
Urtzi Alfaro
312e36c893 Update requirements and insfra versions 2025-10-17 23:09:40 +02:00
Urtzi Alfaro
dbb48d8e2c Improve the sales import 2025-10-15 21:09:42 +02:00
Urtzi Alfaro
8f9e9a7edc Add role-based filtering and imporve code 2025-10-15 16:12:49 +02:00
Urtzi Alfaro
96ad5c6692 Refactor datetime and timezone utils 2025-10-12 23:16:04 +02:00
Urtzi Alfaro
7556a00db7 Improve the demo feature of the project 2025-10-12 18:47:33 +02:00
Urtzi Alfaro
dbc7f2fa0d Re-create migrations init tables 2025-10-09 20:47:31 +02:00
Urtzi Alfaro
3c689b4f98 REFACTOR external service and improve websocket training 2025-10-09 14:11:02 +02:00
Urtzi Alfaro
7c72f83c51 REFACTOR ALL APIs fix 1 2025-10-07 07:15:07 +02:00
Urtzi Alfaro
38fb98bc27 REFACTOR ALL APIs 2025-10-06 15:27:01 +02:00
Urtzi Alfaro
0fdc3b0211 Fix issues 2025-10-01 16:25:53 +02:00
Urtzi Alfaro
2eeebfc1e0 Fix Alembic issue 2025-10-01 11:24:06 +02:00
Urtzi Alfaro
7cc4b957a5 Fix DB issue 2s 2025-09-30 21:58:10 +02:00
Urtzi Alfaro
147893015e Fix DB issues 2025-09-30 13:32:51 +02:00
Urtzi Alfaro
ec6bcb4c7d Add migration services 2025-09-30 08:12:45 +02:00
Urtzi Alfaro
2712a60a2a Refactor services alembic 2025-09-29 19:16:34 +02:00
Urtzi Alfaro
befcc126b0 Refactor all main.py 2025-09-29 13:13:12 +02:00
Urtzi Alfaro
4777e59e7a Add base kubernetes support final fix 4 2025-09-29 07:54:25 +02:00
Urtzi Alfaro
57f77638cc Add base kubernetes support final fix 2 2025-09-28 19:48:05 +02:00
Urtzi Alfaro
63a3f9c77a Add base kubernetes support 2025-09-27 11:18:13 +02:00
Urtzi Alfaro
a8f6e9d593 Simplify the onboardinf flow components 3 2025-09-08 21:52:56 +02:00
Urtzi Alfaro
0faaa25e58 Start integrating the onboarding flow with backend 3 2025-09-04 23:19:53 +02:00
Urtzi Alfaro
4b4268d640 Add new alert architecture 2025-08-23 10:19:58 +02:00
Urtzi Alfaro
f33f5d242a Fix issues 4 2025-08-17 15:21:10 +02:00
Urtzi Alfaro
cafd316c4b Fix issues 3 2025-08-17 13:35:05 +02:00
Urtzi Alfaro
d21094a940 Fix issues 2 2025-08-17 11:12:17 +02:00
Urtzi Alfaro
109961ef6e Fix issues 2025-08-17 10:28:58 +02:00
Urtzi Alfaro
8914786973 New Frontend 2025-08-16 20:13:40 +02:00
Urtzi Alfaro
f7de9115d1 Fix new services implementation 5 2025-08-15 17:53:59 +02:00
Urtzi Alfaro
03737430ee Fix new services implementation 3 2025-08-14 16:47:34 +02:00
Urtzi Alfaro
fbe7470ad9 REFACTOR data service 2025-08-12 18:17:30 +02:00
Urtzi Alfaro
8d125ab0d5 Refactor the traffic fetching system 2025-08-10 18:32:47 +02:00
Urtzi Alfaro
3c2acc934a Improve the traffic fetching system 2025-08-10 17:31:38 +02:00
Urtzi Alfaro
312fdc8ef3 Improve the traffic fetching system 2025-08-08 23:29:48 +02:00
Urtzi Alfaro
8af17f1433 Improve the design of the frontend 2 2025-08-08 23:06:54 +02:00
Urtzi Alfaro
488bb3ef93 REFACTOR - Database logic 2025-08-08 09:08:41 +02:00
Urtzi Alfaro
32a7b913d0 Fix new Frontend 15 2025-08-04 21:46:12 +02:00
Urtzi Alfaro
8bb14ecc4f Fix new Frontend 14 2025-08-04 19:17:31 +02:00
Urtzi Alfaro
0ba543a19a Fix new Frontend 13 2025-08-04 18:58:12 +02:00