Commit Graph

61 Commits

Author SHA1 Message Date
Urtzi Alfaro
667e6e0404 New alert service 2025-12-05 20:07:01 +01:00
Urtzi Alfaro
c349b845a6 Bug fixes of training 2025-11-14 20:27:39 +01:00
Urtzi Alfaro
5783c7ed05 Add POI feature and imporve the overall backend implementation 2025-11-12 15:34:10 +01:00
Urtzi Alfaro
74215d3e85 Fix deadlock issues in training 2025-11-05 18:47:20 +01:00
Urtzi Alfaro
e585e9fac0 Fix critical nested session deadlock in training_service.py
Root Cause (Actual):
The actual nested session issue was in training_service.py, not just in
the trainer methods. The flow was:

1. training_service.py creates outer session (line 173)
2. Updates training_log at line 235-237 (uncommitted)
3. Calls trainer.train_tenant_models() at line 239
4. Trainer creates its own session at line 93
5. DEADLOCK: Outer session has uncommitted UPDATE, inner session can't proceed

Fix:
Added explicit session.commit() after the ml_training progress update
(line 241) to ensure the UPDATE is committed before trainer creates
its own session. This prevents the deadlock condition.

Related to previous commit caff497 which fixed nested sessions in
prophet_manager and hybrid_trainer, but missed the actual root cause
in training_service.py.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 16:30:15 +01:00
Claude
8df90338b2 Fix training log race conditions and audit event error
Critical fixes for training session logging:

1. Training log race condition fix:
   - Add explicit session commits after creating training logs
   - Handle duplicate key errors gracefully when multiple sessions
     try to create the same log simultaneously
   - Implement retry logic to query for existing logs after
     duplicate key violations
   - Prevents "Training log not found" errors during training

2. Audit event async generator error fix:
   - Replace incorrect next(get_db()) usage with proper
     async context manager (database_manager.get_session())
   - Fixes "'async_generator' object is not an iterator" error
   - Ensures audit logging works correctly

These changes address race conditions in concurrent database
sessions and ensure training logs are properly synchronized
across the training pipeline.
2025-11-05 13:24:22 +00:00
Claude
5a84be83d6 Fix multiple critical bugs in onboarding training step
This commit addresses all identified bugs and issues in the training code path:

## Critical Fixes:
- Add get_start_time() method to TrainingLogRepository and fix non-existent method call
- Remove duplicate training.started event from API endpoint (trainer publishes the accurate one)
- Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone"

## High Priority Fixes:
- Fix division by zero risk in time estimation with double-check and max() safety
- Remove unreachable exception handler in training_operations.py
- Simplify WebSocket token refresh logic to only reconnect on actual user session changes

## Medium Priority Fixes:
- Fix auto-start training effect with useRef to prevent duplicate starts
- Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket
- Extract all magic numbers to centralized constants files:
  - Backend: services/training/app/core/training_constants.py
  - Frontend: frontend/src/constants/training.ts
- Standardize error logging with exc_info=True on critical errors

## Code Quality Improvements:
- All progress percentages now use named constants
- All timeouts and intervals now use named constants
- Improved code maintainability and readability
- Better separation of concerns

## Files Changed:
- Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py
- Backend: training_operations.py, training_log_repository.py, training_constants.py (new)
- Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new)

All training progress events now properly flow from 0% to 100% with no gaps.
2025-11-05 13:02:39 +00:00
Claude
799e7dbaeb Fix training job concurrent database session conflicts
Root Cause:
- Multiple parallel training tasks (3 at a time) were sharing the same database session
- This caused SQLAlchemy session state conflicts: "Session is already flushing" and "rollback() is already in progress"
- Additionally, duplicate model records were being created by both trainer and training_service

Fixes:
1. Separated model training from database writes:
   - Training happens in parallel (CPU-intensive)
   - Database writes happen sequentially after training completes
   - This eliminates concurrent session access

2. Removed duplicate database writes:
   - Trainer now writes all model records sequentially after parallel training
   - Training service now retrieves models instead of creating duplicates
   - Performance metrics are also created by trainer (no duplicates)

3. Added proper data flow:
   - _train_single_product: Only trains models, stores results
   - _write_training_results_to_database: Sequential DB writes after training
   - _store_trained_models: Changed to retrieve existing models
   - _create_performance_metrics: Changed to verify existing metrics

Benefits:
- Eliminates database session conflicts
- Prevents duplicate model records
- Maintains parallel training performance
- Ensures data consistency

Files Modified:
- services/training/app/ml/trainer.py
- services/training/app/services/training_service.py

Resolves: Onboarding training job database session conflicts
2025-11-05 12:41:42 +00:00
Urtzi Alfaro
394ad3aea4 Improve AI logic 2025-11-05 13:34:56 +01:00
Urtzi Alfaro
269d3b5032 Add user delete process 2025-10-31 11:54:19 +01:00
Urtzi Alfaro
dbb48d8e2c Improve the sales import 2025-10-15 21:09:42 +02:00
Urtzi Alfaro
8f9e9a7edc Add role-based filtering and imporve code 2025-10-15 16:12:49 +02:00
Urtzi Alfaro
96ad5c6692 Refactor datetime and timezone utils 2025-10-12 23:16:04 +02:00
Urtzi Alfaro
3c689b4f98 REFACTOR external service and improve websocket training 2025-10-09 14:11:02 +02:00
Urtzi Alfaro
a8f6e9d593 Simplify the onboardinf flow components 3 2025-09-08 21:52:56 +02:00
Urtzi Alfaro
f33f5d242a Fix issues 4 2025-08-17 15:21:10 +02:00
Urtzi Alfaro
cafd316c4b Fix issues 3 2025-08-17 13:35:05 +02:00
Urtzi Alfaro
d21094a940 Fix issues 2 2025-08-17 11:12:17 +02:00
Urtzi Alfaro
109961ef6e Fix issues 2025-08-17 10:28:58 +02:00
Urtzi Alfaro
8914786973 New Frontend 2025-08-16 20:13:40 +02:00
Urtzi Alfaro
f7de9115d1 Fix new services implementation 5 2025-08-15 17:53:59 +02:00
Urtzi Alfaro
03737430ee Fix new services implementation 3 2025-08-14 16:47:34 +02:00
Urtzi Alfaro
fbe7470ad9 REFACTOR data service 2025-08-12 18:17:30 +02:00
Urtzi Alfaro
3c2acc934a Improve the traffic fetching system 2025-08-10 17:31:38 +02:00
Urtzi Alfaro
312fdc8ef3 Improve the traffic fetching system 2025-08-08 23:29:48 +02:00
Urtzi Alfaro
8af17f1433 Improve the design of the frontend 2 2025-08-08 23:06:54 +02:00
Urtzi Alfaro
488bb3ef93 REFACTOR - Database logic 2025-08-08 09:08:41 +02:00
Urtzi Alfaro
32a7b913d0 Fix new Frontend 15 2025-08-04 21:46:12 +02:00
Urtzi Alfaro
0ba543a19a Fix new Frontend 13 2025-08-04 18:58:12 +02:00
Urtzi Alfaro
35b02ca364 Fix new Frontend 12 2025-08-04 18:21:42 +02:00
Urtzi Alfaro
935f45a283 Add training job status in the db 2025-08-03 14:55:13 +02:00
Urtzi Alfaro
277e8bec73 Add user role 2025-08-02 09:41:50 +02:00
Urtzi Alfaro
e581a144be Improve the event messaging for training service 2 2025-07-31 15:34:35 +02:00
Urtzi Alfaro
923b2d48d2 Improve the event messaging for training service 2025-07-30 21:21:02 +02:00
Urtzi Alfaro
2d1ce2d523 Add publish events to the training phase - fix 2025-07-29 21:43:26 +02:00
Urtzi Alfaro
1ebc6ec911 Add publish events to the training phase 2025-07-29 21:33:57 +02:00
Urtzi Alfaro
84ed4a7a2e Start fixing forecast service API 3 2025-07-29 15:08:55 +02:00
Urtzi Alfaro
cd6fd875f7 Improve training test 3 2025-07-29 12:45:39 +02:00
Urtzi Alfaro
ef62f05031 Improve training test 2 2025-07-29 12:01:56 +02:00
Urtzi Alfaro
c788c7e406 Improve training code 3 2025-07-28 21:30:49 +02:00
Urtzi Alfaro
7cd595df81 Improve training code 2 2025-07-28 20:20:54 +02:00
Urtzi Alfaro
98f546af12 Improve training code 2025-07-28 19:28:39 +02:00
Urtzi Alfaro
946015b80c Fix data fetch 7 2025-07-27 22:58:18 +02:00
Urtzi Alfaro
938fd24e3a Fix data fetch 5 2025-07-27 21:32:29 +02:00
Urtzi Alfaro
a627b566d2 Fix data fetch 2 2025-07-27 20:20:09 +02:00
Urtzi Alfaro
4684235111 Fix data fetch 2025-07-27 19:30:42 +02:00
Urtzi Alfaro
e63a99b818 Checking onboardin flow - fix 4 2025-07-27 16:29:53 +02:00
Urtzi Alfaro
0b14cf9eb2 Checking onboardin flow - fix 3 2025-07-27 11:04:32 +02:00
Urtzi Alfaro
30ac945058 Checking onboardin flow - fix 2 2025-07-27 10:30:42 +02:00
Urtzi Alfaro
e4885db828 REFACTOR API gateway 2025-07-26 18:46:52 +02:00