Commit Graph

636 Commits

Author SHA1 Message Date
Claude
961bd2328f Fix all critical orchestration scheduler issues and add improvements
This commit addresses all 15 issues identified in the orchestration scheduler analysis:

HIGH PRIORITY FIXES:
1.  Database update methods already in orchestrator service (not in saga)
2.  Add null check for training_client before using it
3.  Fix cron schedule config from "0 5" to "30 5" (5:30 AM)
4.  Standardize on timezone-aware datetime (datetime.now(timezone.utc))
5.  Implement saga compensation logic with actual deletion calls
6.  Extract actual counts from saga results (no placeholders)

MEDIUM PRIORITY FIXES:
7.  Add circuit breakers for inventory/suppliers/recipes clients
8.  Pass circuit breakers to saga and use them in all service calls
9.  Add calling_service_name to AI Insights client
10.  Add database indexes on (tenant_id, started_at) and (status, started_at)
11.  Handle empty shared data gracefully (fail if all 3 fetches fail)

LOW PRIORITY IMPROVEMENTS:
12.  Make notification/validation failures more visible with explicit logging
13.  Track AI insights status in orchestration_runs table
14.  Improve run number generation atomicity using MAX() approach
15.  Optimize tenant ID handling (consistent UUID usage)

CHANGES:
- services/orchestrator/app/core/config.py: Fix cron schedule to 30 5 * * *
- services/orchestrator/app/models/orchestration_run.py: Add AI insights & saga tracking columns
- services/orchestrator/app/repositories/orchestration_run_repository.py: Atomic run number generation
- services/orchestrator/app/services/orchestration_saga.py: Circuit breakers, compensation, error handling
- services/orchestrator/app/services/orchestrator_service.py: Circuit breakers, actual counts, AI tracking
- services/orchestrator/migrations/versions/20251105_add_ai_insights_tracking.py: New migration

All issues resolved. No backwards compatibility. No TODOs. Production-ready.
2025-11-05 13:33:13 +00:00
ualsweb
94099d8bde Merge pull request #3 from ualsweb/claude/fix-training-logs-011CUpoupuk9cR6iBusjfozQ
Fix training log race conditions and audit event error
2025-11-05 14:29:01 +01:00
Claude
8df90338b2 Fix training log race conditions and audit event error
Critical fixes for training session logging:

1. Training log race condition fix:
   - Add explicit session commits after creating training logs
   - Handle duplicate key errors gracefully when multiple sessions
     try to create the same log simultaneously
   - Implement retry logic to query for existing logs after
     duplicate key violations
   - Prevents "Training log not found" errors during training

2. Audit event async generator error fix:
   - Replace incorrect next(get_db()) usage with proper
     async context manager (database_manager.get_session())
   - Fixes "'async_generator' object is not an iterator" error
   - Ensures audit logging works correctly

These changes address race conditions in concurrent database
sessions and ensure training logs are properly synchronized
across the training pipeline.
2025-11-05 13:24:22 +00:00
ualsweb
15025fdf1d Merge pull request #2 from ualsweb/claude/debug-onboarding-training-step-011CUpmWixCPTKKW2re8qJ3A
Fix multiple critical bugs in onboarding training step
2025-11-05 14:05:40 +01:00
Claude
5a84be83d6 Fix multiple critical bugs in onboarding training step
This commit addresses all identified bugs and issues in the training code path:

## Critical Fixes:
- Add get_start_time() method to TrainingLogRepository and fix non-existent method call
- Remove duplicate training.started event from API endpoint (trainer publishes the accurate one)
- Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone"

## High Priority Fixes:
- Fix division by zero risk in time estimation with double-check and max() safety
- Remove unreachable exception handler in training_operations.py
- Simplify WebSocket token refresh logic to only reconnect on actual user session changes

## Medium Priority Fixes:
- Fix auto-start training effect with useRef to prevent duplicate starts
- Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket
- Extract all magic numbers to centralized constants files:
  - Backend: services/training/app/core/training_constants.py
  - Frontend: frontend/src/constants/training.ts
- Standardize error logging with exc_info=True on critical errors

## Code Quality Improvements:
- All progress percentages now use named constants
- All timeouts and intervals now use named constants
- Improved code maintainability and readability
- Better separation of concerns

## Files Changed:
- Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py
- Backend: training_operations.py, training_log_repository.py, training_constants.py (new)
- Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new)

All training progress events now properly flow from 0% to 100% with no gaps.
2025-11-05 13:02:39 +00:00
ualsweb
e3ea92640b Merge pull request #1 from ualsweb/claude/fix-onboarding-training-job-011CUpkdtoMWGH7ANd33zRbm
Fix training job concurrent database session conflicts
2025-11-05 13:44:07 +01:00
Claude
799e7dbaeb Fix training job concurrent database session conflicts
Root Cause:
- Multiple parallel training tasks (3 at a time) were sharing the same database session
- This caused SQLAlchemy session state conflicts: "Session is already flushing" and "rollback() is already in progress"
- Additionally, duplicate model records were being created by both trainer and training_service

Fixes:
1. Separated model training from database writes:
   - Training happens in parallel (CPU-intensive)
   - Database writes happen sequentially after training completes
   - This eliminates concurrent session access

2. Removed duplicate database writes:
   - Trainer now writes all model records sequentially after parallel training
   - Training service now retrieves models instead of creating duplicates
   - Performance metrics are also created by trainer (no duplicates)

3. Added proper data flow:
   - _train_single_product: Only trains models, stores results
   - _write_training_results_to_database: Sequential DB writes after training
   - _store_trained_models: Changed to retrieve existing models
   - _create_performance_metrics: Changed to verify existing metrics

Benefits:
- Eliminates database session conflicts
- Prevents duplicate model records
- Maintains parallel training performance
- Ensures data consistency

Files Modified:
- services/training/app/ml/trainer.py
- services/training/app/services/training_service.py

Resolves: Onboarding training job database session conflicts
2025-11-05 12:41:42 +00:00
Urtzi Alfaro
394ad3aea4 Improve AI logic 2025-11-05 13:34:56 +01:00
Urtzi Alfaro
5c87fbcf48 Improve the frontend 6 2025-11-02 20:26:25 +01:00
Urtzi Alfaro
5adb0e39c0 Improve the frontend 5 2025-11-02 20:24:44 +01:00
Urtzi Alfaro
0220da1725 Improve the frontend 4 2025-11-01 21:35:03 +01:00
Urtzi Alfaro
f44d235c6d Add user delete process 2 2025-10-31 18:57:58 +01:00
Urtzi Alfaro
269d3b5032 Add user delete process 2025-10-31 11:54:19 +01:00
Urtzi Alfaro
63f5c6d512 Improve the frontend 3 2025-10-30 21:08:07 +01:00
Urtzi Alfaro
36217a2729 Improve the frontend 2 2025-10-29 06:58:05 +01:00
Urtzi Alfaro
858d985c92 Improve the frontend modals 2025-10-27 16:33:26 +01:00
Urtzi Alfaro
61376b7a9f Improve the frontend and fix TODOs 2025-10-24 13:05:04 +02:00
Urtzi Alfaro
07c33fa578 Improve the frontend and repository layer 2025-10-23 07:44:54 +02:00
Urtzi Alfaro
8d30172483 Improve the frontend 2025-10-21 19:50:07 +02:00
Urtzi Alfaro
05da20357d Improve teh securty of teh DB 2025-10-19 19:22:37 +02:00
Urtzi Alfaro
62971c07d7 Update landing page 2025-10-18 16:03:23 +02:00
Urtzi Alfaro
312e36c893 Update requirements and insfra versions 2025-10-17 23:09:40 +02:00
Urtzi Alfaro
7e089b80cf Improve public pages 2025-10-17 18:14:28 +02:00
Urtzi Alfaro
d4060962e4 Improve demo seed 2025-10-17 07:31:14 +02:00
Urtzi Alfaro
b6cb800758 Improve GDPR implementation 2025-10-16 07:28:04 +02:00
Urtzi Alfaro
dbb48d8e2c Improve the sales import 2025-10-15 21:09:42 +02:00
Urtzi Alfaro
8f9e9a7edc Add role-based filtering and imporve code 2025-10-15 16:12:49 +02:00
Urtzi Alfaro
96ad5c6692 Refactor datetime and timezone utils 2025-10-12 23:16:04 +02:00
Urtzi Alfaro
7556a00db7 Improve the demo feature of the project 2025-10-12 18:47:33 +02:00
Urtzi Alfaro
dbc7f2fa0d Re-create migrations init tables 2025-10-09 20:47:31 +02:00
Urtzi Alfaro
b420af32c5 REFACTOR production scheduler 2025-10-09 18:01:24 +02:00
Urtzi Alfaro
3c689b4f98 REFACTOR external service and improve websocket training 2025-10-09 14:11:02 +02:00
Urtzi Alfaro
7c72f83c51 REFACTOR ALL APIs fix 1 2025-10-07 07:15:07 +02:00
Urtzi Alfaro
38fb98bc27 REFACTOR ALL APIs 2025-10-06 15:27:01 +02:00
Urtzi Alfaro
dc8221bd2f Add DEMO feature to the project 2025-10-03 14:09:34 +02:00
Urtzi Alfaro
1243c2ca6d Add fixes to procurement logic and fix rel-time connections 2025-10-02 13:20:30 +02:00
Urtzi Alfaro
c9d8d1d071 Fix onboarding process not getting the subcription plan 2025-10-01 21:56:38 +02:00
Urtzi Alfaro
b93fb850c3 Add tilt support 2025-10-01 18:58:30 +02:00
Urtzi Alfaro
0fdc3b0211 Fix issues 2025-10-01 16:25:53 +02:00
Urtzi Alfaro
36b44c41f1 Fix issues 2025-10-01 14:39:10 +02:00
Urtzi Alfaro
6fa655275f Fix notification service health issues 2025-10-01 12:28:00 +02:00
Urtzi Alfaro
016742d63f Fix startup issues 2025-10-01 12:17:59 +02:00
Urtzi Alfaro
2eeebfc1e0 Fix Alembic issue 2025-10-01 11:24:06 +02:00
Urtzi Alfaro
7cc4b957a5 Fix DB issue 2s 2025-09-30 21:58:10 +02:00
Urtzi Alfaro
147893015e Fix DB issues 2025-09-30 13:32:51 +02:00
Urtzi Alfaro
ec6bcb4c7d Add migration services 2025-09-30 08:12:45 +02:00
Urtzi Alfaro
d1c83dce74 Clean code 2025-09-29 19:18:45 +02:00
Urtzi Alfaro
2712a60a2a Refactor services alembic 2025-09-29 19:16:34 +02:00
Urtzi Alfaro
befcc126b0 Refactor all main.py 2025-09-29 13:13:12 +02:00
Urtzi Alfaro
4777e59e7a Add base kubernetes support final fix 4 2025-09-29 07:54:25 +02:00