This commit addresses all 15 issues identified in the orchestration scheduler analysis:
HIGH PRIORITY FIXES:
1. ✅ Database update methods already in orchestrator service (not in saga)
2. ✅ Add null check for training_client before using it
3. ✅ Fix cron schedule config from "0 5" to "30 5" (5:30 AM)
4. ✅ Standardize on timezone-aware datetime (datetime.now(timezone.utc))
5. ✅ Implement saga compensation logic with actual deletion calls
6. ✅ Extract actual counts from saga results (no placeholders)
MEDIUM PRIORITY FIXES:
7. ✅ Add circuit breakers for inventory/suppliers/recipes clients
8. ✅ Pass circuit breakers to saga and use them in all service calls
9. ✅ Add calling_service_name to AI Insights client
10. ✅ Add database indexes on (tenant_id, started_at) and (status, started_at)
11. ✅ Handle empty shared data gracefully (fail if all 3 fetches fail)
LOW PRIORITY IMPROVEMENTS:
12. ✅ Make notification/validation failures more visible with explicit logging
13. ✅ Track AI insights status in orchestration_runs table
14. ✅ Improve run number generation atomicity using MAX() approach
15. ✅ Optimize tenant ID handling (consistent UUID usage)
CHANGES:
- services/orchestrator/app/core/config.py: Fix cron schedule to 30 5 * * *
- services/orchestrator/app/models/orchestration_run.py: Add AI insights & saga tracking columns
- services/orchestrator/app/repositories/orchestration_run_repository.py: Atomic run number generation
- services/orchestrator/app/services/orchestration_saga.py: Circuit breakers, compensation, error handling
- services/orchestrator/app/services/orchestrator_service.py: Circuit breakers, actual counts, AI tracking
- services/orchestrator/migrations/versions/20251105_add_ai_insights_tracking.py: New migration
All issues resolved. No backwards compatibility. No TODOs. Production-ready.
This commit addresses all identified bugs and issues in the training code path:
## Critical Fixes:
- Add get_start_time() method to TrainingLogRepository and fix non-existent method call
- Remove duplicate training.started event from API endpoint (trainer publishes the accurate one)
- Add missing progress events for 80-100% range (85%, 92%, 94%) to eliminate progress "dead zone"
## High Priority Fixes:
- Fix division by zero risk in time estimation with double-check and max() safety
- Remove unreachable exception handler in training_operations.py
- Simplify WebSocket token refresh logic to only reconnect on actual user session changes
## Medium Priority Fixes:
- Fix auto-start training effect with useRef to prevent duplicate starts
- Add HTTP polling debounce delay (5s) to prevent race conditions with WebSocket
- Extract all magic numbers to centralized constants files:
- Backend: services/training/app/core/training_constants.py
- Frontend: frontend/src/constants/training.ts
- Standardize error logging with exc_info=True on critical errors
## Code Quality Improvements:
- All progress percentages now use named constants
- All timeouts and intervals now use named constants
- Improved code maintainability and readability
- Better separation of concerns
## Files Changed:
- Backend: training_service.py, trainer.py, training_events.py, progress_tracker.py
- Backend: training_operations.py, training_log_repository.py, training_constants.py (new)
- Frontend: training.ts (hooks), MLTrainingStep.tsx, training.ts (constants, new)
All training progress events now properly flow from 0% to 100% with no gaps.
Root Cause:
- Multiple parallel training tasks (3 at a time) were sharing the same database session
- This caused SQLAlchemy session state conflicts: "Session is already flushing" and "rollback() is already in progress"
- Additionally, duplicate model records were being created by both trainer and training_service
Fixes:
1. Separated model training from database writes:
- Training happens in parallel (CPU-intensive)
- Database writes happen sequentially after training completes
- This eliminates concurrent session access
2. Removed duplicate database writes:
- Trainer now writes all model records sequentially after parallel training
- Training service now retrieves models instead of creating duplicates
- Performance metrics are also created by trainer (no duplicates)
3. Added proper data flow:
- _train_single_product: Only trains models, stores results
- _write_training_results_to_database: Sequential DB writes after training
- _store_trained_models: Changed to retrieve existing models
- _create_performance_metrics: Changed to verify existing metrics
Benefits:
- Eliminates database session conflicts
- Prevents duplicate model records
- Maintains parallel training performance
- Ensures data consistency
Files Modified:
- services/training/app/ml/trainer.py
- services/training/app/services/training_service.py
Resolves: Onboarding training job database session conflicts