Improve the demo feature of the project

2025-10-12 18:47:33 +02:00
parent dbc7f2fa0d
commit 7556a00db7
168 changed files with 10102 additions and 18869 deletions
--- a/services/training/COMPLETE_IMPLEMENTATION_REPORT.md
+++ b/services/training/COMPLETE_IMPLEMENTATION_REPORT.md
@@ -1,645 +0,0 @@
-# Training Service - Complete Implementation Report
-
-## Executive Summary
-
-This document provides a comprehensive overview of all improvements, fixes, and new features implemented in the training service based on the detailed code analysis. The service has been transformed from **NOT PRODUCTION READY** to **PRODUCTION READY** with significant enhancements in reliability, performance, and maintainability.
-
---
-
-## 🎯 Implementation Status: **COMPLETE** ✅
-
-**Time Saved**: 4-6 weeks of development → Completed in single session
-**Production Ready**: ✅ YES
-**API Compatible**: ✅ YES (No breaking changes)
-
---
-
-## Part 1: Critical Bug Fixes
-
-### 1.1 Duplicate `on_startup` Method ✅
-**File**: [main.py](services/training/app/main.py)
-**Issue**: Two `on_startup` methods causing migration verification skip
-**Fix**: Merged both methods into single implementation
-**Impact**: Service initialization now properly verifies database migrations
-
-**Before**:
-```python
-async def on_startup(self, app):
-    await self.verify_migrations()
-
-async def on_startup(self, app: FastAPI):  # Duplicate!
-    pass
-```
-
-**After**:
-```python
-async def on_startup(self, app: FastAPI):
-    await self.verify_migrations()
-    self.logger.info("Training service startup completed")
-```
-
-### 1.2 Hardcoded Migration Version ✅
-**File**: [main.py](services/training/app/main.py)
-**Issue**: Static version `expected_migration_version = "00001"`
-**Fix**: Dynamic version detection from alembic_version table
-**Impact**: Service survives schema updates automatically
-
-**Before**:
-```python
-expected_migration_version = "00001"  # Hardcoded!
-if version != self.expected_migration_version:
-    raise RuntimeError(...)
-```
-
-**After**:
-```python
-async def verify_migrations(self):
-    result = await session.execute(text("SELECT version_num FROM alembic_version"))
-    version = result.scalar()
-    if not version:
-        raise RuntimeError("Database not initialized")
-    logger.info(f"Migration verification successful: {version}")
-```
-
-### 1.3 Session Management Bug ✅
-**File**: [training_service.py:463](services/training/app/services/training_service.py#L463)
-**Issue**: Incorrect `get_session()()` double-call
-**Fix**: Corrected to `get_session()` single call
-**Impact**: Prevents database connection leaks and session corruption
-
-### 1.4 Disabled Data Validation ✅
-**File**: [data_client.py:263-353](services/training/app/services/data_client.py#L263-L353)
-**Issue**: Validation completely bypassed
-**Fix**: Implemented comprehensive validation
-**Features**:
- Minimum 30 data points (recommended 90+)
- Required fields validation
- Zero-value ratio analysis (error >90%, warning >70%)
- Product diversity checks
- Returns detailed validation report
-
---
-
-## Part 2: Performance Improvements
-
-### 2.1 Parallel Training Execution ✅
-**File**: [trainer.py:240-379](services/training/app/ml/trainer.py#L240-L379)
-**Improvement**: Sequential → Parallel execution using `asyncio.gather()`
-
-**Performance Metrics**:
- **Before**: 10 products × 3 min = **30 minutes**
- **After**: 10 products in parallel = **~3-5 minutes**
- **Speedup**: **6-10x faster**
-
-**Implementation**:
-```python
-# New method for single product training
-async def _train_single_product(...) -> tuple[str, Dict]:
-    # Train one product with progress tracking
-
-# Parallel execution
-training_tasks = [
-    self._train_single_product(...)
-    for idx, (product_id, data) in enumerate(processed_data.items())
-]
-results_list = await asyncio.gather(*training_tasks, return_exceptions=True)
-```
-
-### 2.2 Hyperparameter Optimization ✅
-**File**: [prophet_manager.py](services/training/app/ml/prophet_manager.py)
-**Improvement**: Adaptive trial counts based on product characteristics
-
-**Optimization Settings**:
-| Product Type | Trials (Before) | Trials (After) | Reduction |
-|--------------|----------------|----------------|-----------|
-| High Volume  | 75 | 30 | 60% |
-| Medium Volume | 50 | 25 | 50% |
-| Low Volume | 30 | 20 | 33% |
-| Intermittent | 25 | 15 | 40% |
-
-**Average Speedup**: 40% reduction in optimization time
-
-### 2.3 Database Connection Pooling ✅
-**File**: [database.py:18-27](services/training/app/core/database.py#L18-L27), [config.py:84-90](services/training/app/core/config.py#L84-L90)
-
-**Configuration**:
-```python
-DB_POOL_SIZE: 10          # Base connections
-DB_MAX_OVERFLOW: 20       # Extra connections under load
-DB_POOL_TIMEOUT: 30       # Seconds to wait for connection
-DB_POOL_RECYCLE: 3600     # Recycle connections after 1 hour
-DB_POOL_PRE_PING: true    # Test connections before use
-```
-
-**Benefits**:
- Reduced connection overhead
- Better resource utilization
- Prevents connection exhaustion
- Automatic stale connection cleanup
-
---
-
-## Part 3: Reliability Enhancements
-
-### 3.1 HTTP Request Timeouts ✅
-**File**: [data_client.py:37-51](services/training/app/services/data_client.py#L37-L51)
-
-**Configuration**:
-```python
-timeout = httpx.Timeout(
-    connect=30.0,   # 30s to establish connection
-    read=60.0,      # 60s for large data fetches
-    write=30.0,     # 30s for write operations
-    pool=30.0       # 30s for pool operations
-)
-```
-
-**Impact**: Prevents hanging requests during service failures
-
-### 3.2 Circuit Breaker Pattern ✅
-**Files**:
- [circuit_breaker.py](services/training/app/utils/circuit_breaker.py) (NEW)
- [data_client.py:60-84](services/training/app/services/data_client.py#L60-L84)
-
-**Features**:
- Three states: CLOSED → OPEN → HALF_OPEN
- Configurable failure thresholds
- Automatic recovery attempts
- Per-service circuit breakers
-
-**Circuit Breakers Implemented**:
-| Service | Failure Threshold | Recovery Timeout |
-|---------|------------------|------------------|
-| Sales | 5 failures | 60 seconds |
-| Weather | 3 failures | 30 seconds |
-| Traffic | 3 failures | 30 seconds |
-
-**Example**:
-```python
-self.sales_cb = circuit_breaker_registry.get_or_create(
-    name="sales_service",
-    failure_threshold=5,
-    recovery_timeout=60.0
-)
-
-# Usage
-return await self.sales_cb.call(
-    self._fetch_sales_data_internal,
-    tenant_id, start_date, end_date
-)
-```
-
-### 3.3 Model File Checksum Verification ✅
-**Files**:
- [file_utils.py](services/training/app/utils/file_utils.py) (NEW)
- [prophet_manager.py:522-524](services/training/app/ml/prophet_manager.py#L522-L524)
-
-**Features**:
- SHA-256 checksum calculation on save
- Automatic checksum storage
- Verification on model load
- ChecksummedFile context manager
-
-**Implementation**:
-```python
-# On save
-checksummed_file = ChecksummedFile(str(model_path))
-model_checksum = checksummed_file.calculate_and_save_checksum()
-
-# On load
-if not checksummed_file.load_and_verify_checksum():
-    logger.warning(f"Checksum verification failed: {model_path}")
-```
-
-**Benefits**:
- Detects file corruption
- Ensures model integrity
- Audit trail for security
- Compliance support
-
-### 3.4 Distributed Locking ✅
-**Files**:
- [distributed_lock.py](services/training/app/utils/distributed_lock.py) (NEW)
- [prophet_manager.py:65-71](services/training/app/ml/prophet_manager.py#L65-L71)
-
-**Features**:
- PostgreSQL advisory locks
- Prevents concurrent training of same product
- Works across multiple service instances
- Automatic lock release
-
-**Implementation**:
-```python
-lock = get_training_lock(tenant_id, inventory_product_id, use_advisory=True)
-
-async with self.database_manager.get_session() as session:
-    async with lock.acquire(session):
-        # Train model - guaranteed exclusive access
-        await self._train_model(...)
-```
-
-**Benefits**:
- Prevents race conditions
- Protects data integrity
- Enables horizontal scaling
- Graceful lock contention handling
-
---
-
-## Part 4: Code Quality Improvements
-
-### 4.1 Constants Module ✅
-**File**: [constants.py](services/training/app/core/constants.py) (NEW)
-
-**Categories** (50+ constants):
- Data validation thresholds
- Training time periods (days)
- Product classification thresholds
- Hyperparameter optimization settings
- Prophet uncertainty sampling ranges
- MAPE calculation parameters
- HTTP client configuration
- WebSocket configuration
- Progress tracking ranges
- Synthetic data defaults
-
-**Example Usage**:
-```python
-from app.core import constants as const
-
-# ✅ Good
-if len(sales_data) < const.MIN_DATA_POINTS_REQUIRED:
-    raise ValueError("Insufficient data")
-
-# ❌ Bad (old way)
-if len(sales_data) < 30:  # What does 30 mean?
-    raise ValueError("Insufficient data")
-```
-
-### 4.2 Timezone Utility Module ✅
-**Files**:
- [timezone_utils.py](services/training/app/utils/timezone_utils.py) (NEW)
- [utils/__init__.py](services/training/app/utils/__init__.py) (NEW)
-
-**Functions**:
- `ensure_timezone_aware()` - Make datetime timezone-aware
- `ensure_timezone_naive()` - Remove timezone info
- `normalize_datetime_to_utc()` - Convert to UTC
- `normalize_dataframe_datetime_column()` - Normalize pandas columns
- `prepare_prophet_datetime()` - Prophet-specific preparation
- `safe_datetime_comparison()` - Compare with mismatch handling
- `get_current_utc()` - Get current UTC time
- `convert_timestamp_to_datetime()` - Handle various formats
-
-**Integrated In**:
- prophet_manager.py - Prophet data preparation
- date_alignment_service.py - Date range validation
-
-### 4.3 Standardized Error Handling ✅
-**File**: [data_client.py](services/training/app/services/data_client.py)
-
-**Pattern**: Always raise exceptions, never return empty collections
-
-**Before**:
-```python
-except Exception as e:
-    logger.error(f"Failed: {e}")
-    return []  # ❌ Silent failure
-```
-
-**After**:
-```python
-except ValueError:
-    raise  # Re-raise validation errors
-except Exception as e:
-    logger.error(f"Failed: {e}")
-    raise RuntimeError(f"Operation failed: {e}")  # ✅ Explicit failure
-```
-
-### 4.4 Legacy Code Removal ✅
-**Removed**:
- `BakeryMLTrainer = EnhancedBakeryMLTrainer` alias
- `TrainingService = EnhancedTrainingService` alias
- `BakeryDataProcessor = EnhancedBakeryDataProcessor` alias
- Legacy `fetch_traffic_data()` wrapper
- Legacy `fetch_stored_traffic_data_for_training()` wrapper
- Legacy `_collect_traffic_data_with_timeout()` method
- Legacy `_log_traffic_data_storage()` method
- All "Pre-flight check moved" comments
- All "Temporary implementation" comments
-
---
-
-## Part 5: New Features Summary
-
-### 5.1 Utilities Created
-| Module | Lines | Purpose |
-|--------|-------|---------|
-| constants.py | 100 | Centralized configuration constants |
-| timezone_utils.py | 180 | Timezone handling functions |
-| circuit_breaker.py | 200 | Circuit breaker implementation |
-| file_utils.py | 190 | File operations with checksums |
-| distributed_lock.py | 210 | Distributed locking mechanisms |
-
-**Total New Utility Code**: ~880 lines
-
-### 5.2 Features by Category
-
-**Performance**:
- ✅ Parallel training execution (6-10x faster)
- ✅ Optimized hyperparameter tuning (40% faster)
- ✅ Database connection pooling
-
-**Reliability**:
- ✅ HTTP request timeouts
- ✅ Circuit breaker pattern
- ✅ Model file checksums
- ✅ Distributed locking
- ✅ Data validation
-
-**Code Quality**:
- ✅ Constants module (50+ constants)
- ✅ Timezone utilities (8 functions)
- ✅ Standardized error handling
- ✅ Legacy code removal
-
-**Maintainability**:
- ✅ Comprehensive documentation
- ✅ Developer guide
- ✅ Clear code organization
- ✅ Utility functions
-
---
-
-## Part 6: Files Modified/Created
-
-### Files Modified (9):
-1. main.py - Fixed duplicate methods, dynamic migrations
-2. config.py - Added connection pool settings
-3. database.py - Configured connection pooling
-4. training_service.py - Fixed session management, removed legacy
-5. data_client.py - Added timeouts, circuit breakers, validation
-6. trainer.py - Parallel execution, removed legacy
-7. prophet_manager.py - Checksums, locking, constants, utilities
-8. date_alignment_service.py - Timezone utilities
-9. data_processor.py - Removed legacy alias
-
-### Files Created (8):
-1. core/constants.py - Configuration constants
-2. utils/__init__.py - Utility exports
-3. utils/timezone_utils.py - Timezone handling
-4. utils/circuit_breaker.py - Circuit breaker pattern
-5. utils/file_utils.py - File operations
-6. utils/distributed_lock.py - Distributed locking
-7. IMPLEMENTATION_SUMMARY.md - Change log
-8. DEVELOPER_GUIDE.md - Developer reference
-9. COMPLETE_IMPLEMENTATION_REPORT.md - This document
-
---
-
-## Part 7: Testing & Validation
-
-### Manual Testing Checklist
- [x] Service starts without errors
- [x] Migration verification works
- [x] Database connections properly pooled
- [x] HTTP timeouts configured
- [x] Circuit breakers functional
- [x] Parallel training executes
- [x] Model checksums calculated
- [x] Distributed locks work
- [x] Data validation runs
- [x] Error handling standardized
-
-### Recommended Test Coverage
-**Unit Tests Needed**:
- [ ] Timezone utility functions
- [ ] Constants validation
- [ ] Circuit breaker state transitions
- [ ] File checksum calculations
- [ ] Distributed lock acquisition/release
- [ ] Data validation logic
-
-**Integration Tests Needed**:
- [ ] End-to-end training pipeline
- [ ] External service timeout handling
- [ ] Circuit breaker integration
- [ ] Parallel training coordination
- [ ] Database session management
-
-**Performance Tests Needed**:
- [ ] Parallel vs sequential benchmarks
- [ ] Hyperparameter optimization timing
- [ ] Memory usage under load
- [ ] Connection pool behavior
-
---
-
-## Part 8: Deployment Guide
-
-### Prerequisites
- PostgreSQL 13+ (for advisory locks)
- Python 3.9+
- Redis (optional, for future caching)
-
-### Environment Variables
-
-**Database Configuration**:
-```bash
-DB_POOL_SIZE=10
-DB_MAX_OVERFLOW=20
-DB_POOL_TIMEOUT=30
-DB_POOL_RECYCLE=3600
-DB_POOL_PRE_PING=true
-DB_ECHO=false
-```
-
-**Training Configuration**:
-```bash
-MAX_TRAINING_TIME_MINUTES=30
-MAX_CONCURRENT_TRAINING_JOBS=3
-MIN_TRAINING_DATA_DAYS=30
-```
-
-**Model Storage**:
-```bash
-MODEL_STORAGE_PATH=/app/models
-MODEL_BACKUP_ENABLED=true
-MODEL_VERSIONING_ENABLED=true
-```
-
-### Deployment Steps
-
-1. **Pre-Deployment**:
-   ```bash
-   # Review constants
-   vim services/training/app/core/constants.py
-
-   # Verify environment variables
-   env | grep DB_POOL
-   env | grep MAX_TRAINING
-   ```
-
-2. **Deploy**:
-   ```bash
-   # Pull latest code
-   git pull origin main
-
-   # Build container
-   docker build -t training-service:latest .
-
-   # Deploy
-   kubectl apply -f infrastructure/kubernetes/base/
-   ```
-
-3. **Post-Deployment Verification**:
-   ```bash
-   # Check health
-   curl http://training-service/health
-
-   # Check circuit breaker status
-   curl http://training-service/api/v1/circuit-breakers
-
-   # Verify database connections
-   kubectl logs -f deployment/training-service | grep "pool"
-   ```
-
-### Monitoring
-
-**Key Metrics to Watch**:
- Training job duration (should be 6-10x faster)
- Circuit breaker states (should mostly be CLOSED)
- Database connection pool utilization
- Model file checksum failures
- Lock acquisition timeouts
-
-**Logging Queries**:
-```bash
-# Check parallel training
-kubectl logs training-service | grep "Starting parallel training"
-
-# Check circuit breakers
-kubectl logs training-service | grep "Circuit breaker"
-
-# Check distributed locks
-kubectl logs training-service | grep "Acquired lock"
-
-# Check checksums
-kubectl logs training-service | grep "checksum"
-```
-
---
-
-## Part 9: Performance Benchmarks
-
-### Training Performance
-
-| Scenario | Before | After | Improvement |
-|----------|--------|-------|-------------|
-| 5 products | 15 min | 2-3 min | 5-7x faster |
-| 10 products | 30 min | 3-5 min | 6-10x faster |
-| 20 products | 60 min | 6-10 min | 6-10x faster |
-| 50 products | 150 min | 15-25 min | 6-10x faster |
-
-### Hyperparameter Optimization
-
-| Product Type | Trials (Before) | Trials (After) | Time Saved |
-|--------------|----------------|----------------|------------|
-| High Volume | 75 (38 min) | 30 (15 min) | 23 min (60%) |
-| Medium Volume | 50 (25 min) | 25 (13 min) | 12 min (50%) |
-| Low Volume | 30 (15 min) | 20 (10 min) | 5 min (33%) |
-| Intermittent | 25 (13 min) | 15 (8 min) | 5 min (40%) |
-
-### Memory Usage
- **Before**: ~500MB per training job (unoptimized)
- **After**: ~200MB per training job (optimized)
- **Improvement**: 60% reduction
-
---
-
-## Part 10: Future Enhancements
-
-### High Priority
-1. **Caching Layer**: Redis-based hyperparameter cache
-2. **Metrics Dashboard**: Grafana dashboard for circuit breakers
-3. **Async Task Queue**: Celery/Temporal for background jobs
-4. **Model Registry**: Centralized model storage (S3/GCS)
-
-### Medium Priority
-5. **God Object Refactoring**: Split EnhancedTrainingService
-6. **Advanced Monitoring**: OpenTelemetry integration
-7. **Rate Limiting**: Per-tenant rate limiting
-8. **A/B Testing**: Model comparison framework
-
-### Low Priority
-9. **Method Length Reduction**: Refactor long methods
-10. **Deep Nesting Reduction**: Simplify complex conditionals
-11. **Data Classes**: Replace dicts with domain objects
-12. **Test Coverage**: Achieve 80%+ coverage
-
---
-
-## Part 11: Conclusion
-
-### Achievements
-
-**Code Quality**: A- (was C-)
- Eliminated all critical bugs
- Removed all legacy code
- Extracted all magic numbers
- Standardized error handling
- Centralized utilities
-
-**Performance**: A+ (was C)
- 6-10x faster training
- 40% faster optimization
- Efficient resource usage
- Parallel execution
-
-**Reliability**: A (was D)
- Data validation enabled
- Request timeouts configured
- Circuit breakers implemented
- Distributed locking added
- Model integrity verified
-
-**Maintainability**: A (was C)
- Comprehensive documentation
- Clear code organization
- Utility functions
- Developer guide
-
-### Production Readiness Score
-
-| Category | Before | After |
-|----------|--------|-------|
-| Code Quality | C- | A- |
-| Performance | C | A+ |
-| Reliability | D | A |
-| Maintainability | C | A |
-| **Overall** | **D+** | **A** |
-
-### Final Status
-
-✅ **PRODUCTION READY**
-
-All critical blockers have been resolved:
- ✅ Service initialization fixed
- ✅ Training performance optimized (10x)
- ✅ Timeout protection added
- ✅ Circuit breakers implemented
- ✅ Data validation enabled
- ✅ Database management corrected
- ✅ Error handling standardized
- ✅ Distributed locking added
- ✅ Model integrity verified
- ✅ Code quality improved
-
-**Recommended Action**: Deploy to production with standard monitoring
-
---
-
-*Implementation Complete: 2025-10-07*
-*Estimated Time Saved: 4-6 weeks*
-*Lines of Code Added/Modified: ~3000+*
-*Status: Ready for Production Deployment*
--- a/services/training/DEVELOPER_GUIDE.md
+++ b/services/training/DEVELOPER_GUIDE.md
@@ -1,230 +0,0 @@
-# Training Service - Developer Guide
-
-## Quick Reference for Common Tasks
-
-### Using Constants
-Always use constants instead of magic numbers:
-
-```python
-from app.core import constants as const
-
-# ✅ Good
-if len(sales_data) < const.MIN_DATA_POINTS_REQUIRED:
-    raise ValueError("Insufficient data")
-
-# ❌ Bad
-if len(sales_data) < 30:
-    raise ValueError("Insufficient data")
-```
-
-### Timezone Handling
-Always use timezone utilities:
-
-```python
-from app.utils.timezone_utils import ensure_timezone_aware, prepare_prophet_datetime
-
-# ✅ Good - Ensure timezone-aware
-dt = ensure_timezone_aware(user_input_date)
-
-# ✅ Good - Prepare for Prophet
-df = prepare_prophet_datetime(df, 'ds')
-
-# ❌ Bad - Manual timezone handling
-if dt.tzinfo is None:
-    dt = dt.replace(tzinfo=timezone.utc)
-```
-
-### Error Handling
-Always raise exceptions, never return empty lists:
-
-```python
-# ✅ Good
-if not data:
-    raise ValueError(f"No data available for {tenant_id}")
-
-# ❌ Bad
-if not data:
-    logger.error("No data")
-    return []
-```
-
-### Database Sessions
-Use context manager correctly:
-
-```python
-# ✅ Good
-async with self.database_manager.get_session() as session:
-    await session.execute(query)
-
-# ❌ Bad
-async with self.database_manager.get_session()() as session:  # Double call!
-    await session.execute(query)
-```
-
-### Parallel Execution
-Use asyncio.gather for concurrent operations:
-
-```python
-# ✅ Good - Parallel
-tasks = [train_product(pid) for pid in product_ids]
-results = await asyncio.gather(*tasks, return_exceptions=True)
-
-# ❌ Bad - Sequential
-results = []
-for pid in product_ids:
-    result = await train_product(pid)
-    results.append(result)
-```
-
-### HTTP Client Configuration
-Timeouts are configured automatically in DataClient:
-
-```python
-# No need to configure timeouts manually
-# They're set in DataClient.__init__() using constants
-client = DataClient()  # Timeouts already configured
-```
-
-## File Organization
-
-### Core Modules
- `core/constants.py` - All configuration constants
- `core/config.py` - Service settings
- `core/database.py` - Database configuration
-
-### Utilities
- `utils/timezone_utils.py` - Timezone handling functions
- `utils/__init__.py` - Utility exports
-
-### ML Components
- `ml/trainer.py` - Main training orchestration
- `ml/prophet_manager.py` - Prophet model management
- `ml/data_processor.py` - Data preprocessing
-
-### Services
- `services/data_client.py` - External service communication
- `services/training_service.py` - Training job management
- `services/training_orchestrator.py` - Training pipeline coordination
-
-## Common Pitfalls
-
-### ❌ Don't Create Legacy Aliases
-```python
-# ❌ Bad
-MyNewClass = OldClassName  # Removed!
-```
-
-### ❌ Don't Use Magic Numbers
-```python
-# ❌ Bad
-if score > 0.8:  # What does 0.8 mean?
-
-# ✅ Good
-if score > const.IMPROVEMENT_SIGNIFICANCE_THRESHOLD:
-```
-
-### ❌ Don't Return Empty Lists on Error
-```python
-# ❌ Bad
-except Exception as e:
-    logger.error(f"Failed: {e}")
-    return []
-
-# ✅ Good
-except Exception as e:
-    logger.error(f"Failed: {e}")
-    raise RuntimeError(f"Operation failed: {e}")
-```
-
-### ❌ Don't Handle Timezones Manually
-```python
-# ❌ Bad
-if dt.tzinfo is None:
-    dt = dt.replace(tzinfo=timezone.utc)
-
-# ✅ Good
-from app.utils.timezone_utils import ensure_timezone_aware
-dt = ensure_timezone_aware(dt)
-```
-
-## Testing Checklist
-
-Before submitting code:
- [ ] All magic numbers replaced with constants
- [ ] Timezone handling uses utility functions
- [ ] Errors raise exceptions (not return empty collections)
- [ ] Database sessions use single `get_session()` call
- [ ] Parallel operations use `asyncio.gather`
- [ ] No legacy compatibility aliases
- [ ] No commented-out code
- [ ] Logging uses structured logging
-
-## Performance Guidelines
-
-### Training Jobs
- ✅ Use parallel execution for multiple products
- ✅ Reduce Optuna trials for low-volume products
- ✅ Use constants for all thresholds
- ⚠️ Monitor memory usage during parallel training
-
-### Database Operations
- ✅ Use repository pattern
- ✅ Batch operations when possible
- ✅ Close sessions properly
- ⚠️ Connection pool limits not yet configured
-
-### HTTP Requests
- ✅ Timeouts configured automatically
- ✅ Use shared clients from `shared/clients`
- ⚠️ Circuit breaker not yet implemented
- ⚠️ Request retries delegated to base client
-
-## Debugging Tips
-
-### Training Failures
-1. Check logs for data validation errors
-2. Verify timezone consistency in date ranges
-3. Check minimum data point requirements
-4. Review Prophet error messages
-
-### Performance Issues
-1. Check if parallel training is being used
-2. Verify Optuna trial counts
-3. Monitor database connection usage
-4. Check HTTP timeout configurations
-
-### Data Quality Issues
-1. Review validation errors in logs
-2. Check zero-ratio thresholds
-3. Verify product classification
-4. Review date range alignment
-
-## Migration from Old Code
-
-### If You Find Legacy Code
-1. Check if alias exists (should be removed)
-2. Update imports to use new names
-3. Remove backward compatibility wrappers
-4. Update documentation
-
-### If You Find Magic Numbers
-1. Add constant to `core/constants.py`
-2. Update usage to reference constant
-3. Document what the number represents
-
-### If You Find Manual Timezone Handling
-1. Import from `utils/timezone_utils`
-2. Use appropriate utility function
-3. Remove manual implementation
-
-## Getting Help
-
- Review `IMPLEMENTATION_SUMMARY.md` for recent changes
- Check constants in `core/constants.py` for configuration
- Look at `utils/timezone_utils.py` for timezone functions
- Refer to analysis report for architectural decisions
-
---
-
-*Last Updated: 2025-10-07*
-*Status: Current*
--- a/services/training/Dockerfile
+++ b/services/training/Dockerfile
@@ -27,8 +27,7 @@ COPY --from=shared /shared /app/shared
 # Copy application code
 COPY services/training/ .

-# Copy scripts directory
-COPY scripts/ /app/scripts/
+

 # Add shared libraries to Python path
 ENV PYTHONPATH="/app:/app/shared:${PYTHONPATH:-}"
--- a/services/training/IMPLEMENTATION_SUMMARY.md
+++ b/services/training/IMPLEMENTATION_SUMMARY.md
@@ -1,274 +0,0 @@
-# Training Service - Implementation Summary
-
-## Overview
-This document summarizes all critical fixes, improvements, and refactoring implemented based on the comprehensive code analysis report.
-
---
-
-## ✅ Critical Bugs Fixed
-
-### 1. **Duplicate `on_startup` Method** ([main.py](services/training/app/main.py))
- **Issue**: Two `on_startup` methods defined, causing migration verification to be skipped
- **Fix**: Merged both implementations into single method
- **Impact**: Service initialization now properly verifies database migrations
-
-### 2. **Hardcoded Migration Version** ([main.py](services/training/app/main.py))
- **Issue**: Static version check `expected_migration_version = "00001"`
- **Fix**: Removed hardcoded version, now dynamically checks alembic_version table
- **Impact**: Service survives schema updates without code changes
-
-### 3. **Session Management Double-Call** ([training_service.py:463](services/training/app/services/training_service.py#L463))
- **Issue**: Incorrect `get_session()()` double-call syntax
- **Fix**: Changed to correct `get_session()` single call
- **Impact**: Prevents database connection leaks and session corruption
-
-### 4. **Disabled Data Validation** ([data_client.py:263-294](services/training/app/services/data_client.py#L263-L294))
- **Issue**: Validation completely bypassed with "temporarily disabled" message
- **Fix**: Implemented comprehensive validation checking:
-  - Minimum data points (30 required, 90 recommended)
-  - Required fields presence
-  - Zero-value ratio analysis
-  - Product diversity checks
- **Impact**: Ensures data quality before expensive training operations
-
---
-
-## 🚀 Performance Improvements
-
-### 5. **Parallel Training Execution** ([trainer.py:240-379](services/training/app/ml/trainer.py#L240-L379))
- **Issue**: Sequential product training (O(n) time complexity)
- **Fix**: Implemented parallel training using `asyncio.gather()`
- **Performance Gain**:
-  - Before: 10 products × 3 min = **30 minutes**
-  - After: 10 products in parallel = **~3-5 minutes**
- **Implementation**:
-  - Created `_train_single_product()` method
-  - Refactored `_train_all_models_enhanced()` to use concurrent execution
-  - Maintains progress tracking across parallel tasks
-
-### 6. **Hyperparameter Optimization** ([prophet_manager.py](services/training/app/ml/prophet_manager.py))
- **Issue**: Fixed number of trials regardless of product characteristics
- **Fix**: Reduced trial counts and made them adaptive:
-  - High volume: 30 trials (was 75)
-  - Medium volume: 25 trials (was 50)
-  - Low volume: 20 trials (was 30)
-  - Intermittent: 15 trials (was 25)
- **Performance Gain**: ~40% reduction in optimization time
-
---
-
-## 🔧 Error Handling Standardization
-
-### 7. **Consistent Error Patterns** ([data_client.py](services/training/app/services/data_client.py))
- **Issue**: Mixed error handling (return `[]`, return error dict, raise exception)
- **Fix**: Standardized to raise exceptions with meaningful messages
- **Example**:
-  ```python
-  # Before: return []
-  # After: raise ValueError(f"No sales data available for tenant {tenant_id}")
-  ```
- **Impact**: Errors propagate correctly, no silent failures
-
---
-
-## ⏱️ Request Timeout Configuration
-
-### 8. **HTTP Client Timeouts** ([data_client.py:37-51](services/training/app/services/data_client.py#L37-L51))
- **Issue**: No timeout configuration, requests could hang indefinitely
- **Fix**: Added comprehensive timeout configuration:
-  - Connect: 30 seconds
-  - Read: 60 seconds (for large data fetches)
-  - Write: 30 seconds
-  - Pool: 30 seconds
- **Impact**: Prevents hanging requests during external service failures
-
---
-
-## 📏 Magic Numbers Elimination
-
-### 9. **Constants Module** ([core/constants.py](services/training/app/core/constants.py))
- **Issue**: Magic numbers scattered throughout codebase
- **Fix**: Created centralized constants module with 50+ constants
- **Categories**:
-  - Data validation thresholds
-  - Training time periods
-  - Product classification thresholds
-  - Hyperparameter optimization settings
-  - Prophet uncertainty sampling ranges
-  - MAPE calculation parameters
-  - HTTP client configuration
-  - WebSocket configuration
-  - Progress tracking ranges
-
-### 10. **Constants Integration**
- **Updated Files**:
-  - `prophet_manager.py`: Uses const for trials, uncertainty samples, thresholds
-  - `data_client.py`: Uses const for HTTP timeouts
-  - Future: All files should reference constants module
-
---
-
-## 🧹 Legacy Code Removal
-
-### 11. **Compatibility Aliases Removed**
- **Files Updated**:
-  - `trainer.py`: Removed `BakeryMLTrainer = EnhancedBakeryMLTrainer`
-  - `training_service.py`: Removed `TrainingService = EnhancedTrainingService`
-  - `data_processor.py`: Removed `BakeryDataProcessor = EnhancedBakeryDataProcessor`
-
-### 12. **Legacy Methods Removed** ([data_client.py](services/training/app/services/data_client.py))
- Removed:
-  - `fetch_traffic_data()` (legacy wrapper)
-  - `fetch_stored_traffic_data_for_training()` (legacy wrapper)
- All callers updated to use `fetch_traffic_data_unified()`
-
-### 13. **Commented Code Cleanup**
- Removed "Pre-flight check moved to orchestrator" comments
- Removed "Temporary implementation" comments
- Cleaned up validation placeholders
-
---
-
-## 🌍 Timezone Handling
-
-### 14. **Timezone Utility Module** ([utils/timezone_utils.py](services/training/app/utils/timezone_utils.py))
- **Issue**: Timezone handling scattered across 4+ files
- **Fix**: Created comprehensive utility module with functions:
-  - `ensure_timezone_aware()`: Make datetime timezone-aware
-  - `ensure_timezone_naive()`: Remove timezone info
-  - `normalize_datetime_to_utc()`: Convert any datetime to UTC
-  - `normalize_dataframe_datetime_column()`: Normalize pandas datetime columns
-  - `prepare_prophet_datetime()`: Prophet-specific preparation
-  - `safe_datetime_comparison()`: Compare datetimes handling timezone mismatches
-  - `get_current_utc()`: Get current UTC time
-  - `convert_timestamp_to_datetime()`: Handle various timestamp formats
-
-### 15. **Timezone Utility Integration**
- **Updated Files**:
-  - `prophet_manager.py`: Uses `prepare_prophet_datetime()`
-  - `date_alignment_service.py`: Uses `ensure_timezone_aware()`
-  - Future: All timezone operations should use utility
-
---
-
-## 📊 Summary Statistics
-
-### Files Modified
- **Core Files**: 6
-  - main.py
-  - training_service.py
-  - data_client.py
-  - trainer.py
-  - prophet_manager.py
-  - date_alignment_service.py
-
-### Files Created
- **New Utilities**: 3
-  - core/constants.py
-  - utils/timezone_utils.py
-  - utils/__init__.py
-
-### Code Quality Improvements
- ✅ Eliminated all critical bugs
- ✅ Removed all legacy compatibility code
- ✅ Removed all commented-out code
- ✅ Extracted all magic numbers
- ✅ Standardized error handling
- ✅ Centralized timezone handling
-
-### Performance Improvements
- 🚀 Training time: 30min → 3-5min (10 products)
- 🚀 Hyperparameter optimization: 40% faster
- 🚀 Parallel execution replaces sequential
-
-### Reliability Improvements
- ✅ Data validation enabled
- ✅ Request timeouts configured
- ✅ Error propagation fixed
- ✅ Session management corrected
- ✅ Database initialization verified
-
---
-
-## 🎯 Remaining Recommendations
-
-### High Priority (Not Yet Implemented)
-1. **Distributed Locking**: Implement Redis/database-based locking for concurrent training jobs
-2. **Connection Pooling**: Configure explicit connection pool limits
-3. **Circuit Breaker**: Add circuit breaker pattern for external service calls
-4. **Model File Validation**: Implement checksum verification on model load
-
-### Medium Priority (Future Enhancements)
-5. **Refactor God Object**: Split `EnhancedTrainingService` (765 lines) into smaller services
-6. **Shared Model Storage**: Migrate to S3/GCS for horizontal scaling
-7. **Task Queue**: Replace FastAPI BackgroundTasks with Celery/Temporal
-8. **Caching Layer**: Implement Redis caching for hyperparameter optimization results
-
-### Low Priority (Technical Debt)
-9. **Method Length**: Refactor long methods (>100 lines)
-10. **Deep Nesting**: Reduce nesting levels in complex conditionals
-11. **Data Classes**: Replace primitive obsession with proper domain objects
-12. **Test Coverage**: Add comprehensive unit and integration tests
-
---
-
-## 🔬 Testing Recommendations
-
-### Unit Tests Required
- [ ] Timezone utility functions
- [ ] Constants validation
- [ ] Data validation logic
- [ ] Parallel training execution
- [ ] Error handling patterns
-
-### Integration Tests Required
- [ ] End-to-end training pipeline
- [ ] External service timeout handling
- [ ] Database session management
- [ ] Migration verification
-
-### Performance Tests Required
- [ ] Parallel vs sequential training benchmarks
- [ ] Hyperparameter optimization timing
- [ ] Memory usage under load
- [ ] Database connection pool behavior
-
---
-
-## 📝 Migration Notes
-
-### Breaking Changes
-⚠️ **None** - All changes maintain API compatibility
-
-### Deployment Checklist
-1. ✅ Review constants in `core/constants.py` for environment-specific values
-2. ✅ Verify database migration version check works in your environment
-3. ✅ Test parallel training with small batch first
-4. ✅ Monitor memory usage with parallel execution
-5. ✅ Verify HTTP timeouts are appropriate for your network conditions
-
-### Rollback Plan
- All changes are backward compatible at the API level
- Database schema unchanged
- Can revert individual commits if needed
-
---
-
-## 🎉 Conclusion
-
-**Production Readiness Status**: ✅ **READY** (was ❌ NOT READY)
-
-All **critical blockers** have been resolved:
- ✅ Service initialization bugs fixed
- ✅ Training performance improved (10x faster)
- ✅ Timeout/circuit protection added
- ✅ Data validation enabled
- ✅ Database connection management corrected
-
-**Estimated Remediation Time Saved**: 4-6 weeks → **Completed in current session**
-
---
-
-*Generated: 2025-10-07*
-*Implementation: Complete*
-*Status: Production Ready*
--- a/services/training/PHASE_2_ENHANCEMENTS.md
+++ b/services/training/PHASE_2_ENHANCEMENTS.md
@@ -1,540 +0,0 @@
-# Training Service - Phase 2 Enhancements
-
-## Overview
-
-This document details the additional improvements implemented after the initial critical fixes and performance enhancements. These enhancements further improve reliability, observability, and maintainability of the training service.
-
---
-
-## New Features Implemented
-
-### 1. ✅ Retry Mechanism with Exponential Backoff
-
-**File Created**: [utils/retry.py](services/training/app/utils/retry.py)
-
-**Features**:
- Exponential backoff with configurable parameters
- Jitter to prevent thundering herd problem
- Adaptive retry strategy based on success/failure patterns
- Timeout-based retry strategy
- Decorator-based retry for clean integration
- Pre-configured strategies for common use cases
-
-**Classes**:
-```python
-RetryStrategy              # Base retry strategy
-AdaptiveRetryStrategy      # Adjusts based on history
-TimeoutRetryStrategy       # Overall timeout across all attempts
-```
-
-**Pre-configured Strategies**:
-| Strategy | Max Attempts | Initial Delay | Max Delay | Use Case |
-|----------|--------------|---------------|-----------|----------|
-| HTTP_RETRY_STRATEGY | 3 | 1.0s | 10s | HTTP requests |
-| DATABASE_RETRY_STRATEGY | 5 | 0.5s | 5s | Database operations |
-| EXTERNAL_SERVICE_RETRY_STRATEGY | 4 | 2.0s | 30s | External services |
-
-**Usage Example**:
-```python
-from app.utils.retry import with_retry
-
-@with_retry(max_attempts=3, initial_delay=1.0, max_delay=10.0)
-async def fetch_data():
-    # Your code here - automatically retried on failure
-    pass
-```
-
-**Integration**:
- Applied to `_fetch_sales_data_internal()` in data_client.py
- Configurable per-method retry behavior
- Works seamlessly with circuit breakers
-
-**Benefits**:
- Handles transient failures gracefully
- Prevents immediate failure on temporary issues
- Reduces false alerts from momentary glitches
- Improves overall service reliability
-
---
-
-### 2. ✅ Comprehensive Input Validation Schemas
-
-**File Created**: [schemas/validation.py](services/training/app/schemas/validation.py)
-
-**Validation Schemas Implemented**:
-
-#### **TrainingJobCreateRequest**
- Validates tenant_id, date ranges, product_ids
- Checks date format (ISO 8601)
- Ensures logical date ranges
- Prevents future dates
- Limits to 3-year maximum range
-
-#### **ForecastRequest**
- Validates forecast parameters
- Limits forecast days (1-365)
- Validates confidence levels (0.5-0.99)
- Type-safe UUID validation
-
-#### **ModelEvaluationRequest**
- Validates evaluation periods
- Ensures minimum 7-day evaluation window
- Date format validation
-
-#### **BulkTrainingRequest**
- Validates multiple tenant IDs (max 100)
- Checks for duplicate tenants
- Parallel execution options
-
-#### **HyperparameterOverride**
- Validates Prophet hyperparameters
- Range checking for all parameters
- Regex validation for modes
-
-#### **AdvancedTrainingRequest**
- Extended training options
- Cross-validation configuration
- Manual hyperparameter override
- Diagnostic options
-
-#### **DataQualityCheckRequest**
- Data validation parameters
- Product filtering options
- Recommendation generation
-
-#### **ModelQueryParams**
- Model listing filters
- Pagination support
- Accuracy thresholds
-
-**Example Validation**:
-```python
-request = TrainingJobCreateRequest(
-    tenant_id="123e4567-e89b-12d3-a456-426614174000",
-    start_date="2024-01-01",
-    end_date="2024-12-31"
-)
-# Automatically validates:
-# - UUID format
-# - Date format
-# - Date range logic
-# - Business rules
-```
-
-**Benefits**:
- Catches invalid input before processing
- Clear error messages for API consumers
- Reduces invalid training job submissions
- Self-documenting API with examples
- Type safety with Pydantic
-
---
-
-### 3. ✅ Enhanced Health Check System
-
-**File Created**: [api/health.py](services/training/app/api/health.py)
-
-**Endpoints Implemented**:
-
-#### `GET /health`
- Basic liveness check
- Returns 200 if service is running
- Minimal overhead
-
-#### `GET /health/detailed`
- Comprehensive component health check
- Database connectivity and performance
- System resources (CPU, memory, disk)
- Model storage health
- Circuit breaker status
- Configuration overview
-
-**Response Example**:
-```json
-{
-  "status": "healthy",
-  "components": {
-    "database": {
-      "status": "healthy",
-      "response_time_seconds": 0.05,
-      "model_count": 150,
-      "connection_pool": {
-        "size": 10,
-        "checked_out": 2,
-        "available": 8
-      }
-    },
-    "system": {
-      "cpu": {"usage_percent": 45.2, "count": 8},
-      "memory": {"usage_percent": 62.5, "available_mb": 3072},
-      "disk": {"usage_percent": 45.0, "free_gb": 125}
-    },
-    "storage": {
-      "status": "healthy",
-      "writable": true,
-      "model_files": 150,
-      "total_size_mb": 2500
-    }
-  },
-  "circuit_breakers": { ... }
-}
-```
-
-#### `GET /health/ready`
- Kubernetes readiness probe
- Returns 503 if not ready
- Checks database and storage
-
-#### `GET /health/live`
- Kubernetes liveness probe
- Simpler than ready check
- Returns process PID
-
-#### `GET /metrics/system`
- Detailed system metrics
- Process-level statistics
- Resource usage monitoring
-
-**Benefits**:
- Kubernetes-ready health checks
- Early problem detection
- Operational visibility
- Load balancer integration
- Auto-healing support
-
---
-
-### 4. ✅ Monitoring and Observability Endpoints
-
-**File Created**: [api/monitoring.py](services/training/app/api/monitoring.py)
-
-**Endpoints Implemented**:
-
-#### `GET /monitoring/circuit-breakers`
- Real-time circuit breaker status
- Per-service failure counts
- State transitions
- Summary statistics
-
-**Response**:
-```json
-{
-  "circuit_breakers": {
-    "sales_service": {
-      "state": "closed",
-      "failure_count": 0,
-      "failure_threshold": 5
-    },
-    "weather_service": {
-      "state": "half_open",
-      "failure_count": 2,
-      "failure_threshold": 3
-    }
-  },
-  "summary": {
-    "total": 3,
-    "open": 0,
-    "half_open": 1,
-    "closed": 2
-  }
-}
-```
-
-#### `POST /monitoring/circuit-breakers/{name}/reset`
- Manually reset circuit breaker
- Emergency recovery tool
- Audit logged
-
-#### `GET /monitoring/training-jobs`
- Training job statistics
- Configurable lookback period
- Success/failure rates
- Average training duration
- Recent job history
-
-#### `GET /monitoring/models`
- Model inventory statistics
- Active/production model counts
- Models by type
- Average performance (MAPE)
- Models created today
-
-#### `GET /monitoring/queue`
- Training queue status
- Queued vs running jobs
- Queue wait times
- Oldest job in queue
-
-#### `GET /monitoring/performance`
- Model performance metrics
- MAPE, MAE, RMSE statistics
- Accuracy distribution (excellent/good/acceptable/poor)
- Tenant-specific filtering
-
-#### `GET /monitoring/alerts`
- Active alerts and warnings
- Circuit breaker issues
- Queue backlogs
- System problems
- Severity levels
-
-**Example Alert Response**:
-```json
-{
-  "alerts": [
-    {
-      "type": "circuit_breaker_open",
-      "severity": "high",
-      "message": "Circuit breaker 'sales_service' is OPEN"
-    }
-  ],
-  "warnings": [
-    {
-      "type": "queue_backlog",
-      "severity": "medium",
-      "message": "Training queue has 15 pending jobs"
-    }
-  ]
-}
-```
-
-**Benefits**:
- Real-time operational visibility
- Proactive problem detection
- Performance tracking
- Capacity planning data
- Integration-ready for dashboards
-
---
-
-## Integration and Configuration
-
-### Updated Files
-
-**main.py**:
- Added health router import
- Added monitoring router import
- Registered new routes
-
-**utils/__init__.py**:
- Added retry mechanism exports
- Updated __all__ list
- Complete utility organization
-
-**data_client.py**:
- Integrated retry decorator
- Applied to critical HTTP calls
- Works with circuit breakers
-
-### New Routes Available
-
-| Route | Method | Purpose |
-|-------|--------|---------|
-| /health | GET | Basic health check |
-| /health/detailed | GET | Detailed component health |
-| /health/ready | GET | Kubernetes readiness |
-| /health/live | GET | Kubernetes liveness |
-| /metrics/system | GET | System metrics |
-| /monitoring/circuit-breakers | GET | Circuit breaker status |
-| /monitoring/circuit-breakers/{name}/reset | POST | Reset breaker |
-| /monitoring/training-jobs | GET | Job statistics |
-| /monitoring/models | GET | Model statistics |
-| /monitoring/queue | GET | Queue status |
-| /monitoring/performance | GET | Performance metrics |
-| /monitoring/alerts | GET | Active alerts |
-
---
-
-## Testing the New Features
-
-### 1. Test Retry Mechanism
-```python
-# Should retry 3 times with exponential backoff
-@with_retry(max_attempts=3)
-async def test_function():
-    # Simulate transient failure
-    raise ConnectionError("Temporary failure")
-```
-
-### 2. Test Input Validation
-```bash
-# Invalid date range - should return 422
-curl -X POST http://localhost:8000/api/v1/training/jobs \
-  -H "Content-Type: application/json" \
-  -d '{
-    "tenant_id": "invalid-uuid",
-    "start_date": "2024-12-31",
-    "end_date": "2024-01-01"
-  }'
-```
-
-### 3. Test Health Checks
-```bash
-# Basic health
-curl http://localhost:8000/health
-
-# Detailed health with all components
-curl http://localhost:8000/health/detailed
-
-# Readiness check (Kubernetes)
-curl http://localhost:8000/health/ready
-
-# Liveness check (Kubernetes)
-curl http://localhost:8000/health/live
-```
-
-### 4. Test Monitoring Endpoints
-```bash
-# Circuit breaker status
-curl http://localhost:8000/monitoring/circuit-breakers
-
-# Training job stats (last 24 hours)
-curl http://localhost:8000/monitoring/training-jobs?hours=24
-
-# Model statistics
-curl http://localhost:8000/monitoring/models
-
-# Active alerts
-curl http://localhost:8000/monitoring/alerts
-```
-
---
-
-## Performance Impact
-
-### Retry Mechanism
- **Latency**: +0-30s (only on failures, with exponential backoff)
- **Success Rate**: +15-25% (handles transient failures)
- **False Alerts**: -40% (retries prevent premature failures)
-
-### Input Validation
- **Latency**: +5-10ms per request (validation overhead)
- **Invalid Requests Blocked**: ~30% caught before processing
- **Error Clarity**: 100% improvement (clear validation messages)
-
-### Health Checks
- **/health**: <5ms response time
- **/health/detailed**: <50ms response time
- **System Impact**: Negligible (<0.1% CPU)
-
-### Monitoring Endpoints
- **Query Time**: 10-100ms depending on complexity
- **Database Load**: Minimal (indexed queries)
- **Cache Opportunity**: Can be cached for 1-5 seconds
-
---
-
-## Monitoring Integration
-
-### Prometheus Metrics (Future)
-```yaml
-# Example Prometheus scrape config
-scrape_configs:
-  - job_name: 'training-service'
-    static_configs:
-      - targets: ['training-service:8000']
-    metrics_path: '/metrics/system'
-```
-
-### Grafana Dashboards
-**Recommended Panels**:
-1. Circuit Breaker Status (traffic light)
-2. Training Job Success Rate (gauge)
-3. Average Training Duration (graph)
-4. Model Performance Distribution (histogram)
-5. Queue Depth Over Time (graph)
-6. System Resources (multi-stat)
-
-### Alert Rules
-```yaml
-# Example alert rules
- alert: CircuitBreakerOpen
-  expr: circuit_breaker_state{state="open"} > 0
-  for: 5m
-  annotations:
-    summary: "Circuit breaker {{ $labels.name }} is open"
-
- alert: TrainingQueueBacklog
-  expr: training_queue_depth > 20
-  for: 10m
-  annotations:
-    summary: "Training queue has {{ $value }} pending jobs"
-```
-
---
-
-## Summary Statistics
-
-### New Files Created
-| File | Lines | Purpose |
-|------|-------|---------|
-| utils/retry.py | 350 | Retry mechanism |
-| schemas/validation.py | 300 | Input validation |
-| api/health.py | 250 | Health checks |
-| api/monitoring.py | 350 | Monitoring endpoints |
-| **Total** | **1,250** | **New functionality** |
-
-### Total Lines Added (Phase 2)
- **New Code**: ~1,250 lines
- **Modified Code**: ~100 lines
- **Documentation**: This document
-
-### Endpoints Added
- **Health Endpoints**: 5
- **Monitoring Endpoints**: 7
- **Total New Endpoints**: 12
-
-### Features Completed
- ✅ Retry mechanism with exponential backoff
- ✅ Comprehensive input validation schemas
- ✅ Enhanced health check system
- ✅ Monitoring and observability endpoints
- ✅ Circuit breaker status API
- ✅ Training job statistics
- ✅ Model performance tracking
- ✅ Queue monitoring
- ✅ Alert generation
-
---
-
-## Deployment Checklist
-
- [ ] Review validation schemas match your API requirements
- [ ] Configure Prometheus scraping if using metrics
- [ ] Set up Grafana dashboards
- [ ] Configure alert rules in monitoring system
- [ ] Test health checks with load balancer
- [ ] Verify Kubernetes probes (/health/ready, /health/live)
- [ ] Test circuit breaker reset endpoint access controls
- [ ] Document monitoring endpoints for ops team
- [ ] Set up alert routing (PagerDuty, Slack, etc.)
- [ ] Test retry mechanism with network failures
-
---
-
-## Future Enhancements (Recommendations)
-
-### High Priority
-1. **Structured Logging**: Add request tracing with correlation IDs
-2. **Metrics Export**: Prometheus metrics endpoint
-3. **Rate Limiting**: Per-tenant API rate limits
-4. **Caching**: Redis-based response caching
-
-### Medium Priority
-5. **Async Task Queue**: Celery/Temporal for better job management
-6. **Model Registry**: Centralized model versioning
-7. **A/B Testing**: Model comparison framework
-8. **Data Lineage**: Track data provenance
-
-### Low Priority
-9. **GraphQL API**: Alternative to REST
-10. **WebSocket Updates**: Real-time job progress
-11. **Audit Logging**: Comprehensive action audit trail
-12. **Export APIs**: Bulk data export endpoints
-
---
-
-*Phase 2 Implementation Complete: 2025-10-07*
-*Features Added: 12*
-*Lines of Code: ~1,250*
-*Status: Production Ready*
--- a/services/training/scripts/demo/seed_demo_ai_models.py
+++ b/services/training/scripts/demo/seed_demo_ai_models.py
@@ -0,0 +1,271 @@
+"""
+Demo AI Models Seed Script
+Creates fake AI models for demo tenants to populate the models list
+without having actual trained model files.
+
+This script uses hardcoded tenant and product IDs to avoid cross-database dependencies.
+"""
+
+import asyncio
+import sys
+import os
+from uuid import UUID
+from datetime import datetime, timezone, timedelta
+from decimal import Decimal
+
+# Add project root to path
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))
+
+from sqlalchemy import select
+from shared.database.base import create_database_manager
+import structlog
+
+# Import models - these paths work both locally and in container
+try:
+    # Container environment (training-service image)
+    from app.models.training import TrainedModel
+except ImportError:
+    # Local environment
+    from services.training.app.models.training import TrainedModel
+
+logger = structlog.get_logger()
+
+# ============================================================================
+# HARDCODED DEMO DATA (from seed scripts)
+# ============================================================================
+
+# Demo Tenant IDs (from seed_demo_tenants.py)
+DEMO_TENANT_SAN_PABLO = UUID("a1b2c3d4-e5f6-47a8-b9c0-d1e2f3a4b5c6")
+DEMO_TENANT_LA_ESPIGA = UUID("b2c3d4e5-f6a7-48b9-c0d1-e2f3a4b5c6d7")
+
+# Sample Product IDs for each tenant (these should match finished products from inventory seed)
+# Note: These are example UUIDs - in production, these would be actual product IDs from inventory
+DEMO_PRODUCTS = {
+    DEMO_TENANT_SAN_PABLO: [
+        {"id": UUID("10000000-0000-0000-0000-000000000001"), "name": "Barra de Pan"},
+        {"id": UUID("10000000-0000-0000-0000-000000000002"), "name": "Croissant"},
+        {"id": UUID("10000000-0000-0000-0000-000000000003"), "name": "Magdalenas"},
+        {"id": UUID("10000000-0000-0000-0000-000000000004"), "name": "Empanada"},
+        {"id": UUID("10000000-0000-0000-0000-000000000005"), "name": "Pan Integral"},
+    ],
+    DEMO_TENANT_LA_ESPIGA: [
+        {"id": UUID("20000000-0000-0000-0000-000000000001"), "name": "Pan de Molde"},
+        {"id": UUID("20000000-0000-0000-0000-000000000002"), "name": "Bollo Suizo"},
+        {"id": UUID("20000000-0000-0000-0000-000000000003"), "name": "Palmera de Chocolate"},
+        {"id": UUID("20000000-0000-0000-0000-000000000004"), "name": "Napolitana"},
+        {"id": UUID("20000000-0000-0000-0000-000000000005"), "name": "Pan Rústico"},
+    ]
+}
+
+
+class DemoAIModelSeeder:
+    """Seed fake AI models for demo tenants"""
+
+    def __init__(self):
+        self.training_db_url = os.getenv("TRAINING_DATABASE_URL") or os.getenv("DATABASE_URL")
+
+        if not self.training_db_url:
+            raise ValueError("Missing TRAINING_DATABASE_URL or DATABASE_URL")
+
+        # Convert to async URL if needed
+        if self.training_db_url.startswith("postgresql://"):
+            self.training_db_url = self.training_db_url.replace(
+                "postgresql://", "postgresql+asyncpg://", 1
+            )
+
+        self.training_db = create_database_manager(self.training_db_url, "demo-ai-seed")
+
+    async def create_fake_model(self, session, tenant_id: UUID, product_info: dict):
+        """Create a fake AI model entry for a product"""
+        now = datetime.now(timezone.utc)
+        training_start = now - timedelta(days=90)
+        training_end = now - timedelta(days=7)
+
+        fake_model = TrainedModel(
+            tenant_id=tenant_id,
+            inventory_product_id=product_info["id"],
+            model_type="prophet_optimized",
+            model_version="1.0-demo",
+            job_id=f"demo-job-{tenant_id}-{product_info['id']}",
+
+            # Fake file paths (files don't actually exist)
+            model_path=f"/fake/models/{tenant_id}/{product_info['id']}/model.pkl",
+            metadata_path=f"/fake/models/{tenant_id}/{product_info['id']}/metadata.json",
+
+            # Fake but realistic metrics
+            mape=Decimal("12.5"),  # Mean Absolute Percentage Error
+            mae=Decimal("2.3"),    # Mean Absolute Error
+            rmse=Decimal("3.1"),   # Root Mean Squared Error
+            r2_score=Decimal("0.85"),  # R-squared
+            training_samples=60,   # 60 days of training data
+
+            # Fake hyperparameters
+            hyperparameters={
+                "changepoint_prior_scale": 0.05,
+                "seasonality_prior_scale": 10.0,
+                "holidays_prior_scale": 10.0,
+                "seasonality_mode": "multiplicative"
+            },
+
+            # Features used
+            features_used=["weekday", "month", "is_holiday", "temperature", "precipitation"],
+
+            # Normalization params (fake)
+            normalization_params={
+                "temperature": {"mean": 15.0, "std": 5.0},
+                "precipitation": {"mean": 2.0, "std": 1.5}
+            },
+
+            # Model status
+            is_active=True,
+            is_production=False,  # Demo models are not production-ready
+
+            # Training data info
+            training_start_date=training_start,
+            training_end_date=training_end,
+            data_quality_score=Decimal("0.75"),  # Good but not excellent
+
+            # Metadata
+            notes=f"Demo model for {product_info['name']} - No actual trained file exists. For demonstration purposes only.",
+            created_by="demo-seed-script",
+            created_at=now,
+            updated_at=now,
+            last_used_at=None
+        )
+
+        session.add(fake_model)
+        return fake_model
+
+    async def seed_models_for_tenant(self, tenant_id: UUID, tenant_name: str, products: list):
+        """Create fake AI models for a demo tenant"""
+        logger.info(
+            "Creating fake AI models for demo tenant",
+            tenant_id=str(tenant_id),
+            tenant_name=tenant_name,
+            product_count=len(products)
+        )
+
+        try:
+            async with self.training_db.get_session() as session:
+                models_created = 0
+
+                for product in products:
+                    # Check if model already exists
+                    result = await session.execute(
+                        select(TrainedModel).where(
+                            TrainedModel.tenant_id == tenant_id,
+                            TrainedModel.inventory_product_id == product["id"]
+                        )
+                    )
+                    existing_model = result.scalars().first()
+
+                    if existing_model:
+                        logger.info(
+                            "Model already exists, skipping",
+                            tenant_id=str(tenant_id),
+                            product_name=product["name"],
+                            product_id=str(product["id"])
+                        )
+                        continue
+
+                    # Create fake model
+                    model = await self.create_fake_model(session, tenant_id, product)
+                    models_created += 1
+
+                    logger.info(
+                        "Created fake AI model",
+                        tenant_id=str(tenant_id),
+                        product_name=product["name"],
+                        product_id=str(product["id"]),
+                        model_id=str(model.id)
+                    )
+
+                await session.commit()
+
+                logger.info(
+                    "✅ Successfully created fake AI models for tenant",
+                    tenant_id=str(tenant_id),
+                    tenant_name=tenant_name,
+                    models_created=models_created
+                )
+
+                return models_created
+
+        except Exception as e:
+            logger.error(
+                "❌ Error creating fake AI models for tenant",
+                tenant_id=str(tenant_id),
+                tenant_name=tenant_name,
+                error=str(e),
+                exc_info=True
+            )
+            raise
+
+    async def seed_all_demo_models(self):
+        """Seed fake AI models for all demo tenants"""
+        logger.info("=" * 80)
+        logger.info("🤖 Starting Demo AI Models Seeding")
+        logger.info("=" * 80)
+
+        total_models_created = 0
+
+        try:
+            # Seed models for San Pablo
+            san_pablo_count = await self.seed_models_for_tenant(
+                tenant_id=DEMO_TENANT_SAN_PABLO,
+                tenant_name="Panadería San Pablo",
+                products=DEMO_PRODUCTS[DEMO_TENANT_SAN_PABLO]
+            )
+            total_models_created += san_pablo_count
+
+            # Seed models for La Espiga
+            la_espiga_count = await self.seed_models_for_tenant(
+                tenant_id=DEMO_TENANT_LA_ESPIGA,
+                tenant_name="Panadería La Espiga",
+                products=DEMO_PRODUCTS[DEMO_TENANT_LA_ESPIGA]
+            )
+            total_models_created += la_espiga_count
+
+            logger.info("=" * 80)
+            logger.info(
+                "✅ Demo AI Models Seeding Completed",
+                total_models_created=total_models_created,
+                tenants_processed=2
+            )
+            logger.info("=" * 80)
+
+        except Exception as e:
+            logger.error("=" * 80)
+            logger.error("❌ Demo AI Models Seeding Failed")
+            logger.error("=" * 80)
+            logger.error("Error: %s", str(e))
+            raise
+
+
+async def main():
+    """Main entry point"""
+    logger.info("Demo AI Models Seed Script Starting")
+    logger.info("Mode: %s", os.getenv("DEMO_MODE", "development"))
+    logger.info("Log Level: %s", os.getenv("LOG_LEVEL", "INFO"))
+
+    try:
+        seeder = DemoAIModelSeeder()
+        await seeder.seed_all_demo_models()
+
+        logger.info("")
+        logger.info("🎉 Success! Demo AI models are ready.")
+        logger.info("")
+        logger.info("Note: These are fake models for demo purposes only.")
+        logger.info("      No actual model files exist on disk.")
+        logger.info("")
+
+        return 0
+
+    except Exception as e:
+        logger.error("Demo AI models seed failed", error=str(e), exc_info=True)
+        return 1
+
+
+if __name__ == "__main__":
+    exit_code = asyncio.run(main())
+    sys.exit(exit_code)