Improve the demo feature of the project
This commit is contained in:
@@ -1,645 +0,0 @@
|
||||
# Training Service - Complete Implementation Report
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document provides a comprehensive overview of all improvements, fixes, and new features implemented in the training service based on the detailed code analysis. The service has been transformed from **NOT PRODUCTION READY** to **PRODUCTION READY** with significant enhancements in reliability, performance, and maintainability.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Implementation Status: **COMPLETE** ✅
|
||||
|
||||
**Time Saved**: 4-6 weeks of development → Completed in single session
|
||||
**Production Ready**: ✅ YES
|
||||
**API Compatible**: ✅ YES (No breaking changes)
|
||||
|
||||
---
|
||||
|
||||
## Part 1: Critical Bug Fixes
|
||||
|
||||
### 1.1 Duplicate `on_startup` Method ✅
|
||||
**File**: [main.py](services/training/app/main.py)
|
||||
**Issue**: Two `on_startup` methods causing migration verification skip
|
||||
**Fix**: Merged both methods into single implementation
|
||||
**Impact**: Service initialization now properly verifies database migrations
|
||||
|
||||
**Before**:
|
||||
```python
|
||||
async def on_startup(self, app):
|
||||
await self.verify_migrations()
|
||||
|
||||
async def on_startup(self, app: FastAPI): # Duplicate!
|
||||
pass
|
||||
```
|
||||
|
||||
**After**:
|
||||
```python
|
||||
async def on_startup(self, app: FastAPI):
|
||||
await self.verify_migrations()
|
||||
self.logger.info("Training service startup completed")
|
||||
```
|
||||
|
||||
### 1.2 Hardcoded Migration Version ✅
|
||||
**File**: [main.py](services/training/app/main.py)
|
||||
**Issue**: Static version `expected_migration_version = "00001"`
|
||||
**Fix**: Dynamic version detection from alembic_version table
|
||||
**Impact**: Service survives schema updates automatically
|
||||
|
||||
**Before**:
|
||||
```python
|
||||
expected_migration_version = "00001" # Hardcoded!
|
||||
if version != self.expected_migration_version:
|
||||
raise RuntimeError(...)
|
||||
```
|
||||
|
||||
**After**:
|
||||
```python
|
||||
async def verify_migrations(self):
|
||||
result = await session.execute(text("SELECT version_num FROM alembic_version"))
|
||||
version = result.scalar()
|
||||
if not version:
|
||||
raise RuntimeError("Database not initialized")
|
||||
logger.info(f"Migration verification successful: {version}")
|
||||
```
|
||||
|
||||
### 1.3 Session Management Bug ✅
|
||||
**File**: [training_service.py:463](services/training/app/services/training_service.py#L463)
|
||||
**Issue**: Incorrect `get_session()()` double-call
|
||||
**Fix**: Corrected to `get_session()` single call
|
||||
**Impact**: Prevents database connection leaks and session corruption
|
||||
|
||||
### 1.4 Disabled Data Validation ✅
|
||||
**File**: [data_client.py:263-353](services/training/app/services/data_client.py#L263-L353)
|
||||
**Issue**: Validation completely bypassed
|
||||
**Fix**: Implemented comprehensive validation
|
||||
**Features**:
|
||||
- Minimum 30 data points (recommended 90+)
|
||||
- Required fields validation
|
||||
- Zero-value ratio analysis (error >90%, warning >70%)
|
||||
- Product diversity checks
|
||||
- Returns detailed validation report
|
||||
|
||||
---
|
||||
|
||||
## Part 2: Performance Improvements
|
||||
|
||||
### 2.1 Parallel Training Execution ✅
|
||||
**File**: [trainer.py:240-379](services/training/app/ml/trainer.py#L240-L379)
|
||||
**Improvement**: Sequential → Parallel execution using `asyncio.gather()`
|
||||
|
||||
**Performance Metrics**:
|
||||
- **Before**: 10 products × 3 min = **30 minutes**
|
||||
- **After**: 10 products in parallel = **~3-5 minutes**
|
||||
- **Speedup**: **6-10x faster**
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
# New method for single product training
|
||||
async def _train_single_product(...) -> tuple[str, Dict]:
|
||||
# Train one product with progress tracking
|
||||
|
||||
# Parallel execution
|
||||
training_tasks = [
|
||||
self._train_single_product(...)
|
||||
for idx, (product_id, data) in enumerate(processed_data.items())
|
||||
]
|
||||
results_list = await asyncio.gather(*training_tasks, return_exceptions=True)
|
||||
```
|
||||
|
||||
### 2.2 Hyperparameter Optimization ✅
|
||||
**File**: [prophet_manager.py](services/training/app/ml/prophet_manager.py)
|
||||
**Improvement**: Adaptive trial counts based on product characteristics
|
||||
|
||||
**Optimization Settings**:
|
||||
| Product Type | Trials (Before) | Trials (After) | Reduction |
|
||||
|--------------|----------------|----------------|-----------|
|
||||
| High Volume | 75 | 30 | 60% |
|
||||
| Medium Volume | 50 | 25 | 50% |
|
||||
| Low Volume | 30 | 20 | 33% |
|
||||
| Intermittent | 25 | 15 | 40% |
|
||||
|
||||
**Average Speedup**: 40% reduction in optimization time
|
||||
|
||||
### 2.3 Database Connection Pooling ✅
|
||||
**File**: [database.py:18-27](services/training/app/core/database.py#L18-L27), [config.py:84-90](services/training/app/core/config.py#L84-L90)
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
DB_POOL_SIZE: 10 # Base connections
|
||||
DB_MAX_OVERFLOW: 20 # Extra connections under load
|
||||
DB_POOL_TIMEOUT: 30 # Seconds to wait for connection
|
||||
DB_POOL_RECYCLE: 3600 # Recycle connections after 1 hour
|
||||
DB_POOL_PRE_PING: true # Test connections before use
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Reduced connection overhead
|
||||
- Better resource utilization
|
||||
- Prevents connection exhaustion
|
||||
- Automatic stale connection cleanup
|
||||
|
||||
---
|
||||
|
||||
## Part 3: Reliability Enhancements
|
||||
|
||||
### 3.1 HTTP Request Timeouts ✅
|
||||
**File**: [data_client.py:37-51](services/training/app/services/data_client.py#L37-L51)
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
timeout = httpx.Timeout(
|
||||
connect=30.0, # 30s to establish connection
|
||||
read=60.0, # 60s for large data fetches
|
||||
write=30.0, # 30s for write operations
|
||||
pool=30.0 # 30s for pool operations
|
||||
)
|
||||
```
|
||||
|
||||
**Impact**: Prevents hanging requests during service failures
|
||||
|
||||
### 3.2 Circuit Breaker Pattern ✅
|
||||
**Files**:
|
||||
- [circuit_breaker.py](services/training/app/utils/circuit_breaker.py) (NEW)
|
||||
- [data_client.py:60-84](services/training/app/services/data_client.py#L60-L84)
|
||||
|
||||
**Features**:
|
||||
- Three states: CLOSED → OPEN → HALF_OPEN
|
||||
- Configurable failure thresholds
|
||||
- Automatic recovery attempts
|
||||
- Per-service circuit breakers
|
||||
|
||||
**Circuit Breakers Implemented**:
|
||||
| Service | Failure Threshold | Recovery Timeout |
|
||||
|---------|------------------|------------------|
|
||||
| Sales | 5 failures | 60 seconds |
|
||||
| Weather | 3 failures | 30 seconds |
|
||||
| Traffic | 3 failures | 30 seconds |
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
self.sales_cb = circuit_breaker_registry.get_or_create(
|
||||
name="sales_service",
|
||||
failure_threshold=5,
|
||||
recovery_timeout=60.0
|
||||
)
|
||||
|
||||
# Usage
|
||||
return await self.sales_cb.call(
|
||||
self._fetch_sales_data_internal,
|
||||
tenant_id, start_date, end_date
|
||||
)
|
||||
```
|
||||
|
||||
### 3.3 Model File Checksum Verification ✅
|
||||
**Files**:
|
||||
- [file_utils.py](services/training/app/utils/file_utils.py) (NEW)
|
||||
- [prophet_manager.py:522-524](services/training/app/ml/prophet_manager.py#L522-L524)
|
||||
|
||||
**Features**:
|
||||
- SHA-256 checksum calculation on save
|
||||
- Automatic checksum storage
|
||||
- Verification on model load
|
||||
- ChecksummedFile context manager
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
# On save
|
||||
checksummed_file = ChecksummedFile(str(model_path))
|
||||
model_checksum = checksummed_file.calculate_and_save_checksum()
|
||||
|
||||
# On load
|
||||
if not checksummed_file.load_and_verify_checksum():
|
||||
logger.warning(f"Checksum verification failed: {model_path}")
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Detects file corruption
|
||||
- Ensures model integrity
|
||||
- Audit trail for security
|
||||
- Compliance support
|
||||
|
||||
### 3.4 Distributed Locking ✅
|
||||
**Files**:
|
||||
- [distributed_lock.py](services/training/app/utils/distributed_lock.py) (NEW)
|
||||
- [prophet_manager.py:65-71](services/training/app/ml/prophet_manager.py#L65-L71)
|
||||
|
||||
**Features**:
|
||||
- PostgreSQL advisory locks
|
||||
- Prevents concurrent training of same product
|
||||
- Works across multiple service instances
|
||||
- Automatic lock release
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
lock = get_training_lock(tenant_id, inventory_product_id, use_advisory=True)
|
||||
|
||||
async with self.database_manager.get_session() as session:
|
||||
async with lock.acquire(session):
|
||||
# Train model - guaranteed exclusive access
|
||||
await self._train_model(...)
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Prevents race conditions
|
||||
- Protects data integrity
|
||||
- Enables horizontal scaling
|
||||
- Graceful lock contention handling
|
||||
|
||||
---
|
||||
|
||||
## Part 4: Code Quality Improvements
|
||||
|
||||
### 4.1 Constants Module ✅
|
||||
**File**: [constants.py](services/training/app/core/constants.py) (NEW)
|
||||
|
||||
**Categories** (50+ constants):
|
||||
- Data validation thresholds
|
||||
- Training time periods (days)
|
||||
- Product classification thresholds
|
||||
- Hyperparameter optimization settings
|
||||
- Prophet uncertainty sampling ranges
|
||||
- MAPE calculation parameters
|
||||
- HTTP client configuration
|
||||
- WebSocket configuration
|
||||
- Progress tracking ranges
|
||||
- Synthetic data defaults
|
||||
|
||||
**Example Usage**:
|
||||
```python
|
||||
from app.core import constants as const
|
||||
|
||||
# ✅ Good
|
||||
if len(sales_data) < const.MIN_DATA_POINTS_REQUIRED:
|
||||
raise ValueError("Insufficient data")
|
||||
|
||||
# ❌ Bad (old way)
|
||||
if len(sales_data) < 30: # What does 30 mean?
|
||||
raise ValueError("Insufficient data")
|
||||
```
|
||||
|
||||
### 4.2 Timezone Utility Module ✅
|
||||
**Files**:
|
||||
- [timezone_utils.py](services/training/app/utils/timezone_utils.py) (NEW)
|
||||
- [utils/__init__.py](services/training/app/utils/__init__.py) (NEW)
|
||||
|
||||
**Functions**:
|
||||
- `ensure_timezone_aware()` - Make datetime timezone-aware
|
||||
- `ensure_timezone_naive()` - Remove timezone info
|
||||
- `normalize_datetime_to_utc()` - Convert to UTC
|
||||
- `normalize_dataframe_datetime_column()` - Normalize pandas columns
|
||||
- `prepare_prophet_datetime()` - Prophet-specific preparation
|
||||
- `safe_datetime_comparison()` - Compare with mismatch handling
|
||||
- `get_current_utc()` - Get current UTC time
|
||||
- `convert_timestamp_to_datetime()` - Handle various formats
|
||||
|
||||
**Integrated In**:
|
||||
- prophet_manager.py - Prophet data preparation
|
||||
- date_alignment_service.py - Date range validation
|
||||
|
||||
### 4.3 Standardized Error Handling ✅
|
||||
**File**: [data_client.py](services/training/app/services/data_client.py)
|
||||
|
||||
**Pattern**: Always raise exceptions, never return empty collections
|
||||
|
||||
**Before**:
|
||||
```python
|
||||
except Exception as e:
|
||||
logger.error(f"Failed: {e}")
|
||||
return [] # ❌ Silent failure
|
||||
```
|
||||
|
||||
**After**:
|
||||
```python
|
||||
except ValueError:
|
||||
raise # Re-raise validation errors
|
||||
except Exception as e:
|
||||
logger.error(f"Failed: {e}")
|
||||
raise RuntimeError(f"Operation failed: {e}") # ✅ Explicit failure
|
||||
```
|
||||
|
||||
### 4.4 Legacy Code Removal ✅
|
||||
**Removed**:
|
||||
- `BakeryMLTrainer = EnhancedBakeryMLTrainer` alias
|
||||
- `TrainingService = EnhancedTrainingService` alias
|
||||
- `BakeryDataProcessor = EnhancedBakeryDataProcessor` alias
|
||||
- Legacy `fetch_traffic_data()` wrapper
|
||||
- Legacy `fetch_stored_traffic_data_for_training()` wrapper
|
||||
- Legacy `_collect_traffic_data_with_timeout()` method
|
||||
- Legacy `_log_traffic_data_storage()` method
|
||||
- All "Pre-flight check moved" comments
|
||||
- All "Temporary implementation" comments
|
||||
|
||||
---
|
||||
|
||||
## Part 5: New Features Summary
|
||||
|
||||
### 5.1 Utilities Created
|
||||
| Module | Lines | Purpose |
|
||||
|--------|-------|---------|
|
||||
| constants.py | 100 | Centralized configuration constants |
|
||||
| timezone_utils.py | 180 | Timezone handling functions |
|
||||
| circuit_breaker.py | 200 | Circuit breaker implementation |
|
||||
| file_utils.py | 190 | File operations with checksums |
|
||||
| distributed_lock.py | 210 | Distributed locking mechanisms |
|
||||
|
||||
**Total New Utility Code**: ~880 lines
|
||||
|
||||
### 5.2 Features by Category
|
||||
|
||||
**Performance**:
|
||||
- ✅ Parallel training execution (6-10x faster)
|
||||
- ✅ Optimized hyperparameter tuning (40% faster)
|
||||
- ✅ Database connection pooling
|
||||
|
||||
**Reliability**:
|
||||
- ✅ HTTP request timeouts
|
||||
- ✅ Circuit breaker pattern
|
||||
- ✅ Model file checksums
|
||||
- ✅ Distributed locking
|
||||
- ✅ Data validation
|
||||
|
||||
**Code Quality**:
|
||||
- ✅ Constants module (50+ constants)
|
||||
- ✅ Timezone utilities (8 functions)
|
||||
- ✅ Standardized error handling
|
||||
- ✅ Legacy code removal
|
||||
|
||||
**Maintainability**:
|
||||
- ✅ Comprehensive documentation
|
||||
- ✅ Developer guide
|
||||
- ✅ Clear code organization
|
||||
- ✅ Utility functions
|
||||
|
||||
---
|
||||
|
||||
## Part 6: Files Modified/Created
|
||||
|
||||
### Files Modified (9):
|
||||
1. main.py - Fixed duplicate methods, dynamic migrations
|
||||
2. config.py - Added connection pool settings
|
||||
3. database.py - Configured connection pooling
|
||||
4. training_service.py - Fixed session management, removed legacy
|
||||
5. data_client.py - Added timeouts, circuit breakers, validation
|
||||
6. trainer.py - Parallel execution, removed legacy
|
||||
7. prophet_manager.py - Checksums, locking, constants, utilities
|
||||
8. date_alignment_service.py - Timezone utilities
|
||||
9. data_processor.py - Removed legacy alias
|
||||
|
||||
### Files Created (8):
|
||||
1. core/constants.py - Configuration constants
|
||||
2. utils/__init__.py - Utility exports
|
||||
3. utils/timezone_utils.py - Timezone handling
|
||||
4. utils/circuit_breaker.py - Circuit breaker pattern
|
||||
5. utils/file_utils.py - File operations
|
||||
6. utils/distributed_lock.py - Distributed locking
|
||||
7. IMPLEMENTATION_SUMMARY.md - Change log
|
||||
8. DEVELOPER_GUIDE.md - Developer reference
|
||||
9. COMPLETE_IMPLEMENTATION_REPORT.md - This document
|
||||
|
||||
---
|
||||
|
||||
## Part 7: Testing & Validation
|
||||
|
||||
### Manual Testing Checklist
|
||||
- [x] Service starts without errors
|
||||
- [x] Migration verification works
|
||||
- [x] Database connections properly pooled
|
||||
- [x] HTTP timeouts configured
|
||||
- [x] Circuit breakers functional
|
||||
- [x] Parallel training executes
|
||||
- [x] Model checksums calculated
|
||||
- [x] Distributed locks work
|
||||
- [x] Data validation runs
|
||||
- [x] Error handling standardized
|
||||
|
||||
### Recommended Test Coverage
|
||||
**Unit Tests Needed**:
|
||||
- [ ] Timezone utility functions
|
||||
- [ ] Constants validation
|
||||
- [ ] Circuit breaker state transitions
|
||||
- [ ] File checksum calculations
|
||||
- [ ] Distributed lock acquisition/release
|
||||
- [ ] Data validation logic
|
||||
|
||||
**Integration Tests Needed**:
|
||||
- [ ] End-to-end training pipeline
|
||||
- [ ] External service timeout handling
|
||||
- [ ] Circuit breaker integration
|
||||
- [ ] Parallel training coordination
|
||||
- [ ] Database session management
|
||||
|
||||
**Performance Tests Needed**:
|
||||
- [ ] Parallel vs sequential benchmarks
|
||||
- [ ] Hyperparameter optimization timing
|
||||
- [ ] Memory usage under load
|
||||
- [ ] Connection pool behavior
|
||||
|
||||
---
|
||||
|
||||
## Part 8: Deployment Guide
|
||||
|
||||
### Prerequisites
|
||||
- PostgreSQL 13+ (for advisory locks)
|
||||
- Python 3.9+
|
||||
- Redis (optional, for future caching)
|
||||
|
||||
### Environment Variables
|
||||
|
||||
**Database Configuration**:
|
||||
```bash
|
||||
DB_POOL_SIZE=10
|
||||
DB_MAX_OVERFLOW=20
|
||||
DB_POOL_TIMEOUT=30
|
||||
DB_POOL_RECYCLE=3600
|
||||
DB_POOL_PRE_PING=true
|
||||
DB_ECHO=false
|
||||
```
|
||||
|
||||
**Training Configuration**:
|
||||
```bash
|
||||
MAX_TRAINING_TIME_MINUTES=30
|
||||
MAX_CONCURRENT_TRAINING_JOBS=3
|
||||
MIN_TRAINING_DATA_DAYS=30
|
||||
```
|
||||
|
||||
**Model Storage**:
|
||||
```bash
|
||||
MODEL_STORAGE_PATH=/app/models
|
||||
MODEL_BACKUP_ENABLED=true
|
||||
MODEL_VERSIONING_ENABLED=true
|
||||
```
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
1. **Pre-Deployment**:
|
||||
```bash
|
||||
# Review constants
|
||||
vim services/training/app/core/constants.py
|
||||
|
||||
# Verify environment variables
|
||||
env | grep DB_POOL
|
||||
env | grep MAX_TRAINING
|
||||
```
|
||||
|
||||
2. **Deploy**:
|
||||
```bash
|
||||
# Pull latest code
|
||||
git pull origin main
|
||||
|
||||
# Build container
|
||||
docker build -t training-service:latest .
|
||||
|
||||
# Deploy
|
||||
kubectl apply -f infrastructure/kubernetes/base/
|
||||
```
|
||||
|
||||
3. **Post-Deployment Verification**:
|
||||
```bash
|
||||
# Check health
|
||||
curl http://training-service/health
|
||||
|
||||
# Check circuit breaker status
|
||||
curl http://training-service/api/v1/circuit-breakers
|
||||
|
||||
# Verify database connections
|
||||
kubectl logs -f deployment/training-service | grep "pool"
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
|
||||
**Key Metrics to Watch**:
|
||||
- Training job duration (should be 6-10x faster)
|
||||
- Circuit breaker states (should mostly be CLOSED)
|
||||
- Database connection pool utilization
|
||||
- Model file checksum failures
|
||||
- Lock acquisition timeouts
|
||||
|
||||
**Logging Queries**:
|
||||
```bash
|
||||
# Check parallel training
|
||||
kubectl logs training-service | grep "Starting parallel training"
|
||||
|
||||
# Check circuit breakers
|
||||
kubectl logs training-service | grep "Circuit breaker"
|
||||
|
||||
# Check distributed locks
|
||||
kubectl logs training-service | grep "Acquired lock"
|
||||
|
||||
# Check checksums
|
||||
kubectl logs training-service | grep "checksum"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Part 9: Performance Benchmarks
|
||||
|
||||
### Training Performance
|
||||
|
||||
| Scenario | Before | After | Improvement |
|
||||
|----------|--------|-------|-------------|
|
||||
| 5 products | 15 min | 2-3 min | 5-7x faster |
|
||||
| 10 products | 30 min | 3-5 min | 6-10x faster |
|
||||
| 20 products | 60 min | 6-10 min | 6-10x faster |
|
||||
| 50 products | 150 min | 15-25 min | 6-10x faster |
|
||||
|
||||
### Hyperparameter Optimization
|
||||
|
||||
| Product Type | Trials (Before) | Trials (After) | Time Saved |
|
||||
|--------------|----------------|----------------|------------|
|
||||
| High Volume | 75 (38 min) | 30 (15 min) | 23 min (60%) |
|
||||
| Medium Volume | 50 (25 min) | 25 (13 min) | 12 min (50%) |
|
||||
| Low Volume | 30 (15 min) | 20 (10 min) | 5 min (33%) |
|
||||
| Intermittent | 25 (13 min) | 15 (8 min) | 5 min (40%) |
|
||||
|
||||
### Memory Usage
|
||||
- **Before**: ~500MB per training job (unoptimized)
|
||||
- **After**: ~200MB per training job (optimized)
|
||||
- **Improvement**: 60% reduction
|
||||
|
||||
---
|
||||
|
||||
## Part 10: Future Enhancements
|
||||
|
||||
### High Priority
|
||||
1. **Caching Layer**: Redis-based hyperparameter cache
|
||||
2. **Metrics Dashboard**: Grafana dashboard for circuit breakers
|
||||
3. **Async Task Queue**: Celery/Temporal for background jobs
|
||||
4. **Model Registry**: Centralized model storage (S3/GCS)
|
||||
|
||||
### Medium Priority
|
||||
5. **God Object Refactoring**: Split EnhancedTrainingService
|
||||
6. **Advanced Monitoring**: OpenTelemetry integration
|
||||
7. **Rate Limiting**: Per-tenant rate limiting
|
||||
8. **A/B Testing**: Model comparison framework
|
||||
|
||||
### Low Priority
|
||||
9. **Method Length Reduction**: Refactor long methods
|
||||
10. **Deep Nesting Reduction**: Simplify complex conditionals
|
||||
11. **Data Classes**: Replace dicts with domain objects
|
||||
12. **Test Coverage**: Achieve 80%+ coverage
|
||||
|
||||
---
|
||||
|
||||
## Part 11: Conclusion
|
||||
|
||||
### Achievements
|
||||
|
||||
**Code Quality**: A- (was C-)
|
||||
- Eliminated all critical bugs
|
||||
- Removed all legacy code
|
||||
- Extracted all magic numbers
|
||||
- Standardized error handling
|
||||
- Centralized utilities
|
||||
|
||||
**Performance**: A+ (was C)
|
||||
- 6-10x faster training
|
||||
- 40% faster optimization
|
||||
- Efficient resource usage
|
||||
- Parallel execution
|
||||
|
||||
**Reliability**: A (was D)
|
||||
- Data validation enabled
|
||||
- Request timeouts configured
|
||||
- Circuit breakers implemented
|
||||
- Distributed locking added
|
||||
- Model integrity verified
|
||||
|
||||
**Maintainability**: A (was C)
|
||||
- Comprehensive documentation
|
||||
- Clear code organization
|
||||
- Utility functions
|
||||
- Developer guide
|
||||
|
||||
### Production Readiness Score
|
||||
|
||||
| Category | Before | After |
|
||||
|----------|--------|-------|
|
||||
| Code Quality | C- | A- |
|
||||
| Performance | C | A+ |
|
||||
| Reliability | D | A |
|
||||
| Maintainability | C | A |
|
||||
| **Overall** | **D+** | **A** |
|
||||
|
||||
### Final Status
|
||||
|
||||
✅ **PRODUCTION READY**
|
||||
|
||||
All critical blockers have been resolved:
|
||||
- ✅ Service initialization fixed
|
||||
- ✅ Training performance optimized (10x)
|
||||
- ✅ Timeout protection added
|
||||
- ✅ Circuit breakers implemented
|
||||
- ✅ Data validation enabled
|
||||
- ✅ Database management corrected
|
||||
- ✅ Error handling standardized
|
||||
- ✅ Distributed locking added
|
||||
- ✅ Model integrity verified
|
||||
- ✅ Code quality improved
|
||||
|
||||
**Recommended Action**: Deploy to production with standard monitoring
|
||||
|
||||
---
|
||||
|
||||
*Implementation Complete: 2025-10-07*
|
||||
*Estimated Time Saved: 4-6 weeks*
|
||||
*Lines of Code Added/Modified: ~3000+*
|
||||
*Status: Ready for Production Deployment*
|
||||
@@ -1,230 +0,0 @@
|
||||
# Training Service - Developer Guide
|
||||
|
||||
## Quick Reference for Common Tasks
|
||||
|
||||
### Using Constants
|
||||
Always use constants instead of magic numbers:
|
||||
|
||||
```python
|
||||
from app.core import constants as const
|
||||
|
||||
# ✅ Good
|
||||
if len(sales_data) < const.MIN_DATA_POINTS_REQUIRED:
|
||||
raise ValueError("Insufficient data")
|
||||
|
||||
# ❌ Bad
|
||||
if len(sales_data) < 30:
|
||||
raise ValueError("Insufficient data")
|
||||
```
|
||||
|
||||
### Timezone Handling
|
||||
Always use timezone utilities:
|
||||
|
||||
```python
|
||||
from app.utils.timezone_utils import ensure_timezone_aware, prepare_prophet_datetime
|
||||
|
||||
# ✅ Good - Ensure timezone-aware
|
||||
dt = ensure_timezone_aware(user_input_date)
|
||||
|
||||
# ✅ Good - Prepare for Prophet
|
||||
df = prepare_prophet_datetime(df, 'ds')
|
||||
|
||||
# ❌ Bad - Manual timezone handling
|
||||
if dt.tzinfo is None:
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
Always raise exceptions, never return empty lists:
|
||||
|
||||
```python
|
||||
# ✅ Good
|
||||
if not data:
|
||||
raise ValueError(f"No data available for {tenant_id}")
|
||||
|
||||
# ❌ Bad
|
||||
if not data:
|
||||
logger.error("No data")
|
||||
return []
|
||||
```
|
||||
|
||||
### Database Sessions
|
||||
Use context manager correctly:
|
||||
|
||||
```python
|
||||
# ✅ Good
|
||||
async with self.database_manager.get_session() as session:
|
||||
await session.execute(query)
|
||||
|
||||
# ❌ Bad
|
||||
async with self.database_manager.get_session()() as session: # Double call!
|
||||
await session.execute(query)
|
||||
```
|
||||
|
||||
### Parallel Execution
|
||||
Use asyncio.gather for concurrent operations:
|
||||
|
||||
```python
|
||||
# ✅ Good - Parallel
|
||||
tasks = [train_product(pid) for pid in product_ids]
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# ❌ Bad - Sequential
|
||||
results = []
|
||||
for pid in product_ids:
|
||||
result = await train_product(pid)
|
||||
results.append(result)
|
||||
```
|
||||
|
||||
### HTTP Client Configuration
|
||||
Timeouts are configured automatically in DataClient:
|
||||
|
||||
```python
|
||||
# No need to configure timeouts manually
|
||||
# They're set in DataClient.__init__() using constants
|
||||
client = DataClient() # Timeouts already configured
|
||||
```
|
||||
|
||||
## File Organization
|
||||
|
||||
### Core Modules
|
||||
- `core/constants.py` - All configuration constants
|
||||
- `core/config.py` - Service settings
|
||||
- `core/database.py` - Database configuration
|
||||
|
||||
### Utilities
|
||||
- `utils/timezone_utils.py` - Timezone handling functions
|
||||
- `utils/__init__.py` - Utility exports
|
||||
|
||||
### ML Components
|
||||
- `ml/trainer.py` - Main training orchestration
|
||||
- `ml/prophet_manager.py` - Prophet model management
|
||||
- `ml/data_processor.py` - Data preprocessing
|
||||
|
||||
### Services
|
||||
- `services/data_client.py` - External service communication
|
||||
- `services/training_service.py` - Training job management
|
||||
- `services/training_orchestrator.py` - Training pipeline coordination
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### ❌ Don't Create Legacy Aliases
|
||||
```python
|
||||
# ❌ Bad
|
||||
MyNewClass = OldClassName # Removed!
|
||||
```
|
||||
|
||||
### ❌ Don't Use Magic Numbers
|
||||
```python
|
||||
# ❌ Bad
|
||||
if score > 0.8: # What does 0.8 mean?
|
||||
|
||||
# ✅ Good
|
||||
if score > const.IMPROVEMENT_SIGNIFICANCE_THRESHOLD:
|
||||
```
|
||||
|
||||
### ❌ Don't Return Empty Lists on Error
|
||||
```python
|
||||
# ❌ Bad
|
||||
except Exception as e:
|
||||
logger.error(f"Failed: {e}")
|
||||
return []
|
||||
|
||||
# ✅ Good
|
||||
except Exception as e:
|
||||
logger.error(f"Failed: {e}")
|
||||
raise RuntimeError(f"Operation failed: {e}")
|
||||
```
|
||||
|
||||
### ❌ Don't Handle Timezones Manually
|
||||
```python
|
||||
# ❌ Bad
|
||||
if dt.tzinfo is None:
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
|
||||
# ✅ Good
|
||||
from app.utils.timezone_utils import ensure_timezone_aware
|
||||
dt = ensure_timezone_aware(dt)
|
||||
```
|
||||
|
||||
## Testing Checklist
|
||||
|
||||
Before submitting code:
|
||||
- [ ] All magic numbers replaced with constants
|
||||
- [ ] Timezone handling uses utility functions
|
||||
- [ ] Errors raise exceptions (not return empty collections)
|
||||
- [ ] Database sessions use single `get_session()` call
|
||||
- [ ] Parallel operations use `asyncio.gather`
|
||||
- [ ] No legacy compatibility aliases
|
||||
- [ ] No commented-out code
|
||||
- [ ] Logging uses structured logging
|
||||
|
||||
## Performance Guidelines
|
||||
|
||||
### Training Jobs
|
||||
- ✅ Use parallel execution for multiple products
|
||||
- ✅ Reduce Optuna trials for low-volume products
|
||||
- ✅ Use constants for all thresholds
|
||||
- ⚠️ Monitor memory usage during parallel training
|
||||
|
||||
### Database Operations
|
||||
- ✅ Use repository pattern
|
||||
- ✅ Batch operations when possible
|
||||
- ✅ Close sessions properly
|
||||
- ⚠️ Connection pool limits not yet configured
|
||||
|
||||
### HTTP Requests
|
||||
- ✅ Timeouts configured automatically
|
||||
- ✅ Use shared clients from `shared/clients`
|
||||
- ⚠️ Circuit breaker not yet implemented
|
||||
- ⚠️ Request retries delegated to base client
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
### Training Failures
|
||||
1. Check logs for data validation errors
|
||||
2. Verify timezone consistency in date ranges
|
||||
3. Check minimum data point requirements
|
||||
4. Review Prophet error messages
|
||||
|
||||
### Performance Issues
|
||||
1. Check if parallel training is being used
|
||||
2. Verify Optuna trial counts
|
||||
3. Monitor database connection usage
|
||||
4. Check HTTP timeout configurations
|
||||
|
||||
### Data Quality Issues
|
||||
1. Review validation errors in logs
|
||||
2. Check zero-ratio thresholds
|
||||
3. Verify product classification
|
||||
4. Review date range alignment
|
||||
|
||||
## Migration from Old Code
|
||||
|
||||
### If You Find Legacy Code
|
||||
1. Check if alias exists (should be removed)
|
||||
2. Update imports to use new names
|
||||
3. Remove backward compatibility wrappers
|
||||
4. Update documentation
|
||||
|
||||
### If You Find Magic Numbers
|
||||
1. Add constant to `core/constants.py`
|
||||
2. Update usage to reference constant
|
||||
3. Document what the number represents
|
||||
|
||||
### If You Find Manual Timezone Handling
|
||||
1. Import from `utils/timezone_utils`
|
||||
2. Use appropriate utility function
|
||||
3. Remove manual implementation
|
||||
|
||||
## Getting Help
|
||||
|
||||
- Review `IMPLEMENTATION_SUMMARY.md` for recent changes
|
||||
- Check constants in `core/constants.py` for configuration
|
||||
- Look at `utils/timezone_utils.py` for timezone functions
|
||||
- Refer to analysis report for architectural decisions
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: 2025-10-07*
|
||||
*Status: Current*
|
||||
@@ -27,8 +27,7 @@ COPY --from=shared /shared /app/shared
|
||||
# Copy application code
|
||||
COPY services/training/ .
|
||||
|
||||
# Copy scripts directory
|
||||
COPY scripts/ /app/scripts/
|
||||
|
||||
|
||||
# Add shared libraries to Python path
|
||||
ENV PYTHONPATH="/app:/app/shared:${PYTHONPATH:-}"
|
||||
|
||||
@@ -1,274 +0,0 @@
|
||||
# Training Service - Implementation Summary
|
||||
|
||||
## Overview
|
||||
This document summarizes all critical fixes, improvements, and refactoring implemented based on the comprehensive code analysis report.
|
||||
|
||||
---
|
||||
|
||||
## ✅ Critical Bugs Fixed
|
||||
|
||||
### 1. **Duplicate `on_startup` Method** ([main.py](services/training/app/main.py))
|
||||
- **Issue**: Two `on_startup` methods defined, causing migration verification to be skipped
|
||||
- **Fix**: Merged both implementations into single method
|
||||
- **Impact**: Service initialization now properly verifies database migrations
|
||||
|
||||
### 2. **Hardcoded Migration Version** ([main.py](services/training/app/main.py))
|
||||
- **Issue**: Static version check `expected_migration_version = "00001"`
|
||||
- **Fix**: Removed hardcoded version, now dynamically checks alembic_version table
|
||||
- **Impact**: Service survives schema updates without code changes
|
||||
|
||||
### 3. **Session Management Double-Call** ([training_service.py:463](services/training/app/services/training_service.py#L463))
|
||||
- **Issue**: Incorrect `get_session()()` double-call syntax
|
||||
- **Fix**: Changed to correct `get_session()` single call
|
||||
- **Impact**: Prevents database connection leaks and session corruption
|
||||
|
||||
### 4. **Disabled Data Validation** ([data_client.py:263-294](services/training/app/services/data_client.py#L263-L294))
|
||||
- **Issue**: Validation completely bypassed with "temporarily disabled" message
|
||||
- **Fix**: Implemented comprehensive validation checking:
|
||||
- Minimum data points (30 required, 90 recommended)
|
||||
- Required fields presence
|
||||
- Zero-value ratio analysis
|
||||
- Product diversity checks
|
||||
- **Impact**: Ensures data quality before expensive training operations
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Performance Improvements
|
||||
|
||||
### 5. **Parallel Training Execution** ([trainer.py:240-379](services/training/app/ml/trainer.py#L240-L379))
|
||||
- **Issue**: Sequential product training (O(n) time complexity)
|
||||
- **Fix**: Implemented parallel training using `asyncio.gather()`
|
||||
- **Performance Gain**:
|
||||
- Before: 10 products × 3 min = **30 minutes**
|
||||
- After: 10 products in parallel = **~3-5 minutes**
|
||||
- **Implementation**:
|
||||
- Created `_train_single_product()` method
|
||||
- Refactored `_train_all_models_enhanced()` to use concurrent execution
|
||||
- Maintains progress tracking across parallel tasks
|
||||
|
||||
### 6. **Hyperparameter Optimization** ([prophet_manager.py](services/training/app/ml/prophet_manager.py))
|
||||
- **Issue**: Fixed number of trials regardless of product characteristics
|
||||
- **Fix**: Reduced trial counts and made them adaptive:
|
||||
- High volume: 30 trials (was 75)
|
||||
- Medium volume: 25 trials (was 50)
|
||||
- Low volume: 20 trials (was 30)
|
||||
- Intermittent: 15 trials (was 25)
|
||||
- **Performance Gain**: ~40% reduction in optimization time
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Error Handling Standardization
|
||||
|
||||
### 7. **Consistent Error Patterns** ([data_client.py](services/training/app/services/data_client.py))
|
||||
- **Issue**: Mixed error handling (return `[]`, return error dict, raise exception)
|
||||
- **Fix**: Standardized to raise exceptions with meaningful messages
|
||||
- **Example**:
|
||||
```python
|
||||
# Before: return []
|
||||
# After: raise ValueError(f"No sales data available for tenant {tenant_id}")
|
||||
```
|
||||
- **Impact**: Errors propagate correctly, no silent failures
|
||||
|
||||
---
|
||||
|
||||
## ⏱️ Request Timeout Configuration
|
||||
|
||||
### 8. **HTTP Client Timeouts** ([data_client.py:37-51](services/training/app/services/data_client.py#L37-L51))
|
||||
- **Issue**: No timeout configuration, requests could hang indefinitely
|
||||
- **Fix**: Added comprehensive timeout configuration:
|
||||
- Connect: 30 seconds
|
||||
- Read: 60 seconds (for large data fetches)
|
||||
- Write: 30 seconds
|
||||
- Pool: 30 seconds
|
||||
- **Impact**: Prevents hanging requests during external service failures
|
||||
|
||||
---
|
||||
|
||||
## 📏 Magic Numbers Elimination
|
||||
|
||||
### 9. **Constants Module** ([core/constants.py](services/training/app/core/constants.py))
|
||||
- **Issue**: Magic numbers scattered throughout codebase
|
||||
- **Fix**: Created centralized constants module with 50+ constants
|
||||
- **Categories**:
|
||||
- Data validation thresholds
|
||||
- Training time periods
|
||||
- Product classification thresholds
|
||||
- Hyperparameter optimization settings
|
||||
- Prophet uncertainty sampling ranges
|
||||
- MAPE calculation parameters
|
||||
- HTTP client configuration
|
||||
- WebSocket configuration
|
||||
- Progress tracking ranges
|
||||
|
||||
### 10. **Constants Integration**
|
||||
- **Updated Files**:
|
||||
- `prophet_manager.py`: Uses const for trials, uncertainty samples, thresholds
|
||||
- `data_client.py`: Uses const for HTTP timeouts
|
||||
- Future: All files should reference constants module
|
||||
|
||||
---
|
||||
|
||||
## 🧹 Legacy Code Removal
|
||||
|
||||
### 11. **Compatibility Aliases Removed**
|
||||
- **Files Updated**:
|
||||
- `trainer.py`: Removed `BakeryMLTrainer = EnhancedBakeryMLTrainer`
|
||||
- `training_service.py`: Removed `TrainingService = EnhancedTrainingService`
|
||||
- `data_processor.py`: Removed `BakeryDataProcessor = EnhancedBakeryDataProcessor`
|
||||
|
||||
### 12. **Legacy Methods Removed** ([data_client.py](services/training/app/services/data_client.py))
|
||||
- Removed:
|
||||
- `fetch_traffic_data()` (legacy wrapper)
|
||||
- `fetch_stored_traffic_data_for_training()` (legacy wrapper)
|
||||
- All callers updated to use `fetch_traffic_data_unified()`
|
||||
|
||||
### 13. **Commented Code Cleanup**
|
||||
- Removed "Pre-flight check moved to orchestrator" comments
|
||||
- Removed "Temporary implementation" comments
|
||||
- Cleaned up validation placeholders
|
||||
|
||||
---
|
||||
|
||||
## 🌍 Timezone Handling
|
||||
|
||||
### 14. **Timezone Utility Module** ([utils/timezone_utils.py](services/training/app/utils/timezone_utils.py))
|
||||
- **Issue**: Timezone handling scattered across 4+ files
|
||||
- **Fix**: Created comprehensive utility module with functions:
|
||||
- `ensure_timezone_aware()`: Make datetime timezone-aware
|
||||
- `ensure_timezone_naive()`: Remove timezone info
|
||||
- `normalize_datetime_to_utc()`: Convert any datetime to UTC
|
||||
- `normalize_dataframe_datetime_column()`: Normalize pandas datetime columns
|
||||
- `prepare_prophet_datetime()`: Prophet-specific preparation
|
||||
- `safe_datetime_comparison()`: Compare datetimes handling timezone mismatches
|
||||
- `get_current_utc()`: Get current UTC time
|
||||
- `convert_timestamp_to_datetime()`: Handle various timestamp formats
|
||||
|
||||
### 15. **Timezone Utility Integration**
|
||||
- **Updated Files**:
|
||||
- `prophet_manager.py`: Uses `prepare_prophet_datetime()`
|
||||
- `date_alignment_service.py`: Uses `ensure_timezone_aware()`
|
||||
- Future: All timezone operations should use utility
|
||||
|
||||
---
|
||||
|
||||
## 📊 Summary Statistics
|
||||
|
||||
### Files Modified
|
||||
- **Core Files**: 6
|
||||
- main.py
|
||||
- training_service.py
|
||||
- data_client.py
|
||||
- trainer.py
|
||||
- prophet_manager.py
|
||||
- date_alignment_service.py
|
||||
|
||||
### Files Created
|
||||
- **New Utilities**: 3
|
||||
- core/constants.py
|
||||
- utils/timezone_utils.py
|
||||
- utils/__init__.py
|
||||
|
||||
### Code Quality Improvements
|
||||
- ✅ Eliminated all critical bugs
|
||||
- ✅ Removed all legacy compatibility code
|
||||
- ✅ Removed all commented-out code
|
||||
- ✅ Extracted all magic numbers
|
||||
- ✅ Standardized error handling
|
||||
- ✅ Centralized timezone handling
|
||||
|
||||
### Performance Improvements
|
||||
- 🚀 Training time: 30min → 3-5min (10 products)
|
||||
- 🚀 Hyperparameter optimization: 40% faster
|
||||
- 🚀 Parallel execution replaces sequential
|
||||
|
||||
### Reliability Improvements
|
||||
- ✅ Data validation enabled
|
||||
- ✅ Request timeouts configured
|
||||
- ✅ Error propagation fixed
|
||||
- ✅ Session management corrected
|
||||
- ✅ Database initialization verified
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Remaining Recommendations
|
||||
|
||||
### High Priority (Not Yet Implemented)
|
||||
1. **Distributed Locking**: Implement Redis/database-based locking for concurrent training jobs
|
||||
2. **Connection Pooling**: Configure explicit connection pool limits
|
||||
3. **Circuit Breaker**: Add circuit breaker pattern for external service calls
|
||||
4. **Model File Validation**: Implement checksum verification on model load
|
||||
|
||||
### Medium Priority (Future Enhancements)
|
||||
5. **Refactor God Object**: Split `EnhancedTrainingService` (765 lines) into smaller services
|
||||
6. **Shared Model Storage**: Migrate to S3/GCS for horizontal scaling
|
||||
7. **Task Queue**: Replace FastAPI BackgroundTasks with Celery/Temporal
|
||||
8. **Caching Layer**: Implement Redis caching for hyperparameter optimization results
|
||||
|
||||
### Low Priority (Technical Debt)
|
||||
9. **Method Length**: Refactor long methods (>100 lines)
|
||||
10. **Deep Nesting**: Reduce nesting levels in complex conditionals
|
||||
11. **Data Classes**: Replace primitive obsession with proper domain objects
|
||||
12. **Test Coverage**: Add comprehensive unit and integration tests
|
||||
|
||||
---
|
||||
|
||||
## 🔬 Testing Recommendations
|
||||
|
||||
### Unit Tests Required
|
||||
- [ ] Timezone utility functions
|
||||
- [ ] Constants validation
|
||||
- [ ] Data validation logic
|
||||
- [ ] Parallel training execution
|
||||
- [ ] Error handling patterns
|
||||
|
||||
### Integration Tests Required
|
||||
- [ ] End-to-end training pipeline
|
||||
- [ ] External service timeout handling
|
||||
- [ ] Database session management
|
||||
- [ ] Migration verification
|
||||
|
||||
### Performance Tests Required
|
||||
- [ ] Parallel vs sequential training benchmarks
|
||||
- [ ] Hyperparameter optimization timing
|
||||
- [ ] Memory usage under load
|
||||
- [ ] Database connection pool behavior
|
||||
|
||||
---
|
||||
|
||||
## 📝 Migration Notes
|
||||
|
||||
### Breaking Changes
|
||||
⚠️ **None** - All changes maintain API compatibility
|
||||
|
||||
### Deployment Checklist
|
||||
1. ✅ Review constants in `core/constants.py` for environment-specific values
|
||||
2. ✅ Verify database migration version check works in your environment
|
||||
3. ✅ Test parallel training with small batch first
|
||||
4. ✅ Monitor memory usage with parallel execution
|
||||
5. ✅ Verify HTTP timeouts are appropriate for your network conditions
|
||||
|
||||
### Rollback Plan
|
||||
- All changes are backward compatible at the API level
|
||||
- Database schema unchanged
|
||||
- Can revert individual commits if needed
|
||||
|
||||
---
|
||||
|
||||
## 🎉 Conclusion
|
||||
|
||||
**Production Readiness Status**: ✅ **READY** (was ❌ NOT READY)
|
||||
|
||||
All **critical blockers** have been resolved:
|
||||
- ✅ Service initialization bugs fixed
|
||||
- ✅ Training performance improved (10x faster)
|
||||
- ✅ Timeout/circuit protection added
|
||||
- ✅ Data validation enabled
|
||||
- ✅ Database connection management corrected
|
||||
|
||||
**Estimated Remediation Time Saved**: 4-6 weeks → **Completed in current session**
|
||||
|
||||
---
|
||||
|
||||
*Generated: 2025-10-07*
|
||||
*Implementation: Complete*
|
||||
*Status: Production Ready*
|
||||
@@ -1,540 +0,0 @@
|
||||
# Training Service - Phase 2 Enhancements
|
||||
|
||||
## Overview
|
||||
|
||||
This document details the additional improvements implemented after the initial critical fixes and performance enhancements. These enhancements further improve reliability, observability, and maintainability of the training service.
|
||||
|
||||
---
|
||||
|
||||
## New Features Implemented
|
||||
|
||||
### 1. ✅ Retry Mechanism with Exponential Backoff
|
||||
|
||||
**File Created**: [utils/retry.py](services/training/app/utils/retry.py)
|
||||
|
||||
**Features**:
|
||||
- Exponential backoff with configurable parameters
|
||||
- Jitter to prevent thundering herd problem
|
||||
- Adaptive retry strategy based on success/failure patterns
|
||||
- Timeout-based retry strategy
|
||||
- Decorator-based retry for clean integration
|
||||
- Pre-configured strategies for common use cases
|
||||
|
||||
**Classes**:
|
||||
```python
|
||||
RetryStrategy # Base retry strategy
|
||||
AdaptiveRetryStrategy # Adjusts based on history
|
||||
TimeoutRetryStrategy # Overall timeout across all attempts
|
||||
```
|
||||
|
||||
**Pre-configured Strategies**:
|
||||
| Strategy | Max Attempts | Initial Delay | Max Delay | Use Case |
|
||||
|----------|--------------|---------------|-----------|----------|
|
||||
| HTTP_RETRY_STRATEGY | 3 | 1.0s | 10s | HTTP requests |
|
||||
| DATABASE_RETRY_STRATEGY | 5 | 0.5s | 5s | Database operations |
|
||||
| EXTERNAL_SERVICE_RETRY_STRATEGY | 4 | 2.0s | 30s | External services |
|
||||
|
||||
**Usage Example**:
|
||||
```python
|
||||
from app.utils.retry import with_retry
|
||||
|
||||
@with_retry(max_attempts=3, initial_delay=1.0, max_delay=10.0)
|
||||
async def fetch_data():
|
||||
# Your code here - automatically retried on failure
|
||||
pass
|
||||
```
|
||||
|
||||
**Integration**:
|
||||
- Applied to `_fetch_sales_data_internal()` in data_client.py
|
||||
- Configurable per-method retry behavior
|
||||
- Works seamlessly with circuit breakers
|
||||
|
||||
**Benefits**:
|
||||
- Handles transient failures gracefully
|
||||
- Prevents immediate failure on temporary issues
|
||||
- Reduces false alerts from momentary glitches
|
||||
- Improves overall service reliability
|
||||
|
||||
---
|
||||
|
||||
### 2. ✅ Comprehensive Input Validation Schemas
|
||||
|
||||
**File Created**: [schemas/validation.py](services/training/app/schemas/validation.py)
|
||||
|
||||
**Validation Schemas Implemented**:
|
||||
|
||||
#### **TrainingJobCreateRequest**
|
||||
- Validates tenant_id, date ranges, product_ids
|
||||
- Checks date format (ISO 8601)
|
||||
- Ensures logical date ranges
|
||||
- Prevents future dates
|
||||
- Limits to 3-year maximum range
|
||||
|
||||
#### **ForecastRequest**
|
||||
- Validates forecast parameters
|
||||
- Limits forecast days (1-365)
|
||||
- Validates confidence levels (0.5-0.99)
|
||||
- Type-safe UUID validation
|
||||
|
||||
#### **ModelEvaluationRequest**
|
||||
- Validates evaluation periods
|
||||
- Ensures minimum 7-day evaluation window
|
||||
- Date format validation
|
||||
|
||||
#### **BulkTrainingRequest**
|
||||
- Validates multiple tenant IDs (max 100)
|
||||
- Checks for duplicate tenants
|
||||
- Parallel execution options
|
||||
|
||||
#### **HyperparameterOverride**
|
||||
- Validates Prophet hyperparameters
|
||||
- Range checking for all parameters
|
||||
- Regex validation for modes
|
||||
|
||||
#### **AdvancedTrainingRequest**
|
||||
- Extended training options
|
||||
- Cross-validation configuration
|
||||
- Manual hyperparameter override
|
||||
- Diagnostic options
|
||||
|
||||
#### **DataQualityCheckRequest**
|
||||
- Data validation parameters
|
||||
- Product filtering options
|
||||
- Recommendation generation
|
||||
|
||||
#### **ModelQueryParams**
|
||||
- Model listing filters
|
||||
- Pagination support
|
||||
- Accuracy thresholds
|
||||
|
||||
**Example Validation**:
|
||||
```python
|
||||
request = TrainingJobCreateRequest(
|
||||
tenant_id="123e4567-e89b-12d3-a456-426614174000",
|
||||
start_date="2024-01-01",
|
||||
end_date="2024-12-31"
|
||||
)
|
||||
# Automatically validates:
|
||||
# - UUID format
|
||||
# - Date format
|
||||
# - Date range logic
|
||||
# - Business rules
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Catches invalid input before processing
|
||||
- Clear error messages for API consumers
|
||||
- Reduces invalid training job submissions
|
||||
- Self-documenting API with examples
|
||||
- Type safety with Pydantic
|
||||
|
||||
---
|
||||
|
||||
### 3. ✅ Enhanced Health Check System
|
||||
|
||||
**File Created**: [api/health.py](services/training/app/api/health.py)
|
||||
|
||||
**Endpoints Implemented**:
|
||||
|
||||
#### `GET /health`
|
||||
- Basic liveness check
|
||||
- Returns 200 if service is running
|
||||
- Minimal overhead
|
||||
|
||||
#### `GET /health/detailed`
|
||||
- Comprehensive component health check
|
||||
- Database connectivity and performance
|
||||
- System resources (CPU, memory, disk)
|
||||
- Model storage health
|
||||
- Circuit breaker status
|
||||
- Configuration overview
|
||||
|
||||
**Response Example**:
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"components": {
|
||||
"database": {
|
||||
"status": "healthy",
|
||||
"response_time_seconds": 0.05,
|
||||
"model_count": 150,
|
||||
"connection_pool": {
|
||||
"size": 10,
|
||||
"checked_out": 2,
|
||||
"available": 8
|
||||
}
|
||||
},
|
||||
"system": {
|
||||
"cpu": {"usage_percent": 45.2, "count": 8},
|
||||
"memory": {"usage_percent": 62.5, "available_mb": 3072},
|
||||
"disk": {"usage_percent": 45.0, "free_gb": 125}
|
||||
},
|
||||
"storage": {
|
||||
"status": "healthy",
|
||||
"writable": true,
|
||||
"model_files": 150,
|
||||
"total_size_mb": 2500
|
||||
}
|
||||
},
|
||||
"circuit_breakers": { ... }
|
||||
}
|
||||
```
|
||||
|
||||
#### `GET /health/ready`
|
||||
- Kubernetes readiness probe
|
||||
- Returns 503 if not ready
|
||||
- Checks database and storage
|
||||
|
||||
#### `GET /health/live`
|
||||
- Kubernetes liveness probe
|
||||
- Simpler than ready check
|
||||
- Returns process PID
|
||||
|
||||
#### `GET /metrics/system`
|
||||
- Detailed system metrics
|
||||
- Process-level statistics
|
||||
- Resource usage monitoring
|
||||
|
||||
**Benefits**:
|
||||
- Kubernetes-ready health checks
|
||||
- Early problem detection
|
||||
- Operational visibility
|
||||
- Load balancer integration
|
||||
- Auto-healing support
|
||||
|
||||
---
|
||||
|
||||
### 4. ✅ Monitoring and Observability Endpoints
|
||||
|
||||
**File Created**: [api/monitoring.py](services/training/app/api/monitoring.py)
|
||||
|
||||
**Endpoints Implemented**:
|
||||
|
||||
#### `GET /monitoring/circuit-breakers`
|
||||
- Real-time circuit breaker status
|
||||
- Per-service failure counts
|
||||
- State transitions
|
||||
- Summary statistics
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"circuit_breakers": {
|
||||
"sales_service": {
|
||||
"state": "closed",
|
||||
"failure_count": 0,
|
||||
"failure_threshold": 5
|
||||
},
|
||||
"weather_service": {
|
||||
"state": "half_open",
|
||||
"failure_count": 2,
|
||||
"failure_threshold": 3
|
||||
}
|
||||
},
|
||||
"summary": {
|
||||
"total": 3,
|
||||
"open": 0,
|
||||
"half_open": 1,
|
||||
"closed": 2
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `POST /monitoring/circuit-breakers/{name}/reset`
|
||||
- Manually reset circuit breaker
|
||||
- Emergency recovery tool
|
||||
- Audit logged
|
||||
|
||||
#### `GET /monitoring/training-jobs`
|
||||
- Training job statistics
|
||||
- Configurable lookback period
|
||||
- Success/failure rates
|
||||
- Average training duration
|
||||
- Recent job history
|
||||
|
||||
#### `GET /monitoring/models`
|
||||
- Model inventory statistics
|
||||
- Active/production model counts
|
||||
- Models by type
|
||||
- Average performance (MAPE)
|
||||
- Models created today
|
||||
|
||||
#### `GET /monitoring/queue`
|
||||
- Training queue status
|
||||
- Queued vs running jobs
|
||||
- Queue wait times
|
||||
- Oldest job in queue
|
||||
|
||||
#### `GET /monitoring/performance`
|
||||
- Model performance metrics
|
||||
- MAPE, MAE, RMSE statistics
|
||||
- Accuracy distribution (excellent/good/acceptable/poor)
|
||||
- Tenant-specific filtering
|
||||
|
||||
#### `GET /monitoring/alerts`
|
||||
- Active alerts and warnings
|
||||
- Circuit breaker issues
|
||||
- Queue backlogs
|
||||
- System problems
|
||||
- Severity levels
|
||||
|
||||
**Example Alert Response**:
|
||||
```json
|
||||
{
|
||||
"alerts": [
|
||||
{
|
||||
"type": "circuit_breaker_open",
|
||||
"severity": "high",
|
||||
"message": "Circuit breaker 'sales_service' is OPEN"
|
||||
}
|
||||
],
|
||||
"warnings": [
|
||||
{
|
||||
"type": "queue_backlog",
|
||||
"severity": "medium",
|
||||
"message": "Training queue has 15 pending jobs"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Real-time operational visibility
|
||||
- Proactive problem detection
|
||||
- Performance tracking
|
||||
- Capacity planning data
|
||||
- Integration-ready for dashboards
|
||||
|
||||
---
|
||||
|
||||
## Integration and Configuration
|
||||
|
||||
### Updated Files
|
||||
|
||||
**main.py**:
|
||||
- Added health router import
|
||||
- Added monitoring router import
|
||||
- Registered new routes
|
||||
|
||||
**utils/__init__.py**:
|
||||
- Added retry mechanism exports
|
||||
- Updated __all__ list
|
||||
- Complete utility organization
|
||||
|
||||
**data_client.py**:
|
||||
- Integrated retry decorator
|
||||
- Applied to critical HTTP calls
|
||||
- Works with circuit breakers
|
||||
|
||||
### New Routes Available
|
||||
|
||||
| Route | Method | Purpose |
|
||||
|-------|--------|---------|
|
||||
| /health | GET | Basic health check |
|
||||
| /health/detailed | GET | Detailed component health |
|
||||
| /health/ready | GET | Kubernetes readiness |
|
||||
| /health/live | GET | Kubernetes liveness |
|
||||
| /metrics/system | GET | System metrics |
|
||||
| /monitoring/circuit-breakers | GET | Circuit breaker status |
|
||||
| /monitoring/circuit-breakers/{name}/reset | POST | Reset breaker |
|
||||
| /monitoring/training-jobs | GET | Job statistics |
|
||||
| /monitoring/models | GET | Model statistics |
|
||||
| /monitoring/queue | GET | Queue status |
|
||||
| /monitoring/performance | GET | Performance metrics |
|
||||
| /monitoring/alerts | GET | Active alerts |
|
||||
|
||||
---
|
||||
|
||||
## Testing the New Features
|
||||
|
||||
### 1. Test Retry Mechanism
|
||||
```python
|
||||
# Should retry 3 times with exponential backoff
|
||||
@with_retry(max_attempts=3)
|
||||
async def test_function():
|
||||
# Simulate transient failure
|
||||
raise ConnectionError("Temporary failure")
|
||||
```
|
||||
|
||||
### 2. Test Input Validation
|
||||
```bash
|
||||
# Invalid date range - should return 422
|
||||
curl -X POST http://localhost:8000/api/v1/training/jobs \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"tenant_id": "invalid-uuid",
|
||||
"start_date": "2024-12-31",
|
||||
"end_date": "2024-01-01"
|
||||
}'
|
||||
```
|
||||
|
||||
### 3. Test Health Checks
|
||||
```bash
|
||||
# Basic health
|
||||
curl http://localhost:8000/health
|
||||
|
||||
# Detailed health with all components
|
||||
curl http://localhost:8000/health/detailed
|
||||
|
||||
# Readiness check (Kubernetes)
|
||||
curl http://localhost:8000/health/ready
|
||||
|
||||
# Liveness check (Kubernetes)
|
||||
curl http://localhost:8000/health/live
|
||||
```
|
||||
|
||||
### 4. Test Monitoring Endpoints
|
||||
```bash
|
||||
# Circuit breaker status
|
||||
curl http://localhost:8000/monitoring/circuit-breakers
|
||||
|
||||
# Training job stats (last 24 hours)
|
||||
curl http://localhost:8000/monitoring/training-jobs?hours=24
|
||||
|
||||
# Model statistics
|
||||
curl http://localhost:8000/monitoring/models
|
||||
|
||||
# Active alerts
|
||||
curl http://localhost:8000/monitoring/alerts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Impact
|
||||
|
||||
### Retry Mechanism
|
||||
- **Latency**: +0-30s (only on failures, with exponential backoff)
|
||||
- **Success Rate**: +15-25% (handles transient failures)
|
||||
- **False Alerts**: -40% (retries prevent premature failures)
|
||||
|
||||
### Input Validation
|
||||
- **Latency**: +5-10ms per request (validation overhead)
|
||||
- **Invalid Requests Blocked**: ~30% caught before processing
|
||||
- **Error Clarity**: 100% improvement (clear validation messages)
|
||||
|
||||
### Health Checks
|
||||
- **/health**: <5ms response time
|
||||
- **/health/detailed**: <50ms response time
|
||||
- **System Impact**: Negligible (<0.1% CPU)
|
||||
|
||||
### Monitoring Endpoints
|
||||
- **Query Time**: 10-100ms depending on complexity
|
||||
- **Database Load**: Minimal (indexed queries)
|
||||
- **Cache Opportunity**: Can be cached for 1-5 seconds
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Integration
|
||||
|
||||
### Prometheus Metrics (Future)
|
||||
```yaml
|
||||
# Example Prometheus scrape config
|
||||
scrape_configs:
|
||||
- job_name: 'training-service'
|
||||
static_configs:
|
||||
- targets: ['training-service:8000']
|
||||
metrics_path: '/metrics/system'
|
||||
```
|
||||
|
||||
### Grafana Dashboards
|
||||
**Recommended Panels**:
|
||||
1. Circuit Breaker Status (traffic light)
|
||||
2. Training Job Success Rate (gauge)
|
||||
3. Average Training Duration (graph)
|
||||
4. Model Performance Distribution (histogram)
|
||||
5. Queue Depth Over Time (graph)
|
||||
6. System Resources (multi-stat)
|
||||
|
||||
### Alert Rules
|
||||
```yaml
|
||||
# Example alert rules
|
||||
- alert: CircuitBreakerOpen
|
||||
expr: circuit_breaker_state{state="open"} > 0
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Circuit breaker {{ $labels.name }} is open"
|
||||
|
||||
- alert: TrainingQueueBacklog
|
||||
expr: training_queue_depth > 20
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "Training queue has {{ $value }} pending jobs"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary Statistics
|
||||
|
||||
### New Files Created
|
||||
| File | Lines | Purpose |
|
||||
|------|-------|---------|
|
||||
| utils/retry.py | 350 | Retry mechanism |
|
||||
| schemas/validation.py | 300 | Input validation |
|
||||
| api/health.py | 250 | Health checks |
|
||||
| api/monitoring.py | 350 | Monitoring endpoints |
|
||||
| **Total** | **1,250** | **New functionality** |
|
||||
|
||||
### Total Lines Added (Phase 2)
|
||||
- **New Code**: ~1,250 lines
|
||||
- **Modified Code**: ~100 lines
|
||||
- **Documentation**: This document
|
||||
|
||||
### Endpoints Added
|
||||
- **Health Endpoints**: 5
|
||||
- **Monitoring Endpoints**: 7
|
||||
- **Total New Endpoints**: 12
|
||||
|
||||
### Features Completed
|
||||
- ✅ Retry mechanism with exponential backoff
|
||||
- ✅ Comprehensive input validation schemas
|
||||
- ✅ Enhanced health check system
|
||||
- ✅ Monitoring and observability endpoints
|
||||
- ✅ Circuit breaker status API
|
||||
- ✅ Training job statistics
|
||||
- ✅ Model performance tracking
|
||||
- ✅ Queue monitoring
|
||||
- ✅ Alert generation
|
||||
|
||||
---
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
- [ ] Review validation schemas match your API requirements
|
||||
- [ ] Configure Prometheus scraping if using metrics
|
||||
- [ ] Set up Grafana dashboards
|
||||
- [ ] Configure alert rules in monitoring system
|
||||
- [ ] Test health checks with load balancer
|
||||
- [ ] Verify Kubernetes probes (/health/ready, /health/live)
|
||||
- [ ] Test circuit breaker reset endpoint access controls
|
||||
- [ ] Document monitoring endpoints for ops team
|
||||
- [ ] Set up alert routing (PagerDuty, Slack, etc.)
|
||||
- [ ] Test retry mechanism with network failures
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements (Recommendations)
|
||||
|
||||
### High Priority
|
||||
1. **Structured Logging**: Add request tracing with correlation IDs
|
||||
2. **Metrics Export**: Prometheus metrics endpoint
|
||||
3. **Rate Limiting**: Per-tenant API rate limits
|
||||
4. **Caching**: Redis-based response caching
|
||||
|
||||
### Medium Priority
|
||||
5. **Async Task Queue**: Celery/Temporal for better job management
|
||||
6. **Model Registry**: Centralized model versioning
|
||||
7. **A/B Testing**: Model comparison framework
|
||||
8. **Data Lineage**: Track data provenance
|
||||
|
||||
### Low Priority
|
||||
9. **GraphQL API**: Alternative to REST
|
||||
10. **WebSocket Updates**: Real-time job progress
|
||||
11. **Audit Logging**: Comprehensive action audit trail
|
||||
12. **Export APIs**: Bulk data export endpoints
|
||||
|
||||
---
|
||||
|
||||
*Phase 2 Implementation Complete: 2025-10-07*
|
||||
*Features Added: 12*
|
||||
*Lines of Code: ~1,250*
|
||||
*Status: Production Ready*
|
||||
271
services/training/scripts/demo/seed_demo_ai_models.py
Normal file
271
services/training/scripts/demo/seed_demo_ai_models.py
Normal file
@@ -0,0 +1,271 @@
|
||||
"""
|
||||
Demo AI Models Seed Script
|
||||
Creates fake AI models for demo tenants to populate the models list
|
||||
without having actual trained model files.
|
||||
|
||||
This script uses hardcoded tenant and product IDs to avoid cross-database dependencies.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
import os
|
||||
from uuid import UUID
|
||||
from datetime import datetime, timezone, timedelta
|
||||
from decimal import Decimal
|
||||
|
||||
# Add project root to path
|
||||
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))
|
||||
|
||||
from sqlalchemy import select
|
||||
from shared.database.base import create_database_manager
|
||||
import structlog
|
||||
|
||||
# Import models - these paths work both locally and in container
|
||||
try:
|
||||
# Container environment (training-service image)
|
||||
from app.models.training import TrainedModel
|
||||
except ImportError:
|
||||
# Local environment
|
||||
from services.training.app.models.training import TrainedModel
|
||||
|
||||
logger = structlog.get_logger()
|
||||
|
||||
# ============================================================================
|
||||
# HARDCODED DEMO DATA (from seed scripts)
|
||||
# ============================================================================
|
||||
|
||||
# Demo Tenant IDs (from seed_demo_tenants.py)
|
||||
DEMO_TENANT_SAN_PABLO = UUID("a1b2c3d4-e5f6-47a8-b9c0-d1e2f3a4b5c6")
|
||||
DEMO_TENANT_LA_ESPIGA = UUID("b2c3d4e5-f6a7-48b9-c0d1-e2f3a4b5c6d7")
|
||||
|
||||
# Sample Product IDs for each tenant (these should match finished products from inventory seed)
|
||||
# Note: These are example UUIDs - in production, these would be actual product IDs from inventory
|
||||
DEMO_PRODUCTS = {
|
||||
DEMO_TENANT_SAN_PABLO: [
|
||||
{"id": UUID("10000000-0000-0000-0000-000000000001"), "name": "Barra de Pan"},
|
||||
{"id": UUID("10000000-0000-0000-0000-000000000002"), "name": "Croissant"},
|
||||
{"id": UUID("10000000-0000-0000-0000-000000000003"), "name": "Magdalenas"},
|
||||
{"id": UUID("10000000-0000-0000-0000-000000000004"), "name": "Empanada"},
|
||||
{"id": UUID("10000000-0000-0000-0000-000000000005"), "name": "Pan Integral"},
|
||||
],
|
||||
DEMO_TENANT_LA_ESPIGA: [
|
||||
{"id": UUID("20000000-0000-0000-0000-000000000001"), "name": "Pan de Molde"},
|
||||
{"id": UUID("20000000-0000-0000-0000-000000000002"), "name": "Bollo Suizo"},
|
||||
{"id": UUID("20000000-0000-0000-0000-000000000003"), "name": "Palmera de Chocolate"},
|
||||
{"id": UUID("20000000-0000-0000-0000-000000000004"), "name": "Napolitana"},
|
||||
{"id": UUID("20000000-0000-0000-0000-000000000005"), "name": "Pan Rústico"},
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
class DemoAIModelSeeder:
|
||||
"""Seed fake AI models for demo tenants"""
|
||||
|
||||
def __init__(self):
|
||||
self.training_db_url = os.getenv("TRAINING_DATABASE_URL") or os.getenv("DATABASE_URL")
|
||||
|
||||
if not self.training_db_url:
|
||||
raise ValueError("Missing TRAINING_DATABASE_URL or DATABASE_URL")
|
||||
|
||||
# Convert to async URL if needed
|
||||
if self.training_db_url.startswith("postgresql://"):
|
||||
self.training_db_url = self.training_db_url.replace(
|
||||
"postgresql://", "postgresql+asyncpg://", 1
|
||||
)
|
||||
|
||||
self.training_db = create_database_manager(self.training_db_url, "demo-ai-seed")
|
||||
|
||||
async def create_fake_model(self, session, tenant_id: UUID, product_info: dict):
|
||||
"""Create a fake AI model entry for a product"""
|
||||
now = datetime.now(timezone.utc)
|
||||
training_start = now - timedelta(days=90)
|
||||
training_end = now - timedelta(days=7)
|
||||
|
||||
fake_model = TrainedModel(
|
||||
tenant_id=tenant_id,
|
||||
inventory_product_id=product_info["id"],
|
||||
model_type="prophet_optimized",
|
||||
model_version="1.0-demo",
|
||||
job_id=f"demo-job-{tenant_id}-{product_info['id']}",
|
||||
|
||||
# Fake file paths (files don't actually exist)
|
||||
model_path=f"/fake/models/{tenant_id}/{product_info['id']}/model.pkl",
|
||||
metadata_path=f"/fake/models/{tenant_id}/{product_info['id']}/metadata.json",
|
||||
|
||||
# Fake but realistic metrics
|
||||
mape=Decimal("12.5"), # Mean Absolute Percentage Error
|
||||
mae=Decimal("2.3"), # Mean Absolute Error
|
||||
rmse=Decimal("3.1"), # Root Mean Squared Error
|
||||
r2_score=Decimal("0.85"), # R-squared
|
||||
training_samples=60, # 60 days of training data
|
||||
|
||||
# Fake hyperparameters
|
||||
hyperparameters={
|
||||
"changepoint_prior_scale": 0.05,
|
||||
"seasonality_prior_scale": 10.0,
|
||||
"holidays_prior_scale": 10.0,
|
||||
"seasonality_mode": "multiplicative"
|
||||
},
|
||||
|
||||
# Features used
|
||||
features_used=["weekday", "month", "is_holiday", "temperature", "precipitation"],
|
||||
|
||||
# Normalization params (fake)
|
||||
normalization_params={
|
||||
"temperature": {"mean": 15.0, "std": 5.0},
|
||||
"precipitation": {"mean": 2.0, "std": 1.5}
|
||||
},
|
||||
|
||||
# Model status
|
||||
is_active=True,
|
||||
is_production=False, # Demo models are not production-ready
|
||||
|
||||
# Training data info
|
||||
training_start_date=training_start,
|
||||
training_end_date=training_end,
|
||||
data_quality_score=Decimal("0.75"), # Good but not excellent
|
||||
|
||||
# Metadata
|
||||
notes=f"Demo model for {product_info['name']} - No actual trained file exists. For demonstration purposes only.",
|
||||
created_by="demo-seed-script",
|
||||
created_at=now,
|
||||
updated_at=now,
|
||||
last_used_at=None
|
||||
)
|
||||
|
||||
session.add(fake_model)
|
||||
return fake_model
|
||||
|
||||
async def seed_models_for_tenant(self, tenant_id: UUID, tenant_name: str, products: list):
|
||||
"""Create fake AI models for a demo tenant"""
|
||||
logger.info(
|
||||
"Creating fake AI models for demo tenant",
|
||||
tenant_id=str(tenant_id),
|
||||
tenant_name=tenant_name,
|
||||
product_count=len(products)
|
||||
)
|
||||
|
||||
try:
|
||||
async with self.training_db.get_session() as session:
|
||||
models_created = 0
|
||||
|
||||
for product in products:
|
||||
# Check if model already exists
|
||||
result = await session.execute(
|
||||
select(TrainedModel).where(
|
||||
TrainedModel.tenant_id == tenant_id,
|
||||
TrainedModel.inventory_product_id == product["id"]
|
||||
)
|
||||
)
|
||||
existing_model = result.scalars().first()
|
||||
|
||||
if existing_model:
|
||||
logger.info(
|
||||
"Model already exists, skipping",
|
||||
tenant_id=str(tenant_id),
|
||||
product_name=product["name"],
|
||||
product_id=str(product["id"])
|
||||
)
|
||||
continue
|
||||
|
||||
# Create fake model
|
||||
model = await self.create_fake_model(session, tenant_id, product)
|
||||
models_created += 1
|
||||
|
||||
logger.info(
|
||||
"Created fake AI model",
|
||||
tenant_id=str(tenant_id),
|
||||
product_name=product["name"],
|
||||
product_id=str(product["id"]),
|
||||
model_id=str(model.id)
|
||||
)
|
||||
|
||||
await session.commit()
|
||||
|
||||
logger.info(
|
||||
"✅ Successfully created fake AI models for tenant",
|
||||
tenant_id=str(tenant_id),
|
||||
tenant_name=tenant_name,
|
||||
models_created=models_created
|
||||
)
|
||||
|
||||
return models_created
|
||||
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
"❌ Error creating fake AI models for tenant",
|
||||
tenant_id=str(tenant_id),
|
||||
tenant_name=tenant_name,
|
||||
error=str(e),
|
||||
exc_info=True
|
||||
)
|
||||
raise
|
||||
|
||||
async def seed_all_demo_models(self):
|
||||
"""Seed fake AI models for all demo tenants"""
|
||||
logger.info("=" * 80)
|
||||
logger.info("🤖 Starting Demo AI Models Seeding")
|
||||
logger.info("=" * 80)
|
||||
|
||||
total_models_created = 0
|
||||
|
||||
try:
|
||||
# Seed models for San Pablo
|
||||
san_pablo_count = await self.seed_models_for_tenant(
|
||||
tenant_id=DEMO_TENANT_SAN_PABLO,
|
||||
tenant_name="Panadería San Pablo",
|
||||
products=DEMO_PRODUCTS[DEMO_TENANT_SAN_PABLO]
|
||||
)
|
||||
total_models_created += san_pablo_count
|
||||
|
||||
# Seed models for La Espiga
|
||||
la_espiga_count = await self.seed_models_for_tenant(
|
||||
tenant_id=DEMO_TENANT_LA_ESPIGA,
|
||||
tenant_name="Panadería La Espiga",
|
||||
products=DEMO_PRODUCTS[DEMO_TENANT_LA_ESPIGA]
|
||||
)
|
||||
total_models_created += la_espiga_count
|
||||
|
||||
logger.info("=" * 80)
|
||||
logger.info(
|
||||
"✅ Demo AI Models Seeding Completed",
|
||||
total_models_created=total_models_created,
|
||||
tenants_processed=2
|
||||
)
|
||||
logger.info("=" * 80)
|
||||
|
||||
except Exception as e:
|
||||
logger.error("=" * 80)
|
||||
logger.error("❌ Demo AI Models Seeding Failed")
|
||||
logger.error("=" * 80)
|
||||
logger.error("Error: %s", str(e))
|
||||
raise
|
||||
|
||||
|
||||
async def main():
|
||||
"""Main entry point"""
|
||||
logger.info("Demo AI Models Seed Script Starting")
|
||||
logger.info("Mode: %s", os.getenv("DEMO_MODE", "development"))
|
||||
logger.info("Log Level: %s", os.getenv("LOG_LEVEL", "INFO"))
|
||||
|
||||
try:
|
||||
seeder = DemoAIModelSeeder()
|
||||
await seeder.seed_all_demo_models()
|
||||
|
||||
logger.info("")
|
||||
logger.info("🎉 Success! Demo AI models are ready.")
|
||||
logger.info("")
|
||||
logger.info("Note: These are fake models for demo purposes only.")
|
||||
logger.info(" No actual model files exist on disk.")
|
||||
logger.info("")
|
||||
|
||||
return 0
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Demo AI models seed failed", error=str(e), exc_info=True)
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
exit_code = asyncio.run(main())
|
||||
sys.exit(exit_code)
|
||||
Reference in New Issue
Block a user