216 lines
7.6 KiB
Markdown
216 lines
7.6 KiB
Markdown
# Clean WebSocket Implementation - Status Report
|
|
|
|
## Architecture Overview
|
|
|
|
### Clean KISS Design (Divide and Conquer)
|
|
```
|
|
Frontend WebSocket → Gateway (Token Verification Only) → Training Service WebSocket → RabbitMQ Events → Broadcast to All Clients
|
|
```
|
|
|
|
## ✅ COMPLETED Components
|
|
|
|
### 1. WebSocket Connection Manager (`services/training/app/websocket/manager.py`)
|
|
- **Status**: ✅ COMPLETE
|
|
- Simple connection manager for WebSocket clients
|
|
- Thread-safe connection tracking per job_id
|
|
- Broadcasting capability to all connected clients
|
|
- Auto-cleanup of failed connections
|
|
|
|
### 2. RabbitMQ Event Consumer (`services/training/app/websocket/events.py`)
|
|
- **Status**: ✅ COMPLETE
|
|
- Global consumer that listens to all training.* events
|
|
- Automatically broadcasts events to WebSocket clients
|
|
- Maps RabbitMQ event types to WebSocket message types
|
|
- Sets up on service startup
|
|
|
|
### 3. Clean Event Publishers (`services/training/app/services/training_events.py`)
|
|
- **Status**: ✅ COMPLETE
|
|
- **4 Main Events** as specified:
|
|
1. `publish_training_started()` - 0% progress
|
|
2. `publish_data_analysis()` - 20% progress
|
|
3. `publish_product_training_completed()` - contributes to 20-80% progress
|
|
4. `publish_training_completed()` - 100% progress
|
|
5. `publish_training_failed()` - error handling
|
|
|
|
### 4. WebSocket Endpoint (`services/training/app/api/websocket_operations.py`)
|
|
- **Status**: ✅ COMPLETE
|
|
- Simple endpoint at `/api/v1/tenants/{tenant_id}/training/jobs/{job_id}/live`
|
|
- Token validation
|
|
- Connection management
|
|
- Ping/pong support
|
|
- Receives broadcasts from RabbitMQ consumer
|
|
|
|
### 5. Gateway WebSocket Proxy (`gateway/app/main.py`)
|
|
- **Status**: ✅ COMPLETE
|
|
- **KISS**: Token verification ONLY
|
|
- Simple bidirectional forwarding
|
|
- No business logic
|
|
- Clean error handling
|
|
|
|
### 6. Parallel Product Progress Tracker (`services/training/app/services/progress_tracker.py`)
|
|
- **Status**: ✅ COMPLETE
|
|
- Thread-safe tracking of parallel product training
|
|
- Automatic progress calculation (20-80% range)
|
|
- Each product completion = 60/N% progress
|
|
- Emits `publish_product_training_completed` events
|
|
|
|
### 7. Service Integration (services/training/app/main.py`)
|
|
- **Status**: ✅ COMPLETE
|
|
- Added WebSocket router to FastAPI app
|
|
- Setup WebSocket event consumer on startup
|
|
- Cleanup on shutdown
|
|
|
|
### 8. Removed Legacy Code
|
|
- **Status**: ✅ COMPLETE
|
|
- ❌ Deleted all WebSocket code from `training_operations.py`
|
|
- ❌ Removed ConnectionManager, message cache, backfill logic
|
|
- ❌ Removed per-job RabbitMQ consumers
|
|
- ❌ Simplified event imports
|
|
|
|
## 🚧 PENDING Components
|
|
|
|
### 1. Update Training Service to Use New Events
|
|
- **File**: `services/training/app/services/training_service.py`
|
|
- **Current**: Uses old `TrainingStatusPublisher` with many granular events
|
|
- **Needed**: Replace with 4 clean events:
|
|
```python
|
|
# 1. Start (0%)
|
|
await publish_training_started(job_id, tenant_id, total_products)
|
|
|
|
# 2. Data Analysis (20%)
|
|
await publish_data_analysis(job_id, tenant_id, "Analysis details...")
|
|
|
|
# 3. Product Training (20-80%) - use ParallelProductProgressTracker
|
|
tracker = ParallelProductProgressTracker(job_id, tenant_id, total_products)
|
|
# In parallel training loop:
|
|
await tracker.mark_product_completed(product_name)
|
|
|
|
# 4. Completion (100%)
|
|
await publish_training_completed(job_id, tenant_id, successful, failed, duration)
|
|
```
|
|
|
|
### 2. Update Training Orchestrator/Trainer
|
|
- **File**: `services/training/app/ml/trainer.py` (likely)
|
|
- **Needed**: Integrate `ParallelProductProgressTracker` in parallel training loop
|
|
- Must emit event for each product completion (order doesn't matter)
|
|
|
|
### 3. Remove Old Messaging Module
|
|
- **File**: `services/training/app/services/messaging.py`
|
|
- **Status**: Still exists with old complex event publishers
|
|
- **Action**: Can be removed once training_service.py is updated
|
|
- Keep only the new `training_events.py`
|
|
|
|
### 4. Update Frontend WebSocket Client
|
|
- **File**: `frontend/src/api/hooks/training.ts`
|
|
- **Current**: Already well-implemented but expects certain message types
|
|
- **Needed**: Update to handle new message types:
|
|
- `started` - 0%
|
|
- `progress` - for data_analysis (20%)
|
|
- `product_completed` - for each product (calculate 20 + (completed/total * 60))
|
|
- `completed` - 100%
|
|
- `failed` - error
|
|
|
|
### 5. Frontend Progress Calculation
|
|
- **Location**: Frontend WebSocket message handler
|
|
- **Logic Needed**:
|
|
```typescript
|
|
case 'product_completed':
|
|
const { products_completed, total_products } = message.data;
|
|
const progress = 20 + Math.floor((products_completed / total_products) * 60);
|
|
// Update UI with progress
|
|
break;
|
|
```
|
|
|
|
## Event Flow Diagram
|
|
|
|
```
|
|
Training Start
|
|
↓
|
|
[Event 1: training.started] → 0% progress
|
|
↓
|
|
Data Analysis
|
|
↓
|
|
[Event 2: training.progress] → 20% progress (data_analysis step)
|
|
↓
|
|
Product Training (Parallel)
|
|
↓
|
|
[Event 3a: training.product.completed] → Product 1 done
|
|
[Event 3b: training.product.completed] → Product 2 done
|
|
[Event 3c: training.product.completed] → Product 3 done
|
|
... (progress calculated as: 20 + (completed/total * 60))
|
|
↓
|
|
[Event 3n: training.product.completed] → Product N done → 80% progress
|
|
↓
|
|
Training Complete
|
|
↓
|
|
[Event 4: training.completed] → 100% progress
|
|
```
|
|
|
|
## Key Design Principles
|
|
|
|
1. **KISS (Keep It Simple, Stupid)**
|
|
- No complex caching or backfilling
|
|
- No per-job consumers
|
|
- One global consumer broadcasts to all clients
|
|
- Simple, stateless WebSocket connections
|
|
|
|
2. **Divide and Conquer**
|
|
- Gateway: Token verification only
|
|
- Training Service: WebSocket connections + RabbitMQ consumer
|
|
- Progress Tracker: Parallel training progress
|
|
- Event Publishers: 4 simple event types
|
|
|
|
3. **No Backward Compatibility**
|
|
- Deleted all legacy WebSocket code
|
|
- Clean slate implementation
|
|
- No TODOs (implement everything)
|
|
|
|
## Next Steps
|
|
|
|
1. Update `training_service.py` to use new event publishers
|
|
2. Update trainer to integrate `ParallelProductProgressTracker`
|
|
3. Remove old `messaging.py` module
|
|
4. Update frontend WebSocket client message handlers
|
|
5. Test end-to-end flow
|
|
6. Monitor WebSocket connections in production
|
|
|
|
## Testing Checklist
|
|
|
|
- [ ] WebSocket connection established through gateway
|
|
- [ ] Token verification works (valid and invalid tokens)
|
|
- [ ] Event 1 (started) received with 0% progress
|
|
- [ ] Event 2 (data_analysis) received with 20% progress
|
|
- [ ] Event 3 (product_completed) received for each product
|
|
- [ ] Progress correctly calculated (20 + completed/total * 60)
|
|
- [ ] Event 4 (completed) received with 100% progress
|
|
- [ ] Error events handled correctly
|
|
- [ ] Multiple concurrent clients receive same events
|
|
- [ ] Connection survives network hiccups
|
|
- [ ] Clean disconnection when training completes
|
|
|
|
## Files Modified
|
|
|
|
### Created:
|
|
- `services/training/app/websocket/manager.py`
|
|
- `services/training/app/websocket/events.py`
|
|
- `services/training/app/websocket/__init__.py`
|
|
- `services/training/app/api/websocket_operations.py`
|
|
- `services/training/app/services/training_events.py`
|
|
- `services/training/app/services/progress_tracker.py`
|
|
|
|
### Modified:
|
|
- `services/training/app/main.py` - Added WebSocket router and event consumer setup
|
|
- `services/training/app/api/training_operations.py` - Removed all WebSocket code
|
|
- `gateway/app/main.py` - Simplified WebSocket proxy
|
|
|
|
### To Remove:
|
|
- `services/training/app/services/messaging.py` - Replace with `training_events.py`
|
|
|
|
## Notes
|
|
|
|
- RabbitMQ exchange: `training.events`
|
|
- Routing keys: `training.*` (wildcard for all events)
|
|
- WebSocket URL: `ws://gateway/api/v1/tenants/{tenant_id}/training/jobs/{job_id}/live?token={token}`
|
|
- Progress range: 0% → 20% → 20-80% (products) → 100%
|
|
- Each product contributes: 60/N% where N = total products
|