REFACTOR external service and improve websocket training
This commit is contained in:
167
MODEL_STORAGE_FIX.md
Normal file
167
MODEL_STORAGE_FIX.md
Normal file
@@ -0,0 +1,167 @@
|
||||
# Model Storage Fix - Root Cause Analysis & Resolution
|
||||
|
||||
## Problem Summary
|
||||
**Error**: `Model file not found: /app/models/{tenant_id}/{model_id}.pkl`
|
||||
|
||||
**Impact**: Forecasting service unable to generate predictions, causing 500 errors
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### The Issue
|
||||
Both training and forecasting services were configured to save/load ML models at `/app/models`, but **no persistent storage was configured**. This caused:
|
||||
|
||||
1. **Training service** saves model files to `/app/models/{tenant_id}/{model_id}.pkl` (in-container filesystem)
|
||||
2. **Model metadata** successfully saved to database
|
||||
3. **Container restarts** or different pod instances → filesystem lost
|
||||
4. **Forecasting service** tries to load model from `/app/models/...` → **File not found**
|
||||
|
||||
### Evidence from Logs
|
||||
```
|
||||
[error] Model file not found: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
|
||||
[error] Model file not valid: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
|
||||
[error] Error generating prediction error=Model 2096bc66-aef7-4499-a79c-c4d40d5aa9f1 not found or failed to load
|
||||
```
|
||||
|
||||
### Architecture Flaw
|
||||
- Training service deployment: Only had `/tmp` EmptyDir volume
|
||||
- Forecasting service deployment: Had NO volumes at all
|
||||
- Model files stored in ephemeral container filesystem
|
||||
- No shared persistent storage between services
|
||||
|
||||
## Solution Implemented
|
||||
|
||||
### 1. Created Persistent Volume Claim
|
||||
**File**: `infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml`
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: model-storage
|
||||
namespace: bakery-ia
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce # Single node access
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
storageClassName: standard # Uses local-path provisioner
|
||||
```
|
||||
|
||||
### 2. Updated Training Service
|
||||
**File**: `infrastructure/kubernetes/base/components/training/training-service.yaml`
|
||||
|
||||
Added volume mount:
|
||||
```yaml
|
||||
volumeMounts:
|
||||
- name: model-storage
|
||||
mountPath: /app/models # Training writes models here
|
||||
|
||||
volumes:
|
||||
- name: model-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: model-storage
|
||||
```
|
||||
|
||||
### 3. Updated Forecasting Service
|
||||
**File**: `infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml`
|
||||
|
||||
Added READ-ONLY volume mount:
|
||||
```yaml
|
||||
volumeMounts:
|
||||
- name: model-storage
|
||||
mountPath: /app/models
|
||||
readOnly: true # Forecasting only reads models
|
||||
|
||||
volumes:
|
||||
- name: model-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: model-storage
|
||||
readOnly: true
|
||||
```
|
||||
|
||||
### 4. Updated Kustomization
|
||||
Added PVC to resource list in `infrastructure/kubernetes/base/kustomization.yaml`
|
||||
|
||||
## Verification
|
||||
|
||||
### PVC Status
|
||||
```bash
|
||||
kubectl get pvc -n bakery-ia model-storage
|
||||
# STATUS: Bound (10Gi, RWO)
|
||||
```
|
||||
|
||||
### Volume Mounts Verified
|
||||
```bash
|
||||
# Training service
|
||||
kubectl exec -n bakery-ia deployment/training-service -- ls -la /app/models
|
||||
# ✅ Directory exists and is writable
|
||||
|
||||
# Forecasting service
|
||||
kubectl exec -n bakery-ia deployment/forecasting-service -- ls -la /app/models
|
||||
# ✅ Directory exists and is readable (same volume)
|
||||
```
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
```bash
|
||||
# 1. Create PVC
|
||||
kubectl apply -f infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml
|
||||
|
||||
# 2. Recreate training service (deployment selector is immutable)
|
||||
kubectl delete deployment training-service -n bakery-ia
|
||||
kubectl apply -f infrastructure/kubernetes/base/components/training/training-service.yaml
|
||||
|
||||
# 3. Recreate forecasting service
|
||||
kubectl delete deployment forecasting-service -n bakery-ia
|
||||
kubectl apply -f infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml
|
||||
|
||||
# 4. Verify pods are running
|
||||
kubectl get pods -n bakery-ia | grep -E "(training|forecasting)"
|
||||
```
|
||||
|
||||
## How It Works Now
|
||||
|
||||
1. **Training Flow**:
|
||||
- Model trained → Saved to `/app/models/{tenant_id}/{model_id}.pkl`
|
||||
- File persisted to PersistentVolume (survives pod restarts)
|
||||
- Metadata saved to database with model path
|
||||
|
||||
2. **Forecasting Flow**:
|
||||
- Retrieves model metadata from database
|
||||
- Loads model from `/app/models/{tenant_id}/{model_id}.pkl`
|
||||
- File exists in shared PersistentVolume ✅
|
||||
- Prediction succeeds ✅
|
||||
|
||||
## Storage Configuration
|
||||
|
||||
- **Type**: PersistentVolumeClaim with local-path provisioner
|
||||
- **Access Mode**: ReadWriteOnce (single node, multiple pods)
|
||||
- **Size**: 10Gi (adjustable)
|
||||
- **Lifecycle**: Independent of pod lifecycle
|
||||
- **Shared**: Same volume mounted by both services
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Data Persistence**: Models survive pod restarts/crashes
|
||||
2. **Cross-Service Access**: Training writes, Forecasting reads
|
||||
3. **Scalability**: Can increase storage size as needed
|
||||
4. **Reliability**: No data loss on container recreation
|
||||
|
||||
## Future Improvements
|
||||
|
||||
For production environments, consider:
|
||||
|
||||
1. **ReadWriteMany volumes**: Use NFS/CephFS for multi-node clusters
|
||||
2. **Model versioning**: Implement model lifecycle management
|
||||
3. **Backup strategy**: Regular backups of model storage
|
||||
4. **Monitoring**: Track storage usage and model count
|
||||
5. **Cloud storage**: S3/GCS for distributed deployments
|
||||
|
||||
## Testing Recommendations
|
||||
|
||||
1. Trigger new model training
|
||||
2. Verify model file exists in PV
|
||||
3. Test prediction endpoint
|
||||
4. Restart pods and verify models still accessible
|
||||
5. Monitor for any storage-related errors
|
||||
Reference in New Issue
Block a user