168 lines
5.2 KiB
Markdown
168 lines
5.2 KiB
Markdown
|
|
# Model Storage Fix - Root Cause Analysis & Resolution
|
||
|
|
|
||
|
|
## Problem Summary
|
||
|
|
**Error**: `Model file not found: /app/models/{tenant_id}/{model_id}.pkl`
|
||
|
|
|
||
|
|
**Impact**: Forecasting service unable to generate predictions, causing 500 errors
|
||
|
|
|
||
|
|
## Root Cause Analysis
|
||
|
|
|
||
|
|
### The Issue
|
||
|
|
Both training and forecasting services were configured to save/load ML models at `/app/models`, but **no persistent storage was configured**. This caused:
|
||
|
|
|
||
|
|
1. **Training service** saves model files to `/app/models/{tenant_id}/{model_id}.pkl` (in-container filesystem)
|
||
|
|
2. **Model metadata** successfully saved to database
|
||
|
|
3. **Container restarts** or different pod instances → filesystem lost
|
||
|
|
4. **Forecasting service** tries to load model from `/app/models/...` → **File not found**
|
||
|
|
|
||
|
|
### Evidence from Logs
|
||
|
|
```
|
||
|
|
[error] Model file not found: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
|
||
|
|
[error] Model file not valid: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
|
||
|
|
[error] Error generating prediction error=Model 2096bc66-aef7-4499-a79c-c4d40d5aa9f1 not found or failed to load
|
||
|
|
```
|
||
|
|
|
||
|
|
### Architecture Flaw
|
||
|
|
- Training service deployment: Only had `/tmp` EmptyDir volume
|
||
|
|
- Forecasting service deployment: Had NO volumes at all
|
||
|
|
- Model files stored in ephemeral container filesystem
|
||
|
|
- No shared persistent storage between services
|
||
|
|
|
||
|
|
## Solution Implemented
|
||
|
|
|
||
|
|
### 1. Created Persistent Volume Claim
|
||
|
|
**File**: `infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml`
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
apiVersion: v1
|
||
|
|
kind: PersistentVolumeClaim
|
||
|
|
metadata:
|
||
|
|
name: model-storage
|
||
|
|
namespace: bakery-ia
|
||
|
|
spec:
|
||
|
|
accessModes:
|
||
|
|
- ReadWriteOnce # Single node access
|
||
|
|
resources:
|
||
|
|
requests:
|
||
|
|
storage: 10Gi
|
||
|
|
storageClassName: standard # Uses local-path provisioner
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Updated Training Service
|
||
|
|
**File**: `infrastructure/kubernetes/base/components/training/training-service.yaml`
|
||
|
|
|
||
|
|
Added volume mount:
|
||
|
|
```yaml
|
||
|
|
volumeMounts:
|
||
|
|
- name: model-storage
|
||
|
|
mountPath: /app/models # Training writes models here
|
||
|
|
|
||
|
|
volumes:
|
||
|
|
- name: model-storage
|
||
|
|
persistentVolumeClaim:
|
||
|
|
claimName: model-storage
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Updated Forecasting Service
|
||
|
|
**File**: `infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml`
|
||
|
|
|
||
|
|
Added READ-ONLY volume mount:
|
||
|
|
```yaml
|
||
|
|
volumeMounts:
|
||
|
|
- name: model-storage
|
||
|
|
mountPath: /app/models
|
||
|
|
readOnly: true # Forecasting only reads models
|
||
|
|
|
||
|
|
volumes:
|
||
|
|
- name: model-storage
|
||
|
|
persistentVolumeClaim:
|
||
|
|
claimName: model-storage
|
||
|
|
readOnly: true
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Updated Kustomization
|
||
|
|
Added PVC to resource list in `infrastructure/kubernetes/base/kustomization.yaml`
|
||
|
|
|
||
|
|
## Verification
|
||
|
|
|
||
|
|
### PVC Status
|
||
|
|
```bash
|
||
|
|
kubectl get pvc -n bakery-ia model-storage
|
||
|
|
# STATUS: Bound (10Gi, RWO)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Volume Mounts Verified
|
||
|
|
```bash
|
||
|
|
# Training service
|
||
|
|
kubectl exec -n bakery-ia deployment/training-service -- ls -la /app/models
|
||
|
|
# ✅ Directory exists and is writable
|
||
|
|
|
||
|
|
# Forecasting service
|
||
|
|
kubectl exec -n bakery-ia deployment/forecasting-service -- ls -la /app/models
|
||
|
|
# ✅ Directory exists and is readable (same volume)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Deployment Steps
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Create PVC
|
||
|
|
kubectl apply -f infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml
|
||
|
|
|
||
|
|
# 2. Recreate training service (deployment selector is immutable)
|
||
|
|
kubectl delete deployment training-service -n bakery-ia
|
||
|
|
kubectl apply -f infrastructure/kubernetes/base/components/training/training-service.yaml
|
||
|
|
|
||
|
|
# 3. Recreate forecasting service
|
||
|
|
kubectl delete deployment forecasting-service -n bakery-ia
|
||
|
|
kubectl apply -f infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml
|
||
|
|
|
||
|
|
# 4. Verify pods are running
|
||
|
|
kubectl get pods -n bakery-ia | grep -E "(training|forecasting)"
|
||
|
|
```
|
||
|
|
|
||
|
|
## How It Works Now
|
||
|
|
|
||
|
|
1. **Training Flow**:
|
||
|
|
- Model trained → Saved to `/app/models/{tenant_id}/{model_id}.pkl`
|
||
|
|
- File persisted to PersistentVolume (survives pod restarts)
|
||
|
|
- Metadata saved to database with model path
|
||
|
|
|
||
|
|
2. **Forecasting Flow**:
|
||
|
|
- Retrieves model metadata from database
|
||
|
|
- Loads model from `/app/models/{tenant_id}/{model_id}.pkl`
|
||
|
|
- File exists in shared PersistentVolume ✅
|
||
|
|
- Prediction succeeds ✅
|
||
|
|
|
||
|
|
## Storage Configuration
|
||
|
|
|
||
|
|
- **Type**: PersistentVolumeClaim with local-path provisioner
|
||
|
|
- **Access Mode**: ReadWriteOnce (single node, multiple pods)
|
||
|
|
- **Size**: 10Gi (adjustable)
|
||
|
|
- **Lifecycle**: Independent of pod lifecycle
|
||
|
|
- **Shared**: Same volume mounted by both services
|
||
|
|
|
||
|
|
## Benefits
|
||
|
|
|
||
|
|
1. **Data Persistence**: Models survive pod restarts/crashes
|
||
|
|
2. **Cross-Service Access**: Training writes, Forecasting reads
|
||
|
|
3. **Scalability**: Can increase storage size as needed
|
||
|
|
4. **Reliability**: No data loss on container recreation
|
||
|
|
|
||
|
|
## Future Improvements
|
||
|
|
|
||
|
|
For production environments, consider:
|
||
|
|
|
||
|
|
1. **ReadWriteMany volumes**: Use NFS/CephFS for multi-node clusters
|
||
|
|
2. **Model versioning**: Implement model lifecycle management
|
||
|
|
3. **Backup strategy**: Regular backups of model storage
|
||
|
|
4. **Monitoring**: Track storage usage and model count
|
||
|
|
5. **Cloud storage**: S3/GCS for distributed deployments
|
||
|
|
|
||
|
|
## Testing Recommendations
|
||
|
|
|
||
|
|
1. Trigger new model training
|
||
|
|
2. Verify model file exists in PV
|
||
|
|
3. Test prediction endpoint
|
||
|
|
4. Restart pods and verify models still accessible
|
||
|
|
5. Monitor for any storage-related errors
|