Files
bakery-ia/MODEL_STORAGE_FIX.md

168 lines
5.2 KiB
Markdown
Raw Normal View History

# Model Storage Fix - Root Cause Analysis & Resolution
## Problem Summary
**Error**: `Model file not found: /app/models/{tenant_id}/{model_id}.pkl`
**Impact**: Forecasting service unable to generate predictions, causing 500 errors
## Root Cause Analysis
### The Issue
Both training and forecasting services were configured to save/load ML models at `/app/models`, but **no persistent storage was configured**. This caused:
1. **Training service** saves model files to `/app/models/{tenant_id}/{model_id}.pkl` (in-container filesystem)
2. **Model metadata** successfully saved to database
3. **Container restarts** or different pod instances → filesystem lost
4. **Forecasting service** tries to load model from `/app/models/...`**File not found**
### Evidence from Logs
```
[error] Model file not found: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Model file not valid: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Error generating prediction error=Model 2096bc66-aef7-4499-a79c-c4d40d5aa9f1 not found or failed to load
```
### Architecture Flaw
- Training service deployment: Only had `/tmp` EmptyDir volume
- Forecasting service deployment: Had NO volumes at all
- Model files stored in ephemeral container filesystem
- No shared persistent storage between services
## Solution Implemented
### 1. Created Persistent Volume Claim
**File**: `infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml`
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage
namespace: bakery-ia
spec:
accessModes:
- ReadWriteOnce # Single node access
resources:
requests:
storage: 10Gi
storageClassName: standard # Uses local-path provisioner
```
### 2. Updated Training Service
**File**: `infrastructure/kubernetes/base/components/training/training-service.yaml`
Added volume mount:
```yaml
volumeMounts:
- name: model-storage
mountPath: /app/models # Training writes models here
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-storage
```
### 3. Updated Forecasting Service
**File**: `infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml`
Added READ-ONLY volume mount:
```yaml
volumeMounts:
- name: model-storage
mountPath: /app/models
readOnly: true # Forecasting only reads models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-storage
readOnly: true
```
### 4. Updated Kustomization
Added PVC to resource list in `infrastructure/kubernetes/base/kustomization.yaml`
## Verification
### PVC Status
```bash
kubectl get pvc -n bakery-ia model-storage
# STATUS: Bound (10Gi, RWO)
```
### Volume Mounts Verified
```bash
# Training service
kubectl exec -n bakery-ia deployment/training-service -- ls -la /app/models
# ✅ Directory exists and is writable
# Forecasting service
kubectl exec -n bakery-ia deployment/forecasting-service -- ls -la /app/models
# ✅ Directory exists and is readable (same volume)
```
## Deployment Steps
```bash
# 1. Create PVC
kubectl apply -f infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml
# 2. Recreate training service (deployment selector is immutable)
kubectl delete deployment training-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/training/training-service.yaml
# 3. Recreate forecasting service
kubectl delete deployment forecasting-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml
# 4. Verify pods are running
kubectl get pods -n bakery-ia | grep -E "(training|forecasting)"
```
## How It Works Now
1. **Training Flow**:
- Model trained → Saved to `/app/models/{tenant_id}/{model_id}.pkl`
- File persisted to PersistentVolume (survives pod restarts)
- Metadata saved to database with model path
2. **Forecasting Flow**:
- Retrieves model metadata from database
- Loads model from `/app/models/{tenant_id}/{model_id}.pkl`
- File exists in shared PersistentVolume ✅
- Prediction succeeds ✅
## Storage Configuration
- **Type**: PersistentVolumeClaim with local-path provisioner
- **Access Mode**: ReadWriteOnce (single node, multiple pods)
- **Size**: 10Gi (adjustable)
- **Lifecycle**: Independent of pod lifecycle
- **Shared**: Same volume mounted by both services
## Benefits
1. **Data Persistence**: Models survive pod restarts/crashes
2. **Cross-Service Access**: Training writes, Forecasting reads
3. **Scalability**: Can increase storage size as needed
4. **Reliability**: No data loss on container recreation
## Future Improvements
For production environments, consider:
1. **ReadWriteMany volumes**: Use NFS/CephFS for multi-node clusters
2. **Model versioning**: Implement model lifecycle management
3. **Backup strategy**: Regular backups of model storage
4. **Monitoring**: Track storage usage and model count
5. **Cloud storage**: S3/GCS for distributed deployments
## Testing Recommendations
1. Trigger new model training
2. Verify model file exists in PV
3. Test prediction endpoint
4. Restart pods and verify models still accessible
5. Monitor for any storage-related errors