5.2 KiB
5.2 KiB
Model Storage Fix - Root Cause Analysis & Resolution
Problem Summary
Error: Model file not found: /app/models/{tenant_id}/{model_id}.pkl
Impact: Forecasting service unable to generate predictions, causing 500 errors
Root Cause Analysis
The Issue
Both training and forecasting services were configured to save/load ML models at /app/models, but no persistent storage was configured. This caused:
- Training service saves model files to
/app/models/{tenant_id}/{model_id}.pkl(in-container filesystem) - Model metadata successfully saved to database
- Container restarts or different pod instances → filesystem lost
- Forecasting service tries to load model from
/app/models/...→ File not found
Evidence from Logs
[error] Model file not found: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Model file not valid: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Error generating prediction error=Model 2096bc66-aef7-4499-a79c-c4d40d5aa9f1 not found or failed to load
Architecture Flaw
- Training service deployment: Only had
/tmpEmptyDir volume - Forecasting service deployment: Had NO volumes at all
- Model files stored in ephemeral container filesystem
- No shared persistent storage between services
Solution Implemented
1. Created Persistent Volume Claim
File: infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage
namespace: bakery-ia
spec:
accessModes:
- ReadWriteOnce # Single node access
resources:
requests:
storage: 10Gi
storageClassName: standard # Uses local-path provisioner
2. Updated Training Service
File: infrastructure/kubernetes/base/components/training/training-service.yaml
Added volume mount:
volumeMounts:
- name: model-storage
mountPath: /app/models # Training writes models here
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-storage
3. Updated Forecasting Service
File: infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml
Added READ-ONLY volume mount:
volumeMounts:
- name: model-storage
mountPath: /app/models
readOnly: true # Forecasting only reads models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-storage
readOnly: true
4. Updated Kustomization
Added PVC to resource list in infrastructure/kubernetes/base/kustomization.yaml
Verification
PVC Status
kubectl get pvc -n bakery-ia model-storage
# STATUS: Bound (10Gi, RWO)
Volume Mounts Verified
# Training service
kubectl exec -n bakery-ia deployment/training-service -- ls -la /app/models
# ✅ Directory exists and is writable
# Forecasting service
kubectl exec -n bakery-ia deployment/forecasting-service -- ls -la /app/models
# ✅ Directory exists and is readable (same volume)
Deployment Steps
# 1. Create PVC
kubectl apply -f infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml
# 2. Recreate training service (deployment selector is immutable)
kubectl delete deployment training-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/training/training-service.yaml
# 3. Recreate forecasting service
kubectl delete deployment forecasting-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml
# 4. Verify pods are running
kubectl get pods -n bakery-ia | grep -E "(training|forecasting)"
How It Works Now
-
Training Flow:
- Model trained → Saved to
/app/models/{tenant_id}/{model_id}.pkl - File persisted to PersistentVolume (survives pod restarts)
- Metadata saved to database with model path
- Model trained → Saved to
-
Forecasting Flow:
- Retrieves model metadata from database
- Loads model from
/app/models/{tenant_id}/{model_id}.pkl - File exists in shared PersistentVolume ✅
- Prediction succeeds ✅
Storage Configuration
- Type: PersistentVolumeClaim with local-path provisioner
- Access Mode: ReadWriteOnce (single node, multiple pods)
- Size: 10Gi (adjustable)
- Lifecycle: Independent of pod lifecycle
- Shared: Same volume mounted by both services
Benefits
- Data Persistence: Models survive pod restarts/crashes
- Cross-Service Access: Training writes, Forecasting reads
- Scalability: Can increase storage size as needed
- Reliability: No data loss on container recreation
Future Improvements
For production environments, consider:
- ReadWriteMany volumes: Use NFS/CephFS for multi-node clusters
- Model versioning: Implement model lifecycle management
- Backup strategy: Regular backups of model storage
- Monitoring: Track storage usage and model count
- Cloud storage: S3/GCS for distributed deployments
Testing Recommendations
- Trigger new model training
- Verify model file exists in PV
- Test prediction endpoint
- Restart pods and verify models still accessible
- Monitor for any storage-related errors