# Model Storage Fix - Root Cause Analysis & Resolution ## Problem Summary **Error**: `Model file not found: /app/models/{tenant_id}/{model_id}.pkl` **Impact**: Forecasting service unable to generate predictions, causing 500 errors ## Root Cause Analysis ### The Issue Both training and forecasting services were configured to save/load ML models at `/app/models`, but **no persistent storage was configured**. This caused: 1. **Training service** saves model files to `/app/models/{tenant_id}/{model_id}.pkl` (in-container filesystem) 2. **Model metadata** successfully saved to database 3. **Container restarts** or different pod instances → filesystem lost 4. **Forecasting service** tries to load model from `/app/models/...` → **File not found** ### Evidence from Logs ``` [error] Model file not found: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl [error] Model file not valid: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl [error] Error generating prediction error=Model 2096bc66-aef7-4499-a79c-c4d40d5aa9f1 not found or failed to load ``` ### Architecture Flaw - Training service deployment: Only had `/tmp` EmptyDir volume - Forecasting service deployment: Had NO volumes at all - Model files stored in ephemeral container filesystem - No shared persistent storage between services ## Solution Implemented ### 1. Created Persistent Volume Claim **File**: `infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml` ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-storage namespace: bakery-ia spec: accessModes: - ReadWriteOnce # Single node access resources: requests: storage: 10Gi storageClassName: standard # Uses local-path provisioner ``` ### 2. Updated Training Service **File**: `infrastructure/kubernetes/base/components/training/training-service.yaml` Added volume mount: ```yaml volumeMounts: - name: model-storage mountPath: /app/models # Training writes models here volumes: - name: model-storage persistentVolumeClaim: claimName: model-storage ``` ### 3. Updated Forecasting Service **File**: `infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml` Added READ-ONLY volume mount: ```yaml volumeMounts: - name: model-storage mountPath: /app/models readOnly: true # Forecasting only reads models volumes: - name: model-storage persistentVolumeClaim: claimName: model-storage readOnly: true ``` ### 4. Updated Kustomization Added PVC to resource list in `infrastructure/kubernetes/base/kustomization.yaml` ## Verification ### PVC Status ```bash kubectl get pvc -n bakery-ia model-storage # STATUS: Bound (10Gi, RWO) ``` ### Volume Mounts Verified ```bash # Training service kubectl exec -n bakery-ia deployment/training-service -- ls -la /app/models # ✅ Directory exists and is writable # Forecasting service kubectl exec -n bakery-ia deployment/forecasting-service -- ls -la /app/models # ✅ Directory exists and is readable (same volume) ``` ## Deployment Steps ```bash # 1. Create PVC kubectl apply -f infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml # 2. Recreate training service (deployment selector is immutable) kubectl delete deployment training-service -n bakery-ia kubectl apply -f infrastructure/kubernetes/base/components/training/training-service.yaml # 3. Recreate forecasting service kubectl delete deployment forecasting-service -n bakery-ia kubectl apply -f infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml # 4. Verify pods are running kubectl get pods -n bakery-ia | grep -E "(training|forecasting)" ``` ## How It Works Now 1. **Training Flow**: - Model trained → Saved to `/app/models/{tenant_id}/{model_id}.pkl` - File persisted to PersistentVolume (survives pod restarts) - Metadata saved to database with model path 2. **Forecasting Flow**: - Retrieves model metadata from database - Loads model from `/app/models/{tenant_id}/{model_id}.pkl` - File exists in shared PersistentVolume ✅ - Prediction succeeds ✅ ## Storage Configuration - **Type**: PersistentVolumeClaim with local-path provisioner - **Access Mode**: ReadWriteOnce (single node, multiple pods) - **Size**: 10Gi (adjustable) - **Lifecycle**: Independent of pod lifecycle - **Shared**: Same volume mounted by both services ## Benefits 1. **Data Persistence**: Models survive pod restarts/crashes 2. **Cross-Service Access**: Training writes, Forecasting reads 3. **Scalability**: Can increase storage size as needed 4. **Reliability**: No data loss on container recreation ## Future Improvements For production environments, consider: 1. **ReadWriteMany volumes**: Use NFS/CephFS for multi-node clusters 2. **Model versioning**: Implement model lifecycle management 3. **Backup strategy**: Regular backups of model storage 4. **Monitoring**: Track storage usage and model count 5. **Cloud storage**: S3/GCS for distributed deployments ## Testing Recommendations 1. Trigger new model training 2. Verify model file exists in PV 3. Test prediction endpoint 4. Restart pods and verify models still accessible 5. Monitor for any storage-related errors