Files
bakery-ia/MODEL_STORAGE_FIX.md

5.2 KiB

Model Storage Fix - Root Cause Analysis & Resolution

Problem Summary

Error: Model file not found: /app/models/{tenant_id}/{model_id}.pkl

Impact: Forecasting service unable to generate predictions, causing 500 errors

Root Cause Analysis

The Issue

Both training and forecasting services were configured to save/load ML models at /app/models, but no persistent storage was configured. This caused:

  1. Training service saves model files to /app/models/{tenant_id}/{model_id}.pkl (in-container filesystem)
  2. Model metadata successfully saved to database
  3. Container restarts or different pod instances → filesystem lost
  4. Forecasting service tries to load model from /app/models/...File not found

Evidence from Logs

[error] Model file not found: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Model file not valid: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Error generating prediction error=Model 2096bc66-aef7-4499-a79c-c4d40d5aa9f1 not found or failed to load

Architecture Flaw

  • Training service deployment: Only had /tmp EmptyDir volume
  • Forecasting service deployment: Had NO volumes at all
  • Model files stored in ephemeral container filesystem
  • No shared persistent storage between services

Solution Implemented

1. Created Persistent Volume Claim

File: infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: bakery-ia
spec:
  accessModes:
    - ReadWriteOnce  # Single node access
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard  # Uses local-path provisioner

2. Updated Training Service

File: infrastructure/kubernetes/base/components/training/training-service.yaml

Added volume mount:

volumeMounts:
  - name: model-storage
    mountPath: /app/models  # Training writes models here

volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: model-storage

3. Updated Forecasting Service

File: infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml

Added READ-ONLY volume mount:

volumeMounts:
  - name: model-storage
    mountPath: /app/models
    readOnly: true  # Forecasting only reads models

volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: model-storage
      readOnly: true

4. Updated Kustomization

Added PVC to resource list in infrastructure/kubernetes/base/kustomization.yaml

Verification

PVC Status

kubectl get pvc -n bakery-ia model-storage
# STATUS: Bound (10Gi, RWO)

Volume Mounts Verified

# Training service
kubectl exec -n bakery-ia deployment/training-service -- ls -la /app/models
# ✅ Directory exists and is writable

# Forecasting service
kubectl exec -n bakery-ia deployment/forecasting-service -- ls -la /app/models
# ✅ Directory exists and is readable (same volume)

Deployment Steps

# 1. Create PVC
kubectl apply -f infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml

# 2. Recreate training service (deployment selector is immutable)
kubectl delete deployment training-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/training/training-service.yaml

# 3. Recreate forecasting service
kubectl delete deployment forecasting-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml

# 4. Verify pods are running
kubectl get pods -n bakery-ia | grep -E "(training|forecasting)"

How It Works Now

  1. Training Flow:

    • Model trained → Saved to /app/models/{tenant_id}/{model_id}.pkl
    • File persisted to PersistentVolume (survives pod restarts)
    • Metadata saved to database with model path
  2. Forecasting Flow:

    • Retrieves model metadata from database
    • Loads model from /app/models/{tenant_id}/{model_id}.pkl
    • File exists in shared PersistentVolume
    • Prediction succeeds

Storage Configuration

  • Type: PersistentVolumeClaim with local-path provisioner
  • Access Mode: ReadWriteOnce (single node, multiple pods)
  • Size: 10Gi (adjustable)
  • Lifecycle: Independent of pod lifecycle
  • Shared: Same volume mounted by both services

Benefits

  1. Data Persistence: Models survive pod restarts/crashes
  2. Cross-Service Access: Training writes, Forecasting reads
  3. Scalability: Can increase storage size as needed
  4. Reliability: No data loss on container recreation

Future Improvements

For production environments, consider:

  1. ReadWriteMany volumes: Use NFS/CephFS for multi-node clusters
  2. Model versioning: Implement model lifecycle management
  3. Backup strategy: Regular backups of model storage
  4. Monitoring: Track storage usage and model count
  5. Cloud storage: S3/GCS for distributed deployments

Testing Recommendations

  1. Trigger new model training
  2. Verify model file exists in PV
  3. Test prediction endpoint
  4. Restart pods and verify models still accessible
  5. Monitor for any storage-related errors