bakery-admin/bakery-ia

Fork 0

Files

Urtzi Alfaro 3c689b4f98 REFACTOR external service and improve websocket training

2025-10-09 14:11:02 +02:00

5.2 KiB

Raw Blame History

Model Storage Fix - Root Cause Analysis & Resolution

Problem Summary

Error: Model file not found: /app/models/{tenant_id}/{model_id}.pkl

Impact: Forecasting service unable to generate predictions, causing 500 errors

Root Cause Analysis

The Issue

Both training and forecasting services were configured to save/load ML models at /app/models, but no persistent storage was configured. This caused:

Training service saves model files to /app/models/{tenant_id}/{model_id}.pkl (in-container filesystem)
Model metadata successfully saved to database
Container restarts or different pod instances → filesystem lost
Forecasting service tries to load model from /app/models/... → File not found

Evidence from Logs

[error] Model file not found: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Model file not valid: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Error generating prediction error=Model 2096bc66-aef7-4499-a79c-c4d40d5aa9f1 not found or failed to load

Architecture Flaw

Training service deployment: Only had /tmp EmptyDir volume
Forecasting service deployment: Had NO volumes at all
Model files stored in ephemeral container filesystem
No shared persistent storage between services

Solution Implemented

1. Created Persistent Volume Claim

File: infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: bakery-ia
spec:
  accessModes:
    - ReadWriteOnce  # Single node access
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard  # Uses local-path provisioner

2. Updated Training Service

File: infrastructure/kubernetes/base/components/training/training-service.yaml

Added volume mount:

volumeMounts:
  - name: model-storage
    mountPath: /app/models  # Training writes models here

volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: model-storage

3. Updated Forecasting Service

File: infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml

Added READ-ONLY volume mount:

volumeMounts:
  - name: model-storage
    mountPath: /app/models
    readOnly: true  # Forecasting only reads models

volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: model-storage
      readOnly: true

4. Updated Kustomization

Added PVC to resource list in infrastructure/kubernetes/base/kustomization.yaml

Verification

PVC Status

kubectl get pvc -n bakery-ia model-storage
# STATUS: Bound (10Gi, RWO)

Volume Mounts Verified

# Training service
kubectl exec -n bakery-ia deployment/training-service -- ls -la /app/models
# ✅ Directory exists and is writable

# Forecasting service
kubectl exec -n bakery-ia deployment/forecasting-service -- ls -la /app/models
# ✅ Directory exists and is readable (same volume)

Deployment Steps

# 1. Create PVC
kubectl apply -f infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml

# 2. Recreate training service (deployment selector is immutable)
kubectl delete deployment training-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/training/training-service.yaml

# 3. Recreate forecasting service
kubectl delete deployment forecasting-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml

# 4. Verify pods are running
kubectl get pods -n bakery-ia | grep -E "(training|forecasting)"

How It Works Now

Training Flow:
- Model trained → Saved to /app/models/{tenant_id}/{model_id}.pkl
- File persisted to PersistentVolume (survives pod restarts)
- Metadata saved to database with model path
Forecasting Flow:
- Retrieves model metadata from database
- Loads model from /app/models/{tenant_id}/{model_id}.pkl
- File exists in shared PersistentVolume ✅
- Prediction succeeds ✅

Storage Configuration

Type: PersistentVolumeClaim with local-path provisioner
Access Mode: ReadWriteOnce (single node, multiple pods)
Size: 10Gi (adjustable)
Lifecycle: Independent of pod lifecycle
Shared: Same volume mounted by both services

Benefits

Data Persistence: Models survive pod restarts/crashes
Cross-Service Access: Training writes, Forecasting reads
Scalability: Can increase storage size as needed
Reliability: No data loss on container recreation

Future Improvements

For production environments, consider:

ReadWriteMany volumes: Use NFS/CephFS for multi-node clusters
Model versioning: Implement model lifecycle management
Backup strategy: Regular backups of model storage
Monitoring: Track storage usage and model count
Cloud storage: S3/GCS for distributed deployments

Testing Recommendations

Trigger new model training
Verify model file exists in PV
Test prediction endpoint
Restart pods and verify models still accessible
Monitor for any storage-related errors

5.2 KiB Raw Blame History