bakery-ia/MODEL_STORAGE_FIX.md

# Model Storage Fix - Root Cause Analysis & Resolution

## Problem Summary
**Error**: `Model file not found: /app/models/{tenant_id}/{model_id}.pkl`

**Impact**: Forecasting service unable to generate predictions, causing 500 errors

## Root Cause Analysis

### The Issue
Both training and forecasting services were configured to save/load ML models at `/app/models`, but **no persistent storage was configured**. This caused:

1. **Training service** saves model files to `/app/models/{tenant_id}/{model_id}.pkl` (in-container filesystem)
2. **Model metadata** successfully saved to database
3. **Container restarts** or different pod instances → filesystem lost
4. **Forecasting service** tries to load model from `/app/models/...` → **File not found**

### Evidence from Logs
```
[error] Model file not found: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Model file not valid: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl
[error] Error generating prediction error=Model 2096bc66-aef7-4499-a79c-c4d40d5aa9f1 not found or failed to load
```

### Architecture Flaw
- Training service deployment: Only had `/tmp` EmptyDir volume
- Forecasting service deployment: Had NO volumes at all
- Model files stored in ephemeral container filesystem
- No shared persistent storage between services

## Solution Implemented

### 1. Created Persistent Volume Claim
**File**: `infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml`

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: bakery-ia
spec:
  accessModes:
    - ReadWriteOnce  # Single node access
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard  # Uses local-path provisioner
```

### 2. Updated Training Service
**File**: `infrastructure/kubernetes/base/components/training/training-service.yaml`

Added volume mount:
```yaml
volumeMounts:
  - name: model-storage
    mountPath: /app/models  # Training writes models here

volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: model-storage
```

### 3. Updated Forecasting Service
**File**: `infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml`

Added READ-ONLY volume mount:
```yaml
volumeMounts:
  - name: model-storage
    mountPath: /app/models
    readOnly: true  # Forecasting only reads models

volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: model-storage
      readOnly: true
```

### 4. Updated Kustomization
Added PVC to resource list in `infrastructure/kubernetes/base/kustomization.yaml`

## Verification

### PVC Status
```bash
kubectl get pvc -n bakery-ia model-storage
# STATUS: Bound (10Gi, RWO)
```

### Volume Mounts Verified
```bash
# Training service
kubectl exec -n bakery-ia deployment/training-service -- ls -la /app/models
# ✅ Directory exists and is writable

# Forecasting service
kubectl exec -n bakery-ia deployment/forecasting-service -- ls -la /app/models
# ✅ Directory exists and is readable (same volume)
```

## Deployment Steps

```bash
# 1. Create PVC
kubectl apply -f infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml

# 2. Recreate training service (deployment selector is immutable)
kubectl delete deployment training-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/training/training-service.yaml

# 3. Recreate forecasting service
kubectl delete deployment forecasting-service -n bakery-ia
kubectl apply -f infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml

# 4. Verify pods are running
kubectl get pods -n bakery-ia | grep -E "(training|forecasting)"
```

## How It Works Now

1. **Training Flow**:
   - Model trained → Saved to `/app/models/{tenant_id}/{model_id}.pkl`
   - File persisted to PersistentVolume (survives pod restarts)
   - Metadata saved to database with model path

2. **Forecasting Flow**:
   - Retrieves model metadata from database
   - Loads model from `/app/models/{tenant_id}/{model_id}.pkl`
   - File exists in shared PersistentVolume ✅
   - Prediction succeeds ✅

## Storage Configuration

- **Type**: PersistentVolumeClaim with local-path provisioner
- **Access Mode**: ReadWriteOnce (single node, multiple pods)
- **Size**: 10Gi (adjustable)
- **Lifecycle**: Independent of pod lifecycle
- **Shared**: Same volume mounted by both services

## Benefits

1. **Data Persistence**: Models survive pod restarts/crashes
2. **Cross-Service Access**: Training writes, Forecasting reads
3. **Scalability**: Can increase storage size as needed
4. **Reliability**: No data loss on container recreation

## Future Improvements

For production environments, consider:

1. **ReadWriteMany volumes**: Use NFS/CephFS for multi-node clusters
2. **Model versioning**: Implement model lifecycle management
3. **Backup strategy**: Regular backups of model storage
4. **Monitoring**: Track storage usage and model count
5. **Cloud storage**: S3/GCS for distributed deployments

## Testing Recommendations

1. Trigger new model training
2. Verify model file exists in PV
3. Test prediction endpoint
4. Restart pods and verify models still accessible
5. Monitor for any storage-related errors
REFACTOR external service and improve websocket training 2025-10-09 14:11:02 +02:00			`# Model Storage Fix - Root Cause Analysis & Resolution`

			`## Problem Summary`
			Error: `Model file not found: /app/models/{tenant_id}/{model_id}.pkl`

			`Impact: Forecasting service unable to generate predictions, causing 500 errors`

			`## Root Cause Analysis`

			`### The Issue`
			Both training and forecasting services were configured to save/load ML models at `/app/models`, but no persistent storage was configured. This caused:

			1. Training service saves model files to `/app/models/{tenant_id}/{model_id}.pkl` (in-container filesystem)
			`2. Model metadata successfully saved to database`
			`3. Container restarts or different pod instances → filesystem lost`
			4. Forecasting service tries to load model from `/app/models/...` → File not found

			`### Evidence from Logs`
			```
			`[error] Model file not found: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl`
			`[error] Model file not valid: /app/models/d3fe350f-ffcb-439c-9d66-65851b0cf0c7/2096bc66-aef7-4499-a79c-c4d40d5aa9f1.pkl`
			`[error] Error generating prediction error=Model 2096bc66-aef7-4499-a79c-c4d40d5aa9f1 not found or failed to load`
			```

			`### Architecture Flaw`
			- Training service deployment: Only had `/tmp` EmptyDir volume
			`- Forecasting service deployment: Had NO volumes at all`
			`- Model files stored in ephemeral container filesystem`
			`- No shared persistent storage between services`

			`## Solution Implemented`

			`### 1. Created Persistent Volume Claim`
			File: `infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml`

			```yaml
			`apiVersion: v1`
			`kind: PersistentVolumeClaim`
			`metadata:`
			`name: model-storage`
			`namespace: bakery-ia`
			`spec:`
			`accessModes:`
			`- ReadWriteOnce # Single node access`
			`resources:`
			`requests:`
			`storage: 10Gi`
			`storageClassName: standard # Uses local-path provisioner`
			```

			`### 2. Updated Training Service`
			File: `infrastructure/kubernetes/base/components/training/training-service.yaml`

			`Added volume mount:`
			```yaml
			`volumeMounts:`
			`- name: model-storage`
			`mountPath: /app/models # Training writes models here`

			`volumes:`
			`- name: model-storage`
			`persistentVolumeClaim:`
			`claimName: model-storage`
			```

			`### 3. Updated Forecasting Service`
			File: `infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml`

			`Added READ-ONLY volume mount:`
			```yaml
			`volumeMounts:`
			`- name: model-storage`
			`mountPath: /app/models`
			`readOnly: true # Forecasting only reads models`

			`volumes:`
			`- name: model-storage`
			`persistentVolumeClaim:`
			`claimName: model-storage`
			`readOnly: true`
			```

			`### 4. Updated Kustomization`
			Added PVC to resource list in `infrastructure/kubernetes/base/kustomization.yaml`

			`## Verification`

			`### PVC Status`
			```bash
			`kubectl get pvc -n bakery-ia model-storage`
			`# STATUS: Bound (10Gi, RWO)`
			```

			`### Volume Mounts Verified`
			```bash
			`# Training service`
			`kubectl exec -n bakery-ia deployment/training-service -- ls -la /app/models`
			`# ✅ Directory exists and is writable`

			`# Forecasting service`
			`kubectl exec -n bakery-ia deployment/forecasting-service -- ls -la /app/models`
			`# ✅ Directory exists and is readable (same volume)`
			```

			`## Deployment Steps`

			```bash
			`# 1. Create PVC`
			`kubectl apply -f infrastructure/kubernetes/base/components/volumes/model-storage-pvc.yaml`

			`# 2. Recreate training service (deployment selector is immutable)`
			`kubectl delete deployment training-service -n bakery-ia`
			`kubectl apply -f infrastructure/kubernetes/base/components/training/training-service.yaml`

			`# 3. Recreate forecasting service`
			`kubectl delete deployment forecasting-service -n bakery-ia`
			`kubectl apply -f infrastructure/kubernetes/base/components/forecasting/forecasting-service.yaml`

			`# 4. Verify pods are running`
			`kubectl get pods -n bakery-ia \| grep -E "(training\|forecasting)"`
			```

			`## How It Works Now`

			`1. Training Flow:`
			- Model trained → Saved to `/app/models/{tenant_id}/{model_id}.pkl`
			`- File persisted to PersistentVolume (survives pod restarts)`
			`- Metadata saved to database with model path`

			`2. Forecasting Flow:`
			`- Retrieves model metadata from database`
			- Loads model from `/app/models/{tenant_id}/{model_id}.pkl`
			`- File exists in shared PersistentVolume ✅`
			`- Prediction succeeds ✅`

			`## Storage Configuration`

			`- Type: PersistentVolumeClaim with local-path provisioner`
			`- Access Mode: ReadWriteOnce (single node, multiple pods)`
			`- Size: 10Gi (adjustable)`
			`- Lifecycle: Independent of pod lifecycle`
			`- Shared: Same volume mounted by both services`

			`## Benefits`

			`1. Data Persistence: Models survive pod restarts/crashes`
			`2. Cross-Service Access: Training writes, Forecasting reads`
			`3. Scalability: Can increase storage size as needed`
			`4. Reliability: No data loss on container recreation`

			`## Future Improvements`

			`For production environments, consider:`

			`1. ReadWriteMany volumes: Use NFS/CephFS for multi-node clusters`
			`2. Model versioning: Implement model lifecycle management`
			`3. Backup strategy: Regular backups of model storage`
			`4. Monitoring: Track storage usage and model count`
			`5. Cloud storage: S3/GCS for distributed deployments`

			`## Testing Recommendations`

			`1. Trigger new model training`
			`2. Verify model file exists in PV`
			`3. Test prediction endpoint`
			`4. Restart pods and verify models still accessible`
			`5. Monitor for any storage-related errors`