bakery-ia/docs/TRAFFIC_DATA_STORAGE.md

# Traffic Data Storage for Re-Training

## Overview

This document describes the enhanced traffic data storage system implemented to ensure that fetched traffic data is stored in the database for future use in model re-training.

## Architecture

### Database Schema

The `traffic_data` table stores all traffic data with the following schema:

```sql
CREATE TABLE traffic_data (
    id UUID PRIMARY KEY,
    location_id VARCHAR(100) NOT NULL,  -- Format: "lat,lon" (e.g., "40.4168,-3.7038")
    date TIMESTAMP WITH TIME ZONE NOT NULL,
    traffic_volume INTEGER,
    pedestrian_count INTEGER,
    congestion_level VARCHAR(20),  -- "low", "medium", "high", "blocked"
    average_speed FLOAT,
    source VARCHAR(50) NOT NULL DEFAULT 'madrid_opendata',
    raw_data TEXT,  -- JSON string of original data
    created_at TIMESTAMP WITH TIME ZONE NOT NULL,
    updated_at TIMESTAMP WITH TIME ZONE NOT NULL
);

-- Indexes for efficient querying
CREATE INDEX idx_traffic_location_date ON traffic_data(location_id, date);
CREATE INDEX idx_traffic_date_range ON traffic_data(date);
```

### Key Components

#### 1. Enhanced TrafficService (`services/data/app/services/traffic_service.py`)

**New Methods:**
- `_store_traffic_data_batch()`: Efficiently stores multiple traffic records with duplicate detection
- `_validate_traffic_data()`: Validates traffic data before storage
- `get_stored_traffic_for_training()`: Retrieves stored traffic data specifically for training

**Enhanced Methods:**
- `get_historical_traffic()`: Now automatically stores fetched data for future re-training

#### 2. Training Data Orchestrator (`services/training/app/services/training_orchestrator.py`)

**New Methods:**
- `retrieve_stored_traffic_for_retraining()`: Retrieves previously stored traffic data for re-training
- `_log_traffic_data_storage()`: Logs traffic data storage for audit purposes

**Enhanced Methods:**
- `_collect_traffic_data_with_timeout()`: Now includes storage logging and validation

#### 3. Data Service Client (`shared/clients/data_client.py`)

**New Methods:**
- `get_stored_traffic_data_for_training()`: Dedicated method for retrieving stored training data

#### 4. API Endpoints (`services/data/app/api/traffic.py`)

**New Endpoint:**
- `POST /tenants/{tenant_id}/traffic/stored`: Retrieves stored traffic data for training purposes

## Data Flow

### Initial Training
1. Training orchestrator requests traffic data
2. Data service checks database first
3. If not found, fetches from Madrid Open Data API
4. **Data is automatically stored in database**
5. Returns data to training orchestrator
6. Training completes using fetched data

### Re-Training
1. Training orchestrator requests stored traffic data
2. Data service queries database using location and date range
3. Returns stored data without making API calls
4. Training completes using stored data

## Storage Logic

### Duplicate Prevention
- Before storing, the system checks for existing records with the same location and date
- Only new records are stored to avoid database bloat

### Batch Processing
- Traffic data is stored in batches of 100 records for efficiency
- Each batch is committed separately to handle large datasets

### Data Validation
- Traffic volume: 0-10,000 vehicles per hour
- Pedestrian count: 0-10,000 people per hour
- Average speed: 0-200 km/h
- Congestion level: "low", "medium", "high", "blocked"

## Benefits

### 1. Improved Re-Training Performance
- No need to re-fetch external API data
- Faster training iterations
- Reduced API rate limiting issues

### 2. Data Consistency
- Same traffic data used across multiple training runs
- Reproducible training results
- Historical data preservation

### 3. Cost Efficiency
- Reduced API calls to external services
- Lower bandwidth usage
- Better resource utilization

### 4. Offline Training
- Training can proceed even if external APIs are unavailable
- Increased system resilience

## Usage Examples

### Retrieving Stored Traffic Data
```python
from services.training.app.services.training_orchestrator import TrainingDataOrchestrator

orchestrator = TrainingDataOrchestrator()

# Get stored traffic data for re-training
traffic_data = await orchestrator.retrieve_stored_traffic_for_retraining(
    bakery_location=(40.4168, -3.7038),  # Madrid coordinates
    start_date=datetime(2024, 1, 1),
    end_date=datetime(2024, 12, 31),
    tenant_id="tenant-123"
)
```

### Checking Storage Status
```python
# The system automatically logs storage operations
# Check logs for entries like:
# "Traffic data stored for re-training" - indicates successful storage
# "Retrieved X stored traffic records for training" - indicates successful retrieval
```

## Monitoring

### Storage Metrics
- Number of records stored per location
- Storage success rate
- Duplicate detection rate

### Retrieval Metrics
- Query response time
- Records retrieved per request
- Re-training data availability

### Audit Trail
All traffic data operations are logged with:
- Location coordinates
- Date ranges
- Record counts
- Storage/retrieval timestamps
- Purpose (training/re-training)

## Migration

To enable traffic data storage on existing deployments:

1. **Run Database Migration:**
   ```bash
   cd services/data
   alembic upgrade head
   ```

2. **Restart Data Service:**
   ```bash
   docker-compose restart data-service
   ```

3. **Verify Storage:**
   - Check logs for "Traffic data stored for re-training" messages
   - Query database: `SELECT COUNT(*) FROM traffic_data;`

## Configuration

No additional configuration is required. The system automatically:
- Detects when traffic data should be stored
- Handles duplicate prevention
- Manages database transactions
- Provides fallback mechanisms

## Troubleshooting

### Common Issues

**1. Storage Failures**
- Check database connectivity
- Verify table schema exists
- Review validation errors in logs

**2. No Stored Data Available**
- Ensure initial training has been completed
- Check date ranges are within stored data period
- Verify location coordinates match stored data

**3. Performance Issues**
- Monitor database query performance
- Check index usage
- Consider data archival for old records

### Error Messages

- `"No stored traffic data found for re-training"`: Normal when no previous training has occurred
- `"Failed to store traffic data batch"`: Database connectivity or validation issue
- `"Invalid traffic data, skipping"`: Data validation failure - check raw API response

## Future Enhancements

1. **Data Archival**: Automatic archival of old traffic data
2. **Data Compression**: Compress raw_data field for storage efficiency
3. **Regional Expansion**: Support for traffic data from other cities
4. **Real-time Updates**: Continuous traffic data collection and storage
5. **Analytics**: Traffic pattern analysis and reporting