220 lines
6.7 KiB
Markdown
220 lines
6.7 KiB
Markdown
# Traffic Data Storage for Re-Training
|
|
|
|
## Overview
|
|
|
|
This document describes the enhanced traffic data storage system implemented to ensure that fetched traffic data is stored in the database for future use in model re-training.
|
|
|
|
## Architecture
|
|
|
|
### Database Schema
|
|
|
|
The `traffic_data` table stores all traffic data with the following schema:
|
|
|
|
```sql
|
|
CREATE TABLE traffic_data (
|
|
id UUID PRIMARY KEY,
|
|
location_id VARCHAR(100) NOT NULL, -- Format: "lat,lon" (e.g., "40.4168,-3.7038")
|
|
date TIMESTAMP WITH TIME ZONE NOT NULL,
|
|
traffic_volume INTEGER,
|
|
pedestrian_count INTEGER,
|
|
congestion_level VARCHAR(20), -- "low", "medium", "high", "blocked"
|
|
average_speed FLOAT,
|
|
source VARCHAR(50) NOT NULL DEFAULT 'madrid_opendata',
|
|
raw_data TEXT, -- JSON string of original data
|
|
created_at TIMESTAMP WITH TIME ZONE NOT NULL,
|
|
updated_at TIMESTAMP WITH TIME ZONE NOT NULL
|
|
);
|
|
|
|
-- Indexes for efficient querying
|
|
CREATE INDEX idx_traffic_location_date ON traffic_data(location_id, date);
|
|
CREATE INDEX idx_traffic_date_range ON traffic_data(date);
|
|
```
|
|
|
|
### Key Components
|
|
|
|
#### 1. Enhanced TrafficService (`services/data/app/services/traffic_service.py`)
|
|
|
|
**New Methods:**
|
|
- `_store_traffic_data_batch()`: Efficiently stores multiple traffic records with duplicate detection
|
|
- `_validate_traffic_data()`: Validates traffic data before storage
|
|
- `get_stored_traffic_for_training()`: Retrieves stored traffic data specifically for training
|
|
|
|
**Enhanced Methods:**
|
|
- `get_historical_traffic()`: Now automatically stores fetched data for future re-training
|
|
|
|
#### 2. Training Data Orchestrator (`services/training/app/services/training_orchestrator.py`)
|
|
|
|
**New Methods:**
|
|
- `retrieve_stored_traffic_for_retraining()`: Retrieves previously stored traffic data for re-training
|
|
- `_log_traffic_data_storage()`: Logs traffic data storage for audit purposes
|
|
|
|
**Enhanced Methods:**
|
|
- `_collect_traffic_data_with_timeout()`: Now includes storage logging and validation
|
|
|
|
#### 3. Data Service Client (`shared/clients/data_client.py`)
|
|
|
|
**New Methods:**
|
|
- `get_stored_traffic_data_for_training()`: Dedicated method for retrieving stored training data
|
|
|
|
#### 4. API Endpoints (`services/data/app/api/traffic.py`)
|
|
|
|
**New Endpoint:**
|
|
- `POST /tenants/{tenant_id}/traffic/stored`: Retrieves stored traffic data for training purposes
|
|
|
|
## Data Flow
|
|
|
|
### Initial Training
|
|
1. Training orchestrator requests traffic data
|
|
2. Data service checks database first
|
|
3. If not found, fetches from Madrid Open Data API
|
|
4. **Data is automatically stored in database**
|
|
5. Returns data to training orchestrator
|
|
6. Training completes using fetched data
|
|
|
|
### Re-Training
|
|
1. Training orchestrator requests stored traffic data
|
|
2. Data service queries database using location and date range
|
|
3. Returns stored data without making API calls
|
|
4. Training completes using stored data
|
|
|
|
## Storage Logic
|
|
|
|
### Duplicate Prevention
|
|
- Before storing, the system checks for existing records with the same location and date
|
|
- Only new records are stored to avoid database bloat
|
|
|
|
### Batch Processing
|
|
- Traffic data is stored in batches of 100 records for efficiency
|
|
- Each batch is committed separately to handle large datasets
|
|
|
|
### Data Validation
|
|
- Traffic volume: 0-10,000 vehicles per hour
|
|
- Pedestrian count: 0-10,000 people per hour
|
|
- Average speed: 0-200 km/h
|
|
- Congestion level: "low", "medium", "high", "blocked"
|
|
|
|
## Benefits
|
|
|
|
### 1. Improved Re-Training Performance
|
|
- No need to re-fetch external API data
|
|
- Faster training iterations
|
|
- Reduced API rate limiting issues
|
|
|
|
### 2. Data Consistency
|
|
- Same traffic data used across multiple training runs
|
|
- Reproducible training results
|
|
- Historical data preservation
|
|
|
|
### 3. Cost Efficiency
|
|
- Reduced API calls to external services
|
|
- Lower bandwidth usage
|
|
- Better resource utilization
|
|
|
|
### 4. Offline Training
|
|
- Training can proceed even if external APIs are unavailable
|
|
- Increased system resilience
|
|
|
|
## Usage Examples
|
|
|
|
### Retrieving Stored Traffic Data
|
|
```python
|
|
from services.training.app.services.training_orchestrator import TrainingDataOrchestrator
|
|
|
|
orchestrator = TrainingDataOrchestrator()
|
|
|
|
# Get stored traffic data for re-training
|
|
traffic_data = await orchestrator.retrieve_stored_traffic_for_retraining(
|
|
bakery_location=(40.4168, -3.7038), # Madrid coordinates
|
|
start_date=datetime(2024, 1, 1),
|
|
end_date=datetime(2024, 12, 31),
|
|
tenant_id="tenant-123"
|
|
)
|
|
```
|
|
|
|
### Checking Storage Status
|
|
```python
|
|
# The system automatically logs storage operations
|
|
# Check logs for entries like:
|
|
# "Traffic data stored for re-training" - indicates successful storage
|
|
# "Retrieved X stored traffic records for training" - indicates successful retrieval
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
### Storage Metrics
|
|
- Number of records stored per location
|
|
- Storage success rate
|
|
- Duplicate detection rate
|
|
|
|
### Retrieval Metrics
|
|
- Query response time
|
|
- Records retrieved per request
|
|
- Re-training data availability
|
|
|
|
### Audit Trail
|
|
All traffic data operations are logged with:
|
|
- Location coordinates
|
|
- Date ranges
|
|
- Record counts
|
|
- Storage/retrieval timestamps
|
|
- Purpose (training/re-training)
|
|
|
|
## Migration
|
|
|
|
To enable traffic data storage on existing deployments:
|
|
|
|
1. **Run Database Migration:**
|
|
```bash
|
|
cd services/data
|
|
alembic upgrade head
|
|
```
|
|
|
|
2. **Restart Data Service:**
|
|
```bash
|
|
docker-compose restart data-service
|
|
```
|
|
|
|
3. **Verify Storage:**
|
|
- Check logs for "Traffic data stored for re-training" messages
|
|
- Query database: `SELECT COUNT(*) FROM traffic_data;`
|
|
|
|
## Configuration
|
|
|
|
No additional configuration is required. The system automatically:
|
|
- Detects when traffic data should be stored
|
|
- Handles duplicate prevention
|
|
- Manages database transactions
|
|
- Provides fallback mechanisms
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**1. Storage Failures**
|
|
- Check database connectivity
|
|
- Verify table schema exists
|
|
- Review validation errors in logs
|
|
|
|
**2. No Stored Data Available**
|
|
- Ensure initial training has been completed
|
|
- Check date ranges are within stored data period
|
|
- Verify location coordinates match stored data
|
|
|
|
**3. Performance Issues**
|
|
- Monitor database query performance
|
|
- Check index usage
|
|
- Consider data archival for old records
|
|
|
|
### Error Messages
|
|
|
|
- `"No stored traffic data found for re-training"`: Normal when no previous training has occurred
|
|
- `"Failed to store traffic data batch"`: Database connectivity or validation issue
|
|
- `"Invalid traffic data, skipping"`: Data validation failure - check raw API response
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Data Archival**: Automatic archival of old traffic data
|
|
2. **Data Compression**: Compress raw_data field for storage efficiency
|
|
3. **Regional Expansion**: Support for traffic data from other cities
|
|
4. **Real-time Updates**: Continuous traffic data collection and storage
|
|
5. **Analytics**: Traffic pattern analysis and reporting |