Improve the traffic fetching system
This commit is contained in:
220
docs/TRAFFIC_DATA_STORAGE.md
Normal file
220
docs/TRAFFIC_DATA_STORAGE.md
Normal file
@@ -0,0 +1,220 @@
|
||||
# Traffic Data Storage for Re-Training
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes the enhanced traffic data storage system implemented to ensure that fetched traffic data is stored in the database for future use in model re-training.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Database Schema
|
||||
|
||||
The `traffic_data` table stores all traffic data with the following schema:
|
||||
|
||||
```sql
|
||||
CREATE TABLE traffic_data (
|
||||
id UUID PRIMARY KEY,
|
||||
location_id VARCHAR(100) NOT NULL, -- Format: "lat,lon" (e.g., "40.4168,-3.7038")
|
||||
date TIMESTAMP WITH TIME ZONE NOT NULL,
|
||||
traffic_volume INTEGER,
|
||||
pedestrian_count INTEGER,
|
||||
congestion_level VARCHAR(20), -- "low", "medium", "high", "blocked"
|
||||
average_speed FLOAT,
|
||||
source VARCHAR(50) NOT NULL DEFAULT 'madrid_opendata',
|
||||
raw_data TEXT, -- JSON string of original data
|
||||
created_at TIMESTAMP WITH TIME ZONE NOT NULL,
|
||||
updated_at TIMESTAMP WITH TIME ZONE NOT NULL
|
||||
);
|
||||
|
||||
-- Indexes for efficient querying
|
||||
CREATE INDEX idx_traffic_location_date ON traffic_data(location_id, date);
|
||||
CREATE INDEX idx_traffic_date_range ON traffic_data(date);
|
||||
```
|
||||
|
||||
### Key Components
|
||||
|
||||
#### 1. Enhanced TrafficService (`services/data/app/services/traffic_service.py`)
|
||||
|
||||
**New Methods:**
|
||||
- `_store_traffic_data_batch()`: Efficiently stores multiple traffic records with duplicate detection
|
||||
- `_validate_traffic_data()`: Validates traffic data before storage
|
||||
- `get_stored_traffic_for_training()`: Retrieves stored traffic data specifically for training
|
||||
|
||||
**Enhanced Methods:**
|
||||
- `get_historical_traffic()`: Now automatically stores fetched data for future re-training
|
||||
|
||||
#### 2. Training Data Orchestrator (`services/training/app/services/training_orchestrator.py`)
|
||||
|
||||
**New Methods:**
|
||||
- `retrieve_stored_traffic_for_retraining()`: Retrieves previously stored traffic data for re-training
|
||||
- `_log_traffic_data_storage()`: Logs traffic data storage for audit purposes
|
||||
|
||||
**Enhanced Methods:**
|
||||
- `_collect_traffic_data_with_timeout()`: Now includes storage logging and validation
|
||||
|
||||
#### 3. Data Service Client (`shared/clients/data_client.py`)
|
||||
|
||||
**New Methods:**
|
||||
- `get_stored_traffic_data_for_training()`: Dedicated method for retrieving stored training data
|
||||
|
||||
#### 4. API Endpoints (`services/data/app/api/traffic.py`)
|
||||
|
||||
**New Endpoint:**
|
||||
- `POST /tenants/{tenant_id}/traffic/stored`: Retrieves stored traffic data for training purposes
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Initial Training
|
||||
1. Training orchestrator requests traffic data
|
||||
2. Data service checks database first
|
||||
3. If not found, fetches from Madrid Open Data API
|
||||
4. **Data is automatically stored in database**
|
||||
5. Returns data to training orchestrator
|
||||
6. Training completes using fetched data
|
||||
|
||||
### Re-Training
|
||||
1. Training orchestrator requests stored traffic data
|
||||
2. Data service queries database using location and date range
|
||||
3. Returns stored data without making API calls
|
||||
4. Training completes using stored data
|
||||
|
||||
## Storage Logic
|
||||
|
||||
### Duplicate Prevention
|
||||
- Before storing, the system checks for existing records with the same location and date
|
||||
- Only new records are stored to avoid database bloat
|
||||
|
||||
### Batch Processing
|
||||
- Traffic data is stored in batches of 100 records for efficiency
|
||||
- Each batch is committed separately to handle large datasets
|
||||
|
||||
### Data Validation
|
||||
- Traffic volume: 0-10,000 vehicles per hour
|
||||
- Pedestrian count: 0-10,000 people per hour
|
||||
- Average speed: 0-200 km/h
|
||||
- Congestion level: "low", "medium", "high", "blocked"
|
||||
|
||||
## Benefits
|
||||
|
||||
### 1. Improved Re-Training Performance
|
||||
- No need to re-fetch external API data
|
||||
- Faster training iterations
|
||||
- Reduced API rate limiting issues
|
||||
|
||||
### 2. Data Consistency
|
||||
- Same traffic data used across multiple training runs
|
||||
- Reproducible training results
|
||||
- Historical data preservation
|
||||
|
||||
### 3. Cost Efficiency
|
||||
- Reduced API calls to external services
|
||||
- Lower bandwidth usage
|
||||
- Better resource utilization
|
||||
|
||||
### 4. Offline Training
|
||||
- Training can proceed even if external APIs are unavailable
|
||||
- Increased system resilience
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Retrieving Stored Traffic Data
|
||||
```python
|
||||
from services.training.app.services.training_orchestrator import TrainingDataOrchestrator
|
||||
|
||||
orchestrator = TrainingDataOrchestrator()
|
||||
|
||||
# Get stored traffic data for re-training
|
||||
traffic_data = await orchestrator.retrieve_stored_traffic_for_retraining(
|
||||
bakery_location=(40.4168, -3.7038), # Madrid coordinates
|
||||
start_date=datetime(2024, 1, 1),
|
||||
end_date=datetime(2024, 12, 31),
|
||||
tenant_id="tenant-123"
|
||||
)
|
||||
```
|
||||
|
||||
### Checking Storage Status
|
||||
```python
|
||||
# The system automatically logs storage operations
|
||||
# Check logs for entries like:
|
||||
# "Traffic data stored for re-training" - indicates successful storage
|
||||
# "Retrieved X stored traffic records for training" - indicates successful retrieval
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Storage Metrics
|
||||
- Number of records stored per location
|
||||
- Storage success rate
|
||||
- Duplicate detection rate
|
||||
|
||||
### Retrieval Metrics
|
||||
- Query response time
|
||||
- Records retrieved per request
|
||||
- Re-training data availability
|
||||
|
||||
### Audit Trail
|
||||
All traffic data operations are logged with:
|
||||
- Location coordinates
|
||||
- Date ranges
|
||||
- Record counts
|
||||
- Storage/retrieval timestamps
|
||||
- Purpose (training/re-training)
|
||||
|
||||
## Migration
|
||||
|
||||
To enable traffic data storage on existing deployments:
|
||||
|
||||
1. **Run Database Migration:**
|
||||
```bash
|
||||
cd services/data
|
||||
alembic upgrade head
|
||||
```
|
||||
|
||||
2. **Restart Data Service:**
|
||||
```bash
|
||||
docker-compose restart data-service
|
||||
```
|
||||
|
||||
3. **Verify Storage:**
|
||||
- Check logs for "Traffic data stored for re-training" messages
|
||||
- Query database: `SELECT COUNT(*) FROM traffic_data;`
|
||||
|
||||
## Configuration
|
||||
|
||||
No additional configuration is required. The system automatically:
|
||||
- Detects when traffic data should be stored
|
||||
- Handles duplicate prevention
|
||||
- Manages database transactions
|
||||
- Provides fallback mechanisms
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**1. Storage Failures**
|
||||
- Check database connectivity
|
||||
- Verify table schema exists
|
||||
- Review validation errors in logs
|
||||
|
||||
**2. No Stored Data Available**
|
||||
- Ensure initial training has been completed
|
||||
- Check date ranges are within stored data period
|
||||
- Verify location coordinates match stored data
|
||||
|
||||
**3. Performance Issues**
|
||||
- Monitor database query performance
|
||||
- Check index usage
|
||||
- Consider data archival for old records
|
||||
|
||||
### Error Messages
|
||||
|
||||
- `"No stored traffic data found for re-training"`: Normal when no previous training has occurred
|
||||
- `"Failed to store traffic data batch"`: Database connectivity or validation issue
|
||||
- `"Invalid traffic data, skipping"`: Data validation failure - check raw API response
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Data Archival**: Automatic archival of old traffic data
|
||||
2. **Data Compression**: Compress raw_data field for storage efficiency
|
||||
3. **Regional Expansion**: Support for traffic data from other cities
|
||||
4. **Real-time Updates**: Continuous traffic data collection and storage
|
||||
5. **Analytics**: Traffic pattern analysis and reporting
|
||||
Reference in New Issue
Block a user