6.7 KiB
6.7 KiB
Traffic Data Storage for Re-Training
Overview
This document describes the enhanced traffic data storage system implemented to ensure that fetched traffic data is stored in the database for future use in model re-training.
Architecture
Database Schema
The traffic_data table stores all traffic data with the following schema:
CREATE TABLE traffic_data (
id UUID PRIMARY KEY,
location_id VARCHAR(100) NOT NULL, -- Format: "lat,lon" (e.g., "40.4168,-3.7038")
date TIMESTAMP WITH TIME ZONE NOT NULL,
traffic_volume INTEGER,
pedestrian_count INTEGER,
congestion_level VARCHAR(20), -- "low", "medium", "high", "blocked"
average_speed FLOAT,
source VARCHAR(50) NOT NULL DEFAULT 'madrid_opendata',
raw_data TEXT, -- JSON string of original data
created_at TIMESTAMP WITH TIME ZONE NOT NULL,
updated_at TIMESTAMP WITH TIME ZONE NOT NULL
);
-- Indexes for efficient querying
CREATE INDEX idx_traffic_location_date ON traffic_data(location_id, date);
CREATE INDEX idx_traffic_date_range ON traffic_data(date);
Key Components
1. Enhanced TrafficService (services/data/app/services/traffic_service.py)
New Methods:
_store_traffic_data_batch(): Efficiently stores multiple traffic records with duplicate detection_validate_traffic_data(): Validates traffic data before storageget_stored_traffic_for_training(): Retrieves stored traffic data specifically for training
Enhanced Methods:
get_historical_traffic(): Now automatically stores fetched data for future re-training
2. Training Data Orchestrator (services/training/app/services/training_orchestrator.py)
New Methods:
retrieve_stored_traffic_for_retraining(): Retrieves previously stored traffic data for re-training_log_traffic_data_storage(): Logs traffic data storage for audit purposes
Enhanced Methods:
_collect_traffic_data_with_timeout(): Now includes storage logging and validation
3. Data Service Client (shared/clients/data_client.py)
New Methods:
get_stored_traffic_data_for_training(): Dedicated method for retrieving stored training data
4. API Endpoints (services/data/app/api/traffic.py)
New Endpoint:
POST /tenants/{tenant_id}/traffic/stored: Retrieves stored traffic data for training purposes
Data Flow
Initial Training
- Training orchestrator requests traffic data
- Data service checks database first
- If not found, fetches from Madrid Open Data API
- Data is automatically stored in database
- Returns data to training orchestrator
- Training completes using fetched data
Re-Training
- Training orchestrator requests stored traffic data
- Data service queries database using location and date range
- Returns stored data without making API calls
- Training completes using stored data
Storage Logic
Duplicate Prevention
- Before storing, the system checks for existing records with the same location and date
- Only new records are stored to avoid database bloat
Batch Processing
- Traffic data is stored in batches of 100 records for efficiency
- Each batch is committed separately to handle large datasets
Data Validation
- Traffic volume: 0-10,000 vehicles per hour
- Pedestrian count: 0-10,000 people per hour
- Average speed: 0-200 km/h
- Congestion level: "low", "medium", "high", "blocked"
Benefits
1. Improved Re-Training Performance
- No need to re-fetch external API data
- Faster training iterations
- Reduced API rate limiting issues
2. Data Consistency
- Same traffic data used across multiple training runs
- Reproducible training results
- Historical data preservation
3. Cost Efficiency
- Reduced API calls to external services
- Lower bandwidth usage
- Better resource utilization
4. Offline Training
- Training can proceed even if external APIs are unavailable
- Increased system resilience
Usage Examples
Retrieving Stored Traffic Data
from services.training.app.services.training_orchestrator import TrainingDataOrchestrator
orchestrator = TrainingDataOrchestrator()
# Get stored traffic data for re-training
traffic_data = await orchestrator.retrieve_stored_traffic_for_retraining(
bakery_location=(40.4168, -3.7038), # Madrid coordinates
start_date=datetime(2024, 1, 1),
end_date=datetime(2024, 12, 31),
tenant_id="tenant-123"
)
Checking Storage Status
# The system automatically logs storage operations
# Check logs for entries like:
# "Traffic data stored for re-training" - indicates successful storage
# "Retrieved X stored traffic records for training" - indicates successful retrieval
Monitoring
Storage Metrics
- Number of records stored per location
- Storage success rate
- Duplicate detection rate
Retrieval Metrics
- Query response time
- Records retrieved per request
- Re-training data availability
Audit Trail
All traffic data operations are logged with:
- Location coordinates
- Date ranges
- Record counts
- Storage/retrieval timestamps
- Purpose (training/re-training)
Migration
To enable traffic data storage on existing deployments:
-
Run Database Migration:
cd services/data alembic upgrade head -
Restart Data Service:
docker-compose restart data-service -
Verify Storage:
- Check logs for "Traffic data stored for re-training" messages
- Query database:
SELECT COUNT(*) FROM traffic_data;
Configuration
No additional configuration is required. The system automatically:
- Detects when traffic data should be stored
- Handles duplicate prevention
- Manages database transactions
- Provides fallback mechanisms
Troubleshooting
Common Issues
1. Storage Failures
- Check database connectivity
- Verify table schema exists
- Review validation errors in logs
2. No Stored Data Available
- Ensure initial training has been completed
- Check date ranges are within stored data period
- Verify location coordinates match stored data
3. Performance Issues
- Monitor database query performance
- Check index usage
- Consider data archival for old records
Error Messages
"No stored traffic data found for re-training": Normal when no previous training has occurred"Failed to store traffic data batch": Database connectivity or validation issue"Invalid traffic data, skipping": Data validation failure - check raw API response
Future Enhancements
- Data Archival: Automatic archival of old traffic data
- Data Compression: Compress raw_data field for storage efficiency
- Regional Expansion: Support for traffic data from other cities
- Real-time Updates: Continuous traffic data collection and storage
- Analytics: Traffic pattern analysis and reporting