Improve the traffic fetching system

2025-08-08 23:29:48 +02:00
parent 8af17f1433
commit 312fdc8ef3
8 changed files with 680 additions and 51 deletions
--- a/docs/TRAFFIC_DATA_STORAGE.md
+++ b/docs/TRAFFIC_DATA_STORAGE.md
@@ -0,0 +1,220 @@
+# Traffic Data Storage for Re-Training
+
+## Overview
+
+This document describes the enhanced traffic data storage system implemented to ensure that fetched traffic data is stored in the database for future use in model re-training.
+
+## Architecture
+
+### Database Schema
+
+The `traffic_data` table stores all traffic data with the following schema:
+
+```sql
+CREATE TABLE traffic_data (
+    id UUID PRIMARY KEY,
+    location_id VARCHAR(100) NOT NULL,  -- Format: "lat,lon" (e.g., "40.4168,-3.7038")
+    date TIMESTAMP WITH TIME ZONE NOT NULL,
+    traffic_volume INTEGER,
+    pedestrian_count INTEGER,
+    congestion_level VARCHAR(20),  -- "low", "medium", "high", "blocked"
+    average_speed FLOAT,
+    source VARCHAR(50) NOT NULL DEFAULT 'madrid_opendata',
+    raw_data TEXT,  -- JSON string of original data
+    created_at TIMESTAMP WITH TIME ZONE NOT NULL,
+    updated_at TIMESTAMP WITH TIME ZONE NOT NULL
+);
+
+-- Indexes for efficient querying
+CREATE INDEX idx_traffic_location_date ON traffic_data(location_id, date);
+CREATE INDEX idx_traffic_date_range ON traffic_data(date);
+```
+
+### Key Components
+
+#### 1. Enhanced TrafficService (`services/data/app/services/traffic_service.py`)
+
+**New Methods:**
+- `_store_traffic_data_batch()`: Efficiently stores multiple traffic records with duplicate detection
+- `_validate_traffic_data()`: Validates traffic data before storage
+- `get_stored_traffic_for_training()`: Retrieves stored traffic data specifically for training
+
+**Enhanced Methods:**
+- `get_historical_traffic()`: Now automatically stores fetched data for future re-training
+
+#### 2. Training Data Orchestrator (`services/training/app/services/training_orchestrator.py`)
+
+**New Methods:**
+- `retrieve_stored_traffic_for_retraining()`: Retrieves previously stored traffic data for re-training
+- `_log_traffic_data_storage()`: Logs traffic data storage for audit purposes
+
+**Enhanced Methods:**
+- `_collect_traffic_data_with_timeout()`: Now includes storage logging and validation
+
+#### 3. Data Service Client (`shared/clients/data_client.py`)
+
+**New Methods:**
+- `get_stored_traffic_data_for_training()`: Dedicated method for retrieving stored training data
+
+#### 4. API Endpoints (`services/data/app/api/traffic.py`)
+
+**New Endpoint:**
+- `POST /tenants/{tenant_id}/traffic/stored`: Retrieves stored traffic data for training purposes
+
+## Data Flow
+
+### Initial Training
+1. Training orchestrator requests traffic data
+2. Data service checks database first
+3. If not found, fetches from Madrid Open Data API
+4. **Data is automatically stored in database**
+5. Returns data to training orchestrator
+6. Training completes using fetched data
+
+### Re-Training
+1. Training orchestrator requests stored traffic data
+2. Data service queries database using location and date range
+3. Returns stored data without making API calls
+4. Training completes using stored data
+
+## Storage Logic
+
+### Duplicate Prevention
+- Before storing, the system checks for existing records with the same location and date
+- Only new records are stored to avoid database bloat
+
+### Batch Processing
+- Traffic data is stored in batches of 100 records for efficiency
+- Each batch is committed separately to handle large datasets
+
+### Data Validation
+- Traffic volume: 0-10,000 vehicles per hour
+- Pedestrian count: 0-10,000 people per hour
+- Average speed: 0-200 km/h
+- Congestion level: "low", "medium", "high", "blocked"
+
+## Benefits
+
+### 1. Improved Re-Training Performance
+- No need to re-fetch external API data
+- Faster training iterations
+- Reduced API rate limiting issues
+
+### 2. Data Consistency
+- Same traffic data used across multiple training runs
+- Reproducible training results
+- Historical data preservation
+
+### 3. Cost Efficiency
+- Reduced API calls to external services
+- Lower bandwidth usage
+- Better resource utilization
+
+### 4. Offline Training
+- Training can proceed even if external APIs are unavailable
+- Increased system resilience
+
+## Usage Examples
+
+### Retrieving Stored Traffic Data
+```python
+from services.training.app.services.training_orchestrator import TrainingDataOrchestrator
+
+orchestrator = TrainingDataOrchestrator()
+
+# Get stored traffic data for re-training
+traffic_data = await orchestrator.retrieve_stored_traffic_for_retraining(
+    bakery_location=(40.4168, -3.7038),  # Madrid coordinates
+    start_date=datetime(2024, 1, 1),
+    end_date=datetime(2024, 12, 31),
+    tenant_id="tenant-123"
+)
+```
+
+### Checking Storage Status
+```python
+# The system automatically logs storage operations
+# Check logs for entries like:
+# "Traffic data stored for re-training" - indicates successful storage
+# "Retrieved X stored traffic records for training" - indicates successful retrieval
+```
+
+## Monitoring
+
+### Storage Metrics
+- Number of records stored per location
+- Storage success rate
+- Duplicate detection rate
+
+### Retrieval Metrics
+- Query response time
+- Records retrieved per request
+- Re-training data availability
+
+### Audit Trail
+All traffic data operations are logged with:
+- Location coordinates
+- Date ranges
+- Record counts
+- Storage/retrieval timestamps
+- Purpose (training/re-training)
+
+## Migration
+
+To enable traffic data storage on existing deployments:
+
+1. **Run Database Migration:**
+   ```bash
+   cd services/data
+   alembic upgrade head
+   ```
+
+2. **Restart Data Service:**
+   ```bash
+   docker-compose restart data-service
+   ```
+
+3. **Verify Storage:**
+   - Check logs for "Traffic data stored for re-training" messages
+   - Query database: `SELECT COUNT(*) FROM traffic_data;`
+
+## Configuration
+
+No additional configuration is required. The system automatically:
+- Detects when traffic data should be stored
+- Handles duplicate prevention
+- Manages database transactions
+- Provides fallback mechanisms
+
+## Troubleshooting
+
+### Common Issues
+
+**1. Storage Failures**
+- Check database connectivity
+- Verify table schema exists
+- Review validation errors in logs
+
+**2. No Stored Data Available**
+- Ensure initial training has been completed
+- Check date ranges are within stored data period
+- Verify location coordinates match stored data
+
+**3. Performance Issues**
+- Monitor database query performance
+- Check index usage
+- Consider data archival for old records
+
+### Error Messages
+
+- `"No stored traffic data found for re-training"`: Normal when no previous training has occurred
+- `"Failed to store traffic data batch"`: Database connectivity or validation issue
+- `"Invalid traffic data, skipping"`: Data validation failure - check raw API response
+
+## Future Enhancements
+
+1. **Data Archival**: Automatic archival of old traffic data
+2. **Data Compression**: Compress raw_data field for storage efficiency
+3. **Regional Expansion**: Support for traffic data from other cities
+4. **Real-time Updates**: Continuous traffic data collection and storage
+5. **Analytics**: Traffic pattern analysis and reporting