Improve the traffic fetching system

This commit is contained in:
Urtzi Alfaro
2025-08-08 23:29:48 +02:00
parent 8af17f1433
commit 312fdc8ef3
8 changed files with 680 additions and 51 deletions

View File

@@ -0,0 +1,220 @@
# Traffic Data Storage for Re-Training
## Overview
This document describes the enhanced traffic data storage system implemented to ensure that fetched traffic data is stored in the database for future use in model re-training.
## Architecture
### Database Schema
The `traffic_data` table stores all traffic data with the following schema:
```sql
CREATE TABLE traffic_data (
id UUID PRIMARY KEY,
location_id VARCHAR(100) NOT NULL, -- Format: "lat,lon" (e.g., "40.4168,-3.7038")
date TIMESTAMP WITH TIME ZONE NOT NULL,
traffic_volume INTEGER,
pedestrian_count INTEGER,
congestion_level VARCHAR(20), -- "low", "medium", "high", "blocked"
average_speed FLOAT,
source VARCHAR(50) NOT NULL DEFAULT 'madrid_opendata',
raw_data TEXT, -- JSON string of original data
created_at TIMESTAMP WITH TIME ZONE NOT NULL,
updated_at TIMESTAMP WITH TIME ZONE NOT NULL
);
-- Indexes for efficient querying
CREATE INDEX idx_traffic_location_date ON traffic_data(location_id, date);
CREATE INDEX idx_traffic_date_range ON traffic_data(date);
```
### Key Components
#### 1. Enhanced TrafficService (`services/data/app/services/traffic_service.py`)
**New Methods:**
- `_store_traffic_data_batch()`: Efficiently stores multiple traffic records with duplicate detection
- `_validate_traffic_data()`: Validates traffic data before storage
- `get_stored_traffic_for_training()`: Retrieves stored traffic data specifically for training
**Enhanced Methods:**
- `get_historical_traffic()`: Now automatically stores fetched data for future re-training
#### 2. Training Data Orchestrator (`services/training/app/services/training_orchestrator.py`)
**New Methods:**
- `retrieve_stored_traffic_for_retraining()`: Retrieves previously stored traffic data for re-training
- `_log_traffic_data_storage()`: Logs traffic data storage for audit purposes
**Enhanced Methods:**
- `_collect_traffic_data_with_timeout()`: Now includes storage logging and validation
#### 3. Data Service Client (`shared/clients/data_client.py`)
**New Methods:**
- `get_stored_traffic_data_for_training()`: Dedicated method for retrieving stored training data
#### 4. API Endpoints (`services/data/app/api/traffic.py`)
**New Endpoint:**
- `POST /tenants/{tenant_id}/traffic/stored`: Retrieves stored traffic data for training purposes
## Data Flow
### Initial Training
1. Training orchestrator requests traffic data
2. Data service checks database first
3. If not found, fetches from Madrid Open Data API
4. **Data is automatically stored in database**
5. Returns data to training orchestrator
6. Training completes using fetched data
### Re-Training
1. Training orchestrator requests stored traffic data
2. Data service queries database using location and date range
3. Returns stored data without making API calls
4. Training completes using stored data
## Storage Logic
### Duplicate Prevention
- Before storing, the system checks for existing records with the same location and date
- Only new records are stored to avoid database bloat
### Batch Processing
- Traffic data is stored in batches of 100 records for efficiency
- Each batch is committed separately to handle large datasets
### Data Validation
- Traffic volume: 0-10,000 vehicles per hour
- Pedestrian count: 0-10,000 people per hour
- Average speed: 0-200 km/h
- Congestion level: "low", "medium", "high", "blocked"
## Benefits
### 1. Improved Re-Training Performance
- No need to re-fetch external API data
- Faster training iterations
- Reduced API rate limiting issues
### 2. Data Consistency
- Same traffic data used across multiple training runs
- Reproducible training results
- Historical data preservation
### 3. Cost Efficiency
- Reduced API calls to external services
- Lower bandwidth usage
- Better resource utilization
### 4. Offline Training
- Training can proceed even if external APIs are unavailable
- Increased system resilience
## Usage Examples
### Retrieving Stored Traffic Data
```python
from services.training.app.services.training_orchestrator import TrainingDataOrchestrator
orchestrator = TrainingDataOrchestrator()
# Get stored traffic data for re-training
traffic_data = await orchestrator.retrieve_stored_traffic_for_retraining(
bakery_location=(40.4168, -3.7038), # Madrid coordinates
start_date=datetime(2024, 1, 1),
end_date=datetime(2024, 12, 31),
tenant_id="tenant-123"
)
```
### Checking Storage Status
```python
# The system automatically logs storage operations
# Check logs for entries like:
# "Traffic data stored for re-training" - indicates successful storage
# "Retrieved X stored traffic records for training" - indicates successful retrieval
```
## Monitoring
### Storage Metrics
- Number of records stored per location
- Storage success rate
- Duplicate detection rate
### Retrieval Metrics
- Query response time
- Records retrieved per request
- Re-training data availability
### Audit Trail
All traffic data operations are logged with:
- Location coordinates
- Date ranges
- Record counts
- Storage/retrieval timestamps
- Purpose (training/re-training)
## Migration
To enable traffic data storage on existing deployments:
1. **Run Database Migration:**
```bash
cd services/data
alembic upgrade head
```
2. **Restart Data Service:**
```bash
docker-compose restart data-service
```
3. **Verify Storage:**
- Check logs for "Traffic data stored for re-training" messages
- Query database: `SELECT COUNT(*) FROM traffic_data;`
## Configuration
No additional configuration is required. The system automatically:
- Detects when traffic data should be stored
- Handles duplicate prevention
- Manages database transactions
- Provides fallback mechanisms
## Troubleshooting
### Common Issues
**1. Storage Failures**
- Check database connectivity
- Verify table schema exists
- Review validation errors in logs
**2. No Stored Data Available**
- Ensure initial training has been completed
- Check date ranges are within stored data period
- Verify location coordinates match stored data
**3. Performance Issues**
- Monitor database query performance
- Check index usage
- Consider data archival for old records
### Error Messages
- `"No stored traffic data found for re-training"`: Normal when no previous training has occurred
- `"Failed to store traffic data batch"`: Database connectivity or validation issue
- `"Invalid traffic data, skipping"`: Data validation failure - check raw API response
## Future Enhancements
1. **Data Archival**: Automatic archival of old traffic data
2. **Data Compression**: Compress raw_data field for storage efficiency
3. **Regional Expansion**: Support for traffic data from other cities
4. **Real-time Updates**: Continuous traffic data collection and storage
5. **Analytics**: Traffic pattern analysis and reporting

View File

@@ -149,8 +149,8 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
</p>
</div>
<div className="bg-white rounded-md shadow-md p-8 mb-8">
<div className="bg-red-50 border border-red-200 rounded-md p-6">
<div className="bg-white rounded-2xl shadow-soft p-8 mb-8">
<div className="bg-red-50 border border-red-200 rounded-xl p-6">
<div className="flex items-start space-x-4">
<AlertCircle className="w-6 h-6 text-red-600 flex-shrink-0 mt-1" />
<div>
@@ -172,7 +172,7 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
<div className="mt-6 text-center">
<button
onClick={() => window.location.reload()}
className="bg-blue-600 text-white px-6 py-3 rounded-lg font-medium hover:bg-blue-700 transition-colors"
className="bg-primary-500 text-white px-6 py-3 rounded-xl font-medium hover:bg-primary-600 transition-colors"
>
Intentar Nuevamente
</button>
@@ -186,7 +186,7 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
<div className="max-w-4xl mx-auto">
{/* Header */}
<div className="text-center mb-8">
<div className="inline-flex items-center justify-center w-20 h-20 bg-blue-600 rounded-full mb-4">
<div className="inline-flex items-center justify-center w-20 h-20 bg-primary-500 rounded-full mb-4">
<Brain className="w-10 h-10 text-white animate-pulse" />
</div>
<h2 className="text-3xl font-bold text-gray-900 mb-2">
@@ -198,7 +198,7 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
</div>
{/* Main Progress Section */}
<div className="bg-white rounded-md shadow-md p-8 mb-8">
<div className="bg-white rounded-2xl shadow-soft p-8 mb-8">
{/* Overall Progress Bar */}
<div className="mb-8">
<div className="flex justify-between items-center mb-3">
@@ -207,7 +207,7 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
</div>
<div className="w-full bg-gray-200 rounded-full h-4 overflow-hidden">
<div
className="bg-gradient-to-r from-blue-500 to-indigo-600 h-4 rounded-full transition-all duration-1000 ease-out relative"
className="bg-primary-500 h-4 rounded-full transition-all duration-1000 ease-out relative"
style={{ width: `${progress.progress}%` }}
>
<div className="absolute inset-0 opacity-20 animate-pulse">
@@ -218,10 +218,10 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
</div>
{/* Current Step Info */}
<div className={`bg-${currentStepInfo.color}-50 border border-${currentStepInfo.color}-200 rounded-md p-6 mb-6`}>
<div className={`bg-${currentStepInfo.color}-50 border border-${currentStepInfo.color}-200 rounded-xl p-6 mb-6`}>
<div className="flex items-start space-x-4">
<div className="flex-shrink-0">
<div className={`w-12 h-12 bg-${currentStepInfo.color}-600 rounded-full flex items-center justify-center`}>
<div className={`w-12 h-12 bg-primary-500 rounded-full flex items-center justify-center`}>
<currentStepInfo.icon className="w-6 h-6 text-white" />
</div>
</div>
@@ -232,8 +232,8 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
<p className="text-gray-700 mb-3">
{currentStepInfo.description}
</p>
<div className={`bg-${currentStepInfo.color}-100 border-l-4 border-${currentStepInfo.color}-500 p-3 rounded-r-lg`}>
<p className={`text-sm font-medium text-${currentStepInfo.color}-800`}>
<div className={`bg-primary-50 border-l-4 border-primary-500 p-3 rounded-r-xl`}>
<p className={`text-sm font-medium text-primary-700`}>
{currentStepInfo.tip}
</p>
</div>
@@ -246,11 +246,11 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
{progressSteps.map((step, index) => (
<div
key={step.id}
className={`p-4 rounded-md border-2 transition-all duration-300 ${
className={`p-4 rounded-xl border-2 transition-all duration-300 ${
step.completed
? 'bg-green-50 border-green-200'
: step.current
? 'bg-blue-50 border-blue-300 shadow-md'
? 'bg-primary-50 border-primary-300 shadow-soft'
: 'bg-gray-50 border-gray-200'
}`}
>
@@ -258,12 +258,12 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
{step.completed ? (
<CheckCircle className="w-5 h-5 text-green-600 mr-2" />
) : step.current ? (
<div className="w-5 h-5 border-2 border-blue-600 border-t-transparent rounded-full animate-spin mr-2"></div>
<div className="w-5 h-5 border-2 border-primary-500 border-t-transparent rounded-full animate-spin mr-2"></div>
) : (
<div className="w-5 h-5 border-2 border-gray-300 rounded-full mr-2"></div>
)}
<span className={`text-sm font-medium ${
step.completed ? 'text-green-800' : step.current ? 'text-blue-800' : 'text-gray-600'
step.completed ? 'text-green-800' : step.current ? 'text-primary-700' : 'text-gray-600'
}`}>
{step.name}
</span>
@@ -274,7 +274,7 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
{/* Enhanced Stats Grid */}
<div className="grid grid-cols-1 md:grid-cols-3 gap-6">
<div className="text-center p-4 bg-gray-50 rounded-md">
<div className="text-center p-4 bg-gray-50 rounded-xl">
<div className="flex items-center justify-center mb-2">
<Cpu className="w-5 h-5 text-gray-600 mr-2" />
<span className="text-sm font-medium text-gray-700">Productos Procesados</span>
@@ -285,14 +285,14 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
{progress.productsTotal > 0 && (
<div className="w-full bg-gray-200 rounded-full h-2 mt-2">
<div
className="bg-blue-500 h-2 rounded-full transition-all duration-500"
className="bg-primary-500 h-2 rounded-full transition-all duration-500"
style={{ width: `${(progress.productsCompleted / progress.productsTotal) * 100}%` }}
></div>
</div>
)}
</div>
<div className="text-center p-4 bg-gray-50 rounded-md">
<div className="text-center p-4 bg-gray-50 rounded-xl">
<div className="flex items-center justify-center mb-2">
<Clock className="w-5 h-5 text-gray-600 mr-2" />
<span className="text-sm font-medium text-gray-700">Tiempo Restante</span>
@@ -305,7 +305,7 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
</div>
</div>
<div className="text-center p-4 bg-gray-50 rounded-md">
<div className="text-center p-4 bg-gray-50 rounded-xl">
<div className="flex items-center justify-center mb-2">
<Target className="w-5 h-5 text-gray-600 mr-2" />
<span className="text-sm font-medium text-gray-700">Precisión Esperada</span>
@@ -329,14 +329,14 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
{/* Expected Benefits - Only show if progress < 80% to keep user engaged */}
{progress.progress < 80 && (
<div className="bg-white rounded-md shadow-md p-8">
<div className="bg-white rounded-2xl shadow-soft p-8">
<h3 className="text-2xl font-bold text-gray-900 mb-6 text-center">
Lo que podrás hacer una vez completado
</h3>
<div className="grid grid-cols-1 md:grid-cols-3 gap-6">
{EXPECTED_BENEFITS.map((benefit, index) => (
<div key={index} className="text-center p-6 bg-gradient-to-br from-indigo-50 to-purple-50 rounded-md">
<div className="inline-flex items-center justify-center w-12 h-12 bg-indigo-600 rounded-full mb-4">
<div key={index} className="text-center p-6 bg-gradient-to-br from-primary-50 to-blue-50 rounded-xl">
<div className="inline-flex items-center justify-center w-12 h-12 bg-primary-500 rounded-full mb-4">
<benefit.icon className="w-6 h-6 text-white" />
</div>
<h4 className="text-lg font-semibold text-gray-900 mb-2">
@@ -354,7 +354,7 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
{/* Timeout Warning Modal */}
{showTimeoutWarning && (
<div className="fixed inset-0 bg-black bg-opacity-50 flex items-center justify-center z-50">
<div className="bg-white rounded-md shadow-md p-8 max-w-md mx-4">
<div className="bg-white rounded-2xl shadow-soft p-8 max-w-md mx-4">
<div className="text-center">
<AlertCircle className="w-16 h-16 text-orange-500 mx-auto mb-4" />
<h3 className="text-xl font-bold text-gray-900 mb-4">
@@ -367,13 +367,13 @@ export default function EnhancedTrainingProgress({ progress, onTimeout }: Traini
<div className="flex flex-col sm:flex-row gap-3">
<button
onClick={handleContinueToDashboard}
className="flex-1 bg-blue-600 text-white px-6 py-3 rounded-lg font-medium hover:bg-blue-700 transition-colors"
className="flex-1 bg-primary-500 text-white px-6 py-3 rounded-xl font-medium hover:bg-primary-600 transition-colors"
>
Continuar al Dashboard
</button>
<button
onClick={handleKeepWaiting}
className="flex-1 bg-gray-200 text-gray-800 px-6 py-3 rounded-lg font-medium hover:bg-gray-300 transition-colors"
className="flex-1 bg-gray-200 text-gray-800 px-6 py-3 rounded-xl font-medium hover:bg-gray-300 transition-colors"
>
Seguir Esperando
</button>

View File

@@ -110,4 +110,60 @@ async def get_historical_traffic(
raise
except Exception as e:
logger.error("Unexpected error in historical traffic API", error=str(e))
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
@router.post("/tenants/{tenant_id}/traffic/stored")
async def get_stored_traffic_for_training(
request: HistoricalTrafficRequest,
db: AsyncSession = Depends(get_db),
tenant_id: UUID = Path(..., description="Tenant ID"),
current_user: Dict[str, Any] = Depends(get_current_user_dep),
):
"""Get stored traffic data specifically for training/re-training purposes"""
try:
# Validate date range
if request.end_date <= request.start_date:
raise HTTPException(status_code=400, detail="End date must be after start date")
# Allow longer date ranges for training (up to 3 years)
if (request.end_date - request.start_date).days > 1095:
raise HTTPException(status_code=400, detail="Date range cannot exceed 3 years for training data")
logger.info("Retrieving stored traffic data for training",
tenant_id=str(tenant_id),
location=f"{request.latitude},{request.longitude}",
date_range=f"{request.start_date} to {request.end_date}")
# Use the dedicated method for training data retrieval
stored_data = await traffic_service.get_stored_traffic_for_training(
request.latitude, request.longitude, request.start_date, request.end_date, db
)
# Log retrieval for audit purposes
logger.info("Stored traffic data retrieved for training",
records_count=len(stored_data),
tenant_id=str(tenant_id),
purpose="model_training")
# Publish event for monitoring
try:
await publish_traffic_updated({
"type": "stored_data_retrieved_for_training",
"latitude": request.latitude,
"longitude": request.longitude,
"start_date": request.start_date.isoformat(),
"end_date": request.end_date.isoformat(),
"records_count": len(stored_data),
"tenant_id": str(tenant_id),
"timestamp": datetime.utcnow().isoformat()
})
except Exception as pub_error:
logger.warning("Failed to publish stored traffic retrieval event", error=str(pub_error))
return stored_data
except HTTPException:
raise
except Exception as e:
logger.error("Unexpected error in stored traffic retrieval API", error=str(e))
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")

View File

@@ -63,7 +63,7 @@ class TrafficService:
start_date: datetime,
end_date: datetime,
db: AsyncSession) -> List[TrafficDataResponse]:
"""Get historical traffic data"""
"""Get historical traffic data with enhanced storage for re-training"""
try:
logger.debug("Getting historical traffic",
lat=latitude, lon=longitude,
@@ -100,27 +100,12 @@ class TrafficService:
)
if traffic_data:
# Store in database for future use
try:
for data in traffic_data:
traffic_record = TrafficData(
location_id=location_id,
date=data.get('date', datetime.now()),
traffic_volume=data.get('traffic_volume'),
pedestrian_count=data.get('pedestrian_count'),
congestion_level=data.get('congestion_level'),
average_speed=data.get('average_speed'),
source="madrid_opendata",
raw_data=str(data),
created_at=datetime.now()
)
db.add(traffic_record)
await db.commit()
logger.debug("Historical data stored in database", count=len(traffic_data))
except Exception as db_error:
logger.warning("Failed to store historical data in database", error=str(db_error))
await db.rollback()
# Enhanced storage with better error handling and validation
stored_count = await self._store_traffic_data_batch(
traffic_data, location_id, db
)
logger.info("Traffic data stored for re-training",
fetched=len(traffic_data), stored=stored_count, location=location_id)
return [TrafficDataResponse(**item) for item in traffic_data]
@@ -137,7 +122,7 @@ class TrafficService:
longitude: float,
traffic_data: Dict[str, Any],
db: AsyncSession) -> bool:
"""Store traffic data to database"""
"""Store single traffic data record to database"""
try:
location_id = f"{latitude:.4f},{longitude:.4f}"
@@ -161,4 +146,152 @@ class TrafficService:
except Exception as e:
logger.error("Failed to store traffic data", error=str(e))
await db.rollback()
return False
return False
async def _store_traffic_data_batch(self,
traffic_data: List[Dict[str, Any]],
location_id: str,
db: AsyncSession) -> int:
"""Store batch of traffic data with enhanced validation and duplicate handling"""
stored_count = 0
try:
# Check for existing records to avoid duplicates
if traffic_data:
dates = [data.get('date') for data in traffic_data if data.get('date')]
if dates:
# Query existing records for this location and date range
existing_stmt = select(TrafficData.date).where(
and_(
TrafficData.location_id == location_id,
TrafficData.date.in_(dates)
)
)
result = await db.execute(existing_stmt)
existing_dates = {row[0] for row in result.fetchall()}
logger.debug(f"Found {len(existing_dates)} existing records for location {location_id}")
else:
existing_dates = set()
else:
existing_dates = set()
# Store only new records
for data in traffic_data:
try:
record_date = data.get('date')
if not record_date or record_date in existing_dates:
continue # Skip duplicates
# Validate required fields
if not self._validate_traffic_data(data):
logger.warning("Invalid traffic data, skipping", data=data)
continue
traffic_record = TrafficData(
location_id=location_id,
date=record_date,
traffic_volume=data.get('traffic_volume'),
pedestrian_count=data.get('pedestrian_count'),
congestion_level=data.get('congestion_level'),
average_speed=data.get('average_speed'),
source=data.get('source', 'madrid_opendata'),
raw_data=str(data)
)
db.add(traffic_record)
stored_count += 1
# Commit in batches to avoid memory issues
if stored_count % 100 == 0:
await db.commit()
logger.debug(f"Committed batch of {stored_count} records")
except Exception as record_error:
logger.warning("Failed to store individual traffic record",
error=str(record_error), data=data)
continue
# Final commit
await db.commit()
logger.info(f"Successfully stored {stored_count} traffic records for location {location_id}")
except Exception as e:
logger.error("Failed to store traffic data batch",
error=str(e), location_id=location_id)
await db.rollback()
return stored_count
def _validate_traffic_data(self, data: Dict[str, Any]) -> bool:
"""Validate traffic data before storage"""
required_fields = ['date']
# Check required fields
for field in required_fields:
if not data.get(field):
return False
# Validate data types and ranges
traffic_volume = data.get('traffic_volume')
if traffic_volume is not None and (traffic_volume < 0 or traffic_volume > 10000):
return False
pedestrian_count = data.get('pedestrian_count')
if pedestrian_count is not None and (pedestrian_count < 0 or pedestrian_count > 10000):
return False
average_speed = data.get('average_speed')
if average_speed is not None and (average_speed < 0 or average_speed > 200):
return False
congestion_level = data.get('congestion_level')
if congestion_level and congestion_level not in ['low', 'medium', 'high', 'blocked']:
return False
return True
async def get_stored_traffic_for_training(self,
latitude: float,
longitude: float,
start_date: datetime,
end_date: datetime,
db: AsyncSession) -> List[Dict[str, Any]]:
"""Retrieve stored traffic data specifically for training purposes"""
try:
location_id = f"{latitude:.4f},{longitude:.4f}"
stmt = select(TrafficData).where(
and_(
TrafficData.location_id == location_id,
TrafficData.date >= start_date,
TrafficData.date <= end_date
)
).order_by(TrafficData.date)
result = await db.execute(stmt)
records = result.scalars().all()
# Convert to training format
training_data = []
for record in records:
training_data.append({
'date': record.date,
'traffic_volume': record.traffic_volume,
'pedestrian_count': record.pedestrian_count,
'congestion_level': record.congestion_level,
'average_speed': record.average_speed,
'location_id': record.location_id,
'source': record.source,
'measurement_point_id': record.raw_data # Contains additional metadata
})
logger.info(f"Retrieved {len(training_data)} traffic records for training",
location_id=location_id, start=start_date, end=end_date)
return training_data
except Exception as e:
logger.error("Failed to retrieve traffic data for training",
error=str(e), location_id=location_id)
return []

View File

@@ -0,0 +1,54 @@
"""Create traffic_data table for storing traffic data for re-training
Revision ID: 001_traffic_data
Revises:
Create Date: 2025-01-08 12:00:00.000000
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import UUID
# revision identifiers, used by Alembic.
revision = '001_traffic_data'
down_revision = None
branch_labels = None
depends_on = None
def upgrade():
"""Create traffic_data table"""
op.create_table('traffic_data',
sa.Column('id', UUID(as_uuid=True), nullable=False, primary_key=True),
sa.Column('location_id', sa.String(100), nullable=False, index=True),
sa.Column('date', sa.DateTime(timezone=True), nullable=False, index=True),
sa.Column('traffic_volume', sa.Integer, nullable=True),
sa.Column('pedestrian_count', sa.Integer, nullable=True),
sa.Column('congestion_level', sa.String(20), nullable=True),
sa.Column('average_speed', sa.Float, nullable=True),
sa.Column('source', sa.String(50), nullable=False, server_default='madrid_opendata'),
sa.Column('raw_data', sa.Text, nullable=True),
sa.Column('created_at', sa.DateTime(timezone=True), nullable=False),
sa.Column('updated_at', sa.DateTime(timezone=True), nullable=False),
)
# Create index for efficient querying by location and date
op.create_index(
'idx_traffic_location_date',
'traffic_data',
['location_id', 'date']
)
# Create index for date range queries
op.create_index(
'idx_traffic_date_range',
'traffic_data',
['date']
)
def downgrade():
"""Drop traffic_data table"""
op.drop_index('idx_traffic_date_range', table_name='traffic_data')
op.drop_index('idx_traffic_location_date', table_name='traffic_data')
op.drop_table('traffic_data')

View File

@@ -24,6 +24,13 @@ class DataClient:
# Get the shared data client configured for this service
self.data_client = get_data_client(settings, "training")
# Check if the new method is available for stored traffic data
if hasattr(self.data_client, 'get_stored_traffic_data_for_training'):
self.supports_stored_traffic_data = True
else:
self.supports_stored_traffic_data = False
logger.warning("Stored traffic data method not available in data client")
# Or alternatively, get all clients at once:
# self.clients = get_service_clients(settings, "training")
# Then use: self.clients.data.get_sales_data(...)
@@ -147,6 +154,51 @@ class DataClient:
logger.error(f"Error fetching traffic data: {e}", tenant_id=tenant_id)
return []
async def fetch_stored_traffic_data_for_training(
self,
tenant_id: str,
start_date: str,
end_date: str,
latitude: Optional[float] = None,
longitude: Optional[float] = None
) -> List[Dict[str, Any]]:
"""
Fetch stored traffic data specifically for training/re-training
This method accesses previously stored traffic data without making new API calls
"""
try:
if self.supports_stored_traffic_data:
# Use the dedicated stored traffic data method
stored_traffic_data = await self.data_client.get_stored_traffic_data_for_training(
tenant_id=tenant_id,
start_date=start_date,
end_date=end_date,
latitude=latitude,
longitude=longitude
)
if stored_traffic_data:
logger.info(f"Retrieved {len(stored_traffic_data)} stored traffic records for training",
tenant_id=tenant_id)
return stored_traffic_data
else:
logger.warning("No stored traffic data available for training", tenant_id=tenant_id)
return []
else:
# Fallback to regular traffic data method
logger.info("Using fallback traffic data method for training")
return await self.fetch_traffic_data(
tenant_id=tenant_id,
start_date=start_date,
end_date=end_date,
latitude=latitude,
longitude=longitude
)
except Exception as e:
logger.error(f"Error fetching stored traffic data for training: {e}", tenant_id=tenant_id)
return []
async def validate_data_quality(
self,
tenant_id: str,

View File

@@ -360,7 +360,7 @@ class TrainingDataOrchestrator:
aligned_range: AlignedDateRange,
tenant_id: str
) -> List[Dict[str, Any]]:
"""Collect traffic data with timeout and Madrid constraint validation"""
"""Collect traffic data with enhanced storage and retrieval for re-training"""
try:
# Double-check Madrid constraint before making request
@@ -374,6 +374,7 @@ class TrainingDataOrchestrator:
start_date_str = aligned_range.start.isoformat()
end_date_str = aligned_range.end.isoformat()
# Fetch traffic data - this will automatically store it for future re-training
traffic_data = await self.data_client.fetch_traffic_data(
tenant_id=tenant_id,
start_date=start_date_str,
@@ -383,7 +384,11 @@ class TrainingDataOrchestrator:
# Validate traffic data
if self._validate_traffic_data(traffic_data):
logger.info(f"Collected {len(traffic_data)} valid traffic records")
logger.info(f"Collected and stored {len(traffic_data)} valid traffic records for re-training")
# Log storage success for audit purposes
self._log_traffic_data_storage(lat, lon, aligned_range, len(traffic_data))
return traffic_data
else:
logger.warning("Invalid traffic data received")
@@ -396,6 +401,69 @@ class TrainingDataOrchestrator:
logger.warning(f"Traffic data collection failed: {e}")
return []
def _log_traffic_data_storage(self,
lat: float,
lon: float,
aligned_range: AlignedDateRange,
record_count: int):
"""Log traffic data storage for audit and re-training tracking"""
logger.info(
"Traffic data stored for re-training",
location=f"{lat:.4f},{lon:.4f}",
date_range=f"{aligned_range.start.isoformat()} to {aligned_range.end.isoformat()}",
records_stored=record_count,
storage_timestamp=datetime.now().isoformat(),
purpose="model_training_and_retraining"
)
async def retrieve_stored_traffic_for_retraining(
self,
bakery_location: Tuple[float, float],
start_date: datetime,
end_date: datetime,
tenant_id: str
) -> List[Dict[str, Any]]:
"""
Retrieve previously stored traffic data for model re-training
This method specifically accesses the stored traffic data without making new API calls
"""
lat, lon = bakery_location
try:
# Use the dedicated stored traffic data method for training
stored_traffic_data = await self.data_client.fetch_stored_traffic_data_for_training(
tenant_id=tenant_id,
start_date=start_date.isoformat(),
end_date=end_date.isoformat(),
latitude=lat,
longitude=lon
)
if stored_traffic_data:
logger.info(
f"Retrieved {len(stored_traffic_data)} stored traffic records for re-training",
location=f"{lat:.4f},{lon:.4f}",
date_range=f"{start_date.isoformat()} to {end_date.isoformat()}",
tenant_id=tenant_id
)
return stored_traffic_data
else:
logger.warning(
"No stored traffic data found for re-training",
location=f"{lat:.4f},{lon:.4f}",
date_range=f"{start_date.isoformat()} to {end_date.isoformat()}"
)
return []
except Exception as e:
logger.error(
f"Failed to retrieve stored traffic data for re-training: {e}",
location=f"{lat:.4f},{lon:.4f}",
tenant_id=tenant_id
)
return []
def _validate_weather_data(self, weather_data: List[Dict[str, Any]]) -> bool:
"""Validate weather data quality"""
if not weather_data:

View File

@@ -317,7 +317,53 @@ class DataServiceClient(BaseServiceClient):
else:
logger.error("Failed to fetch traffic data - _make_request returned None")
logger.error("This could be due to: network timeout, HTTP error, authentication failure, or service unavailable")
return []
return None
async def get_stored_traffic_data_for_training(
self,
tenant_id: str,
start_date: str,
end_date: str,
latitude: Optional[float] = None,
longitude: Optional[float] = None
) -> Optional[List[Dict[str, Any]]]:
"""
Get stored traffic data specifically for model training/re-training
This method prioritizes database-stored data over API calls
"""
# Prepare request payload
payload = {
"start_date": start_date,
"end_date": end_date,
"latitude": latitude or 40.4168, # Default Madrid coordinates
"longitude": longitude or -3.7038,
"stored_only": True # Flag to indicate we want stored data only
}
logger.info(f"Training traffic data request: {payload}", tenant_id=tenant_id)
# Standard timeout since we're only querying the database
training_timeout = httpx.Timeout(
connect=30.0,
read=120.0, # 2 minutes should be enough for database query
write=30.0,
pool=30.0
)
result = await self._make_request(
"POST",
"traffic/stored", # New endpoint for stored traffic data
tenant_id=tenant_id,
data=payload,
timeout=training_timeout
)
if result:
logger.info(f"Successfully retrieved {len(result)} stored traffic records for training")
return result
else:
logger.warning("No stored traffic data available for training")
return None
# ================================================================
# PRODUCTS