Files
bakery-ia/docs/poi-detection-system.md
2025-12-05 20:07:01 +01:00

24 KiB
Raw Permalink Blame History

POI Detection System - Implementation Documentation

Overview

The POI (Point of Interest) Detection System is a comprehensive location-based feature engineering solution for bakery demand forecasting. It automatically detects nearby points of interest (schools, offices, transport hubs, competitors, etc.) and generates ML features that improve prediction accuracy for location-specific demand patterns.

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Bakery SaaS Platform                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │           External Data Service (POI MODULE)             │  │
│  ├──────────────────────────────────────────────────────────┤  │
│  │  POI Detection Service → Overpass API (OpenStreetMap)   │  │
│  │  POI Feature Selector → Relevance Filtering             │  │
│  │  Competitor Analyzer → Competitive Pressure Modeling     │  │
│  │  POI Cache Service → Redis (90-day TTL)                 │  │
│  │  TenantPOIContext → PostgreSQL Storage                   │  │
│  └──────────────────────────────────────────────────────────┘  │
│                     │                                           │
│                     │ POI Features                              │
│                     ▼                                           │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │           Training Service (ENHANCED)                     │  │
│  ├──────────────────────────────────────────────────────────┤  │
│  │  Training Data Orchestrator → Fetches POI Features      │  │
│  │  Data Processor → Merges POI Features into Training Data │  │
│  │  Prophet + XGBoost Trainer → Uses POI as Regressors     │  │
│  └──────────────────────────────────────────────────────────┘  │
│                     │                                           │
│                     │ Trained Models                            │
│                     ▼                                           │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │         Forecasting Service (ENHANCED)                    │  │
│  ├──────────────────────────────────────────────────────────┤  │
│  │  POI Feature Service → Fetches POI Features             │  │
│  │  Prediction Engine → Uses Same POI Features as Training  │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementation Status

Phase 1: Core POI Detection Infrastructure (COMPLETED)

Files Created:

  • services/external/app/models/poi_context.py - POI context data model
  • services/external/app/core/poi_config.py - POI categories and configuration
  • services/external/app/services/poi_detection_service.py - POI detection via Overpass API
  • services/external/app/services/poi_feature_selector.py - Feature relevance filtering
  • services/external/app/services/competitor_analyzer.py - Competitive pressure analysis
  • services/external/app/cache/poi_cache_service.py - Redis caching layer
  • services/external/app/repositories/poi_context_repository.py - Data access layer
  • services/external/app/api/poi_context.py - REST API endpoints
  • services/external/app/core/redis_client.py - Redis client accessor
  • services/external/migrations/versions/20251110_1554_add_poi_context.py - Database migration

Files Modified:

  • services/external/app/main.py - Added POI router and table
  • services/external/requirements.txt - Added overpy dependency

Key Features:

  • 9 POI categories: schools, offices, gyms/sports, residential, tourism, competitors, transport hubs, coworking, retail
  • Research-based search radii (400m-1000m) per category
  • Multi-tier feature engineering:
    • Tier 1: Proximity-weighted scores (PRIMARY)
    • Tier 2: Distance band counts (0-100m, 100-300m, 300-500m, 500-1000m)
    • Tier 3: Distance to nearest POI
    • Tier 4: Binary flags
  • Feature relevance thresholds to filter low-signal features
  • Competitive pressure modeling with market classification
  • 90-day Redis cache with 180-day refresh cycle
  • Complete REST API for detection, retrieval, refresh, deletion

Phase 2: ML Training Pipeline Integration (COMPLETED)

Files Created:

  • services/training/app/ml/poi_feature_integrator.py - POI feature integration for training

Files Modified:

  • services/training/app/services/training_orchestrator.py:
    • Added poi_features to TrainingDataSet
    • Added POIFeatureIntegrator initialization
    • Modified _collect_external_data to fetch POI features concurrently
    • Added _collect_poi_features method
    • Updated TrainingDataSet creation to include POI features
  • services/training/app/ml/data_processor.py:
    • Added poi_features parameter to prepare_training_data
    • Added _add_poi_features method
    • Integrated POI features into training data preparation flow
    • Added poi_features parameter to prepare_prediction_features
    • Added POI features to prediction feature generation
  • services/training/app/ml/trainer.py:
    • Updated training calls to pass poi_features from training_dataset
    • Updated test data preparation to include POI features

Key Features:

  • Automatic POI feature fetching during training data preparation
  • POI features added as static columns (broadcast to all dates)
  • Concurrent fetching with weather and traffic data
  • Graceful fallback if POI service unavailable
  • Feature consistency between training and testing

Phase 3: Forecasting Service Integration (COMPLETED)

Files Created:

  • services/forecasting/app/services/poi_feature_service.py - POI feature service for forecasting

Files Modified:

  • services/forecasting/app/ml/predictor.py:
    • Added POIFeatureService initialization
    • Modified _prepare_prophet_dataframe to fetch POI features
    • Ensured feature parity between training and prediction

Key Features:

  • POI features fetched from External service for each prediction
  • Same POI features used in both training and prediction (consistency)
  • Automatic feature retrieval based on tenant_id
  • Graceful handling of missing POI context

Phase 4: Frontend POI Visualization (COMPLETED)

Status: Complete frontend implementation with geocoding and visualization

Files Created:

  • frontend/src/types/poi.ts - Complete TypeScript type definitions with POI_CATEGORY_METADATA
  • frontend/src/services/api/poiContextApi.ts - API client for POI operations
  • frontend/src/services/api/geocodingApi.ts - Geocoding API client (Nominatim)
  • frontend/src/hooks/usePOIContext.ts - React hook for POI state management
  • frontend/src/hooks/useAddressAutocomplete.ts - Address autocomplete hook with debouncing
  • frontend/src/components/ui/AddressAutocomplete.tsx - Reusable address input component
  • frontend/src/components/domain/settings/POIMap.tsx - Interactive Leaflet map with POI markers
  • frontend/src/components/domain/settings/POISummaryCard.tsx - POI summary statistics card
  • frontend/src/components/domain/settings/POICategoryAccordion.tsx - Expandable category details
  • frontend/src/components/domain/settings/POIContextView.tsx - Main POI management view
  • frontend/src/components/domain/onboarding/steps/POIDetectionStep.tsx - Onboarding wizard step

Key Features:

  • Address autocomplete with real-time suggestions (Nominatim API)
  • Interactive map with color-coded POI markers by category
  • Distance rings visualization (100m, 300m, 500m)
  • Detailed category analysis with distance distribution
  • Automatic POI detection during onboarding
  • POI refresh functionality with competitive insights
  • Full TypeScript type safety
  • Map with bakery marker at center
  • Color-coded POI markers by category
  • Distance rings (100m, 300m, 500m)
  • Expandable category accordions with details
  • Refresh button for manual POI re-detection
  • Integration into Settings page and Onboarding wizard

Phase 5: Background Refresh Jobs & Geocoding (COMPLETED)

Status: Complete implementation of periodic POI refresh and address geocoding

Files Created (Background Jobs):

  • services/external/app/models/poi_refresh_job.py - POI refresh job data model
  • services/external/app/services/poi_refresh_service.py - POI refresh job management service
  • services/external/app/services/poi_scheduler.py - Background scheduler for periodic refresh
  • services/external/app/api/poi_refresh_jobs.py - REST API for job management
  • services/external/migrations/versions/20251110_1801_df9709132952_add_poi_refresh_jobs_table.py - Database migration

Files Created (Geocoding):

  • services/external/app/services/nominatim_service.py - Nominatim geocoding service
  • services/external/app/api/geocoding.py - Geocoding REST API endpoints

Files Modified:

  • services/external/app/main.py - Integrated scheduler startup/shutdown, added routers
  • services/external/app/api/poi_context.py - Auto-schedules refresh job after POI detection

Key Features - Background Refresh:

  • Automatic 6-month refresh cycle: Jobs scheduled 180 days after initial POI detection
  • Hourly scheduler: Checks for pending jobs every hour and executes them
  • Change detection: Analyzes differences between old and new POI results
  • Retry logic: Up to 3 attempts with 1-hour retry delay
  • Concurrent execution: Configurable max concurrent jobs (default: 5)
  • Job tracking: Complete audit trail with status, timestamps, results, errors
  • Manual triggers: API endpoints for immediate job execution
  • Auto-scheduling: Next refresh automatically scheduled on completion

Key Features - Geocoding:

  • Address autocomplete: Real-time suggestions from Nominatim API
  • Forward geocoding: Convert address to coordinates
  • Reverse geocoding: Convert coordinates to address
  • Rate limiting: Respects 1 req/sec for public Nominatim API
  • Production ready: Easy switch to self-hosted Nominatim instance
  • Country filtering: Default to Spain (configurable)

Background Job API Endpoints:

  • POST /api/v1/poi-refresh-jobs/schedule - Schedule a refresh job
  • GET /api/v1/poi-refresh-jobs/{job_id} - Get job details
  • GET /api/v1/poi-refresh-jobs/tenant/{tenant_id} - Get tenant's jobs
  • POST /api/v1/poi-refresh-jobs/{job_id}/execute - Manually execute job
  • GET /api/v1/poi-refresh-jobs/pending - Get pending jobs
  • POST /api/v1/poi-refresh-jobs/process-pending - Process all pending jobs
  • POST /api/v1/poi-refresh-jobs/trigger-scheduler - Trigger immediate scheduler check
  • GET /api/v1/poi-refresh-jobs/scheduler/status - Get scheduler status

Geocoding API Endpoints:

  • GET /api/v1/geocoding/search?q={query} - Address search/autocomplete
  • GET /api/v1/geocoding/geocode?address={address} - Forward geocoding
  • GET /api/v1/geocoding/reverse?lat={lat}&lon={lon} - Reverse geocoding
  • GET /api/v1/geocoding/validate?lat={lat}&lon={lon} - Coordinate validation
  • GET /api/v1/geocoding/health - Service health check

Scheduler Lifecycle:

  • Startup: Scheduler automatically starts with External service
  • Runtime: Runs in background, checking every 3600 seconds (1 hour)
  • Shutdown: Gracefully stops when service shuts down
  • Immediate check: Can be triggered via API for testing/debugging

POI Categories & Configuration

Detected Categories

Category OSM Query Search Radius Weight Impact
Schools amenity~"school|kindergarten|university" 500m 1.5 Morning drop-off rush
Offices office 800m 1.3 Weekday lunch demand
Gyms/Sports leisure~"fitness_centre|sports_centre" 600m 0.8 Morning/evening activity
Residential building~"residential|apartments" 400m 1.0 Base demand
Tourism tourism~"attraction|museum|hotel" 1000m 1.2 Tourist foot traffic
Competitors shop~"bakery|pastry" 1000m -0.5 Competition pressure
Transport Hubs railway~"station|subway_entrance" 800m 1.4 Commuter traffic
Coworking amenity="coworking_space" 600m 1.1 Flexible workers
Retail shop 500m 0.9 General foot traffic

Feature Relevance Thresholds

Features are only included in ML models if they pass relevance criteria:

Example - Schools:

  • min_proximity_score: 0.5 (moderate proximity required)
  • max_distance_to_nearest_m: 500 (must be within 500m)
  • min_count: 1 (at least 1 school)

If a bakery has no schools within 500m → school features NOT added (prevents noise)

Feature Engineering Strategy

Hybrid Multi-Tier Approach

Research Basis: Academic studies (2023-2024) show single-method approaches underperform

Tier 1: Proximity-Weighted Scores (PRIMARY)

proximity_score = Σ(1 / (1 + distance_km)) for each POI
weighted_proximity_score = proximity_score × category.weight

Example:

  • Bakery 200m from 5 schools: score = 5 × (1/1.2) = 4.17
  • Bakery 100m from 1 school: score = 1 × (1/1.1) = 0.91
  • First bakery has higher school impact despite further distance!

Tier 2: Distance Band Counts

count_0_100m = count(POIs within 100m)
count_100_300m = count(POIs within 100-300m)
count_300_500m = count(POIs within 300-500m)
count_500_1000m = count(POIs within 500-1000m)

Tier 3: Distance to Nearest

distance_to_nearest_m = min(distances)

Tier 4: Binary Flags

has_within_100m = any(distance <= 100m)
has_within_300m = any(distance <= 300m)
has_within_500m = any(distance <= 500m)

Competitive Pressure Modeling

Special treatment for competitor bakeries:

Zones:

  • Direct (<100m): -1.0 multiplier per competitor (strong negative)
  • Nearby (100-500m): -0.5 multiplier (moderate negative)
  • Market (500-1000m):
    • If 5+ bakeries → +0.3 (bakery district = destination area)
    • If 2-4 bakeries → -0.2 (competitive market)

API Endpoints

POST /api/v1/poi-context/{tenant_id}/detect

Detect POIs for a tenant's bakery location.

Query Parameters:

  • latitude (float, required): Bakery latitude
  • longitude (float, required): Bakery longitude
  • force_refresh (bool, optional): Force re-detection, skip cache

Response:

{
  "status": "success",
  "source": "detection",  // or "cache"
  "poi_context": {
    "id": "uuid",
    "tenant_id": "uuid",
    "location": {"latitude": 40.4168, "longitude": -3.7038},
    "total_pois_detected": 42,
    "high_impact_categories": ["schools", "transport_hubs"],
    "ml_features": {
      "poi_schools_proximity_score": 3.45,
      "poi_schools_count_0_100m": 2,
      "poi_schools_distance_to_nearest_m": 85.0,
      // ... 81+ more features
    }
  },
  "feature_selection": {
    "relevant_categories": ["schools", "transport_hubs", "offices"],
    "relevance_report": [...]
  },
  "competitor_analysis": {
    "competitive_pressure_score": -1.5,
    "direct_competitors_count": 1,
    "competitive_zone": "high_competition",
    "market_type": "competitive_market"
  },
  "competitive_insights": [
    "⚠️ High competition: 1 direct competitor(s) within 100m. Focus on differentiation and quality."
  ]
}

GET /api/v1/poi-context/{tenant_id}

Retrieve stored POI context for a tenant.

Response:

{
  "poi_context": {...},
  "is_stale": false,
  "needs_refresh": false
}

POST /api/v1/poi-context/{tenant_id}/refresh

Refresh POI context (re-detect POIs).

DELETE /api/v1/poi-context/{tenant_id}

Delete POI context for a tenant.

GET /api/v1/poi-context/{tenant_id}/feature-importance

Get feature importance summary.

GET /api/v1/poi-context/{tenant_id}/competitor-analysis

Get detailed competitor analysis.

GET /api/v1/poi-context/health

Check POI detection service health (Overpass API accessibility).

GET /api/v1/poi-context/cache/stats

Get cache statistics (key count, memory usage).

Database Schema

Table: tenant_poi_contexts

CREATE TABLE tenant_poi_contexts (
    id UUID PRIMARY KEY,
    tenant_id UUID UNIQUE NOT NULL,

    -- Location
    latitude FLOAT NOT NULL,
    longitude FLOAT NOT NULL,

    -- POI Detection Data
    poi_detection_results JSONB NOT NULL DEFAULT '{}',
    ml_features JSONB NOT NULL DEFAULT '{}',
    total_pois_detected INTEGER DEFAULT 0,
    high_impact_categories JSONB DEFAULT '[]',
    relevant_categories JSONB DEFAULT '[]',

    -- Detection Metadata
    detection_timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
    detection_source VARCHAR(50) DEFAULT 'overpass_api',
    detection_status VARCHAR(20) DEFAULT 'completed',
    detection_error VARCHAR(500),

    -- Refresh Strategy
    next_refresh_date TIMESTAMP WITH TIME ZONE,
    refresh_interval_days INTEGER DEFAULT 180,
    last_refreshed_at TIMESTAMP WITH TIME ZONE,

    -- Timestamps
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

CREATE INDEX idx_tenant_poi_location ON tenant_poi_contexts (latitude, longitude);
CREATE INDEX idx_tenant_poi_refresh ON tenant_poi_contexts (next_refresh_date);
CREATE INDEX idx_tenant_poi_status ON tenant_poi_contexts (detection_status);

ML Model Integration

Training Pipeline

POI features are automatically fetched and integrated during training:

# TrainingDataOrchestrator fetches POI features
poi_features = await poi_feature_integrator.fetch_poi_features(
    tenant_id=tenant_id,
    latitude=lat,
    longitude=lon
)

# Features added to TrainingDataSet
training_dataset = TrainingDataSet(
    sales_data=filtered_sales,
    weather_data=weather_data,
    traffic_data=traffic_data,
    poi_features=poi_features,  # NEW
    ...
)

# Data processor merges POI features into training data
daily_sales = self._add_poi_features(daily_sales, poi_features)

# Prophet model uses POI features as regressors
for feature_name in poi_features.keys():
    model.add_regressor(feature_name, mode='additive')

Forecasting Pipeline

POI features are fetched and used for predictions:

# POI Feature Service retrieves features
poi_features = await poi_feature_service.get_poi_features(tenant_id)

# Features added to prediction dataframe
df = await data_processor.prepare_prediction_features(
    future_dates=future_dates,
    weather_forecast=weather_df,
    poi_features=poi_features,  # SAME features as training
    ...
)

# Prophet generates forecast with POI features
forecast = model.predict(df)

Feature Consistency

Critical: POI features MUST be identical in training and prediction!

  • Training: POI features fetched from External service
  • Prediction: POI features fetched from External service (same tenant)
  • Features are static (location-based, don't vary by date)
  • Stored in TenantPOIContext ensures consistency

Performance Optimizations

Caching Strategy

Redis Cache:

  • TTL: 90 days
  • Cache key: Rounded coordinates (4 decimals ≈ 10m precision)
  • Allows reuse for bakeries in close proximity
  • Reduces Overpass API load

Database Storage:

  • POI context stored in PostgreSQL
  • Refresh cycle: 180 days (6 months)
  • Background job refreshes stale contexts

API Rate Limiting

Overpass API:

  • Public endpoint: Rate limited
  • Retry logic: 3 attempts with 2-second delay
  • Timeout: 30 seconds per query
  • Concurrent queries: All POI categories fetched in parallel

Recommendation: Self-host Overpass API instance for production

Testing & Validation

Model Performance Impact

Expected improvements with POI features:

  • MAPE improvement: 5-10% for bakeries with significant POI presence
  • Accuracy maintained: For bakeries with no relevant POIs (features filtered out)
  • Feature count: 81+ POI features per bakery (if all categories relevant)

A/B Testing

Compare models with and without POI features:

# Model A: Without POI features
model_a = train_model(sales, weather, traffic)

# Model B: With POI features
model_b = train_model(sales, weather, traffic, poi_features)

# Compare MAPE, MAE, R² score

Troubleshooting

Common Issues

1. No POI context found

  • Cause: POI detection not run during onboarding
  • Fix: Call /api/v1/poi-context/{tenant_id}/detect endpoint

2. Overpass API timeout

  • Cause: API overload or network issues
  • Fix: Retry mechanism handles this automatically; check health endpoint

3. POI features not in model

  • Cause: Feature relevance thresholds filter out low-signal features
  • Fix: Expected behavior; check relevance report

4. Feature count mismatch between training and prediction

  • Cause: POI context refreshed between training and prediction
  • Fix: Models store feature manifest; prediction uses same features

Future Enhancements

  1. Neighborhood Clustering

    • Group bakeries by neighborhood type (business district, residential, tourist)
    • Reduce from 81+ individual features to 4-5 cluster features
    • Enable transfer learning across similar neighborhoods
  2. Automated POI Verification

    • User confirmation of auto-detected POIs
    • Manual addition/removal of POIs
  3. Temporal POI Features

    • School session times (morning vs. afternoon)
    • Office hours variations (hybrid work)
    • Event-based POIs (concerts, sports matches)
  4. Multi-City Support

    • City-specific POI weights
    • Regional calendar integration (school holidays vary by region)
  5. POI Change Detection

    • Monitor for new POIs (e.g., new school opens)
    • Automatic re-training when significant POI changes detected

References

Academic Research

  1. "Gravity models for potential spatial healthcare access measurement" (2023)
  2. "What determines travel time and distance decay in spatial interaction" (2024)
  3. "Location Profiling for Retail-Site Recommendation Using Machine Learning" (2024)
  4. "Predicting ride-hailing passenger demand: A POI-based adaptive clustering" (2024)

Technical Documentation

License & Attribution

POI data from OpenStreetMap contributors (© OpenStreetMap contributors) Licensed under Open Database License (ODbL)