Files
bakery-ia/SERVICE_INITIALIZATION_ARCHITECTURE.md
2025-10-01 12:17:59 +02:00

16 KiB

Service Initialization Architecture Analysis

Current Architecture Problem

You've correctly identified a redundancy and architectural inconsistency in the current setup:

What's Happening Now:

Kubernetes Deployment Flow:
1. Migration Job runs → applies Alembic migrations → completes
2. Service Pod starts → runs migrations AGAIN in startup → service ready

The Redundancy:

Migration Job (external-migration):

  • Runs: /app/scripts/run_migrations.py external
  • Calls: initialize_service_database()
  • Applies: Alembic migrations via alembic upgrade head
  • Status: Completes successfully

Service Startup (external-service pod):

  • Runs: BaseFastAPIService._handle_database_tables() (line 219-241)
  • Calls: initialize_service_database() AGAIN
  • Applies: Alembic migrations via alembic upgrade head AGAIN
  • From logs:
    2025-10-01 09:26:01 [info] Running pending migrations service=external
    INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
    INFO [alembic.runtime.migration] Will assume transactional DDL.
    2025-10-01 09:26:01 [info] Migrations applied successfully service=external
    

Why This Is Problematic

1. Duplicated Logic

  • Same code runs twice (initialize_service_database())
  • Both use same DatabaseInitManager
  • Both check migration state, run Alembic upgrade

2. Unclear Separation of Concerns

  • Migration Job: Supposed to handle migrations
  • Service Startup: Also handling migrations
  • Which one is the source of truth?

3. Race Conditions Potential

If multiple service replicas start simultaneously:

  • All replicas run migrations concurrently
  • Alembic has locking, but still adds overhead
  • Unnecessary database load

4. Slower Startup Times

Every service pod runs full migration check on startup:

  • Connects to database
  • Checks migration state
  • Runs alembic upgrade head (even if no-op)
  • Adds 1-2 seconds to startup

5. Confusion About Responsibilities

From logs, the service is doing migration work:

[info] Running pending migrations service=external

This is NOT what a service should do - it should assume DB is ready.

Architectural Patterns (Best Practices)

Deployment:
  initContainers:
    - name: wait-for-migrations
      # Wait for migration job to complete
    - name: run-migrations  # Optional: inline migrations
      command: alembic upgrade head

  containers:
    - name: service
      # Service starts AFTER migrations complete
      # Service does NOT run migrations

Pros:

  • Clear separation: Init containers handle setup, main container serves traffic
  • No race conditions: Init containers run sequentially
  • Fast service startup: Assumes DB is ready
  • Multiple replicas safe: Only first pod's init runs migrations

Cons:

  • ⚠ Init containers increase pod startup time
  • ⚠ Need proper migration locking (Alembic provides this)

Pattern 2: Standalone Migration Job (Your Current Approach - Almost)

Job: migration-job
  command: alembic upgrade head
  # Runs once on deployment

Deployment: service
  # Service assumes DB is ready
  # NO migration logic in service code

Pros:

  • Complete separation: Migrations are separate workload
  • Clear lifecycle: Job completes before service starts
  • Fast service startup: No migration checks
  • Easy rollback: Re-run job with specific version

Cons:

  • ⚠ Need orchestration: Ensure job completes before service starts
  • ⚠ Deployment complexity: Manage job + deployment separately

Pattern 3: Service Self-Migration (Anti-pattern in Production)

Deployment: service
  # Service runs migrations on startup
  # What you're doing now in both places

Pros:

  • Simple deployment: Single resource
  • Always in sync: Migrations bundled with service

Cons:

  • Race conditions with multiple replicas
  • Slower startup: Every pod checks migrations
  • Service code mixed with operational concerns
  • Harder to debug: Migration failures look like service failures

Hybrid Approach: Init Container + Fallback Check

# 1. Pre-deployment Migration Job (runs once)
apiVersion: batch/v1
kind: Job
metadata:
  name: external-migration
spec:
  template:
    spec:
      containers:
      - name: migrate
        command: ["alembic", "upgrade", "head"]
        # Runs FULL migration logic

---
# 2. Service Deployment (depends on job)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-service
spec:
  template:
    spec:
      initContainers:
      - name: wait-for-db
        # Wait for database to be ready

      # NEW: Wait for migrations to complete
      - name: wait-for-migrations
        command: ["sh", "-c", "
          until alembic current | grep -q 'head'; do
            echo 'Waiting for migrations...';
            sleep 2;
          done
        "]

      containers:
      - name: service
        # Service startup with MINIMAL migration check
        env:
        - name: SKIP_MIGRATIONS
          value: "true"  # Service won't run migrations

Service Code Changes:

Current (shared/service_base.py line 219-241):

async def _handle_database_tables(self):
    """Handle automatic table creation and migration management"""
    # Always runs full migration check
    result = await initialize_service_database(
        database_manager=self.database_manager,
        service_name=self.service_name,
        force_recreate=force_recreate
    )

Recommended:

async def _handle_database_tables(self):
    """Verify database is ready (migrations already applied)"""

    # Check if we should skip migrations (production mode)
    skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"

    if skip_migrations:
        # Production mode: Only verify, don't run migrations
        await self._verify_database_ready()
    else:
        # Development mode: Run full migration check
        result = await initialize_service_database(
            database_manager=self.database_manager,
            service_name=self.service_name,
            force_recreate=force_recreate
        )

async def _verify_database_ready(self):
    """Quick check that database and tables exist"""
    try:
        # Check connection
        if not await self.database_manager.test_connection():
            raise Exception("Database connection failed")

        # Check expected tables exist (if specified)
        if self.expected_tables:
            async with self.database_manager.get_session() as session:
                for table in self.expected_tables:
                    result = await session.execute(
                        text(f"SELECT EXISTS (
                            SELECT FROM information_schema.tables
                            WHERE table_schema = 'public'
                            AND table_name = '{table}'
                        )")
                    )
                    if not result.scalar():
                        raise Exception(f"Expected table '{table}' not found")

        self.logger.info("Database verification successful")
    except Exception as e:
        self.logger.error("Database verification failed", error=str(e))
        raise

Migration Strategy Comparison

Current State:

┌─────────────────┐
│ Migration Job   │ ──> Runs migrations
└─────────────────┘
         │
         ├─> Job completes
         │
         ▼
┌─────────────────┐
│ Service Pod 1   │ ──> Runs migrations AGAIN ❌
└─────────────────┘
         │
┌─────────────────┐
│ Service Pod 2   │ ──> Runs migrations AGAIN ❌
└─────────────────┘
         │
┌─────────────────┐
│ Service Pod 3   │ ──> Runs migrations AGAIN ❌
└─────────────────┘
┌─────────────────┐
│ Migration Job   │ ──> Runs migrations ONCE ✅
└─────────────────┘
         │
         ├─> Job completes
         │
         ▼
┌─────────────────┐
│ Service Pod 1   │ ──> Verifies DB ready only ✅
└─────────────────┘
         │
┌─────────────────┐
│ Service Pod 2   │ ──> Verifies DB ready only ✅
└─────────────────┘
         │
┌─────────────────┐
│ Service Pod 3   │ ──> Verifies DB ready only ✅
└─────────────────┘

Implementation Plan

Phase 1: Add Verification-Only Mode

File: shared/database/init_manager.py

Add new mode: verify_only

class DatabaseInitManager:
    def __init__(
        self,
        # ... existing params
        verify_only: bool = False  # NEW
    ):
        self.verify_only = verify_only

    async def initialize_database(self) -> Dict[str, Any]:
        if self.verify_only:
            return await self._verify_database_state()

        # Existing logic for full initialization
        # ...

    async def _verify_database_state(self) -> Dict[str, Any]:
        """Quick verification that database is properly initialized"""
        db_state = await self._check_database_state()

        if not db_state["has_migrations"]:
            raise Exception("No migrations found - database not initialized")

        if db_state["is_empty"]:
            raise Exception("Database has no tables - migrations not applied")

        if not db_state["has_alembic_version"]:
            raise Exception("No alembic_version table - migrations not tracked")

        return {
            "action": "verified",
            "message": "Database verified successfully",
            "current_revision": db_state["current_revision"]
        }

Phase 2: Update BaseFastAPIService

File: shared/service_base.py

async def _handle_database_tables(self):
    """Handle database initialization based on environment"""

    # Determine mode
    skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"
    force_recreate = os.getenv("DB_FORCE_RECREATE", "false").lower() == "true"

    # Import here to avoid circular imports
    from shared.database.init_manager import initialize_service_database

    try:
        if skip_migrations:
            self.logger.info("Migration skip enabled - verifying database only")
            result = await initialize_service_database(
                database_manager=self.database_manager,
                service_name=self.service_name.replace("-service", ""),
                verify_only=True  # NEW parameter
            )
        else:
            self.logger.info("Running full database initialization")
            result = await initialize_service_database(
                database_manager=self.database_manager,
                service_name=self.service_name.replace("-service", ""),
                force_recreate=force_recreate,
                verify_only=False
            )

        self.logger.info("Database initialization completed", result=result)

    except Exception as e:
        self.logger.error("Database initialization failed", error=str(e))
        raise  # Fail fast in production

Phase 3: Update Kubernetes Manifests

Add to all service deployments:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-service
spec:
  template:
    spec:
      containers:
      - name: external-service
        env:
        # NEW: Skip migrations in service, rely on Job
        - name: SKIP_MIGRATIONS
          value: "true"

        # Keep ENVIRONMENT for production safety
        - name: ENVIRONMENT
          value: "production"  # or "development"

Phase 4: Optional - Add Init Container Dependency

For production safety:

spec:
  template:
    spec:
      initContainers:
      - name: wait-for-migrations
        image: postgres:15-alpine
        command: ["sh", "-c"]
        args:
        - |
          echo "Waiting for migrations to be applied..."
          export PGPASSWORD="$DB_PASSWORD"

          # Wait for alembic_version table to exist
          until psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "SELECT version_num FROM alembic_version" > /dev/null 2>&1; do
            echo "Migrations not yet applied, waiting..."
            sleep 2
          done

          echo "Migrations detected, service can start"
        env:
        - name: DB_HOST
          value: "external-db-service"
        # ... other DB connection details

Environment Configuration Matrix

Environment Migration Job Service Startup Use Case
Development Optional Run migrations Fast iteration, create_all fallback OK
Staging Required Verify only Test migration workflow
Production Required Verify only Safety first, fail fast

Configuration Examples

Development (Current Behavior - OK)

env:
  - name: ENVIRONMENT
    value: "development"
  - name: SKIP_MIGRATIONS
    value: "false"
  - name: DB_FORCE_RECREATE
    value: "false"

Behavior: Service runs full migration check, allows create_all fallback

env:
  - name: ENVIRONMENT
    value: "production"
  - name: SKIP_MIGRATIONS
    value: "true"
  - name: DB_FORCE_RECREATE
    value: "false"

Behavior:

  • Service only verifies database is ready
  • No migration execution in service
  • Fails fast if database not properly initialized

Benefits of Proposed Architecture

Performance:

  • 50-80% faster service startup (skip migration check: ~1-2 seconds saved)
  • Reduced database load (no concurrent migration checks from multiple pods)
  • Faster horizontal scaling (new pods start immediately)

Reliability:

  • No race conditions (only job runs migrations)
  • Clearer error messages ("DB not ready" vs "migration failed")
  • Easier rollback (re-run job independently)

Maintainability:

  • Separation of concerns (ops vs service code)
  • Easier debugging (check job logs for migration issues)
  • Clear deployment flow (job → service)

Safety:

  • Fail-fast in production (service won't start if DB not ready)
  • No create_all in production (explicit migrations required)
  • Audit trail (job logs show when migrations ran)

Migration Path

Step 1: Implement verify_only Mode (Non-Breaking)

  • Add to DatabaseInitManager
  • Backwards compatible (default: full check)

Step 2: Add SKIP_MIGRATIONS Support (Non-Breaking)

  • Update BaseFastAPIService
  • Default: false (current behavior)

Step 3: Enable in Development First

  • Test with SKIP_MIGRATIONS=true locally
  • Verify services start correctly

Step 4: Enable in Staging

  • Update staging manifests
  • Monitor startup times and errors

Step 5: Enable in Production

  • Update production manifests
  • Services fail fast if migrations not applied
  1. Immediate: Document current redundancy ( this document)

  2. Short-term (1-2 days):

    • Implement verify_only mode in DatabaseInitManager
    • Add SKIP_MIGRATIONS support in BaseFastAPIService
    • Test in development environment
  3. Medium-term (1 week):

    • Update all service deployments with SKIP_MIGRATIONS=true
    • Add init container to wait for migrations (optional but recommended)
    • Monitor startup times and error rates
  4. Long-term (ongoing):

    • Document migration process in runbooks
    • Add migration rollback procedures
    • Consider migration versioning strategy

Summary

Current: Migration Job + Service both run migrations → redundant, slower, confusing

Recommended: Migration Job runs migrations → Service only verifies → clear, fast, reliable

The key insight: Migrations are operational concerns, not application concerns. Services should assume the database is ready, not try to fix it themselves.