16 KiB
Service Initialization Architecture Analysis
Current Architecture Problem
You've correctly identified a redundancy and architectural inconsistency in the current setup:
What's Happening Now:
Kubernetes Deployment Flow:
1. Migration Job runs → applies Alembic migrations → completes
2. Service Pod starts → runs migrations AGAIN in startup → service ready
The Redundancy:
Migration Job (external-migration):
- Runs:
/app/scripts/run_migrations.py external - Calls:
initialize_service_database() - Applies: Alembic migrations via
alembic upgrade head - Status: Completes successfully
Service Startup (external-service pod):
- Runs:
BaseFastAPIService._handle_database_tables()(line 219-241) - Calls:
initialize_service_database()AGAIN - Applies: Alembic migrations via
alembic upgrade headAGAIN - From logs:
2025-10-01 09:26:01 [info] Running pending migrations service=external INFO [alembic.runtime.migration] Context impl PostgresqlImpl. INFO [alembic.runtime.migration] Will assume transactional DDL. 2025-10-01 09:26:01 [info] Migrations applied successfully service=external
Why This Is Problematic
1. Duplicated Logic
- Same code runs twice (
initialize_service_database()) - Both use same
DatabaseInitManager - Both check migration state, run Alembic upgrade
2. Unclear Separation of Concerns
- Migration Job: Supposed to handle migrations
- Service Startup: Also handling migrations
- Which one is the source of truth?
3. Race Conditions Potential
If multiple service replicas start simultaneously:
- All replicas run migrations concurrently
- Alembic has locking, but still adds overhead
- Unnecessary database load
4. Slower Startup Times
Every service pod runs full migration check on startup:
- Connects to database
- Checks migration state
- Runs
alembic upgrade head(even if no-op) - Adds 1-2 seconds to startup
5. Confusion About Responsibilities
From logs, the service is doing migration work:
[info] Running pending migrations service=external
This is NOT what a service should do - it should assume DB is ready.
Architectural Patterns (Best Practices)
Pattern 1: Init Container Pattern (Recommended for K8s)
Deployment:
initContainers:
- name: wait-for-migrations
# Wait for migration job to complete
- name: run-migrations # Optional: inline migrations
command: alembic upgrade head
containers:
- name: service
# Service starts AFTER migrations complete
# Service does NOT run migrations
Pros:
- ✅ Clear separation: Init containers handle setup, main container serves traffic
- ✅ No race conditions: Init containers run sequentially
- ✅ Fast service startup: Assumes DB is ready
- ✅ Multiple replicas safe: Only first pod's init runs migrations
Cons:
- ⚠ Init containers increase pod startup time
- ⚠ Need proper migration locking (Alembic provides this)
Pattern 2: Standalone Migration Job (Your Current Approach - Almost)
Job: migration-job
command: alembic upgrade head
# Runs once on deployment
Deployment: service
# Service assumes DB is ready
# NO migration logic in service code
Pros:
- ✅ Complete separation: Migrations are separate workload
- ✅ Clear lifecycle: Job completes before service starts
- ✅ Fast service startup: No migration checks
- ✅ Easy rollback: Re-run job with specific version
Cons:
- ⚠ Need orchestration: Ensure job completes before service starts
- ⚠ Deployment complexity: Manage job + deployment separately
Pattern 3: Service Self-Migration (Anti-pattern in Production)
Deployment: service
# Service runs migrations on startup
# What you're doing now in both places
Pros:
- ✅ Simple deployment: Single resource
- ✅ Always in sync: Migrations bundled with service
Cons:
- ❌ Race conditions with multiple replicas
- ❌ Slower startup: Every pod checks migrations
- ❌ Service code mixed with operational concerns
- ❌ Harder to debug: Migration failures look like service failures
Recommended Architecture
Hybrid Approach: Init Container + Fallback Check
# 1. Pre-deployment Migration Job (runs once)
apiVersion: batch/v1
kind: Job
metadata:
name: external-migration
spec:
template:
spec:
containers:
- name: migrate
command: ["alembic", "upgrade", "head"]
# Runs FULL migration logic
---
# 2. Service Deployment (depends on job)
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-service
spec:
template:
spec:
initContainers:
- name: wait-for-db
# Wait for database to be ready
# NEW: Wait for migrations to complete
- name: wait-for-migrations
command: ["sh", "-c", "
until alembic current | grep -q 'head'; do
echo 'Waiting for migrations...';
sleep 2;
done
"]
containers:
- name: service
# Service startup with MINIMAL migration check
env:
- name: SKIP_MIGRATIONS
value: "true" # Service won't run migrations
Service Code Changes:
Current (shared/service_base.py line 219-241):
async def _handle_database_tables(self):
"""Handle automatic table creation and migration management"""
# Always runs full migration check
result = await initialize_service_database(
database_manager=self.database_manager,
service_name=self.service_name,
force_recreate=force_recreate
)
Recommended:
async def _handle_database_tables(self):
"""Verify database is ready (migrations already applied)"""
# Check if we should skip migrations (production mode)
skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"
if skip_migrations:
# Production mode: Only verify, don't run migrations
await self._verify_database_ready()
else:
# Development mode: Run full migration check
result = await initialize_service_database(
database_manager=self.database_manager,
service_name=self.service_name,
force_recreate=force_recreate
)
async def _verify_database_ready(self):
"""Quick check that database and tables exist"""
try:
# Check connection
if not await self.database_manager.test_connection():
raise Exception("Database connection failed")
# Check expected tables exist (if specified)
if self.expected_tables:
async with self.database_manager.get_session() as session:
for table in self.expected_tables:
result = await session.execute(
text(f"SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'public'
AND table_name = '{table}'
)")
)
if not result.scalar():
raise Exception(f"Expected table '{table}' not found")
self.logger.info("Database verification successful")
except Exception as e:
self.logger.error("Database verification failed", error=str(e))
raise
Migration Strategy Comparison
Current State:
┌─────────────────┐
│ Migration Job │ ──> Runs migrations
└─────────────────┘
│
├─> Job completes
│
▼
┌─────────────────┐
│ Service Pod 1 │ ──> Runs migrations AGAIN ❌
└─────────────────┘
│
┌─────────────────┐
│ Service Pod 2 │ ──> Runs migrations AGAIN ❌
└─────────────────┘
│
┌─────────────────┐
│ Service Pod 3 │ ──> Runs migrations AGAIN ❌
└─────────────────┘
Recommended State:
┌─────────────────┐
│ Migration Job │ ──> Runs migrations ONCE ✅
└─────────────────┘
│
├─> Job completes
│
▼
┌─────────────────┐
│ Service Pod 1 │ ──> Verifies DB ready only ✅
└─────────────────┘
│
┌─────────────────┐
│ Service Pod 2 │ ──> Verifies DB ready only ✅
└─────────────────┘
│
┌─────────────────┐
│ Service Pod 3 │ ──> Verifies DB ready only ✅
└─────────────────┘
Implementation Plan
Phase 1: Add Verification-Only Mode
File: shared/database/init_manager.py
Add new mode: verify_only
class DatabaseInitManager:
def __init__(
self,
# ... existing params
verify_only: bool = False # NEW
):
self.verify_only = verify_only
async def initialize_database(self) -> Dict[str, Any]:
if self.verify_only:
return await self._verify_database_state()
# Existing logic for full initialization
# ...
async def _verify_database_state(self) -> Dict[str, Any]:
"""Quick verification that database is properly initialized"""
db_state = await self._check_database_state()
if not db_state["has_migrations"]:
raise Exception("No migrations found - database not initialized")
if db_state["is_empty"]:
raise Exception("Database has no tables - migrations not applied")
if not db_state["has_alembic_version"]:
raise Exception("No alembic_version table - migrations not tracked")
return {
"action": "verified",
"message": "Database verified successfully",
"current_revision": db_state["current_revision"]
}
Phase 2: Update BaseFastAPIService
File: shared/service_base.py
async def _handle_database_tables(self):
"""Handle database initialization based on environment"""
# Determine mode
skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"
force_recreate = os.getenv("DB_FORCE_RECREATE", "false").lower() == "true"
# Import here to avoid circular imports
from shared.database.init_manager import initialize_service_database
try:
if skip_migrations:
self.logger.info("Migration skip enabled - verifying database only")
result = await initialize_service_database(
database_manager=self.database_manager,
service_name=self.service_name.replace("-service", ""),
verify_only=True # NEW parameter
)
else:
self.logger.info("Running full database initialization")
result = await initialize_service_database(
database_manager=self.database_manager,
service_name=self.service_name.replace("-service", ""),
force_recreate=force_recreate,
verify_only=False
)
self.logger.info("Database initialization completed", result=result)
except Exception as e:
self.logger.error("Database initialization failed", error=str(e))
raise # Fail fast in production
Phase 3: Update Kubernetes Manifests
Add to all service deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-service
spec:
template:
spec:
containers:
- name: external-service
env:
# NEW: Skip migrations in service, rely on Job
- name: SKIP_MIGRATIONS
value: "true"
# Keep ENVIRONMENT for production safety
- name: ENVIRONMENT
value: "production" # or "development"
Phase 4: Optional - Add Init Container Dependency
For production safety:
spec:
template:
spec:
initContainers:
- name: wait-for-migrations
image: postgres:15-alpine
command: ["sh", "-c"]
args:
- |
echo "Waiting for migrations to be applied..."
export PGPASSWORD="$DB_PASSWORD"
# Wait for alembic_version table to exist
until psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "SELECT version_num FROM alembic_version" > /dev/null 2>&1; do
echo "Migrations not yet applied, waiting..."
sleep 2
done
echo "Migrations detected, service can start"
env:
- name: DB_HOST
value: "external-db-service"
# ... other DB connection details
Environment Configuration Matrix
| Environment | Migration Job | Service Startup | Use Case |
|---|---|---|---|
| Development | Optional | Run migrations | Fast iteration, create_all fallback OK |
| Staging | Required | Verify only | Test migration workflow |
| Production | Required | Verify only | Safety first, fail fast |
Configuration Examples
Development (Current Behavior - OK)
env:
- name: ENVIRONMENT
value: "development"
- name: SKIP_MIGRATIONS
value: "false"
- name: DB_FORCE_RECREATE
value: "false"
Behavior: Service runs full migration check, allows create_all fallback
Staging/Production (Recommended)
env:
- name: ENVIRONMENT
value: "production"
- name: SKIP_MIGRATIONS
value: "true"
- name: DB_FORCE_RECREATE
value: "false"
Behavior:
- Service only verifies database is ready
- No migration execution in service
- Fails fast if database not properly initialized
Benefits of Proposed Architecture
Performance:
- ✅ 50-80% faster service startup (skip migration check: ~1-2 seconds saved)
- ✅ Reduced database load (no concurrent migration checks from multiple pods)
- ✅ Faster horizontal scaling (new pods start immediately)
Reliability:
- ✅ No race conditions (only job runs migrations)
- ✅ Clearer error messages ("DB not ready" vs "migration failed")
- ✅ Easier rollback (re-run job independently)
Maintainability:
- ✅ Separation of concerns (ops vs service code)
- ✅ Easier debugging (check job logs for migration issues)
- ✅ Clear deployment flow (job → service)
Safety:
- ✅ Fail-fast in production (service won't start if DB not ready)
- ✅ No create_all in production (explicit migrations required)
- ✅ Audit trail (job logs show when migrations ran)
Migration Path
Step 1: Implement verify_only Mode (Non-Breaking)
- Add to
DatabaseInitManager - Backwards compatible (default: full check)
Step 2: Add SKIP_MIGRATIONS Support (Non-Breaking)
- Update
BaseFastAPIService - Default: false (current behavior)
Step 3: Enable in Development First
- Test with
SKIP_MIGRATIONS=truelocally - Verify services start correctly
Step 4: Enable in Staging
- Update staging manifests
- Monitor startup times and errors
Step 5: Enable in Production
- Update production manifests
- Services fail fast if migrations not applied
Recommended Next Steps
-
Immediate: Document current redundancy (✅ this document)
-
Short-term (1-2 days):
- Implement
verify_onlymode inDatabaseInitManager - Add
SKIP_MIGRATIONSsupport inBaseFastAPIService - Test in development environment
- Implement
-
Medium-term (1 week):
- Update all service deployments with
SKIP_MIGRATIONS=true - Add init container to wait for migrations (optional but recommended)
- Monitor startup times and error rates
- Update all service deployments with
-
Long-term (ongoing):
- Document migration process in runbooks
- Add migration rollback procedures
- Consider migration versioning strategy
Summary
Current: Migration Job + Service both run migrations → redundant, slower, confusing
Recommended: Migration Job runs migrations → Service only verifies → clear, fast, reliable
The key insight: Migrations are operational concerns, not application concerns. Services should assume the database is ready, not try to fix it themselves.