Files
bakery-ia/SERVICE_INITIALIZATION_ARCHITECTURE.md
2025-10-01 12:17:59 +02:00

533 lines
16 KiB
Markdown

# Service Initialization Architecture Analysis
## Current Architecture Problem
You've correctly identified a **redundancy and architectural inconsistency** in the current setup:
### What's Happening Now:
```
Kubernetes Deployment Flow:
1. Migration Job runs → applies Alembic migrations → completes
2. Service Pod starts → runs migrations AGAIN in startup → service ready
```
### The Redundancy:
**Migration Job** (`external-migration`):
- Runs: `/app/scripts/run_migrations.py external`
- Calls: `initialize_service_database()`
- Applies: Alembic migrations via `alembic upgrade head`
- Status: Completes successfully
**Service Startup** (`external-service` pod):
- Runs: `BaseFastAPIService._handle_database_tables()` (line 219-241)
- Calls: `initialize_service_database()` **AGAIN**
- Applies: Alembic migrations via `alembic upgrade head` **AGAIN**
- From logs:
```
2025-10-01 09:26:01 [info] Running pending migrations service=external
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
2025-10-01 09:26:01 [info] Migrations applied successfully service=external
```
## Why This Is Problematic
### 1. **Duplicated Logic**
- Same code runs twice (`initialize_service_database()`)
- Both use same `DatabaseInitManager`
- Both check migration state, run Alembic upgrade
### 2. **Unclear Separation of Concerns**
- **Migration Job**: Supposed to handle migrations
- **Service Startup**: Also handling migrations
- Which one is the source of truth?
### 3. **Race Conditions Potential**
If multiple service replicas start simultaneously:
- All replicas run migrations concurrently
- Alembic has locking, but still adds overhead
- Unnecessary database load
### 4. **Slower Startup Times**
Every service pod runs full migration check on startup:
- Connects to database
- Checks migration state
- Runs `alembic upgrade head` (even if no-op)
- Adds 1-2 seconds to startup
### 5. **Confusion About Responsibilities**
From logs, the service is doing migration work:
```
[info] Running pending migrations service=external
```
This is NOT what a service should do - it should assume DB is ready.
## Architectural Patterns (Best Practices)
### Pattern 1: **Init Container Pattern** (Recommended for K8s)
```yaml
Deployment:
initContainers:
- name: wait-for-migrations
# Wait for migration job to complete
- name: run-migrations # Optional: inline migrations
command: alembic upgrade head
containers:
- name: service
# Service starts AFTER migrations complete
# Service does NOT run migrations
```
**Pros:**
- ✅ Clear separation: Init containers handle setup, main container serves traffic
- ✅ No race conditions: Init containers run sequentially
- ✅ Fast service startup: Assumes DB is ready
- ✅ Multiple replicas safe: Only first pod's init runs migrations
**Cons:**
- ⚠ Init containers increase pod startup time
- ⚠ Need proper migration locking (Alembic provides this)
### Pattern 2: **Standalone Migration Job** (Your Current Approach - Almost)
```yaml
Job: migration-job
command: alembic upgrade head
# Runs once on deployment
Deployment: service
# Service assumes DB is ready
# NO migration logic in service code
```
**Pros:**
- ✅ Complete separation: Migrations are separate workload
- ✅ Clear lifecycle: Job completes before service starts
- ✅ Fast service startup: No migration checks
- ✅ Easy rollback: Re-run job with specific version
**Cons:**
- ⚠ Need orchestration: Ensure job completes before service starts
- ⚠ Deployment complexity: Manage job + deployment separately
### Pattern 3: **Service Self-Migration** (Anti-pattern in Production)
```yaml
Deployment: service
# Service runs migrations on startup
# What you're doing now in both places
```
**Pros:**
- ✅ Simple deployment: Single resource
- ✅ Always in sync: Migrations bundled with service
**Cons:**
- ❌ Race conditions with multiple replicas
- ❌ Slower startup: Every pod checks migrations
- ❌ Service code mixed with operational concerns
- ❌ Harder to debug: Migration failures look like service failures
## Recommended Architecture
### **Hybrid Approach: Init Container + Fallback Check**
```yaml
# 1. Pre-deployment Migration Job (runs once)
apiVersion: batch/v1
kind: Job
metadata:
name: external-migration
spec:
template:
spec:
containers:
- name: migrate
command: ["alembic", "upgrade", "head"]
# Runs FULL migration logic
---
# 2. Service Deployment (depends on job)
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-service
spec:
template:
spec:
initContainers:
- name: wait-for-db
# Wait for database to be ready
# NEW: Wait for migrations to complete
- name: wait-for-migrations
command: ["sh", "-c", "
until alembic current | grep -q 'head'; do
echo 'Waiting for migrations...';
sleep 2;
done
"]
containers:
- name: service
# Service startup with MINIMAL migration check
env:
- name: SKIP_MIGRATIONS
value: "true" # Service won't run migrations
```
### Service Code Changes:
**Current** (`shared/service_base.py` line 219-241):
```python
async def _handle_database_tables(self):
"""Handle automatic table creation and migration management"""
# Always runs full migration check
result = await initialize_service_database(
database_manager=self.database_manager,
service_name=self.service_name,
force_recreate=force_recreate
)
```
**Recommended**:
```python
async def _handle_database_tables(self):
"""Verify database is ready (migrations already applied)"""
# Check if we should skip migrations (production mode)
skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"
if skip_migrations:
# Production mode: Only verify, don't run migrations
await self._verify_database_ready()
else:
# Development mode: Run full migration check
result = await initialize_service_database(
database_manager=self.database_manager,
service_name=self.service_name,
force_recreate=force_recreate
)
async def _verify_database_ready(self):
"""Quick check that database and tables exist"""
try:
# Check connection
if not await self.database_manager.test_connection():
raise Exception("Database connection failed")
# Check expected tables exist (if specified)
if self.expected_tables:
async with self.database_manager.get_session() as session:
for table in self.expected_tables:
result = await session.execute(
text(f"SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'public'
AND table_name = '{table}'
)")
)
if not result.scalar():
raise Exception(f"Expected table '{table}' not found")
self.logger.info("Database verification successful")
except Exception as e:
self.logger.error("Database verification failed", error=str(e))
raise
```
## Migration Strategy Comparison
### Current State:
```
┌─────────────────┐
│ Migration Job │ ──> Runs migrations
└─────────────────┘
├─> Job completes
┌─────────────────┐
│ Service Pod 1 │ ──> Runs migrations AGAIN ❌
└─────────────────┘
┌─────────────────┐
│ Service Pod 2 │ ──> Runs migrations AGAIN ❌
└─────────────────┘
┌─────────────────┐
│ Service Pod 3 │ ──> Runs migrations AGAIN ❌
└─────────────────┘
```
### Recommended State:
```
┌─────────────────┐
│ Migration Job │ ──> Runs migrations ONCE ✅
└─────────────────┘
├─> Job completes
┌─────────────────┐
│ Service Pod 1 │ ──> Verifies DB ready only ✅
└─────────────────┘
┌─────────────────┐
│ Service Pod 2 │ ──> Verifies DB ready only ✅
└─────────────────┘
┌─────────────────┐
│ Service Pod 3 │ ──> Verifies DB ready only ✅
└─────────────────┘
```
## Implementation Plan
### Phase 1: Add Verification-Only Mode
**File**: `shared/database/init_manager.py`
Add new mode: `verify_only`
```python
class DatabaseInitManager:
def __init__(
self,
# ... existing params
verify_only: bool = False # NEW
):
self.verify_only = verify_only
async def initialize_database(self) -> Dict[str, Any]:
if self.verify_only:
return await self._verify_database_state()
# Existing logic for full initialization
# ...
async def _verify_database_state(self) -> Dict[str, Any]:
"""Quick verification that database is properly initialized"""
db_state = await self._check_database_state()
if not db_state["has_migrations"]:
raise Exception("No migrations found - database not initialized")
if db_state["is_empty"]:
raise Exception("Database has no tables - migrations not applied")
if not db_state["has_alembic_version"]:
raise Exception("No alembic_version table - migrations not tracked")
return {
"action": "verified",
"message": "Database verified successfully",
"current_revision": db_state["current_revision"]
}
```
### Phase 2: Update BaseFastAPIService
**File**: `shared/service_base.py`
```python
async def _handle_database_tables(self):
"""Handle database initialization based on environment"""
# Determine mode
skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"
force_recreate = os.getenv("DB_FORCE_RECREATE", "false").lower() == "true"
# Import here to avoid circular imports
from shared.database.init_manager import initialize_service_database
try:
if skip_migrations:
self.logger.info("Migration skip enabled - verifying database only")
result = await initialize_service_database(
database_manager=self.database_manager,
service_name=self.service_name.replace("-service", ""),
verify_only=True # NEW parameter
)
else:
self.logger.info("Running full database initialization")
result = await initialize_service_database(
database_manager=self.database_manager,
service_name=self.service_name.replace("-service", ""),
force_recreate=force_recreate,
verify_only=False
)
self.logger.info("Database initialization completed", result=result)
except Exception as e:
self.logger.error("Database initialization failed", error=str(e))
raise # Fail fast in production
```
### Phase 3: Update Kubernetes Manifests
**Add to all service deployments**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-service
spec:
template:
spec:
containers:
- name: external-service
env:
# NEW: Skip migrations in service, rely on Job
- name: SKIP_MIGRATIONS
value: "true"
# Keep ENVIRONMENT for production safety
- name: ENVIRONMENT
value: "production" # or "development"
```
### Phase 4: Optional - Add Init Container Dependency
**For production safety**:
```yaml
spec:
template:
spec:
initContainers:
- name: wait-for-migrations
image: postgres:15-alpine
command: ["sh", "-c"]
args:
- |
echo "Waiting for migrations to be applied..."
export PGPASSWORD="$DB_PASSWORD"
# Wait for alembic_version table to exist
until psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "SELECT version_num FROM alembic_version" > /dev/null 2>&1; do
echo "Migrations not yet applied, waiting..."
sleep 2
done
echo "Migrations detected, service can start"
env:
- name: DB_HOST
value: "external-db-service"
# ... other DB connection details
```
## Environment Configuration Matrix
| Environment | Migration Job | Service Startup | Use Case |
|-------------|---------------|-----------------|----------|
| **Development** | Optional | Run migrations | Fast iteration, create_all fallback OK |
| **Staging** | Required | Verify only | Test migration workflow |
| **Production** | Required | Verify only | Safety first, fail fast |
## Configuration Examples
### Development (Current Behavior - OK)
```yaml
env:
- name: ENVIRONMENT
value: "development"
- name: SKIP_MIGRATIONS
value: "false"
- name: DB_FORCE_RECREATE
value: "false"
```
**Behavior**: Service runs full migration check, allows create_all fallback
### Staging/Production (Recommended)
```yaml
env:
- name: ENVIRONMENT
value: "production"
- name: SKIP_MIGRATIONS
value: "true"
- name: DB_FORCE_RECREATE
value: "false"
```
**Behavior**:
- Service only verifies database is ready
- No migration execution in service
- Fails fast if database not properly initialized
## Benefits of Proposed Architecture
### Performance:
- ✅ **50-80% faster service startup** (skip migration check: ~1-2 seconds saved)
- ✅ **Reduced database load** (no concurrent migration checks from multiple pods)
- ✅ **Faster horizontal scaling** (new pods start immediately)
### Reliability:
- ✅ **No race conditions** (only job runs migrations)
- ✅ **Clearer error messages** ("DB not ready" vs "migration failed")
- ✅ **Easier rollback** (re-run job independently)
### Maintainability:
- ✅ **Separation of concerns** (ops vs service code)
- ✅ **Easier debugging** (check job logs for migration issues)
- ✅ **Clear deployment flow** (job → service)
### Safety:
- ✅ **Fail-fast in production** (service won't start if DB not ready)
- ✅ **No create_all in production** (explicit migrations required)
- ✅ **Audit trail** (job logs show when migrations ran)
## Migration Path
### Step 1: Implement verify_only Mode (Non-Breaking)
- Add to `DatabaseInitManager`
- Backwards compatible (default: full check)
### Step 2: Add SKIP_MIGRATIONS Support (Non-Breaking)
- Update `BaseFastAPIService`
- Default: false (current behavior)
### Step 3: Enable in Development First
- Test with `SKIP_MIGRATIONS=true` locally
- Verify services start correctly
### Step 4: Enable in Staging
- Update staging manifests
- Monitor startup times and errors
### Step 5: Enable in Production
- Update production manifests
- Services fail fast if migrations not applied
## Recommended Next Steps
1. **Immediate**: Document current redundancy (✅ this document)
2. **Short-term** (1-2 days):
- Implement `verify_only` mode in `DatabaseInitManager`
- Add `SKIP_MIGRATIONS` support in `BaseFastAPIService`
- Test in development environment
3. **Medium-term** (1 week):
- Update all service deployments with `SKIP_MIGRATIONS=true`
- Add init container to wait for migrations (optional but recommended)
- Monitor startup times and error rates
4. **Long-term** (ongoing):
- Document migration process in runbooks
- Add migration rollback procedures
- Consider migration versioning strategy
## Summary
**Current**: Migration Job + Service both run migrations → redundant, slower, confusing
**Recommended**: Migration Job runs migrations → Service only verifies → clear, fast, reliable
The key insight: **Migrations are operational concerns, not application concerns**. Services should assume the database is ready, not try to fix it themselves.