533 lines
16 KiB
Markdown
533 lines
16 KiB
Markdown
|
|
# Service Initialization Architecture Analysis
|
||
|
|
|
||
|
|
## Current Architecture Problem
|
||
|
|
|
||
|
|
You've correctly identified a **redundancy and architectural inconsistency** in the current setup:
|
||
|
|
|
||
|
|
### What's Happening Now:
|
||
|
|
|
||
|
|
```
|
||
|
|
Kubernetes Deployment Flow:
|
||
|
|
1. Migration Job runs → applies Alembic migrations → completes
|
||
|
|
2. Service Pod starts → runs migrations AGAIN in startup → service ready
|
||
|
|
```
|
||
|
|
|
||
|
|
### The Redundancy:
|
||
|
|
|
||
|
|
**Migration Job** (`external-migration`):
|
||
|
|
- Runs: `/app/scripts/run_migrations.py external`
|
||
|
|
- Calls: `initialize_service_database()`
|
||
|
|
- Applies: Alembic migrations via `alembic upgrade head`
|
||
|
|
- Status: Completes successfully
|
||
|
|
|
||
|
|
**Service Startup** (`external-service` pod):
|
||
|
|
- Runs: `BaseFastAPIService._handle_database_tables()` (line 219-241)
|
||
|
|
- Calls: `initialize_service_database()` **AGAIN**
|
||
|
|
- Applies: Alembic migrations via `alembic upgrade head` **AGAIN**
|
||
|
|
- From logs:
|
||
|
|
```
|
||
|
|
2025-10-01 09:26:01 [info] Running pending migrations service=external
|
||
|
|
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
|
||
|
|
INFO [alembic.runtime.migration] Will assume transactional DDL.
|
||
|
|
2025-10-01 09:26:01 [info] Migrations applied successfully service=external
|
||
|
|
```
|
||
|
|
|
||
|
|
## Why This Is Problematic
|
||
|
|
|
||
|
|
### 1. **Duplicated Logic**
|
||
|
|
- Same code runs twice (`initialize_service_database()`)
|
||
|
|
- Both use same `DatabaseInitManager`
|
||
|
|
- Both check migration state, run Alembic upgrade
|
||
|
|
|
||
|
|
### 2. **Unclear Separation of Concerns**
|
||
|
|
- **Migration Job**: Supposed to handle migrations
|
||
|
|
- **Service Startup**: Also handling migrations
|
||
|
|
- Which one is the source of truth?
|
||
|
|
|
||
|
|
### 3. **Race Conditions Potential**
|
||
|
|
If multiple service replicas start simultaneously:
|
||
|
|
- All replicas run migrations concurrently
|
||
|
|
- Alembic has locking, but still adds overhead
|
||
|
|
- Unnecessary database load
|
||
|
|
|
||
|
|
### 4. **Slower Startup Times**
|
||
|
|
Every service pod runs full migration check on startup:
|
||
|
|
- Connects to database
|
||
|
|
- Checks migration state
|
||
|
|
- Runs `alembic upgrade head` (even if no-op)
|
||
|
|
- Adds 1-2 seconds to startup
|
||
|
|
|
||
|
|
### 5. **Confusion About Responsibilities**
|
||
|
|
From logs, the service is doing migration work:
|
||
|
|
```
|
||
|
|
[info] Running pending migrations service=external
|
||
|
|
```
|
||
|
|
This is NOT what a service should do - it should assume DB is ready.
|
||
|
|
|
||
|
|
## Architectural Patterns (Best Practices)
|
||
|
|
|
||
|
|
### Pattern 1: **Init Container Pattern** (Recommended for K8s)
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
Deployment:
|
||
|
|
initContainers:
|
||
|
|
- name: wait-for-migrations
|
||
|
|
# Wait for migration job to complete
|
||
|
|
- name: run-migrations # Optional: inline migrations
|
||
|
|
command: alembic upgrade head
|
||
|
|
|
||
|
|
containers:
|
||
|
|
- name: service
|
||
|
|
# Service starts AFTER migrations complete
|
||
|
|
# Service does NOT run migrations
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- ✅ Clear separation: Init containers handle setup, main container serves traffic
|
||
|
|
- ✅ No race conditions: Init containers run sequentially
|
||
|
|
- ✅ Fast service startup: Assumes DB is ready
|
||
|
|
- ✅ Multiple replicas safe: Only first pod's init runs migrations
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- ⚠ Init containers increase pod startup time
|
||
|
|
- ⚠ Need proper migration locking (Alembic provides this)
|
||
|
|
|
||
|
|
### Pattern 2: **Standalone Migration Job** (Your Current Approach - Almost)
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
Job: migration-job
|
||
|
|
command: alembic upgrade head
|
||
|
|
# Runs once on deployment
|
||
|
|
|
||
|
|
Deployment: service
|
||
|
|
# Service assumes DB is ready
|
||
|
|
# NO migration logic in service code
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- ✅ Complete separation: Migrations are separate workload
|
||
|
|
- ✅ Clear lifecycle: Job completes before service starts
|
||
|
|
- ✅ Fast service startup: No migration checks
|
||
|
|
- ✅ Easy rollback: Re-run job with specific version
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- ⚠ Need orchestration: Ensure job completes before service starts
|
||
|
|
- ⚠ Deployment complexity: Manage job + deployment separately
|
||
|
|
|
||
|
|
### Pattern 3: **Service Self-Migration** (Anti-pattern in Production)
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
Deployment: service
|
||
|
|
# Service runs migrations on startup
|
||
|
|
# What you're doing now in both places
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- ✅ Simple deployment: Single resource
|
||
|
|
- ✅ Always in sync: Migrations bundled with service
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- ❌ Race conditions with multiple replicas
|
||
|
|
- ❌ Slower startup: Every pod checks migrations
|
||
|
|
- ❌ Service code mixed with operational concerns
|
||
|
|
- ❌ Harder to debug: Migration failures look like service failures
|
||
|
|
|
||
|
|
## Recommended Architecture
|
||
|
|
|
||
|
|
### **Hybrid Approach: Init Container + Fallback Check**
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# 1. Pre-deployment Migration Job (runs once)
|
||
|
|
apiVersion: batch/v1
|
||
|
|
kind: Job
|
||
|
|
metadata:
|
||
|
|
name: external-migration
|
||
|
|
spec:
|
||
|
|
template:
|
||
|
|
spec:
|
||
|
|
containers:
|
||
|
|
- name: migrate
|
||
|
|
command: ["alembic", "upgrade", "head"]
|
||
|
|
# Runs FULL migration logic
|
||
|
|
|
||
|
|
---
|
||
|
|
# 2. Service Deployment (depends on job)
|
||
|
|
apiVersion: apps/v1
|
||
|
|
kind: Deployment
|
||
|
|
metadata:
|
||
|
|
name: external-service
|
||
|
|
spec:
|
||
|
|
template:
|
||
|
|
spec:
|
||
|
|
initContainers:
|
||
|
|
- name: wait-for-db
|
||
|
|
# Wait for database to be ready
|
||
|
|
|
||
|
|
# NEW: Wait for migrations to complete
|
||
|
|
- name: wait-for-migrations
|
||
|
|
command: ["sh", "-c", "
|
||
|
|
until alembic current | grep -q 'head'; do
|
||
|
|
echo 'Waiting for migrations...';
|
||
|
|
sleep 2;
|
||
|
|
done
|
||
|
|
"]
|
||
|
|
|
||
|
|
containers:
|
||
|
|
- name: service
|
||
|
|
# Service startup with MINIMAL migration check
|
||
|
|
env:
|
||
|
|
- name: SKIP_MIGRATIONS
|
||
|
|
value: "true" # Service won't run migrations
|
||
|
|
```
|
||
|
|
|
||
|
|
### Service Code Changes:
|
||
|
|
|
||
|
|
**Current** (`shared/service_base.py` line 219-241):
|
||
|
|
```python
|
||
|
|
async def _handle_database_tables(self):
|
||
|
|
"""Handle automatic table creation and migration management"""
|
||
|
|
# Always runs full migration check
|
||
|
|
result = await initialize_service_database(
|
||
|
|
database_manager=self.database_manager,
|
||
|
|
service_name=self.service_name,
|
||
|
|
force_recreate=force_recreate
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Recommended**:
|
||
|
|
```python
|
||
|
|
async def _handle_database_tables(self):
|
||
|
|
"""Verify database is ready (migrations already applied)"""
|
||
|
|
|
||
|
|
# Check if we should skip migrations (production mode)
|
||
|
|
skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"
|
||
|
|
|
||
|
|
if skip_migrations:
|
||
|
|
# Production mode: Only verify, don't run migrations
|
||
|
|
await self._verify_database_ready()
|
||
|
|
else:
|
||
|
|
# Development mode: Run full migration check
|
||
|
|
result = await initialize_service_database(
|
||
|
|
database_manager=self.database_manager,
|
||
|
|
service_name=self.service_name,
|
||
|
|
force_recreate=force_recreate
|
||
|
|
)
|
||
|
|
|
||
|
|
async def _verify_database_ready(self):
|
||
|
|
"""Quick check that database and tables exist"""
|
||
|
|
try:
|
||
|
|
# Check connection
|
||
|
|
if not await self.database_manager.test_connection():
|
||
|
|
raise Exception("Database connection failed")
|
||
|
|
|
||
|
|
# Check expected tables exist (if specified)
|
||
|
|
if self.expected_tables:
|
||
|
|
async with self.database_manager.get_session() as session:
|
||
|
|
for table in self.expected_tables:
|
||
|
|
result = await session.execute(
|
||
|
|
text(f"SELECT EXISTS (
|
||
|
|
SELECT FROM information_schema.tables
|
||
|
|
WHERE table_schema = 'public'
|
||
|
|
AND table_name = '{table}'
|
||
|
|
)")
|
||
|
|
)
|
||
|
|
if not result.scalar():
|
||
|
|
raise Exception(f"Expected table '{table}' not found")
|
||
|
|
|
||
|
|
self.logger.info("Database verification successful")
|
||
|
|
except Exception as e:
|
||
|
|
self.logger.error("Database verification failed", error=str(e))
|
||
|
|
raise
|
||
|
|
```
|
||
|
|
|
||
|
|
## Migration Strategy Comparison
|
||
|
|
|
||
|
|
### Current State:
|
||
|
|
```
|
||
|
|
┌─────────────────┐
|
||
|
|
│ Migration Job │ ──> Runs migrations
|
||
|
|
└─────────────────┘
|
||
|
|
│
|
||
|
|
├─> Job completes
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────┐
|
||
|
|
│ Service Pod 1 │ ──> Runs migrations AGAIN ❌
|
||
|
|
└─────────────────┘
|
||
|
|
│
|
||
|
|
┌─────────────────┐
|
||
|
|
│ Service Pod 2 │ ──> Runs migrations AGAIN ❌
|
||
|
|
└─────────────────┘
|
||
|
|
│
|
||
|
|
┌─────────────────┐
|
||
|
|
│ Service Pod 3 │ ──> Runs migrations AGAIN ❌
|
||
|
|
└─────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### Recommended State:
|
||
|
|
```
|
||
|
|
┌─────────────────┐
|
||
|
|
│ Migration Job │ ──> Runs migrations ONCE ✅
|
||
|
|
└─────────────────┘
|
||
|
|
│
|
||
|
|
├─> Job completes
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────┐
|
||
|
|
│ Service Pod 1 │ ──> Verifies DB ready only ✅
|
||
|
|
└─────────────────┘
|
||
|
|
│
|
||
|
|
┌─────────────────┐
|
||
|
|
│ Service Pod 2 │ ──> Verifies DB ready only ✅
|
||
|
|
└─────────────────┘
|
||
|
|
│
|
||
|
|
┌─────────────────┐
|
||
|
|
│ Service Pod 3 │ ──> Verifies DB ready only ✅
|
||
|
|
└─────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
## Implementation Plan
|
||
|
|
|
||
|
|
### Phase 1: Add Verification-Only Mode
|
||
|
|
|
||
|
|
**File**: `shared/database/init_manager.py`
|
||
|
|
|
||
|
|
Add new mode: `verify_only`
|
||
|
|
|
||
|
|
```python
|
||
|
|
class DatabaseInitManager:
|
||
|
|
def __init__(
|
||
|
|
self,
|
||
|
|
# ... existing params
|
||
|
|
verify_only: bool = False # NEW
|
||
|
|
):
|
||
|
|
self.verify_only = verify_only
|
||
|
|
|
||
|
|
async def initialize_database(self) -> Dict[str, Any]:
|
||
|
|
if self.verify_only:
|
||
|
|
return await self._verify_database_state()
|
||
|
|
|
||
|
|
# Existing logic for full initialization
|
||
|
|
# ...
|
||
|
|
|
||
|
|
async def _verify_database_state(self) -> Dict[str, Any]:
|
||
|
|
"""Quick verification that database is properly initialized"""
|
||
|
|
db_state = await self._check_database_state()
|
||
|
|
|
||
|
|
if not db_state["has_migrations"]:
|
||
|
|
raise Exception("No migrations found - database not initialized")
|
||
|
|
|
||
|
|
if db_state["is_empty"]:
|
||
|
|
raise Exception("Database has no tables - migrations not applied")
|
||
|
|
|
||
|
|
if not db_state["has_alembic_version"]:
|
||
|
|
raise Exception("No alembic_version table - migrations not tracked")
|
||
|
|
|
||
|
|
return {
|
||
|
|
"action": "verified",
|
||
|
|
"message": "Database verified successfully",
|
||
|
|
"current_revision": db_state["current_revision"]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Phase 2: Update BaseFastAPIService
|
||
|
|
|
||
|
|
**File**: `shared/service_base.py`
|
||
|
|
|
||
|
|
```python
|
||
|
|
async def _handle_database_tables(self):
|
||
|
|
"""Handle database initialization based on environment"""
|
||
|
|
|
||
|
|
# Determine mode
|
||
|
|
skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"
|
||
|
|
force_recreate = os.getenv("DB_FORCE_RECREATE", "false").lower() == "true"
|
||
|
|
|
||
|
|
# Import here to avoid circular imports
|
||
|
|
from shared.database.init_manager import initialize_service_database
|
||
|
|
|
||
|
|
try:
|
||
|
|
if skip_migrations:
|
||
|
|
self.logger.info("Migration skip enabled - verifying database only")
|
||
|
|
result = await initialize_service_database(
|
||
|
|
database_manager=self.database_manager,
|
||
|
|
service_name=self.service_name.replace("-service", ""),
|
||
|
|
verify_only=True # NEW parameter
|
||
|
|
)
|
||
|
|
else:
|
||
|
|
self.logger.info("Running full database initialization")
|
||
|
|
result = await initialize_service_database(
|
||
|
|
database_manager=self.database_manager,
|
||
|
|
service_name=self.service_name.replace("-service", ""),
|
||
|
|
force_recreate=force_recreate,
|
||
|
|
verify_only=False
|
||
|
|
)
|
||
|
|
|
||
|
|
self.logger.info("Database initialization completed", result=result)
|
||
|
|
|
||
|
|
except Exception as e:
|
||
|
|
self.logger.error("Database initialization failed", error=str(e))
|
||
|
|
raise # Fail fast in production
|
||
|
|
```
|
||
|
|
|
||
|
|
### Phase 3: Update Kubernetes Manifests
|
||
|
|
|
||
|
|
**Add to all service deployments**:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
apiVersion: apps/v1
|
||
|
|
kind: Deployment
|
||
|
|
metadata:
|
||
|
|
name: external-service
|
||
|
|
spec:
|
||
|
|
template:
|
||
|
|
spec:
|
||
|
|
containers:
|
||
|
|
- name: external-service
|
||
|
|
env:
|
||
|
|
# NEW: Skip migrations in service, rely on Job
|
||
|
|
- name: SKIP_MIGRATIONS
|
||
|
|
value: "true"
|
||
|
|
|
||
|
|
# Keep ENVIRONMENT for production safety
|
||
|
|
- name: ENVIRONMENT
|
||
|
|
value: "production" # or "development"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Phase 4: Optional - Add Init Container Dependency
|
||
|
|
|
||
|
|
**For production safety**:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
spec:
|
||
|
|
template:
|
||
|
|
spec:
|
||
|
|
initContainers:
|
||
|
|
- name: wait-for-migrations
|
||
|
|
image: postgres:15-alpine
|
||
|
|
command: ["sh", "-c"]
|
||
|
|
args:
|
||
|
|
- |
|
||
|
|
echo "Waiting for migrations to be applied..."
|
||
|
|
export PGPASSWORD="$DB_PASSWORD"
|
||
|
|
|
||
|
|
# Wait for alembic_version table to exist
|
||
|
|
until psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "SELECT version_num FROM alembic_version" > /dev/null 2>&1; do
|
||
|
|
echo "Migrations not yet applied, waiting..."
|
||
|
|
sleep 2
|
||
|
|
done
|
||
|
|
|
||
|
|
echo "Migrations detected, service can start"
|
||
|
|
env:
|
||
|
|
- name: DB_HOST
|
||
|
|
value: "external-db-service"
|
||
|
|
# ... other DB connection details
|
||
|
|
```
|
||
|
|
|
||
|
|
## Environment Configuration Matrix
|
||
|
|
|
||
|
|
| Environment | Migration Job | Service Startup | Use Case |
|
||
|
|
|-------------|---------------|-----------------|----------|
|
||
|
|
| **Development** | Optional | Run migrations | Fast iteration, create_all fallback OK |
|
||
|
|
| **Staging** | Required | Verify only | Test migration workflow |
|
||
|
|
| **Production** | Required | Verify only | Safety first, fail fast |
|
||
|
|
|
||
|
|
## Configuration Examples
|
||
|
|
|
||
|
|
### Development (Current Behavior - OK)
|
||
|
|
```yaml
|
||
|
|
env:
|
||
|
|
- name: ENVIRONMENT
|
||
|
|
value: "development"
|
||
|
|
- name: SKIP_MIGRATIONS
|
||
|
|
value: "false"
|
||
|
|
- name: DB_FORCE_RECREATE
|
||
|
|
value: "false"
|
||
|
|
```
|
||
|
|
**Behavior**: Service runs full migration check, allows create_all fallback
|
||
|
|
|
||
|
|
### Staging/Production (Recommended)
|
||
|
|
```yaml
|
||
|
|
env:
|
||
|
|
- name: ENVIRONMENT
|
||
|
|
value: "production"
|
||
|
|
- name: SKIP_MIGRATIONS
|
||
|
|
value: "true"
|
||
|
|
- name: DB_FORCE_RECREATE
|
||
|
|
value: "false"
|
||
|
|
```
|
||
|
|
**Behavior**:
|
||
|
|
- Service only verifies database is ready
|
||
|
|
- No migration execution in service
|
||
|
|
- Fails fast if database not properly initialized
|
||
|
|
|
||
|
|
## Benefits of Proposed Architecture
|
||
|
|
|
||
|
|
### Performance:
|
||
|
|
- ✅ **50-80% faster service startup** (skip migration check: ~1-2 seconds saved)
|
||
|
|
- ✅ **Reduced database load** (no concurrent migration checks from multiple pods)
|
||
|
|
- ✅ **Faster horizontal scaling** (new pods start immediately)
|
||
|
|
|
||
|
|
### Reliability:
|
||
|
|
- ✅ **No race conditions** (only job runs migrations)
|
||
|
|
- ✅ **Clearer error messages** ("DB not ready" vs "migration failed")
|
||
|
|
- ✅ **Easier rollback** (re-run job independently)
|
||
|
|
|
||
|
|
### Maintainability:
|
||
|
|
- ✅ **Separation of concerns** (ops vs service code)
|
||
|
|
- ✅ **Easier debugging** (check job logs for migration issues)
|
||
|
|
- ✅ **Clear deployment flow** (job → service)
|
||
|
|
|
||
|
|
### Safety:
|
||
|
|
- ✅ **Fail-fast in production** (service won't start if DB not ready)
|
||
|
|
- ✅ **No create_all in production** (explicit migrations required)
|
||
|
|
- ✅ **Audit trail** (job logs show when migrations ran)
|
||
|
|
|
||
|
|
## Migration Path
|
||
|
|
|
||
|
|
### Step 1: Implement verify_only Mode (Non-Breaking)
|
||
|
|
- Add to `DatabaseInitManager`
|
||
|
|
- Backwards compatible (default: full check)
|
||
|
|
|
||
|
|
### Step 2: Add SKIP_MIGRATIONS Support (Non-Breaking)
|
||
|
|
- Update `BaseFastAPIService`
|
||
|
|
- Default: false (current behavior)
|
||
|
|
|
||
|
|
### Step 3: Enable in Development First
|
||
|
|
- Test with `SKIP_MIGRATIONS=true` locally
|
||
|
|
- Verify services start correctly
|
||
|
|
|
||
|
|
### Step 4: Enable in Staging
|
||
|
|
- Update staging manifests
|
||
|
|
- Monitor startup times and errors
|
||
|
|
|
||
|
|
### Step 5: Enable in Production
|
||
|
|
- Update production manifests
|
||
|
|
- Services fail fast if migrations not applied
|
||
|
|
|
||
|
|
## Recommended Next Steps
|
||
|
|
|
||
|
|
1. **Immediate**: Document current redundancy (✅ this document)
|
||
|
|
|
||
|
|
2. **Short-term** (1-2 days):
|
||
|
|
- Implement `verify_only` mode in `DatabaseInitManager`
|
||
|
|
- Add `SKIP_MIGRATIONS` support in `BaseFastAPIService`
|
||
|
|
- Test in development environment
|
||
|
|
|
||
|
|
3. **Medium-term** (1 week):
|
||
|
|
- Update all service deployments with `SKIP_MIGRATIONS=true`
|
||
|
|
- Add init container to wait for migrations (optional but recommended)
|
||
|
|
- Monitor startup times and error rates
|
||
|
|
|
||
|
|
4. **Long-term** (ongoing):
|
||
|
|
- Document migration process in runbooks
|
||
|
|
- Add migration rollback procedures
|
||
|
|
- Consider migration versioning strategy
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
|
||
|
|
**Current**: Migration Job + Service both run migrations → redundant, slower, confusing
|
||
|
|
|
||
|
|
**Recommended**: Migration Job runs migrations → Service only verifies → clear, fast, reliable
|
||
|
|
|
||
|
|
The key insight: **Migrations are operational concerns, not application concerns**. Services should assume the database is ready, not try to fix it themselves.
|