Fix startup issues
This commit is contained in:
532
SERVICE_INITIALIZATION_ARCHITECTURE.md
Normal file
532
SERVICE_INITIALIZATION_ARCHITECTURE.md
Normal file
@@ -0,0 +1,532 @@
|
||||
# Service Initialization Architecture Analysis
|
||||
|
||||
## Current Architecture Problem
|
||||
|
||||
You've correctly identified a **redundancy and architectural inconsistency** in the current setup:
|
||||
|
||||
### What's Happening Now:
|
||||
|
||||
```
|
||||
Kubernetes Deployment Flow:
|
||||
1. Migration Job runs → applies Alembic migrations → completes
|
||||
2. Service Pod starts → runs migrations AGAIN in startup → service ready
|
||||
```
|
||||
|
||||
### The Redundancy:
|
||||
|
||||
**Migration Job** (`external-migration`):
|
||||
- Runs: `/app/scripts/run_migrations.py external`
|
||||
- Calls: `initialize_service_database()`
|
||||
- Applies: Alembic migrations via `alembic upgrade head`
|
||||
- Status: Completes successfully
|
||||
|
||||
**Service Startup** (`external-service` pod):
|
||||
- Runs: `BaseFastAPIService._handle_database_tables()` (line 219-241)
|
||||
- Calls: `initialize_service_database()` **AGAIN**
|
||||
- Applies: Alembic migrations via `alembic upgrade head` **AGAIN**
|
||||
- From logs:
|
||||
```
|
||||
2025-10-01 09:26:01 [info] Running pending migrations service=external
|
||||
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
|
||||
INFO [alembic.runtime.migration] Will assume transactional DDL.
|
||||
2025-10-01 09:26:01 [info] Migrations applied successfully service=external
|
||||
```
|
||||
|
||||
## Why This Is Problematic
|
||||
|
||||
### 1. **Duplicated Logic**
|
||||
- Same code runs twice (`initialize_service_database()`)
|
||||
- Both use same `DatabaseInitManager`
|
||||
- Both check migration state, run Alembic upgrade
|
||||
|
||||
### 2. **Unclear Separation of Concerns**
|
||||
- **Migration Job**: Supposed to handle migrations
|
||||
- **Service Startup**: Also handling migrations
|
||||
- Which one is the source of truth?
|
||||
|
||||
### 3. **Race Conditions Potential**
|
||||
If multiple service replicas start simultaneously:
|
||||
- All replicas run migrations concurrently
|
||||
- Alembic has locking, but still adds overhead
|
||||
- Unnecessary database load
|
||||
|
||||
### 4. **Slower Startup Times**
|
||||
Every service pod runs full migration check on startup:
|
||||
- Connects to database
|
||||
- Checks migration state
|
||||
- Runs `alembic upgrade head` (even if no-op)
|
||||
- Adds 1-2 seconds to startup
|
||||
|
||||
### 5. **Confusion About Responsibilities**
|
||||
From logs, the service is doing migration work:
|
||||
```
|
||||
[info] Running pending migrations service=external
|
||||
```
|
||||
This is NOT what a service should do - it should assume DB is ready.
|
||||
|
||||
## Architectural Patterns (Best Practices)
|
||||
|
||||
### Pattern 1: **Init Container Pattern** (Recommended for K8s)
|
||||
|
||||
```yaml
|
||||
Deployment:
|
||||
initContainers:
|
||||
- name: wait-for-migrations
|
||||
# Wait for migration job to complete
|
||||
- name: run-migrations # Optional: inline migrations
|
||||
command: alembic upgrade head
|
||||
|
||||
containers:
|
||||
- name: service
|
||||
# Service starts AFTER migrations complete
|
||||
# Service does NOT run migrations
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- ✅ Clear separation: Init containers handle setup, main container serves traffic
|
||||
- ✅ No race conditions: Init containers run sequentially
|
||||
- ✅ Fast service startup: Assumes DB is ready
|
||||
- ✅ Multiple replicas safe: Only first pod's init runs migrations
|
||||
|
||||
**Cons:**
|
||||
- ⚠ Init containers increase pod startup time
|
||||
- ⚠ Need proper migration locking (Alembic provides this)
|
||||
|
||||
### Pattern 2: **Standalone Migration Job** (Your Current Approach - Almost)
|
||||
|
||||
```yaml
|
||||
Job: migration-job
|
||||
command: alembic upgrade head
|
||||
# Runs once on deployment
|
||||
|
||||
Deployment: service
|
||||
# Service assumes DB is ready
|
||||
# NO migration logic in service code
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- ✅ Complete separation: Migrations are separate workload
|
||||
- ✅ Clear lifecycle: Job completes before service starts
|
||||
- ✅ Fast service startup: No migration checks
|
||||
- ✅ Easy rollback: Re-run job with specific version
|
||||
|
||||
**Cons:**
|
||||
- ⚠ Need orchestration: Ensure job completes before service starts
|
||||
- ⚠ Deployment complexity: Manage job + deployment separately
|
||||
|
||||
### Pattern 3: **Service Self-Migration** (Anti-pattern in Production)
|
||||
|
||||
```yaml
|
||||
Deployment: service
|
||||
# Service runs migrations on startup
|
||||
# What you're doing now in both places
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- ✅ Simple deployment: Single resource
|
||||
- ✅ Always in sync: Migrations bundled with service
|
||||
|
||||
**Cons:**
|
||||
- ❌ Race conditions with multiple replicas
|
||||
- ❌ Slower startup: Every pod checks migrations
|
||||
- ❌ Service code mixed with operational concerns
|
||||
- ❌ Harder to debug: Migration failures look like service failures
|
||||
|
||||
## Recommended Architecture
|
||||
|
||||
### **Hybrid Approach: Init Container + Fallback Check**
|
||||
|
||||
```yaml
|
||||
# 1. Pre-deployment Migration Job (runs once)
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: external-migration
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: migrate
|
||||
command: ["alembic", "upgrade", "head"]
|
||||
# Runs FULL migration logic
|
||||
|
||||
---
|
||||
# 2. Service Deployment (depends on job)
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: external-service
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
initContainers:
|
||||
- name: wait-for-db
|
||||
# Wait for database to be ready
|
||||
|
||||
# NEW: Wait for migrations to complete
|
||||
- name: wait-for-migrations
|
||||
command: ["sh", "-c", "
|
||||
until alembic current | grep -q 'head'; do
|
||||
echo 'Waiting for migrations...';
|
||||
sleep 2;
|
||||
done
|
||||
"]
|
||||
|
||||
containers:
|
||||
- name: service
|
||||
# Service startup with MINIMAL migration check
|
||||
env:
|
||||
- name: SKIP_MIGRATIONS
|
||||
value: "true" # Service won't run migrations
|
||||
```
|
||||
|
||||
### Service Code Changes:
|
||||
|
||||
**Current** (`shared/service_base.py` line 219-241):
|
||||
```python
|
||||
async def _handle_database_tables(self):
|
||||
"""Handle automatic table creation and migration management"""
|
||||
# Always runs full migration check
|
||||
result = await initialize_service_database(
|
||||
database_manager=self.database_manager,
|
||||
service_name=self.service_name,
|
||||
force_recreate=force_recreate
|
||||
)
|
||||
```
|
||||
|
||||
**Recommended**:
|
||||
```python
|
||||
async def _handle_database_tables(self):
|
||||
"""Verify database is ready (migrations already applied)"""
|
||||
|
||||
# Check if we should skip migrations (production mode)
|
||||
skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"
|
||||
|
||||
if skip_migrations:
|
||||
# Production mode: Only verify, don't run migrations
|
||||
await self._verify_database_ready()
|
||||
else:
|
||||
# Development mode: Run full migration check
|
||||
result = await initialize_service_database(
|
||||
database_manager=self.database_manager,
|
||||
service_name=self.service_name,
|
||||
force_recreate=force_recreate
|
||||
)
|
||||
|
||||
async def _verify_database_ready(self):
|
||||
"""Quick check that database and tables exist"""
|
||||
try:
|
||||
# Check connection
|
||||
if not await self.database_manager.test_connection():
|
||||
raise Exception("Database connection failed")
|
||||
|
||||
# Check expected tables exist (if specified)
|
||||
if self.expected_tables:
|
||||
async with self.database_manager.get_session() as session:
|
||||
for table in self.expected_tables:
|
||||
result = await session.execute(
|
||||
text(f"SELECT EXISTS (
|
||||
SELECT FROM information_schema.tables
|
||||
WHERE table_schema = 'public'
|
||||
AND table_name = '{table}'
|
||||
)")
|
||||
)
|
||||
if not result.scalar():
|
||||
raise Exception(f"Expected table '{table}' not found")
|
||||
|
||||
self.logger.info("Database verification successful")
|
||||
except Exception as e:
|
||||
self.logger.error("Database verification failed", error=str(e))
|
||||
raise
|
||||
```
|
||||
|
||||
## Migration Strategy Comparison
|
||||
|
||||
### Current State:
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Migration Job │ ──> Runs migrations
|
||||
└─────────────────┘
|
||||
│
|
||||
├─> Job completes
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Service Pod 1 │ ──> Runs migrations AGAIN ❌
|
||||
└─────────────────┘
|
||||
│
|
||||
┌─────────────────┐
|
||||
│ Service Pod 2 │ ──> Runs migrations AGAIN ❌
|
||||
└─────────────────┘
|
||||
│
|
||||
┌─────────────────┐
|
||||
│ Service Pod 3 │ ──> Runs migrations AGAIN ❌
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
### Recommended State:
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Migration Job │ ──> Runs migrations ONCE ✅
|
||||
└─────────────────┘
|
||||
│
|
||||
├─> Job completes
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Service Pod 1 │ ──> Verifies DB ready only ✅
|
||||
└─────────────────┘
|
||||
│
|
||||
┌─────────────────┐
|
||||
│ Service Pod 2 │ ──> Verifies DB ready only ✅
|
||||
└─────────────────┘
|
||||
│
|
||||
┌─────────────────┐
|
||||
│ Service Pod 3 │ ──> Verifies DB ready only ✅
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Add Verification-Only Mode
|
||||
|
||||
**File**: `shared/database/init_manager.py`
|
||||
|
||||
Add new mode: `verify_only`
|
||||
|
||||
```python
|
||||
class DatabaseInitManager:
|
||||
def __init__(
|
||||
self,
|
||||
# ... existing params
|
||||
verify_only: bool = False # NEW
|
||||
):
|
||||
self.verify_only = verify_only
|
||||
|
||||
async def initialize_database(self) -> Dict[str, Any]:
|
||||
if self.verify_only:
|
||||
return await self._verify_database_state()
|
||||
|
||||
# Existing logic for full initialization
|
||||
# ...
|
||||
|
||||
async def _verify_database_state(self) -> Dict[str, Any]:
|
||||
"""Quick verification that database is properly initialized"""
|
||||
db_state = await self._check_database_state()
|
||||
|
||||
if not db_state["has_migrations"]:
|
||||
raise Exception("No migrations found - database not initialized")
|
||||
|
||||
if db_state["is_empty"]:
|
||||
raise Exception("Database has no tables - migrations not applied")
|
||||
|
||||
if not db_state["has_alembic_version"]:
|
||||
raise Exception("No alembic_version table - migrations not tracked")
|
||||
|
||||
return {
|
||||
"action": "verified",
|
||||
"message": "Database verified successfully",
|
||||
"current_revision": db_state["current_revision"]
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 2: Update BaseFastAPIService
|
||||
|
||||
**File**: `shared/service_base.py`
|
||||
|
||||
```python
|
||||
async def _handle_database_tables(self):
|
||||
"""Handle database initialization based on environment"""
|
||||
|
||||
# Determine mode
|
||||
skip_migrations = os.getenv("SKIP_MIGRATIONS", "false").lower() == "true"
|
||||
force_recreate = os.getenv("DB_FORCE_RECREATE", "false").lower() == "true"
|
||||
|
||||
# Import here to avoid circular imports
|
||||
from shared.database.init_manager import initialize_service_database
|
||||
|
||||
try:
|
||||
if skip_migrations:
|
||||
self.logger.info("Migration skip enabled - verifying database only")
|
||||
result = await initialize_service_database(
|
||||
database_manager=self.database_manager,
|
||||
service_name=self.service_name.replace("-service", ""),
|
||||
verify_only=True # NEW parameter
|
||||
)
|
||||
else:
|
||||
self.logger.info("Running full database initialization")
|
||||
result = await initialize_service_database(
|
||||
database_manager=self.database_manager,
|
||||
service_name=self.service_name.replace("-service", ""),
|
||||
force_recreate=force_recreate,
|
||||
verify_only=False
|
||||
)
|
||||
|
||||
self.logger.info("Database initialization completed", result=result)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error("Database initialization failed", error=str(e))
|
||||
raise # Fail fast in production
|
||||
```
|
||||
|
||||
### Phase 3: Update Kubernetes Manifests
|
||||
|
||||
**Add to all service deployments**:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: external-service
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: external-service
|
||||
env:
|
||||
# NEW: Skip migrations in service, rely on Job
|
||||
- name: SKIP_MIGRATIONS
|
||||
value: "true"
|
||||
|
||||
# Keep ENVIRONMENT for production safety
|
||||
- name: ENVIRONMENT
|
||||
value: "production" # or "development"
|
||||
```
|
||||
|
||||
### Phase 4: Optional - Add Init Container Dependency
|
||||
|
||||
**For production safety**:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
initContainers:
|
||||
- name: wait-for-migrations
|
||||
image: postgres:15-alpine
|
||||
command: ["sh", "-c"]
|
||||
args:
|
||||
- |
|
||||
echo "Waiting for migrations to be applied..."
|
||||
export PGPASSWORD="$DB_PASSWORD"
|
||||
|
||||
# Wait for alembic_version table to exist
|
||||
until psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "SELECT version_num FROM alembic_version" > /dev/null 2>&1; do
|
||||
echo "Migrations not yet applied, waiting..."
|
||||
sleep 2
|
||||
done
|
||||
|
||||
echo "Migrations detected, service can start"
|
||||
env:
|
||||
- name: DB_HOST
|
||||
value: "external-db-service"
|
||||
# ... other DB connection details
|
||||
```
|
||||
|
||||
## Environment Configuration Matrix
|
||||
|
||||
| Environment | Migration Job | Service Startup | Use Case |
|
||||
|-------------|---------------|-----------------|----------|
|
||||
| **Development** | Optional | Run migrations | Fast iteration, create_all fallback OK |
|
||||
| **Staging** | Required | Verify only | Test migration workflow |
|
||||
| **Production** | Required | Verify only | Safety first, fail fast |
|
||||
|
||||
## Configuration Examples
|
||||
|
||||
### Development (Current Behavior - OK)
|
||||
```yaml
|
||||
env:
|
||||
- name: ENVIRONMENT
|
||||
value: "development"
|
||||
- name: SKIP_MIGRATIONS
|
||||
value: "false"
|
||||
- name: DB_FORCE_RECREATE
|
||||
value: "false"
|
||||
```
|
||||
**Behavior**: Service runs full migration check, allows create_all fallback
|
||||
|
||||
### Staging/Production (Recommended)
|
||||
```yaml
|
||||
env:
|
||||
- name: ENVIRONMENT
|
||||
value: "production"
|
||||
- name: SKIP_MIGRATIONS
|
||||
value: "true"
|
||||
- name: DB_FORCE_RECREATE
|
||||
value: "false"
|
||||
```
|
||||
**Behavior**:
|
||||
- Service only verifies database is ready
|
||||
- No migration execution in service
|
||||
- Fails fast if database not properly initialized
|
||||
|
||||
## Benefits of Proposed Architecture
|
||||
|
||||
### Performance:
|
||||
- ✅ **50-80% faster service startup** (skip migration check: ~1-2 seconds saved)
|
||||
- ✅ **Reduced database load** (no concurrent migration checks from multiple pods)
|
||||
- ✅ **Faster horizontal scaling** (new pods start immediately)
|
||||
|
||||
### Reliability:
|
||||
- ✅ **No race conditions** (only job runs migrations)
|
||||
- ✅ **Clearer error messages** ("DB not ready" vs "migration failed")
|
||||
- ✅ **Easier rollback** (re-run job independently)
|
||||
|
||||
### Maintainability:
|
||||
- ✅ **Separation of concerns** (ops vs service code)
|
||||
- ✅ **Easier debugging** (check job logs for migration issues)
|
||||
- ✅ **Clear deployment flow** (job → service)
|
||||
|
||||
### Safety:
|
||||
- ✅ **Fail-fast in production** (service won't start if DB not ready)
|
||||
- ✅ **No create_all in production** (explicit migrations required)
|
||||
- ✅ **Audit trail** (job logs show when migrations ran)
|
||||
|
||||
## Migration Path
|
||||
|
||||
### Step 1: Implement verify_only Mode (Non-Breaking)
|
||||
- Add to `DatabaseInitManager`
|
||||
- Backwards compatible (default: full check)
|
||||
|
||||
### Step 2: Add SKIP_MIGRATIONS Support (Non-Breaking)
|
||||
- Update `BaseFastAPIService`
|
||||
- Default: false (current behavior)
|
||||
|
||||
### Step 3: Enable in Development First
|
||||
- Test with `SKIP_MIGRATIONS=true` locally
|
||||
- Verify services start correctly
|
||||
|
||||
### Step 4: Enable in Staging
|
||||
- Update staging manifests
|
||||
- Monitor startup times and errors
|
||||
|
||||
### Step 5: Enable in Production
|
||||
- Update production manifests
|
||||
- Services fail fast if migrations not applied
|
||||
|
||||
## Recommended Next Steps
|
||||
|
||||
1. **Immediate**: Document current redundancy (✅ this document)
|
||||
|
||||
2. **Short-term** (1-2 days):
|
||||
- Implement `verify_only` mode in `DatabaseInitManager`
|
||||
- Add `SKIP_MIGRATIONS` support in `BaseFastAPIService`
|
||||
- Test in development environment
|
||||
|
||||
3. **Medium-term** (1 week):
|
||||
- Update all service deployments with `SKIP_MIGRATIONS=true`
|
||||
- Add init container to wait for migrations (optional but recommended)
|
||||
- Monitor startup times and error rates
|
||||
|
||||
4. **Long-term** (ongoing):
|
||||
- Document migration process in runbooks
|
||||
- Add migration rollback procedures
|
||||
- Consider migration versioning strategy
|
||||
|
||||
## Summary
|
||||
|
||||
**Current**: Migration Job + Service both run migrations → redundant, slower, confusing
|
||||
|
||||
**Recommended**: Migration Job runs migrations → Service only verifies → clear, fast, reliable
|
||||
|
||||
The key insight: **Migrations are operational concerns, not application concerns**. Services should assume the database is ready, not try to fix it themselves.
|
||||
Reference in New Issue
Block a user