Add dev-prod parity analysis and recommendations

Analyze current differences between development and production environments and provide three options for improving parity: 1. Conservative: Minimal changes, maximum benefit - 2 replicas for critical services - Resource limits at 50% of prod - Specific CORS origins - Resource impact: +30% RAM 2. High Parity: Maximum similarity - Match all prod replica counts - Production resource limits - Enable SSL and monitoring - Resource impact: +200% RAM 3. Hybrid: Balanced approach - 2 replicas for stateful services - Resources at 60% of prod - Production configs with dev features - Resource impact: +100% RAM Recommendation: Start with Option 1 for best cost/benefit ratio.
2026-01-02 19:04:49 +00:00
parent 23b8523b36
commit 50c1eb3469
1 changed files with 227 additions and 0 deletions
--- a/docs/DEV-PROD-PARITY-ANALYSIS.md
+++ b/docs/DEV-PROD-PARITY-ANALYSIS.md
@@ -0,0 +1,227 @@
+# Dev-Prod Parity Analysis
+
+## Current Differences Between Dev and Prod
+
+### 1. **Replicas**
+- **Dev**: 1 replica per service
+- **Prod**: 2-3 replicas per service
+- **Impact**: Multi-replica issues (race conditions, session handling, etc.) won't be caught in dev
+
+### 2. **Resource Limits**
+- **Dev**: Minimal (64Mi-256Mi RAM, 25m-200m CPU)
+- **Prod**: Not explicitly set (uses defaults from base manifests)
+- **Impact**: Resource exhaustion issues may appear only in prod
+
+### 3. **Environment Variables**
+- **Dev**: DEBUG=true, LOG_LEVEL=DEBUG, PROFILING_ENABLED=true
+- **Prod**: DEBUG=false, LOG_LEVEL=INFO, PROFILING_ENABLED=false
+- **Impact**: Different code paths, performance characteristics
+
+### 4. **CORS Configuration**
+- **Dev**: `*` (wildcard, accepts all origins)
+- **Prod**: Specific domains only
+- **Impact**: CORS issues won't be caught in dev
+
+### 5. **SSL/TLS**
+- **Dev**: HTTP only (ssl-redirect: false)
+- **Prod**: HTTPS required (Let's Encrypt)
+- **Impact**: SSL-related issues not tested in dev
+
+### 6. **Image Pull Policy**
+- **Dev**: `Never` (uses local images)
+- **Prod**: Default (pulls from registry)
+- **Impact**: Image versioning issues not caught in dev
+
+### 7. **Storage Class**
+- **Dev**: Uses default Kind storage
+- **Prod**: Uses `microk8s-hostpath`
+- **Impact**: Storage-related differences
+
+### 8. **Rate Limiting**
+- **Dev**: RATE_LIMIT_ENABLED=false
+- **Prod**: RATE_LIMIT_ENABLED=true
+- **Impact**: Rate limit logic not tested in dev
+
+## Recommendations for Dev-Prod Parity
+
+### ✅ What SHOULD Be Aligned
+
+1. **Resource Limits Structure**
+   - Keep dev limits lower, but use same structure
+   - Use 50% of prod limits in dev
+   - This catches resource issues early
+
+2. **Critical Environment Variables**
+   - Same security settings (password requirements, JWT config)
+   - Same timeout values
+   - Same business rules
+   - Different: DEBUG, LOG_LEVEL (dev needs verbosity)
+
+3. **Some Replicas for Critical Services**
+   - Run 2 replicas of gateway, auth in dev
+   - Catches load balancing and state management issues
+   - Still saves resources vs prod
+
+4. **CORS Configuration**
+   - Use specific origins in dev (localhost, 127.0.0.1)
+   - Catches CORS issues early
+
+5. **Rate Limiting**
+   - Enable in dev with higher limits
+   - Tests the code path without being restrictive
+
+### ⚠️ What SHOULD Stay Different
+
+1. **Debug Settings**
+   - Keep DEBUG=true in dev (needed for development)
+   - Keep verbose logging (LOG_LEVEL=DEBUG)
+   - Keep profiling enabled
+
+2. **SSL/TLS**
+   - Optional: Can enable self-signed certs in dev
+   - But HTTP is simpler for local development
+
+3. **Image Pull Policy**
+   - Keep `Never` in dev (faster iteration)
+   - Local builds are essential for dev workflow
+
+4. **Replica Counts**
+   - 1-2 in dev vs 2-3 in prod (balance between parity and resources)
+
+5. **Monitoring**
+   - Optional in dev to save resources
+   - Essential in prod
+
+## Proposed Changes for Better Dev-Prod Parity
+
+### Option 1: Conservative (Recommended)
+Minimal changes, maximum benefit:
+
+1. **Increase critical service replicas to 2**
+   - gateway: 1 → 2
+   - auth-service: 1 → 2
+   - Tests load balancing, keeps other services at 1
+
+2. **Align resource limits structure**
+   - Use same resource structure as prod
+   - Set to 50% of prod values
+
+3. **Fix CORS in dev**
+   - Use specific origins instead of wildcard
+   - Better matches prod behavior
+
+4. **Enable rate limiting with high limits**
+   - Tests the code path
+   - Won't interfere with development
+
+### Option 2: High Parity (More Resources Needed)
+Maximum similarity, higher resource usage:
+
+1. **Match prod replica counts**
+   - Run 2 replicas of all services
+   - Requires more RAM (12-16GB)
+
+2. **Use production resource limits**
+   - Helps catch OOM issues early
+   - Requires powerful development machine
+
+3. **Enable SSL in dev**
+   - Use self-signed certs
+   - Matches prod HTTPS behavior
+
+4. **Enable all production features**
+   - Monitoring, tracing, etc.
+
+### Option 3: Hybrid (Best Balance)
+Balance between parity and development speed:
+
+1. **2 replicas for stateful/critical services**
+   - gateway, auth, tenant, orders: 2 replicas
+   - Others: 1 replica
+
+2. **Resource limits at 60% of prod**
+   - Catches issues without being restrictive
+
+3. **Production-like configuration**
+   - Same CORS policy (with dev domains)
+   - Rate limiting enabled (higher limits)
+   - Same security settings
+
+4. **Keep dev-friendly features**
+   - DEBUG=true
+   - Verbose logging
+   - Hot reload
+   - HTTP (no SSL)
+
+## Impact Analysis
+
+### Resource Usage Comparison
+
+**Current Dev Setup:**
+- ~20 pods running
+- ~2-3GB RAM
+- ~1-2 CPU cores
+
+**Option 1 (Conservative):**
+- ~22 pods (2 extra replicas)
+- ~3-4GB RAM (+30%)
+- ~1.5-2.5 CPU cores
+
+**Option 2 (High Parity):**
+- ~40 pods (double)
+- ~8-10GB RAM (+200%)
+- ~4-5 CPU cores
+
+**Option 3 (Hybrid):**
+- ~28 pods
+- ~5-6GB RAM (+100%)
+- ~2-3 CPU cores
+
+### Benefits of Increased Parity
+
+1. **Catch Multi-Instance Issues**
+   - Race conditions
+   - Distributed locks
+   - Session management
+   - Load balancing problems
+
+2. **Resource Issues Found Early**
+   - Memory leaks
+   - OOM errors
+   - CPU bottlenecks
+
+3. **Configuration Validation**
+   - CORS issues
+   - Rate limiting bugs
+   - Security misconfigurations
+
+4. **Deployment Confidence**
+   - Fewer surprises in production
+   - Better testing
+   - Reduced rollbacks
+
+### Tradeoffs
+
+**Pros:**
+- ✅ Catches more issues before production
+- ✅ More realistic testing environment
+- ✅ Better confidence in deployments
+- ✅ Team learns production behavior
+
+**Cons:**
+- ❌ Higher resource requirements
+- ❌ Slower startup times
+- ❌ More complex troubleshooting
+- ❌ Longer rebuild cycles
+
+## Implementation Guide
+
+If you want to proceed with **Option 1 (Conservative)**, I can:
+
+1. Update dev kustomization to run 2 replicas of critical services
+2. Add resource limits that mirror prod structure (at 50%)
+3. Fix CORS to use specific origins
+4. Enable rate limiting with dev-friendly limits
+5. Create a "dev-high-parity" profile for those who want closer matching
+
+Would you like me to implement these changes?