diff --git a/docs/DEV-PROD-PARITY-ANALYSIS.md b/docs/DEV-PROD-PARITY-ANALYSIS.md new file mode 100644 index 00000000..ed6d1e71 --- /dev/null +++ b/docs/DEV-PROD-PARITY-ANALYSIS.md @@ -0,0 +1,227 @@ +# Dev-Prod Parity Analysis + +## Current Differences Between Dev and Prod + +### 1. **Replicas** +- **Dev**: 1 replica per service +- **Prod**: 2-3 replicas per service +- **Impact**: Multi-replica issues (race conditions, session handling, etc.) won't be caught in dev + +### 2. **Resource Limits** +- **Dev**: Minimal (64Mi-256Mi RAM, 25m-200m CPU) +- **Prod**: Not explicitly set (uses defaults from base manifests) +- **Impact**: Resource exhaustion issues may appear only in prod + +### 3. **Environment Variables** +- **Dev**: DEBUG=true, LOG_LEVEL=DEBUG, PROFILING_ENABLED=true +- **Prod**: DEBUG=false, LOG_LEVEL=INFO, PROFILING_ENABLED=false +- **Impact**: Different code paths, performance characteristics + +### 4. **CORS Configuration** +- **Dev**: `*` (wildcard, accepts all origins) +- **Prod**: Specific domains only +- **Impact**: CORS issues won't be caught in dev + +### 5. **SSL/TLS** +- **Dev**: HTTP only (ssl-redirect: false) +- **Prod**: HTTPS required (Let's Encrypt) +- **Impact**: SSL-related issues not tested in dev + +### 6. **Image Pull Policy** +- **Dev**: `Never` (uses local images) +- **Prod**: Default (pulls from registry) +- **Impact**: Image versioning issues not caught in dev + +### 7. **Storage Class** +- **Dev**: Uses default Kind storage +- **Prod**: Uses `microk8s-hostpath` +- **Impact**: Storage-related differences + +### 8. **Rate Limiting** +- **Dev**: RATE_LIMIT_ENABLED=false +- **Prod**: RATE_LIMIT_ENABLED=true +- **Impact**: Rate limit logic not tested in dev + +## Recommendations for Dev-Prod Parity + +### ✅ What SHOULD Be Aligned + +1. **Resource Limits Structure** + - Keep dev limits lower, but use same structure + - Use 50% of prod limits in dev + - This catches resource issues early + +2. **Critical Environment Variables** + - Same security settings (password requirements, JWT config) + - Same timeout values + - Same business rules + - Different: DEBUG, LOG_LEVEL (dev needs verbosity) + +3. **Some Replicas for Critical Services** + - Run 2 replicas of gateway, auth in dev + - Catches load balancing and state management issues + - Still saves resources vs prod + +4. **CORS Configuration** + - Use specific origins in dev (localhost, 127.0.0.1) + - Catches CORS issues early + +5. **Rate Limiting** + - Enable in dev with higher limits + - Tests the code path without being restrictive + +### ⚠️ What SHOULD Stay Different + +1. **Debug Settings** + - Keep DEBUG=true in dev (needed for development) + - Keep verbose logging (LOG_LEVEL=DEBUG) + - Keep profiling enabled + +2. **SSL/TLS** + - Optional: Can enable self-signed certs in dev + - But HTTP is simpler for local development + +3. **Image Pull Policy** + - Keep `Never` in dev (faster iteration) + - Local builds are essential for dev workflow + +4. **Replica Counts** + - 1-2 in dev vs 2-3 in prod (balance between parity and resources) + +5. **Monitoring** + - Optional in dev to save resources + - Essential in prod + +## Proposed Changes for Better Dev-Prod Parity + +### Option 1: Conservative (Recommended) +Minimal changes, maximum benefit: + +1. **Increase critical service replicas to 2** + - gateway: 1 → 2 + - auth-service: 1 → 2 + - Tests load balancing, keeps other services at 1 + +2. **Align resource limits structure** + - Use same resource structure as prod + - Set to 50% of prod values + +3. **Fix CORS in dev** + - Use specific origins instead of wildcard + - Better matches prod behavior + +4. **Enable rate limiting with high limits** + - Tests the code path + - Won't interfere with development + +### Option 2: High Parity (More Resources Needed) +Maximum similarity, higher resource usage: + +1. **Match prod replica counts** + - Run 2 replicas of all services + - Requires more RAM (12-16GB) + +2. **Use production resource limits** + - Helps catch OOM issues early + - Requires powerful development machine + +3. **Enable SSL in dev** + - Use self-signed certs + - Matches prod HTTPS behavior + +4. **Enable all production features** + - Monitoring, tracing, etc. + +### Option 3: Hybrid (Best Balance) +Balance between parity and development speed: + +1. **2 replicas for stateful/critical services** + - gateway, auth, tenant, orders: 2 replicas + - Others: 1 replica + +2. **Resource limits at 60% of prod** + - Catches issues without being restrictive + +3. **Production-like configuration** + - Same CORS policy (with dev domains) + - Rate limiting enabled (higher limits) + - Same security settings + +4. **Keep dev-friendly features** + - DEBUG=true + - Verbose logging + - Hot reload + - HTTP (no SSL) + +## Impact Analysis + +### Resource Usage Comparison + +**Current Dev Setup:** +- ~20 pods running +- ~2-3GB RAM +- ~1-2 CPU cores + +**Option 1 (Conservative):** +- ~22 pods (2 extra replicas) +- ~3-4GB RAM (+30%) +- ~1.5-2.5 CPU cores + +**Option 2 (High Parity):** +- ~40 pods (double) +- ~8-10GB RAM (+200%) +- ~4-5 CPU cores + +**Option 3 (Hybrid):** +- ~28 pods +- ~5-6GB RAM (+100%) +- ~2-3 CPU cores + +### Benefits of Increased Parity + +1. **Catch Multi-Instance Issues** + - Race conditions + - Distributed locks + - Session management + - Load balancing problems + +2. **Resource Issues Found Early** + - Memory leaks + - OOM errors + - CPU bottlenecks + +3. **Configuration Validation** + - CORS issues + - Rate limiting bugs + - Security misconfigurations + +4. **Deployment Confidence** + - Fewer surprises in production + - Better testing + - Reduced rollbacks + +### Tradeoffs + +**Pros:** +- ✅ Catches more issues before production +- ✅ More realistic testing environment +- ✅ Better confidence in deployments +- ✅ Team learns production behavior + +**Cons:** +- ❌ Higher resource requirements +- ❌ Slower startup times +- ❌ More complex troubleshooting +- ❌ Longer rebuild cycles + +## Implementation Guide + +If you want to proceed with **Option 1 (Conservative)**, I can: + +1. Update dev kustomization to run 2 replicas of critical services +2. Add resource limits that mirror prod structure (at 50%) +3. Fix CORS to use specific origins +4. Enable rate limiting with dev-friendly limits +5. Create a "dev-high-parity" profile for those who want closer matching + +Would you like me to implement these changes?