Add dev-prod parity analysis and recommendations

Analyze current differences between development and production environments
and provide three options for improving parity:

1. Conservative: Minimal changes, maximum benefit
   - 2 replicas for critical services
   - Resource limits at 50% of prod
   - Specific CORS origins
   - Resource impact: +30% RAM

2. High Parity: Maximum similarity
   - Match all prod replica counts
   - Production resource limits
   - Enable SSL and monitoring
   - Resource impact: +200% RAM

3. Hybrid: Balanced approach
   - 2 replicas for stateful services
   - Resources at 60% of prod
   - Production configs with dev features
   - Resource impact: +100% RAM

Recommendation: Start with Option 1 for best cost/benefit ratio.
This commit is contained in:
Claude
2026-01-02 19:04:49 +00:00
parent 23b8523b36
commit 50c1eb3469

View File

@@ -0,0 +1,227 @@
# Dev-Prod Parity Analysis
## Current Differences Between Dev and Prod
### 1. **Replicas**
- **Dev**: 1 replica per service
- **Prod**: 2-3 replicas per service
- **Impact**: Multi-replica issues (race conditions, session handling, etc.) won't be caught in dev
### 2. **Resource Limits**
- **Dev**: Minimal (64Mi-256Mi RAM, 25m-200m CPU)
- **Prod**: Not explicitly set (uses defaults from base manifests)
- **Impact**: Resource exhaustion issues may appear only in prod
### 3. **Environment Variables**
- **Dev**: DEBUG=true, LOG_LEVEL=DEBUG, PROFILING_ENABLED=true
- **Prod**: DEBUG=false, LOG_LEVEL=INFO, PROFILING_ENABLED=false
- **Impact**: Different code paths, performance characteristics
### 4. **CORS Configuration**
- **Dev**: `*` (wildcard, accepts all origins)
- **Prod**: Specific domains only
- **Impact**: CORS issues won't be caught in dev
### 5. **SSL/TLS**
- **Dev**: HTTP only (ssl-redirect: false)
- **Prod**: HTTPS required (Let's Encrypt)
- **Impact**: SSL-related issues not tested in dev
### 6. **Image Pull Policy**
- **Dev**: `Never` (uses local images)
- **Prod**: Default (pulls from registry)
- **Impact**: Image versioning issues not caught in dev
### 7. **Storage Class**
- **Dev**: Uses default Kind storage
- **Prod**: Uses `microk8s-hostpath`
- **Impact**: Storage-related differences
### 8. **Rate Limiting**
- **Dev**: RATE_LIMIT_ENABLED=false
- **Prod**: RATE_LIMIT_ENABLED=true
- **Impact**: Rate limit logic not tested in dev
## Recommendations for Dev-Prod Parity
### ✅ What SHOULD Be Aligned
1. **Resource Limits Structure**
- Keep dev limits lower, but use same structure
- Use 50% of prod limits in dev
- This catches resource issues early
2. **Critical Environment Variables**
- Same security settings (password requirements, JWT config)
- Same timeout values
- Same business rules
- Different: DEBUG, LOG_LEVEL (dev needs verbosity)
3. **Some Replicas for Critical Services**
- Run 2 replicas of gateway, auth in dev
- Catches load balancing and state management issues
- Still saves resources vs prod
4. **CORS Configuration**
- Use specific origins in dev (localhost, 127.0.0.1)
- Catches CORS issues early
5. **Rate Limiting**
- Enable in dev with higher limits
- Tests the code path without being restrictive
### ⚠️ What SHOULD Stay Different
1. **Debug Settings**
- Keep DEBUG=true in dev (needed for development)
- Keep verbose logging (LOG_LEVEL=DEBUG)
- Keep profiling enabled
2. **SSL/TLS**
- Optional: Can enable self-signed certs in dev
- But HTTP is simpler for local development
3. **Image Pull Policy**
- Keep `Never` in dev (faster iteration)
- Local builds are essential for dev workflow
4. **Replica Counts**
- 1-2 in dev vs 2-3 in prod (balance between parity and resources)
5. **Monitoring**
- Optional in dev to save resources
- Essential in prod
## Proposed Changes for Better Dev-Prod Parity
### Option 1: Conservative (Recommended)
Minimal changes, maximum benefit:
1. **Increase critical service replicas to 2**
- gateway: 1 → 2
- auth-service: 1 → 2
- Tests load balancing, keeps other services at 1
2. **Align resource limits structure**
- Use same resource structure as prod
- Set to 50% of prod values
3. **Fix CORS in dev**
- Use specific origins instead of wildcard
- Better matches prod behavior
4. **Enable rate limiting with high limits**
- Tests the code path
- Won't interfere with development
### Option 2: High Parity (More Resources Needed)
Maximum similarity, higher resource usage:
1. **Match prod replica counts**
- Run 2 replicas of all services
- Requires more RAM (12-16GB)
2. **Use production resource limits**
- Helps catch OOM issues early
- Requires powerful development machine
3. **Enable SSL in dev**
- Use self-signed certs
- Matches prod HTTPS behavior
4. **Enable all production features**
- Monitoring, tracing, etc.
### Option 3: Hybrid (Best Balance)
Balance between parity and development speed:
1. **2 replicas for stateful/critical services**
- gateway, auth, tenant, orders: 2 replicas
- Others: 1 replica
2. **Resource limits at 60% of prod**
- Catches issues without being restrictive
3. **Production-like configuration**
- Same CORS policy (with dev domains)
- Rate limiting enabled (higher limits)
- Same security settings
4. **Keep dev-friendly features**
- DEBUG=true
- Verbose logging
- Hot reload
- HTTP (no SSL)
## Impact Analysis
### Resource Usage Comparison
**Current Dev Setup:**
- ~20 pods running
- ~2-3GB RAM
- ~1-2 CPU cores
**Option 1 (Conservative):**
- ~22 pods (2 extra replicas)
- ~3-4GB RAM (+30%)
- ~1.5-2.5 CPU cores
**Option 2 (High Parity):**
- ~40 pods (double)
- ~8-10GB RAM (+200%)
- ~4-5 CPU cores
**Option 3 (Hybrid):**
- ~28 pods
- ~5-6GB RAM (+100%)
- ~2-3 CPU cores
### Benefits of Increased Parity
1. **Catch Multi-Instance Issues**
- Race conditions
- Distributed locks
- Session management
- Load balancing problems
2. **Resource Issues Found Early**
- Memory leaks
- OOM errors
- CPU bottlenecks
3. **Configuration Validation**
- CORS issues
- Rate limiting bugs
- Security misconfigurations
4. **Deployment Confidence**
- Fewer surprises in production
- Better testing
- Reduced rollbacks
### Tradeoffs
**Pros:**
- ✅ Catches more issues before production
- ✅ More realistic testing environment
- ✅ Better confidence in deployments
- ✅ Team learns production behavior
**Cons:**
- ❌ Higher resource requirements
- ❌ Slower startup times
- ❌ More complex troubleshooting
- ❌ Longer rebuild cycles
## Implementation Guide
If you want to proceed with **Option 1 (Conservative)**, I can:
1. Update dev kustomization to run 2 replicas of critical services
2. Add resource limits that mirror prod structure (at 50%)
3. Fix CORS to use specific origins
4. Enable rate limiting with dev-friendly limits
5. Create a "dev-high-parity" profile for those who want closer matching
Would you like me to implement these changes?