1150 lines
30 KiB
Markdown
1150 lines
30 KiB
Markdown
# Bakery-IA Production Operations Guide
|
|
|
|
**Complete guide for operating, monitoring, and maintaining production environment**
|
|
|
|
**Last Updated:** 2026-01-07
|
|
**Target Audience:** DevOps, SRE, System Administrators
|
|
**Security Grade:** A-
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Overview](#overview)
|
|
2. [Monitoring & Observability](#monitoring--observability)
|
|
3. [Security Operations](#security-operations)
|
|
4. [Database Management](#database-management)
|
|
5. [Backup & Recovery](#backup--recovery)
|
|
6. [Performance Optimization](#performance-optimization)
|
|
7. [Scaling Operations](#scaling-operations)
|
|
8. [Incident Response](#incident-response)
|
|
9. [Maintenance Tasks](#maintenance-tasks)
|
|
10. [Compliance & Audit](#compliance--audit)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
### Production Environment
|
|
|
|
**Infrastructure:**
|
|
- **Platform:** MicroK8s on Ubuntu 22.04 LTS
|
|
- **Services:** 18 microservices, 14 databases, monitoring stack
|
|
- **Capacity:** 10-tenant pilot (scalable to 100+)
|
|
- **Security:** TLS encryption, RBAC, audit logging
|
|
- **Monitoring:** Prometheus, Grafana, AlertManager, Jaeger
|
|
|
|
**Key Metrics (10-tenant baseline):**
|
|
- **Uptime Target:** 99.5% (3.65 hours downtime/month)
|
|
- **Response Time:** <2s average API response
|
|
- **Error Rate:** <1% of requests
|
|
- **Database Connections:** ~200 concurrent
|
|
- **Memory Usage:** 12-15 GB / 20 GB capacity
|
|
- **CPU Usage:** 40-60% under normal load
|
|
|
|
### Team Responsibilities
|
|
|
|
| Role | Responsibilities |
|
|
|------|------------------|
|
|
| **DevOps Engineer** | Deployment, infrastructure, scaling |
|
|
| **SRE** | Monitoring, incident response, performance |
|
|
| **Security Admin** | Access control, security patches, compliance |
|
|
| **Database Admin** | Backups, optimization, migrations |
|
|
| **On-Call Engineer** | 24/7 incident response (if applicable) |
|
|
|
|
---
|
|
|
|
## Monitoring & Observability
|
|
|
|
### Access Monitoring Dashboards
|
|
|
|
**Production URLs:**
|
|
```
|
|
https://monitoring.yourdomain.com/grafana # Dashboards & visualization
|
|
https://monitoring.yourdomain.com/prometheus # Metrics & alerts
|
|
https://monitoring.yourdomain.com/alertmanager # Alert management
|
|
https://monitoring.yourdomain.com/jaeger # Distributed tracing
|
|
```
|
|
|
|
**Port Forwarding (if ingress not available):**
|
|
```bash
|
|
# Grafana
|
|
kubectl port-forward -n monitoring svc/grafana 3000:3000
|
|
|
|
# Prometheus
|
|
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
|
|
|
# AlertManager
|
|
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093
|
|
|
|
# Jaeger
|
|
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
|
|
```
|
|
|
|
### Key Dashboards
|
|
|
|
#### 1. Services Overview Dashboard
|
|
**What to Monitor:**
|
|
- Request rate per service
|
|
- Error rate (aim: <1%)
|
|
- P95/P99 latency (aim: <2s)
|
|
- Active connections
|
|
- Pod health status
|
|
|
|
**Red Flags:**
|
|
- ❌ Error rate >5%
|
|
- ❌ P95 latency >3s
|
|
- ❌ Any service showing 0 requests (might be down)
|
|
- ❌ Pod restarts >3 in last hour
|
|
|
|
#### 2. Database Dashboard (PostgreSQL)
|
|
**What to Monitor:**
|
|
- Active connections per database
|
|
- Cache hit ratio (aim: >90%)
|
|
- Query duration (P95)
|
|
- Transaction rate
|
|
- Replication lag (if applicable)
|
|
|
|
**Red Flags:**
|
|
- ❌ Connection count >80% of max
|
|
- ❌ Cache hit ratio <80%
|
|
- ❌ Slow queries >1s frequently
|
|
- ❌ Locks increasing
|
|
|
|
#### 3. Node Exporter (Infrastructure)
|
|
**What to Monitor:**
|
|
- CPU usage per node
|
|
- Memory usage and swap
|
|
- Disk I/O and latency
|
|
- Network throughput
|
|
- Disk space remaining
|
|
|
|
**Red Flags:**
|
|
- ❌ CPU usage >85% sustained
|
|
- ❌ Memory usage >90%
|
|
- ❌ Swap usage >0 (indicates memory pressure)
|
|
- ❌ Disk space <20% remaining
|
|
- ❌ Disk I/O latency >100ms
|
|
|
|
#### 4. Business Metrics Dashboard
|
|
**What to Monitor:**
|
|
- Active tenants
|
|
- ML training jobs (success/failure rate)
|
|
- Forecast requests per hour
|
|
- Alert volume
|
|
- API health score
|
|
|
|
**Red Flags:**
|
|
- ❌ Training failure rate >10%
|
|
- ❌ No forecast requests (might indicate issue)
|
|
- ❌ Alert volume spike (investigate cause)
|
|
|
|
### Alert Severity Levels
|
|
|
|
| Severity | Response Time | Escalation | Examples |
|
|
|----------|---------------|------------|----------|
|
|
| **Critical** | Immediate | Page on-call | Service down, database unavailable |
|
|
| **Warning** | 30 minutes | Email team | High memory, slow queries |
|
|
| **Info** | Best effort | Email | Backup completed, cert renewal |
|
|
|
|
### Common Alerts & Responses
|
|
|
|
#### Alert: ServiceDown
|
|
```
|
|
Severity: Critical
|
|
Meaning: A service has been down for >2 minutes
|
|
Response:
|
|
1. Check pod status: kubectl get pods -n bakery-ia
|
|
2. View logs: kubectl logs POD_NAME -n bakery-ia
|
|
3. Check recent deployments: kubectl rollout history
|
|
4. Restart if safe: kubectl rollout restart deployment/SERVICE_NAME
|
|
5. Rollback if needed: kubectl rollout undo deployment/SERVICE_NAME
|
|
```
|
|
|
|
#### Alert: HighMemoryUsage
|
|
```
|
|
Severity: Warning
|
|
Meaning: Service using >80% of memory limit
|
|
Response:
|
|
1. Check which pods: kubectl top pods -n bakery-ia --sort-by=memory
|
|
2. Review memory trends in Grafana
|
|
3. Check for memory leaks in application logs
|
|
4. Consider increasing memory limits if sustained
|
|
5. Restart pod if memory leak suspected
|
|
```
|
|
|
|
#### Alert: DatabaseConnectionsHigh
|
|
```
|
|
Severity: Warning
|
|
Meaning: Database connections >80% of max
|
|
Response:
|
|
1. Identify which service: Check Grafana database dashboard
|
|
2. Look for connection leaks in application
|
|
3. Check for long-running transactions
|
|
4. Consider increasing max_connections
|
|
5. Restart service if connections not releasing
|
|
```
|
|
|
|
#### Alert: CertificateExpiringSoon
|
|
```
|
|
Severity: Warning
|
|
Meaning: TLS certificate expires in <30 days
|
|
Response:
|
|
1. For Let's Encrypt: Auto-renewal should handle (verify cert-manager)
|
|
2. For internal certs: Regenerate and apply new certificates
|
|
3. See "Certificate Rotation" section below
|
|
```
|
|
|
|
### Metrics to Track Daily
|
|
|
|
```bash
|
|
# Quick health check command
|
|
cat > ~/health-check.sh <<'EOF'
|
|
#!/bin/bash
|
|
echo "=== Bakery-IA Health Check ==="
|
|
echo "Date: $(date)"
|
|
echo ""
|
|
|
|
echo "1. Pod Status:"
|
|
kubectl get pods -n bakery-ia | grep -vE "Running|Completed" || echo "✅ All pods healthy"
|
|
echo ""
|
|
|
|
echo "2. Resource Usage:"
|
|
kubectl top nodes
|
|
echo ""
|
|
|
|
echo "3. Database Connections:"
|
|
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
|
|
"SELECT count(*) as connections FROM pg_stat_activity;"
|
|
echo ""
|
|
|
|
echo "4. Recent Alerts:"
|
|
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state}' | head -10
|
|
echo ""
|
|
|
|
echo "5. Disk Usage:"
|
|
kubectl exec -n bakery-ia deployment/auth-db -- df -h /var/lib/postgresql/data
|
|
echo ""
|
|
|
|
echo "=== End Health Check ==="
|
|
EOF
|
|
|
|
chmod +x ~/health-check.sh
|
|
./health-check.sh
|
|
```
|
|
|
|
---
|
|
|
|
## Security Operations
|
|
|
|
### Security Posture Overview
|
|
|
|
**Current Security Grade: A-**
|
|
|
|
**Implemented:**
|
|
- ✅ TLS 1.2+ encryption for all database connections
|
|
- ✅ Let's Encrypt SSL for public endpoints
|
|
- ✅ 32-character cryptographic passwords
|
|
- ✅ JWT-based authentication
|
|
- ✅ Tenant isolation at database and application level
|
|
- ✅ Kubernetes secrets encryption at rest
|
|
- ✅ PostgreSQL audit logging
|
|
- ✅ RBAC (Role-Based Access Control)
|
|
- ✅ Regular security updates
|
|
|
|
### Access Control Management
|
|
|
|
#### User Roles
|
|
|
|
| Role | Permissions | Use Case |
|
|
|------|-------------|----------|
|
|
| **Viewer** | Read-only access | Dashboard viewing, reports |
|
|
| **Member** | Read + create/update | Day-to-day operations |
|
|
| **Admin** | Full operational access | Manage users, configure settings |
|
|
| **Owner** | Full control | Billing, tenant deletion |
|
|
|
|
#### Managing User Access
|
|
|
|
```bash
|
|
# View current users for a tenant (via API)
|
|
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
|
|
https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users
|
|
|
|
# Promote user to admin
|
|
curl -X PATCH -H "Authorization: Bearer $OWNER_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users/USER_ID \
|
|
-d '{"role": "admin"}'
|
|
```
|
|
|
|
### Security Checklist (Monthly)
|
|
|
|
- [ ] **Review audit logs for suspicious activity**
|
|
```bash
|
|
# Check failed login attempts
|
|
kubectl logs -n bakery-ia deployment/auth-service | grep "authentication failed" | tail -50
|
|
|
|
# Check unusual API calls
|
|
kubectl logs -n bakery-ia deployment/gateway | grep -E "DELETE|admin" | tail -50
|
|
```
|
|
|
|
- [ ] **Verify all services using TLS**
|
|
```bash
|
|
# Check PostgreSQL SSL
|
|
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
|
|
echo "Checking $db"
|
|
kubectl exec -n bakery-ia $db -- psql -U postgres -c "SHOW ssl;"
|
|
done
|
|
```
|
|
|
|
- [ ] **Review and rotate passwords (every 90 days)**
|
|
```bash
|
|
# Generate new passwords
|
|
openssl rand -base64 32 # For each service
|
|
|
|
# Update secrets
|
|
kubectl edit secret bakery-ia-secrets -n bakery-ia
|
|
|
|
# Restart services to pick up new passwords
|
|
kubectl rollout restart deployment -n bakery-ia
|
|
```
|
|
|
|
- [ ] **Check certificate expiry dates**
|
|
```bash
|
|
# Check Let's Encrypt certs
|
|
kubectl get certificate -n bakery-ia
|
|
|
|
# Check internal TLS certs (expire Oct 2028)
|
|
kubectl exec -n bakery-ia deployment/auth-db -- \
|
|
openssl x509 -in /tls/server-cert.pem -noout -dates
|
|
```
|
|
|
|
- [ ] **Review RBAC policies**
|
|
- Ensure least privilege principle
|
|
- Remove access for departed team members
|
|
- Audit admin/owner role assignments
|
|
|
|
- [ ] **Apply security updates**
|
|
```bash
|
|
# Update system packages on VPS
|
|
ssh root@$VPS_IP "apt update && apt upgrade -y"
|
|
|
|
# Update container images (rebuild with latest base images)
|
|
docker-compose build --pull
|
|
```
|
|
|
|
### Certificate Rotation
|
|
|
|
#### Let's Encrypt (Auto-Renewal)
|
|
|
|
Let's Encrypt certificates auto-renew via cert-manager. Verify:
|
|
|
|
```bash
|
|
# Check cert-manager is running
|
|
kubectl get pods -n cert-manager
|
|
|
|
# Check certificate status
|
|
kubectl describe certificate bakery-ia-prod-tls-cert -n bakery-ia
|
|
|
|
# Force renewal if needed (>30 days before expiry)
|
|
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
|
|
# cert-manager will automatically recreate
|
|
```
|
|
|
|
#### Internal TLS Certificates (Manual Rotation)
|
|
|
|
**When:** 90 days before October 2028 expiry
|
|
|
|
```bash
|
|
# 1. Generate new certificates (on local machine)
|
|
cd infrastructure/tls
|
|
./generate-certificates.sh
|
|
|
|
# 2. Update Kubernetes secrets
|
|
kubectl delete secret postgres-tls redis-tls -n bakery-ia
|
|
|
|
kubectl create secret generic postgres-tls \
|
|
--from-file=server-cert.pem=postgres/server-cert.pem \
|
|
--from-file=server-key.pem=postgres/server-key.pem \
|
|
--from-file=ca-cert.pem=postgres/ca-cert.pem \
|
|
-n bakery-ia
|
|
|
|
kubectl create secret generic redis-tls \
|
|
--from-file=redis-cert.pem=redis/redis-cert.pem \
|
|
--from-file=redis-key.pem=redis/redis-key.pem \
|
|
--from-file=ca-cert.pem=redis/ca-cert.pem \
|
|
-n bakery-ia
|
|
|
|
# 3. Restart database pods to pick up new certs
|
|
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=database
|
|
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=cache
|
|
|
|
# 4. Verify new certificates
|
|
kubectl exec -n bakery-ia deployment/auth-db -- \
|
|
openssl x509 -in /tls/server-cert.pem -noout -dates
|
|
```
|
|
|
|
---
|
|
|
|
## Database Management
|
|
|
|
### Database Architecture
|
|
|
|
**14 PostgreSQL Instances:**
|
|
- auth-db, tenant-db, training-db, forecasting-db, sales-db
|
|
- external-db, notification-db, inventory-db, recipes-db
|
|
- suppliers-db, pos-db, orders-db, production-db, alert-processor-db
|
|
|
|
**1 Redis Instance:** Shared caching and session storage
|
|
|
|
### Database Health Monitoring
|
|
|
|
```bash
|
|
# Check all database pods
|
|
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database
|
|
|
|
# Check database resource usage
|
|
kubectl top pods -n bakery-ia -l app.kubernetes.io/component=database
|
|
|
|
# Check database connections
|
|
for db in $(kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database -o name); do
|
|
echo "=== $db ==="
|
|
kubectl exec -n bakery-ia $db -- psql -U postgres -c \
|
|
"SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"
|
|
done
|
|
```
|
|
|
|
### Common Database Operations
|
|
|
|
#### Connect to Database
|
|
|
|
```bash
|
|
# Connect to specific database
|
|
kubectl exec -n bakery-ia deployment/auth-db -it -- \
|
|
psql -U auth_user -d auth_db
|
|
|
|
# Inside psql:
|
|
\dt # List tables
|
|
\d+ table_name # Describe table with details
|
|
\du # List users
|
|
\l # List databases
|
|
\q # Quit
|
|
```
|
|
|
|
#### Check Database Size
|
|
|
|
```bash
|
|
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
|
|
"SELECT pg_database.datname,
|
|
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
|
|
FROM pg_database;"
|
|
```
|
|
|
|
#### Analyze Slow Queries
|
|
|
|
```bash
|
|
# Enable slow query logging (already configured)
|
|
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
|
|
"SELECT query, mean_exec_time, calls
|
|
FROM pg_stat_statements
|
|
ORDER BY mean_exec_time DESC
|
|
LIMIT 10;"
|
|
```
|
|
|
|
#### Check Database Locks
|
|
|
|
```bash
|
|
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
|
|
"SELECT blocked_locks.pid AS blocked_pid,
|
|
blocking_locks.pid AS blocking_pid,
|
|
blocked_activity.usename AS blocked_user,
|
|
blocking_activity.usename AS blocking_user,
|
|
blocked_activity.query AS blocked_statement
|
|
FROM pg_catalog.pg_locks blocked_locks
|
|
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
|
|
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
|
|
AND blocking_locks.relation = blocked_locks.relation
|
|
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
|
|
WHERE NOT blocked_locks.granted;"
|
|
```
|
|
|
|
### Database Optimization
|
|
|
|
#### Vacuum and Analyze
|
|
|
|
```bash
|
|
# Run on each database monthly
|
|
kubectl exec -n bakery-ia deployment/auth-db -- \
|
|
psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"
|
|
|
|
# For all databases (run as cron job)
|
|
cat > ~/vacuum-databases.sh <<'EOF'
|
|
#!/bin/bash
|
|
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
|
|
echo "Vacuuming $db"
|
|
kubectl exec -n bakery-ia $db -- psql -U postgres -c "VACUUM ANALYZE;"
|
|
done
|
|
EOF
|
|
|
|
chmod +x ~/vacuum-databases.sh
|
|
# Add to cron: 0 3 * * 0 (weekly at 3 AM)
|
|
```
|
|
|
|
#### Reindex (if performance degrades)
|
|
|
|
```bash
|
|
# Reindex specific database
|
|
kubectl exec -n bakery-ia deployment/auth-db -- \
|
|
psql -U auth_user -d auth_db -c "REINDEX DATABASE auth_db;"
|
|
```
|
|
|
|
---
|
|
|
|
## Backup & Recovery
|
|
|
|
### Backup Strategy
|
|
|
|
**Automated Daily Backups:**
|
|
- Frequency: Daily at 2 AM
|
|
- Retention: 30 days rolling
|
|
- Encryption: GPG encrypted
|
|
- Storage: Local VPS (configure off-site for production)
|
|
|
|
### Backup Script (Already Configured)
|
|
|
|
```bash
|
|
# Script location: ~/backup-databases.sh
|
|
# Configured in: pilot launch guide
|
|
|
|
# Manual backup
|
|
./backup-databases.sh
|
|
|
|
# Verify backup
|
|
ls -lh /backups/
|
|
```
|
|
|
|
### Backup Best Practices
|
|
|
|
1. **Test Restores Monthly**
|
|
```bash
|
|
# Restore to test database
|
|
gunzip < /backups/2026-01-07.tar.gz | \
|
|
kubectl exec -i -n bakery-ia deployment/test-db -- \
|
|
psql -U postgres test_db
|
|
```
|
|
|
|
2. **Off-Site Storage (Recommended)**
|
|
```bash
|
|
# Sync backups to S3 / Cloud Storage
|
|
aws s3 sync /backups/ s3://bakery-ia-backups/ --delete
|
|
|
|
# Or use rclone for any cloud provider
|
|
rclone sync /backups/ remote:bakery-ia-backups
|
|
```
|
|
|
|
3. **Monitor Backup Success**
|
|
```bash
|
|
# Check last backup date
|
|
ls -lt /backups/ | head -1
|
|
|
|
# Set up alert if no backup in 25 hours
|
|
```
|
|
|
|
### Recovery Procedures
|
|
|
|
#### Restore Single Database
|
|
|
|
```bash
|
|
# 1. Stop the service using the database
|
|
kubectl scale deployment auth-service -n bakery-ia --replicas=0
|
|
|
|
# 2. Drop and recreate database
|
|
kubectl exec -n bakery-ia deployment/auth-db -it -- \
|
|
psql -U postgres -c "DROP DATABASE auth_db;"
|
|
kubectl exec -n bakery-ia deployment/auth-db -it -- \
|
|
psql -U postgres -c "CREATE DATABASE auth_db OWNER auth_user;"
|
|
|
|
# 3. Restore from backup
|
|
gunzip < /backups/2026-01-07/auth-db.sql | \
|
|
kubectl exec -i -n bakery-ia deployment/auth-db -- \
|
|
psql -U auth_user -d auth_db
|
|
|
|
# 4. Restart service
|
|
kubectl scale deployment auth-service -n bakery-ia --replicas=2
|
|
```
|
|
|
|
#### Disaster Recovery (Full System)
|
|
|
|
```bash
|
|
# 1. Provision new VPS (same specs)
|
|
# 2. Install MicroK8s (follow pilot launch guide)
|
|
# 3. Copy latest backup to new VPS
|
|
# 4. Deploy infrastructure and databases
|
|
kubectl apply -k infrastructure/kubernetes/overlays/prod
|
|
|
|
# 5. Wait for databases to be ready
|
|
kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=database -n bakery-ia
|
|
|
|
# 6. Restore all databases
|
|
for backup in /backups/latest/*.sql; do
|
|
db_name=$(basename $backup .sql)
|
|
echo "Restoring $db_name"
|
|
cat $backup | kubectl exec -i -n bakery-ia deployment/${db_name} -- \
|
|
psql -U postgres
|
|
done
|
|
|
|
# 7. Deploy services
|
|
kubectl apply -k infrastructure/kubernetes/overlays/prod
|
|
|
|
# 8. Update DNS to point to new VPS
|
|
# 9. Verify all services healthy
|
|
```
|
|
|
|
**Recovery Time Objective (RTO):** 2-4 hours
|
|
**Recovery Point Objective (RPO):** 24 hours (last daily backup)
|
|
|
|
---
|
|
|
|
## Performance Optimization
|
|
|
|
### Identifying Performance Issues
|
|
|
|
```bash
|
|
# 1. Check overall resource usage
|
|
kubectl top nodes
|
|
kubectl top pods -n bakery-ia --sort-by=cpu
|
|
kubectl top pods -n bakery-ia --sort-by=memory
|
|
|
|
# 2. Check API response times in Grafana
|
|
# Go to "Services Overview" dashboard
|
|
# Look for P95/P99 latency spikes
|
|
|
|
# 3. Check database query performance
|
|
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
|
|
"SELECT query, calls, mean_exec_time, max_exec_time
|
|
FROM pg_stat_statements
|
|
ORDER BY mean_exec_time DESC
|
|
LIMIT 20;"
|
|
|
|
# 4. Check for N+1 queries in application logs
|
|
kubectl logs -n bakery-ia deployment/orders-service | grep "SELECT"
|
|
```
|
|
|
|
### Common Optimizations
|
|
|
|
#### 1. Database Indexing
|
|
|
|
```sql
|
|
-- Find missing indexes
|
|
SELECT schemaname, tablename, attname, n_distinct, correlation
|
|
FROM pg_stats
|
|
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
|
|
ORDER BY abs(correlation) DESC;
|
|
|
|
-- Add index on frequently queried columns
|
|
CREATE INDEX CONCURRENTLY idx_orders_tenant_created
|
|
ON orders(tenant_id, created_at DESC);
|
|
```
|
|
|
|
#### 2. Connection Pooling
|
|
|
|
Already configured in services using SQLAlchemy. Verify settings:
|
|
```python
|
|
# In shared/database/base.py
|
|
pool_size=5 # Adjust based on load
|
|
max_overflow=10 # Max additional connections
|
|
pool_timeout=30 # Connection timeout
|
|
pool_recycle=3600 # Recycle connections after 1 hour
|
|
```
|
|
|
|
#### 3. Redis Caching
|
|
|
|
Increase cache for frequently accessed data:
|
|
```python
|
|
# Cache user permissions (example)
|
|
@cache.cached(timeout=300, key_prefix='user_perms')
|
|
def get_user_permissions(user_id):
|
|
# ... fetch from database
|
|
```
|
|
|
|
#### 4. Query Optimization
|
|
|
|
```sql
|
|
-- Add EXPLAIN ANALYZE to slow queries
|
|
EXPLAIN ANALYZE SELECT * FROM orders WHERE tenant_id = '...';
|
|
|
|
-- Look for:
|
|
-- - Seq Scan (should use index scan)
|
|
-- - High execution time
|
|
-- - Missing indexes
|
|
```
|
|
|
|
### Scaling Triggers
|
|
|
|
**When to scale UP:**
|
|
- ❌ CPU usage >75% sustained for >1 hour
|
|
- ❌ Memory usage >85% sustained
|
|
- ❌ P95 API latency >3s
|
|
- ❌ Database connection pool exhausted frequently
|
|
- ❌ Error rate increasing
|
|
|
|
**When to scale OUT (add replicas):**
|
|
- ❌ Request rate increasing significantly
|
|
- ❌ Single service bottleneck identified
|
|
- ❌ Need zero-downtime deployments
|
|
- ❌ Geographic distribution needed
|
|
|
|
---
|
|
|
|
## Scaling Operations
|
|
|
|
### Vertical Scaling (Upgrade VPS)
|
|
|
|
```bash
|
|
# 1. Create backup
|
|
./backup-databases.sh
|
|
|
|
# 2. Plan upgrade window (requires brief downtime)
|
|
# Notify users: "Scheduled maintenance 2 AM - 3 AM"
|
|
|
|
# 3. At clouding.io, upgrade VPS
|
|
# RAM: 20 GB → 32 GB
|
|
# CPU: 8 cores → 12 cores
|
|
# (Usually instant, may require restart)
|
|
|
|
# 4. Verify after upgrade
|
|
kubectl top nodes
|
|
free -h
|
|
nproc
|
|
```
|
|
|
|
### Horizontal Scaling (Add Replicas)
|
|
|
|
```bash
|
|
# Scale specific service
|
|
kubectl scale deployment orders-service -n bakery-ia --replicas=5
|
|
|
|
# Or update in kustomization for persistence
|
|
# Edit: infrastructure/kubernetes/overlays/prod/kustomization.yaml
|
|
replicas:
|
|
- name: orders-service
|
|
count: 5
|
|
|
|
kubectl apply -k infrastructure/kubernetes/overlays/prod
|
|
```
|
|
|
|
### Auto-Scaling (HPA)
|
|
|
|
Already configured for:
|
|
- orders-service (1-3 replicas)
|
|
- forecasting-service (1-3 replicas)
|
|
- notification-service (1-3 replicas)
|
|
|
|
```bash
|
|
# Check HPA status
|
|
kubectl get hpa -n bakery-ia
|
|
|
|
# Adjust thresholds if needed
|
|
kubectl edit hpa orders-service-hpa -n bakery-ia
|
|
```
|
|
|
|
### Growth Path
|
|
|
|
| Tenants | Recommended Action |
|
|
|---------|-------------------|
|
|
| **10** | Current configuration (20GB RAM, 8 CPU) |
|
|
| **20** | Add replicas for critical services |
|
|
| **30** | Upgrade to 32GB RAM, 12 CPU |
|
|
| **50** | Consider database read replicas |
|
|
| **75** | Upgrade to 48GB RAM, 16 CPU |
|
|
| **100** | Plan multi-node cluster or managed K8s |
|
|
| **200+** | Migrate to managed services (EKS, GKE, AKS) |
|
|
|
|
---
|
|
|
|
## Incident Response
|
|
|
|
### Incident Severity Levels
|
|
|
|
| Level | Description | Response Time | Example |
|
|
|-------|-------------|---------------|---------|
|
|
| **P0** | Complete outage | Immediate | All services down |
|
|
| **P1** | Major degradation | 15 minutes | Database unavailable |
|
|
| **P2** | Partial degradation | 1 hour | One service slow |
|
|
| **P3** | Minor issue | 4 hours | Non-critical alert |
|
|
|
|
### Incident Response Process
|
|
|
|
#### 1. Detect & Alert
|
|
```
|
|
- Monitoring alerts trigger
|
|
- User reports issue
|
|
- Automated health checks fail
|
|
```
|
|
|
|
#### 2. Assess & Communicate
|
|
```bash
|
|
# Quick assessment
|
|
./health-check.sh
|
|
|
|
# Determine severity
|
|
# P0/P1: Notify all stakeholders immediately
|
|
# P2/P3: Regular communication channels
|
|
```
|
|
|
|
#### 3. Investigate
|
|
```bash
|
|
# Check pods
|
|
kubectl get pods -n bakery-ia
|
|
|
|
# Check recent events
|
|
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20
|
|
|
|
# Check logs
|
|
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100
|
|
|
|
# Check metrics
|
|
# View Grafana dashboards
|
|
```
|
|
|
|
#### 4. Mitigate
|
|
```bash
|
|
# Common mitigations:
|
|
|
|
# Restart service
|
|
kubectl rollout restart deployment/SERVICE_NAME -n bakery-ia
|
|
|
|
# Rollback deployment
|
|
kubectl rollout undo deployment/SERVICE_NAME -n bakery-ia
|
|
|
|
# Scale up
|
|
kubectl scale deployment SERVICE_NAME -n bakery-ia --replicas=5
|
|
|
|
# Restart database
|
|
kubectl delete pod DB_POD_NAME -n bakery-ia
|
|
```
|
|
|
|
#### 5. Resolve & Document
|
|
```
|
|
1. Verify issue resolved
|
|
2. Update incident log
|
|
3. Create post-mortem (for P0/P1)
|
|
4. Implement preventive measures
|
|
```
|
|
|
|
### Common Incidents & Fixes
|
|
|
|
#### Incident: Database Connection Exhaustion
|
|
|
|
**Symptoms:** Services showing "connection pool exhausted" errors
|
|
|
|
**Fix:**
|
|
```bash
|
|
# 1. Identify leaking service
|
|
kubectl logs -n bakery-ia deployment/orders-service | grep "pool"
|
|
|
|
# 2. Restart leaking service
|
|
kubectl rollout restart deployment/orders-service -n bakery-ia
|
|
|
|
# 3. Increase max_connections if needed
|
|
kubectl exec -n bakery-ia deployment/orders-db -- \
|
|
psql -U postgres -c "ALTER SYSTEM SET max_connections = 200;"
|
|
kubectl rollout restart deployment/orders-db -n bakery-ia
|
|
```
|
|
|
|
#### Incident: Out of Memory (OOMKilled)
|
|
|
|
**Symptoms:** Pods restarting with "OOMKilled" status
|
|
|
|
**Fix:**
|
|
```bash
|
|
# 1. Identify which pod
|
|
kubectl get pods -n bakery-ia | grep OOMKilled
|
|
|
|
# 2. Check resource limits
|
|
kubectl describe pod POD_NAME -n bakery-ia | grep -A 5 Limits
|
|
|
|
# 3. Increase memory limit
|
|
# Edit deployment: infrastructure/kubernetes/base/components/services/SERVICE.yaml
|
|
resources:
|
|
limits:
|
|
memory: "1Gi" # Increased from 512Mi
|
|
|
|
# 4. Redeploy
|
|
kubectl apply -k infrastructure/kubernetes/overlays/prod
|
|
```
|
|
|
|
#### Incident: Certificate Expired
|
|
|
|
**Symptoms:** SSL errors, services can't connect
|
|
|
|
**Fix:**
|
|
```bash
|
|
# For Let's Encrypt (should auto-renew):
|
|
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
|
|
# Wait for cert-manager to recreate
|
|
|
|
# For internal certs:
|
|
# Follow "Certificate Rotation" section above
|
|
```
|
|
|
|
---
|
|
|
|
## Maintenance Tasks
|
|
|
|
### Daily Tasks
|
|
|
|
```bash
|
|
# Run health check
|
|
./health-check.sh
|
|
|
|
# Check monitoring alerts
|
|
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'
|
|
|
|
# Verify backups ran
|
|
ls -lh /backups/ | head -5
|
|
```
|
|
|
|
### Weekly Tasks
|
|
|
|
```bash
|
|
# Review resource trends
|
|
# Open Grafana, check 7-day trends
|
|
|
|
# Review error logs
|
|
kubectl logs -n bakery-ia deployment/gateway --since=7d | grep ERROR | wc -l
|
|
|
|
# Check disk usage
|
|
kubectl exec -n bakery-ia deployment/auth-db -- df -h
|
|
|
|
# Review security logs
|
|
kubectl logs -n bakery-ia deployment/auth-service --since=7d | grep "failed"
|
|
```
|
|
|
|
### Monthly Tasks
|
|
|
|
- [ ] **Review and rotate passwords**
|
|
- [ ] **Update security patches**
|
|
- [ ] **Test backup restore**
|
|
- [ ] **Review RBAC policies**
|
|
- [ ] **Vacuum and analyze databases**
|
|
- [ ] **Review and optimize slow queries**
|
|
- [ ] **Check certificate expiry dates**
|
|
- [ ] **Review resource allocation**
|
|
- [ ] **Plan capacity for next quarter**
|
|
- [ ] **Update documentation**
|
|
|
|
### Quarterly Tasks (Every 90 Days)
|
|
|
|
- [ ] **Full security audit**
|
|
- [ ] **Disaster recovery drill**
|
|
- [ ] **Performance testing**
|
|
- [ ] **Cost optimization review**
|
|
- [ ] **Update runbooks**
|
|
- [ ] **Team training session**
|
|
- [ ] **Review SLAs and metrics**
|
|
- [ ] **Plan infrastructure upgrades**
|
|
|
|
### Annual Tasks
|
|
|
|
- [ ] **Penetration testing**
|
|
- [ ] **Compliance audit (GDPR, PCI-DSS, SOC 2)**
|
|
- [ ] **Full infrastructure review**
|
|
- [ ] **Update security roadmap**
|
|
- [ ] **Budget planning for next year**
|
|
- [ ] **Technology stack review**
|
|
|
|
---
|
|
|
|
## Compliance & Audit
|
|
|
|
### GDPR Compliance
|
|
|
|
**Requirements Met:**
|
|
- ✅ Article 32: Encryption of personal data (TLS + pgcrypto)
|
|
- ✅ Article 5(1)(f): Security of processing
|
|
- ✅ Article 33: Breach detection (audit logs)
|
|
- ✅ Article 17: Right to erasure (deletion endpoints)
|
|
- ✅ Article 20: Right to data portability (export functionality)
|
|
|
|
**Audit Tasks:**
|
|
```bash
|
|
# Review audit logs for data access
|
|
kubectl logs -n bakery-ia deployment/tenant-service | grep "user_data_access"
|
|
|
|
# Verify encryption in use
|
|
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c "SHOW ssl;"
|
|
|
|
# Check data retention policies
|
|
# Review automated cleanup jobs
|
|
```
|
|
|
|
### PCI-DSS Compliance
|
|
|
|
**Requirements Met:**
|
|
- ✅ Requirement 3.4: Transmission encryption (TLS 1.2+)
|
|
- ✅ Requirement 3.5: Stored data protection (pgcrypto)
|
|
- ✅ Requirement 10: Access tracking (audit logs)
|
|
- ✅ Requirement 8: User authentication (JWT + MFA ready)
|
|
|
|
**Audit Tasks:**
|
|
```bash
|
|
# Verify no plaintext passwords
|
|
kubectl get secret bakery-ia-secrets -n bakery-ia -o jsonpath='{.data}' | grep -i "pass"
|
|
|
|
# Check encryption in transit
|
|
kubectl describe ingress -n bakery-ia | grep TLS
|
|
|
|
# Review access logs
|
|
kubectl logs -n bakery-ia deployment/auth-service | grep "login"
|
|
```
|
|
|
|
### SOC 2 Compliance
|
|
|
|
**Controls Met:**
|
|
- ✅ CC6.1: Access controls (RBAC)
|
|
- ✅ CC6.6: Encryption in transit (TLS)
|
|
- ✅ CC6.7: Encryption at rest (K8s secrets + pgcrypto)
|
|
- ✅ CC7.2: Monitoring (Prometheus + Grafana)
|
|
|
|
### Audit Log Retention
|
|
|
|
**Current Policy:**
|
|
- Application logs: 30 days (stdout)
|
|
- Database audit logs: 90 days
|
|
- Security logs: 1 year
|
|
- Backups: 30 days rolling
|
|
|
|
**Extending Retention:**
|
|
```bash
|
|
# Ship logs to external storage
|
|
# Example: Ship to S3 / CloudWatch / ELK
|
|
|
|
# For PostgreSQL audit logs, increase CSV log retention
|
|
kubectl exec -n bakery-ia deployment/auth-db -- \
|
|
psql -U postgres -c "ALTER SYSTEM SET log_rotation_age = '90d';"
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Reference Commands
|
|
|
|
### Emergency Commands
|
|
|
|
```bash
|
|
# Restart all services (minimal downtime with rolling update)
|
|
kubectl rollout restart deployment -n bakery-ia
|
|
|
|
# Restart specific service
|
|
kubectl rollout restart deployment/orders-service -n bakery-ia
|
|
|
|
# Rollback last deployment
|
|
kubectl rollout undo deployment/orders-service -n bakery-ia
|
|
|
|
# Scale up quickly
|
|
kubectl scale deployment orders-service -n bakery-ia --replicas=5
|
|
|
|
# Get pod status
|
|
kubectl get pods -n bakery-ia
|
|
|
|
# Get recent events
|
|
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20
|
|
|
|
# Get logs
|
|
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 -f
|
|
```
|
|
|
|
### Monitoring Commands
|
|
|
|
```bash
|
|
# Resource usage
|
|
kubectl top nodes
|
|
kubectl top pods -n bakery-ia --sort-by=cpu
|
|
kubectl top pods -n bakery-ia --sort-by=memory
|
|
|
|
# Check HPA
|
|
kubectl get hpa -n bakery-ia
|
|
|
|
# Check all resources
|
|
kubectl get all -n bakery-ia
|
|
|
|
# Check ingress
|
|
kubectl get ingress -n bakery-ia
|
|
|
|
# Check certificates
|
|
kubectl get certificate -n bakery-ia
|
|
```
|
|
|
|
### Database Commands
|
|
|
|
```bash
|
|
# Connect to database
|
|
kubectl exec -n bakery-ia deployment/auth-db -it -- psql -U auth_user -d auth_db
|
|
|
|
# Check connections
|
|
kubectl exec -n bakery-ia deployment/auth-db -- \
|
|
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
|
|
|
|
# Check database size
|
|
kubectl exec -n bakery-ia deployment/auth-db -- \
|
|
psql -U postgres -c "SELECT pg_size_pretty(pg_database_size('auth_db'));"
|
|
|
|
# Vacuum database
|
|
kubectl exec -n bakery-ia deployment/auth-db -- \
|
|
psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"
|
|
```
|
|
|
|
---
|
|
|
|
## Support Resources
|
|
|
|
**Documentation:**
|
|
- [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment
|
|
- [Monitoring Summary](./MONITORING_DEPLOYMENT_SUMMARY.md) - Monitoring details
|
|
- [Quick Start Monitoring](./QUICK_START_MONITORING.md) - Monitoring setup
|
|
- [Security Checklist](./security-checklist.md) - Security procedures
|
|
- [Database Security](./database-security.md) - Database operations
|
|
- [TLS Configuration](./tls-configuration.md) - Certificate management
|
|
- [RBAC Implementation](./rbac-implementation.md) - Access control
|
|
|
|
**External Resources:**
|
|
- Kubernetes: https://kubernetes.io/docs
|
|
- MicroK8s: https://microk8s.io/docs
|
|
- Prometheus: https://prometheus.io/docs
|
|
- Grafana: https://grafana.com/docs
|
|
- PostgreSQL: https://www.postgresql.org/docs
|
|
|
|
**Emergency Contacts:**
|
|
- DevOps Team: devops@yourdomain.com
|
|
- On-Call: oncall@yourdomain.com
|
|
- Security Team: security@yourdomain.com
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
This guide covers all aspects of operating the Bakery-IA platform in production:
|
|
|
|
✅ **Monitoring:** Dashboards, alerts, metrics
|
|
✅ **Security:** Access control, certificates, compliance
|
|
✅ **Databases:** Management, optimization, backups
|
|
✅ **Recovery:** Backup strategy, disaster recovery
|
|
✅ **Performance:** Optimization techniques, scaling
|
|
✅ **Incidents:** Response procedures, common fixes
|
|
✅ **Maintenance:** Daily, weekly, monthly tasks
|
|
✅ **Compliance:** GDPR, PCI-DSS, SOC 2
|
|
|
|
**Remember:**
|
|
- Monitor daily
|
|
- Back up daily
|
|
- Test restores monthly
|
|
- Rotate secrets quarterly
|
|
- Plan for growth continuously
|
|
|
|
---
|
|
|
|
**Document Version:** 1.0
|
|
**Last Updated:** 2026-01-07
|
|
**Maintained By:** DevOps Team
|
|
**Next Review:** 2026-04-07
|