bakery-ia/docs/PRODUCTION_OPERATIONS_GUIDE.md

# Bakery-IA Production Operations Guide

**Complete guide for operating, monitoring, and maintaining production environment**

**Last Updated:** 2026-01-07
**Target Audience:** DevOps, SRE, System Administrators
**Security Grade:** A-

---

## Table of Contents

1. [Overview](#overview)
2. [Monitoring & Observability](#monitoring--observability)
3. [Security Operations](#security-operations)
4. [Database Management](#database-management)
5. [Backup & Recovery](#backup--recovery)
6. [Performance Optimization](#performance-optimization)
7. [Scaling Operations](#scaling-operations)
8. [Incident Response](#incident-response)
9. [Maintenance Tasks](#maintenance-tasks)
10. [Compliance & Audit](#compliance--audit)

---

## Overview

### Production Environment

**Infrastructure:**
- **Platform:** MicroK8s on Ubuntu 22.04 LTS
- **Services:** 18 microservices, 14 databases, monitoring stack
- **Capacity:** 10-tenant pilot (scalable to 100+)
- **Security:** TLS encryption, RBAC, audit logging
- **Monitoring:** Prometheus, Grafana, AlertManager, Jaeger

**Key Metrics (10-tenant baseline):**
- **Uptime Target:** 99.5% (3.65 hours downtime/month)
- **Response Time:** <2s average API response
- **Error Rate:** <1% of requests
- **Database Connections:** ~200 concurrent
- **Memory Usage:** 12-15 GB / 20 GB capacity
- **CPU Usage:** 40-60% under normal load

### Team Responsibilities

| Role | Responsibilities |
|------|------------------|
| **DevOps Engineer** | Deployment, infrastructure, scaling |
| **SRE** | Monitoring, incident response, performance |
| **Security Admin** | Access control, security patches, compliance |
| **Database Admin** | Backups, optimization, migrations |
| **On-Call Engineer** | 24/7 incident response (if applicable) |

---

## Monitoring & Observability

### Access Monitoring Dashboards

**Production URLs:**
```
https://monitoring.yourdomain.com/grafana       # Dashboards & visualization
https://monitoring.yourdomain.com/prometheus    # Metrics & alerts
https://monitoring.yourdomain.com/alertmanager  # Alert management
https://monitoring.yourdomain.com/jaeger        # Distributed tracing
```

**Port Forwarding (if ingress not available):**
```bash
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000

# Prometheus
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090

# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093

# Jaeger
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
```

### Key Dashboards

#### 1. Services Overview Dashboard
**What to Monitor:**
- Request rate per service
- Error rate (aim: <1%)
- P95/P99 latency (aim: <2s)
- Active connections
- Pod health status

**Red Flags:**
- ❌ Error rate >5%
- ❌ P95 latency >3s
- ❌ Any service showing 0 requests (might be down)
- ❌ Pod restarts >3 in last hour

#### 2. Database Dashboard (PostgreSQL)
**What to Monitor:**
- Active connections per database
- Cache hit ratio (aim: >90%)
- Query duration (P95)
- Transaction rate
- Replication lag (if applicable)

**Red Flags:**
- ❌ Connection count >80% of max
- ❌ Cache hit ratio <80%
- ❌ Slow queries >1s frequently
- ❌ Locks increasing

#### 3. Node Exporter (Infrastructure)
**What to Monitor:**
- CPU usage per node
- Memory usage and swap
- Disk I/O and latency
- Network throughput
- Disk space remaining

**Red Flags:**
- ❌ CPU usage >85% sustained
- ❌ Memory usage >90%
- ❌ Swap usage >0 (indicates memory pressure)
- ❌ Disk space <20% remaining
- ❌ Disk I/O latency >100ms

#### 4. Business Metrics Dashboard
**What to Monitor:**
- Active tenants
- ML training jobs (success/failure rate)
- Forecast requests per hour
- Alert volume
- API health score

**Red Flags:**
- ❌ Training failure rate >10%
- ❌ No forecast requests (might indicate issue)
- ❌ Alert volume spike (investigate cause)

### Alert Severity Levels

| Severity | Response Time | Escalation | Examples |
|----------|---------------|------------|----------|
| **Critical** | Immediate | Page on-call | Service down, database unavailable |
| **Warning** | 30 minutes | Email team | High memory, slow queries |
| **Info** | Best effort | Email | Backup completed, cert renewal |

### Common Alerts & Responses

#### Alert: ServiceDown
```
Severity: Critical
Meaning: A service has been down for >2 minutes
Response:
1. Check pod status: kubectl get pods -n bakery-ia
2. View logs: kubectl logs POD_NAME -n bakery-ia
3. Check recent deployments: kubectl rollout history
4. Restart if safe: kubectl rollout restart deployment/SERVICE_NAME
5. Rollback if needed: kubectl rollout undo deployment/SERVICE_NAME
```

#### Alert: HighMemoryUsage
```
Severity: Warning
Meaning: Service using >80% of memory limit
Response:
1. Check which pods: kubectl top pods -n bakery-ia --sort-by=memory
2. Review memory trends in Grafana
3. Check for memory leaks in application logs
4. Consider increasing memory limits if sustained
5. Restart pod if memory leak suspected
```

#### Alert: DatabaseConnectionsHigh
```
Severity: Warning
Meaning: Database connections >80% of max
Response:
1. Identify which service: Check Grafana database dashboard
2. Look for connection leaks in application
3. Check for long-running transactions
4. Consider increasing max_connections
5. Restart service if connections not releasing
```

#### Alert: CertificateExpiringSoon
```
Severity: Warning
Meaning: TLS certificate expires in <30 days
Response:
1. For Let's Encrypt: Auto-renewal should handle (verify cert-manager)
2. For internal certs: Regenerate and apply new certificates
3. See "Certificate Rotation" section below
```

### Metrics to Track Daily

```bash
# Quick health check command
cat > ~/health-check.sh <<'EOF'
#!/bin/bash
echo "=== Bakery-IA Health Check ==="
echo "Date: $(date)"
echo ""

echo "1. Pod Status:"
kubectl get pods -n bakery-ia | grep -vE "Running|Completed" || echo "✅ All pods healthy"
echo ""

echo "2. Resource Usage:"
kubectl top nodes
echo ""

echo "3. Database Connections:"
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT count(*) as connections FROM pg_stat_activity;"
echo ""

echo "4. Recent Alerts:"
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state}' | head -10
echo ""

echo "5. Disk Usage:"
kubectl exec -n bakery-ia deployment/auth-db -- df -h /var/lib/postgresql/data
echo ""

echo "=== End Health Check ==="
EOF

chmod +x ~/health-check.sh
./health-check.sh
```

---

## Security Operations

### Security Posture Overview

**Current Security Grade: A-**

**Implemented:**
- ✅ TLS 1.2+ encryption for all database connections
- ✅ Let's Encrypt SSL for public endpoints
- ✅ 32-character cryptographic passwords
- ✅ JWT-based authentication
- ✅ Tenant isolation at database and application level
- ✅ Kubernetes secrets encryption at rest
- ✅ PostgreSQL audit logging
- ✅ RBAC (Role-Based Access Control)
- ✅ Regular security updates

### Access Control Management

#### User Roles

| Role | Permissions | Use Case |
|------|-------------|----------|
| **Viewer** | Read-only access | Dashboard viewing, reports |
| **Member** | Read + create/update | Day-to-day operations |
| **Admin** | Full operational access | Manage users, configure settings |
| **Owner** | Full control | Billing, tenant deletion |

#### Managing User Access

```bash
# View current users for a tenant (via API)
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users

# Promote user to admin
curl -X PATCH -H "Authorization: Bearer $OWNER_TOKEN" \
  -H "Content-Type: application/json" \
  https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users/USER_ID \
  -d '{"role": "admin"}'
```

### Security Checklist (Monthly)

- [ ] **Review audit logs for suspicious activity**
  ```bash
  # Check failed login attempts
  kubectl logs -n bakery-ia deployment/auth-service | grep "authentication failed" | tail -50

  # Check unusual API calls
  kubectl logs -n bakery-ia deployment/gateway | grep -E "DELETE|admin" | tail -50
  ```

- [ ] **Verify all services using TLS**
  ```bash
  # Check PostgreSQL SSL
  for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
    echo "Checking $db"
    kubectl exec -n bakery-ia $db -- psql -U postgres -c "SHOW ssl;"
  done
  ```

- [ ] **Review and rotate passwords (every 90 days)**
  ```bash
  # Generate new passwords
  openssl rand -base64 32  # For each service

  # Update secrets
  kubectl edit secret bakery-ia-secrets -n bakery-ia

  # Restart services to pick up new passwords
  kubectl rollout restart deployment -n bakery-ia
  ```

- [ ] **Check certificate expiry dates**
  ```bash
  # Check Let's Encrypt certs
  kubectl get certificate -n bakery-ia

  # Check internal TLS certs (expire Oct 2028)
  kubectl exec -n bakery-ia deployment/auth-db -- \
    openssl x509 -in /tls/server-cert.pem -noout -dates
  ```

- [ ] **Review RBAC policies**
  - Ensure least privilege principle
  - Remove access for departed team members
  - Audit admin/owner role assignments

- [ ] **Apply security updates**
  ```bash
  # Update system packages on VPS
  ssh root@$VPS_IP "apt update && apt upgrade -y"

  # Update container images (rebuild with latest base images)
  docker-compose build --pull
  ```

### Certificate Rotation

#### Let's Encrypt (Auto-Renewal)

Let's Encrypt certificates auto-renew via cert-manager. Verify:

```bash
# Check cert-manager is running
kubectl get pods -n cert-manager

# Check certificate status
kubectl describe certificate bakery-ia-prod-tls-cert -n bakery-ia

# Force renewal if needed (>30 days before expiry)
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# cert-manager will automatically recreate
```

#### Internal TLS Certificates (Manual Rotation)

**When:** 90 days before October 2028 expiry

```bash
# 1. Generate new certificates (on local machine)
cd infrastructure/tls
./generate-certificates.sh

# 2. Update Kubernetes secrets
kubectl delete secret postgres-tls redis-tls -n bakery-ia

kubectl create secret generic postgres-tls \
  --from-file=server-cert.pem=postgres/server-cert.pem \
  --from-file=server-key.pem=postgres/server-key.pem \
  --from-file=ca-cert.pem=postgres/ca-cert.pem \
  -n bakery-ia

kubectl create secret generic redis-tls \
  --from-file=redis-cert.pem=redis/redis-cert.pem \
  --from-file=redis-key.pem=redis/redis-key.pem \
  --from-file=ca-cert.pem=redis/ca-cert.pem \
  -n bakery-ia

# 3. Restart database pods to pick up new certs
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=database
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=cache

# 4. Verify new certificates
kubectl exec -n bakery-ia deployment/auth-db -- \
  openssl x509 -in /tls/server-cert.pem -noout -dates
```

---

## Database Management

### Database Architecture

**14 PostgreSQL Instances:**
- auth-db, tenant-db, training-db, forecasting-db, sales-db
- external-db, notification-db, inventory-db, recipes-db
- suppliers-db, pos-db, orders-db, production-db, alert-processor-db

**1 Redis Instance:** Shared caching and session storage

### Database Health Monitoring

```bash
# Check all database pods
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database

# Check database resource usage
kubectl top pods -n bakery-ia -l app.kubernetes.io/component=database

# Check database connections
for db in $(kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database -o name); do
  echo "=== $db ==="
  kubectl exec -n bakery-ia $db -- psql -U postgres -c \
    "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"
done
```

### Common Database Operations

#### Connect to Database

```bash
# Connect to specific database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U auth_user -d auth_db

# Inside psql:
\dt              # List tables
\d+ table_name   # Describe table with details
\du              # List users
\l               # List databases
\q               # Quit
```

#### Check Database Size

```bash
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT pg_database.datname,
   pg_size_pretty(pg_database_size(pg_database.datname)) AS size
   FROM pg_database;"
```

#### Analyze Slow Queries

```bash
# Enable slow query logging (already configured)
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT query, mean_exec_time, calls
   FROM pg_stat_statements
   ORDER BY mean_exec_time DESC
   LIMIT 10;"
```

#### Check Database Locks

```bash
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT blocked_locks.pid AS blocked_pid,
   blocking_locks.pid AS blocking_pid,
   blocked_activity.usename AS blocked_user,
   blocking_activity.usename AS blocking_user,
   blocked_activity.query AS blocked_statement
   FROM pg_catalog.pg_locks blocked_locks
   JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
   JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
   AND blocking_locks.relation = blocked_locks.relation
   JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.granted;"
```

### Database Optimization

#### Vacuum and Analyze

```bash
# Run on each database monthly
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"

# For all databases (run as cron job)
cat > ~/vacuum-databases.sh <<'EOF'
#!/bin/bash
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
  echo "Vacuuming $db"
  kubectl exec -n bakery-ia $db -- psql -U postgres -c "VACUUM ANALYZE;"
done
EOF

chmod +x ~/vacuum-databases.sh
# Add to cron: 0 3 * * 0 (weekly at 3 AM)
```

#### Reindex (if performance degrades)

```bash
# Reindex specific database
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "REINDEX DATABASE auth_db;"
```

---

## Backup & Recovery

### Backup Strategy

**Automated Daily Backups:**
- Frequency: Daily at 2 AM
- Retention: 30 days rolling
- Encryption: GPG encrypted
- Storage: Local VPS (configure off-site for production)

### Backup Script (Already Configured)

```bash
# Script location: ~/backup-databases.sh
# Configured in: pilot launch guide

# Manual backup
./backup-databases.sh

# Verify backup
ls -lh /backups/
```

### Backup Best Practices

1. **Test Restores Monthly**
   ```bash
   # Restore to test database
   gunzip < /backups/2026-01-07.tar.gz | \
     kubectl exec -i -n bakery-ia deployment/test-db -- \
     psql -U postgres test_db
   ```

2. **Off-Site Storage (Recommended)**
   ```bash
   # Sync backups to S3 / Cloud Storage
   aws s3 sync /backups/ s3://bakery-ia-backups/ --delete

   # Or use rclone for any cloud provider
   rclone sync /backups/ remote:bakery-ia-backups
   ```

3. **Monitor Backup Success**
   ```bash
   # Check last backup date
   ls -lt /backups/ | head -1

   # Set up alert if no backup in 25 hours
   ```

### Recovery Procedures

#### Restore Single Database

```bash
# 1. Stop the service using the database
kubectl scale deployment auth-service -n bakery-ia --replicas=0

# 2. Drop and recreate database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U postgres -c "DROP DATABASE auth_db;"
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U postgres -c "CREATE DATABASE auth_db OWNER auth_user;"

# 3. Restore from backup
gunzip < /backups/2026-01-07/auth-db.sql | \
  kubectl exec -i -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db

# 4. Restart service
kubectl scale deployment auth-service -n bakery-ia --replicas=2
```

#### Disaster Recovery (Full System)

```bash
# 1. Provision new VPS (same specs)
# 2. Install MicroK8s (follow pilot launch guide)
# 3. Copy latest backup to new VPS
# 4. Deploy infrastructure and databases
kubectl apply -k infrastructure/kubernetes/overlays/prod

# 5. Wait for databases to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=database -n bakery-ia

# 6. Restore all databases
for backup in /backups/latest/*.sql; do
  db_name=$(basename $backup .sql)
  echo "Restoring $db_name"
  cat $backup | kubectl exec -i -n bakery-ia deployment/${db_name} -- \
    psql -U postgres
done

# 7. Deploy services
kubectl apply -k infrastructure/kubernetes/overlays/prod

# 8. Update DNS to point to new VPS
# 9. Verify all services healthy
```

**Recovery Time Objective (RTO):** 2-4 hours
**Recovery Point Objective (RPO):** 24 hours (last daily backup)

---

## Performance Optimization

### Identifying Performance Issues

```bash
# 1. Check overall resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory

# 2. Check API response times in Grafana
# Go to "Services Overview" dashboard
# Look for P95/P99 latency spikes

# 3. Check database query performance
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT query, calls, mean_exec_time, max_exec_time
   FROM pg_stat_statements
   ORDER BY mean_exec_time DESC
   LIMIT 20;"

# 4. Check for N+1 queries in application logs
kubectl logs -n bakery-ia deployment/orders-service | grep "SELECT"
```

### Common Optimizations

#### 1. Database Indexing

```sql
-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY abs(correlation) DESC;

-- Add index on frequently queried columns
CREATE INDEX CONCURRENTLY idx_orders_tenant_created
  ON orders(tenant_id, created_at DESC);
```

#### 2. Connection Pooling

Already configured in services using SQLAlchemy. Verify settings:
```python
# In shared/database/base.py
pool_size=5           # Adjust based on load
max_overflow=10       # Max additional connections
pool_timeout=30       # Connection timeout
pool_recycle=3600     # Recycle connections after 1 hour
```

#### 3. Redis Caching

Increase cache for frequently accessed data:
```python
# Cache user permissions (example)
@cache.cached(timeout=300, key_prefix='user_perms')
def get_user_permissions(user_id):
    # ... fetch from database
```

#### 4. Query Optimization

```sql
-- Add EXPLAIN ANALYZE to slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE tenant_id = '...';

-- Look for:
-- - Seq Scan (should use index scan)
-- - High execution time
-- - Missing indexes
```

### Scaling Triggers

**When to scale UP:**
- ❌ CPU usage >75% sustained for >1 hour
- ❌ Memory usage >85% sustained
- ❌ P95 API latency >3s
- ❌ Database connection pool exhausted frequently
- ❌ Error rate increasing

**When to scale OUT (add replicas):**
- ❌ Request rate increasing significantly
- ❌ Single service bottleneck identified
- ❌ Need zero-downtime deployments
- ❌ Geographic distribution needed

---

## Scaling Operations

### Vertical Scaling (Upgrade VPS)

```bash
# 1. Create backup
./backup-databases.sh

# 2. Plan upgrade window (requires brief downtime)
# Notify users: "Scheduled maintenance 2 AM - 3 AM"

# 3. At clouding.io, upgrade VPS
# RAM: 20 GB → 32 GB
# CPU: 8 cores → 12 cores
# (Usually instant, may require restart)

# 4. Verify after upgrade
kubectl top nodes
free -h
nproc
```

### Horizontal Scaling (Add Replicas)

```bash
# Scale specific service
kubectl scale deployment orders-service -n bakery-ia --replicas=5

# Or update in kustomization for persistence
# Edit: infrastructure/kubernetes/overlays/prod/kustomization.yaml
replicas:
  - name: orders-service
    count: 5

kubectl apply -k infrastructure/kubernetes/overlays/prod
```

### Auto-Scaling (HPA)

Already configured for:
- orders-service (1-3 replicas)
- forecasting-service (1-3 replicas)
- notification-service (1-3 replicas)

```bash
# Check HPA status
kubectl get hpa -n bakery-ia

# Adjust thresholds if needed
kubectl edit hpa orders-service-hpa -n bakery-ia
```

### Growth Path

| Tenants | Recommended Action |
|---------|-------------------|
| **10** | Current configuration (20GB RAM, 8 CPU) |
| **20** | Add replicas for critical services |
| **30** | Upgrade to 32GB RAM, 12 CPU |
| **50** | Consider database read replicas |
| **75** | Upgrade to 48GB RAM, 16 CPU |
| **100** | Plan multi-node cluster or managed K8s |
| **200+** | Migrate to managed services (EKS, GKE, AKS) |

---

## Incident Response

### Incident Severity Levels

| Level | Description | Response Time | Example |
|-------|-------------|---------------|---------|
| **P0** | Complete outage | Immediate | All services down |
| **P1** | Major degradation | 15 minutes | Database unavailable |
| **P2** | Partial degradation | 1 hour | One service slow |
| **P3** | Minor issue | 4 hours | Non-critical alert |

### Incident Response Process

#### 1. Detect & Alert
```
- Monitoring alerts trigger
- User reports issue
- Automated health checks fail
```

#### 2. Assess & Communicate
```bash
# Quick assessment
./health-check.sh

# Determine severity
# P0/P1: Notify all stakeholders immediately
# P2/P3: Regular communication channels
```

#### 3. Investigate
```bash
# Check pods
kubectl get pods -n bakery-ia

# Check recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20

# Check logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100

# Check metrics
# View Grafana dashboards
```

#### 4. Mitigate
```bash
# Common mitigations:

# Restart service
kubectl rollout restart deployment/SERVICE_NAME -n bakery-ia

# Rollback deployment
kubectl rollout undo deployment/SERVICE_NAME -n bakery-ia

# Scale up
kubectl scale deployment SERVICE_NAME -n bakery-ia --replicas=5

# Restart database
kubectl delete pod DB_POD_NAME -n bakery-ia
```

#### 5. Resolve & Document
```
1. Verify issue resolved
2. Update incident log
3. Create post-mortem (for P0/P1)
4. Implement preventive measures
```

### Common Incidents & Fixes

#### Incident: Database Connection Exhaustion

**Symptoms:** Services showing "connection pool exhausted" errors

**Fix:**
```bash
# 1. Identify leaking service
kubectl logs -n bakery-ia deployment/orders-service | grep "pool"

# 2. Restart leaking service
kubectl rollout restart deployment/orders-service -n bakery-ia

# 3. Increase max_connections if needed
kubectl exec -n bakery-ia deployment/orders-db -- \
  psql -U postgres -c "ALTER SYSTEM SET max_connections = 200;"
kubectl rollout restart deployment/orders-db -n bakery-ia
```

#### Incident: Out of Memory (OOMKilled)

**Symptoms:** Pods restarting with "OOMKilled" status

**Fix:**
```bash
# 1. Identify which pod
kubectl get pods -n bakery-ia | grep OOMKilled

# 2. Check resource limits
kubectl describe pod POD_NAME -n bakery-ia | grep -A 5 Limits

# 3. Increase memory limit
# Edit deployment: infrastructure/kubernetes/base/components/services/SERVICE.yaml
resources:
  limits:
    memory: "1Gi"  # Increased from 512Mi

# 4. Redeploy
kubectl apply -k infrastructure/kubernetes/overlays/prod
```

#### Incident: Certificate Expired

**Symptoms:** SSL errors, services can't connect

**Fix:**
```bash
# For Let's Encrypt (should auto-renew):
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# Wait for cert-manager to recreate

# For internal certs:
# Follow "Certificate Rotation" section above
```

---

## Maintenance Tasks

### Daily Tasks

```bash
# Run health check
./health-check.sh

# Check monitoring alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

# Verify backups ran
ls -lh /backups/ | head -5
```

### Weekly Tasks

```bash
# Review resource trends
# Open Grafana, check 7-day trends

# Review error logs
kubectl logs -n bakery-ia deployment/gateway --since=7d | grep ERROR | wc -l

# Check disk usage
kubectl exec -n bakery-ia deployment/auth-db -- df -h

# Review security logs
kubectl logs -n bakery-ia deployment/auth-service --since=7d | grep "failed"
```

### Monthly Tasks

- [ ] **Review and rotate passwords**
- [ ] **Update security patches**
- [ ] **Test backup restore**
- [ ] **Review RBAC policies**
- [ ] **Vacuum and analyze databases**
- [ ] **Review and optimize slow queries**
- [ ] **Check certificate expiry dates**
- [ ] **Review resource allocation**
- [ ] **Plan capacity for next quarter**
- [ ] **Update documentation**

### Quarterly Tasks (Every 90 Days)

- [ ] **Full security audit**
- [ ] **Disaster recovery drill**
- [ ] **Performance testing**
- [ ] **Cost optimization review**
- [ ] **Update runbooks**
- [ ] **Team training session**
- [ ] **Review SLAs and metrics**
- [ ] **Plan infrastructure upgrades**

### Annual Tasks

- [ ] **Penetration testing**
- [ ] **Compliance audit (GDPR, PCI-DSS, SOC 2)**
- [ ] **Full infrastructure review**
- [ ] **Update security roadmap**
- [ ] **Budget planning for next year**
- [ ] **Technology stack review**

---

## Compliance & Audit

### GDPR Compliance

**Requirements Met:**
- ✅ Article 32: Encryption of personal data (TLS + pgcrypto)
- ✅ Article 5(1)(f): Security of processing
- ✅ Article 33: Breach detection (audit logs)
- ✅ Article 17: Right to erasure (deletion endpoints)
- ✅ Article 20: Right to data portability (export functionality)

**Audit Tasks:**
```bash
# Review audit logs for data access
kubectl logs -n bakery-ia deployment/tenant-service | grep "user_data_access"

# Verify encryption in use
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c "SHOW ssl;"

# Check data retention policies
# Review automated cleanup jobs
```

### PCI-DSS Compliance

**Requirements Met:**
- ✅ Requirement 3.4: Transmission encryption (TLS 1.2+)
- ✅ Requirement 3.5: Stored data protection (pgcrypto)
- ✅ Requirement 10: Access tracking (audit logs)
- ✅ Requirement 8: User authentication (JWT + MFA ready)

**Audit Tasks:**
```bash
# Verify no plaintext passwords
kubectl get secret bakery-ia-secrets -n bakery-ia -o jsonpath='{.data}' | grep -i "pass"

# Check encryption in transit
kubectl describe ingress -n bakery-ia | grep TLS

# Review access logs
kubectl logs -n bakery-ia deployment/auth-service | grep "login"
```

### SOC 2 Compliance

**Controls Met:**
- ✅ CC6.1: Access controls (RBAC)
- ✅ CC6.6: Encryption in transit (TLS)
- ✅ CC6.7: Encryption at rest (K8s secrets + pgcrypto)
- ✅ CC7.2: Monitoring (Prometheus + Grafana)

### Audit Log Retention

**Current Policy:**
- Application logs: 30 days (stdout)
- Database audit logs: 90 days
- Security logs: 1 year
- Backups: 30 days rolling

**Extending Retention:**
```bash
# Ship logs to external storage
# Example: Ship to S3 / CloudWatch / ELK

# For PostgreSQL audit logs, increase CSV log retention
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "ALTER SYSTEM SET log_rotation_age = '90d';"
```

---

## Quick Reference Commands

### Emergency Commands

```bash
# Restart all services (minimal downtime with rolling update)
kubectl rollout restart deployment -n bakery-ia

# Restart specific service
kubectl rollout restart deployment/orders-service -n bakery-ia

# Rollback last deployment
kubectl rollout undo deployment/orders-service -n bakery-ia

# Scale up quickly
kubectl scale deployment orders-service -n bakery-ia --replicas=5

# Get pod status
kubectl get pods -n bakery-ia

# Get recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20

# Get logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 -f
```

### Monitoring Commands

```bash
# Resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory

# Check HPA
kubectl get hpa -n bakery-ia

# Check all resources
kubectl get all -n bakery-ia

# Check ingress
kubectl get ingress -n bakery-ia

# Check certificates
kubectl get certificate -n bakery-ia
```

### Database Commands

```bash
# Connect to database
kubectl exec -n bakery-ia deployment/auth-db -it -- psql -U auth_user -d auth_db

# Check connections
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check database size
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "SELECT pg_size_pretty(pg_database_size('auth_db'));"

# Vacuum database
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"
```

---

## Support Resources

**Documentation:**
- [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment
- [Monitoring Summary](./MONITORING_DEPLOYMENT_SUMMARY.md) - Monitoring details
- [Quick Start Monitoring](./QUICK_START_MONITORING.md) - Monitoring setup
- [Security Checklist](./security-checklist.md) - Security procedures
- [Database Security](./database-security.md) - Database operations
- [TLS Configuration](./tls-configuration.md) - Certificate management
- [RBAC Implementation](./rbac-implementation.md) - Access control

**External Resources:**
- Kubernetes: https://kubernetes.io/docs
- MicroK8s: https://microk8s.io/docs
- Prometheus: https://prometheus.io/docs
- Grafana: https://grafana.com/docs
- PostgreSQL: https://www.postgresql.org/docs

**Emergency Contacts:**
- DevOps Team: devops@yourdomain.com
- On-Call: oncall@yourdomain.com
- Security Team: security@yourdomain.com

---

## Summary

This guide covers all aspects of operating the Bakery-IA platform in production:

✅ **Monitoring:** Dashboards, alerts, metrics
✅ **Security:** Access control, certificates, compliance
✅ **Databases:** Management, optimization, backups
✅ **Recovery:** Backup strategy, disaster recovery
✅ **Performance:** Optimization techniques, scaling
✅ **Incidents:** Response procedures, common fixes
✅ **Maintenance:** Daily, weekly, monthly tasks
✅ **Compliance:** GDPR, PCI-DSS, SOC 2

**Remember:**
- Monitor daily
- Back up daily
- Test restores monthly
- Rotate secrets quarterly
- Plan for growth continuously

---

**Document Version:** 1.0
**Last Updated:** 2026-01-07
**Maintained By:** DevOps Team
**Next Review:** 2026-04-07