bakery-ia/docs/PRODUCTION_OPERATIONS_GUIDE.md

# Bakery-IA Production Operations Guide

**Complete guide for operating, monitoring, and maintaining production environment**

**Last Updated:** 2026-01-07
**Target Audience:** DevOps, SRE, System Administrators
**Security Grade:** A-

---

## Table of Contents

1. [Overview](#overview)
2. [Monitoring & Observability](#monitoring--observability)
3. [CI/CD Operations](#ci-cd-operations)
4. [Security Operations](#security-operations)
5. [Database Management](#database-management)
6. [Backup & Recovery](#backup--recovery)
7. [Performance Optimization](#performance-optimization)
8. [Scaling Operations](#scaling-operations)
9. [Incident Response](#incident-response)
10. [Maintenance Tasks](#maintenance-tasks)
11. [Compliance & Audit](#compliance--audit)

---

## Overview

### Production Environment

**Infrastructure:**
- **Platform:** MicroK8s on Ubuntu 22.04 LTS
- **Services:** 18 microservices, 14 databases, monitoring stack
- **Capacity:** 10-tenant pilot (scalable to 100+)
- **Security:** TLS encryption, RBAC, audit logging
- **Monitoring:** Prometheus, Grafana, AlertManager, SigNoz
- **CI/CD:** Tekton Pipelines, Gitea, Flux CD (GitOps)
- **Email:** Mailu (integrated email server)

**Key Metrics (10-tenant baseline):**
- **Uptime Target:** 99.5% (3.65 hours downtime/month)
- **Response Time:** <2s average API response
- **Error Rate:** <1% of requests
- **Database Connections:** ~200 concurrent
- **Memory Usage:** 12-15 GB / 20 GB capacity
- **CPU Usage:** 40-60% under normal load

### Team Responsibilities

| Role | Responsibilities |
|------|------------------|
| **DevOps Engineer** | Deployment, infrastructure, scaling, CI/CD |
| **SRE** | Monitoring, incident response, performance |
| **Security Admin** | Access control, security patches, compliance |
| **Database Admin** | Backups, optimization, migrations |
| **On-Call Engineer** | 24/7 incident response (if applicable) |
| **CI/CD Admin** | Pipeline management, GitOps workflows |

---

## Monitoring & Observability

### Access Monitoring Dashboards

**Production URLs:**
```
https://monitoring.bakewise.ai/signoz        # SigNoz - Unified observability (PRIMARY)
https://monitoring.bakewise.ai/alertmanager  # AlertManager - Alert management
```

**What is SigNoz?**
SigNoz is a comprehensive, open-source observability platform that provides:
- **Distributed Tracing** - End-to-end request tracking across all microservices
- **Metrics Monitoring** - Application and infrastructure metrics
- **Log Management** - Centralized log aggregation with trace correlation
- **Service Performance Monitoring (SPM)** - RED metrics (Rate, Error, Duration) from traces
- **Database Monitoring** - All 18 PostgreSQL databases + Redis + RabbitMQ
- **Kubernetes Monitoring** - Cluster, node, pod, and container metrics


### Key SigNoz Dashboards and Features

#### 1. Services Tab - APM Overview
**What to Monitor:**
- **Service List** - All 18 microservices with health status
- **Request Rate** - Requests per second per service
- **Error Rate** - Percentage of failed requests (aim: <1%)
- **P50/P90/P99 Latency** - Response time percentiles (aim: P99 <2s)
- **Operations** - Breakdown by endpoint/operation

**Red Flags:**
- ❌ Error rate >5% sustained
- ❌ P99 latency >3s
- ❌ Sudden drop in request rate (service might be down)
- ❌ High latency on specific endpoints

**How to Access:**
- Navigate to `Services` tab in SigNoz
- Click on any service for detailed metrics
- Use "Traces" tab to see sample requests

#### 2. Traces Tab - Distributed Tracing
**What to Monitor:**
- **End-to-end request flows** across microservices
- **Span duration** - Time spent in each service
- **Database query performance** - Auto-captured from SQLAlchemy
- **External API calls** - Auto-captured from HTTPX
- **Error traces** - Requests that failed with stack traces

**Features:**
- Filter by service, operation, status code, duration
- Search by trace ID or span ID
- Correlate traces with logs
- Identify slow database queries and N+1 problems

**Red Flags:**
- ❌ Traces showing >10 database queries per request (N+1 issue)
- ❌ External API calls taking >1s
- ❌ Services with >500ms internal processing time
- ❌ Error spans with exceptions

#### 3. Dashboards Tab - Infrastructure Metrics
**Pre-built Dashboards:**
- **PostgreSQL Monitoring** - All 18 databases
  - Active connections, transactions/sec, cache hit ratio
  - Slow queries, lock waits, replication lag
  - Database size, disk I/O
- **Redis Monitoring** - Cache performance
  - Memory usage, hit rate, evictions
  - Commands/sec, latency
- **RabbitMQ Monitoring** - Message queue health
  - Queue depth, message rates
  - Consumer status, connections
- **Kubernetes Cluster** - Node and pod metrics
  - CPU, memory, disk, network per node
  - Pod resource utilization
  - Container restarts and OOM kills

**Red Flags:**
- ❌ PostgreSQL: Cache hit ratio <80%, active connections >80% of max
- ❌ Redis: Memory >90%, evictions increasing
- ❌ RabbitMQ: Queue depth growing, no consumers
- ❌ Kubernetes: CPU >85%, memory >90%, disk <20% free

#### 4. Logs Tab - Centralized Logging
**Features:**
- **Unified logs** from all 18 microservices + databases
- **Trace correlation** - Click on trace ID to see related logs
- **Kubernetes metadata** - Auto-tagged with pod, namespace, container
- **Search and filter** - By service, severity, time range, content
- **Log patterns** - Automatically detect common patterns

**What to Monitor:**
- Error and warning logs across all services
- Database connection errors
- Authentication failures
- API request/response logs

**Red Flags:**
- ❌ Increasing error logs
- ❌ Repeated "connection refused" or "timeout" messages
- ❌ Authentication failures (potential security issue)
- ❌ Out of memory errors

#### 5. Alerts Tab - Alert Management
**Features:**
- Create alerts based on metrics, traces, or logs
- Configure notification channels (email, Slack, webhook)
- View firing alerts and alert history
- Alert silencing and acknowledgment

**Pre-configured Alerts (see SigNoz):**
- High error rate (>5% for 5 minutes)
- High latency (P99 >3s for 5 minutes)
- Service down (no requests for 2 minutes)
- Database connection errors
- High memory/CPU usage

### Alert Severity Levels

| Severity | Response Time | Escalation | Examples |
|----------|---------------|------------|----------|
| **Critical** | Immediate | Page on-call | Service down, database unavailable |
| **Warning** | 30 minutes | Email team | High memory, slow queries |
| **Info** | Best effort | Email | Backup completed, cert renewal |

### Common Alerts & Responses

#### Alert: ServiceDown
```
Severity: Critical
Meaning: A service has been down for >2 minutes
Response:
1. Check pod status: kubectl get pods -n bakery-ia
2. View logs: kubectl logs POD_NAME -n bakery-ia
3. Check recent deployments: kubectl rollout history
4. Restart if safe: kubectl rollout restart deployment/SERVICE_NAME
5. Rollback if needed: kubectl rollout undo deployment/SERVICE_NAME
```

#### Alert: HighMemoryUsage
```
Severity: Warning
Meaning: Service using >80% of memory limit
Response:
1. Check which pods: kubectl top pods -n bakery-ia --sort-by=memory
2. Review memory trends in Grafana
3. Check for memory leaks in application logs
4. Consider increasing memory limits if sustained
5. Restart pod if memory leak suspected
```

#### Alert: DatabaseConnectionsHigh
```
Severity: Warning
Meaning: Database connections >80% of max
Response:
1. Identify which service: Check Grafana database dashboard
2. Look for connection leaks in application
3. Check for long-running transactions
4. Consider increasing max_connections
5. Restart service if connections not releasing
```

#### Alert: CertificateExpiringSoon
```
Severity: Warning
Meaning: TLS certificate expires in <30 days
Response:
1. For Let's Encrypt: Auto-renewal should handle (verify cert-manager)
2. For internal certs: Regenerate and apply new certificates
3. See "Certificate Rotation" section below
```

### Daily Monitoring Workflow with SigNoz

#### Morning Health Check (5 minutes)

1. **Open SigNoz Dashboard**
   ```
   https://monitoring.bakewise.ai/signoz
   ```

2. **Check Services Tab:**
   - Verify all 18 services are reporting metrics
   - Check error rate <1% for all services
   - Check P99 latency <2s for critical services

3. **Check Alerts Tab:**
   - Review any firing alerts
   - Check for patterns (repeated alerts on same service)
   - Acknowledge or resolve as needed

4. **Quick Infrastructure Check:**
   - Navigate to Dashboards → PostgreSQL
     - Verify all 18 databases are up
     - Check connection counts are healthy
   - Navigate to Dashboards → Redis
     - Check memory usage <80%
   - Navigate to Dashboards → Kubernetes
     - Verify node health, no OOM kills

#### Command-Line Health Check (Alternative)

```bash
# Quick health check command
cat > ~/health-check.sh <<'EOF'
#!/bin/bash
echo "=== Bakery-IA Health Check ==="
echo "Date: $(date)"
echo ""

echo "1. Pod Status:"
kubectl get pods -n bakery-ia | grep -vE "Running|Completed" || echo "✅ All pods healthy"
echo ""

echo "2. Resource Usage:"
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=memory | head -10
echo ""

echo "3. SigNoz Components:"
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
echo ""

echo "4. Recent Alerts (from SigNoz AlertManager):"
curl -s http://localhost:9093/api/v1/alerts 2>/dev/null | jq '.data[] | select(.status.state=="firing") | {alert: .labels.alertname, severity: .labels.severity}' | head -10
echo ""

echo "5. OTel Collector Health:"
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133 2>/dev/null || echo "✅ Health check endpoint responding"
echo ""

echo "=== End Health Check ==="
EOF

chmod +x ~/health-check.sh
./health-check.sh
```

#### Troubleshooting Common Issues

**Issue: Service not showing in SigNoz**
```bash
# Check if service is sending telemetry
kubectl logs -n bakery-ia deployment/SERVICE_NAME | grep -i "telemetry\|otel\|signoz"

# Check OTel Collector is receiving data
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep SERVICE_NAME

# Verify service has proper OTEL endpoints configured
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep OTEL
```

**Issue: No traces appearing**
```bash
# Check tracing is enabled in service
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep ENABLE_TRACING

# Verify OTel Collector gRPC endpoint is reachable
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- nc -zv signoz-otel-collector 4317
```

**Issue: Logs not appearing**
```bash
# Check filelog receiver is working
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog

# Check k8sattributes processor
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep k8sattributes
```

---

## CI/CD Operations

### CI/CD Infrastructure Overview

The platform includes a complete CI/CD pipeline using:
- **Gitea** - Git server and container registry
- **Tekton** - Pipeline automation
- **Flux CD** - GitOps deployment

### Access CI/CD Systems

**SSH Access to Production VPS:**
- **IP Address:** `200.234.233.87`
- **Access Method:** SSH with key authentication
- **Command:** `ssh -i ~/.ssh/your_private_key.pem root@200.234.233.87`

**Gitea (Git Server):**
- URL: http://gitea.bakery-ia.local (development) or http://gitea.bakewise.ai (production)
- Admin panel: http://gitea.bakery-ia.local/admin

**Tekton Dashboard:**
```bash
# Port forward to access Tekton dashboard
kubectl port-forward -n tekton-pipelines svc/tekton-dashboard 9097:9097
# Access at: http://localhost:9097
```

**Flux Status:**
```bash
# Check Flux status
flux check
kubectl get gitrepository -n flux-system
kubectl get kustomization -n flux-system
```

### CI/CD Monitoring

**Check pipeline status:**
```bash
# List all PipelineRuns
kubectl get pipelineruns -n tekton-pipelines

# Check Tekton controller logs
kubectl logs -n tekton-pipelines -l app=tekton-pipelines-controller

# Check Tekton dashboard logs
kubectl logs -n tekton-pipelines -l app=tekton-dashboard
```

**Monitor GitOps synchronization:**
```bash
# Check GitRepository status
kubectl get gitrepository -n flux-system -o wide

# Check Kustomization status
kubectl get kustomization -n flux-system -o wide

# Get reconciliation history
kubectl get events -n flux-system --sort-by='.lastTimestamp'
```

### CI/CD Troubleshooting

**Pipeline not triggering:**
```bash
# Check Gitea webhook logs
kubectl logs -n tekton-pipelines -l app=tekton-triggers-controller

# Verify EventListener pods are running
kubectl get pods -n tekton-pipelines -l app=tekton-triggers-eventlistener

# Check TriggerBinding configuration
kubectl get triggerbinding -n tekton-pipelines
```

**Build failures:**
```bash
# Check Kaniko logs for build errors
kubectl logs -n tekton-pipelines -l tekton.dev/task=kaniko-build

# Verify Dockerfile paths are correct
kubectl describe taskrun -n tekton-pipelines
```

**Flux not applying changes:**
```bash
# Check GitRepository status
kubectl describe gitrepository -n flux-system

# Check Kustomization reconciliation
kubectl describe kustomization -n flux-system

# Check Flux logs
kubectl logs -n flux-system -l app.kubernetes.io/name=helm-controller
```

### CI/CD Maintenance Tasks

**Daily Tasks:**
- [ ] Check for failed pipeline runs
- [ ] Verify GitOps synchronization status
- [ ] Clean up old PipelineRun resources

**Weekly Tasks:**
- [ ] Review pipeline performance metrics
- [ ] Update pipeline definitions if needed
- [ ] Rotate CI/CD secrets

**Monthly Tasks:**
- [ ] Update Tekton and Flux versions
- [ ] Review and optimize pipeline performance
- [ ] Audit CI/CD access permissions

---

## Security Operations

### Security Posture Overview

**Current Security Grade: A-**

**Implemented:**
- ✅ TLS 1.2+ encryption for all database connections
- ✅ Let's Encrypt SSL for public endpoints
- ✅ 32-character cryptographic passwords
- ✅ JWT-based authentication
- ✅ Tenant isolation at database and application level
- ✅ Kubernetes secrets encryption at rest
- ✅ PostgreSQL audit logging
- ✅ RBAC (Role-Based Access Control)
- ✅ Regular security updates

### Access Control Management

#### User Roles

| Role | Permissions | Use Case |
|------|-------------|----------|
| **Viewer** | Read-only access | Dashboard viewing, reports |
| **Member** | Read + create/update | Day-to-day operations |
| **Admin** | Full operational access | Manage users, configure settings |
| **Owner** | Full control | Billing, tenant deletion |

#### Managing User Access

```bash
# View current users for a tenant (via API)
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users

# Promote user to admin
curl -X PATCH -H "Authorization: Bearer $OWNER_TOKEN" \
  -H "Content-Type: application/json" \
  https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users/USER_ID \
  -d '{"role": "admin"}'
```

### Security Checklist (Monthly)

- [ ] **Review audit logs for suspicious activity**
  ```bash
  # Check failed login attempts
  kubectl logs -n bakery-ia deployment/auth-service | grep "authentication failed" | tail -50

  # Check unusual API calls
  kubectl logs -n bakery-ia deployment/gateway | grep -E "DELETE|admin" | tail -50
  ```

- [ ] **Verify all services using TLS**
  ```bash
  # Check PostgreSQL SSL
  for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
    echo "Checking $db"
    kubectl exec -n bakery-ia $db -- psql -U postgres -c "SHOW ssl;"
  done
  ```

- [ ] **Review and rotate passwords (every 90 days)**
  ```bash
  # Generate new passwords
  openssl rand -base64 32  # For each service

  # Update secrets
  kubectl edit secret bakery-ia-secrets -n bakery-ia

  # Restart services to pick up new passwords
  kubectl rollout restart deployment -n bakery-ia
  ```

- [ ] **Check certificate expiry dates**
  ```bash
  # Check Let's Encrypt certs
  kubectl get certificate -n bakery-ia

  # Check internal TLS certs (expire Oct 2028)
  kubectl exec -n bakery-ia deployment/auth-db -- \
    openssl x509 -in /tls/server-cert.pem -noout -dates
  ```

- [ ] **Review RBAC policies**
  - Ensure least privilege principle
  - Remove access for departed team members
  - Audit admin/owner role assignments

- [ ] **Apply security updates**
  ```bash
  # Update system packages on VPS
  ssh root@$VPS_IP "apt update && apt upgrade -y"

  # Update container images (rebuild with latest base images)
  docker-compose build --pull
  ```

### Certificate Rotation

#### Let's Encrypt (Auto-Renewal)

Let's Encrypt certificates auto-renew via cert-manager. Verify:

```bash
# Check cert-manager is running
kubectl get pods -n cert-manager

# Check certificate status
kubectl describe certificate bakery-ia-prod-tls-cert -n bakery-ia

# Force renewal if needed (>30 days before expiry)
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# cert-manager will automatically recreate
```

#### Internal TLS Certificates (Manual Rotation)

**When:** 90 days before October 2028 expiry

```bash
# 1. Generate new certificates (on local machine)
cd infrastructure/tls
./generate-certificates.sh

# 2. Update Kubernetes secrets
kubectl delete secret postgres-tls redis-tls -n bakery-ia

kubectl create secret generic postgres-tls \
  --from-file=server-cert.pem=postgres/server-cert.pem \
  --from-file=server-key.pem=postgres/server-key.pem \
  --from-file=ca-cert.pem=postgres/ca-cert.pem \
  -n bakery-ia

kubectl create secret generic redis-tls \
  --from-file=redis-cert.pem=redis/redis-cert.pem \
  --from-file=redis-key.pem=redis/redis-key.pem \
  --from-file=ca-cert.pem=redis/ca-cert.pem \
  -n bakery-ia

# 3. Restart database pods to pick up new certs
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=database
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=cache

# 4. Verify new certificates
kubectl exec -n bakery-ia deployment/auth-db -- \
  openssl x509 -in /tls/server-cert.pem -noout -dates
```

---

## Database Management

### Database Architecture

**14 PostgreSQL Instances:**
- auth-db, tenant-db, training-db, forecasting-db, sales-db
- external-db, notification-db, inventory-db, recipes-db
- suppliers-db, pos-db, orders-db, production-db, alert-processor-db

**1 Redis Instance:** Shared caching and session storage

### Database Health Monitoring

```bash
# Check all database pods
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database

# Check database resource usage
kubectl top pods -n bakery-ia -l app.kubernetes.io/component=database

# Check database connections
for db in $(kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database -o name); do
  echo "=== $db ==="
  kubectl exec -n bakery-ia $db -- psql -U postgres -c \
    "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"
done
```

### Common Database Operations

#### Connect to Database

```bash
# Connect to specific database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U auth_user -d auth_db

# Inside psql:
\dt              # List tables
\d+ table_name   # Describe table with details
\du              # List users
\l               # List databases
\q               # Quit
```

#### Check Database Size

```bash
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT pg_database.datname,
   pg_size_pretty(pg_database_size(pg_database.datname)) AS size
   FROM pg_database;"
```

#### Analyze Slow Queries

```bash
# Enable slow query logging (already configured)
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT query, mean_exec_time, calls
   FROM pg_stat_statements
   ORDER BY mean_exec_time DESC
   LIMIT 10;"
```

#### Check Database Locks

```bash
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT blocked_locks.pid AS blocked_pid,
   blocking_locks.pid AS blocking_pid,
   blocked_activity.usename AS blocked_user,
   blocking_activity.usename AS blocking_user,
   blocked_activity.query AS blocked_statement
   FROM pg_catalog.pg_locks blocked_locks
   JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
   JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
   AND blocking_locks.relation = blocked_locks.relation
   JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.granted;"
```

### Database Optimization

#### Vacuum and Analyze

```bash
# Run on each database monthly
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"

# For all databases (run as cron job)
cat > ~/vacuum-databases.sh <<'EOF'
#!/bin/bash
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
  echo "Vacuuming $db"
  kubectl exec -n bakery-ia $db -- psql -U postgres -c "VACUUM ANALYZE;"
done
EOF

chmod +x ~/vacuum-databases.sh
# Add to cron: 0 3 * * 0 (weekly at 3 AM)
```

#### Reindex (if performance degrades)

```bash
# Reindex specific database
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "REINDEX DATABASE auth_db;"
```

---

## Backup & Recovery

### Backup Strategy

**Automated Daily Backups:**
- Frequency: Daily at 2 AM
- Retention: 30 days rolling
- Encryption: GPG encrypted
- Storage: Local VPS (configure off-site for production)

### Backup Script (Already Configured)

```bash
# Script location: ~/backup-databases.sh
# Configured in: pilot launch guide

# Manual backup
./backup-databases.sh

# Verify backup
ls -lh /backups/
```

### Backup Best Practices

1. **Test Restores Monthly**
   ```bash
   # Restore to test database
   gunzip < /backups/2026-01-07.tar.gz | \
     kubectl exec -i -n bakery-ia deployment/test-db -- \
     psql -U postgres test_db
   ```

2. **Off-Site Storage (Recommended)**
   ```bash
   # Sync backups to S3 / Cloud Storage
   aws s3 sync /backups/ s3://bakery-ia-backups/ --delete

   # Or use rclone for any cloud provider
   rclone sync /backups/ remote:bakery-ia-backups
   ```

3. **Monitor Backup Success**
   ```bash
   # Check last backup date
   ls -lt /backups/ | head -1

   # Set up alert if no backup in 25 hours
   ```

### Recovery Procedures

#### Restore Single Database

```bash
# 1. Stop the service using the database
kubectl scale deployment auth-service -n bakery-ia --replicas=0

# 2. Drop and recreate database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U postgres -c "DROP DATABASE auth_db;"
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U postgres -c "CREATE DATABASE auth_db OWNER auth_user;"

# 3. Restore from backup
gunzip < /backups/2026-01-07/auth-db.sql | \
  kubectl exec -i -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db

# 4. Restart service
kubectl scale deployment auth-service -n bakery-ia --replicas=2
```

#### Disaster Recovery (Full System)

```bash
# 1. Provision new VPS (same specs)
# 2. Install MicroK8s (follow pilot launch guide)
# 3. Copy latest backup to new VPS
# 4. Deploy infrastructure and databases
kubectl apply -k infrastructure/environments/prod/k8s-manifests

# 5. Wait for databases to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=database -n bakery-ia

# 6. Restore all databases
for backup in /backups/latest/*.sql; do
  db_name=$(basename $backup .sql)
  echo "Restoring $db_name"
  cat $backup | kubectl exec -i -n bakery-ia deployment/${db_name} -- \
    psql -U postgres
done

# 7. Deploy services
kubectl apply -k infrastructure/environments/prod/k8s-manifests

# 8. Update DNS to point to new VPS
# 9. Verify all services healthy
```

**Recovery Time Objective (RTO):** 2-4 hours
**Recovery Point Objective (RPO):** 24 hours (last daily backup)

---

## Performance Optimization

### Identifying Performance Issues

```bash
# 1. Check overall resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory

# 2. Check API response times in Grafana
# Go to "Services Overview" dashboard
# Look for P95/P99 latency spikes

# 3. Check database query performance
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT query, calls, mean_exec_time, max_exec_time
   FROM pg_stat_statements
   ORDER BY mean_exec_time DESC
   LIMIT 20;"

# 4. Check for N+1 queries in application logs
kubectl logs -n bakery-ia deployment/orders-service | grep "SELECT"
```

### Common Optimizations

#### 1. Database Indexing

```sql
-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY abs(correlation) DESC;

-- Add index on frequently queried columns
CREATE INDEX CONCURRENTLY idx_orders_tenant_created
  ON orders(tenant_id, created_at DESC);
```

#### 2. Connection Pooling

Already configured in services using SQLAlchemy. Verify settings:
```python
# In shared/database/base.py
pool_size=5           # Adjust based on load
max_overflow=10       # Max additional connections
pool_timeout=30       # Connection timeout
pool_recycle=3600     # Recycle connections after 1 hour
```

#### 3. Redis Caching

Increase cache for frequently accessed data:
```python
# Cache user permissions (example)
@cache.cached(timeout=300, key_prefix='user_perms')
def get_user_permissions(user_id):
    # ... fetch from database
```

#### 4. Query Optimization

```sql
-- Add EXPLAIN ANALYZE to slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE tenant_id = '...';

-- Look for:
-- - Seq Scan (should use index scan)
-- - High execution time
-- - Missing indexes
```

### Scaling Triggers

**When to scale UP:**
- ❌ CPU usage >75% sustained for >1 hour
- ❌ Memory usage >85% sustained
- ❌ P95 API latency >3s
- ❌ Database connection pool exhausted frequently
- ❌ Error rate increasing

**When to scale OUT (add replicas):**
- ❌ Request rate increasing significantly
- ❌ Single service bottleneck identified
- ❌ Need zero-downtime deployments
- ❌ Geographic distribution needed

---

## Scaling Operations

### Vertical Scaling (Upgrade VPS)

```bash
# 1. Create backup
./backup-databases.sh

# 2. Plan upgrade window (requires brief downtime)
# Notify users: "Scheduled maintenance 2 AM - 3 AM"

# 3. At clouding.io, upgrade VPS
# RAM: 20 GB → 32 GB
# CPU: 8 cores → 12 cores
# (Usually instant, may require restart)

# 4. Verify after upgrade
kubectl top nodes
free -h
nproc
```

### Horizontal Scaling (Add Replicas)

```bash
# Scale specific service
kubectl scale deployment orders-service -n bakery-ia --replicas=5

# Or update in kustomization for persistence
# Edit: infrastructure/environments/prod/k8s-manifests/kustomization.yaml
replicas:
  - name: orders-service
    count: 5

kubectl apply -k infrastructure/environments/prod/k8s-manifests
```

### Auto-Scaling (HPA)

Already configured for:
- orders-service (1-3 replicas)
- forecasting-service (1-3 replicas)
- notification-service (1-3 replicas)

```bash
# Check HPA status
kubectl get hpa -n bakery-ia

# Adjust thresholds if needed
kubectl edit hpa orders-service-hpa -n bakery-ia
```

### Growth Path

| Tenants | Recommended Action |
|---------|-------------------|
| **10** | Current configuration (20GB RAM, 8 CPU) |
| **20** | Add replicas for critical services |
| **30** | Upgrade to 32GB RAM, 12 CPU |
| **50** | Consider database read replicas |
| **75** | Upgrade to 48GB RAM, 16 CPU |
| **100** | Plan multi-node cluster or managed K8s |
| **200+** | Migrate to managed services (EKS, GKE, AKS) |

---

## Incident Response

### Incident Severity Levels

| Level | Description | Response Time | Example |
|-------|-------------|---------------|---------|
| **P0** | Complete outage | Immediate | All services down |
| **P1** | Major degradation | 15 minutes | Database unavailable |
| **P2** | Partial degradation | 1 hour | One service slow |
| **P3** | Minor issue | 4 hours | Non-critical alert |

### Incident Response Process

#### 1. Detect & Alert
```
- Monitoring alerts trigger
- User reports issue
- Automated health checks fail
```

#### 2. Assess & Communicate
```bash
# Quick assessment
./health-check.sh

# Determine severity
# P0/P1: Notify all stakeholders immediately
# P2/P3: Regular communication channels
```

#### 3. Investigate
```bash
# Check pods
kubectl get pods -n bakery-ia

# Check recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20

# Check logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100

# Check metrics
# View Grafana dashboards
```

#### 4. Mitigate
```bash
# Common mitigations:

# Restart service
kubectl rollout restart deployment/SERVICE_NAME -n bakery-ia

# Rollback deployment
kubectl rollout undo deployment/SERVICE_NAME -n bakery-ia

# Scale up
kubectl scale deployment SERVICE_NAME -n bakery-ia --replicas=5

# Restart database
kubectl delete pod DB_POD_NAME -n bakery-ia
```

#### 5. Resolve & Document
```
1. Verify issue resolved
2. Update incident log
3. Create post-mortem (for P0/P1)
4. Implement preventive measures
```

### Common Incidents & Fixes

#### Incident: Database Connection Exhaustion

**Symptoms:** Services showing "connection pool exhausted" errors

**Fix:**
```bash
# 1. Identify leaking service
kubectl logs -n bakery-ia deployment/orders-service | grep "pool"

# 2. Restart leaking service
kubectl rollout restart deployment/orders-service -n bakery-ia

# 3. Increase max_connections if needed
kubectl exec -n bakery-ia deployment/orders-db -- \
  psql -U postgres -c "ALTER SYSTEM SET max_connections = 200;"
kubectl rollout restart deployment/orders-db -n bakery-ia
```

#### Incident: Out of Memory (OOMKilled)

**Symptoms:** Pods restarting with "OOMKilled" status

**Fix:**
```bash
# 1. Identify which pod
kubectl get pods -n bakery-ia | grep OOMKilled

# 2. Check resource limits
kubectl describe pod POD_NAME -n bakery-ia | grep -A 5 Limits

# 3. Increase memory limit
# Edit deployment: infrastructure/kubernetes/base/components/services/SERVICE.yaml
resources:
  limits:
    memory: "1Gi"  # Increased from 512Mi

# 4. Redeploy
kubectl apply -k infrastructure/environments/prod/k8s-manifests
```

#### Incident: Certificate Expired

**Symptoms:** SSL errors, services can't connect

**Fix:**
```bash
# For Let's Encrypt (should auto-renew):
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# Wait for cert-manager to recreate

# For internal certs:
# Follow "Certificate Rotation" section above
```

---

## Maintenance Tasks

### Daily Tasks

```bash
# Run health check
./health-check.sh

# Check monitoring alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

# Verify backups ran
ls -lh /backups/ | head -5
```

### Weekly Tasks

```bash
# Review resource trends
# Open Grafana, check 7-day trends

# Review error logs
kubectl logs -n bakery-ia deployment/gateway --since=7d | grep ERROR | wc -l

# Check disk usage
kubectl exec -n bakery-ia deployment/auth-db -- df -h

# Review security logs
kubectl logs -n bakery-ia deployment/auth-service --since=7d | grep "failed"
```

### Monthly Tasks

- [ ] **Review and rotate passwords**
- [ ] **Update security patches**
- [ ] **Test backup restore**
- [ ] **Review RBAC policies**
- [ ] **Vacuum and analyze databases**
- [ ] **Review and optimize slow queries**
- [ ] **Check certificate expiry dates**
- [ ] **Review resource allocation**
- [ ] **Plan capacity for next quarter**
- [ ] **Update documentation**

### Quarterly Tasks (Every 90 Days)

- [ ] **Full security audit**
- [ ] **Disaster recovery drill**
- [ ] **Performance testing**
- [ ] **Cost optimization review**
- [ ] **Update runbooks**
- [ ] **Team training session**
- [ ] **Review SLAs and metrics**
- [ ] **Plan infrastructure upgrades**

### Annual Tasks

- [ ] **Penetration testing**
- [ ] **Compliance audit (GDPR, PCI-DSS, SOC 2)**
- [ ] **Full infrastructure review**
- [ ] **Update security roadmap**
- [ ] **Budget planning for next year**
- [ ] **Technology stack review**

---

## Compliance & Audit

### GDPR Compliance

**Requirements Met:**
- ✅ Article 32: Encryption of personal data (TLS + pgcrypto)
- ✅ Article 5(1)(f): Security of processing
- ✅ Article 33: Breach detection (audit logs)
- ✅ Article 17: Right to erasure (deletion endpoints)
- ✅ Article 20: Right to data portability (export functionality)

**Audit Tasks:**
```bash
# Review audit logs for data access
kubectl logs -n bakery-ia deployment/tenant-service | grep "user_data_access"

# Verify encryption in use
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c "SHOW ssl;"

# Check data retention policies
# Review automated cleanup jobs
```

### PCI-DSS Compliance

**Requirements Met:**
- ✅ Requirement 3.4: Transmission encryption (TLS 1.2+)
- ✅ Requirement 3.5: Stored data protection (pgcrypto)
- ✅ Requirement 10: Access tracking (audit logs)
- ✅ Requirement 8: User authentication (JWT + MFA ready)

**Audit Tasks:**
```bash
# Verify no plaintext passwords
kubectl get secret bakery-ia-secrets -n bakery-ia -o jsonpath='{.data}' | grep -i "pass"

# Check encryption in transit
kubectl describe ingress -n bakery-ia | grep TLS

# Review access logs
kubectl logs -n bakery-ia deployment/auth-service | grep "login"
```

### SOC 2 Compliance

**Controls Met:**
- ✅ CC6.1: Access controls (RBAC)
- ✅ CC6.6: Encryption in transit (TLS)
- ✅ CC6.7: Encryption at rest (K8s secrets + pgcrypto)
- ✅ CC7.2: Monitoring (Prometheus + Grafana)

### Audit Log Retention

**Current Policy:**
- Application logs: 30 days (stdout)
- Database audit logs: 90 days
- Security logs: 1 year
- Backups: 30 days rolling

**Extending Retention:**
```bash
# Ship logs to external storage
# Example: Ship to S3 / CloudWatch / ELK

# For PostgreSQL audit logs, increase CSV log retention
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "ALTER SYSTEM SET log_rotation_age = '90d';"
```

---

## Quick Reference Commands

### Emergency Commands

```bash
# Restart all services (minimal downtime with rolling update)
kubectl rollout restart deployment -n bakery-ia

# Restart specific service
kubectl rollout restart deployment/orders-service -n bakery-ia

# Rollback last deployment
kubectl rollout undo deployment/orders-service -n bakery-ia

# Scale up quickly
kubectl scale deployment orders-service -n bakery-ia --replicas=5

# Get pod status
kubectl get pods -n bakery-ia

# Get recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20

# Get logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 -f
```

### Monitoring Commands

```bash
# Resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory

# Check HPA
kubectl get hpa -n bakery-ia

# Check all resources
kubectl get all -n bakery-ia

# Check ingress
kubectl get ingress -n bakery-ia

# Check certificates
kubectl get certificate -n bakery-ia
```

### Database Commands

```bash
# Connect to database
kubectl exec -n bakery-ia deployment/auth-db -it -- psql -U auth_user -d auth_db

# Check connections
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check database size
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "SELECT pg_size_pretty(pg_database_size('auth_db'));"

# Vacuum database
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"
```

---

## Support Resources

**Documentation:**
- [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment and setup
- [Security Checklist](./security-checklist.md) - Security procedures and compliance
- [Database Security](./database-security.md) - Database operations and best practices
- [TLS Configuration](./tls-configuration.md) - Certificate management
- [RBAC Implementation](./rbac-implementation.md) - Access control configuration
- [Monitoring Stack README](../infrastructure/kubernetes/base/components/monitoring/README.md) - Detailed monitoring documentation
- [CI/CD Infrastructure README](../infrastructure/cicd/README.md) - Gitea, Tekton, and Flux CD setup and operations
- [SigNoz Monitoring README](../infrastructure/monitoring/signoz/README.md) - SigNoz deployment and configuration

**External Resources:**
- Kubernetes: https://kubernetes.io/docs
- MicroK8s: https://microk8s.io/docs
- Prometheus: https://prometheus.io/docs
- Grafana: https://grafana.com/docs
- PostgreSQL: https://www.postgresql.org/docs

**Emergency Contacts:**
- DevOps Team: devops@bakewise.ai
- On-Call: oncall@bakewise.ai
- Security Team: security@bakewise.ai

---

## Summary

This guide covers all aspects of operating the Bakery-IA platform in production:

✅ **Monitoring:** Dashboards, alerts, metrics
✅ **Security:** Access control, certificates, compliance
✅ **Databases:** Management, optimization, backups
✅ **Recovery:** Backup strategy, disaster recovery
✅ **Performance:** Optimization techniques, scaling
✅ **Incidents:** Response procedures, common fixes
✅ **Maintenance:** Daily, weekly, monthly tasks
✅ **Compliance:** GDPR, PCI-DSS, SOC 2

**Remember:**
- Monitor daily
- Back up daily
- Test restores monthly
- Rotate secrets quarterly
- Plan for growth continuously

---

**Document Version:** 1.0
**Last Updated:** 2026-01-07
**Maintained By:** DevOps Team
**Next Review:** 2026-04-07