Files
bakery-ia/docs/PRODUCTION_OPERATIONS_GUIDE.md

1363 lines
37 KiB
Markdown
Raw Permalink Normal View History

2026-01-07 19:12:35 +01:00
# Bakery-IA Production Operations Guide
**Complete guide for operating, monitoring, and maintaining production environment**
**Last Updated:** 2026-01-07
**Target Audience:** DevOps, SRE, System Administrators
**Security Grade:** A-
---
## Table of Contents
1. [Overview](#overview)
2. [Monitoring & Observability](#monitoring--observability)
2026-01-19 15:15:04 +01:00
3. [CI/CD Operations](#ci-cd-operations)
4. [Security Operations](#security-operations)
5. [Database Management](#database-management)
6. [Backup & Recovery](#backup--recovery)
7. [Performance Optimization](#performance-optimization)
8. [Scaling Operations](#scaling-operations)
9. [Incident Response](#incident-response)
10. [Maintenance Tasks](#maintenance-tasks)
11. [Compliance & Audit](#compliance--audit)
2026-01-07 19:12:35 +01:00
---
## Overview
### Production Environment
**Infrastructure:**
- **Platform:** MicroK8s on Ubuntu 22.04 LTS
- **Services:** 18 microservices, 14 databases, monitoring stack
- **Capacity:** 10-tenant pilot (scalable to 100+)
- **Security:** TLS encryption, RBAC, audit logging
2026-01-08 12:58:00 +01:00
- **Monitoring:** Prometheus, Grafana, AlertManager, SigNoz
2026-01-19 15:15:04 +01:00
- **CI/CD:** Tekton Pipelines, Gitea, Flux CD (GitOps)
- **Email:** Mailu (integrated email server)
2026-01-07 19:12:35 +01:00
**Key Metrics (10-tenant baseline):**
- **Uptime Target:** 99.5% (3.65 hours downtime/month)
- **Response Time:** <2s average API response
- **Error Rate:** <1% of requests
- **Database Connections:** ~200 concurrent
- **Memory Usage:** 12-15 GB / 20 GB capacity
- **CPU Usage:** 40-60% under normal load
### Team Responsibilities
| Role | Responsibilities |
|------|------------------|
2026-01-19 15:15:04 +01:00
| **DevOps Engineer** | Deployment, infrastructure, scaling, CI/CD |
2026-01-07 19:12:35 +01:00
| **SRE** | Monitoring, incident response, performance |
| **Security Admin** | Access control, security patches, compliance |
| **Database Admin** | Backups, optimization, migrations |
| **On-Call Engineer** | 24/7 incident response (if applicable) |
2026-01-19 15:15:04 +01:00
| **CI/CD Admin** | Pipeline management, GitOps workflows |
2026-01-07 19:12:35 +01:00
---
## Monitoring & Observability
### Access Monitoring Dashboards
**Production URLs:**
```
2026-01-10 13:43:38 +01:00
https://monitoring.bakewise.ai/signoz # SigNoz - Unified observability (PRIMARY)
https://monitoring.bakewise.ai/alertmanager # AlertManager - Alert management
2026-01-07 19:12:35 +01:00
```
2026-01-10 13:43:38 +01:00
**What is SigNoz?**
SigNoz is a comprehensive, open-source observability platform that provides:
- **Distributed Tracing** - End-to-end request tracking across all microservices
- **Metrics Monitoring** - Application and infrastructure metrics
- **Log Management** - Centralized log aggregation with trace correlation
- **Service Performance Monitoring (SPM)** - RED metrics (Rate, Error, Duration) from traces
- **Database Monitoring** - All 18 PostgreSQL databases + Redis + RabbitMQ
- **Kubernetes Monitoring** - Cluster, node, pod, and container metrics
2026-01-07 19:12:35 +01:00
2026-01-10 13:43:38 +01:00
### Key SigNoz Dashboards and Features
2026-01-07 19:12:35 +01:00
2026-01-10 13:43:38 +01:00
#### 1. Services Tab - APM Overview
2026-01-07 19:12:35 +01:00
**What to Monitor:**
2026-01-10 13:43:38 +01:00
- **Service List** - All 18 microservices with health status
- **Request Rate** - Requests per second per service
- **Error Rate** - Percentage of failed requests (aim: <1%)
- **P50/P90/P99 Latency** - Response time percentiles (aim: P99 <2s)
- **Operations** - Breakdown by endpoint/operation
2026-01-07 19:12:35 +01:00
**Red Flags:**
2026-01-10 13:43:38 +01:00
- ❌ Error rate >5% sustained
- ❌ P99 latency >3s
- ❌ Sudden drop in request rate (service might be down)
- ❌ High latency on specific endpoints
2026-01-07 19:12:35 +01:00
2026-01-10 13:43:38 +01:00
**How to Access:**
- Navigate to `Services` tab in SigNoz
- Click on any service for detailed metrics
- Use "Traces" tab to see sample requests
#### 2. Traces Tab - Distributed Tracing
2026-01-07 19:12:35 +01:00
**What to Monitor:**
2026-01-10 13:43:38 +01:00
- **End-to-end request flows** across microservices
- **Span duration** - Time spent in each service
- **Database query performance** - Auto-captured from SQLAlchemy
- **External API calls** - Auto-captured from HTTPX
- **Error traces** - Requests that failed with stack traces
**Features:**
- Filter by service, operation, status code, duration
- Search by trace ID or span ID
- Correlate traces with logs
- Identify slow database queries and N+1 problems
2026-01-07 19:12:35 +01:00
**Red Flags:**
2026-01-10 13:43:38 +01:00
- ❌ Traces showing >10 database queries per request (N+1 issue)
- ❌ External API calls taking >1s
- ❌ Services with >500ms internal processing time
- ❌ Error spans with exceptions
#### 3. Dashboards Tab - Infrastructure Metrics
**Pre-built Dashboards:**
- **PostgreSQL Monitoring** - All 18 databases
- Active connections, transactions/sec, cache hit ratio
- Slow queries, lock waits, replication lag
- Database size, disk I/O
- **Redis Monitoring** - Cache performance
- Memory usage, hit rate, evictions
- Commands/sec, latency
- **RabbitMQ Monitoring** - Message queue health
- Queue depth, message rates
- Consumer status, connections
- **Kubernetes Cluster** - Node and pod metrics
- CPU, memory, disk, network per node
- Pod resource utilization
- Container restarts and OOM kills
2026-01-07 19:12:35 +01:00
**Red Flags:**
2026-01-10 13:43:38 +01:00
- ❌ PostgreSQL: Cache hit ratio <80%, active connections >80% of max
- ❌ Redis: Memory >90%, evictions increasing
- ❌ RabbitMQ: Queue depth growing, no consumers
- ❌ Kubernetes: CPU >85%, memory >90%, disk <20% free
#### 4. Logs Tab - Centralized Logging
**Features:**
- **Unified logs** from all 18 microservices + databases
- **Trace correlation** - Click on trace ID to see related logs
- **Kubernetes metadata** - Auto-tagged with pod, namespace, container
- **Search and filter** - By service, severity, time range, content
- **Log patterns** - Automatically detect common patterns
2026-01-07 19:12:35 +01:00
**What to Monitor:**
2026-01-10 13:43:38 +01:00
- Error and warning logs across all services
- Database connection errors
- Authentication failures
- API request/response logs
2026-01-07 19:12:35 +01:00
**Red Flags:**
2026-01-10 13:43:38 +01:00
- ❌ Increasing error logs
- ❌ Repeated "connection refused" or "timeout" messages
- ❌ Authentication failures (potential security issue)
- ❌ Out of memory errors
#### 5. Alerts Tab - Alert Management
**Features:**
- Create alerts based on metrics, traces, or logs
- Configure notification channels (email, Slack, webhook)
- View firing alerts and alert history
- Alert silencing and acknowledgment
**Pre-configured Alerts (see SigNoz):**
- High error rate (>5% for 5 minutes)
- High latency (P99 >3s for 5 minutes)
- Service down (no requests for 2 minutes)
- Database connection errors
- High memory/CPU usage
2026-01-07 19:12:35 +01:00
### Alert Severity Levels
| Severity | Response Time | Escalation | Examples |
|----------|---------------|------------|----------|
| **Critical** | Immediate | Page on-call | Service down, database unavailable |
| **Warning** | 30 minutes | Email team | High memory, slow queries |
| **Info** | Best effort | Email | Backup completed, cert renewal |
### Common Alerts & Responses
#### Alert: ServiceDown
```
Severity: Critical
Meaning: A service has been down for >2 minutes
Response:
1. Check pod status: kubectl get pods -n bakery-ia
2. View logs: kubectl logs POD_NAME -n bakery-ia
3. Check recent deployments: kubectl rollout history
4. Restart if safe: kubectl rollout restart deployment/SERVICE_NAME
5. Rollback if needed: kubectl rollout undo deployment/SERVICE_NAME
```
#### Alert: HighMemoryUsage
```
Severity: Warning
Meaning: Service using >80% of memory limit
Response:
1. Check which pods: kubectl top pods -n bakery-ia --sort-by=memory
2. Review memory trends in Grafana
3. Check for memory leaks in application logs
4. Consider increasing memory limits if sustained
5. Restart pod if memory leak suspected
```
#### Alert: DatabaseConnectionsHigh
```
Severity: Warning
Meaning: Database connections >80% of max
Response:
1. Identify which service: Check Grafana database dashboard
2. Look for connection leaks in application
3. Check for long-running transactions
4. Consider increasing max_connections
5. Restart service if connections not releasing
```
#### Alert: CertificateExpiringSoon
```
Severity: Warning
Meaning: TLS certificate expires in <30 days
Response:
1. For Let's Encrypt: Auto-renewal should handle (verify cert-manager)
2. For internal certs: Regenerate and apply new certificates
3. See "Certificate Rotation" section below
```
2026-01-10 13:43:38 +01:00
### Daily Monitoring Workflow with SigNoz
#### Morning Health Check (5 minutes)
1. **Open SigNoz Dashboard**
```
https://monitoring.bakewise.ai/signoz
```
2. **Check Services Tab:**
- Verify all 18 services are reporting metrics
- Check error rate <1% for all services
- Check P99 latency <2s for critical services
3. **Check Alerts Tab:**
- Review any firing alerts
- Check for patterns (repeated alerts on same service)
- Acknowledge or resolve as needed
4. **Quick Infrastructure Check:**
- Navigate to Dashboards → PostgreSQL
- Verify all 18 databases are up
- Check connection counts are healthy
- Navigate to Dashboards → Redis
- Check memory usage <80%
- Navigate to Dashboards → Kubernetes
- Verify node health, no OOM kills
#### Command-Line Health Check (Alternative)
2026-01-07 19:12:35 +01:00
```bash
# Quick health check command
cat > ~/health-check.sh <<'EOF'
#!/bin/bash
echo "=== Bakery-IA Health Check ==="
echo "Date: $(date)"
echo ""
echo "1. Pod Status:"
kubectl get pods -n bakery-ia | grep -vE "Running|Completed" || echo "✅ All pods healthy"
echo ""
echo "2. Resource Usage:"
kubectl top nodes
2026-01-10 13:43:38 +01:00
kubectl top pods -n bakery-ia --sort-by=memory | head -10
2026-01-07 19:12:35 +01:00
echo ""
2026-01-10 13:43:38 +01:00
echo "3. SigNoz Components:"
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
2026-01-07 19:12:35 +01:00
echo ""
2026-01-10 13:43:38 +01:00
echo "4. Recent Alerts (from SigNoz AlertManager):"
curl -s http://localhost:9093/api/v1/alerts 2>/dev/null | jq '.data[] | select(.status.state=="firing") | {alert: .labels.alertname, severity: .labels.severity}' | head -10
2026-01-07 19:12:35 +01:00
echo ""
2026-01-10 13:43:38 +01:00
echo "5. OTel Collector Health:"
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133 2>/dev/null || echo "✅ Health check endpoint responding"
2026-01-07 19:12:35 +01:00
echo ""
echo "=== End Health Check ==="
EOF
chmod +x ~/health-check.sh
./health-check.sh
```
2026-01-10 13:43:38 +01:00
#### Troubleshooting Common Issues
**Issue: Service not showing in SigNoz**
```bash
# Check if service is sending telemetry
kubectl logs -n bakery-ia deployment/SERVICE_NAME | grep -i "telemetry\|otel\|signoz"
# Check OTel Collector is receiving data
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep SERVICE_NAME
# Verify service has proper OTEL endpoints configured
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep OTEL
```
**Issue: No traces appearing**
```bash
# Check tracing is enabled in service
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep ENABLE_TRACING
# Verify OTel Collector gRPC endpoint is reachable
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- nc -zv signoz-otel-collector 4317
```
**Issue: Logs not appearing**
```bash
# Check filelog receiver is working
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog
# Check k8sattributes processor
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep k8sattributes
```
2026-01-07 19:12:35 +01:00
---
2026-01-19 15:15:04 +01:00
## CI/CD Operations
### CI/CD Infrastructure Overview
The platform includes a complete CI/CD pipeline using:
- **Gitea** - Git server and container registry
- **Tekton** - Pipeline automation
- **Flux CD** - GitOps deployment
### Access CI/CD Systems
2026-01-21 16:21:24 +01:00
**SSH Access to Production VPS:**
- **IP Address:** `200.234.233.87`
- **Access Method:** SSH with key authentication
- **Command:** `ssh -i ~/.ssh/your_private_key.pem root@200.234.233.87`
2026-01-19 15:15:04 +01:00
**Gitea (Git Server):**
- URL: http://gitea.bakery-ia.local (development) or http://gitea.bakewise.ai (production)
- Admin panel: http://gitea.bakery-ia.local/admin
**Tekton Dashboard:**
```bash
# Port forward to access Tekton dashboard
kubectl port-forward -n tekton-pipelines svc/tekton-dashboard 9097:9097
# Access at: http://localhost:9097
```
**Flux Status:**
```bash
# Check Flux status
flux check
kubectl get gitrepository -n flux-system
kubectl get kustomization -n flux-system
```
### CI/CD Monitoring
**Check pipeline status:**
```bash
# List all PipelineRuns
kubectl get pipelineruns -n tekton-pipelines
# Check Tekton controller logs
kubectl logs -n tekton-pipelines -l app=tekton-pipelines-controller
# Check Tekton dashboard logs
kubectl logs -n tekton-pipelines -l app=tekton-dashboard
```
**Monitor GitOps synchronization:**
```bash
# Check GitRepository status
kubectl get gitrepository -n flux-system -o wide
# Check Kustomization status
kubectl get kustomization -n flux-system -o wide
# Get reconciliation history
kubectl get events -n flux-system --sort-by='.lastTimestamp'
```
### CI/CD Troubleshooting
**Pipeline not triggering:**
```bash
# Check Gitea webhook logs
kubectl logs -n tekton-pipelines -l app=tekton-triggers-controller
# Verify EventListener pods are running
kubectl get pods -n tekton-pipelines -l app=tekton-triggers-eventlistener
# Check TriggerBinding configuration
kubectl get triggerbinding -n tekton-pipelines
```
**Build failures:**
```bash
# Check Kaniko logs for build errors
kubectl logs -n tekton-pipelines -l tekton.dev/task=kaniko-build
# Verify Dockerfile paths are correct
kubectl describe taskrun -n tekton-pipelines
```
**Flux not applying changes:**
```bash
# Check GitRepository status
kubectl describe gitrepository -n flux-system
# Check Kustomization reconciliation
kubectl describe kustomization -n flux-system
# Check Flux logs
kubectl logs -n flux-system -l app.kubernetes.io/name=helm-controller
```
### CI/CD Maintenance Tasks
**Daily Tasks:**
- [ ] Check for failed pipeline runs
- [ ] Verify GitOps synchronization status
- [ ] Clean up old PipelineRun resources
**Weekly Tasks:**
- [ ] Review pipeline performance metrics
- [ ] Update pipeline definitions if needed
- [ ] Rotate CI/CD secrets
**Monthly Tasks:**
- [ ] Update Tekton and Flux versions
- [ ] Review and optimize pipeline performance
- [ ] Audit CI/CD access permissions
---
2026-01-07 19:12:35 +01:00
## Security Operations
### Security Posture Overview
**Current Security Grade: A-**
**Implemented:**
- ✅ TLS 1.2+ encryption for all database connections
- ✅ Let's Encrypt SSL for public endpoints
- ✅ 32-character cryptographic passwords
- ✅ JWT-based authentication
- ✅ Tenant isolation at database and application level
- ✅ Kubernetes secrets encryption at rest
- ✅ PostgreSQL audit logging
- ✅ RBAC (Role-Based Access Control)
- ✅ Regular security updates
### Access Control Management
#### User Roles
| Role | Permissions | Use Case |
|------|-------------|----------|
| **Viewer** | Read-only access | Dashboard viewing, reports |
| **Member** | Read + create/update | Day-to-day operations |
| **Admin** | Full operational access | Manage users, configure settings |
| **Owner** | Full control | Billing, tenant deletion |
#### Managing User Access
```bash
# View current users for a tenant (via API)
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users
# Promote user to admin
curl -X PATCH -H "Authorization: Bearer $OWNER_TOKEN" \
-H "Content-Type: application/json" \
https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users/USER_ID \
-d '{"role": "admin"}'
```
### Security Checklist (Monthly)
- [ ] **Review audit logs for suspicious activity**
```bash
# Check failed login attempts
kubectl logs -n bakery-ia deployment/auth-service | grep "authentication failed" | tail -50
# Check unusual API calls
kubectl logs -n bakery-ia deployment/gateway | grep -E "DELETE|admin" | tail -50
```
- [ ] **Verify all services using TLS**
```bash
# Check PostgreSQL SSL
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
echo "Checking $db"
kubectl exec -n bakery-ia $db -- psql -U postgres -c "SHOW ssl;"
done
```
- [ ] **Review and rotate passwords (every 90 days)**
```bash
# Generate new passwords
openssl rand -base64 32 # For each service
# Update secrets
kubectl edit secret bakery-ia-secrets -n bakery-ia
# Restart services to pick up new passwords
kubectl rollout restart deployment -n bakery-ia
```
- [ ] **Check certificate expiry dates**
```bash
# Check Let's Encrypt certs
kubectl get certificate -n bakery-ia
# Check internal TLS certs (expire Oct 2028)
kubectl exec -n bakery-ia deployment/auth-db -- \
openssl x509 -in /tls/server-cert.pem -noout -dates
```
- [ ] **Review RBAC policies**
- Ensure least privilege principle
- Remove access for departed team members
- Audit admin/owner role assignments
- [ ] **Apply security updates**
```bash
# Update system packages on VPS
ssh root@$VPS_IP "apt update && apt upgrade -y"
# Update container images (rebuild with latest base images)
docker-compose build --pull
```
### Certificate Rotation
#### Let's Encrypt (Auto-Renewal)
Let's Encrypt certificates auto-renew via cert-manager. Verify:
```bash
# Check cert-manager is running
kubectl get pods -n cert-manager
# Check certificate status
kubectl describe certificate bakery-ia-prod-tls-cert -n bakery-ia
# Force renewal if needed (>30 days before expiry)
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# cert-manager will automatically recreate
```
#### Internal TLS Certificates (Manual Rotation)
**When:** 90 days before October 2028 expiry
```bash
# 1. Generate new certificates (on local machine)
cd infrastructure/tls
./generate-certificates.sh
# 2. Update Kubernetes secrets
kubectl delete secret postgres-tls redis-tls -n bakery-ia
kubectl create secret generic postgres-tls \
--from-file=server-cert.pem=postgres/server-cert.pem \
--from-file=server-key.pem=postgres/server-key.pem \
--from-file=ca-cert.pem=postgres/ca-cert.pem \
-n bakery-ia
kubectl create secret generic redis-tls \
--from-file=redis-cert.pem=redis/redis-cert.pem \
--from-file=redis-key.pem=redis/redis-key.pem \
--from-file=ca-cert.pem=redis/ca-cert.pem \
-n bakery-ia
# 3. Restart database pods to pick up new certs
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=database
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=cache
# 4. Verify new certificates
kubectl exec -n bakery-ia deployment/auth-db -- \
openssl x509 -in /tls/server-cert.pem -noout -dates
```
---
## Database Management
### Database Architecture
**14 PostgreSQL Instances:**
- auth-db, tenant-db, training-db, forecasting-db, sales-db
- external-db, notification-db, inventory-db, recipes-db
- suppliers-db, pos-db, orders-db, production-db, alert-processor-db
**1 Redis Instance:** Shared caching and session storage
### Database Health Monitoring
```bash
# Check all database pods
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database
# Check database resource usage
kubectl top pods -n bakery-ia -l app.kubernetes.io/component=database
# Check database connections
for db in $(kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database -o name); do
echo "=== $db ==="
kubectl exec -n bakery-ia $db -- psql -U postgres -c \
"SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"
done
```
### Common Database Operations
#### Connect to Database
```bash
# Connect to specific database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
psql -U auth_user -d auth_db
# Inside psql:
\dt # List tables
\d+ table_name # Describe table with details
\du # List users
\l # List databases
\q # Quit
```
#### Check Database Size
```bash
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
"SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database;"
```
#### Analyze Slow Queries
```bash
# Enable slow query logging (already configured)
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
"SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;"
```
#### Check Database Locks
```bash
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
"SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.usename AS blocked_user,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.relation = blocked_locks.relation
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;"
```
### Database Optimization
#### Vacuum and Analyze
```bash
# Run on each database monthly
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"
# For all databases (run as cron job)
cat > ~/vacuum-databases.sh <<'EOF'
#!/bin/bash
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
echo "Vacuuming $db"
kubectl exec -n bakery-ia $db -- psql -U postgres -c "VACUUM ANALYZE;"
done
EOF
chmod +x ~/vacuum-databases.sh
# Add to cron: 0 3 * * 0 (weekly at 3 AM)
```
#### Reindex (if performance degrades)
```bash
# Reindex specific database
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U auth_user -d auth_db -c "REINDEX DATABASE auth_db;"
```
---
## Backup & Recovery
### Backup Strategy
**Automated Daily Backups:**
- Frequency: Daily at 2 AM
- Retention: 30 days rolling
- Encryption: GPG encrypted
- Storage: Local VPS (configure off-site for production)
### Backup Script (Already Configured)
```bash
# Script location: ~/backup-databases.sh
# Configured in: pilot launch guide
# Manual backup
./backup-databases.sh
# Verify backup
ls -lh /backups/
```
### Backup Best Practices
1. **Test Restores Monthly**
```bash
# Restore to test database
gunzip < /backups/2026-01-07.tar.gz | \
kubectl exec -i -n bakery-ia deployment/test-db -- \
psql -U postgres test_db
```
2. **Off-Site Storage (Recommended)**
```bash
# Sync backups to S3 / Cloud Storage
aws s3 sync /backups/ s3://bakery-ia-backups/ --delete
# Or use rclone for any cloud provider
rclone sync /backups/ remote:bakery-ia-backups
```
3. **Monitor Backup Success**
```bash
# Check last backup date
ls -lt /backups/ | head -1
# Set up alert if no backup in 25 hours
```
### Recovery Procedures
#### Restore Single Database
```bash
# 1. Stop the service using the database
kubectl scale deployment auth-service -n bakery-ia --replicas=0
# 2. Drop and recreate database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
psql -U postgres -c "DROP DATABASE auth_db;"
kubectl exec -n bakery-ia deployment/auth-db -it -- \
psql -U postgres -c "CREATE DATABASE auth_db OWNER auth_user;"
# 3. Restore from backup
gunzip < /backups/2026-01-07/auth-db.sql | \
kubectl exec -i -n bakery-ia deployment/auth-db -- \
psql -U auth_user -d auth_db
# 4. Restart service
kubectl scale deployment auth-service -n bakery-ia --replicas=2
```
#### Disaster Recovery (Full System)
```bash
# 1. Provision new VPS (same specs)
# 2. Install MicroK8s (follow pilot launch guide)
# 3. Copy latest backup to new VPS
# 4. Deploy infrastructure and databases
2026-01-19 11:55:17 +01:00
kubectl apply -k infrastructure/environments/prod/k8s-manifests
2026-01-07 19:12:35 +01:00
# 5. Wait for databases to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=database -n bakery-ia
# 6. Restore all databases
for backup in /backups/latest/*.sql; do
db_name=$(basename $backup .sql)
echo "Restoring $db_name"
cat $backup | kubectl exec -i -n bakery-ia deployment/${db_name} -- \
psql -U postgres
done
# 7. Deploy services
2026-01-19 11:55:17 +01:00
kubectl apply -k infrastructure/environments/prod/k8s-manifests
2026-01-07 19:12:35 +01:00
# 8. Update DNS to point to new VPS
# 9. Verify all services healthy
```
**Recovery Time Objective (RTO):** 2-4 hours
**Recovery Point Objective (RPO):** 24 hours (last daily backup)
---
## Performance Optimization
### Identifying Performance Issues
```bash
# 1. Check overall resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory
# 2. Check API response times in Grafana
# Go to "Services Overview" dashboard
# Look for P95/P99 latency spikes
# 3. Check database query performance
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
"SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;"
# 4. Check for N+1 queries in application logs
kubectl logs -n bakery-ia deployment/orders-service | grep "SELECT"
```
### Common Optimizations
#### 1. Database Indexing
```sql
-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY abs(correlation) DESC;
-- Add index on frequently queried columns
CREATE INDEX CONCURRENTLY idx_orders_tenant_created
ON orders(tenant_id, created_at DESC);
```
#### 2. Connection Pooling
Already configured in services using SQLAlchemy. Verify settings:
```python
# In shared/database/base.py
pool_size=5 # Adjust based on load
max_overflow=10 # Max additional connections
pool_timeout=30 # Connection timeout
pool_recycle=3600 # Recycle connections after 1 hour
```
#### 3. Redis Caching
Increase cache for frequently accessed data:
```python
# Cache user permissions (example)
@cache.cached(timeout=300, key_prefix='user_perms')
def get_user_permissions(user_id):
# ... fetch from database
```
#### 4. Query Optimization
```sql
-- Add EXPLAIN ANALYZE to slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE tenant_id = '...';
-- Look for:
-- - Seq Scan (should use index scan)
-- - High execution time
-- - Missing indexes
```
### Scaling Triggers
**When to scale UP:**
- ❌ CPU usage >75% sustained for >1 hour
- ❌ Memory usage >85% sustained
- ❌ P95 API latency >3s
- ❌ Database connection pool exhausted frequently
- ❌ Error rate increasing
**When to scale OUT (add replicas):**
- ❌ Request rate increasing significantly
- ❌ Single service bottleneck identified
- ❌ Need zero-downtime deployments
- ❌ Geographic distribution needed
---
## Scaling Operations
### Vertical Scaling (Upgrade VPS)
```bash
# 1. Create backup
./backup-databases.sh
# 2. Plan upgrade window (requires brief downtime)
# Notify users: "Scheduled maintenance 2 AM - 3 AM"
# 3. At clouding.io, upgrade VPS
# RAM: 20 GB → 32 GB
# CPU: 8 cores → 12 cores
# (Usually instant, may require restart)
# 4. Verify after upgrade
kubectl top nodes
free -h
nproc
```
### Horizontal Scaling (Add Replicas)
```bash
# Scale specific service
kubectl scale deployment orders-service -n bakery-ia --replicas=5
# Or update in kustomization for persistence
2026-01-19 11:55:17 +01:00
# Edit: infrastructure/environments/prod/k8s-manifests/kustomization.yaml
2026-01-07 19:12:35 +01:00
replicas:
- name: orders-service
count: 5
2026-01-19 11:55:17 +01:00
kubectl apply -k infrastructure/environments/prod/k8s-manifests
2026-01-07 19:12:35 +01:00
```
### Auto-Scaling (HPA)
Already configured for:
- orders-service (1-3 replicas)
- forecasting-service (1-3 replicas)
- notification-service (1-3 replicas)
```bash
# Check HPA status
kubectl get hpa -n bakery-ia
# Adjust thresholds if needed
kubectl edit hpa orders-service-hpa -n bakery-ia
```
### Growth Path
| Tenants | Recommended Action |
|---------|-------------------|
| **10** | Current configuration (20GB RAM, 8 CPU) |
| **20** | Add replicas for critical services |
| **30** | Upgrade to 32GB RAM, 12 CPU |
| **50** | Consider database read replicas |
| **75** | Upgrade to 48GB RAM, 16 CPU |
| **100** | Plan multi-node cluster or managed K8s |
| **200+** | Migrate to managed services (EKS, GKE, AKS) |
---
## Incident Response
### Incident Severity Levels
| Level | Description | Response Time | Example |
|-------|-------------|---------------|---------|
| **P0** | Complete outage | Immediate | All services down |
| **P1** | Major degradation | 15 minutes | Database unavailable |
| **P2** | Partial degradation | 1 hour | One service slow |
| **P3** | Minor issue | 4 hours | Non-critical alert |
### Incident Response Process
#### 1. Detect & Alert
```
- Monitoring alerts trigger
- User reports issue
- Automated health checks fail
```
#### 2. Assess & Communicate
```bash
# Quick assessment
./health-check.sh
# Determine severity
# P0/P1: Notify all stakeholders immediately
# P2/P3: Regular communication channels
```
#### 3. Investigate
```bash
# Check pods
kubectl get pods -n bakery-ia
# Check recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20
# Check logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100
# Check metrics
# View Grafana dashboards
```
#### 4. Mitigate
```bash
# Common mitigations:
# Restart service
kubectl rollout restart deployment/SERVICE_NAME -n bakery-ia
# Rollback deployment
kubectl rollout undo deployment/SERVICE_NAME -n bakery-ia
# Scale up
kubectl scale deployment SERVICE_NAME -n bakery-ia --replicas=5
# Restart database
kubectl delete pod DB_POD_NAME -n bakery-ia
```
#### 5. Resolve & Document
```
1. Verify issue resolved
2. Update incident log
3. Create post-mortem (for P0/P1)
4. Implement preventive measures
```
### Common Incidents & Fixes
#### Incident: Database Connection Exhaustion
**Symptoms:** Services showing "connection pool exhausted" errors
**Fix:**
```bash
# 1. Identify leaking service
kubectl logs -n bakery-ia deployment/orders-service | grep "pool"
# 2. Restart leaking service
kubectl rollout restart deployment/orders-service -n bakery-ia
# 3. Increase max_connections if needed
kubectl exec -n bakery-ia deployment/orders-db -- \
psql -U postgres -c "ALTER SYSTEM SET max_connections = 200;"
kubectl rollout restart deployment/orders-db -n bakery-ia
```
#### Incident: Out of Memory (OOMKilled)
**Symptoms:** Pods restarting with "OOMKilled" status
**Fix:**
```bash
# 1. Identify which pod
kubectl get pods -n bakery-ia | grep OOMKilled
# 2. Check resource limits
kubectl describe pod POD_NAME -n bakery-ia | grep -A 5 Limits
# 3. Increase memory limit
# Edit deployment: infrastructure/kubernetes/base/components/services/SERVICE.yaml
resources:
limits:
memory: "1Gi" # Increased from 512Mi
# 4. Redeploy
2026-01-19 11:55:17 +01:00
kubectl apply -k infrastructure/environments/prod/k8s-manifests
2026-01-07 19:12:35 +01:00
```
#### Incident: Certificate Expired
**Symptoms:** SSL errors, services can't connect
**Fix:**
```bash
# For Let's Encrypt (should auto-renew):
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# Wait for cert-manager to recreate
# For internal certs:
# Follow "Certificate Rotation" section above
```
---
## Maintenance Tasks
### Daily Tasks
```bash
# Run health check
./health-check.sh
# Check monitoring alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'
# Verify backups ran
ls -lh /backups/ | head -5
```
### Weekly Tasks
```bash
# Review resource trends
# Open Grafana, check 7-day trends
# Review error logs
kubectl logs -n bakery-ia deployment/gateway --since=7d | grep ERROR | wc -l
# Check disk usage
kubectl exec -n bakery-ia deployment/auth-db -- df -h
# Review security logs
kubectl logs -n bakery-ia deployment/auth-service --since=7d | grep "failed"
```
### Monthly Tasks
- [ ] **Review and rotate passwords**
- [ ] **Update security patches**
- [ ] **Test backup restore**
- [ ] **Review RBAC policies**
- [ ] **Vacuum and analyze databases**
- [ ] **Review and optimize slow queries**
- [ ] **Check certificate expiry dates**
- [ ] **Review resource allocation**
- [ ] **Plan capacity for next quarter**
- [ ] **Update documentation**
### Quarterly Tasks (Every 90 Days)
- [ ] **Full security audit**
- [ ] **Disaster recovery drill**
- [ ] **Performance testing**
- [ ] **Cost optimization review**
- [ ] **Update runbooks**
- [ ] **Team training session**
- [ ] **Review SLAs and metrics**
- [ ] **Plan infrastructure upgrades**
### Annual Tasks
- [ ] **Penetration testing**
- [ ] **Compliance audit (GDPR, PCI-DSS, SOC 2)**
- [ ] **Full infrastructure review**
- [ ] **Update security roadmap**
- [ ] **Budget planning for next year**
- [ ] **Technology stack review**
---
## Compliance & Audit
### GDPR Compliance
**Requirements Met:**
- ✅ Article 32: Encryption of personal data (TLS + pgcrypto)
- ✅ Article 5(1)(f): Security of processing
- ✅ Article 33: Breach detection (audit logs)
- ✅ Article 17: Right to erasure (deletion endpoints)
- ✅ Article 20: Right to data portability (export functionality)
**Audit Tasks:**
```bash
# Review audit logs for data access
kubectl logs -n bakery-ia deployment/tenant-service | grep "user_data_access"
# Verify encryption in use
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c "SHOW ssl;"
# Check data retention policies
# Review automated cleanup jobs
```
### PCI-DSS Compliance
**Requirements Met:**
- ✅ Requirement 3.4: Transmission encryption (TLS 1.2+)
- ✅ Requirement 3.5: Stored data protection (pgcrypto)
- ✅ Requirement 10: Access tracking (audit logs)
- ✅ Requirement 8: User authentication (JWT + MFA ready)
**Audit Tasks:**
```bash
# Verify no plaintext passwords
kubectl get secret bakery-ia-secrets -n bakery-ia -o jsonpath='{.data}' | grep -i "pass"
# Check encryption in transit
kubectl describe ingress -n bakery-ia | grep TLS
# Review access logs
kubectl logs -n bakery-ia deployment/auth-service | grep "login"
```
### SOC 2 Compliance
**Controls Met:**
- ✅ CC6.1: Access controls (RBAC)
- ✅ CC6.6: Encryption in transit (TLS)
- ✅ CC6.7: Encryption at rest (K8s secrets + pgcrypto)
- ✅ CC7.2: Monitoring (Prometheus + Grafana)
### Audit Log Retention
**Current Policy:**
- Application logs: 30 days (stdout)
- Database audit logs: 90 days
- Security logs: 1 year
- Backups: 30 days rolling
**Extending Retention:**
```bash
# Ship logs to external storage
# Example: Ship to S3 / CloudWatch / ELK
# For PostgreSQL audit logs, increase CSV log retention
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U postgres -c "ALTER SYSTEM SET log_rotation_age = '90d';"
```
---
## Quick Reference Commands
### Emergency Commands
```bash
# Restart all services (minimal downtime with rolling update)
kubectl rollout restart deployment -n bakery-ia
# Restart specific service
kubectl rollout restart deployment/orders-service -n bakery-ia
# Rollback last deployment
kubectl rollout undo deployment/orders-service -n bakery-ia
# Scale up quickly
kubectl scale deployment orders-service -n bakery-ia --replicas=5
# Get pod status
kubectl get pods -n bakery-ia
# Get recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20
# Get logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 -f
```
### Monitoring Commands
```bash
# Resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory
# Check HPA
kubectl get hpa -n bakery-ia
# Check all resources
kubectl get all -n bakery-ia
# Check ingress
kubectl get ingress -n bakery-ia
# Check certificates
kubectl get certificate -n bakery-ia
```
### Database Commands
```bash
# Connect to database
kubectl exec -n bakery-ia deployment/auth-db -it -- psql -U auth_user -d auth_db
# Check connections
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
# Check database size
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U postgres -c "SELECT pg_size_pretty(pg_database_size('auth_db'));"
# Vacuum database
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"
```
---
## Support Resources
**Documentation:**
2026-01-08 12:58:00 +01:00
- [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment and setup
- [Security Checklist](./security-checklist.md) - Security procedures and compliance
- [Database Security](./database-security.md) - Database operations and best practices
2026-01-07 19:12:35 +01:00
- [TLS Configuration](./tls-configuration.md) - Certificate management
2026-01-08 12:58:00 +01:00
- [RBAC Implementation](./rbac-implementation.md) - Access control configuration
- [Monitoring Stack README](../infrastructure/kubernetes/base/components/monitoring/README.md) - Detailed monitoring documentation
2026-01-19 15:15:04 +01:00
- [CI/CD Infrastructure README](../infrastructure/cicd/README.md) - Gitea, Tekton, and Flux CD setup and operations
- [SigNoz Monitoring README](../infrastructure/monitoring/signoz/README.md) - SigNoz deployment and configuration
2026-01-07 19:12:35 +01:00
**External Resources:**
- Kubernetes: https://kubernetes.io/docs
- MicroK8s: https://microk8s.io/docs
- Prometheus: https://prometheus.io/docs
- Grafana: https://grafana.com/docs
- PostgreSQL: https://www.postgresql.org/docs
**Emergency Contacts:**
2026-01-08 12:58:00 +01:00
- DevOps Team: devops@bakewise.ai
- On-Call: oncall@bakewise.ai
- Security Team: security@bakewise.ai
2026-01-07 19:12:35 +01:00
---
## Summary
This guide covers all aspects of operating the Bakery-IA platform in production:
**Monitoring:** Dashboards, alerts, metrics
**Security:** Access control, certificates, compliance
**Databases:** Management, optimization, backups
**Recovery:** Backup strategy, disaster recovery
**Performance:** Optimization techniques, scaling
**Incidents:** Response procedures, common fixes
**Maintenance:** Daily, weekly, monthly tasks
**Compliance:** GDPR, PCI-DSS, SOC 2
**Remember:**
- Monitor daily
- Back up daily
- Test restores monthly
- Rotate secrets quarterly
- Plan for growth continuously
---
**Document Version:** 1.0
**Last Updated:** 2026-01-07
**Maintained By:** DevOps Team
**Next Review:** 2026-04-07