# Bakery-IA Production Operations Guide **Complete guide for operating, monitoring, and maintaining production environment** **Last Updated:** 2026-01-07 **Target Audience:** DevOps, SRE, System Administrators **Security Grade:** A- --- ## Table of Contents 1. [Overview](#overview) 2. [Monitoring & Observability](#monitoring--observability) 3. [Security Operations](#security-operations) 4. [Database Management](#database-management) 5. [Backup & Recovery](#backup--recovery) 6. [Performance Optimization](#performance-optimization) 7. [Scaling Operations](#scaling-operations) 8. [Incident Response](#incident-response) 9. [Maintenance Tasks](#maintenance-tasks) 10. [Compliance & Audit](#compliance--audit) --- ## Overview ### Production Environment **Infrastructure:** - **Platform:** MicroK8s on Ubuntu 22.04 LTS - **Services:** 18 microservices, 14 databases, monitoring stack - **Capacity:** 10-tenant pilot (scalable to 100+) - **Security:** TLS encryption, RBAC, audit logging - **Monitoring:** Prometheus, Grafana, AlertManager, Jaeger **Key Metrics (10-tenant baseline):** - **Uptime Target:** 99.5% (3.65 hours downtime/month) - **Response Time:** <2s average API response - **Error Rate:** <1% of requests - **Database Connections:** ~200 concurrent - **Memory Usage:** 12-15 GB / 20 GB capacity - **CPU Usage:** 40-60% under normal load ### Team Responsibilities | Role | Responsibilities | |------|------------------| | **DevOps Engineer** | Deployment, infrastructure, scaling | | **SRE** | Monitoring, incident response, performance | | **Security Admin** | Access control, security patches, compliance | | **Database Admin** | Backups, optimization, migrations | | **On-Call Engineer** | 24/7 incident response (if applicable) | --- ## Monitoring & Observability ### Access Monitoring Dashboards **Production URLs:** ``` https://monitoring.yourdomain.com/grafana # Dashboards & visualization https://monitoring.yourdomain.com/prometheus # Metrics & alerts https://monitoring.yourdomain.com/alertmanager # Alert management https://monitoring.yourdomain.com/jaeger # Distributed tracing ``` **Port Forwarding (if ingress not available):** ```bash # Grafana kubectl port-forward -n monitoring svc/grafana 3000:3000 # Prometheus kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 # AlertManager kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 # Jaeger kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 ``` ### Key Dashboards #### 1. Services Overview Dashboard **What to Monitor:** - Request rate per service - Error rate (aim: <1%) - P95/P99 latency (aim: <2s) - Active connections - Pod health status **Red Flags:** - ❌ Error rate >5% - ❌ P95 latency >3s - ❌ Any service showing 0 requests (might be down) - ❌ Pod restarts >3 in last hour #### 2. Database Dashboard (PostgreSQL) **What to Monitor:** - Active connections per database - Cache hit ratio (aim: >90%) - Query duration (P95) - Transaction rate - Replication lag (if applicable) **Red Flags:** - ❌ Connection count >80% of max - ❌ Cache hit ratio <80% - ❌ Slow queries >1s frequently - ❌ Locks increasing #### 3. Node Exporter (Infrastructure) **What to Monitor:** - CPU usage per node - Memory usage and swap - Disk I/O and latency - Network throughput - Disk space remaining **Red Flags:** - ❌ CPU usage >85% sustained - ❌ Memory usage >90% - ❌ Swap usage >0 (indicates memory pressure) - ❌ Disk space <20% remaining - ❌ Disk I/O latency >100ms #### 4. Business Metrics Dashboard **What to Monitor:** - Active tenants - ML training jobs (success/failure rate) - Forecast requests per hour - Alert volume - API health score **Red Flags:** - ❌ Training failure rate >10% - ❌ No forecast requests (might indicate issue) - ❌ Alert volume spike (investigate cause) ### Alert Severity Levels | Severity | Response Time | Escalation | Examples | |----------|---------------|------------|----------| | **Critical** | Immediate | Page on-call | Service down, database unavailable | | **Warning** | 30 minutes | Email team | High memory, slow queries | | **Info** | Best effort | Email | Backup completed, cert renewal | ### Common Alerts & Responses #### Alert: ServiceDown ``` Severity: Critical Meaning: A service has been down for >2 minutes Response: 1. Check pod status: kubectl get pods -n bakery-ia 2. View logs: kubectl logs POD_NAME -n bakery-ia 3. Check recent deployments: kubectl rollout history 4. Restart if safe: kubectl rollout restart deployment/SERVICE_NAME 5. Rollback if needed: kubectl rollout undo deployment/SERVICE_NAME ``` #### Alert: HighMemoryUsage ``` Severity: Warning Meaning: Service using >80% of memory limit Response: 1. Check which pods: kubectl top pods -n bakery-ia --sort-by=memory 2. Review memory trends in Grafana 3. Check for memory leaks in application logs 4. Consider increasing memory limits if sustained 5. Restart pod if memory leak suspected ``` #### Alert: DatabaseConnectionsHigh ``` Severity: Warning Meaning: Database connections >80% of max Response: 1. Identify which service: Check Grafana database dashboard 2. Look for connection leaks in application 3. Check for long-running transactions 4. Consider increasing max_connections 5. Restart service if connections not releasing ``` #### Alert: CertificateExpiringSoon ``` Severity: Warning Meaning: TLS certificate expires in <30 days Response: 1. For Let's Encrypt: Auto-renewal should handle (verify cert-manager) 2. For internal certs: Regenerate and apply new certificates 3. See "Certificate Rotation" section below ``` ### Metrics to Track Daily ```bash # Quick health check command cat > ~/health-check.sh <<'EOF' #!/bin/bash echo "=== Bakery-IA Health Check ===" echo "Date: $(date)" echo "" echo "1. Pod Status:" kubectl get pods -n bakery-ia | grep -vE "Running|Completed" || echo "✅ All pods healthy" echo "" echo "2. Resource Usage:" kubectl top nodes echo "" echo "3. Database Connections:" kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \ "SELECT count(*) as connections FROM pg_stat_activity;" echo "" echo "4. Recent Alerts:" curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state}' | head -10 echo "" echo "5. Disk Usage:" kubectl exec -n bakery-ia deployment/auth-db -- df -h /var/lib/postgresql/data echo "" echo "=== End Health Check ===" EOF chmod +x ~/health-check.sh ./health-check.sh ``` --- ## Security Operations ### Security Posture Overview **Current Security Grade: A-** **Implemented:** - ✅ TLS 1.2+ encryption for all database connections - ✅ Let's Encrypt SSL for public endpoints - ✅ 32-character cryptographic passwords - ✅ JWT-based authentication - ✅ Tenant isolation at database and application level - ✅ Kubernetes secrets encryption at rest - ✅ PostgreSQL audit logging - ✅ RBAC (Role-Based Access Control) - ✅ Regular security updates ### Access Control Management #### User Roles | Role | Permissions | Use Case | |------|-------------|----------| | **Viewer** | Read-only access | Dashboard viewing, reports | | **Member** | Read + create/update | Day-to-day operations | | **Admin** | Full operational access | Manage users, configure settings | | **Owner** | Full control | Billing, tenant deletion | #### Managing User Access ```bash # View current users for a tenant (via API) curl -H "Authorization: Bearer $ADMIN_TOKEN" \ https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users # Promote user to admin curl -X PATCH -H "Authorization: Bearer $OWNER_TOKEN" \ -H "Content-Type: application/json" \ https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users/USER_ID \ -d '{"role": "admin"}' ``` ### Security Checklist (Monthly) - [ ] **Review audit logs for suspicious activity** ```bash # Check failed login attempts kubectl logs -n bakery-ia deployment/auth-service | grep "authentication failed" | tail -50 # Check unusual API calls kubectl logs -n bakery-ia deployment/gateway | grep -E "DELETE|admin" | tail -50 ``` - [ ] **Verify all services using TLS** ```bash # Check PostgreSQL SSL for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do echo "Checking $db" kubectl exec -n bakery-ia $db -- psql -U postgres -c "SHOW ssl;" done ``` - [ ] **Review and rotate passwords (every 90 days)** ```bash # Generate new passwords openssl rand -base64 32 # For each service # Update secrets kubectl edit secret bakery-ia-secrets -n bakery-ia # Restart services to pick up new passwords kubectl rollout restart deployment -n bakery-ia ``` - [ ] **Check certificate expiry dates** ```bash # Check Let's Encrypt certs kubectl get certificate -n bakery-ia # Check internal TLS certs (expire Oct 2028) kubectl exec -n bakery-ia deployment/auth-db -- \ openssl x509 -in /tls/server-cert.pem -noout -dates ``` - [ ] **Review RBAC policies** - Ensure least privilege principle - Remove access for departed team members - Audit admin/owner role assignments - [ ] **Apply security updates** ```bash # Update system packages on VPS ssh root@$VPS_IP "apt update && apt upgrade -y" # Update container images (rebuild with latest base images) docker-compose build --pull ``` ### Certificate Rotation #### Let's Encrypt (Auto-Renewal) Let's Encrypt certificates auto-renew via cert-manager. Verify: ```bash # Check cert-manager is running kubectl get pods -n cert-manager # Check certificate status kubectl describe certificate bakery-ia-prod-tls-cert -n bakery-ia # Force renewal if needed (>30 days before expiry) kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia # cert-manager will automatically recreate ``` #### Internal TLS Certificates (Manual Rotation) **When:** 90 days before October 2028 expiry ```bash # 1. Generate new certificates (on local machine) cd infrastructure/tls ./generate-certificates.sh # 2. Update Kubernetes secrets kubectl delete secret postgres-tls redis-tls -n bakery-ia kubectl create secret generic postgres-tls \ --from-file=server-cert.pem=postgres/server-cert.pem \ --from-file=server-key.pem=postgres/server-key.pem \ --from-file=ca-cert.pem=postgres/ca-cert.pem \ -n bakery-ia kubectl create secret generic redis-tls \ --from-file=redis-cert.pem=redis/redis-cert.pem \ --from-file=redis-key.pem=redis/redis-key.pem \ --from-file=ca-cert.pem=redis/ca-cert.pem \ -n bakery-ia # 3. Restart database pods to pick up new certs kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=database kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=cache # 4. Verify new certificates kubectl exec -n bakery-ia deployment/auth-db -- \ openssl x509 -in /tls/server-cert.pem -noout -dates ``` --- ## Database Management ### Database Architecture **14 PostgreSQL Instances:** - auth-db, tenant-db, training-db, forecasting-db, sales-db - external-db, notification-db, inventory-db, recipes-db - suppliers-db, pos-db, orders-db, production-db, alert-processor-db **1 Redis Instance:** Shared caching and session storage ### Database Health Monitoring ```bash # Check all database pods kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database # Check database resource usage kubectl top pods -n bakery-ia -l app.kubernetes.io/component=database # Check database connections for db in $(kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database -o name); do echo "=== $db ===" kubectl exec -n bakery-ia $db -- psql -U postgres -c \ "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;" done ``` ### Common Database Operations #### Connect to Database ```bash # Connect to specific database kubectl exec -n bakery-ia deployment/auth-db -it -- \ psql -U auth_user -d auth_db # Inside psql: \dt # List tables \d+ table_name # Describe table with details \du # List users \l # List databases \q # Quit ``` #### Check Database Size ```bash kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \ "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database;" ``` #### Analyze Slow Queries ```bash # Enable slow query logging (already configured) kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \ "SELECT query, mean_exec_time, calls FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;" ``` #### Check Database Locks ```bash kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \ "SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.usename AS blocked_user, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted;" ``` ### Database Optimization #### Vacuum and Analyze ```bash # Run on each database monthly kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U auth_user -d auth_db -c "VACUUM ANALYZE;" # For all databases (run as cron job) cat > ~/vacuum-databases.sh <<'EOF' #!/bin/bash for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do echo "Vacuuming $db" kubectl exec -n bakery-ia $db -- psql -U postgres -c "VACUUM ANALYZE;" done EOF chmod +x ~/vacuum-databases.sh # Add to cron: 0 3 * * 0 (weekly at 3 AM) ``` #### Reindex (if performance degrades) ```bash # Reindex specific database kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U auth_user -d auth_db -c "REINDEX DATABASE auth_db;" ``` --- ## Backup & Recovery ### Backup Strategy **Automated Daily Backups:** - Frequency: Daily at 2 AM - Retention: 30 days rolling - Encryption: GPG encrypted - Storage: Local VPS (configure off-site for production) ### Backup Script (Already Configured) ```bash # Script location: ~/backup-databases.sh # Configured in: pilot launch guide # Manual backup ./backup-databases.sh # Verify backup ls -lh /backups/ ``` ### Backup Best Practices 1. **Test Restores Monthly** ```bash # Restore to test database gunzip < /backups/2026-01-07.tar.gz | \ kubectl exec -i -n bakery-ia deployment/test-db -- \ psql -U postgres test_db ``` 2. **Off-Site Storage (Recommended)** ```bash # Sync backups to S3 / Cloud Storage aws s3 sync /backups/ s3://bakery-ia-backups/ --delete # Or use rclone for any cloud provider rclone sync /backups/ remote:bakery-ia-backups ``` 3. **Monitor Backup Success** ```bash # Check last backup date ls -lt /backups/ | head -1 # Set up alert if no backup in 25 hours ``` ### Recovery Procedures #### Restore Single Database ```bash # 1. Stop the service using the database kubectl scale deployment auth-service -n bakery-ia --replicas=0 # 2. Drop and recreate database kubectl exec -n bakery-ia deployment/auth-db -it -- \ psql -U postgres -c "DROP DATABASE auth_db;" kubectl exec -n bakery-ia deployment/auth-db -it -- \ psql -U postgres -c "CREATE DATABASE auth_db OWNER auth_user;" # 3. Restore from backup gunzip < /backups/2026-01-07/auth-db.sql | \ kubectl exec -i -n bakery-ia deployment/auth-db -- \ psql -U auth_user -d auth_db # 4. Restart service kubectl scale deployment auth-service -n bakery-ia --replicas=2 ``` #### Disaster Recovery (Full System) ```bash # 1. Provision new VPS (same specs) # 2. Install MicroK8s (follow pilot launch guide) # 3. Copy latest backup to new VPS # 4. Deploy infrastructure and databases kubectl apply -k infrastructure/kubernetes/overlays/prod # 5. Wait for databases to be ready kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=database -n bakery-ia # 6. Restore all databases for backup in /backups/latest/*.sql; do db_name=$(basename $backup .sql) echo "Restoring $db_name" cat $backup | kubectl exec -i -n bakery-ia deployment/${db_name} -- \ psql -U postgres done # 7. Deploy services kubectl apply -k infrastructure/kubernetes/overlays/prod # 8. Update DNS to point to new VPS # 9. Verify all services healthy ``` **Recovery Time Objective (RTO):** 2-4 hours **Recovery Point Objective (RPO):** 24 hours (last daily backup) --- ## Performance Optimization ### Identifying Performance Issues ```bash # 1. Check overall resource usage kubectl top nodes kubectl top pods -n bakery-ia --sort-by=cpu kubectl top pods -n bakery-ia --sort-by=memory # 2. Check API response times in Grafana # Go to "Services Overview" dashboard # Look for P95/P99 latency spikes # 3. Check database query performance kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \ "SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 20;" # 4. Check for N+1 queries in application logs kubectl logs -n bakery-ia deployment/orders-service | grep "SELECT" ``` ### Common Optimizations #### 1. Database Indexing ```sql -- Find missing indexes SELECT schemaname, tablename, attname, n_distinct, correlation FROM pg_stats WHERE schemaname NOT IN ('pg_catalog', 'information_schema') ORDER BY abs(correlation) DESC; -- Add index on frequently queried columns CREATE INDEX CONCURRENTLY idx_orders_tenant_created ON orders(tenant_id, created_at DESC); ``` #### 2. Connection Pooling Already configured in services using SQLAlchemy. Verify settings: ```python # In shared/database/base.py pool_size=5 # Adjust based on load max_overflow=10 # Max additional connections pool_timeout=30 # Connection timeout pool_recycle=3600 # Recycle connections after 1 hour ``` #### 3. Redis Caching Increase cache for frequently accessed data: ```python # Cache user permissions (example) @cache.cached(timeout=300, key_prefix='user_perms') def get_user_permissions(user_id): # ... fetch from database ``` #### 4. Query Optimization ```sql -- Add EXPLAIN ANALYZE to slow queries EXPLAIN ANALYZE SELECT * FROM orders WHERE tenant_id = '...'; -- Look for: -- - Seq Scan (should use index scan) -- - High execution time -- - Missing indexes ``` ### Scaling Triggers **When to scale UP:** - ❌ CPU usage >75% sustained for >1 hour - ❌ Memory usage >85% sustained - ❌ P95 API latency >3s - ❌ Database connection pool exhausted frequently - ❌ Error rate increasing **When to scale OUT (add replicas):** - ❌ Request rate increasing significantly - ❌ Single service bottleneck identified - ❌ Need zero-downtime deployments - ❌ Geographic distribution needed --- ## Scaling Operations ### Vertical Scaling (Upgrade VPS) ```bash # 1. Create backup ./backup-databases.sh # 2. Plan upgrade window (requires brief downtime) # Notify users: "Scheduled maintenance 2 AM - 3 AM" # 3. At clouding.io, upgrade VPS # RAM: 20 GB → 32 GB # CPU: 8 cores → 12 cores # (Usually instant, may require restart) # 4. Verify after upgrade kubectl top nodes free -h nproc ``` ### Horizontal Scaling (Add Replicas) ```bash # Scale specific service kubectl scale deployment orders-service -n bakery-ia --replicas=5 # Or update in kustomization for persistence # Edit: infrastructure/kubernetes/overlays/prod/kustomization.yaml replicas: - name: orders-service count: 5 kubectl apply -k infrastructure/kubernetes/overlays/prod ``` ### Auto-Scaling (HPA) Already configured for: - orders-service (1-3 replicas) - forecasting-service (1-3 replicas) - notification-service (1-3 replicas) ```bash # Check HPA status kubectl get hpa -n bakery-ia # Adjust thresholds if needed kubectl edit hpa orders-service-hpa -n bakery-ia ``` ### Growth Path | Tenants | Recommended Action | |---------|-------------------| | **10** | Current configuration (20GB RAM, 8 CPU) | | **20** | Add replicas for critical services | | **30** | Upgrade to 32GB RAM, 12 CPU | | **50** | Consider database read replicas | | **75** | Upgrade to 48GB RAM, 16 CPU | | **100** | Plan multi-node cluster or managed K8s | | **200+** | Migrate to managed services (EKS, GKE, AKS) | --- ## Incident Response ### Incident Severity Levels | Level | Description | Response Time | Example | |-------|-------------|---------------|---------| | **P0** | Complete outage | Immediate | All services down | | **P1** | Major degradation | 15 minutes | Database unavailable | | **P2** | Partial degradation | 1 hour | One service slow | | **P3** | Minor issue | 4 hours | Non-critical alert | ### Incident Response Process #### 1. Detect & Alert ``` - Monitoring alerts trigger - User reports issue - Automated health checks fail ``` #### 2. Assess & Communicate ```bash # Quick assessment ./health-check.sh # Determine severity # P0/P1: Notify all stakeholders immediately # P2/P3: Regular communication channels ``` #### 3. Investigate ```bash # Check pods kubectl get pods -n bakery-ia # Check recent events kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20 # Check logs kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 # Check metrics # View Grafana dashboards ``` #### 4. Mitigate ```bash # Common mitigations: # Restart service kubectl rollout restart deployment/SERVICE_NAME -n bakery-ia # Rollback deployment kubectl rollout undo deployment/SERVICE_NAME -n bakery-ia # Scale up kubectl scale deployment SERVICE_NAME -n bakery-ia --replicas=5 # Restart database kubectl delete pod DB_POD_NAME -n bakery-ia ``` #### 5. Resolve & Document ``` 1. Verify issue resolved 2. Update incident log 3. Create post-mortem (for P0/P1) 4. Implement preventive measures ``` ### Common Incidents & Fixes #### Incident: Database Connection Exhaustion **Symptoms:** Services showing "connection pool exhausted" errors **Fix:** ```bash # 1. Identify leaking service kubectl logs -n bakery-ia deployment/orders-service | grep "pool" # 2. Restart leaking service kubectl rollout restart deployment/orders-service -n bakery-ia # 3. Increase max_connections if needed kubectl exec -n bakery-ia deployment/orders-db -- \ psql -U postgres -c "ALTER SYSTEM SET max_connections = 200;" kubectl rollout restart deployment/orders-db -n bakery-ia ``` #### Incident: Out of Memory (OOMKilled) **Symptoms:** Pods restarting with "OOMKilled" status **Fix:** ```bash # 1. Identify which pod kubectl get pods -n bakery-ia | grep OOMKilled # 2. Check resource limits kubectl describe pod POD_NAME -n bakery-ia | grep -A 5 Limits # 3. Increase memory limit # Edit deployment: infrastructure/kubernetes/base/components/services/SERVICE.yaml resources: limits: memory: "1Gi" # Increased from 512Mi # 4. Redeploy kubectl apply -k infrastructure/kubernetes/overlays/prod ``` #### Incident: Certificate Expired **Symptoms:** SSL errors, services can't connect **Fix:** ```bash # For Let's Encrypt (should auto-renew): kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia # Wait for cert-manager to recreate # For internal certs: # Follow "Certificate Rotation" section above ``` --- ## Maintenance Tasks ### Daily Tasks ```bash # Run health check ./health-check.sh # Check monitoring alerts curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")' # Verify backups ran ls -lh /backups/ | head -5 ``` ### Weekly Tasks ```bash # Review resource trends # Open Grafana, check 7-day trends # Review error logs kubectl logs -n bakery-ia deployment/gateway --since=7d | grep ERROR | wc -l # Check disk usage kubectl exec -n bakery-ia deployment/auth-db -- df -h # Review security logs kubectl logs -n bakery-ia deployment/auth-service --since=7d | grep "failed" ``` ### Monthly Tasks - [ ] **Review and rotate passwords** - [ ] **Update security patches** - [ ] **Test backup restore** - [ ] **Review RBAC policies** - [ ] **Vacuum and analyze databases** - [ ] **Review and optimize slow queries** - [ ] **Check certificate expiry dates** - [ ] **Review resource allocation** - [ ] **Plan capacity for next quarter** - [ ] **Update documentation** ### Quarterly Tasks (Every 90 Days) - [ ] **Full security audit** - [ ] **Disaster recovery drill** - [ ] **Performance testing** - [ ] **Cost optimization review** - [ ] **Update runbooks** - [ ] **Team training session** - [ ] **Review SLAs and metrics** - [ ] **Plan infrastructure upgrades** ### Annual Tasks - [ ] **Penetration testing** - [ ] **Compliance audit (GDPR, PCI-DSS, SOC 2)** - [ ] **Full infrastructure review** - [ ] **Update security roadmap** - [ ] **Budget planning for next year** - [ ] **Technology stack review** --- ## Compliance & Audit ### GDPR Compliance **Requirements Met:** - ✅ Article 32: Encryption of personal data (TLS + pgcrypto) - ✅ Article 5(1)(f): Security of processing - ✅ Article 33: Breach detection (audit logs) - ✅ Article 17: Right to erasure (deletion endpoints) - ✅ Article 20: Right to data portability (export functionality) **Audit Tasks:** ```bash # Review audit logs for data access kubectl logs -n bakery-ia deployment/tenant-service | grep "user_data_access" # Verify encryption in use kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c "SHOW ssl;" # Check data retention policies # Review automated cleanup jobs ``` ### PCI-DSS Compliance **Requirements Met:** - ✅ Requirement 3.4: Transmission encryption (TLS 1.2+) - ✅ Requirement 3.5: Stored data protection (pgcrypto) - ✅ Requirement 10: Access tracking (audit logs) - ✅ Requirement 8: User authentication (JWT + MFA ready) **Audit Tasks:** ```bash # Verify no plaintext passwords kubectl get secret bakery-ia-secrets -n bakery-ia -o jsonpath='{.data}' | grep -i "pass" # Check encryption in transit kubectl describe ingress -n bakery-ia | grep TLS # Review access logs kubectl logs -n bakery-ia deployment/auth-service | grep "login" ``` ### SOC 2 Compliance **Controls Met:** - ✅ CC6.1: Access controls (RBAC) - ✅ CC6.6: Encryption in transit (TLS) - ✅ CC6.7: Encryption at rest (K8s secrets + pgcrypto) - ✅ CC7.2: Monitoring (Prometheus + Grafana) ### Audit Log Retention **Current Policy:** - Application logs: 30 days (stdout) - Database audit logs: 90 days - Security logs: 1 year - Backups: 30 days rolling **Extending Retention:** ```bash # Ship logs to external storage # Example: Ship to S3 / CloudWatch / ELK # For PostgreSQL audit logs, increase CSV log retention kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U postgres -c "ALTER SYSTEM SET log_rotation_age = '90d';" ``` --- ## Quick Reference Commands ### Emergency Commands ```bash # Restart all services (minimal downtime with rolling update) kubectl rollout restart deployment -n bakery-ia # Restart specific service kubectl rollout restart deployment/orders-service -n bakery-ia # Rollback last deployment kubectl rollout undo deployment/orders-service -n bakery-ia # Scale up quickly kubectl scale deployment orders-service -n bakery-ia --replicas=5 # Get pod status kubectl get pods -n bakery-ia # Get recent events kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20 # Get logs kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 -f ``` ### Monitoring Commands ```bash # Resource usage kubectl top nodes kubectl top pods -n bakery-ia --sort-by=cpu kubectl top pods -n bakery-ia --sort-by=memory # Check HPA kubectl get hpa -n bakery-ia # Check all resources kubectl get all -n bakery-ia # Check ingress kubectl get ingress -n bakery-ia # Check certificates kubectl get certificate -n bakery-ia ``` ### Database Commands ```bash # Connect to database kubectl exec -n bakery-ia deployment/auth-db -it -- psql -U auth_user -d auth_db # Check connections kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;" # Check database size kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U postgres -c "SELECT pg_size_pretty(pg_database_size('auth_db'));" # Vacuum database kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U auth_user -d auth_db -c "VACUUM ANALYZE;" ``` --- ## Support Resources **Documentation:** - [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment - [Monitoring Summary](./MONITORING_DEPLOYMENT_SUMMARY.md) - Monitoring details - [Quick Start Monitoring](./QUICK_START_MONITORING.md) - Monitoring setup - [Security Checklist](./security-checklist.md) - Security procedures - [Database Security](./database-security.md) - Database operations - [TLS Configuration](./tls-configuration.md) - Certificate management - [RBAC Implementation](./rbac-implementation.md) - Access control **External Resources:** - Kubernetes: https://kubernetes.io/docs - MicroK8s: https://microk8s.io/docs - Prometheus: https://prometheus.io/docs - Grafana: https://grafana.com/docs - PostgreSQL: https://www.postgresql.org/docs **Emergency Contacts:** - DevOps Team: devops@yourdomain.com - On-Call: oncall@yourdomain.com - Security Team: security@yourdomain.com --- ## Summary This guide covers all aspects of operating the Bakery-IA platform in production: ✅ **Monitoring:** Dashboards, alerts, metrics ✅ **Security:** Access control, certificates, compliance ✅ **Databases:** Management, optimization, backups ✅ **Recovery:** Backup strategy, disaster recovery ✅ **Performance:** Optimization techniques, scaling ✅ **Incidents:** Response procedures, common fixes ✅ **Maintenance:** Daily, weekly, monthly tasks ✅ **Compliance:** GDPR, PCI-DSS, SOC 2 **Remember:** - Monitor daily - Back up daily - Test restores monthly - Rotate secrets quarterly - Plan for growth continuously --- **Document Version:** 1.0 **Last Updated:** 2026-01-07 **Maintained By:** DevOps Team **Next Review:** 2026-04-07