# Bakery-IA Production Operations Guide **Complete guide for operating, monitoring, and maintaining production environment** **Last Updated:** 2026-01-07 **Target Audience:** DevOps, SRE, System Administrators **Security Grade:** A- --- ## Table of Contents 1. [Overview](#overview) 2. [Monitoring & Observability](#monitoring--observability) 3. [CI/CD Operations](#ci-cd-operations) 4. [Security Operations](#security-operations) 5. [Database Management](#database-management) 6. [Backup & Recovery](#backup--recovery) 7. [Performance Optimization](#performance-optimization) 8. [Scaling Operations](#scaling-operations) 9. [Incident Response](#incident-response) 10. [Maintenance Tasks](#maintenance-tasks) 11. [Compliance & Audit](#compliance--audit) --- ## Overview ### Production Environment **Infrastructure:** - **Platform:** MicroK8s on Ubuntu 22.04 LTS - **Services:** 18 microservices, 14 databases, monitoring stack - **Capacity:** 10-tenant pilot (scalable to 100+) - **Security:** TLS encryption, RBAC, audit logging - **Monitoring:** Prometheus, Grafana, AlertManager, SigNoz - **CI/CD:** Tekton Pipelines, Gitea, Flux CD (GitOps) - **Email:** Mailu (integrated email server) **Key Metrics (10-tenant baseline):** - **Uptime Target:** 99.5% (3.65 hours downtime/month) - **Response Time:** <2s average API response - **Error Rate:** <1% of requests - **Database Connections:** ~200 concurrent - **Memory Usage:** 12-15 GB / 20 GB capacity - **CPU Usage:** 40-60% under normal load ### Team Responsibilities | Role | Responsibilities | |------|------------------| | **DevOps Engineer** | Deployment, infrastructure, scaling, CI/CD | | **SRE** | Monitoring, incident response, performance | | **Security Admin** | Access control, security patches, compliance | | **Database Admin** | Backups, optimization, migrations | | **On-Call Engineer** | 24/7 incident response (if applicable) | | **CI/CD Admin** | Pipeline management, GitOps workflows | --- ## Monitoring & Observability ### Access Monitoring Dashboards **Production URLs:** ``` https://monitoring.bakewise.ai/signoz # SigNoz - Unified observability (PRIMARY) https://monitoring.bakewise.ai/alertmanager # AlertManager - Alert management ``` **What is SigNoz?** SigNoz is a comprehensive, open-source observability platform that provides: - **Distributed Tracing** - End-to-end request tracking across all microservices - **Metrics Monitoring** - Application and infrastructure metrics - **Log Management** - Centralized log aggregation with trace correlation - **Service Performance Monitoring (SPM)** - RED metrics (Rate, Error, Duration) from traces - **Database Monitoring** - All 18 PostgreSQL databases + Redis + RabbitMQ - **Kubernetes Monitoring** - Cluster, node, pod, and container metrics ### Key SigNoz Dashboards and Features #### 1. Services Tab - APM Overview **What to Monitor:** - **Service List** - All 18 microservices with health status - **Request Rate** - Requests per second per service - **Error Rate** - Percentage of failed requests (aim: <1%) - **P50/P90/P99 Latency** - Response time percentiles (aim: P99 <2s) - **Operations** - Breakdown by endpoint/operation **Red Flags:** - ❌ Error rate >5% sustained - ❌ P99 latency >3s - ❌ Sudden drop in request rate (service might be down) - ❌ High latency on specific endpoints **How to Access:** - Navigate to `Services` tab in SigNoz - Click on any service for detailed metrics - Use "Traces" tab to see sample requests #### 2. Traces Tab - Distributed Tracing **What to Monitor:** - **End-to-end request flows** across microservices - **Span duration** - Time spent in each service - **Database query performance** - Auto-captured from SQLAlchemy - **External API calls** - Auto-captured from HTTPX - **Error traces** - Requests that failed with stack traces **Features:** - Filter by service, operation, status code, duration - Search by trace ID or span ID - Correlate traces with logs - Identify slow database queries and N+1 problems **Red Flags:** - ❌ Traces showing >10 database queries per request (N+1 issue) - ❌ External API calls taking >1s - ❌ Services with >500ms internal processing time - ❌ Error spans with exceptions #### 3. Dashboards Tab - Infrastructure Metrics **Pre-built Dashboards:** - **PostgreSQL Monitoring** - All 18 databases - Active connections, transactions/sec, cache hit ratio - Slow queries, lock waits, replication lag - Database size, disk I/O - **Redis Monitoring** - Cache performance - Memory usage, hit rate, evictions - Commands/sec, latency - **RabbitMQ Monitoring** - Message queue health - Queue depth, message rates - Consumer status, connections - **Kubernetes Cluster** - Node and pod metrics - CPU, memory, disk, network per node - Pod resource utilization - Container restarts and OOM kills **Red Flags:** - ❌ PostgreSQL: Cache hit ratio <80%, active connections >80% of max - ❌ Redis: Memory >90%, evictions increasing - ❌ RabbitMQ: Queue depth growing, no consumers - ❌ Kubernetes: CPU >85%, memory >90%, disk <20% free #### 4. Logs Tab - Centralized Logging **Features:** - **Unified logs** from all 18 microservices + databases - **Trace correlation** - Click on trace ID to see related logs - **Kubernetes metadata** - Auto-tagged with pod, namespace, container - **Search and filter** - By service, severity, time range, content - **Log patterns** - Automatically detect common patterns **What to Monitor:** - Error and warning logs across all services - Database connection errors - Authentication failures - API request/response logs **Red Flags:** - ❌ Increasing error logs - ❌ Repeated "connection refused" or "timeout" messages - ❌ Authentication failures (potential security issue) - ❌ Out of memory errors #### 5. Alerts Tab - Alert Management **Features:** - Create alerts based on metrics, traces, or logs - Configure notification channels (email, Slack, webhook) - View firing alerts and alert history - Alert silencing and acknowledgment **Pre-configured Alerts (see SigNoz):** - High error rate (>5% for 5 minutes) - High latency (P99 >3s for 5 minutes) - Service down (no requests for 2 minutes) - Database connection errors - High memory/CPU usage ### Alert Severity Levels | Severity | Response Time | Escalation | Examples | |----------|---------------|------------|----------| | **Critical** | Immediate | Page on-call | Service down, database unavailable | | **Warning** | 30 minutes | Email team | High memory, slow queries | | **Info** | Best effort | Email | Backup completed, cert renewal | ### Common Alerts & Responses #### Alert: ServiceDown ``` Severity: Critical Meaning: A service has been down for >2 minutes Response: 1. Check pod status: kubectl get pods -n bakery-ia 2. View logs: kubectl logs POD_NAME -n bakery-ia 3. Check recent deployments: kubectl rollout history 4. Restart if safe: kubectl rollout restart deployment/SERVICE_NAME 5. Rollback if needed: kubectl rollout undo deployment/SERVICE_NAME ``` #### Alert: HighMemoryUsage ``` Severity: Warning Meaning: Service using >80% of memory limit Response: 1. Check which pods: kubectl top pods -n bakery-ia --sort-by=memory 2. Review memory trends in Grafana 3. Check for memory leaks in application logs 4. Consider increasing memory limits if sustained 5. Restart pod if memory leak suspected ``` #### Alert: DatabaseConnectionsHigh ``` Severity: Warning Meaning: Database connections >80% of max Response: 1. Identify which service: Check Grafana database dashboard 2. Look for connection leaks in application 3. Check for long-running transactions 4. Consider increasing max_connections 5. Restart service if connections not releasing ``` #### Alert: CertificateExpiringSoon ``` Severity: Warning Meaning: TLS certificate expires in <30 days Response: 1. For Let's Encrypt: Auto-renewal should handle (verify cert-manager) 2. For internal certs: Regenerate and apply new certificates 3. See "Certificate Rotation" section below ``` ### Daily Monitoring Workflow with SigNoz #### Morning Health Check (5 minutes) 1. **Open SigNoz Dashboard** ``` https://monitoring.bakewise.ai/signoz ``` 2. **Check Services Tab:** - Verify all 18 services are reporting metrics - Check error rate <1% for all services - Check P99 latency <2s for critical services 3. **Check Alerts Tab:** - Review any firing alerts - Check for patterns (repeated alerts on same service) - Acknowledge or resolve as needed 4. **Quick Infrastructure Check:** - Navigate to Dashboards → PostgreSQL - Verify all 18 databases are up - Check connection counts are healthy - Navigate to Dashboards → Redis - Check memory usage <80% - Navigate to Dashboards → Kubernetes - Verify node health, no OOM kills #### Command-Line Health Check (Alternative) ```bash # Quick health check command cat > ~/health-check.sh <<'EOF' #!/bin/bash echo "=== Bakery-IA Health Check ===" echo "Date: $(date)" echo "" echo "1. Pod Status:" kubectl get pods -n bakery-ia | grep -vE "Running|Completed" || echo "✅ All pods healthy" echo "" echo "2. Resource Usage:" kubectl top nodes kubectl top pods -n bakery-ia --sort-by=memory | head -10 echo "" echo "3. SigNoz Components:" kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz echo "" echo "4. Recent Alerts (from SigNoz AlertManager):" curl -s http://localhost:9093/api/v1/alerts 2>/dev/null | jq '.data[] | select(.status.state=="firing") | {alert: .labels.alertname, severity: .labels.severity}' | head -10 echo "" echo "5. OTel Collector Health:" kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133 2>/dev/null || echo "✅ Health check endpoint responding" echo "" echo "=== End Health Check ===" EOF chmod +x ~/health-check.sh ./health-check.sh ``` #### Troubleshooting Common Issues **Issue: Service not showing in SigNoz** ```bash # Check if service is sending telemetry kubectl logs -n bakery-ia deployment/SERVICE_NAME | grep -i "telemetry\|otel\|signoz" # Check OTel Collector is receiving data kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep SERVICE_NAME # Verify service has proper OTEL endpoints configured kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep OTEL ``` **Issue: No traces appearing** ```bash # Check tracing is enabled in service kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep ENABLE_TRACING # Verify OTel Collector gRPC endpoint is reachable kubectl exec -n bakery-ia deployment/SERVICE_NAME -- nc -zv signoz-otel-collector 4317 ``` **Issue: Logs not appearing** ```bash # Check filelog receiver is working kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog # Check k8sattributes processor kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep k8sattributes ``` --- ## CI/CD Operations ### CI/CD Infrastructure Overview The platform includes a complete CI/CD pipeline using: - **Gitea** - Git server and container registry - **Tekton** - Pipeline automation - **Flux CD** - GitOps deployment ### Access CI/CD Systems **SSH Access to Production VPS:** - **IP Address:** `200.234.233.87` - **Access Method:** SSH with key authentication - **Command:** `ssh -i ~/.ssh/your_private_key.pem root@200.234.233.87` **Gitea (Git Server):** - URL: http://gitea.bakery-ia.local (development) or http://gitea.bakewise.ai (production) - Admin panel: http://gitea.bakery-ia.local/admin **Tekton Dashboard:** ```bash # Port forward to access Tekton dashboard kubectl port-forward -n tekton-pipelines svc/tekton-dashboard 9097:9097 # Access at: http://localhost:9097 ``` **Flux Status:** ```bash # Check Flux status flux check kubectl get gitrepository -n flux-system kubectl get kustomization -n flux-system ``` ### CI/CD Monitoring **Check pipeline status:** ```bash # List all PipelineRuns kubectl get pipelineruns -n tekton-pipelines # Check Tekton controller logs kubectl logs -n tekton-pipelines -l app=tekton-pipelines-controller # Check Tekton dashboard logs kubectl logs -n tekton-pipelines -l app=tekton-dashboard ``` **Monitor GitOps synchronization:** ```bash # Check GitRepository status kubectl get gitrepository -n flux-system -o wide # Check Kustomization status kubectl get kustomization -n flux-system -o wide # Get reconciliation history kubectl get events -n flux-system --sort-by='.lastTimestamp' ``` ### CI/CD Troubleshooting **Pipeline not triggering:** ```bash # Check Gitea webhook logs kubectl logs -n tekton-pipelines -l app=tekton-triggers-controller # Verify EventListener pods are running kubectl get pods -n tekton-pipelines -l app=tekton-triggers-eventlistener # Check TriggerBinding configuration kubectl get triggerbinding -n tekton-pipelines ``` **Build failures:** ```bash # Check Kaniko logs for build errors kubectl logs -n tekton-pipelines -l tekton.dev/task=kaniko-build # Verify Dockerfile paths are correct kubectl describe taskrun -n tekton-pipelines ``` **Flux not applying changes:** ```bash # Check GitRepository status kubectl describe gitrepository -n flux-system # Check Kustomization reconciliation kubectl describe kustomization -n flux-system # Check Flux logs kubectl logs -n flux-system -l app.kubernetes.io/name=helm-controller ``` ### CI/CD Maintenance Tasks **Daily Tasks:** - [ ] Check for failed pipeline runs - [ ] Verify GitOps synchronization status - [ ] Clean up old PipelineRun resources **Weekly Tasks:** - [ ] Review pipeline performance metrics - [ ] Update pipeline definitions if needed - [ ] Rotate CI/CD secrets **Monthly Tasks:** - [ ] Update Tekton and Flux versions - [ ] Review and optimize pipeline performance - [ ] Audit CI/CD access permissions --- ## Security Operations ### Security Posture Overview **Current Security Grade: A-** **Implemented:** - ✅ TLS 1.2+ encryption for all database connections - ✅ Let's Encrypt SSL for public endpoints - ✅ 32-character cryptographic passwords - ✅ JWT-based authentication - ✅ Tenant isolation at database and application level - ✅ Kubernetes secrets encryption at rest - ✅ PostgreSQL audit logging - ✅ RBAC (Role-Based Access Control) - ✅ Regular security updates ### Access Control Management #### User Roles | Role | Permissions | Use Case | |------|-------------|----------| | **Viewer** | Read-only access | Dashboard viewing, reports | | **Member** | Read + create/update | Day-to-day operations | | **Admin** | Full operational access | Manage users, configure settings | | **Owner** | Full control | Billing, tenant deletion | #### Managing User Access ```bash # View current users for a tenant (via API) curl -H "Authorization: Bearer $ADMIN_TOKEN" \ https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users # Promote user to admin curl -X PATCH -H "Authorization: Bearer $OWNER_TOKEN" \ -H "Content-Type: application/json" \ https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users/USER_ID \ -d '{"role": "admin"}' ``` ### Security Checklist (Monthly) - [ ] **Review audit logs for suspicious activity** ```bash # Check failed login attempts kubectl logs -n bakery-ia deployment/auth-service | grep "authentication failed" | tail -50 # Check unusual API calls kubectl logs -n bakery-ia deployment/gateway | grep -E "DELETE|admin" | tail -50 ``` - [ ] **Verify all services using TLS** ```bash # Check PostgreSQL SSL for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do echo "Checking $db" kubectl exec -n bakery-ia $db -- psql -U postgres -c "SHOW ssl;" done ``` - [ ] **Review and rotate passwords (every 90 days)** ```bash # Generate new passwords openssl rand -base64 32 # For each service # Update secrets kubectl edit secret bakery-ia-secrets -n bakery-ia # Restart services to pick up new passwords kubectl rollout restart deployment -n bakery-ia ``` - [ ] **Check certificate expiry dates** ```bash # Check Let's Encrypt certs kubectl get certificate -n bakery-ia # Check internal TLS certs (expire Oct 2028) kubectl exec -n bakery-ia deployment/auth-db -- \ openssl x509 -in /tls/server-cert.pem -noout -dates ``` - [ ] **Review RBAC policies** - Ensure least privilege principle - Remove access for departed team members - Audit admin/owner role assignments - [ ] **Apply security updates** ```bash # Update system packages on VPS ssh root@$VPS_IP "apt update && apt upgrade -y" # Update container images (rebuild with latest base images) docker-compose build --pull ``` ### Certificate Rotation #### Let's Encrypt (Auto-Renewal) Let's Encrypt certificates auto-renew via cert-manager. Verify: ```bash # Check cert-manager is running kubectl get pods -n cert-manager # Check certificate status kubectl describe certificate bakery-ia-prod-tls-cert -n bakery-ia # Force renewal if needed (>30 days before expiry) kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia # cert-manager will automatically recreate ``` #### Internal TLS Certificates (Manual Rotation) **When:** 90 days before October 2028 expiry ```bash # 1. Generate new certificates (on local machine) cd infrastructure/tls ./generate-certificates.sh # 2. Update Kubernetes secrets kubectl delete secret postgres-tls redis-tls -n bakery-ia kubectl create secret generic postgres-tls \ --from-file=server-cert.pem=postgres/server-cert.pem \ --from-file=server-key.pem=postgres/server-key.pem \ --from-file=ca-cert.pem=postgres/ca-cert.pem \ -n bakery-ia kubectl create secret generic redis-tls \ --from-file=redis-cert.pem=redis/redis-cert.pem \ --from-file=redis-key.pem=redis/redis-key.pem \ --from-file=ca-cert.pem=redis/ca-cert.pem \ -n bakery-ia # 3. Restart database pods to pick up new certs kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=database kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=cache # 4. Verify new certificates kubectl exec -n bakery-ia deployment/auth-db -- \ openssl x509 -in /tls/server-cert.pem -noout -dates ``` --- ## Database Management ### Database Architecture **14 PostgreSQL Instances:** - auth-db, tenant-db, training-db, forecasting-db, sales-db - external-db, notification-db, inventory-db, recipes-db - suppliers-db, pos-db, orders-db, production-db, alert-processor-db **1 Redis Instance:** Shared caching and session storage ### Database Health Monitoring ```bash # Check all database pods kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database # Check database resource usage kubectl top pods -n bakery-ia -l app.kubernetes.io/component=database # Check database connections for db in $(kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database -o name); do echo "=== $db ===" kubectl exec -n bakery-ia $db -- psql -U postgres -c \ "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;" done ``` ### Common Database Operations #### Connect to Database ```bash # Connect to specific database kubectl exec -n bakery-ia deployment/auth-db -it -- \ psql -U auth_user -d auth_db # Inside psql: \dt # List tables \d+ table_name # Describe table with details \du # List users \l # List databases \q # Quit ``` #### Check Database Size ```bash kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \ "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database;" ``` #### Analyze Slow Queries ```bash # Enable slow query logging (already configured) kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \ "SELECT query, mean_exec_time, calls FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;" ``` #### Check Database Locks ```bash kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \ "SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.usename AS blocked_user, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted;" ``` ### Database Optimization #### Vacuum and Analyze ```bash # Run on each database monthly kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U auth_user -d auth_db -c "VACUUM ANALYZE;" # For all databases (run as cron job) cat > ~/vacuum-databases.sh <<'EOF' #!/bin/bash for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do echo "Vacuuming $db" kubectl exec -n bakery-ia $db -- psql -U postgres -c "VACUUM ANALYZE;" done EOF chmod +x ~/vacuum-databases.sh # Add to cron: 0 3 * * 0 (weekly at 3 AM) ``` #### Reindex (if performance degrades) ```bash # Reindex specific database kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U auth_user -d auth_db -c "REINDEX DATABASE auth_db;" ``` --- ## Backup & Recovery ### Backup Strategy **Automated Daily Backups:** - Frequency: Daily at 2 AM - Retention: 30 days rolling - Encryption: GPG encrypted - Storage: Local VPS (configure off-site for production) ### Backup Script (Already Configured) ```bash # Script location: ~/backup-databases.sh # Configured in: pilot launch guide # Manual backup ./backup-databases.sh # Verify backup ls -lh /backups/ ``` ### Backup Best Practices 1. **Test Restores Monthly** ```bash # Restore to test database gunzip < /backups/2026-01-07.tar.gz | \ kubectl exec -i -n bakery-ia deployment/test-db -- \ psql -U postgres test_db ``` 2. **Off-Site Storage (Recommended)** ```bash # Sync backups to S3 / Cloud Storage aws s3 sync /backups/ s3://bakery-ia-backups/ --delete # Or use rclone for any cloud provider rclone sync /backups/ remote:bakery-ia-backups ``` 3. **Monitor Backup Success** ```bash # Check last backup date ls -lt /backups/ | head -1 # Set up alert if no backup in 25 hours ``` ### Recovery Procedures #### Restore Single Database ```bash # 1. Stop the service using the database kubectl scale deployment auth-service -n bakery-ia --replicas=0 # 2. Drop and recreate database kubectl exec -n bakery-ia deployment/auth-db -it -- \ psql -U postgres -c "DROP DATABASE auth_db;" kubectl exec -n bakery-ia deployment/auth-db -it -- \ psql -U postgres -c "CREATE DATABASE auth_db OWNER auth_user;" # 3. Restore from backup gunzip < /backups/2026-01-07/auth-db.sql | \ kubectl exec -i -n bakery-ia deployment/auth-db -- \ psql -U auth_user -d auth_db # 4. Restart service kubectl scale deployment auth-service -n bakery-ia --replicas=2 ``` #### Disaster Recovery (Full System) ```bash # 1. Provision new VPS (same specs) # 2. Install MicroK8s (follow pilot launch guide) # 3. Copy latest backup to new VPS # 4. Deploy infrastructure and databases kubectl apply -k infrastructure/environments/prod/k8s-manifests # 5. Wait for databases to be ready kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=database -n bakery-ia # 6. Restore all databases for backup in /backups/latest/*.sql; do db_name=$(basename $backup .sql) echo "Restoring $db_name" cat $backup | kubectl exec -i -n bakery-ia deployment/${db_name} -- \ psql -U postgres done # 7. Deploy services kubectl apply -k infrastructure/environments/prod/k8s-manifests # 8. Update DNS to point to new VPS # 9. Verify all services healthy ``` **Recovery Time Objective (RTO):** 2-4 hours **Recovery Point Objective (RPO):** 24 hours (last daily backup) --- ## Performance Optimization ### Identifying Performance Issues ```bash # 1. Check overall resource usage kubectl top nodes kubectl top pods -n bakery-ia --sort-by=cpu kubectl top pods -n bakery-ia --sort-by=memory # 2. Check API response times in Grafana # Go to "Services Overview" dashboard # Look for P95/P99 latency spikes # 3. Check database query performance kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \ "SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 20;" # 4. Check for N+1 queries in application logs kubectl logs -n bakery-ia deployment/orders-service | grep "SELECT" ``` ### Common Optimizations #### 1. Database Indexing ```sql -- Find missing indexes SELECT schemaname, tablename, attname, n_distinct, correlation FROM pg_stats WHERE schemaname NOT IN ('pg_catalog', 'information_schema') ORDER BY abs(correlation) DESC; -- Add index on frequently queried columns CREATE INDEX CONCURRENTLY idx_orders_tenant_created ON orders(tenant_id, created_at DESC); ``` #### 2. Connection Pooling Already configured in services using SQLAlchemy. Verify settings: ```python # In shared/database/base.py pool_size=5 # Adjust based on load max_overflow=10 # Max additional connections pool_timeout=30 # Connection timeout pool_recycle=3600 # Recycle connections after 1 hour ``` #### 3. Redis Caching Increase cache for frequently accessed data: ```python # Cache user permissions (example) @cache.cached(timeout=300, key_prefix='user_perms') def get_user_permissions(user_id): # ... fetch from database ``` #### 4. Query Optimization ```sql -- Add EXPLAIN ANALYZE to slow queries EXPLAIN ANALYZE SELECT * FROM orders WHERE tenant_id = '...'; -- Look for: -- - Seq Scan (should use index scan) -- - High execution time -- - Missing indexes ``` ### Scaling Triggers **When to scale UP:** - ❌ CPU usage >75% sustained for >1 hour - ❌ Memory usage >85% sustained - ❌ P95 API latency >3s - ❌ Database connection pool exhausted frequently - ❌ Error rate increasing **When to scale OUT (add replicas):** - ❌ Request rate increasing significantly - ❌ Single service bottleneck identified - ❌ Need zero-downtime deployments - ❌ Geographic distribution needed --- ## Scaling Operations ### Vertical Scaling (Upgrade VPS) ```bash # 1. Create backup ./backup-databases.sh # 2. Plan upgrade window (requires brief downtime) # Notify users: "Scheduled maintenance 2 AM - 3 AM" # 3. At clouding.io, upgrade VPS # RAM: 20 GB → 32 GB # CPU: 8 cores → 12 cores # (Usually instant, may require restart) # 4. Verify after upgrade kubectl top nodes free -h nproc ``` ### Horizontal Scaling (Add Replicas) ```bash # Scale specific service kubectl scale deployment orders-service -n bakery-ia --replicas=5 # Or update in kustomization for persistence # Edit: infrastructure/environments/prod/k8s-manifests/kustomization.yaml replicas: - name: orders-service count: 5 kubectl apply -k infrastructure/environments/prod/k8s-manifests ``` ### Auto-Scaling (HPA) Already configured for: - orders-service (1-3 replicas) - forecasting-service (1-3 replicas) - notification-service (1-3 replicas) ```bash # Check HPA status kubectl get hpa -n bakery-ia # Adjust thresholds if needed kubectl edit hpa orders-service-hpa -n bakery-ia ``` ### Growth Path | Tenants | Recommended Action | |---------|-------------------| | **10** | Current configuration (20GB RAM, 8 CPU) | | **20** | Add replicas for critical services | | **30** | Upgrade to 32GB RAM, 12 CPU | | **50** | Consider database read replicas | | **75** | Upgrade to 48GB RAM, 16 CPU | | **100** | Plan multi-node cluster or managed K8s | | **200+** | Migrate to managed services (EKS, GKE, AKS) | --- ## Incident Response ### Incident Severity Levels | Level | Description | Response Time | Example | |-------|-------------|---------------|---------| | **P0** | Complete outage | Immediate | All services down | | **P1** | Major degradation | 15 minutes | Database unavailable | | **P2** | Partial degradation | 1 hour | One service slow | | **P3** | Minor issue | 4 hours | Non-critical alert | ### Incident Response Process #### 1. Detect & Alert ``` - Monitoring alerts trigger - User reports issue - Automated health checks fail ``` #### 2. Assess & Communicate ```bash # Quick assessment ./health-check.sh # Determine severity # P0/P1: Notify all stakeholders immediately # P2/P3: Regular communication channels ``` #### 3. Investigate ```bash # Check pods kubectl get pods -n bakery-ia # Check recent events kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20 # Check logs kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 # Check metrics # View Grafana dashboards ``` #### 4. Mitigate ```bash # Common mitigations: # Restart service kubectl rollout restart deployment/SERVICE_NAME -n bakery-ia # Rollback deployment kubectl rollout undo deployment/SERVICE_NAME -n bakery-ia # Scale up kubectl scale deployment SERVICE_NAME -n bakery-ia --replicas=5 # Restart database kubectl delete pod DB_POD_NAME -n bakery-ia ``` #### 5. Resolve & Document ``` 1. Verify issue resolved 2. Update incident log 3. Create post-mortem (for P0/P1) 4. Implement preventive measures ``` ### Common Incidents & Fixes #### Incident: Database Connection Exhaustion **Symptoms:** Services showing "connection pool exhausted" errors **Fix:** ```bash # 1. Identify leaking service kubectl logs -n bakery-ia deployment/orders-service | grep "pool" # 2. Restart leaking service kubectl rollout restart deployment/orders-service -n bakery-ia # 3. Increase max_connections if needed kubectl exec -n bakery-ia deployment/orders-db -- \ psql -U postgres -c "ALTER SYSTEM SET max_connections = 200;" kubectl rollout restart deployment/orders-db -n bakery-ia ``` #### Incident: Out of Memory (OOMKilled) **Symptoms:** Pods restarting with "OOMKilled" status **Fix:** ```bash # 1. Identify which pod kubectl get pods -n bakery-ia | grep OOMKilled # 2. Check resource limits kubectl describe pod POD_NAME -n bakery-ia | grep -A 5 Limits # 3. Increase memory limit # Edit deployment: infrastructure/kubernetes/base/components/services/SERVICE.yaml resources: limits: memory: "1Gi" # Increased from 512Mi # 4. Redeploy kubectl apply -k infrastructure/environments/prod/k8s-manifests ``` #### Incident: Certificate Expired **Symptoms:** SSL errors, services can't connect **Fix:** ```bash # For Let's Encrypt (should auto-renew): kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia # Wait for cert-manager to recreate # For internal certs: # Follow "Certificate Rotation" section above ``` --- ## Maintenance Tasks ### Daily Tasks ```bash # Run health check ./health-check.sh # Check monitoring alerts curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")' # Verify backups ran ls -lh /backups/ | head -5 ``` ### Weekly Tasks ```bash # Review resource trends # Open Grafana, check 7-day trends # Review error logs kubectl logs -n bakery-ia deployment/gateway --since=7d | grep ERROR | wc -l # Check disk usage kubectl exec -n bakery-ia deployment/auth-db -- df -h # Review security logs kubectl logs -n bakery-ia deployment/auth-service --since=7d | grep "failed" ``` ### Monthly Tasks - [ ] **Review and rotate passwords** - [ ] **Update security patches** - [ ] **Test backup restore** - [ ] **Review RBAC policies** - [ ] **Vacuum and analyze databases** - [ ] **Review and optimize slow queries** - [ ] **Check certificate expiry dates** - [ ] **Review resource allocation** - [ ] **Plan capacity for next quarter** - [ ] **Update documentation** ### Quarterly Tasks (Every 90 Days) - [ ] **Full security audit** - [ ] **Disaster recovery drill** - [ ] **Performance testing** - [ ] **Cost optimization review** - [ ] **Update runbooks** - [ ] **Team training session** - [ ] **Review SLAs and metrics** - [ ] **Plan infrastructure upgrades** ### Annual Tasks - [ ] **Penetration testing** - [ ] **Compliance audit (GDPR, PCI-DSS, SOC 2)** - [ ] **Full infrastructure review** - [ ] **Update security roadmap** - [ ] **Budget planning for next year** - [ ] **Technology stack review** --- ## Compliance & Audit ### GDPR Compliance **Requirements Met:** - ✅ Article 32: Encryption of personal data (TLS + pgcrypto) - ✅ Article 5(1)(f): Security of processing - ✅ Article 33: Breach detection (audit logs) - ✅ Article 17: Right to erasure (deletion endpoints) - ✅ Article 20: Right to data portability (export functionality) **Audit Tasks:** ```bash # Review audit logs for data access kubectl logs -n bakery-ia deployment/tenant-service | grep "user_data_access" # Verify encryption in use kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c "SHOW ssl;" # Check data retention policies # Review automated cleanup jobs ``` ### PCI-DSS Compliance **Requirements Met:** - ✅ Requirement 3.4: Transmission encryption (TLS 1.2+) - ✅ Requirement 3.5: Stored data protection (pgcrypto) - ✅ Requirement 10: Access tracking (audit logs) - ✅ Requirement 8: User authentication (JWT + MFA ready) **Audit Tasks:** ```bash # Verify no plaintext passwords kubectl get secret bakery-ia-secrets -n bakery-ia -o jsonpath='{.data}' | grep -i "pass" # Check encryption in transit kubectl describe ingress -n bakery-ia | grep TLS # Review access logs kubectl logs -n bakery-ia deployment/auth-service | grep "login" ``` ### SOC 2 Compliance **Controls Met:** - ✅ CC6.1: Access controls (RBAC) - ✅ CC6.6: Encryption in transit (TLS) - ✅ CC6.7: Encryption at rest (K8s secrets + pgcrypto) - ✅ CC7.2: Monitoring (Prometheus + Grafana) ### Audit Log Retention **Current Policy:** - Application logs: 30 days (stdout) - Database audit logs: 90 days - Security logs: 1 year - Backups: 30 days rolling **Extending Retention:** ```bash # Ship logs to external storage # Example: Ship to S3 / CloudWatch / ELK # For PostgreSQL audit logs, increase CSV log retention kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U postgres -c "ALTER SYSTEM SET log_rotation_age = '90d';" ``` --- ## Quick Reference Commands ### Emergency Commands ```bash # Restart all services (minimal downtime with rolling update) kubectl rollout restart deployment -n bakery-ia # Restart specific service kubectl rollout restart deployment/orders-service -n bakery-ia # Rollback last deployment kubectl rollout undo deployment/orders-service -n bakery-ia # Scale up quickly kubectl scale deployment orders-service -n bakery-ia --replicas=5 # Get pod status kubectl get pods -n bakery-ia # Get recent events kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20 # Get logs kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 -f ``` ### Monitoring Commands ```bash # Resource usage kubectl top nodes kubectl top pods -n bakery-ia --sort-by=cpu kubectl top pods -n bakery-ia --sort-by=memory # Check HPA kubectl get hpa -n bakery-ia # Check all resources kubectl get all -n bakery-ia # Check ingress kubectl get ingress -n bakery-ia # Check certificates kubectl get certificate -n bakery-ia ``` ### Database Commands ```bash # Connect to database kubectl exec -n bakery-ia deployment/auth-db -it -- psql -U auth_user -d auth_db # Check connections kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;" # Check database size kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U postgres -c "SELECT pg_size_pretty(pg_database_size('auth_db'));" # Vacuum database kubectl exec -n bakery-ia deployment/auth-db -- \ psql -U auth_user -d auth_db -c "VACUUM ANALYZE;" ``` --- ## Support Resources **Documentation:** - [Pilot Launch Guide](./PILOT_LAUNCH_GUIDE.md) - Initial deployment and setup - [Security Checklist](./security-checklist.md) - Security procedures and compliance - [Database Security](./database-security.md) - Database operations and best practices - [TLS Configuration](./tls-configuration.md) - Certificate management - [RBAC Implementation](./rbac-implementation.md) - Access control configuration - [Monitoring Stack README](../infrastructure/kubernetes/base/components/monitoring/README.md) - Detailed monitoring documentation - [CI/CD Infrastructure README](../infrastructure/cicd/README.md) - Gitea, Tekton, and Flux CD setup and operations - [SigNoz Monitoring README](../infrastructure/monitoring/signoz/README.md) - SigNoz deployment and configuration **External Resources:** - Kubernetes: https://kubernetes.io/docs - MicroK8s: https://microk8s.io/docs - Prometheus: https://prometheus.io/docs - Grafana: https://grafana.com/docs - PostgreSQL: https://www.postgresql.org/docs **Emergency Contacts:** - DevOps Team: devops@bakewise.ai - On-Call: oncall@bakewise.ai - Security Team: security@bakewise.ai --- ## Summary This guide covers all aspects of operating the Bakery-IA platform in production: ✅ **Monitoring:** Dashboards, alerts, metrics ✅ **Security:** Access control, certificates, compliance ✅ **Databases:** Management, optimization, backups ✅ **Recovery:** Backup strategy, disaster recovery ✅ **Performance:** Optimization techniques, scaling ✅ **Incidents:** Response procedures, common fixes ✅ **Maintenance:** Daily, weekly, monthly tasks ✅ **Compliance:** GDPR, PCI-DSS, SOC 2 **Remember:** - Monitor daily - Back up daily - Test restores monthly - Rotate secrets quarterly - Plan for growth continuously --- **Document Version:** 1.0 **Last Updated:** 2026-01-07 **Maintained By:** DevOps Team **Next Review:** 2026-04-07