34 KiB
Bakery-IA Production Operations Guide
Complete guide for operating, monitoring, and maintaining production environment
Last Updated: 2026-01-07 Target Audience: DevOps, SRE, System Administrators Security Grade: A-
Table of Contents
- Overview
- Monitoring & Observability
- Security Operations
- Database Management
- Backup & Recovery
- Performance Optimization
- Scaling Operations
- Incident Response
- Maintenance Tasks
- Compliance & Audit
Overview
Production Environment
Infrastructure:
- Platform: MicroK8s on Ubuntu 22.04 LTS
- Services: 18 microservices, 14 databases, monitoring stack
- Capacity: 10-tenant pilot (scalable to 100+)
- Security: TLS encryption, RBAC, audit logging
- Monitoring: Prometheus, Grafana, AlertManager, SigNoz
Key Metrics (10-tenant baseline):
- Uptime Target: 99.5% (3.65 hours downtime/month)
- Response Time: <2s average API response
- Error Rate: <1% of requests
- Database Connections: ~200 concurrent
- Memory Usage: 12-15 GB / 20 GB capacity
- CPU Usage: 40-60% under normal load
Team Responsibilities
| Role | Responsibilities |
|---|---|
| DevOps Engineer | Deployment, infrastructure, scaling |
| SRE | Monitoring, incident response, performance |
| Security Admin | Access control, security patches, compliance |
| Database Admin | Backups, optimization, migrations |
| On-Call Engineer | 24/7 incident response (if applicable) |
Monitoring & Observability
Access Monitoring Dashboards
Production URLs:
https://monitoring.bakewise.ai/signoz # SigNoz - Unified observability (PRIMARY)
https://monitoring.bakewise.ai/alertmanager # AlertManager - Alert management
What is SigNoz? SigNoz is a comprehensive, open-source observability platform that provides:
- Distributed Tracing - End-to-end request tracking across all microservices
- Metrics Monitoring - Application and infrastructure metrics
- Log Management - Centralized log aggregation with trace correlation
- Service Performance Monitoring (SPM) - RED metrics (Rate, Error, Duration) from traces
- Database Monitoring - All 18 PostgreSQL databases + Redis + RabbitMQ
- Kubernetes Monitoring - Cluster, node, pod, and container metrics
Port Forwarding (if ingress not available):
# SigNoz Frontend (Main UI)
kubectl port-forward -n bakery-ia svc/signoz 8080:8080
# SigNoz AlertManager
kubectl port-forward -n bakery-ia svc/signoz-alertmanager 9093:9093
# OTel Collector (for debugging)
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4317:4317 # gRPC
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4318:4318 # HTTP
Key SigNoz Dashboards and Features
1. Services Tab - APM Overview
What to Monitor:
- Service List - All 18 microservices with health status
- Request Rate - Requests per second per service
- Error Rate - Percentage of failed requests (aim: <1%)
- P50/P90/P99 Latency - Response time percentiles (aim: P99 <2s)
- Operations - Breakdown by endpoint/operation
Red Flags:
- ❌ Error rate >5% sustained
- ❌ P99 latency >3s
- ❌ Sudden drop in request rate (service might be down)
- ❌ High latency on specific endpoints
How to Access:
- Navigate to
Servicestab in SigNoz - Click on any service for detailed metrics
- Use "Traces" tab to see sample requests
2. Traces Tab - Distributed Tracing
What to Monitor:
- End-to-end request flows across microservices
- Span duration - Time spent in each service
- Database query performance - Auto-captured from SQLAlchemy
- External API calls - Auto-captured from HTTPX
- Error traces - Requests that failed with stack traces
Features:
- Filter by service, operation, status code, duration
- Search by trace ID or span ID
- Correlate traces with logs
- Identify slow database queries and N+1 problems
Red Flags:
- ❌ Traces showing >10 database queries per request (N+1 issue)
- ❌ External API calls taking >1s
- ❌ Services with >500ms internal processing time
- ❌ Error spans with exceptions
3. Dashboards Tab - Infrastructure Metrics
Pre-built Dashboards:
- PostgreSQL Monitoring - All 18 databases
- Active connections, transactions/sec, cache hit ratio
- Slow queries, lock waits, replication lag
- Database size, disk I/O
- Redis Monitoring - Cache performance
- Memory usage, hit rate, evictions
- Commands/sec, latency
- RabbitMQ Monitoring - Message queue health
- Queue depth, message rates
- Consumer status, connections
- Kubernetes Cluster - Node and pod metrics
- CPU, memory, disk, network per node
- Pod resource utilization
- Container restarts and OOM kills
Red Flags:
- ❌ PostgreSQL: Cache hit ratio <80%, active connections >80% of max
- ❌ Redis: Memory >90%, evictions increasing
- ❌ RabbitMQ: Queue depth growing, no consumers
- ❌ Kubernetes: CPU >85%, memory >90%, disk <20% free
4. Logs Tab - Centralized Logging
Features:
- Unified logs from all 18 microservices + databases
- Trace correlation - Click on trace ID to see related logs
- Kubernetes metadata - Auto-tagged with pod, namespace, container
- Search and filter - By service, severity, time range, content
- Log patterns - Automatically detect common patterns
What to Monitor:
- Error and warning logs across all services
- Database connection errors
- Authentication failures
- API request/response logs
Red Flags:
- ❌ Increasing error logs
- ❌ Repeated "connection refused" or "timeout" messages
- ❌ Authentication failures (potential security issue)
- ❌ Out of memory errors
5. Alerts Tab - Alert Management
Features:
- Create alerts based on metrics, traces, or logs
- Configure notification channels (email, Slack, webhook)
- View firing alerts and alert history
- Alert silencing and acknowledgment
Pre-configured Alerts (see SigNoz):
- High error rate (>5% for 5 minutes)
- High latency (P99 >3s for 5 minutes)
- Service down (no requests for 2 minutes)
- Database connection errors
- High memory/CPU usage
Alert Severity Levels
| Severity | Response Time | Escalation | Examples |
|---|---|---|---|
| Critical | Immediate | Page on-call | Service down, database unavailable |
| Warning | 30 minutes | Email team | High memory, slow queries |
| Info | Best effort | Backup completed, cert renewal |
Common Alerts & Responses
Alert: ServiceDown
Severity: Critical
Meaning: A service has been down for >2 minutes
Response:
1. Check pod status: kubectl get pods -n bakery-ia
2. View logs: kubectl logs POD_NAME -n bakery-ia
3. Check recent deployments: kubectl rollout history
4. Restart if safe: kubectl rollout restart deployment/SERVICE_NAME
5. Rollback if needed: kubectl rollout undo deployment/SERVICE_NAME
Alert: HighMemoryUsage
Severity: Warning
Meaning: Service using >80% of memory limit
Response:
1. Check which pods: kubectl top pods -n bakery-ia --sort-by=memory
2. Review memory trends in Grafana
3. Check for memory leaks in application logs
4. Consider increasing memory limits if sustained
5. Restart pod if memory leak suspected
Alert: DatabaseConnectionsHigh
Severity: Warning
Meaning: Database connections >80% of max
Response:
1. Identify which service: Check Grafana database dashboard
2. Look for connection leaks in application
3. Check for long-running transactions
4. Consider increasing max_connections
5. Restart service if connections not releasing
Alert: CertificateExpiringSoon
Severity: Warning
Meaning: TLS certificate expires in <30 days
Response:
1. For Let's Encrypt: Auto-renewal should handle (verify cert-manager)
2. For internal certs: Regenerate and apply new certificates
3. See "Certificate Rotation" section below
Daily Monitoring Workflow with SigNoz
Morning Health Check (5 minutes)
-
Open SigNoz Dashboard
https://monitoring.bakewise.ai/signoz -
Check Services Tab:
- Verify all 18 services are reporting metrics
- Check error rate <1% for all services
- Check P99 latency <2s for critical services
-
Check Alerts Tab:
- Review any firing alerts
- Check for patterns (repeated alerts on same service)
- Acknowledge or resolve as needed
-
Quick Infrastructure Check:
- Navigate to Dashboards → PostgreSQL
- Verify all 18 databases are up
- Check connection counts are healthy
- Navigate to Dashboards → Redis
- Check memory usage <80%
- Navigate to Dashboards → Kubernetes
- Verify node health, no OOM kills
- Navigate to Dashboards → PostgreSQL
Command-Line Health Check (Alternative)
# Quick health check command
cat > ~/health-check.sh <<'EOF'
#!/bin/bash
echo "=== Bakery-IA Health Check ==="
echo "Date: $(date)"
echo ""
echo "1. Pod Status:"
kubectl get pods -n bakery-ia | grep -vE "Running|Completed" || echo "✅ All pods healthy"
echo ""
echo "2. Resource Usage:"
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=memory | head -10
echo ""
echo "3. SigNoz Components:"
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
echo ""
echo "4. Recent Alerts (from SigNoz AlertManager):"
curl -s http://localhost:9093/api/v1/alerts 2>/dev/null | jq '.data[] | select(.status.state=="firing") | {alert: .labels.alertname, severity: .labels.severity}' | head -10
echo ""
echo "5. OTel Collector Health:"
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133 2>/dev/null || echo "✅ Health check endpoint responding"
echo ""
echo "=== End Health Check ==="
EOF
chmod +x ~/health-check.sh
./health-check.sh
Troubleshooting Common Issues
Issue: Service not showing in SigNoz
# Check if service is sending telemetry
kubectl logs -n bakery-ia deployment/SERVICE_NAME | grep -i "telemetry\|otel\|signoz"
# Check OTel Collector is receiving data
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep SERVICE_NAME
# Verify service has proper OTEL endpoints configured
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep OTEL
Issue: No traces appearing
# Check tracing is enabled in service
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep ENABLE_TRACING
# Verify OTel Collector gRPC endpoint is reachable
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- nc -zv signoz-otel-collector 4317
Issue: Logs not appearing
# Check filelog receiver is working
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog
# Check k8sattributes processor
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep k8sattributes
Security Operations
Security Posture Overview
Current Security Grade: A-
Implemented:
- ✅ TLS 1.2+ encryption for all database connections
- ✅ Let's Encrypt SSL for public endpoints
- ✅ 32-character cryptographic passwords
- ✅ JWT-based authentication
- ✅ Tenant isolation at database and application level
- ✅ Kubernetes secrets encryption at rest
- ✅ PostgreSQL audit logging
- ✅ RBAC (Role-Based Access Control)
- ✅ Regular security updates
Access Control Management
User Roles
| Role | Permissions | Use Case |
|---|---|---|
| Viewer | Read-only access | Dashboard viewing, reports |
| Member | Read + create/update | Day-to-day operations |
| Admin | Full operational access | Manage users, configure settings |
| Owner | Full control | Billing, tenant deletion |
Managing User Access
# View current users for a tenant (via API)
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users
# Promote user to admin
curl -X PATCH -H "Authorization: Bearer $OWNER_TOKEN" \
-H "Content-Type: application/json" \
https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users/USER_ID \
-d '{"role": "admin"}'
Security Checklist (Monthly)
-
Review audit logs for suspicious activity
# Check failed login attempts kubectl logs -n bakery-ia deployment/auth-service | grep "authentication failed" | tail -50 # Check unusual API calls kubectl logs -n bakery-ia deployment/gateway | grep -E "DELETE|admin" | tail -50 -
Verify all services using TLS
# Check PostgreSQL SSL for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do echo "Checking $db" kubectl exec -n bakery-ia $db -- psql -U postgres -c "SHOW ssl;" done -
Review and rotate passwords (every 90 days)
# Generate new passwords openssl rand -base64 32 # For each service # Update secrets kubectl edit secret bakery-ia-secrets -n bakery-ia # Restart services to pick up new passwords kubectl rollout restart deployment -n bakery-ia -
Check certificate expiry dates
# Check Let's Encrypt certs kubectl get certificate -n bakery-ia # Check internal TLS certs (expire Oct 2028) kubectl exec -n bakery-ia deployment/auth-db -- \ openssl x509 -in /tls/server-cert.pem -noout -dates -
Review RBAC policies
- Ensure least privilege principle
- Remove access for departed team members
- Audit admin/owner role assignments
-
Apply security updates
# Update system packages on VPS ssh root@$VPS_IP "apt update && apt upgrade -y" # Update container images (rebuild with latest base images) docker-compose build --pull
Certificate Rotation
Let's Encrypt (Auto-Renewal)
Let's Encrypt certificates auto-renew via cert-manager. Verify:
# Check cert-manager is running
kubectl get pods -n cert-manager
# Check certificate status
kubectl describe certificate bakery-ia-prod-tls-cert -n bakery-ia
# Force renewal if needed (>30 days before expiry)
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# cert-manager will automatically recreate
Internal TLS Certificates (Manual Rotation)
When: 90 days before October 2028 expiry
# 1. Generate new certificates (on local machine)
cd infrastructure/tls
./generate-certificates.sh
# 2. Update Kubernetes secrets
kubectl delete secret postgres-tls redis-tls -n bakery-ia
kubectl create secret generic postgres-tls \
--from-file=server-cert.pem=postgres/server-cert.pem \
--from-file=server-key.pem=postgres/server-key.pem \
--from-file=ca-cert.pem=postgres/ca-cert.pem \
-n bakery-ia
kubectl create secret generic redis-tls \
--from-file=redis-cert.pem=redis/redis-cert.pem \
--from-file=redis-key.pem=redis/redis-key.pem \
--from-file=ca-cert.pem=redis/ca-cert.pem \
-n bakery-ia
# 3. Restart database pods to pick up new certs
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=database
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=cache
# 4. Verify new certificates
kubectl exec -n bakery-ia deployment/auth-db -- \
openssl x509 -in /tls/server-cert.pem -noout -dates
Database Management
Database Architecture
14 PostgreSQL Instances:
- auth-db, tenant-db, training-db, forecasting-db, sales-db
- external-db, notification-db, inventory-db, recipes-db
- suppliers-db, pos-db, orders-db, production-db, alert-processor-db
1 Redis Instance: Shared caching and session storage
Database Health Monitoring
# Check all database pods
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database
# Check database resource usage
kubectl top pods -n bakery-ia -l app.kubernetes.io/component=database
# Check database connections
for db in $(kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database -o name); do
echo "=== $db ==="
kubectl exec -n bakery-ia $db -- psql -U postgres -c \
"SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"
done
Common Database Operations
Connect to Database
# Connect to specific database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
psql -U auth_user -d auth_db
# Inside psql:
\dt # List tables
\d+ table_name # Describe table with details
\du # List users
\l # List databases
\q # Quit
Check Database Size
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
"SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database;"
Analyze Slow Queries
# Enable slow query logging (already configured)
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
"SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;"
Check Database Locks
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
"SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.usename AS blocked_user,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.relation = blocked_locks.relation
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;"
Database Optimization
Vacuum and Analyze
# Run on each database monthly
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"
# For all databases (run as cron job)
cat > ~/vacuum-databases.sh <<'EOF'
#!/bin/bash
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
echo "Vacuuming $db"
kubectl exec -n bakery-ia $db -- psql -U postgres -c "VACUUM ANALYZE;"
done
EOF
chmod +x ~/vacuum-databases.sh
# Add to cron: 0 3 * * 0 (weekly at 3 AM)
Reindex (if performance degrades)
# Reindex specific database
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U auth_user -d auth_db -c "REINDEX DATABASE auth_db;"
Backup & Recovery
Backup Strategy
Automated Daily Backups:
- Frequency: Daily at 2 AM
- Retention: 30 days rolling
- Encryption: GPG encrypted
- Storage: Local VPS (configure off-site for production)
Backup Script (Already Configured)
# Script location: ~/backup-databases.sh
# Configured in: pilot launch guide
# Manual backup
./backup-databases.sh
# Verify backup
ls -lh /backups/
Backup Best Practices
-
Test Restores Monthly
# Restore to test database gunzip < /backups/2026-01-07.tar.gz | \ kubectl exec -i -n bakery-ia deployment/test-db -- \ psql -U postgres test_db -
Off-Site Storage (Recommended)
# Sync backups to S3 / Cloud Storage aws s3 sync /backups/ s3://bakery-ia-backups/ --delete # Or use rclone for any cloud provider rclone sync /backups/ remote:bakery-ia-backups -
Monitor Backup Success
# Check last backup date ls -lt /backups/ | head -1 # Set up alert if no backup in 25 hours
Recovery Procedures
Restore Single Database
# 1. Stop the service using the database
kubectl scale deployment auth-service -n bakery-ia --replicas=0
# 2. Drop and recreate database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
psql -U postgres -c "DROP DATABASE auth_db;"
kubectl exec -n bakery-ia deployment/auth-db -it -- \
psql -U postgres -c "CREATE DATABASE auth_db OWNER auth_user;"
# 3. Restore from backup
gunzip < /backups/2026-01-07/auth-db.sql | \
kubectl exec -i -n bakery-ia deployment/auth-db -- \
psql -U auth_user -d auth_db
# 4. Restart service
kubectl scale deployment auth-service -n bakery-ia --replicas=2
Disaster Recovery (Full System)
# 1. Provision new VPS (same specs)
# 2. Install MicroK8s (follow pilot launch guide)
# 3. Copy latest backup to new VPS
# 4. Deploy infrastructure and databases
kubectl apply -k infrastructure/kubernetes/overlays/prod
# 5. Wait for databases to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=database -n bakery-ia
# 6. Restore all databases
for backup in /backups/latest/*.sql; do
db_name=$(basename $backup .sql)
echo "Restoring $db_name"
cat $backup | kubectl exec -i -n bakery-ia deployment/${db_name} -- \
psql -U postgres
done
# 7. Deploy services
kubectl apply -k infrastructure/kubernetes/overlays/prod
# 8. Update DNS to point to new VPS
# 9. Verify all services healthy
Recovery Time Objective (RTO): 2-4 hours Recovery Point Objective (RPO): 24 hours (last daily backup)
Performance Optimization
Identifying Performance Issues
# 1. Check overall resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory
# 2. Check API response times in Grafana
# Go to "Services Overview" dashboard
# Look for P95/P99 latency spikes
# 3. Check database query performance
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
"SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;"
# 4. Check for N+1 queries in application logs
kubectl logs -n bakery-ia deployment/orders-service | grep "SELECT"
Common Optimizations
1. Database Indexing
-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY abs(correlation) DESC;
-- Add index on frequently queried columns
CREATE INDEX CONCURRENTLY idx_orders_tenant_created
ON orders(tenant_id, created_at DESC);
2. Connection Pooling
Already configured in services using SQLAlchemy. Verify settings:
# In shared/database/base.py
pool_size=5 # Adjust based on load
max_overflow=10 # Max additional connections
pool_timeout=30 # Connection timeout
pool_recycle=3600 # Recycle connections after 1 hour
3. Redis Caching
Increase cache for frequently accessed data:
# Cache user permissions (example)
@cache.cached(timeout=300, key_prefix='user_perms')
def get_user_permissions(user_id):
# ... fetch from database
4. Query Optimization
-- Add EXPLAIN ANALYZE to slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE tenant_id = '...';
-- Look for:
-- - Seq Scan (should use index scan)
-- - High execution time
-- - Missing indexes
Scaling Triggers
When to scale UP:
- ❌ CPU usage >75% sustained for >1 hour
- ❌ Memory usage >85% sustained
- ❌ P95 API latency >3s
- ❌ Database connection pool exhausted frequently
- ❌ Error rate increasing
When to scale OUT (add replicas):
- ❌ Request rate increasing significantly
- ❌ Single service bottleneck identified
- ❌ Need zero-downtime deployments
- ❌ Geographic distribution needed
Scaling Operations
Vertical Scaling (Upgrade VPS)
# 1. Create backup
./backup-databases.sh
# 2. Plan upgrade window (requires brief downtime)
# Notify users: "Scheduled maintenance 2 AM - 3 AM"
# 3. At clouding.io, upgrade VPS
# RAM: 20 GB → 32 GB
# CPU: 8 cores → 12 cores
# (Usually instant, may require restart)
# 4. Verify after upgrade
kubectl top nodes
free -h
nproc
Horizontal Scaling (Add Replicas)
# Scale specific service
kubectl scale deployment orders-service -n bakery-ia --replicas=5
# Or update in kustomization for persistence
# Edit: infrastructure/kubernetes/overlays/prod/kustomization.yaml
replicas:
- name: orders-service
count: 5
kubectl apply -k infrastructure/kubernetes/overlays/prod
Auto-Scaling (HPA)
Already configured for:
- orders-service (1-3 replicas)
- forecasting-service (1-3 replicas)
- notification-service (1-3 replicas)
# Check HPA status
kubectl get hpa -n bakery-ia
# Adjust thresholds if needed
kubectl edit hpa orders-service-hpa -n bakery-ia
Growth Path
| Tenants | Recommended Action |
|---|---|
| 10 | Current configuration (20GB RAM, 8 CPU) |
| 20 | Add replicas for critical services |
| 30 | Upgrade to 32GB RAM, 12 CPU |
| 50 | Consider database read replicas |
| 75 | Upgrade to 48GB RAM, 16 CPU |
| 100 | Plan multi-node cluster or managed K8s |
| 200+ | Migrate to managed services (EKS, GKE, AKS) |
Incident Response
Incident Severity Levels
| Level | Description | Response Time | Example |
|---|---|---|---|
| P0 | Complete outage | Immediate | All services down |
| P1 | Major degradation | 15 minutes | Database unavailable |
| P2 | Partial degradation | 1 hour | One service slow |
| P3 | Minor issue | 4 hours | Non-critical alert |
Incident Response Process
1. Detect & Alert
- Monitoring alerts trigger
- User reports issue
- Automated health checks fail
2. Assess & Communicate
# Quick assessment
./health-check.sh
# Determine severity
# P0/P1: Notify all stakeholders immediately
# P2/P3: Regular communication channels
3. Investigate
# Check pods
kubectl get pods -n bakery-ia
# Check recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20
# Check logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100
# Check metrics
# View Grafana dashboards
4. Mitigate
# Common mitigations:
# Restart service
kubectl rollout restart deployment/SERVICE_NAME -n bakery-ia
# Rollback deployment
kubectl rollout undo deployment/SERVICE_NAME -n bakery-ia
# Scale up
kubectl scale deployment SERVICE_NAME -n bakery-ia --replicas=5
# Restart database
kubectl delete pod DB_POD_NAME -n bakery-ia
5. Resolve & Document
1. Verify issue resolved
2. Update incident log
3. Create post-mortem (for P0/P1)
4. Implement preventive measures
Common Incidents & Fixes
Incident: Database Connection Exhaustion
Symptoms: Services showing "connection pool exhausted" errors
Fix:
# 1. Identify leaking service
kubectl logs -n bakery-ia deployment/orders-service | grep "pool"
# 2. Restart leaking service
kubectl rollout restart deployment/orders-service -n bakery-ia
# 3. Increase max_connections if needed
kubectl exec -n bakery-ia deployment/orders-db -- \
psql -U postgres -c "ALTER SYSTEM SET max_connections = 200;"
kubectl rollout restart deployment/orders-db -n bakery-ia
Incident: Out of Memory (OOMKilled)
Symptoms: Pods restarting with "OOMKilled" status
Fix:
# 1. Identify which pod
kubectl get pods -n bakery-ia | grep OOMKilled
# 2. Check resource limits
kubectl describe pod POD_NAME -n bakery-ia | grep -A 5 Limits
# 3. Increase memory limit
# Edit deployment: infrastructure/kubernetes/base/components/services/SERVICE.yaml
resources:
limits:
memory: "1Gi" # Increased from 512Mi
# 4. Redeploy
kubectl apply -k infrastructure/kubernetes/overlays/prod
Incident: Certificate Expired
Symptoms: SSL errors, services can't connect
Fix:
# For Let's Encrypt (should auto-renew):
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# Wait for cert-manager to recreate
# For internal certs:
# Follow "Certificate Rotation" section above
Maintenance Tasks
Daily Tasks
# Run health check
./health-check.sh
# Check monitoring alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'
# Verify backups ran
ls -lh /backups/ | head -5
Weekly Tasks
# Review resource trends
# Open Grafana, check 7-day trends
# Review error logs
kubectl logs -n bakery-ia deployment/gateway --since=7d | grep ERROR | wc -l
# Check disk usage
kubectl exec -n bakery-ia deployment/auth-db -- df -h
# Review security logs
kubectl logs -n bakery-ia deployment/auth-service --since=7d | grep "failed"
Monthly Tasks
- Review and rotate passwords
- Update security patches
- Test backup restore
- Review RBAC policies
- Vacuum and analyze databases
- Review and optimize slow queries
- Check certificate expiry dates
- Review resource allocation
- Plan capacity for next quarter
- Update documentation
Quarterly Tasks (Every 90 Days)
- Full security audit
- Disaster recovery drill
- Performance testing
- Cost optimization review
- Update runbooks
- Team training session
- Review SLAs and metrics
- Plan infrastructure upgrades
Annual Tasks
- Penetration testing
- Compliance audit (GDPR, PCI-DSS, SOC 2)
- Full infrastructure review
- Update security roadmap
- Budget planning for next year
- Technology stack review
Compliance & Audit
GDPR Compliance
Requirements Met:
- ✅ Article 32: Encryption of personal data (TLS + pgcrypto)
- ✅ Article 5(1)(f): Security of processing
- ✅ Article 33: Breach detection (audit logs)
- ✅ Article 17: Right to erasure (deletion endpoints)
- ✅ Article 20: Right to data portability (export functionality)
Audit Tasks:
# Review audit logs for data access
kubectl logs -n bakery-ia deployment/tenant-service | grep "user_data_access"
# Verify encryption in use
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c "SHOW ssl;"
# Check data retention policies
# Review automated cleanup jobs
PCI-DSS Compliance
Requirements Met:
- ✅ Requirement 3.4: Transmission encryption (TLS 1.2+)
- ✅ Requirement 3.5: Stored data protection (pgcrypto)
- ✅ Requirement 10: Access tracking (audit logs)
- ✅ Requirement 8: User authentication (JWT + MFA ready)
Audit Tasks:
# Verify no plaintext passwords
kubectl get secret bakery-ia-secrets -n bakery-ia -o jsonpath='{.data}' | grep -i "pass"
# Check encryption in transit
kubectl describe ingress -n bakery-ia | grep TLS
# Review access logs
kubectl logs -n bakery-ia deployment/auth-service | grep "login"
SOC 2 Compliance
Controls Met:
- ✅ CC6.1: Access controls (RBAC)
- ✅ CC6.6: Encryption in transit (TLS)
- ✅ CC6.7: Encryption at rest (K8s secrets + pgcrypto)
- ✅ CC7.2: Monitoring (Prometheus + Grafana)
Audit Log Retention
Current Policy:
- Application logs: 30 days (stdout)
- Database audit logs: 90 days
- Security logs: 1 year
- Backups: 30 days rolling
Extending Retention:
# Ship logs to external storage
# Example: Ship to S3 / CloudWatch / ELK
# For PostgreSQL audit logs, increase CSV log retention
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U postgres -c "ALTER SYSTEM SET log_rotation_age = '90d';"
Quick Reference Commands
Emergency Commands
# Restart all services (minimal downtime with rolling update)
kubectl rollout restart deployment -n bakery-ia
# Restart specific service
kubectl rollout restart deployment/orders-service -n bakery-ia
# Rollback last deployment
kubectl rollout undo deployment/orders-service -n bakery-ia
# Scale up quickly
kubectl scale deployment orders-service -n bakery-ia --replicas=5
# Get pod status
kubectl get pods -n bakery-ia
# Get recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20
# Get logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 -f
Monitoring Commands
# Resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory
# Check HPA
kubectl get hpa -n bakery-ia
# Check all resources
kubectl get all -n bakery-ia
# Check ingress
kubectl get ingress -n bakery-ia
# Check certificates
kubectl get certificate -n bakery-ia
Database Commands
# Connect to database
kubectl exec -n bakery-ia deployment/auth-db -it -- psql -U auth_user -d auth_db
# Check connections
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
# Check database size
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U postgres -c "SELECT pg_size_pretty(pg_database_size('auth_db'));"
# Vacuum database
kubectl exec -n bakery-ia deployment/auth-db -- \
psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"
Support Resources
Documentation:
- Pilot Launch Guide - Initial deployment and setup
- Security Checklist - Security procedures and compliance
- Database Security - Database operations and best practices
- TLS Configuration - Certificate management
- RBAC Implementation - Access control configuration
- Monitoring Stack README - Detailed monitoring documentation
External Resources:
- Kubernetes: https://kubernetes.io/docs
- MicroK8s: https://microk8s.io/docs
- Prometheus: https://prometheus.io/docs
- Grafana: https://grafana.com/docs
- PostgreSQL: https://www.postgresql.org/docs
Emergency Contacts:
- DevOps Team: devops@bakewise.ai
- On-Call: oncall@bakewise.ai
- Security Team: security@bakewise.ai
Summary
This guide covers all aspects of operating the Bakery-IA platform in production:
✅ Monitoring: Dashboards, alerts, metrics ✅ Security: Access control, certificates, compliance ✅ Databases: Management, optimization, backups ✅ Recovery: Backup strategy, disaster recovery ✅ Performance: Optimization techniques, scaling ✅ Incidents: Response procedures, common fixes ✅ Maintenance: Daily, weekly, monthly tasks ✅ Compliance: GDPR, PCI-DSS, SOC 2
Remember:
- Monitor daily
- Back up daily
- Test restores monthly
- Rotate secrets quarterly
- Plan for growth continuously
Document Version: 1.0 Last Updated: 2026-01-07 Maintained By: DevOps Team Next Review: 2026-04-07