Files
bakery-ia/docs/PRODUCTION_OPERATIONS_GUIDE.md

37 KiB

Bakery-IA Production Operations Guide

Complete guide for operating, monitoring, and maintaining production environment

Last Updated: 2026-01-07 Target Audience: DevOps, SRE, System Administrators Security Grade: A-


Table of Contents

  1. Overview
  2. Monitoring & Observability
  3. CI/CD Operations
  4. Security Operations
  5. Database Management
  6. Backup & Recovery
  7. Performance Optimization
  8. Scaling Operations
  9. Incident Response
  10. Maintenance Tasks
  11. Compliance & Audit

Overview

Production Environment

Infrastructure:

  • Platform: MicroK8s on Ubuntu 22.04 LTS
  • Services: 18 microservices, 14 databases, monitoring stack
  • Capacity: 10-tenant pilot (scalable to 100+)
  • Security: TLS encryption, RBAC, audit logging
  • Monitoring: Prometheus, Grafana, AlertManager, SigNoz
  • CI/CD: Tekton Pipelines, Gitea, Flux CD (GitOps)
  • Email: Mailu (integrated email server)

Key Metrics (10-tenant baseline):

  • Uptime Target: 99.5% (3.65 hours downtime/month)
  • Response Time: <2s average API response
  • Error Rate: <1% of requests
  • Database Connections: ~200 concurrent
  • Memory Usage: 12-15 GB / 20 GB capacity
  • CPU Usage: 40-60% under normal load

Team Responsibilities

Role Responsibilities
DevOps Engineer Deployment, infrastructure, scaling, CI/CD
SRE Monitoring, incident response, performance
Security Admin Access control, security patches, compliance
Database Admin Backups, optimization, migrations
On-Call Engineer 24/7 incident response (if applicable)
CI/CD Admin Pipeline management, GitOps workflows

Monitoring & Observability

Access Monitoring Dashboards

Production URLs:

https://monitoring.bakewise.ai/signoz        # SigNoz - Unified observability (PRIMARY)
https://monitoring.bakewise.ai/alertmanager  # AlertManager - Alert management

What is SigNoz? SigNoz is a comprehensive, open-source observability platform that provides:

  • Distributed Tracing - End-to-end request tracking across all microservices
  • Metrics Monitoring - Application and infrastructure metrics
  • Log Management - Centralized log aggregation with trace correlation
  • Service Performance Monitoring (SPM) - RED metrics (Rate, Error, Duration) from traces
  • Database Monitoring - All 18 PostgreSQL databases + Redis + RabbitMQ
  • Kubernetes Monitoring - Cluster, node, pod, and container metrics

Key SigNoz Dashboards and Features

1. Services Tab - APM Overview

What to Monitor:

  • Service List - All 18 microservices with health status
  • Request Rate - Requests per second per service
  • Error Rate - Percentage of failed requests (aim: <1%)
  • P50/P90/P99 Latency - Response time percentiles (aim: P99 <2s)
  • Operations - Breakdown by endpoint/operation

Red Flags:

  • Error rate >5% sustained
  • P99 latency >3s
  • Sudden drop in request rate (service might be down)
  • High latency on specific endpoints

How to Access:

  • Navigate to Services tab in SigNoz
  • Click on any service for detailed metrics
  • Use "Traces" tab to see sample requests

2. Traces Tab - Distributed Tracing

What to Monitor:

  • End-to-end request flows across microservices
  • Span duration - Time spent in each service
  • Database query performance - Auto-captured from SQLAlchemy
  • External API calls - Auto-captured from HTTPX
  • Error traces - Requests that failed with stack traces

Features:

  • Filter by service, operation, status code, duration
  • Search by trace ID or span ID
  • Correlate traces with logs
  • Identify slow database queries and N+1 problems

Red Flags:

  • Traces showing >10 database queries per request (N+1 issue)
  • External API calls taking >1s
  • Services with >500ms internal processing time
  • Error spans with exceptions

3. Dashboards Tab - Infrastructure Metrics

Pre-built Dashboards:

  • PostgreSQL Monitoring - All 18 databases
    • Active connections, transactions/sec, cache hit ratio
    • Slow queries, lock waits, replication lag
    • Database size, disk I/O
  • Redis Monitoring - Cache performance
    • Memory usage, hit rate, evictions
    • Commands/sec, latency
  • RabbitMQ Monitoring - Message queue health
    • Queue depth, message rates
    • Consumer status, connections
  • Kubernetes Cluster - Node and pod metrics
    • CPU, memory, disk, network per node
    • Pod resource utilization
    • Container restarts and OOM kills

Red Flags:

  • PostgreSQL: Cache hit ratio <80%, active connections >80% of max
  • Redis: Memory >90%, evictions increasing
  • RabbitMQ: Queue depth growing, no consumers
  • Kubernetes: CPU >85%, memory >90%, disk <20% free

4. Logs Tab - Centralized Logging

Features:

  • Unified logs from all 18 microservices + databases
  • Trace correlation - Click on trace ID to see related logs
  • Kubernetes metadata - Auto-tagged with pod, namespace, container
  • Search and filter - By service, severity, time range, content
  • Log patterns - Automatically detect common patterns

What to Monitor:

  • Error and warning logs across all services
  • Database connection errors
  • Authentication failures
  • API request/response logs

Red Flags:

  • Increasing error logs
  • Repeated "connection refused" or "timeout" messages
  • Authentication failures (potential security issue)
  • Out of memory errors

5. Alerts Tab - Alert Management

Features:

  • Create alerts based on metrics, traces, or logs
  • Configure notification channels (email, Slack, webhook)
  • View firing alerts and alert history
  • Alert silencing and acknowledgment

Pre-configured Alerts (see SigNoz):

  • High error rate (>5% for 5 minutes)
  • High latency (P99 >3s for 5 minutes)
  • Service down (no requests for 2 minutes)
  • Database connection errors
  • High memory/CPU usage

Alert Severity Levels

Severity Response Time Escalation Examples
Critical Immediate Page on-call Service down, database unavailable
Warning 30 minutes Email team High memory, slow queries
Info Best effort Email Backup completed, cert renewal

Common Alerts & Responses

Alert: ServiceDown

Severity: Critical
Meaning: A service has been down for >2 minutes
Response:
1. Check pod status: kubectl get pods -n bakery-ia
2. View logs: kubectl logs POD_NAME -n bakery-ia
3. Check recent deployments: kubectl rollout history
4. Restart if safe: kubectl rollout restart deployment/SERVICE_NAME
5. Rollback if needed: kubectl rollout undo deployment/SERVICE_NAME

Alert: HighMemoryUsage

Severity: Warning
Meaning: Service using >80% of memory limit
Response:
1. Check which pods: kubectl top pods -n bakery-ia --sort-by=memory
2. Review memory trends in Grafana
3. Check for memory leaks in application logs
4. Consider increasing memory limits if sustained
5. Restart pod if memory leak suspected

Alert: DatabaseConnectionsHigh

Severity: Warning
Meaning: Database connections >80% of max
Response:
1. Identify which service: Check Grafana database dashboard
2. Look for connection leaks in application
3. Check for long-running transactions
4. Consider increasing max_connections
5. Restart service if connections not releasing

Alert: CertificateExpiringSoon

Severity: Warning
Meaning: TLS certificate expires in <30 days
Response:
1. For Let's Encrypt: Auto-renewal should handle (verify cert-manager)
2. For internal certs: Regenerate and apply new certificates
3. See "Certificate Rotation" section below

Daily Monitoring Workflow with SigNoz

Morning Health Check (5 minutes)

  1. Open SigNoz Dashboard

    https://monitoring.bakewise.ai/signoz
    
  2. Check Services Tab:

    • Verify all 18 services are reporting metrics
    • Check error rate <1% for all services
    • Check P99 latency <2s for critical services
  3. Check Alerts Tab:

    • Review any firing alerts
    • Check for patterns (repeated alerts on same service)
    • Acknowledge or resolve as needed
  4. Quick Infrastructure Check:

    • Navigate to Dashboards → PostgreSQL
      • Verify all 18 databases are up
      • Check connection counts are healthy
    • Navigate to Dashboards → Redis
      • Check memory usage <80%
    • Navigate to Dashboards → Kubernetes
      • Verify node health, no OOM kills

Command-Line Health Check (Alternative)

# Quick health check command
cat > ~/health-check.sh <<'EOF'
#!/bin/bash
echo "=== Bakery-IA Health Check ==="
echo "Date: $(date)"
echo ""

echo "1. Pod Status:"
kubectl get pods -n bakery-ia | grep -vE "Running|Completed" || echo "✅ All pods healthy"
echo ""

echo "2. Resource Usage:"
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=memory | head -10
echo ""

echo "3. SigNoz Components:"
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
echo ""

echo "4. Recent Alerts (from SigNoz AlertManager):"
curl -s http://localhost:9093/api/v1/alerts 2>/dev/null | jq '.data[] | select(.status.state=="firing") | {alert: .labels.alertname, severity: .labels.severity}' | head -10
echo ""

echo "5. OTel Collector Health:"
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133 2>/dev/null || echo "✅ Health check endpoint responding"
echo ""

echo "=== End Health Check ==="
EOF

chmod +x ~/health-check.sh
./health-check.sh

Troubleshooting Common Issues

Issue: Service not showing in SigNoz

# Check if service is sending telemetry
kubectl logs -n bakery-ia deployment/SERVICE_NAME | grep -i "telemetry\|otel\|signoz"

# Check OTel Collector is receiving data
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep SERVICE_NAME

# Verify service has proper OTEL endpoints configured
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep OTEL

Issue: No traces appearing

# Check tracing is enabled in service
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep ENABLE_TRACING

# Verify OTel Collector gRPC endpoint is reachable
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- nc -zv signoz-otel-collector 4317

Issue: Logs not appearing

# Check filelog receiver is working
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog

# Check k8sattributes processor
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep k8sattributes

CI/CD Operations

CI/CD Infrastructure Overview

The platform includes a complete CI/CD pipeline using:

  • Gitea - Git server and container registry
  • Tekton - Pipeline automation
  • Flux CD - GitOps deployment

Access CI/CD Systems

SSH Access to Production VPS:

  • IP Address: 200.234.233.87
  • Access Method: SSH with key authentication
  • Command: ssh -i ~/.ssh/your_private_key.pem root@200.234.233.87

Gitea (Git Server):

Tekton Dashboard:

# Port forward to access Tekton dashboard
kubectl port-forward -n tekton-pipelines svc/tekton-dashboard 9097:9097
# Access at: http://localhost:9097

Flux Status:

# Check Flux status
flux check
kubectl get gitrepository -n flux-system
kubectl get kustomization -n flux-system

CI/CD Monitoring

Check pipeline status:

# List all PipelineRuns
kubectl get pipelineruns -n tekton-pipelines

# Check Tekton controller logs
kubectl logs -n tekton-pipelines -l app=tekton-pipelines-controller

# Check Tekton dashboard logs
kubectl logs -n tekton-pipelines -l app=tekton-dashboard

Monitor GitOps synchronization:

# Check GitRepository status
kubectl get gitrepository -n flux-system -o wide

# Check Kustomization status
kubectl get kustomization -n flux-system -o wide

# Get reconciliation history
kubectl get events -n flux-system --sort-by='.lastTimestamp'

CI/CD Troubleshooting

Pipeline not triggering:

# Check Gitea webhook logs
kubectl logs -n tekton-pipelines -l app=tekton-triggers-controller

# Verify EventListener pods are running
kubectl get pods -n tekton-pipelines -l app=tekton-triggers-eventlistener

# Check TriggerBinding configuration
kubectl get triggerbinding -n tekton-pipelines

Build failures:

# Check Kaniko logs for build errors
kubectl logs -n tekton-pipelines -l tekton.dev/task=kaniko-build

# Verify Dockerfile paths are correct
kubectl describe taskrun -n tekton-pipelines

Flux not applying changes:

# Check GitRepository status
kubectl describe gitrepository -n flux-system

# Check Kustomization reconciliation
kubectl describe kustomization -n flux-system

# Check Flux logs
kubectl logs -n flux-system -l app.kubernetes.io/name=helm-controller

CI/CD Maintenance Tasks

Daily Tasks:

  • Check for failed pipeline runs
  • Verify GitOps synchronization status
  • Clean up old PipelineRun resources

Weekly Tasks:

  • Review pipeline performance metrics
  • Update pipeline definitions if needed
  • Rotate CI/CD secrets

Monthly Tasks:

  • Update Tekton and Flux versions
  • Review and optimize pipeline performance
  • Audit CI/CD access permissions

Security Operations

Security Posture Overview

Current Security Grade: A-

Implemented:

  • TLS 1.2+ encryption for all database connections
  • Let's Encrypt SSL for public endpoints
  • 32-character cryptographic passwords
  • JWT-based authentication
  • Tenant isolation at database and application level
  • Kubernetes secrets encryption at rest
  • PostgreSQL audit logging
  • RBAC (Role-Based Access Control)
  • Regular security updates

Access Control Management

User Roles

Role Permissions Use Case
Viewer Read-only access Dashboard viewing, reports
Member Read + create/update Day-to-day operations
Admin Full operational access Manage users, configure settings
Owner Full control Billing, tenant deletion

Managing User Access

# View current users for a tenant (via API)
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users

# Promote user to admin
curl -X PATCH -H "Authorization: Bearer $OWNER_TOKEN" \
  -H "Content-Type: application/json" \
  https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users/USER_ID \
  -d '{"role": "admin"}'

Security Checklist (Monthly)

  • Review audit logs for suspicious activity

    # Check failed login attempts
    kubectl logs -n bakery-ia deployment/auth-service | grep "authentication failed" | tail -50
    
    # Check unusual API calls
    kubectl logs -n bakery-ia deployment/gateway | grep -E "DELETE|admin" | tail -50
    
  • Verify all services using TLS

    # Check PostgreSQL SSL
    for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
      echo "Checking $db"
      kubectl exec -n bakery-ia $db -- psql -U postgres -c "SHOW ssl;"
    done
    
  • Review and rotate passwords (every 90 days)

    # Generate new passwords
    openssl rand -base64 32  # For each service
    
    # Update secrets
    kubectl edit secret bakery-ia-secrets -n bakery-ia
    
    # Restart services to pick up new passwords
    kubectl rollout restart deployment -n bakery-ia
    
  • Check certificate expiry dates

    # Check Let's Encrypt certs
    kubectl get certificate -n bakery-ia
    
    # Check internal TLS certs (expire Oct 2028)
    kubectl exec -n bakery-ia deployment/auth-db -- \
      openssl x509 -in /tls/server-cert.pem -noout -dates
    
  • Review RBAC policies

    • Ensure least privilege principle
    • Remove access for departed team members
    • Audit admin/owner role assignments
  • Apply security updates

    # Update system packages on VPS
    ssh root@$VPS_IP "apt update && apt upgrade -y"
    
    # Update container images (rebuild with latest base images)
    docker-compose build --pull
    

Certificate Rotation

Let's Encrypt (Auto-Renewal)

Let's Encrypt certificates auto-renew via cert-manager. Verify:

# Check cert-manager is running
kubectl get pods -n cert-manager

# Check certificate status
kubectl describe certificate bakery-ia-prod-tls-cert -n bakery-ia

# Force renewal if needed (>30 days before expiry)
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# cert-manager will automatically recreate

Internal TLS Certificates (Manual Rotation)

When: 90 days before October 2028 expiry

# 1. Generate new certificates (on local machine)
cd infrastructure/tls
./generate-certificates.sh

# 2. Update Kubernetes secrets
kubectl delete secret postgres-tls redis-tls -n bakery-ia

kubectl create secret generic postgres-tls \
  --from-file=server-cert.pem=postgres/server-cert.pem \
  --from-file=server-key.pem=postgres/server-key.pem \
  --from-file=ca-cert.pem=postgres/ca-cert.pem \
  -n bakery-ia

kubectl create secret generic redis-tls \
  --from-file=redis-cert.pem=redis/redis-cert.pem \
  --from-file=redis-key.pem=redis/redis-key.pem \
  --from-file=ca-cert.pem=redis/ca-cert.pem \
  -n bakery-ia

# 3. Restart database pods to pick up new certs
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=database
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=cache

# 4. Verify new certificates
kubectl exec -n bakery-ia deployment/auth-db -- \
  openssl x509 -in /tls/server-cert.pem -noout -dates

Database Management

Database Architecture

14 PostgreSQL Instances:

  • auth-db, tenant-db, training-db, forecasting-db, sales-db
  • external-db, notification-db, inventory-db, recipes-db
  • suppliers-db, pos-db, orders-db, production-db, alert-processor-db

1 Redis Instance: Shared caching and session storage

Database Health Monitoring

# Check all database pods
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database

# Check database resource usage
kubectl top pods -n bakery-ia -l app.kubernetes.io/component=database

# Check database connections
for db in $(kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database -o name); do
  echo "=== $db ==="
  kubectl exec -n bakery-ia $db -- psql -U postgres -c \
    "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"
done

Common Database Operations

Connect to Database

# Connect to specific database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U auth_user -d auth_db

# Inside psql:
\dt              # List tables
\d+ table_name   # Describe table with details
\du              # List users
\l               # List databases
\q               # Quit

Check Database Size

kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT pg_database.datname,
   pg_size_pretty(pg_database_size(pg_database.datname)) AS size
   FROM pg_database;"

Analyze Slow Queries

# Enable slow query logging (already configured)
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT query, mean_exec_time, calls
   FROM pg_stat_statements
   ORDER BY mean_exec_time DESC
   LIMIT 10;"

Check Database Locks

kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT blocked_locks.pid AS blocked_pid,
   blocking_locks.pid AS blocking_pid,
   blocked_activity.usename AS blocked_user,
   blocking_activity.usename AS blocking_user,
   blocked_activity.query AS blocked_statement
   FROM pg_catalog.pg_locks blocked_locks
   JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
   JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
   AND blocking_locks.relation = blocked_locks.relation
   JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.granted;"

Database Optimization

Vacuum and Analyze

# Run on each database monthly
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"

# For all databases (run as cron job)
cat > ~/vacuum-databases.sh <<'EOF'
#!/bin/bash
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
  echo "Vacuuming $db"
  kubectl exec -n bakery-ia $db -- psql -U postgres -c "VACUUM ANALYZE;"
done
EOF

chmod +x ~/vacuum-databases.sh
# Add to cron: 0 3 * * 0 (weekly at 3 AM)

Reindex (if performance degrades)

# Reindex specific database
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "REINDEX DATABASE auth_db;"

Backup & Recovery

Backup Strategy

Automated Daily Backups:

  • Frequency: Daily at 2 AM
  • Retention: 30 days rolling
  • Encryption: GPG encrypted
  • Storage: Local VPS (configure off-site for production)

Backup Script (Already Configured)

# Script location: ~/backup-databases.sh
# Configured in: pilot launch guide

# Manual backup
./backup-databases.sh

# Verify backup
ls -lh /backups/

Backup Best Practices

  1. Test Restores Monthly

    # Restore to test database
    gunzip < /backups/2026-01-07.tar.gz | \
      kubectl exec -i -n bakery-ia deployment/test-db -- \
      psql -U postgres test_db
    
  2. Off-Site Storage (Recommended)

    # Sync backups to S3 / Cloud Storage
    aws s3 sync /backups/ s3://bakery-ia-backups/ --delete
    
    # Or use rclone for any cloud provider
    rclone sync /backups/ remote:bakery-ia-backups
    
  3. Monitor Backup Success

    # Check last backup date
    ls -lt /backups/ | head -1
    
    # Set up alert if no backup in 25 hours
    

Recovery Procedures

Restore Single Database

# 1. Stop the service using the database
kubectl scale deployment auth-service -n bakery-ia --replicas=0

# 2. Drop and recreate database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U postgres -c "DROP DATABASE auth_db;"
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U postgres -c "CREATE DATABASE auth_db OWNER auth_user;"

# 3. Restore from backup
gunzip < /backups/2026-01-07/auth-db.sql | \
  kubectl exec -i -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db

# 4. Restart service
kubectl scale deployment auth-service -n bakery-ia --replicas=2

Disaster Recovery (Full System)

# 1. Provision new VPS (same specs)
# 2. Install MicroK8s (follow pilot launch guide)
# 3. Copy latest backup to new VPS
# 4. Deploy infrastructure and databases
kubectl apply -k infrastructure/environments/prod/k8s-manifests

# 5. Wait for databases to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=database -n bakery-ia

# 6. Restore all databases
for backup in /backups/latest/*.sql; do
  db_name=$(basename $backup .sql)
  echo "Restoring $db_name"
  cat $backup | kubectl exec -i -n bakery-ia deployment/${db_name} -- \
    psql -U postgres
done

# 7. Deploy services
kubectl apply -k infrastructure/environments/prod/k8s-manifests

# 8. Update DNS to point to new VPS
# 9. Verify all services healthy

Recovery Time Objective (RTO): 2-4 hours Recovery Point Objective (RPO): 24 hours (last daily backup)


Performance Optimization

Identifying Performance Issues

# 1. Check overall resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory

# 2. Check API response times in Grafana
# Go to "Services Overview" dashboard
# Look for P95/P99 latency spikes

# 3. Check database query performance
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT query, calls, mean_exec_time, max_exec_time
   FROM pg_stat_statements
   ORDER BY mean_exec_time DESC
   LIMIT 20;"

# 4. Check for N+1 queries in application logs
kubectl logs -n bakery-ia deployment/orders-service | grep "SELECT"

Common Optimizations

1. Database Indexing

-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY abs(correlation) DESC;

-- Add index on frequently queried columns
CREATE INDEX CONCURRENTLY idx_orders_tenant_created
  ON orders(tenant_id, created_at DESC);

2. Connection Pooling

Already configured in services using SQLAlchemy. Verify settings:

# In shared/database/base.py
pool_size=5           # Adjust based on load
max_overflow=10       # Max additional connections
pool_timeout=30       # Connection timeout
pool_recycle=3600     # Recycle connections after 1 hour

3. Redis Caching

Increase cache for frequently accessed data:

# Cache user permissions (example)
@cache.cached(timeout=300, key_prefix='user_perms')
def get_user_permissions(user_id):
    # ... fetch from database

4. Query Optimization

-- Add EXPLAIN ANALYZE to slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE tenant_id = '...';

-- Look for:
-- - Seq Scan (should use index scan)
-- - High execution time
-- - Missing indexes

Scaling Triggers

When to scale UP:

  • CPU usage >75% sustained for >1 hour
  • Memory usage >85% sustained
  • P95 API latency >3s
  • Database connection pool exhausted frequently
  • Error rate increasing

When to scale OUT (add replicas):

  • Request rate increasing significantly
  • Single service bottleneck identified
  • Need zero-downtime deployments
  • Geographic distribution needed

Scaling Operations

Vertical Scaling (Upgrade VPS)

# 1. Create backup
./backup-databases.sh

# 2. Plan upgrade window (requires brief downtime)
# Notify users: "Scheduled maintenance 2 AM - 3 AM"

# 3. At clouding.io, upgrade VPS
# RAM: 20 GB → 32 GB
# CPU: 8 cores → 12 cores
# (Usually instant, may require restart)

# 4. Verify after upgrade
kubectl top nodes
free -h
nproc

Horizontal Scaling (Add Replicas)

# Scale specific service
kubectl scale deployment orders-service -n bakery-ia --replicas=5

# Or update in kustomization for persistence
# Edit: infrastructure/environments/prod/k8s-manifests/kustomization.yaml
replicas:
  - name: orders-service
    count: 5

kubectl apply -k infrastructure/environments/prod/k8s-manifests

Auto-Scaling (HPA)

Already configured for:

  • orders-service (1-3 replicas)
  • forecasting-service (1-3 replicas)
  • notification-service (1-3 replicas)
# Check HPA status
kubectl get hpa -n bakery-ia

# Adjust thresholds if needed
kubectl edit hpa orders-service-hpa -n bakery-ia

Growth Path

Tenants Recommended Action
10 Current configuration (20GB RAM, 8 CPU)
20 Add replicas for critical services
30 Upgrade to 32GB RAM, 12 CPU
50 Consider database read replicas
75 Upgrade to 48GB RAM, 16 CPU
100 Plan multi-node cluster or managed K8s
200+ Migrate to managed services (EKS, GKE, AKS)

Incident Response

Incident Severity Levels

Level Description Response Time Example
P0 Complete outage Immediate All services down
P1 Major degradation 15 minutes Database unavailable
P2 Partial degradation 1 hour One service slow
P3 Minor issue 4 hours Non-critical alert

Incident Response Process

1. Detect & Alert

- Monitoring alerts trigger
- User reports issue
- Automated health checks fail

2. Assess & Communicate

# Quick assessment
./health-check.sh

# Determine severity
# P0/P1: Notify all stakeholders immediately
# P2/P3: Regular communication channels

3. Investigate

# Check pods
kubectl get pods -n bakery-ia

# Check recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20

# Check logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100

# Check metrics
# View Grafana dashboards

4. Mitigate

# Common mitigations:

# Restart service
kubectl rollout restart deployment/SERVICE_NAME -n bakery-ia

# Rollback deployment
kubectl rollout undo deployment/SERVICE_NAME -n bakery-ia

# Scale up
kubectl scale deployment SERVICE_NAME -n bakery-ia --replicas=5

# Restart database
kubectl delete pod DB_POD_NAME -n bakery-ia

5. Resolve & Document

1. Verify issue resolved
2. Update incident log
3. Create post-mortem (for P0/P1)
4. Implement preventive measures

Common Incidents & Fixes

Incident: Database Connection Exhaustion

Symptoms: Services showing "connection pool exhausted" errors

Fix:

# 1. Identify leaking service
kubectl logs -n bakery-ia deployment/orders-service | grep "pool"

# 2. Restart leaking service
kubectl rollout restart deployment/orders-service -n bakery-ia

# 3. Increase max_connections if needed
kubectl exec -n bakery-ia deployment/orders-db -- \
  psql -U postgres -c "ALTER SYSTEM SET max_connections = 200;"
kubectl rollout restart deployment/orders-db -n bakery-ia

Incident: Out of Memory (OOMKilled)

Symptoms: Pods restarting with "OOMKilled" status

Fix:

# 1. Identify which pod
kubectl get pods -n bakery-ia | grep OOMKilled

# 2. Check resource limits
kubectl describe pod POD_NAME -n bakery-ia | grep -A 5 Limits

# 3. Increase memory limit
# Edit deployment: infrastructure/kubernetes/base/components/services/SERVICE.yaml
resources:
  limits:
    memory: "1Gi"  # Increased from 512Mi

# 4. Redeploy
kubectl apply -k infrastructure/environments/prod/k8s-manifests

Incident: Certificate Expired

Symptoms: SSL errors, services can't connect

Fix:

# For Let's Encrypt (should auto-renew):
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# Wait for cert-manager to recreate

# For internal certs:
# Follow "Certificate Rotation" section above

Maintenance Tasks

Daily Tasks

# Run health check
./health-check.sh

# Check monitoring alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

# Verify backups ran
ls -lh /backups/ | head -5

Weekly Tasks

# Review resource trends
# Open Grafana, check 7-day trends

# Review error logs
kubectl logs -n bakery-ia deployment/gateway --since=7d | grep ERROR | wc -l

# Check disk usage
kubectl exec -n bakery-ia deployment/auth-db -- df -h

# Review security logs
kubectl logs -n bakery-ia deployment/auth-service --since=7d | grep "failed"

Monthly Tasks

  • Review and rotate passwords
  • Update security patches
  • Test backup restore
  • Review RBAC policies
  • Vacuum and analyze databases
  • Review and optimize slow queries
  • Check certificate expiry dates
  • Review resource allocation
  • Plan capacity for next quarter
  • Update documentation

Quarterly Tasks (Every 90 Days)

  • Full security audit
  • Disaster recovery drill
  • Performance testing
  • Cost optimization review
  • Update runbooks
  • Team training session
  • Review SLAs and metrics
  • Plan infrastructure upgrades

Annual Tasks

  • Penetration testing
  • Compliance audit (GDPR, PCI-DSS, SOC 2)
  • Full infrastructure review
  • Update security roadmap
  • Budget planning for next year
  • Technology stack review

Compliance & Audit

GDPR Compliance

Requirements Met:

  • Article 32: Encryption of personal data (TLS + pgcrypto)
  • Article 5(1)(f): Security of processing
  • Article 33: Breach detection (audit logs)
  • Article 17: Right to erasure (deletion endpoints)
  • Article 20: Right to data portability (export functionality)

Audit Tasks:

# Review audit logs for data access
kubectl logs -n bakery-ia deployment/tenant-service | grep "user_data_access"

# Verify encryption in use
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c "SHOW ssl;"

# Check data retention policies
# Review automated cleanup jobs

PCI-DSS Compliance

Requirements Met:

  • Requirement 3.4: Transmission encryption (TLS 1.2+)
  • Requirement 3.5: Stored data protection (pgcrypto)
  • Requirement 10: Access tracking (audit logs)
  • Requirement 8: User authentication (JWT + MFA ready)

Audit Tasks:

# Verify no plaintext passwords
kubectl get secret bakery-ia-secrets -n bakery-ia -o jsonpath='{.data}' | grep -i "pass"

# Check encryption in transit
kubectl describe ingress -n bakery-ia | grep TLS

# Review access logs
kubectl logs -n bakery-ia deployment/auth-service | grep "login"

SOC 2 Compliance

Controls Met:

  • CC6.1: Access controls (RBAC)
  • CC6.6: Encryption in transit (TLS)
  • CC6.7: Encryption at rest (K8s secrets + pgcrypto)
  • CC7.2: Monitoring (Prometheus + Grafana)

Audit Log Retention

Current Policy:

  • Application logs: 30 days (stdout)
  • Database audit logs: 90 days
  • Security logs: 1 year
  • Backups: 30 days rolling

Extending Retention:

# Ship logs to external storage
# Example: Ship to S3 / CloudWatch / ELK

# For PostgreSQL audit logs, increase CSV log retention
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "ALTER SYSTEM SET log_rotation_age = '90d';"

Quick Reference Commands

Emergency Commands

# Restart all services (minimal downtime with rolling update)
kubectl rollout restart deployment -n bakery-ia

# Restart specific service
kubectl rollout restart deployment/orders-service -n bakery-ia

# Rollback last deployment
kubectl rollout undo deployment/orders-service -n bakery-ia

# Scale up quickly
kubectl scale deployment orders-service -n bakery-ia --replicas=5

# Get pod status
kubectl get pods -n bakery-ia

# Get recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20

# Get logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 -f

Monitoring Commands

# Resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory

# Check HPA
kubectl get hpa -n bakery-ia

# Check all resources
kubectl get all -n bakery-ia

# Check ingress
kubectl get ingress -n bakery-ia

# Check certificates
kubectl get certificate -n bakery-ia

Database Commands

# Connect to database
kubectl exec -n bakery-ia deployment/auth-db -it -- psql -U auth_user -d auth_db

# Check connections
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check database size
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "SELECT pg_size_pretty(pg_database_size('auth_db'));"

# Vacuum database
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"

Support Resources

Documentation:

External Resources:

Emergency Contacts:


Summary

This guide covers all aspects of operating the Bakery-IA platform in production:

Monitoring: Dashboards, alerts, metrics Security: Access control, certificates, compliance Databases: Management, optimization, backups Recovery: Backup strategy, disaster recovery Performance: Optimization techniques, scaling Incidents: Response procedures, common fixes Maintenance: Daily, weekly, monthly tasks Compliance: GDPR, PCI-DSS, SOC 2

Remember:

  • Monitor daily
  • Back up daily
  • Test restores monthly
  • Rotate secrets quarterly
  • Plan for growth continuously

Document Version: 1.0 Last Updated: 2026-01-07 Maintained By: DevOps Team Next Review: 2026-04-07