bakery-admin/bakery-ia

Fork 0

Files

Urtzi Alfaro b089c216db Imporve monitoring 6

2026-01-10 13:43:38 +01:00

34 KiB

Raw Blame History

Bakery-IA Production Operations Guide

Complete guide for operating, monitoring, and maintaining production environment

Last Updated: 2026-01-07 Target Audience: DevOps, SRE, System Administrators Security Grade: A-

Overview
Monitoring & Observability
Security Operations
Database Management
Backup & Recovery
Performance Optimization
Scaling Operations
Incident Response
Maintenance Tasks
Compliance & Audit

Overview

Production Environment

Infrastructure:

Platform: MicroK8s on Ubuntu 22.04 LTS
Services: 18 microservices, 14 databases, monitoring stack
Capacity: 10-tenant pilot (scalable to 100+)
Security: TLS encryption, RBAC, audit logging
Monitoring: Prometheus, Grafana, AlertManager, SigNoz

Key Metrics (10-tenant baseline):

Uptime Target: 99.5% (3.65 hours downtime/month)
Response Time: <2s average API response
Error Rate: <1% of requests
Database Connections: ~200 concurrent
Memory Usage: 12-15 GB / 20 GB capacity
CPU Usage: 40-60% under normal load

Team Responsibilities

Role	Responsibilities
DevOps Engineer	Deployment, infrastructure, scaling
SRE	Monitoring, incident response, performance
Security Admin	Access control, security patches, compliance
Database Admin	Backups, optimization, migrations
On-Call Engineer	24/7 incident response (if applicable)

Monitoring & Observability

Access Monitoring Dashboards

Production URLs:

https://monitoring.bakewise.ai/signoz        # SigNoz - Unified observability (PRIMARY)
https://monitoring.bakewise.ai/alertmanager  # AlertManager - Alert management

What is SigNoz? SigNoz is a comprehensive, open-source observability platform that provides:

Distributed Tracing - End-to-end request tracking across all microservices
Metrics Monitoring - Application and infrastructure metrics
Log Management - Centralized log aggregation with trace correlation
Service Performance Monitoring (SPM) - RED metrics (Rate, Error, Duration) from traces
Database Monitoring - All 18 PostgreSQL databases + Redis + RabbitMQ
Kubernetes Monitoring - Cluster, node, pod, and container metrics

Port Forwarding (if ingress not available):

# SigNoz Frontend (Main UI)
kubectl port-forward -n bakery-ia svc/signoz 8080:8080

# SigNoz AlertManager
kubectl port-forward -n bakery-ia svc/signoz-alertmanager 9093:9093

# OTel Collector (for debugging)
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4317:4317  # gRPC
kubectl port-forward -n bakery-ia svc/signoz-otel-collector 4318:4318  # HTTP

Key SigNoz Dashboards and Features

1. Services Tab - APM Overview

What to Monitor:

Service List - All 18 microservices with health status
Request Rate - Requests per second per service
Error Rate - Percentage of failed requests (aim: <1%)
P50/P90/P99 Latency - Response time percentiles (aim: P99 <2s)
Operations - Breakdown by endpoint/operation

Red Flags:

❌ Error rate >5% sustained
❌ P99 latency >3s
❌ Sudden drop in request rate (service might be down)
❌ High latency on specific endpoints

How to Access:

Navigate to Services tab in SigNoz
Click on any service for detailed metrics
Use "Traces" tab to see sample requests

2. Traces Tab - Distributed Tracing

What to Monitor:

End-to-end request flows across microservices
Span duration - Time spent in each service
Database query performance - Auto-captured from SQLAlchemy
External API calls - Auto-captured from HTTPX
Error traces - Requests that failed with stack traces

Features:

Filter by service, operation, status code, duration
Search by trace ID or span ID
Correlate traces with logs
Identify slow database queries and N+1 problems

Red Flags:

❌ Traces showing >10 database queries per request (N+1 issue)
❌ External API calls taking >1s
❌ Services with >500ms internal processing time
❌ Error spans with exceptions

3. Dashboards Tab - Infrastructure Metrics

Pre-built Dashboards:

PostgreSQL Monitoring - All 18 databases
- Active connections, transactions/sec, cache hit ratio
- Slow queries, lock waits, replication lag
- Database size, disk I/O
Redis Monitoring - Cache performance
- Memory usage, hit rate, evictions
- Commands/sec, latency
RabbitMQ Monitoring - Message queue health
- Queue depth, message rates
- Consumer status, connections
Kubernetes Cluster - Node and pod metrics
- CPU, memory, disk, network per node
- Pod resource utilization
- Container restarts and OOM kills

Red Flags:

❌ PostgreSQL: Cache hit ratio <80%, active connections >80% of max
❌ Redis: Memory >90%, evictions increasing
❌ RabbitMQ: Queue depth growing, no consumers
❌ Kubernetes: CPU >85%, memory >90%, disk <20% free

4. Logs Tab - Centralized Logging

Features:

Unified logs from all 18 microservices + databases
Trace correlation - Click on trace ID to see related logs
Kubernetes metadata - Auto-tagged with pod, namespace, container
Search and filter - By service, severity, time range, content
Log patterns - Automatically detect common patterns

What to Monitor:

Error and warning logs across all services
Database connection errors
Authentication failures
API request/response logs

Red Flags:

❌ Increasing error logs
❌ Repeated "connection refused" or "timeout" messages
❌ Authentication failures (potential security issue)
❌ Out of memory errors

5. Alerts Tab - Alert Management

Features:

Create alerts based on metrics, traces, or logs
Configure notification channels (email, Slack, webhook)
View firing alerts and alert history
Alert silencing and acknowledgment

Pre-configured Alerts (see SigNoz):

High error rate (>5% for 5 minutes)
High latency (P99 >3s for 5 minutes)
Service down (no requests for 2 minutes)
Database connection errors
High memory/CPU usage

Alert Severity Levels

Severity	Response Time	Escalation	Examples
Critical	Immediate	Page on-call	Service down, database unavailable
Warning	30 minutes	Email team	High memory, slow queries
Info	Best effort	Email	Backup completed, cert renewal

Common Alerts & Responses

Alert: ServiceDown

Severity: Critical
Meaning: A service has been down for >2 minutes
Response:
1. Check pod status: kubectl get pods -n bakery-ia
2. View logs: kubectl logs POD_NAME -n bakery-ia
3. Check recent deployments: kubectl rollout history
4. Restart if safe: kubectl rollout restart deployment/SERVICE_NAME
5. Rollback if needed: kubectl rollout undo deployment/SERVICE_NAME

Alert: HighMemoryUsage

Severity: Warning
Meaning: Service using >80% of memory limit
Response:
1. Check which pods: kubectl top pods -n bakery-ia --sort-by=memory
2. Review memory trends in Grafana
3. Check for memory leaks in application logs
4. Consider increasing memory limits if sustained
5. Restart pod if memory leak suspected

Alert: DatabaseConnectionsHigh

Severity: Warning
Meaning: Database connections >80% of max
Response:
1. Identify which service: Check Grafana database dashboard
2. Look for connection leaks in application
3. Check for long-running transactions
4. Consider increasing max_connections
5. Restart service if connections not releasing

Alert: CertificateExpiringSoon

Severity: Warning
Meaning: TLS certificate expires in <30 days
Response:
1. For Let's Encrypt: Auto-renewal should handle (verify cert-manager)
2. For internal certs: Regenerate and apply new certificates
3. See "Certificate Rotation" section below

Daily Monitoring Workflow with SigNoz

Morning Health Check (5 minutes)

Open SigNoz Dashboard
```
https://monitoring.bakewise.ai/signoz
```
Check Services Tab:
- Verify all 18 services are reporting metrics
- Check error rate <1% for all services
- Check P99 latency <2s for critical services
Check Alerts Tab:
- Review any firing alerts
- Check for patterns (repeated alerts on same service)
- Acknowledge or resolve as needed
Quick Infrastructure Check:
- Navigate to Dashboards → PostgreSQL
  - Verify all 18 databases are up
  - Check connection counts are healthy
- Navigate to Dashboards → Redis
  - Check memory usage <80%
- Navigate to Dashboards → Kubernetes
  - Verify node health, no OOM kills

Command-Line Health Check (Alternative)

# Quick health check command
cat > ~/health-check.sh <<'EOF'
#!/bin/bash
echo "=== Bakery-IA Health Check ==="
echo "Date: $(date)"
echo ""

echo "1. Pod Status:"
kubectl get pods -n bakery-ia | grep -vE "Running|Completed" || echo "✅ All pods healthy"
echo ""

echo "2. Resource Usage:"
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=memory | head -10
echo ""

echo "3. SigNoz Components:"
kubectl get pods -n bakery-ia -l app.kubernetes.io/instance=signoz
echo ""

echo "4. Recent Alerts (from SigNoz AlertManager):"
curl -s http://localhost:9093/api/v1/alerts 2>/dev/null | jq '.data[] | select(.status.state=="firing") | {alert: .labels.alertname, severity: .labels.severity}' | head -10
echo ""

echo "5. OTel Collector Health:"
kubectl exec -n bakery-ia deployment/signoz-otel-collector -- wget -qO- http://localhost:13133 2>/dev/null || echo "✅ Health check endpoint responding"
echo ""

echo "=== End Health Check ==="
EOF

chmod +x ~/health-check.sh
./health-check.sh

Troubleshooting Common Issues

Issue: Service not showing in SigNoz

# Check if service is sending telemetry
kubectl logs -n bakery-ia deployment/SERVICE_NAME | grep -i "telemetry\|otel\|signoz"

# Check OTel Collector is receiving data
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep SERVICE_NAME

# Verify service has proper OTEL endpoints configured
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep OTEL

Issue: No traces appearing

# Check tracing is enabled in service
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- env | grep ENABLE_TRACING

# Verify OTel Collector gRPC endpoint is reachable
kubectl exec -n bakery-ia deployment/SERVICE_NAME -- nc -zv signoz-otel-collector 4317

Issue: Logs not appearing

# Check filelog receiver is working
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep filelog

# Check k8sattributes processor
kubectl logs -n bakery-ia deployment/signoz-otel-collector | grep k8sattributes

Security Operations

Security Posture Overview

Current Security Grade: A-

Implemented:

✅ TLS 1.2+ encryption for all database connections
✅ Let's Encrypt SSL for public endpoints
✅ 32-character cryptographic passwords
✅ JWT-based authentication
✅ Tenant isolation at database and application level
✅ Kubernetes secrets encryption at rest
✅ PostgreSQL audit logging
✅ RBAC (Role-Based Access Control)
✅ Regular security updates

Access Control Management

User Roles

Role	Permissions	Use Case
Viewer	Read-only access	Dashboard viewing, reports
Member	Read + create/update	Day-to-day operations
Admin	Full operational access	Manage users, configure settings
Owner	Full control	Billing, tenant deletion

Managing User Access

# View current users for a tenant (via API)
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users

# Promote user to admin
curl -X PATCH -H "Authorization: Bearer $OWNER_TOKEN" \
  -H "Content-Type: application/json" \
  https://api.yourdomain.com/api/v1/tenants/TENANT_ID/users/USER_ID \
  -d '{"role": "admin"}'

Security Checklist (Monthly)

Review audit logs for suspicious activity

# Check failed login attempts
kubectl logs -n bakery-ia deployment/auth-service | grep "authentication failed" | tail -50

# Check unusual API calls
kubectl logs -n bakery-ia deployment/gateway | grep -E "DELETE|admin" | tail -50

Verify all services using TLS

# Check PostgreSQL SSL
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
  echo "Checking $db"
  kubectl exec -n bakery-ia $db -- psql -U postgres -c "SHOW ssl;"
done

Review and rotate passwords (every 90 days)

# Generate new passwords
openssl rand -base64 32  # For each service

# Update secrets
kubectl edit secret bakery-ia-secrets -n bakery-ia

# Restart services to pick up new passwords
kubectl rollout restart deployment -n bakery-ia

Check certificate expiry dates

# Check Let's Encrypt certs
kubectl get certificate -n bakery-ia

# Check internal TLS certs (expire Oct 2028)
kubectl exec -n bakery-ia deployment/auth-db -- \
  openssl x509 -in /tls/server-cert.pem -noout -dates

Review RBAC policies
- Ensure least privilege principle
- Remove access for departed team members
- Audit admin/owner role assignments

Apply security updates

# Update system packages on VPS
ssh root@$VPS_IP "apt update && apt upgrade -y"

# Update container images (rebuild with latest base images)
docker-compose build --pull

Certificate Rotation

Let's Encrypt (Auto-Renewal)

Let's Encrypt certificates auto-renew via cert-manager. Verify:

# Check cert-manager is running
kubectl get pods -n cert-manager

# Check certificate status
kubectl describe certificate bakery-ia-prod-tls-cert -n bakery-ia

# Force renewal if needed (>30 days before expiry)
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# cert-manager will automatically recreate

Internal TLS Certificates (Manual Rotation)

When: 90 days before October 2028 expiry

# 1. Generate new certificates (on local machine)
cd infrastructure/tls
./generate-certificates.sh

# 2. Update Kubernetes secrets
kubectl delete secret postgres-tls redis-tls -n bakery-ia

kubectl create secret generic postgres-tls \
  --from-file=server-cert.pem=postgres/server-cert.pem \
  --from-file=server-key.pem=postgres/server-key.pem \
  --from-file=ca-cert.pem=postgres/ca-cert.pem \
  -n bakery-ia

kubectl create secret generic redis-tls \
  --from-file=redis-cert.pem=redis/redis-cert.pem \
  --from-file=redis-key.pem=redis/redis-key.pem \
  --from-file=ca-cert.pem=redis/ca-cert.pem \
  -n bakery-ia

# 3. Restart database pods to pick up new certs
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=database
kubectl rollout restart deployment -n bakery-ia -l app.kubernetes.io/component=cache

# 4. Verify new certificates
kubectl exec -n bakery-ia deployment/auth-db -- \
  openssl x509 -in /tls/server-cert.pem -noout -dates

Database Management

Database Architecture

14 PostgreSQL Instances:

auth-db, tenant-db, training-db, forecasting-db, sales-db
external-db, notification-db, inventory-db, recipes-db
suppliers-db, pos-db, orders-db, production-db, alert-processor-db

1 Redis Instance: Shared caching and session storage

Database Health Monitoring

# Check all database pods
kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database

# Check database resource usage
kubectl top pods -n bakery-ia -l app.kubernetes.io/component=database

# Check database connections
for db in $(kubectl get pods -n bakery-ia -l app.kubernetes.io/component=database -o name); do
  echo "=== $db ==="
  kubectl exec -n bakery-ia $db -- psql -U postgres -c \
    "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"
done

Common Database Operations

Connect to Database

# Connect to specific database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U auth_user -d auth_db

# Inside psql:
\dt              # List tables
\d+ table_name   # Describe table with details
\du              # List users
\l               # List databases
\q               # Quit

Check Database Size

kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT pg_database.datname,
   pg_size_pretty(pg_database_size(pg_database.datname)) AS size
   FROM pg_database;"

Analyze Slow Queries

# Enable slow query logging (already configured)
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT query, mean_exec_time, calls
   FROM pg_stat_statements
   ORDER BY mean_exec_time DESC
   LIMIT 10;"

Check Database Locks

kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT blocked_locks.pid AS blocked_pid,
   blocking_locks.pid AS blocking_pid,
   blocked_activity.usename AS blocked_user,
   blocking_activity.usename AS blocking_user,
   blocked_activity.query AS blocked_statement
   FROM pg_catalog.pg_locks blocked_locks
   JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
   JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
   AND blocking_locks.relation = blocked_locks.relation
   JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.granted;"

Database Optimization

Vacuum and Analyze

# Run on each database monthly
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"

# For all databases (run as cron job)
cat > ~/vacuum-databases.sh <<'EOF'
#!/bin/bash
for db in $(kubectl get deploy -n bakery-ia -l app.kubernetes.io/component=database -o name); do
  echo "Vacuuming $db"
  kubectl exec -n bakery-ia $db -- psql -U postgres -c "VACUUM ANALYZE;"
done
EOF

chmod +x ~/vacuum-databases.sh
# Add to cron: 0 3 * * 0 (weekly at 3 AM)

Reindex (if performance degrades)

# Reindex specific database
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "REINDEX DATABASE auth_db;"

Backup & Recovery

Backup Strategy

Automated Daily Backups:

Frequency: Daily at 2 AM
Retention: 30 days rolling
Encryption: GPG encrypted
Storage: Local VPS (configure off-site for production)

Backup Script (Already Configured)

# Script location: ~/backup-databases.sh
# Configured in: pilot launch guide

# Manual backup
./backup-databases.sh

# Verify backup
ls -lh /backups/

Backup Best Practices

Test Restores Monthly

# Restore to test database
gunzip < /backups/2026-01-07.tar.gz | \
  kubectl exec -i -n bakery-ia deployment/test-db -- \
  psql -U postgres test_db

Off-Site Storage (Recommended)

# Sync backups to S3 / Cloud Storage
aws s3 sync /backups/ s3://bakery-ia-backups/ --delete

# Or use rclone for any cloud provider
rclone sync /backups/ remote:bakery-ia-backups

Monitor Backup Success

# Check last backup date
ls -lt /backups/ | head -1

# Set up alert if no backup in 25 hours

Recovery Procedures

Restore Single Database

# 1. Stop the service using the database
kubectl scale deployment auth-service -n bakery-ia --replicas=0

# 2. Drop and recreate database
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U postgres -c "DROP DATABASE auth_db;"
kubectl exec -n bakery-ia deployment/auth-db -it -- \
  psql -U postgres -c "CREATE DATABASE auth_db OWNER auth_user;"

# 3. Restore from backup
gunzip < /backups/2026-01-07/auth-db.sql | \
  kubectl exec -i -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db

# 4. Restart service
kubectl scale deployment auth-service -n bakery-ia --replicas=2

Disaster Recovery (Full System)

# 1. Provision new VPS (same specs)
# 2. Install MicroK8s (follow pilot launch guide)
# 3. Copy latest backup to new VPS
# 4. Deploy infrastructure and databases
kubectl apply -k infrastructure/kubernetes/overlays/prod

# 5. Wait for databases to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=database -n bakery-ia

# 6. Restore all databases
for backup in /backups/latest/*.sql; do
  db_name=$(basename $backup .sql)
  echo "Restoring $db_name"
  cat $backup | kubectl exec -i -n bakery-ia deployment/${db_name} -- \
    psql -U postgres
done

# 7. Deploy services
kubectl apply -k infrastructure/kubernetes/overlays/prod

# 8. Update DNS to point to new VPS
# 9. Verify all services healthy

Recovery Time Objective (RTO): 2-4 hours Recovery Point Objective (RPO): 24 hours (last daily backup)

Performance Optimization

Identifying Performance Issues

# 1. Check overall resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory

# 2. Check API response times in Grafana
# Go to "Services Overview" dashboard
# Look for P95/P99 latency spikes

# 3. Check database query performance
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c \
  "SELECT query, calls, mean_exec_time, max_exec_time
   FROM pg_stat_statements
   ORDER BY mean_exec_time DESC
   LIMIT 20;"

# 4. Check for N+1 queries in application logs
kubectl logs -n bakery-ia deployment/orders-service | grep "SELECT"

Common Optimizations

1. Database Indexing

-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY abs(correlation) DESC;

-- Add index on frequently queried columns
CREATE INDEX CONCURRENTLY idx_orders_tenant_created
  ON orders(tenant_id, created_at DESC);

2. Connection Pooling

Already configured in services using SQLAlchemy. Verify settings:

# In shared/database/base.py
pool_size=5           # Adjust based on load
max_overflow=10       # Max additional connections
pool_timeout=30       # Connection timeout
pool_recycle=3600     # Recycle connections after 1 hour

3. Redis Caching

Increase cache for frequently accessed data:

# Cache user permissions (example)
@cache.cached(timeout=300, key_prefix='user_perms')
def get_user_permissions(user_id):
    # ... fetch from database

4. Query Optimization

-- Add EXPLAIN ANALYZE to slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE tenant_id = '...';

-- Look for:
-- - Seq Scan (should use index scan)
-- - High execution time
-- - Missing indexes

Scaling Triggers

When to scale UP:

❌ CPU usage >75% sustained for >1 hour
❌ Memory usage >85% sustained
❌ P95 API latency >3s
❌ Database connection pool exhausted frequently
❌ Error rate increasing

When to scale OUT (add replicas):

❌ Request rate increasing significantly
❌ Single service bottleneck identified
❌ Need zero-downtime deployments
❌ Geographic distribution needed

Scaling Operations

Vertical Scaling (Upgrade VPS)

# 1. Create backup
./backup-databases.sh

# 2. Plan upgrade window (requires brief downtime)
# Notify users: "Scheduled maintenance 2 AM - 3 AM"

# 3. At clouding.io, upgrade VPS
# RAM: 20 GB → 32 GB
# CPU: 8 cores → 12 cores
# (Usually instant, may require restart)

# 4. Verify after upgrade
kubectl top nodes
free -h
nproc

Horizontal Scaling (Add Replicas)

# Scale specific service
kubectl scale deployment orders-service -n bakery-ia --replicas=5

# Or update in kustomization for persistence
# Edit: infrastructure/kubernetes/overlays/prod/kustomization.yaml
replicas:
  - name: orders-service
    count: 5

kubectl apply -k infrastructure/kubernetes/overlays/prod

Auto-Scaling (HPA)

Already configured for:

orders-service (1-3 replicas)
forecasting-service (1-3 replicas)
notification-service (1-3 replicas)

# Check HPA status
kubectl get hpa -n bakery-ia

# Adjust thresholds if needed
kubectl edit hpa orders-service-hpa -n bakery-ia

Growth Path

Tenants	Recommended Action
10	Current configuration (20GB RAM, 8 CPU)
20	Add replicas for critical services
30	Upgrade to 32GB RAM, 12 CPU
50	Consider database read replicas
75	Upgrade to 48GB RAM, 16 CPU
100	Plan multi-node cluster or managed K8s
200+	Migrate to managed services (EKS, GKE, AKS)

Incident Response

Incident Severity Levels

Level	Description	Response Time	Example
P0	Complete outage	Immediate	All services down
P1	Major degradation	15 minutes	Database unavailable
P2	Partial degradation	1 hour	One service slow
P3	Minor issue	4 hours	Non-critical alert

Incident Response Process

1. Detect & Alert

- Monitoring alerts trigger
- User reports issue
- Automated health checks fail

2. Assess & Communicate

# Quick assessment
./health-check.sh

# Determine severity
# P0/P1: Notify all stakeholders immediately
# P2/P3: Regular communication channels

3. Investigate

# Check pods
kubectl get pods -n bakery-ia

# Check recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20

# Check logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100

# Check metrics
# View Grafana dashboards

4. Mitigate

# Common mitigations:

# Restart service
kubectl rollout restart deployment/SERVICE_NAME -n bakery-ia

# Rollback deployment
kubectl rollout undo deployment/SERVICE_NAME -n bakery-ia

# Scale up
kubectl scale deployment SERVICE_NAME -n bakery-ia --replicas=5

# Restart database
kubectl delete pod DB_POD_NAME -n bakery-ia

5. Resolve & Document

1. Verify issue resolved
2. Update incident log
3. Create post-mortem (for P0/P1)
4. Implement preventive measures

Common Incidents & Fixes

Incident: Database Connection Exhaustion

Symptoms: Services showing "connection pool exhausted" errors

Fix:

# 1. Identify leaking service
kubectl logs -n bakery-ia deployment/orders-service | grep "pool"

# 2. Restart leaking service
kubectl rollout restart deployment/orders-service -n bakery-ia

# 3. Increase max_connections if needed
kubectl exec -n bakery-ia deployment/orders-db -- \
  psql -U postgres -c "ALTER SYSTEM SET max_connections = 200;"
kubectl rollout restart deployment/orders-db -n bakery-ia

Incident: Out of Memory (OOMKilled)

Symptoms: Pods restarting with "OOMKilled" status

Fix:

# 1. Identify which pod
kubectl get pods -n bakery-ia | grep OOMKilled

# 2. Check resource limits
kubectl describe pod POD_NAME -n bakery-ia | grep -A 5 Limits

# 3. Increase memory limit
# Edit deployment: infrastructure/kubernetes/base/components/services/SERVICE.yaml
resources:
  limits:
    memory: "1Gi"  # Increased from 512Mi

# 4. Redeploy
kubectl apply -k infrastructure/kubernetes/overlays/prod

Incident: Certificate Expired

Symptoms: SSL errors, services can't connect

Fix:

# For Let's Encrypt (should auto-renew):
kubectl delete secret bakery-ia-prod-tls-cert -n bakery-ia
# Wait for cert-manager to recreate

# For internal certs:
# Follow "Certificate Rotation" section above

Maintenance Tasks

Daily Tasks

# Run health check
./health-check.sh

# Check monitoring alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

# Verify backups ran
ls -lh /backups/ | head -5

Weekly Tasks

# Review resource trends
# Open Grafana, check 7-day trends

# Review error logs
kubectl logs -n bakery-ia deployment/gateway --since=7d | grep ERROR | wc -l

# Check disk usage
kubectl exec -n bakery-ia deployment/auth-db -- df -h

# Review security logs
kubectl logs -n bakery-ia deployment/auth-service --since=7d | grep "failed"

Monthly Tasks

Review and rotate passwords
Update security patches
Test backup restore
Review RBAC policies
Vacuum and analyze databases
Review and optimize slow queries
Check certificate expiry dates
Review resource allocation
Plan capacity for next quarter
Update documentation

Quarterly Tasks (Every 90 Days)

Full security audit
Disaster recovery drill
Performance testing
Cost optimization review
Update runbooks
Team training session
Review SLAs and metrics
Plan infrastructure upgrades

Annual Tasks

Penetration testing
Compliance audit (GDPR, PCI-DSS, SOC 2)
Full infrastructure review
Update security roadmap
Budget planning for next year
Technology stack review

Compliance & Audit

Requirements Met:

✅ Article 32: Encryption of personal data (TLS + pgcrypto)
✅ Article 5(1)(f): Security of processing
✅ Article 33: Breach detection (audit logs)
✅ Article 17: Right to erasure (deletion endpoints)
✅ Article 20: Right to data portability (export functionality)

Audit Tasks:

# Review audit logs for data access
kubectl logs -n bakery-ia deployment/tenant-service | grep "user_data_access"

# Verify encryption in use
kubectl exec -n bakery-ia deployment/auth-db -- psql -U postgres -c "SHOW ssl;"

# Check data retention policies
# Review automated cleanup jobs

PCI-DSS Compliance

Requirements Met:

✅ Requirement 3.4: Transmission encryption (TLS 1.2+)
✅ Requirement 3.5: Stored data protection (pgcrypto)
✅ Requirement 10: Access tracking (audit logs)
✅ Requirement 8: User authentication (JWT + MFA ready)

Audit Tasks:

# Verify no plaintext passwords
kubectl get secret bakery-ia-secrets -n bakery-ia -o jsonpath='{.data}' | grep -i "pass"

# Check encryption in transit
kubectl describe ingress -n bakery-ia | grep TLS

# Review access logs
kubectl logs -n bakery-ia deployment/auth-service | grep "login"

SOC 2 Compliance

Controls Met:

✅ CC6.1: Access controls (RBAC)
✅ CC6.6: Encryption in transit (TLS)
✅ CC6.7: Encryption at rest (K8s secrets + pgcrypto)
✅ CC7.2: Monitoring (Prometheus + Grafana)

Audit Log Retention

Current Policy:

Application logs: 30 days (stdout)
Database audit logs: 90 days
Security logs: 1 year
Backups: 30 days rolling

Extending Retention:

# Ship logs to external storage
# Example: Ship to S3 / CloudWatch / ELK

# For PostgreSQL audit logs, increase CSV log retention
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "ALTER SYSTEM SET log_rotation_age = '90d';"

Quick Reference Commands

Emergency Commands

# Restart all services (minimal downtime with rolling update)
kubectl rollout restart deployment -n bakery-ia

# Restart specific service
kubectl rollout restart deployment/orders-service -n bakery-ia

# Rollback last deployment
kubectl rollout undo deployment/orders-service -n bakery-ia

# Scale up quickly
kubectl scale deployment orders-service -n bakery-ia --replicas=5

# Get pod status
kubectl get pods -n bakery-ia

# Get recent events
kubectl get events -n bakery-ia --sort-by='.lastTimestamp' | tail -20

# Get logs
kubectl logs -n bakery-ia deployment/SERVICE_NAME --tail=100 -f

Monitoring Commands

# Resource usage
kubectl top nodes
kubectl top pods -n bakery-ia --sort-by=cpu
kubectl top pods -n bakery-ia --sort-by=memory

# Check HPA
kubectl get hpa -n bakery-ia

# Check all resources
kubectl get all -n bakery-ia

# Check ingress
kubectl get ingress -n bakery-ia

# Check certificates
kubectl get certificate -n bakery-ia

Database Commands

# Connect to database
kubectl exec -n bakery-ia deployment/auth-db -it -- psql -U auth_user -d auth_db

# Check connections
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check database size
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U postgres -c "SELECT pg_size_pretty(pg_database_size('auth_db'));"

# Vacuum database
kubectl exec -n bakery-ia deployment/auth-db -- \
  psql -U auth_user -d auth_db -c "VACUUM ANALYZE;"

Support Resources

Documentation:

Pilot Launch Guide - Initial deployment and setup
Security Checklist - Security procedures and compliance
Database Security - Database operations and best practices
TLS Configuration - Certificate management
RBAC Implementation - Access control configuration
Monitoring Stack README - Detailed monitoring documentation

External Resources:

Kubernetes: https://kubernetes.io/docs
MicroK8s: https://microk8s.io/docs
Prometheus: https://prometheus.io/docs
Grafana: https://grafana.com/docs
PostgreSQL: https://www.postgresql.org/docs

Emergency Contacts:

DevOps Team: devops@bakewise.ai
On-Call: oncall@bakewise.ai
Security Team: security@bakewise.ai

Summary

This guide covers all aspects of operating the Bakery-IA platform in production:

✅ Monitoring: Dashboards, alerts, metrics ✅ Security: Access control, certificates, compliance ✅ Databases: Management, optimization, backups ✅ Recovery: Backup strategy, disaster recovery ✅ Performance: Optimization techniques, scaling ✅ Incidents: Response procedures, common fixes ✅ Maintenance: Daily, weekly, monthly tasks ✅ Compliance: GDPR, PCI-DSS, SOC 2

Remember:

Monitor daily
Back up daily
Test restores monthly
Rotate secrets quarterly
Plan for growth continuously

Document Version: 1.0 Last Updated: 2026-01-07 Maintained By: DevOps Team Next Review: 2026-04-07

34 KiB Raw Blame History

Bakery-IA Production Operations Guide

Table of Contents

Overview

Production Environment

Team Responsibilities

Monitoring & Observability

Access Monitoring Dashboards

Key SigNoz Dashboards and Features

1. Services Tab - APM Overview

2. Traces Tab - Distributed Tracing

3. Dashboards Tab - Infrastructure Metrics

4. Logs Tab - Centralized Logging

5. Alerts Tab - Alert Management

Alert Severity Levels

Common Alerts & Responses

Alert: ServiceDown

Alert: HighMemoryUsage

Alert: DatabaseConnectionsHigh

Alert: CertificateExpiringSoon

Daily Monitoring Workflow with SigNoz

Morning Health Check (5 minutes)

Command-Line Health Check (Alternative)

Troubleshooting Common Issues

Security Operations

Security Posture Overview

Access Control Management

User Roles

Managing User Access

Security Checklist (Monthly)

Certificate Rotation

Let's Encrypt (Auto-Renewal)

Internal TLS Certificates (Manual Rotation)

Database Management

Database Architecture

Database Health Monitoring

Common Database Operations

Connect to Database

Check Database Size

Analyze Slow Queries

Check Database Locks

Database Optimization

Vacuum and Analyze

Reindex (if performance degrades)

Backup & Recovery

Backup Strategy

Backup Script (Already Configured)

Backup Best Practices

Recovery Procedures

Restore Single Database

Disaster Recovery (Full System)

Performance Optimization

Identifying Performance Issues

Common Optimizations

1. Database Indexing

2. Connection Pooling

3. Redis Caching

4. Query Optimization

Scaling Triggers

Scaling Operations

Vertical Scaling (Upgrade VPS)

Horizontal Scaling (Add Replicas)

Auto-Scaling (HPA)

Growth Path

Incident Response

Incident Severity Levels

Incident Response Process

1. Detect & Alert

2. Assess & Communicate

3. Investigate

4. Mitigate

5. Resolve & Document

Common Incidents & Fixes

Incident: Database Connection Exhaustion

Incident: Out of Memory (OOMKilled)

Incident: Certificate Expired

Maintenance Tasks

Daily Tasks

Weekly Tasks

Monthly Tasks

34 KiB

Raw Blame History