Add signoz

This commit is contained in:
Urtzi Alfaro
2026-01-08 12:58:00 +01:00
parent 07178f8972
commit dfb7e4b237
40 changed files with 2049 additions and 3935 deletions

View File

@@ -1,201 +0,0 @@
# Infrastructure Cleanup Summary
**Date:** 2026-01-07
**Action:** Removed legacy Docker Compose infrastructure files
---
## Deleted Directories and Files
The following legacy infrastructure files have been removed as they were specific to Docker Compose deployment and are **not used** in the Kubernetes deployment:
### ❌ Removed:
- `infrastructure/pgadmin/` - pgAdmin configuration for Docker Compose
- `pgpass` - Password file
- `servers.json` - Server definitions
- `infrastructure/postgres/` - PostgreSQL configuration for Docker Compose
- `init-scripts/init.sql` - Database initialization
- `infrastructure/rabbitmq/` - RabbitMQ configuration for Docker Compose
- `definitions.json` - Queue/exchange definitions
- `rabbitmq.conf` - RabbitMQ settings
- `infrastructure/redis/` - Redis configuration for Docker Compose
- `redis.conf` - Redis settings
- `infrastructure/terraform/` - Terraform infrastructure-as-code (unused)
- `base/`, `dev/`, `staging/`, `production/` directories
- `modules/` directory
- `infrastructure/rabbitmq.conf` - Standalone RabbitMQ config file
### ✅ Retained:
#### `infrastructure/kubernetes/`
**Purpose:** Complete Kubernetes deployment manifests
**Status:** Active and required
**Contents:**
- `base/` - Base Kubernetes resources
- `components/` - All service deployments
- `databases/` - Database deployments (uses embedded configs)
- `monitoring/` - Prometheus, Grafana, AlertManager
- `migrations/` - Database migration jobs
- `secrets/` - TLS secrets and application secrets
- `configmaps/` - PostgreSQL logging config
- `overlays/` - Environment-specific configurations
- `dev/` - Development overlay
- `prod/` - Production overlay
- `encryption/` - Kubernetes secrets encryption config
#### `infrastructure/tls/`
**Purpose:** TLS/SSL certificates for database encryption
**Status:** Active and required
**Contents:**
- `ca/` - Certificate Authority (10-year validity)
- `ca-cert.pem` - CA certificate
- `ca-key.pem` - CA private key (KEEP SECURE!)
- `postgres/` - PostgreSQL server certificates (3-year validity)
- `server-cert.pem`, `server-key.pem`, `ca-cert.pem`
- `redis/` - Redis server certificates (3-year validity)
- `redis-cert.pem`, `redis-key.pem`, `ca-cert.pem`
- `generate-certificates.sh` - Certificate generation script
---
## Why These Were Removed
### Docker Compose vs Kubernetes
The removed files were configuration files for **Docker Compose** deployments:
- pgAdmin was used for local database management (not needed in prod)
- Standalone config files (rabbitmq.conf, redis.conf, postgres init scripts) were mounted as volumes in Docker Compose
- Terraform was an unused infrastructure-as-code attempt
### Kubernetes Uses Different Approach
Kubernetes deployment uses:
- **ConfigMaps** instead of config files
- **Secrets** instead of environment files
- **Kubernetes manifests** instead of docker-compose.yml
- **Built-in orchestration** instead of Terraform
**Example:**
```yaml
# OLD (Docker Compose):
volumes:
- ./infrastructure/rabbitmq/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf
# NEW (Kubernetes):
env:
- name: RABBITMQ_DEFAULT_USER
valueFrom:
secretKeyRef:
name: rabbitmq-secrets
key: RABBITMQ_USER
```
---
## Verification
### No References Found
Searched entire codebase and confirmed **zero references** to removed folders:
```bash
grep -r "infrastructure/pgadmin" --include="*.yaml" --include="*.sh"
# No results
grep -r "infrastructure/terraform" --include="*.yaml" --include="*.sh"
# No results
```
### Kubernetes Deployment Unaffected
- All services use Kubernetes ConfigMaps and Secrets
- Database configs embedded in deployment YAML files
- TLS certificates managed via Kubernetes Secrets (from `infrastructure/tls/`)
---
## Current Infrastructure Structure
```
infrastructure/
├── kubernetes/ # ✅ ACTIVE - All K8s manifests
│ ├── base/ # Base resources
│ │ ├── components/ # Service deployments
│ │ ├── secrets/ # TLS secrets
│ │ ├── configmaps/ # Configuration
│ │ └── kustomization.yaml # Base kustomization
│ ├── overlays/ # Environment overlays
│ │ ├── dev/ # Development
│ │ └── prod/ # Production
│ └── encryption/ # K8s secrets encryption
└── tls/ # ✅ ACTIVE - TLS certificates
├── ca/ # Certificate Authority
├── postgres/ # PostgreSQL certs
├── redis/ # Redis certs
└── generate-certificates.sh
REMOVED (Docker Compose legacy):
├── pgadmin/ # ❌ DELETED
├── postgres/ # ❌ DELETED
├── rabbitmq/ # ❌ DELETED
├── redis/ # ❌ DELETED
├── terraform/ # ❌ DELETED
└── rabbitmq.conf # ❌ DELETED
```
---
## Impact Assessment
### ✅ No Breaking Changes
- Kubernetes deployment unchanged
- All services continue to work
- TLS certificates still available
- Production readiness maintained
### ✅ Benefits
- Cleaner repository structure
- Less confusion about which configs are used
- Faster repository cloning (smaller size)
- Clear separation: Kubernetes-only deployment
### ✅ Documentation Updated
- [PILOT_LAUNCH_GUIDE.md](../docs/PILOT_LAUNCH_GUIDE.md) - Uses only Kubernetes
- [PRODUCTION_OPERATIONS_GUIDE.md](../docs/PRODUCTION_OPERATIONS_GUIDE.md) - References only K8s resources
- [infrastructure/kubernetes/README.md](kubernetes/README.md) - K8s-specific documentation
---
## Rollback (If Needed)
If for any reason you need these files back, they can be restored from git:
```bash
# View deleted files
git log --diff-filter=D --summary | grep infrastructure
# Restore specific folder (example)
git checkout HEAD~1 -- infrastructure/pgadmin/
# Or restore all deleted infrastructure
git checkout HEAD~1 -- infrastructure/
```
**Note:** You won't need these for Kubernetes deployment. They were Docker Compose specific.
---
## Related Documentation
- [Kubernetes README](kubernetes/README.md) - K8s deployment guide
- [TLS Configuration](../docs/tls-configuration.md) - Certificate management
- [Database Security](../docs/database-security.md) - Database encryption
- [Pilot Launch Guide](../docs/PILOT_LAUNCH_GUIDE.md) - Production deployment
---
**Cleanup Performed By:** Claude Code
**Verified By:** Infrastructure analysis and grep searches
**Status:** ✅ Complete - No issues found

View File

@@ -0,0 +1,316 @@
# SigNoz Helm Chart Values - Development Environment
# Optimized for local development with minimal resource usage
#
# Official Chart: https://github.com/SigNoz/charts
# Install Command: helm install signoz signoz/signoz -n signoz --create-namespace -f signoz-values-dev.yaml
global:
storageClass: "standard"
domain: "localhost"
# Frontend Configuration
frontend:
replicaCount: 1
image:
repository: signoz/frontend
tag: 0.52.3
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 3301
ingress:
enabled: true
className: nginx
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/use-regex: "true"
hosts:
- host: localhost
paths:
- path: /signoz(/|$)(.*)
pathType: ImplementationSpecific
tls: []
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
env:
- name: FRONTEND_REFRESH_INTERVAL
value: "30000"
# Query Service Configuration
queryService:
replicaCount: 1
image:
repository: signoz/query-service
tag: 0.52.3
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 8080
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
env:
- name: DEPLOYMENT_TYPE
value: "kubernetes-helm"
- name: SIGNOZ_LOCAL_DB_PATH
value: "/var/lib/signoz"
persistence:
enabled: true
size: 5Gi
storageClass: "standard"
# AlertManager Configuration
alertmanager:
replicaCount: 1
image:
repository: signoz/alertmanager
tag: 0.23.5
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 9093
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
persistence:
enabled: true
size: 2Gi
storageClass: "standard"
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
receivers:
- name: 'default'
# Add email, slack, webhook configs here
# ClickHouse Configuration - Time Series Database
clickhouse:
replicaCount: 1
image:
repository: clickhouse/clickhouse-server
tag: 24.1.2-alpine
pullPolicy: IfNotPresent
service:
type: ClusterIP
httpPort: 8123
tcpPort: 9000
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
persistence:
enabled: true
size: 10Gi
storageClass: "standard"
# ClickHouse configuration
config:
logger:
level: information
max_connections: 1024
max_concurrent_queries: 100
# Data retention (7 days for dev)
merge_tree:
parts_to_delay_insert: 150
parts_to_throw_insert: 300
# OpenTelemetry Collector - Integrated with SigNoz
otelCollector:
enabled: true
replicaCount: 1
image:
repository: signoz/signoz-otel-collector
tag: 0.102.8
pullPolicy: IfNotPresent
service:
type: ClusterIP
ports:
otlpGrpc: 4317
otlpHttp: 4318
metrics: 8888
healthCheck: 13133
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# Full OTEL Collector Configuration
config:
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins:
- "http://localhost"
- "https://localhost"
# Prometheus receiver for scraping metrics
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 30s
static_configs:
- targets: ['localhost:8888']
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 400
spike_limit_mib: 100
# Resource detection for K8s
resourcedetection:
detectors: [env, system, docker]
timeout: 5s
# Add resource attributes
resource:
attributes:
- key: deployment.environment
value: development
action: upsert
exporters:
# Export to SigNoz ClickHouse
clickhousetraces:
datasource: tcp://clickhouse:9000/?database=signoz_traces
timeout: 10s
clickhousemetricswrite:
endpoint: tcp://clickhouse:9000/?database=signoz_metrics
timeout: 10s
clickhouselogsexporter:
dsn: tcp://clickhouse:9000/?database=signoz_logs
timeout: 10s
# Debug logging
logging:
loglevel: info
sampling_initial: 5
sampling_thereafter: 200
service:
extensions: [health_check, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resourcedetection, resource]
exporters: [clickhousetraces, logging]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, resourcedetection, resource]
exporters: [clickhousemetricswrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resourcedetection, resource]
exporters: [clickhouselogsexporter, logging]
# OpenTelemetry Collector Deployment Mode
otelCollectorDeployment:
enabled: true
mode: deployment
# Node Exporter for infrastructure metrics (optional)
nodeExporter:
enabled: true
service:
type: ClusterIP
port: 9100
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
# Schemamanager - Manages ClickHouse schema
schemamanager:
enabled: true
image:
repository: signoz/signoz-schema-migrator
tag: 0.52.3
pullPolicy: IfNotPresent
# Additional Configuration
serviceAccount:
create: true
annotations: {}
name: ""
# Security Context
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
# Network Policies (disabled for dev)
networkPolicy:
enabled: false
# Monitoring SigNoz itself
selfMonitoring:
enabled: true
serviceMonitor:
enabled: false

View File

@@ -0,0 +1,471 @@
# SigNoz Helm Chart Values - Production Environment
# High-availability configuration with resource optimization
#
# Official Chart: https://github.com/SigNoz/charts
# Install Command: helm install signoz signoz/signoz -n signoz --create-namespace -f signoz-values-prod.yaml
global:
storageClass: "standard"
domain: "monitoring.bakewise.ai"
# Frontend Configuration
frontend:
replicaCount: 2
image:
repository: signoz/frontend
tag: 0.52.3
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 3301
ingress:
enabled: true
className: nginx
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/use-regex: "true"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
hosts:
- host: monitoring.bakewise.ai
paths:
- path: /signoz(/|$)(.*)
pathType: ImplementationSpecific
tls:
- secretName: signoz-tls
hosts:
- monitoring.bakewise.ai
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
# Pod Anti-affinity for HA
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- signoz-frontend
topologyKey: kubernetes.io/hostname
env:
- name: FRONTEND_REFRESH_INTERVAL
value: "30000"
# Query Service Configuration
queryService:
replicaCount: 2
image:
repository: signoz/query-service
tag: 0.52.3
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 8080
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
# Pod Anti-affinity for HA
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- signoz-query-service
topologyKey: kubernetes.io/hostname
env:
- name: DEPLOYMENT_TYPE
value: "kubernetes-helm"
- name: SIGNOZ_LOCAL_DB_PATH
value: "/var/lib/signoz"
- name: RETENTION_DAYS
value: "30"
persistence:
enabled: true
size: 20Gi
storageClass: "standard"
# Horizontal Pod Autoscaler
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 5
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# AlertManager Configuration
alertmanager:
replicaCount: 2
image:
repository: signoz/alertmanager
tag: 0.23.5
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 9093
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
# Pod Anti-affinity for HA
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- signoz-alertmanager
topologyKey: kubernetes.io/hostname
persistence:
enabled: true
size: 5Gi
storageClass: "standard"
config:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@bakewise.ai'
smtp_auth_username: 'alerts@bakewise.ai'
smtp_auth_password: '${SMTP_PASSWORD}'
smtp_require_tls: true
route:
group_by: ['alertname', 'cluster', 'service', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'critical-alerts'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'critical-alerts'
email_configs:
- to: 'critical-alerts@bakewise.ai'
headers:
Subject: '[CRITICAL] {{ .GroupLabels.alertname }} - Bakery IA'
# Slack webhook for critical alerts
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts-critical'
title: '[CRITICAL] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'warning-alerts'
email_configs:
- to: 'oncall@bakewise.ai'
headers:
Subject: '[WARNING] {{ .GroupLabels.alertname }} - Bakery IA'
# ClickHouse Configuration - Time Series Database
clickhouse:
replicaCount: 2
image:
repository: clickhouse/clickhouse-server
tag: 24.1.2-alpine
pullPolicy: IfNotPresent
service:
type: ClusterIP
httpPort: 8123
tcpPort: 9000
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
# Pod Anti-affinity for HA
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- signoz-clickhouse
topologyKey: kubernetes.io/hostname
persistence:
enabled: true
size: 100Gi
storageClass: "standard"
# ClickHouse configuration
config:
logger:
level: information
max_connections: 4096
max_concurrent_queries: 500
# Data retention (30 days for prod)
merge_tree:
parts_to_delay_insert: 150
parts_to_throw_insert: 300
# Performance tuning
max_memory_usage: 10000000000
max_bytes_before_external_group_by: 20000000000
# Backup configuration
backup:
enabled: true
schedule: "0 2 * * *"
retention: 7
# OpenTelemetry Collector - Integrated with SigNoz
otelCollector:
enabled: true
replicaCount: 2
image:
repository: signoz/signoz-otel-collector
tag: 0.102.8
pullPolicy: IfNotPresent
service:
type: ClusterIP
ports:
otlpGrpc: 4317
otlpHttp: 4318
metrics: 8888
healthCheck: 13133
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
# Full OTEL Collector Configuration
config:
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 16
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins:
- "https://monitoring.bakewise.ai"
- "https://*.bakewise.ai"
# Prometheus receiver for scraping metrics
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 30s
static_configs:
- targets: ['localhost:8888']
processors:
batch:
timeout: 10s
send_batch_size: 2048
send_batch_max_size: 4096
memory_limiter:
check_interval: 1s
limit_mib: 800
spike_limit_mib: 200
# Resource detection for K8s
resourcedetection:
detectors: [env, system, docker]
timeout: 5s
# Add resource attributes
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
- key: cluster.name
value: bakery-ia-prod
action: upsert
exporters:
# Export to SigNoz ClickHouse
clickhousetraces:
datasource: tcp://clickhouse:9000/?database=signoz_traces
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
clickhousemetricswrite:
endpoint: tcp://clickhouse:9000/?database=signoz_metrics
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
clickhouselogsexporter:
dsn: tcp://clickhouse:9000/?database=signoz_logs
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
# Minimal logging for prod
logging:
loglevel: warn
sampling_initial: 2
sampling_thereafter: 500
service:
extensions: [health_check, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resourcedetection, resource]
exporters: [clickhousetraces, logging]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, resourcedetection, resource]
exporters: [clickhousemetricswrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resourcedetection, resource]
exporters: [clickhouselogsexporter, logging]
# OpenTelemetry Collector Deployment Mode
otelCollectorDeployment:
enabled: true
mode: deployment
# HPA for OTEL Collector
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# Node Exporter for infrastructure metrics
nodeExporter:
enabled: true
service:
type: ClusterIP
port: 9100
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
# Schemamanager - Manages ClickHouse schema
schemamanager:
enabled: true
image:
repository: signoz/signoz-schema-migrator
tag: 0.52.3
pullPolicy: IfNotPresent
# Additional Configuration
serviceAccount:
create: true
annotations: {}
name: "signoz"
# Security Context
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
# Pod Disruption Budgets for HA
podDisruptionBudget:
frontend:
enabled: true
minAvailable: 1
queryService:
enabled: true
minAvailable: 1
alertmanager:
enabled: true
minAvailable: 1
clickhouse:
enabled: true
minAvailable: 1
# Network Policies for security
networkPolicy:
enabled: true
policyTypes:
- Ingress
- Egress
# Monitoring SigNoz itself
selfMonitoring:
enabled: true
serviceMonitor:
enabled: true
interval: 30s

View File

@@ -4,7 +4,7 @@ This directory contains Kubernetes manifests for deploying the Bakery IA platfor
## Quick Start
Deploy the entire platform with these 5 commands:
Deploy the entire platform with these 4 commands:
```bash
# 1. Start Colima with adequate resources
@@ -17,15 +17,14 @@ kind create cluster --config kind-config.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
kubectl wait --namespace ingress-nginx --for=condition=ready pod --selector=app.kubernetes.io/component=controller --timeout=300s
# 4. Configure permanent localhost access
kubectl patch svc ingress-nginx-controller -n ingress-nginx -p '{"spec":{"type":"NodePort","ports":[{"name":"http","port":80,"targetPort":"http","nodePort":30080},{"name":"https","port":443,"targetPort":"https","nodePort":30443}]}}'
# 4. Deploy with Tilt
tilt up
# 5. Deploy with Skaffold
skaffold dev --profile=dev
# 🎉 Access at: https://localhost
# 🎉 Access at: http://localhost (or see Tilt for individual service ports)
```
> **Note**: The kind-config.yaml already configures port mappings (30080→80, 30443→443) for localhost access, so no additional service patching is needed. The NGINX Ingress for Kind uses NodePort by default on those exact ports.
## Prerequisites
Install the following tools on macOS:
@@ -100,11 +99,11 @@ Then access via:
### Start Development Environment
```bash
# Start development mode with hot-reload
skaffold dev --profile=dev
# Start development mode with hot-reload using Tilt
tilt up
# Or one-time deployment
skaffold run --profile=dev
# Or start in background
tilt up --stream
```
### Key Features
@@ -246,13 +245,39 @@ colima stop --profile k8s-local
### Restart Sequence
```bash
# Post-restart startup
# Post-restart startup (or use kubernetes_restart.sh script)
colima start --cpu 6 --memory 12 --disk 120 --runtime docker --profile k8s-local
kind create cluster --config kind-config.yaml
skaffold dev --profile=dev
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
kubectl wait --namespace ingress-nginx --for=condition=ready pod --selector=app.kubernetes.io/component=controller --timeout=300s
tilt up
```
## Production Considerations
## Production Deployment
### Production URLs
The production environment uses the following domains:
- **Main Application**: https://bakewise.ai
- Frontend application and all public pages
- API endpoints: https://bakewise.ai/api/v1/...
- **Monitoring Stack**: https://monitoring.bakewise.ai
- Grafana: https://monitoring.bakewise.ai/grafana
- Prometheus: https://monitoring.bakewise.ai/prometheus
- Jaeger: https://monitoring.bakewise.ai/jaeger
- AlertManager: https://monitoring.bakewise.ai/alertmanager
### Production Configuration
The production overlay (`overlays/prod/`) includes:
- **Domain Configuration**: bakewise.ai with Let's Encrypt certificates
- **High Availability**: Multi-replica deployments (2-3 replicas per service)
- **Enhanced Security**: Rate limiting, CORS restrictions, security headers
- **Monitoring**: Full observability stack with Prometheus, Grafana, Jaeger
### Production Considerations
For production deployment:
@@ -263,6 +288,7 @@ For production deployment:
- **External Secrets**: Use managed secret services
- **TLS**: Production Let's Encrypt certificates
- **CI/CD**: Automated deployment pipelines
- **DNS**: Configure DNS A/CNAME records pointing to your cluster's load balancer
## Next Steps

View File

@@ -48,6 +48,9 @@ spec:
name: pos-integration-secrets
- secretRef:
name: whatsapp-secrets
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector.monitoring.svc.cluster.local:4317"
resources:
requests:
memory: "256Mi"

View File

@@ -1,429 +0,0 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alert-rules
namespace: monitoring
data:
alert-rules.yml: |
groups:
# Basic Infrastructure Alerts
- name: bakery_services
interval: 30s
rules:
- alert: ServiceDown
expr: up{job="bakery-services"} == 0
for: 2m
labels:
severity: critical
component: infrastructure
annotations:
summary: "Service {{ $labels.service }} is down"
description: "Service {{ $labels.service }} in namespace {{ $labels.namespace }} has been down for more than 2 minutes."
runbook_url: "https://runbooks.bakery-ia.local/ServiceDown"
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status_code=~"5..", job="bakery-services"}[5m])) by (service)
/
sum(rate(http_requests_total{job="bakery-services"}[5m])) by (service)
) > 0.10
for: 5m
labels:
severity: critical
component: application
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Service {{ $labels.service }} has error rate above 10% (current: {{ $value | humanizePercentage }})."
runbook_url: "https://runbooks.bakery-ia.local/HighErrorRate"
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="bakery-services"}[5m])) by (service, le)
) > 1
for: 5m
labels:
severity: warning
component: performance
annotations:
summary: "High response time on {{ $labels.service }}"
description: "Service {{ $labels.service }} P95 latency is above 1 second (current: {{ $value }}s)."
runbook_url: "https://runbooks.bakery-ia.local/HighResponseTime"
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{namespace="bakery-ia", container!=""} > 500000000
for: 5m
labels:
severity: warning
component: infrastructure
annotations:
summary: "High memory usage in {{ $labels.pod }}"
description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is using more than 500MB of memory (current: {{ $value | humanize }}B)."
runbook_url: "https://runbooks.bakery-ia.local/HighMemoryUsage"
- alert: DatabaseConnectionHigh
expr: |
pg_stat_database_numbackends{datname="bakery"} > 80
for: 5m
labels:
severity: warning
component: database
annotations:
summary: "High database connection count"
description: "Database has more than 80 active connections (current: {{ $value }})."
runbook_url: "https://runbooks.bakery-ia.local/DatabaseConnectionHigh"
# Business Logic Alerts
- name: bakery_business
interval: 30s
rules:
- alert: TrainingJobFailed
expr: |
increase(training_job_failures_total[1h]) > 0
for: 5m
labels:
severity: warning
component: ml-training
annotations:
summary: "Training job failures detected"
description: "{{ $value }} training job(s) failed in the last hour."
runbook_url: "https://runbooks.bakery-ia.local/TrainingJobFailed"
- alert: LowPredictionAccuracy
expr: |
prediction_model_accuracy < 0.70
for: 15m
labels:
severity: warning
component: ml-inference
annotations:
summary: "Model prediction accuracy is low"
description: "Model {{ $labels.model_name }} accuracy is below 70% (current: {{ $value | humanizePercentage }})."
runbook_url: "https://runbooks.bakery-ia.local/LowPredictionAccuracy"
- alert: APIRateLimitHit
expr: |
increase(rate_limit_hits_total[5m]) > 10
for: 5m
labels:
severity: info
component: api-gateway
annotations:
summary: "API rate limits being hit frequently"
description: "Rate limits hit {{ $value }} times in the last 5 minutes."
runbook_url: "https://runbooks.bakery-ia.local/APIRateLimitHit"
# Alert System Health
- name: alert_system_health
interval: 30s
rules:
- alert: AlertSystemComponentDown
expr: |
alert_system_component_health{component=~"processor|notifier|scheduler"} == 0
for: 2m
labels:
severity: critical
component: alert-system
annotations:
summary: "Alert system component {{ $labels.component }} is unhealthy"
description: "Component {{ $labels.component }} has been unhealthy for more than 2 minutes."
runbook_url: "https://runbooks.bakery-ia.local/AlertSystemComponentDown"
- alert: RabbitMQConnectionDown
expr: |
rabbitmq_up == 0
for: 1m
labels:
severity: critical
component: alert-system
annotations:
summary: "RabbitMQ connection is down"
description: "Alert system has lost connection to RabbitMQ message queue."
runbook_url: "https://runbooks.bakery-ia.local/RabbitMQConnectionDown"
- alert: RedisConnectionDown
expr: |
redis_up == 0
for: 1m
labels:
severity: critical
component: alert-system
annotations:
summary: "Redis connection is down"
description: "Alert system has lost connection to Redis cache."
runbook_url: "https://runbooks.bakery-ia.local/RedisConnectionDown"
- alert: NoSchedulerLeader
expr: |
sum(alert_system_scheduler_leader) == 0
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "No alert scheduler leader elected"
description: "No scheduler instance has been elected as leader for 5 minutes."
runbook_url: "https://runbooks.bakery-ia.local/NoSchedulerLeader"
# Alert System Performance
- name: alert_system_performance
interval: 30s
rules:
- alert: HighAlertProcessingErrorRate
expr: |
(
sum(rate(alert_processing_errors_total[2m]))
/
sum(rate(alerts_processed_total[2m]))
) > 0.10
for: 2m
labels:
severity: critical
component: alert-system
annotations:
summary: "High alert processing error rate"
description: "Alert processing error rate is above 10% (current: {{ $value | humanizePercentage }})."
runbook_url: "https://runbooks.bakery-ia.local/HighAlertProcessingErrorRate"
- alert: HighNotificationDeliveryFailureRate
expr: |
(
sum(rate(notification_delivery_failures_total[3m]))
/
sum(rate(notifications_sent_total[3m]))
) > 0.05
for: 3m
labels:
severity: warning
component: alert-system
annotations:
summary: "High notification delivery failure rate"
description: "Notification delivery failure rate is above 5% (current: {{ $value | humanizePercentage }})."
runbook_url: "https://runbooks.bakery-ia.local/HighNotificationDeliveryFailureRate"
- alert: HighAlertProcessingLatency
expr: |
histogram_quantile(0.95,
sum(rate(alert_processing_duration_seconds_bucket[5m])) by (le)
) > 5
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "High alert processing latency"
description: "P95 alert processing latency is above 5 seconds (current: {{ $value }}s)."
runbook_url: "https://runbooks.bakery-ia.local/HighAlertProcessingLatency"
- alert: TooManySSEConnections
expr: |
sse_active_connections > 1000
for: 2m
labels:
severity: warning
component: alert-system
annotations:
summary: "Too many active SSE connections"
description: "More than 1000 active SSE connections (current: {{ $value }})."
runbook_url: "https://runbooks.bakery-ia.local/TooManySSEConnections"
- alert: SSEConnectionErrors
expr: |
rate(sse_connection_errors_total[3m]) > 0.5
for: 3m
labels:
severity: warning
component: alert-system
annotations:
summary: "High rate of SSE connection errors"
description: "SSE connection error rate is {{ $value }} errors/sec."
runbook_url: "https://runbooks.bakery-ia.local/SSEConnectionErrors"
# Alert System Business Logic
- name: alert_system_business
interval: 30s
rules:
- alert: UnusuallyHighAlertVolume
expr: |
rate(alerts_generated_total[5m]) > 2
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "Unusually high alert generation volume"
description: "More than 2 alerts per second being generated (current: {{ $value }}/sec)."
runbook_url: "https://runbooks.bakery-ia.local/UnusuallyHighAlertVolume"
- alert: NoAlertsGenerated
expr: |
rate(alerts_generated_total[30m]) == 0
for: 15m
labels:
severity: info
component: alert-system
annotations:
summary: "No alerts generated recently"
description: "No alerts have been generated in the last 30 minutes. This might indicate a problem with alert detection."
runbook_url: "https://runbooks.bakery-ia.local/NoAlertsGenerated"
- alert: SlowAlertResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(alert_response_time_seconds_bucket[10m])) by (le)
) > 3600
for: 10m
labels:
severity: warning
component: alert-system
annotations:
summary: "Slow alert response times"
description: "P95 alert response time is above 1 hour (current: {{ $value | humanizeDuration }})."
runbook_url: "https://runbooks.bakery-ia.local/SlowAlertResponseTime"
- alert: CriticalAlertsUnacknowledged
expr: |
sum(alerts_unacknowledged{severity="critical"}) > 5
for: 10m
labels:
severity: warning
component: alert-system
annotations:
summary: "Multiple critical alerts unacknowledged"
description: "{{ $value }} critical alerts have not been acknowledged for 10+ minutes."
runbook_url: "https://runbooks.bakery-ia.local/CriticalAlertsUnacknowledged"
# Alert System Capacity
- name: alert_system_capacity
interval: 30s
rules:
- alert: LargeSSEMessageQueues
expr: |
sse_message_queue_size > 100
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "Large SSE message queues detected"
description: "SSE message queue for tenant {{ $labels.tenant_id }} has {{ $value }} messages queued."
runbook_url: "https://runbooks.bakery-ia.local/LargeSSEMessageQueues"
- alert: SlowDatabaseStorage
expr: |
histogram_quantile(0.95,
sum(rate(alert_storage_duration_seconds_bucket[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
component: alert-system
annotations:
summary: "Slow alert database storage"
description: "P95 alert storage latency is above 1 second (current: {{ $value }}s)."
runbook_url: "https://runbooks.bakery-ia.local/SlowDatabaseStorage"
# Alert System Critical Scenarios
- name: alert_system_critical
interval: 15s
rules:
- alert: AlertSystemDown
expr: |
up{service=~"alert-processor|notification-service"} == 0
for: 1m
labels:
severity: critical
component: alert-system
annotations:
summary: "Alert system is completely down"
description: "Core alert system service {{ $labels.service }} is down."
runbook_url: "https://runbooks.bakery-ia.local/AlertSystemDown"
- alert: AlertDataNotPersisted
expr: |
(
sum(rate(alerts_processed_total[2m]))
-
sum(rate(alerts_stored_total[2m]))
) > 0
for: 2m
labels:
severity: critical
component: alert-system
annotations:
summary: "Alerts not being persisted to database"
description: "Alerts are being processed but not stored in the database."
runbook_url: "https://runbooks.bakery-ia.local/AlertDataNotPersisted"
- alert: NotificationsNotDelivered
expr: |
(
sum(rate(alerts_processed_total[3m]))
-
sum(rate(notifications_sent_total[3m]))
) > 0
for: 3m
labels:
severity: critical
component: alert-system
annotations:
summary: "Notifications not being delivered"
description: "Alerts are being processed but notifications are not being sent."
runbook_url: "https://runbooks.bakery-ia.local/NotificationsNotDelivered"
# Monitoring System Self-Monitoring
- name: monitoring_health
interval: 30s
rules:
- alert: PrometheusDown
expr: up{job="prometheus"} == 0
for: 5m
labels:
severity: critical
component: monitoring
annotations:
summary: "Prometheus is down"
description: "Prometheus monitoring system is not responding."
runbook_url: "https://runbooks.bakery-ia.local/PrometheusDown"
- alert: AlertManagerDown
expr: up{job="alertmanager"} == 0
for: 2m
labels:
severity: critical
component: monitoring
annotations:
summary: "AlertManager is down"
description: "AlertManager is not responding. Alerts will not be routed."
runbook_url: "https://runbooks.bakery-ia.local/AlertManagerDown"
- alert: PrometheusStorageFull
expr: |
(
prometheus_tsdb_storage_blocks_bytes
/
(prometheus_tsdb_storage_blocks_bytes + prometheus_tsdb_wal_size_bytes)
) > 0.90
for: 10m
labels:
severity: warning
component: monitoring
annotations:
summary: "Prometheus storage almost full"
description: "Prometheus storage is {{ $value | humanizePercentage }} full."
runbook_url: "https://runbooks.bakery-ia.local/PrometheusStorageFull"
- alert: PrometheusScrapeErrors
expr: |
rate(prometheus_target_scrapes_exceeded_sample_limit_total[5m]) > 0
for: 5m
labels:
severity: warning
component: monitoring
annotations:
summary: "Prometheus scrape errors detected"
description: "Prometheus is experiencing scrape errors for target {{ $labels.job }}."
runbook_url: "https://runbooks.bakery-ia.local/PrometheusScrapeErrors"

View File

@@ -1,27 +0,0 @@
---
# InitContainer to substitute secrets into AlertManager config
# This allows us to use environment variables from secrets in the config file
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-init-script
namespace: monitoring
data:
init-config.sh: |
#!/bin/sh
set -e
# Read the template config
TEMPLATE=$(cat /etc/alertmanager-template/alertmanager.yml)
# Substitute environment variables
echo "$TEMPLATE" | \
sed "s|{{ .smtp_host }}|${SMTP_HOST}|g" | \
sed "s|{{ .smtp_from }}|${SMTP_FROM}|g" | \
sed "s|{{ .smtp_username }}|${SMTP_USERNAME}|g" | \
sed "s|{{ .smtp_password }}|${SMTP_PASSWORD}|g" | \
sed "s|{{ .slack_webhook_url }}|${SLACK_WEBHOOK_URL}|g" \
> /etc/alertmanager-final/alertmanager.yml
echo "AlertManager config initialized successfully"
cat /etc/alertmanager-final/alertmanager.yml

View File

@@ -1,391 +0,0 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: '{{ .smtp_host }}'
smtp_from: '{{ .smtp_from }}'
smtp_auth_username: '{{ .smtp_username }}'
smtp_auth_password: '{{ .smtp_password }}'
smtp_require_tls: true
# Define notification templates
templates:
- '/etc/alertmanager/templates/*.tmpl'
# Route alerts to appropriate receivers
route:
# Default receiver
receiver: 'default-email'
# Group alerts by these labels
group_by: ['alertname', 'cluster', 'service']
# Wait time before sending initial notification
group_wait: 10s
# Wait time before sending notifications about new alerts in the group
group_interval: 10s
# Wait time before re-sending a notification
repeat_interval: 12h
# Child routes for specific alert routing
routes:
# Critical alerts - send immediately to all channels
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 0s
group_interval: 5m
repeat_interval: 4h
continue: true
# Warning alerts - less urgent
- match:
severity: warning
receiver: 'warning-alerts'
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
# Alert system specific alerts
- match:
component: alert-system
receiver: 'alert-system-team'
group_wait: 10s
repeat_interval: 6h
# Database alerts
- match_re:
alertname: ^(DatabaseConnectionHigh|SlowDatabaseStorage)$
receiver: 'database-team'
group_wait: 30s
repeat_interval: 8h
# Infrastructure alerts
- match_re:
alertname: ^(HighMemoryUsage|ServiceDown)$
receiver: 'infra-team'
group_wait: 30s
repeat_interval: 6h
# Inhibition rules - prevent alert spam
inhibit_rules:
# If service is down, inhibit all other alerts for that service
- source_match:
alertname: 'ServiceDown'
target_match_re:
alertname: '(HighErrorRate|HighResponseTime|HighMemoryUsage)'
equal: ['service']
# If AlertSystem is completely down, inhibit component alerts
- source_match:
alertname: 'AlertSystemDown'
target_match_re:
alertname: 'AlertSystemComponent.*'
equal: ['namespace']
# If RabbitMQ is down, inhibit alert processing errors
- source_match:
alertname: 'RabbitMQConnectionDown'
target_match:
alertname: 'HighAlertProcessingErrorRate'
equal: ['namespace']
# Receivers - notification destinations
receivers:
# Default email receiver
- name: 'default-email'
email_configs:
- to: 'alerts@yourdomain.com'
headers:
Subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
html: |
{{ range .Alerts }}
<h2>{{ .Labels.alertname }}</h2>
<p><strong>Status:</strong> {{ .Status }}</p>
<p><strong>Severity:</strong> {{ .Labels.severity }}</p>
<p><strong>Service:</strong> {{ .Labels.service }}</p>
<p><strong>Summary:</strong> {{ .Annotations.summary }}</p>
<p><strong>Description:</strong> {{ .Annotations.description }}</p>
<p><strong>Started:</strong> {{ .StartsAt }}</p>
{{ if .EndsAt }}<p><strong>Ended:</strong> {{ .EndsAt }}</p>{{ end }}
{{ end }}
# Critical alerts - multiple channels
- name: 'critical-alerts'
email_configs:
- to: 'critical-alerts@yourdomain.com,oncall@yourdomain.com'
headers:
Subject: '🚨 [CRITICAL] {{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
send_resolved: true
# Uncomment to enable Slack notifications
# slack_configs:
# - api_url: '{{ .slack_webhook_url }}'
# channel: '#alerts-critical'
# title: '🚨 Critical Alert'
# text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
# send_resolved: true
# Warning alerts
- name: 'warning-alerts'
email_configs:
- to: 'alerts@yourdomain.com'
headers:
Subject: '⚠️ [WARNING] {{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
send_resolved: true
# Alert system team
- name: 'alert-system-team'
email_configs:
- to: 'alert-system-team@yourdomain.com'
headers:
Subject: '[Alert System] {{ .GroupLabels.alertname }}'
send_resolved: true
# Database team
- name: 'database-team'
email_configs:
- to: 'database-team@yourdomain.com'
headers:
Subject: '[Database] {{ .GroupLabels.alertname }}'
send_resolved: true
# Infrastructure team
- name: 'infra-team'
email_configs:
- to: 'infra-team@yourdomain.com'
headers:
Subject: '[Infrastructure] {{ .GroupLabels.alertname }}'
send_resolved: true
---
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-templates
namespace: monitoring
data:
default.tmpl: |
{{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".*alertmanager\\.(.*)" "$1" }}{{ end }}
{{ define "slack.default.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{ end }}
{{ define "slack.default.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* `{{ .Labels.severity }}`
*Service:* `{{ .Labels.service }}`
{{ end }}
{{ end }}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: alertmanager
namespace: monitoring
labels:
app: alertmanager
spec:
serviceName: alertmanager
replicas: 3
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
serviceAccountName: prometheus
initContainers:
- name: init-config
image: busybox:1.36
command: ['/bin/sh', '/scripts/init-config.sh']
env:
- name: SMTP_HOST
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: smtp-host
- name: SMTP_USERNAME
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: smtp-username
- name: SMTP_PASSWORD
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: smtp-password
- name: SMTP_FROM
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: smtp-from
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: alertmanager-secrets
key: slack-webhook-url
optional: true
volumeMounts:
- name: init-script
mountPath: /scripts
- name: config-template
mountPath: /etc/alertmanager-template
- name: config-final
mountPath: /etc/alertmanager-final
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- alertmanager
topologyKey: kubernetes.io/hostname
containers:
- name: alertmanager
image: prom/alertmanager:v0.27.0
args:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-0.alertmanager.monitoring.svc.cluster.local:9094'
- '--cluster.peer=alertmanager-1.alertmanager.monitoring.svc.cluster.local:9094'
- '--cluster.peer=alertmanager-2.alertmanager.monitoring.svc.cluster.local:9094'
- '--cluster.reconnect-timeout=5m'
- '--web.external-url=http://monitoring.bakery-ia.local/alertmanager'
- '--web.route-prefix=/'
ports:
- name: web
containerPort: 9093
- name: mesh-tcp
containerPort: 9094
- name: mesh-udp
containerPort: 9094
protocol: UDP
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: config-final
mountPath: /etc/alertmanager
- name: templates
mountPath: /etc/alertmanager/templates
- name: storage
mountPath: /alertmanager
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /-/healthy
port: 9093
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 9093
initialDelaySeconds: 5
periodSeconds: 5
# Config reloader sidecar
- name: configmap-reload
image: jimmidyson/configmap-reload:v0.12.0
args:
- '--webhook-url=http://localhost:9093/-/reload'
- '--volume-dir=/etc/alertmanager'
volumeMounts:
- name: config-final
mountPath: /etc/alertmanager
readOnly: true
resources:
requests:
memory: "16Mi"
cpu: "10m"
limits:
memory: "32Mi"
cpu: "50m"
volumes:
- name: init-script
configMap:
name: alertmanager-init-script
defaultMode: 0755
- name: config-template
configMap:
name: alertmanager-config
- name: config-final
emptyDir: {}
- name: templates
configMap:
name: alertmanager-templates
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 2Gi
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitoring
labels:
app: alertmanager
spec:
type: ClusterIP
clusterIP: None
ports:
- name: web
port: 9093
targetPort: 9093
- name: mesh-tcp
port: 9094
targetPort: 9094
- name: mesh-udp
port: 9094
targetPort: 9094
protocol: UDP
selector:
app: alertmanager
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager-external
namespace: monitoring
labels:
app: alertmanager
spec:
type: ClusterIP
ports:
- name: web
port: 9093
targetPort: 9093
selector:
app: alertmanager

View File

@@ -1,949 +0,0 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards-extended
namespace: monitoring
data:
postgresql-dashboard.json: |
{
"dashboard": {
"title": "Bakery IA - PostgreSQL Database",
"tags": ["bakery-ia", "postgresql", "database"],
"timezone": "browser",
"refresh": "30s",
"schemaVersion": 16,
"version": 1,
"panels": [
{
"id": 1,
"title": "Active Connections by Database",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_stat_activity_count{state=\"active\"}",
"legendFormat": "{{datname}} - active"
},
{
"expr": "pg_stat_activity_count{state=\"idle\"}",
"legendFormat": "{{datname}} - idle"
},
{
"expr": "pg_stat_activity_count{state=\"idle in transaction\"}",
"legendFormat": "{{datname}} - idle tx"
}
]
},
{
"id": 2,
"title": "Total Connections",
"type": "stat",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "sum(pg_stat_activity_count)",
"legendFormat": "Total connections"
}
]
},
{
"id": 3,
"title": "Max Connections",
"type": "stat",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "pg_settings_max_connections",
"legendFormat": "Max connections"
}
]
},
{
"id": 4,
"title": "Transaction Rate (Commits vs Rollbacks)",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(pg_stat_database_xact_commit[5m])",
"legendFormat": "{{datname}} - commits"
},
{
"expr": "rate(pg_stat_database_xact_rollback[5m])",
"legendFormat": "{{datname}} - rollbacks"
}
]
},
{
"id": 5,
"title": "Cache Hit Ratio",
"type": "graph",
"gridPos": {"x": 12, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (1 - (sum(rate(pg_stat_io_blocks_read_total[5m])) / (sum(rate(pg_stat_io_blocks_read_total[5m])) + sum(rate(pg_stat_io_blocks_hit_total[5m])))))",
"legendFormat": "Cache hit ratio %"
}
]
},
{
"id": 6,
"title": "Slow Queries (> 30s)",
"type": "table",
"gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_slow_queries{duration_ms > 30000}",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {},
"renameByName": {
"query": "Query",
"duration_ms": "Duration (ms)",
"datname": "Database"
}
}
}
]
},
{
"id": 7,
"title": "Dead Tuples by Table",
"type": "graph",
"gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_stat_user_tables_n_dead_tup",
"legendFormat": "{{schemaname}}.{{relname}}"
}
]
},
{
"id": 8,
"title": "Table Bloat Estimate",
"type": "graph",
"gridPos": {"x": 0, "y": 24, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (pg_stat_user_tables_n_dead_tup * avg_tuple_size) / (pg_total_relation_size * 8192)",
"legendFormat": "{{schemaname}}.{{relname}} bloat %"
}
]
},
{
"id": 9,
"title": "Replication Lag (bytes)",
"type": "graph",
"gridPos": {"x": 12, "y": 24, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_replication_lag_bytes",
"legendFormat": "{{slot_name}} - {{application_name}}"
}
]
},
{
"id": 10,
"title": "Database Size (GB)",
"type": "graph",
"gridPos": {"x": 0, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_database_size_bytes / 1024 / 1024 / 1024",
"legendFormat": "{{datname}}"
}
]
},
{
"id": 11,
"title": "Database Size Growth (per hour)",
"type": "graph",
"gridPos": {"x": 12, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(pg_database_size_bytes[1h])",
"legendFormat": "{{datname}} - bytes/hour"
}
]
},
{
"id": 12,
"title": "Lock Counts by Type",
"type": "graph",
"gridPos": {"x": 0, "y": 40, "w": 12, "h": 8},
"targets": [
{
"expr": "pg_locks_count",
"legendFormat": "{{datname}} - {{locktype}} - {{mode}}"
}
]
},
{
"id": 13,
"title": "Query Duration (p95)",
"type": "graph",
"gridPos": {"x": 12, "y": 40, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.95, rate(pg_query_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
}
]
}
]
}
}
node-exporter-dashboard.json: |
{
"dashboard": {
"title": "Bakery IA - Node Exporter Infrastructure",
"tags": ["bakery-ia", "node-exporter", "infrastructure"],
"timezone": "browser",
"refresh": "15s",
"schemaVersion": 16,
"version": 1,
"panels": [
{
"id": 1,
"title": "CPU Usage by Node",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}} - {{cpu}}"
}
]
},
{
"id": 2,
"title": "Average CPU Usage",
"type": "stat",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "Average CPU %"
}
]
},
{
"id": 3,
"title": "CPU Load (1m, 5m, 15m)",
"type": "stat",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "avg(node_load1)",
"legendFormat": "1m"
},
{
"expr": "avg(node_load5)",
"legendFormat": "5m"
},
{
"expr": "avg(node_load15)",
"legendFormat": "15m"
}
]
},
{
"id": 4,
"title": "Memory Usage by Node",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 5,
"title": "Memory Used (GB)",
"type": "stat",
"gridPos": {"x": 12, "y": 8, "w": 6, "h": 4},
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024 / 1024 / 1024",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 6,
"title": "Memory Available (GB)",
"type": "stat",
"gridPos": {"x": 18, "y": 8, "w": 6, "h": 4},
"targets": [
{
"expr": "node_memory_MemAvailable_bytes / 1024 / 1024 / 1024",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 7,
"title": "Disk I/O Read Rate (MB/s)",
"type": "graph",
"gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_disk_read_bytes_total[5m]) / 1024 / 1024",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 8,
"title": "Disk I/O Write Rate (MB/s)",
"type": "graph",
"gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_disk_written_bytes_total[5m]) / 1024 / 1024",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 9,
"title": "Disk I/O Operations (IOPS)",
"type": "graph",
"gridPos": {"x": 0, "y": 24, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 10,
"title": "Network Receive Rate (Mbps)",
"type": "graph",
"gridPos": {"x": 12, "y": 24, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{device!=\"lo\"}[5m]) * 8 / 1024 / 1024",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 11,
"title": "Network Transmit Rate (Mbps)",
"type": "graph",
"gridPos": {"x": 0, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_network_transmit_bytes_total{device!=\"lo\"}[5m]) * 8 / 1024 / 1024",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 12,
"title": "Network Errors",
"type": "graph",
"gridPos": {"x": 12, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])",
"legendFormat": "{{instance}} - {{device}}"
}
]
},
{
"id": 13,
"title": "Filesystem Usage by Mount",
"type": "graph",
"gridPos": {"x": 0, "y": 40, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes))",
"legendFormat": "{{instance}} - {{mountpoint}}"
}
]
},
{
"id": 14,
"title": "Filesystem Available (GB)",
"type": "stat",
"gridPos": {"x": 12, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "node_filesystem_avail_bytes / 1024 / 1024 / 1024",
"legendFormat": "{{instance}} - {{mountpoint}}"
}
]
},
{
"id": 15,
"title": "Filesystem Size (GB)",
"type": "stat",
"gridPos": {"x": 18, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "node_filesystem_size_bytes / 1024 / 1024 / 1024",
"legendFormat": "{{instance}} - {{mountpoint}}"
}
]
},
{
"id": 16,
"title": "Load Average (1m, 5m, 15m)",
"type": "graph",
"gridPos": {"x": 0, "y": 48, "w": 12, "h": 8},
"targets": [
{
"expr": "node_load1",
"legendFormat": "{{instance}} - 1m"
},
{
"expr": "node_load5",
"legendFormat": "{{instance}} - 5m"
},
{
"expr": "node_load15",
"legendFormat": "{{instance}} - 15m"
}
]
},
{
"id": 17,
"title": "System Up Time",
"type": "stat",
"gridPos": {"x": 12, "y": 48, "w": 12, "h": 8},
"targets": [
{
"expr": "node_boot_time_seconds",
"legendFormat": "{{instance}} - uptime"
}
]
},
{
"id": 18,
"title": "Context Switches",
"type": "graph",
"gridPos": {"x": 0, "y": 56, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_context_switches_total[5m])",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 19,
"title": "Interrupts",
"type": "graph",
"gridPos": {"x": 12, "y": 56, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(node_intr_total[5m])",
"legendFormat": "{{instance}}"
}
]
}
]
}
}
alertmanager-dashboard.json: |
{
"dashboard": {
"title": "Bakery IA - AlertManager Monitoring",
"tags": ["bakery-ia", "alertmanager", "alerting"],
"timezone": "browser",
"refresh": "10s",
"schemaVersion": 16,
"version": 1,
"panels": [
{
"id": 1,
"title": "Active Alerts by Severity",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "count by (severity) (ALERTS{alertstate=\"firing\"})",
"legendFormat": "{{severity}}"
}
]
},
{
"id": 2,
"title": "Total Active Alerts",
"type": "stat",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "count(ALERTS{alertstate=\"firing\"})",
"legendFormat": "Active alerts"
}
]
},
{
"id": 3,
"title": "Critical Alerts",
"type": "stat",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "count(ALERTS{alertstate=\"firing\", severity=\"critical\"})",
"legendFormat": "Critical"
}
]
},
{
"id": 4,
"title": "Alert Firing Rate (per minute)",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(alertmanager_alerts_fired_total[1m])",
"legendFormat": "Alerts fired/min"
}
]
},
{
"id": 5,
"title": "Alert Resolution Rate (per minute)",
"type": "graph",
"gridPos": {"x": 12, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(alertmanager_alerts_resolved_total[1m])",
"legendFormat": "Alerts resolved/min"
}
]
},
{
"id": 6,
"title": "Notification Success Rate",
"type": "graph",
"gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (rate(alertmanager_notifications_total{status=\"success\"}[5m]) / rate(alertmanager_notifications_total[5m]))",
"legendFormat": "Success rate %"
}
]
},
{
"id": 7,
"title": "Notification Failures",
"type": "graph",
"gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(alertmanager_notifications_total{status=\"failed\"}[5m])",
"legendFormat": "{{integration}}"
}
]
},
{
"id": 8,
"title": "Silenced Alerts",
"type": "stat",
"gridPos": {"x": 0, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "count(ALERTS{alertstate=\"silenced\"})",
"legendFormat": "Silenced"
}
]
},
{
"id": 9,
"title": "AlertManager Cluster Size",
"type": "stat",
"gridPos": {"x": 6, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "count(alertmanager_cluster_peers)",
"legendFormat": "Cluster peers"
}
]
},
{
"id": 10,
"title": "AlertManager Peers",
"type": "stat",
"gridPos": {"x": 12, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "alertmanager_cluster_peers",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 11,
"title": "Cluster Status",
"type": "stat",
"gridPos": {"x": 18, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "up{job=\"alertmanager\"}",
"legendFormat": "{{instance}}"
}
]
},
{
"id": 12,
"title": "Alerts by Group",
"type": "table",
"gridPos": {"x": 0, "y": 28, "w": 12, "h": 8},
"targets": [
{
"expr": "count by (alertname) (ALERTS{alertstate=\"firing\"})",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {},
"renameByName": {
"alertname": "Alert Name",
"Value": "Count"
}
}
}
]
},
{
"id": 13,
"title": "Alert Duration (p99)",
"type": "graph",
"gridPos": {"x": 12, "y": 28, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, rate(alertmanager_alert_duration_seconds_bucket[5m]))",
"legendFormat": "p99 duration"
}
]
},
{
"id": 14,
"title": "Processing Time",
"type": "graph",
"gridPos": {"x": 0, "y": 36, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(alertmanager_receiver_processing_duration_seconds_sum[5m]) / rate(alertmanager_receiver_processing_duration_seconds_count[5m])",
"legendFormat": "{{receiver}}"
}
]
},
{
"id": 15,
"title": "Memory Usage",
"type": "stat",
"gridPos": {"x": 12, "y": 36, "w": 12, "h": 8},
"targets": [
{
"expr": "process_resident_memory_bytes{job=\"alertmanager\"} / 1024 / 1024",
"legendFormat": "{{instance}} - MB"
}
]
}
]
}
}
business-metrics-dashboard.json: |
{
"dashboard": {
"title": "Bakery IA - Business Metrics & KPIs",
"tags": ["bakery-ia", "business-metrics", "kpis"],
"timezone": "browser",
"refresh": "30s",
"schemaVersion": 16,
"version": 1,
"panels": [
{
"id": 1,
"title": "Requests per Service (Rate)",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "sum by (service) (rate(http_requests_total[5m]))",
"legendFormat": "{{service}}"
}
]
},
{
"id": 2,
"title": "Total Request Rate",
"type": "stat",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "sum(rate(http_requests_total[5m]))",
"legendFormat": "requests/sec"
}
]
},
{
"id": 3,
"title": "Peak Request Rate (5m)",
"type": "stat",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "max(sum(rate(http_requests_total[5m])))",
"legendFormat": "Peak requests/sec"
}
]
},
{
"id": 4,
"title": "Error Rates by Service",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "sum by (service) (rate(http_requests_total{status_code=~\"5..\"}[5m]))",
"legendFormat": "{{service}}"
}
]
},
{
"id": 5,
"title": "Overall Error Rate",
"type": "stat",
"gridPos": {"x": 12, "y": 8, "w": 6, "h": 4},
"targets": [
{
"expr": "100 * (sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])))",
"legendFormat": "Error %"
}
]
},
{
"id": 6,
"title": "4xx Error Rate",
"type": "stat",
"gridPos": {"x": 18, "y": 8, "w": 6, "h": 4},
"targets": [
{
"expr": "100 * (sum(rate(http_requests_total{status_code=~\"4..\"}[5m])) / sum(rate(http_requests_total[5m])))",
"legendFormat": "4xx %"
}
]
},
{
"id": 7,
"title": "P95 Latency by Service (ms)",
"type": "graph",
"gridPos": {"x": 0, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) * 1000",
"legendFormat": "{{service}} p95"
}
]
},
{
"id": 8,
"title": "P99 Latency by Service (ms)",
"type": "graph",
"gridPos": {"x": 12, "y": 16, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) * 1000",
"legendFormat": "{{service}} p99"
}
]
},
{
"id": 9,
"title": "Average Latency (ms)",
"type": "stat",
"gridPos": {"x": 0, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "(sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) * 1000",
"legendFormat": "Avg latency ms"
}
]
},
{
"id": 10,
"title": "Active Tenants",
"type": "stat",
"gridPos": {"x": 6, "y": 24, "w": 6, "h": 4},
"targets": [
{
"expr": "count(count by (tenant_id) (rate(http_requests_total[5m])))",
"legendFormat": "Active tenants"
}
]
},
{
"id": 11,
"title": "Requests per Tenant",
"type": "stat",
"gridPos": {"x": 12, "y": 24, "w": 12, "h": 4},
"targets": [
{
"expr": "sum by (tenant_id) (rate(http_requests_total[5m]))",
"legendFormat": "Tenant {{tenant_id}}"
}
]
},
{
"id": 12,
"title": "Alert Generation Rate (per minute)",
"type": "graph",
"gridPos": {"x": 0, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(ALERTS_FOR_STATE[1m])",
"legendFormat": "{{alertname}}"
}
]
},
{
"id": 13,
"title": "Training Job Success Rate",
"type": "stat",
"gridPos": {"x": 12, "y": 32, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (sum(training_job_completed_total{status=\"success\"}) / sum(training_job_completed_total))",
"legendFormat": "Success rate %"
}
]
},
{
"id": 14,
"title": "Training Jobs in Progress",
"type": "stat",
"gridPos": {"x": 0, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "count(training_job_in_progress)",
"legendFormat": "Jobs running"
}
]
},
{
"id": 15,
"title": "Training Job Completion Time (p95, minutes)",
"type": "stat",
"gridPos": {"x": 6, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "histogram_quantile(0.95, training_job_duration_seconds) / 60",
"legendFormat": "p95 minutes"
}
]
},
{
"id": 16,
"title": "Failed Training Jobs",
"type": "stat",
"gridPos": {"x": 12, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "sum(training_job_completed_total{status=\"failed\"})",
"legendFormat": "Failed jobs"
}
]
},
{
"id": 17,
"title": "Total Training Jobs Completed",
"type": "stat",
"gridPos": {"x": 18, "y": 40, "w": 6, "h": 4},
"targets": [
{
"expr": "sum(training_job_completed_total)",
"legendFormat": "Total completed"
}
]
},
{
"id": 18,
"title": "API Health Status",
"type": "table",
"gridPos": {"x": 0, "y": 48, "w": 12, "h": 8},
"targets": [
{
"expr": "up{job=\"bakery-services\"}",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {},
"renameByName": {
"service": "Service",
"Value": "Status",
"instance": "Instance"
}
}
}
]
},
{
"id": 19,
"title": "Service Success Rate (%)",
"type": "graph",
"gridPos": {"x": 12, "y": 48, "w": 12, "h": 8},
"targets": [
{
"expr": "100 * (1 - (sum by (service) (rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum by (service) (rate(http_requests_total[5m]))))",
"legendFormat": "{{service}}"
}
]
},
{
"id": 20,
"title": "Requests Processed Today",
"type": "stat",
"gridPos": {"x": 0, "y": 56, "w": 12, "h": 4},
"targets": [
{
"expr": "sum(increase(http_requests_total[24h]))",
"legendFormat": "Requests (24h)"
}
]
},
{
"id": 21,
"title": "Distinct Users Today",
"type": "stat",
"gridPos": {"x": 12, "y": 56, "w": 12, "h": 4},
"targets": [
{
"expr": "count(count by (user_id) (increase(http_requests_total{user_id!=\"\"}[24h])))",
"legendFormat": "Users (24h)"
}
]
}
]
}
}

View File

@@ -1,177 +0,0 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
data:
gateway-metrics.json: |
{
"dashboard": {
"title": "Bakery IA - Gateway Metrics",
"tags": ["bakery-ia", "gateway"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate by Endpoint",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [{
"expr": "rate(http_requests_total{service=\"gateway\"}[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}]
},
{
"id": 2,
"title": "P95 Request Latency",
"type": "graph",
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8},
"targets": [{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"gateway\"}[5m]))",
"legendFormat": "{{endpoint}} p95"
}]
},
{
"id": 3,
"title": "Error Rate (5xx)",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [{
"expr": "rate(http_requests_total{service=\"gateway\",status_code=~\"5..\"}[5m])",
"legendFormat": "{{endpoint}} errors"
}]
},
{
"id": 4,
"title": "Active Requests",
"type": "stat",
"gridPos": {"x": 12, "y": 8, "w": 6, "h": 4},
"targets": [{
"expr": "sum(rate(http_requests_total{service=\"gateway\"}[1m]))"
}]
},
{
"id": 5,
"title": "Authentication Success Rate",
"type": "stat",
"gridPos": {"x": 18, "y": 8, "w": 6, "h": 4},
"targets": [{
"expr": "rate(gateway_auth_responses_total[5m]) / rate(gateway_auth_requests_total[5m]) * 100"
}]
}
],
"refresh": "10s",
"schemaVersion": 16,
"version": 1
}
}
services-overview.json: |
{
"dashboard": {
"title": "Bakery IA - Services Overview",
"tags": ["bakery-ia", "services"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate by Service",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [{
"expr": "sum by (service) (rate(http_requests_total[5m]))",
"legendFormat": "{{service}}"
}]
},
{
"id": 2,
"title": "P99 Latency by Service",
"type": "graph",
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8},
"targets": [{
"expr": "histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "{{service}} p99"
}]
},
{
"id": 3,
"title": "Error Rate by Service",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 24, "h": 8},
"targets": [{
"expr": "sum by (service) (rate(http_requests_total{status_code=~\"5..\"}[5m]))",
"legendFormat": "{{service}}"
}]
},
{
"id": 4,
"title": "Service Health Status",
"type": "table",
"gridPos": {"x": 0, "y": 16, "w": 24, "h": 8},
"targets": [{
"expr": "up{job=\"bakery-services\"}",
"format": "table",
"instant": true
}],
"transformations": [{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {},
"renameByName": {
"service": "Service Name",
"Value": "Status"
}
}
}]
}
],
"refresh": "30s",
"schemaVersion": 16,
"version": 1
}
}
circuit-breakers.json: |
{
"dashboard": {
"title": "Bakery IA - Circuit Breakers",
"tags": ["bakery-ia", "reliability"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Circuit Breaker States",
"type": "stat",
"gridPos": {"x": 0, "y": 0, "w": 24, "h": 4},
"targets": [{
"expr": "circuit_breaker_state",
"legendFormat": "{{service}} - {{state}}"
}]
},
{
"id": 2,
"title": "Circuit Breaker Trips",
"type": "graph",
"gridPos": {"x": 0, "y": 4, "w": 12, "h": 8},
"targets": [{
"expr": "rate(circuit_breaker_opened_total[5m])",
"legendFormat": "{{service}}"
}]
},
{
"id": 3,
"title": "Rejected Requests",
"type": "graph",
"gridPos": {"x": 12, "y": 4, "w": 12, "h": 8},
"targets": [{
"expr": "rate(circuit_breaker_rejected_total[5m])",
"legendFormat": "{{service}}"
}]
}
],
"refresh": "10s",
"schemaVersion": 16,
"version": 1
}
}

View File

@@ -1,166 +0,0 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
prometheus.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards-config
namespace: monitoring
data:
dashboards.yaml: |
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Bakery IA'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
- name: 'extended'
orgId: 1
folder: 'Bakery IA - Extended'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards-extended
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
labels:
app: grafana
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:12.3.0
ports:
- containerPort: 3000
name: http
env:
- name: GF_SECURITY_ADMIN_USER
valueFrom:
secretKeyRef:
name: grafana-admin
key: admin-user
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-admin
key: admin-password
- name: GF_SERVER_ROOT_URL
value: "http://monitoring.bakery-ia.local/grafana"
- name: GF_SERVER_SERVE_FROM_SUB_PATH
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "false"
- name: GF_INSTALL_PLUGINS
value: ""
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-datasources
mountPath: /etc/grafana/provisioning/datasources
- name: grafana-dashboards-config
mountPath: /etc/grafana/provisioning/dashboards
- name: grafana-dashboards
mountPath: /var/lib/grafana/dashboards
- name: grafana-dashboards-extended
mountPath: /var/lib/grafana/dashboards-extended
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-storage
- name: grafana-datasources
configMap:
name: grafana-datasources
- name: grafana-dashboards-config
configMap:
name: grafana-dashboards-config
- name: grafana-dashboards
configMap:
name: grafana-dashboards
- name: grafana-dashboards-extended
configMap:
name: grafana-dashboards-extended
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-storage
namespace: monitoring
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
labels:
app: grafana
spec:
type: ClusterIP
ports:
- port: 3000
targetPort: 3000
protocol: TCP
name: http
selector:
app: grafana

View File

@@ -1,100 +0,0 @@
---
# PodDisruptionBudgets ensure minimum availability during voluntary disruptions
# (node drains, rolling updates, etc.)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: prometheus-pdb
namespace: monitoring
spec:
minAvailable: 1
selector:
matchLabels:
app: prometheus
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: alertmanager-pdb
namespace: monitoring
spec:
minAvailable: 2
selector:
matchLabels:
app: alertmanager
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: grafana-pdb
namespace: monitoring
spec:
minAvailable: 1
selector:
matchLabels:
app: grafana
---
# ResourceQuota limits total resources in monitoring namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: monitoring-quota
namespace: monitoring
spec:
hard:
# Compute resources
requests.cpu: "10"
requests.memory: "16Gi"
limits.cpu: "20"
limits.memory: "32Gi"
# Storage
persistentvolumeclaims: "10"
requests.storage: "100Gi"
# Object counts
pods: "50"
services: "20"
configmaps: "30"
secrets: "20"
---
# LimitRange sets default resource limits for pods in monitoring namespace
apiVersion: v1
kind: LimitRange
metadata:
name: monitoring-limits
namespace: monitoring
spec:
limits:
# Default container limits
- max:
cpu: "2"
memory: "4Gi"
min:
cpu: "10m"
memory: "16Mi"
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
# Pod limits
- max:
cpu: "4"
memory: "8Gi"
type: Pod
# PVC limits
- max:
storage: "50Gi"
min:
storage: "1Gi"
type: PersistentVolumeClaim

View File

@@ -1,42 +0,0 @@
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: monitoring-ingress
namespace: monitoring
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/ssl-redirect: "false"
spec:
rules:
- host: monitoring.bakery-ia.local
http:
paths:
- path: /grafana(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: grafana
port:
number: 3000
- path: /prometheus(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: prometheus-external
port:
number: 9090
- path: /jaeger(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: jaeger-query
port:
number: 16686
- path: /alertmanager(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: alertmanager-external
port:
number: 9093

View File

@@ -1,190 +0,0 @@
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: monitoring
labels:
app: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.51
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
- name: COLLECTOR_OTLP_ENABLED
value: "true"
- name: SPAN_STORAGE_TYPE
value: "badger"
- name: BADGER_EPHEMERAL
value: "false"
- name: BADGER_DIRECTORY_VALUE
value: "/badger/data"
- name: BADGER_DIRECTORY_KEY
value: "/badger/key"
ports:
- containerPort: 5775
protocol: UDP
name: zipkin-compact
- containerPort: 6831
protocol: UDP
name: jaeger-compact
- containerPort: 6832
protocol: UDP
name: jaeger-binary
- containerPort: 5778
protocol: TCP
name: config-rest
- containerPort: 16686
protocol: TCP
name: query
- containerPort: 14250
protocol: TCP
name: grpc
- containerPort: 14268
protocol: TCP
name: c-tchan-trft
- containerPort: 14269
protocol: TCP
name: admin-http
- containerPort: 9411
protocol: TCP
name: zipkin
- containerPort: 4317
protocol: TCP
name: otlp-grpc
- containerPort: 4318
protocol: TCP
name: otlp-http
volumeMounts:
- name: jaeger-storage
mountPath: /badger
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
livenessProbe:
httpGet:
path: /
port: 14269
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 14269
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: jaeger-storage
persistentVolumeClaim:
claimName: jaeger-storage
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: jaeger-storage
namespace: monitoring
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-query
namespace: monitoring
labels:
app: jaeger
spec:
type: ClusterIP
ports:
- port: 16686
targetPort: 16686
protocol: TCP
name: query
selector:
app: jaeger
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-collector
namespace: monitoring
labels:
app: jaeger
spec:
type: ClusterIP
ports:
- port: 14268
targetPort: 14268
protocol: TCP
name: c-tchan-trft
- port: 14250
targetPort: 14250
protocol: TCP
name: grpc
- port: 9411
targetPort: 9411
protocol: TCP
name: zipkin
- port: 4317
targetPort: 4317
protocol: TCP
name: otlp-grpc
- port: 4318
targetPort: 4318
protocol: TCP
name: otlp-http
selector:
app: jaeger
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-agent
namespace: monitoring
labels:
app: jaeger
spec:
type: ClusterIP
clusterIP: None
ports:
- port: 5775
targetPort: 5775
protocol: UDP
name: zipkin-compact
- port: 6831
targetPort: 6831
protocol: UDP
name: jaeger-compact
- port: 6832
targetPort: 6832
protocol: UDP
name: jaeger-binary
- port: 5778
targetPort: 5778
protocol: TCP
name: config-rest
selector:
app: jaeger

View File

@@ -1,18 +1,20 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
# Minimal Monitoring Infrastructure
# SigNoz is now managed via Helm in the 'signoz' namespace
# This kustomization only maintains:
# - Namespace for legacy resources (if needed)
# - Node exporter for infrastructure metrics
# - PostgreSQL exporter for database metrics
# - Optional OTEL collector (can be disabled if using SigNoz's built-in collector)
resources:
- namespace.yaml
- secrets.yaml
- prometheus.yaml
- alert-rules.yaml
- alertmanager.yaml
- alertmanager-init.yaml
- grafana.yaml
- grafana-dashboards.yaml
- grafana-dashboards-extended.yaml
- postgres-exporter.yaml
# Exporters for metrics collection
- node-exporter.yaml
- jaeger.yaml
- ha-policies.yaml
- ingress.yaml
- postgres-exporter.yaml
# Optional: Keep OTEL collector or use SigNoz's built-in one
# Uncomment if you want a dedicated OTEL collector in monitoring namespace
# - otel-collector.yaml

View File

@@ -0,0 +1,167 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitoring
data:
otel-collector-config.yaml: |
extensions:
health_check:
endpoint: 0.0.0.0:13133
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
# Memory limiter to prevent OOM
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
exporters:
# Export metrics to Prometheus
prometheus:
endpoint: "0.0.0.0:8889"
namespace: otelcol
const_labels:
source: otel-collector
# Export to SigNoz
otlp/signoz:
endpoint: "signoz-query-service.monitoring.svc.cluster.local:8080"
tls:
insecure: true
# Logging exporter for debugging traces and logs
logging:
loglevel: info
sampling_initial: 5
sampling_thereafter: 200
service:
extensions: [health_check]
pipelines:
# Traces pipeline: receive -> process -> export to SigNoz
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/signoz, logging]
# Metrics pipeline: receive -> process -> export to both Prometheus and SigNoz
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus, otlp/signoz]
# Logs pipeline: receive -> process -> export to SigNoz
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/signoz, logging]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: monitoring
labels:
app: otel-collector
spec:
replicas: 1
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.91.0
args:
- --config=/conf/otel-collector-config.yaml
ports:
- containerPort: 4317
protocol: TCP
name: otlp-grpc
- containerPort: 4318
protocol: TCP
name: otlp-http
- containerPort: 8889
protocol: TCP
name: prometheus
- containerPort: 13133
protocol: TCP
name: health-check
volumeMounts:
- name: otel-collector-config
mountPath: /conf
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: otel-collector-config
configMap:
name: otel-collector-config
items:
- key: otel-collector-config.yaml
path: otel-collector-config.yaml
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: monitoring
labels:
app: otel-collector
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8889"
prometheus.io/path: "/metrics"
spec:
type: ClusterIP
ports:
- port: 4317
targetPort: 4317
protocol: TCP
name: otlp-grpc
- port: 4318
targetPort: 4318
protocol: TCP
name: otlp-http
- port: 8889
targetPort: 8889
protocol: TCP
name: prometheus
selector:
app: otel-collector

View File

@@ -1,278 +0,0 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
cluster: 'bakery-ia'
environment: 'production'
# AlertManager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-0.alertmanager.monitoring.svc.cluster.local:9093
- alertmanager-1.alertmanager.monitoring.svc.cluster.local:9093
- alertmanager-2.alertmanager.monitoring.svc.cluster.local:9093
# Load alert rules
rule_files:
- '/etc/prometheus/rules/*.yml'
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Scrape all bakery-ia services
- job_name: 'bakery-services'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- bakery-ia
relabel_configs:
# Only scrape pods with metrics port
- source_labels: [__meta_kubernetes_pod_container_port_name]
action: keep
regex: http
# Add service name label
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
target_label: service
# Add component label
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
target_label: component
# Add pod name
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
# Set metrics path
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Set scrape port
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Scrape Kubernetes nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Scrape AlertManager
- job_name: 'alertmanager'
static_configs:
- targets:
- alertmanager-0.alertmanager.monitoring.svc.cluster.local:9093
- alertmanager-1.alertmanager.monitoring.svc.cluster.local:9093
- alertmanager-2.alertmanager.monitoring.svc.cluster.local:9093
# Scrape PostgreSQL exporter
- job_name: 'postgres-exporter'
static_configs:
- targets: ['postgres-exporter.monitoring.svc.cluster.local:9187']
# Scrape Node Exporter
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
- source_labels: [__meta_kubernetes_node_name]
target_label: node
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
serviceName: prometheus
replicas: 2
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- prometheus
topologyKey: kubernetes.io/hostname
containers:
- name: prometheus
image: prom/prometheus:v3.0.1
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
name: web
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-rules
mountPath: /etc/prometheus/rules
- name: prometheus-storage
mountPath: /prometheus
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-rules
configMap:
name: prometheus-alert-rules
volumeClaimTemplates:
- metadata:
name: prometheus-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 20Gi
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
type: ClusterIP
clusterIP: None
ports:
- port: 9090
targetPort: 9090
protocol: TCP
name: web
selector:
app: prometheus
---
apiVersion: v1
kind: Service
metadata:
name: prometheus-external
namespace: monitoring
labels:
app: prometheus
spec:
type: ClusterIP
ports:
- port: 9090
targetPort: 9090
protocol: TCP
name: web
selector:
app: prometheus

View File

@@ -14,9 +14,10 @@ data:
DEBUG: "false"
LOG_LEVEL: "INFO"
# Observability Settings
# Set to "true" when Jaeger/monitoring stack is deployed
ENABLE_TRACING: "false"
# Observability Settings - SigNoz enabled
ENABLE_TRACING: "true"
ENABLE_METRICS: "true"
ENABLE_LOGS: "true"
# Database initialization settings
# IMPORTANT: Services NEVER run migrations - they only verify DB is ready
@@ -286,12 +287,11 @@ data:
LOG_FILE_PATH: "/app/logs"
LOG_ROTATION_SIZE: "100MB"
LOG_RETENTION_DAYS: "30"
PROMETHEUS_ENABLED: "true"
PROMETHEUS_RETENTION: "200h"
HEALTH_CHECK_TIMEOUT: "30"
HEALTH_CHECK_INTERVAL: "30"
PROMETHEUS_RETENTION_DAYS: "30"
GRAFANA_ROOT_URL: "http://monitoring.bakery-ia.local/grafana"
# Monitoring Configuration - SigNoz
SIGNOZ_ROOT_URL: "http://localhost/signoz"
# ================================================================
# DATA COLLECTION SETTINGS
@@ -382,16 +382,20 @@ data:
NOMINATIM_CPU_LIMIT: "4"
# ================================================================
# DISTRIBUTED TRACING (Jaeger/OpenTelemetry)
# OBSERVABILITY - SigNoz (Unified Monitoring)
# ================================================================
JAEGER_COLLECTOR_ENDPOINT: "http://jaeger-collector.monitoring:4317"
JAEGER_AGENT_HOST: "jaeger-agent.monitoring"
JAEGER_AGENT_PORT: "6831"
OTEL_EXPORTER_OTLP_ENDPOINT: "http://jaeger-collector.monitoring:4317"
# OpenTelemetry Configuration - Direct to SigNoz
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.signoz.svc.cluster.local:4317"
OTEL_EXPORTER_OTLP_PROTOCOL: "grpc"
OTEL_SERVICE_NAME: "bakery-ia"
OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=development"
# SigNoz Endpoints
SIGNOZ_ENDPOINT: "http://signoz-query-service.signoz.svc.cluster.local:8080"
SIGNOZ_FRONTEND_URL: "http://signoz-frontend.signoz.svc.cluster.local:3301"
# ================================================================
# REPLENISHMENT PLANNING SETTINGS
# REPLENISHMENT PLANNING SETTINGS
# ================================================================
REPLENISHMENT_PROJECTION_HORIZON_DAYS: "7"
REPLENISHMENT_SERVICE_LEVEL: "0.95"

View File

@@ -9,11 +9,14 @@ metadata:
resources:
- ../../base
# Monitoring disabled for dev to save resources
# - ../../base/components/monitoring
# Monitoring enabled for dev environment
- ../../base/components/monitoring
- dev-ingress.yaml
# SigNoz ingress is applied by Tilt (see Tiltfile)
# - signoz-ingress.yaml
# Dev-Prod Parity: Enable HTTPS with self-signed certificates
- dev-certificate.yaml
- monitoring-certificate.yaml
- cluster-issuer-staging.yaml
# Exclude nominatim from dev to save resources
@@ -608,6 +611,39 @@ patches:
limits:
memory: "512Mi"
cpu: "300m"
# Optional exporters resource patches for dev
- target:
group: apps
version: v1
kind: DaemonSet
name: node-exporter
namespace: monitoring
patch: |-
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "32Mi"
cpu: "25m"
limits:
memory: "64Mi"
cpu: "100m"
- target:
group: apps
version: v1
kind: Deployment
name: postgres-exporter
namespace: monitoring
patch: |-
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "32Mi"
cpu: "25m"
limits:
memory: "64Mi"
cpu: "100m"
secretGenerator:
- name: dev-secrets

View File

@@ -0,0 +1,49 @@
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: bakery-dev-monitoring-tls-cert
namespace: monitoring
spec:
# Self-signed certificate for local development
secretName: bakery-ia-tls-cert
# Certificate duration
duration: 2160h # 90 days
renewBefore: 360h # 15 days
# Subject configuration
subject:
organizations:
- Bakery IA Development
# Common name
commonName: localhost
# DNS names this certificate is valid for
dnsNames:
- localhost
- monitoring.bakery-ia.local
# IP addresses (for localhost)
ipAddresses:
- 127.0.0.1
- ::1
# Use self-signed issuer for development
issuerRef:
name: selfsigned-issuer
kind: ClusterIssuer
group: cert-manager.io
# Private key configuration
privateKey:
algorithm: RSA
encoding: PKCS1
size: 2048
# Usages
usages:
- server auth
- client auth
- digital signature
- key encipherment

View File

@@ -0,0 +1,39 @@
---
# SigNoz Ingress for Development (localhost)
# SigNoz is deployed via Helm in the 'signoz' namespace
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: signoz-ingress-localhost
namespace: signoz
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/use-regex: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- localhost
secretName: bakery-ia-tls-cert
rules:
- host: localhost
http:
paths:
# SigNoz Frontend UI
- path: /signoz(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: signoz-frontend
port:
number: 3301
# SigNoz Query Service API
- path: /signoz-api(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: signoz-query-service
port:
number: 8080

View File

@@ -14,6 +14,7 @@ resources:
patchesStrategicMerge:
- storage-patch.yaml
- monitoring-ingress-patch.yaml
labels:
- includeSelectors: true
@@ -21,6 +22,89 @@ labels:
environment: production
tier: production
# SigNoz resource patches for production
patches:
# SigNoz ClickHouse production configuration
- target:
group: apps
version: v1
kind: StatefulSet
name: signoz-clickhouse
namespace: signoz
patch: |-
- op: replace
path: /spec/replicas
value: 2
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "1000m"
# SigNoz Query Service production configuration
- target:
group: apps
version: v1
kind: Deployment
name: signoz-query-service
namespace: signoz
patch: |-
- op: replace
path: /spec/replicas
value: 2
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
# SigNoz AlertManager production configuration
- target:
group: apps
version: v1
kind: Deployment
name: signoz-alertmanager
namespace: signoz
patch: |-
- op: replace
path: /spec/replicas
value: 2
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
# SigNoz Frontend production configuration
- target:
group: apps
version: v1
kind: Deployment
name: signoz-frontend
namespace: signoz
patch: |-
- op: replace
path: /spec/replicas
value: 2
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
images:
- name: bakery/auth-service
newTag: latest

View File

@@ -17,14 +17,30 @@ data:
REQUEST_TIMEOUT: "30"
MAX_CONNECTIONS: "100"
# Monitoring
PROMETHEUS_ENABLED: "true"
# Monitoring - SigNoz (Unified Observability)
ENABLE_TRACING: "true"
ENABLE_METRICS: "true"
JAEGER_ENABLED: "true"
JAEGER_AGENT_HOST: "jaeger-agent.monitoring.svc.cluster.local"
JAEGER_AGENT_PORT: "6831"
ENABLE_LOGS: "true"
# OpenTelemetry Configuration - Direct to SigNoz
OTEL_EXPORTER_OTLP_ENDPOINT: "http://signoz-otel-collector.signoz.svc.cluster.local:4317"
OTEL_EXPORTER_OTLP_PROTOCOL: "grpc"
OTEL_SERVICE_NAME: "bakery-ia"
OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=production,cluster.name=bakery-ia-prod"
# SigNoz Endpoints
SIGNOZ_ENDPOINT: "http://signoz-query-service.signoz.svc.cluster.local:8080"
SIGNOZ_FRONTEND_URL: "https://monitoring.bakewise.ai/signoz"
SIGNOZ_ROOT_URL: "https://monitoring.bakewise.ai/signoz"
# Rate Limiting (stricter in production)
RATE_LIMIT_ENABLED: "true"
RATE_LIMIT_PER_MINUTE: "60"
# CORS Configuration for Production
CORS_ORIGINS: "https://bakewise.ai"
CORS_ALLOW_CREDENTIALS: "true"
# Frontend Configuration
VITE_API_URL: "/api"
VITE_ENVIRONMENT: "production"

View File

@@ -16,7 +16,7 @@ metadata:
# CORS configuration for production
nginx.ingress.kubernetes.io/enable-cors: "true"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://bakery.yourdomain.com,https://api.yourdomain.com"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://bakewise.ai"
nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS, PATCH"
nginx.ingress.kubernetes.io/cors-allow-headers: "Content-Type, Authorization, X-Requested-With, Accept, Origin"
nginx.ingress.kubernetes.io/cors-allow-credentials: "true"
@@ -40,12 +40,10 @@ spec:
ingressClassName: nginx
tls:
- hosts:
- bakery.yourdomain.com
- api.yourdomain.com
- monitoring.yourdomain.com
- bakewise.ai
secretName: bakery-ia-prod-tls-cert
rules:
- host: bakery.yourdomain.com
- host: bakewise.ai
http:
paths:
- path: /
@@ -55,7 +53,7 @@ spec:
name: frontend-service
port:
number: 3000
- path: /api
- path: /api/v1
pathType: Prefix
backend:
service:
@@ -63,31 +61,4 @@ spec:
port:
number: 8000
- host: api.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: gateway-service
port:
number: 8000
- host: monitoring.yourdomain.com
http:
paths:
- path: /grafana
pathType: Prefix
backend:
service:
name: grafana-service
port:
number: 3000
- path: /prometheus
pathType: Prefix
backend:
service:
name: prometheus-service
port:
number: 9090
# Monitoring (monitoring.bakewise.ai) is now handled by signoz-ingress.yaml in the signoz namespace

View File

@@ -0,0 +1,78 @@
---
# SigNoz Ingress for Production
# SigNoz is deployed via Helm in the 'signoz' namespace
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: signoz-ingress-prod
namespace: signoz
labels:
app.kubernetes.io/name: signoz
app.kubernetes.io/component: ingress
annotations:
# Nginx ingress controller annotations
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/use-regex: "true"
# CORS configuration
nginx.ingress.kubernetes.io/enable-cors: "true"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://bakewise.ai,https://monitoring.bakewise.ai"
nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS, PATCH"
nginx.ingress.kubernetes.io/cors-allow-headers: "Content-Type, Authorization, X-Requested-With, Accept, Origin"
nginx.ingress.kubernetes.io/cors-allow-credentials: "true"
# Security headers
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers "X-Frame-Options: SAMEORIGIN";
more_set_headers "X-Content-Type-Options: nosniff";
more_set_headers "X-XSS-Protection: 1; mode=block";
more_set_headers "Referrer-Policy: strict-origin-when-cross-origin";
# Rate limiting
nginx.ingress.kubernetes.io/limit-rps: "100"
nginx.ingress.kubernetes.io/limit-connections: "50"
# Cert-manager annotations for automatic certificate issuance
cert-manager.io/cluster-issuer: "letsencrypt-production"
cert-manager.io/acme-challenge-type: http01
spec:
ingressClassName: nginx
tls:
- hosts:
- monitoring.bakewise.ai
secretName: signoz-prod-tls-cert
rules:
- host: monitoring.bakewise.ai
http:
paths:
# SigNoz Frontend UI
- path: /signoz(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: signoz-frontend
port:
number: 3301
# SigNoz Query Service API
- path: /signoz-api(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: signoz-query-service
port:
number: 8080
# SigNoz AlertManager
- path: /signoz-alerts(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: signoz-alertmanager
port:
number: 9093

View File

@@ -0,0 +1,79 @@
# SigNoz Helm Chart Values - Customized for Bakery IA
# https://github.com/SigNoz/charts
# Global settings
global:
storageClass: "standard"
# Frontend configuration
frontend:
service:
type: ClusterIP
port: 3301
ingress:
enabled: true
hosts:
- host: localhost
paths:
- path: /signoz
pathType: Prefix
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
# Query Service configuration
queryService:
replicaCount: 1
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 512Mi
# AlertManager configuration
alertmanager:
replicaCount: 1
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 100m
memory: 256Mi
# ClickHouse configuration
clickhouse:
persistence:
enabled: true
size: 10Gi
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
# OpenTelemetry Collector configuration
otelCollector:
enabled: true
config:
exporters:
otlp:
endpoint: "signoz-query-service:8080"
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]
metrics:
receivers: [otlp]
exporters: [otlp]
logs:
receivers: [otlp]
exporters: [otlp]
# Resource optimization for development
# These can be increased for production
development: true