Add signoz
This commit is contained in:
@@ -584,23 +584,39 @@ docker push YOUR_VPS_IP:32000/bakery/auth-service
|
||||
|
||||
### Step 2: Update Production Configuration
|
||||
|
||||
```bash
|
||||
# On local machine, edit these files:
|
||||
The production configuration is already set up for **bakewise.ai** domain:
|
||||
|
||||
**Production URLs:**
|
||||
- **Main Application:** https://bakewise.ai
|
||||
- **API Endpoints:** https://bakewise.ai/api/v1/...
|
||||
- **Monitoring Dashboard:** https://monitoring.bakewise.ai/grafana
|
||||
- **Prometheus:** https://monitoring.bakewise.ai/prometheus
|
||||
- **SigNoz (Traces/Metrics/Logs):** https://monitoring.bakewise.ai/signoz
|
||||
- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
|
||||
|
||||
```bash
|
||||
# Verify the configuration is correct:
|
||||
cat infrastructure/kubernetes/overlays/prod/prod-ingress.yaml | grep -A 3 "host:"
|
||||
|
||||
# Expected output should show:
|
||||
# - host: bakewise.ai
|
||||
# - host: monitoring.bakewise.ai
|
||||
|
||||
# Verify CORS configuration
|
||||
cat infrastructure/kubernetes/overlays/prod/prod-configmap.yaml | grep CORS
|
||||
|
||||
# Expected: CORS_ORIGINS: "https://bakewise.ai"
|
||||
```
|
||||
|
||||
**If using a different domain**, update these files:
|
||||
```bash
|
||||
# 1. Update domain names
|
||||
nano infrastructure/kubernetes/overlays/prod/prod-ingress.yaml
|
||||
# Replace:
|
||||
# - bakery.yourdomain.com → bakery.your-actual-domain.com
|
||||
# - api.yourdomain.com → api.your-actual-domain.com
|
||||
# - monitoring.yourdomain.com → monitoring.your-actual-domain.com
|
||||
# - Update CORS origins
|
||||
# - Update cert-manager email
|
||||
# Replace bakewise.ai with your domain
|
||||
|
||||
# 2. Update ConfigMap
|
||||
nano infrastructure/kubernetes/overlays/prod/prod-configmap.yaml
|
||||
# Set:
|
||||
# - DOMAIN: "your-actual-domain.com"
|
||||
# - CORS_ORIGINS: "https://bakery.your-actual-domain.com,https://www.your-actual-domain.com"
|
||||
# Update CORS_ORIGINS
|
||||
|
||||
# 3. Verify image names (if using custom registry)
|
||||
nano infrastructure/kubernetes/overlays/prod/kustomization.yaml
|
||||
@@ -840,22 +856,96 @@ kubectl logs -n bakery-ia deployment/auth-service | grep -i "email\|smtp"
|
||||
|
||||
## Post-Deployment
|
||||
|
||||
### Step 1: Enable Monitoring
|
||||
### Step 1: Access Monitoring Stack
|
||||
|
||||
```bash
|
||||
# Monitoring is already configured, verify it's running
|
||||
kubectl get pods -n monitoring
|
||||
Your production monitoring stack provides complete observability with multiple tools:
|
||||
|
||||
# Access Grafana
|
||||
kubectl port-forward -n monitoring svc/grafana 3000:3000
|
||||
#### Production Monitoring URLs
|
||||
|
||||
# Visit http://localhost:3000
|
||||
# Login: admin / (password from monitoring secrets)
|
||||
|
||||
# Check dashboards are working
|
||||
Access via domain (recommended):
|
||||
```
|
||||
https://monitoring.bakewise.ai/grafana # Dashboards & visualization
|
||||
https://monitoring.bakewise.ai/prometheus # Metrics & queries
|
||||
https://monitoring.bakewise.ai/signoz # Unified observability platform (traces, metrics, logs)
|
||||
https://monitoring.bakewise.ai/alertmanager # Alert management
|
||||
```
|
||||
|
||||
### Step 2: Configure Backups
|
||||
Or via port forwarding (if needed):
|
||||
```bash
|
||||
# Grafana
|
||||
kubectl port-forward -n monitoring svc/grafana 3000:3000 &
|
||||
|
||||
# Prometheus
|
||||
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090 &
|
||||
|
||||
# SigNoz
|
||||
kubectl port-forward -n monitoring svc/signoz-frontend 3301:3301 &
|
||||
|
||||
# AlertManager
|
||||
kubectl port-forward -n monitoring svc/alertmanager-external 9093:9093 &
|
||||
```
|
||||
|
||||
#### Available Dashboards
|
||||
|
||||
Login to Grafana (admin / your-password) and explore:
|
||||
|
||||
**Main Dashboards:**
|
||||
1. **Gateway Metrics** - HTTP request rates, latencies, error rates
|
||||
2. **Services Overview** - Multi-service health and performance
|
||||
3. **Circuit Breakers** - Reliability metrics
|
||||
|
||||
**Extended Dashboards:**
|
||||
4. **Service Performance Monitoring (SPM)** - RED metrics from distributed traces
|
||||
5. **PostgreSQL Database** - Database health, connections, query performance
|
||||
6. **Node Exporter Infrastructure** - CPU, memory, disk, network per node
|
||||
7. **AlertManager Monitoring** - Alert tracking and notification status
|
||||
8. **Business Metrics & KPIs** - Tenant activity, ML jobs, forecasts
|
||||
|
||||
#### Quick Health Check
|
||||
|
||||
```bash
|
||||
# Verify all monitoring pods are running
|
||||
kubectl get pods -n monitoring
|
||||
|
||||
# Check Prometheus targets (all should be UP)
|
||||
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
||||
# Open: http://localhost:9090/targets
|
||||
|
||||
# View active alerts
|
||||
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090
|
||||
# Open: http://localhost:9090/alerts
|
||||
```
|
||||
|
||||
### Step 2: Configure Alerting
|
||||
|
||||
Update AlertManager with your notification email addresses:
|
||||
|
||||
```bash
|
||||
# Edit alertmanager configuration
|
||||
kubectl edit configmap -n monitoring alertmanager-config
|
||||
|
||||
# Update recipient emails in the routes section:
|
||||
# - alerts@bakewise.ai (general alerts)
|
||||
# - critical-alerts@bakewise.ai (critical issues)
|
||||
# - oncall@bakewise.ai (on-call rotation)
|
||||
```
|
||||
|
||||
Test alert delivery:
|
||||
```bash
|
||||
# Fire a test alert
|
||||
kubectl run memory-test --image=polinux/stress --restart=Never \
|
||||
--namespace=bakery-ia -- stress --vm 1 --vm-bytes 600M --timeout 300s
|
||||
|
||||
# Check alert appears in AlertManager
|
||||
# https://monitoring.bakewise.ai/alertmanager
|
||||
|
||||
# Verify email notification received
|
||||
|
||||
# Clean up test
|
||||
kubectl delete pod memory-test -n bakery-ia
|
||||
```
|
||||
|
||||
### Step 3: Configure Backups
|
||||
|
||||
```bash
|
||||
# Create backup script on VPS
|
||||
@@ -902,26 +992,82 @@ kubectl edit configmap -n monitoring alertmanager-config
|
||||
# Update recipient emails in the routes section
|
||||
```
|
||||
|
||||
### Step 4: Document Everything
|
||||
### Step 4: Verify Monitoring is Working
|
||||
|
||||
Create a runbook with:
|
||||
- [ ] VPS login credentials (stored securely)
|
||||
Before proceeding, ensure all monitoring components are operational:
|
||||
|
||||
```bash
|
||||
# 1. Check Prometheus targets
|
||||
# Open: https://monitoring.bakewise.ai/prometheus/targets
|
||||
# All targets should show "UP" status
|
||||
|
||||
# 2. Verify Grafana dashboards load data
|
||||
# Open: https://monitoring.bakewise.ai/grafana
|
||||
# Navigate to any dashboard and verify metrics are displaying
|
||||
|
||||
# 3. Check SigNoz is receiving traces
|
||||
# Open: https://monitoring.bakewise.ai/signoz
|
||||
# Search for traces from "gateway" service
|
||||
|
||||
# 4. Verify AlertManager cluster
|
||||
# Open: https://monitoring.bakewise.ai/alertmanager
|
||||
# Check that all 3 AlertManager instances are connected
|
||||
```
|
||||
|
||||
### Step 5: Document Everything
|
||||
|
||||
Create a secure runbook with all credentials and procedures:
|
||||
|
||||
**Essential Information to Document:**
|
||||
- [ ] VPS login credentials (stored securely in password manager)
|
||||
- [ ] Database passwords (in password manager)
|
||||
- [ ] Domain registrar access
|
||||
- [ ] Grafana admin password
|
||||
- [ ] Domain registrar access (for bakewise.ai)
|
||||
- [ ] Cloudflare access
|
||||
- [ ] Email service credentials
|
||||
- [ ] Email service credentials (SMTP)
|
||||
- [ ] WhatsApp API credentials
|
||||
- [ ] Docker Hub / Registry credentials
|
||||
- [ ] Emergency contact information
|
||||
- [ ] Rollback procedures
|
||||
- [ ] Monitoring URLs and access procedures
|
||||
|
||||
### Step 5: Train Your Team
|
||||
### Step 6: Train Your Team
|
||||
|
||||
- [ ] Show team how to access Grafana dashboards
|
||||
- [ ] Demonstrate how to check logs: `kubectl logs`
|
||||
- [ ] Explain how to restart services if needed
|
||||
- [ ] Share this documentation with the team
|
||||
- [ ] Setup on-call rotation (if applicable)
|
||||
Conduct a training session covering:
|
||||
|
||||
- [ ] **Access monitoring dashboards**
|
||||
- Show how to login to https://monitoring.bakewise.ai/grafana
|
||||
- Walk through key dashboards (Services Overview, Database, Infrastructure)
|
||||
- Explain how to interpret metrics and identify issues
|
||||
|
||||
- [ ] **Check application logs**
|
||||
```bash
|
||||
# View logs for a service
|
||||
kubectl logs -n bakery-ia deployment/orders-service --tail=100 -f
|
||||
|
||||
# Search for errors
|
||||
kubectl logs -n bakery-ia deployment/gateway | grep ERROR
|
||||
```
|
||||
|
||||
- [ ] **Restart services when needed**
|
||||
```bash
|
||||
# Restart a service (rolling update, no downtime)
|
||||
kubectl rollout restart deployment/orders-service -n bakery-ia
|
||||
```
|
||||
|
||||
- [ ] **Respond to alerts**
|
||||
- Show how to access AlertManager at https://monitoring.bakewise.ai/alertmanager
|
||||
- Review common alerts and their resolution steps
|
||||
- Reference the [Production Operations Guide](./PRODUCTION_OPERATIONS_GUIDE.md)
|
||||
|
||||
- [ ] **Share documentation**
|
||||
- [PILOT_LAUNCH_GUIDE.md](./PILOT_LAUNCH_GUIDE.md) - This guide
|
||||
- [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations
|
||||
- [security-checklist.md](./security-checklist.md) - Security procedures
|
||||
|
||||
- [ ] **Setup on-call rotation** (if applicable)
|
||||
- Configure in AlertManager
|
||||
- Document escalation procedures
|
||||
|
||||
---
|
||||
|
||||
@@ -1050,16 +1196,25 @@ kubectl scale deployment monitoring -n bakery-ia --replicas=0
|
||||
|
||||
## Support Resources
|
||||
|
||||
- **Full Monitoring Guide:** [MONITORING_DEPLOYMENT_SUMMARY.md](./MONITORING_DEPLOYMENT_SUMMARY.md)
|
||||
- **Operations Guide:** [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md)
|
||||
- **Security Guide:** [security-checklist.md](./security-checklist.md)
|
||||
- **Database Security:** [database-security.md](./database-security.md)
|
||||
- **TLS Configuration:** [tls-configuration.md](./tls-configuration.md)
|
||||
**Documentation:**
|
||||
- **Operations Guide:** [PRODUCTION_OPERATIONS_GUIDE.md](./PRODUCTION_OPERATIONS_GUIDE.md) - Daily operations, monitoring, incident response
|
||||
- **Security Guide:** [security-checklist.md](./security-checklist.md) - Security procedures and compliance
|
||||
- **Database Security:** [database-security.md](./database-security.md) - Database operations and TLS configuration
|
||||
- **TLS Configuration:** [tls-configuration.md](./tls-configuration.md) - Certificate management
|
||||
- **RBAC Implementation:** [rbac-implementation.md](./rbac-implementation.md) - Access control
|
||||
|
||||
**Monitoring Access:**
|
||||
- **Grafana:** https://monitoring.bakewise.ai/grafana (admin / your-password)
|
||||
- **Prometheus:** https://monitoring.bakewise.ai/prometheus
|
||||
- **SigNoz:** https://monitoring.bakewise.ai/signoz
|
||||
- **AlertManager:** https://monitoring.bakewise.ai/alertmanager
|
||||
|
||||
**External Resources:**
|
||||
- **MicroK8s Docs:** https://microk8s.io/docs
|
||||
- **Kubernetes Docs:** https://kubernetes.io/docs
|
||||
- **Let's Encrypt:** https://letsencrypt.org/docs
|
||||
- **Cloudflare DNS:** https://developers.cloudflare.com/dns
|
||||
- **Monitoring Stack README:** infrastructure/kubernetes/base/components/monitoring/README.md
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user