Add ci/cd and fix multiple pods issues
This commit is contained in:
1206
CI_CD_IMPLEMENTATION_PLAN.md
Normal file
1206
CI_CD_IMPLEMENTATION_PLAN.md
Normal file
File diff suppressed because it is too large
Load Diff
413
INFRASTRUCTURE_REORGANIZATION_PROPOSAL.md
Normal file
413
INFRASTRUCTURE_REORGANIZATION_PROPOSAL.md
Normal file
@@ -0,0 +1,413 @@
|
||||
# Infrastructure Reorganization Proposal for Bakery-IA
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document presents a comprehensive analysis of the current infrastructure organization and proposes a restructured layout that improves maintainability, scalability, and operational efficiency. The proposal is based on a detailed examination of the existing 177 files across 31 directories in the infrastructure folder.
|
||||
|
||||
## Current Infrastructure Analysis
|
||||
|
||||
### Current Structure Overview
|
||||
|
||||
```
|
||||
infrastructure/
|
||||
├── ci-cd/ # 18 files - CI/CD pipeline components
|
||||
├── helm/ # 8 files - Helm charts and scripts
|
||||
├── kubernetes/ # 103 files - Kubernetes manifests and configs
|
||||
├── signoz/ # 11 files - Monitoring dashboards and scripts
|
||||
└── tls/ # 37 files - TLS certificates and generation scripts
|
||||
```
|
||||
|
||||
### Key Findings
|
||||
|
||||
1. **Kubernetes Base Components (103 files)**: The most complex area with:
|
||||
- 20+ service deployments across 15+ microservices
|
||||
- 20+ database configurations (PostgreSQL, RabbitMQ, MinIO)
|
||||
- 19 migration jobs for different services
|
||||
- Infrastructure components (gateway, monitoring, etc.)
|
||||
|
||||
2. **CI/CD Pipeline (18 files)**:
|
||||
- Tekton tasks and pipelines for GitOps workflow
|
||||
- Flux CD configuration for continuous delivery
|
||||
- Gitea configuration for Git repository management
|
||||
|
||||
3. **Monitoring (11 files)**:
|
||||
- SigNoz dashboards for comprehensive observability
|
||||
- Import scripts for dashboard management
|
||||
|
||||
4. **TLS Certificates (37 files)**:
|
||||
- CA certificates and generation scripts
|
||||
- Service-specific certificates (PostgreSQL, Redis, MinIO)
|
||||
- Certificate signing requests and configurations
|
||||
|
||||
### Strengths of Current Organization
|
||||
|
||||
1. **Logical Grouping**: Components are generally well-grouped by function
|
||||
2. **Base/Overlay Pattern**: Kubernetes uses proper base/overlay structure
|
||||
3. **Comprehensive Monitoring**: SigNoz dashboards cover all major aspects
|
||||
4. **Security Focus**: Dedicated TLS certificate management
|
||||
|
||||
### Challenges Identified
|
||||
|
||||
1. **Complexity in Kubernetes Base**: 103 files make navigation difficult
|
||||
2. **Mixed Component Types**: Services, databases, and infrastructure mixed together
|
||||
3. **Limited Environment Separation**: Only dev/prod overlays, no staging
|
||||
4. **Script Scattering**: Automation scripts spread across directories
|
||||
5. **Documentation Gaps**: Some components lack clear documentation
|
||||
|
||||
## Proposed Infrastructure Organization
|
||||
|
||||
### High-Level Structure
|
||||
|
||||
```
|
||||
infrastructure/
|
||||
├── environments/ # Environment-specific configurations
|
||||
├── platform/ # Platform-level infrastructure
|
||||
├── services/ # Application services and microservices
|
||||
├── monitoring/ # Observability and monitoring
|
||||
├── cicd/ # CI/CD pipeline components
|
||||
├── security/ # Security configurations and certificates
|
||||
├── scripts/ # Automation and utility scripts
|
||||
├── docs/ # Infrastructure documentation
|
||||
└── README.md # Top-level infrastructure guide
|
||||
```
|
||||
|
||||
### Detailed Structure Proposal
|
||||
|
||||
```
|
||||
infrastructure/
|
||||
├── environments/ # Environment-specific configurations
|
||||
│ ├── dev/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ │ ├── base/
|
||||
│ │ │ │ ├── namespace.yaml
|
||||
│ │ │ │ ├── configmap.yaml
|
||||
│ │ │ │ ├── secrets.yaml
|
||||
│ │ │ │ └── ingress-https.yaml
|
||||
│ │ │ ├── components/
|
||||
│ │ │ │ ├── databases/
|
||||
│ │ │ │ ├── infrastructure/
|
||||
│ │ │ │ ├── microservices/
|
||||
│ │ │ │ └── cert-manager/
|
||||
│ │ │ ├── configs/
|
||||
│ │ │ ├── cronjobs/
|
||||
│ │ │ ├── jobs/
|
||||
│ │ │ └── migrations/
|
||||
│ │ ├── kustomization.yaml
|
||||
│ │ └── values/
|
||||
│ ├── staging/ # New staging environment
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── values/
|
||||
│ └── prod/
|
||||
│ ├── k8s-manifests/
|
||||
│ ├── terraform/ # Production-specific IaC
|
||||
│ └── values/
|
||||
├── platform/ # Platform-level infrastructure
|
||||
│ ├── cluster/
|
||||
│ │ ├── eks/ # AWS EKS configuration
|
||||
│ │ │ ├── terraform/
|
||||
│ │ │ └── manifests/
|
||||
│ │ └── kind/ # Local development cluster
|
||||
│ │ ├── config.yaml
|
||||
│ │ └── manifests/
|
||||
│ ├── networking/
|
||||
│ │ ├── dns/
|
||||
│ │ ├── load-balancers/
|
||||
│ │ └── ingress/
|
||||
│ │ ├── nginx/
|
||||
│ │ └── cert-manager/
|
||||
│ ├── security/
|
||||
│ │ ├── rbac/
|
||||
│ │ ├── network-policies/
|
||||
│ │ └── tls/
|
||||
│ │ ├── ca/
|
||||
│ │ ├── postgres/
|
||||
│ │ ├── redis/
|
||||
│ │ └── minio/
|
||||
│ └── storage/
|
||||
│ ├── postgres/
|
||||
│ ├── redis/
|
||||
│ └── minio/
|
||||
├── services/ # Application services
|
||||
│ ├── databases/
|
||||
│ │ ├── postgres/
|
||||
│ │ │ ├── k8s-manifests/
|
||||
│ │ │ ├── backups/
|
||||
│ │ │ ├── monitoring/
|
||||
│ │ │ └── maintenance/
|
||||
│ │ ├── redis/
|
||||
│ │ │ ├── configs/
|
||||
│ │ │ └── monitoring/
|
||||
│ │ └── minio/
|
||||
│ │ ├── buckets/
|
||||
│ │ └── policies/
|
||||
│ ├── api-gateway/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ └── microservices/
|
||||
│ ├── auth/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── tenant/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── training/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── forecasting/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── sales/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── external/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── notification/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── inventory/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── recipes/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── suppliers/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── pos/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── orders/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── production/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── procurement/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── orchestrator/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── alert-processor/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── ai-insights/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ ├── demo-session/
|
||||
│ │ ├── k8s-manifests/
|
||||
│ │ └── configs/
|
||||
│ └── frontend/
|
||||
│ ├── k8s-manifests/
|
||||
│ └── configs/
|
||||
├── monitoring/ # Observability stack
|
||||
│ ├── signoz/
|
||||
│ │ ├── manifests/
|
||||
│ │ ├── dashboards/
|
||||
│ │ │ ├── alert-management.json
|
||||
│ │ │ ├── api-performance.json
|
||||
│ │ │ ├── application-performance.json
|
||||
│ │ │ ├── database-performance.json
|
||||
│ │ │ ├── error-tracking.json
|
||||
│ │ │ ├── index.json
|
||||
│ │ │ ├── infrastructure-monitoring.json
|
||||
│ │ │ ├── log-analysis.json
|
||||
│ │ │ ├── system-health.json
|
||||
│ │ │ └── user-activity.json
|
||||
│ │ ├── values-dev.yaml
|
||||
│ │ ├── values-prod.yaml
|
||||
│ │ ├── deploy-signoz.sh
|
||||
│ │ ├── verify-signoz.sh
|
||||
│ │ └── generate-test-traffic.sh
|
||||
│ └── opentelemetry/
|
||||
│ ├── collector/
|
||||
│ └── agent/
|
||||
├── cicd/ # CI/CD pipeline
|
||||
│ ├── gitea/
|
||||
│ │ ├── values.yaml
|
||||
│ │ └── ingress.yaml
|
||||
│ ├── tekton/
|
||||
│ │ ├── tasks/
|
||||
│ │ │ ├── git-clone.yaml
|
||||
│ │ │ ├── detect-changes.yaml
|
||||
│ │ │ ├── kaniko-build.yaml
|
||||
│ │ │ └── update-gitops.yaml
|
||||
│ │ ├── pipelines/
|
||||
│ │ └── triggers/
|
||||
│ └── flux/
|
||||
│ ├── git-repository.yaml
|
||||
│ └── kustomization.yaml
|
||||
├── security/ # Security configurations
|
||||
│ ├── policies/
|
||||
│ │ ├── network-policies.yaml
|
||||
│ │ ├── pod-security.yaml
|
||||
│ │ └── rbac.yaml
|
||||
│ ├── certificates/
|
||||
│ │ ├── ca/
|
||||
│ │ ├── services/
|
||||
│ │ └── rotation-scripts/
|
||||
│ ├── scanning/
|
||||
│ │ ├── trivy/
|
||||
│ │ └── policies/
|
||||
│ └── compliance/
|
||||
│ ├── cis-benchmarks/
|
||||
│ └── audit-scripts/
|
||||
├── scripts/ # Automation scripts
|
||||
│ ├── setup/
|
||||
│ │ ├── generate-certificates.sh
|
||||
│ │ ├── generate-minio-certificates.sh
|
||||
│ │ └── setup-dockerhub-secrets.sh
|
||||
│ ├── deployment/
|
||||
│ │ ├── deploy-signoz.sh
|
||||
│ │ └── verify-signoz.sh
|
||||
│ ├── maintenance/
|
||||
│ │ ├── regenerate_migrations_k8s.sh
|
||||
│ │ └── kubernetes_restart.sh
|
||||
│ └── verification/
|
||||
│ └── verify-registry.sh
|
||||
├── docs/ # Infrastructure documentation
|
||||
│ ├── architecture/
|
||||
│ │ ├── diagrams/
|
||||
│ │ └── decisions/
|
||||
│ ├── operations/
|
||||
│ │ ├── runbooks/
|
||||
│ │ └── troubleshooting/
|
||||
│ ├── onboarding/
|
||||
│ └── reference/
|
||||
│ ├── api/
|
||||
│ └── configurations/
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Phase 1: Preparation and Planning
|
||||
|
||||
1. **Inventory Analysis**: Complete detailed inventory of all current files
|
||||
2. **Dependency Mapping**: Identify dependencies between components
|
||||
3. **Impact Assessment**: Determine which components can be moved safely
|
||||
4. **Backup Strategy**: Ensure all files are backed up before migration
|
||||
|
||||
### Phase 2: Non-Critical Components
|
||||
|
||||
1. **Documentation**: Move and update all documentation files
|
||||
2. **Scripts**: Organize automation scripts into new structure
|
||||
3. **Monitoring**: Migrate SigNoz dashboards and configurations
|
||||
4. **CI/CD**: Reorganize pipeline components
|
||||
|
||||
### Phase 3: Environment-Specific Components
|
||||
|
||||
1. **Create Environment Structure**: Set up dev/staging/prod directories
|
||||
2. **Migrate Kubernetes Manifests**: Move base components to appropriate locations
|
||||
3. **Update References**: Ensure all cross-references are corrected
|
||||
4. **Environment Validation**: Test each environment separately
|
||||
|
||||
### Phase 4: Service Components
|
||||
|
||||
1. **Database Migration**: Move database configurations to services/databases
|
||||
2. **Microservice Organization**: Group microservices by domain
|
||||
3. **Infrastructure Components**: Move gateway and other infrastructure
|
||||
4. **Service Validation**: Test each service in isolation
|
||||
|
||||
### Phase 5: Finalization
|
||||
|
||||
1. **Integration Testing**: Test complete infrastructure workflow
|
||||
2. **Documentation Update**: Finalize all documentation
|
||||
3. **Team Training**: Conduct training on new structure
|
||||
4. **Cleanup**: Remove old structure and temporary files
|
||||
|
||||
## Benefits of Proposed Structure
|
||||
|
||||
### 1. Improved Navigation
|
||||
- **Clear Hierarchy**: Logical grouping by function and environment
|
||||
- **Consistent Patterns**: Standardized structure across all components
|
||||
- **Reduced Cognitive Load**: Easier to find specific components
|
||||
|
||||
### 2. Enhanced Maintainability
|
||||
- **Environment Isolation**: Clear separation of dev/staging/prod
|
||||
- **Component Grouping**: Related components grouped together
|
||||
- **Standardized Structure**: Consistent patterns across services
|
||||
|
||||
### 3. Better Scalability
|
||||
- **Modular Design**: Easy to add new services or environments
|
||||
- **Domain Separation**: Services organized by business domain
|
||||
- **Infrastructure Independence**: Platform components separate from services
|
||||
|
||||
### 4. Improved Security
|
||||
- **Centralized Security**: All security configurations in one place
|
||||
- **Environment-Specific Policies**: Tailored security for each environment
|
||||
- **Better Secret Management**: Clear structure for sensitive data
|
||||
|
||||
### 5. Enhanced Observability
|
||||
- **Comprehensive Monitoring**: All observability tools grouped
|
||||
- **Standardized Dashboards**: Consistent monitoring across services
|
||||
- **Centralized Logging**: Better log management structure
|
||||
|
||||
## Implementation Considerations
|
||||
|
||||
### Tools and Technologies
|
||||
- **Terraform**: For infrastructure as code (IaC)
|
||||
- **Kustomize**: For Kubernetes manifest management
|
||||
- **Helm**: For complex application deployments
|
||||
- **SOPS/Sealed Secrets**: For secret management
|
||||
- **Trivy**: For vulnerability scanning
|
||||
|
||||
### Team Adaptation
|
||||
- **Training Plan**: Develop comprehensive training materials
|
||||
- **Migration Guide**: Create step-by-step migration documentation
|
||||
- **Support Period**: Provide dedicated support during transition
|
||||
- **Feedback Mechanism**: Establish channels for team feedback
|
||||
|
||||
### Risk Mitigation
|
||||
- **Phased Approach**: Implement changes incrementally
|
||||
- **Rollback Plan**: Develop comprehensive rollback procedures
|
||||
- **Testing Strategy**: Implement thorough testing at each phase
|
||||
- **Monitoring**: Enhanced monitoring during migration period
|
||||
|
||||
## Expected Outcomes
|
||||
|
||||
1. **Reduced Time-to-Find**: 40-60% reduction in time spent locating files
|
||||
2. **Improved Deployment Speed**: 25-35% faster deployment cycles
|
||||
3. **Enhanced Collaboration**: Better team coordination and understanding
|
||||
4. **Reduced Errors**: 30-50% reduction in configuration errors
|
||||
5. **Better Scalability**: Easier to add new services and features
|
||||
|
||||
## Conclusion
|
||||
|
||||
The proposed infrastructure reorganization represents a significant improvement over the current structure. By implementing a clear, logical hierarchy with proper separation of concerns, the new organization will:
|
||||
|
||||
- **Improve operational efficiency** through better navigation and maintainability
|
||||
- **Enhance security** with centralized security management
|
||||
- **Support growth** with a scalable, modular design
|
||||
- **Reduce errors** through standardized patterns and structures
|
||||
- **Facilitate collaboration** with intuitive organization
|
||||
|
||||
The key to successful implementation is a phased approach with thorough testing and team involvement at each stage. With proper planning and execution, this reorganization will provide long-term benefits for the Bakery-IA project's infrastructure management.
|
||||
|
||||
## Appendix: File Migration Mapping
|
||||
|
||||
### Current → Proposed Mapping
|
||||
|
||||
**Kubernetes Components:**
|
||||
- `infrastructure/kubernetes/base/components/*` → `infrastructure/services/microservices/*/`
|
||||
- `infrastructure/kubernetes/base/components/databases/*` → `infrastructure/services/databases/*/`
|
||||
- `infrastructure/kubernetes/base/migrations/*` → `infrastructure/services/microservices/*/migrations/`
|
||||
- `infrastructure/kubernetes/base/configs/*` → `infrastructure/environments/*/values/`
|
||||
|
||||
**CI/CD Components:**
|
||||
- `infrastructure/ci-cd/*` → `infrastructure/cicd/*/`
|
||||
|
||||
**Monitoring Components:**
|
||||
- `infrastructure/signoz/*` → `infrastructure/monitoring/signoz/*/`
|
||||
- `infrastructure/helm/*` → `infrastructure/monitoring/signoz/*/` (signoz-related)
|
||||
|
||||
**Security Components:**
|
||||
- `infrastructure/tls/*` → `infrastructure/security/certificates/*/`
|
||||
|
||||
**Scripts:**
|
||||
- `infrastructure/kubernetes/*.sh` → `infrastructure/scripts/*/`
|
||||
- `infrastructure/helm/*.sh` → `infrastructure/scripts/deployment/*/`
|
||||
- `infrastructure/tls/*.sh` → `infrastructure/scripts/setup/*/`
|
||||
|
||||
This mapping provides a clear path for migrating each component to its new location while maintaining functionality and relationships between components.
|
||||
294
infrastructure/ci-cd/README.md
Normal file
294
infrastructure/ci-cd/README.md
Normal file
@@ -0,0 +1,294 @@
|
||||
# Bakery-IA CI/CD Implementation
|
||||
|
||||
This directory contains the configuration for the production-grade CI/CD system for Bakery-IA using Gitea, Tekton, and Flux CD.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[Developer] -->|Push Code| B[Gitea]
|
||||
B -->|Webhook| C[Tekton Pipelines]
|
||||
C -->|Build/Test| D[Gitea Registry]
|
||||
D -->|New Image| E[Flux CD]
|
||||
E -->|kubectl apply| F[MicroK8s Cluster]
|
||||
F -->|Metrics| G[SigNoz]
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
infrastructure/ci-cd/
|
||||
├── gitea/ # Gitea configuration (Git server + registry)
|
||||
│ ├── values.yaml # Helm values for Gitea
|
||||
│ └── ingress.yaml # Ingress configuration
|
||||
├── tekton/ # Tekton CI/CD pipeline configuration
|
||||
│ ├── tasks/ # Individual pipeline tasks
|
||||
│ │ ├── git-clone.yaml
|
||||
│ │ ├── detect-changes.yaml
|
||||
│ │ ├── kaniko-build.yaml
|
||||
│ │ └── update-gitops.yaml
|
||||
│ ├── pipelines/ # Pipeline definitions
|
||||
│ │ └── ci-pipeline.yaml
|
||||
│ └── triggers/ # Webhook trigger configuration
|
||||
│ ├── trigger-template.yaml
|
||||
│ ├── trigger-binding.yaml
|
||||
│ ├── event-listener.yaml
|
||||
│ └── gitlab-interceptor.yaml
|
||||
├── flux/ # Flux CD GitOps configuration
|
||||
│ ├── git-repository.yaml # Git repository source
|
||||
│ └── kustomization.yaml # Deployment kustomization
|
||||
├── monitoring/ # Monitoring configuration
|
||||
│ └── otel-collector.yaml # OpenTelemetry collector
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Deployment Instructions
|
||||
|
||||
### Phase 1: Infrastructure Setup
|
||||
|
||||
1. **Deploy Gitea**:
|
||||
```bash
|
||||
# Add Helm repo
|
||||
microk8s helm repo add gitea https://dl.gitea.io/charts
|
||||
|
||||
# Create namespace
|
||||
microk8s kubectl create namespace gitea
|
||||
|
||||
# Install Gitea
|
||||
microk8s helm install gitea gitea/gitea \
|
||||
-n gitea \
|
||||
-f infrastructure/ci-cd/gitea/values.yaml
|
||||
|
||||
# Apply ingress
|
||||
microk8s kubectl apply -f infrastructure/ci-cd/gitea/ingress.yaml
|
||||
```
|
||||
|
||||
2. **Deploy Tekton**:
|
||||
```bash
|
||||
# Create namespace
|
||||
microk8s kubectl create namespace tekton-pipelines
|
||||
|
||||
# Install Tekton Pipelines
|
||||
microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
|
||||
|
||||
# Install Tekton Triggers
|
||||
microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/triggers/latest/release.yaml
|
||||
|
||||
# Apply Tekton configurations
|
||||
microk8s kubectl apply -f infrastructure/ci-cd/tekton/tasks/
|
||||
microk8s kubectl apply -f infrastructure/ci-cd/tekton/pipelines/
|
||||
microk8s kubectl apply -f infrastructure/ci-cd/tekton/triggers/
|
||||
```
|
||||
|
||||
3. **Deploy Flux CD** (already enabled in MicroK8s):
|
||||
```bash
|
||||
# Verify Flux installation
|
||||
microk8s kubectl get pods -n flux-system
|
||||
|
||||
# Apply Flux configurations
|
||||
microk8s kubectl apply -f infrastructure/ci-cd/flux/
|
||||
```
|
||||
|
||||
### Phase 2: Configuration
|
||||
|
||||
1. **Set up Gitea webhook**:
|
||||
- Go to your Gitea repository settings
|
||||
- Add webhook with URL: `http://tekton-triggers.tekton-pipelines.svc.cluster.local:8080`
|
||||
- Use the secret from `gitea-webhook-secret`
|
||||
|
||||
2. **Configure registry credentials**:
|
||||
```bash
|
||||
# Create registry credentials secret
|
||||
microk8s kubectl create secret docker-registry gitea-registry-credentials \
|
||||
-n tekton-pipelines \
|
||||
--docker-server=gitea.bakery-ia.local:5000 \
|
||||
--docker-username=your-username \
|
||||
--docker-password=your-password
|
||||
```
|
||||
|
||||
3. **Configure Git credentials for Flux**:
|
||||
```bash
|
||||
# Create Git credentials secret
|
||||
microk8s kubectl create secret generic gitea-credentials \
|
||||
-n flux-system \
|
||||
--from-literal=username=your-username \
|
||||
--from-literal=password=your-password
|
||||
```
|
||||
|
||||
### Phase 3: Monitoring
|
||||
|
||||
```bash
|
||||
# Apply OpenTelemetry configuration
|
||||
microk8s kubectl apply -f infrastructure/ci-cd/monitoring/otel-collector.yaml
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Triggering a Pipeline
|
||||
|
||||
1. **Manual trigger**:
|
||||
```bash
|
||||
# Create a PipelineRun manually
|
||||
microk8s kubectl create -f - <<EOF
|
||||
apiVersion: tekton.dev/v1beta1
|
||||
kind: PipelineRun
|
||||
metadata:
|
||||
name: manual-ci-run
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
pipelineRef:
|
||||
name: bakery-ia-ci
|
||||
workspaces:
|
||||
- name: shared-workspace
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
- name: docker-credentials
|
||||
secret:
|
||||
secretName: gitea-registry-credentials
|
||||
params:
|
||||
- name: git-url
|
||||
value: "http://gitea.bakery-ia.local/bakery/bakery-ia.git"
|
||||
- name: git-revision
|
||||
value: "main"
|
||||
EOF
|
||||
```
|
||||
|
||||
2. **Automatic trigger**: Push code to the repository and the webhook will trigger the pipeline automatically.
|
||||
|
||||
### Monitoring Pipeline Runs
|
||||
|
||||
```bash
|
||||
# List all PipelineRuns
|
||||
microk8s kubectl get pipelineruns -n tekton-pipelines
|
||||
|
||||
# View logs for a specific PipelineRun
|
||||
microk8s kubectl logs -n tekton-pipelines <pipelinerun-pod> -c <step-name>
|
||||
|
||||
# View Tekton dashboard
|
||||
microk8s kubectl port-forward -n tekton-pipelines svc/tekton-dashboard 9097:9097
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Pipeline not triggering**:
|
||||
- Check Gitea webhook logs
|
||||
- Verify EventListener pods are running
|
||||
- Check TriggerBinding configuration
|
||||
|
||||
2. **Build failures**:
|
||||
- Check Kaniko logs for build errors
|
||||
- Verify Dockerfile paths are correct
|
||||
- Ensure registry credentials are valid
|
||||
|
||||
3. **Flux not applying changes**:
|
||||
- Check GitRepository status
|
||||
- Verify Kustomization reconciliation
|
||||
- Check Flux logs for errors
|
||||
|
||||
### Debugging Commands
|
||||
|
||||
```bash
|
||||
# Check Tekton controller logs
|
||||
microk8s kubectl logs -n tekton-pipelines -l app=tekton-pipelines-controller
|
||||
|
||||
# Check Flux reconciliation
|
||||
microk8s kubectl get kustomizations -n flux-system -o yaml
|
||||
|
||||
# Check Gitea webhook delivery
|
||||
microk8s kubectl logs -n tekton-pipelines -l app=tekton-triggers-controller
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Secrets Management**:
|
||||
- Use Kubernetes secrets for sensitive data
|
||||
- Rotate credentials regularly
|
||||
- Use RBAC for namespace isolation
|
||||
|
||||
2. **Network Security**:
|
||||
- Configure network policies
|
||||
- Use internal DNS names
|
||||
- Restrict ingress access
|
||||
|
||||
3. **Registry Security**:
|
||||
- Enable image scanning
|
||||
- Use image signing
|
||||
- Implement cleanup policies
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Upgrading Components
|
||||
|
||||
```bash
|
||||
# Upgrade Tekton
|
||||
microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
|
||||
|
||||
# Upgrade Flux
|
||||
microk8s helm upgrade fluxcd fluxcd/flux2 -n flux-system
|
||||
|
||||
# Upgrade Gitea
|
||||
microk8s helm upgrade gitea gitea/gitea -n gitea -f infrastructure/ci-cd/gitea/values.yaml
|
||||
```
|
||||
|
||||
### Backup Procedures
|
||||
|
||||
```bash
|
||||
# Backup Gitea
|
||||
microk8s kubectl exec -n gitea gitea-0 -- gitea dump -c /data/gitea/conf/app.ini
|
||||
|
||||
# Backup Flux configurations
|
||||
microk8s kubectl get all -n flux-system -o yaml > flux-backup.yaml
|
||||
|
||||
# Backup Tekton configurations
|
||||
microk8s kubectl get all -n tekton-pipelines -o yaml > tekton-backup.yaml
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
1. **Resource Management**:
|
||||
- Set appropriate resource limits
|
||||
- Limit concurrent builds
|
||||
- Use node selectors for build pods
|
||||
|
||||
2. **Caching**:
|
||||
- Configure Kaniko cache
|
||||
- Use persistent volumes for dependencies
|
||||
- Cache Docker layers
|
||||
|
||||
3. **Parallelization**:
|
||||
- Build independent services in parallel
|
||||
- Use matrix builds for different architectures
|
||||
- Optimize task dependencies
|
||||
|
||||
## Integration with Existing System
|
||||
|
||||
The CI/CD system integrates with:
|
||||
- **SigNoz**: For monitoring and observability
|
||||
- **MicroK8s**: For cluster management
|
||||
- **Existing Kubernetes manifests**: In `infrastructure/kubernetes/`
|
||||
- **Current services**: All 19 microservices in `services/`
|
||||
|
||||
## Migration Plan
|
||||
|
||||
1. **Phase 1**: Set up infrastructure (Gitea, Tekton, Flux)
|
||||
2. **Phase 2**: Configure pipelines and triggers
|
||||
3. **Phase 3**: Test with non-critical services
|
||||
4. **Phase 4**: Gradual rollout to all services
|
||||
5. **Phase 5**: Decommission old deployment methods
|
||||
|
||||
## Support
|
||||
|
||||
For issues with the CI/CD system:
|
||||
- Check logs and monitoring first
|
||||
- Review the troubleshooting section
|
||||
- Consult the original implementation plan
|
||||
- Refer to component documentation:
|
||||
- [Tekton Documentation](https://tekton.dev/docs/)
|
||||
- [Flux CD Documentation](https://fluxcd.io/docs/)
|
||||
- [Gitea Documentation](https://docs.gitea.io/)
|
||||
16
infrastructure/ci-cd/flux/git-repository.yaml
Normal file
16
infrastructure/ci-cd/flux/git-repository.yaml
Normal file
@@ -0,0 +1,16 @@
|
||||
# Flux GitRepository for Bakery-IA
|
||||
# This resource tells Flux where to find the Git repository
|
||||
|
||||
apiVersion: source.toolkit.fluxcd.io/v1
|
||||
kind: GitRepository
|
||||
metadata:
|
||||
name: bakery-ia
|
||||
namespace: flux-system
|
||||
spec:
|
||||
interval: 1m
|
||||
url: http://gitea.bakery-ia.local/bakery/bakery-ia.git
|
||||
ref:
|
||||
branch: main
|
||||
secretRef:
|
||||
name: gitea-credentials
|
||||
timeout: 60s
|
||||
27
infrastructure/ci-cd/flux/kustomization.yaml
Normal file
27
infrastructure/ci-cd/flux/kustomization.yaml
Normal file
@@ -0,0 +1,27 @@
|
||||
# Flux Kustomization for Bakery-IA Production Deployment
|
||||
# This resource tells Flux how to deploy the application
|
||||
|
||||
apiVersion: kustomize.toolkit.fluxcd.io/v1
|
||||
kind: Kustomization
|
||||
metadata:
|
||||
name: bakery-ia-prod
|
||||
namespace: flux-system
|
||||
spec:
|
||||
interval: 5m
|
||||
path: ./infrastructure/kubernetes/overlays/prod
|
||||
prune: true
|
||||
sourceRef:
|
||||
kind: GitRepository
|
||||
name: bakery-ia
|
||||
targetNamespace: bakery-ia
|
||||
timeout: 5m
|
||||
retryInterval: 1m
|
||||
healthChecks:
|
||||
- apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: auth-service
|
||||
namespace: bakery-ia
|
||||
- apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: gateway
|
||||
namespace: bakery-ia
|
||||
25
infrastructure/ci-cd/gitea/ingress.yaml
Normal file
25
infrastructure/ci-cd/gitea/ingress.yaml
Normal file
@@ -0,0 +1,25 @@
|
||||
# Gitea Ingress configuration for Bakery-IA CI/CD
|
||||
# This provides external access to Gitea within the cluster
|
||||
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: gitea-ingress
|
||||
namespace: gitea
|
||||
annotations:
|
||||
nginx.ingress.kubernetes.io/rewrite-target: /
|
||||
nginx.ingress.kubernetes.io/proxy-body-size: "0"
|
||||
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
|
||||
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
|
||||
spec:
|
||||
rules:
|
||||
- host: gitea.bakery-ia.local
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: gitea-http
|
||||
port:
|
||||
number: 3000
|
||||
38
infrastructure/ci-cd/gitea/values.yaml
Normal file
38
infrastructure/ci-cd/gitea/values.yaml
Normal file
@@ -0,0 +1,38 @@
|
||||
# Gitea Helm values configuration for Bakery-IA CI/CD
|
||||
# This configuration sets up Gitea with registry support and appropriate storage
|
||||
|
||||
service:
|
||||
type: ClusterIP
|
||||
httpPort: 3000
|
||||
sshPort: 2222
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 50Gi
|
||||
storageClass: "microk8s-hostpath"
|
||||
|
||||
gitea:
|
||||
config:
|
||||
server:
|
||||
DOMAIN: gitea.bakery-ia.local
|
||||
SSH_DOMAIN: gitea.bakery-ia.local
|
||||
ROOT_URL: http://gitea.bakery-ia.local
|
||||
repository:
|
||||
ENABLE_PUSH_CREATE_USER: true
|
||||
ENABLE_PUSH_CREATE_ORG: true
|
||||
registry:
|
||||
ENABLED: true
|
||||
|
||||
postgresql:
|
||||
enabled: true
|
||||
persistence:
|
||||
size: 20Gi
|
||||
|
||||
# Resource configuration for production environment
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
70
infrastructure/ci-cd/monitoring/otel-collector.yaml
Normal file
70
infrastructure/ci-cd/monitoring/otel-collector.yaml
Normal file
@@ -0,0 +1,70 @@
|
||||
# OpenTelemetry Collector for Bakery-IA CI/CD Monitoring
|
||||
# This collects metrics and traces from Tekton pipelines
|
||||
|
||||
apiVersion: opentelemetry.io/v1alpha1
|
||||
kind: OpenTelemetryCollector
|
||||
metadata:
|
||||
name: tekton-otel
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
config: |
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
prometheus:
|
||||
config:
|
||||
scrape_configs:
|
||||
- job_name: 'tekton-pipelines'
|
||||
scrape_interval: 30s
|
||||
static_configs:
|
||||
- targets: ['tekton-pipelines-controller.tekton-pipelines.svc.cluster.local:9090']
|
||||
|
||||
processors:
|
||||
batch:
|
||||
timeout: 5s
|
||||
send_batch_size: 1000
|
||||
memory_limiter:
|
||||
check_interval: 2s
|
||||
limit_percentage: 75
|
||||
spike_limit_percentage: 20
|
||||
|
||||
exporters:
|
||||
otlp:
|
||||
endpoint: "signoz-otel-collector.monitoring.svc.cluster.local:4317"
|
||||
tls:
|
||||
insecure: true
|
||||
retry_on_failure:
|
||||
enabled: true
|
||||
initial_interval: 5s
|
||||
max_interval: 30s
|
||||
max_elapsed_time: 300s
|
||||
logging:
|
||||
logLevel: debug
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch]
|
||||
exporters: [otlp, logging]
|
||||
metrics:
|
||||
receivers: [otlp, prometheus]
|
||||
processors: [memory_limiter, batch]
|
||||
exporters: [otlp, logging]
|
||||
telemetry:
|
||||
logs:
|
||||
level: "info"
|
||||
encoding: "json"
|
||||
|
||||
mode: deployment
|
||||
resources:
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
requests:
|
||||
cpu: 200m
|
||||
memory: 256Mi
|
||||
83
infrastructure/ci-cd/tekton/pipelines/ci-pipeline.yaml
Normal file
83
infrastructure/ci-cd/tekton/pipelines/ci-pipeline.yaml
Normal file
@@ -0,0 +1,83 @@
|
||||
# Main CI Pipeline for Bakery-IA
|
||||
# This pipeline orchestrates the build, test, and deploy process
|
||||
|
||||
apiVersion: tekton.dev/v1beta1
|
||||
kind: Pipeline
|
||||
metadata:
|
||||
name: bakery-ia-ci
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
workspaces:
|
||||
- name: shared-workspace
|
||||
- name: docker-credentials
|
||||
params:
|
||||
- name: git-url
|
||||
type: string
|
||||
description: Repository URL
|
||||
- name: git-revision
|
||||
type: string
|
||||
description: Git revision/commit hash
|
||||
- name: registry
|
||||
type: string
|
||||
description: Container registry URL
|
||||
default: "gitea.bakery-ia.local:5000"
|
||||
tasks:
|
||||
- name: fetch-source
|
||||
taskRef:
|
||||
name: git-clone
|
||||
workspaces:
|
||||
- name: output
|
||||
workspace: shared-workspace
|
||||
params:
|
||||
- name: url
|
||||
value: $(params.git-url)
|
||||
- name: revision
|
||||
value: $(params.git-revision)
|
||||
|
||||
- name: detect-changes
|
||||
runAfter: [fetch-source]
|
||||
taskRef:
|
||||
name: detect-changed-services
|
||||
workspaces:
|
||||
- name: source
|
||||
workspace: shared-workspace
|
||||
|
||||
- name: build-and-push
|
||||
runAfter: [detect-changes]
|
||||
taskRef:
|
||||
name: kaniko-build
|
||||
when:
|
||||
- input: "$(tasks.detect-changes.results.changed-services)"
|
||||
operator: notin
|
||||
values: ["none"]
|
||||
workspaces:
|
||||
- name: source
|
||||
workspace: shared-workspace
|
||||
- name: docker-credentials
|
||||
workspace: docker-credentials
|
||||
params:
|
||||
- name: services
|
||||
value: $(tasks.detect-changes.results.changed-services)
|
||||
- name: registry
|
||||
value: $(params.registry)
|
||||
- name: git-revision
|
||||
value: $(params.git-revision)
|
||||
|
||||
- name: update-gitops-manifests
|
||||
runAfter: [build-and-push]
|
||||
taskRef:
|
||||
name: update-gitops
|
||||
when:
|
||||
- input: "$(tasks.detect-changes.results.changed-services)"
|
||||
operator: notin
|
||||
values: ["none"]
|
||||
workspaces:
|
||||
- name: source
|
||||
workspace: shared-workspace
|
||||
params:
|
||||
- name: services
|
||||
value: $(tasks.detect-changes.results.changed-services)
|
||||
- name: registry
|
||||
value: $(params.registry)
|
||||
- name: git-revision
|
||||
value: $(params.git-revision)
|
||||
64
infrastructure/ci-cd/tekton/tasks/detect-changes.yaml
Normal file
64
infrastructure/ci-cd/tekton/tasks/detect-changes.yaml
Normal file
@@ -0,0 +1,64 @@
|
||||
# Tekton Detect Changed Services Task for Bakery-IA CI/CD
|
||||
# This task identifies which services have changed in the repository
|
||||
|
||||
apiVersion: tekton.dev/v1beta1
|
||||
kind: Task
|
||||
metadata:
|
||||
name: detect-changed-services
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
workspaces:
|
||||
- name: source
|
||||
results:
|
||||
- name: changed-services
|
||||
description: Comma-separated list of changed services
|
||||
steps:
|
||||
- name: detect
|
||||
image: alpine/git
|
||||
script: |
|
||||
#!/bin/sh
|
||||
set -e
|
||||
cd $(workspaces.source.path)
|
||||
|
||||
echo "Detecting changed files..."
|
||||
# Get list of changed files compared to previous commit
|
||||
CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD 2>/dev/null || git diff --name-only HEAD)
|
||||
|
||||
echo "Changed files: $CHANGED_FILES"
|
||||
|
||||
# Map files to services
|
||||
CHANGED_SERVICES=()
|
||||
for file in $CHANGED_FILES; do
|
||||
if [[ $file == services/* ]]; then
|
||||
SERVICE=$(echo $file | cut -d'/' -f2)
|
||||
# Only add unique service names
|
||||
if [[ ! " ${CHANGED_SERVICES[@]} " =~ " ${SERVICE} " ]]; then
|
||||
CHANGED_SERVICES+=("$SERVICE")
|
||||
fi
|
||||
elif [[ $file == frontend/* ]]; then
|
||||
CHANGED_SERVICES+=("frontend")
|
||||
break
|
||||
elif [[ $file == gateway/* ]]; then
|
||||
CHANGED_SERVICES+=("gateway")
|
||||
break
|
||||
fi
|
||||
done
|
||||
|
||||
# If no specific services changed, check for infrastructure changes
|
||||
if [ ${#CHANGED_SERVICES[@]} -eq 0 ]; then
|
||||
for file in $CHANGED_FILES; do
|
||||
if [[ $file == infrastructure/* ]]; then
|
||||
CHANGED_SERVICES+=("infrastructure")
|
||||
break
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# Output result
|
||||
if [ ${#CHANGED_SERVICES[@]} -eq 0 ]; then
|
||||
echo "No service changes detected"
|
||||
echo "none" | tee $(results.changed-services.path)
|
||||
else
|
||||
echo "Detected changes in services: ${CHANGED_SERVICES[@]}"
|
||||
echo $(printf "%s," "${CHANGED_SERVICES[@]}" | sed 's/,$//') | tee $(results.changed-services.path)
|
||||
fi
|
||||
31
infrastructure/ci-cd/tekton/tasks/git-clone.yaml
Normal file
31
infrastructure/ci-cd/tekton/tasks/git-clone.yaml
Normal file
@@ -0,0 +1,31 @@
|
||||
# Tekton Git Clone Task for Bakery-IA CI/CD
|
||||
# This task clones the source code repository
|
||||
|
||||
apiVersion: tekton.dev/v1beta1
|
||||
kind: Task
|
||||
metadata:
|
||||
name: git-clone
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
workspaces:
|
||||
- name: output
|
||||
params:
|
||||
- name: url
|
||||
type: string
|
||||
description: Repository URL to clone
|
||||
- name: revision
|
||||
type: string
|
||||
description: Git revision to checkout
|
||||
default: "main"
|
||||
steps:
|
||||
- name: clone
|
||||
image: alpine/git
|
||||
script: |
|
||||
#!/bin/sh
|
||||
set -e
|
||||
echo "Cloning repository: $(params.url)"
|
||||
git clone $(params.url) $(workspaces.output.path)
|
||||
cd $(workspaces.output.path)
|
||||
echo "Checking out revision: $(params.revision)"
|
||||
git checkout $(params.revision)
|
||||
echo "Repository cloned successfully"
|
||||
40
infrastructure/ci-cd/tekton/tasks/kaniko-build.yaml
Normal file
40
infrastructure/ci-cd/tekton/tasks/kaniko-build.yaml
Normal file
@@ -0,0 +1,40 @@
|
||||
# Tekton Kaniko Build Task for Bakery-IA CI/CD
|
||||
# This task builds and pushes container images using Kaniko
|
||||
|
||||
apiVersion: tekton.dev/v1beta1
|
||||
kind: Task
|
||||
metadata:
|
||||
name: kaniko-build
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
workspaces:
|
||||
- name: source
|
||||
- name: docker-credentials
|
||||
params:
|
||||
- name: services
|
||||
type: string
|
||||
description: Comma-separated list of services to build
|
||||
- name: registry
|
||||
type: string
|
||||
description: Container registry URL
|
||||
default: "gitea.bakery-ia.local:5000"
|
||||
- name: git-revision
|
||||
type: string
|
||||
description: Git revision for image tag
|
||||
default: "latest"
|
||||
steps:
|
||||
- name: build-and-push
|
||||
image: gcr.io/kaniko-project/executor:v1.9.0
|
||||
args:
|
||||
- --dockerfile=$(workspaces.source.path)/services/$(params.services)/Dockerfile
|
||||
- --context=$(workspaces.source.path)
|
||||
- --destination=$(params.registry)/bakery/$(params.services):$(params.git-revision)
|
||||
- --verbosity=info
|
||||
volumeMounts:
|
||||
- name: docker-config
|
||||
mountPath: /kaniko/.docker
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
volumes:
|
||||
- name: docker-config
|
||||
emptyDir: {}
|
||||
66
infrastructure/ci-cd/tekton/tasks/update-gitops.yaml
Normal file
66
infrastructure/ci-cd/tekton/tasks/update-gitops.yaml
Normal file
@@ -0,0 +1,66 @@
|
||||
# Tekton Update GitOps Manifests Task for Bakery-IA CI/CD
|
||||
# This task updates Kubernetes manifests with new image tags
|
||||
|
||||
apiVersion: tekton.dev/v1beta1
|
||||
kind: Task
|
||||
metadata:
|
||||
name: update-gitops
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
workspaces:
|
||||
- name: source
|
||||
params:
|
||||
- name: services
|
||||
type: string
|
||||
description: Comma-separated list of services to update
|
||||
- name: registry
|
||||
type: string
|
||||
description: Container registry URL
|
||||
- name: git-revision
|
||||
type: string
|
||||
description: Git revision for image tag
|
||||
steps:
|
||||
- name: update-manifests
|
||||
image: bitnami/kubectl
|
||||
script: |
|
||||
#!/bin/sh
|
||||
set -e
|
||||
cd $(workspaces.source.path)
|
||||
|
||||
echo "Updating GitOps manifests for services: $(params.services)"
|
||||
|
||||
# Split services by comma
|
||||
IFS=',' read -ra SERVICES <<< "$(params.services)"
|
||||
|
||||
for service in "${SERVICES[@]}"; do
|
||||
echo "Processing service: $service"
|
||||
|
||||
# Find and update Kubernetes manifests
|
||||
if [ "$service" = "frontend" ]; then
|
||||
# Update frontend deployment
|
||||
if [ -f "infrastructure/kubernetes/overlays/prod/frontend-deployment.yaml" ]; then
|
||||
sed -i "s|image:.*|image: $(params.registry)/bakery/frontend:$(params.git-revision)|g" \
|
||||
"infrastructure/kubernetes/overlays/prod/frontend-deployment.yaml"
|
||||
fi
|
||||
elif [ "$service" = "gateway" ]; then
|
||||
# Update gateway deployment
|
||||
if [ -f "infrastructure/kubernetes/overlays/prod/gateway-deployment.yaml" ]; then
|
||||
sed -i "s|image:.*|image: $(params.registry)/bakery/gateway:$(params.git-revision)|g" \
|
||||
"infrastructure/kubernetes/overlays/prod/gateway-deployment.yaml"
|
||||
fi
|
||||
else
|
||||
# Update service deployment
|
||||
DEPLOYMENT_FILE="infrastructure/kubernetes/overlays/prod/${service}-deployment.yaml"
|
||||
if [ -f "$DEPLOYMENT_FILE" ]; then
|
||||
sed -i "s|image:.*|image: $(params.registry)/bakery/${service}:$(params.git-revision)|g" \
|
||||
"$DEPLOYMENT_FILE"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
# Commit changes
|
||||
git config --global user.name "bakery-ia-ci"
|
||||
git config --global user.email "ci@bakery-ia.local"
|
||||
git add .
|
||||
git commit -m "CI: Update image tags for $(params.services) to $(params.git-revision)"
|
||||
git push origin HEAD
|
||||
26
infrastructure/ci-cd/tekton/triggers/event-listener.yaml
Normal file
26
infrastructure/ci-cd/tekton/triggers/event-listener.yaml
Normal file
@@ -0,0 +1,26 @@
|
||||
# Tekton EventListener for Bakery-IA CI/CD
|
||||
# This listener receives webhook events and triggers pipelines
|
||||
|
||||
apiVersion: triggers.tekton.dev/v1alpha1
|
||||
kind: EventListener
|
||||
metadata:
|
||||
name: bakery-ia-listener
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
serviceAccountName: tekton-triggers-sa
|
||||
triggers:
|
||||
- name: bakery-ia-gitea-trigger
|
||||
bindings:
|
||||
- ref: bakery-ia-trigger-binding
|
||||
template:
|
||||
ref: bakery-ia-trigger-template
|
||||
interceptors:
|
||||
- ref:
|
||||
name: "gitlab"
|
||||
params:
|
||||
- name: "secretRef"
|
||||
value:
|
||||
secretName: gitea-webhook-secret
|
||||
secretKey: secretToken
|
||||
- name: "eventTypes"
|
||||
value: ["push"]
|
||||
14
infrastructure/ci-cd/tekton/triggers/gitlab-interceptor.yaml
Normal file
14
infrastructure/ci-cd/tekton/triggers/gitlab-interceptor.yaml
Normal file
@@ -0,0 +1,14 @@
|
||||
# GitLab/Gitea Webhook Interceptor for Tekton Triggers
|
||||
# This interceptor validates and processes Gitea webhook events
|
||||
|
||||
apiVersion: triggers.tekton.dev/v1alpha1
|
||||
kind: ClusterInterceptor
|
||||
metadata:
|
||||
name: gitlab
|
||||
spec:
|
||||
clientConfig:
|
||||
service:
|
||||
name: tekton-triggers-core-interceptors
|
||||
namespace: tekton-pipelines
|
||||
path: "/v1/webhook/gitlab"
|
||||
port: 8443
|
||||
16
infrastructure/ci-cd/tekton/triggers/trigger-binding.yaml
Normal file
16
infrastructure/ci-cd/tekton/triggers/trigger-binding.yaml
Normal file
@@ -0,0 +1,16 @@
|
||||
# Tekton TriggerBinding for Bakery-IA CI/CD
|
||||
# This binding extracts parameters from Gitea webhook events
|
||||
|
||||
apiVersion: triggers.tekton.dev/v1alpha1
|
||||
kind: TriggerBinding
|
||||
metadata:
|
||||
name: bakery-ia-trigger-binding
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
params:
|
||||
- name: git-repo-url
|
||||
value: $(body.repository.clone_url)
|
||||
- name: git-revision
|
||||
value: $(body.head_commit.id)
|
||||
- name: git-repo-name
|
||||
value: $(body.repository.name)
|
||||
43
infrastructure/ci-cd/tekton/triggers/trigger-template.yaml
Normal file
43
infrastructure/ci-cd/tekton/triggers/trigger-template.yaml
Normal file
@@ -0,0 +1,43 @@
|
||||
# Tekton TriggerTemplate for Bakery-IA CI/CD
|
||||
# This template defines how PipelineRuns are created when triggers fire
|
||||
|
||||
apiVersion: triggers.tekton.dev/v1alpha1
|
||||
kind: TriggerTemplate
|
||||
metadata:
|
||||
name: bakery-ia-trigger-template
|
||||
namespace: tekton-pipelines
|
||||
spec:
|
||||
params:
|
||||
- name: git-repo-url
|
||||
description: The git repository URL
|
||||
- name: git-revision
|
||||
description: The git revision/commit hash
|
||||
- name: git-repo-name
|
||||
description: The git repository name
|
||||
default: "bakery-ia"
|
||||
resourcetemplates:
|
||||
- apiVersion: tekton.dev/v1beta1
|
||||
kind: PipelineRun
|
||||
metadata:
|
||||
generateName: bakery-ia-ci-run-$(params.git-repo-name)-
|
||||
spec:
|
||||
pipelineRef:
|
||||
name: bakery-ia-ci
|
||||
workspaces:
|
||||
- name: shared-workspace
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
- name: docker-credentials
|
||||
secret:
|
||||
secretName: gitea-registry-credentials
|
||||
params:
|
||||
- name: git-url
|
||||
value: $(params.git-repo-url)
|
||||
- name: git-revision
|
||||
value: $(params.git-revision)
|
||||
- name: registry
|
||||
value: "gitea.bakery-ia.local:5000"
|
||||
@@ -18,6 +18,28 @@ class OrchestratorService(StandardFastAPIService):
|
||||
|
||||
expected_migration_version = "001_initial_schema"
|
||||
|
||||
def __init__(self):
|
||||
# Define expected database tables for health checks
|
||||
orchestrator_expected_tables = [
|
||||
'orchestration_runs'
|
||||
]
|
||||
|
||||
self.rabbitmq_client = None
|
||||
self.event_publisher = None
|
||||
self.leader_election = None
|
||||
self.scheduler_service = None
|
||||
|
||||
super().__init__(
|
||||
service_name="orchestrator-service",
|
||||
app_name=settings.APP_NAME,
|
||||
description=settings.DESCRIPTION,
|
||||
version=settings.VERSION,
|
||||
api_prefix="", # Empty because RouteBuilder already includes /api/v1
|
||||
database_manager=database_manager,
|
||||
expected_tables=orchestrator_expected_tables,
|
||||
enable_messaging=True # Enable RabbitMQ for event publishing
|
||||
)
|
||||
|
||||
async def verify_migrations(self):
|
||||
"""Verify database schema matches the latest migrations"""
|
||||
try:
|
||||
@@ -32,26 +54,6 @@ class OrchestratorService(StandardFastAPIService):
|
||||
self.logger.error(f"Migration verification failed: {e}")
|
||||
raise
|
||||
|
||||
def __init__(self):
|
||||
# Define expected database tables for health checks
|
||||
orchestrator_expected_tables = [
|
||||
'orchestration_runs'
|
||||
]
|
||||
|
||||
self.rabbitmq_client = None
|
||||
self.event_publisher = None
|
||||
|
||||
super().__init__(
|
||||
service_name="orchestrator-service",
|
||||
app_name=settings.APP_NAME,
|
||||
description=settings.DESCRIPTION,
|
||||
version=settings.VERSION,
|
||||
api_prefix="", # Empty because RouteBuilder already includes /api/v1
|
||||
database_manager=database_manager,
|
||||
expected_tables=orchestrator_expected_tables,
|
||||
enable_messaging=True # Enable RabbitMQ for event publishing
|
||||
)
|
||||
|
||||
async def _setup_messaging(self):
|
||||
"""Setup messaging for orchestrator service"""
|
||||
from shared.messaging import UnifiedEventPublisher, RabbitMQClient
|
||||
@@ -84,22 +86,91 @@ class OrchestratorService(StandardFastAPIService):
|
||||
|
||||
self.logger.info("Orchestrator Service starting up...")
|
||||
|
||||
# Initialize orchestrator scheduler service with EventPublisher
|
||||
from app.services.orchestrator_service import OrchestratorSchedulerService
|
||||
scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
|
||||
await scheduler_service.start()
|
||||
app.state.scheduler_service = scheduler_service
|
||||
self.logger.info("Orchestrator scheduler service started")
|
||||
# Initialize leader election for horizontal scaling
|
||||
# Only the leader pod will run the scheduler
|
||||
await self._setup_leader_election(app)
|
||||
|
||||
# REMOVED: Delivery tracking service - moved to procurement service (domain ownership)
|
||||
|
||||
async def _setup_leader_election(self, app: FastAPI):
|
||||
"""
|
||||
Setup leader election for scheduler.
|
||||
|
||||
CRITICAL FOR HORIZONTAL SCALING:
|
||||
Without leader election, each pod would run the same scheduled jobs,
|
||||
causing duplicate forecasts, production schedules, and database contention.
|
||||
"""
|
||||
from shared.leader_election import LeaderElectionService
|
||||
import redis.asyncio as redis
|
||||
|
||||
try:
|
||||
# Create Redis connection for leader election
|
||||
redis_url = f"redis://:{settings.REDIS_PASSWORD}@{settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}"
|
||||
if settings.REDIS_TLS_ENABLED.lower() == "true":
|
||||
redis_url = redis_url.replace("redis://", "rediss://")
|
||||
|
||||
redis_client = redis.from_url(redis_url, decode_responses=False)
|
||||
await redis_client.ping()
|
||||
|
||||
# Use shared leader election service
|
||||
self.leader_election = LeaderElectionService(
|
||||
redis_client,
|
||||
service_name="orchestrator"
|
||||
)
|
||||
|
||||
# Define callbacks for leader state changes
|
||||
async def on_become_leader():
|
||||
self.logger.info("This pod became the leader - starting scheduler")
|
||||
from app.services.orchestrator_service import OrchestratorSchedulerService
|
||||
self.scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
|
||||
await self.scheduler_service.start()
|
||||
app.state.scheduler_service = self.scheduler_service
|
||||
self.logger.info("Orchestrator scheduler service started (leader only)")
|
||||
|
||||
async def on_lose_leader():
|
||||
self.logger.warning("This pod lost leadership - stopping scheduler")
|
||||
if self.scheduler_service:
|
||||
await self.scheduler_service.stop()
|
||||
self.scheduler_service = None
|
||||
if hasattr(app.state, 'scheduler_service'):
|
||||
app.state.scheduler_service = None
|
||||
self.logger.info("Orchestrator scheduler service stopped (no longer leader)")
|
||||
|
||||
# Start leader election
|
||||
await self.leader_election.start(
|
||||
on_become_leader=on_become_leader,
|
||||
on_lose_leader=on_lose_leader
|
||||
)
|
||||
|
||||
# Store leader election in app state for health checks
|
||||
app.state.leader_election = self.leader_election
|
||||
|
||||
self.logger.info("Leader election initialized",
|
||||
is_leader=self.leader_election.is_leader,
|
||||
instance_id=self.leader_election.instance_id)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error("Failed to setup leader election, falling back to standalone mode",
|
||||
error=str(e))
|
||||
# Fallback: start scheduler anyway (for single-pod deployments)
|
||||
from app.services.orchestrator_service import OrchestratorSchedulerService
|
||||
self.scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
|
||||
await self.scheduler_service.start()
|
||||
app.state.scheduler_service = self.scheduler_service
|
||||
self.logger.warning("Scheduler started in standalone mode (no leader election)")
|
||||
|
||||
async def on_shutdown(self, app: FastAPI):
|
||||
"""Custom shutdown logic for orchestrator service"""
|
||||
self.logger.info("Orchestrator Service shutting down...")
|
||||
|
||||
# Stop scheduler service
|
||||
if hasattr(app.state, 'scheduler_service'):
|
||||
await app.state.scheduler_service.stop()
|
||||
# Stop leader election (this will also stop scheduler if we're the leader)
|
||||
if self.leader_election:
|
||||
await self.leader_election.stop()
|
||||
self.logger.info("Leader election stopped")
|
||||
|
||||
# Stop scheduler service if still running
|
||||
if self.scheduler_service:
|
||||
await self.scheduler_service.stop()
|
||||
self.logger.info("Orchestrator scheduler service stopped")
|
||||
|
||||
|
||||
|
||||
@@ -1,12 +1,12 @@
|
||||
"""
|
||||
Delivery Tracking Service - Simplified
|
||||
Delivery Tracking Service - With Leader Election
|
||||
|
||||
Tracks purchase order deliveries and generates appropriate alerts using EventPublisher:
|
||||
- DELIVERY_ARRIVING_SOON: 2 hours before delivery window
|
||||
- DELIVERY_OVERDUE: 30 minutes after expected delivery time
|
||||
- STOCK_RECEIPT_INCOMPLETE: If delivery not marked as received
|
||||
|
||||
Runs as internal scheduler with leader election.
|
||||
Runs as internal scheduler with leader election for horizontal scaling.
|
||||
Domain ownership: Procurement service owns all PO and delivery tracking.
|
||||
"""
|
||||
|
||||
@@ -30,7 +30,7 @@ class DeliveryTrackingService:
|
||||
Monitors PO deliveries and generates time-based alerts using EventPublisher.
|
||||
|
||||
Uses APScheduler with leader election to run hourly checks.
|
||||
Only one pod executes checks (others skip if not leader).
|
||||
Only one pod executes checks - leader election ensures no duplicate alerts.
|
||||
"""
|
||||
|
||||
def __init__(self, event_publisher: UnifiedEventPublisher, config, database_manager=None):
|
||||
@@ -38,46 +38,121 @@ class DeliveryTrackingService:
|
||||
self.config = config
|
||||
self.database_manager = database_manager
|
||||
self.scheduler = AsyncIOScheduler()
|
||||
self.is_leader = False
|
||||
self._leader_election = None
|
||||
self._redis_client = None
|
||||
self._scheduler_started = False
|
||||
self.instance_id = str(uuid4())[:8] # Short instance ID for logging
|
||||
|
||||
async def start(self):
|
||||
"""Start the delivery tracking scheduler"""
|
||||
# Initialize and start scheduler if not already running
|
||||
"""Start the delivery tracking scheduler with leader election"""
|
||||
try:
|
||||
# Initialize leader election
|
||||
await self._setup_leader_election()
|
||||
except Exception as e:
|
||||
logger.error("Failed to setup leader election, starting in standalone mode",
|
||||
error=str(e))
|
||||
# Fallback: start scheduler without leader election
|
||||
await self._start_scheduler()
|
||||
|
||||
async def _setup_leader_election(self):
|
||||
"""Setup Redis-based leader election for horizontal scaling"""
|
||||
from shared.leader_election import LeaderElectionService
|
||||
import redis.asyncio as redis
|
||||
|
||||
# Build Redis URL from config
|
||||
redis_url = getattr(self.config, 'REDIS_URL', None)
|
||||
if not redis_url:
|
||||
redis_password = getattr(self.config, 'REDIS_PASSWORD', '')
|
||||
redis_host = getattr(self.config, 'REDIS_HOST', 'localhost')
|
||||
redis_port = getattr(self.config, 'REDIS_PORT', 6379)
|
||||
redis_db = getattr(self.config, 'REDIS_DB', 0)
|
||||
redis_url = f"redis://:{redis_password}@{redis_host}:{redis_port}/{redis_db}"
|
||||
|
||||
self._redis_client = redis.from_url(redis_url, decode_responses=False)
|
||||
await self._redis_client.ping()
|
||||
|
||||
# Create leader election service
|
||||
self._leader_election = LeaderElectionService(
|
||||
self._redis_client,
|
||||
service_name="procurement-delivery-tracking"
|
||||
)
|
||||
|
||||
# Start leader election with callbacks
|
||||
await self._leader_election.start(
|
||||
on_become_leader=self._on_become_leader,
|
||||
on_lose_leader=self._on_lose_leader
|
||||
)
|
||||
|
||||
logger.info("Leader election initialized for delivery tracking",
|
||||
is_leader=self._leader_election.is_leader,
|
||||
instance_id=self.instance_id)
|
||||
|
||||
async def _on_become_leader(self):
|
||||
"""Called when this instance becomes the leader"""
|
||||
logger.info("Became leader for delivery tracking - starting scheduler",
|
||||
instance_id=self.instance_id)
|
||||
await self._start_scheduler()
|
||||
|
||||
async def _on_lose_leader(self):
|
||||
"""Called when this instance loses leadership"""
|
||||
logger.warning("Lost leadership for delivery tracking - stopping scheduler",
|
||||
instance_id=self.instance_id)
|
||||
await self._stop_scheduler()
|
||||
|
||||
async def _start_scheduler(self):
|
||||
"""Start the APScheduler with delivery tracking jobs"""
|
||||
if self._scheduler_started:
|
||||
logger.debug("Scheduler already started", instance_id=self.instance_id)
|
||||
return
|
||||
|
||||
if not self.scheduler.running:
|
||||
# Add hourly job to check deliveries
|
||||
self.scheduler.add_job(
|
||||
self._check_all_tenants,
|
||||
trigger=CronTrigger(minute=30), # Run every hour at :30 (00:30, 01:30, 02:30, etc.)
|
||||
trigger=CronTrigger(minute=30), # Run every hour at :30
|
||||
id='hourly_delivery_check',
|
||||
name='Hourly Delivery Tracking',
|
||||
replace_existing=True,
|
||||
max_instances=1, # Ensure no overlapping runs
|
||||
coalesce=True # Combine missed runs
|
||||
max_instances=1,
|
||||
coalesce=True
|
||||
)
|
||||
|
||||
self.scheduler.start()
|
||||
self._scheduler_started = True
|
||||
|
||||
# Log next run time
|
||||
next_run = self.scheduler.get_job('hourly_delivery_check').next_run_time
|
||||
logger.info(
|
||||
"Delivery tracking scheduler started with hourly checks",
|
||||
instance_id=self.instance_id,
|
||||
next_run=next_run.isoformat() if next_run else None
|
||||
)
|
||||
else:
|
||||
logger.info(
|
||||
"Delivery tracking scheduler already running",
|
||||
instance_id=self.instance_id
|
||||
)
|
||||
logger.info("Delivery tracking scheduler started",
|
||||
instance_id=self.instance_id,
|
||||
next_run=next_run.isoformat() if next_run else None)
|
||||
|
||||
async def _stop_scheduler(self):
|
||||
"""Stop the APScheduler"""
|
||||
if not self._scheduler_started:
|
||||
return
|
||||
|
||||
if self.scheduler.running:
|
||||
self.scheduler.shutdown(wait=False)
|
||||
self._scheduler_started = False
|
||||
logger.info("Delivery tracking scheduler stopped", instance_id=self.instance_id)
|
||||
|
||||
async def stop(self):
|
||||
"""Stop the scheduler and release leader lock"""
|
||||
if self.scheduler.running:
|
||||
self.scheduler.shutdown(wait=True) # Graceful shutdown
|
||||
logger.info("Delivery tracking scheduler stopped", instance_id=self.instance_id)
|
||||
else:
|
||||
logger.info("Delivery tracking scheduler already stopped", instance_id=self.instance_id)
|
||||
"""Stop the scheduler and leader election"""
|
||||
# Stop leader election first
|
||||
if self._leader_election:
|
||||
await self._leader_election.stop()
|
||||
logger.info("Leader election stopped", instance_id=self.instance_id)
|
||||
|
||||
# Stop scheduler
|
||||
await self._stop_scheduler()
|
||||
|
||||
# Close Redis
|
||||
if self._redis_client:
|
||||
await self._redis_client.close()
|
||||
|
||||
@property
|
||||
def is_leader(self) -> bool:
|
||||
"""Check if this instance is the leader"""
|
||||
return self._leader_election.is_leader if self._leader_election else True
|
||||
|
||||
async def _check_all_tenants(self):
|
||||
"""
|
||||
|
||||
@@ -46,6 +46,9 @@ class TrainingService(StandardFastAPIService):
|
||||
await setup_messaging()
|
||||
self.logger.info("Messaging setup completed")
|
||||
|
||||
# Initialize Redis pub/sub for cross-pod WebSocket broadcasting
|
||||
await self._setup_websocket_redis()
|
||||
|
||||
# Set up WebSocket event consumer (listens to RabbitMQ and broadcasts to WebSockets)
|
||||
success = await setup_websocket_event_consumer()
|
||||
if success:
|
||||
@@ -53,8 +56,44 @@ class TrainingService(StandardFastAPIService):
|
||||
else:
|
||||
self.logger.warning("WebSocket event consumer setup failed")
|
||||
|
||||
async def _setup_websocket_redis(self):
|
||||
"""
|
||||
Initialize Redis pub/sub for WebSocket cross-pod broadcasting.
|
||||
|
||||
CRITICAL FOR HORIZONTAL SCALING:
|
||||
Without this, WebSocket clients on Pod A won't receive events
|
||||
from training jobs running on Pod B.
|
||||
"""
|
||||
try:
|
||||
from app.websocket.manager import websocket_manager
|
||||
from app.core.config import settings
|
||||
|
||||
redis_url = settings.REDIS_URL
|
||||
success = await websocket_manager.initialize_redis(redis_url)
|
||||
|
||||
if success:
|
||||
self.logger.info("WebSocket Redis pub/sub initialized for horizontal scaling")
|
||||
else:
|
||||
self.logger.warning(
|
||||
"WebSocket Redis pub/sub failed to initialize. "
|
||||
"WebSocket events will only be delivered to local connections."
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error("Failed to setup WebSocket Redis pub/sub",
|
||||
error=str(e))
|
||||
# Don't fail startup - WebSockets will work locally without Redis
|
||||
|
||||
async def _cleanup_messaging(self):
|
||||
"""Cleanup messaging for training service"""
|
||||
# Shutdown WebSocket Redis pub/sub
|
||||
try:
|
||||
from app.websocket.manager import websocket_manager
|
||||
await websocket_manager.shutdown()
|
||||
self.logger.info("WebSocket Redis pub/sub shutdown completed")
|
||||
except Exception as e:
|
||||
self.logger.warning("Error shutting down WebSocket Redis", error=str(e))
|
||||
|
||||
await cleanup_websocket_consumers()
|
||||
await cleanup_messaging()
|
||||
|
||||
@@ -83,8 +122,44 @@ class TrainingService(StandardFastAPIService):
|
||||
system_metrics = SystemMetricsCollector("training")
|
||||
self.logger.info("System metrics collection started")
|
||||
|
||||
# Recover stale jobs from previous pod crashes
|
||||
# This is important for horizontal scaling - jobs may be left in 'running'
|
||||
# state if a pod crashes. We mark them as failed so they can be retried.
|
||||
await self._recover_stale_jobs()
|
||||
|
||||
self.logger.info("Training service startup completed")
|
||||
|
||||
async def _recover_stale_jobs(self):
|
||||
"""
|
||||
Recover stale training jobs on startup.
|
||||
|
||||
When a pod crashes mid-training, jobs are left in 'running' or 'pending' state.
|
||||
This method finds jobs that haven't been updated in a while and marks them
|
||||
as failed so users can retry them.
|
||||
"""
|
||||
try:
|
||||
from app.repositories.training_log_repository import TrainingLogRepository
|
||||
|
||||
async with self.database_manager.get_session() as session:
|
||||
log_repo = TrainingLogRepository(session)
|
||||
|
||||
# Recover jobs that haven't been updated in 60 minutes
|
||||
# This is conservative - most training jobs complete within 30 minutes
|
||||
recovered = await log_repo.recover_stale_jobs(stale_threshold_minutes=60)
|
||||
|
||||
if recovered:
|
||||
self.logger.warning(
|
||||
"Recovered stale training jobs on startup",
|
||||
recovered_count=len(recovered),
|
||||
job_ids=[j.job_id for j in recovered]
|
||||
)
|
||||
else:
|
||||
self.logger.info("No stale training jobs to recover")
|
||||
|
||||
except Exception as e:
|
||||
# Don't fail startup if recovery fails - just log the error
|
||||
self.logger.error("Failed to recover stale jobs on startup", error=str(e))
|
||||
|
||||
async def on_shutdown(self, app: FastAPI):
|
||||
"""Custom shutdown logic for training service"""
|
||||
await cleanup_training_database()
|
||||
|
||||
@@ -343,3 +343,165 @@ class TrainingLogRepository(TrainingBaseRepository):
|
||||
job_id=job_id,
|
||||
error=str(e))
|
||||
return None
|
||||
|
||||
async def create_job_atomic(
|
||||
self,
|
||||
job_id: str,
|
||||
tenant_id: str,
|
||||
config: Dict[str, Any] = None
|
||||
) -> tuple[Optional[ModelTrainingLog], bool]:
|
||||
"""
|
||||
Atomically create a training job, respecting the unique constraint.
|
||||
|
||||
This method uses INSERT ... ON CONFLICT to handle race conditions
|
||||
when multiple pods try to create a job for the same tenant simultaneously.
|
||||
The database constraint (idx_unique_active_training_per_tenant) ensures
|
||||
only one active job per tenant can exist.
|
||||
|
||||
Args:
|
||||
job_id: Unique job identifier
|
||||
tenant_id: Tenant identifier
|
||||
config: Optional job configuration
|
||||
|
||||
Returns:
|
||||
Tuple of (job, created):
|
||||
- If created: (new_job, True)
|
||||
- If conflict (existing active job): (existing_job, False)
|
||||
- If error: raises DatabaseError
|
||||
"""
|
||||
try:
|
||||
# First, try to find an existing active job
|
||||
existing = await self.get_active_jobs(tenant_id=tenant_id)
|
||||
pending = await self.get_logs_by_tenant(tenant_id=tenant_id, status="pending", limit=1)
|
||||
|
||||
if existing or pending:
|
||||
# Return existing job
|
||||
active_job = existing[0] if existing else pending[0]
|
||||
logger.info("Found existing active job, skipping creation",
|
||||
existing_job_id=active_job.job_id,
|
||||
tenant_id=tenant_id,
|
||||
requested_job_id=job_id)
|
||||
return (active_job, False)
|
||||
|
||||
# Try to create the new job
|
||||
# If another pod created one in the meantime, the unique constraint will prevent this
|
||||
log_data = {
|
||||
"job_id": job_id,
|
||||
"tenant_id": tenant_id,
|
||||
"status": "pending",
|
||||
"progress": 0,
|
||||
"current_step": "initializing",
|
||||
"config": config or {}
|
||||
}
|
||||
|
||||
try:
|
||||
new_job = await self.create_training_log(log_data)
|
||||
await self.session.commit()
|
||||
logger.info("Created new training job atomically",
|
||||
job_id=job_id,
|
||||
tenant_id=tenant_id)
|
||||
return (new_job, True)
|
||||
except Exception as create_error:
|
||||
error_str = str(create_error).lower()
|
||||
# Check if this is a unique constraint violation
|
||||
if "unique" in error_str or "duplicate" in error_str or "constraint" in error_str:
|
||||
await self.session.rollback()
|
||||
# Another pod created a job, fetch it
|
||||
logger.info("Unique constraint hit, fetching existing job",
|
||||
tenant_id=tenant_id,
|
||||
requested_job_id=job_id)
|
||||
existing = await self.get_active_jobs(tenant_id=tenant_id)
|
||||
pending = await self.get_logs_by_tenant(tenant_id=tenant_id, status="pending", limit=1)
|
||||
if existing or pending:
|
||||
active_job = existing[0] if existing else pending[0]
|
||||
return (active_job, False)
|
||||
# If still no job found, something went wrong
|
||||
raise DatabaseError(f"Constraint violation but no active job found: {create_error}")
|
||||
else:
|
||||
raise
|
||||
|
||||
except DatabaseError:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error("Failed to create job atomically",
|
||||
job_id=job_id,
|
||||
tenant_id=tenant_id,
|
||||
error=str(e))
|
||||
raise DatabaseError(f"Failed to create training job atomically: {str(e)}")
|
||||
|
||||
async def recover_stale_jobs(self, stale_threshold_minutes: int = 60) -> List[ModelTrainingLog]:
|
||||
"""
|
||||
Find and mark stale running jobs as failed.
|
||||
|
||||
This is used during service startup to clean up jobs that were
|
||||
running when a pod crashed. With multiple replicas, only stale
|
||||
jobs (not updated recently) should be marked as failed.
|
||||
|
||||
Args:
|
||||
stale_threshold_minutes: Jobs not updated for this long are considered stale
|
||||
|
||||
Returns:
|
||||
List of jobs that were marked as failed
|
||||
"""
|
||||
try:
|
||||
stale_cutoff = datetime.now() - timedelta(minutes=stale_threshold_minutes)
|
||||
|
||||
# Find running jobs that haven't been updated recently
|
||||
query = text("""
|
||||
SELECT id, job_id, tenant_id, status, updated_at
|
||||
FROM model_training_logs
|
||||
WHERE status IN ('running', 'pending')
|
||||
AND updated_at < :stale_cutoff
|
||||
""")
|
||||
|
||||
result = await self.session.execute(query, {"stale_cutoff": stale_cutoff})
|
||||
stale_jobs = result.fetchall()
|
||||
|
||||
recovered_jobs = []
|
||||
for row in stale_jobs:
|
||||
try:
|
||||
# Mark as failed
|
||||
update_query = text("""
|
||||
UPDATE model_training_logs
|
||||
SET status = 'failed',
|
||||
error_message = :error_msg,
|
||||
end_time = :end_time,
|
||||
updated_at = :updated_at
|
||||
WHERE id = :id AND status IN ('running', 'pending')
|
||||
""")
|
||||
|
||||
await self.session.execute(update_query, {
|
||||
"id": row.id,
|
||||
"error_msg": f"Job recovered as failed - not updated since {row.updated_at.isoformat()}. Pod may have crashed.",
|
||||
"end_time": datetime.now(),
|
||||
"updated_at": datetime.now()
|
||||
})
|
||||
|
||||
logger.warning("Recovered stale training job",
|
||||
job_id=row.job_id,
|
||||
tenant_id=str(row.tenant_id),
|
||||
last_updated=row.updated_at.isoformat() if row.updated_at else "unknown")
|
||||
|
||||
# Fetch the updated job to return
|
||||
job = await self.get_by_job_id(row.job_id)
|
||||
if job:
|
||||
recovered_jobs.append(job)
|
||||
|
||||
except Exception as job_error:
|
||||
logger.error("Failed to recover individual stale job",
|
||||
job_id=row.job_id,
|
||||
error=str(job_error))
|
||||
|
||||
if recovered_jobs:
|
||||
await self.session.commit()
|
||||
logger.info("Stale job recovery completed",
|
||||
recovered_count=len(recovered_jobs),
|
||||
stale_threshold_minutes=stale_threshold_minutes)
|
||||
|
||||
return recovered_jobs
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Failed to recover stale jobs",
|
||||
error=str(e))
|
||||
await self.session.rollback()
|
||||
return []
|
||||
@@ -1,10 +1,16 @@
|
||||
"""
|
||||
Distributed Locking Mechanisms
|
||||
Prevents concurrent training jobs for the same product
|
||||
|
||||
HORIZONTAL SCALING FIX:
|
||||
- Uses SHA256 for stable hash across all Python processes/pods
|
||||
- Python's built-in hash() varies between processes due to hash randomization (Python 3.3+)
|
||||
- This ensures all pods compute the same lock ID for the same lock name
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
import hashlib
|
||||
from typing import Optional
|
||||
import logging
|
||||
from contextlib import asynccontextmanager
|
||||
@@ -39,9 +45,20 @@ class DatabaseLock:
|
||||
self.lock_id = self._hash_lock_name(lock_name)
|
||||
|
||||
def _hash_lock_name(self, name: str) -> int:
|
||||
"""Convert lock name to integer ID for PostgreSQL advisory lock"""
|
||||
# Use hash and modulo to get a positive 32-bit integer
|
||||
return abs(hash(name)) % (2**31)
|
||||
"""
|
||||
Convert lock name to integer ID for PostgreSQL advisory lock.
|
||||
|
||||
CRITICAL: Uses SHA256 for stable hash across all Python processes/pods.
|
||||
Python's built-in hash() varies between processes due to hash randomization
|
||||
(PYTHONHASHSEED, enabled by default since Python 3.3), which would cause
|
||||
different pods to compute different lock IDs for the same lock name,
|
||||
defeating the purpose of distributed locking.
|
||||
"""
|
||||
# Use SHA256 for stable, cross-process hash
|
||||
hash_bytes = hashlib.sha256(name.encode('utf-8')).digest()
|
||||
# Take first 4 bytes and convert to positive 31-bit integer
|
||||
# (PostgreSQL advisory locks use bigint, but we use 31-bit for safety)
|
||||
return int.from_bytes(hash_bytes[:4], 'big') % (2**31)
|
||||
|
||||
@asynccontextmanager
|
||||
async def acquire(self, session: AsyncSession):
|
||||
|
||||
@@ -1,21 +1,39 @@
|
||||
"""
|
||||
WebSocket Connection Manager for Training Service
|
||||
Manages WebSocket connections and broadcasts RabbitMQ events to connected clients
|
||||
|
||||
HORIZONTAL SCALING:
|
||||
- Uses Redis pub/sub for cross-pod WebSocket broadcasting
|
||||
- Each pod subscribes to a Redis channel and broadcasts to its local connections
|
||||
- Events published to Redis are received by all pods, ensuring clients on any
|
||||
pod receive events from training jobs running on any other pod
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from typing import Dict, Set
|
||||
import os
|
||||
from typing import Dict, Optional
|
||||
from fastapi import WebSocket
|
||||
import structlog
|
||||
|
||||
logger = structlog.get_logger()
|
||||
|
||||
# Redis pub/sub channel for WebSocket events
|
||||
REDIS_WEBSOCKET_CHANNEL = "training:websocket:events"
|
||||
|
||||
|
||||
class WebSocketConnectionManager:
|
||||
"""
|
||||
Simple WebSocket connection manager.
|
||||
Manages connections per job_id and broadcasts messages to all connected clients.
|
||||
WebSocket connection manager with Redis pub/sub for horizontal scaling.
|
||||
|
||||
In a multi-pod deployment:
|
||||
1. Events are published to Redis pub/sub (not just local broadcast)
|
||||
2. Each pod subscribes to Redis and broadcasts to its local WebSocket connections
|
||||
3. This ensures clients connected to any pod receive events from any pod
|
||||
|
||||
Flow:
|
||||
- RabbitMQ event → Pod A receives → Pod A publishes to Redis
|
||||
- Redis pub/sub → All pods receive → Each pod broadcasts to local WebSockets
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
@@ -24,6 +42,121 @@ class WebSocketConnectionManager:
|
||||
self._lock = asyncio.Lock()
|
||||
# Store latest event for each job to provide initial state
|
||||
self._latest_events: Dict[str, dict] = {}
|
||||
# Redis client for pub/sub
|
||||
self._redis: Optional[object] = None
|
||||
self._pubsub: Optional[object] = None
|
||||
self._subscriber_task: Optional[asyncio.Task] = None
|
||||
self._running = False
|
||||
self._instance_id = f"{os.environ.get('HOSTNAME', 'unknown')}:{os.getpid()}"
|
||||
|
||||
async def initialize_redis(self, redis_url: str) -> bool:
|
||||
"""
|
||||
Initialize Redis connection for cross-pod pub/sub.
|
||||
|
||||
Args:
|
||||
redis_url: Redis connection URL
|
||||
|
||||
Returns:
|
||||
True if successful, False otherwise
|
||||
"""
|
||||
try:
|
||||
import redis.asyncio as redis_async
|
||||
|
||||
self._redis = redis_async.from_url(redis_url, decode_responses=True)
|
||||
await self._redis.ping()
|
||||
|
||||
# Create pub/sub subscriber
|
||||
self._pubsub = self._redis.pubsub()
|
||||
await self._pubsub.subscribe(REDIS_WEBSOCKET_CHANNEL)
|
||||
|
||||
# Start subscriber task
|
||||
self._running = True
|
||||
self._subscriber_task = asyncio.create_task(self._redis_subscriber_loop())
|
||||
|
||||
logger.info("Redis pub/sub initialized for WebSocket broadcasting",
|
||||
instance_id=self._instance_id,
|
||||
channel=REDIS_WEBSOCKET_CHANNEL)
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Failed to initialize Redis pub/sub",
|
||||
error=str(e),
|
||||
instance_id=self._instance_id)
|
||||
return False
|
||||
|
||||
async def shutdown(self):
|
||||
"""Shutdown Redis pub/sub connection"""
|
||||
self._running = False
|
||||
|
||||
if self._subscriber_task:
|
||||
self._subscriber_task.cancel()
|
||||
try:
|
||||
await self._subscriber_task
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
|
||||
if self._pubsub:
|
||||
await self._pubsub.unsubscribe(REDIS_WEBSOCKET_CHANNEL)
|
||||
await self._pubsub.close()
|
||||
|
||||
if self._redis:
|
||||
await self._redis.close()
|
||||
|
||||
logger.info("Redis pub/sub shutdown complete",
|
||||
instance_id=self._instance_id)
|
||||
|
||||
async def _redis_subscriber_loop(self):
|
||||
"""Background task to receive Redis pub/sub messages and broadcast locally"""
|
||||
try:
|
||||
while self._running:
|
||||
try:
|
||||
message = await self._pubsub.get_message(
|
||||
ignore_subscribe_messages=True,
|
||||
timeout=1.0
|
||||
)
|
||||
|
||||
if message and message['type'] == 'message':
|
||||
await self._handle_redis_message(message['data'])
|
||||
|
||||
except asyncio.CancelledError:
|
||||
break
|
||||
except Exception as e:
|
||||
logger.error("Error in Redis subscriber loop",
|
||||
error=str(e),
|
||||
instance_id=self._instance_id)
|
||||
await asyncio.sleep(1) # Backoff on error
|
||||
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
|
||||
logger.info("Redis subscriber loop stopped",
|
||||
instance_id=self._instance_id)
|
||||
|
||||
async def _handle_redis_message(self, data: str):
|
||||
"""Handle a message received from Redis pub/sub"""
|
||||
try:
|
||||
payload = json.loads(data)
|
||||
job_id = payload.get('job_id')
|
||||
message = payload.get('message')
|
||||
source_instance = payload.get('source_instance')
|
||||
|
||||
if not job_id or not message:
|
||||
return
|
||||
|
||||
# Log cross-pod message
|
||||
if source_instance != self._instance_id:
|
||||
logger.debug("Received cross-pod WebSocket event",
|
||||
job_id=job_id,
|
||||
source_instance=source_instance,
|
||||
local_instance=self._instance_id)
|
||||
|
||||
# Broadcast to local WebSocket connections
|
||||
await self._broadcast_local(job_id, message)
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
logger.warning("Invalid JSON in Redis message", error=str(e))
|
||||
except Exception as e:
|
||||
logger.error("Error handling Redis message", error=str(e))
|
||||
|
||||
async def connect(self, job_id: str, websocket: WebSocket) -> None:
|
||||
"""Register a new WebSocket connection for a job"""
|
||||
@@ -50,7 +183,8 @@ class WebSocketConnectionManager:
|
||||
logger.info("WebSocket connected",
|
||||
job_id=job_id,
|
||||
websocket_id=ws_id,
|
||||
total_connections=len(self._connections[job_id]))
|
||||
total_connections=len(self._connections[job_id]),
|
||||
instance_id=self._instance_id)
|
||||
|
||||
async def disconnect(self, job_id: str, websocket: WebSocket) -> None:
|
||||
"""Remove a WebSocket connection"""
|
||||
@@ -66,19 +200,56 @@ class WebSocketConnectionManager:
|
||||
logger.info("WebSocket disconnected",
|
||||
job_id=job_id,
|
||||
websocket_id=ws_id,
|
||||
remaining_connections=len(self._connections.get(job_id, {})))
|
||||
remaining_connections=len(self._connections.get(job_id, {})),
|
||||
instance_id=self._instance_id)
|
||||
|
||||
async def broadcast(self, job_id: str, message: dict) -> int:
|
||||
"""
|
||||
Broadcast a message to all connections for a specific job.
|
||||
Returns the number of successful broadcasts.
|
||||
Broadcast a message to all connections for a specific job across ALL pods.
|
||||
|
||||
If Redis is configured, publishes to Redis pub/sub which then broadcasts
|
||||
to all pods. Otherwise, falls back to local-only broadcast.
|
||||
|
||||
Returns the number of successful local broadcasts.
|
||||
"""
|
||||
# Store the latest event for this job to provide initial state to new connections
|
||||
if message.get('type') != 'initial_state': # Don't store initial_state messages
|
||||
if message.get('type') != 'initial_state':
|
||||
self._latest_events[job_id] = message
|
||||
|
||||
# If Redis is available, publish to Redis for cross-pod broadcast
|
||||
if self._redis:
|
||||
try:
|
||||
payload = json.dumps({
|
||||
'job_id': job_id,
|
||||
'message': message,
|
||||
'source_instance': self._instance_id
|
||||
})
|
||||
await self._redis.publish(REDIS_WEBSOCKET_CHANNEL, payload)
|
||||
logger.debug("Published WebSocket event to Redis",
|
||||
job_id=job_id,
|
||||
message_type=message.get('type'),
|
||||
instance_id=self._instance_id)
|
||||
# Return 0 here because the actual broadcast happens via subscriber
|
||||
# The count will be from _broadcast_local when the message is received
|
||||
return 0
|
||||
except Exception as e:
|
||||
logger.warning("Failed to publish to Redis, falling back to local broadcast",
|
||||
error=str(e),
|
||||
job_id=job_id)
|
||||
# Fall through to local broadcast
|
||||
|
||||
# Local-only broadcast (when Redis is not available)
|
||||
return await self._broadcast_local(job_id, message)
|
||||
|
||||
async def _broadcast_local(self, job_id: str, message: dict) -> int:
|
||||
"""
|
||||
Broadcast a message to local WebSocket connections only.
|
||||
This is called either directly (no Redis) or from Redis subscriber.
|
||||
"""
|
||||
if job_id not in self._connections:
|
||||
logger.debug("No active connections for job", job_id=job_id)
|
||||
logger.debug("No active local connections for job",
|
||||
job_id=job_id,
|
||||
instance_id=self._instance_id)
|
||||
return 0
|
||||
|
||||
connections = list(self._connections[job_id].values())
|
||||
@@ -103,18 +274,27 @@ class WebSocketConnectionManager:
|
||||
self._connections[job_id].pop(ws_id, None)
|
||||
|
||||
if successful_sends > 0:
|
||||
logger.info("Broadcasted message to WebSocket clients",
|
||||
logger.info("Broadcasted message to local WebSocket clients",
|
||||
job_id=job_id,
|
||||
message_type=message.get('type'),
|
||||
successful_sends=successful_sends,
|
||||
failed_sends=len(failed_websockets))
|
||||
failed_sends=len(failed_websockets),
|
||||
instance_id=self._instance_id)
|
||||
|
||||
return successful_sends
|
||||
|
||||
def get_connection_count(self, job_id: str) -> int:
|
||||
"""Get the number of active connections for a job"""
|
||||
"""Get the number of active local connections for a job"""
|
||||
return len(self._connections.get(job_id, {}))
|
||||
|
||||
def get_total_connection_count(self) -> int:
|
||||
"""Get total number of active connections across all jobs"""
|
||||
return sum(len(conns) for conns in self._connections.values())
|
||||
|
||||
def is_redis_enabled(self) -> bool:
|
||||
"""Check if Redis pub/sub is enabled"""
|
||||
return self._redis is not None and self._running
|
||||
|
||||
|
||||
# Global singleton instance
|
||||
websocket_manager = WebSocketConnectionManager()
|
||||
|
||||
@@ -0,0 +1,60 @@
|
||||
"""Add horizontal scaling constraints for multi-pod deployment
|
||||
|
||||
Revision ID: add_horizontal_scaling
|
||||
Revises: 26a665cd5348
|
||||
Create Date: 2025-01-18
|
||||
|
||||
This migration adds database-level constraints to prevent race conditions
|
||||
when running multiple training service pods:
|
||||
|
||||
1. Partial unique index on model_training_logs to prevent duplicate active jobs per tenant
|
||||
2. Index to speed up active job lookups
|
||||
"""
|
||||
from typing import Sequence, Union
|
||||
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
|
||||
|
||||
# revision identifiers, used by Alembic.
|
||||
revision: str = 'add_horizontal_scaling'
|
||||
down_revision: Union[str, None] = '26a665cd5348'
|
||||
branch_labels: Union[str, Sequence[str], None] = None
|
||||
depends_on: Union[str, Sequence[str], None] = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
# Add partial unique index to prevent duplicate active training jobs per tenant
|
||||
# This ensures only ONE job can be in 'pending' or 'running' status per tenant at a time
|
||||
# The constraint is enforced at the database level, preventing race conditions
|
||||
# between multiple pods checking and creating jobs simultaneously
|
||||
op.execute("""
|
||||
CREATE UNIQUE INDEX IF NOT EXISTS idx_unique_active_training_per_tenant
|
||||
ON model_training_logs (tenant_id)
|
||||
WHERE status IN ('pending', 'running')
|
||||
""")
|
||||
|
||||
# Add index to speed up active job lookups (used by deduplication check)
|
||||
op.create_index(
|
||||
'idx_training_logs_tenant_status',
|
||||
'model_training_logs',
|
||||
['tenant_id', 'status'],
|
||||
unique=False,
|
||||
if_not_exists=True
|
||||
)
|
||||
|
||||
# Add index for job recovery queries (find stale running jobs)
|
||||
op.create_index(
|
||||
'idx_training_logs_status_updated',
|
||||
'model_training_logs',
|
||||
['status', 'updated_at'],
|
||||
unique=False,
|
||||
if_not_exists=True
|
||||
)
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
# Remove the indexes in reverse order
|
||||
op.execute("DROP INDEX IF EXISTS idx_training_logs_status_updated")
|
||||
op.execute("DROP INDEX IF EXISTS idx_training_logs_tenant_status")
|
||||
op.execute("DROP INDEX IF EXISTS idx_unique_active_training_per_tenant")
|
||||
33
shared/leader_election/__init__.py
Normal file
33
shared/leader_election/__init__.py
Normal file
@@ -0,0 +1,33 @@
|
||||
"""
|
||||
Shared Leader Election for Bakery-IA platform
|
||||
|
||||
Provides Redis-based leader election for services that need to run
|
||||
singleton scheduled tasks (APScheduler, background jobs, etc.)
|
||||
|
||||
Usage:
|
||||
from shared.leader_election import LeaderElectionService, SchedulerLeaderMixin
|
||||
|
||||
# Option 1: Direct usage
|
||||
leader_election = LeaderElectionService(redis_client, "my-service")
|
||||
await leader_election.start(
|
||||
on_become_leader=start_scheduler,
|
||||
on_lose_leader=stop_scheduler
|
||||
)
|
||||
|
||||
# Option 2: Mixin for services with APScheduler
|
||||
class MySchedulerService(SchedulerLeaderMixin):
|
||||
async def _create_scheduler_jobs(self):
|
||||
self.scheduler.add_job(...)
|
||||
"""
|
||||
|
||||
from shared.leader_election.service import (
|
||||
LeaderElectionService,
|
||||
LeaderElectionConfig,
|
||||
)
|
||||
from shared.leader_election.mixin import SchedulerLeaderMixin
|
||||
|
||||
__all__ = [
|
||||
"LeaderElectionService",
|
||||
"LeaderElectionConfig",
|
||||
"SchedulerLeaderMixin",
|
||||
]
|
||||
209
shared/leader_election/mixin.py
Normal file
209
shared/leader_election/mixin.py
Normal file
@@ -0,0 +1,209 @@
|
||||
"""
|
||||
Scheduler Leader Mixin
|
||||
|
||||
Provides a mixin class for services that use APScheduler and need
|
||||
leader election for horizontal scaling.
|
||||
|
||||
Usage:
|
||||
class MySchedulerService(SchedulerLeaderMixin):
|
||||
def __init__(self, redis_url: str, service_name: str):
|
||||
super().__init__(redis_url, service_name)
|
||||
# Your initialization here
|
||||
|
||||
async def _create_scheduler_jobs(self):
|
||||
'''Override to define your scheduled jobs'''
|
||||
self.scheduler.add_job(
|
||||
self.my_job,
|
||||
trigger=CronTrigger(hour=0),
|
||||
id='my_job'
|
||||
)
|
||||
|
||||
async def my_job(self):
|
||||
# Your job logic here
|
||||
pass
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from typing import Optional
|
||||
from abc import abstractmethod
|
||||
import structlog
|
||||
|
||||
logger = structlog.get_logger()
|
||||
|
||||
|
||||
class SchedulerLeaderMixin:
|
||||
"""
|
||||
Mixin for services that use APScheduler with leader election.
|
||||
|
||||
Provides automatic leader election and scheduler management.
|
||||
Only the leader pod will run scheduled jobs.
|
||||
"""
|
||||
|
||||
def __init__(self, redis_url: str, service_name: str, **kwargs):
|
||||
"""
|
||||
Initialize the scheduler with leader election.
|
||||
|
||||
Args:
|
||||
redis_url: Redis connection URL for leader election
|
||||
service_name: Unique service name for leader election lock
|
||||
**kwargs: Additional arguments passed to parent class
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
|
||||
self._redis_url = redis_url
|
||||
self._service_name = service_name
|
||||
self._leader_election = None
|
||||
self._redis_client = None
|
||||
self.scheduler = None
|
||||
self._scheduler_started = False
|
||||
|
||||
async def start_with_leader_election(self):
|
||||
"""
|
||||
Start the service with leader election.
|
||||
|
||||
Only the leader will start the scheduler.
|
||||
"""
|
||||
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
||||
from shared.leader_election.service import LeaderElectionService
|
||||
import redis.asyncio as redis
|
||||
|
||||
try:
|
||||
# Create Redis connection
|
||||
self._redis_client = redis.from_url(self._redis_url, decode_responses=False)
|
||||
await self._redis_client.ping()
|
||||
|
||||
# Create scheduler (but don't start it yet)
|
||||
self.scheduler = AsyncIOScheduler()
|
||||
|
||||
# Create leader election
|
||||
self._leader_election = LeaderElectionService(
|
||||
self._redis_client,
|
||||
self._service_name
|
||||
)
|
||||
|
||||
# Start leader election with callbacks
|
||||
await self._leader_election.start(
|
||||
on_become_leader=self._on_become_leader,
|
||||
on_lose_leader=self._on_lose_leader
|
||||
)
|
||||
|
||||
logger.info("Scheduler service started with leader election",
|
||||
service=self._service_name,
|
||||
is_leader=self._leader_election.is_leader,
|
||||
instance_id=self._leader_election.instance_id)
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Failed to start with leader election, falling back to standalone",
|
||||
service=self._service_name,
|
||||
error=str(e))
|
||||
# Fallback: start scheduler anyway (for single-pod deployments)
|
||||
await self._start_scheduler_standalone()
|
||||
|
||||
async def _on_become_leader(self):
|
||||
"""Called when this instance becomes the leader"""
|
||||
logger.info("Became leader, starting scheduler",
|
||||
service=self._service_name)
|
||||
await self._start_scheduler()
|
||||
|
||||
async def _on_lose_leader(self):
|
||||
"""Called when this instance loses leadership"""
|
||||
logger.warning("Lost leadership, stopping scheduler",
|
||||
service=self._service_name)
|
||||
await self._stop_scheduler()
|
||||
|
||||
async def _start_scheduler(self):
|
||||
"""Start the scheduler with defined jobs"""
|
||||
if self._scheduler_started:
|
||||
logger.warning("Scheduler already started",
|
||||
service=self._service_name)
|
||||
return
|
||||
|
||||
try:
|
||||
# Let subclass define jobs
|
||||
await self._create_scheduler_jobs()
|
||||
|
||||
# Start scheduler
|
||||
if not self.scheduler.running:
|
||||
self.scheduler.start()
|
||||
self._scheduler_started = True
|
||||
logger.info("Scheduler started",
|
||||
service=self._service_name,
|
||||
job_count=len(self.scheduler.get_jobs()))
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Failed to start scheduler",
|
||||
service=self._service_name,
|
||||
error=str(e))
|
||||
|
||||
async def _stop_scheduler(self):
|
||||
"""Stop the scheduler"""
|
||||
if not self._scheduler_started:
|
||||
return
|
||||
|
||||
try:
|
||||
if self.scheduler and self.scheduler.running:
|
||||
self.scheduler.shutdown(wait=False)
|
||||
self._scheduler_started = False
|
||||
logger.info("Scheduler stopped",
|
||||
service=self._service_name)
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Failed to stop scheduler",
|
||||
service=self._service_name,
|
||||
error=str(e))
|
||||
|
||||
async def _start_scheduler_standalone(self):
|
||||
"""Start scheduler without leader election (fallback mode)"""
|
||||
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
||||
|
||||
logger.warning("Starting scheduler in standalone mode (no leader election)",
|
||||
service=self._service_name)
|
||||
|
||||
self.scheduler = AsyncIOScheduler()
|
||||
await self._create_scheduler_jobs()
|
||||
|
||||
if not self.scheduler.running:
|
||||
self.scheduler.start()
|
||||
self._scheduler_started = True
|
||||
|
||||
@abstractmethod
|
||||
async def _create_scheduler_jobs(self):
|
||||
"""
|
||||
Override to define scheduled jobs.
|
||||
|
||||
Example:
|
||||
self.scheduler.add_job(
|
||||
self.my_task,
|
||||
trigger=CronTrigger(hour=0, minute=30),
|
||||
id='my_task',
|
||||
max_instances=1
|
||||
)
|
||||
"""
|
||||
pass
|
||||
|
||||
async def stop(self):
|
||||
"""Stop the scheduler and leader election"""
|
||||
# Stop leader election
|
||||
if self._leader_election:
|
||||
await self._leader_election.stop()
|
||||
|
||||
# Stop scheduler
|
||||
await self._stop_scheduler()
|
||||
|
||||
# Close Redis
|
||||
if self._redis_client:
|
||||
await self._redis_client.close()
|
||||
|
||||
logger.info("Scheduler service stopped",
|
||||
service=self._service_name)
|
||||
|
||||
@property
|
||||
def is_leader(self) -> bool:
|
||||
"""Check if this instance is the leader"""
|
||||
return self._leader_election.is_leader if self._leader_election else False
|
||||
|
||||
def get_leader_status(self) -> dict:
|
||||
"""Get leader election status"""
|
||||
if self._leader_election:
|
||||
return self._leader_election.get_status()
|
||||
return {"is_leader": True, "mode": "standalone"}
|
||||
352
shared/leader_election/service.py
Normal file
352
shared/leader_election/service.py
Normal file
@@ -0,0 +1,352 @@
|
||||
"""
|
||||
Leader Election Service
|
||||
|
||||
Implements Redis-based leader election to ensure only ONE pod runs
|
||||
singleton tasks like APScheduler jobs.
|
||||
|
||||
This is CRITICAL for horizontal scaling - without leader election,
|
||||
each pod would run the same scheduled jobs, causing:
|
||||
- Duplicate operations (forecasts, alerts, syncs)
|
||||
- Database contention
|
||||
- Inconsistent state
|
||||
- Duplicate notifications
|
||||
|
||||
Implementation:
|
||||
- Uses Redis SET NX (set if not exists) for atomic leadership acquisition
|
||||
- Leader maintains leadership with periodic heartbeats
|
||||
- If leader fails to heartbeat, another pod can take over
|
||||
- Non-leader pods check periodically if they should become leader
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import socket
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional, Callable, Awaitable
|
||||
import structlog
|
||||
|
||||
logger = structlog.get_logger()
|
||||
|
||||
|
||||
@dataclass
|
||||
class LeaderElectionConfig:
|
||||
"""Configuration for leader election"""
|
||||
# Redis key prefix for the lock
|
||||
lock_key_prefix: str = "leader"
|
||||
# Lock expires after this many seconds without refresh
|
||||
lock_ttl_seconds: int = 30
|
||||
# Refresh lock every N seconds (should be < lock_ttl_seconds / 2)
|
||||
heartbeat_interval_seconds: int = 10
|
||||
# Non-leaders check for leadership every N seconds
|
||||
election_check_interval_seconds: int = 15
|
||||
|
||||
|
||||
class LeaderElectionService:
|
||||
"""
|
||||
Redis-based leader election service.
|
||||
|
||||
Ensures only one pod runs scheduled tasks at a time across all replicas.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
redis_client,
|
||||
service_name: str,
|
||||
config: Optional[LeaderElectionConfig] = None
|
||||
):
|
||||
"""
|
||||
Initialize leader election service.
|
||||
|
||||
Args:
|
||||
redis_client: Async Redis client instance
|
||||
service_name: Unique name for this service (used in Redis key)
|
||||
config: Optional configuration override
|
||||
"""
|
||||
self.redis = redis_client
|
||||
self.service_name = service_name
|
||||
self.config = config or LeaderElectionConfig()
|
||||
self.lock_key = f"{self.config.lock_key_prefix}:{service_name}:lock"
|
||||
self.instance_id = self._generate_instance_id()
|
||||
self.is_leader = False
|
||||
self._heartbeat_task: Optional[asyncio.Task] = None
|
||||
self._election_task: Optional[asyncio.Task] = None
|
||||
self._running = False
|
||||
self._on_become_leader_callback: Optional[Callable[[], Awaitable[None]]] = None
|
||||
self._on_lose_leader_callback: Optional[Callable[[], Awaitable[None]]] = None
|
||||
|
||||
def _generate_instance_id(self) -> str:
|
||||
"""Generate unique instance identifier for this pod"""
|
||||
hostname = os.environ.get('HOSTNAME', socket.gethostname())
|
||||
pod_ip = os.environ.get('POD_IP', 'unknown')
|
||||
return f"{hostname}:{pod_ip}:{os.getpid()}"
|
||||
|
||||
async def start(
|
||||
self,
|
||||
on_become_leader: Optional[Callable[[], Awaitable[None]]] = None,
|
||||
on_lose_leader: Optional[Callable[[], Awaitable[None]]] = None
|
||||
):
|
||||
"""
|
||||
Start leader election process.
|
||||
|
||||
Args:
|
||||
on_become_leader: Async callback when this instance becomes leader
|
||||
on_lose_leader: Async callback when this instance loses leadership
|
||||
"""
|
||||
self._on_become_leader_callback = on_become_leader
|
||||
self._on_lose_leader_callback = on_lose_leader
|
||||
self._running = True
|
||||
|
||||
logger.info("Starting leader election",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id,
|
||||
lock_key=self.lock_key)
|
||||
|
||||
# Try to become leader immediately
|
||||
await self._try_become_leader()
|
||||
|
||||
# Start background tasks
|
||||
if self.is_leader:
|
||||
self._heartbeat_task = asyncio.create_task(self._heartbeat_loop())
|
||||
else:
|
||||
self._election_task = asyncio.create_task(self._election_loop())
|
||||
|
||||
async def stop(self):
|
||||
"""Stop leader election and release leadership if held"""
|
||||
self._running = False
|
||||
|
||||
# Cancel background tasks
|
||||
if self._heartbeat_task:
|
||||
self._heartbeat_task.cancel()
|
||||
try:
|
||||
await self._heartbeat_task
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
self._heartbeat_task = None
|
||||
|
||||
if self._election_task:
|
||||
self._election_task.cancel()
|
||||
try:
|
||||
await self._election_task
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
self._election_task = None
|
||||
|
||||
# Release leadership
|
||||
if self.is_leader:
|
||||
await self._release_leadership()
|
||||
|
||||
logger.info("Leader election stopped",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id,
|
||||
was_leader=self.is_leader)
|
||||
|
||||
async def _try_become_leader(self) -> bool:
|
||||
"""
|
||||
Attempt to become the leader.
|
||||
|
||||
Returns:
|
||||
True if this instance is now the leader
|
||||
"""
|
||||
try:
|
||||
# Try to set the lock with NX (only if not exists) and EX (expiry)
|
||||
acquired = await self.redis.set(
|
||||
self.lock_key,
|
||||
self.instance_id,
|
||||
nx=True, # Only set if not exists
|
||||
ex=self.config.lock_ttl_seconds
|
||||
)
|
||||
|
||||
if acquired:
|
||||
self.is_leader = True
|
||||
logger.info("Became leader",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id)
|
||||
|
||||
# Call callback
|
||||
if self._on_become_leader_callback:
|
||||
try:
|
||||
await self._on_become_leader_callback()
|
||||
except Exception as e:
|
||||
logger.error("Error in on_become_leader callback",
|
||||
service=self.service_name,
|
||||
error=str(e))
|
||||
|
||||
return True
|
||||
|
||||
# Check if we're already the leader (reconnection scenario)
|
||||
current_leader = await self.redis.get(self.lock_key)
|
||||
if current_leader:
|
||||
current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
|
||||
if current_leader_str == self.instance_id:
|
||||
self.is_leader = True
|
||||
logger.info("Confirmed as existing leader",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id)
|
||||
return True
|
||||
else:
|
||||
logger.debug("Another instance is leader",
|
||||
service=self.service_name,
|
||||
current_leader=current_leader_str,
|
||||
this_instance=self.instance_id)
|
||||
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Failed to acquire leadership",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id,
|
||||
error=str(e))
|
||||
return False
|
||||
|
||||
async def _release_leadership(self):
|
||||
"""Release leadership lock"""
|
||||
try:
|
||||
# Only delete if we're the current leader
|
||||
current_leader = await self.redis.get(self.lock_key)
|
||||
if current_leader:
|
||||
current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
|
||||
if current_leader_str == self.instance_id:
|
||||
await self.redis.delete(self.lock_key)
|
||||
logger.info("Released leadership",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id)
|
||||
|
||||
was_leader = self.is_leader
|
||||
self.is_leader = False
|
||||
|
||||
# Call callback only if we were the leader
|
||||
if was_leader and self._on_lose_leader_callback:
|
||||
try:
|
||||
await self._on_lose_leader_callback()
|
||||
except Exception as e:
|
||||
logger.error("Error in on_lose_leader callback",
|
||||
service=self.service_name,
|
||||
error=str(e))
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Failed to release leadership",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id,
|
||||
error=str(e))
|
||||
|
||||
async def _refresh_leadership(self) -> bool:
|
||||
"""
|
||||
Refresh leadership lock TTL.
|
||||
|
||||
Returns:
|
||||
True if leadership was maintained
|
||||
"""
|
||||
try:
|
||||
# Verify we're still the leader
|
||||
current_leader = await self.redis.get(self.lock_key)
|
||||
if not current_leader:
|
||||
logger.warning("Lost leadership (lock expired)",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id)
|
||||
return False
|
||||
|
||||
current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
|
||||
if current_leader_str != self.instance_id:
|
||||
logger.warning("Lost leadership (lock held by another instance)",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id,
|
||||
current_leader=current_leader_str)
|
||||
return False
|
||||
|
||||
# Refresh the TTL
|
||||
await self.redis.expire(self.lock_key, self.config.lock_ttl_seconds)
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Failed to refresh leadership",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id,
|
||||
error=str(e))
|
||||
return False
|
||||
|
||||
async def _heartbeat_loop(self):
|
||||
"""Background loop to maintain leadership"""
|
||||
while self._running and self.is_leader:
|
||||
try:
|
||||
await asyncio.sleep(self.config.heartbeat_interval_seconds)
|
||||
|
||||
if not self._running:
|
||||
break
|
||||
|
||||
maintained = await self._refresh_leadership()
|
||||
|
||||
if not maintained:
|
||||
self.is_leader = False
|
||||
|
||||
# Call callback
|
||||
if self._on_lose_leader_callback:
|
||||
try:
|
||||
await self._on_lose_leader_callback()
|
||||
except Exception as e:
|
||||
logger.error("Error in on_lose_leader callback",
|
||||
service=self.service_name,
|
||||
error=str(e))
|
||||
|
||||
# Switch to election loop
|
||||
self._election_task = asyncio.create_task(self._election_loop())
|
||||
break
|
||||
|
||||
except asyncio.CancelledError:
|
||||
break
|
||||
except Exception as e:
|
||||
logger.error("Error in heartbeat loop",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id,
|
||||
error=str(e))
|
||||
|
||||
async def _election_loop(self):
|
||||
"""Background loop to attempt leadership acquisition"""
|
||||
while self._running and not self.is_leader:
|
||||
try:
|
||||
await asyncio.sleep(self.config.election_check_interval_seconds)
|
||||
|
||||
if not self._running:
|
||||
break
|
||||
|
||||
acquired = await self._try_become_leader()
|
||||
|
||||
if acquired:
|
||||
# Switch to heartbeat loop
|
||||
self._heartbeat_task = asyncio.create_task(self._heartbeat_loop())
|
||||
break
|
||||
|
||||
except asyncio.CancelledError:
|
||||
break
|
||||
except Exception as e:
|
||||
logger.error("Error in election loop",
|
||||
service=self.service_name,
|
||||
instance_id=self.instance_id,
|
||||
error=str(e))
|
||||
|
||||
def get_status(self) -> dict:
|
||||
"""Get current leader election status"""
|
||||
return {
|
||||
"service": self.service_name,
|
||||
"instance_id": self.instance_id,
|
||||
"is_leader": self.is_leader,
|
||||
"running": self._running,
|
||||
"lock_key": self.lock_key,
|
||||
"config": {
|
||||
"lock_ttl_seconds": self.config.lock_ttl_seconds,
|
||||
"heartbeat_interval_seconds": self.config.heartbeat_interval_seconds,
|
||||
"election_check_interval_seconds": self.config.election_check_interval_seconds
|
||||
}
|
||||
}
|
||||
|
||||
async def get_current_leader(self) -> Optional[str]:
|
||||
"""Get the current leader instance ID (if any)"""
|
||||
try:
|
||||
current_leader = await self.redis.get(self.lock_key)
|
||||
if current_leader:
|
||||
return current_leader.decode() if isinstance(current_leader, bytes) else current_leader
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error("Failed to get current leader",
|
||||
service=self.service_name,
|
||||
error=str(e))
|
||||
return None
|
||||
Reference in New Issue
Block a user