Add ci/cd and fix multiple pods issues
This commit is contained in:
1206
CI_CD_IMPLEMENTATION_PLAN.md
Normal file
1206
CI_CD_IMPLEMENTATION_PLAN.md
Normal file
File diff suppressed because it is too large
Load Diff
413
INFRASTRUCTURE_REORGANIZATION_PROPOSAL.md
Normal file
413
INFRASTRUCTURE_REORGANIZATION_PROPOSAL.md
Normal file
@@ -0,0 +1,413 @@
|
|||||||
|
# Infrastructure Reorganization Proposal for Bakery-IA
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
This document presents a comprehensive analysis of the current infrastructure organization and proposes a restructured layout that improves maintainability, scalability, and operational efficiency. The proposal is based on a detailed examination of the existing 177 files across 31 directories in the infrastructure folder.
|
||||||
|
|
||||||
|
## Current Infrastructure Analysis
|
||||||
|
|
||||||
|
### Current Structure Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
infrastructure/
|
||||||
|
├── ci-cd/ # 18 files - CI/CD pipeline components
|
||||||
|
├── helm/ # 8 files - Helm charts and scripts
|
||||||
|
├── kubernetes/ # 103 files - Kubernetes manifests and configs
|
||||||
|
├── signoz/ # 11 files - Monitoring dashboards and scripts
|
||||||
|
└── tls/ # 37 files - TLS certificates and generation scripts
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Findings
|
||||||
|
|
||||||
|
1. **Kubernetes Base Components (103 files)**: The most complex area with:
|
||||||
|
- 20+ service deployments across 15+ microservices
|
||||||
|
- 20+ database configurations (PostgreSQL, RabbitMQ, MinIO)
|
||||||
|
- 19 migration jobs for different services
|
||||||
|
- Infrastructure components (gateway, monitoring, etc.)
|
||||||
|
|
||||||
|
2. **CI/CD Pipeline (18 files)**:
|
||||||
|
- Tekton tasks and pipelines for GitOps workflow
|
||||||
|
- Flux CD configuration for continuous delivery
|
||||||
|
- Gitea configuration for Git repository management
|
||||||
|
|
||||||
|
3. **Monitoring (11 files)**:
|
||||||
|
- SigNoz dashboards for comprehensive observability
|
||||||
|
- Import scripts for dashboard management
|
||||||
|
|
||||||
|
4. **TLS Certificates (37 files)**:
|
||||||
|
- CA certificates and generation scripts
|
||||||
|
- Service-specific certificates (PostgreSQL, Redis, MinIO)
|
||||||
|
- Certificate signing requests and configurations
|
||||||
|
|
||||||
|
### Strengths of Current Organization
|
||||||
|
|
||||||
|
1. **Logical Grouping**: Components are generally well-grouped by function
|
||||||
|
2. **Base/Overlay Pattern**: Kubernetes uses proper base/overlay structure
|
||||||
|
3. **Comprehensive Monitoring**: SigNoz dashboards cover all major aspects
|
||||||
|
4. **Security Focus**: Dedicated TLS certificate management
|
||||||
|
|
||||||
|
### Challenges Identified
|
||||||
|
|
||||||
|
1. **Complexity in Kubernetes Base**: 103 files make navigation difficult
|
||||||
|
2. **Mixed Component Types**: Services, databases, and infrastructure mixed together
|
||||||
|
3. **Limited Environment Separation**: Only dev/prod overlays, no staging
|
||||||
|
4. **Script Scattering**: Automation scripts spread across directories
|
||||||
|
5. **Documentation Gaps**: Some components lack clear documentation
|
||||||
|
|
||||||
|
## Proposed Infrastructure Organization
|
||||||
|
|
||||||
|
### High-Level Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
infrastructure/
|
||||||
|
├── environments/ # Environment-specific configurations
|
||||||
|
├── platform/ # Platform-level infrastructure
|
||||||
|
├── services/ # Application services and microservices
|
||||||
|
├── monitoring/ # Observability and monitoring
|
||||||
|
├── cicd/ # CI/CD pipeline components
|
||||||
|
├── security/ # Security configurations and certificates
|
||||||
|
├── scripts/ # Automation and utility scripts
|
||||||
|
├── docs/ # Infrastructure documentation
|
||||||
|
└── README.md # Top-level infrastructure guide
|
||||||
|
```
|
||||||
|
|
||||||
|
### Detailed Structure Proposal
|
||||||
|
|
||||||
|
```
|
||||||
|
infrastructure/
|
||||||
|
├── environments/ # Environment-specific configurations
|
||||||
|
│ ├── dev/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ │ ├── base/
|
||||||
|
│ │ │ │ ├── namespace.yaml
|
||||||
|
│ │ │ │ ├── configmap.yaml
|
||||||
|
│ │ │ │ ├── secrets.yaml
|
||||||
|
│ │ │ │ └── ingress-https.yaml
|
||||||
|
│ │ │ ├── components/
|
||||||
|
│ │ │ │ ├── databases/
|
||||||
|
│ │ │ │ ├── infrastructure/
|
||||||
|
│ │ │ │ ├── microservices/
|
||||||
|
│ │ │ │ └── cert-manager/
|
||||||
|
│ │ │ ├── configs/
|
||||||
|
│ │ │ ├── cronjobs/
|
||||||
|
│ │ │ ├── jobs/
|
||||||
|
│ │ │ └── migrations/
|
||||||
|
│ │ ├── kustomization.yaml
|
||||||
|
│ │ └── values/
|
||||||
|
│ ├── staging/ # New staging environment
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── values/
|
||||||
|
│ └── prod/
|
||||||
|
│ ├── k8s-manifests/
|
||||||
|
│ ├── terraform/ # Production-specific IaC
|
||||||
|
│ └── values/
|
||||||
|
├── platform/ # Platform-level infrastructure
|
||||||
|
│ ├── cluster/
|
||||||
|
│ │ ├── eks/ # AWS EKS configuration
|
||||||
|
│ │ │ ├── terraform/
|
||||||
|
│ │ │ └── manifests/
|
||||||
|
│ │ └── kind/ # Local development cluster
|
||||||
|
│ │ ├── config.yaml
|
||||||
|
│ │ └── manifests/
|
||||||
|
│ ├── networking/
|
||||||
|
│ │ ├── dns/
|
||||||
|
│ │ ├── load-balancers/
|
||||||
|
│ │ └── ingress/
|
||||||
|
│ │ ├── nginx/
|
||||||
|
│ │ └── cert-manager/
|
||||||
|
│ ├── security/
|
||||||
|
│ │ ├── rbac/
|
||||||
|
│ │ ├── network-policies/
|
||||||
|
│ │ └── tls/
|
||||||
|
│ │ ├── ca/
|
||||||
|
│ │ ├── postgres/
|
||||||
|
│ │ ├── redis/
|
||||||
|
│ │ └── minio/
|
||||||
|
│ └── storage/
|
||||||
|
│ ├── postgres/
|
||||||
|
│ ├── redis/
|
||||||
|
│ └── minio/
|
||||||
|
├── services/ # Application services
|
||||||
|
│ ├── databases/
|
||||||
|
│ │ ├── postgres/
|
||||||
|
│ │ │ ├── k8s-manifests/
|
||||||
|
│ │ │ ├── backups/
|
||||||
|
│ │ │ ├── monitoring/
|
||||||
|
│ │ │ └── maintenance/
|
||||||
|
│ │ ├── redis/
|
||||||
|
│ │ │ ├── configs/
|
||||||
|
│ │ │ └── monitoring/
|
||||||
|
│ │ └── minio/
|
||||||
|
│ │ ├── buckets/
|
||||||
|
│ │ └── policies/
|
||||||
|
│ ├── api-gateway/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ └── microservices/
|
||||||
|
│ ├── auth/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── tenant/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── training/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── forecasting/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── sales/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── external/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── notification/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── inventory/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── recipes/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── suppliers/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── pos/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── orders/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── production/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── procurement/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── orchestrator/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── alert-processor/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── ai-insights/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ ├── demo-session/
|
||||||
|
│ │ ├── k8s-manifests/
|
||||||
|
│ │ └── configs/
|
||||||
|
│ └── frontend/
|
||||||
|
│ ├── k8s-manifests/
|
||||||
|
│ └── configs/
|
||||||
|
├── monitoring/ # Observability stack
|
||||||
|
│ ├── signoz/
|
||||||
|
│ │ ├── manifests/
|
||||||
|
│ │ ├── dashboards/
|
||||||
|
│ │ │ ├── alert-management.json
|
||||||
|
│ │ │ ├── api-performance.json
|
||||||
|
│ │ │ ├── application-performance.json
|
||||||
|
│ │ │ ├── database-performance.json
|
||||||
|
│ │ │ ├── error-tracking.json
|
||||||
|
│ │ │ ├── index.json
|
||||||
|
│ │ │ ├── infrastructure-monitoring.json
|
||||||
|
│ │ │ ├── log-analysis.json
|
||||||
|
│ │ │ ├── system-health.json
|
||||||
|
│ │ │ └── user-activity.json
|
||||||
|
│ │ ├── values-dev.yaml
|
||||||
|
│ │ ├── values-prod.yaml
|
||||||
|
│ │ ├── deploy-signoz.sh
|
||||||
|
│ │ ├── verify-signoz.sh
|
||||||
|
│ │ └── generate-test-traffic.sh
|
||||||
|
│ └── opentelemetry/
|
||||||
|
│ ├── collector/
|
||||||
|
│ └── agent/
|
||||||
|
├── cicd/ # CI/CD pipeline
|
||||||
|
│ ├── gitea/
|
||||||
|
│ │ ├── values.yaml
|
||||||
|
│ │ └── ingress.yaml
|
||||||
|
│ ├── tekton/
|
||||||
|
│ │ ├── tasks/
|
||||||
|
│ │ │ ├── git-clone.yaml
|
||||||
|
│ │ │ ├── detect-changes.yaml
|
||||||
|
│ │ │ ├── kaniko-build.yaml
|
||||||
|
│ │ │ └── update-gitops.yaml
|
||||||
|
│ │ ├── pipelines/
|
||||||
|
│ │ └── triggers/
|
||||||
|
│ └── flux/
|
||||||
|
│ ├── git-repository.yaml
|
||||||
|
│ └── kustomization.yaml
|
||||||
|
├── security/ # Security configurations
|
||||||
|
│ ├── policies/
|
||||||
|
│ │ ├── network-policies.yaml
|
||||||
|
│ │ ├── pod-security.yaml
|
||||||
|
│ │ └── rbac.yaml
|
||||||
|
│ ├── certificates/
|
||||||
|
│ │ ├── ca/
|
||||||
|
│ │ ├── services/
|
||||||
|
│ │ └── rotation-scripts/
|
||||||
|
│ ├── scanning/
|
||||||
|
│ │ ├── trivy/
|
||||||
|
│ │ └── policies/
|
||||||
|
│ └── compliance/
|
||||||
|
│ ├── cis-benchmarks/
|
||||||
|
│ └── audit-scripts/
|
||||||
|
├── scripts/ # Automation scripts
|
||||||
|
│ ├── setup/
|
||||||
|
│ │ ├── generate-certificates.sh
|
||||||
|
│ │ ├── generate-minio-certificates.sh
|
||||||
|
│ │ └── setup-dockerhub-secrets.sh
|
||||||
|
│ ├── deployment/
|
||||||
|
│ │ ├── deploy-signoz.sh
|
||||||
|
│ │ └── verify-signoz.sh
|
||||||
|
│ ├── maintenance/
|
||||||
|
│ │ ├── regenerate_migrations_k8s.sh
|
||||||
|
│ │ └── kubernetes_restart.sh
|
||||||
|
│ └── verification/
|
||||||
|
│ └── verify-registry.sh
|
||||||
|
├── docs/ # Infrastructure documentation
|
||||||
|
│ ├── architecture/
|
||||||
|
│ │ ├── diagrams/
|
||||||
|
│ │ └── decisions/
|
||||||
|
│ ├── operations/
|
||||||
|
│ │ ├── runbooks/
|
||||||
|
│ │ └── troubleshooting/
|
||||||
|
│ ├── onboarding/
|
||||||
|
│ └── reference/
|
||||||
|
│ ├── api/
|
||||||
|
│ └── configurations/
|
||||||
|
└── README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Migration Strategy
|
||||||
|
|
||||||
|
### Phase 1: Preparation and Planning
|
||||||
|
|
||||||
|
1. **Inventory Analysis**: Complete detailed inventory of all current files
|
||||||
|
2. **Dependency Mapping**: Identify dependencies between components
|
||||||
|
3. **Impact Assessment**: Determine which components can be moved safely
|
||||||
|
4. **Backup Strategy**: Ensure all files are backed up before migration
|
||||||
|
|
||||||
|
### Phase 2: Non-Critical Components
|
||||||
|
|
||||||
|
1. **Documentation**: Move and update all documentation files
|
||||||
|
2. **Scripts**: Organize automation scripts into new structure
|
||||||
|
3. **Monitoring**: Migrate SigNoz dashboards and configurations
|
||||||
|
4. **CI/CD**: Reorganize pipeline components
|
||||||
|
|
||||||
|
### Phase 3: Environment-Specific Components
|
||||||
|
|
||||||
|
1. **Create Environment Structure**: Set up dev/staging/prod directories
|
||||||
|
2. **Migrate Kubernetes Manifests**: Move base components to appropriate locations
|
||||||
|
3. **Update References**: Ensure all cross-references are corrected
|
||||||
|
4. **Environment Validation**: Test each environment separately
|
||||||
|
|
||||||
|
### Phase 4: Service Components
|
||||||
|
|
||||||
|
1. **Database Migration**: Move database configurations to services/databases
|
||||||
|
2. **Microservice Organization**: Group microservices by domain
|
||||||
|
3. **Infrastructure Components**: Move gateway and other infrastructure
|
||||||
|
4. **Service Validation**: Test each service in isolation
|
||||||
|
|
||||||
|
### Phase 5: Finalization
|
||||||
|
|
||||||
|
1. **Integration Testing**: Test complete infrastructure workflow
|
||||||
|
2. **Documentation Update**: Finalize all documentation
|
||||||
|
3. **Team Training**: Conduct training on new structure
|
||||||
|
4. **Cleanup**: Remove old structure and temporary files
|
||||||
|
|
||||||
|
## Benefits of Proposed Structure
|
||||||
|
|
||||||
|
### 1. Improved Navigation
|
||||||
|
- **Clear Hierarchy**: Logical grouping by function and environment
|
||||||
|
- **Consistent Patterns**: Standardized structure across all components
|
||||||
|
- **Reduced Cognitive Load**: Easier to find specific components
|
||||||
|
|
||||||
|
### 2. Enhanced Maintainability
|
||||||
|
- **Environment Isolation**: Clear separation of dev/staging/prod
|
||||||
|
- **Component Grouping**: Related components grouped together
|
||||||
|
- **Standardized Structure**: Consistent patterns across services
|
||||||
|
|
||||||
|
### 3. Better Scalability
|
||||||
|
- **Modular Design**: Easy to add new services or environments
|
||||||
|
- **Domain Separation**: Services organized by business domain
|
||||||
|
- **Infrastructure Independence**: Platform components separate from services
|
||||||
|
|
||||||
|
### 4. Improved Security
|
||||||
|
- **Centralized Security**: All security configurations in one place
|
||||||
|
- **Environment-Specific Policies**: Tailored security for each environment
|
||||||
|
- **Better Secret Management**: Clear structure for sensitive data
|
||||||
|
|
||||||
|
### 5. Enhanced Observability
|
||||||
|
- **Comprehensive Monitoring**: All observability tools grouped
|
||||||
|
- **Standardized Dashboards**: Consistent monitoring across services
|
||||||
|
- **Centralized Logging**: Better log management structure
|
||||||
|
|
||||||
|
## Implementation Considerations
|
||||||
|
|
||||||
|
### Tools and Technologies
|
||||||
|
- **Terraform**: For infrastructure as code (IaC)
|
||||||
|
- **Kustomize**: For Kubernetes manifest management
|
||||||
|
- **Helm**: For complex application deployments
|
||||||
|
- **SOPS/Sealed Secrets**: For secret management
|
||||||
|
- **Trivy**: For vulnerability scanning
|
||||||
|
|
||||||
|
### Team Adaptation
|
||||||
|
- **Training Plan**: Develop comprehensive training materials
|
||||||
|
- **Migration Guide**: Create step-by-step migration documentation
|
||||||
|
- **Support Period**: Provide dedicated support during transition
|
||||||
|
- **Feedback Mechanism**: Establish channels for team feedback
|
||||||
|
|
||||||
|
### Risk Mitigation
|
||||||
|
- **Phased Approach**: Implement changes incrementally
|
||||||
|
- **Rollback Plan**: Develop comprehensive rollback procedures
|
||||||
|
- **Testing Strategy**: Implement thorough testing at each phase
|
||||||
|
- **Monitoring**: Enhanced monitoring during migration period
|
||||||
|
|
||||||
|
## Expected Outcomes
|
||||||
|
|
||||||
|
1. **Reduced Time-to-Find**: 40-60% reduction in time spent locating files
|
||||||
|
2. **Improved Deployment Speed**: 25-35% faster deployment cycles
|
||||||
|
3. **Enhanced Collaboration**: Better team coordination and understanding
|
||||||
|
4. **Reduced Errors**: 30-50% reduction in configuration errors
|
||||||
|
5. **Better Scalability**: Easier to add new services and features
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
The proposed infrastructure reorganization represents a significant improvement over the current structure. By implementing a clear, logical hierarchy with proper separation of concerns, the new organization will:
|
||||||
|
|
||||||
|
- **Improve operational efficiency** through better navigation and maintainability
|
||||||
|
- **Enhance security** with centralized security management
|
||||||
|
- **Support growth** with a scalable, modular design
|
||||||
|
- **Reduce errors** through standardized patterns and structures
|
||||||
|
- **Facilitate collaboration** with intuitive organization
|
||||||
|
|
||||||
|
The key to successful implementation is a phased approach with thorough testing and team involvement at each stage. With proper planning and execution, this reorganization will provide long-term benefits for the Bakery-IA project's infrastructure management.
|
||||||
|
|
||||||
|
## Appendix: File Migration Mapping
|
||||||
|
|
||||||
|
### Current → Proposed Mapping
|
||||||
|
|
||||||
|
**Kubernetes Components:**
|
||||||
|
- `infrastructure/kubernetes/base/components/*` → `infrastructure/services/microservices/*/`
|
||||||
|
- `infrastructure/kubernetes/base/components/databases/*` → `infrastructure/services/databases/*/`
|
||||||
|
- `infrastructure/kubernetes/base/migrations/*` → `infrastructure/services/microservices/*/migrations/`
|
||||||
|
- `infrastructure/kubernetes/base/configs/*` → `infrastructure/environments/*/values/`
|
||||||
|
|
||||||
|
**CI/CD Components:**
|
||||||
|
- `infrastructure/ci-cd/*` → `infrastructure/cicd/*/`
|
||||||
|
|
||||||
|
**Monitoring Components:**
|
||||||
|
- `infrastructure/signoz/*` → `infrastructure/monitoring/signoz/*/`
|
||||||
|
- `infrastructure/helm/*` → `infrastructure/monitoring/signoz/*/` (signoz-related)
|
||||||
|
|
||||||
|
**Security Components:**
|
||||||
|
- `infrastructure/tls/*` → `infrastructure/security/certificates/*/`
|
||||||
|
|
||||||
|
**Scripts:**
|
||||||
|
- `infrastructure/kubernetes/*.sh` → `infrastructure/scripts/*/`
|
||||||
|
- `infrastructure/helm/*.sh` → `infrastructure/scripts/deployment/*/`
|
||||||
|
- `infrastructure/tls/*.sh` → `infrastructure/scripts/setup/*/`
|
||||||
|
|
||||||
|
This mapping provides a clear path for migrating each component to its new location while maintaining functionality and relationships between components.
|
||||||
294
infrastructure/ci-cd/README.md
Normal file
294
infrastructure/ci-cd/README.md
Normal file
@@ -0,0 +1,294 @@
|
|||||||
|
# Bakery-IA CI/CD Implementation
|
||||||
|
|
||||||
|
This directory contains the configuration for the production-grade CI/CD system for Bakery-IA using Gitea, Tekton, and Flux CD.
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TD
|
||||||
|
A[Developer] -->|Push Code| B[Gitea]
|
||||||
|
B -->|Webhook| C[Tekton Pipelines]
|
||||||
|
C -->|Build/Test| D[Gitea Registry]
|
||||||
|
D -->|New Image| E[Flux CD]
|
||||||
|
E -->|kubectl apply| F[MicroK8s Cluster]
|
||||||
|
F -->|Metrics| G[SigNoz]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
infrastructure/ci-cd/
|
||||||
|
├── gitea/ # Gitea configuration (Git server + registry)
|
||||||
|
│ ├── values.yaml # Helm values for Gitea
|
||||||
|
│ └── ingress.yaml # Ingress configuration
|
||||||
|
├── tekton/ # Tekton CI/CD pipeline configuration
|
||||||
|
│ ├── tasks/ # Individual pipeline tasks
|
||||||
|
│ │ ├── git-clone.yaml
|
||||||
|
│ │ ├── detect-changes.yaml
|
||||||
|
│ │ ├── kaniko-build.yaml
|
||||||
|
│ │ └── update-gitops.yaml
|
||||||
|
│ ├── pipelines/ # Pipeline definitions
|
||||||
|
│ │ └── ci-pipeline.yaml
|
||||||
|
│ └── triggers/ # Webhook trigger configuration
|
||||||
|
│ ├── trigger-template.yaml
|
||||||
|
│ ├── trigger-binding.yaml
|
||||||
|
│ ├── event-listener.yaml
|
||||||
|
│ └── gitlab-interceptor.yaml
|
||||||
|
├── flux/ # Flux CD GitOps configuration
|
||||||
|
│ ├── git-repository.yaml # Git repository source
|
||||||
|
│ └── kustomization.yaml # Deployment kustomization
|
||||||
|
├── monitoring/ # Monitoring configuration
|
||||||
|
│ └── otel-collector.yaml # OpenTelemetry collector
|
||||||
|
└── README.md # This file
|
||||||
|
```
|
||||||
|
|
||||||
|
## Deployment Instructions
|
||||||
|
|
||||||
|
### Phase 1: Infrastructure Setup
|
||||||
|
|
||||||
|
1. **Deploy Gitea**:
|
||||||
|
```bash
|
||||||
|
# Add Helm repo
|
||||||
|
microk8s helm repo add gitea https://dl.gitea.io/charts
|
||||||
|
|
||||||
|
# Create namespace
|
||||||
|
microk8s kubectl create namespace gitea
|
||||||
|
|
||||||
|
# Install Gitea
|
||||||
|
microk8s helm install gitea gitea/gitea \
|
||||||
|
-n gitea \
|
||||||
|
-f infrastructure/ci-cd/gitea/values.yaml
|
||||||
|
|
||||||
|
# Apply ingress
|
||||||
|
microk8s kubectl apply -f infrastructure/ci-cd/gitea/ingress.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Deploy Tekton**:
|
||||||
|
```bash
|
||||||
|
# Create namespace
|
||||||
|
microk8s kubectl create namespace tekton-pipelines
|
||||||
|
|
||||||
|
# Install Tekton Pipelines
|
||||||
|
microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
|
||||||
|
|
||||||
|
# Install Tekton Triggers
|
||||||
|
microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/triggers/latest/release.yaml
|
||||||
|
|
||||||
|
# Apply Tekton configurations
|
||||||
|
microk8s kubectl apply -f infrastructure/ci-cd/tekton/tasks/
|
||||||
|
microk8s kubectl apply -f infrastructure/ci-cd/tekton/pipelines/
|
||||||
|
microk8s kubectl apply -f infrastructure/ci-cd/tekton/triggers/
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Deploy Flux CD** (already enabled in MicroK8s):
|
||||||
|
```bash
|
||||||
|
# Verify Flux installation
|
||||||
|
microk8s kubectl get pods -n flux-system
|
||||||
|
|
||||||
|
# Apply Flux configurations
|
||||||
|
microk8s kubectl apply -f infrastructure/ci-cd/flux/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 2: Configuration
|
||||||
|
|
||||||
|
1. **Set up Gitea webhook**:
|
||||||
|
- Go to your Gitea repository settings
|
||||||
|
- Add webhook with URL: `http://tekton-triggers.tekton-pipelines.svc.cluster.local:8080`
|
||||||
|
- Use the secret from `gitea-webhook-secret`
|
||||||
|
|
||||||
|
2. **Configure registry credentials**:
|
||||||
|
```bash
|
||||||
|
# Create registry credentials secret
|
||||||
|
microk8s kubectl create secret docker-registry gitea-registry-credentials \
|
||||||
|
-n tekton-pipelines \
|
||||||
|
--docker-server=gitea.bakery-ia.local:5000 \
|
||||||
|
--docker-username=your-username \
|
||||||
|
--docker-password=your-password
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Configure Git credentials for Flux**:
|
||||||
|
```bash
|
||||||
|
# Create Git credentials secret
|
||||||
|
microk8s kubectl create secret generic gitea-credentials \
|
||||||
|
-n flux-system \
|
||||||
|
--from-literal=username=your-username \
|
||||||
|
--from-literal=password=your-password
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 3: Monitoring
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Apply OpenTelemetry configuration
|
||||||
|
microk8s kubectl apply -f infrastructure/ci-cd/monitoring/otel-collector.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Triggering a Pipeline
|
||||||
|
|
||||||
|
1. **Manual trigger**:
|
||||||
|
```bash
|
||||||
|
# Create a PipelineRun manually
|
||||||
|
microk8s kubectl create -f - <<EOF
|
||||||
|
apiVersion: tekton.dev/v1beta1
|
||||||
|
kind: PipelineRun
|
||||||
|
metadata:
|
||||||
|
name: manual-ci-run
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
pipelineRef:
|
||||||
|
name: bakery-ia-ci
|
||||||
|
workspaces:
|
||||||
|
- name: shared-workspace
|
||||||
|
volumeClaimTemplate:
|
||||||
|
spec:
|
||||||
|
accessModes: ["ReadWriteOnce"]
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: 5Gi
|
||||||
|
- name: docker-credentials
|
||||||
|
secret:
|
||||||
|
secretName: gitea-registry-credentials
|
||||||
|
params:
|
||||||
|
- name: git-url
|
||||||
|
value: "http://gitea.bakery-ia.local/bakery/bakery-ia.git"
|
||||||
|
- name: git-revision
|
||||||
|
value: "main"
|
||||||
|
EOF
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Automatic trigger**: Push code to the repository and the webhook will trigger the pipeline automatically.
|
||||||
|
|
||||||
|
### Monitoring Pipeline Runs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List all PipelineRuns
|
||||||
|
microk8s kubectl get pipelineruns -n tekton-pipelines
|
||||||
|
|
||||||
|
# View logs for a specific PipelineRun
|
||||||
|
microk8s kubectl logs -n tekton-pipelines <pipelinerun-pod> -c <step-name>
|
||||||
|
|
||||||
|
# View Tekton dashboard
|
||||||
|
microk8s kubectl port-forward -n tekton-pipelines svc/tekton-dashboard 9097:9097
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
1. **Pipeline not triggering**:
|
||||||
|
- Check Gitea webhook logs
|
||||||
|
- Verify EventListener pods are running
|
||||||
|
- Check TriggerBinding configuration
|
||||||
|
|
||||||
|
2. **Build failures**:
|
||||||
|
- Check Kaniko logs for build errors
|
||||||
|
- Verify Dockerfile paths are correct
|
||||||
|
- Ensure registry credentials are valid
|
||||||
|
|
||||||
|
3. **Flux not applying changes**:
|
||||||
|
- Check GitRepository status
|
||||||
|
- Verify Kustomization reconciliation
|
||||||
|
- Check Flux logs for errors
|
||||||
|
|
||||||
|
### Debugging Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Tekton controller logs
|
||||||
|
microk8s kubectl logs -n tekton-pipelines -l app=tekton-pipelines-controller
|
||||||
|
|
||||||
|
# Check Flux reconciliation
|
||||||
|
microk8s kubectl get kustomizations -n flux-system -o yaml
|
||||||
|
|
||||||
|
# Check Gitea webhook delivery
|
||||||
|
microk8s kubectl logs -n tekton-pipelines -l app=tekton-triggers-controller
|
||||||
|
```
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
1. **Secrets Management**:
|
||||||
|
- Use Kubernetes secrets for sensitive data
|
||||||
|
- Rotate credentials regularly
|
||||||
|
- Use RBAC for namespace isolation
|
||||||
|
|
||||||
|
2. **Network Security**:
|
||||||
|
- Configure network policies
|
||||||
|
- Use internal DNS names
|
||||||
|
- Restrict ingress access
|
||||||
|
|
||||||
|
3. **Registry Security**:
|
||||||
|
- Enable image scanning
|
||||||
|
- Use image signing
|
||||||
|
- Implement cleanup policies
|
||||||
|
|
||||||
|
## Maintenance
|
||||||
|
|
||||||
|
### Upgrading Components
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Upgrade Tekton
|
||||||
|
microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
|
||||||
|
|
||||||
|
# Upgrade Flux
|
||||||
|
microk8s helm upgrade fluxcd fluxcd/flux2 -n flux-system
|
||||||
|
|
||||||
|
# Upgrade Gitea
|
||||||
|
microk8s helm upgrade gitea gitea/gitea -n gitea -f infrastructure/ci-cd/gitea/values.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Backup Procedures
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backup Gitea
|
||||||
|
microk8s kubectl exec -n gitea gitea-0 -- gitea dump -c /data/gitea/conf/app.ini
|
||||||
|
|
||||||
|
# Backup Flux configurations
|
||||||
|
microk8s kubectl get all -n flux-system -o yaml > flux-backup.yaml
|
||||||
|
|
||||||
|
# Backup Tekton configurations
|
||||||
|
microk8s kubectl get all -n tekton-pipelines -o yaml > tekton-backup.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Optimization
|
||||||
|
|
||||||
|
1. **Resource Management**:
|
||||||
|
- Set appropriate resource limits
|
||||||
|
- Limit concurrent builds
|
||||||
|
- Use node selectors for build pods
|
||||||
|
|
||||||
|
2. **Caching**:
|
||||||
|
- Configure Kaniko cache
|
||||||
|
- Use persistent volumes for dependencies
|
||||||
|
- Cache Docker layers
|
||||||
|
|
||||||
|
3. **Parallelization**:
|
||||||
|
- Build independent services in parallel
|
||||||
|
- Use matrix builds for different architectures
|
||||||
|
- Optimize task dependencies
|
||||||
|
|
||||||
|
## Integration with Existing System
|
||||||
|
|
||||||
|
The CI/CD system integrates with:
|
||||||
|
- **SigNoz**: For monitoring and observability
|
||||||
|
- **MicroK8s**: For cluster management
|
||||||
|
- **Existing Kubernetes manifests**: In `infrastructure/kubernetes/`
|
||||||
|
- **Current services**: All 19 microservices in `services/`
|
||||||
|
|
||||||
|
## Migration Plan
|
||||||
|
|
||||||
|
1. **Phase 1**: Set up infrastructure (Gitea, Tekton, Flux)
|
||||||
|
2. **Phase 2**: Configure pipelines and triggers
|
||||||
|
3. **Phase 3**: Test with non-critical services
|
||||||
|
4. **Phase 4**: Gradual rollout to all services
|
||||||
|
5. **Phase 5**: Decommission old deployment methods
|
||||||
|
|
||||||
|
## Support
|
||||||
|
|
||||||
|
For issues with the CI/CD system:
|
||||||
|
- Check logs and monitoring first
|
||||||
|
- Review the troubleshooting section
|
||||||
|
- Consult the original implementation plan
|
||||||
|
- Refer to component documentation:
|
||||||
|
- [Tekton Documentation](https://tekton.dev/docs/)
|
||||||
|
- [Flux CD Documentation](https://fluxcd.io/docs/)
|
||||||
|
- [Gitea Documentation](https://docs.gitea.io/)
|
||||||
16
infrastructure/ci-cd/flux/git-repository.yaml
Normal file
16
infrastructure/ci-cd/flux/git-repository.yaml
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
# Flux GitRepository for Bakery-IA
|
||||||
|
# This resource tells Flux where to find the Git repository
|
||||||
|
|
||||||
|
apiVersion: source.toolkit.fluxcd.io/v1
|
||||||
|
kind: GitRepository
|
||||||
|
metadata:
|
||||||
|
name: bakery-ia
|
||||||
|
namespace: flux-system
|
||||||
|
spec:
|
||||||
|
interval: 1m
|
||||||
|
url: http://gitea.bakery-ia.local/bakery/bakery-ia.git
|
||||||
|
ref:
|
||||||
|
branch: main
|
||||||
|
secretRef:
|
||||||
|
name: gitea-credentials
|
||||||
|
timeout: 60s
|
||||||
27
infrastructure/ci-cd/flux/kustomization.yaml
Normal file
27
infrastructure/ci-cd/flux/kustomization.yaml
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
# Flux Kustomization for Bakery-IA Production Deployment
|
||||||
|
# This resource tells Flux how to deploy the application
|
||||||
|
|
||||||
|
apiVersion: kustomize.toolkit.fluxcd.io/v1
|
||||||
|
kind: Kustomization
|
||||||
|
metadata:
|
||||||
|
name: bakery-ia-prod
|
||||||
|
namespace: flux-system
|
||||||
|
spec:
|
||||||
|
interval: 5m
|
||||||
|
path: ./infrastructure/kubernetes/overlays/prod
|
||||||
|
prune: true
|
||||||
|
sourceRef:
|
||||||
|
kind: GitRepository
|
||||||
|
name: bakery-ia
|
||||||
|
targetNamespace: bakery-ia
|
||||||
|
timeout: 5m
|
||||||
|
retryInterval: 1m
|
||||||
|
healthChecks:
|
||||||
|
- apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
name: auth-service
|
||||||
|
namespace: bakery-ia
|
||||||
|
- apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
name: gateway
|
||||||
|
namespace: bakery-ia
|
||||||
25
infrastructure/ci-cd/gitea/ingress.yaml
Normal file
25
infrastructure/ci-cd/gitea/ingress.yaml
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
# Gitea Ingress configuration for Bakery-IA CI/CD
|
||||||
|
# This provides external access to Gitea within the cluster
|
||||||
|
|
||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: gitea-ingress
|
||||||
|
namespace: gitea
|
||||||
|
annotations:
|
||||||
|
nginx.ingress.kubernetes.io/rewrite-target: /
|
||||||
|
nginx.ingress.kubernetes.io/proxy-body-size: "0"
|
||||||
|
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
|
||||||
|
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
|
||||||
|
spec:
|
||||||
|
rules:
|
||||||
|
- host: gitea.bakery-ia.local
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: gitea-http
|
||||||
|
port:
|
||||||
|
number: 3000
|
||||||
38
infrastructure/ci-cd/gitea/values.yaml
Normal file
38
infrastructure/ci-cd/gitea/values.yaml
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
# Gitea Helm values configuration for Bakery-IA CI/CD
|
||||||
|
# This configuration sets up Gitea with registry support and appropriate storage
|
||||||
|
|
||||||
|
service:
|
||||||
|
type: ClusterIP
|
||||||
|
httpPort: 3000
|
||||||
|
sshPort: 2222
|
||||||
|
|
||||||
|
persistence:
|
||||||
|
enabled: true
|
||||||
|
size: 50Gi
|
||||||
|
storageClass: "microk8s-hostpath"
|
||||||
|
|
||||||
|
gitea:
|
||||||
|
config:
|
||||||
|
server:
|
||||||
|
DOMAIN: gitea.bakery-ia.local
|
||||||
|
SSH_DOMAIN: gitea.bakery-ia.local
|
||||||
|
ROOT_URL: http://gitea.bakery-ia.local
|
||||||
|
repository:
|
||||||
|
ENABLE_PUSH_CREATE_USER: true
|
||||||
|
ENABLE_PUSH_CREATE_ORG: true
|
||||||
|
registry:
|
||||||
|
ENABLED: true
|
||||||
|
|
||||||
|
postgresql:
|
||||||
|
enabled: true
|
||||||
|
persistence:
|
||||||
|
size: 20Gi
|
||||||
|
|
||||||
|
# Resource configuration for production environment
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
cpu: 1000m
|
||||||
|
memory: 1Gi
|
||||||
|
requests:
|
||||||
|
cpu: 500m
|
||||||
|
memory: 512Mi
|
||||||
70
infrastructure/ci-cd/monitoring/otel-collector.yaml
Normal file
70
infrastructure/ci-cd/monitoring/otel-collector.yaml
Normal file
@@ -0,0 +1,70 @@
|
|||||||
|
# OpenTelemetry Collector for Bakery-IA CI/CD Monitoring
|
||||||
|
# This collects metrics and traces from Tekton pipelines
|
||||||
|
|
||||||
|
apiVersion: opentelemetry.io/v1alpha1
|
||||||
|
kind: OpenTelemetryCollector
|
||||||
|
metadata:
|
||||||
|
name: tekton-otel
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
config: |
|
||||||
|
receivers:
|
||||||
|
otlp:
|
||||||
|
protocols:
|
||||||
|
grpc:
|
||||||
|
endpoint: 0.0.0.0:4317
|
||||||
|
http:
|
||||||
|
endpoint: 0.0.0.0:4318
|
||||||
|
prometheus:
|
||||||
|
config:
|
||||||
|
scrape_configs:
|
||||||
|
- job_name: 'tekton-pipelines'
|
||||||
|
scrape_interval: 30s
|
||||||
|
static_configs:
|
||||||
|
- targets: ['tekton-pipelines-controller.tekton-pipelines.svc.cluster.local:9090']
|
||||||
|
|
||||||
|
processors:
|
||||||
|
batch:
|
||||||
|
timeout: 5s
|
||||||
|
send_batch_size: 1000
|
||||||
|
memory_limiter:
|
||||||
|
check_interval: 2s
|
||||||
|
limit_percentage: 75
|
||||||
|
spike_limit_percentage: 20
|
||||||
|
|
||||||
|
exporters:
|
||||||
|
otlp:
|
||||||
|
endpoint: "signoz-otel-collector.monitoring.svc.cluster.local:4317"
|
||||||
|
tls:
|
||||||
|
insecure: true
|
||||||
|
retry_on_failure:
|
||||||
|
enabled: true
|
||||||
|
initial_interval: 5s
|
||||||
|
max_interval: 30s
|
||||||
|
max_elapsed_time: 300s
|
||||||
|
logging:
|
||||||
|
logLevel: debug
|
||||||
|
|
||||||
|
service:
|
||||||
|
pipelines:
|
||||||
|
traces:
|
||||||
|
receivers: [otlp]
|
||||||
|
processors: [memory_limiter, batch]
|
||||||
|
exporters: [otlp, logging]
|
||||||
|
metrics:
|
||||||
|
receivers: [otlp, prometheus]
|
||||||
|
processors: [memory_limiter, batch]
|
||||||
|
exporters: [otlp, logging]
|
||||||
|
telemetry:
|
||||||
|
logs:
|
||||||
|
level: "info"
|
||||||
|
encoding: "json"
|
||||||
|
|
||||||
|
mode: deployment
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
cpu: 500m
|
||||||
|
memory: 512Mi
|
||||||
|
requests:
|
||||||
|
cpu: 200m
|
||||||
|
memory: 256Mi
|
||||||
83
infrastructure/ci-cd/tekton/pipelines/ci-pipeline.yaml
Normal file
83
infrastructure/ci-cd/tekton/pipelines/ci-pipeline.yaml
Normal file
@@ -0,0 +1,83 @@
|
|||||||
|
# Main CI Pipeline for Bakery-IA
|
||||||
|
# This pipeline orchestrates the build, test, and deploy process
|
||||||
|
|
||||||
|
apiVersion: tekton.dev/v1beta1
|
||||||
|
kind: Pipeline
|
||||||
|
metadata:
|
||||||
|
name: bakery-ia-ci
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
workspaces:
|
||||||
|
- name: shared-workspace
|
||||||
|
- name: docker-credentials
|
||||||
|
params:
|
||||||
|
- name: git-url
|
||||||
|
type: string
|
||||||
|
description: Repository URL
|
||||||
|
- name: git-revision
|
||||||
|
type: string
|
||||||
|
description: Git revision/commit hash
|
||||||
|
- name: registry
|
||||||
|
type: string
|
||||||
|
description: Container registry URL
|
||||||
|
default: "gitea.bakery-ia.local:5000"
|
||||||
|
tasks:
|
||||||
|
- name: fetch-source
|
||||||
|
taskRef:
|
||||||
|
name: git-clone
|
||||||
|
workspaces:
|
||||||
|
- name: output
|
||||||
|
workspace: shared-workspace
|
||||||
|
params:
|
||||||
|
- name: url
|
||||||
|
value: $(params.git-url)
|
||||||
|
- name: revision
|
||||||
|
value: $(params.git-revision)
|
||||||
|
|
||||||
|
- name: detect-changes
|
||||||
|
runAfter: [fetch-source]
|
||||||
|
taskRef:
|
||||||
|
name: detect-changed-services
|
||||||
|
workspaces:
|
||||||
|
- name: source
|
||||||
|
workspace: shared-workspace
|
||||||
|
|
||||||
|
- name: build-and-push
|
||||||
|
runAfter: [detect-changes]
|
||||||
|
taskRef:
|
||||||
|
name: kaniko-build
|
||||||
|
when:
|
||||||
|
- input: "$(tasks.detect-changes.results.changed-services)"
|
||||||
|
operator: notin
|
||||||
|
values: ["none"]
|
||||||
|
workspaces:
|
||||||
|
- name: source
|
||||||
|
workspace: shared-workspace
|
||||||
|
- name: docker-credentials
|
||||||
|
workspace: docker-credentials
|
||||||
|
params:
|
||||||
|
- name: services
|
||||||
|
value: $(tasks.detect-changes.results.changed-services)
|
||||||
|
- name: registry
|
||||||
|
value: $(params.registry)
|
||||||
|
- name: git-revision
|
||||||
|
value: $(params.git-revision)
|
||||||
|
|
||||||
|
- name: update-gitops-manifests
|
||||||
|
runAfter: [build-and-push]
|
||||||
|
taskRef:
|
||||||
|
name: update-gitops
|
||||||
|
when:
|
||||||
|
- input: "$(tasks.detect-changes.results.changed-services)"
|
||||||
|
operator: notin
|
||||||
|
values: ["none"]
|
||||||
|
workspaces:
|
||||||
|
- name: source
|
||||||
|
workspace: shared-workspace
|
||||||
|
params:
|
||||||
|
- name: services
|
||||||
|
value: $(tasks.detect-changes.results.changed-services)
|
||||||
|
- name: registry
|
||||||
|
value: $(params.registry)
|
||||||
|
- name: git-revision
|
||||||
|
value: $(params.git-revision)
|
||||||
64
infrastructure/ci-cd/tekton/tasks/detect-changes.yaml
Normal file
64
infrastructure/ci-cd/tekton/tasks/detect-changes.yaml
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
# Tekton Detect Changed Services Task for Bakery-IA CI/CD
|
||||||
|
# This task identifies which services have changed in the repository
|
||||||
|
|
||||||
|
apiVersion: tekton.dev/v1beta1
|
||||||
|
kind: Task
|
||||||
|
metadata:
|
||||||
|
name: detect-changed-services
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
workspaces:
|
||||||
|
- name: source
|
||||||
|
results:
|
||||||
|
- name: changed-services
|
||||||
|
description: Comma-separated list of changed services
|
||||||
|
steps:
|
||||||
|
- name: detect
|
||||||
|
image: alpine/git
|
||||||
|
script: |
|
||||||
|
#!/bin/sh
|
||||||
|
set -e
|
||||||
|
cd $(workspaces.source.path)
|
||||||
|
|
||||||
|
echo "Detecting changed files..."
|
||||||
|
# Get list of changed files compared to previous commit
|
||||||
|
CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD 2>/dev/null || git diff --name-only HEAD)
|
||||||
|
|
||||||
|
echo "Changed files: $CHANGED_FILES"
|
||||||
|
|
||||||
|
# Map files to services
|
||||||
|
CHANGED_SERVICES=()
|
||||||
|
for file in $CHANGED_FILES; do
|
||||||
|
if [[ $file == services/* ]]; then
|
||||||
|
SERVICE=$(echo $file | cut -d'/' -f2)
|
||||||
|
# Only add unique service names
|
||||||
|
if [[ ! " ${CHANGED_SERVICES[@]} " =~ " ${SERVICE} " ]]; then
|
||||||
|
CHANGED_SERVICES+=("$SERVICE")
|
||||||
|
fi
|
||||||
|
elif [[ $file == frontend/* ]]; then
|
||||||
|
CHANGED_SERVICES+=("frontend")
|
||||||
|
break
|
||||||
|
elif [[ $file == gateway/* ]]; then
|
||||||
|
CHANGED_SERVICES+=("gateway")
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# If no specific services changed, check for infrastructure changes
|
||||||
|
if [ ${#CHANGED_SERVICES[@]} -eq 0 ]; then
|
||||||
|
for file in $CHANGED_FILES; do
|
||||||
|
if [[ $file == infrastructure/* ]]; then
|
||||||
|
CHANGED_SERVICES+=("infrastructure")
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Output result
|
||||||
|
if [ ${#CHANGED_SERVICES[@]} -eq 0 ]; then
|
||||||
|
echo "No service changes detected"
|
||||||
|
echo "none" | tee $(results.changed-services.path)
|
||||||
|
else
|
||||||
|
echo "Detected changes in services: ${CHANGED_SERVICES[@]}"
|
||||||
|
echo $(printf "%s," "${CHANGED_SERVICES[@]}" | sed 's/,$//') | tee $(results.changed-services.path)
|
||||||
|
fi
|
||||||
31
infrastructure/ci-cd/tekton/tasks/git-clone.yaml
Normal file
31
infrastructure/ci-cd/tekton/tasks/git-clone.yaml
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
# Tekton Git Clone Task for Bakery-IA CI/CD
|
||||||
|
# This task clones the source code repository
|
||||||
|
|
||||||
|
apiVersion: tekton.dev/v1beta1
|
||||||
|
kind: Task
|
||||||
|
metadata:
|
||||||
|
name: git-clone
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
workspaces:
|
||||||
|
- name: output
|
||||||
|
params:
|
||||||
|
- name: url
|
||||||
|
type: string
|
||||||
|
description: Repository URL to clone
|
||||||
|
- name: revision
|
||||||
|
type: string
|
||||||
|
description: Git revision to checkout
|
||||||
|
default: "main"
|
||||||
|
steps:
|
||||||
|
- name: clone
|
||||||
|
image: alpine/git
|
||||||
|
script: |
|
||||||
|
#!/bin/sh
|
||||||
|
set -e
|
||||||
|
echo "Cloning repository: $(params.url)"
|
||||||
|
git clone $(params.url) $(workspaces.output.path)
|
||||||
|
cd $(workspaces.output.path)
|
||||||
|
echo "Checking out revision: $(params.revision)"
|
||||||
|
git checkout $(params.revision)
|
||||||
|
echo "Repository cloned successfully"
|
||||||
40
infrastructure/ci-cd/tekton/tasks/kaniko-build.yaml
Normal file
40
infrastructure/ci-cd/tekton/tasks/kaniko-build.yaml
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
# Tekton Kaniko Build Task for Bakery-IA CI/CD
|
||||||
|
# This task builds and pushes container images using Kaniko
|
||||||
|
|
||||||
|
apiVersion: tekton.dev/v1beta1
|
||||||
|
kind: Task
|
||||||
|
metadata:
|
||||||
|
name: kaniko-build
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
workspaces:
|
||||||
|
- name: source
|
||||||
|
- name: docker-credentials
|
||||||
|
params:
|
||||||
|
- name: services
|
||||||
|
type: string
|
||||||
|
description: Comma-separated list of services to build
|
||||||
|
- name: registry
|
||||||
|
type: string
|
||||||
|
description: Container registry URL
|
||||||
|
default: "gitea.bakery-ia.local:5000"
|
||||||
|
- name: git-revision
|
||||||
|
type: string
|
||||||
|
description: Git revision for image tag
|
||||||
|
default: "latest"
|
||||||
|
steps:
|
||||||
|
- name: build-and-push
|
||||||
|
image: gcr.io/kaniko-project/executor:v1.9.0
|
||||||
|
args:
|
||||||
|
- --dockerfile=$(workspaces.source.path)/services/$(params.services)/Dockerfile
|
||||||
|
- --context=$(workspaces.source.path)
|
||||||
|
- --destination=$(params.registry)/bakery/$(params.services):$(params.git-revision)
|
||||||
|
- --verbosity=info
|
||||||
|
volumeMounts:
|
||||||
|
- name: docker-config
|
||||||
|
mountPath: /kaniko/.docker
|
||||||
|
securityContext:
|
||||||
|
runAsUser: 0
|
||||||
|
volumes:
|
||||||
|
- name: docker-config
|
||||||
|
emptyDir: {}
|
||||||
66
infrastructure/ci-cd/tekton/tasks/update-gitops.yaml
Normal file
66
infrastructure/ci-cd/tekton/tasks/update-gitops.yaml
Normal file
@@ -0,0 +1,66 @@
|
|||||||
|
# Tekton Update GitOps Manifests Task for Bakery-IA CI/CD
|
||||||
|
# This task updates Kubernetes manifests with new image tags
|
||||||
|
|
||||||
|
apiVersion: tekton.dev/v1beta1
|
||||||
|
kind: Task
|
||||||
|
metadata:
|
||||||
|
name: update-gitops
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
workspaces:
|
||||||
|
- name: source
|
||||||
|
params:
|
||||||
|
- name: services
|
||||||
|
type: string
|
||||||
|
description: Comma-separated list of services to update
|
||||||
|
- name: registry
|
||||||
|
type: string
|
||||||
|
description: Container registry URL
|
||||||
|
- name: git-revision
|
||||||
|
type: string
|
||||||
|
description: Git revision for image tag
|
||||||
|
steps:
|
||||||
|
- name: update-manifests
|
||||||
|
image: bitnami/kubectl
|
||||||
|
script: |
|
||||||
|
#!/bin/sh
|
||||||
|
set -e
|
||||||
|
cd $(workspaces.source.path)
|
||||||
|
|
||||||
|
echo "Updating GitOps manifests for services: $(params.services)"
|
||||||
|
|
||||||
|
# Split services by comma
|
||||||
|
IFS=',' read -ra SERVICES <<< "$(params.services)"
|
||||||
|
|
||||||
|
for service in "${SERVICES[@]}"; do
|
||||||
|
echo "Processing service: $service"
|
||||||
|
|
||||||
|
# Find and update Kubernetes manifests
|
||||||
|
if [ "$service" = "frontend" ]; then
|
||||||
|
# Update frontend deployment
|
||||||
|
if [ -f "infrastructure/kubernetes/overlays/prod/frontend-deployment.yaml" ]; then
|
||||||
|
sed -i "s|image:.*|image: $(params.registry)/bakery/frontend:$(params.git-revision)|g" \
|
||||||
|
"infrastructure/kubernetes/overlays/prod/frontend-deployment.yaml"
|
||||||
|
fi
|
||||||
|
elif [ "$service" = "gateway" ]; then
|
||||||
|
# Update gateway deployment
|
||||||
|
if [ -f "infrastructure/kubernetes/overlays/prod/gateway-deployment.yaml" ]; then
|
||||||
|
sed -i "s|image:.*|image: $(params.registry)/bakery/gateway:$(params.git-revision)|g" \
|
||||||
|
"infrastructure/kubernetes/overlays/prod/gateway-deployment.yaml"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
# Update service deployment
|
||||||
|
DEPLOYMENT_FILE="infrastructure/kubernetes/overlays/prod/${service}-deployment.yaml"
|
||||||
|
if [ -f "$DEPLOYMENT_FILE" ]; then
|
||||||
|
sed -i "s|image:.*|image: $(params.registry)/bakery/${service}:$(params.git-revision)|g" \
|
||||||
|
"$DEPLOYMENT_FILE"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Commit changes
|
||||||
|
git config --global user.name "bakery-ia-ci"
|
||||||
|
git config --global user.email "ci@bakery-ia.local"
|
||||||
|
git add .
|
||||||
|
git commit -m "CI: Update image tags for $(params.services) to $(params.git-revision)"
|
||||||
|
git push origin HEAD
|
||||||
26
infrastructure/ci-cd/tekton/triggers/event-listener.yaml
Normal file
26
infrastructure/ci-cd/tekton/triggers/event-listener.yaml
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
# Tekton EventListener for Bakery-IA CI/CD
|
||||||
|
# This listener receives webhook events and triggers pipelines
|
||||||
|
|
||||||
|
apiVersion: triggers.tekton.dev/v1alpha1
|
||||||
|
kind: EventListener
|
||||||
|
metadata:
|
||||||
|
name: bakery-ia-listener
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
serviceAccountName: tekton-triggers-sa
|
||||||
|
triggers:
|
||||||
|
- name: bakery-ia-gitea-trigger
|
||||||
|
bindings:
|
||||||
|
- ref: bakery-ia-trigger-binding
|
||||||
|
template:
|
||||||
|
ref: bakery-ia-trigger-template
|
||||||
|
interceptors:
|
||||||
|
- ref:
|
||||||
|
name: "gitlab"
|
||||||
|
params:
|
||||||
|
- name: "secretRef"
|
||||||
|
value:
|
||||||
|
secretName: gitea-webhook-secret
|
||||||
|
secretKey: secretToken
|
||||||
|
- name: "eventTypes"
|
||||||
|
value: ["push"]
|
||||||
14
infrastructure/ci-cd/tekton/triggers/gitlab-interceptor.yaml
Normal file
14
infrastructure/ci-cd/tekton/triggers/gitlab-interceptor.yaml
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
# GitLab/Gitea Webhook Interceptor for Tekton Triggers
|
||||||
|
# This interceptor validates and processes Gitea webhook events
|
||||||
|
|
||||||
|
apiVersion: triggers.tekton.dev/v1alpha1
|
||||||
|
kind: ClusterInterceptor
|
||||||
|
metadata:
|
||||||
|
name: gitlab
|
||||||
|
spec:
|
||||||
|
clientConfig:
|
||||||
|
service:
|
||||||
|
name: tekton-triggers-core-interceptors
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
path: "/v1/webhook/gitlab"
|
||||||
|
port: 8443
|
||||||
16
infrastructure/ci-cd/tekton/triggers/trigger-binding.yaml
Normal file
16
infrastructure/ci-cd/tekton/triggers/trigger-binding.yaml
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
# Tekton TriggerBinding for Bakery-IA CI/CD
|
||||||
|
# This binding extracts parameters from Gitea webhook events
|
||||||
|
|
||||||
|
apiVersion: triggers.tekton.dev/v1alpha1
|
||||||
|
kind: TriggerBinding
|
||||||
|
metadata:
|
||||||
|
name: bakery-ia-trigger-binding
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
params:
|
||||||
|
- name: git-repo-url
|
||||||
|
value: $(body.repository.clone_url)
|
||||||
|
- name: git-revision
|
||||||
|
value: $(body.head_commit.id)
|
||||||
|
- name: git-repo-name
|
||||||
|
value: $(body.repository.name)
|
||||||
43
infrastructure/ci-cd/tekton/triggers/trigger-template.yaml
Normal file
43
infrastructure/ci-cd/tekton/triggers/trigger-template.yaml
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
# Tekton TriggerTemplate for Bakery-IA CI/CD
|
||||||
|
# This template defines how PipelineRuns are created when triggers fire
|
||||||
|
|
||||||
|
apiVersion: triggers.tekton.dev/v1alpha1
|
||||||
|
kind: TriggerTemplate
|
||||||
|
metadata:
|
||||||
|
name: bakery-ia-trigger-template
|
||||||
|
namespace: tekton-pipelines
|
||||||
|
spec:
|
||||||
|
params:
|
||||||
|
- name: git-repo-url
|
||||||
|
description: The git repository URL
|
||||||
|
- name: git-revision
|
||||||
|
description: The git revision/commit hash
|
||||||
|
- name: git-repo-name
|
||||||
|
description: The git repository name
|
||||||
|
default: "bakery-ia"
|
||||||
|
resourcetemplates:
|
||||||
|
- apiVersion: tekton.dev/v1beta1
|
||||||
|
kind: PipelineRun
|
||||||
|
metadata:
|
||||||
|
generateName: bakery-ia-ci-run-$(params.git-repo-name)-
|
||||||
|
spec:
|
||||||
|
pipelineRef:
|
||||||
|
name: bakery-ia-ci
|
||||||
|
workspaces:
|
||||||
|
- name: shared-workspace
|
||||||
|
volumeClaimTemplate:
|
||||||
|
spec:
|
||||||
|
accessModes: ["ReadWriteOnce"]
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: 5Gi
|
||||||
|
- name: docker-credentials
|
||||||
|
secret:
|
||||||
|
secretName: gitea-registry-credentials
|
||||||
|
params:
|
||||||
|
- name: git-url
|
||||||
|
value: $(params.git-repo-url)
|
||||||
|
- name: git-revision
|
||||||
|
value: $(params.git-revision)
|
||||||
|
- name: registry
|
||||||
|
value: "gitea.bakery-ia.local:5000"
|
||||||
@@ -18,6 +18,28 @@ class OrchestratorService(StandardFastAPIService):
|
|||||||
|
|
||||||
expected_migration_version = "001_initial_schema"
|
expected_migration_version = "001_initial_schema"
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
# Define expected database tables for health checks
|
||||||
|
orchestrator_expected_tables = [
|
||||||
|
'orchestration_runs'
|
||||||
|
]
|
||||||
|
|
||||||
|
self.rabbitmq_client = None
|
||||||
|
self.event_publisher = None
|
||||||
|
self.leader_election = None
|
||||||
|
self.scheduler_service = None
|
||||||
|
|
||||||
|
super().__init__(
|
||||||
|
service_name="orchestrator-service",
|
||||||
|
app_name=settings.APP_NAME,
|
||||||
|
description=settings.DESCRIPTION,
|
||||||
|
version=settings.VERSION,
|
||||||
|
api_prefix="", # Empty because RouteBuilder already includes /api/v1
|
||||||
|
database_manager=database_manager,
|
||||||
|
expected_tables=orchestrator_expected_tables,
|
||||||
|
enable_messaging=True # Enable RabbitMQ for event publishing
|
||||||
|
)
|
||||||
|
|
||||||
async def verify_migrations(self):
|
async def verify_migrations(self):
|
||||||
"""Verify database schema matches the latest migrations"""
|
"""Verify database schema matches the latest migrations"""
|
||||||
try:
|
try:
|
||||||
@@ -32,26 +54,6 @@ class OrchestratorService(StandardFastAPIService):
|
|||||||
self.logger.error(f"Migration verification failed: {e}")
|
self.logger.error(f"Migration verification failed: {e}")
|
||||||
raise
|
raise
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
# Define expected database tables for health checks
|
|
||||||
orchestrator_expected_tables = [
|
|
||||||
'orchestration_runs'
|
|
||||||
]
|
|
||||||
|
|
||||||
self.rabbitmq_client = None
|
|
||||||
self.event_publisher = None
|
|
||||||
|
|
||||||
super().__init__(
|
|
||||||
service_name="orchestrator-service",
|
|
||||||
app_name=settings.APP_NAME,
|
|
||||||
description=settings.DESCRIPTION,
|
|
||||||
version=settings.VERSION,
|
|
||||||
api_prefix="", # Empty because RouteBuilder already includes /api/v1
|
|
||||||
database_manager=database_manager,
|
|
||||||
expected_tables=orchestrator_expected_tables,
|
|
||||||
enable_messaging=True # Enable RabbitMQ for event publishing
|
|
||||||
)
|
|
||||||
|
|
||||||
async def _setup_messaging(self):
|
async def _setup_messaging(self):
|
||||||
"""Setup messaging for orchestrator service"""
|
"""Setup messaging for orchestrator service"""
|
||||||
from shared.messaging import UnifiedEventPublisher, RabbitMQClient
|
from shared.messaging import UnifiedEventPublisher, RabbitMQClient
|
||||||
@@ -84,22 +86,91 @@ class OrchestratorService(StandardFastAPIService):
|
|||||||
|
|
||||||
self.logger.info("Orchestrator Service starting up...")
|
self.logger.info("Orchestrator Service starting up...")
|
||||||
|
|
||||||
# Initialize orchestrator scheduler service with EventPublisher
|
# Initialize leader election for horizontal scaling
|
||||||
from app.services.orchestrator_service import OrchestratorSchedulerService
|
# Only the leader pod will run the scheduler
|
||||||
scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
|
await self._setup_leader_election(app)
|
||||||
await scheduler_service.start()
|
|
||||||
app.state.scheduler_service = scheduler_service
|
|
||||||
self.logger.info("Orchestrator scheduler service started")
|
|
||||||
|
|
||||||
# REMOVED: Delivery tracking service - moved to procurement service (domain ownership)
|
# REMOVED: Delivery tracking service - moved to procurement service (domain ownership)
|
||||||
|
|
||||||
|
async def _setup_leader_election(self, app: FastAPI):
|
||||||
|
"""
|
||||||
|
Setup leader election for scheduler.
|
||||||
|
|
||||||
|
CRITICAL FOR HORIZONTAL SCALING:
|
||||||
|
Without leader election, each pod would run the same scheduled jobs,
|
||||||
|
causing duplicate forecasts, production schedules, and database contention.
|
||||||
|
"""
|
||||||
|
from shared.leader_election import LeaderElectionService
|
||||||
|
import redis.asyncio as redis
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create Redis connection for leader election
|
||||||
|
redis_url = f"redis://:{settings.REDIS_PASSWORD}@{settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}"
|
||||||
|
if settings.REDIS_TLS_ENABLED.lower() == "true":
|
||||||
|
redis_url = redis_url.replace("redis://", "rediss://")
|
||||||
|
|
||||||
|
redis_client = redis.from_url(redis_url, decode_responses=False)
|
||||||
|
await redis_client.ping()
|
||||||
|
|
||||||
|
# Use shared leader election service
|
||||||
|
self.leader_election = LeaderElectionService(
|
||||||
|
redis_client,
|
||||||
|
service_name="orchestrator"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Define callbacks for leader state changes
|
||||||
|
async def on_become_leader():
|
||||||
|
self.logger.info("This pod became the leader - starting scheduler")
|
||||||
|
from app.services.orchestrator_service import OrchestratorSchedulerService
|
||||||
|
self.scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
|
||||||
|
await self.scheduler_service.start()
|
||||||
|
app.state.scheduler_service = self.scheduler_service
|
||||||
|
self.logger.info("Orchestrator scheduler service started (leader only)")
|
||||||
|
|
||||||
|
async def on_lose_leader():
|
||||||
|
self.logger.warning("This pod lost leadership - stopping scheduler")
|
||||||
|
if self.scheduler_service:
|
||||||
|
await self.scheduler_service.stop()
|
||||||
|
self.scheduler_service = None
|
||||||
|
if hasattr(app.state, 'scheduler_service'):
|
||||||
|
app.state.scheduler_service = None
|
||||||
|
self.logger.info("Orchestrator scheduler service stopped (no longer leader)")
|
||||||
|
|
||||||
|
# Start leader election
|
||||||
|
await self.leader_election.start(
|
||||||
|
on_become_leader=on_become_leader,
|
||||||
|
on_lose_leader=on_lose_leader
|
||||||
|
)
|
||||||
|
|
||||||
|
# Store leader election in app state for health checks
|
||||||
|
app.state.leader_election = self.leader_election
|
||||||
|
|
||||||
|
self.logger.info("Leader election initialized",
|
||||||
|
is_leader=self.leader_election.is_leader,
|
||||||
|
instance_id=self.leader_election.instance_id)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error("Failed to setup leader election, falling back to standalone mode",
|
||||||
|
error=str(e))
|
||||||
|
# Fallback: start scheduler anyway (for single-pod deployments)
|
||||||
|
from app.services.orchestrator_service import OrchestratorSchedulerService
|
||||||
|
self.scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
|
||||||
|
await self.scheduler_service.start()
|
||||||
|
app.state.scheduler_service = self.scheduler_service
|
||||||
|
self.logger.warning("Scheduler started in standalone mode (no leader election)")
|
||||||
|
|
||||||
async def on_shutdown(self, app: FastAPI):
|
async def on_shutdown(self, app: FastAPI):
|
||||||
"""Custom shutdown logic for orchestrator service"""
|
"""Custom shutdown logic for orchestrator service"""
|
||||||
self.logger.info("Orchestrator Service shutting down...")
|
self.logger.info("Orchestrator Service shutting down...")
|
||||||
|
|
||||||
# Stop scheduler service
|
# Stop leader election (this will also stop scheduler if we're the leader)
|
||||||
if hasattr(app.state, 'scheduler_service'):
|
if self.leader_election:
|
||||||
await app.state.scheduler_service.stop()
|
await self.leader_election.stop()
|
||||||
|
self.logger.info("Leader election stopped")
|
||||||
|
|
||||||
|
# Stop scheduler service if still running
|
||||||
|
if self.scheduler_service:
|
||||||
|
await self.scheduler_service.stop()
|
||||||
self.logger.info("Orchestrator scheduler service stopped")
|
self.logger.info("Orchestrator scheduler service stopped")
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,12 +1,12 @@
|
|||||||
"""
|
"""
|
||||||
Delivery Tracking Service - Simplified
|
Delivery Tracking Service - With Leader Election
|
||||||
|
|
||||||
Tracks purchase order deliveries and generates appropriate alerts using EventPublisher:
|
Tracks purchase order deliveries and generates appropriate alerts using EventPublisher:
|
||||||
- DELIVERY_ARRIVING_SOON: 2 hours before delivery window
|
- DELIVERY_ARRIVING_SOON: 2 hours before delivery window
|
||||||
- DELIVERY_OVERDUE: 30 minutes after expected delivery time
|
- DELIVERY_OVERDUE: 30 minutes after expected delivery time
|
||||||
- STOCK_RECEIPT_INCOMPLETE: If delivery not marked as received
|
- STOCK_RECEIPT_INCOMPLETE: If delivery not marked as received
|
||||||
|
|
||||||
Runs as internal scheduler with leader election.
|
Runs as internal scheduler with leader election for horizontal scaling.
|
||||||
Domain ownership: Procurement service owns all PO and delivery tracking.
|
Domain ownership: Procurement service owns all PO and delivery tracking.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
@@ -30,7 +30,7 @@ class DeliveryTrackingService:
|
|||||||
Monitors PO deliveries and generates time-based alerts using EventPublisher.
|
Monitors PO deliveries and generates time-based alerts using EventPublisher.
|
||||||
|
|
||||||
Uses APScheduler with leader election to run hourly checks.
|
Uses APScheduler with leader election to run hourly checks.
|
||||||
Only one pod executes checks (others skip if not leader).
|
Only one pod executes checks - leader election ensures no duplicate alerts.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, event_publisher: UnifiedEventPublisher, config, database_manager=None):
|
def __init__(self, event_publisher: UnifiedEventPublisher, config, database_manager=None):
|
||||||
@@ -38,46 +38,121 @@ class DeliveryTrackingService:
|
|||||||
self.config = config
|
self.config = config
|
||||||
self.database_manager = database_manager
|
self.database_manager = database_manager
|
||||||
self.scheduler = AsyncIOScheduler()
|
self.scheduler = AsyncIOScheduler()
|
||||||
self.is_leader = False
|
self._leader_election = None
|
||||||
|
self._redis_client = None
|
||||||
|
self._scheduler_started = False
|
||||||
self.instance_id = str(uuid4())[:8] # Short instance ID for logging
|
self.instance_id = str(uuid4())[:8] # Short instance ID for logging
|
||||||
|
|
||||||
async def start(self):
|
async def start(self):
|
||||||
"""Start the delivery tracking scheduler"""
|
"""Start the delivery tracking scheduler with leader election"""
|
||||||
# Initialize and start scheduler if not already running
|
try:
|
||||||
|
# Initialize leader election
|
||||||
|
await self._setup_leader_election()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to setup leader election, starting in standalone mode",
|
||||||
|
error=str(e))
|
||||||
|
# Fallback: start scheduler without leader election
|
||||||
|
await self._start_scheduler()
|
||||||
|
|
||||||
|
async def _setup_leader_election(self):
|
||||||
|
"""Setup Redis-based leader election for horizontal scaling"""
|
||||||
|
from shared.leader_election import LeaderElectionService
|
||||||
|
import redis.asyncio as redis
|
||||||
|
|
||||||
|
# Build Redis URL from config
|
||||||
|
redis_url = getattr(self.config, 'REDIS_URL', None)
|
||||||
|
if not redis_url:
|
||||||
|
redis_password = getattr(self.config, 'REDIS_PASSWORD', '')
|
||||||
|
redis_host = getattr(self.config, 'REDIS_HOST', 'localhost')
|
||||||
|
redis_port = getattr(self.config, 'REDIS_PORT', 6379)
|
||||||
|
redis_db = getattr(self.config, 'REDIS_DB', 0)
|
||||||
|
redis_url = f"redis://:{redis_password}@{redis_host}:{redis_port}/{redis_db}"
|
||||||
|
|
||||||
|
self._redis_client = redis.from_url(redis_url, decode_responses=False)
|
||||||
|
await self._redis_client.ping()
|
||||||
|
|
||||||
|
# Create leader election service
|
||||||
|
self._leader_election = LeaderElectionService(
|
||||||
|
self._redis_client,
|
||||||
|
service_name="procurement-delivery-tracking"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Start leader election with callbacks
|
||||||
|
await self._leader_election.start(
|
||||||
|
on_become_leader=self._on_become_leader,
|
||||||
|
on_lose_leader=self._on_lose_leader
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info("Leader election initialized for delivery tracking",
|
||||||
|
is_leader=self._leader_election.is_leader,
|
||||||
|
instance_id=self.instance_id)
|
||||||
|
|
||||||
|
async def _on_become_leader(self):
|
||||||
|
"""Called when this instance becomes the leader"""
|
||||||
|
logger.info("Became leader for delivery tracking - starting scheduler",
|
||||||
|
instance_id=self.instance_id)
|
||||||
|
await self._start_scheduler()
|
||||||
|
|
||||||
|
async def _on_lose_leader(self):
|
||||||
|
"""Called when this instance loses leadership"""
|
||||||
|
logger.warning("Lost leadership for delivery tracking - stopping scheduler",
|
||||||
|
instance_id=self.instance_id)
|
||||||
|
await self._stop_scheduler()
|
||||||
|
|
||||||
|
async def _start_scheduler(self):
|
||||||
|
"""Start the APScheduler with delivery tracking jobs"""
|
||||||
|
if self._scheduler_started:
|
||||||
|
logger.debug("Scheduler already started", instance_id=self.instance_id)
|
||||||
|
return
|
||||||
|
|
||||||
if not self.scheduler.running:
|
if not self.scheduler.running:
|
||||||
# Add hourly job to check deliveries
|
# Add hourly job to check deliveries
|
||||||
self.scheduler.add_job(
|
self.scheduler.add_job(
|
||||||
self._check_all_tenants,
|
self._check_all_tenants,
|
||||||
trigger=CronTrigger(minute=30), # Run every hour at :30 (00:30, 01:30, 02:30, etc.)
|
trigger=CronTrigger(minute=30), # Run every hour at :30
|
||||||
id='hourly_delivery_check',
|
id='hourly_delivery_check',
|
||||||
name='Hourly Delivery Tracking',
|
name='Hourly Delivery Tracking',
|
||||||
replace_existing=True,
|
replace_existing=True,
|
||||||
max_instances=1, # Ensure no overlapping runs
|
max_instances=1,
|
||||||
coalesce=True # Combine missed runs
|
coalesce=True
|
||||||
)
|
)
|
||||||
|
|
||||||
self.scheduler.start()
|
self.scheduler.start()
|
||||||
|
self._scheduler_started = True
|
||||||
|
|
||||||
# Log next run time
|
|
||||||
next_run = self.scheduler.get_job('hourly_delivery_check').next_run_time
|
next_run = self.scheduler.get_job('hourly_delivery_check').next_run_time
|
||||||
logger.info(
|
logger.info("Delivery tracking scheduler started",
|
||||||
"Delivery tracking scheduler started with hourly checks",
|
instance_id=self.instance_id,
|
||||||
instance_id=self.instance_id,
|
next_run=next_run.isoformat() if next_run else None)
|
||||||
next_run=next_run.isoformat() if next_run else None
|
|
||||||
)
|
async def _stop_scheduler(self):
|
||||||
else:
|
"""Stop the APScheduler"""
|
||||||
logger.info(
|
if not self._scheduler_started:
|
||||||
"Delivery tracking scheduler already running",
|
return
|
||||||
instance_id=self.instance_id
|
|
||||||
)
|
if self.scheduler.running:
|
||||||
|
self.scheduler.shutdown(wait=False)
|
||||||
|
self._scheduler_started = False
|
||||||
|
logger.info("Delivery tracking scheduler stopped", instance_id=self.instance_id)
|
||||||
|
|
||||||
async def stop(self):
|
async def stop(self):
|
||||||
"""Stop the scheduler and release leader lock"""
|
"""Stop the scheduler and leader election"""
|
||||||
if self.scheduler.running:
|
# Stop leader election first
|
||||||
self.scheduler.shutdown(wait=True) # Graceful shutdown
|
if self._leader_election:
|
||||||
logger.info("Delivery tracking scheduler stopped", instance_id=self.instance_id)
|
await self._leader_election.stop()
|
||||||
else:
|
logger.info("Leader election stopped", instance_id=self.instance_id)
|
||||||
logger.info("Delivery tracking scheduler already stopped", instance_id=self.instance_id)
|
|
||||||
|
# Stop scheduler
|
||||||
|
await self._stop_scheduler()
|
||||||
|
|
||||||
|
# Close Redis
|
||||||
|
if self._redis_client:
|
||||||
|
await self._redis_client.close()
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_leader(self) -> bool:
|
||||||
|
"""Check if this instance is the leader"""
|
||||||
|
return self._leader_election.is_leader if self._leader_election else True
|
||||||
|
|
||||||
async def _check_all_tenants(self):
|
async def _check_all_tenants(self):
|
||||||
"""
|
"""
|
||||||
|
|||||||
@@ -46,6 +46,9 @@ class TrainingService(StandardFastAPIService):
|
|||||||
await setup_messaging()
|
await setup_messaging()
|
||||||
self.logger.info("Messaging setup completed")
|
self.logger.info("Messaging setup completed")
|
||||||
|
|
||||||
|
# Initialize Redis pub/sub for cross-pod WebSocket broadcasting
|
||||||
|
await self._setup_websocket_redis()
|
||||||
|
|
||||||
# Set up WebSocket event consumer (listens to RabbitMQ and broadcasts to WebSockets)
|
# Set up WebSocket event consumer (listens to RabbitMQ and broadcasts to WebSockets)
|
||||||
success = await setup_websocket_event_consumer()
|
success = await setup_websocket_event_consumer()
|
||||||
if success:
|
if success:
|
||||||
@@ -53,8 +56,44 @@ class TrainingService(StandardFastAPIService):
|
|||||||
else:
|
else:
|
||||||
self.logger.warning("WebSocket event consumer setup failed")
|
self.logger.warning("WebSocket event consumer setup failed")
|
||||||
|
|
||||||
|
async def _setup_websocket_redis(self):
|
||||||
|
"""
|
||||||
|
Initialize Redis pub/sub for WebSocket cross-pod broadcasting.
|
||||||
|
|
||||||
|
CRITICAL FOR HORIZONTAL SCALING:
|
||||||
|
Without this, WebSocket clients on Pod A won't receive events
|
||||||
|
from training jobs running on Pod B.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from app.websocket.manager import websocket_manager
|
||||||
|
from app.core.config import settings
|
||||||
|
|
||||||
|
redis_url = settings.REDIS_URL
|
||||||
|
success = await websocket_manager.initialize_redis(redis_url)
|
||||||
|
|
||||||
|
if success:
|
||||||
|
self.logger.info("WebSocket Redis pub/sub initialized for horizontal scaling")
|
||||||
|
else:
|
||||||
|
self.logger.warning(
|
||||||
|
"WebSocket Redis pub/sub failed to initialize. "
|
||||||
|
"WebSocket events will only be delivered to local connections."
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error("Failed to setup WebSocket Redis pub/sub",
|
||||||
|
error=str(e))
|
||||||
|
# Don't fail startup - WebSockets will work locally without Redis
|
||||||
|
|
||||||
async def _cleanup_messaging(self):
|
async def _cleanup_messaging(self):
|
||||||
"""Cleanup messaging for training service"""
|
"""Cleanup messaging for training service"""
|
||||||
|
# Shutdown WebSocket Redis pub/sub
|
||||||
|
try:
|
||||||
|
from app.websocket.manager import websocket_manager
|
||||||
|
await websocket_manager.shutdown()
|
||||||
|
self.logger.info("WebSocket Redis pub/sub shutdown completed")
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.warning("Error shutting down WebSocket Redis", error=str(e))
|
||||||
|
|
||||||
await cleanup_websocket_consumers()
|
await cleanup_websocket_consumers()
|
||||||
await cleanup_messaging()
|
await cleanup_messaging()
|
||||||
|
|
||||||
@@ -83,8 +122,44 @@ class TrainingService(StandardFastAPIService):
|
|||||||
system_metrics = SystemMetricsCollector("training")
|
system_metrics = SystemMetricsCollector("training")
|
||||||
self.logger.info("System metrics collection started")
|
self.logger.info("System metrics collection started")
|
||||||
|
|
||||||
|
# Recover stale jobs from previous pod crashes
|
||||||
|
# This is important for horizontal scaling - jobs may be left in 'running'
|
||||||
|
# state if a pod crashes. We mark them as failed so they can be retried.
|
||||||
|
await self._recover_stale_jobs()
|
||||||
|
|
||||||
self.logger.info("Training service startup completed")
|
self.logger.info("Training service startup completed")
|
||||||
|
|
||||||
|
async def _recover_stale_jobs(self):
|
||||||
|
"""
|
||||||
|
Recover stale training jobs on startup.
|
||||||
|
|
||||||
|
When a pod crashes mid-training, jobs are left in 'running' or 'pending' state.
|
||||||
|
This method finds jobs that haven't been updated in a while and marks them
|
||||||
|
as failed so users can retry them.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from app.repositories.training_log_repository import TrainingLogRepository
|
||||||
|
|
||||||
|
async with self.database_manager.get_session() as session:
|
||||||
|
log_repo = TrainingLogRepository(session)
|
||||||
|
|
||||||
|
# Recover jobs that haven't been updated in 60 minutes
|
||||||
|
# This is conservative - most training jobs complete within 30 minutes
|
||||||
|
recovered = await log_repo.recover_stale_jobs(stale_threshold_minutes=60)
|
||||||
|
|
||||||
|
if recovered:
|
||||||
|
self.logger.warning(
|
||||||
|
"Recovered stale training jobs on startup",
|
||||||
|
recovered_count=len(recovered),
|
||||||
|
job_ids=[j.job_id for j in recovered]
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.logger.info("No stale training jobs to recover")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# Don't fail startup if recovery fails - just log the error
|
||||||
|
self.logger.error("Failed to recover stale jobs on startup", error=str(e))
|
||||||
|
|
||||||
async def on_shutdown(self, app: FastAPI):
|
async def on_shutdown(self, app: FastAPI):
|
||||||
"""Custom shutdown logic for training service"""
|
"""Custom shutdown logic for training service"""
|
||||||
await cleanup_training_database()
|
await cleanup_training_database()
|
||||||
|
|||||||
@@ -343,3 +343,165 @@ class TrainingLogRepository(TrainingBaseRepository):
|
|||||||
job_id=job_id,
|
job_id=job_id,
|
||||||
error=str(e))
|
error=str(e))
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
async def create_job_atomic(
|
||||||
|
self,
|
||||||
|
job_id: str,
|
||||||
|
tenant_id: str,
|
||||||
|
config: Dict[str, Any] = None
|
||||||
|
) -> tuple[Optional[ModelTrainingLog], bool]:
|
||||||
|
"""
|
||||||
|
Atomically create a training job, respecting the unique constraint.
|
||||||
|
|
||||||
|
This method uses INSERT ... ON CONFLICT to handle race conditions
|
||||||
|
when multiple pods try to create a job for the same tenant simultaneously.
|
||||||
|
The database constraint (idx_unique_active_training_per_tenant) ensures
|
||||||
|
only one active job per tenant can exist.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
job_id: Unique job identifier
|
||||||
|
tenant_id: Tenant identifier
|
||||||
|
config: Optional job configuration
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (job, created):
|
||||||
|
- If created: (new_job, True)
|
||||||
|
- If conflict (existing active job): (existing_job, False)
|
||||||
|
- If error: raises DatabaseError
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# First, try to find an existing active job
|
||||||
|
existing = await self.get_active_jobs(tenant_id=tenant_id)
|
||||||
|
pending = await self.get_logs_by_tenant(tenant_id=tenant_id, status="pending", limit=1)
|
||||||
|
|
||||||
|
if existing or pending:
|
||||||
|
# Return existing job
|
||||||
|
active_job = existing[0] if existing else pending[0]
|
||||||
|
logger.info("Found existing active job, skipping creation",
|
||||||
|
existing_job_id=active_job.job_id,
|
||||||
|
tenant_id=tenant_id,
|
||||||
|
requested_job_id=job_id)
|
||||||
|
return (active_job, False)
|
||||||
|
|
||||||
|
# Try to create the new job
|
||||||
|
# If another pod created one in the meantime, the unique constraint will prevent this
|
||||||
|
log_data = {
|
||||||
|
"job_id": job_id,
|
||||||
|
"tenant_id": tenant_id,
|
||||||
|
"status": "pending",
|
||||||
|
"progress": 0,
|
||||||
|
"current_step": "initializing",
|
||||||
|
"config": config or {}
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
new_job = await self.create_training_log(log_data)
|
||||||
|
await self.session.commit()
|
||||||
|
logger.info("Created new training job atomically",
|
||||||
|
job_id=job_id,
|
||||||
|
tenant_id=tenant_id)
|
||||||
|
return (new_job, True)
|
||||||
|
except Exception as create_error:
|
||||||
|
error_str = str(create_error).lower()
|
||||||
|
# Check if this is a unique constraint violation
|
||||||
|
if "unique" in error_str or "duplicate" in error_str or "constraint" in error_str:
|
||||||
|
await self.session.rollback()
|
||||||
|
# Another pod created a job, fetch it
|
||||||
|
logger.info("Unique constraint hit, fetching existing job",
|
||||||
|
tenant_id=tenant_id,
|
||||||
|
requested_job_id=job_id)
|
||||||
|
existing = await self.get_active_jobs(tenant_id=tenant_id)
|
||||||
|
pending = await self.get_logs_by_tenant(tenant_id=tenant_id, status="pending", limit=1)
|
||||||
|
if existing or pending:
|
||||||
|
active_job = existing[0] if existing else pending[0]
|
||||||
|
return (active_job, False)
|
||||||
|
# If still no job found, something went wrong
|
||||||
|
raise DatabaseError(f"Constraint violation but no active job found: {create_error}")
|
||||||
|
else:
|
||||||
|
raise
|
||||||
|
|
||||||
|
except DatabaseError:
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to create job atomically",
|
||||||
|
job_id=job_id,
|
||||||
|
tenant_id=tenant_id,
|
||||||
|
error=str(e))
|
||||||
|
raise DatabaseError(f"Failed to create training job atomically: {str(e)}")
|
||||||
|
|
||||||
|
async def recover_stale_jobs(self, stale_threshold_minutes: int = 60) -> List[ModelTrainingLog]:
|
||||||
|
"""
|
||||||
|
Find and mark stale running jobs as failed.
|
||||||
|
|
||||||
|
This is used during service startup to clean up jobs that were
|
||||||
|
running when a pod crashed. With multiple replicas, only stale
|
||||||
|
jobs (not updated recently) should be marked as failed.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
stale_threshold_minutes: Jobs not updated for this long are considered stale
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of jobs that were marked as failed
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
stale_cutoff = datetime.now() - timedelta(minutes=stale_threshold_minutes)
|
||||||
|
|
||||||
|
# Find running jobs that haven't been updated recently
|
||||||
|
query = text("""
|
||||||
|
SELECT id, job_id, tenant_id, status, updated_at
|
||||||
|
FROM model_training_logs
|
||||||
|
WHERE status IN ('running', 'pending')
|
||||||
|
AND updated_at < :stale_cutoff
|
||||||
|
""")
|
||||||
|
|
||||||
|
result = await self.session.execute(query, {"stale_cutoff": stale_cutoff})
|
||||||
|
stale_jobs = result.fetchall()
|
||||||
|
|
||||||
|
recovered_jobs = []
|
||||||
|
for row in stale_jobs:
|
||||||
|
try:
|
||||||
|
# Mark as failed
|
||||||
|
update_query = text("""
|
||||||
|
UPDATE model_training_logs
|
||||||
|
SET status = 'failed',
|
||||||
|
error_message = :error_msg,
|
||||||
|
end_time = :end_time,
|
||||||
|
updated_at = :updated_at
|
||||||
|
WHERE id = :id AND status IN ('running', 'pending')
|
||||||
|
""")
|
||||||
|
|
||||||
|
await self.session.execute(update_query, {
|
||||||
|
"id": row.id,
|
||||||
|
"error_msg": f"Job recovered as failed - not updated since {row.updated_at.isoformat()}. Pod may have crashed.",
|
||||||
|
"end_time": datetime.now(),
|
||||||
|
"updated_at": datetime.now()
|
||||||
|
})
|
||||||
|
|
||||||
|
logger.warning("Recovered stale training job",
|
||||||
|
job_id=row.job_id,
|
||||||
|
tenant_id=str(row.tenant_id),
|
||||||
|
last_updated=row.updated_at.isoformat() if row.updated_at else "unknown")
|
||||||
|
|
||||||
|
# Fetch the updated job to return
|
||||||
|
job = await self.get_by_job_id(row.job_id)
|
||||||
|
if job:
|
||||||
|
recovered_jobs.append(job)
|
||||||
|
|
||||||
|
except Exception as job_error:
|
||||||
|
logger.error("Failed to recover individual stale job",
|
||||||
|
job_id=row.job_id,
|
||||||
|
error=str(job_error))
|
||||||
|
|
||||||
|
if recovered_jobs:
|
||||||
|
await self.session.commit()
|
||||||
|
logger.info("Stale job recovery completed",
|
||||||
|
recovered_count=len(recovered_jobs),
|
||||||
|
stale_threshold_minutes=stale_threshold_minutes)
|
||||||
|
|
||||||
|
return recovered_jobs
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to recover stale jobs",
|
||||||
|
error=str(e))
|
||||||
|
await self.session.rollback()
|
||||||
|
return []
|
||||||
@@ -1,10 +1,16 @@
|
|||||||
"""
|
"""
|
||||||
Distributed Locking Mechanisms
|
Distributed Locking Mechanisms
|
||||||
Prevents concurrent training jobs for the same product
|
Prevents concurrent training jobs for the same product
|
||||||
|
|
||||||
|
HORIZONTAL SCALING FIX:
|
||||||
|
- Uses SHA256 for stable hash across all Python processes/pods
|
||||||
|
- Python's built-in hash() varies between processes due to hash randomization (Python 3.3+)
|
||||||
|
- This ensures all pods compute the same lock ID for the same lock name
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import time
|
import time
|
||||||
|
import hashlib
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
import logging
|
import logging
|
||||||
from contextlib import asynccontextmanager
|
from contextlib import asynccontextmanager
|
||||||
@@ -39,9 +45,20 @@ class DatabaseLock:
|
|||||||
self.lock_id = self._hash_lock_name(lock_name)
|
self.lock_id = self._hash_lock_name(lock_name)
|
||||||
|
|
||||||
def _hash_lock_name(self, name: str) -> int:
|
def _hash_lock_name(self, name: str) -> int:
|
||||||
"""Convert lock name to integer ID for PostgreSQL advisory lock"""
|
"""
|
||||||
# Use hash and modulo to get a positive 32-bit integer
|
Convert lock name to integer ID for PostgreSQL advisory lock.
|
||||||
return abs(hash(name)) % (2**31)
|
|
||||||
|
CRITICAL: Uses SHA256 for stable hash across all Python processes/pods.
|
||||||
|
Python's built-in hash() varies between processes due to hash randomization
|
||||||
|
(PYTHONHASHSEED, enabled by default since Python 3.3), which would cause
|
||||||
|
different pods to compute different lock IDs for the same lock name,
|
||||||
|
defeating the purpose of distributed locking.
|
||||||
|
"""
|
||||||
|
# Use SHA256 for stable, cross-process hash
|
||||||
|
hash_bytes = hashlib.sha256(name.encode('utf-8')).digest()
|
||||||
|
# Take first 4 bytes and convert to positive 31-bit integer
|
||||||
|
# (PostgreSQL advisory locks use bigint, but we use 31-bit for safety)
|
||||||
|
return int.from_bytes(hash_bytes[:4], 'big') % (2**31)
|
||||||
|
|
||||||
@asynccontextmanager
|
@asynccontextmanager
|
||||||
async def acquire(self, session: AsyncSession):
|
async def acquire(self, session: AsyncSession):
|
||||||
|
|||||||
@@ -1,21 +1,39 @@
|
|||||||
"""
|
"""
|
||||||
WebSocket Connection Manager for Training Service
|
WebSocket Connection Manager for Training Service
|
||||||
Manages WebSocket connections and broadcasts RabbitMQ events to connected clients
|
Manages WebSocket connections and broadcasts RabbitMQ events to connected clients
|
||||||
|
|
||||||
|
HORIZONTAL SCALING:
|
||||||
|
- Uses Redis pub/sub for cross-pod WebSocket broadcasting
|
||||||
|
- Each pod subscribes to a Redis channel and broadcasts to its local connections
|
||||||
|
- Events published to Redis are received by all pods, ensuring clients on any
|
||||||
|
pod receive events from training jobs running on any other pod
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import json
|
import json
|
||||||
from typing import Dict, Set
|
import os
|
||||||
|
from typing import Dict, Optional
|
||||||
from fastapi import WebSocket
|
from fastapi import WebSocket
|
||||||
import structlog
|
import structlog
|
||||||
|
|
||||||
logger = structlog.get_logger()
|
logger = structlog.get_logger()
|
||||||
|
|
||||||
|
# Redis pub/sub channel for WebSocket events
|
||||||
|
REDIS_WEBSOCKET_CHANNEL = "training:websocket:events"
|
||||||
|
|
||||||
|
|
||||||
class WebSocketConnectionManager:
|
class WebSocketConnectionManager:
|
||||||
"""
|
"""
|
||||||
Simple WebSocket connection manager.
|
WebSocket connection manager with Redis pub/sub for horizontal scaling.
|
||||||
Manages connections per job_id and broadcasts messages to all connected clients.
|
|
||||||
|
In a multi-pod deployment:
|
||||||
|
1. Events are published to Redis pub/sub (not just local broadcast)
|
||||||
|
2. Each pod subscribes to Redis and broadcasts to its local WebSocket connections
|
||||||
|
3. This ensures clients connected to any pod receive events from any pod
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
- RabbitMQ event → Pod A receives → Pod A publishes to Redis
|
||||||
|
- Redis pub/sub → All pods receive → Each pod broadcasts to local WebSockets
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
@@ -24,6 +42,121 @@ class WebSocketConnectionManager:
|
|||||||
self._lock = asyncio.Lock()
|
self._lock = asyncio.Lock()
|
||||||
# Store latest event for each job to provide initial state
|
# Store latest event for each job to provide initial state
|
||||||
self._latest_events: Dict[str, dict] = {}
|
self._latest_events: Dict[str, dict] = {}
|
||||||
|
# Redis client for pub/sub
|
||||||
|
self._redis: Optional[object] = None
|
||||||
|
self._pubsub: Optional[object] = None
|
||||||
|
self._subscriber_task: Optional[asyncio.Task] = None
|
||||||
|
self._running = False
|
||||||
|
self._instance_id = f"{os.environ.get('HOSTNAME', 'unknown')}:{os.getpid()}"
|
||||||
|
|
||||||
|
async def initialize_redis(self, redis_url: str) -> bool:
|
||||||
|
"""
|
||||||
|
Initialize Redis connection for cross-pod pub/sub.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
redis_url: Redis connection URL
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if successful, False otherwise
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
import redis.asyncio as redis_async
|
||||||
|
|
||||||
|
self._redis = redis_async.from_url(redis_url, decode_responses=True)
|
||||||
|
await self._redis.ping()
|
||||||
|
|
||||||
|
# Create pub/sub subscriber
|
||||||
|
self._pubsub = self._redis.pubsub()
|
||||||
|
await self._pubsub.subscribe(REDIS_WEBSOCKET_CHANNEL)
|
||||||
|
|
||||||
|
# Start subscriber task
|
||||||
|
self._running = True
|
||||||
|
self._subscriber_task = asyncio.create_task(self._redis_subscriber_loop())
|
||||||
|
|
||||||
|
logger.info("Redis pub/sub initialized for WebSocket broadcasting",
|
||||||
|
instance_id=self._instance_id,
|
||||||
|
channel=REDIS_WEBSOCKET_CHANNEL)
|
||||||
|
return True
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to initialize Redis pub/sub",
|
||||||
|
error=str(e),
|
||||||
|
instance_id=self._instance_id)
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def shutdown(self):
|
||||||
|
"""Shutdown Redis pub/sub connection"""
|
||||||
|
self._running = False
|
||||||
|
|
||||||
|
if self._subscriber_task:
|
||||||
|
self._subscriber_task.cancel()
|
||||||
|
try:
|
||||||
|
await self._subscriber_task
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
if self._pubsub:
|
||||||
|
await self._pubsub.unsubscribe(REDIS_WEBSOCKET_CHANNEL)
|
||||||
|
await self._pubsub.close()
|
||||||
|
|
||||||
|
if self._redis:
|
||||||
|
await self._redis.close()
|
||||||
|
|
||||||
|
logger.info("Redis pub/sub shutdown complete",
|
||||||
|
instance_id=self._instance_id)
|
||||||
|
|
||||||
|
async def _redis_subscriber_loop(self):
|
||||||
|
"""Background task to receive Redis pub/sub messages and broadcast locally"""
|
||||||
|
try:
|
||||||
|
while self._running:
|
||||||
|
try:
|
||||||
|
message = await self._pubsub.get_message(
|
||||||
|
ignore_subscribe_messages=True,
|
||||||
|
timeout=1.0
|
||||||
|
)
|
||||||
|
|
||||||
|
if message and message['type'] == 'message':
|
||||||
|
await self._handle_redis_message(message['data'])
|
||||||
|
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Error in Redis subscriber loop",
|
||||||
|
error=str(e),
|
||||||
|
instance_id=self._instance_id)
|
||||||
|
await asyncio.sleep(1) # Backoff on error
|
||||||
|
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
logger.info("Redis subscriber loop stopped",
|
||||||
|
instance_id=self._instance_id)
|
||||||
|
|
||||||
|
async def _handle_redis_message(self, data: str):
|
||||||
|
"""Handle a message received from Redis pub/sub"""
|
||||||
|
try:
|
||||||
|
payload = json.loads(data)
|
||||||
|
job_id = payload.get('job_id')
|
||||||
|
message = payload.get('message')
|
||||||
|
source_instance = payload.get('source_instance')
|
||||||
|
|
||||||
|
if not job_id or not message:
|
||||||
|
return
|
||||||
|
|
||||||
|
# Log cross-pod message
|
||||||
|
if source_instance != self._instance_id:
|
||||||
|
logger.debug("Received cross-pod WebSocket event",
|
||||||
|
job_id=job_id,
|
||||||
|
source_instance=source_instance,
|
||||||
|
local_instance=self._instance_id)
|
||||||
|
|
||||||
|
# Broadcast to local WebSocket connections
|
||||||
|
await self._broadcast_local(job_id, message)
|
||||||
|
|
||||||
|
except json.JSONDecodeError as e:
|
||||||
|
logger.warning("Invalid JSON in Redis message", error=str(e))
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Error handling Redis message", error=str(e))
|
||||||
|
|
||||||
async def connect(self, job_id: str, websocket: WebSocket) -> None:
|
async def connect(self, job_id: str, websocket: WebSocket) -> None:
|
||||||
"""Register a new WebSocket connection for a job"""
|
"""Register a new WebSocket connection for a job"""
|
||||||
@@ -50,7 +183,8 @@ class WebSocketConnectionManager:
|
|||||||
logger.info("WebSocket connected",
|
logger.info("WebSocket connected",
|
||||||
job_id=job_id,
|
job_id=job_id,
|
||||||
websocket_id=ws_id,
|
websocket_id=ws_id,
|
||||||
total_connections=len(self._connections[job_id]))
|
total_connections=len(self._connections[job_id]),
|
||||||
|
instance_id=self._instance_id)
|
||||||
|
|
||||||
async def disconnect(self, job_id: str, websocket: WebSocket) -> None:
|
async def disconnect(self, job_id: str, websocket: WebSocket) -> None:
|
||||||
"""Remove a WebSocket connection"""
|
"""Remove a WebSocket connection"""
|
||||||
@@ -66,19 +200,56 @@ class WebSocketConnectionManager:
|
|||||||
logger.info("WebSocket disconnected",
|
logger.info("WebSocket disconnected",
|
||||||
job_id=job_id,
|
job_id=job_id,
|
||||||
websocket_id=ws_id,
|
websocket_id=ws_id,
|
||||||
remaining_connections=len(self._connections.get(job_id, {})))
|
remaining_connections=len(self._connections.get(job_id, {})),
|
||||||
|
instance_id=self._instance_id)
|
||||||
|
|
||||||
async def broadcast(self, job_id: str, message: dict) -> int:
|
async def broadcast(self, job_id: str, message: dict) -> int:
|
||||||
"""
|
"""
|
||||||
Broadcast a message to all connections for a specific job.
|
Broadcast a message to all connections for a specific job across ALL pods.
|
||||||
Returns the number of successful broadcasts.
|
|
||||||
|
If Redis is configured, publishes to Redis pub/sub which then broadcasts
|
||||||
|
to all pods. Otherwise, falls back to local-only broadcast.
|
||||||
|
|
||||||
|
Returns the number of successful local broadcasts.
|
||||||
"""
|
"""
|
||||||
# Store the latest event for this job to provide initial state to new connections
|
# Store the latest event for this job to provide initial state to new connections
|
||||||
if message.get('type') != 'initial_state': # Don't store initial_state messages
|
if message.get('type') != 'initial_state':
|
||||||
self._latest_events[job_id] = message
|
self._latest_events[job_id] = message
|
||||||
|
|
||||||
|
# If Redis is available, publish to Redis for cross-pod broadcast
|
||||||
|
if self._redis:
|
||||||
|
try:
|
||||||
|
payload = json.dumps({
|
||||||
|
'job_id': job_id,
|
||||||
|
'message': message,
|
||||||
|
'source_instance': self._instance_id
|
||||||
|
})
|
||||||
|
await self._redis.publish(REDIS_WEBSOCKET_CHANNEL, payload)
|
||||||
|
logger.debug("Published WebSocket event to Redis",
|
||||||
|
job_id=job_id,
|
||||||
|
message_type=message.get('type'),
|
||||||
|
instance_id=self._instance_id)
|
||||||
|
# Return 0 here because the actual broadcast happens via subscriber
|
||||||
|
# The count will be from _broadcast_local when the message is received
|
||||||
|
return 0
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Failed to publish to Redis, falling back to local broadcast",
|
||||||
|
error=str(e),
|
||||||
|
job_id=job_id)
|
||||||
|
# Fall through to local broadcast
|
||||||
|
|
||||||
|
# Local-only broadcast (when Redis is not available)
|
||||||
|
return await self._broadcast_local(job_id, message)
|
||||||
|
|
||||||
|
async def _broadcast_local(self, job_id: str, message: dict) -> int:
|
||||||
|
"""
|
||||||
|
Broadcast a message to local WebSocket connections only.
|
||||||
|
This is called either directly (no Redis) or from Redis subscriber.
|
||||||
|
"""
|
||||||
if job_id not in self._connections:
|
if job_id not in self._connections:
|
||||||
logger.debug("No active connections for job", job_id=job_id)
|
logger.debug("No active local connections for job",
|
||||||
|
job_id=job_id,
|
||||||
|
instance_id=self._instance_id)
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
connections = list(self._connections[job_id].values())
|
connections = list(self._connections[job_id].values())
|
||||||
@@ -103,18 +274,27 @@ class WebSocketConnectionManager:
|
|||||||
self._connections[job_id].pop(ws_id, None)
|
self._connections[job_id].pop(ws_id, None)
|
||||||
|
|
||||||
if successful_sends > 0:
|
if successful_sends > 0:
|
||||||
logger.info("Broadcasted message to WebSocket clients",
|
logger.info("Broadcasted message to local WebSocket clients",
|
||||||
job_id=job_id,
|
job_id=job_id,
|
||||||
message_type=message.get('type'),
|
message_type=message.get('type'),
|
||||||
successful_sends=successful_sends,
|
successful_sends=successful_sends,
|
||||||
failed_sends=len(failed_websockets))
|
failed_sends=len(failed_websockets),
|
||||||
|
instance_id=self._instance_id)
|
||||||
|
|
||||||
return successful_sends
|
return successful_sends
|
||||||
|
|
||||||
def get_connection_count(self, job_id: str) -> int:
|
def get_connection_count(self, job_id: str) -> int:
|
||||||
"""Get the number of active connections for a job"""
|
"""Get the number of active local connections for a job"""
|
||||||
return len(self._connections.get(job_id, {}))
|
return len(self._connections.get(job_id, {}))
|
||||||
|
|
||||||
|
def get_total_connection_count(self) -> int:
|
||||||
|
"""Get total number of active connections across all jobs"""
|
||||||
|
return sum(len(conns) for conns in self._connections.values())
|
||||||
|
|
||||||
|
def is_redis_enabled(self) -> bool:
|
||||||
|
"""Check if Redis pub/sub is enabled"""
|
||||||
|
return self._redis is not None and self._running
|
||||||
|
|
||||||
|
|
||||||
# Global singleton instance
|
# Global singleton instance
|
||||||
websocket_manager = WebSocketConnectionManager()
|
websocket_manager = WebSocketConnectionManager()
|
||||||
|
|||||||
@@ -0,0 +1,60 @@
|
|||||||
|
"""Add horizontal scaling constraints for multi-pod deployment
|
||||||
|
|
||||||
|
Revision ID: add_horizontal_scaling
|
||||||
|
Revises: 26a665cd5348
|
||||||
|
Create Date: 2025-01-18
|
||||||
|
|
||||||
|
This migration adds database-level constraints to prevent race conditions
|
||||||
|
when running multiple training service pods:
|
||||||
|
|
||||||
|
1. Partial unique index on model_training_logs to prevent duplicate active jobs per tenant
|
||||||
|
2. Index to speed up active job lookups
|
||||||
|
"""
|
||||||
|
from typing import Sequence, Union
|
||||||
|
|
||||||
|
from alembic import op
|
||||||
|
import sqlalchemy as sa
|
||||||
|
|
||||||
|
|
||||||
|
# revision identifiers, used by Alembic.
|
||||||
|
revision: str = 'add_horizontal_scaling'
|
||||||
|
down_revision: Union[str, None] = '26a665cd5348'
|
||||||
|
branch_labels: Union[str, Sequence[str], None] = None
|
||||||
|
depends_on: Union[str, Sequence[str], None] = None
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade() -> None:
|
||||||
|
# Add partial unique index to prevent duplicate active training jobs per tenant
|
||||||
|
# This ensures only ONE job can be in 'pending' or 'running' status per tenant at a time
|
||||||
|
# The constraint is enforced at the database level, preventing race conditions
|
||||||
|
# between multiple pods checking and creating jobs simultaneously
|
||||||
|
op.execute("""
|
||||||
|
CREATE UNIQUE INDEX IF NOT EXISTS idx_unique_active_training_per_tenant
|
||||||
|
ON model_training_logs (tenant_id)
|
||||||
|
WHERE status IN ('pending', 'running')
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Add index to speed up active job lookups (used by deduplication check)
|
||||||
|
op.create_index(
|
||||||
|
'idx_training_logs_tenant_status',
|
||||||
|
'model_training_logs',
|
||||||
|
['tenant_id', 'status'],
|
||||||
|
unique=False,
|
||||||
|
if_not_exists=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add index for job recovery queries (find stale running jobs)
|
||||||
|
op.create_index(
|
||||||
|
'idx_training_logs_status_updated',
|
||||||
|
'model_training_logs',
|
||||||
|
['status', 'updated_at'],
|
||||||
|
unique=False,
|
||||||
|
if_not_exists=True
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def downgrade() -> None:
|
||||||
|
# Remove the indexes in reverse order
|
||||||
|
op.execute("DROP INDEX IF EXISTS idx_training_logs_status_updated")
|
||||||
|
op.execute("DROP INDEX IF EXISTS idx_training_logs_tenant_status")
|
||||||
|
op.execute("DROP INDEX IF EXISTS idx_unique_active_training_per_tenant")
|
||||||
33
shared/leader_election/__init__.py
Normal file
33
shared/leader_election/__init__.py
Normal file
@@ -0,0 +1,33 @@
|
|||||||
|
"""
|
||||||
|
Shared Leader Election for Bakery-IA platform
|
||||||
|
|
||||||
|
Provides Redis-based leader election for services that need to run
|
||||||
|
singleton scheduled tasks (APScheduler, background jobs, etc.)
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
from shared.leader_election import LeaderElectionService, SchedulerLeaderMixin
|
||||||
|
|
||||||
|
# Option 1: Direct usage
|
||||||
|
leader_election = LeaderElectionService(redis_client, "my-service")
|
||||||
|
await leader_election.start(
|
||||||
|
on_become_leader=start_scheduler,
|
||||||
|
on_lose_leader=stop_scheduler
|
||||||
|
)
|
||||||
|
|
||||||
|
# Option 2: Mixin for services with APScheduler
|
||||||
|
class MySchedulerService(SchedulerLeaderMixin):
|
||||||
|
async def _create_scheduler_jobs(self):
|
||||||
|
self.scheduler.add_job(...)
|
||||||
|
"""
|
||||||
|
|
||||||
|
from shared.leader_election.service import (
|
||||||
|
LeaderElectionService,
|
||||||
|
LeaderElectionConfig,
|
||||||
|
)
|
||||||
|
from shared.leader_election.mixin import SchedulerLeaderMixin
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"LeaderElectionService",
|
||||||
|
"LeaderElectionConfig",
|
||||||
|
"SchedulerLeaderMixin",
|
||||||
|
]
|
||||||
209
shared/leader_election/mixin.py
Normal file
209
shared/leader_election/mixin.py
Normal file
@@ -0,0 +1,209 @@
|
|||||||
|
"""
|
||||||
|
Scheduler Leader Mixin
|
||||||
|
|
||||||
|
Provides a mixin class for services that use APScheduler and need
|
||||||
|
leader election for horizontal scaling.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
class MySchedulerService(SchedulerLeaderMixin):
|
||||||
|
def __init__(self, redis_url: str, service_name: str):
|
||||||
|
super().__init__(redis_url, service_name)
|
||||||
|
# Your initialization here
|
||||||
|
|
||||||
|
async def _create_scheduler_jobs(self):
|
||||||
|
'''Override to define your scheduled jobs'''
|
||||||
|
self.scheduler.add_job(
|
||||||
|
self.my_job,
|
||||||
|
trigger=CronTrigger(hour=0),
|
||||||
|
id='my_job'
|
||||||
|
)
|
||||||
|
|
||||||
|
async def my_job(self):
|
||||||
|
# Your job logic here
|
||||||
|
pass
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
from typing import Optional
|
||||||
|
from abc import abstractmethod
|
||||||
|
import structlog
|
||||||
|
|
||||||
|
logger = structlog.get_logger()
|
||||||
|
|
||||||
|
|
||||||
|
class SchedulerLeaderMixin:
|
||||||
|
"""
|
||||||
|
Mixin for services that use APScheduler with leader election.
|
||||||
|
|
||||||
|
Provides automatic leader election and scheduler management.
|
||||||
|
Only the leader pod will run scheduled jobs.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, redis_url: str, service_name: str, **kwargs):
|
||||||
|
"""
|
||||||
|
Initialize the scheduler with leader election.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
redis_url: Redis connection URL for leader election
|
||||||
|
service_name: Unique service name for leader election lock
|
||||||
|
**kwargs: Additional arguments passed to parent class
|
||||||
|
"""
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
self._redis_url = redis_url
|
||||||
|
self._service_name = service_name
|
||||||
|
self._leader_election = None
|
||||||
|
self._redis_client = None
|
||||||
|
self.scheduler = None
|
||||||
|
self._scheduler_started = False
|
||||||
|
|
||||||
|
async def start_with_leader_election(self):
|
||||||
|
"""
|
||||||
|
Start the service with leader election.
|
||||||
|
|
||||||
|
Only the leader will start the scheduler.
|
||||||
|
"""
|
||||||
|
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
||||||
|
from shared.leader_election.service import LeaderElectionService
|
||||||
|
import redis.asyncio as redis
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create Redis connection
|
||||||
|
self._redis_client = redis.from_url(self._redis_url, decode_responses=False)
|
||||||
|
await self._redis_client.ping()
|
||||||
|
|
||||||
|
# Create scheduler (but don't start it yet)
|
||||||
|
self.scheduler = AsyncIOScheduler()
|
||||||
|
|
||||||
|
# Create leader election
|
||||||
|
self._leader_election = LeaderElectionService(
|
||||||
|
self._redis_client,
|
||||||
|
self._service_name
|
||||||
|
)
|
||||||
|
|
||||||
|
# Start leader election with callbacks
|
||||||
|
await self._leader_election.start(
|
||||||
|
on_become_leader=self._on_become_leader,
|
||||||
|
on_lose_leader=self._on_lose_leader
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info("Scheduler service started with leader election",
|
||||||
|
service=self._service_name,
|
||||||
|
is_leader=self._leader_election.is_leader,
|
||||||
|
instance_id=self._leader_election.instance_id)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to start with leader election, falling back to standalone",
|
||||||
|
service=self._service_name,
|
||||||
|
error=str(e))
|
||||||
|
# Fallback: start scheduler anyway (for single-pod deployments)
|
||||||
|
await self._start_scheduler_standalone()
|
||||||
|
|
||||||
|
async def _on_become_leader(self):
|
||||||
|
"""Called when this instance becomes the leader"""
|
||||||
|
logger.info("Became leader, starting scheduler",
|
||||||
|
service=self._service_name)
|
||||||
|
await self._start_scheduler()
|
||||||
|
|
||||||
|
async def _on_lose_leader(self):
|
||||||
|
"""Called when this instance loses leadership"""
|
||||||
|
logger.warning("Lost leadership, stopping scheduler",
|
||||||
|
service=self._service_name)
|
||||||
|
await self._stop_scheduler()
|
||||||
|
|
||||||
|
async def _start_scheduler(self):
|
||||||
|
"""Start the scheduler with defined jobs"""
|
||||||
|
if self._scheduler_started:
|
||||||
|
logger.warning("Scheduler already started",
|
||||||
|
service=self._service_name)
|
||||||
|
return
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Let subclass define jobs
|
||||||
|
await self._create_scheduler_jobs()
|
||||||
|
|
||||||
|
# Start scheduler
|
||||||
|
if not self.scheduler.running:
|
||||||
|
self.scheduler.start()
|
||||||
|
self._scheduler_started = True
|
||||||
|
logger.info("Scheduler started",
|
||||||
|
service=self._service_name,
|
||||||
|
job_count=len(self.scheduler.get_jobs()))
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to start scheduler",
|
||||||
|
service=self._service_name,
|
||||||
|
error=str(e))
|
||||||
|
|
||||||
|
async def _stop_scheduler(self):
|
||||||
|
"""Stop the scheduler"""
|
||||||
|
if not self._scheduler_started:
|
||||||
|
return
|
||||||
|
|
||||||
|
try:
|
||||||
|
if self.scheduler and self.scheduler.running:
|
||||||
|
self.scheduler.shutdown(wait=False)
|
||||||
|
self._scheduler_started = False
|
||||||
|
logger.info("Scheduler stopped",
|
||||||
|
service=self._service_name)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to stop scheduler",
|
||||||
|
service=self._service_name,
|
||||||
|
error=str(e))
|
||||||
|
|
||||||
|
async def _start_scheduler_standalone(self):
|
||||||
|
"""Start scheduler without leader election (fallback mode)"""
|
||||||
|
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
||||||
|
|
||||||
|
logger.warning("Starting scheduler in standalone mode (no leader election)",
|
||||||
|
service=self._service_name)
|
||||||
|
|
||||||
|
self.scheduler = AsyncIOScheduler()
|
||||||
|
await self._create_scheduler_jobs()
|
||||||
|
|
||||||
|
if not self.scheduler.running:
|
||||||
|
self.scheduler.start()
|
||||||
|
self._scheduler_started = True
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def _create_scheduler_jobs(self):
|
||||||
|
"""
|
||||||
|
Override to define scheduled jobs.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
self.scheduler.add_job(
|
||||||
|
self.my_task,
|
||||||
|
trigger=CronTrigger(hour=0, minute=30),
|
||||||
|
id='my_task',
|
||||||
|
max_instances=1
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
async def stop(self):
|
||||||
|
"""Stop the scheduler and leader election"""
|
||||||
|
# Stop leader election
|
||||||
|
if self._leader_election:
|
||||||
|
await self._leader_election.stop()
|
||||||
|
|
||||||
|
# Stop scheduler
|
||||||
|
await self._stop_scheduler()
|
||||||
|
|
||||||
|
# Close Redis
|
||||||
|
if self._redis_client:
|
||||||
|
await self._redis_client.close()
|
||||||
|
|
||||||
|
logger.info("Scheduler service stopped",
|
||||||
|
service=self._service_name)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_leader(self) -> bool:
|
||||||
|
"""Check if this instance is the leader"""
|
||||||
|
return self._leader_election.is_leader if self._leader_election else False
|
||||||
|
|
||||||
|
def get_leader_status(self) -> dict:
|
||||||
|
"""Get leader election status"""
|
||||||
|
if self._leader_election:
|
||||||
|
return self._leader_election.get_status()
|
||||||
|
return {"is_leader": True, "mode": "standalone"}
|
||||||
352
shared/leader_election/service.py
Normal file
352
shared/leader_election/service.py
Normal file
@@ -0,0 +1,352 @@
|
|||||||
|
"""
|
||||||
|
Leader Election Service
|
||||||
|
|
||||||
|
Implements Redis-based leader election to ensure only ONE pod runs
|
||||||
|
singleton tasks like APScheduler jobs.
|
||||||
|
|
||||||
|
This is CRITICAL for horizontal scaling - without leader election,
|
||||||
|
each pod would run the same scheduled jobs, causing:
|
||||||
|
- Duplicate operations (forecasts, alerts, syncs)
|
||||||
|
- Database contention
|
||||||
|
- Inconsistent state
|
||||||
|
- Duplicate notifications
|
||||||
|
|
||||||
|
Implementation:
|
||||||
|
- Uses Redis SET NX (set if not exists) for atomic leadership acquisition
|
||||||
|
- Leader maintains leadership with periodic heartbeats
|
||||||
|
- If leader fails to heartbeat, another pod can take over
|
||||||
|
- Non-leader pods check periodically if they should become leader
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
|
import socket
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Optional, Callable, Awaitable
|
||||||
|
import structlog
|
||||||
|
|
||||||
|
logger = structlog.get_logger()
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class LeaderElectionConfig:
|
||||||
|
"""Configuration for leader election"""
|
||||||
|
# Redis key prefix for the lock
|
||||||
|
lock_key_prefix: str = "leader"
|
||||||
|
# Lock expires after this many seconds without refresh
|
||||||
|
lock_ttl_seconds: int = 30
|
||||||
|
# Refresh lock every N seconds (should be < lock_ttl_seconds / 2)
|
||||||
|
heartbeat_interval_seconds: int = 10
|
||||||
|
# Non-leaders check for leadership every N seconds
|
||||||
|
election_check_interval_seconds: int = 15
|
||||||
|
|
||||||
|
|
||||||
|
class LeaderElectionService:
|
||||||
|
"""
|
||||||
|
Redis-based leader election service.
|
||||||
|
|
||||||
|
Ensures only one pod runs scheduled tasks at a time across all replicas.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
redis_client,
|
||||||
|
service_name: str,
|
||||||
|
config: Optional[LeaderElectionConfig] = None
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Initialize leader election service.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
redis_client: Async Redis client instance
|
||||||
|
service_name: Unique name for this service (used in Redis key)
|
||||||
|
config: Optional configuration override
|
||||||
|
"""
|
||||||
|
self.redis = redis_client
|
||||||
|
self.service_name = service_name
|
||||||
|
self.config = config or LeaderElectionConfig()
|
||||||
|
self.lock_key = f"{self.config.lock_key_prefix}:{service_name}:lock"
|
||||||
|
self.instance_id = self._generate_instance_id()
|
||||||
|
self.is_leader = False
|
||||||
|
self._heartbeat_task: Optional[asyncio.Task] = None
|
||||||
|
self._election_task: Optional[asyncio.Task] = None
|
||||||
|
self._running = False
|
||||||
|
self._on_become_leader_callback: Optional[Callable[[], Awaitable[None]]] = None
|
||||||
|
self._on_lose_leader_callback: Optional[Callable[[], Awaitable[None]]] = None
|
||||||
|
|
||||||
|
def _generate_instance_id(self) -> str:
|
||||||
|
"""Generate unique instance identifier for this pod"""
|
||||||
|
hostname = os.environ.get('HOSTNAME', socket.gethostname())
|
||||||
|
pod_ip = os.environ.get('POD_IP', 'unknown')
|
||||||
|
return f"{hostname}:{pod_ip}:{os.getpid()}"
|
||||||
|
|
||||||
|
async def start(
|
||||||
|
self,
|
||||||
|
on_become_leader: Optional[Callable[[], Awaitable[None]]] = None,
|
||||||
|
on_lose_leader: Optional[Callable[[], Awaitable[None]]] = None
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Start leader election process.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
on_become_leader: Async callback when this instance becomes leader
|
||||||
|
on_lose_leader: Async callback when this instance loses leadership
|
||||||
|
"""
|
||||||
|
self._on_become_leader_callback = on_become_leader
|
||||||
|
self._on_lose_leader_callback = on_lose_leader
|
||||||
|
self._running = True
|
||||||
|
|
||||||
|
logger.info("Starting leader election",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id,
|
||||||
|
lock_key=self.lock_key)
|
||||||
|
|
||||||
|
# Try to become leader immediately
|
||||||
|
await self._try_become_leader()
|
||||||
|
|
||||||
|
# Start background tasks
|
||||||
|
if self.is_leader:
|
||||||
|
self._heartbeat_task = asyncio.create_task(self._heartbeat_loop())
|
||||||
|
else:
|
||||||
|
self._election_task = asyncio.create_task(self._election_loop())
|
||||||
|
|
||||||
|
async def stop(self):
|
||||||
|
"""Stop leader election and release leadership if held"""
|
||||||
|
self._running = False
|
||||||
|
|
||||||
|
# Cancel background tasks
|
||||||
|
if self._heartbeat_task:
|
||||||
|
self._heartbeat_task.cancel()
|
||||||
|
try:
|
||||||
|
await self._heartbeat_task
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
pass
|
||||||
|
self._heartbeat_task = None
|
||||||
|
|
||||||
|
if self._election_task:
|
||||||
|
self._election_task.cancel()
|
||||||
|
try:
|
||||||
|
await self._election_task
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
pass
|
||||||
|
self._election_task = None
|
||||||
|
|
||||||
|
# Release leadership
|
||||||
|
if self.is_leader:
|
||||||
|
await self._release_leadership()
|
||||||
|
|
||||||
|
logger.info("Leader election stopped",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id,
|
||||||
|
was_leader=self.is_leader)
|
||||||
|
|
||||||
|
async def _try_become_leader(self) -> bool:
|
||||||
|
"""
|
||||||
|
Attempt to become the leader.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if this instance is now the leader
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Try to set the lock with NX (only if not exists) and EX (expiry)
|
||||||
|
acquired = await self.redis.set(
|
||||||
|
self.lock_key,
|
||||||
|
self.instance_id,
|
||||||
|
nx=True, # Only set if not exists
|
||||||
|
ex=self.config.lock_ttl_seconds
|
||||||
|
)
|
||||||
|
|
||||||
|
if acquired:
|
||||||
|
self.is_leader = True
|
||||||
|
logger.info("Became leader",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id)
|
||||||
|
|
||||||
|
# Call callback
|
||||||
|
if self._on_become_leader_callback:
|
||||||
|
try:
|
||||||
|
await self._on_become_leader_callback()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Error in on_become_leader callback",
|
||||||
|
service=self.service_name,
|
||||||
|
error=str(e))
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
# Check if we're already the leader (reconnection scenario)
|
||||||
|
current_leader = await self.redis.get(self.lock_key)
|
||||||
|
if current_leader:
|
||||||
|
current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
|
||||||
|
if current_leader_str == self.instance_id:
|
||||||
|
self.is_leader = True
|
||||||
|
logger.info("Confirmed as existing leader",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id)
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
logger.debug("Another instance is leader",
|
||||||
|
service=self.service_name,
|
||||||
|
current_leader=current_leader_str,
|
||||||
|
this_instance=self.instance_id)
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to acquire leadership",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id,
|
||||||
|
error=str(e))
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def _release_leadership(self):
|
||||||
|
"""Release leadership lock"""
|
||||||
|
try:
|
||||||
|
# Only delete if we're the current leader
|
||||||
|
current_leader = await self.redis.get(self.lock_key)
|
||||||
|
if current_leader:
|
||||||
|
current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
|
||||||
|
if current_leader_str == self.instance_id:
|
||||||
|
await self.redis.delete(self.lock_key)
|
||||||
|
logger.info("Released leadership",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id)
|
||||||
|
|
||||||
|
was_leader = self.is_leader
|
||||||
|
self.is_leader = False
|
||||||
|
|
||||||
|
# Call callback only if we were the leader
|
||||||
|
if was_leader and self._on_lose_leader_callback:
|
||||||
|
try:
|
||||||
|
await self._on_lose_leader_callback()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Error in on_lose_leader callback",
|
||||||
|
service=self.service_name,
|
||||||
|
error=str(e))
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to release leadership",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id,
|
||||||
|
error=str(e))
|
||||||
|
|
||||||
|
async def _refresh_leadership(self) -> bool:
|
||||||
|
"""
|
||||||
|
Refresh leadership lock TTL.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if leadership was maintained
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Verify we're still the leader
|
||||||
|
current_leader = await self.redis.get(self.lock_key)
|
||||||
|
if not current_leader:
|
||||||
|
logger.warning("Lost leadership (lock expired)",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id)
|
||||||
|
return False
|
||||||
|
|
||||||
|
current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
|
||||||
|
if current_leader_str != self.instance_id:
|
||||||
|
logger.warning("Lost leadership (lock held by another instance)",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id,
|
||||||
|
current_leader=current_leader_str)
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Refresh the TTL
|
||||||
|
await self.redis.expire(self.lock_key, self.config.lock_ttl_seconds)
|
||||||
|
return True
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to refresh leadership",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id,
|
||||||
|
error=str(e))
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def _heartbeat_loop(self):
|
||||||
|
"""Background loop to maintain leadership"""
|
||||||
|
while self._running and self.is_leader:
|
||||||
|
try:
|
||||||
|
await asyncio.sleep(self.config.heartbeat_interval_seconds)
|
||||||
|
|
||||||
|
if not self._running:
|
||||||
|
break
|
||||||
|
|
||||||
|
maintained = await self._refresh_leadership()
|
||||||
|
|
||||||
|
if not maintained:
|
||||||
|
self.is_leader = False
|
||||||
|
|
||||||
|
# Call callback
|
||||||
|
if self._on_lose_leader_callback:
|
||||||
|
try:
|
||||||
|
await self._on_lose_leader_callback()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Error in on_lose_leader callback",
|
||||||
|
service=self.service_name,
|
||||||
|
error=str(e))
|
||||||
|
|
||||||
|
# Switch to election loop
|
||||||
|
self._election_task = asyncio.create_task(self._election_loop())
|
||||||
|
break
|
||||||
|
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Error in heartbeat loop",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id,
|
||||||
|
error=str(e))
|
||||||
|
|
||||||
|
async def _election_loop(self):
|
||||||
|
"""Background loop to attempt leadership acquisition"""
|
||||||
|
while self._running and not self.is_leader:
|
||||||
|
try:
|
||||||
|
await asyncio.sleep(self.config.election_check_interval_seconds)
|
||||||
|
|
||||||
|
if not self._running:
|
||||||
|
break
|
||||||
|
|
||||||
|
acquired = await self._try_become_leader()
|
||||||
|
|
||||||
|
if acquired:
|
||||||
|
# Switch to heartbeat loop
|
||||||
|
self._heartbeat_task = asyncio.create_task(self._heartbeat_loop())
|
||||||
|
break
|
||||||
|
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Error in election loop",
|
||||||
|
service=self.service_name,
|
||||||
|
instance_id=self.instance_id,
|
||||||
|
error=str(e))
|
||||||
|
|
||||||
|
def get_status(self) -> dict:
|
||||||
|
"""Get current leader election status"""
|
||||||
|
return {
|
||||||
|
"service": self.service_name,
|
||||||
|
"instance_id": self.instance_id,
|
||||||
|
"is_leader": self.is_leader,
|
||||||
|
"running": self._running,
|
||||||
|
"lock_key": self.lock_key,
|
||||||
|
"config": {
|
||||||
|
"lock_ttl_seconds": self.config.lock_ttl_seconds,
|
||||||
|
"heartbeat_interval_seconds": self.config.heartbeat_interval_seconds,
|
||||||
|
"election_check_interval_seconds": self.config.election_check_interval_seconds
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async def get_current_leader(self) -> Optional[str]:
|
||||||
|
"""Get the current leader instance ID (if any)"""
|
||||||
|
try:
|
||||||
|
current_leader = await self.redis.get(self.lock_key)
|
||||||
|
if current_leader:
|
||||||
|
return current_leader.decode() if isinstance(current_leader, bytes) else current_leader
|
||||||
|
return None
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to get current leader",
|
||||||
|
service=self.service_name,
|
||||||
|
error=str(e))
|
||||||
|
return None
|
||||||
Reference in New Issue
Block a user