Add ci/cd and fix multiple pods issues

2026-01-18 09:02:27 +01:00
parent 3c4b5c2a06
commit 21d35ea92b
27 changed files with 3779 additions and 73 deletions
--- a/CI_CD_IMPLEMENTATION_PLAN.md
+++ b/CI_CD_IMPLEMENTATION_PLAN.md
--- a/INFRASTRUCTURE_REORGANIZATION_PROPOSAL.md
+++ b/INFRASTRUCTURE_REORGANIZATION_PROPOSAL.md
@@ -0,0 +1,413 @@
 # Infrastructure Reorganization Proposal for Bakery-IA
 ## Executive Summary
 This document presents a comprehensive analysis of the current infrastructure organization and proposes a restructured layout that improves maintainability, scalability, and operational efficiency. The proposal is based on a detailed examination of the existing 177 files across 31 directories in the infrastructure folder.
 ## Current Infrastructure Analysis
 ### Current Structure Overview
 ```
 infrastructure/
 ├── ci-cd/                      # 18 files - CI/CD pipeline components
 ├── helm/                      # 8 files - Helm charts and scripts
 ├── kubernetes/                # 103 files - Kubernetes manifests and configs
 ├── signoz/                    # 11 files - Monitoring dashboards and scripts
 └── tls/                       # 37 files - TLS certificates and generation scripts
 ```
 ### Key Findings
 1. **Kubernetes Base Components (103 files)**: The most complex area with:
   - 20+ service deployments across 15+ microservices
   - 20+ database configurations (PostgreSQL, RabbitMQ, MinIO)
   - 19 migration jobs for different services
   - Infrastructure components (gateway, monitoring, etc.)
 2. **CI/CD Pipeline (18 files)**: 
   - Tekton tasks and pipelines for GitOps workflow
   - Flux CD configuration for continuous delivery
   - Gitea configuration for Git repository management
 3. **Monitoring (11 files)**:
   - SigNoz dashboards for comprehensive observability
   - Import scripts for dashboard management
 4. **TLS Certificates (37 files)**:
   - CA certificates and generation scripts
   - Service-specific certificates (PostgreSQL, Redis, MinIO)
   - Certificate signing requests and configurations
 ### Strengths of Current Organization
 1. **Logical Grouping**: Components are generally well-grouped by function
 2. **Base/Overlay Pattern**: Kubernetes uses proper base/overlay structure
 3. **Comprehensive Monitoring**: SigNoz dashboards cover all major aspects
 4. **Security Focus**: Dedicated TLS certificate management
 ### Challenges Identified
 1. **Complexity in Kubernetes Base**: 103 files make navigation difficult
 2. **Mixed Component Types**: Services, databases, and infrastructure mixed together
 3. **Limited Environment Separation**: Only dev/prod overlays, no staging
 4. **Script Scattering**: Automation scripts spread across directories
 5. **Documentation Gaps**: Some components lack clear documentation
 ## Proposed Infrastructure Organization
 ### High-Level Structure
 ```
 infrastructure/
 ├── environments/                # Environment-specific configurations
 ├── platform/                    # Platform-level infrastructure
 ├── services/                    # Application services and microservices
 ├── monitoring/                  # Observability and monitoring
 ├── cicd/                       # CI/CD pipeline components
 ├── security/                    # Security configurations and certificates
 ├── scripts/                     # Automation and utility scripts
 ├── docs/                       # Infrastructure documentation
 └── README.md                    # Top-level infrastructure guide
 ```
 ### Detailed Structure Proposal
 ```
 infrastructure/
 ├── environments/                # Environment-specific configurations
 │   ├── dev/
 │   │   ├── k8s-manifests/
 │   │   │   ├── base/
 │   │   │   │   ├── namespace.yaml
 │   │   │   │   ├── configmap.yaml
 │   │   │   │   ├── secrets.yaml
 │   │   │   │   └── ingress-https.yaml
 │   │   │   ├── components/
 │   │   │   │   ├── databases/
 │   │   │   │   ├── infrastructure/
 │   │   │   │   ├── microservices/
 │   │   │   │   └── cert-manager/
 │   │   │   ├── configs/
 │   │   │   ├── cronjobs/
 │   │   │   ├── jobs/
 │   │   │   └── migrations/
 │   │   ├── kustomization.yaml
 │   │   └── values/
 │   ├── staging/                 # New staging environment
 │   │   ├── k8s-manifests/
 │   │   └── values/
 │   └── prod/
 │       ├── k8s-manifests/
 │       ├── terraform/           # Production-specific IaC
 │       └── values/
 ├── platform/                    # Platform-level infrastructure
 │   ├── cluster/
 │   │   ├── eks/                  # AWS EKS configuration
 │   │   │   ├── terraform/
 │   │   │   └── manifests/
 │   │   └── kind/                 # Local development cluster
 │   │       ├── config.yaml
 │   │       └── manifests/
 │   ├── networking/
 │   │   ├── dns/
 │   │   ├── load-balancers/
 │   │   └── ingress/
 │   │       ├── nginx/
 │   │       └── cert-manager/
 │   ├── security/
 │   │   ├── rbac/
 │   │   ├── network-policies/
 │   │   └── tls/
 │   │       ├── ca/
 │   │       ├── postgres/
 │   │       ├── redis/
 │   │       └── minio/
 │   └── storage/
 │       ├── postgres/
 │       ├── redis/
 │       └── minio/
 ├── services/                    # Application services
 │   ├── databases/
 │   │   ├── postgres/
 │   │   │   ├── k8s-manifests/
 │   │   │   ├── backups/
 │   │   │   ├── monitoring/
 │   │   │   └── maintenance/
 │   │   ├── redis/
 │   │   │   ├── configs/
 │   │   │   └── monitoring/
 │   │   └── minio/
 │   │       ├── buckets/
 │   │       └── policies/
 │   ├── api-gateway/
 │   │   ├── k8s-manifests/
 │   │   └── configs/
 │   └── microservices/
 │       ├── auth/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── tenant/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── training/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── forecasting/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── sales/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── external/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── notification/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── inventory/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── recipes/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── suppliers/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── pos/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── orders/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── production/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── procurement/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── orchestrator/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── alert-processor/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── ai-insights/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       ├── demo-session/
 │       │   ├── k8s-manifests/
 │       │   └── configs/
 │       └── frontend/
 │           ├── k8s-manifests/
 │           └── configs/
 ├── monitoring/                  # Observability stack
 │   ├── signoz/
 │   │   ├── manifests/
 │   │   ├── dashboards/
 │   │   │   ├── alert-management.json
 │   │   │   ├── api-performance.json
 │   │   │   ├── application-performance.json
 │   │   │   ├── database-performance.json
 │   │   │   ├── error-tracking.json
 │   │   │   ├── index.json
 │   │   │   ├── infrastructure-monitoring.json
 │   │   │   ├── log-analysis.json
 │   │   │   ├── system-health.json
 │   │   │   └── user-activity.json
 │   │   ├── values-dev.yaml
 │   │   ├── values-prod.yaml
 │   │   ├── deploy-signoz.sh
 │   │   ├── verify-signoz.sh
 │   │   └── generate-test-traffic.sh
 │   └── opentelemetry/
 │       ├── collector/
 │       └── agent/
 ├── cicd/                        # CI/CD pipeline
 │   ├── gitea/
 │   │   ├── values.yaml
 │   │   └── ingress.yaml
 │   ├── tekton/
 │   │   ├── tasks/
 │   │   │   ├── git-clone.yaml
 │   │   │   ├── detect-changes.yaml
 │   │   │   ├── kaniko-build.yaml
 │   │   │   └── update-gitops.yaml
 │   │   ├── pipelines/
 │   │   └── triggers/
 │   └── flux/
 │       ├── git-repository.yaml
 │       └── kustomization.yaml
 ├── security/                     # Security configurations
 │   ├── policies/
 │   │   ├── network-policies.yaml
 │   │   ├── pod-security.yaml
 │   │   └── rbac.yaml
 │   ├── certificates/
 │   │   ├── ca/
 │   │   ├── services/
 │   │   └── rotation-scripts/
 │   ├── scanning/
 │   │   ├── trivy/
 │   │   └── policies/
 │   └── compliance/
 │       ├── cis-benchmarks/
 │       └── audit-scripts/
 ├── scripts/                     # Automation scripts
 │   ├── setup/
 │   │   ├── generate-certificates.sh
 │   │   ├── generate-minio-certificates.sh
 │   │   └── setup-dockerhub-secrets.sh
 │   ├── deployment/
 │   │   ├── deploy-signoz.sh
 │   │   └── verify-signoz.sh
 │   ├── maintenance/
 │   │   ├── regenerate_migrations_k8s.sh
 │   │   └── kubernetes_restart.sh
 │   └── verification/
 │       └── verify-registry.sh
 ├── docs/                        # Infrastructure documentation
 │   ├── architecture/
 │   │   ├── diagrams/
 │   │   └── decisions/
 │   ├── operations/
 │   │   ├── runbooks/
 │   │   └── troubleshooting/
 │   ├── onboarding/
 │   └── reference/
 │       ├── api/
 │       └── configurations/
 └── README.md
 ```
 ## Migration Strategy
 ### Phase 1: Preparation and Planning
 1. **Inventory Analysis**: Complete detailed inventory of all current files
 2. **Dependency Mapping**: Identify dependencies between components
 3. **Impact Assessment**: Determine which components can be moved safely
 4. **Backup Strategy**: Ensure all files are backed up before migration
 ### Phase 2: Non-Critical Components
 1. **Documentation**: Move and update all documentation files
 2. **Scripts**: Organize automation scripts into new structure
 3. **Monitoring**: Migrate SigNoz dashboards and configurations
 4. **CI/CD**: Reorganize pipeline components
 ### Phase 3: Environment-Specific Components
 1. **Create Environment Structure**: Set up dev/staging/prod directories
 2. **Migrate Kubernetes Manifests**: Move base components to appropriate locations
 3. **Update References**: Ensure all cross-references are corrected
 4. **Environment Validation**: Test each environment separately
 ### Phase 4: Service Components
 1. **Database Migration**: Move database configurations to services/databases
 2. **Microservice Organization**: Group microservices by domain
 3. **Infrastructure Components**: Move gateway and other infrastructure
 4. **Service Validation**: Test each service in isolation
 ### Phase 5: Finalization
 1. **Integration Testing**: Test complete infrastructure workflow
 2. **Documentation Update**: Finalize all documentation
 3. **Team Training**: Conduct training on new structure
 4. **Cleanup**: Remove old structure and temporary files
 ## Benefits of Proposed Structure
 ### 1. Improved Navigation
 - **Clear Hierarchy**: Logical grouping by function and environment
 - **Consistent Patterns**: Standardized structure across all components
 - **Reduced Cognitive Load**: Easier to find specific components
 ### 2. Enhanced Maintainability
 - **Environment Isolation**: Clear separation of dev/staging/prod
 - **Component Grouping**: Related components grouped together
 - **Standardized Structure**: Consistent patterns across services
 ### 3. Better Scalability
 - **Modular Design**: Easy to add new services or environments
 - **Domain Separation**: Services organized by business domain
 - **Infrastructure Independence**: Platform components separate from services
 ### 4. Improved Security
 - **Centralized Security**: All security configurations in one place
 - **Environment-Specific Policies**: Tailored security for each environment
 - **Better Secret Management**: Clear structure for sensitive data
 ### 5. Enhanced Observability
 - **Comprehensive Monitoring**: All observability tools grouped
 - **Standardized Dashboards**: Consistent monitoring across services
 - **Centralized Logging**: Better log management structure
 ## Implementation Considerations
 ### Tools and Technologies
 - **Terraform**: For infrastructure as code (IaC)
 - **Kustomize**: For Kubernetes manifest management
 - **Helm**: For complex application deployments
 - **SOPS/Sealed Secrets**: For secret management
 - **Trivy**: For vulnerability scanning
 ### Team Adaptation
 - **Training Plan**: Develop comprehensive training materials
 - **Migration Guide**: Create step-by-step migration documentation
 - **Support Period**: Provide dedicated support during transition
 - **Feedback Mechanism**: Establish channels for team feedback
 ### Risk Mitigation
 - **Phased Approach**: Implement changes incrementally
 - **Rollback Plan**: Develop comprehensive rollback procedures
 - **Testing Strategy**: Implement thorough testing at each phase
 - **Monitoring**: Enhanced monitoring during migration period
 ## Expected Outcomes
 1. **Reduced Time-to-Find**: 40-60% reduction in time spent locating files
 2. **Improved Deployment Speed**: 25-35% faster deployment cycles
 3. **Enhanced Collaboration**: Better team coordination and understanding
 4. **Reduced Errors**: 30-50% reduction in configuration errors
 5. **Better Scalability**: Easier to add new services and features
 ## Conclusion
 The proposed infrastructure reorganization represents a significant improvement over the current structure. By implementing a clear, logical hierarchy with proper separation of concerns, the new organization will:
 - **Improve operational efficiency** through better navigation and maintainability
 - **Enhance security** with centralized security management
 - **Support growth** with a scalable, modular design
 - **Reduce errors** through standardized patterns and structures
 - **Facilitate collaboration** with intuitive organization
 The key to successful implementation is a phased approach with thorough testing and team involvement at each stage. With proper planning and execution, this reorganization will provide long-term benefits for the Bakery-IA project's infrastructure management.
 ## Appendix: File Migration Mapping
 ### Current → Proposed Mapping
 **Kubernetes Components:**
 - `infrastructure/kubernetes/base/components/*` → `infrastructure/services/microservices/*/`
 - `infrastructure/kubernetes/base/components/databases/*` → `infrastructure/services/databases/*/`
 - `infrastructure/kubernetes/base/migrations/*` → `infrastructure/services/microservices/*/migrations/`
 - `infrastructure/kubernetes/base/configs/*` → `infrastructure/environments/*/values/`
 **CI/CD Components:**
 - `infrastructure/ci-cd/*` → `infrastructure/cicd/*/`
 **Monitoring Components:**
 - `infrastructure/signoz/*` → `infrastructure/monitoring/signoz/*/`
 - `infrastructure/helm/*` → `infrastructure/monitoring/signoz/*/` (signoz-related)
 **Security Components:**
 - `infrastructure/tls/*` → `infrastructure/security/certificates/*/`
 **Scripts:**
 - `infrastructure/kubernetes/*.sh` → `infrastructure/scripts/*/`
 - `infrastructure/helm/*.sh` → `infrastructure/scripts/deployment/*/`
 - `infrastructure/tls/*.sh` → `infrastructure/scripts/setup/*/`
 This mapping provides a clear path for migrating each component to its new location while maintaining functionality and relationships between components.
--- a/infrastructure/ci-cd/README.md
+++ b/infrastructure/ci-cd/README.md
@@ -0,0 +1,294 @@
 # Bakery-IA CI/CD Implementation
 This directory contains the configuration for the production-grade CI/CD system for Bakery-IA using Gitea, Tekton, and Flux CD.
 ## Architecture Overview
 ```mermaid
 graph TD
    A[Developer] -->|Push Code| B[Gitea]
    B -->|Webhook| C[Tekton Pipelines]
    C -->|Build/Test| D[Gitea Registry]
    D -->|New Image| E[Flux CD]
    E -->|kubectl apply| F[MicroK8s Cluster]
    F -->|Metrics| G[SigNoz]
 ```
 ## Directory Structure
 ```
 infrastructure/ci-cd/
 ├── gitea/                  # Gitea configuration (Git server + registry)
 │   ├── values.yaml         # Helm values for Gitea
 │   └── ingress.yaml        # Ingress configuration
 ├── tekton/                # Tekton CI/CD pipeline configuration
 │   ├── tasks/              # Individual pipeline tasks
 │   │   ├── git-clone.yaml
 │   │   ├── detect-changes.yaml
 │   │   ├── kaniko-build.yaml
 │   │   └── update-gitops.yaml
 │   ├── pipelines/          # Pipeline definitions
 │   │   └── ci-pipeline.yaml
 │   └── triggers/           # Webhook trigger configuration
 │       ├── trigger-template.yaml
 │       ├── trigger-binding.yaml
 │       ├── event-listener.yaml
 │       └── gitlab-interceptor.yaml
 ├── flux/                   # Flux CD GitOps configuration
 │   ├── git-repository.yaml # Git repository source
 │   └── kustomization.yaml  # Deployment kustomization
 ├── monitoring/             # Monitoring configuration
 │   └── otel-collector.yaml # OpenTelemetry collector
 └── README.md               # This file
 ```
 ## Deployment Instructions
 ### Phase 1: Infrastructure Setup
 1. **Deploy Gitea**:
   ```bash
   # Add Helm repo
   microk8s helm repo add gitea https://dl.gitea.io/charts
   # Create namespace
   microk8s kubectl create namespace gitea
   # Install Gitea
   microk8s helm install gitea gitea/gitea \
     -n gitea \
     -f infrastructure/ci-cd/gitea/values.yaml
   # Apply ingress
   microk8s kubectl apply -f infrastructure/ci-cd/gitea/ingress.yaml
   ```
 2. **Deploy Tekton**:
   ```bash
   # Create namespace
   microk8s kubectl create namespace tekton-pipelines
   # Install Tekton Pipelines
   microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
   # Install Tekton Triggers
   microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/triggers/latest/release.yaml
   # Apply Tekton configurations
   microk8s kubectl apply -f infrastructure/ci-cd/tekton/tasks/
   microk8s kubectl apply -f infrastructure/ci-cd/tekton/pipelines/
   microk8s kubectl apply -f infrastructure/ci-cd/tekton/triggers/
   ```
 3. **Deploy Flux CD** (already enabled in MicroK8s):
   ```bash
   # Verify Flux installation
   microk8s kubectl get pods -n flux-system
   # Apply Flux configurations
   microk8s kubectl apply -f infrastructure/ci-cd/flux/
   ```
 ### Phase 2: Configuration
 1. **Set up Gitea webhook**:
   - Go to your Gitea repository settings
   - Add webhook with URL: `http://tekton-triggers.tekton-pipelines.svc.cluster.local:8080`
   - Use the secret from `gitea-webhook-secret`
 2. **Configure registry credentials**:
   ```bash
   # Create registry credentials secret
   microk8s kubectl create secret docker-registry gitea-registry-credentials \
     -n tekton-pipelines \
     --docker-server=gitea.bakery-ia.local:5000 \
     --docker-username=your-username \
     --docker-password=your-password
   ```
 3. **Configure Git credentials for Flux**:
   ```bash
   # Create Git credentials secret
   microk8s kubectl create secret generic gitea-credentials \
     -n flux-system \
     --from-literal=username=your-username \
     --from-literal=password=your-password
   ```
 ### Phase 3: Monitoring
 ```bash
 # Apply OpenTelemetry configuration
 microk8s kubectl apply -f infrastructure/ci-cd/monitoring/otel-collector.yaml
 ```
 ## Usage
 ### Triggering a Pipeline
 1. **Manual trigger**:
   ```bash
   # Create a PipelineRun manually
   microk8s kubectl create -f - <<EOF
   apiVersion: tekton.dev/v1beta1
   kind: PipelineRun
   metadata:
     name: manual-ci-run
     namespace: tekton-pipelines
   spec:
     pipelineRef:
       name: bakery-ia-ci
     workspaces:
       - name: shared-workspace
         volumeClaimTemplate:
           spec:
             accessModes: ["ReadWriteOnce"]
             resources:
               requests:
                 storage: 5Gi
       - name: docker-credentials
         secret:
           secretName: gitea-registry-credentials
     params:
       - name: git-url
         value: "http://gitea.bakery-ia.local/bakery/bakery-ia.git"
       - name: git-revision
         value: "main"
   EOF
   ```
 2. **Automatic trigger**: Push code to the repository and the webhook will trigger the pipeline automatically.
 ### Monitoring Pipeline Runs
 ```bash
 # List all PipelineRuns
 microk8s kubectl get pipelineruns -n tekton-pipelines
 # View logs for a specific PipelineRun
 microk8s kubectl logs -n tekton-pipelines <pipelinerun-pod> -c <step-name>
 # View Tekton dashboard
 microk8s kubectl port-forward -n tekton-pipelines svc/tekton-dashboard 9097:9097
 ```
 ## Troubleshooting
 ### Common Issues
 1. **Pipeline not triggering**:
   - Check Gitea webhook logs
   - Verify EventListener pods are running
   - Check TriggerBinding configuration
 2. **Build failures**:
   - Check Kaniko logs for build errors
   - Verify Dockerfile paths are correct
   - Ensure registry credentials are valid
 3. **Flux not applying changes**:
   - Check GitRepository status
   - Verify Kustomization reconciliation
   - Check Flux logs for errors
 ### Debugging Commands
 ```bash
 # Check Tekton controller logs
 microk8s kubectl logs -n tekton-pipelines -l app=tekton-pipelines-controller
 # Check Flux reconciliation
 microk8s kubectl get kustomizations -n flux-system -o yaml
 # Check Gitea webhook delivery
 microk8s kubectl logs -n tekton-pipelines -l app=tekton-triggers-controller
 ```
 ## Security Considerations
 1. **Secrets Management**:
   - Use Kubernetes secrets for sensitive data
   - Rotate credentials regularly
   - Use RBAC for namespace isolation
 2. **Network Security**:
   - Configure network policies
   - Use internal DNS names
   - Restrict ingress access
 3. **Registry Security**:
   - Enable image scanning
   - Use image signing
   - Implement cleanup policies
 ## Maintenance
 ### Upgrading Components
 ```bash
 # Upgrade Tekton
 microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
 # Upgrade Flux
 microk8s helm upgrade fluxcd fluxcd/flux2 -n flux-system
 # Upgrade Gitea
 microk8s helm upgrade gitea gitea/gitea -n gitea -f infrastructure/ci-cd/gitea/values.yaml
 ```
 ### Backup Procedures
 ```bash
 # Backup Gitea
 microk8s kubectl exec -n gitea gitea-0 -- gitea dump -c /data/gitea/conf/app.ini
 # Backup Flux configurations
 microk8s kubectl get all -n flux-system -o yaml > flux-backup.yaml
 # Backup Tekton configurations
 microk8s kubectl get all -n tekton-pipelines -o yaml > tekton-backup.yaml
 ```
 ## Performance Optimization
 1. **Resource Management**:
   - Set appropriate resource limits
   - Limit concurrent builds
   - Use node selectors for build pods
 2. **Caching**:
   - Configure Kaniko cache
   - Use persistent volumes for dependencies
   - Cache Docker layers
 3. **Parallelization**:
   - Build independent services in parallel
   - Use matrix builds for different architectures
   - Optimize task dependencies
 ## Integration with Existing System
 The CI/CD system integrates with:
 - **SigNoz**: For monitoring and observability
 - **MicroK8s**: For cluster management
 - **Existing Kubernetes manifests**: In `infrastructure/kubernetes/`
 - **Current services**: All 19 microservices in `services/`
 ## Migration Plan
 1. **Phase 1**: Set up infrastructure (Gitea, Tekton, Flux)
 2. **Phase 2**: Configure pipelines and triggers
 3. **Phase 3**: Test with non-critical services
 4. **Phase 4**: Gradual rollout to all services
 5. **Phase 5**: Decommission old deployment methods
 ## Support
 For issues with the CI/CD system:
 - Check logs and monitoring first
 - Review the troubleshooting section
 - Consult the original implementation plan
 - Refer to component documentation:
  - [Tekton Documentation](https://tekton.dev/docs/)
  - [Flux CD Documentation](https://fluxcd.io/docs/)
  - [Gitea Documentation](https://docs.gitea.io/)
--- a/infrastructure/ci-cd/flux/git-repository.yaml
+++ b/infrastructure/ci-cd/flux/git-repository.yaml
@@ -0,0 +1,16 @@
 # Flux GitRepository for Bakery-IA
 # This resource tells Flux where to find the Git repository
 apiVersion: source.toolkit.fluxcd.io/v1
 kind: GitRepository
 metadata:
  name: bakery-ia
  namespace: flux-system
 spec:
  interval: 1m
  url: http://gitea.bakery-ia.local/bakery/bakery-ia.git
  ref:
    branch: main
  secretRef:
    name: gitea-credentials
  timeout: 60s
--- a/infrastructure/ci-cd/flux/kustomization.yaml
+++ b/infrastructure/ci-cd/flux/kustomization.yaml
@@ -0,0 +1,27 @@
 # Flux Kustomization for Bakery-IA Production Deployment
 # This resource tells Flux how to deploy the application
 apiVersion: kustomize.toolkit.fluxcd.io/v1
 kind: Kustomization
 metadata:
  name: bakery-ia-prod
  namespace: flux-system
 spec:
  interval: 5m
  path: ./infrastructure/kubernetes/overlays/prod
  prune: true
  sourceRef:
    kind: GitRepository
    name: bakery-ia
  targetNamespace: bakery-ia
  timeout: 5m
  retryInterval: 1m
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: auth-service
      namespace: bakery-ia
    - apiVersion: apps/v1
      kind: Deployment
      name: gateway
      namespace: bakery-ia
--- a/infrastructure/ci-cd/gitea/ingress.yaml
+++ b/infrastructure/ci-cd/gitea/ingress.yaml
@@ -0,0 +1,25 @@
 # Gitea Ingress configuration for Bakery-IA CI/CD
 # This provides external access to Gitea within the cluster
 apiVersion: networking.k8s.io/v1
 kind: Ingress
 metadata:
  name: gitea-ingress
  namespace: gitea
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
 spec:
  rules:
  - host: gitea.bakery-ia.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: gitea-http
            port:
              number: 3000
--- a/infrastructure/ci-cd/gitea/values.yaml
+++ b/infrastructure/ci-cd/gitea/values.yaml
@@ -0,0 +1,38 @@
 # Gitea Helm values configuration for Bakery-IA CI/CD
 # This configuration sets up Gitea with registry support and appropriate storage
 service:
  type: ClusterIP
  httpPort: 3000
  sshPort: 2222
 persistence:
  enabled: true
  size: 50Gi
  storageClass: "microk8s-hostpath"
 gitea:
  config:
    server:
      DOMAIN: gitea.bakery-ia.local
      SSH_DOMAIN: gitea.bakery-ia.local
      ROOT_URL: http://gitea.bakery-ia.local
    repository:
      ENABLE_PUSH_CREATE_USER: true
      ENABLE_PUSH_CREATE_ORG: true
    registry:
      ENABLED: true
 postgresql:
  enabled: true
  persistence:
    size: 20Gi
 # Resource configuration for production environment
 resources:
  limits:
    cpu: 1000m
    memory: 1Gi
  requests:
    cpu: 500m
    memory: 512Mi
--- a/infrastructure/ci-cd/monitoring/otel-collector.yaml
+++ b/infrastructure/ci-cd/monitoring/otel-collector.yaml
@@ -0,0 +1,70 @@
 # OpenTelemetry Collector for Bakery-IA CI/CD Monitoring
 # This collects metrics and traces from Tekton pipelines
 apiVersion: opentelemetry.io/v1alpha1
 kind: OpenTelemetryCollector
 metadata:
  name: tekton-otel
  namespace: tekton-pipelines
 spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      prometheus:
        config:
          scrape_configs:
            - job_name: 'tekton-pipelines'
              scrape_interval: 30s
              static_configs:
                - targets: ['tekton-pipelines-controller.tekton-pipelines.svc.cluster.local:9090']
    processors:
      batch:
        timeout: 5s
        send_batch_size: 1000
      memory_limiter:
        check_interval: 2s
        limit_percentage: 75
        spike_limit_percentage: 20
    exporters:
      otlp:
        endpoint: "signoz-otel-collector.monitoring.svc.cluster.local:4317"
        tls:
          insecure: true
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 30s
          max_elapsed_time: 300s
      logging:
        logLevel: debug
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp, logging]
        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, batch]
          exporters: [otlp, logging]
      telemetry:
        logs:
          level: "info"
          encoding: "json"
  mode: deployment
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 200m
      memory: 256Mi
--- a/infrastructure/ci-cd/tekton/pipelines/ci-pipeline.yaml
+++ b/infrastructure/ci-cd/tekton/pipelines/ci-pipeline.yaml
@@ -0,0 +1,83 @@
 # Main CI Pipeline for Bakery-IA
 # This pipeline orchestrates the build, test, and deploy process
 apiVersion: tekton.dev/v1beta1
 kind: Pipeline
 metadata:
  name: bakery-ia-ci
  namespace: tekton-pipelines
 spec:
  workspaces:
    - name: shared-workspace
    - name: docker-credentials
  params:
    - name: git-url
      type: string
      description: Repository URL
    - name: git-revision
      type: string
      description: Git revision/commit hash
    - name: registry
      type: string
      description: Container registry URL
      default: "gitea.bakery-ia.local:5000"
  tasks:
    - name: fetch-source
      taskRef:
        name: git-clone
      workspaces:
        - name: output
          workspace: shared-workspace
      params:
        - name: url
          value: $(params.git-url)
        - name: revision
          value: $(params.git-revision)
    - name: detect-changes
      runAfter: [fetch-source]
      taskRef:
        name: detect-changed-services
      workspaces:
        - name: source
          workspace: shared-workspace
    - name: build-and-push
      runAfter: [detect-changes]
      taskRef:
        name: kaniko-build
      when:
        - input: "$(tasks.detect-changes.results.changed-services)"
          operator: notin
          values: ["none"]
      workspaces:
        - name: source
          workspace: shared-workspace
        - name: docker-credentials
          workspace: docker-credentials
      params:
        - name: services
          value: $(tasks.detect-changes.results.changed-services)
        - name: registry
          value: $(params.registry)
        - name: git-revision
          value: $(params.git-revision)
    - name: update-gitops-manifests
      runAfter: [build-and-push]
      taskRef:
        name: update-gitops
      when:
        - input: "$(tasks.detect-changes.results.changed-services)"
          operator: notin
          values: ["none"]
      workspaces:
        - name: source
          workspace: shared-workspace
      params:
        - name: services
          value: $(tasks.detect-changes.results.changed-services)
        - name: registry
          value: $(params.registry)
        - name: git-revision
          value: $(params.git-revision)
--- a/infrastructure/ci-cd/tekton/tasks/detect-changes.yaml
+++ b/infrastructure/ci-cd/tekton/tasks/detect-changes.yaml
@@ -0,0 +1,64 @@
 # Tekton Detect Changed Services Task for Bakery-IA CI/CD
 # This task identifies which services have changed in the repository
 apiVersion: tekton.dev/v1beta1
 kind: Task
 metadata:
  name: detect-changed-services
  namespace: tekton-pipelines
 spec:
  workspaces:
    - name: source
  results:
    - name: changed-services
      description: Comma-separated list of changed services
  steps:
    - name: detect
      image: alpine/git
      script: |
        #!/bin/sh
        set -e
        cd $(workspaces.source.path)
        echo "Detecting changed files..."
        # Get list of changed files compared to previous commit
        CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD 2>/dev/null || git diff --name-only HEAD)
        echo "Changed files: $CHANGED_FILES"
        # Map files to services
        CHANGED_SERVICES=()
        for file in $CHANGED_FILES; do
          if [[ $file == services/* ]]; then
            SERVICE=$(echo $file | cut -d'/' -f2)
            # Only add unique service names
            if [[ ! " ${CHANGED_SERVICES[@]} " =~ " ${SERVICE} " ]]; then
              CHANGED_SERVICES+=("$SERVICE")
            fi
          elif [[ $file == frontend/* ]]; then
            CHANGED_SERVICES+=("frontend")
            break
          elif [[ $file == gateway/* ]]; then
            CHANGED_SERVICES+=("gateway")
            break
          fi
        done
        # If no specific services changed, check for infrastructure changes
        if [ ${#CHANGED_SERVICES[@]} -eq 0 ]; then
          for file in $CHANGED_FILES; do
            if [[ $file == infrastructure/* ]]; then
              CHANGED_SERVICES+=("infrastructure")
              break
            fi
          done
        fi
        # Output result
        if [ ${#CHANGED_SERVICES[@]} -eq 0 ]; then
          echo "No service changes detected"
          echo "none" | tee $(results.changed-services.path)
        else
          echo "Detected changes in services: ${CHANGED_SERVICES[@]}"
          echo $(printf "%s," "${CHANGED_SERVICES[@]}" | sed 's/,$//') | tee $(results.changed-services.path)
        fi
--- a/infrastructure/ci-cd/tekton/tasks/git-clone.yaml
+++ b/infrastructure/ci-cd/tekton/tasks/git-clone.yaml
@@ -0,0 +1,31 @@
 # Tekton Git Clone Task for Bakery-IA CI/CD
 # This task clones the source code repository
 apiVersion: tekton.dev/v1beta1
 kind: Task
 metadata:
  name: git-clone
  namespace: tekton-pipelines
 spec:
  workspaces:
    - name: output
  params:
    - name: url
      type: string
      description: Repository URL to clone
    - name: revision
      type: string
      description: Git revision to checkout
      default: "main"
  steps:
    - name: clone
      image: alpine/git
      script: |
        #!/bin/sh
        set -e
        echo "Cloning repository: $(params.url)"
        git clone $(params.url) $(workspaces.output.path)
        cd $(workspaces.output.path)
        echo "Checking out revision: $(params.revision)"
        git checkout $(params.revision)
        echo "Repository cloned successfully"
--- a/infrastructure/ci-cd/tekton/tasks/kaniko-build.yaml
+++ b/infrastructure/ci-cd/tekton/tasks/kaniko-build.yaml
@@ -0,0 +1,40 @@
 # Tekton Kaniko Build Task for Bakery-IA CI/CD
 # This task builds and pushes container images using Kaniko
 apiVersion: tekton.dev/v1beta1
 kind: Task
 metadata:
  name: kaniko-build
  namespace: tekton-pipelines
 spec:
  workspaces:
    - name: source
    - name: docker-credentials
  params:
    - name: services
      type: string
      description: Comma-separated list of services to build
    - name: registry
      type: string
      description: Container registry URL
      default: "gitea.bakery-ia.local:5000"
    - name: git-revision
      type: string
      description: Git revision for image tag
      default: "latest"
  steps:
    - name: build-and-push
      image: gcr.io/kaniko-project/executor:v1.9.0
      args:
        - --dockerfile=$(workspaces.source.path)/services/$(params.services)/Dockerfile
        - --context=$(workspaces.source.path)
        - --destination=$(params.registry)/bakery/$(params.services):$(params.git-revision)
        - --verbosity=info
      volumeMounts:
        - name: docker-config
          mountPath: /kaniko/.docker
      securityContext:
        runAsUser: 0
  volumes:
    - name: docker-config
      emptyDir: {}
--- a/infrastructure/ci-cd/tekton/tasks/update-gitops.yaml
+++ b/infrastructure/ci-cd/tekton/tasks/update-gitops.yaml
@@ -0,0 +1,66 @@
 # Tekton Update GitOps Manifests Task for Bakery-IA CI/CD
 # This task updates Kubernetes manifests with new image tags
 apiVersion: tekton.dev/v1beta1
 kind: Task
 metadata:
  name: update-gitops
  namespace: tekton-pipelines
 spec:
  workspaces:
    - name: source
  params:
    - name: services
      type: string
      description: Comma-separated list of services to update
    - name: registry
      type: string
      description: Container registry URL
    - name: git-revision
      type: string
      description: Git revision for image tag
  steps:
    - name: update-manifests
      image: bitnami/kubectl
      script: |
        #!/bin/sh
        set -e
        cd $(workspaces.source.path)
        echo "Updating GitOps manifests for services: $(params.services)"
        # Split services by comma
        IFS=',' read -ra SERVICES <<< "$(params.services)"
        for service in "${SERVICES[@]}"; do
          echo "Processing service: $service"
          # Find and update Kubernetes manifests
          if [ "$service" = "frontend" ]; then
            # Update frontend deployment
            if [ -f "infrastructure/kubernetes/overlays/prod/frontend-deployment.yaml" ]; then
              sed -i "s|image:.*|image: $(params.registry)/bakery/frontend:$(params.git-revision)|g" \
                "infrastructure/kubernetes/overlays/prod/frontend-deployment.yaml"
            fi
          elif [ "$service" = "gateway" ]; then
            # Update gateway deployment
            if [ -f "infrastructure/kubernetes/overlays/prod/gateway-deployment.yaml" ]; then
              sed -i "s|image:.*|image: $(params.registry)/bakery/gateway:$(params.git-revision)|g" \
                "infrastructure/kubernetes/overlays/prod/gateway-deployment.yaml"
            fi
          else
            # Update service deployment
            DEPLOYMENT_FILE="infrastructure/kubernetes/overlays/prod/${service}-deployment.yaml"
            if [ -f "$DEPLOYMENT_FILE" ]; then
              sed -i "s|image:.*|image: $(params.registry)/bakery/${service}:$(params.git-revision)|g" \
                "$DEPLOYMENT_FILE"
            fi
          fi
        done
        # Commit changes
        git config --global user.name "bakery-ia-ci"
        git config --global user.email "ci@bakery-ia.local"
        git add .
        git commit -m "CI: Update image tags for $(params.services) to $(params.git-revision)"
        git push origin HEAD
--- a/infrastructure/ci-cd/tekton/triggers/event-listener.yaml
+++ b/infrastructure/ci-cd/tekton/triggers/event-listener.yaml
@@ -0,0 +1,26 @@
 # Tekton EventListener for Bakery-IA CI/CD
 # This listener receives webhook events and triggers pipelines
 apiVersion: triggers.tekton.dev/v1alpha1
 kind: EventListener
 metadata:
  name: bakery-ia-listener
  namespace: tekton-pipelines
 spec:
  serviceAccountName: tekton-triggers-sa
  triggers:
    - name: bakery-ia-gitea-trigger
      bindings:
        - ref: bakery-ia-trigger-binding
      template:
        ref: bakery-ia-trigger-template
      interceptors:
        - ref:
            name: "gitlab"
          params:
            - name: "secretRef"
              value:
                secretName: gitea-webhook-secret
                secretKey: secretToken
            - name: "eventTypes"
              value: ["push"]
--- a/infrastructure/ci-cd/tekton/triggers/gitlab-interceptor.yaml
+++ b/infrastructure/ci-cd/tekton/triggers/gitlab-interceptor.yaml
@@ -0,0 +1,14 @@
 # GitLab/Gitea Webhook Interceptor for Tekton Triggers
 # This interceptor validates and processes Gitea webhook events
 apiVersion: triggers.tekton.dev/v1alpha1
 kind: ClusterInterceptor
 metadata:
  name: gitlab
 spec:
  clientConfig:
    service:
      name: tekton-triggers-core-interceptors
      namespace: tekton-pipelines
      path: "/v1/webhook/gitlab"
      port: 8443
--- a/infrastructure/ci-cd/tekton/triggers/trigger-binding.yaml
+++ b/infrastructure/ci-cd/tekton/triggers/trigger-binding.yaml
@@ -0,0 +1,16 @@
 # Tekton TriggerBinding for Bakery-IA CI/CD
 # This binding extracts parameters from Gitea webhook events
 apiVersion: triggers.tekton.dev/v1alpha1
 kind: TriggerBinding
 metadata:
  name: bakery-ia-trigger-binding
  namespace: tekton-pipelines
 spec:
  params:
    - name: git-repo-url
      value: $(body.repository.clone_url)
    - name: git-revision
      value: $(body.head_commit.id)
    - name: git-repo-name
      value: $(body.repository.name)
--- a/infrastructure/ci-cd/tekton/triggers/trigger-template.yaml
+++ b/infrastructure/ci-cd/tekton/triggers/trigger-template.yaml
@@ -0,0 +1,43 @@
 # Tekton TriggerTemplate for Bakery-IA CI/CD
 # This template defines how PipelineRuns are created when triggers fire
 apiVersion: triggers.tekton.dev/v1alpha1
 kind: TriggerTemplate
 metadata:
  name: bakery-ia-trigger-template
  namespace: tekton-pipelines
 spec:
  params:
    - name: git-repo-url
      description: The git repository URL
    - name: git-revision
      description: The git revision/commit hash
    - name: git-repo-name
      description: The git repository name
      default: "bakery-ia"
  resourcetemplates:
    - apiVersion: tekton.dev/v1beta1
      kind: PipelineRun
      metadata:
        generateName: bakery-ia-ci-run-$(params.git-repo-name)-
      spec:
        pipelineRef:
          name: bakery-ia-ci
        workspaces:
          - name: shared-workspace
            volumeClaimTemplate:
              spec:
                accessModes: ["ReadWriteOnce"]
                resources:
                  requests:
                    storage: 5Gi
          - name: docker-credentials
            secret:
              secretName: gitea-registry-credentials
        params:
          - name: git-url
            value: $(params.git-repo-url)
          - name: git-revision
            value: $(params.git-revision)
          - name: registry
            value: "gitea.bakery-ia.local:5000"
--- a/services/orchestrator/app/main.py
+++ b/services/orchestrator/app/main.py
@@ -18,6 +18,28 @@ class OrchestratorService(StandardFastAPIService):
    expected_migration_version = "001_initial_schema"
    def __init__(self):
        # Define expected database tables for health checks
        orchestrator_expected_tables = [
            'orchestration_runs'
        ]
        self.rabbitmq_client = None
        self.event_publisher = None
        self.leader_election = None
        self.scheduler_service = None
        super().__init__(
            service_name="orchestrator-service",
            app_name=settings.APP_NAME,
            description=settings.DESCRIPTION,
            version=settings.VERSION,
            api_prefix="",  # Empty because RouteBuilder already includes /api/v1
            database_manager=database_manager,
            expected_tables=orchestrator_expected_tables,
            enable_messaging=True  # Enable RabbitMQ for event publishing
        )
    async def verify_migrations(self):
        """Verify database schema matches the latest migrations"""
        try:
@@ -32,26 +54,6 @@ class OrchestratorService(StandardFastAPIService):
            self.logger.error(f"Migration verification failed: {e}")
            raise
    def __init__(self):
        # Define expected database tables for health checks
        orchestrator_expected_tables = [
            'orchestration_runs'
        ]
        self.rabbitmq_client = None
        self.event_publisher = None
        super().__init__(
            service_name="orchestrator-service",
            app_name=settings.APP_NAME,
            description=settings.DESCRIPTION,
            version=settings.VERSION,
            api_prefix="",  # Empty because RouteBuilder already includes /api/v1
            database_manager=database_manager,
            expected_tables=orchestrator_expected_tables,
            enable_messaging=True  # Enable RabbitMQ for event publishing
        )
    async def _setup_messaging(self):
        """Setup messaging for orchestrator service"""
        from shared.messaging import UnifiedEventPublisher, RabbitMQClient
@@ -84,22 +86,91 @@ class OrchestratorService(StandardFastAPIService):
        self.logger.info("Orchestrator Service starting up...")
-        # Initialize orchestrator scheduler service with EventPublisher
+        # Initialize leader election for horizontal scaling
-        from app.services.orchestrator_service import OrchestratorSchedulerService
+        # Only the leader pod will run the scheduler
-        scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
+        await self._setup_leader_election(app)
        await scheduler_service.start()
        app.state.scheduler_service = scheduler_service
        self.logger.info("Orchestrator scheduler service started")
        # REMOVED: Delivery tracking service - moved to procurement service (domain ownership)
    async def _setup_leader_election(self, app: FastAPI):
        """
        Setup leader election for scheduler.
        CRITICAL FOR HORIZONTAL SCALING:
        Without leader election, each pod would run the same scheduled jobs,
        causing duplicate forecasts, production schedules, and database contention.
        """
        from shared.leader_election import LeaderElectionService
        import redis.asyncio as redis
        try:
            # Create Redis connection for leader election
            redis_url = f"redis://:{settings.REDIS_PASSWORD}@{settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}"
            if settings.REDIS_TLS_ENABLED.lower() == "true":
                redis_url = redis_url.replace("redis://", "rediss://")
            redis_client = redis.from_url(redis_url, decode_responses=False)
            await redis_client.ping()
            # Use shared leader election service
            self.leader_election = LeaderElectionService(
                redis_client,
                service_name="orchestrator"
            )
            # Define callbacks for leader state changes
            async def on_become_leader():
                self.logger.info("This pod became the leader - starting scheduler")
                from app.services.orchestrator_service import OrchestratorSchedulerService
                self.scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
                await self.scheduler_service.start()
                app.state.scheduler_service = self.scheduler_service
                self.logger.info("Orchestrator scheduler service started (leader only)")
            async def on_lose_leader():
                self.logger.warning("This pod lost leadership - stopping scheduler")
                if self.scheduler_service:
                    await self.scheduler_service.stop()
                    self.scheduler_service = None
                    if hasattr(app.state, 'scheduler_service'):
                        app.state.scheduler_service = None
                self.logger.info("Orchestrator scheduler service stopped (no longer leader)")
            # Start leader election
            await self.leader_election.start(
                on_become_leader=on_become_leader,
                on_lose_leader=on_lose_leader
            )
            # Store leader election in app state for health checks
            app.state.leader_election = self.leader_election
            self.logger.info("Leader election initialized",
                           is_leader=self.leader_election.is_leader,
                           instance_id=self.leader_election.instance_id)
        except Exception as e:
            self.logger.error("Failed to setup leader election, falling back to standalone mode",
                            error=str(e))
            # Fallback: start scheduler anyway (for single-pod deployments)
            from app.services.orchestrator_service import OrchestratorSchedulerService
            self.scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
            await self.scheduler_service.start()
            app.state.scheduler_service = self.scheduler_service
            self.logger.warning("Scheduler started in standalone mode (no leader election)")
    async def on_shutdown(self, app: FastAPI):
        """Custom shutdown logic for orchestrator service"""
        self.logger.info("Orchestrator Service shutting down...")
-        # Stop scheduler service
+        # Stop leader election (this will also stop scheduler if we're the leader)
-        if hasattr(app.state, 'scheduler_service'):
+        if self.leader_election:
-            await app.state.scheduler_service.stop()
+            await self.leader_election.stop()
            self.logger.info("Leader election stopped")
        # Stop scheduler service if still running
        if self.scheduler_service:
            await self.scheduler_service.stop()
            self.logger.info("Orchestrator scheduler service stopped")
--- a/services/procurement/app/services/delivery_tracking_service.py
+++ b/services/procurement/app/services/delivery_tracking_service.py
@@ -1,12 +1,12 @@
 """
-Delivery Tracking Service - Simplified
+Delivery Tracking Service - With Leader Election
 Tracks purchase order deliveries and generates appropriate alerts using EventPublisher:
 - DELIVERY_ARRIVING_SOON: 2 hours before delivery window
 - DELIVERY_OVERDUE: 30 minutes after expected delivery time
 - STOCK_RECEIPT_INCOMPLETE: If delivery not marked as received
-Runs as internal scheduler with leader election.
+Runs as internal scheduler with leader election for horizontal scaling.
 Domain ownership: Procurement service owns all PO and delivery tracking.
 """
@@ -30,7 +30,7 @@ class DeliveryTrackingService:
    Monitors PO deliveries and generates time-based alerts using EventPublisher.
    Uses APScheduler with leader election to run hourly checks.
-    Only one pod executes checks (others skip if not leader).
+    Only one pod executes checks - leader election ensures no duplicate alerts.
    """
    def __init__(self, event_publisher: UnifiedEventPublisher, config, database_manager=None):
@@ -38,46 +38,121 @@ class DeliveryTrackingService:
        self.config = config
        self.database_manager = database_manager
        self.scheduler = AsyncIOScheduler()
-        self.is_leader = False
+        self._leader_election = None
        self._redis_client = None
        self._scheduler_started = False
        self.instance_id = str(uuid4())[:8]  # Short instance ID for logging
    async def start(self):
-        """Start the delivery tracking scheduler"""
+        """Start the delivery tracking scheduler with leader election"""
-        # Initialize and start scheduler if not already running
+        try:
            # Initialize leader election
            await self._setup_leader_election()
        except Exception as e:
            logger.error("Failed to setup leader election, starting in standalone mode",
                        error=str(e))
            # Fallback: start scheduler without leader election
            await self._start_scheduler()
    async def _setup_leader_election(self):
        """Setup Redis-based leader election for horizontal scaling"""
        from shared.leader_election import LeaderElectionService
        import redis.asyncio as redis
        # Build Redis URL from config
        redis_url = getattr(self.config, 'REDIS_URL', None)
        if not redis_url:
            redis_password = getattr(self.config, 'REDIS_PASSWORD', '')
            redis_host = getattr(self.config, 'REDIS_HOST', 'localhost')
            redis_port = getattr(self.config, 'REDIS_PORT', 6379)
            redis_db = getattr(self.config, 'REDIS_DB', 0)
            redis_url = f"redis://:{redis_password}@{redis_host}:{redis_port}/{redis_db}"
        self._redis_client = redis.from_url(redis_url, decode_responses=False)
        await self._redis_client.ping()
        # Create leader election service
        self._leader_election = LeaderElectionService(
            self._redis_client,
            service_name="procurement-delivery-tracking"
        )
        # Start leader election with callbacks
        await self._leader_election.start(
            on_become_leader=self._on_become_leader,
            on_lose_leader=self._on_lose_leader
        )
        logger.info("Leader election initialized for delivery tracking",
                   is_leader=self._leader_election.is_leader,
                   instance_id=self.instance_id)
    async def _on_become_leader(self):
        """Called when this instance becomes the leader"""
        logger.info("Became leader for delivery tracking - starting scheduler",
                   instance_id=self.instance_id)
        await self._start_scheduler()
    async def _on_lose_leader(self):
        """Called when this instance loses leadership"""
        logger.warning("Lost leadership for delivery tracking - stopping scheduler",
                      instance_id=self.instance_id)
        await self._stop_scheduler()
    async def _start_scheduler(self):
        """Start the APScheduler with delivery tracking jobs"""
        if self._scheduler_started:
            logger.debug("Scheduler already started", instance_id=self.instance_id)
            return
        if not self.scheduler.running:
            # Add hourly job to check deliveries
            self.scheduler.add_job(
                self._check_all_tenants,
-                trigger=CronTrigger(minute=30),  # Run every hour at :30 (00:30, 01:30, 02:30, etc.)
+                trigger=CronTrigger(minute=30),  # Run every hour at :30
                id='hourly_delivery_check',
                name='Hourly Delivery Tracking',
                replace_existing=True,
-                max_instances=1,  # Ensure no overlapping runs
+                max_instances=1,
-                coalesce=True  # Combine missed runs
+                coalesce=True
            )
            self.scheduler.start()
            self._scheduler_started = True
            # Log next run time
            next_run = self.scheduler.get_job('hourly_delivery_check').next_run_time
-            logger.info(
+            logger.info("Delivery tracking scheduler started",
-                "Delivery tracking scheduler started with hourly checks",
+                       instance_id=self.instance_id,
-                instance_id=self.instance_id,
+                       next_run=next_run.isoformat() if next_run else None)
-                next_run=next_run.isoformat() if next_run else None
+
-            )
+    async def _stop_scheduler(self):
-        else:
+        """Stop the APScheduler"""
-            logger.info(
+        if not self._scheduler_started:
-                "Delivery tracking scheduler already running",
+            return
-                instance_id=self.instance_id
+
-            )
+        if self.scheduler.running:
            self.scheduler.shutdown(wait=False)
            self._scheduler_started = False
            logger.info("Delivery tracking scheduler stopped", instance_id=self.instance_id)
    async def stop(self):
-        """Stop the scheduler and release leader lock"""
+        """Stop the scheduler and leader election"""
-        if self.scheduler.running:
+        # Stop leader election first
-            self.scheduler.shutdown(wait=True)  # Graceful shutdown
+        if self._leader_election:
-            logger.info("Delivery tracking scheduler stopped", instance_id=self.instance_id)
+            await self._leader_election.stop()
-        else:
+            logger.info("Leader election stopped", instance_id=self.instance_id)
-            logger.info("Delivery tracking scheduler already stopped", instance_id=self.instance_id)
+
        # Stop scheduler
        await self._stop_scheduler()
        # Close Redis
        if self._redis_client:
            await self._redis_client.close()
    @property
    def is_leader(self) -> bool:
        """Check if this instance is the leader"""
        return self._leader_election.is_leader if self._leader_election else True
    async def _check_all_tenants(self):
        """
--- a/services/training/app/main.py
+++ b/services/training/app/main.py
@@ -46,6 +46,9 @@ class TrainingService(StandardFastAPIService):
        await setup_messaging()
        self.logger.info("Messaging setup completed")
        # Initialize Redis pub/sub for cross-pod WebSocket broadcasting
        await self._setup_websocket_redis()
        # Set up WebSocket event consumer (listens to RabbitMQ and broadcasts to WebSockets)
        success = await setup_websocket_event_consumer()
        if success:
@@ -53,8 +56,44 @@ class TrainingService(StandardFastAPIService):
        else:
            self.logger.warning("WebSocket event consumer setup failed")
    async def _setup_websocket_redis(self):
        """
        Initialize Redis pub/sub for WebSocket cross-pod broadcasting.
        CRITICAL FOR HORIZONTAL SCALING:
        Without this, WebSocket clients on Pod A won't receive events
        from training jobs running on Pod B.
        """
        try:
            from app.websocket.manager import websocket_manager
            from app.core.config import settings
            redis_url = settings.REDIS_URL
            success = await websocket_manager.initialize_redis(redis_url)
            if success:
                self.logger.info("WebSocket Redis pub/sub initialized for horizontal scaling")
            else:
                self.logger.warning(
                    "WebSocket Redis pub/sub failed to initialize. "
                    "WebSocket events will only be delivered to local connections."
                )
        except Exception as e:
            self.logger.error("Failed to setup WebSocket Redis pub/sub",
                            error=str(e))
            # Don't fail startup - WebSockets will work locally without Redis
    async def _cleanup_messaging(self):
        """Cleanup messaging for training service"""
        # Shutdown WebSocket Redis pub/sub
        try:
            from app.websocket.manager import websocket_manager
            await websocket_manager.shutdown()
            self.logger.info("WebSocket Redis pub/sub shutdown completed")
        except Exception as e:
            self.logger.warning("Error shutting down WebSocket Redis", error=str(e))
        await cleanup_websocket_consumers()
        await cleanup_messaging()
@@ -83,8 +122,44 @@ class TrainingService(StandardFastAPIService):
        system_metrics = SystemMetricsCollector("training")
        self.logger.info("System metrics collection started")
        # Recover stale jobs from previous pod crashes
        # This is important for horizontal scaling - jobs may be left in 'running'
        # state if a pod crashes. We mark them as failed so they can be retried.
        await self._recover_stale_jobs()
        self.logger.info("Training service startup completed")
    async def _recover_stale_jobs(self):
        """
        Recover stale training jobs on startup.
        When a pod crashes mid-training, jobs are left in 'running' or 'pending' state.
        This method finds jobs that haven't been updated in a while and marks them
        as failed so users can retry them.
        """
        try:
            from app.repositories.training_log_repository import TrainingLogRepository
            async with self.database_manager.get_session() as session:
                log_repo = TrainingLogRepository(session)
                # Recover jobs that haven't been updated in 60 minutes
                # This is conservative - most training jobs complete within 30 minutes
                recovered = await log_repo.recover_stale_jobs(stale_threshold_minutes=60)
                if recovered:
                    self.logger.warning(
                        "Recovered stale training jobs on startup",
                        recovered_count=len(recovered),
                        job_ids=[j.job_id for j in recovered]
                    )
                else:
                    self.logger.info("No stale training jobs to recover")
        except Exception as e:
            # Don't fail startup if recovery fails - just log the error
            self.logger.error("Failed to recover stale jobs on startup", error=str(e))
    async def on_shutdown(self, app: FastAPI):
        """Custom shutdown logic for training service"""
        await cleanup_training_database()
--- a/services/training/app/repositories/training_log_repository.py
+++ b/services/training/app/repositories/training_log_repository.py
@@ -343,3 +343,165 @@ class TrainingLogRepository(TrainingBaseRepository):
                        job_id=job_id,
                        error=str(e))
            return None
    async def create_job_atomic(
        self,
        job_id: str,
        tenant_id: str,
        config: Dict[str, Any] = None
    ) -> tuple[Optional[ModelTrainingLog], bool]:
        """
        Atomically create a training job, respecting the unique constraint.
        This method uses INSERT ... ON CONFLICT to handle race conditions
        when multiple pods try to create a job for the same tenant simultaneously.
        The database constraint (idx_unique_active_training_per_tenant) ensures
        only one active job per tenant can exist.
        Args:
            job_id: Unique job identifier
            tenant_id: Tenant identifier
            config: Optional job configuration
        Returns:
            Tuple of (job, created):
            - If created: (new_job, True)
            - If conflict (existing active job): (existing_job, False)
            - If error: raises DatabaseError
        """
        try:
            # First, try to find an existing active job
            existing = await self.get_active_jobs(tenant_id=tenant_id)
            pending = await self.get_logs_by_tenant(tenant_id=tenant_id, status="pending", limit=1)
            if existing or pending:
                # Return existing job
                active_job = existing[0] if existing else pending[0]
                logger.info("Found existing active job, skipping creation",
                           existing_job_id=active_job.job_id,
                           tenant_id=tenant_id,
                           requested_job_id=job_id)
                return (active_job, False)
            # Try to create the new job
            # If another pod created one in the meantime, the unique constraint will prevent this
            log_data = {
                "job_id": job_id,
                "tenant_id": tenant_id,
                "status": "pending",
                "progress": 0,
                "current_step": "initializing",
                "config": config or {}
            }
            try:
                new_job = await self.create_training_log(log_data)
                await self.session.commit()
                logger.info("Created new training job atomically",
                           job_id=job_id,
                           tenant_id=tenant_id)
                return (new_job, True)
            except Exception as create_error:
                error_str = str(create_error).lower()
                # Check if this is a unique constraint violation
                if "unique" in error_str or "duplicate" in error_str or "constraint" in error_str:
                    await self.session.rollback()
                    # Another pod created a job, fetch it
                    logger.info("Unique constraint hit, fetching existing job",
                               tenant_id=tenant_id,
                               requested_job_id=job_id)
                    existing = await self.get_active_jobs(tenant_id=tenant_id)
                    pending = await self.get_logs_by_tenant(tenant_id=tenant_id, status="pending", limit=1)
                    if existing or pending:
                        active_job = existing[0] if existing else pending[0]
                        return (active_job, False)
                    # If still no job found, something went wrong
                    raise DatabaseError(f"Constraint violation but no active job found: {create_error}")
                else:
                    raise
        except DatabaseError:
            raise
        except Exception as e:
            logger.error("Failed to create job atomically",
                        job_id=job_id,
                        tenant_id=tenant_id,
                        error=str(e))
            raise DatabaseError(f"Failed to create training job atomically: {str(e)}")
    async def recover_stale_jobs(self, stale_threshold_minutes: int = 60) -> List[ModelTrainingLog]:
        """
        Find and mark stale running jobs as failed.
        This is used during service startup to clean up jobs that were
        running when a pod crashed. With multiple replicas, only stale
        jobs (not updated recently) should be marked as failed.
        Args:
            stale_threshold_minutes: Jobs not updated for this long are considered stale
        Returns:
            List of jobs that were marked as failed
        """
        try:
            stale_cutoff = datetime.now() - timedelta(minutes=stale_threshold_minutes)
            # Find running jobs that haven't been updated recently
            query = text("""
                SELECT id, job_id, tenant_id, status, updated_at
                FROM model_training_logs
                WHERE status IN ('running', 'pending')
                AND updated_at < :stale_cutoff
            """)
            result = await self.session.execute(query, {"stale_cutoff": stale_cutoff})
            stale_jobs = result.fetchall()
            recovered_jobs = []
            for row in stale_jobs:
                try:
                    # Mark as failed
                    update_query = text("""
                        UPDATE model_training_logs
                        SET status = 'failed',
                            error_message = :error_msg,
                            end_time = :end_time,
                            updated_at = :updated_at
                        WHERE id = :id AND status IN ('running', 'pending')
                    """)
                    await self.session.execute(update_query, {
                        "id": row.id,
                        "error_msg": f"Job recovered as failed - not updated since {row.updated_at.isoformat()}. Pod may have crashed.",
                        "end_time": datetime.now(),
                        "updated_at": datetime.now()
                    })
                    logger.warning("Recovered stale training job",
                                 job_id=row.job_id,
                                 tenant_id=str(row.tenant_id),
                                 last_updated=row.updated_at.isoformat() if row.updated_at else "unknown")
                    # Fetch the updated job to return
                    job = await self.get_by_job_id(row.job_id)
                    if job:
                        recovered_jobs.append(job)
                except Exception as job_error:
                    logger.error("Failed to recover individual stale job",
                                job_id=row.job_id,
                                error=str(job_error))
            if recovered_jobs:
                await self.session.commit()
                logger.info("Stale job recovery completed",
                           recovered_count=len(recovered_jobs),
                           stale_threshold_minutes=stale_threshold_minutes)
            return recovered_jobs
        except Exception as e:
            logger.error("Failed to recover stale jobs",
                        error=str(e))
            await self.session.rollback()
            return []
--- a/services/training/app/utils/distributed_lock.py
+++ b/services/training/app/utils/distributed_lock.py
@@ -1,10 +1,16 @@
 """
 Distributed Locking Mechanisms
 Prevents concurrent training jobs for the same product
 HORIZONTAL SCALING FIX:
 - Uses SHA256 for stable hash across all Python processes/pods
 - Python's built-in hash() varies between processes due to hash randomization (Python 3.3+)
 - This ensures all pods compute the same lock ID for the same lock name
 """
 import asyncio
 import time
 import hashlib
 from typing import Optional
 import logging
 from contextlib import asynccontextmanager
@@ -39,9 +45,20 @@ class DatabaseLock:
        self.lock_id = self._hash_lock_name(lock_name)
    def _hash_lock_name(self, name: str) -> int:
-        """Convert lock name to integer ID for PostgreSQL advisory lock"""
+        """
-        # Use hash and modulo to get a positive 32-bit integer
+        Convert lock name to integer ID for PostgreSQL advisory lock.
-        return abs(hash(name)) % (2**31)
+
        CRITICAL: Uses SHA256 for stable hash across all Python processes/pods.
        Python's built-in hash() varies between processes due to hash randomization
        (PYTHONHASHSEED, enabled by default since Python 3.3), which would cause
        different pods to compute different lock IDs for the same lock name,
        defeating the purpose of distributed locking.
        """
        # Use SHA256 for stable, cross-process hash
        hash_bytes = hashlib.sha256(name.encode('utf-8')).digest()
        # Take first 4 bytes and convert to positive 31-bit integer
        # (PostgreSQL advisory locks use bigint, but we use 31-bit for safety)
        return int.from_bytes(hash_bytes[:4], 'big') % (2**31)
    @asynccontextmanager
    async def acquire(self, session: AsyncSession):
--- a/services/training/app/websocket/manager.py
+++ b/services/training/app/websocket/manager.py
@@ -1,21 +1,39 @@
 """
 WebSocket Connection Manager for Training Service
 Manages WebSocket connections and broadcasts RabbitMQ events to connected clients
 HORIZONTAL SCALING:
 - Uses Redis pub/sub for cross-pod WebSocket broadcasting
 - Each pod subscribes to a Redis channel and broadcasts to its local connections
 - Events published to Redis are received by all pods, ensuring clients on any
  pod receive events from training jobs running on any other pod
 """
 import asyncio
 import json
-from typing import Dict, Set
+import os
 from typing import Dict, Optional
 from fastapi import WebSocket
 import structlog
 logger = structlog.get_logger()
 # Redis pub/sub channel for WebSocket events
 REDIS_WEBSOCKET_CHANNEL = "training:websocket:events"
 class WebSocketConnectionManager:
    """
-    Simple WebSocket connection manager.
+    WebSocket connection manager with Redis pub/sub for horizontal scaling.
-    Manages connections per job_id and broadcasts messages to all connected clients.
+
    In a multi-pod deployment:
    1. Events are published to Redis pub/sub (not just local broadcast)
    2. Each pod subscribes to Redis and broadcasts to its local WebSocket connections
    3. This ensures clients connected to any pod receive events from any pod
    Flow:
    - RabbitMQ event → Pod A receives → Pod A publishes to Redis
    - Redis pub/sub → All pods receive → Each pod broadcasts to local WebSockets
    """
    def __init__(self):
@@ -24,6 +42,121 @@ class WebSocketConnectionManager:
        self._lock = asyncio.Lock()
        # Store latest event for each job to provide initial state
        self._latest_events: Dict[str, dict] = {}
        # Redis client for pub/sub
        self._redis: Optional[object] = None
        self._pubsub: Optional[object] = None
        self._subscriber_task: Optional[asyncio.Task] = None
        self._running = False
        self._instance_id = f"{os.environ.get('HOSTNAME', 'unknown')}:{os.getpid()}"
    async def initialize_redis(self, redis_url: str) -> bool:
        """
        Initialize Redis connection for cross-pod pub/sub.
        Args:
            redis_url: Redis connection URL
        Returns:
            True if successful, False otherwise
        """
        try:
            import redis.asyncio as redis_async
            self._redis = redis_async.from_url(redis_url, decode_responses=True)
            await self._redis.ping()
            # Create pub/sub subscriber
            self._pubsub = self._redis.pubsub()
            await self._pubsub.subscribe(REDIS_WEBSOCKET_CHANNEL)
            # Start subscriber task
            self._running = True
            self._subscriber_task = asyncio.create_task(self._redis_subscriber_loop())
            logger.info("Redis pub/sub initialized for WebSocket broadcasting",
                       instance_id=self._instance_id,
                       channel=REDIS_WEBSOCKET_CHANNEL)
            return True
        except Exception as e:
            logger.error("Failed to initialize Redis pub/sub",
                        error=str(e),
                        instance_id=self._instance_id)
            return False
    async def shutdown(self):
        """Shutdown Redis pub/sub connection"""
        self._running = False
        if self._subscriber_task:
            self._subscriber_task.cancel()
            try:
                await self._subscriber_task
            except asyncio.CancelledError:
                pass
        if self._pubsub:
            await self._pubsub.unsubscribe(REDIS_WEBSOCKET_CHANNEL)
            await self._pubsub.close()
        if self._redis:
            await self._redis.close()
        logger.info("Redis pub/sub shutdown complete",
                   instance_id=self._instance_id)
    async def _redis_subscriber_loop(self):
        """Background task to receive Redis pub/sub messages and broadcast locally"""
        try:
            while self._running:
                try:
                    message = await self._pubsub.get_message(
                        ignore_subscribe_messages=True,
                        timeout=1.0
                    )
                    if message and message['type'] == 'message':
                        await self._handle_redis_message(message['data'])
                except asyncio.CancelledError:
                    break
                except Exception as e:
                    logger.error("Error in Redis subscriber loop",
                               error=str(e),
                               instance_id=self._instance_id)
                    await asyncio.sleep(1)  # Backoff on error
        except asyncio.CancelledError:
            pass
        logger.info("Redis subscriber loop stopped",
                   instance_id=self._instance_id)
    async def _handle_redis_message(self, data: str):
        """Handle a message received from Redis pub/sub"""
        try:
            payload = json.loads(data)
            job_id = payload.get('job_id')
            message = payload.get('message')
            source_instance = payload.get('source_instance')
            if not job_id or not message:
                return
            # Log cross-pod message
            if source_instance != self._instance_id:
                logger.debug("Received cross-pod WebSocket event",
                           job_id=job_id,
                           source_instance=source_instance,
                           local_instance=self._instance_id)
            # Broadcast to local WebSocket connections
            await self._broadcast_local(job_id, message)
        except json.JSONDecodeError as e:
            logger.warning("Invalid JSON in Redis message", error=str(e))
        except Exception as e:
            logger.error("Error handling Redis message", error=str(e))
    async def connect(self, job_id: str, websocket: WebSocket) -> None:
        """Register a new WebSocket connection for a job"""
@@ -50,7 +183,8 @@ class WebSocketConnectionManager:
        logger.info("WebSocket connected",
                   job_id=job_id,
                   websocket_id=ws_id,
-                   total_connections=len(self._connections[job_id]))
+                   total_connections=len(self._connections[job_id]),
                   instance_id=self._instance_id)
    async def disconnect(self, job_id: str, websocket: WebSocket) -> None:
        """Remove a WebSocket connection"""
@@ -66,19 +200,56 @@ class WebSocketConnectionManager:
                logger.info("WebSocket disconnected",
                           job_id=job_id,
                           websocket_id=ws_id,
-                           remaining_connections=len(self._connections.get(job_id, {})))
+                           remaining_connections=len(self._connections.get(job_id, {})),
                           instance_id=self._instance_id)
    async def broadcast(self, job_id: str, message: dict) -> int:
        """
-        Broadcast a message to all connections for a specific job.
+        Broadcast a message to all connections for a specific job across ALL pods.
-        Returns the number of successful broadcasts.
+
        If Redis is configured, publishes to Redis pub/sub which then broadcasts
        to all pods. Otherwise, falls back to local-only broadcast.
        Returns the number of successful local broadcasts.
        """
        # Store the latest event for this job to provide initial state to new connections
-        if message.get('type') != 'initial_state':  # Don't store initial_state messages
+        if message.get('type') != 'initial_state':
            self._latest_events[job_id] = message
        # If Redis is available, publish to Redis for cross-pod broadcast
        if self._redis:
            try:
                payload = json.dumps({
                    'job_id': job_id,
                    'message': message,
                    'source_instance': self._instance_id
                })
                await self._redis.publish(REDIS_WEBSOCKET_CHANNEL, payload)
                logger.debug("Published WebSocket event to Redis",
                           job_id=job_id,
                           message_type=message.get('type'),
                           instance_id=self._instance_id)
                # Return 0 here because the actual broadcast happens via subscriber
                # The count will be from _broadcast_local when the message is received
                return 0
            except Exception as e:
                logger.warning("Failed to publish to Redis, falling back to local broadcast",
                             error=str(e),
                             job_id=job_id)
                # Fall through to local broadcast
        # Local-only broadcast (when Redis is not available)
        return await self._broadcast_local(job_id, message)
    async def _broadcast_local(self, job_id: str, message: dict) -> int:
        """
        Broadcast a message to local WebSocket connections only.
        This is called either directly (no Redis) or from Redis subscriber.
        """
        if job_id not in self._connections:
-            logger.debug("No active connections for job", job_id=job_id)
+            logger.debug("No active local connections for job",
                        job_id=job_id,
                        instance_id=self._instance_id)
            return 0
        connections = list(self._connections[job_id].values())
@@ -103,18 +274,27 @@ class WebSocketConnectionManager:
                    self._connections[job_id].pop(ws_id, None)
        if successful_sends > 0:
-            logger.info("Broadcasted message to WebSocket clients",
+            logger.info("Broadcasted message to local WebSocket clients",
                       job_id=job_id,
                       message_type=message.get('type'),
                       successful_sends=successful_sends,
-                       failed_sends=len(failed_websockets))
+                       failed_sends=len(failed_websockets),
                       instance_id=self._instance_id)
        return successful_sends
    def get_connection_count(self, job_id: str) -> int:
-        """Get the number of active connections for a job"""
+        """Get the number of active local connections for a job"""
        return len(self._connections.get(job_id, {}))
    def get_total_connection_count(self) -> int:
        """Get total number of active connections across all jobs"""
        return sum(len(conns) for conns in self._connections.values())
    def is_redis_enabled(self) -> bool:
        """Check if Redis pub/sub is enabled"""
        return self._redis is not None and self._running
 # Global singleton instance
 websocket_manager = WebSocketConnectionManager()
--- a/services/training/migrations/versions/add_horizontal_scaling_constraints.py
+++ b/services/training/migrations/versions/add_horizontal_scaling_constraints.py
@@ -0,0 +1,60 @@
 """Add horizontal scaling constraints for multi-pod deployment
 Revision ID: add_horizontal_scaling
 Revises: 26a665cd5348
 Create Date: 2025-01-18
 This migration adds database-level constraints to prevent race conditions
 when running multiple training service pods:
 1. Partial unique index on model_training_logs to prevent duplicate active jobs per tenant
 2. Index to speed up active job lookups
 """
 from typing import Sequence, Union
 from alembic import op
 import sqlalchemy as sa
 # revision identifiers, used by Alembic.
 revision: str = 'add_horizontal_scaling'
 down_revision: Union[str, None] = '26a665cd5348'
 branch_labels: Union[str, Sequence[str], None] = None
 depends_on: Union[str, Sequence[str], None] = None
 def upgrade() -> None:
    # Add partial unique index to prevent duplicate active training jobs per tenant
    # This ensures only ONE job can be in 'pending' or 'running' status per tenant at a time
    # The constraint is enforced at the database level, preventing race conditions
    # between multiple pods checking and creating jobs simultaneously
    op.execute("""
        CREATE UNIQUE INDEX IF NOT EXISTS idx_unique_active_training_per_tenant
        ON model_training_logs (tenant_id)
        WHERE status IN ('pending', 'running')
    """)
    # Add index to speed up active job lookups (used by deduplication check)
    op.create_index(
        'idx_training_logs_tenant_status',
        'model_training_logs',
        ['tenant_id', 'status'],
        unique=False,
        if_not_exists=True
    )
    # Add index for job recovery queries (find stale running jobs)
    op.create_index(
        'idx_training_logs_status_updated',
        'model_training_logs',
        ['status', 'updated_at'],
        unique=False,
        if_not_exists=True
    )
 def downgrade() -> None:
    # Remove the indexes in reverse order
    op.execute("DROP INDEX IF EXISTS idx_training_logs_status_updated")
    op.execute("DROP INDEX IF EXISTS idx_training_logs_tenant_status")
    op.execute("DROP INDEX IF EXISTS idx_unique_active_training_per_tenant")
--- a/shared/leader_election/init.py
+++ b/shared/leader_election/init.py
@@ -0,0 +1,33 @@
 """
 Shared Leader Election for Bakery-IA platform
 Provides Redis-based leader election for services that need to run
 singleton scheduled tasks (APScheduler, background jobs, etc.)
 Usage:
    from shared.leader_election import LeaderElectionService, SchedulerLeaderMixin
    # Option 1: Direct usage
    leader_election = LeaderElectionService(redis_client, "my-service")
    await leader_election.start(
        on_become_leader=start_scheduler,
        on_lose_leader=stop_scheduler
    )
    # Option 2: Mixin for services with APScheduler
    class MySchedulerService(SchedulerLeaderMixin):
        async def _create_scheduler_jobs(self):
            self.scheduler.add_job(...)
 """
 from shared.leader_election.service import (
    LeaderElectionService,
    LeaderElectionConfig,
 )
 from shared.leader_election.mixin import SchedulerLeaderMixin
 __all__ = [
    "LeaderElectionService",
    "LeaderElectionConfig",
    "SchedulerLeaderMixin",
 ]
--- a/shared/leader_election/mixin.py
+++ b/shared/leader_election/mixin.py
@@ -0,0 +1,209 @@
 """
 Scheduler Leader Mixin
 Provides a mixin class for services that use APScheduler and need
 leader election for horizontal scaling.
 Usage:
    class MySchedulerService(SchedulerLeaderMixin):
        def __init__(self, redis_url: str, service_name: str):
            super().__init__(redis_url, service_name)
            # Your initialization here
        async def _create_scheduler_jobs(self):
            '''Override to define your scheduled jobs'''
            self.scheduler.add_job(
                self.my_job,
                trigger=CronTrigger(hour=0),
                id='my_job'
            )
        async def my_job(self):
            # Your job logic here
            pass
 """
 import asyncio
 from typing import Optional
 from abc import abstractmethod
 import structlog
 logger = structlog.get_logger()
 class SchedulerLeaderMixin:
    """
    Mixin for services that use APScheduler with leader election.
    Provides automatic leader election and scheduler management.
    Only the leader pod will run scheduled jobs.
    """
    def __init__(self, redis_url: str, service_name: str, **kwargs):
        """
        Initialize the scheduler with leader election.
        Args:
            redis_url: Redis connection URL for leader election
            service_name: Unique service name for leader election lock
            **kwargs: Additional arguments passed to parent class
        """
        super().__init__(**kwargs)
        self._redis_url = redis_url
        self._service_name = service_name
        self._leader_election = None
        self._redis_client = None
        self.scheduler = None
        self._scheduler_started = False
    async def start_with_leader_election(self):
        """
        Start the service with leader election.
        Only the leader will start the scheduler.
        """
        from apscheduler.schedulers.asyncio import AsyncIOScheduler
        from shared.leader_election.service import LeaderElectionService
        import redis.asyncio as redis
        try:
            # Create Redis connection
            self._redis_client = redis.from_url(self._redis_url, decode_responses=False)
            await self._redis_client.ping()
            # Create scheduler (but don't start it yet)
            self.scheduler = AsyncIOScheduler()
            # Create leader election
            self._leader_election = LeaderElectionService(
                self._redis_client,
                self._service_name
            )
            # Start leader election with callbacks
            await self._leader_election.start(
                on_become_leader=self._on_become_leader,
                on_lose_leader=self._on_lose_leader
            )
            logger.info("Scheduler service started with leader election",
                       service=self._service_name,
                       is_leader=self._leader_election.is_leader,
                       instance_id=self._leader_election.instance_id)
        except Exception as e:
            logger.error("Failed to start with leader election, falling back to standalone",
                        service=self._service_name,
                        error=str(e))
            # Fallback: start scheduler anyway (for single-pod deployments)
            await self._start_scheduler_standalone()
    async def _on_become_leader(self):
        """Called when this instance becomes the leader"""
        logger.info("Became leader, starting scheduler",
                   service=self._service_name)
        await self._start_scheduler()
    async def _on_lose_leader(self):
        """Called when this instance loses leadership"""
        logger.warning("Lost leadership, stopping scheduler",
                      service=self._service_name)
        await self._stop_scheduler()
    async def _start_scheduler(self):
        """Start the scheduler with defined jobs"""
        if self._scheduler_started:
            logger.warning("Scheduler already started",
                         service=self._service_name)
            return
        try:
            # Let subclass define jobs
            await self._create_scheduler_jobs()
            # Start scheduler
            if not self.scheduler.running:
                self.scheduler.start()
                self._scheduler_started = True
                logger.info("Scheduler started",
                           service=self._service_name,
                           job_count=len(self.scheduler.get_jobs()))
        except Exception as e:
            logger.error("Failed to start scheduler",
                        service=self._service_name,
                        error=str(e))
    async def _stop_scheduler(self):
        """Stop the scheduler"""
        if not self._scheduler_started:
            return
        try:
            if self.scheduler and self.scheduler.running:
                self.scheduler.shutdown(wait=False)
                self._scheduler_started = False
                logger.info("Scheduler stopped",
                           service=self._service_name)
        except Exception as e:
            logger.error("Failed to stop scheduler",
                        service=self._service_name,
                        error=str(e))
    async def _start_scheduler_standalone(self):
        """Start scheduler without leader election (fallback mode)"""
        from apscheduler.schedulers.asyncio import AsyncIOScheduler
        logger.warning("Starting scheduler in standalone mode (no leader election)",
                      service=self._service_name)
        self.scheduler = AsyncIOScheduler()
        await self._create_scheduler_jobs()
        if not self.scheduler.running:
            self.scheduler.start()
            self._scheduler_started = True
    @abstractmethod
    async def _create_scheduler_jobs(self):
        """
        Override to define scheduled jobs.
        Example:
            self.scheduler.add_job(
                self.my_task,
                trigger=CronTrigger(hour=0, minute=30),
                id='my_task',
                max_instances=1
            )
        """
        pass
    async def stop(self):
        """Stop the scheduler and leader election"""
        # Stop leader election
        if self._leader_election:
            await self._leader_election.stop()
        # Stop scheduler
        await self._stop_scheduler()
        # Close Redis
        if self._redis_client:
            await self._redis_client.close()
        logger.info("Scheduler service stopped",
                   service=self._service_name)
    @property
    def is_leader(self) -> bool:
        """Check if this instance is the leader"""
        return self._leader_election.is_leader if self._leader_election else False
    def get_leader_status(self) -> dict:
        """Get leader election status"""
        if self._leader_election:
            return self._leader_election.get_status()
        return {"is_leader": True, "mode": "standalone"}
--- a/shared/leader_election/service.py
+++ b/shared/leader_election/service.py
@@ -0,0 +1,352 @@
 """
 Leader Election Service
 Implements Redis-based leader election to ensure only ONE pod runs
 singleton tasks like APScheduler jobs.
 This is CRITICAL for horizontal scaling - without leader election,
 each pod would run the same scheduled jobs, causing:
 - Duplicate operations (forecasts, alerts, syncs)
 - Database contention
 - Inconsistent state
 - Duplicate notifications
 Implementation:
 - Uses Redis SET NX (set if not exists) for atomic leadership acquisition
 - Leader maintains leadership with periodic heartbeats
 - If leader fails to heartbeat, another pod can take over
 - Non-leader pods check periodically if they should become leader
 """
 import asyncio
 import os
 import socket
 from dataclasses import dataclass
 from typing import Optional, Callable, Awaitable
 import structlog
 logger = structlog.get_logger()
@dataclass
 class LeaderElectionConfig:
    """Configuration for leader election"""
    # Redis key prefix for the lock
    lock_key_prefix: str = "leader"
    # Lock expires after this many seconds without refresh
    lock_ttl_seconds: int = 30
    # Refresh lock every N seconds (should be < lock_ttl_seconds / 2)
    heartbeat_interval_seconds: int = 10
    # Non-leaders check for leadership every N seconds
    election_check_interval_seconds: int = 15
 class LeaderElectionService:
    """
    Redis-based leader election service.
    Ensures only one pod runs scheduled tasks at a time across all replicas.
    """
    def __init__(
        self,
        redis_client,
        service_name: str,
        config: Optional[LeaderElectionConfig] = None
    ):
        """
        Initialize leader election service.
        Args:
            redis_client: Async Redis client instance
            service_name: Unique name for this service (used in Redis key)
            config: Optional configuration override
        """
        self.redis = redis_client
        self.service_name = service_name
        self.config = config or LeaderElectionConfig()
        self.lock_key = f"{self.config.lock_key_prefix}:{service_name}:lock"
        self.instance_id = self._generate_instance_id()
        self.is_leader = False
        self._heartbeat_task: Optional[asyncio.Task] = None
        self._election_task: Optional[asyncio.Task] = None
        self._running = False
        self._on_become_leader_callback: Optional[Callable[[], Awaitable[None]]] = None
        self._on_lose_leader_callback: Optional[Callable[[], Awaitable[None]]] = None
    def _generate_instance_id(self) -> str:
        """Generate unique instance identifier for this pod"""
        hostname = os.environ.get('HOSTNAME', socket.gethostname())
        pod_ip = os.environ.get('POD_IP', 'unknown')
        return f"{hostname}:{pod_ip}:{os.getpid()}"
    async def start(
        self,
        on_become_leader: Optional[Callable[[], Awaitable[None]]] = None,
        on_lose_leader: Optional[Callable[[], Awaitable[None]]] = None
    ):
        """
        Start leader election process.
        Args:
            on_become_leader: Async callback when this instance becomes leader
            on_lose_leader: Async callback when this instance loses leadership
        """
        self._on_become_leader_callback = on_become_leader
        self._on_lose_leader_callback = on_lose_leader
        self._running = True
        logger.info("Starting leader election",
                   service=self.service_name,
                   instance_id=self.instance_id,
                   lock_key=self.lock_key)
        # Try to become leader immediately
        await self._try_become_leader()
        # Start background tasks
        if self.is_leader:
            self._heartbeat_task = asyncio.create_task(self._heartbeat_loop())
        else:
            self._election_task = asyncio.create_task(self._election_loop())
    async def stop(self):
        """Stop leader election and release leadership if held"""
        self._running = False
        # Cancel background tasks
        if self._heartbeat_task:
            self._heartbeat_task.cancel()
            try:
                await self._heartbeat_task
            except asyncio.CancelledError:
                pass
            self._heartbeat_task = None
        if self._election_task:
            self._election_task.cancel()
            try:
                await self._election_task
            except asyncio.CancelledError:
                pass
            self._election_task = None
        # Release leadership
        if self.is_leader:
            await self._release_leadership()
        logger.info("Leader election stopped",
                   service=self.service_name,
                   instance_id=self.instance_id,
                   was_leader=self.is_leader)
    async def _try_become_leader(self) -> bool:
        """
        Attempt to become the leader.
        Returns:
            True if this instance is now the leader
        """
        try:
            # Try to set the lock with NX (only if not exists) and EX (expiry)
            acquired = await self.redis.set(
                self.lock_key,
                self.instance_id,
                nx=True,  # Only set if not exists
                ex=self.config.lock_ttl_seconds
            )
            if acquired:
                self.is_leader = True
                logger.info("Became leader",
                           service=self.service_name,
                           instance_id=self.instance_id)
                # Call callback
                if self._on_become_leader_callback:
                    try:
                        await self._on_become_leader_callback()
                    except Exception as e:
                        logger.error("Error in on_become_leader callback",
                                   service=self.service_name,
                                   error=str(e))
                return True
            # Check if we're already the leader (reconnection scenario)
            current_leader = await self.redis.get(self.lock_key)
            if current_leader:
                current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
                if current_leader_str == self.instance_id:
                    self.is_leader = True
                    logger.info("Confirmed as existing leader",
                               service=self.service_name,
                               instance_id=self.instance_id)
                    return True
                else:
                    logger.debug("Another instance is leader",
                               service=self.service_name,
                               current_leader=current_leader_str,
                               this_instance=self.instance_id)
            return False
        except Exception as e:
            logger.error("Failed to acquire leadership",
                        service=self.service_name,
                        instance_id=self.instance_id,
                        error=str(e))
            return False
    async def _release_leadership(self):
        """Release leadership lock"""
        try:
            # Only delete if we're the current leader
            current_leader = await self.redis.get(self.lock_key)
            if current_leader:
                current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
                if current_leader_str == self.instance_id:
                    await self.redis.delete(self.lock_key)
                    logger.info("Released leadership",
                               service=self.service_name,
                               instance_id=self.instance_id)
            was_leader = self.is_leader
            self.is_leader = False
            # Call callback only if we were the leader
            if was_leader and self._on_lose_leader_callback:
                try:
                    await self._on_lose_leader_callback()
                except Exception as e:
                    logger.error("Error in on_lose_leader callback",
                               service=self.service_name,
                               error=str(e))
        except Exception as e:
            logger.error("Failed to release leadership",
                        service=self.service_name,
                        instance_id=self.instance_id,
                        error=str(e))
    async def _refresh_leadership(self) -> bool:
        """
        Refresh leadership lock TTL.
        Returns:
            True if leadership was maintained
        """
        try:
            # Verify we're still the leader
            current_leader = await self.redis.get(self.lock_key)
            if not current_leader:
                logger.warning("Lost leadership (lock expired)",
                             service=self.service_name,
                             instance_id=self.instance_id)
                return False
            current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
            if current_leader_str != self.instance_id:
                logger.warning("Lost leadership (lock held by another instance)",
                             service=self.service_name,
                             instance_id=self.instance_id,
                             current_leader=current_leader_str)
                return False
            # Refresh the TTL
            await self.redis.expire(self.lock_key, self.config.lock_ttl_seconds)
            return True
        except Exception as e:
            logger.error("Failed to refresh leadership",
                        service=self.service_name,
                        instance_id=self.instance_id,
                        error=str(e))
            return False
    async def _heartbeat_loop(self):
        """Background loop to maintain leadership"""
        while self._running and self.is_leader:
            try:
                await asyncio.sleep(self.config.heartbeat_interval_seconds)
                if not self._running:
                    break
                maintained = await self._refresh_leadership()
                if not maintained:
                    self.is_leader = False
                    # Call callback
                    if self._on_lose_leader_callback:
                        try:
                            await self._on_lose_leader_callback()
                        except Exception as e:
                            logger.error("Error in on_lose_leader callback",
                                       service=self.service_name,
                                       error=str(e))
                    # Switch to election loop
                    self._election_task = asyncio.create_task(self._election_loop())
                    break
            except asyncio.CancelledError:
                break
            except Exception as e:
                logger.error("Error in heartbeat loop",
                            service=self.service_name,
                            instance_id=self.instance_id,
                            error=str(e))
    async def _election_loop(self):
        """Background loop to attempt leadership acquisition"""
        while self._running and not self.is_leader:
            try:
                await asyncio.sleep(self.config.election_check_interval_seconds)
                if not self._running:
                    break
                acquired = await self._try_become_leader()
                if acquired:
                    # Switch to heartbeat loop
                    self._heartbeat_task = asyncio.create_task(self._heartbeat_loop())
                    break
            except asyncio.CancelledError:
                break
            except Exception as e:
                logger.error("Error in election loop",
                            service=self.service_name,
                            instance_id=self.instance_id,
                            error=str(e))
    def get_status(self) -> dict:
        """Get current leader election status"""
        return {
            "service": self.service_name,
            "instance_id": self.instance_id,
            "is_leader": self.is_leader,
            "running": self._running,
            "lock_key": self.lock_key,
            "config": {
                "lock_ttl_seconds": self.config.lock_ttl_seconds,
                "heartbeat_interval_seconds": self.config.heartbeat_interval_seconds,
                "election_check_interval_seconds": self.config.election_check_interval_seconds
            }
        }
    async def get_current_leader(self) -> Optional[str]:
        """Get the current leader instance ID (if any)"""
        try:
            current_leader = await self.redis.get(self.lock_key)
            if current_leader:
                return current_leader.decode() if isinstance(current_leader, bytes) else current_leader
            return None
        except Exception as e:
            logger.error("Failed to get current leader",
                        service=self.service_name,
                        error=str(e))
            return None