Add ci/cd and fix multiple pods issues

2026-01-18 09:02:27 +01:00
parent 3c4b5c2a06
commit 21d35ea92b
27 changed files with 3779 additions and 73 deletions
--- a/CI_CD_IMPLEMENTATION_PLAN.md
+++ b/CI_CD_IMPLEMENTATION_PLAN.md
--- a/INFRASTRUCTURE_REORGANIZATION_PROPOSAL.md
+++ b/INFRASTRUCTURE_REORGANIZATION_PROPOSAL.md
@@ -0,0 +1,413 @@
+# Infrastructure Reorganization Proposal for Bakery-IA
+
+## Executive Summary
+
+This document presents a comprehensive analysis of the current infrastructure organization and proposes a restructured layout that improves maintainability, scalability, and operational efficiency. The proposal is based on a detailed examination of the existing 177 files across 31 directories in the infrastructure folder.
+
+## Current Infrastructure Analysis
+
+### Current Structure Overview
+
+```
+infrastructure/
+├── ci-cd/                      # 18 files - CI/CD pipeline components
+├── helm/                      # 8 files - Helm charts and scripts
+├── kubernetes/                # 103 files - Kubernetes manifests and configs
+├── signoz/                    # 11 files - Monitoring dashboards and scripts
+└── tls/                       # 37 files - TLS certificates and generation scripts
+```
+
+### Key Findings
+
+1. **Kubernetes Base Components (103 files)**: The most complex area with:
+   - 20+ service deployments across 15+ microservices
+   - 20+ database configurations (PostgreSQL, RabbitMQ, MinIO)
+   - 19 migration jobs for different services
+   - Infrastructure components (gateway, monitoring, etc.)
+
+2. **CI/CD Pipeline (18 files)**: 
+   - Tekton tasks and pipelines for GitOps workflow
+   - Flux CD configuration for continuous delivery
+   - Gitea configuration for Git repository management
+
+3. **Monitoring (11 files)**:
+   - SigNoz dashboards for comprehensive observability
+   - Import scripts for dashboard management
+
+4. **TLS Certificates (37 files)**:
+   - CA certificates and generation scripts
+   - Service-specific certificates (PostgreSQL, Redis, MinIO)
+   - Certificate signing requests and configurations
+
+### Strengths of Current Organization
+
+1. **Logical Grouping**: Components are generally well-grouped by function
+2. **Base/Overlay Pattern**: Kubernetes uses proper base/overlay structure
+3. **Comprehensive Monitoring**: SigNoz dashboards cover all major aspects
+4. **Security Focus**: Dedicated TLS certificate management
+
+### Challenges Identified
+
+1. **Complexity in Kubernetes Base**: 103 files make navigation difficult
+2. **Mixed Component Types**: Services, databases, and infrastructure mixed together
+3. **Limited Environment Separation**: Only dev/prod overlays, no staging
+4. **Script Scattering**: Automation scripts spread across directories
+5. **Documentation Gaps**: Some components lack clear documentation
+
+## Proposed Infrastructure Organization
+
+### High-Level Structure
+
+```
+infrastructure/
+├── environments/                # Environment-specific configurations
+├── platform/                    # Platform-level infrastructure
+├── services/                    # Application services and microservices
+├── monitoring/                  # Observability and monitoring
+├── cicd/                       # CI/CD pipeline components
+├── security/                    # Security configurations and certificates
+├── scripts/                     # Automation and utility scripts
+├── docs/                       # Infrastructure documentation
+└── README.md                    # Top-level infrastructure guide
+```
+
+### Detailed Structure Proposal
+
+```
+infrastructure/
+├── environments/                # Environment-specific configurations
+│   ├── dev/
+│   │   ├── k8s-manifests/
+│   │   │   ├── base/
+│   │   │   │   ├── namespace.yaml
+│   │   │   │   ├── configmap.yaml
+│   │   │   │   ├── secrets.yaml
+│   │   │   │   └── ingress-https.yaml
+│   │   │   ├── components/
+│   │   │   │   ├── databases/
+│   │   │   │   ├── infrastructure/
+│   │   │   │   ├── microservices/
+│   │   │   │   └── cert-manager/
+│   │   │   ├── configs/
+│   │   │   ├── cronjobs/
+│   │   │   ├── jobs/
+│   │   │   └── migrations/
+│   │   ├── kustomization.yaml
+│   │   └── values/
+│   ├── staging/                 # New staging environment
+│   │   ├── k8s-manifests/
+│   │   └── values/
+│   └── prod/
+│       ├── k8s-manifests/
+│       ├── terraform/           # Production-specific IaC
+│       └── values/
+├── platform/                    # Platform-level infrastructure
+│   ├── cluster/
+│   │   ├── eks/                  # AWS EKS configuration
+│   │   │   ├── terraform/
+│   │   │   └── manifests/
+│   │   └── kind/                 # Local development cluster
+│   │       ├── config.yaml
+│   │       └── manifests/
+│   ├── networking/
+│   │   ├── dns/
+│   │   ├── load-balancers/
+│   │   └── ingress/
+│   │       ├── nginx/
+│   │       └── cert-manager/
+│   ├── security/
+│   │   ├── rbac/
+│   │   ├── network-policies/
+│   │   └── tls/
+│   │       ├── ca/
+│   │       ├── postgres/
+│   │       ├── redis/
+│   │       └── minio/
+│   └── storage/
+│       ├── postgres/
+│       ├── redis/
+│       └── minio/
+├── services/                    # Application services
+│   ├── databases/
+│   │   ├── postgres/
+│   │   │   ├── k8s-manifests/
+│   │   │   ├── backups/
+│   │   │   ├── monitoring/
+│   │   │   └── maintenance/
+│   │   ├── redis/
+│   │   │   ├── configs/
+│   │   │   └── monitoring/
+│   │   └── minio/
+│   │       ├── buckets/
+│   │       └── policies/
+│   ├── api-gateway/
+│   │   ├── k8s-manifests/
+│   │   └── configs/
+│   └── microservices/
+│       ├── auth/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── tenant/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── training/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── forecasting/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── sales/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── external/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── notification/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── inventory/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── recipes/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── suppliers/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── pos/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── orders/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── production/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── procurement/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── orchestrator/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── alert-processor/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── ai-insights/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       ├── demo-session/
+│       │   ├── k8s-manifests/
+│       │   └── configs/
+│       └── frontend/
+│           ├── k8s-manifests/
+│           └── configs/
+├── monitoring/                  # Observability stack
+│   ├── signoz/
+│   │   ├── manifests/
+│   │   ├── dashboards/
+│   │   │   ├── alert-management.json
+│   │   │   ├── api-performance.json
+│   │   │   ├── application-performance.json
+│   │   │   ├── database-performance.json
+│   │   │   ├── error-tracking.json
+│   │   │   ├── index.json
+│   │   │   ├── infrastructure-monitoring.json
+│   │   │   ├── log-analysis.json
+│   │   │   ├── system-health.json
+│   │   │   └── user-activity.json
+│   │   ├── values-dev.yaml
+│   │   ├── values-prod.yaml
+│   │   ├── deploy-signoz.sh
+│   │   ├── verify-signoz.sh
+│   │   └── generate-test-traffic.sh
+│   └── opentelemetry/
+│       ├── collector/
+│       └── agent/
+├── cicd/                        # CI/CD pipeline
+│   ├── gitea/
+│   │   ├── values.yaml
+│   │   └── ingress.yaml
+│   ├── tekton/
+│   │   ├── tasks/
+│   │   │   ├── git-clone.yaml
+│   │   │   ├── detect-changes.yaml
+│   │   │   ├── kaniko-build.yaml
+│   │   │   └── update-gitops.yaml
+│   │   ├── pipelines/
+│   │   └── triggers/
+│   └── flux/
+│       ├── git-repository.yaml
+│       └── kustomization.yaml
+├── security/                     # Security configurations
+│   ├── policies/
+│   │   ├── network-policies.yaml
+│   │   ├── pod-security.yaml
+│   │   └── rbac.yaml
+│   ├── certificates/
+│   │   ├── ca/
+│   │   ├── services/
+│   │   └── rotation-scripts/
+│   ├── scanning/
+│   │   ├── trivy/
+│   │   └── policies/
+│   └── compliance/
+│       ├── cis-benchmarks/
+│       └── audit-scripts/
+├── scripts/                     # Automation scripts
+│   ├── setup/
+│   │   ├── generate-certificates.sh
+│   │   ├── generate-minio-certificates.sh
+│   │   └── setup-dockerhub-secrets.sh
+│   ├── deployment/
+│   │   ├── deploy-signoz.sh
+│   │   └── verify-signoz.sh
+│   ├── maintenance/
+│   │   ├── regenerate_migrations_k8s.sh
+│   │   └── kubernetes_restart.sh
+│   └── verification/
+│       └── verify-registry.sh
+├── docs/                        # Infrastructure documentation
+│   ├── architecture/
+│   │   ├── diagrams/
+│   │   └── decisions/
+│   ├── operations/
+│   │   ├── runbooks/
+│   │   └── troubleshooting/
+│   ├── onboarding/
+│   └── reference/
+│       ├── api/
+│       └── configurations/
+└── README.md
+```
+
+## Migration Strategy
+
+### Phase 1: Preparation and Planning
+
+1. **Inventory Analysis**: Complete detailed inventory of all current files
+2. **Dependency Mapping**: Identify dependencies between components
+3. **Impact Assessment**: Determine which components can be moved safely
+4. **Backup Strategy**: Ensure all files are backed up before migration
+
+### Phase 2: Non-Critical Components
+
+1. **Documentation**: Move and update all documentation files
+2. **Scripts**: Organize automation scripts into new structure
+3. **Monitoring**: Migrate SigNoz dashboards and configurations
+4. **CI/CD**: Reorganize pipeline components
+
+### Phase 3: Environment-Specific Components
+
+1. **Create Environment Structure**: Set up dev/staging/prod directories
+2. **Migrate Kubernetes Manifests**: Move base components to appropriate locations
+3. **Update References**: Ensure all cross-references are corrected
+4. **Environment Validation**: Test each environment separately
+
+### Phase 4: Service Components
+
+1. **Database Migration**: Move database configurations to services/databases
+2. **Microservice Organization**: Group microservices by domain
+3. **Infrastructure Components**: Move gateway and other infrastructure
+4. **Service Validation**: Test each service in isolation
+
+### Phase 5: Finalization
+
+1. **Integration Testing**: Test complete infrastructure workflow
+2. **Documentation Update**: Finalize all documentation
+3. **Team Training**: Conduct training on new structure
+4. **Cleanup**: Remove old structure and temporary files
+
+## Benefits of Proposed Structure
+
+### 1. Improved Navigation
+- **Clear Hierarchy**: Logical grouping by function and environment
+- **Consistent Patterns**: Standardized structure across all components
+- **Reduced Cognitive Load**: Easier to find specific components
+
+### 2. Enhanced Maintainability
+- **Environment Isolation**: Clear separation of dev/staging/prod
+- **Component Grouping**: Related components grouped together
+- **Standardized Structure**: Consistent patterns across services
+
+### 3. Better Scalability
+- **Modular Design**: Easy to add new services or environments
+- **Domain Separation**: Services organized by business domain
+- **Infrastructure Independence**: Platform components separate from services
+
+### 4. Improved Security
+- **Centralized Security**: All security configurations in one place
+- **Environment-Specific Policies**: Tailored security for each environment
+- **Better Secret Management**: Clear structure for sensitive data
+
+### 5. Enhanced Observability
+- **Comprehensive Monitoring**: All observability tools grouped
+- **Standardized Dashboards**: Consistent monitoring across services
+- **Centralized Logging**: Better log management structure
+
+## Implementation Considerations
+
+### Tools and Technologies
+- **Terraform**: For infrastructure as code (IaC)
+- **Kustomize**: For Kubernetes manifest management
+- **Helm**: For complex application deployments
+- **SOPS/Sealed Secrets**: For secret management
+- **Trivy**: For vulnerability scanning
+
+### Team Adaptation
+- **Training Plan**: Develop comprehensive training materials
+- **Migration Guide**: Create step-by-step migration documentation
+- **Support Period**: Provide dedicated support during transition
+- **Feedback Mechanism**: Establish channels for team feedback
+
+### Risk Mitigation
+- **Phased Approach**: Implement changes incrementally
+- **Rollback Plan**: Develop comprehensive rollback procedures
+- **Testing Strategy**: Implement thorough testing at each phase
+- **Monitoring**: Enhanced monitoring during migration period
+
+## Expected Outcomes
+
+1. **Reduced Time-to-Find**: 40-60% reduction in time spent locating files
+2. **Improved Deployment Speed**: 25-35% faster deployment cycles
+3. **Enhanced Collaboration**: Better team coordination and understanding
+4. **Reduced Errors**: 30-50% reduction in configuration errors
+5. **Better Scalability**: Easier to add new services and features
+
+## Conclusion
+
+The proposed infrastructure reorganization represents a significant improvement over the current structure. By implementing a clear, logical hierarchy with proper separation of concerns, the new organization will:
+
+- **Improve operational efficiency** through better navigation and maintainability
+- **Enhance security** with centralized security management
+- **Support growth** with a scalable, modular design
+- **Reduce errors** through standardized patterns and structures
+- **Facilitate collaboration** with intuitive organization
+
+The key to successful implementation is a phased approach with thorough testing and team involvement at each stage. With proper planning and execution, this reorganization will provide long-term benefits for the Bakery-IA project's infrastructure management.
+
+## Appendix: File Migration Mapping
+
+### Current → Proposed Mapping
+
+**Kubernetes Components:**
+- `infrastructure/kubernetes/base/components/*` → `infrastructure/services/microservices/*/`
+- `infrastructure/kubernetes/base/components/databases/*` → `infrastructure/services/databases/*/`
+- `infrastructure/kubernetes/base/migrations/*` → `infrastructure/services/microservices/*/migrations/`
+- `infrastructure/kubernetes/base/configs/*` → `infrastructure/environments/*/values/`
+
+**CI/CD Components:**
+- `infrastructure/ci-cd/*` → `infrastructure/cicd/*/`
+
+**Monitoring Components:**
+- `infrastructure/signoz/*` → `infrastructure/monitoring/signoz/*/`
+- `infrastructure/helm/*` → `infrastructure/monitoring/signoz/*/` (signoz-related)
+
+**Security Components:**
+- `infrastructure/tls/*` → `infrastructure/security/certificates/*/`
+
+**Scripts:**
+- `infrastructure/kubernetes/*.sh` → `infrastructure/scripts/*/`
+- `infrastructure/helm/*.sh` → `infrastructure/scripts/deployment/*/`
+- `infrastructure/tls/*.sh` → `infrastructure/scripts/setup/*/`
+
+This mapping provides a clear path for migrating each component to its new location while maintaining functionality and relationships between components.
--- a/infrastructure/ci-cd/README.md
+++ b/infrastructure/ci-cd/README.md
@@ -0,0 +1,294 @@
+# Bakery-IA CI/CD Implementation
+
+This directory contains the configuration for the production-grade CI/CD system for Bakery-IA using Gitea, Tekton, and Flux CD.
+
+## Architecture Overview
+
+```mermaid
+graph TD
+    A[Developer] -->|Push Code| B[Gitea]
+    B -->|Webhook| C[Tekton Pipelines]
+    C -->|Build/Test| D[Gitea Registry]
+    D -->|New Image| E[Flux CD]
+    E -->|kubectl apply| F[MicroK8s Cluster]
+    F -->|Metrics| G[SigNoz]
+```
+
+## Directory Structure
+
+```
+infrastructure/ci-cd/
+├── gitea/                  # Gitea configuration (Git server + registry)
+│   ├── values.yaml         # Helm values for Gitea
+│   └── ingress.yaml        # Ingress configuration
+├── tekton/                # Tekton CI/CD pipeline configuration
+│   ├── tasks/              # Individual pipeline tasks
+│   │   ├── git-clone.yaml
+│   │   ├── detect-changes.yaml
+│   │   ├── kaniko-build.yaml
+│   │   └── update-gitops.yaml
+│   ├── pipelines/          # Pipeline definitions
+│   │   └── ci-pipeline.yaml
+│   └── triggers/           # Webhook trigger configuration
+│       ├── trigger-template.yaml
+│       ├── trigger-binding.yaml
+│       ├── event-listener.yaml
+│       └── gitlab-interceptor.yaml
+├── flux/                   # Flux CD GitOps configuration
+│   ├── git-repository.yaml # Git repository source
+│   └── kustomization.yaml  # Deployment kustomization
+├── monitoring/             # Monitoring configuration
+│   └── otel-collector.yaml # OpenTelemetry collector
+└── README.md               # This file
+```
+
+## Deployment Instructions
+
+### Phase 1: Infrastructure Setup
+
+1. **Deploy Gitea**:
+   ```bash
+   # Add Helm repo
+   microk8s helm repo add gitea https://dl.gitea.io/charts
+   
+   # Create namespace
+   microk8s kubectl create namespace gitea
+   
+   # Install Gitea
+   microk8s helm install gitea gitea/gitea \
+     -n gitea \
+     -f infrastructure/ci-cd/gitea/values.yaml
+   
+   # Apply ingress
+   microk8s kubectl apply -f infrastructure/ci-cd/gitea/ingress.yaml
+   ```
+
+2. **Deploy Tekton**:
+   ```bash
+   # Create namespace
+   microk8s kubectl create namespace tekton-pipelines
+   
+   # Install Tekton Pipelines
+   microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
+   
+   # Install Tekton Triggers
+   microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/triggers/latest/release.yaml
+   
+   # Apply Tekton configurations
+   microk8s kubectl apply -f infrastructure/ci-cd/tekton/tasks/
+   microk8s kubectl apply -f infrastructure/ci-cd/tekton/pipelines/
+   microk8s kubectl apply -f infrastructure/ci-cd/tekton/triggers/
+   ```
+
+3. **Deploy Flux CD** (already enabled in MicroK8s):
+   ```bash
+   # Verify Flux installation
+   microk8s kubectl get pods -n flux-system
+   
+   # Apply Flux configurations
+   microk8s kubectl apply -f infrastructure/ci-cd/flux/
+   ```
+
+### Phase 2: Configuration
+
+1. **Set up Gitea webhook**:
+   - Go to your Gitea repository settings
+   - Add webhook with URL: `http://tekton-triggers.tekton-pipelines.svc.cluster.local:8080`
+   - Use the secret from `gitea-webhook-secret`
+
+2. **Configure registry credentials**:
+   ```bash
+   # Create registry credentials secret
+   microk8s kubectl create secret docker-registry gitea-registry-credentials \
+     -n tekton-pipelines \
+     --docker-server=gitea.bakery-ia.local:5000 \
+     --docker-username=your-username \
+     --docker-password=your-password
+   ```
+
+3. **Configure Git credentials for Flux**:
+   ```bash
+   # Create Git credentials secret
+   microk8s kubectl create secret generic gitea-credentials \
+     -n flux-system \
+     --from-literal=username=your-username \
+     --from-literal=password=your-password
+   ```
+
+### Phase 3: Monitoring
+
+```bash
+# Apply OpenTelemetry configuration
+microk8s kubectl apply -f infrastructure/ci-cd/monitoring/otel-collector.yaml
+```
+
+## Usage
+
+### Triggering a Pipeline
+
+1. **Manual trigger**:
+   ```bash
+   # Create a PipelineRun manually
+   microk8s kubectl create -f - <<EOF
+   apiVersion: tekton.dev/v1beta1
+   kind: PipelineRun
+   metadata:
+     name: manual-ci-run
+     namespace: tekton-pipelines
+   spec:
+     pipelineRef:
+       name: bakery-ia-ci
+     workspaces:
+       - name: shared-workspace
+         volumeClaimTemplate:
+           spec:
+             accessModes: ["ReadWriteOnce"]
+             resources:
+               requests:
+                 storage: 5Gi
+       - name: docker-credentials
+         secret:
+           secretName: gitea-registry-credentials
+     params:
+       - name: git-url
+         value: "http://gitea.bakery-ia.local/bakery/bakery-ia.git"
+       - name: git-revision
+         value: "main"
+   EOF
+   ```
+
+2. **Automatic trigger**: Push code to the repository and the webhook will trigger the pipeline automatically.
+
+### Monitoring Pipeline Runs
+
+```bash
+# List all PipelineRuns
+microk8s kubectl get pipelineruns -n tekton-pipelines
+
+# View logs for a specific PipelineRun
+microk8s kubectl logs -n tekton-pipelines <pipelinerun-pod> -c <step-name>
+
+# View Tekton dashboard
+microk8s kubectl port-forward -n tekton-pipelines svc/tekton-dashboard 9097:9097
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Pipeline not triggering**:
+   - Check Gitea webhook logs
+   - Verify EventListener pods are running
+   - Check TriggerBinding configuration
+
+2. **Build failures**:
+   - Check Kaniko logs for build errors
+   - Verify Dockerfile paths are correct
+   - Ensure registry credentials are valid
+
+3. **Flux not applying changes**:
+   - Check GitRepository status
+   - Verify Kustomization reconciliation
+   - Check Flux logs for errors
+
+### Debugging Commands
+
+```bash
+# Check Tekton controller logs
+microk8s kubectl logs -n tekton-pipelines -l app=tekton-pipelines-controller
+
+# Check Flux reconciliation
+microk8s kubectl get kustomizations -n flux-system -o yaml
+
+# Check Gitea webhook delivery
+microk8s kubectl logs -n tekton-pipelines -l app=tekton-triggers-controller
+```
+
+## Security Considerations
+
+1. **Secrets Management**:
+   - Use Kubernetes secrets for sensitive data
+   - Rotate credentials regularly
+   - Use RBAC for namespace isolation
+
+2. **Network Security**:
+   - Configure network policies
+   - Use internal DNS names
+   - Restrict ingress access
+
+3. **Registry Security**:
+   - Enable image scanning
+   - Use image signing
+   - Implement cleanup policies
+
+## Maintenance
+
+### Upgrading Components
+
+```bash
+# Upgrade Tekton
+microk8s kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
+
+# Upgrade Flux
+microk8s helm upgrade fluxcd fluxcd/flux2 -n flux-system
+
+# Upgrade Gitea
+microk8s helm upgrade gitea gitea/gitea -n gitea -f infrastructure/ci-cd/gitea/values.yaml
+```
+
+### Backup Procedures
+
+```bash
+# Backup Gitea
+microk8s kubectl exec -n gitea gitea-0 -- gitea dump -c /data/gitea/conf/app.ini
+
+# Backup Flux configurations
+microk8s kubectl get all -n flux-system -o yaml > flux-backup.yaml
+
+# Backup Tekton configurations
+microk8s kubectl get all -n tekton-pipelines -o yaml > tekton-backup.yaml
+```
+
+## Performance Optimization
+
+1. **Resource Management**:
+   - Set appropriate resource limits
+   - Limit concurrent builds
+   - Use node selectors for build pods
+
+2. **Caching**:
+   - Configure Kaniko cache
+   - Use persistent volumes for dependencies
+   - Cache Docker layers
+
+3. **Parallelization**:
+   - Build independent services in parallel
+   - Use matrix builds for different architectures
+   - Optimize task dependencies
+
+## Integration with Existing System
+
+The CI/CD system integrates with:
+- **SigNoz**: For monitoring and observability
+- **MicroK8s**: For cluster management
+- **Existing Kubernetes manifests**: In `infrastructure/kubernetes/`
+- **Current services**: All 19 microservices in `services/`
+
+## Migration Plan
+
+1. **Phase 1**: Set up infrastructure (Gitea, Tekton, Flux)
+2. **Phase 2**: Configure pipelines and triggers
+3. **Phase 3**: Test with non-critical services
+4. **Phase 4**: Gradual rollout to all services
+5. **Phase 5**: Decommission old deployment methods
+
+## Support
+
+For issues with the CI/CD system:
+- Check logs and monitoring first
+- Review the troubleshooting section
+- Consult the original implementation plan
+- Refer to component documentation:
+  - [Tekton Documentation](https://tekton.dev/docs/)
+  - [Flux CD Documentation](https://fluxcd.io/docs/)
+  - [Gitea Documentation](https://docs.gitea.io/)
--- a/infrastructure/ci-cd/flux/git-repository.yaml
+++ b/infrastructure/ci-cd/flux/git-repository.yaml
@@ -0,0 +1,16 @@
+# Flux GitRepository for Bakery-IA
+# This resource tells Flux where to find the Git repository
+
+apiVersion: source.toolkit.fluxcd.io/v1
+kind: GitRepository
+metadata:
+  name: bakery-ia
+  namespace: flux-system
+spec:
+  interval: 1m
+  url: http://gitea.bakery-ia.local/bakery/bakery-ia.git
+  ref:
+    branch: main
+  secretRef:
+    name: gitea-credentials
+  timeout: 60s
--- a/infrastructure/ci-cd/flux/kustomization.yaml
+++ b/infrastructure/ci-cd/flux/kustomization.yaml
@@ -0,0 +1,27 @@
+# Flux Kustomization for Bakery-IA Production Deployment
+# This resource tells Flux how to deploy the application
+
+apiVersion: kustomize.toolkit.fluxcd.io/v1
+kind: Kustomization
+metadata:
+  name: bakery-ia-prod
+  namespace: flux-system
+spec:
+  interval: 5m
+  path: ./infrastructure/kubernetes/overlays/prod
+  prune: true
+  sourceRef:
+    kind: GitRepository
+    name: bakery-ia
+  targetNamespace: bakery-ia
+  timeout: 5m
+  retryInterval: 1m
+  healthChecks:
+    - apiVersion: apps/v1
+      kind: Deployment
+      name: auth-service
+      namespace: bakery-ia
+    - apiVersion: apps/v1
+      kind: Deployment
+      name: gateway
+      namespace: bakery-ia
--- a/infrastructure/ci-cd/gitea/ingress.yaml
+++ b/infrastructure/ci-cd/gitea/ingress.yaml
@@ -0,0 +1,25 @@
+# Gitea Ingress configuration for Bakery-IA CI/CD
+# This provides external access to Gitea within the cluster
+
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: gitea-ingress
+  namespace: gitea
+  annotations:
+    nginx.ingress.kubernetes.io/rewrite-target: /
+    nginx.ingress.kubernetes.io/proxy-body-size: "0"
+    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
+    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
+spec:
+  rules:
+  - host: gitea.bakery-ia.local
+    http:
+      paths:
+      - path: /
+        pathType: Prefix
+        backend:
+          service:
+            name: gitea-http
+            port:
+              number: 3000
--- a/infrastructure/ci-cd/gitea/values.yaml
+++ b/infrastructure/ci-cd/gitea/values.yaml
@@ -0,0 +1,38 @@
+# Gitea Helm values configuration for Bakery-IA CI/CD
+# This configuration sets up Gitea with registry support and appropriate storage
+
+service:
+  type: ClusterIP
+  httpPort: 3000
+  sshPort: 2222
+
+persistence:
+  enabled: true
+  size: 50Gi
+  storageClass: "microk8s-hostpath"
+
+gitea:
+  config:
+    server:
+      DOMAIN: gitea.bakery-ia.local
+      SSH_DOMAIN: gitea.bakery-ia.local
+      ROOT_URL: http://gitea.bakery-ia.local
+    repository:
+      ENABLE_PUSH_CREATE_USER: true
+      ENABLE_PUSH_CREATE_ORG: true
+    registry:
+      ENABLED: true
+
+postgresql:
+  enabled: true
+  persistence:
+    size: 20Gi
+
+# Resource configuration for production environment
+resources:
+  limits:
+    cpu: 1000m
+    memory: 1Gi
+  requests:
+    cpu: 500m
+    memory: 512Mi
--- a/infrastructure/ci-cd/monitoring/otel-collector.yaml
+++ b/infrastructure/ci-cd/monitoring/otel-collector.yaml
@@ -0,0 +1,70 @@
+# OpenTelemetry Collector for Bakery-IA CI/CD Monitoring
+# This collects metrics and traces from Tekton pipelines
+
+apiVersion: opentelemetry.io/v1alpha1
+kind: OpenTelemetryCollector
+metadata:
+  name: tekton-otel
+  namespace: tekton-pipelines
+spec:
+  config: |
+    receivers:
+      otlp:
+        protocols:
+          grpc:
+            endpoint: 0.0.0.0:4317
+          http:
+            endpoint: 0.0.0.0:4318
+      prometheus:
+        config:
+          scrape_configs:
+            - job_name: 'tekton-pipelines'
+              scrape_interval: 30s
+              static_configs:
+                - targets: ['tekton-pipelines-controller.tekton-pipelines.svc.cluster.local:9090']
+    
+    processors:
+      batch:
+        timeout: 5s
+        send_batch_size: 1000
+      memory_limiter:
+        check_interval: 2s
+        limit_percentage: 75
+        spike_limit_percentage: 20
+    
+    exporters:
+      otlp:
+        endpoint: "signoz-otel-collector.monitoring.svc.cluster.local:4317"
+        tls:
+          insecure: true
+        retry_on_failure:
+          enabled: true
+          initial_interval: 5s
+          max_interval: 30s
+          max_elapsed_time: 300s
+      logging:
+        logLevel: debug
+    
+    service:
+      pipelines:
+        traces:
+          receivers: [otlp]
+          processors: [memory_limiter, batch]
+          exporters: [otlp, logging]
+        metrics:
+          receivers: [otlp, prometheus]
+          processors: [memory_limiter, batch]
+          exporters: [otlp, logging]
+      telemetry:
+        logs:
+          level: "info"
+          encoding: "json"
+  
+  mode: deployment
+  resources:
+    limits:
+      cpu: 500m
+      memory: 512Mi
+    requests:
+      cpu: 200m
+      memory: 256Mi
--- a/infrastructure/ci-cd/tekton/pipelines/ci-pipeline.yaml
+++ b/infrastructure/ci-cd/tekton/pipelines/ci-pipeline.yaml
@@ -0,0 +1,83 @@
+# Main CI Pipeline for Bakery-IA
+# This pipeline orchestrates the build, test, and deploy process
+
+apiVersion: tekton.dev/v1beta1
+kind: Pipeline
+metadata:
+  name: bakery-ia-ci
+  namespace: tekton-pipelines
+spec:
+  workspaces:
+    - name: shared-workspace
+    - name: docker-credentials
+  params:
+    - name: git-url
+      type: string
+      description: Repository URL
+    - name: git-revision
+      type: string
+      description: Git revision/commit hash
+    - name: registry
+      type: string
+      description: Container registry URL
+      default: "gitea.bakery-ia.local:5000"
+  tasks:
+    - name: fetch-source
+      taskRef:
+        name: git-clone
+      workspaces:
+        - name: output
+          workspace: shared-workspace
+      params:
+        - name: url
+          value: $(params.git-url)
+        - name: revision
+          value: $(params.git-revision)
+    
+    - name: detect-changes
+      runAfter: [fetch-source]
+      taskRef:
+        name: detect-changed-services
+      workspaces:
+        - name: source
+          workspace: shared-workspace
+    
+    - name: build-and-push
+      runAfter: [detect-changes]
+      taskRef:
+        name: kaniko-build
+      when:
+        - input: "$(tasks.detect-changes.results.changed-services)"
+          operator: notin
+          values: ["none"]
+      workspaces:
+        - name: source
+          workspace: shared-workspace
+        - name: docker-credentials
+          workspace: docker-credentials
+      params:
+        - name: services
+          value: $(tasks.detect-changes.results.changed-services)
+        - name: registry
+          value: $(params.registry)
+        - name: git-revision
+          value: $(params.git-revision)
+    
+    - name: update-gitops-manifests
+      runAfter: [build-and-push]
+      taskRef:
+        name: update-gitops
+      when:
+        - input: "$(tasks.detect-changes.results.changed-services)"
+          operator: notin
+          values: ["none"]
+      workspaces:
+        - name: source
+          workspace: shared-workspace
+      params:
+        - name: services
+          value: $(tasks.detect-changes.results.changed-services)
+        - name: registry
+          value: $(params.registry)
+        - name: git-revision
+          value: $(params.git-revision)
--- a/infrastructure/ci-cd/tekton/tasks/detect-changes.yaml
+++ b/infrastructure/ci-cd/tekton/tasks/detect-changes.yaml
@@ -0,0 +1,64 @@
+# Tekton Detect Changed Services Task for Bakery-IA CI/CD
+# This task identifies which services have changed in the repository
+
+apiVersion: tekton.dev/v1beta1
+kind: Task
+metadata:
+  name: detect-changed-services
+  namespace: tekton-pipelines
+spec:
+  workspaces:
+    - name: source
+  results:
+    - name: changed-services
+      description: Comma-separated list of changed services
+  steps:
+    - name: detect
+      image: alpine/git
+      script: |
+        #!/bin/sh
+        set -e
+        cd $(workspaces.source.path)
+        
+        echo "Detecting changed files..."
+        # Get list of changed files compared to previous commit
+        CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD 2>/dev/null || git diff --name-only HEAD)
+        
+        echo "Changed files: $CHANGED_FILES"
+        
+        # Map files to services
+        CHANGED_SERVICES=()
+        for file in $CHANGED_FILES; do
+          if [[ $file == services/* ]]; then
+            SERVICE=$(echo $file | cut -d'/' -f2)
+            # Only add unique service names
+            if [[ ! " ${CHANGED_SERVICES[@]} " =~ " ${SERVICE} " ]]; then
+              CHANGED_SERVICES+=("$SERVICE")
+            fi
+          elif [[ $file == frontend/* ]]; then
+            CHANGED_SERVICES+=("frontend")
+            break
+          elif [[ $file == gateway/* ]]; then
+            CHANGED_SERVICES+=("gateway")
+            break
+          fi
+        done
+        
+        # If no specific services changed, check for infrastructure changes
+        if [ ${#CHANGED_SERVICES[@]} -eq 0 ]; then
+          for file in $CHANGED_FILES; do
+            if [[ $file == infrastructure/* ]]; then
+              CHANGED_SERVICES+=("infrastructure")
+              break
+            fi
+          done
+        fi
+        
+        # Output result
+        if [ ${#CHANGED_SERVICES[@]} -eq 0 ]; then
+          echo "No service changes detected"
+          echo "none" | tee $(results.changed-services.path)
+        else
+          echo "Detected changes in services: ${CHANGED_SERVICES[@]}"
+          echo $(printf "%s," "${CHANGED_SERVICES[@]}" | sed 's/,$//') | tee $(results.changed-services.path)
+        fi
--- a/infrastructure/ci-cd/tekton/tasks/git-clone.yaml
+++ b/infrastructure/ci-cd/tekton/tasks/git-clone.yaml
@@ -0,0 +1,31 @@
+# Tekton Git Clone Task for Bakery-IA CI/CD
+# This task clones the source code repository
+
+apiVersion: tekton.dev/v1beta1
+kind: Task
+metadata:
+  name: git-clone
+  namespace: tekton-pipelines
+spec:
+  workspaces:
+    - name: output
+  params:
+    - name: url
+      type: string
+      description: Repository URL to clone
+    - name: revision
+      type: string
+      description: Git revision to checkout
+      default: "main"
+  steps:
+    - name: clone
+      image: alpine/git
+      script: |
+        #!/bin/sh
+        set -e
+        echo "Cloning repository: $(params.url)"
+        git clone $(params.url) $(workspaces.output.path)
+        cd $(workspaces.output.path)
+        echo "Checking out revision: $(params.revision)"
+        git checkout $(params.revision)
+        echo "Repository cloned successfully"
--- a/infrastructure/ci-cd/tekton/tasks/kaniko-build.yaml
+++ b/infrastructure/ci-cd/tekton/tasks/kaniko-build.yaml
@@ -0,0 +1,40 @@
+# Tekton Kaniko Build Task for Bakery-IA CI/CD
+# This task builds and pushes container images using Kaniko
+
+apiVersion: tekton.dev/v1beta1
+kind: Task
+metadata:
+  name: kaniko-build
+  namespace: tekton-pipelines
+spec:
+  workspaces:
+    - name: source
+    - name: docker-credentials
+  params:
+    - name: services
+      type: string
+      description: Comma-separated list of services to build
+    - name: registry
+      type: string
+      description: Container registry URL
+      default: "gitea.bakery-ia.local:5000"
+    - name: git-revision
+      type: string
+      description: Git revision for image tag
+      default: "latest"
+  steps:
+    - name: build-and-push
+      image: gcr.io/kaniko-project/executor:v1.9.0
+      args:
+        - --dockerfile=$(workspaces.source.path)/services/$(params.services)/Dockerfile
+        - --context=$(workspaces.source.path)
+        - --destination=$(params.registry)/bakery/$(params.services):$(params.git-revision)
+        - --verbosity=info
+      volumeMounts:
+        - name: docker-config
+          mountPath: /kaniko/.docker
+      securityContext:
+        runAsUser: 0
+  volumes:
+    - name: docker-config
+      emptyDir: {}
--- a/infrastructure/ci-cd/tekton/tasks/update-gitops.yaml
+++ b/infrastructure/ci-cd/tekton/tasks/update-gitops.yaml
@@ -0,0 +1,66 @@
+# Tekton Update GitOps Manifests Task for Bakery-IA CI/CD
+# This task updates Kubernetes manifests with new image tags
+
+apiVersion: tekton.dev/v1beta1
+kind: Task
+metadata:
+  name: update-gitops
+  namespace: tekton-pipelines
+spec:
+  workspaces:
+    - name: source
+  params:
+    - name: services
+      type: string
+      description: Comma-separated list of services to update
+    - name: registry
+      type: string
+      description: Container registry URL
+    - name: git-revision
+      type: string
+      description: Git revision for image tag
+  steps:
+    - name: update-manifests
+      image: bitnami/kubectl
+      script: |
+        #!/bin/sh
+        set -e
+        cd $(workspaces.source.path)
+        
+        echo "Updating GitOps manifests for services: $(params.services)"
+        
+        # Split services by comma
+        IFS=',' read -ra SERVICES <<< "$(params.services)"
+        
+        for service in "${SERVICES[@]}"; do
+          echo "Processing service: $service"
+          
+          # Find and update Kubernetes manifests
+          if [ "$service" = "frontend" ]; then
+            # Update frontend deployment
+            if [ -f "infrastructure/kubernetes/overlays/prod/frontend-deployment.yaml" ]; then
+              sed -i "s|image:.*|image: $(params.registry)/bakery/frontend:$(params.git-revision)|g" \
+                "infrastructure/kubernetes/overlays/prod/frontend-deployment.yaml"
+            fi
+          elif [ "$service" = "gateway" ]; then
+            # Update gateway deployment
+            if [ -f "infrastructure/kubernetes/overlays/prod/gateway-deployment.yaml" ]; then
+              sed -i "s|image:.*|image: $(params.registry)/bakery/gateway:$(params.git-revision)|g" \
+                "infrastructure/kubernetes/overlays/prod/gateway-deployment.yaml"
+            fi
+          else
+            # Update service deployment
+            DEPLOYMENT_FILE="infrastructure/kubernetes/overlays/prod/${service}-deployment.yaml"
+            if [ -f "$DEPLOYMENT_FILE" ]; then
+              sed -i "s|image:.*|image: $(params.registry)/bakery/${service}:$(params.git-revision)|g" \
+                "$DEPLOYMENT_FILE"
+            fi
+          fi
+        done
+        
+        # Commit changes
+        git config --global user.name "bakery-ia-ci"
+        git config --global user.email "ci@bakery-ia.local"
+        git add .
+        git commit -m "CI: Update image tags for $(params.services) to $(params.git-revision)"
+        git push origin HEAD
--- a/infrastructure/ci-cd/tekton/triggers/event-listener.yaml
+++ b/infrastructure/ci-cd/tekton/triggers/event-listener.yaml
@@ -0,0 +1,26 @@
+# Tekton EventListener for Bakery-IA CI/CD
+# This listener receives webhook events and triggers pipelines
+
+apiVersion: triggers.tekton.dev/v1alpha1
+kind: EventListener
+metadata:
+  name: bakery-ia-listener
+  namespace: tekton-pipelines
+spec:
+  serviceAccountName: tekton-triggers-sa
+  triggers:
+    - name: bakery-ia-gitea-trigger
+      bindings:
+        - ref: bakery-ia-trigger-binding
+      template:
+        ref: bakery-ia-trigger-template
+      interceptors:
+        - ref:
+            name: "gitlab"
+          params:
+            - name: "secretRef"
+              value:
+                secretName: gitea-webhook-secret
+                secretKey: secretToken
+            - name: "eventTypes"
+              value: ["push"]
--- a/infrastructure/ci-cd/tekton/triggers/gitlab-interceptor.yaml
+++ b/infrastructure/ci-cd/tekton/triggers/gitlab-interceptor.yaml
@@ -0,0 +1,14 @@
+# GitLab/Gitea Webhook Interceptor for Tekton Triggers
+# This interceptor validates and processes Gitea webhook events
+
+apiVersion: triggers.tekton.dev/v1alpha1
+kind: ClusterInterceptor
+metadata:
+  name: gitlab
+spec:
+  clientConfig:
+    service:
+      name: tekton-triggers-core-interceptors
+      namespace: tekton-pipelines
+      path: "/v1/webhook/gitlab"
+      port: 8443
--- a/infrastructure/ci-cd/tekton/triggers/trigger-binding.yaml
+++ b/infrastructure/ci-cd/tekton/triggers/trigger-binding.yaml
@@ -0,0 +1,16 @@
+# Tekton TriggerBinding for Bakery-IA CI/CD
+# This binding extracts parameters from Gitea webhook events
+
+apiVersion: triggers.tekton.dev/v1alpha1
+kind: TriggerBinding
+metadata:
+  name: bakery-ia-trigger-binding
+  namespace: tekton-pipelines
+spec:
+  params:
+    - name: git-repo-url
+      value: $(body.repository.clone_url)
+    - name: git-revision
+      value: $(body.head_commit.id)
+    - name: git-repo-name
+      value: $(body.repository.name)
--- a/infrastructure/ci-cd/tekton/triggers/trigger-template.yaml
+++ b/infrastructure/ci-cd/tekton/triggers/trigger-template.yaml
@@ -0,0 +1,43 @@
+# Tekton TriggerTemplate for Bakery-IA CI/CD
+# This template defines how PipelineRuns are created when triggers fire
+
+apiVersion: triggers.tekton.dev/v1alpha1
+kind: TriggerTemplate
+metadata:
+  name: bakery-ia-trigger-template
+  namespace: tekton-pipelines
+spec:
+  params:
+    - name: git-repo-url
+      description: The git repository URL
+    - name: git-revision
+      description: The git revision/commit hash
+    - name: git-repo-name
+      description: The git repository name
+      default: "bakery-ia"
+  resourcetemplates:
+    - apiVersion: tekton.dev/v1beta1
+      kind: PipelineRun
+      metadata:
+        generateName: bakery-ia-ci-run-$(params.git-repo-name)-
+      spec:
+        pipelineRef:
+          name: bakery-ia-ci
+        workspaces:
+          - name: shared-workspace
+            volumeClaimTemplate:
+              spec:
+                accessModes: ["ReadWriteOnce"]
+                resources:
+                  requests:
+                    storage: 5Gi
+          - name: docker-credentials
+            secret:
+              secretName: gitea-registry-credentials
+        params:
+          - name: git-url
+            value: $(params.git-repo-url)
+          - name: git-revision
+            value: $(params.git-revision)
+          - name: registry
+            value: "gitea.bakery-ia.local:5000"
--- a/services/orchestrator/app/main.py
+++ b/services/orchestrator/app/main.py
@@ -18,6 +18,28 @@ class OrchestratorService(StandardFastAPIService):

    expected_migration_version = "001_initial_schema"

+    def __init__(self):
+        # Define expected database tables for health checks
+        orchestrator_expected_tables = [
+            'orchestration_runs'
+        ]
+
+        self.rabbitmq_client = None
+        self.event_publisher = None
+        self.leader_election = None
+        self.scheduler_service = None
+
+        super().__init__(
+            service_name="orchestrator-service",
+            app_name=settings.APP_NAME,
+            description=settings.DESCRIPTION,
+            version=settings.VERSION,
+            api_prefix="",  # Empty because RouteBuilder already includes /api/v1
+            database_manager=database_manager,
+            expected_tables=orchestrator_expected_tables,
+            enable_messaging=True  # Enable RabbitMQ for event publishing
+        )
+
    async def verify_migrations(self):
        """Verify database schema matches the latest migrations"""
        try:
@@ -32,26 +54,6 @@ class OrchestratorService(StandardFastAPIService):
            self.logger.error(f"Migration verification failed: {e}")
            raise

-    def __init__(self):
-        # Define expected database tables for health checks
-        orchestrator_expected_tables = [
-            'orchestration_runs'
-        ]
-
-        self.rabbitmq_client = None
-        self.event_publisher = None
-
-        super().__init__(
-            service_name="orchestrator-service",
-            app_name=settings.APP_NAME,
-            description=settings.DESCRIPTION,
-            version=settings.VERSION,
-            api_prefix="",  # Empty because RouteBuilder already includes /api/v1
-            database_manager=database_manager,
-            expected_tables=orchestrator_expected_tables,
-            enable_messaging=True  # Enable RabbitMQ for event publishing
-        )
-
    async def _setup_messaging(self):
        """Setup messaging for orchestrator service"""
        from shared.messaging import UnifiedEventPublisher, RabbitMQClient
@@ -84,22 +86,91 @@ class OrchestratorService(StandardFastAPIService):

        self.logger.info("Orchestrator Service starting up...")

-        # Initialize orchestrator scheduler service with EventPublisher
-        from app.services.orchestrator_service import OrchestratorSchedulerService
-        scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
-        await scheduler_service.start()
-        app.state.scheduler_service = scheduler_service
-        self.logger.info("Orchestrator scheduler service started")
+        # Initialize leader election for horizontal scaling
+        # Only the leader pod will run the scheduler
+        await self._setup_leader_election(app)

        # REMOVED: Delivery tracking service - moved to procurement service (domain ownership)

+    async def _setup_leader_election(self, app: FastAPI):
+        """
+        Setup leader election for scheduler.
+
+        CRITICAL FOR HORIZONTAL SCALING:
+        Without leader election, each pod would run the same scheduled jobs,
+        causing duplicate forecasts, production schedules, and database contention.
+        """
+        from shared.leader_election import LeaderElectionService
+        import redis.asyncio as redis
+
+        try:
+            # Create Redis connection for leader election
+            redis_url = f"redis://:{settings.REDIS_PASSWORD}@{settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}"
+            if settings.REDIS_TLS_ENABLED.lower() == "true":
+                redis_url = redis_url.replace("redis://", "rediss://")
+
+            redis_client = redis.from_url(redis_url, decode_responses=False)
+            await redis_client.ping()
+
+            # Use shared leader election service
+            self.leader_election = LeaderElectionService(
+                redis_client,
+                service_name="orchestrator"
+            )
+
+            # Define callbacks for leader state changes
+            async def on_become_leader():
+                self.logger.info("This pod became the leader - starting scheduler")
+                from app.services.orchestrator_service import OrchestratorSchedulerService
+                self.scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
+                await self.scheduler_service.start()
+                app.state.scheduler_service = self.scheduler_service
+                self.logger.info("Orchestrator scheduler service started (leader only)")
+
+            async def on_lose_leader():
+                self.logger.warning("This pod lost leadership - stopping scheduler")
+                if self.scheduler_service:
+                    await self.scheduler_service.stop()
+                    self.scheduler_service = None
+                    if hasattr(app.state, 'scheduler_service'):
+                        app.state.scheduler_service = None
+                self.logger.info("Orchestrator scheduler service stopped (no longer leader)")
+
+            # Start leader election
+            await self.leader_election.start(
+                on_become_leader=on_become_leader,
+                on_lose_leader=on_lose_leader
+            )
+
+            # Store leader election in app state for health checks
+            app.state.leader_election = self.leader_election
+
+            self.logger.info("Leader election initialized",
+                           is_leader=self.leader_election.is_leader,
+                           instance_id=self.leader_election.instance_id)
+
+        except Exception as e:
+            self.logger.error("Failed to setup leader election, falling back to standalone mode",
+                            error=str(e))
+            # Fallback: start scheduler anyway (for single-pod deployments)
+            from app.services.orchestrator_service import OrchestratorSchedulerService
+            self.scheduler_service = OrchestratorSchedulerService(self.event_publisher, settings)
+            await self.scheduler_service.start()
+            app.state.scheduler_service = self.scheduler_service
+            self.logger.warning("Scheduler started in standalone mode (no leader election)")
+
    async def on_shutdown(self, app: FastAPI):
        """Custom shutdown logic for orchestrator service"""
        self.logger.info("Orchestrator Service shutting down...")

-        # Stop scheduler service
-        if hasattr(app.state, 'scheduler_service'):
-            await app.state.scheduler_service.stop()
+        # Stop leader election (this will also stop scheduler if we're the leader)
+        if self.leader_election:
+            await self.leader_election.stop()
+            self.logger.info("Leader election stopped")
+
+        # Stop scheduler service if still running
+        if self.scheduler_service:
+            await self.scheduler_service.stop()
            self.logger.info("Orchestrator scheduler service stopped")


--- a/services/procurement/app/services/delivery_tracking_service.py
+++ b/services/procurement/app/services/delivery_tracking_service.py
@@ -1,12 +1,12 @@
 """
-Delivery Tracking Service - Simplified
+Delivery Tracking Service - With Leader Election

 Tracks purchase order deliveries and generates appropriate alerts using EventPublisher:
 - DELIVERY_ARRIVING_SOON: 2 hours before delivery window
 - DELIVERY_OVERDUE: 30 minutes after expected delivery time
 - STOCK_RECEIPT_INCOMPLETE: If delivery not marked as received

-Runs as internal scheduler with leader election.
+Runs as internal scheduler with leader election for horizontal scaling.
 Domain ownership: Procurement service owns all PO and delivery tracking.
 """

@@ -30,7 +30,7 @@ class DeliveryTrackingService:
    Monitors PO deliveries and generates time-based alerts using EventPublisher.

    Uses APScheduler with leader election to run hourly checks.
-    Only one pod executes checks (others skip if not leader).
+    Only one pod executes checks - leader election ensures no duplicate alerts.
    """

    def __init__(self, event_publisher: UnifiedEventPublisher, config, database_manager=None):
@@ -38,46 +38,121 @@ class DeliveryTrackingService:
        self.config = config
        self.database_manager = database_manager
        self.scheduler = AsyncIOScheduler()
-        self.is_leader = False
+        self._leader_election = None
+        self._redis_client = None
+        self._scheduler_started = False
        self.instance_id = str(uuid4())[:8]  # Short instance ID for logging

    async def start(self):
-        """Start the delivery tracking scheduler"""
-        # Initialize and start scheduler if not already running
+        """Start the delivery tracking scheduler with leader election"""
+        try:
+            # Initialize leader election
+            await self._setup_leader_election()
+        except Exception as e:
+            logger.error("Failed to setup leader election, starting in standalone mode",
+                        error=str(e))
+            # Fallback: start scheduler without leader election
+            await self._start_scheduler()
+
+    async def _setup_leader_election(self):
+        """Setup Redis-based leader election for horizontal scaling"""
+        from shared.leader_election import LeaderElectionService
+        import redis.asyncio as redis
+
+        # Build Redis URL from config
+        redis_url = getattr(self.config, 'REDIS_URL', None)
+        if not redis_url:
+            redis_password = getattr(self.config, 'REDIS_PASSWORD', '')
+            redis_host = getattr(self.config, 'REDIS_HOST', 'localhost')
+            redis_port = getattr(self.config, 'REDIS_PORT', 6379)
+            redis_db = getattr(self.config, 'REDIS_DB', 0)
+            redis_url = f"redis://:{redis_password}@{redis_host}:{redis_port}/{redis_db}"
+
+        self._redis_client = redis.from_url(redis_url, decode_responses=False)
+        await self._redis_client.ping()
+
+        # Create leader election service
+        self._leader_election = LeaderElectionService(
+            self._redis_client,
+            service_name="procurement-delivery-tracking"
+        )
+
+        # Start leader election with callbacks
+        await self._leader_election.start(
+            on_become_leader=self._on_become_leader,
+            on_lose_leader=self._on_lose_leader
+        )
+
+        logger.info("Leader election initialized for delivery tracking",
+                   is_leader=self._leader_election.is_leader,
+                   instance_id=self.instance_id)
+
+    async def _on_become_leader(self):
+        """Called when this instance becomes the leader"""
+        logger.info("Became leader for delivery tracking - starting scheduler",
+                   instance_id=self.instance_id)
+        await self._start_scheduler()
+
+    async def _on_lose_leader(self):
+        """Called when this instance loses leadership"""
+        logger.warning("Lost leadership for delivery tracking - stopping scheduler",
+                      instance_id=self.instance_id)
+        await self._stop_scheduler()
+
+    async def _start_scheduler(self):
+        """Start the APScheduler with delivery tracking jobs"""
+        if self._scheduler_started:
+            logger.debug("Scheduler already started", instance_id=self.instance_id)
+            return
+
        if not self.scheduler.running:
            # Add hourly job to check deliveries
            self.scheduler.add_job(
                self._check_all_tenants,
-                trigger=CronTrigger(minute=30),  # Run every hour at :30 (00:30, 01:30, 02:30, etc.)
+                trigger=CronTrigger(minute=30),  # Run every hour at :30
                id='hourly_delivery_check',
                name='Hourly Delivery Tracking',
                replace_existing=True,
-                max_instances=1,  # Ensure no overlapping runs
-                coalesce=True  # Combine missed runs
+                max_instances=1,
+                coalesce=True
            )

            self.scheduler.start()
+            self._scheduler_started = True

-            # Log next run time
            next_run = self.scheduler.get_job('hourly_delivery_check').next_run_time
-            logger.info(
-                "Delivery tracking scheduler started with hourly checks",
-                instance_id=self.instance_id,
-                next_run=next_run.isoformat() if next_run else None
-            )
-        else:
-            logger.info(
-                "Delivery tracking scheduler already running",
-                instance_id=self.instance_id
-            )
+            logger.info("Delivery tracking scheduler started",
+                       instance_id=self.instance_id,
+                       next_run=next_run.isoformat() if next_run else None)
+
+    async def _stop_scheduler(self):
+        """Stop the APScheduler"""
+        if not self._scheduler_started:
+            return
+
+        if self.scheduler.running:
+            self.scheduler.shutdown(wait=False)
+            self._scheduler_started = False
+            logger.info("Delivery tracking scheduler stopped", instance_id=self.instance_id)

    async def stop(self):
-        """Stop the scheduler and release leader lock"""
-        if self.scheduler.running:
-            self.scheduler.shutdown(wait=True)  # Graceful shutdown
-            logger.info("Delivery tracking scheduler stopped", instance_id=self.instance_id)
-        else:
-            logger.info("Delivery tracking scheduler already stopped", instance_id=self.instance_id)
+        """Stop the scheduler and leader election"""
+        # Stop leader election first
+        if self._leader_election:
+            await self._leader_election.stop()
+            logger.info("Leader election stopped", instance_id=self.instance_id)
+
+        # Stop scheduler
+        await self._stop_scheduler()
+
+        # Close Redis
+        if self._redis_client:
+            await self._redis_client.close()
+
+    @property
+    def is_leader(self) -> bool:
+        """Check if this instance is the leader"""
+        return self._leader_election.is_leader if self._leader_election else True

    async def _check_all_tenants(self):
        """
--- a/services/training/app/main.py
+++ b/services/training/app/main.py
@@ -46,6 +46,9 @@ class TrainingService(StandardFastAPIService):
        await setup_messaging()
        self.logger.info("Messaging setup completed")

+        # Initialize Redis pub/sub for cross-pod WebSocket broadcasting
+        await self._setup_websocket_redis()
+
        # Set up WebSocket event consumer (listens to RabbitMQ and broadcasts to WebSockets)
        success = await setup_websocket_event_consumer()
        if success:
@@ -53,8 +56,44 @@ class TrainingService(StandardFastAPIService):
        else:
            self.logger.warning("WebSocket event consumer setup failed")

+    async def _setup_websocket_redis(self):
+        """
+        Initialize Redis pub/sub for WebSocket cross-pod broadcasting.
+
+        CRITICAL FOR HORIZONTAL SCALING:
+        Without this, WebSocket clients on Pod A won't receive events
+        from training jobs running on Pod B.
+        """
+        try:
+            from app.websocket.manager import websocket_manager
+            from app.core.config import settings
+
+            redis_url = settings.REDIS_URL
+            success = await websocket_manager.initialize_redis(redis_url)
+
+            if success:
+                self.logger.info("WebSocket Redis pub/sub initialized for horizontal scaling")
+            else:
+                self.logger.warning(
+                    "WebSocket Redis pub/sub failed to initialize. "
+                    "WebSocket events will only be delivered to local connections."
+                )
+
+        except Exception as e:
+            self.logger.error("Failed to setup WebSocket Redis pub/sub",
+                            error=str(e))
+            # Don't fail startup - WebSockets will work locally without Redis
+
    async def _cleanup_messaging(self):
        """Cleanup messaging for training service"""
+        # Shutdown WebSocket Redis pub/sub
+        try:
+            from app.websocket.manager import websocket_manager
+            await websocket_manager.shutdown()
+            self.logger.info("WebSocket Redis pub/sub shutdown completed")
+        except Exception as e:
+            self.logger.warning("Error shutting down WebSocket Redis", error=str(e))
+
        await cleanup_websocket_consumers()
        await cleanup_messaging()

@@ -83,8 +122,44 @@ class TrainingService(StandardFastAPIService):
        system_metrics = SystemMetricsCollector("training")
        self.logger.info("System metrics collection started")

+        # Recover stale jobs from previous pod crashes
+        # This is important for horizontal scaling - jobs may be left in 'running'
+        # state if a pod crashes. We mark them as failed so they can be retried.
+        await self._recover_stale_jobs()
+
        self.logger.info("Training service startup completed")

+    async def _recover_stale_jobs(self):
+        """
+        Recover stale training jobs on startup.
+
+        When a pod crashes mid-training, jobs are left in 'running' or 'pending' state.
+        This method finds jobs that haven't been updated in a while and marks them
+        as failed so users can retry them.
+        """
+        try:
+            from app.repositories.training_log_repository import TrainingLogRepository
+
+            async with self.database_manager.get_session() as session:
+                log_repo = TrainingLogRepository(session)
+
+                # Recover jobs that haven't been updated in 60 minutes
+                # This is conservative - most training jobs complete within 30 minutes
+                recovered = await log_repo.recover_stale_jobs(stale_threshold_minutes=60)
+
+                if recovered:
+                    self.logger.warning(
+                        "Recovered stale training jobs on startup",
+                        recovered_count=len(recovered),
+                        job_ids=[j.job_id for j in recovered]
+                    )
+                else:
+                    self.logger.info("No stale training jobs to recover")
+
+        except Exception as e:
+            # Don't fail startup if recovery fails - just log the error
+            self.logger.error("Failed to recover stale jobs on startup", error=str(e))
+
    async def on_shutdown(self, app: FastAPI):
        """Custom shutdown logic for training service"""
        await cleanup_training_database()
--- a/services/training/app/repositories/training_log_repository.py
+++ b/services/training/app/repositories/training_log_repository.py
@@ -343,3 +343,165 @@ class TrainingLogRepository(TrainingBaseRepository):
                        job_id=job_id,
                        error=str(e))
            return None
+
+    async def create_job_atomic(
+        self,
+        job_id: str,
+        tenant_id: str,
+        config: Dict[str, Any] = None
+    ) -> tuple[Optional[ModelTrainingLog], bool]:
+        """
+        Atomically create a training job, respecting the unique constraint.
+
+        This method uses INSERT ... ON CONFLICT to handle race conditions
+        when multiple pods try to create a job for the same tenant simultaneously.
+        The database constraint (idx_unique_active_training_per_tenant) ensures
+        only one active job per tenant can exist.
+
+        Args:
+            job_id: Unique job identifier
+            tenant_id: Tenant identifier
+            config: Optional job configuration
+
+        Returns:
+            Tuple of (job, created):
+            - If created: (new_job, True)
+            - If conflict (existing active job): (existing_job, False)
+            - If error: raises DatabaseError
+        """
+        try:
+            # First, try to find an existing active job
+            existing = await self.get_active_jobs(tenant_id=tenant_id)
+            pending = await self.get_logs_by_tenant(tenant_id=tenant_id, status="pending", limit=1)
+
+            if existing or pending:
+                # Return existing job
+                active_job = existing[0] if existing else pending[0]
+                logger.info("Found existing active job, skipping creation",
+                           existing_job_id=active_job.job_id,
+                           tenant_id=tenant_id,
+                           requested_job_id=job_id)
+                return (active_job, False)
+
+            # Try to create the new job
+            # If another pod created one in the meantime, the unique constraint will prevent this
+            log_data = {
+                "job_id": job_id,
+                "tenant_id": tenant_id,
+                "status": "pending",
+                "progress": 0,
+                "current_step": "initializing",
+                "config": config or {}
+            }
+
+            try:
+                new_job = await self.create_training_log(log_data)
+                await self.session.commit()
+                logger.info("Created new training job atomically",
+                           job_id=job_id,
+                           tenant_id=tenant_id)
+                return (new_job, True)
+            except Exception as create_error:
+                error_str = str(create_error).lower()
+                # Check if this is a unique constraint violation
+                if "unique" in error_str or "duplicate" in error_str or "constraint" in error_str:
+                    await self.session.rollback()
+                    # Another pod created a job, fetch it
+                    logger.info("Unique constraint hit, fetching existing job",
+                               tenant_id=tenant_id,
+                               requested_job_id=job_id)
+                    existing = await self.get_active_jobs(tenant_id=tenant_id)
+                    pending = await self.get_logs_by_tenant(tenant_id=tenant_id, status="pending", limit=1)
+                    if existing or pending:
+                        active_job = existing[0] if existing else pending[0]
+                        return (active_job, False)
+                    # If still no job found, something went wrong
+                    raise DatabaseError(f"Constraint violation but no active job found: {create_error}")
+                else:
+                    raise
+
+        except DatabaseError:
+            raise
+        except Exception as e:
+            logger.error("Failed to create job atomically",
+                        job_id=job_id,
+                        tenant_id=tenant_id,
+                        error=str(e))
+            raise DatabaseError(f"Failed to create training job atomically: {str(e)}")
+
+    async def recover_stale_jobs(self, stale_threshold_minutes: int = 60) -> List[ModelTrainingLog]:
+        """
+        Find and mark stale running jobs as failed.
+
+        This is used during service startup to clean up jobs that were
+        running when a pod crashed. With multiple replicas, only stale
+        jobs (not updated recently) should be marked as failed.
+
+        Args:
+            stale_threshold_minutes: Jobs not updated for this long are considered stale
+
+        Returns:
+            List of jobs that were marked as failed
+        """
+        try:
+            stale_cutoff = datetime.now() - timedelta(minutes=stale_threshold_minutes)
+
+            # Find running jobs that haven't been updated recently
+            query = text("""
+                SELECT id, job_id, tenant_id, status, updated_at
+                FROM model_training_logs
+                WHERE status IN ('running', 'pending')
+                AND updated_at < :stale_cutoff
+            """)
+
+            result = await self.session.execute(query, {"stale_cutoff": stale_cutoff})
+            stale_jobs = result.fetchall()
+
+            recovered_jobs = []
+            for row in stale_jobs:
+                try:
+                    # Mark as failed
+                    update_query = text("""
+                        UPDATE model_training_logs
+                        SET status = 'failed',
+                            error_message = :error_msg,
+                            end_time = :end_time,
+                            updated_at = :updated_at
+                        WHERE id = :id AND status IN ('running', 'pending')
+                    """)
+
+                    await self.session.execute(update_query, {
+                        "id": row.id,
+                        "error_msg": f"Job recovered as failed - not updated since {row.updated_at.isoformat()}. Pod may have crashed.",
+                        "end_time": datetime.now(),
+                        "updated_at": datetime.now()
+                    })
+
+                    logger.warning("Recovered stale training job",
+                                 job_id=row.job_id,
+                                 tenant_id=str(row.tenant_id),
+                                 last_updated=row.updated_at.isoformat() if row.updated_at else "unknown")
+
+                    # Fetch the updated job to return
+                    job = await self.get_by_job_id(row.job_id)
+                    if job:
+                        recovered_jobs.append(job)
+
+                except Exception as job_error:
+                    logger.error("Failed to recover individual stale job",
+                                job_id=row.job_id,
+                                error=str(job_error))
+
+            if recovered_jobs:
+                await self.session.commit()
+                logger.info("Stale job recovery completed",
+                           recovered_count=len(recovered_jobs),
+                           stale_threshold_minutes=stale_threshold_minutes)
+
+            return recovered_jobs
+
+        except Exception as e:
+            logger.error("Failed to recover stale jobs",
+                        error=str(e))
+            await self.session.rollback()
+            return []
--- a/services/training/app/utils/distributed_lock.py
+++ b/services/training/app/utils/distributed_lock.py
@@ -1,10 +1,16 @@
 """
 Distributed Locking Mechanisms
 Prevents concurrent training jobs for the same product
+
+HORIZONTAL SCALING FIX:
+- Uses SHA256 for stable hash across all Python processes/pods
+- Python's built-in hash() varies between processes due to hash randomization (Python 3.3+)
+- This ensures all pods compute the same lock ID for the same lock name
 """

 import asyncio
 import time
+import hashlib
 from typing import Optional
 import logging
 from contextlib import asynccontextmanager
@@ -39,9 +45,20 @@ class DatabaseLock:
        self.lock_id = self._hash_lock_name(lock_name)

    def _hash_lock_name(self, name: str) -> int:
-        """Convert lock name to integer ID for PostgreSQL advisory lock"""
-        # Use hash and modulo to get a positive 32-bit integer
-        return abs(hash(name)) % (2**31)
+        """
+        Convert lock name to integer ID for PostgreSQL advisory lock.
+
+        CRITICAL: Uses SHA256 for stable hash across all Python processes/pods.
+        Python's built-in hash() varies between processes due to hash randomization
+        (PYTHONHASHSEED, enabled by default since Python 3.3), which would cause
+        different pods to compute different lock IDs for the same lock name,
+        defeating the purpose of distributed locking.
+        """
+        # Use SHA256 for stable, cross-process hash
+        hash_bytes = hashlib.sha256(name.encode('utf-8')).digest()
+        # Take first 4 bytes and convert to positive 31-bit integer
+        # (PostgreSQL advisory locks use bigint, but we use 31-bit for safety)
+        return int.from_bytes(hash_bytes[:4], 'big') % (2**31)

    @asynccontextmanager
    async def acquire(self, session: AsyncSession):
--- a/services/training/app/websocket/manager.py
+++ b/services/training/app/websocket/manager.py
@@ -1,21 +1,39 @@
 """
 WebSocket Connection Manager for Training Service
 Manages WebSocket connections and broadcasts RabbitMQ events to connected clients
+
+HORIZONTAL SCALING:
+- Uses Redis pub/sub for cross-pod WebSocket broadcasting
+- Each pod subscribes to a Redis channel and broadcasts to its local connections
+- Events published to Redis are received by all pods, ensuring clients on any
+  pod receive events from training jobs running on any other pod
 """

 import asyncio
 import json
-from typing import Dict, Set
+import os
+from typing import Dict, Optional
 from fastapi import WebSocket
 import structlog

 logger = structlog.get_logger()

+# Redis pub/sub channel for WebSocket events
+REDIS_WEBSOCKET_CHANNEL = "training:websocket:events"
+

 class WebSocketConnectionManager:
    """
-    Simple WebSocket connection manager.
-    Manages connections per job_id and broadcasts messages to all connected clients.
+    WebSocket connection manager with Redis pub/sub for horizontal scaling.
+
+    In a multi-pod deployment:
+    1. Events are published to Redis pub/sub (not just local broadcast)
+    2. Each pod subscribes to Redis and broadcasts to its local WebSocket connections
+    3. This ensures clients connected to any pod receive events from any pod
+
+    Flow:
+    - RabbitMQ event → Pod A receives → Pod A publishes to Redis
+    - Redis pub/sub → All pods receive → Each pod broadcasts to local WebSockets
    """

    def __init__(self):
@@ -24,6 +42,121 @@ class WebSocketConnectionManager:
        self._lock = asyncio.Lock()
        # Store latest event for each job to provide initial state
        self._latest_events: Dict[str, dict] = {}
+        # Redis client for pub/sub
+        self._redis: Optional[object] = None
+        self._pubsub: Optional[object] = None
+        self._subscriber_task: Optional[asyncio.Task] = None
+        self._running = False
+        self._instance_id = f"{os.environ.get('HOSTNAME', 'unknown')}:{os.getpid()}"
+
+    async def initialize_redis(self, redis_url: str) -> bool:
+        """
+        Initialize Redis connection for cross-pod pub/sub.
+
+        Args:
+            redis_url: Redis connection URL
+
+        Returns:
+            True if successful, False otherwise
+        """
+        try:
+            import redis.asyncio as redis_async
+
+            self._redis = redis_async.from_url(redis_url, decode_responses=True)
+            await self._redis.ping()
+
+            # Create pub/sub subscriber
+            self._pubsub = self._redis.pubsub()
+            await self._pubsub.subscribe(REDIS_WEBSOCKET_CHANNEL)
+
+            # Start subscriber task
+            self._running = True
+            self._subscriber_task = asyncio.create_task(self._redis_subscriber_loop())
+
+            logger.info("Redis pub/sub initialized for WebSocket broadcasting",
+                       instance_id=self._instance_id,
+                       channel=REDIS_WEBSOCKET_CHANNEL)
+            return True
+
+        except Exception as e:
+            logger.error("Failed to initialize Redis pub/sub",
+                        error=str(e),
+                        instance_id=self._instance_id)
+            return False
+
+    async def shutdown(self):
+        """Shutdown Redis pub/sub connection"""
+        self._running = False
+
+        if self._subscriber_task:
+            self._subscriber_task.cancel()
+            try:
+                await self._subscriber_task
+            except asyncio.CancelledError:
+                pass
+
+        if self._pubsub:
+            await self._pubsub.unsubscribe(REDIS_WEBSOCKET_CHANNEL)
+            await self._pubsub.close()
+
+        if self._redis:
+            await self._redis.close()
+
+        logger.info("Redis pub/sub shutdown complete",
+                   instance_id=self._instance_id)
+
+    async def _redis_subscriber_loop(self):
+        """Background task to receive Redis pub/sub messages and broadcast locally"""
+        try:
+            while self._running:
+                try:
+                    message = await self._pubsub.get_message(
+                        ignore_subscribe_messages=True,
+                        timeout=1.0
+                    )
+
+                    if message and message['type'] == 'message':
+                        await self._handle_redis_message(message['data'])
+
+                except asyncio.CancelledError:
+                    break
+                except Exception as e:
+                    logger.error("Error in Redis subscriber loop",
+                               error=str(e),
+                               instance_id=self._instance_id)
+                    await asyncio.sleep(1)  # Backoff on error
+
+        except asyncio.CancelledError:
+            pass
+
+        logger.info("Redis subscriber loop stopped",
+                   instance_id=self._instance_id)
+
+    async def _handle_redis_message(self, data: str):
+        """Handle a message received from Redis pub/sub"""
+        try:
+            payload = json.loads(data)
+            job_id = payload.get('job_id')
+            message = payload.get('message')
+            source_instance = payload.get('source_instance')
+
+            if not job_id or not message:
+                return
+
+            # Log cross-pod message
+            if source_instance != self._instance_id:
+                logger.debug("Received cross-pod WebSocket event",
+                           job_id=job_id,
+                           source_instance=source_instance,
+                           local_instance=self._instance_id)
+
+            # Broadcast to local WebSocket connections
+            await self._broadcast_local(job_id, message)
+
+        except json.JSONDecodeError as e:
+            logger.warning("Invalid JSON in Redis message", error=str(e))
+        except Exception as e:
+            logger.error("Error handling Redis message", error=str(e))

    async def connect(self, job_id: str, websocket: WebSocket) -> None:
        """Register a new WebSocket connection for a job"""
@@ -50,7 +183,8 @@ class WebSocketConnectionManager:
        logger.info("WebSocket connected",
                   job_id=job_id,
                   websocket_id=ws_id,
-                   total_connections=len(self._connections[job_id]))
+                   total_connections=len(self._connections[job_id]),
+                   instance_id=self._instance_id)

    async def disconnect(self, job_id: str, websocket: WebSocket) -> None:
        """Remove a WebSocket connection"""
@@ -66,19 +200,56 @@ class WebSocketConnectionManager:
                logger.info("WebSocket disconnected",
                           job_id=job_id,
                           websocket_id=ws_id,
-                           remaining_connections=len(self._connections.get(job_id, {})))
+                           remaining_connections=len(self._connections.get(job_id, {})),
+                           instance_id=self._instance_id)

    async def broadcast(self, job_id: str, message: dict) -> int:
        """
-        Broadcast a message to all connections for a specific job.
-        Returns the number of successful broadcasts.
+        Broadcast a message to all connections for a specific job across ALL pods.
+
+        If Redis is configured, publishes to Redis pub/sub which then broadcasts
+        to all pods. Otherwise, falls back to local-only broadcast.
+
+        Returns the number of successful local broadcasts.
        """
        # Store the latest event for this job to provide initial state to new connections
-        if message.get('type') != 'initial_state':  # Don't store initial_state messages
+        if message.get('type') != 'initial_state':
            self._latest_events[job_id] = message

+        # If Redis is available, publish to Redis for cross-pod broadcast
+        if self._redis:
+            try:
+                payload = json.dumps({
+                    'job_id': job_id,
+                    'message': message,
+                    'source_instance': self._instance_id
+                })
+                await self._redis.publish(REDIS_WEBSOCKET_CHANNEL, payload)
+                logger.debug("Published WebSocket event to Redis",
+                           job_id=job_id,
+                           message_type=message.get('type'),
+                           instance_id=self._instance_id)
+                # Return 0 here because the actual broadcast happens via subscriber
+                # The count will be from _broadcast_local when the message is received
+                return 0
+            except Exception as e:
+                logger.warning("Failed to publish to Redis, falling back to local broadcast",
+                             error=str(e),
+                             job_id=job_id)
+                # Fall through to local broadcast
+
+        # Local-only broadcast (when Redis is not available)
+        return await self._broadcast_local(job_id, message)
+
+    async def _broadcast_local(self, job_id: str, message: dict) -> int:
+        """
+        Broadcast a message to local WebSocket connections only.
+        This is called either directly (no Redis) or from Redis subscriber.
+        """
        if job_id not in self._connections:
-            logger.debug("No active connections for job", job_id=job_id)
+            logger.debug("No active local connections for job",
+                        job_id=job_id,
+                        instance_id=self._instance_id)
            return 0

        connections = list(self._connections[job_id].values())
@@ -103,18 +274,27 @@ class WebSocketConnectionManager:
                    self._connections[job_id].pop(ws_id, None)

        if successful_sends > 0:
-            logger.info("Broadcasted message to WebSocket clients",
+            logger.info("Broadcasted message to local WebSocket clients",
                       job_id=job_id,
                       message_type=message.get('type'),
                       successful_sends=successful_sends,
-                       failed_sends=len(failed_websockets))
+                       failed_sends=len(failed_websockets),
+                       instance_id=self._instance_id)

        return successful_sends

    def get_connection_count(self, job_id: str) -> int:
-        """Get the number of active connections for a job"""
+        """Get the number of active local connections for a job"""
        return len(self._connections.get(job_id, {}))

+    def get_total_connection_count(self) -> int:
+        """Get total number of active connections across all jobs"""
+        return sum(len(conns) for conns in self._connections.values())
+
+    def is_redis_enabled(self) -> bool:
+        """Check if Redis pub/sub is enabled"""
+        return self._redis is not None and self._running
+

 # Global singleton instance
 websocket_manager = WebSocketConnectionManager()
--- a/services/training/migrations/versions/add_horizontal_scaling_constraints.py
+++ b/services/training/migrations/versions/add_horizontal_scaling_constraints.py
@@ -0,0 +1,60 @@
+"""Add horizontal scaling constraints for multi-pod deployment
+
+Revision ID: add_horizontal_scaling
+Revises: 26a665cd5348
+Create Date: 2025-01-18
+
+This migration adds database-level constraints to prevent race conditions
+when running multiple training service pods:
+
+1. Partial unique index on model_training_logs to prevent duplicate active jobs per tenant
+2. Index to speed up active job lookups
+"""
+from typing import Sequence, Union
+
+from alembic import op
+import sqlalchemy as sa
+
+
+# revision identifiers, used by Alembic.
+revision: str = 'add_horizontal_scaling'
+down_revision: Union[str, None] = '26a665cd5348'
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    # Add partial unique index to prevent duplicate active training jobs per tenant
+    # This ensures only ONE job can be in 'pending' or 'running' status per tenant at a time
+    # The constraint is enforced at the database level, preventing race conditions
+    # between multiple pods checking and creating jobs simultaneously
+    op.execute("""
+        CREATE UNIQUE INDEX IF NOT EXISTS idx_unique_active_training_per_tenant
+        ON model_training_logs (tenant_id)
+        WHERE status IN ('pending', 'running')
+    """)
+
+    # Add index to speed up active job lookups (used by deduplication check)
+    op.create_index(
+        'idx_training_logs_tenant_status',
+        'model_training_logs',
+        ['tenant_id', 'status'],
+        unique=False,
+        if_not_exists=True
+    )
+
+    # Add index for job recovery queries (find stale running jobs)
+    op.create_index(
+        'idx_training_logs_status_updated',
+        'model_training_logs',
+        ['status', 'updated_at'],
+        unique=False,
+        if_not_exists=True
+    )
+
+
+def downgrade() -> None:
+    # Remove the indexes in reverse order
+    op.execute("DROP INDEX IF EXISTS idx_training_logs_status_updated")
+    op.execute("DROP INDEX IF EXISTS idx_training_logs_tenant_status")
+    op.execute("DROP INDEX IF EXISTS idx_unique_active_training_per_tenant")
--- a/shared/leader_election/init.py
+++ b/shared/leader_election/init.py
@@ -0,0 +1,33 @@
+"""
+Shared Leader Election for Bakery-IA platform
+
+Provides Redis-based leader election for services that need to run
+singleton scheduled tasks (APScheduler, background jobs, etc.)
+
+Usage:
+    from shared.leader_election import LeaderElectionService, SchedulerLeaderMixin
+
+    # Option 1: Direct usage
+    leader_election = LeaderElectionService(redis_client, "my-service")
+    await leader_election.start(
+        on_become_leader=start_scheduler,
+        on_lose_leader=stop_scheduler
+    )
+
+    # Option 2: Mixin for services with APScheduler
+    class MySchedulerService(SchedulerLeaderMixin):
+        async def _create_scheduler_jobs(self):
+            self.scheduler.add_job(...)
+"""
+
+from shared.leader_election.service import (
+    LeaderElectionService,
+    LeaderElectionConfig,
+)
+from shared.leader_election.mixin import SchedulerLeaderMixin
+
+__all__ = [
+    "LeaderElectionService",
+    "LeaderElectionConfig",
+    "SchedulerLeaderMixin",
+]
--- a/shared/leader_election/mixin.py
+++ b/shared/leader_election/mixin.py
@@ -0,0 +1,209 @@
+"""
+Scheduler Leader Mixin
+
+Provides a mixin class for services that use APScheduler and need
+leader election for horizontal scaling.
+
+Usage:
+    class MySchedulerService(SchedulerLeaderMixin):
+        def __init__(self, redis_url: str, service_name: str):
+            super().__init__(redis_url, service_name)
+            # Your initialization here
+
+        async def _create_scheduler_jobs(self):
+            '''Override to define your scheduled jobs'''
+            self.scheduler.add_job(
+                self.my_job,
+                trigger=CronTrigger(hour=0),
+                id='my_job'
+            )
+
+        async def my_job(self):
+            # Your job logic here
+            pass
+"""
+
+import asyncio
+from typing import Optional
+from abc import abstractmethod
+import structlog
+
+logger = structlog.get_logger()
+
+
+class SchedulerLeaderMixin:
+    """
+    Mixin for services that use APScheduler with leader election.
+
+    Provides automatic leader election and scheduler management.
+    Only the leader pod will run scheduled jobs.
+    """
+
+    def __init__(self, redis_url: str, service_name: str, **kwargs):
+        """
+        Initialize the scheduler with leader election.
+
+        Args:
+            redis_url: Redis connection URL for leader election
+            service_name: Unique service name for leader election lock
+            **kwargs: Additional arguments passed to parent class
+        """
+        super().__init__(**kwargs)
+
+        self._redis_url = redis_url
+        self._service_name = service_name
+        self._leader_election = None
+        self._redis_client = None
+        self.scheduler = None
+        self._scheduler_started = False
+
+    async def start_with_leader_election(self):
+        """
+        Start the service with leader election.
+
+        Only the leader will start the scheduler.
+        """
+        from apscheduler.schedulers.asyncio import AsyncIOScheduler
+        from shared.leader_election.service import LeaderElectionService
+        import redis.asyncio as redis
+
+        try:
+            # Create Redis connection
+            self._redis_client = redis.from_url(self._redis_url, decode_responses=False)
+            await self._redis_client.ping()
+
+            # Create scheduler (but don't start it yet)
+            self.scheduler = AsyncIOScheduler()
+
+            # Create leader election
+            self._leader_election = LeaderElectionService(
+                self._redis_client,
+                self._service_name
+            )
+
+            # Start leader election with callbacks
+            await self._leader_election.start(
+                on_become_leader=self._on_become_leader,
+                on_lose_leader=self._on_lose_leader
+            )
+
+            logger.info("Scheduler service started with leader election",
+                       service=self._service_name,
+                       is_leader=self._leader_election.is_leader,
+                       instance_id=self._leader_election.instance_id)
+
+        except Exception as e:
+            logger.error("Failed to start with leader election, falling back to standalone",
+                        service=self._service_name,
+                        error=str(e))
+            # Fallback: start scheduler anyway (for single-pod deployments)
+            await self._start_scheduler_standalone()
+
+    async def _on_become_leader(self):
+        """Called when this instance becomes the leader"""
+        logger.info("Became leader, starting scheduler",
+                   service=self._service_name)
+        await self._start_scheduler()
+
+    async def _on_lose_leader(self):
+        """Called when this instance loses leadership"""
+        logger.warning("Lost leadership, stopping scheduler",
+                      service=self._service_name)
+        await self._stop_scheduler()
+
+    async def _start_scheduler(self):
+        """Start the scheduler with defined jobs"""
+        if self._scheduler_started:
+            logger.warning("Scheduler already started",
+                         service=self._service_name)
+            return
+
+        try:
+            # Let subclass define jobs
+            await self._create_scheduler_jobs()
+
+            # Start scheduler
+            if not self.scheduler.running:
+                self.scheduler.start()
+                self._scheduler_started = True
+                logger.info("Scheduler started",
+                           service=self._service_name,
+                           job_count=len(self.scheduler.get_jobs()))
+
+        except Exception as e:
+            logger.error("Failed to start scheduler",
+                        service=self._service_name,
+                        error=str(e))
+
+    async def _stop_scheduler(self):
+        """Stop the scheduler"""
+        if not self._scheduler_started:
+            return
+
+        try:
+            if self.scheduler and self.scheduler.running:
+                self.scheduler.shutdown(wait=False)
+                self._scheduler_started = False
+                logger.info("Scheduler stopped",
+                           service=self._service_name)
+
+        except Exception as e:
+            logger.error("Failed to stop scheduler",
+                        service=self._service_name,
+                        error=str(e))
+
+    async def _start_scheduler_standalone(self):
+        """Start scheduler without leader election (fallback mode)"""
+        from apscheduler.schedulers.asyncio import AsyncIOScheduler
+
+        logger.warning("Starting scheduler in standalone mode (no leader election)",
+                      service=self._service_name)
+
+        self.scheduler = AsyncIOScheduler()
+        await self._create_scheduler_jobs()
+
+        if not self.scheduler.running:
+            self.scheduler.start()
+            self._scheduler_started = True
+
+    @abstractmethod
+    async def _create_scheduler_jobs(self):
+        """
+        Override to define scheduled jobs.
+
+        Example:
+            self.scheduler.add_job(
+                self.my_task,
+                trigger=CronTrigger(hour=0, minute=30),
+                id='my_task',
+                max_instances=1
+            )
+        """
+        pass
+
+    async def stop(self):
+        """Stop the scheduler and leader election"""
+        # Stop leader election
+        if self._leader_election:
+            await self._leader_election.stop()
+
+        # Stop scheduler
+        await self._stop_scheduler()
+
+        # Close Redis
+        if self._redis_client:
+            await self._redis_client.close()
+
+        logger.info("Scheduler service stopped",
+                   service=self._service_name)
+
+    @property
+    def is_leader(self) -> bool:
+        """Check if this instance is the leader"""
+        return self._leader_election.is_leader if self._leader_election else False
+
+    def get_leader_status(self) -> dict:
+        """Get leader election status"""
+        if self._leader_election:
+            return self._leader_election.get_status()
+        return {"is_leader": True, "mode": "standalone"}
--- a/shared/leader_election/service.py
+++ b/shared/leader_election/service.py
@@ -0,0 +1,352 @@
+"""
+Leader Election Service
+
+Implements Redis-based leader election to ensure only ONE pod runs
+singleton tasks like APScheduler jobs.
+
+This is CRITICAL for horizontal scaling - without leader election,
+each pod would run the same scheduled jobs, causing:
+- Duplicate operations (forecasts, alerts, syncs)
+- Database contention
+- Inconsistent state
+- Duplicate notifications
+
+Implementation:
+- Uses Redis SET NX (set if not exists) for atomic leadership acquisition
+- Leader maintains leadership with periodic heartbeats
+- If leader fails to heartbeat, another pod can take over
+- Non-leader pods check periodically if they should become leader
+"""
+
+import asyncio
+import os
+import socket
+from dataclasses import dataclass
+from typing import Optional, Callable, Awaitable
+import structlog
+
+logger = structlog.get_logger()
+
+
+@dataclass
+class LeaderElectionConfig:
+    """Configuration for leader election"""
+    # Redis key prefix for the lock
+    lock_key_prefix: str = "leader"
+    # Lock expires after this many seconds without refresh
+    lock_ttl_seconds: int = 30
+    # Refresh lock every N seconds (should be < lock_ttl_seconds / 2)
+    heartbeat_interval_seconds: int = 10
+    # Non-leaders check for leadership every N seconds
+    election_check_interval_seconds: int = 15
+
+
+class LeaderElectionService:
+    """
+    Redis-based leader election service.
+
+    Ensures only one pod runs scheduled tasks at a time across all replicas.
+    """
+
+    def __init__(
+        self,
+        redis_client,
+        service_name: str,
+        config: Optional[LeaderElectionConfig] = None
+    ):
+        """
+        Initialize leader election service.
+
+        Args:
+            redis_client: Async Redis client instance
+            service_name: Unique name for this service (used in Redis key)
+            config: Optional configuration override
+        """
+        self.redis = redis_client
+        self.service_name = service_name
+        self.config = config or LeaderElectionConfig()
+        self.lock_key = f"{self.config.lock_key_prefix}:{service_name}:lock"
+        self.instance_id = self._generate_instance_id()
+        self.is_leader = False
+        self._heartbeat_task: Optional[asyncio.Task] = None
+        self._election_task: Optional[asyncio.Task] = None
+        self._running = False
+        self._on_become_leader_callback: Optional[Callable[[], Awaitable[None]]] = None
+        self._on_lose_leader_callback: Optional[Callable[[], Awaitable[None]]] = None
+
+    def _generate_instance_id(self) -> str:
+        """Generate unique instance identifier for this pod"""
+        hostname = os.environ.get('HOSTNAME', socket.gethostname())
+        pod_ip = os.environ.get('POD_IP', 'unknown')
+        return f"{hostname}:{pod_ip}:{os.getpid()}"
+
+    async def start(
+        self,
+        on_become_leader: Optional[Callable[[], Awaitable[None]]] = None,
+        on_lose_leader: Optional[Callable[[], Awaitable[None]]] = None
+    ):
+        """
+        Start leader election process.
+
+        Args:
+            on_become_leader: Async callback when this instance becomes leader
+            on_lose_leader: Async callback when this instance loses leadership
+        """
+        self._on_become_leader_callback = on_become_leader
+        self._on_lose_leader_callback = on_lose_leader
+        self._running = True
+
+        logger.info("Starting leader election",
+                   service=self.service_name,
+                   instance_id=self.instance_id,
+                   lock_key=self.lock_key)
+
+        # Try to become leader immediately
+        await self._try_become_leader()
+
+        # Start background tasks
+        if self.is_leader:
+            self._heartbeat_task = asyncio.create_task(self._heartbeat_loop())
+        else:
+            self._election_task = asyncio.create_task(self._election_loop())
+
+    async def stop(self):
+        """Stop leader election and release leadership if held"""
+        self._running = False
+
+        # Cancel background tasks
+        if self._heartbeat_task:
+            self._heartbeat_task.cancel()
+            try:
+                await self._heartbeat_task
+            except asyncio.CancelledError:
+                pass
+            self._heartbeat_task = None
+
+        if self._election_task:
+            self._election_task.cancel()
+            try:
+                await self._election_task
+            except asyncio.CancelledError:
+                pass
+            self._election_task = None
+
+        # Release leadership
+        if self.is_leader:
+            await self._release_leadership()
+
+        logger.info("Leader election stopped",
+                   service=self.service_name,
+                   instance_id=self.instance_id,
+                   was_leader=self.is_leader)
+
+    async def _try_become_leader(self) -> bool:
+        """
+        Attempt to become the leader.
+
+        Returns:
+            True if this instance is now the leader
+        """
+        try:
+            # Try to set the lock with NX (only if not exists) and EX (expiry)
+            acquired = await self.redis.set(
+                self.lock_key,
+                self.instance_id,
+                nx=True,  # Only set if not exists
+                ex=self.config.lock_ttl_seconds
+            )
+
+            if acquired:
+                self.is_leader = True
+                logger.info("Became leader",
+                           service=self.service_name,
+                           instance_id=self.instance_id)
+
+                # Call callback
+                if self._on_become_leader_callback:
+                    try:
+                        await self._on_become_leader_callback()
+                    except Exception as e:
+                        logger.error("Error in on_become_leader callback",
+                                   service=self.service_name,
+                                   error=str(e))
+
+                return True
+
+            # Check if we're already the leader (reconnection scenario)
+            current_leader = await self.redis.get(self.lock_key)
+            if current_leader:
+                current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
+                if current_leader_str == self.instance_id:
+                    self.is_leader = True
+                    logger.info("Confirmed as existing leader",
+                               service=self.service_name,
+                               instance_id=self.instance_id)
+                    return True
+                else:
+                    logger.debug("Another instance is leader",
+                               service=self.service_name,
+                               current_leader=current_leader_str,
+                               this_instance=self.instance_id)
+
+            return False
+
+        except Exception as e:
+            logger.error("Failed to acquire leadership",
+                        service=self.service_name,
+                        instance_id=self.instance_id,
+                        error=str(e))
+            return False
+
+    async def _release_leadership(self):
+        """Release leadership lock"""
+        try:
+            # Only delete if we're the current leader
+            current_leader = await self.redis.get(self.lock_key)
+            if current_leader:
+                current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
+                if current_leader_str == self.instance_id:
+                    await self.redis.delete(self.lock_key)
+                    logger.info("Released leadership",
+                               service=self.service_name,
+                               instance_id=self.instance_id)
+
+            was_leader = self.is_leader
+            self.is_leader = False
+
+            # Call callback only if we were the leader
+            if was_leader and self._on_lose_leader_callback:
+                try:
+                    await self._on_lose_leader_callback()
+                except Exception as e:
+                    logger.error("Error in on_lose_leader callback",
+                               service=self.service_name,
+                               error=str(e))
+
+        except Exception as e:
+            logger.error("Failed to release leadership",
+                        service=self.service_name,
+                        instance_id=self.instance_id,
+                        error=str(e))
+
+    async def _refresh_leadership(self) -> bool:
+        """
+        Refresh leadership lock TTL.
+
+        Returns:
+            True if leadership was maintained
+        """
+        try:
+            # Verify we're still the leader
+            current_leader = await self.redis.get(self.lock_key)
+            if not current_leader:
+                logger.warning("Lost leadership (lock expired)",
+                             service=self.service_name,
+                             instance_id=self.instance_id)
+                return False
+
+            current_leader_str = current_leader.decode() if isinstance(current_leader, bytes) else current_leader
+            if current_leader_str != self.instance_id:
+                logger.warning("Lost leadership (lock held by another instance)",
+                             service=self.service_name,
+                             instance_id=self.instance_id,
+                             current_leader=current_leader_str)
+                return False
+
+            # Refresh the TTL
+            await self.redis.expire(self.lock_key, self.config.lock_ttl_seconds)
+            return True
+
+        except Exception as e:
+            logger.error("Failed to refresh leadership",
+                        service=self.service_name,
+                        instance_id=self.instance_id,
+                        error=str(e))
+            return False
+
+    async def _heartbeat_loop(self):
+        """Background loop to maintain leadership"""
+        while self._running and self.is_leader:
+            try:
+                await asyncio.sleep(self.config.heartbeat_interval_seconds)
+
+                if not self._running:
+                    break
+
+                maintained = await self._refresh_leadership()
+
+                if not maintained:
+                    self.is_leader = False
+
+                    # Call callback
+                    if self._on_lose_leader_callback:
+                        try:
+                            await self._on_lose_leader_callback()
+                        except Exception as e:
+                            logger.error("Error in on_lose_leader callback",
+                                       service=self.service_name,
+                                       error=str(e))
+
+                    # Switch to election loop
+                    self._election_task = asyncio.create_task(self._election_loop())
+                    break
+
+            except asyncio.CancelledError:
+                break
+            except Exception as e:
+                logger.error("Error in heartbeat loop",
+                            service=self.service_name,
+                            instance_id=self.instance_id,
+                            error=str(e))
+
+    async def _election_loop(self):
+        """Background loop to attempt leadership acquisition"""
+        while self._running and not self.is_leader:
+            try:
+                await asyncio.sleep(self.config.election_check_interval_seconds)
+
+                if not self._running:
+                    break
+
+                acquired = await self._try_become_leader()
+
+                if acquired:
+                    # Switch to heartbeat loop
+                    self._heartbeat_task = asyncio.create_task(self._heartbeat_loop())
+                    break
+
+            except asyncio.CancelledError:
+                break
+            except Exception as e:
+                logger.error("Error in election loop",
+                            service=self.service_name,
+                            instance_id=self.instance_id,
+                            error=str(e))
+
+    def get_status(self) -> dict:
+        """Get current leader election status"""
+        return {
+            "service": self.service_name,
+            "instance_id": self.instance_id,
+            "is_leader": self.is_leader,
+            "running": self._running,
+            "lock_key": self.lock_key,
+            "config": {
+                "lock_ttl_seconds": self.config.lock_ttl_seconds,
+                "heartbeat_interval_seconds": self.config.heartbeat_interval_seconds,
+                "election_check_interval_seconds": self.config.election_check_interval_seconds
+            }
+        }
+
+    async def get_current_leader(self) -> Optional[str]:
+        """Get the current leader instance ID (if any)"""
+        try:
+            current_leader = await self.redis.get(self.lock_key)
+            if current_leader:
+                return current_leader.decode() if isinstance(current_leader, bytes) else current_leader
+            return None
+        except Exception as e:
+            logger.error("Failed to get current leader",
+                        service=self.service_name,
+                        error=str(e))
+            return None