> ## Documentation Index
> Fetch the complete documentation index at: https://ai-kb.automationanywhere.com/llms.txt
> Use this file to discover all available pages before exploring further.

# EKB - AWS EKS (Services and Infrastructure) Guide

> Architecture overview, component descriptions, and data flow for the EKB services infrastructure on AWS EKS.

This document provides a comprehensive overview of the AWS architecture for EKB services infrastructure. The architecture follows cloud-native patterns with AWS managed services, Kubernetes orchestration, and automated scaling.

***

# AWS Architecture Overview

The architecture follows cloud-native patterns with AWS managed services, Kubernetes orchestration, and automated scaling.

***

## Architecture Diagram

```mermaid theme={null}
flowchart TD
    INTERNET["🌐 INTERNET"]
 
    DNS["DNS Provider\n(Route 53 / Cloudflare / etc.)\n──────────────────────────────\n• app.your-domain — Web Frontend\n• api.your-domain — FastAPI Backend\n• automations.your-domain — Automator\n• supabase.your-domain — Supabase Kong\n• signoz.your-domain — SigNoz"]
 
    ALB["APPLICATION LOAD BALANCER\n──────────────────────────────\n• SSL Termination via ACM\n• HTTP → HTTPS Redirect\n• Health Checks & Target Groups\n• Routing by hostname/path"]
 
    subgraph VPC["VPC"]
        subgraph PUB["Public Subnets (3 AZs)"]
            NAT["NAT Gateway × 3"]
        end
        subgraph PRIV["Private Subnets (3 AZs)"]
            NODES["EKS Worker Nodes"]
        end
    end
 
    subgraph EKS["EKS CLUSTER"]
        KARPENTER["KARPENTER — Dynamic Node Provisioning\n──────────────────────────────\n• Spot prioritisation & consolidation\n• Interruption handling via SQS\n• Node classes: general, compute-intensive, database"]
        KEDA["KEDA — Event-driven Pod Autoscaling\n──────────────────────────────\n• CPU / Memory threshold scaling\n• Fast scale-down stabilisation"]
        PODS["APPLICATION PODS\n──────────────────────────────\n• Web Frontend (port 3000)\n• FastAPI Backend (port 8001)\n• Celery Workers\n• Automator (port 80)\n• Supabase services (Kong, Auth, Storage…)\n• PostgreSQL Automator (port 5432)"]
        SIGNOZ["OBSERVABILITY — SigNoz (optional)\n──────────────────────────────\n• Distributed tracing, metrics, logs\n• k8s-infra agent for cluster metrics"]
    end
 
    subgraph DATA["DATA LAYER"]
        REDIS["ELASTICACHE\n(Redis)\n──────────\n• Encryption\n• Multi-AZ\n• Port 6379"]
        MQ["AMAZON MQ\n(RabbitMQ)\n──────────\n• Async queue\n• AMQP/SSL\n• Port 5671"]
        SUPABASE["SUPABASE\n(DB / Auth)\n──────────\nOption A: Cloud\nOption B: Self-hosted\nvia CNPG HA on EKS"]
    end
 
    INTERNET --> DNS
    DNS --> ALB
    ALB --> VPC
    PUB --> PRIV
    PRIV --> EKS
    EKS --> DATA
```

<Info>
  Self-hosted Supabase uses a CloudNativePG-managed HA PostgreSQL cluster (`ha-supabase-db`) with PgBouncer pooling, MinIO for object storage, and the full Supabase application stack deployed via `helm-deployment/supabase-kubernetes-ha`. Supabase Cloud is the alternative if self-hosting is not required.
</Info>

***

## Key Components

### 1. Networking Layer

#### DNS Provider

* **Purpose:** Domain name resolution; works with any provider (Route 53, Cloudflare, etc.)
* **Domains** (use your own domain, e.g. `example.com`):
  * `app.example.com` — Web Frontend
  * `api.example.com` — FastAPI Backend
  * `automations.example.com` — Automator Service
  * `supabase.example.com` — Supabase Kong (self-hosted only)
  * `signoz.example.com` — SigNoz observability (optional)
* **SSL Validation:** CNAME records required for ACM DNS validation

#### Application Load Balancer (ALB)

* **Purpose:** SSL termination, load balancing, and hostname-based routing
* **Features:**
  * SSL/TLS termination using ACM certificates (wildcard or per-service)
  * HTTP → HTTPS redirect
  * Health checks for all target groups
* **Managed by:** AWS Load Balancer Controller (Helm chart in `helm-deployment/infrastructure`)

#### VPC

* **CIDR:** Environment-specific (e.g. `10.x.0.0/16`)
* **Availability Zones:** 3 AZs in the chosen region
* **Subnets:** 3 public (NAT Gateways) + 3 private (EKS nodes)
* **Outbound:** NAT Gateway per AZ for node egress

***

### 2. Compute Layer

#### EKS Cluster

* **Version:** Kubernetes 1.33
* **System Node Group:** Managed node group running Karpenter controller (not on Karpenter-managed nodes, per AWS best practice)
* **Add-ons:** EBS CSI Driver, AWS Load Balancer Controller, CoreDNS, kube-proxy

#### Karpenter — Dynamic Node Provisioning

* **Purpose:** Just-in-time node provisioning and cost optimisation
* **Node Classes:**
  * **General Purpose:** Spot instances for most workloads
  * **Compute Intensive:** High-CPU instances for CPU-bound tasks
  * **Memory Intensive:** Memory-optimised instances for large datasets
  * **Database:** On-demand instances for stateful/database workloads
  * **GPU:** GPU instances for AI/ML workloads (optional)
* **Features:** Spot prioritisation, automatic consolidation, SQS-based interruption handling

#### KEDA — Kubernetes Event-Driven Autoscaling

* **Purpose:** Horizontal pod autoscaling based on resource metrics
* **Targets:**

| Service         | Replicas | CPU Threshold | Memory Threshold |
| --------------- | -------- | ------------- | ---------------- |
| Web Frontend    | 2–8      | 60%           | 80%              |
| FastAPI Backend | 2–10     | 70%           | 80%              |
| Celery Workers  | 2–8      | 70%           | 80%              |
| Automator       | 2–8      | 70%           | 80%              |

* **Scale-down:** 30s stabilisation window for fast response

***

### 3. Application Services

#### Web Frontend

* **Port:** 3000
* **Replicas:** 2–8 (KEDA-managed)
* **Purpose:** React application serving the user interface

#### FastAPI Backend

* **Port:** 8001
* **Replicas:** 2–10 (KEDA-managed)
* **Purpose:** REST API server handling business logic and data access

#### Celery Workers

* **Replicas:** 2–8 (KEDA-managed)
* **Purpose:** Background task processing (queued via RabbitMQ)

#### Automator Service

* **Port:** 80
* **Replicas:** 2–8 (KEDA-managed)
* **Purpose:** Workflow automation and orchestration

#### Supabase Kong

* **Port:** 8000 (internal cluster service)
* **Purpose:** API gateway for all Supabase services
* **Routing:** External traffic reaches Kong via the ALB ingress defined in `odin-services/main-ingress.yaml`

#### SigNoz (optional)

* **Namespace:** `monitoring`
* **Components:** SigNoz platform + k8s-infra DaemonSet agent
* **Purpose:** Distributed tracing, metrics aggregation, log management
* **Enabled by:** `ENABLE_SIGNOZ=true`

***

### 4. Data Layer

#### ElastiCache Redis

* **Purpose:** Caching, session storage, and Celery broker/result backend
* **Configuration:**
  * Node type: configurable (e.g. `cache.t3.micro`)
  * Port: `6379`
  * Encryption at-rest and in-transit
  * Multi-AZ for high availability
* **Enabled by:** `ENABLE_AWS_SERVICES=true`

#### Amazon MQ (RabbitMQ)

* **Purpose:** Message queuing for asynchronous task processing
* **Configuration:**
  * Engine: RabbitMQ
  * Ports: `5671` (AMQP/SSL), `15671` (Management/SSL)
  * Deployment mode: single-instance or active/standby
* **Enabled by:** `ENABLE_AWS_SERVICES=true`

#### Supabase — Option A: Cloud (managed)

* **Purpose:** External managed PostgreSQL, Auth, Storage, and Realtime
* **Connection:** Supabase project URL and service role key configured in `values/odin-services.yaml`

#### Supabase — Option B: Self-hosted on EKS

* **Purpose:** Full Supabase stack running inside the cluster
* **Components:**
  * CloudNativePG operator (`cnpg-system`) — manages the Postgres cluster lifecycle
  * HA Supabase DB (`ha-supabase-db`) — CloudNativePG Cluster resource with PgBouncer pooler
  * Supabase application (`supabase` namespace) — Kong, Auth, Storage (MinIO), Meta, Rest, Realtime, Studio
* **Deployment order:** CloudNativePG → HA Supabase DB → Supabase app
* **Enabled by:** `ENABLE_CNPG=true`, `ENABLE_HA_SUPABASE_DB=true`, `ENABLE_SUPABASE=true`

#### PostgreSQL Automator

* **Purpose:** Local PostgreSQL database for the Automator service
* **Port:** 5432
* **Storage:** EBS persistent volume
* **Node affinity:** Database-dedicated nodes

***

### 5. Security & IAM

#### IAM Roles

| Role                              | Purpose                                     |
| --------------------------------- | ------------------------------------------- |
| EKS Cluster Role                  | Cluster-level API permissions               |
| Node Group Role                   | EC2 node permissions (ECR, SSM, networking) |
| Karpenter Controller Role         | EC2 provisioning, SQS interruption queue    |
| AWS Load Balancer Controller Role | ELBv2 and EC2 management                    |
| EBS CSI Driver Role               | EBS volume lifecycle management             |

Role names follow the pattern `<env-name>-<component>` and are created by the EKS Terraform module.

#### Security Groups

* **ALB:** Auto-created by AWS Load Balancer Controller (80/443 inbound)
* **EKS Cluster:** Node-to-node and pod communication
* **Redis:** Port 6379 from VPC CIDR only
* **RabbitMQ:** Ports 5671, 15671 from VPC CIDR only

#### SSL/TLS

* **Termination:** ALB level (pods see plain HTTP internally)
* **Certificates:** ACM certificates — either per-service or a single wildcard
* **Validation:** DNS CNAME validation via your DNS provider
* **Minimum protocol:** TLS 1.2

***

### 6. Infrastructure as Code

#### Terraform Modules

* **`modules/eks`:** EKS cluster, VPC, node groups, Karpenter, IAM, Helm releases
* **State:** S3 bucket with versioning, DynamoDB lock table

#### Terragrunt

* **Environment isolation:** One directory per environment under `terragrunt/environments/`
* **Template:** `env-template-folder` — copy and fill placeholders to create a new environment
* **DRY configuration:** Shared `root.hcl` with per-environment overrides
* **Enable/disable flags:** Services toggled via environment variables (`ENABLE_*`)

#### Helm Charts

| Chart                    | Namespace        | Description                           |
| ------------------------ | ---------------- | ------------------------------------- |
| `infrastructure`         | `infrastructure` | ALB Controller                        |
| `odin-services`          | `default`        | Web, API, Workers, Automator, Ingress |
| `aws-ebs-csi-driver`     | `kube-system`    | EBS volume provisioning               |
| `keda`                   | `keda`           | Pod autoscaling                       |
| `cloudnative-pg`         | `cnpg-system`    | PostgreSQL operator                   |
| `ha-supabase-db`         | `ha-supabase-db` | HA Postgres cluster + PgBouncer       |
| `supabase-kubernetes-ha` | `supabase`       | Full Supabase stack                   |
| `signoz`                 | `monitoring`     | Observability platform                |
| `k8s-infra`              | `monitoring`     | Cluster metrics agent                 |

***

## Data Flow

### 1. User Request Flow

```
User → DNS → ALB (SSL termination) → EKS Pod (Web Frontend)
                                   → EKS Pod (FastAPI Backend)
                                   → EKS Pod (Supabase Kong)
```

1. User accesses `app.example.com`
2. DNS resolves to the ALB
3. ALB terminates SSL and routes by hostname to the correct target group
4. Web Frontend serves the React app and makes API calls to `api.example.com`
5. FastAPI Backend processes requests and reads/writes to data services

### 2. API Request Flow

```
Client → ALB → FastAPI Backend → Redis (cache) / RabbitMQ (queue) / Supabase (DB)
```

1. Client calls `api.example.com`
2. ALB routes to FastAPI pod
3. Backend checks Redis cache; on miss, queries Supabase database
4. Async tasks are enqueued in RabbitMQ and processed by Celery Workers

### 3. Background Processing Flow

```
FastAPI → RabbitMQ → Celery Worker → Supabase DB
```

1. FastAPI enqueues a task in RabbitMQ
2. Celery Worker dequeues and processes the task
3. Results are written back to the Supabase database

### 4. Automator Workflow

```
Automator → PostgreSQL (local) → Redis → External APIs
```

1. Automator receives a workflow request
2. Workflow state is persisted in the local PostgreSQL instance
3. Redis caches intermediate results
4. External APIs are called as part of the automation

### 5. Scaling Flow

```
Metrics → KEDA → Pod scaling → Karpenter → Node provisioning
```

1. KEDA evaluates CPU/Memory metrics against configured thresholds
2. Pods are scaled horizontally within the configured replica range
3. If cluster capacity is insufficient, Karpenter provisions new EC2 nodes (preferring Spot)
4. When load drops, KEDA scales pods down; Karpenter consolidates and terminates idle nodes

### 6. Security Flow

```
Internet → ALB (TLS 1.2+, ACM) → Security Groups → Pods → IAM IRSA roles → AWS APIs
```

1. All external traffic terminates TLS at the ALB
2. Security groups enforce least-privilege network access
3. Pods communicate with AWS services via IRSA (IAM Roles for Service Accounts)

***

## High Availability Summary

| Feature             | Implementation                                                       |
| ------------------- | -------------------------------------------------------------------- |
| Multi-AZ deployment | 3 AZs for EKS nodes, Redis, subnets                                  |
| Load balancing      | ALB with multiple target groups                                      |
| Pod redundancy      | Minimum 2 replicas per service                                       |
| Database HA         | CloudNativePG cluster with PgBouncer (self-hosted) or Supabase Cloud |
| Cache redundancy    | ElastiCache Multi-AZ                                                 |
| Node autoscaling    | Karpenter with Spot + On-Demand mix                                  |
| Pod autoscaling     | KEDA CPU/Memory-based                                                |
| Observability       | SigNoz (optional)                                                    |
| State management    | S3 with versioning + DynamoDB lock                                   |

***

## Cost Optimisation

* **Spot Instances:** Karpenter prioritises Spot for all non-database workloads
* **Node consolidation:** Karpenter automatically reclaims underutilised nodes
* **Pod right-sizing:** KEDA scales pods down during quiet periods
* **On-Demand only where needed:** Database node class uses On-Demand for stability

***

## Maintenance & Operations

### Deployment Process

```bash theme={null}
cd terragrunt/environments/<your-env-name>
# Set required ENABLE_* and domain/certificate environment variables
terragrunt apply
# Rolling updates: re-apply after updating image tags or values files
```

See [Terragrunt Deployment Guide](/on-premise/kubernetes-deployment/terragrunt-deployment.mdx) for the full deployment sequence.

### Backup Strategy

* **EBS Snapshots:** Automated snapshots for persistent volumes (Automator DB, Supabase MinIO)
* **CloudNativePG:** Continuous WAL archiving + scheduled base backups (if configured)
* **Supabase Cloud:** Managed daily backups (cloud option)
* **IaC state:** S3 versioned bucket

### Disaster Recovery

* **Multi-AZ:** All stateful services span multiple availability zones
* **CloudNativePG HA:** Automatic failover between Postgres primary and replicas
* **Supabase Cloud:** Cross-region redundancy (cloud option)
* **Terraform state:** S3 versioning allows rollback to any previous state

***

## Additional Resources

* [AWS Load Balancer Controller](https://kubernetes-sigs.github.io/aws-load-balancer-controller/)
* [Karpenter Documentation](https://karpenter.sh/docs/)
* [KEDA Documentation](https://keda.sh/docs/)
* [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
* [Supabase Self-hosting](https://supabase.com/docs/guides/self-hosting)
* [SigNoz Documentation](https://signoz.io/docs/)
* [Terragrunt Deployment Guide](/on-premise/kubernetes-deployment/terragrunt-deployment.mdx)
