This document provides a comprehensive overview of the AWS architecture for EKB services infrastructure. The architecture follows cloud-native patterns with AWS managed services, Kubernetes orchestration, and automated scaling.
AWS Architecture Overview
The architecture follows cloud-native patterns with AWS managed services, Kubernetes orchestration, and automated scaling.
Architecture Diagram
Self-hosted Supabase uses a CloudNativePG-managed HA PostgreSQL cluster (ha-supabase-db) with PgBouncer pooling, MinIO for object storage, and the full Supabase application stack deployed via helm-deployment/supabase-kubernetes-ha. Supabase Cloud is the alternative if self-hosting is not required.
Key Components
1. Networking Layer
DNS Provider
- Purpose: Domain name resolution; works with any provider (Route 53, Cloudflare, etc.)
- Domains (use your own domain, e.g.
example.com):
app.example.com — Web Frontend
api.example.com — FastAPI Backend
automations.example.com — Automator Service
supabase.example.com — Supabase Kong (self-hosted only)
signoz.example.com — SigNoz observability (optional)
- SSL Validation: CNAME records required for ACM DNS validation
Application Load Balancer (ALB)
- Purpose: SSL termination, load balancing, and hostname-based routing
- Features:
- SSL/TLS termination using ACM certificates (wildcard or per-service)
- HTTP → HTTPS redirect
- Health checks for all target groups
- Managed by: AWS Load Balancer Controller (Helm chart in
helm-deployment/infrastructure)
VPC
- CIDR: Environment-specific (e.g.
10.x.0.0/16)
- Availability Zones: 3 AZs in the chosen region
- Subnets: 3 public (NAT Gateways) + 3 private (EKS nodes)
- Outbound: NAT Gateway per AZ for node egress
2. Compute Layer
EKS Cluster
- Version: Kubernetes 1.33
- System Node Group: Managed node group running Karpenter controller (not on Karpenter-managed nodes, per AWS best practice)
- Add-ons: EBS CSI Driver, AWS Load Balancer Controller, CoreDNS, kube-proxy
Karpenter — Dynamic Node Provisioning
- Purpose: Just-in-time node provisioning and cost optimisation
- Node Classes:
- General Purpose: Spot instances for most workloads
- Compute Intensive: High-CPU instances for CPU-bound tasks
- Memory Intensive: Memory-optimised instances for large datasets
- Database: On-demand instances for stateful/database workloads
- GPU: GPU instances for AI/ML workloads (optional)
- Features: Spot prioritisation, automatic consolidation, SQS-based interruption handling
KEDA — Kubernetes Event-Driven Autoscaling
- Purpose: Horizontal pod autoscaling based on resource metrics
- Targets:
| Service | Replicas | CPU Threshold | Memory Threshold |
|---|
| Web Frontend | 2–8 | 60% | 80% |
| FastAPI Backend | 2–10 | 70% | 80% |
| Celery Workers | 2–8 | 70% | 80% |
| Automator | 2–8 | 70% | 80% |
- Scale-down: 30s stabilisation window for fast response
3. Application Services
Web Frontend
- Port: 3000
- Replicas: 2–8 (KEDA-managed)
- Purpose: React application serving the user interface
FastAPI Backend
- Port: 8001
- Replicas: 2–10 (KEDA-managed)
- Purpose: REST API server handling business logic and data access
Celery Workers
- Replicas: 2–8 (KEDA-managed)
- Purpose: Background task processing (queued via RabbitMQ)
Automator Service
- Port: 80
- Replicas: 2–8 (KEDA-managed)
- Purpose: Workflow automation and orchestration
Supabase Kong
- Port: 8000 (internal cluster service)
- Purpose: API gateway for all Supabase services
- Routing: External traffic reaches Kong via the ALB ingress defined in
odin-services/main-ingress.yaml
SigNoz (optional)
- Namespace:
monitoring
- Components: SigNoz platform + k8s-infra DaemonSet agent
- Purpose: Distributed tracing, metrics aggregation, log management
- Enabled by:
ENABLE_SIGNOZ=true
4. Data Layer
ElastiCache Redis
- Purpose: Caching, session storage, and Celery broker/result backend
- Configuration:
- Node type: configurable (e.g.
cache.t3.micro)
- Port:
6379
- Encryption at-rest and in-transit
- Multi-AZ for high availability
- Enabled by:
ENABLE_AWS_SERVICES=true
Amazon MQ (RabbitMQ)
- Purpose: Message queuing for asynchronous task processing
- Configuration:
- Engine: RabbitMQ
- Ports:
5671 (AMQP/SSL), 15671 (Management/SSL)
- Deployment mode: single-instance or active/standby
- Enabled by:
ENABLE_AWS_SERVICES=true
Supabase — Option A: Cloud (managed)
- Purpose: External managed PostgreSQL, Auth, Storage, and Realtime
- Connection: Supabase project URL and service role key configured in
values/odin-services.yaml
Supabase — Option B: Self-hosted on EKS
- Purpose: Full Supabase stack running inside the cluster
- Components:
- CloudNativePG operator (
cnpg-system) — manages the Postgres cluster lifecycle
- HA Supabase DB (
ha-supabase-db) — CloudNativePG Cluster resource with PgBouncer pooler
- Supabase application (
supabase namespace) — Kong, Auth, Storage (MinIO), Meta, Rest, Realtime, Studio
- Deployment order: CloudNativePG → HA Supabase DB → Supabase app
- Enabled by:
ENABLE_CNPG=true, ENABLE_HA_SUPABASE_DB=true, ENABLE_SUPABASE=true
PostgreSQL Automator
- Purpose: Local PostgreSQL database for the Automator service
- Port: 5432
- Storage: EBS persistent volume
- Node affinity: Database-dedicated nodes
5. Security & IAM
IAM Roles
| Role | Purpose |
|---|
| EKS Cluster Role | Cluster-level API permissions |
| Node Group Role | EC2 node permissions (ECR, SSM, networking) |
| Karpenter Controller Role | EC2 provisioning, SQS interruption queue |
| AWS Load Balancer Controller Role | ELBv2 and EC2 management |
| EBS CSI Driver Role | EBS volume lifecycle management |
Role names follow the pattern <env-name>-<component> and are created by the EKS Terraform module.
Security Groups
- ALB: Auto-created by AWS Load Balancer Controller (80/443 inbound)
- EKS Cluster: Node-to-node and pod communication
- Redis: Port 6379 from VPC CIDR only
- RabbitMQ: Ports 5671, 15671 from VPC CIDR only
SSL/TLS
- Termination: ALB level (pods see plain HTTP internally)
- Certificates: ACM certificates — either per-service or a single wildcard
- Validation: DNS CNAME validation via your DNS provider
- Minimum protocol: TLS 1.2
6. Infrastructure as Code
modules/eks: EKS cluster, VPC, node groups, Karpenter, IAM, Helm releases
- State: S3 bucket with versioning, DynamoDB lock table
Terragrunt
- Environment isolation: One directory per environment under
terragrunt/environments/
- Template:
env-template-folder — copy and fill placeholders to create a new environment
- DRY configuration: Shared
root.hcl with per-environment overrides
- Enable/disable flags: Services toggled via environment variables (
ENABLE_*)
Helm Charts
| Chart | Namespace | Description |
|---|
infrastructure | infrastructure | ALB Controller |
odin-services | default | Web, API, Workers, Automator, Ingress |
aws-ebs-csi-driver | kube-system | EBS volume provisioning |
keda | keda | Pod autoscaling |
cloudnative-pg | cnpg-system | PostgreSQL operator |
ha-supabase-db | ha-supabase-db | HA Postgres cluster + PgBouncer |
supabase-kubernetes-ha | supabase | Full Supabase stack |
signoz | monitoring | Observability platform |
k8s-infra | monitoring | Cluster metrics agent |
Data Flow
1. User Request Flow
User → DNS → ALB (SSL termination) → EKS Pod (Web Frontend)
→ EKS Pod (FastAPI Backend)
→ EKS Pod (Supabase Kong)
- User accesses
app.example.com
- DNS resolves to the ALB
- ALB terminates SSL and routes by hostname to the correct target group
- Web Frontend serves the React app and makes API calls to
api.example.com
- FastAPI Backend processes requests and reads/writes to data services
2. API Request Flow
Client → ALB → FastAPI Backend → Redis (cache) / RabbitMQ (queue) / Supabase (DB)
- Client calls
api.example.com
- ALB routes to FastAPI pod
- Backend checks Redis cache; on miss, queries Supabase database
- Async tasks are enqueued in RabbitMQ and processed by Celery Workers
3. Background Processing Flow
FastAPI → RabbitMQ → Celery Worker → Supabase DB
- FastAPI enqueues a task in RabbitMQ
- Celery Worker dequeues and processes the task
- Results are written back to the Supabase database
4. Automator Workflow
Automator → PostgreSQL (local) → Redis → External APIs
- Automator receives a workflow request
- Workflow state is persisted in the local PostgreSQL instance
- Redis caches intermediate results
- External APIs are called as part of the automation
5. Scaling Flow
Metrics → KEDA → Pod scaling → Karpenter → Node provisioning
- KEDA evaluates CPU/Memory metrics against configured thresholds
- Pods are scaled horizontally within the configured replica range
- If cluster capacity is insufficient, Karpenter provisions new EC2 nodes (preferring Spot)
- When load drops, KEDA scales pods down; Karpenter consolidates and terminates idle nodes
6. Security Flow
Internet → ALB (TLS 1.2+, ACM) → Security Groups → Pods → IAM IRSA roles → AWS APIs
- All external traffic terminates TLS at the ALB
- Security groups enforce least-privilege network access
- Pods communicate with AWS services via IRSA (IAM Roles for Service Accounts)
High Availability Summary
| Feature | Implementation |
|---|
| Multi-AZ deployment | 3 AZs for EKS nodes, Redis, subnets |
| Load balancing | ALB with multiple target groups |
| Pod redundancy | Minimum 2 replicas per service |
| Database HA | CloudNativePG cluster with PgBouncer (self-hosted) or Supabase Cloud |
| Cache redundancy | ElastiCache Multi-AZ |
| Node autoscaling | Karpenter with Spot + On-Demand mix |
| Pod autoscaling | KEDA CPU/Memory-based |
| Observability | SigNoz (optional) |
| State management | S3 with versioning + DynamoDB lock |
Cost Optimisation
- Spot Instances: Karpenter prioritises Spot for all non-database workloads
- Node consolidation: Karpenter automatically reclaims underutilised nodes
- Pod right-sizing: KEDA scales pods down during quiet periods
- On-Demand only where needed: Database node class uses On-Demand for stability
Maintenance & Operations
Deployment Process
cd terragrunt/environments/<your-env-name>
# Set required ENABLE_* and domain/certificate environment variables
terragrunt apply
# Rolling updates: re-apply after updating image tags or values files
See Terragrunt Deployment Guide for the full deployment sequence.
Backup Strategy
- EBS Snapshots: Automated snapshots for persistent volumes (Automator DB, Supabase MinIO)
- CloudNativePG: Continuous WAL archiving + scheduled base backups (if configured)
- Supabase Cloud: Managed daily backups (cloud option)
- IaC state: S3 versioned bucket
Disaster Recovery
- Multi-AZ: All stateful services span multiple availability zones
- CloudNativePG HA: Automatic failover between Postgres primary and replicas
- Supabase Cloud: Cross-region redundancy (cloud option)
- Terraform state: S3 versioning allows rollback to any previous state
Additional Resources