This document describes the disaster recovery (DR) strategy, Recovery Time Objective (RTO), and Recovery Point Objective (RPO) for the EKB AWS/EKS infrastructure. It covers built-in HA capabilities, active backup systems, recovery procedures, and known gaps.
Architecture & High Availability Summary
Infrastructure Components
| Component | Implementation | HA Model |
|---|
| Compute | EKS (Kubernetes 1.33) + Karpenter | Multi-AZ nodes, Spot + On-Demand |
| Pod autoscaling | KEDA | CPU/Memory-based, min 2 replicas |
| Ingress / SSL | ALB + ACM | Multi-AZ, health-check routing |
| Cache | ElastiCache Redis (conditional) | Multi-AZ with automatic failover |
| Message queue | Amazon MQ / RabbitMQ (conditional) | Single-instance or active/standby |
| Database (self-hosted) | CloudNativePG HA cluster + PgBouncer | Primary + replicas, auto-failover |
| Database (cloud) | Supabase Cloud | Managed, cross-region by provider |
| Observability | SigNoz + k8s-infra (conditional) | In-cluster, no single point of failure |
| IaC state | S3 (versioned) + DynamoDB lock | Encrypted, versioned |
| Persistent volumes | EBS (via EBS CSI Driver) | Snapshots via AWS Data Lifecycle Manager |
Current HA Capabilities
| Capability | Recovery Time | Data Loss | Status |
|---|
| AZ failure (pods) | 0–5 min | None | Active |
| Pod failure / crash | 0–2 min | None | Active |
| ALB health check re-routing | 0–1 min | None | Active |
| Redis Multi-AZ failover | < 1 min | None | Active (when enabled) |
| Karpenter node replacement | 0–10 min | None | Active |
| EBS snapshot restore | 15–30 min | Up to 24 h | Active |
| CloudNativePG PITR restore | 15–60 min | Up to WAL lag | Active (when enabled) |
| Full infrastructure recreation | 30–60 min | None (IaC) | Via Terragrunt |
RTO / RPO Targets
| Scenario | RTO Target | RPO Target | Current Capability |
|---|
| Single pod failure | < 2 min | 0 | Met (KEDA min replicas) |
| Single node failure | < 10 min | 0 | Met (Karpenter replacement) |
| AZ failure | < 5 min | 0 | Met (Multi-AZ spread) |
| Redis failure | < 1 min | 0 | Met (Multi-AZ failover) |
| Postgres primary failure | < 5 min | 0–seconds WAL lag | Met (CNPG auto-failover) |
| EBS volume loss | 15–30 min | < 24 h | Partial — DLM snapshots daily |
| Full region failure | > 60 min | Depends on backup frequency | Not currently automated |
Backup Systems
What is backed up: All infrastructure state (EKS, networking, IAM, Helm releases).
Implementation:
- S3 bucket per environment:
ekb-terraform-state-<env-name> (bootstrapped via terragrunt/environments/<env-name>/state/)
- Versioning enabled — any previous state version can be restored
- Server-side encryption (AES-256)
- DynamoDB table for state locking
Recovery: Roll back to any previous state version in S3, then re-run terragrunt apply.
RPO: Every terragrunt apply commit — continuous.
2. EBS Persistent Volumes — AWS Data Lifecycle Manager
What is backed up: EBS volumes attached to pods (Automator PostgreSQL, Supabase MinIO, any stateful workloads).
Implementation: AWS Data Lifecycle Manager (DLM) policy targeting environment tags.
# Tag EBS volumes with your environment name, then create a DLM policy:
# - Resource: EBS volumes
# - Target tag: Environment=<your-env-name>
# - Schedule: Daily at 02:00 UTC
# - Retention: 7 snapshots
Recovery:
# 1. Identify the snapshot to restore
aws ec2 describe-snapshots --filters "Name=tag:Environment,Values=<your-env-name>"
# 2. Create a volume from the snapshot
aws ec2 create-volume \
--snapshot-id snap-xxxxxxxxxxxxxxxxx \
--availability-zone <your-az> \
--volume-type gp3
# 3. Update the PersistentVolume in Kubernetes to reference the new volume ID
kubectl patch pv <pv-name> -p '{"spec":{"awsElasticBlockStore":{"volumeID":"<new-vol-id>"}}}'
RPO: Up to 24 hours (daily snapshots). Reduce by increasing DLM schedule frequency.
3. CloudNativePG (HA Supabase DB) — Barman Cloud Backups
Applies to: Environments with ENABLE_CNPG=true and ENABLE_HA_SUPABASE_DB=true.
What is backed up: The CloudNativePG Postgres cluster (ha-supabase-db), including continuous WAL (Write-Ahead Log) streaming to S3 or MinIO, and scheduled full base backups via CNPG ScheduledBackup CRD.
Implementation (configured in values/ha-supabase-db.yaml):
postgres:
backup:
enabled: true # toggle barman-cloud backups
# barmanObjectStore points to S3 bucket or MinIO endpoint
# IRSA / service account must have s3:PutObject, s3:GetObject, s3:ListBucket
retentionPolicy: "30d" # keep backups for 30 days
compression: gzip
Verify backup status:
# List all backups for the cluster
kubectl get backup -n ha-supabase-db
# Check a specific backup
kubectl describe backup <backup-name> -n ha-supabase-db
# Trigger an on-demand backup
kubectl cnpg backup ha-supabase-db -n ha-supabase-db
Point-in-time recovery (PITR):
# Create a new CNPG Cluster restoring from a backup
# (in a new namespace or after removing the original)
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: ha-supabase-db-restored
namespace: ha-supabase-db-restore
spec:
instances: 3
bootstrap:
recovery:
source: ha-supabase-db
recoveryTarget:
targetTime: "2026-03-19T03:00:00.000000+00:00" # target point in time
externalClusters:
- name: ha-supabase-db
barmanObjectStore:
# same S3/MinIO config as the source cluster
serverName: ha-supabase-db
RPO: Near-zero for WAL-enabled clusters (seconds of lag, bounded by WAL upload interval).
4. ElastiCache Redis — Multi-AZ Replication
Applies to: Environments with ENABLE_AWS_SERVICES=true.
Redis is not a primary data store — it holds transient cache and session data. DR focus is on fast failover rather than backup/restore.
Implementation:
- Multi-AZ enabled with automatic failover
- Encryption at-rest and in-transit
- A primary node failure promotes a replica automatically (< 1 min)
RPO: Redis data is ephemeral by design. Cache misses after failover are expected; the application re-populates from the database.
5. Amazon MQ (RabbitMQ) — Message Queue
Applies to: Environments with ENABLE_AWS_SERVICES=true.
Implementation:
- Single-instance (default) or active/standby deployment
- In-flight messages may be lost during a broker restart; design consumers to be idempotent
- Management UI available on port 15671 (SSL)
RPO: Single-instance — messages in-flight at time of failure may be lost. Use active/standby deployment mode to reduce this risk.
Auto-Recovery Mechanisms
Karpenter — Node Provisioning & Recovery
- Consolidation:
WhenEmptyOrUnderutilized — idle nodes are terminated automatically
- Spot interruption handling: Listens to SQS interruption events; drains and replaces Spot nodes before termination
- Node drift: Nodes using outdated AMIs or configs are replaced automatically when
enable_drift = true
- Recovery time: New node provisioned in 0–10 minutes
KEDA — Pod Autoscaling
- Minimum replicas: 2 for all services (Web, API, Celery, Automator) — prevents single-point-of-failure
- CPU threshold: 60–70% triggers scale-out
- Memory threshold: 80% triggers scale-out
- Scale-down stabilisation: 30 seconds — avoids flapping
- Recovery time: Failed pods rescheduled within 0–2 minutes
Kubernetes Pod Anti-Affinity
All stateless services use preferredDuringSchedulingIgnoredDuringExecution anti-affinity on kubernetes.io/hostname to spread pods across nodes and AZs.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["web", "fastapi-backend", "celery-worker", "automator"]
topologyKey: kubernetes.io/hostname
Recovery Procedures
Scenario 1: AZ Failure
Expected behaviour: Karpenter provisions replacement nodes in the remaining AZs; KEDA reschedules pods; ALB stops routing to unhealthy targets.
Verification:
kubectl get nodes -o wide # confirm nodes are in healthy AZs
kubectl get pods -A -o wide # confirm pods are running
kubectl get ingress # confirm ALB is routing correctly
No manual intervention is required under normal circumstances.
Scenario 2: Postgres Primary Node Failure (CloudNativePG)
CloudNativePG automatically promotes a replica to primary.
# Monitor failover
kubectl get cluster ha-supabase-db -n ha-supabase-db -w
# Confirm new primary
kubectl cnpg status ha-supabase-db -n ha-supabase-db
# Check PgBouncer pooler is pointing to new primary
kubectl get svc -n ha-supabase-db | grep pooler
Applications reconnect via PgBouncer — no connection string change needed.
Scenario 3: Restore EBS Volume from Snapshot
# 1. Find snapshots for the environment
aws ec2 describe-snapshots \
--filters "Name=tag:Environment,Values=<your-env-name>" \
--query 'Snapshots[*].[SnapshotId,StartTime,VolumeSize]' \
--output table
# 2. Create a new volume from the chosen snapshot
aws ec2 create-volume \
--snapshot-id snap-xxxxxxxxxxxxxxxxx \
--availability-zone <target-az> \
--volume-type gp3 \
--tag-specifications 'ResourceType=volume,Tags=[{Key=Environment,Value=<your-env-name>}]'
# 3. Scale down the affected workload
kubectl scale deployment <workload> --replicas=0
# 4. Update the PV to reference the new EBS volume, then scale back up
kubectl patch pv <pv-name> -p '{"spec":{"csi":{"volumeHandle":"<new-volume-id>"}}}'
kubectl scale deployment <workload> --replicas=2
Scenario 4: Full Infrastructure Recreation
Used after catastrophic failure or when rebuilding a region.
# 1. Bootstrap state bucket (if it doesn't exist)
cd terragrunt/environments/<your-env-name>/state
terragrunt apply
# 2. Recreate the EKS cluster and core infrastructure
cd terragrunt/environments/<your-env-name>
terragrunt apply
# 3. Deploy services in order (see Terragrunt Deployment Guide § Phase 6)
ENABLE_CNPG=true terragrunt apply --target='helm_release.local["cloudnative-pg"]'
ENABLE_HA_SUPABASE_DB=true terragrunt apply --target='helm_release.local["ha-supabase-db"]'
ENABLE_SUPABASE=true terragrunt apply --target='helm_release.local["supabase"]'
Estimated RTO: 30–60 minutes for full cluster; 60–120 minutes if EBS/CNPG data must be restored.
Scenario 5: Terragrunt State Rollback
# List state versions in S3
aws s3api list-object-versions \
--bucket ekb-terraform-state-<your-env-name> \
--prefix <state-key>
# Restore a specific version
aws s3api get-object \
--bucket ekb-terraform-state-<your-env-name> \
--key <state-key> \
--version-id <version-id> \
terraform.tfstate
# Re-import or apply with the restored state
Observability & Alerting (SigNoz)
When ENABLE_SIGNOZ=true, SigNoz is deployed to the monitoring namespace and provides:
- Distributed tracing for all EKB services
- Cluster metrics via the k8s-infra DaemonSet agent (CPU, memory, pod status)
- Log aggregation from all pods
- Alerting — configure alert rules in SigNoz to notify on pod crash loops, high error rates, or node pressure
For DR purposes, SigNoz dashboards are the primary tool for diagnosing and confirming recovery after an incident. SigNoz data itself is stored on EBS PVs and is covered by the EBS snapshot policy.
Gaps & Recommendations
| Gap | Risk | Recommendation |
|---|
| No cross-region replication | Full region failure = extended RTO | Consider RDS cross-region read replica or S3 cross-region replication for CNPG backups |
| EBS snapshots are daily | Up to 24 h data loss for EBS-backed workloads | Increase DLM frequency to hourly for critical volumes |
| RabbitMQ single-instance (default) | In-flight messages lost on broker failure | Switch to ACTIVE_STANDBY_MULTI_AZ deployment mode for production |
| No automated DR drill | Recovery procedures untested until needed | Schedule quarterly recovery drills; test CNPG PITR restore in a staging environment |
| SigNoz on same cluster | Observability lost during cluster failure | Consider a separate lightweight monitoring endpoint or CloudWatch fallback alerts |