Disaster Recovery & RTO/RPO Guide

This document describes the disaster recovery (DR) strategy, Recovery Time Objective (RTO), and Recovery Point Objective (RPO) for the EKB AWS/EKS infrastructure. It covers built-in HA capabilities, active backup systems, recovery procedures, and known gaps.

For deployment instructions, see Terragrunt Deployment Guide.

Architecture & High Availability Summary

Infrastructure Components

Component	Implementation	HA Model
Compute	EKS (Kubernetes 1.33) + Karpenter	Multi-AZ nodes, Spot + On-Demand
Pod autoscaling	KEDA	CPU/Memory-based, min 2 replicas
Ingress / SSL	ALB + ACM	Multi-AZ, health-check routing
Cache	ElastiCache Redis (conditional)	Multi-AZ with automatic failover
Message queue	Amazon MQ / RabbitMQ (conditional)	Single-instance or active/standby
Database (self-hosted)	CloudNativePG HA cluster + PgBouncer	Primary + replicas, auto-failover
Database (cloud)	Supabase Cloud	Managed, cross-region by provider
Observability	SigNoz + k8s-infra (conditional)	In-cluster, no single point of failure
IaC state	S3 (versioned) + DynamoDB lock	Encrypted, versioned
Persistent volumes	EBS (via EBS CSI Driver)	Snapshots via AWS Data Lifecycle Manager

Current HA Capabilities

Capability	Recovery Time	Data Loss	Status
AZ failure (pods)	0–5 min	None	Active
Pod failure / crash	0–2 min	None	Active
ALB health check re-routing	0–1 min	None	Active
Redis Multi-AZ failover	< 1 min	None	Active (when enabled)
Karpenter node replacement	0–10 min	None	Active
EBS snapshot restore	15–30 min	Up to 24 h	Active
CloudNativePG PITR restore	15–60 min	Up to WAL lag	Active (when enabled)
Full infrastructure recreation	30–60 min	None (IaC)	Via Terragrunt

RTO / RPO Targets

Scenario	RTO Target	RPO Target	Current Capability
Single pod failure	< 2 min	0	Met (KEDA min replicas)
Single node failure	< 10 min	0	Met (Karpenter replacement)
AZ failure	< 5 min	0	Met (Multi-AZ spread)
Redis failure	< 1 min	0	Met (Multi-AZ failover)
Postgres primary failure	< 5 min	0–seconds WAL lag	Met (CNPG auto-failover)
EBS volume loss	15–30 min	< 24 h	Partial — DLM snapshots daily
Full region failure	> 60 min	Depends on backup frequency	Not currently automated

Backup Systems

1. Terraform / Terragrunt State — S3

What is backed up: All infrastructure state (EKS, networking, IAM, Helm releases). Implementation:

S3 bucket per environment: ekb-terraform-state-<env-name> (bootstrapped via terragrunt/environments/<env-name>/state/)
Versioning enabled — any previous state version can be restored
Server-side encryption (AES-256)
DynamoDB table for state locking

Recovery: Roll back to any previous state version in S3, then re-run terragrunt apply. RPO: Every terragrunt apply commit — continuous.

2. EBS Persistent Volumes — AWS Data Lifecycle Manager

What is backed up: EBS volumes attached to pods (Automator PostgreSQL, Supabase MinIO, any stateful workloads). Implementation: AWS Data Lifecycle Manager (DLM) policy targeting environment tags.

# Tag EBS volumes with your environment name, then create a DLM policy:
# - Resource: EBS volumes
# - Target tag: Environment=<your-env-name>
# - Schedule: Daily at 02:00 UTC
# - Retention: 7 snapshots

Recovery:

# 1. Identify the snapshot to restore
aws ec2 describe-snapshots --filters "Name=tag:Environment,Values=<your-env-name>"
 
# 2. Create a volume from the snapshot
aws ec2 create-volume \
  --snapshot-id snap-xxxxxxxxxxxxxxxxx \
  --availability-zone <your-az> \
  --volume-type gp3
 
# 3. Update the PersistentVolume in Kubernetes to reference the new volume ID
kubectl patch pv <pv-name> -p '{"spec":{"awsElasticBlockStore":{"volumeID":"<new-vol-id>"}}}'

RPO: Up to 24 hours (daily snapshots). Reduce by increasing DLM schedule frequency.

3. CloudNativePG (HA Supabase DB) — Barman Cloud Backups

Applies to: Environments with ENABLE_CNPG=true and ENABLE_HA_SUPABASE_DB=true. What is backed up: The CloudNativePG Postgres cluster (ha-supabase-db), including continuous WAL (Write-Ahead Log) streaming to S3 or MinIO, and scheduled full base backups via CNPG ScheduledBackup CRD. Implementation (configured in values/ha-supabase-db.yaml):

postgres:
  backup:
    enabled: true                    # toggle barman-cloud backups
    # barmanObjectStore points to S3 bucket or MinIO endpoint
    # IRSA / service account must have s3:PutObject, s3:GetObject, s3:ListBucket
    retentionPolicy: "30d"           # keep backups for 30 days
    compression: gzip

Verify backup status:

# List all backups for the cluster
kubectl get backup -n ha-supabase-db
 
# Check a specific backup
kubectl describe backup <backup-name> -n ha-supabase-db
 
# Trigger an on-demand backup
kubectl cnpg backup ha-supabase-db -n ha-supabase-db

Point-in-time recovery (PITR):

# Create a new CNPG Cluster restoring from a backup
# (in a new namespace or after removing the original)
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: ha-supabase-db-restored
  namespace: ha-supabase-db-restore
spec:
  instances: 3
  bootstrap:
    recovery:
      source: ha-supabase-db
      recoveryTarget:
        targetTime: "2026-03-19T03:00:00.000000+00:00"  # target point in time
  externalClusters:
    - name: ha-supabase-db
      barmanObjectStore:
        # same S3/MinIO config as the source cluster
        serverName: ha-supabase-db

RPO: Near-zero for WAL-enabled clusters (seconds of lag, bounded by WAL upload interval).

4. ElastiCache Redis — Multi-AZ Replication

Applies to: Environments with ENABLE_AWS_SERVICES=true. Redis is not a primary data store — it holds transient cache and session data. DR focus is on fast failover rather than backup/restore. Implementation:

Multi-AZ enabled with automatic failover
Encryption at-rest and in-transit
A primary node failure promotes a replica automatically (< 1 min)

RPO: Redis data is ephemeral by design. Cache misses after failover are expected; the application re-populates from the database.

5. Amazon MQ (RabbitMQ) — Message Queue

Applies to: Environments with ENABLE_AWS_SERVICES=true. Implementation:

Single-instance (default) or active/standby deployment
In-flight messages may be lost during a broker restart; design consumers to be idempotent
Management UI available on port 15671 (SSL)

RPO: Single-instance — messages in-flight at time of failure may be lost. Use active/standby deployment mode to reduce this risk.

Auto-Recovery Mechanisms

Karpenter — Node Provisioning & Recovery

Consolidation: WhenEmptyOrUnderutilized — idle nodes are terminated automatically
Spot interruption handling: Listens to SQS interruption events; drains and replaces Spot nodes before termination
Node drift: Nodes using outdated AMIs or configs are replaced automatically when enable_drift = true
Recovery time: New node provisioned in 0–10 minutes

KEDA — Pod Autoscaling

Minimum replicas: 2 for all services (Web, API, Celery, Automator) — prevents single-point-of-failure
CPU threshold: 60–70% triggers scale-out
Memory threshold: 80% triggers scale-out
Scale-down stabilisation: 30 seconds — avoids flapping
Recovery time: Failed pods rescheduled within 0–2 minutes

Kubernetes Pod Anti-Affinity

All stateless services use preferredDuringSchedulingIgnoredDuringExecution anti-affinity on kubernetes.io/hostname to spread pods across nodes and AZs.

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values: ["web", "fastapi-backend", "celery-worker", "automator"]
        topologyKey: kubernetes.io/hostname

Recovery Procedures

Scenario 1: AZ Failure

Expected behaviour: Karpenter provisions replacement nodes in the remaining AZs; KEDA reschedules pods; ALB stops routing to unhealthy targets. Verification:

kubectl get nodes -o wide     # confirm nodes are in healthy AZs
kubectl get pods -A -o wide   # confirm pods are running
kubectl get ingress            # confirm ALB is routing correctly

No manual intervention is required under normal circumstances.

Scenario 2: Postgres Primary Node Failure (CloudNativePG)

CloudNativePG automatically promotes a replica to primary.

# Monitor failover
kubectl get cluster ha-supabase-db -n ha-supabase-db -w
 
# Confirm new primary
kubectl cnpg status ha-supabase-db -n ha-supabase-db
 
# Check PgBouncer pooler is pointing to new primary
kubectl get svc -n ha-supabase-db | grep pooler

Applications reconnect via PgBouncer — no connection string change needed.

Scenario 3: Restore EBS Volume from Snapshot

# 1. Find snapshots for the environment
aws ec2 describe-snapshots \
  --filters "Name=tag:Environment,Values=<your-env-name>" \
  --query 'Snapshots[*].[SnapshotId,StartTime,VolumeSize]' \
  --output table
 
# 2. Create a new volume from the chosen snapshot
aws ec2 create-volume \
  --snapshot-id snap-xxxxxxxxxxxxxxxxx \
  --availability-zone <target-az> \
  --volume-type gp3 \
  --tag-specifications 'ResourceType=volume,Tags=[{Key=Environment,Value=<your-env-name>}]'
 
# 3. Scale down the affected workload
kubectl scale deployment <workload> --replicas=0
 
# 4. Update the PV to reference the new EBS volume, then scale back up
kubectl patch pv <pv-name> -p '{"spec":{"csi":{"volumeHandle":"<new-volume-id>"}}}'
kubectl scale deployment <workload> --replicas=2

Scenario 4: Full Infrastructure Recreation

Used after catastrophic failure or when rebuilding a region.

# 1. Bootstrap state bucket (if it doesn't exist)
cd terragrunt/environments/<your-env-name>/state
terragrunt apply
 
# 2. Recreate the EKS cluster and core infrastructure
cd terragrunt/environments/<your-env-name>
terragrunt apply
 
# 3. Deploy services in order (see Terragrunt Deployment Guide § Phase 6)
ENABLE_CNPG=true terragrunt apply --target='helm_release.local["cloudnative-pg"]'
ENABLE_HA_SUPABASE_DB=true terragrunt apply --target='helm_release.local["ha-supabase-db"]'
ENABLE_SUPABASE=true terragrunt apply --target='helm_release.local["supabase"]'

Estimated RTO: 30–60 minutes for full cluster; 60–120 minutes if EBS/CNPG data must be restored.

Scenario 5: Terragrunt State Rollback

# List state versions in S3
aws s3api list-object-versions \
  --bucket ekb-terraform-state-<your-env-name> \
  --prefix <state-key>
 
# Restore a specific version
aws s3api get-object \
  --bucket ekb-terraform-state-<your-env-name> \
  --key <state-key> \
  --version-id <version-id> \
  terraform.tfstate
 
# Re-import or apply with the restored state

Observability & Alerting (SigNoz)

When ENABLE_SIGNOZ=true, SigNoz is deployed to the monitoring namespace and provides:

Distributed tracing for all EKB services
Cluster metrics via the k8s-infra DaemonSet agent (CPU, memory, pod status)
Log aggregation from all pods
Alerting — configure alert rules in SigNoz to notify on pod crash loops, high error rates, or node pressure

For DR purposes, SigNoz dashboards are the primary tool for diagnosing and confirming recovery after an incident. SigNoz data itself is stored on EBS PVs and is covered by the EBS snapshot policy.

Gaps & Recommendations

Gap	Risk	Recommendation
No cross-region replication	Full region failure = extended RTO	Consider RDS cross-region read replica or S3 cross-region replication for CNPG backups
EBS snapshots are daily	Up to 24 h data loss for EBS-backed workloads	Increase DLM frequency to hourly for critical volumes
RabbitMQ single-instance (default)	In-flight messages lost on broker failure	Switch to `ACTIVE_STANDBY_MULTI_AZ` deployment mode for production
No automated DR drill	Recovery procedures untested until needed	Schedule quarterly recovery drills; test CNPG PITR restore in a staging environment
SigNoz on same cluster	Observability lost during cluster failure	Consider a separate lightweight monitoring endpoint or CloudWatch fallback alerts

Get Started!

Offerings

Chat

Agents

Knowledge Base

Workflows

Interfaces

Public Chatbot

Settings

My Account

Platform Admin

SDK

General Information

Change Logs

Architecture & High Availability Summary

Infrastructure Components

Current HA Capabilities

RTO / RPO Targets

Backup Systems

1. Terraform / Terragrunt State — S3

2. EBS Persistent Volumes — AWS Data Lifecycle Manager

3. CloudNativePG (HA Supabase DB) — Barman Cloud Backups

4. ElastiCache Redis — Multi-AZ Replication

5. Amazon MQ (RabbitMQ) — Message Queue

Auto-Recovery Mechanisms

Karpenter — Node Provisioning & Recovery

KEDA — Pod Autoscaling

Kubernetes Pod Anti-Affinity

Recovery Procedures

Scenario 1: AZ Failure

Scenario 2: Postgres Primary Node Failure (CloudNativePG)

Scenario 3: Restore EBS Volume from Snapshot

Scenario 4: Full Infrastructure Recreation

Scenario 5: Terragrunt State Rollback

Observability & Alerting (SigNoz)

Gaps & Recommendations

Get Started!

Offerings

Chat

Agents

Knowledge Base

Workflows

Interfaces

Public Chatbot

Settings

My Account

Platform Admin

SDK

General Information

Change Logs

Documentation Index

​Architecture & High Availability Summary

​Infrastructure Components

​Current HA Capabilities

​RTO / RPO Targets

​Backup Systems

​1. Terraform / Terragrunt State — S3

​2. EBS Persistent Volumes — AWS Data Lifecycle Manager

​3. CloudNativePG (HA Supabase DB) — Barman Cloud Backups

​4. ElastiCache Redis — Multi-AZ Replication

​5. Amazon MQ (RabbitMQ) — Message Queue

​Auto-Recovery Mechanisms

​Karpenter — Node Provisioning & Recovery

​KEDA — Pod Autoscaling

​Kubernetes Pod Anti-Affinity

​Recovery Procedures

​Scenario 1: AZ Failure

​Scenario 2: Postgres Primary Node Failure (CloudNativePG)

​Scenario 3: Restore EBS Volume from Snapshot

​Scenario 4: Full Infrastructure Recreation

​Scenario 5: Terragrunt State Rollback

​Observability & Alerting (SigNoz)

​Gaps & Recommendations

Architecture & High Availability Summary

Infrastructure Components

Current HA Capabilities

RTO / RPO Targets

Backup Systems

1. Terraform / Terragrunt State — S3

2. EBS Persistent Volumes — AWS Data Lifecycle Manager

3. CloudNativePG (HA Supabase DB) — Barman Cloud Backups

4. ElastiCache Redis — Multi-AZ Replication

5. Amazon MQ (RabbitMQ) — Message Queue

Auto-Recovery Mechanisms

Karpenter — Node Provisioning & Recovery

KEDA — Pod Autoscaling

Kubernetes Pod Anti-Affinity

Recovery Procedures

Scenario 1: AZ Failure

Scenario 2: Postgres Primary Node Failure (CloudNativePG)

Scenario 3: Restore EBS Volume from Snapshot

Scenario 4: Full Infrastructure Recreation

Scenario 5: Terragrunt State Rollback

Observability & Alerting (SigNoz)

Gaps & Recommendations