Skip to main content
This document describes the disaster recovery (DR) strategy, Recovery Time Objective (RTO), and Recovery Point Objective (RPO) for the EKB AWS/EKS infrastructure. It covers built-in HA capabilities, active backup systems, recovery procedures, and known gaps.
For deployment instructions, see Terragrunt Deployment Guide.

Architecture & High Availability Summary

Infrastructure Components

ComponentImplementationHA Model
ComputeEKS (Kubernetes 1.33) + KarpenterMulti-AZ nodes, Spot + On-Demand
Pod autoscalingKEDACPU/Memory-based, min 2 replicas
Ingress / SSLALB + ACMMulti-AZ, health-check routing
CacheElastiCache Redis (conditional)Multi-AZ with automatic failover
Message queueAmazon MQ / RabbitMQ (conditional)Single-instance or active/standby
Database (self-hosted)CloudNativePG HA cluster + PgBouncerPrimary + replicas, auto-failover
Database (cloud)Supabase CloudManaged, cross-region by provider
ObservabilitySigNoz + k8s-infra (conditional)In-cluster, no single point of failure
IaC stateS3 (versioned) + DynamoDB lockEncrypted, versioned
Persistent volumesEBS (via EBS CSI Driver)Snapshots via AWS Data Lifecycle Manager

Current HA Capabilities

CapabilityRecovery TimeData LossStatus
AZ failure (pods)0–5 minNoneActive
Pod failure / crash0–2 minNoneActive
ALB health check re-routing0–1 minNoneActive
Redis Multi-AZ failover< 1 minNoneActive (when enabled)
Karpenter node replacement0–10 minNoneActive
EBS snapshot restore15–30 minUp to 24 hActive
CloudNativePG PITR restore15–60 minUp to WAL lagActive (when enabled)
Full infrastructure recreation30–60 minNone (IaC)Via Terragrunt

RTO / RPO Targets

ScenarioRTO TargetRPO TargetCurrent Capability
Single pod failure< 2 min0Met (KEDA min replicas)
Single node failure< 10 min0Met (Karpenter replacement)
AZ failure< 5 min0Met (Multi-AZ spread)
Redis failure< 1 min0Met (Multi-AZ failover)
Postgres primary failure< 5 min0–seconds WAL lagMet (CNPG auto-failover)
EBS volume loss15–30 min< 24 hPartial — DLM snapshots daily
Full region failure> 60 minDepends on backup frequencyNot currently automated

Backup Systems

1. Terraform / Terragrunt State — S3

What is backed up: All infrastructure state (EKS, networking, IAM, Helm releases). Implementation:
  • S3 bucket per environment: ekb-terraform-state-<env-name> (bootstrapped via terragrunt/environments/<env-name>/state/)
  • Versioning enabled — any previous state version can be restored
  • Server-side encryption (AES-256)
  • DynamoDB table for state locking
Recovery: Roll back to any previous state version in S3, then re-run terragrunt apply. RPO: Every terragrunt apply commit — continuous.

2. EBS Persistent Volumes — AWS Data Lifecycle Manager

What is backed up: EBS volumes attached to pods (Automator PostgreSQL, Supabase MinIO, any stateful workloads). Implementation: AWS Data Lifecycle Manager (DLM) policy targeting environment tags.
# Tag EBS volumes with your environment name, then create a DLM policy:
# - Resource: EBS volumes
# - Target tag: Environment=<your-env-name>
# - Schedule: Daily at 02:00 UTC
# - Retention: 7 snapshots
Recovery:
# 1. Identify the snapshot to restore
aws ec2 describe-snapshots --filters "Name=tag:Environment,Values=<your-env-name>"
 
# 2. Create a volume from the snapshot
aws ec2 create-volume \
  --snapshot-id snap-xxxxxxxxxxxxxxxxx \
  --availability-zone <your-az> \
  --volume-type gp3
 
# 3. Update the PersistentVolume in Kubernetes to reference the new volume ID
kubectl patch pv <pv-name> -p '{"spec":{"awsElasticBlockStore":{"volumeID":"<new-vol-id>"}}}'
RPO: Up to 24 hours (daily snapshots). Reduce by increasing DLM schedule frequency.

3. CloudNativePG (HA Supabase DB) — Barman Cloud Backups

Applies to: Environments with ENABLE_CNPG=true and ENABLE_HA_SUPABASE_DB=true. What is backed up: The CloudNativePG Postgres cluster (ha-supabase-db), including continuous WAL (Write-Ahead Log) streaming to S3 or MinIO, and scheduled full base backups via CNPG ScheduledBackup CRD. Implementation (configured in values/ha-supabase-db.yaml):
postgres:
  backup:
    enabled: true                    # toggle barman-cloud backups
    # barmanObjectStore points to S3 bucket or MinIO endpoint
    # IRSA / service account must have s3:PutObject, s3:GetObject, s3:ListBucket
    retentionPolicy: "30d"           # keep backups for 30 days
    compression: gzip
Verify backup status:
# List all backups for the cluster
kubectl get backup -n ha-supabase-db
 
# Check a specific backup
kubectl describe backup <backup-name> -n ha-supabase-db
 
# Trigger an on-demand backup
kubectl cnpg backup ha-supabase-db -n ha-supabase-db
Point-in-time recovery (PITR):
# Create a new CNPG Cluster restoring from a backup
# (in a new namespace or after removing the original)
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: ha-supabase-db-restored
  namespace: ha-supabase-db-restore
spec:
  instances: 3
  bootstrap:
    recovery:
      source: ha-supabase-db
      recoveryTarget:
        targetTime: "2026-03-19T03:00:00.000000+00:00"  # target point in time
  externalClusters:
    - name: ha-supabase-db
      barmanObjectStore:
        # same S3/MinIO config as the source cluster
        serverName: ha-supabase-db
RPO: Near-zero for WAL-enabled clusters (seconds of lag, bounded by WAL upload interval).

4. ElastiCache Redis — Multi-AZ Replication

Applies to: Environments with ENABLE_AWS_SERVICES=true. Redis is not a primary data store — it holds transient cache and session data. DR focus is on fast failover rather than backup/restore. Implementation:
  • Multi-AZ enabled with automatic failover
  • Encryption at-rest and in-transit
  • A primary node failure promotes a replica automatically (< 1 min)
RPO: Redis data is ephemeral by design. Cache misses after failover are expected; the application re-populates from the database.

5. Amazon MQ (RabbitMQ) — Message Queue

Applies to: Environments with ENABLE_AWS_SERVICES=true. Implementation:
  • Single-instance (default) or active/standby deployment
  • In-flight messages may be lost during a broker restart; design consumers to be idempotent
  • Management UI available on port 15671 (SSL)
RPO: Single-instance — messages in-flight at time of failure may be lost. Use active/standby deployment mode to reduce this risk.

Auto-Recovery Mechanisms

Karpenter — Node Provisioning & Recovery

  • Consolidation: WhenEmptyOrUnderutilized — idle nodes are terminated automatically
  • Spot interruption handling: Listens to SQS interruption events; drains and replaces Spot nodes before termination
  • Node drift: Nodes using outdated AMIs or configs are replaced automatically when enable_drift = true
  • Recovery time: New node provisioned in 0–10 minutes

KEDA — Pod Autoscaling

  • Minimum replicas: 2 for all services (Web, API, Celery, Automator) — prevents single-point-of-failure
  • CPU threshold: 60–70% triggers scale-out
  • Memory threshold: 80% triggers scale-out
  • Scale-down stabilisation: 30 seconds — avoids flapping
  • Recovery time: Failed pods rescheduled within 0–2 minutes

Kubernetes Pod Anti-Affinity

All stateless services use preferredDuringSchedulingIgnoredDuringExecution anti-affinity on kubernetes.io/hostname to spread pods across nodes and AZs.
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values: ["web", "fastapi-backend", "celery-worker", "automator"]
        topologyKey: kubernetes.io/hostname

Recovery Procedures

Scenario 1: AZ Failure

Expected behaviour: Karpenter provisions replacement nodes in the remaining AZs; KEDA reschedules pods; ALB stops routing to unhealthy targets. Verification:
kubectl get nodes -o wide     # confirm nodes are in healthy AZs
kubectl get pods -A -o wide   # confirm pods are running
kubectl get ingress            # confirm ALB is routing correctly
No manual intervention is required under normal circumstances.

Scenario 2: Postgres Primary Node Failure (CloudNativePG)

CloudNativePG automatically promotes a replica to primary.
# Monitor failover
kubectl get cluster ha-supabase-db -n ha-supabase-db -w
 
# Confirm new primary
kubectl cnpg status ha-supabase-db -n ha-supabase-db
 
# Check PgBouncer pooler is pointing to new primary
kubectl get svc -n ha-supabase-db | grep pooler
Applications reconnect via PgBouncer — no connection string change needed.

Scenario 3: Restore EBS Volume from Snapshot

# 1. Find snapshots for the environment
aws ec2 describe-snapshots \
  --filters "Name=tag:Environment,Values=<your-env-name>" \
  --query 'Snapshots[*].[SnapshotId,StartTime,VolumeSize]' \
  --output table
 
# 2. Create a new volume from the chosen snapshot
aws ec2 create-volume \
  --snapshot-id snap-xxxxxxxxxxxxxxxxx \
  --availability-zone <target-az> \
  --volume-type gp3 \
  --tag-specifications 'ResourceType=volume,Tags=[{Key=Environment,Value=<your-env-name>}]'
 
# 3. Scale down the affected workload
kubectl scale deployment <workload> --replicas=0
 
# 4. Update the PV to reference the new EBS volume, then scale back up
kubectl patch pv <pv-name> -p '{"spec":{"csi":{"volumeHandle":"<new-volume-id>"}}}'
kubectl scale deployment <workload> --replicas=2

Scenario 4: Full Infrastructure Recreation

Used after catastrophic failure or when rebuilding a region.
# 1. Bootstrap state bucket (if it doesn't exist)
cd terragrunt/environments/<your-env-name>/state
terragrunt apply
 
# 2. Recreate the EKS cluster and core infrastructure
cd terragrunt/environments/<your-env-name>
terragrunt apply
 
# 3. Deploy services in order (see Terragrunt Deployment Guide § Phase 6)
ENABLE_CNPG=true terragrunt apply --target='helm_release.local["cloudnative-pg"]'
ENABLE_HA_SUPABASE_DB=true terragrunt apply --target='helm_release.local["ha-supabase-db"]'
ENABLE_SUPABASE=true terragrunt apply --target='helm_release.local["supabase"]'
Estimated RTO: 30–60 minutes for full cluster; 60–120 minutes if EBS/CNPG data must be restored.

Scenario 5: Terragrunt State Rollback

# List state versions in S3
aws s3api list-object-versions \
  --bucket ekb-terraform-state-<your-env-name> \
  --prefix <state-key>
 
# Restore a specific version
aws s3api get-object \
  --bucket ekb-terraform-state-<your-env-name> \
  --key <state-key> \
  --version-id <version-id> \
  terraform.tfstate
 
# Re-import or apply with the restored state

Observability & Alerting (SigNoz)

When ENABLE_SIGNOZ=true, SigNoz is deployed to the monitoring namespace and provides:
  • Distributed tracing for all EKB services
  • Cluster metrics via the k8s-infra DaemonSet agent (CPU, memory, pod status)
  • Log aggregation from all pods
  • Alerting — configure alert rules in SigNoz to notify on pod crash loops, high error rates, or node pressure
For DR purposes, SigNoz dashboards are the primary tool for diagnosing and confirming recovery after an incident. SigNoz data itself is stored on EBS PVs and is covered by the EBS snapshot policy.

Gaps & Recommendations

GapRiskRecommendation
No cross-region replicationFull region failure = extended RTOConsider RDS cross-region read replica or S3 cross-region replication for CNPG backups
EBS snapshots are dailyUp to 24 h data loss for EBS-backed workloadsIncrease DLM frequency to hourly for critical volumes
RabbitMQ single-instance (default)In-flight messages lost on broker failureSwitch to ACTIVE_STANDBY_MULTI_AZ deployment mode for production
No automated DR drillRecovery procedures untested until neededSchedule quarterly recovery drills; test CNPG PITR restore in a staging environment
SigNoz on same clusterObservability lost during cluster failureConsider a separate lightweight monitoring endpoint or CloudWatch fallback alerts