This document describes the disaster recovery (DR) strategy, Recovery Time Objective (RTO), and Recovery Point Objective (RPO) for the EKB AWS/EKS infrastructure. It covers built-in HA capabilities, active backup systems, recovery procedures, and known gaps.Documentation Index
Fetch the complete documentation index at: https://ai-kb.automationanywhere.com/llms.txt
Use this file to discover all available pages before exploring further.
For deployment instructions, see Terragrunt Deployment Guide.
Architecture & High Availability Summary
Infrastructure Components
| Component | Implementation | HA Model |
|---|---|---|
| Compute | EKS (Kubernetes 1.33) + Karpenter | Multi-AZ nodes, Spot + On-Demand |
| Pod autoscaling | KEDA | CPU/Memory-based, min 2 replicas |
| Ingress / SSL | ALB + ACM | Multi-AZ, health-check routing |
| Cache | ElastiCache Redis (conditional) | Multi-AZ with automatic failover |
| Message queue | Amazon MQ / RabbitMQ (conditional) | Single-instance or active/standby |
| Database (self-hosted) | CloudNativePG HA cluster + PgBouncer | Primary + replicas, auto-failover |
| Database (cloud) | Supabase Cloud | Managed, cross-region by provider |
| Observability | SigNoz + k8s-infra (conditional) | In-cluster, no single point of failure |
| IaC state | S3 (versioned) + DynamoDB lock | Encrypted, versioned |
| Persistent volumes | EBS (via EBS CSI Driver) | Snapshots via AWS Data Lifecycle Manager |
Current HA Capabilities
| Capability | Recovery Time | Data Loss | Status |
|---|---|---|---|
| AZ failure (pods) | 0–5 min | None | Active |
| Pod failure / crash | 0–2 min | None | Active |
| ALB health check re-routing | 0–1 min | None | Active |
| Redis Multi-AZ failover | < 1 min | None | Active (when enabled) |
| Karpenter node replacement | 0–10 min | None | Active |
| EBS snapshot restore | 15–30 min | Up to 24 h | Active |
| CloudNativePG PITR restore | 15–60 min | Up to WAL lag | Active (when enabled) |
| Full infrastructure recreation | 30–60 min | None (IaC) | Via Terragrunt |
RTO / RPO Targets
| Scenario | RTO Target | RPO Target | Current Capability |
|---|---|---|---|
| Single pod failure | < 2 min | 0 | Met (KEDA min replicas) |
| Single node failure | < 10 min | 0 | Met (Karpenter replacement) |
| AZ failure | < 5 min | 0 | Met (Multi-AZ spread) |
| Redis failure | < 1 min | 0 | Met (Multi-AZ failover) |
| Postgres primary failure | < 5 min | 0–seconds WAL lag | Met (CNPG auto-failover) |
| EBS volume loss | 15–30 min | < 24 h | Partial — DLM snapshots daily |
| Full region failure | > 60 min | Depends on backup frequency | Not currently automated |
Backup Systems
1. Terraform / Terragrunt State — S3
What is backed up: All infrastructure state (EKS, networking, IAM, Helm releases). Implementation:- S3 bucket per environment:
ekb-terraform-state-<env-name>(bootstrapped viaterragrunt/environments/<env-name>/state/) - Versioning enabled — any previous state version can be restored
- Server-side encryption (AES-256)
- DynamoDB table for state locking
terragrunt apply.
RPO: Every terragrunt apply commit — continuous.
2. EBS Persistent Volumes — AWS Data Lifecycle Manager
What is backed up: EBS volumes attached to pods (Automator PostgreSQL, Supabase MinIO, any stateful workloads). Implementation: AWS Data Lifecycle Manager (DLM) policy targeting environment tags.3. CloudNativePG (HA Supabase DB) — Barman Cloud Backups
Applies to: Environments withENABLE_CNPG=true and ENABLE_HA_SUPABASE_DB=true.
What is backed up: The CloudNativePG Postgres cluster (ha-supabase-db), including continuous WAL (Write-Ahead Log) streaming to S3 or MinIO, and scheduled full base backups via CNPG ScheduledBackup CRD.
Implementation (configured in values/ha-supabase-db.yaml):
4. ElastiCache Redis — Multi-AZ Replication
Applies to: Environments withENABLE_AWS_SERVICES=true.
Redis is not a primary data store — it holds transient cache and session data. DR focus is on fast failover rather than backup/restore.
Implementation:
- Multi-AZ enabled with automatic failover
- Encryption at-rest and in-transit
- A primary node failure promotes a replica automatically (< 1 min)
5. Amazon MQ (RabbitMQ) — Message Queue
Applies to: Environments withENABLE_AWS_SERVICES=true.
Implementation:
- Single-instance (default) or active/standby deployment
- In-flight messages may be lost during a broker restart; design consumers to be idempotent
- Management UI available on port 15671 (SSL)
Auto-Recovery Mechanisms
Karpenter — Node Provisioning & Recovery
- Consolidation:
WhenEmptyOrUnderutilized— idle nodes are terminated automatically - Spot interruption handling: Listens to SQS interruption events; drains and replaces Spot nodes before termination
- Node drift: Nodes using outdated AMIs or configs are replaced automatically when
enable_drift = true - Recovery time: New node provisioned in 0–10 minutes
KEDA — Pod Autoscaling
- Minimum replicas: 2 for all services (Web, API, Celery, Automator) — prevents single-point-of-failure
- CPU threshold: 60–70% triggers scale-out
- Memory threshold: 80% triggers scale-out
- Scale-down stabilisation: 30 seconds — avoids flapping
- Recovery time: Failed pods rescheduled within 0–2 minutes
Kubernetes Pod Anti-Affinity
All stateless services usepreferredDuringSchedulingIgnoredDuringExecution anti-affinity on kubernetes.io/hostname to spread pods across nodes and AZs.
Recovery Procedures
Scenario 1: AZ Failure
Expected behaviour: Karpenter provisions replacement nodes in the remaining AZs; KEDA reschedules pods; ALB stops routing to unhealthy targets. Verification:No manual intervention is required under normal circumstances.
Scenario 2: Postgres Primary Node Failure (CloudNativePG)
CloudNativePG automatically promotes a replica to primary.Scenario 3: Restore EBS Volume from Snapshot
Scenario 4: Full Infrastructure Recreation
Used after catastrophic failure or when rebuilding a region.Scenario 5: Terragrunt State Rollback
Observability & Alerting (SigNoz)
WhenENABLE_SIGNOZ=true, SigNoz is deployed to the monitoring namespace and provides:
- Distributed tracing for all EKB services
- Cluster metrics via the k8s-infra DaemonSet agent (CPU, memory, pod status)
- Log aggregation from all pods
- Alerting — configure alert rules in SigNoz to notify on pod crash loops, high error rates, or node pressure
Gaps & Recommendations
| Gap | Risk | Recommendation |
|---|---|---|
| No cross-region replication | Full region failure = extended RTO | Consider RDS cross-region read replica or S3 cross-region replication for CNPG backups |
| EBS snapshots are daily | Up to 24 h data loss for EBS-backed workloads | Increase DLM frequency to hourly for critical volumes |
| RabbitMQ single-instance (default) | In-flight messages lost on broker failure | Switch to ACTIVE_STANDBY_MULTI_AZ deployment mode for production |
| No automated DR drill | Recovery procedures untested until needed | Schedule quarterly recovery drills; test CNPG PITR restore in a staging environment |
| SigNoz on same cluster | Observability lost during cluster failure | Consider a separate lightweight monitoring endpoint or CloudWatch fallback alerts |