Friday, November 21, 2025

AWS Disaster Recovery (DR) Strategies | Deep Dive.


AWS Disaster Recovery (DR) Strategies - Deep Dive.

Scope:

  • Foundations of AWS Disaster Recovery,
  • Four AWS Disaster Recovery Strategies,
  • AWS Services That Enable DR,
  • Designing a Multi-Region DR Architecture,
  • DR Testing Patterns,
  • Architecture,
  • Cost Optimization Strategies,
  • Choosing the Right DR Strategy.

 Intro:

    • AWS provides multiple patterns for disaster recovery (DR), each with different:
      •  Recovery Point Objective (RPO) 
      •  Recovery Time Objective (RTO).
    •  Enterprises typically choose a strategy based on:
      • Business criticality, 
      • Cost tolerance, 
      • Geographic regulatory requirements, 
      • Operational maturity.

1. Foundations of AWS Disaster Recovery

Key Metrics

  • RPO (Recovery Point Objective)
    • How much data loss is acceptable?
    • Lower RPO = more frequent replication = higher cost.

  • RTO (Recovery Time Objective)
    • How quickly must service be restored?
    • Lower RTO = keeping more infrastructure pre-provisioned = higher cost.

Zones of Failure

  • AWS DR spans several failure domains:
    • Availability Zone failures
    • Regional failures
    • Customer errors (data deletion, corruption)
    • Application/Logical failures

NB:

  • DR planning must consider all of the above.

 2. Four AWS Disaster Recovery Strategies

A. Backup & Restore

    • RPO: Hours
    • RTO: Hours to Days
    • Cost: Lowest

Use Case: 

    • Non-critical workloads, 
    • long-term retention, 
    • compliance.

Key AWS Components

    • AWS Backup
    • Amazon S3 + Cross-Region Replication (CRR)
    • Amazon RDS snapshots (manual + automated)
    • Amazon EBS snapshots + AMIs
    • AWS Glacier for archival

Deep Dive

    •  Backup & Restore is the simplest approach. 
    • Primary data is periodically backed up to S3 and optionally replicated to a secondary region.
    • Compute is spun up only after a disaster.

Pros:

    • Cheapest
    • Easy to manage
    • Strong compliance posture

Cons:

    • Slowest recovery
    • Operationally intensive during failover

B. Pilot Light

RPO: Minutes
RTO: Tens of Minutes
Cost: Low-Medium

  •  Use Case
    • Critical systems where data loss must be minimized, but low cost is required.

Key AWS Components

    • Continuous database replication (RDS Read Replica, DMS, Aurora Global Database)
    •  AWS Lambda or AMI-based EC2 launch templates pre-configured
    •  Minimal core infrastructure always on (config, databases, routing)

Deep Dive

    • Only the critical components (databases, IAM, configuration store) are live in the secondary region. 
    • App servers, load balancers, or enterprise services are rapidly provisioned from templates during disaster.

Pros:

    • Faster than backup/restore
    • Low cost
    • Infrastructure-as-code friendly

Cons:

    • Still some provisioning time
    • Configuration drift risk

C. Warm Standby

RPO: Seconds–Minutes
RTO: Minutes
Cost: Medium-High

Use Case

    • High-availability workloads that still want cost control.

Key AWS Components

    • Scaled-down version of the production stack always running in the DR region
    • Route 53 failover
    • Cross-Region replication for:
      •    RDS / Aurora
      •    DynamoDB Global Tables
      •    S3 CRR
      •    EKS/ ECS image replication

Deep Dive

    • A fully functional but scaled-down copy of the production environment continuously receives replicated data. 
    • During failover, twtech scales out EC2/ECS/EKS workers, ALBs, and application tiers.

Pros:

    • Near-real-time failover
    • Less overhead during a DR event
    • Good balance of cost & performance

Cons:

    • More expensive than pilot light
    • Requires continuous synchronization and testing

D. Active-Active (Multi-Region)

  • RPO: Zero or near-zero
  • RTO: Zero or seconds
  • Cost: Highest

Use Case

Mission-critical systems (banking, global SaaS platforms, healthcare, trading platforms)

Key AWS Components

    •  Amazon Aurora Global Database or DynamoDB Global Tables
    •  Multi-region API endpoints via Route 53 latency-based routing
    •  S3 Multi-Region Access Points
    •  Multi-region EKS/ECS with service mesh (App Mesh, Istio)
    •  CloudFront + Global Accelerator

Deep Dive

    • Both regions serve traffic simultaneously. 
    • Data is replicated in near real time.
    • Failover is automatic, with little to no user impact.

Pros:

    •  Maximum availability
    •  Lowest RTO/RPO
    •  Architected for global performance

Cons:

    •  Highest cost
    •  Increased operational complexity
    •  Requires app design for multi-region state synchronization

 3. AWS Services That Enable DR

A. Database-Level DR

Relational

    • Aurora Global Database <1 second RPO, <1 minute RTO
    • RDS Cross-Region Read Replicas

NoSQL

    • DynamoDB Global Tables – Multi-master across regions

Migration

    • AWS Database Migration Service (AWS DMS) – near-real-time CDC (Change Data Capture) replication

B. File/Object Storage

    • S3 Cross-Region Replication
    • S3 Multi-Region Access Points
    • EFS-to-EFS replication
    • FSx DR (FSx for ONTAP, Windows, Lustre)

C. Compute Layer

    • EC2 AMI replication
    • EC2 Launch Templates/Launch Configurations
    • ECS service replication across regions
    • EKS cluster replication (Cluster API/ GitOps/ Crossplane)

D. Networking & Routing

    • Route 53 Health Checks & Failover Routing
    • AWS Global Accelerator for highly available ingress
    • VPC Lattice multi-region architectures (emerging pattern)

E. Infrastructure Orchestration

    • CloudFormation StackSets
    • Terraform / Pulumi multi-region configuration
    • AWS Systems Manager (SSM) for orchestration of recovery

 4. Designing a Multi-Region DR Architecture

1. Identify Critical Apps

Classify each workload:

    • Tier 0 (RTO seconds, RPO zero)
    • Tier 1 (RTO <15 min)
    • Tier 2 (RTO Hours)
    • Tier 3 (Can wait: archival, batch, BI jobs)

2. Define RPO/RTO Requirements

This determines:

    • Amount of real-time replication
    • Infrastructure running in DR region
    • Cost model

3. Assess Data Gravity

Data-heavy workloads require:

    •  Aurora Global DB
    •  DynamoDB Global Tables
    •  S3 replication
    •  FSx replication

4. Automate DR Failover

Use:

    • Lambda
    • Step Functions
    • Systems Manager Automation
    • CloudFormation

5. Simulate and Test DR

Run:

    •         GameDays
    •         Region failover simulations
    •        Database failover tests
    •         Route 53 failover exercises
Architecture

5. DR Testing Patterns

Test Type

                Description

Failover Test

    Validate traffic rerouting works

Failback Test

    Validate recovery to primary region

Backup Integrity Test

    Ensure backups restore correctly

Network Path Testing

    Validate east–west and north–south flows

Chaos Engineering

Force AZ/region outages using tools like AWS Fault Injection Simulator

 6. Cost Optimization Strategies

    • Use S3 IA for cross-region backups to reduce cost
    • Use Auto-scale DR region compute only during DR events
    • Use Spot instances for DR test workloads
    • Use Lifecycle policies for backups
    • Offload static assets to CloudFront globally

 7. Choosing the Right DR Strategy

Strategy

RPO

RTO

Cost

Best For

Backup/Restore

Hours.

Hours–Days

Lowest

Non-critical apps

Pilot Light

Minutes.

<1 hour

Low–Medium

Medium-critical apps

Warm Standby

Seconds–Minutes.

Minutes

Medium–High

High-critical apps

Active-Active

Seconds.

Seconds

Highest

Mission-critical apps




No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...