Thursday, November 20, 2025

AWS Disaster Recovery (DR) | Overview.

AWS Disaster Recovery (DR) - Overview.

Scope:

  • DR Core Concepts,
  • Categories of Disaster.
  • DR Tiers / Strategies,
  • Multi-Region Active–Active,
  • DR for Each Layer of the Stack.
  • Observability & Orchestration,
  • DR Testing,
  • DR Maturity Model,
  • Disaster Recovery Anti-Patterns,
  • Reference Multi-Region DR Blueprint (AWS Sample). 

Intro:

    • Disaster Recovery is focused on ensuring that critical business services can continue running smoothly or be rapidly restored after a disruptive event.
    • DR sits within the wider domain of Business Continuity (BC) and has strong dependencies on:
      • architecture, 
      • operations, 
      • observability, 
      • security.

1. DR Core Concepts

RPORecovery Point Objective

  • How much data loss is acceptable?
    • Seconds requires continuous replication
    • Minutes frequent snapshots + async replication
    • Hours nightly backups

RTORecovery Time Objective

  • How fast must twteeh resume operations?
    • Seconds warm standby / active–active
    • Minutes warm standby
    • Hours pilot light
    • Days backup & restore

RLORecovery Level Objective

Granularity of recovery:

    • entire region
    • VPC / subnet
    • application
    • database
    • file-level

MTOMaximum Tolerable Outage

  • How long the business can survive an outage.

2. Categories of Disaster

Physical / Environmental

    • region-wide outage
    • power loss
    • natural disasters

Logical / Operational

    • bad deployment
    • configuration drift
    • human error
    • insider misuse

Cyber / Security

    • ransomware
    • supply chain compromise
    • credential theft
    • mass data corruption

Upstream Dependencies

    • failed 3rd-party API
    • DNS provider outage
    • SaaS outages
    • PKI/KMS failures

3. DR Tiers / Strategies

  • Below are the canonical DR topologies used globally.

3.1 Backup & Restore

  • Lowest cost, highest RTO/RPO
    • nightly backups
    • offsite copies (cross-region/object lock/WORM)
    • restore into secondary region only during disaster

Used for: non-critical systems, batch workloads.

3.2 Pilot Light

  • Minimal version of environment always running.
    • core infrastructure pre-provisioned
    • databases replicated or at least snapshots replicated
    • app servers start during failover

RTO: ~30–60 minutes
RPO: minutes–hours

3.3 Warm Standby

  • Scaled-down version of prod constantly running.
    •  DB cross-region replication
    •  reduced-sized compute fleet always on
    •  traffic routed to standby only when needed

RTO: 5–15 minutes
RPO: seconds–minutes

3.4 Multi-Region Active–Active

Most expensive but highest availability.

    • both regions handle real traffic
    • continuous replication
    • global load balancing (GSLB)
    • write conflict resolution system

RTO: near-zero

RPO: near-zero

Used for: 

    • payments, 
    • identity/auth, 
    • gaming, 
    • IoT ingestion, 
    • trading systems.

4. DR for Each Layer of the Stack

4.1 Compute Layer

Stateless compute:

    • rehydrated from CI/CD
    • AMIs, images, container registries replicated
    • IaC (Terraform/CloudFormation/CDK) stored multi-region

Stateful compute:

    • instance store snapshot replication
    • VM failover orchestration (CloudEndure, VMware, ASR)

4.2 Networking

Critical DR dependencies:

    • multi-region VPCs
    • CIDR planning without overlap
    • multi-region Transit Gateway / VNet Peering
    • cross-region private connectivity
    • multi-region firewalls / Web Application Firewall (WAF)
    • DNS failover (Route53, Akamai, Cloudflare)
  • twtech designs two equal network islands that can run independently but share identity + security posture.

4.3 Database DR

Relational:

    • Aurora global / cross-region replication
    • PostgreSQL logical replication
    • MySQL binlog shipping
    • Oracle Data Guard
    • SQL Server AlwaysOn Availability Groups (AGs)

NoSQL:

    • DynamoDB global tables
    • MongoDB Atlas multi-region
    • Cassandra multi-datacenter replication

Data Lake:

    • S3 cross-region replication (CRR)
    • object lock + versioning
    • parquet dataset replication

Caches:

    • Redis Multi-AZ; for cross-region native replication or application-managed cold/warm caches.

4.4 Storage

    • immutable backups (WORM/object lock)
    • multi-region object replication
    • EBS/FSx snapshot replication
    • large dataset async replication pipelines

4.5 Identity & Secrets

  • Critical and often forgotten.

Identity:

    • IAM global vs regional entities
    • cross-region STS access
    • backup of IAM roles/policies via IaC
    • multi-region SSO/IdP failover

Secrets/KMS:

    • multi-region KMS keys (MRKs)
    • replication of key policies
    • envelope encryption for portability
    • ensure decrypt operations after failover

5. Observability & Orchestration

Monitoring:

    • multi-region time-series store
    • central event bus
    • application-level heartbeat checks
    • synthetic canary tests

DR Orchestration:

    • automated failover runbooks
    • event-driven workflows (AWS Step Functions, Azure Automation, GCP Workflows)
    • chaos engineering (fault injection, network blackholing)

Logging:

    • multi-region log sinks
    • long-term storage in immutable buckets
    • SIEM cross-region ingestion

6. DR Testing

    • Testing is more important than architecture.

Tests to run:

    • failover / failback
    • DNS cutover
    • region evacuation
    • database promotion
    • IAM/KMS boundary tests
    • network segmentation tests
    • ransomware restore tests
    • dependency isolation (test if 3rd party outage isolates twtech)

Frequency:

    • critical workloads: monthly
    • standard workloads: quarterly
    • regulated workloads: semi-annual mandatory

7. DR Maturity Model

Level

Description

0

Backups only, inconsistent, untested

1

Automated backups, occasional restore tests

2

Cross-region replication for critical DBs

3

Warm standby with automated failover

4

Active–active architecture with global load balancing

5

Chaos-engineered, highly automated, self-healing DR

8. Disaster Recovery Anti-Patterns

❌   Single-region architectures with multi-AZ only
❌   Backups stored in same region
❌   No restore validation
❌   Hard-coding region names in apps
❌   Stateful services without replication
❌   Transitive dependencies (one region relies on another's control plane)
❌   Shared KMS keys without multi-region symmetry
❌   Overreliance on synchronous replication everywhere
❌   DNS TTLs > 60 seconds

9. Reference Multi-Region DR Blueprint (AWS Sample)








No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...