Wednesday, November 19, 2025

AWS Disaster Recovery (DR), Mitigation & Migration | Overview.


AWS Disaster Recovery (DR),  Mitigation & Migration - Overview.

Scope:

    • Intro,
    • DR Fundamentals RPO/RTO & Tiers,
    • Data Layer DR – The Core Problem,
    • Database DR Patterns,
    • Application & Compute DR,
    • Network & Global Traffic Management,
    • Security Controls in DR,
    • Automation & Playbooks,
    • DR Simulation & Chaos Mitigation,
    • Cloud Migration,
    • Discovery & Migration Readiness,
    • Migration Patterns,
    • Data Migration Strategies,
    • Cutover Strategy,
    • Observability & Validation,
    • DR & Migration Anti-Patterns.

 Intro:

    • Disaster Recovery (DR) is the process of restoring access and functionality to infrastructure after a disruptive event, such as:
      • A natural disaster, 
      • Hardware failure, 
      • Cyberattack. 
    • The core objective of Disaster Recovery (DR) is to minimize downtime and data loss, ensuring that critical business functions can resume as quickly as possible.

Disaster Recovery (DR) involves:

    1.  Business Continuity Requirements (RPO/RTO, regulatory, impact mapping)
    2.  Data durability + replication model
    3.  Compute / application reconstruction strategy
    4.  Operational orchestration + automation

walking through each layer.

1. DR Fundamentals → RPO/RTO & Tiers

  • RPO (Recovery Point Objective)
  • How much data loss is acceptable?

      •  0 seconds synchronous replication
      •  Minutes near-real-time log shipping
      •  Hours periodic backups/snapshots
  • RTO (Recovery Time Objective)
  • How long can the service be down?
      • Seconds active-active
      • Minutes active-standby
      • Hours warm standby
      • Days cold DR

DR Tiers (Industry Standard)

Strategy

RPO

RTO

Summary

Backup & Restore

Hours

Day(s)

Cheapest, slowest. Snapshots + IaC rebuild.

Pilot Light

Min–Hours

Hours

Minimal infra “flickered on”—DB replicated.

Warm Standby

Seconds–Min

Minutes–Hours

Partial scaled environment, can scale up.

Multi-Region Active-Active

0–Seconds

Seconds–Min

Highest cost, highest resilience.

2. Data Layer DR – The Core Problem

Data Replication Models

A. Synchronous Replication

    • 0 data loss
    • Requires low-latency cross-region links
    • Not always supported (e.g., Aurora Global Database is asynchronous)
    • Might impact write latency

Use Cases: 

    • Transactions requiring strict consistency.

B. Asynchronous Replication

    • Small RPO (Recovery Point Objective) = seconds
    • No write latency added
    • Risk of data loss on region outage

Use Cases:

    • Distributed apps with tolerance to minimal data loss.

 Database DR Patterns

Amazon Aurora Global Database 

    • Writer in Region A
    • Read replicas in Region B, C
    • Sub-second replication
    • Failover controlled manually or with orchestrator
    • RPO (Recovery Point Objective) < 1s, RTO (Recovery Time Objective) ~1–2 minutes.

RDS Cross-Region Read Replicas / Logs

    • Higher latency than Aurora
    • Good for RPO of minutes
    • Works for MySQL/Postgres/SQLServer/MariaDB engines.

DynamoDB Global Tables

    • Multi-region active-active
    • True multi-master
    • Conflict resolution = last writer wins
    • RPO (Recovery Point Objective) = 0, RTO (Recovery Time Objective) = seconds
  • Requires careful design for idempotency and conflict safety.

S3 Cross-Region Replication

    • Asynchronous
    • RPO = seconds-minutes
    • Versioning strongly recommended
    • Beware delete marker replication rules

EBS + Snapshot DR

    •  Use for block-level data (VMs, stateful infra)
    •  Slowest (RTO hours)

3. Application & Compute DR

  • The design depends on architecture patterns:

Stateless Microservices

    • Docker images in ECR/GCR
    • Infra managed by IaC (Terraform, CloudFormation, CDK)
    • Auto-recreated in destination region
    • Load balanced using global routing (Route 53, CloudFlare, etc)

Stateful Services

    • Must pair with data replication strategy
    • Use persistent claims + cross-region storage replication

Orchestrators/Compute models

Workload Type

        DR Strategy

EC2.

AMIs + Launch Templates in destination region

EKS /AKS (Azure Kubernetes Service) / GKE (Google Kubernetes Engine).

Cluster recreated via IaC; EBS state replicated via snapshots

Lambda.

Multi-region deployment package + alias failover

Serverless (API Gateway, SQS, SNS).

Deploy in both regions; use global routing

4. Network & Global Traffic Management

DNS-Based Failover

    • Route53 Health Checks
    • Weighted routing
    • Latency-based routing
    • Requires low TTL (30 sec typical)

Global Load Balancing

    • CloudFront
    • GSLB in F5/Cloudflare/Akamai

VPC Connectivity

  • Cross-region traffic must be considered:
    • VPC Peering
    • Transit Gateway (TGW)
    • PrivateLink
    • Encrypted WAN links

5. Security Controls in DR

Encryption

    • KMS multi-region keys (MRKs) for customer data
    • Client-side encryption for cross-region replication
    • IAM replicated via organization SCPs + IaC

Identity Federation

    • Cross-region IAM roles
    • If using Okta/ADFS ensure failover IdP

Secrets Management

    • Secrets Manager or Vault multi-region replication
    • Avoid environment-variable secrets (non-DR-friendly)

6. Automation & Playbooks

DR must be automated, not manual.

IaC Core

    • Terraform with workspaces per region
    • CDK/CloudFormation StackSets
    • Ensure DR region drift detection and validation pipelines

DR Orchestration Runbooks (Samples):

    •  Promote Aurora secondary primary
    •  Re-point DNS to new ALB
    •  Scale up warm standby ASGs
    •  Switch CI/CD pipelines to DR region
    •  Rehydrate secrets and parameters
  • Use SSM Automation or Step Functions for full orchestration.

7. DR Simulation & Chaos Mitigation

Run periodic controlled DR tests:

    • Simulate Region A loss
    • Validate failover
    • Validate RTO/RPO
    • Ensure application consistency
    • Perform rollback to primary region post-test

Chaos Mitigation

Use tools like:

    • AWS Fault Injection Simulator
    • Gremlin
    • LitmusChaos

Cloud Migration

Migration is a multi-dimensional project with four components:

    1.      Discovery & Inventory
    2.      Refactor/Modernize vs Lift-and-Shift
    3.      Data Migration Strategy
    4.      Cutover Plan + Rollback Strategy

1. Discovery & Migration Readiness

Inventory & Assessment

    • Configuration Management Database (CMDB)
    • Network maps
    • Data flows
    • Inter-service dependencies
    • Identity integrations
    • Licensing constraints

Cloud Readiness

    • OS versions supported?
    • DB engines compatible?
    • Storage layer suitable?
    • Compliance / audit requirements?

2. Migration Patterns

Six Migration R’s

    1.      Rehost (Lift and Shift)
    2.      Replatform (Lift-Tinker-and-Shift)
    3.      Refactor (App modernization)
    4.      Repurchase (SaaS replacement)
    5.      Retire (decommission)
    6.      Retain (keep on-prem temporarily)

3. Data Migration Strategies

A. Online migration (near-zero downtime)

Useful for large operational DBs.

    • DMS (CDC - Change Data Capture)
    • Log shipping
    • GoldenGate
    • Debezium
    • Dual write + cutover
    • S3 Snowball Edge for data-at-rest

B. Offline migration

When downtime is allowed.

    • Snapshot + restore
    • Bulk dumps
    • Cold cutover

C. Hybrid multi-step

    • Bulk load historical data
    • CDC apply delta
    • Freeze writes
    • Final cutover

4. Cutover Strategy

Blue/Green Migration

    • New environment (green) tested
    • Traffic switchover via DNS or ALB
    • Easy rollback

Canary Migration

    • Gradual percentage-based routing
    • Ideal for microservices

Big Bang

    • Swap everything at once
    •  High risk, sometimes necessary (legacy monoliths)

Rollback Planning

    • Must plan data reconciliation
    • DB downgrade path
    • S3 versioning to recover state
    • Rollback IaC stack

5. Observability & Validation

Pre-Cutover

    • Load tests
    • DB consistency checks
    • Side-by-side comparison of new & old systems
    • API contract validation

Post-Cutover

    •  Dynamic log baselining
    •  Error budget alarms
    •  Latency deltas across regions
    •  Event-driven monitoring (Lambda-based anomaly detection)

DR & Migration Anti-Patterns

❌     Only testing DR during outages
❌     Keeping manually curated infra in DR region
❌     Replicating unencrypted data cross-region
❌     DNS TTL > 5 minutes
❌     Storing secrets in env vars or instance configs
❌     Relying on backups without restore testing
❌     Migrating first discovering dependencies later
❌     Attempting active-active without solving data consistency first




No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...