AWS Disaster Recovery (DR), Mitigation & Migration - Overview.
Scope:
- Intro,
- DR Fundamentals → RPO/RTO & Tiers,
- Data Layer DR – The Core Problem,
- Database DR Patterns,
- Application & Compute DR,
- Network & Global Traffic Management,
- Security Controls in DR,
- Automation & Playbooks,
- DR Simulation & Chaos Mitigation,
- Cloud Migration,
- Discovery & Migration Readiness,
- Migration Patterns,
- Data Migration Strategies,
- Cutover Strategy,
- Observability & Validation,
- DR & Migration Anti-Patterns.
Intro:
- Disaster Recovery (DR) is the process of restoring access and functionality to infrastructure after a disruptive event, such as:
- A natural disaster,
- Hardware failure,
- Cyberattack.
- The core objective of Disaster Recovery (DR) is to minimize downtime and data loss, ensuring that critical business functions can resume as quickly as possible.
Disaster Recovery (DR) involves:
- Business Continuity Requirements (RPO/RTO, regulatory, impact mapping)
- Data durability + replication model
- Compute / application reconstruction strategy
- Operational orchestration + automation
walking through each layer.
1. DR Fundamentals → RPO/RTO &
Tiers
- RPO (Recovery Point Objective)
- How much data loss is acceptable?
- 0 seconds → synchronous replication
- Minutes → near-real-time log shipping
- Hours → periodic backups/snapshots
- RTO (Recovery Time Objective)
- How long can the service be down?
- Seconds → active-active
- Minutes → active-standby
- Hours → warm standby
- Days → cold DR
DR Tiers (Industry Standard)
|
Strategy |
RPO |
RTO |
Summary |
|
Backup & Restore |
Hours |
Day(s) |
Cheapest, slowest. Snapshots + IaC rebuild. |
|
Pilot Light |
Min–Hours |
Hours |
Minimal infra “flickered on”—DB replicated. |
|
Warm Standby |
Seconds–Min |
Minutes–Hours |
Partial scaled environment, can scale up. |
|
Multi-Region Active-Active |
0–Seconds |
Seconds–Min |
Highest cost, highest resilience. |
2. Data Layer DR – The Core Problem
Data Replication Models
A. Synchronous Replication
- 0 data loss
- Requires low-latency cross-region links
- Not always supported (e.g., Aurora Global Database is asynchronous)
- Might impact write latency
Use Cases:
- Transactions requiring strict consistency.
B. Asynchronous Replication
- Small RPO (Recovery Point Objective) = seconds
- No write latency added
- Risk of data loss on region outage
Use Cases:
- Distributed apps with tolerance to minimal data loss.
Database DR Patterns
Amazon Aurora Global Database
- Writer in Region A
- Read replicas in Region B, C
- Sub-second replication
- Failover controlled manually or with orchestrator
- RPO (Recovery Point Objective) < 1s, RTO (Recovery Time Objective) ~1–2 minutes.
RDS Cross-Region Read Replicas / Logs
- Higher latency than Aurora
- Good for RPO of minutes
- Works for MySQL/Postgres/SQLServer/MariaDB engines.
DynamoDB Global Tables
- Multi-region active-active
- True multi-master
- Conflict resolution = last writer wins
- RPO (Recovery Point Objective) = 0, RTO (Recovery Time Objective) = seconds
- Requires careful design for idempotency and conflict safety.
S3 Cross-Region Replication
- Asynchronous
- RPO = seconds-minutes
- Versioning strongly recommended
- Beware delete marker replication rules
EBS + Snapshot DR
- Use for block-level data (VMs, stateful infra)
- Slowest (RTO hours)
3. Application & Compute DR
- The design depends on architecture patterns:
Stateless Microservices
- Docker images in ECR/GCR
- Infra managed by IaC (Terraform, CloudFormation, CDK)
- Auto-recreated in destination region
- Load balanced using global routing (Route 53, CloudFlare, etc)
Stateful Services
- Must pair with data replication strategy
- Use persistent claims + cross-region storage replication
Orchestrators/Compute models
|
Workload Type |
DR
Strategy |
|
EC2. |
AMIs + Launch Templates in destination region |
|
EKS
/ |
Cluster recreated via IaC; EBS state replicated via snapshots |
|
Lambda. |
Multi-region deployment package + alias failover |
|
Serverless
(API
Gateway, SQS, SNS). |
Deploy in both regions; use global routing |
4. Network & Global Traffic
Management
DNS-Based Failover
- Route53 Health Checks
- Weighted routing
- Latency-based routing
- Requires low TTL (30 sec typical)
Global Load Balancing
- CloudFront
- GSLB in F5/Cloudflare/Akamai
VPC Connectivity
- Cross-region traffic must be considered:
- VPC Peering
- Transit Gateway (TGW)
- PrivateLink
- Encrypted WAN links
5. Security Controls in DR
Encryption
- KMS multi-region keys (MRKs) for customer data
- Client-side encryption for cross-region replication
- IAM replicated via organization SCPs + IaC
Identity Federation
- Cross-region IAM roles
- If using Okta/ADFS → ensure failover IdP
Secrets Management
- Secrets Manager or Vault multi-region replication
- Avoid environment-variable secrets (non-DR-friendly)
6. Automation & Playbooks
DR must be automated, not manual.
IaC Core
- Terraform with workspaces per region
- CDK/CloudFormation StackSets
- Ensure DR region drift detection and validation pipelines
DR Orchestration Runbooks (Samples):
- Promote Aurora secondary → primary
- Re-point DNS to new ALB
- Scale up warm standby ASGs
- Switch CI/CD pipelines to DR region
- Rehydrate secrets and parameters
- Use SSM Automation or Step Functions for full orchestration.
7. DR Simulation & Chaos Mitigation
Run periodic
controlled DR tests:
- Simulate Region A loss
- Validate failover
- Validate RTO/RPO
- Ensure application consistency
- Perform rollback to primary region post-test
Chaos Mitigation
Use tools
like:
- AWS Fault Injection Simulator
- Gremlin
- LitmusChaos
Cloud Migration
Migration is
a multi-dimensional project with four components:
- Discovery & Inventory
- Refactor/Modernize vs Lift-and-Shift
- Data Migration Strategy
- Cutover Plan + Rollback Strategy
1. Discovery & Migration
Readiness
Inventory & Assessment
- Configuration Management Database (CMDB)
- Network maps
- Data flows
- Inter-service dependencies
- Identity integrations
- Licensing constraints
Cloud Readiness
- OS versions supported?
- DB engines compatible?
- Storage layer suitable?
- Compliance / audit requirements?
2. Migration Patterns
Six Migration R’s
-
Rehost (Lift and Shift)
- Replatform (Lift-Tinker-and-Shift)
- Refactor (App modernization)
- Repurchase (SaaS replacement)
- Retire (decommission)
- Retain (keep on-prem temporarily)
3. Data Migration Strategies
A. Online migration (near-zero downtime)
Useful for
large operational DBs.
- DMS (CDC - Change Data Capture)
- Log shipping
- GoldenGate
- Debezium
- Dual write + cutover
- S3 → Snowball Edge for data-at-rest
B. Offline migration
When downtime
is allowed.
- Snapshot + restore
- Bulk dumps
- Cold cutover
C. Hybrid multi-step
- Bulk load historical data
- CDC apply delta
- Freeze writes
- Final cutover
4. Cutover Strategy
Blue/Green Migration
- New environment (green) tested
- Traffic switchover via DNS or ALB
- Easy rollback
Canary Migration
- Gradual percentage-based routing
- Ideal for microservices
Big Bang
- Swap everything at once
- High risk, sometimes necessary (legacy monoliths)
Rollback Planning
- Must plan data reconciliation
- DB downgrade path
- S3 versioning to recover state
- Rollback IaC stack
5. Observability & Validation
Pre-Cutover
- Load tests
- DB consistency checks
- Side-by-side comparison of new & old systems
- API contract validation
Post-Cutover
- Dynamic log baselining
- Error budget alarms
- Latency deltas across regions
- Event-driven monitoring (Lambda-based anomaly detection)
DR & Migration Anti-Patterns
❌ Only
testing DR during outages
❌ Keeping
manually curated infra in DR region
❌ Replicating
unencrypted data cross-region
❌ DNS TTL
> 5 minutes
❌ Storing
secrets in env vars or instance configs
❌ Relying
on backups without restore testing
❌ Migrating
first → discovering
dependencies later
❌ Attempting
active-active without solving data consistency first
No comments:
Post a Comment