AWS Disaster Recovery (DR) - Overview.
Scope:
- DR Core Concepts,
- Categories of Disaster.
- DR Tiers / Strategies,
- Multi-Region Active–Active,
- DR for Each Layer of the
Stack.
- Observability &
Orchestration,
- DR Testing,
- DR Maturity Model,
- Disaster Recovery Anti-Patterns,
- Reference Multi-Region DR Blueprint (AWS Sample).
Intro:
- Disaster Recovery is focused on ensuring that critical business services can continue running smoothly or be rapidly restored after a disruptive event.
- DR sits within the wider domain of Business Continuity (BC) and has strong dependencies on:
- architecture,
- operations,
- observability,
- security.
1. DR Core Concepts
RPO — Recovery
Point Objective
- How much data loss is acceptable?
- Seconds → requires continuous replication
- Minutes → frequent snapshots + async replication
- Hours → nightly backups
RTO — Recovery
Time Objective
- How fast must twteeh resume operations?
- Seconds → warm standby / active–active
- Minutes → warm standby
- Hours → pilot light
- Days → backup & restore
RLO — Recovery
Level Objective
Granularity
of recovery:
- entire region
- VPC / subnet
- application
- database
- file-level
MTO — Maximum
Tolerable Outage
- How long the business can survive an outage.
2. Categories of Disaster
Physical / Environmental
- region-wide outage
- power loss
- natural disasters
Logical / Operational
- bad deployment
- configuration drift
- human error
- insider misuse
Cyber / Security
- ransomware
- supply chain compromise
- credential theft
- mass data corruption
Upstream Dependencies
- failed 3rd-party API
- DNS provider outage
- SaaS outages
- PKI/KMS failures
3. DR Tiers / Strategies
- Below are the canonical DR topologies used globally.
3.1 Backup & Restore
- Lowest cost, highest RTO/RPO
- nightly backups
- offsite copies (cross-region/object lock/WORM)
- restore into secondary region only during disaster
Used for: non-critical systems, batch workloads.
3.2 Pilot Light
- Minimal version of environment always running.
- core infrastructure pre-provisioned
- databases replicated or at least snapshots replicated
- app servers start during failover
RTO: ~30–60 minutes
RPO: minutes–hours
3.3 Warm Standby
- Scaled-down version of prod constantly running.
- DB cross-region replication
- reduced-sized compute fleet always on
- traffic routed to standby only when needed
RTO: 5–15 minutes
RPO: seconds–minutes
3.4 Multi-Region Active–Active
Most
expensive but highest availability.
- both regions handle real traffic
- continuous replication
- global load balancing (GSLB)
- write conflict resolution system
RTO: near-zero
RPO: near-zero
Used for:
- payments,
- identity/auth,
- gaming,
- IoT ingestion,
- trading systems.
4. DR for Each Layer of the Stack
4.1 Compute Layer
Stateless compute:
- rehydrated from CI/CD
- AMIs, images, container registries replicated
- IaC (Terraform/CloudFormation/CDK) stored multi-region
Stateful compute:
- instance store → snapshot replication
- VM failover orchestration (CloudEndure, VMware, ASR)
4.2 Networking
Critical
DR dependencies:
- multi-region VPCs
- CIDR planning without overlap
- multi-region Transit Gateway / VNet Peering
- cross-region private connectivity
- multi-region firewalls / Web Application Firewall (WAF)
- DNS failover (Route53, Akamai, Cloudflare)
- twtech designs two equal network islands that can run independently but share identity + security posture.
4.3 Database DR
Relational:
- Aurora global / cross-region replication
- PostgreSQL logical replication
- MySQL binlog shipping
- Oracle Data Guard
- SQL Server AlwaysOn Availability Groups (AGs)
NoSQL:
- DynamoDB global tables
- MongoDB Atlas multi-region
- Cassandra multi-datacenter replication
Data Lake:
- S3 cross-region replication (CRR)
- object lock + versioning
- parquet dataset replication
Caches:
- Redis Multi-AZ; for cross-region → native replication or application-managed cold/warm caches.
4.4 Storage
- immutable backups (WORM/object lock)
- multi-region object replication
- EBS/FSx snapshot replication
- large dataset async replication pipelines
4.5 Identity & Secrets
- Critical and often forgotten.
Identity:
- IAM global vs regional entities
- cross-region STS access
- backup of IAM roles/policies via IaC
- multi-region SSO/IdP failover
Secrets/KMS:
- multi-region KMS keys (MRKs)
- replication of key policies
- envelope encryption for portability
- ensure decrypt operations after failover
5. Observability & Orchestration
Monitoring:
- multi-region time-series store
- central event bus
- application-level heartbeat checks
- synthetic canary tests
DR Orchestration:
- automated failover runbooks
- event-driven workflows (AWS Step Functions, Azure Automation, GCP Workflows)
- chaos engineering (fault injection, network blackholing)
Logging:
- multi-region log sinks
- long-term storage in immutable buckets
- SIEM cross-region ingestion
6. DR Testing
- Testing is more important than architecture.
Tests to run:
- failover / failback
- DNS cutover
- region evacuation
- database promotion
- IAM/KMS boundary tests
- network segmentation tests
- ransomware restore tests
- dependency isolation (test if 3rd party outage isolates twtech)
Frequency:
- critical workloads: monthly
- standard workloads: quarterly
- regulated workloads: semi-annual mandatory
7. DR Maturity Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8. Disaster Recovery Anti-Patterns
❌ Single-region architectures with multi-AZ only
❌ Backups stored in same region
❌ No restore validation
❌ Hard-coding region names in apps
❌ Stateful services without replication
❌ Transitive dependencies (one region relies on another's control plane)
❌ Shared KMS keys without multi-region symmetry
❌ Overreliance on synchronous replication everywhere
❌ DNS TTLs > 60 seconds
No comments:
Post a Comment