A deep dive into Disaster Recovery (DR).
Scope:
- Concepts,
- Patterns,
- Architectures,
- Best-practice frameworks
Breakdown:
- DR Core Concepts,
- Categories of Disaster.
- DR Tiers / Strategies,
- Multi-Region Active–Active,
- DR for Each Layer of the
Stack.
- Observability &
Orchestration,
- DR Testing,
- DR Maturity Model,
- Disaster Recovery Anti-Patterns,
- Reference Multi-Region DR Blueprint (AWS Example).
Intro:
- Disaster Recovery is focused on ensuring that critical business services can continue running smoothly or be rapidly restored after a disruptive event.
- DR sits within the wider domain of Business Continuity (BC) and has strong dependencies on architecture, operations, observability, and security.
1. DR Core Concepts
RPO — Recovery
Point Objective
How much data
loss is acceptable?
- Seconds → requires continuous replication
- Minutes → frequent snapshots + async replication
- Hours → nightly backups
RTO — Recovery
Time Objective
How fast
must twteeh resume operations?
- Seconds → warm standby / active–active
- Minutes → warm standby
- Hours → pilot light
- Days → backup & restore
RLO — Recovery
Level Objective
Granularity
of recovery:
- entire region
- VPC / subnet
- application
- database
- file-level
MTO — Maximum
Tolerable Outage
How long
the business can survive an outage.
2. Categories of Disaster
Physical / Environmental
- region-wide outage
- power loss
- natural disasters
Logical / Operational
- bad deployment
- configuration drift
- human error
- insider misuse
Cyber / Security
- ransomware
- supply chain compromise
- credential theft
- mass data corruption
Upstream Dependencies
- failed 3rd-party API
- DNS provider outage
- SaaS outages
- PKI/KMS failures
3. DR Tiers / Strategies
Below are
the canonical DR topologies used globally.
3.1 Backup & Restore
Lowest cost, highest RTO/RPO
- nightly backups
- offsite copies (cross-region/object lock/WORM)
- restore into secondary region only during disaster
Used for: non-critical systems, batch workloads.
3.2 Pilot Light
Minimal
version of environment always running.
- core infrastructure pre-provisioned
- databases replicated or at least snapshots replicated
- app servers start during failover
RTO: ~30–60 minutes
RPO: minutes–hours
3.3 Warm Standby
Scaled-down
version of prod constantly running.
- DB cross-region replication
- reduced-sized compute fleet always on
- traffic routed to standby only when needed
RTO: 5–15 minutes
RPO: seconds–minutes
3.4 Multi-Region Active–Active
Most
expensive but highest availability.
- both regions handle real traffic
- continuous replication
- global load balancing (GSLB)
- write conflict resolution system
RTO: near-zero
RPO: near-zero
Used for: payments, identity/auth, gaming, IoT ingestion, trading
systems.
4. DR for Each Layer of the Stack
4.1 Compute Layer
Stateless compute:
- rehydrated from CI/CD
- AMIs, images, container registries replicated
- IaC (Terraform/CloudFormation/CDK) stored multi-region
Stateful compute:
- instance store → snapshot replication
- VM failover orchestration (CloudEndure, VMware, ASR)
4.2 Networking
Critical
DR dependencies:
- multi-region VPCs
- CIDR planning without overlap
- multi-region Transit Gateway / VNet Peering
- cross-region private connectivity
- multi-region firewalls / Web Application Firewall (WAF)
- DNS failover (Route53, Akamai, Cloudflare)
twtech
designs two equal network islands that can run independently
but share identity + security posture.
4.3 Database DR
Relational:
- Aurora global / cross-region replication
- PostgreSQL logical replication
- MySQL binlog shipping
- Oracle Data Guard
- SQL Server AlwaysOn Availability Groups (AGs)
NoSQL:
- DynamoDB global tables
- MongoDB Atlas multi-region
- Cassandra multi-datacenter replication
Data Lake:
- S3 cross-region replication (CRR)
- object lock + versioning
- parquet dataset replication
Caches:
- Redis Multi-AZ; for cross-region → native replication or application-managed cold/warm caches.
4.4 Storage
·
immutable backups (WORM/object lock)
·
multi-region object replication
·
EBS/FSx snapshot replication
·
large dataset async replication pipelines
4.5 Identity & Secrets
Critical
and often forgotten.
Identity:
- IAM global vs regional entities
- cross-region STS access
- backup of IAM roles/policies via IaC
- multi-region SSO/IdP failover
Secrets/KMS:
- multi-region KMS keys (MRKs)
- replication of key policies
- envelope encryption for portability
- ensure decrypt operations after failover
5. Observability & Orchestration
Monitoring:
- multi-region time-series store
- central event bus
- application-level heartbeat checks
- synthetic canary tests
DR Orchestration:
- automated failover runbooks
- event-driven workflows (AWS Step Functions, Azure Automation, GCP Workflows)
- chaos engineering (fault injection, network blackholing)
Logging:
- multi-region log sinks
- long-term storage in immutable buckets
- SIEM cross-region ingestion
6. DR Testing
Testing
is more important than architecture.
Tests to run:
- failover / failback
- DNS cutover
- region evacuation
- database promotion
- IAM/KMS boundary tests
- network segmentation tests
- ransomware restore tests
- dependency isolation (test if 3rd party outage isolates twtech)
Frequency:
- critical workloads: monthly
- standard workloads: quarterly
- regulated workloads: semi-annual mandatory
7. DR Maturity Model
|
Level |
Description |
|
0 |
Backups only,
inconsistent, untested |
|
1 |
Automated backups,
occasional restore tests |
|
2 |
Cross-region
replication for critical DBs |
|
3 |
Warm standby with
automated failover |
|
4 |
Active–active
architecture with global load balancing |
|
5 |
Chaos-engineered,
highly automated, self-healing DR |
8. Disaster Recovery Anti-Patterns
❌ Single-region
architectures with multi-AZ only
❌ Backups stored in same region
❌ No restore validation
❌ Hard-coding region names in apps
❌ Stateful services without replication
❌ Transitive dependencies (one region relies on another's
control plane)
❌ Shared KMS keys without multi-region symmetry
❌ Overreliance on synchronous replication everywhere
❌ DNS TTLs > 60 seconds
No comments:
Post a Comment