Thursday, November 20, 2025

Disaster Recovery | Deep Dive.

A deep dive into Disaster Recovery (DR).

Scope:

Concepts,
Patterns,
Architectures,
Best-practice frameworks

Breakdown:

DR Core Concepts,
Categories of Disaster.
DR Tiers / Strategies,
Multi-Region Active–Active,
DR for Each Layer of the Stack.
Observability & Orchestration,
DR Testing,
DR Maturity Model,
Disaster Recovery Anti-Patterns,
Reference Multi-Region DR Blueprint (AWS Example).

Intro:

Disaster Recovery is focused on ensuring that critical business services can continue running smoothly or be rapidly restored after a disruptive event.
DR sits within the wider domain of Business Continuity (BC) and has strong dependencies on architecture, operations, observability, and security.

1. DR Core Concepts

RPO — Recovery Point Objective

How much data loss is acceptable?

Seconds → requires continuous replication
Minutes → frequent snapshots + async replication
Hours → nightly backups

RTO — Recovery Time Objective

How fast must twteeh resume operations?

Seconds → warm standby / active–active
Minutes → warm standby
Hours → pilot light
Days → backup & restore

RLO — Recovery Level Objective

Granularity of recovery:

entire region
VPC / subnet
application
database
file-level

MTO — Maximum Tolerable Outage

How long the business can survive an outage.

2. Categories of Disaster

Physical / Environmental

region-wide outage
power loss
natural disasters

Logical / Operational

bad deployment
configuration drift
human error
insider misuse

Cyber / Security

ransomware
supply chain compromise
credential theft
mass data corruption

Upstream Dependencies

failed 3rd-party API
DNS provider outage
SaaS outages
PKI/KMS failures

3. DR Tiers / Strategies

Below are the canonical DR topologies used globally.

3.1 Backup & Restore

Lowest cost, highest RTO/RPO

nightly backups
offsite copies (cross-region/object lock/WORM)
restore into secondary region only during disaster

Used for: non-critical systems, batch workloads.

3.2 Pilot Light

Minimal version of environment always running.

core infrastructure pre-provisioned
databases replicated or at least snapshots replicated
app servers start during failover

RTO: ~30–60 minutes
RPO: minutes–hours

3.3 Warm Standby

Scaled-down version of prod constantly running.

DB cross-region replication
reduced-sized compute fleet always on
traffic routed to standby only when needed

RTO: 5–15 minutes
RPO: seconds–minutes

3.4 Multi-Region Active–Active

Most expensive but highest availability.

both regions handle real traffic
continuous replication
global load balancing (GSLB)
write conflict resolution system

RTO: near-zero
RPO: near-zero

Used for: payments, identity/auth, gaming, IoT ingestion, trading systems.

4. DR for Each Layer of the Stack

4.1 Compute Layer

Stateless compute:

rehydrated from CI/CD
AMIs, images, container registries replicated
IaC (Terraform/CloudFormation/CDK) stored multi-region

Stateful compute:

instance store → snapshot replication
VM failover orchestration (CloudEndure, VMware, ASR)

4.2 Networking

Critical DR dependencies:

multi-region VPCs
CIDR planning without overlap
multi-region Transit Gateway / VNet Peering
cross-region private connectivity
multi-region firewalls / Web Application Firewall (WAF)
DNS failover (Route53, Akamai, Cloudflare)

twtech designs two equal network islands that can run independently but share identity + security posture.

4.3 Database DR

Relational:

Aurora global / cross-region replication
PostgreSQL logical replication
MySQL binlog shipping
Oracle Data Guard
SQL Server AlwaysOn Availability Groups (AGs)

NoSQL:

DynamoDB global tables
MongoDB Atlas multi-region
Cassandra multi-datacenter replication

Data Lake:

S3 cross-region replication (CRR)
object lock + versioning
parquet dataset replication

Caches:

Redis Multi-AZ; for cross-region → native replication or application-managed cold/warm caches.

4.4 Storage

· immutable backups (WORM/object lock)

· multi-region object replication

· EBS/FSx snapshot replication

· large dataset async replication pipelines

4.5 Identity & Secrets

Critical and often forgotten.

Identity:

IAM global vs regional entities
cross-region STS access
backup of IAM roles/policies via IaC
multi-region SSO/IdP failover

Secrets/KMS:

multi-region KMS keys (MRKs)
replication of key policies
envelope encryption for portability
ensure decrypt operations after failover

5. Observability & Orchestration

Monitoring:

multi-region time-series store
central event bus
application-level heartbeat checks
synthetic canary tests

DR Orchestration:

automated failover runbooks
event-driven workflows (AWS Step Functions, Azure Automation, GCP Workflows)
chaos engineering (fault injection, network blackholing)

Logging:

multi-region log sinks
long-term storage in immutable buckets
SIEM cross-region ingestion

6. DR Testing

Testing is more important than architecture.

Tests to run:

failover / failback
DNS cutover
region evacuation
database promotion
IAM/KMS boundary tests
network segmentation tests
ransomware restore tests
dependency isolation (test if 3rd party outage isolates twtech)

Frequency:

critical workloads: monthly
standard workloads: quarterly
regulated workloads: semi-annual mandatory

7. DR Maturity Model

Level	Description
0	Backups only, inconsistent, untested
1	Automated backups, occasional restore tests
2	Cross-region replication for critical DBs
3	Warm standby with automated failover
4	Active–active architecture with global load balancing
5	Chaos-engineered, highly automated, self-healing DR

8. Disaster Recovery Anti-Patterns

❌   Single-region architectures with multi-AZ only
❌   Backups stored in same region
❌   No restore validation
❌   Hard-coding region names in apps
❌   Stateful services without replication
❌   Transitive dependencies (one region relies on another's control plane)
❌   Shared KMS keys without multi-region symmetry
❌   Overreliance on synchronous replication everywhere
❌   DNS TTLs > 60 seconds

9. Reference Multi-Region DR Blueprint (AWS Example)