Thursday, November 20, 2025

Disaster Recovery | Deep Dive.

A deep dive into Disaster Recovery (DR).

Scope:

  •        Concepts,
  •        Patterns,
  •        Architectures,
  •        Best-practice frameworks 

Breakdown:

  •        DR Core Concepts,
  •        Categories of Disaster.
  •        DR Tiers / Strategies,
  •        Multi-Region Active–Active,
  •        DR for Each Layer of the Stack.
  •        Observability & Orchestration,
  •        DR Testing,
  •        DR Maturity Model,
  •        Disaster Recovery Anti-Patterns,
  •        Reference Multi-Region DR Blueprint (AWS Example). 

Intro:

  • Disaster Recovery is focused on ensuring that critical business services can continue running smoothly or be rapidly restored after a disruptive event.
  • DR sits within the wider domain of Business Continuity (BC) and has strong dependencies on architecture, operations, observability, and security.

1. DR Core Concepts

RPORecovery Point Objective

How much data loss is acceptable?

  •         Seconds requires continuous replication
  •         Minutes frequent snapshots + async replication
  •         Hours nightly backups

RTORecovery Time Objective

How fast must twteeh resume operations?

  •         Seconds warm standby / active–active
  •         Minutes warm standby
  •         Hours pilot light
  •         Days backup & restore

RLORecovery Level Objective

Granularity of recovery:

  •         entire region
  •         VPC / subnet
  •         application
  •         database
  •         file-level

MTOMaximum Tolerable Outage

How long the business can survive an outage.

2. Categories of Disaster

Physical / Environmental

  •         region-wide outage
  •         power loss
  •         natural disasters

Logical / Operational

  •         bad deployment
  •         configuration drift
  •         human error
  •         insider misuse

Cyber / Security

  •         ransomware
  •         supply chain compromise
  •         credential theft
  •         mass data corruption

Upstream Dependencies

  •         failed 3rd-party API
  •         DNS provider outage
  •         SaaS outages
  •         PKI/KMS failures

3. DR Tiers / Strategies

Below are the canonical DR topologies used globally.

3.1 Backup & Restore

Lowest cost, highest RTO/RPO

  •         nightly backups
  •         offsite copies (cross-region/object lock/WORM)
  •         restore into secondary region only during disaster

Used for: non-critical systems, batch workloads.

3.2 Pilot Light

Minimal version of environment always running.

  •         core infrastructure pre-provisioned
  •         databases replicated or at least snapshots replicated
  •         app servers start during failover

RTO: ~30–60 minutes
RPO: minutes–hours

3.3 Warm Standby

Scaled-down version of prod constantly running.

  •         DB cross-region replication
  •         reduced-sized compute fleet always on
  •         traffic routed to standby only when needed

RTO: 5–15 minutes
RPO: seconds–minutes

3.4 Multi-Region Active–Active

Most expensive but highest availability.

  •         both regions handle real traffic
  •         continuous replication
  •         global load balancing (GSLB)
  •         write conflict resolution system

RTO: near-zero
RPO: near-zero

Used for: payments, identity/auth, gaming, IoT ingestion, trading systems.

4. DR for Each Layer of the Stack

4.1 Compute Layer

Stateless compute:

  •         rehydrated from CI/CD
  •         AMIs, images, container registries replicated
  •         IaC (Terraform/CloudFormation/CDK) stored multi-region

Stateful compute:

  •         instance store snapshot replication
  •         VM failover orchestration (CloudEndure, VMware, ASR)

4.2 Networking

Critical DR dependencies:

  •         multi-region VPCs
  •         CIDR planning without overlap
  •         multi-region Transit Gateway / VNet Peering
  •         cross-region private connectivity
  •         multi-region firewalls / Web Application Firewall (WAF)
  •         DNS failover (Route53, Akamai, Cloudflare)

twtech designs two equal network islands that can run independently but share identity + security posture.

4.3 Database DR

Relational:

  •         Aurora global / cross-region replication
  •         PostgreSQL logical replication
  •         MySQL binlog shipping
  •         Oracle Data Guard
  •         SQL Server AlwaysOn Availability Groups (AGs)

NoSQL:

  •         DynamoDB global tables
  •         MongoDB Atlas multi-region
  •         Cassandra multi-datacenter replication

Data Lake:

  •         S3 cross-region replication (CRR)
  •         object lock + versioning
  •         parquet dataset replication

Caches:

  •         Redis Multi-AZ; for cross-region native replication or application-managed cold/warm caches.

4.4 Storage

·        immutable backups (WORM/object lock)

·        multi-region object replication

·        EBS/FSx snapshot replication

·        large dataset async replication pipelines

4.5 Identity & Secrets

Critical and often forgotten.

Identity:

  •         IAM global vs regional entities
  •         cross-region STS access
  •         backup of IAM roles/policies via IaC
  •         multi-region SSO/IdP failover

Secrets/KMS:

  •         multi-region KMS keys (MRKs)
  •         replication of key policies
  •         envelope encryption for portability
  •         ensure decrypt operations after failover

5. Observability & Orchestration

Monitoring:

  •         multi-region time-series store
  •         central event bus
  •         application-level heartbeat checks
  •         synthetic canary tests

DR Orchestration:

  •         automated failover runbooks
  •         event-driven workflows (AWS Step Functions, Azure Automation, GCP Workflows)
  •         chaos engineering (fault injection, network blackholing)

Logging:

  •         multi-region log sinks
  •         long-term storage in immutable buckets
  •         SIEM cross-region ingestion

6. DR Testing

Testing is more important than architecture.

Tests to run:

  •         failover / failback
  •         DNS cutover
  •         region evacuation
  •         database promotion
  •         IAM/KMS boundary tests
  •         network segmentation tests
  •         ransomware restore tests
  •         dependency isolation (test if 3rd party outage isolates twtech)

Frequency:

  •         critical workloads: monthly
  •         standard workloads: quarterly
  •         regulated workloads: semi-annual mandatory

7. DR Maturity Model

Level

Description

0

Backups only, inconsistent, untested

1

Automated backups, occasional restore tests

2

Cross-region replication for critical DBs

3

Warm standby with automated failover

4

Active–active architecture with global load balancing

5

Chaos-engineered, highly automated, self-healing DR

8. Disaster Recovery Anti-Patterns

❌   Single-region architectures with multi-AZ only
❌   Backups stored in same region
❌   No restore validation
❌   Hard-coding region names in apps
❌   Stateful services without replication
❌   Transitive dependencies
(one region relies on another's control plane)
❌   Shared KMS keys without multi-region symmetry
❌   Overreliance on synchronous replication everywhere
❌   DNS TTLs
> 60 seconds

9. Reference Multi-Region DR Blueprint (AWS Example)




No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...