Sunday, November 23, 2025

AWS Multi-Site & Hot Site DR Strategy | Deep Dive.

AWS Multi-Site & Hot Site DR Strategy - Deep Dive.

Scope:

  • The concept: Multi-Site / Hot Site DR,
  • When to Use Multi-Site / Hot Site,
  • Architectural Building Blocks,
  • DR Failover Workflow (Step-by-Step),
  • Recovery Time & Recovery Point Objectives,
  • Multi-Site / Hot Site AWS Architecture Diagram & Description,
  • Cost Considerations,
  • Best Practices,
  • Optional Failback Procedure.

1. The concept: Multi-Site & Hot Site DR

    • Multi-Site & Hot Site strategy is the fastest and most resilient AWS disaster recovery approach.
    • Multi-Site & Hot Site strategy keeps two (or more) environments fully active and capable of serving production traffic simultaneously.
    • Multi-Site & Hot Site strategy deployed environments  are fully active across multiple AWS Regions.

Key Properties

    •  Both regions run production workloads (active-active or active-passive with warm auto-scaling)
    •  Real-time data replication between Regions
    •  Automatic or near-automatic failover
    •  Lowest RTO (< 1–5 minutes)
    •  Lowest RPO (~0 seconds) depending on datastore type
    •  Highest operational cost of all DR strategies

 2. When to Use Multi-Site / Hot Site

Best for:

    • Mission-critical applications
    • Zero-downtime requirements
    • Financial, healthcare, e-commerce, or services with strict SLAs
    • Global latency reduction (via active-active)
    • Applications requiring continuous reads/writes across Regions

 3. Architectural Building Blocks

 3.1 Global Traffic Management

NB:

    • AWS routing services decide which Region serves the traffic (bases on Health checks).

Options:

  •  Amazon Route 53
    •    Latency-based routing
    •    Geolocation-based routing
    •    Health-check failover
    •    Weighted routing for gradual migration
  •  AWS Global Accelerator
    •    Improves global performance
    •    Intelligent edge routing
    •    Much faster failover (seconds)


3.2 Application Layer (Compute)

Common Compute Patterns

    •  Active-Active
      •    Both Regions serve traffic equally
      •    Requires stateless or state-replicated architecture
    •  Active-Passive Hot
      •    Secondary Region at full capacity but not receiving traffic
      •    Auto-scaled to production ready

AWS Services

    • Amazon ECS / EKS with multi-Region clusters
    • AWS Lambda with replicated versions & aliases
    • Amazon EC2 Auto Scaling across Regions
    • Amazon AppSync or API Gateway multi-region configuration

 3.3 Data Layer – Multi-Region Synchronization

Ideal solution for Databases

Service

Cross-Region Mode

RPO

Notes

Amazon DynamoDB Global Tables

Active-active

0.

Best for global workloads

Amazon Aurora Global Database

Primary + read replicas

<1s.

Fastest cross-region RPO for relational

RDS Cross-Region Read Replicas

Asynchronous

Seconds.

Good for read-heavy workloads

Amazon S3 CRR (Cross-Region Replication)

Asynchronous

Seconds.

Region-to-Region object sync

 3.4 Shared Services

    • AWS IAM (Global service)
    • AWS Secrets Manager replication setup
    • S3 for shared assets (CRR)
    • CloudFront (edge delivery)

 3.5 Automation & Infrastructure Management

    •  AWS CloudFormation StackSets
    •  AWS Control Tower multi-account governance
    •  AWS Config multi-region compliance
    •  GitOps (ArgoCD, Flux) for multi-region Kubernetes

 4. DR Failover Workflow (Step-by-Step)

If Region A fails:

     1.     Route 53 / Global Accelerator detects failure
2.     Traffic is rerouted to Region B
3.     Region B’s stateless services scale automatically
4.     DynamoDB/Aurora Global DB continues operating with Region B as writer
5.     CI/CD or automation promotes Region B as the dominant region
6.     Alerts and dashboards notify administrators
7.     Optional: fail-back procedure after Region A returns

NB:

    • Failover is typically automatic and completes in seconds to minutes.

 5. Recovery Time & Recovery Point Objectives

Strategy

   RTO

   RPO

   Cost

Multi-Site / Hot Site

Seconds–Minutes

Zero–Seconds

Very High

Warm Standby

Minutes

Low

Moderate–High

Pilot Light

Hours

Low

Low–Moderate

 6. Multi-Site / Hot Site AWS Architecture Diagram & Description


7. Cost Considerations

  • Multi-Site Hot Site is the most expensive DR strategy because:
    • Two Regions run near-full production capacity
    • Databases often require multi-region replication (premium pricing)
    • Networking between regions incurs cost
    • More operational overhead

NB:

  • However, this strategy delivers the best RTO/RPO in AWS.

 8. Best Practices

Compute

    •  Design for stateless architecture
    •  Use in-memory caches like ElastiCache Global Datastore

Data

    • Avoid multi-region writes without conflict resolution
    • Prefer services with native global replication

Network

    • Use VPC Lattice
    • Use Transit Gateway
    • or Use VPC Peering multi-region designs.

Security

    • Replicate secrets across Regions
    • Use multi-region KMS keys (supported for some services)

Observability

    •  Multi-region CloudWatch dashboards
    •  Centralized logs in S3 with CRR

 9. Optional Failback Procedure

    1.      Restore Region A
    2.      Re-establish database replication with Region A as secondary
    3.      Shift small percentage of traffic via weighted routing
    4.      Return to active-active or primary routing controls



No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...