Wednesday, August 13, 2025

DynamoDB Backups For Disaster Recovery (DR) | Overview.

DynamoDB Backups For Disaster Recovery (DR) - Overview.

Focus:

Tailored for DevOps/DevSecOps/Cloud/DR Engineers.

Scope:

  • Intro,
  • DynamoDB Backup Types,
  • Point-in-Time Recovery (PITR) Deep Dive,
  • On-Demand Backups,
  • Cross-Region & Cross-Account DR,
  • Options to protect against regional outages
  • Disaster Recovery Patterns,
  • Operational Considerations,
  • Sample DR Flow,
  • Costs to Keep in Mind,
  • DynamoDB Disaster Recovery Strategy Matrix,
  • Automation Architecture Reference,
  • Sample DynamoDB DR Decision Tree,
  • Visual Diagram (Flow Style),
  • Insights PITR (Point-in-Time Recovery),

Intro:

  • DynamoDB provides two primary backup methods for disaster recovery (DR):
    •  Point-in-Time Recovery (PITR),
    • On-demand backups, which can be managed and automated using AWS Backup.
  • These services help meet different recovery time objectives (RTO) and recovery point objectives (RPO)

1. DynamoDB Backup Types

  • DynamoDB supports two main backup & restore mechanisms, each with different DR characteristics.

Feature

Point-in-Time Recovery (PITR)

On-Demand Backup

Purpose

Continuous protection for accidental deletes/writes

Compliance, archival, cloning tables

Granularity

Any second in the last 35 days

Snapshot at request time

Retention

Rolling 35 days

Indefinite (until deleted)

RPO

~1 second

As of backup creation

RTO

Minutes to hours (table size dependent)

Minutes to hours

Cost

Charged for storage of change logs

Charged for full snapshot storage

Best For

Operational recovery from logical errors

Long-term DR, migrations, compliance retention

2. Point-in-Time Recovery (PITR) Deep Dive

PITR uses DynamoDB Streams–like change logs behind the scenes to enable recovery to any second in the past 35 days.

  • Enablement: Table-by-table (not global by default).
  • Use Cases:
    • Accidental DELETE or PUT overwrites.
    • Bad batch job/data corruption.
How the Restore process Works:

1.   Choose a timestamp within the last 35 days.
2.   AWS creates a new table with that point’s data.
3.   Swap traffic over once validated.

DR Characteristics:
    • RPO: ~1 second (near real-time).
    • RTO: Depends on table size & data transfer speed to the new table.

3. On-Demand Backups

A complete snapshot stored in DynamoDB’s backend storage layer.

  • Use Cases:
    • Regulatory requirements.
    • Monthly/quarterly archival.
    • Pre-deployment “safety net.”
  • Behavior:
    • Backups run without impacting read/write performance.
    • Restore is always to a new table.
    • Can be cross-account & cross-region (more on this below).
  • Cost Considerations:
    • Backup storage is separate from table storage.
    • Restores incur full data transfer charges internally.

4. Cross-Region & Cross-Account DR

  • If twtech DR plan includes region failure scenarios, PITR alone isn’t enough — PITR stays in-region.
Options to protect against regional outages:

Option A – Backup Copy

  • Process:
    • Create on-demand backup in Region A.
    • Use CopyBackupToRegion API to move to Region B.
  • Pros: No data loss from region outage if last copy is recent.
  • Cons: Increased RPO (depends on backup frequency).

Option B – Global Tables

  • Process:
    • Set up DynamoDB Global Tables for active-active replication.
  • Pros: RPO ~0; no restore needed for failover.
  • Cons: More expensive; not strictly “backup,” but a replication strategy.

5. Disaster Recovery Patterns

DR Strategy

RPO

RTO

Cost

Notes

In-Region PITR

Seconds

Hours

Low-Med

Covers logical corruption; not region outage

Scheduled On-Demand + Cross-Region Copy

Hours (based on schedule)

Hours

Med-High

Protects from region outage

Global Tables

~0

Minutes

High

Failover without restore

Hybrid

Seconds–Hours

Minutes–Hours

High

PITR + periodic region copy

6. Operational Considerations

  • Automation:
    • Use EventBridge to trigger periodic on-demand backups.
    • Use Lambda to copy to DR region.
  • Monitoring:
    • BackupCompleted CloudWatch events.
    • PITR enabled status alarms.
  • Testing:
    • Periodically restore backups to a staging environment.
    • Validate data integrity and application compatibility.
  • Security:
    • Encrypt backups with KMS CMKs.
    • Ensure IAM least privilege for backup/restore APIs.
  • Large Table Restores:
    • Parallel partition restore architecture means restore speed increases with provisioned capacity.
    • Restores are bulk load operations, not live streaming.

7. Sample DR Flow

Scenario: Regional outage in us-east-2
Goal: Recover in us-west-1 with < 4h RTO and < 1h RPO.

  1. Every hour, an on-demand backup is created in us-east-2.
  2. Immediately copies a replication to us-west-1.
  3. During outage:
    • Trigger restore from latest backup in us-east-2 to a new table.
    • Repoint app to new endpoint after functional tests.

8. Costs to Keep in Mind

  • PITR: ~$0.20 per GB-month for change logs.
  • On-Demand: ~$0.10 per GB-month.
  • Cross-Region Copy: Additional storage + transfer.
  • Restores: ~$0.15 per GB restored.

DynamoDB Disaster Recovery Strategy Matrix

  • Built so twtech can plug in its target RPO (Recovery Point Objective) and RTO (Recovery Time Objective) and immediately see the right AWS backup features, automation setup, and trade-offs.

DynamoDB DR Strategy Matrix

Target RPO

Target RTO

Recommended AWS Features

Automation Setup

Cost Level

Pros

Cons

Best For

1 second

15 min

Global Tables

- Create active-active Global Table across regions.
- Route 53 failover routing.
- Health checks for auto-switch.

🔴 High

Zero restore time; instant failover; protects from region outage.

High ongoing cost; complex conflict resolution logic.

Mission-critical, 24/7 low-latency apps.

1 second

Hours

PITR (Point-in-Time Recovery) (same region)

- Enable PITR per table.
- Manual/automated restore to new table.
- Use CloudFormation/Lambda to rebuild indexes/infra.

🟢 Low-Med

Covers accidental deletes/writes; granular restore.

No region outage protection; restore time depends on table size.

In-region logical corruption recovery.

1 hour

4 hours

PITR + Hourly On-Demand Backup + Cross-Region Copy

- PITR for in-region safety.
- EventBridge hourly backup trigger.
- Lambda copies backup to DR region.

🟡 Medium

Combines near-real-time local restore + regional DR.

Higher storage costs; hourly RPO may not meet sub-hour needs.

Balanced cost & DR coverage.

1 hour

> 4 hours

On-Demand Backups + Cross-Region Copy (every 1h)

- EventBridge to schedule backups.
- Lambda to copy to DR region.

🟡 Medium

Simple automation; meets compliance needs.

Slower restore; 1h data loss possible.

DR compliance & cost balance.

24 hours

> 4 hours

Daily On-Demand Backups (Cross-Region if needed)

- Daily scheduled backups.
- Optional DR region copy.

🟢 Low

Cheapest; meets regulatory archiving.

High potential data loss; slow recovery.

Non-critical, audit-focused data.

Custom

Custom

Hybrid (PITR + periodic backups + optional global tables)

- Mix features based on business unit criticality.

Variable

Tailored to workload.

More complex ops.

Multi-tier DR planning.



Automation Architecture Reference

For “PITR + Cross-Region On-Demand” setup:

  1. PITR Enabled → Continuous local protection.
  2. EventBridge Schedule (every X hours):
    • Trigger Lambda CreateBackup API.
    • Lambda CopyBackupToRegion API.
  3. Backup Monitoring:
    • CloudWatch Event on BackupCompleted SNS alerts.
  4. Disaster Event:
    • Restore from latest DR region backup New table.
    • Switch endpoints via Route 53 or config update.

Quick Selection Guide

  • If twtech wants zero downtime, zero data loss Global Tables.
  • If twtech wants minimal cost & protect from bad writes PITR only.
  • If twtech wants balance of cost, RPO ~1h, region protection PITR + hourly cross-region backups.
  • If compliance is primary goal Daily On-Demand backups with cross-region copy.

Sample DynamoDB DR Decision Tree that starts with RPO (Recovery Point Objective) and RTO (Recovery Time Objective) , then routes to the right AWS backup/replication setup automatically.

Color-coded Quick Legend

  •         🔴 Global Tables Max availability, zero data loss, high cost.
  •         🟡 PITR + Cross-Region Backup Balanced cost, sub-hour RPO, DR ready.
  •         🟢 On-Demand Backups Low cost, compliance-friendly, slower recovery.

Visual Diagram (Flow Style)

Insights PITR (Point-in-Time Recovery)

Recovery technique primarily used with databases. 

  • Restoring data to a specific past state:
    •  PITR allows administrators to restore or recover a database (or other systems) to the exact state it was in at a particular point in time in the past.
  • Protection against data loss or corruption: 
    • This is particularly valuable for situations like accidental deletion of data, unintended writes that corrupt the database, or software application rollouts that cause issues.
  • Summary of How Point-in-Time Recovery (PITR) works: 
    • PITR typically relies on a combination of regular full backups and continuous logging of all changes made to the database (often referred to as transaction logs or write-ahead logs - WAL).
    • A full backup provides a snapshot of the database at a specific moment.
    • Transaction logs record every change made after the full backup.
    • To restore to a specific point in time, the system first restores the full backup and then applies the relevant changes from the transaction logs up to the desired point in time.












No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...