A deep dive on DynamoDB Backups for Disaster Recovery (DR).
View:
Architecture,
Recovery Point Objectives (RPO),
Recovery Time Objectives (RTO),
Costs,
Operational considerations as a DevOps/DevSecOps/Cloud/DR Engineer.
1. DynamoDB Backup Types
DynamoDB supports two main backup
& restore mechanisms, each with different DR characteristics.
Feature |
Point-in-Time
Recovery (PITR) |
On-Demand
Backup |
Purpose |
Continuous protection for
accidental deletes/writes |
Compliance, archival, cloning
tables |
Granularity |
Any second in the last 35 days |
Snapshot at request time |
Retention |
Rolling 35 days |
Indefinite (until deleted) |
RPO |
~1 second |
As of backup creation |
RTO |
Minutes to hours (table size
dependent) |
Minutes to hours |
Cost |
Charged for storage of change logs |
Charged for full snapshot storage |
Best For |
Operational recovery from logical
errors |
Long-term DR, migrations,
compliance retention |
2. Point-in-Time Recovery (PITR) Deep Dive
PITR uses DynamoDB Streams–like
change logs behind the scenes to enable recovery to any second in the
past 35 days.
- Enablement:
Table-by-table (not global by
default).
- Use Cases:
- Accidental DELETE or PUT overwrites.
- Bad batch job/data corruption.
- How the Restore process Works:
1.
Choose a timestamp within the last
35 days.
2.
AWS creates a new table with
that point’s data.
3.
Swap traffic over once validated.
- DR Characteristics:
- RPO:
~1 second (near real-time).
- RTO:
Depends on table size & data transfer speed to the new table.
3. On-Demand Backups
A complete snapshot stored in
DynamoDB’s backend storage layer.
- Use Cases:
- Regulatory requirements.
- Monthly/quarterly archival.
- Pre-deployment “safety net.”
- Behavior:
- Backups run without impacting read/write performance.
- Restore is always to a new table.
- Can be cross-account & cross-region (more on this
below).
- Cost Considerations:
- Backup storage is separate from table storage.
- Restores incur full data transfer charges
internally.
4. Cross-Region & Cross-Account DR
If twtech DR plan includes region
failure scenarios, PITR alone isn’t enough — PITR stays in-region.
To protect against regional outages:
Option
A – Backup Copy
- Process:
- Create on-demand backup in Region A.
- Use CopyBackupToRegion API to move to Region B.
- Pros:
No data loss from region outage if last copy is recent.
- Cons:
Increased RPO (depends on backup frequency).
Option
B – Global Tables
- Process:
- Set up DynamoDB Global Tables for active-active
replication.
- Pros:
RPO ~0; no restore needed for failover.
- Cons:
More expensive; not strictly “backup,” but a replication strategy.
5. Disaster Recovery Patterns
DR
Strategy |
RPO |
RTO |
Cost |
Notes |
In-Region PITR |
Seconds |
Hours |
Low-Med |
Covers logical corruption; not
region outage |
Scheduled On-Demand + Cross-Region
Copy |
Hours (based on schedule) |
Hours |
Med-High |
Protects from region outage |
Global Tables |
~0 |
Minutes |
High |
Failover without restore |
Hybrid |
Seconds–Hours |
Minutes–Hours |
High |
PITR + periodic region copy |
6. Operational Considerations
- Automation:
- Use EventBridge to trigger periodic on-demand
backups.
- Use Lambda to copy to DR region.
- Monitoring:
- BackupCompleted
CloudWatch events.
- PITR enabled status alarms.
- Testing:
- Periodically restore backups to a staging environment.
- Validate data integrity and application
compatibility.
- Security:
- Encrypt backups with KMS CMKs.
- Ensure IAM least privilege for backup/restore
APIs.
- Large Table Restores:
- Parallel partition restore architecture means restore
speed increases with provisioned capacity.
- Restores are bulk load operations, not live
streaming.
7. Example DR Flow
Scenario: Regional outage in us-east-2
Goal: Recover in us-west-2 with < 4h RTO and < 1h RPO.
- Every hour, an on-demand backup is created in us-east-2.
- Immediately copied to us-east-1.
- During outage:
- Trigger restore from latest backup in us-east-1
to a new table.
- Repoint app to new endpoint after functional tests.
8. Costs to Keep in Mind
- PITR:
~$0.20 per GB-month for change logs.
- On-Demand:
~$0.10 per GB-month.
- Cross-Region Copy:
Additional storage + transfer.
- Restores:
~$0.15 per GB restored.
DynamoDB Disaster Recovery Strategy
Matrix — built so you can plug in your
target RPO (Recovery Point Objective) and RTO (Recovery Time
Objective) and immediately see the right AWS backup features, automation setup,
and trade-offs.
DynamoDB DR Strategy Matrix
Target RPO |
Target RTO |
Recommended
AWS Features |
Automation
Setup |
Cost Level |
Pros |
Cons |
Best For |
≤ 1 second |
≤ 15 min |
Global Tables |
- Create active-active Global Table
across regions. |
🔴 High |
Zero restore time; instant failover;
protects from region outage. |
High ongoing cost; complex conflict
resolution logic. |
Mission-critical, 24/7 low-latency
apps. |
≤ 1 second |
Hours |
PITR (Point-in-Time Recovery) (same region) |
- Enable PITR per table. |
🟢 Low-Med |
Covers accidental deletes/writes;
granular restore. |
No region outage protection; restore
time depends on table size. |
In-region logical corruption
recovery. |
≤ 1 hour |
≤ 4 hours |
PITR + Hourly On-Demand Backup +
Cross-Region Copy |
- PITR for in-region safety. |
🟡 Medium |
Combines near-real-time local
restore + regional DR. |
Higher storage costs; hourly RPO may
not meet sub-hour needs. |
Balanced cost & DR coverage. |
≤ 1 hour |
> 4 hours |
On-Demand Backups + Cross-Region
Copy (every 1h) |
- EventBridge to schedule backups. |
🟡 Medium |
Simple automation; meets compliance
needs. |
Slower restore; 1h data loss
possible. |
DR compliance & cost balance. |
≤ 24 hours |
> 4 hours |
Daily On-Demand Backups
(Cross-Region if needed) |
- Daily scheduled backups. |
🟢 Low |
Cheapest; meets regulatory
archiving. |
High potential data loss; slow
recovery. |
Non-critical, audit-focused data. |
Custom |
Custom |
Hybrid (PITR + periodic backups + optional global
tables) |
- Mix features based on business
unit criticality. |
Variable |
Tailored to workload. |
More complex ops. |
Multi-tier DR planning. |
For “PITR + Cross-Region On-Demand”
setup:
- PITR Enabled
→ Continuous local protection.
- EventBridge Schedule
(every X hours):
- Trigger Lambda → CreateBackup API.
- Lambda → CopyBackupToRegion API.
- Backup Monitoring:
- CloudWatch Event on BackupCompleted → SNS alerts.
- Disaster Event:
- Restore from latest DR region backup → New table.
- Switch endpoints via Route 53 or config update.
Quick Selection Guide
- If twtech wants zero downtime, zero data loss → Global Tables.
- If twtech wants minimal cost & protect from bad
writes → PITR only.
- If twtech wants balance of cost, RPO ~1h, region
protection → PITR + hourly
cross-region backups.
- If compliance is primary goal → Daily On-Demand backups with cross-region copy.
Here’s twtech Sample DynamoDB
DR Decision Tree that starts with RPO (Recovery
Point Objective) and RTO (Recovery Time Objective) , then routes
to the right AWS backup/replication setup automatically.
DynamoDB DR Decision Tree
START: twtech target RPO & RTO
│
├── RPO ≤ 1 second?
│ │
│ ├── RTO ≤ 15 min → Use GLOBAL TABLES
│ │
- Multi-region, active-active.
│ │
- Route 53 failover.
│ │
- Cost: High.
│ │
│ └── RTO > 15 min → Use PITR
(Point-in-Time Recovery)
│ - In-region restore.
│ - No regional outage protection.
│ - Cost: Low-Med.
│
├── RPO ≤ 1 hour?
│ │
│ ├── RTO ≤ 4 hours → Use PITR + HOURLY
ON-DEMAND BACKUPS + CROSS-REGION COPY
│ │
- Combines near-real-time local recovery + DR region protection.
│ │
- Cost: Medium.
│ │
│ └── RTO > 4 hours → Use HOURLY
ON-DEMAND BACKUPS + CROSS-REGION COPY
│ - Cost: Medium.
│ - Simpler, but longer restore
time.
│
└── RPO ≤ 24 hours?
│
├── RTO ≤ 4 hours →
Use DAILY ON-DEMAND BACKUPS + CROSS-REGION COPY
│ - Meets compliance + regional DR.
│ - Cost: Low.
│
└── RTO > 4
hours → DAILY ON-DEMAND BACKUPS (optional cross-region copy)
- Lowest
cost.
- Longest
RPO/RTO.
Color-coded Quick Legend
·
🔴 Global Tables → Max availability, zero
data loss, high cost.
·
🟡 PITR
+ Cross-Region Backup → Balanced cost, sub-hour RPO, DR ready.
·
🟢 On-Demand
Backups → Low cost, compliance-friendly, slower recovery.
Visual Diagram (Flow Style)
Insights: PITR
- Summary of How Point-in-Time Recovery (PITR) works:
- PITR typically relies on a combination of regular full backups and continuous logging of all changes made to the database (often referred to as transaction logs or write-ahead logs - WAL).
- A full backup provides a snapshot of the database at a specific moment.
- Transaction logs record every change made after the full backup.
- To restore to a specific point in time, the system first restores the full backup and then applies the relevant changes from the transaction logs up to the desired point in time.
No comments:
Post a Comment