Sunday, November 23, 2025

AWS Data Migration Service (DMS) Continuous Replication | Overview.

AWS Data Migration Service (DMS) Continuous Replication - Overview.

Focus:

Tailored for:

DevOps,
Cloud,
SRE,
Platform,
Infra,
DevSecOps Engineering.

Scope:

DMS Architectural,
Continuous Replication Internals,
DMS Continuous Replication – Failure Modes,
Latency Analysis & Performance Internals,
DMS Replication & Cutover Strategies (Pro-Level),
Resiliency Patterns for Continuous Replication,
Operational Best Practices (Enterprise-Ready),
Expert-Level Tuning Profiles,
Common Pitfalls You MUST Avoid.

Intro:

AWS Database Migration Service (AWS DMS) fully supports continuous data replication, in a process called Change Data Capture (CDC).
Change data capture (CDC) feature allows twtech to keep source and target databases synchronized during a migration with minimal downtime.
Change data capture (CDC) feature also allows twtech to maintain ongoing synchronization for various use cases like disaster recovery and development/test environment synchronization.
AWS Database Migration Service (DMS) supports:

One-time (full-load) migrations
Full-load + CDC (Change Data Capture)
Continuous replication (ongoing CDC stream)

NB:

Continuous replication turns DMS into a near-real-time database replication engine, thus enabling:

Ongoing sync between on-prem → AWS
Cloud-to-cloud migrations
HA/DR architectures
Blue/green cutovers
Multi-region read replicas
Zero-downtime migrations

1. DMS Architectural

Core Components

1. Source Endpoint

RDS, Aurora, Oracle, SQL Server, MySQL, PostgreSQL
On-prem DBs over VPN/DX
Supports log readers (redo/transaction log/Write Ahead Log)

2. Replication Instance

The DMS engine
EC2-based
Handles:

CDC extraction
Change buffering
LOB chunking
Transformation rules
Batch commits to target

3. Target Endpoint

RDS/Aurora
DynamoDB
S3 (Parquet/CSV)
Redshift
OpenSearch
Kafka

4. Task Engine

Orchestrates full-load, CDC, and validation phases
Maintains checkpoint state in target schema
Self-recovery logic on disconnects/restarts

2. Continuous Replication Internals

CDC Source-Level Behavior

Depending on DB:

Engine	CDC Mechanism
Oracle	Redo logs; optionally LogMiner; binary reader for high throughput
SQL Server	Transaction log (log reader agent)
MySQL	Binlog
PostgreSQL	Logical decoding + replication slots
SAP ASE	Replication Server log APIs
MongoDB	Oplog

NB:

DMS reads only committed transactions, preserving transaction order at table level (not always globally across all tables).

CDC Pipeline Flow (Deep Internal Sequence)

1. Source Log → DMS Cache

Reads native logs
Maps operations (INSERT/UPDATE/DELETE)
Extracts PKs, timestamps, SCNs, LSNs, or WAL positions

2. Cache → Mapping Rules Engine

Filters tables
Applies column transformations
Applies type conversions
Rewrites default values
Can enforce uppercase/lowercase naming schemes

3. Commit Grouping / Transaction Assembly

Groups related row changes
Ensures partial transaction correctness

4. Throttle Control & Flow Control

Auto-throttles to avoid target overload
Buffers events in memory or to disk
Handles spikes in source log volume

5. Target Apply Engine

Applies changes in commit order
Uses batched SQL apply
Writes checkpoint positions to DMS schema

3. DMS Continuous Replication – Failure Modes

Category A: Source-Related Failures

Log retention too short → DMS falls behind
WAL/binary log disabled or rotated early
Oracle archive logs purged prematurely
Schema drift (ALTER TABLE) not replicated
Heavy write burst causing backlog

Category B: Replication Instance Failures

Instance class too small (CPU pegged, memory swap)
Disk spills (cache → EBS) slowing down replication
Network path saturation (on-prem → cloud)
Security groups or routing changes blocking connectivity

Category C: Target Endpoint Failures

Deadlocks or lock waits in target
FK/PK mismatches
Unsupported data types
Large LOB size mismatch
Write rate exceeding target IOPS

Category D: Task-Level Failures

Metadata divergence
Checkpoint corruption
Inconsistent table mapping rules
Transformation rule errors
DMS version bugs (frequent root cause)

4. Latency Analysis & Performance Internals

Latency Contributors

Log extraction delay
Network latency
Replication instance CPU
Commit batching size
Target apply speed
Transformation overhead
Large transactions (e.g., bulk DELETE)

How to Optimize

Use CDC-only tasks (avoid full-load overhead)
Enable Multi-AZ for resiliency
Move replication instance into same AZ as source when possible
Use parallel load and batch apply where supported
Use larger replication instance classes for high-WAL-throughput DBs
Avoid LOB full-mode unless absolutely needed
Keep PKs/indexes lean during migration

5. DMS Replication & Cutover Strategies (Pro-Level)

1. Classic Zero-Downtime Migration

1.     Full load
2.     Enable ongoing replication
3.     Let CDC catch up
4.     Freeze application writes
5.     Wait 0–5 seconds for CDC to drain
6.     Promote target

2. Blue/Green Deployment

Two DBs kept continuously in sync
Flip application by RDS proxy or route update
Reversible if failure occurs

3. DR / Multi-Region Replication

Region A → Region B sync
Latency typically 1–5 seconds depending on traffic
Useful for compliance or low-RTO failover

4. Cloud-to-Cloud Incremental Sync

Sample GCP Cloud SQL → AWS Aurora

DMS extracts logs from GCP
Applies changes continuously
Final cutover is seconds

6. Resiliency Patterns for Continuous Replication

A. Multi-AZ DMS Replication Instance

Automated failover
Continuous replication resumes in seconds
Helps when network or AZ degrades

B. Multi-Task Parallelization

Split tasks by:

schema
table
write intensity
table size
PK distribution

C. Auto-Recovery Mechanisms

DMS checkpoints ensure resume from last committed LSN
Retry loops with exponential backoff
Automatic cache-to-disk spillover during bursts

D. Monitoring & Alerting

Watch:

CDC latency
Source log growth
Replication instance CPU
Memory swap
Target apply throughput
Disk queue depth
Network throughput
Error rates in CloudWatch metrics

7. Operational Best Practices (Enterprise-Ready)

Before Starting Replication

Ensure PKs exist on all tables
Enable log retention (WAL/binary logs)
Validate character set mappings
Disable triggers & heavy constraints during migration
Pre-size storage (target & DMS instance)

During Continuous Replication

Ensure replication instance has burstable performance
Monitor backlog growth
Avoid schema changes unless planned
Keep DMS updated to latest engine version
Use custom parameter groups for massive workloads
Enable validation mode after CDC catches up

During Cutover

Freeze write traffic
Validate row counts
Drain CDC backlog
Validate referential integrity
Validate sequences/identity columns

Post-Migration

Turn off DMS tasks
Disable log retention to normal levels
Re-enable triggers and constraints
Run application-level load & integrity tests
Archive DMS logs (CloudWatch/S3)

8. Expert-Level Tuning Profiles

High-Volume CDC (>50k TPS)

Use r6i.8xlarge or larger replication instance
Avoid transformations
Use optimized WAL retention (Postgres)
Increase commit batching on target

LOB-Heavy Workloads

Use LOB “limited” mode
Preload large objects
Increase replication instance storage for spillover

On-Prem Connectivity

Use Direct Connect
Avoid NAT gateways
Keep replication instance close to DX router
Tune TCP window sizes

9. Common Pitfalls You MUST Avoid

❌    Not enabling WAL/binlog retention
❌    Migrating without PKs
❌    Using small replication instances
❌    Leaving transformations on for heavy workloads
❌    Allowing schema drift during replication
❌   Applying DDL changes without updating DMS tasks

Think - with -Tech