AWS Data Migration Service (DMS) Continuous Replication - Overview.
Focus:
- Tailored for:
- DevOps,
- Cloud,
- SRE,
- Platform,
- Infra,
- DevSecOps Engineering.
Scope:
- DMS Architectural,
- Continuous Replication
Internals,
- DMS Continuous Replication –
Failure Modes,
- Latency Analysis & Performance Internals,
- DMS Replication & Cutover Strategies (Pro-Level),
- Resiliency Patterns for
Continuous Replication,
- Operational Best Practices (Enterprise-Ready),
- Expert-Level Tuning Profiles,
- Common Pitfalls You MUST Avoid.
Intro:
- AWS Database Migration Service
(AWS DMS) fully supports
continuous data replication, in a process called Change Data Capture (CDC).
- Change data capture (CDC) feature allows twtech to keep source and target databases synchronized during a migration with minimal downtime.
- Change data capture (CDC) feature also allows twtech to maintain ongoing synchronization for various use cases like disaster recovery and development/test environment synchronization.
- AWS Database Migration Service (DMS) supports:
- One-time (full-load) migrations
- Full-load + CDC (Change Data Capture)
- Continuous replication (ongoing CDC stream)
NB:
Continuous
replication turns DMS into a near-real-time
database replication engine, thus enabling:
- Ongoing
sync between on-prem → AWS
- Cloud-to-cloud
migrations
- HA/DR
architectures
- Blue/green
cutovers
- Multi-region
read replicas
- Zero-downtime
migrations
1. DMS Architectural
Core Components
1.
Source Endpoint
- RDS, Aurora, Oracle, SQL Server, MySQL, PostgreSQL
- On-prem DBs over VPN/DX
- Supports log readers (redo/transaction log/Write Ahead Log)
2.
Replication Instance
- The DMS engine
- EC2-based
- Handles:
- CDC extraction
- Change buffering
- LOB chunking
- Transformation rules
- Batch commits to target
3.
Target Endpoint
- RDS/Aurora
- DynamoDB
- S3 (Parquet/CSV)
- Redshift
- OpenSearch
- Kafka
4.
Task Engine
- Orchestrates full-load, CDC, and validation phases
- Maintains checkpoint state in target schema
- Self-recovery logic on disconnects/restarts
2. Continuous Replication Internals
- CDC Source-Level Behavior
- CDC Source-Level Behavior
Depending on DB:
|
Engine |
CDC Mechanism |
|
Oracle |
Redo logs;
optionally LogMiner; binary reader for high throughput |
|
SQL Server |
Transaction log
(log reader agent) |
|
MySQL |
Binlog |
|
PostgreSQL |
Logical decoding + replication slots |
|
SAP ASE |
Replication Server
log APIs |
|
MongoDB |
Oplog |
NB:
- DMS reads only committed transactions, preserving transaction order at table level (not always globally across all tables).
CDC Pipeline Flow (Deep Internal
Sequence)
1.
Source Log → DMS Cache
- Reads native logs
- Maps operations (INSERT/UPDATE/DELETE)
- Extracts PKs, timestamps, SCNs, LSNs, or WAL positions
2.
Cache → Mapping Rules Engine
- Filters tables
- Applies column transformations
- Applies type conversions
- Rewrites default values
- Can enforce uppercase/lowercase naming schemes
3.
Commit Grouping / Transaction Assembly
- Groups related row changes
- Ensures partial transaction correctness
4.
Throttle Control & Flow Control
- Auto-throttles to avoid target overload
- Buffers events in memory or to disk
- Handles spikes in source log volume
5.
Target Apply Engine
- Applies changes in commit order
- Uses batched SQL apply
- Writes checkpoint positions to DMS schema
3. DMS Continuous Replication – Failure
Modes
Category A: Source-Related
Failures
- Log retention too short → DMS falls behind
- WAL/binary log disabled or rotated early
- Oracle archive logs purged prematurely
- Schema drift (ALTER TABLE) not replicated
- Heavy write burst causing backlog
Category B: Replication
Instance Failures
- Instance class too small (CPU pegged, memory swap)
- Disk spills (cache → EBS) slowing down replication
- Network path saturation (on-prem → cloud)
- Security groups or routing changes blocking connectivity
Category C: Target
Endpoint Failures
- Deadlocks or lock waits in target
- FK/PK mismatches
- Unsupported data types
- Large LOB size mismatch
- Write rate exceeding target IOPS
Category D: Task-Level
Failures
- Metadata divergence
- Checkpoint corruption
- Inconsistent table mapping rules
- Transformation rule errors
- DMS version bugs (frequent root cause)
4. Latency Analysis & Performance
Internals
Latency Contributors
- Log extraction delay
- Network latency
- Replication instance CPU
- Commit batching size
- Target apply speed
- Transformation overhead
- Large transactions (e.g., bulk DELETE)
How to Optimize
- Use CDC-only tasks (avoid full-load overhead)
- Enable Multi-AZ for resiliency
- Move replication instance into same AZ as source when possible
- Use parallel load and batch apply where supported
- Use larger replication instance classes for high-WAL-throughput DBs
- Avoid LOB full-mode unless absolutely needed
- Keep PKs/indexes lean during migration
5. DMS Replication & Cutover Strategies
(Pro-Level)
1. Classic Zero-Downtime Migration
1. Full load
2. Enable ongoing replication
3. Let CDC catch up
4. Freeze application writes
5. Wait 0–5 seconds for CDC to drain
6. Promote target
2. Blue/Green Deployment
- Two DBs kept continuously in sync
- Flip application by RDS proxy or route update
- Reversible if failure occurs
3. DR / Multi-Region Replication
- Region A → Region B sync
- Latency typically 1–5 seconds depending on traffic
- Useful for compliance or low-RTO failover
4. Cloud-to-Cloud Incremental Sync
Sample GCP Cloud
SQL → AWS
Aurora
- DMS extracts logs from GCP
- Applies changes continuously
- Final cutover is seconds
6. Resiliency Patterns for Continuous
Replication
A. Multi-AZ DMS Replication Instance
- Automated failover
- Continuous replication resumes in seconds
- Helps when network or AZ degrades
B. Multi-Task Parallelization
Split tasks by:
- schema
- table
- write intensity
- table size
- PK distribution
C. Auto-Recovery Mechanisms
- DMS checkpoints ensure resume from last committed LSN
- Retry loops with exponential backoff
- Automatic cache-to-disk spillover during bursts
D. Monitoring & Alerting
Watch:
- CDC latency
- Source log growth
- Replication instance CPU
- Memory swap
- Target apply throughput
- Disk queue depth
- Network throughput
- Error rates in CloudWatch metrics
7. Operational Best Practices (Enterprise-Ready)
Before Starting Replication
- Ensure PKs exist on all tables
- Enable log retention (WAL/binary logs)
- Validate character set mappings
- Disable triggers & heavy constraints during migration
- Pre-size storage (target & DMS instance)
During Continuous Replication
- Ensure replication instance has burstable performance
- Monitor backlog growth
- Avoid schema changes unless planned
- Keep DMS updated to latest engine version
- Use custom parameter groups for massive workloads
- Enable validation mode after CDC catches up
During Cutover
- Freeze write traffic
- Validate row counts
- Drain CDC backlog
- Validate referential integrity
- Validate sequences/identity columns
Post-Migration
- Turn off DMS tasks
- Disable log retention to normal levels
- Re-enable triggers and constraints
- Run application-level load & integrity tests
- Archive DMS logs (CloudWatch/S3)
8. Expert-Level Tuning Profiles
High-Volume CDC (>50k TPS)
- Use r6i.8xlarge or larger replication instance
- Avoid transformations
- Use optimized WAL retention (Postgres)
- Increase commit batching on target
LOB-Heavy Workloads
- Use LOB “limited” mode
- Preload large objects
- Increase replication instance storage for spillover
On-Prem Connectivity
- Use Direct Connect
- Avoid NAT gateways
- Keep replication instance close to DX router
- Tune TCP window sizes
9. Common Pitfalls You MUST Avoid
❌ Not
enabling WAL/binlog retention
❌ Migrating
without PKs
❌ Using
small replication instances
❌ Leaving
transformations on for heavy workloads
❌ Allowing
schema drift during replication
❌ Applying DDL changes without updating DMS
tasks
No comments:
Post a Comment