Sunday, November 23, 2025

AWS Data Migration Service (DMS) Continuous Replication | Overview.

AWS Data Migration Service (DMS) Continuous Replication - Overview.

Focus:

    • Tailored for:
      • DevOps, 
      • Cloud, 
      • SRE, 
      • Platform, 
      • Infra, 
      • DevSecOps Engineering.

Scope:

  • DMS Architectural,
  • Continuous Replication Internals,
  • DMS Continuous Replication – Failure Modes,
  • Latency Analysis & Performance Internals,
  • DMS Replication & Cutover Strategies (Pro-Level),
  • Resiliency Patterns for Continuous Replication,
  • Operational Best Practices (Enterprise-Ready),
  • Expert-Level Tuning Profiles,
  • Common Pitfalls You MUST Avoid.

Intro:

    •  AWS Database Migration Service (AWS DMS) fully supports continuous data replication, in a process called Change Data Capture (CDC).
    •  Change data capture (CDC) feature allows twtech to keep source and target databases synchronized during a migration with minimal downtime.
    •  Change data capture (CDC) feature also allows twtech to maintain ongoing synchronization for various use cases like disaster recovery and development/test environment synchronization.
    • AWS Database Migration Service (DMS) supports:
      •  One-time (full-load) migrations
      •  Full-load + CDC (Change Data Capture)
      •  Continuous replication (ongoing CDC stream)

NB:

Continuous replication turns DMS into a near-real-time database replication engine, thus enabling:

    •  Ongoing sync between on-prem AWS
    •  Cloud-to-cloud migrations
    •  HA/DR architectures
    •  Blue/green cutovers
    •  Multi-region read replicas
    •  Zero-downtime migrations

1. DMS Architectural

Core Components

1.     Source Endpoint

    •    RDS, Aurora, Oracle, SQL Server, MySQL, PostgreSQL
    •    On-prem DBs over VPN/DX
    •    Supports log readers (redo/transaction log/Write Ahead Log)

2.     Replication Instance

    •    The DMS engine
    •    EC2-based
    •    Handles:
      •   CDC extraction
      •   Change buffering
      •   LOB chunking
      •   Transformation rules
      •   Batch commits to target

3.     Target Endpoint

    •    RDS/Aurora
    •    DynamoDB
    •    S3 (Parquet/CSV)
    •    Redshift
    •    OpenSearch
    •    Kafka

4.     Task Engine

    •    Orchestrates full-load, CDC, and validation phases
    •    Maintains checkpoint state in target schema
    •    Self-recovery logic on disconnects/restarts

2. Continuous Replication Internals

    • CDC Source-Level Behavior

Depending on DB:

Engine

CDC Mechanism

Oracle

Redo logs; optionally LogMiner; binary reader for high throughput

SQL Server

Transaction log (log reader agent)

MySQL

Binlog

PostgreSQL

Logical decoding + replication slots

SAP ASE

Replication Server log APIs

MongoDB

Oplog

NB:

    • DMS reads only committed transactions, preserving transaction order at table level (not always globally across all tables).

CDC Pipeline Flow (Deep Internal Sequence)

1.     Source Log DMS Cache

    •    Reads native logs
    •    Maps operations (INSERT/UPDATE/DELETE)
    •    Extracts PKs, timestamps, SCNs, LSNs, or WAL positions

2.     Cache Mapping Rules Engine

    •    Filters tables
    •    Applies column transformations
    •    Applies type conversions
    •    Rewrites default values
    •    Can enforce uppercase/lowercase naming schemes

3.     Commit Grouping / Transaction Assembly

    •    Groups related row changes
    •    Ensures partial transaction correctness

4.     Throttle Control & Flow Control

    •    Auto-throttles to avoid target overload
    •    Buffers events in memory or to disk
    •    Handles spikes in source log volume

5.     Target Apply Engine

    •    Applies changes in commit order
    •    Uses batched SQL apply
    •    Writes checkpoint positions to DMS schema

3. DMS Continuous Replication – Failure Modes

Category A: Source-Related Failures

    • Log retention too short DMS falls behind
    • WAL/binary log disabled or rotated early
    • Oracle archive logs purged prematurely
    • Schema drift (ALTER TABLE) not replicated
    • Heavy write burst causing backlog

Category B: Replication Instance Failures

    •  Instance class too small (CPU pegged, memory swap)
    •  Disk spills (cache EBS) slowing down replication
    •  Network path saturation (on-prem cloud)
    •  Security groups or routing changes blocking connectivity

Category C: Target Endpoint Failures

    • Deadlocks or lock waits in target
    • FK/PK mismatches
    • Unsupported data types
    • Large LOB size mismatch
    • Write rate exceeding target IOPS

Category D: Task-Level Failures

    • Metadata divergence
    • Checkpoint corruption
    • Inconsistent table mapping rules
    • Transformation rule errors
    • DMS version bugs (frequent root cause)

4. Latency Analysis & Performance Internals

Latency Contributors

    • Log extraction delay
    • Network latency
    • Replication instance CPU
    • Commit batching size
    • Target apply speed
    • Transformation overhead
    • Large transactions (e.g., bulk DELETE)

How to Optimize

    •  Use CDC-only tasks (avoid full-load overhead)
    •  Enable Multi-AZ for resiliency
    •  Move replication instance into same AZ as source when possible
    •  Use parallel load and batch apply where supported
    •  Use larger replication instance classes for high-WAL-throughput DBs
    •  Avoid LOB full-mode unless absolutely needed
    •  Keep PKs/indexes lean during migration

5. DMS Replication & Cutover Strategies (Pro-Level)

1. Classic Zero-Downtime Migration

     1.     Full load
2.     Enable ongoing replication
3.     Let CDC catch up
4.     Freeze application writes
5.     Wait 0–5 seconds for CDC to drain
6.     Promote target

2. Blue/Green Deployment

    • Two DBs kept continuously in sync
    • Flip application by RDS proxy or route update
    • Reversible if failure occurs

3. DR / Multi-Region Replication

    • Region A Region B sync
    • Latency typically 1–5 seconds depending on traffic
    • Useful for compliance or low-RTO failover

4. Cloud-to-Cloud Incremental Sync

Sample GCP Cloud SQL AWS Aurora

    • DMS extracts logs from GCP
    • Applies changes continuously
    • Final cutover is seconds

6. Resiliency Patterns for Continuous Replication

A. Multi-AZ DMS Replication Instance

    •  Automated failover
    •  Continuous replication resumes in seconds
    •  Helps when network or AZ degrades

B. Multi-Task Parallelization

Split tasks by:

    • schema
    • table
    • write intensity
    • table size
    • PK distribution

C. Auto-Recovery Mechanisms

    • DMS checkpoints ensure resume from last committed LSN
    • Retry loops with exponential backoff
    •  Automatic cache-to-disk spillover during bursts

D. Monitoring & Alerting

Watch:

    •  CDC latency
    •  Source log growth
    •  Replication instance CPU
    •  Memory swap
    •  Target apply throughput
    •  Disk queue depth
    •  Network throughput
    •  Error rates in CloudWatch metrics

7. Operational Best Practices (Enterprise-Ready)

Before Starting Replication

    •  Ensure PKs exist on all tables
    •  Enable log retention (WAL/binary logs)
    •  Validate character set mappings
    •  Disable triggers & heavy constraints during migration
    •  Pre-size storage (target & DMS instance)

During Continuous Replication

    •  Ensure replication instance has burstable performance
    •  Monitor backlog growth
    •  Avoid schema changes unless planned
    •  Keep DMS updated to latest engine version
    •  Use custom parameter groups for massive workloads
    •  Enable validation mode after CDC catches up

During Cutover

    •  Freeze write traffic
    • Validate row counts
    • Drain CDC backlog
    • Validate referential integrity
    • Validate sequences/identity columns

Post-Migration

    •  Turn off DMS tasks
    •  Disable log retention to normal levels
    •  Re-enable triggers and constraints
    •  Run application-level load & integrity tests
    •  Archive DMS logs (CloudWatch/S3)

8. Expert-Level Tuning Profiles

High-Volume CDC (>50k TPS)

    •  Use r6i.8xlarge or larger replication instance
    •  Avoid transformations
    •  Use optimized WAL retention (Postgres)
    •  Increase commit batching on target

LOB-Heavy Workloads

    •  Use LOB “limited” mode
    •  Preload large objects
    •  Increase replication instance storage for spillover

On-Prem Connectivity

    •  Use Direct Connect
    •  Avoid NAT gateways
    •  Keep replication instance close to DX router
    • Tune TCP window sizes

9. Common Pitfalls You MUST Avoid

   Not enabling WAL/binlog retention
❌    Migrating without PKs
❌    Using small replication instances
❌    Leaving transformations on for heavy workloads
❌    Allowing schema drift during replication
❌   Applying DDL changes without updating DMS tasks





No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...