Monday, November 24, 2025

AWS Database Migration Service (DMS) Multi-AZ Deployment | Deep Dive.

AWS Database Migration Service (DMS) Multi-AZ Deployment | Deep Dive. - Deep Dive.

Scope:

  • Intro,
  • Key Features & Benefits,
  • Management & Configuration,
  • Key Considerations
  • What “Multi-AZ” for DMS means
  • Failover behaviour, RPO / RTO expectations,
  • Recommended configuration & sizing,
  • Creating a Multi-AZ replication instance (Samples AWS CLI that creates a replication instance),
  • Task configuration (full load + CDC common pattern),
  • Monitoring & alerting (what to watch),
  • Testing & runbook (practical tests to prove availability),
  • Common failure modes & troubleshooting
  • Security & compliance,
  • Limits & gotchas,
  • Operational checklist (quick),
  • Sample troubleshooting commands & quick checks,
  • Addendum.

Intro:

    • AWS DMS Multi-AZ deployment provides high availability and failover support for a replicatioinstance.
    • AWS DMS Multi-AZ deployment automatically provisions and maintain a synchronous standby replica in a different Availability Zone

Key Features & Benefits

  •  High Availability: 
      •  Multi-AZ protects twtech replication instance against failures such as an Availability Zone outage, internal hardware or network issues, or software failure.
  • Automatic Failover: 
      •  If the primary instance fails, AWS DMS automatically switches to the standby replica with minimal interruption to twtech migration tasks, ensuring operational continuity.
  • Synchronous Replication: 
      • Data is synchronously replicated from the primary to the standby instance, ensuring data redundancy and minimal data loss in case of a failure.
  • Reduced Downtime: 
      •  Planned maintenance sessions (e.g., OS patching) are applied to the standby first, followed by an automatic failover, which minimizes overall downtime.
  • Improved Resilience: 
      •  This configuration is ideal for production environments and long-running replications where fault tolerance is essential

Management & Configuration

  •   Creation: 
      •  When creating a new replication instance, twtech can simply select the "Multi-AZ" option in the AWS Management Console.
  •   Modification: 
      •  twtech can convert an existing Single-AZ replication instance to a Multi-AZ deployment by modifying the instance settings via the console, AWS CLI, or Amazon RDS API. 
      • These changes can be applied immediately or during the next maintenance window.
  •  Monitoring: 
      • The Multi-AZ configuration status can be verified in the "Overview" tab of the replication instance details in the AWS DMS console. 

Key Considerations

  •  Cost: 
      •  Multi-AZ deployment is more cost-intensive than Single-AZ due to the extra resources required for the standby instance.
  •    Read Replicas: 
      •  The standby replica in a Multi-AZ DB instance deployment cannot be used for read operations; it is purely for failover support.

  •  Manual Failover: 
      •   While AWS DMS automatically handles most failures, the failover process itself requires the application to reconnect, which is not entirely transparent to the application layer and may require some network configuration adjustments

1) What “Multi-AZ” for DMS means

  •   Primary idea:
      •  A DMS replication instance runs in a primary AZ with an automatic standby replica in another AZ within the same AWS Region. If the primary replication instance fails, DMS promotes the standby so the replication task can continue with minimal interruption.
  •   Components:
      •    Source database (on-prem, RDS, EC2, etc.) DMS replication instance (Primary + Standby across AZs) Target database (RDS/Aurora/EC2/Redshift/S3).
      •    Replication tasks run on the replication instance (full load, CDC, or both)
      • The standby contains the replication instance state to allow failover.
  •    Network layout:
      •  Ensure replication instance has network access to source and target:
        • VPC subnets in at least two private subnets spanning AZs;
        • appropriate security groups, 
        • routing, 
        • NAT/GW if source is internet or on-prem via TGW/VPN/Direct Connect.

2) Failover behaviour, RPO / RTO expectations

  • Failover mechanics:
      • AWS automatically fails over the DMS replication instance to the standby. Active replication tasks are restarted on the standby instance.
  • RTO (typical):
      •  usually seconds–a few minutes — depends on task complexity, size of in-flight transactions, and how quickly tasks restart. 
      • Plan for a brief interruption while tasks restart and re-establish connections.
  • RPO:
      • Largely depends on the CDC pipeline: if CDC was streaming, recent committed transactions are generally applied. 
      • However, there can be a small window of unreplicated transactions during failover. 
      • Test to quantify for your workload.
  • NB:
      •  Multi-AZ for DMS improves availability of the replication instance, not of the source or target databases. 
      • If source/target are RDS Multi-AZ/Aurora clusters, those have their own failover semantics.

3) Recommended configuration & sizing

 Replication instance class:
      •  choose based on CPU, memory, and network needs. 
      • For heavy CDC / large transactions use dms.r5.xlarge or above;
      • For small workloads  use dms.t3.medium or dms.r5.large may suffice. 
        • The above is a Benchmark with realistic load.
Storage:
      •  set enough EBS storage for cache, cached transactions and task logs. Use higher IOPS if you have high write throughput.
Multi-AZ:
      • enable via the --multi-az flag (CLI) or checkbox in console.
Subnets:
      •  place replication instance in a subnet group that includes at least two private subnets in different AZs.
Security groups:
      • allow egress to source and target DB endpoints and allow access from management hosts if needed.
 Parameter choices:
      •  Auto minor version upgrade: enable to keep agent patched (test in dev).
      •  MaxFileSize / Task settings: tune based on target and transaction flow to avoid huge memory/input/output operations per second (IOPS) spikes.

4) Creating a Multi-AZ replication instance (Samples AWS CLI that creates replication instance)

# bash 
aws dms create-replication-instance \
  --replication-instance-identifier twtech-dms-repl-1 \
  --replication-instance-class dms.r5.large \
  --allocated-storage 100 \
  --vpc-security-group-ids sg-0abc12345 \
  --replication-subnet-group-identifier twtech-dms-subnet-group \
  --multi-az \
  --publicly-accessible false \
  --tags Key=Env,Value=prod

CloudFormation (snippet)

MyReplicationInstance:
  Type: AWS::DMS::ReplicationInstance
  Properties:
    ReplicationInstanceIdentifier: twtech-dms-repl-1
    ReplicationInstanceClass: dms.r5.large
    AllocatedStorage: 100
    VpcSecurityGroupIds:
      - sg-0abc12345
    ReplicationSubnetGroupIdentifier: !Ref twtechSubnetGroup
    MultiAZ: true
    PubliclyAccessible: false

5) Task configuration (full load + CDC common pattern)

  • Strategy:
    • perform an initial full load, then enable CDC to capture changes (this is standard for migrations with minimal downtime).
  • Task settings:
    •  tune FullLoadParallelism, MaxFullLoadSubtasks (for some engines), commit frequency and apply mapping rules.
  • Maintain consistent primary key presence on tables to ensure CDC ordering and idempotence.

6) Monitoring & alerting (what to watch)

  • CloudWatch metrics to monitor (important ones):
    •    ReplicationLatency (how far behind the target is)
    •    FullLoadThroughput / CDCThroughput
    •    CPUUtilization (replication instance)
    •    FreeableMemory
    •    DiskQueueDepth / storage usage (if available)
    •    ReplicationTasksStopped / StoppedReplicationTasks
  • DMS Events & logs:
    •  Subscribe to DMS events (task stopped, error, failover).
    •  Enable CloudWatch Logs for task logs for deeper troubleshooting.
  • Alarms to create (examples):
    •    ReplicationLatency > acceptable threshold for N minutes Pager/Slack
    •  StoppedReplicationTasks > 0 Alert immediately
    • CPUUtilization > 80% for sustained period scaling/upgrade plan
    •  FreeableMemory low investigation
  • Dashboards: create a DMS dashboard showing tasks, latency, CPU, and replication throughput.

7) Testing & runbook (practical tests to prove availability)

  • Planned failover test:
    •  Run full load + CDC baseline and confirm low steady-state ReplicationLatency.
    •      Simulate primary failure (either stop replication instance or introduce networking block) and observe failover to standby.
    •  Measure time to resume and check for data loss (RPO) by confirming source and target data consistency for recent transactions.
    •  Validate application behaviour during failover (reconnect logic, retries).
  • Unplanned failure test: run while injecting realistic transactional load.
  • Rehearse rollback/resync: if a task fails to resume cleanly, document steps to restart the task with saved checkpoints or to re-run full load for subsets.
  • Runbook snippet (on failover detection):
    •  Alert triggers on-call acknowledges.
    •  Check DMS console: replication instance status and events.
    •  Check CloudWatch logs for task restart errors.
    •  If tasks are stopped: try restarting the task. If errors persist, collect logs and escalate.
    •  If replication instance not recoverable: create new replication instance from last saved task settings and reattach endpoints (keep same table mapping and saved positions where possible).

8) Common failure modes & troubleshooting

  • Task restart failures after failover
    • Symptoms: tasks still stopped or error after failover.
    • Actions: review task logs in CloudWatch; check endpoint connectivity, credentials, or schema drift; ensure endpoints are reachable from the new AZ.
  • Replication lag spike
    •  Causes: bursty writes on source, insufficient replication instance CPU, network bandwidth limits, target apply bottleneck.
    •  Actions: increase instance class, scale target performance, tune apply parallelism.
  • Connection timeouts
    • Check SGs, NACLs, route tables, NAT/GWs. 
    • If source is on-prem, test VPN/TGW/Direct Connect reachability.
  • Large DDLs / schema changes
    • DDL might block or slow CDC depending on engine. 
    • Prefer to apply schema changes in maintenance windows or test DMS handling for expected DDLs.
  • Disk pressure / low storage
    •  Increase allocated storage — DMS will not always auto-expand. Monitor disk usage.

9) Security & compliance

  • Encryption: 
    • use encryption at rest for replication instance storage (KMS) and enable SSL/TLS for endpoint connections if supported.
  • IAM:
    • use least-privilege IAM roles for DMS tasks and for any integration with CloudWatch/S3.
  • Secrets:
    • store DB credentials in AWS Secrets Manager and reference them from DMS endpoints.
  •  Network isolation:
    • place replication instances in private subnets; avoid public access unless explicitly required.

10) Limits & gotchas

  • DMS Multi-AZ protects the replication instance, not the target/source DBs. twtech still need Multi-AZ / clustering on source/target for their availability.
  • Cost:
    • Multi-AZ doubles the replication instance standby resource (twtech pays for the HA capability). Consider cost vs Service Level Agreements (SLAs).
  • Replication instance upgrade: 
    • during major version upgrades some tasks may be interrupted — plan maintenance windows and test.
  • Large initial full loads: 
    • may require temporary scaling of replication instance and target resources to keep time reasonable.

11) Operational checklist (quick)

  • Subnet group spans 2 AZs
  • Multi-AZ checkbox/flag enabled on replication instance
  • Sufficient instance class and storage
  • Endpoints tested (sourcereplication, replication target)
  • CloudWatch metrics & alarms configured
  • Task logging enabled (CloudWatch Logs)
  • Secrets Manager for credentials
  • Disaster/recovery playbook (tested)

12) Sample troubleshooting commands & quick checks

  •  Check replication instance status (CLI):

aws dms describe-replication-instances \
--filters Name=replication-instance-id,Values=twtech-dms-repl-1

  •  List tasks & task status:

aws dms describe-replication-tasks

  • Get task logs (CloudWatch via console or CLI):

aws logs filter-log-events \

                   --log-group-name "/aws/dms/task/twtech-task-name" --limit 50


Addendum






No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...