AWS Database Migration Service (DMS) Multi-AZ Deployment | Deep Dive. - Deep Dive.
Scope:
- Intro,
- Key
Features & Benefits,
- Management
& Configuration,
- Key
Considerations
- What “Multi-AZ” for DMS means
- Failover
behaviour, RPO / RTO expectations,
- Recommended
configuration & sizing,
- Creating a Multi-AZ replication instance (Samples AWS CLI that creates a replication instance),
- Task configuration (full load + CDC common pattern),
- Monitoring & alerting (what to watch),
- Testing & runbook (practical tests to prove availability),
- Common
failure modes & troubleshooting
- Security
& compliance,
- Limits
& gotchas,
- Operational checklist (quick),
- Sample
troubleshooting commands & quick checks,
- Addendum.
Intro:
- AWS DMS Multi-AZ deployment provides high availability and failover support for a replication instance.
- AWS DMS Multi-AZ deployment automatically provisions and maintain a synchronous standby replica in a different Availability Zone
Key
Features & Benefits
- High Availability:
- Multi-AZ protects twtech replication instance against failures such as an Availability Zone outage, internal hardware or network issues, or software failure.
- Automatic Failover:
- If the primary instance fails, AWS DMS automatically switches to the standby replica with minimal interruption to twtech migration tasks, ensuring operational continuity.
- Synchronous Replication:
- Data is synchronously replicated from the primary to the standby instance, ensuring data redundancy and minimal data loss in case of a failure.
- Reduced Downtime:
- Planned maintenance sessions (e.g., OS patching) are applied to the standby first, followed by an automatic failover, which minimizes overall downtime.
- Improved Resilience:
- This configuration is ideal for production environments and long-running replications where fault tolerance is essential
Management
& Configuration
- Creation:
- When creating a new replication instance, twtech can simply select the "Multi-AZ" option in the AWS Management Console.
- Modification:
- twtech can convert an existing Single-AZ replication instance to a Multi-AZ deployment by modifying the instance settings via the console, AWS CLI, or Amazon RDS API.
- These changes can be applied immediately or during the next maintenance window.
- Monitoring:
- The Multi-AZ configuration status can be verified in the "Overview" tab of the replication instance details in the AWS DMS console.
Key
Considerations
- Cost:
- Multi-AZ deployment is more cost-intensive than Single-AZ due to the extra resources required for the standby instance.
- Read Replicas:
- The standby replica in a Multi-AZ DB instance deployment cannot be used for read operations; it is purely for failover support.
- Manual Failover:
- While AWS DMS automatically handles most failures, the failover process itself requires the application to reconnect, which is not entirely transparent to the application layer and may require some network configuration adjustments
1) What “Multi-AZ” for DMS means
- Primary idea:
- A DMS replication instance runs in a primary AZ with an automatic standby replica in another AZ within the same AWS Region. If the primary replication instance fails, DMS promotes the standby so the replication task can continue with minimal interruption.
- Components:
- Source database
(on-prem,
RDS, EC2, etc.) → DMS
replication instance (Primary + Standby across AZs) → Target database (RDS/Aurora/EC2/Redshift/S3).
- Replication tasks run on the replication instance (full load, CDC, or both).
- The standby contains the
replication instance state to allow failover.
- Network layout:
- Ensure replication instance has network access to source and target:
- VPC subnets in at least two private subnets spanning AZs;
- appropriate security groups,
- routing,
- NAT/GW if source is internet or on-prem via TGW/VPN/Direct Connect.
2) Failover behaviour, RPO / RTO expectations
- Failover mechanics:
- AWS automatically fails over the DMS replication instance to the standby. Active replication tasks are restarted on the standby instance.
- RTO (typical):
- usually seconds–a few minutes — depends on task complexity, size of in-flight transactions, and how quickly tasks restart.
- Plan for a brief interruption while tasks restart and re-establish connections.
- RPO:
- Largely depends on the CDC pipeline: if CDC was streaming, recent committed transactions are generally applied.
- However, there can be a small window of unreplicated transactions during failover.
- Test to quantify for your workload.
- NB:
- Multi-AZ for DMS improves availability of the replication instance, not of the source or target databases.
- If source/target are RDS Multi-AZ/Aurora clusters, those have their own failover semantics.
3) Recommended configuration & sizing
Replication instance class:- choose based on CPU, memory, and network needs.
- For heavy CDC /
large transactions use
dms.r5.xlargeor above; - For small workloads use
dms.t3.mediumordms.r5.largemay suffice. - The above is a Benchmark with realistic load.
- set enough EBS storage for cache, cached transactions and task logs. Use higher IOPS if you have high write throughput.
- enable via the
--multi-azflag (CLI) or checkbox in console.
- place replication instance in a subnet group that includes at least two private subnets in different AZs.
- allow egress to source and target DB endpoints and allow access from management hosts if needed.
- Auto minor version upgrade: enable to keep agent patched (test in dev).
- MaxFileSize / Task settings: tune based on target and transaction flow to avoid huge memory/input/output operations per second (IOPS) spikes.
4) Creating a Multi-AZ replication instance (Samples AWS CLI that creates replication instance)
# bash aws dmscreate-replication-instance\--replication-instance-identifiertwtech-dms-repl-1\--replication-instance-class dms.r5.large\--allocated-storage100\--vpc-security-group-idssg-0abc12345\--replication-subnet-group-identifiertwtech-dms-subnet-group\--multi-az \--publicly-accessiblefalse\--tags Key=Env,Value=prodCloudFormation (snippet)
MyReplicationInstance:Type:AWS::DMS::ReplicationInstanceProperties:ReplicationInstanceIdentifier:twtech-dms-repl-1ReplicationInstanceClass:dms.r5.largeAllocatedStorage:100VpcSecurityGroupIds:-sg-0abc12345ReplicationSubnetGroupIdentifier:!ReftwtechSubnetGroupMultiAZ:truePubliclyAccessible:false
5) Task configuration (full load + CDC common pattern)
- Strategy:
- perform an initial full load, then enable CDC to capture changes (this is standard for migrations with
minimal downtime).
- Task settings:
- tune
FullLoadParallelism,MaxFullLoadSubtasks(for some engines), commit frequency and apply mapping rules. - Maintain consistent primary key presence
on
tables to ensure CDC ordering and idempotence.
6) Monitoring & alerting (what
to watch)
- CloudWatch metrics to monitor (important ones):
-
ReplicationLatency(how far behind the target is) -
FullLoadThroughput/CDCThroughput -
CPUUtilization(replication instance) -
FreeableMemory -
DiskQueueDepth/ storage usage (if available) -
ReplicationTasksStopped/StoppedReplicationTasks - DMS Events & logs:
- Subscribe to DMS events (task
stopped, error, failover).
- Enable CloudWatch Logs for task logs for deeper troubleshooting.
- Alarms to create (examples):
-
ReplicationLatency> acceptable threshold for N minutes → Pager/Slack -
StoppedReplicationTasks> 0 → Alert immediately CPUUtilization> 80% for sustained period → scaling/upgrade plan-
FreeableMemorylow → investigation - Dashboards: create a DMS
dashboard showing tasks, latency, CPU, and replication throughput.
7) Testing & runbook (practical tests
to prove availability)
- Planned failover test:
- Run full load + CDC baseline and confirm low steady-state
ReplicationLatency. - Simulate primary failure (either stop replication instance or
introduce networking block) and observe failover to standby.
- Measure time to resume and check for data loss (RPO) by confirming source and target
data consistency for recent transactions.
- Validate application behaviour during failover (reconnect logic, retries).
- Unplanned failure test:
run
while injecting realistic transactional load.
- Rehearse rollback/resync: if a task
fails to resume cleanly, document steps to restart the task with saved
checkpoints or to re-run full load for subsets.
- Runbook snippet (on failover detection):
- Alert triggers → on-call
acknowledges.
- Check DMS console: replication instance status and events.
- Check CloudWatch logs for task restart errors.
- If tasks are stopped: try restarting the task. If errors persist,
collect logs and escalate.
- If replication instance not recoverable: create new replication
instance from last saved task settings and reattach endpoints (keep same table
mapping and saved positions where possible).
8) Common failure modes & troubleshooting
- Task restart failures after failover
- Symptoms: tasks still
stoppedorerrorafter failover. - Actions: review task logs in CloudWatch; check endpoint
connectivity, credentials, or schema drift; ensure endpoints are reachable from
the new AZ.
- Replication lag spike
- Causes: bursty writes on source, insufficient replication instance
CPU, network bandwidth limits, target apply bottleneck.
- Actions: increase instance class, scale target performance, tune
apply parallelism.
- Connection timeouts
- Check SGs, NACLs, route tables, NAT/GWs.
- If source is on-prem,
test VPN/TGW/Direct Connect reachability.
- Large DDLs / schema changes
- DDL might block or slow CDC depending on engine.
- Prefer to apply
schema changes in maintenance windows or test DMS handling for expected DDLs.
- Disk pressure / low storage
- Increase allocated storage — DMS will not always auto-expand. Monitor
disk usage.
9) Security & compliance
- Encryption:
- use encryption at rest for replication instance storage (KMS) and enable SSL/TLS for endpoint connections if supported.
- IAM:
- use least-privilege IAM roles for DMS tasks and for any integration with CloudWatch/S3.
- Secrets:
- store DB credentials in AWS Secrets Manager and reference them from DMS endpoints.
- Network isolation:
- place replication instances in private subnets; avoid public access unless explicitly required.
10) Limits & gotchas
- DMS Multi-AZ protects the replication instance,
not
the target/source DBs. twtech still need Multi-AZ / clustering on source/target
for their availability.
- Cost:
- Multi-AZ doubles the replication instance standby resource (twtech pays for the HA capability). Consider cost vs Service Level Agreements (SLAs).
- Replication instance upgrade:
- during major version upgrades some tasks may be interrupted — plan maintenance windows and test.
- Large initial full loads:
- may require temporary scaling of replication instance and target resources to keep time reasonable.
11) Operational checklist (quick)
- Subnet
group spans ≥2 AZs
- Multi-AZ
checkbox/flag enabled on replication instance
- Sufficient
instance class and storage
- Endpoints tested (source → replication, replication → target)
- CloudWatch
metrics & alarms configured
- Task
logging enabled (CloudWatch Logs)
- Secrets
Manager for credentials
- Disaster/recovery
playbook (tested)
12) Sample troubleshooting commands & quick checks
- Check replication instance status (CLI):
aws dmsdescribe-replication-instances \--filters Name=replication-instance-id,Values=twtech-dms-repl-1
- List tasks & task status:
aws dms describe-replication-tasks
- Get task logs (CloudWatch via console or CLI):
aws logs filter-log-events \
--log-group-name "/aws/dms/task/twtech-task-name" --limit 50
Addendum
No comments:
Post a Comment