A deep dive onto AWS Direct
Connect Resiliency.
Scope:
- Architectures,
- Design patterns,
- Best practices,
- AWS recommendations.
Breakdown:
- Overview,
- Resiliency Dimensions,
- Resiliency Models (AWS-Defined),
- Resiliency Architectures,
- Key Components in Design,
- Best Practices,
- Failover and Convergence
Behavior,
- Sample Architectures,
- Validation Tools,
- Summary.
Overview
- AWS Direct Connect (DX) provides a dedicated, private network connection from twtech on-premises environment to AWS.
- AWS Direct Connect (DX) enhances performance, predictability, and security, resiliency.
- DX is a physical service and subject to fiber cuts, power failures, or router faults.
Resiliency
Dimensions
|
Layer |
Description |
Typical Failure |
|
Physical |
Fiber, optics, cross-connects. |
Fiber cut, equipment
fault |
|
Logical (BGP) |
BGP session resilience. |
BGP flaps,
misconfig, route withdraw |
|
Regional |
AWS POP or region issue. |
Device/POP outage |
|
Provider |
Partner circuit or carrier fault. |
Provider-level
disruption |
Resiliency
Models (AWS-Defined)
NB:
AWS
offers four resiliency models
in the
Direct Connect Resiliency Toolkit.
|
Model |
Description |
SLA |
Use Case |
|
1. High Resiliency (Single Location) |
Two dedicated connections in one location, separate
devices. |
99.9% |
Cost-effective single-site design |
|
2. Maximum Resiliency (Dual Location) |
Two dedicated connections in two different DX
locations. |
99.99% |
Mission-critical production |
|
3. Development / Test |
Single connection, no redundancy. |
None |
Non-prod workloads |
|
4. Combined Resiliency
(Hybrid) |
One DX + one VPN for backup. |
99.9%+ |
Low-cost redundancy |
Resiliency
Architectures
A. High Resiliency – Single
Location
- Two physical connections to different DX routers in same location
- BGP active/active or active/passive
- Same AWS Region (one DXGW or VIF redundancy)
Pros: Simple, cost-efficient
Cons: Both links depend on same
site → site-level failure = outage
B. Maximum Resiliency – Dual
Location
- Two DX locations, each with two routers and separate provider circuits
- Connected to different AWS edge routers and different facilities
- Typically uses DX Gateway (DXGW) to aggregate to VPCs across Regions
Pros: Full isolation —
diverse paths, devices, and POPs
Cons: Higher cost and
complexity
Typical BGP Setup:
- Two AWS DXGWs
- Redundant virtual interfaces (VIFs)
- BGP active/active (with AS-PATH prepend or MED tuning)
C. VPN + Direct Connect (Hybrid)
- Combine AWS Site-to-Site VPN (IPSec over internet) as backup for DX
- Use BGP local-preference to prefer DX path
- On DX failure, routes automatically shift to VPN
Pros: Low-cost backup
Cons: Latency
and throughput degrade during failover
- Ideal For: Medium criticality workloads that can tolerate temporary latency increase
D. Partner Redundancy (Hosted Connection)
- Two Hosted Connections via different AWS Partner Networks
- Each hosted connection from separate LOA-CFA and partner edge routers
NB:
- Important: Ensure partners have separate backhaul and upstream paths, not just logical redundancy.
Key
Components in Design
|
Component |
Description |
|
LOA-CFA |
Letter of Authorization – Connection Facility Assignment
document used for cross-connect setup |
|
DXGW (Direct Connect Gateway) |
Aggregates multiple VIFs and enables multi-region VPC
connectivity |
|
VGW (Virtual Private Gateway) |
Legacy attachment to a single VPC |
|
Transit Gateway (TGW) |
Enables multi-VPC, scalable routing |
|
VIF (Virtual Interface) |
Logical connection (Private, Public, or Transit) over DX |
Best
Practices
Redundancy:
- Use two connections in different DX locations
- Each on separate routers and provider backhauls
Routing:
- Use BGP with distinct ASNs and MD5 authentication
- Set BGP timers for appropriate failover speed
- Use AS-PATH prepend or Local Pref to control primary vs backup paths
Monitoring:
- Use CloudWatch metrics and AWS Health Dashboard
- Enable VPC Flow Logs for path validation
Testing:
- Periodically simulate link failure and measure convergence time
DNS:
- Use Route 53 health checks for endpoint-level failover (when applicable)
Failover
and Convergence Behavior
|
Scenario |
Failover Mechanism |
Typical Time |
|
Physical link loss |
BGP session drops |
20–40 seconds |
|
BGP route withdraw |
Re-advertisement |
10–30 seconds |
|
Location-level failure |
Depends on architecture |
<60 seconds (well-tuned) |
Sample
Architectures
- Maximum Resiliency – Dual DX Locations
(Recommended Production Design)
On-Premises Data Center
→ Two separate circuits via different
providers
→ Two AWS DX Locations (e.g., Columbus +
California)
→ Each connects to separate DX Routers and DXGWs
→ DXGW connects to Transit Gateway
→ TGW distributes to multiple VPCs across regions.
Validation
Tools
- AWS Direct Connect Resiliency Toolkit (console-based)
- Network Performance Dashboard
- AWS Reachability Analyzer
- Traceroute and BGP (Border Gateway Protocol) Path Checks
Summary
|
Requirement |
Recommended Pattern |
|
Non-critical / dev. |
Single DX |
|
Prod (low criticality). |
DX + VPN backup |
|
Mission-critical. |
Dual DX (dual
sites) |
|
Compliance or financial workloads. |
Dual DX + dual providers + VPN tertiary path |
No comments:
Post a Comment