Wednesday, November 12, 2025

AWS Direct Connect Resiliency | Deep Dive.


A deep dive onto AWS Direct Connect Resiliency.

Scope:

  •        Architectures,
  •        Design patterns,
  •        Best practices,
  •        AWS recommendations.

Breakdown:

  •        Overview,
  •        Resiliency Dimensions,
  •        Resiliency Models (AWS-Defined),
  •        Resiliency Architectures,
  •        Key Components in Design,
  •        Best Practices,
  •        Failover and Convergence Behavior,
  •        Sample Architectures,
  •        Validation Tools,
  •        Summary.

 Overview

  •        AWS Direct Connect (DX) provides a dedicated, private network connection from twtech on-premises environment to AWS.
  •        AWS Direct Connect (DX) enhances performance, predictability, and security, resiliency.
  •        DX is a physical service and subject to fiber cuts, power failures, or router faults.

 Resiliency Dimensions

Layer

Description

Typical Failure

Physical

Fiber, optics, cross-connects.

Fiber cut, equipment fault

Logical (BGP)

BGP session resilience.

BGP flaps, misconfig, route withdraw

Regional

AWS POP or region issue.

Device/POP outage

Provider

Partner circuit or carrier fault.

Provider-level disruption

 Resiliency Models (AWS-Defined)

NB:

AWS offers four resiliency models in the Direct Connect Resiliency Toolkit.

Model

Description

SLA

Use Case

1. High Resiliency (Single Location)

Two dedicated connections in one location, separate devices.

99.9%

Cost-effective single-site design

2. Maximum Resiliency (Dual Location)

Two dedicated connections in two different DX locations.

99.99%

Mission-critical production

3. Development / Test

Single connection, no redundancy.

None

Non-prod workloads

4. Combined Resiliency (Hybrid)

One DX + one VPN for backup.

99.9%+

Low-cost redundancy

 Resiliency Architectures

 A. High Resiliency Single Location

  •         Two physical connections to different DX routers in same location
  •         BGP active/active or active/passive
  •         Same AWS Region (one DXGW or VIF redundancy)

Pros: Simple, cost-efficient
Cons: Both links depend on same site site-level failure = outage

 B. Maximum Resiliency Dual Location

  •         Two DX locations, each with two routers and separate provider circuits
  •         Connected to different AWS edge routers and different facilities
  •         Typically uses DX Gateway (DXGW) to aggregate to VPCs across Regions

Pros: Full isolation — diverse paths, devices, and POPs
Cons: Higher cost and complexity

Typical BGP Setup:

  •         Two AWS DXGWs
  •         Redundant virtual interfaces (VIFs)
  •         BGP active/active (with AS-PATH prepend or MED tuning)

 C. VPN + Direct Connect (Hybrid)

  •         Combine AWS Site-to-Site VPN (IPSec over internet) as backup for DX
  •         Use BGP local-preference to prefer DX path
  •         On DX failure, routes automatically shift to VPN

Pros: Low-cost backup
Cons: Latency and throughput degrade during failover

  • Ideal For: Medium criticality workloads that can tolerate temporary latency increase

 D. Partner Redundancy (Hosted Connection)

  •         Two Hosted Connections via different AWS Partner Networks
  •         Each hosted connection from separate LOA-CFA and partner edge routers

NB:

  • Important: Ensure partners have separate backhaul and upstream paths, not just logical redundancy.

 Key Components in Design

Component

Description

LOA-CFA

Letter of Authorization – Connection Facility Assignment document used for cross-connect setup

DXGW (Direct Connect Gateway)

Aggregates multiple VIFs and enables multi-region VPC connectivity

VGW (Virtual Private Gateway)

Legacy attachment to a single VPC

Transit Gateway (TGW)

Enables multi-VPC, scalable routing

VIF (Virtual Interface)

Logical connection (Private, Public, or Transit) over DX

 Best Practices

 Redundancy:

  •         Use two connections in different DX locations
  •         Each on separate routers and provider backhauls

Routing:

  •         Use BGP with distinct ASNs and MD5 authentication
  •         Set BGP timers for appropriate failover speed
  •         Use AS-PATH prepend or Local Pref to control primary vs backup paths

 Monitoring:

  •         Use CloudWatch metrics and AWS Health Dashboard
  •         Enable VPC Flow Logs for path validation

Testing:

  •         Periodically simulate link failure and measure convergence time

 DNS:

  •         Use Route 53 health checks for endpoint-level failover (when applicable)

 Failover and Convergence Behavior

Scenario

Failover Mechanism

Typical Time

Physical link loss

BGP session drops

20–40 seconds

BGP route withdraw

Re-advertisement

10–30 seconds

Location-level failure

Depends on architecture

<60 seconds (well-tuned)

 Sample Architectures

  • Maximum Resiliency – Dual DX Locations (Recommended Production Design)

On-Premises Data Center
Two separate circuits via different providers
Two AWS DX Locations (e.g., Columbus + California)
Each connects to separate DX Routers and DXGWs
DXGW connects to Transit Gateway
TGW distributes to multiple VPCs across regions.

 Validation Tools

  •         AWS Direct Connect Resiliency Toolkit (console-based)
  •         Network Performance Dashboard
  •         AWS Reachability Analyzer
  •         Traceroute and BGP (Border Gateway Protocol)  Path Checks

 Summary

Requirement

Recommended Pattern

Non-critical / dev.

Single DX

Prod (low criticality).

DX + VPN backup

Mission-critical.

Dual DX (dual sites)

Compliance or financial workloads.

Dual DX + dual providers + VPN tertiary path

 

No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...