Wednesday, November 12, 2025

AWS Direct Connect Resiliency | Deep Dive.


AWS Direct Connect Resiliency - Deep Dive.

Scope:

  • Overview,
  • Resiliency Dimensions,
  • Resiliency Architectures (Layers, Description & Typical Failure),
  • Resiliency Models (AWS-Defined four resiliency models),
  • Key Components in Design & Description,
  • Best Practices,
  • Failover and Convergence Behavior (Scenario, Failover Mechanism & Typical Time),
  • Sample Architectures,
  • Validation Tools,
  • Summary (Requirements & Recommended Patterns).

 Overview

    • AWS Direct Connect (DX) provides a dedicated, private network connection from twtech on-premises environment to AWS.
    • AWS Direct Connect (DX) enhances performance, predictability, and security, resiliency.
    • DX is a physical service and subject to fiber cuts, power failures, or router faults.

 Resiliency Dimensions (Layers, Description & Typical Failure)

Layer

Description

Typical Failure

Physical

Fiber, optics, cross-connects.

Fiber cut, equipment fault

Logical (BGP)

BGP session resilience.

BGP flaps, misconfig, route withdraw

Regional

AWS POP or region issue.

Device/POP outage

Provider

Partner circuit or carrier fault.

Provider-level disruption

 Resiliency Models (AWS-Defined four resiliency models)

NB:

AWS offers four resiliency models in the Direct Connect Resiliency Toolkit.

Model

Description

SLA

Use Case

1. High Resiliency (Single Location)

Two dedicated connections in one location, separate devices.

99.9%

Cost-effective single-site design

2. Maximum Resiliency (Dual Location)

Two dedicated connections in two different DX locations.

99.99%

Mission-critical production

3. Development / Test

Single connection, no redundancy.

None

Non-prod workloads

4. Combined Resiliency (Hybrid)

One DX + one VPN for backup.

99.9%+

Low-cost redundancy

 Resiliency Architectures

 A. High Resiliency Single Location

    • Two physical connections to different DX routers in same location
    •  BGP active/active or active/passive
    •  Same AWS Region (one DXGW or VIF redundancy)

Pros: Simple, cost-efficient
Cons: Both links depend on same site site-level failure = outage

 B. Maximum Resiliency Dual Location

    • Two DX locations, each with two routers and separate provider circuits
    • Connected to different AWS edge routers and different facilities
    • Typically uses DX Gateway (DXGW) to aggregate to VPCs across Regions

Pros: Full isolation — diverse paths, devices, and POPs
Cons: Higher cost and complexity

Typical BGP Setup:

    •  Two AWS DXGWs
    •   Redundant virtual interfaces (VIFs)
    •   BGP active/active (with AS-PATH prepend or MED tuning)

 C. VPN + Direct Connect (Hybrid)

    • Combine AWS Site-to-Site VPN (IPSec over internet) as backup for DX
    • Use BGP local-preference to prefer DX path
    • On DX failure, routes automatically shift to VPN

Pros: Low-cost backup
Cons: Latency and throughput degrade during failover

    • Ideal For: Medium criticality workloads that can tolerate temporary latency increase

 D. Partner Redundancy (Hosted Connection)

    • Two Hosted Connections via different AWS Partner Networks
    • Each hosted connection from separate LOA-CFA and partner edge routers

NB:

    • Important: Ensure partners have separate backhaul and upstream paths, not just logical redundancy.

 Key Components in Design & Description

Component

Description

LOA-CFA

Letter of Authorization – Connection Facility Assignment document used for cross-connect setup

DXGW (Direct Connect Gateway)

Aggregates multiple VIFs and enables multi-region VPC connectivity

VGW (Virtual Private Gateway)

Legacy attachment to a single VPC

Transit Gateway (TGW)

Enables multi-VPC, scalable routing

VIF (Virtual Interface)

Logical connection (Private, Public, or Transit) over DX

 Best Practices

 Redundancy:

    • Use two connections in different DX locations
    • Each on separate routers and provider backhauls

Routing:

    • Use BGP with distinct ASNs and MD5 authentication
    • Set BGP timers for appropriate failover speed
    • Use AS-PATH prepend or Local Pref to control primary vs backup paths

 Monitoring:

    • Use CloudWatch metrics and AWS Health Dashboard
    • Enable VPC Flow Logs for path validation

Testing:

    • Periodically simulate link failure and measure convergence time

 DNS:

    • Use Route 53 health checks for endpoint-level failover (when applicable)

 Failover and Convergence Behavior (Scenario, Failover Mechanism & Typical Time)

Scenario

Failover Mechanism

Typical Time

Physical link loss

BGP session drops

20–40 seconds

BGP route withdraw

Re-advertisement

10–30 seconds

Location-level failure

Depends on architecture

<60 seconds (well-tuned)

 Sample Architectures

    • Maximum Resiliency – Dual DX Locations (Recommended Production Design)

On-Premises Data Center

Two separate circuits via different providers
Two AWS DX Locations (e.g., Columbus + California)
Each connects to separate DX Routers and DXGWs
DXGW connects to Transit Gateway
TGW distributes to multiple VPCs across regions.

 Validation Tools

    • AWS Direct Connect Resiliency Toolkit (console-based)
    • Network Performance Dashboard
    • AWS Reachability Analyzer
    • Traceroute and BGP (Border Gateway Protocol)  Path Checks

 Summary (Requirements & Recommended Patterns)

Requirement

Recommended Pattern

Non-critical / dev.

Single DX

Prod (low criticality).

DX + VPN backup

Mission-critical.

Dual DX (dual sites)

Compliance or financial workloads.

Dual DX + dual providers + VPN tertiary path

 




No comments:

Post a Comment

Databases Explained & Use Cases with (Flash Card) | Overview.

Databases Explained  & Use Cases ( Flash Cards)   - Overview. A database is a structured collection of digital information designed f...