Thursday, November 13, 2025

AWS Site-to-Site VPN Connection as a Backup | Overview.


An Overview of AWS Site-to-Site VPN Connection as a Backup.

View:

  •        Architecture patterns,
  •         Routing mechanics,
  •        Failover behavior,
  •        Best practices,
  •        Monitoring,
  •        Common pitfalls.

Breakdown:

  •        Architecture Overview (Baseline Pattern)
  •        Deployment Models,
  •        Routing & Failover Mechanics,
  •        BGP Priority Hierarchy,
  •        Recommended BGP Tuning,
  •        Failover Behavior,
  •        AWS Recommended Best Practices,
  •        Transit Gateway Design Considerations,
  •        End-to-End Testing & Validation,
  •        Monitoring & Observability,
  •        Throughput Expectations,
  •        Common Pitfalls,
  •        Architecture.

Architecture Overview (Baseline Pattern)

  •        VPN provides inexpensive, encrypted connectivity, while DX or the primary ISP provides high-bandwidth, low-latency transport.

twtech typically have:

  •         Primary Path: AWS Direct Connect OR private MPLS/SD-WAN circuit
  •         Backup Path: AWS Site-to-Site IPSec VPN
  •         Routing Control: BGP-based failover (preferred) or static routing (less ideal)

AWS Components

  •         Virtual Private Gateway (VGW) or Transit Gateway (TGW)
  •         Direct Connect Gateway (DXGW) when mixing DX with multi-VPC
  •         Customer Router (CSR/ASR/Firewalls/SD-WAN appliance)
  •         AWS Managed VPN Endpoint (two tunnels per VPN connection)

Deployment Models

A. Direct Connect with VPN Backup (Classic Pattern)

  •         DX private VIF is primary
  •         S2S VPN via VGW is backup
  •         BGP multipath disabled (default)
  •         AS_PATH or BGP MED used to prioritize DX

B. SD-WAN/MPLS Primary (VPN Backup)

  •         SD-WAN fabric decides primary path
  •         S2S VPN advertises same prefixes with longer AS_PATH or higher BGP metric

C. Transit Gateway with DX + VPN Backup

  •         DXGW <-> TGW for primary
  •         TGW VPN attachment for backup
  •         Controlled via BGP preference on TGW appliance or AWS VPN

D. Hub-and-Spoke Multi-VPC

  •         VPCs connect via TGW
  •         DX to TGW is primary
  •         VPN to TGW is secondary
  •         Route tables maintain path priority

Routing & Failover Mechanics

  •        Failover design centers around BGP preference:

BGP Priority Hierarchy

AWS follows standard BGP path selection:

  1.      Longest Prefix Match
  2.      Highest Local Preference
  3.      Shortest AS_PATH (most commonly used for DX/VPN failover)
  4.      MED (Multi-Exit Discriminator)
  5.      eBGP over iBGP
  6.      Lowest IGP cost
  7.      Router ID

Recommended BGP Tuning

Primary Path (DX)

  •         Advertise prefixes with shorter AS_PATH
  • Example: ASN 65000

Backup Path (VPN)

  •         Prepend to make path less preferred
  • Example: ASN 65000 65000 65000

NB:

AWS itself does not prepend on the VPN side; twtech configures it on its customer router.

Failover Behavior

A. Direct Connect Failure

Failures that trigger VPN failover:

  •         DX link down
  •         BGP adjacency drop
  •         Fiber cut / LAG failure beyond thresholds
  •         DX router failure

B. How Fast Is Failover

        Typical BGP timers:

  •    Keepalive: 30s
  •    Hold timer: 90s

        Recommended enhanced timers:

  •    Keepalive: 10s
  •    Hold timer: 30s

Actual failover to VPN usually: 20–40 seconds.

C. Tunnel Redundancy

  •        Each AWS VPN connection includes two tunnels.
  •        If Tunnel A fails, Tunnel B is used automatically.

AWS Recommended Best Practices

For Direct Connect

  •         Use LAG or at least two DX connections at different locations. 
  •         Use VPN as tertiary backup.

For VPN Backup

  •         Deploy two VPN connections (one per AWS region edge router)
  •         Use ECMP only if you want multipath
  •         Prepend AS_PATH on VPN so traffic only shifts on DX failure
  •         Monitor both tunnels with CloudWatch

Security Best Practices

  •         Use IKEv2 over IKEv1
  •         Avoid rekey event misconfiguration
  •         Use 30–60 min rekey windows
  •         Do not use aggressive mode

Transit Gateway Design Considerations

  •        TGW adds advanced routing but also considerations:

Primary Path DXGW → TGW

  •         Advertises routes via BGP
  •         Lowest AS_PATH → primary

Backup Path VPN → TGW

  •         Advertises same prefixes
  •         TGW route preference chooses DX unless BGP withdraws

TGW Route Table Behavior

TGW doesn't support BGP directly; the BGP decision happens on:

  •         DX router (DXGW side)
  •         VPN Gateway (AWS side)

 End-to-End Testing & Validation

Recommended Tests

  1.      Shut down DX interface on customer router
  2.      Ensure VPN routes become active
  3.      Run steady-state throughput validation
  4.      Test return path symmetry
  5.      Simulate partial failures:

    •    Only one DX link in LAG goes down
    •    One VPN tunnel fails
    •    Customer router restart
    •    IKE rekey rotation

Monitoring & Observability

CloudWatch

  •         TunnelState metrics
  •         VPN BGP status
  •         IPSec bytes in/out
  •         DX connection alarms
  •           DX virtual interface BGP state

AWS Health Alerts

  • Useful for DX location outages.

Customer Network Monitoring

  •         SNMP/BGP session state
  •         Router syslogs
  •         NQA/SLAs on primary + backup paths

Throughput Expectations

Direct Connect

  •         Predictable, line rate
  •         1, 2, 5, 10 Gbps (LAG can be 100 Gbps+)

VPN Backup

  •         With VGW: ~1.25 Gbps max per tunnel
  •         With AWS VPN on TGW: 5–7.5 Gbps aggregate (if ECMP + multiple tunnels)

NB:

  •        VPN is NOT a performance equivalent to DX.
  •        VPN is strictly for resiliency, not sustained heavy workloads.

 Common Pitfalls

❌   Mis-tuned BGP timers slow failover
❌    AS_PATH prepend
asymmetric routing
❌   Using static routes
manual failover
❌   Only 1 VPN connection
not HA
❌   DX
+ VPN terminating on different devices policy mismatch
❌   Not validating return-path failover
❌   On-prem firewall policies blocking ESP after failover
❌   Misconfigured MTU
/MSS clamping after the tunnel comes up

Architecture

Primary Path
→ Direct Connect (low latency, high throughput)

Backup Path
→ Site-to-Site VPN (encrypted, AWS global backbone)

Routing Control
→ BGP AS_PATH + timers for rapid failover

High Availability
→ DX LAG / two DX circuits
→ Two VPN tunnels (minimum)
→ Preferably two VPN connections across regions

AWS architecture diagram (light-theme icons) showing:

  •         DX primary
  •         VPN backup (two tunnels)
  •         Customer router
  •         VGW/TGW
  •         Routing priority flow arrows
  •         Failover decision logic


No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...