Thursday, November 13, 2025

AWS Site-to-Site VPN Connection As Backup | Overview.

AWS Site-to-Site VPN Connection As Backup - Overview.

Scope:

  • Architecture Overview (Baseline Pattern)
  • Deployment Models,
  • Routing & Failover Mechanics,
  • BGP Priority Hierarchy (AWS standard BGP path selection),
  • Recommended BGP Tuning,
  • Failover Behavior,
  • AWS Recommended Best Practices,
  • Transit Gateway Design Considerations,
  • End-to-End Testing & Validation,
  • Monitoring & Observability with CloudWatch,
  • Throughput Expectations,
  • Common Pitfalls,
  • Architecture.

Architecture Overview (Baseline Pattern)

    • VPN provides:
      • inexpensive, 
      • encrypted connectivity, 
    •  DX or the primary ISP provides:
      • high-bandwidth, 
      • low-latency transport.

twtech typically have:

    • Primary Path: AWS Direct Connect OR private MPLS/SD-WAN circuit
    • Backup Path: AWS Site-to-Site IPSec VPN
    • Routing Control: BGP-based failover (preferred) or static routing (less ideal)

AWS Components

    • Virtual Private Gateway (VGW) or Transit Gateway (TGW)
    • Direct Connect Gateway (DXGW) when mixing DX with multi-VPC
    • Customer Router (CSR/ASR/Firewalls/SD-WAN appliance)
    • AWS Managed VPN Endpoint (two tunnels per VPN connection)

Deployment Models

A. Direct Connect with VPN Backup (Classic Pattern)

    • DX private VIF is primary
    • S2S VPN via VGW is backup
    • BGP multipath disabled (default)
    • AS_PATH or BGP MED used to prioritize DX

B. SD-WAN/MPLS Primary (VPN Backup)

    • SD-WAN fabric decides primary path
    • S2S VPN advertises same prefixes with longer AS_PATH or higher BGP metric

C. Transit Gateway with DX + VPN Backup

    • DXGW <-> TGW for primary
    • TGW VPN attachment for backup
    • Controlled via BGP preference on TGW appliance or AWS VPN

D. Hub-and-Spoke Multi-VPC

    • VPCs connect via TGW
    • DX to TGW is primary
    • VPN to TGW is secondary
    • Route tables maintain path priority

Routing & Failover Mechanics

    • Failover design centers around BGP preference:

BGP Priority Hierarchy (AWS standard BGP path selection):

    1.      Longest Prefix Match
    2.      Highest Local Preference
    3.      Shortest AS_PATH (most commonly used for DX/VPN failover)
    4.      MED (Multi-Exit Discriminator)
    5.      eBGP over iBGP
    6.      Lowest IGP cost
    7.      Router ID

Recommended BGP Tuning

Primary Path (DX)

    • Advertise prefixes with shorter AS_PATH
    • Sample: ASN 65000

Backup Path (VPN)

    • Prepend to make path less preferred
    • Sample: ASN 65000 65000 65000

NB:

  • AWS does not prepend on the VPN side. 
  • twtech configures it on its customer router.

Failover Behavior

A. Direct Connect Failure

Failures that trigger VPN failover:

    • DX link down
    • BGP adjacency drop
    • Fiber cut / LAG failure beyond thresholds
    • DX router failure

B. How Fast Is Failover

        Typical BGP timers:

      •    Keepalive: 30s
      •    Hold timer: 90s

        Recommended enhanced timers:

      •    Keepalive: 10s
      •    Hold timer: 30s

Actual failover to VPN usually: 20–40 seconds.

C. Tunnel Redundancy

    •  Each AWS VPN connection includes two tunnels.
    •  If Tunnel A fails, Tunnel B is used automatically.

AWS Recommended Best Practices

For Direct Connect

    • Use LAG or at least two DX connections at different locations. 
    • Use VPN as tertiary backup.

For VPN Backup

    • Deploy two VPN connections (one per AWS region edge router)
    • Use ECMP only if you want multipath
    • Prepend AS_PATH on VPN so traffic only shifts on DX failure
    • Monitor both tunnels with CloudWatch

Security Best Practices

    • Use IKEv2 over IKEv1
    • Avoid rekey event misconfiguration
    • Use 30–60 min rekey windows
    • Do not use aggressive mode

Transit Gateway Design Considerations

    • TGW adds advanced routing but also considerations:

Primary Path DXGW TGW

    • Advertises routes via BGP
    • Lowest AS_PATH primary

Backup Path VPN → TGW

    • Advertises same prefixes
    • TGW route preference chooses DX unless BGP withdraws

TGW Route Table Behavior

TGW doesn't support BGP directly; the BGP decision happens on:

    •  DX router (DXGW side)
    •  VPN Gateway (AWS side)

 End-to-End Testing & Validation

Recommended Tests

    1.      Shut down DX interface on customer router
    2.      Ensure VPN routes become active
    3.      Run steady-state throughput validation
    4.      Test return path symmetry
    5.      Simulate partial failures:

      •    Only one DX link in LAG goes down
      •    One VPN tunnel fails
      •    Customer router restart
      •    IKE rekey rotation

Monitoring & Observability with CloudWatch

    • TunnelState metrics
    • VPN BGP status
    • IPSec bytes in/out
    • DX connection alarms
    • DX virtual interface BGP state

AWS Health Alerts

    • Useful for DX location outages.

Customer Network Monitoring

    • SNMP/BGP session state
    • Router syslogs
    • NQA/SLAs on primary + backup paths

Throughput Expectations

Direct Connect

    • Predictable, line rate
    • 1, 2, 5, 10 Gbps (LAG can be 100 Gbps+)

VPN Backup

    • With VGW: ~1.25 Gbps max per tunnel
    • With AWS VPN on TGW: 5–7.5 Gbps aggregate (if ECMP + multiple tunnels)

NB:

    • VPN is NOT a performance equivalent to DX.
    • VPN is strictly for resiliency, not sustained heavy workloads.

 Common Pitfalls

❌   Mis-tuned BGP timers slow failover
❌    AS_PATH prepend
asymmetric routing
❌   Using static routes
manual failover
❌   Only 1 VPN connection
not HA
❌   DX
+ VPN terminating on different devices policy mismatch
❌   Not validating return-path failover
❌   On-prem firewall policies blocking ESP after failover
❌   Misconfigured MTU
/MSS clamping after the tunnel comes up

Architecture

Primary Path
Direct Connect (low latency, high throughput)

Backup Path
Site-to-Site VPN (encrypted, AWS global backbone)

Routing Control
BGP AS_PATH + timers for rapid failover

High Availability
DX LAG / two DX circuits
Two VPN tunnels (minimum)
Preferably two VPN connections across regions

AWS architecture diagram (light-theme icons) showing:

    • DX primary
    • VPN backup (two tunnels)
    • Customer router
    • VGW/TGW
    • Routing priority flow arrows
    • Failover decision logic.





No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...