An Overview of AWS
Site-to-Site VPN Connection as a Backup.
View:
- Architecture patterns,
- Routing mechanics,
- Failover behavior,
- Best practices,
- Monitoring,
- Common pitfalls.
Breakdown:
- Architecture Overview (Baseline Pattern)
- Deployment Models,
- Routing & Failover Mechanics,
- BGP Priority Hierarchy,
- Recommended BGP Tuning,
- Failover Behavior,
- AWS Recommended Best Practices,
- Transit Gateway Design Considerations,
- End-to-End Testing &
Validation,
- Monitoring & Observability,
- Throughput Expectations,
- Common Pitfalls,
- Architecture.
Architecture Overview (Baseline Pattern)
- VPN provides inexpensive, encrypted connectivity, while DX or the
primary ISP provides high-bandwidth, low-latency transport.
twtech
typically have:
- Primary Path: AWS Direct Connect OR private MPLS/SD-WAN circuit
- Backup Path: AWS Site-to-Site IPSec VPN
- Routing Control: BGP-based failover (preferred) or static routing (less ideal)
AWS Components
- Virtual Private Gateway (VGW) or Transit Gateway (TGW)
- Direct Connect Gateway (DXGW) when mixing DX with multi-VPC
- Customer Router (CSR/ASR/Firewalls/SD-WAN appliance)
- AWS Managed VPN Endpoint (two tunnels per VPN connection)
Deployment Models
A. Direct
Connect with VPN Backup (Classic Pattern)
- DX private VIF is primary
- S2S VPN via VGW is backup
- BGP multipath disabled (default)
- AS_PATH or BGP MED used to prioritize DX
B. SD-WAN/MPLS
Primary (VPN Backup)
- SD-WAN fabric decides primary path
- S2S VPN advertises same prefixes with longer AS_PATH or higher BGP metric
C. Transit
Gateway with DX +
VPN Backup
- DXGW <-> TGW for primary
- TGW VPN attachment for backup
- Controlled via BGP preference on TGW appliance or AWS VPN
D. Hub-and-Spoke
Multi-VPC
- VPCs connect via TGW
- DX to TGW is primary
- VPN to TGW is secondary
- Route tables maintain path priority
Routing & Failover Mechanics
- Failover design centers around BGP
preference:
BGP Priority Hierarchy
AWS follows
standard BGP path selection:
- Longest Prefix Match
- Highest Local Preference
- Shortest AS_PATH (most commonly used for DX/VPN failover)
- MED (Multi-Exit Discriminator)
- eBGP over iBGP
- Lowest IGP cost
- Router ID
Recommended BGP Tuning
Primary Path (DX)
- Advertise prefixes with shorter AS_PATH
- Example:
ASN 65000
Backup Path (VPN)
- Prepend to make path less preferred
- Example:
ASN 65000 65000 65000
NB:
AWS
itself does not prepend on the VPN side; twtech configures it on its customer
router.
Failover Behavior
A. Direct
Connect Failure
Failures that trigger VPN failover:
- DX link down
- BGP adjacency drop
- Fiber cut / LAG failure beyond thresholds
- DX router failure
B. How
Fast Is Failover
Typical BGP timers:
- Keepalive: 30s
- Hold timer: 90s
Recommended enhanced timers:
- Keepalive: 10s
- Hold timer: 30s
Actual failover to VPN usually: 20–40 seconds.
C. Tunnel
Redundancy
- Each AWS VPN connection includes two tunnels.
- If Tunnel A fails, Tunnel B is used automatically.
AWS Recommended Best Practices
For Direct Connect
- Use LAG or at least two DX connections at different locations.
- Use VPN as tertiary backup.
For VPN Backup
- Deploy two VPN connections (one per AWS region edge router)
- Use ECMP only if you want multipath
- Prepend AS_PATH on VPN so traffic only shifts on DX failure
- Monitor both tunnels with CloudWatch
Security Best Practices
- Use IKEv2 over IKEv1
- Avoid rekey event misconfiguration
- Use 30–60 min rekey windows
- Do not use aggressive mode
Transit Gateway Design Considerations
- TGW adds advanced routing but also considerations:
Primary Path DXGW → TGW
- Advertises routes via BGP
- Lowest AS_PATH → primary
Backup Path VPN → TGW
- Advertises same prefixes
- TGW route preference chooses DX unless BGP withdraws
TGW Route Table Behavior
TGW doesn't support BGP directly; the BGP decision happens on:
- DX router (DXGW side)
- VPN Gateway (AWS side)
End-to-End Testing & Validation
Recommended Tests
- Shut down DX interface on customer router
- Ensure VPN routes become active
- Run steady-state throughput validation
- Test return path symmetry
- Simulate partial failures:
- Only one DX link in LAG goes down
- One VPN tunnel fails
- Customer router restart
- IKE rekey rotation
Monitoring & Observability
CloudWatch
- TunnelState metrics
- VPN BGP status
- IPSec bytes in/out
- DX connection alarms
- DX virtual interface BGP state
AWS Health Alerts
- Useful for DX location outages.
Customer Network Monitoring
- SNMP/BGP session state
- Router syslogs
- NQA/SLAs on primary + backup paths
Throughput Expectations
Direct Connect
- Predictable, line rate
- 1, 2, 5, 10 Gbps (LAG can be 100 Gbps+)
VPN Backup
- With VGW: ~1.25 Gbps max per tunnel
- With AWS VPN on TGW: 5–7.5 Gbps aggregate (if ECMP + multiple tunnels)
NB:
- VPN is NOT a
performance equivalent to DX.
- VPN is strictly for resiliency, not
sustained heavy workloads.
Common Pitfalls
❌ Mis-tuned
BGP timers →
slow failover
❌ AS_PATH
prepend →
asymmetric routing
❌ Using static routes → manual failover
❌ Only 1 VPN connection → not HA
❌ DX + VPN terminating on different devices → policy mismatch
❌ Not validating return-path failover
❌ On-prem firewall policies blocking ESP after
failover
❌ Misconfigured MTU/MSS clamping after the
tunnel comes up
Architecture
Primary Path
→ Direct Connect (low latency, high
throughput)
Backup Path
→ Site-to-Site VPN (encrypted, AWS
global backbone)
Routing Control
→ BGP AS_PATH + timers for rapid failover
High Availability
→ DX LAG / two DX circuits
→ Two VPN tunnels (minimum)
→ Preferably two VPN connections across regions
AWS architecture diagram (light-theme
icons)
showing:
- DX primary
- VPN backup (two tunnels)
- Customer router
- VGW/TGW
- Routing priority flow arrows
- Failover decision logic
No comments:
Post a Comment