AWS Network Topologies & Complications - Deep Dive.
- What “Network
Topology” Means in AWS,
- The Main AWS Network
Topology Models,
- Why Topology
Matters,
- The 4 Major
Components That Drive AWS Topology Design,
- How AWS Topology
Strategy Works in Practice,
- Common Examples of
AWS Network Topologies,
- Putting it Simply,
- AWS Network
Topologies & Complications,
- Hidden Complications
twtech Don’t Notice Until Scale,
- Putting It All Together:
Common Real-World Patterns,
- Multi-Account, Multi-VPC, Multi-Region with Zero Trust,
- Insights.
Intro:
- Network Topologies refers to the structuredways twtech interconnects its:
- VPCs,
- on-premises networks,
- cloud regions,
- external clouds using AWS networking constructs.
- Network Topologies is the architecture blueprint for:
- traffic flows,
- how routes propagate,
- how control points (inspection, segmentation, governance) are implemented.
- Network Topologies refers to the structuredways twtech interconnects its:
- VPCs,
- on-premises networks,
- cloud regions,
- external clouds using AWS networking constructs.
- Network Topologies is the architecture blueprint for:
- traffic flows,
- how routes propagate,
- how control points (inspection, segmentation, governance) are implemented.
1. What
“Network Topology” Means in AWS
- In AWS, topology refers to how:
- VPCs,
- Transit Gateways,
- Direct Connect,
- VPNs,
- load balancers,
- global routing layers are arranged into a system.
Topology Concerns:
- How do VPCs connect?
(hub-and-spoke, mesh, segmented hubs, multi-tier hubs)
- How does traffic move between regions?
- How does traffic reach on-prem?
- Where does inspection happen (NGFW, IPS, proxy, egress control)?
- How is routing controlled and segmented?
- How does multi-cloud traffic flow?
AWS topologies & logical networking:
- VPC
routing
- Route
tables
- TGWs
/ Cloud WAN
- PrivateLink
- VPC
Peering
- Direct
Connect
- Multi-Region constructs
2. The Main AWS Network Topology Models
a. Flat / VPC-to-VPC
Mesh
- VPCs peered directly
- Simple but does not scale
- No transitive routing
b. Hub-and-Spoke (TGW Core or Cloud WAN Core)
- Central hub for all connectivity
- Spokes = app VPCs
- Best for segmentation, control, inspection
c. Regional Hubs
Connected Globally
- Each region has a TGW
- TGWs connected with peering / Cloud WAN
- Strong isolation + scalable
d. Isolated VPC Domains (Zero Trust Segments)
- VPCs isolated; communication via PrivateLink only
- Maximum isolation
e. Inspection Hub
- Central VPC with firewalls
- Traffic forced through via route tables
- Often integrated with TGW
f. Multi-Region
Active/Active
- App deployed in multiple regions
- Global load balancing (Route 53 / CloudFront)
- TGW/Cloud WAN to sync traffic or reach shared services
g. Multi-Cloud
- AWS ↔ Azure/GCP interconnected using:
- Direct Connect ↔ ExpressRoute / Interconnect
- Cloud WAN core with SD-WAN overlays
- TGW + third-party fabric
3. Why
Topology Matters
Topology determines:
Performance
- Latency paths, cross-region data flow, DX routing.
Security
- Segmentation, inspection locations, blast radius.
Scalability
- How many VPCs, regions, or clouds you can add without redesign.
Availability
- Single-region vs multi-region routing failure behavior.
Cost
- TGW attachments, data processing, DX port capacity, firewall appliances.
4. The Four
Major Components That Drive AWS Topology Design
a. Routing &
Segmentation
- VPC
route tables
- TGW
route domains
- Cloud
WAN segments
- Prefix
lists and CIDR hierarchy
b. Connectivity Fabric
- Transit
Gateway
- VPC
peering
- PrivateLink
- Cloud
WAN
- Direct
Connect / VPN
c. Inspection &
Control
- Firewall
VPCs
- Route
table “hairpin” patterns
- Egress
filtering
- Traffic
mirroring
d. Global Scaling
- Multi-region
TGW
- Cloud
WAN Global Core Network
- Shared
services patterns
- Region-specific
failover plans
5. How AWS
Topology Strategy Works in Practice
twtech needs to define:
a) The Global View
- How
many regions
- How
on-prem connects
- How
multi-cloud traffic flows
- Global
CIDR strategy
- Cloud
WAN or TGW?
- Inspection
locations
b) The Regional View
- Each
region’s transit hub (if needed)
- Shared
services
- Segmentation
tiers (prod / nonprod / security /
shared)
c) The VPC View
- Subnets
and routing
- NAT
vs egress control
- Firewalls
- Endpoint
strategy
Together these form the topology.
6. Common
Examples of AWS Network Topologies
a) Traditional
Enterprise
- Single global hub (DX)
- Regional
TGWs
- Inspection
VPC per region
b) Modern Cloud-Native
- PrivateLink
everywhere
- Minimal
TGW
- No
peering mesh
- Zero-trust
friendly
c) Multi-Cloud
Enterprise
- AWS
↔ Azure vWAN ↔ GCP NCC
- SD-WAN
overlays
- Global
routing domain linking all clouds
d) High-Security /
Regulated
- No
internet egress from workloads
- All
inspection centralized
- Strict
segmentation and outbound proxies
7. Think of AWS Network Topology Simply as:
- AWS Network Topology is “The complete design of how twtech cloud networks route, connect, isolate, and inspect traffic across VPCs, regions, data centers, and clouds.”
8. AWS Network
Topologies & Complications
The Core Topologies
a. Single-VPC Architecture
The simplest topologyis Used for:
- small workloads, isolated projects, single-team environments.
Key Traits
- Flat,
minimal network segmentation.
- Simplicity
of route tables and security groups.
- Often
merged into more complex topologies later (and that’s where problems start).
Complications
- Accidental
IP overlap when merging.
- Route
table sprawl as features like PrivateLink, VPC Endpoints, and NATs are added.
- Harder
to isolate noisy or risky workloads.
b. Multi-VPC (Hub-and-Spoke)
- This is the enterprise standard.
- A central networking VPC acts as the hub (transit), and multiple spoke VPCs connect through Transit Gateway or VPC peering.
Why use it
- Strong
workload isolation.
- Consistent controls (inspection, logging, egress filtering).
- Centralized
connectivity to on-prem via DX/VPN.
Complications
i.
Transit Gateway route domains
- Static route propagation rules confused with route tables in VPCs.
- Wrong associations/propagations lead to asymmetric routing or packet drops.
ii.
DNS fragmentation
- Each VPC has its own resolver.
- Central shared services (like Active Directory, automation tools) require careful Route53 Resolver rules.
iii.
Bottlenecks in inspection VPC
- Appliances (NGFWs)
frequently oversubscribe bandwidth.
- East–west traffic can hairpin unnecessarily.
xi.
Scaling of CIDR blocks
- Adding new VPCs requires careful IPAM planning.
- Late-stage IP exhaustion forces renumbering or overlays.
C. Multi-Region Architectures
At scale, twtech almost always end up multi-region: DR, latency, regulatory boundaries.
Key Patterns
- Region-isolated VPCs
with
independent hub-and-spoke.
- Global architectures using:
- Cloud WAN
- TGW peering across regions
- Global Accelerator
- PrivateLink cross-region (rare, expensive)
Complications
a.
No “true” global VPC
- Each region is a silo (isolated system that operates independently)
- Routing state, endpoints, and load balancers do not replicate globally.
b.
TGW inter-region cost model is high
- Every byte crossing regions is billed twice: ingress + egress.
c.
DNS consistency across regions
- Route53 latency-based routing helps, but private DNS across regions requires custom replication via Resolver endpoints.
d.
Failover semantics
- App-level failover is easy.
- Database and shared services failover is hard. Cross-region communication often becomes the choke point.
D. Hybrid Networks (On-Prem
↔ AWS)
This is where
complexity skyrockets.
Complexity Typically involves:
- Direct Connect (dedicated
or hosted)
- VPNs (site-to-site, BGP)
- Transit Gateway or VRF separation
- On-prem firewalls and MPLS/SD-WAN
Complications
a.
BGP route limits
- DX VGWs have ~100-route limit.
- TGW supports thousands but often receives too many routes from on-prem.
b.
Asymmetric routing
- Happens when north-south vs east-west paths use different network constructs.
- twtech can pass traffic into on-prem via DX and receive return traffic via VPN.
c.
Failover unpredictability
- VPN failover is not deterministic.
- BGP metrics may behave differently across DX providers.
d.
MTU mismatches
Classic pitfall:
EC2 → ENI (9001 bytes)
→ TGW (8500 bytes)
→ DX provider (variable)
Result: silent packet drops.
e.
Routing vs security vs compliance teams
- In hybrid setups, decisions are split across orgs. Connectivity changes often require 3+ team approvals.
E. Service-to-Service Connectivity Patterns
Even inside a single VPC, real architectures use
patterns like:
- PrivateLink (Interface
Endpoints)
- VPC Endpoint Services (custom PrivateLink)
- Load balancer-to-load balancer routing
- Mesh or service-discovery architectures
Complications
a.
Interface endpoint explosion
- Each AZ requires an ENI.
- Costs scale rapidly.
- Route tables become unmanageable.
b.
PrivateLink is not transitive
- twtech cannot transit traffic through PrivateLink to reach another VPC.
- Many teams discover this only after deployment.
c.
Cross-VPC service meshes
If twtech tries
to stretch a mesh across VPCs or regions, it often hit:
- DNS conflicts
- MTLS cert domain mismatches
- XDS control-plane bottlenecks
9. Hidden
Complications twtech Don’t Notice Until Scale
a. IP Fragmentation & Overlapping CIDRs
Still the #1 scaling pain.
If twtech has hundreds of VPCs, someone will re-use:
10.0.0.0/16.
Fixing this later involves:
- Renumbering
environments
- Deploying
NAT gateways as “IP translators”
- Creating overlays or using IPv6 (but many enterprise tools don’t support it well)
b. Interplay Between: Routing Tables, SGs, NACLs, Endpoint Policies
AWS has
multiple layers of traffic controls.
Enterprises
often accidentally:
- Allow traffic
in SGs
- Drop it at
NACL
- Allow it in
NACL
- Drop it at endpoint
policy
- Allow it at
endpoint policy
- Drop it at
appliance firewall
This creates multi-layer debugging nightmares.
c. Cloud WAN vs TGW vs VPC Peering
Navigating global topologies becomes a strategic decision.
Transit Gateway
- Great for 1–3
regions.
- Best for
centralized security models.
Cloud WAN
- Best when
spanning 5+ regions or countries.
- Automated
route domains + segmentation.
VPC Peering
- Still fastest
and cheapest, but not transitive.
- Becomes
spaghetti if overused.
Complication:
- Migration from TGW→CloudWAN is not straightforward due to differences in routing model.
d. Network Inspection Patterns
Enterprise
security often demands packet inspection:
- Firewall
sandwich
- Middlebox VPC
- Inline
IDS/IPS
- L7 proxies
- Egress
filtering
But inserting inspection into path creates:
- Latency
spikes
- Asymmetric
routing
- Scaling
issues (firewalls drop traffic under load)
- Routing loops
from misaligned return paths
e. Overlay Networks on AWS
Common when IP overlaps or multi-cloud meshes exist:
Tools:
- Tailscale
- Aviatrix
- SD-WAN vendors
- Cilium mesh
overlays
Complications:
- MTU headaches
- Encapsulation
overhead
- Multi-path
routing conflicts with AWS native routing
- Troubleshooting
becomes hard because traffic disappears inside tunnels
10. Common
Real-World Patterns
Enterprise Multi-Region + Central Security + On-Prem
- TGW in each
region
- TGW peering
between regions
- Centralized
inspection VPC
- Direct
Connect to on-prem
- Route53
Resolver forwarding rules for shared services
- PrivateLink
for internal APIs
- Isolated app
VPCs for each business team
Failure Modes:
- Hairpin through security VPC during east-west traffic
- DX route limits forcing on-prem aggregation
- DNS propagation delays breaking service discovery
- Firewall cluster saturates under burst traffic
11. Multi-Account, Multi-VPC, Multi-Region with Zero
Trust
Uses:
- AWS Verified
Access
- App Mesh or
Envoy service mesh
- PrivateLink
everywhere
- No flat
networking
Failure Modes:
- PrivateLink
sprawl
- DNS
complexities
- Mesh
control-plane overhead
- Operational
cost explosion
Insights:
Going Deeper On:
- Multi-region
routing models,
- Troubleshooting
packet paths,
- Route
table design patterns,
- Transit
Gateway advanced behaviors,
- How
to plan global CIDR allocations,
- Building
secure inspection VPCs,
- Multi-cloud
network topologies (AWS ↔
Azure/GCP).
🌍 1.
Multi-Region Routing Models
- Multi-region
AWS networking is fundamentally constrained by the fact that each
region is a hard boundary.
- There is no global VPC, no global subnet, and no global routing domain. Everything is connected by explicit constructs.
Core Routing Models
a. TGW-to-TGW Peering
- Most common
for east–west inter-region traffic.
- Non-transitive:
TGW A ↔ TGW B doesn’t allow TGW B ↔ TGW C automatically.
- No appliance
insertion between regions.
Complications
- Route tables don’t synchronize across
regions. Manual propagation is required.
- Traffic is encrypted + overlay. MTU drops cause silent packet loss.
- Bandwidth limits on peering attachments per direction.
b. Cloud WAN (Global
Segments)
- Globally
managed wide-area fabric.
- twtech define segments, and AWS handles inter-region connectivity.
Complications
- Routing
domain behavior differs from TGW; migration requires re-architecting.
- Region-specific
features vary (not all TGW features map 1:1).
- Troubleshooting
becomes opaque because routing “happens inside AWS”.
c. Global Accelerator
- NOT a routed topology, but a global TCP/UDP entry system.
- Useful for multi-region failover or latency-based routing.
Complications
- Traffic
enters the nearest POP, not necessarily the nearest AWS region.
- Not suitable
for internal/private routing.
d. PrivateLink
Cross-Region
Rarely used because:
- Expensive
- Non-transitive
- Per-AZ
endpoints multiply cost
- Requires
explicit service exposure per region
But it’s useful for:
- Tenant
isolation
- Publishing
internal APIs globally without routing networks together
e. Multi-Region Service
Mesh (App Mesh, Istio, etc.)
Mesh stretches control plane across regions.
Complications
- XDS
control-plane latency hurts failover.
- Trust domain
issues (mTLS cert mismatches).
- Sidecar MTU
reductions cause fragmentation.
2.
Troubleshooting Packet Paths (AWS-Grade)
There is no
“one place” to see packet paths, so tracing is a multi-layer effort.
Key Misrouting Sources
- VPC route tables (subnet-scoped)
- TGW route tables (attachment-scoped)
- NACLs (stateless, per subnet)
- Security Groups
- Interface Endpoint policies
- EC2 OS route tables
- Appliance/firewall policies
- DNS resolution path
- MTU fragmentation on long paths
Common Debugging Failures
a. Asymmetric Routing
Occurs when:
- Traffic
enters via TGW but returns through IGW/NAT.
- Multi-AZ
firewalls create AZ-skewed return paths.
- On-prem routes prefer DX inbound but VPN outbound.
Symptoms:
- SYN reaches
server, SYN-ACK goes out elsewhere → connection
timeout.
b. Appliance
Hairpinning
Traffic loops through inspection VPC because:
- Default
routes from multiple VPCs point to same inspection ENI.
- Return
traffic is forced through inspection again, creating loops.
c. DNS Path Issues
Very common in multi-region:
- Resolver
rules not symmetric.
- Split-horizon
DNS misconfigured.
- Cross-account
Route 53 rules not correctly shared.
d. MTU Black-Holing
Especially in:
- TGW → DX → On-prem
- Mesh sidecars
- VPNs with
IPSec overhead
Diagnostics:
- “Path works
for ICMP but not large TCP packets”
- Application
fails during TLS handshake
e. Endpoint Policy
Conflicts
Interface endpoints can override:
- Security
groups
- NACLs
- Even TGW routing (logical effect)
Often forgotten during debugging.
3. Route
Table Design Patterns
- AWS has three main routing layers:
- VPC → TGW/Cloud WAN → On-prem.
Principles
a. Subnet-Specific
Routing (Per-AZ Granularity)
Design subnets with intent:
- Public
- Private
with NAT
- Private
isolated
- Ingress
- Egress
- Inspection
Mixing these roles causes chaos.
b. Distributed Routing
Table Pattern
Each
workload
tier gets its own route table:
- App-tier
- DB-tier
- Shared services
- Inspection-bound
Pros:
- Predictable
debugging
- No “god route
table” with 300 entries
Cons:
- More
management overhead without automation
c. Centralized Routing
via TGW
Use TGW tables to isolate traffic domains:
- Core
- Shared
services
- Partner/third-party
- On-prem
- Internet
egress
Common Mistake:
- Putting
everything in one TGW route table → transit
chaos.
d. Blackhole Routes for
Safety
Use blackhole entries to:
- Prevent
accidental transitive routing
- Enforce
tenant boundaries
- Stop
route leaks from on-prem BGP
e. Avoiding the “Route
Table Explosion”
- Using PrivateLink
eliminates route entries.
- Using IPv6
reduces NAT and IGW path complexity.
4. Transit
Gateway Advanced Behaviors
- TGW is incredibly powerful but has hidden mechanics.
a. Non-Transitive
by Default
- VPC A → TGW → VPC B works
- But VPC A → TGW → VPC B → TGW → VPC C does NOT.
- twtech must explicitly configure routing domains.
b. Route
Table Association vs Propagation
Common mistake:
- Association
defines which table an attachment is in.
- Propagation
defines which table receives routes from it.
Failure scenario:
- Route is
propagated to a table that’s not associated to the attachment → traffic blackholes.
c. Appliance
Mode
Required for
middlebox architectures.
If OFF:
- Return
traffic shortcuts around firewall → asymmetric
routing.
If ON:
- All traffic between subnets in VPC uses the appliance path (may overload firewalls).
d. TGW
Peering Limits
- No appliance
insertion
- No multicast
- Bandwidth
caps per peering
- Propagation
must be configured manually per peer
e. BGP
Interactions via VPN/DX
With VPN/DX to TGW:
- Prefix
advertisement filters must match
- Too many
on-prem routes → TGW drops them silently
- Flapping BGP
sessions cause intermittent network blackouts
5. How to
Plan Global CIDR Allocations
This is usually the most painful long-term mistake.
a. Golden Rules
- Never reuse a CIDR anywhere globally. (Even if the region “won’t ever connect”—famous last words.)
- Reserve blocks per region, per environment, per account.
- Use power-of-two CIDRs
- Planning becomes mathematically clean.
b. Recommended Structure
For example:
|
Boundary |
Sample Range |
|
Global
allocation |
10.0.0.0/8 |
|
Per
region |
/12 chunks |
|
Per
environment (dev/stage/prod) |
/16 chunks |
|
Per
VPC |
/20 to /22 |
Everything is hierarchical.
c. Pitfalls
- On-prem IP
overlaps requiring NAT → complexity explosion.
- Overly small
VPC CIDRs → fragmentation leads to new VPCs.
- Allocating
randomly per team causes collisions during mergers.
d. IPv6 Strategy
Use IPv6 for:
- Load
balancers
- Internal
APIs
- Mesh
communication
But keep IPv4 for legacy workloads.
6. Building
Secure Inspection VPCs
This is where most enterprises fail.
a. Core
Components
Inspection VPC typically includes:
- Firewall fleet (NGFWs, IDS/IPS)
- Gateway Load Balancer (GWLBe)
- Central
NAT or egress proxy
- TLS
inspection
- East-west
traffic inspection
- Logging
and packet capture
b. Common
Patterns
Pattern 1: Inbound → Inspection → Target VPC
- Via ALB/NLB → GWLBe → TGW → app VPC.
Pattern 2: Egress Filtering
- App VPC → TGW → Inspection → IGW/NAT.
Pattern 3: East-West Traffic
- Spoke VPC A → TGW → Inspection → Spoke VPC B.
c. Failure
Modes
1.
Asymmetric routing (most common)
- Return path
bypasses firewall.
2.
Firewall scaling bottlenecks
- Stateful
inspection becomes throughput choke point.
3.
Per-AZ routing conflicts
- GWLBe
endpoints are AZ-specific.
4.
Hairpinning
- Traffic loops
inside the inspection VPC.
D. Best
Practices
- Use GWLBe for scaling and AZ alignment.
- Keep north-south and east-west inspection paths separate.
- Use Auto Scaling for firewalls where possible.
- Use appliance mode on TGW.
7.
Multi-Cloud Topologies (AWS ↔ Azure/GCP)
- This is increasingly common, especially for regulated industries or acquisitions.
Core Interconnect
Models
a. IPsec VPN Mesh
- AWS ↔ Azure
- AWS ↔ GCP
- Azure ↔ GCP
- Latency high
- MTU reduction
- HA failover
unpredictable
- Throughput
limited
b. Direct AWS ↔ Azure
ExpressRoute / GCP Interconnect
- Via partner providers offering cross-cloud circuits.
Pros:
- Stable
- High
throughput
- Lower latency
Cons:
- Expensive
- Operationally
complex
- Requires
third-party coordination
- Limited
geography
c. Cloud WAN + Azure
Virtual WAN
Emerging pattern:
- AWS CloudWAN
manages AWS side
- Azure Virtual
WAN manages Azure side
- Joined via
provider network
Complications:
- Multi-domain
routing debugging becomes nearly impossible
- Tools/logs
differ per cloud
d. Multi-Cloud Service
Mesh
High sophistication, using:
- Istio
multi-primary
- Consul mesh
- Zero-trust
boundaries
Complications:
- Trust domains
between clouds are fragile
- Multi-hop MTU
issues
- Sidecar
overhead doubles
- Hard to
troubleshoot cross-mesh flows
e. Multi-Cloud API
Connectivity via PrivateLink Equivalents
- AWS
PrivateLink
- Azure Private
Link
- GCP Private
Service Connect
But these do not
interoperate directly.
Pattern:
- Expose AWS
service via PrivateLink
- Pipe it
through a proxy in Azure
- Connect with
PSC or Azure PL
This introduces:
- Latency
- Double
encapsulation
- Complex DNS forwarding
No comments:
Post a Comment