Saturday, November 15, 2025

AWS Network Topologies & Complications | Deep Dive.


AWS Network Topologies  & Complications - Deep Dive.

Scope:
  •        What “Network Topology” Means in AWS,
  •       The Main AWS Network Topology Models,
  •        Why Topology Matters,
  •        The 4 Major Components That Drive AWS Topology Design,
  •        How AWS Topology Strategy Works in Practice,
  •        Common Examples of AWS Network Topologies,
  •        Putting it Simply,
  •        AWS Network Topologies & Complications,
  •        Hidden Complications twtech Don’t Notice Until Scale,
  •        Putting It All Together: Common Real-World Patterns,
  •        Multi-Account, Multi-VPC, Multi-Region with Zero Trust,
  •        Insights.

Intro:

    • Network Topologies refers to the structuredways twtech interconnects its:
      •  VPCs, 
      • on-premises networks, 
      • cloud regions, 
      • external clouds using AWS networking constructs.
    • Network Topologies is the architecture blueprint for:
      • traffic flows, 
      • how routes propagate
      • how control points (inspection, segmentation, governance) are implemented.

1. What “Network Topology” Means in AWS

    • In AWS, topology refers to how:
      • VPCs, 
      • Transit Gateways, 
      • Direct Connect, 
      • VPNs, 
      • load balancers, 
      • global routing layers are arranged into a system.

Topology Concerns:

    • How do VPCs connect? (hub-and-spoke, mesh, segmented hubs, multi-tier hubs)
    • How does traffic move between regions?
    • How does traffic reach on-prem?
    • Where does inspection happen (NGFW, IPS, proxy, egress control)?
    • How is routing controlled and segmented?
    • How does multi-cloud traffic flow?

AWS topologies & logical networking:

    • VPC routing
    • Route tables
    • TGWs / Cloud WAN
    • PrivateLink
    • VPC Peering
    • Direct Connect
    • Multi-Region constructs

2. The Main AWS Network Topology Models

a. Flat / VPC-to-VPC Mesh

    •  VPCs peered directly
    •  Simple but does not scale
    •  No transitive routing

b. Hub-and-Spoke (TGW Core or Cloud WAN Core)

    • Central hub for all connectivity
    • Spokes = app VPCs
    • Best for segmentation, control, inspection

c. Regional Hubs Connected Globally

    • Each region has a TGW
    • TGWs connected with peering / Cloud WAN
    • Strong isolation + scalable

d. Isolated VPC Domains (Zero Trust Segments)

    • VPCs isolated; communication via PrivateLink only
    • Maximum isolation

e. Inspection Hub

    • Central VPC with firewalls
    • Traffic forced through via route tables
    • Often integrated with TGW

f. Multi-Region Active/Active

    • App deployed in multiple regions
    • Global load balancing (Route 53 / CloudFront)
    • TGW/Cloud WAN to sync traffic or reach shared services

g. Multi-Cloud

  • AWS Azure/GCP interconnected using:
    •    Direct Connect ExpressRoute / Interconnect
    •    Cloud WAN core with SD-WAN overlays
    •    TGW + third-party fabric

 3. Why Topology Matters

Topology determines:

Performance

    • Latency paths, cross-region data flow, DX routing.

Security

    • Segmentation, inspection locations, blast radius.

Scalability

    • How many VPCs, regions, or clouds you can add without redesign.

Availability

    • Single-region vs multi-region routing failure behavior.

Cost

    • TGW attachments, data processing, DX port capacity, firewall appliances.

 4. The Four Major Components That Drive AWS Topology Design

a. Routing & Segmentation

    • VPC route tables
    • TGW route domains
    • Cloud WAN segments
    • Prefix lists and CIDR hierarchy

b. Connectivity Fabric

    • Transit Gateway
    • VPC peering
    • PrivateLink
    • Cloud WAN
    • Direct Connect / VPN

c. Inspection & Control

    • Firewall VPCs
    • Route table “hairpin” patterns
    • Egress filtering
    • Traffic mirroring

d. Global Scaling

    • Multi-region TGW
    • Cloud WAN Global Core Network
    • Shared services patterns
    • Region-specific failover plans

 5. How AWS Topology Strategy Works in Practice

twtech  needs to define:

a) The Global View

    • How many regions
    • How on-prem connects
    • How multi-cloud traffic flows
    • Global CIDR strategy
    • Cloud WAN or TGW?
    • Inspection locations

b) The Regional View

    • Each region’s transit hub (if needed)
    • Shared services
    • Segmentation tiers (prod / nonprod / security / shared)

c) The VPC View

    • Subnets and routing
    • NAT vs egress control
    • Firewalls
    • Endpoint strategy

Together these form the topology.

 6. Common Examples of AWS Network Topologies

a) Traditional Enterprise

    • Single global hub (DX)
    • Regional TGWs
    • Inspection VPC per region

b) Modern Cloud-Native

    • PrivateLink everywhere
    • Minimal TGW
    • No peering mesh
    •  Zero-trust friendly

c) Multi-Cloud Enterprise

    • AWS Azure vWAN GCP NCC
    • SD-WAN overlays
    • Global routing domain linking all clouds

d) High-Security / Regulated

    • No internet egress from workloads
    • All inspection centralized
    • Strict segmentation and outbound proxies

 7. Think of AWS Network Topology Simply as:

    • AWS Network Topology is “The complete design of how twtech cloud networks route, connect, isolate, and inspect traffic across VPCs, regions, data centers, and clouds.”

8.  AWS Network Topologies & Complications

The Core Topologies

a. Single-VPC Architecture

The simplest topologyis Used for

    • small workloads, isolated projects, single-team environments.

Key Traits

    • Flat, minimal network segmentation.
    • Simplicity of route tables and security groups.
    • Often merged into more complex topologies later (and that’s where problems start).

Complications

    • Accidental IP overlap when merging.
    • Route table sprawl as features like PrivateLink, VPC Endpoints, and NATs are added.
    • Harder to isolate noisy or risky workloads.

b. Multi-VPC (Hub-and-Spoke)

    • This is the enterprise standard.
    • A central networking VPC acts as the hub (transit), and multiple spoke VPCs connect through Transit Gateway or VPC peering.

Why use it

    • Strong workload isolation.
    • Consistent controls (inspection, logging, egress filtering).
    • Centralized connectivity to on-prem via DX/VPN.

Complications

i.     Transit Gateway route domains

    •    Static route propagation rules confused with route tables in VPCs.
    •    Wrong associations/propagations lead to asymmetric routing or packet drops.

ii.     DNS fragmentation

    •    Each VPC has its own resolver.
    •    Central shared services (like Active Directory, automation tools) require careful Route53 Resolver rules.

iii.     Bottlenecks in inspection VPC

    •    Appliances (NGFWs) frequently oversubscribe bandwidth.
    •    East–west traffic can hairpin unnecessarily.

xi.     Scaling of CIDR blocks

    •    Adding new VPCs requires careful IPAM planning.
    •    Late-stage IP exhaustion forces renumbering or overlays.

C. Multi-Region Architectures

At scale, twtech almost always end up multi-region: DR, latency, regulatory boundaries.

Key Patterns

    •         Region-isolated VPCs with independent hub-and-spoke.
    •         Global architectures using:

      •    Cloud WAN
      •    TGW peering across regions
      •    Global Accelerator
      •    PrivateLink cross-region (rare, expensive)

Complications

a.     No “true” global VPC

    • Each region is a silo (isolated system that operates independently)
    • Routing state, endpoints, and load balancers do not replicate globally.

b.     TGW inter-region cost model is high

    • Every byte crossing regions is billed twice: ingress + egress.

c.     DNS consistency across regions

    •    Route53 latency-based routing helps, but private DNS across regions requires custom replication via Resolver endpoints.

d.     Failover semantics

    •    App-level failover is easy.
    •    Database and shared services failover is hard. Cross-region communication often becomes the choke point.

D. Hybrid Networks (On-Prem AWS)

This is where complexity skyrockets.
Complexity Typically involves:

    •  Direct Connect (dedicated or hosted)
    •  VPNs (site-to-site, BGP)
    •  Transit Gateway or VRF separation
    •  On-prem firewalls and MPLS/SD-WAN

Complications

a.     BGP route limits

    •    DX VGWs have ~100-route limit.
    •    TGW supports thousands but often receives too many routes from on-prem.

b.     Asymmetric routing

    • Happens when north-south vs east-west paths use different network constructs.
    • twtech can pass traffic into on-prem via DX and receive return traffic via VPN.

c.     Failover unpredictability

    •    VPN failover is not deterministic.
    •    BGP metrics may behave differently across DX providers.

d.     MTU mismatches
Classic pitfall:

     EC2 ENI (9001 bytes)
TGW (8500 bytes)
DX provider (variable)
Result: silent packet drops.

e.     Routing vs security vs compliance teams

    •   In hybrid setups, decisions are split across orgs. Connectivity changes often require 3+ team approvals.

E. Service-to-Service Connectivity Patterns 

Even inside a single VPC, real architectures use patterns like:

    • PrivateLink (Interface Endpoints)
    • VPC Endpoint Services (custom PrivateLink)
    • Load balancer-to-load balancer routing
    • Mesh or service-discovery architectures

Complications

a.     Interface endpoint explosion

    •    Each AZ requires an ENI.
    •    Costs scale rapidly.
    •    Route tables become unmanageable.

b.     PrivateLink is not transitive

    •    twtech cannot transit traffic through PrivateLink to reach another VPC.
    •    Many teams discover this only after deployment.

c.     Cross-VPC service meshes
If twtech tries to stretch a mesh across VPCs or regions, it often hit:

    •    DNS conflicts
    •    MTLS cert domain mismatches
    •    XDS control-plane bottlenecks

9.  Hidden Complications twtech Don’t Notice Until Scale

a. IP Fragmentation & Overlapping CIDRs

Still the #1 scaling pain.

If twtech has hundreds of VPCs, someone will re-use: 10.0.0.0/16.

Fixing this later involves:

    •  Renumbering environments
    •  Deploying NAT gateways as “IP translators”
    •  Creating overlays or using IPv6 (but many enterprise tools don’t support it well)

b. Interplay Between: Routing Tables, SGs, NACLs, Endpoint Policies

AWS has multiple layers of traffic controls.
Enterprises often accidentally:

    • Allow traffic in SGs
    • Drop it at NACL
    • Allow it in NACL
    • Drop it at endpoint policy
    • Allow it at endpoint policy
    • Drop it at appliance firewall

This creates multi-layer debugging nightmares.

c. Cloud WAN vs TGW vs VPC Peering

Navigating global topologies becomes a strategic decision.

Transit Gateway

    • Great for 1–3 regions.
    • Best for centralized security models.

Cloud WAN

    • Best when spanning 5+ regions or countries.
    • Automated route domains + segmentation.

VPC Peering

    • Still fastest and cheapest, but not transitive.
    •  Becomes spaghetti if overused.

Complication:

    • Migration from TGWCloudWAN is not straightforward due to differences in routing model.

d. Network Inspection Patterns

Enterprise security often demands packet inspection:

    • Firewall sandwich
    • Middlebox VPC
    • Inline IDS/IPS
    • L7 proxies
    • Egress filtering

But inserting inspection into path creates:

    • Latency spikes
    • Asymmetric routing
    • Scaling issues (firewalls drop traffic under load)
    • Routing loops from misaligned return paths

e. Overlay Networks on AWS

Common when IP overlaps or multi-cloud meshes exist:

Tools:

    • Tailscale
    •  Aviatrix
    • SD-WAN vendors
    • Cilium mesh overlays

Complications:

    •  MTU headaches
    •  Encapsulation overhead
    •  Multi-path routing conflicts with AWS native routing
    •  Troubleshooting becomes hard because traffic disappears inside tunnels

10.  Common Real-World Patterns

Enterprise Multi-Region + Central Security + On-Prem

    • TGW in each region
    • TGW peering between regions
    • Centralized inspection VPC
    • Direct Connect to on-prem
    • Route53 Resolver forwarding rules for shared services
    • PrivateLink for internal APIs
    • Isolated app VPCs for each business team

Failure Modes:

    • Hairpin through security VPC during east-west traffic
    • DX route limits forcing on-prem aggregation
    • DNS propagation delays breaking service discovery
    • Firewall cluster saturates under burst traffic

11. Multi-Account, Multi-VPC, Multi-Region with Zero Trust

Uses:

    • AWS Verified Access
    • App Mesh or Envoy service mesh
    • PrivateLink everywhere
    • No flat networking

Failure Modes:

    • PrivateLink sprawl
    • DNS complexities
    • Mesh control-plane overhead
    • Operational cost explosion

Insights:

Going Deeper On:

    • Multi-region routing models,
    • Troubleshooting packet paths,
    • Route table design patterns,
    • Transit Gateway advanced behaviors,
    • How to plan global CIDR allocations,
    • Building secure inspection VPCs,
    • Multi-cloud network topologies (AWS Azure/GCP).

🌍 1. Multi-Region Routing Models

    • Multi-region AWS networking is fundamentally constrained by the fact that each region is a hard boundary.
    • There is no global VPC, no global subnet, and no global routing domain. Everything is connected by explicit constructs.

Core Routing Models

a. TGW-to-TGW Peering

    • Most common for east–west inter-region traffic.
    • Non-transitive: TGW A TGW B doesn’t allow TGW B TGW C automatically.
    • No appliance insertion between regions.

Complications

    • Route tables don’t synchronize across regions. Manual propagation is required.
    • Traffic is encrypted + overlay. MTU drops cause silent packet loss.
    •  Bandwidth limits on peering attachments per direction.

b. Cloud WAN (Global Segments)

    • Globally managed wide-area fabric.
    • twtech define segments, and AWS handles inter-region connectivity.

Complications

    • Routing domain behavior differs from TGW; migration requires re-architecting.
    • Region-specific features vary (not all TGW features map 1:1).
    • Troubleshooting becomes opaque because routing “happens inside AWS”.

c. Global Accelerator

    • NOT a routed topology, but a global TCP/UDP entry system.
    • Useful for multi-region failover or latency-based routing.

Complications

    • Traffic enters the nearest POP, not necessarily the nearest AWS region.
    •  Not suitable for internal/private routing.

d. PrivateLink Cross-Region

Rarely used because:

    • Expensive
    • Non-transitive
    • Per-AZ endpoints multiply cost
    • Requires explicit service exposure per region

But it’s useful for:

    • Tenant isolation
    • Publishing internal APIs globally without routing networks together

e. Multi-Region Service Mesh (App Mesh, Istio, etc.)

Mesh stretches control plane across regions.

Complications

    • XDS control-plane latency hurts failover.
    • Trust domain issues (mTLS cert mismatches).
    • Sidecar MTU reductions cause fragmentation.

 2. Troubleshooting Packet Paths (AWS-Grade)

There is no “one place” to see packet paths, so tracing is a multi-layer effort.

Key Misrouting Sources

    1.      VPC route tables (subnet-scoped)
    2.      TGW route tables (attachment-scoped)
    3.      NACLs (stateless, per subnet)
    4.      Security Groups
    5.      Interface Endpoint policies
    6.      EC2 OS route tables
    7.      Appliance/firewall policies
    8.      DNS resolution path
    9.      MTU fragmentation on long paths

Common Debugging Failures

a. Asymmetric Routing

Occurs when:

    • Traffic enters via TGW but returns through IGW/NAT.
    •  Multi-AZ firewalls create AZ-skewed return paths.
    • On-prem routes prefer DX inbound but VPN outbound.

Symptoms:

    • SYN reaches server, SYN-ACK goes out elsewhere connection timeout.

b. Appliance Hairpinning

Traffic loops through inspection VPC because:

    • Default routes from multiple VPCs point to same inspection ENI.
    • Return traffic is forced through inspection again, creating loops.

c. DNS Path Issues

Very common in multi-region:

    • Resolver rules not symmetric.
    • Split-horizon DNS misconfigured.
    • Cross-account Route 53 rules not correctly shared.

d. MTU Black-Holing

Especially in:

    • TGW DX On-prem
    •  Mesh sidecars
    •  VPNs with IPSec overhead

Diagnostics:

    • “Path works for ICMP but not large TCP packets”
    •  Application fails during TLS handshake

e. Endpoint Policy Conflicts

Interface endpoints can override:

    • Security groups
    • NACLs
    • Even TGW routing (logical effect)

Often forgotten during debugging.

 3. Route Table Design Patterns

    • AWS has three main routing layers:
    • VPC TGW/Cloud WAN On-prem.

Principles

a. Subnet-Specific Routing (Per-AZ Granularity)

Design subnets with intent:

    • Public
    • Private with NAT
    • Private isolated
    • Ingress
    • Egress
    • Inspection

Mixing these roles causes chaos.

b. Distributed Routing Table Pattern

Each workload tier gets its own route table:

    • App-tier
    • DB-tier
    • Shared services
    •  Inspection-bound

Pros:

    •  Predictable debugging
    •  No “god route table” with 300 entries

Cons:

    • More management overhead without automation

c. Centralized Routing via TGW

Use TGW tables to isolate traffic domains:

    • Core
    • Shared services
    • Partner/third-party
    • On-prem
    • Internet egress

Common Mistake:

    •  Putting everything in one TGW route table transit chaos.

d. Blackhole Routes for Safety

Use blackhole entries to:

    • Prevent accidental transitive routing
    • Enforce tenant boundaries
    • Stop route leaks from on-prem BGP

e. Avoiding the “Route Table Explosion”

    • Using PrivateLink eliminates route entries.
    • Using IPv6 reduces NAT and IGW path complexity.

 4. Transit Gateway Advanced Behaviors

    • TGW is incredibly powerful but has hidden mechanics.

a. Non-Transitive by Default

    • VPC A TGW VPC B works
    • But VPC A TGW VPC B TGW VPC C does NOT.
  • twtech must explicitly configure routing domains.

b. Route Table Association vs Propagation

Common mistake:

    • Association defines which table an attachment is in.
    • Propagation defines which table receives routes from it.

Failure scenario:

    • Route is propagated to a table that’s not associated to the attachment traffic blackholes.

c. Appliance Mode

Required for middlebox architectures.

If OFF:

    • Return traffic shortcuts around firewall asymmetric routing.

If ON:

    • All traffic between subnets in VPC uses the appliance path (may overload firewalls).

d. TGW Peering Limits

    • No appliance insertion
    • No multicast
    • Bandwidth caps per peering
    • Propagation must be configured manually per peer

e. BGP Interactions via VPN/DX

With VPN/DX to TGW:

    • Prefix advertisement filters must match
    • Too many on-prem routes TGW drops them silently
    • Flapping BGP sessions cause intermittent network blackouts

 5. How to Plan Global CIDR Allocations

This is usually the most painful long-term mistake.

a. Golden Rules

    1. Never reuse a CIDR anywhere globally. (Even if the region “won’t ever connect”—famous last words.)
    2.  Reserve blocks per region, per environment, per account.
    3.  Use power-of-two CIDRs
    4. Planning becomes mathematically clean.

b. Recommended Structure

For example:

Boundary

Sample Range

Global allocation

10.0.0.0/8

Per region

/12 chunks

Per environment (dev/stage/prod)

/16 chunks

Per VPC

/20 to /22

Everything is hierarchical.

c. Pitfalls

    •  On-prem IP overlaps requiring NAT complexity explosion.
    •  Overly small VPC CIDRs fragmentation leads to new VPCs.
    •  Allocating randomly per team causes collisions during mergers.

d. IPv6 Strategy

Use IPv6 for:

    • Load balancers
    • Internal APIs
    • Mesh communication

But keep IPv4 for legacy workloads.

 6. Building Secure Inspection VPCs

This is where most enterprises fail.

a. Core Components

Inspection VPC typically includes:

    • Firewall fleet (NGFWs, IDS/IPS)
    • Gateway Load Balancer (GWLBe)
    • Central NAT or egress proxy
    • TLS inspection
    • East-west traffic inspection
    • Logging and packet capture

b. Common Patterns

Pattern 1: Inbound Inspection → Target VPC

    • Via ALB/NLB GWLBe TGW app VPC.

Pattern 2: Egress Filtering

    • App VPC TGW Inspection IGW/NAT.

Pattern 3: East-West Traffic

    • Spoke VPC A TGW Inspection Spoke VPC B.

c. Failure Modes

1.     Asymmetric routing (most common)

    •    Return path bypasses firewall.

2.     Firewall scaling bottlenecks

    •    Stateful inspection becomes throughput choke point.

3.     Per-AZ routing conflicts

    •    GWLBe endpoints are AZ-specific.

4.     Hairpinning

    •    Traffic loops inside the inspection VPC.

D. Best Practices

    • Use GWLBe for scaling and AZ alignment.
    • Keep north-south and east-west inspection paths separate.
    • Use Auto Scaling for firewalls where possible.
    • Use appliance mode on TGW.

 7. Multi-Cloud Topologies (AWS Azure/GCP)

    • This is increasingly common, especially for regulated industries or acquisitions.

 Core Interconnect Models

a. IPsec VPN Mesh

    • AWS Azure
    • AWS GCP
    •         AzureGCP
Reliable but:
    • Latency high
    • MTU reduction
    • HA failover unpredictable
    • Throughput limited

b. Direct AWS Azure ExpressRoute / GCP Interconnect

    • Via partner providers offering cross-cloud circuits.

Pros:

    • Stable
    • High throughput
    • Lower latency

Cons:

    • Expensive
    • Operationally complex
    • Requires third-party coordination
    • Limited geography

c. Cloud WAN + Azure Virtual WAN

Emerging pattern:

    • AWS CloudWAN manages AWS side
    • Azure Virtual WAN manages Azure side
    • Joined via provider network

Complications:

    •         Multi-domain routing debugging becomes nearly impossible
    •         Tools/logs differ per cloud

d. Multi-Cloud Service Mesh

High sophistication, using:

    • Istio multi-primary
    • Consul mesh
    • Zero-trust boundaries

Complications:

    • Trust domains between clouds are fragile
    • Multi-hop MTU issues
    • Sidecar overhead doubles
    • Hard to troubleshoot cross-mesh flows

e. Multi-Cloud API Connectivity via PrivateLink Equivalents

    • AWS PrivateLink
    • Azure Private Link
    • GCP Private Service Connect

But these do not interoperate directly.

Pattern:

    • Expose AWS service via PrivateLink
    • Pipe it through a proxy in Azure
    • Connect with PSC or Azure PL

This introduces:

    • Latency
    • Double encapsulation
    • Complex DNS forwarding






No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...