Monday, November 17, 2025

AWS IPv4 Troubleshooting | Overview.


AWS IPv4 Troubleshooting - Overview.

Scope:

  •        IPv4 stack problems,
  •        Layer 1–3: Addressing & Subnetting,
  •        Layer 3: Routing Troubleshooting,
  •        NAT Troubleshooting,
  •        DNS Troubleshooting (Often confused with IP issues),
  •        Transport Layer (TCP/UDP),
  •        Firewalls, ACLs, Security Groups, NACLs,
  •        MTU, Fragmentation, PMTUD,
  •        Asymmetric Routing (One of the hardest issues)
  •        Packet Capture Workflow,
  •        Systematic IPv4 Troubleshooting Flow,
  •        AWS-Specific IPv4 Failure Patterns,

IPv4 stack problems ( fall into one of five domains):

  1.      Addressing / Subnetting
  2.      Routing
  3.      NAT
  4.      DNS
  5.      Transport Layer (TCP/UDP)
  6.      Firewalling (SG/NACL/ACL/iptables/etc.)
  7.      Application-Layer quirks

1. Layer 1–3: Addressing & Subnetting

  • Incorrect IPv4 addressing is the #1 cause of failures.

1.1 Checklist

    •  Does the host have a valid IPv4 address?
    •  Is the netmask correct?
    •  Is the default gateway in the same subnet?
    •  Any duplicate IPs?
    •  Any ARP poisoning or stale ARP caches?

1.2 Management Commands

Linux:

# bash
ip addr
ip route
ip neigh
arp -a

Windows:

# bash
ipconfig /all
route print
arp -a

Common pitfalls

    •  Host mask mismatch (e.g., host thinks /24 but network is /23).
    •  Gateway configured outside subnet host silently drops frames.
    •  ARP cache stale; clearing fixes many “weird” issues:
    • ip neigh flush all

2. Layer 3: Routing Troubleshooting

2.1 Understand the routing decision

  • Routing is done in this order:

     1.     Longest Prefix Match (LPM)
2.     Administrative Distance (static vs BGP vs OSPF, etc.)
3.     Metric / cost

2.2 Routing checks

# bash
ip route get <destination>
tracepath <destination>
traceroute <destination>
mtr <destination>

AWS specifics:

    • Route tables must include correct local, IGW, NATGW, TGW, DX, or VPC peering routes.
    • Blackhole route entries occur when EC2 ENI deleted or peering removed.
    • Subnet associations matter; make sure correct RT is applied.

3. NAT Troubleshooting

  • NAT = BIG source of IPv4 issues.

3.1 SNAT vs DNAT

    • SNATprivate public (outbound)
    • DNATpublic private (inbound)

3.2 Logs / checks

Linux iptables NAT table:

sudo iptables -t nat -L -n -v

AWS NAT Gateway:

  • Check CloudWatch metrics:
      •    ErrorPortAllocation
      •    PacketsDropped
      •    BytesOut
  • NAT Gateway fails when:
      •    No route to destination
      •    No IGW in the VPC
      •   SNAT port exhaustion (rare but real with high concurrency)

3.3 Double NAT

Occurs commonly in:

    •   On-prem  firewall NAT AWS NATGW internet

Symptoms:

    •  Broken return traffic
    •  Inconsistent path MTU
    •  Services failing only inbound or outbound

4. DNS Troubleshooting (Often confused with IP issues)

Most connectivity failures are DNS masquerading as networking issues.

4.1 Checklist

    •  Can the resolver be reached?
    •  Is the DNS server configured correctly?
    •  AAAA vs A confusion?
    • Split-horizon inconsistencies?

4.2 Tools

# bash
dig <hostname>
dig +trace <hostname>
dig @<dns-server> <hostname>
nslookup <hostname>

AWS specifics:

    • EC2 uses VPC Resolver (AmazonProvidedDNS) at:
      •    169.254.169.253
    • Conditional forwarders for hybrid setups often misconfigured.
    • Route 53 Resolver rules must be associated with the correct VPC.

5. Transport Layer (TCP/UDP)

Symptoms

    • SYN sent but no SYN-ACK blocked or blackholed
    • SYN-ACK received but ACK missing asymmetric routing
    • UDP “works sometimes” random firewall drops or NAT timeouts

5.1 Tools

# bash
tcpdump -n port <port>
ss -tnlp
nc -zv <host> <port>

Sample TCP handshake capture:

tcpdump -nn -i eth0 "tcp[tcpflags] & (tcp-syn|tcp-ack) != 0"

AWS specifics:

    • TGW asymmetric routing is a classic problem.
    •  NLB preserves client IP (can break firewalls).
    • ALB does NOT preserve client IP; check X-Forwarded-For.

6. Firewalls, ACLs, Security Groups, NACLs

6.1 Host Firewall

# bash
sudo iptables -L -n -v
sudo ufw status
sudo firewalld-cmd --list-all

6.2 AWS Security Groups

    • Stateful
    • Return traffic automatically allowed
    • If outbound rules misconfigured outbound fails silently

6.3 NACLs

    • Stateless
    • Need both inbound + outbound rules
    • Common issues:
      •    Ephemeral ports not allowed
      •    Implicit deny blocks traffic

6.4 Middlebox issues

    •         IDS/IPS dropping packets
    •         DPI throttling or fragmentation issues
    •         VPN/firewall tunnels dropping large packets

7. MTU, Fragmentation, PMTUD

Highly underrated cause of IPv4 issues.

Symptoms

    •  HTTPS works but HTTP breaks
    •  Some sites load, some don’t
    •  DNS works but large downloads fail
    •  TCP stalls mid-transfer

Quick test

# bash
ping -M do -s 1472 8.8.8.8

If it fails:

Reduce until success… MTU_blackhole_detected

AWS MTU specifics:

    • VPC ENIs: 9001 bytes
    • VPN over internet: 1420 / 1399
    • DX: 1500 / 1522 (depending on encapsulation)
  • PMTUD breaks if ICMP type 3 code 4 is blocked by firewalls.

8. Asymmetric Routing (One of the hardest issues)

Asymmetry leads to:

    • SYN goes one way, SYN-ACK goes another
    • Packets accepted but return traffic dropped
    • Firewalls drop sessions because state is on the wrong boundary

AWS contexts where asymmetry is common:

    • TGW + on-prem with multiple DX links
    • Multi-AZ firewalls in HA pairs
    • VPC peering + TGW overlapping paths
    • Load balancer preservation of client IP

Tools:

# bash
mtr -6
tracepath
tcpdump on both sides simultaneously

9. Packet Capture Workflow

9.1 The “two-sided capture” rule

To diagnose anything non-trivial:

      •  Capture on the source
      •  Capture on the destination
      •  Compare flows

9.2 Tools

Linux:

tcpdump -i eth0 -w capture.pcap

Windows:

      • Wireshark
      • NetMon

AWS:

    • VPC Traffic Mirroring to Suricata, Zeek
    • GWLB insertion for deep packet inspection

10. Systematic IPv4 Troubleshooting Flow

Step 1: Local host checks

      • IP correct?
      • Gateway correct?
      • ARP table sane?

Step 2: Can the host reach the gateway?

# bash
ping <gateway>

Step 3: Routing table sanity

# bash
ip route get <destination>

Step 4: DNS resolution confirmed?

Step 5: Is NAT/SNAT working?

    • Check NAT allocations / flows.

Step 6: Firewall sanity

    • SG, NACL, on-prem firewalls.

Step 7: Use tcpdump on both ends

    • Find where the packet dies.

Step 8: MTU / PMTUD

Step 9: Asymmetry / hybrid path issues

11. AWS-Specific IPv4 Failure Patterns

Pattern A“I can SSH out but not in”

Cause:

      •  No public IPv4
      •  SG inbound blocked
      •  NACL inbound blocked
      •  Route table missing 0.0.0.0/0 IGW

Pattern B“EC2 can’t reach internet”

Cause:

      • Using private subnet with no NATGW route
      • No IGW attached
      • Misconfigured DNS resolver
      • SG outbound blocked

Pattern C On-prem AWS via DX works, but AWS on-prem fails

Cause:

      • Asymmetric routing through VPN fallback
      • BGP prefix advertisement mismatch
      • On-prem firewall drops AWS source ranges

Pattern D VPC TGW DX colocation firewalls

Cause:

      • Stateful devices drop return path
      • MTU mismatch on GRE/IPSec tunnels
      • Missing reverse route propagation






No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...