An Overview of IPv4 Troubleshooting.
Scope:
- Cloud-architect level,
- Production-focused
Breakdown:
- IPv4 stack problems,
- Layer 1–3: Addressing & Subnetting,
- Layer 3: Routing
Troubleshooting,
- NAT Troubleshooting,
- DNS Troubleshooting (Often confused with IP issues),
- Transport Layer (TCP/UDP),
- Firewalls, ACLs, Security
Groups, NACLs,
- MTU, Fragmentation, PMTUD,
- Asymmetric Routing (One of the hardest issues)
- Packet Capture Workflow,
- Systematic IPv4
Troubleshooting Flow,
- AWS-Specific IPv4 Failure Patterns,
IPv4 stack
problems ( fall into one of five domains):
- Addressing / Subnetting
- Routing
- NAT
- DNS
- Transport Layer (TCP/UDP)
- Firewalling (SG/NACL/ACL/iptables/etc.)
- Application-Layer quirks
NB:
- This Overview provides a tiered troubleshooting framework with commands, packet flow logic, and AWS-specific pitfalls (if twtech is troubleshooting inside VPCs or hybrid networks).
1. Layer 1–3: Addressing & Subnetting
Incorrect IPv4 addressing is the #1 cause of
failures.
1.1
Checklist
- Does the host have a valid IPv4 address?
- Is the netmask correct?
- Is the default gateway in the same subnet?
- Any duplicate IPs?
- Any ARP poisoning or stale ARP caches?
1.2 Management Commands
Linux:
# baship addrip routeip neigharp -aWindows:
# bashipconfig /allroute printarp -aCommon
pitfalls
- Host mask mismatch (e.g., host thinks /24 but
network is /23).
- Gateway configured outside subnet → host silently drops frames.
- ARP cache stale; clearing fixes many “weird” issues:
ip neigh flush all
2. Layer 3: Routing Troubleshooting
2.1
Understand the routing decision
Routing is done in this order:
1.
Longest Prefix
Match (LPM)
2.
Administrative
Distance (static vs BGP vs OSPF, etc.)
3.
Metric / cost
2.2
Routing checks
# baship route get <destination>tracepath <destination>traceroute <destination>mtr <destination>AWS specifics:
- Route tables must include correct local, IGW, NATGW, TGW, DX, or VPC peering routes.
- Blackhole route entries occur when EC2 ENI deleted or peering removed.
- Subnet associations matter; make sure correct RT is applied.
3. NAT Troubleshooting
NAT = BIG
source of IPv4 issues.
3.1
SNAT vs DNAT
- SNAT – private → public (outbound)
- DNAT – public → private (inbound)
3.2
Logs / checks
Linux iptables NAT table:
sudo iptables -t nat -L -n -vAWS NAT Gateway:
- Check CloudWatch metrics:
-
ErrorPortAllocation -
PacketsDropped -
BytesOut - NAT Gateway fails when:
- No route to destination
- No IGW in the VPC
- SNAT port exhaustion (rare but real with high concurrency)
3.3
Double NAT
Occurs commonly in:
- On-prem → firewall → NAT → AWS → NATGW → internet
Symptoms:
- Broken return traffic
- Inconsistent path MTU
- Services failing only inbound or outbound
4. DNS Troubleshooting (Often
confused with IP issues)
Most connectivity failures are DNS masquerading as networking issues.
4.1
Checklist
- Can the resolver be reached?
- Is the DNS server configured correctly?
- AAAA vs A confusion?
- Split-horizon inconsistencies?
4.2
Tools
# bashdig <hostname>dig +trace <hostname>dig @<dns-server> <hostname>nslookup <hostname>AWS specifics:
- EC2 uses VPC Resolver (AmazonProvidedDNS) at:
-
169.254.169.253 - Conditional forwarders for hybrid setups often misconfigured.
- Route 53 Resolver rules must be associated with the correct VPC.
5. Transport Layer (TCP/UDP)
Symptoms
- SYN sent but no SYN-ACK → blocked or blackholed
- SYN-ACK received but ACK missing → asymmetric routing
- UDP “works sometimes” → random firewall drops or NAT timeouts
5.1
Tools
# bashtcpdump -n port <port>ss -tnlpnc -zv <host> <port>Example TCP handshake capture:
tcpdump -nn -i eth0 "tcp[tcpflags] & (tcp-syn|tcp-ack) != 0"AWS specifics:
- TGW asymmetric routing is a classic problem.
- NLB preserves client IP (can break firewalls).
- ALB does NOT preserve client IP; check X-Forwarded-For.
6. Firewalls, ACLs, Security Groups, NACLs
6.1
Host Firewall
# bashsudo iptables -L -n -vsudo ufw statussudo firewalld-cmd --list-all6.2 AWS
Security Groups
- Stateful
- Return traffic automatically allowed
- If outbound rules misconfigured → outbound fails silently
6.3
NACLs
- Stateless
- Need both inbound + outbound rules
- Common issues:
- Ephemeral ports not allowed
- Implicit deny blocks traffic
6.4
Middlebox issues
- IDS/IPS dropping packets
- DPI throttling or fragmentation issues
- VPN/firewall tunnels dropping large packets
7. MTU, Fragmentation, PMTUD
Highly underrated cause of IPv4 issues.
Symptoms
- HTTPS works but HTTP breaks
- Some sites load, some don’t
- DNS works but large downloads fail
- TCP stalls mid-transfer
Quick
test
# bashping -M do -s 1472 8.8.8.8If it fails:
Reduce until success… MTU_blackhole_detectedAWS MTU specifics:
- VPC ENIs: 9001 bytes
- VPN over internet: 1420 / 1399
- DX: 1500 / 1522 (depending on encapsulation)
PMTUD breaks if ICMP type 3 code 4 blocked by firewalls.
8. Asymmetric Routing (One of the
hardest issues)
Asymmetry leads to:
- SYN goes one way, SYN-ACK goes another
- Packets accepted but return traffic dropped
- Firewalls drop sessions because state is on the wrong boundary
AWS contexts where asymmetry is common:
- TGW + on-prem with multiple DX links
- Multi-AZ firewalls in HA pairs
- VPC peering + TGW overlapping paths
- Load balancer preservation of client IP
Tools:
# bashmtr -6tracepathtcpdump on both sides simultaneously9. Packet Capture Workflow
9.1 The
“two-sided capture” rule
To diagnose anything non-trivial:
- Capture on the source
- Capture on the destination
- Compare flows
9.2
Tools
Linux:
tcpdump -i eth0 -w capture.pcapWindows:
- Wireshark
- NetMon
AWS:
- VPC Traffic Mirroring to Suricata, Zeek
- GWLB insertion for deep packet inspection
10. Systematic IPv4 Troubleshooting Flow
Step 1:
Local host checks
- IP correct?
- Gateway correct?
- ARP table sane?
Step 2:
Can the host reach the gateway?
# bashping <gateway>Step 3:
Routing table sanity
# baship route get <destination>Step 4:
DNS resolution confirmed?
Step 5:
Is NAT/SNAT working?
- Check NAT allocations / flows.
Step 6:
Firewall sanity
- SG, NACL, on-prem firewalls.
Step 7:
Use tcpdump on both ends
- Find where the packet dies.
Step 8:
MTU / PMTUD
Step 9: Asymmetry / hybrid path issues
11. AWS-Specific IPv4 Failure Patterns
Pattern
A – “I can SSH out but not in”
Cause:
- No public IPv4
- SG inbound blocked
- NACL inbound blocked
- Route table missing
0.0.0.0/0 → IGW
Pattern
B – “EC2 can’t reach internet”
Cause:
- Using private subnet with no NATGW route
- No IGW attached
- Misconfigured DNS resolver
- SG outbound blocked
Pattern
C – On-prem → AWS via DX works, but AWS → on-prem fails
Cause:
- Asymmetric routing through VPN fallback
- BGP prefix advertisement mismatch
- On-prem firewall drops AWS source ranges
Pattern
D – VPC → TGW → DX colocation firewalls
Cause:
- Stateful devices drop return path
- MTU mismatch on GRE/IPSec tunnels
- Missing reverse route propagation
No comments:
Post a Comment