Monday, June 29, 2015

Path MTU, IP Fragmentation and MSS

Last few weeks, I've been involved troubleshooting high latency on SATCOM and 3G infrastructure. Long story short, I found that when in UDP, the "Dont Fragment (DF)" bit is set to 1. Therefore, I would like to write about Path MTU discovery and IP Fragmentation in this post and the relation between them.


As per example topology above, if the host LINUX1 is sending a packet to LINUX3 device. Packet has to go through a path in which there are various MTU sizes involved.

Path MTU is; assume packet, which is leaving LINUX1 has total length of 1450 bytes. Because the link between LINUX1-LINUX2 has 1500 bytes limit, there is no problem. However, once LINUX2 receives the packet, it sees that the link that it must use to forward this packet has a lower maximum packet capacity than the packet it has. Under normal circumstances, LINUX2 sends back an ICMP notification to LINUX1 and says that “Hey dude, I can’t forward this packet as I have a link having 800 bytes MTU on the way, do something and lower your packet size”

LINUX1 gets this ICMP and lowers its further packets’ maximum sizes to 800 then the packets flow through. Why doesn’t it occur? This is what documents say if the next link MTU is lower than the packet being forwarded, packets are fragmented.

Now the Path MTU discovery comes in:

LINUX1# ip route show cache 192.168.111.2
192.168.111.2 from 172.30.73.219 via 172.30.73.85 dev eth0
    cache  expires 596sec mtu 800 advmss 1460 hoplimit 64


Can you see it? Now LINUX1 linux knows that it shouldn’t send any packet bigger than 800 bytes if it wants to send a packet for this destination again. This cache expires in 596sec as it can be seen in the output.

During my troubleshooting, I asked myself what happens if I just block every ICMP packet sent from LINUX2 device. The answer is communication halts! because LINUX2 doesn’t provide any feedback about the next link MTU and LINUX1 keeps sending its packets still at 1500 bytes. Since DF bit is set, fragmentation can’t happen and everything is stuck. This is a very bad thing indeed!

So, what can I do from LINUX3 side to prevent this from happening if I can’t inform LINUX1 admin. MSS (Maximum Segment Size) comes in this situation. MSS isn’t a negotiated value indeed due to which what ever LINUX3 tells the other peer during TCP communication, LINUX1 must obey that.

LINUX3# ip route change 0.0.0.0/0 dev eth0 advmss 700

After this command, all the subsequent TCP SYN packets will advertise its MMS as 700 and because LINUX1 will obey this and arrange the packet size according to it, packet flow will not be disrupted.

No comments: