Checking for Blackhole Routing

Blackhole routing comes about when a device in an IP data path (a 'router' although it may be called something else) drops packets without telling the originator. Blackholing can be done deliberately in response to a flood of messages (a Denial of Service attack). But it can and does occur from time to time as a result of device misconfiguration. I have encountered the latter three times now -- twice at work and once at home. I suspect that it is a lot more common than most IT people think. Blackhole routing results in really bizarre communication problems that are very difficult to troubleshoot and are often 'fixed' by some kludge without really understanding or correcting the real problem.

Blackhole routers typically come about when two pretty reasonable technology practices come into conflict. The first practice is using Path MTU Discovery (PMUD) which was defined in RFC1191. The second practice is turning off as many non-critical internet related services as possible in order to avoid hacking attacks on unanticipated defects such as buffer overflows in the software providing the services. Specifically, some system administrators turn the ICMP service off. The problem is that path MTU discovery depends on ICMP. If ICMP is turned off, the MTU can be misset. When a misset MTU is combined with the RFC1191 practice of setting the Do Not Fragment flag in IP packets, some packets may fall into a "black hole" -- simply vanish. Even worse, it is likely that the majority of IP packets -- especially those from older mainstream software will not blackhole. So the symptoms will be that certain programs -- and often only certain options in certain programs -- will fail. For example, TELNET, tracert, ping and Internet Explorer might work, but a Citrix client or VNC might fail over the same link.

Just a bit more detail.

So, the result of combining RFC1191 and administrators shutting off ICMP is that there can be a size range where packets simply vanish into a black hole. This usually is not difficult to check for (although it may hard to fix). To check for black hole routing use the ping tool provided with all unixes and with Windows For Workgroups, Windows 9, and all NT based Windows. The syntax is different for Windows and Unix, and some older Unix pings can't be used because they do not allow the Do Not Fragment flag to be set. Anyway for Windows (where Ping is best run in an MSDOS window) the syntax is ping dest -f -l size, Where dest is the IP or URL of the destination and size is the packet size. For Unix, the syntax is ping dest -D -s size. The Version of ping shipped with Slackware 10.2 the '-D' flag seems to have morphed into '-M do'. '-M want' might be useful if you can figure out how to use it ... which I can't.

The process is simply to send a large packet -- bigger than can be handled without fragmenting. 20000 bytes should be way bigger than can be handled. That should produce a message "Packet needs to be fragmented, but DF set" -- and smaller packets that should go through. Iterate until the largest packet that gets a response and the smallest that needs fragmentation are known. If there is a size range between these where messages simply time out, there is a blackhole problem for packet sizes that time out on that route (MSDOS/Windows will tell you about time outs. Linux won't, but when you hit ctrl-C, there will be a non-zero number of packets sent, 0 received.).

Detecting such a problem is easy. Fixing it however ... Good Luck. The best bet would be to fix the offending router(s). This is likely not to be easy even if you control the device. If someone else controls the router, your chances of finding a support person who understands the problem, acknowledges that it is their problem, and can fix it are not especially good. You can try. Maybe you'll be lucky.

The next best choice turn off Path MTU detection and to manually set the MTU to a size that is known to be safe. In Unix, this might be as simple as feeding a parameter to ifconfig (I haven't tried it). In Windows, it requires tinkering with MTU settings in the Registry which is not especially fun in Windows 9 and is a lot less fun in NT. NT based windows has, in my (thankfully) limited experience, a ludicrous number of MTU related settings in the Registry. One easy trap to fall into is that NT defaults to using hex numeric values for MTU. Setting in decimal numbers without converting to hex will have VERY unsatisfactory results.

Here's a link to Cisco Tech Note 13709 -- Adjusting IP MTU, TCP MSS, and PMTUD on Windows and Sun Systems that purports to address most of these issues including which registry entries to edit. I have not (and do not intend to) validated the document, but it looks consistent with my experience with this stuff

Copyright 2006, Donald Kenney (Donald.Kenney@GMail.com). Permission is hereby granted to use any materials on this page under the V2.5 Creative Commons License. This page has been validated as correct HTML 4.01 Transitional.