Saturday, January 5, 2013

SNMP, DNS, DHCP and Ubuntu 12.10

I love rabbit trails... or maybe not.  I just spent a couple days trying to track down a problem that first showed up as a sporadic SNMP failure after I upgraded my 32-bit Ubuntu 12.04 (Precise Pangolin) computer to 64-bit 12.10 (Quantal Quetzal).  The root cause turned out to be the way the DNS resolver handled being given multiple DNS servers by DHCP.  Additionally, the problem only showed up because I was trying to resolve the name of my firewall, which uses the same name but different IP addresses for the internal and external networks.  Read on for more details.

First, a little background.

My home network has one computer (named "yavin") with multiple NICs acting as the firewall, DNS server, and DHCP server, among other things.  The external NIC has a publicly-routable IP address (209.50.21.246).  Yavin's other NIC faces my internal LAN, and like all other devices inside my house on that network, it has a non-routable address on the 192.168.225.0/24 network.  Both IP addresses reverse lookup to "yavin."

Yavin's DNS server is configured so that queries coming from the internal LAN resolve "yavin" to its internal LAN IP, while queries from the Internet at large resolve "yavin" to its external IP.  Yavin's firewall is configured to allow very few services to connect to the external IP, but quite a few others are allowed from the internal LAN.  One such internal-only service is SNMP, which I use to monitor yavin from another internal server.

Yavin is also my DHCP server, and all internal devices get their network information from it at boot time.  In the interest of redundancy, I've always configured DHCP to hand out multiple DNS servers to its clients.  If one DNS server is down, I want another one to transparently handle queries for all my internal machines.  Yavin itself is the first DNS server listed, and the two others I used belong to my upstream ISP.

I monitor my network using MRTG and SNMP from another machine on my internal network (named "tatooine").  It used to run Ubuntu 12.04 32-bit as its operating system.  Over the long New Year's weekend, I upgraded it to Ubuntu 12.10 64-bit.  Nothing changed on yavin or its services during this upgrade.

As soon as the upgrade finished, the SNMP queries that run every 5 minutes started intermittently failing when connecting to yavin.  They'd work for a few consecutive runs (every 5 minutes), and then fail for a few runs.  The error text that I got emailed was:

SNMP Error: no response received
SNMPv1_Session (remote host: "yavin" [209.50.21.246].161)
                  community: "undisclosed"
                 request ID: 598122027
                PDU bufsize: 8000 bytes
                    timeout: 5s
                    retries: 5
                    backoff: 1)
 at /usr/share/perl5/SNMP_util.pm line 492
SNMPGET Problem for ifInOctets.2 ifOutOctets.2 sysUptime sysName on undisclosed@yavin::::::v4only at /usr/bin/mrtg line 2339
2013-01-05 11:35:02: WARNING: skipping because at least the query for ifInOctets.2 on  yavin did not succeed
2013-01-05 11:35:02: WARNING: no data for ifInOctets&ifOutOctets:undisclosed@yavin. Skipping further queries for Host yavin in this round.


With these coming to my email inbox every five minutes, I decided the problem really needed to be fixed ASAP.  Tcpdump(8) showed me that the SNMP packets (on UDP port 161) were getting to yavin, but the firewall was rejecting them because they were being addressed to yavin's external IP address (that's what Rule 8 does):

tatooine# tcpdump -n -i eth1 -s 1600 port 161
11:25:06.508879 IP 192.168.225.2.52837 > 209.50.21.246.161:  C=undisclosed GetRequest(74)  .1.3.6.1.2.1.2.2.1.10.2 .1.3.6.1.2.1.2.2.1.16.2 .1.3.6.1.2.1.1.3.0 .1.3.6.1.2.1.1.5.0
11:25:11.513999 IP 192.168.225.2.52837 > 209.50.21.246.161:  C=undisclosed GetRequest(74)  .1.3.6.1.2.1.2.2.1.10.2 .1.3.6.1.2.1.2.2.1.16.2 .1.3.6.1.2.1.1.3.0 .1.3.6.1.2.1.1.5.0


yavin# tail -f /var/log/kern.log | grep DPT=161
Jan  3 11:25:03 yavin kernel: RULE 8 -- DENY IN=eth1 OUT= MAC=00:02:a5:e9:31:da:04:4b:80:80:80:03:08:00 SRC=192.168.225.2 DST=209.50.21.246 LEN=120 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=UDP SPT=58645 DPT=161 LEN=100
Jan  5 11:25:06 yavin kernel: RULE 8 -- DENY IN=eth1 OUT= MAC=00:02:a5:e9:31:da:04:4b:80:80:80:03:08:00 SRC=192.168.225.2 DST=209.50.21.246 LEN=120 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=UDP SPT=52837 DPT=161 LEN=100


The "-n" option to tcpdump is important, because it shows the actual IP address being used rather than the hostname.  This told me that the failed queries were going to yavin's external IP, while the successful queries were going to yavin's internal IP.

The next step was to trace yavin's internal NIC for all DNS requests coming from tatooine.  The successful requests queried their DNS information directly from yavin's DNS server:

yavin# tcpdump -n -i eth1 port 53 and host 192.168.225.2
11:50:06.201943 IP 192.168.225.2.45808 > 192.168.225.1.domain:  8108+ A? yavin.jedi.com. (32)
11:50:06.202806 IP 192.168.225.1.domain > 192.168.225.2.45808:  8108* 1/3/3 A 192.168.225.1 (160)

The failed requests queried their DNS information from one of the upstream DNS servers instead:

11:55:07.204384 IP 192.168.225.2.36361 > 199.184.119.1.domain:  48948+ A? yavin.jedi.com. (32)
11:55:07.204755 IP 199.184.119.1.domain > 192.168.225.2.36361:  48948* 1/3/3 A 209.50.21.246 (160)

As you'd expect, yavin replied with its internal IP, while the upstream servers only knew about its external IP.  I suppose this is what you would expect, since yavin's DHCP server was handing out several DNS servers to all of my internal computers.  However, this isn't the way it had been working before the 12.10 install.

It appears as though previous versions of Ubuntu--or rather, the version of the resolver library used in previous versions of Ubuntu--would always query the first DNS server in the list, never bothering to ask the others unless the first one gave it problems.  Since yavin's DNS server always worked, the external DNS servers were never queried, and yavin's external IP address was never received.

With the 12.10 update, the resolver library (libresolv?) now rotates through all the known DNS servers to share the load.  This behavior has always been allowed by the DNS spec, but apparently never implemented.  This means that my situation could have appeared with pretty much any version of Linux, Unix, Windows, or MacOS.  The fact that I'm running Ubuntu is irrelevant.

The solution for me was to remove the external DNS servers from the list handed out by yavin's DHCP server (the "domain-name-servers" option in /etc/dhcpd.conf).  Since yavin is my network firewall and router, if its DNS server is down, the chances of any traffic making it out to the rest of the world are also very slim, so name resolution isn't a big concern.

With the dhcpd.conf file changed, I restarted yavin's DHCP server and then renewed tatooine's DHCP lease to get the new DNS server list:

yavin# service dhcpd restart
tatooine# dhclient -r
tatooine# dhclient

Of course, this is a very specialized case, since there are only two machines on my network which have different internal & external IP addresses.  For any other name being queried, the external DNS servers would have done the job just swimmingly.

So that's how I've spent most of my last two days.  I've left out all of the dead ends and wild geese that I chased in the process, just so I don't confuse anybody.  Hopefully, somebody else out there will benefit from my experience.  Throughout this ordeal, I carried on a nice little monolog on the UbuntuForums web site, in case you're interested.

What's been the weirdest DNS problem you've ever come across?  Please share it in the comments below.  After all, misery loves company.

No comments:

Post a Comment

Please leave your comment below. Comments are moderated, so don't be alarmed if your note doesn't appear immediately. Also, please don't use my blog to advertise your own web site unless it's related to the discussion at hand.