NFS stops responding

Thu Apr 1 19:43:34 EDT 2010

I've run out of clues (EBRAINTOOSMALL) trying to solve an NFS puzzle
and could use some help getting unstuck.  Analysis is awkward because
the customers in question are trying to make what use they can of the
machines even as these problems are ocurring around them, so reboots
and other dramatic acts have to be scheduled well in advance.

Symptoms: after approx 1 hour of apparently normal behavior, operations
like 'df -k' or 'ls -l' hang for minutes at a time and then fail with
I/O errors on any of the three machines when such operations refer to
NFS mounted directories.  At that point, doing this on all 3 machines:

   umount -f -l -a -t nfs

...followed by this:

   mount -a -t nfs

...on all 3 gets things unstuck for another hour.  (?!?!)

The 3 machines have NFS relationships thus:

  A mounts approx 6 directories from B (A->B)
  B mounts approx 6 directories from A (B->A)
  C mounts approx 6 directories from A (C->A) (same dirs as in B->A)
  C mounts approx 6 directories from B (C->B) (same dirs as in A->B)

All systems are running x86_64 CentOS5.4 on HP xw8600 workstations
connected via a Dell 2608 PowerConnect switch that's believed to be
functioning properly.  No jumbo packets.  All MTUs are the standard
1500.  I've tried specifying both UDP and TCP in the fstab lines.

I've disabled selinux.  The output of 'iptables -L' is:

   Chain INPUT (policy ACCEPT)
   target     prot opt source               destination

   Chain FORWARD (policy ACCEPT)
   target     prot opt source               destination

   Chain OUTPUT (policy ACCEPT)
   target     prot opt source               destination

These commands:

   service nfs status ; service portmap status

...indicate nominal conditions (all expected daemons reported running)
when things are working but also when things are b0rken.

There wasn't anything very informative in /var/log/messages with the
default debug levels but messages are now accumulating there at firehose
rates because I enabled debug for everything, thus:

   for m in rpc nfs nfsd nlm; do rpcdebug -m $m -s all; done

After machine A exhibited the problem I *think* I see evidence in
/var/log/messages that the NFS client code has decided it never got a
response from the server (B) to some NFS request, so it retransmits the
request and (I think) it then concludes that the retransmitted request
also went unanswered so the operation is errored out.

I gathered some Enet traffic for Wireshark anlysis on both machines thus:

   dumpcap -i eth0 -w /tmp/`hostname`.pcap

...and viewed the client traffic with Wireshark, which (apparently)
confirms that the client did indeed wait a while and then (apparently)
retransmitted the NFS request.  The weird thing is that Wireshark analysis
of corresponding traffic on the server shows the first request coming in
and being turned around immediately, then we later see the retransmitted
request arrive and it, too, is promptly processed and the response goes
out immediately.  So, if I'm reading these tea leaves properly it's as
if that lost the ability to recognize the reply to that request.  [?!]

But, then, how could it be that all 3 machines seem to get into this state
at more or less the same time?  and why would unmounting and remounting
all NFS filesystems then "fix" it?   Aaaiiieeee!!!

 [ Unfortunately, this problem is only occuring at the one
   customer site and can't be reproduced in-house, so unless
   I can find a way to first sanitize the logs I may not be
   permitted to publish them here...       >-/               ]