NFS stops responding

Thu Apr 1 23:17:04 EDT 2010

> After machine A exhibited the problem I *think* I see evidence in
> /var/log/messages that the NFS client code has decided it never got a
> response from the server (B) to some NFS request, so it retransmits the
> request and (I think) it then concludes that the retransmitted request
> also went unanswered so the operation is errored out.

So, you have a path client -> switch -> server (request) -> file system ->
server (reply) -> switch -> client.

And that's a simple setup.  Add automount and NIS maps and then you'll
have real fun.

> I gathered some Enet traffic for Wireshark anlysis on both machines thus:
> 
>    dumpcap -i eth0 -w /tmp/`hostname`.pcap
> 
> ...and viewed the client traffic with Wireshark, which (apparently)
> confirms that the client did indeed wait a while and then (apparently)
> retransmitted the NFS request.  The weird thing is that Wireshark analysis
> of corresponding traffic on the server shows the first request coming in
> and being turned around immediately, then we later see the retransmitted
> request arrive and it, too, is promptly processed and the response goes
> out immediately.  So, if I'm reading these tea leaves properly it's as
> if that lost the ability to recognize the reply to that request.  [?!]

Sounds that way.  I learned to hate network switches when I was doing
GbE performance work with NFS.  I also got a phone call at 2230 one Friday
night before a blizzard about a customer with inscrutable NFS problem,
ultimately traced to their new router they didn't think of telling me
about.  (And a panic in code that choked on malformed NFS packets which
was actually helpful in fingering the router as the corrupter.)

Oh, this problem.  So you're seeing the replies leave the server.
That's good, it means you're talking about a networking problem, not
hung file systems or such stuff.  It also means your subject line
isn't quite right.  That's okay, happens all the time, but do note
the server _is_ responding.  The packets are getting lost.

The client isn't seeing the replies?  Blame the router, blame the router!
Except - take a closer look at those replies.  In the IP protocol is
the destination address the client?  Look at the Ethernet protocol, is
the destination address the router?  If both are true, then the router
must be seeing the replies, but isn't sending them on to the client.
Where are they going?  Check statistics in the router (if you can, another
reason to hate routers).  Check the ARP table in the router (if ... hate ...).

My guess, and I'm getting really rusty on this, is perhaps some other
system is using the clients IP address (fail - all 3 clients get lost?
that's unlikely.  Check routing tables if you can (... hate ...).

Simplify, simplify, simplify.  One challenging thing about chasing
NFS problems with NFS is that oftentimes the problem is not NFS.
However, NFS is often the dominant traffic source and people are surprised
to see that telnet/ftp/ssh don't work either.

You can't get much simpler than ping - start a ping from client - server.
If you stop getting ping replies the same time you stop getting NFS replies,
then don't worry about NFS.  Fix ping.  Routing tables.  ARP tables.

> But, then, how could it be that all 3 machines seem to get into this state
> at more or less the same time?  and why would unmounting and remounting
> all NFS filesystems then "fix" it?   Aaaiiieeee!!!

If it is related to something like IP address reuse, it may be that by
doing unmounts you stop the NFS traffic long enough for the clients
to send out an ARP message stating they have the IP address and hence
reclaim it.  I'm skeptical that's the problem, but it might be related.

Oh, it's still up - check out http://h30097.www3.hp.com/tipnfs.html
I wrote that ages ago to keep people from pestering me with NFS problems
that were not specific to NFS.

Oh - another thing that can do you in.  Verify that the IP
messages and NFS messages are lengths that the recipient can understand.
If one system uses jumbo frames and the other doesn't, things work just
fine for small files and small directories and then something wedges because
a message is too long.

Same for NFS messages - I saw one case where the client said it wanted
64KB I/O, the server reported it could offer 64KB I/O, but the caching
NFS system in the middle only handled up to 32 KB NFS messages.

Bring bubble wrap to pop as needed.