NFS stops responding

Fri Apr 2 22:26:44 EDT 2010

>> The client isn't seeing the replies?  Blame the router, blame
>> the router!
>
>Heh.  I'd love to, and I just acquired a brand new switch to use as
>an experimental replacement for the one currently deployed.  I'll be
>ecstatic if that fixes thing, though I'm not optimistic.

>From your last note below, neither am I, though it's well worth
the experiment.

>I don't really trust my interpretation of what Wireshark is showing
>me but, if I'm correct, the problem is not that we stop seeing return
>traffic from the server, it's more that the client code stops making sane
>decisions in response when it arrives.  Maybe the packets aren't getting
>all the way back down the stack to be processed by the client code?

Up the client stack.  It certainly could be.  Unfortunately I don't
know much about Linux NFS.  On Tru64 the code paths are extremely
different for UDP and TCP, though ultimately a server thread could
handle either a UDP or TCP message.  Woah.  Not true - I had a set of
threads for TCP and a set for UDP.  Immaterial here - the different
code paths in Linux may or may not shed light why UDP and TCP fail.

>Wireshark display of relevant traffic while observing 'ls -l mountPoint'
>on client hang and then return with 'I/O Error' :
>
>  On CLIENT A:
>  #     Time       SRC DST PROT INFO
>  1031  1.989127   A   B   NFS  V3   GETATTR Call, FH:0x70ab15aa
...
>  29395 61.989380  A   B   NFS  V3   GETATTR Call, FH:0x70ab15aa [retransmission of #1031]
...
>  97138 181.989898 A   B   NFS  V3   GETATTR Call, FH:0x70ab15aa

60 second retransmits.  That's pretty typical of NFS over TCP timeouts.
NFS over UDP is much faster since that's the only thing to recover from
lost packets.  (Aside - NFS over TCP "shouldn't" need to retransmit ever,
we do because implementors figured there could be some critically low
memory situations where the only the thing the server could do is abandon
the NFS request.  I don't know of any that do.  There's also some confusion
with connections closed by the server after it receives a request but
before a reply goes out.  NFS clients will recreate the connection and in
this case must be able to retransmit.)

>All other network plumbing appears to be in working order while the
>problem is occurring - I can connect from one system to another at will
>via SSH, rsync, HTTP, ping, etc.

Serious bummer.  It could still be the router, but it would take a
pretty weird problem.  Sometimes a system will screw up a particular
message, possibly generating a bogus checksum.  That's hard to do
TCP, since the protocol has position counters that monotonically
increase with each data byte.

That you're having problems with GETATTR is also a serious bummer
because those requests and replies are short and that makes moot
everything I said about long messages, fragmentation, etc.

A GETATTR is essentially a stat(2).  Some really weird things
can happen if the server reuses a file handle, e.g. once for a regular
file, and then for a directory.  I've seen some Windows NFS
servers do that, but I'm virtually certain Linux is okay.

I might have missed something really simple, but I fear you have a
really weird problem.  Good things to watch, but they won't shed much
light, are statistics from nfsstat and "netstat -s | grep 2049" might
show curious stuff, like data piling up on the client's socket.  That
would be very good information, but I don't know what to do with it.
Always keep an eye out for IP/UDP/TCP checksum errors, non-zero
counts are often a sign of some weird hardware or software problem.

Check the client a second (and third!) time for a firewall configuration.
If there's some filter that suddenly get triggered, that could easily
wedge NFS but allow everything else work.

Good luck....