NFS stops responding

Fri Apr 2 10:17:21 EDT 2010

Admittedly there is WAY too much blood in my caffeine stream at the
moment but ...

According to your email describing the wireshark trace, the client sends
a request, times out, retransmits, times out, and errors out; but the
server sees the request, replies, sees the retransmit, and replys as
well.  That sounds less like an NFS issue and more like a network issue.

Assuming the wireshark traffic is as I understand it from your email I
see only a couple of options.

1 - the reply packet is incorrectly addressed (bad ARP entry?)

2 - The reply is correctly addressed but 'lost' in network transit
(collisions, spanning tree issue, ...).

The obvious third choice is: this analysis is faulty because either I
have totally misread your email (Likely, see caffeine above) or you have
misrepresented the traffic (more fairly put: you may have been
incomplete in the details you thought non essential resulting in a
potential misrepresentation of the traffic and I fell for it).

So first question is: do you agree with my statement of your traffic
analysis?  Second question ( assuming 'yes' as 'no' shortens the
discussion drastically <grin> ) is, can you confirm/repeat it (did it
happen that way, was it a one time thing only in that instance, ...)?

Again assuming yes, I'd recommend you check network routing then check
for periods of high collisions or similar difficulties.  Given that it
works for awhile then fails, I'm less inclined to think basic
routing/connectivity and more inclined to think something more
'intermittent' LIKE discarded packets, dhcpd address changes, routing
changes (RIP?), or even a weird switch spanning tree issue due to some
cross connected switches in the network (yes I've seen it, don't ask).

If I am incorrect in my assumptions then can we have a clearer more
detailed definition of the wire traffic so we can understand which end
(presumably the client?) has the issue and when/where in the
conversation it occurs (connection setup, file attribute/access,
directory attribute/access, data transfer, ...)?

I'm NOT an NFS guy, but assuming it isn't a network issue, it seems
likely that knowing if it is the same NFS operation that always fails
and which operation that is, might help.  Assuming it is possible to
tell that from logs or trace or ...  Like I said, I'm not an NFS guy and
have no idea what NFS data it is possible to gather OR how much of a
pain in the butt it is to gather it.  I'm only asking based on the
general principle that narrowing the scope of inquiry to a single
operation/function/code section is usually a good thing, basically a
'the more detailed data available the better' request. So please keep
the initial NFS specific data gathering to a low pain level until
someone with more NFS experience jumps in with better defined data requests.

On 2010-04-01 19:43, Michael ODonnell wrote:
> I gathered some Enet traffic for Wireshark anlysis on both machines thus:
> 
>    dumpcap -i eth0 -w /tmp/`hostname`.pcap
> 
> ...and viewed the client traffic with Wireshark, which (apparently)
> confirms that the client did indeed wait a while and then (apparently)
> retransmitted the NFS request.  The weird thing is that Wireshark analysis
> of corresponding traffic on the server shows the first request coming in
> and being turned around immediately, then we later see the retransmitted
> request arrive and it, too, is promptly processed and the response goes
> out immediately.