NFS stops responding

Thu Apr 15 21:08:10 EDT 2010

On Wed, Apr 14, 2010 at 1:33 PM, Ric Werme <ewerme at comcast.net> wrote:
>> Â  It doesn't help that, in early implementations at least, NFS's
>> default error recovery mechanism is apparently "hang the whole machine
>> until it starts working again".
>
> News to me, except on diskless clients with too little RAM.

>  Sigh. ... there's even some commentary about this failure mode in
>  /The Jargon File/; look under "nightmare file system".

The description sounds like it might be exhaustion of the fixed mbuf
pool in 4.2 BSD (and likely in SunOS), I'd be very surprised if it were
a loop at a high SPL.

>  This was also 15 years ago, maybe things suck less now.

Likely.  I got laid off from DEC/HP a lot earlier than a lot of people
expected due in part to a low rate of NFS bug reports (and a lot of those were
network infrastructure issues, seen with NFS because it's the dominant
protocol at a lot of sites).

Now at Oracle my desktop system mounts my home directory which I think
is in Austin but cached in Burlington.  I build kernel code in Burlington
from other Austin exports, except some things reach over to Redwood Shores,
but we try to avoid using systems over there, the latency is pretty high
and our home directories aren't in the automount NIS maps.

Occasionally my subgroup's main build system gets hung up in automount
somehow, and we had some problems when IT changed caching servers in
Burlington, to say nothing of the main transformer there failing something
like three times in the past year.  All in all it works a lot better than I'd
expect a WAN NFS to work.  There was one interesting case where my background
came in handy.  Some clients and servers could handle 64KB I/O, but the
caching systems in the middle only handled 32-48 KB.  However, they passed the
64 KB offer from the server to the clients and things worked fine as long as
we stayed with small files.  Big files hung.  Once I figured that out, we
mounted everything with 32 KB I/O limits until IT and NetApp got things
patched.

Things have improved a lot in the last 15 years - After fixing the original
automount code early in my DEC NFS career in 1992, I became convinced that the
author (he's actually an okay guy) wrote it with a goal of making no kernel
changes and never tested it on anything beyond his workstation.  What I
started with had major bugs when a table filled up and it had to increase
things or start using a list, and lots of other things I've
forgotten/suppressed.  It took years before some coworkers were willing to try
it again, but it got a lot of brownie points with the sysadmins.  Sun started
producing much better code sometime in the 1990s, and that helped vendors
still relying on Sun source.