Timing file read/write over NFS

Thu Dec 18 22:04:09 EST 2008

First, read http://h30097.www3.hp.com/tipnfs.html .  While I wrote it
for Tru-64 Unix systems, it has a lot of good information for any NFS
environment.

I don't have much experience with NFS on Linux, so my comments will
be more general than useful.

Re: Jumbo frames.
A 1500 byte GbE frame takes a mere 12 usec of wire time.  My all time
favorite computer had an integer divide time of about 13 usec.  (At
least it had a divide instruction!)  A 9000 byte frame takes 72 usec.
Jumbo frames are good.

Re: "Roughly I could get 100 MB/sec read over the network, but 
only 45-60 MB/sec write over the network."

One thing that my be happening is that the file you're reading is in the
server cache and reading it requires no disk access.  If you're using
NFS V3, then when you write the file, then the intention of the designers
is that the client sends data to the server, which can write it at its
leisure until the client issues a close.  Then the server has to write
any data to disk and wait for it to complete before the client's close
finishes.  (This involves the COMMIT operation that /usr/sbin/nfsstat
reports.)  At the very least, the server will be doing a lot more work
with repeated writes to a file than repeated reads.

There are various other things that can lead to a read/write dichotomy,
but this is usually the biggest.  Several people have gotten themselves
fllummoxed by reading file data from the client cache and getting more
throughput than the link allows.  There are ways of discarding data
cached on the client - what are you doing to ensure that?  (Nfsstat
is a big help in seeing if there's a problem.  No reads after the
first read of the file means the data is in the client cache.)

Re: rsize, wsize
In general, as big as possible, especially for big files.  Some systems
default to 64 KB for NFS over TCP (Tru64, AIX).

Re: TCP window size (not in others' comments)
I don't know what Linux uses for this on NFS over TCP.  On Tru64 I bumped
this up to a MB or so.  Insanely large by some reckoning, but still only
8 msec of wire time by my reckoning.  If I send data at 1200 MB/sec, I
have to stop unless I get ACKs back in less than 8 msec.  I have no idea
how to adjust this on Linux or even if you can.

BTW, another reason I went to a large window size is that a client would
typically have eight writes outstanding and the BSD network stack does some
astoundingly inefficient stuff when multiple threads are fighting for
access to a single TCP connection.  It gets pretty technical, and it's
just a powerpoint presentation, but see
http://www.connectathon.org/talks02/werme.pdf for more.  I don't know if
Linux has similar problems, but I wouldn't be surprised.  I didn't
relaize the BSD problem until years after I implemented NFS over TCP.

You might try NFS over UDP, but Linux (man nfs) has an especially good
section on the problems with that on a fast link.

Re: Piece of crap GbE switches.  (personal soapbox)
If I spend some time with your GbE switches I'd probably declare them
crap and start ranting and raving about how things haven't improved
since the last time I looked.  Even if you have top-of-the-line hardware
I bet it can't buffer more than a few msec of data.  However, your
netperf data suggests the switches may be handling the test load, but
Linux CPUs (and NICs?) can't.  While TCP can recover from a few lost
packets with virtually no degradation in throughput, it can't handle
large dropouts well.  If the NFS over TCP window size exceeds the
buffering in your net, don't be surprised if performance goes down.

Good luck - a modern NFS system can be difficult to analyze and tune.