FreeNAS/ZFS woes (was Re: Is bcache ready for enterprise production?)
Alan Johnson
alan at datdec.com
Sat Aug 16 20:05:05 EDT 2014
On Fri, Aug 15, 2014 at 10:41 AM, Derek Atkins <warlord at mit.edu> wrote:
> Alan Johnson <alan at datdec.com> writes:
>
> > I'm in the process of replacing a FreeNAS install at $WORK with Linux. I
>
> I'm curious why you are replacing FreeNAS?
>
> I've heard nothing but good things about it, so I'd be interested in
> hearing about your negative experiences.
>
Hmm... yes... well, I was hoping to avoid this, but it is only fair after
making such a statement, and I'm sure this community will benefit from my
story, long though it may be.
Let me start by saying that this story is spread across more than a year,
so my recollection of the specifics is may not be reliable but I will do my
best to point out when I am particularly fuzzy.
*ZFS: Not as Simple as I Had Come to Understand*
So, when it all started, I, like you, had heard nothing but good things
about FreeNAS and ZFS. I had loosely followed both for years. I had
developed the impression that ZFS was supposed to be this really easy to
use system that takes a bunch of disks, or SSDs, and just does smart things
with them. You don't have to think about them. Leading up my own use, I
did a lot of reading of both forums and manuals and found this to not be
the case at all. You have to understand and decided between different
redundancy schemes with their various trade offs, and if you want to use
SSDs you have to work those in manually, being very careful about how you
slice them up for different kinds of caching. You could bring the system
to its knees if you have TOO MUCH cache! In any case, it is certainly no
less complex than working with LVM and software RAID in Linux.
*The Setup*
Moving on, after quite a bit of study, I spec'd a box specifically for the
task of running FreeNAS/ZFS for some tier-2 storage:
- DELL R515
- AMD Opteron(tm) Processor 4228 HE, 6 cores @ 2.8 Ghz
- 64GB of RAM (ZFS is very RAM hungry, even hungrier than this I found
out very painfully)
- 12 4TB 7.2k HDD
- 2 512GB Crucial M4 SSDs (We use these all over the place.)
I installed FreeNAZ 8.3 (latest at the time) with a 40TB RAIDZ2 (for the
uninitiated, that's similar to RAID6 but a bit smarter and more supposedly
performs better) on the disks and used the SSDs for L2ARC (read cache) and
ZIL (write cache, or it might be more accurate to think of it as a journal;
I have come to think of it as some mix of the 2). I created one big zpool
(kind of like an LVM volume group) with the default light-weight
compression on by default (not deduplication; never use dedupe, no matter
how well suited you think your data set is for it, compression will get you
almost all of the benefit without the huge performance hit and RAM needs of
dedupe... is what I have taken away from the FreeNAS forums and even the
manual).
My first, and very minor, nit to pick is that the ZIL is limited by the
amount of RAM you have. If you exceed a certain ratio between the 2, the
system has problems. I forget if it is strictly performance that suffers
or if the system becomes unstable, but I read enough ahead of time to avoid
it. I created 2 appropriately sized partitions on the SSDs and used them
as a mirrored ZIL. The rest of the SSDs I used as L2ARC without
redundancy; ZFS will just go to disk if some or all of the L2ARC goes
missing, as one would expect for a read-cache. Later on, I learned that
too much L2ARC can hurt performance too, but that was much less significant
and I was able to reduce it while the system was online.
Note that FreeNAS does not provide a WUI for managing L2ARC or ZIL, nor
doing block device partitioning. I think this is reasonable for this
project, but worth mentioning. I did all of that stuff at the command
line. If you are not already familiar with BSD, like me, it can be very
tricky an take some time, but it all worked as expected with some fairly
simple commands on vairous guides that pop right up on Google.
*FreeNAS Runs on USB Sticks, like it or not*
A bigger problem I have with FreeNAS is that it is nearly impossible to
boot it off anything other than a USB stick or SD card. As such, it does a
great job of blocking writes to those devices. This has many very annoying
side effects when trying to do anything under the hood, and even above the
hood, it prevents the persistence of things like the root command history
and logs. I think some of this has been addressed in the newer versions,
but I'm not sure to what extent. We didn't reboot that much. Also, to be
fair, you can point it to a syslog server in the WUI easily enough. We had
at least one system crash due to a bad USB drive despite having spent way
too much time trying to find very reliable drives. The cheap thumb drives
I bought at Staples for emergency repair of the box didn't give us any
problems. Go figure. But oh how I wanted so badly to slice off a bit of
my SSDs for a mirrored r/w install of the OS. I came to understand why
FreeNAS does not have this option, but it is just one more thing that says
"not for the enterprise".
*The Trouble Starts*
It ran pretty well like this as an NFS/SMB server and we first got into
trouble when I started using iSCSI to provide storage to an oVirt cloud to
run some less important VMs. The first round of trouble was on me for
diddling with the default iSCSI settings without knowing what I was doing.
I could not find good documentation on this anywhere, even the iSCSI RFC
was of little help, so I tried my best make them safer for our
environment. After a week or 2 of no trouble, it started falling over a
lot. I reset the settings to defaults and it worked well enough for
another few months.
*Things Get Worse...*
Of the following months, we started using it more and more. Then while I
was out of the office for a week, the team decided it was necessary to put
a large chunk of files (installers for our software) that were heavily used
by our automated build and test system. Around a hundred VMs and Ms hit
them over SMB and NFS fairly often. It got pretty slow for all use, but
the real nastiness came from the latency it created on the iSCSI. That
choked our oVirt cloud in some very bad ways. Most of the real pain came
from the way this old version (3.1) of oVirt handled the storage issues,
but that is definitely off topic here. (In short, it didn't handle it;
critical VMs with no connection to this box were choked just as much as the
less important ones on it.)
*...but no indication as to why.*
Where FreeNAS failed was in giving any indication of where the trouble was.
It provides many pretty graphs, and it reports nicely via SNMP to our
OpenNMS box, but never in this or any other performance issue have I seen
anything that shows me where the bottle neck is. The 6 CPUs are all well
under half max, RAM is full of cache stuff, no swapping, network interfaces
are no where near capacity, and so on. The primary metric FreeNAS is
missing (due to the underlying FreeBSD) is iowait. Linux has had this for
a very long time and it is a particularly important data point for a
storage box.
So, if it does not show IO wait, and the bottle neck is nowhere else, then
it must be in IO wait, right? But ZFS is supposed to scream in this
configuration. It certainly fell well short of the performance of what I
have come to expect from 2 of these SSDs alone. Something keeps tickling
me that compression might be the culprit but that should show up as CPU
load, right? I don't know, all I know is that performance did not get
anywhere near expectations and I could not figure out why from the WUI,
SNMP, or command line. Something I do regularly and easily on Linux boxes.
Anyway, we managed to get enough VMs turned off that the cloud calmed down,
then moved those files to a system backed by our Equallogic stack. (I love
Dell's EQ stuff, if you can afford it. It pretty much really does just "do
smart things" and you have to shoot rockets at it to take it down).
Eventually, we got all of our VMs off of this box, and that's when ZFS had
its biggest fail.
*The Big ZFS Fail*
After months of poking and squeezing and nudging, we finally managed to
empty the storage oVirt was using on this box. I'm feeling good. oVirt,
no longer having any knowledge of this box, is feeling a bit better. So,
on a Friday afternoon (of course), I'm feeling cocky after all this and I
go do something crazy like delete the zvols (similar concept to an LVM
logical volume) that were the backing devices for the iSCSI LUNs. You
know, just to clean up a bit and free that space for other use. Was I too
bold?
The first one, 2TB, goes away within a few seconds: nominal for the WUI in
my experience. The second one, 5TB, takes a little longer than a few
seconds... then a few minutes... oh, and look at that, OpenNMS has alerted
me that the node is down. I ping and there is no answer. The console
gives me no love either. Time to reboot the box. No love. It hangs
trying to access the zpool with a complaint about the zvol I was trying to
delete.
I let it sit while I go looking for solutions. I find a FreeNAS forum
thread
<http://forums.freenas.org/index.php?threads/raidz2-hung-after-zvol-delete.15759/>
describing
exactly what is happening to me, only on smaller systems. They say it
slowly eats RAM until it runs out, then panics and reboots. Ultimately, it
came down to adding enough RAM and letting it sit long enough. They went
from 4GB to 8. OK, I'm thinking, 64GB should be enough, I just have to
wait long enough. 5-6 hours pass and it just reboots and starts over. OK,
next, get some more RAM. $2000 for 4 32GB DIMMs and we try again. Same
effect, only it takes a longer to eat all the RAM. $400 for a second
processor so I can add the 16GB DIMMs back in and now we are running with
192MB. It grinds for half a day, eats up ~164GB of RAM but finally
completes the "import" process it is trying to run. After a little bit
more hackery, the data is back online. However, that RAM is still "in use"
so I reboot. All is well.
Before I put it back in service, I go to delete another 5TB zvol and it
hangs for another half a day, but completes the delete, having locked up
~164GB or RAM again. After another reboot, I put it back in service long
enough to copy 14TB of data off to a temporary server, not daring to delete
the final zvol.
*Conclusion: Linux please*
So, the bottom line is, for some reason, something in my setup caused ZFS
to lock OS (no ping even) while it eats up ~164GB of RAM just to do the
functional equivalent of an lvremove. Some of the things I read while
trying to solve this suggested that ZFS has to process some huge chunk of
data under certain conditions. It needs 3-5GB of RAM for every TB of data
in the zpool from which the zvol is being deleted.
Now, to be fair, the recommended specs for ZFS from FreeNAS are 3-5GB of
RAM for every TB or storage, but they only talk about performance for this.
They don't say you might need that much to delete a zvol, and if it is not
enough, your data isn't coming back until you add more, so really you
should go with 5GB/TB. Even if they HAD said that, they would have to add
"your system might be off line for HOURs while it uses up all this RAM when
you try to delete a zvol.
That same forum thread
<http://forums.freenas.org/index.php?threads/raidz2-hung-after-zvol-delete.15759/>
suggests
that this behavior is EXPECTED with zpools that have duplication turned on.
(Have I mentioned that you should NEVER USE DEDUPE!? Just don't do it.)
One post suggests that it can be triggered even if dedupe is turned off
but was on at some point. This makes some sense because when you turn it
off, ZFS does not bother to reduplicate the deduplicated data. Just new
data being added is not deduped. I can even make since of a long processed
that has to scan all the data looking for duplicates before deleting a
zvol, but why does it need to lock up that much memory all at once? Why
does it need to lock up the system until it finishes? Why is that memory
not released for use when it is done? Oh, and what does any of this have
to do with my data, which was NOT deduplicated?
I have no respect for a design that does any of this. I am DONE with ZFS
and I am not impressed with FreeNAS. I'm not even very friendly toward
FreeBSD after this. Take away ZFS and I don't expect I'll have any desired
to use anything other than Linux for a DIY storage box again.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20140816/5050ff8a/attachment-0001.html
More information about the gnhlug-discuss
mailing list