on hard drive failures

Bill McGonigle bill at bfccomputing.com
Mon Sep 3 16:43:15 EDT 2007


On Jan 6, 2006, at 11:47, Bill McGonigle wrote:

> So, then beyond hardware, I'm looking for suggestions as to what  
> are likely causes of software-based filesystem corruption.

I'm following-up on a question I asked quite a while ago, but may  
have just found the answer to, or at least the theory fits the data.

ext3 writes data to its journal, but does no checksumming of its  
journal.  So, if you have a drive doing cached writes and you lose  
power, your journal could be full of bad data, and on replay at boot  
time the filesystem will eat itself.  There's a simple fix, mount  
option:

   barrier=1

which, according to kerneltrap.org does:

   Request barriers, also known as write barriers, provide a  
mechanism for
   guaranteeing the order of disk I/O operations without actually  
waiting for
   the data to be written to disk. Specifically, a request barrier  
guarantees
   that any data queued up prior to the the barrier will be written  
to disk
   before data queued up after the barrier. Without a request  
barrier, the block
   layer can reorder how data is written to disk for maximum  
performance. The
   problem with this being most notably with journaling filesystems  
which
   require that their metadata be updated prior to actually updating  
data,
   allowing true crash recovery. Without request barriers, a journaling
   filesystem has to wait for the metadata change to be written to  
disk before
   it can proceed with actually updating the filesystem. Hence, the  
addition of
   request barriers provides a performance boost for journaled  
filesystems.

The conventional wisdom is to battery back your drives so your write  
caches don't get lost, but this wisdom doesn't account for flakey  
hardware (from BIOS to drive controller, and everywhere in between,  
it seems), so I'm now mounting my ext3 filesystems:

   data=journal,barrier=1

And this seems like it'll probably be reasonably safe, until ZFS  
lands.  Some very rough benchmarking on kernel 2.6.22.1-41.fc7 shows  
performance within the margin of error with data=journal on and  
barrier=1 on vs. off on my backup system.  Ted Tso remarked ~March of  
this year that they were going to turn it on by default sooner or  
later, but I haven't found a way to check this from userspace.

-Bill

-----
Bill McGonigle, Owner           Work: 603.448.4440
BFC Computing, LLC              Home: 603.448.1668
bill at bfccomputing.com           Cell: 603.252.2606
http://www.bfccomputing.com/    Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf


More information about the gnhlug-discuss mailing list