on hard drive failures

Fri Jan 6 15:38:01 EST 2006

This seems to be the week for hard drive failures for me and my 
clients.  Some things I've noticed have got me thinking:

9 out of 10 hard drives I've recovered have failed on the first few 
sectors.  This is especially problematic for boot loaders and 
filesystems which lay out their superblocks and journals there.  So, 
questions that come to mind:

* Is that part of the hard drive especially weak due to geometry?  That 
would suggest placing superblocks elsewhere.

* Does having the essential filesystem bits there cause the drive to 
'use up' that part of the disk first?   That would suggest spreading 
around filesystem information.

Then once I dd_rescue as much of the drive as possible it's time to 
recover the filesystems.  ext3 seems especially fragile to having the 
first block of the drive go kaput. So:

* are there filesystems where recovery has been designed in that are 
less susceptible to bad block damage?  An ideal filesystem would allow 
me to lose all the files on those blocks but be able to recover the 
rest of the disk.

* are there any maintenance routines that could be run to replicate 
essential filesystem information?  For instance, where the backup 
superblocks are stored, inode  tables, etc.  I can't think of any 
server I'm running that doesn't have enough spare cycles to do 
something like this in a nightly cron job.

So, then beyond hardware, I'm looking for suggestions as to what are 
likely causes of software-based filesystem corruption. My primary 
server lost its disk last night to filesystem corruption.  There are no 
bad blocks on the disk (badblocks r/w test and SMART extended self test 
check out OK) and it's running the latest 2.4 kernel.  My only theories 
are undetected memory errors or kernel bugs.  Neither of which are 
logged.  Short of Linux-HA what's the right way to deal with this?

RAID is certainly an answer where one has possession of the machine for 
the first set of problems.  For the no-bad-blocks problem the same 
thing would have occurred with the errors propagated across two disks 
so short of RAD-hardening the system I'm at a loss for what I might 
have done better.  Having consistent filesystems seems like an 
essential foundation for reliable computing but clearly I'm not there 
yet.

-Bill

-----
Bill McGonigle, Owner           Work: 603.448.4440
BFC Computing, LLC              Home: 603.448.1668
bill at bfccomputing.com           Cell: 603.252.2606
http://www.bfccomputing.com/    Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf