on hard drive failures
Bill McGonigle
bill at bfccomputing.com
Fri Jan 6 15:38:01 EST 2006
This seems to be the week for hard drive failures for me and my
clients. Some things I've noticed have got me thinking:
9 out of 10 hard drives I've recovered have failed on the first few
sectors. This is especially problematic for boot loaders and
filesystems which lay out their superblocks and journals there. So,
questions that come to mind:
* Is that part of the hard drive especially weak due to geometry? That
would suggest placing superblocks elsewhere.
* Does having the essential filesystem bits there cause the drive to
'use up' that part of the disk first? That would suggest spreading
around filesystem information.
Then once I dd_rescue as much of the drive as possible it's time to
recover the filesystems. ext3 seems especially fragile to having the
first block of the drive go kaput. So:
* are there filesystems where recovery has been designed in that are
less susceptible to bad block damage? An ideal filesystem would allow
me to lose all the files on those blocks but be able to recover the
rest of the disk.
* are there any maintenance routines that could be run to replicate
essential filesystem information? For instance, where the backup
superblocks are stored, inode tables, etc. I can't think of any
server I'm running that doesn't have enough spare cycles to do
something like this in a nightly cron job.
So, then beyond hardware, I'm looking for suggestions as to what are
likely causes of software-based filesystem corruption. My primary
server lost its disk last night to filesystem corruption. There are no
bad blocks on the disk (badblocks r/w test and SMART extended self test
check out OK) and it's running the latest 2.4 kernel. My only theories
are undetected memory errors or kernel bugs. Neither of which are
logged. Short of Linux-HA what's the right way to deal with this?
RAID is certainly an answer where one has possession of the machine for
the first set of problems. For the no-bad-blocks problem the same
thing would have occurred with the errors propagated across two disks
so short of RAD-hardening the system I'm at a loss for what I might
have done better. Having consistent filesystems seems like an
essential foundation for reliable computing but clearly I'm not there
yet.
-Bill
-----
Bill McGonigle, Owner Work: 603.448.4440
BFC Computing, LLC Home: 603.448.1668
bill at bfccomputing.com Cell: 603.252.2606
http://www.bfccomputing.com/ Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf
More information about the gnhlug-discuss
mailing list