on hard drive failures

Fri Jan 6 21:20:00 EST 2006

Bill McGonigle <bill at bfccomputing.com> writes:

> So, then beyond hardware, I'm looking for suggestions as to what are
> likely causes of software-based filesystem corruption. My primary
> server lost its disk last night to filesystem corruption.  There are
> no bad blocks on the disk (badblocks r/w test and SMART extended self
> test check out OK) and it's running the latest 2.4 kernel.

Beware of SMART.  There are LOTS of things we've found that can go
wrong with a disk that SMART never detects and suddenly the disk goes
kaput.  By same token, we've found that SMART will detect lots of
error, and running the disk through the manufacturer's disk test
software claims it's fine.  

That being said, we regularly run SMART testing in the background on
all our systems and have it kick the machine out of production if it
detects errors.  We re-install the OS since re-formatting the drive
will vector around bad blocks.  Obviously, this is a test environment.
A production system wouldn't have this luxury.  Though, you could
build the system with a 3-way software RAID mirror.  Upon a SMART
error detection, you could remove the bad drive, re-format it and
re-join it to the mirror.  The 3-way mirror is to make sure you still
have a RAID mirror while repairing one disk.  This way you're never
completely at-risk.

Or... you could just not use IDE [PS]ATA drives and invest in SCSI (or
fiber channel) which are *still* higher quality drives than IDE and
not worry about this nearly as much.

> My only theories are undetected memory errors or kernel bugs.
> Neither of which are logged.  Short of Linux-HA what's the right way
> to deal with this?

IMO, SCSI or FC drives if this is for a production server system.
-- 

Seeya,
Paul