on hard drive failures

Fri Jan 6 16:15:01 EST 2006

On 1/6/06, Bill McGonigle <bill at bfccomputing.com> wrote:
>
>
>
> * are there filesystems where recovery has been designed in that are
> less susceptible to bad block damage?  An ideal filesystem would allow
> me to lose all the files on those blocks but be able to recover the
> rest of the disk.

The new ZFS that coming in Solaris addresses this.  It does some error
checking on writes (and reads?)  so it'll even protect against hardware
issues.

<obLinux> OpenSolaris already has ZFS and the code is available.  Perhaps it
could make it into Linux.  And the Solaris mumble (formally known as Janus)
will run Linux code so the need for ZFS in Linux is a bit less.</ob Linux>

* are there any maintenance routines that could be run to replicate
> essential filesystem information?  For instance, where the backup
> superblocks are stored, inode  tables, etc.  I can't think of any
> server I'm running that doesn't have enough spare cycles to do
> something like this in a nightly cron job.

ZFS does some of this I think.  Some SAN systems do this.

So, then beyond hardware, I'm looking for suggestions as to what are
> likely causes of software-based filesystem corruption. My primary
> server lost its disk last night to filesystem corruption.  There are no
> bad blocks on the disk (badblocks r/w test and SMART extended self test
> check out OK) and it's running the latest 2.4 kernel.  My only theories
> are undetected memory errors or kernel bugs.  Neither of which are
> logged.  Short of Linux-HA what's the right way to deal with this?

If it's memory errors, self correcting ECC memory helps.  If it reports
errors you can replace them before there's a problem.  This is a hardware
issue and most PC hardware doesn't do ECC.  I know Sun did this in Sparc
4,5,10,20 and perhaps earlier (before 1990?).

> RAID is certainly an answer where one has possession of the machine for
> the first set of problems.  For the no-bad-blocks problem the same
> thing would have occurred with the errors propagated across two disks

True.

so short of RAD-hardening the system I'm at a loss for what I might
> have done better.  Having consistent filesystems seems like an
> essential foundation for reliable computing but clearly I'm not there
> yet.

I'm finding for business purposes, it's not so hard to get 2 drives instead
of 1 and mirror them with software RAID.  About 3 months ago I was able to
get 2 250GB SATA drives and a RAID card for < $400.

Heck on my home server:  2 20GB drives for the OS, 1 IDE card (not RAID
even) and 2 large disks for data isn't too much to spend.  Software RAID1
on  the OS disks and the same for the data disks.  In 3 years I've lost an
OS disk and a data disk w/o losing data before I got replacement disks.

--
A strong conviction that something must be done is the parent of many bad
measures.
  - Daniel Webster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20060106/cfdd0bed/attachment.html