SMART diags (was: Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

Wed Feb 24 21:50:44 EST 2010

On Tue, Feb 23, 2010 at 9:32 PM, Michael Bilow
<mikebw at colossus.bilow.com> wrote:
>>  Of course, mdstat still calls the array "clean" even after
>> mismatches are detected, which isn't what I'd usually call "clean"...
>
> Ther term "clean" in this context just means that all of the RAID
> components (physical drives) are still present.

  Like I said, not what I'd usually call "clean".  :)

> ... "md" device operates at a level of abstraction above
> block devices that isolates it ...

  Sure.  That doesn't mean the md driver can't follow an algorithm
that hopes the drive will do something intelligent, or at least hope
that re-writing a block might improve the odds somehow.  What are the
altneratives?  We could fail the whole member out of the array.  That
could be an overraction, and definately reduces redundency if it's
just one bad block out of several billion.  Or we could do nothing.  I
can't think of a situation where rewriting one block could cause
serious problems that weren't already about to break loose.  No?

> Unless a write occurs somehow, though, even
> with AWRE enabled the hardware should not reallocate a sector.

  Right, the drive should remain willing to keep retrying the read as
long as you do.

>> R3. OS requests write to same logical block.
>
> Again, exactly what happens is going to vary a lot with the
> particular hardware. Older drives, even parallel ATA drives,
> generally cannot reallocate a spare sector on the fly ...

  Sure, on-the-fly relocation is a *relatively* new thing.  But it's
been around in the IDE world, at least in theory, for what, ten years?
 Implementation may be inconsistent; that I would buy.  But I know
I've seen both parallel and serial ATA drives where the "relocated
blocks" statistic was non-zero and climbed over time.  I've seen the
"pending relocations" be high until a "badblocks -w" pass, and then it
dropped to zero and "relocated blocks" jumped up.  The smartmontools
FAQ says modern drives can relocate bad sectors on write; their "Bad
block HOWTO" goes into some detail on SCSI drives.  Either there's an
awful lot of misleading happening, or this stuff actually does work
sometimes.  :-)

  I'm not so worried if that 120 MB IDE disk I still have in my
closet[1] doesn't do on-the-fly relocation.  ;-)

[1] = Hey, it might come in handy some day!

  Perhaps what we should all be worrying about, rather than ancient
drives, is the flood of USB flash stuff that's happening.  Anyone know
how *that* typically does when it comes to self-monitoring and
-healing?  It'd be a shame if the migration to flash storage sets us
back years in that area.

>> It make me wonder just what the
>> overall SMART health is supposed to indicate -- "Yes, the HDD is
>> physically present"?  :)
>
> SMART is just a communications protocol.

  So, basically, the SMART "overall health" (or whatever it's called)
is just reporting whatever the manufacturer programmed the drive to
report, and may be completely useless.  Good to know.  :)

>>  I did once have the BIOS check start reporting a SMART health
>> warning, but all the OEM diagnostics, smartctl, "badblocks -w", etc.,
>> didn't actually report anything wrong.
>
> SMART is not designed to predict infant mortality and unusual
> failures ...

  Whatever.  :)  My point was that the drive seemed to be indicating
something was wrong, but nobody[2] could figure out why it was doing
that.  SMART overall health was reporting failure but everything else
seemed to be good.  Like I said, it could be the drive knew something
that couldn't be reported using other tools, and it actually averted a
real failure.

[2] = Well, for sufficiently small definitions of "nobody".  Me, one
tech support guy, and a handful of software tools.  :)

-- Ben