SMART diags (was: Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

Tue Feb 23 21:32:51 EST 2010

On 2010-02-23 at 17:43 -0500, Benjamin Scott wrote:

> On Tue, Feb 23, 2010 at 2:01 PM, Michael Bilow
> <mikebw at colossus.bilow.com> wrote:
>> During the md check operation, the array is "clean" (not degraded)
>> and you can see that explicitly with the "[UU]" status report ...
>
>  Of course, mdstat still calls the array "clean" even after
> mismatches are detected, which isn't what I'd usually call "clean"...
> :-)

Ther term "clean" in this context just means that all of the RAID 
components (physical drives) are still present.

>> It is not a "scrub" because it does not attempt to repair anything.
>
>  Comments in previously mentioned config file don't make it sound
> like that.  "A check operation will scan the drives looking for bad
> sectors and automatically repairing only bad sectors."  It doesn't
> explain how it would repair bad sectors.  Perhaps it means the bad
> sectors will be "repaired" by failing the entire member and having the
> sysadmin insert a new disk.  Perhaps the comments are just wrong.
>
>  Not arguing with you, just reporting what the file told me.  Would
> the file lie?  ;-)

That's sort of true and sort of not true, but generally outdated. It 
is important to appreciate that the "md" device operates at a level 
of abstraction above block devices that isolates it from low-level 
details that are handled by whatever driver manages the block 
devices. For something like a parallel IDE drive -- or, heaven 
forbid, an ST-506 drive -- there is not a lot of intelligence on 
board the drive that will mask error conditions: a read error is a 
read error.

When SCSI (meaning SCSI-2) was developed, it provided for a ton of 
settable parameters, some vendor-independent and some proprietary. 
Among these were mode page bits that controlled what the device 
would do by default on encountering errors during read or write, 
notably the "ARRE" (automatic read reallocation) and "AWRE" 
(automatic write reallocation) bits. Exactly what a device does when 
these bits are asserted is not too well specified, especially 
considering that a disk and a tape may have radically different 
ranges of options but use the same basic SCSI command set. In 
practice, I can't think of any reasonable way to implement ARRE: 
it's almost always worse to return bad data from a read operation 
with a success code than to just have the read operation report a 
failure code outright.

(ATA is essentially a protocol for wrapping SCSI commands and 
responses into packets for non-SCSI devices, so the logic applies.)

>> Detecting and reporting "soft failure" incidents
>> such as reallocations of spare sectors ...
>
>  The relocation algorithm in modern disks generally works like this
> (or so I'm told):
>
> R1. OS requests read logical block from HDD.  HDD tries to read from
> block on disk, and can't, even with retries and ECC.  HDD returns
> failure to the OS, and marks that physical block as "bad" and as a
> candidate for relocation.

At this point, an unreadable block encountered on a block device is 
handled at a very high level, usually the file system, well above 
where things like AWRE on the hardware can occur. This is where the 
"md" driver will intervene, attempting to reconstruct the unreadable 
block from its reservoir of redundancy (the other copy if RAID-1, 
the other stripes if RAID-5). If the "md" driver can reconstruct the 
unreadable data, it will attempt to write the correct data back to 
the block device: at this point, the hardware may reallocate a spare 
sector for the new data. Unless a write occurs somehow, though, even 
with AWRE enabled the hardware should not reallocate a sector.

When a write succeeds and forces an AWRE event, the hardware 
test-reads the newly written data and returns an error if the data 
could not be verified. By this stage, the "md" device may have had 
cause to mark the whole block device as bad and degrade the array.

> R2. Repeated attempts by OS to read from the same block cause the HDD
> to retry.  It won't throw away your data on its own.

Correct, in all practical cases the hardware will never reallocate a 
bad block on read operations. The SCSI protocol provides for ARRE, 
but as I noted this is never really implemented.

> R3. OS requests write to same logical block.  HDD relocate to
> different physical block, and throws away the bad block.  It can do
> that now, since you've told it you don't want the data that was there,
> by writing new data over it.

Again, exactly what happens is going to vary a lot with the 
particular hardware. Older drives, even parallel ATA drives, 
generally cannot reallocate a spare sector on the fly during normal 
operation, but can only do it during a low-level format operation of 
the whole drive. This is because the reserve of spare sectors on 
such drives is associated with physical zones, so that reallocation 
can only occur during a track-granular write operation.

In my experience, nearly all SCSI drives have AWRE disabled from the 
factory, and it is up to the operating system to enable it. Linux 
does not do this, as far as I know, unless the user manually sets 
mode page bits using a tool such as "scsiinfo" or "sdparm".

Like SCSI drives, SATA drives tend to have AWRE disabled from the 
factory, but it is often enabled by the machine BIOS. On some 
drives, like their PATA cousins, AWRE cannot be enabled.

>  It would be nice if hard disks were smart enough to detect a block
> that was getting marginal and preemptively relocate it.  Last I looked
> into this (admittedly, several years ago), they didn't do that.  Maybe
> they've gotten smarter about that.  If they haven't gotten smarter, if
> the "check" operation reads all the blocks on the the disk but never
> writes, that alone won't trigger relocation of a bad block.  The
> "check" operation would have to read the good block from the other
> disk, and attempt to rewrite it to the bad disk.  *That* might trigger
> a useful relocation by the HDD with the bad block.
>
>> smartmontools, which can and should be configured to look past the
>> md device and monitor the physical drives that are its components.
>
>  While I run smartd in monitor mode, I've never had it give me a
> useful pre-failure alert.  Likewise, I've never had the SMART health
> check in PC BIOSes give me a useful pre-failure alert.  More than once
> I've seen SMART report the overall health check as "PASS" when the
> whole damn disk is unreadable.  It make me wonder just what the
> overall SMART health is supposed to indicate -- "Yes, the HDD is
> physically present"?  :)

SMART is just a communications protocol. Some drives return nearly 
useless information, while other drives are quite good about 
reporting genuinely useful information. Using "smartmontools" 
properly requires manually configuring both short and long 
self-tests. For example, on a server where I have two Western 
Digital 1TB SATA drives spinning as the two components of an "md" 
RAID-1 device, I have the following in "smartd.conf" --

/dev/sda -d sat -a -o on -S on -s (S/../.././22|L/../../6/04) -m root
/dev/sdb -d sat -a -o on -S on -s (S/../.././23|L/../../7/04) -m root

This runs a short self-test every night on sda at 2200 ET and on sdb 
at 2300 ET, and runs a long self-test on sda at 0400 ET every 
Saturday and on sdb at 0400 ET every Sunday. Just installing the 
monitor deamon without configuring it for a particular installation 
is not terribly useful.

>  I did once have the BIOS check start reporting a SMART health
> warning, but all the OEM diagnostics, smartctl, "badblocks -w", etc.,
> didn't actually report anything wrong.  The reseller replaced the
> drive at my insistence.  Maybe the SMART health check knew something
> that none of the other SMART parameters were reporting.

SMART is not designed to predict infant mortality and unusual 
failures, but rather to monitor the factors which either indicate a 
problem developing ("prefailure") or are approaching the known 
design lifetime of the device ("usage").

Here's an example of a very healthy drive:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
   3 Spin_Up_Time            0x0027   164   164   021    Pre-fail  Always       -       6758
   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
   7 Seek_Error_Rate         0x002e   100   253   051    Old_age   Always       -       0
   9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       7411
  10 Spin_Retry_Count        0x0032   100   253   051    Old_age   Always       -       0
  11 Calibration_Retry_Count 0x0032   100   253   051    Old_age   Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       16
194 Temperature_Celsius     0x0022   121   110   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

In this case, the drive has never logged a raw read error (ID 1), a 
reallocated sector (ID 5), nor a seek error (ID 7), and has zero 
counts for reallocation events (ID 196), current pending sectors (ID 
197), and uncorrectable sectors (ID 198). Since it is in a server, 
it has only been power-cycled 16 times despite having logged over 
7411 hours (about 309 days) of operation. It is very likely that, if 
this drive were starting to fail, these counts would show problems.

-- Mike