SMART diags (was: Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)
Michael Bilow
mikebw at colossus.bilow.com
Tue Feb 23 21:32:51 EST 2010
On 2010-02-23 at 17:43 -0500, Benjamin Scott wrote:
> On Tue, Feb 23, 2010 at 2:01 PM, Michael Bilow
> <mikebw at colossus.bilow.com> wrote:
>> During the md check operation, the array is "clean" (not degraded)
>> and you can see that explicitly with the "[UU]" status report ...
>
> Of course, mdstat still calls the array "clean" even after
> mismatches are detected, which isn't what I'd usually call "clean"...
> :-)
Ther term "clean" in this context just means that all of the RAID
components (physical drives) are still present.
>> It is not a "scrub" because it does not attempt to repair anything.
>
> Comments in previously mentioned config file don't make it sound
> like that. "A check operation will scan the drives looking for bad
> sectors and automatically repairing only bad sectors." It doesn't
> explain how it would repair bad sectors. Perhaps it means the bad
> sectors will be "repaired" by failing the entire member and having the
> sysadmin insert a new disk. Perhaps the comments are just wrong.
>
> Not arguing with you, just reporting what the file told me. Would
> the file lie? ;-)
That's sort of true and sort of not true, but generally outdated. It
is important to appreciate that the "md" device operates at a level
of abstraction above block devices that isolates it from low-level
details that are handled by whatever driver manages the block
devices. For something like a parallel IDE drive -- or, heaven
forbid, an ST-506 drive -- there is not a lot of intelligence on
board the drive that will mask error conditions: a read error is a
read error.
When SCSI (meaning SCSI-2) was developed, it provided for a ton of
settable parameters, some vendor-independent and some proprietary.
Among these were mode page bits that controlled what the device
would do by default on encountering errors during read or write,
notably the "ARRE" (automatic read reallocation) and "AWRE"
(automatic write reallocation) bits. Exactly what a device does when
these bits are asserted is not too well specified, especially
considering that a disk and a tape may have radically different
ranges of options but use the same basic SCSI command set. In
practice, I can't think of any reasonable way to implement ARRE:
it's almost always worse to return bad data from a read operation
with a success code than to just have the read operation report a
failure code outright.
(ATA is essentially a protocol for wrapping SCSI commands and
responses into packets for non-SCSI devices, so the logic applies.)
>> Detecting and reporting "soft failure" incidents
>> such as reallocations of spare sectors ...
>
> The relocation algorithm in modern disks generally works like this
> (or so I'm told):
>
> R1. OS requests read logical block from HDD. HDD tries to read from
> block on disk, and can't, even with retries and ECC. HDD returns
> failure to the OS, and marks that physical block as "bad" and as a
> candidate for relocation.
At this point, an unreadable block encountered on a block device is
handled at a very high level, usually the file system, well above
where things like AWRE on the hardware can occur. This is where the
"md" driver will intervene, attempting to reconstruct the unreadable
block from its reservoir of redundancy (the other copy if RAID-1,
the other stripes if RAID-5). If the "md" driver can reconstruct the
unreadable data, it will attempt to write the correct data back to
the block device: at this point, the hardware may reallocate a spare
sector for the new data. Unless a write occurs somehow, though, even
with AWRE enabled the hardware should not reallocate a sector.
When a write succeeds and forces an AWRE event, the hardware
test-reads the newly written data and returns an error if the data
could not be verified. By this stage, the "md" device may have had
cause to mark the whole block device as bad and degrade the array.
> R2. Repeated attempts by OS to read from the same block cause the HDD
> to retry. It won't throw away your data on its own.
Correct, in all practical cases the hardware will never reallocate a
bad block on read operations. The SCSI protocol provides for ARRE,
but as I noted this is never really implemented.
> R3. OS requests write to same logical block. HDD relocate to
> different physical block, and throws away the bad block. It can do
> that now, since you've told it you don't want the data that was there,
> by writing new data over it.
Again, exactly what happens is going to vary a lot with the
particular hardware. Older drives, even parallel ATA drives,
generally cannot reallocate a spare sector on the fly during normal
operation, but can only do it during a low-level format operation of
the whole drive. This is because the reserve of spare sectors on
such drives is associated with physical zones, so that reallocation
can only occur during a track-granular write operation.
In my experience, nearly all SCSI drives have AWRE disabled from the
factory, and it is up to the operating system to enable it. Linux
does not do this, as far as I know, unless the user manually sets
mode page bits using a tool such as "scsiinfo" or "sdparm".
Like SCSI drives, SATA drives tend to have AWRE disabled from the
factory, but it is often enabled by the machine BIOS. On some
drives, like their PATA cousins, AWRE cannot be enabled.
> It would be nice if hard disks were smart enough to detect a block
> that was getting marginal and preemptively relocate it. Last I looked
> into this (admittedly, several years ago), they didn't do that. Maybe
> they've gotten smarter about that. If they haven't gotten smarter, if
> the "check" operation reads all the blocks on the the disk but never
> writes, that alone won't trigger relocation of a bad block. The
> "check" operation would have to read the good block from the other
> disk, and attempt to rewrite it to the bad disk. *That* might trigger
> a useful relocation by the HDD with the bad block.
>
>> smartmontools, which can and should be configured to look past the
>> md device and monitor the physical drives that are its components.
>
> While I run smartd in monitor mode, I've never had it give me a
> useful pre-failure alert. Likewise, I've never had the SMART health
> check in PC BIOSes give me a useful pre-failure alert. More than once
> I've seen SMART report the overall health check as "PASS" when the
> whole damn disk is unreadable. It make me wonder just what the
> overall SMART health is supposed to indicate -- "Yes, the HDD is
> physically present"? :)
SMART is just a communications protocol. Some drives return nearly
useless information, while other drives are quite good about
reporting genuinely useful information. Using "smartmontools"
properly requires manually configuring both short and long
self-tests. For example, on a server where I have two Western
Digital 1TB SATA drives spinning as the two components of an "md"
RAID-1 device, I have the following in "smartd.conf" --
/dev/sda -d sat -a -o on -S on -s (S/../.././22|L/../../6/04) -m root
/dev/sdb -d sat -a -o on -S on -s (S/../.././23|L/../../7/04) -m root
This runs a short self-test every night on sda at 2200 ET and on sdb
at 2300 ET, and runs a long self-test on sda at 0400 ET every
Saturday and on sdb at 0400 ET every Sunday. Just installing the
monitor deamon without configuring it for a particular installation
is not terribly useful.
> I did once have the BIOS check start reporting a SMART health
> warning, but all the OEM diagnostics, smartctl, "badblocks -w", etc.,
> didn't actually report anything wrong. The reseller replaced the
> drive at my insistence. Maybe the SMART health check knew something
> that none of the other SMART parameters were reporting.
SMART is not designed to predict infant mortality and unusual
failures, but rather to monitor the factors which either indicate a
problem developing ("prefailure") or are approaching the known
design lifetime of the device ("usage").
Here's an example of a very healthy drive:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 164 164 021 Pre-fail Always - 6758
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 16
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 051 Old_age Always - 0
9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7411
10 Spin_Retry_Count 0x0032 100 253 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 16
194 Temperature_Celsius 0x0022 121 110 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0
In this case, the drive has never logged a raw read error (ID 1), a
reallocated sector (ID 5), nor a seek error (ID 7), and has zero
counts for reallocation events (ID 196), current pending sectors (ID
197), and uncorrectable sectors (ID 198). Since it is in a server,
it has only been power-cycled 16 times despite having logged over
7411 hours (about 309 days) of operation. It is very likely that, if
this drive were starting to fail, these counts would show problems.
-- Mike
More information about the gnhlug-discuss
mailing list