mismatch_cnt != 0, member content mismatch, but md says the mirror is good (fwd)

Mon Feb 22 15:00:47 EST 2010

On 2010-02-22 at 13:39 -0500, Michael ODonnell wrote:

> 
> 
> Ruh-rohhh....
> 
>> /var/log/messages:   Feb 21 04:22:02 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md0
>> /var/log/messages:   Feb 21 04:22:02 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md3
>> /var/log/messages.1: Feb 14 04:22:02 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md2
>> /var/log/messages.1: Feb 14 04:22:02 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md0
>> /var/log/messages.1: Feb 14 04:22:02 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md3
>> /var/log/messages.2: Feb 7  04:22:01 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md0
>> /var/log/messages.2: Feb 7  04:22:01 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md3
>> /var/log/messages.3: Jan 31 04:22:02 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md2
>> /var/log/messages.3: Jan 31 04:22:02 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md0
>> /var/log/messages.3: Jan 31 04:22:02 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md3
>> /var/log/messages.4: Jan 24 04:22:06 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md0
>> /var/log/messages.4: Jan 24 04:22:06 sbgrid-dev-architect kernel: md: 
>> syncing RAID array md3
>> 
>> That's a CentOS 5.4 x86_64 box.
> 
> Ours are, too.
> 
> So far, then, it's looking like every Sunday at 4:22 all the RAIDs
> (all types or just RAID1?) in standard x86_64 CentOS5.4 (and RHAT?)
> boxes are broken and then resync'd.  This is presumably unnecessary
> and unintentional.  The harm is that until the resync operations
> complete (large devices can take hours) the filesystems on those
> RAIDs are essentially as vulnerable to HW faults as they'd be on any
> single disk.  (Interactive responsiveness is usually significantly
> reduced, as well - important in cases such as ours with customers
> active at all hours, but maybe less so in a 9-to-5 environment).
> 
> We'll probably disable that "helpful" weekly script on our machines
> until we have a better handle on this (or a fix).

Note that Debian has something similar, although monthly:

# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright Â© martin f. krafft <madduck at madduck.net>
# distributed under the terms of the Artistic Licence 2.0
#
# By default, run at 00:57 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
57 0 * * 0 root [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ] && 
/usr/share/mdadm/checkarray --cron --all --quiet

Note, however, that "checkarray" is not a real resynchronization of the same 
kind that occurs when bringing an array out of degraded mode, and data are not 
at risk in the same way. On the other hand, if something interrupts 
"checkarray" then it is possible for the array to be left in degraded mode, and 
this was the subject of a bug I filed against Debian's "mdadm" package a while 
ago:

 	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=563602

I'm not entirely happy with the maintainer's call that my problem was "local," 
but I can't prove otherwise. In the meantime, I've changed the script so that 
"checkarray" is run on Monday morning at 0757 instead of Sunday at 0057 in 
order to avoid conflict with Debian's weekly log rotation.

It's a matter of opinion whether it is better to risk running "checkarray" once 
a month for a few hours or to risk never running "checkarray" and having data 
errors creep into an array. My view is that, while the md code in Linux is 
quite solid, intermittent hardware problems, especially with failing RAM, will 
often be exposed by invocations of "checkarray" that might otherwise be missed 
until they grow into catastrophic failures, and therefore it is better to do it 
than not do it.

-- Mike