mismatch_cnt != 0, member content mismatch, but md says the mirror is good

Sun Feb 21 16:26:47 EST 2010

I know I'm resurrecting an old thread here, but I just saw a post in Planet
CentOS that seems to have some info on fixing the mismatch_cnt is no 0
error, Take a loog at this blog post where the author's suggests some md
actions that can be taken to clear these errors:
http://www.arrfab.net/blog/?p=199

-Shawn

On Sun, Nov 1, 2009 at 10:00 PM, Ben Scott <dragonhawk at gmail.com> wrote:

>  CentOS 5.4.  Running kernel is 2.6.18-92.1.22.el5.  The system has
> two disks, each with two partitions, making up two md mirror devices.
> md0 is ~ 509 MB and holds /boot; md1 is ~ 69 GB (the rest of the disk)
> and holds an LVM PE.    The following arrived in my mailbox today:
>
> On Sun, Nov 1, 2009 at 4:22 AM, Cron Daemon <root at liberty.gnhlug.org>
> wrote:
> > /etc/cron.weekly/99-raid-check:
> >
> > WARNING: mismatch_cnt is not 0 on /dev/md0
>
>  Investigation finds:
>
> /proc/mdstat reports everything is peachy for both mirrors.  "[2/2] [UU]"
>
> Under /sys/block/md0/md/ I find the following:
>
>        array_state: clean
>        mismatch_cnt: 256
>        rd{0,1}/errors: 0
>        rd{0,1}/state: in_snyc
>
>  Google finds lots of people reporting similar, but nothing
> conclusive or particularly pertinent to this situation.  Lots of
> people saying that swap can cause this (because swap can commit a
> block to one member, then learn it won't ever re-read that block, and
> so won't bother committing the other member), but this is the /boot
> filesystem, not swap.  (swap is in an LV; the md device backing that
> LVM's sole PE reports a mismatch_cnt of zero.)
>
>  I did find some people saying this started happening after CentOS
> 5.3 -> 5.4.  I did do that recently.  One person said the "raid-check"
> was added in 5.4.  So I presume this mismatch_cnt might have been
> non-zero for ages, and I just never knew to look before now.
> mdmonitor has been running, but it mainly reports if a RAID member
> goes offline, and as noted, md is reporting all's quiet on the western
> front.
>
>  I tried dismounting the /boot filesystem and running some tests.
> (Since it's a separate partition and md device, and outside of LVM, I
> can poke at it without taking the system down.)
>
>  "e2fsck -f -n" says /dev/md0 is okay.
>
>  I tried stopping the RAID device with "mdadm --stop /dev/md0", then
> sync'ing disks.  Then I ran "cmp /dev/sda1 /dev/sdb1".  The result:
>
>        /dev/sda1 /dev/sdb1 differ: byte 331875867, line 215880
>
>  So the two mirror members are **NOT** identical.  That's usually bad.
>
>  Running "e2fsck -f -n" on each member says no trouble found.  That
> implies whatever the mismatch is, it is not in filesystem metadata.
>
>  Running a "badblocks" read-only test on each member says no read errors.
>
>  mdadm says the MD superblocks are okay, and comparing the two finds
> most things are the same -- only the checksum and device relationships
> differ (expected).
>
>  One nice thing about simple mirrors is that you can mount the
> members read-only and examine the contents without breaking the mirror
> set.  So:
>
>        liberty$ sudo mount -o ro -t ext2 /dev/sda1 /mnt/sda1
>        liberty$ sudo mount -o ro -t ext2 /dev/sdb1 /mnt/sdb1
>        liberty$ sudo diff -r sda1 sdb1
>        Binary files sda1/grub/stage2 and sdb1/grub/stage2 differ
>        liberty$
>
>  (You have to mount as ext2 because ext3 will replay a journal even
> if you said "read-only".)
>
>  It may be normal for the GRUB stage2 to differ in this
> configuration.  There may be device numbers encoded into them.  GRUB
> was installed on each disk separately, by booting from floppy, so that
> would do it.  Or it could be one disk has an undetected bad block and
> the boot loader on that disk is shot.
>
>  No other differences detected in file data, though.  So between fsck
> and diff, it looks like most of the contents are intact.  Maybe all of
> them.
>
>  I'm unsure as to how to proceed.
>
>  The general procedure for repairing a broken mirror is to resync
> from the good member, assuming you can determine which is good.  My
> problem is, I'm not sure which is the good member, or even if there
> *is* a good member: If GRUB writes different device numbers into the
> boot stage files, the two disks necessarily won't match.  Which, come
> to think of it, is probably something to worry about, since a legit
> mirror resync will scrogg that.
>
>  "smartctl -a" reveals something that may be relevant.  sda reports
> several non-zero values in the "Error counter log" section.  No
> uncorrectable errors, but ECC has been used.  At the same time, sdb
> reports all zeros for those same values.  Further, the counts for sda
> have increased since the disks were installed.  (I saved the output of
> "smartctl -a" back then.  Now you see why.)  Now, ECC usage is not an
> automatic cause for alarm on a modern hard disk, but the fact that sda
> is non-zero and increasing while sdb is zero and flat suggests sdb is
> in better overall health.  However, this probably has nothing to do
> with the mirror mismatch, since both disks report zero *uncorrectable*
> errors.  Uncorrectable media defects would certainly cause a mirror
> mismatch, but the drives think they've been able to handle everything
> so far.
>
>  There are newer kernels available; the system hasn't been rebooted
> in 251 days.  But I'm somewhat loathe to try rebooting with /boot in a
> suspect state.
>
>  The thing I find really confusing is why "mismatch_cnt" can be
> non-zero while the rest of the in-kernel md monitoring stuff reports
> everything is good.
>
>  Anyone here have suggestions, ideas, knowledge, or even wild schemes?
>
> -- Ben
> _______________________________________________
> gnhlug-discuss mailing list
> gnhlug-discuss at mail.gnhlug.org
> http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20100221/f486804c/attachment.html