mismatch_cnt != 0, member content mismatch, but md says the mirror is good
Shawn O'Shea
shawn at eth0.net
Sun Feb 21 16:26:47 EST 2010
I know I'm resurrecting an old thread here, but I just saw a post in Planet
CentOS that seems to have some info on fixing the mismatch_cnt is no 0
error, Take a loog at this blog post where the author's suggests some md
actions that can be taken to clear these errors:
http://www.arrfab.net/blog/?p=199
-Shawn
On Sun, Nov 1, 2009 at 10:00 PM, Ben Scott <dragonhawk at gmail.com> wrote:
> CentOS 5.4. Running kernel is 2.6.18-92.1.22.el5. The system has
> two disks, each with two partitions, making up two md mirror devices.
> md0 is ~ 509 MB and holds /boot; md1 is ~ 69 GB (the rest of the disk)
> and holds an LVM PE. The following arrived in my mailbox today:
>
> On Sun, Nov 1, 2009 at 4:22 AM, Cron Daemon <root at liberty.gnhlug.org>
> wrote:
> > /etc/cron.weekly/99-raid-check:
> >
> > WARNING: mismatch_cnt is not 0 on /dev/md0
>
> Investigation finds:
>
> /proc/mdstat reports everything is peachy for both mirrors. "[2/2] [UU]"
>
> Under /sys/block/md0/md/ I find the following:
>
> array_state: clean
> mismatch_cnt: 256
> rd{0,1}/errors: 0
> rd{0,1}/state: in_snyc
>
> Google finds lots of people reporting similar, but nothing
> conclusive or particularly pertinent to this situation. Lots of
> people saying that swap can cause this (because swap can commit a
> block to one member, then learn it won't ever re-read that block, and
> so won't bother committing the other member), but this is the /boot
> filesystem, not swap. (swap is in an LV; the md device backing that
> LVM's sole PE reports a mismatch_cnt of zero.)
>
> I did find some people saying this started happening after CentOS
> 5.3 -> 5.4. I did do that recently. One person said the "raid-check"
> was added in 5.4. So I presume this mismatch_cnt might have been
> non-zero for ages, and I just never knew to look before now.
> mdmonitor has been running, but it mainly reports if a RAID member
> goes offline, and as noted, md is reporting all's quiet on the western
> front.
>
> I tried dismounting the /boot filesystem and running some tests.
> (Since it's a separate partition and md device, and outside of LVM, I
> can poke at it without taking the system down.)
>
> "e2fsck -f -n" says /dev/md0 is okay.
>
> I tried stopping the RAID device with "mdadm --stop /dev/md0", then
> sync'ing disks. Then I ran "cmp /dev/sda1 /dev/sdb1". The result:
>
> /dev/sda1 /dev/sdb1 differ: byte 331875867, line 215880
>
> So the two mirror members are **NOT** identical. That's usually bad.
>
> Running "e2fsck -f -n" on each member says no trouble found. That
> implies whatever the mismatch is, it is not in filesystem metadata.
>
> Running a "badblocks" read-only test on each member says no read errors.
>
> mdadm says the MD superblocks are okay, and comparing the two finds
> most things are the same -- only the checksum and device relationships
> differ (expected).
>
> One nice thing about simple mirrors is that you can mount the
> members read-only and examine the contents without breaking the mirror
> set. So:
>
> liberty$ sudo mount -o ro -t ext2 /dev/sda1 /mnt/sda1
> liberty$ sudo mount -o ro -t ext2 /dev/sdb1 /mnt/sdb1
> liberty$ sudo diff -r sda1 sdb1
> Binary files sda1/grub/stage2 and sdb1/grub/stage2 differ
> liberty$
>
> (You have to mount as ext2 because ext3 will replay a journal even
> if you said "read-only".)
>
> It may be normal for the GRUB stage2 to differ in this
> configuration. There may be device numbers encoded into them. GRUB
> was installed on each disk separately, by booting from floppy, so that
> would do it. Or it could be one disk has an undetected bad block and
> the boot loader on that disk is shot.
>
> No other differences detected in file data, though. So between fsck
> and diff, it looks like most of the contents are intact. Maybe all of
> them.
>
> I'm unsure as to how to proceed.
>
> The general procedure for repairing a broken mirror is to resync
> from the good member, assuming you can determine which is good. My
> problem is, I'm not sure which is the good member, or even if there
> *is* a good member: If GRUB writes different device numbers into the
> boot stage files, the two disks necessarily won't match. Which, come
> to think of it, is probably something to worry about, since a legit
> mirror resync will scrogg that.
>
> "smartctl -a" reveals something that may be relevant. sda reports
> several non-zero values in the "Error counter log" section. No
> uncorrectable errors, but ECC has been used. At the same time, sdb
> reports all zeros for those same values. Further, the counts for sda
> have increased since the disks were installed. (I saved the output of
> "smartctl -a" back then. Now you see why.) Now, ECC usage is not an
> automatic cause for alarm on a modern hard disk, but the fact that sda
> is non-zero and increasing while sdb is zero and flat suggests sdb is
> in better overall health. However, this probably has nothing to do
> with the mirror mismatch, since both disks report zero *uncorrectable*
> errors. Uncorrectable media defects would certainly cause a mirror
> mismatch, but the drives think they've been able to handle everything
> so far.
>
> There are newer kernels available; the system hasn't been rebooted
> in 251 days. But I'm somewhat loathe to try rebooting with /boot in a
> suspect state.
>
> The thing I find really confusing is why "mismatch_cnt" can be
> non-zero while the rest of the in-kernel md monitoring stuff reports
> everything is good.
>
> Anyone here have suggestions, ideas, knowledge, or even wild schemes?
>
> -- Ben
> _______________________________________________
> gnhlug-discuss mailing list
> gnhlug-discuss at mail.gnhlug.org
> http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20100221/f486804c/attachment.html
More information about the gnhlug-discuss
mailing list