Help with broken raid5?

Fri Oct 28 09:26:23 EDT 2011

On Thu, Oct 27, 2011 at 2:28 PM, Steve Noel <NOEL at stevenoel.com> wrote:

>
> Hi all,
>
>
>
> I’m hoping to get some help with a raid5 container that went belly-up on
> me.  Keep in mind I’m relatively noob with Linux.
>
>
>
> Last year I bought an 8 TB NAS box (iomega ix4-200d) which came as a raid 5
> standalone unit, running a Debian Linux OS.
>
>
>
> Last week it declared drive 4 was in an error state so I replaced the drive
> and waited for it to rebuild.  Overnight I got 3 emails from the unit
> stating:
>
>
>
> 1)      That a “recoverable error” occurred on drive 1
>
> 2)      That the raid rebuild had completed
>
> 3)      That the StorCenter device had failed and some data loss may have
> occurred. Multiple drives may have either failed or been removed from your
> storage system.
>
>
>
> Of course, none of the drives had been removed.  And I don’t have
> confidence that the drive 4 rebuild completed successfully.  Now the until
> declares that all 4 drives have been replaced and wants permission to format
> them.
>
>
>
> Luckily, the unit has SSH enabled and I can log in to the CLI.  But this is
> where I need some help.  I want to force the array to come back online to
> see what I can salvage from it.  The vendor’s only suggestion was to send
> the drives in and for $4000-$5000 they would attempt data recovery for me.
>
>
>
> I’ve been reading up on MDADM and have some info on the state of the
> container…
>
>
>
> root at NAS2:/# mdadm --detail /dev/md1
>
> /dev/md1:
>
>         Version : 01.00
>
>   Creation Time : Thu Mar 25 19:28:35 2010
>
>      Raid Level : raid5
>
>   Used Dev Size : 1951474176 (1861.07 GiB 1998.31 GB)
>
>    Raid Devices : 4
>
>   Total Devices : 4
>
> Preferred Minor : 1
>
>     Persistence : Superblock is persistent
>
>
>
>     Update Time : Tue Oct 25 02:24:28 2011
>
>           State : active, degraded, Not Started
>
> Active Devices : 2
>
> Working Devices : 4
>
> Failed Devices : 0
>
>   Spare Devices : 2
>
>
>
>          Layout : left-symmetric
>
>      Chunk Size : 64K
>
>
>
>            Name : storage:1
>
>            UUID : 53758a21:e972eb25:0c4ddf95:f4dd42b8
>
>          Events : 212414
>
>
>
>     Number   Major   Minor   RaidDevice State
>
>        0       0        0        0      removed
>
>        1       8       18        1      active sync   /dev/sdb2
>
>        2       8       34        2      active sync   /dev/sdc2
>
>        3       0        0        3      removed
>
>
>
>        0       8        2        -      spare   /dev/sda2
>
>        4       8       50        -      spare   /dev/sdd2
>
>
>
> Can anyone offer guidance on how to safely force these drives back online?
> I’m thinking that if I can get drive 1 back online the array should come
> back alive and I can copy data off of it.  Even if drive 4 never completed
> rebuilding the data should be there.
>
>
>
> Thanks in advance,
>
> Steve
>

Steve,
First a disclaimer.  I don't mind helping people with mdadm issues as I've
been where you are a few times.  And getting the array back together is
normally possible.  But doing anything with the drives could result in data
loss and further damage any data that is there so proceed with caution and
I'm not responsible for any data loss.

Ok with that out of the way here's my advice.  Before you do anything else
go to the store and buy more drives, the same number and at least as large
as your existing drives.  Then use dd to clone the old drives to the new
drives.  It sounds like you have 4 2TB drives in this array so that's not
going to be cheap but it will be cheaper than going through a data recovery
service.  And it sounds like your drives are getting to the age where they
may need to be replaced anyways.  This step will prevent you from any
further data loss and will at least give a data recovery service a shot if
you have to go there.

Once you've got your drives cloned you can more safely try to reassemble the
array.  It's your choice on if you want to do this on the old drives or new
drives.  But pick one set and use those for data recovery and use the other
set only as a master copy to reclone to your recovery set if you get to a
point where you believe that the data may have been compromised.

The first thing to do is to just try to pull all the meta data off all the
drives for safe keeping.  This meta data may not be correct but it's worth
looking at to see what's going on.  "sudo mdadm -E /dev/sd[a,b,c,d][1,2] >
mdadm_dump.txt"  for your setup should dump it all to a file named
mdadm_dump.txt in your current directory.

If you want post that to the list here so others can have a look at it.
>From there you can try to just reassemble the array.  This probably won't
work but it's worth trying in case it does.   "sudo mdadm --assemble --force
/dev/md1" depending on what your config file says you may also want to try
adding "--config=/etc/mdadm/mdadm.conf" to the options.  There's a good
chance that neither of these things will work.  If that's the case you'll
need to create a new array telling it to not initialize the array.  To do
this you use the --assume-clean option for the create command.  This would
look something like this "sudo mdadm --create --level=5 --raid-devices=4
--assume-clean /dev/md1 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2".  If this
device is modifying any of the other parameters from defaults you may have
to append some additional stuff to that command.  Hopefully the metadata
printout will give you some clues if that's the case.  But I'd bet that it's
just creating arrays with default RAID5 layout (left-symmetric) and chunk
size (64K).

Good luck and if you have any questions don't be afraid to ask.  It's better
to ask before you try something that may damage your data.
--
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20111028/4e5b18b3/attachment-0001.html