in deep trouble corrupted raid5 array HELP!!!

gaspah · May 22, 2010

I was rebuilding my backup server when i plugged a drive into my main server to copy over data and inadvertently bumped out one of the power cables to one of the drives on the array.. the array stopped copying and now that i've reset my machine windows says that most of the folders are corrupt and cannot be accessed.. i'm totally lost as what to do as i was just starting to re-copy my backup.. so all is lost pretty much... i don't want to touch anything without some solid advice first and am freaking out pretty much as i've got like 7tb of corrupted unbacked up data right about now.. please help am very desperate!!

areca arc1220
8xWD1000EACS

drescherjm · May 22, 2010

The drive you lost power should have been bumped from the raid5? Is the raid 5 degraded? If not you may want to make it degraded if you remember the drive that lost power.

gaspah · May 22, 2010

yes the controller says degraded.. i checked that.. i have not marked the drive i plugged back in as a spare yet.. it says its empty or unused or something simmilar.. i just figure it'll just rebuild a corrupted volume.. so i'm leaving it as pristine as possible..

i'm trying recuva for now and going to save to an external and see how that goes.. don't want to write to the array just yet.

the volume seems to be reading as the first 2tb only... getbackdata says the partition is only 2tb.. windows still says it is 6.36tb... egad im freakin out..

all the folders that i have written to recently are corrupted.. whether it be adding 1 file to the folder or a new folder altogether.. but a little inconsistent..

wonslung · May 22, 2010

gaspah said:
yes the controller says degraded.. i checked that.. i have not marked the drive i plugged back in as a spare yet.. it says its empty or unused or something simmilar.. i just figure it'll just rebuild a corrupted volume.. so i'm leaving it as pristine as possible..

i'm trying recuva for now and going to save to an external and see how that goes.. don't want to write to the array just yet.

the volume seems to be reading as the first 2tb only... getbackdata says the partition is only 2tb.. windows still says it is 6.36tb... egad im freakin out..

This is the main reason i switched to ZFS. I had a few raid5 issues......with hardware raid it can be much more difficult to recover.

Have you tried emailing the manufacturer of your raid card? sometimes they have suggestions.

gaspah · May 22, 2010

wonslung said:
This is the main reason i switched to ZFS. I had a few raid5 issues......with hardware raid it can be much more difficult to recover.

Have you tried emailing the manufacturer of your raid card? sometimes they have suggestions.

yeah first thing i did was email areca but i suppose i'll be waiting a while for a reply being the weekend.

sub.mesa · May 22, 2010

To go over 2TB you need 48-bit LBA, GPT partitions and to boot you probably need EFI support.

Can you tell me about the contents? It has a single NTFS partition on it? Are you using GPT?

gaspah · May 22, 2010

the volume contains mostly videos, also personal documents, pictures, games, music.. generic home pc junk really... just a lot of data that took a long time in the making.

its a single partition 6.36TiB, and yes it is GPT (as it has to be as far as i know).. the volume has been flawless for the last few years.. has successfully migrated from 6 -> 7 -> 8 drives.. its not the boot volume (as you can see from my specs below i have my OS on 2xWD 500GB RE3 drives in raid-0.

currently scanning with GetDataBack, and thats a few hours in with about 8hrs to go.. i also decided to rebuild the array by marking the dropped drive as a hotspare...

i'm too nervous to let windows do a consistency check on startup, chkdisk etc.. as i've had bad experieces in the past with it corrupting my data..

a fair chunk of files are perfectly fine... but my browsing with explorer has managed to make things worse.. turning folders into files.. i mean what the heck.. i stopped as soon as i noticed.. and i'm obviously not writing to the volume at present..

drescherjm · May 22, 2010

This does not make a lot of sense. I mean a raid 5 being degraded by loosing a single drive should not have any corruption at all unless the raid card / raid software messed up. Or perhaps you have those URE errors on your drives (I have never had one that I know of in many years of raid). I have a dozen or so raid 5 arrays at work and most have been degraded at one point ( 1% to 5% drive failure rate per year) and I have never lost a single file from a degraded array. I am mostly on linux software raid however with a couple arcea of cards on windows workstations.

sub.mesa · May 22, 2010

Simply accessing the NTFS would do the journal commit/re-commit/purge process; so you've already written to the volume. With conventional RAID5/RAID6 its somewhat easy to corrupt your data: let the RAID engine use a wrong order of disks (i.e. disk5 becomes disk6), then drop one disk, rebuild the array; and boom corruption all over the volume.

You've done expansion from 6 to 8 drives in two steps. Did you ever do a rebuild/resync/parity rebuild (all the same) since then?

I'm afraid i don't have too much advice; i would continue backing up anything that you can recover. Be as careful as possible, avoiding any (additional) writes. But it does look like at least some corruption is present, possibly in metadata which is what can do funny things with your filesystem; except not so funny if your data is on there, without backup.

As for the future, i highly recommend to look at ZFS; how it can make your life simpler and offer more protection to your data, especially regarding corruption. When you're ready to discuss that, i would be happy to reply in your (newly?) created thread. For now, i assume you would want to focus on getting as much data back as possible. Good luck with that!

gaspah · May 22, 2010

drescherjm said:
This does not make a lot of sense. I mean a raid 5 being degraded by loosing a single drive should not have any corruption at all unless the raid card / raid software messed up. Or perhaps you have those URE errors on your drives (I have never had one that I know of in many years of raid). I have a dozen or so raid 5 arrays at work and most have been degraded at one point ( 1% to 5% drive failure rate per year) and I have never lost a single file from a degraded array. I am mostly on linux software raid however with a couple arcea of cards on windows workstations.

it doesn't make a whole lot of sense to me either...

sub.mesa said:
Simply accessing the NTFS would do the journal commit/re-commit/purge process; so you've already written to the volume. With conventional RAID5/RAID6 its somewhat easy to corrupt your data: let the RAID engine use a wrong order of disks (i.e. disk5 becomes disk6), then drop one disk, rebuild the array; and boom corruption all over the volume.

You've done expansion from 6 to 8 drives in two steps. Did you ever do a rebuild/resync/parity rebuild (all the same) since then?

The only things i've needed to go into the raid controller for were to set up the original array, then expand the array to the other two disks... thats all i know, thats how simple it was in the raid bios... expand array -> choose disks.

sub.mesa said:
I'm afraid i don't have too much advice; i would continue backing up anything that you can recover. Be as careful as possible, avoiding any (additional) writes. But it does look like at least some corruption is present, possibly in metadata which is what can do funny things with your filesystem; except not so funny if your data is on there, without backup.

As for the future, i highly recommend to look at ZFS; how it can make your life simpler and offer more protection to your data, especially regarding corruption. When you're ready to discuss that, i would be happy to reply in your (newly?) created thread. For now, i assume you would want to focus on getting as much data back as possible. Good luck with that!

yeah i'm doing a extensive scan with GetDataBack, as I tried just a quick scan on the effected volume and it listed most of everything, but when i selected and tried to copy some very recently added files (videos) to an external drive, some start to play for 5 seconds but the timestamps are ridiculous values like 93:04:45 (a 93hour video, was originally 25mins), or the file cannot be rendered at all.

this is all quite devastating.. i'm at breaking point..

thanks people.

sub.mesa · May 22, 2010

Sorry man, i could imagine your frustration.

You can try to fix some of the corruption in your videos with VLC (it prompts you to fix the file because it is damaged); that would get the timeline back to normal in most cases; but not the visible corruption at some parts in the movie.

I don't think there's much more you can do than software recovery; the parity data is rebuilt so carries the same (extrapolated) data as the visible data; meaning you can't get anything more out of your array than you are seeing right now.

NO1B4ME · May 22, 2010

Man I am sorry to read this. I hope you recover your data. I was about to buy an Areca card and run 8 2TB drives in Raid 5 but this is scary

.

sub.mesa · May 22, 2010

Well i'm not sure if its appropriate to discuss here, but using something like ZFS or Btrfs with checksums is a good enhancement to reliable data storage, and both have more features which simple filesystems like NTFS lack which enhance data safety.

Using Hardware RAID was the traditional way to use a proper RAID5/RAID6 on Windows. But i feel today there is a better alternative: a ZFS NAS; though the downside here is that you're bound by gigabit speeds and likely have to use the less fast SMB/CIFS protocol since most Windows do not support NFS. But from a data security standpoint, ZFS offers you multiple safety nets to protect your data, and in the case that it fails, it shows you exactly which files are affected.

If you want mainly performance then Hardware RAID on Windows can do that with RAID5/6; but another setup would be onboard RAID0 and ZFS NAS as backup.

john4200 · May 22, 2010

I'm sorry about your difficulties. I'm not really sure what happened, and it is difficult to tell without a lot more details about exactly what you were doing, and what you did after the drive lost power.

But I wanted to give a few tips for you and others for future reference:

1) Always use RAID 6 when using 8 or more HDDs, or any HDDs greater than 1 TB

2) When you first set up your RAID, test it. Simulate a drive failure and make sure that you can recover without data loss (put some test data on it).

3) Never mess around with your RAID system with the power on! If cables come lose while the RAID is live, you can easily lose your entire array. But if the system is powered down, you can just replace the cables.

4) With Areca controllers, always keep "Auto Activate Incomplete Raid" setting DISABLED. That way if something goes wrong with the power out, when you power back on your RAID will not be live (otherwise, it might be activated in degraded mode). When you notice that the RAID is not live, you can power down again and try to fix the problem.

5) After the RAID has been initialized, disable the individual HDD write caches (the caches on each HDD). These can be 32 MB or 64 MB per HDD, and if power is lost suddenly, you could lose that amount of data if it has not been written to disk yet. That is a disaster if some of the data lost is metadata, since your entire file system could become corrupted.

wonslung · May 22, 2010

sub.mesa said:
Well i'm not sure if its appropriate to discuss here, but using something like ZFS or Btrfs with checksums is a good enhancement to reliable data storage, and both have more features which simple filesystems like NTFS lack which enhance data safety.

Using Hardware RAID was the traditional way to use a proper RAID5/RAID6 on Windows. But i feel today there is a better alternative: a ZFS NAS; though the downside here is that you're bound by gigabit speeds and likely have to use the less fast SMB/CIFS protocol since most Windows do not support NFS. But from a data security standpoint, ZFS offers you multiple safety nets to protect your data, and in the case that it fails, it shows you exactly which files are affected.

If you want mainly performance then Hardware RAID on Windows can do that with RAID5/6; but another setup would be onboard RAID0 and ZFS NAS as backup.

Or you can do what i did, and set up link aggregation or IPMP

It's REALLY easy to do Link aggregation in OpenSolaris
(disable nwam, plumb your links, then use dladm to set up your aggregation , then set up normal "static" ip stuff.)

I get amazing ZFS CIFS speeds

cditty · May 22, 2010

Really sorry for your troubles. I had similar issues in the past, which is why I stopped using RAID altogether. I just make a nightly backup to an external drive now. Of course, I only have about 1.5 TB of data, so that is easy for me to do.

Good luck on getting that data back and please let us know how it comes out.

gaspah · May 23, 2010

i seem to be having better luck recovering the backup volumes (deleted windows dynamic span) the files seem to be more intact.. but now im at a storage dilema... where to put the files :S

well this is it.. its all decided.. i'm done with raid now.. i'm going to use my expensive areca card to make a big JBOD.. obviously its easier to recover data from.. also will reinstate my previous JBOD on my backup server.. i shall be blessed with more space.. and have learned to be more vigilant with backups.. thinking of a 3rd backup server incase something like this happens again..

sub.mesa · May 23, 2010

Third backup server? I was under the impression that you didn't have backups at all; which is why you lost data. So a new backup server would be your first; or did i miss something?

houkouonchi · May 23, 2010

Maybe he meant a third server (which is a backup server).

parityOCP · May 23, 2010

@OP

Don't be disheartened by your RAID experience, if you hadn't bumped that power cable, none of this would have happened, and you'd have gone along thinking RAID was perfectly OK. RAID is perfectly OK, so long as you are aware of its behaviours and limitations, and operate within them.

A couple of suggestions I'd like to add to the mix:

1. If you do decide to build another backup server, there's no reason to not use RAID on it. In fact, I'd transfer the RAID card and drives (out of your workstation, I assume) and into the new backup server. However, if you decide to use ZFS with OpenSolaris or FreeBSD, then putting the card in JBOD mode might be more beneficial for RAID-Z/RAID-Z2.

2. Use RAID 6 (RAID-Z2), instead of RAID 5 (RAID-Z) - it will give you a further level of redundancy for when a drive fails in some way.

3. Drive caddies. Use a server case with drive caddies. This will make life a whole lot easier when you need to identify a drive for pulling for what ever reason, and will reduce the likelihood of a drive cable getting bumped.

Like I said, don't run scared of RAID. These things can happen - I myself have had plenty of "experiences" with hard drives, RAID, LVM and corrupted filesystems. Just try not to make the same mistakes twice, and you'll do fine.

drescherjm · May 23, 2010

if you hadn't bumped that power cable, none of this would have happened, and you'd have gone along thinking RAID was perfectly OK. RAID is perfectly OK, so long as you are aware of its behaviours and limitations, and operate within them.

The corruption should have never happened. On linux software raid I can pull out each drive one by one while writing and be assured that I will have no absolutely data loss except for the terminated file operation when the raid went offline. I can not believe that this popular hardware raid is that fragile.

wonslung · May 23, 2010

drescherjm said:
The corruption should have never happened. On linux software raid I can pull out each drive one by one while writing and be assured that I will have no absolutely data loss except for the terminated file operation when the raid went offline. I can not believe that this popular hardware raid is that fragile.

This is why ZFS is the way to go. ZFS is much more resilient to stuff like this.,...even more so than linux software raid.

jay2472000 · May 23, 2010

Im confused. You had a raid5, and lost a disk and the whole array is shot? Are you sure it was raid5 ?? I thought the whole point of raid5 was a HD could drop out and you keep the data intact. What am I missing here?

drescherjm · May 23, 2010

You had a raid5, and lost a disk and the whole array is shot?

The array was not shot but it got corrupted.

I thought the whole point of raid5 was a HD could drop out and you keep the data intact. What am I missing here?

You are correct that should not have happened. Even if the disk lost part of a write the parity in the raid should have fixed that just like it should have fixed it if the drive suddenly died during a write.

jay2472000 · May 23, 2010

drescherjm said:
The array was not shot but it got corrupt .

Lol, you say tomAto, I say tOmato.........corrupt, shot, fubared, all the same isnt it?? (really not trying to be a jerk, as I respect your knowledge, and you are a great poster and while I understand that maybe technically corruption may not be the exact same a shot, it kind of is also)

Ok I was thinking if it was raid 5 it was built to guard against this kind of failure. Weird, maybe there was more going on with the array then the OP realizes. I know personally I have suffered file corruption due to a cheap sata card, and man, it sucks ass to see a file, and not be able to do anything but delete it.

The more I learn about large storage systems the more I appreciate a decent backup plan.

drescherjm · May 23, 2010

Lol, you say tomAto, I say tOmato.........corrupt, shot, fubared, all the same isnt it??

The distinction was even in this case most of the data should still be fine. Whereas I was taking shot to mean a total loss..

drescherjm · May 23, 2010

I was thinking if it was raid 5 it was built to guard against this kind of failure.

It should. I suspect that the array was not truly clean before the drive got kicked out of the array. Meaning that the parity was not in sync.

you are a great poster

Thanks.

The more I learn about large storage systems the more I appreciate a decent backup plan.

I am in the camp that a raid is not a backup. At work I backup everything to tape and if it is an archive I make 2 copies. I know a lot of users are moving to raid as a backup however I caution against a single raid array being the only backup. Although with these cautions I have not lost any data due to raid 5/6 failure in 30TB + of arrays that I have managed over the last decade.

jay2472000 · May 23, 2010

Ok gotcha now. In theory it should have been recoverable is what you meant. I agree, hence my confusion on the type of array.

I am with you 100% about raid backups. My take on raid is it is great for uptime, but its not a standalone solution. I try to disuade people from using raid in their servers as I think it adds a lot of points of failure, especially when a person doesnt quite grasp the conceept to begin with. IDK, as fast and large as hard drives have gotten, it almost seems that all but the most advanced users dont really need it. But then again, from the little linux/bsd experience I have with it those OSes seem to have software raid down pretty good.

Edit : to clarify, IMHO if youre using command line etc to set up a raid, I would consider you advance, so all the linux/*nix/bsd guys are advanced to me. I should say in Windows, ( especially with WHS) that raid is almost not necessary in a home environment) Again, all IMHO.

YellowSnow · May 23, 2010

Hardware RAID 5 with hotspare or hardware RAID6 are great solutions for shared folders on the network. They offer increased performance and safety. You can even do RAID 10 if uptime is imperative.

I use HW RAID 5 with hotspare (6X250GB drives=1TB array) and I can hot swap any of the 6 drives, and after the rebuild, hot swap another.

But nothing replaces nightly backups.

sub.mesa · May 24, 2010

The RAID engine (or layer) is something that can fail for itself; even with all disks being perfectly fine. Or it may react to a 'simple' failure in such a way that it puts your data at risk; or the user may react in a way that cripples or permanently destroys data.

The only real safety net is a true backup, independent of the main array. If that's not possible, then at least snapshots using ZFS would help against some additional risks; but you still rely on ZFS to protect you from corruption and drive failure (redundancy); if a bug in ZFS ruiins your data you're still affected by (partial) data-loss.

Fortunately, i have more trust in ZFS than most commercial RAID engines; where many have proven to be unreliable. I used Areca HW RAID5 and RAID6 in the past as well; but still got array splits just like i had when still using onboard RAID on Windows. That was the motiviation to switch to Linux and FreeBSD platform, where much more advanced software RAID drivers existed.

parityOCP · May 24, 2010

drescherjm said:
The corruption should have never happened. On linux software raid I can pull out each drive one by one while writing and be assured that I will have no absolutely data loss except for the terminated file operation when the raid went offline. I can not believe that this popular hardware raid is that fragile.

I agree, and going by the OPs descriptions it looks like the controller is at fault. I've had a Promise on-board RAID controller go bad (thankfully it was only in default JBOD mode), and the software RAID 0 I had running over it ended up with corrupted files.

That's not to say the OP's controller has something wrong with it, just that it doesn't seem to be very resilient.

YellowSnow · May 24, 2010

Moral is, test your RAID after a good backup. I just pull a drive every once in a while.

pissboy · May 24, 2010

OP - post this on the 2cpu forums, and wait for a reply from Jus. there are a few people who can help with Areca situations there. It doesn't sound like the array is totally cooked, and you should be able to get it back up and running.

gaspah · May 25, 2010

ok thankyou i shall post soon

odditory · May 25, 2010

sub.mesa said:
As for the future, i highly recommend to look at ZFS; how it can make your life simpler and offer more protection to your data, especially regarding corruption. When you're ready to discuss that, i would be happy to reply in your (newly?) created thread. For now, i assume you would want to focus on getting as much data back as possible. Good luck with that!

make life easier, lol. tell that to this guy. no offense but when end users need a degree just for simple troubleshooting maybe its not the one-size-fits-all you evangelize it to be in near every thread.

maybe also mention the lack of OCE in RAID-Z/RAID-Z2 when playing up ZFS. i'd call that a pretty big gotcha if you're trying to sell ZFS as superior to a hardware array controller. somehow ZFS fans always leave that part out.

sub.mesa · May 25, 2010

Well, that's a bit unfair, isn't it? We don't even know what actually caused his problem.

I agree, it does seem likely a FreeNAS bug, but i wasn't particularly recommending FreeNAS. And after this incident, i think i would stop recommending it for use with ZFS, at least until a newer and more stable version is released, based on newer ZFS code (v13/14). The ZFS version used in FreeNAS 0.7 is still considered 'experimental', and a warning message is displayed on the console indicating as such.

Related to expansion i would say that OCE makes the array less resilient; as you're lowering the level of redundancy; 1 parity per 4 drives sounds nice, but 1 per 10 starts being quite unreliable. The expansion that ZFS offers (add 4 at a time) would maintain a 25% parity overhead, but also maintain the redundancy level / RAID-level protection for your data.

Life is all about trade-offs; i just suggested the OP to have a look at the ZFS option. I still recommend that, if he's comfortable with running FreeBSD or OpenSolaris. The 'make your life easier' would come from the fact that you wouldn't have to worry about corruption or virusses (snapshots) or raid splits ever again. That does not mean you're invulnerable to bugs in experimental software, though.

odditory · May 25, 2010

sub.mesa said:
Well, that's a bit unfair, isn't it?

Not necessarily, in light of the fact the OP in (this) thread hasn't really given enough background information either (much like the other thread) and yet a few ZFS fans saw it as an opportunity to point a finger at hardware raid. That said, props for trying to help that guy in the other thread out with his ZFS issue, but its a case in point that the inexperienced can *quickly* end up in the weeds with linux/bsd if trouble strikes. Which is another big factor when weighing a solution, let's say if you're used to Windows and don't necessarily care for another O/S learning curve just to store files.

Don't get me wrong, we're on the same side- if and when ZFS evolves just a bit further I'm all over it, i just think a full disclosure is in order when bringing up ZFS as the be-all, and given that many readers are lost in the wilderness that is multi disk storage and may be led down a path that isn't best for their needs.

odditory · May 25, 2010

gaspah said:
i seem to be having better luck recovering the backup volumes (deleted windows dynamic span) the files seem to be more intact.. but now im at a storage dilema... where to put the files :S

well this is it.. its all decided.. i'm done with raid now.. i'm going to use my expensive areca card to make a big JBOD.. obviously its easier to recover data from..

Bit of a throwing the baby out with the bathwater scenario here, you're still in the same boat with JBOD in place of RAID5/6, and ever finding yourself in the position of needing to recover something off a single drive or array means your strategy is off- namely not duplicating your important files to another set of storage on at least a nightly basis. buy more drives, there's really no other way around it. Striped Raid5/Raid6 hardware arrays are about reliability (uptime) and speed, and that's it.

the situation you described in the OP, based on the limited details given, would be unique, because i've test-failed Areca arrays every which way and never been able to corrupt one. i'm happy to try and help you out if you tell me where you're at right now with the array.

gaspah · May 27, 2010

found out what the problem was.. well turns out my array has been running 'degraded' since september last year, and because the web console wasn't installed i wasn't aware.. so when i bumped out drive7 from the array, the controller saw it as failed and finally decided to recover driveX to the array. basically meaning anything that i've written to the array since september is corrupted.. disappointed to say the least, but i suppose it was all my fault in the end.. luckily im having good luck recovering from my backup server and also i've given huge data dumps to friends which i'm in the process of getting back.. theres still gonna be substantial casualties and i really regret rebuilding the array.. and losing all my data in the process.. gonna keep a closer eye on it in the future.. i just wish the controller told me the array was degraded during post.. i mean i've seen the controller initialize countless times since september and no such warning. oh well, thems the breaks.

Tolyngee · May 27, 2010

gaspah said:
gonna keep a closer eye on it in the future.. i just wish the controller told me the array was degraded during post.. i mean i've seen the controller initialize countless times since september and no such warning. oh well, thems the breaks.

Seems a worthless controller to me. I know 3ware cards would pause at POST if it detected something wrong. I believe it even made you press a key to continue to boot.

Not having any monitoring software running though?

Personally, seeing a drive on the array have no activity lights for eight whole months would have tipped me off that something can't be right, though...

in deep trouble corrupted raid5 array HELP!!!

2[H]4U

[H]F Junkie

2[H]4U

n00b

2[H]4U

2[H]4U

2[H]4U

[H]F Junkie

2[H]4U

2[H]4U

2[H]4U

Weaksauce

2[H]4U

[H]ard|Gawd

n00b

Limp Gawd

2[H]4U

2[H]4U

RIP

Limp Gawd

[H]F Junkie

n00b

Gawd

[H]F Junkie

Gawd

[H]F Junkie

[H]F Junkie

Gawd

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd

Gawd

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

Supreme [H]ardness

2[H]4U

Supreme [H]ardness