What is the best way to recover from a double fault on a VNX array Pool?

Question

Hello,

I wanted to start a discussion here because I am a little confused.

We have a VNX5400 array that suffered a double fault on 2 EFD drives in one of the Raid Groups of the Pool.

It appears that recovery of that pool is limited to destruction and recreation.

Does that sound right to all of you?

Seems there are some ways around this. E.G. Force drives online like we did 15 years ago, or allow the EFD drives to go read only to reconstruct the RAID group. Or perhaps even have a RG able to be removed from a pool. (aka the opposite of expansion).

I feel like we are missing a technique or ability to recover from this event more cleanly.

What do you think?

GearoidG · Accepted Answer

Hi Jason,

So currently Unity uses a very similar (if not identical) raid technology to VNX2 (called MC-X - I think there are some whitepapers on support.emc.com discussed MCR/MCC/MCF)

I have heard rumblings of change but am unclear when etc, and obviously future development cannot be discussed publicly etc

So to answer your question - at this stage yes Unity is vulnerable to a double fault in the same rg.

Obviously we in engineering have methods to attempt to bring a drive back online etc but a lot depends on what has happened to the drive etc. And needless to say every Unity case is being scrutinized closely at present.

Hope that helps

Gearoid

Rainer_EMC · Answer

the best way is to contact EMC support and let the recovery team figure it out

JGroce213 · Answer

Rainer,

Thank your so much for your response.

The good news is I did. However they stated that the array would have to be put in recovery mode to delete the entire pool and recreate from scratch. They said the down time would be 6 hours.

I feel like surely that there is a better way. I was hoping someone in the community would have some experience with a more graceful recovery of a double fault in a EFD pool.

Rainer_EMC · Answer

Hi I dont think there is anyone here in the community that can better judge and fix this than the recovery team from EMC support. Rainer

brettesinclair · Answer

I agree with Rainer, this is always best handled by Support.

If you are unhappy with their response, you also have the ability to escalate or engage with your TAm for further discussions.

I do however believe that they will also offer you the most pragmatic way forward to get your pool back online.

GearoidG · Answer

They are correct a downtime is required if the drives are unrecoverable. (this is due to internal linkage between the internal pool file system and the raid layout - not doing this usually causes a further extended outage)

Usually the recovery team escalate cases to engineering if they believe there is a chance of recovering the drives

In these cases usually this is done because there is no backup, but if the recovery succeeds then there is always damaged data

Regarding removing the damaged rg from the pool, no this cannot be done due to how the internal pool file system is laid across the drives upon the initial creation of the pool,

Trust me this is a topic we discuss a lot internally and enhance whenever possible

Best Regards

Gearoid

JGroce213 · Answer

Gearoid,

Thanks so much for your response. I guess the next question we have in mind....does unity have this feature as well?

The only reason I ask is it seems as we are moving towards all EFD arrays that the probability that 2 drives in a Raid Group would fail really close to each other would increase in frequency as time goes on.

There seems to be some opportunity here add additional features. E.G. why when a EFD goes offline is it completely unreadable? Make sense that you can only write to the drive a certain amount of times...but just because the last write has happen (if it did in this case) why can't the data that was on the drive be read? If this was added, then you should be able to recover the pool that has the double faulted RG in it.
Or as we talked about before, remove RG from pools.

As the current feature set stands...loosing 2 drives in one RG takes down the entire pool. Which seems far worse than the old days when one double faulted RG only took down the LUNs that were in it. (I can see why you are having internal discussions)

So back to the spin off question....how does Unity handle double faulted Raid Groups?

Thank you again for all of yours and everyone's assistance with this...I feel like there is a better way out there.

Jason Groce

Rainer_EMC · Answer

Keep in mind that both VNX and Unity also include proactive copy (PACO) that will kinda pre-spare when the system thinks a disk could be starting to fail.

That reduces the time for a rebuild then.

Plus for EFD's there is more magic - we do monitor how many cells in an EFD have been remapped and will alert long before the SSD wears out.

Unity will also include balancing between raidgroups to spread the wear for all flash pools.

For more info I would suggest to talk to your local EMC technical expert

VNX

What is the best way to recover from a double fault on a VNX array Pool?

Was this post helpful?