This post is more than 5 years old
1 Rookie
•
16 Posts
3
8102
June 30th, 2016 07:00
What is the best way to recover from a double fault on a VNX array Pool?
Hello,
I wanted to start a discussion here because I am a little confused.
We have a VNX5400 array that suffered a double fault on 2 EFD drives in one of the Raid Groups of the Pool.
It appears that recovery of that pool is limited to destruction and recreation.
Does that sound right to all of you?
Seems there are some ways around this. E.G. Force drives online like we did 15 years ago, or allow the EFD drives to go read only to reconstruct the RAID group. Or perhaps even have a RG able to be removed from a pool. (aka the opposite of expansion).
I feel like we are missing a technique or ability to recover from this event more cleanly.
What do you think?
No Events found!
GearoidG
251 Posts
0
July 5th, 2016 07:00
Hi Jason,
So currently Unity uses a very similar (if not identical) raid technology to VNX2 (called MC-X - I think there are some whitepapers on support.emc.com discussed MCR/MCC/MCF)
I have heard rumblings of change but am unclear when etc, and obviously future development cannot be discussed publicly etc
So to answer your question - at this stage yes Unity is vulnerable to a double fault in the same rg.
Obviously we in engineering have methods to attempt to bring a drive back online etc but a lot depends on what has happened to the drive etc. And needless to say every Unity case is being scrutinized closely at present.
Hope that helps
Gearoid
Rainer_EMC
4 Operator
•
8.6K Posts
0
June 30th, 2016 13:00
the best way is to contact EMC support and let the recovery team figure it out
JGroce213
1 Rookie
•
16 Posts
0
June 30th, 2016 14:00
Rainer,
Thank your so much for your response.
The good news is I did. However they stated that the array would have to be put in recovery mode to delete the entire pool and recreate from scratch. They said the down time would be 6 hours.
I feel like surely that there is a better way. I was hoping someone in the community would have some experience with a more graceful recovery of a double fault in a EFD pool.
Rainer_EMC
4 Operator
•
8.6K Posts
1
July 1st, 2016 14:00
Hi
I dont think there is anyone here in the community that can better judge and fix this than the recovery team from EMC support.
Rainer
brettesinclair
2 Intern
•
715 Posts
0
July 3rd, 2016 20:00
I agree with Rainer, this is always best handled by Support.
If you are unhappy with their response, you also have the ability to escalate or engage with your TAm for further discussions.
I do however believe that they will also offer you the most pragmatic way forward to get your pool back online.
GearoidG
251 Posts
1
July 4th, 2016 01:00
They are correct a downtime is required if the drives are unrecoverable. (this is due to internal linkage between the internal pool file system and the raid layout - not doing this usually causes a further extended outage)
Usually the recovery team escalate cases to engineering if they believe there is a chance of recovering the drives
In these cases usually this is done because there is no backup, but if the recovery succeeds then there is always damaged data
Regarding removing the damaged rg from the pool, no this cannot be done due to how the internal pool file system is laid across the drives upon the initial creation of the pool,
Trust me this is a topic we discuss a lot internally and enhance whenever possible
Best Regards
Gearoid
JGroce213
1 Rookie
•
16 Posts
0
July 5th, 2016 06:00
Gearoid,
Thanks so much for your response. I guess the next question we have in mind....does unity have this feature as well?
The only reason I ask is it seems as we are moving towards all EFD arrays that the probability that 2 drives in a Raid Group would fail really close to each other would increase in frequency as time goes on.
There seems to be some opportunity here add additional features. E.G. why when a EFD goes offline is it completely unreadable? Make sense that you can only write to the drive a certain amount of times...but just because the last write has happen (if it did in this case) why can't the data that was on the drive be read? If this was added, then you should be able to recover the pool that has the double faulted RG in it.
Or as we talked about before, remove RG from pools.
As the current feature set stands...loosing 2 drives in one RG takes down the entire pool. Which seems far worse than the old days when one double faulted RG only took down the LUNs that were in it. (I can see why you are having internal discussions)
So back to the spin off question....how does Unity handle double faulted Raid Groups?
Thank you again for all of yours and everyone's assistance with this...I feel like there is a better way out there.
Jason Groce
Rainer_EMC
4 Operator
•
8.6K Posts
0
July 5th, 2016 08:00
Keep in mind that both VNX and Unity also include proactive copy (PACO) that will kinda pre-spare when the system thinks a disk could be starting to fail.
That reduces the time for a rebuild then.
Plus for EFD's there is more magic - we do monitor how many cells in an EFD have been remapped and will alert long before the SSD wears out.
Unity will also include balancing between raidgroups to spread the wear for all flash pools.
For more info I would suggest to talk to your local EMC technical expert