SAN Array member stuck in restart 0%

Question

Hi,

I'm new to SAN. Yesterday I restarted a member SAN and it's stuck at restart (stays 0%) for more than a day. The member has is configured as raid 5 and has 3 failed drives. And this member stays offline which means all my volumes are offline as well. What can I do for this member to be up online again?

Is replacing the failed SAN the only way? Can I wipe the member and reconstruct with less storage space? I don't want to affect other members.

Thanks,

Jason

DELL-Sam L · Answer

Hello Jason, Which equallogic systems do you have?  What is the current versions of firmware that is on your systems? How many members are in the same group?

dwilliam62 · Answer

Hello,

If the volumes are offline then you must have multiple members in the same pool. If you reset that member array the data from that member will be lost and those volumes will never recover. So you would loose all the data they have in common. When you have multiple PS arrays in the same pool, the data is STRIPED, RAID0 across them. There is no redundancy. Unless you have a backup of all your data, you will need to get that member back online.

RAID5 being more vulnerable to double faulting is why it's no longer recommended for production storage. It was removed from the GUI as an option as well.

At the CLI of the member that has the problem, please run:

GrpName>support exec "raidtool"

GrpName>support exec "uname -a"

GrpName>show members

If the RAIDset is faulted beyond recovery then you will likely have to send the drives out for recovery to a 3rd party company. They will try to clone them and hopefully allow a rebuild to finish.

Re: Smaller capacity. The only supported configurations for EQL arrays are 1/2 populated and fully populated drive configurations.

Note: Depending on the model array you will likely have to purchased EQL qualified drives. They are made differently from OEM or other Dell drives. If you don't they will not be recognized by the array and not usable.

Regards,

Don

JasonCheung · Answer

Hi, I'm using 1 each of PS 5000E, PS5000x, and PS6000. Firmware is v7.1.8 (R414953). Each of these is a member of the group. Thanks, Jason

dwilliam62 · Answer

Hello Jason,

There's a lot to unpack here.

The will stay degraded for as long as all the remaining drives stay healthy. If either the R6 or R10 arrays with failed drives go down, you will lose access to all volumes as they striped (RAID0) across the members. In the case of R6 you are out of parity so the next drive that fails will take that member down. On the R10 you have two drives without a mirror pair. So if either of those two fail that member goes offline. To put it bluntly you are on borrowed time. Replacing those failed drives is a priority. Rebuilds are disk intensive so you might have another drive fail during the process. Backing up your data now is priority one.

Rebuild time depends on the size, speed of the drives and IO load. The lower the IO load the better. RAID6 being dual parity will take longer. RAID10 is pretty quick especially with 10K or 15K RPM drives.

Re: Shrink That is not possible. You can't reduce the size nor select the number of spares. Spares are set by RAID policy. RAID6 is one spare, RAID50 and R10 are two spares.

Re: Drives. They have to be EQL specific drives. A standard OEM drive, or Dell non-EQL drive will NOT work in a PS series array. The drives are built specifically for them. Non-EQL drives will be rejected by the firmware.

Regards,

Don

JasonCheung · Answer

Hi,

Thanks Dell-Sam L and dwillian62 for the help. Now reading my question it doesn't really make sense. I guess I was rushing to finish things off before my vacation.

I turned them off during my vacation. Now (2 weeks later) after powering up all 3 members are miraculously online and all my volumes are up and accessible. Now I have to deal with the RAID degrade issue due to the failed hdd's.

Yes there are 3 members and 5 volumes. All volumes are spread over all 3 members. It's sold as-is from my company instead of going to recycling. They don't provided after-sales support.

I actually got it wrong it's a mix of RAID 6 (2x) and RAID 10 (1x) setups. And there are 3 failed hdds in 1 member (RAID 6) and 2 in another (RAID 10).

I have quite some data in the hdd's, mostly deployment images and virtual machines, that I play with personally but would like to keep them over transferring or even losing.

So my questions regarding the failed hdd's are:

how long would this degraded state last? I mean given no more hdd fails and in terms of data loss etc.
if rebuild, how long would it take (I know it depends of the amount of data inside?) Would it be a few days or months? Do I have to buy specific models of hdd?
how do I check if shrinking the member(s) is a possibility without losing the data. For example there are 16 drive bays in a member and I want to shrink so that I will be left off with some drives as spares. And do I need to wipe these left-over drives before turning them into "spares"?

I know some of my thoughts could be silly or won't exist in SAN world but appreciate any comments / suggestions.

Thanks,

Jason

EqualLogic

SAN Array member stuck in restart 0%

Was this post helpful?