Yar2

7 Posts

3470

May 26th, 2013 16:00

How to recover from multiple CX4 drive fault situation

Hello,

I have an older CX4-240c single DAE array which has sat dormant until recently when we decided to put it to use again. It was found to have four bad drives, amber lights on each. These were drives 4,5,7 and 14 in the DAE, (drive 14 was the hot spare). The single Linux host connected to the array can not see or mount the special device /etc/emcpower1a, which has the LUN containing three of the bad drives. The other two LUNs presented to the host are mountable. We located some replacement drives, replaced the four bad drives and three of them went lights green, disk 5 however stayed amber. Looking in Unisphere it had an alert saying that a disk had been removed and to please replace the original bad drive serial number xxxxxx. I put the original bad drive number 5 back in and rebooted both SPB and SPA. Afterwards drive 5 is still faulted, along with the LUN. This time disk 5's alert in Unisphere indicates it is still faulted, that a disk has been removed, but the message about putting in the previous serial doesn't appear. I then put the new replacement for drive 5 back in, but get the same fault.

I'm not so concerned with the data that was on the array, but am trying to arrive at why the disk fault in this particular slot has not gone away. I could try yet another new drive in the slot but I doubt the current new drive is bad.

Is swapping the new drive with another currently green lit drive as part of the same LUN (or perhaps the ho spare) to see if the fault moves a good idea that won't cause further issues?

How about power cycling the array itself?

Perhaps unbinding the LUN and trying to recreate it? Or will that even work with an active fault?

Any particular logs in SP Collect that might provide path to solution?

Any further debug suggestions will be much appreciated. Thanks.

Responses(10)

Roger_Wu

4 Operator

•

4K Posts

0

May 27th, 2013 01:00

"The single Linux host connected to the array can not see or mount the special device /etc/emcpower1a, which has the LUN containing three of the bad drives." ——It's a "double-fault" case. What's the RAID type? You should have lost the data on this LUN.

If you want to recover the data as much as possible, get the SPCollects logs then call EMC support. If you just want to recover the storage from faulted situation, replace all the faulted disks, destory (not all, just affected ones) and then re-create RG and LUNs.

Yar2

7 Posts

0

May 27th, 2013 02:00

The raid type is 10. The "double fault" case will cause the LED on drive in a particular DAE slot to be amber regardless of whether good or bad drive is put in? Thanks.

Roger_Wu

4 Operator

•

4K Posts

0

May 27th, 2013 02:00

"We located some replacement drives" ——Are there all new disks? If not, please refer to EMC KB emc251613 "Swapping drives between CX3 and CX4 series arrays can result in data loss" to swap the disks. You can search the KB on support.emc.com

Yar2

7 Posts

0

May 27th, 2013 06:00

Hello,

The replacement drives were all new. The only difference was that they were larger in capacity (600GB vs 300GB) and slower in speed (10K vs 15K) than the originals. My understanding is that while not optimal performance-wise, this alone should not cause basic functional issues .

kelleg

4.5K Posts

1

May 27th, 2013 12:00

If you can still access the array, it would be helpful if you can describe the raid type of the failed disks and which raid group each disk is assigned to and what the type of each disk.

The fault light on the DAE for a specific disk after replacing a disk could be the result of a number of issues. There was issues in older flare versions where you had to power off/on the DAE to get some faults to clear once you installed a new disk. Also, if there are multiple failed disks in the same raid group that could also cause issues. The best way to resolve this is to replace the faulted disks with new disks, power off/on the DAE, the in Navisphere delete the LUNs and raid group and start over. Do this ONLY if you do not need to recover the data.

glen

Yar2

7 Posts

0

May 27th, 2013 18:00

All the failed drives, except one, were in the same raid group, and the raid type was 1/0. The one in a different raid group was the hot spare and was also 1/0. All failed drives were 300GB 15K 2/4Gb bus FC disk modules.

Thank you. I will try the DAE power off/on tomorrow.

Yar2

7 Posts

0

May 28th, 2013 20:00

I replaced the bad drive with a good replacement, power cycled the DAE and when it came back up the drive in slot 5 faulted again asking for the original (bad) serial number drive to be put back in. In addition, another drive, in slot 10 which had previously shown no error went bad. I put in new drives in slot 5 and 10 and power cycled and this time ANOTHER drive (slot 3) came up as being faulted. At this point I have accepted that the data is gone and just want to get back to all green disks so I can recreate the faulted LUN. I know all these new drives aren't bad. Anyone have any ideas as to what is going on? Thanks.

AnkitMehta

1.4K Posts

1

May 29th, 2013 05:00

If I understood this correctly, after Power cycling the DAE, different disk slot shows faulted, right? You may try Cold Reseat where you will need Power Off the DAE and once the disks are spun down you may remove 1 disk let it cool for 2-3 mins and and replace it! Also, this might sound silly but the disks which were failed originally you may also try refrigeration method where you let it cool down to certain temperature and then insert it back to the right slot and check if it shows the green led (of course, in an orderly manner, the disk which faulted last you will reseat/re-insert it first)...If every step fails and as you mentioned about you accepting Data Loss and want to see everything in green.

- Power Off DAE

- Cold Reseat the disks which shows amber LED lit

- Power ON DAE, if you still see the amber LED lit, replace it with new disks (you might want to destroy the RAID Group)

- Reboot both the SPs, if new disks are recognized as faulted disks.

Yar2

7 Posts

0

May 30th, 2013 05:00

Thanks, I was able to get one of the drives to go green using this method (the last one that failed). The other two still go amber whether the original or new drives are inserted. At least they did the first time I tried this. Also, when the new drive is inserted an alert is generated saying that the original drive has been removed and asking for the original serial number be put back in, which then as mentioned fails also. I will try various permutations a few more times.

AnkitMehta

1.4K Posts

0

May 30th, 2013 06:00

You're welcome! Its a great news! Lets hope for the best!

View All

No Events found!

CLARiiON

How to recover from multiple CX4 drive fault situation

Was this post helpful?