Start a Conversation

Unsolved

This post is more than 5 years old

68272

October 8th, 2014 03:00

Dell EqualLogic PS6500e will not complete reconstruction

I am having BIG issues with a Dell EqualLogic PS6500e unit, this has 48 drives fitted in a raid 50 configuration, with 2 hot spares. The problem started when 2 drives failed, both of which got replaced by the hot spares, so the hot spare count went to 0, however the following morning i had another disk fail (disk 23) and the entire array had become unresponsive, the strange thing is disk 23 is NOT showing as failed in Dell Group Manager webui, but it had an orange light for the disk.

After re-seating the drive a rebuild started to occur, 24 hours later and the array became unresponsive again, showing again disk 23 as orange (but not failed/failing in the webui). 

Ok so we now have 3 new replacement drives onsite, 2 drives to replace the first batch of faulty drives (currently now using the hot spares) and a further drive to replace disk 23.

I have successfully removed both of the failed drives (marked as failed in the webui) and replaced them with new drives, these are now marked as spares in the webui, ok so i now have 2 hot spares again and just have the suspect disk 23 to replace.

This is where the fun really starts, so if i try and remove disk 23 and replace it with a new drive the array will not start, if i leave slot 23 empty then it wont boot either. So the only way to bring the array online is to put in the old disk 23, which of course will then bring the entire array offline 24 hours later.

I'm not 100% sure of why this is happening but my current theory is that the EqualLogic unit does not recognise the fact disk 23 is faulty (altiough it shows as faulty physically with orange light) and therefore is not failing over to the hot spare and instead is casuing the entire array to hang. My gut feeling is that this maybe fixed with a firmware upgrade but i neither have a support contract or feel brave enough to try this whilst the array is in a bad state.

Any ideas or suggestions on how i may move forward with this are VERY much welcome as its causing a real issue that is affecting a large team of developers.

October 8th, 2014 08:00

Hi Don, firstly thanks for responding to my post, appreciated!. I don't quite follow what you mean by "double failure in the single raidset" as we now are at a stage where we only have 1 single drive that is in a bad state, that disk being disk 23, so i dont believe we have had a double failure in a single raid5 set (raid 50 being constructed from multiple raid 5 sets).

As my post detailed the odd behaviour is that the disk is NOT marked as "history of failures" in the web interface (dell group manager) but DOES keep failing with a orange light on the drive during the reconstruction process (unknown as to how far along in the process it fails as it happens overnight). 

So unless im missing something surely disk 23 should be failing over to one of the spares?, instead the behaviour im seeing is that the entire array stops responding, also removing the disk does not cause it to fail over to the spare, and i cannot replace disk 23 with a new drive, as again the array wont start. 

In answer to your question regards firmware, we currently do not have a support contract with dell and therefore cannot get hold of the firmware update.

Many thanks!.

Paul.

No Events found!

Top