DougStorage

1 Rookie

•

15 Posts

0

4913

May 27th, 2014 07:00

Two of my CX4 systems get stuck in "hot spare replacing" status

I have two CX4 systems. NS-480s actually. One runs the latest code with Unisphere. The other one rev back from that. They have 400GB SSD, 300GB FC and 450GB FC 15K drives. I imagine we've had them for 4+ years and they are out of warranty and contract. These systems are provided to us by EMC for development use. I'm trying to keep them running for a couple more projects before we replace with new VNX II systems. In the meantime I've had a problem that not even support has ever successfully helped me with. Just a huge question mark and hoping the issue would not pop up again. Well now with aging drives the array is doing this all the time now. Mostly the 450GB 15K drives because they seem to be the most unreliable drives in our whole datacenter. I thought maybe related to their seagate model or some inherent bug but right now I have this problem occurring again with one of the 450GB 15K CLAR450 drives but also one of the regular 300GB FC drives which in the past have been very reliable.

SYMPTOM: A drive goes bad, fails, faults and shows removed status. The hot spare disk shows ENABLED and a status of replacing or servicing the drive slot used by the original failed drive. They just hang there forever as if the rebuild is stuck and cannot kick off. Instead of the pool or RG showing "T" transitioning rebuild state they show the "F" red fault icon forever. You can wait a week and the rebuild will not complete.

WORKAROUND PAINFUL - So if you disable WB cache for safety reasons then reboot the SP enclosure with a hard reboot, power cycle they instantly begin the TRANSITIONING rebuild status once the system reboots and the SPS power supplies stabilize again. After the disks rebuild the system is healthy again and passes all diagnostics. Then weeks or month later when a drive fails again the same problem repeats. You have to schedule yet another outage to reset the SP enclosure. It doesn't seem to require resetting the disk enclosures but I've reset them before along with the SPE.

THINGS TRIED - Have tried to update disk firmware which only seems to identify upgrades for the 400GB SSD drives which are not having an issue. I cannot locate any advisories for this issue. I have submitted SPcollects and other info to EMC support a year or more ago with no problem found. Other than a disk failing and just staying in the state I described forever until you perform a HW reset. I'm not aware of any service mode settings that can cause this. I've run the backend bus reset wizard but nothing wrong back there.

Has anyone ever seen this issue before? I've been working on these since Data General days and never seen this issue. It's baffling.

CODE is 4.30.00.5.525.... I have tried to download the 5.26 but get download errors for some reason. I get no issues using this same USM to download code for any other system so I gave up trying. I dont know if 5.26 addresses any issues with disk replacements.

Thanks

D.

Responses(6)

Roger_Wu

4 Operator

•

4K Posts

0

May 28th, 2014 02:00

Stuck in "Equalizing" or "Requested Bypass"? Did you check the SPCollects logs?

You mentioned you'd been working on this since DGC, so I think you might know how to check the SP logs.

VivekSi

6 Posts

0

May 28th, 2014 04:00

Few questions to ask:

Do you see this symptom on your both the boxes?
Is this problem associated with any specific hotspare drive?

As Roger said take the spcollects of the time and see what is the lun status when your drive failed and a hotspare got swapped to it. Does it show REB with some % value?

Archuperi721

110 Posts

0

May 28th, 2014 06:00

may be silly, every tried removing those disks (spare & faulty) from the slot and keep away for 10 to 20 seconds and put it back.

your troubleshooting addresses very high level points, but not the low level. Did you ever try?

You need to consider the data lose doing this

DougStorage

1 Rookie

•

15 Posts

0

May 28th, 2014 06:00

Thanks yes I scoured the logs and I think I could be in better shape than I first thought. The symptoms I saw with the sister array may not be exactly what I'm seeing now on this NS-480 with Unisphere flare30. I did scour every entry of the SP events for the last several weeks. The piece I may have missed is the proactive copy. This is something I may not fully understand.

In general the GUI is in a specific state right now in regard to the disk failure in Pool0 and the other disk failure in Pool1. Both appear to have identical status right now which may be easily confirmed. Please let me know if this sounds correct and what actions we need to take. It would be appreciated.

Pool0 - status is ready under properties. It does list the failed disk B0E2D13 as 0.00 capacity and removed. It does not list the hot spare B0E2D14 that I believe has proactively replaced the bad one. Under system>hardware we still see the removed status for B0E2D13 and red faulted F symbol. I have to assume the array is waiting on a customer replacement of this failed drive to clear the fault and copy the data back to the original slot?

Pool1 - same exact symptom for the failed 450GB 15K disk in B0E4D4 with hot spare B0E3D0.

In both cases the SP logs appear to leave this kind of paper trail.

2014-05-24 09:37 Proactive copy is completed on Proactive spare B0E4D4 (this is the failed disk)

2014-05-24 09:37 Internal information only. FLU RAID protection is degraded.

2014-05-24 09:37 Unit Shutdown B0E3D0 (this is the hot spare 450GB 15K)

2014-05-24 09:37 Internal information only. Drive removed, seen by either SP, ack by both SPs completes.B0E4D4

2014-05-24 09:37 LUN 60060160dbd02400:7c390c929e4be311 has detected a fault but will continue to service IO

2014-05-24 09:37 Internal information only. FLU RAID protection is degraded.

2014-05-24 09:37 RAID protection for Storage Pool Pool 1 is degraded. Please resolve any hardware problems.

2014-05-24 09:37 Internal information only. Peer Requested Drive Power Down.B0E4D4 (the failed drive)

2014-05-24 09:37 Internal information only. Drive removed, seen by peer SP. Removed either physically, via software

2014-05-24 09:37 CRU Ready B0E4D4 (the failed disk)

2014-05-24 09:37 Internal information only. CRU ready. Logging eight character CRU serial number. B0E3D0 (hot spare)

2014-05-24 09:37 A rebuild has been started for the database region of a FRU B0E4D4 (failed disk)

2014-05-24 09:37 A rebuild has completed for the database region of a FRU B0E4D4

2014-05-24 09:37 Internal information only. RAID protection upgraded for FLU.

2014-05-24 09:37 LUN 60060160dbd02400:14e3ab6a9e4be311 is ready to service IO.

2014-05-24 09:37 Internal information only. RAID protection upgraded for FLU.

2014-05-24 09:37 All rebuilds for a FRU have completed

2014-05-24 09:37 RAID protection has been upgraded for Storage Pool Pool 1.

2014-05-24 09:37 Disk (Bus 0 Enclosure 4 Disk 4) is faulted. See alerts for details.

kelleg

4.5K Posts

0

May 28th, 2014 14:00

I believe that the Proactive hot sparing looks to be working - the original disk fails or is marked for replacement, the hot spare kicks in and replaces the failed disk - copying the data from either raid (rebuilding) or if the disk is still accessible, a copy (equalize). Once the hot spare is fulling replacing the failed disk (right click on the hot spare disk and look at Properties), then you replace the failed disk.

glen

VivekSi

6 Posts

0

May 29th, 2014 05:00

I agree with Glen, Proactive copy is working fine.

It seems your drive was somewhat bad and flare detected it and started proactive copy (I am assuming this because I don’t see logs of the time when proactive copy started. User may have also started it). As part of this operation, Proactive copy starts and completes successfully on hotspare.

2014-05-24 09:37 Proactive copy is completed on Proactive spare B0E4D4 (this is the failed disk)

Because proactive copy has completed, your proactive candidate (bad drive) is marked as faulted.

2014-05-24 09:37 Disk (Bus 0 Enclosure 4 Disk 4) is faulted. See alerts for details.

You may safely replace this drive now.

Once the faulty drive is replaced, equalize of data from hot spare will start and once equalization completes your raid group will back to its normal state.

You may want to read EMC clariion-global-hot-spares and Proactive Hot Sparing White Paper.

View All

No Events found!

CLARiiON

Two of my CX4 systems get stuck in "hot spare replacing" status

Was this post helpful?