CX4-240C Disk fault error and SP-B unmanaged condition

We have a single Linux host which connects to a single CX4-240 array DAE with a SPE, and SPS. The array has 15 drives in it; 1 hot spare (drive 14), 4 drives which form LUN 0 (drives 0 through 3) and 10 which form LUN 2 (drives 4-13). LUN 2 is the more critical LUN for the running application. All are raid 1/0. The SPE is running Flare 4.30 and we are using Unisphere to connect to SPE.

It was detected that two of the 4 drives comprising LUN 0 were faulted. The LUN was still useable though as they were both mirrors of drives that were still functioning. Tried adding new drives to replace those faulted. However, I kept getting errors that the original faulted drive was critical and needed to be reinserted . Decided, since this LUN was non-critical to the function of the production application, to destroy and recreate LUN 0 and see if this would allow use of the new drives. Once I removed the LUN from the storage group, the raid group and then deleted the applicable raid group, the fault went away on one of the two replaced drives. The second drive though continued to give the same error and also, because it was in the faulted state, did not show up as being useable to recreate the raid group, so I can not rebuild the LUN as it was before, because I need all four raid 1/0 drives. Tried using the 'Replace Disk' wizard in Unisphere. It detected the faulted drive and indicated when to swap drives, but the replaced drive still faults with status of 'Misplaced' and error messages 'drives in wrong slot or missing. A critical disk having serial xxxxxxxx is expected....." Event code 0x7482.

Since LUN 0 was owned by SP-B. I tried, after unmounting the affected partition on the Linux host, rebooting this SP, hoping to clear previous LUN 0 configuration. SP-B after reboot though developed various new errors and adverse effects. It was pingable at ethernet interface, but gave 'Unable to Connect' browser errors when it was attempted to login to Unisphere via https. When logged into SP-A over ethernet I saw the status of SP-B in several different status locations described as :

-'Unmanaged',
-'SPE SPS-B is faulted (event code 0x7404)'
-'Cabling status for SP-B is unknown'
-'Events could not be retrieved due to a communication failure'
-'Peer boot state: Invalid. Invalid boot fault information'
- 'Can no longer manage manage SP-B' (event code 0x743a)

Also the array paths handled by SP-B as seen from the Linux host using the powermt command are all 'dead' and the HW path parameter went from 'qla2xxx' to 'Unknown'. The production application is still working because it is handled by SP-A and using LUN 2 primarily, but obviously we would like to get things back to normality for fault tolerance purposes.

Any suggestions on getting SP-B running again and also rebuilding LUN 0? Ideally we would like to do so without affecting SP-A and LUN2 as they actively used in production, but any assistance is welcomed. Thanks.

View All

No Events found!

CLARiiON

CX4-240C Disk fault error and SP-B unmanaged condition

Was this post helpful?