audiomatron

5 Posts

122972

August 19th, 2011 06:00

Weird MD3000i Behavior

I have a Powervault MD3000i SAN that I am using as iSCSI storage for my vSphere cluster. I just turned on email alerts a few nights ago, and I have been getting errors like these:

Summary
Node ID: DE-VMSTORE-SAN
Host IP Address:
Host ID: Out-of-Band
Event Error Code: 1707
Event occurred: Aug 19, 2011 4:33:19 AM
Event Message: Degraded wide port becomes failed Component type: Enclosure Component (EMM, GBIC/SFP, Power Supply, or Fan) Component location: Enclosure 0, Slot 0

Summary
Node ID: DE-VMSTORE-SAN
Host IP Address:
Host ID: Out-of-Band
Event Error Code: 1706
Event occurred: Aug 19, 2011 4:33:19 AM
Event Message: Optimal wide port becomes degraded Component type: Enclosure Component (EMM, GBIC/SFP, Power Supply, or Fan) Component location: Enclosure 0, Slot 0

The errors correspond with errors like these in vSphere:

Lost access to volume
4c811ecc-c0bdc6a7-fdc2-b8ac6f15f6b3 (SATA
Storage LUN 2) due to connectivity issues.
Recovery attempt is in progress and outcome will
be reported shortly.
info
8/19/2011 4:34:04 AM

and

Path redundancy to storage device
naa.6842b2b0004bb3580000050a4c80fe48
degraded. Path vmhba35:C3:T0:L0 is down.
Affected datastores: "SATA Storage LUN 0".
warning
8/19/2011 4:34:13 AM

This happens over night, and the array seems fine during the day - no alarms or anything. Anyone have a clue what is going on here? Should I be worried?

Marcus

Responses(11)

JOHNADCO

2 Intern

•

847 Posts

0

August 22nd, 2011 13:00

I've seen these fixed with a power cycle, I have also seen them prediciting failure that occured after Dell had me blow them off for an extended period of time.

Hard to say.

Is it always the same path reported to have the issues?

audiomatron

5 Posts

0

August 23rd, 2011 07:00

It is always the same controller.

JOHNADCO

2 Intern

•

847 Posts

0

August 23rd, 2011 08:00

"It is always the same controller."

We have experienced controller failure after this error has been let run for an extended period of time. Dell won't replace the controller until the failure is catastrophic.

The first Dell suggestion if you call support will be.... Pull the controller out for a few minutes and re-insert it. This has seemed to help in the past for us and not every controller experiencing these sorts of errors fails, but we have definetly had a couple completely fail at some point after reporting these sorts of errors.

We were able to track down, that while there was not much user load during periods when these happen? There was usually a ton of backup, redundancy load going on.

audiomatron

5 Posts

0

August 24th, 2011 06:00

So, I'm assuming I can safely pull this controller while the system is running.. Is that correct? Judging from the logs, it looks like when this happens, ESXi just takes the other controller as a path to the storage until this one comes back up. If I can safely pull this during work hours, I'll go do it now.

JOHNADCO

2 Intern

•

847 Posts

0

August 24th, 2011 09:00

PS: After reinserting the contoller and waiting for it to come up. (these controllers boot a little slow so be patient) Then go to the support tab, manage raid controllers, redistribute virtual disks. This will put ownership, thus I/O back on the controllers defined as the owner of the proper luns.

JOHNADCO

2 Intern

•

847 Posts

0

August 24th, 2011 09:00

"So, I'm assuming I can safely pull this controller while the system is running.. Is that correct? Judging from the logs, it looks like when this happens, ESXi just takes the other controller as a path to the storage until this one comes back up. If I can safely pull this during work hours, I'll go do it now."

If your multi-pathing is setup correctly? You should have no issues with it failing over properly. We have done it a million times in the 5 years we have owned several MD3000i's with out ESX hosts. Don't freak if the san goes into error. This is proper operation when the controllers switch ownership of the luns where the controller you pullout is the owner of those luns. ESX will just keep keep humming along as long as multi-pathing is correct.

audiomatron

5 Posts

0

August 24th, 2011 11:00

I pulled out the controller for a bit and put it back in. Now to wait and see if it still happens...

audiomatron

5 Posts

1

August 25th, 2011 08:00

I don't want to speak too soon, but that seems to have worked.. I haven't gotten any errors since I re seated the controller. Thanks for all the help!

JOHNADCO

2 Intern

•

847 Posts

0

August 25th, 2011 08:00

Not sure why many things can only be solved with a power cycle on these darn arrays. But it is so. In essence you power cycled the problem controller.

PowellMD

1 Message

0

July 22nd, 2014 13:00

When you removed and re-seated this controller, did you use Storage Manager to move the controller to offline? I ask because I am having similar issues and will need to change a Battery as well...

DELL-Sam L

Moderator

•

7.6K Posts

0

July 22nd, 2014 14:00

Hello PowellMD,

It is best practice to power down the controller before removing it & replacing the battery. Now you can just remove the controller and replace the battery without powering the controller down but you may run into issues & get some errors by doing so.

Please let us know if you have any other questions.

View All

No Events found!

FluidFS

Weird MD3000i Behavior

Was this post helpful?