sql-lover

7 Posts

51319

August 16th, 2013 09:00

Backup jobs (FULL or DIFFs) disconnect LUNs and bringing SQL Cluster down

Here are the typical Os errors, before SQL fails over to the passive node:

-Connection to the target was lost. The initiator will attempt to retry the connection.
-The initiator could not send an iSCSI PDU. Error status is given in the dump data.
-\Device\MPIODisk25 is currently in a degraded state. One or more paths have failed, though the process is now complete.

Now, I am not a SAN expert, I am a MS-SQL DBA. I am just frustrated by the whole situation.

Has anyone seen this issue before?

I can't confirm exact patch, but our SAN expert applied a firmware upgrade (suggested by DELL) and since then, things are now even worse. I can't run backups anymore. Before, the LUNs disconnect sporadically at least and disk resource came online on their own. Not anymore. He also removed duplicated MPIO entries or something like that, also suggested by DELL.

This is a two node SQL2012 SP1 Cluster running on Windows2008r2 SP1 Os.

Any suggestion is highly appreciated. I may bring that to our SAN expert. I am not ruling out any Windows or HP Proliant driver problem though but everything seems to point to the SAN.

Responses(4)

sql-lover

7 Posts

0

August 16th, 2013 10:00

Don,

Thanks for reply.

That's exactly what I mentioned to the SAN guy, but he says everything is fine. But honestly, I have not checked the logs.

I can ask the guy and check what type of switches we have.

I know for sure (I even told him in advance) that the SAN had poor performance issues. Brand new was giving me 40MB/sec when it should give, with 1Gb adapter I think we have, around 90 or 80. Long story short, the problem finally exploded a month ago and he is upgrading to more and faster disks. He is also upgrading the iSCSI output to 10Gb I believe.

But my main problem is this error when running backups. The LUNs failed. I have not run backups in more than a week and that's scary!

sql-lover

7 Posts

0

August 16th, 2013 12:00

Don,

Appreciate your replies! Even if does not fix the problem, is a good start.

It looks like the switch is a Cisco 3750X. The guy or guys in charge of our network infrastructure are going to check this, as per my request.

Yep! The NICs speeds are set to auto! :-( ... Gosh! it should be a fixed value! I remember I fixed a Cluster issue back in 2004 because this. Not sure if has something to do with current issue though.

sql-lover

7 Posts

0

August 16th, 2013 13:00

Fair enough.

This is the 1st time I do not set the speed to fix, but I guess it's different.

Very good post about the switch though! Lot of good info. I will re-read, understand and forward to my management team and our IT/SAN resource.

If we do the flow control thing, I may come back and update my thread and let you and other forum members know about the results.

Again, really appreciate your feedback.

sql-lover

7 Posts

0

August 20th, 2013 08:00

Hello Don,

Sadly, we deployed the suggested flow control change (switch and iSCSI NICs) and the LUN got disconnected again while running backups on a big database.

Here are the errors:

-Connection to the target was lost. The initiator will attempt to retry the connection.

-The initiator could not send an iSCSI PDU. Error status is given in the dump data.

View All

No Events found!

FluidFS

Backup jobs (FULL or DIFFs) disconnect LUNs and bringing SQL Cluster down

Was this post helpful?