drake2

44 Posts

5708

September 8th, 2011 13:00

rescan failed

Hello, hope someone can help. I am adding new LUNs to my NS42G (5.6.51.320) from my CX4-480 (04.30.000.5.517)

I have done this proceedure before, making certain of the HLU numbering etc. I rescan via Celerra manger to bring the new volumes in and it replies...

xError

rescan failed

Storage system rescan failed

make sure the system is not in fialover state. if the system is failed over, execute the restore action to bring it to OK state

Msg ID 13692698625

All areas show me that data movers are status "OK" and server_2 my primary is the active via nas_server command and state "0" on both.

Any ideas on what to check or what command line options might have better success in rescanning the new storage prior to calling support?

Thanks

Responses(15)

R

Rainer_EMC

4 Operator

•

8.6K Posts

0

September 8th, 2011 14:00

I dont think you’ll find it in the data mover log

server_devconfig will NOT create any new devices if the backend Clariion isn’t healthy

You need to check the Clariion health – most likely a failed over SP, trespassed LUN, ….

Rainer

dynamox

9 Legend

•

20.4K Posts

0

September 8th, 2011 13:00

server_devconfig server_2 -create -scsi -all (for each datamover). If still fails ...what's in the server_log server_2 output around the time you try to rescan ?

dynamox

9 Legend

•

20.4K Posts

0

September 8th, 2011 14:00

what's in server_3 log file ?

drake2

44 Posts

0

September 8th, 2011 14:00

ok the server_devconfig command got me a bit further. server_2 finished clean "done", running it agianst server_3 reported "warning" however it displayed the new disk ids when finished

via GUI again I now see the new volumes however when selecting them with "add volumes" it errors about "inconsistent server visibiliy"

"the volumes you seleted do not have consistent server visibility with the other volumes inthe pool."

nas_disk -list shows the two new volumes now listed as not in use and only servers "2" is shown and not "1,2" like the others (?)

R

Rainer_EMC

4 Operator

•

8.6K Posts

0

September 8th, 2011 14:00

As the command says there is likely a problem on the Clariion backend – check with nas_storage and navicli

Rainer

R

Rainer_EMC

4 Operator

•

8.6K Posts

0

September 8th, 2011 15:00

It means you have an error in the zoning or LUN masking – server_3 needs to see the very same devices as server_2

The message indicates that this isn’t the case

drake2

44 Posts

0

September 9th, 2011 05:00

So this morning I start fresh with a new rescan, and the additional volumes did add to my pool.

Ok, so some further background. Yesterday I was also working a case with support in regards to about two dozen LUNs not tresspassing back to their default owner. You could manually trespass, but minutes later they would be back on the non default owner SP. These LUNs were actually ESX/UCS LUNs in completely seperate storage groups, not Celerra related in the SG sense. Last night I managed to fix my LUN tresspass issue by maually adding connections for a few hosts in those SGs and then changing the array failover mode for those hosts so they were all uniform. Some new UCS hosts were set at "4" (ALUA) and the rest of my original cluster was "1" active/passive.

Again this AM I rescanned and my Celerra LUN addtions cleanly added. So all appears to be good now.

Only possible issue is via the nas_disk command, the two new volumes list their servers as "2,1" and not "1,2" like all the rest. Does this order matter?

So who gets the answer credit? Backend array issues? seems like.. however some necessary commands and guidance originally brought the LUNs in.

R

Rainer_EMC

4 Operator

•

8.6K Posts

0

September 9th, 2011 06:00

The order doesnt make a difference

christopher_ime

2K Posts

0

September 9th, 2011 15:00

drake,

With all of the troubleshooting you performed, I'd like to highly recommend that you find time to actually test the failover of the datamovers. This is disruptive so schedule accordingly, and the time required to come back online via the standby datamover is dependent upon how active the datamover is, number of filesystem that it would need to remount, etc.

As noted above, if the databases (visibility to the same exact LUNs for instance) are out-of-sync between the active and standby datamovers this would result in an aborted fail-over if there were a hardware fault on the active datamover.

Just a thought... it is probably a test case that should be performed regardless and one you might not want to take for granted will work.

christopher_ime

2K Posts

0

September 9th, 2011 15:00

All,

I simply wanted to chime in with the steps that I have adopted over time; many of which are of course optional.

1) Instead of scanning each separately, use keyword "ALL"

When rescanning via CLI, as already noted, all datamovers must have visibility to the same storage. If not, one obvious issue you would find is the datamovers would be unable to fail-over. So that I don't forget or miss a datamover I might have overlooked, instead of:

server_devconfig server_2

server_devconfig server_3

[...]

Issue:

server_devconfig ALL

2) Probe before you commit

Instead of going immediately into the commit, query the list of devices to see what is visible and will absorbed from the back-end. Scan the list quickly to make sure all is accounted for:

server_devconfig ALL -p -s -a (-probe -scsi -all)

Then:

server_devconfig ALL -c -s -a (-create -scsi -all)

3) Finish with a quick health check

nas_storage -c -a (-check -all)

Some of the possible errors might be "Missing path to SPA (or SPB)". If it simply returns "Done" then all is well.

drake2

44 Posts

0

September 12th, 2011 05:00

Thank you all for the input. Very helpful and will definitely take note of the commands here. Christopher, we don't generally test failover on a regular basis however at least as of to weeks ago all tested clean. This happened b/c out of 15K end user systems one of the two Macs in our environment got updated to to 10.7 (Lion). As you might know, when it mapped it knocked my primary DM offline (server_2). We were running on server_3 for about 24hours w/o issue until I got patched and failed back. I assume that you mean check your failover since recently adding the new Celerra storage and this issue inparticular. I just find it, not suprising, but interesting, that non related LUNs in my ESX environment appeared to have affected the Celerra's relationship to its storage (?). I guess I'll chalk this up to back end issues as a whole then, only real coordination of events here. I have much to learn, and as always you guys are an invaluable resource. Thanks.

dynamox

9 Legend

•

20.4K Posts

0

September 12th, 2011 11:00

drake wrote:
. I just find it, not suprising, but interesting, that non related LUNs in my ESX environment appeared to have affected the Celerra's relationship to its storage (?). I

How so ?

drake2

44 Posts

0

September 12th, 2011 12:00

Meaning, I know the array's SPs are the governing units so to speak but the Celerra SG is masked and segregated from the other SGs. I guess I figured that feature is to keep things like that from happening. On the other hand it is the SPs that evidently have to be happy first in order to service those SG requests I guess (or so I'm learning from this particular issue). It looks like the Celerra was trying to tell me all this, I'm just not too skilled at troubleshooting it yet.

dynamox

9 Legend

•

20.4K Posts

0

September 12th, 2011 13:00

i don't think your trespass issues with LUNs presented to ESX had anything to do with discovering LUNs on Celerra. True ..many components are shared that could affect multiple resources using them but trespassing is not that ..at least not the issue you were seeing. One time i had a very old Sun Solaris 9 box that was running Veritas failover software (DMP). It was not configured correctly and caused a trespass storm where LUNs would ping-pong between SPA and SPB. It was so intense that it started affecting performance of other applications on the same array.

R

Rainer_EMC

4 Operator

•

8.6K Posts

0

September 12th, 2011 16:00

The SG for the NAS LUNs is now protected because there were too many people deleting LUNs by mistake while they were still in use or at least configured on the data movers

Just trying to make it less likely to shoot yourself in the foot

That a rescan fails when the backend storage isn’t in a completely healthy situation is also a good thing.

This doesn’t affect traffic to already recognized LUNs – but we don’t want to add more LUNs if there might be an issue.

You don’t want to import and use a LUN and only later in a failover situation find out its not working on the standby data mover.

Rainer

View All

No Events found!

Celerra

rescan failed

Was this post helpful?