Start a Conversation

Unsolved

R

1 Rookie

 • 

7 Posts

3115

August 2nd, 2021 10:00

Stuck in Service Mode, Cache Dirty, UnSafe to Remove; yet I need to replace SP/SD

I have a VNXe3150 that i suspect to have bad hardware, however, i have reservations about replacing such ...

   a few weeks ago, I replaced a faulted SP, however, i used the original SD (wasn't sure where config was stored).

... the VNXe came back up and ran fine for 2 weeks, then i had to reimage/reboot the subject SP.  This reimage remedy got me another day or two of good runtime but now both SPs are stuck in Service Mode, a CLI query tells me i have a cache dirty state, and both of the Unsafe to Remove LEDs are also stuck ON.  My goal is to Execute a System Shutdown (clear the Unsafe LEDs) and replace the said SD, however, i'm unable to clear the LEDs so i have reservations.  I did CLI clear the c states but it didn't help.

So, should i just pull the VNXe power and replace the SD?  (where is the config stored?)  Is there another procedure i should try?

   

1 Rookie

 • 

7 Posts

August 2nd, 2021 15:00

To further clarify and update ...

For the above dilemma, the GUI Shutdown System > Executive Service Action was not available (greyed out).  However, i was able to CLI SPb > svc_shutdown --system-halt but this resulted in only notice that system is already in Service Mode.

I did pull the power, remove the subject SP and replace its SD, restored power, inserted the subject SP, waited a couple hours, and queried status via its web client.  Both SPs in Service Mode; there is a problem with the system software on this SP.  I tried reimage/reboot both SPs, however, Normal Mode still unaccessible.

Any clues out there?

  

1 Rookie

 • 

7 Posts

August 2nd, 2021 17:00

Thanks so much for the reply Josh!!!

i have been reimage/reboot both SP individually all along the way.  However, i'm just now doing such on the supposedly good SP since i replaced the SD within the other SP.   I will say it seems to acting a bit different ...

I've done svc_diag but not the other queries.  I addressed the cache dirty state (somewhat); just reset the flags.  I'll try all the other things u prescribed now and post the results ... 

Moderator

 • 

9.3K Posts

August 2nd, 2021 17:00

Hi,

Can you try the SPs one at a time and see if there is any change? What is the output of svc_diag –state=spinfo

I know you have done some of this already but maybe this will help.

Power Cycle the System: 

Summary Steps: To power-cycle requires the following steps are completed in this order:
Note: A more detailed list of  steps to power-cycle can be found in the VNXe Online Help Support > Online Documentation, select the Search tab and enter the search criteria  Power-cycle the system manually.

  1. Place both SPs in Service Mode.
  2. Disconnect the power cables from the power supplies on the disk-processor enclosure (DPE) to power down the SPs.
  3. Disconnect the power cables from the power supplies on each disk-array enclosure (DAE) to power them down.
  4. Reconnect the power cables to the power supplies on each DAE to power them up.
  5. Reconnect the power cables to the power supplies on the DPE to power up the SPs.
  6. Reboot each SP to return them to Normal Mode (see steps below).

 

Reboot Storage Processor(s), if still in service mode: 

After completing the system power cycle, one or both SPs  may be in service mode.  Follow the steps below to reboot each SP.

  1. In Unisphere, go to Service System page: System > Settings > Service System page.
  2. Login with the service account password.
  3. In the System Components section,  select the affected SP (Note: mode status on the right side of page is set to Service Mode).
  4. In the Service Actions section,  select Reboot
  5. Now select the Execute service action button. 
  6. When the Reboot service confirmation page is displayed, select the OK button to begin the reboot process.

 

If you are still having issues you may need to call in to phone support.

1 Rookie

 • 

7 Posts

August 2nd, 2021 18:00

spInfo diag reports that DIMM0 & DIMM1 are in an unknown state on the subject SP.  There other bus errors, however, i can't image any computer able to do anything within good RAM. I'll address this in the AM tomorrow and post; i'm offsite now.  -Thanks Josh!

 

Moderator

 • 

7.6K Posts

August 3rd, 2021 08:00

Hello raGlenn,

One thing that you can also try is to reseat the dims on the SP, as there have been some cases that I have seen where a reseat has resolved some cache issues.

1 Rookie

 • 

7 Posts

August 3rd, 2021 20:00

... reseat/burnish has been a long time solution for me in these scenarios.  I resorted to such here and this is how it unfolded:

  1. I found SP-B to be in a DIMM: unknown state.  So, …
    1. I hot removed SP-B (the Unsafe to Remove LED was ok; not lit),
    2. burnished the DIMMs,
    3. vacuumed and blew out all,
    4. hot inserted the SP-B,
    5. waited for LED activity to stabilize/SP to reinitialize (~2 hours) and get an IP address assigned,

Note: Follow the manual for LED activity and getting an IP address.

... and then did SSH/CLI to svc-diag -spinfo to learn that the SP-B/DIMM was now OK but SP-A/DIMM was now unknown.

  1. I removed SP-A to service the DIMM: unknown state as per above, however, I found that the problem prevailed after I did so and the Unsafe to Remove LED was lit.  So, I repeated the process (without regard to the LED) except this time I replaced the DIMM modules.  While waiting in vain 6 hours for an address to be assigned, I happened to notice that my assessment of the last query was wrong; where actually, SP-A/DIMM was actually ok and the problem flipped back to SP-B/DIMM state as being Unknown.  8 hours and still no IP address assignment …

...
Something isn’t right here; there must be another problem.  Also, maybe I should have powered down the PDE before removal (with regard to the Unsafe LED).  -Hoping you have a suggestion for me Josh ...  e.g. what do u do when the Unsafe to Remove LED remains ON while the subject needs to be removed???

 

 

 

Moderator

 • 

7.6K Posts

August 4th, 2021 09:00

Hello raGlenn,

What I would do is to power the system all the way down then bring it back up to see if you are still getting the unsafe removal light.

1 Rookie

 • 

7 Posts

August 4th, 2021 09:00

we did that about an hour ago after waiting all night for it initialize and any cache cleanup; it never finished (no IP).  So, we powered down for 20 seconds; still waiting for power up to complete (get an IP).  Do u have any comments on the scenario i mentioned above; i.e. spbDIMM being unknown, then this DIMM goes ok after reseating/burnishing but the other (spb) DIMM goes to unknown, then it flips back to spb DIMM when reseat/burnish spa DIMM.  That is, should i expect to ever see sbInfo report both DIMM sets to be OK?   

Moderator

 • 

9.3K Posts

August 4th, 2021 12:00

It seems odd that it keeps flipping back and forth with the dimm error, but I have not been able to find anything that should cause that. 

Moderator

 • 

7.6K Posts

August 4th, 2021 13:00

Hello raGlenn,

Booting the system using one SP is a good way to see if there is an issue with one of your SP’s.  you just want to make sure that your booting from slot 0.

1 Rookie

 • 

7 Posts

August 4th, 2021 13:00

I suppose it's possible; i really didn't prove that the problem is flipping in a cyclical fashion yet. It could be that the flipping i observed is just a coincidence at this point; but very coincidental.  So then, is DIMM/OK for both SPs the normal/typical/expected state for a chassis with two SPs?  That is, i'm not chasing my tail here; am i?   

On another note, i power reset the unit about 1.5 hours ago and it's still initializing (no IP).  If it doesn't succeed initialization and then reboot to Normal Mode, i'm thinking about removing one SP and doing another power/reboot.  Is this a good test; should i expect it to work (is it really redundant)?  Do you have any other recommendations?  

1 Message

June 5th, 2023 23:00

Just for all other Guys, who got this Error.

Following  fixed my Unity 380F with same Errors. (DIMM Unknown, Service Processor degraded or service mode stuck)

ssh service@unity.ip
svc_rescue_state -l

svc_rescue_state -c

svc_shutdown -r spb 

after that, my Unity came backe "green" state again.

 

 

No Events found!

Top