Start a Conversation

Solved!

Go to Solution

806

March 25th, 2022 07:00

CMC 6.21 stops working when blades are under load

We have seven M1000e chassis all with 16x M630, 2x Force10MXL switches and dual CMC's with 6.21 firmware.  The chassis were set up between Aug and Dec 2021.  Everything worked fine until we started moving production traffic to the chassis.  Once all the blades had an average of 20% CPU load, parts of the CMC functions stopped working.  

For instance, the initial CMC screen stays at "Initializing" forever, and no components are displayed.

Initial screen, stays at Initial screen, stays at "Initializing" forever

 

When trying to reset the CMCs, this happens after clicking on the Troubleshooting link:

Trying to reset the CMCsTrying to reset the CMCs

We can login to the CMC through ssh.  Some commands work, and some never respond, and some return partial information.  

All the time, the blades work properly, and iDRAC on the blades work properly, and so do the switches.  It is only the CMC that malfunctions.

We've tried switching over to the standby CMC, and manually removing one CMC from the chassis, but nothing changes.  The racreset command does not work - it timeouts and never responds.  

In the CMC log, there is a low memory message in the CMCs which are failing:

 

SeqNumber       = 40844

Message ID      = MEM8500

Category        = Audit

AgentID         = CMC

Severity        = Critical

Timestamp       = 2021-10-11 05:37:49

Message         = Low memory condition detected.

--------------------------------------------------------------------------------

SeqNumber       = 40843

Message ID      = MEM8501

Category        = Audit

AgentID         = CMC

Severity        = Warning

Timestamp       = 2021-10-11 02:50:53

Message Arg   1 = 253968

Message Arg   2 = 25390

Message         = Low memory warning, 253968KB, 25390KB.

--------------------------------------------------------------------------------

 

The date on the message is around the date we started sending production traffic to the blades. 

This is the output of 'getversion':

                         

server-1    2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-2    2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-3    2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-4    2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-5    2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-6    2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-7    2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-8    2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-9    2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-10   2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-11   2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-12   2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-13   2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-14   2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-15   2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

server-16   2.75.100.75 (02)       PowerEdge M630         iDRAC8       Y

 

                               

switch-1    MXL 10/40GbE                       A00              9.14(1.3)      

switch-2    MXL 10/40GbE                       A00              9.14(1.3)      

 

It does not include both CMCs, which are shown in the chassis where the CMCs do not display this issue.

 

Any ideas?  

 

Thanks!

 

5 Posts

April 13th, 2022 15:00

After about a week the issue has not happened again.  All the chassis' CMCs are working properly now, after the upgrade to 2.82 in all iDRACs and after setting the power cap to 6000W instead of 4000W

Moderator

 • 

4.4K Posts

March 25th, 2022 12:00

Hello nmoldav,

 

Is just this one chassis affected?

Are all chassis identical?

 

I have seen one previous case where they swapped the CMC and they did not get those errors again.

Can you get a maintenance window and do a complete power drain on the chassis?

Shut down all the blades, shut down chassis, remove all power for 10 minutes, swap CMC's spots,  reconnect power and check results

 

Try updating the iDRACs firmware and check results

iDRAC 2.82.82.82

https://dell.to/3wE6yev

 

This may take us more research.

5 Posts

March 25th, 2022 13:00

Hi Charles.  3 out of 7 chassis have this problem.   Of the 4 which work OK, 3 have very little load, and the 4th one has all blades with 2.81.  We'll try to upgrade all blades to 2.82 first and if it does not work, we'll try powering off the chassis as you suggested.  I'll report back in about a week.

5 Posts

March 31st, 2022 05:00

Hi Charles.  We did as you suggested - complete power drain and CMC swap in the 3 chassis that had the issue.   All the chassis came back up after this. Afterwards we upgraded all the blades to 2.82.  Today I checked again and the CMC in one of the three chassis is presenting the issue again.  

Now more screens in the GUI and in the CLI work, but still many do not.  For instance, the "Update" screen works and so does "Power/Power Configuration", but Troubleshooting / Reset components does not and neither does "Power/Power Monitoring".  In the CLI, getmodinfo, getsensorinfo and racreset do not work (they never respond).  racdump shows "General System/RAC Information" but stops after that and timeouts.

Initial screenInitial screenUpdateUpdate

 

Moderator

 • 

4.4K Posts

March 31st, 2022 05:00

Hello nmoldav,

 

Thank you for the update. I want to check with you a couple things:

Have you cleared browser cache?

Try different browser?

Try from different workstation?

Is management segregated from production traffic?

 

Sounds like you already have the dump log, I'm including instruction if you haven't:

Include service tag on file name of report.

How to create a RAC Dump / Dumplog from a VRTX or M1000e

https://www.dell.com/support/kbdoc/en-us/000144809/how-to-create-a-rac-dump-dumplog-from-a-vrtx-or-m1000e

 

Can you upload here under the chassis service tag and Private Message me the service tag once it is uploaded.

https://upload.dell.com/

Moderator

 • 

4.4K Posts

April 1st, 2022 05:00

Hello nmoldav,

 

I received the private message with the service tag. Thank you.

 

Is this still occurring on multiple chassis? 

Are there any that do not have this issue? If so could you pull a log from one of those for comparison.

 

Please include service tag on file name of reports.

 

 

The first log was incomplete. Could you pull a new one. Run both the following commands:

 

Racdump

--then run next command--

dumplogs

 

(see step 3 - https://dell.to/3K5tiIn)

 

 

You can upload them both to the same service tag you have provided already.

5 Posts

April 7th, 2022 14:00

Hi Charles.   The issue was occurring in only one chassis.  Two were fixed after the power cycle and upgrade to 2.82.  

 

The log was incomplete because the CMC timed out while executing dumplogs.  

Yesterday we power cycled that chassis and now it is working.  We enabled production traffic a few hours ago.  We'll check again in a few days.

A possible reason might have been this log message:

Chassis Management Controller is unable to send power allocation information to Server-1 at priority 1.

We had a power cap of 4000W in the chassis that was failing.  The power cap was lifted to 6000W which is the setting of the rest of the chassis.

No Events found!

Top