5 Posts
0
806
March 25th, 2022 07:00
CMC 6.21 stops working when blades are under load
We have seven M1000e chassis all with 16x M630, 2x Force10MXL switches and dual CMC's with 6.21 firmware. The chassis were set up between Aug and Dec 2021. Everything worked fine until we started moving production traffic to the chassis. Once all the blades had an average of 20% CPU load, parts of the CMC functions stopped working.
For instance, the initial CMC screen stays at "Initializing" forever, and no components are displayed.
Initial screen, stays at "Initializing" forever
When trying to reset the CMCs, this happens after clicking on the Troubleshooting link:
Trying to reset the CMCs
We can login to the CMC through ssh. Some commands work, and some never respond, and some return partial information.
All the time, the blades work properly, and iDRAC on the blades work properly, and so do the switches. It is only the CMC that malfunctions.
We've tried switching over to the standby CMC, and manually removing one CMC from the chassis, but nothing changes. The racreset command does not work - it timeouts and never responds.
In the CMC log, there is a low memory message in the CMCs which are failing:
SeqNumber = 40844
Message ID = MEM8500
Category = Audit
AgentID = CMC
Severity = Critical
Timestamp = 2021-10-11 05:37:49
Message = Low memory condition detected.
--------------------------------------------------------------------------------
SeqNumber = 40843
Message ID = MEM8501
Category = Audit
AgentID = CMC
Severity = Warning
Timestamp = 2021-10-11 02:50:53
Message Arg 1 = 253968
Message Arg 2 = 25390
Message = Low memory warning, 253968KB, 25390KB.
--------------------------------------------------------------------------------
The date on the message is around the date we started sending production traffic to the blades.
This is the output of 'getversion':
server-1 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-2 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-3 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-4 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-5 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-6 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-7 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-8 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-9 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-10 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-11 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-12 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-13 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-14 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-15 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
server-16 2.75.100.75 (02) PowerEdge M630 iDRAC8 Y
switch-1 MXL 10/40GbE A00 9.14(1.3)
switch-2 MXL 10/40GbE A00 9.14(1.3)
It does not include both CMCs, which are shown in the chassis where the CMCs do not display this issue.
Any ideas?
Thanks!
nmoldav
5 Posts
0
April 13th, 2022 15:00
After about a week the issue has not happened again. All the chassis' CMCs are working properly now, after the upgrade to 2.82 in all iDRACs and after setting the power cap to 6000W instead of 4000W
DELL-Charles R
Moderator
•
4.4K Posts
0
March 25th, 2022 12:00
Hello nmoldav,
Is just this one chassis affected?
Are all chassis identical?
I have seen one previous case where they swapped the CMC and they did not get those errors again.
Can you get a maintenance window and do a complete power drain on the chassis?
Shut down all the blades, shut down chassis, remove all power for 10 minutes, swap CMC's spots, reconnect power and check results
Try updating the iDRACs firmware and check results
iDRAC 2.82.82.82
https://dell.to/3wE6yev
This may take us more research.
nmoldav
5 Posts
0
March 25th, 2022 13:00
Hi Charles. 3 out of 7 chassis have this problem. Of the 4 which work OK, 3 have very little load, and the 4th one has all blades with 2.81. We'll try to upgrade all blades to 2.82 first and if it does not work, we'll try powering off the chassis as you suggested. I'll report back in about a week.
nmoldav
5 Posts
0
March 31st, 2022 05:00
Hi Charles. We did as you suggested - complete power drain and CMC swap in the 3 chassis that had the issue. All the chassis came back up after this. Afterwards we upgraded all the blades to 2.82. Today I checked again and the CMC in one of the three chassis is presenting the issue again.
Now more screens in the GUI and in the CLI work, but still many do not. For instance, the "Update" screen works and so does "Power/Power Configuration", but Troubleshooting / Reset components does not and neither does "Power/Power Monitoring". In the CLI, getmodinfo, getsensorinfo and racreset do not work (they never respond). racdump shows "General System/RAC Information" but stops after that and timeouts.
DELL-Charles R
Moderator
•
4.4K Posts
0
March 31st, 2022 05:00
Hello nmoldav,
Thank you for the update. I want to check with you a couple things:
Have you cleared browser cache?
Try different browser?
Try from different workstation?
Is management segregated from production traffic?
Sounds like you already have the dump log, I'm including instruction if you haven't:
Include service tag on file name of report.
How to create a RAC Dump / Dumplog from a VRTX or M1000e
https://www.dell.com/support/kbdoc/en-us/000144809/how-to-create-a-rac-dump-dumplog-from-a-vrtx-or-m1000e
Can you upload here under the chassis service tag and Private Message me the service tag once it is uploaded.
https://upload.dell.com/
DELL-Charles R
Moderator
•
4.4K Posts
0
April 1st, 2022 05:00
Hello nmoldav,
I received the private message with the service tag. Thank you.
Is this still occurring on multiple chassis?
Are there any that do not have this issue? If so could you pull a log from one of those for comparison.
Please include service tag on file name of reports.
The first log was incomplete. Could you pull a new one. Run both the following commands:
Racdump
--then run next command--
dumplogs
(see step 3 - https://dell.to/3K5tiIn)
You can upload them both to the same service tag you have provided already.
nmoldav
5 Posts
0
April 7th, 2022 14:00
Hi Charles. The issue was occurring in only one chassis. Two were fixed after the power cycle and upgrade to 2.82.
The log was incomplete because the CMC timed out while executing dumplogs.
Yesterday we power cycled that chassis and now it is working. We enabled production traffic a few hours ago. We'll check again in a few days.
A possible reason might have been this log message:
We had a power cap of 4000W in the chassis that was failing. The power cap was lifted to 6000W which is the setting of the rest of the chassis.