Unsolved

518

January 4th, 2023 07:00

m1000e suspected midplane failure : 10gb "PXE-E61 : media test failure check cable"

I have m1000e  v1.1 running for a year or two with M620s. Suddenly all blades lost connectivity to the both 10gb switches (MXL 10/40GbE).  After cold boot the whole chassis the blades came on line for a few hours then all went off line. If I restart any of the blades the on board 10gb nic complains "PXE-E61 : media test failure check cable". This is a diskless setup each blade boots off a pxe.

If I log onto either  switch it claims all ports are up. 

There are no related errors in any chassis or idrac logs. However I do see this in the switch log : 

16:39:41: %STKUNIT0-M:CP %CHMGR-2-MAJOR_TEMP: Major alarm: chassis temperature high (Unit 0 temperature reaches or exceeds threshold of 84C)

However, I think the temperature event was last summer (sadly the log does not have the date )- i have not seen any abnormal temps in the chassis for months. 

As a workaround i added 1gb switch in channel C which is working but really would like to get the 10gb link working.

After the last chassis reboot the A1 switch was marked "Unknown" so i removed it however none of the blades connected to the remaining switch either - the switch continues to claim all ports up. I continue to get 'media test failure check cable' - to me it looks like the blade's nic can't connect to the chassis switch.

Does it look like the midplane is fried ?  How can i debug what's happening ? 

thanks for any help.

Richard. 

M1000e cmc firmware = 5.12

switch = MXL 10/40GbE   firmware 9.5(0.0)

 

 

 

 

Moderator

 • 

4.7K Posts

January 4th, 2023 12:00

Hello dr-murphy,

 

I see a Knowledge Base article that look like what you are seeing. 

It may require you to log in or create an account and log in to view it.

 

Article Number: 000122895 :  https://dell.to/3WJSafe

Highlights are:

 Affected OS :  9.4.0.0, 9.5.0.0

System may configure incorrect fan-speed settings in response to changing system temperatures.

01:41:10: %STKUNIT1-M:CP %CHMGR-2-MAJOR_TEMP: Major alarm: chassis temperature high (Unit 1 temperature reaches or exceeds threshold of 84C) 

Engineering team has identified the root cause and the issue has been fixed in software version 9.5.0.1 onwards.

 

I would recommend update and check you results:

Force10 MXL Blade OS0, v9.14.1.14

https://dell.to/3WLoCxL

And update the M1000e chassis CMC 6.21

https://dell.to/3jK8gql

 

Do you have an active warranty contract you can contact Support directly and an engineer can do a remote session to work with you?

January 4th, 2023 14:00

It did coincide with some erratic fan speeds during normal temperatures and zero loading so perhaps this is it. Unfortunately I do not have an warranty contract. I have a logged in to look at the article but I get "This article is permission based. Find another article".

Anyway thanks for the information I will upgrade the switch firmware as suggested.

January 5th, 2023 13:00

I upgraded switch firmware  to 9.14.14  and cmc firmware to 6.21 as suggested. I've reseated the switch and unplugged/plugged power to the chassis. I still get exact same error : switch port state reports no pluggable media present for the internal ports (external ones ok) and the blades report the same for all the 10gb nics.  

any ideas on next step ?  is it possible to know if the mid-plane is faulty or the switch ? both switches went off line at same time so i suspect the mid-plane

 

Moderator

 • 

4.7K Posts

January 5th, 2023 13:00

That could be possible. Was there any power event or anything else that happened at the same time that both switches went offline that makes you suspect the midplane?

 

You don't happen to have another MXL to try?

 

Could you gather the chassis log and upload for review?

How to generate enclosure logs for CMC/VRTX/FX2

https://www.dell.com/support/kbdoc/en-us/000063818/poweredge-server-how-to-generate-enclosure-logs-for-cmc-vrtx-fx2

racdump

dumplogs

getversion

 

Upload result here under the service tag of the M1000e chassis:

https://upload.dell.com/

Then please Private Message me the service tag for me to retrieve the log.

 

January 5th, 2023 15:00

Charles,

Many thanks ! actually looks like i misunderstood the interface mapping between blade nic and switch. The switch that went into "unknown" state according to cmc i had pulled out - the em1 interface must have been trying to connect to that. Once re-inserted that switch the blades re-connected. I have now  upgraded both switch firmwares and appear to have no failed connections. If I log on to the booted blade i can now see the second nic interface (em2) is up and connected to the other switch even though i'm not using it. 

hopefully it is good now although i'll need to check in few hours 

a big thank you for your help !

No Events found!

Top