1 Rookie
•
9 Posts
1
9339
December 21st, 2022 13:00
R740 Pfault fail-safe voltage is outside of range
Hi, I'm having an issue with one of our watercooled R740 servers. After a BIOS (2.16.1) and Ethernet card (20.5.16) update, the node refused to turn on. Stripping it down to minimal configuration revealed nothing so we assumed the problem was with the system board. After replacing the system board and loading the node with miniimal config again, it turned out to be a problem with the ethernet card, because the server starts fine without the card, but fails to do so with it (showing the same pfault error). After realizing this, we replaced the card as well with the same model. This did not help either. We've tried to set the node up in any configuration we could think of (1 CPU, both CPUs, 1 DIMM in A1, 2 DIMMs in A1 and B1, backplane connected/disconnected, both PSUs replaced with ones from a known working server, reseating every single cable on the board etc). Nothing solves the issue and it always boils down to the ethernet card. With it the system just doesn't boot. When I look in the iDRAC webgui, I can see that the iDRAC does not recognize the state of the CPU, DIMMs and PSUs when the ethernet card is connected, not sure if that is related to the issue.
Other steps we took were updating the CPLD, OS Drivers Pack, iDRAC and the OS Collector, draining the node after every component configuration change and rebooting the iDRAC after it reports the first pfault error. We have not tried the ethernet card in a different server in fears that it might cause issues.
We are just really confused why the problem seems to stay with the chassis, even when the system board and ethernet card are completely different.
If anyone has an answer to this, please let me know.
Thanks
fbaritk
1 Rookie
•
9 Posts
1
January 18th, 2023 08:00
I've figured out the problem. First off, if you have a voltage issue (or any other that makes the server not boot) specifically with the ethernet daughter card on a R740, DON'T keep plugging in replacements, because chances are that your ethernet firmware on the system board is corrupted. This means that when you plug a new (or refurbished) card in, it will download the firmware that's present on your system board and basically ruin your card.
What you need to do (or at least the solution we've found), is to plug the new replacement card into a server that you know works, so it can download working firmware upon first connection. After that you can plug it into the misbehaving server and be on your way.
DELL-Young E
Moderator
•
5.1K Posts
0
December 21st, 2022 19:00
Hello, as far as I know water cooling is supported from R750, not R740.
https://dell.to/3WfqrTe
fbaritk
1 Rookie
•
9 Posts
0
December 22nd, 2022 05:00
We have 80 R740s that are water-cooled from factory. This is not the issue.
DiegoLopez
4 Operator
•
2.7K Posts
0
December 22nd, 2022 08:00
Hello @fbaritk,
Well. Honestly, you have tried most things that can be done on the server: replace the motherboard, minimal configuration, replace Network card... yes, there might me some more components that can be replaced, like control panel or else. But I would make sure you have a case open with support and to have them checking the logs. Maybe there is something that is not visible on the iDRAC/Lifecycle Controller.
Do you already have a case open? Or you performed all changes by yourseld? And also, is this only happening in one of the servers? From the 80 you have? Then, it has to be hardware error on a componente you did not replace yet.
Regards.
fbaritk
1 Rookie
•
9 Posts
0
December 22nd, 2022 08:00
The server is no longer under support from Dell so we are fixing it ourselves. None of the other servers are exhibiting this issue. We know the hardware error comes from the ethernet card, which is the only component that can reliably be pinpointed as the culprit, as the server POSTs fine without it, but doesn't when it's connected. However the card was replaced and the problem persists. Is there something else I'm missing?
DELL-Chris H
Moderator
•
9.5K Posts
0
December 22nd, 2022 09:00
Fbaritk,
By chance have you tried clearing the hardware log, I ask as it may be seeing the original error when reinstalling?
Also, where are you connecting the card, in the expansion card riser?
I ask as there may be an issue with the riser.
If you happen to have another system with a known good riser, you may try swapping them out and see if the error is resolved, or follows the riser to the new system.
Let us know.
fbaritk
1 Rookie
•
9 Posts
0
December 22nd, 2022 10:00
Do you mean the System Event Log? I've tried clearing it after every step of troubleshooting, doesn't seem like it had a difference. The card is connected to the NDC slot on the R740 motherboard. As far as I can tell there is no riser between the ethernet card and the motherboard. This is how it looks like:
fbaritk
1 Rookie
•
9 Posts
0
December 22nd, 2022 11:00
I can try clearing the LCC log as well. All the firmware is up-to-date, BIOS is at 2.16.1 and the iDRAC at 6.00.30.00.
DELL-Chris H
Moderator
•
9.5K Posts
0
December 22nd, 2022 11:00
Fbaritk,
I was referring to the iDrac/LCC hardware log, and also thank you for the details.
Would you also confirm if the server is up to date on BIOS, iDrac, etc?
Praveen.Singh
3 Apprentice
•
482 Posts
0
December 22nd, 2022 21:00
please follow the below action plan and let us know if seen any changes.
1. Check the LC logs if found any changes or issue there.
2. Breakdown the system to minimum config.
3. observe the server.
4. if issue repeat clear NVRAM.
5. Latest TSR will help to get more info.
fbaritk
1 Rookie
•
9 Posts
0
January 3rd, 2023 12:00
Hi, I've performed the steps as you said, however I've not had any success. I cleared the NVRAM using the jumper method while the server was at minimal configuration. After that it booted fine. Then I added the ethernet card and the server reported the same "pfault fail-safe voltage is outside of range" error. I'm sending two TSRs, one is with the node reported healthy, the other is when the ehternet card is connected and the system reports the voltage issues.
https://we.tl/t-pLCw5DbSHM