Unsolved
1 Message
0
1538
May 26th, 2023 07:00
Multiple DIMM failures - PowerEdge R740xd
We have several PowerEdge R740xd servers in our environment. One of these servers has experienced 3x DIMM failures in the past six months, which IMO is highly irregular.
When the DIMMs fail, the server crashes and generates the following system log events:
> Multi-bit memory errors are detected on the memory device at location(s) DIMM_B5. Immediately replace the DIMM.
> The system memory has uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B5. Immediately replace the DIMM.
The failures have impacted DIMMs in sockets A2, A4, and B5. As such, the failures are not isolated to a CPU-specific set of sockets (this is a 2x CPU system). Failures have been with different DIMMs/sockets each time (no repeat failures on the same socket and/or DIMM).
Replacing the DIMMs resolves the issue, but we suspect there is a larger issue.
Curious if anyone has experienced a similar issue and/or has any thoughts on what (else) to replace. Maybe motherboard?
DELL-Chris H
Moderator
•
9.5K Posts
0
May 26th, 2023 12:00
eb5915,
I would start with verifying that the server is up to date and current on BIOS, iDrac, Raid controller, etc, as being up to date decrease the chance of having any erroneous errors. After updating, when and if a dimm error occurs, I would suggest you swap the dimm with another in the server, prior to replacing it, and then see if the error follows the dimm or slot.
Lastly, have either the dimms or the processors, been replaced/upgraded since the server was ordered?
Let me know what you see.