Unsolved
1 Rookie
•
19 Posts
0
1346
July 29th, 2022 09:00
ecc error threshold
Hello,
I am watching for burst of ecc errors occurring in different models of Dell power-edge servers with vmware in data center.
I am using Dell racadm tool on the esx hosts.
Sometimes there could be a burst of errors in the range thousands or even millions sometimes right? So I have heard.
I came across this post
https://www.dell.com/support/kbdoc/en-us/000052877/vxrack-node-experiencing-correctable-ecc-errors
which says
" The thresholds set in the BIOS are easily exceeded by these bursts of ECC errors.
- With the current BIOS setting, an ECC error is reported in the SEL logs, once for every 10 occurrences of the correctable ECC error on a DIMM.
- The reporting to the SEL logs gets turned off after 10 such occurrences. So after a total of 100 such errors reported on a DIMM the reporting is turned off. "
So my question is if reporting stops after just 10(which is 100 internally) how will I get the correct ecc correctable error count.
Also I don't want to loose any ecc error events.
The buffer size for the idrac SEL reporting is limited right? So if it gets full, how do I clear them immediately so I can watch for future events without loosing any event. The errors come and go. I would like to keep a tab on them to understand the behavior and possible failure.
The threshold mentioned in the post is 500 per day or something. If reporting stops at 100 events how will I know the exact count?
Thank you
DELL-Charles R
Moderator
•
4.5K Posts
1
July 29th, 2022 13:00
Hello upceo,
I hope this information will be helpful. Dell PowerEdge BIOS updates will change the "Correctable Error Logging" BIOS setting to disabled by default. This BIOS option can be re-enabled for customers wanting to continue to see correctable memory threshold events. : https://dell.to/3Q6JYly
These are a few more resources you may be interested in related to memory:
KB Article Number: 000194574: https://dell.to/3Q6JYly
14G whitepaper: https://dell.to/3Q4sG8p
15G whitepaper: https://dell.to/3Q57wXT
upceo
1 Rookie
•
19 Posts
0
August 2nd, 2022 23:00
Hi Dell-Charles R,
Thanks for the reply. If the "Correctable Error Logging" BIOS setting is disabled by default, does that mean the correctable errors could be ignored? I can not reboot the servers often.
After how many correctable errors is it an alter to replace the DIMM as it might be an indication of the Dimm module failure?
Thanks
Dell-Martin S
Moderator
•
3.5K Posts
1
August 3rd, 2022 02:00
Hi,
correctable errors could be ignored you could only try to update your firmware to reduce these failures.
Regards Martin
upceo
1 Rookie
•
19 Posts
0
August 3rd, 2022 10:00
Hi Dell-Martin S,
I see, please don't mind if I ask for clarity . My concern is
If there are high number of correctable errors happening too fast, don't they get converted into un correctable error at some time which can cause a crash or data corruption ?
DELL-Chris H
Moderator
•
9.5K Posts
0
August 3rd, 2022 12:00
Upceo,
That wouldn't be the case, if you have a large string of correctable ECC errors that hit to where there is a response, the response would be that the logging of the errors is suspended, until cleared. As Martin stated though, keeping the server updated will minimize the occurrence.