Start a Conversation

Unsolved

U

1 Rookie

 • 

19 Posts

1346

July 29th, 2022 09:00

ecc error threshold

Hello,

I am watching for burst of ecc errors occurring in different models of Dell power-edge servers with vmware in data center.

I am using Dell racadm tool on the esx hosts.

Sometimes there could be a burst of errors in the range thousands or even millions sometimes right? So I have heard.

I came across this post

https://www.dell.com/support/kbdoc/en-us/000052877/vxrack-node-experiencing-correctable-ecc-errors

which says 

   " The thresholds set in the BIOS are easily exceeded by these bursts of ECC errors.

  1. With the current BIOS setting, an ECC error is reported in the SEL logs, once for every 10 occurrences of the  correctable ECC error on a DIMM.
  2. The reporting to the SEL logs gets turned off after 10 such occurrences. So after a total of 100 such errors reported on a DIMM the reporting is turned off.   "

So my question is if reporting stops after just 10(which is 100 internally) how will I get the correct ecc correctable error count.

Also I don't want to loose any ecc error events. 

The buffer size for the idrac SEL reporting is limited right? So if it gets full, how do I clear them immediately so I can  watch for future events without loosing any event. The errors come and go. I would like to keep a tab on them to understand the behavior and possible failure.

The threshold mentioned in the post is 500 per day or something. If reporting stops at 100 events how will I know the exact count?

Thank you

 

 

 

 

Moderator

 • 

4.5K Posts

July 29th, 2022 13:00

Hello upceo,

 

I hope this information will be helpful.  Dell PowerEdge BIOS updates will change the "Correctable Error Logging" BIOS setting to disabled by default.  This BIOS option can be re-enabled for customers wanting to continue to see correctable memory threshold events.  : https://dell.to/3Q6JYly

 

These are a few more resources you may be interested in related to memory:

KB Article Number: 000194574: https://dell.to/3Q6JYly

14G whitepaper: https://dell.to/3Q4sG8p

15G whitepaper: https://dell.to/3Q57wXT

 

1 Rookie

 • 

19 Posts

August 2nd, 2022 23:00

Hi Dell-Charles R,

Thanks for the reply. If the "Correctable Error Logging" BIOS setting is disabled by default, does that mean the correctable errors could be ignored?  I can not reboot the servers often.

After how many correctable errors is it an alter to replace the DIMM as it might be an indication of the Dimm module failure?

Thanks

Moderator

 • 

3.5K Posts

August 3rd, 2022 02:00

Hi,

correctable errors could be ignored you could only try to update your firmware to reduce these failures.

 

Regards Martin

1 Rookie

 • 

19 Posts

August 3rd, 2022 10:00

Hi Dell-Martin S,

I see, please don't mind if I ask for clarity . My concern is 

If there are high number of correctable errors happening too fast, don't they get converted into un correctable error at some time which can cause a crash or data corruption ?

Moderator

 • 

9.5K Posts

August 3rd, 2022 12:00

Upceo,

 

That wouldn't be the case, if you have a large string of correctable ECC errors that hit to where there is a response, the response would be that the logging of the errors is suspended, until cleared. As Martin stated though, keeping the server updated will minimize the occurrence.

 

 

 

No Events found!

Top