New study of DRAM errors and ECC memory

All about them.

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

Post Reply
AlpineCarver
Posts: 109
Joined: Tue Sep 07, 2004 12:21 pm
Location: USA

New study of DRAM errors and ECC memory

Post by AlpineCarver » Thu Oct 22, 2009 9:37 pm

here's a study of memory error rates which was conducted recently:
http://www.cs.toronto.edu/~bianca/paper ... rics09.pdf

it claims to be "the first large-scale study of DRAM memory errors in the field," and it "covers the majority of machines in Google's fleet and spans nearly 2.5 yrs," which adds up to "many millions of DIMM days."

here are some excerpts from their conclusions:
  • - the incidence of memory errors was much higher than was found by previous (smaller scale) studies.
  • - about a third of machines have ECC-correctable memory errors in a given year.
  • - a DIMM that has one correctable error is much more likely to have additional errors.
  • - error rates increase as components age.
i'm not sure if this will make anyone more or less inclined to use ECC memory and ECC-compatible motherboards in their builds. i've been using them for most of mine lately. the difference in price typically has been a very small fraction of the overall system price. i figure it may save me some crashes or undetected data corruption, and it may extend the longevity of my systems.

funklizard
Posts: 40
Joined: Wed Aug 26, 2009 1:09 pm
Location: Fairfax, VA
Contact:

Re: New study of DRAM errors and ECC memory

Post by funklizard » Thu Oct 22, 2009 10:01 pm

AlpineCarver wrote:i'm not sure if this will make anyone more or less inclined to use ECC memory and ECC-compatible motherboards in their builds.
I built a dual Opteron workstation over four years ago using ECC memory. I might never use ECC memory in a machine again.

This machine has 8 DIMMs. I have had to replace three of them over its lifespan to date. These were memory modules from a couple of well known major manufacturers (who replaced the problem sticks per their warranties without issue). And diagnosing the problem DIMM was always quite difficult. memtest86+ invariably proved useless. The only thing that worked was to swap them out and run the machine for a while to see if it would still fall over.

In general, I don't think ECC DIMMs are manufactured to any higher standard of quality than non-ECC DIMMs. And I am suspicious that problems in ECC DIMMs are harder to diagnose due to some of the resulting errors being corrected. I think some of the conclusions from this study lend credence to that suspicion.

So ECC isn't going to save you from bad DIMMs. And for my purposes, I'd like to be able to isolate a bad DIMM as quickly and as easily as possible.

Olle P
Posts: 711
Joined: Tue Nov 04, 2008 6:03 am
Location: Sweden

Re: New study of DRAM errors and ECC memory

Post by Olle P » Fri Oct 23, 2009 4:48 am

funklizard wrote:I built a dual Opteron workstation over four years ago using ECC memory.
This machine has 8 DIMMs. I have had to replace three of them over its lifespan to date.
In the servers studied all DIMMs where incorrectable errors were detected were being replaced promptly.
The average life span of a DIMM was two years, so your replacement of only three DIMMs in four years says you were lucky?
funklizard wrote:I am suspicious that problems in ECC DIMMs are harder to diagnose due to some of the resulting errors being corrected. I think some of the conclusions from this study lend credence to that suspicion.
The study shows that most initial errors are correctable. The incorrectable errors follow later on.
This provides the user with time to find and replace the troublesome DIMM before any amount of incorrectable errors occur. (Provided, of course, that there's some sort of supervision that log and report all errors. Highly unusual for home users.)
funklizard wrote:So ECC isn't going to save you from bad DIMMs. And for my purposes, I'd like to be able to isolate a bad DIMM as quickly and as easily as possible.
To me the question is: How do you want to get to know that there's a bad DIMM? By having uncorrected errors, or by having a report tool tell you that some corrected errors occurred?

As I see it, from the less demanding home users point of view, using ECC RAM will provide a stable run over a longer period of time, since most errors are corrected before it gets really bad. Non ECC RAM will provide uncorrected errors from the start (of errors occurring).

Cheers
Olle

funklizard
Posts: 40
Joined: Wed Aug 26, 2009 1:09 pm
Location: Fairfax, VA
Contact:

Re: New study of DRAM errors and ECC memory

Post by funklizard » Fri Oct 23, 2009 10:18 am

Olle P wrote:
funklizard wrote:I built a dual Opteron workstation over four years ago using ECC memory.
This machine has 8 DIMMs. I have had to replace three of them over its lifespan to date.
In the servers studied all DIMMs where incorrectable errors were detected were being replaced promptly.
The average life span of a DIMM was two years, so your replacement of only three DIMMs in four years says you were lucky?
Maybe. However, this failure incidence is higher than any I've experienced over the lifetime of any other machine I've owned. Granted, this sample set is miniscule compared to what the study looked at.

But I'm suspicious of special factors that may have led to an average 2 year lifespan for DIMMs. Consider this: if 2 years really was the average lifespan for a DIMM, would lifetime warranties on these parts really be so commonplace in the industry?
Olle P wrote:
funklizard wrote:I am suspicious that problems in ECC DIMMs are harder to diagnose due to some of the resulting errors being corrected. I think some of the conclusions from this study lend credence to that suspicion.
The study shows that most initial errors are correctable. The incorrectable errors follow later on.
This provides the user with time to find and replace the troublesome DIMM before any amount of incorrectable errors occur. (Provided, of course, that there's some sort of supervision that log and report all errors. Highly unusual for home users.)
My system has been running Fedora Linux. I don't know the extent of its normal capability to report this sort of thing; but I would have expected some sort of error report if the hardware were acting goofy. Unfortunately, in each case, my first indication of a problem was random system freezes.
Olle P wrote:
funklizard wrote:So ECC isn't going to save you from bad DIMMs. And for my purposes, I'd like to be able to isolate a bad DIMM as quickly and as easily as possible.
To me the question is: How do you want to get to know that there's a bad DIMM? By having uncorrected errors, or by having a report tool tell you that some corrected errors occurred?
It's a nice idea; similar to how SMART can often tell you when a hard drive is nosing into a downward spiral. Unfortunately, this scenario just doesn't reflect my personal experience with ECC memory. Maybe I'll give it another shot when the reporting tools available to me have gotten better.

Olle P
Posts: 711
Joined: Tue Nov 04, 2008 6:03 am
Location: Sweden

Re: New study of DRAM errors and ECC memory

Post by Olle P » Tue Oct 27, 2009 2:11 am

funklizard wrote:Consider this: if 2 years really was the average lifespan for a DIMM, would lifetime warranties on these parts really be so commonplace in the industry?
(Somewhat contradicting my previous statement:)
1: The lifetime was shortened by the swift replacement after detected errors. I would expect most users to not detect most of these errors in the first place, and even if detected not count the DIMM as faulty.

2: "Lifetime" warranty makes me a bit suspicious. It all depends on interpretation. One way of interpreting it is to say: "So, the DIMM is dead? Then it's life, and thus lifetime warranty, is over!"
Being a bit less harsh one can still argue that it's only normal and expected behaviour from an older DIMM to cause data errors once in a while.
funklizard wrote:My system has been running Fedora Linux. I don't know the extent of its normal capability to report this sort of thing; but I would have expected some sort of error report if the hardware were acting goofy.
In the test there were very little hardware problems. Just corrupted data in the databases, which is much worse if the data is for example the amount of money present on a bank account.
The OS can't tell if the provided input data is correct or not.
funklizard wrote:Unfortunately, in each case, my first indication of a problem was random system freezes.
Which suggests your DIMMs were in very bad shape.
Even in the test quite a few DIMMs were replaced early on, but those that "survived" the first couple of months without problems usually lasted the entire test period.

Cheers
Olle

funklizard
Posts: 40
Joined: Wed Aug 26, 2009 1:09 pm
Location: Fairfax, VA
Contact:

Re: New study of DRAM errors and ECC memory

Post by funklizard » Tue Oct 27, 2009 8:45 am

Olle P wrote:2: "Lifetime" warranty makes me a bit suspicious. It all depends on interpretation. One way of interpreting it is to say: "So, the DIMM is dead? Then it's life, and thus lifetime warranty, is over!"
This is just disconnected from reality.

A manufacturer who operated under this interpretation would be subject to legal action. A reputable firm would not engage in such silliness.

The precise terms of a "lifetime" warranty do vary; but such deviation from the conventional meaning of the term is not a game that will be played by a manufacturer that cares a whit about its brand.
Olle P wrote:Being a bit less harsh one can still argue that it's only normal and expected behaviour from an older DIMM to cause data errors once in a while.
Have you known of any memory manufacturer to make this argument?

Jay_S
*Lifetime Patron*
Posts: 715
Joined: Fri Feb 10, 2006 2:50 pm
Location: Milwaukee, WI

Re: New study of DRAM errors and ECC memory

Post by Jay_S » Tue Oct 27, 2009 12:11 pm

funklizard wrote:My system has been running Fedora Linux. I don't know the extent of its normal capability to report this sort of thing; but I would have expected some sort of error report if the hardware were acting goofy.
If you're still using that Opteron w/ECC, you might check out edac and edac-util: http://bluesmoke.sourceforge.net/
It seems to read whatever "log" the memory controller has. I agree that this needs to get more user-friendly ala SMART reporting tools.

Olle P
Posts: 711
Joined: Tue Nov 04, 2008 6:03 am
Location: Sweden

Re: New study of DRAM errors and ECC memory

Post by Olle P » Tue Oct 27, 2009 1:16 pm

funklizard wrote:Have you known of any memory manufacturer to make this argument?
No, I've never had any problems with RAM in the first place.

Cheers
Olle

MiKeLezZ
Posts: 110
Joined: Sun Feb 20, 2005 8:00 am
Location: ITALY
Contact:

Re: New study of DRAM errors and ECC memory

Post by MiKeLezZ » Sat Nov 07, 2009 4:05 pm

AlpineCarver wrote:here's a study of memory error rates which was conducted recently:
http://www.cs.toronto.edu/~bianca/paper ... rics09.pdf

it claims to be "the first large-scale study of DRAM memory errors in the field," and it "covers the majority of machines in Google's fleet and spans nearly 2.5 yrs," which adds up to "many millions of DIMM days."

here are some excerpts from their conclusions:
  • - the incidence of memory errors was much higher than was found by previous (smaller scale) studies.
  • - about a third of machines have ECC-correctable memory errors in a given year.
  • - a DIMM that has one correctable error is much more likely to have additional errors.
  • - error rates increase as components age.
i'm not sure if this will make anyone more or less inclined to use ECC memory and ECC-compatible motherboards in their builds. i've been using them for most of mine lately. the difference in price typically has been a very small fraction of the overall system price. i figure it may save me some crashes or undetected data corruption, and it may extend the longevity of my systems.
I want to add just one thing:
- We have larger HD
- We have larger DRAM
- We move larger amount of DATA

The result is:
- We have more errors to deal with. Nowadays even a error rate of 1 every 1000GB of data is not acceptable (but it was more than ok in the '90s).

In my opinion ECC DRAM is not a choice, it should be a necessity for all of us. If everyone used it, it would cost just 12% more of regular DRAM (nothing special: for the unbuffered, they just add a chip...).

jimj
Posts: 2
Joined: Fri Nov 27, 2009 7:00 pm
Location: Sioux Falls SD

Re: New study of DRAM errors and ECC memory

Post by jimj » Fri Nov 27, 2009 7:14 pm

Jay_S wrote: If you're still using that Opteron w/ECC, you might check out edac and edac-util
It seems to read whatever "log" the memory controller has. I agree that this needs to get more user-friendly ala SMART reporting tools.
Won't 'cat /var/log/mcelog' show you ECC errors? It does on my Fedora 12 box.

zads
Posts: 12
Joined: Mon Nov 30, 2009 12:19 am
Location: San Jose, California

Re: New study of DRAM errors and ECC memory

Post by zads » Mon Nov 30, 2009 12:38 am

MiKeLezZ wrote:I want to add just one thing:
- We have larger HD
- We have larger DRAM
- We move larger amount of DATA

The result is:
- We have more errors to deal with. Nowadays even a error rate of 1 every 1000GB of data is not acceptable (but it was more than ok in the '90s).

In my opinion ECC DRAM is not a choice, it should be a necessity for all of us. If everyone used it, it would cost just 12% more of regular DRAM (nothing special: for the unbuffered, they just add a chip...).
There is talk in the industry of bringing ECC as a standard to more mainstream, but we will have a slightly different implementation..
4GB UDIMMs are starting to push some limits

colm
Posts: 409
Joined: Tue Jan 31, 2006 8:22 am
Location: maine

Post by colm » Mon Nov 30, 2009 11:43 pm

coincidence, I just read pdf that yesterday.

I am certain, ecc finding bit errors ahead is an indication of other things.
My next build is ecc enabled.
The unlucky amder with the dimm replacements..
there is one chip set aside on each ecc dimm correcting errors..if you have alot of errors, said dimm dies eventually...correct? (pun?)

maybe you are quite lucky to be killing dimms rather than a northbridge, or a data stampede into a vrm array...

if you know what I am saying...
and lastly, the theory of more errors than ever must be true, we are using very fast threaded ram at similar densities as good ol pc66/100..

I go back to ECC. I could more than guess that the newer cpus mulitcored and shutting itself off to slow is quite ok without ECC and regular user. I am simply trying to keep a prescott going,(in which i am quite content mind you), and know there is a sleeping giant, I have awakened it. Alot of old pc does this, apfc is a corrector in a huge way, and that also keeps me enthused about stuff getting older.

and I must mention yet again, I am here with a 633 celeron that buries its own ram to pagefile use just to browse...errorless. 66mhz FSB baby.



:wink:

RBBOT
Posts: 93
Joined: Thu May 17, 2007 9:02 am

Post by RBBOT » Wed Jan 20, 2010 3:52 pm

I've just built a new box with 12gb of ECC ram and I noticed that while finding the limits of the overclocking and testing with Prime 95, at no point did I ever see a calculation error in Prime; the limits were always caused a blue screen or lock up suggesting something else failed before the RAM started returning inaccurate data. Every other box I've ever overlclocked, I've always got prime failures at some stage while trying to find the optimum voltages.

I'm using Kingston 4GB 1066Mhz 7-7-7-20 dimms and found that run Ok upto about 1200Mhz. However, I found I could get a higher CPU speed by running the ram at only 1020Mhz. At this speed I could use 6-6-6-18 timings on the ram.

speedboxx
Posts: 72
Joined: Wed Dec 30, 2009 9:13 am
Location: Canada

Post by speedboxx » Wed Jan 20, 2010 4:25 pm

Would underclocking ram significantly reduce the errors? I have my 667mhz ram underclocked to 400mhz in my server to save some power and because I figure it might be more stable that way. Also, would errors in the memory usually lead to a crash, or could it lead to corrupt data being stored on your drives of which you may never know about (this is what I am afraid of)? Wouldn't such disk errors be detected by a disk scan tool like CHKDSK?

RBBOT
Posts: 93
Joined: Thu May 17, 2007 9:02 am

Post by RBBOT » Wed Jan 20, 2010 5:39 pm

Memory errors could lead to a crash or corrupt data - depends on whether the data being read from memory is code or data. If its code it is likely to crash, or at least execute wrong instructions changing data in some way. If it is data it may still crash depending on what it does to the program logic.

If it does end up with the wrong data being written to disk, it certainly won't result in disk errors as far as the disk is concerned, it correctly wrote the data it was asked to.

Reducing the clock speed may help stability, although not if, like me, you also reduce the timings.

jimj
Posts: 2
Joined: Fri Nov 27, 2009 7:00 pm
Location: Sioux Falls SD

Post by jimj » Wed Jan 20, 2010 10:26 pm

RBBOT wrote: using Kingston 4GB 1066Mhz 7-7-7-20 dimms
Would you mind posting the part # of your RAM?

RBBOT
Posts: 93
Joined: Thu May 17, 2007 9:02 am

Post by RBBOT » Thu Jan 21, 2010 2:04 pm

KVR1066D3E7SK3/12G

Post Reply