Google study: effect of temperature on server hdds

Silencing hard drives, optical drives and other storage devices

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

jojo4u
Posts: 806
Joined: Sat Dec 14, 2002 7:00 am
Location: Germany

Google study: effect of temperature on server hdds

Post by jojo4u » Fri Feb 16, 2007 6:41 am

There is a Google study available about "Failure Trends in a Large Disk Drive Population". They monitored failures and SMART data of hundred thousands of disks for 5 years. One (surprising?) result is, that temperature only affects the discs which are 3+ years old. When younger, failure rate for 40+ °C was lower or comparable to 15-40 °C average temperature drives. 15-30 °C drives are confusingly much worse.

There is one catch: Google tends to buy large batches of drives, so the 1-year old mix is not the same as the 3-year old mix. So the younger ones could be models which are particularly resisting to temperature while the 3-year old ones are sensitive.
In fact, they state that the failure rate of the 2+ year old drives are influenced by some particular models.

http://labs.google.com/papers/disk_failures.pdf (temperature on page 6)

Le_Gritche
Posts: 140
Joined: Wed Jan 18, 2006 4:57 am
Location: France, Lyon

Re: Google study: effect of temperature on server hdds

Post by Le_Gritche » Fri Feb 16, 2007 11:31 am

jojo4u wrote:There is a Google study available about "Failure Trends in a Large Disk Drive Population". They monitored failures and SMART data of hundred thousands of disks for 5 years. One (surprising?) result is, that temperature only affects the discs which are 3+ years old. When younger, failure rate for 40+ °C was lower or comparable to 15-40 °C average temperature drives. 15-30 °C drives are confusingly much worse.

There is one catch: Google tends to buy large batches of drives, so the 1-year old mix is not the same as the 3-year old mix. So the younger ones could be models which are particularly resisting to temperature while the 3-year old ones are sensitive.
In fact, they state that the failure rate of the 2+ year old drives are influenced by some particular models.

http://labs.google.com/papers/disk_failures.pdf (temperature on page 6)
Very interesting read indeed.

It seems to suggest that lower temperatures (<30°C) would cause the drives to fail more only early in there lifetime (2 or 3% failure rates instead of 1%), whereas hotter temp (>45°C) would cause them to fail more only when the drives reach an age of 2 years or more (10-15% failure rates instead of 5%). The best temperatures for the lower failure rate over the life of the drive would be between 35°C and 45°C with an average temperature of 38°C being the temperature associated with the lower failure rate overall.

The SMART study concludes that errors detected by SMART predict a higher failure probability in the following months (10 times higher failure rates sometimes), but it also shows that for around 1 failure out of 2 there were no heralding SMART errors.

AZBrandon
Friend of SPCR
Posts: 867
Joined: Sun Mar 21, 2004 5:47 pm
Location: Phoenix, AZ

Re: Google study: effect of temperature on server hdds

Post by AZBrandon » Fri Feb 16, 2007 12:02 pm

Le_Gritche wrote:The SMART study concludes that errors detected by SMART predict a higher failure probability in the following months (10 times higher failure rates sometimes), but it also shows that for around 1 failure out of 2 there were no heralding SMART errors.
That sounds a lot like heart attacks. Something like 50% of all cases of heart diseased are discovered by the symptom of sudden death with no prior warning.

whiic
Posts: 575
Joined: Wed Sep 06, 2006 11:48 pm
Location: Finland

Post by whiic » Sun Feb 18, 2007 3:49 am

I don't believe that running drives hotter will make them more reliable. Google didn't experiment on temperatures with the purpose of producing meaningful empirical results. They did not vary the temperature on purpose but instead integrated drives into systems with cooling properties that were in line with Google's in-house regulations - thus temperature of the drive was mostly a function of HDD power usage instead of cooling fan rpm.

Consider a hot-running SCSI server disk and a cool-running IDE disk - which is more reliable if they are provided enough cooling to keep them within manufacturer's specs? The SCSI. But SCSI is also hotter. Is running hotter better then? Certainly cutting down the cooling of the IDE drive to match the operational temperature of a SCSI drive would make it even less reliable instead of more reliable.

Also, if you have a 5400rpm ball-bearing drive (cool-running) and a 7200rpm FDB drive (hot-running), the 5400rpm unit might be less reliable. Increasing operational temperature would not increase reliablity of a drive with ball-bearings.

Relation between temperature and reliability should be evaluated in a study with only one drive model - unlike Google's study which had different manufacturers, models and vintages.

__


I prefer operational temperatures between 15 to 45 deg C, avoiding all temperatures outside that range. And I will continue to prefer 15...30 deg C over 30...45 deg C even though I've read that study performed by Google... but I don't think there's much difference between those two temperature ranges. It always depends on how much of silence I had to sacrifice to obtain lower temperature so I would at least try it out with multiple fan speeds instead of being satisfied with the first "acceptable" result. Optimal is usually far better than first satifying solution. (But in certain hopeless cases even optimal will not be good enough.) :)

Bluefront
*Lifetime Patron*
Posts: 5316
Joined: Sat Jan 18, 2003 2:19 pm
Location: St Louis (county) Missouri USA

Post by Bluefront » Sun Feb 18, 2007 4:39 am

IMHO....the results of that "test" are virtually meaningless. Different brand drives, different models, different mfg dates, all play an important part of such testing. Mix up those factors in a test, and the results are suspect to say the least.

Cooler electronic/mechanical parts always last longer than the same unit run hotter. Of course there's a lower limit on the temp. A HD might not start at all in a super-cold situation. But I've never seen a test where "hotter is better for longer life"......using scientific test methods.

niels007
Posts: 451
Joined: Mon Nov 22, 2004 2:18 am

Post by niels007 » Sun Feb 18, 2007 10:27 am

It might be a reasonable indication that you shouldn't worry with HD temps around the 45c mark as the chances of such a drive dying compared to a cooler drive are about the same. So if it dies, it was probably not the temperature causing it.

Good to read these things.. With cpu and gfx I will go until things get unstable.. With HD's I tend not to try that :)

nutball
*Lifetime Patron*
Posts: 1304
Joined: Thu Apr 10, 2003 7:16 am
Location: en.gb.uk

Post by nutball » Sun Feb 18, 2007 12:21 pm

Bluefront wrote:IMHO....the results of that "test" are virtually meaningless. Different brand drives, different models, different mfg dates, all play an important part of such testing. Mix up those factors in a test, and the results are suspect to say the least.
I don't agree. Did you look at their sample size? More than a hundred thousand drives.

I think that this paper is a great deal more reliable than a pile of hearsay gathered from Internet forums, for example, with all the inherent biases and nonsense propagation that that carries with it.

Its certainly true that the results need to be interpreted correctly -- they don't allow for differences between vendors, models, and all the other stuff, I think that they do present a pretty decent characterisation of the behaviour of the average consumer-grade hard-drive.

I'm sure it'll be an unsatisfactory study to many people. Folks buy lottery tickets for a reason.

AZBrandon
Friend of SPCR
Posts: 867
Joined: Sun Mar 21, 2004 5:47 pm
Location: Phoenix, AZ

Post by AZBrandon » Sun Feb 18, 2007 2:40 pm

The most meaningful data would likely come from a company like EMC or Hitachi that buys drives in bulk by the millions and puts them in hard drive arrays. I know in my companies datacenters we have so many disk arrays that they replace dead drives in bulk on a weekly basis, like probably 10 drives a week or something between all our datacenters. Since the disk array vendors would have all that same data in their arrays leading up to the failure they would likely know what conditions lead to the longest life.

CA_Steve
Moderator
Posts: 7650
Joined: Thu Oct 06, 2005 4:36 am
Location: St. Louis, MO

Post by CA_Steve » Sun Feb 18, 2007 7:03 pm

I thought it was a well presented paper. Over 100k drives is a nice population to draw from. The temp data was interesting as well as looking at the SMART parameters that serously affect the likelyhood of future failure.

agraham
Posts: 12
Joined: Sun Aug 11, 2002 3:26 pm

Post by agraham » Sun Feb 18, 2007 8:42 pm

It will be interesting to see if Google cuts down on the cooling based their own study. You can see from the graph that 40% of their drives ran at between 25-30C, whereas the best reliability is between 35-40C.
They could probably save a lot of money by turning their datacentre fridges up by ten degrees.

AZBrandon
Friend of SPCR
Posts: 867
Joined: Sun Mar 21, 2004 5:47 pm
Location: Phoenix, AZ

Post by AZBrandon » Sun Feb 18, 2007 10:05 pm

agraham wrote:It will be interesting to see if Google cuts down on the cooling based their own study. You can see from the graph that 40% of their drives ran at between 25-30C, whereas the best reliability is between 35-40C.
They could probably save a lot of money by turning their datacentre fridges up by ten degrees.
I imagine that would depend on if they have the storage in separate disk drive arrays from the servers. For anything where you have locally attached storage then you have to take into account the failure rates of everything in the server based on temperature, not just the drives. However, if it's fibre attached SAN storage for example, you could have all your disk arrays in a separated climate zone on the datacenter floor with fewer air conditioning units, like you said.

whiic
Posts: 575
Joined: Wed Sep 06, 2006 11:48 pm
Location: Finland

Post by whiic » Sun Feb 18, 2007 11:21 pm

"Did you look at their sample size? More than a hundred thousand drives."

I wouldn't care if there was hundred million drives or even hundred billion drives. There's still drives of many manufacturers, models and vintages.

Temperature is result of both power consumption and cooling. If they did not intentionally vary the temperature and keep the power consumption constant (i.e use only one model to make the test). They were not evaluating effect of cooling to reliability but only difference of reliability between different drive models... and drive models that use more power seem to be more reliable (SCSI more reliable than 7200rpm ATA and 7200rpm more reliable than 5400rpm drives).

"I think that this paper is a great deal more reliable than a pile of hearsay gathered from Internet forums, for example, with all the inherent biases and nonsense propagation that that carries with it."

What heresay are you referring to?

"Its certainly true that the results need to be interpreted correctly"

Yes. And assuming power consumption constant within all drives tested is a false assumption. Because of that you can't assume drives that run hotter in Google's study mean they have less airflow cooling them and that cooling them less increases reliability.

If you have a cool-running drive that runs cool because it's a 5400rpm ball-bearing drive, it's probably less reliable than most fluid bearing type drives with higher rpm and power consumption. If you intentionally make the 5400rpm drive run at the temperature 7200rpm drive or even 10/15krpm SCSI would run in the same environment, you are going to make unrealible drive even more unrealible.

"I'm sure it'll be an unsatisfactory study to many people. Folks buy lottery tickets for a reason."

I don't think Google's study is meaningless... it confirms previous studies that SMART effectiveness in predicting HDD failure is approximately 50%. It also reveals that there's four meaningful attributes which should have a zero value for reliable operation and it gives prediction accuracy for all four attributes. (Other attributes are either manufacturer specific (and may have a higher tolerance before it should be considered a problem) and some other are only nice to know (like "power on hours count")).

Temperature-reliability relation was not the main purpose of the study. Temperature = only one of many SMART attributes. It had to be included in the study to make the study complete! But it should not be interpreted as "less cooling => more reliablity" because that is not what the temperature attribute means. Temperature attribute doesn't measure airflow speed or temperature of surrounding air around the drive.

Some manufacturers have accurate temperature sensors, others have not. Some Samsungs run at temperatures between "10" and "15" in a room temperature of 25. Wow! They are air-conditioned. Some WDs run at "60" even if they feel luke-warm when touched by hand.

whiic
Posts: 575
Joined: Wed Sep 06, 2006 11:48 pm
Location: Finland

Post by whiic » Mon Feb 19, 2007 12:07 am

Apparently they considered temperature values from Hitachi drives spurious and thus disregarded them: "For example,
some drives have reported temperatures that were
hotter than the surface of the sun."


I have commented several times on the unique way Hitachi drives report temperature. All others report temperature in the last byte of raw data field and rest are just zeroes. Because of this, it's easy to convert the whole raw data from binary to decimal and assume to be current temperature.

Hitachi drives report temperature as [zero byte][max temp byte][zero byte][min temp byte][zero byte][current temp byte]. For example: 0x003200120025 is the raw data of Hitachi Travelstar in this laptop I'm using. 0x32 is 50 deg C, 0x12 is 18 deg C and 0x25 is 37 deg C.

SpeedFan knows how to convert Hitachi temperature raw data to real temperature. Software like HDD Health does not. And whatever software Google uses also fits into the latter category.

agraham
Posts: 12
Joined: Sun Aug 11, 2002 3:26 pm

Post by agraham » Mon Feb 19, 2007 12:59 am

AZBrandon wrote:
agraham wrote:It will be interesting to see if Google cuts down on the cooling based their own study. You can see from the graph that 40% of their drives ran at between 25-30C, whereas the best reliability is between 35-40C.
They could probably save a lot of money by turning their datacentre fridges up by ten degrees.
I imagine that would depend on if they have the storage in separate disk drive arrays from the servers. For anything where you have locally attached storage then you have to take into account the failure rates of everything in the server based on temperature, not just the drives. However, if it's fibre attached SAN storage for example, you could have all your disk arrays in a separated climate zone on the datacenter floor with fewer air conditioning units, like you said.
Most solid-state computer parts just don't fail unless you have a powerspike or a cooling failure or something. Seeing that google looses almost 5% of thier harddrives every year, I bet they could quadruple their failure rates of every other component and still come out way ahead.

Cerb
Posts: 391
Joined: Tue Apr 13, 2004 6:36 pm
Location: GA (US)

Post by Cerb » Mon Feb 19, 2007 1:43 am

Somewhat inline with whiic's comments: they don't appear to show many hot drives.

That is, 45C (~110F) isn't hot. It won't burn, and won't be uncomfortable except on a cold day. I recently recabled a box with a couple Maxtors that run hot enough to burn: easily into the 50s C. If drives are getting into the 50s or 60s all the time, or the temp spread was fairly even, I'd consider the results for temperature much more useful. However, they are showing most sitting under 35C average, by my reading of fig 4. Half the year, that's my PC's case temp.

By that, them saying temp doesn't matter is not saying it does not matter in general (they couldn't find evidence either way), but that in all of their stuff, the drives were cooled beyond what they needed to be, so other cooling concerns could/should take precedence when designing new enclosures and cooling systems.

jojo4u
Posts: 806
Joined: Sat Dec 14, 2002 7:00 am
Location: Germany

Post by jojo4u » Mon Feb 19, 2007 6:27 am

whiic wrote:They were not evaluating effect of cooling to reliability but only difference of reliability between different drive models... and drive models that use more power seem to be more reliable (SCSI more reliable than 7200rpm ATA and 7200rpm more reliable than 5400rpm drives).
Section 2.2 states that only SATA/ATA 7200/5400 upm consumer grade drives are used. This rules out "hot and reliable" SCSI. But you are still right, the 7200 upm drives draw more power and are hotter.

Erssa
Posts: 1421
Joined: Sat Mar 12, 2005 9:26 pm
Location: Finland

Post by Erssa » Mon Feb 19, 2007 8:55 am

This thread should be stickied. I wonder how many times I have given system advice and said that Antec Solo doesn't need extra fan for the hard drive.

whiic
Posts: 575
Joined: Wed Sep 06, 2006 11:48 pm
Location: Finland

Post by whiic » Mon Feb 19, 2007 10:25 am

I do not agree on the need to sticky this thread. That "temperature-reliability" reliation really was not the main purpose of Google study and methods Google used were not right methods in producing accurate data on such a relation.

If you want to read about relation between temperature and reliability, Seagate has published one quite detailed and white paper on the subject. Certainly the sample size was much smaller but at least they were all the same drive model.

You just can't compare apples to bananas. And you can't determine HDD reliability on HDD temperature if you keep cooling fan rpm contant and only vary power consumption of the drive. It's complete non-sense.

What Google study did show was evaluate the efficiency of SMART diagnostics. Results was inline with previous studies (performed with smaller number of drives) and thus Google's study indeed is valuable addition to other studies in the field... but not on temperature-reliability assessment.

I vote for: NOT sticky...

Longwalker
Posts: 53
Joined: Sun Aug 06, 2006 2:35 pm

Post by Longwalker » Mon Feb 19, 2007 12:23 pm

Statistically speaking, it's not strictly necessary to control for brand and model when performing reliability studies. If the population is sufficiently large and sufficiently diverse (i.e. no brand or model is highly overrepresented), differences between brand and model will average out.

Controlling for brand and model would only be necessary if the objective was to determine which brands and models have the worst reliability and/or failure predictability. Google very likely did this analysis for internal purposes but might not be all that keen on publishing the results for commercial reasons.

It would be very interesting to see how drive lifespans co-relate with design life and warranty length. If, for example, the best predictive factor for failure at three years is a three year design life, then there's not much need to worry about temperatures, utilization or SMART results.

whiic
Posts: 575
Joined: Wed Sep 06, 2006 11:48 pm
Location: Finland

Post by whiic » Mon Feb 19, 2007 1:40 pm

"Statistically speaking, it's not strictly necessary to control for brand and model when performing reliability studies. If the population is sufficiently large and sufficiently diverse (i.e. no brand or model is highly overrepresented), differences between brand and model will average out."

Assuming all drives had the same power consumption, you could calculate some average relation for drives that participated. But obviously they don't. And because of that, operational temperature is not only result of cooling but also of heat generation which varies between drives.

To make things even worse, old drives are more likely to spin at 5400rpm while newer drives spin at 7200rpm. 5400rpm drives are made cheaply, with no design effort what-so-ever to make them in any way better. They are dying so why invest money? Take MaXLine II or DiamondMax 16 for example: their counterpart MaXLine +II and DiamondMax +9 had fluid dynamic bearings. Then came MaXLine III and DiamondMax 10 which are both developed from DM+II and DM+9. MLII and DM16 5400rpm were not updated to FDBs even then... they were trailing behind multiple drive generations, sticking to ball-bearings, 75GB/platter media, etc. But while they were primitive they were noticeably cooler than the 7200rpm drives of same capacity (and especially cooler than 7200rpm made by Maxtors because 7200rpm Maxtors are toasters).

Obviously comparing 5400rpm MLII to 7200rpm Maxtors doesn't necessarily show old-tech 5400rpm drives are less reliable as 7200rpm Maxtors were dying like flies as well. But compare 5400rpm Maxtor to any 7200rpm non-Maxtor. This may be a valid comparison because I believe Maxtor was manufacturer that kept 5400rpm drives in production longer than others.

Also, these drives were "consumer-grade". That should rule out MaXLine II and +II and III but still leave DM16. "Consumer-grade" also rules out Seagate NL and ES series and WD's YS, YR, etc. But basically all these "enterprice SATAs" are just regular drives with longer testing. Usually they are also available at high capacities only. And to make more reliable high-capacity drives for enterprice, the "consumer-grade" high-capacity drives have to be built on the same elevated quality (because building two separate series just is not cost effective). Small capacity drives don't typically have enterprice editions which means they are truely consumer-grade. They are more likely built in China, they have cheaper components to cut down the price, etc. Of course they are designed so that there won't be too many DOAs or failures within warranty period ... at least in typical desktop environment. But high-capacity drives may well be better suitable for high workloads.

I really can't tell what factors are most relevant but I do see there are factors beneath the surface that make the sample not "sufficiently diverse". The sample is not diverse: it's divided into several groups with certain typical (=non-random) characteristics.

"Google very likely did this analysis for internal purposes but might not be all that keen on publishing the results for commercial reasons."

Most likely. But they could have substituted drive models with pseudonyms ("drive model 1", "drive model 2", etc.) instead of their real names. Then take for example three respresentative drive models from the study and present them, without model names, without manufacturer names, without other revealing information...

"It would be very interesting to see how drive lifespans co-relate with design life and warranty length."

Warranty length = business & marketing decision. Nothing technical there.

"If, for example, the best predictive factor for failure at three years is a three year design life"

Design life for almost all HDDs is 5 years, regardless of warranty period (3 or 5 years).

"then there's not much need to worry about temperatures, utilization or SMART results."

Design life is given in certain temperature (usually quite a bit below maximum allowed temperature) and certain duty-cycle. Don't expect a drive with 5 year service life to last 5 years at 55 deg C and grinding constantly at highly random I/O. But also don't expect all drives that are kept ~30 deg C and low I/O to die at age of 5 years. The design life assumes that majority of drives reaches (and also exceeds) that age.

SMART alerts are no joking matter, like Google's study has shown. Even a single error in certain attributes may increase the risk for HDD failure by factor of 20 to 60 for the next few months.

And failing an attribute (which means many errors in one of the attributes) means almost certain death. (This was not assessed in Google study but the SMART attributes are set so that there's minimum amount of false alarms. If you were warned at first error, number of false alarms would be high: this also applies to four critical attributes found by Google. Even 60 times higher failure likelyhood would not mean certain death.)

Zinj
Posts: 32
Joined: Mon Sep 25, 2006 11:06 am
Location: San Jose, Costa Rica

A related Study

Post by Zinj » Mon Apr 02, 2007 8:51 am

DailyTech posted a summary of a Carnegie Mellon study of MTBF for hard drives a few weeks back which is relevant (http://www.dailytech.com/Article.aspx?newsid=6404). Mentioned that age is generally the most important issue in hard drive failure and that, "drive operating temperatures had little to no effect on failure rates -- a cool hard drive survived no longer than one running hot." This would seem to add some additional weight to Google's study.

Z

continuum
*Lifetime Patron*
Posts: 213
Joined: Tue Jul 19, 2005 10:23 pm
Location: California
Contact:

Post by continuum » Mon Apr 02, 2007 12:09 pm

Google and CMU's papers do not address drives operating at beyond recommended operating temperature, which dramatically increases drive failures.

"cool" and "hot" mean relatively little when the window is so small; we have extensive production data here which shows when drives are hitting outside of specified thermal limits, failure rate increases dramatically.

Zinj
Posts: 32
Joined: Mon Sep 25, 2006 11:06 am
Location: San Jose, Costa Rica

Post by Zinj » Mon Apr 02, 2007 12:31 pm

Doesn't that really hit it on the head? As long as you're within the manufacturer's tolerances, you're generally o.k.. Whether you're 5 degrees cooler than the suggested maximum or 15 doesn't have much effect. The message I was taking home is that as long as you're in the manufacturer's spec's, no need to get your undies in a bunch. Ditto giving up some heat for silence in your CPU as long as you're not overclocking. If the issue was temps well above the specified operating temperature, then I misread. By the way, at what temps do you see the failure rate start to jump?

Z

croddie
Posts: 541
Joined: Wed Mar 03, 2004 8:52 pm

Post by croddie » Mon Apr 02, 2007 1:21 pm

Longwalker wrote:Statistically speaking, it's not strictly necessary to control for brand and model when performing reliability studies. If the population is sufficiently large and sufficiently diverse (i.e. no brand or model is highly overrepresented), differences between brand and model will average out.
Nonsense

mattthemuppet
*Lifetime Patron*
Posts: 618
Joined: Mon May 23, 2005 7:05 am
Location: State College, PA

Post by mattthemuppet » Mon Apr 02, 2007 2:46 pm

jeez, have you guys never heard of meta analysis? Sure, in an ideal world every variable would be controlled so we could do nice neat stats and get nice neat answers, but there are whole fields where scientists have no control over what they're measuring but still manage to glean information.

I mean, what about evolutionary biology, atmosphere science, geology, paleontology, climate science etc Evlutionary biology would be particularly tricky if we had to wait a few million years for our experiments to pan out (getting funding would especially hard). So saying something's nonsense purely because you don't understand the, admittedly complicated, statistical analyses behind it is rather purile.

Sure, you can never say that x is the main cause of y from this type of study, but you can assign contributory factors and degrees of error. You can even assign probabilities of effect, as in the recent ICCC (?) report on climate change - they assigned a 90% probability that CO2 emissions from human activities were the main contributory factor behind observed increases in mean temperatures.

Admittedly, there'll always be people that'll say "that's crap", "they didn't control for such and such" and so on, but unless you can refute the analysis itself, you're on shaky ground.

grandpa_boris
Posts: 255
Joined: Thu Jun 05, 2003 9:45 am
Location: CA

Post by grandpa_boris » Mon Apr 02, 2007 3:43 pm

continuum wrote:Google and CMU's papers do not address drives operating at beyond recommended operating temperature, which dramatically increases drive failures.

"cool" and "hot" mean relatively little when the window is so small; we have extensive production data here which shows when drives are hitting outside of specified thermal limits, failure rate increases dramatically.
the google paper concludes that the operating temperature doesn't matter, but their graphs show that the failure rate at the higher range of operating temperature does go up. the increase is not of catastrophic magnitude and it is not significant compared to other factors, but it's not negligible either.

does this mean that by exceeding the spec by 5-10'C you'll cause your drives to fail much faster? may be. other factors that aren't temperature-related are much more likely to kill your disks before the heat gets them.

grandpa_boris
Posts: 255
Joined: Thu Jun 05, 2003 9:45 am
Location: CA

Re: Google study: effect of temperature on server hdds

Post by grandpa_boris » Mon Apr 02, 2007 3:46 pm

AZBrandon wrote:That sounds a lot like heart attacks. Something like 50% of all cases of heart diseased are discovered by the symptom of sudden death with no prior warning.
it's worse than that. over 75% of disks fail without any prior warnings.

grandpa_boris
Posts: 255
Joined: Thu Jun 05, 2003 9:45 am
Location: CA

Post by grandpa_boris » Mon Apr 02, 2007 3:52 pm

AZBrandon wrote:The most meaningful data would likely come from a company like EMC or Hitachi that buys drives in bulk by the millions and puts them in hard drive arrays.
while this may be true, these companies have been very cagey about sharing data, if they in fact have any. these companies are also dealing with the fundamentally different class of disk drives than the "consumer" disks most of us are using in our quiet systems. enterprise disk drives cost a lot more and they are made differently, for a different duty cycle and different performance characteristics.

for what it's worth, in response to the google and CMU papers, NTAP has been claiming that they have been tracking and researching disk mortality for many years. but their solution to it is to offer RAID6 and other multiple-redundancy schemes rather than somehow select for more reliable disk drives.

Rusty075
SPCR Reviewer
Posts: 4000
Joined: Sun Aug 11, 2002 3:26 pm
Location: Phoenix, AZ
Contact:

Post by Rusty075 » Tue Apr 03, 2007 12:14 am

I think some of you are overlooking what is the most likely potential explanation for the "cool drives fail faster" phenomenon in the Google report:

The temperature results list average temperature readings for the drives.

Take two identical drives. Place one under continuous Medium utilization where its drive temperature stays at a near constant 45°. Place the other drive under Low utilization, where it spends say 2/3rds of its time idling at 25°, and 1/3rd of its time at 100% use where its temp peaks at over 50°. The Low drive will have an average temp that is much less than the Medium usage drive (33° in my hypothetical). But I could almost guarantee you that the repetitive thermal cycling that comes from alternating periods of high usage and low usage will be harder on that drive than the 12° hotter temp is on the Medium, thus making the Low drive statistically more likely to fail. For many high-precision mechanical parts thermal cycling is more damaging than conventional wear....seems reasonable the HDD's would have a similar reaction.

The "low temperature" drives were more likely drives that were deployed in low demand servers, where they spent large portions of their time at idle, rather than just happening to be in extra-cold rooms or in racks that inexplicably had a bunch of extra fans in them.

A correlating bit of data show up in the Utilization failure chart where in more than half the time periods the drives with Low utilization are more likely to fail than the drives with Medium usage. While their "Survival of the Fittest" is one theory, the effects of thermal cycling could also be playing a part.

whiic
Posts: 575
Joined: Wed Sep 06, 2006 11:48 pm
Location: Finland

Post by whiic » Tue Apr 03, 2007 1:05 am

mattthemuppet: "I mean, what about evolutionary biology, atmosphere science, geology, paleontology, climate science etc Evlutionary biology would be particularly tricky if we had to wait a few million years for our experiments to pan out (getting funding would especially hard). So saying something's nonsense purely because you don't understand the, admittedly complicated, statistical analyses behind it is rather purile. "

First I have to admit croddie's "nonsense" response wasn't quite creative.

Still, do you disagree with some of the following:
- 5400rpm drives usually run cooler than 7200rpm
- 5400rpm drive are a dying breed and usually use older technology and ball-bearings
- ball-bearings compromize HDD reliability over a longer period of use as ball-bearings have a tendency to wear out (thus increase non-repeatable run out (NRRO) and cause errors during I/O).

If you agree with those three propositions, and remember Google's study did have both 5400rpm and 7200rpm drives, don't you see the possibility of this affecting the statistics?

Some of you claim things like cooling a HDD "too much" will cause reliability to drop. I find it more likely that bearing type is more likely cause. Sure, it wouldn't be a statistical problem to have BB drives among other drives IF there was not correlation between BBs and lower rpms. But there obviously is a correlation, thus it affects the outcome and causes extra failures at low temperatures. Reduce cooling on cool-running BB drives and it certainly won't do any good... except reduce noise produced by fans, but since it's a server that doesn't matter.

My opinion is quite close to continuum's. HDD reliability versus temperature might even resemble a bathtub curve. Operational temperature range in the specs is relatively safe with some reduction in reliability in both ends. But I don't see below 30 deg C bad... the low end temperature range is either 0 or 5 deg C. At low temperatures condensation may occur and bearing fluid may be too viscous. Some plastics may become brittle (temporarily).

In high temperatures there's more causes for failures. Higher temperature lower the fly height (risk of head crash increases). Higher temperature warms the bearing fluid too much: evaporation. Higher temperatures increase chemical reactivity: compounds used in plastic parts react with each other and plastics become brittle, media on the platters corrode, all electronic components wear out due internal chemical degradation. Chemical reactions caused by heat cannot be undone by later cooling. Fly-height does return to normal when drive is cooled to the specs.

While low-end limit might be quite ON/OFF, the upper limit isn't. Seagate while paper show some exponential growth in failures. Maybe it's too much focused on the chemical reactivity... that study is probably created mostly based on theoretical research instead of sample size as big as Google's. Also, lower end of temperature range doesn't fit on this Seagate's curve at all. Chemical reactivity continues to decrease even below low-limit of allowed temperature range... it's the condensation (plus a couple of other factors) that cause big trouble. Just maybe it's the same with upper-limit: that head-crashes and other non-chemical failures cause the failures to skyrocket. That would explain why Seagate's theory based white paper doesn't correlate with reality.

Even if failures skyrocketed near the upper temperature limit, that does not nullify the fact that chemical reactivity increase the whole operational range from 5 to 55 (or 0 to 60) deg C.

All this said, I'll try to keep my drives below 40 deg C. Being below that, I don't much care whether it's 20 or 39. It's not that I don't think 20-30 deg C would be optimal but because I don't think it's worth pursuing... increased fan noise and such. To be more precise, that 40 deg C is for normal room temperature. If I tweak my cooling to achieve 40 deg C in optimal room temperature, it will peak around 50...55 deg C during summer... barely within the specs.

Post Reply