For those of u that are not regular slashdot.org or digg.com readers, I came across this story and thought you'd be interested. The RAID 5 comments concern me a little, but I really don't want to move to RAID 6
The paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper's key points.
Key points: * Little difference in replacement rates between SCSI, FC, and SATA drives * Failure rate not constant with age * The probability of seeing 2 drives in a RAID 5 cluster fail within 1 hour is 4 times higher than those predicted by exponential assumption estimates.
However I am not reconfiguring my 5 drive RAID-5 to get a 4 drive RAID-5 + a hot spare.
I guess/hope I will have enough time to get a replacement-disk. I also have the option of just stopping the whole 5200-box when the first disk dies. Then the second disk has to die before the array has been rebuilt to mean a catastrofy.
Typically when such a non standard correlation is seen the root cause are factors external to the unit failing. I suspect that some failures or reported failure are impacted by power blips, cooling loss, controllers issues, mechanical vibration which can affect more than one drive leading to failure frequency that does not fit the expected distribution. I would think this would be expected when looking at failure for arrays versus isolated drive failures in the general drive population (IE. If you drop your n5200 don't expect to recover because you have Raid5)
Figured this was the best thread to ask so here goes. I have a Seagate 750 ST3750640AS showing up with a warning. The Smart Info screen for the drives shows: Power On Hours 883 Hours, Reallocated Sector Count 1, Current Pending Sector 0
Omega's Info module shows; Disk Info: /dev/sdd Smart State: ENABLED Smart Error Count: 0 Raw Read Error Rate: 0 Start/Stop Count: 14 Realloc Sector Count: 1 Power On Hours: 883 hours Power Cycle Count: 29 cycles Temperature: 41 celsius
the other drives are not of concern but here is their info; Disk Info: /dev/sda Smart State: ENABLED Smart Error Count: 0 Raw Read Error Rate: 0 Start/Stop Count: 14 Realloc Sector Count: 0 Power On Hours: 212 hours Power Cycle Count: 29 cycles Temperature: 41 celsius Disk Info: /dev/sdb Smart State: ENABLED Smart Error Count: 0 Raw Read Error Rate: 0 Start/Stop Count: 14 Realloc Sector Count: 0 Power On Hours: 228 hours Power Cycle Count: 29 cycles Temperature: 40 celsius Disk Info: /dev/sdc Smart State: ENABLED Smart Error Count: 0 Raw Read Error Rate: 0 Start/Stop Count: 15 Realloc Sector Count: 0 Power On Hours: 249 hours Power Cycle Count: 30 cycles Temperature: 41 celsius
Here is my question after the long winded info provided;
On the drive that is showing errors there are two noticeable items first the power on Hours is about 4 times longer than the others, even though the drives were purchased by me at the same time. (Hope they did not slip in a refurb on me)
Should concerned with the 1 reallocated sector and second the Temp is 41 is this too high? I plan on going to get another 750 today which will bring my box up to 5 drives and should somewhat protect the data, at least I hope.
from what I know after some research in the Internet it is very difficult to interpret the "raw values" that SMART reports as every manufacturer might have a different raw value measurement. Of course for some obvious fields like cycle count or operating hours the measurement method is always the same
Regarding the much more longer operating time of /dev/sdd, it is funny that this drive has been power cycled one less than all the other drives but has many more hours. For me it seesm that that drive has been run in the past for some time. But I don't think that this is a reason for reclamation as the absolute time is not too much.
Especially the temperature values in SMART seem not to be accurate for all models, this depends heavily on the implementation of the drive manufacturer. So the 41 degrees might be true but it mustn't be. But if it is the truth, you're having quite hot drives compared to mines. I have SAMSUNG HD400LJ drives and they reports 27 degrees all the time.
As for the realloc count, the drives have a smallish pool of spare sectors ib case a sector fails at some time. The failed sector is marked as bad and one of the spare sectors is substituted for it. Normally this happens when the drive gone old and starts to degrade and normally it is assumed the next bad sectors will follow soon. But in your case the drive isn't old at all so I would consider this just to be a single sector failure and therefore would ignore it. But keep an eye on that drive and it's smart values.
And I checked the Internet for you about your drive model: the German computer magazine PC Professional was testing the drive in edition 7/2006 (this is a link to a quote of tha article in german www.vnunet.de/tests/storage/article20060608055.aspx) which states, that the performance of that drive is extraordinary (at that time) but that the surface temperature of the drives was 52 degrees while testing and that they suggest to have extra cooling because of the high temperature. So the reported 42 degrees seems to be reasonable and you're just having a "hot" model.....
Thanks for the info, I checked the Seagate site for the spec's and 41 is within range (0-60) so not as worried.
The extra cycle times on the other drives is due to; I had pulled the drive out to see if maybe the system would reset and maybe rebuild the sector. Of course it did not.
I'm going to purchase another drive just in case. But, thanks to my lovely daughter, that purchase is on hold for a little while. We just moved from Germany and she likes to call her boyfriend on his cell phone. Normal house calls are included in my VoIP phone but not cell phones and you know how expensive they can run.
no, it did rebuild the sector, otherwise the SMART system would mark your drive as damaged. In my opinion, the realloc sector count is telling you, that one sector had to be reallocated and the count will always be a least one, it will never go to zero.
In other words, at the moment your drive is healthy
lemme drag up this thread.... i decided to migrate my RAID-5 from 3 drives to 4 (and one spare) while the migration was going on my number 3 drive popped up a warning, i clicked the "Warning" tag and it read "current pending sector: 4"
well, after all the hemming and hawing of the migration, i discovered that i'd pretty much lost everything
repeat after me everyone... RAID != Backup
anyway, i scrapped the whole thing, took the "warning" drive out, stacked the 4 good ones next to each other in the case and created a new raid 5 with the 4 drives (just finished copying all my data back over from backups.... thank the gods)
now the "warning" drive says "reallocated sector count: 2" "Current Pending Sector: 16"
power on hours 172
(i know that's not right, i bought these 5 seagate 7200.10 500GB drives along with the N5200 in february)
so my question (yes there is one)
did i just get a bad drive? should i see about shipping it back under warranty? is the Thecus just REALLY paranoid when it comes to bad sectors? i have drives (knock wood) that i've been using for years in PCs with no issues, how worried do i need to be about this?
I've checked the Internet about the SMART reallocated sector count attribute, and I found that (excerpt):
"Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks this sector as "reallocated" and transfers data to a special reserved area (spare area). This process is also known as remapping and "reallocated" sectors are called remaps. This is why, on a modern hard disks, you can not see "bad blocks" while testing the surface - all bad blocks are hidden in reallocated sectors. However, the more sectors that are reallocated, the more a sudden decrease (up to 10% and more) can be noticed in the disk read/write speed."