Home » , » SMART not so smart in predicting disk drive failure

SMART not so smart in predicting disk drive failure

Continuing from last blog post, Google report [PDF] also shares their analysis based on disk self-monitoring data and identifies important failure related SMART parameters.
  1. The drives with scan errors are 10x more likely to fail that the drives with no scan errors. 30% of the drives fail within the 8 months after first scan error. The failure probability is higher within first month of first scan error occurring in newer drives and then plateaus. With older drives, failure probability rises with time.

  2. The drives with reallocation count fail 3 – 6x more often than those with none. 15% of the drives fail within the 8 months after the first reallocation.

  3. There is no definite correlation between failure rates and seek errors.

  4. CRC errors are less indicative of drive failures than that of cables and connectors.

  5. There is no significant correlation between failures and high power cycle counts for drives less than two years old. For drives 3 years and older, higher power cycle counts can increase the absolute failure rate by over 2%.


Predictive models based on scan errors, reallocation count, offline reallocation count and probational count couldn’t predict more than half of the failed drives.
We conclude that it is unlikely that SMART data alone can be effectively used to build models that predict failures of individual drives. SMART parameters still appear to be useful in reasoning about the aggregate reliability of large disk populations, which is still very important for logistics and supply-chain planning.
Glossary of Terms

Scan Errors – Large scan error counts can be indicative of surface defects, and therefore indicative of surface defects.

Reallocation Count – When the drive logic believes that a sector is damaged it can remap the faulty sector number to a new physical sector drawn from a pool of spares. Reallocation count reflects the number of times this has happened, and is seen as an indication of drive surface wear.

Offline Reallocation – Offline reallocation are defined as subset of the reallocation counts in which only reallocated sectors found during background scrubbing are counted. In other words, it should exclude sectors that are reallocated as a result of errors found during actual I/O operations.

Probational Count – Disk drives put suspect bad sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems. Probational counts can be seen as a softer error indication.

Seek Errors – Seek errors occur when a disk drive fails to properly track a sector and needs to wait for another revolution to read or write from or to a sector.

CRC Errors – CRC errors are detected during data transmission between the physical media and the interface.

0 comments:

Post a Comment

 
Copyright © 2011. Go Trading - All Rights Reserved
Proudly powered by Blogger