Addressing the Challenge of Multiple Disk Failures in RAID
In 2007, Robin Harris raised concerns about the reliability of RAID5 configurations,
attributing potential issues to the growth in disk capacity without a corresponding improvement in specified disk reliability.
Harris theorized that during a rebuild, the likelihood of encountering a second disk failure,
specifically an Unrecoverable Read Error (URE), would become unacceptably high.
However, the original article's assumptions were flawed for several reasons:
- It incorrectly conflated complete disk failures with the inability to read a single sector.
- The reliability specifications were based on bitstreams rather than block devices.
- Vendor-declared data about disk reliability tends to be conservative.
Vendors often claim that the probability of encountering a URE is one bit in 10^14 to 10^15.
If disks truly adhered to these specifications, reading the disk back after a single write would often be impossible.
Therefore, as of 2015, URE is not a significant concern for RAID5 configurations.
Instead, the primary causes of data loss in RAID5 setups are either common-mode disk failures or delays in
replacing a failed disk promptly. Quick replacement within a day or two is crucial,
as opposed to postponing the replacement for an extended period. In statistical simulations of reliability,
disk failures are treated as independent events. However, in reality, failures related
to environmental conditions (such as lightning strikes) or firmware bugs (as seen in Seagate 7200.11)
are often interdependent and potentially more prevalent than individual disk failures.
Instances of entire disk packs failing over a few hours have been observed in the case of Seagate 7200.11 disks.
Due to the substantial probability of common-mode failures, which cannot be entirely eliminated from the system,
it is crucial to recognize that RAID is not a substitute for a comprehensive backup strategy.