Multiple disk failures in RAID
Due to the growth of disk capacity without corresponding increase of specified disk reliability, in 2007 Robin Harris made the theoretical assumption that a RAID5 will not provide the reliability, because the probability to get a failure of a second disk, or more precisely Unrecoverable Read Error (URE), during a rebuild will be too high.
In general, the original article was wrong, because
- it confused full disk failures with inability to read one sector;
- it based on specifications of reliability given for bitstreams rather than for block devices;
- data about disk reliability declared by vendors is conservative.
For example, vendors declare that the probability to encounter an unrecoverable read error (URE) is one bit in 1014 to 1015. If the disks actually had the reliability specified by vendors, often it would be impossible to read the disk back even after a single write. Thus, in 2015 URE is not a problem for a RAID5.
Instead, the main causes of data loss in a RAID5 are either disk failures due to common cause (so called common mode failure), or insufficiently quick disk replacement - one has to replace a failed disk in a RAID5 in a day or two, rather than put it off for a couple of months. Disk failures in statistical simulation of reliability are considered independent events. In fact, the disk failures associated with poor environmental conditions, like lightning strike or bugs in the firmware (as it was with Seagate 7200.11) are dependent failures and perhaps more common than failed disks. In case of Seagate 7200.11 disks there were cases when the whole disk pack failed over the course of a few hours. Because of the high probability of common mode failure, which cannot really be engineered out of the system, RAID is not a replacement for backup.
Continue to Destriping.