"Write hole" phenomenon

The "write hole" effect can happen if a power failure occurs during the write. It happens in all the array types, including but not limited to RAID5, RAID6, and RAID1. In this case it is impossible to determine which of data blocks or parity blocks have been written to the disks and which have not. In this situation the parity data does not match to the rest of the data in the stripe. Also, you cannot determine with confidence which data is incorrect - parity or one of the data blocks.

Write hole in RAID5

"Write hole" is widely recognized to affect a RAID5, and most of the discussions of the "write hole" effect refer to RAID5. It is important to know that other array types are affected as well.

If the user data is not written completely, usually a filesystem corrects the errors during the reboot by replaying the transaction log. If a file system does not support journaling, the errors will still be corrected during the next consistency check (CHKDSK or fsck).

If the parity (in RAID5) or the mirror copy (in RAID1) is not written correctly, it would be unnoticed until one of the array member disks fails. If the disk fails, you need to replace the failed disk and start RAID rebuild. In this case one of the blocks would be recovered incorrectly. If a RAID recovery is needed because of a controller failure, a mismatch of parity doesn't matter.

A mismatch of parity or mirrored data can be recovered without user intervention, if at some later point a full stripe is written on a RAID5, or the same data block is written again in a RAID1. In such a case the old (incorrect) parity is not used, but new (correct) parity data would be calculated and then written. Also, new parity data would be written if you force the resynchronization of the array (this option is available for many RAID controllers and NAS).

Generally, a power failure during write is rare, uninterruptable power supply is cheap, and stripe block is not that big. Hence, the probability of encountering a "write hole" in practice is small.

Write hole in RAID1

Similarly to a RAID5, the write hole effect can happen in a RAID1. Even if one disk is designated as "first" or "authoritative", and the write operations are arranged so that data is always written to this disk first, ensuring that it contains the latest copy of data, two difficulties still remain:

  • a hard disk can cache data itself. Caching may violate the arrangement done by the controller.
  • if the disk that was designated as the first/authoritative fails, write holes may already been present on the second disk and it would be impossible to find them without the first disk data.

Write hole in RAID6

Theoretically, a RAID hole phenomenon can also happen in a RAID6 consisting of the large number of member disks. RAID write hole in a RAID5/RAID1 occurs when one of the member disks doesn't match the others and by the nature of single-redundant RAID5/RAID1 it is impossible to tell which of the disks is bad. Write hole in a RAID 6 occurs when two disks don't match the others simultaneously. Such a situation can happen, for example, if the power is turned off in the middle of the full stripe write.

Write hole in complex RAID types

Complex RAID types inherit a write hole vulnerability from those RAID types on which they are based.

  • RAID 10 inherits write hole from a RAID 1. If one of the mirrored copies has been written but the second one has not, it is impossible to know which of them is correct.
  • In a RAID 50, which can be represented as a set of RAID 5 arrays, write hole can occur in each of these arrays.
  • The same way RAID 100 is vulnerable and RAID 60 as well, albeit with lesser probability.

How to avoid the "write hole"?

In order to completely avoid the write hole, you need to provide write atomicity. We call the operations which cannot be interrupted in the middle of the process "atomic". The "atomic" operation is either fully completed or is not done at all. If the atomic operation is interrupted because of external reasons (e.g. a power failure), it is guaranteed that a system stays either in original or in final state.

In a system which consists of several independent devices, natural atomicity doesn't exist. Variance of mechanical hard drives characteristics and data bus particularities don't allow to provide required synchronization. In these cases, transactions are typically used. Transaction is a group of operations for which atomicity is provided artificially. However, expensive overhead is required to provide transaction atomicity. Hence, transactions are not used in RAIDs.

One more option to avoid a write hole id to use a ZFS which is a hybrid of a filesystem and a RAID. ZFS uses "copy-on-write" to provide write atomicity. However, this technology requires a special type of RAID (RAID-Z) which cannot be reduced to a combination of common RAID types (RAID 0, RAID 1, or RAID 5).

How to reduce the negative effect of a "write hole"?

Practically, the risk of losing data due to the write hole can be reduced up to the acceptable level even for usual arrays, such as RAID 1 and RAID 5.

  1. Supply uninterruptable power. You can just use uninterruptable power supply (UPS) for the entire RAID. The second option is to use Battery Backup Unit (BBU) which is connected to a RAID controller directly. This battery allows to save write cache content of a controller if a power failure occurs. All the write operations, which are in the cache and are not completed due to a power failure, will be done after the power turns on again. BBU protects only the controller cache, not the hard disk's write caches.
  2. Synchronize your array regularly. Synchronization is a process when parity values (for a RAID 5) or other data providing redundancy (for RAID 6, RAID 7, or RAID DP) are recalculated. In a RAID1, the data from one disk is copied to the other during synchronization. Synchronization destroys all the write holes accumulated during the operation. Once synchronization completes, redundant data will exactly match user data. In the same time synchronization detects bad sectors in rarely used areas of an array, because during synchronization all the array sectors are read from and written to. Modern hardware controllers usually allow to synchronize an array by schedule. RAIDs created using Windows cannot be synchronized by schedule.
  3. If SSDs are used in RAID, usually you can turn off write cache and still can get enough performance for your particular task. Turning off write cache does not avoid a write hole totally, but decreases the probability of losing data and amount of data which can be lost because of a power failure.

Continue to Botched RAID5 rebuild.