Botched RAID 5 rebuild
Botched rebuild is one of the widespread failure modes in RAID5. It happens quite often, it is not readily apparent afterwards, and there is no practically feasible way to recover once the botched rebuild is complete.
The damage
The damage occurs if one removes several disks from the RAID5 array, then plugs them back in a different order, and then performs a RAID 5 rebuild. The RAID 5 rebuild, sometimes called synchronization or resynch, recomputes and rewrites all the XOR parity blocks on the array. Normally, a rebuild is done automatically once the drive is removed and re-inserted, or after a power failure, in order to restore the redundancy. The damage goes like this:
Original | |||
---|---|---|---|
1 | 2 | 3 | P |
4 | 5 | P | 6 |
7 | P | 8 | 9 |
P | 10 | 11 | 12 |
With drives swapped | |||
---|---|---|---|
1 | 2 | P | 3 |
4 | 5 | 6 | P |
7 | P | 9 | 8 |
P | 10 | 12 | 11 |
After the rebuild | |||
---|---|---|---|
1 | 2 | P | X |
4 | 5 | X | P |
7 | X | 9 | 8 |
X | 10 | 12 | 11 |
where P denotes original parity and X denotes new parity.
Recovery
Recovery by the automated software, like this one is not possible.
Manual recovery is theoreticaly possible. In fact, there is still enough data to reconstruct the contents of the array because RAID 5 is single-drive fault tolerant while less than one disk worth of data is destroyed. In the example above you can compute, say, 3 = P xor 2 xor 1. However, the manual recovery requires knowing both original and current configuration (block sizes and disk order), some custom-made software, and a very skilled operator. At the moment, we're not aware of any person or service routinely handling this sort of recovery.
Additional considerations
Under default settings, the controller will in most cases perform the rebuild silently, automatically, and in the background. This makes the problem worse because the rebuild continues and destroys data further while the operator is trying to figure out what happened. By the time the operator figures out the problem, it is typically too late and the damage is too bad.
Putting disks back to RAID 5
When replacing a disk in RAID 5, it is quite easy to pull out the wrong disk. This sort of thing happens more often than you would expect. Much more often indeed. So, once the wrong disk is pulled out, the array is then missing two disks. The apparent failure of two disks causes the array to fail, because RAID 5 is only designed to work with one failed disk.
It is possible (on most controllers) to bring the array back online, but you have to insert the disks in specific order. First and foremost, clearly label the disk you have just removed and its corresponding port, so that you don't mess things further up by accidentally swapping two disks. Then, identify, remove, and similarly label the faulty drive. Now, you should have three labeled drives and two labeled ports. The inventory should go as follows
- The blank, new, and presumably working replacement drive.
- The working drive from the array.
- The port corresponding to the working drive on the array.
- The faulty drive.
- The port corresponding to the faulty drive.
The drives should be placed back in the reversed order of failures (failures, not removals). Now, your actions should be along these lines
- Connect the working drive (2) to its port (3). Reboot and make sure that the array is accessible and that rebuild did not start.
- Connect a replacement drive (1) to a port where the faulty drive was (4). Reboot again, and make sure the rebuild is started.
If you mess up the order in which you insert the disks, you get massive amount of zeros added and mixed into the data. This sort of damage would be not recoverable.
Continue to Rules of RAID recovery.