packetmischief.ca

OpenBSD Software RAID Administration: Failed Component


Dealing with a Failed Component

Having a failed component means that one of the drives in the RAID set has developed problems or failed completely. Once the drive is replaced, it will be necessary to configure it and rebuild the RAID set onto it. The steps needed to do this are very similar to those done during installation. Assuming the other drive is working fine, the system should continue to operate normally with the one good drive. This should allow the system to be taken out of service in a controlled manner to replace the failed drive.

After replacing the failed drive and booting, the system will come up with only one drive in the RAID set. This is because the new drive doesn't yet have a partition, disklabels, etc, on it. Assuming that wd1 is the failed drive:

# raidctl -s raid0
raid0 Components:
    /dev/wd0e: optimal
   component1: failed
No spares.
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

The placeholder component1 is shown in place of the failed drive.

The first recovery steps involve creating the OpenBSD partition, creating the disklabels, and mirroring the boot filesystem from the non-failed drive.

# fdisk -i wd1
# disklabel wd0 > /tmp/disklabel.wd0
# disklabel -Rr wd1 /tmp/disklabel.wd0
# newfs wd1a
# mkdir /mnt/wd0a /mnt/wd1a
# mount /dev/wd0a /mnt/wd0a
# mount /dev/wd1a /mnt/wd1a
# cd /mnt/wd0a
# pax -r -w -pe -v .profile .cshrc altroot bin boot bsd etc home \
    root sbin stand tmp usr var /mnt/wd1a
# cd /mnt
# mkdir wd1a/dev wd1a/mnt
# cp /dev/MAKEDEV wd1a/dev
# cd wd1a/dev && ./MAKEDEV all
# vi /mnt/wd1a/etc/fstab
# cp /usr/mdec/boot /mnt/wd1a/boot
# /usr/mdec/installboot -v /mnt/wd1a/boot /usr/mdec/biosboot wd1

Now that the drive is configured, the RAID set can be reconstructed onto the drive. However, since the drive isn't part of the RAID set at this point, it's not possible to do an in-place reconstruction. The solution is to add wd1 as a spare drive and then reconstruct component1 onto it.

# raidctl -a /dev/wd1e raid0
# raidctl -vF component1 raid0

Once the copyback is finished, the system will once again be running on redundant drives. The array status now looks like this:

# raidctl -s raid0
raid0 Components:
    /dev/wd0e: optimal
   component1: spared
Spares:
    /dev/wd1e: used_spare
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

Once the system is rebooted (the reboot can be done at any time), wd1e will cease to be a spare and will once again be recognized as a component of the RAID set. The failed component is now repaired.

Disclaimer

I make no guarantees about the accuracy or fitness of information within this article. The information within is accurate and true to the best of my knowledge. I will not be held responsible for data loss or system downtime.