How to safely replace a hard drive in a Linux ZFS Raid Array.
Let’s lay out an example scenario - say we have a mirrored (RAID1) array, and I just got an Email alert from smartmontools
telling me that a drive /dev/sdf
is failing in my ZFS RAID 10 array.
Notes
- If you have an extra drive bay available, refrain from removing the old drive until after the resilver is complete.
- ZFS is completely capable of replacing a disk with an unformatted one, there are some scenarios that require manual formatting (e.g. an array who’s disks contain multiple partitions), but you should otherwise be fine skipping steps 3 and 4. If your system does require manual formatting, there may be other steps needed after the resilver is complete (e.g. re-installing the bootloader with
grub-install /dev/sdX
for example), you should also avoid rebooting the system while ZFS is resilvering in this case.
1) Gather Information
The first thing we need to do is collect some information we will want handy during the process. I highly recommend opening up a text editor on your workstation and dropping the information there while you work.
GUID, Pool Name, and a Similarly Partitioned Disk
Use the command zdb
to list out some data for all pools, you can see from the output below the my failing drive has a GUID of 4024410420552873090
, lives in the pool raid10
, and has an adjacent drive /dev/sde
which will have the same partition table.
root@zfs-lab# zdb
raid10:
version: 5000
name: 'raid10'
state: 0
txg: 1367134
pool_guid: 13977946214682563558
errata: 0
hostid: 3182994292
hostname: 'zfs-lab'
com.delphix:has_per_vdev_zaps
vdev_children: 2
vdev_tree:
type: 'root'
id: 0
guid: 13977946214682563558
create_txg: 4
children[0]:
type: 'mirror'
id: 0
guid: 946224559609474074
metaslab_array: 265
metaslab_shift: 34
ashift: 12
asize: 2000384688128
is_log: 0
create_txg: 4
com.delphix:vdev_zap_top: 129
children[0]:
type: 'disk'
id: 0
guid: 1247532717286800833
path: '/dev/sde1'
devid: 'ata-WDC_WD2003FYYS-02W0B1_WD-WMAY05058763-part1'
phys_path: 'pci-0000:01:00.1-ata-5'
whole_disk: 1
DTL: 428
create_txg: 4
com.delphix:vdev_zap_leaf: 130
children[1]:
type: 'disk'
id: 1
guid: 4024410420552873090
path: '/dev/sdf1'
devid: 'ata-Hitachi_HDS722020ALA330_JK1101B9GKY6ET-part1'
phys_path: 'pci-0000:01:00.1-ata-6'
whole_disk: 1
DTL: 427
create_txg: 4
com.delphix:vdev_zap_leaf: 131
children[1]:
type: 'mirror'
id: 1
guid: 8353599129995598725
metaslab_array: 256
metaslab_shift: 34
ashift: 12
asize: 2000384688128
is_log: 0
create_txg: 4
com.delphix:vdev_zap_top: 132
children[0]:
type: 'disk'
id: 0
guid: 5161684360393329728
path: '/dev/sdg1'
devid: 'ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00294631-part1'
phys_path: 'pci-0000:01:00.1-ata-7'
whole_disk: 1
DTL: 426
create_txg: 4
com.delphix:vdev_zap_leaf: 133
children[1]:
type: 'disk'
id: 1
guid: 12714037787224569367
path: '/dev/sdh1'
devid: 'ata-Hitachi_HDS722020ALA330_JK11A1YAJGN9DV-part1'
phys_path: 'pci-0000:01:00.1-ata-8'
whole_disk: 1
DTL: 425
create_txg: 4
com.delphix:vdev_zap_leaf: 134
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
Serial Number
If you don’t already have it, the serial number of the failed drive is easily attained by running the following command. The smartctl
command is supplied with the package smartmontools
.
root@zfs-lab# smartctl -a /dev/sdf | grep Serial
Serial Number: JK1101B9GKY6ET
Installed Disks
Just to make sure we don’t wipe out the wrong disk, let’s get a list of what we have installed. You can see from the output below that I have disks /dev/sda
through /dev/sdh
installed.
root@zfs-lab# lsblk | grep sd
sda 8:0 0 1.8T 0 disk
├─sda1 8:1 0 1.8T 0 part
└─sda9 8:9 0 8M 0 part
sdb 8:16 0 465.8G 0 disk
├─sdb1 8:17 0 1007K 0 part
├─sdb2 8:18 0 512M 0 part
└─sdb3 8:19 0 465.3G 0 part
sdc 8:32 0 1.8T 0 disk
├─sdc1 8:33 0 1.8T 0 part
└─sdc9 8:41 0 8M 0 part
sdd 8:48 0 465.8G 0 disk
├─sdd1 8:49 0 1007K 0 part
├─sdd2 8:50 0 512M 0 part
└─sdd3 8:51 0 465.3G 0 part
sde 8:64 0 1.8T 0 disk
├─sde1 8:65 0 1.8T 0 part
└─sde9 8:73 0 8M 0 part
sdf 8:80 0 1.8T 0 disk
├─sdf1 8:81 0 1.8T 0 part
└─sdf9 8:89 0 8M 0 part
sdg 8:96 0 1.8T 0 disk
├─sdg1 8:97 0 1.8T 0 part
└─sdg9 8:105 0 8M 0 part
sdh 8:112 0 1.8T 0 disk
├─sdh1 8:113 0 1.8T 0 part
└─sdh9 8:121 0 8M 0 part
2) Remove the Failing Disk
Now that we have all the information we need let’s get rid of the failing disk, first we’ll remove it from the ZFS pool.
Note: If this command fails, which may happen if the drive has completely died, use the disks GUID instead: zpool offline raid10 4024410420552873090
root@zfs-lab# zpool offline raid10 /dev/sdf
We should check that it’s been removed before moving on.
root@zfs-lab# zpool status raid10 -v
pool: raid10
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 0B in 0 days 04:42:17 with 0 errors on Sun Nov 10 05:06:22 2019
config:
NAME STATE READ WRITE CKSUM
raid10 DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sde ONLINE 0 0 0
sdf OFFLINE 2 0 0
mirror-1 ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
errors: No known data errors
Now let’s remove the disk from the SCSI subsystem to ensure its disconnected cleanly.
root@zfs-lab# echo 1 | sudo tee /sys/block/sdf/device/delete
3) Format the New Disk
After physically replacing the disk, we’ll need to copy the partition table from the similar disk we found earlier, to find the new disk we’ll use the command lsblk | grep sd
again, it’s usually the same as before, in my case /dev/sdf
.
root@zfs-lab# lsblk | grep sd
sda 8:0 0 1.8T 0 disk
├─sda1 8:1 0 1.8T 0 part
└─sda9 8:9 0 8M 0 part
sdb 8:16 0 465.8G 0 disk
├─sdb1 8:17 0 1007K 0 part
├─sdb2 8:18 0 512M 0 part
└─sdb3 8:19 0 465.3G 0 part
sdc 8:32 0 1.8T 0 disk
├─sdc1 8:33 0 1.8T 0 part
└─sdc9 8:41 0 8M 0 part
sdd 8:48 0 465.8G 0 disk
├─sdd1 8:49 0 1007K 0 part
├─sdd2 8:50 0 512M 0 part
└─sdd3 8:51 0 465.3G 0 part
sde 8:64 0 1.8T 0 disk
├─sde1 8:65 0 1.8T 0 part
└─sde9 8:73 0 8M 0 part
sdf 8:80 0 2.7T 0 disk # <--- There it is!
sdg 8:96 0 1.8T 0 disk
├─sdg1 8:97 0 1.8T 0 part
└─sdg9 8:105 0 8M 0 part
sdh 8:112 0 1.8T 0 disk
├─sdh1 8:113 0 1.8T 0 part
└─sdh9 8:121 0 8M 0 part
IMPORTANT: The syntax of this command is counter-intuitive in my opinion, read these steps carefully, getting the source and target backwards here may hose your data!
The command we are going to use is sgdisk --replicate=/dev/TARGET /dev/SOURCE
, where TARGET
is the new blank disk, and SOURCE
is the live disk with a similar partition table.
# My new blank disk is /dev/sdf
# My live disk with data on it that I don't want to destroy is /dev/sde
root@zfs-lab# sgdisk --replicate=/dev/sdf /dev/sde
The operation has completed successfully.
4) Randomize GUID
To prevent some really bad potential mix-ups by ZFS, each disk should have a unique GUID, we’ll need to address that since we cloned the partition table from another disk.
root@zfs-lab# sgdisk --randomize-guids /dev/sdf
The operation has completed successfully.
5) Add new Disk to ZFS Pool
Use the zpool replace
command to add the new drive into the pool.
root@zfs-lab# zpool replace raid10 /dev/sdf
Check to make sure it has been added successfully.
root@zfs-lab# zpool status raid10 -v
pool: raid10
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Nov 16 14:28:55 2019
373G scanned at 3.01G/s, 186G issued at 1.50G/s, 1.12T total
1.19G resilvered, 16.29% done, 0 days 00:10:37 to go
config:
NAME STATE READ WRITE CKSUM
raid10 DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sde ONLINE 0 0 0
replacing-1 DEGRADED 0 0 0
old OFFLINE 2 0 0
sdf ONLINE 0 0 0 (resilvering)
mirror-1 ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
errors: No known data errors
Now we wait! You can keep an eye on the status with the command watch zpool status raid10 -v
, my system took about three hours to finish resilvering 186 GiB of data.