Skip to main content

5 Steps to Safely Replace a Drive in a Linux ZFS Array

Pulblished:

Updated:

Comments: counting...

How to safely replace a hard drive in a Linux ZFS Raid Array.

Let’s lay out an example scenario - say we have a mirrored (RAID1) array, and I just got an Email alert from smartmontools telling me that a drive /dev/sdf is failing in my ZFS RAID 10 array.

Notes

  1. If you have an extra drive bay available, refrain from removing the old drive until after the resilver is complete.
  2. ZFS is completely capable of replacing a disk with an unformatted one, there are some scenarios that require manual formatting (e.g. an array who’s disks contain multiple partitions), but you should otherwise be fine skipping steps 3 and 4. If your system does require manual formatting, there may be other steps needed after the resilver is complete (e.g. re-installing the bootloader with grub-install /dev/sdX for example), you should also avoid rebooting the system while ZFS is resilvering in this case.

1) Gather Information

The first thing we need to do is collect some information we will want handy during the process. I highly recommend opening up a text editor on your workstation and dropping the information there while you work.

GUID, Pool Name, and a Similarly Partitioned Disk

Use the command zdb to list out some data for all pools, you can see from the output below the my failing drive has a GUID of 4024410420552873090, lives in the pool raid10, and has an adjacent drive /dev/sde which will have the same partition table.

root@zfs-lab# zdb
raid10:
    version: 5000
    name: 'raid10'
    state: 0
    txg: 1367134
    pool_guid: 13977946214682563558
    errata: 0
    hostid: 3182994292
    hostname: 'zfs-lab'
    com.delphix:has_per_vdev_zaps
    vdev_children: 2
    vdev_tree:
        type: 'root'
        id: 0
        guid: 13977946214682563558
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 946224559609474074
            metaslab_array: 265
            metaslab_shift: 34
            ashift: 12
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 1247532717286800833
                path: '/dev/sde1'
                devid: 'ata-WDC_WD2003FYYS-02W0B1_WD-WMAY05058763-part1'
                phys_path: 'pci-0000:01:00.1-ata-5'
                whole_disk: 1
                DTL: 428
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            children[1]:
                type: 'disk'
                id: 1
                guid: 4024410420552873090
                path: '/dev/sdf1'
                devid: 'ata-Hitachi_HDS722020ALA330_JK1101B9GKY6ET-part1'
                phys_path: 'pci-0000:01:00.1-ata-6'
                whole_disk: 1
                DTL: 427
                create_txg: 4
                com.delphix:vdev_zap_leaf: 131
        children[1]:
            type: 'mirror'
            id: 1
            guid: 8353599129995598725
            metaslab_array: 256
            metaslab_shift: 34
            ashift: 12
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 132
            children[0]:
                type: 'disk'
                id: 0
                guid: 5161684360393329728
                path: '/dev/sdg1'
                devid: 'ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00294631-part1'
                phys_path: 'pci-0000:01:00.1-ata-7'
                whole_disk: 1
                DTL: 426
                create_txg: 4
                com.delphix:vdev_zap_leaf: 133
            children[1]:
                type: 'disk'
                id: 1
                guid: 12714037787224569367
                path: '/dev/sdh1'
                devid: 'ata-Hitachi_HDS722020ALA330_JK11A1YAJGN9DV-part1'
                phys_path: 'pci-0000:01:00.1-ata-8'
                whole_disk: 1
                DTL: 425
                create_txg: 4
                com.delphix:vdev_zap_leaf: 134
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

Serial Number

If you don’t already have it, the serial number of the failed drive is easily attained by running the following command. The smartctl command is supplied with the package smartmontools.

root@zfs-lab# smartctl -a /dev/sdf | grep Serial
Serial Number:    JK1101B9GKY6ET

Installed Disks

Just to make sure we don’t wipe out the wrong disk, let’s get a list of what we have installed. You can see from the output below that I have disks /dev/sda through /dev/sdh installed.

root@zfs-lab# lsblk | grep sd
sda         8:0    0   1.8T  0 disk
├─sda1      8:1    0   1.8T  0 part
└─sda9      8:9    0     8M  0 part
sdb         8:16   0 465.8G  0 disk
├─sdb1      8:17   0  1007K  0 part
├─sdb2      8:18   0   512M  0 part
└─sdb3      8:19   0 465.3G  0 part
sdc         8:32   0   1.8T  0 disk
├─sdc1      8:33   0   1.8T  0 part
└─sdc9      8:41   0     8M  0 part
sdd         8:48   0 465.8G  0 disk
├─sdd1      8:49   0  1007K  0 part
├─sdd2      8:50   0   512M  0 part
└─sdd3      8:51   0 465.3G  0 part
sde         8:64   0   1.8T  0 disk
├─sde1      8:65   0   1.8T  0 part
└─sde9      8:73   0     8M  0 part
sdf         8:80   0   1.8T  0 disk
├─sdf1      8:81   0   1.8T  0 part
└─sdf9      8:89   0     8M  0 part
sdg         8:96   0   1.8T  0 disk
├─sdg1      8:97   0   1.8T  0 part
└─sdg9      8:105  0     8M  0 part
sdh         8:112  0   1.8T  0 disk
├─sdh1      8:113  0   1.8T  0 part
└─sdh9      8:121  0     8M  0 part

2) Remove the Failing Disk

Now that we have all the information we need let’s get rid of the failing disk, first we’ll remove it from the ZFS pool.

Note: If this command fails, which may happen if the drive has completely died, use the disks GUID instead: zpool offline raid10 4024410420552873090

root@zfs-lab# zpool offline raid10 /dev/sdf

We should check that it’s been removed before moving on.

root@zfs-lab# zpool status raid10 -v
  pool: raid10
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 0 days 04:42:17 with 0 errors on Sun Nov 10 05:06:22 2019
config:

        NAME        STATE     READ WRITE CKSUM
        raid10      DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sde     ONLINE       0     0     0
            sdf     OFFLINE      2     0     0
          mirror-1  ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0

errors: No known data errors

Now let’s remove the disk from the SCSI subsystem to ensure its disconnected cleanly.

root@zfs-lab# echo 1 | sudo tee /sys/block/sdf/device/delete

3) Format the New Disk

After physically replacing the disk, we’ll need to copy the partition table from the similar disk we found earlier, to find the new disk we’ll use the command lsblk | grep sd again, it’s usually the same as before, in my case /dev/sdf.

root@zfs-lab# lsblk | grep sd
sda         8:0    0   1.8T  0 disk
├─sda1      8:1    0   1.8T  0 part
└─sda9      8:9    0     8M  0 part
sdb         8:16   0 465.8G  0 disk
├─sdb1      8:17   0  1007K  0 part
├─sdb2      8:18   0   512M  0 part
└─sdb3      8:19   0 465.3G  0 part
sdc         8:32   0   1.8T  0 disk
├─sdc1      8:33   0   1.8T  0 part
└─sdc9      8:41   0     8M  0 part
sdd         8:48   0 465.8G  0 disk
├─sdd1      8:49   0  1007K  0 part
├─sdd2      8:50   0   512M  0 part
└─sdd3      8:51   0 465.3G  0 part
sde         8:64   0   1.8T  0 disk
├─sde1      8:65   0   1.8T  0 part
└─sde9      8:73   0     8M  0 part
sdf         8:80   0   2.7T  0 disk # <--- There it is!
sdg         8:96   0   1.8T  0 disk
├─sdg1      8:97   0   1.8T  0 part
└─sdg9      8:105  0     8M  0 part
sdh         8:112  0   1.8T  0 disk
├─sdh1      8:113  0   1.8T  0 part
└─sdh9      8:121  0     8M  0 part

IMPORTANT: The syntax of this command is counter-intuitive in my opinion, read these steps carefully, getting the source and target backwards here may hose your data!

The command we are going to use is sgdisk --replicate=/dev/TARGET /dev/SOURCE, where TARGET is the new blank disk, and SOURCE is the live disk with a similar partition table.

# My new blank disk is /dev/sdf
# My live disk with data on it that I don't want to destroy is /dev/sde
root@zfs-lab# sgdisk --replicate=/dev/sdf /dev/sde
The operation has completed successfully.

4) Randomize GUID

To prevent some really bad potential mix-ups by ZFS, each disk should have a unique GUID, we’ll need to address that since we cloned the partition table from another disk.

root@zfs-lab# sgdisk --randomize-guids /dev/sdf
The operation has completed successfully.

5) Add new Disk to ZFS Pool

Use the zpool replace command to add the new drive into the pool.

root@zfs-lab# zpool replace raid10 /dev/sdf

Check to make sure it has been added successfully.

root@zfs-lab# zpool status raid10 -v
  pool: raid10
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Nov 16 14:28:55 2019
        373G scanned at 3.01G/s, 186G issued at 1.50G/s, 1.12T total
        1.19G resilvered, 16.29% done, 0 days 00:10:37 to go
config:

        NAME             STATE     READ WRITE CKSUM
        raid10           DEGRADED     0     0     0
          mirror-0       DEGRADED     0     0     0
            sde          ONLINE       0     0     0
            replacing-1  DEGRADED     0     0     0
              old        OFFLINE      2     0     0
              sdf        ONLINE       0     0     0  (resilvering)
          mirror-1       ONLINE       0     0     0
            sdg          ONLINE       0     0     0
            sdh          ONLINE       0     0     0

errors: No known data errors

Now we wait! You can keep an eye on the status with the command watch zpool status raid10 -v, my system took about three hours to finish resilvering 186 GiB of data.