Michael's Daemonic Doodles

...blogging bits of BSD

Replacing a failed drive in a raidz2 ZFS setup

This blog post details how to replace a broken drive in the mfsBSD, FreeBSD 9.0 raidz2 ZFS setup discussed earlier. The process is relatively straightforward, but can be tricky if you never did it before.

Symptoms

When the drive failed, controller and OS handled it gracefully:

backup kernel: (da0:mps0:0:0:0): SCSI command timeout on device handle 0x000c SMID 906
backup kernel: mps0: (0:0:0) terminated ioc 804b scsi 0 state c xfer 0
backup last message repeated 4 times
backup kernel: mps0: mpssas_abort_complete: abort request on handle 0x0c SMID 906 complete
backup kernel: mps0: (0:0:0) terminated ioc 804b scsi 0 state 0 xfer 0
backup kernel: mps0: mpssas_remove_complete on target 0x0000, IOCStatus= 0x8
backup kernel: (da0:mps0:0:GEOM_MIRROR0:: 0): lost device - 0 outstanding
backup kernel: Request failed (error=22). da0p2[WRITE(offset=1180434432, length=131072)]
backup kernel: GEOM_MIRROR: Device primaryswap: provider da0p2 disconnected.
backup kernel: (da0:mps0:0:0:0): removing device entry

ZFS status shows the degraded pool:

[root@backup ~]# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
config:

        NAME                      STATE     READ WRITE CKSUM
        tank                      DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            15364271088212071398  REMOVED      0     0     0  was /dev/da0p3
            da1p3                 ONLINE       0     0     0
            da2p3                 ONLINE       0     0     0
            da3p3                 ONLINE       0     0     0
            da4p3                 ONLINE       0     0     0
            da5p3                 ONLINE       0     0     0
            da6p3                 ONLINE       0     0     0
            da7p3                 ONLINE       0     0     0

errors: No known data errors

Replacing the drive

Since the backplane supports hot swapping, connecting the new drive is very convenient.

da0 at mps0 bus 0 scbus0 target 0 lun 0
da0: <SEAGATE ST31000424SS 0006> Fixed Direct Access SCSI-5 device
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)

Rebuilding the system requires the following steps:

  1. Partition the new drive using gpart
  2. Rebuild the gmirror swap container
  3. Replace device in ZFS pool and resilver
  4. Reinstall boot code

Partition the new drive using gpart

The original setup has been done using mfsBSD, using a swap partition size of 16GB. In case one is not certain how to partition the drive, just check the setup of another drive in the pool, in this case e.g.:

[root@backup ~]# gpart list da1
Geom name: da1
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 1953525134
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: da1p1
   Mediasize: 65536 (64k)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 17408
   Mode: r0w0e0
   rawuuid: fa3ef576-83ed-11e1-bdd5-001517783d80
   rawtype: 83bd6b9d-7f41-11dc-be0b-001560b84f0f
   label: (null)
   length: 65536
   offset: 17408
   type: freebsd-boot
   index: 1
   end: 161
   start: 34
2. Name: da1p2
   Mediasize: 17179869184 (16G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 82944
   Mode: r1w1e1
   rawuuid: fa45c1b1-83ed-11e1-bdd5-001517783d80
   rawtype: 516e7cb5-6ecf-11d6-8ff8-00022d09712b
   label: swap1
   length: 17179869184
   offset: 82944
   type: freebsd-swap
   index: 2
   end: 33554593
   start: 162
3. Name: da1p3
   Mediasize: 983024916992 (915G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 82944
   Mode: r1w1e1
   rawuuid: fa4f7605-83ed-11e1-bdd5-001517783d80
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: disk1
   length: 983024916992
   offset: 17179952128
   type: freebsd-zfs
   index: 3
   end: 1953525134
   start: 33554594
Consumers:
1. Name: da1
   Mediasize: 1000204886016 (931G)
   Sectorsize: 512
   Mode: r2w2e4

Using this information we can partition the new drive:

gpart create -s GPT da0
gpart add -t freebsd-boot -s 128 da0
gpart add -t freebsd-swap -s 16G -l swap0 da0
gpart add -t freebsd-zfs -l disk0 da0
dd if=/dev/zero of=/dev/da0p2 bs=512 count=560
dd if=/dev/zero of=/dev/da0p3 bs=512 count=560

(dd is to make sure old gmirror/ZFS meta data is removed).

Rebuild the gmirror swap container

It's important to make gmirror forget the missing disk, so it can be replaced. Then the new partition is inserted at position 0 (where it used to be before).

gmirror forget primaryswap
gmirror insert -p 0 primaryswap /dev/da0p2

You can check the status any time during the process:

[root@backup ~]# gmirror status
                Name    Status  Components
  mirror/primaryswap  DEGRADED  da2p2 (ACTIVE)
                                da4p2 (ACTIVE)
                                da6p2 (ACTIVE)
                                da0p2 (SYNCHRONIZING, 0%)
mirror/secondaryswap  COMPLETE  da1p2 (ACTIVE)
                                da3p2 (ACTIVE)
                                da5p2 (ACTIVE)
                                da7p2 (ACTIVE)

Once it's done the mirror is back to its normal state:

[root@backup ~]# gmirror status
                Name    Status  Components
  mirror/primaryswap  COMPLETE  da2p2 (ACTIVE)
                                da4p2 (ACTIVE)
                                da6p2 (ACTIVE)
                                da0p2 (ACTIVE)
mirror/secondaryswap  COMPLETE  da1p2 (ACTIVE)
                                da3p2 (ACTIVE)
                                da5p2 (ACTIVE)
                                da7p2 (ACTIVE)

Replace device in ZFS pool and resilver

The final step is to replace the drive in the ZFS pool (which will start resilvering automatically). Please keep in mind that resilvering is I/O intensive. In my case it took a long time and slowed down the disk subsystem's performance quite dramatically. This was primarily because of the large number of small files on the system. There are ways to throttle the resilvering process, which I might try next time - namely vfs.zfs.scrub_limit, which can be set in /boot/loader.conf.

The procedure for replacing the drive is straightforward:

zpool replace tank da0p3

You can check the status of the resilver operation anytime using zpool status:

[root@backup ~]# zpool status

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
     continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Mon May  7 20:18:34 2012
    11.7M scanned out of 908G at 353K/s, (scan is slow, no estimated time)
    1.31M resilvered, 0.00% done
config:

     NAME                        STATE     READ WRITE CKSUM
     tank                        DEGRADED     0     0     0
       raidz2-0                  DEGRADED     0     0     0
         replacing-0             REMOVED      0     0     0
           15364271088212071398  REMOVED      0     0     0  was /dev/da0p3/old
           da0p3                 ONLINE       0     0     0  (resilvering)
         da1p3                   ONLINE       0     0     0
         da2p3                   ONLINE       0     0     0
         da3p3                   ONLINE       0     0     0
         da4p3                   ONLINE       0     0     0
         da5p3                   ONLINE       0     0     0
         da6p3                   ONLINE       0     0     0
         da7p3                   ONLINE       0     0     0

errors: No known data errors

Depending on your setup this can take a substantial amount of time.

Reinstall boot code

This is easy to miss, so make sure it's done:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

Conclusion

Replacing a drive is not completely plug and play, but definitely not rocket science either.