ZFS Disk Failure - Understanding and Resolving
ZFS Disk Failure - Understanding and Resolving
Failure
Modified: 08 Mar 2023 00:28 UTC
Introductory Concepts
This document is written for administrators and those who have familiarity with computing hardware
platforms and storage concepts such as RAID. If you're already versed in the general failure
process, you can skip ahead to how to replace a drive and repairing the pool.
Degrees of verbosity
When a drive fails or has errors, a great degree of logging data is available on SmartOS. We can
drill down in more detail to help us find the underlying cause of disk failure. In descending order,
these commands will present the disk failure cause in increasing verbosity:
* `zpool status`
* `iostat -en`
* `iostat -En`
* `fmadm faulty`
* `fmdump -et {n}days`
* `fmdump -eVt {n}days`
The zpool status command will present us with a high level view of pool health.
iostat will present us with high level error counts and specifics as to the devices in question.
fmadm faulty will tell us more specifically which event led to the disk failure. ( fmadm can also be
used to clear transitory faults; this, however, is outside the scope of this document. Refer to
the fmadm man page for more information.) fmdump is much more specific still, presenting us of a
log of the last {n} days of fault events. This information is often extraneous to replacing faulted disks,
but if the problem is more complex than a simple single disk failure, it is extremely useful in isolating
a root cause.
General failure process
ZFS is not the first component in the system to be aware of a disk failure. When a disk fails or
becomes unavailable or has a functional problem, this general order of events occurs:
UNAVAIL
The device (or VDEV) in question can not be opened. If a VDEV is UNAVAIL , the pool will not be
accessible or able to be imported. UNAVAIL devices may also report as FAULTED in some
scenarios. Operationally, UNAVAIL disks are roughly equivalent to FAULTED disks.
DEGRADED
A fault in a device has occurred, impacting all VDEVs above it. The pool is still operable, but
redundancy may have been lost in a VDEV.
REMOVED
The device was physically removed while the system was running. Device removal detection is
hardware-dependent and might not be supported on all platforms.
FAULTED
All components (top and redundancy VDEVs, and drives) of the pool can be in a FAULTED state. A
FAULTED component is completely inaccessible. The severity of a device being DEGRADED
depends a lot on which device it is.
INUSE
This is a status reserved for spares which have been used to replace a faulted drive.
Degree of failure
Due to its integrated volume management characteristics, failures at different levels within ZFS
impact the system and overall pool health to different degrees.
It should be noted that, should this scenario occur, it may still be possible to bring the pool ONLINE
in rare scenarios, such as those brought on by controller failure where a large swath of disks are
FAULTED as a secondary cause. Under most scenarios, a FAULTED pool is unrecoverable and its
data will need to be recreated from backup.
Redundancy groups
If more disks are lost in a redundancy group than there exists redundancy (2 out of 2 disks in a
mirror, 3 disks from RAIDZ-2, or 2 from RAIDZ), the redundancy will become FAULTED. A
FAULTED state at the VDEV level will result in a pool FAULTED state: each redundancy group
should be thought of as your top level protection against data loss, with the pool itself serving to
stripe data across the redundancy groups.
Individual drive failure
Individual drives becoming faulted is not problematic to the pool or redundancy group, as long as
fewer disks than there exists redundancy fail. For instance, 2 disks in a RAIDZ-2 VDEV can fail
without cascading upwards.
In the above example, there are two faulted devices and one that is unavailable. From an
administrative perspective, these two states are functionally identical: you want to replace them with
known working drives.
ZFS will know when a drive hits a limit on a number of errors and will automatically take it out of the
pool. This can happen for any type of failure. As an operator, all that matters is that a drive has
faulted; the manufacturer can determine why it happened when you RMA it.
$ diskinfo -cH
=== Output from 00000000-0000-0000-0000-003590935999 (hostname):
<snip>
SCSI c1t4d0 HITACHI HUS723030ALS640 YHK16Z7G 2794.52 GiB ---- [0]
Slot 02
SCSI c1t13d0 HITACHI HUS723030ALS640 YHJZMU7G 2794.52 GiB ---- [0]
Slot 03
<snip>
SCSI c1t3d0 HITACHI HUS723030ALS640 YHK08JHG 2794.52 GiB ---- [1]
Slot 05 <--- here
1. Offline c1t3d0 , the disk to be replaced. You cannot unconfigure a disk that is in use.
2. Use cfgadm to identify the disk to be unconfigured and unconfigure it (i.e. c1t3d0 ). The
pool will continue to be available, though it will be degraded with the now offline disk.
3. Physically replace the disk. The Ready to Remove LED must be illuminated before you
physically remove the faulted drive.
4. Reconfigure c1t3d0 .
5. Bring the new c1t3d0 online.
6. Run the zpool replace command to replace the disk.
The following example walks through the steps to replace a disk in a ZFS storage pool with a disk in
the same slot.
# zpool offline zone c1t3d0
# cfgadm | grep c1t3d0
sata1/3::dsk/c1t3d0 disk connected configured ok
# cfgadm -c unconfigure sata1/3
Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:3
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
# cfgadm | grep sata1/3
sata1/3 disk connected unconfigured ok
<Physically replace the failed disk c1t3d0>
# cfgadm -c configure sata1/3
# cfgadm | grep sata1/3
sata1/3::dsk/c1t9d0 disk connected configured ok
# zpool online zone c1t3d0
# zpool replace zone c1t3d0
# zpool status zone
pool: zones
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Tue Feb 2 13:17:32 2010
config:
Note that the preceding zpool output might show both the new and old disks under a replacing
heading. For example:
replacing DEGRADED 0 0 0
c1t3d0s0/o FAULTED 0 0 0
c1t3d0 ONLINE 0 0 0
This is normal and not something to elicit concern. The replacing status will remain until the
replace is complete.
Replacing a drive in the same slot (with devfsadm)
devfsadm -Cv can also be used instead of the above cfgadm commands to rebuild the device files.
This process is more straightforward than the above process using cfgadm, and does not require
new replacement drive to be in the same slot.
Spare replacement
Be certain to follow up and pull the failed drive and replace the spare in the pool, if one was used.
Validation
Use zpool status command to verify that:
The pool status is ONLINE
There is one or more spare disks (if your environment uses them).
There is the expected number of log devices
Other Considerations
Mirrors - Special Considerations
Mirror management is different than working with RAIDZ{123} members, as unlike with RAIDZ, there
is no parity to be concerned with. Because of this, mirror members can be 'detached' where you
would normally remove them on RAIDZ.
An in-progress spare replacement can be cancelled by detaching the hot spare. This can only be
done if the original faulted device has not yet been detached. If the disk it is replacing has been
removed, then the hot spare assumes its place in the configuration.
This in effect makes the ZIL a dangerous single point of failure to the pool in certain situations. For
instance, if a single log device fails, it effectively cuts out the middle of the data pipeline while losing
any transactions in-flight: data which was already written to the pool and present on the log, but not
yet committed to persistent storage, will be lost.
Running mirrored ZIL log devices is highly recommended and mitigates this single point of failure.
Working with ZIL log mirrors is contextually identical to other VDEV mirrors: you use detach to
remove a mirror member.
One significant caveat about L2ARC is that these devices are not free to the system: their metadata
is maintained within primary ARC, which is in turn wired in system memory. Memory use
considerations must be made if adding L2ARC devices.
Repairing the pool
Checksum errors
Checksum errors can occur transiently on individual disks or across multiple disks. The most likely
culprits are bit rot or transient storage subsystem errors - oddities like signal loss due to solar flares
and so on.
With ZFS, they are not of much concern, but some degree of preventative maintenance is necessary
to prevent a failure from accumulation.
From time to time you may see zpool status output similar to this:
If this number is significantly large or growing rapidly, the drive is likely in a "pre-failure" state and will
fail soon, and is otherwise (in this case) potentially compromising the redundancy of the VDEV.
One thing to make note of is that checksum errors on individual drives, from time to time, is normal
and expected behavior (if not optimal). So are many errors on single drives which are about to fail.
Many checksum failures across multiple drives can be indicative of a significant storage subsystem
problem: a damaged cable, a faulty HBA, or even power problems. If this is noticed, consider
contacting Support for assistance with identification.
Hint: You can audit pool health across the entire datacenter from the headnode with: sdc-
oneachnode -c 'zpool status -x'
Resilver
A zpool resilver is an operation to rebuild parity across a pool due to either a degraded device (for
instance, a disk may temporarily disappear and need to 'catch up') or a newly replaced device. In
other words, moving the data from one device (the degraded/old disk) to a new device.
Multiple resilvers can occur at the same time within multiple VDEVs.
Please note that resilvers can degrade performance on a busy pool. Plan performance projections
accordingly.
Resilvers are automatic. They can (and should) not be interrupted short of physical removal or
failure of a device.
Scrub
Scrub examines all data in the specified pools to verify that it checksums correctly. For replicated
(mirror or raidz) devices, ZFS automatically repairs any damage discovered during the scrub.
The zpool status command reports the progress of the scrub and summarizes the results of the
scrub upon completion.
To start a scrub:
Autoreplace
By enabling ZFS autoreplace on a pool (a property disabled by default) you will enable your system
to automatically use a spare drive to replace FAULTED/UNAVAIL drives.
It should be cautioned that there are potential drawbacks from this approach: in the event of
something like misbehaving firmware or a HBA failure, multiple drives may be replaced and then the
replacements may fault prior to initial resilver, resulting in a more difficult scenario from which to
recover from. Enabling auto-replace is highly inadvisable unless you've got a responsive 24/7 DC
operations team.
To enable: