0% found this document useful (0 votes)
19 views10 pages

ZFS Disk Failure - Understanding and Resolving

ZFS provides tools to diagnose disk failures and replace failed disks. When a disk fails, it is first detected by FMA and removed by the OS. ZFS then faults the device. To replace a disk, administrators identify the failed disk, use zpool replace to start the resilvering process, wait for completion, remove the replaced disk with zpool remove, offline it with zpool offline, and perform any necessary cleanup. Disk failures can occur at the individual disk level or cause higher-level redundancy groups or the entire pool to fault.

Uploaded by

sotognon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

ZFS Disk Failure - Understanding and Resolving

ZFS provides tools to diagnose disk failures and replace failed disks. When a disk fails, it is first detected by FMA and removed by the OS. ZFS then faults the device. To replace a disk, administrators identify the failed disk, use zpool replace to start the resilvering process, wait for completion, remove the replaced disk with zpool remove, offline it with zpool offline, and perform any necessary cleanup. Disk failures can occur at the individual disk level or cause higher-level redundancy groups or the entire pool to fault.

Uploaded by

sotognon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Understanding and Resolving ZFS Disk

Failure
Modified: 08 Mar 2023 00:28 UTC

Introductory Concepts
This document is written for administrators and those who have familiarity with computing hardware
platforms and storage concepts such as RAID. If you're already versed in the general failure
process, you can skip ahead to how to replace a drive and repairing the pool.
Degrees of verbosity
When a drive fails or has errors, a great degree of logging data is available on SmartOS. We can
drill down in more detail to help us find the underlying cause of disk failure. In descending order,
these commands will present the disk failure cause in increasing verbosity:

* `zpool status`
* `iostat -en`
* `iostat -En`
* `fmadm faulty`
* `fmdump -et {n}days`
* `fmdump -eVt {n}days`

The zpool status command will present us with a high level view of pool health.
iostat will present us with high level error counts and specifics as to the devices in question.
fmadm faulty will tell us more specifically which event led to the disk failure. ( fmadm can also be
used to clear transitory faults; this, however, is outside the scope of this document. Refer to
the fmadm man page for more information.) fmdump is much more specific still, presenting us of a
log of the last {n} days of fault events. This information is often extraneous to replacing faulted disks,
but if the problem is more complex than a simple single disk failure, it is extremely useful in isolating
a root cause.
General failure process
ZFS is not the first component in the system to be aware of a disk failure. When a disk fails or
becomes unavailable or has a functional problem, this general order of events occurs:

1. A failed disk is detected and logged by FMA.


2. The disk is removed by the operating system.
3. ZFS sees the changed state and responds by faulting the device.

ZFS device (and virtual device) states


The overall health of a pool, as reported by zpool status , is determined by the aggregate state of
all devices within the pool. Here are some definitions to help with clarity throughout this document.
ONLINE
All devices can (and should while operating optimally) be in the ONLINE state. This includes the
pool, top-level VDEVs (parity groups of type mirror, raidz{1,2,3}) and the drives themselves.
Transitory errors may still occur without the drive changing state.
OFFLINE
Only bottom-level devices (drives) can be OFFLINE. This is a manual administrative state, and
healthy drives can be brought back online and active into the pool.

UNAVAIL
The device (or VDEV) in question can not be opened. If a VDEV is UNAVAIL , the pool will not be
accessible or able to be imported. UNAVAIL devices may also report as FAULTED in some
scenarios. Operationally, UNAVAIL disks are roughly equivalent to FAULTED disks.
DEGRADED
A fault in a device has occurred, impacting all VDEVs above it. The pool is still operable, but
redundancy may have been lost in a VDEV.

REMOVED
The device was physically removed while the system was running. Device removal detection is
hardware-dependent and might not be supported on all platforms.

FAULTED
All components (top and redundancy VDEVs, and drives) of the pool can be in a FAULTED state. A
FAULTED component is completely inaccessible. The severity of a device being DEGRADED
depends a lot on which device it is.

INUSE
This is a status reserved for spares which have been used to replace a faulted drive.

Degree of failure
Due to its integrated volume management characteristics, failures at different levels within ZFS
impact the system and overall pool health to different degrees.

The pool itself


This is the worst possible scenario, typically resulting from loss of more drives from a redundancy
group than the group was designed to withstand. The pool itself has no concept of redundancy,
instead relying on integrity to be maintained within with individual RAIDZ or mirror VDEVs. For
instance, losing 2 disks out of a RAIDZ would result in both the VDEV and the pool (top level VDEV)
becoming FAULTED.

It should be noted that, should this scenario occur, it may still be possible to bring the pool ONLINE
in rare scenarios, such as those brought on by controller failure where a large swath of disks are
FAULTED as a secondary cause. Under most scenarios, a FAULTED pool is unrecoverable and its
data will need to be recreated from backup.

Redundancy groups
If more disks are lost in a redundancy group than there exists redundancy (2 out of 2 disks in a
mirror, 3 disks from RAIDZ-2, or 2 from RAIDZ), the redundancy will become FAULTED. A
FAULTED state at the VDEV level will result in a pool FAULTED state: each redundancy group
should be thought of as your top level protection against data loss, with the pool itself serving to
stripe data across the redundancy groups.
Individual drive failure
Individual drives becoming faulted is not problematic to the pool or redundancy group, as long as
fewer disks than there exists redundancy fail. For instance, 2 disks in a RAIDZ-2 VDEV can fail
without cascading upwards.

How to replace a drive


High-level overview of drive replacement
At a high level, replacing a specific faulted drive takes the following steps:

1. Identify the FAULTED or UNAVAILABLE drive


2. zpool replace the drive in question
3. Wait for the resilver to finish
4. zpool remove the replaced drive
5. zpool offline the removed drive
6. Perform any necessary cleanup
These steps can vary somewhat depending on specific redundancy level and hardware
configuration.

In-detail steps for drive replacement


Let's start with an example scenario involving multiple faulted and degraded drives:

[root@headnode (dc-example-1) ~]# zpool status


pool: zones
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 7.64G in 0h6m with 0 errors on Fri May 26 10:45:56 2017
config:

NAME STATE READ WRITE CKSUM


zones DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 FAULTED 0 0 0 external device fault
mirror-2 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
mirror-3 DEGRADED 0 0 0
1173487 UNAVAIL 0 0 0 was /dev/dsk/c1t16d0
c1t6d0 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c1t8d0 ONLINE 0 0 0
mirror-5 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 0
c1t10d0 REMOVED 0 0 0
c1t11d0 ONLINE 0 0 0
c1t9d0 FAULTED 0 0 0 external device fault
mirror-6 ONLINE 0 0 0
c1t12d0 ONLINE 0 0 0
c1t13d0 ONLINE 0 0 0
logs
c1t14d0 ONLINE 0 0 0
spares
c1t15d0 INUSE currently in use
c1t16d0 ONLINE 0 0 0

errors: No known data errors

In the above example, there are two faulted devices and one that is unavailable. From an
administrative perspective, these two states are functionally identical: you want to replace them with
known working drives.

ZFS will know when a drive hits a limit on a number of errors and will automatically take it out of the
pool. This can happen for any type of failure. As an operator, all that matters is that a drive has
faulted; the manufacturer can determine why it happened when you RMA it.

Identify the physical location of the FAULTED or UNAVAILABLE


drive
Use diskinfo to get this information.
For instance, zpool status had shown c1t3d0 faulted:

NAME STATE READ WRITE CKSUM


zones DEGRADED 0 0 0
[...]
mirror-1 DEGRADED 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 FAULTED 0 0 0 external device fault

diskinfo -cH will show where c1t3d0 is located. For instance:

$ diskinfo -cH
=== Output from 00000000-0000-0000-0000-003590935999 (hostname):
<snip>
SCSI c1t4d0 HITACHI HUS723030ALS640 YHK16Z7G 2794.52 GiB ---- [0]
Slot 02
SCSI c1t13d0 HITACHI HUS723030ALS640 YHJZMU7G 2794.52 GiB ---- [0]
Slot 03
<snip>
SCSI c1t3d0 HITACHI HUS723030ALS640 YHK08JHG 2794.52 GiB ---- [1]
Slot 05 <--- here

Blink the drive in question


Blinking the drive will require a third party tool for your storage controller(s). In most cases where LSI
cards are in use, you will want sas2ircu for Solaris. Drive location on other platforms will not be
covered here.
Install the sas2ircu somewhere of use, and then run it similar to this:

p /opt/custom/bin/sas2ircu 0 locate 1:5 ON


The above command will light the LED on the 5th slot on the second ([1]) expander via the first (0)
HBA.

the drive in question


Now that the FAULTED/UNAVAIL drive has been identified for replacement, there are several
different ways we can replace the drive.

Replace the drive with a spare


This is the preferred method, as it is less prone to human error. However, if the chassis does not
have room for spares, it will not be possible.

zpool replace zones <bad_drive> <spare_drive>


For instance, using the example above, we would do zpool replace zones c1t3d0 c1t16d0 , using
the other available spare.
Once the drive has been replaced and you have verified the drive is resilvering ( zpool status ),
offline the failed drive:
zpool offline zones c1t3d0
Even in a FAULTED state, drives must be 'offlined' prior to being removed.

To then remove the now-offline drive:

zpool remove zones c1t3d0


With this replacement approach, you also have the option to wait until the resilver completes before
removing the drive. However, be cautious that this does not result in forgotten dead disks.

Continue and perform any necessary cleanup.


Replace the drive with one in the same slot (using cfgadm)
In order to replace a drive in the same slot as the faulted drive, it must be removed from the pool and
unconfigured from the OS before a new disk can be inserted.

The following steps outline the general procedure:

1. Offline c1t3d0 , the disk to be replaced. You cannot unconfigure a disk that is in use.
2. Use cfgadm to identify the disk to be unconfigured and unconfigure it (i.e. c1t3d0 ). The
pool will continue to be available, though it will be degraded with the now offline disk.
3. Physically replace the disk. The Ready to Remove LED must be illuminated before you
physically remove the faulted drive.
4. Reconfigure c1t3d0 .
5. Bring the new c1t3d0 online.
6. Run the zpool replace command to replace the disk.
The following example walks through the steps to replace a disk in a ZFS storage pool with a disk in
the same slot.
# zpool offline zone c1t3d0
# cfgadm | grep c1t3d0
sata1/3::dsk/c1t3d0 disk connected configured ok
# cfgadm -c unconfigure sata1/3
Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:3
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
# cfgadm | grep sata1/3
sata1/3 disk connected unconfigured ok
<Physically replace the failed disk c1t3d0>
# cfgadm -c configure sata1/3
# cfgadm | grep sata1/3
sata1/3::dsk/c1t9d0 disk connected configured ok
# zpool online zone c1t3d0
# zpool replace zone c1t3d0
# zpool status zone
pool: zones
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Tue Feb 2 13:17:32 2010
config:

NAME STATE READ WRITE CKSUM


zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0

errors: No known data errors

Note that the preceding zpool output might show both the new and old disks under a replacing
heading. For example:

replacing DEGRADED 0 0 0
c1t3d0s0/o FAULTED 0 0 0
c1t3d0 ONLINE 0 0 0

This is normal and not something to elicit concern. The replacing status will remain until the
replace is complete.
Replacing a drive in the same slot (with devfsadm)
devfsadm -Cv can also be used instead of the above cfgadm commands to rebuild the device files.
This process is more straightforward than the above process using cfgadm, and does not require
new replacement drive to be in the same slot.

1. offline the faulted drive


2. physically remove the faulted drive
3. run devfsadm -Dv to unconfigure the old disk
4. insert the new drive
5. run devfsadm -Dv again to configure the new disk.
6. online the disk as above
7. replace the disk (also as above)

Perform any necessary cleanup


Turn the drive notification LED off
When turning the drive notification light off on the chassis, be sure to use the same slot and chassis
IDs as you did when enabling it.

'''$ p /opt/custom/bin/sas2ircu 0 locate 1:5 OFF'''

Spare replacement
Be certain to follow up and pull the failed drive and replace the spare in the pool, if one was used.

Validation
Use zpool status command to verify that:
 The pool status is ONLINE
 There is one or more spare disks (if your environment uses them).
 There is the expected number of log devices

Other Considerations
Mirrors - Special Considerations
Mirror management is different than working with RAIDZ{123} members, as unlike with RAIDZ, there
is no parity to be concerned with. Because of this, mirror members can be 'detached' where you
would normally remove them on RAIDZ.

zpool detach zones c1t3d0


Detaching a device is only possible if there are valid replicas of the data.

Working with spares


Hot spare disks can be added with the zpool add command and removed with the zpool
remove command.
zpool add zones <disk> zpool remove zones <disk>
Once a spare replacement is initiated, a new "spare" VDEV is created within the configuration that
will remain there until the original device is replaced. At this point, the hot spare becomes available
again if another device fails.

An in-progress spare replacement can be cancelled by detaching the hot spare. This can only be
done if the original faulted device has not yet been detached. If the disk it is replacing has been
removed, then the hot spare assumes its place in the configuration.

zpool detach zones <disk>


Spares cannot replace log devices.
Working with ZIL logs
ZIL log devices are a special case in ZFS. They 'front' synchronous writes to the pool: slower sync
writes get pushed to the pool and are effectively cached to fast temporary storage to allow storage
consumers to continue, with the mechanisms for the ZIL flushing transactions from the log to
permanent storage in bursts.

This in effect makes the ZIL a dangerous single point of failure to the pool in certain situations. For
instance, if a single log device fails, it effectively cuts out the middle of the data pipeline while losing
any transactions in-flight: data which was already written to the pool and present on the log, but not
yet committed to persistent storage, will be lost.

Running mirrored ZIL log devices is highly recommended and mitigates this single point of failure.
Working with ZIL log mirrors is contextually identical to other VDEV mirrors: you use detach to
remove a mirror member.

If there is only a single ZIL log device, it is removed, not detached:

zpool remove zones c1t3d0


Please note that removal of a ZIL log is a potentially disruptive action and that it should only be done
during a low I/O maintenance window.

Working with L2 ARC


L2 ARC devices can be added to the pool to provide it with secondary ARC caching for primary and
meta data. Whether one or multiple L2 ARC devices are added, they will be used more in a 'striped'
fashion. These are not mirrored devices, as the data they contain is transient.

To add a cache device:

zpool add zones cache <disk>


Replacement is much the same as with mirrors: single devices can be removed outright should they
fail and new ones added.

One significant caveat about L2ARC is that these devices are not free to the system: their metadata
is maintained within primary ARC, which is in turn wired in system memory. Memory use
considerations must be made if adding L2ARC devices.
Repairing the pool
Checksum errors
Checksum errors can occur transiently on individual disks or across multiple disks. The most likely
culprits are bit rot or transient storage subsystem errors - oddities like signal loss due to solar flares
and so on.

With ZFS, they are not of much concern, but some degree of preventative maintenance is necessary
to prevent a failure from accumulation.

From time to time you may see zpool status output similar to this:

NAME STATE READ WRITE CKSUM


zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 23
c1t1d0 ONLINE 0 0 0

Note the "23" in the CKSUM column.

If this number is significantly large or growing rapidly, the drive is likely in a "pre-failure" state and will
fail soon, and is otherwise (in this case) potentially compromising the redundancy of the VDEV.

One thing to make note of is that checksum errors on individual drives, from time to time, is normal
and expected behavior (if not optimal). So are many errors on single drives which are about to fail.
Many checksum failures across multiple drives can be indicative of a significant storage subsystem
problem: a damaged cable, a faulty HBA, or even power problems. If this is noticed, consider
contacting Support for assistance with identification.

Hint: You can audit pool health across the entire datacenter from the headnode with: sdc-
oneachnode -c 'zpool status -x'
Resilver
A zpool resilver is an operation to rebuild parity across a pool due to either a degraded device (for
instance, a disk may temporarily disappear and need to 'catch up') or a newly replaced device. In
other words, moving the data from one device (the degraded/old disk) to a new device.
Multiple resilvers can occur at the same time within multiple VDEVs.

Please note that resilvers can degrade performance on a busy pool. Plan performance projections
accordingly.

Resilvers are automatic. They can (and should) not be interrupted short of physical removal or
failure of a device.

Scrub
Scrub examines all data in the specified pools to verify that it checksums correctly. For replicated
(mirror or raidz) devices, ZFS automatically repairs any damage discovered during the scrub.
The zpool status command reports the progress of the scrub and summarizes the results of the
scrub upon completion.
To start a scrub:

zpool scrub zones


To stop a scrub:

zpool scrub zones -s


If a zpool resilver is in progress, it will not be able to started until the resilver completes.

Scrub and resilver concurrency


Scrubbing and resilvering are very similar operations. The difference is that resilvering only
examines data that ZFS knows to be out of date (for example, when attaching a new device to a
mirror or replacing an existing device), whereas scrubbing examines all data to discover silent errors
due to hardware faults or disk failure.
Because scrubbing and resilvering are I/O-intensive operations, ZFS only allows one at a time. If a
scrub is already in progress, the "zpool scrub" command returns an error.

Autoreplace
By enabling ZFS autoreplace on a pool (a property disabled by default) you will enable your system
to automatically use a spare drive to replace FAULTED/UNAVAIL drives.
It should be cautioned that there are potential drawbacks from this approach: in the event of
something like misbehaving firmware or a HBA failure, multiple drives may be replaced and then the
replacements may fault prior to initial resilver, resulting in a more difficult scenario from which to
recover from. Enabling auto-replace is highly inadvisable unless you've got a responsive 24/7 DC
operations team.

To enable:

zpool set autoreplace=on zones

Further assistance needed


If this document is unclear, incorrect, or does not appear to cover your specific scenario,
please contact MNX Support.
Additional information
Please reference the associated man pages on your systems for further in-depth information:

zfs(1M), zpool(1M), cfgadm(1M), devfsadm(1M), fmadm(1M), fmd(1M), fmdump(1M)

You might also like