0% found this document useful (0 votes)

19 views10 pages

ZFS Disk Failure - Understanding and Resolving

ZFS provides tools to diagnose disk failures and replace failed disks. When a disk fails, it is first detected by FMA and removed by the OS. ZFS then faults the device. To replace a disk, administrators identify the failed disk, use zpool replace to start the resilvering process, wait for completion, remove the replaced disk with zpool remove, offline it with zpool offline, and perform any necessary cleanup. Disk failures can occur at the individual disk level or cause higher-level redundancy groups or the entire pool to fault.

Uploaded by

sotognon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views10 pages

ZFS Disk Failure - Understanding and Resolving

Uploaded by

sotognon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Understanding and Resolving ZFS Disk

Failure
Modified: 08 Mar 2023 00:28 UTC

Introductory Concepts
This document is written for administrators and those who have familiarity with computing hardware
platforms and storage concepts such as RAID. If you're already versed in the general failure
process, you can skip ahead to how to replace a drive and repairing the pool.
Degrees of verbosity
When a drive fails or has errors, a great degree of logging data is available on SmartOS. We can
drill down in more detail to help us find the underlying cause of disk failure. In descending order,
these commands will present the disk failure cause in increasing verbosity:

* `zpool status`
* `iostat -en`
* `iostat -En`
* `fmadm faulty`
* `fmdump -et {n}days`
* `fmdump -eVt {n}days`

The zpool status command will present us with a high level view of pool health.
iostat will present us with high level error counts and specifics as to the devices in question.
fmadm faulty will tell us more specifically which event led to the disk failure. ( fmadm can also be
used to clear transitory faults; this, however, is outside the scope of this document. Refer to
the fmadm man page for more information.) fmdump is much more specific still, presenting us of a
log of the last {n} days of fault events. This information is often extraneous to replacing faulted disks,
but if the problem is more complex than a simple single disk failure, it is extremely useful in isolating
a root cause.
General failure process
ZFS is not the first component in the system to be aware of a disk failure. When a disk fails or
becomes unavailable or has a functional problem, this general order of events occurs:

1. A failed disk is detected and logged by FMA.

2. The disk is removed by the operating system.
3. ZFS sees the changed state and responds by faulting the device.

ZFS device (and virtual device) states

The overall health of a pool, as reported by zpool status , is determined by the aggregate state of
all devices within the pool. Here are some definitions to help with clarity throughout this document.
ONLINE
All devices can (and should while operating optimally) be in the ONLINE state. This includes the
pool, top-level VDEVs (parity groups of type mirror, raidz{1,2,3}) and the drives themselves.
Transitory errors may still occur without the drive changing state.
OFFLINE
Only bottom-level devices (drives) can be OFFLINE. This is a manual administrative state, and
healthy drives can be brought back online and active into the pool.

UNAVAIL
The device (or VDEV) in question can not be opened. If a VDEV is UNAVAIL , the pool will not be
accessible or able to be imported. UNAVAIL devices may also report as FAULTED in some
scenarios. Operationally, UNAVAIL disks are roughly equivalent to FAULTED disks.
DEGRADED
A fault in a device has occurred, impacting all VDEVs above it. The pool is still operable, but
redundancy may have been lost in a VDEV.

REMOVED
The device was physically removed while the system was running. Device removal detection is
hardware-dependent and might not be supported on all platforms.

FAULTED
All components (top and redundancy VDEVs, and drives) of the pool can be in a FAULTED state. A
FAULTED component is completely inaccessible. The severity of a device being DEGRADED
depends a lot on which device it is.

INUSE
This is a status reserved for spares which have been used to replace a faulted drive.

Degree of failure
Due to its integrated volume management characteristics, failures at different levels within ZFS
impact the system and overall pool health to different degrees.

The pool itself

This is the worst possible scenario, typically resulting from loss of more drives from a redundancy
group than the group was designed to withstand. The pool itself has no concept of redundancy,
instead relying on integrity to be maintained within with individual RAIDZ or mirror VDEVs. For
instance, losing 2 disks out of a RAIDZ would result in both the VDEV and the pool (top level VDEV)
becoming FAULTED.

It should be noted that, should this scenario occur, it may still be possible to bring the pool ONLINE
in rare scenarios, such as those brought on by controller failure where a large swath of disks are
FAULTED as a secondary cause. Under most scenarios, a FAULTED pool is unrecoverable and its
data will need to be recreated from backup.

Redundancy groups
If more disks are lost in a redundancy group than there exists redundancy (2 out of 2 disks in a
mirror, 3 disks from RAIDZ-2, or 2 from RAIDZ), the redundancy will become FAULTED. A
FAULTED state at the VDEV level will result in a pool FAULTED state: each redundancy group
should be thought of as your top level protection against data loss, with the pool itself serving to
stripe data across the redundancy groups.
Individual drive failure
Individual drives becoming faulted is not problematic to the pool or redundancy group, as long as
fewer disks than there exists redundancy fail. For instance, 2 disks in a RAIDZ-2 VDEV can fail
without cascading upwards.

How to replace a drive

High-level overview of drive replacement
At a high level, replacing a specific faulted drive takes the following steps:

1. Identify the FAULTED or UNAVAILABLE drive

2. zpool replace the drive in question
3. Wait for the resilver to finish
4. zpool remove the replaced drive
5. zpool offline the removed drive
6. Perform any necessary cleanup
These steps can vary somewhat depending on specific redundancy level and hardware
configuration.

In-detail steps for drive replacement

Let's start with an example scenario involving multiple faulted and degraded drives:

[root@headnode (dc-example-1) ~]# zpool status

pool: zones
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 7.64G in 0h6m with 0 errors on Fri May 26 10:45:56 2017
config:

NAME STATE READ WRITE CKSUM

zones DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 FAULTED 0 0 0 external device fault
mirror-2 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
mirror-3 DEGRADED 0 0 0
1173487 UNAVAIL 0 0 0 was /dev/dsk/c1t16d0
c1t6d0 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c1t8d0 ONLINE 0 0 0
mirror-5 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 0
c1t10d0 REMOVED 0 0 0
c1t11d0 ONLINE 0 0 0
c1t9d0 FAULTED 0 0 0 external device fault
mirror-6 ONLINE 0 0 0
c1t12d0 ONLINE 0 0 0
c1t13d0 ONLINE 0 0 0
logs
c1t14d0 ONLINE 0 0 0
spares
c1t15d0 INUSE currently in use
c1t16d0 ONLINE 0 0 0

errors: No known data errors

In the above example, there are two faulted devices and one that is unavailable. From an
administrative perspective, these two states are functionally identical: you want to replace them with
known working drives.

ZFS will know when a drive hits a limit on a number of errors and will automatically take it out of the
pool. This can happen for any type of failure. As an operator, all that matters is that a drive has
faulted; the manufacturer can determine why it happened when you RMA it.

Identify the physical location of the FAULTED or UNAVAILABLE

drive
Use diskinfo to get this information.
For instance, zpool status had shown c1t3d0 faulted:

NAME STATE READ WRITE CKSUM

zones DEGRADED 0 0 0
[...]
mirror-1 DEGRADED 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 FAULTED 0 0 0 external device fault

diskinfo -cH will show where c1t3d0 is located. For instance:

$ diskinfo -cH
=== Output from 00000000-0000-0000-0000-003590935999 (hostname):
<snip>
SCSI c1t4d0 HITACHI HUS723030ALS640 YHK16Z7G 2794.52 GiB ---- [0]
Slot 02
SCSI c1t13d0 HITACHI HUS723030ALS640 YHJZMU7G 2794.52 GiB ---- [0]
Slot 03
<snip>
SCSI c1t3d0 HITACHI HUS723030ALS640 YHK08JHG 2794.52 GiB ---- [1]
Slot 05 <--- here

Blink the drive in question

Blinking the drive will require a third party tool for your storage controller(s). In most cases where LSI
cards are in use, you will want sas2ircu for Solaris. Drive location on other platforms will not be
covered here.
Install the sas2ircu somewhere of use, and then run it similar to this:

p /opt/custom/bin/sas2ircu 0 locate 1:5 ON

The above command will light the LED on the 5th slot on the second ([1]) expander via the first (0)
HBA.

the drive in question

Now that the FAULTED/UNAVAIL drive has been identified for replacement, there are several
different ways we can replace the drive.

Replace the drive with a spare

This is the preferred method, as it is less prone to human error. However, if the chassis does not
have room for spares, it will not be possible.

zpool replace zones <bad_drive> <spare_drive>

For instance, using the example above, we would do zpool replace zones c1t3d0 c1t16d0 , using
the other available spare.
Once the drive has been replaced and you have verified the drive is resilvering ( zpool status ),
offline the failed drive:
zpool offline zones c1t3d0
Even in a FAULTED state, drives must be 'offlined' prior to being removed.

To then remove the now-offline drive:

zpool remove zones c1t3d0

With this replacement approach, you also have the option to wait until the resilver completes before
removing the drive. However, be cautious that this does not result in forgotten dead disks.

Continue and perform any necessary cleanup.

Replace the drive with one in the same slot (using cfgadm)
In order to replace a drive in the same slot as the faulted drive, it must be removed from the pool and
unconfigured from the OS before a new disk can be inserted.

The following steps outline the general procedure:

1. Offline c1t3d0 , the disk to be replaced. You cannot unconfigure a disk that is in use.
2. Use cfgadm to identify the disk to be unconfigured and unconfigure it (i.e. c1t3d0 ). The
pool will continue to be available, though it will be degraded with the now offline disk.
3. Physically replace the disk. The Ready to Remove LED must be illuminated before you
physically remove the faulted drive.
4. Reconfigure c1t3d0 .
5. Bring the new c1t3d0 online.
6. Run the zpool replace command to replace the disk.
The following example walks through the steps to replace a disk in a ZFS storage pool with a disk in
the same slot.
# zpool offline zone c1t3d0
# cfgadm | grep c1t3d0
sata1/3::dsk/c1t3d0 disk connected configured ok
# cfgadm -c unconfigure sata1/3
Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:3
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
# cfgadm | grep sata1/3
sata1/3 disk connected unconfigured ok
<Physically replace the failed disk c1t3d0>
# cfgadm -c configure sata1/3
# cfgadm | grep sata1/3
sata1/3::dsk/c1t9d0 disk connected configured ok
# zpool online zone c1t3d0
# zpool replace zone c1t3d0
# zpool status zone
pool: zones
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Tue Feb 2 13:17:32 2010
config:

NAME STATE READ WRITE CKSUM

zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0

errors: No known data errors

Note that the preceding zpool output might show both the new and old disks under a replacing
heading. For example:

replacing DEGRADED 0 0 0
c1t3d0s0/o FAULTED 0 0 0
c1t3d0 ONLINE 0 0 0

This is normal and not something to elicit concern. The replacing status will remain until the
replace is complete.
Replacing a drive in the same slot (with devfsadm)
devfsadm -Cv can also be used instead of the above cfgadm commands to rebuild the device files.
This process is more straightforward than the above process using cfgadm, and does not require
new replacement drive to be in the same slot.

1. offline the faulted drive

2. physically remove the faulted drive
3. run devfsadm -Dv to unconfigure the old disk
4. insert the new drive
5. run devfsadm -Dv again to configure the new disk.
6. online the disk as above
7. replace the disk (also as above)

Perform any necessary cleanup

Turn the drive notification LED off
When turning the drive notification light off on the chassis, be sure to use the same slot and chassis
IDs as you did when enabling it.

'''$ p /opt/custom/bin/sas2ircu 0 locate 1:5 OFF'''

Spare replacement
Be certain to follow up and pull the failed drive and replace the spare in the pool, if one was used.

Validation
Use zpool status command to verify that:
 The pool status is ONLINE
 There is one or more spare disks (if your environment uses them).
 There is the expected number of log devices

Other Considerations
Mirrors - Special Considerations
Mirror management is different than working with RAIDZ{123} members, as unlike with RAIDZ, there
is no parity to be concerned with. Because of this, mirror members can be 'detached' where you
would normally remove them on RAIDZ.

zpool detach zones c1t3d0

Detaching a device is only possible if there are valid replicas of the data.

Working with spares

Hot spare disks can be added with the zpool add command and removed with the zpool
remove command.
zpool add zones <disk> zpool remove zones <disk>
Once a spare replacement is initiated, a new "spare" VDEV is created within the configuration that
will remain there until the original device is replaced. At this point, the hot spare becomes available
again if another device fails.

An in-progress spare replacement can be cancelled by detaching the hot spare. This can only be
done if the original faulted device has not yet been detached. If the disk it is replacing has been
removed, then the hot spare assumes its place in the configuration.

zpool detach zones <disk>

Spares cannot replace log devices.
Working with ZIL logs
ZIL log devices are a special case in ZFS. They 'front' synchronous writes to the pool: slower sync
writes get pushed to the pool and are effectively cached to fast temporary storage to allow storage
consumers to continue, with the mechanisms for the ZIL flushing transactions from the log to
permanent storage in bursts.

This in effect makes the ZIL a dangerous single point of failure to the pool in certain situations. For
instance, if a single log device fails, it effectively cuts out the middle of the data pipeline while losing
any transactions in-flight: data which was already written to the pool and present on the log, but not
yet committed to persistent storage, will be lost.

Running mirrored ZIL log devices is highly recommended and mitigates this single point of failure.
Working with ZIL log mirrors is contextually identical to other VDEV mirrors: you use detach to
remove a mirror member.

If there is only a single ZIL log device, it is removed, not detached:

zpool remove zones c1t3d0

Please note that removal of a ZIL log is a potentially disruptive action and that it should only be done
during a low I/O maintenance window.

Working with L2 ARC

L2 ARC devices can be added to the pool to provide it with secondary ARC caching for primary and
meta data. Whether one or multiple L2 ARC devices are added, they will be used more in a 'striped'
fashion. These are not mirrored devices, as the data they contain is transient.

To add a cache device:

zpool add zones cache <disk>

Replacement is much the same as with mirrors: single devices can be removed outright should they
fail and new ones added.

One significant caveat about L2ARC is that these devices are not free to the system: their metadata
is maintained within primary ARC, which is in turn wired in system memory. Memory use
considerations must be made if adding L2ARC devices.
Repairing the pool
Checksum errors
Checksum errors can occur transiently on individual disks or across multiple disks. The most likely
culprits are bit rot or transient storage subsystem errors - oddities like signal loss due to solar flares
and so on.

With ZFS, they are not of much concern, but some degree of preventative maintenance is necessary
to prevent a failure from accumulation.

From time to time you may see zpool status output similar to this:

NAME STATE READ WRITE CKSUM

zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 23
c1t1d0 ONLINE 0 0 0

Note the "23" in the CKSUM column.

If this number is significantly large or growing rapidly, the drive is likely in a "pre-failure" state and will
fail soon, and is otherwise (in this case) potentially compromising the redundancy of the VDEV.

One thing to make note of is that checksum errors on individual drives, from time to time, is normal
and expected behavior (if not optimal). So are many errors on single drives which are about to fail.
Many checksum failures across multiple drives can be indicative of a significant storage subsystem
problem: a damaged cable, a faulty HBA, or even power problems. If this is noticed, consider
contacting Support for assistance with identification.

Hint: You can audit pool health across the entire datacenter from the headnode with: sdc-
oneachnode -c 'zpool status -x'
Resilver
A zpool resilver is an operation to rebuild parity across a pool due to either a degraded device (for
instance, a disk may temporarily disappear and need to 'catch up') or a newly replaced device. In
other words, moving the data from one device (the degraded/old disk) to a new device.
Multiple resilvers can occur at the same time within multiple VDEVs.

Please note that resilvers can degrade performance on a busy pool. Plan performance projections
accordingly.

Resilvers are automatic. They can (and should) not be interrupted short of physical removal or
failure of a device.

Scrub
Scrub examines all data in the specified pools to verify that it checksums correctly. For replicated
(mirror or raidz) devices, ZFS automatically repairs any damage discovered during the scrub.
The zpool status command reports the progress of the scrub and summarizes the results of the
scrub upon completion.
To start a scrub:

zpool scrub zones

To stop a scrub:

zpool scrub zones -s

If a zpool resilver is in progress, it will not be able to started until the resilver completes.

Scrub and resilver concurrency

Scrubbing and resilvering are very similar operations. The difference is that resilvering only
examines data that ZFS knows to be out of date (for example, when attaching a new device to a
mirror or replacing an existing device), whereas scrubbing examines all data to discover silent errors
due to hardware faults or disk failure.
Because scrubbing and resilvering are I/O-intensive operations, ZFS only allows one at a time. If a
scrub is already in progress, the "zpool scrub" command returns an error.

Autoreplace
By enabling ZFS autoreplace on a pool (a property disabled by default) you will enable your system
to automatically use a spare drive to replace FAULTED/UNAVAIL drives.
It should be cautioned that there are potential drawbacks from this approach: in the event of
something like misbehaving firmware or a HBA failure, multiple drives may be replaced and then the
replacements may fault prior to initial resilver, resulting in a more difficult scenario from which to
recover from. Enabling auto-replace is highly inadvisable unless you've got a responsive 24/7 DC
operations team.

To enable:

zpool set autoreplace=on zones

Further assistance needed

If this document is unclear, incorrect, or does not appear to cover your specific scenario,
please contact MNX Support.
Additional information
Please reference the associated man pages on your systems for further in-depth information:

zfs(1M), zpool(1M), cfgadm(1M), devfsadm(1M), fmadm(1M), fmd(1M), fmdump(1M)

WD Repair Tool WDR 5.3 Manual
71% (7)
WD Repair Tool WDR 5.3 Manual
75 pages
Veritas Support Guide
No ratings yet
Veritas Support Guide
7 pages
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
Zfs Admin
No ratings yet
Zfs Admin
49 pages
Replace Drive Solaris Zfs
No ratings yet
Replace Drive Solaris Zfs
4 pages
How Do I Remove Replace A Failed Disk in A ZFS Array
No ratings yet
How Do I Remove Replace A Failed Disk in A ZFS Array
3 pages
ZFS Cheatsheet: This Is A Quick and Dirty Cheatsheet On Sun's ZFS
No ratings yet
ZFS Cheatsheet: This Is A Quick and Dirty Cheatsheet On Sun's ZFS
7 pages
Zfs Tutorial Part 1: Getting Started
No ratings yet
Zfs Tutorial Part 1: Getting Started
18 pages
Remove Devices SAN-attached Storage
No ratings yet
Remove Devices SAN-attached Storage
5 pages
Veritas Guide
No ratings yet
Veritas Guide
7 pages
SOP - Removing Disabled Paths From Solaris-Veritas
No ratings yet
SOP - Removing Disabled Paths From Solaris-Veritas
6 pages
Getting Started With ZFS
No ratings yet
Getting Started With ZFS
43 pages
ZFSNinja Slides PDF
No ratings yet
ZFSNinja Slides PDF
68 pages
Becoming A ZFS Ninja
No ratings yet
Becoming A ZFS Ninja
68 pages
Problem Determination: Spectrum Scale 4.2.2 and Outlook
No ratings yet
Problem Determination: Spectrum Scale 4.2.2 and Outlook
48 pages
VXVM Creating Volume and File System
No ratings yet
VXVM Creating Volume and File System
6 pages
SOP - Disk Replacement - SVM
No ratings yet
SOP - Disk Replacement - SVM
17 pages
Test if Linux Server SCSI _ SATA _ SSD Hard Disk Going Bad - NixCraft
No ratings yet
Test if Linux Server SCSI _ SATA _ SSD Hard Disk Going Bad - NixCraft
24 pages
Replace Internal FibreChannel (FC) Disks Controlled by VXVM
No ratings yet
Replace Internal FibreChannel (FC) Disks Controlled by VXVM
7 pages
The ZFS Filesystem
No ratings yet
The ZFS Filesystem
36 pages
UCS C Series Rack Servers CLI Commands For Troubleshooting HDD Issues
No ratings yet
UCS C Series Rack Servers CLI Commands For Troubleshooting HDD Issues
5 pages
Event Log Codes For Cpqcisse
No ratings yet
Event Log Codes For Cpqcisse
13 pages
Aix Troubleshooting
No ratings yet
Aix Troubleshooting
19 pages
TR 3603
No ratings yet
TR 3603
11 pages
ZFS
No ratings yet
ZFS
2 pages
Device LUN Cleanup On Solaris
No ratings yet
Device LUN Cleanup On Solaris
5 pages
Solaris Containers and ZFS Cheat Sheet
No ratings yet
Solaris Containers and ZFS Cheat Sheet
4 pages
Troubleshooting Disk Management
No ratings yet
Troubleshooting Disk Management
4 pages
Intel® RAID Basic Troubleshooting Guide: Technical Summary Document
No ratings yet
Intel® RAID Basic Troubleshooting Guide: Technical Summary Document
16 pages
SeaChest Info
No ratings yet
SeaChest Info
35 pages
4 Ways To Recover A Dead Hard Disk - WikiHow
No ratings yet
4 Ways To Recover A Dead Hard Disk - WikiHow
7 pages
HFR2 SU3S2 User Manual 1.7
No ratings yet
HFR2 SU3S2 User Manual 1.7
15 pages
Solaris 10 ZFS Basic Example
No ratings yet
Solaris 10 ZFS Basic Example
2 pages
Mantto HSD
No ratings yet
Mantto HSD
54 pages
SeaChest Info
No ratings yet
SeaChest Info
34 pages
2021 Cinservertttpptsession 6 V 21623771319508
No ratings yet
2021 Cinservertttpptsession 6 V 21623771319508
26 pages
SeaToolsDOSguide en
No ratings yet
SeaToolsDOSguide en
24 pages
Imperva SecureSphere FRU Hard Drive Replacement Procedure X2510 X4510 X6510 X8510 X10K M160 PDF
No ratings yet
Imperva SecureSphere FRU Hard Drive Replacement Procedure X2510 X4510 X6510 X8510 X10K M160 PDF
12 pages
Power Cycle Apg40c
No ratings yet
Power Cycle Apg40c
4 pages
How To Resolve Zpool Corruption
No ratings yet
How To Resolve Zpool Corruption
4 pages
SVM - Disk Replacement
No ratings yet
SVM - Disk Replacement
3 pages
2021 Cinservertttpptsession 7 V 21623953256002
No ratings yet
2021 Cinservertttpptsession 7 V 21623953256002
41 pages
CS 4
No ratings yet
CS 4
24 pages
Sea Tools Dos Guide
No ratings yet
Sea Tools Dos Guide
18 pages
Pcmpath
No ratings yet
Pcmpath
4 pages
Introduction To ZFS R1a
No ratings yet
Introduction To ZFS R1a
12 pages
Vdocuments - MX Comandos Diagnostico Errores m5000
No ratings yet
Vdocuments - MX Comandos Diagnostico Errores m5000
11 pages
DSP - Detials
No ratings yet
DSP - Detials
23 pages
Comandos Diagnostico Errores M5000
No ratings yet
Comandos Diagnostico Errores M5000
11 pages
Omreport Triage
No ratings yet
Omreport Triage
4 pages
ZFS Imp
No ratings yet
ZFS Imp
24 pages
Administrative Guide: Veritas Volume Manager
No ratings yet
Administrative Guide: Veritas Volume Manager
12 pages
Zfs Introduction
No ratings yet
Zfs Introduction
17 pages
TVL CapsLET G12 Q3 CSS WEEK 3
No ratings yet
TVL CapsLET G12 Q3 CSS WEEK 3
17 pages
EMC VMAX Raid Virtual Architecture
No ratings yet
EMC VMAX Raid Virtual Architecture
7 pages
Thinkcentre Hardware Maintenance Manual: Machine Types: 1982, 1983, 1985, 1986, 1987, 1990, 1992, 1993
No ratings yet
Thinkcentre Hardware Maintenance Manual: Machine Types: 1982, 1983, 1985, 1986, 1987, 1990, 1992, 1993
292 pages
Diff Failing Failed Disks
No ratings yet
Diff Failing Failed Disks
2 pages
Seagate TOC Rev 2
No ratings yet
Seagate TOC Rev 2
20 pages
The SSD Optimization Guide for Windows 7 & Windows 8.1 Edition 2018
From Everand
The SSD Optimization Guide for Windows 7 & Windows 8.1 Edition 2018
Muhammad Vandestra
No ratings yet
FreeBSD Mastery: Advanced ZFS: IT Mastery, #9
From Everand
FreeBSD Mastery: Advanced ZFS: IT Mastery, #9
Michael W. Lucas
No ratings yet
Data Integrity Proofs in Cloud Storage (Pruthviraj)
No ratings yet
Data Integrity Proofs in Cloud Storage (Pruthviraj)
12 pages
SSE Errors and Explanations KBA-01862-M8P4
No ratings yet
SSE Errors and Explanations KBA-01862-M8P4
5 pages
College Office Automation System: General Article
No ratings yet
College Office Automation System: General Article
3 pages
Guideline For Best Practice
No ratings yet
Guideline For Best Practice
11 pages
Harman Kardon Avr 134 Audio Video Receiver
No ratings yet
Harman Kardon Avr 134 Audio Video Receiver
52 pages
FINAL java bus reservation system
No ratings yet
FINAL java bus reservation system
22 pages
SHAREPOINT-2013 Syllabus: Deccansoft Software Services SHAREPOINT-2013
No ratings yet
SHAREPOINT-2013 Syllabus: Deccansoft Software Services SHAREPOINT-2013
16 pages
Supervision
No ratings yet
Supervision
27 pages
USB Modem Quick Start Guide: Installing Hardware
No ratings yet
USB Modem Quick Start Guide: Installing Hardware
2 pages
Using Timequest Timing Analyzer: For Quartus Prime 17.0
No ratings yet
Using Timequest Timing Analyzer: For Quartus Prime 17.0
18 pages
Ethical Textbook 2
No ratings yet
Ethical Textbook 2
80 pages
CompTIA A+ 220-1101 (Core 1) Module 4
No ratings yet
CompTIA A+ 220-1101 (Core 1) Module 4
111 pages
Arj Man
No ratings yet
Arj Man
14 pages
Interactive LNG Map - FAQs V4.0
No ratings yet
Interactive LNG Map - FAQs V4.0
8 pages
Amazon EC2 Auto Scaling user guide
No ratings yet
Amazon EC2 Auto Scaling user guide
367 pages
Internet and Email
No ratings yet
Internet and Email
6 pages
S22 CS 7B ToA A1 Question - Paper
No ratings yet
S22 CS 7B ToA A1 Question - Paper
2 pages
Why Ruckus Vs Meraki-USLetter5
No ratings yet
Why Ruckus Vs Meraki-USLetter5
6 pages
Praveen Kumar
No ratings yet
Praveen Kumar
1 page
B.Tech. Course Structure - EE
No ratings yet
B.Tech. Course Structure - EE
5 pages
Evolution of Machine Learning
No ratings yet
Evolution of Machine Learning
7 pages
Unit - 4: Design Process
No ratings yet
Unit - 4: Design Process
28 pages
PS CORE Graduate Programme Overview
100% (1)
PS CORE Graduate Programme Overview
2 pages
Mastering Docker - Sample Chapter
No ratings yet
Mastering Docker - Sample Chapter
26 pages
8 Functional Testing Types Explained With Examples
No ratings yet
8 Functional Testing Types Explained With Examples
7 pages
GIS Mid Sem Solution Updated
100% (1)
GIS Mid Sem Solution Updated
8 pages
Iiml PM PDF Product Management
No ratings yet
Iiml PM PDF Product Management
39 pages
Salesforce Developer Course Syllabus
No ratings yet
Salesforce Developer Course Syllabus
4 pages
Aprel (1)
No ratings yet
Aprel (1)
72 pages
3 Tips For Using Imported Javascript With Visual Builder
No ratings yet
3 Tips For Using Imported Javascript With Visual Builder
8 pages

Uploaded by

Uploaded by

Understanding and Resolving ZFS Disk

1. A failed disk is detected and logged by FMA.

ZFS device (and virtual device) states

The pool itself

How to replace a drive

1. Identify the FAULTED or UNAVAILABLE drive

In-detail steps for drive replacement

[root@headnode (dc-example-1) ~]# zpool status

NAME STATE READ WRITE CKSUM

errors: No known data errors

Identify the physical location of the FAULTED or UNAVAILABLE

NAME STATE READ WRITE CKSUM

diskinfo -cH will show where c1t3d0 is located. For instance:

Blink the drive in question

p /opt/custom/bin/sas2ircu 0 locate 1:5 ON

the drive in question

Replace the drive with a spare

zpool replace zones <bad_drive> <spare_drive>

To then remove the now-offline drive:

zpool remove zones c1t3d0

Continue and perform any necessary cleanup.

The following steps outline the general procedure:

NAME STATE READ WRITE CKSUM

errors: No known data errors

1. offline the faulted drive

Perform any necessary cleanup

'''$ p /opt/custom/bin/sas2ircu 0 locate 1:5 OFF'''

zpool detach zones c1t3d0

Working with spares

zpool detach zones <disk>

If there is only a single ZIL log device, it is removed, not detached:

zpool remove zones c1t3d0

Working with L2 ARC

To add a cache device:

zpool add zones cache <disk>

NAME STATE READ WRITE CKSUM

Note the "23" in the CKSUM column.

zpool scrub zones

zpool scrub zones -s

Scrub and resilver concurrency

zpool set autoreplace=on zones

Further assistance needed

zfs(1M), zpool(1M), cfgadm(1M), devfsadm(1M), fmadm(1M), fmd(1M), fmdump(1M)

You might also like