Stubbornly Persistent ZFS Pools (and what to do about them)

19 02 2009

The first time I set up ZFS-Fuse on Hydra, I misinterpreted the admonition to use “whole disk vdevs” to mean that I should create pools from the entire disk device:

# zpool create raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd

But it turns out that when you do that on Solaris, it will actually create a regular partition for you. Since I need partitions anyway, this is good news. But I found that after re-partitioning my disks, it still looked to ZFS as though there were pools there. The information was hiding in some unused areas of the disks. It wasn’t a big deal, but every time I started ZFS, it would complain that it couldn’t find my old pool, “olympic” (get it, olympic pool? oh, never mind)

Fortunately these posts by Ricardo Correia made it clear exactly what I needed to do, and because I had set up fully-redundant RAID, it was fairly easy, if a bit nervewracking, to zero those areas of the disks and re-sync them. My first real experience with RAID. Here’s how it went.

First, find a disk for which zdb -l will report some of the broken pools present:

# zdb -l /dev/sdf
--------------------------------------------
LABEL 0
--------------------------------------------
    version=13
    name='olympic'
    state=0
    txg=279903
    pool_guid=4681941973924109929
    hostid=8323329
    hostname='recovery'
    top_guid=8406764786620297180
    guid=8406764786620297180
    vdev_tree
        type='disk'
        id=1
        guid=8406764786620297180
        path='/dev/sdh'
        whole_disk=0
        metaslab_array=14
        metaslab_shift=32
        ashift=9
        asize=500103118848
        is_log=0
        DTL=89
--------------------------------------------
LABEL 1
--------------------------------------------
failed to unpack label 1
--------------------------------------------
LABEL 2
--------------------------------------------
    version=13
    name='olympic'
    state=0
    txg=279903
    pool_guid=4681941973924109929
    hostid=8323329
    hostname='recovery'
    top_guid=8406764786620297180
    guid=8406764786620297180
    vdev_tree
        type='disk'
        id=1
        guid=8406764786620297180
        path='/dev/sdh'
        whole_disk=0
        metaslab_array=14
        metaslab_shift=32
        ashift=9
        asize=500103118848
        is_log=0
        DTL=89
--------------------------------------------
LABEL 3
--------------------------------------------
    version=13
    name='olympic'
    state=0
    txg=279903
    pool_guid=4681941973924109929
    hostid=8323329
    hostname='recovery'
    top_guid=8406764786620297180
    guid=8406764786620297180
    vdev_tree
        type='disk'
        id=1
        guid=8406764786620297180
        path='/dev/sdh'
        whole_disk=0
        metaslab_array=14
        metaslab_shift=32
        ashift=9
        asize=500103118848
        is_log=0
        DTL=89

Now take the disk’s partitions out of all md arrays in which they participate:

# mdadm /dev/md0 -f /dev/sdf1 -r /dev/sdf1
mdadm: set /dev/sdf1 faulty in /dev/md0
mdadm: hot removed /dev/sdf1
# mdadm /dev/md1 -f /dev/sdf6 -r /dev/sdf6
mdadm: set /dev/sdf6 faulty in /dev/md1
mdadm: hot remove failed for /dev/sdf6: Device or resource busy

Occasionally the remove will fail as in the 2nd example above; just repeat the command in that case.

Note: in my case, there were no active ZFS pools that I wanted to keep on the system; only mdRAID arrays. If you have active ZFS pools you’ll want to do something similar with those; taking them offline before the dd and allowing them to resilver afterward.

So, just to be safe, I’m going to clear the first and last 2MB of each disk. The first 2M is easy:

# dd if=/dev/zero of=/dev/sdf bs=1M count=2

To zero out the final 2M, use fdisk to discover the actual size of the disk:

# fdisk /dev/sdf

The number of cylinders for this disk is set to 60801.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): p

Disk /dev/sdf: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdf1   *           1           6       48163+  fd  Linux raid autodetect
/dev/sdf2               7         732     5831595    5  Extended
/dev/sdf5               7         124      947803+  82  Linux swap / Solaris
/dev/sdf6             125         732     4883728+  fd  Linux raid autodetect

Now calculate the number of 1M blocks into the disks we want to start zeroing, and issue another dd. The calculation rounds down, so zero three blocks instead of two.

# python -c 'print 500107862016/(1024*1024) - 2'
476938
# dd if=/dev/zero of=/dev/sdf bs=1M count=3 seek=476938

Now we’ve clobbered the partition table, so copy it from an identically-partitioned disk:

# sfdisk -d /dev/sde | sfdisk /dev/sdf
Checking that no-one is using this disk right now ...
OK

Disk /dev/sdf: 60801 cylinders, 255 heads, 63 sectors/track
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
Old situation:
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sdf1          0+    729     730-   5863693+   5  Extended
/dev/sdf2          0       -       0          0    0  Empty
/dev/sdf3          0       -       0          0    0  Empty
/dev/sdf4          0       -       0          0    0  Empty
/dev/sdf5          0+    121     122-    979902   82  Linux swap / Solaris
/dev/sdf6        122+    729     608-   4883728+  fd  Linux raid autodetect
New situation:
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdf1   *        63     96389      96327  fd  Linux raid autodetect
/dev/sdf2         96390  11759579   11663190   5  Extended
/dev/sdf3             0         -          0   0  Empty
/dev/sdf4             0         -          0   0  Empty
/dev/sdf5         96453   1992059    1895607  82  Linux swap / Solaris
/dev/sdf6       1992123  11759579    9767457  fd  Linux raid autodetect
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

Add the partitions back, and wait for them to re-synchronize:

# mdadm -a /dev/md0 /dev/sdf1 && mdadm -a /dev/md1 /dev/sdf6
# watch cat /proc/mdstat

When synchronization is done, the only step that remains is to reinstall GRUB if the disk was bootable:

# grub
Probing devices to guess BIOS drives. This may take a long time.

       [ Minimal BASH-like line editing is supported.   For
         the   first   word,  TAB  lists  possible  command
         completions.  Anywhere else TAB lists the possible
         completions of a device/filename. ]
grub> device (hd1) /dev/sdf
device (hd1) /dev/sdf
grub> root (hd1,0)
root (hd1,0)
grub> setup (hd1)
setup (hd1)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/xfs_stage1_5" exists... yes
 Running "embed /grub/xfs_stage1_5 (hd1)"...  19 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd0) (hd1)1+19 p (hd1,0)/grub/stage2 /grub/menu.lst"... succeeded
Done.

Did I remove all traces of the pool? Yup:

# zdb -l /dev/sdf
--------------------------------------------
LABEL 0
--------------------------------------------
failed to unpack label 0
--------------------------------------------
LABEL 1
--------------------------------------------
failed to unpack label 1
--------------------------------------------
LABEL 2
--------------------------------------------
failed to unpack label 2
--------------------------------------------
LABEL 3
--------------------------------------------
failed to unpack label 3

Just to be sure, I rebooted and selected /dev/sdf as my BIOS boot device. It worked! Lather, rinse, repeat (because this is going to take a while).


Update: After handling the three disks that contribute to more than one MD device, I wrote a little script to handle the rest of them. Voilà:

set -e
# set -x
if ! mdadm --detail -t /dev/md1 > /dev/null; then
    echo waiting for healthy array
    until mdadm --detail -t /dev/md1 > /dev/null ; do echo -n . ; sleep 1 ; done
fi

for x in /dev/sd[cdei]; do 
    echo '**' taking $x offline:
    mdadm /dev/md1 -f ${x}6
    sleep 1
    mdadm /dev/md1 -r ${x}6
    echo clearing beginning
    dd if=/dev/zero of=$x bs=1M count=2 > /dev/null 2>&1
    echo clearing end
    dd if=/dev/zero of=$x bs=1M count=2 seek=476938 > /dev/null 2>&1
    echo checking for cleared pools
    zdb -l $x | grep -qv olympic 
    echo no pool remains on $x.  copying partition table
    sfdisk -d /dev/sdb | sfdisk $x > /dev/null 2>&1
    echo bringing disk back online
    mdadm /dev/md1 -a ${x}6
    echo waiting for resilver to complete
    until mdadm --detail -t /dev/md1 > /dev/null ; do echo -n . ; sleep 1 ; done
    echo resilvering complete.  Starting next disk in 5 seconds
    sleep 5
done
Advertisements

Actions

Information

One response

3 09 2009
ZFS-FUSE and OpenSolaris and why Google Groups are such a good resource… | theChrisWalker.net

[…] the disk to have one big partition and also wiped the ZFS pool info from it (easier said than done! Ghost pools seem to be a problem!). Then I re-inserted the new partition into the the zpool and waited for resilver. Then I did the […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: