Iozone Author Speaks Out… Hilariously

27 02 2009

When my experiment in benchmarking ZFS-Fuse yielded more data than I knew what to do with, I googled around a bit and found at least one other person in a similar position who contacted the author (Don Capps) of my benchmarking tool (Iozone) to get his take on the results. I figured it was worth a shot, and found Don to be extremely generous with his time and expertise. He distilled my numbers down into this graph, which made things a lot easier for me to grasp:

Distilled Worst-Case Stats for ZFS-FUSE vs. XFS

Distilled Worst-Case Stats for ZFS-FUSE vs. XFS

Basically, Don only looked at the results where the file size exceeded the system’s RAM size, since any transfer that fits in RAM isn’t going to tell you much about the underlying filesystem technology.
He also told me that if I was getting those kinds of speeds on commodity hardware, I should feel pretty good about my results.

So I posted a little about these results in a relevant mailing list, and someone else posted some numbers for a competing FUSE filesystem that were so much better that I had to ask Don for his opinion. Naturally, Don wanted details about the poster’s hardware and testing protocol, and suggested that we were very likely seeing pure cache effect in those numbers. Unfortunately, I’ve been unable to get any details—just one among several reasons I’m not giving those numbers much weight. In any case, Don followed up with a few messages about what sort of setup one would need to reproduce those claimed speeds. Those followups are the point of this post. They’re reposted here with Don’s permission, and IMO they speak for themselves:

Since you are interested in the science, I thought
I would describe some ways to get 1.2 Gbytes/sec
off the platter. ( It can be done, not easily, but
if one has ~infinite resources…..)

Note: Assume all values below are ballpark and not any
specific hw.

Assuming ~40 Mbytes/sec/disk ( Typical modern disk drive )

Then to get to 1.2 Gbytes/sec == 1200 Mbytes/sec
1200/40 == 120/4 == 30 disk drives.

Now we need someway to connect 30 disks. Assuming
we can get 10 disks in a JBOD, we’ll nee 3 disk
enclosures… Well… Not exactly. We still need
to have an aggregate interconnect of 1200 Mbytes/sec.
Ok.. Fibre Channel (1 Gigabit FC) can do around 100 Mbytes/sec
so… 1200/100 = 12 fibre channel connections. That’s
a bit of a bummer as most PC’s don’t have 12 PCI slots.
So… We will need to go to 2 Gigabit fibre and use 6
slots,… Oh darn… Most PC’s doesn’t have 6 free
PCI slots, so we’ll probably need 3 dual ported 2 Gigabit
FC cards. Since each of these cards is going to be
sustaining 400 Mbytes/sec, it’s probably a good
idea to make these PCI express slots.
So far we now have:
30 disks
6 Disk enclosures with 5 disks in each.
3 Dual ported 2 Gigabit FC cards.

Now on to the next bottleneck… Be sure that one
starts with a motherboard that has a backplane that
can sustain 1200 Mbytes/sec.

Next up, integrity… I doubt that most folks are
going to want to be ripping through data a 1.2 Gbytes/sec
and not care about their data. So… Chances are good
that they’ll want some level of RAID. RAID 1 would be
a good choice for speed, but it does mean that we’ll
need 60 disks instead of 30. If we use RAID 5 then we
will still need more disks, but not as many more. The
bummer of RAID 5 is that it generally slows down the
writer. To make up for that issue, we’ll have to
either choose more disks (RAID1) or a smarter RAID
enclosure, that can do the RAID5 XOR ops independently
of the system CPU, and hopefully double buffered, and
with multiple XOR engines and data paths. All doable,
but it does increase the cost of the system.

So.. Here we are. We can do 1.2 Gbytes/sec, but it
is not going to be cheap or easily achieved. If we
ballpark this we get something like:

* 3 dual ported 2Gbit FC controllers with multiple RAID5
XOR engines… ~ $3,000
* 40 to 60 disks .. ~$4,000 to $6,000.
* 6 JBOD enclosures.. ~$6,000
* 6 FC cables… ~$600
* PC with nice MB for that 1200 Mbytes/sec
backplane.. ~$2,000
~$15,600 to ~$17,600

( And that could go up higher if one wanted dual path
HA type connectivity as it would push one to dual
ported enclosures and quad ported FC cards )

Don Capps

P.S. The above may get one to 1.2 Gbytes/sec for sequential
workloads, but it will not be nearly so speedy if
that workload were to shift towards a random I/O
access pattern… 🙂

P.P.S. Once you have this beast built, then you can
start thinking about the environmental impact.
It’s pretty likely that these 6 disk enclosures,
60 disks, and the PC, are generating a fairly
significant quantity of heat, noise, and making
the electric meter spin at rate you have never
seen before, and can not afford to sustain 🙂
BUT, it will be beautiful and a work, of both,
science and art 🙂 … Make sure you install
plenty of blue LEDs, as the blinking lights with
this many disks is mesmerizing, and will satisfy your
wife that you have constructed something really
interesting and have not been simply wasting your
time… 🙂

Don, this is hilarious and educational. Do you mind if I post it on my blog
(with attribution, of course)?


I don’t mind. But I did leave off a few other thoughts…

Environmental impact continued:

It’s fairly likely you’ll need to hire an electrician to
come out and put in a special circuit and rewire the
bedroom (where you have the storage system) as the current
draw is probably going to exceed the typical breaker used
for a bedroom. + $500

You also may need to call the air-conditioner folks and
upgrade that 3 ton handler to a 5 ton handler, as the
thermal load is pretty high and without addition cooling
capacity, your house may become a sauna. + $6,000
It may also be possible to construct the beast inside of
a water cooled chamber and put a heat exchanger outside
your house. A small cooling tower should do the trick
but you may wish to check with the homeowners association
before you install the external tower.

Make sure the room is very dark and those blinking blue
LEDs look their best, otherwise you may need to explain
to your wife why you spent $23,500 on this project, instead
of a new car, a fur coat, a diamond ring, or a European
vacation … Trust me, you really want those LEDs to be

Don Capps

P.S. If you would like I could send photos of my home
bedroom lab. Yep… My wife really liked the blinking
blue lights, but then again, she is a computer scientist
too 🙂

I’m not going to post a picture of the inside of Don’s bedroom here, but I can tell you that while the rack is impressive, it’s not nearly as scary as I expected. My guess is he’s not trying to reproduce these claimed performance numbers 😉


So Outdated Already?

25 02 2009

Today I finally got VMWare up and running on Hydra, and simultaneously installed Windows XP Pro x64 edition on that and on my new MacBookPro 17″ (more on that later!) under VMWare Fusion. The neverending sequence of updates required every time you install Windows completed in—well, I’m not sure; I’m still waiting—let’s just say a lot less time on the MacBook than on my relatively muscular server.  Yeah, the MacBook has a higher clock rate but the server has twice as many cores, eight times as many disks, and dual I/O channels.

In fact, these updates seem to be deadly slow on that VM.  I don’t know how to account for it; top says vmware-vmx is using a maximum of 36% of one core and zfs-fuse is using maybe 11% max.  I’m looking at an “Installing Updates” window that has had an empty progress bar for the past ten minutes.  It’s almost as though there was some sort of deadlock in the filesystem.  Grrr…

Never Partition Part of an Active RAID Array

21 02 2009

Repeat after me: I will never change the partition table on a disk that’s part of an active RAID array. I will never change the partition table on a disk that’s part of an active RAID array. I will never change the partition table on a disk that’s part of an active RAID array. I will never change the partition table on a disk that’s part of an active RAID array…

I should’ve known this, but naturally all these lovely device aggregation technologies such as LVM, RAID, and ZFS have to store their meta-information somewhere, and that ends up being in the little holes between partitions and at the beginning and end of the disk. Changing the partition table while a RAID array is known by the system to exist on a device is likely to earn you all kinds of pain, and apparently partitioners are likely to stomp on these little areas; I found this out the hard way.

So if your root directory is on RAID and you even need to partition some free space on one of those disks, reboot the system from a rescue disk and do your work with all the RAID arrays stopped. You might even need to destroy the arrays (i.e. lose the array metadata, but not the data on the RAID partitions) and re-construct them after partitioning. And if you are going to re-partition and format the disks: explicity destroy all the RAID, ZFS, and LVM structures first—that information can hang around and come back to bite you if you don’t tell the system that your pools and volume groups are no more.

Lesson learned.

Stubbornly Persistent ZFS Pools (and what to do about them)

19 02 2009

The first time I set up ZFS-Fuse on Hydra, I misinterpreted the admonition to use “whole disk vdevs” to mean that I should create pools from the entire disk device:

# zpool create raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd

But it turns out that when you do that on Solaris, it will actually create a regular partition for you. Since I need partitions anyway, this is good news. But I found that after re-partitioning my disks, it still looked to ZFS as though there were pools there. The information was hiding in some unused areas of the disks. It wasn’t a big deal, but every time I started ZFS, it would complain that it couldn’t find my old pool, “olympic” (get it, olympic pool? oh, never mind)

Fortunately these posts by Ricardo Correia made it clear exactly what I needed to do, and because I had set up fully-redundant RAID, it was fairly easy, if a bit nervewracking, to zero those areas of the disks and re-sync them. My first real experience with RAID. Here’s how it went.

First, find a disk for which zdb -l will report some of the broken pools present:

# zdb -l /dev/sdf
failed to unpack label 1

Now take the disk’s partitions out of all md arrays in which they participate:

# mdadm /dev/md0 -f /dev/sdf1 -r /dev/sdf1
mdadm: set /dev/sdf1 faulty in /dev/md0
mdadm: hot removed /dev/sdf1
# mdadm /dev/md1 -f /dev/sdf6 -r /dev/sdf6
mdadm: set /dev/sdf6 faulty in /dev/md1
mdadm: hot remove failed for /dev/sdf6: Device or resource busy

Occasionally the remove will fail as in the 2nd example above; just repeat the command in that case.

Note: in my case, there were no active ZFS pools that I wanted to keep on the system; only mdRAID arrays. If you have active ZFS pools you’ll want to do something similar with those; taking them offline before the dd and allowing them to resilver afterward.

So, just to be safe, I’m going to clear the first and last 2MB of each disk. The first 2M is easy:

# dd if=/dev/zero of=/dev/sdf bs=1M count=2

To zero out the final 2M, use fdisk to discover the actual size of the disk:

# fdisk /dev/sdf

The number of cylinders for this disk is set to 60801.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): p

Disk /dev/sdf: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdf1   *           1           6       48163+  fd  Linux raid autodetect
/dev/sdf2               7         732     5831595    5  Extended
/dev/sdf5               7         124      947803+  82  Linux swap / Solaris
/dev/sdf6             125         732     4883728+  fd  Linux raid autodetect

Now calculate the number of 1M blocks into the disks we want to start zeroing, and issue another dd. The calculation rounds down, so zero three blocks instead of two.

# python -c 'print 500107862016/(1024*1024) - 2'
# dd if=/dev/zero of=/dev/sdf bs=1M count=3 seek=476938

Now we’ve clobbered the partition table, so copy it from an identically-partitioned disk:

# sfdisk -d /dev/sde | sfdisk /dev/sdf
Checking that no-one is using this disk right now ...

Disk /dev/sdf: 60801 cylinders, 255 heads, 63 sectors/track
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
Old situation:
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sdf1          0+    729     730-   5863693+   5  Extended
/dev/sdf2          0       -       0          0    0  Empty
/dev/sdf3          0       -       0          0    0  Empty
/dev/sdf4          0       -       0          0    0  Empty
/dev/sdf5          0+    121     122-    979902   82  Linux swap / Solaris
/dev/sdf6        122+    729     608-   4883728+  fd  Linux raid autodetect
New situation:
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdf1   *        63     96389      96327  fd  Linux raid autodetect
/dev/sdf2         96390  11759579   11663190   5  Extended
/dev/sdf3             0         -          0   0  Empty
/dev/sdf4             0         -          0   0  Empty
/dev/sdf5         96453   1992059    1895607  82  Linux swap / Solaris
/dev/sdf6       1992123  11759579    9767457  fd  Linux raid autodetect
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

Add the partitions back, and wait for them to re-synchronize:

# mdadm -a /dev/md0 /dev/sdf1 && mdadm -a /dev/md1 /dev/sdf6
# watch cat /proc/mdstat

When synchronization is done, the only step that remains is to reinstall GRUB if the disk was bootable:

# grub
Probing devices to guess BIOS drives. This may take a long time.

       [ Minimal BASH-like line editing is supported.   For
         the   first   word,  TAB  lists  possible  command
         completions.  Anywhere else TAB lists the possible
         completions of a device/filename. ]
grub> device (hd1) /dev/sdf
device (hd1) /dev/sdf
grub> root (hd1,0)
root (hd1,0)
grub> setup (hd1)
setup (hd1)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/xfs_stage1_5" exists... yes
 Running "embed /grub/xfs_stage1_5 (hd1)"...  19 sectors are embedded.
 Running "install /grub/stage1 (hd0) (hd1)1+19 p (hd1,0)/grub/stage2 /grub/menu.lst"... succeeded

Did I remove all traces of the pool? Yup:

# zdb -l /dev/sdf
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

Just to be sure, I rebooted and selected /dev/sdf as my BIOS boot device. It worked! Lather, rinse, repeat (because this is going to take a while).

Update: After handling the three disks that contribute to more than one MD device, I wrote a little script to handle the rest of them. Voilà:

set -e
# set -x
if ! mdadm --detail -t /dev/md1 > /dev/null; then
    echo waiting for healthy array
    until mdadm --detail -t /dev/md1 > /dev/null ; do echo -n . ; sleep 1 ; done

for x in /dev/sd[cdei]; do 
    echo '**' taking $x offline:
    mdadm /dev/md1 -f ${x}6
    sleep 1
    mdadm /dev/md1 -r ${x}6
    echo clearing beginning
    dd if=/dev/zero of=$x bs=1M count=2 > /dev/null 2>&1
    echo clearing end
    dd if=/dev/zero of=$x bs=1M count=2 seek=476938 > /dev/null 2>&1
    echo checking for cleared pools
    zdb -l $x | grep -qv olympic 
    echo no pool remains on $x.  copying partition table
    sfdisk -d /dev/sdb | sfdisk $x > /dev/null 2>&1
    echo bringing disk back online
    mdadm /dev/md1 -a ${x}6
    echo waiting for resilver to complete
    until mdadm --detail -t /dev/md1 > /dev/null ; do echo -n . ; sleep 1 ; done
    echo resilvering complete.  Starting next disk in 5 seconds
    sleep 5

GRUB2: Still NDG

18 02 2009

Today (yesterday?) I foolishly decided to try again with GRUB2 svn to /boot off RAID6 after being told by one of its devs that my distro’s version was way out of date. That didn’t go so well. I guess it’s still not ready for prime-time. I did, however, manage to move port most of Ubuntu’s debian package over to the new version of the source, so if anyone wants to pick up where I left off, just ask for it.

Oh, did I mention that I hosed my OS installation and needed to start over? Well, I hosed my OS installation and needed to start over. What a pain.


16 02 2009

My server finally has a backup system.  Well, we’ll have to see whether the backup cron job fires off tonight, but aside from that, it seems to be working.  The code is available in our GitHub repo.  Now I’m off to document that setup.

50 Ways to Lose Your Data

16 02 2009

So I’m rebuilding my server

I figured I’d better get backup going early.  So I went to look at my github project and it was clearly incomplete (Python syntax error).  Yet I *had* done a complete successful backup already, so there must’ve been a good copy of the script on Hydra before the rebuild.

At the beginning of this, I archived a bunch of stuff from that server in a tarfile on an external disk.

To be sure the tar worked, I had re-run the same tar command
piped into md5 and then checked that against the md5sum of the .tar file

The first time around, it failed.  I presumed that was due to some
bitrot on the target disk, so I did the whole procedure again.

This time the md5s matched, and  I stored the checksum in a file alongside the
tar file.

But somehow, now tar can’t read the .tar file!

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Archive base-256 value is out of time_t range
tar: güg: implausibly old time stamp 1969-12-31 18:59:59
tar: Skipping to next header
tar: Error exit delayed from previous errors

And, yes, the md5s still match! I probably should have tried to read the tar file the moment I created it. Bah.

So I figured I might grab the backed up script from the Amazon bucket that stores our backups.  But that backup was apparently encrypted with a public key that I can’t
identify, and don’t remember the passphrase for!  Probably that was a key I
generated especially for root’s backup process on the server, and the only record is probably locked in that tarfile.  Even if I could read that, I still wouldn’t know the passphrase :(.  How do people keep track of all the unique and frequently-changing—yeah, right!— passwords they’re supposed to have anyway?

So I’m giving up on all that data in the tar file for now (what else can
I do?) and I’m back to trying to reconstruct a working backup script.

Get a new plan, Stan!