ZFS@Home

BillytheImpaler · 26 Jan 2009 at 22:02

Following yashiro's thread from late last week I've been playing with ZFS. I'm considering building a new dedicated server box to throw in the basement and store everything so that I might retire the old server and use it as another MythTV frontend. For around the USD equivalent £380 I can have a low-wattage dual core 64-bit rig with 4 GiB RAM and three 1 TB WD Green disks in a case and a power supply with room to grow. Having room to grow is important.

Back to ZFS...
I dug out 3 old flash drives and plugged them into a laptop running Nexenta, a bizarre hybrid of Ubuntu/Debian userspace and an OpenSolaris kernel and backend. I emulated the demo videos where the hip German man puts a video file into a raidz, and removes some of the disks just to observe that the video keeps right on playing. Very cool. I exported the pool, swapped all the disk positions, and imported it again to observe that the computer could quickly and easily figure out which bits went where. Also very cool.

A problem with home media servers is that nobody really wants to spend the necessary money to do proper backups. A few TB of disks wasted just on mirroring is too much to bear for the average Joe so most folks including myself make only feeble attempts to perform backups. That's fine for recorded TV shows and junk like that, but I really don't want to lose my pictures and home movies. With raidz it seems like I could have good free insurance against data loss so long as the house doesn't burn down or get flooded or whatever; losing one disk at a time is safe. As another plus I get increased bandwidth as it'll be maxing out each disk instead of just one.

So the basics of the process would be like so, if I'm correct:

Code:

rmformat

This'll tell me where the 3 disks are. I'll copy down the gibberish Solaris-y addresses.

Code:

zpool create -f mediablob raidz c2t0d0p0 c3t0d0p0 c4t0d0p0

This will create the pool and mount it at /mediablob. I can move the mount point elsewhere or even create other automounted filesystems within it that have separate mount points. For instance I can have one set of directories at /media/TV, another at /media/Photos, another at /home/bti, and another at /root and all will share the total storage space contained in the pool unless I set quotas. That's really cool to me!

So, ZFS people, am I right in saying these things? I've gone through some of the documentation but I'm a bit too thick to understand it all at this point.

If I wanted to add another disk, let's say a 1.5 TB one a year from now, how do I do that?

If after that I discover that I need to remove a disk after that, how do I separate it from the pool, assuming that there's enough freespace on the other disks to handle the lost storage?

I'm trying to learn how to use this in a basic fashion since it's obvious it has tremendous power and flexibility.

BillytheImpaler · 26 Jan 2009 at 22:32

From the Wikipedias:

Capacity expansion is normally achieved by adding groups of disks as a vdev (stripe, RAID-Z, RAID-Z2, or mirrored). Newly written data will dynamically start to use all available vdevs. It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself — the heal time will depend on amount of stored information, not the disk size. The new free space will not be available until all the disks have been swapped. If a snapshot is taken during this process, it will cause the heal to be restarted.

It is currently not possible to reduce the number of vdevs in a pool nor otherwise reduce pool capacity. However, it is currently being worked on by the ZFS team. Still not available as of Solaris 10 05/08 (AKA update 5).

It is not possible to add a disk to a RAID-Z or RAID-Z2 vdev. This feature appears very difficult to implement. You can however create a new RAID-Z vdev and add it to the zpool.

So what's a vdev? If I add another disk and make it its own vdev and add it to the pool it would not have the benefits of redundancy, right? Continuing on that logic, if I added another 2 disks they would constitute a vdev which would contain redundancy data for each other, but not for any of the existing vdevs in the pool.

BillytheImpaler · 26 Jan 2009 at 23:50

So the hierarchy is something like this:
clicky warm made-of-metal disk -> vdev -> pool

So I can do
zpool create -f mediablob raidz c2t0d0p0 c3t0d0p0 c4t0d0p0 raidz c5t0d0p0 c6t0d0p0 c7t0d0p0 raidz c8t0d0p0 c9t0d0p0 c10t0d0p0 raidz c11t0d0p0 c12t0d0p0 c13t0d0p0
and each group following a raidz would constitute a vdev, all of which are lumped together into a pool. Each vdev contains parity data for other disks within its own vdev, but no others. If c2t0d0p0, c3t0d0p0, and c4t0d0p0 failed simultaneously I would experience data loss, right?

* You can remove a physical HD and replace it with a larger one.

This is what the blog post to which you linked and the Wikipedia article were talking about when it said "It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself." Going back to my previous example I could remove one of the three 1 TB disks and replace it with the 1.5 TB disk. That seems to more limit the number of devices in the pool rather than the devices themselves, even though it's only a coincidence that it works at all.

Imagine that you had the previous three disk setup. In addition to that when you were setting up the array you also included four 256 MiB flash drives that were to stay connected for the life of the machine. Could you then, when you wanted to add capacity, remove a flash drive, add the 1.5 TB disk, and tell zfs that the hardware address changed from, let's say, c8t0d0p0 to c9t0d0p0? It would then heal the array by copying data onto the new, much larger, disk blissfully unaware that you're fooling it. Your total storage, ignorant of data lost to redundancy and lying HDD manufacturers, would go from 3 TB to 4.5 TB (3 TB + .75 GB - .25 GB + 1.5 TB). You wouldn't need to worry about killing the flash media with write cycling because the FS is fault tolerant and they're easily replaced.

I suppose that all relies on being able to tell it that your device moved from one hardware address to another. I've been experimenting with putting one of my flash drives into another USB port but haven't discovered a command that'll tell it that the disk has moved.

BillytheImpaler · 27 Jan 2009 at 00:59

Meh, I dunno. They syntax is fairly straight foreward in this instance. there are certainly more grievous examples though.

I'm in Oklahoma with work so I don't have access to my usual stash of computer goodies so I'm having to make space to free up a 40 GB USB micro HDD. When that's all backed up and ready to go I'll remove the current pool, make a new one, and then try the replace command. If for some reason it doesn't work I don't want to be uncertain if that was a result of it not being capable of that functionality, or my previous meddling.

Thanks, BTW. You've been very helpful so far.

BillytheImpaler · 27 Jan 2009 at 01:30

Another question, sorry.

Why would I want more than one vdev? It seems like if I lumped everything together into one big one, regardless of the number of volumes, I'd have a better change of resisting data loss through volume failure.

BillytheImpaler · 27 Jan 2009 at 02:55

Rats, it easily allowed me to remove one flash drive and replace it with the USB HDD, but the size of the pool did not grow when it was resilvered. The pool is intact and no data was lost, of course, but it's still little.

When creating the pool I used the -f to force it to work on different sized disks. Might there be a way to get it to resilver using all of the new, differently-sized disk?

Sorry if I'm being thick, but the above about Solaris's gibberish disk labels is in reference to what that I said? :embarrassed:

BillytheImpaler · 27 Jan 2009 at 03:28

I'll consider it a project for tomorrow. Nap time.

BillytheImpaler · 27 Jan 2009 at 17:18

Oh yeah, I'm well aware that it's overkill. My current fileserver/Myth backend that is more than up to the task is an 800 MHz PIII. Spec'ing out the rig made me realize that for about $100 more I can have a relatively high-spec machine with all the modern goodies. For instance, the difference between 1 GiB of DDR2 and 4 GiB is $7. The difference between a single core 45W CPU and a dual core 45W CPU 200 MHz faster is $4. I'll pay those.

The biggest part of that cost by far is the disks.

BTW, the WD Green series of disks is supposed to be amazing for this. They only dissipate about 3 watts. You pay for it in transfer speed, but I can live with that seeing as there will never be more than about 4 users streaming from this server anyway.

Thanks for that, rick827. I get it now. I suppose I shouldn't have thought that I could be more clever than Sun's engineers.

I'm still struggling with the concept of vdevs as subunits in a raidz. If I had 12 disks why would I want to use vdevs of, say, 3 disks each instead of one giant vdev. Are there performance, security, or capacity benefits to either situation? Is it related to the hardware controller that the disks are plugged into, i.e. four 3-port SATA cards means you should use 4 3-disk vdevs?

Oh, last thing, what's the latest on btrfs? Is it anywhere near completion? I see it got merged into the mainline kernel in 2.6.29 but that's just experimental.

BillytheImpaler · 27 Jan 2009 at 17:50

My frontend is actually a 35W "Enegry Efificient" dual core AMD. It's diskless so other than one whisper quiet undervolted 120 mm Yate Loon it's silent.

It makes a difference in the power bill every month and will quickly pay for itself when compared to 65 or 95 watt CPUs. As for deciding between single and dual core procs at the same TDP having another processor to fall back on means that for the cost of a sandwich at a restaurant I can have a rig much more capable of media encoding and it'd be more future-proof anyway if I decide to re-purpose it in a pinch as a desktop or frontend.

BillytheImpaler · 28 Jan 2009 at 18:06

rick827 said:
If i'm following correctly I think you may have muddled the terminology 'vdev'. A vdev is the unit you add to a zfs array, these are usually whole disks (or can be partitions or even reserved space in a file). The collection of vdevs creates a pool (zpool) of disks, in our case setup as a raidz pool. We then create volumes on the pool with the zfs filesystem

So in the aforementioned example:

Code:

zpool create -f mediablob raidz c2t0d0p0 c3t0d0p0 c4t0d0p0 raidz c5t0d0p0 c6t0d0p0 c7t0d0p0 raidz c8t0d0p0 c9t0d0p0 c10t0d0p0 raidz c11t0d0p0 c12t0d0p0 c13t0d0p0

The group

Code:

raidz c2t0d0p0 c3t0d0p0 c4t0d0p0

constitutes a vdev, right? Three physical devices are combined to form one virtual device.

How is rudundancy treated for different vdevs within the same pool?

Might this be a way to add capacity to a machine without having to copy all the data in a pool to a separate machine?

For example, I might have a pool consisting of three 1 TB disks formed with the command

Code:

zpool create -f mediablob raidz c2t0d0p0 c3t0d0p0 c4t0d0p0

Let's say down the road I want to add three 1.5 TB disks to the mix. As above I can't expand the current vdev (assuming the question I asked first in this post deserves an affirmative answer), but I can add more vdevs to the same pool. I could add, let's say, c5t0d0p0 c6t0d0p0 c7t0d0p0, my new disks, to the pool mediablob and it would handle that.

The first, preexisting vdev would continue to do its thing, holding parity data for data stored on it. The second vdev I just added to the pool would be allocated data as the file system deems necessary and it would hold all parity data for information in its group. To see the benefits of ZFS's RAID 5-like structure I would always have to expand my storage by at least 3 disks at a time because having a vdev of just 2 devices produces no additional performance and has an overhead cost associated with it.

Or not.

Please tell me if^H^Hwhere I'm wrong.

BillytheImpaler · 28 Jan 2009 at 18:10

rsatd said:
It sucks to live in UK, best I found when I was still looking for such things were the 65W models

If you're still looking the AMD part number for the 2 GHz 35W X2 3800+ is ADD3800CUBOX.

BillytheImpaler · 29 Jan 2009 at 17:06

Wow, thank you so much for that pair of fantastic posts, rick827. From reading those and the documentation I think this project is back on track. I'm A-Ok with adding storage in groups of 3. I'm ditching Nexenta right now in favor of SXCE to see if I like it better. When I'm all done I'll have a play in similar fashion.

My finger's hovering over the "Buy" button. I can't hold it much longer!

BillytheImpaler · 29 Jan 2009 at 17:40

None whatsoever. Should I go that route instead?

BillytheImpaler · 29 Jan 2009 at 18:29

Dear Lord, backspace doesn't even work! How do Solaris people live with this junk!?

BillytheImpaler · 29 Jan 2009 at 20:01

I reloaded the CD and am trying a complete installation rather than the Core one I used above. I also told it to use a few other English character sets hopefully one will work with my keyboard with no other futzing.

I'm definitely using OpenSolaris when I deploy this for real. Since I'm this far already I'll have a go with SXCE for my testing. Amusingly I've downloaded and burnt OpenSolaris before this full install is even 50% done.

ZFS@Home

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour