ZFS@Home

BillytheImpaler · 26 Jan 2009 at 22:02

Following yashiro's thread from late last week I've been playing with ZFS. I'm considering building a new dedicated server box to throw in the basement and store everything so that I might retire the old server and use it as another MythTV frontend. For around the USD equivalent £380 I can have a low-wattage dual core 64-bit rig with 4 GiB RAM and three 1 TB WD Green disks in a case and a power supply with room to grow. Having room to grow is important.

Back to ZFS...
I dug out 3 old flash drives and plugged them into a laptop running Nexenta, a bizarre hybrid of Ubuntu/Debian userspace and an OpenSolaris kernel and backend. I emulated the demo videos where the hip German man puts a video file into a raidz, and removes some of the disks just to observe that the video keeps right on playing. Very cool. I exported the pool, swapped all the disk positions, and imported it again to observe that the computer could quickly and easily figure out which bits went where. Also very cool.

A problem with home media servers is that nobody really wants to spend the necessary money to do proper backups. A few TB of disks wasted just on mirroring is too much to bear for the average Joe so most folks including myself make only feeble attempts to perform backups. That's fine for recorded TV shows and junk like that, but I really don't want to lose my pictures and home movies. With raidz it seems like I could have good free insurance against data loss so long as the house doesn't burn down or get flooded or whatever; losing one disk at a time is safe. As another plus I get increased bandwidth as it'll be maxing out each disk instead of just one.

So the basics of the process would be like so, if I'm correct:

Code:

rmformat

This'll tell me where the 3 disks are. I'll copy down the gibberish Solaris-y addresses.

Code:

zpool create -f mediablob raidz c2t0d0p0 c3t0d0p0 c4t0d0p0

This will create the pool and mount it at /mediablob. I can move the mount point elsewhere or even create other automounted filesystems within it that have separate mount points. For instance I can have one set of directories at /media/TV, another at /media/Photos, another at /home/bti, and another at /root and all will share the total storage space contained in the pool unless I set quotas. That's really cool to me!

So, ZFS people, am I right in saying these things? I've gone through some of the documentation but I'm a bit too thick to understand it all at this point.

If I wanted to add another disk, let's say a 1.5 TB one a year from now, how do I do that?

If after that I discover that I need to remove a disk after that, how do I separate it from the pool, assuming that there's enough freespace on the other disks to handle the lost storage?

I'm trying to learn how to use this in a basic fashion since it's obvious it has tremendous power and flexibility.

yashiro · 26 Jan 2009 at 22:18

If I wanted to add another disk, let's say a 1.5 TB one a year from now, how do I do that?

If you used non raid ZFS. You just plug it in.
With raid, at the moment, it can't be done directly. Adding a new drive directly to the pool that is.
There are 'ways' around it. But as you've seen with ZFS configuring it looks a doddle but it's a bit annoying.

Not being able to expand the first raid pool is why I've not gone ahead with mine for the moment.
That and the prospect of spending hours dicking with the zfs command line with very little clue.

Looks like were both having the same geek thoughts though.

Someone else's tale: http://breden.org.uk/2008/09/01/home-fileserver-raidz-expansion/

The way to achieve RAIDZ vdev expansion is to destroy the pool and re-create it, using the additional drive(s).
Of course, this means you need to back up all the data first, so you’ll need to have an additional storage location available, which has large enough capacity.

Hoho. No. The blog is good though.

BillytheImpaler · 26 Jan 2009 at 22:32

From the Wikipedias:

Capacity expansion is normally achieved by adding groups of disks as a vdev (stripe, RAID-Z, RAID-Z2, or mirrored). Newly written data will dynamically start to use all available vdevs. It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself — the heal time will depend on amount of stored information, not the disk size. The new free space will not be available until all the disks have been swapped. If a snapshot is taken during this process, it will cause the heal to be restarted.

It is currently not possible to reduce the number of vdevs in a pool nor otherwise reduce pool capacity. However, it is currently being worked on by the ZFS team. Still not available as of Solaris 10 05/08 (AKA update 5).

It is not possible to add a disk to a RAID-Z or RAID-Z2 vdev. This feature appears very difficult to implement. You can however create a new RAID-Z vdev and add it to the zpool.

So what's a vdev? If I add another disk and make it its own vdev and add it to the pool it would not have the benefits of redundancy, right? Continuing on that logic, if I added another 2 disks they would constitute a vdev which would contain redundancy data for each other, but not for any of the existing vdevs in the pool.

yashiro · 26 Jan 2009 at 23:22

I assume it's tech speak for virtualdevice? It's essentially a volume in the pool.

Yeah

ZFS filesystems are built on top of virtual storage pools called zpools. A zpool is constructed of virtual devices (vdevs), which are themselves constructed of block devices: files, hard drive partitions, or entire drives, with the last being the recommended usage

Good link : http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Put simply you cannot increase the size of a RAIDZ pool after it's created.

* You can remove a physical HD and replace it with a larger one.
* You can back it all up (lol) and recreate it.
* You can make another pool from the new disks you add.

yashiro · 26 Jan 2009 at 23:30

Besides standard storage, devices can be designated as volatile read cache (ARC)

Something I was reading about yesterday on the a ZFS Dev blog was them making revisions to this cache system. Essentially you assign an SSD as an ARC and increase efficiency greatly, while still having a back end of normal mechanical disks. I cannot see this coming to the Linux kernel any time soon.

It's about the only 'good idea' Sun has had for a while. They probably need to milk it for all it's worth while re-selling the commodity hardware it runs on for ridiculous prices.

BillytheImpaler · 26 Jan 2009 at 23:50

So the hierarchy is something like this:
clicky warm made-of-metal disk -> vdev -> pool

So I can do
zpool create -f mediablob raidz c2t0d0p0 c3t0d0p0 c4t0d0p0 raidz c5t0d0p0 c6t0d0p0 c7t0d0p0 raidz c8t0d0p0 c9t0d0p0 c10t0d0p0 raidz c11t0d0p0 c12t0d0p0 c13t0d0p0
and each group following a raidz would constitute a vdev, all of which are lumped together into a pool. Each vdev contains parity data for other disks within its own vdev, but no others. If c2t0d0p0, c3t0d0p0, and c4t0d0p0 failed simultaneously I would experience data loss, right?

* You can remove a physical HD and replace it with a larger one.

This is what the blog post to which you linked and the Wikipedia article were talking about when it said "It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself." Going back to my previous example I could remove one of the three 1 TB disks and replace it with the 1.5 TB disk. That seems to more limit the number of devices in the pool rather than the devices themselves, even though it's only a coincidence that it works at all.

Imagine that you had the previous three disk setup. In addition to that when you were setting up the array you also included four 256 MiB flash drives that were to stay connected for the life of the machine. Could you then, when you wanted to add capacity, remove a flash drive, add the 1.5 TB disk, and tell zfs that the hardware address changed from, let's say, c8t0d0p0 to c9t0d0p0? It would then heal the array by copying data onto the new, much larger, disk blissfully unaware that you're fooling it. Your total storage, ignorant of data lost to redundancy and lying HDD manufacturers, would go from 3 TB to 4.5 TB (3 TB + .75 GB - .25 GB + 1.5 TB). You wouldn't need to worry about killing the flash media with write cycling because the FS is fault tolerant and they're easily replaced.

I suppose that all relies on being able to tell it that your device moved from one hardware address to another. I've been experimenting with putting one of my flash drives into another USB port but haven't discovered a command that'll tell it that the disk has moved.

yashiro · 27 Jan 2009 at 00:51

* Replacing an existing vdev with a larger vdev. For example:
# zpool replace tank c0t2d0 c2t2d0

The flash idea could work yeah. You're basically forcing a type of thin provisioning by declaring the total number of block devices in the raid. It all gets rather confusing. But, well, that's Sun all over. They do the heavy lifting but fail on the details.

They could totally OWN the home server market.
Turning out their own little servers with basically Drobo functionality.
All it really would take is compiling a few Linux apps and creating a basic gnome panel gui like say Ubuntu Net Remix.
It would put Sun (who?) back in the minds of the consumer and make a few quid.

But they won't. Apple will probably beat them to it, or they'll miss the boat and leave people to DIY it.

BillytheImpaler · 27 Jan 2009 at 00:59

Meh, I dunno. They syntax is fairly straight foreward in this instance. there are certainly more grievous examples though.

I'm in Oklahoma with work so I don't have access to my usual stash of computer goodies so I'm having to make space to free up a 40 GB USB micro HDD. When that's all backed up and ready to go I'll remove the current pool, make a new one, and then try the replace command. If for some reason it doesn't work I don't want to be uncertain if that was a result of it not being capable of that functionality, or my previous meddling.

Thanks, BTW. You've been very helpful so far.

BillytheImpaler · 27 Jan 2009 at 01:30

Another question, sorry.

Why would I want more than one vdev? It seems like if I lumped everything together into one big one, regardless of the number of volumes, I'd have a better change of resisting data loss through volume failure.

yashiro · 27 Jan 2009 at 02:00

Solaris Disk Labels.

An 8 character string typically represents the fully name of a slice (c# t# d# s#).

1) Controller - Identifies host bus adapter which controls communication between the system and disk unit.
2) Targernumber - Corresponding to a unique hardware address that is assigned to each disk ,tape or cdrom.
3) Disk number - Also know as logical unit number (LUN) this number reflects the number of disks as the target number.
4) Slice number: a slice number range from 0-9 (x86)

Code:

Slice 	File System 	Client/Server 	Description
0 	root 	Both 	Holds files and directories that make up the operating system.
1 	swap 	Both 	Provides virtual memory or swap space.
2 	-- 	Both 	By convention, refers to the entire disk. The entire disk is defined automatically by the format command and the Solaris installation programs. Do not change the size of this slice.
3 	/export 	Server 	Holds alternative versions of the operating system that are required by client systems whose architecture differs from that of the server. Clients with the same architecture type as the server obtain executables from the /usr file system, usually slice 6.
4 	/export/swap 	Server 	Provides virtual memory/swap space for client systems.
5 	/opt 	Both 	Holds application software added to a system. If a slice is not allocated for this file system during installation, the /opt directory is put in slice 0.
6 	/usr 	Both 	Holds operating system commands—also known as executables—designed to be run by users. This slice also holds documentation, system programs such as init and syslogd, and library routines.
7 	/home or /export/home 	Both 	Holds files created by user accounts.
8 	-- 	Both 	Contains the boot slice information at the beginning of the Solaris partition that enables Solaris to boot from the hard disk.
9 	-- 	Both 	Provides an area reserved for alternate disk blocks. Slice 9 is known as the alternate sector slice.

Come back Linux, all is forgiven. 8)

BillytheImpaler · 27 Jan 2009 at 02:55

Rats, it easily allowed me to remove one flash drive and replace it with the USB HDD, but the size of the pool did not grow when it was resilvered. The pool is intact and no data was lost, of course, but it's still little.

When creating the pool I used the -f to force it to work on different sized disks. Might there be a way to get it to resilver using all of the new, differently-sized disk?

Sorry if I'm being thick, but the above about Solaris's gibberish disk labels is in reference to what that I said? :embarrassed:

yashiro · 27 Jan 2009 at 03:23

Nah I just thought I'd post it for anyone reading this.
Mainly as:

zpool create -f mediablob raidz c2t0d0p0 c3t0d0p0 c4t0d0p0 raidz c5t0d0p0 c6t0d0p0 c7t0d0p0 raidz c8t0d0p0 c9t0d0p0 c10t0d0p0 raidz c11t0d0p0 c12t0d0p0 c13t0d0p0

Looks awesome.

Not sure about the resilvering issue. I've given up on testing it for the moment. The blog I linked earlier might have some details about it.

BillytheImpaler · 27 Jan 2009 at 03:28

I'll consider it a project for tomorrow. Nap time.

yashiro · 27 Jan 2009 at 03:42

http://www.solarisinternals.com/wiki/index.php/ZFS_Configuration_Guide

rick827 · 27 Jan 2009 at 08:56

BillytheImpaler said:
Rats, it easily allowed me to remove one flash drive and replace it with the USB HDD, but the size of the pool did not grow when it was resilvered. The pool is intact and no data was lost, of course, but it's still little.

When creating the pool I used the -f to force it to work on different sized disks. Might there be a way to get it to resilver using all of the new, differently-sized disk?

With raidZ the pool will be equal to the size of the smallest disk * (n disks-1)
So if you have a 4x 256mb flash drives, replace 1 with a 1.5TB disk, it will only use 256mb of it. This becomes clear when you think about it as how can you store redundency parity for 1.5TB of data on 3x 256mb

Your pools speed will also be limited to that of the 256mb drives.

However, you can replace each of the 4 flash drives with 1.5TB disks one at a time, once the final flash is replaced your volume will expand to the new minimum size of 1.5TB*(n-1) and performance will increase to that of the slowest drive

yashiro · 27 Jan 2009 at 09:25

Ah. yeah.

Looks like just going with 3 huge drives (+ seperate boot) is the best home solution to start with.

rick827 · 27 Jan 2009 at 09:57

If you have the space and dont mind the noise then the more drives the better, more spindles means more performance and the storage lost to parity becomes a smaller %

For instance:
7x 500gb drives gives 3TB writeable storage with the performance of 7 disks for about £350
3x 1.5TB disks also gives 3TB writeable storage for £330 but only the throughput of 3 disks

Obviously you dont get improved seek time but thats not important for filestorage of tv/film etc

yashiro · 27 Jan 2009 at 10:10

Indeed, if you are OK with the noise and power requirements.

rsatd · 27 Jan 2009 at 10:49

BillytheImpaler said:
For around the USD equivalent £380 I can have a low-wattage dual core 64-bit rig with 4 GiB RAM and three 1 TB WD Green disks in a case and a power supply with room to grow. Having room to grow is important.

Just to say, a streaming server does not need that power. It would be wasted, if I read your plan correctly, but if those specs are for the front end, then yeah looks nice, should easily play HD content. Which is the problem with my current setup, its not powerful enough for HD.

But seeing I got nothing to display 720p on, I am not too bothered

BillytheImpaler · 27 Jan 2009 at 17:18

Oh yeah, I'm well aware that it's overkill. My current fileserver/Myth backend that is more than up to the task is an 800 MHz PIII. Spec'ing out the rig made me realize that for about $100 more I can have a relatively high-spec machine with all the modern goodies. For instance, the difference between 1 GiB of DDR2 and 4 GiB is $7. The difference between a single core 45W CPU and a dual core 45W CPU 200 MHz faster is $4. I'll pay those.

The biggest part of that cost by far is the disks.

BTW, the WD Green series of disks is supposed to be amazing for this. They only dissipate about 3 watts. You pay for it in transfer speed, but I can live with that seeing as there will never be more than about 4 users streaming from this server anyway.

Thanks for that, rick827. I get it now. I suppose I shouldn't have thought that I could be more clever than Sun's engineers.

I'm still struggling with the concept of vdevs as subunits in a raidz. If I had 12 disks why would I want to use vdevs of, say, 3 disks each instead of one giant vdev. Are there performance, security, or capacity benefits to either situation? Is it related to the hardware controller that the disks are plugged into, i.e. four 3-port SATA cards means you should use 4 3-disk vdevs?

Oh, last thing, what's the latest on btrfs? Is it anywhere near completion? I see it got merged into the mainline kernel in 2.6.29 but that's just experimental.

ZFS@Home

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour

Man of Honour