Solaris ZFS File Server
- Introduction
- Storage Requirements
- Solutions
- ZFS It Is!
- Bonus Features
- Hardware and Operating System
- ZFS Configuration
- ZFS in Action
- Migrating from FreeBSD to Solaris
- Pictures
- References
Introduction
My existing file server had reached capacity and I needed something bigger and newer.
The server was running FreeBSD with a 3Ware RAID card. It had three 200GB drives setup in a RAID-5 configuration which yielded 400GB of usable storage. The array was at capacity and the drives were IDE so finding a replacement if one of the drives failed was getting more and more difficult. The whole thing needed to go.
Storage Requirements
These are the requirements I had (in order of importance) for my new storage server.
- Data integrity
- Data redundancy
- Expandability
- Performance
- Notification for failed/failing drives
- Useful management/configuration tools
The existing server fell down on a couple of these points.
- Data integrity: It wasn't able to determine if data on the disk had become currupted (i.e., corruped while being written, corrupted while being read back, etc).
- Expandability: The 3x 200GB RAID-5 array was not expandable.
- Performance: First off, the drives were IDE/ATA100. Secondly, it was RAID-5 which has its own performance issues.
I wanted to make sure that I addressed each of these issues with the new setup.
Solutions
Around this time I was reading a lot about Sun's Solaris OS and it didn't take long to run across their Zettabyte File System (ZFS). In Sun's own words:
ZFS is a new kind of file system that provides simple administration,
transactional semantics, end-to-end data integrity, and immense
scalability. ZFS is not an incremental improvement to existing
technology; is a fundamentally new approach to data management.
We've blown away 20 years of obsolete assumptions, eliminated complexity
at the source, and created a storage system that's actually a pleasure
to use.
http://www.opensolaris.org/os/community/zfs/whatis/
Up until this point I didn't have a good first-choice solution. I was contemplating another RAID card (this time SATA) but that still wouldn't address all of the requirements I had and would probably put me back in the same predicament 2-3 years from now. As I read more about ZFS I became more interested and realized it was a very good fit for my needs.
ZFS It Is!
Here's how ZFS addresses each of my requirements.
Data Integrity
ZFS uses block-level checksums to catch and repair data corruption as data is read from the disk. When the file system is configured for mirroring or RAID, the data is automatically repaired using one of the alternate copies of the data. ZFS also uses copy-on-write for all write transactions. This ensures that the on-disk state is always valid (no need for fsck). You can also force a "scrub" of the storage pool which causes ZFS to traverse the entire pool and verify each block against its checksum and then repair it if necessary.
Data redundancy
ZFS isn't just a file system it's also a volume manager. This means you can configure stripes, mirrors and RAID sets using ZFS. Using mirrored sets or RAID-Z/RAID-Z2 (ZFS's own implementation of RAID-5 and RAID-6, respectively) provides data redundancy and protection against disk failures. ZFS also has the ability to store up to 3 copies of user data on the same physical disk. ZFS does its best to physically separate these copies so that if the drive becomes damaged there is a maximum chance that at least one copy of the data survived. This feature is above and beyond any mirroring or RAID that has been setup.
Expandability
ZFS works on the concept of "storage pools". You basically throw a bunch of disks at ZFS and tell it "hey, use these disks, and while you're at it, create a RAID set out of them". ZFS then takes care of formatting, partitioning, and RAID setup. It then presents you with a "pool" of storage which you can begin using right away. If your pool becomes full, add more disk to the system and tell ZFS "hey, add these disks to my pool". One large caveat here is that as of this writing (Solaris Express Community Edition build 77) you cannot expand a RAID-Z or RAID-Z2 set in this way; you can only expand stripes. You can however replace disks in a RAID-Z set with bigger disks.
Performance
My primary concern when it came to performance was the hardware. All I really wanted was SATA 3.0Gbps disks and controllers. Of course ZFS has its own performance features such as I/O pipelining, I/O priorities and deadlines as well as I/O aggregation.
Notification for Failed/Failing Drives
Beyond the typical SMART monitoring support by the disks, ZFS provides notification of failed reads/writes due to hardware issues and reads that failed due to checksum failures.
Useful Management/Configuration Tools
There are exactly two commands you need to know to use ZFS: zpool(1M) and zfs(1M). These two commands are used to create storage pools and create/modify file systems. The status, health and configuration of the pool and all file systems in the pool can also be viewed with these two commands.
Bonus Features
ZFS addresses all of my main requirements quite nicely but it has a lot of other features/benefits that I can take advantage of too.
Cheap (as in "free")
ZFS is open source software developed by Sun Microsystems. They distribute ZFS as part of their Solaris operating system. Solaris is available via free download from Sun and is free to use at home or commercially. ZFS is also part of FreeBSD (another free download) and Mac OS X (not available for free). Since ZFS takes care of mirroring and RAID, I can also save money by not buying a SATA RAID controller card.
Point-in-time snapshots
Snapshots are read-only copies of a file system that represent the state of that file system at a particular point in time. Even as files are added and deleted in the live file system, the files in the snapshot don't change. So for example, if you delete a file and then realize that was a mistake, you can go into the snapshot and pull the file back out. Snapshots only take up enough space on the disk to store the changes that have happened between the file system and the snapshot. Instead of digging individual files out of a snapshot, you can also do a "rollback" of a file system and take it back in time to the point when the snapshot was taken.
Compression
ZFS has built-in compression which can be enabled on a per file system basis. Instead of compressing your data using a compression utility, you can let ZFS take care of it for you. The compression ratio of a file system can be seen by using the 'zfs' command.
Active development/support community
Over at the OpenSolaris ZFS Community you'll find introductory information about ZFS, an overview of its features, links to mailing lists and discussion forums as well as information related to ZFS development, code and the on-disk data structure.
Hardware and Operating System
Since ZFS is developed by Sun, I decided to run Solaris. I specifically chose Solaris Express Community Edition (SXCE) which is Sun's bleeding-edge code that will one day become the next version of Solaris (Solaris 11). SXCE has all the latest ZFS features, some of which are not yet part of Sun's fully supported release of Solaris, Solaris 10.
This is the hardware I chose:
- ASUS M2N32-SLI Premium Vista Edition
- AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ Socket AM2
- 2x 1GB Kingston DDR2-533 ECC
- Seagate 7200.10 160GB SATA 3Gb/s (system drive)
- 2x Seagate 7200.10 750GB SATA 3Gb/s (ZFS storage)
- A second-hand nVidia PCI graphics card
- Antec case with 350W PSU
I chose the M2N32-SLI board because it has 7 SATA 3.0Gb/s ports (plus 1 eSATA port). Most boards only have 4 ports. This allows me to install 6 drives for storage and 1 drive for the system/OS all within the case. The board also has 2x PCI, 1x PCIe x4, 1x PCIe x1 and 2x PCIe x16 slots for expansion. This board does not have on-board video, so I bought a used PCI graphics card and threw it in the mix. This is a server, so high-end graphics are of no concern. For storage I chose a 160GB drive for the operating system (which gives me enough room to do Live Upgrades) and two 750GB drives for storage (more on that down below).
The M2N32-SLI board is fully supported under SXCE b77, including the on-board NICs (nge). This board is not mentioned on the Solaris HCL but I can report everything that I use on it works just fine.
The output from "prtdiag" and "cfgadm" can be seen here: zfs_file_server_hw.txt
ZFS Configuration
In the current version of Solaris Express Community Edition (build 77), you cannot grow a RAID set by adding new drives to the system and expanding the storage pool onto them. This is in the works, but is not here yet and there is no (public) time table. Since expanding my storage pool is high on my list of requirements, I decided to use striped mirrors (also known as RAID-10). Since ZFS allows the expansion of striped pools, this configuration will allow me to grow the pool as large as I need.
The intial configuration is 2x 750GB drives setup in a mirror which gives 750GB of storage space.
| Mirror (750GB storage) | |
| 750GB Drive | 750GB Drive |
As the pool becomes full I plan on adding 2x 1TB drives in a mirrored configuration and then striping across the mirrors giving the pool 1.75TB of storage.
| Stripe (1.75TB storage) | |||
| Mirror (750GB storage) | Mirror (1TB storage) | ||
| 750GB Drive | 750GB Drive | 1TB Drive | 1TB Drive |
I can keep expanding the pool this way to grow it as large as I need.
There are some drawbacks to this method.
- Storage costs are double what they would be if I was able to add a single drive to the pool. I have to purchase two drives everytime I need to expand. However, since ZFS doesn't give me any other choice, the decision is easy. Also, we're only talking about 2 drives every 1 to 1.5 years (my best guess) so the cost really isn't that much.
- SATA ports get used up twice as fast. Well again, not much choice currently with ZFS. That's also why I was sure to have proper PCIe ports on the motherboard so I could add a SATA controller down the road.
- Physical drive bays get used twice as fast. There are options for this too: external enclosures or in-chassis drive enclosures.
Another option to address all these points is that down the road at some point, ZFS will support growing of RAID sets. When that happens, I can simply build a RAID set on some new disks, copy the data from the mirrored set onto the RAID set, break the mirror and then add those disks from the mirror into the RAID set and grow onto them.
The other way of looking at this is that for its drawbacks, RAID-10 does have advantages over RAID-5 so there are reasons to stick with it.
- Can sustain more disk failures (1 in each mirror set; RAID-5 can only sustain 1 disk failure)
- Has better read performance
ZFS in Action
Some examples showing some of the features of ZFS.
Viewing the list of storage pools and then the status of the pool named "data".
jwk@sigma:~% zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT data 696G 456G 240G 65% ONLINE - jwk@sigma:~% zpool status -v data pool: data state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t0d0p0 ONLINE 0 0 0 c2t1d0p0 ONLINE 0 0 0
The c2t0d0p0 and c2t1d0p0 are the two 750GB SATA drives (Solaris has an interesting drive naming convention). You can see they are configured in a mirror. You can also see that there have been no read, write or checksum errors on either drive.
This command shows how to retrieve the "compression" setting for a file system and the compression ratio.
jwk@sigma:~% zfs get compression,compressratio data/dump
NAME PROPERTY VALUE SOURCE
data/dump compression on local
data/dump compressratio 1.34x -
An example of taking a snapshot and using it to restore a deleted file is shown below. The file system being worked on is called "data/dump" and the snapshot being created is called "data/dump@snapshot1".
jwk@sigma:/data/dump% zfs list data/dump NAME USED AVAIL REFER MOUNTPOINT data/dump 677M 229G 677M /data/dump # create the snapshot jwk@sigma:/data/dump% zfs snapshot data/dump@snapshot1 # note the snapshot uses zero bytes when it's first created jwk@sigma:/data/dump% zfs list data/dump@snapshot1 NAME USED AVAIL REFER MOUNTPOINT data/dump@snapshot1 0 - 677M - # snapshots are made available in the ".zfs" directory jwk@sigma:/data/dump% ls -la .zfs/snapshot total 5 dr-xr-xr-x 2 root root 2 Dec 3 17:54 . dr-xr-xr-x 3 root root 3 Dec 3 17:54 .. drwxrwxr-x 7 root samba 38 Dec 20 21:31 snapshot1 jwk@sigma:/data/dump% ls -la jinstall* -rw-r--r-- 1 jwk samba 61425279 Jun 4 2006 jinstall-7.1R1.3.tgz -rw-r--r-- 1 jwk samba 66904843 Jun 4 2006 jinstall-7.2R4.2.tgz -rw-rw---- 1 jwk samba 93860889 May 14 2007 jinstall-8.2R2.4.tgz jwk@sigma:/data/dump% /bin/rm -f jinstall* jwk@sigma:/data/dump% ls -la jinstall* zsh: no matches found: jinstall* # files were deleted, snapshot now takes up some space on disk jwk@sigma:/data/dump% zfs list data/dump@snapshot1 NAME USED AVAIL REFER MOUNTPOINT data/dump@snapshot1 212M - 677M - jwk@sigma:/data/dump% cd .zfs/snapshot/snapshot1 jwk@sigma:/data/dump/.zfs/snapshot/snapshot1% ls -la jinstall* -rw-r--r-- 1 jwk samba 61425279 Jun 4 2006 jinstall-7.1R1.3.tgz -rw-r--r-- 1 jwk samba 66904843 Jun 4 2006 jinstall-7.2R4.2.tgz -rw-rw---- 1 jwk samba 93860889 May 14 2007 jinstall-8.2R2.4.tgz # copy files out of the snapshot back to the file system jwk@sigma:/data/dump/.zfs/snapshot/snapshot1% cp -p jinstall-7.1* /data/dump
Notice how the snapshot intially takes up zero bytes! It's only after the file system starts to change (i.e., files were deleted) that the snapshot starts to take up space and even then, it only takes up enough space to store the differences between the snapshot and the live file system.
We can also do a rollback of the file system and take it back in time so that it's back to the state it was at when the snapshot was taken.
jwk@sigma:/data% zfs rollback data/dump@snapshot1 jwk@sigma:/data% zfs list data/dump data/dump@snapshot1 NAME USED AVAIL REFER MOUNTPOINT data/dump 677M 229G 677M /data/dump data/dump@snapshot1 46K - 677M - # everything is back the way it was! jwk@sigma:/data/dump% ls -la jinstall* -rw-r--r-- 1 jwk samba 61425279 Jun 4 2006 jinstall-7.1R1.3.tgz -rw-r--r-- 1 jwk samba 66904843 Jun 4 2006 jinstall-7.2R4.2.tgz -rw-rw---- 1 jwk samba 93860889 May 14 2007 jinstall-8.2R2.4.tgz
Migrating from FreeBSD to Solaris
Because there are no Solaris drivers for 3ware RAID cards I had to do a quick and dirty install of FreeBSD 7 and use that to migrate my data. Since ZFS is architecture/endian agnostic, it's possible to create a ZFS pool on FreeBSD, export it, and then import it back on a Solaris system. It's even possible to export a pool on an Intel-based system and then import it on a SPARC-based system without any trouble.
Pictures
Asus M2N32-SLI board:
Inside the case once everything was installed:
References