packetmischief.ca

Solaris ZFS File Server


Introduction

My existing file server had reached capacity and I needed something bigger and newer.

The server was running FreeBSD with a 3Ware RAID card. It had three 200GB drives setup in a RAID-5 configuration which yielded 400GB of usable storage. The array was at capacity and the drives were IDE so finding a replacement if one of the drives failed was getting more and more difficult. The whole thing needed to go.

Storage Requirements

These are the requirements I had (in order of importance) for my new storage server.

  1. Data integrity
  2. Data redundancy
  3. Expandability
  4. Performance
  5. Notification for failed/failing drives
  6. Useful management/configuration tools

The existing server fell down on a couple of these points.

I wanted to make sure that I addressed each of these issues with the new setup.

Solutions

Around this time I was reading a lot about Sun's Solaris OS and it didn't take long to run across their Zettabyte File System (ZFS). In Sun's own words:

ZFS is a new kind of file system that provides simple administration, transactional semantics, end-to-end data integrity, and immense scalability. ZFS is not an incremental improvement to existing technology; is a fundamentally new approach to data management. We've blown away 20 years of obsolete assumptions, eliminated complexity at the source, and created a storage system that's actually a pleasure to use.

http://www.opensolaris.org/os/community/zfs/whatis/

Up until this point I didn't have a good first-choice solution. I was contemplating another RAID card (this time SATA) but that still wouldn't address all of the requirements I had and would probably put me back in the same predicament 2-3 years from now. As I read more about ZFS I became more interested and realized it was a very good fit for my needs.

ZFS It Is!

Here's how ZFS addresses each of my requirements.

Data Integrity

ZFS uses block-level checksums to catch and repair data corruption as data is read from the disk. When the file system is configured for mirroring or RAID, the data is automatically repaired using one of the alternate copies of the data. ZFS also uses copy-on-write for all write transactions. This ensures that the on-disk state is always valid (no need for fsck). You can also force a "scrub" of the storage pool which causes ZFS to traverse the entire pool and verify each block against its checksum and then repair it if necessary.

Data redundancy

ZFS isn't just a file system it's also a volume manager. This means you can configure stripes, mirrors and RAID sets using ZFS. Using mirrored sets or RAID-Z/RAID-Z2 (ZFS's own implementation of RAID-5 and RAID-6, respectively) provides data redundancy and protection against disk failures. ZFS also has the ability to store up to 3 copies of user data on the same physical disk. ZFS does its best to physically separate these copies so that if the drive becomes damaged there is a maximum chance that at least one copy of the data survived. This feature is above and beyond any mirroring or RAID that has been setup.

Expandability

ZFS works on the concept of "storage pools". You basically throw a bunch of disks at ZFS and tell it "hey, use these disks, and while you're at it, create a RAID set out of them". ZFS then takes care of formatting, partitioning, and RAID setup. It then presents you with a "pool" of storage which you can begin using right away. If your pool becomes full, add more disk to the system and tell ZFS "hey, add these disks to my pool". One large caveat here is that as of this writing (Solaris Express Community Edition build 77) you cannot expand a RAID-Z or RAID-Z2 set in this way; you can only expand stripes. You can however replace disks in a RAID-Z set with bigger disks.

Performance

My primary concern when it came to performance was the hardware. All I really wanted was SATA 3.0Gbps disks and controllers. Of course ZFS has its own performance features such as I/O pipelining, I/O priorities and deadlines as well as I/O aggregation.

Notification for Failed/Failing Drives

Beyond the typical SMART monitoring support by the disks, ZFS provides notification of failed reads/writes due to hardware issues and reads that failed due to checksum failures.

Useful Management/Configuration Tools

There are exactly two commands you need to know to use ZFS: zpool(1M) and zfs(1M). These two commands are used to create storage pools and create/modify file systems. The status, health and configuration of the pool and all file systems in the pool can also be viewed with these two commands.

Bonus Features

ZFS addresses all of my main requirements quite nicely but it has a lot of other features/benefits that I can take advantage of too.

Cheap (as in "free")

ZFS is open source software developed by Sun Microsystems. They distribute ZFS as part of their Solaris operating system. Solaris is available via free download from Sun and is free to use at home or commercially. ZFS is also part of FreeBSD (another free download) and Mac OS X (not available for free). Since ZFS takes care of mirroring and RAID, I can also save money by not buying a SATA RAID controller card.

Point-in-time snapshots

Snapshots are read-only copies of a file system that represent the state of that file system at a particular point in time. Even as files are added and deleted in the live file system, the files in the snapshot don't change. So for example, if you delete a file and then realize that was a mistake, you can go into the snapshot and pull the file back out. Snapshots only take up enough space on the disk to store the changes that have happened between the file system and the snapshot. Instead of digging individual files out of a snapshot, you can also do a "rollback" of a file system and take it back in time to the point when the snapshot was taken.

Compression

ZFS has built-in compression which can be enabled on a per file system basis. Instead of compressing your data using a compression utility, you can let ZFS take care of it for you. The compression ratio of a file system can be seen by using the 'zfs' command.

Active development/support community

Over at the OpenSolaris ZFS Community you'll find introductory information about ZFS, an overview of its features, links to mailing lists and discussion forums as well as information related to ZFS development, code and the on-disk data structure.

Hardware and Operating System

Since ZFS is developed by Sun, I decided to run Solaris. I specifically chose Solaris Express Community Edition (SXCE) which is Sun's bleeding-edge code that will one day become the next version of Solaris (Solaris 11). SXCE has all the latest ZFS features, some of which are not yet part of Sun's fully supported release of Solaris, Solaris 10.

This is the hardware I chose:

I chose the M2N32-SLI board because it has 7 SATA 3.0Gb/s ports (plus 1 eSATA port). Most boards only have 4 ports. This allows me to install 6 drives for storage and 1 drive for the system/OS all within the case. The board also has 2x PCI, 1x PCIe x4, 1x PCIe x1 and 2x PCIe x16 slots for expansion. This board does not have on-board video, so I bought a used PCI graphics card and threw it in the mix. This is a server, so high-end graphics are of no concern. For storage I chose a 160GB drive for the operating system (which gives me enough room to do Live Upgrades) and two 750GB drives for storage (more on that down below).

The M2N32-SLI board is fully supported under SXCE b77, including the on-board NICs (nge). This board is not mentioned on the Solaris HCL but I can report everything that I use on it works just fine.

The output from "prtdiag" and "cfgadm" can be seen here: zfs_file_server_hw.txt

ZFS Configuration

In the current version of Solaris Express Community Edition (build 77), you cannot grow a RAID set by adding new drives to the system and expanding the storage pool onto them. This is in the works, but is not here yet and there is no (public) time table. Since expanding my storage pool is high on my list of requirements, I decided to use striped mirrors (also known as RAID-10). Since ZFS allows the expansion of striped pools, this configuration will allow me to grow the pool as large as I need.

The intial configuration is 2x 750GB drives setup in a mirror which gives 750GB of storage space.

Mirror (750GB storage)
750GB Drive750GB Drive

As the pool becomes full I plan on adding 2x 1TB drives in a mirrored configuration and then striping across the mirrors giving the pool 1.75TB of storage.

Stripe (1.75TB storage)
Mirror (750GB storage) Mirror (1TB storage)
750GB Drive750GB Drive 1TB Drive1TB Drive

I can keep expanding the pool this way to grow it as large as I need.

There are some drawbacks to this method.

Another option to address all these points is that down the road at some point, ZFS will support growing of RAID sets. When that happens, I can simply build a RAID set on some new disks, copy the data from the mirrored set onto the RAID set, break the mirror and then add those disks from the mirror into the RAID set and grow onto them.

The other way of looking at this is that for its drawbacks, RAID-10 does have advantages over RAID-5 so there are reasons to stick with it.

ZFS in Action

Some examples showing some of the features of ZFS.

Viewing the list of storage pools and then the status of the pool named "data".

jwk@sigma:~% zpool list
NAME   SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
data   696G   456G   240G    65%  ONLINE  -

jwk@sigma:~% zpool status -v data
  pool: data
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        data          ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c2t0d0p0  ONLINE       0     0     0
            c2t1d0p0  ONLINE       0     0     0

The c2t0d0p0 and c2t1d0p0 are the two 750GB SATA drives (Solaris has an interesting drive naming convention). You can see they are configured in a mirror. You can also see that there have been no read, write or checksum errors on either drive.

This command shows how to retrieve the "compression" setting for a file system and the compression ratio.

jwk@sigma:~% zfs get compression,compressratio data/dump
NAME            PROPERTY       VALUE           SOURCE
data/dump       compression    on              local
data/dump       compressratio  1.34x           -

An example of taking a snapshot and using it to restore a deleted file is shown below. The file system being worked on is called "data/dump" and the snapshot being created is called "data/dump@snapshot1".

jwk@sigma:/data/dump% zfs list data/dump
NAME        USED  AVAIL  REFER  MOUNTPOINT
data/dump   677M   229G   677M  /data/dump

# create the snapshot
jwk@sigma:/data/dump% zfs snapshot data/dump@snapshot1

# note the snapshot uses zero bytes when it's first created
jwk@sigma:/data/dump% zfs list data/dump@snapshot1
NAME                  USED  AVAIL  REFER  MOUNTPOINT
data/dump@snapshot1      0      -   677M  -

# snapshots are made available in the ".zfs" directory
jwk@sigma:/data/dump% ls -la .zfs/snapshot
total 5
dr-xr-xr-x   2 root     root           2 Dec  3 17:54 .
dr-xr-xr-x   3 root     root           3 Dec  3 17:54 ..
drwxrwxr-x   7 root     samba         38 Dec 20 21:31 snapshot1

jwk@sigma:/data/dump% ls -la jinstall*
-rw-r--r--   1 jwk      samba    61425279 Jun  4  2006 jinstall-7.1R1.3.tgz
-rw-r--r--   1 jwk      samba    66904843 Jun  4  2006 jinstall-7.2R4.2.tgz
-rw-rw----   1 jwk      samba    93860889 May 14  2007 jinstall-8.2R2.4.tgz
jwk@sigma:/data/dump% /bin/rm -f jinstall*
jwk@sigma:/data/dump% ls -la jinstall*
zsh: no matches found: jinstall*

# files were deleted, snapshot now takes up some space on disk
jwk@sigma:/data/dump% zfs list data/dump@snapshot1
NAME                  USED  AVAIL  REFER  MOUNTPOINT
data/dump@snapshot1   212M      -   677M  -

jwk@sigma:/data/dump% cd .zfs/snapshot/snapshot1
jwk@sigma:/data/dump/.zfs/snapshot/snapshot1% ls -la jinstall*
-rw-r--r--   1 jwk      samba    61425279 Jun  4  2006 jinstall-7.1R1.3.tgz
-rw-r--r--   1 jwk      samba    66904843 Jun  4  2006 jinstall-7.2R4.2.tgz
-rw-rw----   1 jwk      samba    93860889 May 14  2007 jinstall-8.2R2.4.tgz

# copy files out of the snapshot back to the file system
jwk@sigma:/data/dump/.zfs/snapshot/snapshot1% cp -p jinstall-7.1* /data/dump

Notice how the snapshot intially takes up zero bytes! It's only after the file system starts to change (i.e., files were deleted) that the snapshot starts to take up space and even then, it only takes up enough space to store the differences between the snapshot and the live file system.

We can also do a rollback of the file system and take it back in time so that it's back to the state it was at when the snapshot was taken.

jwk@sigma:/data% zfs rollback data/dump@snapshot1

jwk@sigma:/data% zfs list data/dump data/dump@snapshot1
NAME                  USED  AVAIL  REFER  MOUNTPOINT
data/dump             677M   229G   677M  /data/dump
data/dump@snapshot1    46K      -   677M  -

# everything is back the way it was!
jwk@sigma:/data/dump% ls -la jinstall*
-rw-r--r--   1 jwk      samba    61425279 Jun  4  2006 jinstall-7.1R1.3.tgz
-rw-r--r--   1 jwk      samba    66904843 Jun  4  2006 jinstall-7.2R4.2.tgz
-rw-rw----   1 jwk      samba    93860889 May 14  2007 jinstall-8.2R2.4.tgz

Migrating from FreeBSD to Solaris

Because there are no Solaris drivers for 3ware RAID cards I had to do a quick and dirty install of FreeBSD 7 and use that to migrate my data. Since ZFS is architecture/endian agnostic, it's possible to create a ZFS pool on FreeBSD, export it, and then import it back on a Solaris system. It's even possible to export a pool on an Intel-based system and then import it on a SPARC-based system without any trouble.

Pictures

Asus M2N32-SLI board:

Inside the case once everything was installed:

References