======ZFS/OpenZFS====== Sources: https://wiki.ubuntu.com/Kernel/Reference/ZFS https://github.com/zfsonlinux/zfs/wiki/faq https://wiki.archlinux.org/index.php/ZFS#Identify_disks ===== ZFS ===== ZFS support was added to Ubuntu Wily 15.10 as a technology preview and comes fully supported in Ubuntu Xenial 16.04. Note that ZFS is only supported on 64-bit architectures. Also note that currently only [[https://blog.ubuntu.com/2018/10/15/deploying-ubuntu-root-on-zfs-with-maas|MAAS allows ZFS]] to be installed as a root filesystem. A minimum of 2 GB of free memory is required to run ZFS, however it is recommended to use ZFS on a system with at least 8 GB of memory. To install ZFS, use: sudo apt install zfsutils-linux Below is a quick overview of ZFS, this is intended as a getting started primer. For further information on ZFS, please refer to some [[https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux|excellent documentation]] written by Aaron Toponce. ==== NOTE: ==== >**For the sake of brevity, devices in this document are referred to as /dev/sda /dev/sdb etc. One should avoid this and use a full device name path using /dev/disk/by-id to uniquely identify drives to avoid boot time failures if device name mappings change.** ===== Quick Setup ===== This quick setup guide will result in the creation of a RAIDZ array. RAIDZ is a redundant array with >3 disks and allowance for a single disk failure. This works well for home setups as a minimal number of disks can be used while still providing redundancy, thus saving money. The performance of this setup will be quite good and can easily saturate multiple gigabit connections. Much like mounting disks with UUIDs in fstab, using the disk id is a much more reliable way to keep track of the disks. There are other ways, noted [[https://github.com/zfsonlinux/zfs/wiki/faq#selecting-dev-names-when-creating-a-pool|here]], but they are mainly useful for enterprise setups with many (>8) drives in one server. $ ls -lh /dev/disk/by-id total 0 lrwxrwxrwx 1 root root 9 Oct 26 09:04 ata-HGST_HUS726060ALE610_######## -> ../../sdc lrwxrwxrwx 1 root root 9 Oct 26 09:04 ata-HGST_HUS726060ALE610_######## -> ../../sdb lrwxrwxrwx 1 root root 9 Oct 26 09:04 ata-HGST_HUS726060ALE614_######## -> ../../sde lrwxrwxrwx 1 root root 9 Oct 26 09:04 ata-HGST_HUS726060ALE614_######## -> ../../sda ==== Creating the pool ==== **-f** forces the pool to be created even if there are existing filesystems on the devices \\ **-m** specifies the mount point of the pool \\ **raidz** this can be specified as; mirror, raidz, raidz2, or raidz3 At pool creation, ashift=12 should always be used, except with SSDs that have 8k sectors where ashift=13 is correct. A vdev of 512 byte disks using 4k sectors will not experience performance issues, but a 4k disk using 512 byte sectors will. Since ashift cannot be changed after pool creation, even a pool with only 512 byte disks should use 4k because those disks may need to be replaced with 4k disks or the pool may be expanded by adding a vdev composed of 4k disks. Because correct detection of 4k disks is not reliable, ''-o ashift=12'' should always be specified during pool creation. sudo zpool create -f -o ashift=12 -m /mnt/bastion bastion raidz ata-HGST_HUS726060ALE610_######## ata-HGST_HUS726060ALE610_######## ata-HGST_HUS726060ALE614_######## ata-HGST_HUS726060ALE614_######## Check the status of the pool # zpool status Output: lucid@shiro:~$ sudo zpool status pool: bastion state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM bastion ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-HGST_HUS726060ALE610_######## ONLINE 0 0 0 ata-HGST_HUS726060ALE610_######## ONLINE 0 0 0 ata-HGST_HUS726060ALE614_######## ONLINE 0 0 0 ata-HGST_HUS726060ALE614_######## ONLINE 0 0 0 errors: No known data errors Check the configuration of the pool, this also shows the total available size of the pool. # zpool get all ==== Create filesystems ==== Filesystems are individual folders on the root of the pool, more information on pools, filesystems, and vdevs can be found in the source links at the top of the page. Example: # zfs create bastion/documents The created filesystem will be owned by root so the file will need permissions changed to the user of choice. # chown lucid:lucid /mnt/bastion/documents ==== Automatic scrubbing ==== Using a systemd timer/service it is possible to automatically scrub pools monthly: /etc/systemd/system/zfs-scrub@.timer ----------------------------------------------- [Unit] Description=Monthly zpool scrub on %i [Timer] OnCalendar=monthly AccuracySec=1h Persistent=true [Install] WantedBy=multi-user.target /etc/systemd/system/zfs-scrub@.service ----------------------------------------------- [Unit] Description=zpool scrub on %i [Service] Nice=19 IOSchedulingClass=idle KillSignal=SIGINT ExecStart=/usr/bin/zpool scrub %i Enable/start zfs-scrub@pool-to-scrub.timer unit for monthly scrubbing the specified zpool. Unmounting a pool is weird # zpool export ===== ZFS Virtual Devices (ZFS VDEVs) ===== A VDEV is a meta-device that can represent one or more devices. ZFS supports 7 different types of VDEV: * File - a pre-allocated file * Physical Drive (HDD, SDD, PCIe NVME, etc) * Mirror - a standard RAID1 mirror * ZFS software raidz1, raidz2, raidz3 'distributed' parity based RAID * Hot Spare - hot spare for ZFS software raid. * Cache - a device for level 2 adaptive read cache (ZFS L2ARC) * Log - ZFS Intent Log (ZFS ZIL) VDEVS are dynamically striped by ZFS. A device can be added to a VDEV, but cannot be removed from it. ===== ZFS Pools ===== A zpool is a pool of storage made from a collection of VDEVS. One or more ZFS file systems can be created from a ZFS pool. In the following example, a pool named "pool-test" is created from 3 physical drives: $ sudo zpool create pool-test /dev/sdb /dev/sdc /dev/sdd Striping is performed dynamically, so this creates a zero redundancy RAID-0 pool. '''Notice:''' If you are managing many devices, it can be easy to confuse them, so you should probably prefer /dev/disk/by-id/ names, which often use serial numbers of drives. The examples here should not suggest that 'sd_' names are preferred. They merely make examples herein easier to read. One can see the status of the pool using the following command: $ sudo zpool status pool-test ...and destroy it using: $ sudo zpool destroy pool-test ==== A 2 x 2 mirrored zpool example ==== The following example, we create a zpool containing a VDEV of 2 drives in a mirror: $ sudo zpool create mypool mirror /dev/sdc /dev/sdd next, we add another VDEV of 2 drives in a mirror to the pool: $ sudo zpool add mypool mirror /dev/sde /dev/sdf -f $ sudo zpool status pool: mypool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 In this example: * /dev/sdc, /dev/sdd, /dev/sde, /dev/sdf are the physical devices * mirror-0, mirror-1 are the VDEVs * mypool is the pool There are plenty of other ways to arrange VDEVs to create a zpool. ==== A single file based zpool example ==== In the following example, we use a single 2GB file as a VDEV and make a zpool from just this one VDEV: $ dd if=/dev/zero of=example.img bs=1M count=2048 $ sudo zpool create pool-test /home/user/example.img $ sudo zpool status pool: pool-test state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM pool-test ONLINE 0 0 0 /home/user/example.img ONLINE 0 0 0 In this example: * /home/user/example.img is a file based VDEV * pool-test is the pool ===== RAID ===== ZFS offers different RAID options: ==== Striped VDEVS ==== This is equivalent to RAID0. This has no parity and no mirroring to rebuild the data. This is not recommended because of the risk of losing data if a drive fails. Example, creating a striped pool using 4 VDEVs: $ sudo zpool create example /dev/sdb /dev/sdc /dev/sdd /dev/sde ==== Mirrored VDEVs ==== Much like RAID1, one can use 2 or more VDEVs. For N VDEVs, one will have to have N-1 fail before data is lost. Example, creating mirrored pool with 2 VDEVs $ sudo zpool create example mirror /dev/sdb /dev/sdc ==== Striped Mirrored VDEVs ==== Much like RAID10, great for small random read I/O. Create mirrored pairs and then stripe data over the mirrors. Example, creating striped 2 x 2 mirrored pool: sudo zpool create example mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde or: sudo zpool create example mirror /dev/sdb /dev/sdc sudo zpool add example mirror /dev/sdd /dev/sde ==== RAIDZ ==== Like RAID5, this uses a variable width strip for parity. Allows one to get the most capacity out of a bunch of disks with parity checking with a sacrifice to some performance. Allows a single disk failure without losing data. Example, creating a 4 VDEV RAIDZ: $ sudo zpool create example raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde ==== RAIDZ2 ==== Like RAID6, with double the parity for 2 disk failures with performance similar to RAIDZ. Example, create a 2 parity 5 VDEV pool: $ sudo zpool create example raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf ==== RAIDZ3 ==== 3 parity bits, allowing for 3 disk failures before losing data with performance like RAIDZ2 and RAIDZ. Example, create a 3 parity 6 VDEV pool: $ sudo zpool create example raidz3 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg ==== Nested RAIDZ ==== Like RAID50, RAID60, striped RAIDZ volumes. This is better performing than RAIDZ but at the cost of reducing capacity. Example, 2 x RAIDZ: $ sudo zpool create example raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde $ sudo zpool add example raidz /dev/sdf /dev/sdg /dev/sdh /dev/sdi ===== ZFS Intent Logs ===== ZIL (ZFS Intent Log) drives can be added to a ZFS pool to speed up the write capabilities of any level of ZFS RAID. One normally would use a fast SSD for the ZIL. Conceptually, ZIL is a logging mechanism where data and metadata to be the written is stored, then later flushed as a transactional write. In reality, the ZIL is more complex than this and [[http://nex7.blogspot.co.uk/2013/04/zfs-intent-log.html|described in detail here]]. One or more drives can be used for the ZIL. For example, to add a SSDs to the pool 'mypool', use: $ sudo zpool add mypool log /dev/sdg -f ===== ZFS Cache Drives ===== Cache devices provide an additional layer of caching between main memory and disk. They are especially useful to improve random-read performance of mainly static data. Fox example, to add a cache drive /dev/sdh to the pool 'mypool', use: $ sudo zpool add mypool cache /dev/sdh -f ===== ZFS file systems ===== ZFS allows one to create a maximum of 2^64 file systems per pool. In the following example, we create two file systems in the pool 'mypool': sudo zfs create mypool/tmp sudo zfs create mypool/projects and to destroy a file system, use: sudo zfs destroy mypool/tmp Each ZFS file systems can have properties set, for example, setting a maximum quota of 10 gigabytes: sudo zfs set quota=10G mypool/projects or adding using compression: sudo zfs set compression=on mypool/projects ===== ZFS Snapshots ===== A ZFS snapshot is a read-only copy of ZFS file system or volume. It can be used to save the state of a ZFS file system at a point of time, and one can roll back to this state at a later date. One can even extract files from a snapshot and not need to perform a complete roll back. In the following example, we snapshot the mypool/projects file system: $ sudo zfs snapshot -r mypool/projects@snap1 ..and we can see the collection of snapshots using: $ sudo zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT mypool/projects@snap1 8.80G - 8.80G - Now lets 'accidentally' destroy all the files and then roll back: $ rm -rf /mypool/projects $ sudo zfs rollback mypool/projects@snap1 One can remove a snapshot using the following: $ sudo zfs destroy mypool/projects@snap1 ===== ZFS Clones ===== A ZFS clone is a writeable copy of a file system with the initial content of the clone being identical to the original file system. A ZFS clone can only be created from a ZFS snapshot and the snapshot cannot be destroyed until the clones created from it are also destroyed. For example, to clone mypool/projects, first make a snapshot and then clone: $ sudo zfs snapshot -r mypool/projects@snap1 $ sudo zfs clone mypool/projects@snap1 mypool/projects-clone ===== ZFS Send and Receive ===== ZFS send sends a snapshot of a filesystem that can be streamed to a file or to another machine. ZFS receive takes this stream and will write out the copy of the snapshot back as a ZFS filesystem. This is great for backups or sending copies over the network (e.g. using ssh) to copy a file system. For example, make a snapshot and save it to a file: sudo zfs snapshot -r mypool/projects@snap2 sudo zfs send mypool/projects@snap2 > ~/projects-snap.zfs ..and receive it back: sudo zfs receive -F mypool/projects-copy < ~/projects-snap.zfs ===== ZFS Ditto Blocks ===== Ditto blocks create more redundant copies of data to copy, just for more added redundancy. With a storage pool of just one device, ditto blocks are spread across the device, trying to place the blocks at least 1/8 of the disk apart. With multiple devices in a pool, ZFS tries to spread ditto blocks across separate VDEVs. 1 to 3 copies can be can be set. For example, setting 3 copies on mypool/projects: $ sudo zfs set copies=3 mypool/projects ===== ZFS Deduplication ===== ZFS dedup will discard blocks that are identical to existing blocks and will instead use a reference to the existing block. This saves space on the device but comes at a large cost to memory. The dedup in-memory table uses ~320 bytes per block. The greater the table is in size, the slower write performance becomes. For example, enable dedup on mypool/projects, use: $ sudo zfs set dedup=on mypool/projects For more pros/cons of deduping, refer to [[http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe]]. Deduplication is almost never worth the performance penalty. ===== ZFS Pool Scrubbing ===== To initiate an explicit data integrity check on a pool one uses the zfs scrub command. For example, to scrub pool 'mypool': $ sudo zpool scrub mypool one can check the status of the scrub using zpool status, for example: $ sudo zpool status -v mypool ===== Data recovery, a simple example ===== Let's assume we have a 2 x 2 mirror'd zpool: $ sudo zpool create mypool mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf -f $ sudo zpool status pool: mypool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 Now populate it with some data and check sum the data: $ dd if=/dev/urandom of=/mypool/random.dat bs=1M count=4096 $ md5sum /mypool/random.dat f0ca5a6e2718b8c98c2e0fdabd83d943 /mypool/random.dat Now we simulate catastrophic data loss by overwriting one of the VDEV devices with zeros: $ sudo dd if=/dev/zero of=/dev/sde bs=1M count=8192 And now initiate a scrub: $ sudo zpool scrub mypool And check the status: $ sudo zpool status pool: mypool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub in progress since Tue May 12 17:34:53 2015 244M scanned out of 1.91G at 61.0M/s, 0h0m to go 115M repaired, 12.46% done config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sde ONLINE 0 0 948 (repairing) sdf ONLINE 0 0 0 ...now let us remove the drive from the pool: $ sudo zpool detach mypool /dev/sde ..hot swap it out and add a new one back: $ sudo zpool attach mypool /dev/sdf /dev/sde -f ..and initiate a scrub to repair the 2 x 2 mirror: $ sudo zpool scrub mypool ===== ZFS compression ===== As mentioned earlier, one can compress data automatically with ZFS. With the speed of modern CPUs this is a useful option as reduced data size means less data to physically read and write and hence faster I/O. ZFS offers a comprehensive range of compression methods. The default is lz4 (a high performance replacement of lzjb) that offers faster compression/decompression to lzjb and slightly higher compression too. One can change the compression level, e.g. sudo zfs set compression=gzip-9 mypool or even the compression type: sudo zfs set compression=lz4 mypool and check on the compression level using: sudo zfs get compressratio lz4 is significantly faster than the other options while still performing well; lz4 is the safest choice. ===== ZFS Management ===== You may want to list the zpools that you have, as well as see various statistics about the pools. sudo zpool list