Thursday, April 4, 2019

ZFS

  • There is no JBOD in ZFS (most of the time)

ZFS uses three-tier logic to manage physical disks. Disks are combined into virtual devices (vdevs). Vdevs are then combined into a pool (or multiple pools, but I’m talking about single pool now). Vdevs can be of different types – simple (single disk), mirrors (two or more identical disks), or RAIDZ/Z2/Z3 (similar to RAID5, tolerating one, two, or three failed disks respectively). You can add vdevs to the existing pool, and the pool expands accordingly (it will be significant later).

It may seem that if we make several vdevs consisting of a single disk each, and then combine them to a pool, the result will resemble a traditional JBOD. That does not happen. Traditional JBOD will allocate space for data from start to end of the array. When one of the disks fills up, the next disk is used, and so on (this is not exactly correct, but a good approximation nonetheless). ZFS pool allocates data blocks on different vdevs in turn. If a large file is written, its blocks are put onto different vdevs. However, if you add a new disk (and thus a new vdev) to the pool which is filled to near capacity, no automatic rebalancing takes place. Whatever files you add to the pool will be mostly written to a newly added disk.

In general case, no. ZFS pool without redundancy is not like a JBOD. It behaves more like a RAID0.
https://www.klennet.com/notes/2018-12-20-no-jbod-in-zfs-mostly.aspx


  • "The COW filesystem for Linux that won't eat your data".
Bcachefs is an advanced new filesystem for Linux, with an emphasis on reliability and robustness. It has a long list of features, completed or in progress:

    Copy on write (COW) - like zfs or btrfs
    Full data and metadata checksumming
    Multiple devices, including replication and other types of RAID
    Caching
    Compression
    Encryption
    Snapshots
    Scalable - has been tested to 50+ TB, will eventually scale far higher
    Already working and stable, with a small community of users
https://bcachefs.org/

  • For most users the kABI-tracking kmod packages are recommended in order to avoid needing to rebuild ZFS for every kernel update. DKMS packages are recommended for users running a non-distribution kernel or for users who wish to apply local customizations to ZFS on Linux.
https://github.com/zfsonlinux/zfs/wiki/RHEL-and-CentOS

  • ZFS is originally designed to work with Solaris and BSD system. Because of the legal and licensing issues, ZFS cannot be shipped with Linux.
    Since ZFS is open source, some developers port the ZFS the Linux, and make it run at the kernel level via dkms. This works great as long as you don’t update the kernel. Otherwise the ZFS will not be loaded with the new kernel.
    In a ZFS/Linux environment, it is a bad idea to update the system automatically.
    For some odd reasons, ZFS/Linux will work with server grade or gaming grade computers. Do not run ZFS/Linux on entry level computers.

https://icesquare.com/wordpress/how-to-install-zfs-on-rhel-centos-7/

  • In this post we discuss the Linux disk I/O performance using either ZFS Raidz or the linux mdadm software RAID-0. It is important to understand that RAID-0 is not reliable for data storage, a single disk loss can easily destroy the whole RAID. On the other hand ZFS Raidz behaves similarly to RAID-5, while creating  ZFS pool without specifying the Raidz1 is effcetively RAID-o.
http://supercomputing.caltech.edu/blog/index.php/2016/02/04/linux-zfs-vs-mdadm-performance-difference/


  • What is a ZFS Virtual Devices (ZFS VDEVs)?

A VDEV is nothing but a collection of a physical disk, file image, or ZFS software raid device, hot spare for ZFS raid. Examples are:

    /dev/sdb – a physical disk
    /images/200G.img – a file image
    /dev/sdc1 – A partition
https://www.cyberciti.biz/faq/how-to-install-zfs-on-ubuntu-linux-16-04-lts/
  • OpenZFS is the open source implementation of ZFS which is an advanced and highly scalable storage platform. Although ZFS was originally designed for Sun Solaris, you can use ZFS on most of major Linux distributions with the help of the ZFS on Linux project, a part of the OpenZFS project.
https://www.vultr.com/docs/how-to-setup-openzfs-on-centos-7
  • A Guide to Install and Use ZFS on CentOS 7
Once a pool is created, it is possible to add or remove hot spares and cache  devices from the pool, attach or detach devices from mirrored pools and replace devices. But non-redundant and raidz devices cannot be removed from a pool.  We will see how to perform some of these operations in this section.
https://linoxide.com/tools/guide-install-use-zfs-centos-7/

  • The Z file system is a free and open source logical volume manager built by Sun Microsystems for use in their Solaris operating system
it’s a 128-bit file system that’s capable of managing zettabytes (one billion terabytes) of data.
ZFS is capable of many different RAID levels, all while delivering performance that’s comparable to that of hardware RAID controllers.
we’re not going to install ZFS as a root file system.  This section assumes that you’re using ext4 or some other file system and would like to use ZFS for some secondary hard drives.

https://www.howtogeek.com/175159/an-introduction-to-the-z-file-system-zfs-for-linux/

  • ZFS on Linux - the official OpenZFS implementation for Linux
By default the zfs-release package is configured to install DKMS style packages so they will work with a wide range of kernels. In order to install the kABI-tracking kmods the default repository in the /etc/yum.repos.d/zfs.repo file must be switch from zfs to zfs-kmod. Keep in mind that the kABI-tracking kmods are only verified to work with the distribution provided kernel.
https://github.com/zfsonlinux/zfs/wiki/RHEL-and-CentOS


  • The -f option is to ignore disk partition labels since these are new disks
raidz is raid level. RAIDZ is nothing but the variation of RAID-5 that allows for better distribute on of parity and eliminates the “RAID-5” write hole (data and parity inconsistency after a power loss).
http://www.thegeekstuff.com/2015/07/zfs-on-linux-zpool/

  • 2. RAID-Z pools
Now we can also have a pool similar to a RAID-5 configuration called as RAID-Z. RAID-Z are of 3 types raidz1 (single parity) and raidz2 (double parity) and rzidz3 (triple parity). Lets us see how we can configure each type.

Minimum disk requirements for each type
Minimum disks required for each type of RAID-Z
1. raidz1 – 2 disks
2. raidz2 – 3 disks
3. raidz3 – 4 disks

https://www.thegeekdiary.com/zfs-tutorials-creating-zfs-pools-and-file-systems/

  • Multiple Disk (RAID 0)
This will create a pool of storage where data is striped across all of the devices specified. Loss of any of the drives will result in losing all of your data.

RAID 10

Creating a RAID1 pool of two drives, and then adding another pair of mirroring drives as shown above would actually create a RAID 10 pool whereby data is striped over two mirrors. This results in better performance without sacrificing redundancy.

RAIDz3
Exactly the same as RAIDz3 except a third drive holds parity and the minimum number of drives is 4. Your array can lose 3 drives without loss of data.

http://blog.programster.org/zfs-create-disk-pools


  • What is a ZVOL?
A ZVOL is a "ZFS volume" that has been exported to the system as a block device

Creating a ZVOL
To create a ZVOL, we use the "-V" switch with our "zfs create" command, and give it a size
/dev/zvol/tank/ and /dev/tank/ which points to a new block device in /dev/:


Virtual Device
meta-device that represents one or more physical devices. In Linux software RAID, you might have a "/dev/md0" device that represents a RAID-5 array of 4 disks. In this case, "/dev/md0" would be your "VDEV".

seven types of VDEVs in ZFS:

    disk (default)- The physical hard drives in your system.
    file- The absolute path of pre-allocated files/images.
    mirror- Standard software RAID-1 mirror.
    raidz1/2/3- Non-standard distributed parity-based software RAID levels.
    spare- Hard drives marked as a "hot spare" for ZFS software RAID.
    cache- Device used for a level 2 adaptive read cache (L2ARC).
    log- A separate log (SLOG) called the "ZFS Intent Log" or ZIL

RAID-0 is faster than RAID-1, which is faster than RAIDZ-1, which is faster than RAIDZ-2, which is faster than RAIDZ-3.


Nested VDEVs
VDEVs can be nested. A perfect example is a standard RAID-1+0 (commonly referred to as "RAID-10"). This is a stripe of mirrors
zpool create tank mirror sde sdf mirror sdg sdh

Real life example
In production, the files would be physical disk, and the ZIL and cache would be fast SSDs.
notice that the name of the SSDs is "ata-OCZ-REVODRIVE_OCZ-33W9WE11E9X73Y41-part1", etc. These are found in /dev/disk/by-id/. The reason I chose these instead of "sdb" and "sdc" is because the cache and log devices don't necessarily store the same ZFS metadata. Thus, when the pool is being created on boot, they may not come into the pool, and could be missing. Or, the motherboard may assign the drive letters in a different order. This isn't a problem with the main pool, but is a big problem on GNU/Linux with logs and cached devices. Using the device name under /dev/disk/by-id/ ensures greater persistence and uniqueness.
https://pthree.org/2012/12/04/zfs-administration-part-i-vdevs/

  •  RAIDZ-1
RAIDZ-1 is similar to RAID-5 in that there is a single parity bit distributed across all the disks in the array
This still allows for one disk failure to maintain data.
Two disk failures would result in data loss
A minimum of 3 disks should be used in a RAIDZ-1
The capacity of your storage will be the number of disks in your array times the storage of the smallest disk, minus one disk for parity storage
zpool create tank raidz1 sde sdf sdg

RAIDZ-2
RAIDZ-2 is similar to RAID-6 in that there is a dual parity bit distributed across all the disks in the array.
This still allows for two disk failures to maintain data.
Three disk failures would result in data loss
A minimum of 4 disks should be used in a RAIDZ-2.
The capacity of your storage will be the number of disks in your array times the storage of the smallest disk, minus two disks for parity storage
zpool create tank raidz2 sde sdf sdg sdh

RAIDZ-3
https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/
RAIDZ-3 does not have a standardized RAID level to compare it to. However, it is the logical continuation of RAIDZ-1 and RAIDZ-2 in that there is a triple parity bit distributed across all the disks in the arra
This still allows for three disk failures to maintain data.
Four disk failures would result in data loss.
A minimum of 5 disks should be used in a RAIDZ-3
The capacity of your storage will be the number of disks in your array times the storage of the smallest disk, minus three disks for parity storage.
zpool create tank raidz3 sde sdf sdg sdh sdi

Hybrid RAIDZ
this setup is essentially a RAIDZ-1+0
Each RAIDZ-1 VDEV will receive 1/4 of the data sent to the pool, then each striped piece will be further striped across the disks in each VDEV.

in terms of performance, mirrors will always outperform RAIDZ levels.On both reads and writes. Further, RAIDZ-1 will outperform RAIDZ-2, which it turn will outperform RAIDZ-3. The more parity bits you have to calculate, the longer it's going to take to both read and write the data

in a nutshell, from fastest to slowest, your non-nested RAID levels

    RAID-0 (fastest)
    RAID-1
    RAIDZ-1
    RAIDZ-2
    RAIDZ-3 (slowest)


https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/

  • zpool create tank sde sdf sdg sdh
For the next examples, we will assume 4 drives: /dev/sde, /dev/sdf, /dev/sdg and /dev/sdh, all 8 GB USB thumb drives
In this case, I'm using four disk VDEVs.

Consider doing something similar with LVM, RAID and ext4. You would need to do the following
 https://pthree.org/2012/12/04/zfs-administration-part-i-vdevs/

  •  In these examples, we will assume our ZFS shared storage is named "tank". Further, we will assume that the pool is created with 4 preallocated files of 1 GB in size each, in a RAIDZ-1 array. Let's create some datasets.   
    https://pthree.org/2012/12/17/zfs-administration-part-x-creating-filesystems/
  • A ZVOL is a "ZFS volume" that has been exported to the system as a block device.
So far, when dealing with the ZFS filesystem, other than creating our pool, we haven't dealt with block devices at all, even when mounting the datasets. It's almost like ZFS is behaving like a userspace application more than a filesystem.
A ZVOL is a ZFS block device that resides in your storage pool.

This means that the single block device gets to take advantage of your underlying RAID array, such as mirrors or RAID-Z. It gets to take advantage of the copy-on-write benefits, such as snapshots. It gets to take advantage of online scrubbing, compression and data deduplication. It gets to take advantage of the ZIL and ARC. Because it's a legitimate block device, you can do some very interesting things with your ZVOL. We'll look at three of them here- swap, ext4, and VM storage.


Creating a ZVOL
To create a ZVOL, we use the "-V" switch with our "zfs create" command, and give it a size.

Ext4 on a ZVOL
you could put another filesystem, and mount it, on top of a ZVOL. In other words, you could have an ext4 formatted ZVOL and mounted to /mnt. You could even partition your ZVOL, and put multiple filesystems on it.


ZVOL storage for VMs
you can use these block devices as the backend storage for VMs.
It's not uncommon to create logical volume block devices as the backend for VM storage.
After having the block device available for Qemu, you attach the block device to the virtual machine, and from its perspective, you have a "/dev/vda" or "/dev/sda"
 depending on the setup.

https://pthree.org/2012/12/21/zfs-administration-part-xiv-zvols/

  • “Preview” file system. SUSE continues to support Btrfs in only RAID 10 equivalent configurations, and only time will tell if bcachefs proves to be a compelling alternative to OpenZFS
https://www.ixsystems.com/blog/open-zfs-vs-btrfs/

  • The ZFS file system provides data integrity features for storage drives using its Copy On Write (CoW) technology and improved RAID, but these features have been limited to storage drives previously. If you have a drive failure, utilizing RAID or mirroring will protect your volumes, but what happens if your boot drive fails? In the past, if you used FreeNAS, you had no option other than having your storage go offline and remain unusable until it was repaired, and the ability to mirror was only available in TrueNAS, which utilized the underlying FreeBSD code.
In older versions, the FreeNAS and TrueNAS boot drives used the UFS (Unix File System), an older file system that does not include the advanced data integrity features found in ZFS. This has recently changed on current versions of TrueNAS and FreeNAS, and now ZFS can be installed on boot drives using the menu-driven installer via a simple interface.
https://www.ixsystems.com/blog/root-on-zfs/

  • Red Hat deprecates BTRFS, is Stratis the new ZFS-like hope?
Btrfs was originally created in 2008 to be a Linux alternative to ZFS
Red Hat introduced BTRFS as a technology preview in RHEL 6 and has been over the years one of the major contributors.

Stratis vs BTRFS/ZFS
Just like the other two competitors, the newly born Stratis aims to fill the gap on Linux, but there’s much difference. Aside from the brave decision to use the Rust programming language, Stratis aims to provide Btrfs/Zfs-esque features using an incremental approach. Rather than rebuilding the entire stack Stratis aims to extend existing projects to provide the user with a unique point of interaction with the user. Currently, the most appealing projects to be extended are: DM, XFS and LVM (much debated).
https://www.marksei.com/red-hat-deprecates-btrfs-stratis/

  • A daemon that manages a pool of block devices to create flexible filesystems.
https://github.com/stratis-storage/stratisd

  • Open ZFS vs. Btrfs | and other file systems
“The only thing worse than competition is no competition.”
SUSE continues to support Btrfs in only RAID 10 equivalent configurations, and only time will tell if bcachefs proves to be a compelling alternative to OpenZFS.
It is an honor to work with the OpenZFS community and iXsystems in particular who, thanks to FreeNASTrueNAS and TrueOS, has put OpenZFS in more hands than any other project or product on Earth.
https://www.ixsystems.com/blog/open-zfs-vs-btrfs/

  • Btrfs has been deprecated
The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux.
The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature.
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.4_release_notes/chap-red_hat_enterprise_linux-7.4_release_notes-deprecated_functionality


  • Copy-on-write (CoW or COW), sometimes referred to as implicit sharing or shadowing, is a resource-management technique used in computer programming to efficiently implement a "duplicate" or "copy" operation on modifiable resources.If a resource is duplicated but not modified, it is not necessary to create a new resource; the resource can be shared between the copy and the original. Modifications must still create a copy, hence the technique: the copy operation is deferred to the first write. By sharing resources in this way, it is possible to significantly reduce the resource consumption of unmodified copies, while adding a small overhead to resource-modifying operations.

https://en.wikipedia.org/wiki/Copy-on-write


  • The principle that you can efficiently share as many read-only copies of an object as you want until you need to modify it. Then you need to have your own copy. 

http://wiki.c2.com/?CopyOnWrite

  • Copy-on-write

All blocks within the ZFS filesystem contain a checksum of the target block, which is verified when the block is read. Blocks that contain active data are never overwritten in place. Instead, a new block is allocated and data is written to it. This means that data is never over-written when storing data ensure data-integrity.

https://www.ixsystems.com/freenas-mini

No comments:

Post a Comment