Bumpy start: OpenZFS 2.4.0 with fast encryption – and problems
OpenZFS 2.4.0 brings interesting new features as well as many bug fixes. There still seem to be problems in some areas, such as block cloning and with NVMe.
(Image: c't)
- Michael Plura
In December, the developers of OpenZFS released version 2.4.0 of their self-healing file system, which evolved from Sun Microsystems' ZFS developed over 20 years ago for Solaris. Specifically supported are Linux kernels 4.18 to 6.18 as well as FreeBSD 14, the current Production release 15, and version 16 ("-current") expected in about two years.
A patch by Ameer Hamza introduces default quotas for users, groups, and projects in OpenZFS – including object quotas – thus ensuring consistent limits even when no individual limits are set. Kernel and userspace tools (zfs {user|group|project}space) are adapted under FreeBSD and Linux so that default quotas are displayed if no individual quotas have been configured.
A change by Alexander Motin extends ZIL allocation so that in the absence of an SLOG, special vdevs (typically SSDs) are also used for ZIL blocks to avoid illogical assignments with higher latencies on HDDs. This allows HDD pools with a fast special vdev to better handle synchronous workloads. This works without an additional SLOG and allows minimizing SSD wear within certain limits.
Videos by heise
As early as February last year, Joel Low ported code from Google's BoringSSL to OpenZFS, which is said to provide up to 80 percent faster encryption. To achieve this, Google developers used a vector AES-optimized AES-GCM (Galois/Counter Mode) implementation, which uses fast AVX2 instead of AVX512/AVX10 on AMD Zen3 CPUs.
Three new commands
Anyone using OpenZFS 2.4.0 should familiarize themselves with three new commands: "zfs rewrite -P" attempts to preserve the birth time when rewriting blocks, saving time and resources since the actual data doesn't change. With "zpool scrub -S -E", scrubbing can be limited to specific time periods (based on Transaction Groups / TXG) – however, some problems are reported in the commit [https://github.com/openzfs/zfs/pull/16853]. Finally, "zpool prefetch -t brt" is intended to pre-read the metadata of the BRT (Block Reference Table) into the ARC (Adaptive Replacement Cache) to accelerate block cloning and block deallocation (more on this below).
The majority of commits come from Rob Norris (229 commits) and Alexander Motin (119 commits), both working for Klara, a company specializing in FreeBSD, ZFS, and ARM. Eight developers have double-digit commits, while the vast majority are limited to exactly one commit.
Faster writes without cache and trouble with "Gang Blocks"
While the above innovations consist of individual commits, four areas of the new OpenZFS 2.4.0 consist of a whole series of summarized detail improvements: Uncached I/O, Gang Blocks, Deduplication, and Block Cloning. Alexander Motin is working on optimizing Uncached I/O operations, which are between fast Direct I/O (with restrictions like page alignment) and regular Cached I/O in terms of performance. If Direct I/O is not an option in certain scenarios, there should be a fallback to Uncached I/O instead of the even slower Cached I/O as before.
Several fixes are intended to improve the use of "Gang Blocks". Gang Blocks are a kind of ZFS emergency mechanism that kicks in when there is no longer contiguous free space available for a large data or metadata block. In this case, ZFS splits the block into several smaller physical blocks and additionally stores at least two redundant Gang Block Headers that point to these partial blocks, so that the block can still be logically treated as a single unit. One of the improvements is to change the size of the Gang Block Headers from a fixed 512 bytes to any dynamic size.
The ever-growing data collection frenzy of corporations and governments makes optimizations in OpenZFS deduplication particularly necessary. A total of eight commits are intended to help OpenZFS 2.4.0 save storage space.
Problem child Block Cloning fixed – or not?
As Alexander Motin clarifies, a structural error was made in the original implementation of block cloning, affecting the "BRT ZAP Entries", but it has now been corrected. Block cloning allows files or parts of them to be copied by creating only references to existing blocks, rather than duplicating data. This saves space and time because the data does not need to be rewritten.
The Block Reference Table (BRT) is a new metadata object in OpenZFS (introduced in 2.2) that supports block cloning or "Reflinks". OpenZFS stores BRT entries in a ZAP object. ZAP (ZFS Attribute Processor) is a flexible on-disk structure for key/value data, such as directories, properties, or reference tables. Further fixes are intended to make OpenZFS block cloning more stable and less error-prone. Faulty block cloning already caused data loss in OpenZFS 2.2.0.
Is some NVMe hardware not OpenZFS compatible?
In addition to problems caused by block cloning, OpenZFS also seems to be having some issues with NVMe pools. Besides comments on OpenZFS postings or forum entries, there are also long discussions, for example on Github. In almost all these discussions, problems with NVMe drives in particular are pointed out; SATA and SAS HDDs/SSDs seem to be less affected. Whether the error lies with OpenZFS or if it's a hardware problem is not always clear and leads to heated arguments.
Another possibility is that some NVMe hardware is simply overwhelmed by OpenZFS under high load. If this hardware works with other file systems or even older ZFS variants, and if the current OpenZFS also runs cleanly almost everywhere on other systems – is the special combination the problem? Perhaps a closer look at the NVMe controller hardware and its firmware used is also advisable? Consumer hardware in particular is often built at the edge of its specifications and could then tend to errors under harsher conditions with OpenZFS. The problem should be analyzed precisely, as powerful OpenZFS installations in particular often rely on NVMe hardware.
OpenZFS is available for GNU/Linux, FreeBSD, NetBSD, macOS, OpenSolaris, Illumos, and OpenIndiana. The source code for OpenZFS 2.4.0, along with a detailed list of all new features and changes, is available on Githuv.
(mki)