i can't believe it's not btrfs —

Examining btrfs, Linux’s perpetually half-finished filesystem

This btrfs filesystem overview highlights some longstanding shortcomings.

We don't recommend allowing btrfs to directly manage a complex array of disks—floppy or otherwise.
Enlarge / We don't recommend allowing btrfs to directly manage a complex array of disks—floppy or otherwise.

Btrfs—short for "B-Tree File System" and frequently pronounced "butter" or "butter eff ess"—is the most advanced filesystem present in the mainline Linux kernel. In some ways, btrfs simply seeks to supplant ext4, the default filesystem for most Linux distributions. But btrfs also aims to provide next-gen features that break the simple "filesystem" mold, combining the functionality of a RAID array manager, a volume manager, and more.

We have good news and bad news about this. First, btrfs is a perfectly cromulent single-disk ext4 replacement. But if you're hoping to replace ZFS—or a more complex stack built on discrete RAID management, volume management, and simple filesystem—the picture isn't quite so rosy. Although the btrfs project has fixed many of the glaring problems it launched with in 2009, other problems remain essentially unchanged 12 years later.

History

Chris Mason is the founding developer of btrfs, which he began working on in 2007 while working at Oracle. This leads many people to believe that btrfs is an Oracle project—it is not. The project belonged to Mason, not to his employer, and it remains a community project unencumbered by corporate ownership to this day. In 2009, btrfs 1.0 was accepted into the mainline Linux kernel 2.6.29.

Although btrfs entered mainline in 2009, it wasn't actually production-ready. For the next four years, creating a btrfs filesystem would display the following deliberately scary message to the admin who dared mkfs a btrfs, and it required a non-default Y to proceed:

Btrfs is a new filesystem with extents, writable snapshotting,
support for multiple devices and many more features.

Btrfs is highly experimental, and THE DISK FORMAT IS NOT YET
FINALIZED. You should say N here unless you are interested in
testing Btrfs with non-critical data.

Linux users being Linux users, many chose to ignore this warning—and, unsurprisingly, many lost data as a result. This four-year-long wide beta may have had a lasting impact on the btrfs dev community, who in my experience tended to fall back on "well, it's all beta anyway" whenever user-reported problems cropped up. This was happening well after mkfs.btrfs lost its scare dialog in late 2013.

It has now been nearly eight years since the "experimental" tag was removed, but many of btrfs' age-old problems remain unaddressed and effectively unchanged. So, we'll repeat this once more: as a single-disk filesystem, btrfs has been stable and for the most part performant for years. But the deeper you get into the new features btrfs offers, the shakier the ground you walk on—that's what we're focusing on today.

Features

Btrfs only has one true competitor in the Linux and BSD filesystem space: OpenZFS. It's almost impossible to avoid comparing and contrasting btrfs to OpenZFS, since the Venn diagram of their respective feature sets is little more than a single, slightly lumpy circle. But we're going to try to avoid directly comparing and contrasting the two as much as possible. If you're an OpenZFS admin, you already know; and if you're not an OpenZFS admin, they're not really helpful.

In addition to being a simple single-disk filesystem, btrfs offers multiple disk topologies (RAID), volume managed storage (cf., Linux Logical Volume Manager), atomic copy-on-write snapshots, asynchronous incremental replication, automatic corrupt data healing, and on-disk compression.

Comparison with legacy storage

If you wanted to build a btrfs- and ZFS-free system with similar features, you'd need a stack of discrete layers—mdraid at the bottom for RAID, LVM next for snapshots, and then a filesystem such as ext4 or xfs on the top of your storage sundae.

An mdraid+LVM+ext4 storage stack still ends up missing some of btrfs' most theoretically compelling features, unfortunately. LVM offers atomic snapshots but no direct snapshot replication. Neither ext4 nor xfs offers inline compression. And although mdraid can offer data healing if you enable the dm-integrity target, it kinda sucks.

The dm-integrity target defaults to an extremely weak crc32 hash algorithm prone to collisions, it requires completely overwriting target devices on initialization, and it also requires entirely rewriting every block of a replaced disk after a failure—above and beyond the full drive-write necessary during initialization.

In short, you really can't replicate btrfs' entire promised feature set on a legacy storage stack. To get the whole bundle, you need either btrfs or ZFS.

Btrfs multiple-disk topologies

Now that we've covered where things go wrong with a legacy storage stack, it's time to take a look at where btrfs itself falls down. For that, the first place we'll look is in btrfs' multiple disk topologies.

Btrfs offers five multiple disk topologies: btrfs-raid0, btrfs-raid1, btrfs-raid10, btrfs-raid5, and btrfs-raid6. Although the documentation tends to refer to these topologies more simply—eg., just as raid1 rather than btrfs-raid1—we strongly recommend keeping the prefix in mind. These topologies can in some cases be extremely different from their conventional counterparts.

Topology Conventional version Btrfs version
RAID0 Simple stripe—lose any disk, lose the array Simple stripe—lose any disk, lose the array
RAID1 Simple mirror—all data blocks on disk n and disk o are identical Guaranteed redundancy—copies of all blocks will be saved on two separate devices
RAID10 Striped mirror sets—eg., a stripe across three mirrored disk pairs Striped mirror sets—eg., a stripe across three mirrored disk pairs
RAID5 Diagonal parity RAID—single parity (one parity block per stripe), fixed stripe width Diagonal parity RAID—single parity (one parity block per stripe) with variable stripe width
RAID6 Diagonal parity RAID—double parity (two parity blocks per stripe), fixed stripe width Diagonal parity RAID—double parity (two parity blocks per stripe) with variable stripe width

As you can see in the chart above, btrfs-raid1 differed pretty drastically from its conventional analogue. To understand how, let's think about a hypothetical collection of "mutt" drives of mismatched sizes. If we have one 8T disk, three 4T disks, and a 2T disk, it's difficult to make a useful conventional RAID array from them—for example, a RAID5 or RAID6 would need to treat them all as 2T disks (producing only 8T raw storage before parity).

However, btrfs-raid1 offers a very interesting premise. Since it doesn't actually marry disks together in pairs, it can use the entire collection of disks without waste. Any time a block is written to the btrfs-raid1, it's written identically to two separate disks—any two separate disks. Since there are no fixed pairings, btrfs-raid1 is free to simply fill all the disks at the same rough rate proportional to their free capacity.

The btrfs-raid5 and btrfs-raid6 topologies are somewhat similar to btrfs-raid1 in that—unlike their conventional counterparts—they're capable of handling mismatched drive sizes by dynamically altering stripe width as smaller drives fill up. Neither btrfs-raid5 nor btrfs-raid6 should be used in production, however, for reasons we'll go into shortly.

The btrfs-raid10 and btrfs-raid0 topologies are much closer to their conventional counterparts, and for most purposes these can be thought of as direct replacements that share the same strengths and weaknesses.

Channel Ars Technica