Case study recovery of a corrupted 12 TB multi-device pool

Why a Power Cycle Can Compromise Your Filesystem Integrity

The incident involved a 12 TB Btrfs pool spread across three devices, configured with `data=single` and `metadata=DUP`. The disks were DM-SMR (shingled magnetic recording) drives, which introduce complexity due to their internal write amplification and variable performance. The filesystem itself used features like `MIXED_BACKREF`, `COMPRESS_ZSTD`, `BIG_METADATA`, `EXTENDED_IREF`, `SKINNY_METADATA`, and `NO_HOLES`. This configuration, leveraging advanced Btrfs features and DM-SMR drives, is representative of a high-demand production environment, underscoring the complexities inherent in such deployments.

Btrfs, particularly with metadata=DUP, promises resilience. Metadata blocks are written twice, ideally to different physical locations, to protect against single-block corruption. However, a hard power cycle, specifically one that interrupted a commit between generations 18958 and 18959, revealed a critical vulnerability. The DUP copies of metadata blocks ended up with inconsistent parent/child generations. This represented a fundamental break in the filesystem's internal consistency model, specifically within the extent tree and free space tree. Corruption in these core B-trees compromises the filesystem's ability to locate data, manage free space, and interpret its own structure.

Native Tools Fail: A Critical Vulnerability

The system immediately became unmountable. The native `btrfs check --repair` utility, the primary recovery utility, failed. First, it rejected the command with "read-only file system." Then, attempts with `--init-extent-tree` deadlocked. This illustrates a circular dependency: rebuilding the extent tree requires block allocation, which in turn relies on a functional extent tree.

The most critical failure involved btrfs check --repair entering an infinite loop. It advanced the pool generation by over 46,000 commits with zero net progress. This not only consumed valuable time but also irrevocably erased any possibility of a straightforward rollback. The four superblock backup_roots slots, which are often mistaken for historical backups, function as a four-commit sliding window. The infinite loop overwrote these slots approximately 11,000 times each, erasing all pre-crash rollback points. This design misinterpretation can lead to false confidence in recovery capabilities, as mechanisms like backup_roots are often mistaken for full disaster recovery solutions when they are intended for minor rollbacks.

Figure 1: Visual representation of corrupted metadata blocks (red) alongside intact data streams (green), illustrating the selective nature of the corruption. — Figure 1: Visual representation of corrupted metadata blocks

Data Recovery: The Trade-off Between Consistency and Availability

Recovering this 12 TB pool required extensive architectural understanding and painstaking manual effort. The engineer developed 14 custom C tools, built against the internal `btrfs-progs` API, along with a single-line patch to `alloc_reserved_tree_block` to tolerate `EEXIST`. This was not a quick fix; it involved a staged recovery plan, empirically validated at each step.

The incident highlighted a fundamental trade-off: when native tools fail, manual recovery inherently prioritizes data consistency over immediate availability. The pool experienced extended downtime. The goal was not 100% data recovery, which is often impractical in severe corruption scenarios, but maximal data recovery with a fully operational filesystem. The outcome was impressive: approximately 7.2 MB of data loss out of 4.59 TB (0.00016 percent). Despite the pool being fully operational, btrfs check --readonly still reported residual cosmetic errors, such as 712 incorrect link counts and 393,057 backpointer mismatches, which did not impede normal operation.

This situation highlights the inherent trade-offs in distributed system design, where severe partitions necessitate a sacrifice of availability until consistency is manually restored. The custom tools re-established consistency by rebuilding internal B-trees, fixing backreferences, and correcting accounting errors. For instance, the rebuild_extent_tree_apply tool injected 3,248,617 EXTENT_DATA_REFs in a single transaction, a process that took approximately 34 minutes on the DM-SMR disks. This process inherently sacrifices availability.

The custom tools developed for this recovery embodied several principles crucial for robust distributed system design. They featured a mandatory read-only scan mode for forensic analysis prior to any write operations, complemented by an explicit, opt-in --write mode as a critical safety mechanism. Crucially, the staged recovery process inherently demanded idempotency or careful ordering for many operations, such as scan_and_fix_all_backrefs or rebuild_extent_tree_apply, ensuring that repeated execution would not worsen the filesystem state. This approach is fundamental for iterative and complex recovery scenarios.

Proposed Upstream Improvements from this Incident

Analysis of the incident revealed several architectural flaws within the `btrfs-progs` implementation that directly informed concrete proposals for upstream development. These are not minor bugs; they represent critical vulnerabilities in the filesystem's self-healing mechanisms, particularly under severe, multi-point corruption.

The identified deficiencies and corresponding proposals include:

Progress Detection in btrfs check --repair: Implement mechanisms to abort after a defined number of non-decreasing error passes, preventing infinite loops and the destruction of backup_roots slots.
Symmetric reinit_extent_tree Handling: Extend reinit_extent_tree to tolerate BTRFS_ADD_DELAYED_REF failures for sharable blocks, mirroring the existing exemption for BTRFS_DROP_DELAYED_REF to prevent crashes during rebalance.
Sibling Safety Precheck in btrfs_del_items: Introduce a precheck to skip rebalance operations if last_snapshot < trans->transid and the system is in recovery mode, preventing corruption of pre-crash leaves by unconditional Copy-on-Write on stale siblings.
Supervised EEXIST Handling in alloc_reserved_tree_block: Make the handling of EEXIST configurable with explicit modes (error, silent, update) to prevent infinite loops caused by inadequate error management.
Extent Tree Rebuild in Userspace: Develop a new btrfs rescue rebuild-extent-tree subcommand that operates from a pre-scanned reference list, offering a robust alternative to the currently deadlocking --init-extent-tree.
Orphan Inode Cleanup with Bulletproof Criterion: Introduce a btrfs rescue clean-orphan-inodes subcommand with a built-in dry-run mode. This tool would apply a stringent 5-condition check to identify safe candidates for deletion, producing a machine-readable plan. The "bulletproof subset criterion" for safe orphan inode deletion requires:
- The leaf hosting the inode's INODE_ITEM has gen > last_snapshot.
- The leaf's parent level-1 node has gen > last_snapshot.
- The estimated used bytes of the leaf after removing target items > LEAF_DATA_SIZE/4 (to avoid forced rebalance).
- All immediate slot-adjacent siblings of the leaf in its parent node have gen > last_snapshot.
- The disk_bytenr of every EXTENT_DATA referenced by the inode resolves in the current extent tree.
Surgical BLOCK_GROUP_ITEM.used Patch: Extend btrfs rescue with a fix-bg-accounting subcommand for precise adjustments to BLOCK_GROUP_ITEM.used after bulk extent tree rebuilds.
Clear Documentation of backup_roots Semantics: Explicitly document that backup_roots[0..3] functions as a four-commit sliding window, not a historical backup mechanism.
Documentation of DIR i_size Accounting Rule: Provide clear documentation for the DIR i_size = sum(namelen * 2) rule, which is crucial for directory integrity.

Architecting for Failure: Lessons from a 12 TB Btrfs Incident

This incident offers crucial architectural insights for anyone designing or operating distributed systems, especially those built upon complex filesystems.

Robust recovery mechanisms are essential. Repair tools must be as resilient as the data they protect. They require progress detection to abort infinite loops, graceful handling of edge cases such as EEXIST, and a clear understanding of internal dependencies to prevent deadlocks. The proposed btrfs rescue rebuild-extent-tree and clean-orphan-inodes subcommands, which operate from pre-scanned lists and include dry-run modes, exemplify the idempotent, observable recovery patterns required.

Clear semantics for internal state are critical. The backup_roots confusion exemplifies how ambiguous documentation or implicit assumptions can lead to disaster. Any recovery or rollback mechanism must have its guarantees and limitations explicitly defined. Historical backups require distinct architectural patterns, such as immutable logs or versioned snapshots, rather than a sliding window.

Layered data integrity is prudent. Sole reliance on the filesystem for data integrity introduces risk. Higher-level application logic, external backup systems, and replication strategies (e.g., across availability zones or regions) provide additional protection layers. This incident demonstrates that even a sophisticated filesystem can experience core failures.

This is not a narrative about Btrfs's inherent flaws; it illustrates the realities of distributed state management under adverse conditions. The recovery of this 12 TB pool, with minimal data loss, demonstrates the engineer's expertise in addressing complex problems. For any architecture review, it is imperative to understand a filesystem's true guarantees and to design for failure at every layer. Furthermore, this incident underscores that human ingenuity often proves to be the most effective recovery tool.