Why a Power Cycle Can Compromise Your Filesystem Integrity
The incident involved a 12 TB Btrfs pool spread across three devices, configured with `data=single` and `metadata=DUP`. The disks were DM-SMR (shingled magnetic recording) drives, which introduce complexity due to their internal write amplification and variable performance. The filesystem itself used features like `MIXED_BACKREF`, `COMPRESS_ZSTD`, `BIG_METADATA`, `EXTENDED_IREF`, `SKINNY_METADATA`, and `NO_HOLES`. This configuration, leveraging advanced Btrfs features and DM-SMR drives, is representative of a high-demand production environment, underscoring the complexities inherent in such deployments.Btrfs, particularly with metadata=DUP, promises resilience. Metadata blocks are written twice, ideally to different physical locations, to protect against single-block corruption. However, a hard power cycle, specifically one that interrupted a commit between generations 18958 and 18959, revealed a critical vulnerability. The DUP copies of metadata blocks ended up with inconsistent parent/child generations. This represented a fundamental break in the filesystem's internal consistency model, specifically within the extent tree and free space tree. Corruption in these core B-trees compromises the filesystem's ability to locate data, manage free space, and interpret its own structure.
Native Tools Fail: A Critical Vulnerability
The system immediately became unmountable. The native `btrfs check --repair` utility, the primary recovery utility, failed. First, it rejected the command with "read-only file system." Then, attempts with `--init-extent-tree` deadlocked. This illustrates a circular dependency: rebuilding the extent tree requires block allocation, which in turn relies on a functional extent tree.The most critical failure involved btrfs check --repair entering an infinite loop. It advanced the pool generation by over 46,000 commits with zero net progress. This not only consumed valuable time but also irrevocably erased any possibility of a straightforward rollback. The four superblock backup_roots slots, which are often mistaken for historical backups, function as a four-commit sliding window. The infinite loop overwrote these slots approximately 11,000 times each, erasing all pre-crash rollback points. This design misinterpretation can lead to false confidence in recovery capabilities, as mechanisms like backup_roots are often mistaken for full disaster recovery solutions when they are intended for minor rollbacks.
Data Recovery: The Trade-off Between Consistency and Availability
Recovering this 12 TB pool required extensive architectural understanding and painstaking manual effort. The engineer developed 14 custom C tools, built against the internal `btrfs-progs` API, along with a single-line patch to `alloc_reserved_tree_block` to tolerate `EEXIST`. This was not a quick fix; it involved a staged recovery plan, empirically validated at each step.The incident highlighted a fundamental trade-off: when native tools fail, manual recovery inherently prioritizes data consistency over immediate availability. The pool experienced extended downtime. The goal was not 100% data recovery, which is often impractical in severe corruption scenarios, but maximal data recovery with a fully operational filesystem. The outcome was impressive: approximately 7.2 MB of data loss out of 4.59 TB (0.00016 percent). Despite the pool being fully operational, btrfs check --readonly still reported residual cosmetic errors, such as 712 incorrect link counts and 393,057 backpointer mismatches, which did not impede normal operation.
This situation highlights the inherent trade-offs in distributed system design, where severe partitions necessitate a sacrifice of availability until consistency is manually restored. The custom tools re-established consistency by rebuilding internal B-trees, fixing backreferences, and correcting accounting errors. For instance, the rebuild_extent_tree_apply tool injected 3,248,617 EXTENT_DATA_REFs in a single transaction, a process that took approximately 34 minutes on the DM-SMR disks. This process inherently sacrifices availability.
The custom tools developed for this recovery embodied several principles crucial for robust distributed system design. They featured a mandatory read-only scan mode for forensic analysis prior to any write operations, complemented by an explicit, opt-in --write mode as a critical safety mechanism. Crucially, the staged recovery process inherently demanded idempotency or careful ordering for many operations, such as scan_and_fix_all_backrefs or rebuild_extent_tree_apply, ensuring that repeated execution would not worsen the filesystem state. This approach is fundamental for iterative and complex recovery scenarios.
Proposed Upstream Improvements from this Incident
Analysis of the incident revealed several architectural flaws within the `btrfs-progs` implementation that directly informed concrete proposals for upstream development. These are not minor bugs; they represent critical vulnerabilities in the filesystem's self-healing mechanisms, particularly under severe, multi-point corruption.The identified deficiencies and corresponding proposals include:
- Progress Detection in
btrfs check --repair: Implement mechanisms to abort after a defined number of non-decreasing error passes, preventing infinite loops and the destruction ofbackup_rootsslots. - Symmetric
reinit_extent_treeHandling: Extendreinit_extent_treeto tolerateBTRFS_ADD_DELAYED_REFfailures for sharable blocks, mirroring the existing exemption forBTRFS_DROP_DELAYED_REFto prevent crashes during rebalance. - Sibling Safety Precheck in
btrfs_del_items: Introduce a precheck to skip rebalance operations iflast_snapshot < trans->transidand the system is in recovery mode, preventing corruption of pre-crash leaves by unconditional Copy-on-Write on stale siblings. - Supervised
EEXISTHandling inalloc_reserved_tree_block: Make the handling ofEEXISTconfigurable with explicit modes (error, silent, update) to prevent infinite loops caused by inadequate error management. - Extent Tree Rebuild in Userspace: Develop a new
btrfs rescue rebuild-extent-treesubcommand that operates from a pre-scanned reference list, offering a robust alternative to the currently deadlocking--init-extent-tree. - Orphan Inode Cleanup with Bulletproof Criterion: Introduce a
btrfs rescue clean-orphan-inodessubcommand with a built-in dry-run mode. This tool would apply a stringent 5-condition check to identify safe candidates for deletion, producing a machine-readable plan. The "bulletproof subset criterion" for safe orphan inode deletion requires:- The leaf hosting the inode's
INODE_ITEMhasgen > last_snapshot. - The leaf's parent level-1 node has
gen > last_snapshot. - The estimated used bytes of the leaf after removing target items
> LEAF_DATA_SIZE/4(to avoid forced rebalance). - All immediate slot-adjacent siblings of the leaf in its parent node have
gen > last_snapshot. - The
disk_bytenrof everyEXTENT_DATAreferenced by the inode resolves in the current extent tree.
- The leaf hosting the inode's
- Surgical
BLOCK_GROUP_ITEM.usedPatch: Extendbtrfs rescuewith afix-bg-accountingsubcommand for precise adjustments toBLOCK_GROUP_ITEM.usedafter bulk extent tree rebuilds. - Clear Documentation of
backup_rootsSemantics: Explicitly document thatbackup_roots[0..3]functions as a four-commit sliding window, not a historical backup mechanism. - Documentation of DIR
i_sizeAccounting Rule: Provide clear documentation for theDIR i_size = sum(namelen * 2)rule, which is crucial for directory integrity.
Architecting for Failure: Lessons from a 12 TB Btrfs Incident
This incident offers crucial architectural insights for anyone designing or operating distributed systems, especially those built upon complex filesystems.Robust recovery mechanisms are essential. Repair tools must be as resilient as the data they protect. They require progress detection to abort infinite loops, graceful handling of edge cases such as EEXIST, and a clear understanding of internal dependencies to prevent deadlocks. The proposed btrfs rescue rebuild-extent-tree and clean-orphan-inodes subcommands, which operate from pre-scanned lists and include dry-run modes, exemplify the idempotent, observable recovery patterns required.
Clear semantics for internal state are critical. The backup_roots confusion exemplifies how ambiguous documentation or implicit assumptions can lead to disaster. Any recovery or rollback mechanism must have its guarantees and limitations explicitly defined. Historical backups require distinct architectural patterns, such as immutable logs or versioned snapshots, rather than a sliding window.
Layered data integrity is prudent. Sole reliance on the filesystem for data integrity introduces risk. Higher-level application logic, external backup systems, and replication strategies (e.g., across availability zones or regions) provide additional protection layers. This incident demonstrates that even a sophisticated filesystem can experience core failures.
This is not a narrative about Btrfs's inherent flaws; it illustrates the realities of distributed state management under adverse conditions. The recovery of this 12 TB pool, with minimal data loss, demonstrates the engineer's expertise in addressing complex problems. For any architecture review, it is imperative to understand a filesystem's true guarantees and to design for failure at every layer. Furthermore, this incident underscores that human ingenuity often proves to be the most effective recovery tool.