Merge tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs

Pull bcachefs updates from Kent Overstreet: - Poisoned extents can now be moved: this lets us handle bitrotted data without deleting it. For now, reading from poisoned extents only returns -EIO: in the future we'll have an API for specifying "read this data even if there were bitflips". - Incompatible features may now be enabled at runtime, via "opts/version_upgrade" in sysfs. Toggle it to incompatible, and then toggle it back - option changes via the sysfs interface are persistent. - Various changes to support deployable disk images: - RO mounts now use less memory - Images may be stripped of alloc info, particularly useful for slimming them down if they will primarily be mounted RO. Alloc info will be automatically regenerated on first RW mount, and this is quite fast - Filesystem images generated with 'bcachefs image' will be automatically resized the first time they're mounted on a larger device The images 'bcachefs image' generates with compression enabled have been comparable in size to those generated by squashfs and erofs - but you get a full RW capable filesystem - Major error message improvements for btree node reads, data reads, and elsewhere. We now build up a single error message that lists all the errors encountered, actions taken to repair, and success/failure of the IO. This extends to other error paths that may kick off other actions, e.g. scheduling recovery passes: actions we took because of an error are included in that error message, with grouping/indentation so we can see what caused what. - New option, 'rebalance_on_ac_only'. Does exactly what the name suggests, quite handy with background compression. - Repair/self healing: - We can now kick off recovery passes and run them in the background if we detect errors. Currently, this is just used by code that walks backpointers. We now also check for missing backpointers at runtime and run check_extents_to_backpointers if required. The messy 6.14 upgrade left missing backpointers for some users, and this will correct that automatically instead of requiring a manual fsck - some users noticed this as copygc spinning and not making progress. In the future, as more recovery passes come online, we'll be able to repair and recover from nearly anything - except for unreadable btree nodes, and that's why you're using replication, of course - without shutting down the filesystem. - There's a new recovery pass, for checking the rebalance_work btree, which tracks extents that rebalance will process later. - Hardening: - Close the last known hole in btree iterator/btree locking assertions: path->should_be_locked paths must stay locked until the end of the transaction. This shook out a few bugs, including a performance issue that was causing unnecessary path_upgrade transaction restarts. - Performance: - Faster snapshot deletion: this is an incompatible feature, as it requires new sentinal values, for safety. Snapshot deletion no longer has to do a full metadata scan, it now just scans the inodes btree: if an extent/dirent/xattr is present for a given snapshot ID, we already require that an inode be present with that same snapshot ID. If/when users hit scalability limits again (ridiculously huge filesystems with lots of inodes, and many sparse snapshots), let me know - the next step will be to add an index from snapshot ID -> inode number, which won't be too hard. - Faster device removal: the "scan for pointers to this device" no longer does a full metadata scan, instead it walks backpointers. Like fast snapshot deletion this is another incompat feature: it also requires a new sentinal value, because we don't want to reuse these device IDs until after a fsck. - We're now coalescing redundant accounting updates prior to transaction commit, taking some pressure off the journal. Shortly we'll also be doing multiple extent updates in a transaction in the main write path, which combined with the previous should drastically cut down on the amount of metadata updates we have to journal. - Stack usage improvements: All allocator state has been moved off the stack - Debug improvements: - enumerated refcounts: The debug code previously used for filesystem write refs is now a small library, and used for other heavily used refcounts. Different users of a refcount are enumerated, making it much easier to debug refcount issues. - Async object debugging: There's a new kconfig option that makes various async objects (different types of bios, data updates, write ops, etc.) visible in debugfs, and it should be fast enough to leave on in production. - Various sets of assertions no longer require CONFIG_BCACHEFS_DEBUG, instead they're controlled by module parameters and static keys, meaning users won't need to compile custom kernels as often to help debug issues. - bch2_trans_kmalloc() calls can be tracked (there's a new kconfig option). With it on you can check the btree_transaction_stats in debugfs to see the bch2_trans_kmalloc() calls a transaction did when it used the most memory. * tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs: (218 commits) bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE bcachefs: Fix btree_iter_next_node() for new locking asserts bcachefs: Ensure we don't use a blacklisted journal seq bcachefs: Small check_fix_ptr fixes bcachefs: Fix opts.recovery_pass_last bcachefs: Fix allocate -> self healing path bcachefs: Fix endianness in casefold check/repair bcachefs: Path must be locked if trans->locked && should_be_locked bcachefs: Simplify bch2_path_put() bcachefs: Plumb btree_trans for more locking asserts bcachefs: Clear trans->locked before unlock bcachefs: Clear should_be_locked before unlock in key_cache_drop() bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked bcachefs: Give out new path if upgrade fails bcachefs: Fix btree_path_get_locks when not doing trans restart bcachefs: btree_node_locked_type_nowrite() bcachefs: Kill bch2_path_put_nokeep() bcachefs: bch2_journal_write_checksum() bcachefs: Reduce stack usage in data_update_index_update() bcachefs: bch2_trans_log_str() ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2025-05-26 12:43:30 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2025-05-26 12:43:30 -0700
commit: 522544fc71c27b4b432386c7919f71ecc79a3bfb (patch)
tree: 55c6f337ca94ad02480e2fa7fbec8a937a028a73 /Documentation
parent: Merge tag 'gfs2-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 (diff)
parent: bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE (diff)
download: wireguard-linux-522544fc71c27b4b432386c7919f71ecc79a3bfb.tar.xz
wireguard-linux-522544fc71c27b4b432386c7919f71ecc79a3bfb.zip
3 files changed, 103 insertions, 0 deletions
diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
index ba5de97d155f..871a38f557e8 100644
--- a/Documentation/filesystems/bcachefs/casefolding.rst
+++ b/Documentation/filesystems/bcachefs/casefolding.rst
@@ -88,3 +88,21 @@ This would fail if negative dentry's were cached.
 
 This is slightly suboptimal, but could be fixed in future with some vfs work.
 
+
+References
+----------
+
+(from Peter Anvin, on the list)
+
+It is worth noting that Microsoft has basically declared their
+"recommended" case folding (upcase) table to be permanently frozen (for
+new filesystem instances in the case where they use an on-disk
+translation table created at format time.)  As far as I know they have
+never supported anything other than 1:1 conversion of BMP code points,
+nor normalization.
+
+The exFAT specification enumerates the full recommended upcase table,
+although in a somewhat annoying format (basically a hex dump of
+compressed data):
+
+https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification
diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst
new file mode 100644
index 000000000000..59a332509dcd
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/future/idle_work.rst
@@ -0,0 +1,78 @@
+Idle/background work classes design doc:
+
+Right now, our behaviour at idle isn't ideal, it was designed for servers that
+would be under sustained load, to keep pending work at a "medium" level, to
+let work build up so we can process it in more efficient batches, while also
+giving headroom for bursts in load.
+
+But for desktops or mobile - scenarios where work is less sustained and power
+usage is more important - we want to operate differently, with a "rush to
+idle" so the system can go to sleep. We don't want to be dribbling out
+background work while the system should be idle.
+
+The complicating factor is that there are a number of background tasks, which
+form a heirarchy (or a digraph, depending on how you divide it up) - one
+background task may generate work for another.
+
+Thus proper idle detection needs to model this heirarchy.
+
+- Foreground writes
+- Page cache writeback
+- Copygc, rebalance
+- Journal reclaim
+
+When we implement idle detection and rush to idle, we need to be careful not
+to disturb too much the existing behaviour that works reasonably well when the
+system is under sustained load (or perhaps improve it in the case of
+rebalance, which currently does not actively attempt to let work batch up).
+
+SUSTAINED LOAD REGIME
+---------------------
+
+When the system is under continuous load, we want these jobs to run
+continuously - this is perhaps best modelled with a P/D controller, where
+they'll be trying to keep a target value (i.e. fragmented disk space,
+available journal space) roughly in the middle of some range.
+
+The goal under sustained load is to balance our ability to handle load spikes
+without running out of x resource (free disk space, free space in the
+journal), while also letting some work accumululate to be batched (or become
+unnecessary).
+
+For example, we don't want to run copygc too aggressively, because then it
+will be evacuating buckets that would have become empty (been overwritten or
+deleted) anyways, and we don't want to wait until we're almost out of free
+space because then the system will behave unpredicably - suddenly we're doing
+a lot more work to service each write and the system becomes much slower.
+
+IDLE REGIME
+-----------
+
+When the system becomes idle, we should start flushing our pending work
+quicker so the system can go to sleep.
+
+Note that the definition of "idle" depends on where in the heirarchy a task
+is - a task should start flushing work more quickly when the task above it has
+stopped generating new work.
+
+e.g. rebalance should start flushing more quickly when page cache writeback is
+idle, and journal reclaim should only start flushing more quickly when both
+copygc and rebalance are idle.
+
+It's important to let work accumulate when more work is still incoming and we
+still have room, because flushing is always more efficient if we let it batch
+up. New writes may overwrite data before rebalance moves it, and tasks may be
+generating more updates for the btree nodes that journal reclaim needs to flush.
+
+On idle, how much work we do at each interval should be proportional to the
+length of time we have been idle for. If we're idle only for a short duration,
+we shouldn't flush everything right away; the system might wake up and start
+generating new work soon, and flushing immediately might end up doing a lot of
+work that would have been unnecessary if we'd allowed things to batch more.
+ 
+To summarize, we will need:
+
+ - A list of classes for background tasks that generate work, which will
+   include one "foreground" class.
+ - Tracking for each class - "Am I doing work, or have I gone to sleep?"
+ - And each class should check the class above it when deciding how much work to issue.
diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst
index 3864d0ae89c1..e5c4c2120b93 100644
--- a/Documentation/filesystems/bcachefs/index.rst
+++ b/Documentation/filesystems/bcachefs/index.rst
@@ -29,3 +29,10 @@ At this moment, only a few of these are described here.
 
    casefolding
    errorcodes
+
+Future design
+-------------
+.. toctree::
+   :maxdepth: 1
+
+   future/idle_work
author	Linus Torvalds <torvalds@linux-foundation.org>	2025-05-26 12:43:30 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2025-05-26 12:43:30 -0700
commit	522544fc71c27b4b432386c7919f71ecc79a3bfb (patch)
tree	55c6f337ca94ad02480e2fa7fbec8a937a028a73 /Documentation
parent	Merge tag 'gfs2-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 (diff)
parent	bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE (diff)
download	wireguard-linux-522544fc71c27b4b432386c7919f71ecc79a3bfb.tar.xz wireguard-linux-522544fc71c27b4b432386c7919f71ecc79a3bfb.zip