aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-sqlite.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-04-17btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()Qu Wenruo1-2/+7
[BUG] When running btrfs/004 with 4K fs block size and 64K page size, sometimes fsstress workload can take 100% CPU for a while, but not long enough to trigger a 120s hang warning. [CAUSE] When such 100% CPU usage happens, btrfs_punch_hole_lock_range() is always in the call trace. One example when this problem happens, the function btrfs_punch_hole_lock_range() got the following parameters: lock_start = 4096, lockend = 20469 Then we calculate @page_lockstart by rounding up lock_start to page boundary, which is 64K (page size is 64K). For @page_lockend, we round down the value towards page boundary, which result 0. Then since we need to pass an inclusive end to filemap_range_has_page(), we subtract 1 from the rounded down value, resulting in (u64)-1. In the above case, the range is inside the same page, and we do not even need to call filemap_range_has_page(), not to mention to call it with (u64)-1 at the end. This behavior will cause btrfs_punch_hole_lock_range() to busy loop waiting for irrelevant range to have its pages dropped. [FIX] Calculate @page_lockend by just rounding down @lockend, without decreasing the value by one. So @page_lockend will no longer overflow. Then exit early if @page_lockend is no larger than @page_lockstart. As it means either the range is inside the same page, or the two pages are adjacent already. Finally only decrease @page_lockend when calling filemap_range_has_page(). Fixes: 0528476b6ac7 ("btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-17btrfs: subpage: access correct object when reading bitmap start in subpage_calc_start_bit()Qu Wenruo1-1/+1
Inside the macro, subpage_calc_start_bit(), we need to calculate the offset to the beginning of the folio. But we're using offset_in_page(), on systems with 4K page size and 4K fs block size, this means we will always return offset 0 for a large folio, causing all kinds of errors. Fix it by using offset_in_folio() instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-01btrfs: remove folio order ASSERT()s in super block writeback pathQu Wenruo1-2/+0
[BUG] There is a syzbot report that the ASSERT() inside write_dev_supers() got triggered: assertion failed: folio_order(folio) == 0, in fs/btrfs/disk-io.c:3858 ------------[ cut here ]------------ kernel BUG at fs/btrfs/disk-io.c:3858! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 UID: 0 PID: 6730 Comm: syz-executor378 Not tainted 6.14.0-syzkaller-03565-gf6e0150b2003 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:write_dev_supers fs/btrfs/disk-io.c:3858 [inline] RIP: 0010:write_all_supers+0x400f/0x4090 fs/btrfs/disk-io.c:4155 Call Trace: <TASK> btrfs_commit_transaction+0x1eda/0x3750 fs/btrfs/transaction.c:2528 btrfs_quota_enable+0xfcc/0x21a0 fs/btrfs/qgroup.c:1226 btrfs_ioctl_quota_ctl+0x144/0x1c0 fs/btrfs/ioctl.c:3677 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5ad1f20289 </TASK> ---[ end trace 0000000000000000 ]--- [CAUSE] Since commit f93ee0df5139 ("btrfs: convert super block writes to folio in write_dev_supers()") and commit c94b7349b859 ("btrfs: convert super block writes to folio in wait_dev_supers()"), the super block writeback path is converted to use folio. Since the original code is using page based interfaces, we have an "ASSERT(folio_order(folio) == 0);" added to make sure everything is not changed. But the folio here is not from any btrfs inode, but from the block device, and we have no control on the folio order in bdev, the device can choose whatever folio size they want/need. E.g. the bdev may even have a block size of multiple pages. So the ASSERT() is triggered. [FIX] The super block writeback path has taken larger folios into consideration, so there is no need for the ASSERT(). And since commit bc00965dbff7 ("btrfs: count super block write errors in device instead of tracking folio error state"), the wait path no longer checks the folio status but only wait for the folio writeback to finish, there is nothing requiring the ASSERT() either. So we can remove both ASSERT()s safely now. Reported-by: syzbot+34122898a11ab689518a@syzkaller.appspotmail.com Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-01btrfs: correctly escape subvol in btrfs_show_options()Johannes Kimmel1-2/+1
Currently, displaying the btrfs subvol mount option doesn't escape ','. This makes parsing /proc/self/mounts and /proc/self/mountinfo ambiguous for subvolume names that contain commas. The text after the comma could be mistaken for another option (think "subvol=foo,ro", where ro is actually part of the subvolumes name). Replace the manual escape characters list with a call to seq_show_option(). Thanks to Calvin Walton for suggesting this approach. Fixes: c8d3fe028f64 ("Btrfs: show subvol= and subvolid= in /proc/mounts") CC: stable@vger.kernel.org # 5.4+ Suggested-by: Calvin Walton <calvin.walton@kepstin.ca> Signed-off-by: Johannes Kimmel <kernel@bareminimum.eu> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-01btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAINSidong Yang1-0/+2
Fix a bug in encoded read that mistakenly frees the iov in case btrfs_encoded_read() returns -EAGAIN assuming the structure will be reused. This can happen when when receiving requests concurrently, the io_uring subsystem does not reset the data, and the last free will happen in btrfs_uring_read_finished(). Handle the -EAGAIN error and skip freeing iov. CC: stable@vger.kernel.org # 6.13+ Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: zoned: fix zone finishing with missing devicesJohannes Thumshirn1-0/+3
If do_zone_finish() is called with a filesystem that has missing devices (e.g. a RAID file system mounted in degraded mode) it is accessing the btrfs_device::zone_info pointer, which will not be set if the device in question is missing. Check if the device is present (by checking if it has a valid block device pointer associated) and if not, skip zone finishing for it. Fixes: 4dcbb8ab31c1 ("btrfs: zoned: make zone finishing multi stripe capable") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: zoned: fix zone activation with missing devicesJohannes Thumshirn1-0/+3
If btrfs_zone_activate() is called with a filesystem that has missing devices (e.g. a RAID file system mounted in degraded mode) it is accessing the btrfs_device::zone_info pointer, which will not be set if the device in question is missing. Check if the device is present (by checking if it has a valid block device pointer associated) and if not, skip zone activation for it. Fixes: f9a912a3c45f ("btrfs: zoned: make zone activation multi stripe capable") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove end_no_trans label from btrfs_log_inode_parent()Filipe Manana1-18/+10
It's a pointless label as we don't have to do anything under it other than return from the function. So remove it and directly return from the function where we used to goto. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify condition for logging new dentries at btrfs_log_inode_parent()Filipe Manana1-3/+6
There's no point in checking if the inode is a directory as ctx->log_new_dentries is only set in case we are logging a directory down the call chain of btrfs_log_inode(). So remove that check making the logic more simple and while at it add a comment about why use a local variable to track if we later need to log new dentries. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove redundant else statement from btrfs_log_inode_parent()Filipe Manana1-2/+0
If we don't need to log new directory dentries, there's no point in having an else branch just to set 'ret' to zero, as it's already zero because every time it gets a non-zero value we jump into one of the exit labels. So remove it, which reduces source code size and the module text size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1813855 163737 16920 1994512 1e6f10 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1813807 163737 16920 1994464 1e6ee0 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use memcmp_extent_buffer() at replay_one_extent()Filipe Manana1-13/+6
Instead of using memcmp(), which requires copying both file extent items from each extent buffer into a local buffer, use memcmp_extent_buffer() so that we only need to copy one of the file extent items and directly use the extent buffer of the other file extent item for the comparison. This reduces code size, saves one memory copy and reduces stack usage. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: update outdated comment for overwrite_item()Filipe Manana1-5/+5
The function is exclusively used for log replay since commit 3eb423442483 ("btrfs: remove outdated logic from overwrite_item() and add assertion"), so update the comment so that it doesn't say it can be used for logging. Also some minor rewording for clarity and while at it reformat the affected text so that it fits closer to the 80 characters limit for comments. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use variables to store extent buffer and slot at overwrite_item()Filipe Manana1-22/+25
Instead of referring to path->nodes[0] and path->slots[0] multiple times, which is verbose and confusing since we have an 'eb' and 'slot' variables as well, introduce local variables 'dst_eb' to point to path->nodes[0] and 'dst_slot' to have path->slots[0], reducing verbosity and making it more obvious about which extent buffer and slot we are referring to. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid unnecessary memory allocation and copy at overwrite_item()Filipe Manana1-10/+2
There's no need to allocate memory and copy from both the destination and source extent buffers to compare if the items are equal, we can instead use memcmp_extent_buffer() which allows to do only one memory allocation and copy instead of two. So use memcmp_extent_buffer() instead of memcmp(), allowing us to avoid one memory allocation, which can fail or be slow while under memory heavy pressure, avoid the memory copying and reducing code. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: don't clobber ret in btrfs_validate_super()Mark Harmstone1-0/+3
Commit 2a9bb78cfd36 ("btrfs: validate system chunk array at btrfs_validate_super()") introduces a call to validate_sys_chunk_array() in btrfs_validate_super(), which clobbers the value of ret set earlier. This has the effect of negating the validity checks done earlier, making it so btrfs could potentially try to mount invalid filesystems. Fixes: 2a9bb78cfd36 ("btrfs: validate system chunk array at btrfs_validate_super()") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prepare btrfs_page_mkwrite() for large foliosQu Wenruo1-9/+10
This changes the assumption that the folio is always page sized. (Although the ASSERT() for folio order is still kept as-is). Just replace the PAGE_SIZE with folio_size(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prepare extent_io.c for future large folio supportQu Wenruo1-13/+18
When we're handling folios from filemap, we can no longer assume all folios are page sized. Thus for call sites assuming the folio is page sized, change the PAGE_SIZE usage to folio_size() instead. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prepare btrfs_launcher_folio() for large folios supportQu Wenruo1-1/+1
That function is only calling btrfs_qgroup_free_data(), which doesn't care about the size of the folio. Just replace the fixed PAGE_SIZE with folio_size(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: replace PAGE_SIZE with folio_size for subpage.[ch]Qu Wenruo2-4/+4
Since we can no longer assume all data filemap folios are page sized, use proper folio_size() calls to determine the folio size, as a preparation for future large data filemap folios. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: add a size parameter to btrfs_alloc_subpage()Qu Wenruo3-6/+6
Since we can no longer assume page sized folio for data filemap folios, allow btrfs_alloc_subpage() to accept a new parameter, @fsize, indicating the folio size. This doesn't follow the regular behavior of passing a folio directly, because this function is shared by both data and metadata folios, and for metadata folios we have extra allocation policy to ensure no large folios whose sizes are larger than nodesize (unless it's page sized). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: subpage: make btrfs_is_subpage() check against a folioQu Wenruo4-26/+26
To support large data folios, we can no longer assume every filemap folio is page sized. So btrfs_is_subpage() check must be done against a folio. Thankfully for metadata folios, we have the full control and ensure a large folio will not be large than nodesize, so btrfs_meta_is_subpage() doesn't need this change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: add extra warning if delayed iput is added when it's not allowedQu Wenruo3-0/+6
Since I have triggered the ASSERT() on the delayed iput too many times, now is the time to add some extra debug warnings for delayed iput. All delayed iputs should be queued after all ordered extents finish their IO and all involved workqueues are flushed. Thus after the btrfs_run_delayed_iputs() inside close_ctree(), there should be no more delayed puts added. So introduce a new BTRFS_FS_STATE_NO_DELAYED_IPUT, set after the above mentioned timing. And all btrfs_add_delayed_iput() will check that flag and give a WARN_ON_ONCE(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid redundant path slot assignment in btrfs_search_forward()Sun YangKai1-2/+1
Move path slot assignment before the condition check to prevent duplicate assignment. Previously, the slot was set both inside and after the 'slot >= nritems' block with no change in its value, which is unnecessary. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove unnecessary btrfs_key local variable in btrfs_search_forward()Sun YangKai1-7/+5
The 'found_key' variable was only used to temporarily store the found key before copying it to 'min_key' at the end of the function when returning success. Eliminate the 'found_key' variable, and directly store the key into 'min_key' at the exact loop exit points where ret=0 is set, maintaining identical functionality. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify the return value handling in search_ioctl()Sun YangKai1-9/+8
Move the assignment of -EFAULT to within the error condition check in fault_in_subpage_writeable(). The previous placement outside the condition could lead to the error value being overwritten by subsequent assignments, cause unnecessary assignments. Simplify loop exit logic by removing redundant goto. The original code used 'goto err' to bypass post-loop processing after handling errors from btrfs_search_forward(). However, the loop's termination naturally falls through to the post-loop section, which already handles 'ret' values. Replacing 'goto err' with 'break' eliminates redundant control flow, consolidates error handling, and makes the loop's exit conditions explicit. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: tests: fix chunk map leak after failure to add it to the treeFilipe Manana1-0/+1
If we fail to add the chunk map to the fs mapping tree we exit test_rmap_block() without freeing the chunk map. Fix this by adding a call to btrfs_free_chunk_map() before exiting the test function if the call to btrfs_add_chunk_map() failed. Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: codify pattern for adding block_group to bg_listBoris Burkov1-24/+31
Similar to mark_bg_unused() and mark_bg_to_reclaim(), we have a few places that use bg_list with refcounting, mostly for retrying failures to reclaim/delete unused. These have custom logic for handling locking and refcounting the bg_list properly, but they actually all want to do the same thing, so pull that logic out into a helper. Unfortunately, mark_bg_unused() does still need the NEW flag to avoid prematurely marking stuff unused (even if refcount is fine, we don't want to mess with bg creation), so it cannot use the new helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: explicitly ref count block_group on new_bgs listBoris Burkov2-0/+3
All other users of the bg_list list_head increment the refcount when adding to a list and decrement it when deleting from the list. Just for the sake of uniformity and to try to avoid refcounting bugs, do it for this list as well. This does not fix any known ref-counting bug, as the reference belongs to a single task (trans_handle is not shared and this represents trans_handle->new_bgs linkage) and will not lose its original refcount while that thread is running. And BLOCK_GROUP_FLAG_NEW protects against ref-counting errors "moving" the block group to the unused list without taking a ref. With that said, I still believe it is simpler to just hold the extra ref count for this list user as well. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_discard_workfn() block_group ref explicitBoris Burkov1-18/+16
Currently, the async discard machinery owns a ref to the block_group when the block_group is queued on a discard list. However, to handle races with discard cancellation and the discard workfn, we have a specific logic to detect that the block_group is *currently* running in the workfn, to protect the workfn's usage amidst cancellation. As far as I can tell, this doesn't have any overt bugs (though finish_discard_pass() and remove_from_discard_list() racing can have a surprising outcome for the caller of remove_from_discard_list() in that it is again added at the end). But it is needlessly complicated to rely on locking and the nullity of discard_ctl->block_group. Simplify this significantly by just taking a refcount while we are in the workfn and unconditionally drop it in both the remove and workfn paths, regardless of if they race. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: harden block_group::bg_list against list_del() racesBoris Burkov2-0/+20
As far as I can tell, these calls of list_del_init() on bg_list cannot run concurrently with btrfs_mark_bg_unused() or btrfs_mark_bg_to_reclaim(), as they are in transaction error paths and situations where the block group is readonly. However, if there is any chance at all of racing with mark_bg_unused(), or a different future user of bg_list, better to be safe than sorry. Otherwise we risk the following interleaving (bg_list refcount in parens) T1 (some random op) T2 (btrfs_mark_bg_unused) !list_empty(&bg->bg_list); (1) list_del_init(&bg->bg_list); (1) list_move_tail (1) btrfs_put_block_group (0) btrfs_delete_unused_bgs bg = list_first_entry list_del_init(&bg->bg_list); btrfs_put_block_group(bg); (-1) Ultimately, this results in a broken ref count that hits zero one deref early and the real final deref underflows the refcount, resulting in a WARNING. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: fix block group refcount race in btrfs_create_pending_block_groups()Boris Burkov1-0/+3
Block group creation is done in two phases, which results in a slightly unintuitive property: a block group can be allocated/deallocated from after btrfs_make_block_group() adds it to the space_info with btrfs_add_bg_to_space_info(), but before creation is completely completed in btrfs_create_pending_block_groups(). As a result, it is possible for a block group to go unused and have 'btrfs_mark_bg_unused' called on it concurrently with 'btrfs_create_pending_block_groups'. This causes a number of issues, which were fixed with the block group flag 'BLOCK_GROUP_FLAG_NEW'. However, this fix is not quite complete. Since it does not use the unused_bg_lock, it is possible for the following race to occur: btrfs_create_pending_block_groups btrfs_mark_bg_unused if list_empty // false list_del_init clear_bit else if (test_bit) // true list_move_tail And we get into the exact same broken ref count and invalid new_bgs state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to prevent. The broken refcount aspect will result in a warning like: [1272.943527] refcount_t: underflow; use-after-free. [1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs] [1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G W 6.14.0-rc5+ #108 [1272.946368] Tainted: [W]=WARN [1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110 [1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282 [1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000 [1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff [1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268 [1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0 [1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0 [1272.952850] FS: 0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000 [1272.953458] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0 [1272.954474] Call Trace: [1272.954655] <TASK> [1272.954812] ? refcount_warn_saturate+0xba/0x110 [1272.955173] ? __warn.cold+0x93/0xd7 [1272.955487] ? refcount_warn_saturate+0xba/0x110 [1272.955816] ? report_bug+0xe7/0x120 [1272.956103] ? handle_bug+0x53/0x90 [1272.956424] ? exc_invalid_op+0x13/0x60 [1272.956700] ? asm_exc_invalid_op+0x16/0x20 [1272.957011] ? refcount_warn_saturate+0xba/0x110 [1272.957399] btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs] [1272.957853] btrfs_put_block_group.cold+0x5d/0x8e [btrfs] [1272.958289] btrfs_discard_workfn+0x194/0x380 [btrfs] [1272.958729] process_one_work+0x130/0x290 [1272.959026] worker_thread+0x2ea/0x420 [1272.959335] ? __pfx_worker_thread+0x10/0x10 [1272.959644] kthread+0xd7/0x1c0 [1272.959872] ? __pfx_kthread+0x10/0x10 [1272.960172] ret_from_fork+0x30/0x50 [1272.960474] ? __pfx_kthread+0x10/0x10 [1272.960745] ret_from_fork_asm+0x1a/0x30 [1272.961035] </TASK> [1272.961238] ---[ end trace 0000000000000000 ]--- Though we have seen them in the async discard workfn as well. It is most likely to happen after a relocation finishes which cancels discard, tears down the block group, etc. Fix this fully by taking the lock around the list_del_init + clear_bit so that the two are done atomically. Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove unnecessary fs_info argument from btrfs_add_block_group_cache()Filipe Manana1-8/+8
The fs_info can be taken from the given block group, so there is no need to pass it as an argument. Also rename the local variable from 'info' to 'fs_info' which is more widely used, more clear and to be more consistent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove unnecessary fs_info argument from delete_block_group_cache()Filipe Manana1-5/+4
The fs_info can be taken from the given block group, so there is no need to pass it as an argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove unnecessary fs_info argument from create_reloc_inode()Filipe Manana1-2/+2
The fs_info can be taken from the given block group, so there is no need to pass it as an argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_iget_path() return a btrfs inode insteadFilipe Manana3-11/+11
It's an internal function and btrfs_iget() is now returning a btrfs inode, so change btrfs_iget_path() to also return a btrfs inode instead of a VFS inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_iget() return a btrfs inode insteadFilipe Manana9-71/+76
It's an internal function and most of the time the callers are doing a lot of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so change the return type to struct btrfs_inode instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass a btrfs_inode to fixup_inode_link_count()Filipe Manana1-13/+13
fixup_inode_link_count() mostly wants to use a btrfs_inode, plus it's an internal function so it should take btrfs_inode instead of a VFS inode. Change the argument type to btrfs_inode, avoiding several BTRFS_I() calls too. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: return a btrfs_inode from read_one_inode()Filipe Manana1-79/+73
All callers of read_one_inode() are mostly interested in the btrfs_inode structure rather than the VFS inode, so make read_one_inode() return the btrfs_inode instead, avoiding lots of BTRFS_I() calls. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: return a btrfs_inode from btrfs_iget_logging()Filipe Manana1-49/+45
All callers of btrfs_iget_logging() are interested in the btrfs_inode structure rather than the VFS inode, so make btrfs_iget_logging() return the btrfs_inode instead, avoiding lots of BTRFS_I() calls. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid linker error in btrfs_find_create_tree_block()Mark Harmstone1-2/+5
The inline function btrfs_is_testing() is hardcoded to return 0 if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set. Currently we're relying on the compiler optimizing out the call to alloc_test_extent_buffer() in btrfs_find_create_tree_block(), as it's not been defined (it's behind an #ifdef). Add a stub version of alloc_test_extent_buffer() to avoid linker errors on non-standard optimization levels. This problem was seen on GCC 14 with -O0 and is helps to see symbols that would be otherwise optimized out. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: run btrfs_error_commit_super() earlyQu Wenruo1-7/+8
[BUG] Even after all the error fixes related the "ASSERT(list_empty(&fs_info->delayed_iputs));" in close_ctree(), I can still hit it reliably with my experimental 2K block size. [CAUSE] In my case, all the error is triggered after the fs is already in error status. I find the following call trace to be the cause of race: Main thread | endio_write_workers ---------------------------------------------+--------------------------- close_ctree() | |- btrfs_error_commit_super() | | |- btrfs_cleanup_transaction() | | | |- btrfs_destroy_all_ordered_extents() | | | |- btrfs_wait_ordered_roots() | | |- btrfs_run_delayed_iputs() | | | btrfs_finish_ordered_io() | | |- btrfs_put_ordered_extent() | | |- btrfs_add_delayed_iput() |- ASSERT(list_empty(delayed_iputs)) | !!! Triggered !!! The root cause is that, btrfs_wait_ordered_roots() only wait for ordered extents to finish their IOs, not to wait for them to finish and removed. [FIX] Since btrfs_error_commit_super() will flush and wait for all ordered extents, it should be executed early, before we start flushing the workqueues. And since btrfs_error_commit_super() now runs early, there is no need to run btrfs_run_delayed_iputs() inside it, so just remove the btrfs_run_delayed_iputs() call from btrfs_error_commit_super(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: defrag: extend ioctl to accept compression levelsDaniel Vacek7-12/+51
The zstd and zlib compression types support setting compression levels. Enhance the defrag interface to specify the levels as well. For zstd the negative (realtime) levels are also accepted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: send: simplify return logic from send_encoded_extent()Filipe Manana1-8/+5
The 'out' label is pointless as we don't have anything to cleanup anymore (we used to have an inode to iput), so remove it and make error paths directly return an error. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: send: remove unnecessary inode lookup at send_encoded_inline_extent()Filipe Manana1-8/+1
We are doing a lookup of the inode but we don't use it at all. So just remove this pointless lookup. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid unnecessary bio dereference at run_one_async_done()Filipe Manana1-1/+1
We have dereferenced the async_submit_bio structure and extracted the bio pointer into a local variable, so there's no need to dereference it again when calling btrfs_bio_end_io(). Just use "bio->bi_status" instead of the longer expression "async->bbio->bio.bi_status". Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: fix non-empty delayed iputs list on unmount due to async workersFilipe Manana1-0/+13
At close_ctree() after we have ran delayed iputs either explicitly through calling btrfs_run_delayed_iputs() or later during the call to btrfs_commit_super() or btrfs_error_commit_super(), we assert that the delayed iputs list is empty. We have (another) race where this assertion might fail because we have queued an async write into the fs_info->workers workqueue. Here's how it happens: 1) We are submitting a data bio for an inode that is not the data relocation inode, so we call btrfs_wq_submit_bio(); 2) btrfs_wq_submit_bio() submits a work for the fs_info->workers queue that will run run_one_async_done(); 3) We enter close_ctree(), flush several work queues except fs_info->workers, explicitly run delayed iputs with a call to btrfs_run_delayed_iputs() and then again shortly after by calling btrfs_commit_super() or btrfs_error_commit_super(), which also run delayed iputs; 4) run_one_async_done() is executed in the work queue, and because there was an IO error (bio->bi_status is not 0) it calls btrfs_bio_end_io(), which drops the final reference on the associated ordered extent by calling btrfs_put_ordered_extent() - and that adds a delayed iput for the inode; 5) At close_ctree() we find that after stopping the cleaner and transaction kthreads the delayed iputs list is not empty, failing the following assertion: ASSERT(list_empty(&fs_info->delayed_iputs)); Fix this by flushing the fs_info->workers workqueue before running delayed iputs at close_ctree(). David reported this when running generic/648, which exercises IO error paths by using the DM error table. Reported-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: reject out-of-band dirty folios during writebackQu Wenruo2-1/+42
[OUT-OF-BAND DIRTY FOLIOS] An out-of-band folio means the folio is marked dirty but without notifying the filesystem. This can lead to various problems, not limited to: - No folio::private to track per block status - No proper space reserved for such a dirty folio [HISTORY IN BTRFS] This used to be a problem related to get_user_page(), but with the introduction of pin_user_pages*(), we should no longer hit such case anymore. In btrfs, we have a long history of catching such out-of-band dirty folios by: - Mark the folio ordered during delayed allocation - Check the folio ordered flag during writeback If the folio has no ordered flag, it means it doesn't go through delayed allocation, thus it's definitely an out-of-band one. If we got one, we go through COW fixup, which will re-dirty the folio with proper handling in another workqueue. [PROBLEMS OF COW-FIXUP] Such workaround is a blockage for us to migrate to iomap (it requires extra flags to trace if a folio is dirtied by the fs or not) and I'd argue it's not data checksum safe, since if a folio can be marked dirty without informing the fs, the content can also change at any time. But with the introduction of pin_user_pages*() during v5.8 merge window, such out-of-band dirty folio such be treated as a bug. Ext4 has treated such case by warning and erroring out even before pin_user_pages*(). Furthermore, there are already proofs that such folio ordered flag tracking can be screwed up by incorrect error handling, check the commit messages of the following commits: 06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed") c2b47df81c8e ("btrfs: do proper folio cleanup when run_delalloc_nocow() failed") [FIXES] Unlike btrfs, ext4 and xfs (iomap) never bother handling such out-of-band dirty folios. - Ext4 just warns and errors out - Iomap always follows the folio/block dirty flags And there is nothing really COW specific, xfs also supports COW too. Here we take one step towards ext4 by doing warning and erroring out. But since the cow fixup thing is introduced from the beginning, we keep the old behavior for non-experimental builds, and only do the new warning for experimental builds before we're 100% sure and remove cow fixup. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: return a literal instead of a variable in btrfs_init_dev_replace()Dan Carpenter1-2/+1
This is just a small clean up, it doesn't change how the code works. Originally this code had a goto so we needed to set "ret = 0;" but now it returns directly and so we can simplify it a bit by doing a "return 0;" and removing the assignment. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: move btrfs_cleanup_bio() code into its single callerFilipe Manana1-9/+5
The btrfs_cleanup_bio() helper is trivial and has a single caller, there's no point in having a dedicated helper function. So get rid of it and move its code into the caller (btrfs_bio_end_io()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: move __btrfs_bio_end_io() code into its single callerFilipe Manana1-13/+9
The __btrfs_bio_end_io() helper is trivial and has a single caller, so there's no point in having a dedicated helper function. Further the double underscore prefix in the name is discouraged. So get rid of it and move its code into the caller (btrfs_bio_end_io()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>