| Age | Commit message (Collapse) | Author | Files | Lines |
|
Currently if block size < page size, btrfs only supports one single
config, 4K.
This is mostly to reduce the test configurations, as 4K is going to be
the default block size for all architectures.
However all other major filesystems have no artificial limits on the
support block size, and some are already supporting block size > page
sizes.
Since the btrfs subpage block support has been there for a long time,
it's time for us to enable all block size <= page size support.
So here enable all block sizes support as long as it's no larger than
page size for experimental builds.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Replace max_t() followed by min_t() with a single clamp().
As was pointed by David Laight in
https://lore.kernel.org/linux-btrfs/20250906122458.75dfc8f0@pumpkin/
the calculation may overflow u32 when the input value is too large, so
clamp_t() is not used. In practice the expected values are in range of
megabytes to gigabytes (throughput limit) so the bug would not happen.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
[ Use clamp() and add explanation. ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the compression workspace buffer size is always based on
PAGE_SIZE, but btrfs has support subpage sized block size for some time.
This means for one-shot compression algorithm like lzo, we're wasting
quite some memory if the block size is smaller than page size, as the
LZO only works on one block (thus one-shot).
On 64K page sized systems with 4K block size, it means we only need at
most 8K buffer space for lzo, but in reality we're allocating 64K
buffer.
So to reduce the memory usage, change all workspace buffer to base its
size based on block size other than page size.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since all workspace managers are per-fs, there is no need nor no way to
store them inside btrfs_compress_op::wsm anymore.
With that said, we can do the following modifications:
- Remove zstd_workspace_mananger::ops
Zstd always grab the global btrfs_compress_op[].
- Remove btrfs_compress_op::wsm member
- Rename btrfs_compress_op to btrfs_compress_levels
This should make it more clear that btrfs_compress_levels structures are
only to indicate the levels of each compress algorithm.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since all workspaces are handled by the per-fs workspace managers, we
can safely remove the old per-module managers.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are several interfaces involved for each algorithm:
- alloc workspace
All algorithms allocate a workspace without the need for workspace
manager.
So no change needs to be done.
- get workspace
This involves checking the workspace manager to find a free one, and
if not, allocate a new one.
For none and lzo, they share the same generic btrfs_get_workspace()
helper, only needs to update that function to use the per-fs manager.
For zlib it uses a wrapper around btrfs_get_workspace(), so no special
work needed.
For zstd, update zstd_find_workspace() and zstd_get_workspace() to
utilize the per-fs manager.
- put workspace
For none/zlib/lzo they share the same btrfs_put_workspace(), update
that function to use the per-fs manager.
For zstd, it's zstd_put_workspace(), the same update.
- zstd specific timer
This is the timer to reclaim workspace, change it to grab the per-fs
workspace manager instead.
Now all workspace are managed by the per-fs manager.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This involves:
- Add (alloc|free)_workspace_manager helpers.
These are the helper to alloc/free workspace_manager structure.
The allocator will allocate a workspace_manager structure, initialize
it, and pre-allocate one workspace for it.
The freer will do the cleanup and set the manager pointer to NULL.
- Call alloc_workspace_manager() inside btrfs_alloc_compress_wsm()
- Call alloc_workspace_manager() inside btrfs_free_compress_wsm()
For none, zlib and lzo compression algorithms.
For now the generic per-fs workspace managers won't really have any effect,
and all compression is still going through the global workspace manager.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This involves:
- Add zstd_alloc_workspace_manager() and zstd_free_workspace_manager()
Those two functions will accept an fs_info pointer, and alloc/free
fs_info->compr_wsm[BTRFS_COMPRESS_ZSTD] pointer.
- Add btrfs_alloc_compress_wsm() and btrfs_free_compress_wsm()
Those are helpers allocating the workspace managers for all
algorithms.
For now only zstd is supported, and the timing is a little unusual,
the btrfs_alloc_compress_wsm() should only be called after the
sectorsize being initialized.
Meanwhile btrfs_free_fs_info_compress() is called in
btrfs_free_fs_info().
- Move the definition of btrfs_compression_type to "fs.h"
The reason is that "compression.h" has already included "fs.h", thus
we can not just include "compression.h" to get the definition of
BTRFS_NR_COMPRESS_TYPES to define fs_info::compr_wsm[].
For now the per-fs zstd workspace manager won't really have any effect,
and all compression is still going through the global workspace manager.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BACKGROUND]
Currently btrfs shares workspaces and their managers for all filesystems,
this is mostly fine as all those workspaces are using page size based
buffers, and btrfs only support block size (bs) <= page size (ps).
This means even if bs < ps, we at most waste some buffer space in the
workspace, but everything will still work fine.
The problem here is that is limiting our support for bs > ps cases.
As now a workspace now may need larger buffer to handle bs > ps cases,
but since the pool has no way to distinguish different workspaces, a
regular workspace (which is still using buffer size based on ps) can be
passed to a btrfs whose bs > ps.
In that case the buffer is not large enough, and will cause various
problems.
[ENHANCEMENT]
To prepare for the per-fs workspace migration, add an fs_info parameter
to all workspace related functions.
For now this new fs_info parameter is not yet utilized.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a very low chance that DEBUG_WARN() inside
btrfs_writepage_cow_fixup() can be triggered when
CONFIG_BTRFS_EXPERIMENTAL is enabled.
This only happens after run_delalloc_nocow() failed.
Unfortunately I haven't hit it for a while thus no real world dmesg for
now.
[CAUSE]
There is a race window where after run_delalloc_nocow() failed, error
handling can race with writeback thread.
Before we hit run_delalloc_nocow(), there is an inode with the following
dirty pages: (4K page size, 4K block size, no large folio)
0 4K 8K 12K 16K
|/////////|///////////|///////////|////////////|
The inode also have NODATACOW flag, and the above dirty range will go
through different extents during run_delalloc_range():
0 4K 8K 12K 16K
| NOCOW | COW | COW | NOCOW |
The race happen like this:
writeback thread A | writeback thread B
----------------------------------+--------------------------------------
Writeback for folio 0 |
run_delalloc_nocow() |
|- nocow_one_range() |
| For range [0, 4K), ret = 0 |
| |
|- fallback_to_cow() |
| For range [4K, 8K), ret = 0 |
| Folio 4K *UNLOCKED* |
| | Writeback for folio 4K
|- fallback_to_cow() | extent_writepage()
| For range [8K, 12K), failure | |- writepage_delalloc()
| | |
|- btrfs_cleanup_ordered_extents()| |
|- btrfs_folio_clear_ordered() | |
| Folio 0 still locked, safe | |
| | | Ordered extent already allocated.
| | | Nothing to do.
| | |- extent_writepage_io()
| | |- btrfs_writepage_cow_fixup()
|- btrfs_folio_clear_ordered() | |
Folio 4K hold by thread B, | |
UNSAFE! | |- btrfs_test_ordered()
| | Cleared by thread A,
| |
| |- DEBUG_WARN();
This is only possible after run_delalloc_nocow() failure, as
cow_file_range() will keep all folios and io tree range locked, until
everything is finished or after error handling.
The root cause is we allow fallback_to_cow() and nocow_one_range() to
unlock the folios after a successful run, so that during error handling
we're no longer safe to use btrfs_cleanup_ordered_extents() as the
folios are already unlocked.
[FIX]
- Make fallback_to_cow() and nocow_one_range() to keep folios locked
after a successful run
For fallback_to_cow() we can pass COW_FILE_RANGE_KEEP_LOCKED flag
into cow_file_range().
For nocow_one_range() we have to remove the PAGE_UNLOCK flag from
extent_clear_unlock_delalloc().
- Unlock folios if everything is fine in run_delalloc_nocow()
- Use extent_clear_unlock_delalloc() to handle range [@start,
@cur_offset) inside run_delalloc_nocow()
Since folios are still locked, we do not need
cleanup_dirty_folios() to do the cleanup.
extent_clear_unlock_delalloc() with "PAGE_START_WRITEBACK |
PAGE_END_WRITEBACK" will clear the dirty flags.
- Remove cleanup_dirty_folios()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if we hit an error inside nocow_one_range(), we do not clear
the page dirty, and let the caller to handle it.
This is very different compared to fallback_to_cow(), when that function
failed, everything will be cleaned up by cow_file_range().
Enhance the situation by:
- Use a common error handling for nocow_one_range()
If we failed anything, use the same btrfs_cleanup_ordered_extents()
and extent_clear_unlock_delalloc().
btrfs_cleanup_ordered_extents() is safe even if we haven't created new
ordered extent, in that case there should be no OE and that function
will do nothing.
The same applies to extent_clear_unlock_delalloc(), and since we're
passing PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK, it
will also clear folio dirty flag during error handling.
- Avoid touching the failed range of nocow_one_range()
As the failed range will be cleaned up and unlocked by that function.
Here we introduce a new variable @nocow_end to record the failed range,
so that we can skip it during the error handling of run_delalloc_nocow().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When running emulated write error tests like generic/475, we can hit
error messages like this:
BTRFS error (device dm-12 state EA): run_delalloc_nocow failed, root=596 inode=264 start=1605632 len=73728: -5
BTRFS error (device dm-12 state EA): failed to run delalloc range, root=596 ino=264 folio=1605632 submit_bitmap=0-7 start=1605632 len=73728: -5
Which is normally buried by direct IO error messages.
However above error messages are not enough to determine which is the
real range that caused the error.
Considering we can have multiple different extents in one delalloc
range (e.g. some COW extents along with some NOCOW extents), just
outputting the error at the end of run_delalloc_nocow() is not enough.
To enhance the error messages:
- Remove the rate limit on the existing error messages
In the generic/475 example, most error messages are from direct IO,
not really from the delalloc range.
Considering how useful the delalloc range error messages are, we don't
want they to be rate limited.
- Add extra @cur_offset output for cow_file_range()
- Add extra variable output for run_delalloc_nocow()
This is especially important for run_delalloc_nocow(), as there
are extra error paths where we can hit error without into
nocow_one_range() nor fallback_to_cow().
- Add an error message for nocow_one_range()
That's the missing part.
For fallback_to_cow(), we have error message from cow_file_range()
already.
- Constify the @len and @end local variables for nocow_one_range()
This makes it much easier to make sure @len and @end are not modified
at runtime.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit 79525b51acc1 ("io_uring: fix nvme's 32b cqes on mixed cq") split
out a separate io_uring_cmd_done32() helper for ->uring_cmd()
implementations that return 32-byte CQEs. The res2 value passed to
io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores
it when is_cqe32 is passed as false. So drop the parameter from
io_uring_cmd_done() to simplify the callers and clarify that it's not
possible to return an extra value beyond the 32-bit CQE result.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since 'ocfs2_sprintf_system_inode_name()' uses 'snprintf()' and returns
the number of characters emitted, callers of the former are better to use
that return value instead of an explicit calls to 'strlen()'.
Link: https://lkml.kernel.org/r/20250917060229.1854335-1-dmantipov@yandex.ru
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The hfsplus_strcasecmp() logic can trigger the issue:
[ 117.317703][ T9855] ==================================================================
[ 117.318353][ T9855] BUG: KASAN: slab-out-of-bounds in hfsplus_strcasecmp+0x1bc/0x490
[ 117.318991][ T9855] Read of size 2 at addr ffff88802160f40c by task repro/9855
[ 117.319577][ T9855]
[ 117.319773][ T9855] CPU: 0 UID: 0 PID: 9855 Comm: repro Not tainted 6.17.0-rc6 #33 PREEMPT(full)
[ 117.319780][ T9855] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 117.319783][ T9855] Call Trace:
[ 117.319785][ T9855] <TASK>
[ 117.319788][ T9855] dump_stack_lvl+0x1c1/0x2a0
[ 117.319795][ T9855] ? __virt_addr_valid+0x1c8/0x5c0
[ 117.319803][ T9855] ? __pfx_dump_stack_lvl+0x10/0x10
[ 117.319808][ T9855] ? rcu_is_watching+0x15/0xb0
[ 117.319816][ T9855] ? lock_release+0x4b/0x3e0
[ 117.319821][ T9855] ? __kasan_check_byte+0x12/0x40
[ 117.319828][ T9855] ? __virt_addr_valid+0x1c8/0x5c0
[ 117.319835][ T9855] ? __virt_addr_valid+0x4a5/0x5c0
[ 117.319842][ T9855] print_report+0x17e/0x7e0
[ 117.319848][ T9855] ? __virt_addr_valid+0x1c8/0x5c0
[ 117.319855][ T9855] ? __virt_addr_valid+0x4a5/0x5c0
[ 117.319862][ T9855] ? __phys_addr+0xd3/0x180
[ 117.319869][ T9855] ? hfsplus_strcasecmp+0x1bc/0x490
[ 117.319876][ T9855] kasan_report+0x147/0x180
[ 117.319882][ T9855] ? hfsplus_strcasecmp+0x1bc/0x490
[ 117.319891][ T9855] hfsplus_strcasecmp+0x1bc/0x490
[ 117.319900][ T9855] ? __pfx_hfsplus_cat_case_cmp_key+0x10/0x10
[ 117.319906][ T9855] hfs_find_rec_by_key+0xa9/0x1e0
[ 117.319913][ T9855] __hfsplus_brec_find+0x18e/0x470
[ 117.319920][ T9855] ? __pfx_hfsplus_bnode_find+0x10/0x10
[ 117.319926][ T9855] ? __pfx_hfs_find_rec_by_key+0x10/0x10
[ 117.319933][ T9855] ? __pfx___hfsplus_brec_find+0x10/0x10
[ 117.319942][ T9855] hfsplus_brec_find+0x28f/0x510
[ 117.319949][ T9855] ? __pfx_hfs_find_rec_by_key+0x10/0x10
[ 117.319956][ T9855] ? __pfx_hfsplus_brec_find+0x10/0x10
[ 117.319963][ T9855] ? __kmalloc_noprof+0x2a9/0x510
[ 117.319969][ T9855] ? hfsplus_find_init+0x8c/0x1d0
[ 117.319976][ T9855] hfsplus_brec_read+0x2b/0x120
[ 117.319983][ T9855] hfsplus_lookup+0x2aa/0x890
[ 117.319990][ T9855] ? __pfx_hfsplus_lookup+0x10/0x10
[ 117.320003][ T9855] ? d_alloc_parallel+0x2f0/0x15e0
[ 117.320008][ T9855] ? __lock_acquire+0xaec/0xd80
[ 117.320013][ T9855] ? __pfx_d_alloc_parallel+0x10/0x10
[ 117.320019][ T9855] ? __raw_spin_lock_init+0x45/0x100
[ 117.320026][ T9855] ? __init_waitqueue_head+0xa9/0x150
[ 117.320034][ T9855] __lookup_slow+0x297/0x3d0
[ 117.320039][ T9855] ? __pfx___lookup_slow+0x10/0x10
[ 117.320045][ T9855] ? down_read+0x1ad/0x2e0
[ 117.320055][ T9855] lookup_slow+0x53/0x70
[ 117.320065][ T9855] walk_component+0x2f0/0x430
[ 117.320073][ T9855] path_lookupat+0x169/0x440
[ 117.320081][ T9855] filename_lookup+0x212/0x590
[ 117.320089][ T9855] ? __pfx_filename_lookup+0x10/0x10
[ 117.320098][ T9855] ? strncpy_from_user+0x150/0x290
[ 117.320105][ T9855] ? getname_flags+0x1e5/0x540
[ 117.320112][ T9855] user_path_at+0x3a/0x60
[ 117.320117][ T9855] __x64_sys_umount+0xee/0x160
[ 117.320123][ T9855] ? __pfx___x64_sys_umount+0x10/0x10
[ 117.320129][ T9855] ? do_syscall_64+0xb7/0x3a0
[ 117.320135][ T9855] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 117.320141][ T9855] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 117.320145][ T9855] do_syscall_64+0xf3/0x3a0
[ 117.320150][ T9855] ? exc_page_fault+0x9f/0xf0
[ 117.320154][ T9855] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 117.320158][ T9855] RIP: 0033:0x7f7dd7908b07
[ 117.320163][ T9855] Code: 23 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 08
[ 117.320167][ T9855] RSP: 002b:00007ffd5ebd9698 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
[ 117.320172][ T9855] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7dd7908b07
[ 117.320176][ T9855] RDX: 0000000000000009 RSI: 0000000000000009 RDI: 00007ffd5ebd9740
[ 117.320179][ T9855] RBP: 00007ffd5ebda780 R08: 0000000000000005 R09: 00007ffd5ebd9530
[ 117.320181][ T9855] R10: 00007f7dd799bfc0 R11: 0000000000000202 R12: 000055e2008b32d0
[ 117.320184][ T9855] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 117.320189][ T9855] </TASK>
[ 117.320190][ T9855]
[ 117.351311][ T9855] Allocated by task 9855:
[ 117.351683][ T9855] kasan_save_track+0x3e/0x80
[ 117.352093][ T9855] __kasan_kmalloc+0x8d/0xa0
[ 117.352490][ T9855] __kmalloc_noprof+0x288/0x510
[ 117.352914][ T9855] hfsplus_find_init+0x8c/0x1d0
[ 117.353342][ T9855] hfsplus_lookup+0x19c/0x890
[ 117.353747][ T9855] __lookup_slow+0x297/0x3d0
[ 117.354148][ T9855] lookup_slow+0x53/0x70
[ 117.354514][ T9855] walk_component+0x2f0/0x430
[ 117.354921][ T9855] path_lookupat+0x169/0x440
[ 117.355325][ T9855] filename_lookup+0x212/0x590
[ 117.355740][ T9855] user_path_at+0x3a/0x60
[ 117.356115][ T9855] __x64_sys_umount+0xee/0x160
[ 117.356529][ T9855] do_syscall_64+0xf3/0x3a0
[ 117.356920][ T9855] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 117.357429][ T9855]
[ 117.357636][ T9855] The buggy address belongs to the object at ffff88802160f000
[ 117.357636][ T9855] which belongs to the cache kmalloc-2k of size 2048
[ 117.358827][ T9855] The buggy address is located 0 bytes to the right of
[ 117.358827][ T9855] allocated 1036-byte region [ffff88802160f000, ffff88802160f40c)
[ 117.360061][ T9855]
[ 117.360266][ T9855] The buggy address belongs to the physical page:
[ 117.360813][ T9855] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x21608
[ 117.361562][ T9855] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 117.362285][ T9855] flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
[ 117.362929][ T9855] page_type: f5(slab)
[ 117.363282][ T9855] raw: 00fff00000000040 ffff88801a842f00 ffffea0000932000 dead000000000002
[ 117.364015][ T9855] raw: 0000000000000000 0000000080080008 00000000f5000000 0000000000000000
[ 117.364750][ T9855] head: 00fff00000000040 ffff88801a842f00 ffffea0000932000 dead000000000002
[ 117.365491][ T9855] head: 0000000000000000 0000000080080008 00000000f5000000 0000000000000000
[ 117.366232][ T9855] head: 00fff00000000003 ffffea0000858201 00000000ffffffff 00000000ffffffff
[ 117.366968][ T9855] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
[ 117.367711][ T9855] page dumped because: kasan: bad access detected
[ 117.368259][ T9855] page_owner tracks the page as allocated
[ 117.368745][ T9855] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN1
[ 117.370541][ T9855] post_alloc_hook+0x240/0x2a0
[ 117.370954][ T9855] get_page_from_freelist+0x2101/0x21e0
[ 117.371435][ T9855] __alloc_frozen_pages_noprof+0x274/0x380
[ 117.371935][ T9855] alloc_pages_mpol+0x241/0x4b0
[ 117.372360][ T9855] allocate_slab+0x8d/0x380
[ 117.372752][ T9855] ___slab_alloc+0xbe3/0x1400
[ 117.373159][ T9855] __kmalloc_cache_noprof+0x296/0x3d0
[ 117.373621][ T9855] nexthop_net_init+0x75/0x100
[ 117.374038][ T9855] ops_init+0x35c/0x5c0
[ 117.374400][ T9855] setup_net+0x10c/0x320
[ 117.374768][ T9855] copy_net_ns+0x31b/0x4d0
[ 117.375156][ T9855] create_new_namespaces+0x3f3/0x720
[ 117.375613][ T9855] unshare_nsproxy_namespaces+0x11c/0x170
[ 117.376094][ T9855] ksys_unshare+0x4ca/0x8d0
[ 117.376477][ T9855] __x64_sys_unshare+0x38/0x50
[ 117.376879][ T9855] do_syscall_64+0xf3/0x3a0
[ 117.377265][ T9855] page last free pid 9110 tgid 9110 stack trace:
[ 117.377795][ T9855] __free_frozen_pages+0xbeb/0xd50
[ 117.378229][ T9855] __put_partials+0x152/0x1a0
[ 117.378625][ T9855] put_cpu_partial+0x17c/0x250
[ 117.379026][ T9855] __slab_free+0x2d4/0x3c0
[ 117.379404][ T9855] qlist_free_all+0x97/0x140
[ 117.379790][ T9855] kasan_quarantine_reduce+0x148/0x160
[ 117.380250][ T9855] __kasan_slab_alloc+0x22/0x80
[ 117.380662][ T9855] __kmalloc_noprof+0x232/0x510
[ 117.381074][ T9855] tomoyo_supervisor+0xc0a/0x1360
[ 117.381498][ T9855] tomoyo_env_perm+0x149/0x1e0
[ 117.381903][ T9855] tomoyo_find_next_domain+0x15ad/0x1b90
[ 117.382378][ T9855] tomoyo_bprm_check_security+0x11c/0x180
[ 117.382859][ T9855] security_bprm_check+0x89/0x280
[ 117.383289][ T9855] bprm_execve+0x8f1/0x14a0
[ 117.383673][ T9855] do_execveat_common+0x528/0x6b0
[ 117.384103][ T9855] __x64_sys_execve+0x94/0xb0
[ 117.384500][ T9855]
[ 117.384706][ T9855] Memory state around the buggy address:
[ 117.385179][ T9855] ffff88802160f300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 117.385854][ T9855] ffff88802160f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 117.386534][ T9855] >ffff88802160f400: 00 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 117.387204][ T9855] ^
[ 117.387566][ T9855] ffff88802160f480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 117.388243][ T9855] ffff88802160f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 117.388918][ T9855] ==================================================================
The issue takes place if the length field of struct hfsplus_unistr
is bigger than HFSPLUS_MAX_STRLEN. The patch simply checks
the length of comparing strings. And if the strings' length
is bigger than HFSPLUS_MAX_STRLEN, then it is corrected
to this value.
v2
The string length correction has been added for hfsplus_strcmp().
Reported-by: Jiaming Zhang <r772577952@gmail.com>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
cc: Yangtao Li <frank.li@vivo.com>
cc: linux-fsdevel@vger.kernel.org
cc: syzkaller@googlegroups.com
Link: https://lore.kernel.org/r/20250919191243.1370388-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
|
|
When parsing Allocation Extent Descriptor, lengthAllocDescs comes from
on-disk data and must be validated against the block size. Crafted or
corrupted images may set lengthAllocDescs so that the total descriptor
length (sizeof(allocExtDesc) + lengthAllocDescs) exceeds the buffer,
leading udf_update_tag() to call crc_itu_t() on out-of-bounds memory and
trigger a KASAN use-after-free read.
BUG: KASAN: use-after-free in crc_itu_t+0x1d5/0x2b0 lib/crc-itu-t.c:60
Read of size 1 at addr ffff888041e7d000 by task syz-executor317/5309
CPU: 0 UID: 0 PID: 5309 Comm: syz-executor317 Not tainted 6.12.0-rc4-syzkaller-00261-g850925a8133c #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x169/0x550 mm/kasan/report.c:488
kasan_report+0x143/0x180 mm/kasan/report.c:601
crc_itu_t+0x1d5/0x2b0 lib/crc-itu-t.c:60
udf_update_tag+0x70/0x6a0 fs/udf/misc.c:261
udf_write_aext+0x4d8/0x7b0 fs/udf/inode.c:2179
extent_trunc+0x2f7/0x4a0 fs/udf/truncate.c:46
udf_truncate_tail_extent+0x527/0x7e0 fs/udf/truncate.c:106
udf_release_file+0xc1/0x120 fs/udf/file.c:185
__fput+0x23f/0x880 fs/file_table.c:431
task_work_run+0x24f/0x310 kernel/task_work.c:239
exit_task_work include/linux/task_work.h:43 [inline]
do_exit+0xa2f/0x28e0 kernel/exit.c:939
do_group_exit+0x207/0x2c0 kernel/exit.c:1088
__do_sys_exit_group kernel/exit.c:1099 [inline]
__se_sys_exit_group kernel/exit.c:1097 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
Validate the computed total length against epos->bh->b_size.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Reported-by: syzbot+8743fca924afed42f93e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8743fca924afed42f93e
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Larshin Sergey <Sergey.Larshin@kaspersky.com>
Link: https://patch.msgid.link/20250922131358.745579-1-Sergey.Larshin@kaspersky.com
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Simply derive the ns operations from the namespace type.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Except 'xchk_setup_metapath_rtginode()' case, 'path' argument of
'xchk_setup_metapath_scan()' is a compile-time constant. So it may
be reasonable to use 'kstrdup_const()' / 'kree_const()' to manage
'path' field of 'struct xchk_metapath' in attempt to reuse .rodata
instance rather than making a copy. Compile tested only.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Only ranges inside the file system can be translated, and the file system
can be smaller than the containing device.
Fixes: f4ed93037966 ("xfs: don't shut down the filesystem for media failures beyond end of log")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Add a bt_nr_sectors to track the number of sector in each buftarg, and
replace the check that hard codes sb_dblock in xfs_buf_map_verify with
this new value so that it is correct for non-ddev buftargs. The
RT buftarg only has a superblock in the first block, so it is unlikely
to trigger this, or are we likely to ever have enough blocks in the
in-memory buftargs, but we might as well get the check right.
Fixes: 10616b806d1d ("xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Currently the error handling of run_delalloc_nocow() modifies
@cur_offset to handle different parts of the delalloc range.
However the error handling can always be split into 3 parts:
1) The range with ordered extents allocated (OE cleanup)
We have to cleanup the ordered extents and unlock the folios.
2) The range that have been cleaned up (skip)
For now it's only for the range of fallback_to_cow().
We should not touch it at all.
3) The range that is not yet touched (untouched)
We have to unlock the folios and clear any reserved space.
This 3 ranges split has the same principle as cow_file_range(), however
the NOCOW/COW handling makes the above 3 range split much more complex:
a) Failure immediately after a successful OE allocation
Thus no @cow_start nor @cow_end set.
start cur_offset end
| OE cleanup | untouched |
b) Failure after hitting a COW range but before calling
fallback_to_cow()
start cow_start cur_offset end
| OE Cleanup | untouched |
c) Failure to call fallback_to_cow()
start cow_start cow_end end
| OE Cleanup | skip | untouched |
Instead of modifying @cur_offset, do proper range calculation for
OE-cleanup and untouched ranges using above 3 cases with proper range
charts.
This avoid updating @cur_offset, as it will an extra output for debug
purposes later, and explain the behavior easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The ref_tracker infrastructure aids debugging but is not enabled by
default as it has a performance impact. Add mount option 'ref_tracker'
so it can be selectively enabled on a filesystem. Currently it track
references of 'delayed inodes'.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are seeing soft lockups in kill_all_delayed_nodes due to a
delayed_node with a lingering reference count of 1. Printing at this
point will reveal the guilty stack trace. If the delayed_node has no
references there should be no output.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add ref_tracker infrastructure for struct btrfs_delayed_node.
It is a response to the largest btrfs related crash in our fleet. We're
seeing soft lockups in btrfs_kill_all_delayed_nodes() that seem to be a
result of delayed_nodes not being released properly.
A ref_tracker object is allocated on reference count increases and freed
on reference count decreases. The ref_tracker object stores a stack
trace of where it is allocated. The ref_tracker_dir object is embedded
in btrfs_delayed_node and keeps track of all current and some old/freed
ref_tracker objects. When a leak is detected we can print the stack
traces for all ref_trackers that have not yet been freed.
Here is a common example of taking a reference to a delayed_node and
freeing it with ref_tracker.
struct btrfs_ref_tracker tracker;
struct btrfs_delayed_node *node;
node = btrfs_get_delayed_node(inode, &tracker);
// use delayed_node...
btrfs_release_delayed_node(node, &tracker);
There are two special cases where the delayed_node reference is "long
lived", meaning that the thread that takes the reference and the thread
that releases the reference are different. The 'inode_cache_tracker'
tracks the delayed_node stored in btrfs_inode. The 'node_list_tracker'
tracks the delayed_node stored in the btrfs_delayed_root
node_list/prepare_list. These trackers are embedded in the
btrfs_delayed_node.
btrfs_ref_tracker and btrfs_ref_tracker_dir are wrappers that either
compile to the corresponding ref_tracker structs or empty structs
depending on CONFIG_BTRFS_DEBUG. There are also btrfs wrappers for
the ref_tracker API.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We're almost done cleaning misused int/bool parameters. Convert a bunch
of them, found by manual grepping. Note that btrfs_sync_fs() needs an
int as it's mandated by the struct super_operations prototype.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove CONFIG_BTRFS_FS_REF_VERIFY Kconfig and add it as part of
CONFIG_BTRFS_DEBUG. This should not be impactful to the performance
of debug. The struct btrfs_ref takes an additional u64, btrfs_fs_info
takes an additional spinlock_t and rb_root. All of the ref_verify logic
is still protected by a mount option.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Use the standard error pointer macro to simplify the code.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we manually check the block size against 3 different values:
- 4K
- PAGE_SIZE
- MIN_BLOCKSIZE
Those 3 values can match or differ from each other. This makes it
pretty complex to output the supported block sizes.
Considering we're going to add block size > page size support soon, this
can make the support block size sysfs attribute much harder to
implement.
To make it easier, factor out a helper, btrfs_supported_blocksize() to
do a simple check for the block size.
Then utilize it in the two locations:
- btrfs_validate_super()
This is very straightforward
- supported_sectorsizes_show()
Iterate through all valid block sizes, and only output supported ones.
This is to make future full range block sizes support much easier.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BEHAVIOR DIFFERENCE BETWEEN COMPRESSION ALGOS]
Currently LZO compression algorithm will check if we're making the
compressed data larger after compressing more than 2 blocks.
But zlib and zstd do the same checks after compressing more than 8192
bytes.
This is not a big deal, but since we're already supporting larger block
size (e.g. 64K block size if page size is also 64K), this check is not
suitable for all block sizes.
For example, if our page and block size are both 16KiB, and after the
first block compressed using zlib, the resulted compressed data is
slightly larger than 16KiB, we will immediately abort the compression.
This makes zstd and zlib compression algorithms to behave slightly
different from LZO, which only aborts after compressing two blocks.
[ENHANCEMENT]
To unify the behavior, only abort the compression after compressing at
least two blocks.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For the 3 supported compression algorithms, two of them (zstd and zlib)
are already grabbing the btrfs inode for error messages.
It's more common to pass btrfs_inode and grab the address space from it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The hint block group selection in the extent allocator is wrong in the
first place, as it can select the dedicated data relocation block group for
the normal data allocation.
Since we separated the normal data space_info and the data relocation
space_info, we can easily identify a block group is for data relocation or
not. Do not choose it for the normal data allocation.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If you run a workload with:
- a cgroup that does tons of parallel data reading, with a working set
much larger than its memory limit
- a second cgroup that writes relatively fewer files, with overwrites,
with no memory limit
(see full code listing at the bottom for a reproducer)
Then what quickly occurs is:
- we have a large number of threads trying to read the csum tree
- we have a decent number of threads deleting csums running delayed refs
- we have a large number of threads in direct reclaim and thus high
memory pressure
The result of this is that we writeback the csum tree repeatedly mid
transaction, to get back the extent_buffer folios for reclaim. As a
result, we repeatedly COW the csum tree for the delayed refs that are
deleting csums. This means repeatedly write locking the higher levels of
the tree.
As a result of this, we achieve an unpleasant priority inversion. We
have:
- a high degree of contention on the csum root node (and other upper
nodes) eb rwsem
- a memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
as readers, but not scheduling promptly. i.e., task __state == 0, but
not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.
(running delayed_refs, deleting csums for unreferenced data extents)
This results in arbitrarily long transactions. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).
It isn't an academic problem, as we see this exact problem in production
at Meta with one cgroup over its memory limit ruining btrfs performance
for the whole system, stalling critical system services that depend on
btrfs syncs.
The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known and was discussed at LPC 2024, for example.
As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems with an eye to reducing the frequency
of write locking, to avoid disabling the read lock fast acquisition path.
Luckily, it seems likely that many reads are for old extents written
many transactions ago, and that for those we *can* in fact search the
commit root. The commit_root_sem only gets taken write once, near the
end of transaction commit, no matter how much memory pressure there is,
so we have much less contention between readers and writers.
This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.
Note that a single bio_ctrl might collect a few extent_maps into a single
bio, so it is important to track a maximum generation across all the
extent_maps used for each bio to make an accurate decision on whether it
is valid to look in the commit root. If any extent_map is updated in the
current generation, we can't use the commit root.
To test and reproduce this issue, I used the following script and
accompanying C program (to avoid bottlenecks in constantly forking
thousands of dd processes):
====== big-read.c ======
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <errno.h>
#define BUF_SZ (128 * (1 << 10UL))
int read_once(int fd, size_t sz) {
char buf[BUF_SZ];
size_t rd = 0;
int ret = 0;
while (rd < sz) {
ret = read(fd, buf, BUF_SZ);
if (ret < 0) {
if (errno == EINTR)
continue;
fprintf(stderr, "read failed: %d\n", errno);
return -errno;
} else if (ret == 0) {
break;
} else {
rd += ret;
}
}
return rd;
}
int read_loop(char *fname) {
int fd;
struct stat st;
size_t sz = 0;
int ret;
while (1) {
fd = open(fname, O_RDONLY);
if (fd == -1) {
perror("open");
return 1;
}
if (!sz) {
if (!fstat(fd, &st)) {
sz = st.st_size;
} else {
perror("stat");
return 1;
}
}
ret = read_once(fd, sz);
close(fd);
}
}
int main(int argc, char *argv[]) {
int fd;
struct stat st;
off_t sz;
char *buf;
int ret;
if (argc != 2) {
fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
return 1;
}
return read_loop(argv[1]);
}
====== repro.sh ======
#!/usr/bin/env bash
SCRIPT=$(readlink -f "$0")
DIR=$(dirname "$SCRIPT")
dev=$1
mnt=$2
shift
shift
CG_ROOT=/sys/fs/cgroup
BAD_CG=$CG_ROOT/bad-nbr
GOOD_CG=$CG_ROOT/good-nbr
NR_BIGGOS=1
NR_LITTLE=10
NR_VICTIMS=32
NR_VILLAINS=512
START_SEC=$(date +%s)
_elapsed() {
echo "elapsed: $(($(date +%s) - $START_SEC))"
}
_stats() {
local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev)
echo "================"
date
_elapsed
cat $sysfs/commit_stats
cat $BAD_CG/memory.pressure
}
_setup_cgs() {
echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control
mkdir -p $GOOD_CG
mkdir -p $BAD_CG
echo max > $BAD_CG/memory.max
# memory.high much less than the working set will cause heavy reclaim
echo $((1 << 30)) > $BAD_CG/memory.high
# victims get a subset of villain CPUs
echo 0 > $GOOD_CG/cpuset.cpus
echo 0,1,2,3 > $BAD_CG/cpuset.cpus
}
_kill_cg() {
local cg=$1
local attempts=0
echo "kill cgroup $cg"
[ -f $cg/cgroup.procs ] || return
while true; do
attempts=$((attempts + 1))
echo 1 > $cg/cgroup.kill
sleep 1
procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1)
[ $procs -eq 0 ] && break
done
rmdir $cg
echo "killed cgroup $cg in $attempts attempts"
}
_biggo_vol() {
echo $mnt/biggo_vol.$1
}
_biggo_file() {
echo $(_biggo_vol $1)/biggo
}
_subvoled_biggos() {
total_sz=$((10 << 30))
per_sz=$((total_sz / $NR_VILLAINS))
dd_count=$((per_sz >> 20))
echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes."
for i in $(seq $NR_VILLAINS)
do
btrfs subvol create $(_biggo_vol $i) &>/dev/null
dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null
done
echo "done creating subvols."
}
_setup() {
[ -f .done ] && rm .done
findmnt -n $dev && exit 1
if [ -f .re-mkfs ]; then
mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2
else
echo "touch .re-mkfs to populate the test fs"
fi
mount -o noatime $dev $mnt || exit 3
[ -f .re-mkfs ] && _subvoled_biggos
_setup_cgs
}
_my_cleanup() {
echo "CLEANUP!"
_kill_cg $BAD_CG
_kill_cg $GOOD_CG
sleep 1
umount $mnt
}
_bad_exit() {
_err "Unexpected Exit! $?"
_stats
exit $?
}
trap _my_cleanup EXIT
trap _bad_exit INT TERM
_setup
# Use a lot of page cache reading the big file
_villain() {
local i=$1
echo $BASHPID > $BAD_CG/cgroup.procs
$DIR/big-read $(_biggo_file $i)
}
# Hit del_csum a lot by overwriting lots of small new files
_victim() {
echo $BASHPID > $GOOD_CG/cgroup.procs
i=0;
while (true)
do
local tmp=$mnt/tmp.$i
dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1
i=$((i+1))
[ $i -eq $NR_LITTLE ] && i=0
done
}
_one_sync() {
echo "sync..."
before=$(date +%s)
sync
after=$(date +%s)
echo "sync done in $((after - before))s"
_stats
}
# sync in a loop
_sync() {
echo "start sync loop"
syncs=0
echo $BASHPID > $GOOD_CG/cgroup.procs
while true
do
[ -f .done ] && break
_one_sync
syncs=$((syncs + 1))
[ -f .done ] && break
sleep 10
done
if [ $syncs -eq 0 ]; then
echo "do at least one sync!"
_one_sync
fi
echo "sync loop done."
}
_sleep() {
local time=${1-60}
local now=$(date +%s)
local end=$((now + time))
while [ $now -lt $end ];
do
echo "SLEEP: $((end - now))s left. Sleep 10."
sleep 10
now=$(date +%s)
done
}
echo "start $NR_VILLAINS villains"
for i in $(seq $NR_VILLAINS)
do
_villain $i &
disown # get rid of annoying log on kill (done via cgroup anyway)
done
echo "start $NR_VICTIMS victims"
for i in $(seq $NR_VICTIMS)
do
_victim &
disown
done
_sync &
SYNC_PID=$!
_sleep $1
_elapsed
touch .done
wait $SYNC_PID
echo "OK"
exit 0
Without this patch, that reproducer:
- Ran for 6+ minutes instead of 60s
- Hung hundreds of threads in D state on the csum reader lock
- Got a commit stuck for 3 minutes
sync done in 388s
================
Wed Jul 9 09:52:31 PM UTC 2025
elapsed: 420
commits 2
cur_commit_ms 0
last_commit_ms 159446
max_commit_ms 159446
total_commit_ms 160058
some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386
full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274
419 hits state R, D comms big-read
btrfs_tree_read_lock_nested
btrfs_read_lock_root_node
btrfs_search_slot
btrfs_lookup_csum
btrfs_lookup_bio_sums
btrfs_submit_bbio
1 hits state D comms btrfs-transacti
btrfs_tree_lock_nested
btrfs_lock_root_node
btrfs_search_slot
btrfs_del_csums
__btrfs_run_delayed_refs
btrfs_run_delayed_refs
With the patch, the reproducer exits naturally, in 65s, completing a
pretty decent 4 commits, despite heavy memory pressure. Occasionally you
can still trigger a rather long commit (couple seconds) but never one
that is minutes long.
sync done in 3s
================
elapsed: 65
commits 4
cur_commit_ms 0
last_commit_ms 485
max_commit_ms 689
total_commit_ms 2453
some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893
full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168
some random rwalker samples showed the most common stack in reclaim,
rather than the csum tree:
145 hits state R comms bash, sleep, dd, shuf
shrink_folio_list
shrink_lruvec
shrink_node
do_try_to_free_pages
try_to_free_mem_cgroup_pages
reclaim_high
Link: https://lpc.events/event/18/contributions/1883/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In messages.h there's linux/types.h included more than once.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22939
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that btrfs_zone_finish_endio_workfn() is directly calling
do_zone_finish() the only caller of btrfs_zone_finish_endio() is
btrfs_finish_one_ordered().
btrfs_finish_one_ordered() already has error handling in-place so
btrfs_zone_finish_endio() can return an error if the block group lookup
fails.
Also as btrfs_zone_finish_endio() already checks for zoned filesystems and
returns early, there's no need to do this in the caller.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When btrfs_zone_finish_endio_workfn() is calling btrfs_zone_finish_endio()
it already has a pointer to the block group. Furthermore
btrfs_zone_finish_endio() does additional checks if the block group can be
finished or not.
But in the context of btrfs_zone_finish_endio_workfn() only the actual
call to do_zone_finish() is of interest, as the skipping condition when
there is still room to allocate from the block group cannot be checked.
Directly call do_zone_finish() on the block group.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's one only one caller of unaccount_log_buffer() and both this
function and the caller are short, so move its code into the caller.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of extracting again the disk_bytenr and disk_num_bytes values from
the file extent item to pass to btrfs_qgroup_trace_extent(), use the key
local variable 'ins' which already has those values, reducing the size of
the source code.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of having an if statement to check for regular and prealloc
extents first, process them in a block, and then following with an else
statement to check for an inline extent, check for an inline extent first,
process it and jump to the 'update_inode' label, allowing us to avoid
having the code for processing regular and prealloc extents inside a
block, reducing the high indentation level by one and making the code
easier to read and avoid line splittings too.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At replay_one_extent(), we can jump to the code that updates the file
extent range and updates the inode when processing a file extent item that
represents a hole and we don't have the NO_HOLES feature enabled. This
helps reduce the high indentation level by one in replay_one_extent() and
avoid splitting some lines to make the code easier to read.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the replay_one_buffer() log tree walk callback we return errors to the
log tree walk caller and then the caller aborts the transaction, if we
have one, or turns the fs into error state if we don't have one. While
this reduces code it makes it harder to figure out where exactly an error
came from. So add the transaction aborts after every failure inside the
replay_one_buffer() callback and the functions it calls, making it as
fine grained as possible, so that it helps figuring out why failures
happen.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If read_alloc_one_name() we explicitly return -ENOMEM and currently that
is fine since it's the only error read_alloc_one_name() can return for
now. However this is fragile and not future proof, so return instead what
read_alloc_one_name() returned.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of keep dereferencing the walk_control structure to extract the
transaction handle whenever is needed, do it once by storing it in a local
variable and then use the variable everywhere. This reduces code verbosity
and eliminates the need for some split lines.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the process_one_buffer() log tree walk callback we return errors to the
log tree walk caller and then the caller aborts the transaction, if we
have one, or turns the fs into error state if we don't have one. While
this reduces code it makes it harder to figure out where exactly an error
came from. So add the transaction aborts after every failure inside the
process_one_buffer() callback, so that it helps figuring out why failures
happen.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We do several things while walking a log tree (for replaying and for
freeing a log tree) like reading extent buffers and cleaning them up,
but we don't immediately abort the transaction, or turn the fs into an
error state, when one of these things fails. Instead we the transaction
abort or turn the fs into error state in the caller of the entry point
function that walks a log tree - walk_log_tree() - which means we don't
get to know exactly where an error came from.
Improve on this by doing a transaction abort / turn fs into error state
after each such failure so that when it happens we have a better
understanding where the failure comes from. This deliberately leaves
the transaction abort / turn fs into error state in the callers of
walk_log_tree() as to ensure we don't get into an inconsistent state in
case we forget to do it deeper in call chain. It also deliberately does
not do it after errors from the calls to the callback defined in
struct walk_control::process_func(), as we will do it later on another
patch.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function cow_file_range() has two boolean parameters. Replace it
with a single @flags parameter, with two flags:
- COW_FILE_RANGE_NO_INLINE
- COW_FILE_RANGE_KEEP_LOCKED
And since we're here, also update the comments of cow_file_range() to
replace the old "page" usage with "folio".
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In order to identify whether a certain file is open by a different
client, start the unlink process by sending a compound request of
CREATE(DELETE_ON_CLOSE) + CLOSE with only FILE_SHARE_DELETE bit set in
smb2_create_req::ShareAccess. If the file is currently open, then the
server will fail the request with STATUS_SHARING_VIOLATION, in which
case we'll map it to -EBUSY, so __cifs_unlink() will fall back to
silly-rename the file.
This fixes the following case where open(O_CREAT) fails with
-ENOENT (STATUS_DELETE_PENDING) due to file still open by a different
client.
* Before patch
$ mount.cifs //srv/share /mnt/1 -o ...,nosharesock
$ mount.cifs //srv/share /mnt/2 -o ...,nosharesock
$ cd /mnt/1
$ touch foo
$ exec 3<>foo
$ cd /mnt/2
$ rm foo
$ touch foo
touch: cannot touch 'foo': No such file or directory
$ exec 3>&-
* After patch
$ mount.cifs //srv/share /mnt/1 -o ...,nosharesock
$ mount.cifs //srv/share /mnt/2 -o ...,nosharesock
$ cd /mnt/1
$ touch foo
$ exec 3<>foo
$ cd /mnt/2
$ rm foo
$ touch foo
$ exec 3>&-
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Frank Sorenson <sorenson@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This makes it safer during the disconnect and avoids
requeueing.
It's ok to call disable_work[_sync]() more than once.
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If we are using a hardcoded delay of 0 there's no point in
using delayed_work it only adds confusion.
The client also uses a normal work_struct and now
it is easier to move it to the common smbdirect_socket.
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Now that nfsd is accessing SHA-256 via the library API instead of via
crypto_shash, there is a direct symbol dependency on the SHA-256 code
and there is no benefit to be gained from forcing it to be built-in.
Therefore, select CRYPTO_LIB_SHA256 from NFSD (conditional on NFSD_V4)
instead of from NFSD_V4, so that it can be 'm' if NFSD is 'm'.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|