aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/task-analyzer.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-07-25ext4: factor out __ext4_mb_scan_group()Baokun Li2-18/+29
Extract __ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: fix largest free orders lists corruption on mb_optimize_scan switchBaokun Li1-19/+14
The grp->bb_largest_free_order is updated regardless of whether mb_optimize_scan is enabled. This can lead to inconsistencies between grp->bb_largest_free_order and the actual s_mb_largest_free_orders list index when mb_optimize_scan is repeatedly enabled and disabled via remount. For example, if mb_optimize_scan is initially enabled, largest free order is 3, and the group is in s_mb_largest_free_orders[3]. Then, mb_optimize_scan is disabled via remount, block allocations occur, updating largest free order to 2. Finally, mb_optimize_scan is re-enabled via remount, more block allocations update largest free order to 1. At this point, the group would be removed from s_mb_largest_free_orders[3] under the protection of s_mb_largest_free_orders_locks[2]. This lock mismatch can lead to list corruption. To fix this, whenever grp->bb_largest_free_order changes, we now always attempt to remove the group from its old order list. However, we only insert the group into the new order list if `mb_optimize_scan` is enabled. This approach helps prevent lock inconsistencies and ensures the data in the order lists remains reliable. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: fix zombie groups in average fragment size listsBaokun Li1-18/+18
Groups with no free blocks shouldn't be in any average fragment size list. However, when all blocks in a group are allocated(i.e., bb_fragments or bb_free is 0), we currently skip updating the average fragment size, which means the group isn't removed from its previous s_mb_avg_fragment_size[old] list. This created "zombie" groups that were always skipped during traversal as they couldn't satisfy any block allocation requests, negatively impacting traversal efficiency. Therefore, when a group becomes completely full, bb_avg_fragment_size_order is now set to -1. If the old order was not -1, a removal operation is performed; if the new order is not -1, an insertion is performed. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: merge freed extent with existing extents before insertionBaokun Li1-37/+76
Attempt to merge ext4_free_data with already inserted free extents prior to adding new ones. This strategy drastically cuts down the number of times locks are held. For example, if prev, new, and next extents are all mergeable, the existing code (before this patch) requires acquiring the s_md_lock three times: prev merge into new and free prev // hold lock next merge into new and free next // hold lock insert new // hold lock After the patch, it only needs to be acquired once: new merge into next and free new // no lock next merge into prev and free next // hold lock Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 20043 | 20097 (+0.2%) | 314331 | 316141 (+0.5%) | |mb_optimize_scan=1 | 7290 | 13318 (+87.4%) | 324226 | 325273 (+0.3%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 54999 | 53603 (-2.5%) | 214380 | 214243 (-0.06%)| |mb_optimize_scan=1 | 13497 | 20887 (+54.6%) | 216276 | 213632 (-1.2%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-10-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: convert sbi->s_mb_free_pending to atomic_tBaokun Li3-8/+5
Previously, s_md_lock was used to protect s_mb_free_pending during modifications, while smp_mb() ensured fresh reads, so s_md_lock just guarantees the atomicity of s_mb_free_pending. Thus we optimized it by converting s_mb_free_pending into an atomic variable, thereby eliminating s_md_lock and minimizing lock contention. This also prepares for future lockless merging of free extents. Following this modification, s_md_lock is exclusively responsible for managing insertions and deletions within s_freed_data_list, along with operations involving list_splice. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 19628 | 20043 (+2.1%) | 320885 | 314331 (-2.0%) | |mb_optimize_scan=1 | 7129 | 7290 (+2.2%) | 321275 | 324226 (+0.9%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 53760 | 54999 (+2.3%) | 213145 | 214380 (+0.5%) | |mb_optimize_scan=1 | 12716 | 13497 (+6.1%) | 215262 | 216276 (+0.4%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: fix typo in CR_GOAL_LEN_SLOW commentBaokun Li1-1/+1
Remove the superfluous "find_". Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: get rid of some obsolete EXT4_MB_HINT flagsBaokun Li2-9/+0
Since nobody has used these EXT4_MB_HINT flags for ages, let's remove them. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: utilize multiple global goals to reduce contentionBaokun Li2-6/+27
When allocating data blocks, if the first try (goal allocation) fails and stream allocation is on, it tries a global goal starting from the last group we used (s_mb_last_group). This helps cluster large files together to reduce free space fragmentation, and the data block contiguity also accelerates write-back to disk. However, when multiple processes allocate blocks, having just one global goal means they all fight over the same group. This drastically lowers the chances of extents merging and leads to much worse file fragmentation. To mitigate this multi-process contention, we now employ multiple global goals, with the number of goals being the minimum between the number of possible CPUs and one-quarter of the filesystem's total block group count. To ensure a consistent goal for each inode, we select the corresponding goal by taking the inode number modulo the total number of goals. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 9636 | 19628 (+103%) | 337597 | 320885 (-4.9%) | |mb_optimize_scan=1 | 4834 | 7129 (+47.4%) | 341440 | 321275 (-5.9%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 22341 | 53760 (+140%) | 219707 | 213145 (-2.9%) | |mb_optimize_scan=1 | 9177 | 12716 (+38.5%) | 215732 | 215262 (+0.2%) | Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: remove unnecessary s_md_lock on update s_mb_last_groupBaokun Li2-10/+4
After we optimized the block group lock, we found another lock contention issue when running will-it-scale/fallocate2 with multiple processes. The fallocate's block allocation and the truncate's block release were fighting over the s_md_lock. The problem is, this lock protects totally different things in those two processes: the list of freed data blocks (s_freed_data_list) when releasing, and where to start looking for new blocks (mb_last_group) when allocating. Now we only need to track s_mb_last_group and no longer need to track s_mb_last_start, so we don't need the s_md_lock lock to ensure that the two are consistent. Since s_mb_last_group is merely a hint and doesn't require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient. Besides, the s_mb_last_group data type only requires ext4_group_t (i.e., unsigned int), rendering unsigned long superfluous. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 4821 | 9636 (+99.8%) | 314065 | 337597 (+7.4%) | |mb_optimize_scan=1 | 4784 | 4834 (+1.04%) | 316344 | 341440 (+7.9%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) | |mb_optimize_scan=1 | 6101 | 9177 (+50.4%) | 207373 | 215732 (+4.0%) | Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: remove unnecessary s_mb_last_startBaokun Li2-2/+0
Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1 by default, so the no longer needed sbi->s_mb_last_start is removed. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: separate stream goal hits from s_bal_goals for better trackingBaokun Li2-2/+10
In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal() fails to achieve the inode goal, allocation continues with the stream allocation global goal. Currently, hits for both are combined in sbi->s_bal_goals, hindering accurate optimization. This commit separates global goal hits into sbi->s_bal_stream_goals. Since stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1. This prevents stream allocations from being counted in s_bal_goals. Also clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again. After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show: mballoc: reqs: 840347 success: 750992 groups_scanned: 1230506 cr_p2_aligned_stats: hits: 21531 groups_considered: 411664 extents_scanned: 21531 useless_loops: 0 bad_suggestions: 6 cr_goal_fast_stats: hits: 111222 groups_considered: 1806728 extents_scanned: 467908 useless_loops: 0 bad_suggestions: 13 cr_best_avail_stats: hits: 36267 groups_considered: 1817631 extents_scanned: 156143 useless_loops: 0 bad_suggestions: 204 cr_goal_slow_stats: hits: 106396 groups_considered: 5671710 extents_scanned: 22540056 useless_loops: 123747 cr_any_free_stats: hits: 138071 groups_considered: 724692 extents_scanned: 23615593 useless_loops: 585 extents_scanned: 46804261 goal_hits: 1307 stream_goal_hits: 236317 len_goal_hits: 155549 2^n_hits: 21531 breaks: 225096 lost: 35062 buddies_generated: 40/40 buddies_time_used: 48004 preallocated: 5962467 discarded: 4847560 Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: add ext4_try_lock_group() to skip busy groupsBaokun Li2-12/+30
When ext4 allocates blocks, we used to just go through the block groups one by one to find a good one. But when there are tons of block groups (like hundreds of thousands or even millions) and not many have free space (meaning they're mostly full), it takes a really long time to check them all, and performance gets bad. So, we added the "mb_optimize_scan" mount option (which is on by default now). It keeps track of some group lists, so when we need a free block, we can just grab a likely group from the right list. This saves time and makes block allocation much faster. But when multiple processes or containers are doing similar things, like constantly allocating 8k blocks, they all try to use the same block group in the same list. Even just two processes doing this can cut the IOPS in half. For example, one container might do 300,000 IOPS, but if you run two at the same time, the total is only 150,000. Since we can already look at block groups in a non-linear way, the first and last groups in the same list are basically the same for finding a block right now. Therefore, add an ext4_try_lock_group() helper function to skip the current group when it is locked by another process, thereby avoiding contention with other processes. This helps ext4 make better use of having multiple block groups. Also, to make sure we don't skip all the groups that have free space when allocating blocks, we won't try to skip busy groups anymore when ac_criteria is CR_ANY_FREE. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | |Memory: 512GB |-------------------------| |960GB SSD (0.5GB/s)| base | patched | |-------------------|-------|-----------------| |mb_optimize_scan=0 | 2667 | 4821 (+80.7%) | |mb_optimize_scan=1 | 2643 | 4784 (+81.0%) | |CPU: AMD 9654 * 2 | P96 | |Memory: 1536GB |-------------------------| |960GB SSD (1GB/s) | base | patched | |-------------------|-------|-----------------| |mb_optimize_scan=0 | 3450 | 15371 (+345%) | |mb_optimize_scan=1 | 3209 | 6101 (+90.0%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: initialize superblock fields in the kballoc-test.c kunit testsZhang Yi1-0/+9
Various changes in the "ext4: better scalability for ext4 block allocation" patch series have resulted in kunit test failures, most notably in the test_new_blocks_simple and the test_mb_mark_used tests. The root cause of these failures is that various in-memory ext4 data structures were not getting initialized, and while previous versions of the functions exercised by the unit tests didn't use these structure members, this was arguably a test bug. Since one of the patches in the block allocation scalability patches is a fix which is has a cc:stable tag, this commit also has a cc:stable tag. CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250714130327.1830534-1-libaokun1@huawei.com Link: https://patch.msgid.link/20250725021550.3177573-1-yi.zhang@huaweicloud.com Link: https://patch.msgid.link/20250725021654.3188798-1-yi.zhang@huaweicloud.com Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/linux-ext4/b0635ad0-7ebf-4152-a69b-58e7e87d5085@roeck-us.net/ Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: refactor the inline directory conversion and new directory codepathsTheodore Ts'o3-77/+48
There was a lot of common code in the codepaths used to convert an inline directory and to creaet a new directory. To address this, rename ext4_init_dot_dotdot() to ext4_init_dirblock() and then move common code into that function. This reduces the lines of code count in fs/ext4/inline.c and fs/ext4/namei.c, as well as reducing the size of their object files. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20250712181249.434530-3-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: use memcpy() instead of strcpy()Theodore Ts'o2-4/+4
The strcpy() function is considered dangerous and eeeevil by people who are using sophisticated code analysis tools such as "grep". This is true even when a quick inspection would show that the source is a constant string ("." or "..") and the destination is a fixed array which is guaranteed to have enough space. Make the "grep" code analysis tool happy by using memcpy() isstead of strcpy(). :-) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20250712181249.434530-2-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: replace strcmp with direct comparison for '.' and '..'Theodore Ts'o1-4/+7
In a discussion over a proposed patch, "ext4: replace strcpy() with '.' assignment"[1], I had asserted that directory entries in ext4 were not NUL terminated, and hence it was safe to replace strcpy() with a direct assignment. As it turns out, this was incorrect. It's true for all all directory entries *except* for '.' and '..' where the kernel was using strcmp() and where e2fsck actually checks and offers to fix things if '.' and '..' are not NUL terminated. [1] https://lore.kernel.org/r/202505191316.JJMnPobO-lkp@intel.com We can't change this without breaking old kernel versions, but in the spirit of "be liberal in what you receive", use direct comparison of de->name_len and de->name[0,1] instead of strcmp(). This has the side benefit of reducing the compiled text size by 96 bytes on x86_64. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20250712181249.434530-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: Make sure BH_New bit is cleared in ->write_end handlerJan Kara2-1/+4
Currently we clear BH_New bit in case of error and also in the standard ext4_write_end() handler (in block_commit_write()). However ext4_journalled_write_end() misses this clearing and thus we are leaving stale BH_New bits behind. Generally ext4_block_write_begin() clears these bits before any harm can be done but in case blocksize < pagesize and we hit some error when processing a page with these stale bits, we'll try to zero buffers with these stale BH_New bits and jbd2 will complain (as buffers were not prepared for writing in this transaction). Fix the problem by clearing BH_New bits in ext4_journalled_write_end() and WARN if ext4_block_write_begin() sees stale BH_New bits. Reported-by: Baolin Liu <liubaolin12138@163.com> Reported-by: Zhi Long <longzhi@sangfor.com.cn> Fixes: 3910b513fcdf ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250709084831.23876-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: fix inode use after free in ext4_end_io_rsv_work()Baokun Li1-8/+8
In ext4_io_end_defer_completion(), check if io_end->list_vec is empty to avoid adding an io_end that requires no conversion to the i_rsv_conversion_list, which in turn prevents starting an unnecessary worker. An ext4_emergency_state() check is also added to avoid attempting to abort the journal in an emergency state. Additionally, ext4_put_io_end_defer() is refactored to call ext4_io_end_defer_completion() directly instead of being open-coded. This also prevents starting an unnecessary worker when EXT4_IO_END_FAILED is set but data_err=abort is not enabled. This ensures that the check in ext4_put_io_end_defer() is consistent with the check in ext4_end_bio(). Otherwise, we might add an io_end to the i_rsv_conversion_list and then call ext4_finish_bio(), after which the inode could be freed before ext4_end_io_rsv_work() is called, triggering a use-after-free issue. Fixes: ce51afb8cc5e ("ext4: abort journal on data writeback failure if in data_err=abort mode") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250708111504.3208660-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: Refactor breaking condition for xattr_find_entry()I Hsin Cheng1-1/+1
Refactor the condition for breaking the loop within xattr_find_entry(). Elimate the usage of "<=" and take condition shortcut when "!cmp" is true. Originally, the condition was "(cmp <= 0 && (sorted || cmp == 0))", which means after it knows "cmp <= 0" is true, it has to check the value of "sorted" and "cmp". The checking of "cmp" here would be redundant since it has already checked it. Observing from the logic, when "cmp == 0" the branch is going to be true, no need to check "cmp == 0" again, so we only need to take shortcut when "cmp == 0", on the other hand, we'll check "sorted" when "cmp < 0". The refactor can shrink the generated code size by 44 bytes. Numerous instructions can be saved thus should also benefit execution efficiency as well. $ ./scripts/bloat-o-meter vmlinux_old vmlinux_new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44) Function old new delta xattr_find_entry 300 256 -44 Total: Before=22989434, After=22989390, chg -0.00% The test is done on kernel version 6.16 with x86_64 defconfig and gcc 13.3.0. Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Link: https://patch.msgid.link/20250708020013.175728-1-richard120310@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: limit the maximum folio orderZhang Yi3-6/+21
In environments with a page size of 64KB, the maximum size of a folio can reach up to 128MB. Consequently, during the write-back of folios, the 'rsv_blocks' will be overestimated to 1,577, which can make pressure on the journal space where the journal is small. This can easily exceed the limit of a single transaction. Besides, an excessively large folio is meaningless and will instead increase the overhead of traversing the bhs within the folio. Therefore, limit the maximum order of a folio to 2048 filesystem blocks. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Joseph Qi <jiangqi903@gmail.com> Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()Zhang Yi1-2/+2
The calculation of journal credits in ext4_meta_trans_blocks() should include pextents, as each extent separately may be allocated from a different group and thus need to update different bitmap and group descriptor block. Fixes: 0e32d8617012 ("ext4: correct the journal credits calculations of allocating blocks") Reported-by: Jan Kara <jack@suse.cz> Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-11-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: replace ext4_writepage_trans_blocks()Zhang Yi6-27/+25
After ext4 supports large folios, the semantics of reserving credits in pages is no longer applicable. In most scenarios, reserving credits in extents is sufficient. Therefore, introduce ext4_chunk_trans_extent() to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the only remaining location where we are still processing extents in pages. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: reserved credits for one extent during the folio writebackZhang Yi1-17/+8
After ext4 supports large folios, reserving journal credits for one maximum-ordered folio based on the worst case cenario during the writeback process can easily exceed the maximum transaction credits. Additionally, reserving journal credits for one page is also no longer appropriate. Currently, the folio writeback process can either extend the journal credits or initiate a new transaction if the currently reserved journal credits are insufficient. Therefore, it can be modified to reserve credits for only one extent at the outset. In most cases involving continuous mapping, these credits are generally adequate, and we may only need to perform some basic credit expansion. However, in extreme cases where the block size and folio size differ significantly, or when the folios are sufficiently discontinuous, it may be necessary to restart a new transaction and resubmit the folios. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: correct the reserved credits for extent conversionZhang Yi1-3/+3
Now, we reserve journal credits for converting extents in only one page to written state when the I/O operation is complete. This is insufficient when large folio is enabled. Fix this by reserving credits for converting up to one extent per block in the largest 2MB folio, this calculation should only involve extents index and leaf blocks, so it should not estimate too many credits. Fixes: 7ac67301e82f ("ext4: enable large folio for regular file") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: enhance tracepoints during the folios writebackZhang Yi2-6/+41
After mpage_map_and_submit_extent() supports restarting handle if credits are insufficient during allocating blocks, it is more likely to exit the current mapping iteration and continue to process the current processing partially mapped folio again. The existing tracepoints are not sufficient to track this situation, so enhance the tracepoints to track the writeback position and the return value before and after submitting the folios. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: restart handle if credits are insufficient during allocating blocksZhang Yi1-5/+36
After large folios are supported on ext4, writing back a sufficiently large and discontinuous folio may consume a significant number of journal credits, placing considerable strain on the journal. For example, in a 20GB filesystem with 1K block size and 1MB journal size, writing back a 2MB folio could require thousands of credits in the worst-case scenario (when each block is discontinuous and distributed across different block groups), potentially exceeding the journal size. This issue can also occur in ext4_write_begin() and ext4_page_mkwrite() when delalloc is not enabled. Fix this by ensuring that there are sufficient journal credits before allocating an extent in mpage_map_one_extent() and ext4_block_write_begin(). If there are not enough credits, return -EAGAIN, exit the current mapping loop, restart a new handle and a new transaction, and allocating blocks on this folio again in the next iteration. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: refactor the block allocation process of ext4_page_mkwrite()Zhang Yi1-45/+50
The block allocation process and error handling in ext4_page_mkwrite() is complex now. Refactor it by introducing a new helper function, ext4_block_page_mkwrite(). It will call ext4_block_write_begin() to allocate blocks instead of directly calling block_page_mkwrite(). Preparing to implement retry logic in a subsequent patch to address situations where the reserved journal credits are insufficient. Additionally, this modification will help prevent potential deadlocks that may occur when waiting for folio writeback while holding the transaction handle. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: fix stale data if it bail out of the extents mapping loopZhang Yi1-1/+50
During the process of writing back folios, if mpage_map_and_submit_extent() exits the extent mapping loop due to an ENOSPC or ENOMEM error, it may result in stale data or filesystem inconsistency in environments where the block size is smaller than the folio size. When mapping a discontinuous folio in mpage_map_and_submit_extent(), some buffers may have already be mapped. If we exit the mapping loop prematurely, the folio data within the mapped range will not be written back, and the file's disk size will not be updated. Once the transaction that includes this range of extents is committed, this can lead to stale data or filesystem inconsistency. Fix this by submitting the current processing partially mapped folio. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()Zhang Yi1-2/+1
mpage_folio_done() should be a more appropriate place than mpage_submit_folio() for updating the wbc->nr_to_write after we have submitted a fully mapped folio. Preparing to make mpage_submit_folio() allows to submit partially mapped folio that is still under processing. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: process folios writeback in bytesZhang Yi2-41/+42
Since ext4 supports large folios, processing writebacks in pages is no longer appropriate, it can be modified to process writebacks in bytes. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-12ext4: remove unused EXT_STATS macro from ext4_extents.hBaolin Liu1-7/+0
The EXT_STATS macro in fs/ext4/ext4_extents.h has been defined but never used in the codebase since its introduction. This patch removes it. Analysis: 1. No references found in fs/ext4/ or other kernel code. 2. No impact on compilation or functionality. 3. Git history shows it was never utilized. Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250527053805.1550912-1-liubaolin12138@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-11ext4: remove unnecessary duplicate check in ext4_map_blocks()Dan Carpenter1-2/+1
The previous lines ensure that EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF is set so remove this duplicate check. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/aDCdjUhpzxB64vkD@stanley.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-10ext4: remove duplicate check for EXT4_FC_REPLAYJinliang Zheng1-2/+1
EXT4_FC_REPLAY will be checked in ext4_es_lookup_extent(). If it is set, ext4_es_lookup_extent() will return 0. Remove the repeated check for EXT4_FC_REPLAY in ext4_map_blocks() to simplify the code. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250429111722.294975-1-alexjlzheng@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-06-29Linux 6.16-rc4Linus Torvalds1-1/+1
2025-06-27tracing: Fix filter logic errorEdward Adam Davis1-7/+7
If the processing of the tr->events loop fails, the filter that has been added to filter_head will be released twice in free_filter_list(&head->rcu) and __free_filter(filter). After adding the filter of tr->events, add the filter to the filter_head process to avoid triggering uaf. Link: https://lore.kernel.org/tencent_4EF87A626D702F816CD0951CE956EC32CD0A@qq.com Fixes: a9d0aab5eb33 ("tracing: Fix regression of filter waiting a long time on RCU synchronization") Reported-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=daba72c4af9915e9c894 Tested-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-27i2c: scx200_acb: depends on HAS_IOPORTJohannes Berg1-1/+1
It already depends on X86_32, but that's also set for ARCH=um. Recent changes made UML no longer have IO port access since it's not needed, but this driver uses it. Build it only for HAS_IOPORT. This is pretty much the same as depending on X86, but on the off-chance that HAS_IOPORT will ever be optional on x86 HAS_IOPORT is the real prerequisite. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2025-06-27LoongArch: KVM: Disable updating of "num_cpu" and "feature"Bibo Mao1-0/+6
Property "num_cpu" and "feature" are read-only once eiointc is created, which are set with KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL attr group before device creation. Attr group KVM_DEV_LOONGARCH_EXTIOI_GRP_SW_STATUS is to update register and software state for migration and reset usage, property "num_cpu" and "feature" can not be update again if it is created already. Here discard write operation with property "num_cpu" and "feature" in attr group KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL. Cc: stable@vger.kernel.org Fixes: 1ad7efa552fd ("LoongArch: KVM: Add EIOINTC user mode read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-27LoongArch: KVM: Check validity of "num_cpu" from user spaceBibo Mao1-5/+14
The maximum supported cpu number is EIOINTC_ROUTE_MAX_VCPUS about irqchip EIOINTC, here add validation about cpu number to avoid array pointer overflow. Cc: stable@vger.kernel.org Fixes: 1ad7efa552fd ("LoongArch: KVM: Add EIOINTC user mode read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-27LoongArch: KVM: Check interrupt route from physical CPUBibo Mao1-6/+18
With EIOINTC interrupt controller, physical CPU ID is set for irq route. However the function kvm_get_vcpu() is used to get destination vCPU when delivering irq. With API kvm_get_vcpu(), the logical CPU ID is used. With API kvm_get_vcpu_by_cpuid(), vCPU ID can be searched from physical CPU ID. Cc: stable@vger.kernel.org Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-27LoongArch: KVM: Fix interrupt route update with EIOINTCBibo Mao1-7/+6
With function eiointc_update_sw_coremap(), there is forced assignment like val = *(u64 *)pvalue. Parameter pvalue may be pointer to char type or others, there is problem with forced assignment with u64 type. Here the detailed value is passed rather address pointer. Cc: stable@vger.kernel.org Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-27LoongArch: KVM: Add address alignment check for IOCSR emulationBibo Mao1-0/+10
IOCSR instruction supports 1/2/4/8 bytes access, the address should be naturally aligned with its access size. Here address alignment check is added in the EIOINTC kernel emulation. Cc: stable@vger.kernel.org Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-26io_uring/kbuf: flag partial buffer mappingsJens Axboe3-9/+18
A previous commit aborted mapping more for a non-incremental ring for bundle peeking, but depending on where in the process this peeking happened, it would not necessarily prevent a retry by the user. That can create gaps in the received/read data. Add struct buf_sel_arg->partial_map, which can pass this information back. The networking side can then map that to internal state and use it to gate retry as well. Since this necessitates a new flag, change io_sr_msg->retry to a retry_flags member, and store both the retry and partial map condition in there. Cc: stable@vger.kernel.org Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-26cifs: Fix reading into an ITER_FOLIOQ from the smbdirect codeDavid Howells1-95/+17
When performing a file read from RDMA, smbd_recv() prints an "Invalid msg type 4" error and fails the I/O. This is due to the switch-statement there not handling the ITER_FOLIOQ handed down from netfslib. Fix this by collapsing smbd_recv_buf() and smbd_recv_page() into smbd_recv() and just using copy_to_iter() instead of memcpy(). This future-proofs the function too, in case more ITER_* types are added. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: Tom Talpey <tom@talpey.com> cc: Paulo Alcantara (Red Hat) <pc@manguebit.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-26cifs: Fix the smbd_response slab to allow usercopyDavid Howells1-5/+13
The handling of received data in the smbdirect client code involves using copy_to_iter() to copy data from the smbd_reponse struct's packet trailer to a folioq buffer provided by netfslib that encapsulates a chunk of pagecache. If, however, CONFIG_HARDENED_USERCOPY=y, this will result in the checks then performed in copy_to_iter() oopsing with something like the following: CIFS: Attempting to mount //172.31.9.1/test CIFS: VFS: RDMA transport established usercopy: Kernel memory exposure attempt detected from SLUB object 'smbd_response_0000000091e24ea1' (offset 81, size 63)! ------------[ cut here ]------------ kernel BUG at mm/usercopy.c:102! ... RIP: 0010:usercopy_abort+0x6c/0x80 ... Call Trace: <TASK> __check_heap_object+0xe3/0x120 __check_object_size+0x4dc/0x6d0 smbd_recv+0x77f/0xfe0 [cifs] cifs_readv_from_socket+0x276/0x8f0 [cifs] cifs_read_from_socket+0xcd/0x120 [cifs] cifs_demultiplex_thread+0x7e9/0x2d50 [cifs] kthread+0x396/0x830 ret_from_fork+0x2b8/0x3b0 ret_from_fork_asm+0x1a/0x30 The problem is that the smbd_response slab's packet field isn't marked as being permitted for usercopy. Fix this by passing parameters to kmem_slab_create() to indicate that copy_to_iter() is permitted from the packet region of the smbd_response slab objects, less the header space. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Reported-by: Stefan Metzmacher <metze@samba.org> Link: https://lore.kernel.org/r/acb7f612-df26-4e2a-a35d-7cd040f513e1@samba.org/ Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Stefan Metzmacher <metze@samba.org> Tested-by: Stefan Metzmacher <metze@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-26smb: client: fix potential deadlock when reconnecting channelsPaulo Alcantara2-22/+37
Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order and prevent the following deadlock from happening ====================================================== WARNING: possible circular locking dependency detected 6.16.0-rc3-build2+ #1301 Tainted: G S W ------------------------------------------------------ cifsd/6055 is trying to acquire lock: ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200 but task is already holding lock: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&ret_buf->chan_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_setup_session+0x81/0x4b0 cifs_get_smb_ses+0x771/0x900 cifs_mount_get_session+0x7e/0x170 cifs_mount+0x92/0x2d0 cifs_smb3_do_mount+0x161/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&ret_buf->ses_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_match_super+0x101/0x320 sget+0xab/0x270 cifs_smb3_do_mount+0x1e0/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&tcp_ses->srv_lock){+.+.}-{3:3}: check_noncircular+0x95/0xc0 check_prev_add+0x115/0x2f0 validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_signal_cifsd_for_reconnect+0x134/0x200 __cifs_reconnect+0x8f/0x500 cifs_handle_standard+0x112/0x280 cifs_demultiplex_thread+0x64d/0xbc0 kthread+0x2f7/0x310 ret_from_fork+0x2a/0x230 ret_from_fork_asm+0x1a/0x30 other info that might help us debug this: Chain exists of: &tcp_ses->srv_lock --> &ret_buf->ses_lock --> &ret_buf->chan_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ret_buf->chan_lock); lock(&ret_buf->ses_lock); lock(&ret_buf->chan_lock); lock(&tcp_ses->srv_lock); *** DEADLOCK *** 3 locks held by cifsd/6055: #0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200 #1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200 #2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 Cc: linux-cifs@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Fixes: d7d7a66aacd6 ("cifs: avoid use of global locks for high contention data") Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-26block: fix false warning in bdev_count_inflight_rw()Yu Kuai1-11/+15
While bdev_count_inflight is interating all cpus, if some IOs are issued from traversed cpu and then completed from the cpu that is not traversed yet: cpu0 cpu1 bdev_count_inflight //for_each_possible_cpu // cpu0 is 0 infliht += 0 // issue a io blk_account_io_start // cpu0 inflight ++ cpu2 // the io is done blk_account_io_done // cpu2 inflight -- // cpu 1 is 0 inflight += 0 // cpu2 is -1 inflight += -1 ... In this case, the total inflight will be -1, causing lots of false warning. Fix the problem by removing the warning. Noted there is still a valid warning for nvme-mpath(From Yi) that is not fixed yet. Fixes: f5482ee5edb9 ("block: WARN if bdev inflight counter is negative") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#mae89155a5006463d0a21a4a2c35ae0034b26a339 Reported-and-tested-by: Calvin Owens <calvin@wbinvd.org> Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#m1d935a00070bf95055d0ac84e6075158b08acaef Reported-by: Dave Chinner <david@fromorbit.com> Closes: https://lore.kernel.org/linux-block/aFuypjqCXo9-5_En@dread.disaster.area/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250626115743.1641443-1-yukuai3@huawei.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-26ublk: sanity check add_dev input for underflowRonnie Sahlberg1-1/+2
Add additional checks that queue depth and number of queues are non-zero. Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250626022046.235018-1-ronniesahlberg@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-26drm/xe: Process deferred GGTT node removals on device unwindMichal Wajdeczko1-0/+11
While we are indirectly draining our dedicated workqueue ggtt->wq that we use to complete asynchronous removal of some GGTT nodes, this happends as part of the managed-drm unwinding (ggtt_fini_early), which could be later then manage-device unwinding, where we could already unmap our MMIO/GMS mapping (mmio_fini). This was recently observed during unsuccessful VF initialization: [ ] xe 0000:00:02.1: probe with driver xe failed with error -62 [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747340 __xe_bo_unpin_map_no_vm (16 bytes) [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747540 __xe_bo_unpin_map_no_vm (16 bytes) [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747240 __xe_bo_unpin_map_no_vm (16 bytes) [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747040 tiles_fini (16 bytes) [ ] xe 0000:00:02.1: DEVRES REL ffff88811e746840 mmio_fini (16 bytes) [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747f40 xe_bo_pinned_fini (16 bytes) [ ] xe 0000:00:02.1: DEVRES REL ffff88811e746b40 devm_drm_dev_init_release (16 bytes) [ ] xe 0000:00:02.1: [drm:drm_managed_release] drmres release begin [ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef81640 __fini_relay (8 bytes) [ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef80d40 guc_ct_fini (8 bytes) [ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef80040 __drmm_mutex_release (8 bytes) [ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef80140 ggtt_fini_early (8 bytes) and this was leading to: [ ] BUG: unable to handle page fault for address: ffffc900058162a0 [ ] #PF: supervisor write access in kernel mode [ ] #PF: error_code(0x0002) - not-present page [ ] Oops: Oops: 0002 [#1] SMP NOPTI [ ] Tainted: [W]=WARN [ ] Workqueue: xe-ggtt-wq ggtt_node_remove_work_func [xe] [ ] RIP: 0010:xe_ggtt_set_pte+0x6d/0x350 [xe] [ ] Call Trace: [ ] <TASK> [ ] xe_ggtt_clear+0xb0/0x270 [xe] [ ] ggtt_node_remove+0xbb/0x120 [xe] [ ] ggtt_node_remove_work_func+0x30/0x50 [xe] [ ] process_one_work+0x22b/0x6f0 [ ] worker_thread+0x1e8/0x3d Add managed-device action that will explicitly drain the workqueue with all pending node removals prior to releasing MMIO/GSM mapping. Fixes: 919bb54e989c ("drm/xe: Fix missing runtime outer protection for ggtt_remove_node") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250612220937.857-2-michal.wajdeczko@intel.com (cherry picked from commit 89d2835c3680ab1938e22ad81b1c9f8c686bd391) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-26drm/xe/guc: Explicitly exit CT safe mode on unwindMichal Wajdeczko1-4/+6
During driver probe we might be briefly using CT safe mode, which is based on a delayed work, but usually we are able to stop this once we have IRQ fully operational. However, if we abort the probe quite early then during unwind we might try to destroy the workqueue while there is still a pending delayed work that attempts to restart itself which triggers a WARN. This was recently observed during unsuccessful VF initialization: [ ] xe 0000:00:02.1: probe with driver xe failed with error -62 [ ] ------------[ cut here ]------------ [ ] workqueue: cannot queue safe_mode_worker_func [xe] on wq xe-g2h-wq [ ] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:2257 __queue_work+0x287/0x710 [ ] RIP: 0010:__queue_work+0x287/0x710 [ ] Call Trace: [ ] delayed_work_timer_fn+0x19/0x30 [ ] call_timer_fn+0xa1/0x2a0 Exit the CT safe mode on unwind to avoid that warning. Fixes: 09b286950f29 ("drm/xe/guc: Allow CTB G2H processing without G2H IRQ") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250612220937.857-3-michal.wajdeczko@intel.com (cherry picked from commit 2ddbb73ec20b98e70a5200cb85deade22ccea2ec) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-26drm/xe: move DPT l2 flush to a more sensible placeMatthew Auld1-2/+3
Only need the flush for DPT host updates here. Normal GGTT updates don't need special flush. Fixes: 01570b446939 ("drm/xe/bmg: implement Wa_16023588340") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250606104546.1996818-4-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 35db1da40c8cfd7511dc42f342a133601eb45449) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>