aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2025-10-14perf/core: Fix MMAP event path names with backing filesAdrian Hunter1-1/+1
Some file systems like FUSE-based ones or overlayfs may record the backing file in struct vm_area_struct vm_file, instead of the user file that the user mmapped. Since commit def3ae83da02f ("fs: store real path instead of fake path in backing file f_path"), file_path() no longer returns the user file path when applied to a backing file. There is an existing helper file_user_path() for that situation. Use file_user_path() instead of file_path() to get the path for MMAP and MMAP2 events. Example: Setup: # cd /root # mkdir test ; cd test ; mkdir lower upper work merged # cp `which cat` lower # mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged # perf record -e intel_pt//u -- /root/test/merged/cat /proc/self/maps ... 55b0ba399000-55b0ba434000 r-xp 00018000 00:1a 3419 /root/test/merged/cat ... [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.060 MB perf.data ] # Before: File name is wrong (/cat), so decoding fails: # perf script --no-itrace --show-mmap-events cat 367 [016] 100.491492: PERF_RECORD_MMAP2 367/367: [0x55b0ba399000(0x9b000) @ 0x18000 00:02 3419 489959280]: r-xp /cat ... # perf script --itrace=e | wc -l Warning: 19 instruction trace errors 19 # After: File name is correct (/root/test/merged/cat), so decoding is ok: # perf script --no-itrace --show-mmap-events cat 364 [016] 72.153006: PERF_RECORD_MMAP2 364/364: [0x55ce4003d000(0x9b000) @ 0x18000 00:02 3419 3132534314]: r-xp /root/test/merged/cat # perf script --itrace=e # perf script --itrace=e | wc -l 0 # Fixes: def3ae83da02f ("fs: store real path instead of fake path in backing file f_path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@vger.kernel.org
2025-10-14perf/core: Fix address filter match with backing filesAdrian Hunter1-1/+1
It was reported that Intel PT address filters do not work in Docker containers. That relates to the use of overlayfs. overlayfs records the backing file in struct vm_area_struct vm_file, instead of the user file that the user mmapped. In order for an address filter to match, it must compare to the user file inode. There is an existing helper file_user_inode() for that situation. Use file_user_inode() instead of file_inode() to get the inode for address filter matching. Example: Setup: # cd /root # mkdir test ; cd test ; mkdir lower upper work merged # cp `which cat` lower # mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged # perf record --buildid-mmap -e intel_pt//u --filter 'filter * @ /root/test/merged/cat' -- /root/test/merged/cat /proc/self/maps ... 55d61d246000-55d61d2e1000 r-xp 00018000 00:1a 3418 /root/test/merged/cat ... [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.015 MB perf.data ] # perf buildid-cache --add /root/test/merged/cat Before: Address filter does not match so there are no control flow packets # perf script --itrace=e # perf script --itrace=b | wc -l 0 # perf script -D | grep 'TIP.PGE' | wc -l 0 # After: Address filter does match so there are control flow packets # perf script --itrace=e # perf script --itrace=b | wc -l 235 # perf script -D | grep 'TIP.PGE' | wc -l 57 # With respect to stable kernels, overlayfs mmap function ovl_mmap() was added in v4.19 but file_user_inode() was not added until v6.8 and never back-ported to stable kernels. FMODE_BACKING that it depends on was added in v6.5. This issue has gone largely unnoticed, so back-porting before v6.8 is probably not worth it, so put 6.8 as the stable kernel prerequisite version, although in practice the next long term kernel is 6.12. Closes: https://lore.kernel.org/linux-perf-users/aBCwoq7w8ohBRQCh@fremen.lan Reported-by: Edd Barrett <edd@theunixzoo.co.uk> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@vger.kernel.org # 6.8
2025-10-14uprobe: Move arch_uprobe_optimize right after handlers executionJiri Olsa1-3/+3
It's less confusing to optimize uprobe right after handlers execution and before we do the check for changed ip register to avoid situations where changed ip register would skip uprobe optimization. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>
2025-10-13PM: WQ_UNBOUND added to pm_wq workqueueMarco Crivellari1-1/+1
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change add the WQ_UNBOUND flag to pm_wq, to make explicit this workqueue can be unbound and that it does not benefit from per-cpu work. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-13sched_ext: Make scx_bpf_dsq_insert*() return boolTejun Heo1-11/+34
In preparation for hierarchical schedulers, change scx_bpf_dsq_insert() and scx_bpf_dsq_insert_vtime() to return bool instead of void. With sub-schedulers, there will be no reliable way to guarantee a task is still owned by the sub-scheduler at insertion time (e.g., the task may have been migrated to another scheduler). The bool return value will enable sub-schedulers to detect and gracefully handle insertion failures. For the root scheduler, insertion failures will continue to trigger scheduler abort via scx_error(), so existing code doesn't need to check the return value. Backward compatibility is maintained through compat wrappers. Also update scx_bpf_dsq_move() documentation to clarify that it can return false for sub-schedulers when @dsq_id points to a disallowed local DSQ. Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Wrap kfunc args in struct to prepare for aux__progTejun Heo2-27/+98
scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and() currently have 5 parameters. An upcoming change will add aux__prog parameter which will exceed BPF's 5 argument limit. Prepare by adding new kfuncs __scx_bpf_dsq_insert_vtime() and __scx_bpf_select_cpu_and() that take args structs. The existing kfuncs are kept as compatibility wrappers. BPF programs use inline wrappers that detect kernel API version via bpf_core_type_exists() and use the new struct-based kfuncs when available, falling back to compat kfuncs otherwise. This allows BPF programs to work with both old and new kernels. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime()Tejun Heo1-0/+30
With the planned hierarchical scheduler support, sub-schedulers will need to be verified for authority before being allowed to modify task->scx.slice and task->scx.dsq_vtime. Add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() which will perform the necessary permission checks. Root schedulers can still directly write to these fields, so this doesn't affect existing schedulers. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Exit early on hotplug events during attachAndrea Righi1-2/+9
There is no need to complete the entire scx initialization if a scheduler is failing to be attached due to a hotplug event. Exit early to avoid unnecessary work and simplify the attach flow. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()Tejun Heo1-10/+79
On systems with >4096 CPUs, scx_kick_cpus_pnt_seqs allocation fails during boot because it exceeds the 32,768 byte percpu allocator limit. Restructure to use DEFINE_PER_CPU() for the per-CPU pointers, with each CPU pointing to its own kvzalloc'd array. Move allocation from boot time to scx_enable() and free in scx_disable(), so the O(nr_cpu_ids^2) memory is only consumed when sched_ext is active. Use RCU to guard against racing with free. Arrays are freed via call_rcu() and kick_cpus_irq_workfn() uses rcu_dereference_bh() with a NULL check. While at it, rename to scx_kick_pseqs for brevity and update comments to clarify these are pick_task sequence numbers. v2: RCU protect scx_kick_seqs to manage kick_cpus_irq_workfn() racing against disable as per Andrea. v3: Fix bugs notcied by Andrea. Reported-by: Phil Auld <pauld@redhat.com> Link: http://lkml.kernel.org/r/20251007133523.GA93086@pauld.westford.csb Cc: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: defer queue_balance_callback() until after ops.dispatchEmil Tsalapatis2-2/+28
The sched_ext code calls queue_balance_callback() during enqueue_task() to defer operations that drop multiple locks until we can unpin them. The call assumes that the rq lock is held until the callbacks are invoked, and the pending callbacks will not be visible to any other threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock(). However, balance_one() may actually drop the lock during a BPF dispatch call. Another thread may win the race to get the rq lock and see the pending callback. To avoid this, sched_ext must only queue the callback after the dispatch calls have completed. CPU 0 CPU 1 CPU 2 scx_balance() rq_unpin_lock() scx_balance_one() |= IN_BALANCE scx_enqueue() ops.dispatch() rq_unlock() rq_lock() queue_balance_callback() rq_unlock() [WARN] rq_pin_lock() rq_lock() &= ~IN_BALANCE rq_repin_lock() Changelog v2-> v1 (https://lore.kernel.org/sched-ext/aOgOxtHCeyRT_7jn@gpd4) - Fixed explanation in patch description (Andrea) - Fixed scx_rq mask state updates (Andrea) - Added Reviewed-by tag from Andrea Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Sync error_irq_work before freeing scx_schedTejun Heo1-0/+2
By the time scx_sched_free_rcu_work() runs, the scx_sched is no longer reachable. However, a previously queued error_irq_work may still be pending or running. Ensure it completes before proceeding with teardown. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCUTejun Heo1-4/+4
scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() take a DSQ iterator argument which has to be valid. Mark them with KF_RCU. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf before 6.18-rc1Alexei Starovoitov40-251/+1117
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-11Merge tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-traceLinus Torvalds1-4/+8
Pull tracing fixes from Steven Rostedt: "The previous fix to trace_marker required updating trace_marker_raw as well. The difference between trace_marker_raw from trace_marker is that the raw version is for applications to write binary structures directly into the ring buffer instead of writing ASCII strings. This is for applications that will read the raw data from the ring buffer and get the data structures directly. It's a bit quicker than using the ASCII version. Unfortunately, it appears that our test suite has several tests that test writes to the trace_marker file, but lacks any tests to the trace_marker_raw file (this needs to be remedied). Two issues came about the update to the trace_marker_raw file that syzbot found: - Fix tracing_mark_raw_write() to use per CPU buffer The fix to use the per CPU buffer to copy from user space was needed for both the trace_maker and trace_maker_raw file. The fix for reading from user space into per CPU buffers properly fixed the trace_marker write function, but the trace_marker_raw file wasn't fixed properly. The user space data was correctly written into the per CPU buffer, but the code that wrote into the ring buffer still used the user space pointer and not the per CPU buffer that had the user space data already written. - Stop the fortify string warning from writing into trace_marker_raw After converting the copy_from_user_nofault() into a memcpy(), another issue appeared. As writes to the trace_marker_raw expects binary data, the first entry is a 4 byte identifier. The entry structure is defined as: struct { struct trace_entry ent; int id; char buf[]; }; The size of this structure is reserved on the ring buffer with: size = sizeof(*entry) + cnt; Then it is copied from the buffer into the ring buffer with: memcpy(&entry->id, buf, cnt); This use to be a copy_from_user_nofault(), but now converting it to a memcpy() triggers the fortify-string code, and causes a warning. The allocated space is actually more than what is copied, as the cnt used also includes the entry->id portion. Allocating sizeof(*entry) plus cnt is actually allocating 4 bytes more than what is needed. Change the size function to: size = struct_size(entry, buf, cnt - sizeof(entry->id)); And update the memcpy() to unsafe_memcpy()" * tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Stop fortify-string from warning in tracing_mark_raw_write() tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
2025-10-11Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds1-2/+2
Pull bpf fixes from Alexei Starovoitov: - Finish constification of 1st parameter of bpf_d_path() (Rong Tao) - Harden userspace-supplied xdp_desc validation (Alexander Lobakin) - Fix metadata_dst leak in __bpf_redirect_neigh_v{4,6}() (Daniel Borkmann) - Fix undefined behavior in {get,put}_unaligned_be32() (Eric Biggers) - Use correct context to unpin bpf hash map with special types (KaFai Wan) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add test for unpinning htab with internal timer struct bpf: Avoid RCU context warning when unpinning htab with internal structs xsk: Harden userspace-supplied xdp_desc validation bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6} libbpf: Fix undefined behavior in {get,put}_unaligned_be32() bpf: Finish constification of 1st parameter of bpf_d_path()
2025-10-11Merge tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds1-43/+318
Pull more updates from Andrew Morton: "Just one series here - Mike Rappoport has taught KEXEC handover to preserve vmalloc allocations across handover" * tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt kho: add support for preserving vmalloc allocations kho: replace kho_preserve_phys() with kho_preserve_pages() kho: check if kho is finalized in __kho_preserve_order() MAINTAINERS, .mailmap: update Umang's email address
2025-10-11tracing: Stop fortify-string from warning in tracing_mark_raw_write()Steven Rostedt1-2/+6
The way tracing_mark_raw_write() records its data is that it has the following structure: struct { struct trace_entry; int id; char buf[]; }; But memcpy(&entry->id, buf, size) triggers the following warning when the size is greater than the id: ------------[ cut here ]------------ memcpy: detected field-spanning write (size 6) of single field "&entry->id" at kernel/trace/trace.c:7458 (size 4) WARNING: CPU: 7 PID: 995 at kernel/trace/trace.c:7458 write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0 Modules linked in: CPU: 7 UID: 0 PID: 995 Comm: bash Not tainted 6.17.0-test-00007-g60b82183e78a-dirty #211 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0 Code: 04 00 75 a7 b9 04 00 00 00 48 89 de 48 89 04 24 48 c7 c2 e0 b1 d1 b2 48 c7 c7 40 b2 d1 b2 c6 05 2d 88 6a 04 01 e8 f7 e8 bd ff <0f> 0b 48 8b 04 24 e9 76 ff ff ff 49 8d 7c 24 04 49 8d 5c 24 08 48 RSP: 0018:ffff888104c3fc78 EFLAGS: 00010292 RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 1ffffffff6b363b4 RDI: 0000000000000001 RBP: ffff888100058a00 R08: ffffffffb041d459 R09: ffffed1020987f40 R10: 0000000000000007 R11: 0000000000000001 R12: ffff888100bb9010 R13: 0000000000000000 R14: 00000000000003e3 R15: ffff888134800000 FS: 00007fa61d286740(0000) GS:ffff888286cad000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000560d28d509f1 CR3: 00000001047a4006 CR4: 0000000000172ef0 Call Trace: <TASK> tracing_mark_raw_write+0x1fe/0x290 ? __pfx_tracing_mark_raw_write+0x10/0x10 ? security_file_permission+0x50/0xf0 ? rw_verify_area+0x6f/0x4b0 vfs_write+0x1d8/0xdd0 ? __pfx_vfs_write+0x10/0x10 ? __pfx_css_rstat_updated+0x10/0x10 ? count_memcg_events+0xd9/0x410 ? fdget_pos+0x53/0x5e0 ksys_write+0x182/0x200 ? __pfx_ksys_write+0x10/0x10 ? do_user_addr_fault+0x4af/0xa30 do_syscall_64+0x63/0x350 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa61d318687 Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff RSP: 002b:00007ffd87fe0120 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007fa61d286740 RCX: 00007fa61d318687 RDX: 0000000000000006 RSI: 0000560d28d509f0 RDI: 0000000000000001 RBP: 0000560d28d509f0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000006 R13: 00007fa61d4715c0 R14: 00007fa61d46ee80 R15: 0000000000000000 </TASK> ---[ end trace 0000000000000000 ]--- This is because fortify string sees that the size of entry->id is only 4 bytes, but it is writing more than that. But this is OK as the dynamic_array is allocated to handle that copy. The size allocated on the ring buffer was actually a bit too big: size = sizeof(*entry) + cnt; But cnt includes the 'id' and the buffer data, so adding cnt to the size of *entry actually allocates too much on the ring buffer. Change the allocation to: size = struct_size(entry, buf, cnt - sizeof(entry->id)); and the memcpy() to unsafe_memcpy() with an added justification. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20251011112032.77be18e4@gandalf.local.home Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space") Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-10tracing: Fix tracing_mark_raw_write() to use buf and not ubufSteven Rostedt1-2/+2
The fix to use a per CPU buffer to read user space tested only the writes to trace_marker. But it appears that the selftests are missing tests to the trace_maker_raw file. The trace_maker_raw file is used by applications that writes data structures and not strings into the file, and the tools read the raw ring buffer to process the structures it writes. The fix that reads the per CPU buffers passes the new per CPU buffer to the trace_marker file writes, but the update to the trace_marker_raw write read the data from user space into the per CPU buffer, but then still used then passed the user space address to the function that records the data. Pass in the per CPU buffer and not the user space address. TODO: Add a test to better test trace_marker_raw. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20251011035243.386098147@kernel.org Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space") Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-10bpf: Extract internal structs validation logic into helpersMykyta Yatsenko3-36/+29
The arraymap and hashtab duplicate the logic that checks for and frees internal structs (timer, workqueue, task_work) based on BTF record flags. Centralize this by introducing two helpers: * bpf_map_has_internal_structs(map) Returns true if the map value contains any of internal structs: BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK. * bpf_map_free_internal_structs(map, obj) Frees the internal structs for a single value object. Convert arraymap and both the prealloc/malloc hashtab paths to use the new generic functions. This keeps the functionality for when/how to free these special fields in one place and makes it easier to add support for new internal structs in the future without touching every map implementation. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251010164606.147298-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-10bpf: Fix handling maps with no BTF and non-constant offsets for the bpf_wqMykyta Yatsenko1-5/+12
Fix handling maps with no BTF and non-constant offsets for the bpf_wq. This de-duplicates logic with other internal structs (task_work, timer), keeps error reporting consistent, and makes future changes to the layout handling centralized. Fixes: d940c9b94d7e ("bpf: add support for KF_ARG_PTR_TO_WORKQUEUE") Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251010164606.147298-1-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-10bpf: Avoid RCU context warning when unpinning htab with internal structsKaFai Wan1-2/+2
When unpinning a BPF hash table (htab or htab_lru) that contains internal structures (timer, workqueue, or task_work) in its values, a BUG warning is triggered: BUG: sleeping function called from invalid context at kernel/bpf/hashtab.c:244 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 14, name: ksoftirqd/0 ... The issue arises from the interaction between BPF object unpinning and RCU callback mechanisms: 1. BPF object unpinning uses ->free_inode() which schedules cleanup via call_rcu(), deferring the actual freeing to an RCU callback that executes within the RCU_SOFTIRQ context. 2. During cleanup of hash tables containing internal structures, htab_map_free_internal_structs() is invoked, which includes cond_resched() or cond_resched_rcu() calls to yield the CPU during potentially long operations. However, cond_resched() or cond_resched_rcu() cannot be safely called from atomic RCU softirq context, leading to the BUG warning when attempting to reschedule. Fix this by changing from ->free_inode() to ->destroy_inode() and rename bpf_free_inode() to bpf_destroy_inode() for BPF objects (prog, map, link). This allows direct inode freeing without RCU callback scheduling, avoiding the invalid context warning. Reported-by: Le Chen <tom2cat@sjtu.edu.cn> Closes: https://lore.kernel.org/all/1444123482.1827743.1750996347470.JavaMail.zimbra@sjtu.edu.cn/ Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.") Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: KaFai Wan <kafai.wan@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251008102628.808045-2-kafai.wan@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-10bpf: add bpf_strcasestr,bpf_strncasestr kfuncsRong Tao1-21/+77
bpf_strcasestr() and bpf_strncasestr() functions perform same like bpf_strstr() and bpf_strnstr() except ignoring the case of the characters. Signed-off-by: Rong Tao <rongtao@cestc.cn> Link: https://lore.kernel.org/r/tencent_B01165355D42A8B8BF5E8D0A21EE1A88090A@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-10bpf: Refactor storage_get_func_atomic to generic non_sleepable flagKumar Kartikeya Dwivedi1-16/+17
Rename the storage_get_func_atomic flag to a more generic non_sleepable flag that tracks whether a helper or kfunc may be called from a non-sleepable context. This makes the flag more broadly applicable beyond just storage_get helpers. See [0] for more context. The flag is now set unconditionally for all helpers and kfuncs when: - RCU critical section is active. - Preemption is disabled. - IRQs are disabled. - In a non-sleepable context within a sleepable program (e.g., timer callbacks), which is indicated by !in_sleepable(). Previously, the flag was only set for storage_get helpers in these contexts. With this change, it can be used by any code that needs to differentiate between sleepable and non-sleepable contexts at the per-instruction level. The existing usage in do_misc_fixups() for storage_get helpers is preserved by checking is_storage_get_function() before using the flag. [0]: https://lore.kernel.org/bpf/CAP01T76cbaNi4p-y8E0sjE2NXSra2S=Uja8G4hSQDu_SbXxREQ@mail.gmail.com Cc: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> Link: https://lore.kernel.org/r/20251007220349.3852807-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-10bpf: Fix sleepable context for async callbacksKumar Kartikeya Dwivedi1-11/+30
Fix the BPF verifier to correctly determine the sleepable context of async callbacks based on the async primitive type rather than the arming program's context. The bug is in in_sleepable() which uses OR logic to check if the current execution context is sleepable. When a sleepable program arms a timer callback, the callback's state correctly has in_sleepable=false, but in_sleepable() would still return true due to env->prog->sleepable being true. This incorrectly allows sleepable helpers like bpf_copy_from_user() inside timer callbacks when armed from sleepable programs, even though timer callbacks always execute in non-sleepable context. Fix in_sleepable() to rely solely on env->cur_state->in_sleepable, and initialize state->in_sleepable to env->prog->sleepable in do_check_common() for the main program entry. This ensures the sleepable context is properly tracked per verification state rather than being overridden by the program's sleepability. The env->cur_state NULL check in in_sleepable() was only needed for do_misc_fixups() which runs after verification when env->cur_state is set to NULL. Update do_misc_fixups() to use env->prog->sleepable directly for the storage_get_function check, and remove the redundant NULL check from in_sleepable(). Introduce is_async_cb_sleepable() helper to explicitly determine async callback sleepability based on the primitive type: - bpf_timer callbacks are never sleepable - bpf_wq and bpf_task_work callbacks are always sleepable Add verifier_bug() check to catch unhandled async callback types, ensuring future additions cannot be silently mishandled. Move the is_task_work_add_kfunc() forward declaration to the top alongside other callback-related helpers. We update push_async_cb() to adjust to the new changes. At the same time, while simplifying in_sleepable(), we notice a problem in do_misc_fixups. Fix storage_get helpers to use GFP_ATOMIC when called from non-sleepable contexts within sleepable programs, such as bpf_timer callbacks. Currently, the check in do_misc_fixups assumes that env->prog->sleepable, previously in_sleepable(env) which only resolved to this check before last commit, holds across the program's execution, but that is not true. Instead, the func_atomic bit must be set whenever we see the function being called in an atomic context. Previously, this is being done when the helper is invoked in atomic contexts in sleepable programs, we can simply just set the value to true without doing an in_sleepable() check. We must also do a standalone in_sleepable() check to handle cases where the async callback itself is armed from a sleepable program, but is itself non-sleepable (e.g., timer callback) and invokes such a helper, thus needing the func_atomic bit to be true for the said call. Adjust do_misc_fixups() to drop any checks regarding sleepable nature of the program, and just depend on the func_atomic bit to decide which GFP flag to pass. Fixes: 81f1d7a583fa ("bpf: wq: add bpf_wq_set_callback_impl") Fixes: b00fa38a9c1c ("bpf: Enable non-atomic allocations in local storage") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251007220349.3852807-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-09Merge tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-traceLinus Torvalds5-79/+241
Pull tracing clean up and fixes from Steven Rostedt: - Have osnoise tracer use memdup_user_nul() The function osnoise_cpus_write() open codes a kmalloc() and then a copy_from_user() and then adds a nul byte at the end which is the same as simply using memdup_user_nul(). - Fix wakeup and irq tracers when failing to acquire calltime When the wakeup and irq tracers use the function graph tracer for tracing function times, it saves a timestamp into the fgraph shadow stack. It is possible that this could fail to be stored. If that happens, it exits the routine early. These functions also disable nesting of the operations by incremeting the data "disable" counter. But if the calltime exits out early, it never increments the counter back to what it needs to be. Since there's only a couple of lines of code that does work after acquiring the calltime, instead of exiting out early, reverse the if statement to be true if calltime is acquired, and place the code that is to be done within that if block. The clean up will always be done after that. - Fix ring_buffer_map() return value on failure of __rb_map_vma() If __rb_map_vma() fails in ring_buffer_map(), it does not return an error. This means the caller will be working against a bad vma mapping. Have ring_buffer_map() return an error when __rb_map_vma() fails. - Fix regression of writing to the trace_marker file A bug fix was made to change __copy_from_user_inatomic() to copy_from_user_nofault() in the trace_marker write function. The trace_marker file is used by applications to write into it (usually with a file descriptor opened at the start of the program) to record into the tracing system. It's usually used in critical sections so the write to trace_marker is highly optimized. The reason for copying in an atomic section is that the write reserves space on the ring buffer and then writes directly into it. After it writes, it commits the event. The time between reserve and commit must have preemption disabled. The trace marker write does not have any locking nor can it allocate due to the nature of it being a critical path. Unfortunately, converting __copy_from_user_inatomic() to copy_from_user_nofault() caused a regression in Android. Now all the writes from its applications trigger the fault that is rejected by the _nofault() version that wasn't rejected by the _inatomic() version. Instead of getting data, it now just gets a trace buffer filled with: tracing_mark_write: <faulted> To fix this, on opening of the trace_marker file, allocate per CPU buffers that can be used by the write call. Then when entering the write call, do the following: preempt_disable(); cpu = smp_processor_id(); buffer = per_cpu_ptr(cpu_buffers, cpu); do { cnt = nr_context_switches_cpu(cpu); migrate_disable(); preempt_enable(); ret = copy_from_user(buffer, ptr, size); preempt_disable(); migrate_enable(); } while (!ret && cnt != nr_context_switches_cpu(cpu)); if (!ret) ring_buffer_write(buffer); preempt_enable(); This works similarly to seqcount. As it must enabled preemption to do a copy_from_user() into a per CPU buffer, if it gets preempted, the buffer could be corrupted by another task. To handle this, read the number of context switches of the current CPU, disable migration, enable preemption, copy the data from user space, then immediately disable preemption again. If the number of context switches is the same, the buffer is still valid. Otherwise it must be assumed that the buffer may have been corrupted and it needs to try again. Now the trace_marker write can get the user data even if it has to fault it in, and still not grab any locks of its own. * tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Have trace_marker use per-cpu data to read user space ring buffer: Propagate __rb_map_vma return value to caller tracing: Fix irqoff tracers on failure of acquiring calltime tracing: Fix wakeup tracers on failure of acquiring calltime tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nul
2025-10-08tracing: Have trace_marker use per-cpu data to read user spaceSteven Rostedt1-48/+220
It was reported that using __copy_from_user_inatomic() can actually schedule. Which is bad when preemption is disabled. Even though there's logic to check in_atomic() is set, but this is a nop when the kernel is configured with PREEMPT_NONE. This is due to page faulting and the code could schedule with preemption disabled. Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/ The solution was to change the __copy_from_user_inatomic() to copy_from_user_nofault(). But then it was reported that this caused a regression in Android. There's several applications writing into trace_marker() in Android, but now instead of showing the expected data, it is showing: tracing_mark_write: <faulted> After reverting the conversion to copy_from_user_nofault(), Android was able to get the data again. Writes to the trace_marker is a way to efficiently and quickly enter data into the Linux tracing buffer. It takes no locks and was designed to be as non-intrusive as possible. This means it cannot allocate memory, and must use pre-allocated data. A method that is actively being worked on to have faultable system call tracepoints read user space data is to allocate per CPU buffers, and use them in the callback. The method uses a technique similar to seqcount. That is something like this: preempt_disable(); cpu = smp_processor_id(); buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu); do { cnt = nr_context_switches_cpu(cpu); migrate_disable(); preempt_enable(); ret = copy_from_user(buffer, ptr, size); preempt_disable(); migrate_enable(); } while (!ret && cnt != nr_context_switches_cpu(cpu)); if (!ret) ring_buffer_write(buffer); preempt_enable(); It's a little more involved than that, but the above is the basic logic. The idea is to acquire the current CPU buffer, disable migration, and then enable preemption. At this moment, it can safely use copy_from_user(). After reading the data from user space, it disables preemption again. It then checks to see if there was any new scheduling on this CPU. If there was, it must assume that the buffer was corrupted by another task. If there wasn't, then the buffer is still valid as only tasks in preemptable context can write to this buffer and only those that are running on the CPU. By using this method, where trace_marker open allocates the per CPU buffers, trace_marker writes can access user space and even fault it in, without having to allocate or take any locks of its own. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Wattson CI <wattson-external@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home Fixes: 3d62ab32df065 ("tracing: Fix tracing_marker may trigger page fault during preempt_disable") Reported-by: Runping Lai <runpinglai@google.com> Tested-by: Runping Lai <runpinglai@google.com> Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08ring buffer: Propagate __rb_map_vma return value to callerAnkit Khushwaha1-1/+1
The return value from `__rb_map_vma()`, which rejects writable or executable mappings (VM_WRITE, VM_EXEC, or !VM_MAYSHARE), was being ignored. As a result the caller of `__rb_map_vma` always returned 0 even when the mapping had actually failed, allowing it to proceed with an invalid VMA. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251008172516.20697-1-ankitkhushwaha.linux@gmail.com Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Reported-by: syzbot+ddc001b92c083dbf2b97@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?id=194151be8eaebd826005329b2e123aecae714bdb Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08tracing: Fix irqoff tracers on failure of acquiring calltimeSteven Rostedt1-13/+10
The functions irqsoff_graph_entry() and irqsoff_graph_return() both call func_prolog_dec() that will test if the data->disable is already set and if not, increment it and return. If it was set, it returns false and the caller exits. The caller of this function must decrement the disable counter, but misses doing so if the calltime fails to be acquired. Instead of exiting out when calltime is NULL, change the logic to do the work if it is not NULL and still do the clean up at the end of the function if it is NULL. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251008114943.6f60f30f@gandalf.local.home Fixes: a485ea9e3ef3 ("tracing: Fix irqsoff and wakeup latency tracers when using function graph") Reported-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-2-sashal@kernel.org/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08tracing: Fix wakeup tracers on failure of acquiring calltimeSteven Rostedt1-10/+6
The functions wakeup_graph_entry() and wakeup_graph_return() both call func_prolog_preempt_disable() that will test if the data->disable is already set and if not, increment it and disable preemption. If it was set, it returns false and the caller exits. The caller of this function must decrement the disable counter, but misses doing so if the calltime fails to be acquired. Instead of exiting out when calltime is NULL, change the logic to do the work if it is not NULL and still do the clean up at the end of the function if it is NULL. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251008114835.027b878a@gandalf.local.home Fixes: a485ea9e3ef3 ("tracing: Fix irqsoff and wakeup latency tracers when using function graph") Reported-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-1-sashal@kernel.org/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nulThorsten Blum1-7/+4
Replace kmalloc() followed by copy_from_user() with memdup_user_nul() to simplify and improve osnoise_cpus_write(). Remove the manual NUL-termination. No functional changes intended. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251001130907.364673-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-07bpf: Cleanup unused func args in rqspinlock implementationSiddharth Chintamaneni1-11/+8
cleanup unused function args in check_deadlock* functions. Fixes: 31158ad02ddb ("rqspinlock: Add deadlock detection and recovery") Signed-off-by: Siddharth Chintamaneni <sidchintamaneni@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251001172702.122838-1-sidchintamaneni@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-07kho: add support for preserving vmalloc allocationsMike Rapoport (Microsoft)1-0/+281
A vmalloc allocation is preserved using binary structure similar to global KHO memory tracker. It's a linked list of pages where each page is an array of physical address of pages in vmalloc area. kho_preserve_vmalloc() hands out the physical address of the head page to the caller. This address is used as the argument to kho_vmalloc_restore() to restore the mapping in the vmalloc address space and populate it with the preserved pages. [pasha.tatashin@soleen.com: free chunks using free_page() not kfree()] Link: https://lkml.kernel.org/r/mafs0a52idbeg.fsf@kernel.org [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20250921054458.4043761-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07kho: replace kho_preserve_phys() with kho_preserve_pages()Mike Rapoport (Microsoft)1-14/+11
to make it clear that KHO operates on pages rather than on a random physical address. The kho_preserve_pages() will be also used in upcoming support for vmalloc preservation. Link: https://lkml.kernel.org/r/20250921054458.4043761-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07kho: check if kho is finalized in __kho_preserve_order()Mike Rapoport (Microsoft)1-29/+26
Patch series "kho: add support for preserving vmalloc allocations", v5. Following the discussion about preservation of memfd with LUO [1] these patches add support for preserving vmalloc allocations. Any KHO uses case presumes that there's a data structure that lists physical addresses of preserved folios (and potentially some additional metadata). Allowing vmalloc preservations with KHO allows scalable preservation of such data structures. For instance, instead of allocating array describing preserved folios in the fdt, memfd preservation can use vmalloc: preserved_folios = vmalloc_array(nr_folios, sizeof(*preserved_folios)); memfd_luo_preserve_folios(preserved_folios, folios, nr_folios); kho_preserve_vmalloc(preserved_folios, &folios_info); This patch (of 4): Instead of checking if kho is finalized in each caller of __kho_preserve_order(), do it in the core function itself. Link: https://lkml.kernel.org/r/20250921054458.4043761-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250921054458.4043761-2-rppt@kernel.org Link: https://lore.kernel.org/all/20250807014442.3829950-30-pasha.tatashin@soleen.com [1] Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07Merge tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linuxLinus Torvalds3-13/+10
Pull hyperv updates from Wei Liu: - Unify guest entry code for KVM and MSHV (Sean Christopherson) - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain() (Nam Cao) - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV (Mukesh Rathor) - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov) - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna Kumar T S M) - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari, Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley) * tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hyperv: Remove the spurious null directive line MAINTAINERS: Mark hyperv_fb driver Obsolete fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver Drivers: hv: Make CONFIG_HYPERV bool Drivers: hv: Add CONFIG_HYPERV_VMBUS option Drivers: hv: vmbus: Fix typos in vmbus_drv.c Drivers: hv: vmbus: Fix sysfs output format for ring buffer index Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store() x86/hyperv: Switch to msi_create_parent_irq_domain() mshv: Use common "entry virt" APIs to do work in root before running guest entry: Rename "kvm" entry code assets to "virt" to genericize APIs entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper mshv: Handle NEED_RESCHED_LAZY before transferring to guest x86/hyperv: Add kexec/kdump support on Azure CVMs Drivers: hv: Simplify data structures for VMBus channel close message Drivers: hv: util: Cosmetic changes for hv_utils_transport.c mshv: Add support for a new parent partition configuration clocksource: hyper-v: Skip unnecessary checks for the root partition hyperv: Add missing field to hv_output_map_device_interrupt
2025-10-05Merge tag 'trace-v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-traceLinus Torvalds8-20/+26
Pull tracing updates from Steven Rostedt: - Use READ_ONCE() and WRITE_ONCE() instead of RCU for syscall tracepoints Individual system call trace events are pseudo events attached to the raw_syscall trace events that just trace the entry and exit of all system calls. When any of these individual system call trace events get enabled, an element in an array indexed by the system call number is assigned to the trace file that defines how to trace it. When the trace event triggers, it reads this array and if the array has an element, it uses that trace file to know what to write it (the trace file defines the output format of the corresponding system call). The issue is that it uses rcu_dereference_ptr() and marks the elements of the array as using RCU. This is incorrect. There is no RCU synchronization here. The event file that is pointed to has a completely different way to make sure its freed properly. The reading of the array during the system call trace event is only to know if there is a value or not. If not, it does nothing (it means this system call isn't being traced). If it does, it uses the information to store the system call data. The RCU usage here can simply be replaced by READ_ONCE() and WRITE_ONCE() macros. - Have the system call trace events use "0x" for hex values Some system call trace events display hex values but do not have "0x" in front of it. Seeing "count: 44" can be assumed that it is 44 decimal when in actuality it is 44 hex (68 decimal). Display "0x44" instead. - Use vmalloc_array() in tracing_map_sort_entries() The function tracing_map_sort_entries() used array_size() and vmalloc() when it could have simply used vmalloc_array(). - Use for_each_online_cpu() in trace_osnoise.c() Instead of open coding for_each_cpu(cpu, cpu_online_mask), use for_each_online_cpu(). - Move the buffer field in struct trace_seq to the end The buffer field in struct trace_seq is architecture dependent in size, and caused padding for the fields after it. By moving the buffer to the end of the structure, it compacts the trace_seq structure better. - Remove redundant zeroing of cmdline_idx field in saved_cmdlines_buffer() The structure that contains cmdline_idx is zeroed by memset(), no need to explicitly zero any of its fields after that. - Use system_percpu_wq instead of system_wq in user_event_mm_remove() As system_wq is being deprecated, use the new wq. - Add cond_resched() is ftrace_module_enable() Some modules have a lot of functions (thousands of them), and the enabling of those functions can take some time. On non preemtable kernels, it was triggering a watchdog timeout. Add a cond_resched() to prevent that. - Add a BUILD_BUG_ON() to make sure PID_MAX_DEFAULT is always a power of 2 There's code that depends on PID_MAX_DEFAULT being a power of 2 or it will break. If in the future that changes, make sure the build fails to ensure that the code is fixed that depends on this. - Grab mutex_lock() before ever exiting s_start() The s_start() function is a seq_file start routine. As s_stop() is always called even if s_start() fails, and s_stop() expects the event_mutex to be held as it will always release it. That mutex must always be taken in s_start() even if that function fails. * tag 'trace-v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix lock imbalance in s_start() memory allocation failure path tracing: Ensure optimized hashing works ftrace: Fix softlockup in ftrace_module_enable tracing: replace use of system_wq with system_percpu_wq tracing: Remove redundant 0 value initialization tracing: Move buffer in trace_seq to end of struct tracing/osnoise: Use for_each_online_cpu() instead of for_each_cpu() tracing: Use vmalloc_array() to improve code tracing: Have syscall trace events show "0x" for values greater than 10 tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
2025-10-05Merge tag 'probes-fixes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-traceLinus Torvalds4-14/+28
Pull probe fix from Masami Hiramatsu: - Fix race condition in kprobe initialization causing NULL pointer dereference. This happens on weak memory model, which does not correctly manage the flags access with appropriate memory barriers. Use RELEASE-ACQUIRE to fix it. * tag 'probes-fixes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix race condition in kprobe initialization causing NULL pointer dereference
2025-10-04Merge tag 'v6.18-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds1-5/+10
Pull crypto updates from Herbert Xu: "Drivers: - Add ciphertext hiding support to ccp - Add hashjoin, gather and UDMA data move features to hisilicon - Add lz4 and lz77_only to hisilicon - Add xilinx hwrng driver - Add ti driver with ecb/cbc aes support - Add ring buffer idle and command queue telemetry for GEN6 in qat Others: - Use rcu_dereference_all to stop false alarms in rhashtable - Fix CPU number wraparound in padata" * tag 'v6.18-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (78 commits) dt-bindings: rng: hisi-rng: convert to DT schema crypto: doc - Add explicit title heading to API docs hwrng: ks-sa - fix division by zero in ks_sa_rng_init KEYS: X.509: Fix Basic Constraints CA flag parsing crypto: anubis - simplify return statement in anubis_mod_init crypto: hisilicon/qm - set NULL to qm->debug.qm_diff_regs crypto: hisilicon/qm - clear all VF configurations in the hardware crypto: hisilicon - enable error reporting again crypto: hisilicon/qm - mask axi error before memory init crypto: hisilicon/qm - invalidate queues in use crypto: qat - Return pointer directly in adf_ctl_alloc_resources crypto: aspeed - Fix dma_unmap_sg() direction rhashtable: Use rcu_dereference_all and rcu_dereference_all_check crypto: comp - Use same definition of context alloc and free ops crypto: omap - convert from tasklet to BH workqueue crypto: qat - Replace kzalloc() + copy_from_user() with memdup_user() crypto: caam - double the entropy delay interval for retry padata: WQ_PERCPU added to alloc_workqueue users padata: replace use of system_unbound_wq with system_dfl_wq crypto: cryptd - WQ_PERCPU added to alloc_workqueue users ...
2025-10-04Merge tag 'rcu.2025.09.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linuxLinus Torvalds9-22/+49
Pull RCU updates from Paul McKenney: "Documentation updates: - Update whatisRCU.rst and checklist.rst for recent RCU API additions - Fix RCU documentation formatting and typos - Replace dead Ottawa Linux Symposium links in RTFP.txt Miscellaneous RCU updates: - Document that rcu_barrier() hurries RCU_LAZY callbacks - Remove redundant interrupt disabling from rcu_preempt_deferred_qs_handler() - Move list_for_each_rcu from list.h to rculist.h, and adjust the include directive in kernel/cgroup/dmem.c accordingly - Make initial set of changes to accommodate upcoming system_percpu_wq changes SRCU updates: - Create an srcu_read_lock_fast_notrace() for eventual use in tracing, including adding guards - Document the reliance on per-CPU operations as implicit RCU readers in __srcu_read_{,un}lock_fast() - Document the srcu_flip() function's memory-barrier D's relationship to SRCU-fast readers - Remove a redundant preempt_disable() and preempt_enable() pair from srcu_gp_start_if_needed() Torture-test updates: - Fix jitter.sh spin time so that it actually varies as advertised. It is still quite coarse-grained, but at least it does now vary - Update torture.sh help text to include the not-so-new --do-normal parameter, which permits (for example) testing KCSAN kernels without doing non-debug kernels - Fix a number of false-positive diagnostics that were being triggered by rcutorture starting before boot completed. Running multiple near-CPU-bound rcutorture processes when there is only the boot CPU is after all a bit excessive - Substitute kcalloc() for kzalloc() - Remove a redundant kfree() and NULL out kfree()ed objects" * tag 'rcu.2025.09.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (31 commits) rcu: WQ_UNBOUND added to sync_wq workqueue rcu: WQ_PERCPU added to alloc_workqueue users rcu: replace use of system_wq with system_percpu_wq refperf: Set reader_tasks to NULL after kfree() refperf: Remove redundant kfree() after torture_stop_kthread() srcu/tiny: Remove preempt_disable/enable() in srcu_gp_start_if_needed() srcu: Document srcu_flip() memory-barrier D relation to SRCU-fast srcu: Document __srcu_read_{,un}lock_fast() implicit RCU readers rculist: move list_for_each_rcu() to where it belongs refscale: Use kcalloc() instead of kzalloc() rcutorture: Use kcalloc() instead of kzalloc() docs: rcu: Replace multiple dead OLS links in RTFP.txt doc: Fix typo in RCU's torture.rst documentation Documentation: RCU: Retitle toctree index Documentation: RCU: Reduce toctree depth Documentation: RCU: Wrap kvm-remote.sh rerun snippet in literal code block rcu: docs: Requirements.rst: Abide by conventions of kernel documentation doc: Add RCU guards to checklist.rst doc: Update whatisRCU.rst for recent RCU API additions rcutorture: Delay forward-progress testing until boot completes ...
2025-10-04Merge tag 'printk-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linuxLinus Torvalds4-14/+366
Pull printk updates from Petr Mladek: - Add KUnit test for the printk ring buffer - Fix the check of the maximal record size which is allowed to be stored into the printk ring buffer. It prevents corruptions of the ring buffer. Note that printk() is on the safe side. The messages are limited by 1kB buffer and are always small enough for the minimal log buffer size 4kB, see CONFIG_LOG_BUF_SHIFT definition. * tag 'printk-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: ringbuffer: Fix data block max size check printk: kunit: support offstack cpumask printk: kunit: Fix __counted_by() in struct prbtest_rbdata printk: ringbuffer: Explain why the KUnit test ignores failed writes printk: ringbuffer: Add KUnit test
2025-10-04Merge tag 'kgdb-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linuxLinus Torvalds6-36/+60
Pull kgdb updates from Daniel Thompson: "A collection of small cleanups this cycle. Thorsten Blum has replaced a number strcpy() calls with safer alternatives (fixing a pointer aliasing bug in the process). Colin Ian King has simplified things by removing some unreachable code" * tag 'kgdb-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: remove redundant check for scancode 0xe0 kdb: Replace deprecated strcpy() with helper function in kdb_defcmd() kdb: Replace deprecated strcpy() with memcpy() in parse_grep() kdb: Replace deprecated strcpy() with memmove() in vkdb_printf() kdb: Replace deprecated strcpy() with memcpy() in kdb_strdup() kernel: debug: gdbstub: Replace deprecated strcpy() with strscpy()
2025-10-03Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds1-3/+4
Pull bpf fixes from Alexei Starovoitov: - Fix selftests/bpf (typo, conflicts) and unbreak BPF CI (Jiri Olsa) - Remove linux/unaligned.h dependency for libbpf_sha256 (Andrii Nakryiko) and add a test (Eric Biggers) - Reject negative offsets for ALU operations in the verifier (Yazhou Tang) and add a test (Eduard Zingerman) - Skip scalar adjustment for BPF_NEG operation if destination register is a pointer (Brahmajit Das) and add a test (KaFai Wan) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: libbpf: Fix missing #pragma in libbpf_utils.c selftests/bpf: Add tests for rejection of ALU ops with negative offsets selftests/bpf: Add test for libbpf_sha256() bpf: Reject negative offsets for ALU ops libbpf: remove linux/unaligned.h dependency for libbpf_sha256() libbpf: move libbpf_sha256() implementation into libbpf_utils.c libbpf: move libbpf_errstr() into libbpf_utils.c libbpf: remove unused libbpf_strerror_r and STRERR_BUFSIZE libbpf: make libbpf_errno.c into more generic libbpf_utils.c selftests/bpf: Add test for BPF_NEG alu on CONST_PTR_TO_MAP bpf: Skip scalar adjustment for BPF_NEG if dst is a pointer selftests/bpf: Fix realloc size in bpf_get_addrs selftests/bpf: Fix typo in subtest_basic_usdt after merge conflict selftests/bpf: Fix open-coded gettid syscall in uprobe syscall tests
2025-10-03Merge tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linuxLinus Torvalds7-198/+149
Pull dma-mapping updates from Marek Szyprowski: - Refactoring of DMA mapping API to physical addresses as the primary interface instead of page+offset parameters This gets much closer to Matthew Wilcox's long term wish for struct-pageless IO to cacheable DRAM and is supporting memdesc project which seeks to substantially transform how struct page works. An advantage of this approach is the possibility of introducing DMA_ATTR_MMIO, which covers existing 'dma_map_resource' flow in the common paths, what in turn lets to use recently introduced dma_iova_link() API to map PCI P2P MMIO without creating struct page Developped by Leon Romanovsky and Jason Gunthorpe - Minor clean-up by Petr Tesarik and Qianfeng Rong * tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: kmsan: fix missed kmsan_handle_dma() signature conversion mm/hmm: properly take MMIO path mm/hmm: migrate to physical address-based DMA mapping API dma-mapping: export new dma_*map_phys() interface xen: swiotlb: Open code map_resource callback dma-mapping: implement DMA_ATTR_MMIO for dma_(un)map_page_attrs() kmsan: convert kmsan_handle_dma to use physical addresses dma-mapping: convert dma_direct_*map_page to be phys_addr_t based iommu/dma: implement DMA_ATTR_MMIO for iommu_dma_(un)map_phys() iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys dma-debug: refactor to use physical addresses for page mapping iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link(). dma-mapping: introduce new DMA attribute to indicate MMIO memory swiotlb: Remove redundant __GFP_NOWARN dma-direct: clean up the logic in __dma_direct_alloc_pages()
2025-10-03Merge tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2-61/+37
Pull file->f_path constification from Al Viro: "Only one thing was modifying ->f_path of an opened file - acct(2). Massaging that away and constifying a bunch of struct path * arguments in functions that might be given &file->f_path ends up with the situation where we can turn ->f_path into an anon union of const struct path f_path and struct path __f_path, the latter modified only in a few places in fs/{file_table,open,namei}.c, all for struct file instances that are yet to be opened" * tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) Have cc(1) catch attempts to modify ->f_path kernel/acct.c: saner struct file treatment configfs:get_target() - release path as soon as we grab configfs_item reference apparmor/af_unix: constify struct path * arguments ovl_is_real_file: constify realpath argument ovl_sync_file(): constify path argument ovl_lower_dir(): constify path argument ovl_get_verity_digest(): constify path argument ovl_validate_verity(): constify {meta,data}path arguments ovl_ensure_verity_loaded(): constify datapath argument ksmbd_vfs_set_init_posix_acl(): constify path argument ksmbd_vfs_inherit_posix_acl(): constify path argument ksmbd_vfs_kern_path_unlock(): constify path argument ksmbd_vfs_path_lookup_locked(): root_share_path can be const struct path * check_export(): constify path argument export_operations->open(): constify path argument rqst_exp_get_by_name(): constify path argument nfs: constify path argument of __vfs_getattr() bpf...d_path(): constify path argument done_path_create(): constify path argument ...
2025-10-03Merge tag 'pull-fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-2/+1
Pull fs_context updates from Al Viro: "Change vfs_parse_fs_string() calling conventions Get rid of the length argument (almost all callers pass strlen() of the string argument there), add vfs_parse_fs_qstr() for the cases that do want separate length" * tag 'pull-fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_nfs4_mount(): switch to vfs_parse_fs_string() change the calling conventions for vfs_parse_fs_string()
2025-10-03Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-6/+6
Pull vfs mount updates from Al Viro: "Several piles this cycle, this mount-related one being the largest and trickiest: - saner handling of guards in fs/namespace.c, getting rid of needlessly strong locking in some of the users - lock_mount() calling conventions change - have it set the environment for attaching to given location, storing the results in caller-supplied object, without altering the passed struct path. Make unlock_mount() called as __cleanup for those objects. It's not exactly guard(), but similar to it - MNT_WRITE_HOLD done right. mnt_hold_writers() does *not* mess with ->mnt_flags anymore, so insertion of a new mount into ->s_mounts of underlying superblock does not, in itself, expose ->mnt_flags of that mount to concurrent modifications - getting rid of pathological cases when umount() spends quadratic time removing the victims from propagation graph - part of that had been dealt with last cycle, this should finish it - a bunch of stuff constified - assorted cleanups * tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) constify {__,}mnt_is_readonly() WRITE_HOLD machinery: no need for to bump mount_lock seqcount struct mount: relocate MNT_WRITE_HOLD bit preparations to taking MNT_WRITE_HOLD out of ->mnt_flags setup_mnt(): primitive for connecting a mount to filesystem simplify the callers of mnt_unhold_writers() copy_mnt_ns(): use guards copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure open_detached_copy(): separate creation of namespace into helper open_detached_copy(): don't bother with mount_lock_hash() path_has_submounts(): use guard(mount_locked_reader) fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() ecryptfs: get rid of pointless mount references in ecryptfs dentries umount_tree(): take all victims out of propagation graph at once do_mount(): use __free(path_put) do_move_mount_old(): use __free(path_put) constify can_move_mount_beneath() arguments path_umount(): constify struct path argument may_copy_tree(), __do_loopback(): constify struct path argument path_mount(): constify struct path argument ...
2025-10-03tracing: Fix lock imbalance in s_start() memory allocation failure pathSasha Levin1-2/+1
When s_start() fails to allocate memory for set_event_iter, it returns NULL before acquiring event_mutex. However, the corresponding s_stop() function always tries to unlock the mutex, causing a lock imbalance warning: WARNING: bad unlock balance detected! 6.17.0-rc7-00175-g2b2e0c04f78c #7 Not tainted ------------------------------------- syz.0.85611/376514 is trying to release lock (event_mutex) at: [<ffffffff8dafc7a4>] traverse.part.0.constprop.0+0x2c4/0x650 fs/seq_file.c:131 but there are no more locks to release! The issue was introduced by commit b355247df104 ("tracing: Cache ':mod:' events for modules not loaded yet") which added the kzalloc() allocation before the mutex lock, creating a path where s_start() could return without locking the mutex while s_stop() would still try to unlock it. Fix this by unconditionally acquiring the mutex immediately after allocation, regardless of whether the allocation succeeded. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250929113238.3722055-1-sashal@kernel.org Fixes: b355247df104 ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-03cgroup: Fix seqcount lockdep assertion in cgroup freezerNirbhay Sharma1-1/+1
The commit afa3701c0e45 ("cgroup: cgroup.stat.local time accounting") introduced a seqcount to track freeze timing but initialized it as a plain seqcount_t using seqcount_init(). However, the write-side critical section in cgroup_do_freeze() holds the css_set_lock spinlock while calling write_seqcount_begin(). On PREEMPT_RT kernels, spinlocks do not disable preemption, causing the lockdep assertion for a plain seqcount_t, which checks for preemption being disabled, to fail. This triggers the following warning: WARNING: CPU: 0 PID: 9692 at include/linux/seqlock.h:221 Fix this by changing the type to seqcount_spinlock_t and initializing it with seqcount_spinlock_init() to associate css_set_lock with the seqcount. This allows lockdep to correctly validate that the spinlock is held during write operations, resolving the assertion failure on all kernel configurations. Reported-by: syzbot+27a2519eb4dad86d0156@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=27a2519eb4dad86d0156 Fixes: afa3701c0e45 ("cgroup: cgroup.stat.local time accounting") Signed-off-by: Nirbhay Sharma <nirbhay.lkd@gmail.com> Link: https://lore.kernel.org/r/20251002165510.KtY3IT--@linutronix.de/ Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-02Merge tag 'mm-nonmm-stable-2025-10-02-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds19-138/+625
Pull non-MM updates from Andrew Morton: - "ida: Remove the ida_simple_xxx() API" from Christophe Jaillet completes the removal of this legacy IDR API - "panic: introduce panic status function family" from Jinchao Wang provides a number of cleanups to the panic code and its various helpers, which were rather ad-hoc and scattered all over the place - "tools/delaytop: implement real-time keyboard interaction support" from Fan Yu adds a few nice user-facing usability changes to the delaytop monitoring tool - "efi: Fix EFI boot with kexec handover (KHO)" from Evangelos Petrongonas fixes a panic which was happening with the combination of EFI and KHO - "Squashfs: performance improvement and a sanity check" from Phillip Lougher teaches squashfs's lseek() about SEEK_DATA/SEEK_HOLE. A mere 150x speedup was measured for a well-chosen microbenchmark - plus another 50-odd singleton patches all over the place * tag 'mm-nonmm-stable-2025-10-02-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (75 commits) Squashfs: reject negative file sizes in squashfs_read_inode() kallsyms: use kmalloc_array() instead of kmalloc() MAINTAINERS: update Sibi Sankar's email address Squashfs: add SEEK_DATA/SEEK_HOLE support Squashfs: add additional inode sanity checking lib/genalloc: fix device leak in of_gen_pool_get() panic: remove CONFIG_PANIC_ON_OOPS_VALUE ocfs2: fix double free in user_cluster_connect() checkpatch: suppress strscpy warnings for userspace tools cramfs: fix incorrect physical page address calculation kernel: prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit Squashfs: fix uninit-value in squashfs_get_parent kho: only fill kimage if KHO is finalized ocfs2: avoid extra calls to strlen() after ocfs2_sprintf_system_inode_name() kernel/sys.c: fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths sched/task.h: fix the wrong comment on task_lock() nesting with tasklist_lock coccinelle: platform_no_drv_owner: handle also built-in drivers coccinelle: of_table: handle SPI device ID tables lib/decompress: use designated initializers for struct compress_format efi: support booting with kexec handover (KHO) ...
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds10-81/+218
Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...