Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull bpf updates from Alexei Starovoitov:
- Fix and improve BTF deduplication of identical BTF types (Alan
Maguire and Andrii Nakryiko)
- Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
Alexis Lothoré)
- Support load-acquire and store-release instructions in BPF JIT on
riscv64 (Andrea Parri)
- Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
Protopopov)
- Streamline allowed helpers across program types (Feng Yang)
- Support atomic update for hashtab of BPF maps (Hou Tao)
- Implement json output for BPF helpers (Ihor Solodrai)
- Several s390 JIT fixes (Ilya Leoshkevich)
- Various sockmap fixes (Jiayuan Chen)
- Support mmap of vmlinux BTF data (Lorenz Bauer)
- Support BPF rbtree traversal and list peeking (Martin KaFai Lau)
- Tests for sockmap/sockhash redirection (Michal Luczaj)
- Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)
- Add support for dma-buf iterators in BPF (T.J. Mercier)
- The verifier support for __bpf_trap() (Yonghong Song)
* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
bpf, arm64: Remove unused-but-set function and variable.
selftests/bpf: Add tests with stack ptr register in conditional jmp
bpf: Do not include stack ptr register in precision backtracking bookkeeping
selftests/bpf: enable many-args tests for arm64
bpf, arm64: Support up to 12 function arguments
bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
bpf: Avoid __bpf_prog_ret0_warn when jit fails
bpftool: Add support for custom BTF path in prog load/loadall
selftests/bpf: Add unit tests with __bpf_trap() kfunc
bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
bpf: Remove special_kfunc_set from verifier
selftests/bpf: Add test for open coded dmabuf_iter
selftests/bpf: Add test for dmabuf_iter
bpf: Add open coded dmabuf iterator
bpf: Add dmabuf iterator
dma-buf: Rename debugfs symbols
bpf: Fix error return value in bpf_copy_from_user_dynptr
libbpf: Use mmap to parse vmlinux BTF from sysfs
selftests: bpf: Add a test for mmapable vmlinux BTF
btf: Allow mmap of vmlinux btf
...
|
|
Pull sysctl updates from Joel Granados:
- Move kern_table members out of kernel/sysctl.c
Moved a subset (tracing, panic, signal, stack_tracer and sparc) out
of the kern_table array. The goal is for kern_table to only have
sysctl elements. All this increases modularity by placing the
ctl_tables closer to where they are used while reducing the chances
of merge conflicts in kernel/sysctl.c.
- Fixed sysctl unit test panic by relocating it to selftests
* tag 'sysctl-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: Close test ctl_headers with a for loop
sysctl: call sysctl tests with a for loop
sysctl: Add 0012 to test the u8 range check
sysctl: move u8 register test to lib/test_sysctl.c
sparc: mv sparc sysctls into their own file under arch/sparc/kernel
stack_tracer: move sysctl registration to kernel/trace/trace_stack.c
tracing: Move trace sysctls into trace.c
signal: Move signal ctl tables into signal.c
panic: Move panic ctl tables into panic.c
|
|
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
|
|
On error, copy_from_user returns number of bytes not copied to
destination, but current implementation of copy_user_data_sleepable does
not handle that correctly and returns it as error value, which may
confuse user, expecting meaningful negative error value.
Fixes: a498ee7576de ("bpf: Implement dynptr copy kfuncs")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250523181705.261585-1-mykyta.yatsenko5@gmail.com
|
|
Filesystems like XFS can implement atomic write I/O using either
REQ_ATOMIC flag set in the bio or via CoW operation. It will be useful
if we have a flag in trace events to distinguish between the two. This
patch adds char 'U' (Untorn writes) to rwbs field of the trace events
if REQ_ATOMIC flag is set in the bio.
<W/ REQ_ATOMIC>
=================
xfs_io-4238 [009] ..... 4148.126843: block_rq_issue: 259,0 WFSU 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0 [009] d.h1. 4148.129864: block_rq_complete: 259,0 WFSU () 768 + 32 none,0,0 [0]
<W/O REQ_ATOMIC>
===============
xfs_io-4237 [010] ..... 4143.325616: block_rq_issue: 259,0 WS 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0 [010] d.H1. 4143.329138: block_rq_complete: 259,0 WS () 768 + 32 none,0,0 [0]
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/44317cb2ec4588f6a2c1501a96684e6a1196e8ba.1747921498.git.ritesh.list@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 4a8f635a6054.
Althought get_pid_task() internally already calls rcu_read_lock() and
rcu_read_unlock(), the find_vpid() was not.
The documentation for find_vpid() clearly states:
"Must be called with the tasklist_lock or rcu_read_lock() held."
Add proper rcu_read_lock/unlock() to protect the find_vpid().
Fixes: 4a8f635a6054 ("bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic")
Reported-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250520054943.5002-1-xuewen.yan@unisoc.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Pull tracing fixes from Steven Rostedt:
- Fix sample code that uses trace_array_printk()
The sample code for in kernel use of trace_array (that creates an
instance for use within the kernel) and shows how to use
trace_array_printk() that writes into the created instance, used
trace_printk_init_buffers(). But that function is used to initialize
normal trace_printk() and produces the NOTICE banner which is not
needed for use of trace_array_printk(). The function to initialize
that is trace_array_init_printk() that takes the created trace array
instance as a parameter.
Update the sample code to reflect the proper usage.
- Fix preemption count output for stacktrace event
The tracing buffer shows the preempt count level when an event
executes. Because writing the event itself disables preemption, this
needs to be accounted for when recording. The stacktrace event did
not account for this so the output of the stacktrace event showed
preemption was disabled while the event that triggered the stacktrace
shows preemption is enabled and this leads to confusion. Account for
preemption being disabled for the stacktrace event.
The same happened for stack traces triggered by function tracer.
- Fix persistent ring buffer when trace_pipe is used
The ring buffer swaps the reader page with the next page to read from
the write buffer when trace_pipe is used. If there's only a page of
data in the ring buffer, this swap will cause the "commit" pointer
(last data written) to be on the reader page. If more data is written
to the buffer, it is added to the reader page until it falls off back
into the write buffer.
If the system reboots and the commit pointer is still on the reader
page, even if new data was written, the persistent buffer validator
will miss finding the commit pointer because it only checks the write
buffer and does not check the reader page. This causes the validator
to fail the validation and clear the buffer, where the new data is
lost.
There was a check for this, but it checked the "head pointer", which
was incorrect, because the "head pointer" always stays on the write
buffer and is the next page to swap out for the reader page. Fix the
logic to catch this case and allow the user to still read the data
after reboot.
* tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Fix persistent buffer when commit page is the reader page
ftrace: Fix preemption accounting for stacktrace filter command
ftrace: Fix preemption accounting for stacktrace trigger command
tracing: samples: Initialize trace_array_printk() with the correct function
|
|
The ring buffer is made up of sub buffers (sometimes called pages as they
are by default PAGE_SIZE). It has the following "pages":
"tail page" - this is the page that the next write will write to
"head page" - this is the page that the reader will swap the reader page with.
"reader page" - This belongs to the reader, where it will swap the head
page from the ring buffer so that the reader does not
race with the writer.
The writer may end up on the "reader page" if the ring buffer hasn't
written more than one page, where the "tail page" and the "head page" are
the same.
The persistent ring buffer has meta data that points to where these pages
exist so on reboot it can re-create the pointers to the cpu_buffer
descriptor. But when the commit page is on the reader page, the logic is
incorrect.
The check to see if the commit page is on the reader page checked if the
head page was the reader page, which would never happen, as the head page
is always in the ring buffer. The correct check would be to test if the
commit page is on the reader page. If that's the case, then it can exit
out early as the commit page is only on the reader page when there's only
one page of data in the buffer. There's no reason to iterate the ring
buffer pages to find the "commit page" as it is already found.
To trigger this bug:
# echo 1 > /sys/kernel/tracing/instances/boot_mapped/events/syscalls/sys_enter_fchownat/enable
# touch /tmp/x
# chown sshd /tmp/x
# reboot
On boot up, the dmesg will have:
Ring buffer meta [0] is from previous boot!
Ring buffer meta [1] is from previous boot!
Ring buffer meta [2] is from previous boot!
Ring buffer meta [3] is from previous boot!
Ring buffer meta [4] commit page not found
Ring buffer meta [5] is from previous boot!
Ring buffer meta [6] is from previous boot!
Ring buffer meta [7] is from previous boot!
Where the buffer on CPU 4 had a "commit page not found" error and that
buffer is cleared and reset causing the output to be empty and the data lost.
When it works correctly, it has:
# cat /sys/kernel/tracing/instances/boot_mapped/trace_pipe
<...>-1137 [004] ..... 998.205323: sys_enter_fchownat: __syscall_nr=0x104 (260) dfd=0xffffff9c (4294967196) filename=(0xffffc90000a0002c) user=0x3e8 (1000) group=0xffffffff (4294967295) flag=0x0 (0
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250513115032.3e0b97f7@gandalf.local.home
Fixes: 5f3b6e839f3ce ("ring-buffer: Validate boot range memory events")
Reported-by: Tasos Sahanidis <tasos@tasossah.com>
Tested-by: Tasos Sahanidis <tasos@tasossah.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The preemption count of the stacktrace filter command to trace ksys_read
is consistently incorrect:
$ echo ksys_read:stacktrace > set_ftrace_filter
<...>-453 [004] ...1. 38.308956: <stack trace>
=> ksys_read
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
The root cause is that the trace framework disables preemption when
invoking the filter command callback in function_trace_probe_call:
preempt_disable_notrace();
probe_ops->func(ip, parent_ip, probe_opsbe->tr, probe_ops, probe->data);
preempt_enable_notrace();
Use tracing_gen_ctx_dec() to account for the preempt_disable_notrace(),
which will output the correct preemption count:
$ echo ksys_read:stacktrace > set_ftrace_filter
<...>-410 [006] ..... 31.420396: <stack trace>
=> ksys_read
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
Cc: stable@vger.kernel.org
Fixes: 36590c50b2d07 ("tracing: Merge irqflags + preempt counter.")
Link: https://lore.kernel.org/20250512094246.1167956-2-dolinux.peng@gmail.com
Signed-off-by: pengdonglin <dolinux.peng@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When using the stacktrace trigger command to trace syscalls, the
preemption count was consistently reported as 1 when the system call
event itself had 0 (".").
For example:
root@ubuntu22-vm:/sys/kernel/tracing/events/syscalls/sys_enter_read
$ echo stacktrace > trigger
$ echo 1 > enable
sshd-416 [002] ..... 232.864910: sys_read(fd: a, buf: 556b1f3221d0, count: 8000)
sshd-416 [002] ...1. 232.864913: <stack trace>
=> ftrace_syscall_enter
=> syscall_trace_enter
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
The root cause is that the trace framework disables preemption in __DO_TRACE before
invoking the trigger callback.
Use the tracing_gen_ctx_dec() that will accommodate for the increase of
the preemption count in __DO_TRACE when calling the callback. The result
is the accurate reporting of:
sshd-410 [004] ..... 210.117660: sys_read(fd: 4, buf: 559b725ba130, count: 40000)
sshd-410 [004] ..... 210.117662: <stack trace>
=> ftrace_syscall_enter
=> syscall_trace_enter
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
Cc: stable@vger.kernel.org
Fixes: ce33c845b030c ("tracing: Dump stacktrace trigger to the corresponding instance")
Link: https://lore.kernel.org/20250512094246.1167956-1-dolinux.peng@gmail.com
Signed-off-by: pengdonglin <dolinux.peng@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
syzkaller reported an issue:
WARNING: CPU: 3 PID: 5971 at kernel/trace/bpf_trace.c:1861 get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861
Modules linked in:
CPU: 3 UID: 0 PID: 5971 Comm: syz-executor205 Not tainted 6.15.0-rc5-syzkaller-00038-g707df3375124 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861
RSP: 0018:ffffc90003636fa8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81c6bc4c
RDX: ffff888032efc880 RSI: ffffffff81c6bc83 RDI: 0000000000000005
RBP: ffff88806a730860 R08: 0000000000000005 R09: 0000000000000003
R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004
R13: 0000000000000001 R14: ffffc90003637008 R15: 0000000000000900
FS: 0000000000000000(0000) GS:ffff8880d6cdf000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7baee09130 CR3: 0000000029f5a000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1934 [inline]
bpf_get_stack_raw_tp+0x24/0x160 kernel/trace/bpf_trace.c:1931
bpf_prog_ec3b2eefa702d8d3+0x43/0x47
bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline]
__bpf_prog_run include/linux/filter.h:718 [inline]
bpf_prog_run include/linux/filter.h:725 [inline]
__bpf_trace_run kernel/trace/bpf_trace.c:2363 [inline]
bpf_trace_run3+0x23f/0x5a0 kernel/trace/bpf_trace.c:2405
__bpf_trace_mmap_lock_acquire_returned+0xfc/0x140 include/trace/events/mmap_lock.h:47
__traceiter_mmap_lock_acquire_returned+0x79/0xc0 include/trace/events/mmap_lock.h:47
__do_trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline]
trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline]
__mmap_lock_do_trace_acquire_returned+0x138/0x1f0 mm/mmap_lock.c:35
__mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline]
mmap_read_trylock include/linux/mmap_lock.h:204 [inline]
stack_map_get_build_id_offset+0x535/0x6f0 kernel/bpf/stackmap.c:157
__bpf_get_stack+0x307/0xa10 kernel/bpf/stackmap.c:483
____bpf_get_stack kernel/bpf/stackmap.c:499 [inline]
bpf_get_stack+0x32/0x40 kernel/bpf/stackmap.c:496
____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1941 [inline]
bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1931
bpf_prog_ec3b2eefa702d8d3+0x43/0x47
Tracepoint like trace_mmap_lock_acquire_returned may cause nested call
as the corner case show above, which will be resolved with more general
method in the future. As a result, WARN_ON_ONCE will be triggered. As
Alexei suggested, remove the WARN_ON_ONCE first.
Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data")
Reported-by: syzbot+45b0c89a0fc7ae8dbadc@syzkaller.appspotmail.com
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250513042747.757042-1-chen.dylane@linux.dev
Closes: https://lore.kernel.org/bpf/8bc2554d-1052-4922-8832-e0078a033e1d@gmail.com
|
|
Since the shared trace_probe_log variable can be accessed and
modified via probe event create operation of kprobe_events,
uprobe_events, and dynamic_events, it should be protected.
In the dynamic_events, all operations are serialized by
`dyn_event_ops_mutex`. But kprobe_events and uprobe_events
interfaces are not serialized.
To solve this issue, introduces dyn_event_create(), which runs
create() operation under the mutex, for kprobe_events and
uprobe_events. This also uses lockdep to check the mutex is
held when using trace_probe_log* APIs.
Link: https://lore.kernel.org/all/174684868120.551552.3068655787654268804.stgit@devnote2/
Reported-by: Paul Cacheux <paulcacheux@gmail.com>
Closes: https://lore.kernel.org/all/20250510074456.805a16872b591e2971a4d221@kernel.org/
Fixes: ab105a4fb894 ("tracing: Use tracing error_log with probe events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
This patch introduces a new set of kfuncs for working with dynptrs in
BPF programs, enabling reading variable-length user or kernel data
into dynptr directly. To enable memory-safety, verifier allows only
constant-sized reads via existing bpf_probe_read_{user|kernel} etc.
kfuncs, dynptr-based kfuncs allow dynamically-sized reads without memory
safety shortcomings.
The following kfuncs are introduced:
* `bpf_probe_read_kernel_dynptr()`: probes kernel-space data into a dynptr
* `bpf_probe_read_user_dynptr()`: probes user-space data into a dynptr
* `bpf_probe_read_kernel_str_dynptr()`: probes kernel-space string into
a dynptr
* `bpf_probe_read_user_str_dynptr()`: probes user-space string into a
dynptr
* `bpf_copy_from_user_dynptr()`: sleepable, copies user-space data into
a dynptr for the current task
* `bpf_copy_from_user_str_dynptr()`: sleepable, copies user-space string
into a dynptr for the current task
* `bpf_copy_from_user_task_dynptr()`: sleepable, copies user-space data
of the task into a dynptr
* `bpf_copy_from_user_task_str_dynptr()`: sleepable, copies user-space
string of the task into a dynptr
The implementation is built on two generic functions:
* __bpf_dynptr_copy
* __bpf_dynptr_copy_str
These functions take function pointers as arguments, enabling the
copying of data from various sources, including both kernel and user
space.
Use __always_inline for generic functions and callbacks to make sure the
compiler doesn't generate indirect calls into callbacks, which is more
expensive, especially on some kernel configurations. Inlining allows
compiler to put direct calls into all the specific callback implementations
(copy_user_data_sleepable, copy_user_data_nofault, and so on).
Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20250512205348.191079-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Make sure trace_probe_log_clear is called in the tracing
eprobe code path, matching the trace_probe_log_init call.
Link: https://lore.kernel.org/all/20250504-fix-trace-probe-log-race-v3-1-9e99fec7eddc@gmail.com/
Signed-off-by: Paul Cacheux <paulcacheux@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:
WARNING: suspicious RCU usage
kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!
other info that might help us debug this:
#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0
Call Trace:
fprobe_module_callback
notifier_call_chain
blocking_notifier_call_chain
This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.
Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.
Link: https://lore.kernel.org/all/20250410-fprobe-v1-1-068ef5f41436@debian.org/
Fixes: a3dc2983ca7b90 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Antonio Quartulli <antonio@mandelbit.com>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Adding support to retrieve ref_ctr_offset for uprobe perf link,
which got somehow omitted from the initial uprobe link info changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20250509153539.779599-2-jolsa@kernel.org
|
|
if it works under NMI and doesn't use any context-dependent things,
should be fine for any program type. The detailed discussion is in [1].
[1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20250506061434.94277-2-yangfeng59949@163.com
|
|
The block layer bounce buffering support is unused now, remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
On some paths in print_event_fields() it takes the trace_event_sem for
read, even though it should always be held when the function is called.
Remove the taking of that mutex and add a lockdep_assert_held_read() to
make sure the trace_event_sem is held when print_event_fields() is called.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250501224128.0b1f0571@batman.local.home
Fixes: 80a76994b2d88 ("tracing: Add "fields" option to show raw trace event fields")
Reported-by: syzbot+441582c1592938fccf09@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6813ff5e.050a0220.14dd7d.001b.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The function trace_adjust_address() is used to map addresses of modules
stored in the persistent memory and are also loaded in the current boot to
return the current address for the module.
If there's only one module entry, it will simply use that, otherwise it
performs a bsearch of the entry array to find the modules to offset with.
The issue is if there are no modules in the array. The code does not
account for that and ends up referencing the first element in the array
which does not exist and causes a crash.
If nr_entries is zero, exit out early as if this was a core kernel
address.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250501151909.65910359@gandalf.local.home
Fixes: 35a380ddbc653 ("tracing: Show last module text symbols in the stacktrace")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The check for a failed memory location is incorrectly checking
the wrong level of pointer indirection by checking !filter_hash
rather than !*filter_hash. Fix this.
Cc: asami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250422221335.89896-1-colin.i.king@gmail.com
Fixes: 0ae6b8ce200d ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
syzbot reported this bug:
==================================================================
BUG: KASAN: slab-out-of-bounds in trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
BUG: KASAN: slab-out-of-bounds in tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
Write of size 4507 at addr ffff888032b6b000 by task syz.2.320/7260
CPU: 1 UID: 0 PID: 7260 Comm: syz.2.320 Not tainted 6.15.0-rc1-syzkaller-00301-g3bde70a2c827 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0xc3/0x670 mm/kasan/report.c:521
kasan_report+0xe0/0x110 mm/kasan/report.c:634
check_region_inline mm/kasan/generic.c:183 [inline]
kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189
__asan_memcpy+0x3c/0x60 mm/kasan/shadow.c:106
trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
....
==================================================================
It has been reported that trace_seq_to_buffer() tries to copy more data
than PAGE_SIZE to buf. Therefore, to prevent this, we should use the
smaller of trace_seq_used(&iter->seq) and PAGE_SIZE as an argument.
Link: https://lore.kernel.org/20250422113026.13308-1-aha310510@gmail.com
Reported-by: syzbot+c8cd2d2c412b868263fb@syzkaller.appspotmail.com
Fixes: 3c56819b14b0 ("tracing: splice support for tracing_pipe")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Many conditional checks in switch-case are redundant
with bpf_base_func_proto and should be removed.
Regarding the permission checks bpf_base_func_proto:
The permission checks in bpf_prog_load (as outlined below)
ensure that the trace has both CAP_BPF and CAP_PERFMON capabilities,
thus enabling the use of corresponding prototypes
in bpf_base_func_proto without adverse effects.
bpf_prog_load
......
bpf_cap = bpf_token_capable(token, CAP_BPF);
......
if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
type != BPF_PROG_TYPE_CGROUP_SKB &&
!bpf_cap)
goto put_token;
......
if (is_perfmon_prog_type(type) && !bpf_token_capable(token, CAP_PERFMON))
goto put_token;
......
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20250423073151.297103-1-yangfeng59949@163.com
|
|
Cross-merge bpf and other fixes after downstream PRs.
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The filter string testing uses strncpy_from_kernel/user_nofault() to
retrieve the string to test the filter against. The if() statement was
incorrect as it considered 0 as a fault, when it is only negative that it
faulted.
Running the following commands:
# cd /sys/kernel/tracing
# echo "filename.ustring ~ \"/proc*\"" > events/syscalls/sys_enter_openat/filter
# echo 1 > events/syscalls/sys_enter_openat/enable
# ls /proc/$$/maps
# cat trace
Would produce nothing, but with the fix it will produce something like:
ls-1192 [007] ..... 8169.828333: sys_openat(dfd: ffffffffffffff9c, filename: 7efc18359904, flags: 80000, mode: 0)
Link: https://lore.kernel.org/all/CAEf4BzbVPQ=BjWztmEwBPRKHUwNfKBkS3kce-Rzka6zvbQeVpg@mail.gmail.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250417183003.505835fb@gandalf.local.home
Fixes: 77360f9bbc7e5 ("tracing: Add test for user space strings when filtering on string pointers")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Reported-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
ftrace_graph_ent.depth is int, but ftrace_graph_ent_entry.depth is
unsigned long. This confuses trace-cmd on 64-bit big-endian systems and
makes it print a huge amount of spaces. Fix this by using unsigned int,
which has a matching size, instead.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250412221847.17310-2-iii@linux.ibm.com
Fixes: ff5c9c576e75 ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The maximum of the ftrace hash bits is made fls(32) in
register_ftrace_direct(), which seems illogical. So, we fix it by making
the max hash bits FTRACE_HASH_MAX_BITS instead.
Link: https://lore.kernel.org/20250413014444.36724-1-dongml2@chinatelecom.cn
Fixes: d05cb470663a ("ftrace: Fix modification of direct_function hash while in use")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The subops processing creates new hashes when adding and removing subops.
There were some places that the old hashes that were replaced were not
freed and this caused some memory leaks.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417135939.245b128d@gandalf.local.home
Fixes: 0ae6b8ce200d ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
There's several locations that free a ftrace hash pointer but may be
referenced again. Reset them to EMPTY_HASH so that a u-a-f bug doesn't
happen.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417110933.20ab718b@gandalf.local.home
Fixes: 0ae6b8ce200d ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The reworking to fix and simplify the ftrace_startup_subops() and the
ftrace_shutdown_subops() made it possible for the filter_hash and
notrace_hash variables to be used uninitialized in a way that the compiler
did not catch it.
Initialize both filter_hash and notrace_hash to the EMPTY_HASH as that is
what they should be if they never are used.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417104017.3aea66c2@gandalf.local.home
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Fixes: 0ae6b8ce200d ("ftrace: Fix accounting of subop hashes")
Closes: https://lore.kernel.org/all/1db64a42-626d-4b3a-be08-c65e47333ce2@linux.ibm.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Pull tracing fixes from Steven Rostedt:
- Hide get_vm_area() from MMUless builds
The function get_vm_area() is not defined when CONFIG_MMU is not
defined. Hide that function within #ifdef CONFIG_MMU.
- Fix output of synthetic events when they have dynamic strings
The print fmt of the synthetic event's format file use to have "%.*s"
for dynamic size strings even though the user space exported
arguments had only __get_str() macro that provided just a nul
terminated string. This was fixed so that user space could parse this
properly.
But the reason that it had "%.*s" was because internally it provided
the maximum size of the string as one of the arguments. The fix that
replaced "%.*s" with "%s" caused the trace output (when the kernel
reads the event) to write "(efault)" as it would now read the length
of the string as "%s".
As the string provided is always nul terminated, there's no reason
for the internal code to use "%.*s" anyway. Just remove the length
argument to match the "%s" that is now in the format.
- Fix the ftrace subops hash logic of the manager ops hash
The function_graph uses the ftrace subops code. The subops code is a
way to have a single ftrace_ops registered with ftrace to determine
what functions will call the ftrace_ops callback. More than one user
of function graph can register a ftrace_ops with it. The function
graph infrastructure will then add this ftrace_ops as a subops with
the main ftrace_ops it registers with ftrace. This is because the
functions will always call the function graph callback which in turn
calls the subops ftrace_ops callbacks.
The main ftrace_ops must add a callback to all the functions that the
subops want a callback from. When a subops is registered, it will
update the main ftrace_ops hash to include the functions it wants.
This is the logic that was broken.
The ftrace_ops hash has a "filter_hash" and a "notrace_hash" where
all the functions in the filter_hash but not in the notrace_hash are
attached by ftrace. The original logic would have the main ftrace_ops
filter_hash be a union of all the subops filter_hashes and the main
notrace_hash would be a intersect of all the subops filter hashes.
But this was incorrect because the notrace hash depends on the
filter_hash it is associated to and not the union of all
filter_hashes.
Instead, when a subops is added, just include all the functions of
the subops hash that are in its filter_hash but not in its
notrace_hash. The main subops hash should not use its notrace hash,
unless all of its subops hashes have an empty filter_hash (which
means to attach to all functions), and then, and only then, the main
ftrace_ops notrace hash can be the intersect of all the subops
hashes.
This not only fixes the bug, but also simplifies the code.
- Add a selftest to better test the subops filtering
Add a selftest that would catch the bug fixed by the above change.
- Fix extra newline printed in function tracing with retval
The function parameter code changed the output logic slightly and
called print_graph_retval() and also printed a newline. The
print_graph_retval() also prints a newline which caused blank lines
to be printed in the function graph tracer when retval was added.
This caused one of the selftests to fail if retvals were enabled.
Instead remove the new line output from print_graph_retval() and have
the callers always print the new line so that it doesn't have to do
special logic if it calls print_graph_retval() or not.
- Fix out-of-bound memory access in the runtime verifier
When rv_is_container_monitor() is called on the last entry on the
link list it references the next entry, which is the list head and
causes an out-of-bound memory access.
* tag 'trace-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Fix out-of-bound memory access in rv_is_container_monitor()
ftrace: Do not have print_graph_retval() add a newline
tracing/selftest: Add test to better test subops filtering of function graph
ftrace: Fix accounting of subop hashes
ftrace: Properly merge notrace hashes
tracing: Do not add length to print format in synthetic events
tracing: Hide get_vm_area() from MMUless builds
|
|
When rv_is_container_monitor() is called on the last monitor in
rv_monitors_list, KASAN yells:
BUG: KASAN: global-out-of-bounds in rv_is_container_monitor+0x101/0x110
Read of size 8 at addr ffffffff97c7c798 by task setup/221
The buggy address belongs to the variable:
rv_monitors_list+0x18/0x40
This is due to list_next_entry() is called on the last entry in the list.
It wraps around to the first list_head, and the first list_head is not
embedded in struct rv_monitor_def.
Fix it by checking if the monitor is last in the list.
Cc: stable@vger.kernel.org
Cc: Gabriele Monaco <gmonaco@redhat.com>
Fixes: cb85c660fcd4 ("rv: Add option for nested monitors and include sched")
Link: https://lore.kernel.org/e85b5eeb7228bfc23b8d7d4ab5411472c54ae91b.1744355018.git.namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The retval and retaddr options for function_graph tracer will add a
comment at the end of a function for both leaf and non leaf functions that
looks like:
__wake_up_common(); /* ret=0x1 */
} /* pick_next_task_fair ret=0x0 */
The function print_graph_retval() adds a newline after the "*/". But if
that's not called, the caller function needs to make sure there's a
newline added.
This is confusing and when the function parameters code was added, it
added a newline even when calling print_graph_retval() as the fact that
the print_graph_retval() function prints a newline isn't obvious.
This caused an extra newline to be printed and that made it fail the
selftests when the retval option was set, as the selftests were not
expecting blank lines being injected into the trace.
Instead of having print_graph_retval() print a newline, just have the
caller always print the newline regardless if it calls print_graph_retval()
or not. This not only fixes this bug, but it also simplifies the code.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250411133015.015ca393@gandalf.local.home
Reported-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/ccc40f2b-4b9e-4abd-8daf-d22fce2a86f0@sirena.org.uk/
Fixes: ff5c9c576e754 ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The function graph infrastructure uses ftrace to hook to functions. It has
a single ftrace_ops to manage all the users of function graph. Each
individual user (tracing, bpf, fprobes, etc) has its own ftrace_ops to
track the functions it will have its callback called from. These
ftrace_ops are "subops" to the main ftrace_ops of the function graph
infrastructure.
Each ftrace_ops has a filter_hash and a notrace_hash that is defined as:
Only trace functions that are in the filter_hash but not in the
notrace_hash.
If the filter_hash is empty, it means to trace all functions.
If the notrace_hash is empty, it means do not disable any function.
The function graph main ftrace_ops needs to be a superset containing all
the functions to be traced by all the subops it has. The algorithm to
perform this merge was incorrect.
When the first subops was added to the main ops, it simply made the main
ops a copy of the subops (same filter_hash and notrace_hash).
When a second ops was added, it joined the new subops filter_hash with the
main ops filter_hash as a union of the two sets. The intersect between the
new subops notrace_hash and the main ops notrace_hash was created as the
new notrace_hash of the main ops.
The issue here is that it would then start tracing functions than no
subops were tracing. For example if you had two subops that had:
subops 1:
filter_hash = '*sched*' # trace all functions with "sched" in it
notrace_hash = '*time*' # except do not trace functions with "time"
subops 2:
filter_hash = '*lock*' # trace all functions with "lock" in it
notrace_hash = '*clock*' # except do not trace functions with "clock"
The intersect of '*time*' functions with '*clock*' functions could be the
empty set. That means the main ops will be tracing all functions with
'*time*' and all "*clock*" in it!
Instead, modify the algorithm to be a bit simpler and correct.
First, when adding a new subops, even if it's the first one, do not add
the notrace_hash if the filter_hash is not empty. Instead, just add the
functions that are in the filter_hash of the subops but not in the
notrace_hash of the subops into the main ops filter_hash. There's no
reason to add anything to the main ops notrace_hash.
The notrace_hash of the main ops should only be non empty iff all subops
filter_hashes are empty (meaning to trace all functions) and all subops
notrace_hashes include the same functions.
That is, the main ops notrace_hash is empty if any subops filter_hash is
non empty.
The main ops notrace_hash only has content in it if all subops
filter_hashes are empty, and the content are only functions that intersect
all the subops notrace_hashes. If any subops notrace_hash is empty, then
so is the main ops notrace_hash.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Andy Chiu <andybnac@gmail.com>
Link: https://lore.kernel.org/20250409152720.216356767@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The global notrace hash should be jointly decided by the intersection of
each subops's notrace hash, but not the filter hash.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250408160258.48563-1-andybnac@gmail.com
Fixes: 5fccc7552ccb ("ftrace: Add subops logic to allow one ops to manage many")
Signed-off-by: Andy Chiu <andybnac@gmail.com>
[ fixed removing of freeing of filter_hash ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The link_create.flags are currently not used for multi-uprobes, so return
-EINVAL if it is set, same as for other attach APIs.
We allow target_fd to have an arbitrary value for multi-uprobe, though,
as there are existing users (libbpf) relying on this.
Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250407035752.1108927-2-chen.dylane@linux.dev
|
|
The link_create.flags are currently not used for multi-kprobes, so return
-EINVAL if it is set, same as for other attach APIs.
We allow target_fd, on the other hand, to have an arbitrary value for
multi-kprobe, as there are existing users (libbpf) relying on this.
Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250407035752.1108927-1-chen.dylane@linux.dev
|
|
The following causes a vsnprintf fault:
# echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events
# echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger
Because the synthetic event's "wakee" field is created as a dynamic string
(even though the string copied is not). The print format to print the
dynamic string changed from "%*s" to "%s" because another location
(__set_synth_event_print_fmt()) exported this to user space, and user
space did not need that. But it is still used in print_synth_event(), and
the output looks like:
<idle>-0 [001] d..5. 193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155
sshd-session-879 [001] d..5. 193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58
<idle>-0 [002] d..5. 193.811198: wake_lat: wakee=(efault)bashdelta=91
bash-880 [002] d..5. 193.811371: wake_lat: wakee=(efault)kworker/u35:2delta=21
<idle>-0 [001] d..5. 193.811516: wake_lat: wakee=(efault)sshd-sessiondelta=129
sshd-session-879 [001] d..5. 193.967576: wake_lat: wakee=(efault)kworker/u34:5delta=50
The length isn't needed as the string is always nul terminated. Just print
the string and not add the length (which was hard coded to the max string
length anyway).
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250407154139.69955768@gandalf.local.home
Fixes: 4d38328eb442d ("tracing: Fix synth event printk format for str fields");
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Move stack_tracer_enabled into trace_stack_sysctl_table. This is part of
a greater effort to move ctl tables into their respective subsystems
which will reduce the merge conflicts in kernel/sysctl.c.
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move trace ctl tables into their own const array in
kernel/trace/trace.c. The sysctl table register is called with
subsys_initcall placing if after its original place in proc_root_init.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Pull probes fixes from Masami Hiramatsu:
- fprobe: remove fprobe_hlist_node when module unloading
When a fprobe target module is removed, the fprobe_hlist_node should
be removed from the fprobe's hash table to prevent reusing
accidentally if another module is loaded at the same address.
- fprobe: lock module while registering fprobe
The module containing the function to be probeed is locked using a
reference counter until the fprobe registration is complete, which
prevents use after free.
- fprobe-events: fix possible UAF on modules
Basically as same as above, but in the fprobe-events layer we also
need to get module reference counter when we find the tracepoint in
the module.
* tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: fprobe: Cleanup fprobe hash when module unloading
tracing: fprobe events: Fix possible UAF on modules
tracing: fprobe: Fix to lock module while registering fprobe
|
|
Cleanup fprobe address hash table on module unloading because the
target symbols will be disappeared when unloading module and not
sure the same symbol is mapped on the same address.
Note that this is at least disables the fprobes if a part of target
symbols on the unloaded modules. Unlike kprobes, fprobe does not
re-enable the probe point by itself. To do that, the caller should
take care register/unregister fprobe when loading/unloading modules.
This simplifies the fprobe state managememt related to the module
loading/unloading.
Link: https://lore.kernel.org/all/174343534473.843280.13988101014957210732.stgit@devnote2/
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
The function get_vm_area() is not defined for non-MMU builds and causes a
build error if it is used. Hide the map_pages() function around a:
#ifdef CONFIG_MMU
to keep it from being compiled when CONFIG_MMU is not set.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250407120111.2ccc9319@gandalf.local.home
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/all/4f8ece8b-8862-4f7c-8ede-febd28f8a9fe@roeck-us.net/
Fixes: 394f3f02de531 ("tracing: Use vmap_page_range() to map memmap ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Pull ring-buffer updates from Steven Rostedt:
"Persistent buffer cleanups and simplifications.
It was mistaken that the physical memory returned from "reserve_mem"
had to be vmap()'d to get to it from a virtual address. But
reserve_mem already maps the memory to the virtual address of the
kernel so a simple phys_to_virt() can be used to get to the virtual
address from the physical memory returned by "reserve_mem". With this
new found knowledge, the code can be cleaned up and simplified.
- Enforce that the persistent memory is page aligned
As the buffers using the persistent memory are all going to be
mapped via pages, make sure that the memory given to the tracing
infrastructure is page aligned. If it is not, it will print a
warning and fail to map the buffer.
- Use phys_to_virt() to get the virtual address from reserve_mem
Instead of calling vmap() on the physical memory returned from
"reserve_mem", use phys_to_virt() instead.
As the memory returned by "memmap" or any other means where a
physical address is given to the tracing infrastructure, it still
needs to be vmap(). Since this memory can never be returned back to
the buddy allocator nor should it ever be memmory mapped to user
space, flag this buffer and up the ref count. The ref count will
keep it from ever being freed, and the flag will prevent it from
ever being memory mapped to user space.
- Use vmap_page_range() for memmap virtual address mapping
For the memmap buffer, instead of allocating an array of struct
pages, assigning them to the contiguous phsycial memory and then
passing that to vmap(), use vmap_page_range() instead
- Replace flush_dcache_folio() with flush_kernel_vmap_range()
Instead of calling virt_to_folio() and passing that to
flush_dcache_folio(), just call flush_kernel_vmap_range() directly.
This also fixes a bug where if a subbuffer was bigger than
PAGE_SIZE only the PAGE_SIZE portion would be flushed"
* tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()
tracing: Use vmap_page_range() to map memmap ring buffer
tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
tracing: Enforce the persistent ring buffer to be page aligned
|
|
Pull tracing fixes from Steven Rostedt:
- Fix build error when CONFIG_PROBE_EVENTS_BTF_ARGS is not enabled
The tracing of arguments in the function tracer depends on some
functions that are only defined when PROBE_EVENTS_BTF_ARGS is
enabled. In fact, PROBE_EVENTS_BTF_ARGS also depends on all the same
configs as the function argument tracing requires. Just have the
function argument tracing depend on PROBE_EVENTS_BTF_ARGS.
- Free module_delta for persistent ring buffer instance
When an instance holds the persistent ring buffer, it allocates a
helper array to hold the deltas between where modules are loaded on
the last boot and the current boot. This array needs to be freed when
the instance is freed.
- Add cond_resched() to loop in ftrace_graph_set_hash()
The hash functions in ftrace loop over every function that can be
enabled by ftrace. This can be 50,000 functions or more. This loop is
known to trigger soft lockup warnings and requires a cond_resched().
The loop in ftrace_graph_set_hash() was missing it.
- Fix the event format verifier to include "%*p.." arguments
To prevent events from dereferencing stale pointers that can happen
if a trace event uses a dereferece pointer to something that was not
copied into the ring buffer and can be freed by the time the trace is
read, a verifier is called. At boot or module load, the verifier
scans the print format string for pointers that can be dereferenced
and it checks the arguments to make sure they do not contain
something that can be freed. The "%*p" was not handled, which would
add another argument and cause the verifier to not only not verify
this pointer, but it will look at the wrong argument for every
pointer after that.
- Fix mcount sorttable building for different endian type target
When modifying the ELF file to sort the mcount_loc table in the
sorttable.c code, the endianess of the file and the host is used to
determine if the bytes need to be swapped when calculations are done.
A change was made to the sorting of the mcount_loc that read the
values from the ELF file into an array and the swap happened on the
filling of the array. But one of the calculations of the array still
did the swap when it did not need to. This caused building on a
little endian machine for a big endian target to not find the mcount
function in the 'nm' table and it zeroed it out, causing there to be
no functions available to trace.
- Add goto out_unlock jump to rv_register_monitor() on error path
One of the error paths in rv_register_monitor() just returned the
error when it should have jumped to the out_unlock label to release
the mutex.
* tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Fix missing unlock on double nested monitors return path
scripts/sorttable: Fix endianness handling in build-time mcount sort
tracing: Verify event formats that have "%*p.."
ftrace: Add cond_resched() to ftrace_graph_set_hash()
tracing: Free module_delta on freeing of persistent ring buffer
ftrace: Have tracing function args depend on PROBE_EVENTS_BTF_ARGS
|
|
Pull more printk updates from Petr Mladek:
- Silence warnings about candidates for ‘gnu_print’ format attribute
* tag 'printk-for-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
vsnprintf: Silence false positive GCC warning for va_format()
vsnprintf: Drop unused const char fmt * in va_format()
vsnprintf: Mark binary printing functions with __printf() attribute
tracing: Mark binary printing functions with __printf() attribute
seq_file: Mark binary printing functions with __printf() attribute
seq_buf: Mark binary printing functions with __printf() attribute
|
|
Some architectures do not have data cache coherency between user and
kernel space. For these architectures, the cache needs to be flushed on
both the kernel and user addresses so that user space can see the updates
the kernel has made.
Instead of using flush_dcache_folio() and playing with virt_to_folio()
within the call to that function, use flush_kernel_vmap_range() which
takes the virtual address and does the work for those architectures that
need it.
This also fixes a bug where the flush of the reader page only flushed one
page. If the sub-buffer order is 1 or more, where the sub-buffer size
would be greater than a page, it would miss the rest of the sub-buffer
content, as the "reader page" is not just a page, but the size of a
sub-buffer.
Link: https://lore.kernel.org/all/CAG48ez3w0my4Rwttbc5tEbNsme6tc0mrSN95thjXUFaJ3aQ6SA@mail.gmail.com/
Cc: stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Link: https://lore.kernel.org/20250402144953.920792197@goodmis.org
Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions");
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The code to map the physical memory retrieved by memmap currently
allocates an array of pages to cover the physical memory and then calls
vmap() to map it to a virtual address. Instead of using this temporary
array of struct page descriptors, simply use vmap_page_range() that can
directly map the contiguous physical memory to a virtual address.
Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250402144953.754618481@goodmis.org
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The reserve_mem kernel command line option may pass back a physical
address, but the memory is still part of the normal memory just like
using memblock_alloc() would be. This means that the physical memory
returned by the reserve_mem command line option can be converted directly
to virtual memory by simply using phys_to_virt().
When freeing the buffer there's no need to call vunmap() anymore as the
memory allocated by reserve_mem is freed by the call to
reserve_mem_release_by_name().
Because the persistent ring buffer can also be allocated via the memmap
option, which *is* different than normal memory as it cannot be added back
to the buddy system, it must be treated differently. It still needs to be
virtually mapped to have access to it. It also can not be freed nor can it
ever be memory mapped to user space.
Create a new trace_array flag called TRACE_ARRAY_FL_MEMMAP which gets set
if the buffer is created by the memmap option, and this will prevent the
buffer from being memory mapped by user space.
Also increment the ref count for memmap'ed buffers so that they can never
be freed.
Link: https://lore.kernel.org/all/Z-wFszhJ_9o4dc8O@kernel.org/
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250402144953.583750106@goodmis.org
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Enforce that the address and the size of the memory used by the persistent
ring buffer is page aligned. Also update the documentation to reflect this
requirement.
Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250402144953.412882844@goodmis.org
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|