aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-03-20io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionallyJens Axboe1-2/+1
io_req_msg_cleanup() relies on the fact that io_netmsg_recycle() will always fully recycle, but that may not be the case if the msg cache was already full. To ensure that normal cleanup always gets run, let io_netmsg_recycle() deal with clearing the relevant cleanup flags, as it knows exactly when that should be done. Cc: stable@vger.kernel.org Reported-by: David Wei <dw@davidwei.uk> Fixes: 75191341785e ("io_uring/net: add iovec recycling") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05io_uring/rw: ensure reissue path is correctly handled for IOPOLLJens Axboe1-4/+3
The IOPOLL path posts CQEs when the io_kiocb is marked as completed, so it cannot rely on the usual retry that non-IOPOLL requests do for read/write requests. If -EAGAIN is received and the request should be retried, go through the normal completion path and let the normal flush logic catch it and reissue it, like what is done for !IOPOLL reads or writes. Fixes: d803d123948f ("io_uring/rw: handle -EAGAIN retry at IO completion time") Reported-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/io-uring/2b43ccfa-644d-4a09-8f8f-39ad71810f41@oracle.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-25io_uring/net: save msg_control for compatPavel Begunkov1-1/+3
Match the compat part of io_sendmsg_copy_hdr() with its counterpart and save msg_control. Fixes: c55978024d123 ("io_uring/net: move receive multishot out of the generic msghdr path") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a8418821fe83d3b64350ad2b3c0303e9b732bbd.1740498502.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: clean up mshot forced sync modePavel Begunkov1-9/+3
Move code forcing synchronous execution of multishot read requests out a more generic __io_read(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4ad7b928c776d1ad59addb9fff64ef2d1fc474d5.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: move ki_complete init into prepPavel Begunkov1-3/+8
Initialise ki_complete during request prep stage, we'll depend on it not being reset during issue in the following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/817624086bd5f0448b08c80623399919fda82f34.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: don't directly use ki_completePavel Begunkov1-5/+9
We want to avoid checking ->ki_complete directly in the io_uring completion path. Fortunately we have only two callback the selection of which depend on the ring constant flags, i.e. IOPOLL, so use that to infer the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4eb4bdab8cbcf5bc87083f7047edc81e920ab83c.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: forbid multishot async readsPavel Begunkov1-2/+11
At the moment we can't sanely handle queuing an async request from a multishot context, so disable them. It shouldn't matter as pollable files / socekts don't normally do async. Patching it in __io_read() is not the cleanest way, but it's simpler than other options, so let's fix it there and clean up on top. Cc: stable@vger.kernel.org Reported-by: chase xd <sl1589472800@gmail.com> Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7d51732c125159d17db4fe16f51ec41b936973f8.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rsrc: remove unused constantsCaleb Sander Mateos1-6/+0
IO_NODE_ALLOC_CACHE_MAX has been unused since commit fbbb8e991d86 ("io_uring/rsrc: get rid of io_rsrc_node allocation cache") removed the rsrc_node_cache. IO_RSRC_TAG_TABLE_SHIFT and IO_RSRC_TAG_TABLE_MASK have been unused since commit 7029acd8a950 ("io_uring/rsrc: get rid of per-ring io_rsrc_node list") removed the separate tag table for registered nodes. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20250219033444.2020136-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18io_uring: fix spelling error in uapi io_uring.hJens Axboe1-1/+1
This is obviously not that important, but when changes are synced back from the kernel to liburing, the codespell CI ends up erroring because of this misspelling. Let's just correct it and avoid this biting us again on an import. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-15io_uring: prevent opcode speculationPavel Begunkov1-0/+2
sqe->opcode is used for different tables, make sure we santitise it against speculations. Cc: stable@vger.kernel.org Fixes: d3656344fea03 ("io_uring: add lookup table for various opcode needs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/7eddbf31c8ca0a3947f8ed98271acc2b4349c016.1739568408.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-14io-wq: backoff when retrying worker creationUday Shankar1-5/+18
When io_uring submission goes async for the first time on a given task, we'll try to create a worker thread to handle the submission. Creating this worker thread can fail due to various transient conditions, such as an outstanding signal in the forking thread, so we have retry logic with a limit of 3 retries. However, this retry logic appears to be too aggressive/fast - we've observed a thread blowing through the retry limit while having the same outstanding signal the whole time. Here's an excerpt of some tracing that demonstrates the issue: First, signal 26 is generated for the process. It ends up getting routed to thread 92942. 0) cbd-92284 /* signal_generate: sig=26 errno=0 code=-2 comm=psblkdASD pid=92934 grp=1 res=0 */ This causes create_io_thread in the signalled thread to fail with ERESTARTNOINTR, and thus a retry is queued. 13) task_th-92942 /* io_uring_queue_async_work: ring 000000007325c9ae, request 0000000080c96d8e, user_data 0x0, opcode URING_CMD, flags 0x8240001, normal queue, work 000000006e96dd3f */ 13) task_th-92942 io_wq_enqueue() { 13) task_th-92942 _raw_spin_lock(); 13) task_th-92942 io_wq_activate_free_worker(); 13) task_th-92942 _raw_spin_lock(); 13) task_th-92942 create_io_worker() { 13) task_th-92942 __kmalloc_cache_noprof(); 13) task_th-92942 __init_swait_queue_head(); 13) task_th-92942 kprobe_ftrace_handler() { 13) task_th-92942 get_kprobe(); 13) task_th-92942 aggr_pre_handler() { 13) task_th-92942 pre_handler_kretprobe(); 13) task_th-92942 /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */ 13) task_th-92942 } /* aggr_pre_handler */ ... 13) task_th-92942 } /* copy_process */ 13) task_th-92942 } /* create_io_thread */ 13) task_th-92942 kretprobe_rethook_handler() { 13) task_th-92942 /* create_exit: (create_io_worker+0x8a/0x1a0 <- create_io_thread) arg1=0xfffffffffffffdff */ 13) task_th-92942 } /* kretprobe_rethook_handler */ 13) task_th-92942 queue_work_on() { ... The CPU is then handed to a kworker to process the queued retry: ------------------------------------------ 13) task_th-92942 => kworker-54154 ------------------------------------------ 13) kworker-54154 io_workqueue_create() { 13) kworker-54154 io_queue_worker_create() { 13) kworker-54154 task_work_add() { 13) kworker-54154 wake_up_state() { 13) kworker-54154 try_to_wake_up() { 13) kworker-54154 _raw_spin_lock_irqsave(); 13) kworker-54154 _raw_spin_unlock_irqrestore(); 13) kworker-54154 } /* try_to_wake_up */ 13) kworker-54154 } /* wake_up_state */ 13) kworker-54154 kick_process(); 13) kworker-54154 } /* task_work_add */ 13) kworker-54154 } /* io_queue_worker_create */ 13) kworker-54154 } /* io_workqueue_create */ And then we immediately switch back to the original task to try creating a worker again. This fails, because the original task still hasn't handled its signal. ----------------------------------------- 13) kworker-54154 => task_th-92942 ------------------------------------------ 13) task_th-92942 create_worker_cont() { 13) task_th-92942 kprobe_ftrace_handler() { 13) task_th-92942 get_kprobe(); 13) task_th-92942 aggr_pre_handler() { 13) task_th-92942 pre_handler_kretprobe(); 13) task_th-92942 /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */ 13) task_th-92942 } /* aggr_pre_handler */ 13) task_th-92942 } /* kprobe_ftrace_handler */ 13) task_th-92942 create_io_thread() { 13) task_th-92942 copy_process() { 13) task_th-92942 task_active_pid_ns(); 13) task_th-92942 _raw_spin_lock_irq(); 13) task_th-92942 recalc_sigpending(); 13) task_th-92942 _raw_spin_lock_irq(); 13) task_th-92942 } /* copy_process */ 13) task_th-92942 } /* create_io_thread */ 13) task_th-92942 kretprobe_rethook_handler() { 13) task_th-92942 /* create_exit: (create_worker_cont+0x35/0x1b0 <- create_io_thread) arg1=0xfffffffffffffdff */ 13) task_th-92942 } /* kretprobe_rethook_handler */ 13) task_th-92942 io_worker_release(); 13) task_th-92942 queue_work_on() { 13) task_th-92942 clear_pending_if_disabled(); 13) task_th-92942 __queue_work() { 13) task_th-92942 } /* __queue_work */ 13) task_th-92942 } /* queue_work_on */ 13) task_th-92942 } /* create_worker_cont */ The pattern repeats another couple times until we blow through the retry counter, at which point we give up. All outstanding work is canceled, and the io_uring command which triggered all this is failed with ECANCELED: 13) task_th-92942 io_acct_cancel_pending_work() { ... 13) task_th-92942 /* io_uring_complete: ring 000000007325c9ae, req 0000000080c96d8e, user_data 0x0, result -125, cflags 0x0 extra1 0 extra2 0 */ Finally, the task gets around to processing its outstanding signal 26, but it's too late. 13) task_th-92942 /* signal_deliver: sig=26 errno=0 code=-2 sa_handler=59566a0 sa_flags=14000000 */ Try to address this issue by adding a small scaling delay when retrying worker creation. This should give the forking thread time to handle its signal in the above case. This isn't a particularly satisfying solution, as sufficiently paradoxical scheduling would still have us hitting the same issue, and I'm open to suggestions for something better. But this is likely to prevent this (already rare) issue from hitting in practice. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250208-wq_retry-v2-1-4f6f5041d303@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-13io_uring/uring_cmd: unconditionally copy SQEs at prep timeJens Axboe1-23/+11
This isn't generally necessary, but conditions have been observed where SQE data is accessed from the original SQE after prep has been done and outside of the initial issue. Opcode prep handlers must ensure that any SQE related data is stable beyond the prep phase, but uring_cmd is a bit special in how it handles the SQE which makes it susceptible to reading stale data. If the application has reused the SQE before the original completes, then that can lead to data corruption. Down the line we can relax this again once uring_cmd has been sanitized a bit, and avoid unnecessarily copying the SQE. Fixes: 5eff57fa9f3a ("io_uring/uring_cmd: defer SQE copying until it's needed") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/waitid: setup async data in the prep handlerJens Axboe1-7/+7
This is the idiomatic way that opcodes should setup their async data, so that it's always valid inside ->issue() without issue needing to do that. Fixes: f31ecf671ddc4 ("io_uring: add IORING_OP_WAITID support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/uring_cmd: remove dead req_has_async_data() checkJens Axboe1-3/+0
Any uring_cmd always has async data allocated now, there's no reason to check and clear a cached copy of the SQE. Fixes: d10f19dff56e ("io_uring/uring_cmd: switch to always allocating async data") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/uring_cmd: switch sqe to async_data on EAGAINCaleb Sander Mateos1-9/+14
5eff57fa9f3a ("io_uring/uring_cmd: defer SQE copying until it's needed") moved the unconditional memcpy() of the uring_cmd SQE to async_data to 2 cases when the request goes async: - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue Unlike the REQ_F_FORCE_ASYNC case, in the EAGAIN case, io_uring_cmd() copies the SQE to async_data but neglects to update the io_uring_cmd's sqe field to point to async_data. As a result, sqe still points to the slot in the userspace-mapped SQ. At the end of io_submit_sqes(), the kernel advances the SQ head index, allowing userspace to reuse the slot for a new SQE. If userspace reuses the slot before the io_uring worker reissues the original SQE, the io_uring_cmd's SQE will be corrupted. Introduce a helper io_uring_cmd_cache_sqes() to copy the original SQE to the io_uring_cmd's async_data and point sqe there. Use it for both the REQ_F_FORCE_ASYNC and EAGAIN cases. This ensures the uring_cmd doesn't read from the SQ slot after it has been returned to userspace. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 5eff57fa9f3a ("io_uring/uring_cmd: defer SQE copying until it's needed") Link: https://lore.kernel.org/r/20250212204546.3751645-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/uring_cmd: don't assume io_uring_cmd_data layoutCaleb Sander Mateos1-1/+1
eaf72f7b414f ("io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout") removed most of the places assuming struct io_uring_cmd_data has sqes as its first field. However, the EAGAIN case in io_uring_cmd() still compares ioucmd->sqe to the struct io_uring_cmd_data pointer using a void * cast. Since fa3595523d72 ("io_uring: get rid of alloc cache init_once handling"), sqes is no longer io_uring_cmd_data's first field. As a result, the pointers will always compare unequal and memcpy() may be called with the same source and destination. Replace the incorrect void * cast with the address of the sqes field. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: eaf72f7b414f ("io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout") Link: https://lore.kernel.org/r/20250212204546.3751645-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/kbuf: reallocate buf lists on upgradePavel Begunkov1-4/+12
IORING_REGISTER_PBUF_RING can reuse an old struct io_buffer_list if it was created for legacy selected buffer and has been emptied. It violates the requirement that most of the field should stay stable after publish. Always reallocate it instead. Cc: stable@vger.kernel.org Reported-by: Pumpkin Chang <pumpkin@devco.re> Fixes: 2fcabce2d7d34 ("io_uring: disallow mixed provided buffer group registrations") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/waitid: don't abuse io_tw_statePavel Begunkov1-2/+2
struct io_tw_state is managed by core io_uring, and opcode handling code must never try to cheat and create their own instances, it's plain incorrect. io_waitid_complete() attempts exactly that outside of the task work context, and even though the ring is locked, there would be no one to reap the requests from the defer completion list. It only works now because luckily it's called before io_uring_try_cancel_uring_cmd(), which flushes completions. Fixes: f31ecf671ddc4 ("io_uring: add IORING_OP_WAITID support") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-30io_uring/net: don't retry connect operation on EPOLLERRJens Axboe2-0/+7
If a socket is shutdown before the connection completes, POLLERR is set in the poll mask. However, connect ignores this as it doesn't know, and attempts the connection again. This may lead to a bogus -ETIMEDOUT result, where it should have noticed the POLLERR and just returned -ECONNRESET instead. Have the poll logic check for whether or not POLLERR is set in the mask, and if so, mark the request as failed. Then connect can appropriately fail the request rather than retry it. Reported-by: Sergey Galas <ssgalas@cloud.ru> Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/discussions/1335 Fixes: 3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring/rw: simplify io_rw_recycle()Pavel Begunkov1-13/+3
Instead of freeing iovecs in case of IO_URING_F_UNLOCKED in io_rw_recycle(), leave it be and rely on the core io_uring code to call io_readv_writev_cleanup() later. This way the iovec will get recycled and we can clean up io_rw_recycle() and kill io_rw_iovec_free(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/14f83b112eb40078bea18e15d77a4f99fc981a44.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: remove !KASAN guards from cache freePavel Begunkov2-4/+0
Test setups (with KASAN) will avoid !KASAN sections, and so it's not testing paths that would be exercised otherwise. That's bad as to be sure that your code works you now have to specifically test both KASAN and !KASAN configs. Remove !CONFIG_KASAN guards from io_netmsg_cache_free() and io_rw_cache_free(). The free functions should always be getting valid entries, and even though for KASAN iovecs should already be cleared, that's better than skipping the chunks completely. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/d6078a51c7137a243f9d00849bc3daa660873209.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring/net: extract io_send_select_buffer()Pavel Begunkov1-37/+50
Extract a helper out of io_send() for provided buffer selection to improve readability as it has grown to take too many lines. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/26a769cdabd61af7f40c5d88a22469c5ad071796.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring/net: clean io_msg_copy_hdr()Pavel Begunkov1-3/+4
Put msg->msg_iov into a local variable in io_msg_copy_hdr(), it reads better and clearly shows the used types. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/6a5d4f7a96b10e571d6128be010166b3aaf7afd5.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring/net: make io_net_vec_assign() return voidPavel Begunkov1-4/+5
io_net_vec_assign() can only return 0 and it doesn't make sense for it to fail, so make it return void. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/7c1a2390c99e17d3ae4e8562063e572d3cdeb164.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: add alloc_cache.cPavel Begunkov3-36/+54
Avoid inlining all and everything from alloc_cache.h and move cold bits into a new file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/06984c6cd58e703f7cfae5ab3067912f9f635a06.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: dont ifdef io_alloc_cache_kasan()Pavel Begunkov1-9/+5
Use IS_ENABLED in io_alloc_cache_kasan() so at least it gets compile tested without KASAN. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/35e53e83f6e16478dca0028a64a6cc905dc764d3.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: include all deps for alloc_cache.hPavel Begunkov1-0/+2
alloc_cache.h uses types it doesn't declare and thus depends on the order in which it's included. Make it self contained and pull all needed definitions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/39569f3d5b250b4fe78bb609d57f67d3736ebcc4.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: fix multishots with selected buffersPavel Begunkov1-0/+2
We do io_kbuf_recycle() when arming a poll but every iteration of a multishot can grab more buffers, which is why we need to flush the kbuf ring state before continuing with waiting. Cc: stable@vger.kernel.org Fixes: b3fdea6ecb55c ("io_uring: multishot recv") Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg> Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Reported-by: Jacob Soo <jacob.soo@starlabs.sg> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1bfc9990fe435f1fc6152ca9efeba5eb3e68339c.1738025570.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-24io_uring/register: use atomic_read/write for sq_flags migrationJens Axboe1-1/+1
A previous commit changed all of the migration from the old to the new ring for resizing to use READ/WRITE_ONCE. However, ->sq_flags is an atomic_t, and while most archs won't complain on this, some will indeed flag this: io_uring/register.c:554:9: sparse: sparse: cast to non-scalar io_uring/register.c:554:9: sparse: sparse: cast from non-scalar Just use atomic_set/atomic_read for handling this case. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501242000.A2sKqaCL-lkp@intel.com/ Fixes: 2c5aae129f42 ("io_uring/register: document io_register_resize_rings() shared mem usage") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23io_uring/alloc_cache: get rid of _nocache() helperJens Axboe3-13/+9
Just allow passing in NULL for the cache, if the type in question doesn't have a cache associated with it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23io_uring: get rid of alloc cache init_once handlingJens Axboe12-93/+91
init_once is called when an object doesn't come from the cache, and hence needs initial clearing of certain members. While the whole struct could get cleared by memset() in that case, a few of the cache members are large enough that this may cause unnecessary overhead if the caches used aren't large enough to satisfy the workload. For those cases, some churn of kmalloc+kfree is to be expected. Ensure that the 3 users that need clearing put the members they need cleared at the start of the struct, and wrap the rest of the struct in a struct group so the offset is known. While at it, improve the interaction with KASAN such that when/if KASAN writes to members inside the struct that should be retained over caching, it won't trip over itself. For rw and net, the retaining of the iovec over caching is disabled if KASAN is enabled. A helper will free and clear those members in that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23io_uring/uring_cmd: cleanup struct io_uring_cmd_data layoutJens Axboe1-3/+3
A few spots in uring_cmd assume that the SQEs copied are always at the start of the structure, and hence mix req->async_data and the struct itself. Clean that up and use the proper indices. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23io_uring/uring_cmd: use cached cmd_op in io_uring_cmd_sock()Jens Axboe1-1/+1
io_uring_cmd_sock() does a normal read of cmd->sqe->cmd_op, where it really should be using a READ_ONCE() as ->sqe may still be pointing to the original SQE. Since the prep side already does this READ_ONCE() and stores it locally, use that value rather than re-read it. Fixes: 8e9fad0e70b7b ("io_uring: Add io_uring command support for sockets") Link: https://lore.kernel.org/r/20250121-uring-sockcmd-fix-v1-1-add742802a29@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-22io_uring/msg_ring: don't leave potentially dangling ->tctx pointerJens Axboe1-2/+2
For remote posting of messages, req->tctx is assigned even though it is never used. Rather than leave a dangling pointer, just clear it to NULL and use the previous check for a valid submitter_task to gate on whether or not the request should be terminated. Reported-by: Jann Horn <jannh@google.com> Fixes: b6f58a3f4aa8 ("io_uring: move struct io_kiocb from task_struct to io_uring_task") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21io_uring/rsrc: Move lockdep assert from io_free_rsrc_node() to callerJann Horn2-2/+3
Checking for lockdep_assert_held(&ctx->uring_lock) in io_free_rsrc_node() means that the assertion is only checked when the resource drops to zero references. Move the lockdep assertion up into the caller io_put_rsrc_node() so that it instead happens on every reference count decrement. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20250120-uring-lockdep-assert-earlier-v1-1-68d8e071a4bb@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21io_uring/rsrc: remove unused parameter ctx for io_rsrc_node_alloc()Sidong Yang3-7/+7
io_uring_ctx parameter for io_rsrc_node_alloc() is unused for now. This patch removes the parameter and fixes the callers accordingly. Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> Link: https://lore.kernel.org/r/20250115142033.658599-1-sidong.yang@furiosa.ai Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21io_uring: clean up io_uring_register_get_file()Pavel Begunkov2-4/+5
Make it always reference the returned file. It's safer, especially with unregistrations happening under it. And it makes the api cleaner with no conditional clean ups by the caller. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0d0b13a63e8edd6b5d360fc821dcdb035cb6b7e0.1736995897.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21io_uring/rsrc: Simplify buffer cloning by locking both ringsJann Horn1-33/+40
The locking in the buffer cloning code is somewhat complex because it goes back and forth between locking the source ring and the destination ring. Make it easier to reason about by locking both rings at the same time. To avoid ABBA deadlocks, lock the rings in ascending kernel address order, just like in lock_two_nondirectories(). Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20250115-uring-clone-refactor-v2-1-7289ba50776d@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-20x86: use cmov for user address maskingLinus Torvalds2-9/+8
This was a suggestion by David Laight, and while I was slightly worried that some micro-architecture would predict cmov like a conditional branch, there is little reason to actually believe any core would be that broken. Intel documents that their existing cores treat CMOVcc as a data dependency that will constrain speculation in their "Speculative Execution Side Channel Mitigations" whitepaper: "Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also be used to prevent bounds check bypass by constraining speculative execution on current family 6 processors (Intel® Core™, Intel® Atom™, Intel® Xeon® and Intel® Xeon Phi™ processors)" and while that leaves the future uarch issues open, that's certainly true of our traditional SBB usage too. Any core that predicts CMOV will be unusable for various crypto algorithms that need data-independent timing stability, so let's just treat CMOV as the safe choice that simplifies the address masking by avoiding an extra instruction and doesn't need a temporary register. Suggested-by: David Laight <David.Laight@aculab.com> Link: https://www.intel.com/content/dam/develop/external/us/en/documents/336996-speculative-execution-side-channel-mitigations.pdf Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-20x86: use proper 'clac' and 'stac' opcode namesLinus Torvalds1-11/+7
Back when we added SMAP support, all versions of binutils didn't necessarily understand the 'clac' and 'stac' instructions. So we implemented those instructions manually as ".byte" sequences. But we've since upgraded the minimum version of binutils to version 2.25, and that included proper support for the SMAP instructions, and there's no reason for us to use some line noise to express them any more. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-20samples/vfs: fix build warningsChristian Brauner1-11/+12
Fix build warnings reported from linux-next. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lore.kernel.org/r/20250120192504.4a1965a0@canb.auug.org.au Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-20samples/vfs: use shared headerChristian Brauner3-93/+249
Share some infrastructure between sample programs and fix a build failure that was reported. Reported-by: Sasha Levin <sashal@kernel.org> Link: https://lore.kernel.org/r/Z42UkSXx0MS9qZ9w@lappy Link: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-511-g109a8e0fa9d6/testrun/26809210/suite/build/test/gcc-8-allyesconfig/log Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-19Linux 6.13Linus Torvalds1-1/+1
2025-01-19io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_inameAl Viro1-4/+5
Output of io_uring_show_fdinfo() has several problems: * racy use of ->d_iname * junk if the name is long - in that case it's not stored in ->d_iname at all * lack of quoting (names can contain newlines, etc. - or be equal to "<none>", for that matter). * lines for empty slots are pointless noise - we already have the total amount, so having just the non-empty ones would carry the same information. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-17tracing: gfp: Fix the GFP enum values shown for user space tracing toolsSteven Rostedt1-0/+63
Tracing tools like perf and trace-cmd read the /sys/kernel/tracing/events/*/*/format files to know how to parse the data and also how to print it. For the "print fmt" portion of that file, if anything uses an enum that is not exported to the tracing system, user space will not be able to parse it. The GFP flags use to be defines, and defines get translated in the print fmt sections. But now they are converted to use enums, which is not. The mm_page_alloc trace event format use to have: print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s", REC->pfn != -1UL ? (((struct page *)vmemmap_base) + (REC->pfn)) : ((void *)0), REC->pfn != -1UL ? REC->pfn : 0, REC->order, REC->migratetype, (REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {( unsigned long)(((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) | (( gfp_t)0x40000u) | (( gfp_t)0x80000u) | (( gfp_t)0x2000u)) & ~(( gfp_t)(0x400u|0x800u))) | (( gfp_t)0x400u)), "GFP_TRANSHUGE"}, {( unsigned long)((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) ... Where the GFP values are shown and not their names. But after the GFP flags were converted to use enums, it has: print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s", REC->pfn != -1UL ? (vmemmap + (REC->pfn)) : ((void *)0), REC->pfn != -1UL ? REC->pfn : 0, REC->order, REC->migratetype, (REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {( unsigned long)(((((((( gfp_t)(((((1UL))) << (___GFP_DIRECT_RECLAIM_BIT))|((((1UL))) << (___GFP_KSWAPD_RECLAIM_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_IO_BIT))) | (( gfp_t)((((1UL))) << (___GFP_FS_BIT))) | (( gfp_t)((((1UL))) << (___GFP_HARDWALL_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_HIGHMEM_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_MOVABLE_BIT))) | (( gfp_t)0)) | (( gfp_t)((((1UL))) << (___GFP_COMP_BIT))) ... Where the enums names like ___GFP_KSWAPD_RECLAIM_BIT are shown and not their values. User space has no way to convert these names to their values and the output will fail to parse. What is shown is now: mm_page_alloc: page=0xffffffff981685f3 pfn=0x1d1ac1 order=0 migratetype=1 gfp_flags=0x140cca The TRACE_DEFINE_ENUM() macro was created to handle enums in the print fmt files. This causes them to be replaced at boot up with the numbers, so that user space tooling can parse it. By using this macro, the output is back to the human readable: mm_page_alloc: page=0xffffffff981685f3 pfn=0x122233 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Veronika Molnarova <vmolnaro@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250116214438.749504792@goodmis.org Reported-by: Michael Petlan <mpetlan@redhat.com> Closes: https://lore.kernel.org/all/87be5f7c-1a0-dad-daa0-54e342efaea7@redhat.com/ Fixes: 772dd0342727c ("mm: enumerate all gfp flags") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-17block: Don't trim an atomic writeJohn Garry1-0/+4
This is disallowed. This check will now be relevant since the device mapper personalities will start to support atomic writes, and they use this function. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-17block: Add common atomic writes enable flagJohn Garry7-7/+11
Currently only stacked devices need to explicitly enable atomic writes by setting BLK_FEAT_ATOMIC_WRITES_STACKED flag. This does not work well for device mapper stacking devices, as there many sets of limits are stacked and what is the 'bottom' and 'top' device can swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set for many queue limits, which is messy. Generalize enabling atomic writes enabling by ensuring that all devices must explicitly set a flag - that includes NVMe, SCSI sd, and md raid. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-16drm/xe: Mark ComputeCS read mode as UC on iGPUMatthew Brost1-1/+1
RING_CMD_CCTL read index should be UC on iGPU parts due to L3 caching structure. Having this as WB blocks ULLS from being enabled. Change to UC to unblock ULLS on iGPU. v2: - Drop internal communications commnet, bspec is updated Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: stable@vger.kernel.org Fixes: 328e089bfb37 ("drm/xe: Leverage ComputeCS read L3 caching") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250114002507.114087-1-matthew.brost@intel.com (cherry picked from commit 758debf35b9cda5450e40996991a6e4b222899bd) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-01-16md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()Dan Carpenter1-2/+2
The linear_conf() returns error pointers, it doesn't return NULL. Update the error checking to match. Fixes: 127186cfb184 ("md: reintroduce md-linear") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/add654be-759f-4b2d-93ba-a3726dae380c@stanley.mountain Signed-off-by: Song Liu <song@kernel.org>
2025-01-16x86/asm: Make serialize() always_inlineJuergen Gross1-1/+1
In order to allow serialize() to be used from noinstr code, make it __always_inline. Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates") Closes: https://lore.kernel.org/oe-kbuild-all/202412181756.aJvzih2K-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241218100918.22167-1-jgross@suse.com