aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-02-19io_uring/rsrc: remove unused constantsCaleb Sander Mateos1-6/+0
IO_NODE_ALLOC_CACHE_MAX has been unused since commit fbbb8e991d86 ("io_uring/rsrc: get rid of io_rsrc_node allocation cache") removed the rsrc_node_cache. IO_RSRC_TAG_TABLE_SHIFT and IO_RSRC_TAG_TABLE_MASK have been unused since commit 7029acd8a950 ("io_uring/rsrc: get rid of per-ring io_rsrc_node list") removed the separate tag table for registered nodes. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20250219033444.2020136-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18io_uring: fix spelling error in uapi io_uring.hJens Axboe1-1/+1
This is obviously not that important, but when changes are synced back from the kernel to liburing, the codespell CI ends up erroring because of this misspelling. Let's just correct it and avoid this biting us again on an import. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-15io_uring: prevent opcode speculationPavel Begunkov1-0/+2
sqe->opcode is used for different tables, make sure we santitise it against speculations. Cc: stable@vger.kernel.org Fixes: d3656344fea03 ("io_uring: add lookup table for various opcode needs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/7eddbf31c8ca0a3947f8ed98271acc2b4349c016.1739568408.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-14io-wq: backoff when retrying worker creationUday Shankar1-5/+18
When io_uring submission goes async for the first time on a given task, we'll try to create a worker thread to handle the submission. Creating this worker thread can fail due to various transient conditions, such as an outstanding signal in the forking thread, so we have retry logic with a limit of 3 retries. However, this retry logic appears to be too aggressive/fast - we've observed a thread blowing through the retry limit while having the same outstanding signal the whole time. Here's an excerpt of some tracing that demonstrates the issue: First, signal 26 is generated for the process. It ends up getting routed to thread 92942. 0) cbd-92284 /* signal_generate: sig=26 errno=0 code=-2 comm=psblkdASD pid=92934 grp=1 res=0 */ This causes create_io_thread in the signalled thread to fail with ERESTARTNOINTR, and thus a retry is queued. 13) task_th-92942 /* io_uring_queue_async_work: ring 000000007325c9ae, request 0000000080c96d8e, user_data 0x0, opcode URING_CMD, flags 0x8240001, normal queue, work 000000006e96dd3f */ 13) task_th-92942 io_wq_enqueue() { 13) task_th-92942 _raw_spin_lock(); 13) task_th-92942 io_wq_activate_free_worker(); 13) task_th-92942 _raw_spin_lock(); 13) task_th-92942 create_io_worker() { 13) task_th-92942 __kmalloc_cache_noprof(); 13) task_th-92942 __init_swait_queue_head(); 13) task_th-92942 kprobe_ftrace_handler() { 13) task_th-92942 get_kprobe(); 13) task_th-92942 aggr_pre_handler() { 13) task_th-92942 pre_handler_kretprobe(); 13) task_th-92942 /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */ 13) task_th-92942 } /* aggr_pre_handler */ ... 13) task_th-92942 } /* copy_process */ 13) task_th-92942 } /* create_io_thread */ 13) task_th-92942 kretprobe_rethook_handler() { 13) task_th-92942 /* create_exit: (create_io_worker+0x8a/0x1a0 <- create_io_thread) arg1=0xfffffffffffffdff */ 13) task_th-92942 } /* kretprobe_rethook_handler */ 13) task_th-92942 queue_work_on() { ... The CPU is then handed to a kworker to process the queued retry: ------------------------------------------ 13) task_th-92942 => kworker-54154 ------------------------------------------ 13) kworker-54154 io_workqueue_create() { 13) kworker-54154 io_queue_worker_create() { 13) kworker-54154 task_work_add() { 13) kworker-54154 wake_up_state() { 13) kworker-54154 try_to_wake_up() { 13) kworker-54154 _raw_spin_lock_irqsave(); 13) kworker-54154 _raw_spin_unlock_irqrestore(); 13) kworker-54154 } /* try_to_wake_up */ 13) kworker-54154 } /* wake_up_state */ 13) kworker-54154 kick_process(); 13) kworker-54154 } /* task_work_add */ 13) kworker-54154 } /* io_queue_worker_create */ 13) kworker-54154 } /* io_workqueue_create */ And then we immediately switch back to the original task to try creating a worker again. This fails, because the original task still hasn't handled its signal. ----------------------------------------- 13) kworker-54154 => task_th-92942 ------------------------------------------ 13) task_th-92942 create_worker_cont() { 13) task_th-92942 kprobe_ftrace_handler() { 13) task_th-92942 get_kprobe(); 13) task_th-92942 aggr_pre_handler() { 13) task_th-92942 pre_handler_kretprobe(); 13) task_th-92942 /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */ 13) task_th-92942 } /* aggr_pre_handler */ 13) task_th-92942 } /* kprobe_ftrace_handler */ 13) task_th-92942 create_io_thread() { 13) task_th-92942 copy_process() { 13) task_th-92942 task_active_pid_ns(); 13) task_th-92942 _raw_spin_lock_irq(); 13) task_th-92942 recalc_sigpending(); 13) task_th-92942 _raw_spin_lock_irq(); 13) task_th-92942 } /* copy_process */ 13) task_th-92942 } /* create_io_thread */ 13) task_th-92942 kretprobe_rethook_handler() { 13) task_th-92942 /* create_exit: (create_worker_cont+0x35/0x1b0 <- create_io_thread) arg1=0xfffffffffffffdff */ 13) task_th-92942 } /* kretprobe_rethook_handler */ 13) task_th-92942 io_worker_release(); 13) task_th-92942 queue_work_on() { 13) task_th-92942 clear_pending_if_disabled(); 13) task_th-92942 __queue_work() { 13) task_th-92942 } /* __queue_work */ 13) task_th-92942 } /* queue_work_on */ 13) task_th-92942 } /* create_worker_cont */ The pattern repeats another couple times until we blow through the retry counter, at which point we give up. All outstanding work is canceled, and the io_uring command which triggered all this is failed with ECANCELED: 13) task_th-92942 io_acct_cancel_pending_work() { ... 13) task_th-92942 /* io_uring_complete: ring 000000007325c9ae, req 0000000080c96d8e, user_data 0x0, result -125, cflags 0x0 extra1 0 extra2 0 */ Finally, the task gets around to processing its outstanding signal 26, but it's too late. 13) task_th-92942 /* signal_deliver: sig=26 errno=0 code=-2 sa_handler=59566a0 sa_flags=14000000 */ Try to address this issue by adding a small scaling delay when retrying worker creation. This should give the forking thread time to handle its signal in the above case. This isn't a particularly satisfying solution, as sufficiently paradoxical scheduling would still have us hitting the same issue, and I'm open to suggestions for something better. But this is likely to prevent this (already rare) issue from hitting in practice. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250208-wq_retry-v2-1-4f6f5041d303@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-13io_uring/uring_cmd: unconditionally copy SQEs at prep timeJens Axboe1-23/+11
This isn't generally necessary, but conditions have been observed where SQE data is accessed from the original SQE after prep has been done and outside of the initial issue. Opcode prep handlers must ensure that any SQE related data is stable beyond the prep phase, but uring_cmd is a bit special in how it handles the SQE which makes it susceptible to reading stale data. If the application has reused the SQE before the original completes, then that can lead to data corruption. Down the line we can relax this again once uring_cmd has been sanitized a bit, and avoid unnecessarily copying the SQE. Fixes: 5eff57fa9f3a ("io_uring/uring_cmd: defer SQE copying until it's needed") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/waitid: setup async data in the prep handlerJens Axboe1-7/+7
This is the idiomatic way that opcodes should setup their async data, so that it's always valid inside ->issue() without issue needing to do that. Fixes: f31ecf671ddc4 ("io_uring: add IORING_OP_WAITID support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/uring_cmd: remove dead req_has_async_data() checkJens Axboe1-3/+0
Any uring_cmd always has async data allocated now, there's no reason to check and clear a cached copy of the SQE. Fixes: d10f19dff56e ("io_uring/uring_cmd: switch to always allocating async data") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/uring_cmd: switch sqe to async_data on EAGAINCaleb Sander Mateos1-9/+14
5eff57fa9f3a ("io_uring/uring_cmd: defer SQE copying until it's needed") moved the unconditional memcpy() of the uring_cmd SQE to async_data to 2 cases when the request goes async: - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue Unlike the REQ_F_FORCE_ASYNC case, in the EAGAIN case, io_uring_cmd() copies the SQE to async_data but neglects to update the io_uring_cmd's sqe field to point to async_data. As a result, sqe still points to the slot in the userspace-mapped SQ. At the end of io_submit_sqes(), the kernel advances the SQ head index, allowing userspace to reuse the slot for a new SQE. If userspace reuses the slot before the io_uring worker reissues the original SQE, the io_uring_cmd's SQE will be corrupted. Introduce a helper io_uring_cmd_cache_sqes() to copy the original SQE to the io_uring_cmd's async_data and point sqe there. Use it for both the REQ_F_FORCE_ASYNC and EAGAIN cases. This ensures the uring_cmd doesn't read from the SQ slot after it has been returned to userspace. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 5eff57fa9f3a ("io_uring/uring_cmd: defer SQE copying until it's needed") Link: https://lore.kernel.org/r/20250212204546.3751645-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/uring_cmd: don't assume io_uring_cmd_data layoutCaleb Sander Mateos1-1/+1
eaf72f7b414f ("io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout") removed most of the places assuming struct io_uring_cmd_data has sqes as its first field. However, the EAGAIN case in io_uring_cmd() still compares ioucmd->sqe to the struct io_uring_cmd_data pointer using a void * cast. Since fa3595523d72 ("io_uring: get rid of alloc cache init_once handling"), sqes is no longer io_uring_cmd_data's first field. As a result, the pointers will always compare unequal and memcpy() may be called with the same source and destination. Replace the incorrect void * cast with the address of the sqes field. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: eaf72f7b414f ("io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout") Link: https://lore.kernel.org/r/20250212204546.3751645-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/kbuf: reallocate buf lists on upgradePavel Begunkov1-4/+12
IORING_REGISTER_PBUF_RING can reuse an old struct io_buffer_list if it was created for legacy selected buffer and has been emptied. It violates the requirement that most of the field should stay stable after publish. Always reallocate it instead. Cc: stable@vger.kernel.org Reported-by: Pumpkin Chang <pumpkin@devco.re> Fixes: 2fcabce2d7d34 ("io_uring: disallow mixed provided buffer group registrations") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/waitid: don't abuse io_tw_statePavel Begunkov1-2/+2
struct io_tw_state is managed by core io_uring, and opcode handling code must never try to cheat and create their own instances, it's plain incorrect. io_waitid_complete() attempts exactly that outside of the task work context, and even though the ring is locked, there would be no one to reap the requests from the defer completion list. It only works now because luckily it's called before io_uring_try_cancel_uring_cmd(), which flushes completions. Fixes: f31ecf671ddc4 ("io_uring: add IORING_OP_WAITID support") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-30io_uring/net: don't retry connect operation on EPOLLERRJens Axboe2-0/+7
If a socket is shutdown before the connection completes, POLLERR is set in the poll mask. However, connect ignores this as it doesn't know, and attempts the connection again. This may lead to a bogus -ETIMEDOUT result, where it should have noticed the POLLERR and just returned -ECONNRESET instead. Have the poll logic check for whether or not POLLERR is set in the mask, and if so, mark the request as failed. Then connect can appropriately fail the request rather than retry it. Reported-by: Sergey Galas <ssgalas@cloud.ru> Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/discussions/1335 Fixes: 3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring/rw: simplify io_rw_recycle()Pavel Begunkov1-13/+3
Instead of freeing iovecs in case of IO_URING_F_UNLOCKED in io_rw_recycle(), leave it be and rely on the core io_uring code to call io_readv_writev_cleanup() later. This way the iovec will get recycled and we can clean up io_rw_recycle() and kill io_rw_iovec_free(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/14f83b112eb40078bea18e15d77a4f99fc981a44.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: remove !KASAN guards from cache freePavel Begunkov2-4/+0
Test setups (with KASAN) will avoid !KASAN sections, and so it's not testing paths that would be exercised otherwise. That's bad as to be sure that your code works you now have to specifically test both KASAN and !KASAN configs. Remove !CONFIG_KASAN guards from io_netmsg_cache_free() and io_rw_cache_free(). The free functions should always be getting valid entries, and even though for KASAN iovecs should already be cleared, that's better than skipping the chunks completely. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/d6078a51c7137a243f9d00849bc3daa660873209.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring/net: extract io_send_select_buffer()Pavel Begunkov1-37/+50
Extract a helper out of io_send() for provided buffer selection to improve readability as it has grown to take too many lines. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/26a769cdabd61af7f40c5d88a22469c5ad071796.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring/net: clean io_msg_copy_hdr()Pavel Begunkov1-3/+4
Put msg->msg_iov into a local variable in io_msg_copy_hdr(), it reads better and clearly shows the used types. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/6a5d4f7a96b10e571d6128be010166b3aaf7afd5.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring/net: make io_net_vec_assign() return voidPavel Begunkov1-4/+5
io_net_vec_assign() can only return 0 and it doesn't make sense for it to fail, so make it return void. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/7c1a2390c99e17d3ae4e8562063e572d3cdeb164.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: add alloc_cache.cPavel Begunkov3-36/+54
Avoid inlining all and everything from alloc_cache.h and move cold bits into a new file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/06984c6cd58e703f7cfae5ab3067912f9f635a06.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: dont ifdef io_alloc_cache_kasan()Pavel Begunkov1-9/+5
Use IS_ENABLED in io_alloc_cache_kasan() so at least it gets compile tested without KASAN. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/35e53e83f6e16478dca0028a64a6cc905dc764d3.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: include all deps for alloc_cache.hPavel Begunkov1-0/+2
alloc_cache.h uses types it doesn't declare and thus depends on the order in which it's included. Make it self contained and pull all needed definitions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/39569f3d5b250b4fe78bb609d57f67d3736ebcc4.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: fix multishots with selected buffersPavel Begunkov1-0/+2
We do io_kbuf_recycle() when arming a poll but every iteration of a multishot can grab more buffers, which is why we need to flush the kbuf ring state before continuing with waiting. Cc: stable@vger.kernel.org Fixes: b3fdea6ecb55c ("io_uring: multishot recv") Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg> Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Reported-by: Jacob Soo <jacob.soo@starlabs.sg> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1bfc9990fe435f1fc6152ca9efeba5eb3e68339c.1738025570.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-24io_uring/register: use atomic_read/write for sq_flags migrationJens Axboe1-1/+1
A previous commit changed all of the migration from the old to the new ring for resizing to use READ/WRITE_ONCE. However, ->sq_flags is an atomic_t, and while most archs won't complain on this, some will indeed flag this: io_uring/register.c:554:9: sparse: sparse: cast to non-scalar io_uring/register.c:554:9: sparse: sparse: cast from non-scalar Just use atomic_set/atomic_read for handling this case. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501242000.A2sKqaCL-lkp@intel.com/ Fixes: 2c5aae129f42 ("io_uring/register: document io_register_resize_rings() shared mem usage") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23io_uring/alloc_cache: get rid of _nocache() helperJens Axboe3-13/+9
Just allow passing in NULL for the cache, if the type in question doesn't have a cache associated with it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23io_uring: get rid of alloc cache init_once handlingJens Axboe12-93/+91
init_once is called when an object doesn't come from the cache, and hence needs initial clearing of certain members. While the whole struct could get cleared by memset() in that case, a few of the cache members are large enough that this may cause unnecessary overhead if the caches used aren't large enough to satisfy the workload. For those cases, some churn of kmalloc+kfree is to be expected. Ensure that the 3 users that need clearing put the members they need cleared at the start of the struct, and wrap the rest of the struct in a struct group so the offset is known. While at it, improve the interaction with KASAN such that when/if KASAN writes to members inside the struct that should be retained over caching, it won't trip over itself. For rw and net, the retaining of the iovec over caching is disabled if KASAN is enabled. A helper will free and clear those members in that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23io_uring/uring_cmd: cleanup struct io_uring_cmd_data layoutJens Axboe1-3/+3
A few spots in uring_cmd assume that the SQEs copied are always at the start of the structure, and hence mix req->async_data and the struct itself. Clean that up and use the proper indices. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23io_uring/uring_cmd: use cached cmd_op in io_uring_cmd_sock()Jens Axboe1-1/+1
io_uring_cmd_sock() does a normal read of cmd->sqe->cmd_op, where it really should be using a READ_ONCE() as ->sqe may still be pointing to the original SQE. Since the prep side already does this READ_ONCE() and stores it locally, use that value rather than re-read it. Fixes: 8e9fad0e70b7b ("io_uring: Add io_uring command support for sockets") Link: https://lore.kernel.org/r/20250121-uring-sockcmd-fix-v1-1-add742802a29@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-22io_uring/msg_ring: don't leave potentially dangling ->tctx pointerJens Axboe1-2/+2
For remote posting of messages, req->tctx is assigned even though it is never used. Rather than leave a dangling pointer, just clear it to NULL and use the previous check for a valid submitter_task to gate on whether or not the request should be terminated. Reported-by: Jann Horn <jannh@google.com> Fixes: b6f58a3f4aa8 ("io_uring: move struct io_kiocb from task_struct to io_uring_task") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21io_uring/rsrc: Move lockdep assert from io_free_rsrc_node() to callerJann Horn2-2/+3
Checking for lockdep_assert_held(&ctx->uring_lock) in io_free_rsrc_node() means that the assertion is only checked when the resource drops to zero references. Move the lockdep assertion up into the caller io_put_rsrc_node() so that it instead happens on every reference count decrement. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20250120-uring-lockdep-assert-earlier-v1-1-68d8e071a4bb@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21io_uring/rsrc: remove unused parameter ctx for io_rsrc_node_alloc()Sidong Yang3-7/+7
io_uring_ctx parameter for io_rsrc_node_alloc() is unused for now. This patch removes the parameter and fixes the callers accordingly. Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> Link: https://lore.kernel.org/r/20250115142033.658599-1-sidong.yang@furiosa.ai Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21io_uring: clean up io_uring_register_get_file()Pavel Begunkov2-4/+5
Make it always reference the returned file. It's safer, especially with unregistrations happening under it. And it makes the api cleaner with no conditional clean ups by the caller. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0d0b13a63e8edd6b5d360fc821dcdb035cb6b7e0.1736995897.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21io_uring/rsrc: Simplify buffer cloning by locking both ringsJann Horn1-33/+40
The locking in the buffer cloning code is somewhat complex because it goes back and forth between locking the source ring and the destination ring. Make it easier to reason about by locking both rings at the same time. To avoid ABBA deadlocks, lock the rings in ascending kernel address order, just like in lock_two_nondirectories(). Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20250115-uring-clone-refactor-v2-1-7289ba50776d@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-20x86: use cmov for user address maskingLinus Torvalds2-9/+8
This was a suggestion by David Laight, and while I was slightly worried that some micro-architecture would predict cmov like a conditional branch, there is little reason to actually believe any core would be that broken. Intel documents that their existing cores treat CMOVcc as a data dependency that will constrain speculation in their "Speculative Execution Side Channel Mitigations" whitepaper: "Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also be used to prevent bounds check bypass by constraining speculative execution on current family 6 processors (Intel® Core™, Intel® Atom™, Intel® Xeon® and Intel® Xeon Phi™ processors)" and while that leaves the future uarch issues open, that's certainly true of our traditional SBB usage too. Any core that predicts CMOV will be unusable for various crypto algorithms that need data-independent timing stability, so let's just treat CMOV as the safe choice that simplifies the address masking by avoiding an extra instruction and doesn't need a temporary register. Suggested-by: David Laight <David.Laight@aculab.com> Link: https://www.intel.com/content/dam/develop/external/us/en/documents/336996-speculative-execution-side-channel-mitigations.pdf Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-20x86: use proper 'clac' and 'stac' opcode namesLinus Torvalds1-11/+7
Back when we added SMAP support, all versions of binutils didn't necessarily understand the 'clac' and 'stac' instructions. So we implemented those instructions manually as ".byte" sequences. But we've since upgraded the minimum version of binutils to version 2.25, and that included proper support for the SMAP instructions, and there's no reason for us to use some line noise to express them any more. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-20samples/vfs: fix build warningsChristian Brauner1-11/+12
Fix build warnings reported from linux-next. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lore.kernel.org/r/20250120192504.4a1965a0@canb.auug.org.au Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-20samples/vfs: use shared headerChristian Brauner3-93/+249
Share some infrastructure between sample programs and fix a build failure that was reported. Reported-by: Sasha Levin <sashal@kernel.org> Link: https://lore.kernel.org/r/Z42UkSXx0MS9qZ9w@lappy Link: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-511-g109a8e0fa9d6/testrun/26809210/suite/build/test/gcc-8-allyesconfig/log Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-19Linux 6.13Linus Torvalds1-1/+1
2025-01-19io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_inameAl Viro1-4/+5
Output of io_uring_show_fdinfo() has several problems: * racy use of ->d_iname * junk if the name is long - in that case it's not stored in ->d_iname at all * lack of quoting (names can contain newlines, etc. - or be equal to "<none>", for that matter). * lines for empty slots are pointless noise - we already have the total amount, so having just the non-empty ones would carry the same information. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-17tracing: gfp: Fix the GFP enum values shown for user space tracing toolsSteven Rostedt1-0/+63
Tracing tools like perf and trace-cmd read the /sys/kernel/tracing/events/*/*/format files to know how to parse the data and also how to print it. For the "print fmt" portion of that file, if anything uses an enum that is not exported to the tracing system, user space will not be able to parse it. The GFP flags use to be defines, and defines get translated in the print fmt sections. But now they are converted to use enums, which is not. The mm_page_alloc trace event format use to have: print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s", REC->pfn != -1UL ? (((struct page *)vmemmap_base) + (REC->pfn)) : ((void *)0), REC->pfn != -1UL ? REC->pfn : 0, REC->order, REC->migratetype, (REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {( unsigned long)(((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) | (( gfp_t)0x40000u) | (( gfp_t)0x80000u) | (( gfp_t)0x2000u)) & ~(( gfp_t)(0x400u|0x800u))) | (( gfp_t)0x400u)), "GFP_TRANSHUGE"}, {( unsigned long)((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) ... Where the GFP values are shown and not their names. But after the GFP flags were converted to use enums, it has: print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s", REC->pfn != -1UL ? (vmemmap + (REC->pfn)) : ((void *)0), REC->pfn != -1UL ? REC->pfn : 0, REC->order, REC->migratetype, (REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {( unsigned long)(((((((( gfp_t)(((((1UL))) << (___GFP_DIRECT_RECLAIM_BIT))|((((1UL))) << (___GFP_KSWAPD_RECLAIM_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_IO_BIT))) | (( gfp_t)((((1UL))) << (___GFP_FS_BIT))) | (( gfp_t)((((1UL))) << (___GFP_HARDWALL_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_HIGHMEM_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_MOVABLE_BIT))) | (( gfp_t)0)) | (( gfp_t)((((1UL))) << (___GFP_COMP_BIT))) ... Where the enums names like ___GFP_KSWAPD_RECLAIM_BIT are shown and not their values. User space has no way to convert these names to their values and the output will fail to parse. What is shown is now: mm_page_alloc: page=0xffffffff981685f3 pfn=0x1d1ac1 order=0 migratetype=1 gfp_flags=0x140cca The TRACE_DEFINE_ENUM() macro was created to handle enums in the print fmt files. This causes them to be replaced at boot up with the numbers, so that user space tooling can parse it. By using this macro, the output is back to the human readable: mm_page_alloc: page=0xffffffff981685f3 pfn=0x122233 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Veronika Molnarova <vmolnaro@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250116214438.749504792@goodmis.org Reported-by: Michael Petlan <mpetlan@redhat.com> Closes: https://lore.kernel.org/all/87be5f7c-1a0-dad-daa0-54e342efaea7@redhat.com/ Fixes: 772dd0342727c ("mm: enumerate all gfp flags") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-17block: Don't trim an atomic writeJohn Garry1-0/+4
This is disallowed. This check will now be relevant since the device mapper personalities will start to support atomic writes, and they use this function. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-17block: Add common atomic writes enable flagJohn Garry7-7/+11
Currently only stacked devices need to explicitly enable atomic writes by setting BLK_FEAT_ATOMIC_WRITES_STACKED flag. This does not work well for device mapper stacking devices, as there many sets of limits are stacked and what is the 'bottom' and 'top' device can swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set for many queue limits, which is messy. Generalize enabling atomic writes enabling by ensuring that all devices must explicitly set a flag - that includes NVMe, SCSI sd, and md raid. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-16drm/xe: Mark ComputeCS read mode as UC on iGPUMatthew Brost1-1/+1
RING_CMD_CCTL read index should be UC on iGPU parts due to L3 caching structure. Having this as WB blocks ULLS from being enabled. Change to UC to unblock ULLS on iGPU. v2: - Drop internal communications commnet, bspec is updated Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: stable@vger.kernel.org Fixes: 328e089bfb37 ("drm/xe: Leverage ComputeCS read L3 caching") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250114002507.114087-1-matthew.brost@intel.com (cherry picked from commit 758debf35b9cda5450e40996991a6e4b222899bd) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-01-16md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()Dan Carpenter1-2/+2
The linear_conf() returns error pointers, it doesn't return NULL. Update the error checking to match. Fixes: 127186cfb184 ("md: reintroduce md-linear") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/add654be-759f-4b2d-93ba-a3726dae380c@stanley.mountain Signed-off-by: Song Liu <song@kernel.org>
2025-01-16x86/asm: Make serialize() always_inlineJuergen Gross1-1/+1
In order to allow serialize() to be used from noinstr code, make it __always_inline. Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates") Closes: https://lore.kernel.org/oe-kbuild-all/202412181756.aJvzih2K-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241218100918.22167-1-jgross@suse.com
2025-01-16pmdomain: imx8mp-blk-ctrl: add missing loop break conditionXiaolei Wang1-1/+1
Currently imx8mp_blk_ctrl_remove() will continue the for loop until an out-of-bounds exception occurs. pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : dev_pm_domain_detach+0x8/0x48 lr : imx8mp_blk_ctrl_shutdown+0x58/0x90 sp : ffffffc084f8bbf0 x29: ffffffc084f8bbf0 x28: ffffff80daf32ac0 x27: 0000000000000000 x26: ffffffc081658d78 x25: 0000000000000001 x24: ffffffc08201b028 x23: ffffff80d0db9490 x22: ffffffc082340a78 x21: 00000000000005b0 x20: ffffff80d19bc180 x19: 000000000000000a x18: ffffffffffffffff x17: ffffffc080a39e08 x16: ffffffc080a39c98 x15: 4f435f464f006c72 x14: 0000000000000004 x13: ffffff80d0172110 x12: 0000000000000000 x11: ffffff80d0537740 x10: ffffff80d05376c0 x9 : ffffffc0808ed2d8 x8 : ffffffc084f8bab0 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffffff80d19b9420 x4 : fffffffe03466e60 x3 : 0000000080800077 x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000 Call trace: dev_pm_domain_detach+0x8/0x48 platform_shutdown+0x2c/0x48 device_shutdown+0x158/0x268 kernel_restart_prepare+0x40/0x58 kernel_kexec+0x58/0xe8 __do_sys_reboot+0x198/0x258 __arm64_sys_reboot+0x2c/0x40 invoke_syscall+0x5c/0x138 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x38/0xc8 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x190/0x198 Code: 8128c2d0 ffffffc0 aa1e03e9 d503201f Fixes: 556f5cf9568a ("soc: imx: add i.MX8MP HSIO blk-ctrl") Cc: stable@vger.kernel.org Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Lucas Stach <l.stach@pengutronix.de> Reviewed-by: Fabio Estevam <festevam@gmail.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://lore.kernel.org/r/20250115014118.4086729-1-xiaolei.wang@windriver.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2025-01-16netdev: avoid CFI problems with sock priv helpersJakub Kicinski2-5/+25
Li Li reports that casting away callback type may cause issues for CFI. Let's generate a small wrapper for each callback, to make sure compiler sees the anticipated types. Reported-by: Li Li <dualli@chromium.org> Link: https://lore.kernel.org/CANBPYPjQVqmzZ4J=rVQX87a9iuwmaetULwbK_5_3YWk2eGzkaA@mail.gmail.com Fixes: 170aafe35cb9 ("netdev: support binding dma-buf to netdevice") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250115161436.648646-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16hrtimers: Handle CPU state correctly on hotplugKoichiro Den3-2/+12
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to CPUHP_ONLINE: Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set to 1 throughout. However, during a CPU unplug operation, the tick and the clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online state, for instance CFS incorrectly assumes that the hrtick is already active, and the chance of the clockevent device to transition to oneshot mode is also lost forever for the CPU, unless it goes back to a lower state than CPUHP_HRTIMERS_PREPARE once. This round-trip reveals another issue; cpu_base.online is not set to 1 after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer(). Aside of that, the bulk of the per CPU state is not reset either, which means there are dangling pointers in the worst case. Address this by adding a corresponding startup() callback, which resets the stale per CPU state and sets the online flag. [ tglx: Make the new callback unconditionally available, remove the online modification in the prepare() callback and clear the remaining state in the starting callback instead of the prepare callback ] Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
2025-01-16timers/migration: Annotate accesses to ignore flagFrederic Weisbecker1-6/+15
The group's ignore flag is: _ read under the group's lock (idle entry, remote expiry) _ turned on/off under the group's lock (idle entry, remote expiry) _ turned on locklessly on idle exit When idle entry or remote expiry clear the "ignore" flag of a group, the operation must be synchronized against other concurrent idle entry or remote expiry to make sure the related group timer is never missed. To enforce this synchronization, both "ignore" clear and read are performed under the group lock. On the contrary, whether idle entry or remote expiry manage to observe the "ignore" flag turned on by a CPU exiting idle is a matter of optimization. If that flag set is missed or cleared concurrently, the worst outcome is a migrator wasting time remotely handling a "ghost" timer. This is why the ignore flag can be set locklessly. Unfortunately, the related lockless accesses are bare and miss appropriate annotations. KCSAN rightfully complains: BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0: __tmigr_cpu_activate tmigr_cpu_activate timer_clear_idle tick_nohz_restart_sched_tick tick_nohz_idle_exit do_idle cpu_startup_entry kernel_init do_initcalls clear_bss reserve_bios_regions common_startup_64 read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1: print_report kcsan_report_known_origin kcsan_setup_watchpoint tmigr_next_groupevt tmigr_update_events tmigr_inactive_up __walk_groups+0x50/0x77 walk_groups __tmigr_cpu_deactivate tmigr_cpu_deactivate __get_next_timer_interrupt timer_base_try_to_set_idle tick_nohz_stop_tick tick_nohz_idle_stop_tick cpuidle_idle_call do_idle Although the relevant accesses could be marked as data_race(), the "ignore" flag being read several times within the same tmigr_update_events() function is confusing and error prone. Prefer reading it once in that function and make use of similar/paired accesses elsewhere with appropriate comments when necessary. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250114231507.21672-4-frederic@kernel.org Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com
2025-01-16timers/migration: Enforce group initialization visibility to tree walkersFrederic Weisbecker1-2/+12
Commit 2522c84db513 ("timers/migration: Fix another race between hotplug and idle entry/exit") fixed yet another race between idle exit and CPU hotplug up leading to a wrong "0" value migrator assigned to the top level. However there is yet another situation that remains unhandled: [GRP0:0] migrator = TMIGR_NONE active = NONE groupmask = 1 / \ \ 0 1 2..7 idle idle idle 0) The system is fully idle. [GRP0:0] migrator = CPU 0 active = CPU 0 groupmask = 1 / \ \ 0 1 2..7 active idle idle 1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state but it hasn't yet returned to __walk_groups(). [GRP0:0] migrator = CPU 0 active = CPU 0, CPU 1 groupmask = 1 / \ \ 0 1 2..7 active active idle 2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in __walk_groups(), delayed by #VMEXIT for example). [GRP1:0] migrator = TMIGR_NONE active = NONE groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 1 groupmask = 2 / \ \ 0 1 2..7 8 active active idle !online 3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1 and GRP0:0. CPU 1 hasn't yet propagated its activation up to GRP1:0. [GRP1:0] migrator = GRP0:0 active = GRP0:0 groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 1 groupmask = 2 / \ \ 0 1 2..7 8 active active idle !online 4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP1:0 is visible and fetched and the pre-initialized groupmask of GRP0:0 is also visible. As a result tmigr_active_up() is called to GRP1:0 with GRP0:0 as active and migrator. CPU 0 is returning to __walk_groups() but suffers again a #VMEXIT. [GRP1:0] migrator = GRP0:0 active = GRP0:0 groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 1 groupmask = 2 / \ \ 0 1 2..7 8 active active idle !online 5) CPU 1 propagates its activation of GRP0:0 to GRP1:0. This has no effect since CPU 0 did it already. [GRP1:0] migrator = GRP0:0 active = GRP0:0, GRP0:1 groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = CPU 8 active = CPU 0, CPU1 active = CPU 8 groupmask = 1 groupmask = 2 / \ \ \ 0 1 2..7 8 active active idle active 6) CPU 1 links CPU 8 to its group. CPU 8 boots and goes through CPUHP_AP_TMIGR_ONLINE which propagates activation. [GRP2:0] migrator = TMIGR_NONE active = NONE groupmask = 1 / \ [GRP1:0] [GRP1:1] migrator = GRP0:0 migrator = TMIGR_NONE active = GRP0:0, GRP0:1 active = NONE groupmask = 1 groupmask = 2 / \ [GRP0:0] [GRP0:1] [GRP0:2] migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE active = CPU 0, CPU1 active = CPU 8 active = NONE groupmask = 1 groupmask = 2 groupmask = 0 / \ \ \ 0 1 2..7 8 64 active active idle active !online 7) CPU 64 is booting. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP1:1, GRP0:2 and the new top GRP2:0 connected to GRP1:1 and GRP1:0. CPU 1 hasn't yet propagated its activation up to GRP2:0. [GRP2:0] migrator = 0 (!!!) active = NONE groupmask = 1 / \ [GRP1:0] [GRP1:1] migrator = GRP0:0 migrator = TMIGR_NONE active = GRP0:0, GRP0:1 active = NONE groupmask = 1 groupmask = 2 / \ [GRP0:0] [GRP0:1] [GRP0:2] migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE active = CPU 0, CPU1 active = CPU 8 active = NONE groupmask = 1 groupmask = 2 groupmask = 0 / \ \ \ 0 1 2..7 8 64 active active idle active !online 8) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP2:0 is visible and fetched but the pre-initialized groupmask of GRP1:0 is not because no ordering made its initialization visible. As a result tmigr_active_up() may be called to GRP2:0 with a "0" child's groumask. Leaving the timers ignored for ever when the system is fully idle. The race is highly theoretical and perhaps impossible in practice but the groupmask of the child is not the only concern here as the whole initialization of the child is not guaranteed to be visible to any tree walker racing against hotplug (idle entry/exit, remote handling, etc...). Although the current code layout seem to be resilient to such hazards, this doesn't tell much about the future. Fix this with enforcing address dependency between group initialization and the write/read to the group's parent's pointer. Fortunately that doesn't involve any barrier addition in the fast paths. Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250114231507.21672-3-frederic@kernel.org
2025-01-16timers/migration: Fix another race between hotplug and idle entry/exitFrederic Weisbecker1-1/+28
Commit 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") fixed a race between idle exit and CPU hotplug up leading to a wrong "0" value migrator assigned to the top level. However there is still a situation that remains unhandled: [GRP0:0] migrator = TMIGR_NONE active = NONE groupmask = 0 / \ \ 0 1 2..7 idle idle idle 0) The system is fully idle. [GRP0:0] migrator = CPU 0 active = CPU 0 groupmask = 0 / \ \ 0 1 2..7 active idle idle 1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state but it hasn't yet returned to __walk_groups(). [GRP0:0] migrator = CPU 0 active = CPU 0, CPU 1 groupmask = 0 / \ \ 0 1 2..7 active active idle 2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in __walk_groups(), delayed by #VMEXIT for example). [GRP1:0] migrator = TMIGR_NONE active = NONE groupmask = 0 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 2 groupmask = 1 / \ \ 0 1 2..7 8 active active idle !online 3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1 and GRP0:0. The groupmask of GRP0:0 is now 2. CPU 1 hasn't yet propagated its activation up to GRP1:0. [GRP1:0] migrator = 0 (!!!) active = NONE groupmask = 0 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 2 groupmask = 1 / \ \ 0 1 2..7 8 active active idle !online 4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP1:0 is visible and fetched but the freshly updated groupmask of GRP0:0 may not be visible due to lack of ordering! As a result tmigr_active_up() is called to GRP0:0 with a child's groupmask of "0". This buggy "0" groupmask then becomes the migrator for GRP1:0 forever. As a result, timers on a fully idle system get ignored. One possible fix would be to define TMIGR_NONE as "0" so that such a race would have no effect. And after all TMIGR_NONE doesn't need to be anything else. However this would leave an uncomfortable state machine where gears happen not to break by chance but are vulnerable to future modifications. Keep TMIGR_NONE as is instead and pre-initialize to "1" the groupmask of any newly created top level. This groupmask is guaranteed to be visible upon fetching the corresponding group for the 1st time: _ By the upcoming CPU thanks to CPU hotplug synchronization between the control CPU (BP) and the booting one (AP). _ By the control CPU since the groupmask and parent pointers are initialized locally. _ By all CPUs belonging to the same group than the control CPU because they must wait for it to ever become idle before needing to walk to the new top. The cmpcxhg() on ->migr_state then makes sure its groupmask is visible. With this pre-initialization, it is guaranteed that if a future top level is linked to an old one, it is walked through with a valid groupmask. Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250114231507.21672-2-frederic@kernel.org
2025-01-16net/mlx5e: Always start IPsec sequence number from 1Leon Romanovsky2-3/+14
According to RFC4303, section "3.3.3. Sequence Number Generation", the first packet sent using a given SA will contain a sequence number of 1. This is applicable to both ESN and non-ESN mode, which was not covered in commit mentioned in Fixes line. Fixes: 3d42c8cc67a8 ("net/mlx5e: Ensure that IPsec sequence packet number starts from 1") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>