Age | Commit message (Collapse) | Author | Files | Lines |
|
IO_NODE_ALLOC_CACHE_MAX has been unused since commit fbbb8e991d86
("io_uring/rsrc: get rid of io_rsrc_node allocation cache") removed the
rsrc_node_cache.
IO_RSRC_TAG_TABLE_SHIFT and IO_RSRC_TAG_TABLE_MASK have been unused
since commit 7029acd8a950 ("io_uring/rsrc: get rid of per-ring
io_rsrc_node list") removed the separate tag table for registered nodes.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250219033444.2020136-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is obviously not that important, but when changes are synced back
from the kernel to liburing, the codespell CI ends up erroring because
of this misspelling. Let's just correct it and avoid this biting us
again on an import.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
sqe->opcode is used for different tables, make sure we santitise it
against speculations.
Cc: stable@vger.kernel.org
Fixes: d3656344fea03 ("io_uring: add lookup table for various opcode needs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/7eddbf31c8ca0a3947f8ed98271acc2b4349c016.1739568408.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When io_uring submission goes async for the first time on a given task,
we'll try to create a worker thread to handle the submission. Creating
this worker thread can fail due to various transient conditions, such as
an outstanding signal in the forking thread, so we have retry logic with
a limit of 3 retries. However, this retry logic appears to be too
aggressive/fast - we've observed a thread blowing through the retry
limit while having the same outstanding signal the whole time. Here's an
excerpt of some tracing that demonstrates the issue:
First, signal 26 is generated for the process. It ends up getting routed
to thread 92942.
0) cbd-92284 /* signal_generate: sig=26 errno=0 code=-2 comm=psblkdASD pid=92934 grp=1 res=0 */
This causes create_io_thread in the signalled thread to fail with
ERESTARTNOINTR, and thus a retry is queued.
13) task_th-92942 /* io_uring_queue_async_work: ring 000000007325c9ae, request 0000000080c96d8e, user_data 0x0, opcode URING_CMD, flags 0x8240001, normal queue, work 000000006e96dd3f */
13) task_th-92942 io_wq_enqueue() {
13) task_th-92942 _raw_spin_lock();
13) task_th-92942 io_wq_activate_free_worker();
13) task_th-92942 _raw_spin_lock();
13) task_th-92942 create_io_worker() {
13) task_th-92942 __kmalloc_cache_noprof();
13) task_th-92942 __init_swait_queue_head();
13) task_th-92942 kprobe_ftrace_handler() {
13) task_th-92942 get_kprobe();
13) task_th-92942 aggr_pre_handler() {
13) task_th-92942 pre_handler_kretprobe();
13) task_th-92942 /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */
13) task_th-92942 } /* aggr_pre_handler */
...
13) task_th-92942 } /* copy_process */
13) task_th-92942 } /* create_io_thread */
13) task_th-92942 kretprobe_rethook_handler() {
13) task_th-92942 /* create_exit: (create_io_worker+0x8a/0x1a0 <- create_io_thread) arg1=0xfffffffffffffdff */
13) task_th-92942 } /* kretprobe_rethook_handler */
13) task_th-92942 queue_work_on() {
...
The CPU is then handed to a kworker to process the queued retry:
------------------------------------------
13) task_th-92942 => kworker-54154
------------------------------------------
13) kworker-54154 io_workqueue_create() {
13) kworker-54154 io_queue_worker_create() {
13) kworker-54154 task_work_add() {
13) kworker-54154 wake_up_state() {
13) kworker-54154 try_to_wake_up() {
13) kworker-54154 _raw_spin_lock_irqsave();
13) kworker-54154 _raw_spin_unlock_irqrestore();
13) kworker-54154 } /* try_to_wake_up */
13) kworker-54154 } /* wake_up_state */
13) kworker-54154 kick_process();
13) kworker-54154 } /* task_work_add */
13) kworker-54154 } /* io_queue_worker_create */
13) kworker-54154 } /* io_workqueue_create */
And then we immediately switch back to the original task to try creating
a worker again. This fails, because the original task still hasn't
handled its signal.
-----------------------------------------
13) kworker-54154 => task_th-92942
------------------------------------------
13) task_th-92942 create_worker_cont() {
13) task_th-92942 kprobe_ftrace_handler() {
13) task_th-92942 get_kprobe();
13) task_th-92942 aggr_pre_handler() {
13) task_th-92942 pre_handler_kretprobe();
13) task_th-92942 /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */
13) task_th-92942 } /* aggr_pre_handler */
13) task_th-92942 } /* kprobe_ftrace_handler */
13) task_th-92942 create_io_thread() {
13) task_th-92942 copy_process() {
13) task_th-92942 task_active_pid_ns();
13) task_th-92942 _raw_spin_lock_irq();
13) task_th-92942 recalc_sigpending();
13) task_th-92942 _raw_spin_lock_irq();
13) task_th-92942 } /* copy_process */
13) task_th-92942 } /* create_io_thread */
13) task_th-92942 kretprobe_rethook_handler() {
13) task_th-92942 /* create_exit: (create_worker_cont+0x35/0x1b0 <- create_io_thread) arg1=0xfffffffffffffdff */
13) task_th-92942 } /* kretprobe_rethook_handler */
13) task_th-92942 io_worker_release();
13) task_th-92942 queue_work_on() {
13) task_th-92942 clear_pending_if_disabled();
13) task_th-92942 __queue_work() {
13) task_th-92942 } /* __queue_work */
13) task_th-92942 } /* queue_work_on */
13) task_th-92942 } /* create_worker_cont */
The pattern repeats another couple times until we blow through the retry
counter, at which point we give up. All outstanding work is canceled,
and the io_uring command which triggered all this is failed with
ECANCELED:
13) task_th-92942 io_acct_cancel_pending_work() {
...
13) task_th-92942 /* io_uring_complete: ring 000000007325c9ae, req 0000000080c96d8e, user_data 0x0, result -125, cflags 0x0 extra1 0 extra2 0 */
Finally, the task gets around to processing its outstanding signal 26,
but it's too late.
13) task_th-92942 /* signal_deliver: sig=26 errno=0 code=-2 sa_handler=59566a0 sa_flags=14000000 */
Try to address this issue by adding a small scaling delay when retrying
worker creation. This should give the forking thread time to handle its
signal in the above case. This isn't a particularly satisfying solution,
as sufficiently paradoxical scheduling would still have us hitting the
same issue, and I'm open to suggestions for something better. But this
is likely to prevent this (already rare) issue from hitting in practice.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250208-wq_retry-v2-1-4f6f5041d303@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This isn't generally necessary, but conditions have been observed where
SQE data is accessed from the original SQE after prep has been done and
outside of the initial issue. Opcode prep handlers must ensure that any
SQE related data is stable beyond the prep phase, but uring_cmd is a bit
special in how it handles the SQE which makes it susceptible to reading
stale data. If the application has reused the SQE before the original
completes, then that can lead to data corruption.
Down the line we can relax this again once uring_cmd has been sanitized
a bit, and avoid unnecessarily copying the SQE.
Fixes: 5eff57fa9f3a ("io_uring/uring_cmd: defer SQE copying until it's needed")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is the idiomatic way that opcodes should setup their async data,
so that it's always valid inside ->issue() without issue needing to
do that.
Fixes: f31ecf671ddc4 ("io_uring: add IORING_OP_WAITID support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Any uring_cmd always has async data allocated now, there's no reason to
check and clear a cached copy of the SQE.
Fixes: d10f19dff56e ("io_uring/uring_cmd: switch to always allocating async data")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
5eff57fa9f3a ("io_uring/uring_cmd: defer SQE copying until it's needed")
moved the unconditional memcpy() of the uring_cmd SQE to async_data
to 2 cases when the request goes async:
- If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
- If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
Unlike the REQ_F_FORCE_ASYNC case, in the EAGAIN case, io_uring_cmd()
copies the SQE to async_data but neglects to update the io_uring_cmd's
sqe field to point to async_data. As a result, sqe still points to the
slot in the userspace-mapped SQ. At the end of io_submit_sqes(), the
kernel advances the SQ head index, allowing userspace to reuse the slot
for a new SQE. If userspace reuses the slot before the io_uring worker
reissues the original SQE, the io_uring_cmd's SQE will be corrupted.
Introduce a helper io_uring_cmd_cache_sqes() to copy the original SQE to
the io_uring_cmd's async_data and point sqe there. Use it for both the
REQ_F_FORCE_ASYNC and EAGAIN cases. This ensures the uring_cmd doesn't
read from the SQ slot after it has been returned to userspace.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 5eff57fa9f3a ("io_uring/uring_cmd: defer SQE copying until it's needed")
Link: https://lore.kernel.org/r/20250212204546.3751645-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
eaf72f7b414f ("io_uring/uring_cmd: cleanup struct io_uring_cmd_data
layout") removed most of the places assuming struct io_uring_cmd_data
has sqes as its first field. However, the EAGAIN case in io_uring_cmd()
still compares ioucmd->sqe to the struct io_uring_cmd_data pointer using
a void * cast. Since fa3595523d72 ("io_uring: get rid of alloc cache
init_once handling"), sqes is no longer io_uring_cmd_data's first field.
As a result, the pointers will always compare unequal and memcpy() may
be called with the same source and destination.
Replace the incorrect void * cast with the address of the sqes field.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: eaf72f7b414f ("io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout")
Link: https://lore.kernel.org/r/20250212204546.3751645-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
IORING_REGISTER_PBUF_RING can reuse an old struct io_buffer_list if it
was created for legacy selected buffer and has been emptied. It violates
the requirement that most of the field should stay stable after publish.
Always reallocate it instead.
Cc: stable@vger.kernel.org
Reported-by: Pumpkin Chang <pumpkin@devco.re>
Fixes: 2fcabce2d7d34 ("io_uring: disallow mixed provided buffer group registrations")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
struct io_tw_state is managed by core io_uring, and opcode handling code
must never try to cheat and create their own instances, it's plain
incorrect.
io_waitid_complete() attempts exactly that outside of the task work
context, and even though the ring is locked, there would be no one to
reap the requests from the defer completion list. It only works now
because luckily it's called before io_uring_try_cancel_uring_cmd(),
which flushes completions.
Fixes: f31ecf671ddc4 ("io_uring: add IORING_OP_WAITID support")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If a socket is shutdown before the connection completes, POLLERR is set
in the poll mask. However, connect ignores this as it doesn't know, and
attempts the connection again. This may lead to a bogus -ETIMEDOUT
result, where it should have noticed the POLLERR and just returned
-ECONNRESET instead.
Have the poll logic check for whether or not POLLERR is set in the mask,
and if so, mark the request as failed. Then connect can appropriately
fail the request rather than retry it.
Reported-by: Sergey Galas <ssgalas@cloud.ru>
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/discussions/1335
Fixes: 3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of freeing iovecs in case of IO_URING_F_UNLOCKED in
io_rw_recycle(), leave it be and rely on the core io_uring code to
call io_readv_writev_cleanup() later. This way the iovec will get
recycled and we can clean up io_rw_recycle() and kill
io_rw_iovec_free().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/14f83b112eb40078bea18e15d77a4f99fc981a44.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Test setups (with KASAN) will avoid !KASAN sections, and so it's not
testing paths that would be exercised otherwise. That's bad as to be
sure that your code works you now have to specifically test both KASAN
and !KASAN configs.
Remove !CONFIG_KASAN guards from io_netmsg_cache_free() and
io_rw_cache_free(). The free functions should always be getting valid
entries, and even though for KASAN iovecs should already be cleared,
that's better than skipping the chunks completely.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/d6078a51c7137a243f9d00849bc3daa660873209.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Extract a helper out of io_send() for provided buffer selection to
improve readability as it has grown to take too many lines.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/26a769cdabd61af7f40c5d88a22469c5ad071796.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Put msg->msg_iov into a local variable in io_msg_copy_hdr(), it reads
better and clearly shows the used types.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/6a5d4f7a96b10e571d6128be010166b3aaf7afd5.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_net_vec_assign() can only return 0 and it doesn't make sense for it
to fail, so make it return void.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/7c1a2390c99e17d3ae4e8562063e572d3cdeb164.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Avoid inlining all and everything from alloc_cache.h and move cold bits
into a new file.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/06984c6cd58e703f7cfae5ab3067912f9f635a06.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use IS_ENABLED in io_alloc_cache_kasan() so at least it gets compile
tested without KASAN.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/35e53e83f6e16478dca0028a64a6cc905dc764d3.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
alloc_cache.h uses types it doesn't declare and thus depends on the
order in which it's included. Make it self contained and pull all needed
definitions.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/39569f3d5b250b4fe78bb609d57f67d3736ebcc4.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We do io_kbuf_recycle() when arming a poll but every iteration of a
multishot can grab more buffers, which is why we need to flush the kbuf
ring state before continuing with waiting.
Cc: stable@vger.kernel.org
Fixes: b3fdea6ecb55c ("io_uring: multishot recv")
Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Reported-by: Jacob Soo <jacob.soo@starlabs.sg>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1bfc9990fe435f1fc6152ca9efeba5eb3e68339c.1738025570.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A previous commit changed all of the migration from the old to the new
ring for resizing to use READ/WRITE_ONCE. However, ->sq_flags is an
atomic_t, and while most archs won't complain on this, some will indeed
flag this:
io_uring/register.c:554:9: sparse: sparse: cast to non-scalar
io_uring/register.c:554:9: sparse: sparse: cast from non-scalar
Just use atomic_set/atomic_read for handling this case.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501242000.A2sKqaCL-lkp@intel.com/
Fixes: 2c5aae129f42 ("io_uring/register: document io_register_resize_rings() shared mem usage")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just allow passing in NULL for the cache, if the type in question
doesn't have a cache associated with it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
init_once is called when an object doesn't come from the cache, and
hence needs initial clearing of certain members. While the whole
struct could get cleared by memset() in that case, a few of the cache
members are large enough that this may cause unnecessary overhead if
the caches used aren't large enough to satisfy the workload. For those
cases, some churn of kmalloc+kfree is to be expected.
Ensure that the 3 users that need clearing put the members they need
cleared at the start of the struct, and wrap the rest of the struct in
a struct group so the offset is known.
While at it, improve the interaction with KASAN such that when/if
KASAN writes to members inside the struct that should be retained over
caching, it won't trip over itself. For rw and net, the retaining of
the iovec over caching is disabled if KASAN is enabled. A helper will
free and clear those members in that case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A few spots in uring_cmd assume that the SQEs copied are always at the
start of the structure, and hence mix req->async_data and the struct
itself.
Clean that up and use the proper indices.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring_cmd_sock() does a normal read of cmd->sqe->cmd_op, where it
really should be using a READ_ONCE() as ->sqe may still be pointing to
the original SQE. Since the prep side already does this READ_ONCE() and
stores it locally, use that value rather than re-read it.
Fixes: 8e9fad0e70b7b ("io_uring: Add io_uring command support for sockets")
Link: https://lore.kernel.org/r/20250121-uring-sockcmd-fix-v1-1-add742802a29@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For remote posting of messages, req->tctx is assigned even though it
is never used. Rather than leave a dangling pointer, just clear it to
NULL and use the previous check for a valid submitter_task to gate on
whether or not the request should be terminated.
Reported-by: Jann Horn <jannh@google.com>
Fixes: b6f58a3f4aa8 ("io_uring: move struct io_kiocb from task_struct to io_uring_task")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Checking for lockdep_assert_held(&ctx->uring_lock) in io_free_rsrc_node()
means that the assertion is only checked when the resource drops to zero
references.
Move the lockdep assertion up into the caller io_put_rsrc_node() so that it
instead happens on every reference count decrement.
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20250120-uring-lockdep-assert-earlier-v1-1-68d8e071a4bb@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring_ctx parameter for io_rsrc_node_alloc() is unused for now.
This patch removes the parameter and fixes the callers accordingly.
Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
Link: https://lore.kernel.org/r/20250115142033.658599-1-sidong.yang@furiosa.ai
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Make it always reference the returned file. It's safer, especially with
unregistrations happening under it. And it makes the api cleaner with no
conditional clean ups by the caller.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0d0b13a63e8edd6b5d360fc821dcdb035cb6b7e0.1736995897.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The locking in the buffer cloning code is somewhat complex because it goes
back and forth between locking the source ring and the destination ring.
Make it easier to reason about by locking both rings at the same time.
To avoid ABBA deadlocks, lock the rings in ascending kernel address order,
just like in lock_two_nondirectories().
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20250115-uring-clone-refactor-v2-1-7289ba50776d@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This was a suggestion by David Laight, and while I was slightly worried
that some micro-architecture would predict cmov like a conditional
branch, there is little reason to actually believe any core would be
that broken.
Intel documents that their existing cores treat CMOVcc as a data
dependency that will constrain speculation in their "Speculative
Execution Side Channel Mitigations" whitepaper:
"Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also
be used to prevent bounds check bypass by constraining speculative
execution on current family 6 processors (Intel® Core™, Intel® Atom™,
Intel® Xeon® and Intel® Xeon Phi™ processors)"
and while that leaves the future uarch issues open, that's certainly
true of our traditional SBB usage too.
Any core that predicts CMOV will be unusable for various crypto
algorithms that need data-independent timing stability, so let's just
treat CMOV as the safe choice that simplifies the address masking by
avoiding an extra instruction and doesn't need a temporary register.
Suggested-by: David Laight <David.Laight@aculab.com>
Link: https://www.intel.com/content/dam/develop/external/us/en/documents/336996-speculative-execution-side-channel-mitigations.pdf
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Back when we added SMAP support, all versions of binutils didn't
necessarily understand the 'clac' and 'stac' instructions. So we
implemented those instructions manually as ".byte" sequences.
But we've since upgraded the minimum version of binutils to version
2.25, and that included proper support for the SMAP instructions, and
there's no reason for us to use some line noise to express them any
more.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix build warnings reported from linux-next.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lore.kernel.org/r/20250120192504.4a1965a0@canb.auug.org.au
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Share some infrastructure between sample programs and fix a build
failure that was reported.
Reported-by: Sasha Levin <sashal@kernel.org>
Link: https://lore.kernel.org/r/Z42UkSXx0MS9qZ9w@lappy
Link: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-511-g109a8e0fa9d6/testrun/26809210/suite/build/test/gcc-8-allyesconfig/log
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
|
|
Output of io_uring_show_fdinfo() has several problems:
* racy use of ->d_iname
* junk if the name is long - in that case it's not stored in ->d_iname
at all
* lack of quoting (names can contain newlines, etc. - or be equal to "<none>",
for that matter).
* lines for empty slots are pointless noise - we already have the total
amount, so having just the non-empty ones would carry the same information.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Tracing tools like perf and trace-cmd read the /sys/kernel/tracing/events/*/*/format
files to know how to parse the data and also how to print it. For the
"print fmt" portion of that file, if anything uses an enum that is not
exported to the tracing system, user space will not be able to parse it.
The GFP flags use to be defines, and defines get translated in the print
fmt sections. But now they are converted to use enums, which is not.
The mm_page_alloc trace event format use to have:
print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s",
REC->pfn != -1UL ? (((struct page *)vmemmap_base) + (REC->pfn)) : ((void
*)0), REC->pfn != -1UL ? REC->pfn : 0, REC->order, REC->migratetype,
(REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {( unsigned
long)(((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) |
(( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) |
(( gfp_t)0x40000u) | (( gfp_t)0x80000u) | (( gfp_t)0x2000u)) & ~((
gfp_t)(0x400u|0x800u))) | (( gfp_t)0x400u)), "GFP_TRANSHUGE"}, {( unsigned
long)((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) |
(( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) ...
Where the GFP values are shown and not their names. But after the GFP
flags were converted to use enums, it has:
print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s",
REC->pfn != -1UL ? (vmemmap + (REC->pfn)) : ((void *)0), REC->pfn != -1UL
? REC->pfn : 0, REC->order, REC->migratetype, (REC->gfp_flags) ?
__print_flags(REC->gfp_flags, "|", {( unsigned long)((((((((
gfp_t)(((((1UL))) << (___GFP_DIRECT_RECLAIM_BIT))|((((1UL))) <<
(___GFP_KSWAPD_RECLAIM_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_IO_BIT)))
| (( gfp_t)((((1UL))) << (___GFP_FS_BIT))) | (( gfp_t)((((1UL))) <<
(___GFP_HARDWALL_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_HIGHMEM_BIT))))
| (( gfp_t)((((1UL))) << (___GFP_MOVABLE_BIT))) | (( gfp_t)0)) | ((
gfp_t)((((1UL))) << (___GFP_COMP_BIT))) ...
Where the enums names like ___GFP_KSWAPD_RECLAIM_BIT are shown and not their
values. User space has no way to convert these names to their values and
the output will fail to parse. What is shown is now:
mm_page_alloc: page=0xffffffff981685f3 pfn=0x1d1ac1 order=0 migratetype=1 gfp_flags=0x140cca
The TRACE_DEFINE_ENUM() macro was created to handle enums in the print fmt
files. This causes them to be replaced at boot up with the numbers, so
that user space tooling can parse it. By using this macro, the output is
back to the human readable:
mm_page_alloc: page=0xffffffff981685f3 pfn=0x122233 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Veronika Molnarova <vmolnaro@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250116214438.749504792@goodmis.org
Reported-by: Michael Petlan <mpetlan@redhat.com>
Closes: https://lore.kernel.org/all/87be5f7c-1a0-dad-daa0-54e342efaea7@redhat.com/
Fixes: 772dd0342727c ("mm: enumerate all gfp flags")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
This is disallowed.
This check will now be relevant since the device mapper personalities
will start to support atomic writes, and they use this function.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250116170301.474130-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently only stacked devices need to explicitly enable atomic writes by
setting BLK_FEAT_ATOMIC_WRITES_STACKED flag.
This does not work well for device mapper stacking devices, as there many
sets of limits are stacked and what is the 'bottom' and 'top' device can
swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set
for many queue limits, which is messy.
Generalize enabling atomic writes enabling by ensuring that all devices
must explicitly set a flag - that includes NVMe, SCSI sd, and md raid.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
RING_CMD_CCTL read index should be UC on iGPU parts due to L3 caching
structure. Having this as WB blocks ULLS from being enabled. Change to
UC to unblock ULLS on iGPU.
v2:
- Drop internal communications commnet, bspec is updated
Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: stable@vger.kernel.org
Fixes: 328e089bfb37 ("drm/xe: Leverage ComputeCS read L3 caching")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Michal Mrozek <michal.mrozek@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250114002507.114087-1-matthew.brost@intel.com
(cherry picked from commit 758debf35b9cda5450e40996991a6e4b222899bd)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
|
|
The linear_conf() returns error pointers, it doesn't return NULL. Update
the error checking to match.
Fixes: 127186cfb184 ("md: reintroduce md-linear")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/add654be-759f-4b2d-93ba-a3726dae380c@stanley.mountain
Signed-off-by: Song Liu <song@kernel.org>
|
|
In order to allow serialize() to be used from noinstr code, make it
__always_inline.
Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates")
Closes: https://lore.kernel.org/oe-kbuild-all/202412181756.aJvzih2K-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241218100918.22167-1-jgross@suse.com
|
|
Currently imx8mp_blk_ctrl_remove() will continue the for loop
until an out-of-bounds exception occurs.
pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : dev_pm_domain_detach+0x8/0x48
lr : imx8mp_blk_ctrl_shutdown+0x58/0x90
sp : ffffffc084f8bbf0
x29: ffffffc084f8bbf0 x28: ffffff80daf32ac0 x27: 0000000000000000
x26: ffffffc081658d78 x25: 0000000000000001 x24: ffffffc08201b028
x23: ffffff80d0db9490 x22: ffffffc082340a78 x21: 00000000000005b0
x20: ffffff80d19bc180 x19: 000000000000000a x18: ffffffffffffffff
x17: ffffffc080a39e08 x16: ffffffc080a39c98 x15: 4f435f464f006c72
x14: 0000000000000004 x13: ffffff80d0172110 x12: 0000000000000000
x11: ffffff80d0537740 x10: ffffff80d05376c0 x9 : ffffffc0808ed2d8
x8 : ffffffc084f8bab0 x7 : 0000000000000000 x6 : 0000000000000000
x5 : ffffff80d19b9420 x4 : fffffffe03466e60 x3 : 0000000080800077
x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000
Call trace:
dev_pm_domain_detach+0x8/0x48
platform_shutdown+0x2c/0x48
device_shutdown+0x158/0x268
kernel_restart_prepare+0x40/0x58
kernel_kexec+0x58/0xe8
__do_sys_reboot+0x198/0x258
__arm64_sys_reboot+0x2c/0x40
invoke_syscall+0x5c/0x138
el0_svc_common.constprop.0+0x48/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x38/0xc8
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x190/0x198
Code: 8128c2d0 ffffffc0 aa1e03e9 d503201f
Fixes: 556f5cf9568a ("soc: imx: add i.MX8MP HSIO blk-ctrl")
Cc: stable@vger.kernel.org
Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
Reviewed-by: Fabio Estevam <festevam@gmail.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://lore.kernel.org/r/20250115014118.4086729-1-xiaolei.wang@windriver.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
Li Li reports that casting away callback type may cause issues
for CFI. Let's generate a small wrapper for each callback,
to make sure compiler sees the anticipated types.
Reported-by: Li Li <dualli@chromium.org>
Link: https://lore.kernel.org/CANBPYPjQVqmzZ4J=rVQX87a9iuwmaetULwbK_5_3YWk2eGzkaA@mail.gmail.com
Fixes: 170aafe35cb9 ("netdev: support binding dma-buf to netdevice")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250115161436.648646-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway
through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to
CPUHP_ONLINE:
Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set
to 1 throughout. However, during a CPU unplug operation, the tick and the
clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online
state, for instance CFS incorrectly assumes that the hrtick is already
active, and the chance of the clockevent device to transition to oneshot
mode is also lost forever for the CPU, unless it goes back to a lower state
than CPUHP_HRTIMERS_PREPARE once.
This round-trip reveals another issue; cpu_base.online is not set to 1
after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer().
Aside of that, the bulk of the per CPU state is not reset either, which
means there are dangling pointers in the worst case.
Address this by adding a corresponding startup() callback, which resets the
stale per CPU state and sets the online flag.
[ tglx: Make the new callback unconditionally available, remove the online
modification in the prepare() callback and clear the remaining
state in the starting callback instead of the prepare callback ]
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
|
|
The group's ignore flag is:
_ read under the group's lock (idle entry, remote expiry)
_ turned on/off under the group's lock (idle entry, remote expiry)
_ turned on locklessly on idle exit
When idle entry or remote expiry clear the "ignore" flag of a group, the
operation must be synchronized against other concurrent idle entry or
remote expiry to make sure the related group timer is never missed. To
enforce this synchronization, both "ignore" clear and read are
performed under the group lock.
On the contrary, whether idle entry or remote expiry manage to observe
the "ignore" flag turned on by a CPU exiting idle is a matter of
optimization. If that flag set is missed or cleared concurrently, the
worst outcome is a migrator wasting time remotely handling a "ghost"
timer. This is why the ignore flag can be set locklessly.
Unfortunately, the related lockless accesses are bare and miss
appropriate annotations. KCSAN rightfully complains:
BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report
write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0:
__tmigr_cpu_activate
tmigr_cpu_activate
timer_clear_idle
tick_nohz_restart_sched_tick
tick_nohz_idle_exit
do_idle
cpu_startup_entry
kernel_init
do_initcalls
clear_bss
reserve_bios_regions
common_startup_64
read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1:
print_report
kcsan_report_known_origin
kcsan_setup_watchpoint
tmigr_next_groupevt
tmigr_update_events
tmigr_inactive_up
__walk_groups+0x50/0x77
walk_groups
__tmigr_cpu_deactivate
tmigr_cpu_deactivate
__get_next_timer_interrupt
timer_base_try_to_set_idle
tick_nohz_stop_tick
tick_nohz_idle_stop_tick
cpuidle_idle_call
do_idle
Although the relevant accesses could be marked as data_race(), the
"ignore" flag being read several times within the same
tmigr_update_events() function is confusing and error prone. Prefer
reading it once in that function and make use of similar/paired accesses
elsewhere with appropriate comments when necessary.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250114231507.21672-4-frederic@kernel.org
Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com
|
|
Commit 2522c84db513 ("timers/migration: Fix another race between hotplug
and idle entry/exit") fixed yet another race between idle exit and CPU
hotplug up leading to a wrong "0" value migrator assigned to the top
level. However there is yet another situation that remains unhandled:
[GRP0:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 1
/ \ \
0 1 2..7
idle idle idle
0) The system is fully idle.
[GRP0:0]
migrator = CPU 0
active = CPU 0
groupmask = 1
/ \ \
0 1 2..7
active idle idle
1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state
but it hasn't yet returned to __walk_groups().
[GRP0:0]
migrator = CPU 0
active = CPU 0, CPU 1
groupmask = 1
/ \ \
0 1 2..7
active active idle
2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in
__walk_groups(), delayed by #VMEXIT for example).
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 1 groupmask = 2
/ \ \
0 1 2..7 8
active active idle !online
3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1
and GRP0:0. CPU 1 hasn't yet propagated its activation up to GRP1:0.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 1 groupmask = 2
/ \ \
0 1 2..7 8
active active idle !online
4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP1:0 is visible and
fetched and the pre-initialized groupmask of GRP0:0 is also visible.
As a result tmigr_active_up() is called to GRP1:0 with GRP0:0 as active
and migrator. CPU 0 is returning to __walk_groups() but suffers again
a #VMEXIT.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 1 groupmask = 2
/ \ \
0 1 2..7 8
active active idle !online
5) CPU 1 propagates its activation of GRP0:0 to GRP1:0. This has no
effect since CPU 0 did it already.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0, GRP0:1
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = CPU 8
active = CPU 0, CPU1 active = CPU 8
groupmask = 1 groupmask = 2
/ \ \ \
0 1 2..7 8
active active idle active
6) CPU 1 links CPU 8 to its group. CPU 8 boots and goes through
CPUHP_AP_TMIGR_ONLINE which propagates activation.
[GRP2:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 1
/ \
[GRP1:0] [GRP1:1]
migrator = GRP0:0 migrator = TMIGR_NONE
active = GRP0:0, GRP0:1 active = NONE
groupmask = 1 groupmask = 2
/ \
[GRP0:0] [GRP0:1] [GRP0:2]
migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = CPU 8 active = NONE
groupmask = 1 groupmask = 2 groupmask = 0
/ \ \ \
0 1 2..7 8 64
active active idle active !online
7) CPU 64 is booting. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP1:1, GRP0:2 and the new top GRP2:0 connected to
GRP1:1 and GRP1:0. CPU 1 hasn't yet propagated its activation up to
GRP2:0.
[GRP2:0]
migrator = 0 (!!!)
active = NONE
groupmask = 1
/ \
[GRP1:0] [GRP1:1]
migrator = GRP0:0 migrator = TMIGR_NONE
active = GRP0:0, GRP0:1 active = NONE
groupmask = 1 groupmask = 2
/ \
[GRP0:0] [GRP0:1] [GRP0:2]
migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = CPU 8 active = NONE
groupmask = 1 groupmask = 2 groupmask = 0
/ \ \ \
0 1 2..7 8 64
active active idle active !online
8) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP2:0 is visible and
fetched but the pre-initialized groupmask of GRP1:0 is not because no
ordering made its initialization visible. As a result tmigr_active_up()
may be called to GRP2:0 with a "0" child's groumask. Leaving the timers
ignored for ever when the system is fully idle.
The race is highly theoretical and perhaps impossible in practice but
the groupmask of the child is not the only concern here as the whole
initialization of the child is not guaranteed to be visible to any
tree walker racing against hotplug (idle entry/exit, remote handling,
etc...). Although the current code layout seem to be resilient to such
hazards, this doesn't tell much about the future.
Fix this with enforcing address dependency between group initialization
and the write/read to the group's parent's pointer. Fortunately that
doesn't involve any barrier addition in the fast paths.
Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250114231507.21672-3-frederic@kernel.org
|
|
Commit 10a0e6f3d3db ("timers/migration: Move hierarchy setup into
cpuhotplug prepare callback") fixed a race between idle exit and CPU
hotplug up leading to a wrong "0" value migrator assigned to the top
level. However there is still a situation that remains unhandled:
[GRP0:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 0
/ \ \
0 1 2..7
idle idle idle
0) The system is fully idle.
[GRP0:0]
migrator = CPU 0
active = CPU 0
groupmask = 0
/ \ \
0 1 2..7
active idle idle
1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state
but it hasn't yet returned to __walk_groups().
[GRP0:0]
migrator = CPU 0
active = CPU 0, CPU 1
groupmask = 0
/ \ \
0 1 2..7
active active idle
2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in
__walk_groups(), delayed by #VMEXIT for example).
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 0
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 2 groupmask = 1
/ \ \
0 1 2..7 8
active active idle !online
3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1
and GRP0:0. The groupmask of GRP0:0 is now 2. CPU 1 hasn't yet
propagated its activation up to GRP1:0.
[GRP1:0]
migrator = 0 (!!!)
active = NONE
groupmask = 0
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 2 groupmask = 1
/ \ \
0 1 2..7 8
active active idle !online
4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP1:0 is visible and
fetched but the freshly updated groupmask of GRP0:0 may not be visible
due to lack of ordering! As a result tmigr_active_up() is called to
GRP0:0 with a child's groupmask of "0". This buggy "0" groupmask then
becomes the migrator for GRP1:0 forever. As a result, timers on a fully
idle system get ignored.
One possible fix would be to define TMIGR_NONE as "0" so that such a
race would have no effect. And after all TMIGR_NONE doesn't need to be
anything else. However this would leave an uncomfortable state machine
where gears happen not to break by chance but are vulnerable to future
modifications.
Keep TMIGR_NONE as is instead and pre-initialize to "1" the groupmask of
any newly created top level. This groupmask is guaranteed to be visible
upon fetching the corresponding group for the 1st time:
_ By the upcoming CPU thanks to CPU hotplug synchronization between the
control CPU (BP) and the booting one (AP).
_ By the control CPU since the groupmask and parent pointers are
initialized locally.
_ By all CPUs belonging to the same group than the control CPU because
they must wait for it to ever become idle before needing to walk to
the new top. The cmpcxhg() on ->migr_state then makes sure its
groupmask is visible.
With this pre-initialization, it is guaranteed that if a future top level
is linked to an old one, it is walked through with a valid groupmask.
Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250114231507.21672-2-frederic@kernel.org
|
|
According to RFC4303, section "3.3.3. Sequence Number Generation",
the first packet sent using a given SA will contain a sequence
number of 1.
This is applicable to both ESN and non-ESN mode, which was not covered
in commit mentioned in Fixes line.
Fixes: 3d42c8cc67a8 ("net/mlx5e: Ensure that IPsec sequence packet number starts from 1")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|