Age | Commit message (Collapse) | Author | Files | Lines |
|
Replace the dprintk in nfsd_lookup_dentry() with a trace point.
nfsd_lookup_dentry() is called frequently enough that enabling this
dprintk call site would result in log floods and performance issues.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Turn Sargun's internal kprobe based implementation of this into a normal
static tracepoint. Also, remove the dprintk's that got added recently
with the fix for zero-length ACLs.
Cc: Sargun Dillon <sargun@sargun.me>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Introduce tracing helpers that can be used before the procedure
status code is known. These macros are similar to the
SVC_RQST_ENDPOINT helpers, but they can be modified to include
NFS-specific fields if that is needed later.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Record and emit presentation addresses using tracing helpers
designed for the task.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
RFC 7862 states that if an NFS server implements a CLONE operation,
it MUST also implement FATTR4_CLONE_BLKSIZE. NFSD implements CLONE,
but does not implement FATTR4_CLONE_BLKSIZE.
Note that in Section 12.2, RFC 7862 claims that
FATTR4_CLONE_BLKSIZE is RECOMMENDED, not REQUIRED. Likely this is
because a minor version is not permitted to add a REQUIRED
attribute. Confusing.
We assume this attribute reports a block size as a count of bytes,
as RFC 7862 does not specify a unit.
Reported-by: Roland Mainz <roland.mainz@nrubsig.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org>
Cc: stable@vger.kernel.org # v6.7+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This user of SHA-256 does not support any other algorithm, so the
crypto_shash abstraction provides no value. Just use the SHA-256
library API instead, which is much simpler and easier to use.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
To handle device removal, svc_rdma_accept() requests removal
notification for the underlying device when accepting a connection.
However svc_rdma_free() is not invoked if svc_rdma_accept() fails.
There needs to be a matching "unregister" in that case; otherwise
the device cannot be removed.
Fixes: c4de97f7c454 ("svcrdma: Handle device removal outside of the CM event handler")
Cc: stable@vger.kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The connection backlog passed to listen() denotes the number of
connections that are fully established, but that have not yet been
accept()ed. If the amount goes above that level, new connection requests
will be dropped on the floor until the value goes down. If all the knfsd
threads are bogged down in (e.g.) disk I/O, new connection attempts can
stall because of this.
For the same rationale that Trond points out in the userland patch [1],
ensure that svc_xprt sockets created by the kernel allow SOMAXCONN
(4096) backlogged connections instead of the 64 that they do today.
[1]: https://lore.kernel.org/linux-nfs/20240308180223.2965601-1-trond.myklebust@hammerspace.com/
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
In nfs4_state_start_net(), laundromat_work may access nfsd_ssc through
nfs4_laundromat -> nfsd4_ssc_expire_umount. If nfsd_ssc isn't initialized,
this can cause NULL pointer dereference.
Normally the delayed start of laundromat_work allows sufficient time for
nfsd_ssc initialization to complete. However, when the kernel waits too
long for userspace responses (e.g. in nfs4_state_start_net ->
nfsd4_end_grace -> nfsd4_record_grace_done -> nfsd4_cld_grace_done ->
cld_pipe_upcall -> __cld_pipe_upcall -> wait_for_completion path), the
delayed work may start before nfsd_ssc initialization finishes.
Fix this by moving nfsd_ssc initialization before starting laundromat_work.
Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Neil is planning retirement, and has asked me to replace his Suse
email address with his personal email address. Both addresses
currently route to the same mailbox.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
I've been looking at a problem where we see increased RPC timeouts in
clients when the nfs_layout_flexfiles dataserver_timeo value is tuned
very low (6s). This is necessary to ensure quick failover to a different
mirror if a server goes down, but it causes a lot more major RPC timeouts.
Ultimately, the problem is server-side however. It's sometimes doesn't
respond to connection attempts. My theory is that the interrupt handler
runs when a connection comes in, the xprt ends up being enqueued, but it
takes a significant amount of time for the nfsd thread to pick it up.
Currently, the svc_xprt_dequeue tracepoint displays "wakeup-us". This is
the time between the wake_up() call, and the thread dequeueing the xprt.
If no thread was woken, or the thread ended up picking up a different
xprt than intended, then this value won't tell us how long the xprt was
waiting.
Add a new xpt_qtime field to struct svc_xprt and set it in
svc_xprt_enqueue(). When the dequeue tracepoint fires, also store the
time that the xprt sat on the queue in total. Display it as "qtime-us".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Very useful for gauging how long the vfs_fsync_range() takes.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If the request being processed is not a v4 compound request, then
examining the cstate can have undefined results.
This patch adds a check that the rpc procedure being executed
(rq_procinfo) is the NFSPROC4_COMPOUND procedure.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When an export policy with xprtsec policy is set with "tls"
and/or "mtls", but an NFS client is doing a v3 xprtsec=tls
mount, then NLM locking calls fail with an error because
there is currently no support for NLM with TLS.
Until such support is added, allow NLM calls under TLS-secured
policy.
Fixes: 4cc9b9f2bf4d ("nfsd: refine and rename NFSD_MAY_LOCK")
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
It can be removed since svc_fill_write_vector already has the
same WARN_ON_ONCE.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
NFSD currently has two separate code paths for handling read
requests. One uses page splicing; the other is a traditional read
based on an iov iterator.
Because most Linux file systems support splice read, the latter
does not get nearly the same test experience as splice reads.
To force the use of vectored reads for testing and benchmarking,
introduce the ability to disable splice reads for all NFS READ
operations.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Create a small sandbox under /sys/kernel/debug for experimental NFS
server feature settings. There is no API/ABI compatibility guarantee
for these settings.
The only documentation for such settings, if any documentation exists,
is in the kernel source code.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
As of now nfsd calls create_proc_exports_entry() at start of init_nfsd
and cleanup by remove_proc_entry() at last of exit_nfsd.
Which causes kernel OOPs if there is race between below 2 operations:
(i) exportfs -r
(ii) mount -t nfsd none /proc/fs/nfsd
for 5.4 kernel ARM64:
CPU 1:
el1_irq+0xbc/0x180
arch_counter_get_cntvct+0x14/0x18
running_clock+0xc/0x18
preempt_count_add+0x88/0x110
prep_new_page+0xb0/0x220
get_page_from_freelist+0x2d8/0x1778
__alloc_pages_nodemask+0x15c/0xef0
__vmalloc_node_range+0x28c/0x478
__vmalloc_node_flags_caller+0x8c/0xb0
kvmalloc_node+0x88/0xe0
nfsd_init_net+0x6c/0x108 [nfsd]
ops_init+0x44/0x170
register_pernet_operations+0x114/0x270
register_pernet_subsys+0x34/0x50
init_nfsd+0xa8/0x718 [nfsd]
do_one_initcall+0x54/0x2e0
CPU 2 :
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
PC is at : exports_net_open+0x50/0x68 [nfsd]
Call trace:
exports_net_open+0x50/0x68 [nfsd]
exports_proc_open+0x2c/0x38 [nfsd]
proc_reg_open+0xb8/0x198
do_dentry_open+0x1c4/0x418
vfs_open+0x38/0x48
path_openat+0x28c/0xf18
do_filp_open+0x70/0xe8
do_sys_open+0x154/0x248
Sometimes it crashes at exports_net_open() and sometimes cache_seq_next_rcu().
and same is happening on latest 6.14 kernel as well:
[ 0.000000] Linux version 6.14.0-rc5-next-20250304-dirty
...
[ 285.455918] Unable to handle kernel paging request at virtual address 00001f4800001f48
...
[ 285.464902] pc : cache_seq_next_rcu+0x78/0xa4
...
[ 285.469695] Call trace:
[ 285.470083] cache_seq_next_rcu+0x78/0xa4 (P)
[ 285.470488] seq_read+0xe0/0x11c
[ 285.470675] proc_reg_read+0x9c/0xf0
[ 285.470874] vfs_read+0xc4/0x2fc
[ 285.471057] ksys_read+0x6c/0xf4
[ 285.471231] __arm64_sys_read+0x1c/0x28
[ 285.471428] invoke_syscall+0x44/0x100
[ 285.471633] el0_svc_common.constprop.0+0x40/0xe0
[ 285.471870] do_el0_svc_compat+0x1c/0x34
[ 285.472073] el0_svc_compat+0x2c/0x80
[ 285.472265] el0t_32_sync_handler+0x90/0x140
[ 285.472473] el0t_32_sync+0x19c/0x1a0
[ 285.472887] Code: f9400885 93407c23 937d7c27 11000421 (f86378a3)
[ 285.473422] ---[ end trace 0000000000000000 ]---
It reproduced simply with below script:
while [ 1 ]
do
/exportfs -r
done &
while [ 1 ]
do
insmod /nfsd.ko
mount -t nfsd none /proc/fs/nfsd
umount /proc/fs/nfsd
rmmod nfsd
done &
So exporting interfaces to user space shall be done at last and
cleanup at first place.
With change there is no Kernel OOPs.
Co-developed-by: Shubham Rana <s9.rana@samsung.com>
Signed-off-by: Shubham Rana <s9.rana@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
With rpc_status netlink support, unregister of register_filesystem()
was missed in case of genl_register_family() fails.
Correcting it by making new label.
Fixes: bd9d6a3efa97 ("NFSD: add rpc_status netlink support")
Cc: stable@vger.kernel.org
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When cache cleanup runs concurrently with cache entry removal, a race
condition can occur that leads to incorrect nextcheck times. This can
delay cache cleanup for the cache_detail by up to 1800 seconds:
1. cache_clean() sets nextcheck to current time plus 1800 seconds
2. While scanning a non-empty bucket, concurrent cache entry removal can
empty that bucket
3. cache_clean() finds no cache entries in the now-empty bucket to update
the nextcheck time
4. This maybe delays the next scan of the cache_detail by up to 1800
seconds even when it should be scanned earlier based on remaining
entries
Fix this by moving the hash_lock acquisition earlier in cache_clean().
This ensures bucket emptiness checks and nextcheck updates happen
atomically, preventing the race between cleanup and entry removal.
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The cache_detail structure uses a "nextcheck" field to control hash table
scanning intervals. When a table scan begins, nextcheck is set to current
time plus 1800 seconds. During scanning, if cache_detail is not empty and
a cache entry's expiry time is earlier than the current nextcheck, the
nextcheck is updated to that expiry time.
This mechanism ensures that:
1) Empty cache_details are scanned every 1800 seconds to avoid unnecessary
scans
2) Non-empty cache_details are scanned based on the earliest expiry time
found
However, when adding a new cache entry to an empty cache_detail, the
nextcheck time was not being updated, remaining at 1800 seconds. This
could delay cache cleanup for up to 1800 seconds, potentially blocking
threads(such as nfsd) that are waiting for cache cleanup.
Fix this by updating the nextcheck time whenever a new cache entry is
added.
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Help the client resolve the race between the reply to an
asynchronous COPY reply and the associated CB_OFFLOAD callback by
planting the session, slot, and sequence number of the COPY in the
CB_SEQUENCE contained in the CB_OFFLOAD COMPOUND.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The slot index number of the current COMPOUND has, until now, not
been needed outside of nfsd4_sequence(). But to record the tuple
that represents a referring call, the slot number will be needed
when processing subsequent operations in the COMPOUND.
Refactor the code that allocates a new struct nfsd4_slot to ensure
that the new sl_index field is always correctly initialized.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
We have yet to implement a mechanism in NFSD for resolving races
between a server's reply and a related callback operation. For
example, a CB_OFFLOAD callback can race with the matching COPY
response. The client will not recognize the copy state ID in the
CB_OFFLOAD callback until the COPY response arrives.
Trond adds:
> It is also needed for the same kind of race with delegation
> recalls, layout recalls, CB_NOTIFY_DEVICEID and would also be
> helpful (although not as strongly required) for CB_NOTIFY_LOCK.
RFC 8881 Section 20.9.3 describes referring call lists this way:
> The csa_referring_call_lists array is the list of COMPOUND
> requests, identified by session ID, slot ID, and sequence ID.
> These are requests that the client previously sent to the server.
> These previous requests created state that some operation(s) in
> the same CB_COMPOUND as the csa_referring_call_lists are
> identifying. A session ID is included because leased state is tied
> to a client ID, and a client ID can have multiple sessions. See
> Section 2.10.6.3.
Introduce the XDR infrastructure for populating the
csa_referring_call_lists argument of CB_SEQUENCE. Subsequent patches
will put the referring call list to use.
Note that cb_sequence_enc_sz estimates that only zero or one rcl is
included in each CB_SEQUENCE, but the new infrastructure can
manage any number of referring calls.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Try not to prolong the wait for completion of a COPY or COPY_NOTIFY
operation.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Update the status of an async COPY operation when it has been
stopped. OFFLOAD_STATUS needs to indicate that the COPY is no longer
running.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
|
|
A recent commit put one entry in the wrong place. This just moves it to the
right place.
Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-5-vi@endrift.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
This adds support for several new controllers, all of which include
Share buttons:
- HORI Drum controller
- PowerA Fusion Pro 4
- 8BitDo Ultimate 3-mode Controller
- Hyperkin DuchesS Xbox One controller
- PowerA MOGA XP-Ultra controller
Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-4-vi@endrift.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
The Share button, if present, is always one of two offsets from the end of the
file, depending on the presence of a specific interface. As we lack parsing for
the identify packet we can't automatically determine the presence of that
interface, but we can hardcode which of these offsets is correct for a given
controller.
More controllers are probably fixable by adding the MAP_SHARE_BUTTON in the
future, but for now I only added the ones that I have the ability to test
directly.
Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-2-vi@endrift.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
Two controllers -- Mad Catz JOYTECH NEO SE Advanced and PDP Mirror's
Edge Official -- were missing the value of the mapping field, and thus
wouldn't detect properly.
Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-1-vi@endrift.com
Fixes: 540602a43ae5 ("Input: xpad - add a few new VID/PID combinations")
Fixes: 3492321e2e60 ("Input: xpad - add multiple supported devices")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
To wake up the system from s2idle when pressing the power-button, let's
convert from using pm_wakeup_event() to pm_wakeup_dev_event(), as it allows
us to specify the "hard" in-parameter, which needs to be set for s2idle.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250306115021.797426-1-ulf.hansson@linaro.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
propagate_mnt() does not attach anything to mounts created during
propagate_mnt() itself. What's more, anything on ->mnt_slave_list
of such new mount must also be new, so we don't need to even look
there.
When move_mount() had been introduced, we've got an additional
class of mounts to skip - if we are moving from anon namespace,
we do not want to propagate to mounts we are moving (i.e. all
mounts in that anon namespace).
Unfortunately, the part about "everything on their ->mnt_slave_list
will also be ignorable" is not true - if we have propagation graph
A -> B -> C
and do OPEN_TREE_CLONE open_tree() of B, we get
A -> [B <-> B'] -> C
as propagation graph, where B' is a clone of B in our detached tree.
Making B private will result in
A -> B' -> C
C still gets propagation from A, as it would after making B private
if we hadn't done that open_tree(), but now the propagation goes
through B'. Trying to move_mount() our detached tree on subdirectory
in A should have
* moved B' on that subdirectory in A
* skipped the corresponding subdirectory in B' itself
* copied B' on the corresponding subdirectory in C.
As it is, the logics in propagation_next() and friends ends up
skipping propagation into C, since it doesn't consider anything
downstream of B'.
IOW, walking the propagation graph should only skip the ->mnt_slave_list
of new mounts; the only places where the check for "in that one
anon namespace" are applicable are propagate_one() (where we should
treat that as the same kind of thing as "mountpoint we are looking
at is not visible in the mount we are looking at") and
propagation_would_overmount(). The latter is better dealt with
in the caller (can_move_mount_beneath()); on the first call of
propagation_would_overmount() the test is always false, on the
second it is always true in "move from anon namespace" case and
always false in "move within our namespace" one, so it's easier
to just use check_mnt() before bothering with the second call and
be done with that.
Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
as it is, a failed move_mount(2) from anon namespace breaks
all further propagation into that namespace, including normal
mounts in non-anon namespaces that would otherwise propagate
there.
Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
do_umount() analogue of the race fixed in 119e1ef80ecf "fix
__legitimize_mnt()/mntput() race". Here we want to make sure that
if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
notice their refcount increment. Harder to hit than mntput_no_expire()
one, fortunately, and consequences are milder (sync umount acting
like umount -l on a rare race with RCU pathwalk hitting at just the
wrong time instead of use-after-free galore mntput_no_expire()
counterpart used to be hit). Still a bug...
Fixes: 48a066e72d97 ("RCU'd vfsmounts")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... or we risk stealing final mntput from sync umount - raising mnt_count
after umount(2) has verified that victim is not busy, but before it
has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see
that it's safe to quietly undo mnt_count increment and leaves dropping
the reference to caller, where it'll be a full-blown mntput().
Check under mount_lock is needed; leaving the current one done before
taking that makes no sense - it's nowhere near common enough to bother
with.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
tl;dr: There is a window in the mm switching code where the new CR3 is
set and the CPU should be getting TLB flushes for the new mm. But
should_flush_tlb() has a bug and suppresses the flush. Fix it by
widening the window where should_flush_tlb() sends an IPI.
Long Version:
=== History ===
There were a few things leading up to this.
First, updating mm_cpumask() was observed to be too expensive, so it was
made lazier. But being lazy caused too many unnecessary IPIs to CPUs
due to the now-lazy mm_cpumask(). So code was added to cull
mm_cpumask() periodically[2]. But that culling was a bit too aggressive
and skipped sending TLB flushes to CPUs that need them. So here we are
again.
=== Problem ===
The too-aggressive code in should_flush_tlb() strikes in this window:
// Turn on IPIs for this CPU/mm combination, but only
// if should_flush_tlb() agrees:
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
load_new_mm_cr3(need_flush);
// ^ After 'need_flush' is set to false, IPIs *MUST*
// be sent to this CPU and not be ignored.
this_cpu_write(cpu_tlbstate.loaded_mm, next);
// ^ Not until this point does should_flush_tlb()
// become true!
should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3()
and writing to 'loaded_mm', which is a window where they should not be
suppressed. Whoops.
=== Solution ===
Thankfully, the fuzzy "just about to write CR3" window is already marked
with loaded_mm==LOADED_MM_SWITCHING. Simply checking for that state in
should_flush_tlb() is sufficient to ensure that the CPU is targeted with
an IPI.
This will cause more TLB flush IPIs. But the window is relatively small
and I do not expect this to cause any kind of measurable performance
impact.
Update the comment where LOADED_MM_SWITCHING is written since it grew
yet another user.
Peter Z also raised a concern that should_flush_tlb() might not observe
'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off()
writes them. Add a barrier to ensure that they are observed in the
order they are written.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/ [1]
Fixes: 6db2526c1d69 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2]
Reported-by: Stephen Dolan <sdolan@janestreet.com>
Cc: stable@vger.kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep. While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.
There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation. But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput. The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.
My benchmark is:
fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
--invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \
--filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
--rw=randread --numjobs=1 --sqthread_poll
In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec
With this patch:
READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec
which is a 15% improvement in measured bandwidth. The average
submission latency is noticeably lowered too. As measured by
fio:
Baseline:
lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
lat (usec): min=26, max=226, avg=86.48, stdev=4.82
If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.
In the baseline, after a stabilization phase, an ordinary submission
looks like:
254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7
After patching, I see consistently more requests per unplug.
254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32
Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size. We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem. Instead, let's just give it a more sensible value
that will allow for more efficient batching. I've tested with different
cap values, and initially proposed to increase the cap to 1024. Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Determining the SST/MST mode during state computation must be done based
on the output type stored in the CRTC state, which in turn is set once
based on the modeset connector's SST vs. MST type and will not change as
long as the connector is using the CRTC. OTOH the MST mode indicated by
the given connector's intel_dp::is_mst flag can change independently of
the above output type, based on what sink is at any moment plugged to
the connector.
Fix the state computation accordingly.
Cc: Jani Nikula <jani.nikula@intel.com>
Fixes: f6971d7427c2 ("drm/i915/mst: adapt intel_dp_mtp_tu_compute_config() for 128b/132b SST")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4607
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Link: https://lore.kernel.org/r/20250507151953.251846-1-imre.deak@intel.com
(cherry picked from commit 0f45696ddb2b901fbf15cb8d2e89767be481d59f)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
|
|
When increasing the array size in memblock_double_array() and the slab
is not yet available, a call to memblock_find_in_range() is used to
reserve/allocate memory. However, the range returned may not have been
accepted, which can result in a crash when booting an SNP guest:
RIP: 0010:memcpy_orig+0x68/0x130
Code: ...
RSP: 0000:ffffffff9cc03ce8 EFLAGS: 00010006
RAX: ff11001ff83e5000 RBX: 0000000000000000 RCX: fffffffffffff000
RDX: 0000000000000bc0 RSI: ffffffff9dba8860 RDI: ff11001ff83e5c00
RBP: 0000000000002000 R08: 0000000000000000 R09: 0000000000002000
R10: 000000207fffe000 R11: 0000040000000000 R12: ffffffff9d06ef78
R13: ff11001ff83e5000 R14: ffffffff9dba7c60 R15: 0000000000000c00
memblock_double_array+0xff/0x310
memblock_add_range+0x1fb/0x2f0
memblock_reserve+0x4f/0xa0
memblock_alloc_range_nid+0xac/0x130
memblock_alloc_internal+0x53/0xc0
memblock_alloc_try_nid+0x3d/0xa0
swiotlb_init_remap+0x149/0x2f0
mem_init+0xb/0xb0
mm_core_init+0x8f/0x350
start_kernel+0x17e/0x5d0
x86_64_start_reservations+0x14/0x30
x86_64_start_kernel+0x92/0xa0
secondary_startup_64_no_verify+0x194/0x19b
Mitigate this by calling accept_memory() on the memory range returned
before the slab is available.
Prior to v6.12, the accept_memory() interface used a 'start' and 'end'
parameter instead of 'start' and 'size', therefore the accept_memory()
call must be adjusted to specify 'start + size' for 'end' when applying
to kernels prior to v6.12.
Cc: stable@vger.kernel.org # see patch description, needs adjustments for <= 6.11
Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/da1ac73bf4ded761e21b4e4bb5178382a580cd73.1746725050.git.thomas.lendacky@amd.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
Without CONFIG_DRM_XE_GPUSVM set, GPU SVM is not initialized thus below
warning pops. Refine the flush work code to be controlled by the config
to avoid below warning:
"
[ 453.132028] ------------[ cut here ]------------
[ 453.132527] WARNING: CPU: 9 PID: 4491 at kernel/workqueue.c:4205 __flush_work+0x379/0x3a0
[ 453.133355] Modules linked in: xe drm_ttm_helper ttm gpu_sched drm_buddy drm_suballoc_helper drm_gpuvm drm_exec
[ 453.134352] CPU: 9 UID: 0 PID: 4491 Comm: xe_exec_mix_mod Tainted: G U W 6.15.0-rc3+ #7 PREEMPT(full)
[ 453.135405] Tainted: [U]=USER, [W]=WARN
...
[ 453.136921] RIP: 0010:__flush_work+0x379/0x3a0
[ 453.137417] Code: 8b 45 00 48 8b 55 08 89 c7 48 c1 e8 04 83 e7 08 83 e0 0f 83 cf 02 89 c6 48 0f ba 6d 00 03 e9 d5 fe ff ff 0f 0b e9 db fd ff ff <0f> 0b 45 31 e4 e9 d1 fd ff ff 0f 0b e9 03 ff ff ff 0f 0b e9 d6 fe
[ 453.139250] RSP: 0018:ffffc90000c67b18 EFLAGS: 00010246
[ 453.139782] RAX: 0000000000000000 RBX: ffff888108a24000 RCX: 0000000000002000
[ 453.140521] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8881016d61c8
[ 453.141253] RBP: ffff8881016d61c8 R08: 0000000000000000 R09: 0000000000000000
[ 453.141985] R10: 0000000000000000 R11: 0000000008a24000 R12: 0000000000000001
[ 453.142709] R13: 0000000000000002 R14: 0000000000000000 R15: ffff888107db8c00
[ 453.143450] FS: 00007f44853d4c80(0000) GS:ffff8882f469b000(0000) knlGS:0000000000000000
[ 453.144276] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 453.144853] CR2: 00007f4487629228 CR3: 00000001016aa000 CR4: 00000000000406f0
[ 453.145594] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 453.146320] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 453.147061] Call Trace:
[ 453.147336] <TASK>
[ 453.147579] ? tick_nohz_tick_stopped+0xd/0x30
[ 453.148067] ? xas_load+0x9/0xb0
[ 453.148435] ? xa_load+0x6f/0xb0
[ 453.148781] __xe_vm_bind_ioctl+0xbd5/0x1500 [xe]
[ 453.149338] ? dev_printk_emit+0x48/0x70
[ 453.149762] ? _dev_printk+0x57/0x80
[ 453.150148] ? drm_ioctl+0x17c/0x440
[ 453.150544] ? __drm_dev_vprintk+0x36/0x90
[ 453.150983] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[ 453.151575] ? drm_ioctl_kernel+0x9f/0xf0
[ 453.151998] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[ 453.152560] drm_ioctl_kernel+0x9f/0xf0
[ 453.152968] drm_ioctl+0x20f/0x440
[ 453.153332] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[ 453.153893] ? ioctl_has_perm.constprop.0.isra.0+0xae/0x100
[ 453.154489] ? memory_bm_test_bit+0x5/0x60
[ 453.154935] xe_drm_ioctl+0x47/0x70 [xe]
[ 453.155419] __x64_sys_ioctl+0x8d/0xc0
[ 453.155824] do_syscall_64+0x47/0x110
[ 453.156228] entry_SYSCALL_64_after_hwframe+0x76/0x7e
"
v2 (Matt):
refine commit message to have more details
add Fixes tag
move the code to xe_svm.h which already have the config
remove a blank line per codestyle suggestion
Fixes: 63f6e480d115 ("drm/xe: Add SVM garbage collector")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250502170052.1787973-1-shuicheng.lin@intel.com
(cherry picked from commit 9d80698bcd97a5ad1088bcbb055e73fd068895e2)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
xe_force_wake_get() is dependent on xe_pm_runtime_get(), so for
the release path, xe_force_wake_put() should be called first then
xe_pm_runtime_put().
Combine the error path and normal path together with goto.
Fixes: 85d547608ef5 ("drm/xe/xe_gt_debugfs: Update handling of xe_force_wake_get return")
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://lore.kernel.org/r/20250507022302.2187527-1-shuicheng.lin@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 432cd94efdca06296cc5e76d673546f58aa90ee1)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
The workqueue used for the reset worker is marked as WQ_MEM_RECLAIM,
while the GSC one isn't (and can't be as we need to do memory
allocations in the gsc worker). Therefore, we can't flush the latter
from the former.
The reason why we had such a flush was to avoid interrupting either
the GSC FW load or in progress GSC proxy operations. GSC proxy
operations fall into 2 categories:
1) GSC proxy init: this only happens once immediately after GSC FW load
and does not support being interrupted. The only way to recover from
an interruption of the proxy init is to do an FLR and re-load the GSC.
2) GSC proxy request: this can happen in response to a request that
the driver sends to the GSC. If this is interrupted, the GSC FW will
timeout and the driver request will be failed, but overall the GSC
will keep working fine.
Flushing the work allowed us to avoid interruption in both cases (unless
the hang came from the GSC engine itself, in which case we're toast
anyway). However, a failure on a proxy request is tolerable if we're in
a scenario where we're triggering a GT reset (i.e., something is already
gone pretty wrong), so what we really need to avoid is interrupting
the init flow, which we can do by polling on the register that reports
when the proxy init is complete (as that ensure us that all the load and
init operations have been completed).
Note that during suspend we still want to do a flush of the worker to
make sure it completes any operations involving the HW before the power
is cut.
v2: fix spelling in commit msg, rename waiter function (Julia)
Fixes: dd0e89e5edc2 ("drm/xe/gsc: GSC FW load")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4830
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://lore.kernel.org/r/20250502155104.2201469-1-daniele.ceraolospurio@intel.com
(cherry picked from commit 12370bfcc4f0bdf70279ec5b570eb298963422b5)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
LNCF registers report wrong values when XE_FORCEWAKE_GT
only is held. Holding XE_FORCEWAKE_ALL ensures correct
operations on LNCF regs.
V2(Himal):
- Use xe_force_wake_ref_has_domain
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1999
Fixes: a6a4ea6d7d37 ("drm/xe: Add mocs kunit")
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250428082357.1730068-1-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
(cherry picked from commit 70a2585e582058e94fe4381a337be42dec800337)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
For an unknown reason the math to determine the PF queue size does is
not correct - compute UMD applications are overflowing the PF queue
which is fatal. A multippier of 8 fixes the problem.
Fixes: 3338e4f90c14 ("drm/xe: Use topology to determine page fault queue size")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Jagmeet Randhawa <jagmeet.randhawa@intel.com>
Link: https://lore.kernel.org/r/20250408155915.78770-1-matthew.brost@intel.com
(cherry picked from commit 29582e0ea75c95668d168b12406e3c56cf5a73c4)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: 689275140cb8 ("drm/amdgpu/hdp7.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dbc064adfcf9095e7d895bea87b2f75c1ab23236)
Cc: stable@vger.kernel.org
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: abe1cbaec6cf ("drm/amdgpu/hdp6.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 84141ff615951359c9a99696fd79a36c465ed847)
Cc: stable@vger.kernel.org
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: f756dbac1ce1 ("drm/amdgpu/hdp5.2: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4a89b7698e771914b4d5b571600c76e2fdcbe2a9)
Cc: stable@vger.kernel.org
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: cf424020e040 ("drm/amdgpu/hdp5.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a5cb344033c7598762e89255e8ff52827abb57a4)
Cc: stable@vger.kernel.org
|
|
Ever since commit eca2040972b4("scsi: block: ioprio: Clean up interface
definition"), the macro IOPRIO_PRIO_LEVEL() will mask the level value to
something between 0 and 7 so necessarily, level will always be lower than
IOPRIO_NR_LEVELS(8).
Remove this obsolete check.
Reported-by: Kexin Wei <ys.weikexin@h3c.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250508083018.GA769554@bytedance
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|