aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2026-03-29SUNRPC: Allocate a separate Reply page arrayChuck Lever1-24/+23
struct svc_rqst uses a single dynamically-allocated page array (rq_pages) for both the incoming RPC Call message and the outgoing RPC Reply message. rq_respages is a sliding pointer into rq_pages that each transport receive path must compute based on how many pages the Call consumed. This boundary tracking is a source of confusion and bugs, and prevents an RPC transaction from having both a large Call and a large Reply simultaneously. Allocate rq_respages as its own page array, eliminating the boundary arithmetic. This decouples Call and Reply buffer lifetimes, following the precedent set by rq_bvec (a separate dynamically- allocated array for I/O vectors). Each svc_rqst now pins twice as many pages as before. For a server running 16 threads with a 1MB maximum payload, the additional cost is roughly 16MB of pinned memory. The new dynamic svc thread count facility keeps this overhead minimal on an idle server. A subsequent patch in this series limits per-request repopulation to only the pages released during the previous RPC, avoiding a full-array scan on each call to svc_alloc_arg(). Note: We've considered several alternatives to maintaining a full second array. Each alternative reintroduces either boundary logic complexity or I/O-path allocation pressure. rq_next_page is initialized in svc_alloc_arg() and svc_process() during Reply construction, and in svc_rdma_recvfrom() as a precaution on error paths. Transport receive paths no longer compute it from the Call size. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29sunrpc: split cache_detail queue into request and reader listsJeff Layton1-1/+3
Replace the single interleaved queue (which mixed cache_request and cache_reader entries distinguished by a ->reader flag) with two dedicated lists: cd->requests for upcall requests and cd->readers for open file handles. Readers now track their position via a monotonically increasing sequence number (next_seqno) rather than by their position in the shared list. Each cache_request is assigned a seqno when enqueued, and a new cache_next_request() helper finds the next request at or after a given seqno. This eliminates the cache_queue wrapper struct entirely, simplifies the reader-skipping loops in cache_read/cache_poll/cache_ioctl/ cache_release, and makes the data flow easier to reason about. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29sunrpc: convert queue_wait from global to per-cache-detail waitqueueJeff Layton1-0/+2
The queue_wait waitqueue is currently a file-scoped global, so a wake_up for one cache_detail wakes pollers on all caches. Convert it to a per-cache-detail field so that only pollers on the relevant cache are woken. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29sunrpc: convert queue_lock from global spinlock to per-cache-detail lockJeff Layton1-0/+1
The global queue_lock serializes all upcall queue operations across every cache_detail instance. Convert it to a per-cache-detail spinlock so that different caches (e.g. auth.unix.ip vs nfsd.fh) no longer contend with each other on queue operations. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29Documentation: Add the RPC language description of NLM version 4Chuck Lever1-0/+233
In order to generate source code to encode and decode NLMv4 protocol elements, include a copy of the RPC language description of NLMv4 for xdrgen to process. The language description is an amalgam of RFC 1813 and the Open Group's XNFS specification: https://pubs.opengroup.org/onlinepubs/9629799/chap10.htm The C code committed here was generated from the new nlm4.x file using tools/net/sunrpc/xdrgen/xdrgen. The goals of replacing hand-written XDR functions with ones that are tool-generated are to improve memory safety and make XDR encoding and decoding less brittle to maintain. The xdrgen utility derives both the type definitions and the encode/decode functions directly from protocol specifications, using names and symbols familiar to anyone who knows those specs. Unlike hand-written code that can inadvertently diverge from the specification, xdrgen guarantees that the generated code matches the specification exactly. We would eventually like xdrgen to generate Rust code as well, making the conversion of the kernel's NFS stacks to use Rust just a little easier for us. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29NFSD: Enforce timeout on layout recall and integrate lease manager fencingDai Ngo1-0/+1
When a layout conflict triggers a recall, enforcing a timeout is necessary to prevent excessive nfsd threads from being blocked in __break_lease ensuring the server continues servicing incoming requests efficiently. This patch introduces a new function to lease_manager_operations: lm_breaker_timedout: Invoked when a lease recall times out and is about to be disposed of. This function enables the lease manager to inform the caller whether the file_lease should remain on the flc_list or be disposed of. For the NFSD lease manager, this function now handles layout recall timeouts. If the layout type supports fencing and the client has not been fenced, a fence operation is triggered to prevent the client from accessing the block device. While the fencing operation is in progress, the conflicting file_lease remains on the flc_list until fencing is complete. This guarantees that no other clients can access the file, and the client with exclusive access is properly blocked before disposal. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29sunrpc: Fix compilation error (`make W=1`) when dprintk() is no-opAndy Shevchenko2-5/+6
Clang compiler is not happy about set but unused variables: .../flexfilelayout/flexfilelayoutdev.c:56:9: error: variable 'ret' set but not used [-Werror,-Wunused-but-set-variable] .../flexfilelayout/flexfilelayout.c:1505:6: error: variable 'err' set but not used [-Werror,-Wunused-but-set-variable] .../nfs4proc.c:9244:12: error: variable 'ptr' set but not used [-Werror,-Wunused-but-set-variable] Fix these by forwarding parameters of dprintk() to no_printk(). The positive side-effect is a format-string checker enabled even for the cases when dprintk() is no-op. Fixes: d67ae825a59d ("pnfs/flexfiles: Add the FlexFile Layout Driver") Fixes: fc931582c260 ("nfs41: create_session operation") Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29sunrpc: Kill RPC_IFDEBUG()Andy Shevchenko1-2/+0
RPC_IFDEBUG() is used in only two places. In one the user of the definition is guarded by ifdeffery, in the second one it's implied due to dprintk() usage. Kill the macro and move the ifdeffery to the regular condition with the variable defined inside, while in the second case add the same conditional and move the respective code there. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Make linux/lockd/nlm.h an internal headerChuck Lever2-60/+0
The NLM protocol constants and status codes in nlm.h are needed only by lockd's internal implementation. NFS client code and NFSD interact with lockd through the stable API in bind.h and have no direct use for protocol-level definitions. Exposing these definitions globally via bind.h creates unnecessary coupling between lockd internals and its consumers. Moving nlm.h from include/linux/lockd/ to fs/lockd/ clarifies the API boundary: bind.h provides the lockd service interface, while nlm.h remains available only to code within fs/lockd/ that implements the protocol. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Move xdr.h from include/linux/lockd/ to fs/lockd/Chuck Lever2-116/+2
The lockd subsystem unnecessarily exposes internal NLM XDR type definitions through the global include path. These definitions are not used by any code outside fs/lockd/, making them inappropriate for include/linux/lockd/. Moving xdr.h to fs/lockd/ narrows the API surface and clarifies that these types are internal implementation details. The comment in linux/lockd/bind.h stating xdr.h was needed for "xdr-encoded error codes" is stale: no lockd API consumers use those codes. Forward declarations for struct nfs_fh and struct file_lock are added to bind.h because their definitions were previously pulled in transitively through xdr.h. Additionally, nfs3proc.c and proc.c need explicit includes of filelock.h for FL_CLOSE and for accessing struct file_lock members, respectively. Built and tested with lockd client/server operations. No functional change. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Remove lockd/debug.hChuck Lever1-40/+0
The lockd include structure has unnecessary indirection. The header include/linux/lockd/debug.h is consumed only by fs/lockd/lockd.h, creating an extra compilation dependency and making the code harder to navigate. Fold the debug.h definitions directly into lockd.h and remove the now-redundant header. This reduces the include tree depth and makes the debug-related definitions easier to find when working on lockd internals. Build-tested with lockd built as module and built-in. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Relocate include/linux/lockd/lockd.hChuck Lever1-401/+0
Headers placed in include/linux/ form part of the kernel's internal API and signal to subsystem maintainers that other parts of the kernel may depend on them. By moving lockd.h into fs/lockd/, lockd becomes a more self-contained module whose internal interfaces are clearly distinguished from its public contract with the rest of the kernel. This relocation addresses a long-standing XXX comment in the header itself that acknowledged the file's misplacement. Future changes to lockd internals can now proceed with confidence that external consumers are not inadvertently coupled to implementation details. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Move share.h from include/linux/lockd/ to fs/lockd/Chuck Lever2-32/+2
The share.h header defines struct nlm_share and declares the DOS share management functions used by the NLM server to implement NLM_SHARE and NLM_UNSHARE operations. These interfaces are used exclusively within the lockd subsystem. A git grep search confirms no external code references them. Relocating this header from include/linux/lockd/ to fs/lockd/ narrows the public API surface of the lockd module. Out-of-tree code cannot depend on these internal interfaces after this change. Future refactoring of the share management implementation thus requires no consideration of external consumers. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Move xdr4.h from include/linux/lockd/ to fs/lockd/Chuck Lever3-49/+4
The xdr4.h header declares NLMv4-specific XDR encoder/decoder functions and error codes that are used exclusively within the lockd subsystem. Moving it from include/linux/lockd/ to fs/lockd/ clarifies the intended scope of these declarations and prevents external code from depending on lockd-internal interfaces. This change reduces the public API surface of the lockd module and makes it easier to refactor NLMv4 internals without risk of breaking out-of-tree consumers. The header's contents are implementation details of the NLMv4 wire protocol handling, not a contract with other kernel subsystems. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29NFS: Use nlmclnt_shutdown_rpc_clnt() to safely shut down NLMChuck Lever1-0/+1
A race condition exists in shutdown_store() when writing to the sysfs "shutdown" file concurrently with nlm_shutdown_hosts_net(). Without synchronization, the following sequence can occur: 1. shutdown_store() reads server->nlm_host (non-NULL) 2. nlm_shutdown_hosts_net() acquires nlm_host_mutex, calls rpc_shutdown_client(), sets h_rpcclnt to NULL, and potentially frees the host via nlm_gc_hosts() 3. shutdown_store() dereferences the now-stale or freed host Introduce nlmclnt_shutdown_rpc_clnt(), which acquires nlm_host_mutex before accessing h_rpcclnt. This synchronizes with nlm_shutdown_hosts_net() and ensures the rpc_clnt pointer remains valid during the shutdown operation. This change also improves API layering: NFS client code no longer needs to include the internal lockd header to access nlm_host fields. The new helper resides in bind.h alongside other public lockd interfaces. Reported-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Relocate nlmsvc_unlock API declarationsChuck Lever2-6/+7
The nlmsvc_unlock_all_by_sb() and nlmsvc_unlock_all_by_ip() functions are part of lockd's external API, consumed by other kernel subsystems. Their declarations currently reside in linux/lockd/lockd.h alongside internal implementation details, which blurs the boundary between lockd's public interface and its private internals. Moving these declarations to linux/lockd/bind.h groups them with other external API functions and makes the separation explicit. This clarifies which functions are intended for external use and reduces the risk of internal implementation details leaking into the public API surface. Build-tested with allyesconfig; no functional changes. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Have nlm_fopen() return errno valuesChuck Lever2-5/+5
The nlm_fopen() function is part of the API between nfsd and lockd. Currently its return value is an on-the-wire NLM status code. But that forces NFSD to include NLM wire protocol definitions despite having no other dependency on the NLM wire protocol. In addition, a CONFIG_LOCKD_V4 Kconfig symbol appears in the middle of NFSD source code. Refactor: Let's not use on-the-wire values as part of a high-level API between two Linux kernel modules. That's what we have errno for, right? And, instead of simply moving the CONFIG_LOCKD_V4 check, we can get rid of it entirely and let the decision of what actual NLM status code goes on the wire to be left up to NLM version-specific code. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Introduce nlm__int__deadlockChuck Lever1-0/+1
The use of CONFIG_LOCKD_V4 in combination with a later cast_status() in the NLMv3 code is difficult to reason about. Instead, replace the use of nlm_deadlock with an implementation-defined status value that version-specific code translates appropriately. The new approach establishes a translation boundary: generic lockd code returns nlm__int__deadlock when posix_lock_file() yields -EDEADLK. Version-specific handlers (svc4proc.c for NLMv4, svcproc.c for NLMv3) translate this internal status to the appropriate wire protocol value. NLMv4 maps to nlm4_deadlock; NLMv3 maps to nlm_lck_denied (since NLMv3 lacks a deadlock-specific status code). Later this modification will also remove the need to include NLMv4 headers in NLMv3 and generic code. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29lockd: Relocate and rename nlm_drop_replyChuck Lever2-2/+6
The nlm_drop_reply status code is internal to the kernel's lockd implementation and must never appear on the wire. Its previous location in xdr.h grouped it with legitimate NLM protocol status codes, obscuring this critical distinction. Relocate the definition to lockd.h with a comment block for internal status codes, and rename to nlm__int__drop_reply to make its internal-only nature explicit. This prepares for adding additional internal status codes in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29nfsd/sunrpc: move rq_cachetype into struct nfsd_thread_local_infoJeff Layton1-1/+0
The svc_rqst->rq_cachetype field is only accessed by nfsd. Move it into the nfsd_thread_local_info instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29nfsd/sunrpc: add svc_rqst->rq_private pointer and remove rq_lease_breakerJeff Layton1-1/+4
rq_lease_breaker has always been a NFSv4 specific layering violation in svc_rqst. The reason it's there though is that we need a place that is thread-local, and accessible from the svc_rqst pointer. Add a new rq_private pointer to struct svc_rqst. This is intended for use by the threads that are handling the service. sunrpc code doesn't touch it. In nfsd, define a new struct nfsd_thread_local_info. nfsd declares one of these on the stack and puts a pointer to it in rq_private. Add a new ntli_lease_breaker field to the new struct and convert all of the places that access rq_lease_breaker to use the new field instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29Merge tag 'vfs-7.0-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds3-12/+1
Pull vfs fixes from Christian Brauner: - Fix netfs_limit_iter() hitting BUG() when an ITER_KVEC iterator reaches it via core dump writes to 9P filesystems. Add ITER_KVEC handling following the same pattern as the existing ITER_BVEC code. - Fix a NULL pointer dereference in the netfs unbuffered write retry path when the filesystem (e.g., 9P) doesn't set the prepare_write operation. - Clear I_DIRTY_TIME in sync_lazytime for filesystems implementing ->sync_lazytime. Without this the flag stays set and may cause additional unnecessary calls during inode deactivation. - Increase tmpfs size in mount_setattr selftests. A recent commit bumped the ext4 image size to 2 GB but didn't adjust the tmpfs backing store, so mkfs.ext4 fails with ENOSPC writing metadata. - Fix an invalid folio access in iomap when i_blkbits matches the folio size but differs from the I/O granularity. The cur_folio pointer would not get invalidated and iomap_read_end() would still be called on it despite the IO helper owning it. - Fix hash_name() docstring. - Fix read abandonment during netfs retry where the subreq variable used for abandonment could be uninitialized on the first pass or point to a deleted subrequest on later passes. - Don't block sync for filesystems with no data integrity guarantees. Add a SB_I_NO_DATA_INTEGRITY superblock flag replacing the per-inode AS_NO_DATA_INTEGRITY mapping flag so sync kicks off writeback but doesn't wait for flusher threads. This fixes a suspend-to-RAM hang on fuse-overlayfs where the flusher thread blocks when the fuse daemon is frozen. - Fix a lockdep splat in iomap when reads fail. iomap_read_end_io() invokes fserror_report() which calls igrab() taking i_lock in hardirq context while i_lock is normally held with interrupts enabled. Kick failed read handling to a workqueue. - Remove the redundant netfs_io_stream::front member and use stream->subrequests.next instead, fixing a potential issue in the direct write code path. * tag 'vfs-7.0-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix the handling of stream->front by removing it iomap: fix lockdep complaint when reads fail writeback: don't block sync for filesystems with no data integrity guarantees netfs: Fix read abandonment during retry vfs: fix docstring of hash_name() iomap: fix invalid folio access when i_blkbits differs from I/O granularity selftests/mount_setattr: increase tmpfs size for idmapped mount tests fs: clear I_DIRTY_TIME in sync_lazytime netfs: Fix NULL pointer dereference in netfs_unbuffered_write() on retry netfs: Fix kernel BUG in netfs_limit_iter() for ITER_KVEC iterators
2026-03-29netfilter: remove nf_ipv6_ops and use direct function callsFernando Fernandez Mancera1-96/+6
As IPv6 is built-in only, nf_ipv6_ops can be removed completely as it is not longer necessary. Convert all nf_ipv6_ops usage to direct function calls instead. In addition, remove the ipv6_netfilter_init/fini() functions as they are not necessary any longer. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Tested-by: Ricardo B. Marlière <rbm@suse.com> Link: https://patch.msgid.link/20260325120928.15848-12-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-29ipv6: remove dynamic ICMPv6 sender registration infrastructureFernando Fernandez Mancera1-27/+2
As IPv6 is built-in only, there is no need to maintain the sender registration infrastructure used to allow built-in subsystems to send ICMPv6 messages when IPv6 was compiled as a module. Drop the registration mechanism and the __icmpv6_send() sender implementation. While icmpv6_send() users could be converted to icmp6_send() that doesn't seems necessary as none of them are using the force_saddr parameter. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Tested-by: Ricardo B. Marlière <rbm@suse.com> Link: https://patch.msgid.link/20260325120928.15848-5-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-29ipv6: replace IS_BUILTIN(CONFIG_IPV6) with IS_ENABLED(CONFIG_IPV6)Fernando Fernandez Mancera1-1/+1
As IPv6 is built-in only, it does not make sense to continue using IS_BUILTIN(CONFIG_IPV6). Therefore, replace it with IS_ENABLED() when necessary and drop it if it isn't valid anymore. Notice that there is still one instance related to ICMPv6, as it requires more changes it will be handle separately. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Tested-by: Ricardo B. Marlière <rbm@suse.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260325120928.15848-4-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-29Merge tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+1
Pull futex fixes from Ingo Molnar: - Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar futex flags and potential UaF access (Peter Zijlstra) - Fix UaF between futex_key_to_node_opt() and vma_replace_policy() (Hao-Yu Yang) - Clear stale exiting pointer in futex_lock_pi() retry path, which triggered a warning (and potential misbehavior) in stress-testing (Davidlohr Bueso) * tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Clear stale exiting pointer in futex_lock_pi() retry path futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy() futex: Require sys_futex_requeue() to have identical flags
2026-03-29bpf: Support struct btf_struct_meta via KF_IMPLICIT_ARGSIhor Solodrai1-1/+1
The following kfuncs currently accept void *meta__ign argument: * bpf_obj_new_impl * bpf_obj_drop_impl * bpf_percpu_obj_new_impl * bpf_percpu_obj_drop_impl * bpf_refcount_acquire_impl * bpf_list_push_back_impl * bpf_list_push_front_impl * bpf_rbtree_add_impl The __ign suffix is an indicator for the verifier to skip the argument in check_kfunc_args(). Then, in fixup_kfunc_call() the verifier may set the value of this argument to struct btf_struct_meta * kptr_struct_meta from insn_aux_data. BPF programs must pass a dummy NULL value when calling these kfuncs. Additionally, the list and rbtree _impl kfuncs also accept an implicit u64 argument, which doesn't require __ign suffix because it's a scalar, and BPF programs explicitly pass 0. Add new kfuncs with KF_IMPLICIT_ARGS [1], that correspond to each _impl kfunc accepting meta__ign. The existing _impl kfuncs remain unchanged for backwards compatibility. To support this, add "btf_struct_meta" to the list of recognized implicit argument types in resolve_btfids. Implement is_kfunc_arg_implicit() in the verifier, that determines implicit args by inspecting both a non-_impl BTF prototype of the kfunc. Update the special_kfunc_list in the verifier and relevant checks to support both the old _impl and the new KF_IMPLICIT_ARGS variants of btf_struct_meta users. [1] https://lore.kernel.org/bpf/20260120222638.3976562-1-ihor.solodrai@linux.dev/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260327203241.3365046-1-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-29mailbox: Fix NULL message support in mbox_send_message()Jassi Brar1-0/+3
The active_req field serves double duty as both the "is a TX in flight" flag (NULL means idle) and the storage for the in-flight message pointer. When a client sends NULL via mbox_send_message(), active_req is set to NULL, which the framework misinterprets as "no active request". This breaks the TX state machine by: - tx_tick() short-circuits on (!mssg), skipping the tx_done callback and the tx_complete completion - txdone_hrtimer() skips the channel entirely since active_req is NULL, so poll-based TX-done detection never fires. Fix this by introducing a MBOX_NO_MSG sentinel value that means "no active request," freeing NULL to be valid message data. The sentinel is defined in the subsystem-internal mailbox.h so that controller drivers within drivers/mailbox/ can reference it, but it is not exposed to clients outside the subsystem. Fifteen in-tree callers send NULL (doorbell-style IPCs on Qualcomm, Tegra, TI, Xilinx, i.MX, SCMI, and PCC platforms). All were audited for regression: - Most already work around the bug via knows_txdone=true with a manual mbox_client_txdone() call, making the framework's tracking irrelevant. These are unaffected. - Poll-based callers (Xilinx zynqmp/r5) are strictly better off: the poll timer now correctly detects NULL-active channels instead of silently skipping them. - irq-qcom-mpm.c was a pre-existing bug -- the only Qualcomm caller that omitted the knows_txdone + mbox_client_txdone() pattern. Fixed in a companion commit ("irqchip/qcom-mpm: Fix missing mailbox TX done acknowledgment"). - No caller sets both a tx_done callback and sends NULL, nor combines tx_block=true with NULL sends, so the newly reachable callback/completion paths are never exercised. Also update tegra-hsp's flush callback, which directly inspects active_req to wait for the channel to drain: the old "!= NULL" check becomes "!= MBOX_NO_MSG", otherwise flush spins until timeout since the sentinel is non-NULL. The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message by clients. Reported-by: Joonwon Kang <joonwonkang@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
2026-03-29mailbox: remove superfluous internal headerWolfram Sang1-0/+5
Quite some controller drivers use the defines from the internal header already. This prevents controller drivers outside the mailbox directory. Move the defines to the public controller header to allow this again as the defines are not strictly internal anyhow. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Sudeep Holla <sudeep.holla@kernel.org> Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com> Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
2026-03-29Merge tag 'iio-fixes-for-7.0b' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio into char-misc-linuxGreg Kroah-Hartman1-0/+12
Jonathan writes: IIO: 2nd set of fixes for the 7.0 cycle Usual mixed bag of fixes for recent code and much older issues that have surfaced. Biggest group are continued resolution of IRQF_ONE_SHOT being used incorrectly (which now triggers a warning) adi,ad4062 - Replace IRQF_ONESHOT (as no threaded handler) with IRQF_NO_THREAD as the caller makes use of iio_trigger_poll() which cannot run from a thread. adi,ade9000 - Move mutex_init() earlier to ensure it is available if spurious IRQ occurs. adi,adis16550 - Fix swapped gyro and accel filter functions. adi,adxl3380 - Fix some bit manipulation that was always resulting in 0. - Fix incorrect register map for calibbias on the active power channel. - Fix returning IRQF_HANDLED from a function that should return 0 or -ERRNO. aspeed,adc - Clear a reference voltage bit that might be set prior to driver load. bosch,bno055 - Off by one channel buffer sizing. Benine due to padding prior to the subsequent timestamp. hid-sensors - A more complex fix to IRQF_ONESHOT warning as this driver had a trigger that was never actually used but the ABI that exposed had to be maintained to avoid regressions. hid-sensors-rotation - An obscure buffer alignment case that applies to quaternions only was recently broken resulting in writes beyond the end of the channel buffer. Add a new core macro and apply it in this driver to make it very clear what was going on. honeywell,abp2030pa - Remove meaningless IRQF_ONESHOT from a non threaded IRQ handler. Warning fix only. invense,mpu3050 - Fix token passed to free_irq() to match the one used at setup. - Fix an irq resource leak in error path. - Reorder probe so that userspace interfaces are exposed only after everything else has finished. - Reorder remove slightly to cleanup the buffer only after irq removed ensuring reverse of probe sequence. microchip,mcp47feb02 - Fix use of mutex before it was initialized by not performing unnecessary lock that was early enough in probe that all code was serial. st,lsm6dsx - Ensure that FIFO ODR is only controllable for accel and gyro channels avoiding incorrect register accesses. - Restrict separation of buffer sampling from main sampling rate to accelerometer. It is only useful for running event detection faster than the fifo and the only events are on the accelerometer. ti,ads1018 - Fix overflow of u8 which wasn't big enough to store max data rate value. ti,ads1119: - Fix unbalanced pm in an error path. - IRQF_ONESHOT (as no threaded handler) replaced with IRQF_NO_THREAD (needed for iio_trigger_poll()). - Ensure complete reinitialized before reuse. Previously it would have completed immediate after the first time. ti,ads7950 - Fix return value of gpio_get() to be 0 or 1. - Avoid accidental overwrite of state resulting in gpio_get() only returning 0 or -ERRNO but never 1. * tag 'iio-fixes-for-7.0b' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio: (25 commits) iio: imu: adis16550: fix swapped gyro/accel filter functions iio: adc: aspeed: clear reference voltage bits before configuring vref iio: adc: ti-ads1119: Reinit completion before wait_for_completion_timeout() iio: adc: ti-ads1018: fix type overflow for data rate iio: adc: ti-ads7950: do not clobber gpio state in ti_ads7950_get() iio: adc: ti-ads7950: normalize return value of gpio_get iio: orientation: hid-sensor-rotation: fix quaternion alignment iio: add IIO_DECLARE_QUATERNION() macro iio: adc: ti-ads1119: Replace IRQF_ONESHOT with IRQF_NO_THREAD iio: imu: bno055: fix BNO055_SCAN_CH_COUNT off by one iio: hid-sensors: Use software trigger iio: adc: ad4062: Replace IRQF_ONESHOT with IRQF_NO_THREAD iio: gyro: mpu3050: Fix out-of-sequence free_irq() iio: gyro: mpu3050: Move iio_device_register() to correct location iio: gyro: mpu3050: Fix irq resource leak iio: gyro: mpu3050: Fix incorrect free_irq() variable iio: imu: st_lsm6dsx: Set buffer sampling frequency for accelerometer only iio: imu: st_lsm6dsx: Set FIFO ODR for accelerometer and gyroscope only iio: dac: mcp47feb02: Fix mutex used before initialization iio: adc: ade9000: fix wrong return type in streaming push ...
2026-03-28Merge tag 'mm-hotfixes-stable-2026-03-28-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds1-11/+21
Pull misc fixes from Andrew Morton: "10 hotfixes. 8 are cc:stable. 9 are for MM. There's a 3-patch series of DAMON fixes from Josh Law and SeongJae Park. The rest are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-03-28-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/mseal: update VMA end correctly on merge bug: avoid format attribute warning for clang as well mm/pagewalk: fix race between concurrent split and refault mm/memory: fix PMD/PUD checks in follow_pfnmap_start() mm/damon/sysfs: check contexts->nr in repeat_call_fn mm/damon/sysfs: check contexts->nr before accessing contexts_arr[0] mm/damon/sysfs: fix param_ctx leak on damon_sysfs_new_test_ctx() failure mm/swap: fix swap cache memcg accounting MAINTAINERS, mailmap: update email address for Harry Yoo mm/huge_memory: fix folio isn't locked in softleaf_to_folio()
2026-03-27watchdog/hardlockup: improve buddy system detection timelinessMayank Rungta1-0/+1
Currently, the buddy system only performs checks every 3rd sample. With a 4-second interval. If a check window is missed, the next check occurs 12 seconds later, potentially delaying hard lockup detection for up to 24 seconds. Modify the buddy system to perform checks at every interval (4s). Introduce a missed-interrupt threshold to maintain the existing grace period while reducing the detection window to 8-12 seconds. Best and worst case detection scenarios: Before (12s check window): - Best case: Lockup occurs after first check but just before heartbeat interval. Detected in ~8s (8s till next check). - Worst case: Lockup occurs just after a check. Detected in ~24s (missed check + 12s till next check + 12s logic). After (4s check window with threshold of 3): - Best case: Lockup occurs just before a check. Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd). - Worst case: Lockup occurs just after a check. Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd). Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-4-45bd8a0cc7ed@google.com Signed-off-by: Mayank Rungta <mrungta@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Ian Rogers <irogers@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Stephane Erainan <eranian@google.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27net: stmmac: provide flag to disable EEERussell King (Oracle)1-6/+7
Some platforms have problems when EEE is enabled, and thus need a way to disable stmmac EEE support. Add a flag before the other LPI related flags which tells stmmac to avoid populating the phylink LPI capabilities, which causes phylink to call phy_disable_eee() for any PHY that is attached to the affected phylink instance. iMX8MP is an example - the lpi_intr_o signal is wired to an OR gate along with the main dwmac interrupts. Since lpi_intr_o is synchronous to the receive clock domain, and takes four clock cycles to clear, this leads to interrupt storms as the interrupt remains asserted for some time after the LPI control and status register is read. This problem becomes worse when the receive clock from the PHY stops when the receive path enters LPI state - which means that lpi_intr_o can not deassert until the clock restarts. Since the LPI state of the receive path depends on the link partner, this is out of our control. We could disable RX clock stop at the PHY, but that doesn't get around the slow-to-deassert lpi_intr_o mentioned in the above paragraph. Previously, iMX8MP worked around this by disabling gigabit EEE, but this is insufficient - the problem is also visible at 100M speeds, where the receive clock is slower. There is extensive discussion and investigation in the thread linked below, the result of which is summarised in this commit message. Reported-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Closes: https://lore.kernel.org/r/20251026122905.29028-1-laurent.pinchart@ideasonboard.com Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Reviewed-by: Kieran Bingham <kieran.bingham@ideasonboard.com> Link: https://patch.msgid.link/20260325210003.2752013-2-laurent.pinchart@ideasonboard.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-27mm/huge_memory: fix folio isn't locked in softleaf_to_folio()Jinjiang Tu1-11/+21
On arm64 server, we found folio that get from migration entry isn't locked in softleaf_to_folio(). This issue triggers when mTHP splitting and zap_nonpresent_ptes() races, and the root cause is lack of memory barrier in softleaf_to_folio(). The race is as follows: CPU0 CPU1 deferred_split_scan() zap_nonpresent_ptes() lock folio split_folio() unmap_folio() change ptes to migration entries __split_folio_to_order() softleaf_to_folio() set flags(including PG_locked) for tail pages folio = pfn_folio(softleaf_to_pfn(entry)) smp_wmb() VM_WARN_ON_ONCE(!folio_test_locked(folio)) prep_compound_page() for tail pages In __split_folio_to_order(), smp_wmb() guarantees page flags of tail pages are visible before the tail page becomes non-compound. smp_wmb() should be paired with smp_rmb() in softleaf_to_folio(), which is missed. As a result, if zap_nonpresent_ptes() accesses migration entry that stores tail pfn, softleaf_to_folio() may see the updated compound_head of tail page before page->flags. This issue will trigger VM_WARN_ON_ONCE() in pfn_swap_entry_folio() because of the race between folio split and zap_nonpresent_ptes() leading to a folio incorrectly undergoing modification without a folio lock being held. This is a BUG_ON() before commit 93976a20345b ("mm: eliminate further swapops predicates"), which in merged in v6.19-rc1. To fix it, add missing smp_rmb() if the softleaf entry is migration entry in softleaf_to_folio() and softleaf_to_page(). [tujinjiang@huawei.com: update function name and comments] Link: https://lkml.kernel.org/r/20260321075214.3305564-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20260319012541.4158561-1-tujinjiang@huawei.com Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27ptr_ring: disable KCSAN warningsMichael S. Tsirkin1-4/+4
Eric Dumazet reported KCSAN warnings: BUG: KCSAN: data-race in pfifo_fast_dequeue / pfifo_fast_enqueue write to 0xffff88811d5ccc00 of 8 bytes by interrupt on cpu 0: __ptr_ring_zero_tail include/linux/ptr_ring.h:259 [inline] __ptr_ring_discard_one include/linux/ptr_ring.h:291 [inline] __ptr_ring_consume include/linux/ptr_ring.h:311 [inline] __skb_array_consume include/linux/skb_array.h:98 [inline] pfifo_fast_dequeue+0x770/0x8f0 net/sched/sch_generic.c:770 dequeue_skb net/sched/sch_generic.c:297 [inline] qdisc_restart net/sched/sch_generic.c:402 [inline] __qdisc_run+0x189/0xc80 net/sched/sch_generic.c:420 qdisc_run include/net/pkt_sched.h:120 [inline] net_tx_action+0x379/0x590 net/core/dev.c:5793 handle_softirqs+0xb9/0x280 kernel/softirq.c:622 do_softirq+0x45/0x60 kernel/softirq.c:523 __local_bh_enable_ip+0x70/0x80 kernel/softirq.c:450 local_bh_enable include/linux/bottom_half.h:33 [inline] bpf_test_run+0x2db/0x620 net/bpf/test_run.c:426 bpf_prog_test_run_skb+0x9a4/0xef0 net/bpf/test_run.c:1159 bpf_prog_test_run+0x204/0x340 kernel/bpf/syscall.c:4721 __sys_bpf+0x52e/0x7e0 kernel/bpf/syscall.c:6246 __do_sys_bpf kernel/bpf/syscall.c:6341 [inline] __se_sys_bpf kernel/bpf/syscall.c:6339 [inline] __x64_sys_bpf+0x41/0x50 kernel/bpf/syscall.c:6339 x64_sys_call+0x10cb/0x3020 arch/x86/include/generated/asm/syscalls_64.h:322 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff88811d5ccc00 of 8 bytes by task 22632 on cpu 1: __ptr_ring_produce include/linux/ptr_ring.h:106 [inline] ptr_ring_produce include/linux/ptr_ring.h:129 [inline] skb_array_produce include/linux/skb_array.h:44 [inline] pfifo_fast_enqueue+0xd5/0x2c0 net/sched/sch_generic.c:741 dev_qdisc_enqueue net/core/dev.c:4144 [inline] __dev_xmit_skb net/core/dev.c:4188 [inline] __dev_queue_xmit+0x6a4/0x1f20 net/core/dev.c:4795 dev_queue_xmit include/linux/netdevice.h:3384 [inline] __bpf_tx_skb net/core/filter.c:2153 [inline] __bpf_redirect_common net/core/filter.c:2197 [inline] __bpf_redirect+0x862/0x990 net/core/filter.c:2204 ____bpf_clone_redirect net/core/filter.c:2487 [inline] bpf_clone_redirect+0x20c/0x290 net/core/filter.c:2450 bpf_prog_53f18857bc887b09+0x22/0x2a bpf_dispatcher_nop_func include/linux/bpf.h:1402 [inline] __bpf_prog_run include/linux/filter.h:723 [inline] bpf_prog_run include/linux/filter.h:730 [inline] bpf_test_run+0x29d/0x620 net/bpf/test_run.c:423 bpf_prog_test_run_skb+0x9a4/0xef0 net/bpf/test_run.c:1159 bpf_prog_test_run+0x204/0x340 kernel/bpf/syscall.c:4721 __sys_bpf+0x52e/0x7e0 kernel/bpf/syscall.c:6246 __do_sys_bpf kernel/bpf/syscall.c:6341 [inline] __se_sys_bpf kernel/bpf/syscall.c:6339 [inline] __x64_sys_bpf+0x41/0x50 kernel/bpf/syscall.c:6339 x64_sys_call+0x10cb/0x3020 arch/x86/include/generated/asm/syscalls_64.h:322 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0xffff888104a93a00 -> 0x0000000000000000 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 22632 Comm: syz.0.4135 Tainted: G W syzkaller #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 There is no race on ring accesses: reading/writing a partial pointer would be fine, because the reading is done by the producer which merely cares about NULL/non NULL. Document and disable the warnings using data_race(). Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/dd3984b3bce9df3591927f927668cb31cc7ecf34.1774460059.git.mst@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-27Merge tag 'spi-fix-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spiLinus Torvalds1-5/+0
Pull spi fixes from Mark Brown: "There are two core fixes here. One is from Johan dealing with an issue introduced by a devm_ API usage update causing things to be freed earlier than they had earlier when we fail to register a device, another from Danilo avoids unlocked acccess to data by converting to use a driver core API. We also have a few relatively minor driver specific fixes" * tag 'spi-fix-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: spi-fsl-lpspi: fix teardown order issue (UAF) spi: fix use-after-free on managed registration failure spi: use generic driver_override infrastructure spi: meson-spicc: Fix double-put in remove path spi: sn-f-ospi: Use devm_mutex_init() to simplify code spi: sn-f-ospi: Fix resource leak in f_ospi_probe()
2026-03-27dm: provide helper to set stacked limitsKeith Busch1-0/+7
There are multiple device mappers that set up their stacking limits exactly the same for the logical, physical and minimum IO queue limits. Provide a helper for it. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-27mpage: Provide variant of mpage_writepages() with own optional folio handlerJan Kara1-2/+9
Some filesystems need to treat some folios specially (for example for inodes with inline data). Doing the handling in their .writepages method in a race-free manner results in duplicating some of the writeback internals. So provide generalized version of mpage_writepages() that allows filesystem to provide a handler called for each folio which can handle the folio in a special way. Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260326140635.15895-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz>
2026-03-27NTB: core: Add .get_dma_dev() callback to ntb_dev_opsKoichiro Den1-0/+24
Some NTB implementations are backed by a PCI function that is not the right struct device to use with DMA API helpers (e.g. due to IOMMU topology, or because the NTB device is virtual). Add an optional .get_dma_dev() callback to struct ntb_dev_ops and provide a helper, ntb_get_dma_dev(), so NTB clients can use the appropriate struct device for DMA allocations and mappings. If the callback is not implemented, ntb_get_dma_dev() returns the current default (ntb->dev.parent). Drivers that implement .get_dma_dev() must return a non-NULL device. Suggested-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> [bhelgaas: format doc] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://patch.msgid.link/20260306031443.1911860-2-den@valinux.co.jp
2026-03-27Merge tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme into for-7.1/blockJens Axboe2-16/+49
Pull NVMe updates from Keith: "- Fabrics authentication updates (Eric, Alistar) - Enanced block queue limits support (Caleb) - Workqueue usage updates (Marco) - A new write zeroes device quirk (Robert) - Tagset cleanup fix for loop device (Nilay)" * tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme: (41 commits) nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown nvme: add WQ_PERCPU to alloc_workqueue users nvmet-fc: add WQ_PERCPU to alloc_workqueue users nvmet: replace use of system_wq with system_percpu_wq nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C nvme: Add the DHCHAP maximum HD IDs nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4 nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set nvmet: report NPDGL and NPDAL nvmet: use NVME_NS_FEAT_OPTPERF_SHIFT nvme: set discard_granularity from NPDG/NPDA nvme: add from0based() helper nvme: always issue I/O Command Set specific Identify Namespace nvme: update nvme_id_ns OPTPERF constants nvme: fold nvme_config_discard() into nvme_update_disk_info() nvme: add preferred I/O size fields to struct nvme_id_ns_nvm nvme: Allow reauth from sysfs nvme: Expose the tls_configured sysfs for secure concat connections nvmet-tcp: Don't free SQ on authentication success nvmet-tcp: Don't error if TLS is enabed on a reset ...
2026-03-27arm_mpam: resctrl: Add empty definitions for assorted resctrl functionsJames Morse1-0/+9
A few resctrl features and hooks need to be provided, but aren't needed or supported on MPAM platforms. resctrl has individual hooks to separately enable and disable the closid/partid and rmid/pmg context switching code. For MPAM this is all the same thing, as the value in struct task_struct is used to cache the value that should be written to hardware. arm64's context switching code is enabled once MPAM is usable, but doesn't touch the hardware unless the value has changed. For now event configuration is not supported, and can be turned off by returning 'false' from resctrl_arch_is_evt_configurable(). The new io_alloc feature is not supported either, always return false from the enable helper to indicate and fail the enable. Add this, and empty definitions for the other hooks. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>
2026-03-27arm_mpam: resctrl: Add resctrl_arch_rmid_read()James Morse1-0/+5
resctrl uses resctrl_arch_rmid_read() to read counters. CDP emulation means the counter may need reading in three different ways. The helpers behind the resctrl_arch_ functions will be re-used for the ABMC equivalent functions. Add the rounding helper for checking monitor values while we're here. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>
2026-03-27arm_mpam: resctrl: Allow resctrl to allocate monitorsJames Morse1-0/+5
When resctrl wants to read a domain's 'QOS_L3_OCCUP', it needs to allocate a monitor on the corresponding resource. Monitors are allocated by class instead of component. Add helpers to allocate a CSU monitor. These helper return an out of range value for MBM counters. Allocating a montitor context is expected to block until hardware resources become available. This only makes sense for QOS_L3_OCCUP as unallocated MBM counters are losing data. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>
2026-03-27arm_mpam: resctrl: Add rmid index helpersBen Horgan1-0/+3
Because MPAM's pmg aren't identical to RDT's rmid, resctrl handles some data structures by index. This allows x86 to map indexes to RMID, and MPAM to map them to partid-and-pmg. Add the helpers to do this. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Suggested-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>
2026-03-27arm_mpam: resctrl: Add CDP emulationJames Morse1-0/+2
Intel RDT's CDP feature allows the cache to use a different control value depending on whether the accesses was for instruction fetch or a data access. MPAM's equivalent feature is the other way up: the CPU assigns a different partid label to traffic depending on whether it was instruction fetch or a data access, which causes the cache to use a different control value based solely on the partid. MPAM can emulate CDP, with the side effect that the alternative partid is seen by all MSC, it can't be enabled per-MSC. Add the resctrl hooks to turn this on or off. Add the helpers that match a closid against a task, which need to be aware that the value written to hardware is not the same as the one resctrl is using. Update the 'arm64_mpam_global_default' variable the arch code uses during context switch to know when the per-cpu value should be used instead. Also, update these per-cpu values and sync the resulting mpam partid/pmg configuration to hardware. resctrl can enable CDP for L2 caches, L3 caches or both. When it is enabled by one and not the other MPAM globally enabled CDP but hides the effect on the other cache resource. This hiding is possible as CPOR is the only supported cache control and that uses a resource bitmap; two partids with the same bitmap act as one. Awkwardly, the MB controls don't implement CDP and CDP can't be hidden as the memory bandwidth control is a maximum per partid which can't be modelled with more partids. If the total maximum is used for both the data and instruction partids then then the maximum may be exceeded and if it is split in two then the one using more bandwidth will hit a lower limit. Hence, hide the MB controls completely if CDP is enabled for any resource. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Cc: Dave Martin <Dave.Martin@arm.com> Cc: Amit Singh Tomar <amitsinght@marvell.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>
2026-03-27arm_mpam: resctrl: Add plumbing against arm64 task and cpu hooksJames Morse1-0/+5
arm64 provides helpers for changing a task's and a cpu's mpam partid/pmg values. These are used to back a number of resctrl_arch_ functions. Connect them up. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>
2026-03-27arm_mpam: resctrl: Add boilerplate cpuhp and domain allocationJames Morse1-0/+3
resctrl has its own data structures to describe its resources. We can't use these directly as we play tricks with the 'MBA' resource, picking the MPAM controls or monitors that best apply. We may export the same component as both L3 and MBA. Add mpam_resctrl_res[] as the array of class->resctrl mappings we are exporting, and add the cpuhp hooks that allocated and free the resctrl domain structures. Only the mpam control feature are considered here and monitor support will be added later. While we're here, plumb in a few other obvious things. CONFIG_ARM_CPU_RESCTRL is used to allow this code to be built even though it can't yet be linked against resctrl. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>
2026-03-27PCI: Align head space betterIlpo Järvinen1-0/+5
When a bridge window contains big and small resource(s), the small resource(s) may not amount to the half of the size of the big resource which would allow calculate_head_align() to shrink the head alignment. This results in always placing the small resource(s) after the big resource. In general, it would be good to be able to place the small resource(s) before the big resource to achieve better utilization of the address space. In the cases where the large resource can only fit at the end of the window, it is even required. However, carrying the information over from pbus_size_mem() and calculate_head_align() to __pci_assign_resource() and pcibios_align_resource() is not easy with the current data structures. A somewhat hacky way to move the non-aligning tail part to the head is possible within pcibios_align_resource(). The free space between the start of the free space span and the aligned start address can be compared with the non-aligning remainder of the size. If the free space is larger than the remainder, placing the remainder before the start address is possible. This relocation should generally work, because PCI resources consist only power-of-2 atoms. Various arch requirements may still need to override the relocation, so the relocation is only applied selectively in such cases. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221205 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-10-ilpo.jarvinen@linux.intel.com
2026-03-27resource: Pass full extent of empty space to resource_alignf callbackIlpo Järvinen2-3/+6
__find_resource_space() calculates the full extent of empty space but only passes the aligned space to resource_alignf callback. In some situations, the callback may choose take advantage of the free space before the requested alignment. Pass the full extent of the calculated empty space to resource_alignf callback as an additional parameter. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-3-ilpo.jarvinen@linux.intel.com
2026-03-27nvme: Add the DHCHAP maximum HD IDsAlistair Francis1-0/+4
In preperation for using DHCHAP length in upcoming host and target patches let's add the hash and diffie-hellman ID length macros. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yunje Shin <ioerts@kookmin.ac.kr> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chris Leech <cleech@redhat.com> Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>