| Age | Commit message (Collapse) | Author | Files | Lines |
|
The vduse_iova_range_v2 and vduse_iotlb_entry_v2 structures are both
defined in a way that adds implicit padding and is incompatible between
i386 and x86_64 userspace because of the different structure alignment
requirements. Building the header with -Wpadded shows these new warnings:
vduse.h:305:1: error: padding struct size to alignment boundary with 4 bytes [-Werror=padded]
vduse.h:374:1: error: padding struct size to alignment boundary with 4 bytes [-Werror=padded]
Change the amount of padding in these two structures to align them to
64 bit words and avoid those problems. Since the v1 vduse_iotlb_entry
already has an inconsistent size, do not attempt to reuse the structure
but rather list the members indiviudally, with a fixed amount of
padding.
Fixes: 079212f6877e ("vduse: add vq group asid support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20260202224835.559538-1-arnd@kernel.org>
|
|
KVM/arm64 updates for 7.0
- Add support for FEAT_IDST, allowing ID registers that are not
implemented to be reported as a normal trap rather than as an UNDEF
exception.
- Add sanitisation of the VTCR_EL2 register, fixing a number of
UXN/PXN/XN bugs in the process.
- Full handling of RESx bits, instead of only RES0, and resulting in
SCTLR_EL2 being added to the list of sanitised registers.
- More pKVM fixes for features that are not supposed to be exposed to
guests.
- Make sure that MTE being disabled on the pKVM host doesn't give it
the ability to attack the hypervisor.
- Allow pKVM's host stage-2 mappings to use the Force Write Back
version of the memory attributes by using the "pass-through'
encoding.
- Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
guest.
- Preliminary work for guest GICv5 support.
- A bunch of debugfs fixes, removing pointless custom iterators stored
in guest data structures.
- A small set of FPSIMD cleanups.
- Selftest fixes addressing the incorrect alignment of page
allocation.
- Other assorted low-impact fixes and spelling fixes.
|
|
ASoC: Updates for v7.0
This release is almost all abut driers, there's very little core work
here, although some of that driver work is in more generic areas like
SDCA and SOF:
- Generic SDCA support for reporting jack events.
- Continuing platform support, cleanup and feature improements for the
AMD, Intel, Qualcomm and SOF code.
- Platform description improvements for the Cirrus drivers.
- Support for NXP i.MX952, Realtek RT1320 and RT5575, and Sophogo
CV1800B.
We also pulled in one small SPI API update and some more substantial
regmap work (cache description improvements) for use in drivers.
|
|
- Support for Rock band 4 PS4 and PS5 guitars (Rosalie Wanders)
|
|
__rcu annotations on the return types from dma_fence_driver_name() and
dma_fence_timeline_name() cause sparse to complain because both the
constant signaled strings, and the strings return by the dma_fence_ops are
not __rcu annotated.
For a simple fix it is easiest to cast them with __rcu added and undo the
smarts from the tracpoints side of things. There is no functional change
since the rest is left in place. Later we can consider changing the
dma_fence_ops return types too, and handle all the individual drivers
which define them.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: 506aa8b02a8d ("dma-fence: Add safe access helpers and document the rules")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506162214.1eA69hLe-lkp@intel.com/
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250616155952.24259-1-tvrtko.ursulin@igalia.com
Signed-off-by: Christian König <christian.koenig@amd.com>
|
|
This is based on AES256-CTS-HMAC384-192 crypto algorithm per RFC 8009
(i.e. Kerberos 5, hence the name) with custom-defined key usage numbers.
The implementation allows a given key to have/be linked to between one
and three usage numbers.
The existing CEPH_CRYPTO_AES remains in place and unchanged. The
usage_slot parameter that needed to be added to ceph_crypt() and its
wrappers is simply ignored there.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In commit 4d0b035fd6da ("drm/xe/uapi: loosen used tracking restriction")
we dropped the CAP_PERMON restriction but missed updating the
corresponding kernel-doc. Fix that.
v2 (Sanjay):
- Don't drop the note around the extra cpu_visible_used expectations.
Reported-by: Ulisses Furquim <ulisses.furquim@intel.com>
Fixes: 4d0b035fd6da ("drm/xe/uapi: loosen used tracking restriction")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Reviewed-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Link: https://patch.msgid.link/20260130125105.451229-2-matthew.auld@intel.com
|
|
Expose DMABUF functionality to userspace through the uverbs interface,
enabling InfiniBand/RDMA devices to export PCI based memory regions
(e.g. device memory) as DMABUF file descriptors. This allows
zero-copy sharing of RDMA memory with other subsystems that support the
dma-buf framework.
A new UVERBS_OBJECT_DMABUF object type and allocation method were
introduced.
During allocation, uverbs invokes the driver to supply the
rdma_user_mmap_entry associated with the given page offset (pgoff).
Based on the returned rdma_user_mmap_entry, uverbs requests the driver
to provide the corresponding physical-memory details as well as the
driver’s PCI provider information.
Using this information, dma_buf_export() is called; if it succeeds,
uobj->object is set to the underlying file pointer returned by the
dma-buf framework.
The file descriptor number follows the standard uverbs allocation flow,
but the file pointer comes from the dma-buf subsystem, including its own
fops and private data.
When an mmap entry is removed, uverbs iterates over its associated
DMABUFs, marks them as revoked, and calls dma_buf_move_notify() so that
their importers are notified.
The same procedure applies during the disassociate flow; final cleanup
occurs when the application closes the file.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260201-dmabuf-export-v3-2-da238b614fe3@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
* Reuse common phys_vec, phase out dma_buf_phys_vec
Signed-off-by: Alex Williamson <alex@shazbot.org>
* tag 'common_phys_vec_via_vfio':
types: reuse common phys_vec type instead of DMABUF open‑coded variant
types: move phys_vec definition to common header
nvme-pci: Use size_t for length fields to handle larger sizes
|
|
For smbdirect it required to use different ports depending
on the RDMA protocol. E.g. for iWarp 5445 is needed
(as tcp port 445 already used by the raw tcp transport for SMB),
while InfiniBand, RoCEv1 and RoCEv2 use port 445, as they
use an independent port range (even for RoCEv2, which uses udp
port 4791 itself).
Currently ksmbd is not able to function correctly at
all if the system has iWarp (RDMA_NODE_RNIC) interface(s)
and any InfiniBand, RoCEv1 and/or RoCEv2 interface(s)
at the same time.
And cifs.ko uses 5445 with a fallback to 445, which
means depending on the available interfaces, it tries
5445 in the RoCE range or may tries iWarp with 445
as a fallback. This leads to strange error messages
and strange network captures.
To avoid these problems they will be able to
use rdma_restrict_node_type(RDMA_NODE_RNIC) before
trying port 5445 and rdma_restrict_node_type(RDMA_NODE_IB_CA)
before trying port 445. It means we'll get early
-ENODEV early from rdma_resolve_addr() without any
network traffic and timeouts.
This is designed to be called before calling any
of rdma_bind_addr(), rdma_resolve_addr() or rdma_listen().
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-rdma@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Acked-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull 6.19-devel branch.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
Enabling KCSAN is causing a large number of duplicate types in BTF for
core kernel structs like task_struct [1]. This is due to the definition
in include/linux/compiler_types.h
`#ifdef __SANITIZE_THREAD__
...
`#define __data_racy volatile
..
`#else
...
`#define __data_racy
...
`#endif
Because some objects in the kernel are compiled without KCSAN flags
(KCSAN_SANITIZE) we sometimes get the empty __data_racy annotation for
objects; as a result we get multiple conflicting representations of the
associated structs in DWARF, and these lead to multiple instances of core
kernel types in BTF since they cannot be deduplicated due to the
additional modifier in some instances.
Moving the __data_racy definition under CONFIG_KCSAN avoids this problem,
since the volatile modifier will be present for both KCSAN and
KCSAN_SANITIZE objects in a CONFIG_KCSAN=y kernel.
Link: https://lkml.kernel.org/r/20260116091730.324322-1-alan.maguire@oracle.com
Fixes: 31f605a308e6 ("kcsan, compiler_types: Introduce __data_racy type qualifier")
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Suggested-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Naman Jain <namjain@linux.microsoft.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce an in-kernel test module to validate the core logic of the Live
Update Orchestrator's File-Lifecycle-Bound feature. This provides a
low-level, controlled environment to test FLB registration and callback
invocation without requiring userspace interaction or actual kexec
reboots.
The test is enabled by the CONFIG_LIVEUPDATE_TEST Kconfig option.
Link: https://lkml.kernel.org/r/20251218155752.3045808-6-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce a mechanism for managing global kernel state whose lifecycle is
tied to the preservation of one or more files. This is necessary for
subsystems where multiple preserved file descriptors depend on a single,
shared underlying resource.
An example is HugeTLB, where multiple file descriptors such as memfd and
guest_memfd may rely on the state of a single HugeTLB subsystem.
Preserving this state for each individual file would be redundant and
incorrect. The state should be preserved only once when the first file is
preserved, and restored/finished only once the last file is handled.
This patch introduces File-Lifecycle-Bound (FLB) objects to solve this
problem. An FLB is a global, reference-counted object with a defined set
of operations:
- A file handler (struct liveupdate_file_handler) declares a dependency
on one or more FLBs via a new registration function,
liveupdate_register_flb().
- When the first file depending on an FLB is preserved, the FLB's
.preserve() callback is invoked to save the shared global state. The
reference count is then incremented for each subsequent file.
- Conversely, when the last file is unpreserved (before reboot) or
finished (after reboot), the FLB's .unpreserve() or .finish() callback
is invoked to clean up the global resource.
The implementation includes:
- A new set of ABI definitions (luo_flb_ser, luo_flb_head_ser) and a
corresponding FDT node (luo-flb) to serialize the state of all active
FLBs and pass them via Kexec Handover.
- Core logic in luo_flb.c to manage FLB registration, reference
counting, and the invocation of lifecycle callbacks.
- An API (liveupdate_flb_get/_incoming/_outgoing) for other kernel
subsystems to safely access the live object managed by an FLB, both
before and after the live update.
This framework provides the necessary infrastructure for more complex
subsystems like IOMMU, VFIO, and KVM to integrate with the Live Update
Orchestrator.
Link: https://lkml.kernel.org/r/20251218155752.3045808-5-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "list private v2 & luo flb", v9.
This series introduces two connected infrastructure improvements: a new
API for handling private linked lists, and the "File-Lifecycle-Bound"
(FLB) mechanism for the Live Update Orchestrator.
1. Private List Primitives (patches 1-3)
Recently, Linux introduced the ability to mark structure members as
__private and access them via ACCESS_PRIVATE(). This enforces better
encapsulation by ensuring internal details are only accessible by the
owning subsystem.
However, struct list_head is frequently used as an internal linkage
mechanism within these private sections. The standard macros in
<linux/list.h> do not support ACCESS_PRIVATE() natively. Consequently,
subsystems using private lists are forced to implement ad-hoc
workarounds or local iterator macros.
This series adds <linux/list_private.h>, providing a set of
primitives identical to those in <linux/list.h> but designed for
private list heads. It also includes a KUnit test suite to verify that
the macros correctly handle pointer offsets and qualifiers.
2. This series adds FLB (patches 4-5) support to Live Update that also
internally uses private lists.
FLB allows global kernel state (such as IOMMU domains or HugeTLB
state) to be preserved once, shared across multiple file descriptors,
and restored when needed. This is necessary for subsystems where
multiple preserved file descriptors depend on a single, shared
underlying resource. Preserving this state for each individual file
would be redundant and incorrect.
FLB uses reference counting tied to the lifecycle of preserved
files. The state is preserved when the first file depending on it is
preserved, and restored or cleaned up only when the last file is
handled.
This patch (of 5):
Linux recently added an ability to add private members to structs (i.e.
__private) and access them via ACCESS_PRIVATE(). This ensures that those
members are only accessible by the subsystem which owns the struct type,
and not to the object owner.
However, struct list_head often needs to be placed into the private
section to be manipulated privately by the subsystem.
Add macros to support private list manipulations in
<linux/list_private.h>.
[akpm@linux-foundation.org: fix kerneldoc]
Link: https://lkml.kernel.org/r/20251218155752.3045808-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20251218155752.3045808-2-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The custom definition of 'struct timespec64' is incompatible with both the
kernel's internal definition and the glibc type, at least on big-endian
targets that have the tv_nsec field in a different place, and the
definition clashes with any userspace that also defines a timespec64
structure.
Running the header check with -Wpadding enabled produces this output that
warns about the incorrect padding:
usr/include/linux/taskstats.h:25:1: error: padding struct size to alignment boundary with 4 bytes [-Werror=padded]
Remove the hack and instead use the regular __kernel_timespec type that is
meant to be used in uapi definitions.
Link: https://lkml.kernel.org/r/20260202095906.1344100-1-arnd@kernel.org
Fixes: 29b63f6eff0e ("delayacct: add timestamp of delay max")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Jiang Kun <jiang.kun2@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull scheduler fixes from Ingo Molnar:
"Miscellaneous MMCID fixes to address bugs and performance regressions
in the recent rewrite of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path by removing an
unintended transition phase
- Fix performance/scalability regression on a thread-pool benchmark
by optimizing transitional CIDs when scheduling out"
* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Optimize transitional CIDs when scheduling out
sched/mmcid: Drop per CPU CID immediately when switching to per task mode
sched/mmcid: Protect transition on weakly ordered systems
sched/mmcid: Prevent live lock on task to CPU mode transition
|
|
Pull objtool fixes from Ingo Molnar::
- Bump up the Clang minimum version requirements for livepatch
builds, due to Clang assembler section handling bugs causing
silent miscompilations
- Strip livepatching symbol artifacts from non-livepatch modules
- Fix livepatch build warnings when certain Clang LTO options
are enabled
- Fix livepatch build error when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
* tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool/klp: Fix unexported static call key access for manually built livepatch modules
objtool/klp: Fix symbol correlation for orphaned local symbols
livepatch: Free klp_{object,func}_ext data after initialization
livepatch: Fix having __klp_objects relics in non-livepatch modules
livepatch/klp-build: Require Clang assembler >= 20
|
|
Now that the HBH jumbo helpers are not used by any driver or GSO, remove
them altogether.
Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260205133925.526371-13-alice.kernel@fastmail.im
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
BIG TCP IPv6 inserts a hop-by-hop extension header to indicate the real
IPv6 payload length when it doesn't fit into the 16-bit field in the
IPv6 header itself. While it helps tools parse the packet, it also
requires every driver that supports TSO and BIG TCP to remove this
8-byte extension header. It might not sound that bad until we try to
apply it to tunneled traffic. Currently, the drivers don't attempt to
strip HBH if skb->encapsulation = 1. Moreover, trying to do so would
require dissecting different tunnel protocols and making corresponding
adjustments on case-by-case basis, which would slow down the fastpath
(potentially also requiring adjusting checksums in outer headers).
At the same time, BIG TCP IPv4 doesn't insert any extra headers and just
calculates the payload length from skb->len, significantly simplifying
implementing BIG TCP for tunnels.
Stop inserting HBH when building BIG TCP GSO SKBs.
Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260205133925.526371-3-alice.kernel@fastmail.im
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The next commits will transition away from using the hop-by-hop
extension header to encode packet length for BIG TCP. Add wrappers
around ip6->payload_len that return the actual value if it's non-zero,
and calculate it from skb->len if payload_len is set to zero (and a
symmetrical setter).
The new helpers are used wherever the surrounding code supports the
hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code
uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h).
No behavioral change in this commit.
Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260205133925.526371-2-alice.kernel@fastmail.im
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The following warnings were visible:
$ ./scripts/kernel-doc -Wall -none \
net/mptcp/ include/net/mptcp.h include/uapi/linux/mptcp*.h \
include/trace/events/mptcp.h
Warning: net/mptcp/token.c:108 No description found for return value of 'mptcp_token_new_request'
Warning: net/mptcp/token.c:151 No description found for return value of 'mptcp_token_new_connect'
Warning: net/mptcp/token.c:246 No description found for return value of 'mptcp_token_get_sock'
Warning: net/mptcp/token.c:298 No description found for return value of 'mptcp_token_iter_next'
Warning: net/mptcp/protocol.c:4431 No description found for return value of 'mptcp_splice_read'
Warning: include/uapi/linux/mptcp_pm.h:13 missing initial short description on line:
* enum mptcp_event_type
Address all of them: either by using the 'Return:' keyword, or by adding
a missing initial short description.
The MPTCP CI will soon report issues with kdoc to avoid introducing new
issues and being flagged by the Netdev CI.
Reviewed-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260205-net-mptcp-misc-fixes-6-19-rc8-v2-3-c2720ce75c34@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This helper is already (auto)inlined from IPv4 TCP stack.
Make it an inline function to benefit IPv6 as well.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 30/-49 (-19)
Function old new delta
tcp_v6_rcv 3448 3478 +30
__pfx_tcp_filter 16 - -16
tcp_filter 33 - -33
Total: Before=24891904, After=24891885, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260205164329.3401481-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, unhash_nsid() scans the entire system for each netns being
killed, leading to O(L_dying_net * M_alive_net * N_id) complexity, as
__peernet2id() also performs a linear search in the IDR.
Optimize this to O(M_alive_net * N_id) by batching unhash operations. Move
unhash_nsid() out of the per-netns loop in cleanup_net() to perform a
single-pass traversal over survivor namespaces.
Identify dying peers by an 'is_dying' flag, which is set under net_rwsem
write lock after the netns is removed from the global list. This batches
the unhashing work and eliminates the O(L_dying_net) multiplier.
To minimize the impact on struct net size, 'is_dying' is placed in an
existing hole after 'hash_mix' in struct net.
Use a restartable idr_get_next() loop for iteration. This avoids the
unsafe modification issue inherent to idr_for_each() callbacks and allows
dropping the nsid_lock to safely call sleepy rtnl_net_notifyid().
Clean up redundant nsid_lock and simplify the destruction loop now that
unhashing is centralized.
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204074854.3506916-1-realwujing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For architectures that define __HAVE_ARCH_TLB_REMOVE_TABLE, the page
tables at the pmd/pud level are generally not of struct ptdesc type, and
do not have pt_rcu_head member, thus these architectures cannot support
PT_RECLAIM.
In preparation for enabling PT_RECLAIM on more architectures, convert
__HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config,
so that we can make conditional judgments in Kconfig.
Link: https://lkml.kernel.org/r/5ebfa3d4b56e63c6906bda5eccaa9f7194d3a86b.1769515122.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Tested-by: Andreas Larsson <andreas@gaisler.com> [sparc, UP&SMP]
Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc]
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Magnus Lindholm <linmag7@gmail.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
- Fix documentation typos (Shawn Lin)
- Add struct p2pdma_provider kernel doc (Leon Romanovsky)
- Remove useless devres WARN_ON() (Philipp Stanner)
* pci/misc:
PCI: Remove useless WARN_ON() from devres
PCI/P2PDMA: Add missing struct p2pdma_provider documentation
Documentation: PCI: Fix typos in msi-howto.rst
|
|
- Extend PCI_FIND_NEXT_CAP() and PCI_FIND_NEXT_EXT_CAP() to return a
pointer to the preceding Capability (Qiang Yu)
- Add dw_pcie_remove_capability() and dw_pcie_remove_ext_capability() to
remove Capabilities that are advertised but not fully implemented (Qiang
Yu)
- Remove MSI and MSI-X Capabilities for DWC controllers in platforms that
can't support them, so we automatically fall back to INTx (Qiang Yu)
- Remove MSI-X and DPC Capabilities for Qualcomm platforms that advertise
but don't support them (Qiang Yu)
- Remove duplicate dw_pcie_ep_hide_ext_capability() function and replace
with dw_pcie_remove_ext_capability() (Qiang Yu)
- Add ASPM L1.1 and L1.2 Substates context to debugfs ltssm_status for
drivers that support this (Shawn Lin)
- Skip PME_Turn_Off broadcast and L2/L3 transition during suspend if link
is not up to avoid an unnecessary timeout (Manivannan Sadhasivam)
- Revert dw-rockchip, qcom, and DWC core changes that used link-up IRQs to
trigger enumeration instead of waiting for link to be up because the PCI
core doesn't allocate bus number space for hierarchies that might be
attached (Niklas Cassel)
- Make endpoint iATU entry for MSI permanent instead of programming it
dynamically, which is slow and racy with respect to other concurrent
traffic, e.g., eDMA (Koichiro Den)
- Use iMSI-RX MSI target address when possible to fix endpoints using
32-bit MSI (Shawn Lin)
- Make dw_pcie_ltssm_status_string() available and use it for logging
errors in dw_pcie_wait_for_link() (Manivannan Sadhasivam)
- Return -ENODEV when dw_pcie_wait_for_link() finds no devices, -EIO for
device present but inactive, -ETIMEDOUT for other failures, so callers
can handle these cases differently (Manivannan Sadhasivam)
- Allow DWC host controller driver probe to continue if device is not found
or found but inactive; only fail when there's an error with the link
(Manivannan Sadhasivam)
- For controllers like NXP i.MX6QP and i.MX7D, where LTSSM registers are
not accessible after PME_Turn_Off, simply wait 10ms instead of polling
for L2/L3 Ready (Richard Zhu)
- Use multiple iATU entries to map large bridge windows and DMA ranges when
necessary instead of failing (Samuel Holland)
- Rename struct dw_pcie_rp.has_msi_ctrl to .use_imsi_rx for clarity (Qiang
Yu)
- Add EPC dynamic_inbound_mapping feature bit for Endpoint Controllers that
can update BAR inbound address translation without requiring EPF driver
to clear/reset the BAR first, and advertise it for DWC-based Endpoints
(Koichiro Den)
- Add EPC subrange_mapping feature bit for Endpoint Controllers that can
map multiple independent inbound regions in a single BAR, implement
subrange mapping, advertise it for DWC-based Endpoints, and add Endpoint
selftests for it (Koichiro Den)
- Allow overriding default BAR sizes for pci-epf-test (Niklas Cassel)
- Make resizable BARs work for Endpoint multi-PF configurations; previously
it only worked for PF 0 (Aksh Garg)
- Fix Endpoint non-PF 0 support for BAR configuration, ATU mappings, and
Address Match Mode (Aksh Garg)
- Fix issues with outbound iATU index assignment that caused iATU index to
be out of bounds (Niklas Cassel)
- Clean up iATU index tracking to be consistent (Niklas Cassel)
- Set up iATU when ECAM is enabled; previously IO and MEM outbound windows
weren't programmed, and ECAM-related iATU entries weren't restored after
suspend/resume, so config accesses failed (Krishna Chaitanya Chundru)
* pci/controller/dwc:
PCI: dwc: Fix missing iATU setup when ECAM is enabled
PCI: dwc: Clean up iATU index usage in dw_pcie_iatu_setup()
PCI: dwc: Fix msg_atu_index assignment
PCI: dwc: ep: Add comment explaining controller level PTM access in multi PF setup
PCI: dwc: ep: Add per-PF BAR and inbound ATU mapping support
PCI: dwc: ep: Fix resizable BAR support for multi-PF configurations
PCI: endpoint: pci-epf-test: Allow overriding default BAR sizes
selftests: pci_endpoint: Add BAR subrange mapping test case
misc: pci_endpoint_test: Add BAR subrange mapping test case
PCI: endpoint: pci-epf-test: Add BAR subrange mapping test support
Documentation: PCI: endpoint: Clarify pci_epc_set_bar() usage
PCI: dwc: ep: Support BAR subrange inbound mapping via Address Match Mode iATU
PCI: dwc: Advertise dynamic inbound mapping support
PCI: endpoint: Add BAR subrange mapping support
PCI: endpoint: Add dynamic_inbound_mapping EPC feature
PCI: dwc: Rename dw_pcie_rp::has_msi_ctrl to dw_pcie_rp::use_imsi_rx for clarity
PCI: dwc: Fix grammar and formatting for comment in dw_pcie_remove_ext_capability()
PCI: dwc: Use multiple iATU windows for mapping large bridge windows and DMA ranges
PCI: dwc: Remove duplicate dw_pcie_ep_hide_ext_capability() function
PCI: dwc: Skip waiting for L2/L3 Ready if dw_pcie_rp::skip_l23_wait is true
PCI: dwc: Fail dw_pcie_host_init() if dw_pcie_wait_for_link() returns -ETIMEDOUT
PCI: dwc: Rework the error print of dw_pcie_wait_for_link()
PCI: dwc: Rename and move ltssm_status_string() to pcie-designware.c
PCI: dwc: Return -EIO from dw_pcie_wait_for_link() if device is not active
PCI: dwc: Return -ENODEV from dw_pcie_wait_for_link() if device is not found
PCI: dwc: Use cfg0_base as iMSI-RX target address to support 32-bit MSI devices
PCI: dwc: ep: Cache MSI outbound iATU mapping
Revert "PCI: dwc: Don't wait for link up if driver can detect Link Up event"
Revert "PCI: qcom: Enumerate endpoints based on Link up event in 'global_irq' interrupt"
Revert "PCI: qcom: Enable MSI interrupts together with Link up if 'Global IRQ' is supported"
Revert "PCI: qcom: Don't wait for link if we can detect Link Up"
Revert "PCI: dw-rockchip: Enumerate endpoints based on dll_link_up IRQ"
Revert "PCI: dw-rockchip: Don't wait for link since we can detect Link Up"
PCI: dwc: Skip PME_Turn_Off broadcast and L2/L3 transition during suspend if link is not up
PCI: dw-rockchip: Change get_ltssm() to provide L1 Substates info
PCI: dwc: Add L1 Substates context to ltssm_status of debugfs
PCI: qcom: Remove DPC Extended Capability
PCI: qcom: Remove MSI-X Capability for Root Ports
PCI: dwc: Remove MSI/MSIX capability for Root Port if iMSI-RX is used as MSI controller
PCI: dwc: Add new APIs to remove standard and extended Capability
PCI: Add preceding capability position support in PCI_FIND_NEXT_*_CAP macros
|
|
- Mark ASM1164 SATA controller to avoid bus reset since it fails to train
the Link after reset (Alex Williamson)
- Mark Nvidia GB10 Root Ports to avoid bus reset since they may fail to
retrain the link after reset (Johnny-CC Chang)
- Add lockdep and other lock assertions (Ilpo Järvinen)
- Add ACS quirk for Qualcomm Hamoa & Glymur, which provides ACS-like
features but doesn't advertise an ACS Capability (Krishna Chaitanya
Chundru)
- Add ACS quirk for Pericom PI7C9X2G404 switches, which fail under load
when P2P Redirect Request is enabled (Nicolas Cavallari)
- Remove an incorrect unlock in pci_slot_trylock() error handling (Jinhui
Guo)
- Lock the bridge device for slot reset (Keith Busch)
- Enable ACS after IOMMU configuration on OF platforms so ACS is enabled an
all devices; previously the first device enumeration (typically a Root
Port) was omitted (Manivannan Sadhasivam)
- Disable ACS Source Validation for IDT 0x80b5 and 0x8090 switches to work
around hardware erratum; previously ACS SV was temporarily disabled,
which worked for enumeration but not after reset (Manivannan Sadhasivam)
* pci/virtualization:
PCI: Disable ACS SV for IDT 0x8090 switch
PCI: Disable ACS SV for IDT 0x80b5 switch
PCI: Cache ACS Capabilities register
PCI: Enable ACS after configuring IOMMU for OF platforms
PCI: Add ACS quirk for Pericom PI7C9X2G404 switches [12d8:b404]
PCI: Add ACS quirk for Qualcomm Hamoa & Glymur
PCI: Use device_lock_assert() to verify device lock is held
PCI: Use lockdep_assert_held(pci_bus_sem) to verify lock is held
PCI: Fix pci_slot_lock () device locking
PCI: Fix pci_slot_trylock() error handling
PCI: Mark Nvidia GB10 to avoid bus reset
PCI: Mark ASM1164 SATA controller to avoid bus reset
|
|
- Add generic RAS tracepoint for hotplug events (Shuai Xue)
- Add RAS tracepoint for link speed changes (Shuai Xue)
* pci/trace:
Documentation: tracing: Add PCI tracepoint documentation
PCI: trace: Add RAS tracepoint to monitor link speed changes
PCI: trace: Add generic RAS tracepoint for hotplug event
|
|
- Build zero-sized resources when a BAR is larger than 4G but
pci_bus_addr_t or resource_size_t can't represent 64-bit addresses (Ilpo
Järvinen)
- Fix bridge window alignment with optional resources, where we previously
lost the additional alignment requirement (Ilpo Järvinen)
- Stop over-estimating bridge window size since we now assign them without
any gaps between them (Ilpo Järvinen)
- Increase resource MAX_IORES_LEVEL to avoid /proc/iomem flattening for
nested bridges and endpoints (Ilpo Järvinen)
- Remove old_size limit from bridge window sizing (Ilpo Järvinen)
- Push realloc check into pbus_size_mem() to simplify callers (Ilpo
Järvinen)
- Pass bridge window resource to pbus_size_mem() to avoid looking it up
again (Ilpo Järvinen)
- Use res_to_dev_res() instead of open-coding the same search (Ilpo
Järvinen)
- Add pci_resource_is_bridge_win() helper (Ilpo Järvinen)
- Add more logging of resource assignment (Ilpo Järvinen)
- Add pbus_mem_size_optional() to handle sizes of optional resources
(SR-IOV VF BARs, expansion ROMs, bridge windows) (Ilpo Järvinen)
- Move CardBus code to setup-cardbus.c and only build it when
CONFIG_CARDBUS is set (Ilpo Järvinen)
- Use scnprintf() instead of sprintf() (Ilpo Järvinen)
- Add pbus_validate_busn() for Bus Number validation (Ilpo Järvinen)
- Don't claim disabled bridge windows to avoid spurious claim failures
(Ilpo Järvinen)
* pci/resource:
PCI: Don't claim disabled bridge windows
PCI: Move CardBus bridge scanning to setup-cardbus.c
PCI: Add pbus_validate_busn() for Bus Number validation
PCI: Add dword #defines for Bus Number + Secondary Latency Timer
PCI: Use scnprintf() instead of sprintf()
PCI: Handle CardBus-specific params in setup-cardbus.c
PCI: Separate CardBus setup & build it only with CONFIG_CARDBUS
PCI: Add 'pci' prefix to struct pci_dev_resource handling functions
PCI: Use resource_assigned() in setup-bus.c algorithm
resource: Mark res given to resource_assigned() as const
PCI: Add pbus_mem_size_optional() to handle optional sizes
PCI: Check invalid align earlier in pbus_size_mem()
PCI: Log reset and restore of resources
PCI: Add pci_resource_is_bridge_win()
PCI: Fetch dev_res to local var in __assign_resources_sorted()
PCI: Use res_to_dev_res() in reassign_resources_sorted()
PCI: Pass bridge window resource to pbus_size_mem()
PCI: Push realloc check into pbus_size_mem()
PCI: Remove old_size limit from bridge window sizing
resource: Increase MAX_IORES_LEVEL to 8
PCI: Stop over-estimating bridge window size
PCI: Rewrite bridge window head alignment function
PCI: Fix bridge window alignment with optional resources
PCI: Use resource_set_range() that correctly sets ->end
|
|
- Rename pwrseq, tc9563, and slot driver structs, variables, and functions
for consistency (Bjorn Helgaas)
- Add power_on/off callbacks with generic signature to pwrseq, tc9563, and
slot drivers so they can be used by pwrctrl core (Manivannan Sadhasivam)
- Add interfaces to create and destroy pwrctrl devices (Krishna Chaitanya
Chundru)
- Add interfaces to power devices on and off (Manivannan Sadhasivam)
- Switch to pwrctrl interfaces to create, destroy, and power on/off
devices, calling them from host controller drivers instead of the PCI
core (Manivannan Sadhasivam)
- Drop qcom .assert_perst() callbacks since this is now done by the
controller driver instead of the pwrctrl driver (Manivannan Sadhasivam)
- Add PCIe M.2 connector support to the slot pwrctrl driver (Manivannan
Sadhasivam)
- Create pwrctrl devices for devicetree PCIe M.2 connector nodes
(Manivannan Sadhasivam)
* pci/pwrctrl:
PCI/pwrctrl: Create pwrctrl device if graph port is found
PCI/pwrctrl: Add PCIe M.2 connector support
PCI: Drop the assert_perst() callback
PCI: qcom: Drop the assert_perst() callbacks
PCI/pwrctrl: Switch to pwrctrl create, power on/off, destroy APIs
PCI/pwrctrl: Add APIs to power on/off pwrctrl devices
PCI/pwrctrl: Add APIs to create, destroy pwrctrl devices
PCI/pwrctrl: Add 'struct pci_pwrctrl::power_{on/off}' callbacks
PCI/pwrctrl: pwrseq: Factor out power on/off code to helpers
PCI/pwrctrl: slot: Factor out power on/off code to helpers
PCI/pwrctrl: tc9563: Rename private struct and pointers for consistency
PCI/pwrctrl: tc9563: Add local variables to reduce repetition
PCI/pwrctrl: tc9563: Clean up whitespace
PCI/pwrctrl: tc9563: Use put_device() instead of i2c_put_adapter()
PCI/pwrctrl: slot: Rename private struct and pointers for consistency
PCI/pwrctrl: pwrseq: Rename private struct and pointers for consistency
# Conflicts:
# drivers/pci/bus.c
|
|
- Add PCI_BRIDGE_NO_ALIAS quirk for ASPEED AST1150, where VGA and USB are
behind a PCIe-to-PCI bridge and share the same StreamID (Nirmoy Das)
* pci/iommu:
PCI: Add PCI_BRIDGE_NO_ALIAS quirk for ASPEED AST1150
PCI: Add ASPEED vendor ID to pci_ids.h
|
|
The commit 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as
PCIe BW controller") was found to lead to a boot hang on a Intel P45
system. Testing without setting Link Bandwidth Management Interrupt Enable
(LBMIE) and Link Autonomous Bandwidth Interrupt Enable (LABIE) (PCIe r7.0,
sec 7.5.3.7) in bwctrl allowed system to come up.
P45 is a very old chipset and supports only up to gen2 PCIe, so not having
bwctrl does not seem a huge deficiency.
Add no_bw_notif in struct pci_dev and quirk Intel P45 Root Port with it.
Reported-by: Adam Stylinski <kungfujesus06@gmail.com>
Link: https://lore.kernel.org/linux-pci/aUCt1tHhm_-XIVvi@eggsbenedict/
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Adam Stylinski <kungfujesus06@gmail.com>
Link: https://patch.msgid.link/20260116131513.2359-1-ilpo.jarvinen@linux.intel.com
|
|
The ACS Capability register is read-only. Cache it to allow quirks to
override it and to avoid re-reading it.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://patch.msgid.link/20260102-pci_acs-v3-2-72280b94d288@oss.qualcomm.com
|
|
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}()
properly by switching to bpf_selem_unlink_nofail().
Both functions iterate their own RCU-protected list of selems and call
bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when
both map_free() and destroy() fail to remove a selem from b->list
(extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(),
also switch to hlist_for_each_entry_rcu() since we no longer iterate
local_storage->list under local_storage->lock.
bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths
so reuse_now should always be false. Remove it from the argument and
hardcode it.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com
|
|
Introduce bpf_selem_unlink_nofail() to properly handle errors returned
from rqspinlock in bpf_local_storage_map_free() and
bpf_local_storage_destroy() where the operation must succeeds.
The idea of bpf_selem_unlink_nofail() is to allow an selem to be
partially linked and use atomic operation on a bit field, selem->state,
to determine when and who can free the selem if any unlink under lock
fails. An selem initially is fully linked to a map and a local storage.
Under normal circumstances, bpf_selem_unlink_nofail() will be able to
grab locks and unlink a selem from map and local storage in sequeunce,
just like bpf_selem_unlink(), and then free it after an RCU grace period.
However, if any of the lock attempts fails, it will only clear
SDATA(selem)->smap or selem->local_storage depending on the caller and
set SELEM_MAP_UNLINKED or SELEM_STORAGE_UNLINKED according to the
caller. Then, after both map_free() and destroy() see the selem and the
state becomes SELEM_UNLINKED, one of two racing caller can succeed in
cmpxchg the state from SELEM_UNLINKED to SELEM_TOFREE, ensuring no
double free or memory leak.
To make sure bpf_obj_free_fields() is done only once and when map is
still present, it is called when unlinking an selem from b->list under
b->lock.
To make sure uncharging memory is done only when the owner is still
present in map_free(), block destroy() from returning until there is no
pending map_free().
Since smap may not be valid in destroy(), bpf_selem_unlink_nofail()
skips bpf_selem_unlink_storage_nolock_misc() when called from destroy().
This is okay as bpf_local_storage_destroy() will return the remaining
amount of memory charge tracked by mem_charge to the owner to uncharge.
It is also safe to skip clearing local_storage->owner and owner_storage
as the owner is being freed and no users or bpf programs should be able
to reference the owner and using local_storage.
Finally, access of selem, SDATA(selem)->smap and selem->local_storage
are racy. Callers will protect these fields with RCU.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-11-ameryhung@gmail.com
|
|
The next patch will introduce bpf_selem_unlink_nofail() to handle
rqspinlock errors. bpf_selem_unlink_nofail() will allow an selem to be
partially unlinked from map or local storage. Save memory allocation
method in selem so that later an selem can be correctly freed even when
SDATA(selem)->smap is init to NULL.
In addition, keep track of memory charge to the owner in local storage
so that later bpf_selem_unlink_nofail() can return the correct memory
charge to the owner. Updating local_storage->mem_charge is protected by
local_storage->lock.
Finally, extract miscellaneous tasks performed when unlinking an selem
from local_storage into bpf_selem_unlink_storage_nolock_misc(). It will
be reused by bpf_selem_unlink_nofail().
This patch also takes the chance to remove local_storage->smap, which
is no longer used since commit f484f4a3e058 ("bpf: Replace bpf memory
allocator with kmalloc_nolock() in local storage").
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-10-ameryhung@gmail.com
|
|
Percpu locks have been removed from cgroup and task local storage. Now
that all local storage no longer use percpu variables as locks preventing
recursion, there is no need to pass them to bpf_local_storage_map_free().
Remove the argument from the function.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com
|
|
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock
from raw_spin_lock to rqspinlock.
Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall
return or BPF helper return.
In bpf_local_storage_destroy(), ignore return from
raw_res_spin_lock_irqsave() for now. A later patch will correctly
handle errors correctly in bpf_local_storage_destroy() so that it can
unlink selems even when failing to acquire locks.
For __bpf_local_storage_map_cache(), instead of handling the error,
skip updating the cache.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-6-ameryhung@gmail.com
|
|
To prepare changing both bpf_local_storage_map_bucket::lock and
bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to
failable. It still always succeeds and returns 0 until the change
happens. No functional change.
Open code bpf_selem_unlink_storage() in the only caller,
bpf_selem_unlink(), since unlink_map and unlink_storage must be done
together after all the necessary locks are acquired.
For bpf_local_storage_map_free(), ignore the return from
bpf_selem_unlink() for now. A later patch will allow it to unlink selems
even when failing to acquire locks.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com
|
|
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_link_map() to failable. It still always succeeds and
returns 0 until the change happens. No functional change.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com
|
|
A later bpf_local_storage refactor will acquire all locks before
performing any update. To simplified the number of locks needed to take
in bpf_local_storage_map_update(), determine the bucket based on the
local_storage an selem belongs to instead of the selem pointer.
Currently, when a new selem needs to be created to replace the old selem
in bpf_local_storage_map_update(), locks of both buckets need to be
acquired to prevent racing. This can be simplified if the two selem
belongs to the same bucket so that only one bucket needs to be locked.
Therefore, instead of hashing selem, hashing the local_storage pointer
the selem belongs.
Performance wise, this is slightly better as update now requires locking
one bucket. It should not change the level of contention on one bucket
as the pointers to local storages of selems in a map are just as unique
as pointers to selems.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-2-ameryhung@gmail.com
|
|
Pull hotfixes from Andrew Morton:
"A couple of late-breaking MM fixes. One against a new-in-this-cycle
patch and the other addresses a locking issue which has been there for
over a year"
* tag 'mm-hotfixes-stable-2026-02-06-12-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/memory-failure: reject unsupported non-folio compound page
procfs: avoid fetching build ID while holding VMA lock
|
|
Pull ceph fixes from Ilya Dryomov:
"One RBD and two CephFS fixes which address potential oopses.
The RBD thing is more of a rare edge case that pops up in our CI,
while the two CephFS scenarios are regressions that were reported by
users and can be triggered trivially in normal operation. All marked
for stable"
* tag 'ceph-for-6.19-rc9' of https://github.com/ceph/ceph-client:
ceph: fix NULL pointer dereference in ceph_mds_auth_match()
ceph: fix oops due to invalid pointer for kfree() in parse_longname()
rbd: check for EOD after exclusive lock is ensured to be held
|
|
Pull dma-mapping fixes from Marek Szyprowski:
"Two minor fixes for the DMA-mapping subsystem:
- check for the rare case of the allocation failure of the global CMA
pool (Shanker Donthineni)
- avoid perf buffer overflow when tracing large scatter-gather lists
(Deepanshu Kartikey)"
* tag 'dma-mapping-6.19-2026-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma: contiguous: Check return value of dma_contiguous_reserve_area()
tracing/dma: Cap dma_map_sg tracepoint arrays to prevent buffer overflow
|
|
- Move ABI requirement next to each access right to prepare adding more
access rights;
- Mention the possibility to remove the random component of a socket's
ephemeral port choice within the netns-wide ephemeral port range,
since it allows choosing the "random" ephemeral port.
Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
Link: https://lore.kernel.org/r/20251212163704.142301-2-matthieu@buffet.re
Signed-off-by: Mickaël Salaün <mic@digikod.net>
|
|
Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag. With this flag, a
given Landlock ruleset is applied to all threads of the calling
process, instead of only the current one.
Without this flag, multithreaded userspace programs currently resort
to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
which is also used by libcap and for setuid(2). Using this
userspace-based scheme, the threads of a process enforce the same
Landlock policy, but the resulting Landlock domains are still
separate. The domains being separate causes multiple problems:
* When using Landlock's "scoped" access rights, the domain identity is
used to determine whether an operation is permitted. As a result,
when using LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads
stops working. This is a problem for programming languages and
frameworks which are inherently multithreaded (e.g. Go).
* In audit logging, the domains of separate threads in a process will
get logged with different domain IDs, even when they are based on
the same ruleset FD, which might confuse users.
Cc: Andrew G. Morgan <morgan@kernel.org>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Paul Moore <paul@paul-moore.com>
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Günther Noack <gnoack@google.com>
Link: https://lore.kernel.org/r/20251127115136.3064948-2-gnoack@google.com
[mic: Fix restrict_self_flags test, clean up Makefile, allign comments,
reduce local variable scope, add missing includes]
Closes: https://github.com/landlock-lsm/linux/issues/2
Signed-off-by: Mickaël Salaün <mic@digikod.net>
|
|
This reverts commit 62eb557580eb2177cf16c3fd2b6efadff297b29a.
The revocable implementation uses two separate abstractions, struct
revocable_provider and struct revocable, in order to store the SRCU read
lock index which must be passed unaltered to srcu_read_unlock() in the
same context when a resource is no longer needed.
With the merged revocable API, multiple threads could however share the
same struct revocable and therefore potentially overwrite the SRCU index
of another thread which can cause the SRCU synchronisation in
revocable_provider_revoke() to never complete. [1]
An example revocable conversion of the gpiolib code also turned out to
be fundamentally flawed and could lead to use-after-free. [2]
An attempt to address both issues was quickly put together and merged,
but revocable is still fundamentally broken. [3]
Specifically, the latest design relies on RCU for storing a pointer to
the revocable provider, but since the resource can be shared by value
(e.g. as in the now reverted selftests) this does not work at all and
can also lead to use-after-free:
static void revocable_provider_release(struct kref *kref)
{
struct revocable_provider *rp = container_of(kref,
struct revocable_provider, kref);
cleanup_srcu_struct(&rp->srcu);
kfree_rcu(rp, rcu);
}
void revocable_provider_revoke(struct revocable_provider __rcu **rp_ptr)
{
struct revocable_provider *rp;
rp = rcu_replace_pointer(*rp_ptr, NULL, 1);
...
kref_put(&rp->kref, revocable_provider_release);
}
int revocable_init(struct revocable_provider __rcu *_rp,
struct revocable *rev)
{
struct revocable_provider *rp;
...
scoped_guard(rcu) {
rp = rcu_dereference(_rp);
if (!rp)
return -ENODEV;
if (!kref_get_unless_zero(&rp->kref))
return -ENODEV;
}
...
}
producer:
priv->rp = revocable_provider_alloc(&priv->res);
// pass priv->rp by value to consumer
revocable_provider_revoke(&priv->rp);
consumer:
struct revocable_provider __rcu *rp = filp->private_data;
struct revocable *rev;
revocable_init(rp, &rev);
as _rp would still be non-NULL in revocable_init() regardless of whether
the producer has revoked the resource and set its pointer to NULL.
Essentially revocable still relies on having a pointer to reference
counted driver data which holds the revocable provider, which makes all
the RCU protection unnecessary along with most of the current revocable
design and implementation.
As the above shows, and as has been pointed out repeatedly elsewhere,
these kind of issues are not something that should be addressed
incrementally. [4]
Revert the revocable implementation until a redesign has been proposed
and evaluated properly.
Link: https://lore.kernel.org/all/20260124170535.11756-4-johan@kernel.org/ [1]
Link: https://lore.kernel.org/all/aXT45B6vLf9R3Pbf@hovoldconsulting.com/ [2]
Link: https://lore.kernel.org/all/20260129143733.45618-1-tzungbi@kernel.org/ [3]
Link: https://lore.kernel.org/all/aXobzoeooJqxMkEj@hovoldconsulting.com/ [4]
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260204142849.22055-4-johan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.
This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
== -1, like the other "blind" register opcodes which work on the task
rather than a specific ring. This allows registration of the same kind
of restrictions as can been done on a specific ring, but with the task
itself. Once done, any ring created will inherit these restrictions.
If a restriction filter is registered with a task, then it's inherited
on fork for its children. Children may only further restrict operations,
not extend them.
Inheriting restrictions include both the classic
IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
filters that have been registered with the task via
IORING_REGISTER_BPF_FILTER.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|