aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-05-21net: add skb_crc32c()Eric Biggers1-0/+73
Add skb_crc32c(), which calculates the CRC32C of a sk_buff. It will replace __skb_checksum(), which unnecessarily supports arbitrary checksums. Compared to __skb_checksum(), skb_crc32c(): - Uses the correct type for CRC32C values (u32, not __wsum). - Does not require the caller to provide a skb_checksum_ops struct. - Is faster because it does not use indirect calls and does not use the very slow crc32c_combine(). According to commit 2817a336d4d5 ("net: skb_checksum: allow custom update/combine for walking skb") which added __skb_checksum(), the original motivation for the abstraction layer was to avoid code duplication for CRC32C and other checksums in the future. However: - No additional checksums showed up after CRC32C. __skb_checksum() is only used with the "regular" net checksum and CRC32C. - Indirect calls are expensive. Commit 2544af0344ba ("net: avoid indirect calls in L4 checksum calculation") worked around this using the INDIRECT_CALL_1 macro. But that only avoided the indirect call for the net checksum, and at the cost of an extra branch. - The checksums use different types (__wsum and u32), causing casts to be needed. - It made the checksums of fragments be combined (rather than chained) for both checksums, despite this being highly counterproductive for CRC32C due to how slow crc32c_combine() is. This can clearly be seen in commit 4c2f24549644 ("sctp: linearize early if it's not GSO") which tried to work around this performance bug. With a dedicated function for each checksum, we can instead just use the proper strategy for each checksum. As shown by the following tables, the new function skb_crc32c() is faster than __skb_checksum(), with the improvement varying greatly from 5% to 2500% depending on the case. The largest improvements come from fragmented packets, mainly due to eliminating the inefficient crc32c_combine(). But linear packets are improved too, especially shorter ones, mainly due to eliminating indirect calls. These benchmarks were done on AMD Zen 5. On that CPU, Linux uses IBRS instead of retpoline; an even greater improvement might be seen with retpoline: Linear sk_buffs Length in bytes __skb_checksum cycles skb_crc32c cycles =============== ===================== ================= 64 43 18 256 94 77 1420 204 161 16384 1735 1642 Nonlinear sk_buffs (even split between head and one fragment) Length in bytes __skb_checksum cycles skb_crc32c cycles =============== ===================== ================= 64 579 22 256 829 77 1420 1506 194 16384 4365 1682 Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-3-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21net: introduce CONFIG_NET_CRC32CEric Biggers3-0/+7
Add a hidden kconfig symbol NET_CRC32C that will group together the functions that calculate CRC32C checksums of packets, so that these don't have to be built into NET-enabled kernels that don't need them. Make skb_crc32c_csum_help() (which is called only when IP_SCTP is enabled) conditional on this symbol, and make IP_SCTP select it. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-2-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21Bluetooth: separate CIS_LINK and BIS_LINK link typesPauli Virtanen6-46/+66
Use separate link type id for unicast and broadcast ISO connections. These connection types are handled with separate HCI commands, socket API is different, and hci_conn has union fields that are different in the two cases, so they shall not be mixed up. Currently in most places it is attempted to distinguish ucast by bacmp(&c->dst, BDADDR_ANY) but it is wrong as dst is set for bcast sink hci_conn in iso_conn_ready(). Additionally checking sync_handle might be OK, but depends on details of bcast conn configuration flow. To avoid complicating it, use separate link types. Fixes: f764a6c2c1e4 ("Bluetooth: ISO: Add broadcast support") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21Bluetooth: add support for SIOCETHTOOL ETHTOOL_GET_TS_INFOPauli Virtanen2-0/+120
Bluetooth needs some way for user to get supported so_timestamping flags for the different socket types. Use SIOCETHTOOL API for this purpose. As hci_dev is not associated with struct net_device, the existing implementation can't be reused, so we add a small one here. Add support (only) for ETHTOOL_GET_TS_INFO command. The API differs slightly from netdev in that the result depends also on socket type. Signed-off-by: Pauli Virtanen <pav@iki.fi> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21Bluetooth: ISO: Fix getpeername not returning sockaddr_iso_bc fieldsLuiz Augusto von Dentz1-3/+14
If the socket is a broadcast receiver fields from sockaddr_iso_bc shall be part of the values returned to getpeername since some of these fields are updated while doing the PA and BIG sync procedures. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21Bluetooth: ISO: Fix not using SID from adv reportLuiz Augusto von Dentz5-14/+75
Up until now it has been assumed that the application would be able to enter the advertising SID in sockaddr_iso_bc.bc_sid, but userspace has no access to SID since the likes of MGMT_EV_DEVICE_FOUND cannot carry it, so it was left unset (0x00) which means it would be unable to synchronize if the broadcast source is using a different SID e.g. 0x04: > HCI Event: LE Meta Event (0x3e) plen 57 LE Extended Advertising Report (0x0d) Num reports: 1 Entry 0 Event type: 0x0000 Props: 0x0000 Data status: Complete Address type: Random (0x01) Address: 0B:82:E8:50:6D:C8 (Non-Resolvable) Primary PHY: LE 1M Secondary PHY: LE 2M SID: 0x04 TX power: 127 dBm RSSI: -55 dBm (0xc9) Periodic advertising interval: 180.00 msec (0x0090) Direct address type: Public (0x00) Direct address: 00:00:00:00:00:00 (OUI 00-00-00) Data length: 0x1f 06 16 52 18 5b 0b e1 05 16 56 18 04 00 11 30 4c ..R.[....V....0L 75 69 7a 27 73 20 53 32 33 20 55 6c 74 72 61 uiz's S23 Ultra Service Data: Broadcast Audio Announcement (0x1852) Broadcast ID: 14748507 (0xe10b5b) Service Data: Public Broadcast Announcement (0x1856) Data[2]: 0400 Unknown EIR field 0x30[16]: 4c75697a27732053323320556c747261 < HCI Command: LE Periodic Advertising Create Sync (0x08|0x0044) plen 14 Options: 0x0000 Use advertising SID, Advertiser Address Type and address Reporting initially enabled SID: 0x00 (<- Invalid) Adv address type: Random (0x01) Adv address: 0B:82:E8:50:6D:C8 (Non-Resolvable) Skip: 0x0000 Sync timeout: 20000 msec (0x07d0) Sync CTE type: 0x0000 So instead this changes now allow application to set HCI_SID_INVALID which will make hci_le_pa_create_sync to wait for a report, update the conn->sid using the report SID and only then issue PA create sync command: < HCI Command: LE Periodic Advertising Create Sync Options: 0x0000 Use advertising SID, Advertiser Address Type and address Reporting initially enabled SID: 0x04 Adv address type: Random (0x01) Adv address: 0B:82:E8:50:6D:C8 (Non-Resolvable) Skip: 0x0000 Sync timeout: 20000 msec (0x07d0) Sync CTE type: 0x0000 > HCI Event: LE Meta Event (0x3e) plen 16 LE Periodic Advertising Sync Established (0x0e) Status: Success (0x00) Sync handle: 64 Advertising SID: 0x04 Advertiser address type: Random (0x01) Advertiser address: 0B:82:E8:50:6D:C8 (Non-Resolvable) Advertiser PHY: LE 2M (0x02) Periodic advertising interval: 180.00 msec (0x0090) Advertiser clock accuracy: 0x05 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21Bluetooth: Introduce HCI Driver protocolHsin-chen Chuang4-3/+128
Although commit 75ddcd5ad40e ("Bluetooth: btusb: Configure altsetting for HCI_USER_CHANNEL") has enabled the HCI_USER_CHANNEL user to send out SCO data through USB Bluetooth chips, it's observed that with the patch HFP is flaky on most of the existing USB Bluetooth controllers: Intel chips sometimes send out no packet for Transparent codec; MTK chips may generate SCO data with a wrong handle for CVSD codec; RTK could split the data with a wrong packet size for Transparent codec; ... etc. To address the issue above one needs to reset the altsetting back to zero when there is no active SCO connection, which is the same as the BlueZ behavior, and another benefit is the bus doesn't need to reserve bandwidth when no SCO connection. This patch adds the infrastructure that allow the user space program to talk to Bluetooth drivers directly: - Define the new packet type HCI_DRV_PKT which is specifically used for communication between the user space program and the Bluetooth drviers - hci_send_frame intercepts the packets and invokes drivers' HCI Drv callbacks (so far only defined for btusb) - 2 kinds of events to user space: Command Status and Command Complete, the former simply returns the status while the later may contain additional response data. Cc: chromeos-bluetooth-upstreaming@chromium.org Fixes: b16b327edb4d ("Bluetooth: btusb: add sysfs attribute to control USB alt setting") Signed-off-by: Hsin-chen Chuang <chharry@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21coredump: add coredump socketChristian Brauner1-13/+41
Coredumping currently supports two modes: (1) Dumping directly into a file somewhere on the filesystem. (2) Dumping into a pipe connected to a usermode helper process spawned as a child of the system_unbound_wq or kthreadd. For simplicity I'm mostly ignoring (1). There's probably still some users of (1) out there but processing coredumps in this way can be considered adventurous especially in the face of set*id binaries. The most common option should be (2) by now. It works by allowing userspace to put a string into /proc/sys/kernel/core_pattern like: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h The "|" at the beginning indicates to the kernel that a pipe must be used. The path following the pipe indicator is a path to a binary that will be spawned as a usermode helper process. Any additional parameters pass information about the task that is generating the coredump to the binary that processes the coredump. In the example core_pattern shown above systemd-coredump is spawned as a usermode helper. There's various conceptual consequences of this (non-exhaustive list): - systemd-coredump is spawned with file descriptor number 0 (stdin) connected to the read-end of the pipe. All other file descriptors are closed. That specifically includes 1 (stdout) and 2 (stderr). This has already caused bugs because userspace assumed that this cannot happen (Whether or not this is a sane assumption is irrelevant.). - systemd-coredump will be spawned as a child of system_unbound_wq. So it is not a child of any userspace process and specifically not a child of PID 1. It cannot be waited upon and is in a weird hybrid upcall which are difficult for userspace to control correctly. - systemd-coredump is spawned with full kernel privileges. This necessitates all kinds of weird privilege dropping excercises in userspace to make this safe. - A new usermode helper has to be spawned for each crashing process. This series adds a new mode: (3) Dumping into an AF_UNIX socket. Userspace can set /proc/sys/kernel/core_pattern to: @/path/to/coredump.socket The "@" at the beginning indicates to the kernel that an AF_UNIX coredump socket will be used to process coredumps. The coredump socket must be located in the initial mount namespace. When a task coredumps it opens a client socket in the initial network namespace and connects to the coredump socket. - The coredump server uses SO_PEERPIDFD to get a stable handle on the connected crashing task. The retrieved pidfd will provide a stable reference even if the crashing task gets SIGKILLed while generating the coredump. - By setting core_pipe_limit non-zero userspace can guarantee that the crashing task cannot be reaped behind it's back and thus process all necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to detect whether /proc/<pid> still refers to the same process. The core_pipe_limit isn't used to rate-limit connections to the socket. This can simply be done via AF_UNIX sockets directly. - The pidfd for the crashing task will grow new information how the task coredumps. - The coredump server should mark itself as non-dumpable. - A container coredump server in a separate network namespace can simply bind to another well-know address and systemd-coredump fowards coredumps to the container. - Coredumps could in the future also be handled via per-user/session coredump servers that run only with that users privileges. The coredump server listens on the coredump socket and accepts a new coredump connection. It then retrieves SO_PEERPIDFD for the client, inspects uid/gid and hands the accepted client to the users own coredump handler which runs with the users privileges only (It must of coure pay close attention to not forward crashing suid binaries.). The new coredump socket will allow userspace to not have to rely on usermode helpers for processing coredumps and provides a safer way to handle them instead of relying on super privileged coredumping helpers that have and continue to cause significant CVEs. This will also be significantly more lightweight since no fork()+exec() for the usermodehelper is required for each crashing process. The coredump server in userspace can e.g., just keep a worker pool. Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-4-664f3caf2516@kernel.org Acked-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21xsk: Bring back busy polling support in XDP_COPYSamiullah Khawaja1-1/+1
Commit 5ef44b3cb43b ("xsk: Bring back busy polling support") fixed the busy polling support in xsk for XDP_ZEROCOPY after it was broken in commit 86e25f40aa1e ("net: napi: Add napi_config"). The busy polling support with XDP_COPY remained broken since the napi_id setup in xsk_rcv_check was removed. Bring back the setup of napi_id for XDP_COPY so socket level SO_BUSYPOLL can be used to poll the underlying napi. Do the setup of napi_id for XDP_COPY in xsk_bind, as it is done currently for XDP_ZEROCOPY. The setup of napi_id for XDP_COPY in xsk_bind is safe because xsk_rcv_check checks that the rx queue at which the packet arrives is equal to the queue_id that was supplied in bind. This is done for both XDP_COPY and XDP_ZEROCOPY mode. Tested using AF_XDP support in virtio-net by running the xsk_rr AF_XDP benchmarking tool shared here: https://lore.kernel.org/all/20250320163523.3501305-1-skhawaja@google.com/T/ Enabled socket busy polling using following commands in qemu, ``` sudo ethtool -L eth0 combined 1 echo 400 | sudo tee /proc/sys/net/core/busy_read echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout ``` Fixes: 5ef44b3cb43b ("xsk: Bring back busy polling support") Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-21wifi: mac80211: accept probe response on link address as wellAditya Kumar Singh1-1/+17
If a random MAC address is not requested during scan request, unicast probe response frames are only accepted if the destination address matches the interface address. This works fine for non-ML interfaces. However, with MLO, the same interface can have multiple links, and a scan on a link would be requested with the link address. In such cases, the probe response frame gets dropped which is incorrect. Therefore, add logic to check if any of the link addresses match the destination address if the interface address does not match. Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Link: https://patch.msgid.link/20250516-bug_fix_mlo_scan-v2-2-12e59d9110ac@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-21wifi: mac80211: validate SCAN_FLAG_AP in scan request during MLOAditya Kumar Singh1-1/+1
When an AP interface is already beaconing, a subsequent scan is not allowed unless the user space explicitly sets the flag NL80211_SCAN_FLAG_AP in the scan request. If this flag is not set, the scan request will be returned with the error code -EOPNOTSUPP. However, this restriction currently applies only to non-ML interfaces. For ML interfaces, scans are allowed without this flag being explicitly set by the user space which is wrong. This is because the beaconing check currently uses only the deflink, which does not get set during MLO. Hence to fix this, during MLO, use the existing helper ieee80211_num_beaconing_links() to know if any of the link is beaconing. Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Link: https://patch.msgid.link/20250516-bug_fix_mlo_scan-v2-1-12e59d9110ac@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-21wifi: check if socket flags are validBert Karwatzki2-4/+4
Checking the SOCK_WIFI_STATUS flag bit in sk_flags may give wrong results since sk_flags are part of a union and the union is used otherwise. Add sk_requests_wifi_status() which checks if sk is non-NULL, sk is a full socket (so flags are valid) and checks the flag bit. Fixes: 76a853f86c97 ("wifi: free SKBTX_WIFI_STATUS skb tx_flags flag") Suggested-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Bert Karwatzki <spasswolf@web.de> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250520223430.6875-1-spasswolf@web.de [edit commit message, fix indentation] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-20ipv6: Revert two per-cpu var allocation for RTM_NEWROUTE.Kuniyuki Iwashima2-37/+7
These two commits preallocated two per-cpu variables in ip6_route_info_create() as fib_nh_common_init() and fib6_nh_init() were expected to be called under RCU. * commit d27b9c40dbd6 ("ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().") * commit 5720a328c3e9 ("ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().") Now these functions can be called without RCU and can use GFP_KERNEL. Let's revert the commits. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20ipv6: Pass gfp_flags down to ip6_route_info_create_nh().Kuniyuki Iwashima1-4/+5
Since commit c4837b9853e5 ("ipv6: Split ip6_route_info_create()."), ip6_route_info_create_nh() uses GFP_ATOMIC as it was expected to be called under RCU. Now, we can call it without RCU and use GFP_KERNEL. Let's pass gfp_flags to ip6_route_info_create_nh(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20Revert "ipv6: Factorise ip6_route_multipath_add()."Kuniyuki Iwashima1-123/+70
Commit 71c0efb6d12f ("ipv6: Factorise ip6_route_multipath_add().") split a loop in ip6_route_multipath_add() so that we can put rcu_read_lock() between ip6_route_info_create() and ip6_route_info_create_nh(). We no longer need to do so as ip6_route_info_create_nh() does not require RCU now. Let's revert the commit to simplify the code. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20Revert "ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup"Kuniyuki Iwashima1-3/+3
The previous patch fixed the same issue mentioned in commit 14a0087e7236 ("ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup"). Let's revert it. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20250516022759.44392-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20ipv6: Narrow down RCU critical section in inet6_rtm_newroute().Kuniyuki Iwashima2-15/+25
Commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.") added rcu_read_lock() covering ip6_route_info_create_nh() and __ip6_ins_rt() to guarantee that nexthop and netdev will not go away. However, as reported by syzkaller [0], ip_tun_build_state() calls dst_cache_init() with GFP_KERNEL during the RCU critical section. ip6_route_info_create_nh() fetches nexthop or netdev depending on whether RTA_NH_ID is set, and struct fib6_info holds a refcount of either of them by nexthop_get() or netdev_get_by_index(). netdev_get_by_index() looks up a dev and calls dev_hold() under RCU. So, we need RCU only around nexthop_find_by_id() and nexthop_get() ( and a few more nexthop code). Let's add rcu_read_lock() there and remove rcu_read_lock() in ip6_route_add() and ip6_route_multipath_add(). Now these functions called from fib6_add() need RCU: - inet6_rt_notify() - fib6_drop_pcpu_from() (via fib6_purge_rt()) - rt6_flush_exceptions() (via fib6_purge_rt()) - ip6_ignore_linkdown() (via rt6_multipath_rebalance()) All callers of inet6_rt_notify() need RCU, so rcu_read_lock() is added there. [0]: [ BUG: Invalid wait context ] 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W ---------------------------- syz-executor234/5832 is trying to lock: ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at: pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782 other info that might help us debug this: context-{5:5} 1 lock held by syz-executor234/5832: 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:331 [inline] 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:841 [inline] 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913 stack backtrace: CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/29/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline] check_wait_context kernel/locking/lockdep.c:4903 [inline] __lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185 lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866 __mutex_lock_common kernel/locking/mutex.c:601 [inline] __mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746 pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782 dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145 ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687 lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137 fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635 fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669 ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866 ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915 inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732 rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955 netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:727 ____sys_sendmsg+0x505/0x830 net/socket.c:2566 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620 __sys_sendmsg net/socket.c:2652 [inline] __do_sys_sendmsg net/socket.c:2657 [inline] __se_sys_sendmsg net/socket.c:2655 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94 Fixes: 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.") Reported-by: syzbot+bcc12d6799364500fbec@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bcc12d6799364500fbec Reported-by: Eric Dumazet <edumazet@google.com> Closes: https://lore.kernel.org/netdev/CANn89i+r1cGacVC_6n3-A-WSkAa_Nr+pmxJ7Gt+oP-P9by2aGw@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?().Kuniyuki Iwashima4-20/+8
Commit f130a0cc1b4f ("inet: fix lwtunnel_valid_encap_type() lock imbalance") added the rtnl_is_held argument as a temporary fix while I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU. Now all callers of lwtunnel_valid_encap_type() do not hold RTNL. Let's remove the argument. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20ipv6: Remove rcu_read_lock() in fib6_get_table().Kuniyuki Iwashima1-12/+10
Once allocated, the IPv6 routing table is not freed until netns is dismantled. fib6_get_table() uses rcu_read_lock() while iterating net->ipv6.fib_table_hash[], but it's not needed and rather confusing. Because some callers have this pattern, table = fib6_get_table(); rcu_read_lock(); /* ... use table here ... */ rcu_read_unlock(); [ See: addrconf_get_prefix_route(), ip6_route_del(), rt6_get_route_info(), rt6_get_dflt_router() ] and this looks illegal but is actually safe. Let's remove rcu_read_lock() in fib6_get_table() and pass true to the last argument of hlist_for_each_entry_rcu() to bypass the RCU check. Note that protection is not needed but RCU helper is used to avoid data-race. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20selftests: net: validate team flags propagationStanislav Fomichev1-1/+9
Cover three recent cases: 1. missing ops locking for the lowers during netdev_sync_lower_features 2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked with a comment on why/when it's needed) 3. rcu lock during team_change_rx_flags Verified that each one triggers when the respective fix is reverted. Not sure about the placement, but since it all relies on teaming, added to the teaming directory. One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way to trigger netdev_sync_lower_features without it. Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Link: https://patch.msgid.link/20250516232205.539266-1-stfomichev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20sctp: Do not wake readers in __sctp_write_space()Petr Malat1-1/+2
Function __sctp_write_space() doesn't set poll key, which leads to ep_poll_callback() waking up all waiters, not only these waiting for the socket being writable. Set the key properly using wake_up_interruptible_poll(), which is preferred over the sync variant, as writers are not woken up before at least half of the queue is available. Also, TCP does the same. Signed-off-by: Petr Malat <oss@malat.biz> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20250516081727.1361451-1-oss@malat.biz Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20xfrm: use kfree_sensitive() for SA secret zeroizationZilin Guan1-3/+3
High-level copy_to_user_* APIs already redact SA secret fields when redaction is enabled, but the state teardown path still freed aead, aalg and ealg structs with plain kfree(), which does not clear memory before deallocation. This can leave SA keys and other confidential data in memory, risking exposure via post-free vulnerabilities. Since this path is outside the packet fast path, the cost of zeroization is acceptable and prevents any residual key material. This patch replaces those kfree() calls unconditionally with kfree_sensitive(), which zeroizes the entire buffer before freeing. Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-05-19can: bcm: add missing rcu read protection for procfs contentOliver Hartkopp1-4/+9
When the procfs content is generated for a bcm_op which is in the process to be removed the procfs output might show unreliable data (UAF). As the removal of bcm_op's is already implemented with rcu handling this patch adds the missing rcu_read_lock() and makes sure the list entries are properly removed under rcu protection. Fixes: f1b4e32aca08 ("can: bcm: use call_rcu() instead of costly synchronize_rcu()") Reported-by: Anderson Nascimento <anderson@allelesecurity.com> Suggested-by: Anderson Nascimento <anderson@allelesecurity.com> Tested-by: Anderson Nascimento <anderson@allelesecurity.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20250519125027.11900-2-socketcan@hartkopp.net Cc: stable@vger.kernel.org # >= 5.4 Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-05-19can: bcm: add locking for bcm_op runtime updatesOliver Hartkopp1-21/+45
The CAN broadcast manager (CAN BCM) can send a sequence of CAN frames via hrtimer. The content and also the length of the sequence can be changed resp reduced at runtime where the 'currframe' counter is then set to zero. Although this appeared to be a safe operation the updates of 'currframe' can be triggered from user space and hrtimer context in bcm_can_tx(). Anderson Nascimento created a proof of concept that triggered a KASAN slab-out-of-bounds read access which can be prevented with a spin_lock_bh. At the rework of bcm_can_tx() the 'count' variable has been moved into the protected section as this variable can be modified from both contexts too. Fixes: ffd980f976e7 ("[CAN]: Add broadcast manager (bcm) protocol") Reported-by: Anderson Nascimento <anderson@allelesecurity.com> Tested-by: Anderson Nascimento <anderson@allelesecurity.com> Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20250519125027.11900-1-socketcan@hartkopp.net Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-05-19llc: fix data loss when reading from a socket in llc_ui_recvmsg()Ilia Gavrilov1-4/+4
For SOCK_STREAM sockets, if user buffer size (len) is less than skb size (skb->len), the remaining data from skb will be lost after calling kfree_skb(). To fix this, move the statement for partial reading above skb deletion. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) Fixes: 30a584d944fb ("[LLX]: SOCK_DGRAM interface fixes") Cc: stable@vger.kernel.org Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-16mr: consolidate the ipmr_can_free_table() checks.Paolo Abeni2-22/+2
Guoyu Yin reported a splat in the ipmr netns cleanup path: WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_free_table net/ipv4/ipmr.c:440 [inline] WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361 Modules linked in: CPU: 2 UID: 0 PID: 14564 Comm: syz.4.838 Not tainted 6.14.0 #1 Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:ipmr_free_table net/ipv4/ipmr.c:440 [inline] RIP: 0010:ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361 Code: ff df 48 c1 ea 03 80 3c 02 00 75 7d 48 c7 83 60 05 00 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e e9 71 67 7f 00 e8 4c 2d 8a fd 90 <0f> 0b 90 eb 93 e8 41 2d 8a fd 0f b6 2d 80 54 ea 01 31 ff 89 ee e8 RSP: 0018:ffff888109547c58 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888108c12dc0 RCX: ffffffff83e09868 RDX: ffff8881022b3300 RSI: ffffffff83e098d4 RDI: 0000000000000005 RBP: ffff888104288000 R08: 0000000000000000 R09: ffffed10211825c9 R10: 0000000000000001 R11: ffff88801816c4a0 R12: 0000000000000001 R13: ffff888108c13320 R14: ffff888108c12dc0 R15: fffffbfff0b74058 FS: 00007f84f39316c0(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f84f3930f98 CR3: 0000000113b56000 CR4: 0000000000350ef0 Call Trace: <TASK> ipmr_net_exit_batch+0x50/0x90 net/ipv4/ipmr.c:3160 ops_exit_list+0x10c/0x160 net/core/net_namespace.c:177 setup_net+0x47d/0x8e0 net/core/net_namespace.c:394 copy_net_ns+0x25d/0x410 net/core/net_namespace.c:516 create_new_namespaces+0x3f6/0xaf0 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc3/0x180 kernel/nsproxy.c:228 ksys_unshare+0x78d/0x9a0 kernel/fork.c:3342 __do_sys_unshare kernel/fork.c:3413 [inline] __se_sys_unshare kernel/fork.c:3411 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3411 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f84f532cc29 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f84f3931038 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f84f5615fa0 RCX: 00007f84f532cc29 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000400 RBP: 00007f84f53fba18 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f84f5615fa0 R15: 00007fff51c5f328 </TASK> The running kernel has CONFIG_IP_MROUTE_MULTIPLE_TABLES disabled, and the sanity check for such build is still too loose. Address the issue consolidating the relevant sanity check in a single helper regardless of the kernel configuration. Also share it between the ipv4 and ipv6 code. Reported-by: Guoyu Yin <y04609127@gmail.com> Fixes: 50b94204446e ("ipmr: tune the ipmr_can_free_table() checks.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/372dc261e1bf12742276e1b984fc5a071b7fc5a8.1747321903.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16net: rfs: add sock_rps_delete_flow() helperEric Dumazet3-3/+7
RFS can exhibit lower performance for workloads using short-lived flows and a small set of 4-tuple. This is often the case for load-testers, using a pair of hosts, if the server has a single listener port. Typical use case : Server : tcp_crr -T128 -F1000 -6 -U -l30 -R 14250 Client : tcp_crr -T128 -F1000 -6 -U -l30 -c -H server | grep local_throughput This is because RFS global hash table contains stale information, when the same RSS key is recycled for another socket and another cpu. Make sure to undo the changes and go back to initial state when a flow is disconnected. Performance of the above test is increased by 22 %, going from 372604 transactions per second to 457773. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Octavian Purdila <tavip@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250515100354.3339920-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16bridge: netfilter: Fix forwarding of fragmented packetsIdo Schimmel2-5/+3
When netfilter defrag hooks are loaded (due to the presence of conntrack rules, for example), fragmented packets entering the bridge will be defragged by the bridge's pre-routing hook (br_nf_pre_routing() -> ipv4_conntrack_defrag()). Later on, in the bridge's post-routing hook, the defragged packet will be fragmented again. If the size of the largest fragment is larger than what the kernel has determined as the destination MTU (using ip_skb_dst_mtu()), the defragged packet will be dropped. Before commit ac6627a28dbf ("net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward"), ip_skb_dst_mtu() would return dst_mtu() as the destination MTU. Assuming the dst entry attached to the packet is the bridge's fake rtable one, this would simply be the bridge's MTU (see fake_mtu()). However, after above mentioned commit, ip_skb_dst_mtu() ends up returning the route's MTU stored in the dst entry's metrics. Ideally, in case the dst entry is the bridge's fake rtable one, this should be the bridge's MTU as the bridge takes care of updating this metric when its MTU changes (see br_change_mtu()). Unfortunately, the last operation is a no-op given the metrics attached to the fake rtable entry are marked as read-only. Therefore, ip_skb_dst_mtu() ends up returning 1500 (the initial MTU value) and defragged packets are dropped during fragmentation when dealing with large fragments and high MTU (e.g., 9k). Fix by moving the fake rtable entry's metrics to be per-bridge (in a similar fashion to the fake rtable entry itself) and marking them as writable, thereby allowing MTU changes to be reflected. Fixes: 62fa8a846d7d ("net: Implement read-only protection and COW'ing of metrics.") Fixes: 33eb9873a283 ("bridge: initialize fake_rtable metrics") Reported-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Closes: https://lore.kernel.org/netdev/PH0PR10MB4504888284FF4CBA648197D0ACB82@PH0PR10MB4504.namprd10.prod.outlook.com/ Tested-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250515084848.727706-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16net: dsa: microchip: linearize skb for tail-tagging switchesJakob Unterwurzacher1-4/+15
The pointer arithmentic for accessing the tail tag only works for linear skbs. For nonlinear skbs, it reads uninitialized memory inside the skb headroom, essentially randomizing the tag. I have observed it gets set to 6 most of the time. Example where ksz9477_rcv thinks that the packet from port 1 comes from port 6 (which does not exist for the ksz9896 that's in use), dropping the packet. Debug prints added by me (not included in this patch): [ 256.645337] ksz9477_rcv:323 tag0=6 [ 256.645349] skb len=47 headroom=78 headlen=0 tailroom=0 mac=(64,14) mac_len=14 net=(78,0) trans=78 shinfo(txflags=0 nr_frags=1 gso(size=0 type=0 segs=0)) csum(0x0 start=0 offset=0 ip_summed=0 complete_sw=0 valid=0 level=0) hash(0x0 sw=0 l4=0) proto=0x00f8 pkttype=1 iif=3 priority=0x0 mark=0x0 alloc_cpu=0 vlan_all=0x0 encapsulation=0 inner(proto=0x0000, mac=0, net=0, trans=0) [ 256.645377] dev name=end1 feat=0x0002e10200114bb3 [ 256.645386] skb headroom: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 256.645395] skb headroom: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 256.645403] skb headroom: 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 256.645411] skb headroom: 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 256.645420] skb headroom: 00000040: ff ff ff ff ff ff 00 1c 19 f2 e2 db 08 06 [ 256.645428] skb frag: 00000000: 00 01 08 00 06 04 00 01 00 1c 19 f2 e2 db 0a 02 [ 256.645436] skb frag: 00000010: 00 83 00 00 00 00 00 00 0a 02 a0 2f 00 00 00 00 [ 256.645444] skb frag: 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 [ 256.645452] ksz_common_rcv:92 dsa_conduit_find_user returned NULL Call skb_linearize before trying to access the tag. This patch fixes ksz9477_rcv which is used by the ksz9896 I have at hand, and also applies the same fix to ksz8795_rcv which seems to have the same problem. Signed-off-by: Jakob Unterwurzacher <jakob.unterwurzacher@cherry.de> CC: stable@vger.kernel.org Fixes: 016e43a26bab ("net: dsa: ksz: Add KSZ8795 tag code") Fixes: 8b8010fb7876 ("dsa: add support for Microchip KSZ tail tagging") Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://patch.msgid.link/20250515072920.2313014-1-jakob.unterwurzacher@cherry.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16wifi: mac80211: handle non-MLO mode as well in ieee80211_num_beaconing_links()Aditya Kumar Singh1-3/+3
Currently, ieee80211_num_beaconing_links() returns 0 when the interface operates in non-ML mode. However, non-MLO mode is equivalent to having a single link. Therefore, the function can handle the non-MLO case as well. This adjustment will also eliminate the need for deflink usage in certain scenarios. Hence, implement changes to handle the non-MLO case as well. There is no change in functionality, and no existing user-visible bug is getting fixed. This update simply makes the function generic to handle all cases. Suggested-by: Johannes Berg <johannes@sipsolutions.net> Link: https://lore.kernel.org/linux-wireless/16499ad8e4b060ee04c8a8b3615fe8952aa7b07b.camel@sipsolutions.net/ Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Link: https://patch.msgid.link/20250515-fix_num_beaconing_links-v1-1-4a39e2704314@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-15svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pagesChuck Lever1-3/+13
Allow allocation of more entries in the sc_pages[] array when the maximum size of an RPC message is increased. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pagesChuck Lever1-2/+6
Allow allocation of more entries in the rc_pages[] array when the maximum size of an RPC message is increased. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15sunrpc: Adjust size of socket's receive page array dynamicallyChuck Lever1-2/+6
As a step towards making NFSD's maximum rsize and wsize variable at run-time, make sk_pages a flexible array. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15SUNRPC: Remove svc_fill_write_vector()Chuck Lever1-40/+0
Clean up: This API is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15SUNRPC: Export xdr_buf_to_bvec()Chuck Lever1-0/+1
Prepare xdr_buf_to_bvec() to be invoked from upper layer protocol code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15NFSD: De-duplicate the svc_fill_write_vector() call sitesChuck Lever1-2/+2
All three call sites do the same thing. I'm struggling with this a bit, however. struct xdr_buf is an XDR layer object and unmarshaling a WRITE payload is clearly a task intended to be done by the proc and xdr functions, not by VFS. This feels vaguely like a layering violation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15sunrpc: Replace the rq_bvec array with dynamically-allocated memoryChuck Lever2-4/+10
As a step towards making NFSD's maximum rsize and wsize variable at run-time, replace the fixed-size rq_bvec[] array in struct svc_rqst with a chunk of dynamically-allocated memory. The rq_bvec[] array contains enough bio_vecs to handle each page in a maximum size RPC message. On a system with 8-byte pointers and 4KB pages, pahole reports that the rq_bvec[] array is 4144 bytes. This patch replaces that array with a single 8-byte pointer field. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15sunrpc: Replace the rq_pages array with dynamically-allocated memoryChuck Lever3-24/+19
As a step towards making NFSD's maximum rsize and wsize variable at run-time, replace the fixed-size rq_vec[] array in struct svc_rqst with a chunk of dynamically-allocated memory. On a system with 8-byte pointers and 4KB pages, pahole reports that the rq_pages[] array is 2080 bytes. This patch replaces that with a single 8-byte pointer field. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15sunrpc: Remove backchannel check in svc_init_buffer()Chuck Lever1-4/+0
The server's backchannel uses struct svc_rqst, but does not use the pages in svc_rqst::rq_pages. It's rq_arg::pages and rq_res::pages comes from the RPC client's page allocator. Currently, svc_init_buffer() skips allocating pages in rq_pages for that reason. Except that, svc_rqst::rq_pages is filled anyway when a backchannel svc_rqst is passed to svc_recv() -> and then to svc_alloc_arg(). This isn't really a problem at the moment, except that these pages are allocated but then never used, as far as I can tell. The problem is that later in this series, in addition to populating the entries of rq_pages[], svc_init_buffer() will also allocate the memory underlying the rq_pages[] array itself. If that allocation is skipped, then svc_alloc_args() chases a NULL pointer for ingress backchannel requests. This approach avoids introducing extra conditional logic in svc_alloc_args(), which is a hot path. Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15svcrdma: Reduce the number of rdma_rw contexts per-QPChuck Lever1-6/+8
There is an upper bound on the number of rdma_rw contexts that can be created per QP. This invisible upper bound is because rdma_create_qp() adds one or more additional SQEs for each ctxt that the ULP requests via qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on the order of the sum of qp_attr.cap.max_send_wr and a factor times qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending on whether MR operations are required before RDMA Reads. This limit is not visible to RDMA consumers via dev->attrs. When the limit is surpassed, QP creation fails with -ENOMEM. For example: svcrdma's estimate of the number of rdma_rw contexts it needs is three times the number of pages in RPCSVC_MAXPAGES. When MAXPAGES is about 260, the internally-computed SQ length should be: 64 credits + 10 backlog + 3 * (3 * 260) = 2414 Which is well below the advertised qp_max_wr of 32768. If RPCSVC_MAXPAGES is increased to 4MB, that's 1040 pages: 64 credits + 10 backlog + 3 * (3 * 1040) = 9434 However, QP creation fails. Dynamic printk for mlx5 shows: calc_sq_size:618:(pid 1514): send queue size (9326 * 256 / 64 -> 65536) exceeds limits(32768) Although 9326 is still far below qp_max_wr, QP creation still fails. Because the total SQ length calculation is opaque to RDMA consumers, there doesn't seem to be much that can be done about this except for consumers to try to keep the requested rdma_rw ctxt count low. Fixes: 2da0f610e733 ("svcrdma: Increase the per-transport rw_ctx count") Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15tcp: increase tcp_rmem[2] to 32 MBEric Dumazet1-1/+1
Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e32 ("tcp: change tcp_adv_win_scale and tcp_rmem[2]") TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers. After this series improvements, it is time to increase the default. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250513193919.1089692-12-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tcp: always use tcp_limit_output_bytes limitationEric Dumazet1-3/+2
This partially reverts commit c73e5807e4f6 ("tcp: tsq: no longer use limit_output_bytes for paced flows") Overriding the tcp_limit_output_bytes sysctl value for FQ enabled flows has the following problem: It allows TCP to queue around 2 ms worth of data per flow, defeating tcp_rcv_rtt_update() accuracy on the receiver, forcing it to increase sk->sk_rcvbuf even if the real RTT is around 100 us. After this change, we keep enough packets in flight to fill the pipe, and let receive queues small enough to get good cache behavior (cpu caches and/or NIC driver page pools). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250513193919.1089692-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tcp: increase tcp_limit_output_bytes default value to 4MBEric Dumazet1-2/+2
Last change happened in 2018 with commit c73e5807e4f6 ("tcp: tsq: no longer use limit_output_bytes for paced flows") Modern NIC speeds got a 4x increase since then. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250513193919.1089692-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tcp: skip big rtt sample if receive queue is not emptyEric Dumazet1-0/+3
tcp_rcv_rtt_update() role is to keep an estimation of RTT (tp->rcv_rtt_est.rtt_us) for receivers. If an application is too slow to drain the TCP receive queue, it is better to leave the RTT estimation small, so that tcp_rcv_space_adjust() does not inflate tp->rcvq_space.space and sk->sk_rcvbuf. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250513193919.1089692-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tcp: always seek for minimal rtt in tcp_rcv_rtt_update()Eric Dumazet1-14/+8
tcp_rcv_rtt_update() goal is to maintain an estimation of the RTT in tp->rcv_rtt_est.rtt_us, used by tcp_rcv_space_adjust() When TCP TS are enabled, tcp_rcv_rtt_update() is using EWMA to smooth the samples. Change this to immediately latch the incoming value if it is lower than tp->rcv_rtt_est.rtt_us, so that tcp_rcv_space_adjust() does not overshoot tp->rcvq_space.space and sk->sk_rcvbuf. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250513193919.1089692-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tcp: fix initial tp->rcvq_space.space value for passive TS enabled flowsEric Dumazet1-3/+3
tcp_rcv_state_process() must tweak tp->advmss for TS enabled flows before the call to tcp_init_transfer() / tcp_init_buffer_space(). Otherwise tp->rcvq_space.space is off by 120 bytes (TCP_INIT_CWND * TCPOLEN_TSTAMP_ALIGNED). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Link: https://patch.msgid.link/20250513193919.1089692-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tcp: remove zero TCP TS samples for autotuningEric Dumazet1-5/+5
For TCP flows using ms RFC 7323 timestamp granularity tcp_rcv_rtt_update() can be fed with 1 ms samples, breaking TCP autotuning for data center flows with sub ms RTT. Instead, rely on the window based samples, fed by tcp_rcv_rtt_measure() tcp_rcvbuf_grow() for a 10 second TCP_STREAM sesssion now looks saner. We can see rcvbuf is kept at a reasonable value. 222.234976: tcp:tcp_rcvbuf_grow: time=348 rtt_us=330 copied=110592 inq=0 space=40960 ooo=0 scaling_ratio=230 rcvbuf=131072 ... 222.235276: tcp:tcp_rcvbuf_grow: time=300 rtt_us=288 copied=126976 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=246187 ... 222.235569: tcp:tcp_rcvbuf_grow: time=294 rtt_us=288 copied=184320 inq=0 space=126976 ooo=0 scaling_ratio=230 rcvbuf=282659 ... 222.235833: tcp:tcp_rcvbuf_grow: time=264 rtt_us=244 copied=373760 inq=0 space=184320 ooo=0 scaling_ratio=230 rcvbuf=410312 ... 222.236142: tcp:tcp_rcvbuf_grow: time=308 rtt_us=219 copied=424960 inq=20480 space=373760 ooo=0 scaling_ratio=230 rcvbuf=832022 ... 222.236378: tcp:tcp_rcvbuf_grow: time=236 rtt_us=219 copied=692224 inq=49152 space=404480 ooo=0 scaling_ratio=230 rcvbuf=900407 ... 222.236602: tcp:tcp_rcvbuf_grow: time=225 rtt_us=219 copied=730112 inq=49152 space=643072 ooo=0 scaling_ratio=230 rcvbuf=1431534 ... 222.237050: tcp:tcp_rcvbuf_grow: time=229 rtt_us=219 copied=1160192 inq=49152 space=680960 ooo=0 scaling_ratio=230 rcvbuf=1515876 ... 222.237618: tcp:tcp_rcvbuf_grow: time=305 rtt_us=218 copied=2228224 inq=49152 space=1111040 ooo=0 scaling_ratio=230 rcvbuf=2473271 ... 222.238591: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3063808 inq=360448 space=2179072 ooo=0 scaling_ratio=230 rcvbuf=4850803 ... 222.240647: tcp:tcp_rcvbuf_grow: time=260 rtt_us=218 copied=2752512 inq=0 space=2703360 ooo=0 scaling_ratio=230 rcvbuf=6017914 ... 222.243535: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2834432 inq=49152 space=2752512 ooo=0 scaling_ratio=230 rcvbuf=6127331 ... 222.245108: tcp:tcp_rcvbuf_grow: time=240 rtt_us=218 copied=2883584 inq=49152 space=2785280 ooo=0 scaling_ratio=230 rcvbuf=6200275 ... 222.245333: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2859008 inq=0 space=2834432 ooo=0 scaling_ratio=230 rcvbuf=6309692 ... 222.301021: tcp:tcp_rcvbuf_grow: time=222 rtt_us=218 copied=2883584 inq=0 space=2859008 ooo=0 scaling_ratio=230 rcvbuf=6364400 ... 222.989242: tcp:tcp_rcvbuf_grow: time=225 rtt_us=218 copied=2899968 inq=0 space=2883584 ooo=0 scaling_ratio=230 rcvbuf=6419108 ... 224.139553: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3014656 inq=65536 space=2899968 ooo=0 scaling_ratio=230 rcvbuf=6455580 ... 224.584608: tcp:tcp_rcvbuf_grow: time=232 rtt_us=218 copied=3014656 inq=49152 space=2949120 ooo=0 scaling_ratio=230 rcvbuf=6564997 ... 230.145560: tcp:tcp_rcvbuf_grow: time=223 rtt_us=218 copied=2981888 inq=0 space=2965504 ooo=0 scaling_ratio=230 rcvbuf=6601469 ... Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Link: https://patch.msgid.link/20250513193919.1089692-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tcp: add receive queue awareness in tcp_rcv_space_adjust()Eric Dumazet1-2/+4
If the application can not drain fast enough a TCP socket queue, tcp_rcv_space_adjust() can overestimate tp->rcvq_space.space. Then sk->sk_rcvbuf can grow and hit tcp_rmem[2] for no good reason. Fix this by taking into acount the number of available bytes. Keeping sk->sk_rcvbuf at the right size allows better cache efficiency. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Link: https://patch.msgid.link/20250513193919.1089692-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tcp: adjust rcvbuf in presence of reordersEric Dumazet1-0/+4
This patch takes care of the needed provisioning when incoming packets are stored in the out of order queue. This part was not implemented in the correct way, we need to decouple it from tcp_rcv_space_adjust() logic. Without it, stalls in the pipe could happen. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250513193919.1089692-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tcp: fix sk_rcvbuf overshootEric Dumazet1-34/+25
Current autosizing in tcp_rcv_space_adjust() is too aggressive. Instead of betting on possible losses and over estimate BDP, it is better to only account for slow start. The following patch is then adding a more precise tuning in the events of packet losses. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250513193919.1089692-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>