aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-05-27xsk: add missing virtual address conversion for pageBui Quang Minh1-2/+1
In commit 7ead4405e06f ("xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc()"), when converting from netmem to page, I missed a call to page_address() around skb_frag_page(frag) to get the virtual address of the page. This commit uses skb_frag_address() helper to fix the issue. Fixes: 7ead4405e06f ("xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc()") Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250522040115.5057-1-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27vsock: Move lingering logic to af_vsock coreMichal Luczaj2-21/+35
Lingering should be transport-independent in the long run. In preparation for supporting other transports, as well as the linger on shutdown(), move code to core. Generalize by querying vsock_transport::unsent_bytes(), guard against the callback being unimplemented. Do not pass sk_lingertime explicitly. Pull SOCK_LINGER check into vsock_linger(). Flatten the function. Remove the nested block by inverting the condition: return early on !timeout. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250522-vsock-linger-v6-2-2ad00b0e447e@rbox.co Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27vsock/virtio: Linger on unsent dataMichal Luczaj1-1/+3
Currently vsock's lingering effectively boils down to waiting (or timing out) until packets are consumed or dropped by the peer; be it by receiving the data, closing or shutting down the connection. To align with the semantics described in the SO_LINGER section of man socket(7) and to mimic AF_INET's behaviour more closely, change the logic of a lingering close(): instead of waiting for all data to be handled, block until data is considered sent from the vsock's transport point of view. That is until worker picks the packets for processing and decrements virtio_vsock_sock::bytes_unsent down to 0. Note that (some interpretation of) lingering was always limited to transports that called virtio_transport_wait_close() on transport release. This does not change, i.e. under Hyper-V and VMCI no lingering would be observed. The implementation does not adhere strictly to man page's interpretation of SO_LINGER: shutdown() will not trigger the lingering. This follows AF_INET. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250522-vsock-linger-v6-1-2ad00b0e447e@rbox.co Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net: core: Convert dev_set_mac_address_user() to use struct sockaddr_storageKees Cook2-4/+7
Convert callers of dev_set_mac_address_user() to use struct sockaddr_storage. Add sanity checks on dev->addr_len usage. Signed-off-by: Kees Cook <kees@kernel.org> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-8-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27rtnetlink: do_setlink: Use struct sockaddr_storageKees Cook1-15/+4
Instead of a heap allocating a variably sized struct sockaddr and lying about the type in the call to netif_set_mac_address(), use a stack allocated struct sockaddr_storage. This lets us drop the cast and avoid the allocation. Putting "ss" on the stack means it will get a reused stack slot since it is the same size (128B) as other existing single-scope stack variables, like the vfinfo array (128B), so no additional stack space is used by this function. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-7-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net: core: Convert dev_set_mac_address() to struct sockaddr_storageKees Cook4-5/+6
All users of dev_set_mac_address() are now using a struct sockaddr_storage. Convert the internal data type to struct sockaddr_storage, drop the casts, and update pointer types. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-6-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27ieee802154: Use struct sockaddr_storage with dev_set_mac_address()Kees Cook1-4/+4
Switch to struct sockaddr_storage for calling dev_set_mac_address(). Add a temporary cast to struct sockaddr, which will be removed in a subsequent patch. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-4-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net/ncsi: Use struct sockaddr_storage for pending_macKees Cook3-11/+11
To avoid future casting with coming API type changes, switch struct ncsi_dev_priv::pending_mac to a full struct sockaddr_storage. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-3-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net: core: Switch netif_set_mac_address() to struct sockaddr_storageKees Cook3-8/+8
In order to avoid passing around struct sockaddr that has a size the compiler cannot reason about (nor track at runtime), convert netif_set_mac_address() to take struct sockaddr_storage. This is just a cast conversion, so there is are no binary changes. Following patches will make actual allocation changes. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-2-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net: core: Convert inet_addr_is_any() to sockaddr_storageKees Cook1-4/+4
All the callers of inet_addr_is_any() have a sockaddr_storage-backed sockaddr. Avoid casts and switch prototype to the actual object being used. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-1-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26replace strncpy with strscpy_padBaris Can Goral1-4/+2
The strncpy() function is actively dangerous to use since it may not NULL-terminate the destination string, resulting in potential memory content exposures, unbounded reads, or crashes. Link: https://github.com/KSPP/linux/issues/90 In addition, strscpy_pad is more appropriate because it also zero-fills any remaining space in the destination if the source is shorter than the provided buffer size. Signed-off-by: Baris Can Goral <goralbaris@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250521161036.14489-1-goralbaris@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextPaolo Abeni19-170/+541
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next, specifically 26 patches: 5 patches adding/updating selftests, 4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables): 1) Improve selftest coverage for pipapo 4 bit group format, from Florian Westphal. 2) Fix incorrect dependencies when compiling a kernel without legacy ip{6}tables support, also from Florian. 3) Two patches to fix nft_fib vrf issues, including selftest updates to improve coverage, also from Florian Westphal. 4) Fix incorrect nesting in nft_tunnel's GENEVE support, from Fernando F. Mancera. 5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure and nft_inner to match in inner headers, from Sebastian Andrzej Siewior. 6) Integrate conntrack information into nft trace infrastructure, from Florian Westphal. 7) A series of 13 patches to allow to specify wildcard netdevice in netdev basechain and flowtables, eg. table netdev filter { chain ingress { type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept; } } This also allows for runtime hook registration on NETDEV_{UN}REGISTER event, from Phil Sutter. netfilter pull request 25-05-23 * tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits) selftests: netfilter: Torture nftables netdev hooks netfilter: nf_tables: Add notifications for hook changes netfilter: nf_tables: Support wildcard netdev hook specs netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc() netfilter: nf_tables: Handle NETDEV_CHANGENAME events netfilter: nf_tables: Wrap netdev notifiers netfilter: nf_tables: Respect NETDEV_REGISTER events netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook() netfilter: nf_tables: Introduce nft_register_flowtable_ops() netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}() netfilter: nf_tables: Introduce functions freeing nft_hook objects netfilter: nf_tables: add packets conntrack state to debug trace info netfilter: conntrack: make nf_conntrack_id callable without a module dependency netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx netfilter: nf_dup{4, 6}: Move duplication check to task_struct netfilter: nft_tunnel: fix geneve_opt dump selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs ... ==================== Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-nextPaolo Abeni5-51/+96
Steffen Klassert says: ==================== 1) Remove some unnecessary strscpy_pad() size arguments. From Thorsten Blum. 2) Correct use of xso.real_dev on bonding offloads. Patchset from Cosmin Ratiu. 3) Add hardware offload configuration to XFRM_MSG_MIGRATE. From Chiachang Wang. 4) Refactor migration setup during cloning. This was done after the clone was created. Now it is done in the cloning function itself. From Chiachang Wang. 5) Validate assignment of maximal possible SEQ number. Prevent from setting to the maximum sequrnce number as this would cause for traffic drop. From Leon Romanovsky. 6) Prevent configuration of interface index when offload is used. Hardware can't handle this case.i From Leon Romanovsky. 7) Always use kfree_sensitive() for SA secret zeroization. From Zilin Guan. ipsec-next-2025-05-23 * tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: use kfree_sensitive() for SA secret zeroization xfrm: prevent configuration of interface index when offload is used xfrm: validate assignment of maximal possible SEQ number xfrm: Refactor migration setup during the cloning process xfrm: Migrate offload configuration bonding: Fix multiple long standing offload races bonding: Mark active offloaded xfrm_states xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free} xfrm: Remove unneeded device check from validate_xmit_xfrm xfrm: Use xdo.dev instead of xdo.real_dev net/mlx5: Avoid using xso.real_dev unnecessarily xfrm: Remove unnecessary strscpy_pad() size arguments ==================== Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: mctp: use nlmsg_payload() for netlink message data extractionJeremy Kerr2-3/+6
Jakub suggests: > I have a different request :) Matt, once this ends up in net-next > (end of this week) could you refactor it to use nlmsg_payload() ? > It doesn't exist in net but this is exactly why it was added. This refactors the additions to both mctp_dump_addrinfo(), and mctp_rtm_getneigh() - two cases where we're calling nlh_data() on an an incoming netlink message, without a prior nlmsg_parse(). For the neigh.c case, we cannot hit the failure where the nlh does not contain a full ndmsg at present, as the core handler (net/core/neighbour.c, neigh_get()) has already validated the size through neigh_valid_req_get(), and would have failed the get operation before the MCTP hander is called. However, relying on that is a bit fragile, so apply the nlmsg_payload refector here too. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250521-mctp-nlmsg-payload-v2-1-e85df160c405@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: neigh: use kfree_skb_reason() in neigh_resolve_output() and neigh_connected_output()Qiu Yutan1-2/+2
Replace kfree_skb() used in neigh_resolve_output() and neigh_connected_output() with kfree_skb_reason(). Following new skb drop reason is added: /* failed to fill the device hard header */ SKB_DROP_REASON_NEIGH_HH_FILLFAIL Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26net: devmem: support single IOV with sendmsgStanislav Fomichev1-1/+2
sendmsg() with a single iov becomes ITER_UBUF, sendmsg() with multiple iovs becomes ITER_IOVEC. iter_iov_len does not return correct value for UBUF, so teach to treat UBUF differently. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Fixes: bd61848900bf ("net: devmem: Implement TX path") Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Acked-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23netfilter: nf_tables: Add notifications for hook changesPhil Sutter3-0/+62
Notify user space if netdev hooks are updated due to netdev add/remove events. Send minimal notification messages by introducing NFT_MSG_NEWDEV/DELDEV message types describing a single device only. Upon NETDEV_CHANGENAME, the callback has no information about the interface's old name. To provide a clear message to user space, include the hook's stored interface name in the notification. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Support wildcard netdev hook specsPhil Sutter2-16/+15
User space may pass non-nul-terminated NFTA_DEVICE_NAME attribute values to indicate a suffix wildcard. Expect for multiple devices to match the given prefix in nft_netdev_hook_alloc() and populate 'ops_list' with them all. When checking for duplicate hooks, compare the shortest prefix so a device may never match more than a single hook spec. Finally respect the stored prefix length when hooking into new devices from event handlers. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc()Phil Sutter1-9/+7
No point in having err_hook_alloc, just call return directly. Also rename err_hook_dev - it's not about the hook's device but freeing the hook itself. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Handle NETDEV_CHANGENAME eventsPhil Sutter2-18/+48
For the sake of simplicity, treat them like consecutive NETDEV_REGISTER and NETDEV_UNREGISTER events. If the new name matches a hook spec and registration fails, escalate the error and keep things as they are. To avoid unregistering the newly registered hook again during the following fake NETDEV_UNREGISTER event, leave hooks alone if their interface spec matches the new name. Note how this patch also skips for NETDEV_REGISTER if the device is already registered. This is not yet possible as the new name would have to match the old one. This will change with wildcard interface specs, though. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Wrap netdev notifiersPhil Sutter2-26/+46
Handling NETDEV_CHANGENAME events has to traverse all chains/flowtables twice, prepare for this. No functional change intended. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Respect NETDEV_REGISTER eventsPhil Sutter2-9/+60
Hook into new devices if their name matches the hook spec. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Prepare for handling NETDEV_REGISTER eventsPhil Sutter2-14/+24
Put NETDEV_UNREGISTER handling code into a switch, no functional change intended as the function is only called for that event yet. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Have a list of nf_hook_ops in nft_hookPhil Sutter3-62/+133
Supporting a 1:n relationship between nft_hook and nf_hook_ops is convenient since a chain's or flowtable's nft_hooks may remain in place despite matching interfaces disappearing. This stabilizes ruleset dumps in that regard and opens the possibility to claim newly added interfaces which match the spec. Also it prepares for wildcard interface specs since these will potentially match multiple interfaces. All spots dealing with hook registration are updated to handle a list of multiple nf_hook_ops, but nft_netdev_hook_alloc() only adds a single item for now to retain the old behaviour. The only expected functional change here is how vanishing interfaces are handled: Instead of dropping the respective nft_hook, only the matching nf_hook_ops are dropped. To safely remove individual ops from the list in netdev handlers, an rcu_head is added to struct nf_hook_ops so kfree_rcu() may be used. There is at least nft_flowtable_find_dev() which may be iterating through the list at the same time. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook()Phil Sutter1-11/+9
The function accesses only the hook's ops field, pass it directly. This prepares for nft_hooks holding a list of nf_hook_ops in future. While at it, make use of the function in __nft_unregister_flowtable_net_hooks() as well. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Introduce nft_register_flowtable_ops()Phil Sutter1-11/+21
Facilitate binding and registering of a flowtable hook via a single function call. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()Phil Sutter4-5/+26
Also a pretty dull wrapper around the hook->ops.dev comparison for now. Will search the embedded nf_hook_ops list in future. The ugly cast to eliminate the const qualifier will vanish then, too. Since this future list will be RCU-protected, also introduce an _rcu() variant here. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Introduce functions freeing nft_hook objectsPhil Sutter1-14/+24
Pointless wrappers around kfree() for now, prep work for an embedded list of nf_hook_ops. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: add packets conntrack state to debug trace infoFlorian Westphal1-1/+53
Add the minimal relevant info needed for userspace ("nftables monitor trace") to provide the conntrack view of the packet: - state (new, related, established) - direction (original, reply) - status (e.g., if connection is subject to dnat) - id (allows to query ctnetlink for remaining conntrack state info) Example: trace id a62 inet filter PRE_RAW packet: iif "enp0s3" ether [..] [..] trace id a62 inet filter PRE_MANGLE conntrack: ct direction original ct state new ct id 32 trace id a62 inet filter PRE_MANGLE packet: [..] [..] trace id a62 inet filter IN conntrack: ct direction original ct state new ct status dnat-done ct id 32 [..] In this case one can see that while NAT is active, the new connection isn't subject to a translation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: conntrack: make nf_conntrack_id callable without a module dependencyFlorian Westphal1-0/+6
While nf_conntrack_id() doesn't need any functionaliy from conntrack, it does reside in nf_conntrack_core.c -- callers add a module dependency on conntrack. Followup patch will need to compute the conntrack id from nf_tables_trace.c to include it in nf_trace messages emitted to userspace via netlink. I don't want to introduce a module dependency between nf_tables and conntrack for this. Since trace is slowpath, the added indirection is ok. One alternative is to move nf_conntrack_id to the netfilter/core.c, but I don't see a compelling reason so far. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmitSebastian Andrzej Siewior1-4/+18
nf_dup_skb_recursion is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Move nf_dup_skb_recursion to struct netdev_xmit, provide wrappers. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctxSebastian Andrzej Siewior1-3/+15
nft_pcpu_tun_ctx is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Make a struct with a nft_inner_tun_ctx member (original nft_pcpu_tun_ctx) and a local_lock_t and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_dup{4, 6}: Move duplication check to task_structSebastian Andrzej Siewior5-11/+8
nf_skb_duplicated is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Due to the recursion involved, the simplest change is to make it a per-task variable. Move the per-CPU variable nf_skb_duplicated to task_struct and name it in_nf_duplicate. Add it to the existing bitfield so it doesn't use additional memory. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nft_tunnel: fix geneve_opt dumpFernando Fernandez Mancera1-4/+4
When dumping a nft_tunnel with more than one geneve_opt configured the netlink attribute hierarchy should be as follow: NFTA_TUNNEL_KEY_OPTS | |--NFTA_TUNNEL_KEY_OPTS_GENEVE | | | |--NFTA_TUNNEL_KEY_GENEVE_CLASS | |--NFTA_TUNNEL_KEY_GENEVE_TYPE | |--NFTA_TUNNEL_KEY_GENEVE_DATA | |--NFTA_TUNNEL_KEY_OPTS_GENEVE | | | |--NFTA_TUNNEL_KEY_GENEVE_CLASS | |--NFTA_TUNNEL_KEY_GENEVE_TYPE | |--NFTA_TUNNEL_KEY_GENEVE_DATA | |--NFTA_TUNNEL_KEY_OPTS_GENEVE ... Otherwise, userspace tools won't be able to fetch the geneve options configured correctly. Fixes: 925d844696d9 ("netfilter: nft_tunnel: add support for geneve opts") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: nft_fib: consistent l3mdev handlingFlorian Westphal2-5/+10
fib has two modes: 1. Obtain output device according to source or destination address 2. Obtain the type of the address, e.g. local, unicast, multicast. 'fib daddr type' should return 'local' if the address is configured in this netns or unicast otherwise. 'fib daddr . iif type' should return 'local' if the address is configured on the input interface or unicast otherwise, i.e. more restrictive. However, if the interface is part of a VRF, then 'fib daddr type' returns unicast even if the address is configured on the incoming interface. This is broken for both ipv4 and ipv6. In the ipv4 case, inet_dev_addr_type must only be used if the 'iif' or 'oif' (strict mode) was requested. Else inet_addr_type_dev_table() needs to be used and the correct dev argument must be passed as well so the correct fib (vrf) table is used. In the ipv6 case, the bug is similar, without strict mode, dev is NULL so .flowi6_l3mdev will be set to 0. Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that to init the .l3mdev structure member. For ipv6, use it from nft_fib6_flowi_init() which gets called from both the 'type' and the 'route' mode eval functions. This provides consistent behaviour for all modes for both ipv4 and ipv6: If strict matching is requested, the input respectively output device of the netfilter hooks is used. Otherwise, use skb->dev to obtain the l3mdev ifindex. Without this, most type checks in updated nft_fib.sh selftest fail: FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4 FAIL: did not find veth0 . dead:1::1 . local in fibtype6 FAIL: did not find veth0 . dead:9::1 . local in fibtype6 FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4 FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4 FAIL: did not find tvrf . dead:1::1 . local in fibtype6 FAIL: did not find tvrf . dead:9::1 . local in fibtype6 FAIL: fib expression address types match (iif in vrf) (fib errounously returns 'unicast' for all of them, even though all of these addresses are local to the vrf). Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23af_unix: Introduce SO_PASSRIGHTS.Kuniyuki Iwashima2-2/+34
As long as recvmsg() or recvmmsg() is used with cmsg, it is not possible to avoid receiving file descriptors via SCM_RIGHTS. This behaviour has occasionally been flagged as problematic, as it can be (ab)used to trigger DoS during close(), for example, by passing a FUSE-controlled fd or a hung NFS fd. For instance, as noted on the uAPI Group page [0], an untrusted peer could send a file descriptor pointing to a hung NFS mount and then close it. Once the receiver calls recvmsg() with msg_control, the descriptor is automatically installed, and then the responsibility for the final close() now falls on the receiver, which may result in blocking the process for a long time. Regarding this, systemd calls cmsg_close_all() [1] after each recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS. However, this cannot work around the issue at all, because the final fput() may still occur on the receiver's side once sendmsg() with SCM_RIGHTS succeeds. Also, even filtering by LSM at recvmsg() does not work for the same reason. Thus, we need a better way to refuse SCM_RIGHTS at sendmsg(). Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS. Note that this option is enabled by default for backward compatibility. Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0] Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23af_unix: Inherit sk_flags at connect().Kuniyuki Iwashima1-6/+6
For SOCK_STREAM embryo sockets, the SO_PASS{CRED,PIDFD,SEC} options are inherited from the parent listen()ing socket. Currently, this inheritance happens at accept(), because these attributes were stored in sk->sk_socket->flags and the struct socket is not allocated until accept(). This leads to unintentional behaviour. When a peer sends data to an embryo socket in the accept() queue, unix_maybe_add_creds() embeds credentials into the skb, even if neither the peer nor the listener has enabled these options. If the option is enabled, the embryo socket receives the ancillary data after accept(). If not, the data is silently discarded. This conservative approach works for SO_PASS{CRED,PIDFD,SEC}, but would not for SO_PASSRIGHTS; once an SCM_RIGHTS with a hung file descriptor was sent, it'd be game over. To avoid this, we will need to preserve SOCK_PASSRIGHTS even on embryo sockets. Commit aed6ecef55d7 ("af_unix: Save listener for embryo socket.") made it possible to access the parent's flags in sendmsg() via unix_sk(other)->listener->sk->sk_socket->flags, but this introduces an unnecessary condition that is irrelevant for most sockets, accept()ed sockets and clients. Therefore, we moved SOCK_PASSXXX into struct sock. Let’s inherit sk->sk_scm_recv_flags at connect() to avoid receiving SCM_RIGHTS on embryo sockets created from a parent with SO_PASSRIGHTS=0. Note that the parent socket is locked in connect() so we don't need READ_ONCE() for sk_scm_recv_flags. Now, we can remove !other->sk_socket check in unix_maybe_add_creds() to avoid slow SOCK_PASS{CRED,PIDFD} handling for embryo sockets created from a parent with SO_PASS{CRED,PIDFD}=0. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23af_unix: Move SOCK_PASS{CRED,PIDFD,SEC} to struct sock.Kuniyuki Iwashima3-52/+39
As explained in the next patch, SO_PASSRIGHTS would have a problem if we assigned a corresponding bit to socket->flags, so it must be managed in struct sock. Mixing socket->flags and sk->sk_flags for similar options will look confusing, and sk->sk_flags does not have enough space on 32bit system. Also, as mentioned in commit 16e572626961 ("af_unix: dont send SCM_CREDENTIALS by default"), SOCK_PASSCRED and SOCK_PASSPID handling is known to be slow, and managing the flags in struct socket cannot avoid that for embryo sockets. Let's move SOCK_PASS{CRED,PIDFD,SEC} to struct sock. While at it, other SOCK_XXX flags in net.h are grouped as enum. Note that assign_bit() was atomic, so the writer side is moved down after lock_sock() in setsockopt(), but the bit is only read once in sendmsg() and recvmsg(), so lock_sock() is not needed there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23net: Restrict SO_PASS{CRED,PIDFD,SEC} to AF_{UNIX,NETLINK,BLUETOOTH}.Kuniyuki Iwashima1-0/+18
SCM_CREDENTIALS and SCM_SECURITY can be recv()ed by calling scm_recv() or scm_recv_unix(), and SCM_PIDFD is only used by scm_recv_unix(). scm_recv() is called from AF_NETLINK and AF_BLUETOOTH. scm_recv_unix() is literally called from AF_UNIX. Let's restrict SO_PASSCRED and SO_PASSSEC to such sockets and SO_PASSPIDFD to AF_UNIX only. Later, SOCK_PASS{CRED,PIDFD,SEC} will be moved to struct sock and united with another field. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23tcp: Restrict SO_TXREHASH to TCP socket.Kuniyuki Iwashima1-0/+5
sk->sk_txrehash is only used for TCP. Let's restrict SO_TXREHASH to TCP to reflect this. Later, we will make sk_txrehash a part of the union for other protocol families. Note that we need to modify BPF selftest not to get/set SO_TEREHASH for non-TCP sockets. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23scm: Move scm_recv() from scm.h to scm.c.Kuniyuki Iwashima1-0/+123
scm_recv() has been placed in scm.h since the pre-git era for no particular reason (I think), which makes the file really fragile. For example, when you move SOCK_PASSCRED from include/linux/net.h to enum sock_flags in include/net/sock.h, you will see weird build failure due to terrible dependency. To avoid the build failure in the future, let's move scm_recv(_unix())? and its callees to scm.c. Note that only scm_recv() needs to be exported for Bluetooth. scm_send() should be moved to scm.c too, but I'll revisit later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23af_unix: Don't pass struct socket to maybe_add_creds().Kuniyuki Iwashima1-11/+12
We will move SOCK_PASS{CRED,PIDFD,SEC} from struct socket.flags to struct sock for better handling with SOCK_PASSRIGHTS. Then, we don't need to access struct socket in maybe_add_creds(). Let's pass struct sock to maybe_add_creds() and its caller queue_oob(). While at it, we append the unix_ prefix and fix double spaces around the pid assignment. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23af_unix: Factorise test_bit() for SOCK_PASSCRED and SOCK_PASSPIDFD.Kuniyuki Iwashima1-22/+15
Currently, the same checks for SOCK_PASSCRED and SOCK_PASSPIDFD are scattered across many places. Let's centralise the bit tests to make the following changes cleaner. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-22Merge tag 'wireless-next-2025-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-nextJakub Kicinski5-11/+32
Johannes Berg says: ==================== Lots of new things, notably: * ath12k: monitor mode for WCN7850, better 6 GHz regulatory * brcmfmac: SAE for some Cypress devices * iwlwifi: rework device configuration * mac80211: scan improvements with MLO * mt76: EHT improvements, new device IDs * rtw88: throughput improvements * rtw89: MLO, STA/P2P concurrency improvements, SAR * tag 'wireless-next-2025-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (389 commits) wifi: mt76: mt7925: add rfkill_poll for hardware rfkill wifi: mt76: support power delta calculation for 5 TX paths wifi: mt76: fix available_antennas setting wifi: mt76: mt7996: fix RX buffer size of MCU event wifi: mt76: mt7996: change max beacon size wifi: mt76: mt7996: fix invalid NSS setting when TX path differs from NSS wifi: mt76: mt7996: drop fragments with multicast or broadcast RA wifi: mt76: mt7996: set EHT max ampdu length capability wifi: mt76: mt7996: fix beamformee SS field wifi: mt76: remove capability of partial bandwidth UL MU-MIMO wifi: mt76: mt7925: add test mode support wifi: mt76: mt7925: extend MCU support for testmode wifi: mt76: mt7925: ensure all MCU commands wait for response wifi: mt76: mt7925: refine the sniffer commnad wifi: mt76: mt7925: prevent multiple scan commands wifi: mt76: mt7915: Fix null-ptr-deref in mt7915_mmio_wed_init() wifi: mt76: mt7996: Fix null-ptr-deref in mt7996_mmio_wed_init() wifi: mt76: mt7925: add RNR scan support for 6GHz wifi: mt76: add mt76_connac_mcu_build_rnr_scan_param routine wifi: mt76: scan: Fix 'mlink' dereferenced before IS_ERR_OR_NULL check ... ==================== Link: https://patch.msgid.link/20250522165501.189958-50-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22Merge tag 'for-net-next-2025-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-nextJakub Kicinski11-66/+403
Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: core: - Add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO - Separate CIS_LINK and BIS_LINK link types - Introduce HCI Driver protocol drivers: - btintel_pcie: Do not generate coredump for diagnostic events - btusb: Add HCI Drv commands for configuring altsetting - btusb: Add RTL8851BE device 0x0bda:0xb850 - btusb: Add new VID/PID 13d3/3584 for MT7922 - btusb: Add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: Implement host-wakeup feature * tag 'for-net-next-2025-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (23 commits) Bluetooth: btintel: Check dsbr size from EFI variable Bluetooth: MGMT: iterate over mesh commands in mgmt_mesh_foreach() Bluetooth: btusb: Add new VID/PID 13d3/3584 for MT7922 Bluetooth: btusb: use skb_pull to avoid unsafe access in QCA dump handling Bluetooth: L2CAP: Fix not checking l2cap_chan security level Bluetooth: separate CIS_LINK and BIS_LINK link types Bluetooth: btusb: Add new VID/PID 13d3/3630 for MT7925 Bluetooth: add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO Bluetooth: btintel_pcie: Dump debug registers on error Bluetooth: ISO: Fix getpeername not returning sockaddr_iso_bc fields Bluetooth: ISO: Fix not using SID from adv report Revert "Bluetooth: btusb: add sysfs attribute to control USB alt setting" Revert "Bluetooth: btusb: Configure altsetting for HCI_USER_CHANNEL" Bluetooth: btusb: Add HCI Drv commands for configuring altsetting Bluetooth: Introduce HCI Driver protocol Bluetooth: btnxpuart: Implement host-wakeup feature dt-bindings: net: bluetooth: nxp: Add support for host-wakeup Bluetooth: btusb: Add RTL8851BE device 0x0bda:0xb850 Bluetooth: hci_uart: Remove unnecessary NULL check before release_firmware() Bluetooth: btmtksdio: Fix wakeup source leaks on device unbind ... ==================== Link: https://patch.msgid.link/20250522171048.3307873-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22Bluetooth: MGMT: iterate over mesh commands in mgmt_mesh_foreach()Dmitry Antipov1-1/+1
In 'mgmt_mesh_foreach()', iterate over mesh commands rather than generic mgmt ones. Compile tested only. Fixes: b338d91703fa ("Bluetooth: Implement support for Mesh") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-22Bluetooth: L2CAP: Fix not checking l2cap_chan security levelLuiz Augusto von Dentz1-7/+8
l2cap_check_enc_key_size shall check the security level of the l2cap_chan rather than the hci_conn since for incoming connection request that may be different as hci_conn may already been encrypted using a different security level. Fixes: 522e9ed157e3 ("Bluetooth: l2cap: Check encryption key size on incoming connection") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski19-185/+139
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22netfilter: nf_tables: nft_fib_ipv6: fix VRF ipv4/ipv6 result discrepancyFlorian Westphal1-4/+9
With a VRF, ipv4 and ipv6 FIB expression behave differently. fib daddr . iif oif Will return the input interface name for ipv4, but the real device for ipv6. Example: If VRF device name is tvrf and real (incoming) device is veth0. First round is ok, both ipv4 and ipv6 will yield 'veth0'. But in the second round (incoming device will be set to "tvrf"), ipv4 will yield "tvrf" whereas ipv6 returns "veth0" for the second round too. This makes ipv6 behave like ipv4. A followup patch will add a test case for this, without this change it will fail with: get element inet t fibif6iif { tvrf . dead:1::99 . tvrf } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FAIL: did not find tvrf . dead:1::99 . tvrf in fibif6iif Alternatively we could either not do anything at all or change ipv4 to also return the lower/real device, however, nft (userspace) doc says "iif: if fib lookup provides a route then check its output interface is identical to the packets input interface." which is what the nft fib ipv4 behaviour is. Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22netfilter: xtables: support arpt_mark and ipv6 optstrip for iptables-nft only buildsFlorian Westphal2-3/+3
Its now possible to build a kernel that has no support for the classic xtables get/setsockopt interfaces and builtin tables. In this case, we have CONFIG_IP6_NF_MANGLE=n and CONFIG_IP_NF_ARPTABLES=n. For optstript, the ipv6 code is so small that we can enable it if netfilter ipv6 support exists. For mark, check if either classic arptables or NFT_ARP_COMPAT is set. Fixes: a9525c7f6219 ("netfilter: xtables: allow xtables-nft only builds") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>