aboutsummaryrefslogtreecommitdiffstats
path: root/net/core/net_namespace.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-09-29Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-32/+26
Pull namespace updates from Christian Brauner: "This contains a larger set of changes around the generic namespace infrastructure of the kernel. Each specific namespace type (net, cgroup, mnt, ...) embedds a struct ns_common which carries the reference count of the namespace and so on. We open-coded and cargo-culted so many quirks for each namespace type that it just wasn't scalable anymore. So given there's a bunch of new changes coming in that area I've started cleaning all of this up. The core change is to make it possible to correctly initialize every namespace uniformly and derive the correct initialization settings from the type of the namespace such as namespace operations, namespace type and so on. This leaves the new ns_common_init() function with a single parameter which is the specific namespace type which derives the correct parameters statically. This also means the compiler will yell as soon as someone does something remotely fishy. The ns_common_init() addition also allows us to remove ns_alloc_inum() and drops any special-casing of the initial network namespace in the network namespace initialization code that Linus complained about. Another part is reworking the reference counting. The reference counting was open-coded and copy-pasted for each namespace type even though they all followed the same rules. This also removes all open accesses to the reference count and makes it private and only uses a very small set of dedicated helpers to manipulate them just like we do for e.g., files. In addition this generalizes the mount namespace iteration infrastructure introduced a few cycles ago. As reminder, the vfs makes it possible to iterate sequentially and bidirectionally through all mount namespaces on the system or all mount namespaces that the caller holds privilege over. This allow userspace to iterate over all mounts in all mount namespaces using the listmount() and statmount() system call. Each mount namespace has a unique identifier for the lifetime of the systems that is exposed to userspace. The network namespace also has a unique identifier working exactly the same way. This extends the concept to all other namespace types. The new nstree type makes it possible to lookup namespaces purely by their identifier and to walk the namespace list sequentially and bidirectionally for all namespace types, allowing userspace to iterate through all namespaces. Looking up namespaces in the namespace tree works completely locklessly. This also means we can move the mount namespace onto the generic infrastructure and remove a bunch of code and members from struct mnt_namespace itself. There's a bunch of stuff coming on top of this in the future but for now this uses the generic namespace tree to extend a concept introduced first for pidfs a few cycles ago. For a while now we have supported pidfs file handles for pidfds. This has proven to be very useful. This extends the concept to cover namespaces as well. It is possible to encode and decode namespace file handles using the common name_to_handle_at() and open_by_handle_at() apis. As with pidfs file handles, namespace file handles are exhaustive, meaning it is not required to actually hold a reference to nsfs in able to decode aka open_by_handle_at() a namespace file handle. Instead the FD_NSFS_ROOT constant can be passed which will let the kernel grab a reference to the root of nsfs internally and thus decode the file handle. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them trivially. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. The namespace file handle layout is exposed as uapi and has a stable and extensible format. For now it simply contains the namespace identifier, the namespace type, and the inode number. The stable format means that userspace may construct its own namespace file handles without going through name_to_handle_at() as they are already allowed for pidfs and cgroup file handles" * tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits) ns: drop assert ns: move ns type into struct ns_common nstree: make struct ns_tree private ns: add ns_debug() ns: simplify ns_common_init() further cgroup: add missing ns_common include ns: use inode initializer for initial namespaces selftests/namespaces: verify initial namespace inode numbers ns: rename to __ns_ref nsfs: port to ns_ref_*() helpers net: port to ns_ref_*() helpers uts: port to ns_ref_*() helpers ipv4: use check_net() net: use check_net() net-sysfs: use check_net() user: port to ns_ref_*() helpers time: port to ns_ref_*() helpers pid: port to ns_ref_*() helpers ipc: port to ns_ref_*() helpers cgroup: port to ns_ref_*() helpers ...
2025-09-25ns: move ns type into struct ns_commonChristian Brauner1-1/+0
It's misplaced in struct proc_ns_operations and ns->ops might be NULL if the namespace is compiled out but we still want to know the type of the namespace for the initial namespace struct. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-22ns: simplify ns_common_init() furtherChristian Brauner1-8/+1
Simply derive the ns operations from the namespace type. Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19net: use check_net()Christian Brauner1-1/+1
Don't directly acess the namespace count. There's even a dedicated helper for this. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19ns: add ns_common_free()Christian Brauner1-2/+2
And drop ns_free_inum(). Anything common that can be wasted centrally should be wasted in the new common helper. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19nscommon: simplify initializationChristian Brauner1-1/+1
There's a lot of information that namespace implementers don't need to know about at all. Encapsulate this all in the initialization helper. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19net: centralize ns_common initializationChristian Brauner1-20/+3
Centralize ns_common initialization. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19ns: add to_<type>_ns() to respective headersChristian Brauner1-5/+0
Every namespace type has a container_of(ns, <ns_type>, ns) static inline function that is currently not exposed in the header. So we have a bunch of places that open-code it via container_of(). Move it to the headers so we can use it directly. Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19net: support ns lookupChristian Brauner1-2/+6
Support the generic ns lookup infrastructure to support file handles for namespaces. The network namespace has a separate list with different lifetime rules which we can just leave in tact. We have a similar concept for mount namespaces as well where it is on two differenet lists for different purposes. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19net: use ns_common_init()Christian Brauner1-13/+33
Don't cargo-cult the same thing over and over. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01copy_process: pass clone_flags as u64 across calltreeSimon Schuster1-1/+1
With the introduction of clone3 in commit 7f192e3cd316 ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit, with a new type of u64 for the flags. However, for most consumers of clone_flags the interface was not changed from the previous type of unsigned long. While this works fine as long as none of the new 64-bit flag bits (CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still undesirable in terms of the principle of least surprise. Thus, this commit fixes all relevant interfaces of callees to sys_clone3/copy_process (excluding the architecture-specific copy_thread) to consistently pass clone_flags as u64, so that no truncation to 32-bit integers occurs on 32-bit architectures. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-30Merge tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds1-23/+45
Pull networking updates from Jakub Kicinski: "Core & protocols: - Wrap datapath globals into net_aligned_data, to avoid false sharing - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container) - Add SO_INQ and SCM_INQ support to AF_UNIX - Add SIOCINQ support to AF_VSOCK - Add TCP_MAXSEG sockopt to MPTCP - Add IPv6 force_forwarding sysctl to enable forwarding per interface - Make TCP validation of whether packet fully fits in the receive window and the rcv_buf more strict. With increased use of HW aggregation a single "packet" can be multiple 100s of kB - Add MSG_MORE flag to optimize large TCP transmissions via sockmap, improves latency up to 33% for sockmap users - Convert TCP send queue handling from tasklet to BH workque - Improve BPF iteration over TCP sockets to see each socket exactly once - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code - Support enabling kernel threads for NAPI processing on per-NAPI instance basis rather than a whole device. Fully stop the kernel NAPI thread when threaded NAPI gets disabled. Previously thread would stick around until ifdown due to tricky synchronization - Allow multicast routing to take effect on locally-generated packets - Add output interface argument for End.X in segment routing - MCTP: add support for gateway routing, improve bind() handling - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink - Add a new neighbor flag ("extern_valid"), which cedes refresh responsibilities to userspace. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed - Support NUD_PERMANENT for proxy neighbor entries - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM - Add sequence numbers to netconsole messages. Unregister netconsole's console when all net targets are removed. Code refactoring. Add a number of selftests - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol should be used for an inbound SA lookup - Support inspecting ref_tracker state via DebugFS - Don't force bonding advertisement frames tx to ~333 ms boundaries. Add broadcast_neighbor option to send ARP/ND on all bonded links - Allow providing upcall pid for the 'execute' command in openvswitch - Remove DCCP support from Netfilter's conntrack - Disallow multiple packet duplications in the queuing layer - Prevent use of deprecated iptables code on PREEMPT_RT Driver API: - Support RSS and hashing configuration over ethtool Netlink - Add dedicated ethtool callbacks for getting and setting hashing fields - Add support for power budget evaluation strategy in PSE / Power-over-Ethernet. Generate Netlink events for overcurrent etc - Support DPLL phase offset monitoring across all device inputs. Support providing clock reference and SYNC over separate DPLL inputs - Support traffic classes in devlink rate API for bandwidth management - Remove rtnl_lock dependency from UDP tunnel port configuration Device drivers: - Add a new Broadcom driver for 800G Ethernet (bnge) - Add a standalone driver for Microchip ZL3073x DPLL - Remove IBM's NETIUCV device driver - Ethernet high-speed NICs: - Broadcom (bnxt): - support zero-copy Tx of DMABUF memory - take page size into account for page pool recycling rings - Intel (100G, ice, idpf): - idpf: XDP and AF_XDP support preparations - idpf: add flow steering - add link_down_events statistic - clean up the TSPLL code - preparations for live VM migration - nVidia/Mellanox: - support zero-copy Rx/Tx interfaces (DMABUF and io_uring) - optimize context memory usage for matchers - expose serial numbers in devlink info - support PCIe congestion metrics - Meta (fbnic): - add 25G, 50G, and 100G link modes to phylink - support dumping FW logs - Marvell/Cavium: - support for CN20K generation of the Octeon chips - Amazon: - add HW clock (without timestamping, just hypervisor time access) - Ethernet virtual: - VirtIO net: - support segmentation of UDP-tunnel-encapsulated packets - Google (gve): - support packet timestamping and clock synchronization - Microsoft vNIC: - add handler for device-originated servicing events - allow dynamic MSI-X vector allocation - support Tx bandwidth clamping - Ethernet NICs consumer, and embedded: - AMD: - amd-xgbe: hardware timestamping and PTP clock support - Broadcom integrated MACs (bcmgenet, bcmasp): - use napi_complete_done() return value to support NAPI polling - add support for re-starting auto-negotiation - Broadcom switches (b53): - support BCM5325 switches - add bcm63xx EPHY power control - Synopsys (stmmac): - lots of code refactoring and cleanups - TI: - icssg-prueth: read firmware-names from device tree - icssg: PRP offload support - Microchip: - lan78xx: convert to PHYLINK for improved PHY and MAC management - ksz: add KSZ8463 switch support - Intel: - support similar queue priority scheme in multi-queue and time-sensitive networking (taprio) - support packet pre-emption in both - RealTek (r8169): - enable EEE at 5Gbps on RTL8126 - Airoha: - add PPPoE offload support - MDIO bus controller for Airoha AN7583 - Ethernet PHYs: - support for the IPQ5018 internal GE PHY - micrel KSZ9477 switch-integrated PHYs: - add MDI/MDI-X control support - add RX error counters - add cable test support - add Signal Quality Indicator (SQI) reporting - dp83tg720: improve reset handling and reduce link recovery time - support bcm54811 (and its MII-Lite interface type) - air_en8811h: support resume/suspend - support PHY counters for QCA807x and QCA808x - support WoL for QCA807x - CAN drivers: - rcar_canfd: support for Transceiver Delay Compensation - kvaser: report FW versions via devlink dev info - WiFi: - extended regulatory info support (6 GHz) - add statistics and beacon monitor for Multi-Link Operation (MLO) - support S1G aggregation, improve S1G support - add Radio Measurement action fields - support per-radio RTS threshold - some work around how FIPS affects wifi, which was wrong (RC4 is used by TKIP, not only WEP) - improvements for unsolicited probe response handling - WiFi drivers: - RealTek (rtw88): - IBSS mode for SDIO devices - RealTek (rtw89): - BT coexistence for MLO/WiFi7 - concurrent station + P2P support - support for USB devices RTL8851BU/RTL8852BU - Intel (iwlwifi): - use embedded PNVM in (to be released) FW images to fix compatibility issues - many cleanups (unused FW APIs, PCIe code, WoWLAN) - some FIPS interoperability - MediaTek (mt76): - firmware recovery improvements - more MLO work - Qualcomm/Atheros (ath12k): - fix scan on multi-radio devices - more EHT/Wi-Fi 7 features - encapsulation/decapsulation offload - Broadcom (brcm80211): - support SDIO 43751 device - Bluetooth: - hci_event: add support for handling LE BIG Sync Lost event - ISO: add socket option to report packet seqnum via CMSG - ISO: support SCM_TIMESTAMPING for ISO TS - Bluetooth drivers: - intel_pcie: support Function Level Reset - nxpuart: add support for 4M baudrate - nxpuart: implement powerup sequence, reset, FW dump, and FW loading" * tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits) dpll: zl3073x: Fix build failure selftests: bpf: fix legacy netfilter options ipv6: annotate data-races around rt->fib6_nsiblings ipv6: fix possible infinite loop in fib6_info_uses_dev() ipv6: prevent infinite loop in rt6_nlmsg_size() ipv6: add a retry logic in net6_rt_notify() vrf: Drop existing dst reference in vrf_ip6_input_dst net/sched: taprio: align entry index attr validation with mqprio net: fsl_pq_mdio: use dev_err_probe selftests: rtnetlink.sh: remove esp4_offload after test vsock: remove unnecessary null check in vsock_getname() igb: xsk: solve negative overflow of nb_pkts in zerocopy mode stmmac: xsk: fix negative overflow of budget in zerocopy mode dt-bindings: ieee802154: Convert at86rf230.txt yaml format net: dsa: microchip: Disable PTP function of KSZ8463 net: dsa: microchip: Setup fiber ports for KSZ8463 net: dsa: microchip: Write switch MAC address differently for KSZ8463 net: dsa: microchip: Use different registers for KSZ8463 net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver dt-bindings: net: dsa: microchip: Add KSZ8463 switch support ...
2025-07-02net: move net_cookie into net_aligned_dataEric Dumazet1-6/+2
Using per-cpu data for net->net_cookie generation is overkill, because even busy hosts do not create hundreds of netns per second. Make sure to put net_cookie in a private cache line to avoid potential false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30net: net->nsid_lock does not need BH safetyEric Dumazet1-11/+11
At the time of commit bc51dddf98c9 ("netns: avoid disabling irq for netns id") peernet2id() was not yet using RCU. Commit 2dce224f469f ("netns: protect netns ID lookups with RCU") changed peernet2id() to no longer acquire net->nsid_lock (potentially from BH context). We do not need to block soft interrupts when acquiring net->nsid_lock anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Guillaume Nault <gnault@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250627163242.230866-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ref_tracker: eliminate the ref_tracker_dir name fieldJeff Layton1-2/+2
Now that we have dentries and the ability to create meaningful symlinks to them, don't keep a name string in each tracker. Switch the output format to print "class@address", and drop the name field. Also, add a kerneldoc header for ref_tracker_dir_init(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-9-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19net: add symlinks to ref_tracker_dir for netnsJeff Layton1-1/+29
After assigning the inode number to the namespace, use it to create a unique name for each netns refcount tracker with the ns.inum and net_cookie values in it, and register a symlink to the debugfs file for it. init_net is registered before the ref_tracker dir is created, so add a late_initcall() to register its files and symlinks. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-8-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ref_tracker: add a static classname string to each ref_tracker_dirJeff Layton1-2/+2
A later patch in the series will be adding debugfs files for each ref_tracker that get created in ref_tracker_dir_init(). The format will be "class@%px". The current "name" string can vary between ref_tracker_dir objects of the same type, so it's not suitable for this purpose. Add a new "class" string to the ref_tracker dir that describes the the type of object (sans any individual info for that object). Also, in the i915 driver, gate the creation of debugfs files on whether the dentry pointer is still set to NULL. CI has shown that the ref_tracker_dir can be initialized more than once. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-4-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11netns: use stable inode number for initial mount nsChristian Brauner1-0/+8
Apart from the network and mount namespace all other namespaces expose a stable inode number and userspace has been relying on that for a very long time now. It's very much heavily used API. Align the network namespace and use a stable inode number from the reserved procfs inode number space so this is consistent across all namespaces. Link: https://lore.kernel.org/20250606-work-nsfs-v1-2-b8749c9a8844@kernel.org Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-05net: annotate data-races around cleanup_net_taskEric Dumazet1-2/+2
from_cleanup_net() reads cleanup_net_task locklessly. Add READ_ONCE()/WRITE_ONCE() annotations to avoid a potential KCSAN warning, even if the race is harmless. Fixes: 0734d7c3d93c ("net: expedite synchronize_net() for cleanup_net()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250604093928.1323333-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23net: Fix wild-memory-access in __register_pernet_operations() when CONFIG_NET_NS=n.Kuniyuki Iwashima1-6/+6
kernel test robot reported the splat below. [0] Before commit fed176bf3143 ("net: Add ops_undo_single for module load/unload."), if CONFIG_NET_NS=n, ops was linked to pernet_list only when init_net had not been initialised, and ops was unlinked from pernet_list only under the same condition. Let's say an ops is loaded before the init_net setup but unloaded after that. Then, the ops remains in pernet_list, which seems odd. The cited commit added ops_undo_single(), which calls list_add() for ops to link it to a temporary list, so a minor change was added to __register_pernet_operations() and __unregister_pernet_operations() under CONFIG_NET_NS=n to avoid the pernet_list corruption. However, the corruption must have been left as is. When CONFIG_NET_NS=n, pernet_list was used to keep ops registered before the init_net setup, and after that, pernet_list was not used at all. This was because some ops annotated with __net_initdata are cleared out of memory at some point during boot. Then, such ops is initialised by POISON_FREE_INITMEM (0xcc), resulting in that ops->list.{next,prev} suddenly switches from a valid pointer to a weird value, 0xcccccccccccccccc. To avoid such wild memory access, let's allow the pernet_list corruption for CONFIG_NET_NS=n. [0]: Oops: general protection fault, probably for non-canonical address 0xf999959999999999: 0000 [#1] SMP KASAN NOPTI KASAN: maybe wild-memory-access in range [0xccccccccccccccc8-0xcccccccccccccccf] CPU: 2 UID: 0 PID: 346 Comm: modprobe Not tainted 6.15.0-rc1-00294-ga4cba7e98e35 #85 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__list_add_valid_or_report (lib/list_debug.c:32) Code: 48 c1 ea 03 80 3c 02 00 0f 85 5a 01 00 00 49 39 74 24 08 0f 85 83 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 1f 01 00 00 4c 39 26 0f 85 ab 00 00 00 4c 39 ee RSP: 0018:ff11000135b87830 EFLAGS: 00010a07 RAX: dffffc0000000000 RBX: ffffffffc02223c0 RCX: ffffffff8406fcc2 RDX: 1999999999999999 RSI: cccccccccccccccc RDI: ffffffffc02223c0 RBP: ffffffff86064e40 R08: 0000000000000001 R09: fffffbfff0a9f5b5 R10: ffffffff854fadaf R11: 676552203a54454e R12: ffffffff86064e40 R13: ffffffffc02223c0 R14: ffffffff86064e48 R15: 0000000000000021 FS: 00007f6fb0d9e1c0(0000) GS:ff11000858ea0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6fb0eda580 CR3: 0000000122fec005 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> register_pernet_operations (./include/linux/list.h:150 (discriminator 5) ./include/linux/list.h:183 (discriminator 5) net/core/net_namespace.c:1315 (discriminator 5) net/core/net_namespace.c:1359 (discriminator 5)) register_pernet_subsys (net/core/net_namespace.c:1401) inet6_init (net/ipv6/af_inet6.c:535) ipv6 do_one_initcall (init/main.c:1257) do_init_module (kernel/module/main.c:2942) load_module (kernel/module/main.c:3409) init_module_from_file (kernel/module/main.c:3599) idempotent_init_module (kernel/module/main.c:3611) __x64_sys_finit_module (./include/linux/file.h:62 ./include/linux/file.h:83 kernel/module/main.c:3634 kernel/module/main.c:3621 kernel/module/main.c:3621) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f6fb0df7e5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007fffdc6a8968 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 000055b535721b70 RCX: 00007f6fb0df7e5d RDX: 0000000000000000 RSI: 000055b51e44aa2a RDI: 0000000000000004 RBP: 0000000000040000 R08: 0000000000000000 R09: 000055b535721b30 R10: 0000000000000004 R11: 0000000000000246 R12: 000055b51e44aa2a R13: 000055b535721bf0 R14: 000055b5357220b0 R15: 0000000000000000 </TASK> Modules linked in: ipv6(+) crc_ccitt Fixes: fed176bf3143 ("net: Add ops_undo_single for module load/unload.") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202504181511.1c3f23e4-lkp@intel.com Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418215025.87871-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: Drop hold_rtnl arg from ops_undo_list().Kuniyuki Iwashima1-6/+8
ops_undo_list() first iterates over ops_list for ->pre_exit(). Let's check if any of the ops has ->exit_rtnl() there and drop the hold_rtnl argument. Note that nexthop uses ->exit_rtnl() and is built-in, so hold_rtnl is always true for setup_net() and cleanup_net() for now. Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20250414170148.21f3523c@kernel.org/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418003259.48017-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net: Remove ->exit_batch_rtnl().Kuniyuki Iwashima1-7/+1
There are no ->exit_batch_rtnl() users remaining. Let's remove the hook. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-15-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net: Add ->exit_rtnl() hook to struct pernet_operations.Kuniyuki Iwashima1-14/+39
struct pernet_operations provides two batching hooks; ->exit_batch() and ->exit_batch_rtnl(). The batching variant is beneficial if ->exit() meets any of the following conditions: 1) ->exit() repeatedly acquires a global lock for each netns 2) ->exit() has a time-consuming operation that can be factored out (e.g. synchronize_rcu(), smp_mb(), etc) 3) ->exit() does not need to repeat the same iterations for each netns (e.g. inet_twsk_purge()) Currently, none of the ->exit_batch_rtnl() functions satisfy any of the above conditions because RTNL is factored out and held by the caller and all of these functions iterate over the dying netns list. Also, we want to hold per-netns RTNL there but avoid spreading __rtnl_net_lock() across multiple locations. Let's add ->exit_rtnl() hook and run it under __rtnl_net_lock(). The following patches will convert all ->exit_batch_rtnl() users to ->exit_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net: Add ops_undo_single for module load/unload.Kuniyuki Iwashima1-30/+24
If ops_init() fails while loading a module or we unload the module, free_exit_list() rolls back the changes. The rollback sequence is the same as ops_undo_list(). The ops is already removed from pernet_list before calling free_exit_list(). If we link the ops to a temporary list, we can reuse ops_undo_list(). Let's add a wrapper of ops_undo_list() and use it instead of free_exit_list(). Now, we have the central place to roll back ops_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net: Factorise setup_net() and cleanup_net().Kuniyuki Iwashima1-55/+51
When we roll back the changes made by struct pernet_operations.init(), we execute mostly identical sequences in three places. * setup_net() * cleanup_net() * free_exit_list() The only difference between the first two is which list and RCU helpers to use. In setup_net(), an ops could fail on the way, so we need to perform a reverse walk from its previous ops in pernet_list. OTOH, in cleanup_net(), we iterate the full list from tail to head. The former passes the failed ops to list_for_each_entry_continue_reverse(). It's tricky, but we can reuse it for the latter if we pass list_entry() of the head node. Also, synchronize_rcu() and synchronize_rcu_expedited() can be easily switched by an argument. Let's factorise the rollback part in setup_net() and cleanup_net(). In the next patch, ops_undo_list() will be reused for free_exit_list(), and then two arguments (ops_list and hold_rtnl) will differ. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net: introduce per netns packet chainsPaolo Abeni1-0/+2
Currently network taps unbound to any interface are linked in the global ptype_all list, affecting the performance in all the network namespaces. Add per netns ptypes chains, so that in the mentioned case only the netns owning the packet socket(s) is affected. While at that drop the global ptype_all list: no in kernel user registers a tap on "any" type without specifying either the target device or the target namespace (and IMHO doing that would not make any sense). Note that this adds a conditional in the fast path (to check for per netns ptype_specific list) and increases the dataset size by a cacheline (owing the per netns lists). Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumaze@google.com> Link: https://patch.msgid.link/ae405f98875ee87f8150c460ad162de7e466f8a7.1742494826.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18net: Add net_passive_inc() and net_passive_dec().Kuniyuki Iwashima1-4/+4
net_drop_ns() is NULL when CONFIG_NET_NS is disabled. The next patch introduces a function that increments and decrements net->passive. As a prep, let's rename and export net_free() to net_passive_dec() and add net_passive_inc(). Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/CANn89i+oUCt2VGvrbrweniTendZFEh+nwS=uonc004-aPkWy-Q@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250217191129.19967-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: expedite synchronize_net() for cleanup_net()Eric Dumazet1-0/+5
cleanup_net() is the single thread responsible for netns dismantles, and a serious bottleneck. Before we can get per-netns RTNL, make sure all synchronize_net() called from this thread are using rcu_synchronize_expedited(). v3: deal with CONFIG_NET_NS=n Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Link: https://patch.msgid.link/20250114205531.967841-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-06net: defer final 'struct net' free in netns dismantleEric Dumazet1-1/+19
Ilya reported a slab-use-after-free in dst_destroy [1] Issue is in xfrm6_net_init() and xfrm4_net_init() : They copy xfrm[46]_dst_ops_template into net->xfrm.xfrm[46]_dst_ops. But net structure might be freed before all the dst callbacks are called. So when dst_destroy() calls later : if (dst->ops->destroy) dst->ops->destroy(dst); dst->ops points to the old net->xfrm.xfrm[46]_dst_ops, which has been freed. See a relevant issue fixed in : ac888d58869b ("net: do not delay dst_entries_add() in dst_release()") A fix is to queue the 'struct net' to be freed after one another cleanup_net() round (and existing rcu_barrier()) [1] BUG: KASAN: slab-use-after-free in dst_destroy (net/core/dst.c:112) Read of size 8 at addr ffff8882137ccab0 by task swapper/37/0 Dec 03 05:46:18 kernel: CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Kdump: loaded Not tainted 6.12.0 #67 Hardware name: Red Hat KVM/RHEL, BIOS 1.16.1-1.el9 04/01/2014 Call Trace: <IRQ> dump_stack_lvl (lib/dump_stack.c:124) print_address_description.constprop.0 (mm/kasan/report.c:378) ? dst_destroy (net/core/dst.c:112) print_report (mm/kasan/report.c:489) ? dst_destroy (net/core/dst.c:112) ? kasan_addr_to_slab (mm/kasan/common.c:37) kasan_report (mm/kasan/report.c:603) ? dst_destroy (net/core/dst.c:112) ? rcu_do_batch (kernel/rcu/tree.c:2567) dst_destroy (net/core/dst.c:112) rcu_do_batch (kernel/rcu/tree.c:2567) ? __pfx_rcu_do_batch (kernel/rcu/tree.c:2491) ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4339 kernel/locking/lockdep.c:4406) rcu_core (kernel/rcu/tree.c:2825) handle_softirqs (kernel/softirq.c:554) __irq_exit_rcu (kernel/softirq.c:589 kernel/softirq.c:428 kernel/softirq.c:637) irq_exit_rcu (kernel/softirq.c:651) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049) </IRQ> <TASK> asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:702) RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:743) Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 6e ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 0f 00 2d c7 c9 27 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 RSP: 0018:ffff888100d2fe00 EFLAGS: 00000246 RAX: 00000000001870ed RBX: 1ffff110201a5fc2 RCX: ffffffffb61a3e46 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffb3d4d123 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed11c7e1835d R10: ffff888e3f0c1aeb R11: 0000000000000000 R12: 0000000000000000 R13: ffff888100d20000 R14: dffffc0000000000 R15: 0000000000000000 ? ct_kernel_exit.constprop.0 (kernel/context_tracking.c:148) ? cpuidle_idle_call (kernel/sched/idle.c:186) default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118) cpuidle_idle_call (kernel/sched/idle.c:186) ? __pfx_cpuidle_idle_call (kernel/sched/idle.c:168) ? lock_release (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5848) ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4347 kernel/locking/lockdep.c:4406) ? tsc_verify_tsc_adjust (arch/x86/kernel/tsc_sync.c:59) do_idle (kernel/sched/idle.c:326) cpu_startup_entry (kernel/sched/idle.c:423 (discriminator 1)) start_secondary (arch/x86/kernel/smpboot.c:202 arch/x86/kernel/smpboot.c:282) ? __pfx_start_secondary (arch/x86/kernel/smpboot.c:232) ? soft_restart_cpu (arch/x86/kernel/head_64.S:452) common_startup_64 (arch/x86/kernel/head_64.S:414) </TASK> Dec 03 05:46:18 kernel: Allocated by task 12184: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (./arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345) kmem_cache_alloc_noprof (mm/slub.c:4085 mm/slub.c:4134 mm/slub.c:4141) copy_net_ns (net/core/net_namespace.c:421 net/core/net_namespace.c:480) create_new_namespaces (kernel/nsproxy.c:110) unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4)) ksys_unshare (kernel/fork.c:3313) __x64_sys_unshare (kernel/fork.c:3382) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Dec 03 05:46:18 kernel: Freed by task 11: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (./arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) kasan_save_free_info (mm/kasan/generic.c:582) __kasan_slab_free (mm/kasan/common.c:271) kmem_cache_free (mm/slub.c:4579 mm/slub.c:4681) cleanup_net (net/core/net_namespace.c:456 net/core/net_namespace.c:446 net/core/net_namespace.c:647) process_one_work (kernel/workqueue.c:3229) worker_thread (kernel/workqueue.c:3304 kernel/workqueue.c:3391) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) Dec 03 05:46:18 kernel: Last potentially related work creation: kasan_save_stack (mm/kasan/common.c:48) __kasan_record_aux_stack (mm/kasan/generic.c:541) insert_work (./include/linux/instrumented.h:68 ./include/asm-generic/bitops/instrumented-non-atomic.h:141 kernel/workqueue.c:788 kernel/workqueue.c:795 kernel/workqueue.c:2186) __queue_work (kernel/workqueue.c:2340) queue_work_on (kernel/workqueue.c:2391) xfrm_policy_insert (net/xfrm/xfrm_policy.c:1610) xfrm_add_policy (net/xfrm/xfrm_user.c:2116) xfrm_user_rcv_msg (net/xfrm/xfrm_user.c:3321) netlink_rcv_skb (net/netlink/af_netlink.c:2536) xfrm_netlink_rcv (net/xfrm/xfrm_user.c:3344) netlink_unicast (net/netlink/af_netlink.c:1316 net/netlink/af_netlink.c:1342) netlink_sendmsg (net/netlink/af_netlink.c:1886) sock_write_iter (net/socket.c:729 net/socket.c:744 net/socket.c:1165) vfs_write (fs/read_write.c:590 fs/read_write.c:683) ksys_write (fs/read_write.c:736) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Dec 03 05:46:18 kernel: Second to last potentially related work creation: kasan_save_stack (mm/kasan/common.c:48) __kasan_record_aux_stack (mm/kasan/generic.c:541) insert_work (./include/linux/instrumented.h:68 ./include/asm-generic/bitops/instrumented-non-atomic.h:141 kernel/workqueue.c:788 kernel/workqueue.c:795 kernel/workqueue.c:2186) __queue_work (kernel/workqueue.c:2340) queue_work_on (kernel/workqueue.c:2391) __xfrm_state_insert (./include/linux/workqueue.h:723 net/xfrm/xfrm_state.c:1150 net/xfrm/xfrm_state.c:1145 net/xfrm/xfrm_state.c:1513) xfrm_state_update (./include/linux/spinlock.h:396 net/xfrm/xfrm_state.c:1940) xfrm_add_sa (net/xfrm/xfrm_user.c:912) xfrm_user_rcv_msg (net/xfrm/xfrm_user.c:3321) netlink_rcv_skb (net/netlink/af_netlink.c:2536) xfrm_netlink_rcv (net/xfrm/xfrm_user.c:3344) netlink_unicast (net/netlink/af_netlink.c:1316 net/netlink/af_netlink.c:1342) netlink_sendmsg (net/netlink/af_netlink.c:1886) sock_write_iter (net/socket.c:729 net/socket.c:744 net/socket.c:1165) vfs_write (fs/read_write.c:590 fs/read_write.c:683) ksys_write (fs/read_write.c:736) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Fixes: a8a572a6b5f2 ("xfrm: dst_entries_init() per-net dst_ops") Reported-by: Ilya Maximets <i.maximets@ovn.org> Closes: https://lore.kernel.org/netdev/CANn89iKKYDVpB=MtmfH7nyv2p=rJWSLedO5k7wSZgtY_tO8WQg@mail.gmail.com/T/#m02c98c3009fe66382b73cfb4db9cf1df6fab3fbf Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241204125455.3871859-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-21Merge tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds1-7/+19
Pull networking updates from Paolo Abeni: "The most significant set of changes is the per netns RTNL. The new behavior is disabled by default, regression risk should be contained. Notably the new config knob PTP_1588_CLOCK_VMCLOCK will inherit its default value from PTP_1588_CLOCK_KVM, as the first is intended to be a more reliable replacement for the latter. Core: - Started a very large, in-progress, effort to make the RTNL lock scope per network-namespace, thus reducing the lock contention significantly in the containerized use-case, comprising: - RCU-ified some relevant slices of the FIB control path - introduce basic per netns locking helpers - namespacified the IPv4 address hash table - remove rtnl_register{,_module}() in favour of rtnl_register_many() - refactor rtnl_{new,del,set}link() moving as much validation as possible out of RTNL lock - convert all phonet doit() and dumpit() handlers to RCU - convert IPv4 addresses manipulation to per-netns RTNL - convert virtual interface creation to per-netns RTNL the per-netns lock infrastructure is guarded by the CONFIG_DEBUG_NET_SMALL_RTNL knob, disabled by default ad interim. - Introduce NAPI suspension, to efficiently switching between busy polling (NAPI processing suspended) and normal processing. - Migrate the IPv4 routing input, output and control path from direct ToS usage to DSCP macros. This is a work in progress to make ECN handling consistent and reliable. - Add drop reasons support to the IPv4 rotue input path, allowing better introspection in case of packets drop. - Make FIB seqnum lockless, dropping RTNL protection for read access. - Make inet{,v6} addresses hashing less predicable. - Allow providing timestamp OPT_ID via cmsg, to correlate TX packets and timestamps Things we sprinkled into general kernel code: - Add small file operations for debugfs, to reduce the struct ops size. - Refactoring and optimization for the implementation of page_frag API, This is a preparatory work to consolidate the page_frag implementation. Netfilter: - Optimize set element transactions to reduce memory consumption - Extended netlink error reporting for attribute parser failure. - Make legacy xtables configs user selectable, giving users the option to configure iptables without enabling any other config. - Address a lot of false-positive RCU issues, pointed by recent CI improvements. BPF: - Put xsk sockets on a struct diet and add various cleanups. Overall, this helps to bump performance by 12% for some workloads. - Extend BPF selftests to increase coverage of XDP features in combination with BPF cpumap. - Optimize and homogenize bpf_csum_diff helper for all archs and also add a batch of new BPF selftests for it. - Extend netkit with an option to delegate skb->{mark,priority} scrubbing to its BPF program. - Make the bpf_get_netns_cookie() helper available also to tc(x) BPF programs. Protocols: - Introduces 4-tuple hash for connected udp sockets, speeding-up significantly connected sockets lookup. - Add a fastpath for some TCP timers that usually expires after close, the socket lock contention. - Add inbound and outbound xfrm state caches to speed up state lookups. - Avoid sending MPTCP advertisements on stale subflows, reducing risks on loosing them. - Make neighbours table flushing more scalable, maintaining per device neigh lists. Driver API: - Introduce a unified interface to configure transmission H/W shaping, and expose it to user-space via generic-netlink. - Add support for per-NAPI config via netlink. This makes napi configuration persistent across queues removal and re-creation. Requires driver updates, currently supported drivers are: nVidia/Mellanox mlx4 and mlx5, Broadcom brcm and Intel ice. - Add ethtool support for writing SFP / PHY firmware blocks. - Track RSS context allocation from ethtool core. - Implement support for mirroring to DSA CPU port, via TC mirror offload. - Consolidate FDB updates notification, to avoid duplicates on device-specific entries. - Expose DPLL clock quality level to the user-space. - Support master-slave PHY config via device tree. Tests and tooling: - forwarding: introduce deferred commands, to simplify the cleanup phase Drivers: - Updated several drivers - Amazon vNic, Google vNic, Microsoft vNic, Intel e1000e and Broadcom Tigon3 - to use netdev-genl to link the IRQs and queues to NAPI IDs, allowing busy polling and better introspection. - Ethernet high-speed NICs: - nVidia/Mellanox: - mlx5: - a large refactor to implement support for cross E-Switch scheduling - refactor H/W conter management to let it scale better - H/W GRO cleanups - Intel (100G, ice):: - add support for ethtool reset - implement support for per TX queue H/W shaping - AMD/Solarflare: - implement per device queue stats support - Broadcom (bnxt): - improve wildcard l4proto on IPv4/IPv6 ntuple rules - Marvell Octeon: - Add representor support for each Resource Virtualization Unit (RVU) device. - Hisilicon: - add support for the BMC Gigabit Ethernet - IBM (EMAC): - driver cleanup and modernization - Cisco (VIC): - raise the queues number limit to 256 - Ethernet virtual: - Google vNIC: - implement page pool support - macsec: - inherit lower device's features and TSO limits when offloading - virtio_net: - enable premapped mode by default - support for XDP socket(AF_XDP) zerocopy TX - wireguard: - set the TSO max size to be GSO_MAX_SIZE, to aggregate larger packets. - Ethernet NICs embedded and virtual: - Broadcom ASP: - enable software timestamping - Freescale: - add enetc4 PF driver - MediaTek: Airoha SoC: - implement BQL support - RealTek r8169: - enable TSO by default on r8168/r8125 - implement extended ethtool stats - Renesas AVB: - enable TX checksum offload - Synopsys (stmmac): - support header splitting for vlan tagged packets - move common code for DWMAC4 and DWXGMAC into a separate FPE module. - add dwmac driver support for T-HEAD TH1520 SoC - Synopsys (xpcs): - driver refactor and cleanup - TI: - icssg_prueth: add VLAN offload support - Xilinx emaclite: - add clock support - Ethernet switches: - Microchip: - implement support for the lan969x Ethernet switch family - add LAN9646 switch support to KSZ DSA driver - Ethernet PHYs: - Marvel: 88q2x: enable auto negotiation - Microchip: add support for LAN865X Rev B1 and LAN867X Rev C1/C2 - PTP: - Add support for the Amazon virtual clock device - Add PtP driver for s390 clocks - WiFi: - mac80211 - EHT 1024 aggregation size for transmissions - new operation to indicate that a new interface is to be added - support radio separation of multi-band devices - move wireless extension spy implementation to libiw - Broadcom: - brcmfmac: optional LPO clock support - Microchip: - add support for Atmel WILC3000 - Qualcomm (ath12k): - firmware coredump collection support - add debugfs support for a multitude of statistics - Qualcomm (ath5k): - Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support - Realtek: - rtw88: 8821au and 8812au USB adapters support - rtw89: add thermal protection - rtw89: fine tune BT-coexsitence to improve user experience - rtw89: firmware secure boot for WiFi 6 chip - Bluetooth - add Qualcomm WCN785x support for ids Foxconn 0xe0fc/0xe0f3 and 0x13d3:0x3623 - add Realtek RTL8852BE support for id Foxconn 0xe123 - add MediaTek MT7920 support for wireless module ids - btintel_pcie: add handshake between driver and firmware - btintel_pcie: add recovery mechanism - btnxpuart: add GPIO support to power save feature" * tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1475 commits) mm: page_frag: fix a compile error when kernel is not compiled Documentation: tipc: fix formatting issue in tipc.rst selftests: nic_performance: Add selftest for performance of NIC driver selftests: nic_link_layer: Add selftest case for speed and duplex states selftests: nic_link_layer: Add link layer selftest for NIC driver bnxt_en: Add FW trace coredump segments to the coredump bnxt_en: Add a new ethtool -W dump flag bnxt_en: Add 2 parameters to bnxt_fill_coredump_seg_hdr() bnxt_en: Add functions to copy host context memory bnxt_en: Do not free FW log context memory bnxt_en: Manage the FW trace context memory bnxt_en: Allocate backing store memory for FW trace logs bnxt_en: Add a 'force' parameter to bnxt_free_ctx_mem() bnxt_en: Refactor bnxt_free_ctx_mem() bnxt_en: Add mem_valid bit to struct bnxt_ctx_mem_type bnxt_en: Update firmware interface spec to 1.10.3.85 selftests/bpf: Add some tests with sockmap SK_PASS bpf: fix recursive lock when verdict program return SK_PASS wireguard: device: support big tcp GSO wireguard: selftests: load nf_conntrack if not present ...
2024-11-11rtnetlink: Remove __rtnl_link_register()Kuniyuki Iwashima1-1/+0
link_ops is protected by link_ops_mutex and no longer needs RTNL, so we have no reason to have __rtnl_link_register() separately. Let's remove it and call rtnl_link_register() from ifb.ko and dummy.ko. Note that both modules' init() work on init_net only, so we need not export pernet_ops_rwsem and can use rtnl_net_lock() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03fdget(), trivial conversionsAl Viro1-6/+4
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-15net: Use rtnl_register_many().Kuniyuki Iwashima1-5/+9
We will remove rtnl_register() in favour of rtnl_register_many(). When it succeeds, rtnl_register_many() guarantees all rtnetlink types in the passed array are supported, and there is no chance that a part of message types is not supported. Let's use rtnl_register_many() instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241014201828.91221-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-11tcp: move sysctl_tcp_l3mdev_accept to netns_ipv4_read_rxEric Dumazet1-1/+3
sysctl_tcp_l3mdev_accept is read from TCP receive fast path from tcp_v6_early_demux(), __inet6_lookup_established, inet_request_bound_dev_if(). Move it to netns_ipv4_read_rx. Remove the '#ifdef CONFIG_NET_L3_MASTER_DEV' that was guarding its definition. Note this adds a hole of three bytes that could be filled later. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Cc: Wei Wang <weiwan@google.com> Cc: Coco Li <lixiaoyan@google.com> Link: https://patch.msgid.link/20241010034100.320832-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08net-timestamp: namespacify the sysctl_tstamp_allow_dataJason Xing1-0/+1
Let it be tuned in per netns by admins. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20241005222609.94980-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08rtnetlink: Add per-netns RTNL.Kuniyuki Iwashima1-0/+6
The goal is to break RTNL down into per-netns mutex. This patch adds per-netns mutex and its helper functions, rtnl_net_lock() and rtnl_net_unlock(). rtnl_net_lock() acquires the global RTNL and per-netns RTNL mutex, and rtnl_net_unlock() releases them. We will replace 800+ rtnl_lock() with rtnl_net_lock() and finally removes rtnl_lock() in rtnl_net_lock(). When we need to nest per-netns RTNL mutex, we will use __rtnl_net_lock(), and its locking order is defined by rtnl_net_lock_cmp_fn() as follows: 1. init_net is first 2. netns address ascending order Note that the conversion will be done under CONFIG_DEBUG_NET_SMALL_RTNL with LOCKDEP so that we can carefully add the extra mutex without slowing down RTNL operations during conversion. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-23Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-3/+3
Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro1-3/+3
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-03net: Initialise net.core sysctl defaults in preinit_net().Kuniyuki Iwashima1-26/+11
Commit 7c3f1875c66f ("net: move somaxconn init from sysctl code") introduced net_defaults_ops to make sure that net.core sysctl knobs are always initialised even if CONFIG_SYSCTL is disabled. Such operations better fit preinit_net() added for a similar purpose by commit 6e77a5a4af05 ("net: initialize net->notrefcnt_tracker earlier"). Let's initialise the sysctl defaults in preinit_net(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-03net: Slim down setup_net().Kuniyuki Iwashima1-15/+16
Most initialisations in setup_net() do not require pernet_ops_rwsem and can be moved to preinit_net(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-03net: Call preinit_net() without pernet_ops_rwsem.Kuniyuki Iwashima1-1/+2
When initialising the root netns, we call preinit_net() under pernet_ops_rwsem. However, the operations in preinit_net() do not require pernet_ops_rwsem. Also, we don't hold it for preinit_net() when initialising non-root netns. To be consistent, let's call preinit_net() without pernet_ops_rwsem in net_ns_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-03net: Initialise net->passive once in preinit_net().Kuniyuki Iwashima1-2/+1
When initialising the root netns, we set net->passive in setup_net(). However, we do it twice for non-root netns in copy_net_ns() and setup_net(). This is because we could bypass setup_net() in copy_net_ns() if down_read_killable() fails. preinit_net() is a better place to put such an operation. Let's initialise net->passive in preinit_net(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-03net: Don't register pernet_operations if only one of id or size is specified.Kuniyuki Iwashima1-4/+8
We can allocate per-netns memory for struct pernet_operations by specifying id and size. register_pernet_operations() assigns an id to pernet_operations and later ops_init() allocates the specified size of memory as net->gen->ptr[id]. If id is missing, no memory is allocated. If size is not specified, pernet_operations just wastes an entry of net->gen->ptr[] for every netns. net_generic is available only when both id and size are specified, so let's ensure that. While we are at it, we add const to both fields. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-18netns: Make get_net_ns() handle zero refcount netYue Haibing1-2/+7
Syzkaller hit a warning: refcount_t: addition on 0; use-after-free. WARNING: CPU: 3 PID: 7890 at lib/refcount.c:25 refcount_warn_saturate+0xdf/0x1d0 Modules linked in: CPU: 3 PID: 7890 Comm: tun Not tainted 6.10.0-rc3-00100-gcaa4f9578aba-dirty #310 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:refcount_warn_saturate+0xdf/0x1d0 Code: 41 49 04 31 ff 89 de e8 9f 1e cd fe 84 db 75 9c e8 76 26 cd fe c6 05 b6 41 49 04 01 90 48 c7 c7 b8 8e 25 86 e8 d2 05 b5 fe 90 <0f> 0b 90 90 e9 79 ff ff ff e8 53 26 cd fe 0f b6 1 RSP: 0018:ffff8881067b7da0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff811c72ac RDX: ffff8881026a2140 RSI: ffffffff811c72b5 RDI: 0000000000000001 RBP: ffff8881067b7db0 R08: 0000000000000000 R09: 205b5d3730353139 R10: 0000000000000000 R11: 205d303938375420 R12: ffff8881086500c4 R13: ffff8881086500c4 R14: ffff8881086500b0 R15: ffff888108650040 FS: 00007f5b2961a4c0(0000) GS:ffff88823bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055d7ed36fd18 CR3: 00000001482f6000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? show_regs+0xa3/0xc0 ? __warn+0xa5/0x1c0 ? refcount_warn_saturate+0xdf/0x1d0 ? report_bug+0x1fc/0x2d0 ? refcount_warn_saturate+0xdf/0x1d0 ? handle_bug+0xa1/0x110 ? exc_invalid_op+0x3c/0xb0 ? asm_exc_invalid_op+0x1f/0x30 ? __warn_printk+0xcc/0x140 ? __warn_printk+0xd5/0x140 ? refcount_warn_saturate+0xdf/0x1d0 get_net_ns+0xa4/0xc0 ? __pfx_get_net_ns+0x10/0x10 open_related_ns+0x5a/0x130 __tun_chr_ioctl+0x1616/0x2370 ? __sanitizer_cov_trace_switch+0x58/0xa0 ? __sanitizer_cov_trace_const_cmp2+0x1c/0x30 ? __pfx_tun_chr_ioctl+0x10/0x10 tun_chr_ioctl+0x2f/0x40 __x64_sys_ioctl+0x11b/0x160 x64_sys_call+0x1211/0x20d0 do_syscall_64+0x9e/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5b28f165d7 Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 8 RSP: 002b:00007ffc2b59c5e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5b28f165d7 RDX: 0000000000000000 RSI: 00000000000054e3 RDI: 0000000000000003 RBP: 00007ffc2b59c650 R08: 00007f5b291ed8c0 R09: 00007f5b2961a4c0 R10: 0000000029690010 R11: 0000000000000246 R12: 0000000000400730 R13: 00007ffc2b59cf40 R14: 0000000000000000 R15: 0000000000000000 </TASK> Kernel panic - not syncing: kernel: panic_on_warn set ... This is trigger as below: ns0 ns1 tun_set_iff() //dev is tun0 tun->dev = dev //ip link set tun0 netns ns1 put_net() //ref is 0 __tun_chr_ioctl() //TUNGETDEVNETNS net = dev_net(tun->dev); open_related_ns(&net->ns, get_net_ns); //ns1 get_net_ns() get_net() //addition on 0 Use maybe_get_net() in get_net_ns in case net's ref is zero to fix this Fixes: 0c3e0e3bb623 ("tun: Add ioctl() TUNGETDEVNETNS cmd to allow obtaining real net ns of tun device") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20240614131302.2698509-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-3/+10
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 35d92abfbad8 ("net: hns3: fix kernel crash when devlink reload during initialization") 2a1a1a7b5fd7 ("net: hns3: add command queue trace for hns3") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06net: fix out-of-bounds access in ops_initThadeu Lima de Souza Cascardo1-3/+10
net_alloc_generic is called by net_alloc, which is called without any locking. It reads max_gen_ptrs, which is changed under pernet_ops_rwsem. It is read twice, first to allocate an array, then to set s.len, which is later used to limit the bounds of the array access. It is possible that the array is allocated and another thread is registering a new pernet ops, increments max_gen_ptrs, which is then used to set s.len with a larger than allocated length for the variable array. Fix it by reading max_gen_ptrs only once in net_alloc_generic. If max_gen_ptrs is later incremented, it will be caught in net_assign_generic. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Fixes: 073862ba5d24 ("netns: fix net_alloc_generic()") Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240502132006.3430840-1-cascardo@igalia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-17netns: no longer hold RTNL in rtnl_net_dumpid()Eric Dumazet1-2/+3
- rtnl_net_dumpid() is already fully RCU protected, RTNL is not needed there. - Fix return value at the end of a dump, so that NLMSG_DONE can be appended to current skb, saving one recvmsg() system call. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/20240416140739.967941-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-12net: use synchronize_rcu_expedited in cleanup_net()Eric Dumazet1-1/+1
cleanup_net() is calling synchronize_rcu() right before acquiring RTNL. synchronize_rcu() is much slower than synchronize_rcu_expedited(), and cleanup_net() is currently single threaded. In many workloads we want cleanup_net() to be fast, in order to free memory and various sysfs and procfs entries as fast as possible. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-07net: add exit_batch_rtnl() methodEric Dumazet1-1/+30
Many (struct pernet_operations)->exit_batch() methods have to acquire rtnl. In presence of rtnl mutex pressure, this makes cleanup_net() very slow. This patch adds a new exit_batch_rtnl() method to reduce number of rtnl acquisitions from cleanup_net(). exit_batch_rtnl() handlers are called while rtnl is locked, and devices to be killed can be queued in a list provided as their second argument. A single unregister_netdevice_many() is called right before rtnl is released. exit_batch_rtnl() handlers are called before ->exit() and ->exit_batch() handlers. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Antoine Tenart <atenart@kernel.org> Link: https://lore.kernel.org/r/20240206144313.2050392-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-15net: Namespace-ify sysctl_optmem_maxEric Dumazet1-0/+4
optmem_max being used in tx zerocopy, we want to be able to control it on a netns basis. Following patch changes two tests. Tested: oqq130:~# cat /proc/sys/net/core/optmem_max 131072 oqq130:~# echo 1000000 >/proc/sys/net/core/optmem_max oqq130:~# cat /proc/sys/net/core/optmem_max 1000000 oqq130:~# unshare -n oqq130:~# cat /proc/sys/net/core/optmem_max 131072 oqq130:~# exit logout oqq130:~# cat /proc/sys/net/core/optmem_max 1000000 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>