aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/net/core/dev.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-03-06net: hold netdev instance lock during ndo_open/ndo_stopStanislav Fomichev1-0/+12
For the drivers that use shaper API, switch to the mode where core stack holds the netdev lock. This affects two drivers: * iavf - already grabs netdev lock in ndo_open/ndo_stop, so mostly remove these * netdevsim - switch to _locked APIs to avoid deadlock iavf_close diff is a bit confusing, the existing call looks like this: iavf_close() { netdev_lock() .. netdev_unlock() wait_event_timeout(down_waitqueue) } I change it to the following: netdev_lock() iavf_close() { .. netdev_unlock() wait_event_timeout(down_waitqueue) netdev_lock() // reusing this lock call } netdev_unlock() Since I'm reusing existing netdev_lock call, so it looks like I only add netdev_unlock. Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04net: Prevent use after free in netif_napi_set_irq_locked()Dan Carpenter1-1/+1
The cpu_rmap_put() will call kfree() when the last reference is dropped so it could result in a use after free when we dereference the same pointer the next line. Move the cpu_rmap_put() after the dereference. Fixes: bd7c00605ee0 ("net: move aRFS rmap management and CPU affinity to core") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/5a9c53a4-5487-4b8c-9ffa-d8e5343aaaaf@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04net: plumb extack in __dev_change_net_namespace()Nicolas Dichtel1-9/+34
It could be hard to understand why the netlink command fails. For example, if dev->netns_immutable is set, the error is "Invalid argument". Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-04net: rename netns_local to netns_immutableNicolas Dichtel1-2/+2
The name 'netns_local' is confusing. A following commit will export it via netlink, so let's use a more explicit name. Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-10/+4
Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c fa52f15c745c ("net: cadence: macb: Synchronize stats calculations") 75696dd0fd72 ("net: cadence: macb: Convert to get_stats64") https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au Adjacent changes: drivers/net/ethernet/intel/ice/ice_sriov.c 79990cf5e7ad ("ice: Fix deinitializing VF in error path") a203163274a4 ("ice: simplify VF MSI-X managing") net/ipv4/tcp.c 18912c520674 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace") 297d389e9e5b ("net: prefix devmem specific helpers") net/mptcp/subflow.c 8668860b0ad3 ("mptcp: reset when MPTCP opts are dropped after join") c3349a22c200 ("mptcp: consolidate subflow cleanup") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27net: gro: expose GRO init/cleanup to use outside of NAPIAlexander Lobakin1-34/+3
Make GRO init and cleanup functions global to be able to use GRO without a NAPI instance. Taking into account already global gro_flush(), it's now fully usable standalone. New functions are not exported, since they're not supposed to be used outside of the kernel core code. Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27net: gro: decouple GRO from the NAPI layerAlexander Lobakin1-35/+31
In fact, these two are not tied closely to each other. The only requirements to GRO are to use it in the BH context and have some sane limits on the packet batches, e.g. NAPI has a limit of its budget (64/8/etc.). Move purely GRO fields into a new structure, &gro_node. Embed it into &napi_struct and adjust all the references. gro_node::cached_napi_id is effectively the same as napi_struct::napi_id, but to be used on GRO hotpath to mark skbs. napi_struct::napi_id is now a fully control path field. Three Ethernet drivers use napi_gro_flush() not really meant to be exported, so move it to <net/gro.h> and add that include there. napi_gro_receive() is used in more than 100 drivers, keep it in <linux/netdevice.h>. This does not make GRO ready to use outside of the NAPI context yet. Tested-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-26net: move aRFS rmap management and CPU affinity to coreAhmed Zaki1-0/+169
A common task for most drivers is to remember the user-set CPU affinity to its IRQs. On each netdev reset, the driver should re-assign the user's settings to the IRQs. Unify this task across all drivers by moving the CPU affinity to napi->config. However, to move the CPU affinity to core, we also need to move aRFS rmap management since aRFS uses its own IRQ notifiers. For the aRFS, add a new netdev flag "rx_cpu_rmap_auto". Drivers supporting aRFS should set the flag via netif_enable_cpu_rmap() and core will allocate and manage the aRFS rmaps. Freeing the rmap is also done by core when the netdev is freed. For better IRQ affinity management, move the IRQ rmap notifier inside the napi_struct and add new notify.notify and notify.release functions: netif_irq_cpu_rmap_notify() and netif_napi_affinity_release(). Now we have the aRFS rmap management in core, add CPU affinity mask to napi_config. To delegate the CPU affinity management to the core, drivers must: 1 - set the new netdev flag "irq_affinity_auto": netif_enable_irq_affinity(netdev) 2 - create the napi with persistent config: netif_napi_add_config() 3 - bind an IRQ to the napi instance: netif_napi_set_irq() the core will then make sure to use re-assign affinity to the napi's IRQ. The default IRQ mask is set to one cpu starting from the closest NUMA. Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://patch.msgid.link/20250224232228.990783-2-ahmed.zaki@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26net: Handle napi_schedule() calls from non-interruptFrederic Weisbecker1-1/+1
napi_schedule() is expected to be called either: * From an interrupt, where raised softirqs are handled on IRQ exit * From a softirq disabled section, where raised softirqs are handled on the next call to local_bh_enable(). * From a softirq handler, where raised softirqs are handled on the next round in do_softirq(), or further deferred to a dedicated kthread. Other bare tasks context may end up ignoring the raised NET_RX vector until the next random softirq handling opportunity, which may not happen before a while if the CPU goes idle afterwards with the tick stopped. Such "misuses" have been detected on several places thanks to messages of the kind: "NOHZ tick-stop error: local softirq work is pending, handler #08!!!" For example: __raise_softirq_irqoff __napi_schedule rtl8152_runtime_resume.isra.0 rtl8152_resume usb_resume_interface.isra.0 usb_resume_both __rpm_callback rpm_callback rpm_resume __pm_runtime_resume usb_autoresume_device usb_remote_wakeup hub_event process_one_work worker_thread kthread ret_from_fork ret_from_fork_asm And also: * drivers/net/usb/r8152.c::rtl_work_func_t * drivers/net/netdevsim/netdev.c::nsim_start_xmit There is a long history of issues of this kind: 019edd01d174 ("ath10k: sdio: Add missing BH locking around napi_schdule()") 330068589389 ("idpf: disable local BH when scheduling napi for marker packets") e3d5d70cb483 ("net: lan78xx: fix "softirq work is pending" error") e55c27ed9ccf ("mt76: mt7615: add missing bh-disable around rx napi schedule") c0182aa98570 ("mt76: mt7915: add missing bh-disable around tx napi enable/schedule") 970be1dff26d ("mt76: disable BH around napi_schedule() calls") 019edd01d174 ("ath10k: sdio: Add missing BH locking around napi_schdule()") 30bfec4fec59 ("can: rx-offload: can_rx_offload_threaded_irq_finish(): add new function to be called from threaded interrupt") e63052a5dd3c ("mlx5e: add add missing BH locking around napi_schdule()") 83a0c6e58901 ("i40e: Invoke softirqs after napi_reschedule") bd4ce941c8d5 ("mlx4: Invoke softirqs after napi_reschedule") 8cf699ec849f ("mlx4: do not call napi_schedule() without care") ec13ee80145c ("virtio_net: invoke softirqs after __napi_schedule") This shows that relying on the caller to arrange a proper context for the softirqs to be handled while calling napi_schedule() is very fragile and error prone. Also fixing them can also prove challenging if the caller may be called from different kinds of contexts. Therefore fix this from napi_schedule() itself with waking up ksoftirqd when softirqs are raised from task contexts. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Reported-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Francois Romieu <romieu@fr.zoreil.com> Closes: https://lore.kernel.org/lkml/354a2690-9bbf-4ccb-8769-fa94707a9340@molgen.mpg.de/ Cc: Breno Leitao <leitao@debian.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250223221708.27130-1-frederic@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26net: Use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net().Kuniyuki Iwashima1-9/+3
Breno Leitao reported the splat below. [0] Commit 65161fb544aa ("net: Fix dev_net(dev) race in unregister_netdevice_notifier_dev_net().") added the DEBUG_NET_WARN_ON_ONCE(), assuming that the netdev is not registered before register_netdevice_notifier_dev_net(). But the assumption was simply wrong. Let's use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net(). [0]: WARNING: CPU: 25 PID: 849 at net/core/dev.c:2150 register_netdevice_notifier_dev_net (net/core/dev.c:2150) <TASK> ? __warn (kernel/panic.c:242 kernel/panic.c:748) ? register_netdevice_notifier_dev_net (net/core/dev.c:2150) ? register_netdevice_notifier_dev_net (net/core/dev.c:2150) ? report_bug (lib/bug.c:? lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:285) ? exc_invalid_op (arch/x86/kernel/traps.c:309) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? register_netdevice_notifier_dev_net (net/core/dev.c:2150) ? register_netdevice_notifier_dev_net (./include/net/net_namespace.h:406 ./include/linux/netdevice.h:2663 net/core/dev.c:2144) mlx5e_mdev_notifier_event+0x9f/0xf0 mlx5_ib notifier_call_chain.llvm.12241336988804114627 (kernel/notifier.c:85) blocking_notifier_call_chain (kernel/notifier.c:380) mlx5_core_uplink_netdev_event_replay (drivers/net/ethernet/mellanox/mlx5/core/main.c:352) mlx5_ib_roce_init.llvm.12447516292400117075+0x1c6/0x550 mlx5_ib mlx5r_probe+0x375/0x6a0 mlx5_ib ? kernfs_put (./include/linux/instrumented.h:96 ./include/linux/atomic/atomic-arch-fallback.h:2278 ./include/linux/atomic/atomic-instrumented.h:1384 fs/kernfs/dir.c:557) ? auxiliary_match_id (drivers/base/auxiliary.c:174) ? mlx5r_mp_remove+0x160/0x160 mlx5_ib really_probe (drivers/base/dd.c:? drivers/base/dd.c:658) driver_probe_device (drivers/base/dd.c:830) __driver_attach (drivers/base/dd.c:1217) bus_for_each_dev (drivers/base/bus.c:369) ? driver_attach (drivers/base/dd.c:1157) bus_add_driver (drivers/base/bus.c:679) driver_register (drivers/base/driver.c:249) Fixes: 7fb1073300a2 ("net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().") Reported-by: Breno Leitao <leitao@debian.org> Closes: https://lore.kernel.org/netdev/20250224-noisy-cordial-roadrunner-fad40c@leitao/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250225211023.96448-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net: Remove shadow variable in netdev_run_todo()Breno Leitao1-3/+2
Fix a shadow variable warning in net/core/dev.c when compiled with CONFIG_LOCKDEP enabled. The warning occurs because 'dev' is redeclared inside the while loop, shadowing the outer scope declaration. net/core/dev.c:11211:22: warning: declaration shadows a local variable [-Wshadow] struct net_device *dev = list_first_entry(&unlink_list, net/core/dev.c:11202:21: note: previous declaration is here struct net_device *dev, *tmp; Remove the redundant declaration since the variable is already defined in the outer scope and will be overwritten in the subsequent list_for_each_entry_safe() loop anyway. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250221-netcons_fix_shadow-v1-1-dee20c8658dd@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski1-1/+2
Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-02-20 We've added 19 non-merge commits during the last 8 day(s) which contain a total of 35 files changed, 1126 insertions(+), 53 deletions(-). The main changes are: 1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing 2) Add network TX timestamping support to BPF sock_ops, from Jason Xing 3) Add TX metadata Launch Time support, from Song Yoong Siang * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: igc: Add launch time support to XDP ZC igc: Refactor empty frame insertion for launch time support net: stmmac: Add launch time support to XDP ZC selftests/bpf: Add launch time request to xdp_hw_metadata xsk: Add launch time hardware offload support to XDP Tx metadata selftests/bpf: Add simple bpf tests in the tx path for timestamping feature bpf: Support selective sampling for bpf timestamping bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING bpf: Disable unsafe helpers in TX timestamping callbacks bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping bpf: Add networking timestamping support to bpf_get/setsockopt() selftests/bpf: Add rto max for bpf_setsockopt test bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt ==================== Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callbackJason Xing1-1/+2
Support SCM_TSTAMP_SCHED case for bpf timestamping. Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SCHED_CB. This callback will occur at the same timestamping point as the user space's SCM_TSTAMP_SCHED. The BPF program can use it to get the same SCM_TSTAMP_SCHED timestamp without modifying the user-space application. A new SKBTX_BPF flag is added to mark skb_shinfo(skb)->tx_flags, ensuring that the new BPF timestamping and the current user space's SO_TIMESTAMPING do not interfere with each other. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-7-kerneljasonxing@gmail.com
2025-02-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-11/+97
Cross-merge networking fixes after downstream PR (net-6.14-rc4). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20Revert "net: skb: introduce and use a single page frag cache"Paolo Abeni1-0/+17
After the previous commit is finally safe to revert commit dbae2b062824 ("net: skb: introduce and use a single page frag cache"): do it here. The intended goal of such change was to counter a performance regression introduced by commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs"). Unfortunately, the blamed commit introduces another regression for the virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny size, so that the whole head frag could fit a 512-byte block. The single page frag cache uses a 1K fragment for such allocation, and the additional overhead, under small UDP packets flood, makes the page allocator a bottleneck. Thanks to commit bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head"), this revert does not re-introduce the original regression. Actually, in the relevant test on top of this revert, I measure a small but noticeable positive delta, just above noise level. The revert itself required some additional mangling due to recent updates in the affected code. Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: dbae2b062824 ("net: skb: introduce and use a single page frag cache") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-19net: Add non-RCU dev_getbyhwaddr() helperBreno Leitao1-3/+34
Add dedicated helper for finding devices by hardware address when holding rtnl_lock, similar to existing dev_getbyhwaddr_rcu(). This prevents PROVE_LOCKING warnings when rtnl_lock is held but RCU read lock is not. Extract common address comparison logic into dev_addr_cmp(). The context about this change could be found in the following discussion: Link: https://lore.kernel.org/all/20250206-scarlet-ermine-of-improvement-1fcac5@leitao/ Cc: kuniyu@amazon.com Cc: ushankar@purestorage.com Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-1-d3d6892db9e1@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18dev: Use rtnl_net_dev_lock() in unregister_netdev().Kuniyuki Iwashima1-4/+2
The following sequence is basically illegal when dev was fetched without lookup because dev_net(dev) might be different after holding rtnl_net_lock(): net = dev_net(dev); rtnl_net_lock(net); Let's use rtnl_net_dev_lock() in unregister_netdev(). Note that there is no real bug in unregister_netdev() for now because RTNL protects the scope even if dev_net(dev) is changed before/after RTNL. Fixes: 00fb9823939e ("dev: Hold per-netns RTNL in (un)?register_netdev().") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250217191129.19967-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18net: Fix dev_net(dev) race in unregister_netdevice_notifier_dev_net().Kuniyuki Iwashima1-4/+44
After the cited commit, dev_net(dev) is fetched before holding RTNL and passed to __unregister_netdevice_notifier_net(). However, dev_net(dev) might be different after holding RTNL. In the reported case [0], while removing a VF device, its netns was being dismantled and the VF was moved to init_net. So the following sequence is basically illegal when dev was fetched without lookup: net = dev_net(dev); rtnl_net_lock(net); Let's use a new helper rtnl_net_dev_lock() to fix the race. It fetches dev_net_rcu(dev), bumps its net->passive, and checks if dev_net_rcu(dev) is changed after rtnl_net_lock(). [0]: BUG: KASAN: slab-use-after-free in notifier_call_chain (kernel/notifier.c:75 (discriminator 2)) Read of size 8 at addr ffff88810cefb4c8 by task test-bridge-lag/21127 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) print_report (mm/kasan/report.c:379 mm/kasan/report.c:489) kasan_report (mm/kasan/report.c:604) notifier_call_chain (kernel/notifier.c:75 (discriminator 2)) call_netdevice_notifiers_info (net/core/dev.c:2011) unregister_netdevice_many_notify (net/core/dev.c:11551) unregister_netdevice_queue (net/core/dev.c:11487) unregister_netdev (net/core/dev.c:11635) mlx5e_remove (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:6552 drivers/net/ethernet/mellanox/mlx5/core/en_main.c:6579) mlx5_core auxiliary_bus_remove (drivers/base/auxiliary.c:230) device_release_driver_internal (drivers/base/dd.c:1275 drivers/base/dd.c:1296) bus_remove_device (./include/linux/kobject.h:193 drivers/base/base.h:73 drivers/base/bus.c:583) device_del (drivers/base/power/power.h:142 drivers/base/core.c:3855) mlx5_rescan_drivers_locked (./include/linux/auxiliary_bus.h:241 drivers/net/ethernet/mellanox/mlx5/core/dev.c:333 drivers/net/ethernet/mellanox/mlx5/core/dev.c:535 drivers/net/ethernet/mellanox/mlx5/core/dev.c:549) mlx5_core mlx5_unregister_device (drivers/net/ethernet/mellanox/mlx5/core/dev.c:468) mlx5_core mlx5_uninit_one (./include/linux/instrumented.h:68 ./include/asm-generic/bitops/instrumented-non-atomic.h:141 drivers/net/ethernet/mellanox/mlx5/core/main.c:1563) mlx5_core remove_one (drivers/net/ethernet/mellanox/mlx5/core/main.c:965 drivers/net/ethernet/mellanox/mlx5/core/main.c:2019) mlx5_core pci_device_remove (./include/linux/pm_runtime.h:129 drivers/pci/pci-driver.c:475) device_release_driver_internal (drivers/base/dd.c:1275 drivers/base/dd.c:1296) unbind_store (drivers/base/bus.c:245) kernfs_fop_write_iter (fs/kernfs/file.c:338) vfs_write (fs/read_write.c:587 (discriminator 1) fs/read_write.c:679 (discriminator 1)) ksys_write (fs/read_write.c:732) do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f6a4d5018b7 Fixes: 7fb1073300a2 ("net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().") Reported-by: Yael Chemla <ychemla@nvidia.com> Closes: https://lore.kernel.org/netdev/146eabfe-123c-4970-901e-e961b4c09bc3@nvidia.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250217191129.19967-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18net: Remove redundant variable declaration in __dev_change_flags()Breno Leitao1-1/+1
The old_flags variable is declared twice in __dev_change_flags(), causing a shadow variable warning. This patch fixes the issue by removing the redundant declaration, reusing the existing old_flags variable instead. net/core/dev.c:9225:16: warning: declaration shadows a local variable [-Wshadow] 9225 | unsigned int old_flags = dev->flags; | ^ net/core/dev.c:9185:15: note: previous declaration is here 9185 | unsigned int old_flags = dev->flags; | ^ 1 warning generated. Remove the redundant inner declaration and reuse the existing old_flags variable since its value is not needed outside the if block, and it is safe to reuse the variable. This eliminates the warning while maintaining the same functionality. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250217-old_flags-v2-1-4cda3b43a35f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18netdev: Switch to use hrtimer_setup()Nam Cao1-2/+1
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/861d9b287f2b5206b853dbcc0f302127f2a7f8ac.1738746872.git.namcao@linutronix.de
2025-02-17net: use napi_id_valid helperStefano Jordhani1-3/+3
In commit 6597e8d35851 ("netdev-genl: Elide napi_id when not present"), napi_id_valid function was added. Use the helper to refactor open-coded checks in the source. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Stefano Jordhani <sjordhani@gmail.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> # for iouring Link: https://patch.msgid.link/20250214181801.931-1-sjordhani@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06Merge branch 'io_uring-zero-copy-rx'Jakub Kicinski1-1/+15
David Wei says: ==================== io_uring zero copy rx This patchset contains net/ patches needed by a new io_uring request implementing zero copy rx into userspace pages, eliminating a kernel to user copy. We configure a page pool that a driver uses to fill a hw rx queue to hand out user pages instead of kernel pages. Any data that ends up hitting this hw rx queue will thus be dma'd into userspace memory directly, without needing to be bounced through kernel memory. 'Reading' data out of a socket instead becomes a _notification_ mechanism, where the kernel tells userspace where the data is. The overall approach is similar to the devmem TCP proposal. This relies on hw header/data split, flow steering and RSS to ensure packet headers remain in kernel memory and only desired flows hit a hw rx queue configured for zero copy. Configuring this is outside of the scope of this patchset. We share netdev core infra with devmem TCP. The main difference is that io_uring is used for the uAPI and the lifetime of all objects are bound to an io_uring instance. Data is 'read' using a new io_uring request type. When done, data is returned via a new shared refill queue. A zero copy page pool refills a hw rx queue from this refill queue directly. Of course, the lifetime of these data buffers are managed by io_uring rather than the networking stack, with different refcounting rules. This patchset is the first step adding basic zero copy support. We will extend this iteratively with new features e.g. dynamically allocated zero copy areas, THP support, dmabuf support, improved copy fallback, general optimisations and more. In terms of netdev support, we're first targeting Broadcom bnxt. Patches aren't included since Taehee Yoo has already sent a more comprehensive patchset adding support in [1]. Google gve should already support this, and Mellanox mlx5 support is WIP pending driver changes. =========== Performance =========== Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet. Test setup: * AMD EPYC 9454 * Broadcom BCM957508 200G * Kernel v6.11 base [2] * liburing fork [3] * kperf fork [4] * 4K MTU * Single TCP flow With application thread + net rx softirq pinned to _different_ cores: +-------------------------------+ | epoll | io_uring | |-----------|-------------------| | 82.2 Gbps | 116.2 Gbps (+41%) | +-------------------------------+ Pinned to _same_ core: +-------------------------------+ | epoll | io_uring | |-----------|-------------------| | 62.6 Gbps | 80.9 Gbps (+29%) | +-------------------------------+ ===== Links ===== Broadcom bnxt support: [1]: https://lore.kernel.org/20241003160620.1521626-8-ap420073@gmail.com Linux kernel branch including io_uring bits: [2]: https://github.com/isilence/linux.git zcrx/v13 liburing for testing: [3]: https://github.com/isilence/liburing.git zcrx/next kperf for testing: [4]: https://git.kernel.dk/kperf.git ==================== Link: https://patch.msgid.link/20250204215622.695511-1-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: page_pool: add a mp hook to unregister_netdevice*Pavel Begunkov1-1/+15
Devmem TCP needs a hook in unregister_netdevice_many_notify() to upkeep the set tracking queues it's bound to, i.e. ->bound_rxqs. Instead of devmem sticking directly out of the genetic path, add a mp function. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-8-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+14
Cross-merge networking fixes after downstream PR (net-6.14-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net: flush_backlog() small changesEric Dumazet1-4/+8
Add READ_ONCE() around reads of skb->dev->reg_state, because this field can be changed from other threads/cpus. Instead of calling dev_kfree_skb_irq() and kfree_skb() while interrupts are masked and locks held, use a temporary list and use __skb_queue_purge_reason() Use SKB_DROP_REASON_DEV_READY drop reason to better describe why these skbs are dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250204144825.316785-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04net: warn if NAPI instance wasn't shut downJakub Kicinski1-0/+3
Drivers should always disable a NAPI instance before removing it. If they don't the instance may be queued for polling. Since commit 86e25f40aa1e ("net: napi: Add napi_config") we also remove the NAPI from the busy polling hash table in napi_disable(), so not disabling would leave a stale entry there. Use of busy polling is relatively uncommon so bugs may be lurking in the drivers. Add an explicit warning. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250203215816.1294081-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-03net: harmonize tstats and dstatsPaolo Abeni1-0/+14
After the blamed commits below, some UDP tunnel use dstats for accounting. On the xmit path, all the UDP-base tunnels ends up using iptunnel_xmit_stats() for stats accounting, and the latter assumes the relevant (tunnel) network device uses tstats. The end result is some 'funny' stat report for the mentioned UDP tunnel, e.g. when no packet is actually dropped and a bunch of packets are transmitted: gnv2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue \ state UNKNOWN mode DEFAULT group default qlen 1000 link/ether ee:7d:09:87:90:ea brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped missed mcast 14916 23 0 15 0 0 TX: bytes packets errors dropped carrier collsns 0 1566 0 0 0 0 Address the issue ensuring the same binary layout for the overlapping fields of dstats and tstats. While this solution is a bit hackish, is smaller and with no performance pitfall compared to other alternatives i.e. supporting both dstat and tstat in iptunnel_xmit_stats() or reverting the blamed commit. With time we should possibly move all the IP-based tunnel (and virtual devices) to dstats. Fixes: c77200c07491 ("bareudp: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Fixes: 6fa6de302246 ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Fixes: be226352e8dc ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/2e1c444cf0f63ae472baff29862c4c869be17031.1738432804.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30net: revert RTNL changes in unregister_netdevice_many_notify()Eric Dumazet1-30/+3
This patch reverts following changes: 83419b61d187 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2) ae646f1a0bb9 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1) cfa579f66656 net: no longer hold RTNL while calling flush_all_backlogs() This caused issues in layers holding a private mutex: cleanup_net() rtnl_lock(); mutex_lock(subsystem_mutex); unregister_netdevice(); rtnl_unlock(); // LOCKDEP violation rtnl_lock(); I will revisit this in next cycle, opt-in for the new behavior from safe contexts only. Fixes: cfa579f66656 ("net: no longer hold RTNL while calling flush_all_backlogs()") Fixes: ae646f1a0bb9 ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)") Fixes: 83419b61d187 ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)") Reported-by: syzbot+5b9196ecf74447172a9a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6789d55f.050a0220.20d369.004e.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250129142726.747726-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29net: xdp: Disallow attaching device-bound programs in generic modeToke Høiland-Jørgensen1-0/+4
Device-bound programs are used to support RX metadata kfuncs. These kfuncs are driver-specific and rely on the driver context to read the metadata. This means they can't work in generic XDP mode. However, there is no check to disallow such programs from being attached in generic mode, in which case the metadata kfuncs will be called in an invalid context, leading to crashes. Fix this by adding a check to disallow attaching device-bound programs in generic mode. Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs") Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> Closes: https://lore.kernel.org/r/dae862ec-43b5-41a0-8edf-46c59071cdda@hetzner-cloud.de Tested-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250127131344.238147-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27net: page_pool: don't try to stash the napi idJakub Kicinski1-1/+1
Page ppol tried to cache the NAPI ID in page pool info to avoid having a dependency on the life cycle of the NAPI instance. Since commit under Fixes the NAPI ID is not populated until napi_enable() and there's a good chance that page pool is created before NAPI gets enabled. Protect the NAPI pointer with the existing page pool mutex, the reading path already holds it. napi_id itself we need to READ_ONCE(), it's protected by netdev_lock() which are not holding in page pool. Before this patch napi IDs were missing for mlx5: # ./cli.py --spec netlink/specs/netdev.yaml --dump page-pool-get [{'id': 144, 'ifindex': 2, 'inflight': 3072, 'inflight-mem': 12582912}, {'id': 143, 'ifindex': 2, 'inflight': 5568, 'inflight-mem': 22806528}, {'id': 142, 'ifindex': 2, 'inflight': 5120, 'inflight-mem': 20971520}, {'id': 141, 'ifindex': 2, 'inflight': 4992, 'inflight-mem': 20447232}, ... After: [{'id': 144, 'ifindex': 2, 'inflight': 3072, 'inflight-mem': 12582912, 'napi-id': 565}, {'id': 143, 'ifindex': 2, 'inflight': 4224, 'inflight-mem': 17301504, 'napi-id': 525}, {'id': 142, 'ifindex': 2, 'inflight': 4288, 'inflight-mem': 17563648, 'napi-id': 524}, ... Fixes: 86e25f40aa1e ("net: napi: Add napi_config") Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250123231620.1086401-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20net: provide pending ring configuration in net_deviceJakub Kicinski1-0/+2
Record the pending configuration in net_device struct. ethtool core duplicates the current config and the specific handlers (for now just ringparam) can modify it. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250119020518.1962249-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20net: move HDS config from ethtool stateJakub Kicinski1-2/+8
Separate the HDS config from the ethtool state struct. The HDS config contains just simple parameters, not state. Having it as a separate struct will make it easier to clone / copy and also long term potentially make it per-queue. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250119020518.1962249-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20net: sched: refine software bypass handling in tc_runXin Long1-6/+9
This patch addresses issues with filter counting in block (tcf_block), particularly for software bypass scenarios, by introducing a more accurate mechanism using useswcnt. Previously, filtercnt and skipswcnt were introduced by: Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter") filtercnt tracked all tp (tcf_proto) objects added to a block, and skipswcnt counted tp objects with the skipsw attribute set. The problem is: a single tp can contain multiple filters, some with skipsw and others without. The current implementation fails in the case: When the first filter in a tp has skipsw, both skipswcnt and filtercnt are incremented, then adding a second filter without skipsw to the same tp does not modify these counters because tp->counted is already set. This results in bypass software behavior based solely on skipswcnt equaling filtercnt, even when the block includes filters without skipsw. Consequently, filters without skipsw are inadvertently bypassed. To address this, the patch introduces useswcnt in block to explicitly count tp objects containing at least one filter without skipsw. Key changes include: Whenever a filter without skipsw is added, its tp is marked with usesw and counted in useswcnt. tc_run() now uses useswcnt to determine software bypass, eliminating reliance on filtercnt and skipswcnt. This refined approach prevents software bypass for blocks containing mixed filters, ensuring correct behavior in tc_run(). Additionally, as atomic operations on useswcnt ensure thread safety and tp->lock guards access to tp->usesw and tp->counted, the broader lock down_write(&block->cb_lock) is no longer required in tc_new_tfilter(), and this resolves a performance regression caused by the filter counting mechanism during parallel filter insertions. The improvement can be demonstrated using the following script: # cat insert_tc_rules.sh tc qdisc add dev ens1f0np0 ingress for i in $(seq 16); do taskset -c $i tc -b rules_$i.txt & done wait Each of rules_$i.txt files above includes 100000 tc filter rules to a mlx5 driver NIC ens1f0np0. Without this patch: # time sh insert_tc_rules.sh real 0m50.780s user 0m23.556s sys 4m13.032s With this patch: # time sh insert_tc_rules.sh real 0m17.718s user 0m7.807s sys 3m45.050s Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-18net: introduce netdev_napi_exit()Eric Dumazet1-6/+17
After 1b23cdbd2bbc ("net: protect netdev->napi_list with netdev_lock()") it makes sense to iterate through dev->napi_list while holding the device lock. Also call synchronize_net() at most one time. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250117232113.1612899-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18net: destroy dev->lock later in free_netdev()Eric Dumazet1-2/+2
syzbot complained that free_netdev() was calling netif_napi_del() after dev->lock mutex has been destroyed. This fires a warning for CONFIG_DEBUG_MUTEXES=y builds. Move mutex_destroy(&dev->lock) near the end of free_netdev(). [1] DEBUG_LOCKS_WARN_ON(lock->magic != lock) WARNING: CPU: 0 PID: 5971 at kernel/locking/mutex.c:564 __mutex_lock_common kernel/locking/mutex.c:564 [inline] WARNING: CPU: 0 PID: 5971 at kernel/locking/mutex.c:564 __mutex_lock+0xdac/0xee0 kernel/locking/mutex.c:735 Modules linked in: CPU: 0 UID: 0 PID: 5971 Comm: syz-executor Not tainted 6.13.0-rc7-syzkaller-01131-g8d20dcda404d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/27/2024 RIP: 0010:__mutex_lock_common kernel/locking/mutex.c:564 [inline] RIP: 0010:__mutex_lock+0xdac/0xee0 kernel/locking/mutex.c:735 Code: 0f b6 04 38 84 c0 0f 85 1a 01 00 00 83 3d 6f 40 4c 04 00 75 19 90 48 c7 c7 60 84 0a 8c 48 c7 c6 00 85 0a 8c e8 f5 dc 91 f5 90 <0f> 0b 90 90 90 e9 c7 f3 ff ff 90 0f 0b 90 e9 29 f8 ff ff 90 0f 0b RSP: 0018:ffffc90003317580 EFLAGS: 00010246 RAX: ee0f97edaf7b7d00 RBX: ffff8880299f8cb0 RCX: ffff8880323c9e00 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc90003317710 R08: ffffffff81602ac2 R09: 1ffff110170c519a R10: dffffc0000000000 R11: ffffed10170c519b R12: 0000000000000000 R13: 0000000000000000 R14: 1ffff92000662ec4 R15: dffffc0000000000 FS: 000055557a046500(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd581d46ff8 CR3: 000000006f870000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> netdev_lock include/linux/netdevice.h:2691 [inline] __netif_napi_del include/linux/netdevice.h:2829 [inline] netif_napi_del include/linux/netdevice.h:2848 [inline] free_netdev+0x2d9/0x610 net/core/dev.c:11621 netdev_run_todo+0xf21/0x10d0 net/core/dev.c:11189 nsim_destroy+0x3c3/0x620 drivers/net/netdevsim/netdev.c:1028 __nsim_dev_port_del+0x14b/0x1b0 drivers/net/netdevsim/dev.c:1428 nsim_dev_port_del_all drivers/net/netdevsim/dev.c:1440 [inline] nsim_dev_reload_destroy+0x28a/0x490 drivers/net/netdevsim/dev.c:1661 nsim_drv_remove+0x58/0x160 drivers/net/netdevsim/dev.c:1676 device_remove drivers/base/dd.c:567 [inline] Fixes: 1b23cdbd2bbc ("net: protect netdev->napi_list with netdev_lock()") Reported-by: syzbot+85ff1051228a04613a32@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/678add43.050a0220.303755.0016.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250117224626.1427577-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16dev: Hold rtnl_net_lock() for dev_ifsioc().Kuniyuki Iwashima1-5/+2
Basically, dev_ifsioc() operates on the passed single netns (except for netdev notifier chains with lower/upper devices for which we will need more changes). Let's hold rtnl_net_lock() for dev_ifsioc(). Now that NETDEV_CHANGENAME is always triggered under rtnl_net_lock() of the device's netns. (do_setlink() and dev_ifsioc()) Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250115095545.52709-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16dev: Remove devnet_rename_sem.Kuniyuki Iwashima1-14/+2
devnet_rename_sem is no longer used since commit 0840556e5a3a ("net: Protect dev->name by seqlock."). Also, RTNL serialises dev_change_name(). Let's remove devnet_rename_sem. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250115095545.52709-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16dev: Acquire netdev_rename_lock before restoring dev->name in dev_change_name().Kuniyuki Iwashima1-0/+2
The cited commit forgot to add netdev_rename_lock in one of the error paths in dev_change_name(). Let's hold netdev_rename_lock before restoring the old dev->name. Fixes: 0840556e5a3a ("net: Protect dev->name by seqlock.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250115095545.52709-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15Merge branch 'net-reduce-rtnl-pressure-in-unregister_netdevice'Jakub Kicinski1-22/+75
Eric Dumazet says: ==================== net: reduce RTNL pressure in unregister_netdevice() One major source of RTNL contention resides in unregister_netdevice() Due to RCU protection of various network structures, and unregister_netdevice() being a synchronous function, it is calling potentially slow functions while holding RTNL. I think we can release RTNL in two points, so that three slow functions are called while RTNL can be used by other threads. v1: https://lore.kernel.org/netdev/20250107130906.098fc8d6@kernel.org/T/#m398c95f5778e1ff70938e079d3c4c43c050ad2a6 ==================== Link: https://patch.msgid.link/20250114205531.967841-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)Eric Dumazet1-2/+1
One synchronize_net() call is currently done while holding RTNL. This is source of RTNL contention in workloads adding and deleting many network namespaces per second, because synchronize_rcu() and synchronize_rcu_expedited() can use 60+ ms in some cases. For cleanup_net() use, temporarily release RTNL while calling the last synchronize_net(). This should be safe, because devices are no longer visible to other threads after unlist_netdevice() call and setting dev->reg_state to NETREG_UNREGISTERING. In any case, the new netdev_lock() / netdev_unlock() infrastructure that we are adding should allow to fix potential issues, with a combination of a per-device mutex and dev->reg_state awareness. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Link: https://patch.msgid.link/20250114205531.967841-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)Eric Dumazet1-0/+3
Two synchronize_net() calls are currently done while holding RTNL. This is source of RTNL contention in workloads adding and deleting many network namespaces per second, because synchronize_rcu() and synchronize_rcu_expedited() can use 60+ ms in some cases. For cleanup_net() use, temporarily release RTNL while calling the last synchronize_net(). This should be safe, because devices are no longer visible to other threads at this point. In any case, the new netdev_lock() / netdev_unlock() infrastructure that we are adding should allow to fix potential issues, with a combination of a per-device mutex and dev->reg_state awareness. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Link: https://patch.msgid.link/20250114205531.967841-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: no longer hold RTNL while calling flush_all_backlogs()Eric Dumazet1-3/+28
flush_all_backlogs() is called from unregister_netdevice_many_notify() as part of netdevice dismantles. This is currently called under RTNL, and can last up to 50 ms on busy hosts. There is no reason to hold RTNL at this stage, if our caller is cleanup_net() : netns are no more visible, devices are in NETREG_UNREGISTERING state and no other thread could mess our state while RTNL is temporarily released. In order to provide isolation, this patch provides a separate 'net_todo_list' for cleanup_net(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Link: https://patch.msgid.link/20250114205531.967841-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: no longer assume RTNL is held in flush_all_backlogs()Eric Dumazet1-18/+35
flush_all_backlogs() uses per-cpu and static data to hold its temporary data, on the assumption it is called under RTNL protection. Following patch in the series will break this assumption. Use instead a dynamically allocated piece of memory. In the unlikely case the allocation fails, use a boot-time allocated memory. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Link: https://patch.msgid.link/20250114205531.967841-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: expedite synchronize_net() for cleanup_net()Eric Dumazet1-1/+10
cleanup_net() is the single thread responsible for netns dismantles, and a serious bottleneck. Before we can get per-netns RTNL, make sure all synchronize_net() called from this thread are using rcu_synchronize_expedited(). v3: deal with CONFIG_NET_NS=n Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Link: https://patch.msgid.link/20250114205531.967841-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: protect napi->irq with netdev_lock()Jakub Kicinski1-1/+1
Take netdev_lock() in netif_napi_set_irq(). All NAPI "control fields" are now protected by that lock (most of the other ones are set during napi add/del). The napi_hash_node is fully protected by the hash spin lock, but close enough for the kdoc... Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: protect threaded status of NAPI with netdev_lock()Jakub Kicinski1-0/+2
Now that NAPI instances can't come and go without holding netdev->lock we can trivially switch from rtnl_lock() to netdev_lock() for setting netdev->threaded via sysfs. Note that since we do not lock netdev_lock around sysfs calls in the core we don't have to "trylock" like we do with rtnl_lock. Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: make netdev netlink ops hold netdev_lock()Jakub Kicinski1-1/+2
In prep for dropping rtnl_lock, start locking netdev->lock in netlink genl ops. We need to be using netdev->up instead of flags & IFF_UP. We can remove the RCU lock protection for the NAPI since NAPI list is protected by netdev->lock already. Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: protect NAPI enablement with netdev_lock()Jakub Kicinski1-7/+34
Wrap napi_enable() / napi_disable() with netdev_lock(). Provide the "already locked" flavor of the API. iavf needs the usual adjustment. A number of drivers call napi_enable() under a spin lock, so they have to be modified to take netdev_lock() first, then spin lock then call napi_enable_locked(). Protecting napi_enable() implies that napi->napi_id is protected by netdev_lock(). Acked-by: Francois Romieu <romieu@fr.zoreil.com> # via-velocity Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: protect netdev->napi_list with netdev_lock()Jakub Kicinski1-5/+10
Hold netdev->lock when NAPIs are getting added or removed. This will allow safe access to NAPI instances of a net_device without rtnl_lock. Create a family of helpers which assume the lock is already taken. Switch iavf to them, as it makes extensive use of netdev->lock, already. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: add netdev->up protected by netdev_lock()Jakub Kicinski1-2/+2
Some uAPI (netdev netlink) hide net_device's sub-objects while the interface is down to ensure uniform behavior across drivers. To remove the rtnl_lock dependency from those uAPIs we need a way to safely tell if the device is down or up. Add an indication of whether device is open or closed, protected by netdev->lock. The semantics are the same as IFF_UP, but taking netdev_lock around every write to ->flags would be a lot of code churn. We don't want to blanket the entire open / close path by netdev_lock, because it will prevent us from applying it to specific structures - core helpers won't be able to take that lock from any function called by the drivers on open/close paths. So the state of the flag is "pessimistic", as in it may report false negatives, but never false positives. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>