Age | Commit message (Collapse) | Author | Files | Lines |
|
There are cases where do_xdp_generic returns bpf_net_context without
clearing it. This causes various memory corruptions, so the missing
bpf_net_ctx_clear must be added.
Reported-by: syzbot+44623300f057a28baf1e@syzkaller.appspotmail.com
Fixes: fecef4cd42c6 ("tun: Assign missing bpf_net_context.")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot+3c2b6d5d4bec3b904933@syzkaller.appspotmail.com
Reported-by: syzbot+707d98c8649695eaf329@syzkaller.appspotmail.com
Reported-by: syzbot+c226757eb784a9da3e8b@syzkaller.appspotmail.com
Reported-by: syzbot+61a1cfc2b6632363d319@syzkaller.appspotmail.com
Reported-by: syzbot+709e4c85c904bcd62735@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In fact, this structure contains a flexible array at the end, but
historically its size, alignment etc., is calculated manually.
There are several instances of the structure embedded into other
structures, but also there's ongoing effort to remove them and we
could in the meantime declare &net_device properly.
Declare the array explicitly, use struct_size() and store the array
size inside the structure, so that __counted_by() can be applied.
Don't use PTR_ALIGN(), as SLUB itself tries its best to ensure the
allocated buffer is aligned to what the user expects.
Also, change its alignment from %NETDEV_ALIGN to the cacheline size
as per several suggestions on the netdev ML.
bloat-o-meter for vmlinux:
free_netdev 445 440 -5
netdev_freemem 24 - -24
alloc_netdev_mqs 1481 1450 -31
On x86_64 with several NICs of different vendors, I was never able to
get a &net_device pointer not aligned to the cacheline size after the
change.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20240710113036.2125584-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
During the introduction of struct bpf_net_context handling for
XDP-redirect, the tun driver has been missed.
Jakub also pointed out that there is another call chain to
do_xdp_generic() originating from netif_receive_skb() and drivers may
use it outside from the NAPI context.
Set the bpf_net_context before invoking BPF XDP program within the TUN
driver. Set the bpf_net_context also in do_xdp_generic() if a xdp
program is available.
Reported-by: syzbot+0b5c75599f1d872bea6f@syzkaller.appspotmail.com
Reported-by: syzbot+5ae46b237278e2369cac@syzkaller.appspotmail.com
Reported-by: syzbot+c1e04a422bbc0f0f2921@syzkaller.appspotmail.com
Fixes: 401cb7dae8130 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240704144815.j8xQda5r@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While this is not needed to serialise the ethtool entry points (which
are all under RTNL), drivers may have cause to asynchronously access
dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to
do this safely without needing to take the RTNL.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/7f9c15eb7525bf87af62c275dde3a8570ee8bf0a.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently passed as NULL, but will allow drivers to report back errors
when ethnl support for these ops is added.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/6e0012347d175fdd1280363d7bfa76a2f2777e17.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new API to create/modify/remove RSS contexts, that passes in the
newly-chosen context ID (not as a pointer) rather than leaving the
driver to choose it on create. Also pass in the ctx, allowing drivers
to easily use its private data area to store their hardware-specific
state.
Keep the existing .set_rxfh API for now as a fallback, but deprecate it
for custom contexts (rss_context != 0).
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/45f1fe61df2163c091ec394c9f52000c8b16cc3b.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Each context stores the RXFH settings (indir, key, and hfunc) as well
as optionally some driver private data.
Delete any still-existing contexts at netdev unregister time.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/cbd1c402cec38f2e03124f2ab65b4ae4e08bd90d.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
net_dev->ethtool is a pointer to new struct ethtool_netdev_state, which
currently contains only the wol_enabled field.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/293a562278371de7534ed1eb17531838ca090633.1719502239.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
e3f02f32a050 ("ionic: fix kernel panic due to multi-buffer handling")
d9c04209990b ("ionic: Mark error paths in the data path as unlikely")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
packet and makes decisions. While doing that, the per-CPU variable
bpf_redirect_info is used.
- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
and it may also access other per-CPU variables like xskmap_flush_list.
At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.
The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.
PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.
Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it, bpf_net_ctx_clear() removes it again.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info. The returned data structure is zero initialized to
ensure nothing is leaked from stack. This is done on first usage of the
struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to
note that initialisation is required. First invocation of
bpf_net_ctx_get_ri() will memset() the data structure and update
bpf_redirect_info::kern_flags.
bpf_redirect_info::nh is excluded from memset because it is only used
once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is
moved past nh to exclude it from memset.
The pointer to bpf_net_context is saved task's task_struct. Using
always the bpf_net_context approach has the advantage that there is
almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-15-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
softnet_data::process_queue is a per-CPU variable and relies on disabled
BH for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.
softnet_data::input_queue_head can be updated lockless. This is fine
because this value is only update CPU local by the local backlog_napi
thread.
Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking
of process_queue. This change adds only lockdep coverage and does not
alter the functional behaviour for !PREEMPT_RT.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-11-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The backlog_napi locking (previously RPS) relies on explicit locking if
either RPS or backlog NAPI is enabled. If both are disabled then locking
was achieved by disabling interrupts except on PREEMPT_RT. PREEMPT_RT
was excluded because the needed synchronisation was already provided
local_bh_disable().
Since the introduction of backlog NAPI and making it mandatory for
PREEMPT_RT the ifdef within backlog_lock.*() is obsolete and can be
removed.
Remove the ifdefs in backlog_lock.*().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-10-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
local_bh_disable() there is no guarantee that only one device is
transmitting at a time.
With preemption and multiple senders it is possible that the per-CPU
`recursion' counter gets incremented by different threads and exceeds
XMIT_RECURSION_LIMIT leading to a false positive recursion alert.
The `more' member is subject to similar problems if set by one thread
for one driver and wrongly used by another driver within another thread.
Instead of adding a lock to protect the per-CPU variable it is simpler
to make xmit per-task. Sending and receiving skbs happens always
in thread context anyway.
Having a lock to protected the per-CPU counter would block/ serialize two
sending threads needlessly. It would also require a recursive lock to
ensure that the owner can increment the counter further.
Make the softnet_data.xmit a task_struct member on PREEMPT_RT. Add
needed wrapper.
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-9-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot reported a lockdep violation involving bridge driver [1]
Make sure netdev_rename_lock is softirq safe to fix this issue.
[1]
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
6.10.0-rc2-syzkaller-00249-gbe27b8965297 #0 Not tainted
-----------------------------------------------------
syz-executor.2/9449 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
ffffffff8f5de668 (netdev_rename_lock.seqcount){+.+.}-{0:0}, at: rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839
and this task is already holding:
ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: br_port_slave_changelink+0x3d/0x150 net/bridge/br_netlink.c:1212
which would create a new lock dependency:
(&br->lock){+.-.}-{2:2} -> (netdev_rename_lock.seqcount){+.+.}-{0:0}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&br->lock){+.-.}-{2:2}
... which became SOFTIRQ-irq-safe at:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
__raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
_raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
spin_lock include/linux/spinlock.h:351 [inline]
br_forward_delay_timer_expired+0x50/0x440 net/bridge/br_stp_timer.c:86
call_timer_fn+0x18e/0x650 kernel/time/timer.c:1792
expire_timers kernel/time/timer.c:1843 [inline]
__run_timers kernel/time/timer.c:2417 [inline]
__run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2428
run_timer_base kernel/time/timer.c:2437 [inline]
run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2447
handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
__do_softirq kernel/softirq.c:588 [inline]
invoke_softirq kernel/softirq.c:428 [inline]
__irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
lock_acquire+0x264/0x550 kernel/locking/lockdep.c:5758
fs_reclaim_acquire+0xaf/0x140 mm/page_alloc.c:3800
might_alloc include/linux/sched/mm.h:334 [inline]
slab_pre_alloc_hook mm/slub.c:3890 [inline]
slab_alloc_node mm/slub.c:3980 [inline]
kmalloc_trace_noprof+0x3d/0x2c0 mm/slub.c:4147
kmalloc_noprof include/linux/slab.h:660 [inline]
kzalloc_noprof include/linux/slab.h:778 [inline]
class_dir_create_and_add drivers/base/core.c:3255 [inline]
get_device_parent+0x2a7/0x410 drivers/base/core.c:3315
device_add+0x325/0xbf0 drivers/base/core.c:3645
netdev_register_kobject+0x17e/0x320 net/core/net-sysfs.c:2136
register_netdevice+0x11d5/0x19e0 net/core/dev.c:10375
nsim_init_netdevsim drivers/net/netdevsim/netdev.c:690 [inline]
nsim_create+0x647/0x890 drivers/net/netdevsim/netdev.c:750
__nsim_dev_port_add+0x6c0/0xae0 drivers/net/netdevsim/dev.c:1390
nsim_dev_port_add_all drivers/net/netdevsim/dev.c:1446 [inline]
nsim_dev_reload_create drivers/net/netdevsim/dev.c:1498 [inline]
nsim_dev_reload_up+0x69b/0x8e0 drivers/net/netdevsim/dev.c:985
devlink_reload+0x478/0x870 net/devlink/dev.c:474
devlink_nl_reload_doit+0xbd6/0xe50 net/devlink/dev.c:586
genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0xb14/0xec0 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
___sys_sendmsg net/socket.c:2639 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
to a SOFTIRQ-irq-unsafe lock:
(netdev_rename_lock.seqcount){+.+.}-{0:0}
... which became SOFTIRQ-irq-unsafe at:
...
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline]
do_write_seqcount_begin include/linux/seqlock.h:495 [inline]
write_seqlock include/linux/seqlock.h:823 [inline]
dev_change_name+0x184/0x920 net/core/dev.c:1229
do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880
__rtnl_newlink net/core/rtnetlink.c:3696 [inline]
rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
__sys_sendto+0x3a4/0x4f0 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0xde/0x100 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(netdev_rename_lock.seqcount);
local_irq_disable();
lock(&br->lock);
lock(netdev_rename_lock.seqcount);
<Interrupt>
lock(&br->lock);
*** DEADLOCK ***
3 locks held by syz-executor.2/9449:
#0: ffffffff8f5e7448 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:79 [inline]
#0: ffffffff8f5e7448 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x842/0x1180 net/core/rtnetlink.c:6632
#1: ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
#1: ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: br_port_slave_changelink+0x3d/0x150 net/bridge/br_netlink.c:1212
#2: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
#2: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline]
#2: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: team_change_rx_flags+0x29/0x330 drivers/net/team/team_core.c:1767
the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&br->lock){+.-.}-{2:2} {
HARDIRQ-ON-W at:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
_raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
spin_lock_bh include/linux/spinlock.h:356 [inline]
br_add_if+0xb34/0xef0 net/bridge/br_if.c:682
do_set_master net/core/rtnetlink.c:2701 [inline]
do_setlink+0xe70/0x41f0 net/core/rtnetlink.c:2907
__rtnl_newlink net/core/rtnetlink.c:3696 [inline]
rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
__sys_sendto+0x3a4/0x4f0 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0xde/0x100 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
IN-SOFTIRQ-W at:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
__raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
_raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
spin_lock include/linux/spinlock.h:351 [inline]
br_forward_delay_timer_expired+0x50/0x440 net/bridge/br_stp_timer.c:86
call_timer_fn+0x18e/0x650 kernel/time/timer.c:1792
expire_timers kernel/time/timer.c:1843 [inline]
__run_timers kernel/time/timer.c:2417 [inline]
__run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2428
run_timer_base kernel/time/timer.c:2437 [inline]
run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2447
handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
__do_softirq kernel/softirq.c:588 [inline]
invoke_softirq kernel/softirq.c:428 [inline]
__irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
lock_acquire+0x264/0x550 kernel/locking/lockdep.c:5758
fs_reclaim_acquire+0xaf/0x140 mm/page_alloc.c:3800
might_alloc include/linux/sched/mm.h:334 [inline]
slab_pre_alloc_hook mm/slub.c:3890 [inline]
slab_alloc_node mm/slub.c:3980 [inline]
kmalloc_trace_noprof+0x3d/0x2c0 mm/slub.c:4147
kmalloc_noprof include/linux/slab.h:660 [inline]
kzalloc_noprof include/linux/slab.h:778 [inline]
class_dir_create_and_add drivers/base/core.c:3255 [inline]
get_device_parent+0x2a7/0x410 drivers/base/core.c:3315
device_add+0x325/0xbf0 drivers/base/core.c:3645
netdev_register_kobject+0x17e/0x320 net/core/net-sysfs.c:2136
register_netdevice+0x11d5/0x19e0 net/core/dev.c:10375
nsim_init_netdevsim drivers/net/netdevsim/netdev.c:690 [inline]
nsim_create+0x647/0x890 drivers/net/netdevsim/netdev.c:750
__nsim_dev_port_add+0x6c0/0xae0 drivers/net/netdevsim/dev.c:1390
nsim_dev_port_add_all drivers/net/netdevsim/dev.c:1446 [inline]
nsim_dev_reload_create drivers/net/netdevsim/dev.c:1498 [inline]
nsim_dev_reload_up+0x69b/0x8e0 drivers/net/netdevsim/dev.c:985
devlink_reload+0x478/0x870 net/devlink/dev.c:474
devlink_nl_reload_doit+0xbd6/0xe50 net/devlink/dev.c:586
genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0xb14/0xec0 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
___sys_sendmsg net/socket.c:2639 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
INITIAL USE at:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
_raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
spin_lock_bh include/linux/spinlock.h:356 [inline]
br_add_if+0xb34/0xef0 net/bridge/br_if.c:682
do_set_master net/core/rtnetlink.c:2701 [inline]
do_setlink+0xe70/0x41f0 net/core/rtnetlink.c:2907
__rtnl_newlink net/core/rtnetlink.c:3696 [inline]
rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
__sys_sendto+0x3a4/0x4f0 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0xde/0x100 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
}
... key at: [<ffffffff94b9a1a0>] br_dev_setup.__key+0x0/0x20
the dependencies between the lock to be acquired
and SOFTIRQ-irq-unsafe lock:
-> (netdev_rename_lock.seqcount){+.+.}-{0:0} {
HARDIRQ-ON-W at:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline]
do_write_seqcount_begin include/linux/seqlock.h:495 [inline]
write_seqlock include/linux/seqlock.h:823 [inline]
dev_change_name+0x184/0x920 net/core/dev.c:1229
do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880
__rtnl_newlink net/core/rtnetlink.c:3696 [inline]
rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
__sys_sendto+0x3a4/0x4f0 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0xde/0x100 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
SOFTIRQ-ON-W at:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline]
do_write_seqcount_begin include/linux/seqlock.h:495 [inline]
write_seqlock include/linux/seqlock.h:823 [inline]
dev_change_name+0x184/0x920 net/core/dev.c:1229
do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880
__rtnl_newlink net/core/rtnetlink.c:3696 [inline]
rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
__sys_sendto+0x3a4/0x4f0 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0xde/0x100 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
INITIAL USE at:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline]
do_write_seqcount_begin include/linux/seqlock.h:495 [inline]
write_seqlock include/linux/seqlock.h:823 [inline]
dev_change_name+0x184/0x920 net/core/dev.c:1229
do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880
__rtnl_newlink net/core/rtnetlink.c:3696 [inline]
rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
__sys_sendto+0x3a4/0x4f0 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0xde/0x100 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
INITIAL READ USE at:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
seqcount_lockdep_reader_access include/linux/seqlock.h:72 [inline]
read_seqbegin include/linux/seqlock.h:772 [inline]
netdev_copy_name+0x168/0x2c0 net/core/dev.c:949
rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839
rtmsg_ifinfo_build_skb+0x18a/0x260 net/core/rtnetlink.c:4073
rtmsg_ifinfo_event net/core/rtnetlink.c:4107 [inline]
rtmsg_ifinfo+0x91/0x1b0 net/core/rtnetlink.c:4116
register_netdevice+0x1665/0x19e0 net/core/dev.c:10422
register_netdev+0x3b/0x50 net/core/dev.c:10512
loopback_net_init+0x73/0x150 drivers/net/loopback.c:217
ops_init+0x359/0x610 net/core/net_namespace.c:139
__register_pernet_operations net/core/net_namespace.c:1247 [inline]
register_pernet_operations+0x2cb/0x660 net/core/net_namespace.c:1320
register_pernet_device+0x33/0x80 net/core/net_namespace.c:1407
net_dev_init+0xfcd/0x10d0 net/core/dev.c:11956
do_one_initcall+0x248/0x880 init/main.c:1267
do_initcall_level+0x157/0x210 init/main.c:1329
do_initcalls+0x3f/0x80 init/main.c:1345
kernel_init_freeable+0x435/0x5d0 init/main.c:1578
kernel_init+0x1d/0x2b0 init/main.c:1467
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
}
... key at: [<ffffffff8f5de668>] netdev_rename_lock+0x8/0xa0
... acquired at:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
seqcount_lockdep_reader_access include/linux/seqlock.h:72 [inline]
read_seqbegin include/linux/seqlock.h:772 [inline]
netdev_copy_name+0x168/0x2c0 net/core/dev.c:949
rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839
rtmsg_ifinfo_build_skb+0x18a/0x260 net/core/rtnetlink.c:4073
rtmsg_ifinfo_event net/core/rtnetlink.c:4107 [inline]
rtmsg_ifinfo+0x91/0x1b0 net/core/rtnetlink.c:4116
__dev_notify_flags+0xf7/0x400 net/core/dev.c:8816
__dev_set_promiscuity+0x152/0x5a0 net/core/dev.c:8588
dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608
team_change_rx_flags+0x203/0x330 drivers/net/team/team_core.c:1771
dev_change_rx_flags net/core/dev.c:8541 [inline]
__dev_set_promiscuity+0x406/0x5a0 net/core/dev.c:8585
dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608
br_port_clear_promisc net/bridge/br_if.c:135 [inline]
br_manage_promisc+0x505/0x590 net/bridge/br_if.c:172
nbp_update_port_count net/bridge/br_if.c:242 [inline]
br_port_flags_change+0x161/0x1f0 net/bridge/br_if.c:761
br_setport+0xcb5/0x16d0 net/bridge/br_netlink.c:1000
br_port_slave_changelink+0x135/0x150 net/bridge/br_netlink.c:1213
__rtnl_newlink net/core/rtnetlink.c:3689 [inline]
rtnl_newlink+0x169f/0x20a0 net/core/rtnetlink.c:3743
rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
___sys_sendmsg net/socket.c:2639 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
stack backtrace:
CPU: 0 PID: 9449 Comm: syz-executor.2 Not tainted 6.10.0-rc2-syzkaller-00249-gbe27b8965297 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/07/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
print_bad_irq_dependency kernel/locking/lockdep.c:2626 [inline]
check_irq_usage kernel/locking/lockdep.c:2865 [inline]
check_prev_add kernel/locking/lockdep.c:3138 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x4de0/0x5900 kernel/locking/lockdep.c:3869
__lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
seqcount_lockdep_reader_access include/linux/seqlock.h:72 [inline]
read_seqbegin include/linux/seqlock.h:772 [inline]
netdev_copy_name+0x168/0x2c0 net/core/dev.c:949
rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839
rtmsg_ifinfo_build_skb+0x18a/0x260 net/core/rtnetlink.c:4073
rtmsg_ifinfo_event net/core/rtnetlink.c:4107 [inline]
rtmsg_ifinfo+0x91/0x1b0 net/core/rtnetlink.c:4116
__dev_notify_flags+0xf7/0x400 net/core/dev.c:8816
__dev_set_promiscuity+0x152/0x5a0 net/core/dev.c:8588
dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608
team_change_rx_flags+0x203/0x330 drivers/net/team/team_core.c:1771
dev_change_rx_flags net/core/dev.c:8541 [inline]
__dev_set_promiscuity+0x406/0x5a0 net/core/dev.c:8585
dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608
br_port_clear_promisc net/bridge/br_if.c:135 [inline]
br_manage_promisc+0x505/0x590 net/bridge/br_if.c:172
nbp_update_port_count net/bridge/br_if.c:242 [inline]
br_port_flags_change+0x161/0x1f0 net/bridge/br_if.c:761
br_setport+0xcb5/0x16d0 net/bridge/br_netlink.c:1000
br_port_slave_changelink+0x135/0x150 net/bridge/br_netlink.c:1213
__rtnl_newlink net/core/rtnetlink.c:3689 [inline]
rtnl_newlink+0x169f/0x20a0 net/core/rtnetlink.c:3743
rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
___sys_sendmsg net/socket.c:2639 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f3b3047cf29
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f3b311740c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f3b305b4050 RCX: 00007f3b3047cf29
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000008
RBP: 00007f3b304ec074 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f3b305b4050 R15: 00007ffca2f3dc68
</TASK>
Fixes: 0840556e5a3a ("net: Protect dev->name by seqlock.")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
skb does not include enough information to find out receiving
sockets/services and netns/containers on packet drops. In theory
skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
stack for OOO packet lookup. Similarly, skb->sk often identifies a local
sender, and tells nothing about a receiver.
Allow passing an extra receiving socket to the tracepoint to improve
the visibility on receiving drops.
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We currently have dev_get_tstats64() for collecting per-cpu stats of
type pcpu_sw_netstats ("tstats"). However, tstats doesn't allow for
accounting tx/rx drops. We do have a stats variant that does have stats
for dropped packets: struct pcpu_dstats, but there are no core helpers
for using those stats.
The VRF driver uses dstats, by providing its own collation/fetch
functions to do so.
This change adds a common implementation for dstats-type collection,
used when pcpu_stat_type == NETDEV_PCPU_STAT_DSTAT. This is based on the
VRF driver's existing stats collator (plus the unused tx_drops stat from
there). We will switch the VRF driver to use this in the next change.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240607-dstats-v3-2-cc781fe116f7@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/pensando/ionic/ionic_txrx.c
d9c04209990b ("ionic: Mark error paths in the data path as unlikely")
491aee894a08 ("ionic: fix kernel panic in XDP_TX action")
net/ipv6/ip6_fib.c
b4cb4a1391dc ("net: use unrcu_pointer() helper")
b01e1c030770 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
John Sperbeck reported that if we turn off CONFIG_RFS_ACCEL, the 'head'
is not defined, which will trigger compile error. So I move the 'head'
out of the CONFIG_RFS_ACCEL scope.
Fixes: 84b6823cd96b ("net: rps: protect last_qtail with rps_input_queue_tail_save() helper")
Reported-by: John Sperbeck <jsperbeck@google.com>
Closes: https://lore.kernel.org/all/20240529203421.2432481-1-jsperbeck@google.com/
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240530032717.57787-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mono_delivery_time was added to check if skb->tstamp has delivery
time in mono clock base (i.e. EDT) otherwise skb->tstamp has
timestamp in ingress and delivery_time at egress.
Renaming the bitfield from mono_delivery_time to tstamp_type is for
extensibilty for other timestamps such as userspace timestamp
(i.e. SO_TXTIME) set via sock opts.
As we are renaming the mono_delivery_time to tstamp_type, it makes
sense to start assigning tstamp_type based on enum defined
in this commit.
Earlier we used bool arg flag to check if the tstamp is mono in
function skb_set_delivery_time, Now the signature of the functions
accepts tstamp_type to distinguish between mono and real time.
Also skb_set_delivery_type_by_clockid is a new function which accepts
clockid to determine the tstamp_type.
In future tstamp_type:1 can be extended to support userspace timestamp
by increasing the bitfield.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-2-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
The series is causing issues with PHY drivers built as modules.
Since it was only partially applied and the merge window has
opened let's revert and try again for v6.11.
Revert 6916e461e793 ("net: phy: Introduce ethernet link topology representation")
Revert 0ec5ed6c130e ("net: sfp: pass the phy_device when disconnecting an sfp module's PHY")
Revert e75e4e074c44 ("net: phy: add helpers to handle sfp phy connect/disconnect")
Revert fdd353965b52 ("net: sfp: Add helper to return the SFP bus name")
Revert 841942bc6212 ("net: ethtool: Allow passing a phy index for some commands")
Link: https://lore.kernel.org/all/171242462917.4000.9759453824684907063.git-patchwork-notify@kernel.org/
Link: https://lore.kernel.org/all/20240507102822.2023826-1-maxime.chevallier@bootlin.com/
Link: https://lore.kernel.org/r/20240513154156.104281-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Change dev_change_proto_down() and dev_change_proto_down_reason()
to write once on dev->proto_down and dev->proto_down_reason.
Then rtnl_fill_proto_down() can use READ_ONCE() annotations
and run locklessly.
rtnl_proto_down_size() should assume worst case,
because readng dev->proto_down_reason multiple
times would be racy without RTNL in the future.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In the following patch we want to read dev->allmulti
and dev->promiscuity locklessly from rtnl_fill_ifinfo()
In this patch I change __dev_set_promiscuity() and
__dev_set_allmulti() to write these fields (and dev->flags)
only if they succeed, with WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
rtnl_fill_ifinfo() can read dev->tx_queue_len locklessly,
granted we add corresponding READ_ONCE()/WRITE_ONCE() annotations.
Add missing READ_ONCE(dev->tx_queue_len) in teql_enqueue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
dev->threaded can be read locklessly, if we add
corresponding READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240502173926.2010646-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will convert ioctl(SIOCGARP) to RCU, and then we need to copy
dev->name which is currently protected by rtnl_lock().
This patch does the following:
1) Add seqlock netdev_rename_lock to protect dev->name
2) Add netdev_copy_name() that copies dev->name to buffer
under netdev_rename_lock
3) Use netdev_copy_name() in netdev_get_name() and drop
devnet_rename_sem
Suggested-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/CANn89iJEWs7AYSJqGCUABeVqOCTkErponfZdT5kV-iD=-SajnQ@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sysctl_skb_defer_max is used in TCP fast path,
move it to net_hodata.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This came while reviewing commit c4e86b4363ac ("net: add two more
call_rcu_hurry()").
Paolo asked if adding one synchronize_rcu() would help.
While synchronize_rcu() does not help, making sure to call
rcu_barrier() before msleep(wait) is definitely helping
to make sure lazy call_rcu() are completed.
Instead of waiting ~100 seconds in my tests, the ref_tracker
splats occurs one time only, and netdev_wait_allrefs_any()
latency is reduced to the strict minimum.
Ideally we should audit our call_rcu() users to make sure
no refcount (or cascading call_rcu()) is held too long,
because rcu_barrier() is quite expensive.
Fixes: 0e4be9e57e8c ("net: use exponential backoff in netdev_wait_allrefs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/all/28bbf698-befb-42f6-b561-851c67f464aa@kernel.org/T/#m76d73ed6b03cd930778ac4d20a777f22a08d6824
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It is impossible to use init_dummy_netdev together with alloc_netdev()
as the 'setup' argument.
This is because alloc_netdev() initializes some fields in the net_device
structure, and later init_dummy_netdev() memzero them all. This causes
some problems as reported here:
https://lore.kernel.org/all/20240322082336.49f110cc@kernel.org/
Split the init_dummy_netdev() function in two. Create a new function called
init_dummy_netdev_core() that does not memzero the net_device structure.
Then have init_dummy_netdev() memzero-ing and calling
init_dummy_netdev_core(), keeping the old behaviour.
init_dummy_netdev_core() is the new function that could be called as an
argument for alloc_netdev().
Also, create a helper to allocate and initialize dummy net devices,
leveraging init_dummy_netdev_core() as the setup argument. This function
basically simplify the allocation of dummy devices, by allocating and
initializing it. Freeing the device continue to be done through
free_netdev()
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For dummy devices, exit earlier at free_netdev() instead of executing
the whole function. This is necessary, because dummy devices are
special, and shouldn't have the second part of the function executed.
Otherwise reg_state, which is NETREG_DUMMY, will be overwritten and
there will be no way to identify that this is a dummy device. Also, this
device do not need the final put_device(), since dummy devices are not
registered (through register_netdevice()), where the device reference is
increased (at netdev_register_kobject()/device_add()).
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix bad grammar in description of init_dummy_netdev() function. This
topic showed up in the review of the "allocate dummy device dynamically"
patch set.
Suggested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is the last member in struct rps_dev_flow which should be
protected locklessly. So finish it.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As we can see, rflow->filter can be written/read concurrently, so
lockless access is needed.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Removing one unnecessary reader protection and add another writer
protection to finish the locklessly proctection job.
Note: the removed READ_ONCE() is not needed because we only have to protect
the locklessly reader in the different context (rps_may_expire_flow()).
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
include/trace/events/rpcgss.h
386f4a737964 ("trace: events: cleanup deprecated strncpy uses")
a4833e3abae1 ("SUNRPC: Fix rpcgss_context trace event acceptor field")
Adjacent changes:
drivers/net/ethernet/intel/ice/ice_tc_lib.c
2cca35f5dd78 ("ice: Fix checking for unsupported keys on non-tunnel device")
784feaa65dfd ("ice: Add support for PFCP hardware offload in switchdev")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the mirred action is used on a classful egress qdisc and a packet is
mirrored or redirected to self we hit a qdisc lock deadlock.
See trace below.
[..... other info removed for brevity....]
[ 82.890906]
[ 82.890906] ============================================
[ 82.890906] WARNING: possible recursive locking detected
[ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G W
[ 82.890906] --------------------------------------------
[ 82.890906] ping/418 is trying to acquire lock:
[ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[ 82.890906]
[ 82.890906] but task is already holding lock:
[ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[ 82.890906]
[ 82.890906] other info that might help us debug this:
[ 82.890906] Possible unsafe locking scenario:
[ 82.890906]
[ 82.890906] CPU0
[ 82.890906] ----
[ 82.890906] lock(&sch->q.lock);
[ 82.890906] lock(&sch->q.lock);
[ 82.890906]
[ 82.890906] *** DEADLOCK ***
[ 82.890906]
[..... other info removed for brevity....]
Example setup (eth0->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth0
Another example(eth0->eth1->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth1
tc qdisc add dev eth1 root handle 1: htb default 30
tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth0
We fix this by adding an owner field (CPU id) to struct Qdisc set after
root qdisc is entered. When the softirq enters it a second time, if the
qdisc owner is the same CPU, the packet is dropped to break the loop.
Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
Fixes: 3bcb846ca4cf ("net: get rid of spin_trylock() in net_tx_action()")
Fixes: e578d9c02587 ("net: sched: use counter to break reclassify loops")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20240415210728.36949-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Link topologies containing multiple network PHYs attached to the same
net_device can be found when using a PHY as a media converter for use
with an SFP connector, on which an SFP transceiver containing a PHY can
be used.
With the current model, the transceiver's PHY can't be used for
operations such as cable testing, timestamping, macsec offload, etc.
The reason being that most of the logic for these configuration, coming
from either ethtool netlink or ioctls tend to use netdev->phydev, which
in multi-phy systems will reference the PHY closest to the MAC.
Introduce a numbering scheme allowing to enumerate PHY devices that
belong to any netdev, which can in turn allow userspace to take more
precise decisions with regard to each PHY's configuration.
The numbering is maintained per-netdev, in a phy_device_list.
The numbering works similarly to a netdevice's ifindex, with
identifiers that are only recycled once INT_MAX has been reached.
This prevents races that could occur between PHY listing and SFP
transceiver removal/insertion.
The identifiers are assigned at phy_attach time, as the numbering
depends on the netdevice the phy is attached to. The PHY index can be
re-used for PHYs that are persistent.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
net/ipv4/ip_gre.c
17af420545a7 ("erspan: make sure erspan_base_hdr is present in skb->head")
5832c4a77d69 ("ip_tunnel: convert __be16 tunnel flags to bitmaps")
https://lore.kernel.org/all/20240402103253.3b54a1cf@canb.auug.org.au/
Adjacent changes:
net/ipv6/ip6_fib.c
d21d40605bca ("ipv6: Fix infinite recursion in fib6_dump_done().")
5fc68320c1fb ("ipv6: remove RTNL protection from inet6_dump_fib()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For systems that use CPU isolation (via nohz_full), creating or destroying
a socket with SO_TIMESTAMP, SO_TIMESTAMPNS or SO_TIMESTAMPING with flag
SOF_TIMESTAMPING_RX_SOFTWARE will cause a static key to be enabled/disabled.
This in turn causes undesired IPIs to isolated CPUs.
So enable the static key unconditionally, if CPU isolation is enabled,
thus avoiding the IPIs.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/ZgrUiLLtbEUf9SFn@tpad
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
process_backlog() can batch increments of sd->input_queue_head,
saving some memory bandwidth.
Also add READ_ONCE()/WRITE_ONCE() annotations around
sd->input_queue_head accesses.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
input_queue_tail_incr_save() is incrementing the sd queue_tail
and save it in the flow last_qtail.
Two issues here :
- no lock protects the write on last_qtail, we should use appropriate
annotations.
- We can perform this write after releasing the per-cpu backlog lock,
to decrease this lock hold duration (move away the cache line miss)
Also move input_queue_head_incr() and rps helpers to include/net/rps.h,
while adding rps_ prefix to better reflect their role.
v2: Fixed a build issue (Jakub and kernel build bots)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We can remove a goto and a label by reversing a condition.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If under extreme cpu backlog pressure enqueue_to_backlog() has
to drop a packet, it could do this without dirtying a cache line
and potentially slowing down the target cpu.
Move sd->dropped into a separate cache line, and make it atomic.
In non pressure mode, this field is not touched, no need to consume
valuable space in a hot cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If the device attached to the packet given to enqueue_to_backlog()
is not running, we drop the packet.
But we accidentally increase sd->dropped, giving false signals
to admins: sd->dropped should be reserved to cpu backlog pressure,
not to temporary glitches at device dismantles.
While we are at it, perform the netif_running() test before
we get the rps lock, and use REASON_DEV_READY
drop reason instead of NOT_SPECIFIED.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is no reason to consume a full cacheline to store system_page_pool.
We can eventually move it to softnet_data later for full locality control.
Fixes: 2b0cfa6e4956 ("net: add generic percpu page_pool allocator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Bianconi <lorenzo@kernel.org>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20240328173448.2262593-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
TC filters come in 3 variants:
- no flag (try to process in hardware, but fallback to software))
- skip_hw (do not process filter by hardware)
- skip_sw (do not process filter by software)
However skip_sw is implemented so that the skip_sw
flag can first be checked, after it has been matched.
IMHO it's common when using skip_sw, to use it on all rules.
So if all filters in a block is skip_sw filters, then
we can bail early, we can thus avoid having to match
the filters, just to check for the skip_sw flag.
This patch adds a bypass, for when only TC skip_sw rules
are used. The bypass is guarded by a static key, to avoid
harming other workloads.
There are 3 ways that a packet from a skip_sw ruleset, can
end up in the kernel path. Although the send packets to a
non-existent chain way is only improved a few percents, then
I believe it's worth optimizing the trap and fall-though
use-cases.
+----------------------------+--------+--------+--------+
| Test description | Pre- | Post- | Rel. |
| | kpps | kpps | chg. |
+----------------------------+--------+--------+--------+
| basic forwarding + notrack | 3589.3 | 3587.9 | 1.00x |
| switch to eswitch mode | 3081.8 | 3094.7 | 1.00x |
| add ingress qdisc | 3042.9 | 3063.6 | 1.01x |
| tc forward in hw / skip_sw |37024.7 |37028.4 | 1.00x |
| tc forward in sw / skip_hw | 3245.0 | 3245.3 | 1.00x |
+----------------------------+--------+--------+--------+
| tests with only skip_sw rules below: |
+----------------------------+--------+--------+--------+
| 1 non-matching rule | 2694.7 | 3058.7 | 1.14x |
| 1 n-m rule, match trap | 2611.2 | 3323.1 | 1.27x |
| 1 n-m rule, goto non-chain | 2886.8 | 2945.9 | 1.02x |
| 5 non-matching rules | 1958.2 | 3061.3 | 1.56x |
| 5 n-m rules, match trap | 1911.9 | 3327.0 | 1.74x |
| 5 n-m rules, goto non-chain| 2883.1 | 2947.5 | 1.02x |
| 10 non-matching rules | 1466.3 | 3062.8 | 2.09x |
| 10 n-m rules, match trap | 1444.3 | 3317.9 | 2.30x |
| 10 n-m rules,goto non-chain| 2883.1 | 2939.5 | 1.02x |
| 25 non-matching rules | 838.5 | 3058.9 | 3.65x |
| 25 n-m rules, match trap | 824.5 | 3323.0 | 4.03x |
| 25 n-m rules,goto non-chain| 2875.8 | 2944.7 | 1.02x |
| 50 non-matching rules | 488.1 | 3054.7 | 6.26x |
| 50 n-m rules, match trap | 484.9 | 3318.5 | 6.84x |
| 50 n-m rules,goto non-chain| 2884.1 | 2939.7 | 1.02x |
+----------------------------+--------+--------+--------+
perf top (25 n-m skip_sw rules - pre patch):
20.39% [kernel] [k] __skb_flow_dissect
16.43% [kernel] [k] rhashtable_jhash2
10.58% [kernel] [k] fl_classify
10.23% [kernel] [k] fl_mask_lookup
4.79% [kernel] [k] memset_orig
2.58% [kernel] [k] tcf_classify
1.47% [kernel] [k] __x86_indirect_thunk_rax
1.42% [kernel] [k] __dev_queue_xmit
1.36% [kernel] [k] nft_do_chain
1.21% [kernel] [k] __rcu_read_lock
perf top (25 n-m skip_sw rules - post patch):
5.12% [kernel] [k] __dev_queue_xmit
4.77% [kernel] [k] nft_do_chain
3.65% [kernel] [k] dev_gro_receive
3.41% [kernel] [k] check_preemption_disabled
3.14% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
2.88% [kernel] [k] __netif_receive_skb_core.constprop.0
2.49% [kernel] [k] mlx5e_xmit
2.15% [kernel] [k] ip_forward
1.95% [kernel] [k] mlx5e_tc_restore_tunnel
1.92% [kernel] [k] vlan_gro_receive
Test setup:
DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
Data rate measured on switch (Extreme X690), and DUT connected as
a router on a stick, with pktgen and pktsink as VLANs.
Pktgen-dpdk was in range 36.6-37.7 Mpps 64B packets across all tests.
Full test data at https://files.fiberby.net/ast/2024/tc_skip_sw/v2_tests/
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
System page_pools are percpu and one instance can be used only on
one CPU.
%NUMA_NO_NODE is fine for allocating pages, as the PP core always
allocates local pages in this case. But for the struct &page_pool
itself, this node ID means they are allocated on the boot CPU,
which may belong to a different node than the target CPU.
Pin system page_pools to the corresponding nodes when creating,
so that all the allocated data will always be local. Use
cpu_to_mem() to account memless nodes.
Nodes != 0 win some Kpps when testing with xdp-trafficgen.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20240325160635.3215855-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The rps_lock.*() functions use the inner lock of a sk_buff_head for
locking. This lock is used if RPS is enabled, otherwise the list is
accessed lockless and disabling interrupts is enough for the
synchronisation because it is only accessed CPU local. Not only the list
is protected but also the NAPI state protected.
With the addition of backlog threads, the lock is also needed because of
the cross CPU access even without RPS. The clean up of the defer_list
list is also done via backlog threads (if enabled).
It has been suggested to rename the locking function since it is no
longer just RPS.
Rename the rps_lock*() functions to backlog_lock*().
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The defer_list is a per-CPU list which is used to free skbs outside of
the socket lock and on the CPU on which they have been allocated.
The list is processed during NAPI callbacks so ideally the list is
cleaned up.
Should the amount of skbs on the list exceed a certain water mark then
the softirq is triggered remotely on the target CPU by invoking a remote
function call. The raise of the softirqs via a remote function call
leads to waking the ksoftirqd on PREEMPT_RT which is undesired.
The backlog-NAPI threads already provide the infrastructure which can be
utilized to perform the cleanup of the defer_list.
The NAPI state is updated with the input_pkt_queue.lock acquired. It
order not to break the state, it is needed to also wake the backlog-NAPI
thread with the lock held. This requires to acquire the use the lock in
rps_lock_irq*() if the backlog-NAPI threads are used even with RPS
disabled.
Move the logic of remotely starting softirqs to clean up the defer_list
into kick_defer_list_purge(). Make sure a lock is held in
rps_lock_irq*() if backlog-NAPI threads are used. Schedule backlog-NAPI
for defer_list cleanup if backlog-NAPI is available.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Backlog NAPI is a per-CPU NAPI struct only (with no device behind it)
used by drivers which don't do NAPI them self, RPS and parts of the
stack which need to avoid recursive deadlocks while processing a packet.
The non-NAPI driver use the CPU local backlog NAPI. If RPS is enabled
then a flow for the skb is computed and based on the flow the skb can be
enqueued on a remote CPU. Scheduling/ raising the softirq (for backlog's
NAPI) on the remote CPU isn't trivial because the softirq is only
scheduled on the local CPU and performed after the hardirq is done.
In order to schedule a softirq on the remote CPU, an IPI is sent to the
remote CPU which schedules the backlog-NAPI on the then local CPU.
On PREEMPT_RT interrupts are force-threaded. The soft interrupts are
raised within the interrupt thread and processed after the interrupt
handler completed still within the context of the interrupt thread. The
softirq is handled in the context where it originated.
With force-threaded interrupts enabled, ksoftirqd is woken up if a
softirq is raised from hardirq context. This is the case if it is raised
from an IPI. Additionally there is a warning on PREEMPT_RT if the
softirq is raised from the idle thread.
This was done for two reasons:
- With threaded interrupts the processing should happen in thread
context (where it originated) and ksoftirqd is the only thread for
this context if raised from hardirq. Using the currently running task
instead would "punish" a random task.
- Once ksoftirqd is active it consumes all further softirqs until it
stops running. This changed recently and is no longer the case.
Instead of keeping the backlog NAPI in ksoftirqd (in force-threaded/
PREEMPT_RT setups) I am proposing NAPI-threads for backlog.
The "proper" setup with threaded-NAPI is not doable because the threads
are not pinned to an individual CPU and can be modified by the user.
Additionally a dummy network device would have to be assigned. Also
CPU-hotplug has to be considered if additional CPUs show up.
All this can be probably done/ solved but the smpboot-threads already
provide this infrastructure.
Sending UDP packets over loopback expects that the packet is processed
within the call. Delaying it by handing it over to the thread hurts
performance. It is not beneficial to the outcome if the context switch
happens immediately after enqueue or after a while to process a few
packets in a batch.
There is no need to always use the thread if the backlog NAPI is
requested on the local CPU. This restores the loopback throuput. The
performance drops mostly to the same value after enabling RPS on the
loopback comparing the IPI and the tread result.
Create NAPI-threads for backlog if request during boot. The thread runs
the inner loop from napi_threaded_poll(), the wait part is different. It
checks for NAPI_STATE_SCHED (the backlog NAPI can not be disabled).
The NAPI threads for backlog are optional, it has to be enabled via the boot
argument "thread_backlog_napi". It is mandatory for PREEMPT_RT to avoid the
wakeup of ksoftirqd from the IPI.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
A NAPI thread is scheduled by first setting NAPI_STATE_SCHED bit. If
successful (the bit was not yet set) then the NAPI_STATE_SCHED_THREADED
is set but only if thread's state is not TASK_INTERRUPTIBLE (is
TASK_RUNNING) followed by task wakeup.
If the task is idle (TASK_INTERRUPTIBLE) then the
NAPI_STATE_SCHED_THREADED bit is not set. The thread is no relying on
the bit but always leaving the wait-loop after returning from schedule()
because there must have been a wakeup.
The smpboot-threads implementation for per-CPU threads requires an
explicit condition and does not support "if we get out of schedule()
then there must be something to do".
Removing this optimisation simplifies the following integration.
Set NAPI_STATE_SCHED_THREADED unconditionally on wakeup and rely on it
in the wait path by removing the `woken' condition.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|