aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-10-23tcp: fix tcp_tso_should_defer() vs large RTTEric Dumazet1-4/+15
[ Upstream commit 295ce1eb36ae47dc862d6c8a1012618a25516208 ] Neal reported that using neper tcp_stream with TCP_TX_DELAY set to 50ms would often lead to flows stuck in a small cwnd mode, regardless of the congestion control. While tcp_stream sets TCP_TX_DELAY too late after the connect(), it highlighted two kernel bugs. The following heuristic in tcp_tso_should_defer() seems wrong for large RTT: delta = tp->tcp_clock_cache - head->tstamp; /* If next ACK is likely to come too late (half srtt), do not defer */ if ((s64)(delta - (u64)NSEC_PER_USEC * (tp->srtt_us >> 4)) < 0) goto send_now; If next ACK is expected to come in more than 1 ms, we should not defer because we prefer a smooth ACK clocking. While blamed commit was a step in the good direction, it was not generic enough. Another patch fixing TCP_TX_DELAY for established flows will be proposed when net-next reopens. Fixes: 50c8339e9299 ("tcp: tso: restore IW10 after TSO autosizing") Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251011115742.1245771-1-edumazet@google.com [pabeni@redhat.com: fixed whitespace issue] Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23net/ip6_tunnel: Prevent perpetual tunnel growthDmitry Safonov1-14/+0
[ Upstream commit 21f4d45eba0b2dcae5dbc9e5e0ad08735c993f16 ] Similarly to ipv4 tunnel, ipv6 version updates dev->needed_headroom, too. While ipv4 tunnel headroom adjustment growth was limited in commit 5ae1e9922bbd ("net: ip_tunnel: prevent perpetual headroom growth"), ipv6 tunnel yet increases the headroom without any ceiling. Reflect ipv4 tunnel headroom adjustment limit on ipv6 version. Credits to Francesco Ruggeri, who was originally debugging this issue and wrote local Arista-specific patch and a reproducer. Fixes: 8eb30be0352d ("ipv6: Create ip6_tnl_xmit") Cc: Florian Westphal <fw@strlen.de> Cc: Francesco Ruggeri <fruggeri05@gmail.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Link: https://patch.msgid.link/20251009-ip6_tunnel-headroom-v2-1-8e4dbd8f7e35@arista.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-19tcp: take care of zero tp->window_clamp in tcp_set_rcvlowat()Eric Dumazet1-1/+4
[ Upstream commit 21b29e74ffe5a6c851c235bb80bf5ee26292c67b ] Some applications (like selftests/net/tcp_mmap.c) call SO_RCVLOWAT on their listener, before accept(). This has an unfortunate effect on wscale selection in tcp_select_initial_window() during 3WHS. For instance, tcp_mmap was negotiating wscale 4, regardless of tcp_rmem[2] and sysctl_rmem_max. Do not change tp->window_clamp if it is zero or bigger than our computed value. Zero value is special, it allows tcp_select_initial_window() to enable autotuning. Note that SO_RCVLOWAT use on listener is probably not wise, because tp->scaling_ratio has a default value, possibly wrong. Fixes: d1361840f8c5 ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251003184119.2526655-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-19tcp: Don't call reqsk_fastopen_remove() in tcp_conn_request().Kuniyuki Iwashima1-1/+0
[ Upstream commit 2e7cbbbe3d61c63606994b7ff73c72537afe2e1c ] syzbot reported the splat below in tcp_conn_request(). [0] If a listener is close()d while a TFO socket is being processed in tcp_conn_request(), inet_csk_reqsk_queue_add() does not set reqsk->sk and calls inet_child_forget(), which calls tcp_disconnect() for the TFO socket. After the cited commit, tcp_disconnect() calls reqsk_fastopen_remove(), where reqsk_put() is called due to !reqsk->sk. Then, reqsk_fastopen_remove() in tcp_conn_request() decrements the last req->rsk_refcnt and frees reqsk, and __reqsk_free() at the drop_and_free label causes the refcount underflow for the listener and double-free of the reqsk. Let's remove reqsk_fastopen_remove() in tcp_conn_request(). Note that other callers make sure tp->fastopen_rsk is not NULL. [0]: refcount_t: underflow; use-after-free. WARNING: CPU: 12 PID: 5563 at lib/refcount.c:28 refcount_warn_saturate (lib/refcount.c:28) Modules linked in: CPU: 12 UID: 0 PID: 5563 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025 RIP: 0010:refcount_warn_saturate (lib/refcount.c:28) Code: ab e8 8e b4 98 ff 0f 0b c3 cc cc cc cc cc 80 3d a4 e4 d6 01 00 75 9c c6 05 9b e4 d6 01 01 48 c7 c7 e8 df fb ab e8 6a b4 98 ff <0f> 0b e9 03 5b 76 00 cc 80 3d 7d e4 d6 01 00 0f 85 74 ff ff ff c6 RSP: 0018:ffffa79fc0304a98 EFLAGS: 00010246 RAX: d83af4db1c6b3900 RBX: ffff9f65c7a69020 RCX: d83af4db1c6b3900 RDX: 0000000000000000 RSI: 00000000ffff7fff RDI: ffffffffac78a280 RBP: 000000009d781b60 R08: 0000000000007fff R09: ffffffffac6ca280 R10: 0000000000017ffd R11: 0000000000000004 R12: ffff9f65c7b4f100 R13: ffff9f65c7d23c00 R14: ffff9f65c7d26000 R15: ffff9f65c7a64ef8 FS: 00007f9f962176c0(0000) GS:ffff9f65fcf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000200000000180 CR3: 000000000dbbe006 CR4: 0000000000372ef0 Call Trace: <IRQ> tcp_conn_request (./include/linux/refcount.h:400 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 ./include/net/sock.h:1965 ./include/net/request_sock.h:131 net/ipv4/tcp_input.c:7301) tcp_rcv_state_process (net/ipv4/tcp_input.c:6708) tcp_v6_do_rcv (net/ipv6/tcp_ipv6.c:1670) tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1906) ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:438) ip6_input (net/ipv6/ip6_input.c:500) ipv6_rcv (net/ipv6/ip6_input.c:311) __netif_receive_skb (net/core/dev.c:6104) process_backlog (net/core/dev.c:6456) __napi_poll (net/core/dev.c:7506) net_rx_action (net/core/dev.c:7569 net/core/dev.c:7696) handle_softirqs (kernel/softirq.c:579) do_softirq (kernel/softirq.c:480) </IRQ> Fixes: 45c8a6cc2bcd ("tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect().") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251001233755.1340927-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15tcp: use skb->len instead of skb->truesize in tcp_can_ingest()Eric Dumazet1-2/+13
[ Upstream commit f017c1f768b670bced4464476655b27dfb937e67 ] Some applications are stuck to the 20th century and still use small SO_RCVBUF values. After the blamed commit, we can drop packets especially when using LRO/hw-gro enabled NIC and small MSS (1500) values. LRO/hw-gro NIC pack multiple segments into pages, allowing tp->scaling_ratio to be set to a high value. Whenever the receive queue gets full, we can receive a small packet filling RWIN, but with a high skb->truesize, because most NIC use 4K page plus sk_buff metadata even when receiving less than 1500 bytes of payload. Even if we refine how tp->scaling_ratio is estimated, we could have an issue at the start of the flow, because the first round of packets (IW10) will be sent based on the initial tp->scaling_ratio (1/2) Relax tcp_can_ingest() to use skb->len instead of skb->truesize, allowing the peer to use final RWIN, assuming a 'perfect' scaling_ratio of 1. Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250927092827.2707901-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15tcp: fix __tcp_close() to only send RST when requiredEric Dumazet1-4/+5
[ Upstream commit 5f9238530970f2993b23dd67fdaffc552a2d2e98 ] If the receive queue contains payload that was already received, __tcp_close() can send an unexpected RST. Refine the code to take tp->copied_seq into account, as we already do in tcp recvmsg(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250903084720.1168904-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15inet: ping: check sock_net() in ping_get_port() and ping_lookup()Eric Dumazet1-4/+10
[ Upstream commit 59f26d86b2a16f1406f3b42025062b6d1fba5dd5 ] We need to check socket netns before considering them in ping_get_port(). Otherwise, one malicious netns could 'consume' all ports. Add corresponding check in ping_lookup(). Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250829153054.474201-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15ipv4: start using dst_dev_rcu()Eric Dumazet4-10/+12
[ Upstream commit 6ad8de3cefdb6ffa6708b21c567df0dbf82c43a8 ] Change icmpv4_xrlim_allow(), ip_defrag() to prevent possible UAF. Change ipmr_prepare_xmit(), ipmr_queue_fwd_xmit(), ip_mr_output(), ipv4_neigh_lookup() to use lockdep enabled dst_dev_rcu(). Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15tcp_metrics: use dst_dev_net_rcu()Eric Dumazet1-3/+3
[ Upstream commit 50c127a69cd6285300931853b352a1918cfa180f ] Replace three dst_dev() with a lockdep enabled helper. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15net: dst: introduce dst->dev_rcuEric Dumazet1-2/+2
[ Upstream commit caedcc5b6df1b2e2b5f39079e3369c1d4d5c5f50 ] Followup of commit 88fe14253e18 ("net: dst: add four helpers to annotate data-races around dst->dev"). We want to gradually add explicit RCU protection to dst->dev, including lockdep support. Add an union to alias dst->dev_rcu and dst->dev. Add dst_dev_net_rcu() helper. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-23nexthop: Forbid FDB status change while nexthop is in a groupIdo Schimmel1-0/+7
The kernel forbids the creation of non-FDB nexthop groups with FDB nexthops: # ip nexthop add id 1 via 192.0.2.1 fdb # ip nexthop add id 2 group 1 Error: Non FDB nexthop group cannot have fdb nexthops. And vice versa: # ip nexthop add id 3 via 192.0.2.2 dev dummy1 # ip nexthop add id 4 group 3 fdb Error: FDB nexthop group can only have fdb nexthops. However, as long as no routes are pointing to a non-FDB nexthop group, the kernel allows changing the type of a nexthop from FDB to non-FDB and vice versa: # ip nexthop add id 5 via 192.0.2.2 dev dummy1 # ip nexthop add id 6 group 5 # ip nexthop replace id 5 via 192.0.2.2 fdb # echo $? 0 This configuration is invalid and can result in a NPD [1] since FDB nexthops are not associated with a nexthop device: # ip route add 198.51.100.1/32 nhid 6 # ping 198.51.100.1 Fix by preventing nexthop FDB status change while the nexthop is in a group: # ip nexthop add id 7 via 192.0.2.2 dev dummy1 # ip nexthop add id 8 group 7 # ip nexthop replace id 7 via 192.0.2.2 fdb Error: Cannot change nexthop FDB status while in a group. [1] BUG: kernel NULL pointer dereference, address: 00000000000003c0 [...] Oops: Oops: 0000 [#1] SMP CPU: 6 UID: 0 PID: 367 Comm: ping Not tainted 6.17.0-rc6-virtme-gb65678cacc03 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014 RIP: 0010:fib_lookup_good_nhc+0x1e/0x80 [...] Call Trace: <TASK> fib_table_lookup+0x541/0x650 ip_route_output_key_hash_rcu+0x2ea/0x970 ip_route_output_key_hash+0x55/0x80 __ip4_datagram_connect+0x250/0x330 udp_connect+0x2b/0x60 __sys_connect+0x9c/0xd0 __x64_sys_connect+0x18/0x20 do_syscall_64+0xa4/0x2a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops") Reported-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68c9a4d2.050a0220.3c6139.0e63.GAE@google.com/ Tested-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250921150824.149157-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-17tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect().Kuniyuki Iwashima1-0/+5
syzbot reported the splat below where a socket had tcp_sk(sk)->fastopen_rsk in the TCP_ESTABLISHED state. [0] syzbot reused the server-side TCP Fast Open socket as a new client before the TFO socket completes 3WHS: 1. accept() 2. connect(AF_UNSPEC) 3. connect() to another destination As of accept(), sk->sk_state is TCP_SYN_RECV, and tcp_disconnect() changes it to TCP_CLOSE and makes connect() possible, which restarts timers. Since tcp_disconnect() forgot to clear tcp_sk(sk)->fastopen_rsk, the retransmit timer triggered the warning and the intended packet was not retransmitted. Let's call reqsk_fastopen_remove() in tcp_disconnect(). [0]: WARNING: CPU: 2 PID: 0 at net/ipv4/tcp_timer.c:542 tcp_retransmit_timer (net/ipv4/tcp_timer.c:542 (discriminator 7)) Modules linked in: CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Not tainted 6.17.0-rc5-g201825fb4278 #62 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:tcp_retransmit_timer (net/ipv4/tcp_timer.c:542 (discriminator 7)) Code: 41 55 41 54 55 53 48 8b af b8 08 00 00 48 89 fb 48 85 ed 0f 84 55 01 00 00 0f b6 47 12 3c 03 74 0c 0f b6 47 12 3c 04 74 04 90 <0f> 0b 90 48 8b 85 c0 00 00 00 48 89 ef 48 8b 40 30 e8 6a 4f 06 3e RSP: 0018:ffffc900002f8d40 EFLAGS: 00010293 RAX: 0000000000000002 RBX: ffff888106911400 RCX: 0000000000000017 RDX: 0000000002517619 RSI: ffffffff83764080 RDI: ffff888106911400 RBP: ffff888106d5c000 R08: 0000000000000001 R09: ffffc900002f8de8 R10: 00000000000000c2 R11: ffffc900002f8ff8 R12: ffff888106911540 R13: ffff888106911480 R14: ffff888106911840 R15: ffffc900002f8de0 FS: 0000000000000000(0000) GS:ffff88907b768000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8044d69d90 CR3: 0000000002c30003 CR4: 0000000000370ef0 Call Trace: <IRQ> tcp_write_timer (net/ipv4/tcp_timer.c:738) call_timer_fn (kernel/time/timer.c:1747) __run_timers (kernel/time/timer.c:1799 kernel/time/timer.c:2372) timer_expire_remote (kernel/time/timer.c:2385 kernel/time/timer.c:2376 kernel/time/timer.c:2135) tmigr_handle_remote_up (kernel/time/timer_migration.c:944 kernel/time/timer_migration.c:1035) __walk_groups.isra.0 (kernel/time/timer_migration.c:533 (discriminator 1)) tmigr_handle_remote (kernel/time/timer_migration.c:1096) handle_softirqs (./arch/x86/include/asm/jump_label.h:36 ./include/trace/events/irq.h:142 kernel/softirq.c:580) irq_exit_rcu (kernel/softirq.c:614 kernel/softirq.c:453 kernel/softirq.c:680 kernel/softirq.c:696) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1050 (discriminator 35) arch/x86/kernel/apic/apic.c:1050 (discriminator 35)) </IRQ> Fixes: 8336886f786f ("tcp: TCP Fast Open Server - support TFO listeners") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250915175800.118793-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-14net/tcp: Fix a NULL pointer dereference when using TCP-AO with TCP_REPAIRAnderson Nascimento1-1/+3
A NULL pointer dereference can occur in tcp_ao_finish_connect() during a connect() system call on a socket with a TCP-AO key added and TCP_REPAIR enabled. The function is called with skb being NULL and attempts to dereference it on tcp_hdr(skb)->seq without a prior skb validation. Fix this by checking if skb is NULL before dereferencing it. The commentary is taken from bpf_skops_established(), which is also called in the same flow. Unlike the function being patched, bpf_skops_established() validates the skb before dereferencing it. int main(void){ struct sockaddr_in sockaddr; struct tcp_ao_add tcp_ao; int sk; int one = 1; memset(&sockaddr,'\0',sizeof(sockaddr)); memset(&tcp_ao,'\0',sizeof(tcp_ao)); sk = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); sockaddr.sin_family = AF_INET; memcpy(tcp_ao.alg_name,"cmac(aes128)",12); memcpy(tcp_ao.key,"ABCDEFGHABCDEFGH",16); tcp_ao.keylen = 16; memcpy(&tcp_ao.addr,&sockaddr,sizeof(sockaddr)); setsockopt(sk, IPPROTO_TCP, TCP_AO_ADD_KEY, &tcp_ao, sizeof(tcp_ao)); setsockopt(sk, IPPROTO_TCP, TCP_REPAIR, &one, sizeof(one)); sockaddr.sin_family = AF_INET; sockaddr.sin_port = htobe16(123); inet_aton("127.0.0.1", &sockaddr.sin_addr); connect(sk,(struct sockaddr *)&sockaddr,sizeof(sockaddr)); return 0; } $ gcc tcp-ao-nullptr.c -o tcp-ao-nullptr -Wall $ unshare -Urn BUG: kernel NULL pointer dereference, address: 00000000000000b6 PGD 1f648d067 P4D 1f648d067 PUD 1982e8067 PMD 0 Oops: Oops: 0000 [#1] SMP NOPTI Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 RIP: 0010:tcp_ao_finish_connect (net/ipv4/tcp_ao.c:1182) Fixes: 7c2ffaf21bd6 ("net/tcp: Calculate TCP-AO traffic keys") Signed-off-by: Anderson Nascimento <anderson@allelesecurity.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250911230743.2551-3-anderson@allelesecurity.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11Merge tag 'net-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds1-0/+6
Pull networking fixes from Paolo Abeni: "Including fixes from CAN, netfilter and wireless. We have an IPv6 routing regression with the relevant fix still a WiP. This includes a last-minute revert to avoid more problems. Current release - new code bugs: - wifi: nl80211: completely disable per-link stats for now Previous releases - regressions: - dev_ioctl: take ops lock in hwtstamp lower paths - netfilter: - fix spurious set lookup failures - fix lockdep splat due to missing annotation - genetlink: fix genl_bind() invoking bind() after -EPERM - phy: transfer phy_config_inband() locking responsibility to phylink - can: xilinx_can: fix use-after-free of transmitted SKB - hsr: fix lock warnings - eth: - igb: fix NULL pointer dereference in ethtool loopback test - i40e: fix Jumbo Frame support after iPXE boot - macsec: sync features on RTM_NEWLINK Previous releases - always broken: - tunnels: reset the GSO metadata before reusing the skb - mptcp: make sync_socket_options propagate SOCK_KEEPOPEN - can: j1939: implement NETDEV_UNREGISTER notification hanidler - wifi: ath12k: fix WMI TLV header misalignment" * tag 'net-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits) Revert "net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups" hsr: hold rcu and dev lock for hsr_get_port_ndev hsr: use hsr_for_each_port_rtnl in hsr_port_get_hsr hsr: use rtnl lock when iterating over ports wifi: nl80211: completely disable per-link stats for now net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups net: ethtool: fix wrong type used in struct kernel_ethtool_ts_info MAINTAINERS: add Phil as netfilter reviewer netfilter: nf_tables: restart set lookup on base_seq change netfilter: nf_tables: make nft_set_do_lookup available unconditionally netfilter: nf_tables: place base_seq in struct net netfilter: nft_set_rbtree: continue traversal if element is inactive netfilter: nft_set_pipapo: don't check genbit from packetpath lookups netfilter: nft_set_bitmap: fix lockdep splat due to missing annotation can: rcar_can: rcar_can_resume(): fix s2ram with PSCI can: xilinx_can: xcan_write_frame(): fix use-after-free of transmitted SKB can: j1939: j1939_local_ecu_get(): undo increment when j1939_local_ecu_get() fails can: j1939: j1939_sk_bind(): call j1939_priv_put() immediately when j1939_local_ecu_get() failed can: j1939: implement NETDEV_UNREGISTER notification handler selftests: can: enable CONFIG_CAN_VCAN as a module ...
2025-09-11Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds1-1/+4
Pull bpf fixes from Alexei Starovoitov: "A number of fixes accumulated due to summer vacations - Fix out-of-bounds dynptr write in bpf_crypto_crypt() kfunc which was misidentified as a security issue (Daniel Borkmann) - Update the list of BPF selftests maintainers (Eduard Zingerman) - Fix selftests warnings with icecc compiler (Ilya Leoshkevich) - Disable XDP/cpumap direct return optimization (Jesper Dangaard Brouer) - Fix unexpected get_helper_proto() result in unusual configuration BPF_SYSCALL=y and BPF_EVENTS=n (Jiri Olsa) - Allow fallback to interpreter when JIT support is limited (KaFai Wan) - Fix rqspinlock and choose trylock fallback for NMI waiters. Pick the simplest fix. More involved fix is targeted bpf-next (Kumar Kartikeya Dwivedi) - Fix cleanup when tcp_bpf_send_verdict() fails to allocate psock->cork (Kuniyuki Iwashima) - Disallow bpf_timer in PREEMPT_RT for now. Proper solution is being discussed for bpf-next. (Leon Hwang) - Fix XSK cq descriptor production (Maciej Fijalkowski) - Tell memcg to use allow_spinning=false path in bpf_timer_init() to avoid lockup in cgroup_file_notify() (Peilin Ye) - Fix bpf_strnstr() to handle suffix match cases (Rong Tao)" * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Skip timer cases when bpf_timer is not supported bpf: Reject bpf_timer for PREEMPT_RT tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork. bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init() bpf: Allow fall back to interpreter for programs with stack size <= 512 rqspinlock: Choose trylock fallback for NMI waiters xsk: Fix immature cq descriptor production bpf: Update the list of BPF selftests maintainers selftests/bpf: Add tests for bpf_strnstr selftests/bpf: Fix "expression result unused" warnings with icecc bpf: Fix bpf_strnstr() to handle suffix match cases better selftests/bpf: Extend crypto_sanity selftest with invalid dst buffer bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt bpf: Check the helper function is valid in get_helper_proto bpf, cpumap: Disable page_pool direct xdp_return need larger scope
2025-09-10tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork.Kuniyuki Iwashima1-1/+4
syzbot reported the splat below. [0] The repro does the following: 1. Load a sk_msg prog that calls bpf_msg_cork_bytes(msg, cork_bytes) 2. Attach the prog to a SOCKMAP 3. Add a socket to the SOCKMAP 4. Activate fault injection 5. Send data less than cork_bytes At 5., the data is carried over to the next sendmsg() as it is smaller than the cork_bytes specified by bpf_msg_cork_bytes(). Then, tcp_bpf_send_verdict() tries to allocate psock->cork to hold the data, but this fails silently due to fault injection + __GFP_NOWARN. If the allocation fails, we need to revert the sk->sk_forward_alloc change done by sk_msg_alloc(). Let's call sk_msg_free() when tcp_bpf_send_verdict fails to allocate psock->cork. The "*copied" also needs to be updated such that a proper error can be returned to the caller, sendmsg. It fails to allocate psock->cork. Nothing has been corked so far, so this patch simply sets "*copied" to 0. [0]: WARNING: net/ipv4/af_inet.c:156 at inet_sock_destruct+0x623/0x730 net/ipv4/af_inet.c:156, CPU#1: syz-executor/5983 Modules linked in: CPU: 1 UID: 0 PID: 5983 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025 RIP: 0010:inet_sock_destruct+0x623/0x730 net/ipv4/af_inet.c:156 Code: 0f 0b 90 e9 62 fe ff ff e8 7a db b5 f7 90 0f 0b 90 e9 95 fe ff ff e8 6c db b5 f7 90 0f 0b 90 e9 bb fe ff ff e8 5e db b5 f7 90 <0f> 0b 90 e9 e1 fe ff ff 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c 9f fc RSP: 0018:ffffc90000a08b48 EFLAGS: 00010246 RAX: ffffffff8a09d0b2 RBX: dffffc0000000000 RCX: ffff888024a23c80 RDX: 0000000000000100 RSI: 0000000000000fff RDI: 0000000000000000 RBP: 0000000000000fff R08: ffff88807e07c627 R09: 1ffff1100fc0f8c4 R10: dffffc0000000000 R11: ffffed100fc0f8c5 R12: ffff88807e07c380 R13: dffffc0000000000 R14: ffff88807e07c60c R15: 1ffff1100fc0f872 FS: 00005555604c4500(0000) GS:ffff888125af1000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005555604df5c8 CR3: 0000000032b06000 CR4: 00000000003526f0 Call Trace: <IRQ> __sk_destruct+0x86/0x660 net/core/sock.c:2339 rcu_do_batch kernel/rcu/tree.c:2605 [inline] rcu_core+0xca8/0x1770 kernel/rcu/tree.c:2861 handle_softirqs+0x286/0x870 kernel/softirq.c:579 __do_softirq kernel/softirq.c:613 [inline] invoke_softirq kernel/softirq.c:453 [inline] __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680 irq_exit_rcu+0x9/0x30 kernel/softirq.c:696 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1052 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1052 </IRQ> Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Reported-by: syzbot+4cabd1d2fa917a456db8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68c0b6b5.050a0220.3c6139.0013.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250909232623.4151337-1-kuniyu@google.com
2025-09-09tunnels: reset the GSO metadata before reusing the skbAntoine Tenart1-0/+6
If a GSO skb is sent through a Geneve tunnel and if Geneve options are added, the split GSO skb might not fit in the MTU anymore and an ICMP frag needed packet can be generated. In such case the ICMP packet might go through the segmentation logic (and dropped) later if it reaches a path were the GSO status is checked and segmentation is required. This is especially true when an OvS bridge is used with a Geneve tunnel attached to it. The following set of actions could lead to the ICMP packet being wrongfully segmented: 1. An skb is constructed by the TCP layer (e.g. gso_type SKB_GSO_TCPV4, segs >= 2). 2. The skb hits the OvS bridge where Geneve options are added by an OvS action before being sent through the tunnel. 3. When the skb is xmited in the tunnel, the split skb does not fit anymore in the MTU and iptunnel_pmtud_build_icmp is called to generate an ICMP fragmentation needed packet. This is done by reusing the original (GSO!) skb. The GSO metadata is not cleared. 4. The ICMP packet being sent back hits the OvS bridge again and because skb_is_gso returns true, it goes through queue_gso_packets... 5. ...where __skb_gso_segment is called. The skb is then dropped. 6. Note that in the above example on re-transmission the skb won't be a GSO one as it would be segmented (len > MSS) and the ICMP packet should go through. Fix this by resetting the GSO information before reusing an skb in iptunnel_pmtud_build_icmp and iptunnel_pmtud_build_icmpv6. Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Reported-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Link: https://patch.msgid.link/20250904125351.159740-1-atenart@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-03ipv4: Fix NULL vs error pointer check in inet_blackhole_dev_init()Dan Carpenter1-4/+3
The inetdev_init() function never returns NULL. Check for error pointers instead. Fixes: 22600596b675 ("ipv4: give an IPv4 dev to blackhole_netdev") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/aLaQWL9NguWmeM1i@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01icmp: fix icmp_ndo_send address translation for reply directionFabian Bläse1-2/+4
The icmp_ndo_send function was originally introduced to ensure proper rate limiting when icmp_send is called by a network device driver, where the packet's source address may have already been transformed by SNAT. However, the original implementation only considers the IP_CT_DIR_ORIGINAL direction for SNAT and always replaced the packet's source address with that of the original-direction tuple. This causes two problems: 1. For SNAT: Reply-direction packets were incorrectly translated using the source address of the CT original direction, even though no translation is required. 2. For DNAT: Reply-direction packets were not handled at all. In DNAT, the original direction's destination is translated. Therefore, in the reply direction the source address must be set to the reply-direction source, so rate limiting works as intended. Fix this by using the connection direction to select the correct tuple for source address translation, and adjust the pre-checks to handle reply-direction packets in case of DNAT. Additionally, wrap the `ct->status` access in READ_ONCE(). This avoids possible KCSAN reports about concurrent updates to `ct->status`. Fixes: 0b41713b6066 ("icmp: introduce helper for nat'd source address in network device context") Signed-off-by: Fabian Bläse <fabian@blaese.de> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28net: ipv4: fix regression in local-broadcast routesOscar Maes1-3/+7
Commit 9e30ecf23b1b ("net: ipv4: fix incorrect MTU in broadcast routes") introduced a regression where local-broadcast packets would have their gateway set in __mkroute_output, which was caused by fi = NULL being removed. Fix this by resetting the fib_info for local-broadcast packets. This preserves the intended changes for directed-broadcast packets. Cc: stable@vger.kernel.org Fixes: 9e30ecf23b1b ("net: ipv4: fix incorrect MTU in broadcast routes") Reported-by: Brett A C Sheffield <bacs@librecast.net> Closes: https://lore.kernel.org/regressions/20250822165231.4353-4-bacs@librecast.net Signed-off-by: Oscar Maes <oscmaes92@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250827062322.4807-1-oscmaes92@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-21netfilter: nf_reject: don't leak dst refcount for loopback packetsFlorian Westphal1-4/+2
recent patches to add a WARN() when replacing skb dst entry found an old bug: WARNING: include/linux/skbuff.h:1165 skb_dst_check_unset include/linux/skbuff.h:1164 [inline] WARNING: include/linux/skbuff.h:1165 skb_dst_set include/linux/skbuff.h:1210 [inline] WARNING: include/linux/skbuff.h:1165 nf_reject_fill_skb_dst+0x2a4/0x330 net/ipv4/netfilter/nf_reject_ipv4.c:234 [..] Call Trace: nf_send_unreach+0x17b/0x6e0 net/ipv4/netfilter/nf_reject_ipv4.c:325 nft_reject_inet_eval+0x4bc/0x690 net/netfilter/nft_reject_inet.c:27 expr_call_ops_eval net/netfilter/nf_tables_core.c:237 [inline] .. This is because blamed commit forgot about loopback packets. Such packets already have a dst_entry attached, even at PRE_ROUTING stage. Instead of checking hook just check if the skb already has a route attached to it. Fixes: f53b9b0bdc59 ("netfilter: introduce support for reject at prerouting stage") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20250820123707.10671-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-12Merge tag 'ipsec-2025-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsecPaolo Abeni1-1/+1
Steffen Klassert says: ==================== pull request (net): ipsec 2025-08-11 1) Fix flushing of all states in xfrm_state_fini. From Sabrina Dubroca. 2) Fix some IPsec software offload features. These got lost with some recent HW offload changes. From Sabrina Dubroca. Please pull or let me know if there are problems. * tag 'ipsec-2025-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: udp: also consider secpath when evaluating ipsec use for checksumming xfrm: bring back device check in validate_xmit_xfrm xfrm: restore GSO for SW crypto xfrm: flush all states in xfrm_state_fini ==================== Link: https://patch.msgid.link/20250811092008.731573-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-07netfilter: add back NETFILTER_XTABLES dependenciesArnd Bergmann1-0/+3
Some Kconfig symbols were changed to depend on the 'bool' symbol NETFILTER_XTABLES_LEGACY, which means they can now be set to built-in when the xtables code itself is in a loadable module: x86_64-linux-ld: vmlinux.o: in function `arpt_unregister_table_pre_exit': (.text+0x1831987): undefined reference to `xt_find_table' x86_64-linux-ld: vmlinux.o: in function `get_info.constprop.0': arp_tables.c:(.text+0x1831aab): undefined reference to `xt_request_find_table_lock' x86_64-linux-ld: arp_tables.c:(.text+0x1831bea): undefined reference to `xt_table_unlock' x86_64-linux-ld: vmlinux.o: in function `do_arpt_get_ctl': arp_tables.c:(.text+0x183205d): undefined reference to `xt_find_table_lock' x86_64-linux-ld: arp_tables.c:(.text+0x18320c1): undefined reference to `xt_table_unlock' x86_64-linux-ld: arp_tables.c:(.text+0x183219a): undefined reference to `xt_recseq' Change these to depend on both NETFILTER_XTABLES and NETFILTER_XTABLES_LEGACY. Fixes: 9fce66583f06 ("netfilter: Exclude LEGACY TABLES on PREEMPT_RT.") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Florian Westphal <fw@strlen.de> Tested-by: Breno Leitao <leitao@debian.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-08-07udp: also consider secpath when evaluating ipsec use for checksummingSabrina Dubroca1-1/+1
Commit b40c5f4fde22 ("udp: disable inner UDP checksum offloads in IPsec case") tried to fix checksumming in UFO when the packets are going through IPsec, so that we can't rely on offloads because the UDP header and payload will be encrypted. But when doing a TCP test over VXLAN going through IPsec transport mode with GSO enabled (esp4_offload module loaded), I'm seeing broken UDP checksums on the encap after successful decryption. The skbs get to udp4_ufo_fragment/__skb_udp_tunnel_segment via __dev_queue_xmit -> validate_xmit_skb -> skb_gso_segment and at this point we've already dropped the dst (unless the device sets IFF_XMIT_DST_RELEASE, which is not common), so need_ipsec is false and we proceed with checksum offload. Make need_ipsec also check the secpath, which is not dropped on this callpath. Fixes: b40c5f4fde22 ("udp: disable inner UDP checksum offloads in IPsec case") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-08-01net: Add locking to protect skb->dev access in ip_outputSharath Chandra Vurukala1-5/+10
In ip_output() skb->dev is updated from the skb_dst(skb)->dev this can become invalid when the interface is unregistered and freed, Introduced new skb_dst_dev_rcu() function to be used instead of skb_dst_dev() within rcu_locks in ip_output.This will ensure that all the skb's associated with the dev being deregistered will be transnmitted out first, before freeing the dev. Given that ip_output() is called within an rcu_read_lock() critical section or from a bottom-half context, it is safe to introduce an RCU read-side critical section within it. Multiple panic call stacks were observed when UL traffic was run in concurrency with device deregistration from different functions, pasting one sample for reference. [496733.627565][T13385] Call trace: [496733.627570][T13385] bpf_prog_ce7c9180c3b128ea_cgroupskb_egres+0x24c/0x7f0 [496733.627581][T13385] __cgroup_bpf_run_filter_skb+0x128/0x498 [496733.627595][T13385] ip_finish_output+0xa4/0xf4 [496733.627605][T13385] ip_output+0x100/0x1a0 [496733.627613][T13385] ip_send_skb+0x68/0x100 [496733.627618][T13385] udp_send_skb+0x1c4/0x384 [496733.627625][T13385] udp_sendmsg+0x7b0/0x898 [496733.627631][T13385] inet_sendmsg+0x5c/0x7c [496733.627639][T13385] __sys_sendto+0x174/0x1e4 [496733.627647][T13385] __arm64_sys_sendto+0x28/0x3c [496733.627653][T13385] invoke_syscall+0x58/0x11c [496733.627662][T13385] el0_svc_common+0x88/0xf4 [496733.627669][T13385] do_el0_svc+0x2c/0xb0 [496733.627676][T13385] el0_svc+0x2c/0xa4 [496733.627683][T13385] el0t_64_sync_handler+0x68/0xb4 [496733.627689][T13385] el0t_64_sync+0x1a4/0x1a8 Changes in v3: - Replaced WARN_ON() with WARN_ON_ONCE(), as suggested by Willem de Bruijn. - Dropped legacy lines mistakenly pulled in from an outdated branch. Changes in v2: - Addressed review comments from Eric Dumazet - Used READ_ONCE() to prevent potential load/store tearing - Added skb_dst_dev_rcu() and used along with rcu_read_lock() in ip_output Signed-off-by: Sharath Chandra Vurukala <quic_sharathv@quicinc.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250730105118.GA26100@hu-sharathv-hyd.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-25netfilter: Exclude LEGACY TABLES on PREEMPT_RT.Pablo Neira Ayuso1-12/+12
The seqcount xt_recseq is used to synchronize the replacement of xt_table::private in xt_replace_table() against all readers such as ipt_do_table() To ensure that there is only one writer, the writing side disables bottom halves. The sequence counter can be acquired recursively. Only the first invocation modifies the sequence counter (signaling that a writer is in progress) while the following (recursive) writer does not modify the counter. The lack of a proper locking mechanism for the sequence counter can lead to live lock on PREEMPT_RT if the high prior reader preempts the writer. Additionally if the per-CPU lock on PREEMPT_RT is removed from local_bh_disable() then there is no synchronisation for the per-CPU sequence counter. The affected code is "just" the legacy netfilter code which is replaced by "netfilter tables". That code can be disabled without sacrificing functionality because everything is provided by the newer implementation. This will only requires the usage of the "-nft" tools instead of the "-legacy" ones. The long term plan is to remove the legacy code so lets accelerate the progress. Relax dependencies on iptables legacy, replace select with depends on, this should cause no harm to existing kernel configs and users can still toggle IP{6}_NF_IPTABLES_LEGACY in any case. Make EBTABLES_LEGACY, IPTABLES_LEGACY and ARPTABLES depend on NETFILTER_XTABLES_LEGACY. Hide xt_recseq and its users, xt_register_table() and xt_percpu_counter_alloc() behind NETFILTER_XTABLES_LEGACY. Let NETFILTER_XTABLES_LEGACY depend on !PREEMPT_RT. This will break selftest expecing the legacy options enabled and will be addressed in a following patch. Co-developed-by: Florian Westphal <fw@strlen.de> Co-developed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-0/+5
Cross-merge networking fixes after downstream PR (net-6.16-rc8). Conflicts: drivers/net/ethernet/microsoft/mana/gdma_main.c 9669ddda18fb ("net: mana: Fix warnings for missing export.h header inclusion") 755391121038 ("net: mana: Allocate MSI-X vectors dynamically") https://lore.kernel.org/20250711130752.23023d98@canb.auug.org.au Adjacent changes: drivers/net/ethernet/ti/icssg/icssg_prueth.h 6e86fb73de0f ("net: ti: icssg-prueth: Fix buffer allocation for ICSSG") ffe8a4909176 ("net: ti: icssg-prueth: Read firmware-names from device tree") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24Merge tag 'ipsec-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsecPaolo Abeni2-0/+5
Steffen Klassert says: ==================== pull request (net): ipsec 2025-07-23 1) Premption fixes for xfrm_state_find. From Sabrina Dubroca. 2) Initialize offload path also for SW IPsec GRO. This fixes a performance regression on SW IPsec offload. From Leon Romanovsky. 3) Fix IPsec UDP GRO for IKE packets. From Tobias Brunner, 4) Fix transport header setting for IPcomp after decompressing. From Fernando Fernandez Mancera. 5) Fix use-after-free when xfrmi_changelink tries to change collect_md for a xfrm interface. From Eyal Birger . 6) Delete the special IPcomp x->tunnel state along with the state x to avoid refcount problems. From Sabrina Dubroca. Please pull or let me know if there are problems. * tag 'ipsec-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: Revert "xfrm: destroy xfrm_state synchronously on net exit path" xfrm: delete x->tunnel as we delete x xfrm: interface: fix use-after-free after changing collect_md xfrm interface xfrm: ipcomp: adjust transport header after decompressing xfrm: Set transport header to fix UDP GRO handling xfrm: always initialize offload path xfrm: state: use a consistent pcpu_id in xfrm_state_find xfrm: state: initialize state_ptrs earlier in xfrm_state_find ==================== Link: https://patch.msgid.link/20250723075417.3432644-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22tcp: do not increment BeyondWindow MIB for old seqPaolo Abeni1-1/+5
The mentioned MIB is currently incremented even when a packet with an old sequence number (i.e. a zero window probe) is received, which is IMHO misleading. Explicitly restrict such MIB increment at the relevant events. Fixes: 6c758062c64d ("tcp: add LINUX_MIB_BEYOND_WINDOW") Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20d147292eb4b13b6535e0ad6f56be64d9c330d3.1753118029.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22tcp: do not set a zero size receive bufferPaolo Abeni1-0/+4
The nipa CI is reporting frequent failures in the mptcp_connect self-tests. In the failing scenarios (TCP -> MPTCP) the involved sockets are actually plain TCP ones, as fallback for passive socket at 2whs time cause the MPTCP listener to actually create a TCP socket. The transfer is stuck due to the receiver buffer being zero. With the stronger check in place, tcp_clamp_window() can be invoked while the TCP socket has sk_rmem_alloc == 0, and the receive buffer will be zeroed, too. Check for the critical condition in tcp_prune_queue() and just drop the packet without shrinking the receiver buffer. Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20c18165d3f848e1c5c1b782d88c1a5ab38b3f70.1753118029.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22tcp: trace retransmit failures in tcp_retransmit_skbFan Yu1-17/+29
Background ========== When TCP retransmits a packet due to missing ACKs, the retransmission may fail for various reasons (e.g., packets stuck in driver queues, receiver zero windows, or routing issues). The original tcp_retransmit_skb tracepoint: 'commit e086101b150a ("tcp: add a tracepoint for tcp retransmission")' lacks visibility into these failure causes, making production diagnostics difficult. Solution ======== Adds the retval("err") to the tcp_retransmit_skb tracepoint. Enables users to know why some tcp retransmission failed and users can filter retransmission failures by retval. Compatibility description ========================= This patch extends the tcp_retransmit_skb tracepoint by adding a new "err" field at the end of its existing structure (within TP_STRUCT__entry). The compatibility implications are detailed as follows: 1) Structural compatibility for legacy user-space tools Legacy tools/BPF programs accessing existing fields (by offset or name) can still work without modification or recompilation.The new field is appended to the end, preserving original memory layout. 2) Note: semantic changes The original tracepoint primarily only focused on successfully retransmitted packets. With this patch, the tracepoint now can figure out packets that may terminate early due to specific reasons. For accurate statistics, users should filter using "err" to distinguish outcomes. Before patched: field:const void * skbaddr; offset:8; size:8; signed:0; field:const void * skaddr; offset:16; size:8; signed:0; field:int state; offset:24; size:4; signed:1; field:__u16 sport; offset:28; size:2; signed:0; field:__u16 dport; offset:30; size:2; signed:0; field:__u16 family; offset:32; size:2; signed:0; field:__u8 saddr[4]; offset:34; size:4; signed:0; field:__u8 daddr[4]; offset:38; size:4; signed:0; field:__u8 saddr_v6[16]; offset:42; size:16; signed:0; field:__u8 daddr_v6[16]; offset:58; size:16; signed:0; print fmt: "skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s" After patched: field:const void * skbaddr; offset:8; size:8; signed:0; field:const void * skaddr; offset:16; size:8; signed:0; field:int state; offset:24; size:4; signed:1; field:__u16 sport; offset:28; size:2; signed:0; field:__u16 dport; offset:30; size:2; signed:0; field:__u16 family; offset:32; size:2; signed:0; field:__u8 saddr[4]; offset:34; size:4; signed:0; field:__u8 daddr[4]; offset:38; size:4; signed:0; field:__u8 saddr_v6[16]; offset:42; size:16; signed:0; field:__u8 daddr_v6[16]; offset:58; size:16; signed:0; field:int err; offset:76; size:4; signed:1; print fmt: "skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s err=%d" Co-developed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250721111607626_BDnIJB0ywk6FghN63bor@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-21tcp: add tcp_sock_set_maxsegGeliang Tang1-9/+14
Add a helper tcp_sock_set_maxseg() to directly set the TCP_MAXSEG sockopt from kernel space. This new helper will be used in the following patch from MPTCP. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250719-net-next-mptcp-tcp_maxseg-v2-2-8c910fbc5307@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_get_flags/netif_get_flags/Stanislav Fomichev3-3/+3
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_get_port_parent_id/netif_get_port_parent_id/Stanislav Fomichev1-1/+1
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: track pfmemalloc drops via SKB_DROP_REASON_PFMEMALLOCJesper Dangaard Brouer2-15/+17
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets dropped due to memory pressure. In production environments, we've observed memory exhaustion reported by memory layer stack traces, but these drops were not properly tracked in the SKB drop reason infrastructure. While most network code paths now properly report pfmemalloc drops, some protocol-specific socket implementations still use sk_filter() without drop reason tracking: - Bluetooth L2CAP sockets - CAIF sockets - IUCV sockets - Netlink sockets - SCTP sockets - Unix domain sockets These remaining cases represent less common paths and could be converted in a follow-up patch if needed. The current implementation provides significantly improved observability into memory pressure events in the network stack, especially for key protocols like TCP and UDP, helping to diagnose problems in production environments. Reported-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski1-69/+200
Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-07-17 We've added 13 non-merge commits during the last 20 day(s) which contain a total of 4 files changed, 712 insertions(+), 84 deletions(-). The main changes are: 1) Avoid skipping or repeating a sk when using a TCP bpf_iter, from Jordan Rife. 2) Clarify the driver requirement on using the XDP metadata, from Song Yoong Siang * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: doc: xdp: Clarify driver implementation for XDP Rx metadata selftests/bpf: Add tests for bucket resume logic in established sockets selftests/bpf: Create iter_tcp_destroy test program selftests/bpf: Create established sockets in socket iterator tests selftests/bpf: Make ehash buckets configurable in socket iterator tests selftests/bpf: Allow for iteration over multiple states selftests/bpf: Allow for iteration over multiple ports selftests/bpf: Add tests for bucket resume logic in listening sockets bpf: tcp: Avoid socket skips and repeats during iteration bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch items bpf: tcp: Get rid of st_bucket_done bpf: tcp: Make sure iter->batch always contains a full bucket snapshot bpf: tcp: Make mem flags configurable through bpf_iter_tcp_realloc_batch ==================== Link: https://patch.msgid.link/20250717191731.4142326-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Update pneigh_entry in pneigh_create().Kuniyuki Iwashima1-3/+1
neigh_add() updates pneigh_entry() found or created by pneigh_create(). This update is serialised by RTNL, but we will remove it. Let's move the update part to pneigh_create() and make it return errno instead of a pointer of pneigh_entry. Now, the pneigh code is RTNL free. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Split pneigh_lookup().Kuniyuki Iwashima1-2/+2
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which is confusing. When called with the last argument, creat, 0, pneigh_lookup() literally looks up a proxy neighbour entry. This is the case of the reader path as the fast path and RTM_GETNEIGH. pneigh_lookup(), however, creates a pneigh_entry when called with creat 1 from RTM_NEWNEIGH and SIOCSARP, which require RTNL. Let's split pneigh_lookup() into two functions. We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock) in the new pneigh_lookup() will be dropped. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-0/+2
Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") af52020fc599 ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c a44312d58e78 ("net: phy: Don't register LEDs for genphy") f0f2b992d818 ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c 5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: fix segmentation after TCP/UDP fraglist GROFelix Fietkau2-0/+2
Since "net: gro: use cb instead of skb->network_header", the skb network header is no longer set in the GRO path. This breaks fraglist segmentation, which relies on ip_hdr()/tcp_hdr() to check for address/port changes. Fix this regression by selectively setting the network header for merged segment skbs. Fixes: 186b1ea73ad8 ("net: gro: use cb instead of skb->network_header") Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250705150622.10699-1-nbd@nbd.name Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-16tcp: fix UaF in tcp_prune_ofo_queue()Paolo Abeni1-1/+1
The CI reported a UaF in tcp_prune_ofo_queue(): BUG: KASAN: slab-use-after-free in tcp_prune_ofo_queue+0x55d/0x660 Read of size 4 at addr ffff8880134729d8 by task socat/20348 CPU: 0 UID: 0 PID: 20348 Comm: socat Not tainted 6.16.0-rc5-virtme #1 PREEMPT(full) Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x82/0xd0 print_address_description.constprop.0+0x2c/0x400 print_report+0xb4/0x270 kasan_report+0xca/0x100 tcp_prune_ofo_queue+0x55d/0x660 tcp_try_rmem_schedule+0x855/0x12e0 tcp_data_queue+0x4dd/0x2260 tcp_rcv_established+0x5e8/0x2370 tcp_v4_do_rcv+0x4ba/0x8c0 __release_sock+0x27a/0x390 release_sock+0x53/0x1d0 tcp_sendmsg+0x37/0x50 sock_write_iter+0x3c1/0x520 vfs_write+0xc09/0x1210 ksys_write+0x183/0x1d0 do_syscall_64+0xc1/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fcf73ef2337 Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffd4f924708 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcf73ef2337 RDX: 0000000000002000 RSI: 0000555f11d1a000 RDI: 0000000000000008 RBP: 0000555f11d1a000 R08: 0000000000002000 R09: 0000000000000000 R10: 0000000000000040 R11: 0000000000000246 R12: 0000000000000008 R13: 0000000000002000 R14: 0000555ee1a44570 R15: 0000000000002000 </TASK> Allocated by task 20348: kasan_save_stack+0x24/0x50 kasan_save_track+0x14/0x30 __kasan_slab_alloc+0x59/0x70 kmem_cache_alloc_node_noprof+0x110/0x340 __alloc_skb+0x213/0x2e0 tcp_collapse+0x43f/0xff0 tcp_try_rmem_schedule+0x6b9/0x12e0 tcp_data_queue+0x4dd/0x2260 tcp_rcv_established+0x5e8/0x2370 tcp_v4_do_rcv+0x4ba/0x8c0 __release_sock+0x27a/0x390 release_sock+0x53/0x1d0 tcp_sendmsg+0x37/0x50 sock_write_iter+0x3c1/0x520 vfs_write+0xc09/0x1210 ksys_write+0x183/0x1d0 do_syscall_64+0xc1/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 20348: kasan_save_stack+0x24/0x50 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x38/0x50 kmem_cache_free+0x149/0x330 tcp_prune_ofo_queue+0x211/0x660 tcp_try_rmem_schedule+0x855/0x12e0 tcp_data_queue+0x4dd/0x2260 tcp_rcv_established+0x5e8/0x2370 tcp_v4_do_rcv+0x4ba/0x8c0 __release_sock+0x27a/0x390 release_sock+0x53/0x1d0 tcp_sendmsg+0x37/0x50 sock_write_iter+0x3c1/0x520 vfs_write+0xc09/0x1210 ksys_write+0x183/0x1d0 do_syscall_64+0xc1/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff888013472900 which belongs to the cache skbuff_head_cache of size 232 The buggy address is located 216 bytes inside of freed 232-byte region [ffff888013472900, ffff8880134729e8) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13472 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x80000000000040(head|node=0|zone=1) page_type: f5(slab) raw: 0080000000000040 ffff88800198fb40 ffffea0000347b10 ffffea00004f5290 raw: 0000000000000000 0000000000120012 00000000f5000000 0000000000000000 head: 0080000000000040 ffff88800198fb40 ffffea0000347b10 ffffea00004f5290 head: 0000000000000000 0000000000120012 00000000f5000000 0000000000000000 head: 0080000000000001 ffffea00004d1c81 00000000ffffffff 00000000ffffffff head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888013472880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888013472900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888013472980: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc ^ ffff888013472a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888013472a80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb Indeed tcp_prune_ofo_queue() is reusing the skb dropped a few lines above. The caller wants to enqueue 'in_skb', lets check space vs the latter. Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: syzbot+865aca08c0533171bf6a@syzkaller.appspotmail.com Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/b78d2d9bdccca29021eed9a0e7097dd8dc00f485.1752567053.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14tcp: stronger sk_rcvbuf checksEric Dumazet1-6/+16
Currently, TCP stack accepts incoming packet if sizes of receive queues are below sk->sk_rcvbuf limit. This can cause memory overshoot if the packet is big, like an 1/2 MB BIG TCP one. Refine the check to take into account the incoming skb truesize. Note that we still accept the packet if the receive queue is empty, to not completely freeze TCP flows in pathological conditions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711114006.480026-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skbEric Dumazet1-1/+1
These functions to not modify the skb, add a const qualifier. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711114006.480026-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14tcp: call tcp_measure_rcv_mss() for ooo packetsEric Dumazet1-0/+1
tcp_measure_rcv_mss() is used to update icsk->icsk_ack.rcv_mss (tcpi_rcv_mss in tcp_info) and tp->scaling_ratio. Calling it from tcp_data_queue_ofo() makes sure these fields are updated, and permits a better tuning of sk->sk_rcvbuf, in the case a new flow receives many ooo packets. Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711114006.480026-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14tcp: add LINUX_MIB_BEYOND_WINDOWEric Dumazet2-0/+2
Add a new SNMP MIB : LINUX_MIB_BEYOND_WINDOW Incremented when an incoming packet is received beyond the receiver window. nstat -az | grep TcpExtBeyondWindow Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711114006.480026-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14tcp: do not accept packets beyond windowEric Dumazet1-5/+17
Currently, TCP accepts incoming packets which might go beyond the offered RWIN. Add to tcp_sequence() the validation of packet end sequence. Add the corresponding check in the fast path. We relax this new constraint if the receive queue is empty, to not freeze flows from buggy peers. Add a new drop reason : SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711114006.480026-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14net: ipv4: fix incorrect MTU in broadcast routesOscar Maes1-1/+0
Currently, __mkroute_output overrules the MTU value configured for broadcast routes. This buggy behaviour can be reproduced with: ip link set dev eth1 mtu 9000 ip route del broadcast 192.168.0.255 dev eth1 proto kernel scope link src 192.168.0.2 ip route add broadcast 192.168.0.255 dev eth1 proto kernel scope link src 192.168.0.2 mtu 1500 The maximum packet size should be 1500, but it is actually 8000: ping -b 192.168.0.255 -s 8000 Fix __mkroute_output to allow MTU values to be configured for for broadcast routes (to support a mixed-MTU local-area-network). Signed-off-by: Oscar Maes <oscmaes92@gmail.com> Link: https://patch.msgid.link/20250710142714.12986-1-oscmaes92@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14bpf: tcp: Avoid socket skips and repeats during iterationJordan Rife1-32/+115
Replace the offset-based approach for tracking progress through a bucket in the TCP table with one based on socket cookies. Remember the cookies of unprocessed sockets from the last batch and use this list to pick up where we left off or, in the case that the next socket disappears between reads, find the first socket after that point that still exists in the bucket and resume from there. This approach guarantees that all sockets that existed when iteration began and continue to exist throughout will be visited exactly once. Sockets that are added to the table during iteration may or may not be seen, but if they are they will be seen exactly once. Signed-off-by: Jordan Rife <jordan@jrife.io> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
2025-07-14bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch itemsJordan Rife1-10/+14
Prepare for the next patch that tracks cookies between iterations by converting struct sock **batch to union bpf_tcp_iter_batch_item *batch inside struct bpf_tcp_iter_state. Signed-off-by: Jordan Rife <jordan@jrife.io> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
2025-07-14bpf: tcp: Get rid of st_bucket_doneJordan Rife1-8/+6
Get rid of the st_bucket_done field to simplify TCP iterator state and logic. Before, st_bucket_done could be false if bpf_iter_tcp_batch returned a partial batch; however, with the last patch ("bpf: tcp: Make sure iter->batch always contains a full bucket snapshot"), st_bucket_done == true is equivalent to iter->cur_sk == iter->end_sk. Signed-off-by: Jordan Rife <jordan@jrife.io> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me>