wireguard-linux/net/netfilter, branch stable

netfilter: nft_ct: add seqadj extension for natted connections

2025-10-29T13:47:59Z

Sequence adjustment may be required for FTP traffic with PASV/EPSV modes. due to need to re-write packet payload (IP, port) on the ftp control connection. This can require changes to the TCP length and expected seq / ack_seq. The easiest way to reproduce this issue is with PASV mode. Example ruleset: table inet ftp_nat { ct helper ftp_helper { type "ftp" protocol tcp l3proto inet } chain prerouting { type filter hook prerouting priority 0; policy accept; tcp dport 21 ct state new ct helper set "ftp_helper" } } table ip nat { chain prerouting { type nat hook prerouting priority -100; policy accept; tcp dport 21 dnat ip prefix to ip daddr map { 192.168.100.1 : 192.168.13.2/32 } } chain postrouting { type nat hook postrouting priority 100 ; policy accept; tcp sport 21 snat ip prefix to ip saddr map { 192.168.13.2 : 192.168.100.1/32 } } } Note that the ftp helper gets assigned *after* the dnat setup. The inverse (nat after helper assign) is handled by an existing check in nf_nat_setup_info() and will not show the problem. Topoloy: +-------------------+ +----------------------------------+ | FTP: 192.168.13.2 | <-> | NAT: 192.168.13.3, 192.168.100.1 | +-------------------+ +----------------------------------+ | +-----------------------+ | Client: 192.168.100.2 | +-----------------------+ ftp nat changes do not work as expected in this case: Connected to 192.168.100.1. [..] ftp> epsv EPSV/EPRT on IPv4 off. ftp> ls 227 Entering passive mode (192,168,100,1,209,129). 421 Service not available, remote server has closed connection. Kernel logs: Missing nfct_seqadj_ext_add() setup call WARNING: CPU: 1 PID: 0 at net/netfilter/nf_conntrack_seqadj.c:41 [..] __nf_nat_mangle_tcp_packet+0x100/0x160 [nf_nat] nf_nat_ftp+0x142/0x280 [nf_nat_ftp] help+0x4d1/0x880 [nf_conntrack_ftp] nf_confirm+0x122/0x2e0 [nf_conntrack] nf_hook_slow+0x3c/0xb0 .. Fix this by adding the required extension when a conntrack helper is assigned to a connection that has a nat binding. Fixes: 1a64edf54f55 ("netfilter: nft_ct: add helper set support") Signed-off-by: Andrii Melnychenko Signed-off-by: Florian Westphal

netfilter: nft_connlimit: fix possible data race on connection count

2025-10-29T13:47:59Z

nft_connlimit_eval() reads priv->list->count to check if the connection limit has been exceeded. This value is being read without a lock and can be modified by a different process. Use READ_ONCE() for correctness. Fixes: df4a90250976 ("netfilter: nf_conncount: merge lookup and add functions") Signed-off-by: Fernando Fernandez Mancera Signed-off-by: Florian Westphal

netfilter: nft_ct: enable labels for get case too

2025-10-29T13:47:59Z

conntrack labels can only be set when the conntrack has been created with the "ctlabel" extension. For older iptables (connlabel match), adding an "-m connlabel" rule turns on the ctlabel extension allocation for all future conntrack entries. For nftables, its only enabled for 'ct label set foo', but not for 'ct label foo' (i.e. check). But users could have a ruleset that only checks for presence, and rely on userspace to set a label bit via ctnetlink infrastructure. This doesn't work without adding a dummy 'ct label set' rule. We could also enable extension infra for the first (failing) ctnetlink request, but unlike ruleset we would not be able to disable the extension again. Therefore turn on ctlabel extension allocation if an nftables ruleset checks for a connlabel too. Fixes: 1ad8f48df6f6 ("netfilter: nftables: add connlabel set support") Reported-by: Antonio Ojea Closes: https://lore.kernel.org/netfilter-devel/aPi_VdZpVjWujZ29@strlen.de/ Signed-off-by: Florian Westphal

netfilter: nft_objref: validate objref and objrefmap expressions

2025-10-08T11:17:25Z

Referencing a synproxy stateful object from OUTPUT hook causes kernel crash due to infinite recursive calls: BUG: TASK stack guard page was hit at 000000008bda5b8c (stack is 000000003ab1c4a5..00000000494d8b12) [...] Call Trace: __find_rr_leaf+0x99/0x230 fib6_table_lookup+0x13b/0x2d0 ip6_pol_route+0xa4/0x400 fib6_rule_lookup+0x156/0x240 ip6_route_output_flags+0xc6/0x150 __nf_ip6_route+0x23/0x50 synproxy_send_tcp_ipv6+0x106/0x200 synproxy_send_client_synack_ipv6+0x1aa/0x1f0 nft_synproxy_do_eval+0x263/0x310 nft_do_chain+0x5a8/0x5f0 [nf_tables nft_do_chain_inet+0x98/0x110 nf_hook_slow+0x43/0xc0 __ip6_local_out+0xf0/0x170 ip6_local_out+0x17/0x70 synproxy_send_tcp_ipv6+0x1a2/0x200 synproxy_send_client_synack_ipv6+0x1aa/0x1f0 [...] Implement objref and objrefmap expression validate functions. Currently, only NFT_OBJECT_SYNPROXY object type requires validation. This will also handle a jump to a chain using a synproxy object from the OUTPUT hook. Now when trying to reference a synproxy object in the OUTPUT hook, nft will produce the following error: synproxy_crash.nft: Error: Could not process rule: Operation not supported synproxy name mysynproxy ^^^^^^^^^^^^^^^^^^^^^^^^ Fixes: ee394f96ad75 ("netfilter: nft_synproxy: add synproxy stateful object support") Reported-by: Georg Pfuetzenreuter Closes: https://bugzilla.suse.com/1250237 Signed-off-by: Fernando Fernandez Mancera Reviewed-by: Pablo Neira Ayuso Signed-off-by: Florian Westphal

netfilter: nf_conntrack: do not skip entries in /proc/net/nf_conntrack

2025-09-24T09:50:28Z

ct_seq_show() has an opportunistic garbage collector : if (nf_ct_should_gc(ct)) { nf_ct_kill(ct); goto release; } So if one nf_conn is killed there, next time ct_get_next() runs, we skip the following item in the bucket, even if it should have been displayed if gc did not take place. We can decrement st->skip_elems to tell ct_get_next() one of the items was removed from the chain. Fixes: 58e207e4983d ("netfilter: evict stale entries when user reads /proc/net/nf_conntrack") Signed-off-by: Eric Dumazet Signed-off-by: Florian Westphal

netfilter: nft_set_pipapo_avx2: fix skip of expired entries

2025-09-24T09:50:28Z

KASAN reports following splat: BUG: KASAN: slab-out-of-bounds in pipapo_get_avx2+0x941/0x25d0 Read of size 1 at addr ffff88814c561be0 by task nft/3944 Call Trace: pipapo_get_avx2+0x941/0x25d0 nft_pipapo_insert+0x440/0x11b0 nf_tables_newsetelem+0x220a/0x3a00 .. This bisects to commit 84c1da7b38d9 ("netfilter: nft_set_pipapo: use AVX2 algorithm for insertions too"). However, that change merely uncovers this bug. When we find a match but that match has expired or timed out, the AVX2 implementation restarts the full match loop. At that point, the pointer to the key data has already been changed and points to the keys last field. This will then result in out-of-bounds read once its incremented again for the next field. The restart logic in AVX2 is different compared to the plain C implementation, but both should follow the same logic. The C implementation just calls pipapo_refill() again do check the next entry. Do the same in the AVX2 implementation. Note that with this change, due to implementation differences of pipapo_refill vs. nft_pipapo_avx2_refill, the refill call will return the same element again. Then, on the next call, it will move to the next entry as expected. This is because avx2_refill doesn't clear the bitmap in the 'last' conditional. This is harmless. Expired/timed out elements are also not expected to be frequent. selftest is added in a followup commit. Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Reviewed-by: Stefano Brivio Signed-off-by: Florian Westphal

netfilter: nft_set_pipapo: use 0 genmask for packetpath lookups

2025-09-24T09:50:28Z

In commit c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") I replaced genmask_cur() with NFT_GENMASK_ANY, but this change has no effect in the pipapo set type. New entries are unreachable from the active copy, so NFT_GENMASK_ANY has same result as genmask_cur(): current-gen elements are disabled and the new-generation elements cannot be found. Tests did not catch this incomplete fix because the change also dropped the genmask test from the AVX2 version of the algorithm, so test only fails if host cpu lacks AVX2 support. Use genmask test only from the control plane (inserts, deletions, ..). Packet path has to skip the check, use of 0 is enough for this because ext->genmask has a the relevant bit set when the element is INACTIVE in that generation: using a 0 genmask thus makes nft_set_elem_active() always return true. Fix the comment and replace NFT_GENMASK_ANY with 0. Fixes: c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") Signed-off-by: Florian Westphal

netfilter: nfnetlink: reset nlh pointer during batch replay

2025-09-24T09:50:28Z

During a batch replay, the nlh pointer is not reset until the parsing of the commands. Since commit bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages") that is problematic as the condition to add an ACK for batch begin will evaluate to true even if NLM_F_ACK wasn't used for batch begin message. If there is an error during the command processing, netlink is sending an ACK despite that. This misleads userspace tools which think that the return code was 0. Reset the nlh pointer to the original one when a replay is triggered. Fixes: bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages") Signed-off-by: Fernando Fernandez Mancera Signed-off-by: Florian Westphal

ipvs: Defer ip_vs_ftp unregister during netns cleanup

2025-09-24T09:50:28Z

On the netns cleanup path, __ip_vs_ftp_exit() may unregister ip_vs_ftp before connections with valid cp->app pointers are flushed, leading to a use-after-free. Fix this by introducing a global `exiting_module` flag, set to true in ip_vs_ftp_exit() before unregistering the pernet subsystem. In __ip_vs_ftp_exit(), skip ip_vs_ftp unregister if called during netns cleanup (when exiting_module is false) and defer it to __ip_vs_cleanup_batch(), which unregisters all apps after all connections are flushed. If called during module exit, unregister ip_vs_ftp immediately. Fixes: 61b1ab4583e2 ("IPVS: netns, add basic init per netns.") Suggested-by: Julian Anastasov Signed-off-by: Slavin Liu Signed-off-by: Julian Anastasov Signed-off-by: Florian Westphal

net: replace use of system_wq with system_percpu_wq

2025-09-23T00:40:30Z

Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo Signed-off-by: Marco Crivellari Link: https://patch.msgid.link/20250918142427.309519-3-marco.crivellari@suse.com Signed-off-by: Jakub Kicinski