Age | Commit message (Collapse) | Author | Files | Lines |
|
syzbot was able to trigger this warning [1], after injecting a
malicious packet through af_packet, setting skb->csum_start and thus
the transport header to an incorrect value.
We can at least make sure the transport header is after
the end of the network header (with a estimated minimal size).
[1]
[ 67.873027] skb len=4096 headroom=16 headlen=14 tailroom=0
mac=(-1,-1) mac_len=0 net=(16,-6) trans=10
shinfo(txflags=0 nr_frags=1 gso(size=0 type=0 segs=0))
csum(0xa start=10 offset=0 ip_summed=3 complete_sw=0 valid=0 level=0)
hash(0x0 sw=0 l4=0) proto=0x0800 pkttype=0 iif=0
priority=0x0 mark=0x0 alloc_cpu=10 vlan_all=0x0
encapsulation=0 inner(proto=0x0000, mac=0, net=0, trans=0)
[ 67.877172] dev name=veth0_vlan feat=0x000061164fdd09e9
[ 67.877764] sk family=17 type=3 proto=0
[ 67.878279] skb linear: 00000000: 00 00 10 00 00 00 00 00 0f 00 00 00 08 00
[ 67.879128] skb frag: 00000000: 0e 00 07 00 00 00 28 00 08 80 1c 00 04 00 00 02
[ 67.879877] skb frag: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.880647] skb frag: 00000020: 00 00 02 00 00 00 08 00 1b 00 00 00 00 00 00 00
[ 67.881156] skb frag: 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.881753] skb frag: 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.882173] skb frag: 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.882790] skb frag: 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.883171] skb frag: 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.883733] skb frag: 00000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.884206] skb frag: 00000090: 00 00 00 00 00 00 00 00 00 00 69 70 76 6c 61 6e
[ 67.884704] skb frag: 000000a0: 31 00 00 00 00 00 00 00 00 00 2b 00 00 00 00 00
[ 67.885139] skb frag: 000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.885677] skb frag: 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.886042] skb frag: 000000d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.886408] skb frag: 000000e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.887020] skb frag: 000000f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 67.887384] skb frag: 00000100: 00 00
[ 67.887878] ------------[ cut here ]------------
[ 67.887908] offset (-6) >= skb_headlen() (14)
[ 67.888445] WARNING: CPU: 10 PID: 2088 at net/core/dev.c:3332 skb_checksum_help (net/core/dev.c:3332 (discriminator 2))
[ 67.889353] Modules linked in: macsec macvtap macvlan hsr wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 libchacha poly1305_x86_64 dummy bridge sr_mod cdrom evdev pcspkr i2c_piix4 9pnet_virtio 9p 9pnet netfs
[ 67.890111] CPU: 10 UID: 0 PID: 2088 Comm: b363492833 Not tainted 6.11.0-virtme #1011
[ 67.890183] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 67.890309] RIP: 0010:skb_checksum_help (net/core/dev.c:3332 (discriminator 2))
[ 67.891043] Call Trace:
[ 67.891173] <TASK>
[ 67.891274] ? __warn (kernel/panic.c:741)
[ 67.891320] ? skb_checksum_help (net/core/dev.c:3332 (discriminator 2))
[ 67.891333] ? report_bug (lib/bug.c:180 lib/bug.c:219)
[ 67.891348] ? handle_bug (arch/x86/kernel/traps.c:239)
[ 67.891363] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1))
[ 67.891372] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
[ 67.891388] ? skb_checksum_help (net/core/dev.c:3332 (discriminator 2))
[ 67.891399] ? skb_checksum_help (net/core/dev.c:3332 (discriminator 2))
[ 67.891416] ip_do_fragment (net/ipv4/ip_output.c:777 (discriminator 1))
[ 67.891448] ? __ip_local_out (./include/linux/skbuff.h:1146 ./include/net/l3mdev.h:196 ./include/net/l3mdev.h:213 net/ipv4/ip_output.c:113)
[ 67.891459] ? __pfx_ip_finish_output2 (net/ipv4/ip_output.c:200)
[ 67.891470] ? ip_route_output_flow (./arch/x86/include/asm/preempt.h:84 (discriminator 13) ./include/linux/rcupdate.h:96 (discriminator 13) ./include/linux/rcupdate.h:871 (discriminator 13) net/ipv4/route.c:2625 (discriminator 13) ./include/net/route.h:141 (discriminator 13) net/ipv4/route.c:2852 (discriminator 13))
[ 67.891484] ipvlan_process_v4_outbound (drivers/net/ipvlan/ipvlan_core.c:445 (discriminator 1))
[ 67.891581] ipvlan_queue_xmit (drivers/net/ipvlan/ipvlan_core.c:542 drivers/net/ipvlan/ipvlan_core.c:604 drivers/net/ipvlan/ipvlan_core.c:670)
[ 67.891596] ipvlan_start_xmit (drivers/net/ipvlan/ipvlan_main.c:227)
[ 67.891607] dev_hard_start_xmit (./include/linux/netdevice.h:4916 ./include/linux/netdevice.h:4925 net/core/dev.c:3588 net/core/dev.c:3604)
[ 67.891620] __dev_queue_xmit (net/core/dev.h:168 (discriminator 25) net/core/dev.c:4425 (discriminator 25))
[ 67.891630] ? skb_copy_bits (./include/linux/uaccess.h:233 (discriminator 1) ./include/linux/uaccess.h:260 (discriminator 1) ./include/linux/highmem-internal.h:230 (discriminator 1) net/core/skbuff.c:3018 (discriminator 1))
[ 67.891645] ? __pskb_pull_tail (net/core/skbuff.c:2848 (discriminator 4))
[ 67.891655] ? skb_partial_csum_set (net/core/skbuff.c:5657)
[ 67.891666] ? virtio_net_hdr_to_skb.constprop.0 (./include/linux/skbuff.h:2791 (discriminator 3) ./include/linux/skbuff.h:2799 (discriminator 3) ./include/linux/virtio_net.h:109 (discriminator 3))
[ 67.891684] packet_sendmsg (net/packet/af_packet.c:3145 (discriminator 1) net/packet/af_packet.c:3177 (discriminator 1))
[ 67.891700] ? _raw_spin_lock_bh (./arch/x86/include/asm/atomic.h:107 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:127 (discriminator 4) kernel/locking/spinlock.c:178 (discriminator 4))
[ 67.891716] __sys_sendto (net/socket.c:730 (discriminator 1) net/socket.c:745 (discriminator 1) net/socket.c:2210 (discriminator 1))
[ 67.891734] ? do_sock_setsockopt (net/socket.c:2335)
[ 67.891747] ? __sys_setsockopt (./include/linux/file.h:34 net/socket.c:2355)
[ 67.891761] __x64_sys_sendto (net/socket.c:2222 (discriminator 1) net/socket.c:2218 (discriminator 1) net/socket.c:2218 (discriminator 1))
[ 67.891772] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1))
[ 67.891785] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
Fixes: 9181d6f8a2bb ("net: add more sanity check in virtio_net_hdr_to_skb()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240926165836.3797406-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Detect tcp gso fraglist skbs with corrupted geometry (see below) and
pass these to skb_segment instead of skb_segment_list, as the first
can segment them correctly.
Valid SKB_GSO_FRAGLIST skbs
- consist of two or more segments
- the head_skb holds the protocol headers plus first gso_size
- one or more frag_list skbs hold exactly one segment
- all but the last must be gso_size
Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can
modify these skbs, breaking these invariants.
In extreme cases they pull all data into skb linear. For TCP, this
causes a NULL ptr deref in __tcpv4_gso_segment_list_csum at
tcp_hdr(seg->next).
Detect invalid geometry due to pull, by checking head_skb size.
Don't just drop, as this may blackhole a destination. Convert to be
able to pass to regular skb_segment.
Approach and description based on a patch by Willem de Bruijn.
Link: https://lore.kernel.org/netdev/20240428142913.18666-1-shiming.cheng@mediatek.com/
Link: https://lore.kernel.org/netdev/20240922150450.3873767-1-willemdebruijn.kernel@gmail.com/
Fixes: bee88cd5bd83 ("net: add support for segmenting TCP fraglist GSO packets")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240926085315.51524-1-nbd@nbd.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Regression Description:
Depending on the options specified for the GRE tunnel device, small
packets may be dropped. This occurs because the pskb_network_may_pull
function fails due to the packet's insufficient length.
For example, if only the okey option is specified for the tunnel device,
original (before encapsulation) packets smaller than 28 bytes (including
the IPv4 header) will be dropped. This happens because the required
length is calculated relative to the network header, not the skb->head.
Here is how the required length is computed and checked:
* The pull_len variable is set to 28 bytes, consisting of:
* IPv4 header: 20 bytes
* GRE header with Key field: 8 bytes
* The pskb_network_may_pull function adds the network offset, shifting
the checkable space further to the beginning of the network header and
extending it to the beginning of the packet. As a result, the end of
the checkable space occurs beyond the actual end of the packet.
Instead of ensuring that 28 bytes are present in skb->head, the function
is requesting these 28 bytes starting from the network header. For small
packets, this requested length exceeds the actual packet size, causing
the check to fail and the packets to be dropped.
This issue affects both locally originated and forwarded packets in
DMVPN-like setups.
How to reproduce (for local originated packets):
ip link add dev gre1 type gre ikey 1.9.8.4 okey 1.9.8.4 \
local <your-ip> remote 0.0.0.0
ip link set mtu 1400 dev gre1
ip link set up dev gre1
ip address add 192.168.13.1/24 dev gre1
ip neighbor add 192.168.13.2 lladdr <remote-ip> dev gre1
ping -s 1374 -c 10 192.168.13.2
tcpdump -vni gre1
tcpdump -vni <your-ext-iface> 'ip proto 47'
ip -s -s -d link show dev gre1
Solution:
Use the pskb_may_pull function instead the pskb_network_may_pull.
Fixes: 80d875cfc9d3 ("ipv4: ip_gre: Avoid skb_pull() failure in ipgre_xmit()")
Signed-off-by: Anton Danilov <littlesmilingcloud@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240924235158.106062-1-littlesmilingcloud@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Increase the timeout for checking the busy bit of the VLAN Tag register
from 10µs to 500ms. This change is necessary to accommodate scenarios
where Energy Efficient Ethernet (EEE) is enabled.
Overnight testing revealed that when EEE is active, the busy bit can
remain set for up to approximately 300ms. The new 500ms timeout provides
a safety margin.
Fixes: ed64639bc1e0 ("net: stmmac: Add support for VLAN Rx filtering")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
Link: https://patch.msgid.link/20240924205424.573913-1-shenwei.wang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
One path takes care of SKB_GSO_DODGY, assuming
skb->len is bigger than hdr_len.
virtio_net_hdr_to_skb() does not fully dissect TCP headers,
it only make sure it is at least 20 bytes.
It is possible for an user to provide a malicious 'GSO' packet,
total length of 80 bytes.
- 20 bytes of IPv4 header
- 60 bytes TCP header
- a small gso_size like 8
virtio_net_hdr_to_skb() would declare this packet as a normal
GSO packet, because it would see 40 bytes of payload,
bigger than gso_size.
We need to make detect this case to not underflow
qdisc_skb_cb(skb)->pkt_len.
Fixes: 1def9238d4aa ("net_sched: more precise pkt_len computation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
After commit 7c6d2ecbda83 ("net: be more gentle about silly gso
requests coming from user") virtio_net_hdr_to_skb() had sanity check
to detect malicious attempts from user space to cook a bad GSO packet.
Then commit cf9acc90c80ec ("net: virtio_net_hdr_to_skb: count
transport header in UFO") while fixing one issue, allowed user space
to cook a GSO packet with the following characteristic :
IPv4 SKB_GSO_UDP, gso_size=3, skb->len = 28.
When this packet arrives in qdisc_pkt_len_init(), we end up
with hdr_len = 28 (IPv4 header + UDP header), matching skb->len
Then the following sets gso_segs to 0 :
gso_segs = DIV_ROUND_UP(skb->len - hdr_len,
shinfo->gso_size);
Then later we set qdisc_skb_cb(skb)->pkt_len to back to zero :/
qdisc_skb_cb(skb)->pkt_len += (gso_segs - 1) * hdr_len;
This leads to the following crash in fq_codel [1]
qdisc_pkt_len_init() is best effort, we only want an estimation
of the bytes sent on the wire, not crashing the kernel.
This patch is fixing this particular issue, a following one
adds more sanity checks for another potential bug.
[1]
[ 70.724101] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 70.724561] #PF: supervisor read access in kernel mode
[ 70.724561] #PF: error_code(0x0000) - not-present page
[ 70.724561] PGD 10ac61067 P4D 10ac61067 PUD 107ee2067 PMD 0
[ 70.724561] Oops: Oops: 0000 [#1] SMP NOPTI
[ 70.724561] CPU: 11 UID: 0 PID: 2163 Comm: b358537762 Not tainted 6.11.0-virtme #991
[ 70.724561] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 70.724561] RIP: 0010:fq_codel_enqueue (net/sched/sch_fq_codel.c:120 net/sched/sch_fq_codel.c:168 net/sched/sch_fq_codel.c:230) sch_fq_codel
[ 70.724561] Code: 24 08 49 c1 e1 06 44 89 7c 24 18 45 31 ed 45 31 c0 31 ff 89 44 24 14 4c 03 8b 90 01 00 00 eb 04 39 ca 73 37 4d 8b 39 83 c7 01 <49> 8b 17 49 89 11 41 8b 57 28 45 8b 5f 34 49 c7 07 00 00 00 00 49
All code
========
0: 24 08 and $0x8,%al
2: 49 c1 e1 06 shl $0x6,%r9
6: 44 89 7c 24 18 mov %r15d,0x18(%rsp)
b: 45 31 ed xor %r13d,%r13d
e: 45 31 c0 xor %r8d,%r8d
11: 31 ff xor %edi,%edi
13: 89 44 24 14 mov %eax,0x14(%rsp)
17: 4c 03 8b 90 01 00 00 add 0x190(%rbx),%r9
1e: eb 04 jmp 0x24
20: 39 ca cmp %ecx,%edx
22: 73 37 jae 0x5b
24: 4d 8b 39 mov (%r9),%r15
27: 83 c7 01 add $0x1,%edi
2a:* 49 8b 17 mov (%r15),%rdx <-- trapping instruction
2d: 49 89 11 mov %rdx,(%r9)
30: 41 8b 57 28 mov 0x28(%r15),%edx
34: 45 8b 5f 34 mov 0x34(%r15),%r11d
38: 49 c7 07 00 00 00 00 movq $0x0,(%r15)
3f: 49 rex.WB
Code starting with the faulting instruction
===========================================
0: 49 8b 17 mov (%r15),%rdx
3: 49 89 11 mov %rdx,(%r9)
6: 41 8b 57 28 mov 0x28(%r15),%edx
a: 45 8b 5f 34 mov 0x34(%r15),%r11d
e: 49 c7 07 00 00 00 00 movq $0x0,(%r15)
15: 49 rex.WB
[ 70.724561] RSP: 0018:ffff95ae85e6fb90 EFLAGS: 00000202
[ 70.724561] RAX: 0000000002000000 RBX: ffff95ae841de000 RCX: 0000000000000000
[ 70.724561] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
[ 70.724561] RBP: ffff95ae85e6fbf8 R08: 0000000000000000 R09: ffff95b710a30000
[ 70.724561] R10: 0000000000000000 R11: bdf289445ce31881 R12: ffff95ae85e6fc58
[ 70.724561] R13: 0000000000000000 R14: 0000000000000040 R15: 0000000000000000
[ 70.724561] FS: 000000002c5c1380(0000) GS:ffff95bd7fcc0000(0000) knlGS:0000000000000000
[ 70.724561] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 70.724561] CR2: 0000000000000000 CR3: 000000010c568000 CR4: 00000000000006f0
[ 70.724561] Call Trace:
[ 70.724561] <TASK>
[ 70.724561] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
[ 70.724561] ? page_fault_oops (arch/x86/mm/fault.c:715)
[ 70.724561] ? exc_page_fault (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539)
[ 70.724561] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
[ 70.724561] ? fq_codel_enqueue (net/sched/sch_fq_codel.c:120 net/sched/sch_fq_codel.c:168 net/sched/sch_fq_codel.c:230) sch_fq_codel
[ 70.724561] dev_qdisc_enqueue (net/core/dev.c:3784)
[ 70.724561] __dev_queue_xmit (net/core/dev.c:3880 (discriminator 2) net/core/dev.c:4390 (discriminator 2))
[ 70.724561] ? irqentry_enter (kernel/entry/common.c:237)
[ 70.724561] ? sysvec_apic_timer_interrupt (./arch/x86/include/asm/hardirq.h:74 (discriminator 2) arch/x86/kernel/apic/apic.c:1043 (discriminator 2) arch/x86/kernel/apic/apic.c:1043 (discriminator 2))
[ 70.724561] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:58 (discriminator 4))
[ 70.724561] ? asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:702)
[ 70.724561] ? virtio_net_hdr_to_skb.constprop.0 (./include/linux/virtio_net.h:129 (discriminator 1))
[ 70.724561] packet_sendmsg (net/packet/af_packet.c:3145 (discriminator 1) net/packet/af_packet.c:3177 (discriminator 1))
[ 70.724561] ? _raw_spin_lock_bh (./arch/x86/include/asm/atomic.h:107 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:127 (discriminator 4) kernel/locking/spinlock.c:178 (discriminator 4))
[ 70.724561] ? netdev_name_node_lookup_rcu (net/core/dev.c:325 (discriminator 1))
[ 70.724561] __sys_sendto (net/socket.c:730 (discriminator 1) net/socket.c:745 (discriminator 1) net/socket.c:2210 (discriminator 1))
[ 70.724561] ? __sys_setsockopt (./include/linux/file.h:34 net/socket.c:2355)
[ 70.724561] __x64_sys_sendto (net/socket.c:2222 (discriminator 1) net/socket.c:2218 (discriminator 1) net/socket.c:2218 (discriminator 1))
[ 70.724561] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1))
[ 70.724561] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 70.724561] RIP: 0033:0x41ae09
Fixes: cf9acc90c80ec ("net: virtio_net_hdr_to_skb: count transport header in UFO")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jonathan Davies <jonathan.davies@nutanix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The number of register fields cannot be assumed to be ALE_FIELDS_MAX
as some platforms can have lesser fields.
Solve this by embedding the actual number of fields available
in platform data and use that instead of ALE_FIELDS_MAX.
Gets rid of the below warning on BeagleBone Black
[ 1.007735] WARNING: CPU: 0 PID: 33 at drivers/base/regmap/regmap.c:1208 regmap_field_init+0x88/0x9c
[ 1.007802] invalid empty mask defined
[ 1.007812] Modules linked in:
[ 1.007842] CPU: 0 UID: 0 PID: 33 Comm: kworker/u4:3 Not tainted 6.11.0-01459-g508403ab7b74-dirty #840
[ 1.007867] Hardware name: Generic AM33XX (Flattened Device Tree)
[ 1.007890] Workqueue: events_unbound deferred_probe_work_func
[ 1.007935] Call trace:
[ 1.007957] unwind_backtrace from show_stack+0x10/0x14
[ 1.007999] show_stack from dump_stack_lvl+0x50/0x64
[ 1.008033] dump_stack_lvl from __warn+0x70/0x124
[ 1.008077] __warn from warn_slowpath_fmt+0x194/0x1a8
[ 1.008113] warn_slowpath_fmt from regmap_field_init+0x88/0x9c
[ 1.008154] regmap_field_init from devm_regmap_field_alloc+0x48/0x64
[ 1.008193] devm_regmap_field_alloc from cpsw_ale_create+0xfc/0x320
[ 1.008251] cpsw_ale_create from cpsw_init_common+0x214/0x354
[ 1.008286] cpsw_init_common from cpsw_probe+0x4ac/0xb88
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Closes: https://lore.kernel.org/netdev/CAMuHMdUf-tKRDzkz2_m8qdFTFutefddU0NTratVrEjRTzA3yQQ@mail.gmail.com/
Fixes: 11cbcfeaa79e ("net: ethernet: ti: cpsw_ale: use regfields for number of Entries and Policers")
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240924-am65-cpsw-multi-rx-fix-v1-1-0ca3fa9a1398@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
There is no need to ask the user about enabling Microchip FDMA
functionality, as all drivers that use it select the FDMA symbol.
Hence make the symbol invisible, unless when compile-testing.
Fixes: 30e48a75df9c6ead ("net: microchip: add FDMA library")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/8e2bcd8899c417a962b7ee3f75b29f35b25d7933.1727171879.git.geert+renesas@glider.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
On link-state change, the controller gets reset,
which clears all PTP registers, including PHC time,
calibrated clock correction values etc. For correct
IEEE 1588 operation we need to restore these after
the reset.
Fixes: 6605b730c061 ("FEC: Add time stamping code and a PTP hardware clock")
Signed-off-by: Csókás, Bence <csokas.bence@prolan.hu>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20240924093705.2897329-2-csokas.bence@prolan.hu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
On link state change, the controller gets reset,
causing PPS to drop out. Re-enable PPS if it was
enabled before the controller reset.
Fixes: 6605b730c061 ("FEC: Add time stamping code and a PTP hardware clock")
Signed-off-by: Csókás, Bence <csokas.bence@prolan.hu>
Link: https://patch.msgid.link/20240924093705.2897329-1-csokas.bence@prolan.hu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The value is read from the register TXGBE_RX_GEN_CTL3, and it should be
written back to TXGBE_RX_GEN_CTL3 when it changes some fields.
Cc: stable@vger.kernel.org
Fixes: f629acc6f210 ("net: pcs: xpcs: support to switch mode for Wangxun NICs")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reported-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20240924022857.865422-1-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When applying padding, the buffer is not zeroed, which results in memory
disclosure. The mentioned data is observed on the wire. This patch uses
skb_put_padto() to pad Ethernet frames properly. The mentioned function
zeroes the expanded buffer.
In case the packet cannot be padded it is silently dropped. Statistics
are also not incremented. This driver does not support statistics in the
old 32-bit format or the new 64-bit format. These will be added in the
future. In its current form, the patch should be easily backported to
stable versions.
Ethernet MACs on Amazon-SE and Danube cannot do padding of the packets
in hardware, so software padding must be applied.
Fixes: 504d4721ee8e ("MIPS: Lantiq: Add ethernet driver")
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20240923214949.231511-2-olek2@wp.pl
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Commit 24ab059d2ebd ("net: check dev->gso_max_size in gso_features_check()")
added a dev->gso_max_size test to gso_features_check() in order to fall
back to GSO when needed.
This was added as it was noticed that some drivers could misbehave if TSO
packets get too big. However, the check doesn't respect dev->gso_ipv4_max_size
limit. For instance, a device could be configured with BIG TCP for IPv4,
but not IPv6.
Therefore, add a netif_get_gso_max_size() equivalent to netif_get_gro_max_size()
and use the helper to respect both limits before falling back to GSO engine.
Fixes: 24ab059d2ebd ("net: check dev->gso_max_size in gso_features_check()")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240923212242.15669-2-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add a small netif_get_gro_max_size() helper which returns the maximum IPv4
or IPv6 GRO size of the netdevice.
We later add a netif_get_gso_max_size() equivalent as well for GSO, so that
these helpers can be used consistently instead of open-coded checks.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240923212242.15669-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Alexander Sverdlin presents 2 problems during shutdown with the
lan9303 driver. One is specific to lan9303 and the other just happens
to reproduce there.
The first problem is that lan9303 is unique among DSA drivers in that it
calls dev_get_drvdata() at "arbitrary runtime" (not probe, not shutdown,
not remove):
phy_state_machine()
-> ...
-> dsa_user_phy_read()
-> ds->ops->phy_read()
-> lan9303_phy_read()
-> chip->ops->phy_read()
-> lan9303_mdio_phy_read()
-> dev_get_drvdata()
But we never stop the phy_state_machine(), so it may continue to run
after dsa_switch_shutdown(). Our common pattern in all DSA drivers is
to set drvdata to NULL to suppress the remove() method that may come
afterwards. But in this case it will result in an NPD.
The second problem is that the way in which we set
dp->conduit->dsa_ptr = NULL; is concurrent with receive packet
processing. dsa_switch_rcv() checks once whether dev->dsa_ptr is NULL,
but afterwards, rather than continuing to use that non-NULL value,
dev->dsa_ptr is dereferenced again and again without NULL checks:
dsa_conduit_find_user() and many other places. In between dereferences,
there is no locking to ensure that what was valid once continues to be
valid.
Both problems have the common aspect that closing the conduit interface
solves them.
In the first case, dev_close(conduit) triggers the NETDEV_GOING_DOWN
event in dsa_user_netdevice_event() which closes user ports as well.
dsa_port_disable_rt() calls phylink_stop(), which synchronously stops
the phylink state machine, and ds->ops->phy_read() will thus no longer
call into the driver after this point.
In the second case, dev_close(conduit) should do this, as per
Documentation/networking/driver.rst:
| Quiescence
| ----------
|
| After the ndo_stop routine has been called, the hardware must
| not receive or transmit any data. All in flight packets must
| be aborted. If necessary, poll or wait for completion of
| any reset commands.
So it should be sufficient to ensure that later, when we zeroize
conduit->dsa_ptr, there will be no concurrent dsa_switch_rcv() call
on this conduit.
The addition of the netif_device_detach() function is to ensure that
ioctls, rtnetlinks and ethtool requests on the user ports no longer
propagate down to the driver - we're no longer prepared to handle them.
The race condition actually did not exist when commit 0650bf52b31f
("net: dsa: be compatible with masters which unregister on shutdown")
first introduced dsa_switch_shutdown(). It was created later, when we
stopped unregistering the user interfaces from a bad spot, and we just
replaced that sequence with a racy zeroization of conduit->dsa_ptr
(one which doesn't ensure that the interfaces aren't up).
Reported-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Closes: https://lore.kernel.org/netdev/2d2e3bba17203c14a5ffdabc174e3b6bbb9ad438.camel@siemens.com/
Closes: https://lore.kernel.org/netdev/c1bf4de54e829111e0e4a70e7bd1cf523c9550ff.camel@siemens.com/
Fixes: ee534378f005 ("net: dsa: fix panic when DSA master device unbinds on shutdown")
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Tested-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20240913203549.3081071-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This aligned BR/EDR JUST_WORKS method with LE which since 92516cd97fd4
("Bluetooth: Always request for user confirmation for Just Works")
always request user confirmation with confirm_hint set since the
likes of bluetoothd have dedicated policy around JUST_WORKS method
(e.g. main.conf:JustWorksRepairing).
CVE: CVE-2024-8805
Cc: stable@vger.kernel.org
Fixes: ba15a58b179e ("Bluetooth: Fix SSP acceptor just-works confirmation without MITM")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-by: Kiran K <kiran.k@intel.com>
|
|
disable_irq() after request_irq() still has a time gap in which
interrupts can come. request_irq() with IRQF_NO_AUTOEN flag will
disable IRQ auto-enable when request IRQ.
Fixes: bb7f4f0bcee6 ("btmrvl: add platform specific wakeup interrupt support")
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
[Syzbot reported]
BUG: KASAN: slab-use-after-free in l2cap_connect.constprop.0+0x10d8/0x1270 net/bluetooth/l2cap_core.c:3949
Read of size 8 at addr ffff8880241e9800 by task kworker/u9:0/54
CPU: 0 UID: 0 PID: 54 Comm: kworker/u9:0 Not tainted 6.11.0-rc6-syzkaller-00268-g788220eee30d #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
Workqueue: hci2 hci_rx_work
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:93 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:119
print_address_description mm/kasan/report.c:377 [inline]
print_report+0xc3/0x620 mm/kasan/report.c:488
kasan_report+0xd9/0x110 mm/kasan/report.c:601
l2cap_connect.constprop.0+0x10d8/0x1270 net/bluetooth/l2cap_core.c:3949
l2cap_connect_req net/bluetooth/l2cap_core.c:4080 [inline]
l2cap_bredr_sig_cmd net/bluetooth/l2cap_core.c:4772 [inline]
l2cap_sig_channel net/bluetooth/l2cap_core.c:5543 [inline]
l2cap_recv_frame+0xf0b/0x8eb0 net/bluetooth/l2cap_core.c:6825
l2cap_recv_acldata+0x9b4/0xb70 net/bluetooth/l2cap_core.c:7514
hci_acldata_packet net/bluetooth/hci_core.c:3791 [inline]
hci_rx_work+0xaab/0x1610 net/bluetooth/hci_core.c:4028
process_one_work+0x9c5/0x1b40 kernel/workqueue.c:3231
process_scheduled_works kernel/workqueue.c:3312 [inline]
worker_thread+0x6c8/0xed0 kernel/workqueue.c:3389
kthread+0x2c1/0x3a0 kernel/kthread.c:389
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
...
Freed by task 5245:
kasan_save_stack+0x33/0x60 mm/kasan/common.c:47
kasan_save_track+0x14/0x30 mm/kasan/common.c:68
kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:579
poison_slab_object+0xf7/0x160 mm/kasan/common.c:240
__kasan_slab_free+0x32/0x50 mm/kasan/common.c:256
kasan_slab_free include/linux/kasan.h:184 [inline]
slab_free_hook mm/slub.c:2256 [inline]
slab_free mm/slub.c:4477 [inline]
kfree+0x12a/0x3b0 mm/slub.c:4598
l2cap_conn_free net/bluetooth/l2cap_core.c:1810 [inline]
kref_put include/linux/kref.h:65 [inline]
l2cap_conn_put net/bluetooth/l2cap_core.c:1822 [inline]
l2cap_conn_del+0x59d/0x730 net/bluetooth/l2cap_core.c:1802
l2cap_connect_cfm+0x9e6/0xf80 net/bluetooth/l2cap_core.c:7241
hci_connect_cfm include/net/bluetooth/hci_core.h:1960 [inline]
hci_conn_failed+0x1c3/0x370 net/bluetooth/hci_conn.c:1265
hci_abort_conn_sync+0x75a/0xb50 net/bluetooth/hci_sync.c:5583
abort_conn_sync+0x197/0x360 net/bluetooth/hci_conn.c:2917
hci_cmd_sync_work+0x1a4/0x410 net/bluetooth/hci_sync.c:328
process_one_work+0x9c5/0x1b40 kernel/workqueue.c:3231
process_scheduled_works kernel/workqueue.c:3312 [inline]
worker_thread+0x6c8/0xed0 kernel/workqueue.c:3389
kthread+0x2c1/0x3a0 kernel/kthread.c:389
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
Reported-by: syzbot+c12e2f941af1feb5632c@syzkaller.appspotmail.com
Tested-by: syzbot+c12e2f941af1feb5632c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c12e2f941af1feb5632c
Fixes: 7b064edae38d ("Bluetooth: Fix authentication if acl data comes before remote feature evt")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
If mgmt_index_removed is called while there are commands queued on
cmd_sync it could lead to crashes like the bellow trace:
0x0000053D: __list_del_entry_valid_or_report+0x98/0xdc
0x0000053D: mgmt_pending_remove+0x18/0x58 [bluetooth]
0x0000053E: mgmt_remove_adv_monitor_complete+0x80/0x108 [bluetooth]
0x0000053E: hci_cmd_sync_work+0xbc/0x164 [bluetooth]
So while handling mgmt_index_removed this attempts to dequeue
commands passed as user_data to cmd_sync.
Fixes: 7cf5c2978f23 ("Bluetooth: hci_sync: Refactor remove Adv Monitor")
Reported-by: jiaymao <quic_jiaymao@quicinc.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
It's important to undo pm_runtime_use_autosuspend() with
pm_runtime_dont_use_autosuspend() at driver exit time.
But the pm_runtime_disable() and pm_runtime_dont_use_autosuspend()
is missing in the error path for bam_dmux_probe(). So add it.
Found by code review. Compile-tested only.
Fixes: 21a0ffd9b38c ("net: wwan: Add Qualcomm BAM-DMUX WWAN network driver")
Suggested-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
disable_irq() after request_irq() still has a time gap in which
interrupts can come. request_irq() with IRQF_NO_AUTOEN flag will
disable IRQ auto-enable when request IRQ.
Fixes: 8c6ad9cc5157 ("ieee802154: Add NXP MCR20A IEEE 802.15.4 transceiver driver")
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/20240911094234.1922418-1-ruanjinjie@huawei.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
|
|
If the client can't reach the server, the latter remains listening
forever. Kill it after 5s of waiting.
Fixes: 867d2190799a ("selftests: netfilter: add ipvs test script")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The netfilter race happens when two packets with the same tuple are DNATed
and enqueued with nfqueue in the postrouting hook.
Once one of the packet is reinjected it may be DNATed again to a different
destination, but the conntrack entry remains the same and the return packet
was dropped.
Based on earlier patch from Antonio Ojea.
Link: https://bugzilla.netfilter.org/show_bug.cgi?id=1766
Co-developed-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
For historical reasons there are two clash resolution spots in
netfilter, one in nfnetlink_queue and one in conntrack core.
nfnetlink_queue one was added first: If a colliding entry is found, NAT
NAT transformation is reversed by calling nat engine again with altered
tuple.
See commit 368982cd7d1b ("netfilter: nfnetlink_queue: resolve clash for
unconfirmed conntracks") for details.
One problem is that nf_reroute() won't take an action if the queueing
doesn't occur in the OUTPUT hook, i.e. when queueing in forward or
postrouting, packet will be sent via the wrong path.
Another problem is that the scenario addressed (2nd UDP packet sent with
identical addresses while first packet is still being processed) can also
occur without any nfqueue involvement due to threaded resolvers doing
A and AAAA requests back-to-back.
This lead us to add clash resolution logic to the conntrack core, see
commit 6a757c07e51f ("netfilter: conntrack: allow insertion of clashing
entries"). Instead of fixing the nfqueue based logic, lets remove it
and let conntrack core handle this instead.
Retain the ->update hook for sake of nfqueue based conntrack helpers.
We could axe this hook completely but we'd have to split confirm and
helper logic again, see commit ee04805ff54a ("netfilter: conntrack: make
conntrack userspace helpers work again").
This SHOULD NOT be backported to kernels earlier than v5.6; they lack
adequate clash resolution handling.
Patch was originally written by Pablo Neira Ayuso.
Reported-by: Antonio Ojea <aojea@google.com>
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1766
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Several ruleset objects are still not using GFP_KERNEL_ACCOUNT for
memory accounting, update them. This includes:
- catchall elements
- compat match large info area
- log prefix
- meta secctx
- numgen counters
- pipapo set backend datastructure
- tunnel private objects
Fixes: 33758c891479 ("memcg: enable accounting for nft objects")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Lockless iteration over hook list is possible from netlink dump path,
use rcu variant to iterate over the hook list as is done with flowtable
hooks.
Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Only provide ctnetlink_label_size when it is used,
which is when CONFIG_NF_CONNTRACK_EVENTS is configured.
Flagged by clang-18 W=1 builds as:
.../nf_conntrack_netlink.c:385:19: warning: unused function 'ctnetlink_label_size' [-Wunused-function]
385 | static inline int ctnetlink_label_size(const struct nf_conn *ct)
| ^~~~~~~~~~~~~~~~~~~~
The condition on CONFIG_NF_CONNTRACK_LABELS being removed by
this patch guards compilation of non-trivial implementations
of ctnetlink_dump_labels() and ctnetlink_label_size().
However, this is not necessary as each of these functions
will always return 0 if CONFIG_NF_CONNTRACK_LABELS is not defined
as each function starts with the equivalent of:
struct nf_conn_labels *labels = nf_ct_labels_find(ct);
if (!labels)
return 0;
And nf_ct_labels_find always returns NULL if CONFIG_NF_CONNTRACK_LABELS
is not enabled. So I believe that the compiler optimises the code away
in such cases anyway.
Found by inspection.
Compile tested only.
Originally splitted in two patches, Pablo Neira Ayuso collapsed them and
added Fixes: tag.
Fixes: 0ceabd83875b ("netfilter: ctnetlink: deliver labels to userspace")
Link: https://lore.kernel.org/netfilter-devel/20240909151712.GZ2097826@kernel.org/
Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
If CONFIG_BRIDGE_NETFILTER is not enabled, which is the case for x86_64
defconfig, then building nf_reject_ipv4.c and nf_reject_ipv6.c with W=1
using gcc-14 results in the following warnings, which are treated as
errors:
net/ipv4/netfilter/nf_reject_ipv4.c: In function 'nf_send_reset':
net/ipv4/netfilter/nf_reject_ipv4.c:243:23: error: variable 'niph' set but not used [-Werror=unused-but-set-variable]
243 | struct iphdr *niph;
| ^~~~
cc1: all warnings being treated as errors
net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6':
net/ipv6/netfilter/nf_reject_ipv6.c:286:25: error: variable 'ip6h' set but not used [-Werror=unused-but-set-variable]
286 | struct ipv6hdr *ip6h;
| ^~~~
cc1: all warnings being treated as errors
Address this by reducing the scope of these local variables to where
they are used, which is code only compiled when CONFIG_BRIDGE_NETFILTER
enabled.
Compile tested and run through netfilter selftests.
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Closes: https://lore.kernel.org/netfilter-devel/20240906145513.567781-1-andriy.shevchenko@linux.intel.com/
Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Documentation of list_del_rcu() warns callers to not immediately free
the deleted list item. While it seems not necessary to use the
RCU-variant of list_del() here in the first place, doing so seems to
require calling kfree_rcu() on the deleted item as well.
Fixes: 3f0465a9ef02 ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The iptables example was added in commit d2f26037a38a (netfilter: Add
documentation for tproxy, 2008-10-08), but xt_socket 'transparent'
option was added in commit a31e1ffd2231 (netfilter: xt_socket: added new
revision of the 'socket' match supporting flags, 2009-06-09).
Now add the 'transparent' option to the iptables example to ignore
non-transparent sockets, which is also consistent with the nft example.
Signed-off-by: 谢致邦 (XIE Zhibang) <Yeking@Red54.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Some of the functions may be unused (CONFIG_NETFILTER_NETLINK_GLUE_CT=n
and CONFIG_NF_CONNTRACK_EVENTS=n), it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y:
net/netfilter/nf_conntrack_netlink.c:657:22: error: unused function 'ctnetlink_acct_size' [-Werror,-Wunused-function]
657 | static inline size_t ctnetlink_acct_size(const struct nf_conn *ct)
| ^~~~~~~~~~~~~~~~~~~
net/netfilter/nf_conntrack_netlink.c:667:19: error: unused function 'ctnetlink_secctx_size' [-Werror,-Wunused-function]
667 | static inline int ctnetlink_secctx_size(const struct nf_conn *ct)
| ^~~~~~~~~~~~~~~~~~~~~
net/netfilter/nf_conntrack_netlink.c:683:22: error: unused function 'ctnetlink_timestamp_size' [-Werror,-Wunused-function]
683 | static inline size_t ctnetlink_timestamp_size(const struct nf_conn *ct)
| ^~~~~~~~~~~~~~~~~~~~~~~~
Fix this by guarding possible unused functions with ifdeffery.
See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The TPROXY functionality is widely used, however, there are only mptcp
selftests covering this feature.
The selftests represent the most common scenarios and can also be used
as selfdocumentation of the feature.
UDP and TCP testcases are split in different files because of the
different nature of the protocols, specially due to the challenges that
present to reliable test UDP due to the connectionless nature of the
protocol. UDP only covers the scenarios involving the prerouting hook.
The UDP tests are signfinicantly slower than the TCP ones, hence they
use a larger timeout, it takes 20 seconds to run the full UDP suite
on a 48 vCPU Intel(R) Xeon(R) CPU @2.60GHz.
Signed-off-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add test program that is sending UDP packets in both directions
and check that packets arrive without source port modification.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Given existing entry:
ORIGIN: a:b -> c:d
REPLY: c:d -> a:b
And colliding entry:
ORIGIN: c:d -> a:b
REPLY: a:b -> c:d
The colliding ct (and the associated skb) get dropped on insert.
Permit this by checking if the colliding entry matches the reply
direction.
Happens when both ends send packets at same time, both requests are picked
up as NEW, rather than NEW for the 'first' and 'ESTABLISHED' for the
second packet.
This is an esoteric condition, as ruleset must permit NEW connections
in either direction and both peers must already have a bidirectional
traffic flow at the time conntrack gets enabled.
Allow the 'reverse' skb to pass and assign the existing (clashing)
entry.
While at it, also drop the extra 'dying' check, this is already
tested earlier by the calling function.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
A conntrack entry can be inserted to the connection tracking table if there
is no existing entry with an identical tuple in either direction.
Example:
INITIATOR -> NAT/PAT -> RESPONDER
Initiator passes through NAT/PAT ("us") and SNAT is done (saddr rewrite).
Then, later, NAT/PAT machine itself also wants to connect to RESPONDER.
This will not work if the SNAT done earlier has same IP:PORT source pair.
Conntrack table has:
ORIGINAL: $IP_INITATOR:$SPORT -> $IP_RESPONDER:$DPORT
REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT
and new locally originating connection wants:
ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT
REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT
This is handled by the NAT engine which will do a source port reallocation
for the locally originating connection that is colliding with an existing
tuple by attempting a source port rewrite.
This is done even if this new connection attempt did not go through a
masquerade/snat rule.
There is a rare race condition with connection-less protocols like UDP,
where we do the port reallocation even though its not needed.
This happens when new packets from the same, pre-existing flow are received
in both directions at the exact same time on different CPUs after the
conntrack table was flushed (or conntrack becomes active for first time).
With strict ordering/single cpu, the first packet creates new ct entry and
second packet is resolved as established reply packet.
With parallel processing, both packets are picked up as new and both get
their own ct entry.
In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by
NAT engine because a port collision is detected.
This change isn't enough to prevent a packet drop later during
nf_conntrack_confirm(), the existing clash resolution strategy will not
detect such reverse clash case. This is resolved by a followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Some packetdrill tests are flaky in debug mode. As discussed, increase
tolerance.
We have been doing this for debug builds outside ksft too.
Previous setting was 10000. A manual 50 runs in virtme-ng showed two
failures that needed 12000. To be on the safe side, Increase to 14000.
Link: https://lore.kernel.org/netdev/Zuhhe4-MQHd3EkfN@mini-arch/
Fixes: 1e42f73fd3c2 ("selftests/net: packetdrill: import tcp/zerocopy")
Reported-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240919124412.3014326-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The work can submit URBs and the URBs can schedule the work.
This cycle needs to be broken, when a device is to be stopped.
Use a flag to do so.
This is a design issue as old as the driver.
Signed-off-by: Oliver Neukum <oneukum@suse.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
CC: stable@vger.kernel.org
Link: https://patch.msgid.link/20240919123525.688065-1-oneukum@suse.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Commit 5fabb01207a2 ("net: stmmac: Add initial XDP support") sets
PP_FLAG_DMA_SYNC_DEV flag for page_pool unconditionally,
page_pool_recycle_direct() will call page_pool_dma_sync_for_device()
on every page even the page is not going to be reused by XDP program.
When XDP is not enabled, the page which holds the received buffer
will be recycled once the buffer is copied into new SKB by
skb_copy_to_linear_data(), then the MAC core will never reuse this
page any longer. Always setting PP_FLAG_DMA_SYNC_DEV wastes CPU cycles
on unnecessary calling of page_pool_dma_sync_for_device().
After this patch, up to 9% noticeable performance improvement was observed
on certain platforms.
Fixes: 5fabb01207a2 ("net: stmmac: Add initial XDP support")
Signed-off-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/20240919121028.1348023-1-0x1207@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently, the virtio-net driver will perform a pre-dma-mapping for
small or mergeable RX buffer. But for small packets, a mismatched address
without VIRTNET_RX_PAD and xdp_headroom is used for unmapping.
That will result in unsynchronized buffers when SWIOTLB is enabled, for
example, when running as a TDX guest.
This patch unifies the address passed to the virtio core as the address of
the virtnet header and fixes the mismatched buffer address.
Changes from v2: unify the buf that passed to the virtio core in small
and merge mode.
Changes from v1: Use ctx to get xdp_headroom.
Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling mergeable buffers")
Signed-off-by: Wenbo Li <liwenbo.martin@bytedance.com>
Signed-off-by: Jiahui Cen <cenjiahui@bytedance.com>
Signed-off-by: Ying Fang <fangying.tommy@bytedance.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://patch.msgid.link/20240919081351.51772-1-liwenbo.martin@bytedance.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This reverts commit 9184b17fbc23 ("dt-bindings: input: Goodix SPI HID
Touchscreen") because it duplicates existing binding leadings to errors:
goodix,gt7986u.example.dtb:
touchscreen@0: compatible: 'oneOf' conditional failed, one must be fixed:
['goodix,gt7986u'] is too short
'goodix,gt7375p' was expected
This was reported on mailing list on 6th of September, but no reaction
happened from contributor or maintainer to fix it.
Therefore let's drop binding which breaks and duplicates existing one.
Fixes: 9184b17fbc23 ("dt-bindings: input: Goodix SPI HID Touchscreen")
Reported-by: Rob Herring <robh@kernel.org>
Closes: https://lore.kernel.org/all/CAL_Jsq+QfTtRj_JCqXzktQ49H8VUnztVuaBjvvkg3fwEHniUHw@mail.gmail.com/
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
|
|
Drop support for Devicetree from, because the binding is being reverted
(on basis of duplicating existing binding) and property was not added to
the original binding.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
|
|
The km.state is not checked in driver's delayed work. When
xfrm_state_check_expire() is called, the state can be reset to
XFRM_STATE_EXPIRED, even if it is XFRM_STATE_DEAD already. This
happens when xfrm state is deleted, but not freed yet. As
__xfrm_state_delete() is called again in xfrm timer, the following
crash occurs.
To fix this issue, skip xfrm_state_check_expire() if km.state is not
XFRM_STATE_VALID.
Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
CPU: 5 UID: 0 PID: 7448 Comm: kworker/u102:2 Not tainted 6.11.0-rc2+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Workqueue: mlx5e_ipsec: eth%d mlx5e_ipsec_handle_sw_limits [mlx5_core]
RIP: 0010:__xfrm_state_delete+0x3d/0x1b0
Code: 0f 84 8b 01 00 00 48 89 fd c6 87 c8 00 00 00 05 48 8d bb 40 10 00 00 e8 11 04 1a 00 48 8b 95 b8 00 00 00 48 8b 85 c0 00 00 00 <48> 89 42 08 48 89 10 48 8b 55 10 48 b8 00 01 00 00 00 00 ad de 48
RSP: 0018:ffff88885f945ec8 EFLAGS: 00010246
RAX: dead000000000122 RBX: ffffffff82afa940 RCX: 0000000000000036
RDX: dead000000000100 RSI: 0000000000000000 RDI: ffffffff82afb980
RBP: ffff888109a20340 R08: ffff88885f945ea0 R09: 0000000000000000
R10: 0000000000000000 R11: ffff88885f945ff8 R12: 0000000000000246
R13: ffff888109a20340 R14: ffff88885f95f420 R15: ffff88885f95f400
FS: 0000000000000000(0000) GS:ffff88885f940000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2163102430 CR3: 00000001128d6001 CR4: 0000000000370eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
? die_addr+0x33/0x90
? exc_general_protection+0x1a2/0x390
? asm_exc_general_protection+0x22/0x30
? __xfrm_state_delete+0x3d/0x1b0
? __xfrm_state_delete+0x2f/0x1b0
xfrm_timer_handler+0x174/0x350
? __xfrm_state_delete+0x1b0/0x1b0
__hrtimer_run_queues+0x121/0x270
hrtimer_run_softirq+0x88/0xd0
handle_softirqs+0xcc/0x270
do_softirq+0x3c/0x50
</IRQ>
<TASK>
__local_bh_enable_ip+0x47/0x50
mlx5e_ipsec_handle_sw_limits+0x7d/0x90 [mlx5_core]
process_one_work+0x137/0x2d0
worker_thread+0x28d/0x3a0
? rescuer_thread+0x480/0x480
kthread+0xb8/0xe0
? kthread_park+0x80/0x80
ret_from_fork+0x2d/0x50
? kthread_park+0x80/0x80
ret_from_fork_asm+0x11/0x20
</TASK>
Fixes: b2f7b01d36a9 ("net/mlx5e: Simulate missing IPsec TX limits hardware functionality")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
When having larger RQ sizes and small MTUs sizes, the hd_per_wq variable
can overflow. Like in the following case:
$> ethtool --set-ring eth1 rx 8192
$> ip link set dev eth1 mtu 144
$> ethtool --features eth1 rx-gro-hw on
... yields in dmesg:
mlx5_core 0000:08:00.1: mlx5_cmd_out_err:808:(pid 194797): CREATE_MKEY(0x200) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x3bf6f), err(-22)
because hd_per_wq is 64K which overflows to 0 and makes the command
fail.
This patch increases the variable size to 32 bit.
Fixes: 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Fixed all the 'E2BIG' returns in error flow of functions to
the negative '-E2BIG' as we are using negative error codes
everywhere in HWS code.
This also fixes the following smatch warnings:
"warn: was negative '-E2BIG' intended?"
Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/f8c77688-7d83-4937-baba-ac844dfe2e0b@stanley.mountain/
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
When SQ creation fails, call the appropriate mlx5_core destroy function.
This fixes the following smatch warnings:
divers/net/ethernet/mellanox/mlx5/core/steering/hws/mlx5hws_send.c:739
hws_send_ring_open_sq() warn: 'sq->dep_wqe' double freed
hws_send_ring_open_sq() warn: 'sq->wq_ctrl.buf.frags' double freed
hws_send_ring_open_sq() warn: 'sq->wr_priv' double freed
Fixes: 2ca62599aa0b ("net/mlx5: HWS, added send engine and context handling")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/e4ebc227-4b25-49bf-9e4c-14b7ea5c6a07@stanley.mountain/
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Fixing the wrong size of a field in hca_cap_2.
The bug was introduced by adding new fields for HWS
and not fixing the reserved field size.
Fixes: 34c626c3004a ("net/mlx5: Added missing mlx5_ifc definition for HW Steering")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
In mlx5e_tir_builder_alloc() kvzalloc() may return NULL
which is dereferenced on the next line in a reference
to the modify field.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: a6696735d694 ("net/mlx5e: Convert TIR to a dedicated object")
Signed-off-by: Elena Salomatkina <esalomatkina@ispras.ru>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Collecting crdump involves reading vsc registers from pci config space
of mlx device, which can take long time to complete. This might result
in starving other threads waiting to run on the cpu.
Numbers I got from testing ConnectX-5 Ex MCX516A-CDAT in the lab:
- mlx5_vsc_gw_read_block_fast() was called with length = 1310716.
- mlx5_vsc_gw_read_fast() reads 4 bytes at a time. It was not used to
read the entire 1310716 bytes. It was called 53813 times because
there are jumps in read_addr.
- On average mlx5_vsc_gw_read_fast() took 35284.4ns.
- In total mlx5_vsc_wait_on_flag() called vsc_read() 54707 times.
The average time for each call was 17548.3ns. In some instances
vsc_read() was called more than one time when the flag was not set.
As expected the thread released the cpu after 16 iterations in
mlx5_vsc_wait_on_flag().
- Total time to read crdump was 35284.4ns * 53813 ~= 1.898s.
It was seen in the field that crdump can take more than 5 seconds to
complete. During that time mlx5_vsc_wait_on_flag() did not release the
cpu because it did not complete 16 iterations. It is believed that pci
config reads were slow. Adding cond_resched() every 128 register read
improves the situation. In the common case the, crdump takes ~1.8989s,
the thread yields the cpu every ~4.51ms. If crdump takes ~5s, the thread
yields the cpu every ~18.0ms.
Fixes: 8b9d8baae1de ("net/mlx5: Add Crdump support")
Reviewed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Remove the erroneous unmap in case no DMA mapping was established
The multi-packet WQE transmit code attempts to obtain a DMA mapping for
the skb. This could fail, e.g. under memory pressure, when the IOMMU
driver just can't allocate more memory for page tables. While the code
tries to handle this in the path below the err_unmap label it erroneously
unmaps one entry from the sq's FIFO list of active mappings. Since the
current map attempt failed this unmap is removing some random DMA mapping
that might still be required. If the PCI function now presents that IOVA,
the IOMMU may assumes a rogue DMA access and e.g. on s390 puts the PCI
function in error state.
The erroneous behavior was seen in a stress-test environment that created
memory pressure.
Fixes: 5af75c747e2a ("net/mlx5e: Enhanced TX MPWQE for SKBs")
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Acked-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Currently the Rust support is gated on not having MODVERSIONS enabled,
and as a result an "allmodconfig" build will disable Rust build tests.
While MODVERSIONS configurations are worth build testing, the feature is
not actually meaningful unless you run the result, and I'd rather get
build coverage of Rust than MODVERSIONS. So let's disable MODVERSIONS
for build testing until the Rust side clears up.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|