aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-sqlite.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-05-15xfrm: validate assignment of maximal possible SEQ numberLeon Romanovsky1-10/+42
Users can set any seq/seq_hi/oseq/oseq_hi values. The XFRM core code doesn't prevent from them to set even 0xFFFFFFFF, however this value will cause for traffic drop. Is is happening because SEQ numbers here mean that packet with such number was processed and next number should be sent on the wire. In this case, the next number will be 0, and it means overflow which causes to (expected) packet drops. While it can be considered as misconfiguration and handled by XFRM datapath in the same manner as any other SEQ number, let's add validation to easy for packet offloads implementations which need to configure HW with next SEQ to send and not with current SEQ like it is done in core code. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-17xfrm: Refactor migration setup during the cloning processChiachang Wang1-8/+9
Previously, migration related setup, such as updating family, destination address, and source address, was performed after the clone was created in `xfrm_state_migrate`. This change moves this setup into the cloning function itself, improving code locality and reducing redundancy. The `xfrm_state_clone_and_setup` function now conditionally applies the migration parameters from struct xfrm_migrate if it is provided. This allows the function to be used both for simple cloning and for cloning with migration setup. Test: Tested with kernel test in the Android tree located in https://android.googlesource.com/kernel/tests/ The xfrm_tunnel_test.py under the tests folder in particular. Signed-off-by: Chiachang Wang <chiachangwang@google.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-17xfrm: Migrate offload configurationChiachang Wang5-9/+29
Add hardware offload configuration to XFRM_MSG_MIGRATE using an option netlink attribute XFRMA_OFFLOAD_DEV. In the existing xfrm_state_migrate(), the xfrm_init_state() is called assuming no hardware offload by default. Even the original xfrm_state is configured with offload, the setting will be reset. If the device is configured with hardware offload, it's reasonable to allow the device to maintain its hardware offload mode. But the device will end up with offload disabled after receiving a migration event when the device migrates the connection from one netdev to another one. The devices that support migration may work with different underlying networks, such as mobile devices. The hardware setting should be forwarded to the different netdev based on the migration configuration. This change provides the capability for user space to migrate from one netdev to another. Test: Tested with kernel test in the Android tree located in https://android.googlesource.com/kernel/tests/ The xfrm_tunnel_test.py under the tests folder in particular. Signed-off-by: Chiachang Wang <chiachangwang@google.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16bonding: Fix multiple long standing offload racesCosmin Ratiu2-48/+41
Refactor the bonding ipsec offload operations to fix a number of long-standing control plane races between state migration and user deletion and a few other issues. xfrm state deletion can happen concurrently with bond_change_active_slave() operation. This manifests itself as a bond_ipsec_del_sa() call with x->lock held, followed by a bond_ipsec_free_sa() a bit later from a wq. The alternate path of these calls coming from xfrm_dev_state_flush() can't happen, as that needs the RTNL lock and bond_change_active_slave() already holds it. 1. bond_ipsec_del_sa_all() might call xdo_dev_state_delete() a second time on an xfrm state that was concurrently killed. This is bad. 2. bond_ipsec_add_sa_all() can add a state on the new device, but pending bond_ipsec_free_sa() calls from the old device will then hit the WARN_ON() and then, worse, call xdo_dev_state_free() on the new device without a corresponding xdo_dev_state_delete(). 3. Resolve a sleeping in atomic context introduced by the mentioned "Fixes" commit. bond_ipsec_del_sa_all() and bond_ipsec_add_sa_all() now acquire x->lock and check for x->km.state to help with problems 1 and 2. And since xso.real_dev is now a private pointer managed by the bonding driver in xfrm state, make better use of it to fully fix problems 1 and 2. In bond_ipsec_del_sa_all(), set xso.real_dev to NULL while holding both the mutex and x->lock, which makes sure that neither bond_ipsec_del_sa() nor bond_ipsec_free_sa() could run concurrently. Fix problem 3 by moving the list cleanup (which requires the mutex) from bond_ipsec_del_sa() (called from atomic context) to bond_ipsec_free_sa() Finally, simplify bond_ipsec_del_sa() and bond_ipsec_free_sa() by using xso->real_dev directly, since it's now protected by locks and can be trusted to always reflect the offload device. Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Tested-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16bonding: Mark active offloaded xfrm_statesCosmin Ratiu1-3/+5
When the active link is changed for a bond device, the existing xfrm states need to be migrated over to the new link. This is done with: - bond_ipsec_del_sa_all() goes through the offloaded states list and removes all of them from hw. - bond_ipsec_add_sa_all() re-offloads all states to the new device. But because the offload status of xfrm states isn't marked in any way, there can be bugs. When all bond links are down, bond_ipsec_del_sa_all() unoffloads everything from the previous active link. If the same link then comes back up, nothing gets reoffloaded by bond_ipsec_add_sa_all(). This results in a stack trace like this a bit later when user space removes the offloaded rules, because mlx5e_xfrm_del_state() is asked to remove a rule that's no longer offloaded: [] Call Trace: [] <TASK> [] ? __warn+0x7d/0x110 [] ? mlx5e_xfrm_del_state+0x90/0xa0 [mlx5_core] [] ? report_bug+0x16d/0x180 [] ? handle_bug+0x4f/0x90 [] ? exc_invalid_op+0x14/0x70 [] ? asm_exc_invalid_op+0x16/0x20 [] ? mlx5e_xfrm_del_state+0x73/0xa0 [mlx5_core] [] ? mlx5e_xfrm_del_state+0x90/0xa0 [mlx5_core] [] bond_ipsec_del_sa+0x1ab/0x200 [bonding] [] xfrm_dev_state_delete+0x1f/0x60 [] __xfrm_state_delete+0x196/0x200 [] xfrm_state_delete+0x21/0x40 [] xfrm_del_sa+0x69/0x110 [] xfrm_user_rcv_msg+0x11d/0x300 [] ? release_pages+0xca/0x140 [] ? copy_to_user_tmpl.part.0+0x110/0x110 [] netlink_rcv_skb+0x54/0x100 [] xfrm_netlink_rcv+0x31/0x40 [] netlink_unicast+0x1fc/0x2d0 [] netlink_sendmsg+0x1e4/0x410 [] __sock_sendmsg+0x38/0x60 [] sock_write_iter+0x94/0xf0 [] vfs_write+0x338/0x3f0 [] ksys_write+0xba/0xd0 [] do_syscall_64+0x4c/0x100 [] entry_SYSCALL_64_after_hwframe+0x4b/0x53 There's also another theoretical bug: Calling bond_ipsec_del_sa_all() multiple times can result in corruption in the driver implementation if the double-free isn't tolerated. This isn't nice. Before the "Fixes" commit, xs->xso.real_dev was set to NULL when an xfrm state was unoffloaded from a device, but a race with netdevsim's .xdo_dev_offload_ok() accessing real_dev was considered a sufficient reason to not set real_dev to NULL anymore. This unfortunately introduced the new bugs. Since .xdo_dev_offload_ok() was significantly refactored by [1] and there are no more users in the stack of xso.real_dev, that race is now gone and xs->xso.real_dev can now once again be used to represent which device (if any) currently holds the offloaded rule. Go one step further and set real_dev after add/before delete calls, to catch any future driver misuses of real_dev. [1] https://lore.kernel.org/netdev/cover.1739972570.git.leon@kernel.org/ Fixes: f8cde9805981 ("bonding: fix xfrm real_dev null pointer dereference") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Tested-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}Cosmin Ratiu14-99/+136
Previously, device driver IPSec offload implementations would fall into two categories: 1. Those that used xso.dev to determine the offload device. 2. Those that used xso.real_dev to determine the offload device. The first category didn't work with bonding while the second did. In a non-bonding setup the two pointers are the same. This commit adds explicit pointers for the offload netdevice to .xdo_dev_state_add() / .xdo_dev_state_delete() / .xdo_dev_state_free() which eliminates the confusion and allows drivers from the first category to work with bonding. xso.real_dev now becomes a private pointer managed by the bonding driver. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16xfrm: Remove unneeded device check from validate_xmit_xfrmCosmin Ratiu1-6/+1
validate_xmit_xfrm checks whether a packet already passed through it on the master device (xso.dev) and skips processing the skb again on the slave device (xso.real_dev). This check was added in commit [1] to avoid tx packets on a bond device pass through xfrm twice and get two sets of headers, but the check was soon obsoleted by commit [2], which was added around the same time to fix a similar but unrelated problem. Commit [3] set XFRM_XMIT only when packets are hw offloaded. xso.dev is usually equal to xso.real_dev, unless bonding is used, in which case the bonding driver uses xso.real_dev to manage offloaded xfrm states. Since commit [3], the check added in commit [1] is unused on all cases, since packets going through validate_xmit_xfrm twice bail out on the check added in commit [2]. Here's a breakdown of relevant scenarios: 1. ESP offload off: validate_xmit_xfrm returns early on !xo. 2. ESP offload on, no bond: skb->dev == xso.real_dev == xso.dev. 3. ESP offload on, bond, xs on bond dev: 1st pass adds XFRM_XMIT, 2nd pass returns early on XFRM_XMIT. 3. ESP offload on, bond, xs on slave dev: 1st pass returns early on !xo, 2nd pass adds XFRM_XMIT. 4. ESP offload on, bond, xs on both bond AND slave dev: only 1 offload possible in secpath. Either 1st pass adds XFRM_XMIT and 2nd pass returns early on XFRM_XMIT, or 1st pass is sw and returns early on !xo. 6. ESP offload on, crypto fallback triggered in esp_xmit/esp6_xmit: 1st pass does sw crypto & secpath_reset, 2nd pass returns on !xo. This commit removes the unnecessary check, so xso.real_dev becomes what it is in practice: a private field managed by bonding driver. The check immediately below that can be simplified as well. [1] commit 272c2330adc9 ("xfrm: bail early on slave pass over skb") [2] commit 94579ac3f6d0 ("xfrm: Fix double ESP trailer insertion in IPsec crypto offload.") [3] commit c7dbf4c08868 ("xfrm: Provide private skb extensions for segmented and hw offloaded ESP packets") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16xfrm: Use xdo.dev instead of xdo.real_devCosmin Ratiu3-5/+1
The policy offload struct was reused from the state offload and real_dev was copied from dev, but it was never set to anything else. Simplify the code by always using xdo.dev for policies. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16net/mlx5: Avoid using xso.real_dev unnecessarilyCosmin Ratiu2-11/+6
xso.real_dev is the active device of an offloaded xfrm state and is managed by bonding. As such, it's subject to change when states are migrated to a new device. Using it in places other than offloading/unoffloading the states is risky. This commit saves the device into the driver-specific struct mlx5e_ipsec_sa_entry and switches mlx5e_ipsec_init_macs() and mlx5e_ipsec_netevent_event() to make use of it. Additionally, mlx5e_xfrm_update_stats() used xso.real_dev to validate that correct net locks are held. But in a bonding config, the net of the master device is the same as the underlying devices, and the net is already a local var, so use that instead. The only remaining references to xso.real_dev are now in the .xdo_dev_state_add() / .xdo_dev_state_delete() path. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-11xfrm: Remove unnecessary strscpy_pad() size argumentsThorsten Blum1-5/+5
If the destination buffer has a fixed length, strscpy_pad() automatically determines its size using sizeof() when the argument is omitted. This makes the explicit sizeof() calls unnecessary - remove them. No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-10r8169: add helper rtl8125_phy_paramHeiner Kallweit1-20/+16
The integrated PHY's of RTL8125/8126 have an own mechanism to access PHY parameters, similar to what r8168g_phy_param does on earlier PHY versions. Add helper rtl8125_phy_param to simplify the code. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/847b7356-12d6-441b-ade9-4b6e1539b84a@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10r8169: add helper rtl_csi_mod for accessing extended config spaceHeiner Kallweit1-10/+16
Add a helper for the Realtek-specific mechanism for accessing extended config space if native access isn't possible. This avoids code duplication. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/b368fd91-57d7-4cb5-9342-98b4d8fe9aea@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10ipv4: remove unnecessary judgment in ip_route_output_key_hash_rcuZhengchao Shao1-2/+1
In the ip_route_output_key_cash_rcu function, the input fl4 member saddr is first checked to be non-zero before entering multicast, broadcast and arbitrary IP address checks. However, the fact that the IP address is not 0 has already ruled out the possibility of any address, so remove unnecessary judgment. Signed-off-by: Zhengchao Shao <shaozhengchao@163.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250409033321.108244-1-shaozhengchao@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl: generate code for rt-route and add a sampleJakub Kicinski4-1/+83
YNL C can now generate code for simple classic netlink families. Include rt-route in the Makefile for generation and add a sample. $ ./tools/net/ynl/samples/rt-route oif: wlp0s20f3 gateway: 192.168.1.1 oif: wlp0s20f3 dst: 192.168.1.0/24 oif: vpn0 dst: fe80::/64 oif: wlp0s20f3 dst: fe80::/64 oif: wlp0s20f3 gateway: fe80::200:5eff:fe00:201 Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-14-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl: generate code for rt-addr and add a sampleJakub Kicinski4-2/+84
YNL C can now generate code for simple classic netlink families. Include rt-addr in the Makefile for generation and add a sample. $ ./tools/net/ynl/samples/rt-addr lo: 127.0.0.1 wlp0s20f3: 192.168.1.101 lo: :: wlp0s20f3: fe80::6385:be6:746e:8116 vpn0: fe80::3597:d353:b5a7:66dd Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-13-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl-gen: use family c-name in notificationsJakub Kicinski1-3/+3
Family names may include dashes. Fix notification handling code gen to the c-compatible name. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-12-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl-gen: consider dump ops without a do "type-consistent"Jakub Kicinski1-5/+9
If the type for the response to do and dump are the same we don't generate it twice. This is called "type_consistent" in the generator. Consider operations which only have dump to also be consistent. This removes unnecessary "_dump" from the names. There's a number of GET ops in classic Netlink which only have dump handlers. Make sure we output the "onesided" types, normally if the type is consistent we only output it when we render the do structures. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250410014658.782120-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl: don't use genlmsghdr in classic netlinkJakub Kicinski3-8/+22
Make sure the codegen calls the right YNL lib helper to start the request based on family type. Classic netlink request must not include the genl header. Conversely don't expect genl headers in the responses. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl-gen: don't consider requests with fixed hdr emptyJakub Kicinski1-2/+5
C codegen skips generating the structs if request/reply has no attrs. In such cases the request op takes no argument and return int (rather than response struct). In case of classic netlink a lot of information gets passed using the fixed struct, however, so adjust the logic to consider a request empty only if it has no attrs _and_ no fixed struct. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250410014658.782120-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl: support creating non-genl socketsJakub Kicinski3-20/+43
Classic netlink has static family IDs specified in YAML, there is no family name -> ID lookup. Support providing the ID info to the library via the generated struct and make library use it. Since NETLINK_ROUTE is ID 0 we need an extra boolean to indicate classic_id is to be used. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-route: add C naming infoJakub Kicinski1-0/+3
Add properties needed for C codegen to match names with uAPI headers. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-addr: add C naming infoJakub Kicinski1-0/+4
Add properties needed for C codegen to match names with uAPI headers. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-route: remove the fixed members from attrsJakub Kicinski1-14/+1
The purpose of the attribute list is to list the attributes which will be included in a given message to shrink the objects for families with huge attr spaces. Fixed headers are always present in their entirety (between netlink header and the attrs) so there's no point in listing their members. Current C codegen doesn't expect them and tries to look them up in the attribute space. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-addr: remove the fixed members from attrsJakub Kicinski1-17/+3
The purpose of the attribute list is to list the attributes which will be included in a given message to shrink the objects for families with huge attr spaces. Fixed headers are always present in their entirety (between netlink header and the attrs) so there's no point in listing their members. Current C codegen doesn't expect them and tries to look them up in the attribute space. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-route: specify fixed-header at operations levelJakub Kicinski1-3/+1
The C codegen currently stores the fixed-header as part of family info, so it only supports one fixed-header type per spec. Luckily all rtm route message have the same fixed header so just move it up to the higher level. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rename rtnetlink specs in accordance with family nameJakub Kicinski7-3/+3
The rtnetlink family names are set to rt-$name within the YAML but the files are called rt_$name. C codegen assumes that the generated file name will match the family. The use of dashes is in line with our general expectation that name properties in the spec use dashes not underscores (even tho, as Donald points out most genl families use underscores in the name). We have 3 un-ideal options to choose from: - accept the slight inconsistency with old families using _, or - accept the slight annoyance with all languages having to do s/-/_/ when looking up family ID, or - accept the inconsistency with all name properties in new YAML spec being separated with - and just the family name always using _. Pick option 1 and rename the rtnl spec files. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10usbnet: asix AX88772: leave the carrier control to phylinkKrzysztof Hałasa3-36/+4
ASIX AX88772B based USB 10/100 Ethernet adapter doesn't come up ("carrier off"), despite the built-in 100BASE-FX PHY positive link indication. The internal PHY is configured (using EEPROM) in fixed 100 Mbps full duplex mode. The primary problem appears to be using carrier_netif_{on,off}() while, at the same time, delegating carrier management to phylink. Use only the latter and remove "manual control" in the asix driver. I don't have any other AX88772 board here, but the problem doesn't seem specific to a particular board or settings - it's probably timing-dependent. Remove unused asix_adjust_link() as well. Signed-off-by: Krzysztof Hałasa <khalasa@piap.pl> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/m3plhmdfte.fsf_-_@t19.piap.pl Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10trace: tcp: Add tracepoint for tcp_sendmsg_locked()Breno Leitao3-0/+27
Add a tracepoint to monitor TCP send operations, enabling detailed visibility into TCP message transmission. Create a new tracepoint within the tcp_sendmsg_locked function, capturing traditional fields along with size_goal, which indicates the optimal data size for a single TCP segment. Additionally, a reference to the struct sock sk is passed, allowing direct access for BPF programs. The implementation is largely based on David's patch[1] and suggestions. Link: https://lore.kernel.org/all/70168c8f-bf52-4279-b4c4-be64527aa1ac@kernel.org/ [1] Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-2-208b87064c28@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10net: pass const to msg_data_left()Breno Leitao1-1/+1
The msg_data_left() function doesn't modify the struct msghdr parameter, so mark it as const. This allows the function to be used with const references, improving type safety and making the API more flexible. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-1-208b87064c28@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10net: stmmac: dwc-qos: use stmmac_pltfr_find_clk()Russell King (Oracle)1-14/+4
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/E1u2QO9-001Rp8-Ii@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10net: stmmac: provide stmmac_pltfr_find_clk()Russell King (Oracle)2-0/+14
Provide a generic way to find a clock in the bulk data. Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1u2QO4-001Rp2-Dy@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tcp: add LINUX_MIB_PAWS_TW_REJECTED counterJiayuan Chen5-1/+6
When TCP is in TIME_WAIT state, PAWS verification uses LINUX_PAWSESTABREJECTED, which is ambiguous and cannot be distinguished from other PAWS verification processes. We added a new counter, like the existing PAWS_OLD_ACK one. Also we update the doc with previously missing PAWS_OLD_ACK. usage: ''' nstat -az | grep PAWSTimewait TcpExtPAWSTimewait 1 0.0 ''' Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250409112614.16153-3-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tcp: add TCP_RFC7323_TW_PAWS drop reasonJiayuan Chen5-5/+17
Devices in the networking path, such as firewalls, NATs, or routers, which can perform SNAT or DNAT, use addresses from their own limited address pools to masquerade the source address during forwarding, causing PAWS verification to fail more easily. Currently, packet loss statistics for PAWS can only be viewed through MIB, which is a global metric and cannot be precisely obtained through tracing to get the specific 4-tuple of the dropped packet. In the past, we had to use kprobe ret to retrieve relevant skb information from tcp_timewait_state_process(). We add a drop_reason pointer, similar to what previous commit does: commit e34100c2ecbb ("tcp: add a drop_reason pointer to tcp_check_req()") This commit addresses the PAWSESTABREJECTED case and also sets the corresponding drop reason. We use 'pwru' to test. Before this commit: '''' ./pwru 'port 9999' 2025/04/07 13:40:19 Listening for events.. TUPLE FUNC 172.31.75.115:12345->172.31.75.114:9999(tcp) sk_skb_reason_drop(SKB_DROP_REASON_NOT_SPECIFIED) ''' After this commit: ''' ./pwru 'port 9999' 2025/04/07 13:51:34 Listening for events.. TUPLE FUNC 172.31.75.115:12345->172.31.75.114:9999(tcp) sk_skb_reason_drop(SKB_DROP_REASON_TCP_RFC7323_TW_PAWS) ''' Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250409112614.16153-2-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10af_unix: Remove unix_unhash()Michal Luczaj1-8/+0
Dummy unix_unhash() was introduced for sockmap in commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap"), but there's no need to implement it anymore. ->unhash() is only called conditionally: in unix_shutdown() since commit d359902d5c35 ("af_unix: Fix NULL pointer bug in unix_shutdown"), and in BPF proto's sock_map_unhash() since commit 5b4a79ba65a1 ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself"). Remove it. Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250409-cleanup-drop-unix-unhash-v1-1-1659e5b8ee84@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10ethtool: cmis_cdb: Fix incorrect read / write length extensionIdo Schimmel2-16/+3
The 'read_write_len_ext' field in 'struct ethtool_cmis_cdb_cmd_args' stores the maximum number of bytes that can be read from or written to the Local Payload (LPL) page in a single multi-byte access. Cited commit started overwriting this field with the maximum number of bytes that can be read from or written to the Extended Payload (LPL) pages in a single multi-byte access. Transceiver modules that support auto paging can advertise a number larger than 255 which is problematic as 'read_write_len_ext' is a 'u8', resulting in the number getting truncated and firmware flashing failing [1]. Fix by ignoring the maximum EPL access size as the kernel does not currently support auto paging (even if the transceiver module does) and will not try to read / write more than 128 bytes at once. [1] Transceiver module firmware flashing started for device enp177s0np0 Transceiver module firmware flashing in progress for device enp177s0np0 Progress: 0% Transceiver module firmware flashing encountered an error for device enp177s0np0 Status message: Write FW block EPL command failed, LPL length is longer than CDB read write length extension allows. Fixes: 9a3b0d078bd8 ("net: ethtool: Add support for writing firmware blocks using EPL payload") Reported-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Closes: https://lore.kernel.org/netdev/20250402183123.321036-3-michael.chan@broadcom.com/ Tested-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250409112440.365672-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-10selftests: netfilter: add test case for recent mismatch bugFlorian Westphal1-1/+38
Without 'nft_set_pipapo: fix incorrect avx2 match of 5th field octet" this fails: TEST: reported issues Add two elements, flush, re-add 1s [ OK ] net,mac with reload 0s [ OK ] net,port,proto 3s [ OK ] avx2 false match 0s [FAIL] False match for fe80:dead:01fe:0a02:0b03:6007:8009:a001 Other tests do not detect the kernel bug as they only alter parts in the /64 netmask. Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-10nft_set_pipapo: fix incorrect avx2 match of 5th field octetFlorian Westphal1-1/+2
Given a set element like: icmpv6 . dead:beef:00ff::1 The value of 'ff' is irrelevant, any address will be matched as long as the other octets are the same. This is because of too-early register clobbering: ymm7 is reloaded with new packet data (pkt[9]) but it still holds data of an earlier load that wasn't processed yet. The existing tests in nft_concat_range.sh selftests do exercise this code path, but do not trigger incorrect matching due to the network prefix limitation. Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Reported-by: sontu mazumdar <sontu21@gmail.com> Closes: https://lore.kernel.org/netfilter/CANgxkqwnMH7fXra+VUfODT-8+qFLgskq3set1cAzqqJaV4iEZg@mail.gmail.com/T/#t Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-10net: ppp: Add bound checking for skb data on ppp_sync_txmungArnaud Lecomte1-0/+5
Ensure we have enough data in linear buffer from skb before accessing initial bytes. This prevents potential out-of-bounds accesses when processing short packets. When ppp_sync_txmung receives an incoming package with an empty payload: (remote) gef➤ p *(struct pppoe_hdr *) (skb->head + skb->network_header) $18 = { type = 0x1, ver = 0x1, code = 0x0, sid = 0x2, length = 0x0, tag = 0xffff8880371cdb96 } from the skb struct (trimmed) tail = 0x16, end = 0x140, head = 0xffff88803346f400 "4", data = 0xffff88803346f416 ":\377", truesize = 0x380, len = 0x0, data_len = 0x0, mac_len = 0xe, hdr_len = 0x0, it is not safe to access data[2]. Reported-by: syzbot+29fc8991b0ecb186cf40@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=29fc8991b0ecb186cf40 Tested-by: syzbot+29fc8991b0ecb186cf40@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Link: https://patch.msgid.link/20250408-bound-checking-ppp_txmung-v2-1-94bb6e1b92d0@arnaud-lcm.com [pabeni@redhat.com: fixed subj typo] Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-09net: txgbe: add sriov function supportMengyuan Lou7-4/+101
Add sriov_configure for driver ops. Add mailbox handler wx_msg_task for txgbe. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/ECDC57CF4F2316B9+20250408091556.9640-7-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: ngbe: add sriov function supportMengyuan Lou6-9/+127
Add sriov_configure for driver ops. Add mailbox handler wx_msg_task for ngbe in the interrupt handler. Add the notification flow when the vfs exist. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/C9A0A43732966022+20250408091556.9640-6-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: libwx: Add msg task funcMengyuan Lou6-4/+703
Implement wx_msg_task which is used to process mailbox messages sent by vf. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/8257B39B95CDB469+20250408091556.9640-5-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: libwx: Redesign flow when sriov is enabledMengyuan Lou3-17/+441
Reallocate queue and int resources when sriov is enabled. Redefine macro VMDQ to make it work in VT mode. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/64B616774ABE3C5A+20250408091556.9640-4-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: libwx: Add sriov api for wangxun nicsMengyuan Lou5-1/+250
Implement sriov_configure interface for wangxun nics in libwx. Enable VT mode and initialize vf control structure, when sriov is enabled. Do not be allowed to disable sriov when vfs are assigned. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/81EA45C21B0A98B0+20250408091556.9640-3-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: libwx: Add mailbox api for wangxun pf driversMengyuan Lou4-1/+217
Implements the mailbox interfaces for wangxun pf drivers ngbe and txgbe. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/70017BD4D67614A4+20250408091556.9640-2-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: ethernet: cortina: Use TOE/TSO on all TCPLinus Walleij1-8/+29
It is desireable to push the hardware accelerator to also process non-segmented TCP frames: we pass the skb->len to the "TOE/TSO" offloader and it will handle them. Without this quirk the driver becomes unstable and lock up and and crash. I do not know exactly why, but it is probably due to the TOE (TCP offload engine) feature that is coupled with the segmentation feature - it is not possible to turn one part off and not the other, either both TOE and TSO are active, or neither of them. Not having the TOE part active seems detrimental, as if that hardware feature is not really supposed to be turned off. The datasheet says: "Based on packet parsing and TCP connection/NAT table lookup results, the NetEngine puts the packets belonging to the same TCP connection to the same queue for the software to process. The NetEngine puts incoming packets to the buffer or series of buffers for a jumbo packet. With this hardware acceleration, IP/TCP header parsing, checksum validation and connection lookup are offloaded from the software processing." After numerous tests with the hardware locking up after something between minutes and hours depending on load using iperf3 I have concluded this is necessary to stabilize the hardware. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Link: https://patch.msgid.link/20250408-gemini-ethernet-tso-always-v1-1-e669f932359c@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09selftests: test_bridge_neigh_suppress: Test unicast ARP/NS with suppressionAmit Cohen1-0/+125
Add test cases to check that unicast ARP/NS packets are replied once, even if ARP/ND suppression is enabled. Without the previous patch: $ ./test_bridge_neigh_suppress.sh ... Unicast ARP, per-port ARP suppression - VLAN 10 ----------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast ARP, suppression on, h1 filter [FAIL] TEST: Unicast ARP, suppression on, h2 filter [ OK ] Unicast ARP, per-port ARP suppression - VLAN 20 ----------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast ARP, suppression on, h1 filter [FAIL] TEST: Unicast ARP, suppression on, h2 filter [ OK ] ... Unicast NS, per-port NS suppression - VLAN 10 --------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast NS, suppression on, h1 filter [FAIL] TEST: Unicast NS, suppression on, h2 filter [ OK ] Unicast NS, per-port NS suppression - VLAN 20 --------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast NS, suppression on, h1 filter [FAIL] TEST: Unicast NS, suppression on, h2 filter [ OK ] ... Tests passed: 156 Tests failed: 4 With the previous patch: $ ./test_bridge_neigh_suppress.sh ... Unicast ARP, per-port ARP suppression - VLAN 10 ----------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast ARP, suppression on, h1 filter [ OK ] TEST: Unicast ARP, suppression on, h2 filter [ OK ] Unicast ARP, per-port ARP suppression - VLAN 20 ----------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast ARP, suppression on, h1 filter [ OK ] TEST: Unicast ARP, suppression on, h2 filter [ OK ] ... Unicast NS, per-port NS suppression - VLAN 10 --------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast NS, suppression on, h1 filter [ OK ] TEST: Unicast NS, suppression on, h2 filter [ OK ] Unicast NS, per-port NS suppression - VLAN 20 --------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast NS, suppression on, h1 filter [ OK ] TEST: Unicast NS, suppression on, h2 filter [ OK ] ... Tests passed: 160 Tests failed: 0 Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/dc240b9649b31278295189f412223f320432c5f2.1744123493.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: bridge: Prevent unicast ARP/NS packets from being suppressed by bridgeAmit Cohen1-0/+7
When Proxy ARP or ARP/ND suppression are enabled, ARP/NS packets can be handled by bridge in br_do_proxy_suppress_arp()/br_do_suppress_nd(). For broadcast packets, they are replied by bridge, but later they are not flooded. Currently, unicast packets are replied by bridge when suppression is enabled, and they are also forwarded, which results two replicas of ARP reply/NA - one from the bridge and second from the target. RFC 1122 describes use case for unicat ARP packets - "unicast poll" - actively poll the remote host by periodically sending a point-to-point ARP request to it, and delete the entry if no ARP reply is received from N successive polls. The purpose of ARP/ND suppression is to reduce flooding in the broadcast domain. If a host is sending a unicast ARP/NS, then it means it already knows the address and the switches probably know it as well and there will not be any flooding. In addition, the use case of unicast ARP/NS is to poll a specific host, so it does not make sense to have the switch answer on behalf of the host. According to RFC 9161: "A PE SHOULD reply to broadcast/multicast address resolution messages, i.e., ARP Requests, ARP probes, NS messages, as well as DAD NS messages. An ARP probe is an ARP Request constructed with an all-zero sender IP address that may be used by hosts for IPv4 Address Conflict Detection as specified in [RFC5227]. A PE SHOULD NOT reply to unicast address resolution requests (for instance, NUD NS messages)." Forward such requests and prevent the bridge from replying to them. Reported-by: Denis Yulevych <denisyu@nvidia.com> Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/6bf745a149ddfe5e6be8da684a63aa574a326f8d.1744123493.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.Kuniyuki Iwashima2-2/+43
When I ran the repro [0] and waited a few seconds, I observed two LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1] Reproduction Steps: 1) Mount CIFS 2) Add an iptables rule to drop incoming FIN packets for CIFS 3) Unmount CIFS 4) Unload the CIFS module 5) Remove the iptables rule At step 3), the CIFS module calls sock_release() for the underlying TCP socket, and it returns quickly. However, the socket remains in FIN_WAIT_1 because incoming FIN packets are dropped. At this point, the module's refcnt is 0 while the socket is still alive, so the following rmmod command succeeds. # ss -tan State Recv-Q Send-Q Local Address:Port Peer Address:Port FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445 # lsmod | grep cifs cifs 1159168 0 This highlights a discrepancy between the lifetime of the CIFS module and the underlying TCP socket. Even after CIFS calls sock_release() and it returns, the TCP socket does not die immediately in order to close the connection gracefully. While this is generally fine, it causes an issue with LOCKDEP because CIFS assigns a different lock class to the TCP socket's sk->sk_lock using sock_lock_init_class_and_name(). Once an incoming packet is processed for the socket or a timer fires, sk->sk_lock is acquired. Then, LOCKDEP checks the lock context in check_wait_context(), where hlock_class() is called to retrieve the lock class. However, since the module has already been unloaded, hlock_class() logs a warning and returns NULL, triggering the null-ptr-deref. If LOCKDEP is enabled, we must ensure that a module calling sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded while such a socket is still alive to prevent this issue. Let's hold the module reference in sock_lock_init_class_and_name() and release it when the socket is freed in sk_prot_free(). Note that sock_lock_init() clears sk->sk_owner for svc_create_socket() that calls sock_lock_init_class_and_name() for a listening socket, which clones a socket by sk_clone_lock() without GFP_ZERO. [0]: CIFS_SERVER="10.0.0.137" CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST" DEV="enp0s3" CRED="/root/WindowsCredential.txt" MNT=$(mktemp -d /tmp/XXXXXX) mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1 iptables -A INPUT -s ${CIFS_SERVER} -j DROP for i in $(seq 10); do umount ${MNT} rmmod cifs sleep 1 done rm -r ${MNT} iptables -D INPUT -s ${CIFS_SERVER} -j DROP [1]: DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) ... Call Trace: <IRQ> __lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178) lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ... BUG: kernel NULL pointer dereference, address: 00000000000000c4 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36 Tainted: [W]=WARN Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178) Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44 RSP: 0018:ffa0000000468a10 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027 RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88 RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88 R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1 FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1)) ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234) ip_sublist_rcv_finish (net/ipv4/ip_input.c:576) ip_list_rcv_finish (net/ipv4/ip_input.c:628) ip_list_rcv (net/ipv4/ip_input.c:670) __netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986) netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129) napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496) e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815) __napi_poll.constprop.0 (net/core/dev.c:7191) net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382) handle_softirqs (kernel/softirq.c:561) __irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662) irq_exit_rcu (kernel/softirq.c:680) common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14)) </IRQ> <TASK> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693) RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744) Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202 RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5 RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118) do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1)) start_secondary (arch/x86/kernel/smpboot.c:315) common_startup_64 (arch/x86/kernel/head_64.S:421) </TASK> Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CR2: 00000000000000c4 Fixes: ed07536ed673 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250407163313.22682-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: remove cpu stall in txq_trans_update()Eric Dumazet2-4/+5
txq_trans_update() currently uses txq->xmit_lock_owner to conditionally update txq->trans_start. For regular devices, txq->xmit_lock_owner is updated from HARD_TX_LOCK() and HARD_TX_UNLOCK(), and this apparently causes cpu stalls. Using dev->lltx, which sits in a read-mostly cache-line, and already used in HARD_TX_LOCK() and HARD_TX_UNLOCK() helps cpu prediction. On an AMD EPYC 7B12 dual socket server, tcp_rr with 128 threads and 30,000 flows gets a 5 % increase in throughput. As explained in commit 95ecba62e2fd ("net: fix races in netdev_tx_sent_queue()/dev_watchdog()") I am planning to no longer update txq->trans_start in the fast path in a followup patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250408202742.2145516-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09ipv6: Align behavior across nexthops during path selectionIdo Schimmel1-4/+4
A nexthop is only chosen when the calculated multipath hash falls in the nexthop's hash region (i.e., the hash is smaller than the nexthop's hash threshold) and when the nexthop is assigned a non-negative score by rt6_score_route(). Commit 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop") introduced an unintentional difference between the first nexthop and the rest when the score is negative. When the first nexthop matches, but has a negative score, the code will currently evaluate subsequent nexthops until one is found with a non-negative score. On the other hand, when a different nexthop matches, but has a negative score, the code will fallback to the nexthop with which the selection started ('match'). Align the behavior across all nexthops and fallback to 'match' when the first nexthop matches, but has a negative score. Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N") Fixes: 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop") Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Closes: https://lore.kernel.org/netdev/67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250408084316.243559-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>