wireguard-linux - WireGuard for the Linux kernel

Age	Commit message (Collapse)	Author	Files	Lines
2023-06-21	selftests: forwarding: dual_vxlan_bridge: Disable IPv6 autogen on bridges	Petr Machata	1	-0/+1
	In a future patch, mlxsw will start adding RIFs to uppers of front panel port netdevices, if they have an IP address. This will cause this selftest to fail spuriously. The swp enslavement to the 802.1ad bridge is not allowed, because RIFs are not allowed to be created for 802.1ad bridges, but the address indicates one needs to be created. Fix by disabling automatic IPv6 address generation for the HW-offloaded bridge in this selftest, thus exempting it from mlxsw router attention. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-21	selftests: forwarding: q_in_vni: Disable IPv6 autogen on bridges	Petr Machata	1	-0/+1
	In a future patch, mlxsw will start adding RIFs to uppers of front panel port netdevices, if they have an IP address. This will cause this selftest to fail spuriously. The swp enslavement to the 802.1ad bridge is not allowed, because RIFs are not allowed to be created for 802.1ad bridges, but the address indicates one needs to be created. Fix by disabling automatic IPv6 address generation for the HW-offloaded bridge in this selftest, thus exempting it from mlxsw router attention. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-21	net: micrel: Change to receive timestamp in the frame for lan8841	Horatiu Vultur	1	-96/+154
	Currently for each timestamp frame, the SW needs to go and read the received timestamp over the MDIO bus. But the HW has the capability to store the received nanoseconds part and the least significant two bits of the seconds in the reserved field of the PTP header. In this way we could save few MDIO transactions (actually a little more transactions because the access to the PTP registers are indirect) for each received frame. Instead of reading the rest of seconds part of the timestamp of the frame using MDIO transactions schedule PTP worker thread to read the seconds part every 500ms and then for each of the received frames use this information. Because if for example running with 512 frames per second, there is no point to read 512 times the second part. Doing all these changes will give a great CPU usage performance. Running ptp4l with logSyncInterval of -9 will give a ~60% CPU improvement. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: add support for emac4 on sa8775p platforms	Bartosz Golaszewski	1	-14/+51
	sa8775p uses EMAC version 4, add the relevant defines, rename the has_emac3 switch to has_emac_ge_3 (has emac greater-or-equal than 3) and add the new compatible. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	dt-bindings: net: qcom,ethqos: add description for sa8775p	Bartosz Golaszewski	2	-1/+14
	Add the compatible for the MAC controller on sa8775p platforms. This MAC works with a single interrupt so add minItems to the interrupts property. The fourth clock's name is different here so change it. Enable relevant PHY properties. Add the relevant compatibles to the binding document for snps,dwmac as well. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: add new switch to struct plat_stmmacenet_data	Bartosz Golaszewski	2	-1/+2
	On some platforms, the PCS can be integrated in the MAC so the driver will not see any PCS link activity. Add a switch that allows the platform drivers to let the core code know. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: add support for SGMII	Bartosz Golaszewski	1	-0/+37
	On sa8775p the MAC is connected to the external PHY over SGMII so add support for it to the driver. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: prepare the driver for more PHY modes	Bartosz Golaszewski	1	-5/+26
	In preparation for supporting SGMII, let's make the code a bit more generic. Add a new callback for MAC configuration so that we can assign a different variant of it in the future. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: add support for the phyaux clock	Bartosz Golaszewski	1	-15/+16
	On sa8775p, the EMAC revision is 4 and we use SGMII instead of RGMII. There's no "rgmii" clock but there's a fourth clock under a different name: "phyaux". Add a new field to the chip data struct that specifies the link clock name. Default to "rgmii" for backward compatibility. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: add support for the optional serdes phy	Bartosz Golaszewski	1	-0/+37
	On sa8775p platforms, there's a SGMII SerDes PHY between the MAC and external PHY that we need to enable and configure. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: remove stray space	Bartosz Golaszewski	1	-1/+1
	There's an unnecessary space in the rgmii_updatel() function, remove it. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: add a newline between headers	Bartosz Golaszewski	1	-0/+1
	Typically we use a newline between global and local headers so add it here as well. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: add missing include	Bartosz Golaszewski	1	-0/+1
	device_get_phy_mode() is declared in linux/property.h but this header is not included. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: use a helper variable for &pdev->dev	Bartosz Golaszewski	1	-23/+26
	Shrink code and avoid line breaks by using a helper variable for &pdev->dev. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: tweak the order of local variables	Bartosz Golaszewski	1	-1/+1
	Make sure we follow the reverse-xmas tree convention. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: rename a label in probe()	Bartosz Golaszewski	1	-7/+7
	The err_mem label's name is unclear. It actually should be reached on any error after stmmac_probe_config_dt() succeeds. Name it after the cleanup action that needs to be called before exiting. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: stmmac: dwmac-qcom-ethqos: shrink clock code with devres	Bartosz Golaszewski	1	-13/+11
	We can use a devm action to completely drop the remove callback and use stmmac_pltfr_remove() directly for remove. We can also drop one of the goto labels. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	sfc: fix uninitialized variable use	Arnd Bergmann	1	-0/+1
	The new efx_bind_neigh() function contains a broken code path when IPV6 is disabled: drivers/net/ethernet/sfc/tc_encap_actions.c:144:7: error: variable 'n' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized] if (encap->type & EFX_ENCAP_FLAG_IPV6) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/sfc/tc_encap_actions.c:184:8: note: uninitialized use occurs here if (!n) { ^ drivers/net/ethernet/sfc/tc_encap_actions.c:144:3: note: remove the 'if' if its condition is always false if (encap->type & EFX_ENCAP_FLAG_IPV6) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/sfc/tc_encap_actions.c:141:22: note: initialize the variable 'n' to silence this warning struct neighbour *n; ^ = NULL Change it to use the existing error handling path here. Fixes: 7e5e7d800011a ("sfc: neighbour lookup for TC encap action offload") Suggested-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/r/20230619091215.2731541-2-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	sfc: add CONFIG_INET dependency for TC offload	Arnd Bergmann	1	-0/+1
	The driver now fails to link when CONFIG_INET is disabled, so add an explicit Kconfig dependency: ld.lld: error: undefined symbol: ip_route_output_flow >>> referenced by tc_encap_actions.c >>> drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_tc_flower_create_encap_md) in archive vmlinux.a ld.lld: error: undefined symbol: ip_send_check >>> referenced by tc_encap_actions.c >>> drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_gen_encap_header) in archive vmlinux.a >>> referenced by tc_encap_actions.c >>> drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_gen_encap_header) in archive vmlinux.a ld.lld: error: undefined symbol: arp_tbl >>> referenced by tc_encap_actions.c >>> drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_tc_netevent_event) in archive vmlinux.a >>> referenced by tc_encap_actions.c >>> drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_tc_netevent_event) in archive vmlinux.a Fixes: a1e82162af0b8 ("sfc: generate encap headers for TC offload") Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306151656.yttECVTP-lkp@intel.com/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20230619091215.2731541-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: phy-c45: Fix genphy_c45_ethtool_set_eee description	Andrew Lunn	1	-3/+6
	The text has been cut/paste from genphy_c45_ethtool_get_eee but not changed to reflect it performs set. Additionally, extend the comment. This function implements the logic that eee_enabled has global control over EEE. When eee_enabled is false, no link modes will be advertised, and as a result, the MAC should not transmit LPI. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230619220332.4038924-1-andrew@lunn.ch Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: remove sk_is_ipmr() and sk_is_icmpv6() helpers	Eric Dumazet	3	-19/+2
	Blamed commit added these helpers for sake of detecting RAW sockets specific ioctl. syzbot complained about it [1]. Issue here is that RAW sockets could pretend there was no need to call ipmr_sk_ioctl() Regardless of inet_sk(sk)->inet_num, we must be prepared for ipmr_ioctl() being called later. This must happen from ipmr_sk_ioctl() context only. We could add a safety check in ipmr_ioctl() at the risk of breaking applications. Instead, remove sk_is_ipmr() and sk_is_icmpv6() because their name would be misleading, once we change their implementation. [1] BUG: KASAN: stack-out-of-bounds in ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654 Read of size 4 at addr ffffc90003aefae4 by task syz-executor105/5004 CPU: 0 PID: 5004 Comm: syz-executor105 Not tainted 6.4.0-rc6-syzkaller-01304-gc08afcdcf952 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351 print_report mm/kasan/report.c:462 [inline] kasan_report+0x11c/0x130 mm/kasan/report.c:572 ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654 raw_ioctl+0x4e/0x1e0 net/ipv4/raw.c:881 sock_ioctl_out net/core/sock.c:4186 [inline] sk_ioctl+0x151/0x440 net/core/sock.c:4214 inet_ioctl+0x18c/0x380 net/ipv4/af_inet.c:1001 sock_do_ioctl+0xcc/0x230 net/socket.c:1189 sock_ioctl+0x1f8/0x680 net/socket.c:1306 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f2944bf6ad9 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd8897a028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2944bf6ad9 RDX: 0000000000000000 RSI: 00000000000089e1 RDI: 0000000000000003 RBP: 00007f2944bbac80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2944bbad10 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> The buggy address belongs to stack of task syz-executor105/5004 and is located at offset 36 in frame: sk_ioctl+0x0/0x440 net/core/sock.c:4172 This frame has 2 objects: [32, 36) 'karg' [48, 88) 'buffer' Fixes: e1d001fa5b47 ("net: ioctl: Use kernel memory on protocol ioctl callbacks") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230619124336.651528-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	ipv6: fix a typo in ip6mr_sk_ioctl()	Eric Dumazet	1	-3/+3
	SIOCGETSGCNT_IN6 uses a "struct sioc_sg_req6 buffer". Unfortunately the blamed commit made hard to ensure type safety. syzbot reported: BUG: KASAN: stack-out-of-bounds in ip6mr_ioctl+0xba3/0xcb0 net/ipv6/ip6mr.c:1917 Read of size 16 at addr ffffc900039afb68 by task syz-executor937/5008 CPU: 1 PID: 5008 Comm: syz-executor937 Not tainted 6.4.0-rc6-syzkaller-01304-gc08afcdcf952 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351 print_report mm/kasan/report.c:462 [inline] kasan_report+0x11c/0x130 mm/kasan/report.c:572 ip6mr_ioctl+0xba3/0xcb0 net/ipv6/ip6mr.c:1917 rawv6_ioctl+0x4e/0x1e0 net/ipv6/raw.c:1143 sock_ioctl_out net/core/sock.c:4186 [inline] sk_ioctl+0x151/0x440 net/core/sock.c:4214 inet6_ioctl+0x1b8/0x290 net/ipv6/af_inet6.c:582 sock_do_ioctl+0xcc/0x230 net/socket.c:1189 sock_ioctl+0x1f8/0x680 net/socket.c:1306 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f255849bad9 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd06792778 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f255849bad9 RDX: 0000000000000000 RSI: 00000000000089e1 RDI: 0000000000000003 RBP: 00007f255845fc80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f255845fd10 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> The buggy address belongs to stack of task syz-executor937/5008 and is located at offset 40 in frame: sk_ioctl+0x0/0x440 net/core/sock.c:4172 This frame has 2 objects: [32, 36) 'karg' [48, 88) 'buffer' Fixes: e1d001fa5b47 ("net: ioctl: Use kernel memory on protocol ioctl callbacks") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: David Ahern <dsahern@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20230619072740.464528-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	octeontx2-pf: TC flower offload support for rxqueue mapping	Ratheesh Kannoth	1	-2/+12
	TC rule support to offload rx queue mapping rules. Eg: tc filter add dev eth2 ingress protocol ip flower \ dst_ip 192.168.8.100 \ action skbedit queue_mapping 4 skip_sw action mirred ingress redirect dev eth5 Packets destined to 192.168.8.100 will be forwarded to rx queue 4 of eth5 interface. tc filter add dev eth2 ingress protocol ip flower \ dst_ip 192.168.8.100 \ action skbedit queue_mapping 9 skip_sw Packets destined to 192.168.8.100 will be forwarded to rx queue 4 of eth2 interface. Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://lore.kernel.org/r/20230619060638.1032304-1-rkannoth@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	netlabel: Reorder fields in 'struct netlbl_domaddr6_map'	Christophe JAILLET	1	-1/+1
	Group some variables based on their sizes to reduce hole and avoid padding. On x86_64, this shrinks the size of 'struct netlbl_domaddr6_map' from 72 to 64 bytes. It saves a few bytes of memory and is more cache-line friendly. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/aa109847260e51e174c823b6d1441f75be370f01.1687083361.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	mptcp: Reorder fields in 'struct mptcp_pm_add_entry'	Christophe JAILLET	1	-1/+1
	Group some variables based on their sizes to reduce hole and avoid padding. On x86_64, this shrinks the size of 'struct mptcp_pm_add_entry' from 136 to 128 bytes. It saves a few bytes of memory and is more cache-line friendly. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/e47b71de54fd3e580544be56fc1bb2985c77b0f4.1687081558.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	mctp: Reorder fields in 'struct mctp_route'	Christophe JAILLET	1	-2/+2
	Group some variables based on their sizes to reduce hole and avoid padding. On x86_64, this shrinks the size of 'struct mctp_route' from 72 to 64 bytes. It saves a few bytes of memory and is more cache-line friendly. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/393ad1a5aef0aa28d839eeb3d7477da0e0eeb0b0.1687080803.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: fec: allow to build without PAGE_POOL_STATS	Lucas Stach	2	-1/+3
	Commit 6970ef27ff7f ("net: fec: add xdp and page pool statistics") selected CONFIG_PAGE_POOL_STATS from the FEC driver symbol, making it impossible to build without the page pool statistics when this driver is enabled. The help text of those statistics mentions increased overhead. Allow the user to choose between usefulness of the statistics and the added overhead. Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230616191832.2944130-1-l.stach@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	crypto: af_alg/hash: Fix recvmsg() after sendmsg(MSG_MORE)	David Howells	1	-13/+25
	If an AF_ALG socket bound to a hashing algorithm is sent a zero-length message with MSG_MORE set and then recvmsg() is called without first sending another message without MSG_MORE set to end the operation, an oops will occur because the crypto context and result doesn't now get set up in advance because hash_sendmsg() now defers that as long as possible in the hope that it can use crypto_ahash_digest() - and then because the message is zero-length, it the data wrangling loop is skipped. Fix this by handling zero-length sends at the top of the hash_sendmsg() function. If we're not continuing the previous sendmsg(), then just ignore the send (hash_recvmsg() will invent something when called); if we are continuing, then we finalise the request at this point if MSG_MORE is not set to get any error here, otherwise the send is of no effect and can be ignored. Whilst we're at it, remove the code to create a kvmalloc'd scatterlist if we get more than ALG_MAX_PAGES - this shouldn't happen. Fixes: c662b043cdca ("crypto: af_alg/hash: Support MSG_SPLICE_PAGES") Reported-by: syzbot+13a08c0bf4d212766c3c@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/000000000000b928f705fdeb873a@google.com/ Reported-by: syzbot+14234ccf6d0ef629ec1a@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/000000000000c047db05fdeb8790@google.com/ Reported-by: syzbot+4e2e47f32607d0f72d43@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/000000000000bcca3205fdeb87fb@google.com/ Reported-by: syzbot+472626bb5e7c59fb768f@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/000000000000b55d8805fdeb8385@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> Reported-and-tested-by: syzbot+6efc50cc1f8d718d6cb7@syzkaller.appspotmail.com cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/427646.1686913832@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	net: phy: mediatek: fix compile-test dependencies	Arnd Bergmann	1	-1/+1
	The new phy driver attempts to select a driver from another subsystem, but that fails when the NVMEM subsystem is disabled: WARNING: unmet direct dependencies detected for NVMEM_MTK_EFUSE Depends on [n]: NVMEM [=n] && (ARCH_MEDIATEK [=n] \|\| COMPILE_TEST [=y]) && HAS_IOMEM [=y] Selected by [y]: - MEDIATEK_GE_SOC_PHY [=y] && NETDEVICES [=y] && PHYLIB [=y] && (ARM64 && ARCH_MEDIATEK [=n] \|\| COMPILE_TEST [=y]) I could not see an actual compile time dependency, so presumably this is only needed for for working correctly but not technically a dependency on that particular nvmem driver implementation, so it would likely be safe to remove the select for compile testing. To keep the spirit of the original 'select', just replace this with a 'depends on' that ensures that the driver will work but does not get in the way of build testing. Fixes: 98c485eaf509b ("net: phy: add driver for MediaTek SoC built-in GE PHYs") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Reviewed-by: Daniel Golle <daniel@makrotopia.org> Link: https://lore.kernel.org/r/20230616093009.3511692-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20	ptp: ocp: Add .getmaxphase ptp_clock_info callback	Rahul Rameshbabu	1	-0/+7
	Add a function that advertises a maximum offset of zero supported by ptp_clock_info .adjphase in the OCP null ptp implementation. Cc: Richard Cochran <richardcochran@gmail.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20	ptp: idt82p33: Add .getmaxphase ptp_clock_info callback	Rahul Rameshbabu	2	-11/+11
	Advertise the maximum offset the .adjphase callback is capable of supporting in nanoseconds for IDT devices. Refactor the negation of the offset stored in the register to be after the boundary check of the offset value rather than before. Boundary check based on the intended value rather than its device-specific representation. Depend on ptp_clock_adjtime for handling out-of-range offsets. ptp_clock_adjtime returns -ERANGE instead of clamping out-of-range offsets. Cc: Richard Cochran <richardcochran@gmail.com> Cc: Min Li <min.li.xe@renesas.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20	ptp: ptp_clockmatrix: Add .getmaxphase ptp_clock_info callback	Rahul Rameshbabu	2	-20/+18
	Advertise the maximum offset the .adjphase callback is capable of supporting in nanoseconds for IDT ClockMatrix devices. Depend on ptp_clock_adjtime for handling out-of-range offsets. ptp_clock_adjtime returns -ERANGE instead of clamping out-of-range offsets. Cc: Richard Cochran <richardcochran@gmail.com> Cc: Vincent Cheng <vincent.cheng.xh@renesas.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20	net/mlx5: Add .getmaxphase ptp_clock_info callback	Rahul Rameshbabu	1	-16/+15
	Implement .getmaxphase callback of ptp_clock_info in mlx5 driver. No longer do a range check in .adjphase callback implementation. Handled by the ptp stack. Cc: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20	ptp: Add .getmaxphase callback to ptp_clock_info	Rahul Rameshbabu	6	-4/+31
	Enables advertisement of the maximum offset supported by the phase control functionality of PHCs. The callback is used to return an error if an offset not supported by the PHC is used in ADJ_OFFSET. The ioctls PTP_CLOCK_GETCAPS and PTP_CLOCK_GETCAPS2 now advertise the maximum offset a PHC's phase control functionality is capable of supporting. Introduce new sysfs node, max_phase_adjustment. Cc: Jakub Kicinski <kuba@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Maciek Machnikowski <maciek@machnikowski.net> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20	testptp: Add support for testing ptp_clock_info .adjphase callback	Rahul Rameshbabu	1	-1/+18
	Invoke clock_adjtime syscall with tx.modes set with ADJ_OFFSET when testptp is invoked with a phase adjustment offset value. Support seconds and nanoseconds for the offset value. Cc: Jakub Kicinski <kuba@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Maciek Machnikowski <maciek@machnikowski.net> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20	testptp: Remove magic numbers related to nanosecond to second conversion	Rahul Rameshbabu	1	-2/+2
	Use existing NSEC_PER_SEC declaration in place of hardcoded magic numbers. Cc: Jakub Kicinski <kuba@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Maciek Machnikowski <maciek@machnikowski.net> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20	docs: ptp.rst: Add information about NVIDIA Mellanox devices	Rahul Rameshbabu	1	-0/+13
	The mlx5_core driver has implemented ptp clock driver functionality but lacked documentation about the PTP devices. This patch adds information about the Mellanox device family. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20	ptp: Clarify ptp_clock_info .adjphase expects an internal servo to be used	Rahul Rameshbabu	2	-2/+20
	.adjphase expects a PHC to use an internal servo algorithm to correct the provided phase offset target in the callback. Implementation of the internal servo algorithm are defined by the individual devices. Cc: Jakub Kicinski <kuba@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-19	ipv6: exthdrs: Remove redundant skb_headlen() check in ip6_parse_tlv().	Kuniyuki Iwashima	1	-3/+0
	ipv6_destopt_rcv() and ipv6_parse_hopopts() pulls these data - Hop-by-Hop/Destination Options Header : 8 - Hdr Ext Len : skb_transport_header(skb)[1] << 3 and calls ip6_parse_tlv(), so it need not check if skb_headlen() is less than skb_transport_offset(skb) + (skb_transport_header(skb)[1] << 3). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-19	ipv6: exthdrs: Reload hdr only when needed in ipv6_srh_rcv().	Kuniyuki Iwashima	1	-2/+2
	We need not reload hdr in ipv6_srh_rcv() unless we call pskb_expand_head(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-19	ipv6: exthdrs: Replace pskb_pull() with skb_pull() in ipv6_srh_rcv().	Kuniyuki Iwashima	1	-5/+1
	ipv6_rthdr_rcv() pulls these data - Segment Routing Header : 8 - Hdr Ext Len : skb_transport_header(skb)[1] << 3 needed by ipv6_srh_rcv(), so pskb_pull() in ipv6_srh_rcv() never fails and can be replaced with skb_pull(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-19	ipv6: rpl: Remove redundant multicast tests in ipv6_rpl_srh_rcv().	Kuniyuki Iwashima	1	-2/+1
	ipv6_rpl_srh_rcv() checks if ipv6_hdr(skb)->daddr or ohdr->rpl_segaddr[i] is the multicast address with ipv6_addr_type(). We have the same check for ipv6_hdr(skb)->daddr in ipv6_rthdr_rcv(), so we need not recheck it in ipv6_rpl_srh_rcv(). Also, we should use ipv6_addr_is_multicast() for ohdr->rpl_segaddr[i] instead of ipv6_addr_type(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-19	ipv6: rpl: Remove pskb(_may)?_pull() in ipv6_rpl_srh_rcv().	Kuniyuki Iwashima	3	-26/+1
	As Eric Dumazet pointed out [0], ipv6_rthdr_rcv() pulls these data - Segment Routing Header : 8 - Hdr Ext Len : skb_transport_header(skb)[1] << 3 needed by ipv6_rpl_srh_rcv(). We can remove pskb_may_pull() and replace pskb_pull() with skb_pull() in ipv6_rpl_srh_rcv(). Link: https://lore.kernel.org/netdev/CANn89iLboLwLrHXeHJucAqBkEL_S0rJFog68t7wwwXO-aNf5Mg@mail.gmail.com/ [0] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-18	gro: move the tc_ext comparison to a helper	Jakub Kicinski	1	-13/+19
	The double ifdefs (one for the variable declaration and one around the code) are quite aesthetically displeasing. Factor this code out into a helper for easier wrapping. This will become even more ugly when another skb ext comparison is added in the future. The resulting machine code looks the same, the compiler seems to try to use %rax more and some blocks more around but I haven't spotted minor differences. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-18	net: phy: at803x: Use devm_regulator_get_enable_optional()	Christophe JAILLET	1	-37/+7
	Use devm_regulator_get_enable_optional() instead of hand writing it. It saves some line of code. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-18	dt-bindings: net: phy: gpy2xx: more precise description	Michael Walle	1	-5/+6
	Mention that the interrupt line is just asserted for a random period of time, not the entire time. Suggested-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Walle <mwalle@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-18	ipv6: also use netdev_hold() in ip6_route_check_nh()	Eric Dumazet	1	-4/+9
	In blamed commit, we missed the fact that ip6_validate_gw() could change dev under us from ip6_route_check_nh() In this fix, I use GFP_ATOMIC in order to not pass too many additional arguments to ip6_validate_gw() and ip6_route_check_nh() only for a rarely used debug feature. syzbot reported: refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 0 PID: 5006 at lib/refcount.c:31 refcount_warn_saturate+0x1d7/0x1f0 lib/refcount.c:31 Modules linked in: CPU: 0 PID: 5006 Comm: syz-executor403 Not tainted 6.4.0-rc5-syzkaller-01229-g97c5209b3d37 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 RIP: 0010:refcount_warn_saturate+0x1d7/0x1f0 lib/refcount.c:31 Code: 05 fb 8e 51 0a 01 e8 98 95 38 fd 0f 0b e9 d3 fe ff ff e8 ac d9 70 fd 48 c7 c7 00 d3 a6 8a c6 05 d8 8e 51 0a 01 e8 79 95 38 fd <0f> 0b e9 b4 fe ff ff 48 89 ef e8 1a d7 c3 fd e9 5c fe ff ff 0f 1f RSP: 0018:ffffc900039df6b8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff888026d71dc0 RSI: ffffffff814c03b7 RDI: 0000000000000001 RBP: ffff888146a505fc R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 1ffff9200073bedc R13: 00000000ffffffef R14: ffff888146a505fc R15: ffff8880284eb5a8 FS: 0000555556c88300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004585c0 CR3: 000000002b1b1000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_dec include/linux/refcount.h:344 [inline] refcount_dec include/linux/refcount.h:359 [inline] ref_tracker_free+0x539/0x820 lib/ref_tracker.c:236 netdev_tracker_free include/linux/netdevice.h:4097 [inline] netdev_put include/linux/netdevice.h:4114 [inline] netdev_put include/linux/netdevice.h:4110 [inline] fib6_nh_init+0xb96/0x1bd0 net/ipv6/route.c:3624 ip6_route_info_create+0x10f3/0x1980 net/ipv6/route.c:3791 ip6_route_add+0x28/0x150 net/ipv6/route.c:3835 ipv6_route_ioctl+0x3fc/0x570 net/ipv6/route.c:4459 inet6_ioctl+0x246/0x290 net/ipv6/af_inet6.c:569 sock_do_ioctl+0xcc/0x230 net/socket.c:1189 sock_ioctl+0x1f8/0x680 net/socket.c:1306 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] Fixes: 70f7457ad6d6 ("net: create device lookup API with reference tracking") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-18	crypto: Fix af_alg_sendmsg(MSG_SPLICE_PAGES) sglist limit	David Howells	1	-1/+1
	When af_alg_sendmsg() calls extract_iter_to_sg(), it passes MAX_SGL_ENTS as the maximum number of elements that may be written to, but some of the elements may already have been used (as recorded in sgl->cur), so extract_iter_to_sg() may end up overrunning the scatterlist. Fix this to limit the number of elements to "MAX_SGL_ENTS - sgl->cur". Note: It probably makes sense in future to alter the behaviour of extract_iter_to_sg() to stop if "sgtable->nents >= sg_max" instead, but this is a smaller fix for now. The bug causes errors looking something like: BUG: KASAN: slab-out-of-bounds in sg_assign_page include/linux/scatterlist.h:109 [inline] BUG: KASAN: slab-out-of-bounds in sg_set_page include/linux/scatterlist.h:139 [inline] BUG: KASAN: slab-out-of-bounds in extract_bvec_to_sg lib/scatterlist.c:1183 [inline] BUG: KASAN: slab-out-of-bounds in extract_iter_to_sg lib/scatterlist.c:1352 [inline] BUG: KASAN: slab-out-of-bounds in extract_iter_to_sg+0x17a6/0x1960 lib/scatterlist.c:1339 Fixes: bf63e250c4b1 ("crypto: af_alg: Support MSG_SPLICE_PAGES") Reported-by: syzbot+6efc50cc1f8d718d6cb7@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/000000000000b2585a05fdeb8379@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: syzbot+6efc50cc1f8d718d6cb7@syzkaller.appspotmail.com cc: Herbert Xu <herbert@gondor.apana.org.au> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-crypto@vger.kernel.org cc: netdev@vger.kernel.org Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-18	tcp: Use per-vma locking for receive zerocopy	Arjun Roy	5	-11/+60
	Per-VMA locking allows us to lock a struct vm_area_struct without taking the process-wide mmap lock in read mode. Consider a process workload where the mmap lock is taken constantly in write mode. In this scenario, all zerocopy receives are periodically blocked during that period of time - though in principle, the memory ranges being used by TCP are not touched by the operations that need the mmap write lock. This results in performance degradation. Now consider another workload where the mmap lock is never taken in write mode, but there are many TCP connections using receive zerocopy that are concurrently receiving. These connections all take the mmap lock in read mode, but this does induce a lot of contention and atomic ops for this process-wide lock. This results in additional CPU overhead caused by contending on the cache line for this lock. However, with per-vma locking, both of these problems can be avoided. As a test, I ran an RPC-style request/response workload with 4KB payloads and receive zerocopy enabled, with 100 simultaneous TCP connections. I measured perf cycles within the find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and without per-vma locking enabled. When using process-wide mmap semaphore read locking, about 1% of measured perf cycles were within this path. With per-VMA locking, this value dropped to about 0.45%. Signed-off-by: Arjun Roy <arjunroy@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-17	tcp: enforce receive buffer memory limits by allowing the tcp window to shrink	mfreemon@cloudflare.com	5	-9/+78
	Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to incoming data packets as a result of the window not closing when it should be. This can result in the receive buffer growing all the way up to tcp_rmem[2], even for tcp sessions with a low BDP. To reproduce: Connect a TCP session with the receiver doing nothing and the sender sending small packets (an infinite loop of socket send() with 4 bytes of payload with a sleep of 1 ms in between each send()). This will cause the tcp receive buffer to grow all the way up to tcp_rmem[2]. As a result, a host can have individual tcp sessions with receive buffers of size tcp_rmem[2], and the host itself can reach tcp_mem limits, causing the host to go into tcp memory pressure mode. The fundamental issue is the relationship between the granularity of the window scaling factor and the number of byte ACKed back to the sender. This problem has previously been identified in RFC 7323, appendix F [1]. The Linux kernel currently adheres to never shrinking the window. In addition to the overallocation of memory mentioned above, the current behavior is functionally incorrect, because once tcp_rmem[2] is reached when no remediations remain (i.e. tcp collapse fails to free up any more memory and there are no packets to prune from the out-of-order queue), the receiver will drop in-window packets resulting in retransmissions and an eventual timeout of the tcp session. A receive buffer full condition should instead result in a zero window and an indefinite wait. In practice, this problem is largely hidden for most flows. It is not applicable to mice flows. Elephant flows can send data fast enough to "overrun" the sk_rcvbuf limit (in a single ACK), triggering a zero window. But this problem does show up for other types of flows. Examples are websockets and other type of flows that send small amounts of data spaced apart slightly in time. In these cases, we directly encounter the problem described in [1]. RFC 7323, section 2.4 [2], says there are instances when a retracted window can be offered, and that TCP implementations MUST ensure that they handle a shrinking window, as specified in RFC 1122, section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window management have made clear that sender must accept a shrunk window from the receiver, including RFC 793 [4] and RFC 1323 [5]. This patch implements the functionality to shrink the tcp window when necessary to keep the right edge within the memory limit by autotuning (sk_rcvbuf). This new functionality is enabled with the new sysctl: net.ipv4.tcp_shrink_window Additional information can be found at: https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4 [3] https://www.rfc-editor.org/rfc/rfc1122#page-91 [4] https://www.rfc-editor.org/rfc/rfc793 [5] https://www.rfc-editor.org/rfc/rfc1323 Signed-off-by: Mike Freemon <mfreemon@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>