aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-07-07net: ethtool: Remove PHYLIB direct dependencyFlorian Fainelli2-3/+16
Now that we have introduced ethtool_phy_ops and the PHY library dynamically registers its operations with that function pointer, we can remove the direct PHYLIB dependency in favor of using dynamic operations. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07net: ethtool: Introduce ethtool_phy_opsFlorian Fainelli2-0/+13
In order to decouple ethtool from its PHY library dependency, define an ethtool_phy_ops singleton which can be overriden by the PHY library when it loads with an appropriate set of function pointers. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07bridge: mcast: Fix MLD2 Report IPv6 payload length checkLinus Lüssing1-1/+1
Commit e57f61858b7c ("net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling") introduced a bug in the IPv6 header payload length check which would potentially lead to rejecting a valid MLD2 Report: The check needs to take into account the 2 bytes for the "Number of Sources" field in the "Multicast Address Record" before reading it. And not the size of a pointer to this field. Fixes: e57f61858b7c ("net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling") Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07net/sched: act_ct: add miss tcf_lastuse_update.wenxu1-0/+2
When tcf_ct_act execute the tcf_lastuse_update should be update or the used stats never update filter protocol ip pref 3 flower chain 0 filter protocol ip pref 3 flower chain 0 handle 0x1 eth_type ipv4 dst_ip 1.1.1.1 ip_flags frag/firstfrag skip_hw not_in_hw action order 1: ct zone 1 nat pipe index 1 ref 1 bind 1 installed 103 sec used 103 sec Action statistics: Sent 151500 bytes 101 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 cookie 4519c04dc64a1a295787aab13b6a50fb Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07mptcp: fix DSS map generation on fin retransmissionPaolo Abeni1-3/+3
The RFC 8684 mandates that no-data DATA FIN packets should carry a DSS with 0 sequence number and data len equal to 1. Currently, on FIN retransmission we re-use the existing mapping; if the previous fin transmission was part of a partially acked data packet, we could end-up writing in the egress packet a non-compliant DSS. The above will be detected by a "Bad mapping" warning on the receiver side. This change addresses the issue explicitly checking for 0 len packet when adding the DATA_FIN option. Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets") Reported-by: syzbot+42a07faa5923cfaeb9c9@syzkaller.appspotmail.com Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsgSabrina Dubroca1-0/+3
IPv4 ping sockets don't set fl4.fl4_icmp_{type,code}, which leads to incomplete IPsec ACQUIRE messages being sent to userspace. Currently, both raw sockets and IPv6 ping sockets set those fields. Expected output of "ip xfrm monitor": acquire proto esp sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 8 code 0 dev ens4 policy src 10.0.2.15/32 dst 8.8.8.8/32 <snip> Currently with ping sockets: acquire proto esp sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 0 code 0 dev ens4 policy src 10.0.2.15/32 dst 8.8.8.8/32 <snip> The Libreswan test suite found this problem after Fedora changed the value for the sysctl net.ipv4.ping_group_range. Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Reported-by: Paul Wouters <pwouters@redhat.com> Tested-by: Paul Wouters <pwouters@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07cgroup: fix cgroup_sk_alloc() for sk_clone_lock()Cong Wang1-1/+1
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. The global variable 'cgroup_sk_alloc_disabled' is used to determine whether to take these reference counts. It is impossible to make the reference counting correct unless we save this bit of information in skcd->val. So, add a new bit there to record whether the socket has already taken the reference counts. This obviously relies on kmalloc() to align cgroup pointers to at least 4 bytes, ARCH_KMALLOC_MINALIGN is certainly larger than that. This bug seems to be introduced since the beginning, commit d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") tried to fix it but not compeletely. It seems not easy to trigger until the recent commit 090e28b229af ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged. Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Reported-by: Cameron Berkenpas <cam@neo-zeon.de> Reported-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reported-by: Daniël Sonck <dsonck92@gmail.com> Reported-by: Zhang Qiang <qiang.zhang@windriver.com> Tested-by: Cameron Berkenpas <cam@neo-zeon.de> Tested-by: Peter Geis <pgwipeout@gmail.com> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Zefan Li <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07ipv6: Fix use of anycast address with loopbackDavid Ahern1-1/+1
Thomas reported a regression with IPv6 and anycast using the following reproducer: echo 1 > /proc/sys/net/ipv6/conf/all/forwarding ip -6 a add fc12::1/16 dev lo sleep 2 echo "pinging lo" ping6 -c 2 fc12:: The conversion of addrconf_f6i_alloc to use ip6_route_info_create missed the use of fib6_is_reject which checks addresses added to the loopback interface and sets the REJECT flag as needed. Update fib6_is_reject for loopback checks to handle RTF_ANYCAST addresses. Fixes: c7a1ce397ada ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create") Reported-by: thomas.gambier@nexedi.com Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07mptcp: use mptcp worker for path managementFlorian Westphal3-47/+27
We can re-use the existing work queue to handle path management instead of a dedicated work queue. Just move pm_worker to protocol.c, call it from the mptcp worker and get rid of the msk lock (already held). Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-06mptcp: fix race in subflow_data_ready()Davide Caratti1-1/+1
syzkaller was able to make the kernel reach subflow_data_ready() for a server subflow that was closed before subflow_finish_connect() completed. In these cases we can avoid using the path for regular/fallback MPTCP data, and just wake the main socket, to avoid the following warning: WARNING: CPU: 0 PID: 9370 at net/mptcp/subflow.c:885 subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 9370 Comm: syz-executor.0 Not tainted 5.7.0 #106 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xb7/0xfe lib/dump_stack.c:118 panic+0x29e/0x692 kernel/panic.c:221 __warn.cold+0x2f/0x3d kernel/panic.c:582 report_bug+0x28b/0x2f0 lib/bug.c:195 fixup_bug arch/x86/kernel/traps.c:105 [inline] fixup_bug arch/x86/kernel/traps.c:100 [inline] do_error_trap+0x10f/0x180 arch/x86/kernel/traps.c:197 do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:216 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:1027 RIP: 0010:subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885 Code: 04 02 84 c0 74 06 0f 8e 91 00 00 00 41 0f b6 5e 48 31 ff 83 e3 18 89 de e8 37 ec 3d fe 84 db 0f 85 65 ff ff ff e8 fa ea 3d fe <0f> 0b e9 59 ff ff ff e8 ee ea 3d fe 48 89 ee 4c 89 ef e8 f3 77 ff RSP: 0018:ffff88811b2099b0 EFLAGS: 00010206 RAX: ffff888111197000 RBX: 0000000000000000 RCX: ffffffff82fbc609 RDX: 0000000000000100 RSI: ffffffff82fbc616 RDI: 0000000000000001 RBP: ffff8881111bc800 R08: ffff888111197000 R09: ffffed10222a82af R10: ffff888111541577 R11: ffffed10222a82ae R12: 1ffff11023641336 R13: ffff888111541000 R14: ffff88810fd4ca00 R15: ffff888111541570 tcp_child_process+0x754/0x920 net/ipv4/tcp_minisocks.c:841 tcp_v4_do_rcv+0x749/0x8b0 net/ipv4/tcp_ipv4.c:1642 tcp_v4_rcv+0x2666/0x2e60 net/ipv4/tcp_ipv4.c:1999 ip_protocol_deliver_rcu+0x29/0x1f0 net/ipv4/ip_input.c:204 ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline] NF_HOOK include/linux/netfilter.h:421 [inline] ip_local_deliver+0x2da/0x390 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:441 [inline] ip_rcv_finish net/ipv4/ip_input.c:428 [inline] ip_rcv_finish net/ipv4/ip_input.c:414 [inline] NF_HOOK include/linux/netfilter.h:421 [inline] ip_rcv+0xef/0x140 net/ipv4/ip_input.c:539 __netif_receive_skb_one_core+0x197/0x1e0 net/core/dev.c:5268 __netif_receive_skb+0x27/0x1c0 net/core/dev.c:5382 process_backlog+0x1e5/0x6d0 net/core/dev.c:6226 napi_poll net/core/dev.c:6671 [inline] net_rx_action+0x3e3/0xd70 net/core/dev.c:6739 __do_softirq+0x18c/0x634 kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 </IRQ> do_softirq.part.0+0x26/0x30 kernel/softirq.c:337 do_softirq arch/x86/include/asm/preempt.h:26 [inline] __local_bh_enable_ip+0x46/0x50 kernel/softirq.c:189 local_bh_enable include/linux/bottom_half.h:32 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:723 [inline] ip_finish_output2+0x78a/0x19c0 net/ipv4/ip_output.c:229 __ip_finish_output+0x471/0x720 net/ipv4/ip_output.c:306 dst_output include/net/dst.h:435 [inline] ip_local_out+0x181/0x1e0 net/ipv4/ip_output.c:125 __ip_queue_xmit+0x7a1/0x14e0 net/ipv4/ip_output.c:530 __tcp_transmit_skb+0x19dc/0x35e0 net/ipv4/tcp_output.c:1238 __tcp_send_ack.part.0+0x3c2/0x5b0 net/ipv4/tcp_output.c:3785 __tcp_send_ack net/ipv4/tcp_output.c:3791 [inline] tcp_send_ack+0x7d/0xa0 net/ipv4/tcp_output.c:3791 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6040 [inline] tcp_rcv_state_process+0x36a4/0x49c2 net/ipv4/tcp_input.c:6209 tcp_v4_do_rcv+0x343/0x8b0 net/ipv4/tcp_ipv4.c:1651 sk_backlog_rcv include/net/sock.h:996 [inline] __release_sock+0x1ad/0x310 net/core/sock.c:2548 release_sock+0x54/0x1a0 net/core/sock.c:3064 inet_wait_for_connect net/ipv4/af_inet.c:594 [inline] __inet_stream_connect+0x57e/0xd50 net/ipv4/af_inet.c:686 inet_stream_connect+0x53/0xa0 net/ipv4/af_inet.c:725 mptcp_stream_connect+0x171/0x5f0 net/mptcp/protocol.c:1920 __sys_connect_file net/socket.c:1854 [inline] __sys_connect+0x267/0x2f0 net/socket.c:1871 __do_sys_connect net/socket.c:1882 [inline] __se_sys_connect net/socket.c:1879 [inline] __x64_sys_connect+0x6f/0xb0 net/socket.c:1879 do_syscall_64+0xb7/0x3d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fb577d06469 Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48 RSP: 002b:00007fb5783d5dd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 000000000068bfa0 RCX: 00007fb577d06469 RDX: 000000000000004d RSI: 0000000020000040 RDI: 0000000000000003 RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000041427c R14: 00007fb5783d65c0 R15: 0000000000000003 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/39 Reported-by: Christoph Paasch <cpaasch@apple.com> Fixes: e1ff9e82e2ea ("net: mptcp: improve fallback to TCP") Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-06ipv6: fib6_select_path can not use out path for nexthop objectsDavid Ahern1-1/+4
Brian reported a crash in IPv6 code when using rpfilter with a setup running FRR and external nexthop objects. The root cause of the crash is fib6_select_path setting fib6_nh in the result to NULL because of an improper check for nexthop objects. More specifically, rpfilter invokes ip6_route_lookup with flowi6_oif set causing fib6_select_path to be called with have_oif_match set. fib6_select_path has early check on have_oif_match and jumps to the out label which presumes a builtin fib6_nh. This path is invalid for nexthop objects; for external nexthops fib6_select_path needs to just return if the fib6_nh has already been set in the result otherwise it returns after the call to nexthop_path_fib6_result. Update the check on have_oif_match to not bail on external nexthops. Update selftests for this problem. Fixes: f88d8ea67fbd ("ipv6: Plumb support for nexthop object in a fib6_info") Reported-by: Brian Rak <brak@choopa.com> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-06Replace HTTP links with HTTPS ones: IPv*Alexander A. Klimov9-13/+13
Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05net: dsa: tag_qca.c: Fix warning for __be16 vs u16Andrew Lunn1-3/+5
net/dsa/tag_qca.c:48:15: warning: incorrect type in assignment (different base types) net/dsa/tag_qca.c:48:15: expected unsigned short [usertype] net/dsa/tag_qca.c:48:15: got restricted __be16 [usertype] net/dsa/tag_qca.c:68:13: warning: incorrect type in assignment (different base types) net/dsa/tag_qca.c:68:13: expected restricted __be16 [usertype] hdr net/dsa/tag_qca.c:68:13: got int net/dsa/tag_qca.c:71:16: warning: restricted __be16 degrades to integer net/dsa/tag_qca.c:81:17: warning: restricted __be16 degrades to integer Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05net: dsa: tag_mtk: Fix warnings for __be16Andrew Lunn1-1/+2
net/dsa/tag_mtk.c:84:13: warning: incorrect type in assignment (different base types) net/dsa/tag_mtk.c:84:13: expected restricted __be16 [usertype] hdr net/dsa/tag_mtk.c:84:13: got int net/dsa/tag_mtk.c:94:17: warning: restricted __be16 degrades to integer The result of a ntohs() is not __be16, but u16. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05net: dsa: tag_lan9303: Fix __be16 warningsAndrew Lunn1-8/+9
net/dsa/tag_lan9303.c:76:24: warning: incorrect type in assignment (different base types) net/dsa/tag_lan9303.c:76:24: expected unsigned short [usertype] net/dsa/tag_lan9303.c:76:24: got restricted __be16 [usertype] net/dsa/tag_lan9303.c:80:24: warning: incorrect type in assignment (different base types) net/dsa/tag_lan9303.c:80:24: expected unsigned short [usertype] net/dsa/tag_lan9303.c:80:24: got restricted __be16 [usertype] net/dsa/tag_lan9303.c:106:31: warning: restricted __be16 degrades to integer net/dsa/tag_lan9303.c:111:24: warning: cast to restricted __be16 net/dsa/tag_lan9303.c:111:24: warning: cast to restricted __be16 net/dsa/tag_lan9303.c:111:24: warning: cast to restricted __be16 net/dsa/tag_lan9303.c:111:24: warning: cast to restricted __be16 Make use of __be16 where appropriate to fix these warnings. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05net: dsa: tag_ksz: Fix __be16 warningsAndrew Lunn1-4/+5
cpu_to_be16 returns a __be16 value. So what it is assigned to needs to have the same type to avoid warnings. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05net: dsa: Add __percpu property to prevent warningsAndrew Lunn1-1/+1
net/dsa/slave.c:505:13: warning: incorrect type in initializer (different address spaces) net/dsa/slave.c:505:13: expected void const [noderef] <asn:3> *__vpp_verify net/dsa/slave.c:505:13: got struct pcpu_sw_netstats * Add the needed _percpu property to prevent this warning. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04hsr: fix interface leak in error path of hsr_dev_finalize()Taehee Yoo1-4/+7
To release hsr(upper) interface, it should release its own lower interfaces first. Then, hsr(upper) interface can be released safely. In the current code of error path of hsr_dev_finalize(), it releases hsr interface before releasing a lower interface. So, a warning occurs, which warns about the leak of lower interfaces. In order to fix this problem, changing the ordering of the error path of hsr_dev_finalize() is needed. Test commands: ip link add dummy0 type dummy ip link add dummy1 type dummy ip link add dummy2 type dummy ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1 ip link add hsr1 type hsr slave1 dummy2 slave2 dummy0 Splat looks like: [ 214.923127][ C2] WARNING: CPU: 2 PID: 1093 at net/core/dev.c:8992 rollback_registered_many+0x986/0xcf0 [ 214.923129][ C2] Modules linked in: hsr dummy openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipx [ 214.923154][ C2] CPU: 2 PID: 1093 Comm: ip Not tainted 5.8.0-rc2+ #623 [ 214.923156][ C2] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 214.923157][ C2] RIP: 0010:rollback_registered_many+0x986/0xcf0 [ 214.923160][ C2] Code: 41 8b 4e cc 45 31 c0 31 d2 4c 89 ee 48 89 df e8 e0 47 ff ff 85 c0 0f 84 cd fc ff ff 5 [ 214.923162][ C2] RSP: 0018:ffff8880c5156f28 EFLAGS: 00010287 [ 214.923165][ C2] RAX: ffff8880d1dad458 RBX: ffff8880bd1b9000 RCX: ffffffffb929d243 [ 214.923167][ C2] RDX: 1ffffffff77e63f0 RSI: 0000000000000008 RDI: ffffffffbbf31f80 [ 214.923168][ C2] RBP: dffffc0000000000 R08: fffffbfff77e63f1 R09: fffffbfff77e63f1 [ 214.923170][ C2] R10: ffffffffbbf31f87 R11: 0000000000000001 R12: ffff8880c51570a0 [ 214.923172][ C2] R13: ffff8880bd1b90b8 R14: ffff8880c5157048 R15: ffff8880d1dacc40 [ 214.923174][ C2] FS: 00007fdd257a20c0(0000) GS:ffff8880da200000(0000) knlGS:0000000000000000 [ 214.923175][ C2] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 214.923177][ C2] CR2: 00007ffd78beb038 CR3: 00000000be544005 CR4: 00000000000606e0 [ 214.923179][ C2] Call Trace: [ 214.923180][ C2] ? netif_set_real_num_tx_queues+0x780/0x780 [ 214.923182][ C2] ? dev_validate_mtu+0x140/0x140 [ 214.923183][ C2] ? synchronize_rcu.part.79+0x85/0xd0 [ 214.923185][ C2] ? synchronize_rcu_expedited+0xbb0/0xbb0 [ 214.923187][ C2] rollback_registered+0xc8/0x170 [ 214.923188][ C2] ? rollback_registered_many+0xcf0/0xcf0 [ 214.923190][ C2] unregister_netdevice_queue+0x18b/0x240 [ 214.923191][ C2] hsr_dev_finalize+0x56e/0x6e0 [hsr] [ 214.923192][ C2] hsr_newlink+0x36b/0x450 [hsr] [ 214.923194][ C2] ? hsr_dellink+0x70/0x70 [hsr] [ 214.923195][ C2] ? rtnl_create_link+0x2e4/0xb00 [ 214.923197][ C2] ? __netlink_ns_capable+0xc3/0xf0 [ 214.923198][ C2] __rtnl_newlink+0xbdb/0x1270 [ ... ] Fixes: e0a4b99773d3 ("hsr: use upper/lower device infrastructure") Reported-by: syzbot+7f1c020f68dab95aab59@syzkaller.appspotmail.com Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04mptcp: support IPV6_V6ONLY setsockoptFlorian Westphal1-0/+30
Without this, Opensshd fails to open an ipv6 socket listening socket: error: setsockopt IPV6_V6ONLY: Operation not supported error: Bind to port 22 on :: failed: Address already in use. Opensshd opens an ipv4 and and ipv6 listening socket, but because IPV6_V6ONLY setsockopt fails, the port number is already in use. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04mptcp: add REUSEADDR/REUSEPORT supportFlorian Westphal1-1/+32
This will e.g. make 'sshd restart' work when MPTCP is used, as we will now set this option on the listener socket instead of only the mptcp socket (where it has no effect). We still need to copy the setting to the master socket so that a subsequent getsockopt() returns the expected value. Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04net: use mptcp setsockopt function for SOL_SOCKET on mptcp socketsFlorian Westphal2-1/+15
setsockopt(mptcp_fd, SOL_SOCKET, ...)... appears to work (returns 0), but it has no effect -- this is because the MPTCP layer never has a chance to copy the settings to the subflow socket. Skip the generic handling for the mptcp case and instead call the mptcp specific handler instead for SOL_SOCKET too. Next patch adds more specific handling for SOL_SOCKET to mptcp. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller8-69/+584
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-07-04 The following pull-request contains BPF updates for your *net-next* tree. We've added 73 non-merge commits during the last 17 day(s) which contain a total of 106 files changed, 5233 insertions(+), 1283 deletions(-). The main changes are: 1) bpftool ability to show PIDs of processes having open file descriptors for BPF map/program/link/BTF objects, relying on BPF iterator progs to extract this info efficiently, from Andrii Nakryiko. 2) Addition of BPF iterator progs for dumping TCP and UDP sockets to seq_files, from Yonghong Song. 3) Support access to BPF map fields in struct bpf_map from programs through BTF struct access, from Andrey Ignatov. 4) Add a bpf_get_task_stack() helper to be able to dump /proc/*/stack via seq_file from BPF iterator progs, from Song Liu. 5) Make SO_KEEPALIVE and related options available to bpf_setsockopt() helper, from Dmitry Yakunin. 6) Optimize BPF sk_storage selection of its caching index, from Martin KaFai Lau. 7) Removal of redundant synchronize_rcu()s from BPF map destruction which has been a historic leftover, from Alexei Starovoitov. 8) Several improvements to test_progs to make it easier to create a shell loop that invokes each test individually which is useful for some CIs, from Jesper Dangaard Brouer. 9) Fix bpftool prog dump segfault when compiled without skeleton code on older clang versions, from John Fastabend. 10) Bunch of cleanups and minor improvements, from various others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller5-5/+7
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Use kvfree() to release vmalloc()'ed areas in ipset, from Eric Dumazet. 2) UAF in nfnetlink_queue from the nf_conntrack_update() path. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04netfilter: nf_tables: reject unsupported chain flagsPablo Neira Ayuso1-0/+3
Bail out if userspace sends unsupported chain flags. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04netfilter: nf_tables: add NFT_CHAIN_BINDINGPablo Neira Ayuso2-12/+125
This new chain flag specifies that: * the kernel dynamically allocates the chain name, if no chain name is specified. * If the immediate expression that refers to this chain is removed, then this bound chain (and its content) is destroyed. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04netfilter: nf_tables: add nft_chain_add()Pablo Neira Ayuso1-8/+20
This patch adds a helper function to add the chain to the hashtable and the chain list. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04netfilter: nf_tables: expose enum nft_chain_flags through UAPIPablo Neira Ayuso1-2/+2
This enum definition was never exposed through UAPI. Rename NFT_BASE_CHAIN to NFT_CHAIN_BASE for consistency. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04netfilter: nf_tables: add NFTA_VERDICT_CHAIN_ID attributePablo Neira Ayuso1-3/+13
This netlink attribute allows you to identify the chain to jump/goto by means of the chain ID. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attributePablo Neira Ayuso1-4/+32
This new netlink attribute allows you to add rules to chains by the chain ID. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04netfilter: nf_tables: add NFTA_CHAIN_ID attributePablo Neira Ayuso1-3/+12
This netlink attribute allows you to refer to chains inside a transaction as an alternative to the name and the handle. The chain binding support requires this new chain ID approach. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04ipvs: allow connection reuse for unconfirmed conntrackJulian Anastasov1-5/+7
YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 https://github.com/kubernetes/kubernetes/issues/70747 - Apache Bench can fill up ipvs service proxy in seconds #544 https://github.com/cloudnativelabs/kube-router/issues/544 - Additional 1s latency in `host -> service IP -> pod` https://github.com/kubernetes/kubernetes/issues/90854 Fixes: f719e3754ee2 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <yx.atom1@gmail.com> Signed-off-by: YangYuxi <yx.atom1@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-03ipv6/ping: set skb->mark on icmpv6 socketsWillem de Bruijn1-0/+1
IPv6 ping sockets route based on fwmark, but do not yet set skb->mark. Add this. IPv4 ping sockets also do both. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-03sched: consistently handle layer3 header accesses in the presence of VLANsToke Høiland-Jørgensen16-32/+41
There are a couple of places in net/sched/ that check skb->protocol and act on the value there. However, in the presence of VLAN tags, the value stored in skb->protocol can be inconsistent based on whether VLAN acceleration is enabled. The commit quoted in the Fixes tag below fixed the users of skb->protocol to use a helper that will always see the VLAN ethertype. However, most of the callers don't actually handle the VLAN ethertype, but expect to find the IP header type in the protocol field. This means that things like changing the ECN field, or parsing diffserv values, stops working if there's a VLAN tag, or if there are multiple nested VLAN tags (QinQ). To fix this, change the helper to take an argument that indicates whether the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we make sure to skip all of them, so behaviour is consistent even in QinQ mode. To make the helper usable from the ECN code, move it to if_vlan.h instead of pkt_sched.h. v3: - Remove empty lines - Move vlan variable definitions inside loop in skb_protocol() - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and bpf_skb_ecn_set_ce() v2: - Use eth_type_vlan() helper in skb_protocol() - Also fix code that reads skb->protocol directly - Change a couple of 'if/else if' statements to switch constructs to avoid calling the helper twice Reported-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com> Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-03netfilter: conntrack: refetch conntrack after nf_conntrack_update()Pablo Neira Ayuso1-0/+2
__nf_conntrack_update() might refresh the conntrack object that is attached to the skbuff. Otherwise, this triggers UAF. [ 633.200434] ================================================================== [ 633.200472] BUG: KASAN: use-after-free in nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200478] Read of size 1 at addr ffff888370804c00 by task nfqnl_test/6769 [ 633.200487] CPU: 1 PID: 6769 Comm: nfqnl_test Not tainted 5.8.0-rc2+ #388 [ 633.200490] Hardware name: LENOVO 23259H1/23259H1, BIOS G2ET32WW (1.12 ) 05/30/2012 [ 633.200491] Call Trace: [ 633.200499] dump_stack+0x7c/0xb0 [ 633.200526] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200532] print_address_description.constprop.6+0x1a/0x200 [ 633.200539] ? _raw_write_lock_irqsave+0xc0/0xc0 [ 633.200568] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200594] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200598] kasan_report.cold.9+0x1f/0x42 [ 633.200604] ? call_rcu+0x2c0/0x390 [ 633.200633] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200659] nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200687] ? nf_conntrack_find_get+0x30/0x30 [nf_conntrack] Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1436 Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-02Merge tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-0/+1
Pull nfsd fixes from Bruce Fields: "Fixes for a umask bug on exported filesystems lacking ACL support, a leak and a module unloading bug in the /proc/fs/nfsd/clients/ code, and a compile warning" * tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux: SUNRPC: Add missing definition of ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE nfsd: fix nfsdfs inode reference count leak nfsd4: fix nfsdfs reference count loop nfsd: apply umask on fs without ACL support
2020-07-02bridge: Extend br_fill_ifinfo to return MPR statusHoratiu Vultur1-1/+24
This patch extends the function br_fill_ifinfo to return also the MRP status for each instance on a bridge. It also adds a new filter RTEXT_FILTER_MRP to return the MRP status only when this is set, not to interfer with the vlans. The MRP status is return only on the bridge interfaces. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-02bridge: mrp: Add br_mrp_fill_infoHoratiu Vultur2-0/+71
Add the function br_mrp_fill_info which populates the MRP attributes regarding the status of each MRP instance. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-02tcp: md5: allow changing MD5 keys in all socket statesEric Dumazet1-4/+1
This essentially reverts commit 721230326891 ("tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets") Mathieu reported that many vendors BGP implementations can actually switch TCP MD5 on established flows. Quoting Mathieu : Here is a list of a few network vendors along with their behavior with respect to TCP MD5: - Cisco: Allows for password to be changed, but within the hold-down timer (~180 seconds). - Juniper: When password is initially set on active connection it will reset, but after that any subsequent password changes no network resets. - Nokia: No notes on if they flap the tcp connection or not. - Ericsson/RedBack: Allows for 2 password (old/new) to co-exist until both sides are ok with new passwords. - Meta-Switch: Expects the password to be set before a connection is attempted, but no further info on whether they reset the TCP connection on a change. - Avaya: Disable the neighbor, then set password, then re-enable. - Zebos: Would normally allow the change when socket connected. We can revert my prior change because commit 9424e2e7ad93 ("tcp: md5: fix potential overestimation of TCP option space") removed the leak of 4 kernel bytes to the wire that was the main reason for my patch. While doing my investigations, I found a bug when a MD5 key is changed, leading to these commits that stable teams want to consider before backporting this revert : Commit 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()") Commit e6ced831ef11 ("tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers") Fixes: 721230326891 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets" Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01mptcp: add receive buffer auto-tuningFlorian Westphal3-8/+127
When mptcp is used, userspace doesn't read from the tcp (subflow) socket but from the parent (mptcp) socket receive queue. skbs are moved from the subflow socket to the mptcp rx queue either from 'data_ready' callback (if mptcp socket can be locked), a work queue, or the socket receive function. This means tcp_rcv_space_adjust() is never called and thus no receive buffer size auto-tuning is done. An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the function that moves skbs from subflow to mptcp socket. While this enabled autotuning, it also meant tuning was done even if userspace was reading the mptcp socket very slowly. This adds mptcp_rcv_space_adjust() and calls it after userspace has read data from the mptcp socket rx queue. Its very similar to tcp_rcv_space_adjust, with two differences: 1. The rtt estimate is the largest one observed on a subflow 2. The rcvbuf size and window clamp of all subflows is adjusted to the mptcp-level rcvbuf. Otherwise, we get spurious drops at tcp (subflow) socket level if the skbs are not moved to the mptcp socket fast enough. Before: time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap [..] ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 40823ms) [ OK ] ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 23119ms) [ OK ] ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5421ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 41446ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 23427ms) [ OK ] ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5426ms) [ OK ] Time: 1396 seconds After: ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ] ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5427ms) [ OK ] ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5422ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5415ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5422ms) [ OK ] ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5423ms) [ OK ] Time: 296 seconds Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01tcp: fix SO_RCVLOWAT possible hangs under high mem pressureEric Dumazet1-0/+2
Whenever tcp_try_rmem_schedule() returns an error, we are under trouble and should make sure to wakeup readers so that they can drain socket queues and eventually make room. Fixes: 03f45c883c6f ("tcp: avoid extra wakeups for SO_RCVLOWAT users") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01net: sched: Allow changing default qdisc to FQ-PIEDanny Lin1-0/+4
Similar to fq_codel and the other qdiscs that can set as default, fq_pie is also suitable for general use without explicit configuration, which makes it a valid choice for this. This is useful in situations where a painless out-of-the-box solution for reducing bufferbloat is desired but fq_codel is not necessarily the best choice. For example, fq_pie can be better for DASH streaming, but there could be more cases where it's the better choice of the two simple AQMs available in the kernel. Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01ip: Fix SO_MARK in RST, ACK and ICMP packetsWillem de Bruijn3-5/+5
When no full socket is available, skbs are sent over a per-netns control socket. Its sk_mark is temporarily adjusted to match that of the real (request or timewait) socket or to reflect an incoming skb, so that the outgoing skb inherits this in __ip_make_skb. Introduction of the socket cookie mark field broke this. Now the skb is set through the cookie and cork: <caller> # init sockc.mark from sk_mark or cmsg ip_append_data ip_setup_cork # convert sockc.mark to cork mark ip_push_pending_frames ip_finish_skb __ip_make_skb # set skb->mark to cork mark But I missed these special control sockets. Update all callers of __ip(6)_make_skb that were originally missed. For IPv6, the same two icmp(v6) paths are affected. The third case is not, as commit 92e55f412cff ("tcp: don't annotate mark on control socket from tcp_v6_send_response()") replaced the ctl_sk->sk_mark with passing the mark field directly as a function argument. That commit predates the commit that introduced the bug. Fixes: c6af0c227a22 ("ip: support SO_MARK cmsg") Signed-off-by: Willem de Bruijn <willemb@google.com> Reported-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01tcp: md5: do not send silly options in SYNCOOKIESEric Dumazet1-3/+5
Whenever cookie_init_timestamp() has been used to encode ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK. Otherwise, tcp_synack_options() will still advertize options like WSCALE that we can not deduce later when receiving the packet from the client to complete 3WHS. Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet, but we can not know for sure that all TCP stacks have the same logic. Before the fix a tcpdump would exhibit this wrong exchange : 10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0 10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0 10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0 10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12 10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0 After this patch the exchange looks saner : 11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0 11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0 11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0 11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12 11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0 11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12 11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0 Of course, no SACK block will ever be added later, but nothing should break. Technically, we could remove the 4 nops included in MD5+TS options, but again some stacks could break seeing not conventional alignment. Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01rds: If one path needs re-connection, check all and re-connectRao Shoaib3-1/+20
In testing with mprds enabled, Oracle Cluster nodes after reboot were not able to communicate with others nodes and so failed to rejoin the cluster. Peers with lower IP address initiated connection but the node could not respond as it choose a different path and could not initiate a connection as it had a higher IP address. With this patch, when a node sends out a packet and the selected path is down, all other paths are also checked and any down paths are re-connected. Reviewed-by: Ka-cheong Poon <ka-cheong.poon@oracle.com> Reviewed-by: David Edmondson <david.edmondson@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriersEric Dumazet2-9/+18
My prior fix went a bit too far, according to Herbert and Mathieu. Since we accept that concurrent TCP MD5 lookups might see inconsistent keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb() Clearing all key->key[] is needed to avoid possible KMSAN reports, if key->keylen is increased. Since tcp_md5_do_add() is not fast path, using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler. data_race() was added in linux-5.8 and will prevent KCSAN reports, this can safely be removed in stable backports, if data_race() is not yet backported. v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add() Fixes: 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Marco Elver <elver@google.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01genetlink: remove genl_bindSean Tranchetti1-49/+0
A potential deadlock can occur during registering or unregistering a new generic netlink family between the main nl_table_lock and the cb_lock where each thread wants the lock held by the other, as demonstrated below. 1) Thread 1 is performing a netlink_bind() operation on a socket. As part of this call, it will call netlink_lock_table(), incrementing the nl_table_users count to 1. 2) Thread 2 is registering (or unregistering) a genl_family via the genl_(un)register_family() API. The cb_lock semaphore will be taken for writing. 3) Thread 1 will call genl_bind() as part of the bind operation to handle subscribing to GENL multicast groups at the request of the user. It will attempt to take the cb_lock semaphore for reading, but it will fail and be scheduled away, waiting for Thread 2 to finish the write. 4) Thread 2 will call netlink_table_grab() during the (un)registration call. However, as Thread 1 has incremented nl_table_users, it will not be able to proceed, and both threads will be stuck waiting for the other. genl_bind() is a noop, unless a genl_family implements the mcast_bind() function to handle setting up family-specific multicast operations. Since no one in-tree uses this functionality as Cong pointed out, simply removing the genl_bind() function will remove the possibility for deadlock, as there is no attempt by Thread 1 above to take the cb_lock semaphore. Fixes: c380d9a7afff ("genetlink: pass multicast bind/unbind to families") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Johannes Berg <johannes.berg@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01net/packet: remove redundant initialization of variable errColin Ian King1-1/+1
The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01bpf: Fix net/core/filter build errors when INET is not enabledRandy Dunlap1-0/+4
Fix build errors when CONFIG_INET is not set/enabled. (.text+0x2b1b): undefined reference to `tcp_prot' (.text+0x2b3b): undefined reference to `tcp_prot' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/b1a858ec-7e04-56bc-248a-62cb9bbee726@infradead.org
2020-07-01ipvs: avoid expiring many connections from timerJulian Anastasov2-17/+42
Add new functions ip_vs_conn_del() and ip_vs_conn_del_put() to release many IPVS connections in process context. They are suitable for connections found in table when we do not want to overload the timers. Currently, the change is useful for the dropentry delayed work but it will be used also in following patch when flushing connections to failed destinations. Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30net: qrtr: Fix an out of bounds read qrtr_endpoint_post()Dan Carpenter1-1/+5
This code assumes that the user passed in enough data for a qrtr_hdr_v1 or qrtr_hdr_v2 struct, but it's not necessarily true. If the buffer is too small then it will read beyond the end. Reported-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Reported-by: syzbot+b8fe393f999a291a9ea6@syzkaller.appspotmail.com Fixes: 194ccc88297a ("net: qrtr: Support decoding incoming v2 packets") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>