linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2019-06-09	mpls: fix af_mpls dependencies	Matteo Croce	1	-0/+1
	MPLS routing code relies on sysctl to work, so let it select PROC_SYSCTL. Reported-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09	ibmvnic: Fix unchecked return codes of memory allocations	Thomas Falcon	1	-6/+7
	The return values for these memory allocations are unchecked, which may cause an oops if the driver does not handle them after a failure. Fix by checking the function's return code. Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09	ibmvnic: Refresh device multicast list after reset	Thomas Falcon	1	-0/+3
	It was observed that multicast packets were no longer received after a device reset. The fix is to resend the current multicast list to the backing device after recovery. Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09	ibmvnic: Do not close unopened driver during reset	Thomas Falcon	1	-1/+2
	Check driver state before halting it during a reset. If the driver is not running, do nothing. Otherwise, a request to deactivate a down link can cause an error and the reset will fail. Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09	mpls: fix warning with multi-label encap	George Wilkie	1	-1/+1
	If you configure a route with multiple labels, e.g. ip route add 10.10.3.0/24 encap mpls 16/100 via 10.10.2.2 dev ens4 A warning is logged: kernel: [ 130.561819] netlink: 'ip': attribute type 1 has an invalid length. This happens because mpls_iptunnel_policy has set the type of MPLS_IPTUNNEL_DST to fixed size NLA_U32. Change it to a minimum size. nla_get_labels() does the remaining validation. Fixes: e3e4712ec096 ("mpls: ip tunnel support") Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09	net: phy: rename Asix Electronics PHY driver	Michael Schmitz	4	-3/+3
	[Resent to net instead of net-next - may clash with Anders Roxell's patch series addressing duplicate module names] Commit 31dd83b96641 ("net-next: phy: new Asix Electronics PHY driver") introduced a new PHY driver drivers/net/phy/asix.c that causes a module name conflict with a pre-existiting driver (drivers/net/usb/asix.c). The PHY driver is used by the X-Surf 100 ethernet card driver, and loaded by that driver via its PHY ID. A rename of the driver looks unproblematic. Rename PHY driver to ax88796b.c in order to resolve name conflict. Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Tested-by: Michael Schmitz <schmitzmic@gmail.com> Fixes: 31dd83b96641 ("net-next: phy: new Asix Electronics PHY driver") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09	ipv6: flowlabel: fl6_sock_lookup() must use atomic_inc_not_zero	Eric Dumazet	1	-3/+4
	Before taking a refcount, make sure the object is not already scheduled for deletion. Same fix is needed in ipv6_flowlabel_opt() Fixes: 18367681a10b ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09	net: ipv4: fib_semantics: fix uninitialized variable	Enrico Weigelt	1	-1/+1
	fix an uninitialized variable: CC net/ipv4/fib_semantics.o net/ipv4/fib_semantics.c: In function 'fib_check_nh_v4_gw': net/ipv4/fib_semantics.c:1027:12: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized] if (!tbl \|\| err) { ^~ Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07	net/mlx5e: Support tagged tunnel over bond	Eli Britstein	1	-5/+6
	Stacked devices like bond interface may have a VLAN device on top of them. Detect lag state correctly under this condition, and return the correct routed net device, according to it the encap header is built. Fixes: e32ee6c78efa ("net/mlx5e: Support tunnel encap over tagged Ethernet") Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-07	net/mlx5e: Avoid detaching non-existing netdev under switchdev mode	Alaa Hleihel	1	-0/+5
	After introducing dedicated uplink representor, the netdev instance set over the esw manager vport (PF) became no longer in use, so it was removed in the cited commit once we're on switchdev mode. However, the mlx5e_detach function was not updated accordingly, and it still tries to detach a non-existing netdev, causing a kernel crash. This patch fixes this issue. Fixes: aec002f6f82c ("net/mlx5e: Uninstantiate esw manager vport netdev on switchdev mode") Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-07	net/mlx5e: Fix source port matching in fdb peer flow rule	Raed Salem	1	-3/+0
	The cited commit changed the initialization placement of the eswitch attributes so it is done prior to parse tc actions function call, including among others the in_rep and in_mdev fields which are mistakenly reassigned inside the parse actions function. This breaks the source port matching criteria of the peer redirect rule. Fix by removing the now redundant reassignment of the already initialized fields. Fixes: 988ab9c7363a ("net/mlx5e: Introduce mlx5e_flow_esw_attr_init() helper") Signed-off-by: Raed Salem <raeds@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-07	net/mlx5e: Replace reciprocal_scale in TX select queue function	Shay Agroskin	3	-6/+8
	The TX queue index returned by the fallback function ranges between [0,NUM CHANNELS - 1] if QoS isn't set and [0, (NUM CHANNELS)(NUM TCs) -1] otherwise. Our HW uses different TC mapping than the fallback function (which is denoted as 'up', user priority) so we only need to extract a channel number out of the returned value. Since (NUM CHANNELS)(NUM TCs) is a relatively small number, using reciprocal scale almost always returns zero. We instead access the 'txq2sq' table to extract the sq (and with it the channel number) associated with the tx queue, thus getting a more evenly distributed channel number. Perf: Rx/Tx side with Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz and ConnectX-5. Used 'iperf' UDP traffic, 10 threads, and priority 5. Before: 0.566Mpps After: 2.37Mpps As expected, releasing the existing bottleneck of steering all traffic to TX queue zero significantly improves transmission rates. Fixes: 7ccdd0841b30 ("net/mlx5e: Fix select queue callback") Signed-off-by: Shay Agroskin <shayag@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-07	net/mlx5e: Add ndo_set_feature for uplink representor	Chris Mi	3	-6/+8
	After we have a dedicated uplink representor, the new netdev ops doesn't support ndo_set_feature. Because of that, we can't change some features, eg. rxvlan. Now add it back. In this patch, I also do a cleanup for the features flag handling, eg. remove duplicate NETIF_F_HW_TC flag setting. Fixes: aec002f6f82c ("net/mlx5e: Uninstantiate esw manager vport netdev on switchdev mode") Signed-off-by: Chris Mi <chrism@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-07	net/mlx5: Avoid reloading already removed devices	Alaa Hleihel	1	-2/+23
	Prior to reloading a device we must first verify that it was not already removed. Otherwise, the attempt to remove the device will do nothing, and in that case we will end up proceeding with adding an new device that no one was expecting to remove, leaving behind used resources such as EQs that causes a failure to destroy comp EQs and syndrome (0x30f433). Fix that by making sure that we try to remove and add a device (based on a protocol) only if the device is already added. Fixes: c5447c70594b ("net/mlx5: E-Switch, Reload IB interface when switching devlink modes") Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-07	net/mlx5: Update pci error handler entries and command translation	Edward Srouji	1	-0/+8
	Add missing entries for create/destroy UCTX and UMEM commands. This could get us wrong "unknown FW command" error in flows where we unbind the device or reset the driver. Also the translation of these commands from opcodes to string was missing. Fixes: 6e3722baac04 ("IB/mlx5: Use the correct commands for UMEM and UCTX allocation") Signed-off-by: Edward Srouji <edwards@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-07	can: purge socket error queue on sock destruct	Willem de Bruijn	1	-0/+1
	CAN supports software tx timestamps as of the below commit. Purge any queued timestamp packets on socket destroy. Fixes: 51f31cabe3ce ("ip: support for TX timestamps on UDP and RAW sockets") Reported-by: syzbot+a90604060cb40f5bdd16@syzkaller.appspotmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07	can: flexcan: Remove unneeded registration message	Fabio Estevam	1	-3/+0
	Currently the following message is observed when the flexcan driver is probed: flexcan 2090000.flexcan: device registered (reg_base=(ptrval), irq=23) The reason for printing 'ptrval' is explained at Documentation/core-api/printk-formats.rst: "Pointers printed without a specifier extension (i.e unadorned %p) are hashed to prevent leaking information about the kernel memory layout. This has the added benefit of providing a unique identifier. On 64-bit machines the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it gathers enough entropy." Instead of passing %pK, which can print the correct address, simply remove the entire message as it is not really that useful. Signed-off-by: Fabio Estevam <festevam@gmail.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07	can: af_can: Fix error path of can_init()	YueHaibing	1	-3/+21
	This patch add error path for can_init() to avoid possible crash if some error occurs. Fixes: 0d66548a10cb ("[CAN]: Add PF_CAN core module") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07	can: m_can: implement errata "Needless activation of MRAF irq"	Eugen Hristev	1	-0/+21
	During frame reception while the MCAN is in Error Passive state and the Receive Error Counter has thevalue MCAN_ECR.REC = 127, it may happen that MCAN_IR.MRAF is set although there was no Message RAM access failure. If MCAN_IR.MRAF is enabled, an interrupt to the Host CPU is generated. Work around: The Message RAM Access Failure interrupt routine needs to check whether MCAN_ECR.RP = '1' and MCAN_ECR.REC = '127'. In this case, reset MCAN_IR.MRAF. No further action is required. This affects versions older than 3.2.0 Errata explained on Sama5d2 SoC which includes this hardware block: http://ww1.microchip.com/downloads/en/DeviceDoc/SAMA5D2-Family-Silicon-Errata-and-Data-Sheet-Clarification-DS80000803B.pdf chapter 6.2 Reproducibility: If 2 devices with m_can are connected back to back, configuring different bitrate on them will lead to interrupt storm on the receiving side, with error "Message RAM access failure occurred". Another way is to have a bad hardware connection. Bad wire connection can lead to this issue as well. This patch fixes the issue according to provided workaround. Signed-off-by: Eugen Hristev <eugen.hristev@microchip.com> Reviewed-by: Ludovic Desroches <ludovic.desroches@microchip.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07	can: mcp251x: add support for mcp25625	Sean Nyekjaer	2	-11/+19
	Fully compatible with mcp2515, the mcp25625 have integrated transceiver. This patch adds support for the mcp25625 to the existing mcp251x driver. Signed-off-by: Sean Nyekjaer <sean@geanix.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07	dt-bindings: can: mcp251x: add mcp25625 support	Sean Nyekjaer	1	-0/+1
	Fully compatible with mcp2515, the mcp25625 have integrated transceiver. This patch add the mcp25625 to the device tree bindings documentation. Signed-off-by: Sean Nyekjaer <sean@geanix.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07	can: xilinx_can: use correct bittiming_const for CAN FD core	Anssi Hannula	1	-1/+1
	Commit 9e5f1b273e6a ("can: xilinx_can: add support for Xilinx CAN FD core") added a new can_bittiming_const structure for CAN FD cores that support larger values for tseg1, tseg2, and sjw than previous Xilinx CAN cores, but the commit did not actually take that into use. Fix that. Tested with CAN FD core on a ZynqMP board. Fixes: 9e5f1b273e6a ("can: xilinx_can: add support for Xilinx CAN FD core") Reported-by: Shubhrajyoti Datta <shubhrajyoti.datta@gmail.com> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi> Cc: Michal Simek <michal.simek@xilinx.com> Reviewed-by: Shubhrajyoti Datta <shubhrajyoti.datta@gmail.com> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07	can: flexcan: fix timeout when set small bitrate	Joakim Zhang	1	-1/+1
	Current we can meet timeout issue when setting a small bitrate like 10000 as follows on i.MX6UL EVK board (ipg clock = 66MHZ, per clock = 30MHZ): \| root@imx6ul7d:~# ip link set can0 up type can bitrate 10000 A link change request failed with some changes committed already. Interface can0 may have been left with an inconsistent configuration, please check. \| RTNETLINK answers: Connection timed out It is caused by calling of flexcan_chip_unfreeze() timeout. Originally the code is using usleep_range(10, 20) for unfreeze operation, but the patch (8badd65 can: flexcan: avoid calling usleep_range from interrupt context) changed it into udelay(10) which is only a half delay of before, there're also some other delay changes. After double to FLEXCAN_TIMEOUT_US to 100 can fix the issue. Meanwhile, Rasmus Villemoes reported that even with a timeout of 100, flexcan_probe() fails on the MPC8309, which requires a value of at least 140 to work reliably. 250 works for everyone. Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Reviewed-by: Dong Aisheng <aisheng.dong@nxp.com> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07	can: usb: Kconfig: Remove duplicate menu entry	Alexander Dahl	1	-6/+0
	This seems to have slipped in by accident when sorting the entries. Fixes: ffbdd9172ee2f53020f763574b4cdad8d9760a4f Signed-off-by: Alexander Dahl <ada@thorsis.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-06	bpf: expand section tests for test_section_names	Daniel Borkmann	1	-0/+10
	Add cgroup/recvmsg{4,6} to test_section_names as well. Test run output: # ./test_section_names libbpf: failed to guess program type based on ELF section name 'InvAliD' libbpf: supported section(type) names are: [...] libbpf: failed to guess attach type based on ELF section name 'InvAliD' libbpf: attachable section(type) names are: [...] libbpf: failed to guess program type based on ELF section name 'cgroup' libbpf: supported section(type) names are: [...] libbpf: failed to guess attach type based on ELF section name 'cgroup' libbpf: attachable section(type) names are: [...] Summary: 38 PASSED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06	bpf: more msg_name rewrite tests to test_sock_addr	Daniel Borkmann	1	-16/+197
	Extend test_sock_addr for recvmsg test cases, bigger parts of the sendmsg code can be reused for this. Below are the strace view of the recvmsg rewrites; the sendmsg side does not have a BPF prog connected to it for the context of this test: IPv4 test case: [pid 4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, attach_type=0x13 /* BPF_??? /, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0 [pid 4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5 [pid 4846] bind(5, {sa_family=AF_INET, sin_port=htons(4444), sin_addr=inet_addr("127.0.0.1")}, 128) = 0 [pid 4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 6 [pid 4846] sendmsg(6, {msg_name={sa_family=AF_INET, sin_port=htons(4444), sin_addr=inet_addr("127.0.0.1")}, msg_namelen=128, msg_iov=[{iov_base="a", iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1 [pid 4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], left {tv_sec=1, tv_usec=999995}) [pid 4846] recvmsg(5, {msg_name={sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, msg_namelen=128->16, msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1 [pid 4846] close(6) = 0 [pid 4846] close(5) = 0 [pid 4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x13 / BPF_??? /}, 112) = 0 IPv6 test case: [pid 4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, attach_type=0x14 / BPF_??? /, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0 [pid 4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 5 [pid 4846] bind(5, {sa_family=AF_INET6, sin6_port=htons(6666), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 128) = 0 [pid 4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 6 [pid 4846] sendmsg(6, {msg_name={sa_family=AF_INET6, sin6_port=htons(6666), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=128, msg_iov=[{iov_base="a", iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1 [pid 4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], left {tv_sec=1, tv_usec=999996}) [pid 4846] recvmsg(5, {msg_name={sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=128->28, msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1 [pid 4846] close(6) = 0 [pid 4846] close(5) = 0 [pid 4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x14 / BPF_??? */}, 112) = 0 test_sock_addr run w/o strace view: # ./test_sock_addr.sh [...] Test case: recvmsg4: return code ok .. [PASS] Test case: recvmsg4: return code !ok .. [PASS] Test case: recvmsg6: return code ok .. [PASS] Test case: recvmsg6: return code !ok .. [PASS] Test case: recvmsg4: rewrite IP & port (asm) .. [PASS] Test case: recvmsg6: rewrite IP & port (asm) .. [PASS] [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06	bpf, bpftool: enable recvmsg attach types	Daniel Borkmann	5	-6/+15
	Trivial patch to bpftool in order to complete enabling attaching programs to BPF_CGROUP_UDP{4,6}_RECVMSG. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06	bpf, libbpf: enable recvmsg attach types	Daniel Borkmann	1	-0/+4
	Another trivial patch to libbpf in order to enable identifying and attaching programs to BPF_CGROUP_UDP{4,6}_RECVMSG by section name. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06	bpf: sync tooling uapi header	Daniel Borkmann	1	-0/+2
	Sync BPF uapi header in order to pull in BPF_CGROUP_UDP{4,6}_RECVMSG attach types. This is done and preferred as an extra patch in order to ease sync of libbpf. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06	bpf: fix unconnected udp hooks	Daniel Borkmann	7	-4/+36
	Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06	pktgen: do not sleep with the thread lock held.	Paolo Abeni	1	-0/+11
	Currently, the process issuing a "start" command on the pktgen procfs interface, acquires the pktgen thread lock and never release it, until all pktgen threads are completed. The above can blocks indefinitely any other pktgen command and any (even unrelated) netdevice removal - as the pktgen netdev notifier acquires the same lock. The issue is demonstrated by the following script, reported by Matteo: ip -b - <<'EOF' link add type dummy link add type veth link set dummy0 up EOF modprobe pktgen echo reset >/proc/net/pktgen/pgctrl { echo rem_device_all echo add_device dummy0 } >/proc/net/pktgen/kpktgend_0 echo count 0 >/proc/net/pktgen/dummy0 echo start >/proc/net/pktgen/pgctrl & sleep 1 rmmod veth Fix the above releasing the thread lock around the sleep call. Additionally we must prevent racing with forcefull rmmod - as the thread lock no more protects from them. Instead, acquire a self-reference before waiting for any thread. As a side effect, running rmmod pktgen while some thread is running now fails with "module in use" error, before this patch such command hanged indefinitely. Note: the issue predates the commit reported in the fixes tag, but this fix can't be applied before the mentioned commit. v1 -> v2: - no need to check for thread existence after flipping the lock, pktgen threads are freed only at net exit time - Fixes: 6146e6a43b35 ("[PKTGEN]: Removes thread_{un,}lock() macros.") Reported-and-tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06	net: mvpp2: Use strscpy to handle stat strings	Maxime Chevallier	1	-2/+2
	Use a safe strscpy call to copy the ethtool stat strings into the relevant buffers, instead of a memcpy that will be accessing out-of-bound data. Fixes: 118d6298f6f0 ("net: mvpp2: add ethtool GOP statistics") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06	net: rds: fix memory leak in rds_ib_flush_mr_pool	Zhu Yanjun	1	-4/+6
	When the following tests last for several hours, the problem will occur. Server: rds-stress -r 1.1.1.16 -D 1M Client: rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30 The following will occur. " Starting up.... tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu % 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 " >From vmcore, we can find that clean_list is NULL. >From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker. Then rds_ib_mr_pool_flush_worker calls " rds_ib_flush_mr_pool(pool, 0, NULL); " Then in function " int rds_ib_flush_mr_pool(struct rds_ib_mr_pool pool, int free_all, struct rds_ib_mr ibmr_ret) " ibmr_ret is NULL. In the source code, " ... list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail); if (ibmr_ret) ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode); /* more than one entry in llist nodes */ if (clean_nodes->next) llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list); ... " When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next instead of clean_nodes is added in clean_list. So clean_nodes is discarded. It can not be used again. The workqueue is executed periodically. So more and more clean_nodes are discarded. Finally the clean_list is NULL. Then this problem will occur. Fixes: 1bc144b62524 ("net, rds, Replace xlist in net/rds/xlist.h with llist") Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06	ipv6: fix EFAULT on sendto with icmpv6 and hdrincl	Olivier Matz	1	-5/+8
	The following code returns EFAULT (Bad address): s = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); setsockopt(s, SOL_IPV6, IPV6_HDRINCL, 1); sendto(ipv6_icmp6_packet, addr); /* returns -1, errno = EFAULT */ The IPv4 equivalent code works. A workaround is to use IPPROTO_RAW instead of IPPROTO_ICMPV6. The failure happens because 2 bytes are eaten from the msghdr by rawv6_probe_proto_opt() starting from commit 19e3c66b52ca ("ipv6 equivalent of "ipv4: Avoid reading user iov twice after raw_probe_proto_opt""), but at that time it was not a problem because IPV6_HDRINCL was not yet introduced. Only eat these 2 bytes if hdrincl == 0. Fixes: 715f504b1189 ("ipv6: add IPV6_HDRINCL option for raw sockets") Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06	ipv6: use READ_ONCE() for inet->hdrincl as in ipv4	Olivier Matz	1	-2/+10
	As it was done in commit 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") and commit 20b50d79974e ("net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()") for ipv4, copy the value of inet->hdrincl in a local variable, to avoid introducing a race condition in the next commit. Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06	Revert "gfs2: Replace gl_revokes with a GLF flag"	Bob Peterson	6	-31/+15
	Commit 73118ca8baf7 introduced a glock reference counting bug in gfs2_trans_remove_revoke. Given that, replacing gl_revokes with a GLF flag is no longer useful, so revert that commit. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-06	arm64: Silence gcc warnings about arch ABI drift	Dave Martin	1	-0/+1
	Since GCC 9, the compiler warns about evolution of the platform-specific ABI, in particular relating for the marshaling of certain structures involving bitfields. The kernel is a standalone binary, and of course nobody would be so stupid as to expose structs containing bitfields as function arguments in ABI. (Passing a pointer to such a struct, however inadvisable, should be unaffected by this change. perf and various drivers rely on that.) So these warnings do more harm than good: turn them off. We may miss warnings about future ABI drift, but that's too bad. Future ABI breaks of this class will have to be debugged and fixed the traditional way unless the compiler evolves finer-grained diagnostics. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-06-06	parisc: Fix crash due alternative coding for NP iopdir_fdc bit	Helge Deller	1	-1/+2
	According to the found documentation, data cache flushes and sync instructions are needed on the PCX-U+ (PA8200, e.g. C200/C240) platforms, while PCX-W (PA8500, e.g. C360) platforms aparently don't need those flushes when changing the IO PDIR data structures. We have no documentation for PCX-W+ (PA8600) and PCX-W2 (PA8700) CPUs, but Carlo Pisani reported that his C3600 machine (PA8600, PCX-W+) fails when the fdc instructions were removed. His firmware didn't set the NIOP bit, so one may assume it's a firmware bug since other C3750 machines had the bit set. Even if documentation (as mentioned above) states that PCX-W (PA8500, e.g. J5000) does not need fdc flushes, Sven could show that an Adaptec 29320A PCI-X SCSI controller reliably failed on a dd command during the first five minutes in his J5000 when fdc flushes were missing. Going forward, we will now NOT replace the fdc and sync assembler instructions by NOPS if: a) the NP iopdir_fdc bit was set by firmware, or b) we find a CPU up to and including a PCX-W+ (PA8600). This fixes the HPMC crashes on a C240 and C36XX machines. For other machines we rely on the firmware to set the bit when needed. In case one finds HPMC issues, people could try to boot their machines with the "no-alternatives" kernel option to turn off any alternative patching. Reported-by: Sven Schnelle <svens@stackframe.org> Reported-by: Carlo Pisani <carlojpisani@gmail.com> Tested-by: Sven Schnelle <svens@stackframe.org> Fixes: 3847dab77421 ("parisc: Add alternative coding infrastructure") Signed-off-by: Helge Deller <deller@gmx.de> Cc: stable@vger.kernel.org # 5.0+
2019-06-06	parisc: Use lpa instruction to load physical addresses in driver code	John David Anglin	3	-2/+26
	Most I/O in the kernel is done using the kernel offset mapping. However, there is one API that uses aliased kernel address ranges: > The final category of APIs is for I/O to deliberately aliased address > ranges inside the kernel. Such aliases are set up by use of the > vmap/vmalloc API. Since kernel I/O goes via physical pages, the I/O > subsystem assumes that the user mapping and kernel offset mapping are > the only aliases. This isn't true for vmap aliases, so anything in > the kernel trying to do I/O to vmap areas must manually manage > coherency. It must do this by flushing the vmap range before doing > I/O and invalidating it after the I/O returns. For this reason, we should use the hardware lpa instruction to load the physical address of kernel virtual addresses in the driver code. I believe we only use the vmap/vmalloc API with old PA 1.x processors which don't have a sba, so we don't hit this problem. Tested on c3750, c8000 and rp3440. Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: Helge Deller <deller@gmx.de>
2019-06-06	parisc: configs: Remove useless UEVENT_HELPER_PATH	Krzysztof Kozlowski	7	-7/+0
	Remove the CONFIG_UEVENT_HELPER_PATH because: 1. It is disabled since commit 1be01d4a5714 ("driver: base: Disable CONFIG_UEVENT_HELPER by default") as its dependency (UEVENT_HELPER) was made default to 'n', 2. It is not recommended (help message: "This should not be used today [...] creates a high system load") and was kept only for ancient userland, 3. Certain userland specifically requests it to be disabled (systemd README: "Legacy hotplug slows down the system and confuses udev"). Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Helge Deller <deller@gmx.de>
2019-06-06	parisc: Use implicit space register selection for loading the coherence index of I/O pdirs	John David Anglin	2	-5/+2
	We only support I/O to kernel space. Using %sr1 to load the coherence index may be racy unless interrupts are disabled. This patch changes the code used to load the coherence index to use implicit space register selection. This saves one instruction and eliminates the race. Tested on rp3440, c8000 and c3750. Signed-off-by: John David Anglin <dave.anglin@bell.net> Cc: stable@vger.kernel.org Signed-off-by: Helge Deller <deller@gmx.de>
2019-06-06	ARM64: trivial: s/TIF_SECOMP/TIF_SECCOMP/ comment typo fix	George G. Davis	1	-1/+1
	Fix a s/TIF_SECOMP/TIF_SECCOMP/ comment typo Cc: Jiri Kosina <trivial@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org Signed-off-by: George G. Davis <george_davis@mentor.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-06-05	Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"	Hangbin Liu	1	-3/+3
	This reverts commit e9919a24d3022f72bcadc407e73a6ef17093a849. Nathan reported the new behaviour breaks Android, as Android just add new rules and delete old ones. If we return 0 without adding dup rules, Android will remove the new added rules and causing system to soft-reboot. Fixes: e9919a24d302 ("fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: Yaro Slav <yaro330@gmail.com> Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05	net: aquantia: fix wol configuration not applied sometimes	Nikita Danilov	2	-8/+10
	WoL magic packet configuration sometimes does not work due to couple of leakages found. Mainly there was a regression introduced during readx_poll refactoring. Next, fw request waiting time was too small. Sometimes that caused sleep proxy config function to return with an error and to skip WoL configuration. At last, WoL data were passed to FW from not clean buffer. That could cause FW to accept garbage as a random configuration data. Fixes: 6a7f2277313b ("net: aquantia: replace AQ_HW_WAIT_FOR with readx_poll_timeout_atomic") Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05	ethtool: fix potential userspace buffer overflow	Vivien Didelot	1	-1/+4
	ethtool_get_regs() allocates a buffer of size ops->get_regs_len(), and pass it to the kernel driver via ops->get_regs() for filling. There is no restriction about what the kernel drivers can or cannot do with the open ethtool_regs structure. They usually set regs->version and ignore regs->len or set it to the same size as ops->get_regs_len(). But if userspace allocates a smaller buffer for the registers dump, we would cause a userspace buffer overflow in the final copy_to_user() call, which uses the regs.len value potentially reset by the driver. To fix this, make this case obvious and store regs.len before calling ops->get_regs(), to only copy as much data as requested by userspace, up to the value returned by ops->get_regs_len(). While at it, remove the redundant check for non-null regbuf. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05	Fix memory leak in sctp_process_init	Neil Horman	2	-10/+8
	syzbot found the following leak in sctp_process_init BUG: memory leak unreferenced object 0xffff88810ef68400 (size 1024): comm "syz-executor273", pid 7046, jiffies 4294945598 (age 28.770s) hex dump (first 32 bytes): 1d de 28 8d de 0b 1b e3 b5 c2 f9 68 fd 1a 97 25 ..(........h...% 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000a02cebbd>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<00000000a02cebbd>] slab_post_alloc_hook mm/slab.h:439 [inline] [<00000000a02cebbd>] slab_alloc mm/slab.c:3326 [inline] [<00000000a02cebbd>] __do_kmalloc mm/slab.c:3658 [inline] [<00000000a02cebbd>] __kmalloc_track_caller+0x15d/0x2c0 mm/slab.c:3675 [<000000009e6245e6>] kmemdup+0x27/0x60 mm/util.c:119 [<00000000dfdc5d2d>] kmemdup include/linux/string.h:432 [inline] [<00000000dfdc5d2d>] sctp_process_init+0xa7e/0xc20 net/sctp/sm_make_chunk.c:2437 [<00000000b58b62f8>] sctp_cmd_process_init net/sctp/sm_sideeffect.c:682 [inline] [<00000000b58b62f8>] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1384 [inline] [<00000000b58b62f8>] sctp_side_effects net/sctp/sm_sideeffect.c:1194 [inline] [<00000000b58b62f8>] sctp_do_sm+0xbdc/0x1d60 net/sctp/sm_sideeffect.c:1165 [<0000000044e11f96>] sctp_assoc_bh_rcv+0x13c/0x200 net/sctp/associola.c:1074 [<00000000ec43804d>] sctp_inq_push+0x7f/0xb0 net/sctp/inqueue.c:95 [<00000000726aa954>] sctp_backlog_rcv+0x5e/0x2a0 net/sctp/input.c:354 [<00000000d9e249a8>] sk_backlog_rcv include/net/sock.h:950 [inline] [<00000000d9e249a8>] __release_sock+0xab/0x110 net/core/sock.c:2418 [<00000000acae44fa>] release_sock+0x37/0xd0 net/core/sock.c:2934 [<00000000963cc9ae>] sctp_sendmsg+0x2c0/0x990 net/sctp/socket.c:2122 [<00000000a7fc7565>] inet_sendmsg+0x64/0x120 net/ipv4/af_inet.c:802 [<00000000b732cbd3>] sock_sendmsg_nosec net/socket.c:652 [inline] [<00000000b732cbd3>] sock_sendmsg+0x54/0x70 net/socket.c:671 [<00000000274c57ab>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2292 [<000000008252aedb>] __sys_sendmsg+0x80/0xf0 net/socket.c:2330 [<00000000f7bf23d1>] __do_sys_sendmsg net/socket.c:2339 [inline] [<00000000f7bf23d1>] __se_sys_sendmsg net/socket.c:2337 [inline] [<00000000f7bf23d1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2337 [<00000000a8b4131f>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:3 The problem was that the peer.cookie value points to an skb allocated area on the first pass through this function, at which point it is overwritten with a heap allocated value, but in certain cases, where a COOKIE_ECHO chunk is included in the packet, a second pass through sctp_process_init is made, where the cookie value is re-allocated, leaking the first allocation. Fix is to always allocate the cookie value, and free it when we are done using it. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> CC: "David S. Miller" <davem@davemloft.net> CC: netdev@vger.kernel.org Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05	net: rds: fix memory leak when unload rds_rdma	Zhu Yanjun	2	-1/+4
	When KASAN is enabled, after several rds connections are created, then "rmmod rds_rdma" is run. The following will appear. " BUG rds_ib_incoming (Not tainted): Objects remaining in rds_ib_incoming on __kmem_cache_shutdown() Call Trace: dump_stack+0x71/0xab slab_err+0xad/0xd0 __kmem_cache_shutdown+0x17d/0x370 shutdown_cache+0x17/0x130 kmem_cache_destroy+0x1df/0x210 rds_ib_recv_exit+0x11/0x20 [rds_rdma] rds_ib_exit+0x7a/0x90 [rds_rdma] __x64_sys_delete_module+0x224/0x2c0 ? __ia32_sys_delete_module+0x2c0/0x2c0 do_syscall_64+0x73/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 " This is rds connection memory leak. The root cause is: When "rmmod rds_rdma" is run, rds_ib_remove_one will call rds_ib_dev_shutdown to drop the rds connections. rds_ib_dev_shutdown will call rds_conn_drop to drop rds connections as below. " rds_conn_path_drop(&conn->c_path[0], false); " In the above, destroy is set to false. void rds_conn_path_drop(struct rds_conn_path cp, bool destroy) { atomic_set(&cp->cp_state, RDS_CONN_ERROR); rcu_read_lock(); if (!destroy && rds_destroy_pending(cp->cp_conn)) { rcu_read_unlock(); return; } queue_work(rds_wq, &cp->cp_down_w); rcu_read_unlock(); } In the above function, destroy is set to false. rds_destroy_pending is called. This does not move rds connections to ib_nodev_conns. So destroy is set to true to move rds connections to ib_nodev_conns. In rds_ib_unregister_client, flush_workqueue is called to make rds_wq finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns is called to shutdown rds connections finally. Then rds_ib_recv_exit is called to destroy slab. void rds_ib_recv_exit(void) { kmem_cache_destroy(rds_ib_incoming_slab); kmem_cache_destroy(rds_ib_frag_slab); } The above slab memory leak will not occur again. >From tests, 256 rds connections [root@ca-dev14 ~]# time rmmod rds_rdma real 0m16.522s user 0m0.000s sys 0m8.152s 512 rds connections [root@ca-dev14 ~]# time rmmod rds_rdma real 0m32.054s user 0m0.000s sys 0m15.568s To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed. And with 512 rds connections, about 32 seconds are needed. >From ftrace, when one rds connection is destroyed, " 19) \| rds_conn_destroy [rds]() { 19) 7.782 us \| rds_conn_path_drop [rds](); 15) \| rds_shutdown_worker [rds]() { 15) \| rds_conn_shutdown [rds]() { 15) 1.651 us \| rds_send_path_reset [rds](); 15) 7.195 us \| } 15) + 11.434 us \| } 19) 2.285 us \| rds_cong_remove_conn [rds](); 19) 24062.76 us \| } " So if many rds connections will be destroyed, this function rds_ib_destroy_nodev_conns uses most of time. Suggested-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05	ipv6: fix the check before getting the cookie in rt6_get_cookie	Xin Long	1	-2/+1
	In Jianlin's testing, netperf was broken with 'Connection reset by peer', as the cookie check failed in rt6_check() and ip6_dst_check() always returned NULL. It's caused by Commit 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes"), where the cookie can be got only when 'c1'(see below) for setting dst_cookie whereas rt6_check() is called when !'c1' for checking dst_cookie, as we can see in ip6_dst_check(). Since in ip6_dst_check() both rt6_dst_from_check() (c1) and rt6_check() (!c1) will check the 'from' cookie, this patch is to remove the c1 check in rt6_get_cookie(), so that the dst_cookie can always be set properly. c1: (rt->rt6i_flags & RTF_PCPU \|\| unlikely(!list_empty(&rt->rt6i_uncached))) Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05	ipv4: not do cache for local delivery if bc_forwarding is enabled	Xin Long	1	-12/+12
	With the topo: h1 ---\| rp1 \| \| route rp3 \|--- h3 (192.168.200.1) h2 ---\| rp2 \| If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on h2, and the packets can still be forwared. This issue was caused by the input route cache. It should only do the cache for either bc forwarding or local delivery. Otherwise, local delivery can use the route cache for bc forwarding of other interfaces. This patch is to fix it by not doing cache for local delivery if all.bc_forwarding is enabled. Note that we don't fix it by checking route cache local flag after rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the common route code shouldn't be touched for bc_forwarding. Fixes: 5cbf777cfdf6 ("route: add support for directed broadcast forwarding") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05	tools: bpftool: Fix JSON output when lookup fails	Krzesimir Nowak	1	-0/+2
	In commit 9a5ab8bf1d6d ("tools: bpftool: turn err() and info() macros into functions") one case of error reporting was special cased, so it could report a lookup error for a specific key when dumping the map element. What the code forgot to do is to wrap the key and value keys into a JSON object, so an example output of pretty JSON dump of a sockhash map (which does not support looking up its values) is: [ "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x00" ], "value": { "error": "Operation not supported" }, "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x01" ], "value": { "error": "Operation not supported" } ] Note the key-value pairs inside the toplevel array. They should be wrapped inside a JSON object, otherwise it is an invalid JSON. This commit fixes this, so the output now is: [{ "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x00" ], "value": { "error": "Operation not supported" } },{ "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x01" ], "value": { "error": "Operation not supported" } } ] Fixes: 9a5ab8bf1d6d ("tools: bpftool: turn err() and info() macros into functions") Cc: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>