Age | Commit message (Collapse) | Author | Files | Lines |
|
smc_ism_register_dmb() returns error codes set by the ISM driver which
are not guaranteed to be negative or in the errno range. Such values
would not be handled by ERR_PTR() and finally the return code will be
used as a memory address.
Fix that by using a valid negative errno value with ERR_PTR().
Fixes: 72b7f6c48708 ("net/smc: unique reason code for exceeded max dmb count")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The SMCD_DMBE_SIZES should include all valid DMBE buffer sizes, so the
correct value is 6 which means 1MB. With 7 the registration of an ISM
buffer would always fail because of the invalid size requested.
Fix that and set the value to 6.
Fixes: c6ba7c9ba43d ("net/smc: add base infrastructure for SMC-D and ISM")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a delayed event is enqueued then the event worker will send this
event the next time it is running and no other flow is currently
active. The event handler is called for the delayed event, and the
pointer to the event keeps set in lgr->delayed_event. This pointer is
cleared later in the processing by smc_llc_flow_start().
This can lead to a use-after-free condition when the processing does not
reach smc_llc_flow_start(), but frees the event because of an error
situation. Then the delayed_event pointer is still set but the event is
freed.
Fix this by always clearing the delayed event pointer when the event is
provided to the event handler for processing, and remove the code to
clear it in smc_llc_flow_start().
Fixes: 555da9af827d ("net/smc: add event-based llc_flow framework")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
IF CONFIG_BPFILTER_UMH is set, building fails:
In file included from /usr/include/sys/socket.h:33:0,
from net/bpfilter/main.c:6:
/usr/include/bits/socket.h:390:10: fatal error: asm/socket.h: No such file or directory
#include <asm/socket.h>
^~~~~~~~~~~~~~
compilation terminated.
scripts/Makefile.userprogs:43: recipe for target 'net/bpfilter/main.o' failed
make[2]: *** [net/bpfilter/main.o] Error 1
Add missing include path to fix this.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The access of tcf_tunnel_info() produces the following splat, so fix it
by dereferencing the tcf_tunnel_key_params pointer with marker that
internal tcfa_liock is held.
=============================
WARNING: suspicious RCU usage
5.9.0+ #1 Not tainted
-----------------------------
include/net/tc_act/tc_tunnel_key.h:59 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by tc/34839:
#0: ffff88828572c2a0 (&p->tcfa_lock){+...}-{2:2}, at: tc_setup_flow_action+0xb3/0x48b5
stack backtrace:
CPU: 1 PID: 34839 Comm: tc Not tainted 5.9.0+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x9a/0xd0
tc_setup_flow_action+0x14cb/0x48b5
fl_hw_replace_filter+0x347/0x690 [cls_flower]
fl_change+0x2bad/0x4875 [cls_flower]
tc_new_tfilter+0xf6f/0x1ba0
rtnetlink_rcv_msg+0x5f2/0x870
netlink_rcv_skb+0x124/0x350
netlink_unicast+0x433/0x700
netlink_sendmsg+0x6f1/0xbd0
sock_sendmsg+0xb0/0xe0
____sys_sendmsg+0x4fa/0x6d0
___sys_sendmsg+0x12e/0x1b0
__sys_sendmsg+0xa4/0x120
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f1f8cd4fe57
Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
RSP: 002b:00007ffdc1e193b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1f8cd4fe57
RDX: 0000000000000000 RSI: 00007ffdc1e19420 RDI: 0000000000000003
RBP: 000000005f85aafa R08: 0000000000000001 R09: 00007ffdc1e1936c
R10: 000000000040522d R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00007ffdc1e1d6f0 R15: 0000000000482420
Fixes: 3ebaf6da0716 ("net: sched: Do not assume RTNL is held in tunnel key action helpers")
Fixes: 7a47281439ba ("net: sched: lock action when translating it to flow_action infra")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Ingress large send packets are identified by either:
The IBMVETH_RXQ_LRG_PKT flag in the receive buffer
or with a -1 placed in the ip header checksum.
The method used depends on firmware version. Frame
geometry and sufficient header validation is performed by the
hypervisor eliminating the need for further header checks here.
Fixes: 7b5967389f5a ("ibmveth: set correct gso_size and gso_type")
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Reviewed-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Reviewed-by: Cristobal Forno <cris.forno@ibm.com>
Reviewed-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ibmveth_rx_csum_helper() must be called after ibmveth_rx_mss_helper()
as ibmveth_rx_csum_helper() may alter ip and tcp checksum values.
Fixes: 66aa0678efc2 ("ibmveth: Support to enable LSO/CSO for Trunk VEA.")
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Reviewed-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Reviewed-by: Cristobal Forno <cris.forno@ibm.com>
Reviewed-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The 4-tuple NAT offload via PEDIT always overwrites all the 4-tuple
fields even if they had not been explicitly enabled. If any fields in
the 4-tuple are not enabled, then the hardware overwrites the
disabled fields with zeros, instead of ignoring them.
So, add a parser that can translate the enabled 4-tuple PEDIT fields
to one of the NAT mode combinations supported by the hardware and
hence avoid overwriting disabled fields to 0. Any rule with
unsupported NAT mode combination is rejected.
Signed-off-by: Herat Ramani <herat@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The objective of the tests is to check that ICMP errors generated while
crossing between VRFs are properly routed back to the source host.
The first ttl test sends a ping with a ttl of 1 from h1 to h2 and parses the
output of the command to check that a ttl expired error is received.
The second ttl test runs traceroute from h1 to h2 and parses the output to
check for a hop on r1.
The mtu test sends a ping with a payload of 1450 from h1 to h2, through
r1 which has an interface with a mtu of 1400 and parses the output of the
command to check that a fragmentation needed error is received.
[ The IPv6 MTU test still fails with the symmetric routing setup. It
appears to be caused by source address selection picking ::1. Fixing
this is beyond the scope of this series. ]
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As per RFC4443, the destination address field for ICMPv6 error messages
is copied from the source address field of the invoking packet.
In configurations with Virtual Routing and Forwarding tables, looking up
which routing table to use for sending ICMPv6 error messages is
currently done by using the destination net_device.
If the source and destination interfaces are within separate VRFs, or
one in the global routing table and the other in a VRF, looking up the
source address of the invoking packet in the destination interface's
routing table will fail if the destination interface's routing table
contains no route to the invoking packet's source address.
One observable effect of this issue is that traceroute6 does not work in
the following cases:
- Route leaking between global routing table and VRF
- Route leaking between VRFs
Use the source device routing table when sending ICMPv6 error
messages.
[ In the context of ipv4, it has been pointed out that a similar issue
may exist with ICMP errors triggered when forwarding between network
namespaces. It would be worthwhile to investigate whether ipv6 has
similar issues, but is outside of the scope of this investigation. ]
[ Testing shows that similar issues exist with ipv6 unreachable /
fragmentation needed messages. However, investigation of this
additional failure mode is beyond this investigation's scope. ]
Link: https://tools.ietf.org/html/rfc4443
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As per RFC792, ICMP errors should be sent to the source host.
However, in configurations with Virtual Routing and Forwarding tables,
looking up which routing table to use is currently done by using the
destination net_device.
commit 9d1a6c4ea43e ("net: icmp_route_lookup should use rt dev to
determine L3 domain") changes the interface passed to
l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
to skb_dst(skb_in)->dev. This effectively uses the destination device
rather than the source device for choosing which routing table should be
used to lookup where to send the ICMP error.
Therefore, if the source and destination interfaces are within separate
VRFs, or one in the global routing table and the other in a VRF, looking
up the source host in the destination interface's routing table will
fail if the destination interface's routing table contains no route to
the source host.
One observable effect of this issue is that traceroute does not work in
the following cases:
- Route leaking between global routing table and VRF
- Route leaking between VRFs
Preferably use the source device routing table when sending ICMP error
messages. If no source device is set, fall-back on the destination
device routing table. Else, use the main routing table (index 0).
[ It has been pointed out that a similar issue may exist with ICMP
errors triggered when forwarding between network namespaces. It would
be worthwhile to investigate, but is outside of the scope of this
investigation. ]
[ It has also been pointed out that a similar issue exists with
unreachable / fragmentation needed messages, which can be triggered by
changing the MTU of eth1 in r1 to 1400 and running:
ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2
Some investigation points to raw_icmp_error() and raw_err() as being
involved in this last scenario. The focus of this patch is TTL expired
ICMP messages, which go through icmp_route_lookup.
Investigation of failure modes related to raw_icmp_error() is beyond
this investigation's scope. ]
Fixes: 9d1a6c4ea43e ("net: icmp_route_lookup should use rt dev to determine L3 domain")
Link: https://tools.ietf.org/html/rfc792
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
conditionally. When it is set, it assumes the outer IP header is
already created before ipgre_xmit().
This is not true when we send packets through a raw packet socket,
where L2 headers are supposed to be constructed by user. Packet
socket calls dev_validate_header() to validate the header. But
GRE tunnel does not set dev->hard_header_len, so that check can
be simply bypassed, therefore uninit memory could be passed down
to ipgre_xmit(). Similar for dev->needed_headroom.
dev->hard_header_len is supposed to be the length of the header
created by dev->header_ops->create(), so it should be used whenever
header_ops is set, and dev->needed_headroom should be used when it
is not set.
Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
SOCK_TSTAMP_NEW (timespec64 instead of timespec) is also used for
hardware time stamps (configured via SO_TIMESTAMPING_NEW).
User space (ptp4l) first configures hardware time stamping via
SO_TIMESTAMPING_NEW which sets SOCK_TSTAMP_NEW. In the next step, ptp4l
disables SO_TIMESTAMPNS(_NEW) (software time stamps), but this must not
switch hardware time stamps back to "32 bit mode".
This problem happens on 32 bit platforms were the libc has already
switched to struct timespec64 (from SO_TIMExxx_OLD to SO_TIMExxx_NEW
socket options). ptp4l complains with "missing timestamp on transmitted
peer delay request" because the wrong format is received (and
discarded).
Fixes: 887feae36aee ("socket: Add SO_TIMESTAMP[NS]_NEW")
Fixes: 783da70e8396 ("net: add sock_enable_timestamps")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The comparison of optname with SO_TIMESTAMPING_NEW is wrong way around,
so SOCK_TSTAMP_NEW will first be set and than reset again. Additionally
move it out of the test for SOF_TIMESTAMPING_RX_SOFTWARE as this seems
unrelated.
This problem happens on 32 bit platforms were the libc has already
switched to struct timespec64 (from SO_TIMExxx_OLD to SO_TIMExxx_NEW
socket options). ptp4l complains with "missing timestamp on transmitted
peer delay request" because the wrong format is received (and
discarded).
Fixes: 9718475e6908 ("socket: Add SO_TIMESTAMPING_NEW")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Dump vlan tag and proto for the usual vlan offload case if the
NF_LOG_MACDECODE flag is set on. Without this information the logging is
misleading as there is no reference to the VLAN header.
[12716.993704] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0800 SRC=192.168.10.2 DST=172.217.168.163 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=2548 DF PROTO=TCP SPT=55848 DPT=80 WINDOW=501 RES=0x00 ACK FIN URGP=0
[12721.157643] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0806 ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=86:6c:92:ea:d6:73 IPSRC=192.168.10.2 MACDST=0e:3b:eb:86:73:76 IPDST=192.168.10.1
Fixes: 83e96d443b37 ("netfilter: log: split family specific code to nf_log_{ip,ip6,common}.c files")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
With the introduction of netif_set_xps_queue, XPS can be enabled
by the driver at initialization.
Update the documentation to reflect this, as otherwise users
may incorrectly believe that the feature is off by default.
Fixes: 537c00de1c9b ("net: Add functions netif_reset_xps_queue and netif_set_xps_queue")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The phy_reset_after_clk_enable() is always called with ndev->phydev,
however that pointer may be NULL even though the PHY device instance
already exists and is sufficient to perform the PHY reset.
This condition happens in fec_open(), where the clock must be enabled
first, then the PHY must be reset, and then the PHY IDs can be read
out of the PHY.
If the PHY still is not bound to the MAC, but there is OF PHY node
and a matching PHY device instance already, use the OF PHY node to
obtain the PHY device instance, and then use that PHY device instance
when triggering the PHY reset.
Fixes: 1b0a83ac04e3 ("net: fec: add phy_reset_after_clk_enable() support")
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Christoph Niedermaier <cniedermaier@dh-electronics.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: NXP Linux Team <linux-imx@nxp.com>
Cc: Richard Leitner <richard.leitner@skidata.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
netcons calls napi_poll with a budget of 0 to transmit packets.
Handle this by:
- skipping RX processing
- do not try to recycle TX packets to the RX cache
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
kmalloc returns KSEG0 addresses so convert back from KSEG1
in kfree. Also make sure array is freed when the driver is
unloaded from the kernel.
Fixes: ef11291bcd5f ("Add support the Korina (IDT RC32434) Ethernet MAC")
Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Between queuing the delayed work and finishing the setup of the dsa
ports, the process may sleep in request_module() (via
phy_device_create()) and the queued work may be executed prior to the
switch net devices being registered. In ksz_mib_read_work(), a NULL
dereference will happen within netof_carrier_ok(dp->slave).
Not queuing the delayed work in ksz_init_mib_timer() makes things even
worse because the work will now be queued for immediate execution
(instead of 2000 ms) in ksz_mac_link_down() via
dsa_port_link_register_of().
Call tree:
ksz9477_i2c_probe()
\--ksz9477_switch_register()
\--ksz_switch_register()
+--dsa_register_switch()
| \--dsa_switch_probe()
| \--dsa_tree_setup()
| \--dsa_tree_setup_switches()
| +--dsa_switch_setup()
| | +--ksz9477_setup()
| | | \--ksz_init_mib_timer()
| | | |--/* Start the timer 2 seconds later. */
| | | \--schedule_delayed_work(&dev->mib_read, msecs_to_jiffies(2000));
| | \--__mdiobus_register()
| | \--mdiobus_scan()
| | \--get_phy_device()
| | +--get_phy_id()
| | \--phy_device_create()
| | |--/* sleeping, ksz_mib_read_work() can be called meanwhile */
| | \--request_module()
| |
| \--dsa_port_setup()
| +--/* Called for non-CPU ports */
| +--dsa_slave_create()
| | +--/* Too late, ksz_mib_read_work() may be called beforehand */
| | \--port->slave = ...
| ...
| +--Called for CPU port */
| \--dsa_port_link_register_of()
| \--ksz_mac_link_down()
| +--/* mib_read must be initialized here */
| +--/* work is already scheduled, so it will be executed after 2000 ms */
| \--schedule_delayed_work(&dev->mib_read, 0);
\-- /* here port->slave is setup properly, scheduling the delayed work should be safe */
Solution:
1. Do not queue (only initialize) delayed work in ksz_init_mib_timer().
2. Only queue delayed work in ksz_mac_link_down() if init is completed.
3. Queue work once in ksz_switch_register(), after dsa_register_switch()
has completed.
Fixes: 7c6ff470aa86 ("net: dsa: microchip: add MIB counter reading support")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The initial support for netlink extended ACK is missing the chain update
path, which results in misleading error reporting in case of EEXIST.
Fixes 36dd1bcc07e5 ("netfilter: nf_tables: initial support for extended ACK reporting")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
fq qdisc requires tstamp to be cleared in forwarding path
Reported-by: Evgeny B <abt-admin@mail.ru>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209427
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 8203e2d844d3 ("net: clear skb->tstamp in forwarding paths")
Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
add a test with re-queueing: usespace doesn't pass accept verdict,
but tells to re-queue to another nf_queue instance.
Also, make the second nf-queue program use non-gso mode, kernel will
have to perform software segmentation.
Lastly, do not queue every packet, just one per second, and add delay
when re-injecting the packet to the kernel.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Tobias reported regressions in IPsec tests following the patch
referenced by the Fixes tag below. The root cause is dropping the
reset of the flowi4_oif after the fib_lookup. Apparently it is
needed for xfrm cases, so restore the oif update to ip_route_output_flow
right before the call to xfrm_lookup_route.
Fixes: 2fbc6e89b2f1 ("ipv4: Update exception handling for multipath routes via same device")
Reported-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The msk can close MP_JOIN subflows if the initial handshake
fails. Currently such subflows are kept alive in the
conn_list until the msk itself is closed.
Beyond the wasted memory, we could end-up sending the
DATA_FIN and the DATA_FIN ack on such socket, even after a
reset.
Fixes: 43b54c6ee382 ("mptcp: Use full MPTCP-level disconnect state machine")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Additional/MP_JOIN subflows that do not pass some initial handshake
tests currently causes fallback to TCP. That is an RFC violation:
we should instead reset the subflow and leave the the msk untouched.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/91
Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For older versions of gcc, the array = {0}; will cause warnings:
net/smc/smc_llc.c: In function 'smc_llc_add_link_local':
net/smc/smc_llc.c:1212:9: warning: missing braces around initializer [-Wmissing-braces]
struct smc_llc_msg_add_link add_llc = {0};
^
net/smc/smc_llc.c:1212:9: warning: (near initialization for 'add_llc.hd') [-Wmissing-braces]
net/smc/smc_llc.c: In function 'smc_llc_srv_delete_link_local':
net/smc/smc_llc.c:1245:9: warning: missing braces around initializer [-Wmissing-braces]
struct smc_llc_msg_del_link del_llc = {0};
^
net/smc/smc_llc.c:1245:9: warning: (near initialization for 'del_llc.hd') [-Wmissing-braces]
2 warnings generated
Fixes: 4dadd151b265 ("net/smc: enqueue local LLC messages")
Signed-off-by: Pujin Shi <shipujin.t@gmail.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For older versions of gcc, the array = {0}; will cause warnings:
net/smc/smc_llc.c: In function 'smc_llc_send_link_delete_all':
net/smc/smc_llc.c:1317:9: warning: missing braces around initializer [-Wmissing-braces]
struct smc_llc_msg_del_link delllc = {0};
^
net/smc/smc_llc.c:1317:9: warning: (near initialization for 'delllc.hd') [-Wmissing-braces]
1 warnings generated
Fixes: f3811fd7bc97 ("net/smc: send DELETE_LINK, ALL message and wait for send to complete")
Signed-off-by: Pujin Shi <shipujin.t@gmail.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the function node_lost_contact(), we call __skb_queue_purge() without
grabbing the list->lock. This can cause to a race-condition why processing
the list 'namedq' in calling path tipc_named_rcv()->tipc_named_dequeue().
[] BUG: kernel NULL pointer dereference, address: 0000000000000000
[] #PF: supervisor read access in kernel mode
[] #PF: error_code(0x0000) - not-present page
[] PGD 7ca63067 P4D 7ca63067 PUD 6c553067 PMD 0
[] Oops: 0000 [#1] SMP NOPTI
[] CPU: 1 PID: 15 Comm: ksoftirqd/1 Tainted: G O 5.9.0-rc6+ #2
[] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS [...]
[] RIP: 0010:tipc_named_rcv+0x103/0x320 [tipc]
[] Code: 41 89 44 24 10 49 8b 16 49 8b 46 08 49 c7 06 00 00 00 [...]
[] RSP: 0018:ffffc900000a7c58 EFLAGS: 00000282
[] RAX: 00000000000012ec RBX: 0000000000000000 RCX: ffff88807bde1270
[] RDX: 0000000000002c7c RSI: 0000000000002c7c RDI: ffff88807b38f1a8
[] RBP: ffff88807b006288 R08: ffff88806a367800 R09: ffff88806a367900
[] R10: ffff88806a367a00 R11: ffff88806a367b00 R12: ffff88807b006258
[] R13: ffff88807b00628a R14: ffff888069334d00 R15: ffff88806a434600
[] FS: 0000000000000000(0000) GS:ffff888079480000(0000) knlGS:0[...]
[] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[] CR2: 0000000000000000 CR3: 0000000077320000 CR4: 00000000000006e0
[] Call Trace:
[] ? tipc_bcast_rcv+0x9a/0x1a0 [tipc]
[] tipc_rcv+0x40d/0x670 [tipc]
[] ? _raw_spin_unlock+0xa/0x20
[] tipc_l2_rcv_msg+0x55/0x80 [tipc]
[] __netif_receive_skb_one_core+0x8c/0xa0
[] process_backlog+0x98/0x140
[] net_rx_action+0x13a/0x420
[] __do_softirq+0xdb/0x316
[] ? smpboot_thread_fn+0x2f/0x1e0
[] ? smpboot_thread_fn+0x74/0x1e0
[] ? smpboot_thread_fn+0x14e/0x1e0
[] run_ksoftirqd+0x1a/0x40
[] smpboot_thread_fn+0x149/0x1e0
[] ? sort_range+0x20/0x20
[] kthread+0x131/0x150
[] ? kthread_unuse_mm+0xa0/0xa0
[] ret_from_fork+0x22/0x30
[] Modules linked in: veth tipc(O) ip6_udp_tunnel udp_tunnel [...]
[] CR2: 0000000000000000
[] ---[ end trace 65c276a8e2e2f310 ]---
To fix this, we need to grab the lock of the 'namedq' list on both
path calling.
Fixes: cad2929dc432 ("tipc: update a binding service via broadcast")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
skb_unshare() drops a reference count on the old skb unconditionally,
so in the failure case, we end up freeing the skb twice here.
And because the skb is allocated in fclone and cloned by caller
tipc_msg_reassemble(), the consequence is actually freeing the
original skb too, thus triggered the UAF by syzbot.
Fix this by replacing this skb_unshare() with skb_cloned()+skb_copy().
Fixes: ff48b6222e65 ("tipc: use skb_unshare() instead in tipc_buf_append()")
Reported-and-tested-by: syzbot+e96a7ba46281824cc46a@syzkaller.appspotmail.com
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove one of the two instances of the function prototype for
tls_validate_xmit_skb().
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Boris Pismenny <borisp@nvidia.com>
Cc: Aviad Yehezkel <aviadye@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
At first when sendpage gets called, if there is more data, 'more' in
tls_push_data() gets set which later sets pending_open_record_frags, but
when there is no more data in file left, and last time tls_push_data()
gets called, pending_open_record_frags doesn't get reset. And later when
2 bytes of encrypted alert comes as sendmsg, it first checks for
pending_open_record_frags, and since this is set, it creates a record with
0 data bytes to encrypt, meaning record length is prepend_size + tag_size
only, which causes problem.
We should set/reset pending_open_record_frags based on more bit.
Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix comment typo.
s/abitrary/arbitrary/
Signed-off-by: Naoki Hayama <naoki.hayama@lineo.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a ICMPV6_PKT_TOOBIG report a next-hop MTU that is less than the IPv6
minimum link MTU, the estimated path MTU is reduced to the minimum link
MTU. This behaviour breaks TAHI IPv6 Core Conformance Test v6LC4.1.6:
Packet Too Big Less than IPv6 MTU.
Referring to RFC 8201 section 4: "If a node receives a Packet Too Big
message reporting a next-hop MTU that is less than the IPv6 minimum link
MTU, it must discard it. A node must not reduce its estimate of the Path
MTU below the IPv6 minimum link MTU on receipt of a Packet Too Big
message."
Drop the path MTU update if reported MTU is less than the minimum link MTU.
Signed-off-by: Georg Kohmann <geokohma@cisco.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When processing a system suspend request we suspend modem endpoints
if they are enabled, and call ipa_cmd_tag_process() (which issues
IPA commands) to ensure the IPA pipeline is cleared. It is an error
to attempt to issue an IPA command before setup is complete, so this
is clearly a bug. But we also shouldn't suspend or resume any
endpoints that have not been set up.
Have ipa_endpoint_suspend() and ipa_endpoint_resume() immediately
return if setup hasn't completed, to avoid any attempt to configure
endpoints or issue IPA commands in that case.
Fixes: 84f9bd12d46d ("soc: qcom: ipa: IPA endpoints")
Tested-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
using packetdrill it's possible to observe the same MPTCP DSN being acked
by different subflows with DACK4 and DACK8. This is in contrast with what
specified in RFC8684 §3.3.2: if an MPTCP endpoint transmits a 64-bit wide
DSN, it MUST be acknowledged with a 64-bit wide DACK. Fix 'use_64bit_ack'
variable to make it a property of MPTCP sockets, not TCP subflows.
Fixes: a0c1d0eafd1e ("mptcp: Use 32-bit DATA_ACK when possible")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The phy_reset_after_clk_enable() does a PHY reset, which means the PHY
loses its register settings. The fec_enet_mii_probe() starts the PHY
and does the necessary calls to configure the PHY via PHY framework,
and loads the correct register settings into the PHY. Therefore,
fec_enet_mii_probe() should be called only after the PHY has been
reset, not before as it is now.
Fixes: 1b0a83ac04e3 ("net: fec: add phy_reset_after_clk_enable() support")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Richard Leitner <richard.leitner@skidata.com>
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Christoph Niedermaier <cniedermaier@dh-electronics.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: NXP Linux Team <linux-imx@nxp.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch add the initialization of skbcnt, similar to:
e009f95b1543 can: j1935: j1939_tp_tx_dat_new(): fix missing initialization of skbcnt
Let's play save and initialize this skbcnt as well.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
This fixes an uninit-value warning:
BUG: KMSAN: uninit-value in can_receive+0x26b/0x630 net/can/af_can.c:650
Reported-and-tested-by: syzbot+3f3837e61a48d32b495f@syzkaller.appspotmail.com
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Cc: Robin van der Gracht <robin@protonic.nl>
Cc: Oleksij Rempel <linux@rempel-privat.de>
Cc: Pengutronix Kernel Team <kernel@pengutronix.de>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://lore.kernel.org/r/20201008061821.24663-1-xiyou.wangcong@gmail.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Remove propmt for selecting MLX5_VDPA by the user and modify
MLX5_VDPA_NET to select MLX5_VDPA. Also modify MLX5_VDPA_NET to depend
on mlx5_core.
This fixes an issue where configuration sets 'y' for MLX5_VDPA_NET while
MLX5_CORE is compiled as a module causing link errors.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 device")s
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20201007064011.GA50074@mtl-vdi-166.wap.labs.mlnx
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
A VM with mlx5 vDPA has below warnings while being reset:
vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)
We should allow userspace emulating the virtio device be
able to get to vq's avail_index, regardless of vDPA device
status. Save the index that was last seen when virtq was
stopped, so that userspace doesn't complain.
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Link: https://lore.kernel.org/r/1601583511-15138-1-git-send-email-si-wei.liu@oracle.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eli Cohen <elic@nvidia.com>
|
|
Add usb ids of the Cellient MPL200 card.
Signed-off-by: Wilken Gottwalt <wilken.gottwalt@mailbox.org>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
De-referencing skb after call to gro_cells_receive() is not allowed.
We need to fetch skb->len earlier.
Fixes: 5491e7c6b1a9 ("macsec: enable GRO and RPS on macsec devices")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some firmware files trigger a PHY soft reset and don't wait for it to
be finished. PHY register writes directly after applying the firmware
may fail or provide unexpected results therefore. Fix this by waiting
for bit BMCR_RESET to be cleared after applying firmware.
There's nothing wrong with the referenced change, it's just that the
fix will apply cleanly only after this change.
Fixes: 89fbd26cca7e ("r8169: fix firmware not resetting tp->ocp_base")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With multiple DNAT rules it's possible that after destination
translation the resulting tuples collide.
For example, two openvswitch flows:
nw_dst=10.0.0.10,tp_dst=10, actions=ct(commit,table=2,nat(dst=20.0.0.1:20))
nw_dst=10.0.0.20,tp_dst=10, actions=ct(commit,table=2,nat(dst=20.0.0.1:20))
Assuming two TCP clients initiating the following connections:
10.0.0.10:5000->10.0.0.10:10
10.0.0.10:5000->10.0.0.20:10
Both tuples would translate to 10.0.0.10:5000->20.0.0.1:20 causing
nf_conntrack_confirm() to fail because of tuple collision.
Netfilter handles this case by allocating a null binding for SNAT at
egress by default. Perform the same operation in openvswitch for DNAT
if no explicit SNAT is requested by the user and allocate a null binding
for SNAT for packets in the "original" direction.
Reported-at: https://bugzilla.redhat.com/1877128
Suggested-by: Florian Westphal <fw@strlen.de>
Fixes: 05752523e565 ("openvswitch: Interface with NAT.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After freeing ep->auth_hmacs we have to clear the pointer
or risk use-after-free as reported by syzbot:
BUG: KASAN: use-after-free in sctp_auth_destroy_hmacs net/sctp/auth.c:509 [inline]
BUG: KASAN: use-after-free in sctp_auth_destroy_hmacs net/sctp/auth.c:501 [inline]
BUG: KASAN: use-after-free in sctp_auth_free+0x17e/0x1d0 net/sctp/auth.c:1070
Read of size 8 at addr ffff8880a8ff52c0 by task syz-executor941/6874
CPU: 0 PID: 6874 Comm: syz-executor941 Not tainted 5.9.0-rc8-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
sctp_auth_destroy_hmacs net/sctp/auth.c:509 [inline]
sctp_auth_destroy_hmacs net/sctp/auth.c:501 [inline]
sctp_auth_free+0x17e/0x1d0 net/sctp/auth.c:1070
sctp_endpoint_destroy+0x95/0x240 net/sctp/endpointola.c:203
sctp_endpoint_put net/sctp/endpointola.c:236 [inline]
sctp_endpoint_free+0xd6/0x110 net/sctp/endpointola.c:183
sctp_destroy_sock+0x9c/0x3c0 net/sctp/socket.c:4981
sctp_v6_destroy_sock+0x11/0x20 net/sctp/socket.c:9415
sk_common_release+0x64/0x390 net/core/sock.c:3254
sctp_close+0x4ce/0x8b0 net/sctp/socket.c:1533
inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:475
__sock_release+0xcd/0x280 net/socket.c:596
sock_close+0x18/0x20 net/socket.c:1277
__fput+0x285/0x920 fs/file_table.c:281
task_work_run+0xdd/0x190 kernel/task_work.c:141
exit_task_work include/linux/task_work.h:25 [inline]
do_exit+0xb7d/0x29f0 kernel/exit.c:806
do_group_exit+0x125/0x310 kernel/exit.c:903
__do_sys_exit_group kernel/exit.c:914 [inline]
__se_sys_exit_group kernel/exit.c:912 [inline]
__x64_sys_exit_group+0x3a/0x50 kernel/exit.c:912
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x43f278
Code: Bad RIP value.
RSP: 002b:00007fffe0995c38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f278
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 00000000004bf068 R08: 00000000000000e7 R09: ffffffffffffffd0
R10: 0000000020000000 R11: 0000000000000246 R12: 0000000000000001
R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000
Allocated by task 6874:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
kmem_cache_alloc_trace+0x174/0x300 mm/slab.c:3554
kmalloc include/linux/slab.h:554 [inline]
kmalloc_array include/linux/slab.h:593 [inline]
kcalloc include/linux/slab.h:605 [inline]
sctp_auth_init_hmacs+0xdb/0x3b0 net/sctp/auth.c:464
sctp_auth_init+0x8a/0x4a0 net/sctp/auth.c:1049
sctp_setsockopt_auth_supported net/sctp/socket.c:4354 [inline]
sctp_setsockopt+0x477e/0x97f0 net/sctp/socket.c:4631
__sys_setsockopt+0x2db/0x610 net/socket.c:2132
__do_sys_setsockopt net/socket.c:2143 [inline]
__se_sys_setsockopt net/socket.c:2140 [inline]
__x64_sys_setsockopt+0xba/0x150 net/socket.c:2140
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 6874:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
__cache_free mm/slab.c:3422 [inline]
kfree+0x10e/0x2b0 mm/slab.c:3760
sctp_auth_destroy_hmacs net/sctp/auth.c:511 [inline]
sctp_auth_destroy_hmacs net/sctp/auth.c:501 [inline]
sctp_auth_init_hmacs net/sctp/auth.c:496 [inline]
sctp_auth_init_hmacs+0x2b7/0x3b0 net/sctp/auth.c:454
sctp_auth_init+0x8a/0x4a0 net/sctp/auth.c:1049
sctp_setsockopt_auth_supported net/sctp/socket.c:4354 [inline]
sctp_setsockopt+0x477e/0x97f0 net/sctp/socket.c:4631
__sys_setsockopt+0x2db/0x610 net/socket.c:2132
__do_sys_setsockopt net/socket.c:2143 [inline]
__se_sys_setsockopt net/socket.c:2140 [inline]
__x64_sys_setsockopt+0xba/0x150 net/socket.c:2140
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 1f485649f529 ("[SCTP]: Implement SCTP-AUTH internals")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This commit is correcting NETLINK br_fill_ifinfo() to be able to
handle 'filter_mask' with multiple flags asserted.
Fixes: 36a8e8e265420 ("bridge: Extend br_fill_ifinfo to return MPR status")
Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The afs filesystem has a lock[*] that it uses to serialise I/O operations
going to the server (vnode->io_lock), as the server will only perform one
modification operation at a time on any given file or directory. This
prevents the the filesystem from filling up all the call slots to a server
with calls that aren't going to be executed in parallel anyway, thereby
allowing operations on other files to obtain slots.
[*] Note that is probably redundant for directories at least since
i_rwsem is used to serialise directory modifications and
lookup/reading vs modification. The server does allow parallel
non-modification ops, however.
When a file truncation op completes, we truncate the in-memory copy of the
file to match - but we do it whilst still holding the io_lock, the idea
being to prevent races with other operations.
However, if writeback starts in a worker thread simultaneously with
truncation (whilst notify_change() is called with i_rwsem locked, writeback
pays it no heed), it may manage to set PG_writeback bits on the pages that
will get truncated before afs_setattr_success() manages to call
truncate_pagecache(). Truncate will then wait for those pages - whilst
still inside io_lock:
# cat /proc/8837/stack
[<0>] wait_on_page_bit_common+0x184/0x1e7
[<0>] truncate_inode_pages_range+0x37f/0x3eb
[<0>] truncate_pagecache+0x3c/0x53
[<0>] afs_setattr_success+0x4d/0x6e
[<0>] afs_wait_for_operation+0xd8/0x169
[<0>] afs_do_sync_operation+0x16/0x1f
[<0>] afs_setattr+0x1fb/0x25d
[<0>] notify_change+0x2cf/0x3c4
[<0>] do_truncate+0x7f/0xb2
[<0>] do_sys_ftruncate+0xd1/0x104
[<0>] do_syscall_64+0x2d/0x3a
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The writeback operation, however, stalls indefinitely because it needs to
get the io_lock to proceed:
# cat /proc/5940/stack
[<0>] afs_get_io_locks+0x58/0x1ae
[<0>] afs_begin_vnode_operation+0xc7/0xd1
[<0>] afs_store_data+0x1b2/0x2a3
[<0>] afs_write_back_from_locked_page+0x418/0x57c
[<0>] afs_writepages_region+0x196/0x224
[<0>] afs_writepages+0x74/0x156
[<0>] do_writepages+0x2d/0x56
[<0>] __writeback_single_inode+0x84/0x207
[<0>] writeback_sb_inodes+0x238/0x3cf
[<0>] __writeback_inodes_wb+0x68/0x9f
[<0>] wb_writeback+0x145/0x26c
[<0>] wb_do_writeback+0x16a/0x194
[<0>] wb_workfn+0x74/0x177
[<0>] process_one_work+0x174/0x264
[<0>] worker_thread+0x117/0x1b9
[<0>] kthread+0xec/0xf1
[<0>] ret_from_fork+0x1f/0x30
and thus deadlock has occurred.
Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact
that the caller is holding i_rwsem doesn't preclude more pages being
dirtied through an mmap'd region.
Fix this by:
(1) Use the vnode validate_lock to mediate access between afs_setattr()
and afs_writepages():
(a) Exclusively lock validate_lock in afs_setattr() around the whole
RPC operation.
(b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to
shared-lock validate_lock and returning immediately if we couldn't
get it.
(c) If WB_SYNC_ALL is set, wait for the lock.
The validate_lock is also used to validate a file and to zap its cache
if the file was altered by a third party, so it's probably a good fit
for this.
(2) Move the truncation outside of the io_lock in setattr, using the same
hook as is used for local directory editing.
This requires the old i_size to be retained in the operation record as
we commit the revised status to the inode members inside the io_lock
still, but we still need to know if we reduced the file size.
Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In commit 70e806e4e645 ("mm: Do early cow for pinned pages during fork()
for ptes") we write-protected the PTE before doing the page pinning
check, in order to avoid a race with concurrent fast-GUP pinning (which
doesn't take the mm semaphore or the page table lock).
That trick doesn't actually work - it doesn't handle memory ordering
properly, and doing so would be prohibitively expensive.
It also isn't really needed. While we're moving in the direction of
allowing and supporting page pinning without marking the pinned area
with MADV_DONTFORK, the fact is that we've never really supported this
kind of odd "concurrent fork() and page pinning", and doing the
serialization on a pte level is just wrong.
We can add serialization with a per-mm sequence counter, so we know how
to solve that race properly, but we'll do that at a more appropriate
time. Right now this just removes the write protect games.
It also turns out that the write protect games actually break on Power,
as reported by Aneesh Kumar:
"Architecture like ppc64 expects set_pte_at to be not used for updating
a valid pte. This is further explained in commit 56eecdb912b5 ("mm:
Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"
and the code triggered a warning there:
WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
Call Trace:
copy_present_page mm/memory.c:857 [inline]
copy_present_pte mm/memory.c:899 [inline]
copy_pte_range mm/memory.c:1014 [inline]
copy_pmd_range mm/memory.c:1092 [inline]
copy_pud_range mm/memory.c:1127 [inline]
copy_p4d_range mm/memory.c:1150 [inline]
copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
dup_mmap kernel/fork.c:592 [inline]
dup_mm+0x77c/0xab0 kernel/fork.c:1355
copy_mm kernel/fork.c:1411 [inline]
copy_process+0x1f00/0x2740 kernel/fork.c:2070
_do_fork+0xc4/0x10b0 kernel/fork.c:2429
Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In nl80211_parse_key(), key.idx is first initialized as -1.
If this value of key.idx remains unmodified and gets returned, and
nl80211_key_allowed() also returns 0, then rdev_del_key() gets called
with key.idx = -1.
This causes an out-of-bounds array access.
Handle this issue by checking if the value of key.idx after
nl80211_parse_key() is called and return -EINVAL if key.idx < 0.
Cc: stable@vger.kernel.org
Reported-by: syzbot+b1bb342d1d097516cbda@syzkaller.appspotmail.com
Tested-by: syzbot+b1bb342d1d097516cbda@syzkaller.appspotmail.com
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Link: https://lore.kernel.org/r/20201007035401.9522-1-anant.thazhemadam@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|