Age | Commit message (Collapse) | Author | Files | Lines |
|
It is possible that tc stats get checked before the packet we check for
actually arrived into the interface and accounted for.
Fix it by checking for the expected result in a loop until
timeout is reached (by default 1 second).
Fixes: 07e5c75184a1 ("selftests: forwarding: Introduce tc flower matching tests")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a classful qdisc's child qdisc has set the flag
TCQ_F_CPUSTATS (pfifo_fast for example), the child qdisc's
cpu_bstats should be passed to gnet_stats_copy_basic(),
but many classful qdisc didn't do that. As a result,
`tc -s class show dev DEV` always return 0 for bytes and
packets in this case.
Pass the child qdisc's cpu_bstats to gnet_stats_copy_basic()
to fix this issue.
The qstats also has this problem, but it has been fixed
in 5dd431b6b9 ("net: sched: introduce and use qstats read...")
and bstats still remains buggy.
Fixes: 22e0f8b9322c ("net: sched: make bstats per cpu and estimator RCU safe")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The recently updated ALE APIs cpsw_ale_del_mcast() and
cpsw_ale_del_vlan_modify() have an issue and will not delete ALE entry even
if VLAN/mcast group has no more members. Hence fix it here and delete ALE
entry if !port_mask.
The issue affected only new cpsw switchdev driver.
Fixes: e85c14370783 ("net: ethernet: ti: ale: modify vlan/mdb api for switchdev")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If IPV6 is not set and CONFIG_MLX5_ESWITCH is y,
building fails:
drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c:322:5: error: redefinition of mlx5e_tc_tun_create_header_ipv6
int mlx5e_tc_tun_create_header_ipv6(struct mlx5e_priv *priv,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c:7:0:
drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h:67:1: note: previous definition of mlx5e_tc_tun_create_header_ipv6 was here
mlx5e_tc_tun_create_header_ipv6(struct mlx5e_priv *priv,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use #ifdef to guard this, also move mlx5e_route_lookup_ipv6
to cleanup unused warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: e689e998e102 ("net/mlx5e: TC, Stub out ipv6 tun create header function")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some versions of iproute2 will output more than one line per entry, which
will cause the test to fail, like:
TEST: ipv6: list and flush cached exceptions [FAIL]
can't list cached exceptions
That happens, for example, with iproute2 4.15.0. When using the -oneline
option, this will work just fine:
TEST: ipv6: list and flush cached exceptions [ OK ]
This also works just fine with a more recent version of iproute2, like
5.4.0.
For some reason, two lines are printed for the IPv4 test no matter what
version of iproute2 is used. Use the same -oneline parameter there instead
of counting the lines twice.
Fixes: b964641e9925 ("selftests: pmtu: Make list_flush_ipv6_exception test more demanding")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Scenario:
1. A client socket initiates a SYN message to a listening socket.
2. The send link is congested, the SYN message is put in the
send link and a wakeup message is put in wakeup queue.
3. The congestion situation is abated, the wakeup message is
pulled out of the wakeup queue. Function tipc_sk_push_backlog()
is called to send out delayed messages by Nagle. However,
the client socket is still in CONNECTING state. So, it sends
the SYN message in the socket write queue to the listening socket
again.
4. The listening socket receives the first SYN message and creates
first server socket. The client socket receives ACK- and establishes
a connection to the first server socket. The client socket closes
its connection with the first server socket.
5. The listening socket receives the second SYN message and creates
second server socket. The second server socket sends ACK- to the
client socket, but it has been closed. It results in connection
reset error when reading from the server socket in user space.
Solution: return from function tipc_sk_push_backlog() immediately
if there is pending SYN message in the socket write queue.
Fixes: c0bceb97db9e ("tipc: add smart nagle feature")
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In function __tipc_shutdown(), the timeout value passed to
tipc_wait_for_cond() is not jiffies.
This commit fixes it by converting that value from milliseconds
to jiffies.
Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When tipc_sk_timeout() is executed but user space is grabbing
ownership, this function rearms itself and returns. However, the
socket reference counter is not reduced. This causes potential
unexpected behavior.
This commit fixes it by calling sock_put() before tipc_sk_timeout()
returns in the above-mentioned case.
Fixes: afe8792fec69 ("tipc: refactor function tipc_sk_timeout()")
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When initiating a connection message to a server side, the connection
message is cloned and added to the socket write queue. However, if the
cloning is failed, only the socket write queue is purged. It causes
memory leak because the original connection message is not freed.
This commit fixes it by purging the list of connection message when
it cannot be cloned.
Fixes: 6787927475e5 ("tipc: buffer overflow handling in listener socket")
Reported-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This driver forgets to kill tasklet in remove.
Add the call to fix it.
Fixes: 032dc41ba6e2 ("net: macb: Handle HRESP error")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
perror(str) is basically equivalent to
print("%s: %s\n", str, strerror(errno)).
New line or colon at the end of str is
a mistake/breaks formatting.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
test_sockmap creates a temporary file to use for sendpage.
this may fail for various reasons. Handle the error rather
than segfault.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Partially sent record cleanup path increments an SG entry
directly instead of using sg_next(). This should not be a
problem today, as encrypted messages should be always
allocated as arrays. But given this is a cleanup path it's
easy to miss was this ever to change. Use sg_next(), and
simplify the code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Looks like when BPF support was added by commit d3b18ad31f93
("tls: add bpf support to sk_msg handling") and
commit d829e9c4112b ("tls: convert to generic sk_msg interface")
it broke/removed the support for in-place crypto as added by
commit 4e6d47206c32 ("tls: Add support for inplace records
encryption").
The inplace_crypto member of struct tls_rec is dead, inited
to zero, and sometimes set to zero again. It used to be
set to 1 when record was allocated, but the skmsg code doesn't
seem to have been written with the idea of in-place crypto
in mind.
Since non trivial effort is required to bring the feature back
and we don't really have the HW to measure the benefit just
remove the left over support for now to avoid confusing readers.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a sendmsg test with very fragmented messages. This should
fill up sk_msg and test the boundary conditions.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TLS 1.3 started using the entry at the end of the SG array
for chaining-in the single byte content type entry. This mostly
works:
[ E E E E E E . . ]
^ ^
start end
E < content type
/
[ E E E E E E C . ]
^ ^
start end
(Where E denotes a populated SG entry; C denotes a chaining entry.)
If the array is full, however, the end will point to the start:
[ E E E E E E E E ]
^
start
end
And we end up overwriting the start:
E < content type
/
[ C E E E E E E E ]
^
start
end
The sg array is supposed to be a circular buffer with start and
end markers pointing anywhere. In case where start > end
(i.e. the circular buffer has "wrapped") there is an extra entry
reserved at the end to chain the two halves together.
[ E E E E E E . . l ]
(Where l is the reserved entry for "looping" back to front.
As suggested by John, let's reserve another entry for chaining
SG entries after the main circular buffer. Note that this entry
has to be pointed to by the end entry so its position is not fixed.
Examples of full messages:
[ E E E E E E E E . l ]
^ ^
start end
<---------------.
[ E E . E E E E E E l ]
^ ^
end start
Now the end will always point to an unused entry, so TLS 1.3
can always use it.
Fixes: 130b392c6cd6 ("net: tls: Add tls 1.3 support")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When tls_do_encryption() fails the SG lists are left with the
SG_END and SG_CHAIN marks in place. One could hope that once
encryption fails we will never see the record again, but that
is in fact not true. Commit d3b18ad31f93 ("tls: add bpf support
to sk_msg handling") added special handling to ENOMEM and ENOSPC
errors which mean we may see the same record re-submitted.
As suggested by John free the record, the BPF code is already
doing just that.
Reported-by: syzbot+df0d4ec12332661dd1f9@syzkaller.appspotmail.com
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
bpf_exec_tx_verdict() may free the record if tls_push_record()
fails, or if the entire record got consumed by BPF. Re-check
ctx->open_rec before touching the data.
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch corrects the SPDX License Identifier style in
header files related to drivers for USB Network devices.
This patch gives an explicit block comment to the
SPDX License Identifier.
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch corrects the SPDX License Identifier style in
header files related to PHY Layer for Ethernet drivers.
For C header files Documentation/process/license-rules.rst
mandates C-like comments (opposed to C source files where
C++ style should be used). This patch also gives an explicit
block comment to the SPDX License Identifier.
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 6570bc79c0df ("net: core: use listified Rx for GRO_NORMAL in
napi_gro_receive()") has applied batched GRO_NORMAL packets processing
to all napi_gro_receive() users, including mac80211-based drivers.
However, this change has led to a regression in iwlwifi driver [1][2] as
it is required for NAPI users to call napi_complete_done() or
napi_complete() and the end of every polling iteration, whilst iwlwifi
doesn't use NAPI scheduling at all and just calls napi_gro_flush().
In that particular case, packets which have not been already flushed
from napi->rx_list stall in it until at least next Rx cycle.
Fix this by adding a manual flushing of the list to iwlwifi driver right
before napi_gro_flush() call to mimic napi_complete() logics.
I prefer to open-code gro_normal_list() rather than exporting it for 2
reasons:
* to prevent from using it and napi_gro_flush() in any new drivers,
as it is the *really* bad way to use NAPI that should be avoided;
* to keep gro_normal_list() static and don't lose any CC optimizations.
I also don't add the "Fixes:" tag as the mentioned commit was only a
trigger that only exposed an improper usage of NAPI in this particular
driver.
[1] https://lore.kernel.org/netdev/PSXP216MB04388962C411CD0B17A86F47804A0@PSXP216MB0438.KORP216.PROD.OUTLOOK.COM
[2] https://bugzilla.kernel.org/show_bug.cgi?id=205647
Signed-off-by: Alexander Lobakin <alobakin@dlink.ru>
Acked-by: Luca Coelho <luciano.coelho@intel.com>
Reported-by: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
Tested-by: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
Reviewed-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Convert to use skb queue instead of the list of skbs.
The skb queue could provide protection with lock.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Break the matching loop when find the matching skb for TX timestamp.
This is to avoid consuming more skbs incorrectly. The timestamp ID
is from 0 to 3 while the FIFO could support 128 timestamps at most.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In gve_alloc_queue_page_list(), when a page allocation fails,
qpl->num_entries will be wrong. In this case priv->num_registered_pages
can underflow in gve_free_queue_page_list(), causing subsequent calls
to gve_alloc_queue_page_list() to fail.
Fixes: f5cedc84a30d ("gve: Add transmit and receive support")
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Any argument outside of that range would result in an out of bound
memory access, since the accessed array is 65536 bits long.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When user-space sets the OVS_UFID_F_OMIT_* flags, and the relevant
flow has no UFID, we can exceed the computed size, as
ovs_nla_put_identifier() will always dump an OVS_FLOW_ATTR_KEY
attribute.
Take the above in account when computing the flow command message
size.
Fixes: 74ed7ab9264c ("openvswitch: Add support for unique flow IDs.")
Reported-by: Qi Jun Ding <qding@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix the return paths for all I/O operations to ensure
that the I/O completed successfully. Then pass the return
to the caller for further processing
Fixes: 01db923e8377 ("net: phy: dp83869: Add TI dp83869 phy")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Dan Murphy <dmurphy@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We need to calculate the skb size correctly otherwise we risk triggering
skb_over_panic[1]. The issue is that data_len is added to the skb in a
nl attribute, but we don't account for its header size (nlattr 4 bytes)
and alignment. We account for it when calculating the total size in
the > PSAMPLE_MAX_PACKET_SIZE comparison correctly, but not when
allocating after that. The fix is simple - use nla_total_size() for
data_len when allocating.
To reproduce:
$ tc qdisc add dev eth1 clsact
$ tc filter add dev eth1 egress matchall action sample rate 1 group 1 trunc 129
$ mausezahn eth1 -b bcast -a rand -c 1 -p 129
< skb_over_panic BUG(), tail is 4 bytes past skb->end >
[1] Trace:
[ 50.459526][ T3480] skbuff: skb_over_panic: text:(____ptrval____) len:196 put:136 head:(____ptrval____) data:(____ptrval____) tail:0xc4 end:0xc0 dev:<NULL>
[ 50.474339][ T3480] ------------[ cut here ]------------
[ 50.481132][ T3480] kernel BUG at net/core/skbuff.c:108!
[ 50.486059][ T3480] invalid opcode: 0000 [#1] PREEMPT SMP
[ 50.489463][ T3480] CPU: 3 PID: 3480 Comm: mausezahn Not tainted 5.4.0-rc7 #108
[ 50.492844][ T3480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
[ 50.496551][ T3480] RIP: 0010:skb_panic+0x79/0x7b
[ 50.498261][ T3480] Code: bc 00 00 00 41 57 4c 89 e6 48 c7 c7 90 29 9a 83 4c 8b 8b c0 00 00 00 50 8b 83 b8 00 00 00 50 ff b3 c8 00 00 00 e8 ae ef c0 fe <0f> 0b e8 2f df c8 fe 48 8b 55 08 44 89 f6 4c 89 e7 48 c7 c1 a0 22
[ 50.504111][ T3480] RSP: 0018:ffffc90000447a10 EFLAGS: 00010282
[ 50.505835][ T3480] RAX: 0000000000000087 RBX: ffff888039317d00 RCX: 0000000000000000
[ 50.507900][ T3480] RDX: 0000000000000000 RSI: ffffffff812716e1 RDI: 00000000ffffffff
[ 50.509820][ T3480] RBP: ffffc90000447a60 R08: 0000000000000001 R09: 0000000000000000
[ 50.511735][ T3480] R10: ffffffff81d4f940 R11: 0000000000000000 R12: ffffffff834a22b0
[ 50.513494][ T3480] R13: ffffffff82c10433 R14: 0000000000000088 R15: ffffffff838a8084
[ 50.515222][ T3480] FS: 00007f3536462700(0000) GS:ffff88803eac0000(0000) knlGS:0000000000000000
[ 50.517135][ T3480] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 50.518583][ T3480] CR2: 0000000000442008 CR3: 000000003b222000 CR4: 00000000000006e0
[ 50.520723][ T3480] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 50.522709][ T3480] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 50.524450][ T3480] Call Trace:
[ 50.525214][ T3480] skb_put.cold+0x1b/0x1b
[ 50.526171][ T3480] psample_sample_packet+0x1d3/0x340
[ 50.527307][ T3480] tcf_sample_act+0x178/0x250
[ 50.528339][ T3480] tcf_action_exec+0xb1/0x190
[ 50.529354][ T3480] mall_classify+0x67/0x90
[ 50.530332][ T3480] tcf_classify+0x72/0x160
[ 50.531286][ T3480] __dev_queue_xmit+0x3db/0xd50
[ 50.532327][ T3480] dev_queue_xmit+0x18/0x20
[ 50.533299][ T3480] packet_sendmsg+0xee7/0x2090
[ 50.534331][ T3480] sock_sendmsg+0x54/0x70
[ 50.535271][ T3480] __sys_sendto+0x148/0x1f0
[ 50.536252][ T3480] ? tomoyo_file_ioctl+0x23/0x30
[ 50.537334][ T3480] ? ksys_ioctl+0x5e/0xb0
[ 50.540068][ T3480] __x64_sys_sendto+0x2a/0x30
[ 50.542810][ T3480] do_syscall_64+0x73/0x1f0
[ 50.545383][ T3480] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 50.548477][ T3480] RIP: 0033:0x7f35357d6fb3
[ 50.551020][ T3480] Code: 48 8b 0d 18 90 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d f9 d3 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 eb f6 ff ff 48 89 04 24
[ 50.558547][ T3480] RSP: 002b:00007ffe0c7212c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 50.561870][ T3480] RAX: ffffffffffffffda RBX: 0000000001dac010 RCX: 00007f35357d6fb3
[ 50.565142][ T3480] RDX: 0000000000000082 RSI: 0000000001dac2a2 RDI: 0000000000000003
[ 50.568469][ T3480] RBP: 00007ffe0c7212f0 R08: 00007ffe0c7212d0 R09: 0000000000000014
[ 50.571731][ T3480] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000082
[ 50.574961][ T3480] R13: 0000000001dac2a2 R14: 0000000000000001 R15: 0000000000000003
[ 50.578170][ T3480] Modules linked in: sch_ingress virtio_net
[ 50.580976][ T3480] ---[ end trace 61a515626a595af6 ]---
CC: Yotam Gigi <yotamg@mellanox.com>
CC: Jiri Pirko <jiri@mellanox.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Simon Horman <simon.horman@netronome.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
correct usage prototype of callback in tasklet_init().
Report by https://github.com/KSPP/linux/issues/20
Signed-off-by: Phong Tran <tranmanphong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
correct usage prototype of callback in tasklet_init().
Report by https://github.com/KSPP/linux/issues/20
Signed-off-by: Phong Tran <tranmanphong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the quest to bring io_kiocb down to 3 cachelines, this one does
the trick. Make the wait_queue_entry for the poll command come out
of kmalloc instead of embedding it in struct io_poll_iocb, as the
latter is the largest member of io_kiocb. Once we trim this down a
bit, we're back at a healthy 192 bytes for struct io_kiocb.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently we're using 40 bytes for the io_wq_work structure, and 16 of
those is the doubly link list node. We don't need doubly linked lists,
we always add to tail to keep things ordered, and any other use case
is list traversal with deletion. For the deletion case, we can easily
support any node deletion by keeping track of the previous entry.
This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
io_uring to 216 to 208 bytes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are several things that can go wrong in the current code on NUMA
systems, especially if not all nodes are online all the time:
- If the identifiers of the online nodes do not form a single contiguous
block starting at zero, wq->wqes will be too small, and OOB memory
accesses will occur e.g. in the loop in io_wq_create().
- If a node comes online between the call to num_online_nodes() and the
for_each_node() loop in io_wq_create(), an OOB write will occur.
- If a node comes online between io_wq_create() and io_wq_enqueue(), a
lookup is performed for an element that doesn't exist, and an OOB read
will probably occur.
Fix it by:
- using nr_node_ids instead of num_online_nodes() for the allocation size;
nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
highest node ID that could possibly come online at some point, even if
those nodes' identifiers are not a contiguous block
- creating workers for all possible CPUs, not just all online ones
This is basically what the normal workqueue code also does, as far as I can
tell.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These allocations are single-element allocations, so don't use the array
allocation wrapper for them.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Clean io_import_fixed() call site and make it return proper type.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is no point left in keeping struct sqe_submit. Inline it
into struct io_kiocb, so any req->submit.field is now just req->field
- moves initialisation of ring_file into io_get_req()
- removes duplicated req->sequence.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Timeouts' sequence offset (i.e. sqe->off) is stored in
req->submit.sequence under a false name. Keep it in timeout.data
instead. The unused space for sequence will be reclaimed in the
following patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Only io_uring uses (and added) these, and we want to disallow the
use of sendmsg/recvmsg for anything but regular data transfers.
Use the newly added prep helper to split the msghdr copy out from
the core function, to check for msg_control and msg_controllen
settings. If either is set, we return -EINVAL.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is in preparation for enabling the io_uring helpers for sendmsg
and recvmsg to first copy the header for validation before continuing
with the operation.
There should be no functional changes in this patch.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Note that the sysctl write accessor functions guarantee that:
net->ipv4.sysctl_ip_prot_sock <= net->ipv4.ip_local_ports.range[0]
invariant is maintained, and as such the max() in selinux hooks is actually spurious.
ie. even though
if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) {
per logic is the same as
if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) {
it is actually functionally equivalent to:
if (snum < low || snum > high) {
which is equivalent to:
if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) {
even though the first clause is spurious.
But we want to hold on to it in case we ever want to change what what
inet_port_requires_bind_service() means (for example by changing
it from a, by default, [0..1024) range to some sort of set).
Test: builds, git 'grep inet_prot_sock' finds no other references
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Provide some serialization for device CRQ commands
and queries to ensure that the shared variable used for
storing return codes is properly synchronized.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Create a wrapper for wait_for_completion calls with additional
driver checks to ensure that the driver does not wait on a
disabled device. In those cases or if the device does not respond
in an extended amount of time, this will allow the driver an
opportunity to recover.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If we receive a notification that the device has been deactivated
or removed, force a completion of all waiting threads.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix multiple calls to init_completion for device completion
structures. Instead, initialize them during device probe and
reinitialize them later as needed.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It already existed in part of the function, but move it
to a higher level and use it consistently throughout.
Safe since sk is never written to.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It cannot overlap with the local port range - ie. with autobind selectable
ports - and not with reserved ports.
Indeed 'ip_local_reserved_ports' isn't even a range, it's a (by default
empty) set.
Fixes: 4548b683b781 ("Introduce a sysctl that modifies the value of PROT_SOCK.")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After the following commit:
05b042a19443: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
'struct cpu_entry_area' has to be Kconfig invariant, so that we always
have a matching CPU_ENTRY_AREA_PAGES size.
This commit added a CONFIG_X86_IOPL_IOPERM dependency to tss_struct:
111e7b15cf10: ("x86/ioperm: Extend IOPL config to control ioperm() as well")
Which, if CONFIG_X86_IOPL_IOPERM is turned off, reduces the size of
cpu_entry_area by two pages, triggering the assert:
./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_202’ declared with attribute error: BUILD_BUG_ON failed: (CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE
Simplify the Kconfig dependencies and make cpu_entry_area constant
size on 32-bit kernels again.
Fixes: 05b042a19443: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This reverts commit 0be0ee71816b2b6725e2b4f32ad6726c9d729777.
I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.
While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads. Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service. In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.
There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor. Which doesn't work well for
them.
The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side). It may be that it's just this that causes problems, but we
clearly weren't ready yet.
Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:
/dev/fuse (fuse_dev_operations)
/dev/mcelog (mce_chrdev_ops)
/dev/mei0 (mei_fops)
/dev/net/tun (tun_fops)
/dev/nvme0 (nvme_dev_fops)
/dev/tpm0 (tpm_fops)
/proc/self/ns/mnt (ns_file_operations)
/dev/snd/pcm* (snd_pcm_f_ops[])
and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.
Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.
We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
annotations).
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Cc: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In commit 4f07b80c9733 ("tipc: check msg->req data len in
tipc_nl_compat_bearer_disable") the same patch code was copied into
routines: tipc_nl_compat_bearer_disable(),
tipc_nl_compat_link_stat_dump() and tipc_nl_compat_link_reset_stats().
The two link routine occurrences should have been modified to check
the maximum link name length and not bearer name length.
Fixes: 4f07b80c9733 ("tipc: check msg->reg data len in tipc_nl_compat_bearer_disable")
Signed-off-by: John Rutherford <john.rutherford@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This field contains a pointer to addrlen and checking to see if it's set
returns -EINVAL if the caller sets addr & addrlen pointers.
Fixes: 17f2fe35d080 ("io_uring: add support for IORING_OP_ACCEPT")
Signed-off-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|