aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2023-08-11io_uring: never overflow io_aux_cqePavel Begunkov5-14/+16
Now all callers of io_aux_cqe() set allow_overflow to false, remove the parameter and not allow overflowing auxilary multishot cqes. When CQ is full the function callers and all multishot requests in general are expected to complete the request. That prevents indefinite in-background grows of the overflow list and let's the userspace to handle the backlog at its own pace. Resubmitting a request should also be faster than accounting a bunch of overflows, so it should be better for perf when it happens, but a well behaving userspace should be trying to avoid overflows in any case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb20d14d708ea174721e58bb53786b0521e4dd6d.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: remove return from io_req_cqe_overflow()Pavel Begunkov2-5/+5
Nobody checks io_req_cqe_overflow()'s return, make it return void. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8f2029ad0c22f73451664172d834372608ee0a77.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: open code io_fill_cqe_req()Pavel Begunkov3-14/+7
io_fill_cqe_req() is only called from one place, open code it, and rename __io_fill_cqe_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f432ce75bb1c94cadf0bd2add4d6aa510bd1fb36.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring/net: don't overflow multishot recvPavel Begunkov1-1/+1
Don't allow overflowing multishot recv CQEs, it might get out of hand, hurt performance, and in the worst case scenario OOM the task. Cc: stable@vger.kernel.org Fixes: b3fdea6ecb55c ("io_uring: multishot recv") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0b295634e8f1b71aa764c984608c22d85f88f75c.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring/net: don't overflow multishot acceptPavel Begunkov1-1/+1
Don't allow overflowing multishot accept CQEs, we want to limit the grows of the overflow list. Cc: stable@vger.kernel.org Fixes: 4e86a2c980137 ("io_uring: implement multishot mode for accept") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7d0d749649244873772623dd7747966f516fe6e2.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring/io-wq: don't gate worker wake up success on wake_up_process()Jens Axboe1-4/+7
All we really care about is finding a free worker. If said worker is already running, it's either starting new work already or it's just finishing up existing work. For the latter, we'll be finding this work item next anyway, and for the former, if the worker does go to sleep, it'll create a new worker anyway as we have pending items. This reduces try_to_wake_up() overhead considerably: 23.16% -10.46% [kernel.kallsyms] [k] try_to_wake_up Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring/io-wq: reduce frequency of acct->lock acquisitionsJens Axboe1-13/+34
When we check if we have work to run, we grab the acct lock, check, drop it, and then return the result. If we do have work to run, then running the work will again grab acct->lock and get the work item. This causes us to grab acct->lock more frequently than we need to. If we have work to do, have io_acct_run_queue() return with the acct lock still acquired. io_worker_handle_work() is then always invoked with the acct lock already held. In a simple test cases that stats files (IORING_OP_STATX always hits io-wq), we see a nice reduction in locking overhead with this change: 19.32% -12.55% [kernel.kallsyms] [k] __cmpwait_case_32 20.90% -12.07% [kernel.kallsyms] [k] queued_spin_lock_slowpath Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring/io-wq: don't grab wq->lock for worker activationJens Axboe1-3/+0
The worker free list is RCU protected, and checks for workers going away when iterating it. There's no need to hold the wq->lock around the lookup. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10io_uring: remove unnecessary forward declarationJens Axboe1-1/+0
We never use io_move_task_work_from_local() before it's defined in the file anyway, so kill the forward declaration. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10io_uring: have io_file_put() take an io_kiocb rather than the fileJens Axboe2-7/+5
No functional changes in this patch, just a prep patch for needing the request in io_file_put(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10io_uring/splice: use fput() directlyJens Axboe1-2/+2
No point in using io_file_put() here, as we need to check if it's a fixed file in the caller anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10io_uring/fdinfo: get rid of ref trygetJens Axboe1-12/+6
The caller holds a reference to the ring itself, so by definition the ring cannot go away. There's no need to play games with tryget for the reference, as we don't need an extra reference at all. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: cleanup 'ret' handling in io_iopoll_check()Jens Axboe1-7/+10
We return 0 for success, or -error when there's an error. Move the 'ret' variable into the loop where we are actually using it, to make it clearer that we don't carry this variable forward for return outside of the loop. While at it, also move the need_resched() break condition out of the while check itself, keeping it with the signal pending check. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: break iopolling on signalPavel Begunkov1-0/+3
Don't keep spinning iopoll with a signal set. It'll eventually return back, e.g. by virtue of need_resched(), but it's not a nice user experience. Cc: stable@vger.kernel.org Fixes: def596e9557c9 ("io_uring: support for IO polling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/eeba551e82cad12af30c3220125eb6cb244cc94c.1691594339.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: kill io_uring userspace examplesPavel Begunkov10-1441/+0
There are tons of io_uring tests and examples in liburing and on the Internet. If you're looking for a benchmark, io_uring-bench.c is just an acutely outdated version of fio/io_uring. And for basic condensed init template for likes of selftests take a peek at io_uring_zerocopy_tx.c. Kill tools/io_uring/, it's a burden keeping it here. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7c740701d3b475dcad8c92602a551044f72176b4.1691543666.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: fix false positive KASAN warningsPavel Begunkov2-2/+0
io_req_local_work_add() peeks into the work list, which can be executed in the meanwhile. It's completely fine without KASAN as we're in an RCU read section and it's SLAB_TYPESAFE_BY_RCU. With KASAN though it may trigger a false positive warning because internal io_uring caches are sanitised. Remove sanitisation from the io_uring request cache for now. Cc: stable@vger.kernel.org Fixes: 8751d15426a31 ("io_uring: reduce scheduling due to tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c6fbf7a82a341e66a0007c76eefd9d57f2d3ba51.1691541473.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: fix drain stalls by invalid SQEPavel Begunkov1-0/+2
cq_extra is protected by ->completion_lock, which io_get_sqe() misses. The bug is harmless as it doesn't happen in real life, requires invalid SQ index array and racing with submission, and only messes up the userspace, i.e. stall requests execution but will be cleaned up on ring destruction. Fixes: 15641e427070f ("io_uring: don't cache number of dropped SQEs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66096d54651b1a60534bb2023f2947f09f50ef73.1691538547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring/rsrc: Remove unused declaration io_rsrc_put_tw()Yue Haibing1-1/+0
Commit 36b9818a5a84 ("io_uring/rsrc: don't offload node free") removed the implementation but leave declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20230808151058.4572-1-yuehaibing@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: annotate the struct io_kiocb slab for appropriate user copyJens Axboe1-2/+14
When compiling the kernel with clang and having HARDENED_USERCOPY enabled, the liburing openat2.t test case fails during request setup: usercopy: Kernel memory overwrite attempt detected to SLUB object 'io_kiocb' (offset 24, size 24)! ------------[ cut here ]------------ kernel BUG at mm/usercopy.c:102! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 3 PID: 413 Comm: openat2.t Tainted: G N 6.4.3-g6995e2de6891-dirty #19 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 RIP: 0010:usercopy_abort+0x84/0x90 Code: ce 49 89 ce 48 c7 c3 68 48 98 82 48 0f 44 de 48 c7 c7 56 c6 94 82 4c 89 de 48 89 c1 41 52 41 56 53 e8 e0 51 c5 00 48 83 c4 18 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 41 57 41 56 RSP: 0018:ffffc900016b3da0 EFLAGS: 00010296 RAX: 0000000000000062 RBX: ffffffff82984868 RCX: 4e9b661ac6275b00 RDX: ffff8881b90ec580 RSI: ffffffff82949a64 RDI: 00000000ffffffff RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000 R10: ffffc900016b3c88 R11: ffffc900016b3c30 R12: 00007ffe549659e0 R13: ffff888119014000 R14: 0000000000000018 R15: 0000000000000018 FS: 00007f862e3ca680(0000) GS:ffff8881b90c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005571483542a8 CR3: 0000000118c11000 CR4: 00000000003506e0 Call Trace: <TASK> ? __die_body+0x63/0xb0 ? die+0x9d/0xc0 ? do_trap+0xa7/0x180 ? usercopy_abort+0x84/0x90 ? do_error_trap+0xc6/0x110 ? usercopy_abort+0x84/0x90 ? handle_invalid_op+0x2c/0x40 ? usercopy_abort+0x84/0x90 ? exc_invalid_op+0x2f/0x40 ? asm_exc_invalid_op+0x16/0x20 ? usercopy_abort+0x84/0x90 __check_heap_object+0xe2/0x110 __check_object_size+0x142/0x3d0 io_openat2_prep+0x68/0x140 io_submit_sqes+0x28a/0x680 __se_sys_io_uring_enter+0x120/0x580 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x55714834de26 Code: ca 01 0f b6 82 d0 00 00 00 8b ba cc 00 00 00 45 31 c0 31 d2 41 b9 08 00 00 00 83 e0 01 c1 e0 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> 66 0f 1f 84 00 00 00 00 00 89 30 eb 89 0f 1f 40 00 8b 00 a8 06 RSP: 002b:00007ffe549659c8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa RAX: ffffffffffffffda RBX: 00007ffe54965a50 RCX: 000055714834de26 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008 R10: 0000000000000000 R11: 0000000000000246 R12: 000055714834f057 R13: 00007ffe54965a50 R14: 0000000000000001 R15: 0000557148351dd8 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- when it tries to copy struct open_how from userspace into the per-command space in the io_kiocb. There's nothing wrong with the copy, but we're missing the appropriate annotations for allowing user copies to/from the io_kiocb slab. Allow copies in the per-command area, which is from the 'file' pointer to when 'opcode' starts. We do have existing user copies there, but they are not all annotated like the one that openat2_prep() uses, copy_struct_from_user(). But in practice opcodes should be allowed to copy data into their per-command area in the io_kiocb. Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: Add io_uring command support for socketsBreno Leitao4-0/+44
Enable io_uring commands on network sockets. Create two new SOCKET_URING_OP commands that will operate on sockets. In order to call ioctl on sockets, use the file_operations->io_uring_cmd callbacks, and map it to a uring socket function, which handles the SOCKET_URING_OP accordingly, and calls socket ioctls. This patches was tested by creating a new test case in liburing. Link: https://github.com/leitao/liburing/tree/io_uring_cmd Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/cancel: wire up IORING_ASYNC_CANCEL_OP for sync cancelJens Axboe2-4/+11
Allow usage of IORING_ASYNC_CANCEL_OP through the sync cancelation API as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/cancel: support opcode based lookup and cancelationJens Axboe4-5/+19
Add IORING_ASYNC_CANCEL_OP flag for cancelation, which allows the application to target cancelation based on the opcode of the original request. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/cancel: add IORING_ASYNC_CANCEL_USERDATAJens Axboe2-6/+14
Add a flag to explicitly match on user_data in the request for cancelation purposes. This is the default behavior if none of the other match flags are set, but if we ALSO want to match on user_data, then this flag can be set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring: use cancelation match helper for poll and timeout requestsJens Axboe2-17/+7
Get rid of the request vs io_cancel_data checking and just use the exported helper for this. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/cancel: fix sequence matching for IORING_ASYNC_CANCEL_ANYJens Axboe1-2/+3
We always need to check/update the cancel sequence if IORING_ASYNC_CANCEL_ALL is set. Also kill the redundant check for IORING_ASYNC_CANCEL_ANY at the end, if we get here we know it's not set as we would've matched it higher up. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/cancel: abstract out request match helperJens Axboe2-4/+14
We have different match code in a variety of spots. Start the cleanup of this by abstracting out a helper that can be used to check if a given request matches the cancelation criteria outlined in io_cancel_data. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/timeout: always set 'ctx' in io_cancel_dataJens Axboe1-2/+2
In preparation for using a generic handler to match requests for cancelation purposes, ensure that ctx is set in io_cancel_data. The timeout handlers don't check for this as it'll always match, but we'll need it set going forward. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/poll: always set 'ctx' in io_cancel_dataJens Axboe1-1/+1
This isn't strictly necessary for this callsite, as it uses it's internal lookup for this cancelation purpose. But let's be consistent with how it's used in general and set ctx as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-16Linux 6.5-rc2Linus Torvalds1-1/+1
2023-07-14cifs: fix mid leak during reconnection after timeout thresholdShyam Prasad N1-4/+15
When the number of responses with status of STATUS_IO_TIMEOUT exceeds a specified threshold (NUM_STATUS_IO_TIMEOUT), we reconnect the connection. But we do not return the mid, or the credits returned for the mid, or reduce the number of in-flight requests. This bug could result in the server->in_flight count to go bad, and also cause a leak in the mids. This change moves the check to a few lines below where the response is decrypted, even of the response is read from the transform header. This way, the code for returning the mids can be reused. Also, the cifs_reconnect was reconnecting just the transport connection before. In case of multi-channel, this may not be what we want to do after several timeouts. Changed that to reconnect the session and the tree too. Also renamed NUM_STATUS_IO_TIMEOUT to a more appropriate name MAX_STATUS_IO_TIMEOUT. Fixes: 8e670f77c4a5 ("Handle STATUS_IO_TIMEOUT gracefully") Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-14cifs: is_network_name_deleted should return a boolShyam Prasad N3-7/+14
Currently, is_network_name_deleted and it's implementations do not return anything if the network name did get deleted. So the function doesn't fully achieve what it advertizes. Changed the function to return a bool instead. It will now return true if the error returned is STATUS_NETWORK_NAME_DELETED and the share (tree id) was found to be connected. It returns false otherwise. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-14block: queue data commands from the flush state machine at the headChristoph Hellwig1-1/+1
We used to insert the data commands following a pre-flush to the head of the queue until commit 1e82fadfc6b ("blk-mq: do not do head insertions post-pre-flush commands"). Not doing this seems to cause hangs of such commands on NFS workloads when exported from file systems with SATA SSDs. I have no idea why this would starve these workloads, but doing a semantic revert of this patch (which looks quite different due to various other changes) fixes the hangs. Fixes: 1e82fadfc6b ("blk-mq: do not do head insertions post-pre-flush commands") Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20230714143014.11879-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-14iommu/sva: Fix signedness bug in iommu_sva_alloc_pasid()Dan Carpenter1-1/+2
The ida_alloc_range() function returns negative error codes on error. On success it returns values in the min to max range (inclusive). It never returns more then INT_MAX even if "max" is higher. It never returns values in the 0 to (min - 1) range. The bug is that "min" is an unsigned int so negative error codes will be promoted to high positive values errors treated as success. Fixes: 1a14bf0fc7ed ("iommu/sva: Use GFP_KERNEL for pasid allocation") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/6b32095d-7491-4ebb-a850-12e96209eaaf@kili.mountain Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-07-14iommu: Fix crash during syfs iommu_groups/N/typeJason Gunthorpe1-13/+14
The err_restore_domain flow was accidently inserted into the success path in commit 1000dccd5d13 ("iommu: Allow IOMMU_RESV_DIRECT to work on ARM"). It should only happen if iommu_create_device_direct_mappings() fails. This caused the domains the be wrongly changed and freed whenever the sysfs is used, resulting in an oops: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 3417 Comm: avocado Not tainted 6.4.0-rc4-next-20230602 #3 Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.3.6 07/06/2021 RIP: 0010:__iommu_attach_device+0xc/0xa0 Code: c0 c3 cc cc cc cc 48 89 f0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 8b 47 08 <48> 8b 00 48 85 c0 74 74 48 89 f5 e8 64 12 49 00 41 89 c4 85 c0 74 RSP: 0018:ffffabae0220bd48 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff9ac04f70e410 RCX: 0000000000000001 RDX: ffff9ac044db20c0 RSI: ffff9ac044fa50d0 RDI: ffff9ac04f70e410 RBP: ffff9ac044fa50d0 R08: 1000000100209001 R09: 00000000000002dc R10: 0000000000000000 R11: 0000000000000000 R12: ffff9ac043d54700 R13: ffff9ac043d54700 R14: 0000000000000001 R15: 0000000000000001 FS: 00007f02e30ae000(0000) GS:ffff9afeb2440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000012afca006 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> ? __die+0x24/0x70 ? page_fault_oops+0x82/0x150 ? __iommu_queue_command_sync+0x80/0xc0 ? exc_page_fault+0x69/0x150 ? asm_exc_page_fault+0x26/0x30 ? __iommu_attach_device+0xc/0xa0 ? __iommu_attach_device+0x1c/0xa0 __iommu_device_set_domain+0x42/0x80 __iommu_group_set_domain_internal+0x5d/0x160 iommu_setup_default_domain+0x318/0x400 iommu_group_store_type+0xb1/0x200 kernfs_fop_write_iter+0x12f/0x1c0 vfs_write+0x2a2/0x3b0 ksys_write+0x63/0xe0 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f02e2f14a6f Reorganize the error flow so that the success branch and error branches are clearer. Fixes: 1000dccd5d13 ("iommu: Allow IOMMU_RESV_DIRECT to work on ARM") Reported-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com> Tested-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/0-v1-5bd8cc969d9e+1f1-iommu_set_def_fix_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-07-14tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if failsMasami Hiramatsu (Google)3-12/+14
Fix to record 0-length data to data_loc in fetch_store_string*() if it fails to get the string data. Currently those expect that the data_loc is updated by store_trace_args() if it returns the error code. However, that does not work correctly if the argument is an array of strings. In that case, store_trace_args() only clears the first entry of the array (which may have no error) and leaves other entries. So it should be cleared by fetch_store_string*() itself. Also, 'dyndata' and 'maxlen' in store_trace_args() should be updated only if it is used (ret > 0 and argument is a dynamic data.) Link: https://lore.kernel.org/all/168908496683.123124.4761206188794205601.stgit@devnote2/ Fixes: 40b53b771806 ("tracing: probeevent: Add array type support") Cc: stable@vger.kernel.org Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-07-13blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rqChengming Zhou1-17/+30
The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation state of the block device. Most of the time request is allocated after rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected. But for plug batched allocation introduced by the commit 47c122e35d7e ("block: pre-allocate requests if plug is started and is a batch"), we can rq_qos_throttle() after the allocation of the request. This is what the blk_mq_get_cached_request() does. In this case, the cached request alloc_time_ns or start_time_ns is much ahead if blocked in any qos ->throttle(). Fix it by setting alloc_time_ns and start_time_ns to now when the allocated request is actually used. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230710105516.2053478-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-13sparc: mark __arch_xchg() as __always_inlineArnd Bergmann2-2/+2
An otherwise correct change to the atomic operations uncovered an existing bug in the sparc __arch_xchg() function, which is calls __xchg_called_with_bad_pointer() when its arguments are unknown at compile time: ERROR: modpost: "__xchg_called_with_bad_pointer" [lib/atomic64_test.ko] undefined! This now happens because gcc determines that it's better to not inline the function. Avoid this by just marking the function as __always_inline to force the compiler to do the right thing here. Reported-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/all/c525adc9-6623-4660-8718-e0c9311563b8@roeck-us.net/ Fixes: d12157efc8e08 ("locking/atomic: make atomic*_{cmp,}xchg optional") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/20230628094938.2318171-1-arnd@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-13MAINTAINERS: Foolishly claim maintainership of string routinesKees Cook1-1/+4
Since the string API is tightly coupled with FORTIFY_SOURCE, I am offering myself up as maintainer for it. Thankfully Andy is already a reviewer and can keep me on the straight and narrow. Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20230712194625.never.252-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-14Revert "tracing: Add "(fault)" name injection to kernel probes"Masami Hiramatsu (Google)3-26/+9
This reverts commit 2e9906f84fc7c99388bb7123ade167250d50f1c0. It was turned out that commit 2e9906f84fc7 ("tracing: Add "(fault)" name injection to kernel probes") did not work correctly and probe events still show just '(fault)' (instead of '"(fault)"'). Also, current '(fault)' is more explicit that it faulted. This also moves FAULT_STRING macro to trace.h so that synthetic event can keep using it, and uses it in trace_probe.c too. Link: https://lore.kernel.org/all/168908495772.123124.1250788051922100079.stgit@devnote2/ Link: https://lore.kernel.org/all/20230706230642.3793a593@rorschach.local.home/ Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14tracing/probes: Fix to update dynamic data counter if fetcharg uses itMasami Hiramatsu (Google)1-5/+7
Fix to update dynamic data counter ('dyndata') and max length ('maxlen') only if the fetcharg uses the dynamic data. Also get out arg->dynamic from unlikely(). This makes dynamic data address wrong if process_fetch_insn() returns error on !arg->dynamic case. Link: https://lore.kernel.org/all/168908494781.123124.8160245359962103684.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/all/20230710233400.5aaf024e@gandalf.local.home/ Fixes: 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14tracing/probes: Fix not to count error code to total lengthMasami Hiramatsu (Google)1-0/+2
Fix not to count the error code (which is minus value) to the total used length of array, because it can mess up the return code of process_fetch_insn_bottom(). Also clear the 'ret' value because it will be used for calculating next data_loc entry. Link: https://lore.kernel.org/all/168908493827.123124.2175257289106364229.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/ Fixes: 9b960a38835f ("tracing: probeevent: Unify fetch_insn processing common part") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14tracing/probes: Fix to avoid double count of the string length on the arrayMasami Hiramatsu (Google)1-2/+2
If an array is specified with the ustring or symstr, the length of the strings are accumlated on both of 'ret' and 'total', which means the length is double counted. Just set the length to the 'ret' value for avoiding double counting. Link: https://lore.kernel.org/all/168908492917.123124.15076463491122036025.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/ Fixes: 88903c464321 ("tracing/probe: Add ustring type for user-space string") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is runningMasami Hiramatsu (Google)1-0/+6
Add a comment the reason why fprobe_kprobe_handler() exits if any other kprobe is running. Link: https://lore.kernel.org/all/168874788299.159442.2485957441413653858.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/all/20230706120916.3c6abf15@gandalf.local.home/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-13nvme-pci: fix DMA direction of unmapping integrity dataMing Lei1-1/+1
DMA direction should be taken in dma_unmap_page() for unmapping integrity data. Fix this DMA direction, and reported in Guangwu's test. Reported-by: Guangwu Zhang <guazhang@redhat.com> Fixes: 4aedb705437f ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13nvme: don't reject probe due to duplicate IDs for single-ported PCIe devicesChristoph Hellwig1-3/+33
While duplicate IDs are still very harmful, including the potential to easily see changing devices in /dev/disk/by-id, it turn out they are extremely common for cheap end user NVMe devices. Relax our check for them for so that it doesn't reject the probe on single-ported PCIe devices, but prints a big warning instead. In doubt we'd still like to see quirk entries to disable the potential for changing supposed stable device identifier links, but this will at least allow users how have two (or more) of these devices to use them without having to manually add a new PCI ID entry with the quirk through sysfs or by patching the kernel. Fixes: 2079f41ec6ff ("nvme: check that EUI/GUID/UUID are globally unique") Cc: stable@vger.kernel.org # 6.0+ Co-developed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13tracing: Fix memory leak of iter->temp when reading trace_pipeZheng Yejian1-0/+1
kmemleak reports: unreferenced object 0xffff88814d14e200 (size 256): comm "cat", pid 336, jiffies 4294871818 (age 779.490s) hex dump (first 32 bytes): 04 00 01 03 00 00 00 00 08 00 00 00 00 00 00 00 ................ 0c d8 c8 9b ff ff ff ff 04 5a ca 9b ff ff ff ff .........Z...... backtrace: [<ffffffff9bdff18f>] __kmalloc+0x4f/0x140 [<ffffffff9bc9238b>] trace_find_next_entry+0xbb/0x1d0 [<ffffffff9bc9caef>] trace_print_lat_context+0xaf/0x4e0 [<ffffffff9bc94490>] print_trace_line+0x3e0/0x950 [<ffffffff9bc95499>] tracing_read_pipe+0x2d9/0x5a0 [<ffffffff9bf03a43>] vfs_read+0x143/0x520 [<ffffffff9bf04c2d>] ksys_read+0xbd/0x160 [<ffffffff9d0f0edf>] do_syscall_64+0x3f/0x90 [<ffffffff9d2000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 when reading file 'trace_pipe', 'iter->temp' is allocated or relocated in trace_find_next_entry() but not freed before 'trace_pipe' is closed. To fix it, free 'iter->temp' in tracing_release_pipe(). Link: https://lore.kernel.org/linux-trace-kernel/20230713141435.1133021-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-13libceph: harden msgr2.1 frame segment length checksIlya Dryomov1-15/+26
ceph_frame_desc::fd_lens is an int array. decode_preamble() thus effectively casts u32 -> int but the checks for segment lengths are written as if on unsigned values. While reading in HELLO or one of the AUTH frames (before authentication is completed), arithmetic in head_onwire_len() can get duped by negative ctrl_len and produce head_len which is less than CEPH_PREAMBLE_LEN but still positive. This would lead to a buffer overrun in prepare_read_control() as the preamble gets copied to the newly allocated buffer of size head_len. Cc: stable@vger.kernel.org Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)") Reported-by: Thelford Williams <thelford@google.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com>
2023-07-13selftests: tc-testing: add test for qfq with stab overheadPedro Tammela1-0/+38
A packet with stab overhead greater than QFQ_MAX_LMAX should be dropped by the QFQ qdisc as it can't handle such lengths. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-13net/sched: sch_qfq: account for stab overhead in qfq_enqueuePedro Tammela1-1/+6
Lion says: ------- In the QFQ scheduler a similar issue to CVE-2023-31436 persists. Consider the following code in net/sched/sch_qfq.c: static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { unsigned int len = qdisc_pkt_len(skb), gso_segs; // ... if (unlikely(cl->agg->lmax < len)) { pr_debug("qfq: increasing maxpkt from %u to %u for class %u", cl->agg->lmax, len, cl->common.classid); err = qfq_change_agg(sch, cl, cl->agg->class_weight, len); if (err) { cl->qstats.drops++; return qdisc_drop(skb, sch, to_free); } // ... } Similarly to CVE-2023-31436, "lmax" is increased without any bounds checks according to the packet length "len". Usually this would not impose a problem because packet sizes are naturally limited. This is however not the actual packet length, rather the "qdisc_pkt_len(skb)" which might apply size transformations according to "struct qdisc_size_table" as created by "qdisc_get_stab()" in net/sched/sch_api.c if the TCA_STAB option was set when modifying the qdisc. A user may choose virtually any size using such a table. As a result the same issue as in CVE-2023-31436 can occur, allowing heap out-of-bounds read / writes in the kmalloc-8192 cache. ------- We can create the issue with the following commands: tc qdisc add dev $DEV root handle 1: stab mtu 2048 tsize 512 mpu 0 \ overhead 999999999 linklayer ethernet qfq tc class add dev $DEV parent 1: classid 1:1 htb rate 6mbit burst 15k tc filter add dev $DEV parent 1: matchall classid 1:1 ping -I $DEV 1.1.1.2 This is caused by incorrectly assuming that qdisc_pkt_len() returns a length within the QFQ_MIN_LMAX < len < QFQ_MAX_LMAX. Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Reported-by: Lion <nnamrec@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-13selftests: tc-testing: add tests for qfq mtu sanity checkPedro Tammela1-0/+48
QFQ only supports a certain bound of MTU size so make sure we check for this requirement in the tests. Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>