aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/include/uapi/linux/io_uring.h (follow)
AgeCommit message (Collapse)AuthorFilesLines
2023-12-12io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALLJens Axboe1-0/+9
io_uring can currently open/close regular files or fixed/direct descriptors. Or you can instantiate a fixed descriptor from a regular one, and then close the regular descriptor. But you currently can't turn a purely fixed/direct descriptor into a regular file descriptor. IORING_OP_FIXED_FD_INSTALL adds support for installing a direct descriptor into the normal file table, just like receiving a file descriptor or opening a new file would do. This is all nicely abstracted into receive_fd(), and hence adding support for this is truly trivial. Since direct descriptors are only usable within io_uring itself, it can be useful to turn them into real file descriptors if they ever need to be accessed via normal syscalls. This can either be a transitory thing, or just a permanent transition for a given direct descriptor. By default, new fds are installed with O_CLOEXEC set. The application can disable O_CLOEXEC by setting IORING_FIXED_FD_NO_CLOEXEC in the sqe->install_fd_flags member. Suggested-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-01Merge tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds1-0/+4
Pull io_uring futex support from Jens Axboe: "This adds support for using futexes through io_uring - first futex wake and wait, and then the vectored variant of waiting, futex waitv. For both wait/wake/waitv, we support the bitset variant, as the 'normal' variants can be easily implemented on top of that. PI and requeue are not supported through io_uring, just the above mentioned parts. This may change in the future, but in the spirit of keeping this small (and based on what people have been asking for), this is what we currently have. Wake support is pretty straight forward, most of the thought has gone into the wait side to avoid needing to offload wait operations to a blocking context. Instead, we rely on the usual callbacks to retry and post a completion event, when appropriate. As far as I can recall, the first request for futex support with io_uring came from Andres Freund, working on postgres. His aio rework of postgres was one of the early adopters of io_uring, and futex support was a natural extension for that. This is relevant from both a usability point of view, as well as for effiency and performance. In Andres's words, for the former: Futex wait support in io_uring makes it a lot easier to avoid deadlocks in concurrent programs that have their own buffer pool: Obviously pages in the application buffer pool have to be locked during IO. If the initiator of IO A needs to wait for a held lock B, the holder of lock B might wait for the IO A to complete. The ability to wait for a lock and IO completions at the same time provides an efficient way to avoid such deadlocks and in terms of effiency, even without unlocking the full potential yet, Andres says: Futex wake support in io_uring is useful because it allows for more efficient directed wakeups. For some "locks" postgres has queues implemented in userspace, with wakeup logic that cannot easily be implemented with FUTEX_WAKE_BITSET on a single "futex word" (imagine waiting for journal flushes to have completed up to a certain point). Thus a "lock release" sometimes need to wake up many processes in a row. A quick-and-dirty conversion to doing these wakeups via io_uring lead to a 3% throughput increase, with 12% fewer context switches, albeit in a fairly extreme workload" * tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux: io_uring: add support for vectored futex waits futex: make the vectored futex operations available futex: make futex_parse_waitv() available as a helper futex: add wake_data to struct futex_q io_uring: add support for futex wake and wait futex: abstract out a __futex_wake_mark() helper futex: factor out the futex wake handling futex: move FUTEX2_VALID_MASK to futex.h
2023-10-19io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPTBreno Leitao1-0/+1
Add initial support for SOCKET_URING_OP_SETSOCKOPT. This new command is similar to setsockopt. This implementation leverages the function do_sock_setsockopt(), which is shared with the setsockopt() system call path. Important to say that userspace needs to keep the pointer's memory alive until the operation is completed. I.e, the memory could not be deallocated before the CQE is returned to userspace. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231016134750.1381153-11-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPTBreno Leitao1-0/+7
Add support for getsockopt command (SOCKET_URING_OP_GETSOCKOPT), where level is SOL_SOCKET. This is leveraging the sockptr_t infrastructure, where a sockptr_t is either userspace or kernel space, and handled as such. Differently from the getsockopt(2), the optlen field is not a userspace pointers. In getsockopt(2), userspace provides optlen pointer, which is overwritten by the kernel. In this implementation, userspace passes a u32, and the new value is returned in cqe->res. I.e., optlen is not a pointer. Important to say that userspace needs to keep the pointer alive until the CQE is completed. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231016134750.1381153-10-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-29io_uring: add support for vectored futex waitsJens Axboe1-0/+1
This adds support for IORING_OP_FUTEX_WAITV, which allows registering a notification for a number of futexes at once. If one of the futexes are woken, then the request will complete with the index of the futex that got woken as the result. This is identical to what the normal vectored futex waitv operation does. Use like IORING_OP_FUTEX_WAIT, except sqe->addr must now contain a pointer to a struct futex_waitv array, and sqe->off must now contain the number of elements in that array. As flags are passed in the futex_vector array, and likewise for the value and futex address(es), sqe->addr2 and sqe->addr3 are also reserved for IORING_OP_FUTEX_WAITV. For cancelations, FUTEX_WAITV does not rely on the futex_unqueue() return value as we're dealing with multiple futexes. Instead, a separate per io_uring request atomic is used to claim ownership of the request. Waiting on N futexes could be done with IORING_OP_FUTEX_WAIT as well, but that punts a lot of the work to the application: 1) Application would need to submit N IORING_OP_FUTEX_WAIT requests, rather than just a single IORING_OP_FUTEX_WAITV. 2) When one futex is woken, application would need to cancel the remaining N-1 requests that didn't trigger. While this is of course doable, having a single vectored futex wait makes for much simpler application code. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-29io_uring: add support for futex wake and waitJens Axboe1-0/+3
Add support for FUTEX_WAKE/WAIT primitives. IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as it does support passing in a bitset. Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and FUTEX_WAIT_BITSET. For both of them, they are using the futex2 interface. FUTEX_WAKE is straight forward, as those can always be done directly from the io_uring submission without needing async handling. For FUTEX_WAIT, things are a bit more complicated. If the futex isn't ready, then we rely on a callback via futex_queue->wake() when someone wakes up the futex. From that calback, we queue up task_work with the original task, which will post a CQE and wake it, if necessary. Cancelations are supported, both from the application point-of-view, but also to be able to cancel pending waits if the ring exits before all events have occurred. The return value of futex_unqueue() is used to gate who wins the potential race between cancelation and futex wakeups. Whomever gets a 'ret == 1' return from that claims ownership of the io_uring futex request. This is just the barebones wait/wake support. PI or REQUEUE support is not added at this point, unclear if we might look into that later. Likewise, explicit timeouts are not supported either. It is expected that users that need timeouts would do so via the usual io_uring mechanism to do that using linked timeouts. The SQE format is as follows: `addr` Address of futex `fd` futex2(2) FUTEX2_* flags `futex_flags` io_uring specific command flags. None valid now. `addr2` Value of futex `addr3` Mask to wake/wait Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-28io_uring: retain top 8bits of uring_cmd flags for kernel internal useMing Lei1-3/+2
Retain top 8bits of uring_cmd flags for kernel internal use, so that we can move IORING_URING_CMD_POLLED out of uapi header. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-21io_uring: add IORING_OP_WAITID supportJens Axboe1-0/+2
This adds support for an async version of waitid(2), in a fully async version. If an event isn't immediately available, wait for a callback to trigger a retry. The format of the sqe is as follows: sqe->len The 'which', the idtype being queried/waited for. sqe->fd The 'pid' (or id) being waited for. sqe->file_index The 'options' being set. sqe->addr2 A pointer to siginfo_t, if any, being filled in. buf_index, add3, and waitid_flags are reserved/unused for now. waitid_flags will be used for options for this request type. One interesting use case may be to add multi-shot support, so that the request stays armed and posts a notification every time a monitored process state change occurs. Note that this does not support rusage, on Arnd's recommendation. See the waitid(2) man page for details on the arguments. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-21io_uring/rw: add support for IORING_OP_READ_MULTISHOTJens Axboe1-0/+1
This behaves like IORING_OP_READ, except: 1) It only supports pollable files (eg pipes, sockets, etc). Note that for sockets, you probably want to use recv/recvmsg with multishot instead. 2) It supports multishot mode, meaning it will repeatedly trigger a read and fill a buffer when data is available. This allows similar use to recv/recvmsg but on non-sockets, where a single request will repeatedly post a CQE whenever data is read from it. 3) Because of #2, it must be used with provided buffers. This is uniformly true across any request type that supports multishot and transfers data, with the reason being that it's obviously not possible to pass in a single buffer for the data, as multiple reads may very well trigger before an application has a chance to process previous CQEs and the data passed from them. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: add option to remove SQ indirectionPavel Begunkov1-0/+5
Not many aware, but io_uring submission queue has two levels. The first level usually appears as sq_array and stores indexes into the actual SQ. To my knowledge, no one has ever seriously used it, nor liburing exposes it to users. Add IORING_SETUP_NO_SQARRAY, when set we don't bother creating and using the sq_array and SQ heads/tails will be pointing directly into the SQ. Improves memory footprint, in term of both allocations as well as cache usage, and also should make io_get_sqe() less branchy in the end. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ffa3268a5ef61d326201ff43a233315c96312e0.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: Add io_uring command support for socketsBreno Leitao1-0/+8
Enable io_uring commands on network sockets. Create two new SOCKET_URING_OP commands that will operate on sockets. In order to call ioctl on sockets, use the file_operations->io_uring_cmd callbacks, and map it to a uring socket function, which handles the SOCKET_URING_OP accordingly, and calls socket ioctls. This patches was tested by creating a new test case in liburing. Link: https://github.com/leitao/liburing/tree/io_uring_cmd Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/cancel: wire up IORING_ASYNC_CANCEL_OP for sync cancelJens Axboe1-1/+3
Allow usage of IORING_ASYNC_CANCEL_OP through the sync cancelation API as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/cancel: support opcode based lookup and cancelationJens Axboe1-0/+2
Add IORING_ASYNC_CANCEL_OP flag for cancelation, which allows the application to target cancelation based on the opcode of the original request. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17io_uring/cancel: add IORING_ASYNC_CANCEL_USERDATAJens Axboe1-0/+2
Add a flag to explicitly match on user_data in the request for cancelation purposes. This is the default behavior if none of the other match flags are set, but if we ALSO want to match on user_data, then this flag can be set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28nvme: improved uring pollingKeith Busch1-0/+2
Drivers can poll requests directly, so use that. We just need to ensure the driver's request was allocated from a polled hctx, so a special driver flag is added to struct io_uring_cmd. The allows unshared and multipath namespaces to use the same polling callback, and multipath is guaranteed to get the same queue as the command was submitted on. Previously multipath polling might check a different path and poll the wrong info. The other bonus is we don't need a bio payload in order to poll, allowing commands like 'flush' and 'write zeroes' to be submitted on the same high priority queue as read and write commands. Finally, using the request based polling skips the unnecessary bio overhead. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230612190343.2087040-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16io_uring: Add io_uring_setup flag to pre-register ring fd and never install itJosh Triplett1-0/+7
With IORING_REGISTER_USE_REGISTERED_RING, an application can register the ring fd and use it via registered index rather than installed fd. This allows using a registered ring for everything *except* the initial mmap. With IORING_SETUP_NO_MMAP, io_uring_setup uses buffers allocated by the user, rather than requiring a subsequent mmap. The combination of the two allows a user to operate *entirely* via a registered ring fd, making it unnecessary to ever install the fd in the first place. So, add a flag IORING_SETUP_REGISTERED_FD_ONLY to make io_uring_setup register the fd and return a registered index, without installing the fd. This allows an application to avoid touching the fd table at all, and allows a library to never even momentarily install a file descriptor. This splits out an io_ring_add_registered_file helper from io_ring_add_registered_fd, for use by io_uring_setup. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: https://lore.kernel.org/r/bc8f431bada371c183b95a83399628b605e978a3.1682699803.git.josh@joshtriplett.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16io_uring: support for user allocated memory for rings/sqesJens Axboe1-2/+7
Currently io_uring applications must call mmap(2) twice to map the rings themselves, and the sqes array. This works fine, but it does not support using huge pages to back the rings/sqes. Provide a way for the application to pass in pre-allocated memory for the rings/sqes, which can then suitably be allocated from shmfs or via mmap to get huge page support. Particularly for larger rings, this reduces the TLBs needed. If an application wishes to take advantage of that, it must pre-allocate the memory needed for the sq/cq ring, and the sqes. The former must be passed in via the io_uring_params->cq_off.user_data field, while the latter is passed in via the io_uring_params->sq_off.user_data field. Then it must set IORING_SETUP_NO_MMAP in the io_uring_params->flags field, and io_uring will then map the existing memory into the kernel for shared use. The application must not call mmap(2) to map rings as it otherwise would have, that will now fail with -EINVAL if this setup flag was used. The pages used for the rings and sqes must be contigious. The intent here is clearly that huge pages should be used, otherwise the normal setup procedure works fine as-is. The application may use one huge page for both the rings and sqes. Outside of those initialization changes, everything works like it did before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring: add support for multishot timeoutsDavid Wei1-0/+1
A multishot timeout submission will repeatedly generate completions with the IORING_CQE_F_MORE cflag set. Depending on the value of the `off' field in the submission, these timeouts can either repeat indefinitely until cancelled (`off' = 0) or for a fixed number of times (`off' > 0). Only noseq timeouts (i.e. not dependent on the number of I/O completions) are supported. An indefinite timer will be cancelled if the CQ ever overflows. Signed-off-by: David Wei <davidhwei@meta.com> Link: https://lore.kernel.org/r/20230418225817.1905027-1-davidhwei@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03io_uring: kill unused notif declarationsPavel Begunkov1-13/+0
There are two leftover structures from the notification registration mechanism that has never been released, kill them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f05f65aebaf8b1b5bf28519a8fdb350e3e7c9ad0.1679924536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03io_uring: add support for user mapped provided buffer ringJens Axboe1-0/+17
The ring mapped provided buffer rings rely on the application allocating the memory for the ring, and then the kernel will map it. This generally works fine, but runs into issues on some architectures where we need to be able to ensure that the kernel and application virtual address for the ring play nicely together. This at least impacts architectures that set SHM_COLOUR, but potentially also anyone setting SHMLBA. To use this variant of ring provided buffers, the application need not allocate any memory for the ring. Instead the kernel will do so, and the allocation must subsequently call mmap(2) on the ring with the offset set to: IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT) to get a virtual address for the buffer ring. Normally the application would allocate a suitable piece of memory (and correctly aligned) and simply pass that in via io_uring_buf_reg.ring_addr and the kernel would map it. Outside of the setup differences, the kernel allocate + user mapped provided buffer ring works exactly the same. Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03io_uring/kbuf: rename struct io_uring_buf_reg 'pad' to'flags'Jens Axboe1-1/+1
In preparation for allowing flags to be set for registration, rename the padding and use it for that. Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-21Merge tag 'hardening-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linuxLinus Torvalds1-1/+1
Pull hardening updates from Kees Cook: "Beyond some specific LoadPin, UBSAN, and fortify features, there are other fixes scattered around in various subsystems where maintainers were okay with me carrying them in my tree or were non-responsive but the patches were reviewed by others: - Replace 0-length and 1-element arrays with flexible arrays in various subsystems (Paulo Miguel Almeida, Stephen Rothwell, Kees Cook) - randstruct: Disable Clang 15 support (Eric Biggers) - GCC plugins: Drop -std=gnu++11 flag (Sam James) - strpbrk(): Refactor to use strchr() (Andy Shevchenko) - LoadPin LSM: Allow root filesystem switching when non-enforcing - fortify: Use dynamic object size hints when available - ext4: Fix CFI function prototype mismatch - Nouveau: Fix DP buffer size arguments - hisilicon: Wipe entire crypto DMA pool on error - coda: Fully allocate sig_inputArgs - UBSAN: Improve arm64 trap code reporting - copy_struct_from_user(): Add minimum bounds check on kernel buffer size" * tag 'hardening-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: randstruct: disable Clang 15 support uaccess: Add minimum bounds check on kernel buffer size arm64: Support Clang UBSAN trap codes for better reporting coda: Avoid partial allocation of sig_inputArgs gcc-plugins: drop -std=gnu++11 to fix GCC 13 build lib/string: Use strchr() in strpbrk() crypto: hisilicon: Wipe entire pool on error net/i40e: Replace 0-length array with flexible array io_uring: Replace 0-length array with flexible array ext4: Fix function prototype mismatch for ext4_feat_ktype i915/gvt: Replace one-element array with flexible-array member drm/nouveau/disp: Fix nvif_outp_acquire_dp() argument size LoadPin: Allow filesystem switch when not enforcing LoadPin: Move pin reporting cleanly out of locking LoadPin: Refactor sysctl initialization LoadPin: Refactor read-only check into a helper ARM: ixp4xx: Replace 0-length arrays with flexible arrays fortify: Use __builtin_dynamic_object_size() when available rxrpc: replace zero-lenth array with DECLARE_FLEX_ARRAY() helper
2023-02-16io_uring: Support calling io_uring_register with a registered ring fdJosh Triplett1-1/+5
Add a new flag IORING_REGISTER_USE_REGISTERED_RING (set via the high bit of the opcode) to treat the fd as a registered index rather than a file descriptor. This makes it possible for a library to open an io_uring, register the ring fd, close the ring fd, and subsequently use the ring entirely via registered index. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: https://lore.kernel.org/r/f2396369e638284586b069dbddffb8c992afba95.1676419314.git.josh@joshtriplett.org [axboe: remove extra high bit clear] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring/msg_ring: Pass custom flags to the cqeBreno Leitao1-0/+2
This patch adds a new flag (IORING_MSG_RING_FLAGS_PASS) in the message ring operations (IORING_OP_MSG_RING). This new flag enables the sender to specify custom flags, which will be copied over to cqe->flags in the receiving ring. These custom flags should be specified using the sqe->file_index field. This mechanism provides additional flexibility when sending messages between rings. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20230103160507.617416-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-27io_uring: Replace 0-length array with flexible arrayKees Cook1-1/+1
Zero-length arrays are deprecated[1]. Replace struct io_uring_buf_ring's "bufs" with a flexible array member. (How is the size of this array verified?) Detected with GCC 13, using -fstrict-flex-arrays=3: In function 'io_ring_buffer_select', inlined from 'io_buffer_select' at io_uring/kbuf.c:183:10: io_uring/kbuf.c:141:23: warning: array subscript 255 is outside the bounds of an interior zero-length array 'struct io_uring_buf[0]' [-Wzero-length-bounds] 141 | buf = &br->bufs[head]; | ^~~~~~~~~~~~~~~ In file included from include/linux/io_uring.h:7, from io_uring/kbuf.c:10: include/uapi/linux/io_uring.h: In function 'io_buffer_select': include/uapi/linux/io_uring.h:628:41: note: while referencing 'bufs' 628 | struct io_uring_buf bufs[0]; | ^~~~ [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: stable@vger.kernel.org Cc: io-uring@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230105190507.gonna.131-kees@kernel.org
2022-12-27uapi:io_uring.h: allow linux/time_types.h to be skippedStefan Metzmacher1-0/+8
include/uapi/linux/io_uring.h is synced 1:1 into liburing:src/include/liburing/io_uring.h. liburing has a configure check to detect the need for linux/time_types.h. It can opt-out by defining UAPI_LINUX_IO_URING_H_SKIP_LINUX_TIME_TYPES_H Fixes: 78a861b94959 ("io_uring: add sync cancelation API through io_uring_register()") Link: https://github.com/axboe/liburing/issues/708 Link: https://github.com/axboe/liburing/pull/709 Link: https://lore.kernel.org/io-uring/20221115212614.1308132-1-ammar.faizi@intel.com/T/#m9f5dd571cd4f6a5dee84452dbbca3b92ba7a4091 CC: Jens Axboe <axboe@kernel.dk> Cc: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Link: https://lore.kernel.org/r/7071a0a1d751221538b20b63f9160094fc7e06f4.1668630247.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21io_uring/net: introduce IORING_SEND_ZC_REPORT_USAGE flagStefan Metzmacher1-0/+18
It might be useful for applications to detect if a zero copy transfer with SEND[MSG]_ZC was actually possible or not. The application can fallback to plain SEND[MSG] in order to avoid the overhead of two cqes per request. Or it can generate a log message that could indicate to an administrator that no zero copy was possible and could explain degraded performance. Cc: stable@vger.kernel.org # 6.1 Link: https://lore.kernel.org/io-uring/fb6a7599-8a9b-15e5-9b64-6cd9d01c6ff4@gmail.com/T/#m2b0d9df94ce43b0e69e6c089bdff0ce6babbdfaa Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8945b01756d902f5d5b0667f20b957ad3f742e5e.1666895626.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-06io_uring: fix typo in io_uring.h commentJens Axboe1-1/+1
Just a basic s/thig/this swap, fixing up a typo introduced by a commit added in the 6.1 release. Fixes: 9cda70f622cd ("io_uring: introduce fixed buffer support for io_uring_cmd") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30io_uring: introduce fixed buffer support for io_uring_cmdAnuj Gupta1-0/+9
Add IORING_URING_CMD_FIXED flag that is to be used for sending io_uring command with previously registered buffers. User-space passes the buffer index in sqe->buf_index, same as done in read/write variants that uses fixed buffers. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-3-anuj20.g@samsung.com [axboe: shuffle valid flags check before acting on it] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: zerocopy sendmsgPavel Begunkov1-0/+1
Add a zerocopy version of sendmsg. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6aabc4bdfc0ec78df6ec9328137e394af9d4e7ef.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring: add IORING_SETUP_DEFER_TASKRUNDylan Yudaken1-0/+7
Allow deferring async tasks until the user calls io_uring_enter(2) with the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at io_uring_setup time. This functionality requires that the later io_uring_enter will be called from the same submission task, and therefore restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also set. Being able to hand pick when tasks are run prevents the problem where there is current work to be done, however task work runs anyway. For example, a common workload would obtain a batch of CQEs, and process each one. Interrupting this to additional taskwork would add latency but not gain anything. If instead task work is deferred to just before more CQEs are obtained then no additional latency is added. The way this is implemented is by trying to keep task work local to a io_ring_ctx, rather than to the submission task. This is required, as the application will want to wake up only a single io_ring_ctx at a time to process work, and so the lists of work have to be kept separate. This has some other benefits like not having to check the task continually in handle_tw_list (and potentially unlocking/locking those), and reducing locks in the submit & process completions path. There are networking cases where using this option can reduce request latency by 50%. For example a contrived example using [1] where the client sends 2k data and receives the same data back while doing some system calls (to trigger task work) shows this reduction. The reason ends up being that if sending responses is delayed by processing task work, then the client side sits idle. Whereas reordering the sends first means that the client runs it's workload in parallel with the local task work. [1]: Using https://github.com/DylanZA/netbench/tree/defer_run Client: ./netbench --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000" Server: ./netbench --server_only 1 --control_port 10000 --rx "io_uring --defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 --workload 100" Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01io_uring/net: simplify zerocopy send user APIPavel Begunkov1-2/+5
Following user feedback, this patch simplifies zerocopy send API. One of the main complaints is that the current API is difficult with the userspace managing notification slots, and then send retries with error handling make it even worse. Instead of keeping notification slots change it to the per-request notifications model, which posts both completion and notification CQEs for each request when any data has been sent, and only one CQE if it fails. All notification CQEs will have IORING_CQE_F_NOTIF set and IORING_CQE_F_MORE in completion CQEs indicates whether to wait a notification or not. IOSQE_CQE_SKIP_SUCCESS is disallowed with zerocopy sends for now. This is less flexible, but greatly simplifies the user API and also the kernel implementation. We reuse notif helpers in this patch, but in the future there won't be need for keeping two requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95287640ab98fc9417370afb16e310677c63e6ce.1662027856.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01io_uring/notif: remove notif registrationPavel Begunkov1-8/+0
We're going to remove the userspace exposed zerocopy notification API, remove notification registration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6ff00b97be99869c386958a990593c9c31cf105b.1662027856.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01Revert "io_uring: rename IORING_OP_FILES_UPDATE"Pavel Begunkov1-11/+1
This reverts commit 4379d5f15b3fd4224c37841029178aa8082a242e. We removed notification flushing, also cleanup uapi preparation changes to not pollute it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/89edc3905350f91e1b6e26d9dbf42ee44fd451a2.1662027856.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01Revert "io_uring: add zc notification flush requests"Pavel Begunkov1-1/+0
This reverts commit 492dddb4f6e3a5839c27d41ff1fecdbe6c3ab851. Soon we won't have the very notion of notification flushing, so remove notification flushing requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8850334ca56e65b413cb34fd158db81d7b2865a3.1662027856.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-23io_uring: uapi: Add `extern "C"` in io_uring.h for liburingAmmar Faizi1-0/+8
Make it easy for liburing to integrate uapi header with the kernel. Previously, when this header changes, the liburing side can't directly copy this header file due to some small differences. Sync them. Link: https://lore.kernel.org/io-uring/f1feef16-6ea2-0653-238f-4aaee35060b6@kernel.dk Cc: Bart Van Assche <bvanassche@acm.org> Cc: Dylan Yudaken <dylany@fb.com> Cc: Facebook Kernel Team <kernel-team@fb.com> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: add zc notification flush requestsPavel Begunkov1-0/+1
Overlay notification control onto IORING_OP_RSRC_UPDATE (former IORING_OP_FILES_UPDATE). It allows to flush a range of zc notifications from slots with indexes [sqe->off, sqe->off+sqe->len). If sqe->arg is not zero, it also copies sqe->arg as a new tag for all flushed notifications. Note, it doesn't flush a notification of a slot if there was no requests attached to it (since last flush or registration). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/df13e2363400682a73dd9e71c3b990b8d1ff0333.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: rename IORING_OP_FILES_UPDATEPavel Begunkov1-1/+11
IORING_OP_FILES_UPDATE will be a more generic opcode serving different resource types, rename it into IORING_OP_RSRC_UPDATE and add subtype handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a907133907d9af3415a8a7aa1802c6aa97c03c6.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: flush notifiers after sendzcPavel Begunkov1-0/+4
Allow to flush notifiers as a part of sendzc request by setting IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will flush the used [active] notifier. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e0b4d9a6797e2fd6092824fe42953db7a519bbc8.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: sendzc with fixed buffersPavel Begunkov1-1/+5
Allow zerocopy sends to use fixed buffers. There is an optimisation for this case, the network layer don't need to reference the pages, see SKBFL_MANAGED_FRAG_REFS, so io_uring have to ensure validity of fixed buffers until the notifier is released. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1d8bd1b5934e541d90c1824eb4020ae3f5f43f3.1657643355.git.asml.silence@gmail.com [axboe: fold in 32-bit pointer cast warning fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: allow to pass addr into sendzcPavel Begunkov1-1/+1
Allow to specify an address to zerocopy sends making it more like sendto(2). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70417a8f7c5b51ab454690bae08adc0c187f89e8.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: wire send zc request typePavel Begunkov1-0/+5
Add a new io_uring opcode IORING_OP_SENDZC. The main distinction from IORING_OP_SEND is that the user should specify a notification slot index in sqe::notification_idx and the buffers are safe to reuse only when the used notification is flushed and completes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a80387c6a68ce9cf99b3b6ef6f71068468761fb7.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: add notification slot registrationPavel Begunkov1-0/+17
Let the userspace to register and unregister notification slots. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a0aa8161fe3ebb2a4cc6e5dbd0cffb96e6881cf5.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: support multishot in recvmsgDylan Yudaken1-0/+7
Similar to multishot recv, this will require provided buffers to be used. However recvmsg is much more complex than recv as it has multiple outputs. Specifically flags, name, and control messages. Support this by introducing a new struct io_uring_recvmsg_out with 4 fields. namelen, controllen and flags match the similar out fields in msghdr from standard recvmsg(2), payloadlen is the length of the payload following the header. This struct is placed at the start of the returned buffer. Based on what the user specifies in struct msghdr, the next bytes of the buffer will be name (the next msg_namelen bytes), and then control (the next msg_controllen bytes). The payload will come at the end. The return value in the CQE is the total used size of the provided buffer. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220714110258.1336200-4-dylany@fb.com [axboe: style fixups, see link] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: multishot recvDylan Yudaken1-0/+5
Support multishot receive for io_uring. Typical server applications will run a loop where for each recv CQE it requeues another recv/recvmsg. This can be simplified by using the existing multishot functionality combined with io_uring's provided buffers. The API is to add the IORING_RECV_MULTISHOT flag to the SQE. CQEs will then be posted (with IORING_CQE_F_MORE flag set) when data is available and is read. Once an error occurs or the socket ends, the multishot will be removed and a completion without IORING_CQE_F_MORE will be posted. The benefit to this is that the recv is much more performant. * Subsequent receives are queued up straight away without requiring the application to finish a processing loop. * If there are more data in the socket (sat the provided buffer size is smaller than the socket buffer) then the data is immediately returned, improving batching. * Poll is only armed once and reused, saving CPU cycles Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-11-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: let to set a range for file slot allocationPavel Begunkov1-0/+13
From recently io_uring provides an option to allocate a file index for operation registering fixed files. However, it's utterly unusable with mixed approaches when for a part of files the userspace knows better where to place it, as it may race and users don't have any sane way to pick a slot and hoping it will not be taken. Let the userspace to register a range of fixed file slots in which the auto-allocation happens. The use case is splittting the fixed table in two parts, where on of them is used for auto-allocation and another for slot-specified operations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: add support for passing fixed file descriptorsJens Axboe1-0/+17
With IORING_OP_MSG_RING, one ring can send a message to another ring. Extend that support to also allow sending a fixed file descriptor to that ring, enabling one ring to pass a registered descriptor to another one. Arguments are extended to pass in: sqe->addr3 fixed file slot in source ring sqe->file_index fixed file slot in destination ring IORING_OP_MSG_RING is extended to take a command argument in sqe->addr. If set to zero (or IORING_MSG_DATA), it sends just a message like before. If set to IORING_MSG_SEND_FD, a fixed file descriptor is sent according to the above arguments. Two common use cases for this are: 1) Server needs to be shutdown or restarted, pass file descriptors to another onei 2) Backend is split, and one accepts connections, while others then get the fd passed and handle the actual connection. Both of those are classic SCM_RIGHTS use cases, and it's not possible to support them with direct descriptors today. By default, this will post a CQE to the target ring, similarly to how IORING_MSG_DATA does it. If IORING_MSG_RING_CQE_SKIP is set, no message is posted to the target ring. The issuer is expected to notify the receiver side separately. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: replace zero-length array with flexible-array memberGustavo A. R. Silva1-1/+1
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/78 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: add sync cancelation API through io_uring_register()Jens Axboe1-0/+15
The io_uring cancelation API is async, like any other API that we expose there. For the case of finding a request to cancel, or not finding one, it is fully sync in that when submission returns, the CQE for both the cancelation request and the targeted request have been posted to the CQ ring. However, if the targeted work is being executed by io-wq, the API can only start the act of canceling it. This makes it difficult to use in some circumstances, as the caller then has to wait for the CQEs to come in and match on the same cancelation data there. Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register() that does sync cancelations, always. For the io-wq case, it'll wait for the cancelation to come in before returning. The only expected returns from this API is: 0 Request found and canceled fine. > 0 Requests found and canceled. Only happens if asked to cancel multiple requests, and if the work wasn't in progress. -ENOENT Request not found. -ETIME A timeout on the operation was requested, but the timeout expired before we could cancel. and we won't get -EALREADY via this API. If the timeout value passed in is -1 (tv_sec and tv_nsec), then that means that no timeout is requested. Otherwise, the timespec passed in is the amount of time the sync cancel will wait for a successful cancelation. Link: https://github.com/axboe/liburing/discussions/608 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24io_uring: add IORING_ASYNC_CANCEL_FD_FIXED cancel flagJens Axboe1-0/+2
In preparation for not having a request to pass in that carries this state, add a separate cancelation flag that allows the caller to ask for a fixed file for cancelation. Signed-off-by: Jens Axboe <axboe@kernel.dk>