aboutsummaryrefslogtreecommitdiffstats
path: root/fs/io_uring.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-01-15io_uring: ensure finish_wait() is always called in __io_uring_task_cancel()Jens Axboe1-0/+1
If we enter with requests pending and performm cancelations, we'll have a different inflight count before and after calling prepare_to_wait(). This causes the loop to restart. If we actually ended up canceling everything, or everything completed in-between, then we'll break out of the loop without calling finish_wait() on the waitqueue. This can trigger a warning on exit_signals(), as we leave the task state in TASK_UNINTERRUPTIBLE. Put a finish_wait() after the loop to catch that case. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-15io_uring: flush timeouts that should already have expiredMarcelo Diop-Gonzalez1-4/+30
Right now io_flush_timeouts() checks if the current number of events is equal to ->timeout.target_seq, but this will miss some timeouts if there have been more than 1 event added since the last time they were flushed (possible in io_submit_flush_completions(), for example). Fix it by recording the last sequence at which timeouts were flushed so that the number of events seen can be compared to the number of events needed without overflow. Signed-off-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-13io_uring: do sqo disable on install_fd errorPavel Begunkov1-0/+1
WARNING: CPU: 0 PID: 8494 at fs/io_uring.c:8717 io_ring_ctx_wait_and_kill+0x4f2/0x600 fs/io_uring.c:8717 Call Trace: io_uring_release+0x3e/0x50 fs/io_uring.c:8759 __fput+0x283/0x920 fs/file_table.c:280 task_work_run+0xdd/0x190 kernel/task_work.c:140 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:174 [inline] exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 failed io_uring_install_fd() is a special case, we don't do io_ring_ctx_wait_and_kill() directly but defer it to fput, though still need to io_disable_sqo_submit() before. note: it doesn't fix any real problem, just a warning. That's because sqring won't be available to the userspace in this case and so SQPOLL won't submit anything. Reported-by: syzbot+9c9c35374c0ecac06516@syzkaller.appspotmail.com Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-13io_uring: fix null-deref in io_disable_sqo_submitPavel Begunkov1-1/+2
general protection fault, probably for non-canonical address 0xdffffc0000000022: 0000 [#1] KASAN: null-ptr-deref in range [0x0000000000000110-0x0000000000000117] RIP: 0010:io_ring_set_wakeup_flag fs/io_uring.c:6929 [inline] RIP: 0010:io_disable_sqo_submit+0xdb/0x130 fs/io_uring.c:8891 Call Trace: io_uring_create fs/io_uring.c:9711 [inline] io_uring_setup+0x12b1/0x38e0 fs/io_uring.c:9739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 io_disable_sqo_submit() might be called before user rings were allocated, don't do io_ring_set_wakeup_flag() in those cases. Reported-by: syzbot+ab412638aeb652ded540@syzkaller.appspotmail.com Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-11io_uring: don't take files/mm for a dead taskPavel Begunkov1-0/+5
In rare cases a task may be exiting while io_ring_exit_work() trying to cancel/wait its requests. It's ok for __io_sq_thread_acquire_mm() because of SQPOLL check, but is not for __io_sq_thread_acquire_files(). Play safe and fail for both of them. Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-11io_uring: drop mm and files after task_work_runPavel Begunkov1-0/+2
__io_req_task_submit() run by task_work can set mm and files, but io_sq_thread() in some cases, and because __io_sq_thread_acquire_mm() and __io_sq_thread_acquire_files() do a simple current->mm/files check it may end up submitting IO with mm/files of another task. We also need to drop it after in the end to drop potentially grabbed references to them. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-09io_uring: stop SQPOLL submit on creator's deathPavel Begunkov1-9/+53
When the creator of SQPOLL io_uring dies (i.e. sqo_task), we don't want its internals like ->files and ->mm to be poked by the SQPOLL task, it have never been nice and recently got racy. That can happen when the owner undergoes destruction and SQPOLL tasks tries to submit new requests in parallel, and so calls io_sq_thread_acquire*(). That patch halts SQPOLL submissions when sqo_task dies by introducing sqo_dead flag. Once set, the SQPOLL task must not do any submission, which is synchronised by uring_lock as well as the new flag. The tricky part is to make sure that disabling always happens, that means either the ring is discovered by creator's do_exit() -> cancel, or if the final close() happens before it's done by the creator. The last is guaranteed by the fact that for SQPOLL the creator task and only it holds exactly one file note, so either it pins up to do_exit() or removed by the creator on the final put in flush. (see comments in uring_flush() around file->f_count == 2). One more place that can trigger io_sq_thread_acquire_*() is __io_req_task_submit(). Shoot off requests on sqo_dead there, even though actually we don't need to. That's because cancellation of sqo_task should wait for the request before going any further. note 1: io_disable_sqo_submit() does io_ring_set_wakeup_flag() so the caller would enter the ring to get an error, but it still doesn't guarantee that the flag won't be cleared. note 2: if final __userspace__ close happens not from the creator task, the file note will pin the ring until the task dies. Fixed: b1b6b5a30dce8 ("kernel/io_uring: cancel io_uring before task works") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-09io_uring: add warn_once for io_uring_flush()Pavel Begunkov1-4/+10
files_cancel() should cancel all relevant requests and drop file notes, so we should never have file notes after that, including on-exit fput and flush. Add a WARN_ONCE to be sure. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-09io_uring: inline io_uring_attempt_task_drop()Pavel Begunkov1-18/+11
A simple preparation change inlining io_uring_attempt_task_drop() into io_uring_flush(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-09io_uring: io_rw_reissue lockdep annotationsPavel Begunkov1-0/+2
We expect io_rw_reissue() to take place only during submission with uring_lock held. Add a lockdep annotation to check that invariant. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-07io_uring: synchronise ev_posted() with waitqueuesPavel Begunkov1-2/+8
waitqueue_active() needs smp_mb() to be in sync with waitqueues modification, but we miss it in io_cqring_ev_posted*() apart from cq_wait() case. Take an smb_mb() out of wq_has_sleeper() making it waitqueue_active(), and place it a few lines before, so it can synchronise other waitqueue_active() as well. The patch doesn't add any additional overhead, so even if there are no problems currently, it's just safer to have it this way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-07io_uring: dont kill fasync under completion_lockPavel Begunkov1-5/+8
CPU0 CPU1 ---- ---- lock(&new->fa_lock); local_irq_disable(); lock(&ctx->completion_lock); lock(&new->fa_lock); <Interrupt> lock(&ctx->completion_lock); *** DEADLOCK *** Move kill_fasync() out of io_commit_cqring() to io_cqring_ev_posted(), so it doesn't hold completion_lock while doing it. That saves from the reported deadlock, and it's just nice to shorten the locking time and untangle nested locks (compl_lock -> wq_head::lock). Reported-by: syzbot+91ca3f25bd7f795f019c@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-07io_uring: trigger eventfd for IOPOLLPavel Begunkov1-2/+11
Make sure io_iopoll_complete() tries to wake up eventfd, which currently is skipped together with io_cqring_ev_posted() for non-SQPOLL IOPOLL. Add an iopoll version of io_cqring_ev_posted(), duplicates a bit of code, but they actually use different sets of wait queues may be for better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-06io_uring: Fix return value from alloc_fixed_file_ref_nodeMatthew Wilcox (Oracle)1-6/+6
alloc_fixed_file_ref_node() currently returns an ERR_PTR on failure. io_sqe_files_unregister() expects it to return NULL and since it can only return -ENOMEM, it makes more sense to change alloc_fixed_file_ref_node() to behave that way. Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-05io_uring: Delete useless variable ‘id’ in io_prep_async_workYe Bin1-2/+0
Fix follow warning: fs/io_uring.c:1523:22: warning: variable ‘id’ set but not used [-Wunused-but-set-variable] struct io_identity *id; ^~ Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-04io_uring: cancel more aggressively in exit_workPavel Begunkov1-4/+9
While io_ring_exit_work() is running new requests of all sorts may be issued, so it should do a bit more to cancel them, otherwise they may just get stuck. e.g. in io-wq, in poll lists, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-04io_uring: drop file refs after task cancelPavel Begunkov1-9/+16
io_uring fds marked O_CLOEXEC and we explicitly cancel all requests before going through exec, so we don't want to leave task's file references to not our anymore io_uring instances. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-04io_uring: patch up IOPOLL overflow_flush syncPavel Begunkov1-37/+41
IOPOLL skips completion locking but keeps it under uring_lock, thus io_cqring_overflow_flush() and so io_cqring_events() need additional locking with uring_lock in some cases for IOPOLL. Remove __io_cqring_overflow_flush() from io_cqring_events(), introduce a wrapper around flush doing needed synchronisation and call it by hand. Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-04io_uring: synchronise IOPOLL on task_submit failPavel Begunkov1-6/+7
io_req_task_submit() might be called for IOPOLL, do the fail path under uring_lock to comply with IOPOLL synchronisation based solely on it. Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-30io_uring: fix io_sqe_files_unregister() hangsPavel Begunkov1-2/+22
io_sqe_files_unregister() uninterruptibly waits for enqueued ref nodes, however requests keeping them may never complete, e.g. because of some userspace dependency. Make sure it's interruptible otherwise it would hang forever. Cc: stable@vger.kernel.org # 5.6+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-30io_uring: add a helper for setting a ref nodePavel Begunkov1-10/+12
Setting a new reference node to a file data is not trivial, don't repeat it, add and use a helper. Cc: stable@vger.kernel.org # 5.6+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-29io_uring: don't assume mm is constant across submitsJens Axboe1-7/+7
If we COW the identity, we assume that ->mm never changes. But this isn't true of multiple processes end up sharing the ring. Hence treat id->mm like like any other process compontent when it comes to the identity mapping. This is pretty trivial, just moving the existing grab into io_grab_identity(), and including a check for the match. Cc: stable@vger.kernel.org # 5.10 Fixes: 1e6fa5216a0e ("io_uring: COW io_identity on mismatch") Reported-by: Christian Brauner <christian.brauner@ubuntu.com>: Tested-by: Christian Brauner <christian.brauner@ubuntu.com>: Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-22io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()Xiaoguang Wang1-10/+19
io_iopoll_complete() does not hold completion_lock to complete polled io, so in io_wq_submit_work(), we can not call io_req_complete() directly, to complete polled io, otherwise there maybe concurrent access to cqring, defer_list, etc, which is not safe. Commit dad1b1242fd5 ("io_uring: always let io_iopoll_complete() complete polled io") has fixed this issue, but Pavel reported that IOPOLL apart from rw can do buf reg/unreg requests( IORING_OP_PROVIDE_BUFFERS or IORING_OP_REMOVE_BUFFERS), so the fix is not good. Given that io_iopoll_complete() is always called under uring_lock, so here for polled io, we can also get uring_lock to fix this issue. Fixes: dad1b1242fd5 ("io_uring: always let io_iopoll_complete() complete polled io") Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: don't deref 'req' after completing it'] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-22io_uring: fix double io_uring freePavel Begunkov1-32/+39
Once we created a file for current context during setup, we should not call io_ring_ctx_wait_and_kill() directly as it'll be done by fput(file) Cc: stable@vger.kernel.org # 5.10 Reported-by: syzbot+c9937dfb2303a5f18640@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fix unused 'ret' for !CONFIG_UNIX] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-21io_uring: fix ignoring xa_store errorsPavel Begunkov1-3/+7
xa_store() may fail, check the result. Cc: stable@vger.kernel.org # 5.10 Fixes: 0f2122045b946 ("io_uring: don't rely on weak ->files references") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-20io_uring: end waiting before task cancel attemptsPavel Begunkov1-1/+1
Get rid of TASK_UNINTERRUPTIBLE and waiting with finish_wait before going for next iteration in __io_uring_task_cancel(), because __io_uring_files_cancel() doesn't expect that sheduling is disallowed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-20io_uring: always progress task_work on task cancelPavel Begunkov1-1/+1
Might happen that __io_uring_cancel_task_requests() cancels nothing but there are task_works pending. We need to always run them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-20io_uring: make ctx cancel on exit targeted to actual ctxJens Axboe1-1/+8
Before IORING_SETUP_ATTACH_WQ, we could just cancel everything on the io-wq when exiting. But that's not the case if they are shared, so cancel for the specific ctx instead. Cc: stable@vger.kernel.org Fixes: 24369c2e3bb0 ("io_uring: add io-wq workqueue sharing") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-19io_uring: fix 0-iov read buffer selectPavel Begunkov1-3/+1
Doing vectored buf-select read with 0 iovec passed is meaningless and utterly broken, forbid it. Cc: <stable@vger.kernel.org> # 5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-18io_uring: close a small race gap for files cancelPavel Begunkov1-4/+4
The purpose of io_uring_cancel_files() is to wait for all requests matching ->files to go/be cancelled. We should first drop files of a request in io_req_drop_files() and only then make it undiscoverable for io_uring_cancel_files. First drop, then delete from list. It's ok to leave req->id->files dangling, because it's not dereferenced by cancellation code, only compared against. It would potentially go to sleep and be awaken by following in io_req_drop_files() wake_up(). Fixes: 0f2122045b946 ("io_uring: don't rely on weak ->files references") Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17io_uring: limit {io|sq}poll submit locking scopePavel Begunkov1-3/+6
We don't need to take uring_lock for SQPOLL|IOPOLL to do io_cqring_overflow_flush() when cq_overflow_list is empty, remove it from the hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17io_uring: inline io_cqring_mark_overflow()Pavel Begunkov1-13/+9
There is only one user of it and the name is misleading, get rid of it by inlining. By the way make overflow_flush's return value deduction simpler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17io_uring: consolidate CQ nr events calculationPavel Begunkov1-9/+8
Add a helper which calculates number of events in CQ. Handcoded version of it in io_cqring_overflow_flush() is not the clearest thing, so it makes it slightly more readable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17io_uring: remove racy overflow list fast checksPavel Begunkov1-4/+1
list_empty_careful() is not racy only if some conditions are met, i.e. no re-adds after del_init. io_cqring_overflow_flush() does list_move(), so it's actually racy. Remove those checks, we have ->cq_check_overflow for the fast path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17io_uring: cancel reqs shouldn't kill overflow listPavel Begunkov1-4/+2
io_uring_cancel_task_requests() doesn't imply that the ring is going away, it may continue to work well after that. The problem is that it sets ->cq_overflow_flushed effectively disabling the CQ overflow feature Split setting cq_overflow_flushed from flush, and do the first one only on exit. It's ok in terms of cancellations because there is a io_uring->in_idle check in __io_cqring_fill_event(). It also fixes a race with setting ->cq_overflow_flushed in io_uring_cancel_task_requests, whuch's is not atomic and a part of a bitmask with other flags. Though, the only other flag that's not set during init is drain_next, so it's not as bad for sane architectures. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 0f2122045b946 ("io_uring: don't rely on weak ->files references") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17io_uring: hold mmap_sem for mm->locked_vm manipulationJens Axboe1-4/+10
The kernel doesn't seem to have clear rules around this, but various spots are using the mmap_sem to serialize access to modifying the locked_vm count. Play it safe and lock the mm for write when accounting or unaccounting locked memory. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-16io_uring: break links on shutdown failureJens Axboe1-0/+2
Ensure that the return value of __sys_shutdown_sock() is used to potentially break links to the request, if we fail. Fixes: 36f4fa6886a8 ("io_uring: add support for shutdown(2)") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-16Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds1-6/+4
Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...
2020-12-16Merge tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds1-557/+770
Pull io_uring updates from Jens Axboe: "Fairly light set of changes this time around, and mostly some bits that were pushed out to 5.11 instead of 5.10, fixes/cleanups, and a few features. In particular: - Cleanups around iovec import (David Laight, Pavel) - Add timeout support for io_uring_enter(2), which enables us to clean up liburing and avoid a timeout sqe submission in the completion path. The big win here is that it allows setups that split SQ and CQ handling into separate threads to avoid locking, as the CQ side will no longer submit when timeouts are needed when waiting for events (Hao Xu) - Add support for socket shutdown, and renameat/unlinkat. - SQPOLL cleanups and improvements (Xiaoguang Wang) - Allow SQPOLL setups for CAP_SYS_NICE, and enable regular (non-fixed) files to be used. - Cancelation improvements (Pavel) - Fixed file reference improvements (Pavel) - IOPOLL related race fixes (Pavel) - Lots of other little fixes and cleanups (mostly Pavel)" * tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block: (43 commits) io_uring: fix io_cqring_events()'s noflush io_uring: fix racy IOPOLL flush overflow io_uring: fix racy IOPOLL completions io_uring: always let io_iopoll_complete() complete polled io io_uring: add timeout update io_uring: restructure io_timeout_cancel() io_uring: fix files cancellation io_uring: use bottom half safe lock for fixed file data io_uring: fix miscounting ios_left io_uring: change submit file state invariant io_uring: check kthread stopped flag when sq thread is unparked io_uring: share fixed_file_refs b/w multiple rsrcs io_uring: replace inflight_wait with tctx->wait io_uring: don't take fs for recvmsg/sendmsg io_uring: only wake up sq thread while current task is in io worker context io_uring: don't acquire uring_lock twice io_uring: initialize 'timeout' properly in io_sq_thread() io_uring: refactor io_sq_thread() handling io_uring: always batch cancel in *cancel_files() io_uring: pass files into kill timeouts/poll ...
2020-12-16Merge tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds1-22/+8
Pull TIF_NOTIFY_SIGNAL updates from Jens Axboe: "This sits on top of of the core entry/exit and x86 entry branch from the tip tree, which contains the generic and x86 parts of this work. Here we convert the rest of the archs to support TIF_NOTIFY_SIGNAL. With that done, we can get rid of JOBCTL_TASK_WORK from task_work and signal.c, and also remove a deadlock work-around in io_uring around knowing that signal based task_work waking is invoked with the sighand wait queue head lock. The motivation for this work is to decouple signal notify based task_work, of which io_uring is a heavy user of, from sighand. The sighand lock becomes a huge contention point, particularly for threaded workloads where it's shared between threads. Even outside of threaded applications it's slower than it needs to be. Roman Gershman <romger@amazon.com> reported that his networked workload dropped from 1.6M QPS at 80% CPU to 1.0M QPS at 100% CPU after io_uring was changed to use TIF_NOTIFY_SIGNAL. The time was all spent hammering on the sighand lock, showing 57% of the CPU time there [1]. There are further cleanups possible on top of this. One example is TIF_PATCH_PENDING, where a patch already exists to use TIF_NOTIFY_SIGNAL instead. Hopefully this will also lead to more consolidation, but the work stands on its own as well" [1] https://github.com/axboe/liburing/issues/215 * tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block: (28 commits) io_uring: remove 'twa_signal_ok' deadlock work-around kernel: remove checking for TIF_NOTIFY_SIGNAL signal: kill JOBCTL_TASK_WORK io_uring: JOBCTL_TASK_WORK is no longer used by task_work task_work: remove legacy TWA_SIGNAL path sparc: add support for TIF_NOTIFY_SIGNAL riscv: add support for TIF_NOTIFY_SIGNAL nds32: add support for TIF_NOTIFY_SIGNAL ia64: add support for TIF_NOTIFY_SIGNAL h8300: add support for TIF_NOTIFY_SIGNAL c6x: add support for TIF_NOTIFY_SIGNAL alpha: add support for TIF_NOTIFY_SIGNAL xtensa: add support for TIF_NOTIFY_SIGNAL arm: add support for TIF_NOTIFY_SIGNAL microblaze: add support for TIF_NOTIFY_SIGNAL hexagon: add support for TIF_NOTIFY_SIGNAL csky: add support for TIF_NOTIFY_SIGNAL openrisc: add support for TIF_NOTIFY_SIGNAL sh: add support for TIF_NOTIFY_SIGNAL um: add support for TIF_NOTIFY_SIGNAL ...
2020-12-15Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-1/+1
Pull execve updates from Eric Biederman: "This set of changes ultimately fixes the interaction of posix file lock and exec. Fundamentally most of the change is just moving where unshare_files is called during exec, and tweaking the users of files_struct so that the count of files_struct is not unnecessarily played with. Along the way fcheck and related helpers were renamed to more accurately reflect what they do. There were also many other small changes that fell out, as this is the first time in a long time much of this code has been touched. Benchmarks haven't turned up any practical issues but Al Viro has observed a possibility for a lot of pounding on task_lock. So I have some changes in progress to convert put_files_struct to always rcu free files_struct. That wasn't ready for the merge window so that will have to wait until next time" * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) exec: Move io_uring_task_cancel after the point of no return coredump: Document coredump code exclusively used by cell spufs file: Remove get_files_struct file: Rename __close_fd_get_file close_fd_get_file file: Replace ksys_close with close_fd file: Rename __close_fd to close_fd and remove the files parameter file: Merge __alloc_fd into alloc_fd file: In f_dupfd read RLIMIT_NOFILE once. file: Merge __fd_install into fd_install proc/fd: In fdinfo seq_show don't use get_files_struct bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu file: Implement task_lookup_next_fd_rcu kcmp: In get_file_raw_ptr use task_lookup_fd_rcu proc/fd: In tid_fd_mode use task_lookup_fd_rcu file: Implement task_lookup_fd_rcu file: Rename fcheck lookup_fd_rcu file: Replace fcheck_files with files_lookup_fd_rcu file: Factor files_lookup_fd_locked out of fcheck_files file: Rename __fcheck_files to files_lookup_fd_raw ...
2020-12-15Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds1-8/+8
Pull networking updates from Jakub Kicinski: "Core: - support "prefer busy polling" NAPI operation mode, where we defer softirq for some time expecting applications to periodically busy poll - AF_XDP: improve efficiency by more batching and hindering the adjacency cache prefetcher - af_packet: make packet_fanout.arr size configurable up to 64K - tcp: optimize TCP zero copy receive in presence of partial or unaligned reads making zero copy a performance win for much smaller messages - XDP: add bulk APIs for returning / freeing frames - sched: support fragmenting IP packets as they come out of conntrack - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs BPF: - BPF switch from crude rlimit-based to memcg-based memory accounting - BPF type format information for kernel modules and related tracing enhancements - BPF implement task local storage for BPF LSM - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage Protocols: - mptcp: improve multiple xmit streams support, memory accounting and many smaller improvements - TLS: support CHACHA20-POLY1305 cipher - seg6: add support for SRv6 End.DT4/DT6 behavior - sctp: Implement RFC 6951: UDP Encapsulation of SCTP - ppp_generic: add ability to bridge channels directly - bridge: Connectivity Fault Management (CFM) support as is defined in IEEE 802.1Q section 12.14. Drivers: - mlx5: make use of the new auxiliary bus to organize the driver internals - mlx5: more accurate port TX timestamping support - mlxsw: - improve the efficiency of offloaded next hop updates by using the new nexthop object API - support blackhole nexthops - support IEEE 802.1ad (Q-in-Q) bridging - rtw88: major bluetooth co-existance improvements - iwlwifi: support new 6 GHz frequency band - ath11k: Fast Initial Link Setup (FILS) - mt7915: dual band concurrent (DBDC) support - net: ipa: add basic support for IPA v4.5 Refactor: - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior - phy: add support for shared interrupts; get rid of multiple driver APIs and have the drivers write a full IRQ handler, slight growth of driver code should be compensated by the simpler API which also allows shared IRQs - add common code for handling netdev per-cpu counters - move TX packet re-allocation from Ethernet switch tag drivers to a central place - improve efficiency and rename nla_strlcpy - number of W=1 warning cleanups as we now catch those in a patchwork build bot Old code removal: - wan: delete the DLCI / SDLA drivers - wimax: move to staging - wifi: remove old WDS wifi bridging support" * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits) net: hns3: fix expression that is currently always true net: fix proc_fs init handling in af_packet and tls nfc: pn533: convert comma to semicolon af_vsock: Assign the vsock transport considering the vsock address flags af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path vsock_addr: Check for supported flag values vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag vm_sockets: Add flags field in the vsock address data structure net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context nfc: s3fwrn5: Release the nfc firmware net: vxget: clean up sparse warnings mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3 mlxsw: spectrum_router_xm: Introduce basic XM cache flushing mlxsw: reg: Add Router LPM Cache Enable Register mlxsw: reg: Add Router LPM Cache ML Delete Register mlxsw: spectrum_router_xm: Implement L-value tracking for M-index mlxsw: reg: Add XM Router M Table Register ...
2020-12-14Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski1-8/+8
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-12-14 1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest. 2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua. 3) Support for finding BTF based kernel attach targets through libbpf's bpf_program__set_attach_target() API, from Andrii Nakryiko. 4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song. 5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet. 6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo. 7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel. 8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman. 9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner, KP Singh, Jiri Olsa and various others. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits) selftests/bpf: Add a test for ptr_to_map_value on stack for helper access bpf: Permits pointers on stack for helper calls libbpf: Expose libbpf ring_buffer epoll_fd selftests/bpf: Add set_attach_target() API selftest for module target libbpf: Support modules in bpf_program__set_attach_target() API selftests/bpf: Silence ima_setup.sh when not running in verbose mode. selftests/bpf: Drop the need for LLVM's llc selftests/bpf: fix bpf_testmod.ko recompilation logic samples/bpf: Fix possible hang in xdpsock with multiple threads selftests/bpf: Make selftest compilation work on clang 11 selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore selftests/bpf: Drop tcp-{client,server}.py from Makefile selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV selftests/bpf: Xsk selftests - DRV POLL, NOPOLL selftests/bpf: Xsk selftests - SKB POLL, NOPOLL selftests/bpf: Xsk selftests framework bpf: Only provide bpf_sock_from_file with CONFIG_NET bpf: Return -ENOTSUPP when attaching to non-kernel BTF xsk: Validate socket state in xsk_recvmsg, prior touching socket members ... ==================== Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-12io_uring: remove 'twa_signal_ok' deadlock work-aroundJens Axboe1-15/+6
The TIF_NOTIFY_SIGNAL based implementation of TWA_SIGNAL is always safe to use, regardless of context, as we won't be recursing into the signal lock. So now that all archs are using that, we can drop this deadlock work-around as it's always safe to use TWA_SIGNAL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12io_uring: JOBCTL_TASK_WORK is no longer used by task_workJens Axboe1-7/+2
Remove the dead code, TWA_SIGNAL will never set JOBCTL_TASK_WORK at this point. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-10file: Rename __close_fd_get_file close_fd_get_fileEric W. Biederman1-1/+1
The function close_fd_get_file is explicitly a variant of __close_fd[1]. Now that __close_fd has been renamed close_fd, rename close_fd_get_file to be consistent with close_fd. When __alloc_fd, __close_fd and __fd_install were introduced the double underscore indicated that the function took a struct files_struct parameter. The function __close_fd_get_file never has so the naming has always been inconsistent. This just cleans things up so there are not any lingering mentions or references __close_fd left in the code. [1] 80cd795630d6 ("binder: fix use-after-free due to ksys_close() during fdget()") Link: https://lkml.kernel.org/r/20201120231441.29911-23-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-09io_uring: fix io_cqring_events()'s noflushPavel Begunkov1-1/+1
Checking !list_empty(&ctx->cq_overflow_list) around noflush in io_cqring_events() is racy, because if it fails but a request overflowed just after that, io_cqring_overflow_flush() still will be called. Remove the second check, it shouldn't be a problem for performance, because there is cq_check_overflow bit check just above. Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09io_uring: fix racy IOPOLL flush overflowPavel Begunkov1-4/+6
It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without hodling uring_lock, because it does synchronisation differently. Make sure we have it. As for io_ring_exit_work(), we don't even need it there because io_ring_ctx_wait_and_kill() already set force flag making all overflowed requests to be dropped. Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09io_uring: fix racy IOPOLL completionsPavel Begunkov1-5/+18
IOPOLL allows buffer remove/provide requests, but they doesn't synchronise by rules of IOPOLL, namely it have to hold uring_lock. Cc: <stable@vger.kernel.org> # 5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09io_uring: always let io_iopoll_complete() complete polled ioXiaoguang Wang1-2/+13
Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring(): [ 95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0 [ 95.505921] [ 95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G B W 5.10.0-rc5+ #1 [ 95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 95.508248] Call Trace: [ 95.508683] dump_stack+0x107/0x163 [ 95.509323] ? io_commit_cqring+0x3ec/0x8e0 [ 95.509982] print_address_description.constprop.0+0x3e/0x60 [ 95.510814] ? vprintk_func+0x98/0x140 [ 95.511399] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512036] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512733] kasan_report_invalid_free+0x51/0x80 [ 95.513431] ? io_commit_cqring+0x3ec/0x8e0 [ 95.514047] __kasan_slab_free+0x141/0x160 [ 95.514699] kfree+0xd1/0x390 [ 95.515182] io_commit_cqring+0x3ec/0x8e0 [ 95.515799] __io_req_complete.part.0+0x64/0x90 [ 95.516483] io_wq_submit_work+0x1fa/0x260 [ 95.517117] io_worker_handle_work+0xeac/0x1c00 [ 95.517828] io_wqe_worker+0xc94/0x11a0 [ 95.518438] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.519151] ? __kthread_parkme+0x11d/0x1d0 [ 95.519806] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.520512] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.521211] kthread+0x396/0x470 [ 95.521727] ? _raw_spin_unlock_irq+0x24/0x30 [ 95.522380] ? kthread_mod_delayed_work+0x180/0x180 [ 95.523108] ret_from_fork+0x22/0x30 [ 95.523684] [ 95.523985] Allocated by task 4035: [ 95.524543] kasan_save_stack+0x1b/0x40 [ 95.525136] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 95.525882] kmem_cache_alloc_trace+0x17b/0x310 [ 95.533930] io_queue_sqe+0x225/0xcb0 [ 95.534505] io_submit_sqes+0x1768/0x25f0 [ 95.535164] __x64_sys_io_uring_enter+0x89e/0xd10 [ 95.535900] do_syscall_64+0x33/0x40 [ 95.536465] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 95.537199] [ 95.537505] Freed by task 4035: [ 95.538003] kasan_save_stack+0x1b/0x40 [ 95.538599] kasan_set_track+0x1c/0x30 [ 95.539177] kasan_set_free_info+0x1b/0x30 [ 95.539798] __kasan_slab_free+0x112/0x160 [ 95.540427] kfree+0xd1/0x390 [ 95.540910] io_commit_cqring+0x3ec/0x8e0 [ 95.541516] io_iopoll_complete+0x914/0x1390 [ 95.542150] io_do_iopoll+0x580/0x700 [ 95.542724] io_iopoll_try_reap_events.part.0+0x108/0x200 [ 95.543512] io_ring_ctx_wait_and_kill+0x118/0x340 [ 95.544206] io_uring_release+0x43/0x50 [ 95.544791] __fput+0x28d/0x940 [ 95.545291] task_work_run+0xea/0x1b0 [ 95.545873] do_exit+0xb6a/0x2c60 [ 95.546400] do_group_exit+0x12a/0x320 [ 95.546967] __x64_sys_exit_group+0x3f/0x50 [ 95.547605] do_syscall_64+0x33/0x40 [ 95.548155] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is that once we got a non EAGAIN error in io_wq_submit_work(), we'll complete req by calling io_req_complete(), which will hold completion_lock to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't hold completion_lock to call io_commit_cqring(), then there maybe concurrent access to ctx->defer_list, double free may happen. To fix this bug, we always let io_iopoll_complete() complete polled io. Cc: <stable@vger.kernel.org> # 5.5+ Reported-by: Abaci Fuzz <abaci@linux.alibaba.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>