aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/fs/io-wq.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-08-23io-wq: remove GFP_ATOMIC allocation off schedule out pathJens Axboe1-32/+40
Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running stress-ng: | [ 90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35 | [ 90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041 | [ 90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G W 5.14.0-rc4-rt4+ #89 | [ 90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 | [ 90.202561] Call Trace: | [ 90.202577] dump_stack_lvl+0x34/0x44 | [ 90.202584] ___might_sleep.cold+0x87/0x94 | [ 90.202588] rt_spin_lock+0x19/0x70 | [ 90.202593] ___slab_alloc+0xcb/0x7d0 | [ 90.202598] ? newidle_balance.constprop.0+0xf5/0x3b0 | [ 90.202603] ? dequeue_entity+0xc3/0x290 | [ 90.202605] ? io_wqe_dec_running.isra.0+0x98/0xe0 | [ 90.202610] ? pick_next_task_fair+0xb9/0x330 | [ 90.202612] ? __schedule+0x670/0x1410 | [ 90.202615] ? io_wqe_dec_running.isra.0+0x98/0xe0 | [ 90.202618] kmem_cache_alloc_trace+0x79/0x1f0 | [ 90.202621] io_wqe_dec_running.isra.0+0x98/0xe0 | [ 90.202625] io_wq_worker_sleeping+0x37/0x50 | [ 90.202628] schedule+0x30/0xd0 | [ 90.202630] schedule_timeout+0x8f/0x1a0 | [ 90.202634] ? __bpf_trace_tick_stop+0x10/0x10 | [ 90.202637] io_wqe_worker+0xfd/0x320 | [ 90.202641] ? finish_task_switch.isra.0+0xd3/0x290 | [ 90.202644] ? io_worker_handle_work+0x670/0x670 | [ 90.202646] ? io_worker_handle_work+0x670/0x670 | [ 90.202649] ret_from_fork+0x22/0x30 which is due to the RT kernel not liking a GFP_ATOMIC allocation inside a raw spinlock. Besides that not working on RT, doing any kind of allocation from inside schedule() is kind of nasty and should be avoided if at all possible. This particular path happens when an io-wq worker goes to sleep, and we need a new worker to handle pending work. We currently allocate a small data item to hold the information we need to create a new worker, but we can instead include this data in the io_worker struct itself and just protect it with a single bit lock. We only really need one per worker anyway, as we will have run pending work between to sleep cycles. https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/ Reported-by: Daniel Wagner <dwagner@suse.de> Tested-by: Daniel Wagner <dwagner@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker()Hao Xu1-7/+11
There may be cases like: A B spin_lock(wqe->lock) nr_workers is 0 nr_workers++ spin_unlock(wqe->lock) spin_lock(wqe->lock) nr_wokers is 1 nr_workers++ spin_unlock(wqe->lock) create_io_worker() acct->worker is 1 create_io_worker() acct->worker is 1 There should be one worker marked IO_WORKER_F_FIXED, but no one is. Fix this by introduce a new agrument for create_io_worker() to indicate if it is the first worker. Fixes: 3d4e4face9c1 ("io-wq: fix no lock protection of acct->nr_worker") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210808135434.68667-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09io-wq: fix bug of creating io-wokers unconditionallyHao Xu1-2/+10
The former patch to add check between nr_workers and max_workers has a bug, which will cause unconditionally creating io-workers. That's because the result of the check doesn't affect the call of create_io_worker(), fix it by bringing in a boolean value for it. Fixes: 21698274da5b ("io-wq: fix lack of acct->nr_workers < acct->max_workers judgement") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210808135434.68667-2-haoxu@linux.alibaba.com [axboe: drop hunk that isn't strictly needed] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-06io-wq: fix lack of acct->nr_workers < acct->max_workers judgementHao Xu1-1/+9
There should be this judgement before we create an io-worker Fixes: 685fe7feedb9 ("io-wq: eliminate the need for a manager thread") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-06io-wq: fix no lock protection of acct->nr_workerHao Xu1-6/+17
There is an acct->nr_worker visit without lock protection. Think about the case: two callers call io_wqe_wake_worker(), one is the original context and the other one is an io-worker(by calling io_wqe_enqueue(wqe, linked)), on two cpus paralelly, this may cause nr_worker to be larger than max_worker. Let's fix it by adding lock for it, and let's do nr_workers++ before create_io_worker. There may be a edge cause that the first caller fails to create an io-worker, but the second caller doesn't know it and then quit creating io-worker as well: say nr_worker = max_worker - 1 cpu 0 cpu 1 io_wqe_wake_worker() io_wqe_wake_worker() nr_worker < max_worker nr_worker++ create_io_worker() nr_worker == max_worker failed return return But the chance of this case is very slim. Fixes: 685fe7feedb9 ("io-wq: eliminate the need for a manager thread") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: fix unconditional create_io_worker() call] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-04io-wq: fix race between worker exiting and activating free workerJens Axboe1-19/+19
Nadav correctly reports that we have a race between a worker exiting, and new work being queued. This can lead to work being queued behind an existing worker that could be sleeping on an event before it can run to completion, and hence introducing potential big latency gaps if we hit this race condition: cpu0 cpu1 ---- ---- io_wqe_worker() schedule_timeout() // timed out io_wqe_enqueue() io_wqe_wake_worker() // work_flags & IO_WQ_WORK_CONCURRENT io_wqe_activate_free_worker() io_worker_exit() Fix this by having the exiting worker go through the normal decrement of a running worker, which will spawn a new one if needed. The free worker activation is modified to only return success if we were able to find a sleeping worker - if not, we keep looking through the list. If we fail, we create a new worker as per usual. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/BFF746C0-FEDE-4646-A253-3021C57C26C9@gmail.com/ Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-23io_uring: explicitly catch any illegal async queue attemptJens Axboe1-1/+6
Catch an illegal case to queue async from an unrelated task that got the ring fd passed to it. This should not be possible to hit, but better be proactive and catch it explicitly. io-wq is extended to check for early IO_WQ_WORK_CANCEL being set on a work item as well, so it can run the request through the normal cancelation path. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18io_uring: fix false WARN_ONCEPavel Begunkov1-1/+4
WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_wake_worker fs/io-wq.c:244 [inline] WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_enqueue+0x7f6/0x910 fs/io-wq.c:751 A WARN_ON_ONCE() in io_wqe_wake_worker() can be triggered by a valid userspace setup. Replace it with pr_warn. Reported-by: syzbot+ea2f1484cffe5109dc10@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f7ede342c3342c4c26668f5168e2993e38bbd99c.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-17io_uring: allow user configurable IO thread CPU affinityJens Axboe1-0/+17
io-wq defaults to per-node masks for IO workers. This works fine by default, but isn't particularly handy for workloads that prefer more specific affinities, for either performance or isolation reasons. This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU mask that is then applied to IO thread workers, and an IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the default of per-node. Note that no care is given to existing IO threads, they will need to go through a reschedule before the affinity is correct if they are already running or sleeping. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-17io-wq: use private CPU maskJens Axboe1-7/+43
In preparation for allowing user specific CPU masks for IO thread creation, switch to using a mask embedded in the per-node wqe structure. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16io-wq: remove header files not needed anymoreOlivier Langlois1-2/+0
mm related header files are not needed for io-wq module. remove them for a small clean-up. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15io-wq: remove redundant initialization of variable retColin Ian King1-1/+1
The variable ret is being initialized with a value that is never read, the assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20210615143424.60449-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14io-wq: simplify worker exitingPavel Begunkov1-4/+1
io_worker_handle_work() already takes care of the empty list case and releases spinlock, so get rid of ugly conditional unlocking and unconditionally call handle_work() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7521e485677f381036676943e876a0afecc23017.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14io-wq: don't repeat IO_WQ_BIT_EXIT check by workerPavel Begunkov1-2/+1
io_wqe_worker()'s main loop does check IO_WQ_BIT_EXIT flag, so no need for a second test_bit at the end as it will immediately jump to the first check afterwards. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6af4a51c86523a527fb5417c9fbc775c4b26497.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14io-wq: remove unused io-wq refcountingPavel Begunkov1-5/+1
iowq->refs is initialised to one and killed on exit, so it's not used and we can kill it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/401007393528ea7c102360e69a29b64498e15db2.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14io-wq: embed wqe ptr array into struct io_wqPavel Begunkov1-11/+4
io-wq keeps an array of pointers to struct io_wqe, allocate this array as a part of struct io-wq, it's easier to code and saves an extra indirection for nearly each io-wq call. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1482c6a001923bbed662dc38a8a580fb08b1ed8c.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-26io-wq: Fix UAF when wakeup wqe in hash waitqueueZqiang1-3/+6
BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650 Read of size 8 at addr ffff8880304250d8 by task iou-wrk-28796/28802 Call Trace: __dump_stack [inline] dump_stack+0x141/0x1d7 print_address_description.constprop.0.cold+0x5b/0x2c6 __kasan_report [inline] kasan_report.cold+0x7c/0xd8 __wake_up_common+0x637/0x650 __wake_up_common_lock+0xd0/0x130 io_worker_handle_work+0x9dd/0x1790 io_wqe_worker+0xb2a/0xd40 ret_from_fork+0x1f/0x30 Allocated by task 28798: kzalloc_node [inline] io_wq_create+0x3c4/0xdd0 io_init_wq_offload [inline] io_uring_alloc_task_context+0x1bf/0x6b0 __io_uring_add_task_file+0x29a/0x3c0 io_uring_add_task_file [inline] io_uring_install_fd [inline] io_uring_create [inline] io_uring_setup+0x209a/0x2bd0 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 28798: kfree+0x106/0x2c0 io_wq_destroy+0x182/0x380 io_wq_put [inline] io_wq_put_and_exit+0x7a/0xa0 io_uring_clean_tctx [inline] __io_uring_cancel+0x428/0x530 io_uring_files_cancel do_exit+0x299/0x2a60 do_group_exit+0x125/0x310 get_signal+0x47f/0x2150 arch_do_signal_or_restart+0x2a8/0x1eb0 handle_signal_work[inline] exit_to_user_mode_loop [inline] exit_to_user_mode_prepare+0x171/0x280 __syscall_exit_to_user_mode_work [inline] syscall_exit_to_user_mode+0x19/0x60 do_syscall_64+0x47/0xb0 entry_SYSCALL_64_after_hwframe There are the following scenarios, hash waitqueue is shared by io-wq1 and io-wq2. (note: wqe is worker) io-wq1:worker2 | locks bit1 io-wq2:worker1 | waits bit1 io-wq1:worker3 | waits bit1 io-wq1:worker2 | completes all wqe bit1 work items io-wq1:worker2 | drop bit1, exit io-wq2:worker1 | locks bit1 io-wq1:worker3 | can not locks bit1, waits bit1 and exit io-wq1 | exit and free io-wq1 io-wq2:worker1 | drops bit1 io-wq1:worker3 | be waked up, even though wqe is freed After all iou-wrk belonging to io-wq1 have exited, remove wqe form hash waitqueue, it is guaranteed that there will be no more wqe belonging to io-wq1 in the hash waitqueue. Reported-by: syzbot+6cb11ade52aa17095297@syzkaller.appspotmail.com Signed-off-by: Zqiang <qiang.zhang@windriver.com> Link: https://lore.kernel.org/r/20210526050826.30500-1-qiang.zhang@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-25io_uring/io-wq: close io-wq full-stop gapPavel Begunkov1-11/+9
There is an old problem with io-wq cancellation where requests should be killed and are in io-wq but are not discoverable, e.g. in @next_hashed or @linked vars of io_worker_handle_work(). It adds some unreliability to individual request canellation, but also may potentially get __io_uring_cancel() stuck. For instance: 1) An __io_uring_cancel()'s cancellation round have not found any request but there are some as desribed. 2) __io_uring_cancel() goes to sleep 3) Then workers wake up and try to execute those hidden requests that happen to be unbound. As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT in advance, so preventing 3) from executing unbound requests. The workers will initially break looping because of getting a signal as they are threads of the dying/exec()'ing user task. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-20io-wq: remove unused io_wqe_need_worker() functionJens Axboe1-13/+0
A previous commit removed the need for this, but overlooked that we no longer use it at all. Get rid of it. Fixes: 685fe7feedb9 ("io-wq: eliminate the need for a manager thread") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11io-wq: Fix io_wq_worker_affinity()Peter Zijlstra1-9/+2
Do not include private headers and do not frob in internals. On top of that, while the previous code restores the affinity, it doesn't ensure the task actually moves there if it was running, leading to the fun situation that it can be observed running outside of its allowed mask for potentially significant time. Use the proper API instead. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/YG7QkiUzlEbW85TU@hirez.programming.kicks-ass.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11io-wq: simplify code in __io_worker_busy()Hao Xu1-9/+6
Leverage XOR to simplify the code in __io_worker_busy. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1617678525-3129-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11io-wq: cancel task_work on exit only targeting the current 'wq'Jens Axboe1-1/+11
With using task_work_cancel(), we're potentially canceling task_work that isn't related to this specific io_wq. Use the newly added task_work_cancel_match() to ensure that we only remove and cancel work items that are specific to this io_wq. Fixes: 685fe7feedb9 ("io-wq: eliminate the need for a manager thread") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11io-wq: eliminate the need for a manager threadJens Axboe1-157/+115
io-wq relies on a manager thread to create/fork new workers, as needed. But there's really no strong need for it anymore. We have the following cases that fork a new worker: 1) Work queue. This is done from the task itself always, and it's trivial to create a worker off that path, if needed. 2) All workers have gone to sleep, and we have more work. This is called off the sched out path. For this case, use a task_work items to queue a fork-worker operation. 3) Hashed work completion. Don't think we need to do anything off this case. If need be, it could just use approach 2 as well. Part of this change is incrementing the running worker count before the fork, to avoid cases where we observe we need a worker and then queue creation of one. Then new work comes in, we fork a new one. That last queue operation should have waited for the previous worker to come up, it's quite possible we don't even need it. Hence move the worker running from before we fork it off to more efficiently handle that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11io-wq: refactor *_get_acct()Pavel Begunkov1-10/+7
Extract a helper for io_work_get_acct() and io_wqe_get_acct() to avoid duplication. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08io-wq: cancel unbounded works on io-wq destroyPavel Begunkov1-0/+4
WARNING: CPU: 5 PID: 227 at fs/io_uring.c:8578 io_ring_exit_work+0xe6/0x470 RIP: 0010:io_ring_exit_work+0xe6/0x470 Call Trace: process_one_work+0x206/0x400 worker_thread+0x4a/0x3d0 kthread+0x129/0x170 ret_from_fork+0x22/0x30 INFO: task lfs-openat:2359 blocked for more than 245 seconds. task:lfs-openat state:D stack: 0 pid: 2359 ppid: 1 flags:0x00000004 Call Trace: ... wait_for_completion+0x8b/0xf0 io_wq_destroy_manager+0x24/0x60 io_wq_put_and_exit+0x18/0x30 io_uring_clean_tctx+0x76/0xa0 __io_uring_files_cancel+0x1b9/0x2e0 do_exit+0xc0/0xb40 ... Even after io-wq destroy has been issued io-wq worker threads will continue executing all left work items as usual, and may hang waiting for I/O that won't ever complete (aka unbounded). [<0>] pipe_read+0x306/0x450 [<0>] io_iter_do_read+0x1e/0x40 [<0>] io_read+0xd5/0x330 [<0>] io_issue_sqe+0xd21/0x18a0 [<0>] io_wq_submit_work+0x6c/0x140 [<0>] io_worker_handle_work+0x17d/0x400 [<0>] io_wqe_worker+0x2c0/0x330 [<0>] ret_from_fork+0x22/0x30 Cancel all unbounded I/O instead of executing them. This changes the user visible behaviour, but that's inevitable as io-wq is not per task. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cd4b543154154cba055cf86f351441c2174d7f71.1617842918.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-01io_uring/io-wq: protect against sprintf overflowPavel Begunkov1-2/+2
task_pid may be large enough to not fit into the left space of TASK_COMM_LEN-sized buffers and overflow in sprintf. We not so care about uniqueness, so replace it with safer snprintf(). Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1702c6145d7e1c46fbc382f28334c02e1a3d3994.1617267273.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-27io_uring: handle signals for IO threads like a normal threadJens Axboe1-6/+14
We go through various hoops to disallow signals for the IO threads, but there's really no reason why we cannot just allow them. The IO threads never return to userspace like a normal thread, and hence don't go through normal signal processing. Instead, just check for a pending signal as part of the work loop, and call get_signal() to handle it for us if anything is pending. With that, we can support receiving signals, including special ones like SIGSTOP. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25io-wq: fix race around pending work on teardownJens Axboe1-1/+5
syzbot reports that it's triggering the warning condition on having pending work on shutdown: WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_destroy fs/io-wq.c:1061 [inline] WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_put+0x153/0x260 fs/io-wq.c:1072 Modules linked in: CPU: 1 PID: 12346 Comm: syz-executor.5 Not tainted 5.12.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_wq_destroy fs/io-wq.c:1061 [inline] RIP: 0010:io_wq_put+0x153/0x260 fs/io-wq.c:1072 Code: 8d e8 71 90 ea 01 49 89 c4 41 83 fc 40 7d 4f e8 33 4d 97 ff 42 80 7c 2d 00 00 0f 85 77 ff ff ff e9 7a ff ff ff e8 1d 4d 97 ff <0f> 0b eb b9 8d 6b ff 89 ee 09 de bf ff ff ff ff e8 18 51 97 ff 09 RSP: 0018:ffffc90001ebfb08 EFLAGS: 00010293 RAX: ffffffff81e16083 RBX: ffff888019038040 RCX: ffff88801e86b780 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040 RBP: 1ffff1100b2f8a80 R08: ffffffff81e15fce R09: ffffed100b2f8a82 R10: ffffed100b2f8a82 R11: 0000000000000000 R12: 0000000000000000 R13: dffffc0000000000 R14: ffff8880597c5400 R15: ffff888019038000 FS: 00007f8dcd89c700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055e9a054e160 CR3: 000000001dfb8000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_uring_clean_tctx+0x1b7/0x210 fs/io_uring.c:8802 __io_uring_files_cancel+0x13c/0x170 fs/io_uring.c:8820 io_uring_files_cancel include/linux/io_uring.h:47 [inline] do_exit+0x258/0x2340 kernel/exit.c:780 do_group_exit+0x168/0x2d0 kernel/exit.c:922 get_signal+0x1734/0x1ef0 kernel/signal.c:2773 arch_do_signal_or_restart+0x3c/0x610 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:208 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline] syscall_exit_to_user_mode+0x48/0x180 kernel/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x465f69 which shouldn't happen, but seems to be possible due to a race on whether or not the io-wq manager sees a fatal signal first, or whether the io-wq workers do. If we race with queueing work and then send a fatal signal to the owning task, and the io-wq worker sees that before the manager sets IO_WQ_BIT_EXIT, then it's possible to have the worker exit and leave work behind. Just turn the WARN_ON_ONCE() into a cancelation condition instead. Reported-by: syzbot+77a738a6bc947bf639ca@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-21io_uring: don't use {test,clear}_tsk_thread_flag() for currentJens Axboe1-4/+2
Linus correctly points out that this is both unnecessary and generates much worse code on some archs as going from current to thread_info is actually backwards - and obviously just wasteful, since the thread_info is what we care about. Since io_uring only operates on current for these operations, just use test_thread_flag() instead. For io-wq, we can further simplify and use tracehook_notify_signal() to handle the TIF_NOTIFY_SIGNAL work and clear the flag. The latter isn't an actual bug right now, but it may very well be in the future if we place other work items under TIF_NOTIFY_SIGNAL. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/CAHk-=wgYhNck33YHKZ14mFB5MzTTk8gqXHcfj=RWTAXKwgQJgg@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-21io-wq: ensure task is running before processing task_workJens Axboe1-2/+6
Mark the current task as running if we need to run task_work from the io-wq threads as part of work handling. If that is the case, then return as such so that the caller can appropriately loop back and reset if it was part of a going-to-sleep flush. Fixes: 3bfe6106693b ("io-wq: fork worker threads from original task") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-12io_uring: allow IO worker threads to be frozenJens Axboe1-1/+5
With the freezer using the proper signaling to notify us of when it's time to freeze a thread, we can re-enable normal freezer usage for the IO threads. Ensure that SQPOLL, io-wq, and the io-wq manager call try_to_freeze() appropriately, and remove the default setting of PF_NOFREEZE from create_io_thread(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10kernel: make IO threads unfreezable by defaultJens Axboe1-2/+1
The io-wq threads were already marked as no-freeze, but the manager was not. On resume, we perpetually have signal_pending() being true, and hence the manager will loop and spin 100% of the time. Just mark the tasks created by create_io_thread() as PF_NOFREEZE by default, and remove any knowledge of it in io-wq and io_uring. Reported-by: Kevin Locke <kevin@kevinlocke.name> Tested-by: Kevin Locke <kevin@kevinlocke.name> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io-wq: fix ref leak for req in case of exit cancelationsyangerkun1-2/+1
do_work such as io_wq_submit_work that cancel the work may leave a ref of req as 1 if we have links. Fix it by call io_run_cancel. Fixes: 4fb6ac326204 ("io-wq: improve manager/worker handling over exec") Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20210309030410.3294078-1-yangerkun@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io-wq: remove unused 'user' member of io_wqJens Axboe1-1/+0
Previous patches killed the last user of this, now it's just a dead member in the struct. Get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io-wq: warn on creating manager while exitingPavel Begunkov1-0/+2
Add a simple warning making sure that nobody tries to create a new manager while we're under IO_WQ_BIT_EXIT. That can potentially happen due to racy work submission after final put. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-06io-wq: fix race in freeing 'wq' and worker accessJens Axboe1-8/+8
Ran into a use-after-free on the main io-wq struct, wq. It has a worker ref and completion event, but the manager itself isn't holding a reference. This can lead to a race where the manager thinks there are no workers and exits, but a worker is being added. That leads to the following trace: BUG: KASAN: use-after-free in io_wqe_worker+0x4c0/0x5e0 Read of size 8 at addr ffff888108baa8a0 by task iou-wrk-3080422/3080425 CPU: 5 PID: 3080425 Comm: iou-wrk-3080422 Not tainted 5.12.0-rc1+ #110 Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO 10G (MS-7C60), BIOS 1.60 05/13/2020 Call Trace: dump_stack+0x90/0xbe print_address_description.constprop.0+0x67/0x28d ? io_wqe_worker+0x4c0/0x5e0 kasan_report.cold+0x7b/0xd4 ? io_wqe_worker+0x4c0/0x5e0 __asan_load8+0x6d/0xa0 io_wqe_worker+0x4c0/0x5e0 ? io_worker_handle_work+0xc00/0xc00 ? recalc_sigpending+0xe5/0x120 ? io_worker_handle_work+0xc00/0xc00 ? io_worker_handle_work+0xc00/0xc00 ret_from_fork+0x1f/0x30 Allocated by task 3080422: kasan_save_stack+0x23/0x60 __kasan_kmalloc+0x80/0xa0 kmem_cache_alloc_node_trace+0xa0/0x480 io_wq_create+0x3b5/0x600 io_uring_alloc_task_context+0x13c/0x380 io_uring_add_task_file+0x109/0x140 __x64_sys_io_uring_enter+0x45f/0x660 do_syscall_64+0x32/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 3080422: kasan_save_stack+0x23/0x60 kasan_set_track+0x20/0x40 kasan_set_free_info+0x24/0x40 __kasan_slab_free+0xe8/0x120 kfree+0xa8/0x400 io_wq_put+0x14a/0x220 io_wq_put_and_exit+0x9a/0xc0 io_uring_clean_tctx+0x101/0x140 __io_uring_files_cancel+0x36e/0x3c0 do_exit+0x169/0x1340 __x64_sys_exit+0x34/0x40 do_syscall_64+0x32/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Have the manager itself hold a reference, and now both drop points drop and complete if we hit zero, and the manager can unconditionally do a wait_for_completion() instead of having a race between reading the ref count and waiting if it was non-zero. Fixes: fb3a1f6c745c ("io-wq: have manager wait for all workers to exit") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05io-wq: kill hashed waitqueue before manager exitsJens Axboe1-4/+5
If we race with shutting down the io-wq context and someone queueing a hashed entry, then we can exit the manager with it armed. If it then triggers after the manager has exited, we can have a use-after-free where io_wqe_hash_wake() attempts to wake a now gone manager process. Move the killing of the hashed write queue into the manager itself, so that we know we've killed it before the task exits. Fixes: e941894eae31 ("io-wq: make buffered file write hashed work map per-ctx") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05io_uring: move to using create_io_thread()Jens Axboe1-88/+35
This allows us to do task creation and setup without needing to use completions to try and synchronize with the starting thread. Get rid of the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup completion events - we can now do setup before the task is running. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: ensure all pending work is canceled on exitJens Axboe1-9/+33
If we race on shutting down the io-wq, then we should ensure that any work that was queued after workers shutdown is canceled. Harden the add work check a bit too, checking for IO_WQ_BIT_EXIT and cancel if it's set. Add a WARN_ON() for having any work before we kill the io-wq context. Reported-by: syzbot+91b4b56ead187d35c9d3@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io_uring: ensure that threads freeze on suspendJens Axboe1-0/+3
Alex reports that his system fails to suspend using 5.12-rc1, with the following dump: [ 240.650300] PM: suspend entry (deep) [ 240.650748] Filesystems sync: 0.000 seconds [ 240.725605] Freezing user space processes ... [ 260.739483] Freezing of tasks failed after 20.013 seconds (3 tasks refusing to freeze, wq_busy=0): [ 260.739497] task:iou-mgr-446 state:S stack: 0 pid: 516 ppid: 439 flags:0x00004224 [ 260.739504] Call Trace: [ 260.739507] ? sysvec_apic_timer_interrupt+0xb/0x81 [ 260.739515] ? pick_next_task_fair+0x197/0x1cde [ 260.739519] ? sysvec_reschedule_ipi+0x2f/0x6a [ 260.739522] ? asm_sysvec_reschedule_ipi+0x12/0x20 [ 260.739525] ? __schedule+0x57/0x6d6 [ 260.739529] ? del_timer_sync+0xb9/0x115 [ 260.739533] ? schedule+0x63/0xd5 [ 260.739536] ? schedule_timeout+0x219/0x356 [ 260.739540] ? __next_timer_interrupt+0xf1/0xf1 [ 260.739544] ? io_wq_manager+0x73/0xb1 [ 260.739549] ? io_wq_create+0x262/0x262 [ 260.739553] ? ret_from_fork+0x22/0x30 [ 260.739557] task:iou-mgr-517 state:S stack: 0 pid: 522 ppid: 439 flags:0x00004224 [ 260.739561] Call Trace: [ 260.739563] ? sysvec_apic_timer_interrupt+0xb/0x81 [ 260.739566] ? pick_next_task_fair+0x16f/0x1cde [ 260.739569] ? sysvec_apic_timer_interrupt+0xb/0x81 [ 260.739571] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 260.739574] ? __schedule+0x5b7/0x6d6 [ 260.739578] ? del_timer_sync+0x70/0x115 [ 260.739581] ? schedule_timeout+0x211/0x356 [ 260.739585] ? __next_timer_interrupt+0xf1/0xf1 [ 260.739588] ? io_wq_check_workers+0x15/0x11f [ 260.739592] ? io_wq_manager+0x69/0xb1 [ 260.739596] ? io_wq_create+0x262/0x262 [ 260.739600] ? ret_from_fork+0x22/0x30 [ 260.739603] task:iou-wrk-517 state:S stack: 0 pid: 523 ppid: 439 flags:0x00004224 [ 260.739607] Call Trace: [ 260.739609] ? __schedule+0x5b7/0x6d6 [ 260.739614] ? schedule+0x63/0xd5 [ 260.739617] ? schedule_timeout+0x219/0x356 [ 260.739621] ? __next_timer_interrupt+0xf1/0xf1 [ 260.739624] ? task_thread.isra.0+0x148/0x3af [ 260.739628] ? task_thread_unbound+0xa/0xa [ 260.739632] ? task_thread_bound+0x7/0x7 [ 260.739636] ? ret_from_fork+0x22/0x30 [ 260.739647] OOM killer enabled. [ 260.739648] Restarting tasks ... done. [ 260.740077] PM: suspend exit Play nice and ensure that any thread we create will call try_to_freeze() at an opportune time so that memory suspend can proceed. For the io-wq worker threads, mark them as PF_NOFREEZE. They could potentially be blocked for a long time. Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: fix error path leak of buffered write hash mapJens Axboe1-1/+1
The 'err' path should include the hash put, we already grabbed a reference once we get that far. Fixes: e941894eae31 ("io-wq: make buffered file write hashed work map per-ctx") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io_uring: move cred assignment into io_issue_sqe()Jens Axboe1-26/+0
If we move it in there, then we no longer have to care about it in io-wq. This means we can drop the cred handling in io-wq, and we can drop the REQ_F_WORK_INITIALIZED flag and async init functions as that was the last user of it since we moved to the new workers. Then we can also drop io_wq_work->creds, and just hold the personality u16 in there instead. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: provide an io_wq_put_and_exit() helperJens Axboe1-10/+19
If we put the io-wq from io_uring, we really want it to exit. Provide a helper that does that for us. Couple that with not having the manager hold a reference to the 'wq' and the normal SQPOLL exit will tear down the io-wq context appropriate. On the io-wq side, our wq context is per task, so only the task itself is manipulating ->manager and hence it's safe to check and clear without any extra locking. We just need to ensure that the manager task stays around, in case it exits. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: fix double put of 'wq' in error pathJens Axboe1-2/+0
We are already freeing the wq struct in both spots, so don't put it and get it freed twice. Reported-by: syzbot+7bf785eedca35ca05501@syzkaller.appspotmail.com Fixes: 4fb6ac326204 ("io-wq: improve manager/worker handling over exec") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: wait for manager exit on wq destroyJens Axboe1-1/+6
The manager waits for the workers, hence the manager is always valid if workers are running. Now also have wq destroy wait for the manager on exit, so we now everything is gone. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: rename wq->done completion to wq->startedJens Axboe1-4/+4
This is a leftover from a different use cases, it's used to wait for the manager to startup. Rename it as such. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: don't ask for a new worker if we're exitingJens Axboe1-0/+2
If we're in the process of shutting down the async context, then don't create new workers if we already have at least the fixed one. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io-wq: have manager wait for all workers to exitJens Axboe1-8/+22
Instead of having to wait separately on workers and manager, just have the manager wait on the workers. We use an atomic_t for the reference here, as we need to start at 0 and allow increment from that. Since the number of workers is naturally capped by the allowed nr of processes, and that uses an int, there is no risk of overflow. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-01io-wq: wait for worker startup when forking a new oneJens Axboe1-0/+4
We need to have our worker count updated before continuing, to avoid cases where we repeatedly think we need a new worker, but a fork is already in progress. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25io-wq: remove now unused IO_WQ_BIT_ERRORJens Axboe1-10/+0
This flag is now dead, remove it. Fixes: 1cbd9c2bcf02 ("io-wq: don't create any IO workers upfront") Signed-off-by: Jens Axboe <axboe@kernel.dk>