Age | Commit message (Collapse) | Author | Files | Lines |
|
With end_io handlers now being able to potentially pass ownership of
the request upon completion, we can allow requests with end_io handlers
in the batch completion handling.
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Co-developed-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.
In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The filesystem IO path can take advantage of allocating batches of
requests, if the underlying submitter tells the block layer about it
through the blk_plug. For passthrough IO, the exported API is the
blk_mq_alloc_request() helper, and that one does not allow for
request caching.
Wire up request caching for blk_mq_alloc_request(), which is generally
done without having a bio available upfront.
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We've never had any useful reports from this BUG_ON(), and in fact a
number of the BUG_ON()'s in the flush handling need to be turned into
more graceful handling.
In preparation for allowing batched completions of the end_io handling,
where we can enter the flush completion with queuelist having been reused
for the batch, get rid of this BUG_ON().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.
Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault
See how the process-context sbitmap_queue_wake_up() has been interrupted,
after bringing wait_cnt down to 0 (and in this example, after doing its
wakeups), before advancing wake_index and refilling wake_cnt: an
interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.
I have almost no grasp of all the possible sbitmap races, and their
consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
so it is better if sbq_wake_ptr() skips on to the next ws in that case:
which fixes the lockup and shows no adverse consequence for me.
The check for wait_cnt being 0 is obviously racy, and ultimately can lead
to lost wakeups: for example, when there is only a single waitqueue with
waiters. However, lost wakeups are unlikely to matter in these cases,
and a proper fix requires redesign (and benchmarking) of the batched
wakeup code: so let's plug the hole with this bandaid for now.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so
we can't use task-tw ordering trick to order notification cqes
with requests completions. In this case leave it alone and let
io_send_zc_cleanup() flush it.
Cc: stable@vger.kernel.org
Fixes: 53bdc88aac9a2 ("io_uring/notif: order notif vs send CQEs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.1664486545.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_sendmsg_copy_hdr() may clear msg->msg_name if the userspace didn't
provide it, we should retain NULL in this case.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/97d49f61b5ec76d0900df658cfde3aa59ff22121.1664486545.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This isn't a reliable mechanism to tell if we have task_work pending, we
really should be looking at whether we have any items queued. This is
problematic if forward progress is gated on running said task_work. One
such example is reading from a pipe, where the write side has been closed
right before the read is started. The fput() of the file queues TWA_RESUME
task_work, and we need that task_work to be run before ->release() is
called for the pipe. If ->release() isn't called, then the read will sit
forever waiting on data that will never arise.
Fix this by io_run_task_work() so it checks if we have task_work pending
rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
doesn't work for task_work that is queued without TWA_SIGNAL.
Reported-by: Christiano Haesbaert <haesbaert@haesbaert.org>
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/665
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We can't call these off the kiocb completion as that might be off
soft/hard irq context. Defer the calls to when we process the
task_work for this request. That avoids valid complaints like:
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3694e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_usage_bug kernel/locking/lockdep.c:3961 [inline]
valid_state kernel/locking/lockdep.c:3973 [inline]
mark_lock_irq kernel/locking/lockdep.c:4176 [inline]
mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632
mark_lock kernel/locking/lockdep.c:4596 [inline]
mark_usage kernel/locking/lockdep.c:4527 [inline]
__lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007
lock_acquire kernel/locking/lockdep.c:5666 [inline]
lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
__fs_reclaim_acquire mm/page_alloc.c:4674 [inline]
fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688
might_alloc include/linux/sched/mm.h:271 [inline]
slab_pre_alloc_hook mm/slab.h:700 [inline]
slab_alloc mm/slab.c:3278 [inline]
__kmem_cache_alloc_lru mm/slab.c:3471 [inline]
kmem_cache_alloc+0x39/0x520 mm/slab.c:3491
fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline]
fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline]
fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948
send_to_group fs/notify/fsnotify.c:360 [inline]
fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570
__fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230
fsnotify_parent include/linux/fsnotify.h:77 [inline]
fsnotify_file include/linux/fsnotify.h:99 [inline]
fsnotify_access include/linux/fsnotify.h:309 [inline]
__io_complete_rw_common+0x485/0x720 io_uring/rw.c:195
io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228
iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline]
iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178
bio_endio+0x5f9/0x780 block/bio.c:1564
req_bio_endio block/blk-mq.c:695 [inline]
blk_update_request+0x3fc/0x1300 block/blk-mq.c:825
scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541
scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971
scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438
blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022
__do_softirq+0x1d3/0x9c6 kernel/softirq.c:571
invoke_softirq kernel/softirq.c:445 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240
Fixes: f63cf5192fe3 ("io_uring: ensure that fsnotify is always called")
Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/
Reported-by: syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are two places in the block layer at the moment where
blk_mq_plug() helper could be used instead of directly accessing the
plug from struct current. In both these cases, directly accessing the plug
should not have any consequences for zoned devices.
Make the intent explicit by adding comments instead of introducing unwanted
checks with blk_mq_plug() helper.[1]
[1] https://lore.kernel.org/linux-block/f6e54907-1035-2b2c-6387-ed178be05ccb@kernel.dk/
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Suggested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220929144141.140077-1-p.raghav@samsung.com
[axboe: fixup multi-line comment style]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The current implementation of blk_mq_plug() disables plugging for all
operations that involves a transfer to the device as we just check if
the last bit in op_is_write() function.
Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and
REQ_OP_WRITE_ZEROS as they might require a zone lock.
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220929074745.103073-2-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
I hit a very bad problem during my tests of SENDMSG_ZC.
BUG(); in first_iovec_segment() triggered very easily.
The problem was io_setup_async_msg() in the partial retry case,
which seems to happen more often with _ZC.
iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
being only relative to the first element.
Which means kmsg->msg.msg_iter.iov is no longer the
same as kmsg->fast_iov.
But this would rewind the copy to be the start of
async_msg->fast_iov, which means the internal
state of sync_msg->msg.msg_iter is inconsitent.
I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608
and got a short writes with:
- ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244
- ret=-EAGAIN => io_uring_poll_arm
- ret=4911225 min_ret=5713448 => remaining 802223 sr->done_io=7586469
- ret=-EAGAIN => io_uring_poll_arm
- ret=802223 min_ret=802223 => res=8388692
While this was easily triggered with SENDMSG_ZC (queued for 6.1),
it was a potential problem starting with 7ba89d2af17aa879dda30f5d5d3f152e587fc551
in 5.18 for IORING_OP_RECVMSG.
And also with 4c3c09439c08b03d9503df0ca4c7619c5842892e in 5.19
for IORING_OP_SENDMSG.
However 257e84a5377fbbc336ff563833a8712619acce56 introduced the critical
code into io_setup_async_msg() in 5.11.
Fixes: 7ba89d2af17aa ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly")
Fixes: 257e84a5377fb ("io_uring: refactor sendmsg/recvmsg iov managing")
Cc: stable@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We're currently ignoring the dest address with non-zerocopy send because
even though we copy it from the userspace shortly after ->msg_name gets
zeroed. Move msghdr init earlier.
Fixes: 516e82f0e043a ("io_uring/net: support non-zerocopy sendto")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/176ced5e8568aa5d300ca899b7f05b303ebc49fd.1664409532.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As far as I can tell there is no need for the staged setup in
dasd, so allocate the tagset and the disk with the queue in
dasd_gendisk_alloc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220928143945.1687114-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We currently only add a notification CQE when the send succeded, i.e.
cqe.res >= 0. However, it'd be more robust to do buffer notifications
for failed requests as well in case drivers decide do something fanky.
Always return a buffer notification after initial prep, don't hide it.
This behaviour is better aligned with documentation and the patch also
helps the userspace to respect it.
Cc: stable@vger.kernel.org # 6.0
Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9c8bead87b2b980fcec441b8faef52188b4a6588.1664292100.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blkg_conf_prep just creates a new blkg structure, there is no real
need to update the lookup hint which should only be done on a
successful lookup in the I/O path.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220927065425.257876-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
nvmet is a consumer of the block layer and should not directly look at
the request_queue. Use the bdev_ helpers to retrieve the device limits
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
|
|
nvmet is a consumer of the block layer and should not directly look at
the request_queue. Just use the NUMA node ID from the gendisk instead of
the request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
|
|
The hctx's run_work may be racing with the elevator switch when
reinitializing hardware queues. The queue is merely frozen in this
context, but that only prevents requests from allocating and doesn't
stop the hctx work from running. The work may get an elevator pointer
that's being torn down, and can result in use-after-free errors and
kernel panics (example below). Use the quiesced elevator switch instead,
and make the previous one static since it is now only used locally.
nvme nvme0: resetting controller
nvme nvme0: 32/0/0 default/read/poll queues
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0
Oops: 0000 [#1] SMP PTI
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:kyber_has_work+0x29/0x70
...
Call Trace:
__blk_mq_do_dispatch_sched+0x83/0x2b0
__blk_mq_sched_dispatch_requests+0x12e/0x170
blk_mq_sched_dispatch_requests+0x30/0x60
__blk_mq_run_hw_queue+0x2b/0x50
process_one_work+0x1ef/0x380
worker_thread+0x2d/0x3e0
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Unused now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
Use the common helpers to allocate and free the tagsets. To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_loop_ctrl.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
Defer initializing the sqsize field from the options until it has been
capped by MAXCMD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
Use the common helpers to allocate and free the tagsets. To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_fc_ctrl.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
|
|
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code and use the chance the cleanup
the init_hctx methods a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
|
|
Also update the sqsize field when capping the queue size, and remove the
check a queue size that is larger than sqsize given that sqsize is only
initialized from opts->queue_size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
|
|
Use the common helpers to allocate and free the tagsets. To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_rdma_ctrl.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
Use the common helpers to allocate and free the tagsets. To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_tcp_ctrl.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
->nvme_tcp_queue is not used anywhere, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
Add common helpers to allocate and tear down the admin and I/O tag sets,
including the special queues allocated with them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
Add Hannes as the nvme-auth maintainer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
The code to set the result field for the admin and I/O connect commands
is not only verbose and duplicated, but also violates the aliasing
rules as it accesses both the u16 and u32 members in the union.
Add a little helper to sort all that out.
Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
|
|
Mark them as unsigned so that we don't need extra casts, and define
them relative to cdword0 instead of requiring extra shifts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
|
|
Currently blktests nvme/002 trips up debugobjects if CONFIG_NVME_AUTH is
enabled, but authentication is not on a queue. This is because
nvmet_auth_sq_free cancels sq->auth_expired_work unconditionaly, while
auth_expired_work is only ever initialized if authentication is enabled
for a given controller.
Fix this by calling most of what is nvmet_init_auth unconditionally
when initializing the SQ, and just do the setting of the result
field in the connect command handler.
Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
|
|
There is only a single call-site of nvmet_tcp_finish_cmd(), this
becomes redundant. Remove nvmet_tcp_finish_cmd() and use the original
function body instead.
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
ttag is used as an index to get cmd in nvmet_tcp_handle_h2c_data_pdu(),
add a bounds check to avoid out-of-bounds access.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
As per NVMe/TCP transport specification ICReq PDU is the first PDU received
by the controller and controller should receive only one ICReq PDU.
If controller receives more than one ICReq PDU then this can be considered
as fatal error.
nvmet-tcp driver does not check for ICReq PDU opcode if queue state is
NVMET_TCP_Q_LIVE. In LIVE state ICReq PDU is treated as CapsuleCmd PDU,
this can result in abnormal behavior.
Add a check for ICReq PDU in nvmet_tcp_done_recv_pdu() to fix this issue.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
nvmet-tcp frees CMD buffers in nvmet_tcp_uninit_data_in_cmds(),
and waits the inflight IO requests in nvmet_sq_destroy(). During wait
the inflight IO requests, the callback nvmet_tcp_queue_response()
is called from backend after IO complete, this leads a typical
Use-After-Free issue like this:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 107f80067 P4D 107f80067 PUD 10789e067 PMD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 123 Comm: kworker/1:1H Kdump: loaded Tainted: G E 6.0.0-rc2.bm.1-amd64 #15
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
RIP: 0010:shash_ahash_digest+0x2b/0x110
Code: 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 fd 53 48 89 f3 48 83 ec 08 44 8b 67 30 45 85 e4 74 1c 48 8b 57 38 b8 00 10 00 00 <44> 8b 7a 08 44 29 f8 39 42 0c 0f 46 42 0c 41 39 c4 76 43 48 8b 03
RSP: 0018:ffffc9000051bdd8 EFLAGS: 00010206
RAX: 0000000000001000 RBX: ffff888100ab5470 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff888100ab5470 RDI: ffff888100ab5420
RBP: ffff888100ab5420 R08: ffff8881024d08c8 R09: ffff888103e1b4b8
R10: 8080808080808080 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000000000 R14: ffff88813412bd4c R15: ffff8881024d0800
FS: 0000000000000000(0000) GS:ffff88883fa40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000104b48000 CR4: 0000000000350ee0
Call Trace:
<TASK>
nvmet_tcp_io_work+0xa52/0xb52 [nvmet_tcp]
? __switch_to+0x106/0x420
process_one_work+0x1ae/0x380
? process_one_work+0x380/0x380
worker_thread+0x30/0x360
? process_one_work+0x380/0x380
kthread+0xe6/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
Separate nvmet_tcp_uninit_data_in_cmds() into two steps:
uninit data in cmds <- new step 1
nvmet_sq_destroy();
cancel_work_sync(&queue->io_work);
free CMD buffers <- new step 2
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
We've been reporting 2 maps regardless of whether the module parameter
asked for anything beyond the default queues. A consequence of this
means that blk-mq will reinitialize the all the hardware contexts and io
schedulers on every controller reset when the mapping is exactly the
same as before. This unnecessary overhead is adding several milliseconds
on a reset for environments that don't need it. Report the actual number
of mappings in use.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
If swiotlb is force enabled dma_max_mapping_size ends up calling
swiotlb_max_mapping_size which takes into account the min align mask for
the device. Set the min align mask for nvme driver before calling
dma_max_mapping_size while calculating max hw sectors.
Signed-off-by: Rishabh Bhatnagar <risbhat@amazon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When a discovery controller is disconnected, no AENs will arrive to
notify the host about discovery log change events.
In order to solve this, send a uevent notification when a
persistent discovery controller reconnects. We add a new ctrl
flag NVME_CTRL_STARTED_ONCE that will be set on the first
start, and consecutive calls will find it set, and send the
event to userspace if the controller is a discovery controller.
Upon the event reception, userspace will re-read the discovery
log page and will act upon changes as it sees fit.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
We expect to grow a few of these flags for various purposes
so make them a proper enumeration.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The subsystem reset writes to a register, so we have to ensure the
device state is capable of handling that otherwise the driver may access
unmapped registers. Use the state machine to ensure the subsystem reset
doesn't try to write registers on a device already undergoing this type
of reset.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214771
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The passthrough commands already have this restriction, but the other
operations do not. Require the same capabilities for all users as all of
these operations, which include resets and rescans, can be disruptive.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The firmware revision can change on after a reset so copy the most
recent info each time instead of just the first time, otherwise the
sysfs firmware_rev entry may contain stale data.
Reported-by: Jeff Lien <jeff.lien@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
If a reset occurs after the scan work attempts to issue a command, the
reset may quisce the admin queue, which blocks the scan work's command
from dispatching. The scan work will not be able to complete while the
queue is quiesced.
Meanwhile, the reset work will cancel all outstanding admin tags and
wait until all requests have transitioned to idle, which includes the
passthrough request. But the passthrough request won't be set to idle
until after the scan_work flushes, so we're deadlocked.
Fix this by handling the end effects after the request has been freed.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216354
Reported-by: Jonathan Derrick <Jonathan.Derrick@solidigm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Prepare for storing the blkcg information in the gendisk instead of
the request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|