| Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull block updates from Jens Axboe:
- Improve the type checking of request flags (Bart)
- Ensure queue mapping for a single queues always picks the right queue
(Bart)
- Sanitize the io priority handling (Jan)
- rq-qos race fix (Jinke)
- Reserved tags handling improvements (John)
- Separate memory alignment from file/disk offset aligment for O_DIRECT
(Keith)
- Add new ublk driver, userspace block driver using io_uring for
communication with the userspace backend (Ming)
- Use try_cmpxchg() to cleanup the code in various spots (Uros)
- Finally remove bdevname() (Christoph)
- Clean up the zoned device handling (Christoph)
- Clean up independent access range support (Christoph)
- Clean up and improve block sysfs handling (Christoph)
- Clean up and improve teardown of block devices.
This turns the usual two step process into something that is simpler
to implement and handle in block drivers (Christoph)
- Clean up chunk size handling (Christoph)
- Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
Ming, Sebastian, Yang, Ying)
* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
ublk_drv: fix double shift bug
ublk_drv: make sure that correct flags(features) returned to userspace
ublk_drv: fix error handling of ublk_add_dev
ublk_drv: fix lockdep warning
block: remove __blk_get_queue
block: call blk_mq_exit_queue from disk_release for never added disks
blk-mq: fix error handling in __blk_mq_alloc_disk
ublk: defer disk allocation
ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
ublk: cleanup ublk_ctrl_uring_cmd
ublk: simplify ublk_ch_open and ublk_ch_release
ublk: remove the empty open and release block device operations
ublk: remove UBLK_IO_F_PREFLUSH
ublk: add a MAINTAINERS entry
block: don't allow the same type rq_qos add more than once
mmc: fix disk/queue leak in case of adding disk failure
ublk_drv: fix an IS_ERR() vs NULL check
ublk: remove UBLK_IO_F_INTEGRITY
ublk_drv: remove unneeded semicolon
...
|
|
zs_malloc returns 0 if it fails. zs_zpool_malloc will return -1 when
zs_malloc return 0. But -1 makes the return value unclear.
For example, when zswap_frontswap_store calls zs_malloc through
zs_zpool_malloc, it will return -1 to its caller. The other return value
is -EINVAL, -ENODEV or something else.
This commit changes zs_malloc to return ERR_PTR on failure. It didn't
just let zs_zpool_malloc return -ENOMEM becaue zs_malloc has two types of
failure:
- size is not OK return -EINVAL
- memory alloc fail return -ENOMEM.
Link: https://lkml.kernel.org/r/20220714080757.12161-1-teawater@gmail.com
Signed-off-by: Hui Zhu <teawater@antgroup.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The test/clear_bit() functions take a bit number, but this code is
passing as shifted value. It's the equivalent of saying BIT(BIT(0))
instead of just BIT(0).
This doesn't affect runtime because numbers are small and it's done
consistently.
Fixes: fa362045564e ("ublk: simplify ublk_ch_open and ublk_ch_release")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/Yt/2R/+MJf/MSoyl@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Userspace may support more features or new added flags, but the driver
side can be old, so make sure correct flags(features) returned to
userpsace, then userspace can work as expected.
Also mark the 2nd flags as reversed, just use the 1st one. When we run
out of flags, the reserved one can be handled at that time.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220722103817.631258-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__ublk_destroy_dev() is called for handling error in ublk_add_dev(),
but either tagset isn't allocated or mutex isn't initialized.
So fix the issue by letting replacing ublk_add_dev with a
ublk_add_tag_set function that is much more limited in scope and
instead unwind every single step directly in ublk_ctrl_add_dev.
To allow for this refactor the device freeing so that there is
a helper for freeing the device number instead of coupling that
with freeing the mutex and the memory.
Note that this now copies the dev_info to userspace before adding
the character device. This not only simplifies the erro handling
in ublk_ctrl_add_dev, but also means that the character device
can only be seen by userspace if the device addition succeeded.
Based on a patch from Ming Lei.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220722103817.631258-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ub->mutex is used to protecting reading and writing ub->mm, then the
following lockdep warning is triggered.
Fix it by using one dedicated spin lock for protecting ub->mm.
[1] lockdep warning
[ 25.046186] ======================================================
[ 25.048886] WARNING: possible circular locking dependency detected
[ 25.051610] 5.19.0-rc4_for-v5.20+ #149 Not tainted
[ 25.053665] ------------------------------------------------------
[ 25.056334] ublk/989 is trying to acquire lock:
[ 25.058296] ffff975d0329a918 (&disk->open_mutex){+.+.}-{3:3}, at: bd_register_pending_holders+0x2a/0x110
[ 25.063678]
[ 25.063678] but task is already holding lock:
[ 25.066246] ffff975d1df59708 (&ub->mutex){+.+.}-{3:3}, at: ublk_ctrl_uring_cmd+0x2df/0x730
[ 25.069423]
[ 25.069423] which lock already depends on the new lock.
[ 25.069423]
[ 25.072603]
[ 25.072603] the existing dependency chain (in reverse order) is:
[ 25.074908]
[ 25.074908] -> #3 (&ub->mutex){+.+.}-{3:3}:
[ 25.076386] __mutex_lock+0x93/0x870
[ 25.077470] ublk_ch_mmap+0x3a/0x140
[ 25.078494] mmap_region+0x375/0x5a0
[ 25.079386] do_mmap+0x33a/0x530
[ 25.080168] vm_mmap_pgoff+0xb9/0x150
[ 25.080979] ksys_mmap_pgoff+0x184/0x1f0
[ 25.081838] do_syscall_64+0x37/0x80
[ 25.082653] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 25.083730]
[ 25.083730] -> #2 (&mm->mmap_lock#2){++++}-{3:3}:
[ 25.084707] __might_fault+0x55/0x80
[ 25.085344] _copy_from_user+0x1e/0xa0
[ 25.086020] get_sg_io_hdr+0x26/0xb0
[ 25.086651] scsi_ioctl+0x42f/0x960
[ 25.087267] sr_block_ioctl+0xe8/0x100
[ 25.087734] blkdev_ioctl+0x134/0x2b0
[ 25.088196] __x64_sys_ioctl+0x8a/0xc0
[ 25.088677] do_syscall_64+0x37/0x80
[ 25.089044] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 25.089548]
[ 25.089548] -> #1 (&cd->lock){+.+.}-{3:3}:
[ 25.090072] __mutex_lock+0x93/0x870
[ 25.090452] sr_block_open+0x64/0xe0
[ 25.090837] blkdev_get_whole+0x26/0x90
[ 25.091445] blkdev_get_by_dev.part.0+0x1ce/0x2f0
[ 25.092203] blkdev_open+0x52/0x90
[ 25.092617] do_dentry_open+0x1ca/0x360
[ 25.093499] path_openat+0x78d/0xcb0
[ 25.094136] do_filp_open+0xa1/0x130
[ 25.094759] do_sys_openat2+0x76/0x130
[ 25.095454] __x64_sys_openat+0x5c/0x70
[ 25.096078] do_syscall_64+0x37/0x80
[ 25.096637] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 25.097304]
[ 25.097304] -> #0 (&disk->open_mutex){+.+.}-{3:3}:
[ 25.098229] __lock_acquire+0x12e2/0x1f90
[ 25.098789] lock_acquire+0xbf/0x2c0
[ 25.099256] __mutex_lock+0x93/0x870
[ 25.099706] bd_register_pending_holders+0x2a/0x110
[ 25.100246] device_add_disk+0x209/0x370
[ 25.100712] ublk_ctrl_uring_cmd+0x405/0x730
[ 25.101205] io_issue_sqe+0xfe/0x2ac0
[ 25.101665] io_submit_sqes+0x352/0x1820
[ 25.102131] __do_sys_io_uring_enter+0x848/0xdc0
[ 25.102646] do_syscall_64+0x37/0x80
[ 25.103087] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 25.103640]
[ 25.103640] other info that might help us debug this:
[ 25.103640]
[ 25.104549] Chain exists of:
[ 25.104549] &disk->open_mutex --> &mm->mmap_lock#2 --> &ub->mutex
[ 25.104549]
[ 25.105611] Possible unsafe locking scenario:
[ 25.105611]
[ 25.106258] CPU0 CPU1
[ 25.106677] ---- ----
[ 25.107100] lock(&ub->mutex);
[ 25.107446] lock(&mm->mmap_lock#2);
[ 25.108045] lock(&ub->mutex);
[ 25.108802] lock(&disk->open_mutex);
[ 25.109265]
[ 25.109265] *** DEADLOCK ***
[ 25.109265]
[ 25.110117] 2 locks held by ublk/989:
[ 25.110490] #0: ffff975d07bbf8a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x83e/0xdc0
[ 25.111249] #1: ffff975d1df59708 (&ub->mutex){+.+.}-{3:3}, at: ublk_ctrl_uring_cmd+0x2df/0x730
[ 25.111943]
[ 25.111943] stack backtrace:
[ 25.112557] CPU: 2 PID: 989 Comm: ublk Not tainted 5.19.0-rc4_for-v5.20+ #149
[ 25.113137] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014
[ 25.113792] Call Trace:
[ 25.114130] <TASK>
[ 25.114417] dump_stack_lvl+0x71/0xa0
[ 25.114771] check_noncircular+0xdf/0x100
[ 25.115137] ? register_lock_class+0x38/0x470
[ 25.115524] __lock_acquire+0x12e2/0x1f90
[ 25.115887] ? find_held_lock+0x2b/0x80
[ 25.116244] lock_acquire+0xbf/0x2c0
[ 25.116590] ? bd_register_pending_holders+0x2a/0x110
[ 25.117009] __mutex_lock+0x93/0x870
[ 25.117362] ? bd_register_pending_holders+0x2a/0x110
[ 25.117780] ? bd_register_pending_holders+0x2a/0x110
[ 25.118201] ? kobject_add+0x71/0x90
[ 25.118546] ? bd_register_pending_holders+0x2a/0x110
[ 25.118958] bd_register_pending_holders+0x2a/0x110
[ 25.119373] device_add_disk+0x209/0x370
[ 25.119732] ublk_ctrl_uring_cmd+0x405/0x730
[ 25.120109] ? rcu_read_lock_sched_held+0x3c/0x70
[ 25.120514] io_issue_sqe+0xfe/0x2ac0
[ 25.120863] io_submit_sqes+0x352/0x1820
[ 25.121228] ? rcu_read_lock_sched_held+0x3c/0x70
[ 25.121626] ? __do_sys_io_uring_enter+0x83e/0xdc0
[ 25.122028] ? find_held_lock+0x2b/0x80
[ 25.122390] ? __do_sys_io_uring_enter+0x848/0xdc0
[ 25.122791] __do_sys_io_uring_enter+0x848/0xdc0
[ 25.123190] ? syscall_enter_from_user_mode+0x20/0x70
[ 25.123606] ? syscall_enter_from_user_mode+0x20/0x70
[ 25.124024] do_syscall_64+0x37/0x80
[ 25.124383] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 25.124829] RIP: 0033:0x7f120a762af6
[ 25.125223] Code: 45 c1 41 89 c2 41 b9 08 00 00 00 41 83 ca 10 f6 87 d0 00 00 00 01 8b bf cc 00 00 00 44 0f 44 d0 45 31 c0c
[ 25.126576] RSP: 002b:00007ffdcb3c5518 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[ 25.127153] RAX: ffffffffffffffda RBX: 00000000013aef50 RCX: 00007f120a762af6
[ 25.127748] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
[ 25.128351] RBP: 000000000000000b R08: 0000000000000000 R09: 0000000000000008
[ 25.128956] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcb3c74a6
[ 25.129524] R13: 00000000013aef50 R14: 0000000000000000 R15: 00000000000003df
[ 25.130121] </TASK>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721153117.591394-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Defer allocating the gendisk and request_queue until UBLK_CMD_START_DEV
is called. This avoids funky life times where a disk is allocated
and then can be added and removed multiple times, which has never been
supported by the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Looking at the hctxs and cpumap is not safe without at very last a RCU
reference. It also requires the queue to be set up before starting the
device, which leads to rather awkward life time rules.
Instead rewrite ublk_ctrl_get_queue_affinity to just build the cpumask
directly from the mq_map in the tag set, similar to hctx->cpumask is
built.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fold __ublk_create_dev into its only caller to avoid the packing and
unpacking of the return value into an ERR_PTR.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move all per-command work into the per-command ublk_ctrl_* helpers
instead of being split over those, ublk_ctrl_cmd_validate, and the main
ublk_ctrl_uring_cmd handler. To facilitate that, the old
ublk_ctrl_stop_dev function that just contained two function calls is
folded into both callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
fops->open and fops->release are always paired. Use simple atomic bit
ops ot indicate if the device is opened instead of a count that can
only be 0 and 1 and a useless cmpxchg loop in ublk_ch_release.
Also don't bother clearing file->private_data is the file is about to
be freed anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No need to define empty versions, they can just be left out.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
REQ_PREFLUSH is turned into REQ_OP_FLUSH by the flush state machine
and thus never seen by a blk-mq based driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The blk_mq_alloc_disk_for_queue() doesn't return error pointers, it
returns NULL on error.
Fixes: cebbe577cb17 ("ublk_drv: fix request queue leak")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/YtVAgedTsQVK1oTM@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The ublk protocol has no mechanism to actually transfer the integrity
metadata, so don't define this flag, which requires that an integrity
payload is attached to a bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220718063013.335531-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Eliminate the following coccicheck warnings:
./drivers/block/ublk_drv.c:1467:2-3: Unneeded semicolon
./drivers/block/ublk_drv.c:1528:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220718015431.40185-1-yang.lee@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If blk_mq_init_queue() fails, it should return error code in ublk_add_dev()
Fixes: cebbe577cb17 ("ublk_drv: fix request queue leak")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220718042408.3132835-1-yangyingliang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
drivers/block/zram/zram_drv.c:55:45: warning: 'zram_wb_devops' defined but not used [-Wunused-const-variable=]
Fix the above warning if CONFIG_ZRAM_WRITEBACK not enabled.
Link: https://lkml.kernel.org/r/20220608072534.68850-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After applying -Wmaybe-uninitialized manually, two build warnings are
triggered:
drivers/block/ublk_drv.c:940:11: warning: ‘io’ may be used uninitialized [-Wmaybe-uninitialized]
940 | io->flags &= ~UBLK_IO_FLAG_ACTIVE;
drivers/block/ublk_drv.c: In function ‘ublk_ctrl_uring_cmd’:
drivers/block/ublk_drv.c:1531:9: warning: ‘ret’ may be used uninitialized [-Wmaybe-uninitialized]
Fix the 1st one by removing 'io->flags &= ~UBLK_IO_FLAG_ACTIVE;' which
isn't needed since the function always return successfully after setting
this flag.
Fix the 2nd one by always initializing 'ret'.
Also fix another sparse warning of 'sparse: sparse: incorrect type in return
expression' by changing return type of ublk_setup_iod().
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220716095344.222674-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Improve static type checking by using the enum req_op type where
appropriate.
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-19-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Improve static type checking by using the enum req_op type for request
operations and the new blk_opf_t type for request flags.
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-18-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Improve static type checking by using the new blk_opf_t type to represent
the combination of a request and request flags.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Cc: Md. Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-17-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since the type of request.cmd_flags has been changed from u32 into
blk_opf_t, use the __force keyword when casting to an integer type to
prevent that sparse warns about this cast.
Cc: Denis Efremov <efremov@linux.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-16-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Combine the drbd_submit_peer_request() 'op' and 'op_flags' arguments
into a single argument. This patch does not change any functionality.
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-15-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags.
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-14-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Improve static type checking by using the enum req_op type for a
function argument that represents a request operation.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-13-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Improve static type checking by changing the type of the value returned by
req_op() and bio_op() from unsigned int into enum req_op. Insert
'default: break;' in switch statements on the enum req_op type to prevent
that the compiler warns about these switch statements.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-5-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All .rw_page() callers pass an enum req_op value as last argument. Make
this explicit by changing the type of the last argument into enum req_op.
See also commit 3f289dcb4b26 ("block: make bdev_ops->rw_page() take a
REQ_OP instead of bool").
Cc: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The type name enum req_opf is misleading since it suggests that values of
this type include both an operation type and flags. Since values of this
type represent an operation only, change the type name into enum req_op.
Convert the enum req_op documentation into kernel-doc format. Move a few
definitions such that the enum req_op documentation occurs just above
the enum req_op definition.
The name "req_opf" was introduced by commit ef295ecf090d ("block: better op
and flags encoding").
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just print the block device name directly using the %pg format specifier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just use the %pg format specifier instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just use the %pg format specifier instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just use the %pg format specifier instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Call blk_cleanup_queue() in release code path for fixing request
queue leak.
Also for-5.20/block has cleaned up blk_cleanup_queue(), which is
basically merged to del_gendisk() if blk_mq_alloc_disk() is used
for allocating disk and queue.
However, ublk may not add disk in case of starting device failure, then
del_gendisk() won't be called when removing ublk device, so blk_mq_exit_queue
will not be callsed, and it can be bit hard to deal with this kind of
merge conflict.
Turns out ublk's queue/disk use model is very similar with scsi, so switch
to scsi's model by allocating disk and queue independently, then it can be
quite easy to handle v5.20 merge conflict by replacing blk_cleanup_queue
with blk_mq_destroy_queue.
Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220714103201.131648-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use task_work_add if it is available, since task_work_add can bring
up better performance, especially batching signaling ->ubq_daemon can
be done.
It is observed that task_work_add() can boost iops by +4% on random
4k io test. Also except for completing io command, all other code
paths are same with completing io command via
io_uring_cmd_complete_in_task.
Meantime add one flag of UBLK_F_URING_CMD_COMP_IN_TASK for comparing
the mode easily.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220713140711.97356-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is the driver part of userspace block driver(ublk driver), the other
part is userspace daemon part(ublksrv)[1].
The two parts communicate by io_uring's IORING_OP_URING_CMD with one
shared cmd buffer for storing io command, and the buffer is read only for
ublksrv, each io command is indexed by io request tag directly, and is
written by ublk driver.
For example, when one READ io request is submitted to ublk block driver,
ublk driver stores the io command into cmd buffer first, then completes
one IORING_OP_URING_CMD for notifying ublksrv, and the URING_CMD is issued
to ublk driver beforehand by ublksrv for getting notification of any new
io request, and each URING_CMD is associated with one io request by tag.
After ublksrv gets the io command, it translates and handles the ublk io
request, such as, for the ublk-loop target, ublksrv translates the request
into same request on another file or disk, like the kernel loop block
driver. In ublksrv's implementation, the io is still handled by io_uring,
and share same ring with IORING_OP_URING_CMD command. When the target io
request is done, the same IORING_OP_URING_CMD is issued to ublk driver for
both committing io request result and getting future notification of new
io request.
Another thing done by ublk driver is to copy data between kernel io
request and ublksrv's io buffer:
1) before ubsrv handles WRITE request, copy the request's data into
ublksrv's userspace io buffer, so that ublksrv can handle the write
request
2) after ubsrv handles READ request, copy ublksrv's userspace io buffer
into this READ request, then ublk driver can complete the READ request
Zero copy may be switched if mm is ready to support it.
ublk driver doesn't handle any logic of the specific user space driver,
so it is small/simple enough.
[1] ublksrv
https://github.com/ming1/ubdsrv
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220713140711.97356-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move the zone related fields that are currently stored in
struct request_queue to struct gendisk as these are part of the highlevel
block layer API and are only used for non-passthrough I/O that requires
the gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pass a block_device instead of a request_queue as that is what most
callers have at hand.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Switch to a gendisk based API in preparation for moving all zone related
fields from the request_queue to the gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Prepare for storing the zone related field in struct gendisk instead
of struct request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We no longer use the 'reserved' arg in busy_tag_iter_fn for any iter
function so it may be dropped.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> #nvme
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/1657109034-206040-6-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With new API blk_mq_is_reserved_rq() we can tell if a request is from
the reserved pool, so stop passing 'reserved' arg. There is actually
only a single user of that arg for all the callback implementations, which
can use blk_mq_is_reserved_rq() instead.
This will also allow us to stop passing the same 'reserved' around the
blk-mq iter functions next.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Always use crypto_has_comp() so that crypto can lookup module, call
usermodhelper to load the modules, wait for usermodhelper to finish and so
on. Otherwise crypto will do all of these steps under CPU hot-plug lock
and this looks like too much stuff to handle under the CPU hot-plug lock.
Besides this can end up in a deadlock when usermodhelper triggers a code
path that attempts to lock the CPU hot-plug lock, that zram already holds.
An example of such deadlock:
- path A. zram grabs CPU hot-plug lock, execs /sbin/modprobe from crypto
and waits for modprobe to finish
disksize_store
zcomp_create
__cpuhp_state_add_instance
__cpuhp_state_add_instance_cpuslocked
zcomp_cpu_up_prepare
crypto_alloc_base
crypto_alg_mod_lookup
call_usermodehelper_exec
wait_for_completion_killable
do_wait_for_common
schedule
- path B. async work kthread that brings in scsi device. It wants to
register CPUHP states at some point, and it needs the CPU hot-plug
lock for that, which is owned by zram.
async_run_entry_fn
scsi_probe_and_add_lun
scsi_mq_alloc_queue
blk_mq_init_queue
blk_mq_init_allocated_queue
blk_mq_realloc_hw_ctxs
__cpuhp_state_add_instance
__cpuhp_state_add_instance_cpuslocked
mutex_lock
schedule
- path C. modprobe sleeps, waiting for all aync works to finish.
load_module
do_init_module
async_synchronize_full
async_synchronize_cookie_domain
schedule
[senozhatsky@chromium.org: add comment]
Link: https://lkml.kernel.org/r/20220624060606.1014474-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20220622023501.517125-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Split the current bounce buffering logic used with persistent grants
into it's own option, and allow enabling it independently of
persistent grants. This allows to reuse the same code paths to
perform the bounce buffering required to avoid leaking contiguous data
in shared pages not part of the request fragments.
Reporting whether the backend is to be trusted can be done using a
module parameter, or from the xenstore frontend path as set by the
toolstack when adding the device.
This is CVE-2022-33742, part of XSA-403.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
When allocating pages to be used for shared communication with the
backend always zero them, this avoids leaking unintended data present
on the pages.
This is CVE-2022-26365, part of XSA-403.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
blk_cleanup_disk is nothing but a trivial wrapper for put_disk now,
so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for
all disks that do not have separately allocated queues, and thus remove
the need to call blk_cleanup_queue for them.
Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that
this function is intended only for separately allocated blk-mq queues.
This saves an extra queue freeze for devices without a separately
allocated queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the proper helper to mark a surpise removal, remove the gendisk as
soon as possible when removing the device and implement the ->free_disk
callback to ensure the private data is alive as long as the gendisk has
references.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This file is a huge mess that iterates over all devices and is in the
way of fixing the device removal in this driver, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a VBD is not fully created and then closed, the kernel can have a
NULL pointer dereference:
The reproducer is trivial:
[user@dom0 ~]$ sudo xl block-attach work backend=sys-usb vdev=xvdi target=/dev/sdz
[user@dom0 ~]$ xl block-list work
Vdev BE handle state evt-ch ring-ref BE-path
51712 0 241 4 -1 -1 /local/domain/0/backend/vbd/241/51712
51728 0 241 4 -1 -1 /local/domain/0/backend/vbd/241/51728
51744 0 241 4 -1 -1 /local/domain/0/backend/vbd/241/51744
51760 0 241 4 -1 -1 /local/domain/0/backend/vbd/241/51760
51840 3 241 3 -1 -1 /local/domain/3/backend/vbd/241/51840
^ note state, the /dev/sdz doesn't exist in the backend
[user@dom0 ~]$ sudo xl block-detach work xvdi
[user@dom0 ~]$ xl block-list work
Vdev BE handle state evt-ch ring-ref BE-path
work is an invalid domain identifier
And its console has:
BUG: kernel NULL pointer dereference, address: 0000000000000050
PGD 80000000edebb067 P4D 80000000edebb067 PUD edec2067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 52 Comm: xenwatch Not tainted 5.16.18-2.43.fc32.qubes.x86_64 #1
RIP: 0010:blk_mq_stop_hw_queues+0x5/0x40
Code: 00 48 83 e0 fd 83 c3 01 48 89 85 a8 00 00 00 41 39 5c 24 50 77 c0 5b 5d 41 5c 41 5d c3 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 <8b> 47 50 85 c0 74 32 41 54 49 89 fc 55 53 31 db 49 8b 44 24 48 48
RSP: 0018:ffffc90000bcfe98 EFLAGS: 00010293
RAX: ffffffffc0008370 RBX: 0000000000000005 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000000
RBP: ffff88800775f000 R08: 0000000000000001 R09: ffff888006e620b8
R10: ffff888006e620b0 R11: f000000000000000 R12: ffff8880bff39000
R13: ffff8880bff39000 R14: 0000000000000000 R15: ffff88800604be00
FS: 0000000000000000(0000) GS:ffff8880f3300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000050 CR3: 00000000e932e002 CR4: 00000000003706e0
Call Trace:
<TASK>
blkback_changed+0x95/0x137 [xen_blkfront]
? read_reply+0x160/0x160
xenwatch_thread+0xc0/0x1a0
? do_wait_intr_irq+0xa0/0xa0
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer snd soundcore ipt_REJECT nf_reject_ipv4 xt_state xt_conntrack nft_counter nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel xen_netfront pcspkr xen_scsiback target_core_mod xen_netback xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn ipmi_devintf ipmi_msghandler fuse bpf_preload ip_tables overlay xen_blkfront
CR2: 0000000000000050
---[ end trace 7bc9597fd06ae89d ]---
RIP: 0010:blk_mq_stop_hw_queues+0x5/0x40
Code: 00 48 83 e0 fd 83 c3 01 48 89 85 a8 00 00 00 41 39 5c 24 50 77 c0 5b 5d 41 5c 41 5d c3 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 <8b> 47 50 85 c0 74 32 41 54 49 89 fc 55 53 31 db 49 8b 44 24 48 48
RSP: 0018:ffffc90000bcfe98 EFLAGS: 00010293
RAX: ffffffffc0008370 RBX: 0000000000000005 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000000
RBP: ffff88800775f000 R08: 0000000000000001 R09: ffff888006e620b8
R10: ffff888006e620b0 R11: f000000000000000 R12: ffff8880bff39000
R13: ffff8880bff39000 R14: 0000000000000000 R15: ffff88800604be00
FS: 0000000000000000(0000) GS:ffff8880f3300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000050 CR3: 00000000e932e002 CR4: 00000000003706e0
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
info->rq and info->gd are only set in blkfront_connect(), which is
called for state 4 (XenbusStateConnected). Guard against using NULL
variables in blkfront_closing() to avoid the issue.
The rest of blkfront_closing looks okay. If info->nr_rings is 0, then
for_each_rinfo won't do anything.
blkfront_remove also needs to check for non-NULL pointers before
cleaning up the gendisk and request queue.
Fixes: 05d69d950d9d "xen-blkfront: sanitize the removal state machine"
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20220601195341.28581-1-jandryuk@gmail.com
Signed-off-by: Juergen Gross <jgross@suse.com>
|