aboutsummaryrefslogtreecommitdiffstats
path: root/block (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-02-06Merge tag 'block-5.11-2021-02-05' of git://git.kernel.dk/linux-blockLinus Torvalds1-4/+4
Pull block fixes from Jens Axboe: "A few small regression fixes: - NVMe pull request from Christoph: - more quirks for buggy devices (Thorsten Leemhuis, Claus Stovgaard) - update the email address for Keith (Keith Busch) - fix an out of bounds access in nvmet-tcp (Sagi Grimberg) - Regression fix for BFQ shallow depth calculations introduced in this merge window (Lin)" * tag 'block-5.11-2021-02-05' of git://git.kernel.dk/linux-block: nvmet-tcp: fix out-of-bounds access when receiving multiple h2cdata PDUs bfq-iosched: Revert "bfq: Fix computation of shallow depth" update the email address for Keith Bush nvme-pci: ignore the subsysem NQN on Phison E16 nvme-pci: avoid the deepest sleep state on Kingston A2000 SSDs
2021-02-02bfq-iosched: Revert "bfq: Fix computation of shallow depth"Lin Feng1-4/+4
This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a. bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core sbitmap_get_shallow, which uses just the number to limit the scan depth of each bitmap word, formula: scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100% That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct. But after commit patch 'bfq: Fix computation of shallow depth', we use sbitmap.depth instead, as a example in following case: sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64. The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit nothing. Signed-off-by: Lin Feng <linf@wangsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-02block: fix memory leak of bvecMing Lei1-1/+1
bio_init() clears bio instance, so the bvec index has to be set after bio_init(), otherwise bio->bi_io_vec may be leaked. Fixes: 3175199ab0ac ("block: split bio_kmalloc from bio_alloc_bioset") Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01block/keyslot-manager: introduce devm_blk_ksm_init()Eric Biggers1-0/+29
Add a resource-managed variant of blk_ksm_init() so that drivers don't have to worry about calling blk_ksm_destroy(). Note that the implementation uses a custom devres action to call blk_ksm_destroy() rather than switching the two allocations to be directly devres-managed, e.g. with devm_kmalloc(). This is because we need to keep zeroing the memory containing the keyslots when it is freed, and also because we want to continue using kvmalloc() (and there is no devm_kvmalloc()). Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Satya Tangirala <satyat@google.com> Acked-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20210121082155.111333-2-ebiggers@kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2021-01-29Merge tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-blockLinus Torvalds4-11/+22
Pull block fixes from Jens Axboe: "All over the place fixes for this release: - blk-cgroup iteration teardown resched fix (Baolin) - NVMe pull request from Christoph: - add another Write Zeroes quirk (Chaitanya Kulkarni) - handle a no path available corner case (Daniel Wagner) - use the proper RCU aware list_add helper (Chao Leng) - bcache regression fix (Coly) - bdev->bd_size_lock IRQ fix. This will be fixed in drivers for 5.12, but for now, we'll make it IRQ safe (Damien) - null_blk zoned init fix (Damien) - add_partition() error handling fix (Dinghao) - s390 dasd kobject fix (Jan) - nbd fix for freezing queue while adding connections (Josef) - tag queueing regression fix (Ming) - revert of a patch that inadvertently meant that we regressed write performance on raid (Maxim)" * tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block: null_blk: cleanup zoned mode initialization nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head nvme-multipath: Early exit if no path is available nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES block: fix bd_size_lock use blk-cgroup: Use cond_resched() when destroy blkgs Revert "block: simplify set_init_blocksize" to regain lost performance nbd: freeze the queue while we're adding connections s390/dasd: Fix inconsistent kobject removal block: Fix an error handling in add_partition blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
2021-01-29block: drop removed argument from kernel-doc of blk_execute_rq()Lukas Bulwahn1-1/+0
Commit 684da7628d93 ("block: remove unnecessary argument from blk_execute_rq") changes the signature of blk_execute_rq(), but misses to adjust its kernel-doc. Hence, make htmldocs warns on ./block/blk-exec.c:78: warning: Excess function parameter 'q' description in 'blk_execute_rq' Drop removed argument from kernel-doc of blk_execute_rq() as well. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Acked-by: Guoqing Jiang <Guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-29block: remove typo in kernel-doc of set_disk_ro()Lukas Bulwahn1-1/+1
Commit 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk") provides some kernel-doc for set_disk_ro(), but introduces a small typo. Hence, make htmldocs warns on ./block/genhd.c:1441: warning: Function parameter or member 'read_only' not described in 'set_disk_ro' warning: Excess function parameter 'ready_only' description in 'set_disk_ro' Remove that typo in the kernel-doc for set_disk_ro(). Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-28blk-cgroup: Remove obsolete macroBaolin Wang1-2/+0
Remove the obsolete 'MAX_KEY_LEN' macro. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-28block: fix bd_size_lock useDamien Le Moal2-4/+7
Some block device drivers, e.g. the skd driver, call set_capacity() with IRQ disabled. This results in lockdep ito complain about inconsistent lock states ("inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage") because set_capacity takes a block device bd_size_lock using the functions spin_lock() and spin_unlock(). Ensure a consistent locking state by replacing these calls with spin_lock_irqsave() and spin_lock_irqrestore(). The same applies to bdev_set_nr_sectors(). With this fix, all lockdep complaints are resolved. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-28blk-cgroup: Use cond_resched() when destroy blkgsBaolin Wang1-5/+13
On !PREEMPT kernel, we can get below softlockup when doing stress testing with creating and destroying block cgroup repeatly. The reason is it may take a long time to acquire the queue's lock in the loop of blkcg_destroy_blkgs(), or the system can accumulate a huge number of blkgs in pathological cases. We can add a need_resched() check on each loop and release locks and do cond_resched() if true to avoid this issue, since the blkcg_destroy_blkgs() is not called from atomic contexts. [ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s! [ 4757.010698] Call trace: [ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150 [ 4757.010701]  cgwb_release_workfn+0x104/0x158 [ 4757.010702]  process_one_work+0x1bc/0x3f0 [ 4757.010704]  worker_thread+0x164/0x468 [ 4757.010705]  kthread+0x108/0x138 Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27block: use an on-stack bio in blkdev_issue_flushChristoph Hellwig1-11/+6
There is no point in allocating memory for a synchronous flush. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27block: split bio_kmalloc from bio_alloc_biosetChristoph Hellwig1-82/+85
bio_kmalloc shares almost no logic with the bio_set based fast path in bio_alloc_bioset. Split it into an entirely separate implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27blk-crypto: use bio_kmalloc in blk_crypto_clone_bioChristoph Hellwig1-1/+1
Use bio_kmalloc instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27bfq: Use only idle IO periods for think time calculationsJan Kara1-1/+9
Currently whenever bfq queue has a request queued we add now - last_completion_time to the think time statistics. This is however misleading in case the process is able to submit several requests in parallel because e.g. if the queue has request completed at time T0 and then queues new requests at times T1, T2, then we will add T1-T0 and T2-T0 to think time statistics which just doesn't make any sence (the queue's think time is penalized by the queue being able to submit more IO). So add to think time statistics only time intervals when the queue had no IO pending. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> [axboe: fix whitespace on empty line] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27bfq: Use 'ttime' local variableJan Kara1-1/+1
Use local variable 'ttime' instead of dereferencing bfqq. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27bfq: Avoid false bfq queue mergingJan Kara1-0/+1
bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it makes sense to merge current bfq queue with the in-service queue. However if the in-service queue is freshly scheduled and didn't dispatch any requests yet, bfqd->in_serv_last_pos is stale and contains value from the previously scheduled bfq queue which can thus result in a bogus decision that the two queues should be merged. This bug can be observed for example with the following fio jobfile: [global] direct=0 ioengine=sync invalidate=1 size=1g rw=read [reader] numjobs=4 directory=/mnt where the 4 processes will end up in the one shared bfq queue although they do IO to physically very distant files (for some reason I was able to observe this only with slice_idle=1ms setting). Fix the problem by invalidating bfqd->in_serv_last_pos when switching in-service queue. Fixes: 058fdecc6de7 ("block, bfq: fix in-service-queue check for queue merging") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-26blkcg: delete redundant get/put operations for queueChunguang Xu1-5/+8
When calling blkcg_schedule_throttle(), for the same queue, redundant get/put operations can be removed. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-26blk: wbt: remove unused parameter from wbt_should_throttleLei Chen1-2/+2
The first parameter rwb is not used for this function. So just remove it. Signed-off-by: Lei Chen <lennychen@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-26block: inherit BIO_REMAPPED when cloning biosChristoph Hellwig3-0/+6
Cloned bios are can be used to on the same device, in which case we need to inherit the BIO_REMAPPED flag to avoid a double partition remap. When the cloned bios are used on another device, bio_set_dev will clear the flag. Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25bfq: bfq_check_waker() should be staticJens Axboe1-1/+2
It's only used in the same file, mark is appropriately static. Fixes: 71217df39dc6 ("block, bfq: make waker-queue detection more robust") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: make waker-queue detection more robustPaolo Valente2-110/+108
In the presence of many parallel I/O flows, the detection of waker bfq_queues suffers from false positives. This commits addresses this issue by making the filtering of actual wakers more selective. In more detail, a candidate waker must be found to meet waker requirements three times before being promoted to actual waker. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: save also injection state on queue mergingPaolo Valente2-0/+13
To prevent injection information from being lost on bfq_queue merging, also the amount of service that a bfq_queue receives must be saved and restored when the bfq_queue is merged and split, respectively. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: save also weight-raised service on queue mergingPaolo Valente2-0/+3
To prevent weight-raising information from being lost on bfq_queue merging, also the amount of service that a bfq_queue receives must be saved and restored when the bfq_queue is merged and split, respectively. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: fix switch back from soft-rt weitgh-raisingPaolo Valente1-2/+20
A bfq_queue may happen to be deemed as soft real-time while it is still enjoying interactive weight-raising. If this happens because of a false positive, then the bfq_queue is likely to loose its soft real-time status soon. Upon losing such a status, the bfq_queue must get back its interactive weight-raising, if its interactive period is not over yet. But this case is not handled. This commit corrects this error. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: re-evaluate convenience of I/O plugging on rq arrivalsPaolo Valente1-5/+19
Upon an I/O-dispatch attempt, BFQ may detect that it was better to plug I/O dispatch, and to wait for a new request to arrive for the currently in-service queue. But the arrival of a new request for an empty bfq_queue, and thus the switch from idle to busy of the bfq_queue, may cause the scenario to change, and make plugging no longer needed for service guarantees, or more convenient for throughput. In this case, keeping I/O-dispatch plugged would certainly lower throughput. To address this issue, this commit makes such a check, and stops plugging I/O if it is better to stop plugging I/O. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: replace mechanism for evaluating I/O intensityPaolo Valente2-27/+52
Some BFQ mechanisms make their decisions on a bfq_queue basing also on whether the bfq_queue is I/O bound. In this respect, the current logic for evaluating whether a bfq_queue is I/O bound is rather rough. This commits replaces this logic with a more effective one. The new logic measures the percentage of time during which a bfq_queue is active, and marks the bfq_queue as I/O bound if the latter if this percentage is above a fixed threshold. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: skip bio_check_eod for partition-remapped biosChristoph Hellwig1-5/+6
When an already remapped bio is resubmitted (e.g. by blk_queue_split), bio_check_eod will compare the remapped bi_sector against the size of the partition, leading to spurious I/O failures. Skip the EOD check in this case. Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25bio: don't copy bvec for direct IOPavel Begunkov1-38/+29
The block layer spends quite a while in blkdev_direct_IO() to copy and initialise bio's bvec. However, if we've already got a bvec in the input iterator it might be reused in some cases, i.e. when new ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable performance boost, and it also reduces memory footprint. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block/psi: remove PSI annotations from direct IOPavel Begunkov1-0/+6
Direct IO does not operate on the current working set of pages managed by the kernel, so it should not be accounted as memory stall to PSI infrastructure. The block layer and iomap direct IO use bio_iov_iter_get_pages() to build bios, and they are the only users of it, so to avoid PSI tracking for them clear out BIO_WORKINGSET flag. Do same for dio_bio_submit() because fs/direct_io constructs bios by hand directly calling bio_add_page(). Reported-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: remove unnecessary argument from blk_execute_rqGuoqing Jiang3-6/+5
We can remove 'q' from blk_execute_rq as well after the previous change in blk_execute_rq_nowait. And more importantly it never really was needed to start with given that we can trivial derive it from struct request. Cc: linux-scsi@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ide@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvme@lists.infradead.org Cc: linux-nfs@vger.kernel.org Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: remove unnecessary argument from blk_execute_rq_nowaitGuoqing Jiang1-6/+4
The 'q' is not used since commit a1ce35fa4985 ("block: remove dead elevator code"), also update the comment of the function. And more importantly it never really was needed to start with given that we can trivial derive it from struct request. Cc: target-devel@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ide@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvme@lists.infradead.org Cc: linux-nfs@vger.kernel.org Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24bsg: free the request before return error codePan Bian1-1/+3
Free the request rq before returning error code. Fixes: 972248e9111e ("scsi: bsg-lib: handle bidi requests without block layer help") Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: Fix an error handling in add_partitionDinghao Liu1-1/+1
Once we have called device_initialize(), we should use put_device() to give up the reference on error, just like what we have done on failure of device_add(). Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queueMing Lei1-1/+1
In case of blk_mq_is_sbitmap_shared(), we should test QUEUE_FLAG_HCTX_ACTIVE against q->queue_flags instead of BLK_MQ_S_TAG_ACTIVE. So fix it. Cc: John Garry <john.garry@huawei.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Fixes: f1b49fdc1c64 ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: move three bvec helpers declaration into private helperMing Lei1-0/+4
bvec_alloc(), bvec_free() and bvec_nr_vecs() are only used inside block layer core functions, no need to declare them in public header. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: set .bi_max_vecs as actual allocated vector numberMing Lei1-1/+2
bvec_alloc() may allocate more bio vectors than requested, so set .bi_max_vecs as actual allocated vector number, instead of the requested number. This way can help fs build bigger bio because new bio often won't be allocated until the current one becomes full. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: don't allocate inline bvecs if this bioset needn't bvecsMing Lei1-2/+5
The inline bvecs won't be used if user needn't bvecs by not passing BIOSET_NEED_BVECS, so don't allocate bvecs in this situation. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: don't pass BIOSET_NEED_BVECS for q->bio_splitMing Lei1-1/+1
q->bio_split is only used by bio_split() for fast cloning bio, and no need to allocate bvecs, so remove this flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: manage bio slab cache by xarrayMing Lei1-67/+49
Managing bio slab cache via xarray by using slab cache size as xarray index, and storing 'struct bio_slab' instance into xarray. So code is simplified a lot, meantime it becomes more readable than before. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24bfq: don't duplicate code for different pathshuhai1-3/+0
As we can see, returns parent_sched_may_change whether sd->next_in_service changes or not, so remove this judgment. Signed-off-by: huhai <huhai@tj.kylinos.cn> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24blk-mq: Improve performance of non-mq IO schedulers with multiple HW queuesJan Kara2-6/+61
Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for a queue with multiple HW queues, the performance it rather bad. The problem is that these IO schedulers use queue-wide locking and their dispatch function does not respect the hctx it is passed in and returns any request it finds appropriate. Thus locality of request access is broken and dispatch from multiple CPUs just contends on IO scheduler locks. For these IO schedulers there's little point in dispatching from multiple CPUs. Instead dispatch always only from a single CPU to limit contention. Below is a comparison of dbench runs on XFS filesystem where the storage is a raid card with 64 HW queues and to it attached a single rotating disk. BFQ is used as IO scheduler: clients MQ SQ MQ-Patched Amean 1 39.12 (0.00%) 43.29 * -10.67%* 36.09 * 7.74%* Amean 2 128.58 (0.00%) 101.30 * 21.22%* 96.14 * 25.23%* Amean 4 577.42 (0.00%) 494.47 * 14.37%* 508.49 * 11.94%* Amean 8 610.95 (0.00%) 363.86 * 40.44%* 362.12 * 40.73%* Amean 16 391.78 (0.00%) 261.49 * 33.25%* 282.94 * 27.78%* Amean 32 324.64 (0.00%) 267.71 * 17.54%* 233.00 * 28.23%* Amean 64 295.04 (0.00%) 253.02 * 14.24%* 242.37 * 17.85%* Amean 512 10281.61 (0.00%) 10211.16 * 0.69%* 10447.53 * -1.61%* Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is the same kernel with megaraid_sas.host_tagset_enable=0 so that the card advertises just a single HW queue. MQ-Patched is a kernel with this patch applied. You can see multiple hardware queues heavily hurt performance in combination with BFQ. The patch restores the performance. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24Revert "blk-mq, elevator: Count requests per hctx to improve performance"Jan Kara3-12/+0
This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95. Since both mq-deadline and BFQ completely ignore hctx they are passed to their dispatch function and dispatch whatever request they deem fit checking whether any request for a particular hctx is queued is just pointless since we'll very likely get a request from a different hctx anyway. In the following commit we'll deal with lock contention in these IO schedulers in presence of multiple HW queues in a different way. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: do not expire a queue when it is the only busy onePaolo Valente1-2/+20
This commits preserves I/O-dispatch plugging for a special symmetric case that may suddenly turn into asymmetric: the case where only one bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not cause any harm to any other queues in terms of service guarantees. In contrast, it avoids the following unlucky sequence of events: (1) bfqq is expired, (2) a new queue with a lower weight than bfqq becomes busy (or more queues), (3) the new queue is served until a new request arrives for bfqq, (4) when bfqq is finally served, there are so many requests of the new queue in the drive that the pending requests for bfqq take a lot of time to be served. In particular, event (2) may case even already dispatched requests of bfqq to be delayed, inside the drive. So, to avoid this series of events, the scenario is preventively declared as asymmetric also if bfqq is the only busy queues. By doing so, I/O-dispatch plugging is performed for bfqq. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: avoid spurious switches to soft_rt of interactive queuesPaolo Valente1-20/+37
BFQ tags some bfq_queues as interactive or soft_rt if it deems that these bfq_queues contain the I/O of, respectively, interactive or soft real-time applications. BFQ privileges both these special types of bfq_queues over normal bfq_queues. To privilege a bfq_queue, BFQ mainly raises the weight of the bfq_queue. In particular, soft_rt bfq_queues get a higher weight than interactive bfq_queues. A bfq_queue may turn from interactive to soft_rt. And this leads to a tricky issue. Soft real-time applications usually start with an I/O-bound, interactive phase, in which they load themselves into main memory. BFQ correctly detects this phase, and keeps the bfq_queues associated with the application in interactive mode for a while. Problems arise when the I/O pattern of the application finally switches to soft real-time. One of the conditions for a bfq_queue to be deemed as soft_rt is that the bfq_queue does not consume too much bandwidth. But the bfq_queues associated with a soft real-time application consume as much bandwidth as they can in the loading phase of the application. So, after the application becomes truly soft real-time, a lot of time should pass before the average bandwidth consumed by its bfq_queues finally drops to a value acceptable for soft_rt bfq_queues. As a consequence, there might be a time gap during which the application is not privileged at all, because its bfq_queues are not interactive any longer, but cannot be deemed as soft_rt yet. To avoid this problem, BFQ pretends that an interactive bfq_queue consumes zero bandwidth, and allows an interactive bfq_queue to switch to soft_rt. Yet, this fake zero-bandwidth consumption easily causes the bfq_queue to often switch to soft_rt deceptively, during its loading phase. As in soft_rt mode, the bfq_queue gets its bandwidth correctly computed, and therefore soon switches back to interactive. Then it switches again to soft_rt, and so on. These spurious fluctuations usually cause losses of throughput, because they deceive BFQ's mechanisms for boosting throughput (injection, I/O-plugging avoidance, ...). This commit addresses this issue as follows: 1) It does compute actual bandwidth consumption also for interactive bfq_queues. This avoids the above false positives. 2) When a bfq_queue switches from interactive to normal mode, the consumed bandwidth is reset (forgotten). This allows the bfq_queue to enjoy soft_rt very quickly. In particular, two alternatives are possible in this switch: - the bfq_queue still has backlog, and therefore there is a budget already scheduled to serve the bfq_queue; in this case, the scheduling of the current budget of the bfq_queue is not hindered, because only the scheduling of the next budget will be affected by the weight drop. After that, if the bfq_queue is actually in a soft_rt phase, and becomes empty during the service of its current budget, which is the natural behavior of a soft_rt bfq_queue, then the bfq_queue will be considered as soft_rt when its next I/O arrives. If, in contrast, the bfq_queue remains constantly non-empty, then its next budget will be scheduled with a low weight, which is the natural treatment for an I/O-bound (non soft_rt) bfq_queue. - the bfq_queue is empty; in this case, the bfq_queue may be considered unjustly soft_rt when its new I/O arrives. Yet the problem is now much smaller than before, because it is unlikely that more than one spurious fluctuation occurs. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: do not raise non-default weightsPaolo Valente1-3/+7
BFQ heuristics try to detect interactive I/O, and raise the weight of the queues containing such an I/O. Yet, if also the user changes the weight of a queue (i.e., the user changes the ioprio of the process associated with that queue), then it is most likely better to prevent BFQ heuristics from silently changing the same weight. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: increase time window for waker detectionPaolo Valente1-1/+1
Tests on slower machines showed current window to be way too small. This commit increases it. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: set next_rq to waker_bfqq->next_rq in waker injectionJia Cheng Hu1-1/+1
Since commit c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O"), when the in-service bfq_queue, say Q, is temporarily empty, BFQ checks whether there are I/O requests to inject (also) from the waker bfq_queue for Q. To this goal, the value pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the current implementation mistakenly looks at bfqq->next_rq, which instead points to the next request of the currently served queue. This mistake evidently causes losses of throughput in scenarios with waker bfq_queues. This commit corrects this mistake. Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O") Signed-off-by: Jia Cheng Hu <jia.jiachenghu@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: use half slice_idle as a threshold to check short ttimePaolo Valente1-3/+4
The value of the I/O plugging (idling) timeout is used also as the think-time threshold to decide whether a process has a short think time. In this respect, a good value of this timeout for rotational drives is un the order of several ms. Yet, this is often too long a time interval to be effective as a think-time threshold. This commit mitigates this problem (by a lot, according to tests), by halving the threshold. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: use an xarray for disk->part_tblChristoph Hellwig4-178/+19
Now that no fast path lookups in the partition table are left, there is no point in micro-optimizing the data structure for it. Just use a bog standard xarray. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: remove DISK_PITER_REVERSEChristoph Hellwig1-30/+7
There is good reason to iterate backwards when deleting all partitions in del_gendisk, just like we don't in blk_drop_partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>