aboutsummaryrefslogtreecommitdiffstats
path: root/block (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-10-23block: Remove elevator_lock usage from blkg_conf frozen operationsMing Lei1-9/+4
[ Upstream commit 08823e89e3e269bf4c4a20b4c24a8119920cc7a4 ] Remove the acquisition and release of q->elevator_lock in the blkg_conf_open_bdev_frozen() and blkg_conf_exit_frozen() functions. The elevator lock is no longer needed in these code paths since commit 78c271344b6f ("block: move wbt_enable_default() out of queue freezing from sched ->exit()") which introduces `disk->rqos_state_mutex` for protecting wbt state change, and not necessary to abuse elevator_lock for this purpose. This change helps to solve the lockdep warning reported from Yu Kuai[1]. Pass blktests/throtl with lockdep enabled. Links: https://lore.kernel.org/linux-block/e5e7ac3f-2063-473a-aafb-4d8d43e5576e@yukuai.org.cn/ [1] Fixes: commit 78c271344b6f ("block: move wbt_enable_default() out of queue freezing from sched ->exit()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23blk-mq: fix stale tag depth for shared sched tags in blk_mq_update_nr_requests()Yu Kuai4-5/+7
[ Upstream commit dc96cefef0d3032c69e46a21b345c60e56b18934 ] Commit 7f2799c546db ("blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()") moves blk_mq_tag_update_sched_shared_tags() before q->nr_requests is updated, however, it's still using the old q->nr_requests to resize tag depth. Fix this problem by passing in expected new tag depth. Fixes: 7f2799c546db ("blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reported-by: Chris Mason <clm@meta.com> Link: https://lore.kernel.org/linux-block/20251014130507.4187235-2-clm@meta.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-19blk-crypto: fix missing blktrace bio split eventsYu Kuai1-0/+3
commit 06d712d297649f48ebf1381d19bd24e942813b37 upstream. trace_block_split() is missing, resulting in blktrace inability to catch BIO split events and making it harder to analyze the BIO sequence. Cc: stable@vger.kernel.org Fixes: 488f6682c832 ("block: blk-crypto-fallback for Inline Encryption") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15blk-throttle: fix throtl_data leak during disk releaseYu Kuai1-2/+5
[ Upstream commit 336aec7b06be860477be80a4299263a2e9355789 ] Tightening the throttle activation check in blk_throtl_activated() to require both q->td presence and policy bit set introduced a memory leak during disk release: blkg_destroy_all() clears the policy bit first during queue deactivation, causing subsequent blk_throtl_exit() to skip throtl_data cleanup when blk_throtl_activated() fails policy check. Idealy we should avoid modifying blk_throtl_exit() activation check because it's intuitive that blk-throtl start from blk_throtl_init() and end in blk_throtl_exit(). However, call blk_throtl_exit() before blkg_destroy_all() will make a long term deadlock problem easier to trigger[1], hence fix this problem by checking if q->td is NULL from blk_throtl_exit(), and remove policy deactivation as well since it's useless. [1] https://lore.kernel.org/all/CAHj4cs9p9H5yx+ywsb3CMUdbqGPhM+8tuBvhW=9ADiCjAqza9w@mail.gmail.com/#t Fixes: bd9fd5be6bc0 ("blk-throttle: fix access race during throttle policy activation") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs-p-ZwBEKigBj7T6hQCOo-H68-kVwCrV6ZvRovrr9Z+HA@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15block: fix stacking of atomic writes when atomics are not supportedJohn Garry1-9/+10
[ Upstream commit f2d8c5a2f79c28569edf4948b611052253b5e99a ] Atomic writes support may not always be possible when stacking devices which support atomic writes. Such as case is a different atomic write boundary between stacked devices (which is not supported). In the case that atomic writes cannot supported, the top device queue HW limits are set to 0. However, in blk_stack_atomic_writes_limits(), we detect that we are stacking the first bottom device by checking the top device atomic_write_hw_max value == 0. This get confused with the case of atomic writes not supported, above. Make the distinction between stacking the first bottom device and no atomics supported by initializing stacked device atomic_write_hw_max = UINT_MAX and checking that for stacking the first bottom device. Fixes: d7f36dc446e8 ("block: Support atomic writes limits for stacked devices") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15block: update validation of atomic writes boundary for stacked devicesJohn Garry1-8/+14
[ Upstream commit bfd4037296bd7e1f95394a2e3daf8e3c1796c3b3 ] In commit 63d092d1c1b1 ("block: use chunk_sectors when evaluating stacked atomic write limits"), it was missed to use a chunk sectors limit check in blk_stack_atomic_writes_boundary_head(), so update that function to do the proper check. Fixes: 63d092d1c1b1 ("block: use chunk_sectors when evaluating stacked atomic write limits") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-mq: fix potential deadlock while nr_requests grownYu Kuai3-22/+34
[ Upstream commit b86433721f46d934940528f28d49c1dedb690df1 ] Allocate and free sched_tags while queue is freezed can deadlock[1], this is a long term problem, hence allocate memory before freezing queue and free memory after queue is unfreezed. [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()Yu Kuai4-11/+19
[ Upstream commit 6293e336f6d7d3f3415346ce34993b3398846166 ] This helper only support to allocate the default number of requests, add a new parameter to support specific number of requests. Prepare to fix potential deadlock in the case nr_requests grow. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: b86433721f46 ("blk-mq: fix potential deadlock while nr_requests grown") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()Yu Kuai1-12/+27
[ Upstream commit e63200404477456ec60c62dd8b3b1092aba2e211 ] No functional changes are intended, make code cleaner and prepare to fix the grow case in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: b86433721f46 ("blk-mq: fix potential deadlock while nr_requests grown") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()Yu Kuai2-28/+22
[ Upstream commit 7f2799c546dba9e12f9ff4d07936601e416c640d ] For shared tags case, all hctx->sched_tags/tags are the same, it doesn't make sense to call into blk_mq_tag_update_depth() multiple times for the same tags. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: b86433721f46 ("blk-mq: fix potential deadlock while nr_requests grown") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lockYu Kuai1-5/+20
[ Upstream commit 626ff4f8ebcb7207f01e7810acb85812ccf06bd8 ] request_queue->nr_requests can be changed by: a) switch elevator by updating nr_hw_queues b) switch elevator by elevator sysfs attribute c) configue queue sysfs attribute nr_requests Current lock order is: 1) update_nr_hwq_lock, case a,b 2) freeze_queue 3) elevator_lock, case a,b,c And update nr_requests is seriablized by elevator_lock() already, however, in the case c, we'll have to allocate new sched_tags if nr_requests grow, and do this with elevator_lock held and queue freezed has the risk of deadlock. Hence use update_nr_hwq_lock instead, make it possible to allocate memory if tags grow, meanwhile also prevent nr_requests to be changed concurrently. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: b86433721f46 ("blk-mq: fix potential deadlock while nr_requests grown") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-mq: check invalid nr_requests in queue_requests_store()Yu Kuai4-22/+17
[ Upstream commit b46d4c447db76e36906ed59ebb9b3ef8f3383322 ] queue_requests_store() is the only caller of blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the only caller of blk_mq_tag_update_depth(), however, they all have checkings for nr_requests input by user. Make code cleaner by moving all the checkings to the top function: 1) nr_requests > reserved tags; 2) if there is elevator, 4 <= nr_requests <= 2048; 3) if elevator is none, 4 <= nr_requests <= tag_set->queue_depth; Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be returned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: b86433721f46 ("blk-mq: fix potential deadlock while nr_requests grown") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-mq: remove useless checkings in blk_mq_update_nr_requests()Yu Kuai1-8/+1
[ Upstream commit 8bd7195fea6d9662aa3b32498a3828bfd9b63185 ] 1) queue_requests_store() is the only caller of blk_mq_update_nr_requests(), where queue is already freezed, no need to check mq_freeze_depth; 2) q->tag_set must be set for request based device, and queue_is_mq() is already checked in blk_mq_queue_attr_visible(), no need to check q->tag_set. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: b86433721f46 ("blk-mq: fix potential deadlock while nr_requests grown") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15block: fix ordering of recursive split IOYu Kuai4-9/+13
[ Upstream commit b2f5974079d82a4761f002e80601064d4e39a81f ] Currently, split bio will be chained to original bio, and original bio will be resubmitted to the tail of current->bio_list, waiting for split bio to be issued. However, if split bio get split again, the IO order will be messed up. This problem, on the one hand, will cause performance degradation, especially for mdraid with large IO size; on the other hand, will cause write errors for zoned block devices[1]. For example, in raid456 IO will first be split by max_sector from md_submit_bio(), and then later be split again by chunksize for internal handling: For example, assume max_sectors is 1M, and chunksize is 512k 1) issue a 2M IO: bio issuing: 0+2M current->bio_list: NULL 2) md_submit_bio() split by max_sector: bio issuing: 0+1M current->bio_list: 1M+1M 3) chunk_aligned_read() split by chunksize: bio issuing: 0+512k current->bio_list: 1M+1M -> 512k+512k 4) after first bio issued, __submit_bio_noacct() will contuine issuing next bio: bio issuing: 1M+1M current->bio_list: 512k+512k bio issued: 0+512k 5) chunk_aligned_read() split by chunksize: bio issuing: 1M+512k current->bio_list: 512k+512k -> 1536k+512k bio issued: 0+512k 6) no split afterwards, finally the issue order is: 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k This behaviour will cause large IO read on raid456 endup to be small discontinuous IO in underlying disks. Fix this problem by placing split bio to the head of current->bio_list. Test script: test on 8 disk raid5 with 64k chunksize dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct Test results: Before this patch 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10 sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20 2) blktrace G stage: 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd] 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd] 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd] 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd] 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd] 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd] 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd] 3) io seek for sdx: Noted io seek is the result from blktrace D stage, statistic of: ABS((offset of next IO) - (offset + len of previous IO)) Read|Write seek cnt 55175, zero cnt 25079 >=(KB) .. <(KB) : count ratio |distribution | 0 .. 1 : 25079 45.5% |########################################| 1 .. 2 : 0 0.0% | | 2 .. 4 : 0 0.0% | | 4 .. 8 : 0 0.0% | | 8 .. 16 : 0 0.0% | | 16 .. 32 : 0 0.0% | | 32 .. 64 : 12540 22.7% |##################### | 64 .. 128 : 2508 4.5% |##### | 128 .. 256 : 0 0.0% | | 256 .. 512 : 10032 18.2% |################# | 512 .. 1024 : 5016 9.1% |######### | After this patch: 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60 sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50 2) blktrace G stage: 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd] 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd] 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd] 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd] 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd] 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd] 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd] 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd] 3) io seek for sdx Read|Write seek cnt 28644, zero cnt 21483 >=(KB) .. <(KB) : count ratio |distribution | 0 .. 1 : 21483 75.0% |########################################| 1 .. 2 : 0 0.0% | | 2 .. 4 : 0 0.0% | | 4 .. 8 : 0 0.0% | | 8 .. 16 : 0 0.0% | | 16 .. 32 : 0 0.0% | | 32 .. 64 : 7161 25.0% |############## | BTW, this looks like a long term problem from day one, and large sequential IO read is pretty common case like video playing. And even with this patch, in this test case IO is merged to at most 128k is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such limit can get even better performance. However, we'll figure out how to do this properly later. [1] https://lore.kernel.org/all/e40b076d-583d-406b-b223-005910a9f46f@acm.org/ Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls") Reported-by: Tie Ren <tieren@fnnas.com> Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15block: skip unnecessary checks for split bioYu Kuai3-2/+7
[ Upstream commit 0b64682e78f7a53ea863e368b1aa66f05767858d ] Lots of checks are already done while submitting this bio the first time, and there is no need to check them again when this bio is resubmitted after split. Hence open code should_fail_bio() and blk_throtl_bio() that are still necessary from submit_bio_split_bioset(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: b2f5974079d8 ("block: fix ordering of recursive split IO") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15block: factor out a helper bio_submit_split_bioset()Yu Kuai1-19/+40
[ Upstream commit e37b5596a19be9a150cb194ec32e78f295a3574b ] No functional changes are intended, some drivers like mdraid will split bio by internal processing, prepare to unify bio split codes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: b2f5974079d8 ("block: fix ordering of recursive split IO") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15block: initialize bio issue time in blk_mq_submit_bio()Yu Kuai4-8/+8
[ Upstream commit 1f963bdd6420b6080bcfd0ee84a75c96f35545a6 ] bio->issue_time_ns is only used by blk-iolatency, which can only be enabled for rq-based disk, hence it's not necessary to initialize the time for bio-based disk. Meanwhile, if bio is split by blk_crypto_fallback_split_bio_if_needed(), the issue time is not initialized for new split bio, this can be fixed as well. Noted the next patch will optimize better that bio issue time will only be used when blk-iolatency is really enabled by the disk. Fixes: 488f6682c832 ("block: blk-crypto-fallback for Inline Encryption") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15block: cleanup bio_issueYu Kuai4-55/+5
[ Upstream commit 1733e88874838ddebf7774440c285700865e6b08 ] Now that bio->bi_issue is only used by blk-iolatency to get bio issue time, replace bio_issue with u64 time directly and remove bio_issue to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 1f963bdd6420 ("block: initialize bio issue time in blk_mq_submit_bio()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-throttle: fix access race during throttle policy activationHan Guangjiang4-18/+18
[ Upstream commit bd9fd5be6bc0836820500f68fff144609fbd85a9 ] On repeated cold boots we occasionally hit a NULL pointer crash in blk_should_throtl() when throttling is consulted before the throttle policy is fully enabled for the queue. Checking only q->td != NULL is insufficient during early initialization, so blkg_to_pd() for the throttle policy can still return NULL and blkg_to_tg() becomes NULL, which later gets dereferenced. Unable to handle kernel NULL pointer dereference at virtual address 0000000000000156 ... pc : submit_bio_noacct+0x14c/0x4c8 lr : submit_bio_noacct+0x48/0x4c8 sp : ffff800087f0b690 x29: ffff800087f0b690 x28: 0000000000005f90 x27: ffff00068af393c0 x26: 0000000000080000 x25: 000000000002fbc0 x24: ffff000684ddcc70 x23: 0000000000000000 x22: 0000000000000000 x21: 0000000000000000 x20: 0000000000080000 x19: ffff000684ddcd08 x18: ffffffffffffffff x17: 0000000000000000 x16: ffff80008132a550 x15: 0000ffff98020fff x14: 0000000000000000 x13: 1fffe000d11d7021 x12: ffff000688eb810c x11: ffff00077ec4bb80 x10: ffff000688dcb720 x9 : ffff80008068ef60 x8 : 00000a6fb8a86e85 x7 : 000000000000111e x6 : 0000000000000002 x5 : 0000000000000246 x4 : 0000000000015cff x3 : 0000000000394500 x2 : ffff000682e35e40 x1 : 0000000000364940 x0 : 000000000000001a Call trace: submit_bio_noacct+0x14c/0x4c8 verity_map+0x178/0x2c8 __map_bio+0x228/0x250 dm_submit_bio+0x1c4/0x678 __submit_bio+0x170/0x230 submit_bio_noacct_nocheck+0x16c/0x388 submit_bio_noacct+0x16c/0x4c8 submit_bio+0xb4/0x210 f2fs_submit_read_bio+0x4c/0xf0 f2fs_mpage_readpages+0x3b0/0x5f0 f2fs_readahead+0x90/0xe8 Tighten blk_throtl_activated() to also require that the throttle policy bit is set on the queue: return q->td != NULL && test_bit(blkcg_policy_throtl.plid, q->blkcg_pols); This prevents blk_should_throtl() from accessing throttle group state until policy data has been attached to blkgs. Fixes: a3166c51702b ("blk-throttle: delay initialization until configuration") Co-developed-by: Liang Jie <liangjie@lixiang.com> Signed-off-by: Liang Jie <liangjie@lixiang.com> Signed-off-by: Han Guangjiang <hanguangjiang@lixiang.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-mq: fix elevator depth_updated methodYu Kuai6-52/+41
[ Upstream commit 7d337eef4affc5e26e0570513168c69ddbc40f92 ] Current depth_updated has some problems: 1) depth_updated() will be called for each hctx, while all elevators will update async_depth for the disk level, this is not related to hctx; 2) In blk_mq_update_nr_requests(), if previous hctx update succeed and this hctx update failed, q->nr_requests will not be updated, while async_depth is already updated with new nr_reqeuests in previous depth_updated(); 3) All elevators are using q->nr_requests to calculate async_depth now, however, q->nr_requests is still the old value when depth_updated() is called from blk_mq_update_nr_requests(); Those problems are first from error path, then mq-deadline, and recently for bfq and kyber, fix those problems by: - pass in request_queue instead of hctx; - move depth_updated() after q->nr_requests is updated in blk_mq_update_nr_requests(); - add depth_updated() call inside init_sched() method to initialize async_depth; - remove init_hctx() method for mq-deadline and bfq that is useless now; Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes") Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code") Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250821060612.1729939-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15block: use int to store blk_stack_limits() return valueQianfeng Rong1-1/+2
[ Upstream commit b0b4518c992eb5f316c6e40ff186cbb7a5009518 ] Change the 'ret' variable in blk_stack_limits() from unsigned int to int, as it needs to store negative value -1. Storing the negative error codes in unsigned type, or performing equality comparisons (e.g., ret == -1), doesn't cause an issue at runtime [1] but can be confusing. Additionally, assigning negative error codes to unsigned type may trigger a GCC warning when the -Wsign-conversion flag is enabled. No effect on runtime. Link: https://lore.kernel.org/all/x3wogjf6vgpkisdhg3abzrx7v7zktmdnfmqeih5kosszmagqfs@oh3qxrgzkikf/ #1 Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Fixes: fe0b393f2c0a ("block: Correct handling of bottom device misaligment") Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20250902130930.68317-1-rongqianfeng@vivo.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctxLi Nan1-2/+4
[ Upstream commit 4c7ef92f6d4d08a27d676e4c348f4e2922cab3ed ] In __blk_mq_update_nr_hw_queues() the return value of blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx fails, later changing the number of hw_queues or removing disk will trigger the following warning: kernfs: can not remove 'nr_tags', no directory WARNING: CPU: 2 PID: 637 at fs/kernfs/dir.c:1707 kernfs_remove_by_name_ns+0x13f/0x160 Call Trace: remove_files.isra.1+0x38/0xb0 sysfs_remove_group+0x4d/0x100 sysfs_remove_groups+0x31/0x60 __kobject_del+0x23/0xf0 kobject_del+0x17/0x40 blk_mq_unregister_hctx+0x5d/0x80 blk_mq_sysfs_unregister_hctxs+0x94/0xd0 blk_mq_update_nr_hw_queues+0x124/0x760 nullb_update_nr_hw_queues+0x71/0xf0 [null_blk] nullb_device_submit_queues_store+0x92/0x120 [null_blk] kobjct_del() was called unconditionally even if sysfs creation failed. Fix it by checkig the kobject creation statusbefore deleting it. Fixes: 477e19dedc9d ("blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250826084854.1030545-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-06blk-mq: fix blk_mq_tags double free while nr_requests grownYu Kuai1-0/+1
commit ba28afbd9eff2a6370f23ef4e6a036ab0cfda409 upstream. In the case user trigger tags grow by queue sysfs attribute nr_requests, hctx->sched_tags will be freed directly and replaced with a new allocated tags, see blk_mq_tag_update_depth(). The problem is that hctx->sched_tags is from elevator->et->tags, while et->tags is still the freed tags, hence later elevator exit will try to free the tags again, causing kernel panic. Fix this problem by replacing et->tags with new allocated tags as well. Noted there are still some long term problems that will require some refactor to be fixed thoroughly[1]. [1] https://lore.kernel.org/all/20250815080216.410665-1-yukuai1@huaweicloud.com/ Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20250821060612.1729939-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-26Merge tag 'block-6.17-20250925' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linuxLinus Torvalds1-1/+3
Pull block fixes from Jens Axboe: "A regression fix for this series where an attempt to silence an EOD error got messed up a bit, and then a change of git trees for the block and io_uring trees. Switching the git trees to kernel.org now, as I've just about had it trying to battle AI bots that bring the box to its knees, continually. At least I don't have to maintain the kernel.org side" * tag 'block-6.17-20250925' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: MAINTAINERS: update io_uring and block tree git trees block: fix EOD return for device with nr_sectors == 0
2025-09-22block: fix EOD return for device with nr_sectors == 0Jens Axboe1-1/+3
A recent commit skipped dumping the usual "attempt to access beyond end of device" message if the device size is 0 sectors, as that's a common pattern for devices that have been hot removed. But while it stopped that message, it also prevented returning -EIO for that condition. Reinstate the -EIO return, while retaining the quiet operation for triggering EOD for a device with 0 sectors. Reported-by: syzbot+4b12286339fe4c2700c1@syzkaller.appspotmail.com Reported-by: Sahil Chandna <chandna.linuxkernel@gmail.com> Fixes: d0a2b527d8c3 ("block: tone down bio_check_eod") Tested-by: Sahil Chandna <chandna.linuxkernel@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08Merge tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-5/+8
Pull vfs fixes from Christian Brauner: "fuse: - Prevent opening of non-regular backing files. Fuse doesn't support non-regular files anyway. - Check whether copy_file_range() returns a larger size than requested. - Prevent overflow in copy_file_range() as fuse currently only supports 32-bit sized copies. - Cache the blocksize value if the server returned a new value as inode->i_blkbits isn't modified directly anymore. - Fix i_blkbits handling for iomap partial writes. By default i_blkbits is set to PAGE_SIZE which causes iomap to mark the whole folio as uptodate even on a partial write. But fuseblk filesystems support choosing a blocksize smaller than PAGE_SIZE risking data corruption. Simply enforce PAGE_SIZE as blocksize for fuseblk's internal inode for now. - Prevent out-of-bounds acces in fuse_dev_write() when the number of bytes to be retrieved is truncated to the fc->max_pages limit. virtiofs: - Fix page faults for DAX page addresses. Misc: - Tighten file handle decoding from userns. Check that the decoded dentry itself has a valid idmapping in the user namespace. - Fix mount-notify selftests. - Fix some indentation errors. - Add an FMODE_ flag to indicate IOCB_HAS_METADATA availability. This will be moved to an FOP_* flag with a bit more rework needed for that to happen not suitable for a fix. - Don't silently ignore metadata for sync read/write. - Don't pointlessly log warning when reading coredump sysctls" * tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fuse: virtio_fs: fix page fault for DAX page address selftests/fs/mount-notify: Fix compilation failure. fhandle: use more consistent rules for decoding file handle from userns fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file coredump: don't pointlessly check and spew warnings fs: fix indentation style block: don't silently ignore metadata for sync read/write fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability Please enter a commit message to explain why this merge is necessary, especially if it merges an updated upstream into a topic branch.
2025-09-01Merge tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into vfs.fixesChristian Brauner6-32/+52
fuse fixes for 6.17-rc5 * tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (6 commits) fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file Link: https://lore.kernel.org/CAJfpeguEVMMyw_zCb+hbOuSxdE2Z3Raw=SJsq=Y56Ae6dn2W3g@mail.gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-26block: validate QoS before calling __rq_qos_done_bio()Nilay Shroff1-5/+8
If a bio has BIO_QOS_xxx set, it doesn't guarantee that q->rq_qos is also present at-least for stacked block devices. For instance, in case of NVMe when multipath is enabled, the bottom device may have QoS enabled but top device doesn't. So always validate QoS is enabled and q->rq_qos is present before calling __rq_qos_done_bio(). Fixes: 370ac285f23a ("block: avoid cpu_hotplug_lock depedency on freeze_lock") Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/all/3a07b752-06a4-4eee-b302-f4669feb859d@linux.ibm.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250826163128.1952394-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-26blk-zoned: Fix a lockdep complaint about recursive lockingBart Van Assche1-5/+6
If preparing a write bio fails then blk_zone_wplug_bio_work() calls bio_endio() with zwplug->lock held. If a device mapper driver is stacked on top of the zoned block device then this results in nested locking of zwplug->lock. The resulting lockdep complaint is a false positive because this is nested locking and not recursive locking. Suppress this false positive by calling blk_zone_wplug_bio_io_error() without holding zwplug->lock. This is safe because no code in blk_zone_wplug_bio_io_error() depends on zwplug->lock being held. This patch suppresses the following lockdep complaint: WARNING: possible recursive locking detected -------------------------------------------- kworker/3:0H/46 is trying to acquire lock: ffffff882968b830 (&zwplug->lock){-...}-{2:2}, at: blk_zone_write_plug_bio_endio+0x64/0x1f0 but task is already holding lock: ffffff88315bc230 (&zwplug->lock){-...}-{2:2}, at: blk_zone_wplug_bio_work+0x8c/0x48c other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&zwplug->lock); lock(&zwplug->lock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/3:0H/46: #0: ffffff8809486758 ((wq_completion)sdd_zwplugs){+.+.}-{0:0}, at: process_one_work+0x1bc/0x65c #1: ffffffc085de3d70 ((work_completion)(&zwplug->bio_work)){+.+.}-{0:0}, at: process_one_work+0x1e4/0x65c #2: ffffff88315bc230 (&zwplug->lock){-...}-{2:2}, at: blk_zone_wplug_bio_work+0x8c/0x48c stack backtrace: CPU: 3 UID: 0 PID: 46 Comm: kworker/3:0H Tainted: G W OE 6.12.38-android16-5-maybe-dirty-4k #1 8b362b6f76e3645a58cd27d86982bce10d150025 Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: Spacecraft board based on MALIBU (DT) Workqueue: sdd_zwplugs blk_zone_wplug_bio_work Call trace: dump_backtrace+0xfc/0x17c show_stack+0x18/0x28 dump_stack_lvl+0x40/0xa0 dump_stack+0x18/0x24 print_deadlock_bug+0x38c/0x398 __lock_acquire+0x13e8/0x2e1c lock_acquire+0x134/0x2b4 _raw_spin_lock_irqsave+0x5c/0x80 blk_zone_write_plug_bio_endio+0x64/0x1f0 bio_endio+0x9c/0x240 __dm_io_complete+0x214/0x260 clone_endio+0xe8/0x214 bio_endio+0x218/0x240 blk_zone_wplug_bio_work+0x204/0x48c process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20 Cc: stable@vger.kernel.org Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250825182720.1697203-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21block: avoid cpu_hotplug_lock depedency on freeze_lockNilay Shroff3-28/+36
A recent lockdep[1] splat observed while running blktest block/005 reveals a potential deadlock caused by the cpu_hotplug_lock dependency on ->freeze_lock. This dependency was introduced by commit 033b667a823e ("block: blk-rq-qos: guard rq-qos helpers by static key"). That change added a static key to avoid fetching q->rq_qos when neither blk-wbt nor blk-iolatency is configured. The static key dynamically patches kernel text to a NOP when disabled, eliminating overhead of fetching q->rq_qos in the I/O hot path. However, enabling a static key at runtime requires acquiring both cpu_hotplug_lock and jump_label_mutex. When this happens after the queue has already been frozen (i.e., while holding ->freeze_lock), it creates a locking dependency from cpu_hotplug_lock to ->freeze_lock, which leads to a potential deadlock reported by lockdep [1]. To resolve this, replace the static key mechanism with q->queue_flags: QUEUE_FLAG_QOS_ENABLED. This flag is evaluated in the fast path before accessing q->rq_qos. If the flag is set, we proceed to fetch q->rq_qos; otherwise, the access is skipped. Since q->queue_flags is commonly accessed in IO hotpath and resides in the first cacheline of struct request_queue, checking it imposes minimal overhead while eliminating the deadlock risk. This change avoids the lockdep splat without introducing performance regressions. [1] https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/ Fixes: 033b667a823e ("block: blk-rq-qos: guard rq-qos helpers by static key") Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250814082612.500845-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21block: decrement block_rq_qos static key in rq_qos_del()Nilay Shroff1-0/+1
rq_qos_add() increments the block_rq_qos static key when a QoS policy is attached. When a QoS policy is removed via rq_qos_del(), we must symmetrically decrement the static key. If this removal drops the last QoS policy from the queue (q->rq_qos becomes NULL), the static branch can be disabled and the jump label patched to a NOP, avoiding overhead on the hot path. This change ensures rq_qos_add()/rq_qos_del() keep the block_rq_qos static key balanced and prevents leaving the branch permanently enabled after the last policy is removed. Fixes: 033b667a823e ("block: blk-rq-qos: guard rq-qos helpers by static key") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250814082612.500845-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21block: skip q->rq_qos check in rq_qos_done_bio()Nilay Shroff1-2/+8
If a bio has BIO_QOS_THROTTLED or BIO_QOS_MERGED set, it implicitly guarantees that q->rq_qos is present. Avoid re-checking q->rq_qos in this case and call __rq_qos_done_bio() directly as a minor optimization. Suggested-by : Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250814082612.500845-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21blk-mq: fix lockdep warning in __blk_mq_update_nr_hw_queuesMing Lei1-4/+9
Commit 5989bfe6ac6b ("block: restore two stage elevator switch while running nr_hw_queue update") reintroduced a lockdep warning by calling blk_mq_freeze_queue_nomemsave() before switching the I/O scheduler. The function blk_mq_elv_switch_none() calls elevator_change_done(). Running this while the queue is frozen causes a lockdep warning. Fix this by reordering the operations: first, switch the I/O scheduler to 'none', and then freeze the queue. This ensures that elevator_change_done() is not called on an already frozen queue. And this way is safe because elevator_set_none() does freeze queue before switching to none. Also we still have to rely on blk_mq_elv_switch_back() for switching back, and it has to cover unfrozen queue case. Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Yu Kuai <yukuai3@huawei.com> Fixes: 5989bfe6ac6b ("block: restore two stage elevator switch while running nr_hw_queue update") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250815131737.331692-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-20block: don't silently ignore metadata for sync read/writeChristoph Hellwig1-5/+5
The block fops don't try to handle metadata for synchronous requests, probably because the completion handler looks at dio->iocb which is not valid for synchronous requests. But silently ignoring metadata (or warning in case of __blkdev_direct_IO_simple) is a really bad idea as that can cause silent data corruption if a user ever shows up. Instead simply handle metadata for synchronous requests as the completion handler can simply check for bio_integrity() as the block layer default integrity will already be freed at this point, and thus bio_integrity() will only return true for user mapped integrity. Fixes: 3d8b5a22d404 ("block: add support to pass user meta buffer") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250819082517.2038819-3-hch@lst.de Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-20fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availabilityChristoph Hellwig1-0/+3
Currently the kernel will happily route io_uring requests with metadata to file operations that don't support it. Add a FMODE_ flag to guard that. Fixes: 4de2ce04c862 ("fs: introduce IOCB_HAS_METADATA for metadata") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250819082517.2038819-2-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-18block: tone down bio_check_eodChristoph Hellwig1-1/+1
bdev_nr_sectors() == 0 is a pattern used for block devices that have been hot removed, don't spam the log about them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250818101102.1604551-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-18block: remove newlines from the warnings in blk_validate_integrity_limitsChristoph Hellwig1-6/+3
Otherwise they are very hard to read in the kernel log. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250818045456.1482889-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-18block: handle pi_tuple_size in queue_limits_stack_integrityChristoph Hellwig1-0/+3
queue_limits_stack_integrity needs to handle the new pi_tuple_size field, otherwise stacking PI-capable devices will always fail. Fixes: 76e45252a4ce ("block: introduce pi_tuple_size field in blk_integrity") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250818045456.1482889-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-13block: restore default wbt enablementJulian Sun1-1/+1
The commit 245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock") protected wbt_enable_default() with q->elevator_lock; however, it also placed wbt_enable_default() before blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);, resulting in wbt failing to be enabled. Moreover, the protection of wbt_enable_default() by q->elevator_lock was removed in commit 78c271344b6f ("block: move wbt_enable_default() out of queue freezing from sched ->exit()"), so we can directly fix this issue by placing wbt_enable_default() after blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);. Additionally, this issue also causes the inability to read the wbt_lat_usec file, and the scenario is as follows: root@q:/sys/block/sda/queue# cat wbt_lat_usec cat: wbt_lat_usec: Invalid argument root@q:/data00/sjc/linux# ls /sys/kernel/debug/block/sda/rqos cannot access '/sys/kernel/debug/block/sda/rqos': No such file or directory root@q:/data00/sjc/linux# find /sys -name wbt /sys/kernel/debug/tracing/events/wbt After testing with this patch, wbt can be enabled normally. Signed-off-by: Julian Sun <sunjunchao@bytedance.com> Cc: stable@vger.kernel.org Fixes: 245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock") Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250812154257.57540-1-sunjunchao@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11blk-wbt: Eliminate ambiguity in the comments of struct rq_wbTang Yizhou1-2/+2
In the current implementation, the last_issue and last_comp members of struct rq_wb are used only by read requests and not by non-throttled write requests. Therefore, eliminate the ambiguity here. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250727173959.160835-3-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11blk-wbt: Optimize wbt_done() for non-throttled writesTang Yizhou1-5/+6
In the current implementation, the sync_cookie and last_cookie members of struct rq_wb are used only by read requests and not by non-throttled write requests. Based on this, we can optimize wbt_done() by removing one if condition check for non-throttled write requests. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250727173959.160835-2-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11block: fix kobject double initialization in add_diskZheng Qixing3-7/+8
Device-mapper can call add_disk() multiple times for the same gendisk due to its two-phase creation process (dm create + dm load). This leads to kobject double initialization errors when the underlying iSCSI devices become temporarily unavailable and then reappear. However, if the first add_disk() call fails and is retried, the queue_kobj gets initialized twice, causing: kobject: kobject (ffff88810c27bb90): tried to init an initialized object, something is seriously wrong. Call Trace: <TASK> dump_stack_lvl+0x5b/0x80 kobject_init.cold+0x43/0x51 blk_register_queue+0x46/0x280 add_disk_fwnode+0xb5/0x280 dm_setup_md_queue+0x194/0x1c0 table_load+0x297/0x2d0 ctl_ioctl+0x2a2/0x480 dm_ctl_ioctl+0xe/0x20 __x64_sys_ioctl+0xc7/0x110 do_syscall_64+0x72/0x390 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix this by separating kobject initialization from sysfs registration: - Initialize queue_kobj early during gendisk allocation - add_disk() only adds the already-initialized kobject to sysfs - del_gendisk() removes from sysfs but doesn't destroy the kobject - Final cleanup happens when the disk is released Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs") Reported-by: Li Lingfeng <lilingfeng3@huawei.com> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/ Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250808053609.3237836-1-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11blk-cgroup: remove redundant __GFP_NOWARNQianfeng Rong1-3/+3
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250809141358.168781-1-rongqianfeng@vivo.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11block, bfq: remove redundant __GFP_NOWARNQianfeng Rong1-2/+1
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Link: https://lore.kernel.org/r/20250811081135.374315-1-rongqianfeng@vivo.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-09Merge tag 'block-6.17-20250808' of git://git.kernel.dk/linuxLinus Torvalds12-200/+287
Pull more block updates from Jens Axboe: - MD pull request via Yu: - mddev null-ptr-dereference fix, by Erkun - md-cluster fail to remove the faulty disk regression fix, by Heming - minor cleanup, by Li Nan and Jinchao - mdadm lifetime regression fix reported by syzkaller, by Yu Kuai - MD pull request via Christoph - add support for getting the FDP featuee in fabrics passthru path (Nitesh Shetty) - add capability to connect to an administrative controller (Kamaljit Singh) - fix a leak on sgl setup error (Keith Busch) - initialize discovery subsys after debugfs is initialized (Mohamed Khalfella) - fix various comment typos (Bjorn Helgaas) - remove unneeded semicolons (Jiapeng Chong) - nvmet debugfs ordering issue fix - Fix UAF in the tag_set in zloop - Ensure sbitmap shallow depth covers entire set - Reduce lock roundtrips in io context lookup - Move scheduler tags alloc/free out of elevator and freeze lock, to fix some lockdep found issues - Improve robustness of queue limits checking - Fix a regression with IO priorities, if no io context exists * tag 'block-6.17-20250808' of git://git.kernel.dk/linux: (26 commits) lib/sbitmap: make sbitmap_get_shallow() internal lib/sbitmap: convert shallow_depth from one word to the whole sbitmap nvmet: exit debugfs after discovery subsystem exits block, bfq: Reorder struct bfq_iocq_bfqq_data md: make rdev_addable usable for rcu mode md/raid1: remove struct pool_info and related code md/raid1: change r1conf->r1bio_pool to a pointer type block: ensure discard_granularity is zero when discard is not supported zloop: fix KASAN use-after-free of tag set block: Fix default IO priority if there is no IO context nvme: fix various comment typos nvme-auth: remove unneeded semicolon nvme-pci: fix leak on sgl setup error nvmet: initialize discovery subsys after debugfs is initialized nvme: add capability to connect to an administrative controller nvmet: add support for FDP in fabrics passthru path md: rename recovery_cp to resync_offset md/md-cluster: handle REMOVE message earlier md: fix create on open mddev lifetime regression block: fix potential deadlock while running nr_hw_queue update ...
2025-08-07lib/sbitmap: convert shallow_depth from one word to the whole sbitmapYu Kuai4-43/+20
Currently elevators will record internal 'async_depth' to throttle asynchronous requests, and they both calculate shallow_dpeth based on sb->shift, with the respect that sb->shift is the available tags in one word. However, sb->shift is not the availbale tags in the last word, see __map_depth: if (index == sb->map_nr - 1) return sb->depth - (index << sb->shift); For consequence, if the last word is used, more tags can be get than expected, for example, assume nr_requests=256 and there are four words, in the worst case if user set nr_requests=32, then the first word is the last word, and still use bits per word, which is 64, to calculate async_depth is wrong. One the ohter hand, due to cgroup qos, bfq can allow only one request to be allocated, and set shallow_dpeth=1 will still allow the number of words request to be allocated. Fix this problems by using shallow_depth to the whole sbitmap instead of per word, also change kyber, mq-deadline and bfq to follow this, a new helper __map_depth_with_shallow() is introduced to calculate available bits in each word. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-04block, bfq: Reorder struct bfq_iocq_bfqq_dataChristophe JAILLET1-5/+5
The size of struct bfq_iocq_bfqq_data can be reduced by moving a few fields around. On a x86_64, with allmodconfig, this shrinks the size from 144 to 128 bytes. The main benefit is to reduce the size of struct bfq_io_cq from 1360 to 1232. This structure is stored in a dedicated slab cache. So reducing its size improves cache usage. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/79394db1befaa658e8066b8e3348073ce27d9d26.1754119538.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-31Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds1-2/+2
Pull MM updates from Andrew Morton: "As usual, many cleanups. The below blurbiage describes 42 patchsets. 21 of those are partially or fully cleanup work. "cleans up", "cleanup", "maintainability", "rationalizes", etc. I never knew the MM code was so dirty. "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes) addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park) adds a new kernel module which simplifies the setup and usage of DAMON in production environments. "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig) is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. "drivers/base/node.c: optimization and cleanups" (Donet Tom) contains largely uncorrelated cleanups to the NUMA node setup and management code. "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman) does some maintenance work on the userfaultfd code. "Readahead tweaks for larger folios" (Ryan Roberts) implements some tuneups for pagecache readahead when it is reading into order>0 folios. "selftests/mm: Tweaks to the cow test" (Mark Brown) provides some cleanups and consistency improvements to the selftests code. "Optimize mremap() for large folios" (Dev Jain) does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. "Remove zero_user()" (Matthew Wilcox) expunges zero_user() in favor of the more modern memzero_page(). "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand) addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park) provides some cleanup and consolidation work in DAMON. "use vm_flags_t consistently" (Lorenzo Stoakes) uses vm_flags_t in places where we were inappropriately using other types. "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy) increases the reliability of large page allocation in the memfd code. "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple) removes several now-unneeded PFN_* flags. "mm/damon: decouple sysfs from core" (SeongJae Park) implememnts some cleanup and maintainability work in the DAMON sysfs layer. "madvise cleanup" (Lorenzo Stoakes) does quite a lot of cleanup/maintenance work in the madvise() code. "madvise anon_name cleanups" (Vlastimil Babka) provides additional cleanups on top or Lorenzo's effort. "Implement numa node notifier" (Oscar Salvador) creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan) cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park) adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. "Misc rework on hugetlb faulting path" (Oscar Salvador) fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport) rationalizes and cleans up the highmem-specific code in the CMA allocator. "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand) provides cleanups and future-preparedness to the migration code. "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park) adds some tracepoints to some DAMON auto-tuning code. "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park) does that. "mm/damon: misc cleanups" (SeongJae Park) also does what it claims. "mm: folio_pte_batch() improvements" (David Hildenbrand) cleans up the large folio PTE batching code. "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park) facilitates dynamic alteration of DAMON's inter-node allocation policy. "Remove unmap_and_put_page()" (Vishal Moola) provides a couple of page->folio conversions. "mm: per-node proactive reclaim" (Davidlohr Bueso) implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. "mm/damon: remove damon_callback" (SeongJae Park) replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes) implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. "drop hugetlb_free_pgd_range()" (Anthony Yznaga) switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park) augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. "Some randome fixes and cleanups to swapfile" (Kemeng Shi) does what is claims. "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand) provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan) addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. "__folio_split() clean up" (Zi Yan) cleans up __folio_split()! "Optimize mprotect() for large folios" (Dev Jain) provides some quite large (>3x) speedups to mprotect() when dealing with large folios. "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian) does some cleanup work in the selftests code. "tools/testing: expand mremap testing" (Lorenzo Stoakes) extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. "selftests/damon/sysfs.py: test all parameters" (SeongJae Park) extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset" * tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits) MAINTAINERS: add missing headers to mempory policy & migration section MAINTAINERS: add missing file to cgroup section MAINTAINERS: add MM MISC section, add missing files to MISC and CORE MAINTAINERS: add missing zsmalloc file MAINTAINERS: add missing files to page alloc section MAINTAINERS: add missing shrinker files MAINTAINERS: move memremap.[ch] to hotplug section MAINTAINERS: add missing mm_slot.h file THP section MAINTAINERS: add missing interval_tree.c to memory mapping section MAINTAINERS: add missing percpu-internal.h file to per-cpu section mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() selftests/damon: introduce _common.sh to host shared function selftests/damon/sysfs.py: test runtime reduction of DAMON parameters selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion selftests/damon/sysfs.py: test DAMOS destinations commitment ...
2025-07-31block: ensure discard_granularity is zero when discard is not supportedChristoph Hellwig1-3/+10
Documentation/ABI/stable/sysfs-block states: What: /sys/block/<disk>/queue/discard_granularity [...] A discard_granularity of 0 means that the device does not support discard functionality. but this got broken when sorting out the block limits updates. Fix this by setting the discard_granularity limit to zero when the combined max_discard_sectors is zero. Fixes: 3c407dc723bb ("block: default the discard granularity to sector size") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250731152228.873923-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-30block: fix potential deadlock while running nr_hw_queue updateNilay Shroff5-15/+89
Move scheduler tags (sched_tags) allocation and deallocation outside both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues. This change breaks the dependency chain from the percpu allocator lock to the elevator lock, helping to prevent potential deadlocks, as observed in the reported lockdep splat[1]. This commit introduces batch allocation and deallocation helpers for sched_tags, which are now used from within __blk_mq_update_nr_hw_queues routine while iterating through the tagset. With this change, all sched_tags memory management is handled entirely outside the ->elevator_lock and the ->freeze_lock context, thereby eliminating the lock dependency that could otherwise manifest during nr_hw_queues updates. [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reported-by: Stefan Haberland <sth@linux.ibm.com> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250730074614.2537382-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>