aboutsummaryrefslogtreecommitdiffstats
path: root/block/blk-mq.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-02-02block: use same block debugfs directory for blk-mq and blktraceOmar Sandoval1-2/+0
When I added the blk-mq debugging information to debugfs, I didn't notice that blktrace also creates a "block" directory in debugfs. Make them use the same dentry, now created in the core block code. Based on a patch from Jens. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27block: split scsi_request out of struct requestChristoph Hellwig1-10/+0
And require all drivers that want to support BLOCK_PC to allocate it as the first thing of their private data. To support this the legacy IDE and BSG code is switched to set cmd_size on their queues to let the block layer allocate the additional space. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()Jens Axboe1-12/+13
Instead of letting the caller check this and handle the details of inserting a flush request, put the logic in the scheduler insertion function. This fixes direct flush insertion outside of the usual make_request_fn calls, like from dm via blk_insert_cloned_request(). Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27block: add a op_is_flush helperChristoph Hellwig1-2/+2
This centralizes the checks for bios that needs to be go into the flush state machine. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq-sched: change ->dispatch_requests() to ->dispatch_request()Jens Axboe1-1/+1
When we invoke dispatch_requests(), the scheduler empties everything into the passed in list. This isn't always a good thing, since it means that we remove items that we could have potentially merged with. Change the function to dispatch single requests at the time. If we do that, we can backoff exactly at the point where the device can't consume more IO, and leave the rest with the scheduler for better merging and future dispatch decision making. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27blk-mq-sched: fix starvation for multiple hardware queues and shared tagsJens Axboe1-1/+2
If we have both multiple hardware queues and shared tag map between devices, we need to ensure that we propagate the hardware queue restart bit higher up. This is because we can get into a situation where we don't have any IO pending on a hardware queue, yet we fail getting a tag to start new IO. If that happens, it's not enough to mark the hardware queue as needing a restart, we need to bubble that up to the higher level queue as well. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27blk-mq: release driver tag on a requeue eventJens Axboe1-0/+16
We don't want to hold on to this resource when we have a scheduler attached. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27blk-mq: fix potential race in queue restart and driver tag allocationJens Axboe1-1/+9
Once we mark the queue as needing a restart, re-check if we can get a driver tag. This fixes a theoretical issue where the needed IO completes _after_ blk_mq_get_driver_tag() fails, but before we manage to set the restart bit. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27blk-mq: improve scheduler queue sync/async runningJens Axboe1-2/+4
We'll use the same criteria for whether we need to run the queue sync or async when we have a scheduler, as we do without one. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27blk-mq: create debugfs directory treeOmar Sandoval1-0/+2
In preparation for putting blk-mq debugging information in debugfs, create a directory tree mirroring the one in sysfs: # tree -d /sys/kernel/debug/block /sys/kernel/debug/block |-- nvme0n1 | `-- mq | |-- 0 | | `-- cpu0 | |-- 1 | | `-- cpu1 | |-- 2 | | `-- cpu2 | `-- 3 | `-- cpu3 `-- vda `-- mq `-- 0 |-- cpu0 |-- cpu1 |-- cpu2 `-- cpu3 Also add the scaffolding for the actual files that will go in here, either under the hardware queue or software queue directories. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-26blk-mq: don't lose flags passed in to blk_mq_alloc_request()Jens Axboe1-3/+3
If we come in from blk_mq_alloc_requst() with NOWAIT set in flags, we must ensure that we don't later overwrite that in blk_mq_sched_get_request(). Initialize alloc_data->flags before passing it in. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-25blk-mq: only apply active queue tag throttling for driver tagsJens Axboe1-5/+8
If we have a scheduler attached, we have two sets of tags. We don't want to apply our active queue throttling for the scheduler side of tags, that only applies to driver tags since that's the resource we need to dispatch an IO. Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-20blk-mq: allow resize of scheduler requestsJens Axboe1-5/+14
Add support for growing the tags associated with a hardware queue, for the scheduler tags. Currently we only support resizing within the limits of the original depth, change that so we can grow it as well by allocating and replacing the existing scheduler tag set. This is similar to how we could increase the software queue depth with the legacy IO stack and schedulers. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-19blk-mq: stop hardware queue in blk_mq_delay_queue()Jens Axboe1-0/+1
The run handler we register for the delayed work requires that the queue be stopped, yet we leave that up to the caller. Let's move it into blk_mq_delay_queue() itself, so that the API is sane. This fixes a stall with SCSI, where it calls blk_mq_delay_queue() without having stopped the queue. Hence the queue is never run. Reported-by: Hannes Reinecke <hare@suse.com> Fixes: 70f4db639c5b ("blk-mq: add blk_mq_delay_queue") Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-17blk-mq-sched: allow setting of default IO schedulerJens Axboe1-0/+8
Add Kconfig entries to manage what devices get assigned an MQ scheduler, and add a blk-mq flag for drivers to opt out of scheduling. The latter is useful for admin type queues that still allocate a blk-mq queue and tag set, but aren't use for normal IO. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq-sched: add framework for MQ capable IO schedulersJens Axboe1-131/+187
This adds a set of hooks that intercepts the blk-mq path of allocating/inserting/issuing/completing requests, allowing us to develop a scheduler within that framework. We reuse the existing elevator scheduler API on the registration side, but augment that with the scheduler flagging support for the blk-mq interfce, and with a separate set of ops hooks for MQ devices. We split driver and scheduler tags, so we can run the scheduling independently of device queue depth. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: split tag ->rqs[] into twoJens Axboe1-7/+23
This is in preparation for having two sets of tags available. For that we need a static index, and a dynamically assignable one. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: add support for carrying internal tag information in blk_qc_tJens Axboe1-3/+8
No functional change in this patch, just in preparation for having two types of tags available to the block layer for a single request. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: abstract out helpers for allocating/freeing tag mapsJens Axboe1-43/+74
Prep patch for adding an extra tag map for scheduler requests. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq-tag: cleanup the normal/reserved tag allocationJens Axboe1-1/+1
This is in preparation for having another tag set available. Cleanup the parameters, and allow passing in of tags for blk_mq_put_tag(). Signed-off-by: Jens Axboe <axboe@fb.com> [hch: even more cleanups] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: export some helpers we need to the scheduling frameworkJens Axboe1-18/+21
Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: un-export blk_mq_free_hctx_request()Jens Axboe1-3/+2
It's only used in blk-mq, kill it from the main exported header and kill the symbol export as well. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-11blk-mq: make mq_ops a const pointerJens Axboe1-1/+1
We never change it, make that clear. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
2016-12-25ktime: Cleanup ktime_set() usageThomas Gleixner1-1/+1
ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-14blk-mq: Fix failed allocation path when mapping queuesGabriel Krisman Bertazi1-5/+21
In blk_mq_map_swqueue, there is a memory optimization that frees the tags of a queue that has gone unmapped. Later, if that hctx is remapped after another topology change, the tags need to be reallocated. If this allocation fails, a simple WARN_ON triggers, but the block layer ends up with an active hctx without any corresponding set of tags. Then, any income IO to that hctx can trigger an Oops. I can reproduce it consistently by running IO, flipping CPUs on and off and eventually injecting a memory allocation failure in that path. In the fix below, if the system experiences a failed allocation of any hctx's tags, we remap all the ctxs of that queue to the hctx_0, which should always keep it's tags. There is a minor performance hit, since our mapping just got worse after the error path, but this is the simplest solution to handle this error path. The performance hit will disappear after another successful remap. I considered dropping the memory optimization all together, but it seemed a bad trade-off to handle this very specific error case. This should apply cleanly on top of Jens' for-next branch. The Oops is the one below: SP (3fff935ce4d0) is in userspace 1:mon> e cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110] pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180 lr: c000000000575328: __bt_get+0x48/0xd0 sp: c000000fe99eb390 msr: 900000010280b033 dar: 28 dsisr: 40000000 current = 0xc000000fe9966800 paca = 0xc000000007e80300 softe: 0 irq_happened: 0x01 pid = 11035, comm = aio-stress Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016 1:mon> s [c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0 [c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0 [c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100 [c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220 [c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0 [c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540 [c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230 [c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200 [c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0 [c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0 [c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0 [c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180 [c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0 [c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540 [c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130 [c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430 [c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450 [c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730 [c000000fe99ebe30] c000000000009404 system_call+0x38/0xec Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Cc: Brian King <brking@linux.vnet.ibm.com> Cc: Douglas Miller <dougmill@linux.vnet.ibm.com> Cc: linux-block@vger.kernel.org Cc: linux-scsi@vger.kernel.org Reviewed-by: Douglas Miller <dougmill@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-14blk-mq: Avoid memory reclaim when remapping queuesGabriel Krisman Bertazi1-3/+3
While stressing memory and IO at the same time we changed SMT settings, we were able to consistently trigger deadlocks in the mm system, which froze the entire machine. I think that under memory stress conditions, the large allocations performed by blk_mq_init_rq_map may trigger a reclaim, which stalls waiting on the block layer remmaping completion, thus deadlocking the system. The trace below was collected after the machine stalled, waiting for the hotplug event completion. The simplest fix for this is to make allocations in this path non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the issue anymore. This should apply on top of Jens's for-next branch cleanly. Changes since v1: - Use GFP_NOIO instead of GFP_NOWAIT. Call Trace: [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable) [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430 [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0 [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0 [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30 [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0 [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0 [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs] [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs] [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs] [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210 [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0 [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0 [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520 [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250 [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0 [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380 [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360 [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0 [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250 [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100 [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0 [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0 [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250 [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120 [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0 [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150 [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0 [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120 [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0 [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0 [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0 [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250 [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0 [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270 [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110 [c000000f0160be30] [c000000000009204] system_call+0x38/0xec Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Cc: Brian King <brking@linux.vnet.ibm.com> Cc: Douglas Miller <dougmill@linux.vnet.ibm.com> Cc: linux-block@vger.kernel.org Cc: linux-scsi@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-09blk-mq: abstract out blk_mq_dispatch_rq_list() helperJens Axboe1-38/+47
Takes a list of requests, and dispatches it. Moves any residual requests to the dispatch list. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-12-09blk-mq: add blk_mq_start_stopped_hw_queue()Jens Axboe1-7/+12
We have a variant for all hardware queues, but not one for a single hardware queue. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-12-05blk-mq: blk_account_io_start() takes a boolJens Axboe1-1/+1
Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2016-11-29blk-mq: Drop explicit timeout sync in hotplugGabriel Krisman Bertazi1-8/+1
After commit 287922eb0b18 ("block: defer timeouts to a workqueue"), deleting the timeout work after freezing the queue shouldn't be necessary, since the synchronization is already enforced by the acquisition of a q_usage_counter reference in blk_mq_timeout_work. Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Reviewed-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17blk-mq: make the polling code adaptiveJens Axboe1-3/+64
The previous commit introduced the hybrid sleep/poll mode. Take that one step further, and use the completion latencies to automatically sleep for half the mean completion time. This is a good approximation. This changes the 'io_poll_delay' sysfs file a bit to expose the various options. Depending on the value, the polling code will behave differently: -1 Never enter hybrid sleep mode 0 Use half of the completion mean for the sleep delay >0 Use this specific value as the sleep delay Signed-off-by: Jens Axboe <axboe@fb.com> Tested-By: Stephen Bates <sbates@raithlin.com> Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17blk-mq: implement hybrid poll mode for sync O_DIRECTJens Axboe1-0/+50
This patch enables a hybrid polling mode. Instead of polling after IO submission, we can induce an artificial delay, and then poll after that. For example, if the IO is presumed to complete in 8 usecs from now, we can sleep for 4 usecs, wake up, and then do our polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, we can achieve big latency reductions while still using the same (or less) amount of CPU. Signed-off-by: Jens Axboe <axboe@fb.com> Tested-By: Stephen Bates <sbates@raithlin.com> Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-16block: deal with stale req count of plug listMing Lei1-0/+7
In both legacy and mq path, req count of plug list is computed before allocating request, so the number can be stale when falling back to slept allocation, also the new introduced wbt can sleep too. This patch deals with the case by checking if plug list becomes empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds' which is introduced by Shaohua's patches of dispatching big request. Fixes: 600271d900002(blk-mq: immediately dispatch big size request) Fixes: 50d24c34403c6(block: immediately dispatch big size request) Cc: Shaohua Li <shli@fb.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11block: move poll code to blk-mqJens Axboe1-0/+54
The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-11blk-mq: blk_mq_try_issue_directly() should lookup hardware queueJens Axboe1-4/+4
A previous commit changed this to pass in the hardware queue, but it was using the wrong hardware queue. Hence a request that was allocated on one hardware queue ended up being issued on another one, and that caused IO timeouts and oopses on some drivers. Since the request holds hardware queue private resources, like a tag, we can't just issue it on a different hardware queue. Fixes: 2253efc850c4 ("blk-mq: Move more code into blk_mq_direct_issue_request()") Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10block: hook up writeback throttlingJens Axboe1-2/+24
Enable throttling of buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency target to me met. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch this setting. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10block: add scalable completion tracking of requestsJens Axboe1-0/+25
For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. The feature is off by default, to avoid any extra overhead. In-kernel users of it can turn it on by setting QUEUE_FLAG_STATS in the queue flags. We currently don't turn it on if someone just reads any of the stats files, that is something we could add as well. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-06blk-mq: Always schedule hctx->next_cpuGabriel Krisman Bertazi1-3/+1
Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU") attempts to avoid triggering the WARN_ON in __blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the last batch execution before round robin, blk_mq_hctx_next_cpu can schedule a dead CPU and also update next_cpu to the next alive CPU in the mask, which will trigger the WARN_ON despite the previous workaround. The following patch fixes this scenario by always scheduling the value in hctx->next_cpu. This changes the moment when we round-robin the CPU running the hctx, but it really doesn't matter, since it still executes BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU. Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU") Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-03blk-mq: immediately dispatch big size requestShaohua Li1-1/+6
This is corresponding part for blk-mq. Disk with multiple hardware queues doesn't need this as we only hold 1 request at most. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()Bart Van Assche1-3/+7
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls are followed by kicking the requeue list. Hence add an argument to these two functions that allows to kick the requeue list. This was proposed by Christoph Hellwig. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Introduce blk_mq_quiesce_queue()Bart Van Assche1-7/+64
blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations have finished. This function does *not* wait until all outstanding requests have finished (this means invocation of request.end_io()). The algorithm used by blk_mq_quiesce_queue() is as follows: * Hold either an RCU read lock or an SRCU read lock around .queue_rq() calls. The former is used if .queue_rq() does not block and the latter if .queue_rq() may block. * blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues() followed by synchronize_srcu() or synchronize_rcu(). The latter call waits for .queue_rq() invocations that started before blk_mq_quiesce_queue() was called. * The blk_mq_hctx_stopped() calls that control whether or not .queue_rq() will be called are called with the (S)RCU read lock held. This is necessary to avoid race conditions against blk_mq_quiesce_queue(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Remove blk_mq_cancel_requeue_work()Bart Van Assche1-6/+0
Since blk_mq_requeue_work() no longer restarts stopped queues canceling requeue work is no longer needed to prevent that a stopped queue would be restarted. Hence remove this function. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Avoid that requeueing starts stopped queuesBart Van Assche1-5/+1
Since blk_mq_requeue_work() starts stopped queues and since execution of this function can be scheduled after a queue has been stopped it is not possible to stop queues without using an additional state variable to track whether or not the queue has been stopped. Hence modify blk_mq_requeue_work() such that it does not start stopped queues. My conclusion after a review of the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list() callers is as follows: * In the dm driver starting and stopping queues should only happen if __dm_suspend() or __dm_resume() is called and not if the requeue list is processed. * In the SCSI core queue stopping and starting should only be performed by the scsi_internal_device_block() and scsi_internal_device_unblock() functions but not by any other function. Although the blk_mq_stop_hw_queue() call in scsi_queue_rq() may help to reduce CPU load if a LLD queue is full, figuring out whether or not a queue should be restarted when requeueing a command would require to introduce additional locking in scsi_mq_requeue_cmd() to avoid a race with scsi_internal_device_block(). Avoid this complexity by removing the blk_mq_stop_hw_queue() call from scsi_queue_rq(). * In the NVMe core only the functions that call blk_mq_start_stopped_hw_queues() explicitly should start stopped queues. * A blk_mq_start_stopped_hwqueues() call must be added in the xen-blkfront driver in its blkif_recover() function. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <jejb@linux.vnet.ibm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Move more code into blk_mq_direct_issue_request()Bart Van Assche1-8/+10
Move the "hctx stopped" test and the insert request calls into blk_mq_direct_issue_request(). Rename that function into blk_mq_try_issue_directly() to reflect its new semantics. Pass the hctx pointer to that function instead of looking it up a second time. These changes avoid that code has to be duplicated in the next patch. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Introduce blk_mq_queue_stopped()Bart Van Assche1-0/+20
The function blk_queue_stopped() allows to test whether or not a traditional request queue has been stopped. Introduce a helper function that allows block drivers to query easily whether or not one or more hardware contexts of a blk-mq queue have been stopped. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Introduce blk_mq_hctx_stopped()Bart Van Assche1-6/+6
Multiple functions test the BLK_MQ_S_STOPPED bit so introduce a helper function that performs this test. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Do not invoke .queue_rq() for a stopped queueBart Van Assche1-3/+3
The meaning of the BLK_MQ_S_STOPPED flag is "do not call .queue_rq()". Hence modify blk_mq_make_request() such that requests are queued instead of issued if a queue has been stopped. Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28block: better op and flags encodingChristoph Hellwig1-17/+11
Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28block: split out request-only flags into a new namespaceChristoph Hellwig1-10/+9
A lot of the REQ_* flags are only used on struct requests, and only of use to the block layer and a few drivers that dig into struct request internals. This patch adds a new req_flags_t rq_flags field to struct request for them, and thus dramatically shrinks the number of common requests. It also removes the unfortunate situation where we have to fit the fields from the same enum into 32 bits for struct bio and 64 bits for struct request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-27blk-mq: get rid of confusing blk_map_ctx structureJens Axboe1-13/+5
We can just use struct blk_mq_alloc_data - it has a few more members, but we allocate it further down the stack anyway. So this cleans up the code, and reduces the stack overhead a bit. Signed-off-by: Jens Axboe <axboe@fb.com>