aboutsummaryrefslogtreecommitdiffstats
path: root/block (follow)
AgeCommit message (Collapse)AuthorFilesLines
2014-09-25blk-mq: handle failure path for initializing hctxMing Lei1-45/+69
Failure of initializing one hctx isn't handled, so this patch introduces blk_mq_init_hctx() and its pair to handle it explicitly. Also this patch makes code cleaner. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-24blk-mq, percpu_ref: start q->mq_usage_counter in atomic modeTejun Heo3-3/+20
blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take above ten seconds when scsi-mq is used. [ 0.949892] scsi host0: Virtio SCSI HBA [ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz <stall> [ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0 [ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0 [ 16.194099] osd: LOADED open-osd 0.2.1 [ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.208478] sd 0:0:0:0: [sda] Write Protect is off [ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off [ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA This is also the reason why request_queues start in bypass mode which is ended on blk_register_queue() as shutting down a fully functional queue also involves a RCU grace period and the queues for non-existent SCSI devices never reach registration. blk-mq basically needs to do the same thing - start the mq in a degraded mode which is faster to shut down and then make it fully functional only after the queue reaches registration. percpu_ref recently grew facilities to force atomic operation until explicitly switched to percpu mode, which can be used for this purpose. This patch makes blk-mq initialize q->mq_usage_counter in atomic mode and switch it to percpu mode only once blk_register_queue() is reached. Note that this issue was previously worked around by 0a30288da1ae ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") for v3.17. The temp fix was reverted in preparation of adding persistent atomic mode to percpu_ref by 9eca80461a45 ("Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe""). This patch and the prerequisite percpu_ref changes will be merged during v3.18 devel cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count") Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-09-24percpu_ref: add PERCPU_REF_INIT_* flagsTejun Heo1-1/+1
With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-09-24Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"Tejun Heo1-10/+1
This reverts commit 0a30288da1aec914e158c2d7a3482a85f632750f, which was a temporary fix for SCSI blk-mq stall issue. The following patches will fix the issue properly by introducing atomic mode to percpu_ref. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>
2014-09-24Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18Tejun Heo10-82/+194
This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") which implements __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The commit reverted and patches to implement proper fix will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>
2014-09-24blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probeTejun Heo1-1/+10
blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take more than ten seconds when scsi-mq is used. This will be properly fixed by implementing a mechanism to keep q->mq_usage_counter in atomic mode till genhd registration; however, that involves rather big updates to percpu_ref which is difficult to apply late in the devel cycle (v3.17-rc6 at the moment). As a stop-gap measure till the proper fix can be implemented in the next cycle, this patch introduces __percpu_ref_kill_expedited() and makes blk_mq_freeze_queue() use it. This is heavy-handed but should work for testing the experimental SCSI blk-mq implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count") Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22genhd: fix leftover might_sleep() in blk_free_devt()Jens Axboe1-2/+0
Commit 2da78092 changed the locking from a mutex to a spinlock, so we now longer sleep in this context. But there was a leftover might_sleep() in there, which now triggers since we do the final free from an RCU callback. Get rid of it. Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22block: fix blk_abort_request on blk-mqChristoph Hellwig3-2/+8
Signed-off-by: Christoph Hellwig <hch@lst.de> Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-timeout: fix blk_add_timerMing Lei1-1/+2
Commit 8cb34819cdd5d(blk-mq: unshared timeout handler) introduces blk-mq's own timeout handler, and removes following line: blk_queue_rq_timed_out(q, blk_mq_rq_timed_out); which then causes blk_add_timer() to bypass adding the timer, since blk-mq no longer has q->rq_timed_out_fn defined. This patch fixes the problem by bypassing the check for blk-mq, so that both request deadlines are still set and the rolling timer updated. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: limit memory consumption if a crash dump is activeJens Axboe1-0/+11
It's not uncommon for crash dump kernels to be limited to 128MB or something low in that area. This is normally not a problem for devices as we don't use that much memory, but for some shared SCSI setups with huge queue depths, it can potentially fill most of memory with tons of request allocations. blk-mq does scale back when it fails to allocate memory, but it scales back just enough so that blk-mq succeeds. This could still leave the system with not enough memory to make any real progress. Check if we are in a kdump environment and limit the hardware queues and tag depth. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: remove unnecessary blk_clear_rq_complete()Ming Lei2-3/+0
This patch removes two unnecessary blk_clear_rq_complete(), the REQ_ATOM_COMPLETE flag is cleared inside blk_mq_start_request(), so: - The blk_clear_rq_complete() in blk_flush_restore_request() needn't because the request will be freed later, and clearing it here may open a small race window with timeout. - The blk_clear_rq_complete() in blk_mq_requeue_request() isn't necessary too, even though REQ_ATOM_STARTED is cleared in __blk_mq_requeue_request(), in theory it still may cause a small race window with timeout since the two clear_bit() may be reordered. Signed-off-by: Ming Lei <ming.lei@canoical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: pass a reserved argument to the timeout handlerChristoph Hellwig1-3/+3
Allow blk-mq to pass an argument to the timeout handler to indicate if we're timing out a reserved or regular command. For many drivers those need to be handled different. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: unshared timeout handlerChristoph Hellwig3-24/+39
Duplicate the (small) timeout handler in blk-mq so that we can pass arguments more easily to the driver timeout handler. This enables the next patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: fix and simplify tag iteration for the timeout handlerChristoph Hellwig2-86/+43
Don't do a kmalloc from timer to handle timeouts, chances are we could be under heavy load or similar and thus just miss out on the timeouts. Fortunately it is very easy to just iterate over all in use tags, and doing this properly actually cleans up the blk_mq_busy_iter API as well, and prepares us for the next patch by passing a reserved argument to the iterator. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: rename blk_mq_end_io to blk_mq_end_requestChristoph Hellwig2-10/+10
Now that we've changed the driver API on the submission side use the opportunity to fix up the name on the completion side to fit into the general scheme. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: call blk_mq_start_request from ->queue_rqChristoph Hellwig1-7/+6
When we call blk_mq_start_request from the core blk-mq code before calling into ->queue_rq there is a racy window where the timeout handler can hit before we've fully set up the driver specific part of the command. Move the call to blk_mq_start_request into the driver so the driver can start the request only once it is fully set up. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: remove REQ_ENDChristoph Hellwig1-17/+5
Pass an explicit parameter for the last request in a batch to ->queue_rq instead of using a request flag. Besides being a cleaner and non-stateful interface this is also required for the next patch, which fixes the blk-mq I/O submission code to not start a time too early. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22Merge branch 'for-linus' into for-3.18/coreJens Axboe2-11/+18
Moving patches from for-linus to 3.18 instead, pull in this changes that will go to Linus today.
2014-09-22blk-mq: use blk_mq_start_hw_queues() when running requeue workJens Axboe1-1/+5
When requests are retried due to hw or sw resource shortages, we often stop the associated hardware queue. So ensure that we restart the queues when running the requeue work, otherwise the queue run will be a no-op. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: fix potential oops on out-of-memory in __blk_mq_alloc_rq_maps()Jens Axboe1-1/+0
__blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale back the queue depth if we are low on memory. So don't clear set->tags when we fail, this is handled directly in the parent function, blk_mq_alloc_tag_set(). Reported-by: Robert Elliott <Elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: avoid infinite recursion with the FUA flagChristoph Hellwig2-8/+4
We should not insert requests into the flush state machine from blk_mq_insert_request. All incoming flush requests come through blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait should only be called for BLOCK_PC requests. All other callers deal with requests that already went through the flush statemchine and shouldn't be reinserted into it. Reported-by: Robert Elliott <Elliott@hp.com> Debugged-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: Avoid race condition with uninitialized requestsDavid Hildenbrand1-1/+3
This patch should fix the bug reported in https://lkml.org/lkml/2014/9/11/249. We have to initialize at least the atomic_flags and the cmd_flags when allocating storage for the requests. Otherwise blk_mq_timeout_check() might dereference uninitialized pointers when racing with the creation of a request. Also move the reset of cmd_flags for the initializing code to the point where a request is freed. So we will never end up with pending flush request indicators that might trigger dereferences of invalid pointers in blk_mq_timeout_check(). Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reported-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com> Tested-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22blk-mq: request deadline must be visible before marking rq as startedJens Axboe1-0/+6
When we start the request, we set the deadline and flip the bits marking the request as started and non-complete. However, it's important that the deadline store is ordered before flipping the bits, otherwise we could have a small window where the request is marked started but with an invalid deadline. This can confuse the timeout handling. Suggested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-11Merge branch 'for-linus' into for-3.18/coreJens Axboe9-71/+163
A bit of churn on the for-linus side that would be nice to have in the core bits for 3.18, so pull it in to catch us up and make forward progress easier. Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: block/scsi_ioctl.c
2014-09-10blk-mq: scale depth and rq map appropriate if low on memoryJens Axboe1-19/+69
If we are running in a kdump environment, resources are scarce. For some SCSI setups with a huge set of shared tags, we run out of memory allocating what the drivers is asking for. So implement a scale back logic to reduce the tag depth for those cases, allowing the driver to successfully load. We should extend this to detect low memory situations, and implement a sane fallback for those (1 queue, 64 tags, or something like that). Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-09Block: fix unbalanced bypass-disable in blk_register_queueAlan Stern1-2/+4
When a queue is registered, the block layer turns off the bypass setting (because bypass is enabled when the queue is created). This doesn't work well for queues that are unregistered and then registered again; we get a WARNING because of the unbalanced calls to blk_queue_bypass_end(). This patch fixes the problem by making blk_register_queue() call blk_queue_bypass_end() only the first time the queue is registered. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: Tejun Heo <tj@kernel.org> CC: James Bottomley <James.Bottomley@HansenPartnership.com> CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-09Documentation: Docbook: Fix generated DocBook/kernel-api.xmlMasanari Iida2-5/+5
This patch fix spelling typo found in DocBook/kernel-api.xml. It is because the file is generated from the source comments, I have to fix the comments in source codes. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-09-08block, bdi: an active gendisk always has a request_queue associated with itTejun Heo3-15/+3
bdev_get_queue() returns the request_queue associated with the specified block_device. blk_get_backing_dev_info() makes use of bdev_get_queue() to determine the associated bdi given a block_device. All the callers of bdev_get_queue() including blk_get_backing_dev_info() assume that bdev_get_queue() may return NULL and implement NULL handling; however, bdev_get_queue() requires the passed in block_device is opened and attached to its gendisk. Because an active gendisk always has a valid request_queue associated with it, bdev_get_queue() can never return NULL and neither can blk_get_backing_dev_info(). Make it clear that neither of the two functions can return NULL and remove NULL handling from all the callers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-08blkcg: remove blkcg->idTejun Heo3-10/+5
blkcg->id is a unique id given to each blkcg; however, the cgroup_subsys_state which each blkcg embeds already has ->serial_nr which can be used for the same purpose. Drop blkcg->id and replace its uses with blkcg->css.serial_nr. Rename cfq_cgroup->blkcg_id to ->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-08percpu-refcount: add @gfp to percpu_ref_init()Tejun Heo1-1/+2
Percpu allocator now supports allocation mask. Add @gfp to percpu_ref_init() so that !GFP_KERNEL allocation masks can be used with percpu_refs too. This patch doesn't make any functional difference. v2: blk-mq conversion was missing. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Jens Axboe <axboe@kernel.dk>
2014-09-03block: Fix dev_t minor allocation lifetimeKeith Busch2-11/+15
Releases the dev_t minor when all references are closed to prevent another device from acquiring the same major/minor. Since the partition's release may be invoked from call_rcu's soft-irq context, the ext_dev_idr's mutex had to be replaced with a spinlock so as not so sleep. Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-03blk-mq: cleanup after blk_mq_init_rq_map failuresRobert Elliott1-0/+3
In blk-mq.c blk_mq_alloc_tag_set, if: set->tags = kmalloc_node() succeeds, but one of the blk_mq_init_rq_map() calls fails, goto out_unwind; needs to free set->tags so the caller is not obligated to do so. None of the current callers (null_blk, virtio_blk, virtio_blk, or the forthcoming scsi-mq) do so. set->tags needs to be set to NULL after doing so, so other tag cleanup logic doesn't try to free a stale pointer later. Also set it to NULL in blk_mq_free_tag_set. Tested with error injection on the forthcoming scsi-mq + hpsa combination. Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-02blk-merge: fix blk_recount_segmentsMing Lei1-6/+11
QUEUE_FLAG_NO_SG_MERGE is set at default for blk-mq devices, so bio->bi_phys_segment computed may be bigger than queue_max_segments(q) for blk-mq devices, then drivers will fail to handle the case, for example, BUG_ON() in virtio_queue_rq() can be triggerd for virtio-blk: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1359146 This patch fixes the issue by ignoring the QUEUE_FLAG_NO_SG_MERGE flag if the computed bio->bi_phys_segment is bigger than queue_max_segments(q), and the regression is caused by commit 05f1dd53152173(block: add queue flag for disabling SG merging). Reported-by: Kick In <pierre-andre.morey@canonical.com> Tested-by: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-29bsg: fix potential error pointer dereferenceJens Axboe1-0/+1
Dan writes: block/bsg.c:327 bsg_map_hdr() error: 'next_rq' dereferencing possible ERR_PTR(). Fix this by setting next_rq to NULL, for the case where it can be != NULL but an error pointer. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-28block,scsi: fixup blk_get_request dead queue scenariosJoe Lawrence4-29/+33
The blk_get_request function may fail in low-memory conditions or during device removal (even if __GFP_WAIT is set). To distinguish between these errors, modify the blk_get_request call stack to return the appropriate ERR_PTR. Verify that all callers check the return status and consider IS_ERR instead of a simple NULL pointer check. For consistency, make a similar change to the blk_mq_alloc_request leg of blk_get_request. It may fail if the queue is dead, or the caller was unwilling to wait. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd] Acked-by: Boaz Harrosh <bharrosh@panasas.com> [for osd] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-28cfq-iosched: Add comments on update timing of weightToshiaki Makita1-0/+8
Explain that weight has to be updated on activation. This complements previous fix e15693ef18e1 ("cfq-iosched: Fix wrong children_weight calculation"). Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-26block,scsi: verify return pointer from blk_get_requestJoe Lawrence1-1/+8
The blk-core dead queue checks introduce an error scenario to blk_get_request that returns NULL if the request queue has been shutdown. This affects the behavior for __GFP_WAIT callers, who should verify the return value before dereferencing. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-26cfq-iosched: Fix wrong children_weight calculationToshiaki Makita1-3/+8
cfq_group_service_tree_add() is applying new_weight at the beginning of the function via cfq_update_group_weight(). This actually allows weight to change between adding it to and subtracting it from children_weight, and triggers WARN_ON_ONCE() in cfq_group_service_tree_del(), or even causes oops by divide error during vfr calculation in cfq_group_service_tree_add(). The detailed scenario is as follows: 1. Create blkio cgroups X and Y as a child of X. Set X's weight to 500 and perform some I/O to apply new_weight. This X's I/O completes before starting Y's I/O. 2. Y starts I/O and cfq_group_service_tree_add() is called with Y. 3. cfq_group_service_tree_add() walks up the tree during children_weight calculation and adds parent X's weight (500) to children_weight of root. children_weight becomes 500. 4. Set X's weight to 1000. 5. X starts I/O and cfq_group_service_tree_add() is called with X. 6. cfq_group_service_tree_add() applies its new_weight (1000). 7. I/O of Y completes and cfq_group_service_tree_del() is called with Y. 8. I/O of X completes and cfq_group_service_tree_del() is called with X. 9. cfq_group_service_tree_del() subtracts X's weight (1000) from children_weight of root. children_weight becomes -500. This triggers WARN_ON_ONCE(). 10. Set X's weight to 500. 11. X starts I/O and cfq_group_service_tree_add() is called with X. 12. cfq_group_service_tree_add() applies its new_weight (500) and adds it to children_weight of root. children_weight becomes 0. Calcularion of vfr triggers oops by divide error. weight should be updated right before adding it to children_weight. Reported-by: Ruki Sekiya <sekiya.ruki@lab.ntt.co.jp> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-26block: fix error handling in sg_ioSabrina Dubroca1-0/+1
Before commit 2cada584b200 ("block: cleanup error handling in sg_io"), we had ret = 0 before entering the last big if block of sg_io. Since 2cada584b200, ret = -EFAULT, which breaks hdparm: /dev/sda: setting Advanced Power Management level to 0xc8 (200) HDIO_DRIVE_CMD failed: Bad address APM_level = 128 Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Fixes: 2cada584b200 ("block: cleanup error handling in sg_io") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-22fix regression in SCSI_IOCTL_SEND_COMMANDTony Battersby1-2/+7
blk_rq_set_block_pc() memsets rq->cmd to 0, so it should come immediately after blk_get_request() to avoid overwriting the user-supplied CDB. Also check for failure to allocate rq. Fixes: f27b087b81b7 ("block: add blk_rq_set_block_pc()") Cc: <stable@vger.kernel.org> # 3.16.x Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-22scsi-mq: fix requests that use a separate CDB bufferTony Battersby2-1/+2
This patch fixes code such as the following with scsi-mq enabled: rq = blk_get_request(...); blk_rq_set_block_pc(rq); rq->cmd = my_cmd_buffer; /* separate CDB buffer */ blk_execute_rq_nowait(...); Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for large CDBs only). Without this patch, scsi_mq_prep_fn() will set rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-21block: support > 16 byte CDBs for SG_IOChristoph Hellwig1-7/+17
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-21block: cleanup error handling in sg_ioChristoph Hellwig1-6/+4
Make sure we always clean up through the out label and just have a single place to put the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-21blk-mq: blk_mq_freeze_queue() should allow nestingTejun Heo1-4/+8
While converting to percpu_ref for freezing, add703fda981 ("blk-mq: use percpu_ref for mq usage count") incorrectly made blk_mq_freeze_queue() misbehave when freezing is nested due to percpu_ref_kill() being invoked on an already killed ref. Fix it by making blk_mq_freeze_queue() kill and kick the queue only for the outermost freeze attempt. All the nested ones can simply wait for the ref to reach zero. While at it, remove unnecessary @wake initialization from blk_mq_unfreeze_queue(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-21blk-mq: correct a few wrong/bad commentsJens Axboe1-3/+3
Just grammar or spelling errors, nothing major. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-21block: Fix BUG_ON when pi errors occurSagi Grimberg1-1/+1
When getting a pi error we get to bio_integrity_end_io with bi_remaining already decremented to 0 where we will eventually need to call bio_endio with restored original bio completion handler. Calling bio_endio invokes a BUG_ON(). We should call bio_endio_nodec instead, like what is done in bio_integrity_verify_fn. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-21blk-mq: don't allow merges if turned off for the queueJens Axboe1-3/+9
blk-mq uses BLK_MQ_F_SHOULD_MERGE, as set by the driver at init time, to determine whether it should merge IO or not. However, this could also be disabled by the admin, if merging is switched off through sysfs. So check the general queue state as well before attempting to merge IO. Reported-by: Rob Elliott <Elliott@hp.com> Tested-by: Rob Elliott <Elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-15blk-mq: fix WARNING "percpu_ref_kill() called more than once!"Ming Lei1-4/+0
Before doing queue release, the queue has been freezed already by blk_cleanup_queue(), so needn't to freeze queue for deleting tag set. This patch fixes the WARNING of "percpu_ref_kill() called more than once!" which is triggered during unloading block driver. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-14Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dmLinus Torvalds1-1/+2
Pull device mapper changes from Mike Snitzer: - Allow the thin target to paired with any size external origin; also allow thin snapshots to be larger than the external origin. - Add support for quickly loading a repetitive pattern into the dm-switch target. - Use per-bio data in the dm-crypt target instead of always using a mempool for each allocation. Required switching to kmalloc alignment for the bio slab. - Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag - Fix the dm-cache and dm-thin targets' export of the minimum_io_size to match the data block size -- this fixes an issue where mkfs.xfs would improperly infer raid striping was in place on the underlying storage. - Small cleanups in dm-io, dm-mpath and dm-cache * tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm table: propagate QUEUE_FLAG_NO_SG_MERGE dm switch: efficiently support repetitive patterns dm switch: factor out switch_region_table_read dm cache: set minimum_io_size to cache's data block size dm thin: set minimum_io_size to pool's data block size dm crypt: use per-bio data block: use kmalloc alignment for bio slab dm table: make dm_table_supports_discards static dm cache metadata: use dm-space-map-metadata.h defined size limits dm cache: fail migrations in the do_worker error path dm cache: simplify deferred set reference count increments dm thin: relax external origin size constraints dm thin: switch to an atomic_t for tracking pending new block preparations dm mpath: eliminate pg_ready() wrapper dm io: simplify dec_count and sync_io
2014-08-14Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+4
Pull block driver changes from Jens Axboe: "Nothing out of the ordinary here, this pull request contains: - A big round of fixes for bcache from Kent Overstreet, Slava Pestov, and Surbhi Palande. No new features, just a lot of fixes. - The usual round of drbd updates from Andreas Gruenbacher, Lars Ellenberg, and Philipp Reisner. - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei has taken it one step further and added support for actually using more than one queue. - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to compliment the the default behavior of adding to the tail of the queue. From Douglas Gilbert" * 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits) bcache: Drop unneeded blk_sync_queue() calls bcache: add mutex lock for bch_is_open bcache: Correct printing of btree_gc_max_duration_ms bcache: try to set b->parent properly bcache: fix memory corruption in init error path bcache: fix crash with incomplete cache set bcache: Fix more early shutdown bugs bcache: fix use-after-free in btree_gc_coalesce() bcache: Fix an infinite loop in journal replay bcache: fix crash in bcache_btree_node_alloc_fail tracepoint bcache: bcache_write tracepoint was crashing bcache: fix typo in bch_bkey_equal_header bcache: Allocate bounce buffers with GFP_NOWAIT bcache: Make sure to pass GFP_WAIT to mempool_alloc() bcache: fix uninterruptible sleep in writeback thread bcache: wait for buckets when allocating new btree root bcache: fix crash on shutdown in passthrough mode bcache: fix lockdep warnings on shutdown bcache allocator: send discards with correct size bcache: Fix to remove the rcu_sched stalls. ...