aboutsummaryrefslogtreecommitdiffstats
path: root/block (follow)
AgeCommit message (Collapse)AuthorFilesLines
2015-09-02Merge branch 'for-4.3/sg' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull SG updates from Jens Axboe: "This contains a set of scatter-gather related changes/fixes for 4.3: - Add support for limited chaining of sg tables even for architectures that do not set ARCH_HAS_SG_CHAIN. From Christoph. - Add sg chain support to target_rd. From Christoph. - Fixup open coded sg->page_link in crypto/omap-sham. From Christoph. - Fixup open coded crypto ->page_link manipulation. From Dan. - Also from Dan, automated fixup of manual sg_unmark_end() manipulations. - Also from Dan, automated fixup of open coded sg_phys() implementations. - From Robert Jarzmik, addition of an sg table splitting helper that drivers can use" * 'for-4.3/sg' of git://git.kernel.dk/linux-block: lib: scatterlist: add sg splitting function scatterlist: use sg_phys() crypto/omap-sham: remove an open coded access to ->page_link scatterlist: remove open coded sg_unmark_end instances crypto: replace scatterwalk_sg_chain with sg_chain target/rd: always chain S/G list scatterlist: allow limited chaining without ARCH_HAS_SG_CHAIN
2015-09-02Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds17-313/+421
Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
2015-09-01blk: Fix bio_io_vec index when checking bvec gapsKeith Busch1-1/+1
Corrects a coding error from earlier patch. Reported by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Keith Busch <keith.busch@intel.com> Fixes: 03100aada96f ("block: Replace SG_GAPS with new queue limits mask") Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-19block: Replace SG_GAPS with new queue limits maskKeith Busch3-18/+22
The SG_GAPS queue flag caused checks for bio vector alignment against PAGE_SIZE, but the device may have different constraints. This patch adds a queue limits so a driver with such constraints can set to allow requests that would have been unnecessarily split. The new gaps check takes the request_queue as a parameter to simplify the logic around invoking this function. This new limit makes the queue flag redundant, so removing it and all usage. Device-mappers will inherit the correct settings through blk_stack_limits(). Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18Revert "block: remove artifical max_hw_sectors cap"Jeff Moyer1-1/+3
This reverts commit 34b48db66e08ca1c1bc07cf305d672ac940268dc. That commit caused performance regressions for streaming I/O workloads on a number of different storage devices, from SATA disks to external RAID arrays. It also managed to trip up some buggy firmware in at least one drive, causing data corruption. The next patch will bump the default max_sectors_kb value to 1280, which will accommodate a 10-data-disk stripe write with chunk size 128k. In the testing I've done using iozone, fio, and aio-stress, a value of 1280 does not show a big performance difference from 512. This will hopefully still help the software RAID setup that Christoph saw the original performance gains with while still not regressing other storage configurations. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-17scatterlist: remove open coded sg_unmark_end instancesDan Williams1-1/+1
Signed-off-by: Dan Williams <dan.j.williams@intel.com> [hch: split from a larger patch by Dan] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-15Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-2/+2
Pull SCSI fixes from James Bottomley: "This has two libfc fixes for bugs causing rare crashes, one iscsi fix for a potential hang on shutdown, and a fix for an I/O blocksize issue which caused a regression" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: sd: Fix maximum I/O size for BLOCK_PC requests libfc: Fix fc_fcp_cleanup_each_cmd() libfc: Fix fc_exch_recv_req() error path libiscsi: Fix host busy blocking during connection teardown
2015-08-15blk-mq: fix race between timeout and freeing requestMing Lei5-18/+35
Inside timeout handler, blk_mq_tag_to_rq() is called to retrieve the request from one tag. This way is obviously wrong because the request can be freed any time and some fiedds of the request can't be trusted, then kernel oops might be triggered[1]. Currently wrt. blk_mq_tag_to_rq(), the only special case is that the flush request can share same tag with the request cloned from, and the two requests can't be active at the same time, so this patch fixes the above issue by updating tags->rqs[tag] with the active request(either flush rq or the request cloned from) of the tag. Also blk_mq_tag_to_rq() gets much simplified with this patch. Given blk_mq_tag_to_rq() is mainly for drivers and the caller must make sure the request can't be freed, so in bt_for_each() this helper is replaced with tags->rqs[tag]. [1] kernel oops log [ 439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M [ 439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M [ 439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M [ 439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M [ 439.700653] Dumping ftrace buffer:^M [ 439.700653] (ftrace buffer empty)^M [ 439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M [ 439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M [ 439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M [ 439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M [ 439.730500] RIP: 0010:[<ffffffff812d89ba>] [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M [ 439.730500] RSP: 0018:ffff880819203da0 EFLAGS: 00010283^M [ 439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M [ 439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M [ 439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M [ 439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M [ 439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M [ 439.730500] FS: 00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M [ 439.730500] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M [ 439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M [ 439.730500] Stack:^M [ 439.730500] 0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M [ 439.755663] ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M [ 439.755663] Call Trace:^M [ 439.755663] <IRQ> ^M [ 439.755663] [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M [ 439.755663] [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M [ 439.755663] [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M [ 439.755663] [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M [ 439.755663] [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M [ 439.755663] [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M [ 439.755663] [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M [ 439.755663] [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M [ 439.755663] [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M [ 439.755663] [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M [ 439.755663] [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M [ 439.755663] [<ffffffff8104c76e>] irq_exit+0x40/0x94^M [ 439.755663] [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M [ 439.755663] [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M [ 439.755663] <EOI> ^M [ 439.755663] [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M [ 439.755663] [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M [ 439.755663] [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M [ 439.755663] [<ffffffff81550066>] __schedule+0x469/0x6cd^M [ 439.755663] [<ffffffff8155039b>] schedule+0x82/0x9a^M [ 439.789267] [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M [ 439.790911] [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M [ 439.790911] [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M [ 439.790911] [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M [ 439.790911] [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M [ 439.790911] [<ffffffff8116292b>] SyS_read+0x49/0x7f^M [ 439.790911] [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M [ 439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89 f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b 53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10 ^M [ 439.790911] RIP [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M [ 439.790911] RSP <ffff880819203da0>^M [ 439.790911] CR2: 0000000000000158^M [ 439.790911] ---[ end trace d40af58949325661 ]---^M Cc: <stable@vger.kernel.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-15blk-mq: fix buffer overflow when reading sysfs file of 'pending'Ming Lei1-7/+18
There may be lots of pending requests so that the buffer of PAGE_SIZE can't hold them at all. One typical example is scsi-mq, the queue depth(.can_queue) of scsi_host and blk-mq is quite big but scsi_device's queue_depth is a bit small(.cmd_per_lun), then it is quite easy to have lots of pending requests in hw queue. This patch fixes the following warning and the related memory destruction. [ 359.025101] fill_read_buffer: blk_mq_hw_sysfs_show+0x0/0x7d returned bad count^M [ 359.055595] irq event stamp: 15537^M [ 359.055606] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M [ 359.055614] Dumping ftrace buffer:^M [ 359.055660] (ftrace buffer empty)^M [ 359.055672] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M [ 359.055678] CPU: 4 PID: 21631 Comm: stress-ng-sysfs Not tainted 4.2.0-rc5-next-20150805 #434^M [ 359.055679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M [ 359.055682] task: ffff8802161cc000 ti: ffff88021b4a8000 task.ti: ffff88021b4a8000^M [ 359.055693] RIP: 0010:[<ffffffff811541c5>] [<ffffffff811541c5>] __kmalloc+0xe8/0x152^M Cc: <stable@vger.kernel.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13block: remove bio_get_nr_vecs()Kent Overstreet1-23/+0
We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13block: kill merge_bvec_fn() completelyKent Overstreet2-36/+3
As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13block: remove split code in blkdev_issue_{discard,write_same}Ming Lin1-36/+11
The split code in blkdev_issue_{discard,write_same} can go away now that any driver that cares does the split. We have to make sure bio size doesn't overflow. For discard, we set max discard sectors to (1<<31)>>9 to ensure it doesn't overflow bi_size and hopefully it is of the proper granularity as long as the granularity is a power of two. Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13block: simplify bio_add_page()Kent Overstreet1-80/+55
Since generic_make_request() can now handle arbitrary size bios, all we have to do is make sure the bvec array doesn't overflow. __bio_add_page() doesn't need to call ->merge_bvec_fn(), where we can get rid of unnecessary code paths. Removing the call to ->merge_bvec_fn() is also fine, as no driver that implements support for BLOCK_PC commands even has a ->merge_bvec_fn() method. Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: rebase and resolve merge conflicts, change a couple of comments, make bio_add_page() warn once upon a cloned bio.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13block: make generic_make_request handle arbitrarily sized biosKent Overstreet4-20/+165
The way the block layer is currently written, it goes to great lengths to avoid having to split bios; upper layer code (such as bio_add_page()) checks what the underlying device can handle and tries to always create bios that don't need to be split. But this approach becomes unwieldy and eventually breaks down with stacked devices and devices with dynamic limits, and it adds a lot of complexity. If the block layer could split bios as needed, we could eliminate a lot of complexity elsewhere - particularly in stacked drivers. Code that creates bios can then create whatever size bios are convenient, and more importantly stacked drivers don't have to deal with both their own bio size limitations and the limitations of the (potentially multiple) devices underneath them. In the future this will let us delete merge_bvec_fn and a bunch of other code. We do this by adding calls to blk_queue_split() to the various make_request functions that need it - a few can already handle arbitrary size bios. Note that we add the call _after_ any call to blk_queue_bounce(); this means that blk_queue_split() and blk_recalc_rq_segments() don't need to be concerned with bouncing affecting segment merging. Some make_request_fn() callbacks were simple enough to audit and verify they don't need blk_queue_split() calls. The skipped ones are: * nfhd_make_request (arch/m68k/emu/nfblock.c) * axon_ram_make_request (arch/powerpc/sysdev/axonram.c) * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c) * brd_make_request (ramdisk - drivers/block/brd.c) * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c) * loop_make_request * null_queue_bio * bcache's make_request fns Some others are almost certainly safe to remove now, but will be left for future patches. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Geoff Levand <geoff@infradead.org> Cc: Jim Paris <jim@jtan.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: skip more mq-based drivers, resolve merge conflicts, etc.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-12sd: Fix maximum I/O size for BLOCK_PC requestsMartin K. Petersen1-2/+2
Commit bcdb247c6b6a ("sd: Limit transfer length") clamped the maximum size of an I/O request to the MAXIMUM TRANSFER LENGTH field in the BLOCK LIMITS VPD. This had the unfortunate effect of also limiting the maximum size of non-filesystem requests sent to the device through sg/bsg. Avoid using blk_queue_max_hw_sectors() and set the max_sectors queue limit directly. Also update the comment in blk_limits_max_hw_sectors() to clarify that max_hw_sectors defines the limit for the I/O controller only. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Brian King <brking@linux.vnet.ibm.com> Tested-by: Brian King <brking@linux.vnet.ibm.com> Cc: stable@vger.kernel.org # 3.17+ Signed-off-by: James Bottomley <JBottomley@Odin.com>
2015-07-29block: manipulate bio->bi_flags through helpersJens Axboe5-11/+11
Some places use helpers now, others don't. We only have the 'is set' helper, add helpers for setting and clearing flags too. It was a bit of a mess of atomic vs non-atomic access. With BIO_UPTODATE gone, we don't have any risk of concurrent access to the flags. So relax the restriction and don't make any of them atomic. The flags that do have serialization issues (reffed and chained), we already handle those separately. Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig7-72/+62
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-23block: Do a full clone when splitting discard biosMartin K. Petersen1-3/+12
This fixes a data corruption bug when using discard on top of MD linear, raid0 and raid10 personalities. Commit 20d0189b1012 "block: Introduce new bio_split()" permits sharing the bio_vec between the two resulting bios. That is fine for read/write requests where the bio_vec is immutable. For discards, however, we need to be able to attach a payload and update the bio_vec so the page can get mapped to a scatterlist entry. Therefore the bio_vec can not be shared when splitting discards and we must do a full clone. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Seunguk Shin <seunguk.shin@samsung.com> Tested-by: Seunguk Shin <seunguk.shin@samsung.com> Cc: Seunguk Shin <seunguk.shin@samsung.com> Cc: Jens Axboe <axboe@fb.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: <stable@vger.kernel.org> # v3.14+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-23block: export bio_associate_*() and wbc_account_io()Tejun Heo1-0/+2
bio_associate_blkcg(), bio_associate_current() and wbc_account_io() are used to implement cgroup writeback support for filesystems and thus need to be exported. Export them. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-22blkcg: fix gendisk reference leak in blkg_conf_prep()Tejun Heo1-1/+5
When a blkcg configuration is targeted to a partition rather than a whole device, blkg_conf_prep fails with -EINVAL; unfortunately, it forgets to put the gendisk ref in that case. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17block: make /sys/block/<dev>/queue/discard_max_bytes writeableJens Axboe2-1/+43
Lots of devices support huge discard sizes these days. Depending on how the device handles them internally, huge discards can introduce massive latencies (hundreds of msec) on the device side. We have a sysfs file, discard_max_bytes, that advertises the max hardware supported discard size. Make this writeable, and split the settings into a soft and hard limit. This can be set from 'discard_granularity' and up to the hardware limit. Add a new sysfs file, 'discard_max_hw_bytes', that shows the hw set limit. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17block: partition: convert percpu refMing Lei2-5/+10
Percpu refcount is the perfect match for partition's case, and the conversion is quite straight. With the convertion, one pair of atomic inc/dec can be saved for accounting block I/O, which is run in hot path of block I/O. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17block: partition: introduce hd_free_part()Ming Lei2-4/+2
So the helper can be used in both generic partition case and part0 case. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-16blk-mq: set default timeout as 30 secondsMing Lei1-1/+1
It is reasonable to set default timeout of request as 30 seconds instead of 30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64 based systems may choose 100 HZ. Signed-off-by: Ming Lei <ming.lei@canonical.com> Fixes: c76cbbcf4044 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()" Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-09blkcg: fix blkcg_policy_data allocation bugTejun Heo1-36/+42
e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data") updated per-blkcg policy data to be dynamically allocated. When a policy is registered, its policy data aren't created. Instead, when the policy is activated on a queue, the policy data are allocated if there are blkg's (blkcg_gq's) which are attached to a given blkcg. This is buggy. Consider the following scenario. 1. A blkcg is created. No blkg's attached yet. 2. The policy is registered. No policy data is allocated. 3. The policy is activated on a queue. As the above blkcg doesn't have any blkg's, it won't allocate the matching blkcg_policy_data. 4. An IO is issued from the blkcg and blkg is created and the blkcg still doesn't have the matching policy data allocated. With cfq-iosched, this leads to an oops. It also doesn't free policy data on policy unregistration assuming that freeing of all policy data on blkcg destruction should take care of it; however, this also is incorrect. 1. A blkcg has policy data. 2. The policy gets unregistered but the policy data remains. 3. Another policy gets registered on the same slot. 4. Later, the new policy tries to allocate policy data on the previous blkcg but the slot is already occupied and gets skipped. The policy ends up operating on the policy data of the previous policy. There's no reason to manage blkcg_policy_data lazily. The reason we do lazy allocation of blkg's is that the number of all possible blkg's is the product of cgroups and block devices which can reach a surprising level. blkcg_policy_data is contrained by the number of cgroups and shouldn't be a problem. This patch makes blkcg_policy_data to be allocated for all existing blkcg's on policy registration and freed on unregistration and removes blkcg_policy_data handling from policy [de]activation paths. This makes that blkcg_policy_data are created and removed with the policy they belong to and fixes the above described problems. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data") Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-09blkcg: implement all_blkcgs listTejun Heo1-5/+12
Add all_blkcgs list goes through blkcg->all_blkcgs_node and is protected by blkcg_pol_mutex. This will be used to fix blkcg_policy_data allocation bug. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-09blkcg: blkcg_css_alloc() should grab blkcg_pol_mutex while iterating blkcg_policy[]Tejun Heo1-0/+4
An entry in blkcg_policy[] is stable while there are non-bypassing in-flight IOs on a request_queue which has the policy activated. This is why most derefs of blkcg_policy[] don't need explicit locking; however, blkcg_css_alloc() isn't invoked from IO path and thus doesn't have this protection and may race policies being added and removed. Fix it by adding explicit blkcg_pol_mutex protection around blkcg_policy[] iteration in blkcg_css_alloc(). Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data") Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-09blkcg: allow blkcg_pol_mutex to be grabbed from cgroup [file] methodsTejun Heo1-19/+21
blkcg_pol_mutex primarily protects the blkcg_policy array. It also protects cgroup file type [un]registration during policy addition / removal. This puts blkcg_pol_mutex outside cgroup internal synchronization and in turn makes it impossible to grab from blkcg's cgroup methods as that leads to cyclic dependency. Another problematic dependency arising from this is through cgroup interface file deactivation. Removing a cftype requires removing all files of the type which in turn involves draining all on-going invocations of the file methods. This means that an interface file implementation can't grab blkcg_pol_mutex as draining can lead to AA deadlock. blkcg_reset_stats() is already in this situation. It currently trylocks blkcg_pol_mutex and then unwinds and retries the whole operation on failure, which is cumbersome at best. It has a lengthy comment explaining how cgroup internal synchronization is involved and expected to be updated but as explained above this doesn't need cgroup internal locking to deadlock. It's a self-contained AA deadlock. The described circular dependencies can be easily broken by moving cftype [un]registration out of blkcg_pol_mutex and protect them with an outer mutex. This patch introduces blkcg_pol_register_mutex which wraps entire policy [un]registration including cftype operations and shrinks blkcg_pol_mutex critical section. This also makes the trylock dancing in blkcg_reset_stats() unnecessary. Removed. This patch is necessary for the following blkcg_policy_data allocation bug fixes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-07block/blk-cgroup.c: free per-blkcg data when freeing the blkcgArianna Avanzini1-3/+6
Currently, per-blkcg data is freed each time a policy is deactivated, that is also upon scheduler switch. However, when switching from a scheduler implementing a policy which requires per-blkcg data to another one, that same policy might be active on other devices, and therefore those same per-blkcg data could be still in use. This commit lets per-blkcg data be freed when the blkcg is freed instead of on policy deactivation. Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Reported-and-tested-by: Michael Kaminsky <kaminsky@cs.cmu.edu> Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data") Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-07block: use FIELD_SIZEOF to calculate size of a fieldManinder Singh1-1/+1
use FIELD_SIZEOF instead of open coding Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-07bio integrity: do not assume bio_integrity_pool exists if bioset existsMike Snitzer1-2/+2
bio_integrity_alloc() and bio_integrity_free() assume that if a bio was allocated from a bioset that that bioset also had its bio_integrity_pool allocated using bioset_integrity_create(). This is a very bad assumption given that bioset_create() and bioset_integrity_create() are completely disjoint. Not all callers of bioset_create() have been trained to also call bioset_integrity_create() -- and they may not care to be. Fix this by falling back to kmalloc'ing 'struct bio_integrity_payload' rather than force all bioset consumers to (wastefully) preallocate a bio_integrity_pool that they very likely won't actually need (given the niche nature of the current block integrity support). Otherwise, a NULL pointer "Kernel BUG" with a trace like the following will be observed (as seen on s390x using zfcp storage) because dm-io doesn't use bioset_integrity_create() when creating its bioset: [ 791.643338] Call Trace: [ 791.643339] ([<00000003df98b848>] 0x3df98b848) [ 791.643341] [<00000000002c5de8>] bio_integrity_alloc+0x48/0xf8 [ 791.643348] [<00000000002c6486>] bio_integrity_prep+0xae/0x2f0 [ 791.643349] [<0000000000371e38>] blk_queue_bio+0x1c8/0x3d8 [ 791.643355] [<000000000036f8d0>] generic_make_request+0xc0/0x100 [ 791.643357] [<000000000036f9b2>] submit_bio+0xa2/0x198 [ 791.643406] [<000003ff801f9774>] dispatch_io+0x15c/0x3b0 [dm_mod] [ 791.643419] [<000003ff801f9b3e>] dm_io+0x176/0x2f0 [dm_mod] [ 791.643423] [<000003ff8074b28a>] do_reads+0x13a/0x1a8 [dm_mirror] [ 791.643425] [<000003ff8074b43a>] do_mirror+0x142/0x298 [dm_mirror] [ 791.643428] [<0000000000154fca>] process_one_work+0x18a/0x3f8 [ 791.643432] [<000000000015598a>] worker_thread+0x132/0x3b0 [ 791.643435] [<000000000015d49a>] kthread+0xd2/0xd8 [ 791.643438] [<00000000005bc0ca>] kernel_thread_starter+0x6/0xc [ 791.643446] [<00000000005bc0c4>] kernel_thread_starter+0x0/0xc Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-27block: fix bogus EFAULT error from SG_IO ioctlPaolo Bonzini1-2/+2
Whenever blk_fill_sghdr_rq fails, its errno code is ignored and changed to EFAULT. This can cause very confusing errors: $ sg_persist -k /dev/sda persistent reservation in: pass through os error: Bad address The fix is trivial, just propagate the return value from blk_fill_sghdr_rq. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-26Merge tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dmLinus Torvalds1-11/+83
Pull device mapper fixes from Mike Snitzer: "Apologies for not pressing this request-based DM partial completion issue further, it was an oversight on my part. We'll have to get it fixed up properly and revisit for a future release. - Revert block and DM core changes the removed request-based DM's ability to handle partial request completions -- otherwise with the current SCSI LLDs these changes could lead to silent data corruption. - Fix two DM version bumps that were missing from the initial 4.2 DM pull request (enabled userspace lvm2 to know certain changes have been made)" * tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache policy smq: fix "default" version to be 1.4.0 dm: bump the ioctl version to 4.32.0 Revert "block, dm: don't copy bios for request clones" Revert "dm: do not allocate any mempools for blk-mq request-based DM"
2015-06-26Revert "block, dm: don't copy bios for request clones"Mike Snitzer1-11/+83
This reverts commit 5f1b670d0bef508a5554d92525f5f6d00d640b38. Justification for revert as reported in this dm-devel post: https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html this change should not be pushed to mainline yet. Firstly, Christoph has a newer version of the patch that fixes silent data corruption problem: https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html And the new version still depends on LLDDs to always complete requests to the end when error happens, while block API doesn't enforce such a requirement. If the assumption is ever broken, the inconsistency between request and bio (e.g. rq->__sector and rq->bio) will cause silent data corruption: https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-25Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds11-728/+139
Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
2015-06-25Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds15-255/+478
Pull core block IO update from Jens Axboe: "Nothing really major in here, mostly a collection of smaller optimizations and cleanups, mixed with various fixes. In more detail, this contains: - Addition of policy specific data to blkcg for block cgroups. From Arianna Avanzini. - Various cleanups around command types from Christoph. - Cleanup of the suspend block I/O path from Christoph. - Plugging updates from Shaohua and Jeff Moyer, for blk-mq. - Eliminating atomic inc/dec of both remaining IO count and reference count in a bio. From me. - Fixes for SG gap and chunk size support for data-less (discards) IO, so we can merge these better. From me. - Small restructuring of blk-mq shared tag support, freeing drivers from iterating hardware queues. From Keith Busch. - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the IOPS mode the default for non-rotational storage" * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits) cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL cfq-iosched: fix sysfs oops when attempting to read unconfigured weights cfq-iosched: move group scheduling functions under ifdef cfq-iosched: fix the setting of IOPS mode on SSDs blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file block, cgroup: implement policy-specific per-blkcg data block: Make CFQ default to IOPS mode on SSDs block: add blk_set_queue_dying() to blkdev.h blk-mq: Shared tag enhancements block: don't honor chunk sizes for data-less IO block: only honor SG gap prevention for merges that contain data block: fix returnvar.cocci warnings block, dm: don't copy bios for request clones block: remove management of bi_remaining when restoring original bi_end_io block: replace trylock with mutex_lock in blkdev_reread_part() block: export blkdev_reread_part() and __blkdev_reread_part() suspend: simplify block I/O handling block: collapse bio bit space block: remove unused BIO_RW_BLOCK and BIO_EOF flags block: remove BIO_EOPNOTSUPP ...
2015-06-22Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull scheduler updates from Ingo Molnar: "The main changes are: - lockless wakeup support for futexes and IPC message queues (Davidlohr Bueso, Peter Zijlstra) - Replace spinlocks with atomics in thread_group_cputimer(), to improve scalability (Jason Low) - NUMA balancing improvements (Rik van Riel) - SCHED_DEADLINE improvements (Wanpeng Li) - clean up and reorganize preemption helpers (Frederic Weisbecker) - decouple page fault disabling machinery from the preemption counter, to improve debuggability and robustness (David Hildenbrand) - SCHED_DEADLINE documentation updates (Luca Abeni) - topology CPU masks cleanups (Bartosz Golaszewski) - /proc/sched_debug improvements (Srikar Dronamraju)" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits) sched/deadline: Remove needless parameter in dl_runtime_exceeded() sched: Remove superfluous resetting of the p->dl_throttled flag sched/deadline: Drop duplicate init_sched_dl_class() declaration sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target sched/deadline: Make init_sched_dl_class() __init sched/deadline: Optimize pull_dl_task() sched/preempt: Add static_key() to preempt_notifiers sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration sched/stop_machine: Fix deadlock between multiple stop_two_cpus() sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched sched/debug: Replace vruntime with wait_sum in /proc/sched_debug sched/debug: Properly format runnable tasks in /proc/sched_debug sched/numa: Only consider less busy nodes as numa balancing destinations Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") sched/fair: Prevent throttling in early pick_next_task_fair() preempt: Reorganize the notrace definitions a bit preempt: Use preempt_schedule_context() as the official tracing preemption point sched: Make preempt_schedule_context() function-tracing safe x86: Remove cpu_sibling_mask() and cpu_core_mask() x86: Replace cpu_**_mask() with topology_**_cpumask() ...
2015-06-20cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULLJens Axboe1-1/+11
Commit 9470e4a693db only covered the initial bug report, there are other spots in CFQ where we need to check that we actually have a valid cfq_group_data structure. Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data") Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-19cfq-iosched: fix sysfs oops when attempting to read unconfigured weightsJens Axboe1-2/+12
If none of the devices in the system are using CFQ, then attempting to read: /sys/fs/cgroup/blkio/blkio.leaf_weight will results in a NULL dereference. Check for a valid cfq_group_data struct before attempting to dereference it. Reported-by: Andrey Wagin <avagin@gmail.com> Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data") Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-19cfq-iosched: move group scheduling functions under ifdefJens Axboe1-16/+16
If CFQ_GROUP_IOSCHED is not set, the compiler produces the following warning: CC block/cfq-iosched.o linux/block/cfq-iosched.c:469:2: warning: 'cpd_to_cfqgd' defined but not used [-Wunused-function] *cpd_to_cfqgd(struct blkcg_policy_data *cpd) ^ In reality, two other lookup functions aren't used either if CFQ_GROUP_IOSCHED isn't set. Move all three under one of the CFQ_GROUP_IOSCHED sections in the code. Reported-by: Vladimir Zapolskiy <vz@mleia.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-11block: fix ext_dev_lock lockdep reportDan Williams1-6/+6
================================= [ INFO: inconsistent lock state ] 4.1.0-rc7+ #217 Tainted: G O --------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70 {SOFTIRQ-ON-W} state was registered at: [<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70 [<ffffffff810c1947>] lock_acquire+0xb7/0x290 [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50 [<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0 <-- take the lock in process context [..] [<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70 [<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70 [<ffffffff810c1947>] lock_acquire+0xb7/0x290 [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70 [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50 [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70 [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70 <-- take the lock in softirq [<ffffffff8143bfec>] part_release+0x1c/0x50 [<ffffffff8158edf6>] device_release+0x36/0xb0 [<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8145aad0>] kobject_put+0x30/0x70 [<ffffffff8158f147>] put_device+0x17/0x20 [<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180 [<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0 [<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90 [<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90 [<ffffffff81067e2e>] __do_softirq+0xde/0x600 Neil sees this in his tests and it also triggers on pmem driver unbind for the libnvdimm tests. This fix is on top of an initial fix by Keith for incorrect usage of mutex_lock() in this path: 2da78092dda1 "block: Fix dev_t minor allocation lifetime". Both this and 2da78092dda1 are candidates for -stable. Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime") Cc: <stable@vger.kernel.org> Cc: Keith Busch <keith.busch@intel.com> Reported-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-10cfq-iosched: fix the setting of IOPS mode on SSDsJens Axboe2-1/+16
A previous commit wanted to make CFQ default to IOPS mode on non-rotational storage, however it did so when the queue was initialized and the non-rotational flag is only set later on in the probe. Add an elevator hook that gets called off the add_disk() path, at that point we know that feature probing has finished, and we can reliably check for the various flags that drivers can set. Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs") Tested-by: Romain Francoise <romain@orebokech.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-09blk-mq: free hctx->ctxs in queue's release handlerMing Lei1-2/+6
Now blk_cleanup_queue() can be called before calling del_gendisk()[1], inside which hctx->ctxs is touched from blk_mq_unregister_hctx(), but the variable has been freed by blk_cleanup_queue() at that time. So this patch moves freeing of hctx->ctxs into queue's release handler for fixing the oops reported by Stefan. [1], 6cd18e711dd8075 (block: destroy bdi before blockdev is unregistered) Reported-by: Stefan Seyfried <stefan.seyfried@googlemail.com> Cc: NeilBrown <neilb@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org (v4.0) Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-07block, cgroup: implement policy-specific per-blkcg dataArianna Avanzini3-29/+173
The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05block: Make CFQ default to IOPS mode on SSDsTahsin Erdogan1-1/+1
CFQ idling causes reduced IOPS throughput on non-rotational disks. Since disk head seeking is not applicable to SSDs, it doesn't really help performance by anticipating future near-by IO requests. By turning off idling (and switching to IOPS mode), we allow other processes to dispatch IO requests down to the driver and so increase IO throughput. Following FIO benchmark results were taken on a cloud SSD offering with idling on and off: Idling iops avg-lat(ms) stddev bw ------------------------------------------------------ On 7054 90.107 38.697 28217KB/s Off 29255 21.836 11.730 117022KB/s fio --name=temp --size=100G --time_based --ioengine=libaio \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \ --filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10 And the following is from a local SSD run: Idling iops avg-lat(ms) stddev bw ------------------------------------------------------ On 19320 33.043 14.068 77281KB/s Off 21626 29.465 12.662 86507KB/s fio --name=temp --size=5G --time_based --ioengine=libaio \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \ --filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10 Reviewed-by: Nauman Rafique <nauman@google.com> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback, blkcg: propagate non-root blkcg congestion stateTejun Heo1-6/+9
Now that bdi layer can handle per-blkcg bdi_writeback_congested state, blk_{set|clear}_congested() can propagate non-root blkcg congestion state to them. This can be easily achieved by disabling the root_rl tests in blk_{set|clear}_congested(). Note that we still need those tests when !CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg wb's congestion state for events happening on other blkcgs. v2: Updated for bdi_writeback_congested. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback, blkcg: restructure blk_{set|clear}_queue_congested()Tejun Heo1-25/+37
blk_{set|clear}_queue_congested() take @q and set or clear, respectively, the congestion state of its bdi's root wb. Because bdi used to be able to handle congestion state only on the root wb, the callers of those functions tested whether the congestion is on the root blkcg and skipped if not. This is cumbersome and makes implementation of per cgroup bdi_writeback congestion state propagation difficult. This patch renames blk_{set|clear}_queue_congested() to blk_{set|clear}_congested(), and makes them take request_list instead of request_queue and test whether the specified request_list is the root one before updating bdi_writeback congestion state. This makes the tests in the callers unnecessary and simplifies them. As there are no external users of these functions, the definitions are moved from include/linux/blkdev.h to block/blk-core.c. This patch doesn't introduce any noticeable behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congestedTejun Heo1-2/+15
A blkg (blkcg_gq) can be congested and decongested independently from other blkgs on the same request_queue. Accordingly, for cgroup writeback support, the congestion status at bdi (backing_dev_info) should be split and updated separately from matching blkg's. This patch prepares by adding blkg->wb_congested and associating a blkg with its matching per-blkcg bdi_writeback_congested on creation. v2: Updated to associate bdi_writeback_congested instead of bdi_writeback. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: make backing_dev_info host cgroup-specific bdi_writebacksTejun Heo1-1/+6
For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v3: inode_attach_wb() in account_page_dirtied() moved inside mapping_cap_account_dirty() block where it's known to be !NULL. Also, an unnecessary NULL check before kfree() removed. Both detected by the kbuild bot. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kbuild test robot <fengguang.wu@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACKTejun Heo1-1/+1
cgroup writeback requires support from both bdi and filesystem sides. Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if both MEMCG and BLK_CGROUP are enabled. inode_cgwb_enabled() which determines whether a given inode's both bdi and fs support cgroup writeback is added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>