aboutsummaryrefslogtreecommitdiffstats
path: root/block (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-02-19block: Add blk_set_runtime_active()Mika Westerberg1-0/+24
If block device is left runtime suspended during system suspend, resume hook of the driver typically corrects runtime PM status of the device back to "active" after it is resumed. However, this is not enough as queue's runtime PM status is still "suspended". As long as it is in this state blk_pm_peek_request() returns NULL and thus prevents new requests to be processed. Add new function blk_set_runtime_active() that can be used to force the queue status back to "active" as needed. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-02-17Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds6-13/+23
Pull block fixes from Jens Axboe: "A collection of fixes from the past few weeks that should go into 4.5. This contains: - Overflow fix for sysfs discard show function from Alan. - A stacking limit init fix for max_dev_sectors, so we don't end up artificially capping some use cases. From Keith. - Have blk-mq proper end unstarted requests on a dying queue, instead of pushing that to the driver. From Keith. - NVMe: - Update to Kconfig description for NVME_SCSI, since it was vague and having it on is important for some SUSE distros. From Christoph. - Set of fixes from Keith, around surprise removal. Also kills the no-merge flag, so it supports merging. - Set of fixes for lightnvm from Matias, Javier, and Wenwei. - Fix null_blk oops when asked for lightnvm, but not available. From Matias. - Copy-to-user EINTR fix from Hannes, fixing a case where SG_IO fails if interrupted by a signal. - Two floppy fixes from Jiri, fixing signal handling and blocking open. - A use-after-free fix for O_DIRECT, from Mike Krinkin. - A block module ref count fix from Roman Pen. - An fs IO wait accounting fix for O_DSYNC from Stephane Gasparini. - Smaller reallo fix for xen-blkfront from Bob Liu. - Removal of an unused struct member in the deadline IO scheduler, from Tahsin. - Also from Tahsin, properly initialize inode struct members associated with cgroup writeback, if enabled. - From Tejun, ensure that we keep the superblock pinned during cgroup writeback" * 'for-linus' of git://git.kernel.dk/linux-block: (25 commits) blk: fix overflow in queue_discard_max_hw_show writeback: initialize inode members that track writeback history writeback: keep superblock pinned during cgroup writeback association switches bio: return EINTR if copying to user space got interrupted NVMe: Rate limit nvme IO warnings NVMe: Poll device while still active during remove NVMe: Requeue requests on suspended queues NVMe: Allow request merges NVMe: Fix io incapable return values blk-mq: End unstarted requests on dying queue block: Initialize max_dev_sectors to 0 null_blk: oops when initializing without lightnvm block: fix module reference leak on put_disk() call for cgroups throttle nvme: fix Kconfig description for BLK_DEV_NVME_SCSI kernel/fs: fix I/O wait not accounted for RW O_DSYNC floppy: refactor open() flags handling lightnvm: allow to force mm initialization lightnvm: check overflow and correct mlc pairs lightnvm: fix request intersection locking in rrpc lightnvm: warn if irqs are disabled in lock laddr ...
2016-02-17blk: fix overflow in queue_discard_max_hw_showAlan1-3/+2
We get this right for queue_discard_max_show but not max_hw_show. Follow the same pattern as queue_discard_max_show instead so that we don't truncate. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-14blk-mq: mark request queue as mq asapMing Lei1-1/+3
Currently q->mq_ops is used widely to decide if the queue is mq or not, so we should set the 'flag' asap so that both block core and drivers can get the correct mq info. For example, commit 868f2f0b720(blk-mq: dynamic h/w context count) moves the hctx's initialization before setting q->mq_ops in blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue() to think the queue is non-mq and don't allocate command size for the per-hctx flush rq. This patches should fix the problem reported by Sasha. Cc: Keith Busch <keith.busch@intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Fixes: 868f2f0b720 ("blk-mq: dynamic h/w context count") Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-12Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-2/+4
Pull SCSI fixes from James Bottomley: "A set of seven fixes: Two regressions in the new hisi_sas arm driver, a blacklist entry for the marvell console which was causing a reset cascade without it, a race fix in the WRITE_SAME/DISCARD routines, a retry fix for the rdac driver, without which, it would prematurely return EIO and a couple of fixes for the hyper-v storvsc driver" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: block/sd: Return -EREMOTEIO when WRITE SAME and DISCARD are disabled SCSI: Add Marvell Console to VPD blacklist scsi_dh_rdac: always retry MODE SELECT on command lock violation storvsc: Use the specified target ID in device lookup storvsc: Install the storvsc specific timeout handler for FC devices hisi_sas: fix v1 hw check for slot error hisi_sas: add dependency for HAS_IOMEM
2016-02-12bio: return EINTR if copying to user space got interruptedHannes Reinecke1-2/+5
Commit 35dc248383bbab0a7203fca4d722875bc81ef091 introduced a check for current->mm to see if we have a user space context and only copies data if we do. Now if an IO gets interrupted by a signal data isn't copied into user space any more (as we don't have a user space context) but user space isn't notified about it. This patch modifies the behaviour to return -EINTR from bio_uncopy_user() to notify userland that a signal has interrupted the syscall, otherwise it could lead to a situation where the caller may get a buffer with no data returned. This can be reproduced by issuing SG_IO ioctl()s in one thread while constantly sending signals to it. Fixes: 35dc248 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: stable@vger.kernel.org # v.3.11+ Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-11blk-mq: End unstarted requests on dying queueKeith Busch1-2/+4
Go directly to ending a request if it wasn't started. Previously, completing a request may invoke a driver callback for a request it didn't initialize. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-11block: Initialize max_dev_sectors to 0Keith Busch1-2/+2
The new queue limit is not used by the majority of block drivers, and should be initialized to 0 for the driver's requested settings to be used. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-11block: Initialize max_dev_sectors to 0Keith Busch1-2/+2
The new queue limit is not used by the majority of block drivers, and should be initialized to 0 for the driver's requested settings to be used. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-09blk-mq: dynamic h/w context countKeith Busch3-73/+110
The hardware's provided queue count may change at runtime with resource provisioning. This patch allows a block driver to alter the number of h/w queues available when its resource count changes. The main part is a new blk-mq API to request a new number of h/w queues for a given live tag set. The new API freezes all queues using that set, then adjusts the allocated count prior to remapping these to CPUs. The bulk of the rest just shifts where h/w contexts and all their artifacts are allocated and freed. The number of max h/w contexts is capped to the number of possible cpus since there is no use for more than that. As such, all pre-allocated memory for pointers need to account for the max possible rather than the initial number of queues. A side effect of this is that the blk-mq will proceed successfully as long as it can allocate at least one h/w context. Previously it would fail request queue initialization if less than the requested number was allocated. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-09block: fix module reference leak on put_disk() call for cgroups throttleRoman Pen1-0/+9
get_disk(),get_gendisk() calls have non explicit side effect: they increase the reference on the disk owner module. The following is the correct sequence how to get a disk reference and to put it: disk = get_gendisk(...); /* use disk */ owner = disk->fops->owner; put_disk(disk); module_put(owner); fs/block_dev.c is aware of this required module_put() call, but f.e. blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put a module reference. To see a leakage in action cgroups throttle config can be used. In the following script I'm removing throttle for /dev/ram0 (actually this is NOP, because throttle was never set for this device): # lsmod | grep brd brd 5175 0 # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \ /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \ done # lsmod | grep brd brd 5175 100 Now brd module has 100 references. The issue is fixed by calling module_put() just right away put_disk(). Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-09kernel/fs: fix I/O wait not accounted for RW O_DSYNCStephane Gasparini1-1/+1
When a process is doing Random Write with O_DSYNC flag the I/O wait are not accounted in the kernel (get_cpu_iowait_time_us). This is preventing the governor or the cpufreq driver to account for I/O wait and thus use the right pstate Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com> Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-04Merge remote-tracking branch 'mkp-scsi/4.5/scsi-fixes' into fixesJames Bottomley1-2/+4
2016-02-04block/sd: Return -EREMOTEIO when WRITE SAME and DISCARD are disabledMartin K. Petersen1-2/+4
When a storage device rejects a WRITE SAME command we will disable write same functionality for the device and return -EREMOTEIO to the block layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or failing the path. Yiwen Jiang discovered a small race where WRITE SAME requests issued simultaneously would cause -EIO to be returned. This happened because any requests being prepared after WRITE SAME had been disabled for the device caused us to return BLKPREP_KILL. The latter caused the block layer to return -EIO upon completion. To overcome this we introduce BLKPREP_INVALID which indicates that this is an invalid request for the device. blk_peek_request() is modified to return -EREMOTEIO in that case. Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Suggested-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinicke <hare@suse.de> Reviewed-by: Ewan Milne <emilne@redhat.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-02-04cfq-iosched: Allow parent cgroup to preempt its childJan Kara1-1/+18
Currently we don't allow sync workload of one cgroup to preempt sync workload of any other cgroup. This is because we want to achieve service separation between cgroups. However in cases where cgroup preempting is ancestor of the current cgroup, there is no need of separation and idling introduces unnecessary overhead. This hurts for example the case when workload is isolated within a cgroup but journalling threads are in root cgroup. Simple way to demostrate the issue is using: dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1 on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make difference more visible). When all processes are in the root cgroup, reported throughput is 153.132 MB/sec. When dbench process gets its own blkio cgroup, reported throughput drops to 26.1006 MB/sec. Fix the problem by making check in cfq_should_preempt() more benevolent and allow preemption by ancestor cgroup. This improves the throughput reported by dbench4 to 48.9106 MB/sec. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-04cfq-iosched: Allow sync noidle workloads to preempt each otherJan Kara1-1/+0
The original idea with preemption of sync noidle queues (introduced in commit 718eee0579b8 "cfq-iosched: fairness for sync no-idle queues") was that we service all sync noidle queues together, we don't idle on any of the queues individually and we idle only if there is no sync noidle queue to be served. This intention also matches the original test: if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD && new_cfqq->service_tree == cfqq->service_tree) return true; However since at that time cfqq->service_tree was not set for idling queues, this test was unreliable and was replaced in commit e4a229196a7c "cfq-iosched: fix no-idle preemption logic" by: if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD && cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD && new_cfqq->service_tree->count == 1) return true; That was a reliable test but was actually doing something different - now we preempt sync noidle queue only if the new queue is the only one busy in the service tree. These days cfq queue is kept in service tree even if it is idling and thus the original check would be safe again. But since we actually check that cfq queues are in the same cgroup, of the same priority class and workload type (sync noidle), we know that new_cfqq is fine to preempt cfqq. So just remove the service tree check. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-04cfq-iosched: Reorder checks in cfq_should_preempt()Jan Kara1-6/+7
Move check for preemption by rt class up. There is no functional change but it makes arguing about conditions simpler since we can be sure both cfq queues are from the same ioprio class. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-04cfq-iosched: Don't group_idle if cfqq has big thinktimeJan Kara1-2/+8
There is no point in idling on a cfq group if the only cfq queue that is there has too big thinktime. Signed-off-by: Jan Kara <jack@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-01Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds2-41/+15
Pull libnvdimm fixes from Dan Williams: "1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved area for storing a struct page array. 2/ Fixes for dax operations on a raw block device to prevent pagecache collisions with dax mappings. 3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null pointer de-reference. These have received build success notification from the kbuild robot across 153 configs and pass the latest ndctl tests" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: phys_to_pfn_t: use phys_addr_t mm: fix pfn_t to page conversion in vm_insert_mixed block: use DAX for partition table reads block: revert runtime dax control of the raw block device fs, block: force direct-I/O for dax-enabled block devices devm_memremap_pages: fix vmem_altmap lifetime + alignment handling libnvdimm, pfn: fix restoring memmap location libnvdimm: fix mode determination for e820 devices
2016-02-01deadline: remove unused struct memberTahsin Erdogan1-3/+0
commit 63de428b139d3d31d86ebe25ae97b33f6540fb7e ("deadline-iosched: allow non-sequential batching") removed last use of last_sector. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-01-30block: use DAX for partition table readsDan Williams1-3/+15
Avoid populating pagecache when the block device is in DAX mode. Otherwise these page cache entries collide with the fsync/msync implementation and break data durability guarantees. Cc: Jan Kara <jack@suse.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-30block: revert runtime dax control of the raw block deviceDan Williams1-38/+0
Dynamically enabling DAX requires that the page cache first be flushed and invalidated. This must occur atomically with the change of DAX mode otherwise we confuse the fsync/msync tracking and violate data durability guarantees. Eliminate the possibilty of DAX-disabled to DAX-enabled transitions for now and revisit this for the next cycle. Cc: Jan Kara <jack@suse.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-29Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-7/+19
Pull block layer fix from Jens Axboe: "This just contains the fix for the split issue that we had in -rc1. It's been well tested at this point, so let's get it in mainline so we don't have the same split issue for -rc2" * 'for-linus' of git://git.kernel.dk/linux-block: block: fix bio splitting on max sectors
2016-01-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdmaLinus Torvalds2-225/+1
Pull rdma updates from Doug Ledford: "Initial roundup of 4.5 merge window patches - Remove usage of ib_query_device and instead store attributes in ib_device struct - Move iopoll out of block and into lib, rename to irqpoll, and use in several places in the rdma stack as our new completion queue polling library mechanism. Update the other block drivers that already used iopoll to use the new mechanism too. - Replace the per-entry GID table locks with a single GID table lock - IPoIB multicast cleanup - Cleanups to the IB MR facility - Add support for 64bit extended IB counters - Fix for netlink oops while parsing RDMA nl messages - RoCEv2 support for the core IB code - mlx4 RoCEv2 support - mlx5 RoCEv2 support - Cross Channel support for mlx5 - Timestamp support for mlx5 - Atomic support for mlx5 - Raw QP support for mlx5 - MAINTAINERS update for mlx4/mlx5 - Misc ocrdma, qib, nes, usNIC, cxgb3, cxgb4, mlx4, mlx5 updates - Add support for remote invalidate to the iSER driver (pushed through the RDMA tree due to dependencies, acknowledged by nab) - Update to NFSoRDMA (pushed through the RDMA tree due to dependencies, acknowledged by Bruce)" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (169 commits) IB/mlx5: Unify CQ create flags check IB/mlx5: Expose Raw Packet QP to user space consumers {IB, net}/mlx5: Move the modify QP operation table to mlx5_ib IB/mlx5: Support setting Ethernet priority for Raw Packet QPs IB/mlx5: Add Raw Packet QP query functionality IB/mlx5: Add create and destroy functionality for Raw Packet QP IB/mlx5: Refactor mlx5_ib_qp to accommodate other QP types IB/mlx5: Allocate a Transport Domain for each ucontext net/mlx5_core: Warn on unsupported events of QP/RQ/SQ net/mlx5_core: Add RQ and SQ event handling net/mlx5_core: Export transport objects IB/mlx5: Expose CQE version to user-space IB/mlx5: Add CQE version 1 support to user QPs and SRQs IB/mlx5: Fix data validation in mlx5_ib_alloc_ucontext IB/sa: Fix netlink local service GFP crash IB/srpt: Remove redundant wc array IB/qib: Improve ipoib UD performance IB/mlx4: Advertise RoCE v2 support IB/mlx4: Create and use another QP1 for RoCEv2 IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers ...
2016-01-22block: fix bio splitting on max sectorsMing Lei1-7/+19
After commit e36f62042880(block: split bios to maxpossible length), bio can be splitted in the middle of a vector entry, then it is easy to split out one bio which size isn't aligned with block size, especially when the block size is bigger than 512. This patch fixes the issue by making the max io size aligned to logical block size. Fixes: e36f62042880(block: split bios to maxpossible length) Reported-by: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Keith Busch <keith.busch@intel.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-01-22wrappers for ->i_mutex accessAl Viro1-2/+2
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-21Merge branch 'for-4.5/nvme' of git://git.kernel.dk/linux-blockLinus Torvalds5-18/+29
Pull NVMe updates from Jens Axboe: "Last branch for this series is the nvme changes. It's in a separate branch to avoid splitting too much between core and NVMe changes, since NVMe is still helping drive some blk-mq changes. That said, not a huge amount of core changes in here. The grunt of the work is the continued split of the code" * 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits) uapi: update install list after nvme.h rename NVMe: Export NVMe attributes to sysfs group NVMe: Shutdown controller only for power-off NVMe: IO queue deletion re-write NVMe: Remove queue freezing on resets NVMe: Use a retryable error code on reset NVMe: Fix admin queue ring wrap nvme: make SG_IO support optional nvme: fixes for NVME_IOCTL_IO_CMD on the char device nvme: synchronize access to ctrl->namespaces nvme: Move nvme_freeze/unfreeze_queues to nvme core PCI/AER: include header file NVMe: Export namespace attributes to sysfs NVMe: Add pci error handlers block: remove REQ_NO_TIMEOUT flag nvme: merge iod and cmd_info nvme: meta_sg doesn't have to be an array nvme: properly free resources for cancelled command nvme: simplify completion handling nvme: special case AEN requests ...
2016-01-19Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-blockLinus Torvalds9-48/+56
Pull core block updates from Jens Axboe: "We don't have a lot of core changes this time around, it's mostly in drivers, which will come in a subsequent pull. The cores changes include: - blk-mq - Prep patch from Christoph, changing blk_mq_alloc_request() to take flags instead of just using gfp_t for sleep/nosleep. - Doc patch from me, clarifying the difference between legacy and blk-mq for timer usage. - Fixes from Raghavendra for memory-less numa nodes, and a reuse of CPU masks. - Cleanup from Geliang Tang, using offset_in_page() instead of open coding it. - From Ilya, rename request_queue slab to it reflects what it holds, and a fix for proper use of bdgrab/put. - A real fix for the split across stripe boundaries from Keith. We yanked a broken version of this from 4.4-rc final, this one works. - From Mike Krinkin, emit a trace message when we split. - From Wei Tang, two small cleanups, not explicitly clearing memory that is already cleared" * 'for-4.5/core' of git://git.kernel.dk/linux-block: block: use bd{grab,put}() instead of open-coding block: split bios to max possible length block: add call to split trace point blk-mq: Avoid memoryless numa node encoded in hctx numa_node blk-mq: Reuse hardware context cpumask for tags blk-mq: add a flags parameter to blk_mq_alloc_request Revert "blk-flush: Queue through IO scheduler when flush not required" block: clarify blk_add_timer() use case for blk-mq bio: use offset_in_page macro block: do not initialise statics to 0 or NULL block: do not initialise globals to 0 or NULL block: rename request_queue slab cache
2016-01-13Merge tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds4-2/+686
Pull libnvdimm updates from Dan Williams: "The bulk of this has appeared in -next and independently received a build success notification from the kbuild robot. The 'for-4.5/block- dax' topic branch was rebased over the weekend to drop the "block device end-of-life" rework that Al would like to see re-implemented with a notifier, and to address bug reports against the badblocks integration. There is pending feedback against "libnvdimm: Add a poison list and export badblocks" received last week. Linda identified some localized fixups that we will handle incrementally. Summary: - Media error handling: The 'badblocks' implementation that originated in md-raid is up-levelled to a generic capability of a block device. This initial implementation is limited to being consulted in the pmem block-i/o path. Later, 'badblocks' will be consulted when creating dax mappings. - Raw block device dax: For virtualization and other cases that want large contiguous mappings of persistent memory, add the capability to dax-mmap a block device directly. - Increased /dev/mem restrictions: Add an option to treat all io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is actively using an address range. This behavior is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the existing "iomem=relaxed" kernel command line option. - Miscellaneous fixes include a 'pfn'-device huge page alignment fix, block device shutdown crash fix, and other small libnvdimm fixes" * tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits) block: kill disk_{check|set|clear|alloc}_badblocks libnvdimm, pmem: nvdimm_read_bytes() badblocks support pmem, dax: disable dax in the presence of bad blocks pmem: fail io-requests to known bad blocks libnvdimm: convert to statically allocated badblocks libnvdimm: don't fail init for full badblocks list block, badblocks: introduce devm_init_badblocks block: clarify badblocks lifetime badblocks: rename badblocks_free to badblocks_exit libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h libnvdimm: Add a poison list and export badblocks nfit_test: Enable DSMs for all test NFITs md: convert to use the generic badblocks code block: Add badblock management for gendisks badblocks: Add core badblock management code block: fix del_gendisk() vs blkdev_ioctl crash block: enable dax for raw block devices block: introduce bdev_file_inode() restrict /dev/mem to idle io memory ranges arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug ...
2016-01-12block: split bios to max possible lengthKeith Busch1-3/+16
This splits bio in the middle of a vector to form the largest possible bio at the h/w's desired alignment, and guarantees the bio being split will have some data. The criteria for splitting is changed from the max sectors to the h/w's optimal sector alignment if it is provided. For h/w that advertise their block storage's underlying chunk size, it's a big performance win to not submit commands that cross them. If sector alignment is not provided, this patch uses the max sectors as before. This addresses the performance issue commit d380561113 attempted to fix, but was reverted due to splitting logic error. Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Ming Lei <tom.leiming@gmail.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: <stable@vger.kernel.org> # 4.4.x- Signed-off-by: Jens Axboe <axboe@fb.com>
2016-01-09block: kill disk_{check|set|clear|alloc}_badblocksDan Williams1-42/+0
These actions are completely managed by a block driver or can use the badblocks api directly. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09pmem, dax: disable dax in the presence of bad blocksDan Williams1-0/+10
Longer term teach dax to punch "error" holes in mapping requests and deliver SIGBUS to applications that consume a bad pmem page. For now, simply disable the dax performance optimization in the presence of known errors. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block, badblocks: introduce devm_init_badblocksDan Williams1-13/+35
Provide a devres interface for initializing a badblocks instance. The pmem driver has several scenarios where it will be beneficial to have this structure automatically freed when the device is disabled / fails probe. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: clarify badblocks lifetimeDan Williams2-5/+2
The badblocks list attached to a gendisk is allocated by the driver which equates to the driver owning the lifetime of the object. Do not automatically free it in del_gendisk(). This is in preparation for expanding the use of badblocks in libnvdimm drivers and introducing devm_init_badblocks(). Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09badblocks: rename badblocks_free to badblocks_exitDan Williams2-4/+4
For symmetry with badblocks_init() make it clear that this path only destroys incremental allocations of a badblocks instance, and does not free the badblocks instance itself. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: Add badblock management for gendisksVishal Verma1-0/+76
NVDIMM devices, which can behave more like DRAM rather than block devices, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known sectors that have poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Use the new badblock management interfaces to add a badblocks list to gendisks. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09badblocks: Add core badblock management codeVishal Verma2-1/+562
Take the core badblocks implementation from md, and make it generally available. This follows the same style as kernel implementations of linked lists, rb-trees etc, where you can have a structure that can be embedded anywhere, and accessor functions to manipulate the data. The only changes in this copy of the code are ones to generalize function/variable names from md-specific ones. Also add init and free functions. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: fix del_gendisk() vs blkdev_ioctl crashDan Williams1-1/+0
When tearing down a block device early in its lifetime, userspace may still be performing discovery actions like blkdev_ioctl() to re-read partitions. The nvdimm_revalidate_disk() implementation depends on disk->driverfs_dev to be valid at entry. However, it is set to NULL in del_gendisk() and fatally this is happening *before* the disk device is deleted from userspace view. There's no reason for del_gendisk() to clear ->driverfs_dev. That device is the parent of the disk. It is guaranteed to not be freed until the disk, as a child, drops its ->parent reference. We could also fix this issue locally in nvdimm_revalidate_disk() by using disk_to_dev(disk)->parent, but lets fix it globally since ->driverfs_dev follows the lifetime of the parent. Longer term we should probably just add a @parent parameter to add_disk(), and stop carrying this pointer in the gendisk. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffffa00340a8>] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm] CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G O 4.4.0-rc5 #2257 [..] Call Trace: [<ffffffff8143e5c7>] rescan_partitions+0x87/0x2c0 [<ffffffff810f37f9>] ? __lock_is_held+0x49/0x70 [<ffffffff81438c62>] __blkdev_reread_part+0x72/0xb0 [<ffffffff81438cc5>] blkdev_reread_part+0x25/0x40 [<ffffffff8143982d>] blkdev_ioctl+0x4fd/0x9c0 [<ffffffff811246c9>] ? current_kernel_time64+0x69/0xd0 [<ffffffff812916dd>] block_ioctl+0x3d/0x50 [<ffffffff81264c38>] do_vfs_ioctl+0x308/0x560 [<ffffffff8115dbd1>] ? __audit_syscall_entry+0xb1/0x100 [<ffffffff810031d6>] ? do_audit_syscall_entry+0x66/0x70 [<ffffffff81264f09>] SyS_ioctl+0x79/0x90 [<ffffffff81902672>] entry_SYSCALL_64_fastpath+0x12/0x76 Reported-by: Robert Hu <robert.hu@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: enable dax for raw block devicesDan Williams1-0/+61
If an application wants exclusive access to all of the persistent memory provided by an NVDIMM namespace it can use this raw-block-dax facility to forgo establishing a filesystem. This capability is targeted primarily to hypervisors wanting to provision persistent memory for guests. It can be disabled / enabled dynamically via the new BLKDAXSET ioctl. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-08Revert "block: Split bios on chunk boundaries"Jens Axboe1-1/+1
This reverts commit d3805611130af9b911e908af9f67a3f64f4f0914. If we end up splitting on the first segment, we don't adjust the sector count. That results in hitting a BUG() with attempting to split 0 sectors. As this is just a performance issue and not a regression since 4.3 release, let's just rever this change. That gives us more time to test a real fix for 4.5, which would be marked for stable anyway.
2015-12-28block: add blk_start_queue_async()Jens Axboe1-0/+16
We currently only have an inline/sync helper to restart a stopped queue. If drivers need an async version, they have to roll their own. Add a generic helper instead. Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22block: Split bios on chunk boundariesKeith Busch1-1/+1
For h/w that advertise their block storage's underlying chunk size, it's a big performance win to not submit commands that cross them. This patch uses that criteria if it is provided. If it is not provided, this patch uses the max sectors as before. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-2/+2
Pull block layer fixes from Jens Axboe: "Three small fixes for 4.4 final. Specifically: - The segment issue fix from Junichi, where the old IO path does a bio limit split before potentially bouncing the pages. We need to do that in the right order, to ensure that limitations are met. - A NVMe surprise removal IO hang fix from Keith. - A use-after-free in null_blk, introduced by a previous patch in this series. From Mike Krinkin" * 'for-linus' of git://git.kernel.dk/linux-block: null_blk: fix use-after-free error block: ensure to split after potentially bouncing a bio NVMe: IO ending fixes on surprise removal
2015-12-22block: ensure to split after potentially bouncing a bioJunichi Nomura1-2/+2
blk_queue_bio() does split then bounce, which makes the segment counting based on pages before bouncing and could go wrong. Move the split to after bouncing, like we do for blk-mq, and the we fix the issue of having the bio count for segments be wrong. Fixes: 54efd50bfd87 ("block: make generic_make_request handle arbitrarily sized bios") Cc: stable@vger.kernel.org Tested-by: Artem S. Tashkinov <t.artem@lycos.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22block: remove REQ_NO_TIMEOUT flagChristoph Hellwig2-5/+0
This was added for the 'magic' AEN requests in the NVMe driver that never return. We now handle them purely inside the driver and don't need this core hack any more. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22block: defer timeouts to a workqueueChristoph Hellwig4-6/+23
Timer context is not very useful for drivers to perform any meaningful abort action from. So instead of calling the driver from this useless context defer it to a workqueue as soon as possible. Note that while a delayed_work item would seem the right thing here I didn't dare to use it due to the magic in blk_add_timer that pokes deep into timer internals. But maybe this encourages Tejun to add a sensible API for that to the workqueue API and we'll all be fine in the end :) Contains a major update from Keith Bush: "This patch removes synchronizing the timeout work so that the timer can start a freeze on its own queue. The timer enters the queue, so timer context can only start a freeze, but not wait for frozen." Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-15Merge branch 'rdma-cq.2' of git://git.infradead.org/users/hch/rdma into 4.5/rdma-cqDoug Ledford2-225/+1
Signed-off-by: Doug Ledford <dledford@redhat.com> Conflicts: drivers/infiniband/ulp/srp/ib_srp.c - Conflicts with changes in ib_srp.c introduced during 4.4-rc updates
2015-12-12Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+12
Pull block layer fixes from Jens Axboe: "A set of fixes for the current series. This contains: - A bunch of fixes for lightnvm, should be the last round for this series. From Matias and Wenwei. - A writeback detach inode fix from Ilya, also marked for stable. - A block (though it says SCSI) fix for an OOPS in SCSI runtime power management. - Module init error path fixes for null_blk from Minfei" * 'for-linus' of git://git.kernel.dk/linux-block: null_blk: Fix error path in module initialization lightnvm: do not compile in debugging by default lightnvm: prevent gennvm module unload on use lightnvm: fix media mgr registration lightnvm: replace req queue with nvmdev for lld lightnvm: comments on constants lightnvm: check mm before use lightnvm: refactor spin_unlock in gennvm_get_blk lightnvm: put blks when luns configure failed lightnvm: use flags in rrpc_get_blk block: detach bdev inode from its wb in __blkdev_put() SCSI: Fix NULL pointer dereference in runtime PM
2015-12-11irq_poll: make blk-iopoll available outside the block layerChristoph Hellwig2-225/+1
The new name is irq_poll as iopoll is already taken. Better suggestions welcome. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
2015-12-09blk-integrity: checking for NULL instead of IS_ERRDan Carpenter1-5/+4
We recently changed bio_integrity_alloc() to return ERR_PTRs instead of NULL but these calls were missed. Fixes: 06c1e3902aa7 ('blk-integrity: empty implementation when disabled') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>