aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2017-04-19block, bfq: modify the peak-rate estimatorPaolo Valente1-125/+372
Unless the maximum budget B_max that BFQ can assign to a queue is set explicitly by the user, BFQ automatically updates B_max. In particular, BFQ dynamically sets B_max to the number of sectors that can be read, at the current estimated peak rate, during the maximum time, T_max, allowed before a budget timeout occurs. In formulas, if we denote as R_est the estimated peak rate, then B_max = T_max ∗ R_est. Hence, the higher R_est is with respect to the actual device peak rate, the higher the probability that processes incur budget timeouts unjustly is. Besides, a too high value of B_max unnecessarily increases the deviation from an ideal, smooth service. Unfortunately, it is not trivial to estimate the peak rate correctly: because of the presence of sw and hw queues between the scheduler and the device components that finally serve I/O requests, it is hard to say exactly when a given dispatched request is served inside the device, and for how long. As a consequence, it is hard to know precisely at what rate a given set of requests is actually served by the device. On the opposite end, the dispatch time of any request is trivially available, and, from this piece of information, the "dispatch rate" of requests can be immediately computed. So, the idea in the next function is to use what is known, namely request dispatch times (plus, when useful, request completion times), to estimate what is unknown, namely in-device request service rate. The main issue is that, because of the above facts, the rate at which a certain set of requests is dispatched over a certain time interval can vary greatly with respect to the rate at which the same requests are then served. But, since the size of any intermediate queue is limited, and the service scheme is lossless (no request is silently dropped), the following obvious convergence property holds: the number of requests dispatched MUST become closer and closer to the number of requests completed as the observation interval grows. This is the key property used in this new version of the peak-rate estimator. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19block, bfq: improve throughput boostingPaolo Valente1-46/+41
The feedback-loop algorithm used by BFQ to compute queue (process) budgets is basically a set of three update rules, one for each of the main reasons why a queue may be expired. If many processes suddenly switch from sporadic I/O to greedy and sequential I/O, then these rules are quite slow to assign large budgets to these processes, and hence to achieve a high throughput. On the opposite side, BFQ assigns the maximum possible budget B_max to a just-created queue. This allows a high throughput to be achieved immediately if the associated process is I/O-bound and performs sequential I/O from the beginning. But it also increases the worst-case latency experienced by the first requests issued by the process, because the larger the budget of a queue waiting for service is, the later the queue will be served by B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or soft real-time application. To tackle these throughput and latency problems, on one hand this patch changes the initial budget value to B_max/2. On the other hand, it re-tunes the three rules, adopting a more aggressive, multiplicative increase/linear decrease scheme. This scheme trades latency for throughput more than before, and tends to assign large budgets quickly to processes that are or become I/O-bound. For two of the expiration reasons, the new version of the rules also contains some more little improvements, briefly described below. *No more backlog.* In this case, the budget was larger than the number of sectors actually read/written by the process before it stopped doing I/O. Hence, to reduce latency for the possible future I/O requests of the process, the old rule simply set the next budget to the number of sectors actually consumed by the process. However, if there are still outstanding requests, then the process may have not yet issued its next request just because it is still waiting for the completion of some of the still outstanding ones. If this sub-case holds true, then the new rule, instead of decreasing the budget, doubles it, proactively, in the hope that: 1) a larger budget will fit the actual needs of the process, and 2) the process is sequential and hence a higher throughput will be achieved by serving the process longer after granting it access to the device. *Budget timeout*. The original rule set the new budget to the maximum value B_max, to maximize throughput and let all processes experiencing budget timeouts receive the same share of the device time. In our experiments we verified that this sudden jump to B_max did not provide sensible benefits; rather it increased the latency of processes performing sporadic and short I/O. The new rule only doubles the budget. [1] P. Valente and M. Andreolini, "Improving Application Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR '12), June 2012. Slightly extended version: http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite- results.pdf Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19block, bfq: add full hierarchical scheduling and cgroups supportArianna Avanzini4-312/+2141
Add complete support for full hierarchical scheduling, with a cgroups interface. Full hierarchical scheduling is implemented through the 'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues associated with processes, and groups are represented in general by entities. Given the bfq_queues associated with the processes belonging to a given group, the entities representing these queues are sons of the entity representing the group. At higher levels, if a group, say G, contains other groups, then the entity representing G is the parent entity of the entities representing the groups in G. Hierarchical scheduling is performed as follows: if the timestamps of a leaf entity (i.e., of a bfq_queue) change, and such a change lets the entity become the next-to-serve entity for its parent entity, then the timestamps of the parent entity are recomputed as a function of the budget of its new next-to-serve leaf entity. If the parent entity belongs, in its turn, to a group, and its new timestamps let it become the next-to-serve for its parent entity, then the timestamps of the latter parent entity are recomputed as well, and so on. When a new bfq_queue must be set in service, the reverse path is followed: the next-to-serve highest-level entity is chosen, then its next-to-serve child entity, and so on, until the next-to-serve leaf entity is reached, and the bfq_queue that this entity represents is set in service. Writeback is accounted for on a per-group basis, i.e., for each group, the async I/O requests of the processes of the group are enqueued in a distinct bfq_queue, and the entity associated with this queue is a child of the entity associated with the group. Weights can be assigned explicitly to groups and processes through the cgroups interface, differently from what happens, for single processes, if the cgroups interface is not used (as explained in the description of the previous patch). In particular, since each node has a full scheduler, each group can be assigned its own weight. Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19block, bfq: introduce the BFQ-v0 I/O scheduler as an extra schedulerPaolo Valente5-0/+4697
We tag as v0 the version of BFQ containing only BFQ's engine plus hierarchical support. BFQ's engine is introduced by this commit, while hierarchical support is added by next commit. We use the v0 tag to distinguish this minimal version of BFQ from the versions containing also the features and the improvements added by next commits. BFQ-v0 coincides with the version of BFQ submitted a few years ago [1], apart from the introduction of preemption, described below. BFQ is a proportional-share I/O scheduler, whose general structure, plus a lot of code, are borrowed from CFQ. - Each process doing I/O on a device is associated with a weight and a (bfq_)queue. - BFQ grants exclusive access to the device, for a while, to one queue (process) at a time, and implements this service model by associating every queue with a budget, measured in number of sectors. - After a queue is granted access to the device, the budget of the queue is decremented, on each request dispatch, by the size of the request. - The in-service queue is expired, i.e., its service is suspended, only if one of the following events occurs: 1) the queue finishes its budget, 2) the queue empties, 3) a "budget timeout" fires. - The budget timeout prevents processes doing random I/O from holding the device for too long and dramatically reducing throughput. - Actually, as in CFQ, a queue associated with a process issuing sync requests may not be expired immediately when it empties. In contrast, BFQ may idle the device for a short time interval, giving the process the chance to go on being served if it issues a new request in time. Device idling typically boosts the throughput on rotational devices, if processes do synchronous and sequential I/O. In addition, under BFQ, device idling is also instrumental in guaranteeing the desired throughput fraction to processes issuing sync requests (see [2] for details). - With respect to idling for service guarantees, if several processes are competing for the device at the same time, but all processes (and groups, after the following commit) have the same weight, then BFQ guarantees the expected throughput distribution without ever idling the device. Throughput is thus as high as possible in this common scenario. - Queues are scheduled according to a variant of WF2Q+, named B-WF2Q+, and implemented using an augmented rb-tree to preserve an O(log N) overall complexity. See [2] for more details. B-WF2Q+ is also ready for hierarchical scheduling. However, for a cleaner logical breakdown, the code that enables and completes hierarchical support is provided in the next commit, which focuses exactly on this feature. - B-WF2Q+ guarantees a tight deviation with respect to an ideal, perfectly fair, and smooth service. In particular, B-WF2Q+ guarantees that each queue receives a fraction of the device throughput proportional to its weight, even if the throughput fluctuates, and regardless of: the device parameters, the current workload and the budgets assigned to the queue. - The last, budget-independence, property (although probably counterintuitive in the first place) is definitely beneficial, for the following reasons: - First, with any proportional-share scheduler, the maximum deviation with respect to an ideal service is proportional to the maximum budget (slice) assigned to queues. As a consequence, BFQ can keep this deviation tight not only because of the accurate service of B-WF2Q+, but also because BFQ *does not* need to assign a larger budget to a queue to let the queue receive a higher fraction of the device throughput. - Second, BFQ is free to choose, for every process (queue), the budget that best fits the needs of the process, or best leverages the I/O pattern of the process. In particular, BFQ updates queue budgets with a simple feedback-loop algorithm that allows a high throughput to be achieved, while still providing tight latency guarantees to time-sensitive applications. When the in-service queue expires, this algorithm computes the next budget of the queue so as to: - Let large budgets be eventually assigned to the queues associated with I/O-bound applications performing sequential I/O: in fact, the longer these applications are served once got access to the device, the higher the throughput is. - Let small budgets be eventually assigned to the queues associated with time-sensitive applications (which typically perform sporadic and short I/O), because, the smaller the budget assigned to a queue waiting for service is, the sooner B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). - Weights can be assigned to processes only indirectly, through I/O priorities, and according to the relation: weight = 10 * (IOPRIO_BE_NR - ioprio). The next patch provides, instead, a cgroups interface through which weights can be assigned explicitly. - If several processes are competing for the device at the same time, but all processes and groups have the same weight, then BFQ guarantees the expected throughput distribution without ever idling the device. It uses preemption instead. Throughput is then much higher in this common scenario. - ioprio classes are served in strict priority order, i.e., lower-priority queues are not served as long as there are higher-priority queues. Among queues in the same class, the bandwidth is distributed in proportion to the weight of each queue. A very thin extra bandwidth is however guaranteed to the Idle class, to prevent it from starving. - If the strict_guarantees parameter is set (default: unset), then BFQ - always performs idling when the in-service queue becomes empty; - forces the device to serve one I/O request at a time, by dispatching a new request only if there is no outstanding request. In the presence of differentiated weights or I/O-request sizes, both the above conditions are needed to guarantee that every queue receives its allotted share of the bandwidth (see Documentation/block/bfq-iosched.txt for more details). Setting strict_guarantees may evidently affect throughput. [1] https://lkml.org/lkml/2008/4/1/234 https://lkml.org/lkml/2008/11/11/148 [2] P. Valente and M. Andreolini, "Improving Application Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR '12), June 2012. Slightly extended version: http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite- results.pdf Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19nbd: set the max segment size to UINT_MAXJosef Bacik1-0/+1
NBD doesn't care about limiting the segment size, let the user push the largest bio's they want. This allows us to control the request size solely through max_sectors_kb. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19Merge branch 'stable/for-jens-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-4.12/blockJens Axboe1-0/+3
Konrad writes: It has one fix - to emit an uevent whenever the size of the guest disk image changes.
2017-04-18blkfront: add uevent for size changeMarc Olson1-0/+3
When a blkfront device is resized from dom0, emit a KOBJ_CHANGE uevent to notify the guest about the change. This allows for custom udev rules, such as automatically resizing a filesystem, when an event occurs. With this patch you get these udev KERNEL[577.206230] change /devices/vbd-51728/block/xvdb (block) UDEV [577.226218] change /devices/vbd-51728/block/xvdb (block) Signed-off-by: Marc Olson <marcolso@amazon.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2017-04-17nbd: add a flag to destroy an nbd device on disconnectJosef Bacik2-1/+35
For ease of management it would be nice for users to specify that the device node for a nbd device is destroyed once it is disconnected and there are no more users. Add a client flag and enable this operation to happen. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: add device refcountingJosef Bacik1-18/+86
In order to support deleting the device on disconnect we need to refcount the actual nbd_device struct. So add the refcounting framework and change how we free the normal devices at rmmod time so we can catch reference leaks. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: add a status netlink commandJosef Bacik2-0/+133
Allow users to query the status of existing nbd devices. Right now this only returns whether or not the device is connected, but could be extended in the future to include more information. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: handle dead connectionsJosef Bacik2-4/+60
Sometimes we like to upgrade our server without making all of our clients freak out and reconnect. This patch provides a way to specify a dead connection timeout to allow us to pause all requests and wait for new connections to be opened. With this in place I can take down the nbd server for less than the dead connection timeout time and bring it back up and everything resumes gracefully. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: only clear the queue on device teardownJosef Bacik1-1/+3
When running a disconnect torture test I noticed that sometimes we would crash with a negative ref count on our queue. This was because we were ending the same request twice. Turns out we were racing with NBD_CLEAR_SOCK clearing the requests as well as the teardown of the device clearing the requests. So instead make the ioctl only shutdown the sockets and make it so that we only ever run nbd_clear_que from the device teardown. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: multicast dead link notificationsJosef Bacik2-14/+81
Provide a mechanism to notify userspace that there's been a link problem on a NBD device. This will allow userspace to re-establish a connection and provide the new socket to the device without disrupting the device. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: add a reconfigure netlink commandJosef Bacik2-0/+142
We want to be able to reconnect dead connections to existing block devices, so add a reconfigure netlink command. We will also allow users to change their timeout on the fly, but everything else will require a disconnect and reconnect. You won't be able to add more connections either, simply replace dead connections with new more lively connections. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: add a basic netlink interfaceJosef Bacik2-26/+359
The existing ioctl interface for configuring NBD devices is a bit cumbersome and hard to extend. The other problem is we leave a userspace app sitting in it's syscall until the device disconnects, which is less than ideal. This patch introduces a netlink interface for adding and disconnecting nbd devices. This has the benefits of being easily extendable without breaking older userspace applications, and allows us to configure a nbd device without leaving a userspace app sitting waiting for the device to disconnect. With this interface we also gain the ability to configure more devices than are preallocated at insmod time. We also have gained the ability to not specify a particular device and be provided one for us so that userspace doesn't need to find a free device to configure. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: stop using the bdev everywhereJosef Bacik1-55/+37
In preparation for the upcoming netlink interface we need to not rely on already having the bdev for the NBD device we are doing operations on. Instead of passing the bdev around, just use it in places where we know we already have the bdev. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: separate out the config informationJosef Bacik1-174/+263
In order to properly refcount the various aspects of a NBD device we need to separate out the configuration elements of the nbd device. The configuration of a NBD device has a different lifetime from the actual device, so it doesn't make sense to bundle these two concepts. Add a config_refs to keep track of the configuration structure, that way we can be sure that we never access it when we've torn down the device. Add a new nbd_config structure to hold all of the transient configuration information. Finally create this when we open the device so that it is in place when we start to configure the device. This has a nice side-effect of fixing a long standing problem where you could end up with a half-configured nbd device that needed to be "disconnected" in order to be usable again. Now once we close our device the configuration will be discarded. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: handle single path failures gracefullyJosef Bacik1-26/+125
Currently if we have multiple connections and one of them goes down we will tear down the whole device. However there's no reason we need to do this as we could have other connections that are working fine. Deal with this by keeping track of the state of the different connections, and if we lose one we mark it as dead and send all IO destined for that socket to one of the other healthy sockets. Any outstanding requests that were on the dead socket will timeout and be re-submitted properly. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-17nbd: put socket in error casesJosef Bacik1-2/+7
When adding a new socket we look it up and then try to add it to our configuration. If any of those steps fail we need to make sure we put the socket so we don't leak them. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: fix some error code in pblk-init.cDan Carpenter1-6/+14
There were a bunch of places in pblk_lines_init() where we didn't set an error code. And in pblk_writer_init() we accidentally return 1 instead of a correct error code, which would result in a Oops later. Fixes: 11a5d6fdf919 ("lightnvm: physical block device (pblk) target") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: fix some WARN() messagesDan Carpenter3-8/+8
WARN_ON() takes a condition, not an error message. I slightly tweaked some conditions so hopefully it's more clear. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: pblk-gc: fix an error pointer dereference in initDan Carpenter1-2/+2
These labels are reversed so we could end up dereferencing an error pointer or leaking. Fixes: 7f347ba6bb3a ("lightnvm: physical block device (pblk) target") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: physical block device (pblk) targetJavier González15-0/+8044
This patch introduces pblk, a host-side translation layer for Open-Channel SSDs to expose them like block devices. The translation layer allows data placement decisions, and I/O scheduling to be managed by the host, enabling users to optimize the SSD for their specific workloads. An open-channel SSD has a set of LUNs (parallel units) and a collection of blocks. Each block can be read in any order, but writes must be sequential. Writes may also fail, and if a block requires it, must also be reset before new writes can be applied. To manage the constraints, pblk maintains a logical to physical address (L2P) table, write cache, garbage collection logic, recovery scheme, and logic to rate-limit user I/Os versus garbage collection I/Os. The L2P table is fully-associative and manages sectors at a 4KB granularity. Pblk stores the L2P table in two places, in the out-of-band area of the media and on the last page of a line. In the cause of a power failure, pblk will perform a scan to recover the L2P table. The user data is organized into lines. A line is data striped across blocks and LUNs. The lines enable the host to reduce the amount of metadata to maintain besides the user data and makes it easier to implement RAID or erasure coding in the future. pblk implements multi-tenant support and can be instantiated multiple times on the same drive. Each instance owns a portion of the SSD - both regarding I/O bandwidth and capacity - providing I/O isolation for each case. Finally, pblk also exposes a sysfs interface that allows user-space to peek into the internals of pblk. The interface is available at /dev/block/*/pblk/ where * is the block device name exposed. This work also contains contributions from: Matias Bjørling <matias@cnexlabs.com> Simon A. F. Lund <slund@cnexlabs.com> Young Tack Jin <youngtack.jin@gmail.com> Huaicheng Li <huaicheng@cs.uchicago.edu> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: convert sprintf into strlcpyJavier González1-3/+3
Convert sprintf calls to strlcpy in order to make possible buffer overflow more obvious. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: fix type checks on rrpcJavier González1-2/+2
sector_t is always unsigned, therefore avoid < 0 checks on it. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: clean unused variableJavier González1-3/+0
Clean unused variable on lightnvm core. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: make nvm_free staticJavier González1-1/+1
Prefix the nvm_free static function with a missing static keyword. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: allow to init targets on factory modeJavier González4-5/+19
Target initialization has two responsibilities: creating the target partition and instantiating the target. This patch enables to create a factory partition (e.g., do not trigger recovery on the given target). This is useful for target development and for being able to restore the device state at any moment in time without requiring a full-device erase. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: bad type conversion for nvme control bitsJavier González1-1/+1
The NVMe I/O command control bits are 16 bytes, but is interpreted as 32 bytes in the lightnvm user I/O data path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: fix cleanup order of disk on init errorJavier González1-7/+7
Reorder disk allocation such that the disk structure can be put safely. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: double-clear of dev->lun_map on target init errorJavier González1-7/+10
The dev->lun_map bits are cleared twice if an target init error occurs. First in the target clean routine, and then next in the nvm_tgt_create error function. Make sure that it is only cleared once by extending nvm_remove_tgt_devi() with a clear bit, such that clearing of bits can ignored when cleaning up a successful initialized target. Signed-off-by: Javier González <javier@cnexlabs.com> Fix style. Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: don't check for failure from mempool_alloc()NeilBrown1-9/+0
mempool_alloc() cannot fail if the gfp flags allow it to sleep, and both GFP_KERNEL and GFP_NOIO allows for sleeping. So rrpc_move_valid_pages() and rrpc_make_rq() don't need to test the return value. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: enable nvme size compile assertsMatias Bjørling1-2/+4
The asserts in _nvme_nvm_check_size are not compiled due to the function not begin called. Make sure that it is called, and also fix the wrong sizes of asserts for nvme_nvm_addr_format, and nvme_nvm_bb_tbl, which checked for number of bits instead of bytes. Reported-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: free reverse device mapJavier González1-1/+13
Free the reverse mapping table correctly on target tear down Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: rename scrambler controller hintJavier González1-1/+1
According to the OCSSD 1.2 specification, the 0x200 hint enables the media scrambler for the read/write opcode, providing that the controller has been correctly configured by the firmware. Rename the macro to represent this meaning. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: submit erases using the I/O pathJavier González4-50/+47
Until now erases have been submitted as synchronous commands through a dedicated erase function. In order to enable targets implementing asynchronous erases, refactor the erase path so that it uses the normal async I/O submission functions. If a target requires sync I/O, it can implement it internally. Also, adapt rrpc to use the new erase path. Signed-off-by: Javier González <javier@cnexlabs.com> Fixed spelling error. Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16nvme/lightnvm: Prevent small buffer overflow in nvme_nvm_identifyScott Bauer1-1/+1
There are two closely named structs in lightnvm: struct nvme_nvm_addr_format and struct nvme_addr_format. The first struct has 4 reserved bytes at the end, the second does not. (gdb) p sizeof(struct nvme_nvm_addr_format) $1 = 16 (gdb) p sizeof(struct nvm_addr_format) $2 = 12 In the nvme_nvm_identify function we memcpy from the larger struct to the smaller struct. We incorrectly pass the length of the larger struct and overflow by 4 bytes, lets not do that. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-16lightnvm: Fix error handlingChristophe JAILLET1-2/+4
According to error handling in this function, it is likely that going to 'out' was expected here. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14net: off by one in inet6_pton()Dan Carpenter1-1/+1
If "scope_len" is sizeof(scope_id) then we would put the NUL terminator one space beyond the end of the buffer. Fixes: b1a951fe469e ("net/utils: generic inet_pton_with_scope helper") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14blk-mq: introduce Kyber multiqueue I/O schedulerOmar Sandoval4-0/+743
The Kyber I/O scheduler is an I/O scheduler for fast devices designed to scale to multiple queues. Users configure only two knobs, the target read and synchronous write latencies, and the scheduler tunes itself to achieve that latency goal. The implementation is based on "tokens", built on top of the scalable bitmap library. Tokens serve as a mechanism for limiting requests. There are two tiers of tokens: queueing tokens and dispatch tokens. A queueing token is required to allocate a request. In fact, these tokens are actually the blk-mq internal scheduler tags, but the scheduler manages the allocation directly in order to implement its policy. Dispatch tokens are device-wide and split up into two scheduling domains: reads vs. writes. Each hardware queue dispatches batches round-robin between the scheduling domains as long as tokens are available for that domain. These tokens can be used as the mechanism to enable various policies. The policy Kyber uses is inspired by active queue management techniques for network routing, similar to blk-wbt. The scheduler monitors latencies and scales the number of dispatch tokens accordingly. Queueing tokens are used to prevent starvation of synchronous requests by asynchronous requests. Various extensions are possible, including better heuristics and ionice support. The new scheduler isn't set as the default yet. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14blk-mq-sched: make completed_request() callback more usefulOmar Sandoval3-10/+8
Currently, this callback is called right after put_request() and has no distinguishable purpose. Instead, let's call it before put_request() as soon as I/O has completed on the request, before we account it in blk-stat. With this, Kyber can enable stats when it sees a latency outlier and make sure the outlier gets accounted. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14blk-mq: export helpersOmar Sandoval2-0/+3
blk_mq_finish_request() is required for schedulers that define their own put_request(). blk_mq_run_hw_queue() is required for schedulers that hold back requests to be run later. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14blk-mq: add shallow depth option for blk_mq_get_tag()Omar Sandoval2-1/+5
Wire up the sbitmap_get_shallow() operation to the tag code so that a caller can limit the number of tags available to it. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14sbitmap: add sbitmap_get_shallow() operationOmar Sandoval2-7/+123
This operation supports the use case of limiting the number of bits that can be allocated for a given operation. Rather than setting aside some bits at the end of the bitmap, we can set aside bits in each word of the bitmap. This means we can keep the allocation hints spread out and support sbitmap_resize() nicely at the cost of lower granularity for the allowed depth. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14remove the mg_disk driverChristoph Hellwig5-1257/+0
This drivers was added in 2008, but as far as a I can tell we never had a single platform that actually registered resources for the platform driver. It's also been unmaintained for a long time and apparently has a ATA mode that can be driven using the IDE/libata subsystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-11block: Fix list corruption of blk stats callback listJan Kara1-8/+4
When CFQ calls wbt_disable_default(), it will call blk_stat_remove_callback() to stop gathering IO statistics for the purposes of writeback throttling. Later, when request_queue is unregistered, wbt_exit() will call blk_stat_remove_callback() again which will try to delete callback from the list again and possibly cause list corruption. Fix the problem by making wbt_disable_default() called wbt_exit() which is properly guarded against being called multiple times. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-10blk-mq: Show symbolic names for hctx state and flagsBart Van Assche1-3/+34
Instead of showing the hctx state and flags as numbers, show the names of the flags. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-10blk-mq: Export queue state through /sys/kernel/debug/block/*/stateBart Van Assche1-0/+106
Make it possible to check whether or not a block layer queue has been stopped. Make it possible to start and to run a blk-mq queue from user space. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08scsi: sd: Remove LBPRZ dependency for discardsMartin K. Petersen1-19/+6
Separating discards and zeroout operations allows us to remove the LBPRZ block zeroing constraints from discards and honor the device preferences for UNMAP commands. If supported by the device, we'll also choose UNMAP over one of the WRITE SAME variants for discards. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08scsi: sd: Separate zeroout and discard command choicesMartin K. Petersen2-3/+61
Now that zeroout and discards are distinct operations we need to separate the policy of choosing the appropriate command. Create a zeroing_mode which can be one of: write: Zeroout assist not present, use regular WRITE writesame: Allow WRITE SAME(10/16) with a zeroed payload writesame_16_unmap: Allow WRITE SAME(16) with UNMAP writesame_10_unmap: Allow WRITE SAME(10) with UNMAP The last two are conditional on the device being thin provisioned with LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively. Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And if none of the _unmap variants are supported, regular WRITE SAME will be used if the device supports it. The zeroout_mode is exported in sysfs and the detected mode for a given device can be overridden using the string constants above. With this change in place we can now issue WRITE SAME(16) with UNMAP set for block zeroing applications that require hard guarantees and logical_block_size granularity. And at the same time use the UNMAP command with the device's preferred granulary and alignment for discard operations. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>