aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2014-01-30 11:19:05 -0800
committerLinus Torvalds <torvalds@linux-foundation.org>2014-01-30 11:19:05 -0800
commitf568849edac8611d603e00bd6cbbcfea09395ae6 (patch)
treeb9472d640fe5d87426d38c9d81d946cf197ad3fb /Documentation
parentMerge branch 'for-3.14' of git://linux-nfs.org/~bfields/linux (diff)
parentxtensa: fixup simdisk driver to work with immutable bio_vecs (diff)
downloadlinux-dev-f568849edac8611d603e00bd6cbbcfea09395ae6.tar.xz
linux-dev-f568849edac8611d603e00bd6cbbcfea09395ae6.zip
Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/block/biodoc.txt7
-rw-r--r--Documentation/block/biovecs.txt111
2 files changed, 114 insertions, 4 deletions
diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 8df5e8e6dceb..2101e718670d 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -447,14 +447,13 @@ struct bio_vec {
* main unit of I/O for the block layer and lower layers (ie drivers)
*/
struct bio {
- sector_t bi_sector;
struct bio *bi_next; /* request queue link */
struct block_device *bi_bdev; /* target device */
unsigned long bi_flags; /* status, command, etc */
unsigned long bi_rw; /* low bits: r/w, high: priority */
unsigned int bi_vcnt; /* how may bio_vec's */
- unsigned int bi_idx; /* current index into bio_vec array */
+ struct bvec_iter bi_iter; /* current index into bio_vec array */
unsigned int bi_size; /* total size in bytes */
unsigned short bi_phys_segments; /* segments after physaddr coalesce*/
@@ -480,7 +479,7 @@ With this multipage bio design:
- Code that traverses the req list can find all the segments of a bio
by using rq_for_each_segment. This handles the fact that a request
has multiple bios, each of which can have multiple segments.
-- Drivers which can't process a large bio in one shot can use the bi_idx
+- Drivers which can't process a large bio in one shot can use the bi_iter
field to keep track of the next bio_vec entry to process.
(e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE)
[TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying
@@ -589,7 +588,7 @@ driver should not modify these values. The block layer sets up the
nr_sectors and current_nr_sectors fields (based on the corresponding
hard_xxx values and the number of bytes transferred) and updates it on
every transfer that invokes end_that_request_first. It does the same for the
-buffer, bio, bio->bi_idx fields too.
+buffer, bio, bio->bi_iter fields too.
The buffer field is just a virtual address mapping of the current segment
of the i/o buffer in cases where the buffer resides in low-memory. For high
diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
new file mode 100644
index 000000000000..74a32ad52f53
--- /dev/null
+++ b/Documentation/block/biovecs.txt
@@ -0,0 +1,111 @@
+
+Immutable biovecs and biovec iterators:
+=======================================
+
+Kent Overstreet <kmo@daterainc.com>
+
+As of 3.13, biovecs should never be modified after a bio has been submitted.
+Instead, we have a new struct bvec_iter which represents a range of a biovec -
+the iterator will be modified as the bio is completed, not the biovec.
+
+More specifically, old code that needed to partially complete a bio would
+update bi_sector and bi_size, and advance bi_idx to the next biovec. If it
+ended up partway through a biovec, it would increment bv_offset and decrement
+bv_len by the number of bytes completed in that biovec.
+
+In the new scheme of things, everything that must be mutated in order to
+partially complete a bio is segregated into struct bvec_iter: bi_sector,
+bi_size and bi_idx have been moved there; and instead of modifying bv_offset
+and bv_len, struct bvec_iter has bi_bvec_done, which represents the number of
+bytes completed in the current bvec.
+
+There are a bunch of new helper macros for hiding the gory details - in
+particular, presenting the illusion of partially completed biovecs so that
+normal code doesn't have to deal with bi_bvec_done.
+
+ * Driver code should no longer refer to biovecs directly; we now have
+ bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs,
+ constructed from the raw biovecs but taking into account bi_bvec_done and
+ bi_size.
+
+ bio_for_each_segment() has been updated to take a bvec_iter argument
+ instead of an integer (that corresponded to bi_idx); for a lot of code the
+ conversion just required changing the types of the arguments to
+ bio_for_each_segment().
+
+ * Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a
+ wrapper around bio_advance_iter() that operates on bio->bi_iter, and also
+ advances the bio integrity's iter if present.
+
+ There is a lower level advance function - bvec_iter_advance() - which takes
+ a pointer to a biovec, not a bio; this is used by the bio integrity code.
+
+What's all this get us?
+=======================
+
+Having a real iterator, and making biovecs immutable, has a number of
+advantages:
+
+ * Before, iterating over bios was very awkward when you weren't processing
+ exactly one bvec at a time - for example, bio_copy_data() in fs/bio.c,
+ which copies the contents of one bio into another. Because the biovecs
+ wouldn't necessarily be the same size, the old code was tricky convoluted -
+ it had to walk two different bios at the same time, keeping both bi_idx and
+ and offset into the current biovec for each.
+
+ The new code is much more straightforward - have a look. This sort of
+ pattern comes up in a lot of places; a lot of drivers were essentially open
+ coding bvec iterators before, and having common implementation considerably
+ simplifies a lot of code.
+
+ * Before, any code that might need to use the biovec after the bio had been
+ completed (perhaps to copy the data somewhere else, or perhaps to resubmit
+ it somewhere else if there was an error) had to save the entire bvec array
+ - again, this was being done in a fair number of places.
+
+ * Biovecs can be shared between multiple bios - a bvec iter can represent an
+ arbitrary range of an existing biovec, both starting and ending midway
+ through biovecs. This is what enables efficient splitting of arbitrary
+ bios. Note that this means we _only_ use bi_size to determine when we've
+ reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes
+ bi_size into account when constructing biovecs.
+
+ * Splitting bios is now much simpler. The old bio_split() didn't even work on
+ bios with more than a single bvec! Now, we can efficiently split arbitrary
+ size bios - because the new bio can share the old bio's biovec.
+
+ Care must be taken to ensure the biovec isn't freed while the split bio is
+ still using it, in case the original bio completes first, though. Using
+ bio_chain() when splitting bios helps with this.
+
+ * Submitting partially completed bios is now perfectly fine - this comes up
+ occasionally in stacking block drivers and various code (e.g. md and
+ bcache) had some ugly workarounds for this.
+
+ It used to be the case that submitting a partially completed bio would work
+ fine to _most_ devices, but since accessing the raw bvec array was the
+ norm, not all drivers would respect bi_idx and those would break. Now,
+ since all drivers _must_ go through the bvec iterator - and have been
+ audited to make sure they are - submitting partially completed bios is
+ perfectly fine.
+
+Other implications:
+===================
+
+ * Almost all usage of bi_idx is now incorrect and has been removed; instead,
+ where previously you would have used bi_idx you'd now use a bvec_iter,
+ probably passing it to one of the helper macros.
+
+ I.e. instead of using bio_iovec_idx() (or bio->bi_iovec[bio->bi_idx]), you
+ now use bio_iter_iovec(), which takes a bvec_iter and returns a
+ literal struct bio_vec - constructed on the fly from the raw biovec but
+ taking into account bi_bvec_done (and bi_size).
+
+ * bi_vcnt can't be trusted or relied upon by driver code - i.e. anything that
+ doesn't actually own the bio. The reason is twofold: firstly, it's not
+ actually needed for iterating over the bio anymore - we only use bi_size.
+ Secondly, when cloning a bio and reusing (a portion of) the original bio's
+ biovec, in order to calculate bi_vcnt for the new bio we'd have to iterate
+ over all the biovecs in the new bio - which is silly as it's not needed.
+
+ So, don't use bi_vcnt anymore.