aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/disk-io.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-09-14Merge branch 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfsLinus Torvalds1-0/+2
Pull zstd support from Chris Mason: "Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s | Adj MB/s | Mem (MB) | |----------|----------|----------|-------|---------|----------|----------| | none | 11988480 | 0.100 | 1 | 2119.88 | - | - | | zstd -1 | 73645762 | 1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 | 1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 | 2.563 | 3.261 | 82.71 | 86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 | 16.13 | 13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 | 4.45 | 4.46 | 21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 | 2.06 | 2.06 | 60.15 | | zlib -1 | 77260026 | 2.895 | 2.744 | 73.23 | 75.85 | 0.27 | | zlib -3 | 72972206 | 4.116 | 2.905 | 51.50 | 52.79 | 0.27 | | zlib -6 | 68190360 | 9.633 | 3.109 | 22.01 | 22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 | 9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s | Adjusted MB/s | Memory (MB) | |----------|----------|---------|---------------|-------------| | none | 0.025 | 8479.54 | - | - | | zstd -1 | 0.358 | 592.15 | 636.60 | 0.84 | | zstd -3 | 0.396 | 535.32 | 571.40 | 1.46 | | zstd -5 | 0.396 | 535.32 | 571.40 | 1.46 | | zstd -10 | 0.374 | 566.81 | 607.42 | 2.51 | | zstd -15 | 0.379 | 559.34 | 598.84 | 4.61 | | zstd -19 | 0.412 | 514.54 | 547.77 | 8.80 | | zlib -1 | 0.940 | 225.52 | 231.68 | 0.04 | | zlib -3 | 0.883 | 240.08 | 247.07 | 0.04 | | zlib -6 | 0.844 | 251.17 | 258.84 | 0.04 | | zlib -9 | 0.837 | 253.27 | 287.64 | 0.04 | I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran" * 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: squashfs: Add zstd support btrfs: Add zstd support lib: Add zstd modules lib: Add xxhash module
2017-09-09Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds1-148/+94
Pull btrfs updates from David Sterba: "The changes range through all types: cleanups, core chagnes, sanity checks, fixes, other user visible changes, detailed list below: - deprecated: user transaction ioctl - mount option ssd does not change allocation alignments - degraded read-write mount is allowed if all the raid profile constraints are met, now based on more accurate check - defrag: do not reset compression afterwards; the NOCOMPRESS flag can be now overriden by defrag - prep work for better extent reference tracking (related to the qgroup slowness with balance) - prep work for compression heuristics - memory allocation reductions (may help latencies on a loaded system) - better accounting for io waiting states - error handling improvements (removed BUGs) - added more sanity checks for shared refs - fix readdir vs pagefault deadlock under some circumstances - fix for 'no-hole' mode, certain combination of compressed and inline extents - send: fix emission of invalid clone operations - fixup file mode if setting acls fail - more fixes from fuzzing - oher cleanups" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits) btrfs: submit superblock io with REQ_META and REQ_PRIO btrfs: remove unnecessary memory barrier in btrfs_direct_IO btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent btrfs: pass fs_info to btrfs_del_root instead of tree_root Btrfs: add one more sanity check for shared ref type Btrfs: remove BUG_ON in __add_tree_block Btrfs: remove BUG() in add_data_reference Btrfs: remove BUG() in print_extent_item Btrfs: remove BUG() in btrfs_extent_inline_ref_size Btrfs: convert to use btrfs_get_extent_inline_ref_type Btrfs: add a helper to retrive extent inline ref type btrfs: scrub: simplify scrub worker initialization btrfs: scrub: clean up division in scrub_find_csum btrfs: scrub: clean up division in __scrub_mark_bitmap btrfs: scrub: use bool for flush_all_writes btrfs: preserve i_mode if __btrfs_set_acl() fails btrfs: Remove extraneous chunk_objectid variable btrfs: Remove chunk_objectid argument from btrfs_make_block_group btrfs: Remove extra parentheses from condition in copy_items() ...
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-08-24Btrfs: fix blk_status_t/errno confusionOmar Sandoval1-2/+2
This fixes several instances of blk_status_t and bare errno ints being mixed up, some of which are real bugs. In the normal case, 0 matches BLK_STS_OK, so we don't observe any effects of the missing conversion, but in case of errors or passes through the repair/retry paths, the errors get mixed up. The changes were identified using 'sparse', we don't have reports of the buggy behaviour. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig1-1/+1
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-22btrfs: submit superblock io with REQ_META and REQ_PRIODavid Sterba1-2/+3
The superblock is also metadata of the filesystem so the relevant IO should be tagged as such. We also tag it as high priority, as it's the last block committed for metadata from a given transaction. Any delays would effectively block the whole transaction, also blocking any other operation holding the device_list_mutex. Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: Do not use data_alloc_cluster in ssd modeHans van Kranenburg1-4/+2
This patch provides a band aid to improve the 'out of the box' behaviour of btrfs for disks that are detected as being an ssd. In a general purpose mixed workload scenario, the current ssd mode causes overallocation of available raw disk space for data, while leaving behind increasing amounts of unused fragmented free space. This situation leads to early ENOSPC problems which are harming user experience and adoption of btrfs as a general purpose filesystem. This patch modifies the data extent allocation behaviour of the ssd mode to make it behave identical to nossd mode. The metadata behaviour and additional ssd_spread option stay untouched so far. Recommendations for future development are to reconsider the current oversimplified nossd / ssd distinction and the broken detection mechanism based on the rotational attribute in sysfs and provide experienced users with a more flexible way to choose allocator behaviour for data and metadata, optimized for certain use cases, while keeping sane 'out of the box' default settings. The internals of the current btrfs code have more potential than what currently gets exposed to the user to choose from. The SSD story... In the first year of btrfs development, around early 2008, btrfs gained a mount option which enables specific functionality for filesystems on solid state devices. The first occurance of this functionality is in commit e18e4809, labeled "Add mount -o ssd, which includes optimizations for seek free storage". The effect on allocating free space for doing (data) writes is to 'cluster' writes together, writing them out in contiguous space, as opposed to a 'tetris' way of putting all separate writes into any free space fragment that fits (which is what the -o nossd behaviour does). A somewhat simplified explanation of what happens is that, when for example, the 'cluster' size is set to 2MiB, when we do some writes, the data allocator will search for a free space block that is 2MiB big, and put the writes in there. The ssd mode itself might allow a 2MiB cluster to be composed of multiple free space extents with some existing data in between, while the additional ssd_spread mount option kills off this option and requires fully free space. The idea behind this is (commit 536ac8ae): "The [...] clusters make it more likely a given IO will completely overwrite the ssd block, so it doesn't have to do an internal rwm cycle."; ssd block meaning nand erase block. So, effectively this means applying a "locality based algorithm" and trying to outsmart the actual ssd. Since then, various changes have been made to the involved code, but the basic idea is still present, and gets activated whenever the ssd mount option is active. This also happens by default, when the rotational flag as seen at /sys/block/<device>/queue/rotational is set to 0. However, there's a number of problems with this approach. First, what the optimization is trying to do is outsmart the ssd by assuming there is a relation between the physical address space of the block device as seen by btrfs and the actual physical storage of the ssd, and then adjusting data placement. However, since the introduction of the Flash Translation Layer (FTL) which is a part of the internal controller of an ssd, these attempts are futile. The use of good quality FTL in consumer ssd products might have been limited in 2008, but this situation has changed drastically soon after that time. Today, even the flash memory in your automatic cat feeding machine or your grandma's wheelchair has a full featured one. Second, the behaviour as described above results in the filesystem being filled up with badly fragmented free space extents because of relatively small pieces of space that are freed up by deletes, but not selected again as part of a 'cluster'. Since the algorithm prefers allocating a new chunk over going back to tetris mode, the end result is a filesystem in which all raw space is allocated, but which is composed of underutilized chunks with a 'shotgun blast' pattern of fragmented free space. Usually, the next problematic thing that happens is the filesystem wanting to allocate new space for metadata, which causes the filesystem to fail in spectacular ways. Third, the default mount options you get for an ssd ('ssd' mode enabled, 'discard' not enabled), in combination with spreading out writes over the full address space and ignoring freed up space leads to worst case behaviour in providing information to the ssd itself, since it will never learn that all the free space left behind is actually free. There are two ways to let an ssd know previously written data does not have to be preserved, which are sending explicit signals using discard or fstrim, or by simply overwriting the space with new data. The worst case behaviour is the btrfs ssd_spread mount option in combination with not having discard enabled. It has a side effect of minimizing the reuse of free space previously written in. Fourth, the rotational flag in /sys/ does not reliably indicate if the device is a locally attached ssd. For example, iSCSI or NBD displays as non-rotational, while a loop device on an ssd shows up as rotational. The combination of the second and third problem effectively means that despite all the good intentions, the btrfs ssd mode reliably causes the ssd hardware and the filesystem structures and performance to be choked to death. The clickbait version of the title of this story would have been "Btrfs ssd optimizations considered harmful for ssds". The current nossd 'tetris' mode (even still without discard) allows a pattern of overwriting much more previously used space, causing many more implicit discards to happen because of the overwrite information the ssd gets. The actual location in the physical address space, as seen from the point of view of btrfs is irrelevant, because the actual writes to the low level flash are reordered anyway thanks to the FTL. Changes made in the code 1. Make ssd mode data allocation identical to tetris mode, like nossd. 2. Adjust and clean up filesystem mount messages so that we can easily identify if a kernel has this patch applied or not, when providing support to end users. Also, make better use of the *_and_info helpers to only trigger messages on actual state changes. Backporting notes Notes for whoever wants to backport this patch to their 4.9 LTS kernel: * First apply commit 951e7966 "btrfs: drop the nossd flag when remounting with -o ssd", or fixup the differences manually. * The rest of the conflicts are because of the fs_info refactoring. So, for example, instead of using fs_info, it's root->fs_info in extent-tree.c Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: use btrfsic_submit_bio instead of submit_bio in write_dev_flushLu Fengqi1-1/+1
Although this bio has no data attached, it will reach this condition (bio->bi_opf & REQ_PREFLUSH) and then update the flush_gen of dev_state in __btrfsic_submit_bio. So we should still submit it through integrity checker. Otherwise, the integrity checker will throw the following warning when I mount a newly created btrfs filesystem. [10264.755497] btrfs: attempt to write superblock which references block M @29523968 (sdb1/1111654400/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)! [10264.755498] btrfs: attempt to write superblock which references block M @29523968 (sdb1/37912576/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)! Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18btrfs: use appropriate define for the fsidAnand Jain1-3/+3
Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we should use the matching constant for the fsid buffer. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: drop chunk locks at the end of close_ctreeDavid Sterba1-2/+0
The pinned chunks might be left over so we clean them but at this point of close_ctree, there's noone to race with, the locking can be removed. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: fix spelling of snapshottingDavid Sterba1-1/+1
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: use named constant for bdev blocksizeDavid Sterba1-5/+6
Superblock is read and written using buffer heads, we need to set the bdev blocksize. The magic constant has been hardcoded in several places, so replace it with a named constant. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: split write_dev_supers to two functionsDavid Sterba1-56/+73
There are two independent parts, one that writes the superblocks and another that waits for completion. No functional changes, but cleanups, reformatting and comment updates. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: get fs_info from eb in btrfs_print_leaf, remove argumentDavid Sterba1-1/+1
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: Cleanup num_tolerated_disk_barrier_failuresQu Wenruo1-54/+0
As we use per-chunk degradable check, the global num_tolerated_disk_barrier_failures is of no use. We can now remove it. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: Allow barrier_all_devices to do chunk level device checkQu Wenruo1-13/+3
The last user of num_tolerated_disk_barrier_failures is barrier_all_devices(). But it can be easily changed to the new per-chunk degradable check framework. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: Do chunk level check for degraded rw mountQu Wenruo1-8/+3
Now use the btrfs_check_rw_degradable() to check if we can mount in the degraded mode. With this patch, we can mount in the following case: # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc # wipefs -a /dev/sdc # mount /dev/sdb /mnt/btrfs -o degraded As the single data chunk is only on sdb, so it's OK to mount as degraded, as missing one device is OK for RAID1. But still fail in the following case as expected: # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc # wipefs -a /dev/sdb # mount /dev/sdc /mnt/btrfs -o degraded As the data chunk is only in sdb, so it's not OK to mount it as degraded. Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reported-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-15btrfs: Add zstd supportNick Terrell1-0/+2
Add zstd compression and decompression support to BtrFS. zstd at its fastest level compresses almost as well as zlib, while offering much faster compression and decompression, approaching lzo speeds. I benchmarked btrfs with zstd compression against no compression, lzo compression, and zlib compression. I benchmarked two scenarios. Copying a set of files to btrfs, and then reading the files. Copying a tarball to btrfs, extracting it to btrfs, and then reading the extracted files. After every operation, I call `sync` and include the sync time. Between every pair of operations I unmount and remount the filesystem to avoid caching. The benchmark files can be found in the upstream zstd source repository under `contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}` [1] [2]. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. The first compression benchmark is copying 10 copies of the unzipped Silesia corpus [3] into a BtrFS filesystem mounted with `-o compress-force=Method`. The decompression benchmark times how long it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is measured by comparing the output of `df` and `du`. See the benchmark file [1] for details. I benchmarked multiple zstd compression levels, although the patch uses zstd level 1. | Method | Ratio | Compression MB/s | Decompression speed | |---------|-------|------------------|---------------------| | None | 0.99 | 504 | 686 | | lzo | 1.66 | 398 | 442 | | zlib | 2.58 | 65 | 241 | | zstd 1 | 2.57 | 260 | 383 | | zstd 3 | 2.71 | 174 | 408 | | zstd 6 | 2.87 | 70 | 398 | | zstd 9 | 2.92 | 43 | 406 | | zstd 12 | 2.93 | 21 | 408 | | zstd 15 | 3.01 | 11 | 354 | The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it measures the compression ratio, extracts the tar, and deletes the tar. Then it measures the compression ratio again, and `tar`s the extracted files into `/dev/null`. See the benchmark file [2] for details. | Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) | |--------|-----------|---------------|----------|------------|----------| | None | 0.97 | 0.78 | 0.981 | 5.501 | 8.807 | | lzo | 2.06 | 1.38 | 1.631 | 8.458 | 8.585 | | zlib | 3.40 | 1.86 | 7.750 | 21.544 | 11.744 | | zstd 1 | 3.57 | 1.85 | 2.579 | 11.479 | 9.389 | [1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh [2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh [3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia [4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz zstd source repository: https://github.com/facebook/zstd Signed-off-by: Nick Terrell <terrelln@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2017-07-14Merge branch 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds1-0/+1
Pull btrfs fixes from David Sterba: "We've identified and fixed a silent corruption (introduced by code in the first pull), a fixup after the blk_status_t merge and two fixes to incremental send that Filipe has been hunting for some time" * 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix unexpected return value of bio_readpage_error btrfs: btrfs_create_repair_bio never fails, skip error handling btrfs: cloned bios must not be iterated by bio_for_each_segment_all Btrfs: fix write corruption due to bio cloning on raid5/6 Btrfs: incremental send, fix invalid memory access Btrfs: incremental send, fix invalid path for link commands
2017-07-14btrfs: cloned bios must not be iterated by bio_for_each_segment_allDavid Sterba1-0/+1
We've started using cloned bios more in 4.13, there are some specifics regarding the iteration. Filipe found [1] that the raid56 iterated a cloned bio using bio_for_each_segment_all, which is incorrect. The cloned bios have wrong bi_vcnt and this could lead to silent corruptions. This patch adds assertions to all remaining bio_for_each_segment_all cases. [1] https://patchwork.kernel.org/patch/9838535/ Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-06Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpuLinus Torvalds1-6/+6
Pull percpu updates from Tejun Heo: "These are the percpu changes for the v4.13-rc1 merge window. There are a couple visibility related changes - tracepoints and allocator stats through debugfs, along with __ro_after_init markings and a cosmetic rename in percpu_counter. Please note that the simple O(#elements_in_the_chunk) area allocator used by percpu allocator is again showing scalability issues, primarily with bpf allocating and freeing large number of counters. Dennis is working on the replacement allocator and the percpu allocator will be seeing increased churns in the coming cycles" * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: fix static checker warnings in pcpu_destroy_chunk percpu: fix early calls for spinlock in pcpu_stats percpu: resolve err may not be initialized in pcpu_alloc percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch percpu: add tracepoint support for percpu memory percpu: expose statistics about percpu memory via debugfs percpu: migrate percpu data structures to internal header percpu: add missing lockdep_assert_held to func pcpu_free_area mark most percpu globals as __ro_after_init
2017-07-05Merge branch 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds1-96/+98
Pull btrfs updates from David Sterba: "The core updates improve error handling (mostly related to bios), with the usual incremental work on the GFP_NOFS (mis)use removal, refactoring or cleanups. Except the two top patches, all have been in for-next for an extensive amount of time. User visible changes: - statx support - quota override tunable - improved compression thresholds - obsoleted mount option alloc_start Core updates: - bio-related updates: - faster bio cloning - no allocation failures - preallocated flush bios - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates - prep work for btree_inode removal - dir-item validation - qgoup fixes and updates - cleanups: - removed unused struct members, unused code, refactoring - argument refactoring (fs_info/root, caller -> callee sink) - SEARCH_TREE ioctl docs" * 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits) btrfs: Remove false alert when fiemap range is smaller than on-disk extent btrfs: Don't clear SGID when inheriting ACLs btrfs: fix integer overflow in calc_reclaim_items_nr btrfs: scrub: fix target device intialization while setting up scrub context btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges btrfs: qgroup: Introduce extent changeset for qgroup reserve functions btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled btrfs: qgroup: Return actually freed bytes for qgroup release or free data btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function btrfs: qgroup: Add quick exit for non-fs extents Btrfs: rework delayed ref total_bytes_pinned accounting Btrfs: return old and new total ref mods when adding delayed refs Btrfs: always account pinned bytes when dropping a tree block ref Btrfs: update total_bytes_pinned when pinning down extents Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 btrfs: fix validation of XATTR_ITEM dir items btrfs: Verify dir_item in iterate_object_props btrfs: Check name_len before in btrfs_del_root_ref btrfs: Check name_len before reading btrfs_get_name ...
2017-06-21btrfs: move dev stats accounting out of wait_dev_flushDavid Sterba1-8/+3
We should really just wait in wait_dev_flush and let the caller decide what to do with the error value. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21btrfs: account as waiting for IO, while waiting fot the flush bio completionDavid Sterba1-1/+1
Similar to what submit_bio_wait does, we should account for IO while waiting for a bio completion. This has marginal visible effects, flush bio is short-lived. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21btrfs: preallocate device flush bioDavid Sterba1-18/+7
For devices that support flushing, we allocate a bio, submit, wait for it and then free it. The bio allocation does not fail so ENOMEM is not a problem but we still may unnecessarily stress the allocation subsystem. Instead, we can allocate the bio at the same time we allocate the device and reuse it each time we need to flush the barriers. The bio is reset before each use. Reference counting is simplified to just device allocation (get) and freeing (put). The bio used to be submitted through the integrity checker which will find out that bio has no data attached and call submit_bio. Status of the bio in flight needs to be tracked separately in case the device caches get switched off between write and wait. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batchNikolay Borisov1-6/+6
Currently, percpu_counter_add is a wrapper around __percpu_counter_add which is preempt safe due to explicit calls to preempt_disable. Given how __ prefix is used in percpu related interfaces, the naming unfortunately creates the false sense that __percpu_counter_add is less safe than percpu_counter_add. In terms of context-safety, they're equivalent. The only difference is that the __ version takes a batch parameter. Make this a bit more explicit by just renaming __percpu_counter_add to percpu_counter_add_batch. This patch doesn't cause any functional changes. tj: Minor updates to patch description for clarity. Cosmetic indentation updates. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@fb.com> Cc: linux-mm@kvack.org Cc: "David S. Miller" <davem@davemloft.net>
2017-06-20btrfs: move fs_info::fs_frozen to the flagsDavid Sterba1-1/+0
We can keep the state among the other fs_info flags, there's no reason why fs_frozen would need to be separate. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: wait part of the write_dev_flush() can be separated outAnand Jain1-28/+31
Submit and wait parts of write_dev_flush() can be split into two separate functions for better readability. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: remove redundant null bdev counting during flush submissionAnand Jain1-5/+2
There is no extra benefit to count null bdev during the submit loop, as these null devices will be anyway checked during command completion device loop just after the submit loop. We are holding the device_list_mutex, the device->bdev status won't change in between. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: write_dev_flush does not return ENOMEM anymoreAnand Jain1-33/+5
Since commit "btrfs: btrfs_io_bio_alloc never fails, skip error handling" write_dev_flush will not return ENOMEM in the sending part. We do not need to check for it in the callers. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ updated changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: sink gfp parameter to btrfs_io_bio_allocDavid Sterba1-1/+1
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we change it back from GFP_KERNEL in scrub. I'd rather save a few stack bytes from not passing the gfp flags in the remaining, more imporatant, contexts and the bio allocating API now looks more consistent. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: btrfs_io_bio_alloc never fails, skip error handlingDavid Sterba1-3/+0
Update direct callers of btrfs_io_bio_alloc that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: use generic slab for for btrfs_transactionDavid Sterba1-5/+0
Observing the number of slab objects of btrfs_transaction, there's just one active on an almost quiescent filesystem, and the number of objects goes to about ten when sync is in progress. Then the nubmer goes down to 1. This matches the expectations of the transaction lifetime. For such use the separate slab cache is not justified, as we do not reuse objects frequently. For the shortlived transaction, the generic slab (size 512) should be ok. We can optimistically expect that the 512 slabs are not all used (fragmentation) and there are free slots to take when we do the allocation, compared to potentially allocating a whole new page for the separate slab. We'll lose the stats about the object use, which could be added later if we really need them. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: btrfs_wait_tree_block_writeback can be void returnJeff Layton1-3/+3
Nothing checks its return value. Is it safe to skip checking return value of btrfs_wait_tree_block_writeback? Liu Bo: I think yes, it's used in walk_log_tree which is called in two places, free_log_tree and log replay. For free_log_tree, it waits for any running writeback of the extent buffer under freeing to finish in case we need to access the eb pointer from page->private, and it's OK to not check the return value, while for log replay, it's doesn't wait because wc->wait is not set. So neither cares about the writeback error. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> [ added more explanation to changelog, from Liu Bo ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: remove unused member list from btrfs_end_io_wqDavid Sterba1-1/+0
The end io work queue items have been tracked by the work queues since "Btrfs: Add async worker threads for pre and post IO checksumming" (8b7128429235d9bd72cfd5e) (2008). Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: remove unused member list from async_submit_bioDavid Sterba1-1/+0
The list used to track checksums in the early version (2.6.29), but I was able not pinpoint the commit that stopped using it. Everything apparently works without it for a long time. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: replace tree->mapping with tree->private_dataJosef Bacik1-21/+29
For extent_io tree's we have carried the address_mapping of the inode around in the io tree in order to pull the inode back out for calling into various tree ops hooks. This works fine when everything that has an extent_io_tree has an inode. But we are going to remove the btree_inode, so we need to change this. Instead just have a generic void * for private data that we can initialize with, and have all the tree ops use that instead. This had a lot of cascading changes but should be relatively straightforward. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor reordering of the callback prototypes ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: Convert fs_info->free_chunk_space to atomic64_tNikolay Borisov1-2/+1
The ->free_chunk_space variable is used to track the unallocated space and access to it is protected by a spinlock, which is not used for anything else. Make the code a bit self-explanatory by switching the variable to an atomic64_t type and kill the spinlock. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ not a performance critical code, use of atomic type is ok ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: add framework to handle device flush error as a volumeAnand Jain1-4/+53
This adds comments to the flush error handling part of the code, and hopes to maintain the same logic with a framework which can be used to handle the errors at the volume level. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-12Merge tag 'v4.12-rc5' into for-4.13/blockJens Axboe1-4/+6
We've already got a few conflicts and upcoming work depends on some of the changes that have gone into mainline as regression fixes for this series. Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream trees to continue working on 4.13 changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-10Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfsLinus Torvalds1-4/+6
Pull btrfs fixes from Chris Mason: "Some fixes that Dave Sterba collected. We've been hitting an early enospc problem on production machines that Omar tracked down to an old int->u64 mistake. I waited a bit on this pull to make sure it was really the problem from production, but it's on ~2100 hosts now and I think we're good. Omar also noticed a commit in the queue would make new early ENOSPC problems. I pulled that out for now, which is why the top three commits are younger than the rest. Otherwise these are all fixes, some explaining very old bugs that we've been poking at for a while" * 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix delalloc accounting leak caused by u32 overflow Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io btrfs: tree-log.c: Wrong printk information about namelen btrfs: fix race with relocation recovery and fs_root setup btrfs: fix memory leak in update_space_info failure path btrfs: use correct types for page indices in btrfs_page_exists_in_range btrfs: fix incorrect error return ret being passed to mapping_set_error btrfs: Make flush bios explicitely sync btrfs: fiemap: Cache and merge fiemap extent before submit it to user
2017-06-09block: switch bios to blk_status_tChristoph Hellwig1-38/+37
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-16btrfs: Make flush bios explicitely syncJan Kara1-4/+6
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. CC: David Sterba <dsterba@suse.com> CC: linux-btrfs@vger.kernel.org Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-10Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfsLinus Torvalds1-6/+7
Pull btrfs updates from Chris Mason: "This has fixes and cleanups Dave Sterba collected for the merge window. The biggest functional fixes are between btrfs raid5/6 and scrub, and raid5/6 and device replacement. Some of our pending qgroup fixes are included as well while I bash on the rest in testing. We also have the usual set of cleanups, including one that makes __btrfs_map_block() much more maintainable, and conversions from atomic_t to refcount_t" * 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits) btrfs: fix the gfp_mask for the reada_zones radix tree Btrfs: fix reported number of inode blocks Btrfs: send, fix file hole not being preserved due to inline extent Btrfs: fix extent map leak during fallocate error path Btrfs: fix incorrect space accounting after failure to insert inline extent Btrfs: fix invalid attempt to free reserved space on failure to cow range btrfs: Handle delalloc error correctly to avoid ordered extent hang btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error btrfs: check if the device is flush capable btrfs: delete unused member nobarriers btrfs: scrub: Fix RAID56 recovery race condition btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: Use ktime_get_real_ts for root ctime Btrfs: handle only applicable errors returned by btrfs_get_extent btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option btrfs: use q which is already obtained from bdev_get_queue Btrfs: switch to div64_u64 if with a u64 divisor Btrfs: update scrub_parity to use u64 stripe_len Btrfs: enable repair during read for raid56 profile btrfs: use clear_page where appropriate ...
2017-05-04btrfs: fix the gfp_mask for the reada_zones radix treeChris Mason1-1/+1
Commits cc8385b59e17 and 7ef70b4d9987a7 added preallocation for the reada radix trees and also switched them over to GFP_KERNEL for the default gfp mask. Since we're doing radix tree insertions under spinlocks, we need to make sure the mask doesn't allow sleeping. This fix keeps the radix preallocation but switches back to the original gfp_mask. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2017-05-01Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-blockLinus Torvalds1-29/+7
Pull block layer updates from Jens Axboe: - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ was initially a fork of CFQ, but subsequently changed to implement fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant to be used on desktop type single drives, providing good fairness. From Paolo. - Add Kyber IO scheduler. This is a full multiqueue aware scheduler, using a scalable token based algorithm that throttles IO based on live completion IO stats, similary to blk-wbt. From Omar. - A series from Jan, moving users to separately allocated backing devices. This continues the work of separating backing device life times, solving various problems with hot removal. - A series of updates for lightnvm, mostly from Javier. Includes a 'pblk' target that exposes an open channel SSD as a physical block device. - A series of fixes and improvements for nbd from Josef. - A series from Omar, removing queue sharing between devices on mostly legacy drivers. This helps us clean up other bits, if we know that a queue only has a single device backing. This has been overdue for more than a decade. - Fixes for the blk-stats, and improvements to unify the stats and user windows. This both improves blk-wbt, and enables other users to register a need to receive IO stats for a device. From Omar. - blk-throttle improvements from Shaohua. This provides a scalable framework for implementing scalable priotization - particularly for blk-mq, but applicable to any type of block device. The interface is marked experimental for now. - Bucketized IO stats for IO polling from Stephen Bates. This improves efficiency of polled workloads in the presence of mixed block size IO. - A few fixes for opal, from Scott. - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics. From a variety of folks, mostly Sagi and James Smart. - A series from Bart, improving our exposed info and capabilities from the blk-mq debugfs support. - A series from Christoph, cleaning up how handle WRITE_ZEROES. - A series from Christoph, cleaning up the block layer handling of how we track errors in a request. On top of being a nice cleanup, it also shrinks the size of struct request a bit. - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was never used by platforms, and the latter has outlived it's usefulness. - Various little bug fixes and cleanups from a wide variety of folks. * 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits) block: hide badblocks attribute by default blk-mq: unify hctx delay_work and run_work block: add kblock_mod_delayed_work_on() blk-mq: unify hctx delayed_run_work and run_work nbd: fix use after free on module unload MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler blk-mq-sched: alloate reserved tags out of normal pool mtip32xx: use runtime tag to initialize command header scsi: Implement blk_mq_ops.show_rq() blk-mq: Add blk_mq_ops.show_rq() blk-mq: Show operation, cmd_flags and rq_flags names blk-mq: Make blk_flags_show() callers append a newline character blk-mq: Move the "state" debugfs attribute one level down blk-mq: Unregister debugfs attributes earlier blk-mq: Only unregister hctxs for which registration succeeded blk-mq-debugfs: Rename functions for registering and unregistering the mq directory blk-mq: Let blk_mq_debugfs_register() look up the queue name blk-mq: Register <dev>/queue/mq after having registered <dev>/queue ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset ide-pm: always pass 0 error to __blk_end_request_all ..
2017-04-20btrfs: Convert to separately allocated bdiJan Kara1-29/+7
Allocate struct backing_dev_info separately instead of embedding it inside superblock. This unifies handling of bdi among users. CC: Chris Mason <clm@fb.com> CC: Josef Bacik <jbacik@fb.com> CC: David Sterba <dsterba@suse.com> CC: linux-btrfs@vger.kernel.org Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-18btrfs: check if the device is flush capableAnand Jain1-0/+4
The block layer call chain from submit_bio will check if the write cache is enabled for the given queue before submitting the flush. This will add a code to fail fast if its not. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ updated changelog to reflect current code stat, blkdev_issue_flush is not used yet ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: delete unused member nobarriersAnand Jain1-3/+0
The last consumer of nobarriers is removed by the commit [1] and sync won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write through it just return success without actually transpiring such a request to the block device/lun. [1] commit b25de9d6da49b1a8760a89672283128aa8c78345 block: remove BIO_EOPNOTSUPP And, as the device/lun write cache state may change dynamically saving such as state won't help either. So deleting the member nobarriers. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: remove redundant parameter from btree_readahead_hookDavid Sterba1-2/+2
We can read fs_info from eb. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>