aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-07-21Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds1-2/+5
Pull btrfs fix from David Sterba: "A fix of a corruption regarding fsync and clone, under some very specific conditions explained in the patch. The fix is marked for stable 3.16+ so I'd like to get it merged now given the impact" * tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix file data corruption after cloning a range and fsync
2018-07-19Btrfs: fix file data corruption after cloning a range and fsyncFilipe Manana1-2/+5
When we clone a range into a file we can end up dropping existing extent maps (or trimming them) and replacing them with new ones if the range to be cloned overlaps with a range in the destination inode. When that happens we add the new extent maps to the list of modified extents in the inode's extent map tree, so that a "fast" fsync (the flag BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps and log corresponding extent items. However, at the end of range cloning operation we do truncate all the pages in the affected range (in order to ensure future reads will not get stale data). Sometimes this truncation will release the corresponding extent maps besides the pages from the page cache. If this happens, then a "fast" fsync operation will miss logging some extent items, because it relies exclusively on the extent maps being present in the inode's extent tree, leading to data loss/corruption if the fsync ends up using the same transaction used by the clone operation (that transaction was not committed in the meanwhile). An extent map is released through the callback btrfs_invalidatepage(), which gets called by truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The later ends up calling try_release_extent_mapping() which will release the extent map if some conditions are met, like the file size being greater than 16Mb, gfp flags allow blocking and the range not being locked (which is the case during the clone operation) nor being the extent map flagged as pinned (also the case for cloning). The following example, turned into a test for fstests, reproduces the issue: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar # reflink destination offset corresponds to the size of file bar, # 2728Kb minus 4Kb. $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar $ md5sum /mnt/bar 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar <power fail> $ mount /dev/sdb /mnt $ md5sum /mnt/bar 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the # power failure In the above example, the destination offset of the clone operation corresponds to the size of the "bar" file minus 4Kb. So during the clone operation, the extent map covering the range from 2572Kb to 2728Kb gets trimmed so that it ends at offset 2724Kb, and a new extent map covering the range from 2724Kb to 11724Kb is created. So at the end of the clone operation when we ask to truncate the pages in the range from 2724Kb to 2724Kb + 15908Kb, the page invalidation callback ends up removing the new extent map (through try_release_extent_mapping()) when the page at offset 2724Kb is passed to that callback. Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent map is removed at try_release_extent_mapping(), forcing the next fsync to search for modified extents in the fs/subvolume tree instead of relying on the presence of extent maps in memory. This way we can continue doing a "fast" fsync if the destination range of a clone operation does not overlap with an existing range or if any of the criteria necessary to remove an extent map at try_release_extent_mapping() is not met (file size not bigger then 16Mb or gfp flags do not allow blocking). CC: stable@vger.kernel.org # 3.16+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-18Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds3-8/+13
Pull btrfs fixes from David Sterba: "Three regression fixes. They're few-liners and fixing some corner cases missed in the origial patches" * tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() btrfs: fix use-after-free of cmp workspace pages btrfs: restore uuid_mutex in btrfs_open_devices
2018-07-17btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()Qu Wenruo1-8/+9
In commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace") we removed the branch of copy_nocow_pages() to avoid corruption for compressed nodatasum extents. However above commit only solves the problem in scrub_extent(), if during scrub_pages() we failed to read some pages, sctx->no_io_error_seen will be non-zero and we go to fixup function scrub_handle_errored_block(). In scrub_handle_errored_block(), for sctx without csum (no matter if we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine, which does the similar thing with copy_nocow_pages(), but does it without the extra check in copy_nocow_pages() routine. So for test cases like btrfs/100, where we emulate read errors during replace/scrub, we could corrupt compressed extent data again. This patch will fix it just by avoiding any "optimization" for nodatasum, just falls back to the normal fixup routine by try read from any good copy. This also solves WARN_ON() or dead lock caused by lame backref iteration in scrub_fixup_nodatasum() routine. The deadlock or WARN_ON() won't be triggered before commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace") since copy_nocow_pages() have better locking and extra check for data extent, and it's already doing the fixup work by try to read data from any good copy, so it won't go scrub_fixup_nodatasum() anyway. This patch disables the faulty code and will be removed completely in a followup patch. Fixes: ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-13btrfs: fix use-after-free of cmp workspace pagesNaohiro Aota1-0/+2
btrfs_cmp_data_free() puts cmp's src_pages and dst_pages, but leaves their page address intact. Now, if you hit "goto again" in btrfs_extent_same_range() and hit some error in btrfs_cmp_data_prepare(), you'll try to unlock/put already put pages. This is simple fix to reset the address to avoid use-after-free. Fixes: 67b07bd4bec5 ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl") Signed-off-by: Naohiro Aota <naota@elisp.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-13btrfs: restore uuid_mutex in btrfs_open_devicesDavid Sterba1-0/+2
Commit 542c5908abfe84f7b4c1 ("btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices") switched to device_list_mutex as we need that for the device list traversal, but we also need uuid_mutex to protect access to fs_devices::opened to be consistent with other users of that. Fixes: 542c5908abfe84f7b4c1 ("btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-01Merge tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds2-5/+15
Pull btrfs fixes from David Sterba: "We have a few regression fixes for qgroup rescan status tracking and the vm_fault_t conversion that mixed up the error values" * tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix mount failure when qgroup rescan is in progress Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf
2018-06-28Btrfs: fix mount failure when qgroup rescan is in progressFilipe Manana1-3/+10
If a power failure happens while the qgroup rescan kthread is running, the next mount operation will always fail. This is because of a recent regression that makes qgroup_rescan_init() incorrectly return -EINVAL when we are mounting the filesystem (through btrfs_read_qgroup_config()). This causes the -EINVAL error to be returned regardless of any qgroup flags being set instead of returning the error only when neither of the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON are set. A test case for fstests follows up soon. Fixes: 9593bf49675e ("btrfs: qgroup: show more meaningful qgroup_rescan_init error message") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-28Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversionChris Mason1-1/+2
The vm_fault_t conversion commit introduced a ret2 variable for tracking the integer return values from internal btrfs functions. It was sometimes returning VM_FAULT_LOCKED for pages that were actually invalid and had been removed from the radix. Something like this: ret2 = btrfs_delalloc_reserve_space() // returns zero on success lock_page(page) if (page->mapping != inode->i_mapping) goto out_unlock; ... out_unlock: if (!ret2) { ... return VM_FAULT_LOCKED; } This ends up triggering this WARNING in btrfs_destroy_inode() WARN_ON(BTRFS_I(inode)->block_rsv.size); xfstests generic/095 was able to reliably reproduce the errors. Since out_unlock: is only used for errors, this fix moves it below the if (!ret2) check we use to return VM_FAULT_LOCKED for success. Fixes: a528a2415087 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t) Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-28btrfs: quota: Set rescan progress to (u64)-1 if we hit last leafQu Wenruo1-1/+3
Commit ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree") added a new exit for rescan finish. However after finishing quota rescan, we set fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the original exit path. While we missed that assignment of (u64)-1 in the new exit path. The end result is, the quota status item doesn't have the same value. (-1 vs the last bytenr + 1) Although it doesn't affect quota accounting, it's still better to keep the original behavior. Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Fixes: ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-26Merge tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds3-7/+12
Pull btrfs fixes from David Sterba: "Two regression fixes and an incorrect error value propagation fix from 'rename exchange'" * tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix return value on rename exchange failure btrfs: fix invalid-free in btrfs_extent_same Btrfs: fix physical offset reported by fiemap for inline extents
2018-06-22Btrfs: fix return value on rename exchange failureFilipe Manana1-1/+3
If we failed during a rename exchange operation after starting/joining a transaction, we would end up replacing the return value, stored in the local 'ret' variable, with the return value from btrfs_end_transaction(). So this could end up returning 0 (success) to user space despite the operation having failed and aborted the transaction, because if there are multiple tasks having a reference on the transaction at the time btrfs_end_transaction() is called by the rename exchange, that function returns 0 (otherwise it returns -EIO and not the original error value). So fix this by not overwriting the return value on error after getting a transaction handle. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-21btrfs: fix invalid-free in btrfs_extent_sameLu Fengqi1-5/+5
If this condition ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) != (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) is hit, we will go to free the uninitialized cmp.src_pages and cmp.dst_pages. Fixes: 67b07bd4bec5 ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl") Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-21Btrfs: fix physical offset reported by fiemap for inline extentsFilipe Manana1-1/+4
Commit 9d311e11fc1f ("Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero") introduced a regression where we no longer report 0 as the physical offset for inline extents (and other extents with a special block_start value). This is because it always sets the variable used to report the physical offset ("disko") as em->block_start plus some offset, and em->block_start has the value 18446744073709551614 ((u64) -2) for inline extents. This made the btrfs test 004 (from fstests) often fail, for example, for a file with an inline extent we have the following items in the subvolume tree: item 101 key (418 INODE_ITEM 0) itemoff 11029 itemsize 160 generation 25 transid 38 size 1525 nbytes 1525 block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0 sequence 0 flags 0x2(none) atime 1529342058.461891730 (2018-06-18 18:14:18) ctime 1529342058.461891730 (2018-06-18 18:14:18) mtime 1529342058.461891730 (2018-06-18 18:14:18) otime 1529342055.869892885 (2018-06-18 18:14:15) item 102 key (418 INODE_REF 264) itemoff 11016 itemsize 13 index 25 namelen 3 name: fc7 item 103 key (418 EXTENT_DATA 0) itemoff 9470 itemsize 1546 generation 38 type 0 (inline) inline extent data size 1525 ram_bytes 1525 compression 0 (none) Then when test 004 invoked fiemap against the file it got a non-zero physical offset: $ filefrag -v /mnt/p0/d4/d7/fc7 Filesystem type is: 9123683e File size of /mnt/p0/d4/d7/fc7 is 1525 (1 block of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 4095: 18446744073709551614.. 4093: 4096: last,not_aligned,inline,eof /mnt/p0/d4/d7/fc7: 1 extent found This resulted in the test failing like this: btrfs/004 49s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad) --- tests/btrfs/004.out 2016-08-23 10:17:35.027012095 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad 2018-06-18 18:15:02.385872155 +0100 @@ -1,3 +1,10 @@ QA output created by 004 *** test backref walking -*** done +./tests/btrfs/004: line 227: [: 7.55578637259143e+22: integer expression expected +ERROR: 7.55578637259143e+22 is not a valid numeric value. +unexpected output from + /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -s 65536 -P 7.55578637259143e+22 /home/fdmanana/btrfs-tests/scratch_1 ... (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad' to see the entire diff) Ran: btrfs/004 The large number in scientific notation reported as an invalid numeric value is the result from the filter passed to perl which multiplies the physical offset by the block size reported by fiemap. So fix this by ensuring the physical offset is always set to 0 when we are processing an extent with a special block_start value. Fixes: 9d311e11fc1f ("Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-15Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playgroundLinus Torvalds5-12/+12
Pull inode timestamps conversion to timespec64 from Arnd Bergmann: "This is a late set of changes from Deepa Dinamani doing an automated treewide conversion of the inode and iattr structures from 'timespec' to 'timespec64', to push the conversion from the VFS layer into the individual file systems. As Deepa writes: 'The series aims to switch vfs timestamps to use struct timespec64. Currently vfs uses struct timespec, which is not y2038 safe. The series involves the following: 1. Add vfs helper functions for supporting struct timepec64 timestamps. 2. Cast prints of vfs timestamps to avoid warnings after the switch. 3. Simplify code using vfs timestamps so that the actual replacement becomes easy. 4. Convert vfs timestamps to use struct timespec64 using a script. This is a flag day patch. Next steps: 1. Convert APIs that can handle timespec64, instead of converting timestamps at the boundaries. 2. Update internal data structures to avoid timestamp conversions' Thomas Gleixner adds: 'I think there is no point to drag that out for the next merge window. The whole thing needs to be done in one go for the core changes which means that you're going to play that catchup game forever. Let's get over with it towards the end of the merge window'" * tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground: pstore: Remove bogus format string definition vfs: change inode times to use struct timespec64 pstore: Convert internal records to timespec64 udf: Simplify calls to udf_disk_stamp_to_time fs: nfs: get rid of memcpys for inode times ceph: make inode time prints to be long long lustre: Use long long type to print inode time fs: add timespec64_truncate()
2018-06-15Merge tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds5-19/+19
Pull btrfs fixes from David Sterba: - error handling fixup for one of the new ioctls from 1st pull - fix for device-replace that incorrectly uses inode pages and can mess up compressed extents in some cases - fiemap fix for reporting incorrect number of extents - vm_fault_t type conversion * tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: Don't use inode pages for device replace btrfs: change return type of btrfs_page_mkwrite to vm_fault_t Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero btrfs: Check error of btrfs_iget in btrfs_search_path_in_tree_user
2018-06-14Merge branch 'vfs_timespec64' of https://github.com/deepa-hub/vfs into vfs-timespec64Arnd Bergmann5-12/+12
Pull the timespec64 conversion from Deepa Dinamani: "The series aims to switch vfs timestamps to use struct timespec64. Currently vfs uses struct timespec, which is not y2038 safe. The flag patch applies cleanly. I've not seen the timestamps update logic change often. The series applies cleanly on 4.17-rc6 and linux-next tip (top commit: next-20180517). I'm not sure how to merge this kind of a series with a flag patch. We are targeting 4.18 for this. Let me know if you have other suggestions. The series involves the following: 1. Add vfs helper functions for supporting struct timepec64 timestamps. 2. Cast prints of vfs timestamps to avoid warnings after the switch. 3. Simplify code using vfs timestamps so that the actual replacement becomes easy. 4. Convert vfs timestamps to use struct timespec64 using a script. This is a flag day patch. I've tried to keep the conversions with the script simple, to aid in the reviews. I've kept all the internal filesystem data structures and function signatures the same. Next steps: 1. Convert APIs that can handle timespec64, instead of converting timestamps at the boundaries. 2. Update internal data structures to avoid timestamp conversions." I've pulled it into a branch based on top of the NFS changes that are now in mainline, so I could resolve the non-obvious conflict between the two while merging. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-06-12treewide: kzalloc() -> kcalloc()Kees Cook1-2/+2
The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-11btrfs: scrub: Don't use inode pages for device replaceQu Wenruo1-1/+1
[BUG] Btrfs can create compressed extent without checksum (even though it shouldn't), and if we then try to replace device containing such extent, the result device will contain all the uncompressed data instead of the compressed one. Test case already submitted to fstests: https://patchwork.kernel.org/patch/10442353/ [CAUSE] When handling compressed extent without checksum, device replace will goe into copy_nocow_pages() function. In that function, btrfs will get all inodes referring to this data extents and then use find_or_create_page() to get pages direct from that inode. The problem here is, pages directly from inode are always uncompressed. And for compressed data extent, they mismatch with on-disk data. Thus this leads to corrupted compressed data extent written to replace device. [FIX] In this attempt, we could just remove the "optimization" branch, and let unified scrub_pages() to handle it. Although scrub_pages() won't bother reusing page cache, it will be a little slower, but it does the correct csum checking and won't cause such data corruption caused by "optimization". Note about the fix: this is the minimal fix that can be backported to older stable trees without conflicts. The whole callchain from copy_nocow_pages() can be deleted, and will be in followup patches. Fixes: ff023aac3119 ("Btrfs: add code to scrub to copy read data to another disk") CC: stable@vger.kernel.org # 4.4+ Reported-by: James Harvey <jamespharvey20@gmail.com> Reviewed-by: James Harvey <jamespharvey20@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [ remove code removal, add note why ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-07btrfs: change return type of btrfs_page_mkwrite to vm_fault_tSouptick Joarder2-15/+13
Use the new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Reference commit 1c8f422059ae ("mm: change return type to vm_fault_t") vmf_error() is the newly introduced inline function in 4.17-rc6. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-07Btrfs: fiemap: pass correct bytenr when fm_extent_count is zeroRobbie Ko1-3/+1
[BUG] fm_mapped_extents is not correct when fm_extent_count is 0 Like: # mount /dev/vdb5 /mnt/btrfs # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 25088..25215 128 0x1 When user space wants to get the number of file extents, set fm_extent_count to 0 to run fiemap and then read fm_mapped_extents. In the above example, fiemap will return with fm_mapped_extents set to 4, but it should be 1 since there's only one entry in the output. [REASON] The problem seems to be that disko is only set if fieinfo->fi_extents_max is set. And this member is initialized, in the generic ioctl_fiemap function, to the value of used-passed fm_extent_count. So when the user passes 0 then fi_extent_max is also set to zero and this causes btrfs to not initialize disko at all. Eventually this leads emit_fiemap_extent being called with a bogus 'phys' argument preventing proper fiemap entries merging. [FIX] Move the disko initialization earlier in extent_fiemap making it independent of user-passed arguments, allowing emit_fiemap_extent to properly handle consecutive extent entries. Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-05vfs: change inode times to use struct timespec64Deepa Dinamani5-12/+12
struct timespec is not y2038 safe. Transition vfs to use y2038 safe struct timespec64 instead. The change was made with the help of the following cocinelle script. This catches about 80% of the changes. All the header file and logic changes are included in the first 5 rules. The rest are trivial substitutions. I avoid changing any of the function signatures or any other filesystem specific data structures to keep the patch simple for review. The script can be a little shorter by combining different cases. But, this version was sufficient for my usecase. virtual patch @ depends on patch @ identifier now; @@ - struct timespec + struct timespec64 current_time ( ... ) { - struct timespec now = current_kernel_time(); + struct timespec64 now = current_kernel_time64(); ... - return timespec_trunc( + return timespec64_trunc( ... ); } @ depends on patch @ identifier xtime; @@ struct \( iattr \| inode \| kstat \) { ... - struct timespec xtime; + struct timespec64 xtime; ... } @ depends on patch @ identifier t; @@ struct inode_operations { ... int (*update_time) (..., - struct timespec t, + struct timespec64 t, ...); ... } @ depends on patch @ identifier t; identifier fn_update_time =~ "update_time$"; @@ fn_update_time (..., - struct timespec *t, + struct timespec64 *t, ...) { ... } @ depends on patch @ identifier t; @@ lease_get_mtime( ... , - struct timespec *t + struct timespec64 *t ) { ... } @te depends on patch forall@ identifier ts; local idexpression struct inode *inode_node; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn_update_time =~ "update_time$"; identifier fn; expression e, E3; local idexpression struct inode *node1; local idexpression struct inode *node2; local idexpression struct iattr *attr1; local idexpression struct iattr *attr2; local idexpression struct iattr attr; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; @@ ( ( - struct timespec ts; + struct timespec64 ts; | - struct timespec ts = current_time(inode_node); + struct timespec64 ts = current_time(inode_node); ) <+... when != ts ( - timespec_equal(&inode_node->i_xtime, &ts) + timespec64_equal(&inode_node->i_xtime, &ts) | - timespec_equal(&ts, &inode_node->i_xtime) + timespec64_equal(&ts, &inode_node->i_xtime) | - timespec_compare(&inode_node->i_xtime, &ts) + timespec64_compare(&inode_node->i_xtime, &ts) | - timespec_compare(&ts, &inode_node->i_xtime) + timespec64_compare(&ts, &inode_node->i_xtime) | ts = current_time(e) | fn_update_time(..., &ts,...) | inode_node->i_xtime = ts | node1->i_xtime = ts | ts = inode_node->i_xtime | <+... attr1->ia_xtime ...+> = ts | ts = attr1->ia_xtime | ts.tv_sec | ts.tv_nsec | btrfs_set_stack_timespec_sec(..., ts.tv_sec) | btrfs_set_stack_timespec_nsec(..., ts.tv_nsec) | - ts = timespec64_to_timespec( + ts = ... -) | - ts = ktime_to_timespec( + ts = ktime_to_timespec64( ...) | - ts = E3 + ts = timespec_to_timespec64(E3) | - ktime_get_real_ts(&ts) + ktime_get_real_ts64(&ts) | fn(..., - ts + timespec64_to_timespec(ts) ,...) ) ...+> ( <... when != ts - return ts; + return timespec64_to_timespec(ts); ...> ) | - timespec_equal(&node1->i_xtime1, &node2->i_xtime2) + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2) | - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2) + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2) | - timespec_compare(&node1->i_xtime1, &node2->i_xtime2) + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2) | node1->i_xtime1 = - timespec_trunc(attr1->ia_xtime1, + timespec64_trunc(attr1->ia_xtime1, ...) | - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2, + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2, ...) | - ktime_get_real_ts(&attr1->ia_xtime1) + ktime_get_real_ts64(&attr1->ia_xtime1) | - ktime_get_real_ts(&attr.ia_xtime1) + ktime_get_real_ts64(&attr.ia_xtime1) ) @ depends on patch @ struct inode *node; struct iattr *attr; identifier fn; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; expression e; @@ ( - fn(node->i_xtime); + fn(timespec64_to_timespec(node->i_xtime)); | fn(..., - node->i_xtime); + timespec64_to_timespec(node->i_xtime)); | - e = fn(attr->ia_xtime); + e = fn(timespec64_to_timespec(attr->ia_xtime)); ) @ depends on patch forall @ struct inode *node; struct iattr *attr; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); fn (..., - &attr->ia_xtime, + &ts, ...); ) ...+> } @ depends on patch forall @ struct inode *node; struct iattr *attr; struct kstat *stat; identifier ia_xtime =~ "^ia_[acm]time$"; identifier i_xtime =~ "^i_[acm]time$"; identifier xtime =~ "^[acm]time$"; identifier fn, ret; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime); + &ts); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime); + &ts); | + ts = timespec64_to_timespec(stat->xtime); ret = fn (..., - &stat->xtime); + &ts); ) ...+> } @ depends on patch @ struct inode *node; struct inode *node2; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier i_xtime3 =~ "^i_[acm]time$"; struct iattr *attrp; struct iattr *attrp2; struct iattr attr ; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; struct kstat *stat; struct kstat stat1; struct timespec64 ts; identifier xtime =~ "^[acmb]time$"; expression e; @@ ( ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ; | node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | stat->xtime = node2->i_xtime1; | stat1.xtime = node2->i_xtime1; | ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ; | ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2; | - e = node->i_xtime1; + e = timespec64_to_timespec( node->i_xtime1 ); | - e = attrp->ia_xtime1; + e = timespec64_to_timespec( attrp->ia_xtime1 ); | node->i_xtime1 = current_time(...); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | - node->i_xtime1 = e; + node->i_xtime1 = timespec_to_timespec64(e); ) Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: <anton@tuxera.com> Cc: <balbi@kernel.org> Cc: <bfields@fieldses.org> Cc: <darrick.wong@oracle.com> Cc: <dhowells@redhat.com> Cc: <dsterba@suse.com> Cc: <dwmw2@infradead.org> Cc: <hch@lst.de> Cc: <hirofumi@mail.parknet.co.jp> Cc: <hubcap@omnibond.com> Cc: <jack@suse.com> Cc: <jaegeuk@kernel.org> Cc: <jaharkes@cs.cmu.edu> Cc: <jslaby@suse.com> Cc: <keescook@chromium.org> Cc: <mark@fasheh.com> Cc: <miklos@szeredi.hu> Cc: <nico@linaro.org> Cc: <reiserfs-devel@vger.kernel.org> Cc: <richard@nod.at> Cc: <sage@redhat.com> Cc: <sfrench@samba.org> Cc: <swhiteho@redhat.com> Cc: <tj@kernel.org> Cc: <trond.myklebust@primarydata.com> Cc: <tytso@mit.edu> Cc: <viro@zeniv.linux.org.uk>
2018-06-05btrfs: Check error of btrfs_iget in btrfs_search_path_in_tree_userMisono Tomohiro1-0/+4
The patch introducing the ioctl was not the latest version at the time of merging to the mainline and needs a fixup from this patch. Fixes: ba637a252d30 ("btrfs: Check error of btrfs_iget() in btrfs_search_path_in_tree_user") Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-04Merge tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds47-2746/+3248
Pull btrfs updates from David Sterba: "User visible features: - added support for the ioctl FS_IOC_FSGETXATTR, per-inode flags, successor of GET/SETFLAGS; now supports only existing flags: append, immutable, noatime, nodump, sync - 3 new unprivileged ioctls to allow users to enumerate subvolumes - dedupe syscall implementation does not restrict the range to 16MiB, though it still splits the whole range to 16MiB chunks - on user demand, rmdir() is able to delete an empty subvolume, export the capability in sysfs - fix inode number types in tracepoints, other cleanups - send: improved speed when dealing with a large removed directory, measurements show decrease from 2000 minutes to 2 minutes on a directory with 2 million entries - pre-commit check of superblock to detect a mysterious in-memory corruption - log message updates Other changes: - orphan inode cleanup improved, does no keep long-standing reservations that could lead up to early ENOSPC in some cases - slight improvement of handling snapshotted NOCOW files by avoiding some unnecessary tree searches - avoid OOM when dealing with many unmergeable small extents at flush time - speedup conversion of free space tree representations from/to bitmap/tree - code refactoring, deletion, cleanups: + delayed refs + delayed iput + redundant argument removals + memory barrier cleanups + remove a redundant mutex supposedly excluding several ioctls to run in parallel - new tracepoints for blockgroup manipulation - more sanity checks of compressed headers" * tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (183 commits) btrfs: Add unprivileged version of ino_lookup ioctl btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF btrfs: Add unprivileged ioctl which returns subvolume information Btrfs: clean up error handling in btrfs_truncate() btrfs: Factor out write portion of btrfs_get_blocks_direct btrfs: Factor out read portion of btrfs_get_blocks_direct btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist btrfs: raid56: Remove VLA usage btrfs: return error value if create_io_em failed in cow_file_range btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot btrfs: drop unused parameter qgroup_reserved btrfs: balance dirty metadata pages in btrfs_finish_ordered_io btrfs: lift some btrfs_cross_ref_exist checks in nocow path btrfs: Remove fs_info argument from btrfs_uuid_tree_rem btrfs: Remove fs_info argument from btrfs_uuid_tree_add Btrfs: remove unused check of skip_locking Btrfs: remove always true check in unlock_up Btrfs: grab write lock directly if write_lock_level is the max level Btrfs: move get root out of btrfs_search_slot to a helper Btrfs: use more straightforward extent_buffer_uptodate check ...
2018-06-04Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-blockLinus Torvalds1-14/+11
Pull block updates from Jens Axboe: - clean up how we pass around gfp_t and blk_mq_req_flags_t (Christoph) - prepare us to defer scheduler attach (Christoph) - clean up drivers handling of bounce buffers (Christoph) - fix timeout handling corner cases (Christoph/Bart/Keith) - bcache fixes (Coly) - prep work for bcachefs and some block layer optimizations (Kent). - convert users of bio_sets to using embedded structs (Kent). - fixes for the BFQ io scheduler (Paolo/Davide/Filippo) - lightnvm fixes and improvements (Matias, with contributions from Hans and Javier) - adding discard throttling to blk-wbt (me) - sbitmap blk-mq-tag handling (me/Omar/Ming). - remove the sparc jsflash block driver, acked by DaveM. - Kyber scheduler improvement from Jianchao, making it more friendly wrt merging. - conversion of symbolic proc permissions to octal, from Joe Perches. Previously the block parts were a mix of both. - nbd fixes (Josef and Kevin Vigor) - unify how we handle the various kinds of timestamps that the block core and utility code uses (Omar) - three NVMe pull requests from Keith and Christoph, bringing AEN to feature completeness, file backed namespaces, cq/sq lock split, and various fixes - various little fixes and improvements all over the map * tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits) blk-mq: update nr_requests when switching to 'none' scheduler block: don't use blocking queue entered for recursive bio submits dm-crypt: fix warning in shutdown path lightnvm: pblk: take bitmap alloc. out of critical section lightnvm: pblk: kick writer on new flush points lightnvm: pblk: only try to recover lines with written smeta lightnvm: pblk: remove unnecessary bio_get/put lightnvm: pblk: add possibility to set write buffer size manually lightnvm: fix partial read error path lightnvm: proper error handling for pblk_bio_add_pages lightnvm: pblk: fix smeta write error path lightnvm: pblk: garbage collect lines with failed writes lightnvm: pblk: rework write error recovery path lightnvm: pblk: remove dead function lightnvm: pass flag on graceful teardown to targets lightnvm: pblk: check for chunk size before allocating it lightnvm: pblk: remove unnecessary argument lightnvm: pblk: remove unnecessary indirection lightnvm: pblk: return NVM_ error on failed submission lightnvm: pblk: warn in case of corrupted write buffer ...
2018-05-31btrfs: Add unprivileged version of ino_lookup ioctlTomohiro Misono1-0/+204
Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER to allow normal users to call "btrfs subvolume list/show" etc. in combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF. This can be used like BTRFS_IOC_INO_LOOKUP but the argument is different. This is because it always searches the fs/file tree correspoinding to the fd with which this ioctl is called and also returns the name of bottom subvolume. The main differences from original ino_lookup ioctl are: 1. Read + Exec permission will be checked using inode_permission() during path construction. -EACCES will be returned in case of failure. 2. Path construction will be stopped at the inode number which corresponds to the fd with which this ioctl is called. If constructed path does not exist under fd's inode, -EACCES will be returned. 3. The name of bottom subvolume is also searched and filled. Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1) bytes than ino_lookup ioctl because of space of subvolume's name. Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REFTomohiro Misono1-0/+99
Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns ROOT_REF information of the subvolume containing this inode except the subvolume name (this is because to prevent potential name leak). The subvolume name will be gained by user version of ino_lookup ioctl (BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check. The min id of root ref's subvolume to be searched is specified by @min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search ends, @min_id is set to the last searched root ref's subvolid + 1. Also, if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM, -EOVERFLOW is returned. Therefore the caller can just call this ioctl again without changing the argument to continue search. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes and struct item renames ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31btrfs: Add unprivileged ioctl which returns subvolume informationTomohiro Misono1-0/+121
Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns the information of subvolume containing this inode. (i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.) Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ minor style fixes, update struct comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: convert to bioset_init()/mempool_init()Kent Overstreet1-14/+11
Convert btrfs to embedded bio sets. Acked-by: Chris Mason <clm@fb.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30Btrfs: clean up error handling in btrfs_truncate()Omar Sandoval1-19/+14
btrfs_truncate() uses two variables for error handling, ret and err (if this sounds familiar, it's because btrfs_truncate_inode_items() did something similar). This is error prone, as was made evident by "Btrfs: fix error handling in btrfs_truncate()". We only have err because we don't want to mask an error if we call btrfs_update_inode() and btrfs_end_transaction(), so let's make that its own scoped return variable and use ret everywhere else. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: Factor out write portion of btrfs_get_blocks_directNikolay Borisov1-99/+108
Now that the read side is extracted into its own function, do the same to the write side. This leaves btrfs_get_blocks_direct_write with the sole purpose of handling common locking required. Also flip the condition in btrfs_get_blocks_direct_write so that the write case comes first and we check for if (Create) rather than if (!create). This is purely subjective but I believe makes reading a bit more "linear". No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: Factor out read portion of btrfs_get_blocks_directNikolay Borisov1-13/+43
Currently this function handles both the READ and WRITE dio cases. This is facilitated by a bunch of 'if' statements, a goto short-circuit statement and a very perverse aliasing of "!created"(READ) case by setting lockstart = lockend and checking for lockstart < lockend for detecting the write. Let's simplify this mess by extracting the READ-only code into a separate __btrfs_get_block_direct_read function. This is only the first step, the next one will be to factor out the write side as well. The end goal will be to have the common locking/ unlocking code in btrfs_get_blocks_direct and then it will call either the read|write subvariants. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_existSu Yue1-1/+1
The error code does not match the reason of failure and may confuse the callers. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: raid56: Remove VLA usageKees Cook1-10/+28
In the quest to remove all stack VLA usage from the kernel[1], this allocates the working buffers during regular init, instead of using stack space. This refactors the allocation code a bit to make it easier to review. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: return error value if create_io_em failed in cow_file_rangeSu Yue1-1/+3
In cow_file_range(), create_io_em() may fail, but its return value is not recorded. Then return value may be 0 even it failed which is a wrong behavior. Let cow_file_range() return PTR_ERR(em) if create_io_em() failed. Fixes: 6f9994dbabe5 ("Btrfs: create a helper to create em for IO") CC: stable@vger.kernel.org # 4.11+ Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshotGu JinXiang1-1/+0
Since there is no more use of qgroup_reserved member in struct btrfs_pending_snapshot, remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: drop unused parameter qgroup_reservedGu JinXiang4-14/+5
Since commit 7775c8184ec0 ("btrfs: remove unused parameter from btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used by caller of function btrfs_subvolume_reserve_metadata. So remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: balance dirty metadata pages in btrfs_finish_ordered_ioEthan Lien1-0/+3
[Problem description and how we fix it] We should balance dirty metadata pages at the end of btrfs_finish_ordered_io, since a small, unmergeable random write can potentially produce dirty metadata which is multiple times larger than the data itself. For example, a small, unmergeable 4KiB write may produce: 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree Although we do call balance dirty pages in write side, but in the buffered write path, most metadata are dirtied only after we reach the dirty background limit (which by far only counts dirty data pages) and wakeup the flusher thread. If there are many small, unmergeable random writes spread in a large btree, we'll find a burst of dirty pages exceeds the dirty_bytes limit after we wakeup the flusher thread - which is not what we expect. In our machine, it caused out-of-memory problem since a page cannot be dropped if it is marked dirty. Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay, but since we do btrfs_finish_ordered_io in a separate worker, it will not stop the flusher consuming dirty pages. Also, we use different worker for metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle the size of dirty metadata pages. [Reproduce steps] To reproduce the problem, we need to do 4KiB write randomly spread in a large btree. In our 2GiB RAM machine: 1) Create 4 subvolumes. 2) Run fio on each subvolume: [global] direct=0 rw=randwrite ioengine=libaio bs=4k iodepth=16 numjobs=1 group_reporting size=128G runtime=1800 norandommap time_based randrepeat=0 3) Take snapshot on each subvolume and repeat fio on existing files. 4) Repeat step (3) until we get large btrees. In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of metadata in each subvolume tree and 12GiB of metadata in extent tree. 5) Stop all fio, take snapshot again, and wait until all delayed work is completed. 6) Start all fio. Few seconds later we hit OOM when the flusher starts to work. It can be reproduced even when using nocow write. Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: lift some btrfs_cross_ref_exist checks in nocow pathEthan Lien1-0/+15
In nocow path, we check if the extent is snapshotted in btrfs_cross_ref_exist(). We can do the similar check earlier and avoid unnecessary search into extent tree. A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows: [global] group_reporting time_based thread=1 ioengine=libaio bs=4k iodepth=32 size=64G runtime=180 numjobs=8 rw=randwrite [file1] filename=/mnt/nocow/testfile IOPS result: unpatched patched 1 fio round: 46670 46958 snapshot 2 fio round: 51826 54498 3 fio round: 59767 61289 After snapshot, the first fio get about 5% performance gain. As we continually write to the same file, all writes will resume to nocow mode and eventually we have no performance gain. Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: Remove fs_info argument from btrfs_uuid_tree_remLu Fengqi4-9/+7
This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ rename the function ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: Remove fs_info argument from btrfs_uuid_tree_addLu Fengqi5-13/+10
This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30Btrfs: remove unused check of skip_lockingLiu Bo1-2/+5
The check is superfluous since all callers who set search_for_commit also have skip_locking set. ASSERT() is put in place to ensure skip_locking is set by new callers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30Btrfs: remove always true check in unlock_upLiu Bo1-1/+1
As unlock_up() is written as for () { if (!path->locks[i]) break; ... if (... && path->locks[i]) { } } Apparently, @path->locks[i] is always true at this 'if'. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30Btrfs: grab write lock directly if write_lock_level is the max levelLiu Bo1-11/+16
Typically, when acquiring root node's lock, btrfs tries its best to get read lock and trade for write lock if @write_lock_level implies to do so. In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's write lock directly. In this particular case, the dance of acquiring read lock and then trading for write lock can be saved. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30Btrfs: move get root out of btrfs_search_slot to a helperLiu Bo1-45/+65
It's good to have a helper instead of having all get-root details open-coded. The new helper locks (if necessary) and sets root node of the path. Also invert the checks to make the code flow easier to read. There is no functional change in this commit. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30Btrfs: use more straightforward extent_buffer_uptodate checkLiu Bo1-1/+1
If parent_transid "0" is passed to btrfs_buffer_uptodate(), btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but extent_buffer_uptodate() is preferred since we don't have to look into verify_parent_transid(). Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30Btrfs: remove superfluous free_extent_buffer in read_block_for_searchLiu Bo1-1/+0
read_block_for_search() can be simplified as: tmp = find_extent_buffer(); if (tmp) return; ... free_extent_buffer(); read_tree_block(); Apparently, @tmp must be NULL at this point, free_extent_buffer() is not needed. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: drop unused space_info parameter from create_space_infoLu Fengqi1-8/+5
Since commit dc2d3005d27d ("btrfs: remove dead create_space_info calls"), there is only one caller btrfs_init_space_info. However, it doesn't need create_space_info to return space_info at all. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30Btrfs: add parent_transid parameter to veirfy_level_keyLiu Bo1-6/+7
As verify_level_key() is checked after verify_parent_transid(), i.e. if (verify_parent_transid()) ret = -EIO; else if (verify_level_key()) ret = -EUCLEAN; if parent_transid is 0, verify_parent_transid() skips verifying parent_transid and considers eb as valid, and if verify_level_key() reports something wrong, we're not going to know if it's caused by corrupted metadata or non-checkecd eb (e.g. stale eb). The stale eb can be from an outdated raid1 mirror after a degraded mount, see eg "btrfs: fix reading stale metadata blocks after degraded raid1 mounts" (02a3307aa9c20b4f66262) for more details. @parent_transid is able to tell whether the eb's generation has been verified by the caller. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: qgroup: show more meaningful qgroup_rescan_init error messageQu Wenruo1-15/+18
Error message from qgroup_rescan_init() mostly looks like: BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115 Which is far from meaningful, and sometimes confusing as for above -EINPROGRESS it's mostly (despite the init race) harmless, but sometimes it can also indicate problem if the return value is -EINVAL. Change it to some more meaningful messages like: BTRFS info (device nvme0n1p1): qgroup rescan is already in progress And BTRFS err(device nvme0n1p1): qgroup rescan init failed, qgroup is not enabled Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update the messages and level ] Signed-off-by: David Sterba <dsterba@suse.com>