aboutsummaryrefslogtreecommitdiffstats
path: root/fs/xfs/xfs_iomap.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-08-05Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds1-1/+29
Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-08-04Merge tag 'xfs-5.20-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-4/+4
Pull xfs updates from Darrick Wong: "The biggest changes for this release are the log scalability improvements, lockless lookups for the buffer cache, and making the attr fork a permanent part of the incore inode in preparation for directory parent pointers. There's also a bunch of bug fixes that have accumulated since -rc5. I might send you a second pull request with some more bug fixes that I'm still working on. Once the merge window ends, I will hand maintainership back to Dave Chinner until the 6.1-rc1 release so that I can conduct the design review for the online fsck feature, and try to get it merged. Summary: - Improve scalability of the XFS log by removing spinlocks and global synchronization points. - Add security labels to whiteout inodes to match the other filesystems. - Clean up per-ag pointer passing to simplify call sites. - Reduce verifier overhead by precalculating more AG geometry. - Implement fast-path lockless lookups in the buffer cache to reduce spinlock hammering. - Make attr forks a permanent part of the inode structure to fix a UAF bug and because most files these days tend to have security labels and soon will have parent pointers too. - Clean up XFS_IFORK_Q usage and give it a better name. - Fix more UAF bugs in the xattr code. - SOB my tags. - Fix some typos in the timestamp range documentation. - Fix a few more memory leaks. - Code cleanups and typo fixes. - Fix an unlocked inode fork pointer access in getbmap" * tag 'xfs-5.20-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (61 commits) xfs: delete extra space and tab in blank line xfs: fix NULL pointer dereference in xfs_getbmap() xfs: Fix typo 'the the' in comment xfs: Fix comment typo xfs: don't leak memory when attr fork loading fails xfs: fix for variable set but not used warning xfs: xfs_buf cache destroy isn't RCU safe xfs: delete unnecessary NULL checks xfs: fix comment for start time value of inode with bigtime enabled xfs: fix use-after-free in xattr node block inactivation xfs: lockless buffer lookup xfs: remove a superflous hash lookup when inserting new buffers xfs: reduce the number of atomic when locking a buffer after lookup xfs: merge xfs_buf_find() and xfs_buf_get_map() xfs: break up xfs_buf_find() into individual pieces xfs: add in-memory iunlink log item xfs: add log item precommit operation xfs: combine iunlink inode update functions xfs: clean up xfs_iunlink_update_inode() xfs: double link the unlinked inode list ...
2022-07-24xfs: Add async buffered write supportStefan Roesch1-1/+4
This adds the async buffered write support to XFS. For async buffered write requests, the request will return -EAGAIN if the ilock cannot be obtained immediately. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-15-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24xfs: Specify lockmode when calling xfs_ilock_for_iomap()Stefan Roesch1-3/+3
This patch changes the helper function xfs_ilock_for_iomap such that the lock mode must be passed in. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-14-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-17xfs: support CoW in fsdax modeShiyang Ruan1-1/+29
In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After that, new allocated extents needs to be remapped to the file. So, add a CoW identification in ->iomap_begin(), and implement ->iomap_end() to do the remapping work. [akpm@linux-foundation.org: make xfs_dax_fault() static] Link: https://lkml.kernel.org/r/20220603053738.1218681-14-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-12xfs: replace XFS_IFORK_Q with a proper predicate functionDarrick J. Wong1-1/+1
Replace this shouty macro with a real C function that has a more descriptive name. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-09xfs: make inode attribute forks a permanent part of struct xfs_inodeDarrick J. Wong1-2/+2
Syzkaller reported a UAF bug a while back: ================================================================== BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958 CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted 5.15.0-0.30.3-20220406_1406 #3 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106 print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459 xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159 xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36 __vfs_getxattr+0xdf/0x13d fs/xattr.c:399 cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300 security_inode_need_killpriv+0x4c/0x97 security/security.c:1408 dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912 dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908 do_truncate+0xc3/0x1e0 fs/open.c:56 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 RIP: 0033:0x7f7ef4bb753d Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0 RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0 </TASK> Allocated by task 2953: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:434 [inline] __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467 kasan_slab_alloc include/linux/kasan.h:254 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slub.c:3213 [inline] slab_alloc mm/slub.c:3221 [inline] kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226 kmem_cache_zalloc include/linux/slab.h:711 [inline] xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287 xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098 xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_setxattr+0x11b/0x177 fs/xattr.c:180 __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214 __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275 vfs_setxattr+0x154/0x33d fs/xattr.c:301 setxattr+0x216/0x29f fs/xattr.c:575 __do_sys_fsetxattr fs/xattr.c:632 [inline] __se_sys_fsetxattr fs/xattr.c:621 [inline] __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 Freed by task 2949: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track+0x1c/0x21 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free mm/kasan/common.c:328 [inline] __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374 kasan_slab_free include/linux/kasan.h:230 [inline] slab_free_hook mm/slub.c:1700 [inline] slab_free_freelist_hook mm/slub.c:1726 [inline] slab_free mm/slub.c:3492 [inline] kmem_cache_free+0xdc/0x3ce mm/slub.c:3508 xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773 xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822 xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413 xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684 xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_removexattr+0x106/0x16a fs/xattr.c:468 cap_inode_killpriv+0x24/0x47 security/commoncap.c:324 security_inode_killpriv+0x54/0xa1 security/security.c:1414 setattr_prepare+0x1a6/0x897 fs/attr.c:146 xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682 xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065 xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093 notify_change+0xae5/0x10a1 fs/attr.c:410 do_truncate+0x134/0x1e0 fs/open.c:64 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 The buggy address belongs to the object at ffff88802cec9188 which belongs to the cache xfs_ifork of size 40 The buggy address is located 20 bytes inside of 40-byte region [ffff88802cec9188, ffff88802cec91b0) The buggy address belongs to the page: page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2cec9 flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80 raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb ^ ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb ================================================================== The root cause of this bug is the unlocked access to xfs_inode.i_afp from the getxattr code paths while trying to determine which ILOCK mode to use to stabilize the xattr data. Unfortunately, the VFS does not acquire i_rwsem when vfs_getxattr (or listxattr) call into the filesystem, which means that getxattr can race with a removexattr that's tearing down the attr fork and crash: xfs_attr_set: xfs_attr_get: xfs_attr_fork_remove: xfs_ilock_attr_map_shared: xfs_idestroy_fork(ip->i_afp); kmem_cache_free(xfs_ifork_cache, ip->i_afp); if (ip->i_afp && ip->i_afp = NULL; xfs_need_iread_extents(ip->i_afp)) <KABOOM> ip->i_forkoff = 0; Regrettably, the VFS is much more lax about i_rwsem and getxattr than is immediately obvious -- not only does it not guarantee that we hold i_rwsem, it actually doesn't guarantee that we *don't* hold it either. The getxattr system call won't acquire the lock before calling XFS, but the file capabilities code calls getxattr with and without i_rwsem held to determine if the "security.capabilities" xattr is set on the file. Fixing the VFS locking requires a treewide investigation into every code path that could touch an xattr and what i_rwsem state it expects or sets up. That could take years or even prove impossible; fortunately, we can fix this UAF problem inside XFS. An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to ensure that i_forkoff is always zeroed before i_afp is set to null and changed the read paths to use smp_rmb before accessing i_forkoff and i_afp, which avoided these UAF problems. However, the patch author was too busy dealing with other problems in the meantime, and by the time he came back to this issue, the situation had changed a bit. On a modern system with selinux, each inode will always have at least one xattr for the selinux label, so it doesn't make much sense to keep incurring the extra pointer dereference. Furthermore, Allison's upcoming parent pointer patchset will also cause nearly every inode in the filesystem to have extended attributes. Therefore, make the inode attribute fork structure part of struct xfs_inode, at a cost of 40 more bytes. This patch adds a clunky if_present field where necessary to maintain the existing logic of xattr fork null pointer testing in the existing codebase. The next patch switches the logic over to XFS_IFORK_Q and it all goes away. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-09xfs: convert XFS_IFORK_PTR to a static inline helperDarrick J. Wong1-2/+2
We're about to make this logic do a bit more, so convert the macro to a static inline function for better typechecking and fewer shouty macros. No functional changes here. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-04-13xfs: Conditionally upgrade existing inodes to use large extent countersChandan Babu R1-0/+5
This commit enables upgrading existing inodes to use large extent counters provided that underlying filesystem's superblock has large extent counter feature enabled. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2022-04-11xfs: Define max extent length based on on-disk format definitionChandan Babu R1-14/+14
The maximum extent length depends on maximum block count that can be stored in a BMBT record. Hence this commit defines MAXEXTLEN based on BMBT_BLOCKCOUNT_BITLEN. While at it, the commit also renames MAXEXTLEN to XFS_MAX_BMBT_EXTLEN. Suggested-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2021-12-04fsdax: shift partition offset handling into the file systemsChristoph Hellwig1-2/+8
Remove the last user of ->bdev in dax.c by requiring the file system to pass in an address that already includes the DAX offset. As part of the only set ->bdev or ->daxdev when actually required in the ->iomap_begin methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-27-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04iomap: add a IOMAP_DAX flagChristoph Hellwig1-3/+4
Add a flag so that the file system can easily detect DAX operations based just on the iomap operation requested instead of looking at inode state using IS_DAX. This will be needed to apply the to be added partition offset only for operations that actually use DAX, but not things like fiemap that are based on the block device. In the long run it should also allow turning the bdev, dax_dev and inline_data into a union. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-25-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04xfs: pass the mapping flags to xfs_bmbt_to_iomapChristoph Hellwig1-15/+20
To prepare for looking at the IOMAP_DAX flag in xfs_bmbt_to_iomap pass in the input mapping flags to xfs_bmbt_to_iomap. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-24-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04xfs: use xfs_direct_write_iomap_ops for DAX zeroingChristoph Hellwig1-2/+2
While the buffered write iomap ops do work due to the fact that zeroing never allocates blocks, the DAX zeroing should use the direct ops just like actual DAX I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-23-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04fsdax: decouple zeroing from the iomap buffered I/O codeChristoph Hellwig1-1/+6
Unshare the DAX and iomap buffered I/O page zeroing code. This code previously did a IS_DAX check deep inside the iomap code, which in fact was the only DAX check in the code. Instead move these checks into the callers. Most callers already have DAX special casing anyway and XFS will need it for reflink support as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-19-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04xfs: add xfs_zero_range and xfs_truncate_page helpersShiyang Ruan1-0/+25
Add helpers to prepare for using different DAX operations. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> [hch: split from a larger patch + slight cleanups] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-16-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-08-23xfs: only set IOMAP_F_SHARED when providing a srcmap to a writeDarrick J. Wong1-4/+4
While prototyping a free space defragmentation tool, I observed an unexpected IO error while running a sequence of commands that can be recreated by the following sequence of commands: # xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1 # cp --reflink=always file1 file2 # punch-alternating -o 1 file2 # xfs_io -c "funshare 0 10m" file2 fallocate: Input/output error I then scraped this (abbreviated) stack trace from dmesg: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:iomap_write_begin+0x376/0x450 RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297 RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000 RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50 R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40 R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000 FS: 00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0 Call Trace: iomap_unshare_actor+0x95/0x140 iomap_apply+0xfa/0x300 iomap_file_unshare+0x44/0x60 xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] vfs_fallocate+0x133/0x330 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4b8f79140a Looking at the iomap tracepoints, I saw this: iomap_iter: dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED iomap_iter_srcmap: dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags iomap_iter: dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED console: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 The first time funshare calls ->iomap_begin, xfs sees that the first block is shared and creates a 128k delalloc reservation in the COW fork. The delalloc reservation is returned as dstmap, and the shared block is returned as srcmap. So far so good. funshare calls ->iomap_begin to try the second block. This time there's no srcmap (punch-alternating punched it out!) but we still have the delalloc reservation in the COW fork. Therefore, we again return the reservation as dstmap and the hole as srcmap. iomap_unshare_iter incorrectly tries to unshare the hole, which __iomap_write_begin rejects because shared regions must be fully written and therefore cannot require zeroing. Therefore, change the buffered write iomap_begin function not to set IOMAP_F_SHARED when there isn't a source mapping to read from for the unsharing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-08-19xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdownDave Chinner1-6/+6
Remove the shouty macro and instead use the inline function that matches other state/feature check wrapper naming. This conversion was done with sed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: convert mount flags to featuresDave Chinner1-2/+2
Replace m_flags feature checks with xfs_has_<feature>() calls and rework the setup code to set flags in m_features. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-26xfs: Fix fall-through warnings for ClangGustavo A. R. Silva1-1/+1
In preparation to enable -Wimplicit-fallthrough for Clang, fix the following warnings by replacing /* fall through */ comments, and its variants, with the new pseudo-keyword macro fallthrough: fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] Notice that Clang doesn't recognize /* fall through */ comments as implicit fall-through markings, so in order to globally enable -Wimplicit-fallthrough for Clang, these comments need to be replaced with fallthrough; in the whole codebase. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-04-15xfs: remove XFS_IFEXTENTSChristoph Hellwig1-2/+2
The in-memory XFS_IFEXTENTS is now only used to check if an inode with extents still needs the extents to be read into memory before doing operations that need the extent map. Add a new xfs_need_iread_extents helper that returns true for btree format forks that do not have any entries in the in-memory extent btree, and use that instead of checking the XFS_IFEXTENTS flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-15xfs: move the XFS_IFEXTENTS check into xfs_iread_extentsChristoph Hellwig1-10/+6
Move the XFS_IFEXTENTS check from the callers into xfs_iread_extents to simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_size field to struct xfs_inodeChristoph Hellwig1-1/+1
In preparation of removing the historic icinode struct, move the on-disk size field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: Fix dax inode extent calculation when direct write is performed on an unwritten extentChandan Babu R1-2/+3
With dax enabled filesystems, a direct write operation into an existing unwritten extent results in xfs_iomap_write_direct() zero-ing and converting the extent into a normal extent before the actual data is copied from the userspace buffer. The inode extent count can increase by 2 if the extent range being written to maps to the middle of the existing unwritten extent range. Hence this commit uses XFS_IEXT_WRITE_UNWRITTEN_CNT as the extent count delta when such a write operation is being performed. Fixes: 727e1acd297c ("xfs: Check for extent overflow when trivally adding a new extent") Reported-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-21Merge tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-30/+23
Pull xfs updates from Darrick Wong: "There's a lot going on this time, which seems about right for this drama-filled year. Community developers added some code to speed up freezing when read-only workloads are still running, refactored the logging code, added checks to prevent file extent counter overflow, reduced iolock cycling to speed up fsync and gc scans, and started the slow march towards supporting filesystem shrinking. There's a huge refactoring of the internal speculative preallocation garbage collection code which fixes a bunch of bugs, makes the gc scheduling per-AG and hence multithreaded, and standardizes the retry logic when we try to reserve space or quota, can't, and want to trigger a gc scan. We also enable multithreaded quotacheck to reduce mount times further. This is also preparation for background file gc, which may or may not land for 5.13. We also fixed some deadlocks in the rename code, fixed a quota accounting leak when FSSETXATTR fails, restored the behavior that write faults to an mmap'd region actually cause a SIGBUS, fixed a bug where sgid directory inheritance wasn't quite working properly, and fixed a bug where symlinks weren't working properly in ecryptfs. We also now advertise the inode btree counters feature that was introduced two cycles ago. Summary: - Fix an ABBA deadlock when renaming files on overlayfs. - Make sure that we can't overflow the inode extent counters when adding to or removing extents from a file. - Make directory sgid inheritance work the same way as all the other filesystems. - Don't drain the buffer cache on freeze and ro remount, which should reduce the amount of time if read-only workloads are continuing during the freeze. - Fix a bug where symlink size isn't reported to the vfs in ecryptfs. - Disentangle log cleaning from log covering. This refactoring sets us up for future changes to the log, though for now it simply means that we can use covering for freezes, and cleaning becomes something we only do at unmount. - Speed up file fsyncs by reducing iolock cycling. - Fix delalloc blocks leaking when changing the project id fails because of input validation errors in FSSETXATTR. - Fix oversized quota reservation when converting unwritten extents during a DAX write. - Create a transaction allocation helper function to standardize the idiom of allocating a transaction, reserving blocks, locking inodes, and reserving quota. Replace all the open-coded logic for file creation, file ownership changes, and file modifications to use them. - Actually shut down the fs if the incore quota reservations get corrupted. - Fix background block garbage collection scans to not block and to actually clean out CoW staging extents properly. - Run block gc scans when we run low on project quota. - Use the standardized transaction allocation helpers to make it so that ENOSPC and EDQUOT errors during reservation will back out, invoke the block gc scanner, and try again. This is preparation for introducing background inode garbage collection in the next cycle. - Combine speculative post-EOF block garbage collection with speculative copy on write block garbage collection. - Enable multithreaded quotacheck. - Allow sysadmins to tweak the CPU affinities and maximum concurrency levels of quotacheck and background blockgc worker pools. - Expose the inode btree counter feature in the fs geometry ioctl. - Cleanups of the growfs code in preparation for starting work on filesystem shrinking. - Fix all the bloody gcc warnings that the maintainer knows about. :P - Fix a RST syntax error. - Don't trigger bmbt corruption assertions after the fs shuts down. - Restore behavior of forcing SIGBUS on a shut down filesystem when someone triggers a mmap write fault (or really, any buffered write)" * tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (85 commits) xfs: consider shutdown in bmapbt cursor delete assert xfs: fix boolreturn.cocci warnings xfs: restore shutdown check in mapped write fault path xfs: fix rst syntax error in admin guide xfs: fix incorrect root dquot corruption error when switching group/project quota types xfs: get rid of xfs_growfs_{data,log}_t xfs: rename `new' to `delta' in xfs_growfs_data_private() libxfs: expose inobtcount in xfs geometry xfs: don't bounce the iolock between free_{eof,cow}blocks xfs: expose the blockgc workqueue knobs publicly xfs: parallelize block preallocation garbage collection xfs: rename block gc start and stop functions xfs: only walk the incore inode tree once per blockgc scan xfs: consolidate the eofblocks and cowblocks workers xfs: consolidate incore inode radix tree posteof/cowblocks tags xfs: remove trivial eof/cowblocks functions xfs: hide xfs_icache_free_cowblocks xfs: hide xfs_icache_free_eofblocks xfs: relocate the eofb/cowb workqueue functions xfs: set WQ_SYSFS on all workqueues in debug mode ...
2021-02-10xfs: restore shutdown check in mapped write fault pathBrian Foster1-0/+3
XFS triggers an iomap warning in the write fault path due to a !PageUptodate() page if a write fault happens to occur on a page that recently failed writeback. The iomap writeback error handling code can clear the Uptodate flag if no portion of the page is submitted for I/O. This is reproduced by fstest generic/019, which combines various forms of I/O with simulated disk failures that inevitably lead to filesystem shutdown (which then unconditionally fails page writeback). This is a regression introduced by commit f150b4234397 ("xfs: split the iomap ops for buffered vs direct writes") due to the removal of a shutdown check and explicit error return in the ->iomap_begin() path used by the write fault path. The explicit error return historically translated to a SIGBUS, but now carries on with iomap processing where it complains about the unexpected state. Restore the shutdown check to xfs_buffered_write_iomap_begin() to restore historical behavior. Fixes: f150b4234397 ("xfs: split the iomap ops for buffered vs direct writes") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-03xfs: refactor reflink functions to use xfs_trans_alloc_inodeDarrick J. Wong1-1/+2
The two remaining callers of xfs_trans_reserve_quota_nblks are in the reflink code. These conversions aren't as uniform as the previous conversions, so call that out in a separate patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03xfs: allow reservation of rtblocks with xfs_trans_alloc_inodeDarrick J. Wong1-17/+5
Make it so that we can reserve rt blocks with the xfs_trans_alloc_inode wrapper function, then convert a few more callsites. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: refactor common transaction/inode/quota allocation idiomDarrick J. Wong1-9/+2
Create a new helper xfs_trans_alloc_inode that allocates a transaction, locks and joins an inode to it, and then reserves the appropriate amount of quota against that transction. Then replace all the open-coded idioms with a single call to this helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: reserve data and rt quota at the same timeDarrick J. Wong1-13/+13
Modify xfs_trans_reserve_quota_nblks so that we can reserve data and realtime blocks from the dquot at the same time. This change has the theoretical side effect that for allocations to realtime files we will reserve from the dquot both the number of rtblocks being allocated and the number of bmbt blocks that might be needed to add the mapping. However, since the mount code disables quota if it finds a realtime device, this should not result in any behavior changes. Now that we've moved the inode creation callers away from using the _nblks function, we can repurpose the (now unused) ninos argument for realtime blocks, so make that change. This also replaces the flags argument with a boolean parameter to force the reservation since we don't need to distinguish between data and rt quota reservations any more, and the only flag being passed in was FORCE_RES. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: remove xfs_trans_unreserve_quota_nblks completelyDarrick J. Wong1-4/+2
xfs_trans_cancel will release all the quota resources that were reserved on behalf of the transaction, so get rid of the explicit unreserve step. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: reduce quota reservation when doing a dax unwritten extent conversionDarrick J. Wong1-1/+1
In commit 3b0fe47805802, we reduced the free space requirement to perform a pre-write unwritten extent conversion on an S_DAX file. Since we're not actually allocating any space, the logic goes, we only need enough reservation to handle shape changes in the bmbt. The same logic should have been applied to quota -- we're not allocating any space, so we only need to reserve enough quota to handle the bmbt shape changes. Fixes: 3b0fe4780580 ("xfs: Don't use reserved blocks for data blocks with DAX") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-01xfs: reduce exclusive locking on unaligned dioDave Chinner1-8/+21
Attempt shared locking for unaligned DIO, but only if the the underlying extent is already allocated and in written state. On failure, retry with the existing exclusive locking. Test case is fio randrw of 512 byte IOs using AIO and an iodepth of 32 IOs. Vanilla: READ: bw=4560KiB/s (4670kB/s), 4560KiB/s-4560KiB/s (4670kB/s-4670kB/s), io=134MiB (140MB), run=30001-30001msec WRITE: bw=4567KiB/s (4676kB/s), 4567KiB/s-4567KiB/s (4676kB/s-4676kB/s), io=134MiB (140MB), run=30001-30001msec Patched: READ: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1127MiB (1182MB), run=30002-30002msec WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1128MiB (1183MB), run=30002-30002msec That's an improvement from ~18k IOPS to a ~150k IOPS, which is about the IOPS limit of the VM block device setup I'm testing on. 4kB block IO comparison: READ: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8868MiB (9299MB), run=30002-30002msec WRITE: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8878MiB (9309MB), run=30002-30002msec Which is ~150k IOPS, same as what the test gets for sub-block AIO+DIO writes with this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: rebased, split unaligned from nowait] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-22xfs: Check for extent overflow when writing to unwritten extentChandan Babu R1-0/+5
A write to a sub-interval of an existing unwritten extent causes the original extent to be split into 3 extents i.e. | Unwritten | Real | Unwritten | Hence extent count can increase by 2. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2021-01-22xfs: Check for extent overflow when trivally adding a new extentChandan Babu R1-0/+5
When adding a new data extent (without modifying an inode's existing extents) the extent count increases only by 1. This commit checks for extent count overflow in such cases. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-19xfs: don't allow NOWAIT DIO across extent boundariesDave Chinner1-0/+29
Jens has reported a situation where partial direct IOs can be issued and completed yet still return -EAGAIN. We don't want this to report a short IO as we want XFS to complete user DIO entirely or not at all. This partial IO situation can occur on a write IO that is split across an allocated extent and a hole, and the second mapping is returning EAGAIN because allocation would be required. The trivial reproducer: $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo wrote 4096/4096 bytes at offset 0 4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec) pwrite: Resource temporarily unavailable $ The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done the first 4kB write: xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000 iomap_apply: dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin xfs_iomap_found: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1 iomap_apply_dstmap: dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY Here the first iomap loop has mapped the first 4kB of the file and issued the IO, and we enter the second iomap_apply loop: iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin And we exit with -EAGAIN out because we hit the allocate case trying to make the second 4kB block. Then IO completes on the first 4kB and the original IO context completes and unlocks the inode, returning -EAGAIN to userspace: xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096 xfs_iunlock: dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write There are other vectors to the same problem when we re-enter the mapping code if we have to make multiple mappinfs under NOWAIT conditions. e.g. failing trylocks, COW extents being found, allocation being required, and so on. Avoid all these potential problems by only allowing IOMAP_NOWAIT IO to go ahead if the mapping we retrieve for the IO spans an entire allocated extent. This avoids the possibility of subsequent mappings to complete the IO from triggering NOWAIT semantics by any means as NOWAIT IO will now only enter the mapping code once per NOWAIT IO. Reported-and-tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-08-05xfs: delete duplicated words + other fixesRandy Dunlap1-1/+1
Delete repeated words in fs/xfs/. {we, that, the, a, to, fork} Change "it it" to "it is" in one location. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> To: linux-fsdevel@vger.kernel.org Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-28xfs: create xfs_dqtype_t to represent quota typesDarrick J. Wong1-12/+12
Create a new type (xfs_dqtype_t) to represent the type of an incore dquot (user, group, project, or none). Rename the incore dquot's dq_flags field to q_type. This allows us to replace all the "uint type" arguments to the quota functions with "xfs_dqtype_t type", to make it obvious when we're passing a quota type argument into a function. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-07-28xfs: rename XFS_DQ_{USER,GROUP,PROJ} to XFS_DQTYPE_*Darrick J. Wong1-6/+6
We're going to split up the incore dquot state flags from the ondisk dquot flags (eventually renaming this "type") so start by renaming the three flags and the bitmask that are going to participate in this. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-07-28xfs: use a per-resource struct for incore dquot dataDarrick J. Wong1-3/+3
Introduce a new struct xfs_dquot_res that we'll use to track all the incore data for a particular resource type (block, inode, rt block). This will help us (once we've eliminated q_core) to declutter quota functions that currently open-code field access or pass around fields around explicitly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com>
2020-05-27xfs: refactor xfs_iomap_prealloc_sizeDarrick J. Wong1-48/+35
Refactor xfs_iomap_prealloc_size to be the function that dynamically computes the per-file preallocation size by moving the allocsize= case to the caller. Break up the huge comment preceding the function to annotate the relevant parts of the code, and remove the impossible check_writeio case. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-05-27xfs: measure all contiguous previous extents for prealloc sizeDarrick J. Wong1-13/+27
When we're estimating a new speculative preallocation length for an extending write, we should walk backwards through the extent list to determine the number of number of blocks that are physically and logically contiguous with the write offset, and use that as an input to the preallocation size computation. This way, preallocation length is truly measured by the effectiveness of the allocator in giving us contiguous allocations without being influenced by the state of a given extent. This fixes both the problem where ZERO_RANGE within an EOF can reduce preallocation, and prevents the unnecessary shrinkage of preallocation when delalloc extents are turned into unwritten extents. This was found as a regression in xfs/014 after changing delalloc writes to create unwritten extents during writeback. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-05-27xfs: don't fail unwritten extent conversion on writeback due to edquotDarrick J. Wong1-1/+1
During writeback, it's possible for the quota block reservation in xfs_iomap_write_unwritten to fail with EDQUOT because we hit the quota limit. This causes writeback errors for data that was already written to disk, when it's not even guaranteed that the bmbt will expand to exceed the quota limit. Irritatingly, this condition is reported to userspace as EIO by fsync, which is confusing. We wrote the data, so allow the reservation. That might put us slightly above the hard limit, but it's better than losing data after a write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-19xfs: move the fork format fields into struct xfs_iforkChristoph Hellwig1-2/+2
Both the data and attr fork have a format that is stored in the legacy idinode. Move it into the xfs_ifork structure instead, where it uses up padding. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19xfs: move the per-fork nextents fields into struct xfs_iforkChristoph Hellwig1-1/+1
There are there are three extents counters per inode, one for each of the forks. Two are in the legacy icdinode and one is directly in struct xfs_inode. Switch to a single counter in the xfs_ifork structure where it uses up padding at the end of the structure. This simplifies various bits of code that just wants the number of extents counter and can now directly dereference it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-01-20xfs: change return value of xfs_inode_need_cow to intzhengbin1-1/+1
Fixes coccicheck warning: fs/xfs/xfs_reflink.c:236:9-10: WARNING: return of 0/1 in function 'xfs_inode_need_cow' with return type bool Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> [darrick: rename the function so it doesn't sound like a predicate] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13xfs: convert open coded corruption check to use XFS_IS_CORRUPTDarrick J. Wong1-4/+2
Convert the last of the open coded corruption check and report idioms to use the XFS_IS_CORRUPT macro. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-11xfs: attach dquots and reserve quota blocks during unwritten conversionDarrick J. Wong1-0/+10
In xfs_iomap_write_unwritten, we need to ensure that dquots are attached to the inode and quota blocks reserved so that we capture in the quota counters any blocks allocated to handle a bmbt split. This can happen on the first unwritten extent conversion to a preallocated sparse file on a fresh mount. This was found by running generic/311 with quotas enabled. The bug seems to have been introduced in "[XFS] rework iocore infrastructure, remove some code and make it more" from ~2002? Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10xfs: refactor "does this fork map blocks" predicateDarrick J. Wong1-2/+1
Replace the open-coded checks for whether or not an inode fork maps blocks with a macro that will implant the code for us. This helps us declutter the bmap code a bit. Note that I had to use a macro instead of a static inline function because of C header dependency problems between xfs_inode.h and xfs_inode_fork.h. Conversion was performed with the following Coccinelle script: @@ expression ip, w; @@ - XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_EXTENTS || XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_BTREE + xfs_ifork_has_extents(ip, w) @@ expression ip, w; @@ - XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_EXTENTS && XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_BTREE + !xfs_ifork_has_extents(ip, w) @@ expression ip, w; @@ - XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_BTREE || XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_EXTENTS + xfs_ifork_has_extents(ip, w) @@ expression ip, w; @@ - XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_BTREE && XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_EXTENTS + !xfs_ifork_has_extents(ip, w) @@ expression ip, w; @@ - (xfs_ifork_has_extents(ip, w)) + xfs_ifork_has_extents(ip, w) @@ expression ip, w; @@ - (!xfs_ifork_has_extents(ip, w)) + !xfs_ifork_has_extents(ip, w) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-03xfs: simplify the xfs_iomap_write_direct callingChristoph Hellwig1-53/+29
Move the EOF alignment and checking for the next allocated extent into the callers to avoid the need to pass the byte based offset and count as well as looking at the incoming imap. The added benefit is that the caller can unlock the incoming ilock and the function doesn't have funny unbalanced locking contexts. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>