aboutsummaryrefslogtreecommitdiffstats
path: root/fs/xfs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2015-05-29xfs: fix broken i_nlink accounting for whiteout tmpfile inodeBrian Foster1-2/+8
XFS uses the internal tmpfile() infrastructure for the whiteout inode used for RENAME_WHITEOUT operations. For tmpfile inodes, XFS allocates the inode, drops di_nlink, adds the inode to the agi unlinked list, calls d_tmpfile() which correspondingly drops i_nlink of the vfs inode, and then finishes the common inode setup (e.g., clear I_NEW and unlock). The d_tmpfile() call was originally made inxfs_create_tmpfile(), but was pulled up out of that function as part of the following commit to resolve a deadlock issue: 330033d6 xfs: fix tmpfile/selinux deadlock and initialize security As a result, callers of xfs_create_tmpfile() are responsible for either calling d_tmpfile() or fixing up i_nlink appropriately. The whiteout tmpfile allocation helper does neither. As a result, the vfs ->i_nlink becomes inconsistent with the on-disk ->di_nlink once xfs_rename() links it back into the source dentry and calls xfs_bumplink(). Update the assert in xfs_rename() to help detect this problem in the future and update xfs_rename_alloc_whiteout() to decrement the link count as part of the manual tmpfile inode setup. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: xfs_iozero can return positive errnoDave Chinner1-1/+1
It was missed when we converted everything in XFs to use negative error numbers, so fix it now. Bug introduced in 3.17 by commit 2451337 ("xfs: global error sign conversion"), and should go back to stable kernels. Thanks to Brian Foster for noticing it. cc: <stable@vger.kernel.org> # 3.17, 3.18, 3.19, 4.0 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: xfs_attr_inactive leaves inconsistent attr fork state behindDave Chinner4-47/+58
xfs_attr_inactive() is supposed to clean up the attribute fork when the inode is being freed. While it removes attribute fork extents, it completely ignores attributes in local format, which means that there can still be active attributes on the inode after xfs_attr_inactive() has run. This leads to problems with concurrent inode writeback - the in-core inode attribute fork is removed without locking on the assumption that nothing will be attempting to access the attribute fork after a call to xfs_attr_inactive() because it isn't supposed to exist on disk any more. To fix this, make xfs_attr_inactive() completely remove all traces of the attribute fork from the inode, regardless of it's state. Further, also remove the in-core attribute fork structure safely so that there is nothing further that needs to be done by callers to clean up the attribute fork. This means we can remove the in-core and on-disk attribute forks atomically. Also, on error simply remove the in-memory attribute fork. There's nothing that can be done with it once we have failed to remove the on-disk attribute fork, so we may as well just blow it away here anyway. cc: <stable@vger.kernel.org> # 3.12 to 4.0 Reported-by: Waiman Long <waiman.long@hp.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: extent size hints can round up extents past MAXEXTLENDave Chinner1-12/+19
This results in BMBT corruption, as seen by this test: # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc .... # mount /dev/vdc /mnt/scratch # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo which results in this failure on a debug kernel: XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211 .... Call Trace: [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100 [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20 [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120 [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0 [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170 [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400 [<ffffffff81ddcc29>] ? down_write+0x29/0x60 [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310 [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120 [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570 [<ffffffff811cd680>] vfs_fallocate+0x140/0x260 [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80 [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17 The tracepoint that indicates the extent that triggered the assert failure is: xfs_iext_insert: idx 0 offset 0 block 16777224 count 2097152 flag 1 Clearly indicating that the extent length is greater than MAXEXTLEN, which is 2097151. A prior trace point shows the allocation was an exact size match and that a length greater than MAXEXTLEN was asked for: xfs_alloc_size_done: agno 1 agbno 8 minlen 2097152 maxlen 2097152 ^^^^^^^ ^^^^^^^ We don't see this problem with extent size hints through the IO path because we can't do single IOs large enough to trigger MAXEXTLEN allocation. fallocate(), OTOH, is not limited in it's allocation sizes and so needs help here. The issue is that the extent size hint alignment is rounding up the extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking into account extent size hints when calculating the maximum extent length to allocate. xfs_bmapi_reserve_delalloc() is already doing this, but direct extent allocation is not. Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is wrong, and it works only because delayed allocation extents are not limited in size to MAXEXTLEN in the in-core extent tree. hence this calculation does not work for direct allocation, and the delalloc code needs fixing. This may, in fact be the underlying bug that occassionally causes transaction overruns in delayed allocation extent conversion, so now we know it's wrong we should fix it, too. Many thanks to Brian Foster for finding this problem during review of this patch. Hence the fix, after much code reading, is to allow xfs_bmap_extsize_align() to align partial extents when full alignment would extend the alignment past MAXEXTLEN. We can safely do this because all callers have higher layer allocation loops that already handle short allocations, and so will simply run another allocation to cover the remainder of the requested allocation range that we ignored during alignment. The advantage of this approach is that it also removes the need for callers to do anything other than limit their requests to MAXEXTLEN - they don't really need to be aware of extent size hints at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: inode and free block counters need to use __percpu_counter_compareDave Chinner1-14/+20
Because the counters use a custom batch size, the comparison functions need to be aware of that batch size otherwise the comparison does not work correctly. This leads to ASSERT failures on generic/027 like this: XFS: Assertion failed: 0, file: fs/xfs/xfs_mount.c, line: 1099 ------------[ cut here ]------------ .... Call Trace: [<ffffffff81522a39>] xfs_mod_icount+0x99/0xc0 [<ffffffff815285cb>] xfs_trans_unreserve_and_mod_sb+0x28b/0x5b0 [<ffffffff8152f941>] xfs_log_commit_cil+0x321/0x580 [<ffffffff81528e17>] xfs_trans_commit+0xb7/0x260 [<ffffffff81503d4d>] xfs_bmap_finish+0xcd/0x1b0 [<ffffffff8151da41>] xfs_inactive_ifree+0x1e1/0x250 [<ffffffff8151dbe0>] xfs_inactive+0x130/0x200 [<ffffffff81523a21>] xfs_fs_evict_inode+0x91/0xf0 [<ffffffff811f3958>] evict+0xb8/0x190 [<ffffffff811f433b>] iput+0x18b/0x1f0 [<ffffffff811e8853>] do_unlinkat+0x1f3/0x320 [<ffffffff811d548a>] ? filp_close+0x5a/0x80 [<ffffffff811e999b>] SyS_unlinkat+0x1b/0x40 [<ffffffff81e0892e>] system_call_fastpath+0x12/0x71 This is a regression introduced by commit 501ab32 ("xfs: use generic percpu counters for inode counter"). This patch fixes the same problem for both the inode counter and the free block counter in the superblocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: use percpu_counter_read_positive for mp->m_icountGeorge Wang1-3/+6
Function percpu_counter_read just return the current counter, which can be negative. This will cause the checking of "allocated inode counts <= m_maxicount" false positive. Use percpu_counter_read_positive can solve this problem, and be consistent with the purpose to introduce percpu mechanism to xfs. Signed-off-by: George Wang <xuw2015@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-26Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds7-31/+31
Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-24Merge tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfsLinus Torvalds45-1927/+1986
Pull xfs update from Dave Chinner: "This update contains: - RENAME_WHITEOUT support - conversion of per-cpu superblock accounting to use generic counters - new inode mmap lock so that we can lock page faults out of truncate, hole punch and other direct extent manipulation functions to avoid racing mmap writes from causing data corruption - rework of direct IO submission and completion to solve data corruption issue when running concurrent extending DIO writes. Also solves problem of running IO completion transactions in interrupt context during size extending AIO writes. - FALLOC_FL_INSERT_RANGE support for inserting holes into a file via direct extent manipulation to avoid needing to copy data within the file - attribute block header field overflow fix for 64k block size filesystems - Lots of changes to log messaging to be more informative and concise when errors occur. Also prevent a lot of unnecessary log spamming due to cascading failures in error conditions. - lots of cleanups and bug fixes One thing of note is the direct IO fixes that we merged last week after the window opened. Even though a little late, they fix a user reported data corruption and have been pretty well tested. I figured there was not much point waiting another 2 weeks for -rc1 to be released just so I could send them to you..." * tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits) xfs: using generic_file_direct_write() is unnecessary xfs: direct IO EOF zeroing needs to drain AIO xfs: DIO write completion size updates race xfs: DIO writes within EOF don't need an ioend xfs: handle DIO overwrite EOF update completion correctly xfs: DIO needs an ioend for writes xfs: move DIO mapping size calculation xfs: factor DIO write mapping from get_blocks xfs: unlock i_mutex in xfs_break_layouts xfs: kill unnecessary firstused overflow check on attr3 leaf removal xfs: use larger in-core attr firstused field and detect overflow xfs: pass attr geometry to attr leaf header conversion functions xfs: disallow ro->rw remount on norecovery mount xfs: xfs_shift_file_space can be static xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate fs: Add support FALLOC_FL_INSERT_RANGE for fallocate xfs: Fix incorrect positive ENOMEM return xfs: xfs_mru_cache_insert() should use GFP_NOFS xfs: %pF is only for function pointers xfs: fix shadow warning in xfs_da3_root_split() ...
2015-04-16Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2-26/+25
Pull third hunk of vfs changes from Al Viro: "This contains the ->direct_IO() changes from Omar + saner generic_write_checks() + dealing with fcntl()/{read,write}() races (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of repeatedly looking at ->f_flags, which can be changed by fcntl(2), check ->ki_flags - which cannot) + infrastructure bits for dhowells' d_inode annotations + Christophs switch of /dev/loop to vfs_iter_write()" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits) block: loop: switch to VFS ITER_BVEC configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode VFS: Make pathwalk use d_is_reg() rather than S_ISREG() VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR() VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk NFS: Don't use d_inode as a variable name VFS: Impose ordering on accesses of d_inode and d_flags VFS: Add owner-filesystem positive/negative dentry checks nfs: generic_write_checks() shouldn't be done on swapout... ocfs2: use __generic_file_write_iter() mirror O_APPEND and O_DIRECT into iocb->ki_flags switch generic_write_checks() to iocb and iter ocfs2: move generic_write_checks() before the alignment checks ocfs2_file_write_iter: stop messing with ppos udf_file_write_iter: reorder and simplify fuse: ->direct_IO() doesn't need generic_write_checks() ext4_file_write_iter: move generic_write_checks() up xfs_file_aio_write_checks: switch to iocb/iov_iter generic_write_checks(): drop isblk argument blkdev_write_iter: expand generic_file_checks() call in there ...
2015-04-16Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fsLinus Torvalds3-197/+100
Pull quota and udf updates from Jan Kara: "The pull contains quota changes which complete unification of XFS and VFS quota interfaces (so tools can use either interface to manipulate any filesystem). There's also a patch to support project quotas in VFS quota subsystem from Li Xi. Finally there's a bunch of UDF fixes and cleanups and tiny cleanup in reiserfs & ext3" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits) udf: Update ctime and mtime when directory is modified udf: return correct errno for udf_update_inode() ext3: Remove useless condition in if statement. vfs: Add general support to enforce project quota limits reiserfs: fix __RASSERT format string udf: use int for allocated blocks instead of sector_t udf: remove redundant buffer_head.h includes udf: remove else after return in __load_block_bitmap() udf: remove unused variable in udf_table_free_blocks() quota: Fix maximum quota limit settings quota: reorder flags in quota state quota: paranoia: check quota tree root quota: optimize i_dquot access quota: Hook up Q_XSETQLIM for id 0 to ->set_info xfs: Add support for Q_SETINFO quota: Make ->set_info use structure with neccesary info to VFS and XFS quota: Remove ->get_xstate and ->get_xstatev callbacks gfs2: Convert to using ->get_state callback xfs: Convert to using ->get_state callback quota: Wire up Q_GETXSTATE and Q_GETXSTATV calls to work with ->get_state ...
2015-04-16Merge branch 'xfs-dio-extend-fix' into for-nextDave Chinner3-82/+239
Conflicts: fs/xfs/xfs_file.c
2015-04-16xfs: using generic_file_direct_write() is unnecessaryDave Chinner1-3/+20
generic_file_direct_write() does all sorts of things to make DIO work "sorta ok" with mixed buffered IO workloads. We already do most of this work in xfs_file_aio_dio_write() because of the locking requirements, so there's only a couple of things it does for us. The first thing is that it does a page cache invalidation after the ->direct_IO callout. This can easily be added to the XFS code. The second thing it does is that if data was written, it updates the iov_iter structure to reflect the data written, and then does EOF size updates if necessary. For XFS, these EOF size updates are now not necessary, as we do them safely and race-free in IO completion context. That leaves just the iov_iter update, and that's also moved to the XFS code. Therefore we don't need to call generic_file_direct_write() and in doing so remove redundant buffered writeback and page cache invalidation calls from the DIO submission path. We also remove a racy EOF size update, and make the DIO submission code in XFS much easier to follow. Wins all round, really. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: direct IO EOF zeroing needs to drain AIODave Chinner1-0/+10
When we are doing AIO DIO writes, the IOLOCK only provides an IO submission barrier. When we need to do EOF zeroing, we need to ensure that no other IO is in progress and all pending in-core EOF updates have been completed. This requires us to wait for all outstanding AIO DIO writes to the inode to complete and, if necessary, run their EOF updates. Once all the EOF updates are complete, we can then restart xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing that EOF is up to date and we have exclusive IO access to the file so we can run EOF block zeroing if we need to without interference. This gives EOF zeroing the same exclusivity against other IO as we provide truncate operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: DIO write completion size updates raceDave Chinner2-1/+19
xfs_end_io_direct_write() can race with other IO completions when updating the in-core inode size. The IO completion processing is not serialised for direct IO - they are done either under the IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all during AIO DIO completion. Hence the non-atomic test-and-set update of the in-core inode size is racy and can result in the in-core inode size going backwards if the race if hit just right. If the inode size goes backwards, this can trigger the EOF zeroing code to run incorrectly on the next IO, which then will zero data that has successfully been written to disk by a previous DIO. To fix this bug, we need to serialise the test/set updates of the in-core inode size. This first patch introduces locking around the relevant updates and checks in the DIO path. Because we now have an ioend in xfs_end_io_direct_write(), we know exactly then we are doing an IO that requires an in-core EOF update, and we know that they are not running in interrupt context. As such, we do not need to use irqsave() spinlock variants to protect against interrupts while the lock is held. Hence we can use an existing spinlock in the inode to do this serialisation and so not need to grow the struct xfs_inode just to work around this problem. This patch does not address the test/set EOF update in generic_file_write_direct() for various reasons - that will be done as a followup with separate explanation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: DIO writes within EOF don't need an ioendDave Chinner2-30/+40
DIO writes that lie entirely within EOF have nothing to do in IO completion. In this case, we don't need no steekin' ioend, and so we can avoid allocating an ioend until we have a mapping that spans EOF. This means that IO completion has two contexts - deferred completion to the dio workqueue that uses an ioend, and interrupt completion that does nothing because there is nothing that can be done in this context. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: handle DIO overwrite EOF update completion correctlyDave Chinner2-31/+31
Currently a DIO overwrite that extends the EOF (e.g sub-block IO or write into allocated blocks beyond EOF) requires a transaction for the EOF update. Thi is done in IO completion context, but we aren't explicitly handling this situation properly and so it can run in interrupt context. Ensure that we defer IO that spans EOF correctly to the DIO completion workqueue, and now that we have an ioend in IO completion we can use the common ioend completion path to do all the work. Note: we do not preallocate the append transaction as we can have multiple mapping and allocation calls per direct IO. hence preallocating can still leave us with nested transactions by attempting to map and allocate more blocks after we've preallocated an append transaction. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: DIO needs an ioend for writesDave Chinner2-10/+85
Currently we can only tell DIO completion that an IO requires unwritten extent completion. This is done by a hacky non-null private pointer passed to Io completion, but the private pointer does not actually contain any information that is used. We also need to pass to IO completion the fact that the IO may be beyond EOF and so a size update transaction needs to be done. This is currently determined by checks in the io completion, but we need to determine if this is necessary at block mapping time as we need to defer the size update transactions to a completion workqueue, just like unwritten extent conversion. To do this, first we need to allocate and pass an ioend to to IO completion. Add this for unwritten extent conversion; we'll do the EOF updates in the next commit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: move DIO mapping size calculationDave Chinner1-33/+46
The mapping size calculation is done last in __xfs_get_blocks(), but we are going to need the actual mapping size we will use to map the direct IO correctly in xfs_map_direct(). Factor out the calculation for code clarity, and move the call to be the first operation in mapping the extent to the returned buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: factor DIO write mapping from get_blocksDave Chinner1-13/+27
Clarify and separate the buffer mapping logic so that the direct IO mapping is not tangled up in propagating the extent status to teh mapping buffer. This makes it easier to extend the direct IO mapping to use an ioend in future. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells7-31/+31
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-13Merge branch 'xfs-misc-fixes-for-4.1-3' into for-nextDave Chinner12-51/+159
Conflicts: fs/xfs/xfs_iops.c
2015-04-13xfs: unlock i_mutex in xfs_break_layoutsChristoph Hellwig5-7/+13
We want to drop all I/O path locks when recalling layouts, and that includes i_mutex for the write path. Without this we get stuck processe when recalls take too long. [dchinner: fix build with !CONFIG_PNFS] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-13xfs: kill unnecessary firstused overflow check on attr3 leaf removalBrian Foster1-2/+1
xfs_attr3_leaf_remove() removes an attribute from an attr leaf block. If the attribute nameval data happens to be at the start of the nameval region, a new start offset (firstused) for the region is calculated (since the region grows from the tail of the block to the start). Once the new firstused is calculated, it is checked for zero in an apparent overflow check. Now that the in-core firstused is 32-bit, overflow is not possible and this check can be removed. Since the purpose for this check is not documented and appears to exist since the port to Linux, be conservative and replace it with an assert. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-13xfs: use larger in-core attr firstused field and detect overflowBrian Foster2-6/+92
The on-disk xfs_attr3_leaf_hdr structure firstused field is 16-bit and subject to overflow when fs block size is 64k. The field is typically initialized to block size when an attr leaf block is initialized. This problem is demonstrated by assert failures when running xfstests generic/117 on an fs with 64k blocks. To support the existing attr leaf block algorithms for insertion, rebalance and entry movement, increase the size of the in-core firstused field to 32-bit and handle the potential overflow on conversion to/from the on-disk structure. If the overflow condition occurs, set a special value in the firstused field that is translated back on header read. The special value is only required in the case of an empty 64k attr block. A value of zero is used because firstused is initialized to the block size and grows backwards from there. Furthermore, the attribute block header occupies the first bytes of the block. Thus, a value of zero has no other legitimate meaning for this structure. Two new conversion helpers are created to manage the conversion of firstused to and from disk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-13xfs: pass attr geometry to attr leaf header conversion functionsBrian Foster4-35/+46
The firstused field of the xfs_attr3_leaf_hdr structure is subject to an overflow when fs blocksize is 64k. In preparation to handle this overflow in the header conversion functions, pass the attribute geometry to the functions that convert the in-core structure to and from the on-disk structure. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-13xfs: disallow ro->rw remount on norecovery mountEric Sandeen1-0/+6
There's a bit of a loophole in norecovery mount handling right now: an initial mount must be readonly, but nothing prevents a mount -o remount,rw from producing a writable, unrecovered xfs filesystem. It might be possible to try to perform a log recovery when this is requested, but I'm not sure it's worth the effort. For now, simply disallow this sort of transition. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-13xfs: xfs_shift_file_space can be statickbuild test robot1-1/+1
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-11mirror O_APPEND and O_DIRECT into iocb->ki_flagsAl Viro1-2/+2
... avoiding write_iter/fcntl races. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11switch generic_write_checks() to iocb and iterAl Viro1-4/+4
... returning -E... upon error and amount of data left in iter after (possible) truncation upon success. Note, that normal case gives a non-zero (positive) return value, so any tests for != 0 _must_ be updated. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/ext4/file.c
2015-04-11xfs_file_aio_write_checks: switch to iocb/iov_iterAl Viro1-15/+16
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11generic_write_checks(): drop isblk argumentAl Viro1-1/+1
all remaining callers are passing 0; some just obscure that fact. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11direct_IO: remove rw from a_ops->direct_IO()Omar Sandoval1-1/+0
Now that no one is using rw, remove it completely. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11direct_IO: use iov_iter_rw() instead of rw everywhereOmar Sandoval1-1/+1
The rw parameter to direct_IO is redundant with iov_iter->type, and treated slightly differently just about everywhere it's used: some users do rw & WRITE, and others do rw == WRITE where they should be doing a bitwise check. Simplify this with the new iov_iter_rw() helper, which always returns either READ or WRITE. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11Remove rw from {,__,do_}blockdev_direct_IO()Omar Sandoval1-5/+4
Most filesystems call through to these at some point, so we'll start here. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11make new_sync_{read,write}() staticAl Viro1-2/+0
All places outside of core VFS that checked ->read and ->write for being NULL or called the methods directly are gone now, so NULL {read,write} with non-NULL {read,write}_iter will do the right thing in all cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11Merge branch 'iocb' into for-nextAl Viro2-2/+0
2015-03-25fs: move struct kiocb to fs.hChristoph Hellwig2-2/+0
struct kiocb now is a generic I/O container, so move it to fs.h. Also do a #include diet for aio.h while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25Merge branch 'fallocate-insert-range' into for-nextDave Chinner6-83/+461
2015-03-25Merge branch 'xfs-misc-fixes-for-4.1-2' into for-nextDave Chinner8-52/+20
Conflicts: fs/xfs/libxfs/xfs_bmap.c fs/xfs/xfs_inode.c
2015-03-25xfs: Add support FALLOC_FL_INSERT_RANGE for fallocateNamjae Jeon6-83/+461
This patch implements fallocate's FALLOC_FL_INSERT_RANGE for XFS. 1) Make sure that both offset and len are block size aligned. 2) Update the i_size of inode by len bytes. 3) Compute the file's logical block number against offset. If the computed block number is not the starting block of the extent, split the extent such that the block number is the starting block of the extent. 4) Shift all the extents which are lying bewteen [offset, last allocated extent] towards right by len bytes. This step will make a hole of len bytes at offset. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25xfs: Fix incorrect positive ENOMEM returnJoe Perches1-1/+1
added a positive error return value. This value filters up through the return layers and should be negative as the other return values are in the same function. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25xfs: xfs_mru_cache_insert() should use GFP_NOFSByoungyoung Lee1-1/+1
xfs_mru_cache_insert() can be called from within transaction context during block allocation like so: write(2) .... xfs_get_blocks xfs_iomap_write_direct start transaction xfs_bmapi_write xfs_bmapi_allocate xfs_bmap_btalloc xfs_bmap_btalloc_filestreams xfs_filestream_new_ag xfs_filestream_pick_ag xfs_mru_cache_insert radix_tree_preload(GFP_KERNEL) In this case, GFP_KERNEL is incorrect and can potentially lead to deadlocks in memory reclaim. It should use GFP_NOFS allocations to avoid lock recursion problems. [dchinner: rewrote commit message] Signed-off-by: Byoungyoung Lee <blee@gatech.edu> Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25xfs: %pF is only for function pointersScott Wood2-11/+11
Use %pS for actual addresses, otherwise you'll get bad output on arches like ppc64 where %pF expects a function descriptor. Signed-off-by: Scott Wood <scottwood@freescale.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25xfs: fix shadow warning in xfs_da3_root_split()Fabian Frederick1-4/+4
Use icnodehdr for struct xfs_da3_icnode_hdr instead of nodehdr (already declared above). Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25xfs: use bool instead of int in xfs_rename()Fabian Frederick1-2/+2
new_parent and src_is_directory are only used in 0/1 context. Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25xfs: fix NULL pointer dereference in xfs_filestream_lookup_ag()Eric Sandeen1-1/+1
If xfs_filestream_get_parent() fails, we have a null pip, goto out, and attempt to IRELE(NULL). This causes a null pointer dereference and BUG(). Fix this by directly returning NULLAGNUMBER in this case. Reported-by: Adrien Nader <adrien@notk.org> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25xfs: remove xfs_bmap_sanity_check()Dave Chinner1-33/+0
This code is redundant now that we have verifiers that sanity check the buffers as they are read from disk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25Merge branch 'xfs-rename-whiteout' into for-nextDave Chinner2-171/+239
Conflicts: fs/xfs/xfs_inode.c
2015-03-25xfs: add RENAME_WHITEOUT supportDave Chinner2-24/+107
Whiteouts are used by overlayfs - it has a crazy convention that a whiteout is a character device inode with a major:minor of 0:0. Because it's not documented anywhere, here's an example of what RENAME_WHITEOUT does on ext4: # echo foo > /mnt/scratch/foo # echo bar > /mnt/scratch/bar # ls -l /mnt/scratch total 24 -rw-r--r-- 1 root root 4 Feb 11 20:22 bar -rw-r--r-- 1 root root 4 Feb 11 20:22 foo drwx------ 2 root root 16384 Feb 11 20:18 lost+found # src/renameat2 -w /mnt/scratch/foo /mnt/scratch/bar # ls -l /mnt/scratch total 20 -rw-r--r-- 1 root root 4 Feb 11 20:22 bar c--------- 1 root root 0, 0 Feb 11 20:23 foo drwx------ 2 root root 16384 Feb 11 20:18 lost+found # cat /mnt/scratch/bar foo # In XFS rename terms, the operation that has been done is that source (foo) has been moved to the target (bar), which is like a nomal rename operation, but rather than the source being removed, it have been replaced with a whiteout. We can't allocate whiteout inodes within the rename transaction due to allocation being a multi-commit transaction: rename needs to be a single, atomic commit. Hence we have several options here, form most efficient to least efficient: - use DT_WHT in the target dirent and do no whiteout inode allocation. The main issue with this approach is that we need hooks in lookup to create a virtual chardev inode to present to userspace and in places where we might need to modify the dirent e.g. unlink. Overlayfs also needs to be taught about DT_WHT. Most invasive change, lowest overhead. - create a special whiteout inode in the root directory (e.g. a ".wino" dirent) and then hardlink every new whiteout to it. This means we only need to create a single whiteout inode, and rename simply creates a hardlink to it. We can use DT_WHT for these, though using DT_CHR means we won't have to modify overlayfs, nor anything in userspace. Downside is we have to look up the whiteout inode on every operation and create it if it doesn't exist. - copy ext4: create a special whiteout chardev inode for every whiteout. This is more complex than the above options because of the lack of atomicity between inode creation and the rename operation, requiring us to create a tmpfile inode and then linking it into the directory structure during the rename. At least with a tmpfile inode crashes between the create and rename doesn't leave unreferenced inodes or directory pollution around. By far the simplest thing to do in the short term is to copy ext4. While it is the most inefficient way of supporting whiteouts, but as an initial implementation we can simply reuse existing functions and add a small amount of extra code the the rename operation. When we get full whiteout support in the VFS (via the dentry cache) we can then look to supporting DT_WHT method outlined as the first method of supporting whiteouts. But until then, we'll stick with what overlayfs expects us to be: dumb and stupid. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2015-03-25xfs: make xfs_cross_rename() complete fullyDave Chinner1-20/+21
Now that xfs_finish_rename() exists, there is no reason for xfs_cross_rename() to return to xfs_rename() to finish off the rename transaction. Drive the completion code into xfs_cross_rename() and handle all errors there so as to simplify the xfs_rename() code. Further, push the rename exchange target_ip check to early in the rename code so as to make the error handling easy and obviously correct. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>