aboutsummaryrefslogtreecommitdiffstats
path: root/fs/xfs/xfs_iomap.c
diff options
context:
space:
mode:
authorDarrick J. Wong <djwong@kernel.org>2022-11-28 17:23:58 -0800
committerDarrick J. Wong <djwong@kernel.org>2022-11-28 17:23:58 -0800
commit7dd73802f97d2a1602b1cf5c1d6623fb08cb15c5 (patch)
treea328cd35cf62e6dcc565068e36708be2e3be50b7 /fs/xfs/xfs_iomap.c
parentxfs: fix incorrect i_nlink caused by inode racing (diff)
parentxfs: drop write error injection is unfixable, remove it (diff)
downloadlinux-rng-7dd73802f97d2a1602b1cf5c1d6623fb08cb15c5.tar.xz
linux-rng-7dd73802f97d2a1602b1cf5c1d6623fb08cb15c5.zip
Merge tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB
xfs, iomap: fix data corruption due to stale cached iomaps This patch series fixes a data corruption that occurs in a specific multi-threaded write workload. The workload combined racing unaligned adjacent buffered writes with low memory conditions that caused both writeback and memory reclaim to race with the writes. The result of this was random partial blocks containing zeroes instead of the correct data. The underlying problem is that iomap caches the write iomap for the duration of the write() operation, but it fails to take into account that the extent underlying the iomap can change whilst the write is in progress. The short story is that an iomap can span mutliple folios, and so under low memory writeback can be cleaning folios the write() overlaps. Whilst the overlapping data is cached in memory, this isn't a problem, but because the folios are now clean they can be reclaimed. Once reclaimed, the write() does the wrong thing when re-instantiating partial folios because the iomap no longer reflects the underlying state of the extent. e.g. it thinks the extent is unwritten, so it zeroes the partial range, when in fact the underlying extent is now written and so it should have read the data from disk. This is how we get random zero ranges in the file instead of the correct data. The gory details of the race condition can be found here: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/ Fixing the problem has two aspects. The first aspect of the problem is ensuring that iomap can detect a stale cached iomap during a write in a race-free manner. We already do this stale iomap detection in the writeback path, so we have a mechanism for detecting that the iomap backing the data range may have changed and needs to be remapped. In the case of the write() path, we have to ensure that the iomap is validated at a point in time when the page cache is stable and cannot be reclaimed from under us. We also need to validate the extent before we start performing any modifications to the folio state or contents. Combine these two requirements together, and the only "safe" place to validate the iomap is after we have looked up and locked the folio we are going to copy the data into, but before we've performed any initialisation operations on that folio. If the iomap fails validation, we then mark it stale, unlock the folio and end the write. This effectively means a stale iomap results in a short write. Filesystems should already be able to handle this, as write operations can end short for many reasons and need to iterate through another mapping cycle to be completed. Hence the iomap changes needed to detect and handle stale iomaps during write() operations is relatively simple... However, the assumption is that filesystems should already be able to handle write failures safely, and that's where the second (first?) part of the problem exists. That is, handling a partial write is harder than just "punching out the unused delayed allocation extent". This is because mmap() based faults can race with writes, and if they land in the delalloc region that the write allocated, then punching out the delalloc region can cause data corruption. This data corruption problem is exposed by generic/346 when iomap is converted to detect stale iomaps during write() operations. Hence write failure in the filesytems needs to handle the fact that the write() in progress doesn't necessarily own the data in the page cache over the range of the delalloc extent it just allocated. As a result, we can't just truncate the page cache over the range the write() didn't reach and punch all the delalloc extent. We have to walk the page cache over the untouched range and skip over any dirty data region in the cache in that range. Which is .... non-trivial. That is, iterating the page cache has to handle partially populated folios (i.e. block size < page size) that contain data. The data might be discontiguous within a folio. Indeed, there might be *multiple* discontiguous data regions within a single folio. And to make matters more complex, multi-page folios mean we just don't know how many sub-folio regions we might have to iterate to find all these regions. All the corner cases between the conversions and rounding between filesystem block size, folio size and multi-page folio size combined with unaligned write offsets kept breaking my brain. However, if we convert the code to track the processed write regions by byte ranges instead of fileystem block or page cache index, we could simply use mapping_seek_hole_data() to find the start and end of each discrete data region within the range we needed to scan. SEEK_DATA finds the start of the cached data region, SEEK_HOLE finds the end of the region. These are byte based interfaces that understand partially uptodate folio regions, and so can iterate discrete sub-folio data regions directly. This largely solved the problem of discovering the dirty regions we need to keep the delalloc extent over. However, to use mapping_seek_hole_data() without needing to export it, we have to move all the delalloc extent cleanup to the iomap core and so now the iomap core can clean up delayed allocation extents in a safe, sane and filesystem neutral manner. With all this done, the original data corruption never occurs anymore, and we now have a generic mechanism for ensuring that page cache writes do not do the wrong thing when writeback and reclaim change the state of the physical extent and/or page cache contents whilst the write is in progress. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: drop write error injection is unfixable, remove it xfs: use iomap_valid method to detect stale cached iomaps iomap: write iomap validity checks xfs: xfs_bmap_punch_delalloc_range() should take a byte range iomap: buffered write failure should not truncate the page cache xfs,iomap: move delalloc punching to iomap xfs: use byte ranges for write cleanup ranges xfs: punching delalloc extents on write failure is racy xfs: write page faults in iomap are not buffered writes
Diffstat (limited to 'fs/xfs/xfs_iomap.c')
-rw-r--r--fs/xfs/xfs_iomap.c169
1 files changed, 102 insertions, 67 deletions
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 07da03976ec1..1bdd7afc1010 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -48,13 +48,45 @@ xfs_alert_fsblock_zero(
return -EFSCORRUPTED;
}
+u64
+xfs_iomap_inode_sequence(
+ struct xfs_inode *ip,
+ u16 iomap_flags)
+{
+ u64 cookie = 0;
+
+ if (iomap_flags & IOMAP_F_XATTR)
+ return READ_ONCE(ip->i_af.if_seq);
+ if ((iomap_flags & IOMAP_F_SHARED) && ip->i_cowfp)
+ cookie = (u64)READ_ONCE(ip->i_cowfp->if_seq) << 32;
+ return cookie | READ_ONCE(ip->i_df.if_seq);
+}
+
+/*
+ * Check that the iomap passed to us is still valid for the given offset and
+ * length.
+ */
+static bool
+xfs_iomap_valid(
+ struct inode *inode,
+ const struct iomap *iomap)
+{
+ return iomap->validity_cookie ==
+ xfs_iomap_inode_sequence(XFS_I(inode), iomap->flags);
+}
+
+const struct iomap_page_ops xfs_iomap_page_ops = {
+ .iomap_valid = xfs_iomap_valid,
+};
+
int
xfs_bmbt_to_iomap(
struct xfs_inode *ip,
struct iomap *iomap,
struct xfs_bmbt_irec *imap,
unsigned int mapping_flags,
- u16 iomap_flags)
+ u16 iomap_flags,
+ u64 sequence_cookie)
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
@@ -91,6 +123,9 @@ xfs_bmbt_to_iomap(
if (xfs_ipincount(ip) &&
(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
iomap->flags |= IOMAP_F_DIRTY;
+
+ iomap->validity_cookie = sequence_cookie;
+ iomap->page_ops = &xfs_iomap_page_ops;
return 0;
}
@@ -195,7 +230,8 @@ xfs_iomap_write_direct(
xfs_fileoff_t offset_fsb,
xfs_fileoff_t count_fsb,
unsigned int flags,
- struct xfs_bmbt_irec *imap)
+ struct xfs_bmbt_irec *imap,
+ u64 *seq)
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_trans *tp;
@@ -285,6 +321,7 @@ xfs_iomap_write_direct(
error = xfs_alert_fsblock_zero(ip, imap);
out_unlock:
+ *seq = xfs_iomap_inode_sequence(ip, 0);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
return error;
@@ -743,6 +780,7 @@ xfs_direct_write_iomap_begin(
bool shared = false;
u16 iomap_flags = 0;
unsigned int lockmode = XFS_ILOCK_SHARED;
+ u64 seq;
ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
@@ -811,9 +849,10 @@ xfs_direct_write_iomap_begin(
goto out_unlock;
}
+ seq = xfs_iomap_inode_sequence(ip, iomap_flags);
xfs_iunlock(ip, lockmode);
trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
- return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags);
+ return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
allocate_blocks:
error = -EAGAIN;
@@ -839,24 +878,26 @@ allocate_blocks:
xfs_iunlock(ip, lockmode);
error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
- flags, &imap);
+ flags, &imap, &seq);
if (error)
return error;
trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap);
return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
- iomap_flags | IOMAP_F_NEW);
+ iomap_flags | IOMAP_F_NEW, seq);
out_found_cow:
- xfs_iunlock(ip, lockmode);
length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
if (imap.br_startblock != HOLESTARTBLOCK) {
- error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
+ seq = xfs_iomap_inode_sequence(ip, 0);
+ error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
if (error)
- return error;
+ goto out_unlock;
}
- return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED);
+ seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
+ xfs_iunlock(ip, lockmode);
+ return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
out_unlock:
if (lockmode)
@@ -915,6 +956,7 @@ xfs_buffered_write_iomap_begin(
int allocfork = XFS_DATA_FORK;
int error = 0;
unsigned int lockmode = XFS_ILOCK_EXCL;
+ u64 seq;
if (xfs_is_shutdown(mp))
return -EIO;
@@ -1094,26 +1136,31 @@ retry:
* Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
* them out if the write happens to fail.
*/
+ seq = xfs_iomap_inode_sequence(ip, IOMAP_F_NEW);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
- return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW);
+ return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
found_imap:
+ seq = xfs_iomap_inode_sequence(ip, 0);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
- return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+ return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
found_cow:
- xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ seq = xfs_iomap_inode_sequence(ip, 0);
if (imap.br_startoff <= offset_fsb) {
- error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
+ error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
if (error)
- return error;
+ goto out_unlock;
+ seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
- IOMAP_F_SHARED);
+ IOMAP_F_SHARED, seq);
}
xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
- return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
out_unlock:
xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1121,6 +1168,16 @@ out_unlock:
}
static int
+xfs_buffered_write_delalloc_punch(
+ struct inode *inode,
+ loff_t offset,
+ loff_t length)
+{
+ return xfs_bmap_punch_delalloc_range(XFS_I(inode), offset,
+ offset + length);
+}
+
+static int
xfs_buffered_write_iomap_end(
struct inode *inode,
loff_t offset,
@@ -1129,56 +1186,17 @@ xfs_buffered_write_iomap_end(
unsigned flags,
struct iomap *iomap)
{
- struct xfs_inode *ip = XFS_I(inode);
- struct xfs_mount *mp = ip->i_mount;
- xfs_fileoff_t start_fsb;
- xfs_fileoff_t end_fsb;
- int error = 0;
-
- if (iomap->type != IOMAP_DELALLOC)
- return 0;
- /*
- * Behave as if the write failed if drop writes is enabled. Set the NEW
- * flag to force delalloc cleanup.
- */
- if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_DROP_WRITES)) {
- iomap->flags |= IOMAP_F_NEW;
- written = 0;
- }
-
- /*
- * start_fsb refers to the first unused block after a short write. If
- * nothing was written, round offset down to point at the first block in
- * the range.
- */
- if (unlikely(!written))
- start_fsb = XFS_B_TO_FSBT(mp, offset);
- else
- start_fsb = XFS_B_TO_FSB(mp, offset + written);
- end_fsb = XFS_B_TO_FSB(mp, offset + length);
+ struct xfs_mount *mp = XFS_M(inode->i_sb);
+ int error;
- /*
- * Trim delalloc blocks if they were allocated by this write and we
- * didn't manage to write the whole range.
- *
- * We don't need to care about racing delalloc as we hold i_mutex
- * across the reserve/allocate/unreserve calls. If there are delalloc
- * blocks in the range, they are ours.
- */
- if ((iomap->flags & IOMAP_F_NEW) && start_fsb < end_fsb) {
- truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
- XFS_FSB_TO_B(mp, end_fsb) - 1);
-
- error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
- end_fsb - start_fsb);
- if (error && !xfs_is_shutdown(mp)) {
- xfs_alert(mp, "%s: unable to clean up ino %lld",
- __func__, ip->i_ino);
- return error;
- }
+ error = iomap_file_buffered_write_punch_delalloc(inode, iomap, offset,
+ length, written, &xfs_buffered_write_delalloc_punch);
+ if (error && !xfs_is_shutdown(mp)) {
+ xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
+ __func__, XFS_I(inode)->i_ino);
+ return error;
}
-
return 0;
}
@@ -1187,6 +1205,15 @@ const struct iomap_ops xfs_buffered_write_iomap_ops = {
.iomap_end = xfs_buffered_write_iomap_end,
};
+/*
+ * iomap_page_mkwrite() will never fail in a way that requires delalloc extents
+ * that it allocated to be revoked. Hence we do not need an .iomap_end method
+ * for this operation.
+ */
+const struct iomap_ops xfs_page_mkwrite_iomap_ops = {
+ .iomap_begin = xfs_buffered_write_iomap_begin,
+};
+
static int
xfs_read_iomap_begin(
struct inode *inode,
@@ -1204,6 +1231,7 @@ xfs_read_iomap_begin(
int nimaps = 1, error = 0;
bool shared = false;
unsigned int lockmode = XFS_ILOCK_SHARED;
+ u64 seq;
ASSERT(!(flags & (IOMAP_WRITE | IOMAP_ZERO)));
@@ -1217,13 +1245,14 @@ xfs_read_iomap_begin(
&nimaps, 0);
if (!error && (flags & IOMAP_REPORT))
error = xfs_reflink_trim_around_shared(ip, &imap, &shared);
+ seq = xfs_iomap_inode_sequence(ip, shared ? IOMAP_F_SHARED : 0);
xfs_iunlock(ip, lockmode);
if (error)
return error;
trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
- shared ? IOMAP_F_SHARED : 0);
+ shared ? IOMAP_F_SHARED : 0, seq);
}
const struct iomap_ops xfs_read_iomap_ops = {
@@ -1248,6 +1277,7 @@ xfs_seek_iomap_begin(
struct xfs_bmbt_irec imap, cmap;
int error = 0;
unsigned lockmode;
+ u64 seq;
if (xfs_is_shutdown(mp))
return -EIO;
@@ -1282,8 +1312,9 @@ xfs_seek_iomap_begin(
if (data_fsb < cow_fsb + cmap.br_blockcount)
end_fsb = min(end_fsb, data_fsb);
xfs_trim_extent(&cmap, offset_fsb, end_fsb);
+ seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
error = xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
- IOMAP_F_SHARED);
+ IOMAP_F_SHARED, seq);
/*
* This is a COW extent, so we must probe the page cache
* because there could be dirty page cache being backed
@@ -1304,8 +1335,9 @@ xfs_seek_iomap_begin(
imap.br_startblock = HOLESTARTBLOCK;
imap.br_state = XFS_EXT_NORM;
done:
+ seq = xfs_iomap_inode_sequence(ip, 0);
xfs_trim_extent(&imap, offset_fsb, end_fsb);
- error = xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+ error = xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
out_unlock:
xfs_iunlock(ip, lockmode);
return error;
@@ -1331,6 +1363,7 @@ xfs_xattr_iomap_begin(
struct xfs_bmbt_irec imap;
int nimaps = 1, error = 0;
unsigned lockmode;
+ int seq;
if (xfs_is_shutdown(mp))
return -EIO;
@@ -1347,12 +1380,14 @@ xfs_xattr_iomap_begin(
error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
&nimaps, XFS_BMAPI_ATTRFORK);
out_unlock:
+
+ seq = xfs_iomap_inode_sequence(ip, IOMAP_F_XATTR);
xfs_iunlock(ip, lockmode);
if (error)
return error;
ASSERT(nimaps);
- return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+ return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_XATTR, seq);
}
const struct iomap_ops xfs_xattr_iomap_ops = {