From ef1e68236b9153c27cb7cf29ead0c532870d4215 Mon Sep 17 00:00:00 2001 From: Tavian Barnes Date: Fri, 15 Mar 2024 21:14:29 -0400 Subject: btrfs: fix race in read_extent_buffer_pages() There are reports from tree-checker that detects corrupted nodes, without any obvious pattern so possibly an overwrite in memory. After some debugging it turns out there's a race when reading an extent buffer the uptodate status can be missed. To prevent concurrent reads for the same extent buffer, read_extent_buffer_pages() performs these checks: /* (1) */ if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) return 0; /* (2) */ if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) goto done; At this point, it seems safe to start the actual read operation. Once that completes, end_bbio_meta_read() does /* (3) */ set_extent_buffer_uptodate(eb); /* (4) */ clear_bit(EXTENT_BUFFER_READING, &eb->bflags); Normally, this is enough to ensure only one read happens, and all other callers wait for it to finish before returning. Unfortunately, there is a racey interleaving: Thread A | Thread B | Thread C ---------+----------+--------- (1) | | | (1) | (2) | | (3) | | (4) | | | (2) | | | (1) When this happens, thread B kicks of an unnecessary read. Worse, thread C will see UPTODATE set and return immediately, while the read from thread B is still in progress. This race could result in tree-checker errors like this as the extent buffer is concurrently modified: BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360] Fix it by testing UPTODATE again after setting the READING bit, and if it's been set, skip the unnecessary read. Fixes: d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading") Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/ Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/ CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Qu Wenruo Reviewed-by: Christoph Hellwig Signed-off-by: Tavian Barnes Reviewed-by: David Sterba [ minor update of changelog ] Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 7441245b1ceb..61594eaf1f89 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4333,6 +4333,19 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num, if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) goto done; + /* + * Between the initial test_bit(EXTENT_BUFFER_UPTODATE) and the above + * test_and_set_bit(EXTENT_BUFFER_READING), someone else could have + * started and finished reading the same eb. In this case, UPTODATE + * will now be set, and we shouldn't read it in again. + */ + if (unlikely(test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))) { + clear_bit(EXTENT_BUFFER_READING, &eb->bflags); + smp_mb__after_atomic(); + wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING); + return 0; + } + clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags); eb->read_mirror = 0; check_buffer_tree_ref(eb); -- cgit v1.2.3-59-g8ed1b From 68879386180c0efd5a11e800b0525a01068c9457 Mon Sep 17 00:00:00 2001 From: Naohiro Aota Date: Tue, 26 Mar 2024 14:39:20 +0900 Subject: btrfs: zoned: do not flag ZEROOUT on non-dirty extent buffer Btrfs clears the content of an extent buffer marked as EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is introduced to prevent a write hole of an extent buffer, which is once allocated, marked dirty, but turns out unnecessary and cleaned up within one transaction operation. Currently, btrfs_clear_buffer_dirty() marks the extent buffer as EXTENT_BUFFER_ZONED_ZEROOUT, and skips the entry function. If this call happens while the buffer is under IO (with the WRITEBACK flag set, without the DIRTY flag), we can add the ZEROOUT flag and clear the buffer's content just before a bio submission. As a result: 1) it can lead to adding faulty delayed reference item which leads to a FS corrupted (EUCLEAN) error, and 2) it writes out cleared tree node on disk The former issue is previously discussed in [1]. The corruption happens when it runs a delayed reference update. So, on-disk data is safe. [1] https://lore.kernel.org/linux-btrfs/3f4f2a0ff1a6c818050434288925bdcf3cd719e5.1709124777.git.naohiro.aota@wdc.com/ The latter one can reach on-disk data. But, as that node is already processed by btrfs_clear_buffer_dirty(), that will be invalidated in the next transaction commit anyway. So, the chance of hitting the corruption is relatively small. Anyway, we should skip flagging ZEROOUT on a non-DIRTY extent buffer, to keep the content under IO intact. Fixes: aa6313e6ff2b ("btrfs: zoned: don't clear dirty flag of extent buffer") CC: stable@vger.kernel.org # 6.8 Link: https://lore.kernel.org/linux-btrfs/oadvdekkturysgfgi4qzuemd57zudeasynswurjxw3ocdfsef6@sjyufeugh63f/ Reviewed-by: Johannes Thumshirn Signed-off-by: Naohiro Aota Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 61594eaf1f89..df3fe36126f9 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4154,7 +4154,7 @@ void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans, * The actual zeroout of the buffer will happen later in * btree_csum_one_bio. */ - if (btrfs_is_zoned(fs_info)) { + if (btrfs_is_zoned(fs_info) && test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) { set_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags); return; } -- cgit v1.2.3-59-g8ed1b From 073bda7a541731f41ed08f32d286394236c74005 Mon Sep 17 00:00:00 2001 From: Naohiro Aota Date: Tue, 26 Mar 2024 14:39:21 +0900 Subject: btrfs: zoned: add ASSERT and WARN for EXTENT_BUFFER_ZONED_ZEROOUT handling Add an ASSERT to catch a faulty delayed reference item resulting from prematurely cleared extent buffer. Also, add a WARN to detect if we try to dirty a ZEROOUT buffer again, which is suspicious as its update will be lost. Reviewed-by: Johannes Thumshirn Signed-off-by: Naohiro Aota Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent-tree.c | 8 ++++++++ fs/btrfs/extent_io.c | 1 + 2 files changed, 9 insertions(+) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index beedd6ed64d3..257d044bca91 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3464,6 +3464,14 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, if (root_id != BTRFS_TREE_LOG_OBJECTID) { struct btrfs_ref generic_ref = { 0 }; + /* + * Assert that the extent buffer is not cleared due to + * EXTENT_BUFFER_ZONED_ZEROOUT. Please refer + * btrfs_clear_buffer_dirty() and btree_csum_one_bio() for + * detail. + */ + ASSERT(btrfs_header_bytenr(buf) != 0); + btrfs_init_generic_ref(&generic_ref, BTRFS_DROP_DELAYED_REF, buf->start, buf->len, parent, btrfs_header_owner(buf)); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index df3fe36126f9..b18034f2ab80 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4193,6 +4193,7 @@ void set_extent_buffer_dirty(struct extent_buffer *eb) num_folios = num_extent_folios(eb); WARN_ON(atomic_read(&eb->refs) == 0); WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)); + WARN_ON(test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags)); if (!was_dirty) { bool subpage = eb->fs_info->nodesize < PAGE_SIZE; -- cgit v1.2.3-59-g8ed1b From 1db7959aacd905e6487d0478ac01d89f86eb1e51 Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Tue, 26 Mar 2024 09:16:46 +1030 Subject: btrfs: do not wait for short bulk allocation [BUG] There is a recent report that when memory pressure is high (including cached pages), btrfs can spend most of its time on memory allocation in btrfs_alloc_page_array() for compressed read/write. [CAUSE] For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and even if the bulk allocation failed (fell back to single page allocation) we still retry but with extra memalloc_retry_wait(). If the bulk alloc only returned one page a time, we would spend a lot of time on the retry wait. The behavior was introduced in commit 395cb57e8560 ("btrfs: wait between incomplete batch memory allocations"). [FIX] Although the commit mentioned that other filesystems do the wait, it's not the case at least nowadays. All the mainlined filesystems only call memalloc_retry_wait() if they failed to allocate any page (not only for bulk allocation). If there is any progress, they won't call memalloc_retry_wait() at all. For example, xfs_buf_alloc_pages() would only call memalloc_retry_wait() if there is no allocation progress at all, and the call is not for metadata readahead. So I don't believe we should call memalloc_retry_wait() unconditionally for short allocation. Call memalloc_retry_wait() if it fails to allocate any page for tree block allocation (which goes with __GFP_NOFAIL and may not need the special handling anyway), and reduce the latency for btrfs_alloc_page_array(). Reported-by: Julian Taylor Tested-by: Julian Taylor Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/ Fixes: 395cb57e8560 ("btrfs: wait between incomplete batch memory allocations") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Sweet Tea Dorminy Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 18 ++++-------------- 1 file changed, 4 insertions(+), 14 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index b18034f2ab80..2776112dbdf8 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -681,31 +681,21 @@ static void end_bbio_data_read(struct btrfs_bio *bbio) int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array, gfp_t extra_gfp) { + const gfp_t gfp = GFP_NOFS | extra_gfp; unsigned int allocated; for (allocated = 0; allocated < nr_pages;) { unsigned int last = allocated; - allocated = alloc_pages_bulk_array(GFP_NOFS | extra_gfp, - nr_pages, page_array); - - if (allocated == nr_pages) - return 0; - - /* - * During this iteration, no page could be allocated, even - * though alloc_pages_bulk_array() falls back to alloc_page() - * if it could not bulk-allocate. So we must be out of memory. - */ - if (allocated == last) { + allocated = alloc_pages_bulk_array(gfp, nr_pages, page_array); + if (unlikely(allocated == last)) { + /* No progress, fail and do cleanup. */ for (int i = 0; i < allocated; i++) { __free_page(page_array[i]); page_array[i] = NULL; } return -ENOMEM; } - - memalloc_retry_wait(GFP_NOFS); } return 0; } -- cgit v1.2.3-59-g8ed1b From 7938d38b94c98e7a48ddc0a43ddf54482b940b90 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 18 Mar 2024 11:52:00 +0000 Subject: btrfs: remove pointless readahead callback wrapper There's no point in having a static readahead callback in inode.c that does nothing besides calling extent_readahead() from extent_io.c. So just remove the callback at inode.c and rename extent_readahead() to btrfs_readahead(). Reviewed-by: Johannes Thumshirn Reviewed-by: Anand Jain Reviewed-by: Qu Wenruo Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 2 +- fs/btrfs/extent_io.h | 2 +- fs/btrfs/inode.c | 5 ----- 3 files changed, 2 insertions(+), 7 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2776112dbdf8..f863eefe0f1c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2267,7 +2267,7 @@ int extent_writepages(struct address_space *mapping, return ret; } -void extent_readahead(struct readahead_control *rac) +void btrfs_readahead(struct readahead_control *rac) { struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD }; struct page *pagepool[16]; diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index e3530d427e1f..eb123b0499e1 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -241,7 +241,7 @@ int extent_writepages(struct address_space *mapping, struct writeback_control *wbc); int btree_write_cache_pages(struct address_space *mapping, struct writeback_control *wbc); -void extent_readahead(struct readahead_control *rac); +void btrfs_readahead(struct readahead_control *rac); int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, u64 start, u64 len); int set_folio_extent_mapped(struct folio *folio); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7fed887e700c..ce923f207e2d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7929,11 +7929,6 @@ static int btrfs_writepages(struct address_space *mapping, return extent_writepages(mapping, wbc); } -static void btrfs_readahead(struct readahead_control *rac) -{ - extent_readahead(rac); -} - /* * For release_folio() and invalidate_folio() we have a race window where * folio_end_writeback() is called but the subpage spinlock is not yet released. -- cgit v1.2.3-59-g8ed1b From c66f2afc714867cf7e685680d848748e0d636bef Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 18 Mar 2024 11:58:28 +0000 Subject: btrfs: remove pointless writepages callback wrapper There's no point in having a static writepages callback in inode.c that does nothing besides calling extent_writepages from extent_io.c. So just remove the callback at inode.c and rename extent_writepages() to btrfs_writepages(). Reviewed-by: Johannes Thumshirn Reviewed-by: Anand Jain Reviewed-by: Qu Wenruo Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 3 +-- fs/btrfs/extent_io.h | 3 +-- fs/btrfs/inode.c | 6 ------ 3 files changed, 2 insertions(+), 10 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f863eefe0f1c..7bc23e26a530 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2246,8 +2246,7 @@ next_page: submit_write_bio(&bio_ctrl, found_error ? ret : 0); } -int extent_writepages(struct address_space *mapping, - struct writeback_control *wbc) +int btrfs_writepages(struct address_space *mapping, struct writeback_control *wbc) { struct inode *inode = mapping->host; int ret = 0; diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index eb123b0499e1..818431b37124 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -237,8 +237,7 @@ int btrfs_read_folio(struct file *file, struct folio *folio); void extent_write_locked_range(struct inode *inode, struct page *locked_page, u64 start, u64 end, struct writeback_control *wbc, bool pages_dirty); -int extent_writepages(struct address_space *mapping, - struct writeback_control *wbc); +int btrfs_writepages(struct address_space *mapping, struct writeback_control *wbc); int btree_write_cache_pages(struct address_space *mapping, struct writeback_control *wbc); void btrfs_readahead(struct readahead_control *rac); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ce923f207e2d..a6ebaa5438be 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7923,12 +7923,6 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, return ret; } -static int btrfs_writepages(struct address_space *mapping, - struct writeback_control *wbc) -{ - return extent_writepages(mapping, wbc); -} - /* * For release_folio() and invalidate_folio() we have a race window where * folio_end_writeback() is called but the subpage spinlock is not yet released. -- cgit v1.2.3-59-g8ed1b From 1e2d1837091bf70f204802bcac48495358e75673 Mon Sep 17 00:00:00 2001 From: Tavian Barnes Date: Mon, 18 Mar 2024 09:56:53 -0400 Subject: btrfs: add helper to clear EXTENT_BUFFER_READING We are clearing the bit and waking up any waiters in two different places. Factor that code out into a static helper function. Reviewed-by: Qu Wenruo Signed-off-by: Tavian Barnes Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 7bc23e26a530..23bdd05b5cec 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4260,6 +4260,13 @@ void set_extent_buffer_uptodate(struct extent_buffer *eb) } } +static void clear_extent_buffer_reading(struct extent_buffer *eb) +{ + clear_bit(EXTENT_BUFFER_READING, &eb->bflags); + smp_mb__after_atomic(); + wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING); +} + static void end_bbio_meta_read(struct btrfs_bio *bbio) { struct extent_buffer *eb = bbio->private; @@ -4294,9 +4301,7 @@ static void end_bbio_meta_read(struct btrfs_bio *bbio) bio_offset += len; } - clear_bit(EXTENT_BUFFER_READING, &eb->bflags); - smp_mb__after_atomic(); - wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING); + clear_extent_buffer_reading(eb); free_extent_buffer(eb); bio_put(&bbio->bio); @@ -4330,9 +4335,7 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num, * will now be set, and we shouldn't read it in again. */ if (unlikely(test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))) { - clear_bit(EXTENT_BUFFER_READING, &eb->bflags); - smp_mb__after_atomic(); - wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING); + clear_extent_buffer_reading(eb); return 0; } -- cgit v1.2.3-59-g8ed1b From f32f20e2bd1f3b83925f703704840eebb56faedb Mon Sep 17 00:00:00 2001 From: Tavian Barnes Date: Mon, 18 Mar 2024 09:56:54 -0400 Subject: btrfs: warn if EXTENT_BUFFER_UPTODATE is set while reading We recently tracked down a race condition that triggered a read for an extent buffer with EXTENT_BUFFER_UPTODATE already set. While this read was in progress, other concurrent readers would see the UPTODATE bit and return early as if the read was already complete, making accesses to the extent buffer conflict with the read operation that was overwriting it. Add a WARN_ON() to end_bbio_meta_read() for this situation to make similar races easier to spot in the future. Reviewed-by: Qu Wenruo Signed-off-by: Tavian Barnes Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 23bdd05b5cec..ecb18a8db373 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4275,6 +4275,13 @@ static void end_bbio_meta_read(struct btrfs_bio *bbio) struct folio_iter fi; u32 bio_offset = 0; + /* + * If the extent buffer is marked UPTODATE before the read operation + * completes, other calls to read_extent_buffer_pages() will return + * early without waiting for the read to finish, causing data races. + */ + WARN_ON(test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)); + eb->read_mirror = bbio->mirror_num; if (uptodate && -- cgit v1.2.3-59-g8ed1b From 11e03f2f4b79eac2176d8ae5120bc9857e7fbb29 Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Mon, 29 Jan 2024 20:16:10 +1030 Subject: btrfs: introduce btrfs_alloc_folio_array() The new helper will do the same thing as btrfs_alloc_page_array(), but with folios. One extra difference is, there is no extra helper for bulk allocation, thus it may not be as efficient as the page version. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 31 +++++++++++++++++++++++++++++++ fs/btrfs/extent_io.h | 2 ++ 2 files changed, 33 insertions(+) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ecb18a8db373..d90330f26827 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -666,6 +666,37 @@ static void end_bbio_data_read(struct btrfs_bio *bbio) bio_put(bio); } +/* + * Populate every free slot in a provided array with folios. + * + * @nr_folios: number of folios to allocate + * @folio_array: the array to fill with folios; any existing non-NULL entries in + * the array will be skipped + * @extra_gfp: the extra GFP flags for the allocation + * + * Return: 0 if all folios were able to be allocated; + * -ENOMEM otherwise, the partially allocated folios would be freed and + * the array slots zeroed + */ +int btrfs_alloc_folio_array(unsigned int nr_folios, struct folio **folio_array, + gfp_t extra_gfp) +{ + for (int i = 0; i < nr_folios; i++) { + if (folio_array[i]) + continue; + folio_array[i] = folio_alloc(GFP_NOFS | extra_gfp, 0); + if (!folio_array[i]) + goto error; + } + return 0; +error: + for (int i = 0; i < nr_folios; i++) { + if (folio_array[i]) + folio_put(folio_array[i]); + } + return -ENOMEM; +} + /* * Populate every free slot in a provided array with pages. * diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 818431b37124..c81a9b546c9f 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -360,6 +360,8 @@ void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans, int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array, gfp_t extra_gfp); +int btrfs_alloc_folio_array(unsigned int nr_folios, struct folio **folio_array, + gfp_t extra_gfp); #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS bool find_lock_delalloc_range(struct inode *inode, -- cgit v1.2.3-59-g8ed1b From 53e24158684b527d013b5b2204ccb34d1f94c248 Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Sun, 14 Apr 2024 05:42:43 +0000 Subject: btrfs: set start on clone before calling copy_extent_buffer_full Our subpage testing started hanging on generic/560 and I bisected it down to 1cab1375ba6d ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations"). This is subtle because we use eb->start to figure out where in the folio we're copying to when we're subpage, as our ->start may refer to an area inside of the folio. For example, assume a 16K page size machine with a 4K node size, and assume that we already have a cloned extent buffer when we cloned the previous search. copy_extent_buffer_full() will do the following when copying the extent buffer path->nodes[0] (src) into cloned (dest): src->start = 8k; // this is the new leaf we're cloning cloned->start = 4k; // this is left over from the previous clone src_addr = folio_address(src->folios[0]); dest_addr = folio_address(dest->folios[0]); memcpy(dest_addr + get_eb_offset_in_folio(dst, 0), src_addr + get_eb_offset_in_folio(src, 0), src->len); Now get_eb_offset_in_folio() is where the problems occur, because for sub-pagesize blocksize we can have multiple eb's per folio, the code for this is as follows size_t get_eb_offset_in_folio(eb, offset) { return (eb->start + offset & (folio_size(eb->folio[0]) - 1)); } So in the above example we are copying into offset 4K inside the folio. However once we update cloned->start to 8K to match the src the math for get_eb_offset_in_folio() changes, and any subsequent reads (i.e. btrfs_item_key_to_cpu()) will start reading from the offset 8K instead of 4K where we copied to, giving us garbage. Fix this by setting start before we co copy_extent_buffer_full() to make sure that we're copying into the same offset inside of the folio that we will read from later. All other sites of copy_extent_buffer_full() are correct because we either set ->start beforehand or we simply don't change it in the case of the tree-log usage. With this fix we now pass generic/560 on our subpage tests. Fixes: 1cab1375ba6d ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations") Reviewed-by: Filipe Manana Reviewed-by: Qu Wenruo Signed-off-by: Josef Bacik Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d90330f26827..91122817f137 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2803,13 +2803,19 @@ static int fiemap_next_leaf_item(struct btrfs_inode *inode, struct btrfs_path *p goto out; } - /* See the comment at fiemap_search_slot() about why we clone. */ - copy_extent_buffer_full(clone, path->nodes[0]); /* * Important to preserve the start field, for the optimizations when * checking if extents are shared (see extent_fiemap()). + * + * We must set ->start before calling copy_extent_buffer_full(). If we + * are on sub-pagesize blocksize, we use ->start to determine the offset + * into the folio where our eb exists, and if we update ->start after + * the fact then any subsequent reads of the eb may read from a + * different offset in the folio than where we originally copied into. */ clone->start = path->nodes[0]->start; + /* See the comment at fiemap_search_slot() about why we clone. */ + copy_extent_buffer_full(clone, path->nodes[0]); slot = path->slots[0]; btrfs_release_path(path); -- cgit v1.2.3-59-g8ed1b From c2fbd812d749757c5abc6f995a7741da0653a4f4 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Thu, 21 Mar 2024 15:08:38 +0000 Subject: btrfs: pass the extent map tree's inode to remove_extent_mapping() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to remove_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change remove_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo Reviewed-by: Josef Bacik Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 2 +- fs/btrfs/extent_map.c | 22 +++++++++++++--------- fs/btrfs/extent_map.h | 2 +- fs/btrfs/tests/extent-map-tests.c | 19 ++++++++++--------- 4 files changed, 25 insertions(+), 20 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 91122817f137..7b10f47d8f83 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2457,7 +2457,7 @@ remove_em: * hurts the fsync performance for workloads with a data * size that exceeds or is close to the system's memory). */ - remove_extent_mapping(map, em); + remove_extent_mapping(btrfs_inode, em); /* once for the rb tree */ free_extent_map(em); next: diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 40f5a99ab382..2b7e3666ebd3 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -449,16 +449,18 @@ struct extent_map *search_extent_mapping(struct extent_map_tree *tree, } /* - * Remove an extent_map from the extent tree. + * Remove an extent_map from its inode's extent tree. * - * @tree: extent tree to remove from + * @inode: the inode the extent map belongs to * @em: extent map being removed * - * Remove @em from @tree. No reference counts are dropped, and no checks - * are done to see if the range is in use. + * Remove @em from the extent tree of @inode. No reference counts are dropped, + * and no checks are done to see if the range is in use. */ -void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em) +void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em) { + struct extent_map_tree *tree = &inode->extent_tree; + lockdep_assert_held_write(&tree->lock); WARN_ON(em->flags & EXTENT_FLAG_PINNED); @@ -633,8 +635,10 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode, * if needed. This avoids searching the tree, from the root down to the first * extent map, before each deletion. */ -static void drop_all_extent_maps_fast(struct extent_map_tree *tree) +static void drop_all_extent_maps_fast(struct btrfs_inode *inode) { + struct extent_map_tree *tree = &inode->extent_tree; + write_lock(&tree->lock); while (!RB_EMPTY_ROOT(&tree->map.rb_root)) { struct extent_map *em; @@ -643,7 +647,7 @@ static void drop_all_extent_maps_fast(struct extent_map_tree *tree) node = rb_first_cached(&tree->map); em = rb_entry(node, struct extent_map, rb_node); em->flags &= ~(EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING); - remove_extent_mapping(tree, em); + remove_extent_mapping(inode, em); free_extent_map(em); cond_resched_rwlock_write(&tree->lock); } @@ -676,7 +680,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end, WARN_ON(end < start); if (end == (u64)-1) { if (start == 0 && !skip_pinned) { - drop_all_extent_maps_fast(em_tree); + drop_all_extent_maps_fast(inode); return; } len = (u64)-1; @@ -854,7 +858,7 @@ remove_em: ASSERT(!split); btrfs_set_inode_full_sync(inode); } - remove_extent_mapping(em_tree, em); + remove_extent_mapping(inode, em); } /* diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h index 732fc8d7e534..c3707461ff62 100644 --- a/fs/btrfs/extent_map.h +++ b/fs/btrfs/extent_map.h @@ -120,7 +120,7 @@ static inline u64 extent_map_end(const struct extent_map *em) void extent_map_tree_init(struct extent_map_tree *tree); struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree, u64 start, u64 len); -void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em); +void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em); int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre, u64 new_logical); diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c index 0f5c9c9304d9..ba36794ba2d5 100644 --- a/fs/btrfs/tests/extent-map-tests.c +++ b/fs/btrfs/tests/extent-map-tests.c @@ -11,8 +11,9 @@ #include "../disk-io.h" #include "../block-group.h" -static int free_extent_map_tree(struct extent_map_tree *em_tree) +static int free_extent_map_tree(struct btrfs_inode *inode) { + struct extent_map_tree *em_tree = &inode->extent_tree; struct extent_map *em; struct rb_node *node; int ret = 0; @@ -21,7 +22,7 @@ static int free_extent_map_tree(struct extent_map_tree *em_tree) while (!RB_EMPTY_ROOT(&em_tree->map.rb_root)) { node = rb_first_cached(&em_tree->map); em = rb_entry(node, struct extent_map, rb_node); - remove_extent_mapping(em_tree, em); + remove_extent_mapping(inode, em); #ifdef CONFIG_BTRFS_DEBUG if (refcount_read(&em->refs) != 1) { @@ -142,7 +143,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode) } free_extent_map(em); out: - ret2 = free_extent_map_tree(em_tree); + ret2 = free_extent_map_tree(inode); if (ret == 0) ret = ret2; @@ -237,7 +238,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode) } free_extent_map(em); out: - ret2 = free_extent_map_tree(em_tree); + ret2 = free_extent_map_tree(inode); if (ret == 0) ret = ret2; @@ -313,7 +314,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info, } free_extent_map(em); out: - ret2 = free_extent_map_tree(em_tree); + ret2 = free_extent_map_tree(inode); if (ret == 0) ret = ret2; @@ -435,7 +436,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info, } free_extent_map(em); out: - ret2 = free_extent_map_tree(em_tree); + ret2 = free_extent_map_tree(inode); if (ret == 0) ret = ret2; @@ -679,7 +680,7 @@ static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode) if (ret) goto out; out: - ret2 = free_extent_map_tree(&inode->extent_tree); + ret2 = free_extent_map_tree(inode); if (ret == 0) ret = ret2; @@ -738,7 +739,7 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode) ret = 0; out: free_extent_map(em); - ret2 = free_extent_map_tree(em_tree); + ret2 = free_extent_map_tree(inode); if (ret == 0) ret = ret2; @@ -876,7 +877,7 @@ out: ret2 = unpin_extent_cache(inode, 0, SZ_16K, 0); if (ret == 0) ret = ret2; - ret2 = free_extent_map_tree(em_tree); + ret2 = free_extent_map_tree(inode); if (ret == 0) ret = ret2; -- cgit v1.2.3-59-g8ed1b From 078b981aaa565040348cd3ca75b0ec9e138464a9 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 16 Apr 2024 15:07:13 +0100 Subject: btrfs: rename some variables at try_release_extent_mapping() Rename the following variables: 1) "btrfs_inode" to "inode", because it's shorter to type and clear, and we don't have a VFS inode here as well, so there's no confusion; 2) "tree" to "io_tree", to be clear which tree we are dealing with, since we use 2 different trees in the function; 3) "map" to "extent_tree" since "map" gives the idea we are dealing with an extent map for example, but we are dealing with the inode's extent tree (the tree which stores extent maps). These also make the next patches simpler. Reviewed-by: Johannes Thumshirn Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 7b10f47d8f83..6438c3e74756 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2398,9 +2398,9 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) struct extent_map *em; u64 start = page_offset(page); u64 end = start + PAGE_SIZE - 1; - struct btrfs_inode *btrfs_inode = page_to_inode(page); - struct extent_io_tree *tree = &btrfs_inode->io_tree; - struct extent_map_tree *map = &btrfs_inode->extent_tree; + struct btrfs_inode *inode = page_to_inode(page); + struct extent_io_tree *io_tree = &inode->io_tree; + struct extent_map_tree *extent_tree = &inode->extent_tree; if (gfpflags_allow_blocking(mask) && page->mapping->host->i_size > SZ_16M) { @@ -2410,19 +2410,19 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) u64 cur_gen; len = end - start + 1; - write_lock(&map->lock); - em = lookup_extent_mapping(map, start, len); + write_lock(&extent_tree->lock); + em = lookup_extent_mapping(extent_tree, start, len); if (!em) { - write_unlock(&map->lock); + write_unlock(&extent_tree->lock); break; } if ((em->flags & EXTENT_FLAG_PINNED) || em->start != start) { - write_unlock(&map->lock); + write_unlock(&extent_tree->lock); free_extent_map(em); break; } - if (test_range_bit_exists(tree, em->start, + if (test_range_bit_exists(io_tree, em->start, extent_map_end(em) - 1, EXTENT_LOCKED)) goto next; @@ -2442,7 +2442,7 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) * Otherwise don't remove it, we could be racing with an * ongoing fast fsync that could miss the new extent. */ - fs_info = btrfs_inode->root->fs_info; + fs_info = inode->root->fs_info; spin_lock(&fs_info->trans_lock); cur_gen = fs_info->generation; spin_unlock(&fs_info->trans_lock); @@ -2457,12 +2457,12 @@ remove_em: * hurts the fsync performance for workloads with a data * size that exceeds or is close to the system's memory). */ - remove_extent_mapping(btrfs_inode, em); + remove_extent_mapping(inode, em); /* once for the rb tree */ free_extent_map(em); next: start = extent_map_end(em); - write_unlock(&map->lock); + write_unlock(&extent_tree->lock); /* once for us */ free_extent_map(em); @@ -2470,7 +2470,7 @@ next: cond_resched(); /* Allow large-extent preemption. */ } } - return try_release_extent_state(tree, page, mask); + return try_release_extent_state(io_tree, page, mask); } struct btrfs_fiemap_entry { -- cgit v1.2.3-59-g8ed1b From 85d288309ab5463140a2d00b3827262fb14e7db4 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 16 Apr 2024 15:13:03 +0100 Subject: btrfs: use btrfs_get_fs_generation() at try_release_extent_mapping() Nowadays we have the btrfs_get_fs_generation() to get the current generation of the filesystem, so there's no need anymore to lock the transaction spinlock to read it. Reviewed-by: Johannes Thumshirn Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 6438c3e74756..f689c53553e3 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2406,8 +2406,7 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) page->mapping->host->i_size > SZ_16M) { u64 len; while (start <= end) { - struct btrfs_fs_info *fs_info; - u64 cur_gen; + const u64 cur_gen = btrfs_get_fs_generation(inode->root->fs_info); len = end - start + 1; write_lock(&extent_tree->lock); @@ -2442,10 +2441,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) * Otherwise don't remove it, we could be racing with an * ongoing fast fsync that could miss the new extent. */ - fs_info = inode->root->fs_info; - spin_lock(&fs_info->trans_lock); - cur_gen = fs_info->generation; - spin_unlock(&fs_info->trans_lock); if (em->generation >= cur_gen) goto next; remove_em: -- cgit v1.2.3-59-g8ed1b From 433a3e01dda1d463159a9620b40ba027514f0ea5 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 16 Apr 2024 15:19:03 +0100 Subject: btrfs: remove i_size restriction at try_release_extent_mapping() Currently we don't attempt to release extent maps if the inode has an i_size that is not greater than 16M. This condition was added way back in 2008 by commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), without any explanation about it. A quick chat with Chris on slack revealed that the goal was probably to release the extent maps for small files only when closing the inode. This however can be harmful in case we have tons of such files being kept open for very long periods of time, since we will consume more and more pages for extent maps. So remove the condition. Reviewed-by: Johannes Thumshirn Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f689c53553e3..ff9132b897e3 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2402,8 +2402,7 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) struct extent_io_tree *io_tree = &inode->io_tree; struct extent_map_tree *extent_tree = &inode->extent_tree; - if (gfpflags_allow_blocking(mask) && - page->mapping->host->i_size > SZ_16M) { + if (gfpflags_allow_blocking(mask)) { u64 len; while (start <= end) { const u64 cur_gen = btrfs_get_fs_generation(inode->root->fs_info); -- cgit v1.2.3-59-g8ed1b From 2e504418e4645302c40982a64de6a6979ec5489d Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 16 Apr 2024 15:34:51 +0100 Subject: btrfs: be better releasing extent maps at try_release_extent_mapping() At try_release_extent_mapping(), called during the release folio callback (btrfs_release_folio() callchain), we don't release any extent maps in the range if the GFP flags don't allow blocking. This behaviour is exaggerated because: 1) Both searching for extent maps and removing them are not blocking operations. The only thing that it is the cond_resched() call at the end of the loop that searches for and removes extent maps; 2) We currently only operate on a single page, so for the case where block size matches the page size, we can only have one extent map, and for the case where the block size is smaller than the page size, we can have at most 16 extent maps. So it's very unlikely the cond_resched() call will ever block even in the block size smaller than page size scenario. So instead of not removing any extent maps at all in case the GFP glags don't allow blocking, keep removing extent maps while we don't need to reschedule. This makes it safe for the subpage case and for a future where we can process folios with a size larger than a page. Reviewed-by: Johannes Thumshirn Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 120 ++++++++++++++++++++++++++------------------------- 1 file changed, 61 insertions(+), 59 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ff9132b897e3..2230e6b6ba95 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2395,73 +2395,75 @@ static int try_release_extent_state(struct extent_io_tree *tree, */ int try_release_extent_mapping(struct page *page, gfp_t mask) { - struct extent_map *em; u64 start = page_offset(page); u64 end = start + PAGE_SIZE - 1; struct btrfs_inode *inode = page_to_inode(page); struct extent_io_tree *io_tree = &inode->io_tree; - struct extent_map_tree *extent_tree = &inode->extent_tree; - - if (gfpflags_allow_blocking(mask)) { - u64 len; - while (start <= end) { - const u64 cur_gen = btrfs_get_fs_generation(inode->root->fs_info); - - len = end - start + 1; - write_lock(&extent_tree->lock); - em = lookup_extent_mapping(extent_tree, start, len); - if (!em) { - write_unlock(&extent_tree->lock); - break; - } - if ((em->flags & EXTENT_FLAG_PINNED) || - em->start != start) { - write_unlock(&extent_tree->lock); - free_extent_map(em); - break; - } - if (test_range_bit_exists(io_tree, em->start, - extent_map_end(em) - 1, - EXTENT_LOCKED)) - goto next; - /* - * If it's not in the list of modified extents, used - * by a fast fsync, we can remove it. If it's being - * logged we can safely remove it since fsync took an - * extra reference on the em. - */ - if (list_empty(&em->list) || - (em->flags & EXTENT_FLAG_LOGGING)) - goto remove_em; - /* - * If it's in the list of modified extents, remove it - * only if its generation is older then the current one, - * in which case we don't need it for a fast fsync. - * Otherwise don't remove it, we could be racing with an - * ongoing fast fsync that could miss the new extent. - */ - if (em->generation >= cur_gen) - goto next; -remove_em: - /* - * We only remove extent maps that are not in the list of - * modified extents or that are in the list but with a - * generation lower then the current generation, so there - * is no need to set the full fsync flag on the inode (it - * hurts the fsync performance for workloads with a data - * size that exceeds or is close to the system's memory). - */ - remove_extent_mapping(inode, em); - /* once for the rb tree */ + + while (start <= end) { + const u64 cur_gen = btrfs_get_fs_generation(inode->root->fs_info); + const u64 len = end - start + 1; + struct extent_map_tree *extent_tree = &inode->extent_tree; + struct extent_map *em; + + write_lock(&extent_tree->lock); + em = lookup_extent_mapping(extent_tree, start, len); + if (!em) { + write_unlock(&extent_tree->lock); + break; + } + if ((em->flags & EXTENT_FLAG_PINNED) || em->start != start) { + write_unlock(&extent_tree->lock); free_extent_map(em); + break; + } + if (test_range_bit_exists(io_tree, em->start, + extent_map_end(em) - 1, EXTENT_LOCKED)) + goto next; + /* + * If it's not in the list of modified extents, used by a fast + * fsync, we can remove it. If it's being logged we can safely + * remove it since fsync took an extra reference on the em. + */ + if (list_empty(&em->list) || (em->flags & EXTENT_FLAG_LOGGING)) + goto remove_em; + /* + * If it's in the list of modified extents, remove it only if + * its generation is older then the current one, in which case + * we don't need it for a fast fsync. Otherwise don't remove it, + * we could be racing with an ongoing fast fsync that could miss + * the new extent. + */ + if (em->generation >= cur_gen) + goto next; +remove_em: + /* + * We only remove extent maps that are not in the list of + * modified extents or that are in the list but with a + * generation lower then the current generation, so there is no + * need to set the full fsync flag on the inode (it hurts the + * fsync performance for workloads with a data size that exceeds + * or is close to the system's memory). + */ + remove_extent_mapping(inode, em); + /* Once for the inode's extent map tree. */ + free_extent_map(em); next: - start = extent_map_end(em); - write_unlock(&extent_tree->lock); + start = extent_map_end(em); + write_unlock(&extent_tree->lock); - /* once for us */ - free_extent_map(em); + /* Once for us, for the lookup_extent_mapping() reference. */ + free_extent_map(em); + + if (need_resched()) { + /* + * If we need to resched but we can't block just exit + * and leave any remaining extent maps. + */ + if (!gfpflags_allow_blocking(mask)) + break; - cond_resched(); /* Allow large-extent preemption. */ + cond_resched(); } } return try_release_extent_state(io_tree, page, mask); -- cgit v1.2.3-59-g8ed1b From de6f14e83e6221e3ef7e949deabe041240bc1829 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 16 Apr 2024 20:52:30 +0100 Subject: btrfs: make try_release_extent_mapping() return a bool Currently try_release_extent_mapping() as an int return type, but we use it as a boolean. Its only caller, the release folio callback, also returns a boolean which corresponds to try_release_extent_mapping()'s return value. So change its return value type to bool as well as its helper try_release_extent_state(). Reviewed-by: Johannes Thumshirn Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 17 +++++++++-------- fs/btrfs/extent_io.h | 2 +- fs/btrfs/inode.c | 7 +++---- 3 files changed, 13 insertions(+), 13 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2230e6b6ba95..a9f9f5abdf53 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2355,19 +2355,20 @@ int extent_invalidate_folio(struct extent_io_tree *tree, * are locked or under IO and drops the related state bits if it is safe * to drop the page. */ -static int try_release_extent_state(struct extent_io_tree *tree, +static bool try_release_extent_state(struct extent_io_tree *tree, struct page *page, gfp_t mask) { u64 start = page_offset(page); u64 end = start + PAGE_SIZE - 1; - int ret = 1; + bool ret; if (test_range_bit_exists(tree, start, end, EXTENT_LOCKED)) { - ret = 0; + ret = false; } else { u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM | EXTENT_DELALLOC_NEW | EXTENT_CTLBITS | EXTENT_QGROUP_RESERVED); + int ret2; /* * At this point we can safely clear everything except the @@ -2375,15 +2376,15 @@ static int try_release_extent_state(struct extent_io_tree *tree, * The delalloc new bit will be cleared by ordered extent * completion. */ - ret = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL); + ret2 = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL); /* if clear_extent_bit failed for enomem reasons, * we can't allow the release to continue. */ - if (ret < 0) - ret = 0; + if (ret2 < 0) + ret = false; else - ret = 1; + ret = true; } return ret; } @@ -2393,7 +2394,7 @@ static int try_release_extent_state(struct extent_io_tree *tree, * in the range corresponding to the page, both state records and extent * map records are removed */ -int try_release_extent_mapping(struct page *page, gfp_t mask) +bool try_release_extent_mapping(struct page *page, gfp_t mask) { u64 start = page_offset(page); u64 end = start + PAGE_SIZE - 1; diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index c81a9b546c9f..f38397765e90 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -230,7 +230,7 @@ static inline void extent_changeset_free(struct extent_changeset *changeset) kfree(changeset); } -int try_release_extent_mapping(struct page *page, gfp_t mask); +bool try_release_extent_mapping(struct page *page, gfp_t mask); int try_release_extent_buffer(struct page *page); int btrfs_read_folio(struct file *file, struct folio *folio); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1dde8085271e..eb0dc913c33b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7899,13 +7899,12 @@ static void wait_subpage_spinlock(struct page *page) static bool __btrfs_release_folio(struct folio *folio, gfp_t gfp_flags) { - int ret = try_release_extent_mapping(&folio->page, gfp_flags); - - if (ret == 1) { + if (try_release_extent_mapping(&folio->page, gfp_flags)) { wait_subpage_spinlock(&folio->page); clear_page_extent_mapped(&folio->page); + return true; } - return ret; + return false; } static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags) -- cgit v1.2.3-59-g8ed1b From c0707c9e1e36d56cef7b3c8de5c5fdcb14f34aa5 Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Mon, 12 Feb 2024 16:10:44 -0500 Subject: btrfs: push the extent lock into btrfs_run_delalloc_range We want to limit the scope of the extent lock to be around operations that can change in flight. Currently we hold the extent lock through the entire writepage operation, which isn't really necessary. We want to protect to make sure nobody has updated DELALLOC. In find_lock_delalloc_range we must lock the range in order to validate the contents of our io_tree. However once we've done that we're safe to unlock the range and continue, as we have the page lock already held for the range. We are protected from all operations at this point. * mmap() - we're holding the page lock, thus are protected. * buffered writes - again, we're protected because we take the page lock for the first and last page in our range for buffered writes so we won't create new delalloc ranges in this area. * direct IO - we invalidate pagecache before attempting to write a new area, which requires the page lock, so again are protected once we're holding the page lock on this range. Additionally this behavior actually already exists for compressed, we unlock the range as soon as we start to process the async extents, and re-lock it during compression. So this is completely safe, and makes the locking more consistent. Make this simple by just pushing the extent lock into btrfs_run_delalloc_range. From there followup patches will push the lock further down into its users. Reviewed-by: Goldwyn Rodrigues Signed-off-by: Josef Bacik Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 5 ++--- fs/btrfs/inode.c | 5 +++++ 2 files changed, 7 insertions(+), 3 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index a9f9f5abdf53..d76ba4099b79 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -396,15 +396,14 @@ again: /* then test to make sure it is all still delalloc */ ret = test_range_bit(tree, delalloc_start, delalloc_end, EXTENT_DELALLOC, cached_state); + + unlock_extent(tree, delalloc_start, delalloc_end, &cached_state); if (!ret) { - unlock_extent(tree, delalloc_start, delalloc_end, - &cached_state); __unlock_for_delalloc(inode, locked_page, delalloc_start, delalloc_end); cond_resched(); goto again; } - free_extent_state(cached_state); *start = delalloc_start; *end = delalloc_end; out_failed: diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d82c453ae9d7..347de4e3a331 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2249,6 +2249,11 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page const bool zoned = btrfs_is_zoned(inode->root->fs_info); int ret; + /* + * We're unlocked by the different fill functions below. + */ + lock_extent(&inode->io_tree, start, end, NULL); + /* * The range must cover part of the @locked_page, or a return of 1 * can confuse the caller. -- cgit v1.2.3-59-g8ed1b From 6b0a63a4fa3142d1cb0069b9c7bf02270412d96f Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Wed, 3 Apr 2024 17:29:40 -0400 Subject: btrfs: add a cached state to extent_clear_unlock_delalloc Now that we have the lock_extent tightly coupled with extent_clear_unlock_delalloc we can add a cached state to extent_clear_unlock_delalloc and benefit from skipping the extra lookup when we're doing cow. Reviewed-by: Goldwyn Rodrigues Signed-off-by: Josef Bacik Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 3 ++- fs/btrfs/extent_io.h | 2 ++ fs/btrfs/inode.c | 42 ++++++++++++++++++++++++------------------ 3 files changed, 28 insertions(+), 19 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d76ba4099b79..47a5bb95a994 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -412,9 +412,10 @@ out_failed: void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end, struct page *locked_page, + struct extent_state **cached, u32 clear_bits, unsigned long page_ops) { - clear_extent_bit(&inode->io_tree, start, end, clear_bits, NULL); + clear_extent_bit(&inode->io_tree, start, end, clear_bits, cached); __process_pages_contig(inode->vfs_inode.i_mapping, locked_page, start, end, page_ops); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index f38397765e90..dca6b12769ec 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -27,6 +27,7 @@ struct address_space; struct writeback_control; struct extent_io_tree; struct extent_map_tree; +struct extent_state; struct btrfs_block_group; struct btrfs_fs_info; struct btrfs_inode; @@ -352,6 +353,7 @@ void clear_extent_buffer_uptodate(struct extent_buffer *eb); void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end); void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end, struct page *locked_page, + struct extent_state **cached, u32 bits_to_clear, unsigned long page_ops); int extent_invalidate_folio(struct extent_io_tree *tree, struct folio *folio, size_t offset); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 944877a363fa..d0274324c75a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -762,8 +762,8 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 offset, return ret; } - free_extent_state(cached); - extent_clear_unlock_delalloc(inode, offset, end, NULL, clear_flags, + extent_clear_unlock_delalloc(inode, offset, end, NULL, &cached, + clear_flags, PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK); return ret; @@ -1154,6 +1154,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk, struct btrfs_ordered_extent *ordered; struct btrfs_key ins; struct page *locked_page = NULL; + struct extent_state *cached = NULL; struct extent_map *em; int ret = 0; u64 start = async_extent->start; @@ -1194,7 +1195,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk, goto done; } - lock_extent(io_tree, start, end, NULL); + lock_extent(io_tree, start, end, &cached); /* Here we're doing allocation and writeback of the compressed pages */ em = create_io_em(inode, start, @@ -1229,7 +1230,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk, /* Clear dirty, set writeback and unlock the pages. */ extent_clear_unlock_delalloc(inode, start, end, - NULL, EXTENT_LOCKED | EXTENT_DELALLOC, + NULL, &cached, EXTENT_LOCKED | EXTENT_DELALLOC, PAGE_UNLOCK | PAGE_START_WRITEBACK); btrfs_submit_compressed_write(ordered, async_extent->folios, /* compressed_folios */ @@ -1247,7 +1248,8 @@ out_free_reserve: btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1); mapping_set_error(inode->vfs_inode.i_mapping, -EIO); extent_clear_unlock_delalloc(inode, start, end, - NULL, EXTENT_LOCKED | EXTENT_DELALLOC | + NULL, &cached, + EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW | EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING, PAGE_UNLOCK | PAGE_START_WRITEBACK | @@ -1329,6 +1331,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode, { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; + struct extent_state *cached = NULL; u64 alloc_hint = 0; u64 orig_start = start; u64 num_bytes; @@ -1429,7 +1432,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode, ram_size = ins.offset; - lock_extent(&inode->io_tree, start, start + ram_size - 1, NULL); + lock_extent(&inode->io_tree, start, start + ram_size - 1, + &cached); em = create_io_em(inode, start, ins.offset, /* len */ start, /* orig_start */ @@ -1441,7 +1445,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode, BTRFS_ORDERED_REGULAR /* type */); if (IS_ERR(em)) { unlock_extent(&inode->io_tree, start, - start + ram_size - 1, NULL); + start + ram_size - 1, &cached); ret = PTR_ERR(em); goto out_reserve; } @@ -1453,7 +1457,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode, BTRFS_COMPRESS_NONE); if (IS_ERR(ordered)) { unlock_extent(&inode->io_tree, start, - start + ram_size - 1, NULL); + start + ram_size - 1, &cached); ret = PTR_ERR(ordered); goto out_drop_extent_cache; } @@ -1493,7 +1497,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode, page_ops |= PAGE_SET_ORDERED; extent_clear_unlock_delalloc(inode, start, start + ram_size - 1, - locked_page, + locked_page, &cached, EXTENT_LOCKED | EXTENT_DELALLOC, page_ops); if (num_bytes < cur_alloc_size) @@ -1552,7 +1556,7 @@ out_unlock: if (!locked_page) mapping_set_error(inode->vfs_inode.i_mapping, ret); extent_clear_unlock_delalloc(inode, orig_start, start - 1, - locked_page, 0, page_ops); + locked_page, NULL, 0, page_ops); } /* @@ -1575,7 +1579,7 @@ out_unlock: if (extent_reserved) { extent_clear_unlock_delalloc(inode, start, start + cur_alloc_size - 1, - locked_page, + locked_page, &cached, clear_bits, page_ops); start += cur_alloc_size; @@ -1590,7 +1594,7 @@ out_unlock: if (start < end) { clear_bits |= EXTENT_CLEAR_DATA_RESV; extent_clear_unlock_delalloc(inode, start, end, locked_page, - clear_bits, page_ops); + &cached, clear_bits, page_ops); } return ret; } @@ -2206,11 +2210,10 @@ must_cow: btrfs_put_ordered_extent(ordered); extent_clear_unlock_delalloc(inode, cur_offset, nocow_end, - locked_page, EXTENT_LOCKED | - EXTENT_DELALLOC | + locked_page, &cached_state, + EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_CLEAR_DATA_RESV, PAGE_UNLOCK | PAGE_SET_ORDERED); - free_extent_state(cached_state); cur_offset = extent_end; @@ -2252,10 +2255,13 @@ error: * we're not locked at this point. */ if (cur_offset < end) { - lock_extent(&inode->io_tree, cur_offset, end, NULL); + struct extent_state *cached = NULL; + + lock_extent(&inode->io_tree, cur_offset, end, &cached); extent_clear_unlock_delalloc(inode, cur_offset, end, - locked_page, EXTENT_LOCKED | - EXTENT_DELALLOC | EXTENT_DEFRAG | + locked_page, &cached, + EXTENT_LOCKED | EXTENT_DELALLOC | + EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING, PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK); -- cgit v1.2.3-59-g8ed1b From bc00965dbff7a8612c8ec0005b3bc943d7196629 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sat, 20 Apr 2024 03:49:59 +0100 Subject: btrfs: count super block write errors in device instead of tracking folio error state Currently the error status of super block write is tracked in page/folio status bit Error. For that we need to keep the reference for the whole duration of write and wait. Count the number of superblock writeback errors in the btrfs_device. That means we don't need the folio to stay around until it's waited for, and can avoid the extra call to folio_get/put. Also remove a mention of PageError in a comment as it's the last mention of the page Error state. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/disk-io.c | 46 +++++++++++++++++++--------------------------- fs/btrfs/extent_io.c | 2 +- fs/btrfs/volumes.h | 9 +++++++++ 3 files changed, 29 insertions(+), 28 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 90c54466ecc3..a91a8056758a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3634,11 +3634,15 @@ static void btrfs_end_super_write(struct bio *bio) "lost super block write due to IO error on %s (%d)", btrfs_dev_name(device), blk_status_to_errno(bio->bi_status)); - folio_set_error(fi.folio); btrfs_dev_stat_inc_and_print(device, BTRFS_DEV_STAT_WRITE_ERRS); + /* Ensure failure if the primary sb fails. */ + if (bio->bi_opf & REQ_FUA) + atomic_add(BTRFS_SUPER_PRIMARY_WRITE_ERROR, + &device->sb_write_errors); + else + atomic_inc(&device->sb_write_errors); } - folio_unlock(fi.folio); folio_put(fi.folio); } @@ -3742,10 +3746,11 @@ static int write_dev_supers(struct btrfs_device *device, struct address_space *mapping = device->bdev->bd_inode->i_mapping; SHASH_DESC_ON_STACK(shash, fs_info->csum_shash); int i; - int errors = 0; int ret; u64 bytenr, bytenr_orig; + atomic_set(&device->sb_write_errors, 0); + if (max_mirrors == 0) max_mirrors = BTRFS_SUPER_MIRROR_MAX; @@ -3765,7 +3770,7 @@ static int write_dev_supers(struct btrfs_device *device, btrfs_err(device->fs_info, "couldn't get super block location for mirror %d", i); - errors++; + atomic_inc(&device->sb_write_errors); continue; } if (bytenr + BTRFS_SUPER_INFO_SIZE >= @@ -3785,14 +3790,11 @@ static int write_dev_supers(struct btrfs_device *device, btrfs_err(device->fs_info, "couldn't get super block page for bytenr %llu", bytenr); - errors++; + atomic_inc(&device->sb_write_errors); continue; } ASSERT(folio_order(folio) == 0); - /* Bump the refcount for wait_dev_supers() */ - folio_get(folio); - offset = offset_in_folio(folio, bytenr); disk_super = folio_address(folio) + offset; memcpy(disk_super, sb, BTRFS_SUPER_INFO_SIZE); @@ -3820,16 +3822,17 @@ static int write_dev_supers(struct btrfs_device *device, submit_bio(bio); if (btrfs_advance_sb_log(device, i)) - errors++; + atomic_inc(&device->sb_write_errors); } - return errors < i ? 0 : -1; + return atomic_read(&device->sb_write_errors) < i ? 0 : -1; } /* * Wait for write completion of superblocks done by write_dev_supers, * @max_mirrors same for write and wait phases. * - * Return number of errors when folio is not found or not marked up to date. + * Return -1 if primary super block write failed or when there were no super block + * copies written. Otherwise 0. */ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) { @@ -3860,30 +3863,19 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) folio = filemap_get_folio(device->bdev->bd_inode->i_mapping, bytenr >> PAGE_SHIFT); - if (IS_ERR(folio)) { - errors++; - if (i == 0) - primary_failed = true; + /* If the folio has been removed, then we know it completed. */ + if (IS_ERR(folio)) continue; - } ASSERT(folio_order(folio) == 0); /* Folio will be unlocked once the write completes. */ folio_wait_locked(folio); - if (folio_test_error(folio)) { - errors++; - if (i == 0) - primary_failed = true; - } - - /* Drop our reference */ - folio_put(folio); - - /* Drop the reference from the writing run */ folio_put(folio); } - /* log error, force error return */ + errors += atomic_read(&device->sb_write_errors); + if (errors >= BTRFS_SUPER_PRIMARY_WRITE_ERROR) + primary_failed = true; if (primary_failed) { btrfs_err(device->fs_info, "error writing primary super block to device %llu", device->devid); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 47a5bb95a994..597387e9f040 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1602,7 +1602,7 @@ static void set_btree_ioerr(struct extent_buffer *eb) * can be no longer dirty nor marked anymore for writeback (if a * subsequent modification to the extent buffer didn't happen before the * transaction commit), which makes filemap_fdata[write|wait]_range not - * able to find the pages tagged with SetPageError at transaction + * able to find the pages which contain errors at transaction * commit time. So if this happens we must abort the transaction, * otherwise we commit a super block with btree roots that point to * btree nodes/leafs whose content on disk is invalid - either garbage diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index cf555f5b47ce..66e6fc481ecd 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -92,6 +92,9 @@ enum btrfs_raid_types { #define BTRFS_DEV_STATE_FLUSH_SENT (4) #define BTRFS_DEV_STATE_NO_READA (5) +/* Special value encoding failure to write primary super block. */ +#define BTRFS_SUPER_PRIMARY_WRITE_ERROR (INT_MAX / 2) + struct btrfs_fs_devices; struct btrfs_device { @@ -142,6 +145,12 @@ struct btrfs_device { /* type and info about this device */ u64 type; + /* + * Counter of super block write errors, values larger than + * BTRFS_SUPER_PRIMARY_WRITE_ERROR encode primary super block write failure. + */ + atomic_t sb_write_errors; + /* minimal io size for this device */ u32 sector_size; -- cgit v1.2.3-59-g8ed1b