aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/subpage.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-08-23btrfs: subpage: fix a potential use-after-free in writeback helperQu Wenruo1-1/+3
[BUG] There is a possible use-after-free bug when running generic/095. BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b Faulting instruction address: 0xc000000000283654 c000000000283078 do_raw_spin_unlock+0x88/0x230 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0 c0000000009e0458 end_bio_extent_writepage+0x158/0x270 c000000000b6fd14 bio_endio+0x254/0x270 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200 c000000000b6fd14 bio_endio+0x254/0x270 c000000000b781fc blk_update_request+0x46c/0x670 c000000000b8b394 blk_mq_end_request+0x34/0x1d0 c000000000d82d1c lo_complete_rq+0x11c/0x140 c000000000b880a4 blk_complete_reqs+0x84/0xb0 c0000000012b2ca4 __do_softirq+0x334/0x680 c0000000001dd878 irq_exit+0x148/0x1d0 c000000000016f4c do_IRQ+0x20c/0x240 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0 [CAUSE] There is very small race window like the following in generic/095. Thread 1 | Thread 2 --------------------------------+------------------------------------ end_bio_extent_writepage() | btrfs_releasepage() |- spin_lock_irqsave() | | |- end_page_writeback() | | | | |- if (PageWriteback() ||...) | | |- clear_page_extent_mapped() | | |- kfree(subpage); |- spin_unlock_irqrestore(). The race can also happen between writeback and btrfs_invalidatepage(), although that would be much harder as btrfs_invalidatepage() has much more work to do before the clear_page_extent_mapped() call. [FIX] Here we "wait" for the subapge spinlock to be released before we detach subpage structure. So this patch will introduce a new function, wait_subpage_spinlock(), to do the "wait" by acquiring the spinlock and release it. Since the caller has ensured the page is not dirty nor writeback, and page is already locked, the only way to hold the subpage spinlock is from endio function. Thus we only need to acquire the spinlock to wait for any existing holder. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: subpage: fix writeback which does not have ordered extentQu Wenruo1-0/+20
[BUG] When running fsstress with subpage RW support, there are random BUG_ON()s triggered with the following trace: kernel BUG at fs/btrfs/file-item.c:667! Internal error: Oops - BUG: 0 [#1] SMP CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43 Hardware name: Radxa ROCK Pi 4B (DT) Workqueue: btrfs-worker-high btrfs_work_helper [btrfs] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs] lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs] Call trace: btrfs_csum_one_bio+0x420/0x4e0 [btrfs] btrfs_submit_bio_start+0x20/0x30 [btrfs] run_one_async_start+0x28/0x44 [btrfs] btrfs_work_helper+0x128/0x1b4 [btrfs] process_one_work+0x22c/0x430 worker_thread+0x70/0x3a0 kthread+0x13c/0x140 ret_from_fork+0x10/0x30 [CAUSE] Above BUG_ON() means there is some bio range which doesn't have ordered extent, which indeed is worth a BUG_ON(). Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra subpage dirty bitmap to record which range is dirty and should be written back. This means, if we submit bio for a subpage range, we do not only need to clear page dirty, but also need to clear subpage dirty bits. In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for any range we submit a bio. But there is loophole, if we hit a range which is beyond i_size, we just call btrfs_writepage_endio_finish_ordered() to finish the ordered io, then break out, without clearing the subpage dirty. This means, if we hit above branch, the subpage dirty bits are still there, if other range of the page get dirtied and we need to writeback that page again, we will submit bio for the old range, leaving a wild bio range which doesn't have ordered extent. [FIX] Fix it by always calling btrfs_page_clear_dirty() in __extent_writepage_io(). Also to avoid such problem from happening again, add a new assert, btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage dirty bits are cleared before exiting __extent_writepage_io(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: subpage: fix a rare race between metadata endio and eb freeingQu Wenruo1-4/+15
[BUG] There is a very rare ASSERT() triggering during full fstests run for subpage rw support. No other reproducer so far. The ASSERT() gets triggered for metadata read in btrfs_page_set_uptodate() inside end_page_read(). [CAUSE] There is still a small race window for metadata only, the race could happen like this: T1 | T2 ------------------------------------+----------------------------- end_bio_extent_readpage() | |- btrfs_validate_metadata_buffer() | | |- free_extent_buffer() | | Still have 2 refs | |- end_page_read() | |- if (unlikely(PagePrivate()) | | The page still has Private | | | free_extent_buffer() | | | Only one ref 1, will be | | | released | | |- detach_extent_buffer_page() | | |- btrfs_detach_subpage() |- btrfs_set_page_uptodate() | The page no longer has Private| >>> ASSERT() triggered <<< | This race window is super small, thus pretty hard to hit, even with so many runs of fstests. But the race window is still there, we have to go another way to solve it other than relying on random PagePrivate() check. Data path is not affected, as it will lock the page before reading, while unlocking the page after the last read has finished, thus no race window. [FIX] This patch will fix the bug by repurposing btrfs_subpage::readers. Now btrfs_subpage::readers will be a member shared by both metadata and data. For metadata path, we don't do the page unlock as metadata only relies on extent locking. At the same time, teach page_range_has_eb() to take btrfs_subpage::readers into consideration. So that even if the last eb of a page gets freed, page::private won't be detached as long as there still are pending end_page_read() calls. By this we eliminate the race window, this will slight increase the metadata memory usage, as the page may not be released as frequently as usual. But it should not be a big deal. The code got introduced in ("btrfs: submit read time repair only for each corrupted sector"), but the fix is in a separate patch to keep the problem description and the crash is rare so it should not hurt bisectability. Signed-off-by: Qu Wegruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: introduce helpers for subpage ordered statusQu Wenruo1-0/+29
This patch introduces the following functions to handle btrfs subpage ordered (Private2) status: - btrfs_subpage_set_ordered() - btrfs_subpage_clear_ordered() - btrfs_subpage_test_ordered() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_ordered() - btrfs_page_clear_ordered() - btrfs_page_test_ordered() These helpers can handle both regular sector size and subpage without problem. These functions are here to coordinate btrfs_invalidatepage() with btrfs_writepage_endio_finish_ordered(), to make sure only one of those functions can finish the ordered extent. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: make process_one_page() to handle subpage lockingQu Wenruo1-12/+77
Introduce a new data inodes specific subpage member, writers, to record how many sectors are under page lock for delalloc writing. This member acts pretty much the same as readers, except it's only for delalloc writes. This is important for delalloc code to trace which page can really be freed, as we have cases like run_delalloc_nocow() where we may exit processing nocow range inside a page, but need to exit to do cow half way. In that case, we need a way to determine if we can really unlock a full page. With the new btrfs_subpage::writers, there is a new requirement: - Page locked by process_one_page() must be unlocked by process_one_page() There are still tons of call sites manually lock and unlock a page, without updating btrfs_subpage::writers. So if we lock a page through process_one_page() then it must be unlocked by process_one_page() to keep btrfs_subpage::writers consistent. This will be handled in next patch. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: provide btrfs_page_clamp_*() helpersQu Wenruo1-0/+38
In the coming subpage RW supports, there are a lot of page status update calls which need to be converted to subpage compatible version, which needs @start and @len. Some call sites already have such @start/@len and are already in page range, like various endio functions. But there are also call sites which need to clamp the range for subpage case, like btrfs_dirty_pagse() and __process_contig_pages(). Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the clamp for subpage version. Although in theory all existing btrfs_page_*() calls can be converted to use btrfs_page_clamp_*() directly, but that would make us to do unnecessary clamp operations. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: subpage: add overview commentsQu Wenruo1-0/+58
This patch adds an overview how btrfs subpage support works: - limitations - behavior - basic implementation points Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: subpage: introduce helpers for writeback statusQu Wenruo1-0/+30
Introduces the following functions to handle subpage writeback status: - btrfs_subpage_set_writeback() - btrfs_subpage_clear_writeback() - btrfs_subpage_test_writeback() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_writeback() - btrfs_page_clear_writeback() - btrfs_page_test_writeback() These helpers can handle both regular sector size and subpage without problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: subpage: introduce helpers for dirty statusQu Wenruo1-0/+52
Introduce the following functions to handle subpage dirty status: - btrfs_subpage_set_dirty() - btrfs_subpage_clear_dirty() - btrfs_subpage_test_dirty() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_dirty() - btrfs_page_clear_dirty() - btrfs_page_test_dirty() These helpers can handle both regular sector size and subpage without problem. Thus they would be used to replace PageDirty() related calls in later patches. There is one special point to note here, just like set_page_dirty() and clear_page_dirty_for_io(), btrfs_*page_set_dirty() and btrfs_*page_clear_dirty() must be called with page locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: integrate page status update for data read path into begin/end_page_readQu Wenruo1-10/+43
In btrfs data page read path, the page status update are handled in two different locations: btrfs_do_read_page() { while (cur <= end) { /* No need to read from disk */ if (HOLE/PREALLOC/INLINE){ memset(); set_extent_uptodate(); continue; } /* Read from disk */ ret = submit_extent_page(end_bio_extent_readpage); } end_bio_extent_readpage() { endio_readpage_uptodate_page_status(); } This is fine for sectorsize == PAGE_SIZE case, as for above loop we should only hit one branch and then exit. But for subpage, there is more work to be done in page status update: - Page Unlock condition Unlike regular page size == sectorsize case, we can no longer just unlock a page. Only the last reader of the page can unlock the page. This means, we can unlock the page either in the while() loop, or in the endio function. - Page uptodate condition Since we have multiple sectors to read for a page, we can only mark the full page uptodate if all sectors are uptodate. To handle both subpage and regular cases, introduce a pair of functions to help handling page status update: - begin_page_read() For regular case, it does nothing. For subpage case, it updates the reader counters so that later end_page_read() can know who is the last one to unlock the page. - end_page_read() This is just endio_readpage_uptodate_page_status() renamed. The original name is a little too long and too specific for endio. The new thing added is the condition for page unlock. Now for subpage data, we unlock the page if we're the last reader. This does not only provide the basis for subpage data read, but also hide the special handling of page read from the main read loop. Also, since we're changing how the page lock is handled, there are two existing error paths where we need to manually unlock the page before calling begin_page_read(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: introduce helpers for subpage error statusQu Wenruo1-0/+29
Introduce the following functions to handle subpage error status: - btrfs_subpage_set_error() - btrfs_subpage_clear_error() - btrfs_subpage_test_error() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_error() - btrfs_page_clear_error() - btrfs_page_test_error() These helpers can handle both regular sector size and subpage without problem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: introduce helpers for subpage uptodate statusQu Wenruo1-0/+113
Introduce the following functions to handle subpage uptodate status: - btrfs_subpage_set_uptodate() - btrfs_subpage_clear_uptodate() - btrfs_subpage_test_uptodate() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_uptodate() - btrfs_page_clear_uptodate() - btrfs_page_test_uptodate() These helpers can handle both regular sector size and subpage. Although caller should still ensure that the range is inside the page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: support subpage for extent buffer page releaseQu Wenruo1-0/+42
In btrfs_release_extent_buffer_pages(), we need to add extra handling for subpage. Introduce a helper, detach_extent_buffer_page(), to do different handling for regular and subpage cases. For subpage case, handle detaching page private. For unmapped (dummy or cloned) ebs, we can detach the page private immediately as the page can only be attached to one unmapped eb. For mapped ebs, we have to ensure there are no eb in the page range before we delete it, as page->private is shared between all ebs in the same page. But there is a subpage specific race, where we can race with extent buffer allocation, and clear the page private while new eb is still being utilized, like this: Extent buffer A is the new extent buffer which will be allocated, while extent buffer B is the last existing extent buffer of the page. T1 (eb A) | T2 (eb B) -------------------------------+------------------------------ alloc_extent_buffer() | btrfs_release_extent_buffer_pages() |- p = find_or_create_page() | | |- attach_extent_buffer_page() | | | | |- detach_extent_buffer_page() | | |- if (!page_range_has_eb()) | | | No new eb in the page range yet | | | As new eb A hasn't yet been | | | inserted into radix tree. | | |- btrfs_detach_subpage() | | |- detach_page_private(); |- radix_tree_insert() | Then we have a metadata eb whose page has no private bit. To avoid such race, we introduce a subpage metadata-specific member, btrfs_subpage::eb_refs. In alloc_extent_buffer() we increase eb_refs in the critical section of private_lock. Then page_range_has_eb() will return true for detach_extent_buffer_page(), and will not detach page private. The section is marked by: - btrfs_page_inc_eb_refs() - btrfs_page_dec_eb_refs() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: make attach_extent_buffer_page() handle subpage caseQu Wenruo1-6/+24
For subpage case, we need to allocate additional memory for each metadata page. So we need to: - Allow attach_extent_buffer_page() to return int to indicate allocation failure - Allow manually pre-allocate subpage memory for alloc_extent_buffer() As we don't want to use GFP_ATOMIC under spinlock, we introduce btrfs_alloc_subpage() and btrfs_free_subpage() functions for this purpose. (The simple wrap for btrfs_free_subpage() is for later convert to kmem_cache. Already internally tested without problem) - Preallocate btrfs_subpage structure for alloc_extent_buffer() We don't want to call memory allocation with spinlock held, so do preallocation before we acquire mapping->private_lock. - Handle subpage and regular case differently in attach_extent_buffer_page() For regular case, no change, just do the usual thing. For subpage case, allocate new memory or use the preallocated memory. For future subpage metadata, we will make use of radix tree to grab extent buffer. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: introduce the skeleton of btrfs_subpage structureQu Wenruo1-0/+43
For sectorsize < page size support, we need a structure to record extra status info for each sector of a page. Introduce the skeleton structure, all subpage related code would go to subpage.[ch]. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>