aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/mm/internal.h (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-04-01mm/munlock: protect the per-CPU pagevec by a local_lock_tSebastian Andrzej Siewior1-2/+4
The access to mlock_pvec is protected by disabling preemption via get_cpu_var() or implicit by having preemption disabled by the caller (in mlock_page_drain() case). This breaks on PREEMPT_RT since folio_lruvec_lock_irq() acquires a sleeping lock in this section. Create struct mlock_pvec which consits of the local_lock_t and the pagevec. Acquire the local_lock() before accessing the per-CPU pagevec. Replace mlock_page_drain() with a _local() version which is invoked on the local CPU and acquires the local_lock_t and a _remote() version which uses the pagevec from a remote CPU which offline. Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds1-38/+79
Pull folio updates from Matthew Wilcox: - Rewrite how munlock works to massively reduce the contention on i_mmap_rwsem (Hugh Dickins): https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/ - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig): https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/ - Convert GUP to use folios and make pincount available for order-1 pages. (Matthew Wilcox) - Convert a few more truncation functions to use folios (Matthew Wilcox) - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox) - Convert rmap_walk to use folios (Matthew Wilcox) - Convert most of shrink_page_list() to use a folio (Matthew Wilcox) - Add support for creating large folios in readahead (Matthew Wilcox) * tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits) mm/damon: minor cleanup for damon_pa_young selftests/vm/transhuge-stress: Support file-backed PMD folios mm/filemap: Support VM_HUGEPAGE for file mappings mm/readahead: Switch to page_cache_ra_order mm/readahead: Align file mappings for non-DAX mm/readahead: Add large folio readahead mm: Support arbitrary THP sizes mm: Make large folios depend on THP mm: Fix READ_ONLY_THP warning mm/filemap: Allow large folios to be added to the page cache mm: Turn can_split_huge_page() into can_split_folio() mm/vmscan: Convert pageout() to take a folio mm/vmscan: Turn page_check_references() into folio_check_references() mm/vmscan: Account large folios correctly mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios mm/vmscan: Free non-shmem folios without splitting them mm/rmap: Constify the rmap_walk_control argument mm/rmap: Convert rmap_walk() to take a folio mm: Turn page_anon_vma() into folio_anon_vma() mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read() ...
2022-03-22mm/early_ioremap: declare early_memremap_pgprot_adjust()Vlastimil Babka1-0/+6
The mm/ directory can almost fully be built with W=1, which would help in local development. One remaining issue is missing prototype for early_memremap_pgprot_adjust(). Thus add a declaration for this function. Use mm/internal.h instead of asm/early_ioremap.h to avoid missing type definitions and unnecessary exposure. Link: https://lkml.kernel.org/r/20220314165724.16071-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22mm: handle uninitialized numa nodes gracefullyMichal Hocko1-0/+2
We have had several reports [1][2][3] that page allocator blows up when an allocation from a possible node is requested. The underlying reason is that NODE_DATA for the specific node is not allocated. NUMA specific initialization is arch specific and it can vary a lot. E.g. x86 tries to initialize all nodes that have some cpu affinity (see init_cpu_to_node) but this can be insufficient because the node might be cpuless for example. One way to address this problem would be to check for !node_online nodes when trying to get a zonelist and silently fall back to another node. That is unfortunately adding a branch into allocator hot path and it doesn't handle any other potential NODE_DATA users. This patch takes a different approach (following a lead of [3]) and it pre allocates pgdat for all possible nodes in an arch indipendent code - free_area_init. All uninitialized nodes are treated as memoryless nodes. node_state of the node is not changed because that would lead to other side effects - e.g. sysfs representation of such a node and from past discussions [4] it is known that some tools might have problems digesting that. Newly allocated pgdat only gets a minimal initialization and the rest of the work is expected to be done by the memory hotplug - hotadd_new_pgdat (renamed to hotadd_init_pgdat). generic_alloc_nodedata is changed to use the memblock allocator because neither page nor slab allocators are available at the stage when all pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can use the early boot allocator. The only arch specific implementation is ia64 and that is changed to use the early allocator as well. [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com [akpm@linux-foundation.org: replace comment, per Mike] Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz Reported-by: Alexey Makhalov <amakhalov@vmware.com> Tested-by: Alexey Makhalov <amakhalov@vmware.com> Reported-by: Nico Pache <npache@redhat.com> Acked-by: Rafael Aquini <raquini@redhat.com> Tested-by: Rafael Aquini <raquini@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22mm/sparse: make mminit_validate_memmodel_limits() staticMiaohe Lin1-11/+0
It's only used in the sparse.c now. So we can make it static and further clean up the relevant code. Link: https://lkml.kernel.org/r/20220127093221.63524-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-21mm/readahead: Switch to page_cache_ra_orderMatthew Wilcox (Oracle)1-2/+2
do_page_cache_ra() was being exposed for the benefit of do_sync_mmap_readahead(). Switch it over to page_cache_ra_order() partly because it's a better interface but mostly for the benefit of the next patch. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm: Turn page_anon_vma() into folio_anon_vma()Matthew Wilcox (Oracle)1-0/+1
Move the prototype from mm.h to mm/internal.h and convert all callers to pass a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/mlock: Add mlock_vma_folio()Matthew Wilcox (Oracle)1-4/+11
Convert mlock_page() into mlock_folio() and convert the callers. Keep mlock_vma_page() as a wrapper. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm: Convert page_vma_mapped_walk to work on PFNsMatthew Wilcox (Oracle)1-6/+9
page_mapped_in_vma() really just wants to walk one page, but as the code stands, if passed the head page of a compound page, it will walk every page in the compound page. Extract pfn/nr_pages/pgoff from the struct page early, so they can be overridden by page_mapped_in_vma(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/truncate: Combine invalidate_mapping_pagevec() and __invalidate_mapping_pages()Matthew Wilcox (Oracle)1-0/+2
We can save a function call by combining these two functions, which are identical except for the return value. Also move the prototype to mm/internal.h. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
2022-03-21mm: Turn deactivate_file_page() into deactivate_file_folio()Matthew Wilcox (Oracle)1-0/+1
This function has one caller which already has a reference to the page, so we don't need to use get_page_unless_zero(). Also move the prototype to mm/internal.h. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
2022-03-21mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()Matthew Wilcox (Oracle)1-0/+1
Some of the callers already have the address_space and can avoid calling folio_mapping() and checking if the folio was already truncated. Also add kernel-doc and fix the return type (in case we ever support folios larger than 4TB). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
2022-03-21mm: Turn putback_lru_page() into folio_putback_lru()Matthew Wilcox (Oracle)1-1/+2
Add a putback_lru_page() wrapper. Removes a couple of compound_head() calls. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm: Turn isolate_lru_page() into folio_isolate_lru()Matthew Wilcox (Oracle)1-1/+2
Add isolate_lru_page() as a wrapper around isolate_lru_folio(). TestClearPageLRU() would have always failed on a tail page, so returning -EBUSY is the same behaviour. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Add try_get_folio() and try_grab_folio()Matthew Wilcox (Oracle)1-0/+5
Convert try_get_compound_head() into try_get_folio() and convert try_grab_compound_head() into try_grab_folio(). Add a temporary try_grab_compound_head() wrapper around try_grab_folio() to let us convert callers individually. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-03mm: remove the extra ZONE_DEVICE struct page refcountChristoph Hellwig1-0/+2
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>. Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm/munlock: mlock_vma_page() check against VM_SPECIALHugh Dickins1-2/+9
Although mmap_region() and mlock_fixup() take care that VM_LOCKED is never left set on a VM_SPECIAL vma, there is an interval while file->f_op->mmap() is using vm_insert_page(s), when VM_LOCKED may still be set while VM_SPECIAL bits are added: so mlock_vma_page() should ignore VM_LOCKED while any VM_SPECIAL bits are set. This showed up as a "Bad page" still mlocked, when vfree()ing pages which had been vm_inserted by remap_vmalloc_range_partial(): while release_pages() and __page_cache_release(), and so put_page(), catch pages still mlocked when freeing (and clear_page_mlock() caught them when unmapping), the vfree() path is unprepared for them: fix it? but these pages should not have been mlocked in the first place. I assume that an mlockall(MCL_FUTURE) had been done in the past; or maybe the user got to specify MAP_LOCKED on a vmalloc'ing driver mmap. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: mlock_page() munlock_page() batch by pagevecHugh Dickins1-1/+8
A weakness of the page->mlock_count approach is the need for lruvec lock while holding page table lock. That is not an overhead we would allow on normal pages, but I think acceptable just for pages in an mlocked area. But let's try to amortize the extra cost by gathering on per-cpu pagevec before acquiring the lruvec lock. I have an unverified conjecture that the mlock pagevec might work out well for delaying the mlock processing of new file pages until they have got off lru_cache_add()'s pagevec and on to LRU. The initialization of page->mlock_count is subject to races and awkward: 0 or !!PageMlocked or 1? Was it wrong even in the implementation before this commit, which just widens the window? I haven't gone back to think it through. Maybe someone can point out a better way to initialize it. Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec, rather than lru_cache_add()'s pagevec. Experimented with various orderings: the right thing seems to be for mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to pagevec, but munlock_page() to leave TestClearPageMlocked to the later pagevec processing. Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS() for that. This still leaves acquiring lruvec locks under page table lock each time the pagevec fills (or a THP is added): which I suppose is rather silly, since they sit on pagevec waiting to be processed long after page table lock has been dropped; but I'm disinclined to uglify the calling sequence until some load shows an actual problem with it (nothing wrong with taking lruvec lock under page table lock, just "nicer" to do it less). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: mlock_pte_range() when mlocking or munlockingHugh Dickins1-1/+2
Fill in missing pieces: reimplementation of munlock_vma_pages_range(), required to lower the mlock_counts when munlocking without munmapping; and its complement, implementation of mlock_vma_pages_range(), required to raise the mlock_counts on pages already there when a range is mlocked. Combine them into just the one function mlock_vma_pages_range(), using walk_page_range() to run mlock_pte_range(). This approach fixes the "Very slow unlockall()" of unpopulated PROT_NONE areas, reported in https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/ Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the pte first, it will not try to munlock the page: leaving release_pages() to correct it when the last reference to the page is gone - that's okay, a page is not evictable anyway while it is held by an extra reference. Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if a racing remove_migration_pte() or try_to_unmap_one() (depending on i_mmap_rwsem) gets to the pte first, it will try to mlock the page, then mlock_pte_range() mlock it a second time. This is harder to reproduce, but a more serious race because it could leave the page unevictable indefinitely though the area is munlocked afterwards. Guard against it by setting the (inappropriate) VM_IO flag, and modifying mlock_vma_page() to decline such vmas. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: replace clear_page_mlock() by final clearanceHugh Dickins1-12/+0
Placing munlock_vma_page() at the end of page_remove_rmap() shifts most of the munlocking to clear_page_mlock(), since PageMlocked is typically still set when mapcount has fallen to 0. That is not what we want: we want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check on the integrity of of the mlock/munlock protocol - small numbers are not surprising, but big numbers mean the protocol is not working. That could be easily fixed by placing munlock_vma_page() at the start of page_remove_rmap(); but later in the series we shall want to batch the munlocking, and that too would tend to leave PageMlocked still set at the point when it is checked. So delete clear_page_mlock() now: leave it instead to release_pages() (and __page_cache_release()) to do this backstop clearing of Mlocked, when page refcount has fallen to 0. If a pinned page occasionally gets counted as Mlocked and Unevictable until it is unpinned, that's okay. A slightly regrettable side-effect of this change is that, since release_pages() and __page_cache_release() may be called at interrupt time, those places which update NR_MLOCK with interrupts enabled had better use mod_zone_page_state() than __mod_zone_page_state() (but holding the lruvec lock always has interrupts disabled). This change, forcing Mlocked off when refcount 0 instead of earlier when mapcount 0, is not fundamental: it can be reversed if performance or something else is found to suffer; but this is the easiest way to separate the stats - let's not complicate that without good reason. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: rmap call mlock_vma_page() munlock_vma_page()Hugh Dickins1-5/+31
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them inline functions which check (vma->vm_flags & VM_LOCKED) before calling mlock_page() and munlock_page() in mm/mlock.c. Add bool compound to mlock_vma_page() and munlock_vma_page(): this is because we have understandable difficulty in accounting pte maps of THPs, and if passed a PageHead page, mlock_page() and munlock_page() cannot tell whether it's a pmd map to be counted or a pte map to be ignored. Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the others, and use that to call mlock_vma_page() at the end of the page adds, and munlock_vma_page() at the end of page_remove_rmap() (end or beginning? unimportant, but end was easier for assertions in testing). No page lock is required (although almost all adds happen to hold it): delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s. Certainly page lock did serialize with page migration, but I'm having difficulty explaining why that was ever important. Mlock accounting on THPs has been hard to define, differed between anon and file, involved PageDoubleMap in some places and not others, required clear_page_mlock() at some points. Keep it simple now: just count the pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks. page_add_new_anon_rmap() callers unchanged: they have long been calling lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED handling (it also checks for not VM_SPECIAL: I think that's overcautious, and inconsistent with other checks, that mmap_region() already prevents VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: delete munlock_vma_pages_all(), allow oomreapHugh Dickins1-14/+2
munlock_vma_pages_range() will still be required, when munlocking but not munmapping a set of pages; but when unmapping a pte, the mlock count will be maintained in much the same way as it will be maintained when mapping in the pte. Which removes the need for munlock_vma_pages_all() on mlocked vmas when munmapping or exiting: eliminating the catastrophic contention on i_mmap_rwsem, and the need for page lock on the pages. There is still a need to update locked_vm accounting according to the munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped(). exit_mmap() does not need locked_vm updates, so delete unlock_range(). And wasn't I the one who forbade the OOM reaper to attack mlocked vmas, because of the uncertainty in blocking on all those page locks? No fear of that now, so permit the OOM reaper on mlocked vmas. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: delete page_mlock() and all its worksHugh Dickins1-1/+1
We have recommended some applications to mlock their userspace, but that turns out to be counter-productive: when many processes mlock the same file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it is needed for write, to remove any vma mapping that file from rmap's tree; but hogged for read by those with mlocks calling page_mlock() (formerly known as try_to_munlock()) on *each* page mapped from the file (the purpose being to find out whether another process has the page mlocked, so therefore it should not be unmlocked yet). Several optimizations have been made in the past: one is to skip page_mlock() when mapcount tells that nothing else has this page mapped; but that doesn't help at all when others do have it mapped. This time around, I initially intended to add a preliminary search of the rmap tree for overlapping VM_LOCKED ranges; but that gets messy with locking order, when in doubt whether a page is actually present; and risks adding even more contention on the i_mmap_rwsem. A solution would be much easier, if only there were space in struct page for an mlock_count... but actually, most of the time, there is space for it - an mlocked page spends most of its life on an unevictable LRU, but since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has been redundant. Let's try to reuse its page->lru. But leave that until a later patch: in this patch, clear the ground by removing page_mlock(), and all the infrastructure that has gathered around it - which mostly hinders understanding, and will make reviewing new additions harder. Don't mind those old comments about THPs, they date from before 4.5's refcounting rework: splitting is not a risk here. Just keep a minimal version of munlock_vma_page(), as reminder of what it should attend to (in particular, the odd way PGSTRANDED is counted out of PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range(). Move unchanged __mlock_posix_error_return() out of the way, down to above its caller: this series then makes no further change after mlock_fixup(). After this and each following commit, the kernel builds, boots and runs; but with deficiencies which may show up in testing of mlock and munlock. The system calls succeed or fail as before, and mlock remains effective in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts may be shown too low after mlock, grow, then stay too high after munlock: with previously mlocked pages remaining unevictable for too long, until finally unmapped and freed and counts corrected. Normal service will be resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking". Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-01-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-6/+1
Merge misc updates from Andrew Morton: "146 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak, dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap, memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb, userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp, ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits) mm/damon: hide kernel pointer from tracepoint event mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging mm/damon/dbgfs: remove an unnecessary variable mm/damon: move the implementation of damon_insert_region to damon.h mm/damon: add access checking for hugetlb pages Docs/admin-guide/mm/damon/usage: update for schemes statistics mm/damon/dbgfs: support all DAMOS stats Docs/admin-guide/mm/damon/reclaim: document statistics parameters mm/damon/reclaim: provide reclamation statistics mm/damon/schemes: account how many times quota limit has exceeded mm/damon/schemes: account scheme actions that successfully applied mm/damon: remove a mistakenly added comment for a future feature Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning Docs/admin-guide/mm/damon/usage: remove redundant information Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks mm/damon: convert macro functions to static inline functions mm/damon: modify damon_rand() macro to static inline function mm/damon: move damon_rand() definition into damon.h ...
2022-01-15mm: make slab and vmalloc allocators __GFP_NOLOCKDEP awareMichal Hocko1-1/+1
sl?b and vmalloc allocators reduce the given gfp mask for their internal needs. For that they use GFP_RECLAIM_MASK to preserve the reclaim behavior and constrains. __GFP_NOLOCKDEP is not a part of that mask because it doesn't really control the reclaim behavior strictly speaking. On the other hand it tells the underlying page allocator to disable reclaim recursion detection so arguably it should be part of the mask. Having __GFP_NOLOCKDEP in the mask will not alter the behavior in any form so this change is safe pretty much by definition. It also adds a support for this flag to SL?B and vmalloc allocators which will in turn allow its use to kvmalloc as well. A lack of the support has been noticed recently in http://lkml.kernel.org/r/20211119225435.GZ449541@dread.disaster.area Link: https://lkml.kernel.org/r/YZ9XtLY4AEjVuiEI@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Dave Chinner <dchinner@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: memcontrol: make cgroup_memory_nokmem staticMuchun Song1-5/+0
Commit 494c1dfe855e ("mm: memcg/slab: create a new set of kmalloc-cg-<n> caches") makes cgroup_memory_nokmem global, however, it is unnecessary because there is already a function mem_cgroup_kmem_disabled() which exports it. Just make it static and replace it with mem_cgroup_kmem_disabled() in mm/slab_common.c. Link: https://lkml.kernel.org/r/20211109065418.21693-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-08truncate,shmem: Handle truncates that split large foliosMatthew Wilcox (Oracle)1-0/+2
Handle folio splitting in the parts of the truncation functions which already handle partial pages. Factor all that code out into a new function called truncate_inode_partial_folio(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-08mm: Convert find_lock_entries() to use a folio_batchMatthew Wilcox (Oracle)1-1/+1
find_lock_entries() already only returned the head page of folios, so convert it to return a folio_batch instead of a pagevec. That cascades through converting truncate_inode_pages_range() to delete_from_page_cache_batch() and page_cache_delete_batch(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-08filemap: Return only folios from find_get_entries()Matthew Wilcox (Oracle)1-0/+4
The callers have all been converted to work on folios, so convert find_get_entries() to return a batch of folios instead of pages. We also now return multiple large folios in a single call. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-01-08truncate: Add invalidate_complete_folio2()Matthew Wilcox (Oracle)1-0/+1
Convert invalidate_complete_page2() to invalidate_complete_folio2(). Use filemap_free_folio() to free the page instead of calling ->freepage manually. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-08truncate,shmem: Add truncate_inode_folio()Matthew Wilcox (Oracle)1-0/+1
Convert all callers of truncate_inode_page() to call truncate_inode_folio() instead, and move the declaration to mm/internal.h. Move the assertion that the caller is not passing in a tail page to generic_error_remove_page(). We can't entirely remove the struct page from the callers yet because the page pointer in the pvec might be a shadow/dax/swap entry instead of actually a page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-08mm: Add unmap_mapping_folio()Matthew Wilcox (Oracle)1-1/+3
Convert both callers of unmap_mapping_page() to call unmap_mapping_folio() instead. Also move zap_details from linux/mm.h to mm/memory.c Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2021-11-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+22
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ...
2021-11-06mm/vmscan: centralise timeout values for reclaim_throttleMel Gorman1-2/+1
Neil Brown raised concerns about callers of reclaim_throttle specifying a timeout value. The original timeout values to congestion_wait() were probably pulled out of thin air or copy&pasted from somewhere else. This patch centralises the timeout values and selects a timeout based on the reason for reclaim throttling. These figures are also pulled out of the same thin air but better values may be derived Running a workload that is throttling for inappropriate periods and tracing mm_vmscan_throttled can be used to pick a more appropriate value. Excessive throttling would pick a lower timeout where as excessive CPU usage in reclaim context would select a larger timeout. Ideally a large value would always be used and the wakeups would occur before a timeout but that requires careful testing. Link: https://lkml.kernel.org/r/20211022144651.19914-7-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Darrick J . Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/vmscan: throttle reclaim and compaction when too may pages are isolatedMel Gorman1-0/+11
Page reclaim throttles on congestion if too many parallel reclaim instances have isolated too many pages. This makes no sense, excessive parallelisation has nothing to do with writeback or congestion. This patch creates an additional workqueue to sleep on when too many pages are isolated. The throttled tasks are woken when the number of isolated pages is reduced or a timeout occurs. There may be some false positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will throttle again if necessary. [shy828301@gmail.com: Wake up from compaction context] [vbabka@suse.cz: Account number of throttled tasks only for writeback] Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Darrick J . Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/vmscan: throttle reclaim until some writeback completes if congestedMel Gorman1-0/+11
Patch series "Remove dependency on congestion_wait in mm/", v5. This series that removes all calls to congestion_wait in mm/ and deletes wait_iff_congested. It's not a clever implementation but congestion_wait has been broken for a long time [1]. Even if congestion throttling worked, it was never a great idea. While excessive dirty/writeback pages at the tail of the LRU is one possibility that reclaim may be slow, there is also the problem of too many pages being isolated and reclaim failing for other reasons (elevated references, too many pages isolated, excessive LRU contention etc). This series replaces the "congestion" throttling with 3 different types. - If there are too many dirty/writeback pages, sleep until a timeout or enough pages get cleaned - If too many pages are isolated, sleep until enough isolated pages are either reclaimed or put back on the LRU - If no progress is being made, direct reclaim tasks sleep until another task makes progress with acceptable efficiency. This was initially tested with a mix of workloads that used to trigger corner cases that no longer work. A new test case was created called "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly created XFS filesystem. Note that it may be necessary to increase the timeout of ssh if executing remotely as ssh itself can get throttled and the connection may timeout. stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4 to check the impact as the number of direct reclaimers increase. It has four types of worker. - One "anon latency" worker creates small mappings with mmap() and times how long it takes to fault the mapping reading it 4K at a time - X file writers which is fio randomly writing X files where the total size of the files add up to the allowed dirty_ratio. fio is allowed to run for a warmup period to allow some file-backed pages to accumulate. The duration of the warmup is based on the best-case linear write speed of the storage. - Y file readers which is fio randomly reading small files - Z anon memory hogs which continually map (100-dirty_ratio)% of memory - Total estimated WSS = (100+dirty_ration) percentage of memory X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4 The intent is to maximise the total WSS with a mix of file and anon memory where some anonymous memory must be swapped and there is a high likelihood of dirty/writeback pages reaching the end of the LRU. The test can be configured to have no background readers to stress dirty/writeback pages. The results below are based on having zero readers. The short summary of the results is that the series works and stalls until some event occurs but the timeouts may need adjustment. The test results are not broken down by patch as the series should be treated as one block that replaces a broken throttling mechanism with a working one. Finally, three machines were tested but I'm reporting the worst set of results. The other two machines had much better latencies for example. First the results of the "anon latency" latency stutterp 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v5r4 Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%) Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%) Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%) Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%) Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%) Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%) Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%) Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%) Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%) Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%) Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%) Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%) Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%) Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%) Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%) Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%) Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%) Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%) Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%) Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%) Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%) Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%) Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%) Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%) For most thread counts, the time to mmap() is unfortunately increased. In earlier versions of the series, this was lower but a large number of throttling events were reaching their timeout increasing the amount of inefficient scanning of the LRU. There is no prioritisation of reclaim tasks making progress based on each tasks rate of page allocation versus progress of reclaim. The variance is also impacted for high worker counts but in all cases, the differences in latency are not statistically significant due to very large maximum outliers. Max-90 shows that 90% of the stalls are comparable but the Max results show the massive outliers which are increased to to stalling. It is expected that this will be very machine dependant. Due to the test design, reclaim is difficult so allocations stall and there are variances depending on whether THPs can be allocated or not. The amount of memory will affect exactly how bad the corner cases are and how often they trigger. The warmup period calculation is not ideal as it's based on linear writes where as fio is randomly writing multiple files from multiple tasks so the start state of the test is variable. For example, these are the latencies on a single-socket machine that had more memory Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%* Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%* Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%) Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%) Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%) The overall system CPU usage and elapsed time is as follows 5.15.0-rc3 5.15.0-rc3 vanilla mm-reclaimcongest-v5r4 Duration User 6989.03 983.42 Duration System 7308.12 799.68 Duration Elapsed 2277.67 2092.98 The patches reduce system CPU usage by 89% as the vanilla kernel is rarely stalling. The high-level /proc/vmstats show 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v5r2 Ops Direct pages scanned 1056608451.00 503594991.00 Ops Kswapd pages scanned 109795048.00 147289810.00 Ops Kswapd pages reclaimed 63269243.00 31036005.00 Ops Direct pages reclaimed 10803973.00 6328887.00 Ops Kswapd efficiency % 57.62 21.07 Ops Kswapd velocity 48204.98 57572.86 Ops Direct efficiency % 1.02 1.26 Ops Direct velocity 463898.83 196845.97 Kswapd scanned less pages but the detailed pattern is different. The vanilla kernel scans slowly over time where as the patches exhibits burst patterns of scan activity. Direct reclaim scanning is reduced by 52% due to stalling. The pattern for stealing pages is also slightly different. Both kernels exhibit spikes but the vanilla kernel when reclaiming shows pages being reclaimed over a period of time where as the patches tend to reclaim in spikes. The difference is that vanilla is not throttling and instead scanning constantly finding some pages over time where as the patched kernel throttles and reclaims in spikes. Ops Percentage direct scans 90.59 77.37 For direct reclaim, vanilla scanned 90.59% of pages where as with the patches, 77.37% were direct reclaim due to throttling Ops Page writes by reclaim 2613590.00 1687131.00 Page writes from reclaim context are reduced. Ops Page writes anon 2932752.00 1917048.00 And there is less swapping. Ops Page reclaim immediate 996248528.00 107664764.00 The number of pages encountered at the tail of the LRU tagged for immediate reclaim but still dirty/writeback is reduced by 89%. Ops Slabs scanned 164284.00 153608.00 Slab scan activity is similar. ftrace was used to gather stall activity Vanilla ------- 1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000 2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000 8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000 29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000 82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0 The fast majority of wait_iff_congested calls do not stall at all. What is likely happening is that cond_resched() reschedules the task for a short period when the BDI is not registering congestion (which it never will in this test setup). 1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000 2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000 4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000 380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000 778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000 congestion_wait if called always exceeds the timeout as there is no trigger to wake it up. Bottom line: Vanilla will throttle but it's not effective. Patch series ------------ Kswapd throttle activity was always due to scanning pages tagged for immediate reclaim at the tail of the LRU 1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK 4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK 11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK 112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK The majority of events did not stall or stalled for a short period. Roughly 16% of stalls reached the timeout before expiry. For direct reclaim, the number of times stalled for each reason were 6624 reason=VMSCAN_THROTTLE_ISOLATED 93246 reason=VMSCAN_THROTTLE_NOPROGRESS 96934 reason=VMSCAN_THROTTLE_WRITEBACK The most common reason to stall was due to excessive pages tagged for immediate reclaim at the tail of the LRU followed by a failure to make forward. A relatively small number were due to too many pages isolated from the LRU by parallel threads For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was 9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED 12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED 83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED 6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED Most did not stall at all. A small number reached the timeout. For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over the map 1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS 11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS 16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS 18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS 21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS 23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS 23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS 25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS 25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS 26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS 27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS 28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS 29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS 30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS 30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS 31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS 32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS 33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS 35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS 35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS 36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS 36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS 37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS 38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS 40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS 43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS 55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS 56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS 58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS 59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS 61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS 71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS 71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS 79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS 82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS 82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS 85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS 85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS 88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS 90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS 90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS 94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS 118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS 119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS 126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS 146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS 148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS 148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS 159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS 178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS 183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS 237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS 266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS 313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS 347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS 470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS 559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS 964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS 2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS 2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS 7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS 22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS 51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS The full timeout is often hit but a large number also do not stall at all. The remainder slept a little allowing other reclaim tasks to make progress. While this timeout could be further increased, it could also negatively impact worst-case behaviour when there is no prioritisation of what task should make progress. For VMSCAN_THROTTLE_WRITEBACK, the breakdown was 1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK 2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK 3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK 6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK 7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK 11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK 12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK 16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK 24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK 28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK 30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK 30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK 32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK 42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK 77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK 99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK 137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK 190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK 7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK 83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK The majority hit the timeout in direct reclaim context although a sizable number did not stall at all. This is very different to kswapd where only a tiny percentage of stalls due to writeback reached the timeout. Bottom line, the throttling appears to work and the wakeup events may limit worst case stalls. There might be some grounds for adjusting timeouts but it's likely futile as the worst-case scenarios depend on the workload, memory size and the speed of the storage. A better approach to improve the series further would be to prioritise tasks based on their rate of allocation with the caveat that it may be very expensive to track. This patch (of 5): Page reclaim throttles on wait_iff_congested under the following conditions: - kswapd is encountering pages under writeback and marked for immediate reclaim implying that pages are cycling through the LRU faster than pages can be cleaned. - Direct reclaim will stall if all dirty pages are backed by congested inodes. wait_iff_congested is almost completely broken with few exceptions. This patch adds a new node-based workqueue and tracks the number of throttled tasks and pages written back since throttling started. If enough pages belonging to the node are written back then the throttled tasks will wake early. If not, the throttled tasks sleeps until the timeout expires. [neilb@suse.de: Uninterruptible sleep and simpler wakeups] [hdanton@sina.com: Avoid race when reclaim starts] [vbabka@suse.cz: vmstat irq-safe api, clarifications] Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1] Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: NeilBrown <neilb@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Darrick J . Wong" <djwong@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Rik van Riel <riel@surriel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm: introduce pmd_install() helperQi Zheng1-0/+1
Patch series "Do some code cleanups related to mm", v3. This patch (of 2): Currently we have three times the same few lines repeated in the code. Deduplicate them by newly introduced pmd_install() helper. Link: https://lkml.kernel.org/r/20210901102722.47686-1-zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/20210901102722.47686-2-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Mika Penttila <mika.penttila@nextfour.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-18mm: Add folio_evictable()Matthew Wilcox (Oracle)1-8/+19
This is the folio equivalent of page_evictable(). Unfortunately, it's different from !folio_test_unevictable(), but I think it's used in places where you have to be a VM expert and can reasonably be expected to know the difference. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-10-18mm/writeback: Add __folio_end_writeback()Matthew Wilcox (Oracle)1-0/+1
test_clear_page_writeback() is actually an mm-internal function, although it's named as if it's a pagecache function. Move it to mm/internal.h, rename it to __folio_end_writeback() and change the return type to bool. The conversion from page to folio is mostly about accounting the number of pages being written back, although it does eliminate a couple of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-27mm: Add folio_raw_mapping()Matthew Wilcox (Oracle)1-0/+7
Convert __page_rmapping to folio_raw_mapping and move it to mm/internal.h. It's only a couple of instructions (load and mask), so it's definitely going to be cheaper to inline it than call it. Leave page_rmapping out of line. Change page_anon_vma() to not call folio_raw_mapping() -- it's more efficient to do the subtraction than the mask. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-27mm/swap: Add folio_rotate_reclaimable()Matthew Wilcox (Oracle)1-0/+1
Convert rotate_reclaimable_page() to folio_rotate_reclaimable(). This eliminates all five of the calls to compound_head() in this function, saving 75 bytes at the cost of adding 15 bytes to its one caller, end_page_writeback(). We also save 36 bytes from pagevec_move_tail_fn() due to using folios there. Net 96 bytes savings. Also move its declaration to mm/internal.h as it's only used by filemap.c. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Howells <dhowells@redhat.com>
2021-09-03mm/numa: automatically generate node migration orderDave Hansen1-0/+5
Patch series "Migrate Pages in lieu of discard", v11. We're starting to see systems with more and more kinds of memory such as Intel's implementation of persistent memory. Let's say you have a system with some DRAM and some persistent memory. Today, once DRAM fills up, reclaim will start and some of the DRAM contents will be thrown out. Allocations will, at some point, start falling over to the slower persistent memory. That has two nasty properties. First, the newer allocations can end up in the slower persistent memory. Second, reclaimed data in DRAM are just discarded even if there are gobs of space in persistent memory that could be used. This patchset implements a solution to these problems. At the end of the reclaim process in shrink_page_list() just before the last page refcount is dropped, the page is migrated to persistent memory instead of being dropped. While I've talked about a DRAM/PMEM pairing, this approach would function in any environment where memory tiers exist. This is not perfect. It "strands" pages in slower memory and never brings them back to fast DRAM. Huang Ying has follow-on work which repurposes NUMA balancing to promote hot pages back to DRAM. This is also all based on an upstream mechanism that allows persistent memory to be onlined and used as if it were volatile: http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com With that, the DRAM and PMEM in each socket will be represented as 2 separate NUMA nodes, with the CPUs sit in the DRAM node. So the general inter-NUMA demotion mechanism introduced in the patchset can migrate the cold DRAM pages to the PMEM node. We have tested the patchset with the postgresql and pgbench. On a 2-socket server machine with DRAM and PMEM, the kernel with the patchset can improve the score of pgbench up to 22.1% compared with that of the DRAM only + disk case. This comes from the reduced disk read throughput (which reduces up to 70.8%). == Open Issues == * Memory policies and cpusets that, for instance, restrict allocations to DRAM can be demoted to PMEM whenever they opt in to this new mechanism. A cgroup-level API to opt-in or opt-out of these migrations will likely be required as a follow-on. * Could be more aggressive about where anon LRU scanning occurs since it no longer necessarily involves I/O. get_scan_count() for instance says: "If we have no swap space, do not bother scanning anon pages" This patch (of 9): Prepare for the kernel to auto-migrate pages to other memory nodes with a node migration table. This allows creating single migration target for each NUMA node to enable the kernel to do NUMA page migrations instead of simply discarding colder pages. A node with no target is a "terminal node", so reclaim acts normally there. The migration target does not fundamentally _need_ to be a single node, but this implementation starts there to limit complexity. When memory fills up on a node, memory contents can be automatically migrated to another node. The biggest problems are knowing when to migrate and to where the migration should be targeted. The most straightforward way to generate the "to where" list would be to follow the page allocator fallback lists. Those lists already tell us if memory is full where to look next. It would also be logical to move memory in that order. But, the allocator fallback lists have a fatal flaw: most nodes appear in all the lists. This would potentially lead to migration cycles (A->B, B->A, A->B, ...). Instead of using the allocator fallback lists directly, keep a separate node migration ordering. But, reuse the same data used to generate page allocator fallback in the first place: find_next_best_node(). This means that the firmware data used to populate node distances essentially dictates the ordering for now. It should also be architecture-neutral since all NUMA architectures have a working find_next_best_node(). RCU is used to allow lock-less read of node_demotion[] and prevent demotion cycles been observed. If multiple reads of node_demotion[] are performed, a single rcu_read_lock() must be held over all reads to ensure no cycles are observed. Details are as follows. === What does RCU provide? === Imagine a simple loop which walks down the demotion path looking for the last node: terminal_node = start_node; while (node_demotion[terminal_node] != NUMA_NO_NODE) { terminal_node = node_demotion[terminal_node]; } The initial values are: node_demotion[0] = 1; node_demotion[1] = NUMA_NO_NODE; and are updated to: node_demotion[0] = NUMA_NO_NODE; node_demotion[1] = 0; What guarantees that the cycle is not observed: node_demotion[0] = 1; node_demotion[1] = 0; and would loop forever? With RCU, a rcu_read_lock/unlock() can be placed around the loop. Since the write side does a synchronize_rcu(), the loop that observed the old contents is known to be complete before the synchronize_rcu() has completed. RCU, combined with disable_all_migrate_targets(), ensures that the old migration state is not visible by the time __set_migration_target_nodes() is called. === What does READ_ONCE() provide? === READ_ONCE() forbids the compiler from merging or reordering successive reads of node_demotion[]. This ensures that any updates are *eventually* observed. Consider the above loop again. The compiler could theoretically read the entirety of node_demotion[] into local storage (registers) and never go back to memory, and *permanently* observe bad values for node_demotion[]. Note: RCU does not provide any universal compiler-ordering guarantees: https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/ This code is unused for now. It will be called later in the series. Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: introduce memmap_alloc() to unify memory map allocationMike Rapoport1-0/+4
There are several places that allocate memory for the memory map: alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and __populate_section_memmap() for SPARSEMEM. The memory allocated in the FLATMEM case is zeroed and it is never poisoned, regardless of CONFIG_PAGE_POISON setting. The memory allocated in the SPARSEMEM cases is not zeroed and it is implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set. Introduce memmap_alloc() wrapper for memblock allocators that will be used for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and poisoning consistent for different memory models. Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08mmap: make mlock_future_check() globalMike Rapoport1-0/+3
Patch series "mm: introduce memfd_secret system call to create "secret" memory areas", v20. This is an implementation of "secret" mappings backed by a file descriptor. The file descriptor backing secret memory mappings is created using a dedicated memfd_secret system call The desired protection mode for the memory is configured using flags parameter of the system call. The mmap() of the file descriptor created with memfd_secret() will create a "secret" memory mapping. The pages in that mapping will be marked as not present in the direct map and will be present only in the page table of the owning mm. Although normally Linux userspace mappings are protected from other users, such secret mappings are useful for environments where a hostile tenant is trying to trick the kernel into giving them access to other tenants mappings. It's designed to provide the following protections: * Enhanced protection (in conjunction with all the other in-kernel attack prevention systems) against ROP attacks. Seceretmem makes "simple" ROP insufficient to perform exfiltration, which increases the required complexity of the attack. Along with other protections like the kernel stack size limit and address space layout randomization which make finding gadgets is really hard, absence of any in-kernel primitive for accessing secret memory means the one gadget ROP attack can't work. Since the only way to access secret memory is to reconstruct the missing mapping entry, the attacker has to recover the physical page and insert a PTE pointing to it in the kernel and then retrieve the contents. That takes at least three gadgets which is a level of difficulty beyond most standard attacks. * Prevent cross-process secret userspace memory exposures. Once the secret memory is allocated, the user can't accidentally pass it into the kernel to be transmitted somewhere. The secreremem pages cannot be accessed via the direct map and they are disallowed in GUP. * Harden against exploited kernel flaws. In order to access secretmem, a kernel-side attack would need to either walk the page tables and create new ones, or spawn a new privileged uiserspace process to perform secrets exfiltration using ptrace. In the future the secret mappings may be used as a mean to protect guest memory in a virtual machine host. For demonstration of secret memory usage we've created a userspace library https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git that does two things: the first is act as a preloader for openssl to redirect all the OPENSSL_malloc calls to secret memory meaning any secret keys get automatically protected this way and the other thing it does is expose the API to the user who needs it. We anticipate that a lot of the use cases would be like the openssl one: many toolkits that deal with secret keys already have special handling for the memory to try to give them greater protection, so this would simply be pluggable into the toolkits without any need for user application modification. Hiding secret memory mappings behind an anonymous file allows usage of the page cache for tracking pages allocated for the "secret" mappings as well as using address_space_operations for e.g. page migration callbacks. The anonymous file may be also used implicitly, like hugetlb files, to implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm ABIs in the future. Removing of the pages from the direct map may cause its fragmentation on architectures that use large pages to map the physical memory which affects the system performance. However, the original Kconfig text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1] showed that "... although 1G mappings are a good default choice, there is no compelling evidence that it must be the only choice". Hence, it is sufficient to have secretmem disabled by default with the ability of a system administrator to enable it at boot time. In addition, there is also a long term goal to improve management of the direct map. [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ This patch (of 7): It will be used by the upcoming secret memory implementation. Link: https://lkml.kernel.org/r/20210518072034.31572-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210518072034.31572-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01mm/page_alloc: move prototype for find_suitable_fallbackMel Gorman1-2/+1
make W=1 generates the following warning in mmap_lock.c for allnoconfig mm/page_alloc.c:2670:5: warning: no previous prototype for `find_suitable_fallback' [-Wmissing-prototypes] int find_suitable_fallback(struct free_area *area, unsigned int order, ^~~~~~~~~~~~~~~~~~~~~~ find_suitable_fallback is only shared outside of page_alloc.c for CONFIG_COMPACTION but to suppress the warning, move the protype outside of CONFIG_COMPACTION. It is not worth the effort at this time to find a clever way of allowing compaction.c to share the code or avoid the use entirely as the function is called on relatively slow paths. Link: https://lkml.kernel.org/r/20210520084809.8576-14-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tablesDavid Hildenbrand1-0/+3
I. Background: Sparse Memory Mappings When we manage sparse memory mappings dynamically in user space - also sometimes involving MAP_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way (instead of generating SIGBUS) if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for reliably discarding memory for most mapping types, there is no generic approach to populate page tables and preallocate memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report errors during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it cannot really be used for any private mappings on anonymous files via memfd due to COW semantics. In addition, fallocate() does not actually populate page tables, so we still always get pagefaults on first access - which is sometimes undesired (i.e., real-time workloads) and requires real prefaulting of page tables, not just a preallocation of backend storage. There might be interesting use cases for sparse memory regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy as it does not prefault page tables. II. On preallcoation/prefaulting from user space Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., reading+writing one byte to not overwrite existing data) all individual pages. However, that approach 1) Can result in wear on storage backing, because we end up reading/writing each page; this is especially a problem for dax/pmem. 2) Can result in mmap_sem contention when prefaulting via multiple threads. 3) Requires expensive signal handling, especially to catch SIGBUS in case of hugetlbfs/shmem/file-backed memory. For example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation concurrently from other thread is not that easy. III. On MADV_WILLNEED Extending MADV_WILLNEED is not an option because 1. It would change the semantics: "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/ preallocate all memory and definitely fail on errors.". 2. Existing users (like virtio-balloon in QEMU when deflating the balloon) don't want populate/prealloc semantics. They treat this rather as a hint to give a little performance boost without too much overhead - and don't expect that a lot of memory might get consumed or a lot of time might be spent. IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by MAP_POPULATE, with the following semantics: 1. MADV_POPULATE_READ can be used to prefault page tables just like manually reading each individual page. This will not break any COW mappings. The shared zero page might get mapped and no backend storage might get preallocated -- allocation might be deferred to write-fault time. Especially shared file mappings require an explicit fallocate() upfront to actually preallocate backend memory (blocks in the file system) in case the file might have holes. 2. If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and prefault page tables just like manually writing (or reading+writing) each individual page. This will break any COW mappings -- e.g., the shared zeropage is never populated. 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special mappings marked with VM_PFNMAP and VM_IO. Also, proper access permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such mapping is encountered, madvise() fails with -EINVAL. 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables might have been populated. 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON when encountering a HW poisoned page in the range. 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot protect from the OOM (Out Of Memory) handler killing the process. While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e., preallocate memory and prefault page tables for VMs), one issue is that whenever we prefault pages writable, the pages have to be marked dirty, because the CPU could dirty them any time. while not a real problem for hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each page will be marked dirty and has to be written back later when evicting. MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole mapping from backend storage without marking it dirty, such that eviction won't have to write it back. As discussed above, shared file mappings might require an explciit fallocate() upfront to achieve preallcoation+prepopulation. Although sparse memory mappings are the primary use case, this will also be useful for other preallocate/prefault use cases where MAP_POPULATE is not desired or the semantics of MAP_POPULATE are not sufficient: as one example, QEMU users can trigger preallocation/prefaulting of guest RAM after the mapping was created -- and don't want errors to be silently suppressed. Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements -- which should also still be the case. V. Single-threaded performance comparison I did a short experiment, prefaulting page tables on completely *empty mappings/files* and repeated the experiment 10 times. The results correspond to the shortest execution time. In general, the performance benefit for huge pages is negligible with small mappings. V.1: Private mappings POPULATE_READ and POPULATE_WRITE is fastest. Note that Reading/POPULATE_READ will populate the shared zeropage where applicable -- which result in short population times. The fastest way to allocate backend storage (here: swap or huge pages) and prefault page tables is POPULATE_WRITE. V.2: Shared mappings fallocate() is fastest, however, doesn't prefault page tables. POPULATE_WRITE is faster than simple writes and read/writes. POPULATE_READ is faster than simple reads. Without a fd, the fastest way to allocate backend storage and prefault page tables is POPULATE_WRITE. With an fd, the fastest way is usually FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one exception are actual files: FALLOCATE+Read is slightly faster than FALLOCATE+POPULATE_READ. The fastest way to allocate backend storage prefault page tables is FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then, FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as dirty. v.3: Detailed results ================================================== 2 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 0.119 ms Anon 4 KiB : Write : 0.222 ms Anon 4 KiB : Read/Write : 0.380 ms Anon 4 KiB : POPULATE_READ : 0.060 ms Anon 4 KiB : POPULATE_WRITE : 0.158 ms Memfd 4 KiB : Read : 0.034 ms Memfd 4 KiB : Write : 0.310 ms Memfd 4 KiB : Read/Write : 0.362 ms Memfd 4 KiB : POPULATE_READ : 0.039 ms Memfd 4 KiB : POPULATE_WRITE : 0.229 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.033 ms tmpfs : Write : 0.313 ms tmpfs : Read/Write : 0.406 ms tmpfs : POPULATE_READ : 0.039 ms tmpfs : POPULATE_WRITE : 0.285 ms file : Read : 0.033 ms file : Write : 0.351 ms file : Read/Write : 0.408 ms file : POPULATE_READ : 0.039 ms file : POPULATE_WRITE : 0.290 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 237.940 ms Anon 4 KiB : Write : 708.409 ms Anon 4 KiB : Read/Write : 1054.041 ms Anon 4 KiB : POPULATE_READ : 124.310 ms Anon 4 KiB : POPULATE_WRITE : 572.582 ms Memfd 4 KiB : Read : 136.928 ms Memfd 4 KiB : Write : 963.898 ms Memfd 4 KiB : Read/Write : 1106.561 ms Memfd 4 KiB : POPULATE_READ : 78.450 ms Memfd 4 KiB : POPULATE_WRITE : 805.881 ms Memfd 2 MiB : Read : 357.116 ms Memfd 2 MiB : Write : 357.210 ms Memfd 2 MiB : Read/Write : 357.606 ms Memfd 2 MiB : POPULATE_READ : 356.094 ms Memfd 2 MiB : POPULATE_WRITE : 356.937 ms tmpfs : Read : 137.536 ms tmpfs : Write : 954.362 ms tmpfs : Read/Write : 1105.954 ms tmpfs : POPULATE_READ : 80.289 ms tmpfs : POPULATE_WRITE : 822.826 ms file : Read : 137.874 ms file : Write : 987.025 ms file : Read/Write : 1107.439 ms file : POPULATE_READ : 80.413 ms file : POPULATE_WRITE : 857.622 ms hugetlbfs : Read : 355.607 ms hugetlbfs : Write : 355.729 ms hugetlbfs : Read/Write : 356.127 ms hugetlbfs : POPULATE_READ : 354.585 ms hugetlbfs : POPULATE_WRITE : 355.138 ms ************************************************** 2 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 0.394 ms Anon 4 KiB : Write : 0.348 ms Anon 4 KiB : Read/Write : 0.400 ms Anon 4 KiB : POPULATE_READ : 0.326 ms Anon 4 KiB : POPULATE_WRITE : 0.273 ms Anon 2 MiB : Read : 0.030 ms Anon 2 MiB : Write : 0.030 ms Anon 2 MiB : Read/Write : 0.030 ms Anon 2 MiB : POPULATE_READ : 0.030 ms Anon 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 4 KiB : Read : 0.412 ms Memfd 4 KiB : Write : 0.372 ms Memfd 4 KiB : Read/Write : 0.419 ms Memfd 4 KiB : POPULATE_READ : 0.343 ms Memfd 4 KiB : POPULATE_WRITE : 0.288 ms Memfd 4 KiB : FALLOCATE : 0.137 ms Memfd 4 KiB : FALLOCATE+Read : 0.446 ms Memfd 4 KiB : FALLOCATE+Write : 0.330 ms Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 2 MiB : FALLOCATE : 0.030 ms Memfd 2 MiB : FALLOCATE+Read : 0.031 ms Memfd 2 MiB : FALLOCATE+Write : 0.031 ms Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.416 ms tmpfs : Write : 0.369 ms tmpfs : Read/Write : 0.425 ms tmpfs : POPULATE_READ : 0.346 ms tmpfs : POPULATE_WRITE : 0.295 ms tmpfs : FALLOCATE : 0.139 ms tmpfs : FALLOCATE+Read : 0.447 ms tmpfs : FALLOCATE+Write : 0.333 ms tmpfs : FALLOCATE+Read/Write : 0.454 ms tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms file : Read : 0.191 ms file : Write : 0.511 ms file : Read/Write : 0.524 ms file : POPULATE_READ : 0.196 ms file : POPULATE_WRITE : 0.434 ms file : FALLOCATE : 0.004 ms file : FALLOCATE+Read : 0.197 ms file : FALLOCATE+Write : 0.554 ms file : FALLOCATE+Read/Write : 0.480 ms file : FALLOCATE+POPULATE_READ : 0.201 ms file : FALLOCATE+POPULATE_WRITE : 0.381 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms hugetlbfs : FALLOCATE : 0.030 ms hugetlbfs : FALLOCATE+Read : 0.031 ms hugetlbfs : FALLOCATE+Write : 0.031 ms hugetlbfs : FALLOCATE+Read/Write : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 1053.090 ms Anon 4 KiB : Write : 913.642 ms Anon 4 KiB : Read/Write : 1060.350 ms Anon 4 KiB : POPULATE_READ : 893.691 ms Anon 4 KiB : POPULATE_WRITE : 782.885 ms Anon 2 MiB : Read : 358.553 ms Anon 2 MiB : Write : 358.419 ms Anon 2 MiB : Read/Write : 357.992 ms Anon 2 MiB : POPULATE_READ : 357.533 ms Anon 2 MiB : POPULATE_WRITE : 357.808 ms Memfd 4 KiB : Read : 1078.144 ms Memfd 4 KiB : Write : 942.036 ms Memfd 4 KiB : Read/Write : 1100.391 ms Memfd 4 KiB : POPULATE_READ : 925.829 ms Memfd 4 KiB : POPULATE_WRITE : 804.394 ms Memfd 4 KiB : FALLOCATE : 304.632 ms Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms Memfd 4 KiB : FALLOCATE+Write : 933.186 ms Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms Memfd 2 MiB : Read : 358.131 ms Memfd 2 MiB : Write : 358.099 ms Memfd 2 MiB : Read/Write : 358.250 ms Memfd 2 MiB : POPULATE_READ : 357.563 ms Memfd 2 MiB : POPULATE_WRITE : 357.334 ms Memfd 2 MiB : FALLOCATE : 356.735 ms Memfd 2 MiB : FALLOCATE+Read : 358.152 ms Memfd 2 MiB : FALLOCATE+Write : 358.331 ms Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms tmpfs : Read : 1087.265 ms tmpfs : Write : 950.840 ms tmpfs : Read/Write : 1107.567 ms tmpfs : POPULATE_READ : 922.605 ms tmpfs : POPULATE_WRITE : 810.094 ms tmpfs : FALLOCATE : 306.320 ms tmpfs : FALLOCATE+Read : 1169.796 ms tmpfs : FALLOCATE+Write : 933.730 ms tmpfs : FALLOCATE+Read/Write : 1191.610 ms tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms file : Read : 654.101 ms file : Write : 1259.142 ms file : Read/Write : 1289.509 ms file : POPULATE_READ : 661.642 ms file : POPULATE_WRITE : 1106.816 ms file : FALLOCATE : 1.864 ms file : FALLOCATE+Read : 656.328 ms file : FALLOCATE+Write : 1153.300 ms file : FALLOCATE+Read/Write : 1180.613 ms file : FALLOCATE+POPULATE_READ : 668.347 ms file : FALLOCATE+POPULATE_WRITE : 996.143 ms hugetlbfs : Read : 357.245 ms hugetlbfs : Write : 357.413 ms hugetlbfs : Read/Write : 357.120 ms hugetlbfs : POPULATE_READ : 356.321 ms hugetlbfs : POPULATE_WRITE : 356.693 ms hugetlbfs : FALLOCATE : 355.927 ms hugetlbfs : FALLOCATE+Read : 357.074 ms hugetlbfs : FALLOCATE+Write : 357.120 ms hugetlbfs : FALLOCATE+Read/Write : 356.983 ms hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms ************************************************** [1] https://lkml.org/lkml/2013/6/27/698 [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm: make variable names for populate_vma_page_range() consistentDavid Hildenbrand1-1/+1
Patch series "mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables", v2. Excessive details on MADV_POPULATE_(READ|WRITE) can be found in patch #2. This patch (of 5): Let's make the variable names in the function declaration match the variable names used in the definition. Link: https://lkml.kernel.org/r/20210419135443.12822-1-david@redhat.com Link: https://lkml.kernel.org/r/20210419135443.12822-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jann Horn <jannh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rik van Riel <riel@surriel.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm: thp: refactor NUMA fault handlingYang Shi1-18/+0
When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. This patch reworks the NUMA fault handling to use generic migration implementation to migrate misplaced page. There is no functional change. After the refactor the flow of NUMA fault handling looks just like its PTE counterpart: Acquire ptl Prepare for migration (elevate page refcount) Release ptl Isolate page from lru and elevate page refcount Migrate the misplaced THP If migration fails just restore the old normal PMD. In the old code anon_vma lock was needed to serialize THP migration against THP split, but since then the THP code has been reworked a lot, it seems anon_vma lock is not required anymore to avoid the race. The page refcount elevation when holding ptl should prevent from THP split. Use migrate_misplaced_page() for both base page and THP NUMA hinting fault and remove all the dead and duplicate code. [dan.carpenter@oracle.com: fix a double unlock bug] Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm: memory: make numa_migrate_prep() non-staticYang Shi1-0/+3
The numa_migrate_prep() will be used by huge NUMA fault as well in the following patch, make it non-static. Link: https://lkml.kernel.org/r/20210518200801.7413-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page_alloc: allow high-order pages to be stored on the per-cpu listsMel Gorman1-1/+1
The per-cpu page allocator (PCP) only stores order-0 pages. This means that all THP and "cheap" high-order allocations including SLUB contends on the zone->lock. This patch extends the PCP allocator to store THP and "cheap" high-order pages. Note that struct per_cpu_pages increases in size to 256 bytes (4 cache lines) on x86-64. Note that this is not necessarily a universal performance win because of how it is implemented. High-order pages can cause pcp->high to be exceeded prematurely for lower-orders so for example, a large number of THP pages being freed could release order-0 pages from the PCP lists. Hence, much depends on the allocation/free pattern as observed by a single CPU to determine if caching helps or hurts a particular workload. That said, basic performance testing passed. The following is a netperf UDP_STREAM test which hits the relevant patches as some of the network allocations are high-order. netperf-udp 5.13.0-rc2 5.13.0-rc2 mm-pcpburst-v3r4 mm-pcphighorder-v1r7 Hmean send-64 261.46 ( 0.00%) 266.30 * 1.85%* Hmean send-128 516.35 ( 0.00%) 536.78 * 3.96%* Hmean send-256 1014.13 ( 0.00%) 1034.63 * 2.02%* Hmean send-1024 3907.65 ( 0.00%) 4046.11 * 3.54%* Hmean send-2048 7492.93 ( 0.00%) 7754.85 * 3.50%* Hmean send-3312 11410.04 ( 0.00%) 11772.32 * 3.18%* Hmean send-4096 13521.95 ( 0.00%) 13912.34 * 2.89%* Hmean send-8192 21660.50 ( 0.00%) 22730.72 * 4.94%* Hmean send-16384 31902.32 ( 0.00%) 32637.50 * 2.30%* Functionally, a patch like this is necessary to make bulk allocation of high-order pages work with similar performance to order-0 bulk allocations. The bulk allocator is not updated in this series as it would have to be determined by bulk allocation users how they want to track the order of pages allocated with the bulk allocator. Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>