From 55f77df7d715110299f12c27f4365bd6332d1adb Mon Sep 17 00:00:00 2001 From: Lucas Stach Date: Mon, 18 Mar 2024 21:07:36 +0100 Subject: mm: page_alloc: control latency caused by zone PCP draining Patch series "mm/treewide: Remove pXd_huge() API", v2. In previous work [1], we removed the pXd_large() API, which is arch specific. This patchset further removes the hugetlb pXd_huge() API. Hugetlb was never special on creating huge mappings when compared with other huge mappings. Having a standalone API just to detect such pgtable entries is more or less redundant, especially after the pXd_leaf() API set is introduced with/without CONFIG_HUGETLB_PAGE. When looking at this problem, a few issues are also exposed that we don't have a clear definition of the *_huge() variance API. This patchset started by cleaning these issues first, then replace all *_huge() users to use *_leaf(), then drop all *_huge() code. On x86/sparc, swap entries will be reported "true" in pXd_huge(), while for all the rest archs they're reported "false" instead. This part is done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix, but I'll leave that to hmm experts to decide. Besides, there are three archs (arm, arm64, powerpc) that have slightly different definitions between the *_huge() v.s. *_leaf() variances. I tackled them separately so that it'll be easier for arch experts to chim in when necessary. This part is done in patch 6-9. The final patches 10-14 do the rest on the final removal, since *_leaf() will be the ultimate API in the future, and we seem to have quite some confusions on how *_huge() APIs can be defined, provide a rich comment for *_leaf() API set to define them properly to avoid future misuse, and hopefully that'll also help new archs to start support huge mappings and avoid traps (like either swap entries, or PROT_NONE entry checks). [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com This patch (of 14): When the complete PCP is drained a much larger number of pages than the usual batch size might be freed at once, causing large IRQ and preemption latency spikes, as they are all freed while holding the pcp and zone spinlocks. To avoid those latency spikes, limit the number of pages freed in a single bulk operation to common batch limits. Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@pengutronix.de Signed-off-by: Lucas Stach Signed-off-by: Peter Xu Cc: Christophe Leroy Cc: Jason Gunthorpe Cc: "Matthew Wilcox (Oracle)" Cc: Mike Rapoport (IBM) Cc: Muchun Song Cc: Alistair Popple Cc: Andreas Larsson Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Bjorn Andersson Cc: Borislav Petkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David S. Miller Cc: Fabio Estevam Cc: Ingo Molnar Cc: Konrad Dybcio Cc: Krzysztof Kozlowski Cc: Mark Salter Cc: Michael Ellerman Cc: Naoya Horiguchi Cc: "Naveen N. Rao" Cc: Nicholas Piggin Cc: Russell King Cc: Shawn Guo Cc: Thomas Gleixner Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/page_alloc.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 14d39f34d336..5083ac034d26 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2216,12 +2216,15 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) */ static void drain_pages_zone(unsigned int cpu, struct zone *zone) { - struct per_cpu_pages *pcp; + struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); + int count = READ_ONCE(pcp->count); + + while (count) { + int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); + count -= to_drain; - pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); - if (pcp->count) { spin_lock(&pcp->lock); - free_pcppages_bulk(zone, pcp->count, pcp, 0); + free_pcppages_bulk(zone, to_drain, pcp, 0); spin_unlock(&pcp->lock); } } -- cgit v1.2.3-59-g8ed1b From dcfe378c81f72f146890ce1dcfdcc742d3b66924 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Thu, 21 Mar 2024 09:36:36 -0700 Subject: lib: introduce support for page allocation tagging MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce helper functions to easily instrument page allocators by storing a pointer to the allocation tag associated with the code that allocated the page in a page_ext field. Link: https://lkml.kernel.org/r/20240321163705.3067592-15-surenb@google.com Signed-off-by: Suren Baghdasaryan Co-developed-by: Kent Overstreet Signed-off-by: Kent Overstreet Reviewed-by: Vlastimil Babka Tested-by: Kees Cook Cc: Alexander Viro Cc: Alex Gaynor Cc: Alice Ryhl Cc: Andreas Hindborg Cc: Benno Lossin Cc: "Björn Roy Baron" Cc: Boqun Feng Cc: Christoph Lameter Cc: Dennis Zhou Cc: Gary Guo Cc: Miguel Ojeda Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Tejun Heo Cc: Wedson Almeida Filho Signed-off-by: Andrew Morton --- include/linux/page_ext.h | 1 - include/linux/pgalloc_tag.h | 78 +++++++++++++++++++++++++++++++++++++++++++++ lib/Kconfig.debug | 1 + lib/alloc_tag.c | 17 ++++++++++ mm/mm_init.c | 1 + mm/page_alloc.c | 4 +++ mm/page_ext.c | 4 +++ 7 files changed, 105 insertions(+), 1 deletion(-) create mode 100644 include/linux/pgalloc_tag.h (limited to 'mm/page_alloc.c') diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h index be98564191e6..07e0656898f9 100644 --- a/include/linux/page_ext.h +++ b/include/linux/page_ext.h @@ -4,7 +4,6 @@ #include #include -#include struct pglist_data; diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h new file mode 100644 index 000000000000..66bd021eb46e --- /dev/null +++ b/include/linux/pgalloc_tag.h @@ -0,0 +1,78 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * page allocation tagging + */ +#ifndef _LINUX_PGALLOC_TAG_H +#define _LINUX_PGALLOC_TAG_H + +#include + +#ifdef CONFIG_MEM_ALLOC_PROFILING + +#include + +extern struct page_ext_operations page_alloc_tagging_ops; +extern struct page_ext *page_ext_get(struct page *page); +extern void page_ext_put(struct page_ext *page_ext); + +static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext) +{ + return (void *)page_ext + page_alloc_tagging_ops.offset; +} + +static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref) +{ + return (void *)ref - page_alloc_tagging_ops.offset; +} + +/* Should be called only if mem_alloc_profiling_enabled() */ +static inline union codetag_ref *get_page_tag_ref(struct page *page) +{ + if (page) { + struct page_ext *page_ext = page_ext_get(page); + + if (page_ext) + return codetag_ref_from_page_ext(page_ext); + } + return NULL; +} + +static inline void put_page_tag_ref(union codetag_ref *ref) +{ + page_ext_put(page_ext_from_codetag_ref(ref)); +} + +static inline void pgalloc_tag_add(struct page *page, struct task_struct *task, + unsigned int nr) +{ + if (mem_alloc_profiling_enabled()) { + union codetag_ref *ref = get_page_tag_ref(page); + + if (ref) { + alloc_tag_add(ref, task->alloc_tag, PAGE_SIZE * nr); + put_page_tag_ref(ref); + } + } +} + +static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) +{ + if (mem_alloc_profiling_enabled()) { + union codetag_ref *ref = get_page_tag_ref(page); + + if (ref) { + alloc_tag_sub(ref, PAGE_SIZE * nr); + put_page_tag_ref(ref); + } + } +} + +#else /* CONFIG_MEM_ALLOC_PROFILING */ + +static inline void pgalloc_tag_add(struct page *page, struct task_struct *task, + unsigned int nr) {} +static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {} + +#endif /* CONFIG_MEM_ALLOC_PROFILING */ + +#endif /* _LINUX_PGALLOC_TAG_H */ diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index fa7aa32ba11a..a8f77de5e8ea 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -978,6 +978,7 @@ config MEM_ALLOC_PROFILING depends on PROC_FS depends on !DEBUG_FORCE_WEAK_PER_CPU select CODE_TAGGING + select PAGE_EXTENSION help Track allocation source code and record total allocation size initiated at that code location. The mechanism can be used to track diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index 331dd17650f3..59bb71036837 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include #include @@ -118,6 +119,22 @@ static bool alloc_tag_module_unload(struct codetag_type *cttype, return module_unused; } +static __init bool need_page_alloc_tagging(void) +{ + return true; +} + +static __init void init_page_alloc_tagging(void) +{ +} + +struct page_ext_operations page_alloc_tagging_ops = { + .size = sizeof(union codetag_ref), + .need = need_page_alloc_tagging, + .init = init_page_alloc_tagging, +}; +EXPORT_SYMBOL(page_alloc_tagging_ops); + static struct ctl_table memory_allocation_profiling_sysctls[] = { { .procname = "mem_profiling", diff --git a/mm/mm_init.c b/mm/mm_init.c index 549e76af8f82..2fd9bf044a79 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5083ac034d26..3fd273c22749 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include "internal.h" #include "shuffle.h" @@ -1101,6 +1102,7 @@ __always_inline bool free_pages_prepare(struct page *page, /* Do not let hwpoison pages hit pcplists/buddy */ reset_page_owner(page, order); page_table_check_free(page, order); + pgalloc_tag_sub(page, 1 << order); return false; } @@ -1140,6 +1142,7 @@ __always_inline bool free_pages_prepare(struct page *page, page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; reset_page_owner(page, order); page_table_check_free(page, order); + pgalloc_tag_sub(page, 1 << order); if (!PageHighMem(page)) { debug_check_no_locks_freed(page_address(page), @@ -1533,6 +1536,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, set_page_owner(page, order, gfp_flags); page_table_check_alloc(page, order); + pgalloc_tag_add(page, current, 1 << order); } static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, diff --git a/mm/page_ext.c b/mm/page_ext.c index 4548fcc66d74..3c58fe8a24df 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -10,6 +10,7 @@ #include #include #include +#include /* * struct page extension @@ -82,6 +83,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = { #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT) &page_idle_ops, #endif +#ifdef CONFIG_MEM_ALLOC_PROFILING + &page_alloc_tagging_ops, +#endif #ifdef CONFIG_PAGE_TABLE_CHECK &page_table_check_ops, #endif -- cgit v1.2.3-59-g8ed1b From b951aaff503502a7fe066eeed2744ba8a6413c89 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Thu, 21 Mar 2024 09:36:40 -0700 Subject: mm: enable page allocation tagging MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Redefine page allocators to record allocation tags upon their invocation. Instrument post_alloc_hook and free_pages_prepare to modify current allocation tag. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-3-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-19-surenb@google.com Signed-off-by: Suren Baghdasaryan Co-developed-by: Kent Overstreet Signed-off-by: Kent Overstreet Reviewed-by: Kees Cook Tested-by: Kees Cook Cc: Alexander Viro Cc: Alex Gaynor Cc: Alice Ryhl Cc: Andreas Hindborg Cc: Benno Lossin Cc: "Björn Roy Baron" Cc: Boqun Feng Cc: Christoph Lameter Cc: Dennis Zhou Cc: Gary Guo Cc: Miguel Ojeda Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Tejun Heo Cc: Vlastimil Babka Cc: Wedson Almeida Filho Signed-off-by: Andrew Morton --- include/linux/alloc_tag.h | 14 ++++++ include/linux/gfp.h | 126 +++++++++++++++++++++++++++++----------------- include/linux/pagemap.h | 9 ++-- mm/compaction.c | 7 ++- mm/filemap.c | 6 +-- mm/mempolicy.c | 46 ++++++++--------- mm/page_alloc.c | 52 +++++++++---------- 7 files changed, 157 insertions(+), 103 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index 13561223fdab..8932187124da 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -153,4 +153,18 @@ static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {} #endif /* CONFIG_MEM_ALLOC_PROFILING */ +#define alloc_hooks_tag(_tag, _do_alloc) \ +({ \ + struct alloc_tag * __maybe_unused _old = alloc_tag_save(_tag); \ + typeof(_do_alloc) _res = _do_alloc; \ + alloc_tag_restore(_tag, _old); \ + _res; \ +}) + +#define alloc_hooks(_do_alloc) \ +({ \ + DEFINE_ALLOC_TAG(_alloc_tag); \ + alloc_hooks_tag(&_alloc_tag, _do_alloc); \ +}) + #endif /* _LINUX_ALLOC_TAG_H */ diff --git a/include/linux/gfp.h b/include/linux/gfp.h index c775ea3c6015..450c2cbcf04b 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -6,6 +6,8 @@ #include #include +#include +#include struct vm_area_struct; struct mempolicy; @@ -175,42 +177,46 @@ static inline void arch_free_page(struct page *page, int order) { } static inline void arch_alloc_page(struct page *page, int order) { } #endif -struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, +struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_nid, nodemask_t *nodemask); -struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid, +#define __alloc_pages(...) alloc_hooks(__alloc_pages_noprof(__VA_ARGS__)) + +struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid, nodemask_t *nodemask); +#define __folio_alloc(...) alloc_hooks(__folio_alloc_noprof(__VA_ARGS__)) -unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, +unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, nodemask_t *nodemask, int nr_pages, struct list_head *page_list, struct page **page_array); +#define __alloc_pages_bulk(...) alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__)) -unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, +unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp, unsigned long nr_pages, struct page **page_array); +#define alloc_pages_bulk_array_mempolicy(...) \ + alloc_hooks(alloc_pages_bulk_array_mempolicy_noprof(__VA_ARGS__)) /* Bulk allocate order-0 pages */ -static inline unsigned long -alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list) -{ - return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL); -} +#define alloc_pages_bulk_list(_gfp, _nr_pages, _list) \ + __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL) -static inline unsigned long -alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array) -{ - return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array); -} +#define alloc_pages_bulk_array(_gfp, _nr_pages, _page_array) \ + __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array) static inline unsigned long -alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array) +alloc_pages_bulk_array_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages, + struct page **page_array) { if (nid == NUMA_NO_NODE) nid = numa_mem_id(); - return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array); + return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, NULL, page_array); } +#define alloc_pages_bulk_array_node(...) \ + alloc_hooks(alloc_pages_bulk_array_node_noprof(__VA_ARGS__)) + static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask) { gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN); @@ -230,82 +236,104 @@ static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask) * online. For more general interface, see alloc_pages_node(). */ static inline struct page * -__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) +__alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); warn_if_node_offline(nid, gfp_mask); - return __alloc_pages(gfp_mask, order, nid, NULL); + return __alloc_pages_noprof(gfp_mask, order, nid, NULL); } +#define __alloc_pages_node(...) alloc_hooks(__alloc_pages_node_noprof(__VA_ARGS__)) + static inline -struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid) +struct folio *__folio_alloc_node_noprof(gfp_t gfp, unsigned int order, int nid) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); warn_if_node_offline(nid, gfp); - return __folio_alloc(gfp, order, nid, NULL); + return __folio_alloc_noprof(gfp, order, nid, NULL); } +#define __folio_alloc_node(...) alloc_hooks(__folio_alloc_node_noprof(__VA_ARGS__)) + /* * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE, * prefer the current CPU's closest node. Otherwise node must be valid and * online. */ -static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, - unsigned int order) +static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask, + unsigned int order) { if (nid == NUMA_NO_NODE) nid = numa_mem_id(); - return __alloc_pages_node(nid, gfp_mask, order); + return __alloc_pages_node_noprof(nid, gfp_mask, order); } +#define alloc_pages_node(...) alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__)) + #ifdef CONFIG_NUMA -struct page *alloc_pages(gfp_t gfp, unsigned int order); -struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, +struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order); +struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order, struct mempolicy *mpol, pgoff_t ilx, int nid); -struct folio *folio_alloc(gfp_t gfp, unsigned int order); -struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, +struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order); +struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, bool hugepage); #else -static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order) +static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order) { - return alloc_pages_node(numa_node_id(), gfp_mask, order); + return alloc_pages_node_noprof(numa_node_id(), gfp_mask, order); } -static inline struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, +static inline struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order, struct mempolicy *mpol, pgoff_t ilx, int nid) { - return alloc_pages(gfp, order); + return alloc_pages_noprof(gfp, order); } -static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order) +static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order) { return __folio_alloc_node(gfp, order, numa_node_id()); } -#define vma_alloc_folio(gfp, order, vma, addr, hugepage) \ - folio_alloc(gfp, order) +#define vma_alloc_folio_noprof(gfp, order, vma, addr, hugepage) \ + folio_alloc_noprof(gfp, order) #endif + +#define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__)) +#define alloc_pages_mpol(...) alloc_hooks(alloc_pages_mpol_noprof(__VA_ARGS__)) +#define folio_alloc(...) alloc_hooks(folio_alloc_noprof(__VA_ARGS__)) +#define vma_alloc_folio(...) alloc_hooks(vma_alloc_folio_noprof(__VA_ARGS__)) + #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) -static inline struct page *alloc_page_vma(gfp_t gfp, + +static inline struct page *alloc_page_vma_noprof(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr) { - struct folio *folio = vma_alloc_folio(gfp, 0, vma, addr, false); + struct folio *folio = vma_alloc_folio_noprof(gfp, 0, vma, addr, false); return &folio->page; } +#define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__)) + +extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order); +#define __get_free_pages(...) alloc_hooks(get_free_pages_noprof(__VA_ARGS__)) -extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); -extern unsigned long get_zeroed_page(gfp_t gfp_mask); +extern unsigned long get_zeroed_page_noprof(gfp_t gfp_mask); +#define get_zeroed_page(...) alloc_hooks(get_zeroed_page_noprof(__VA_ARGS__)) + +void *alloc_pages_exact_noprof(size_t size, gfp_t gfp_mask) __alloc_size(1); +#define alloc_pages_exact(...) alloc_hooks(alloc_pages_exact_noprof(__VA_ARGS__)) -void *alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1); void free_pages_exact(void *virt, size_t size); -__meminit void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2); -#define __get_free_page(gfp_mask) \ - __get_free_pages((gfp_mask), 0) +__meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2); +#define alloc_pages_exact_nid(...) \ + alloc_hooks(alloc_pages_exact_nid_noprof(__VA_ARGS__)) + +#define __get_free_page(gfp_mask) \ + __get_free_pages((gfp_mask), 0) -#define __get_dma_pages(gfp_mask, order) \ - __get_free_pages((gfp_mask) | GFP_DMA, (order)) +#define __get_dma_pages(gfp_mask, order) \ + __get_free_pages((gfp_mask) | GFP_DMA, (order)) extern void __free_pages(struct page *page, unsigned int order); extern void free_pages(unsigned long addr, unsigned int order); @@ -374,10 +402,14 @@ extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma); #ifdef CONFIG_CONTIG_ALLOC /* The below functions must be run on a range from a single zone. */ -extern int alloc_contig_range(unsigned long start, unsigned long end, +extern int alloc_contig_range_noprof(unsigned long start, unsigned long end, unsigned migratetype, gfp_t gfp_mask); -extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask, - int nid, nodemask_t *nodemask); +#define alloc_contig_range(...) alloc_hooks(alloc_contig_range_noprof(__VA_ARGS__)) + +extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask, + int nid, nodemask_t *nodemask); +#define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__)) + #endif void free_contig_range(unsigned long pfn, unsigned long nr_pages); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 2df35e65557d..35636e67e2e1 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -542,14 +542,17 @@ static inline void *detach_page_private(struct page *page) #endif #ifdef CONFIG_NUMA -struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order); +struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order); #else -static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order) +static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order) { - return folio_alloc(gfp, order); + return folio_alloc_noprof(gfp, order); } #endif +#define filemap_alloc_folio(...) \ + alloc_hooks(filemap_alloc_folio_noprof(__VA_ARGS__)) + static inline struct page *__page_cache_alloc(gfp_t gfp) { return &filemap_alloc_folio(gfp, 0)->page; diff --git a/mm/compaction.c b/mm/compaction.c index 807b58e6eb68..70b01190d2f3 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1851,7 +1851,7 @@ static void isolate_freepages(struct compact_control *cc) * This is a migrate-callback that "allocates" freepages by taking pages * from the isolated freelists in the block we are migrating to. */ -static struct folio *compaction_alloc(struct folio *src, unsigned long data) +static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long data) { struct compact_control *cc = (struct compact_control *)data; struct folio *dst; @@ -1898,6 +1898,11 @@ again: return page_rmappable_folio(&dst->page); } +static struct folio *compaction_alloc(struct folio *src, unsigned long data) +{ + return alloc_hooks(compaction_alloc_noprof(src, data)); +} + /* * This is a migrate-callback that "frees" freepages back to the isolated * freelist. All pages on the freelist are from the same zone, so there is no diff --git a/mm/filemap.c b/mm/filemap.c index 30de18c4fd28..60890758d660 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -966,7 +966,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio, EXPORT_SYMBOL_GPL(filemap_add_folio); #ifdef CONFIG_NUMA -struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order) +struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order) { int n; struct folio *folio; @@ -981,9 +981,9 @@ struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order) return folio; } - return folio_alloc(gfp, order); + return folio_alloc_noprof(gfp, order); } -EXPORT_SYMBOL(filemap_alloc_folio); +EXPORT_SYMBOL(filemap_alloc_folio_noprof); #endif /* diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 13100a290918..0b3def99174a 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2201,9 +2201,9 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order, */ preferred_gfp = gfp | __GFP_NOWARN; preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); - page = __alloc_pages(preferred_gfp, order, nid, nodemask); + page = __alloc_pages_noprof(preferred_gfp, order, nid, nodemask); if (!page) - page = __alloc_pages(gfp, order, nid, NULL); + page = __alloc_pages_noprof(gfp, order, nid, NULL); return page; } @@ -2218,7 +2218,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order, * * Return: The page on success or NULL if allocation fails. */ -struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, +struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order, struct mempolicy *pol, pgoff_t ilx, int nid) { nodemask_t *nodemask; @@ -2249,7 +2249,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, * First, try to allocate THP only on local node, but * don't reclaim unnecessarily, just compact. */ - page = __alloc_pages_node(nid, + page = __alloc_pages_node_noprof(nid, gfp | __GFP_THISNODE | __GFP_NORETRY, order); if (page || !(gfp & __GFP_DIRECT_RECLAIM)) return page; @@ -2262,7 +2262,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, } } - page = __alloc_pages(gfp, order, nid, nodemask); + page = __alloc_pages_noprof(gfp, order, nid, nodemask); if (unlikely(pol->mode == MPOL_INTERLEAVE) && page) { /* skip NUMA_INTERLEAVE_HIT update if numa stats is disabled */ @@ -2293,7 +2293,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, * * Return: The folio on success or NULL if allocation fails. */ -struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, +struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, bool hugepage) { struct mempolicy *pol; @@ -2301,12 +2301,12 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, struct page *page; pol = get_vma_policy(vma, addr, order, &ilx); - page = alloc_pages_mpol(gfp | __GFP_COMP, order, - pol, ilx, numa_node_id()); + page = alloc_pages_mpol_noprof(gfp | __GFP_COMP, order, + pol, ilx, numa_node_id()); mpol_cond_put(pol); return page_rmappable_folio(page); } -EXPORT_SYMBOL(vma_alloc_folio); +EXPORT_SYMBOL(vma_alloc_folio_noprof); /** * alloc_pages - Allocate pages. @@ -2322,7 +2322,7 @@ EXPORT_SYMBOL(vma_alloc_folio); * flags are used. * Return: The page on success or NULL if allocation fails. */ -struct page *alloc_pages(gfp_t gfp, unsigned int order) +struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order) { struct mempolicy *pol = &default_policy; @@ -2333,16 +2333,16 @@ struct page *alloc_pages(gfp_t gfp, unsigned int order) if (!in_interrupt() && !(gfp & __GFP_THISNODE)) pol = get_task_policy(current); - return alloc_pages_mpol(gfp, order, - pol, NO_INTERLEAVE_INDEX, numa_node_id()); + return alloc_pages_mpol_noprof(gfp, order, pol, NO_INTERLEAVE_INDEX, + numa_node_id()); } -EXPORT_SYMBOL(alloc_pages); +EXPORT_SYMBOL(alloc_pages_noprof); -struct folio *folio_alloc(gfp_t gfp, unsigned int order) +struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order) { - return page_rmappable_folio(alloc_pages(gfp | __GFP_COMP, order)); + return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order)); } -EXPORT_SYMBOL(folio_alloc); +EXPORT_SYMBOL(folio_alloc_noprof); static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, struct mempolicy *pol, unsigned long nr_pages, @@ -2361,13 +2361,13 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, for (i = 0; i < nodes; i++) { if (delta) { - nr_allocated = __alloc_pages_bulk(gfp, + nr_allocated = alloc_pages_bulk_noprof(gfp, interleave_nodes(pol), NULL, nr_pages_per_node + 1, NULL, page_array); delta--; } else { - nr_allocated = __alloc_pages_bulk(gfp, + nr_allocated = alloc_pages_bulk_noprof(gfp, interleave_nodes(pol), NULL, nr_pages_per_node, NULL, page_array); } @@ -2504,11 +2504,11 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, preferred_gfp = gfp | __GFP_NOWARN; preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); - nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes, + nr_allocated = alloc_pages_bulk_noprof(preferred_gfp, nid, &pol->nodes, nr_pages, NULL, page_array); if (nr_allocated < nr_pages) - nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL, + nr_allocated += alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL, nr_pages - nr_allocated, NULL, page_array + nr_allocated); return nr_allocated; @@ -2520,7 +2520,7 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, * It can accelerate memory allocation especially interleaving * allocate memory. */ -unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, +unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp, unsigned long nr_pages, struct page **page_array) { struct mempolicy *pol = &default_policy; @@ -2544,8 +2544,8 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, nid = numa_node_id(); nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid); - return __alloc_pages_bulk(gfp, nid, nodemask, - nr_pages, NULL, page_array); + return alloc_pages_bulk_noprof(gfp, nid, nodemask, + nr_pages, NULL, page_array); } int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fd273c22749..d3afaf8f983a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4391,7 +4391,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, * * Returns the number of pages on the list or array. */ -unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, +unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, nodemask_t *nodemask, int nr_pages, struct list_head *page_list, struct page **page_array) @@ -4527,7 +4527,7 @@ failed_irq: pcp_trylock_finish(UP_flags); failed: - page = __alloc_pages(gfp, 0, preferred_nid, nodemask); + page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask); if (page) { if (page_list) list_add(&page->lru, page_list); @@ -4538,13 +4538,13 @@ failed: goto out; } -EXPORT_SYMBOL_GPL(__alloc_pages_bulk); +EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof); /* * This is the 'heart' of the zoned buddy allocator. */ -struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, - nodemask_t *nodemask) +struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, + int preferred_nid, nodemask_t *nodemask) { struct page *page; unsigned int alloc_flags = ALLOC_WMARK_LOW; @@ -4606,38 +4606,38 @@ out: return page; } -EXPORT_SYMBOL(__alloc_pages); +EXPORT_SYMBOL(__alloc_pages_noprof); -struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid, +struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid, nodemask_t *nodemask) { - struct page *page = __alloc_pages(gfp | __GFP_COMP, order, + struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order, preferred_nid, nodemask); return page_rmappable_folio(page); } -EXPORT_SYMBOL(__folio_alloc); +EXPORT_SYMBOL(__folio_alloc_noprof); /* * Common helper functions. Never use with __GFP_HIGHMEM because the returned * address cannot represent highmem pages. Use alloc_pages and then kmap if * you need to access high mem. */ -unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order) +unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order) { struct page *page; - page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order); + page = alloc_pages_noprof(gfp_mask & ~__GFP_HIGHMEM, order); if (!page) return 0; return (unsigned long) page_address(page); } -EXPORT_SYMBOL(__get_free_pages); +EXPORT_SYMBOL(get_free_pages_noprof); -unsigned long get_zeroed_page(gfp_t gfp_mask) +unsigned long get_zeroed_page_noprof(gfp_t gfp_mask) { - return __get_free_page(gfp_mask | __GFP_ZERO); + return get_free_pages_noprof(gfp_mask | __GFP_ZERO, 0); } -EXPORT_SYMBOL(get_zeroed_page); +EXPORT_SYMBOL(get_zeroed_page_noprof); /** * __free_pages - Free pages allocated with alloc_pages(). @@ -4853,7 +4853,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order, * * Return: pointer to the allocated area or %NULL in case of error. */ -void *alloc_pages_exact(size_t size, gfp_t gfp_mask) +void *alloc_pages_exact_noprof(size_t size, gfp_t gfp_mask) { unsigned int order = get_order(size); unsigned long addr; @@ -4861,10 +4861,10 @@ void *alloc_pages_exact(size_t size, gfp_t gfp_mask) if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM))) gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM); - addr = __get_free_pages(gfp_mask, order); + addr = get_free_pages_noprof(gfp_mask, order); return make_alloc_exact(addr, order, size); } -EXPORT_SYMBOL(alloc_pages_exact); +EXPORT_SYMBOL(alloc_pages_exact_noprof); /** * alloc_pages_exact_nid - allocate an exact number of physically-contiguous @@ -4878,7 +4878,7 @@ EXPORT_SYMBOL(alloc_pages_exact); * * Return: pointer to the allocated area or %NULL in case of error. */ -void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) +void * __meminit alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mask) { unsigned int order = get_order(size); struct page *p; @@ -4886,7 +4886,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM))) gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM); - p = alloc_pages_node(nid, gfp_mask, order); + p = alloc_pages_node_noprof(nid, gfp_mask, order); if (!p) return NULL; return make_alloc_exact((unsigned long)page_address(p), order, size); @@ -6343,7 +6343,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc, * pages which PFN is in [start, end) are allocated for the caller and * need to be freed with free_contig_range(). */ -int alloc_contig_range(unsigned long start, unsigned long end, +int alloc_contig_range_noprof(unsigned long start, unsigned long end, unsigned migratetype, gfp_t gfp_mask) { unsigned long outer_start, outer_end; @@ -6467,15 +6467,15 @@ done: undo_isolate_page_range(start, end, migratetype); return ret; } -EXPORT_SYMBOL(alloc_contig_range); +EXPORT_SYMBOL(alloc_contig_range_noprof); static int __alloc_contig_pages(unsigned long start_pfn, unsigned long nr_pages, gfp_t gfp_mask) { unsigned long end_pfn = start_pfn + nr_pages; - return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE, - gfp_mask); + return alloc_contig_range_noprof(start_pfn, end_pfn, MIGRATE_MOVABLE, + gfp_mask); } static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn, @@ -6530,8 +6530,8 @@ static bool zone_spans_last_pfn(const struct zone *zone, * * Return: pointer to contiguous pages on success, or NULL if not successful. */ -struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask, - int nid, nodemask_t *nodemask) +struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask, + int nid, nodemask_t *nodemask) { unsigned long ret, pfn, flags; struct zonelist *zonelist; -- cgit v1.2.3-59-g8ed1b From be25d1d4e822a703b19043baff916aec6ac4362d Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Thu, 21 Mar 2024 09:36:41 -0700 Subject: mm: create new codetag references during page splitting MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When a high-order page is split into smaller ones, each newly split page should get its codetag. After the split each split page will be referencing the original codetag. The codetag's "bytes" counter remains the same because the amount of allocated memory has not changed, however the "calls" counter gets increased to keep the counter correct when these individual pages get freed. Link: https://lkml.kernel.org/r/20240321163705.3067592-20-surenb@google.com Signed-off-by: Suren Baghdasaryan Reviewed-by: Vlastimil Babka Tested-by: Kees Cook Cc: Alexander Viro Cc: Alex Gaynor Cc: Alice Ryhl Cc: Andreas Hindborg Cc: Benno Lossin Cc: "Björn Roy Baron" Cc: Boqun Feng Cc: Christoph Lameter Cc: Dennis Zhou Cc: Gary Guo Cc: Kent Overstreet Cc: Miguel Ojeda Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Tejun Heo Cc: Wedson Almeida Filho Signed-off-by: Andrew Morton --- include/linux/alloc_tag.h | 9 +++++++++ include/linux/pgalloc_tag.h | 30 ++++++++++++++++++++++++++++++ mm/huge_memory.c | 2 ++ mm/page_alloc.c | 2 ++ 4 files changed, 43 insertions(+) (limited to 'mm/page_alloc.c') diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index 8932187124da..2aea7afa1fab 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -117,6 +117,15 @@ static inline void __alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag this_cpu_inc(tag->counters->calls); } +static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag) +{ + alloc_tag_add_check(ref, tag); + if (!ref || !tag) + return; + + __alloc_tag_ref_set(ref, tag); +} + static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes) { alloc_tag_add_check(ref, tag); diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h index 66bd021eb46e..093edf98c3d7 100644 --- a/include/linux/pgalloc_tag.h +++ b/include/linux/pgalloc_tag.h @@ -67,11 +67,41 @@ static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) } } +static inline void pgalloc_tag_split(struct page *page, unsigned int nr) +{ + int i; + struct page_ext *page_ext; + union codetag_ref *ref; + struct alloc_tag *tag; + + if (!mem_alloc_profiling_enabled()) + return; + + page_ext = page_ext_get(page); + if (unlikely(!page_ext)) + return; + + ref = codetag_ref_from_page_ext(page_ext); + if (!ref->ct) + goto out; + + tag = ct_to_alloc_tag(ref->ct); + page_ext = page_ext_next(page_ext); + for (i = 1; i < nr; i++) { + /* Set new reference to point to the original tag */ + alloc_tag_ref_set(codetag_ref_from_page_ext(page_ext), tag); + page_ext = page_ext_next(page_ext); + } +out: + page_ext_put(page_ext); +} + #else /* CONFIG_MEM_ALLOC_PROFILING */ static inline void pgalloc_tag_add(struct page *page, struct task_struct *task, unsigned int nr) {} static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {} +static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {} #endif /* CONFIG_MEM_ALLOC_PROFILING */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index dd88bb3ea196..25191ab70631 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include @@ -2946,6 +2947,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, /* Caller disabled irqs, so they are still disabled here */ split_page_owner(head, order, new_order); + pgalloc_tag_split(head, 1 << order); /* See comment in __split_huge_page_tail() */ if (folio_test_anon(folio)) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d3afaf8f983a..e453ee22d489 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2630,6 +2630,7 @@ void split_page(struct page *page, unsigned int order) for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i); split_page_owner(page, order, 0); + pgalloc_tag_split(page, 1 << order); split_page_memcg(page, order, 0); } EXPORT_SYMBOL_GPL(split_page); @@ -4827,6 +4828,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order, struct page *last = page + nr; split_page_owner(page, order, 0); + pgalloc_tag_split(page, 1 << order); split_page_memcg(page, order, 0); while (page < --last) set_page_refcounted(last); -- cgit v1.2.3-59-g8ed1b From cc92eba1c88b1f74e0f044df2738f4e4b22f1e4e Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Thu, 21 Mar 2024 09:36:42 -0700 Subject: mm: fix non-compound multi-order memory accounting in __free_pages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When a non-compound multi-order page is freed, it is possible that a speculative reference keeps the page pinned. In this case we free all pages except for the first page, which will be freed later by the last put_page(). However the page passed to put_page() is indistinguishable from an order-0 page, so it cannot do the accounting, just as it cannot free the subsequent pages. Do the accounting here, where we free the pages. Link: https://lkml.kernel.org/r/20240321163705.3067592-21-surenb@google.com Reported-by: Vlastimil Babka Signed-off-by: Suren Baghdasaryan Tested-by: Kees Cook Cc: Alexander Viro Cc: Alex Gaynor Cc: Alice Ryhl Cc: Andreas Hindborg Cc: Benno Lossin Cc: "Björn Roy Baron" Cc: Boqun Feng Cc: Christoph Lameter Cc: Dennis Zhou Cc: Gary Guo Cc: Kent Overstreet Cc: Miguel Ojeda Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Tejun Heo Cc: Wedson Almeida Filho Signed-off-by: Andrew Morton --- include/linux/pgalloc_tag.h | 24 ++++++++++++++++++++++++ mm/page_alloc.c | 5 ++++- 2 files changed, 28 insertions(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h index 093edf98c3d7..50d212330bbb 100644 --- a/include/linux/pgalloc_tag.h +++ b/include/linux/pgalloc_tag.h @@ -96,12 +96,36 @@ out: page_ext_put(page_ext); } +static inline struct alloc_tag *pgalloc_tag_get(struct page *page) +{ + struct alloc_tag *tag = NULL; + + if (mem_alloc_profiling_enabled()) { + union codetag_ref *ref = get_page_tag_ref(page); + + alloc_tag_sub_check(ref); + if (ref && ref->ct) + tag = ct_to_alloc_tag(ref->ct); + put_page_tag_ref(ref); + } + + return tag; +} + +static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) +{ + if (mem_alloc_profiling_enabled() && tag) + this_cpu_sub(tag->counters->bytes, PAGE_SIZE * nr); +} + #else /* CONFIG_MEM_ALLOC_PROFILING */ static inline void pgalloc_tag_add(struct page *page, struct task_struct *task, unsigned int nr) {} static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {} static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {} +static inline struct alloc_tag *pgalloc_tag_get(struct page *page) { return NULL; } +static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) {} #endif /* CONFIG_MEM_ALLOC_PROFILING */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e453ee22d489..e1241ecef271 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4664,12 +4664,15 @@ void __free_pages(struct page *page, unsigned int order) { /* get PageHead before we drop reference */ int head = PageHead(page); + struct alloc_tag *tag = pgalloc_tag_get(page); if (put_page_testzero(page)) free_the_page(page, order); - else if (!head) + else if (!head) { + pgalloc_tag_sub_pages(tag, (1 << order) - 1); while (order-- > 0) free_the_page(page + (1 << order), order); + } } EXPORT_SYMBOL(__free_pages); -- cgit v1.2.3-59-g8ed1b From b7b098cf00a2b65d5654a86dc8edf82f125289c1 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 21 Mar 2024 14:24:39 +0000 Subject: mm: always initialise folio->_deferred_list Patch series "Various significant MM patches". These patches all interact in annoying ways which make it tricky to send them out in any way other than a big batch, even though there's not really an overarching theme to connect them. The big effects of this patch series are: - folio_test_hugetlb() becomes reliable, even when called without a page reference - We free up PG_slab, and we could always use more page flags - We no longer need to check PageSlab before calling page_mapcount() This patch (of 9): For compound pages which are at least order-2 (and hence have a deferred_list), initialise it and then we can check at free that the page is not part of a deferred list. We recently found this useful to rule out a source of corruption. [peterx@redhat.com: always initialise folio->_deferred_list] Link: https://lkml.kernel.org/r/20240417211836.2742593-2-peterx@redhat.com Link: https://lkml.kernel.org/r/20240321142448.1645400-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240321142448.1645400-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Peter Xu Reviewed-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Miaohe Lin Cc: Muchun Song Cc: Oscar Salvador Signed-off-by: Andrew Morton --- mm/huge_memory.c | 2 -- mm/hugetlb.c | 3 ++- mm/internal.h | 2 ++ mm/memcontrol.c | 3 +++ mm/page_alloc.c | 9 +++++---- 5 files changed, 12 insertions(+), 7 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 25191ab70631..4cc7133aaa4b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -793,8 +793,6 @@ void folio_prep_large_rmappable(struct folio *folio) { if (!folio || !folio_test_large(folio)) return; - if (folio_order(folio) > 1) - INIT_LIST_HEAD(&folio->_deferred_list); folio_set_large_rmappable(folio); } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ce7be5c24442..378181547b7b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1796,7 +1796,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, destroy_compound_gigantic_folio(folio, huge_page_order(h)); free_gigantic_folio(folio, huge_page_order(h)); } else { - __free_pages(&folio->page, huge_page_order(h)); + INIT_LIST_HEAD(&folio->_deferred_list); + folio_put(folio); } } diff --git a/mm/internal.h b/mm/internal.h index 85c3db43454d..5c0c57c9cd19 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -525,6 +525,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order) atomic_set(&folio->_entire_mapcount, -1); atomic_set(&folio->_nr_pages_mapped, 0); atomic_set(&folio->_pincount, 0); + if (order > 1) + INIT_LIST_HEAD(&folio->_deferred_list); } static inline void prep_compound_tail(struct page *head, int tail_idx) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 896b4bf05b9c..45dd20901282 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7400,6 +7400,9 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) struct obj_cgroup *objcg; VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); + VM_BUG_ON_FOLIO(folio_order(folio) > 1 && + !folio_test_hugetlb(folio) && + !list_empty(&folio->_deferred_list), folio); /* * Nobody should be changing or seriously looking at diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e1241ecef271..7e8f4b751801 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1007,10 +1007,11 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) } break; case 2: - /* - * the second tail page: ->mapping is - * deferred_list.next -- ignore value. - */ + /* the second tail page: deferred_list overlaps ->mapping */ + if (unlikely(!list_empty(&folio->_deferred_list))) { + bad_page(page, "on deferred list"); + goto out; + } break; default: if (page->mapping != TAIL_MAPPING) { -- cgit v1.2.3-59-g8ed1b From 17edeb5d3f761c20fd28f6002f5a9faa53c0a0d8 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Mar 2024 14:02:06 -0400 Subject: mm: page_alloc: remove pcppage migratetype caching Patch series "mm: page_alloc: freelist migratetype hygiene", v4. The page allocator's mobility grouping is intended to keep unmovable pages separate from reclaimable/compactable ones to allow on-demand defragmentation for higher-order allocations and huge pages. Currently, there are several places where accidental type mixing occurs: an allocation asks for a page of a certain migratetype and receives another. This ruins pageblocks for compaction, which in turn makes allocating huge pages more expensive and less reliable. The series addresses those causes. The last patch adds type checks on all freelist movements to prevent new violations being introduced. The benefits can be seen in a mixed workload that stresses the machine with a memcache-type workload and a kernel build job while periodically attempting to allocate batches of THP. The following data is aggregated over 50 consecutive defconfig builds: VANILLA PATCHED Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%) Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%) Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%) Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%) Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%) THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%) THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%) THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%) Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%) Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%) Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%) Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%) Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%) Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%) Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%) Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%) Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%) Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%) Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%) Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%) Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%) Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%) Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%) Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%) Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%) Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%) Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%) Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%) Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%) Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%) Swap in 209.16 ( +0.00%) 256.48 ( +22.52%) File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%) Huge page success rate is higher, allocation latencies are shorter and more predictable. Stealing (fallback) rate is drastically reduced. Notably, while the vanilla kernel keeps doing fallbacks on an ongoing basis, the patched kernel enters a steady state once the distribution of block types is adequate for the workload. Steals over 50 runs: VANILLA PATCHED 1504.0 227.0 1557.0 6.0 1391.0 13.0 1080.0 26.0 1057.0 40.0 1156.0 6.0 805.0 46.0 736.0 20.0 1747.0 2.0 1699.0 34.0 1269.0 13.0 1858.0 12.0 907.0 4.0 727.0 2.0 563.0 2.0 3094.0 2.0 10211.0 3.0 2621.0 1.0 5508.0 2.0 1060.0 2.0 538.0 3.0 5773.0 2.0 2199.0 0.0 3781.0 2.0 1387.0 1.0 4977.0 0.0 2865.0 1.0 1814.0 1.0 3739.0 1.0 6857.0 0.0 382.0 0.0 407.0 1.0 3784.0 0.0 297.0 0.0 298.0 0.0 6636.0 0.0 4188.0 0.0 242.0 0.0 9960.0 0.0 5816.0 0.0 354.0 0.0 287.0 0.0 261.0 0.0 140.0 1.0 2065.0 0.0 312.0 0.0 331.0 0.0 164.0 0.0 465.0 1.0 219.0 0.0 Type mismatches are down too. Those count every time an allocation request asks for one migratetype and gets another. This can still occur minimally in the patched kernel due to non-stealing fallbacks, but it's quite rare and follows the pattern of overall fallbacks - once the block type distribution settles, mismatches cease as well: VANILLA: PATCHED: 182602.0 268.0 135794.0 20.0 88619.0 19.0 95973.0 0.0 129590.0 0.0 129298.0 0.0 147134.0 0.0 230854.0 0.0 239709.0 0.0 137670.0 0.0 132430.0 0.0 65712.0 0.0 57901.0 0.0 67506.0 0.0 63565.0 4.0 34806.0 0.0 42962.0 0.0 32406.0 0.0 38668.0 0.0 61356.0 0.0 57800.0 0.0 41435.0 0.0 83456.0 0.0 65048.0 0.0 28955.0 0.0 47597.0 0.0 75117.0 0.0 55564.0 0.0 38280.0 0.0 52404.0 0.0 26264.0 0.0 37538.0 0.0 19671.0 0.0 30936.0 0.0 26933.0 0.0 16962.0 0.0 44554.0 0.0 46352.0 0.0 24995.0 0.0 35152.0 0.0 12823.0 0.0 21583.0 0.0 18129.0 0.0 31693.0 0.0 28745.0 0.0 33308.0 0.0 31114.0 0.0 35034.0 0.0 12111.0 0.0 24885.0 0.0 Compaction work is markedly reduced despite much better THP rates. In the vanilla kernel, reclaim seems to have been driven primarily by watermark boosting that happens as a result of fallbacks. With those all but eliminated, watermarks average lower and kswapd does less work. The uptick in direct reclaim is because THP requests have to fend for themselves more often - which is intended policy right now. Aggregate reclaim activity is lowered significantly, though. This patch (of 10): The idea behind the cache is to save get_pageblock_migratetype() lookups during bulk freeing. A microbenchmark suggests this isn't helping, though. The pcp migratetype can get stale, which means that bulk freeing has an extra branch to check if the pageblock was isolated while on the pcp. While the variance overlaps, the cache write and the branch seem to make this a net negative. The following test allocates and frees batches of 10,000 pages (~3x the pcp high marks to trigger flushing): Before: 8,668.48 msec task-clock # 99.735 CPUs utilized ( +- 2.90% ) 19 context-switches # 4.341 /sec ( +- 3.24% ) 0 cpu-migrations # 0.000 /sec 17,440 page-faults # 3.984 K/sec ( +- 2.90% ) 41,758,692,473 cycles # 9.541 GHz ( +- 2.90% ) 126,201,294,231 instructions # 5.98 insn per cycle ( +- 2.90% ) 25,348,098,335 branches # 5.791 G/sec ( +- 2.90% ) 33,436,921 branch-misses # 0.26% of all branches ( +- 2.90% ) 0.0869148 +- 0.0000302 seconds time elapsed ( +- 0.03% ) After: 8,444.81 msec task-clock # 99.726 CPUs utilized ( +- 2.90% ) 22 context-switches # 5.160 /sec ( +- 3.23% ) 0 cpu-migrations # 0.000 /sec 17,443 page-faults # 4.091 K/sec ( +- 2.90% ) 40,616,738,355 cycles # 9.527 GHz ( +- 2.90% ) 126,383,351,792 instructions # 6.16 insn per cycle ( +- 2.90% ) 25,224,985,153 branches # 5.917 G/sec ( +- 2.90% ) 32,236,793 branch-misses # 0.25% of all branches ( +- 2.90% ) 0.0846799 +- 0.0000412 seconds time elapsed ( +- 0.05% ) A side effect is that this also ensures that pages whose pageblock gets stolen while on the pcplist end up on the right freelist and we don't perform potentially type-incompatible buddy merges (or skip merges when we shouldn't), which is likely beneficial to long-term fragmentation management, although the effects would be harder to measure. Settle for simpler and faster code as justification here. Link: https://lkml.kernel.org/r/20240320180429.678181-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20240320180429.678181-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Zi Yan Reviewed-by: Vlastimil Babka Acked-by: Mel Gorman Tested-by: "Huang, Ying" Tested-by: Baolin Wang Cc: David Hildenbrand Signed-off-by: Andrew Morton --- mm/page_alloc.c | 66 ++++++++++++--------------------------------------------- 1 file changed, 14 insertions(+), 52 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e8f4b751801..56a341c8b3ac 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -207,24 +207,6 @@ EXPORT_SYMBOL(node_states); gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; -/* - * A cached value of the page's pageblock's migratetype, used when the page is - * put on a pcplist. Used to avoid the pageblock migratetype lookup when - * freeing from pcplists in most cases, at the cost of possibly becoming stale. - * Also the migratetype set in the page does not necessarily match the pcplist - * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any - * other index - this ensures that it will be put on the correct CMA freelist. - */ -static inline int get_pcppage_migratetype(struct page *page) -{ - return page->index; -} - -static inline void set_pcppage_migratetype(struct page *page, int migratetype) -{ - page->index = migratetype; -} - #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE unsigned int pageblock_order __read_mostly; #endif @@ -1195,7 +1177,6 @@ static void free_pcppages_bulk(struct zone *zone, int count, { unsigned long flags; unsigned int order; - bool isolated_pageblocks; struct page *page; /* @@ -1208,7 +1189,6 @@ static void free_pcppages_bulk(struct zone *zone, int count, pindex = pindex - 1; spin_lock_irqsave(&zone->lock, flags); - isolated_pageblocks = has_isolate_pageblock(zone); while (count > 0) { struct list_head *list; @@ -1224,23 +1204,19 @@ static void free_pcppages_bulk(struct zone *zone, int count, order = pindex_to_order(pindex); nr_pages = 1 << order; do { + unsigned long pfn; int mt; page = list_last_entry(list, struct page, pcp_list); - mt = get_pcppage_migratetype(page); + pfn = page_to_pfn(page); + mt = get_pfnblock_migratetype(page, pfn); /* must delete to avoid corrupting pcp list */ list_del(&page->pcp_list); count -= nr_pages; pcp->count -= nr_pages; - /* MIGRATE_ISOLATE page should not go to pcplists */ - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); - /* Pageblock could have been isolated meanwhile */ - if (unlikely(isolated_pageblocks)) - mt = get_pageblock_migratetype(page); - - __free_one_page(page, page_to_pfn(page), zone, order, mt, FPI_NONE); + __free_one_page(page, pfn, zone, order, mt, FPI_NONE); trace_mm_page_pcpu_drain(page, order, mt); } while (count > 0 && !list_empty(list)); } @@ -1580,7 +1556,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, continue; del_page_from_free_list(page, zone, current_order); expand(zone, page, order, current_order, migratetype); - set_pcppage_migratetype(page, migratetype); trace_mm_page_alloc_zone_locked(page, order, migratetype, pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); @@ -2151,7 +2126,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, * pages are ordered properly. */ list_add_tail(&page->pcp_list, list); - if (is_migrate_cma(get_pcppage_migratetype(page))) + if (is_migrate_cma(get_pageblock_migratetype(page))) __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -(1 << order)); } @@ -2347,19 +2322,6 @@ void drain_all_pages(struct zone *zone) __drain_all_pages(zone, false); } -static bool free_unref_page_prepare(struct page *page, unsigned long pfn, - unsigned int order) -{ - int migratetype; - - if (!free_pages_prepare(page, order)) - return false; - - migratetype = get_pfnblock_migratetype(page, pfn); - set_pcppage_migratetype(page, migratetype); - return true; -} - static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high) { int min_nr_free, max_nr_free; @@ -2492,7 +2454,7 @@ void free_unref_page(struct page *page, unsigned int order) unsigned long pfn = page_to_pfn(page); int migratetype, pcpmigratetype; - if (!free_unref_page_prepare(page, pfn, order)) + if (!free_pages_prepare(page, order)) return; /* @@ -2502,7 +2464,7 @@ void free_unref_page(struct page *page, unsigned int order) * get those areas back if necessary. Otherwise, we may have to free * excessively into the page allocator */ - migratetype = pcpmigratetype = get_pcppage_migratetype(page); + migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn); if (unlikely(migratetype >= MIGRATE_PCPTYPES)) { if (unlikely(is_migrate_isolate(migratetype))) { free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE); @@ -2541,14 +2503,14 @@ void free_unref_folios(struct folio_batch *folios) if (order > 0 && folio_test_large_rmappable(folio)) folio_undo_large_rmappable(folio); - if (!free_unref_page_prepare(&folio->page, pfn, order)) + if (!free_pages_prepare(&folio->page, order)) continue; /* * Free isolated folios and orders not handled on the PCP * directly to the allocator, see comment in free_unref_page. */ - migratetype = get_pcppage_migratetype(&folio->page); + migratetype = get_pfnblock_migratetype(&folio->page, pfn); if (!pcp_allowed_order(order) || is_migrate_isolate(migratetype)) { free_one_page(folio_zone(folio), &folio->page, pfn, @@ -2565,10 +2527,11 @@ void free_unref_folios(struct folio_batch *folios) for (i = 0; i < folios->nr; i++) { struct folio *folio = folios->folios[i]; struct zone *zone = folio_zone(folio); + unsigned long pfn = folio_pfn(folio); unsigned int order = (unsigned long)folio->private; folio->private = NULL; - migratetype = get_pcppage_migratetype(&folio->page); + migratetype = get_pfnblock_migratetype(&folio->page, pfn); /* Different zone requires a different pcp lock */ if (zone != locked_zone) { @@ -2585,9 +2548,8 @@ void free_unref_folios(struct folio_batch *folios) pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (unlikely(!pcp)) { pcp_trylock_finish(UP_flags); - free_one_page(zone, &folio->page, - folio_pfn(folio), order, - migratetype, FPI_NONE); + free_one_page(zone, &folio->page, pfn, + order, migratetype, FPI_NONE); locked_zone = NULL; continue; } @@ -2757,7 +2719,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, } } __mod_zone_freepage_state(zone, -(1 << order), - get_pcppage_migratetype(page)); + get_pageblock_migratetype(page)); spin_unlock_irqrestore(&zone->lock, flags); } while (check_new_pages(page, order)); -- cgit v1.2.3-59-g8ed1b From 9cbe97bad5cd75b5b493734bd2695febb8e95281 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Mar 2024 14:02:07 -0400 Subject: mm: page_alloc: optimize free_unref_folios() Move direct freeing of isolated pages to the lock-breaking block in the second loop. This saves an unnecessary migratetype reassessment. Minor comment and local variable scoping cleanups. Link: https://lkml.kernel.org/r/20240320180429.678181-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Suggested-by: Vlastimil Babka Tested-by: "Huang, Ying" Reviewed-by: Vlastimil Babka Tested-by: Baolin Wang Cc: David Hildenbrand Cc: Mel Gorman Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/page_alloc.c | 32 +++++++++++++++++++++++--------- 1 file changed, 23 insertions(+), 9 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 56a341c8b3ac..5fa3d534df2f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2493,7 +2493,7 @@ void free_unref_folios(struct folio_batch *folios) unsigned long __maybe_unused UP_flags; struct per_cpu_pages *pcp = NULL; struct zone *locked_zone = NULL; - int i, j, migratetype; + int i, j; /* Prepare folios for freeing */ for (i = 0, j = 0; i < folios->nr; i++) { @@ -2505,14 +2505,15 @@ void free_unref_folios(struct folio_batch *folios) folio_undo_large_rmappable(folio); if (!free_pages_prepare(&folio->page, order)) continue; - /* - * Free isolated folios and orders not handled on the PCP - * directly to the allocator, see comment in free_unref_page. + * Free orders not handled on the PCP directly to the + * allocator. */ - migratetype = get_pfnblock_migratetype(&folio->page, pfn); - if (!pcp_allowed_order(order) || - is_migrate_isolate(migratetype)) { + if (!pcp_allowed_order(order)) { + int migratetype; + + migratetype = get_pfnblock_migratetype(&folio->page, + pfn); free_one_page(folio_zone(folio), &folio->page, pfn, order, migratetype, FPI_NONE); continue; @@ -2529,15 +2530,29 @@ void free_unref_folios(struct folio_batch *folios) struct zone *zone = folio_zone(folio); unsigned long pfn = folio_pfn(folio); unsigned int order = (unsigned long)folio->private; + int migratetype; folio->private = NULL; migratetype = get_pfnblock_migratetype(&folio->page, pfn); /* Different zone requires a different pcp lock */ - if (zone != locked_zone) { + if (zone != locked_zone || + is_migrate_isolate(migratetype)) { if (pcp) { pcp_spin_unlock(pcp); pcp_trylock_finish(UP_flags); + locked_zone = NULL; + pcp = NULL; + } + + /* + * Free isolated pages directly to the + * allocator, see comment in free_unref_page. + */ + if (is_migrate_isolate(migratetype)) { + free_one_page(zone, &folio->page, pfn, + order, migratetype, FPI_NONE); + continue; } /* @@ -2550,7 +2565,6 @@ void free_unref_folios(struct folio_batch *folios) pcp_trylock_finish(UP_flags); free_one_page(zone, &folio->page, pfn, order, migratetype, FPI_NONE); - locked_zone = NULL; continue; } locked_zone = zone; -- cgit v1.2.3-59-g8ed1b From e6cf9e1c4cde8a53385423ecb8ca581097f42e02 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Mar 2024 14:02:08 -0400 Subject: mm: page_alloc: fix up block types when merging compatible blocks The buddy allocator coalesces compatible blocks during freeing, but it doesn't update the types of the subblocks to match. When an allocation later breaks the chunk down again, its pieces will be put on freelists of the wrong type. This encourages incompatible page mixing (ask for one type, get another), and thus long-term fragmentation. Update the subblocks when merging a larger chunk, such that a later expand() will maintain freelist type hygiene. Link: https://lkml.kernel.org/r/20240320180429.678181-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Zi Yan Reviewed-by: Vlastimil Babka Acked-by: Mel Gorman Tested-by: "Huang, Ying" Tested-by: Baolin Wang Cc: David Hildenbrand Signed-off-by: Andrew Morton --- mm/page_alloc.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5fa3d534df2f..2c03336def76 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -786,10 +786,17 @@ static inline void __free_one_page(struct page *page, */ int buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn); - if (migratetype != buddy_mt - && (!migratetype_is_mergeable(migratetype) || - !migratetype_is_mergeable(buddy_mt))) - goto done_merging; + if (migratetype != buddy_mt) { + if (!migratetype_is_mergeable(migratetype) || + !migratetype_is_mergeable(buddy_mt)) + goto done_merging; + /* + * Match buddy type. This ensures that + * an expand() down the line puts the + * sub-blocks on the right freelists. + */ + set_pageblock_migratetype(buddy, migratetype); + } } /* -- cgit v1.2.3-59-g8ed1b From b54ccd3c6bacbc571f7e61797fb5ff9fe3861413 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Mar 2024 14:02:09 -0400 Subject: mm: page_alloc: move free pages when converting block during isolation When claiming a block during compaction isolation, move any remaining free pages to the correct freelists as well, instead of stranding them on the wrong list. Otherwise, this encourages incompatible page mixing down the line, and thus long-term fragmentation. Link: https://lkml.kernel.org/r/20240320180429.678181-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Zi Yan Reviewed-by: Vlastimil Babka Acked-by: Mel Gorman Tested-by: "Huang, Ying" Tested-by: Baolin Wang Cc: David Hildenbrand Signed-off-by: Andrew Morton --- mm/page_alloc.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2c03336def76..dcacb86efd29 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2653,9 +2653,12 @@ int __isolate_free_page(struct page *page, unsigned int order) * Only change normal pageblocks (i.e., they can merge * with others) */ - if (migratetype_is_mergeable(mt)) + if (migratetype_is_mergeable(mt)) { set_pageblock_migratetype(page, MIGRATE_MOVABLE); + move_freepages_block(zone, page, + MIGRATE_MOVABLE, NULL); + } } } -- cgit v1.2.3-59-g8ed1b From 2dd482ba627de15d67f0c0ed445133c8ae9b201b Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Mar 2024 14:02:10 -0400 Subject: mm: page_alloc: fix move_freepages_block() range error When a block is partially outside the zone of the cursor page, the function cuts the range to the pivot page instead of the zone start. This can leave large parts of the block behind, which encourages incompatible page mixing down the line (ask for one type, get another), and thus long-term fragmentation. This triggers reliably on the first block in the DMA zone, whose start_pfn is 1. The block is stolen, but everything before the pivot page (which was often hundreds of pages) is left on the old list. Link: https://lkml.kernel.org/r/20240320180429.678181-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Vlastimil Babka Tested-by: Baolin Wang Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Mel Gorman Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/page_alloc.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dcacb86efd29..0e223d5b94fa 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1650,9 +1650,15 @@ int move_freepages_block(struct zone *zone, struct page *page, start_pfn = pageblock_start_pfn(pfn); end_pfn = pageblock_end_pfn(pfn) - 1; - /* Do not cross zone boundaries */ + /* + * The caller only has the lock for @zone, don't touch ranges + * that straddle into other zones. While we could move part of + * the range that's inside the zone, this call is usually + * accompanied by other operations such as migratetype updates + * which also should be locked. + */ if (!zone_spans_pfn(zone, start_pfn)) - start_pfn = pfn; + return 0; if (!zone_spans_pfn(zone, end_pfn)) return 0; -- cgit v1.2.3-59-g8ed1b From c0cd6f557b9090525d288806cccbc73440ac235a Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Mar 2024 14:02:11 -0400 Subject: mm: page_alloc: fix freelist movement during block conversion Currently, page block type conversion during fallbacks, atomic reservations and isolation can strand various amounts of free pages on incorrect freelists. For example, fallback stealing moves free pages in the block to the new type's freelists, but then may not actually claim the block for that type if there aren't enough compatible pages already allocated. In all cases, free page moving might fail if the block straddles more than one zone, in which case no free pages are moved at all, but the block type is changed anyway. This is detrimental to type hygiene on the freelists. It encourages incompatible page mixing down the line (ask for one type, get another) and thus contributes to long-term fragmentation. Split the process into a proper transaction: check first if conversion will happen, then try to move the free pages, and only if that was successful convert the block to the new type. [baolin.wang@linux.alibaba.com: fix allocation failures with CONFIG_CMA] Link: https://lkml.kernel.org/r/a97697e0-45b0-4f71-b087-fdc7a1d43c0e@linux.alibaba.com Link: https://lkml.kernel.org/r/20240320180429.678181-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Signed-off-by: Baolin Wang Tested-by: "Huang, Ying" Reviewed-by: Vlastimil Babka Tested-by: Baolin Wang Cc: David Hildenbrand Cc: Mel Gorman Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/page-isolation.h | 3 +- mm/page_alloc.c | 174 +++++++++++++++++++++++++---------------- mm/page_isolation.c | 22 ++++-- 3 files changed, 121 insertions(+), 78 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 4ac34392823a..8550b3c91480 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -34,8 +34,7 @@ static inline bool is_migrate_isolate(int migratetype) #define REPORT_FAILURE 0x2 void set_pageblock_migratetype(struct page *page, int migratetype); -int move_freepages_block(struct zone *zone, struct page *page, - int migratetype, int *num_movable); +int move_freepages_block(struct zone *zone, struct page *page, int migratetype); int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, int migratetype, int flags, gfp_t gfp_flags); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0e223d5b94fa..a7cfe65e45c1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1601,9 +1601,8 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, * Note that start_page and end_pages are not aligned on a pageblock * boundary. If alignment is required, use move_freepages_block() */ -static int move_freepages(struct zone *zone, - unsigned long start_pfn, unsigned long end_pfn, - int migratetype, int *num_movable) +static int move_freepages(struct zone *zone, unsigned long start_pfn, + unsigned long end_pfn, int migratetype) { struct page *page; unsigned long pfn; @@ -1613,14 +1612,6 @@ static int move_freepages(struct zone *zone, for (pfn = start_pfn; pfn <= end_pfn;) { page = pfn_to_page(pfn); if (!PageBuddy(page)) { - /* - * We assume that pages that could be isolated for - * migration are movable. But we don't actually try - * isolating, as that would be expensive. - */ - if (num_movable && - (PageLRU(page) || __PageMovable(page))) - (*num_movable)++; pfn++; continue; } @@ -1638,17 +1629,16 @@ static int move_freepages(struct zone *zone, return pages_moved; } -int move_freepages_block(struct zone *zone, struct page *page, - int migratetype, int *num_movable) +static bool prep_move_freepages_block(struct zone *zone, struct page *page, + unsigned long *start_pfn, + unsigned long *end_pfn, + int *num_free, int *num_movable) { - unsigned long start_pfn, end_pfn, pfn; - - if (num_movable) - *num_movable = 0; + unsigned long pfn, start, end; pfn = page_to_pfn(page); - start_pfn = pageblock_start_pfn(pfn); - end_pfn = pageblock_end_pfn(pfn) - 1; + start = pageblock_start_pfn(pfn); + end = pageblock_end_pfn(pfn) - 1; /* * The caller only has the lock for @zone, don't touch ranges @@ -1657,13 +1647,50 @@ int move_freepages_block(struct zone *zone, struct page *page, * accompanied by other operations such as migratetype updates * which also should be locked. */ - if (!zone_spans_pfn(zone, start_pfn)) - return 0; - if (!zone_spans_pfn(zone, end_pfn)) - return 0; + if (!zone_spans_pfn(zone, start)) + return false; + if (!zone_spans_pfn(zone, end)) + return false; + + *start_pfn = start; + *end_pfn = end; - return move_freepages(zone, start_pfn, end_pfn, migratetype, - num_movable); + if (num_free) { + *num_free = 0; + *num_movable = 0; + for (pfn = start; pfn <= end;) { + page = pfn_to_page(pfn); + if (PageBuddy(page)) { + int nr = 1 << buddy_order(page); + + *num_free += nr; + pfn += nr; + continue; + } + /* + * We assume that pages that could be isolated for + * migration are movable. But we don't actually try + * isolating, as that would be expensive. + */ + if (PageLRU(page) || __PageMovable(page)) + (*num_movable)++; + pfn++; + } + } + + return true; +} + +int move_freepages_block(struct zone *zone, struct page *page, + int migratetype) +{ + unsigned long start_pfn, end_pfn; + + if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, + NULL, NULL)) + return -1; + + return move_freepages(zone, start_pfn, end_pfn, migratetype); } static void change_pageblock_range(struct page *pageblock_page, @@ -1748,33 +1775,37 @@ static inline bool boost_watermark(struct zone *zone) } /* - * This function implements actual steal behaviour. If order is large enough, - * we can steal whole pageblock. If not, we first move freepages in this - * pageblock to our migratetype and determine how many already-allocated pages - * are there in the pageblock with a compatible migratetype. If at least half - * of pages are free or compatible, we can change migratetype of the pageblock - * itself, so pages freed in the future will be put on the correct free list. + * This function implements actual steal behaviour. If order is large enough, we + * can claim the whole pageblock for the requested migratetype. If not, we check + * the pageblock for constituent pages; if at least half of the pages are free + * or compatible, we can still claim the whole block, so pages freed in the + * future will be put on the correct free list. Otherwise, we isolate exactly + * the order we need from the fallback block and leave its migratetype alone. */ -static void steal_suitable_fallback(struct zone *zone, struct page *page, - unsigned int alloc_flags, int start_type, bool whole_block) +static struct page * +steal_suitable_fallback(struct zone *zone, struct page *page, + int current_order, int order, int start_type, + unsigned int alloc_flags, bool whole_block) { - unsigned int current_order = buddy_order(page); int free_pages, movable_pages, alike_pages; - int old_block_type; + unsigned long start_pfn, end_pfn; + int block_type; - old_block_type = get_pageblock_migratetype(page); + block_type = get_pageblock_migratetype(page); /* * This can happen due to races and we want to prevent broken * highatomic accounting. */ - if (is_migrate_highatomic(old_block_type)) + if (is_migrate_highatomic(block_type)) goto single_page; /* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { + del_page_from_free_list(page, zone, current_order); change_pageblock_range(page, current_order, start_type); - goto single_page; + expand(zone, page, order, current_order, start_type); + return page; } /* @@ -1789,10 +1820,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, if (!whole_block) goto single_page; - free_pages = move_freepages_block(zone, page, start_type, - &movable_pages); /* moving whole block can fail due to zone boundary conditions */ - if (!free_pages) + if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, + &free_pages, &movable_pages)) goto single_page; /* @@ -1810,7 +1840,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, * vice versa, be conservative since we can't distinguish the * exact migratetype of non-movable pages. */ - if (old_block_type == MIGRATE_MOVABLE) + if (block_type == MIGRATE_MOVABLE) alike_pages = pageblock_nr_pages - (free_pages + movable_pages); else @@ -1821,13 +1851,16 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, * compatible migratability as our allocation, claim the whole block. */ if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || - page_group_by_mobility_disabled) + page_group_by_mobility_disabled) { + move_freepages(zone, start_pfn, end_pfn, start_type); set_pageblock_migratetype(page, start_type); - - return; + return __rmqueue_smallest(zone, order, start_type); + } single_page: - move_to_free_list(page, zone, current_order, start_type); + del_page_from_free_list(page, zone, current_order); + expand(zone, page, order, current_order, block_type); + return page; } /* @@ -1895,9 +1928,10 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) mt = get_pageblock_migratetype(page); /* Only reserve normal pageblocks (i.e., they can merge with others) */ if (migratetype_is_mergeable(mt)) { - zone->nr_reserved_highatomic += pageblock_nr_pages; - set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); - move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); + if (move_freepages_block(zone, page, MIGRATE_HIGHATOMIC) != -1) { + set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); + zone->nr_reserved_highatomic += pageblock_nr_pages; + } } out_unlock: @@ -1922,7 +1956,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, struct zone *zone; struct page *page; int order; - bool ret; + int ret; for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx, ac->nodemask) { @@ -1971,10 +2005,14 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * of pageblocks that cannot be completely freed * may increase. */ + ret = move_freepages_block(zone, page, ac->migratetype); + /* + * Reserving this block already succeeded, so this should + * not fail on zone boundaries. + */ + WARN_ON_ONCE(ret == -1); set_pageblock_migratetype(page, ac->migratetype); - ret = move_freepages_block(zone, page, ac->migratetype, - NULL); - if (ret) { + if (ret > 0) { spin_unlock_irqrestore(&zone->lock, flags); return ret; } @@ -1995,7 +2033,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * deviation from the rest of this file, to make the for loop * condition simpler. */ -static __always_inline bool +static __always_inline struct page * __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, unsigned int alloc_flags) { @@ -2042,7 +2080,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, goto do_steal; } - return false; + return NULL; find_smallest: for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { @@ -2062,14 +2100,14 @@ find_smallest: do_steal: page = get_page_from_free_area(area, fallback_mt); - steal_suitable_fallback(zone, page, alloc_flags, start_migratetype, - can_steal); + /* take off list, maybe claim block, expand remainder */ + page = steal_suitable_fallback(zone, page, current_order, order, + start_migratetype, alloc_flags, can_steal); trace_mm_page_alloc_extfrag(page, order, current_order, start_migratetype, fallback_mt); - return true; - + return page; } /* @@ -2096,15 +2134,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, return page; } } -retry: + page = __rmqueue_smallest(zone, order, migratetype); if (unlikely(!page)) { if (alloc_flags & ALLOC_CMA) page = __rmqueue_cma_fallback(zone, order); - if (!page && __rmqueue_fallback(zone, order, migratetype, - alloc_flags)) - goto retry; + if (!page) + page = __rmqueue_fallback(zone, order, migratetype, + alloc_flags); } return page; } @@ -2659,12 +2697,10 @@ int __isolate_free_page(struct page *page, unsigned int order) * Only change normal pageblocks (i.e., they can merge * with others) */ - if (migratetype_is_mergeable(mt)) { - set_pageblock_migratetype(page, - MIGRATE_MOVABLE); - move_freepages_block(zone, page, - MIGRATE_MOVABLE, NULL); - } + if (migratetype_is_mergeable(mt) && + move_freepages_block(zone, page, + MIGRATE_MOVABLE) != -1) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); } } diff --git a/mm/page_isolation.c b/mm/page_isolation.c index a5c8fa4c2a75..71539d7b96cf 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -178,15 +178,18 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end, migratetype, isol_flags); if (!unmovable) { - unsigned long nr_pages; + int nr_pages; int mt = get_pageblock_migratetype(page); + nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE); + /* Block spans zone boundaries? */ + if (nr_pages == -1) { + spin_unlock_irqrestore(&zone->lock, flags); + return -EBUSY; + } + __mod_zone_freepage_state(zone, -nr_pages, mt); set_pageblock_migratetype(page, MIGRATE_ISOLATE); zone->nr_isolate_pageblock++; - nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE, - NULL); - - __mod_zone_freepage_state(zone, -nr_pages, mt); spin_unlock_irqrestore(&zone->lock, flags); return 0; } @@ -206,7 +209,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ static void unset_migratetype_isolate(struct page *page, int migratetype) { struct zone *zone; - unsigned long flags, nr_pages; + unsigned long flags; bool isolated_page = false; unsigned int order; struct page *buddy; @@ -252,7 +255,12 @@ static void unset_migratetype_isolate(struct page *page, int migratetype) * allocation. */ if (!isolated_page) { - nr_pages = move_freepages_block(zone, page, migratetype, NULL); + int nr_pages = move_freepages_block(zone, page, migratetype); + /* + * Isolating this block already succeeded, so this + * should not fail on zone boundaries. + */ + WARN_ON_ONCE(nr_pages == -1); __mod_zone_freepage_state(zone, nr_pages, migratetype); } set_pageblock_migratetype(page, migratetype); -- cgit v1.2.3-59-g8ed1b From 55612e80e722ac554cc5e80df05555b4f8d40c37 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Mar 2024 14:02:12 -0400 Subject: mm: page_alloc: close migratetype race between freeing and stealing There are three freeing paths that read the page's migratetype optimistically before grabbing the zone lock. When this races with block stealing, those pages go on the wrong freelist. The paths in question are: - when freeing >costly orders that aren't THP - when freeing pages to the buddy upon pcp lock contention - when freeing pages that are isolated - when freeing pages initially during boot - when freeing the remainder in alloc_pages_exact() - when "accepting" unaccepted VM host memory before first use - when freeing pages during unpoisoning None of these are so hot that they would need this optimization at the cost of hampering defrag efforts. Especially when contrasted with the fact that the most common buddy freeing path - free_pcppages_bulk - is checking the migratetype under the zone->lock just fine. In addition, isolated pages need to look up the migratetype under the lock anyway, which adds branches to the locked section, and results in a double lookup when the pages are in fact isolated. Move the lookups into the lock. Link: https://lkml.kernel.org/r/20240320180429.678181-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reported-by: Vlastimil Babka Reviewed-by: Vlastimil Babka Tested-by: Baolin Wang Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Mel Gorman Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/page_alloc.c | 52 +++++++++++++++++++--------------------------------- 1 file changed, 19 insertions(+), 33 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a7cfe65e45c1..289dcb434719 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1231,18 +1231,15 @@ static void free_pcppages_bulk(struct zone *zone, int count, spin_unlock_irqrestore(&zone->lock, flags); } -static void free_one_page(struct zone *zone, - struct page *page, unsigned long pfn, - unsigned int order, - int migratetype, fpi_t fpi_flags) +static void free_one_page(struct zone *zone, struct page *page, + unsigned long pfn, unsigned int order, + fpi_t fpi_flags) { unsigned long flags; + int migratetype; spin_lock_irqsave(&zone->lock, flags); - if (unlikely(has_isolate_pageblock(zone) || - is_migrate_isolate(migratetype))) { - migratetype = get_pfnblock_migratetype(page, pfn); - } + migratetype = get_pfnblock_migratetype(page, pfn); __free_one_page(page, pfn, zone, order, migratetype, fpi_flags); spin_unlock_irqrestore(&zone->lock, flags); } @@ -1250,21 +1247,13 @@ static void free_one_page(struct zone *zone, static void __free_pages_ok(struct page *page, unsigned int order, fpi_t fpi_flags) { - int migratetype; unsigned long pfn = page_to_pfn(page); struct zone *zone = page_zone(page); if (!free_pages_prepare(page, order)) return; - /* - * Calling get_pfnblock_migratetype() without spin_lock_irqsave() here - * is used to avoid calling get_pfnblock_migratetype() under the lock. - * This will reduce the lock holding time. - */ - migratetype = get_pfnblock_migratetype(page, pfn); - - free_one_page(zone, page, pfn, order, migratetype, fpi_flags); + free_one_page(zone, page, pfn, order, fpi_flags); __count_vm_events(PGFREE, 1 << order); } @@ -2503,7 +2492,7 @@ void free_unref_page(struct page *page, unsigned int order) struct per_cpu_pages *pcp; struct zone *zone; unsigned long pfn = page_to_pfn(page); - int migratetype, pcpmigratetype; + int migratetype; if (!free_pages_prepare(page, order)) return; @@ -2515,23 +2504,23 @@ void free_unref_page(struct page *page, unsigned int order) * get those areas back if necessary. Otherwise, we may have to free * excessively into the page allocator */ - migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn); + migratetype = get_pfnblock_migratetype(page, pfn); if (unlikely(migratetype >= MIGRATE_PCPTYPES)) { if (unlikely(is_migrate_isolate(migratetype))) { - free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE); + free_one_page(page_zone(page), page, pfn, order, FPI_NONE); return; } - pcpmigratetype = MIGRATE_MOVABLE; + migratetype = MIGRATE_MOVABLE; } zone = page_zone(page); pcp_trylock_prepare(UP_flags); pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (pcp) { - free_unref_page_commit(zone, pcp, page, pcpmigratetype, order); + free_unref_page_commit(zone, pcp, page, migratetype, order); pcp_spin_unlock(pcp); } else { - free_one_page(zone, page, pfn, order, migratetype, FPI_NONE); + free_one_page(zone, page, pfn, order, FPI_NONE); } pcp_trylock_finish(UP_flags); } @@ -2561,12 +2550,8 @@ void free_unref_folios(struct folio_batch *folios) * allocator. */ if (!pcp_allowed_order(order)) { - int migratetype; - - migratetype = get_pfnblock_migratetype(&folio->page, - pfn); - free_one_page(folio_zone(folio), &folio->page, pfn, - order, migratetype, FPI_NONE); + free_one_page(folio_zone(folio), &folio->page, + pfn, order, FPI_NONE); continue; } folio->private = (void *)(unsigned long)order; @@ -2602,7 +2587,7 @@ void free_unref_folios(struct folio_batch *folios) */ if (is_migrate_isolate(migratetype)) { free_one_page(zone, &folio->page, pfn, - order, migratetype, FPI_NONE); + order, FPI_NONE); continue; } @@ -2615,7 +2600,7 @@ void free_unref_folios(struct folio_batch *folios) if (unlikely(!pcp)) { pcp_trylock_finish(UP_flags); free_one_page(zone, &folio->page, pfn, - order, migratetype, FPI_NONE); + order, FPI_NONE); continue; } locked_zone = zone; @@ -6798,13 +6783,14 @@ bool take_page_off_buddy(struct page *page) bool put_page_back_buddy(struct page *page) { struct zone *zone = page_zone(page); - unsigned long pfn = page_to_pfn(page); unsigned long flags; - int migratetype = get_pfnblock_migratetype(page, pfn); bool ret = false; spin_lock_irqsave(&zone->lock, flags); if (put_page_testzero(page)) { + unsigned long pfn = page_to_pfn(page); + int migratetype = get_pfnblock_migratetype(page, pfn); + ClearPageHWPoisonTakenOff(page); __free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE); if (TestClearPageHWPoison(page)) { -- cgit v1.2.3-59-g8ed1b From f37c0f6876a8eabe1477c87860460bc181f6cdbb Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Wed, 20 Mar 2024 14:02:13 -0400 Subject: mm: page_alloc: set migratetype inside move_freepages() This avoids changing migratetype after move_freepages() or move_freepages_block(), which is error prone. It also prepares for upcoming changes to fix move_freepages() not moving free pages partially in the range. Link: https://lkml.kernel.org/r/20240320180429.678181-9-hannes@cmpxchg.org Signed-off-by: Zi Yan Signed-off-by: Johannes Weiner Reviewed-by: Vlastimil Babka Tested-by: Baolin Wang Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Mel Gorman Signed-off-by: Andrew Morton --- mm/page_alloc.c | 27 +++++++++++++-------------- mm/page_isolation.c | 7 +++---- 2 files changed, 16 insertions(+), 18 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 289dcb434719..fee52ce8ab2d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1586,9 +1586,8 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, #endif /* - * Move the free pages in a range to the freelist tail of the requested type. - * Note that start_page and end_pages are not aligned on a pageblock - * boundary. If alignment is required, use move_freepages_block() + * Change the type of a block and move all its free pages to that + * type's freelist. */ static int move_freepages(struct zone *zone, unsigned long start_pfn, unsigned long end_pfn, int migratetype) @@ -1598,6 +1597,9 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn, unsigned int order; int pages_moved = 0; + VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1)); + VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn); + for (pfn = start_pfn; pfn <= end_pfn;) { page = pfn_to_page(pfn); if (!PageBuddy(page)) { @@ -1615,6 +1617,8 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn, pages_moved += 1 << order; } + set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype); + return pages_moved; } @@ -1842,7 +1846,6 @@ steal_suitable_fallback(struct zone *zone, struct page *page, if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || page_group_by_mobility_disabled) { move_freepages(zone, start_pfn, end_pfn, start_type); - set_pageblock_migratetype(page, start_type); return __rmqueue_smallest(zone, order, start_type); } @@ -1916,12 +1919,10 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) /* Yoink! */ mt = get_pageblock_migratetype(page); /* Only reserve normal pageblocks (i.e., they can merge with others) */ - if (migratetype_is_mergeable(mt)) { - if (move_freepages_block(zone, page, MIGRATE_HIGHATOMIC) != -1) { - set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); + if (migratetype_is_mergeable(mt)) + if (move_freepages_block(zone, page, + MIGRATE_HIGHATOMIC) != -1) zone->nr_reserved_highatomic += pageblock_nr_pages; - } - } out_unlock: spin_unlock_irqrestore(&zone->lock, flags); @@ -2000,7 +2001,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * not fail on zone boundaries. */ WARN_ON_ONCE(ret == -1); - set_pageblock_migratetype(page, ac->migratetype); if (ret > 0) { spin_unlock_irqrestore(&zone->lock, flags); return ret; @@ -2682,10 +2682,9 @@ int __isolate_free_page(struct page *page, unsigned int order) * Only change normal pageblocks (i.e., they can merge * with others) */ - if (migratetype_is_mergeable(mt) && - move_freepages_block(zone, page, - MIGRATE_MOVABLE) != -1) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); + if (migratetype_is_mergeable(mt)) + move_freepages_block(zone, page, + MIGRATE_MOVABLE); } } diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 71539d7b96cf..f84f0981b2df 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -188,7 +188,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ return -EBUSY; } __mod_zone_freepage_state(zone, -nr_pages, mt); - set_pageblock_migratetype(page, MIGRATE_ISOLATE); zone->nr_isolate_pageblock++; spin_unlock_irqrestore(&zone->lock, flags); return 0; @@ -262,10 +261,10 @@ static void unset_migratetype_isolate(struct page *page, int migratetype) */ WARN_ON_ONCE(nr_pages == -1); __mod_zone_freepage_state(zone, nr_pages, migratetype); - } - set_pageblock_migratetype(page, migratetype); - if (isolated_page) + } else { + set_pageblock_migratetype(page, migratetype); __putback_isolated_page(page, order, migratetype); + } zone->nr_isolate_pageblock--; out: spin_unlock_irqrestore(&zone->lock, flags); -- cgit v1.2.3-59-g8ed1b From fd919a85cd55be5d00a6a7372071f44c8eafb825 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Mar 2024 14:02:14 -0400 Subject: mm: page_isolation: prepare for hygienic freelists Page isolation currently sets MIGRATE_ISOLATE on a block, then drops zone->lock and scans the block for straddling buddies to split up. Because this happens non-atomically wrt the page allocator, it's possible for allocations to get a buddy whose first block is a regular pcp migratetype but whose tail is isolated. This means that in certain cases memory can still be allocated after isolation. It will also trigger the freelist type hygiene warnings in subsequent patches. start_isolate_page_range() isolate_single_pageblock() set_migratetype_isolate(tail) lock zone->lock move_freepages_block(tail) // nop set_pageblock_migratetype(tail) unlock zone->lock __rmqueue_smallest() del_page_from_freelist(head) expand(head, head_mt) WARN(head_mt != tail_mt) start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES) for (pfn = start_pfn, pfn < end_pfn) if (PageBuddy()) split_free_page(head) Introduce a variant of move_freepages_block() provided by the allocator specifically for page isolation; it moves free pages, converts the block, and handles the splitting of straddling buddies while holding zone->lock. The allocator knows that pageblocks and buddies are always naturally aligned, which means that buddies can only straddle blocks if they're actually >pageblock_order. This means the search-and-split part can be simplified compared to what page isolation used to do. Also tighten up the page isolation code around the expectations of which pages can be large, and how they are freed. Based on extensive discussions with and invaluable input from Zi Yan. [hannes@cmpxchg.org: work around older gcc warning] Link: https://lkml.kernel.org/r/20240321142426.GB777580@cmpxchg.org Link: https://lkml.kernel.org/r/20240320180429.678181-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Vlastimil Babka Tested-by: Baolin Wang Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Mel Gorman Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/page-isolation.h | 4 +- mm/internal.h | 4 - mm/page_alloc.c | 204 ++++++++++++++++++++++++----------------- mm/page_isolation.c | 106 +++++++-------------- 4 files changed, 155 insertions(+), 163 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 8550b3c91480..c16db0067090 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -34,7 +34,9 @@ static inline bool is_migrate_isolate(int migratetype) #define REPORT_FAILURE 0x2 void set_pageblock_migratetype(struct page *page, int migratetype); -int move_freepages_block(struct zone *zone, struct page *page, int migratetype); + +bool move_freepages_block_isolate(struct zone *zone, struct page *page, + int migratetype); int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, int migratetype, int flags, gfp_t gfp_flags); diff --git a/mm/internal.h b/mm/internal.h index ab8250d8a591..cbc40f07537c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -562,10 +562,6 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align, void memmap_init_range(unsigned long, int, unsigned long, unsigned long, unsigned long, enum meminit_context, struct vmem_altmap *, int); - -int split_free_page(struct page *free_page, - unsigned int order, unsigned long split_pfn_offset); - #if defined CONFIG_COMPACTION || defined CONFIG_CMA /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fee52ce8ab2d..fe7853eb3ea0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -833,64 +833,6 @@ done_merging: page_reporting_notify_free(order); } -/** - * split_free_page() -- split a free page at split_pfn_offset - * @free_page: the original free page - * @order: the order of the page - * @split_pfn_offset: split offset within the page - * - * Return -ENOENT if the free page is changed, otherwise 0 - * - * It is used when the free page crosses two pageblocks with different migratetypes - * at split_pfn_offset within the page. The split free page will be put into - * separate migratetype lists afterwards. Otherwise, the function achieves - * nothing. - */ -int split_free_page(struct page *free_page, - unsigned int order, unsigned long split_pfn_offset) -{ - struct zone *zone = page_zone(free_page); - unsigned long free_page_pfn = page_to_pfn(free_page); - unsigned long pfn; - unsigned long flags; - int free_page_order; - int mt; - int ret = 0; - - if (split_pfn_offset == 0) - return ret; - - spin_lock_irqsave(&zone->lock, flags); - - if (!PageBuddy(free_page) || buddy_order(free_page) != order) { - ret = -ENOENT; - goto out; - } - - mt = get_pfnblock_migratetype(free_page, free_page_pfn); - if (likely(!is_migrate_isolate(mt))) - __mod_zone_freepage_state(zone, -(1UL << order), mt); - - del_page_from_free_list(free_page, zone, order); - for (pfn = free_page_pfn; - pfn < free_page_pfn + (1UL << order);) { - int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn); - - free_page_order = min_t(unsigned int, - pfn ? __ffs(pfn) : order, - __fls(split_pfn_offset)); - __free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order, - mt, FPI_NONE); - pfn += 1UL << free_page_order; - split_pfn_offset -= (1UL << free_page_order); - /* we have done the first part, now switch to second part */ - if (split_pfn_offset == 0) - split_pfn_offset = (1UL << order) - (pfn - free_page_pfn); - } -out: - spin_unlock_irqrestore(&zone->lock, flags); - return ret; -} /* * A bad page could be due to a number of fields. Instead of multiple branches, * try and check multiple fields with one check. The caller must do a detailed @@ -1674,8 +1616,8 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, return true; } -int move_freepages_block(struct zone *zone, struct page *page, - int migratetype) +static int move_freepages_block(struct zone *zone, struct page *page, + int migratetype) { unsigned long start_pfn, end_pfn; @@ -1686,6 +1628,123 @@ int move_freepages_block(struct zone *zone, struct page *page, return move_freepages(zone, start_pfn, end_pfn, migratetype); } +#ifdef CONFIG_MEMORY_ISOLATION +/* Look for a buddy that straddles start_pfn */ +static unsigned long find_large_buddy(unsigned long start_pfn) +{ + int order = 0; + struct page *page; + unsigned long pfn = start_pfn; + + while (!PageBuddy(page = pfn_to_page(pfn))) { + /* Nothing found */ + if (++order > MAX_PAGE_ORDER) + return start_pfn; + pfn &= ~0UL << order; + } + + /* + * Found a preceding buddy, but does it straddle? + */ + if (pfn + (1 << buddy_order(page)) > start_pfn) + return pfn; + + /* Nothing found */ + return start_pfn; +} + +/* Split a multi-block free page into its individual pageblocks */ +static void split_large_buddy(struct zone *zone, struct page *page, + unsigned long pfn, int order) +{ + unsigned long end_pfn = pfn + (1 << order); + + VM_WARN_ON_ONCE(order <= pageblock_order); + VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1)); + + /* Caller removed page from freelist, buddy info cleared! */ + VM_WARN_ON_ONCE(PageBuddy(page)); + + while (pfn != end_pfn) { + int mt = get_pfnblock_migratetype(page, pfn); + + __free_one_page(page, pfn, zone, pageblock_order, mt, FPI_NONE); + pfn += pageblock_nr_pages; + page = pfn_to_page(pfn); + } +} + +/** + * move_freepages_block_isolate - move free pages in block for page isolation + * @zone: the zone + * @page: the pageblock page + * @migratetype: migratetype to set on the pageblock + * + * This is similar to move_freepages_block(), but handles the special + * case encountered in page isolation, where the block of interest + * might be part of a larger buddy spanning multiple pageblocks. + * + * Unlike the regular page allocator path, which moves pages while + * stealing buddies off the freelist, page isolation is interested in + * arbitrary pfn ranges that may have overlapping buddies on both ends. + * + * This function handles that. Straddling buddies are split into + * individual pageblocks. Only the block of interest is moved. + * + * Returns %true if pages could be moved, %false otherwise. + */ +bool move_freepages_block_isolate(struct zone *zone, struct page *page, + int migratetype) +{ + unsigned long start_pfn, end_pfn, pfn; + int nr_moved, mt; + + if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, + NULL, NULL)) + return false; + + /* No splits needed if buddies can't span multiple blocks */ + if (pageblock_order == MAX_PAGE_ORDER) + goto move; + + /* We're a tail block in a larger buddy */ + pfn = find_large_buddy(start_pfn); + if (pfn != start_pfn) { + struct page *buddy = pfn_to_page(pfn); + int order = buddy_order(buddy); + int mt = get_pfnblock_migratetype(buddy, pfn); + + if (!is_migrate_isolate(mt)) + __mod_zone_freepage_state(zone, -(1UL << order), mt); + del_page_from_free_list(buddy, zone, order); + set_pageblock_migratetype(page, migratetype); + split_large_buddy(zone, buddy, pfn, order); + return true; + } + + /* We're the starting block of a larger buddy */ + if (PageBuddy(page) && buddy_order(page) > pageblock_order) { + int mt = get_pfnblock_migratetype(page, pfn); + int order = buddy_order(page); + + if (!is_migrate_isolate(mt)) + __mod_zone_freepage_state(zone, -(1UL << order), mt); + del_page_from_free_list(page, zone, order); + set_pageblock_migratetype(page, migratetype); + split_large_buddy(zone, page, pfn, order); + return true; + } +move: + mt = get_pfnblock_migratetype(page, start_pfn); + nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype); + if (!is_migrate_isolate(mt)) + __mod_zone_freepage_state(zone, -nr_moved, mt); + else if (!is_migrate_isolate(migratetype)) + __mod_zone_freepage_state(zone, nr_moved, migratetype); + return true; +} +#endif /* CONFIG_MEMORY_ISOLATION */ + static void change_pageblock_range(struct page *pageblock_page, int start_order, int migratetype) { @@ -6365,7 +6424,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, unsigned migratetype, gfp_t gfp_mask) { unsigned long outer_start, outer_end; - int order; int ret = 0; struct compact_control cc = { @@ -6438,29 +6496,7 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, * We don't have to hold zone->lock here because the pages are * isolated thus they won't get removed from buddy. */ - - order = 0; - outer_start = start; - while (!PageBuddy(pfn_to_page(outer_start))) { - if (++order > MAX_PAGE_ORDER) { - outer_start = start; - break; - } - outer_start &= ~0UL << order; - } - - if (outer_start != start) { - order = buddy_order(pfn_to_page(outer_start)); - - /* - * outer_start page could be small order buddy page and - * it doesn't include start page. Adjust outer_start - * in this case to report failed page properly - * on tracepoint in test_pages_isolated() - */ - if (outer_start + (1UL << order) <= start) - outer_start = start; - } + outer_start = find_large_buddy(start); /* Make sure the range is really isolated. */ if (test_pages_isolated(outer_start, end, 0)) { diff --git a/mm/page_isolation.c b/mm/page_isolation.c index f84f0981b2df..042937d5abe4 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -178,16 +178,10 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end, migratetype, isol_flags); if (!unmovable) { - int nr_pages; - int mt = get_pageblock_migratetype(page); - - nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE); - /* Block spans zone boundaries? */ - if (nr_pages == -1) { + if (!move_freepages_block_isolate(zone, page, MIGRATE_ISOLATE)) { spin_unlock_irqrestore(&zone->lock, flags); return -EBUSY; } - __mod_zone_freepage_state(zone, -nr_pages, mt); zone->nr_isolate_pageblock++; spin_unlock_irqrestore(&zone->lock, flags); return 0; @@ -254,13 +248,11 @@ static void unset_migratetype_isolate(struct page *page, int migratetype) * allocation. */ if (!isolated_page) { - int nr_pages = move_freepages_block(zone, page, migratetype); /* * Isolating this block already succeeded, so this * should not fail on zone boundaries. */ - WARN_ON_ONCE(nr_pages == -1); - __mod_zone_freepage_state(zone, nr_pages, migratetype); + WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype)); } else { set_pageblock_migratetype(page, migratetype); __putback_isolated_page(page, order, migratetype); @@ -374,26 +366,29 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, VM_BUG_ON(!page); pfn = page_to_pfn(page); - /* - * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any - * free pages in [start_pfn, boundary_pfn), its head page will - * always be in the range. - */ + if (PageBuddy(page)) { int order = buddy_order(page); - if (pfn + (1UL << order) > boundary_pfn) { - /* free page changed before split, check it again */ - if (split_free_page(page, order, boundary_pfn - pfn)) - continue; - } + /* move_freepages_block_isolate() handled this */ + VM_WARN_ON_ONCE(pfn + (1 << order) > boundary_pfn); pfn += 1UL << order; continue; } + /* - * migrate compound pages then let the free page handling code - * above do the rest. If migration is not possible, just fail. + * If a compound page is straddling our block, attempt + * to migrate it out of the way. + * + * We don't have to worry about this creating a large + * free page that straddles into our block: gigantic + * pages are freed as order-0 chunks, and LRU pages + * (currently) do not exceed pageblock_order. + * + * The block of interest has already been marked + * MIGRATE_ISOLATE above, so when migration is done it + * will free its pages onto the correct freelists. */ if (PageCompound(page)) { struct page *head = compound_head(page); @@ -404,16 +399,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, pfn = head_pfn + nr_pages; continue; } + #if defined CONFIG_COMPACTION || defined CONFIG_CMA - /* - * hugetlb, lru compound (THP), and movable compound pages - * can be migrated. Otherwise, fail the isolation. - */ - if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) { - int order; - unsigned long outer_pfn; + if (PageHuge(page)) { int page_mt = get_pageblock_migratetype(page); - bool isolate_page = !is_migrate_isolate_page(page); struct compact_control cc = { .nr_migratepages = 0, .order = -1, @@ -426,56 +415,25 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, }; INIT_LIST_HEAD(&cc.migratepages); - /* - * XXX: mark the page as MIGRATE_ISOLATE so that - * no one else can grab the freed page after migration. - * Ideally, the page should be freed as two separate - * pages to be added into separate migratetype free - * lists. - */ - if (isolate_page) { - ret = set_migratetype_isolate(page, page_mt, - flags, head_pfn, head_pfn + nr_pages); - if (ret) - goto failed; - } - ret = __alloc_contig_migrate_range(&cc, head_pfn, head_pfn + nr_pages, page_mt); - - /* - * restore the page's migratetype so that it can - * be split into separate migratetype free lists - * later. - */ - if (isolate_page) - unset_migratetype_isolate(page, page_mt); - if (ret) goto failed; - /* - * reset pfn to the head of the free page, so - * that the free page handling code above can split - * the free page to the right migratetype list. - * - * head_pfn is not used here as a hugetlb page order - * can be bigger than MAX_PAGE_ORDER, but after it is - * freed, the free page order is not. Use pfn within - * the range to find the head of the free page. - */ - order = 0; - outer_pfn = pfn; - while (!PageBuddy(pfn_to_page(outer_pfn))) { - /* stop if we cannot find the free page */ - if (++order > MAX_PAGE_ORDER) - goto failed; - outer_pfn &= ~0UL << order; - } - pfn = outer_pfn; + pfn = head_pfn + nr_pages; continue; - } else + } + + /* + * These pages are movable too, but they're + * not expected to exceed pageblock_order. + * + * Let us know when they do, so we can add + * proper free and split handling for them. + */ + VM_WARN_ON_ONCE_PAGE(PageLRU(page), page); + VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page); #endif - goto failed; + goto failed; } pfn++; -- cgit v1.2.3-59-g8ed1b From e0932b6c1f942fa747258e152cdce0d0b2b5be5c Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Mar 2024 14:02:15 -0400 Subject: mm: page_alloc: consolidate free page accounting Free page accounting currently happens a bit too high up the call stack, where it has to deal with guard pages, compaction capturing, block stealing and even page isolation. This is subtle and fragile, and makes it difficult to hack on the code. Now that type violations on the freelists have been fixed, push the accounting down to where pages enter and leave the freelist. [hannes@cmpxchg.org: undo unrelated drive-by line wrap] Link: https://lkml.kernel.org/r/20240327185736.GA7597@cmpxchg.org [hannes@cmpxchg.org: remove unused page parameter from account_freepages()] Link: https://lkml.kernel.org/r/20240327185831.GB7597@cmpxchg.org [baolin.wang@linux.alibaba.com: fix free page accounting] Link: https://lkml.kernel.org/r/a2a48baca69f103aa431fd201f8a06e3b95e203d.1712648441.git.baolin.wang@linux.alibaba.com [andriy.shevchenko@linux.intel.com: avoid defining unused function] Link: https://lkml.kernel.org/r/20240423161506.2637177-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20240320180429.678181-11-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Signed-off-by: Andy Shevchenko Signed-off-by: Baolin Wang Reviewed-by: Vlastimil Babka Tested-by: Baolin Wang Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Mel Gorman Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mm.h | 18 +++-- include/linux/vmstat.h | 8 --- mm/debug_page_alloc.c | 12 +--- mm/internal.h | 5 -- mm/page_alloc.c | 192 +++++++++++++++++++++++++++---------------------- 5 files changed, 118 insertions(+), 117 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/mm.h b/include/linux/mm.h index 60eabc6c8e00..1588fe15a38e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3797,24 +3797,22 @@ static inline bool page_is_guard(struct page *page) return PageGuard(page); } -bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order, - int migratetype); +bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order); static inline bool set_page_guard(struct zone *zone, struct page *page, - unsigned int order, int migratetype) + unsigned int order) { if (!debug_guardpage_enabled()) return false; - return __set_page_guard(zone, page, order, migratetype); + return __set_page_guard(zone, page, order); } -void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order, - int migratetype); +void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order); static inline void clear_page_guard(struct zone *zone, struct page *page, - unsigned int order, int migratetype) + unsigned int order) { if (!debug_guardpage_enabled()) return; - __clear_page_guard(zone, page, order, migratetype); + __clear_page_guard(zone, page, order); } #else /* CONFIG_DEBUG_PAGEALLOC */ @@ -3824,9 +3822,9 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; } static inline bool debug_guardpage_enabled(void) { return false; } static inline bool page_is_guard(struct page *page) { return false; } static inline bool set_page_guard(struct zone *zone, struct page *page, - unsigned int order, int migratetype) { return false; } + unsigned int order) { return false; } static inline void clear_page_guard(struct zone *zone, struct page *page, - unsigned int order, int migratetype) {} + unsigned int order) {} #endif /* CONFIG_DEBUG_PAGEALLOC */ #ifdef __HAVE_ARCH_GATE_AREA diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 343906a98d6e..735eae6e272c 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -487,14 +487,6 @@ static inline void node_stat_sub_folio(struct folio *folio, mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio)); } -static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages, - int migratetype) -{ - __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages); - if (is_migrate_cma(migratetype)) - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); -} - extern const char * const vmstat_text[]; static inline const char *zone_stat_name(enum zone_stat_item item) diff --git a/mm/debug_page_alloc.c b/mm/debug_page_alloc.c index 6755f0c9d4a3..d46acf989dde 100644 --- a/mm/debug_page_alloc.c +++ b/mm/debug_page_alloc.c @@ -32,8 +32,7 @@ static int __init debug_guardpage_minorder_setup(char *buf) } early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup); -bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order, - int migratetype) +bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order) { if (order >= debug_guardpage_minorder()) return false; @@ -41,19 +40,12 @@ bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order, __SetPageGuard(page); INIT_LIST_HEAD(&page->buddy_list); set_page_private(page, order); - /* Guard pages are not available for any usage */ - if (!is_migrate_isolate(migratetype)) - __mod_zone_freepage_state(zone, -(1 << order), migratetype); return true; } -void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order, - int migratetype) +void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order) { __ClearPageGuard(page); - set_page_private(page, 0); - if (!is_migrate_isolate(migratetype)) - __mod_zone_freepage_state(zone, (1 << order), migratetype); } diff --git a/mm/internal.h b/mm/internal.h index cbc40f07537c..fb219e31f0f0 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1039,11 +1039,6 @@ static inline bool is_migrate_highatomic(enum migratetype migratetype) return migratetype == MIGRATE_HIGHATOMIC; } -static inline bool is_migrate_highatomic_page(struct page *page) -{ - return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC; -} - void setup_zone_pageset(struct zone *zone); struct migration_target_control { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fe7853eb3ea0..b3b98a91d6a7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -643,23 +643,33 @@ compaction_capture(struct capture_control *capc, struct page *page, } #endif /* CONFIG_COMPACTION */ -/* Used for pages not on another list */ -static inline void add_to_free_list(struct page *page, struct zone *zone, - unsigned int order, int migratetype) +static inline void account_freepages(struct zone *zone, int nr_pages, + int migratetype) { - struct free_area *area = &zone->free_area[order]; + if (is_migrate_isolate(migratetype)) + return; - list_add(&page->buddy_list, &area->free_list[migratetype]); - area->nr_free++; + __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages); + + if (is_migrate_cma(migratetype)) + __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); } /* Used for pages not on another list */ -static inline void add_to_free_list_tail(struct page *page, struct zone *zone, - unsigned int order, int migratetype) +static inline void __add_to_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype, + bool tail) { struct free_area *area = &zone->free_area[order]; - list_add_tail(&page->buddy_list, &area->free_list[migratetype]); + VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype, + "page type is %lu, passed migratetype is %d (nr=%d)\n", + get_pageblock_migratetype(page), migratetype, 1 << order); + + if (tail) + list_add_tail(&page->buddy_list, &area->free_list[migratetype]); + else + list_add(&page->buddy_list, &area->free_list[migratetype]); area->nr_free++; } @@ -669,16 +679,28 @@ static inline void add_to_free_list_tail(struct page *page, struct zone *zone, * allocation again (e.g., optimization for memory onlining). */ static inline void move_to_free_list(struct page *page, struct zone *zone, - unsigned int order, int migratetype) + unsigned int order, int old_mt, int new_mt) { struct free_area *area = &zone->free_area[order]; - list_move_tail(&page->buddy_list, &area->free_list[migratetype]); + /* Free page moving can fail, so it happens before the type update */ + VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt, + "page type is %lu, passed migratetype is %d (nr=%d)\n", + get_pageblock_migratetype(page), old_mt, 1 << order); + + list_move_tail(&page->buddy_list, &area->free_list[new_mt]); + + account_freepages(zone, -(1 << order), old_mt); + account_freepages(zone, 1 << order, new_mt); } -static inline void del_page_from_free_list(struct page *page, struct zone *zone, - unsigned int order) +static inline void __del_page_from_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype) { + VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype, + "page type is %lu, passed migratetype is %d (nr=%d)\n", + get_pageblock_migratetype(page), migratetype, 1 << order); + /* clear reported state and update reported page count */ if (page_reported(page)) __ClearPageReported(page); @@ -689,6 +711,13 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone, zone->free_area[order].nr_free--; } +static inline void del_page_from_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype) +{ + __del_page_from_free_list(page, zone, order, migratetype); + account_freepages(zone, -(1 << order), migratetype); +} + static inline struct page *get_page_from_free_area(struct free_area *area, int migratetype) { @@ -760,16 +789,16 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); VM_BUG_ON(migratetype == -1); - if (likely(!is_migrate_isolate(migratetype))) - __mod_zone_freepage_state(zone, 1 << order, migratetype); - VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); + account_freepages(zone, 1 << order, migratetype); + while (order < MAX_PAGE_ORDER) { + int buddy_mt = migratetype; + if (compaction_capture(capc, page, order, migratetype)) { - __mod_zone_freepage_state(zone, -(1 << order), - migratetype); + account_freepages(zone, -(1 << order), migratetype); return; } @@ -784,19 +813,12 @@ static inline void __free_one_page(struct page *page, * pageblock isolation could cause incorrect freepage or CMA * accounting or HIGHATOMIC accounting. */ - int buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn); + buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn); - if (migratetype != buddy_mt) { - if (!migratetype_is_mergeable(migratetype) || - !migratetype_is_mergeable(buddy_mt)) - goto done_merging; - /* - * Match buddy type. This ensures that - * an expand() down the line puts the - * sub-blocks on the right freelists. - */ - set_pageblock_migratetype(buddy, migratetype); - } + if (migratetype != buddy_mt && + (!migratetype_is_mergeable(migratetype) || + !migratetype_is_mergeable(buddy_mt))) + goto done_merging; } /* @@ -804,9 +826,19 @@ static inline void __free_one_page(struct page *page, * merge with it and move up one order. */ if (page_is_guard(buddy)) - clear_page_guard(zone, buddy, order, migratetype); + clear_page_guard(zone, buddy, order); else - del_page_from_free_list(buddy, zone, order); + __del_page_from_free_list(buddy, zone, order, buddy_mt); + + if (unlikely(buddy_mt != migratetype)) { + /* + * Match buddy type. This ensures that an + * expand() down the line puts the sub-blocks + * on the right freelists. + */ + set_pageblock_migratetype(buddy, migratetype); + } + combined_pfn = buddy_pfn & pfn; page = page + (combined_pfn - pfn); pfn = combined_pfn; @@ -823,10 +855,7 @@ done_merging: else to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order); - if (to_tail) - add_to_free_list_tail(page, zone, order, migratetype); - else - add_to_free_list(page, zone, order, migratetype); + __add_to_free_list(page, zone, order, migratetype, to_tail); /* Notify page reporting subsystem of freed page */ if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY)) @@ -1318,10 +1347,10 @@ static inline void expand(struct zone *zone, struct page *page, * Corresponding page table entries will not be touched, * pages will stay not present in virtual address space */ - if (set_page_guard(zone, &page[size], high, migratetype)) + if (set_page_guard(zone, &page[size], high)) continue; - add_to_free_list(&page[size], zone, high, migratetype); + add_to_free_list(&page[size], zone, high, migratetype, false); set_buddy_order(&page[size], high); } } @@ -1492,7 +1521,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page = get_page_from_free_area(area, migratetype); if (!page) continue; - del_page_from_free_list(page, zone, current_order); + del_page_from_free_list(page, zone, current_order, migratetype); expand(zone, page, order, current_order, migratetype); trace_mm_page_alloc_zone_locked(page, order, migratetype, pcp_allowed_order(order) && @@ -1532,7 +1561,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, * type's freelist. */ static int move_freepages(struct zone *zone, unsigned long start_pfn, - unsigned long end_pfn, int migratetype) + unsigned long end_pfn, int old_mt, int new_mt) { struct page *page; unsigned long pfn; @@ -1554,12 +1583,14 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn, VM_BUG_ON_PAGE(page_zone(page) != zone, page); order = buddy_order(page); - move_to_free_list(page, zone, order, migratetype); + + move_to_free_list(page, zone, order, old_mt, new_mt); + pfn += 1 << order; pages_moved += 1 << order; } - set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype); + set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt); return pages_moved; } @@ -1617,7 +1648,7 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, } static int move_freepages_block(struct zone *zone, struct page *page, - int migratetype) + int old_mt, int new_mt) { unsigned long start_pfn, end_pfn; @@ -1625,7 +1656,7 @@ static int move_freepages_block(struct zone *zone, struct page *page, NULL, NULL)) return -1; - return move_freepages(zone, start_pfn, end_pfn, migratetype); + return move_freepages(zone, start_pfn, end_pfn, old_mt, new_mt); } #ifdef CONFIG_MEMORY_ISOLATION @@ -1697,7 +1728,6 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, int migratetype) { unsigned long start_pfn, end_pfn, pfn; - int nr_moved, mt; if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, NULL, NULL)) @@ -1712,11 +1742,9 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, if (pfn != start_pfn) { struct page *buddy = pfn_to_page(pfn); int order = buddy_order(buddy); - int mt = get_pfnblock_migratetype(buddy, pfn); - if (!is_migrate_isolate(mt)) - __mod_zone_freepage_state(zone, -(1UL << order), mt); - del_page_from_free_list(buddy, zone, order); + del_page_from_free_list(buddy, zone, order, + get_pfnblock_migratetype(buddy, pfn)); set_pageblock_migratetype(page, migratetype); split_large_buddy(zone, buddy, pfn, order); return true; @@ -1724,23 +1752,17 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, /* We're the starting block of a larger buddy */ if (PageBuddy(page) && buddy_order(page) > pageblock_order) { - int mt = get_pfnblock_migratetype(page, pfn); int order = buddy_order(page); - if (!is_migrate_isolate(mt)) - __mod_zone_freepage_state(zone, -(1UL << order), mt); - del_page_from_free_list(page, zone, order); + del_page_from_free_list(page, zone, order, + get_pfnblock_migratetype(page, pfn)); set_pageblock_migratetype(page, migratetype); split_large_buddy(zone, page, pfn, order); return true; } move: - mt = get_pfnblock_migratetype(page, start_pfn); - nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype); - if (!is_migrate_isolate(mt)) - __mod_zone_freepage_state(zone, -nr_moved, mt); - else if (!is_migrate_isolate(migratetype)) - __mod_zone_freepage_state(zone, nr_moved, migratetype); + move_freepages(zone, start_pfn, end_pfn, + get_pfnblock_migratetype(page, start_pfn), migratetype); return true; } #endif /* CONFIG_MEMORY_ISOLATION */ @@ -1854,7 +1876,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, /* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { - del_page_from_free_list(page, zone, current_order); + del_page_from_free_list(page, zone, current_order, block_type); change_pageblock_range(page, current_order, start_type); expand(zone, page, order, current_order, start_type); return page; @@ -1904,12 +1926,12 @@ steal_suitable_fallback(struct zone *zone, struct page *page, */ if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || page_group_by_mobility_disabled) { - move_freepages(zone, start_pfn, end_pfn, start_type); + move_freepages(zone, start_pfn, end_pfn, block_type, start_type); return __rmqueue_smallest(zone, order, start_type); } single_page: - del_page_from_free_list(page, zone, current_order); + del_page_from_free_list(page, zone, current_order, block_type); expand(zone, page, order, current_order, block_type); return page; } @@ -1979,7 +2001,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) mt = get_pageblock_migratetype(page); /* Only reserve normal pageblocks (i.e., they can merge with others) */ if (migratetype_is_mergeable(mt)) - if (move_freepages_block(zone, page, + if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) != -1) zone->nr_reserved_highatomic += pageblock_nr_pages; @@ -2020,11 +2042,13 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, spin_lock_irqsave(&zone->lock, flags); for (order = 0; order < NR_PAGE_ORDERS; order++) { struct free_area *area = &(zone->free_area[order]); + int mt; page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC); if (!page) continue; + mt = get_pageblock_migratetype(page); /* * In page freeing path, migratetype change is racy so * we can counter several free pages in a pageblock @@ -2032,7 +2056,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * from highatomic to ac->migratetype. So we should * adjust the count once. */ - if (is_migrate_highatomic_page(page)) { + if (is_migrate_highatomic(mt)) { /* * It should never happen but changes to * locking could inadvertently allow a per-cpu @@ -2054,7 +2078,8 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * of pageblocks that cannot be completely freed * may increase. */ - ret = move_freepages_block(zone, page, ac->migratetype); + ret = move_freepages_block(zone, page, mt, + ac->migratetype); /* * Reserving this block already succeeded, so this should * not fail on zone boundaries. @@ -2225,12 +2250,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, * pages are ordered properly. */ list_add_tail(&page->pcp_list, list); - if (is_migrate_cma(get_pageblock_migratetype(page))) - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, - -(1 << order)); } - - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); spin_unlock_irqrestore(&zone->lock, flags); return i; @@ -2723,11 +2743,9 @@ int __isolate_free_page(struct page *page, unsigned int order) watermark = zone->_watermark[WMARK_MIN] + (1UL << order); if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA)) return 0; - - __mod_zone_freepage_state(zone, -(1UL << order), mt); } - del_page_from_free_list(page, zone, order); + del_page_from_free_list(page, zone, order, mt); /* * Set the pageblock if the isolated page is at least half of a @@ -2742,7 +2760,7 @@ int __isolate_free_page(struct page *page, unsigned int order) * with others) */ if (migratetype_is_mergeable(mt)) - move_freepages_block(zone, page, + move_freepages_block(zone, page, mt, MIGRATE_MOVABLE); } } @@ -2827,8 +2845,6 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, return NULL; } } - __mod_zone_freepage_state(zone, -(1 << order), - get_pageblock_migratetype(page)); spin_unlock_irqrestore(&zone->lock, flags); } while (check_new_pages(page, order)); @@ -6716,8 +6732,9 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) BUG_ON(page_count(page)); BUG_ON(!PageBuddy(page)); + VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE); order = buddy_order(page); - del_page_from_free_list(page, zone, order); + del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE); pfn += (1 << order); } spin_unlock_irqrestore(&zone->lock, flags); @@ -6745,6 +6762,14 @@ bool is_free_buddy_page(struct page *page) EXPORT_SYMBOL(is_free_buddy_page); #ifdef CONFIG_MEMORY_FAILURE +static inline void add_to_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype, + bool tail) +{ + __add_to_free_list(page, zone, order, migratetype, tail); + account_freepages(zone, 1 << order, migratetype); +} + /* * Break down a higher-order page in sub-pages, and keep our target out of * buddy allocator. @@ -6767,10 +6792,10 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page, current_buddy = page + size; } - if (set_page_guard(zone, current_buddy, high, migratetype)) + if (set_page_guard(zone, current_buddy, high)) continue; - add_to_free_list(current_buddy, zone, high, migratetype); + add_to_free_list(current_buddy, zone, high, migratetype, false); set_buddy_order(current_buddy, high); } } @@ -6796,12 +6821,11 @@ bool take_page_off_buddy(struct page *page) int migratetype = get_pfnblock_migratetype(page_head, pfn_head); - del_page_from_free_list(page_head, zone, page_order); + del_page_from_free_list(page_head, zone, page_order, + migratetype); break_down_buddy_pages(zone, page_head, page, 0, page_order, migratetype); SetPageHWPoisonTakenOff(page); - if (!is_migrate_isolate(migratetype)) - __mod_zone_freepage_state(zone, -1, migratetype); ret = true; break; } @@ -6909,7 +6933,7 @@ static bool try_to_accept_memory_one(struct zone *zone) list_del(&page->lru); last = list_empty(&zone->unaccepted_pages); - __mod_zone_freepage_state(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); + account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES); spin_unlock_irqrestore(&zone->lock, flags); @@ -6961,7 +6985,7 @@ static bool __free_unaccepted(struct page *page) spin_lock_irqsave(&zone->lock, flags); first = list_empty(&zone->unaccepted_pages); list_add_tail(&page->lru, &zone->unaccepted_pages); - __mod_zone_freepage_state(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); + account_freepages(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); __mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES); spin_unlock_irqrestore(&zone->lock, flags); -- cgit v1.2.3-59-g8ed1b From e1f42a577f63647dadf1abe4583053c03d6be045 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 25 Apr 2024 20:56:04 -0700 Subject: mm: page_alloc: change move_freepages() to __move_freepages_block() The function is now supposed to be called only on a single pageblock and checks start_pfn and end_pfn accordingly. Rename it to make this more obvious and drop the end_pfn parameter which can be determined trivially and none of the callers use it for anything else. Also make the (now internal) end_pfn exclusive, which is more common. Link: https://lkml.kernel.org/r/81b1d642-2ec0-49f5-89fc-19a3828419ff@suse.cz Signed-off-by: Vlastimil Babka Reviewed-by: Zi Yan Acked-by: Johannes Weiner Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Mel Gorman Signed-off-by: Andrew Morton --- mm/page_alloc.c | 43 ++++++++++++++++++++----------------------- 1 file changed, 20 insertions(+), 23 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b3b98a91d6a7..489dd74d2232 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1560,18 +1560,18 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, * Change the type of a block and move all its free pages to that * type's freelist. */ -static int move_freepages(struct zone *zone, unsigned long start_pfn, - unsigned long end_pfn, int old_mt, int new_mt) +static int __move_freepages_block(struct zone *zone, unsigned long start_pfn, + int old_mt, int new_mt) { struct page *page; - unsigned long pfn; + unsigned long pfn, end_pfn; unsigned int order; int pages_moved = 0; VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1)); - VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn); + end_pfn = pageblock_end_pfn(start_pfn); - for (pfn = start_pfn; pfn <= end_pfn;) { + for (pfn = start_pfn; pfn < end_pfn;) { page = pfn_to_page(pfn); if (!PageBuddy(page)) { pfn++; @@ -1597,14 +1597,13 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn, static bool prep_move_freepages_block(struct zone *zone, struct page *page, unsigned long *start_pfn, - unsigned long *end_pfn, int *num_free, int *num_movable) { unsigned long pfn, start, end; pfn = page_to_pfn(page); start = pageblock_start_pfn(pfn); - end = pageblock_end_pfn(pfn) - 1; + end = pageblock_end_pfn(pfn); /* * The caller only has the lock for @zone, don't touch ranges @@ -1615,16 +1614,15 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, */ if (!zone_spans_pfn(zone, start)) return false; - if (!zone_spans_pfn(zone, end)) + if (!zone_spans_pfn(zone, end - 1)) return false; *start_pfn = start; - *end_pfn = end; if (num_free) { *num_free = 0; *num_movable = 0; - for (pfn = start; pfn <= end;) { + for (pfn = start; pfn < end;) { page = pfn_to_page(pfn); if (PageBuddy(page)) { int nr = 1 << buddy_order(page); @@ -1650,13 +1648,12 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, static int move_freepages_block(struct zone *zone, struct page *page, int old_mt, int new_mt) { - unsigned long start_pfn, end_pfn; + unsigned long start_pfn; - if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, - NULL, NULL)) + if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL)) return -1; - return move_freepages(zone, start_pfn, end_pfn, old_mt, new_mt); + return __move_freepages_block(zone, start_pfn, old_mt, new_mt); } #ifdef CONFIG_MEMORY_ISOLATION @@ -1727,10 +1724,9 @@ static void split_large_buddy(struct zone *zone, struct page *page, bool move_freepages_block_isolate(struct zone *zone, struct page *page, int migratetype) { - unsigned long start_pfn, end_pfn, pfn; + unsigned long start_pfn, pfn; - if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, - NULL, NULL)) + if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL)) return false; /* No splits needed if buddies can't span multiple blocks */ @@ -1761,8 +1757,9 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, return true; } move: - move_freepages(zone, start_pfn, end_pfn, - get_pfnblock_migratetype(page, start_pfn), migratetype); + __move_freepages_block(zone, start_pfn, + get_pfnblock_migratetype(page, start_pfn), + migratetype); return true; } #endif /* CONFIG_MEMORY_ISOLATION */ @@ -1862,7 +1859,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, unsigned int alloc_flags, bool whole_block) { int free_pages, movable_pages, alike_pages; - unsigned long start_pfn, end_pfn; + unsigned long start_pfn; int block_type; block_type = get_pageblock_migratetype(page); @@ -1895,8 +1892,8 @@ steal_suitable_fallback(struct zone *zone, struct page *page, goto single_page; /* moving whole block can fail due to zone boundary conditions */ - if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, - &free_pages, &movable_pages)) + if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages, + &movable_pages)) goto single_page; /* @@ -1926,7 +1923,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, */ if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || page_group_by_mobility_disabled) { - move_freepages(zone, start_pfn, end_pfn, block_type, start_type); + __move_freepages_block(zone, start_pfn, block_type, start_type); return __rmqueue_smallest(zone, order, start_type); } -- cgit v1.2.3-59-g8ed1b From 883dd161e9a83e188487debc562b1928917a4b39 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 27 Mar 2024 15:01:11 -0400 Subject: mm: page_alloc: batch vmstat updates in expand() expand() currently updates vmstat for every subpage. This is unnecessary, since they're all of the same zone and migratetype. Count added pages locally, then do a single vmstat update. Link: https://lkml.kernel.org/r/20240327190111.GC7597@cmpxchg.org Signed-off-by: Johannes Weiner Suggested-by: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/page_alloc.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 489dd74d2232..382d1c98f8e5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1335,6 +1335,7 @@ static inline void expand(struct zone *zone, struct page *page, int low, int high, int migratetype) { unsigned long size = 1 << high; + unsigned long nr_added = 0; while (high > low) { high--; @@ -1350,9 +1351,11 @@ static inline void expand(struct zone *zone, struct page *page, if (set_page_guard(zone, &page[size], high)) continue; - add_to_free_list(&page[size], zone, high, migratetype, false); + __add_to_free_list(&page[size], zone, high, migratetype, false); set_buddy_order(&page[size], high); + nr_added += size; } + account_freepages(zone, nr_added, migratetype); } static void check_new_page_bad(struct page *page) -- cgit v1.2.3-59-g8ed1b From e42dfe4e0a51b476dcc6f1461c51fdb1b76573aa Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Wed, 6 Mar 2024 18:13:26 +0800 Subject: mm: record the migration reason for struct migration_target_control Patch series "make the hugetlb migration strategy consistent", v2. As discussed in previous thread [1], there is an inconsistency when handling hugetlb migration. When handling the migration of freed hugetlb, it prevents fallback to other NUMA nodes in alloc_and_dissolve_hugetlb_folio(). However, when dealing with in-use hugetlb, it allows fallback to other NUMA nodes in alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool and might result in unexpected failures when node bound workloads doesn't get what is asssumed available. This patchset tries to make the hugetlb migration strategy more clear and consistent. Please find details in each patch. [1] https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/ This patch (of 2): To support different hugetlb allocation strategies during hugetlb migration based on various migration reasons, record the migration reason in the migration_target_control structure as a preparation. Link: https://lkml.kernel.org/r/cover.1709719720.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/7b95d4981e07211f57139fc5b1f7ce91b920cee4.1709719720.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Oscar Salvador Cc: David Hildenbrand Cc: Miaohe Lin Cc: Michal Hocko Cc: Muchun Song Cc: Naoya Horiguchi Signed-off-by: Andrew Morton --- mm/gup.c | 1 + mm/internal.h | 1 + mm/memory-failure.c | 1 + mm/memory_hotplug.c | 1 + mm/mempolicy.c | 1 + mm/migrate.c | 1 + mm/page_alloc.c | 1 + mm/vmscan.c | 3 ++- 8 files changed, 9 insertions(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/mm/gup.c b/mm/gup.c index 8433d3dc31fc..6d8d15f8c7f9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2145,6 +2145,7 @@ static int migrate_longterm_unpinnable_pages( struct migration_target_control mtc = { .nid = NUMA_NO_NODE, .gfp_mask = GFP_USER | __GFP_NOWARN, + .reason = MR_LONGTERM_PIN, }; if (migrate_pages(movable_page_list, alloc_migration_target, diff --git a/mm/internal.h b/mm/internal.h index fb219e31f0f0..63bdac6d0413 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1045,6 +1045,7 @@ struct migration_target_control { int nid; /* preferred node id */ nodemask_t *nmask; gfp_t gfp_mask; + enum migrate_reason reason; }; /* diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0a7a8a4ba421..9e50586f2e37 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -2669,6 +2669,7 @@ static int soft_offline_in_use_page(struct page *page) struct migration_target_control mtc = { .nid = NUMA_NO_NODE, .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + .reason = MR_MEMORY_FAILURE, }; if (!huge && folio_test_large(folio)) { diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index a444e2d7dd2b..b79ba36e09e0 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1841,6 +1841,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) struct migration_target_control mtc = { .nmask = &nmask, .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + .reason = MR_MEMORY_HOTPLUG, }; int ret; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0b3def99174a..e128e6b7bbcb 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1070,6 +1070,7 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest, struct migration_target_control mtc = { .nid = dest, .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, + .reason = MR_SYSCALL, }; nodes_clear(nmask); diff --git a/mm/migrate.c b/mm/migrate.c index 73a052a382f1..bde63010a3cf 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2060,6 +2060,7 @@ static int do_move_pages_to_node(struct list_head *pagelist, int node) struct migration_target_control mtc = { .nid = node, .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, + .reason = MR_SYSCALL, }; err = migrate_pages(pagelist, alloc_migration_target, NULL, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 382d1c98f8e5..daab8cab91cc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6351,6 +6351,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc, struct migration_target_control mtc = { .nid = zone_to_nid(cc->zone), .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + .reason = MR_CONTIG_RANGE, }; struct page *page; unsigned long total_mapped = 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index 3ef654addd44..289121e76753 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -967,7 +967,8 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT, .nid = target_nid, - .nmask = &allowed_mask + .nmask = &allowed_mask, + .reason = MR_DEMOTION, }; if (list_empty(demote_folios)) -- cgit v1.2.3-59-g8ed1b From 2ace5a670e2a0f3a6029bc6cf91dc14202074e7a Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 26 Mar 2024 17:10:27 +0000 Subject: mm: make is_free_buddy_page() take a const argument This function does not modify its argument; let the callers know that so they can make better optimisation decisions. Link: https://lkml.kernel.org/r/20240326171045.410737-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 2 +- mm/page_alloc.c | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 0163d6da6283..eaecf544039f 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -1088,7 +1088,7 @@ static inline bool is_page_hwpoison(const struct page *page) return folio_test_hugetlb(folio) && PageHWPoison(&folio->page); } -extern bool is_free_buddy_page(struct page *page); +bool is_free_buddy_page(const struct page *page); PAGEFLAG(Isolated, isolated, PF_ANY); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index daab8cab91cc..6a8593aef297 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6745,16 +6745,16 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) /* * This function returns a stable result only if called under zone lock. */ -bool is_free_buddy_page(struct page *page) +bool is_free_buddy_page(const struct page *page) { unsigned long pfn = page_to_pfn(page); unsigned int order; for (order = 0; order < NR_PAGE_ORDERS; order++) { - struct page *page_head = page - (pfn & ((1 << order) - 1)); + const struct page *head = page - (pfn & ((1 << order) - 1)); - if (PageBuddy(page_head) && - buddy_order_unsafe(page_head) >= order) + if (PageBuddy(head) && + buddy_order_unsafe(head) >= order) break; } -- cgit v1.2.3-59-g8ed1b From bb8ea62daa7cfd634d1a43815666948311074d0c Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 26 Mar 2024 14:11:30 +0800 Subject: mm/page_alloc.c: remove unneeded codes in !NUMA version of build_zonelists() When CONFIG_NUMA=n, MAX_NUMNODES is always 1 because Kconfig item NODES_SHIFT depends on NUMA. So in !NUMA version of build_zonelists(), no need to bother with the two for loop because code execution won't enter them ever. Here, remove those unneeded codes in !NUMA version of build_zonelists(). [bhe@redhat.com: remove unused locals] Link: https://lkml.kernel.org/r/ZgQL1WOf9K88nLpQ@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20240326061134.1055295-5-bhe@redhat.com Signed-off-by: Baoquan He Cc: Mel Gorman Cc: "Mike Rapoport (IBM)" Signed-off-by: Andrew Morton --- mm/page_alloc.c | 24 ------------------------ 1 file changed, 24 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6a8593aef297..d932f406c0c1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5280,37 +5280,13 @@ static void setup_min_slab_ratio(void); static void build_zonelists(pg_data_t *pgdat) { - int node, local_node; struct zoneref *zonerefs; int nr_zones; - local_node = pgdat->node_id; - zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; nr_zones = build_zonerefs_node(pgdat, zonerefs); zonerefs += nr_zones; - /* - * Now we build the zonelist so that it contains the zones - * of all the other nodes. - * We don't want to pressure a particular node, so when - * building the zones for node N, we make sure that the - * zones coming right after the local ones are those from - * node N+1 (modulo N) - */ - for (node = local_node + 1; node < MAX_NUMNODES; node++) { - if (!node_online(node)) - continue; - nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs); - zonerefs += nr_zones; - } - for (node = 0; node < local_node; node++) { - if (!node_online(node)) - continue; - nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs); - zonerefs += nr_zones; - } - zonerefs->zone = NULL; zonerefs->zone_idx = 0; } -- cgit v1.2.3-59-g8ed1b From 96a5c186efff9906c58fa24ab05d0cc3ee000a00 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 26 Mar 2024 14:11:32 +0800 Subject: mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone On one node, for lower zone's ->lowmem_reserve[], it will show how much memory is reserved in this lower zone to avoid excessive page allocation from the relevant higher zone's fallback allocation. However, currently lower zone's lowmem_reserve[] element will be filled even though the relevant higher zone is empty. That doesnt' make sense and can cause confusion. E.g on node 0 of one system as below, it has zone DMA/DMA32/NORMAL/MOVABLE/DEVICE, among them zone MOVABLE/DEVICE are the highest and both are empty. In zone DMA/DMA32's protection array, we can see that it has value for zone MOVABLE and DEVICE. Node 0, zone DMA ...... pages free 2816 boost 0 min 7 low 10 high 13 spanned 4095 present 3998 managed 3840 cma 0 protection: (0, 1582, 23716, 23716, 23716) ...... Node 0, zone DMA32 pages free 403269 boost 0 min 753 low 1158 high 1563 spanned 1044480 present 487039 managed 405070 cma 0 protection: (0, 0, 22134, 22134, 22134) ...... Node 0, zone Normal pages free 5423879 boost 0 min 10539 low 16205 high 21871 spanned 5767168 present 5767168 managed 5666438 cma 0 protection: (0, 0, 0, 0, 0) ...... Node 0, zone Movable pages free 0 boost 0 min 32 low 32 high 32 spanned 0 present 0 managed 0 cma 0 protection: (0, 0, 0, 0, 0) Node 0, zone Device pages free 0 boost 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 cma 0 protection: (0, 0, 0, 0, 0) Here, clear out the element value in lower zone's ->lowmem_reserve[] if the relevant higher zone is empty. And also replace space with tab in _deferred_grow_zone() Link: https://lkml.kernel.org/r/20240326061134.1055295-7-bhe@redhat.com Signed-off-by: Baoquan He Cc: Mel Gorman Cc: "Mike Rapoport (IBM)" Signed-off-by: Andrew Morton --- mm/page_alloc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d932f406c0c1..875cec8c7708 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -315,7 +315,7 @@ static inline bool deferred_pages_enabled(void) static bool __ref _deferred_grow_zone(struct zone *zone, unsigned int order) { - return deferred_grow_zone(zone, order); + return deferred_grow_zone(zone, order); } #else static inline bool deferred_pages_enabled(void) @@ -5903,10 +5903,11 @@ static void setup_per_zone_lowmem_reserve(void) for (j = i + 1; j < MAX_NR_ZONES; j++) { struct zone *upper_zone = &pgdat->node_zones[j]; + bool empty = !zone_managed_pages(upper_zone); managed_pages += zone_managed_pages(upper_zone); - if (clear) + if (clear || empty) zone->lowmem_reserve[j] = 0; else zone->lowmem_reserve[j] = managed_pages / ratio; -- cgit v1.2.3-59-g8ed1b From 0aac45663a6d8803e8436dbaca884e0712970c7a Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 26 Mar 2024 14:11:33 +0800 Subject: mm/page_alloc.c: change the array-length to MIGRATE_PCPTYPES Earlier, in commit 1dd214b8f21c ("mm: page_alloc: avoid merging non-fallbackable pageblocks with others"), migrate type MIGRATE_CMA and MIGRATE_ISOLATE are removed from fallbacks list since they are never used. Later on, in commit ("aa02d3c174ab mm/page_alloc: reduce fallbacks to (MIGRATE_PCPTYPES - 1)"), the array column size is reduced to 'MIGRATE_PCPTYPES - 1'. In fact, the array row size need be reduced to MIGRATE_PCPTYPES too since it's only covering rows of the number MIGRATE_PCPTYPES. Even though the current code has handled cases when the migratetype is CMA, HIGHATOMIC and MEMORY_ISOLATION, making the row size right is still good to avoid future error and confusion. Link: https://lkml.kernel.org/r/20240326061134.1055295-8-bhe@redhat.com Signed-off-by: Baoquan He Cc: Mel Gorman Cc: "Mike Rapoport (IBM)" Signed-off-by: Andrew Morton --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 875cec8c7708..47ab0297838a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1542,7 +1542,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, * * The other migratetypes do not have fallbacks. */ -static int fallbacks[MIGRATE_TYPES][MIGRATE_PCPTYPES - 1] = { +static int fallbacks[MIGRATE_PCPTYPES][MIGRATE_PCPTYPES - 1] = { [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE }, [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE }, [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE }, -- cgit v1.2.3-59-g8ed1b From ba42b524a0408b5f92bd41edaee1ea84309ab9ae Mon Sep 17 00:00:00 2001 From: York Jasper Niebuhr Date: Fri, 29 Mar 2024 15:56:05 +0100 Subject: mm: init_mlocked_on_free_v3 Implements the "init_mlocked_on_free" boot option. When this boot option is enabled, any mlock'ed pages are zeroed on free. If the pages are munlock'ed beforehand, no initialization takes place. This boot option is meant to combat the performance hit of "init_on_free" as reported in commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options"). With "init_mlocked_on_free=1" only relevant data is freed while everything else is left untouched by the kernel. Correspondingly, this patch introduces no performance hit for unmapping non-mlock'ed memory. The unmapping overhead for purely mlocked memory was measured to be approximately 13%. Realistically, most systems mlock only a fraction of the total memory so the real-world system overhead should be close to zero. Optimally, userspace programs clear any key material or other confidential memory before exit and munlock the according memory regions. If a program crashes, userspace key managers fail to do this job. Accordingly, no munlock operations are performed so the data is caught and zeroed by the kernel. Should the program not crash, all memory will ideally be munlocked so no overhead is caused. CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON can be set to enable "init_mlocked_on_free" by default. Link: https://lkml.kernel.org/r/20240329145605.149917-1-yjnworkstation@gmail.com Signed-off-by: York Jasper Niebuhr Cc: Matthew Wilcox (Oracle) Cc: York Jasper Niebuhr Cc: Kees Cook Signed-off-by: Andrew Morton --- Documentation/admin-guide/kernel-parameters.txt | 6 ++++ include/linux/mm.h | 9 +++++- mm/internal.h | 1 + mm/memory.c | 6 ++++ mm/mm_init.c | 43 +++++++++++++++++++++---- mm/page_alloc.c | 2 +- security/Kconfig.hardening | 15 +++++++++ 7 files changed, 73 insertions(+), 9 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 902ecd92a29f..3ff97de349da 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2148,6 +2148,12 @@ Format: 0 | 1 Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON. + init_mlocked_on_free= [MM] Fill freed userspace memory with zeroes if + it was mlock'ed and not explicitly munlock'ed + afterwards. + Format: 0 | 1 + Default set by CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON + init_pkru= [X86] Specify the default memory protection keys rights register contents for all processes. 0x55555554 by default (disallow access to all but pkey 0). Can diff --git a/include/linux/mm.h b/include/linux/mm.h index 2d5e492ef57f..4f4e460d7853 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3762,7 +3762,14 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_FREE_DEFAULT_ON, init_on_free); static inline bool want_init_on_free(void) { return static_branch_maybe(CONFIG_INIT_ON_FREE_DEFAULT_ON, - &init_on_free); + &init_on_free); +} + +DECLARE_STATIC_KEY_MAYBE(CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON, init_mlocked_on_free); +static inline bool want_init_mlocked_on_free(void) +{ + return static_branch_maybe(CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON, + &init_mlocked_on_free); } extern bool _debug_pagealloc_enabled_early; diff --git a/mm/internal.h b/mm/internal.h index 6614ba4ca9de..cf7799e29391 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -506,6 +506,7 @@ extern void __putback_isolated_page(struct page *page, unsigned int order, extern void memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order); extern void __free_pages_core(struct page *page, unsigned int order); +extern void kernel_init_pages(struct page *page, int numpages); /* * This will have no effect, other than possibly generating a warning, if the diff --git a/mm/memory.c b/mm/memory.c index 0b92336bcebd..80944acb5b4e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1506,6 +1506,12 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); } + + if (want_init_mlocked_on_free() && folio_test_mlocked(folio) && + !delay_rmap && folio_test_anon(folio)) { + kernel_init_pages(page, folio_nr_pages(folio)); + } + if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) { *force_flush = true; *force_break = true; diff --git a/mm/mm_init.c b/mm/mm_init.c index d01912b8a597..2c8f3af4430f 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -2522,6 +2522,9 @@ EXPORT_SYMBOL(init_on_alloc); DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_FREE_DEFAULT_ON, init_on_free); EXPORT_SYMBOL(init_on_free); +DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON, init_mlocked_on_free); +EXPORT_SYMBOL(init_mlocked_on_free); + static bool _init_on_alloc_enabled_early __read_mostly = IS_ENABLED(CONFIG_INIT_ON_ALLOC_DEFAULT_ON); static int __init early_init_on_alloc(char *buf) @@ -2539,6 +2542,14 @@ static int __init early_init_on_free(char *buf) } early_param("init_on_free", early_init_on_free); +static bool _init_mlocked_on_free_enabled_early __read_mostly + = IS_ENABLED(CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON); +static int __init early_init_mlocked_on_free(char *buf) +{ + return kstrtobool(buf, &_init_mlocked_on_free_enabled_early); +} +early_param("init_mlocked_on_free", early_init_mlocked_on_free); + DEFINE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled); /* @@ -2566,12 +2577,21 @@ static void __init mem_debugging_and_hardening_init(void) } #endif - if ((_init_on_alloc_enabled_early || _init_on_free_enabled_early) && + if ((_init_on_alloc_enabled_early || _init_on_free_enabled_early || + _init_mlocked_on_free_enabled_early) && page_poisoning_requested) { pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, " - "will take precedence over init_on_alloc and init_on_free\n"); + "will take precedence over init_on_alloc, init_on_free " + "and init_mlocked_on_free\n"); _init_on_alloc_enabled_early = false; _init_on_free_enabled_early = false; + _init_mlocked_on_free_enabled_early = false; + } + + if (_init_mlocked_on_free_enabled_early && _init_on_free_enabled_early) { + pr_info("mem auto-init: init_on_free is on, " + "will take precedence over init_mlocked_on_free\n"); + _init_mlocked_on_free_enabled_early = false; } if (_init_on_alloc_enabled_early) { @@ -2588,9 +2608,17 @@ static void __init mem_debugging_and_hardening_init(void) static_branch_disable(&init_on_free); } - if (IS_ENABLED(CONFIG_KMSAN) && - (_init_on_alloc_enabled_early || _init_on_free_enabled_early)) - pr_info("mem auto-init: please make sure init_on_alloc and init_on_free are disabled when running KMSAN\n"); + if (_init_mlocked_on_free_enabled_early) { + want_check_pages = true; + static_branch_enable(&init_mlocked_on_free); + } else { + static_branch_disable(&init_mlocked_on_free); + } + + if (IS_ENABLED(CONFIG_KMSAN) && (_init_on_alloc_enabled_early || + _init_on_free_enabled_early || _init_mlocked_on_free_enabled_early)) + pr_info("mem auto-init: please make sure init_on_alloc, init_on_free and " + "init_mlocked_on_free are disabled when running KMSAN\n"); #ifdef CONFIG_DEBUG_PAGEALLOC if (debug_pagealloc_enabled()) { @@ -2629,9 +2657,10 @@ static void __init report_meminit(void) else stack = "off"; - pr_info("mem auto-init: stack:%s, heap alloc:%s, heap free:%s\n", + pr_info("mem auto-init: stack:%s, heap alloc:%s, heap free:%s, mlocked free:%s\n", stack, want_init_on_alloc(GFP_KERNEL) ? "on" : "off", - want_init_on_free() ? "on" : "off"); + want_init_on_free() ? "on" : "off", + want_init_mlocked_on_free() ? "on" : "off"); if (want_init_on_free()) pr_info("mem auto-init: clearing system memory may take some time...\n"); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 47ab0297838a..e030ccf9d5bc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1032,7 +1032,7 @@ static inline bool should_skip_kasan_poison(struct page *page) return page_kasan_tag(page) == KASAN_TAG_KERNEL; } -static void kernel_init_pages(struct page *page, int numpages) +void kernel_init_pages(struct page *page, int numpages) { int i; diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening index 2cff851ebfd7..effbf5982be1 100644 --- a/security/Kconfig.hardening +++ b/security/Kconfig.hardening @@ -255,6 +255,21 @@ config INIT_ON_FREE_DEFAULT_ON touching "cold" memory areas. Most cases see 3-5% impact. Some synthetic workloads have measured as high as 8%. +config INIT_MLOCKED_ON_FREE_DEFAULT_ON + bool "Enable mlocked memory zeroing on free" + depends on !KMSAN + help + This config has the effect of setting "init_mlocked_on_free=1" + on the kernel command line. If it is enabled, all mlocked process + memory is zeroed when freed. This restriction to mlocked memory + improves performance over "init_on_free" but can still be used to + protect confidential data like key material from content exposures + to other processes, as well as live forensics and cold boot attacks. + Any non-mlocked memory is not cleared before it is reassigned. This + configuration can be overwritten by setting "init_mlocked_on_free=0" + on the command line. The "init_on_free" boot option takes + precedence over "init_mlocked_on_free". + config CC_HAS_ZERO_CALL_USED_REGS def_bool $(cc-option,-fzero-call-used-regs=used-gpr) # https://github.com/ClangBuiltLinux/linux/issues/1766 -- cgit v1.2.3-59-g8ed1b From 7998df0b6407da65a0fe8325d6ff239c4e14ff7d Mon Sep 17 00:00:00 2001 From: Joel Granados Date: Thu, 28 Mar 2024 16:57:48 +0100 Subject: memory: remove the now superfluous sentinel element from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from all files under mm/ that register a sysctl table. Link: https://lkml.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-1-47c1463b3af2@samsung.com Signed-off-by: Joel Granados Reviewed-by: Muchun Song Reviewed-by: Miaohe Lin Signed-off-by: Andrew Morton --- mm/compaction.c | 1 - mm/hugetlb.c | 1 - mm/hugetlb_vmemmap.c | 1 - mm/memory-failure.c | 1 - mm/oom_kill.c | 1 - mm/page-writeback.c | 1 - mm/page_alloc.c | 1 - 7 files changed, 7 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/compaction.c b/mm/compaction.c index 70b01190d2f3..e731d45befc7 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -3350,7 +3350,6 @@ static struct ctl_table vm_compaction[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, - { } }; static int __init kcompactd_init(void) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8af2fd48f785..10b4f8bde16f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5029,7 +5029,6 @@ static struct ctl_table hugetlb_table[] = { .mode = 0644, .proc_handler = hugetlb_overcommit_handler, }, - { } }; static void hugetlb_sysctl_init(void) diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index da177e49d956..b9a55322e52c 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -679,7 +679,6 @@ static struct ctl_table hugetlb_vmemmap_sysctls[] = { .mode = 0644, .proc_handler = proc_dobool, }, - { } }; static int __init hugetlb_vmemmap_init(void) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0d863e9216af..4f22330600ff 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -141,7 +141,6 @@ static struct ctl_table memory_failure_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, - { } }; /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8d6a207c3c59..4d7a0004df2c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -724,7 +724,6 @@ static struct ctl_table vm_oom_kill_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, - {} }; #endif diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3e19b87049db..fba324e1a010 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2291,7 +2291,6 @@ static struct ctl_table vm_page_writeback_sysctls[] = { .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, - {} }; #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e030ccf9d5bc..07c21774a722 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6288,7 +6288,6 @@ static struct ctl_table page_alloc_sysctl_table[] = { .extra2 = SYSCTL_ONE_HUNDRED, }, #endif - {} }; void __init page_alloc_sysctl_init(void) -- cgit v1.2.3-59-g8ed1b From 6303d1c553c8d758f068de70a41668622b7a917c Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Wed, 3 Apr 2024 21:47:21 +0800 Subject: mm: page_alloc: use the correct THP order for THP PCP Commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") extends the PCP allocator to store THP pages, and it determines whether to cache THP pages in PCP by comparing with pageblock_order. But the pageblock_order is not always equal to THP order. It might also be MAX_PAGE_ORDER, which could prevent PCP from caching THP pages. Therefore, using HPAGE_PMD_ORDER instead to determine the need for caching THP for PCP will fix this issue Link: https://lkml.kernel.org/r/a25c9e14cd03907d5978b60546a69e6aa3fc2a7d.1712151833.git.baolin.wang@linux.alibaba.com Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Baolin Wang Acked-by: Vlastimil Babka Cc: Mel Gorman Reviewed-by: Barry Song Signed-off-by: Andrew Morton --- mm/page_alloc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 07c21774a722..cbe3f695eda9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -506,7 +506,7 @@ static inline unsigned int order_to_pindex(int migratetype, int order) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (order > PAGE_ALLOC_COSTLY_ORDER) { - VM_BUG_ON(order != pageblock_order); + VM_BUG_ON(order != HPAGE_PMD_ORDER); return NR_LOWORDER_PCP_LISTS; } #else @@ -522,7 +522,7 @@ static inline int pindex_to_order(unsigned int pindex) #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (pindex == NR_LOWORDER_PCP_LISTS) - order = pageblock_order; + order = HPAGE_PMD_ORDER; #else VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); #endif @@ -535,7 +535,7 @@ static inline bool pcp_allowed_order(unsigned int order) if (order <= PAGE_ALLOC_COSTLY_ORDER) return true; #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (order == pageblock_order) + if (order == HPAGE_PMD_ORDER) return true; #endif return false; -- cgit v1.2.3-59-g8ed1b From 5b8d75913a0ed9deb16140c0aa880c4d6db2dc62 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 5 Apr 2024 16:32:24 +0100 Subject: mm: combine free_the_page() and free_unref_page() The pcp_allowed_order() check in free_the_page() was only being skipped by __folio_put_small() which is about to be rearranged. Link: https://lkml.kernel.org/r/20240405153228.2563754-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Zi Yan Signed-off-by: Andrew Morton --- mm/page_alloc.c | 25 +++++++++++-------------- 1 file changed, 11 insertions(+), 14 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cbe3f695eda9..c01757251de2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -541,14 +541,6 @@ static inline bool pcp_allowed_order(unsigned int order) return false; } -static inline void free_the_page(struct page *page, unsigned int order) -{ - if (pcp_allowed_order(order)) /* Via pcp? */ - free_unref_page(page, order); - else - __free_pages_ok(page, order, FPI_NONE); -} - /* * Higher-order pages are called "compound pages". They are structured thusly: * @@ -584,7 +576,7 @@ void destroy_large_folio(struct folio *folio) folio_undo_large_rmappable(folio); mem_cgroup_uncharge(folio); - free_the_page(&folio->page, folio_order(folio)); + free_unref_page(&folio->page, folio_order(folio)); } static inline void set_buddy_order(struct page *page, unsigned int order) @@ -2573,6 +2565,11 @@ void free_unref_page(struct page *page, unsigned int order) unsigned long pfn = page_to_pfn(page); int migratetype; + if (!pcp_allowed_order(order)) { + __free_pages_ok(page, order, FPI_NONE); + return; + } + if (!free_pages_prepare(page, order)) return; @@ -4755,11 +4752,11 @@ void __free_pages(struct page *page, unsigned int order) struct alloc_tag *tag = pgalloc_tag_get(page); if (put_page_testzero(page)) - free_the_page(page, order); + free_unref_page(page, order); else if (!head) { pgalloc_tag_sub_pages(tag, (1 << order) - 1); while (order-- > 0) - free_the_page(page + (1 << order), order); + free_unref_page(page + (1 << order), order); } } EXPORT_SYMBOL(__free_pages); @@ -4821,7 +4818,7 @@ void __page_frag_cache_drain(struct page *page, unsigned int count) VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); if (page_ref_sub_and_test(page, count)) - free_the_page(page, compound_order(page)); + free_unref_page(page, compound_order(page)); } EXPORT_SYMBOL(__page_frag_cache_drain); @@ -4862,7 +4859,7 @@ refill: goto refill; if (unlikely(nc->pfmemalloc)) { - free_the_page(page, compound_order(page)); + free_unref_page(page, compound_order(page)); goto refill; } @@ -4906,7 +4903,7 @@ void page_frag_free(void *addr) struct page *page = virt_to_head_page(addr); if (unlikely(put_page_testzero(page))) - free_the_page(page, compound_order(page)); + free_unref_page(page, compound_order(page)); } EXPORT_SYMBOL(page_frag_free); -- cgit v1.2.3-59-g8ed1b From 2542b1ac9a46ac58f9565de0048457956898d481 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 5 Apr 2024 16:32:25 +0100 Subject: mm: inline destroy_large_folio() into __folio_put_large() destroy_large_folio() has only one caller, move its contents there. Link: https://lkml.kernel.org/r/20240405153228.2563754-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mm.h | 2 -- mm/page_alloc.c | 14 -------------- mm/swap.c | 13 ++++++++++--- 3 files changed, 10 insertions(+), 19 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/mm.h b/include/linux/mm.h index 4f4e460d7853..cc9410d772dd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1318,8 +1318,6 @@ void folio_copy(struct folio *dst, struct folio *src); unsigned long nr_free_buffer_pages(void); -void destroy_large_folio(struct folio *folio); - /* Returns the number of bytes in this potentially compound page. */ static inline unsigned long page_size(struct page *page) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c01757251de2..22e8b9f1d710 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -565,20 +565,6 @@ void prep_compound_page(struct page *page, unsigned int order) prep_compound_head(page, order); } -void destroy_large_folio(struct folio *folio) -{ - if (folio_test_hugetlb(folio)) { - free_huge_folio(folio); - return; - } - - if (folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); - - mem_cgroup_uncharge(folio); - free_unref_page(&folio->page, folio_order(folio)); -} - static inline void set_buddy_order(struct page *page, unsigned int order) { set_page_private(page, order); diff --git a/mm/swap.c b/mm/swap.c index d7db3cd4e80a..6e3bd03673ea 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -127,9 +127,16 @@ static void __folio_put_large(struct folio *folio) * (it's never listed to any LRU lists) and no memcg routines should * be called for hugetlb (it has a separate hugetlb_cgroup.) */ - if (!folio_test_hugetlb(folio)) - page_cache_release(folio); - destroy_large_folio(folio); + if (folio_test_hugetlb(folio)) { + free_huge_folio(folio); + return; + } + + page_cache_release(folio); + if (folio_test_large_rmappable(folio)) + folio_undo_large_rmappable(folio); + mem_cgroup_uncharge(folio); + free_unref_page(&folio->page, folio_order(folio)); } void __folio_put(struct folio *folio) -- cgit v1.2.3-59-g8ed1b From 05c5323b2a344c19c51cd1b91a4ab9ae90853794 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 9 Apr 2024 21:22:47 +0200 Subject: mm: track mapcount of large folios in single value Let's track the mapcount of large folios in a single value. The mapcount of a large folio currently corresponds to the sum of the entire mapcount and all page mapcounts. This sum is what we actually want to know in folio_mapcount() and it is also sufficient for implementing folio_mapped(). With PTE-mapped THP becoming more important and more widely used, we want to avoid looping over all pages of a folio just to obtain the mapcount of large folios. The comment "In the common case, avoid the loop when no pages mapped by PTE" in folio_total_mapcount() does no longer hold for mTHP that are always mapped by PTE. Further, we are planning on using folio_mapcount() more frequently, and might even want to remove page mapcounts for large folios in some kernel configs. Therefore, allow for reading the mapcount of large folios efficiently and atomically without looping over any pages. Maintain the mapcount also for hugetlb pages for simplicity. Use the new mapcount to implement folio_mapcount() and folio_mapped(). Make page_mapped() simply call folio_mapped(). We can now get rid of folio_large_is_mapped(). _nr_pages_mapped is now only used in rmap code and for debugging purposes. Keep folio_nr_pages_mapped() around, but document that its use should be limited to rmap internals and debugging purposes. This change implies one additional atomic add/sub whenever mapping/unmapping (parts of) a large folio. As we now batch RMAP operations for PTE-mapped THP during fork(), during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust the large mapcount for a PTE batch only once, the added overhead in the common case is small. Only when unmapping individual pages of a large folio (e.g., during COW), the overhead might be bigger in comparison, but it's essentially one additional atomic operation. Note that before the new mapcount would overflow, already our refcount would overflow: each mapping requires a folio reference. Extend the focumentation of folio_mapcount(). Link: https://lkml.kernel.org/r/20240409192301.907377-5-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Yin Fengwei Cc: Chris Zankel Cc: Hugh Dickins Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Matthew Wilcox (Oracle) Cc: Max Filippov Cc: Miaohe Lin Cc: Muchun Song Cc: Naoya Horiguchi Cc: Peter Xu Cc: Richard Chang Cc: Rich Felker Cc: Ryan Roberts Cc: Yang Shi Cc: Yoshinori Sato Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/mm/transhuge.rst | 12 ++++++------ include/linux/mm.h | 44 +++++++++++++++++++----------------------- include/linux/mm_types.h | 5 +++-- include/linux/rmap.h | 10 ++++++++++ mm/debug.c | 3 ++- mm/hugetlb.c | 4 ++-- mm/internal.h | 3 +++ mm/khugepaged.c | 2 +- mm/page_alloc.c | 4 ++++ mm/rmap.c | 34 +++++++++++--------------------- 10 files changed, 62 insertions(+), 59 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index 93c9239b9ebe..1ba0ad63246c 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -116,14 +116,14 @@ pages: succeeds on tail pages. - map/unmap of a PMD entry for the whole THP increment/decrement - folio->_entire_mapcount and also increment/decrement - folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount - goes from -1 to 0 or 0 to -1. + folio->_entire_mapcount, increment/decrement folio->_large_mapcount + and also increment/decrement folio->_nr_pages_mapped by ENTIRELY_MAPPED + when _entire_mapcount goes from -1 to 0 or 0 to -1. - map/unmap of individual pages with PTE entry increment/decrement - page->_mapcount and also increment/decrement folio->_nr_pages_mapped - when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts - the number of pages mapped by PTE. + page->_mapcount, increment/decrement folio->_large_mapcount and also + increment/decrement folio->_nr_pages_mapped when page->_mapcount goes + from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE. split_huge_page internally has to distribute the refcounts in the head page to the tail pages before clearing all PG_head/tail bits from the page diff --git a/include/linux/mm.h b/include/linux/mm.h index 8fa4d78bcc09..059477821625 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1240,16 +1240,26 @@ static inline int page_mapcount(struct page *page) return mapcount; } -int folio_total_mapcount(const struct folio *folio); +static inline int folio_large_mapcount(const struct folio *folio) +{ + VM_WARN_ON_FOLIO(!folio_test_large(folio), folio); + return atomic_read(&folio->_large_mapcount) + 1; +} /** - * folio_mapcount() - Calculate the number of mappings of this folio. + * folio_mapcount() - Number of mappings of this folio. * @folio: The folio. * - * A large folio tracks both how many times the entire folio is mapped, - * and how many times each individual page in the folio is mapped. - * This function calculates the total number of times the folio is - * mapped. + * The folio mapcount corresponds to the number of present user page table + * entries that reference any part of a folio. Each such present user page + * table entry must be paired with exactly on folio reference. + * + * For ordindary folios, each user page table entry (PTE/PMD/PUD/...) counts + * exactly once. + * + * For hugetlb folios, each abstracted "hugetlb" user page table entry that + * references the entire folio counts exactly once, even when such special + * page table entries are comprised of multiple ordinary page table entries. * * Return: The number of times this folio is mapped. */ @@ -1257,17 +1267,7 @@ static inline int folio_mapcount(const struct folio *folio) { if (likely(!folio_test_large(folio))) return atomic_read(&folio->_mapcount) + 1; - return folio_total_mapcount(folio); -} - -static inline bool folio_large_is_mapped(const struct folio *folio) -{ - /* - * Reading _entire_mapcount below could be omitted if hugetlb - * participated in incrementing nr_pages_mapped when compound mapped. - */ - return atomic_read(&folio->_nr_pages_mapped) > 0 || - atomic_read(&folio->_entire_mapcount) >= 0; + return folio_large_mapcount(folio); } /** @@ -1276,11 +1276,9 @@ static inline bool folio_large_is_mapped(const struct folio *folio) * * Return: True if any page in this folio is referenced by user page tables. */ -static inline bool folio_mapped(struct folio *folio) +static inline bool folio_mapped(const struct folio *folio) { - if (likely(!folio_test_large(folio))) - return atomic_read(&folio->_mapcount) >= 0; - return folio_large_is_mapped(folio); + return folio_mapcount(folio) >= 1; } /* @@ -1290,9 +1288,7 @@ static inline bool folio_mapped(struct folio *folio) */ static inline bool page_mapped(const struct page *page) { - if (likely(!PageCompound(page))) - return atomic_read(&page->_mapcount) >= 0; - return folio_large_is_mapped(page_folio(page)); + return folio_mapped(page_folio(page)); } static inline struct page *virt_to_head_page(const void *x) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index fa0d6995706f..db0adf5721cc 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -289,7 +289,8 @@ typedef struct { * @virtual: Virtual address in the kernel direct map. * @_last_cpupid: IDs of last CPU and last process that accessed the folio. * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). - * @_nr_pages_mapped: Do not use directly, call folio_mapcount(). + * @_large_mapcount: Do not use directly, call folio_mapcount(). + * @_nr_pages_mapped: Do not use outside of rmap and debug code. * @_pincount: Do not use directly, call folio_maybe_dma_pinned(). * @_folio_nr_pages: Do not use directly, call folio_nr_pages(). * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h. @@ -348,8 +349,8 @@ struct folio { struct { unsigned long _flags_1; unsigned long _head_1; - unsigned long _folio_avail; /* public: */ + atomic_t _large_mapcount; atomic_t _entire_mapcount; atomic_t _nr_pages_mapped; atomic_t _pincount; diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 327f1ca5a487..0f906dc6d280 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -273,6 +273,7 @@ static inline int hugetlb_try_dup_anon_rmap(struct folio *folio, ClearPageAnonExclusive(&folio->page); } atomic_inc(&folio->_entire_mapcount); + atomic_inc(&folio->_large_mapcount); return 0; } @@ -306,6 +307,7 @@ static inline void hugetlb_add_file_rmap(struct folio *folio) VM_WARN_ON_FOLIO(folio_test_anon(folio), folio); atomic_inc(&folio->_entire_mapcount); + atomic_inc(&folio->_large_mapcount); } static inline void hugetlb_remove_rmap(struct folio *folio) @@ -313,11 +315,14 @@ static inline void hugetlb_remove_rmap(struct folio *folio) VM_WARN_ON_FOLIO(!folio_test_hugetlb(folio), folio); atomic_dec(&folio->_entire_mapcount); + atomic_dec(&folio->_large_mapcount); } static __always_inline void __folio_dup_file_rmap(struct folio *folio, struct page *page, int nr_pages, enum rmap_level level) { + const int orig_nr_pages = nr_pages; + __folio_rmap_sanity_checks(folio, page, nr_pages, level); switch (level) { @@ -330,9 +335,11 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio, do { atomic_inc(&page->_mapcount); } while (page++, --nr_pages > 0); + atomic_add(orig_nr_pages, &folio->_large_mapcount); break; case RMAP_LEVEL_PMD: atomic_inc(&folio->_entire_mapcount); + atomic_inc(&folio->_large_mapcount); break; } } @@ -382,6 +389,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *src_vma, enum rmap_level level) { + const int orig_nr_pages = nr_pages; bool maybe_pinned; int i; @@ -423,6 +431,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio, ClearPageAnonExclusive(page); atomic_inc(&page->_mapcount); } while (page++, --nr_pages > 0); + atomic_add(orig_nr_pages, &folio->_large_mapcount); break; case RMAP_LEVEL_PMD: if (PageAnonExclusive(page)) { @@ -431,6 +440,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio, ClearPageAnonExclusive(page); } atomic_inc(&folio->_entire_mapcount); + atomic_inc(&folio->_large_mapcount); break; } return 0; diff --git a/mm/debug.c b/mm/debug.c index b71186f1fb0b..d064db42af54 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -68,8 +68,9 @@ static void __dump_folio(struct folio *folio, struct page *page, folio_ref_count(folio), mapcount, mapping, folio->index + idx, pfn); if (folio_test_large(folio)) { - pr_warn("head: order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n", + pr_warn("head: order:%u mapcount:%d entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n", folio_order(folio), + folio_mapcount(folio), folio_entire_mapcount(folio), folio_nr_pages_mapped(folio), atomic_read(&folio->_pincount)); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5dc3f5ea3a2e..d74289d3f30d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1517,7 +1517,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, struct page *p; atomic_set(&folio->_entire_mapcount, 0); - atomic_set(&folio->_nr_pages_mapped, 0); + atomic_set(&folio->_large_mapcount, 0); atomic_set(&folio->_pincount, 0); for (i = 1; i < nr_pages; i++) { @@ -2120,7 +2120,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, /* we rely on prep_new_hugetlb_folio to set the hugetlb flag */ folio_set_order(folio, order); atomic_set(&folio->_entire_mapcount, -1); - atomic_set(&folio->_nr_pages_mapped, 0); + atomic_set(&folio->_large_mapcount, -1); atomic_set(&folio->_pincount, 0); return true; diff --git a/mm/internal.h b/mm/internal.h index 22152e0c8494..2adc3f616b71 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -72,6 +72,8 @@ void page_writeback_init(void); /* * How many individual pages have an elevated _mapcount. Excludes * the folio's entire_mapcount. + * + * Don't use this function outside of debugging code. */ static inline int folio_nr_pages_mapped(const struct folio *folio) { @@ -611,6 +613,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order) struct folio *folio = (struct folio *)page; folio_set_order(folio, order); + atomic_set(&folio->_large_mapcount, -1); atomic_set(&folio->_entire_mapcount, -1); atomic_set(&folio->_nr_pages_mapped, 0); atomic_set(&folio->_pincount, 0); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 89e2624fb3ff..2f73d2aa9ae8 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1358,7 +1358,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, * Check if the page has any GUP (or other external) pins. * * Here the check may be racy: - * it may see total_mapcount > refcount in some cases? + * it may see folio_mapcount() > folio_ref_count(). * But such case is ephemeral we could always retry collapse * later. However it may report false positive if the page * has excessive GUP pins (i.e. 512). Anyway the same check diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 22e8b9f1d710..dd4265c760ff 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -935,6 +935,10 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) bad_page(page, "nonzero entire_mapcount"); goto out; } + if (unlikely(folio_large_mapcount(folio))) { + bad_page(page, "nonzero large_mapcount"); + goto out; + } if (unlikely(atomic_read(&folio->_nr_pages_mapped))) { bad_page(page, "nonzero nr_pages_mapped"); goto out; diff --git a/mm/rmap.c b/mm/rmap.c index 4bde6d60db6c..2608c40dffad 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1138,34 +1138,12 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, return page_vma_mkclean_one(&pvmw); } -int folio_total_mapcount(const struct folio *folio) -{ - int mapcount = folio_entire_mapcount(folio); - int nr_pages; - int i; - - /* In the common case, avoid the loop when no pages mapped by PTE */ - if (folio_nr_pages_mapped(folio) == 0) - return mapcount; - /* - * Add all the PTE mappings of those pages mapped by PTE. - * Limit the loop to folio_nr_pages_mapped()? - * Perhaps: given all the raciness, that may be a good or a bad idea. - */ - nr_pages = folio_nr_pages(folio); - for (i = 0; i < nr_pages; i++) - mapcount += atomic_read(&folio_page(folio, i)->_mapcount); - - /* But each of those _mapcounts was based on -1 */ - mapcount += nr_pages; - return mapcount; -} - static __always_inline unsigned int __folio_add_rmap(struct folio *folio, struct page *page, int nr_pages, enum rmap_level level, int *nr_pmdmapped) { atomic_t *mapped = &folio->_nr_pages_mapped; + const int orig_nr_pages = nr_pages; int first, nr = 0; __folio_rmap_sanity_checks(folio, page, nr_pages, level); @@ -1185,6 +1163,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio, nr++; } } while (page++, --nr_pages > 0); + atomic_add(orig_nr_pages, &folio->_large_mapcount); break; case RMAP_LEVEL_PMD: first = atomic_inc_and_test(&folio->_entire_mapcount); @@ -1201,6 +1180,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio, nr = 0; } } + atomic_inc(&folio->_large_mapcount); break; } return nr; @@ -1436,10 +1416,14 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, SetPageAnonExclusive(page); } + /* increment count (starts at -1) */ + atomic_set(&folio->_large_mapcount, nr - 1); atomic_set(&folio->_nr_pages_mapped, nr); } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); + /* increment count (starts at -1) */ + atomic_set(&folio->_large_mapcount, 0); atomic_set(&folio->_nr_pages_mapped, ENTIRELY_MAPPED); SetPageAnonExclusive(&folio->page); __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); @@ -1522,6 +1506,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, break; } + atomic_sub(nr_pages, &folio->_large_mapcount); do { last = atomic_add_negative(-1, &page->_mapcount); if (last) { @@ -1532,6 +1517,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, } while (page++, --nr_pages > 0); break; case RMAP_LEVEL_PMD: + atomic_dec(&folio->_large_mapcount); last = atomic_add_negative(-1, &folio->_entire_mapcount); if (last) { nr = atomic_sub_return_relaxed(ENTIRELY_MAPPED, mapped); @@ -2714,6 +2700,7 @@ void hugetlb_add_anon_rmap(struct folio *folio, struct vm_area_struct *vma, VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); atomic_inc(&folio->_entire_mapcount); + atomic_inc(&folio->_large_mapcount); if (flags & RMAP_EXCLUSIVE) SetPageAnonExclusive(&folio->page); VM_WARN_ON_FOLIO(folio_entire_mapcount(folio) > 1 && @@ -2728,6 +2715,7 @@ void hugetlb_add_new_anon_rmap(struct folio *folio, BUG_ON(address < vma->vm_start || address >= vma->vm_end); /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); + atomic_set(&folio->_large_mapcount, 0); folio_clear_hugetlb_restore_reserve(folio); __folio_set_anon(folio, vma, address, true); SetPageAnonExclusive(&folio->page); -- cgit v1.2.3-59-g8ed1b From 7115936ac10abd0997f2937e4305cb679d47b592 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 9 Apr 2024 21:22:53 +0200 Subject: mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range() We want to limit the use of page_mapcount() to the places where it is absolutely necessary. For tracing purposes, we use page_mapcount() in __alloc_contig_migrate_range(). Adding that mapcount to total_mapped sounds strange: total_migrated and total_reclaimed would count each page only once, not multiple times. But then, isolate_migratepages_range() adds each folio only once to the list. So for large folios, we would query the mapcount of the first page of the folio, which doesn't make too much sense for large folios. Let's simply use folio_mapped() * folio_nr_pages(), which makes more sense as nr_migratepages is also incremented by the number of pages in the folio in case of successful migration. Link: https://lkml.kernel.org/r/20240409192301.907377-11-david@redhat.com Signed-off-by: David Hildenbrand Cc: Chris Zankel Cc: Hugh Dickins Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Matthew Wilcox (Oracle) Cc: Max Filippov Cc: Miaohe Lin Cc: Muchun Song Cc: Naoya Horiguchi Cc: Peter Xu Cc: Richard Chang Cc: Rich Felker Cc: Ryan Roberts Cc: Yang Shi Cc: Yin Fengwei Cc: Yoshinori Sato Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/page_alloc.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dd4265c760ff..45fccbcda763 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6347,8 +6347,12 @@ int __alloc_contig_migrate_range(struct compact_control *cc, if (trace_mm_alloc_contig_migrate_range_info_enabled()) { total_reclaimed += nr_reclaimed; - list_for_each_entry(page, &cc->migratepages, lru) - total_mapped += page_mapcount(page); + list_for_each_entry(page, &cc->migratepages, lru) { + struct folio *folio = page_folio(page); + + total_mapped += folio_mapped(folio) * + folio_nr_pages(folio); + } } ret = migrate_pages(&cc->migratepages, alloc_migration_target, -- cgit v1.2.3-59-g8ed1b From 231f8c7127e37edcd4d9e3f87e0f9fcf0e90d902 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Fri, 12 Apr 2024 11:27:04 +0800 Subject: mm: page_alloc: allowing mTHP compaction to capture the freed page directly Currently, compaction_capture() does not allow lower-order allocations to directly capture the movable free pages, even though lower-order allocations might also be requesting movable pages, that can lead to more compaction scanning. And, with the enablement of mTHP, such situations will become more common. Thus allowing lower-order (mTHP) allocations of movable page types directly capture the movable free pages can avoid unnecessary compaction scanning, meanwhile that won't pollute the movable pageblock. With testing 1M mTHP compaction, it can be seen that compaction scanning is significantly reduced. mm-unstable patched Ops Compaction pages isolated 116598741.00 120946702.00 Ops Compaction migrate scanned 1764870054.00 1488621550.00 Ops Compaction free scanned 7707879039.00 4986299318.00 Ops Compact scan efficiency 22.90 29.85 Ops Compaction cost 73797.69 72933.48 Link: https://lkml.kernel.org/r/8118a5d66a034736a48433beddaca60ed78577c4.1712892329.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Zi Yan Acked-by: Johannes Weiner Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Mel Gorman Cc: Ryan Roberts Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/page_alloc.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 45fccbcda763..2e22ce5675ca 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -595,12 +595,14 @@ compaction_capture(struct capture_control *capc, struct page *page, return false; /* - * Do not let lower order allocations pollute a movable pageblock. + * Do not let lower order allocations pollute a movable pageblock + * unless compaction is also requesting movable pages. * This might let an unmovable request use a reclaimable pageblock * and vice-versa but no more than normal fallback logic which can * have trouble finding a high-order free page. */ - if (order < pageblock_order && migratetype == MIGRATE_MOVABLE) + if (order < pageblock_order && migratetype == MIGRATE_MOVABLE && + capc->cc->migratetype != MIGRATE_MOVABLE) return false; capc->page = page; -- cgit v1.2.3-59-g8ed1b