aboutsummaryrefslogtreecommitdiffstatshomepage
AgeCommit message (Collapse)AuthorFilesLines
2025-01-25mm/damon/sysfs: remove unused code for schemes tried regions updateSeongJae Park3-282/+2
DAMON sysfs interface was using damon_callback with its own complicated synchronization logics to update DAMOS scheme applied regions directories and files. But it is replaced to use damos_walk(), and the additional synchronization logics are no more being used. Remove those. Link: https://lkml.kernel.org/r/20250103174400.54890-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: use damos_walk() for update_schemes_tried_{bytes,regions}SeongJae Park3-0/+96
DAMON sysfs interface uses damon_callback with its own complicated synchronization facility to handle update_schemes_tried_bytes and update_schemes_tried_regions commands. But damos_walk() can support the use case without the additional synchronizations. Convert the code to use damos_walk() instead. Link: https://lkml.kernel.org/r/20250103174400.54890-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/mm/damon/design: document DAMOS regions walkingSeongJae Park1-0/+11
DAMOS' regions walking is a feature for efficiently retrieving monitoring results or DAMOS-internal behavior. It can be useful for multiple purposes including investigations and tuning. Add a section for it on the design document. Link: https://lkml.kernel.org/r/20250103174400.54890-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/core: implement damos_walk()SeongJae Park2-2/+163
Introduce a new core layer interface, damos_walk(). It aims to replace some damon_callback usages that access DAMOS schemes applied regions of ongoing kdamond with additional synchronizations. It receives a function pointer and asks kdamond to invoke it for any region that it tried to apply any DAMOS action within one scheme apply interval for every scheme of it. The function further waits until the kdamond finishes the invocations for every scheme, or cancels the request, and returns. The kdamond invokes the function as requested within the main loop. If it is deactivated by DAMOS watermarks or going out of the main loop, it marks the request as canceled, so that damos_walk() can wakeup and return. Link: https://lkml.kernel.org/r/20250103174400.54890-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: use damon_call() for update_schemes_effective_quotasSeongJae Park1-8/+7
DAMON sysfs interface uses damon_callback with its own synchronization facility to handle update_schemes_effective_quotas command. But damon_call() can support the use case without the additional synchronizations. Convert the code to use damon_call() instead. Link: https://lkml.kernel.org/r/20250103174400.54890-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: use damon_call() for commit_schemes_quota_goalsSeongJae Park1-5/+6
DAMON sysfs interface uses damon_callback with its own synchronization facility to handle commit_schemes_quota_goals command. But damon_call() can support the use case without the additional synchronizations. Convert the code to use damon_call() instead. Link: https://lkml.kernel.org/r/20250103174400.54890-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: use damon_call() for update_schemes_statsSeongJae Park1-10/+21
DAMON sysfs interface uses damon_callback with its own synchronization facility to handle update_schemes_stats kdamond command. But damon_call() can support the use case without the additional synchronizations. Convert the code to use damon_call() instead. Link: https://lkml.kernel.org/r/20250103174400.54890-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/core: introduce damon_call()SeongJae Park2-0/+112
Introduce a new DAMON core API function, damon_call(). It aims to replace some damon_callback usages that access damon_ctx of ongoing kdamond with additional synchronizations. It receives a function pointer, let the parallel kdamond invokes the function, and returns after the invocation is finished, or canceled due to some races. kdamond invokes the function inside the main loop after sampling is done. If it is deactivated by DAMOS watermarks or already out of the main loop, mark the request as canceled so that damon_call() can wakeup and return. Link: https://lkml.kernel.org/r/20250103174400.54890-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: handle clear_schemes_tried_regions from DAMON sysfs contextSeongJae Park2-15/+4
DAMON sysfs interface handles clear_schemes_tried_regions request from the DAMON callback context (damon_sysfs_cmd_request_callback()), which is designed to be used for safe access to the related DAMON context internal data. But no DAMON context internal data is accessed for the work. Directly handle it from DAMON sysfs interface context, namely damon_sysfs_handle_cmd(). Link: https://lkml.kernel.org/r/20250103174400.54890-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs-schemes: remove unnecessary schemes existence check in damon_sysfs_schemes_clear_regions()SeongJae Park3-14/+7
Patch series "mm/damon: replace most damon_callback usages in sysfs with new core functions". DAMON provides damon_callback API that notifies monitoring events and allows safe access to damon_ctx internal data. The usage is simple. Users register and deregister callback functions for different monitoring events in damon_ctx. Then the DAMON worker thread (kdamond) of the damon_ctx calls back the registered functions on the events. It is designed in such simple way because it was sufficient for usages of DAMON at the early days. We also wanted to make it flexible so that API user code can implement any required additional features on top of damon_callback on their demands. As expected, more sophisticated usages have invented. Online updates of DAMON parameters and DAMOS auto-tuning inputs, and online retrieval of DAMOS statistics and tried regions information are such usages. Because damon_callback doesn't provide any explicit synchronization mechanism, the user ABIs for exposing such functionalities are implemented in asynchronous ways (DAMON_RECLAIM and DAMON_LRU_SORT}), or synchronous ways (DAMON_SYSFS) with additional synchronization mechanisms that built inside the ABI implementation, on top of damon_callback. So damon_callback is working as expected. However, the additional mechanisms built inside ABI on top of damon_callback is becoming somewhat too big and not easy to maintain. The additional mechanisms can be smaller and easier to maintain when implemented inside the core logic layer. Introduce two new DAMON core API, namely 'damon_call()' and 'damos_walk()'. The two functions support synchronous access to - damon_ctx internal data including DAMON parameters and monitoring results, and - DAMOS-specific data such as regions that each DAMOS action is applied, respectively. And replace most of damon_callback usages in DAMON sysfs interface with the new core API functions. damon_callback usage for online DAMON parameters tuning is not replaced in this series, since it has specific callback timing assumptions that require more works. Patch sequence ============== First two patches are fixups for simplifying the following changes. Those remove a unnecessary condition check and a synchronization, respectively. Third patch implements one of the new DAMON core APIs, namely damon_call(). Three patches replacing damon_callback usages in DAMON sysfs interface using damon_call() follow. Then, seventh and eighth patches introduces the other new DAMON API, damos_walk(), and document it on the design doc. Ninth patch replaces two damon_callback usages in DAMON sysfs interface using damos_walk(). The tenth patch finally cleans up code that no more being used. This patch (of 10): damon_sysfs_schemes_clear_regions() skips removing the scheme tried region directories only if the matching scheme is still ongoing. It is unnecessary check, since what users want is just removing the entire region directories. Remove the unnecessary check. Link: https://lkml.kernel.org/r/20250103174400.54890-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250103174400.54890-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: introduce ctor/dtor at PGD levelKevin Brodsky5-5/+21
Following on from the introduction of P4D-level ctor/dtor, let's finish the job and introduce ctor/dtor at PGD level. The incurred improvement in page accounting is minimal - the main motivation is to create a single, generic place where construction/destruction hooks can be added for all page table pages. This patch should cover all architectures and all configurations where PGDs are one or more regular pages. This excludes any configuration where PGDs are allocated from a kmem_cache object. Link: https://lkml.kernel.org/r/20250103184415.2744423-7-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25asm-generic: pgalloc: provide generic __pgd_{alloc,free}Kevin Brodsky19-80/+65
We already have a generic implementation of alloc/free up to P4D level, as well as pgd_free(). Let's finish the work and add a generic PGD-level alloc helper as well. Unlike at lower levels, almost all architectures need some specific magic at PGD level (typically initialising PGD entries), so introducing a generic pgd_alloc() isn't worth it. Instead we introduce two new helpers, __pgd_alloc() and __pgd_free(), and make use of them in the arch-specific pgd_alloc() and pgd_free() wherever possible. To accommodate as many arch as possible, __pgd_alloc() takes a page allocation order. Because pagetable_alloc() allocates zeroed pages, explicit zeroing in pgd_alloc() becomes redundant and we can get rid of it. Some trivial implementations of pgd_free() also become unnecessary once __pgd_alloc() is used; remove them. Another small improvement is consistent accounting of PGD pages by using GFP_PGTABLE_{USER,KERNEL} as appropriate. Not all PGD allocations can be handled by the generic helpers. In particular, multiple architectures allocate PGDs from a kmem_cache, and those PGDs may not be page-sized. Link: https://lkml.kernel.org/r/20250103184415.2744423-6-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25ARM: mm: rename PGD helpersKevin Brodsky1-7/+7
Generic implementations of __pgd_alloc and __pgd_free are about to be introduced. Rename the macros in arch/arm/mm/pgd.c to avoid clashes. While we're at it, also pass down the mm as argument to those helpers, as it will be needed to call the generic __pgd_{alloc,free}. Link: https://lkml.kernel.org/r/20250103184415.2744423-5-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25m68k: mm: add calls to pagetable_pmd_[cd]torKevin Brodsky2-8/+15
get_pointer_table() and free_pointer_table() already special-case TABLE_PTE to call pagetable_pte_[cd]tor. Let's do the same at PMD level to improve accounting further. TABLE_PGD and TABLE_PMD are currently defined to the same value, so we first need to separate them. That also implies separating ptable_list for PMD/PGD levels. Link: https://lkml.kernel.org/r/20250103184415.2744423-4-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25parisc: mm: ensure pagetable_pmd_[cd]tor are calledKevin Brodsky1-11/+12
The implementation of pmd_{alloc_one,free} on parisc requires a non-zero allocation order, but is completely standard aside from that. Let's reuse the generic implementation of pmd_alloc_one(). Explicit zeroing is not needed as GFP_PGTABLE_KERNEL includes __GFP_ZERO. The generic pmd_free() can handle higher allocation orders so we don't need to define our own. These changes ensure that pagetable_pmd_[cd]tor are called, improving the accounting of page table pages. Link: https://lkml.kernel.org/r/20250103184415.2744423-3-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: move common part of pagetable_*_ctor to helperKevin Brodsky1-16/+12
Patch series "Account page tables at all levels". This series should be considered in conjunction with Qi's series [1]. Together, they ensure that page table ctor/dtor are called at all levels (PTE to PGD) and all architectures, where page tables are regular pages. Besides the improvement in accounting and general cleanup, this also create a single place where construction/destruction hooks can be called for all page tables, namely the now-generic pagetable_dtor() introduced by Qi, and __pagetable_ctor() introduced in this series. [1] https://lore.kernel.org/linux-mm/cover.1735549103.git.zhengqi.arch@bytedance.com/ This patch (of 6): pagetable_*_ctor all have the same basic implementation. Move the common part to a helper to reduce duplication. Link: https://lkml.kernel.org/r/20250103184415.2744423-1-kevin.brodsky@arm.com Link: https://lkml.kernel.org/r/20250103184415.2744423-2-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/debug: prefer VM_WARN_ON_VMG() to report VMG debug warningsLorenzo Stoakes1-16/+17
Now we have VM_WARN_ON_VMG() to provide us with considerably more debug output when a debug assert fails, utilise it everywhere we can. This allows us to have considerably more information to go on when things go wrong, especially when a non-repro issue occurs as reported by syzkaller or the like. Link: https://lkml.kernel.org/r/986e45e9549e71284ac7a7fa878688568a94d58b.1735932169.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/debug: introduce VM_WARN_ON_VMG() to dump VMA merge stateLorenzo Stoakes2-1/+84
Patch series "mm/debug: introduce and use VM_WARN_ON_VMG()". We use a number of asserts, enabled only when CONFIG_DEBUG_VM is set, during VMA merge operations to ensure state is as expected. However, when syzkaller or the like encounters these asserts, often the information provided by the report is insufficient to narrow down what the problem is. We noticed this recently in [0], where a non-repro issue resisted debugging due to simply not having sufficient information to go on. This series improves the situation by providing VM_WARN_ON_VMG() which acts like VM_WARN_ON() (i.e. only actually being invoked if CONFIG_DEBUG_VM is set), while dumping significant information about the VMA merge state, the mm_struct describing the virtual address space, all associated VMAs and, if CONFIG_DEBUG_VM_MAPLE_TREE is set, the associated maple tree. [0]:https://lore.kernel.org/all/6774c98f.050a0220.25abdd.0991.GAE@google.com/ This patch (of 2): We use a number of asserts, enabled only when CONFIG_DEBUG_VM is set, during VMA merge operations to ensure state is as expected. However, when syzkaller or the like encounters these asserts, often the information provided by the report is insufficient to narrow down what the problem is. This might not be so much of an issue if the reported problem is reproducible, but if it is a rarely encountered race or some other case which precludes a repro, it is a very big problem (see [0] for the motivating case). It is therefore sensible to provide a means by which we can easily and conveniently dump a lot more information in these circumstances. The aggregation of merge state into a single struct threaded through the operation makes this trivial - we can simply introduce a variant on VM_WARN_ON() which takes the VMA merge state object (vmg) and use that to dump information. This patch therefore introduces VM_WARN_ON_VMG() which provides this functionality. It additionally dumps full mm state, VMA state for each of the three VMAs the vmg contains (prev, next, vma) and if CONFIG_DEBUG_VM_MAPLE_TREE is enabled, dumps the maple tree from the provided VMA iterator if non-NULL. This patch has no functional impact if CONFIG_DEBUG_VM is not set. [0]:https://lore.kernel.org/all/6774c98f.050a0220.25abdd.0991.GAE@google.com/ Link: https://lkml.kernel.org/r/cover.1735932169.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/13b09b52d4d103ee86acaf0ae612539648ae29e0.1735932169.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25lib/list_debug.c: add object information in case of invalid objectManinder Singh5-20/+22
As of now during link list corruption it prints about cluprit address and its wrong value, but sometime it is not enough to catch the actual issue point. If it prints allocation and free path of that corrupted node, it will be a lot easier to find and fix the issues. Adding the same information when data mismatch is found in link list debug data: [ 14.243055] slab kmalloc-32 start ffff0000cda19320 data offset 32 pointer offset 8 size 32 allocated at add_to_list+0x28/0xb0 [ 14.245259] __kmalloc_cache_noprof+0x1c4/0x358 [ 14.245572] add_to_list+0x28/0xb0 ... [ 14.248632] do_el0_svc_compat+0x1c/0x34 [ 14.249018] el0_svc_compat+0x2c/0x80 [ 14.249244] Free path: [ 14.249410] kfree+0x24c/0x2f0 [ 14.249724] do_force_corruption+0xbc/0x100 ... [ 14.252266] el0_svc_common.constprop.0+0x40/0xe0 [ 14.252540] do_el0_svc_compat+0x1c/0x34 [ 14.252763] el0_svc_compat+0x2c/0x80 [ 14.253071] ------------[ cut here ]------------ [ 14.253303] list_del corruption. next->prev should be ffff0000cda192a8, but was 6b6b6b6b6b6b6b6b. (next=ffff0000cda19348) [ 14.254255] WARNING: CPU: 3 PID: 84 at lib/list_debug.c:65 __list_del_entry_valid_or_report+0x158/0x164 Moved prototype of mem_dump_obj() to bug.h, as mm.h can not be included in bug.h. Link: https://lkml.kernel.org/r/20241230101043.53773-1-maninder1.s@samsung.com Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Rohit Thapliyal <r.thapliyal@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: pgtable: introduce generic pagetable_dtor_free()Qi Zheng4-16/+11
The pte_free(), pmd_free(), __pud_free() and __p4d_free() in asm-generic/pgalloc.h and the generic __tlb_remove_table() are basically the same, so let's introduce pagetable_dtor_free() to deduplicate them. In addition, the pagetable_dtor_free() in s390 does the same thing, so let's s390 also calls generic pagetable_dtor_free(). Link: https://lkml.kernel.org/r/1663a0565aca881d1338ceb7d1db4aa9c333abd6.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: pgtable: move __tlb_remove_table_one() in x86 to generic fileQi Zheng2-21/+18
The __tlb_remove_table_one() in x86 does not contain architecture-specific content, so move it to the generic file. Link: https://lkml.kernel.org/r/aab8a449bc67167943fd2cb5aab0a3a23b7b1cd7.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: pgtable: completely move pagetable_dtor() to generic tlb_remove_table()Qi Zheng2-6/+8
For the generic tlb_remove_table(), it is implemented in the following two forms: 1) CONFIG_MMU_GATHER_TABLE_FREE is enabled tlb_remove_table --> generic __tlb_remove_table() 2) CONFIG_MMU_GATHER_TABLE_FREE is disabled tlb_remove_table --> tlb_remove_page For case 1), the pagetable_dtor() has already been moved to generic __tlb_remove_table(). For case 2), now only arm will call tlb_remove_table()/tlb_remove_ptdesc() when CONFIG_MMU_GATHER_TABLE_FREE is disabled. Let's move pagetable_dtor() completely to generic tlb_remove_table(), so that the architectures can follow more easily. Link: https://lkml.kernel.org/r/0c733ac867b287ec08190676496d1decebf49da2.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: pgtable: introduce generic __tlb_remove_table()Qi Zheng9-59/+19
Several architectures (arm, arm64, riscv and x86) define exactly the same __tlb_remove_table(), just introduce generic __tlb_remove_table() to eliminate these duplications. The s390 __tlb_remove_table() is nearly the same, so also make s390 __tlb_remove_table() version generic. Link: https://lkml.kernel.org/r/ea372633d94f4d3f9f56a7ec5994bf050bf77e39.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc] Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Acked-by: Arnd Bergmann <arnd@arndb.de> [asm-generic] Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25s390: pgtable: consolidate PxD and PTE TLB free pathsQi Zheng2-13/+4
Call pagetable_dtor() for PMD|PUD|P4D tables just before ptdesc is freed - same as it is done for PTE tables. That allows consolidating TLB free paths for all table types. Link: https://lkml.kernel.org/r/ac69360a5f3350ebb2f63cd14b7b45316a130ee4.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25x86: pgtable: move pagetable_dtor() to __tlb_remove_table()Qi Zheng3-12/+6
Move pagetable_dtor() to __tlb_remove_table(), so that ptlock and page table pages can be freed together (regardless of whether RCU is used). This prevents the use-after-free problem where the ptlock is freed immediately but the page table pages is freed later via RCU. Link: https://lkml.kernel.org/r/27b3cdc8786bebd4f748380bf82f796482718504.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25x86: pgtable: convert __tlb_remove_table() to use struct ptdescQi Zheng3-13/+19
Convert __tlb_remove_table() to use struct ptdesc, which will help to move pagetable_dtor() to __tlb_remove_table(). And page tables shouldn't have swap cache, so use pagetable_free() instead of free_page_and_swap_cache() to free page table pages. Link: https://lkml.kernel.org/r/39f60f93143ff77cf5d6b3c3e75af0ffc1480adb.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25riscv: pgtable: move pagetable_dtor() to __tlb_remove_table()Qi Zheng2-31/+21
Move pagetable_dtor() to __tlb_remove_table(), so that ptlock and page table pages can be freed together (regardless of whether RCU is used). This prevents the use-after-free problem where the ptlock is freed immediately but the page table pages is freed later via RCU. Page tables shouldn't have swap cache, so use pagetable_free() instead of free_page_and_swap_cache() to free page table pages. By the way, move the comment above __tlb_remove_table() to riscv_tlb_remove_ptdesc(), it will be more appropriate. Link: https://lkml.kernel.org/r/b89d77c965507b1b102cbabe988e69365cb288b6.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25arm64: pgtable: move pagetable_dtor() to __tlb_remove_table()Qi Zheng1-6/+4
Move pagetable_dtor() to __tlb_remove_table(), so that ptlock and page table pages can be freed together (regardless of whether RCU is used). This prevents the use-after-free problem where the ptlock is freed immediately but the page table pages is freed later via RCU. Page tables shouldn't have swap cache, so use pagetable_free() instead of free_page_and_swap_cache() to free page table pages. Link: https://lkml.kernel.org/r/cf4b847caf390f96a3e3d534dacb2c174e16c154.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Will Deacon <will@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25arm: pgtable: move pagetable_dtor() to __tlb_remove_table()Qi Zheng1-3/+6
Move pagetable_dtor() to __tlb_remove_table(), so that ptlock and page table pages can be freed together (regardless of whether RCU is used). This prevents the use-after-free problem where the ptlock is freed immediately but the page table pages is freed later via RCU. Page tables shouldn't have swap cache, so use pagetable_free() instead of free_page_and_swap_cache() to free page table pages. Link: https://lkml.kernel.org/r/327b4b8990729edd4ce97d9d5acbdaff2d9fa1d1.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: pgtable: introduce pagetable_dtor()Qi Zheng28-95/+62
The pagetable_p*_dtor() are exactly the same except for the handling of ptlock. If we make ptlock_free() handle the case where ptdesc->ptl is NULL and remove VM_BUG_ON_PAGE() from pmd_ptlock_free(), we can unify pagetable_p*_dtor() into one function. Let's introduce pagetable_dtor() to do this. Later, pagetable_dtor() will be moved to tlb_remove_ptdesc(), so that ptlock and page table pages can be freed together (regardless of whether RCU is used). This prevents the use-after-free problem where the ptlock is freed immediately but the page table pages is freed later via RCU. Link: https://lkml.kernel.org/r/47f44fff9dc68d9d9e9a0d6c036df275f820598a.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25s390: pgtable: add statistics for PUD and P4D level page tableQi Zheng2-8/+23
Like PMD and PTE level page table, also add statistics for PUD and P4D page table. Link: https://lkml.kernel.org/r/4707dffce228ccec5c6662810566dd12b5741c4b.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25arm64: pgtable: use mmu gather to free p4d level page tableQi Zheng2-1/+14
Like other levels of page tables, also use mmu gather mechanism to free p4d level page table. Link: https://lkml.kernel.org/r/3fd48525397b34a64f7c0eb76746da30814dc941.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: pgtable: add statistics for P4D level page tableQi Zheng4-1/+26
Like other levels of page tables, add statistics for P4D level page table. Link: https://lkml.kernel.org/r/d55fe3c286305aae84457da9e1066df99b3de125.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25asm-generic: pgalloc: provide generic p4d_{alloc_one,free}Kevin Brodsky4-58/+45
Four architectures currently implement 5-level pgtables: arm64, riscv, x86 and s390. The first three have essentially the same implementation for p4d_alloc_one() and p4d_free(), so we've got an opportunity to reduce duplication like at the lower levels. Provide a generic version of p4d_alloc_one() and p4d_free(), and make use of it on those architectures. Their implementation is the same as at PUD level, except that p4d_free() performs a runtime check by calling mm_p4d_folded(). 5-level pgtables depend on a runtime-detected hardware feature on all supported architectures, so we might as well include this check in the generic implementation. No runtime check is required in p4d_alloc_one() as the top-level p4d_alloc() already does the required check. Link: https://lkml.kernel.org/r/26d69c74a29183ecc335b9b407040d8e4cd70c6a.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> [asm-generic] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25riscv: mm: skip pgtable level check in {pud,p4d}_alloc_oneKevin Brodsky1-18/+4
Patch series "move pagetable_*_dtor() to __tlb_remove_table()", v5. As proposed [1] by Peter Zijlstra below, this patch series aims to move pagetable_*_dtor() into __tlb_remove_table(). This will cleanup pagetable_*_dtor() a bit and more gracefully fix the UAF issue [2] reported by syzbot. : Notably: : : - s390 pud isn't calling the existing pagetable_pud_[cd]tor() : - none of the p4d things have pagetable_p4d_[cd]tor() (x86,arm64,s390,riscv) : and they have inconsistent accounting : - while much of the _ctor calls are in generic code, many of the _dtor : calls are in arch code for hysterial raisins, this could easily be : fixed : - if we fix ptlock_free() to handle NULL, then all the _dtor() : functions can use it, and we can observe they're all identical : and can be folded : : after all that cleanup, you can move the _dtor from *_free_tlb() into : tlb_remove_table() -- which for the above case, would then have it called : from __tlb_remove_table_free(). This patch (of 16): {pmd,pud,p4d}_alloc_one() is never called if the corresponding page table level is folded, as {pmd,pud,p4d}_alloc() already does the required check. We can therefore remove the runtime page table level checks in {pud,p4d}_alloc_one. The PUD helper becomes equivalent to the generic version, so we remove it altogether. This is consistent with the way arm64 and x86 handle this situation (runtime check in p4d_free() only). Link: https://lkml.kernel.org/r/cover.1736317725.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/93a1c6bddc0ded9f1a9f15658c1e4af5c93d1194.1736317725.git.zhengqi.arch@bytedance.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: remove unnecessary calls to lru_add_drainRik van Riel4-6/+0
There seem to be several categories of calls to lru_add_drain and lru_add_drain_all. The first are code paths that recently allocated, swapped in, or otherwise processed a batch of pages, and want them all on the LRU. These drain pages that were recently allocated, probably on the local CPU. A second category are code paths that are actively trying to reclaim, migrate, or offline memory. These often use lru_add_drain_all, to drain the caches on all CPUs. However, there also seem to be some other callers where we aren't really doing either. They are calling lru_add_drain(), despite operating on pages that may have been allocated long ago, and quite possibly on different CPUs. Those calls are not likely to be effective at anything but creating lock contention on the LRU locks. Remove the lru_add_drain calls in the latter category. For detailed reasoning, see [1] and [2]. Link: https://lkml.kernel.org/r/dca2824e8e88e826c6b260a831d79089b5b9c79d.camel@surriel.com [1] Link: https://lkml.kernel.org/r/xxfhcjaq2xxcl5adastz5omkytenq7izo2e5f4q7e3ns4z6lko@odigjjc7hqrg [2] Link: https://lkml.kernel.org/r/20241219153253.3da9e8aa@fangorn Signed-off-by: Rik van Riel <riel@surriel.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Chris Li <chrisl@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: add build-time option for hotplug memory default online typeGregory Price7-23/+89
Memory hotplug presently auto-onlines memory into a zone the kernel deems appropriate if CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y. The memhp_default_state boot param enables runtime config, but it's not possible to do this at build-time. Remove CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, and replace it with CONFIG_MHP_DEFAULT_ONLINE_TYPE_* choices that sync with the boot param. Selections: CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE => mhp_default_online_type = "offline" Memory will not be onlined automatically. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_AUTO => mhp_default_online_type = "online" Memory will be onlined automatically in a zone deemed. appropriate by the kernel. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL => mhp_default_online_type = "online_kernel" Memory will be onlined automatically. The zone may allow kernel data (e.g. ZONE_NORMAL). CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE => mhp_default_online_type = "online_movable" Memory will be onlined automatically. The zone will be ZONE_MOVABLE. Default to CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE to match the existing default CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n behavior. Existing users of CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y should use CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_AUTO. [gourry@gourry.net: update KConfig comments] Link: https://lkml.kernel.org/r/20241226182918.648799-1-gourry@gourry.net Link: https://lkml.kernel.org/r/20241220210709.300066-1-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25selftests/mm: add new test cases to the migration testDonet Tom1-0/+99
Added three new test cases to the migration tests: 1. Shared anon THP migration test This test will mmap shared anon memory, madvise it to MADV_HUGEPAGE, then do migration entry testing. One thread will move pages back and forth between nodes whilst other threads try and access them. 2. Private anon hugetlb migration test This test will mmap private anon hugetlb memory and then do the migration entry testing. 3. Shared anon hugetlb migration test This test will mmap shared anon hugetlb memory and then do the migration entry testing. Test results ============ # ./tools/testing/selftests/mm/migration TAP version 13 1..6 # Starting 6 tests from 1 test cases. # RUN migration.private_anon ... # OK migration.private_anon ok 1 migration.private_anon # RUN migration.shared_anon ... # OK migration.shared_anon ok 2 migration.shared_anon # RUN migration.private_anon_thp ... # OK migration.private_anon_thp ok 3 migration.private_anon_thp # RUN migration.shared_anon_thp ... # OK migration.shared_anon_thp ok 4 migration.shared_anon_thp # RUN migration.private_anon_htlb ... # OK migration.private_anon_htlb ok 5 migration.private_anon_htlb # RUN migration.shared_anon_htlb ... # OK migration.shared_anon_htlb ok 6 migration.shared_anon_htlb # PASSED: 6 / 6 tests passed. # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0 # Link: https://lkml.kernel.org/r/20241219102720.4487-1-donettom@linux.ibm.com Signed-off-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: replace free hugepage folios after migrationyangge3-1/+60
My machine has 4 NUMA nodes, each equipped with 32GB of memory. I have configured each NUMA node with 16GB of CMA and 16GB of in-use hugetlb pages. The allocation of contiguous memory via cma_alloc() can fail probabilistically. When there are free hugetlb folios in the hugetlb pool, during the migration of in-use hugetlb folios, new folios are allocated from the free hugetlb pool. After the migration is completed, the old folios are released back to the free hugetlb pool instead of being returned to the buddy system. This can cause test_pages_isolated() check to fail, ultimately leading to the failure of cma_alloc(). Call trace: cma_alloc() __alloc_contig_migrate_range() // migrate in-use hugepage test_pages_isolated() __test_page_isolated_in_pageblock() PageBuddy(page) // check if the page is in buddy To address this issue, we introduce a function named replace_free_hugepage_folios(). This function will replace the hugepage in the free hugepage pool with a new one and release the old one to the buddy system. After the migration of in-use hugetlb pages is completed, we will invoke replace_free_hugepage_folios() to ensure that these hugepages are properly released to the buddy system. Following this step, when test_pages_isolated() is executed for inspection, it will successfully pass. Additionally, when alloc_contig_range() is used to migrate multiple in-use hugetlb pages, it can result in some in-use hugetlb pages being released back to the free hugetlb pool and subsequently being reallocated and used again. For example: [huge 0] [huge 1] To migrate huge 0, we obtain huge x from the pool. After the migration is completed, we return the now-freed huge 0 back to the pool. When it's time to migrate huge 1, we can simply reuse the now-freed huge 0 from the pool. As a result, when replace_free_hugepage_folios() is executed, it cannot release huge 0 back to the buddy system. To address this issue, we should prevent the reuse of isolated free hugepages during the migration process. Link: https://lkml.kernel.org/r/1734503588-16254-1-git-send-email-yangge1116@126.com Link: https://lkml.kernel.org/r/1736582300-11364-1-git-send-email-yangge1116@126.com Signed-off-by: yangge <yangge1116@126.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25zram: cond_resched() in writeback loopSergey Senozhatsky1-0/+2
zram writeback is a costly operation, because every target slot (unless ZRAM_HUGE) is decompressed before it gets written to a backing device. The writeback to a backing device uses submit_bio_wait() which may look like a rescheduling point. However, if the backing device has BD_HAS_SUBMIT_BIO bit set __submit_bio() calls directly disk->fops->submit_bio(bio) on the backing device and so when submit_bio_wait() calls blk_wait_io() the I/O is already done. On such systems we effective end up in a loop for_each (target slot) { decompress(slot) __submit_bio() disk->fops->submit_bio(bio) } Which on PREEMPT_NONE systems triggers watchdogs (since there are no explicit rescheduling points). Add cond_resched() to the zram writeback loop. Link: https://lkml.kernel.org/r/20241218063513.297475-8-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25zram: use zram_read_from_zspool() in writebackSergey Senozhatsky1-7/+4
We only can read pages from zspool in writeback, zram_read_page() is not really right in that context not only because it's a more generic function that handles ZRAM_WB pages, but also because it requires us to unlock slot between slot flag check and actual page read. Use zram_read_from_zspool() instead and do slot flags check and page read under the same slot lock. Link: https://lkml.kernel.org/r/20241218063513.297475-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25zram: factor out different page types readSergey Senozhatsky1-33/+52
Similarly to write, split the page read code into ZRAM_HUGE read, ZRAM_SAME read and compressed page read to simplify the code. Link: https://lkml.kernel.org/r/20241218063513.297475-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25zram: factor out ZRAM_HUGE writeSergey Senozhatsky1-53/+83
zram_write_page() handles: ZRAM_SAME pages (which was already factored out) stores, regular page stores and ZRAM_HUGE pages stores. ZRAM_HUGE handling adds a significant amount of complexity. Instead, we can handle ZRAM_HUGE in a separate function. This allows us to simplify zs_handle allocations slow-path, as it now does not handle ZRAM_HUGE case. ZRAM_HUGE zs_handle allocation, on the other hand, can now drop __GFP_KSWAPD_RECLAIM because we handle ZRAM_HUGE in preemptible context (outside of local-lock scope). Link: https://lkml.kernel.org/r/20241218063513.297475-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25zram: factor out ZRAM_SAME writeSergey Senozhatsky1-16/+21
Handling of ZRAM_SAME now uses a goto to the final stages of zram_write_page() plus it introduces a branch and flags variable, which is not making the code any simpler. In reality, we can handle ZRAM_SAME immediately when we detect such pages and remove a goto and a branch. Factor out ZRAM_SAME handling into a separate routine to simplify zram_write_page(). Link: https://lkml.kernel.org/r/20241218063513.297475-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25zram: remove entry element memberSergey Senozhatsky2-22/+6
Element is in the same anon union as handle and hence holds the same value, which makes code below sort of confusing handle = zram_get_handle() if (!handle) element = zram_get_element() Element doesn't really simplify the code, let's just remove it. We already re-purpose handle to store the block id a written back page. Link: https://lkml.kernel.org/r/20241218063513.297475-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25zram: free slot memory early during writeSergey Senozhatsky1-6/+5
Patch series "zram: split page type read/write handling", v2. This is a subset of [1] series which contains only fixes and improvements (no new features, as ZRAM_HUGE split is still under consideration). The motivation for factoring out is that zram_write_page() gets more and more complex all the time, because it tries to handle too many scenarios: ZRAM_SAME store, ZRAM_HUGE store, compress page store with zs_malloc allocation slowpath and conditional recompression, etc. Factor those out and make things easier to handle. Addition of cond_resched() is simply a fix, I can trigger watchdog from zram writeback(). And early slot free is just a reasonable thing to do. [1] https://lore.kernel.org/linux-kernel/20241119072057.3440039-1-senozhatsky@chromium.org This patch (of 7): In the current implementation entry's previously allocated memory is released in the very last moment, when we already have allocated a new memory for new data. This, basically, temporarily increases memory usage for no good reason. For example, consider the case when both old (stale) and new entry data are incompressible so such entry will temporarily use two physical pages - one for stale (old) data and one for new data. We can release old memory as soon as we get a write request for entry. Link: https://lkml.kernel.org/r/20241218063513.297475-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20241218063513.297475-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/swap_cgroup: decouple swap cgroup recording and clearingKairui Song3-36/+55
The current implementation of swap cgroup tracking is a bit complex and fragile: On charging path, swap_cgroup_record always records an actual memcg id, and it depends on the caller to make sure all entries passed in must belong to one single folio. As folios are always charged or uncharged as a whole, and always charged and uncharged in order, swap_cgroup doesn't need an extra lock. On uncharging path, swap_cgroup_record always sets the record to zero. These entries won't be charged again until uncharging is done. So there is no extra lock needed either. Worth noting that swap cgroup clearing may happen without folio involved, eg. exiting processes will zap its page table without swapin. The xchg/cmpxchg provides atomic operations and barriers to ensure no tearing or synchronization issue of these swap cgroup records. It works but quite error-prone. Things can be much clear and robust by decoupling recording and clearing into two helpers. Recording takes the actual folio being charged as argument, and clearing always set the record to zero, and refine the debug sanity checks to better reflect their usage Benchmark even showed a very slight improvement as it saved some extra arguments and lookups: make -j96 with defconfig on tmpfs in 1.5G memory cgroup using 4k folios: Before: sys 9617.23 (stdev 37.764062) After : sys 9541.54 (stdev 42.973976) make -j96 with defconfig on tmpfs in 2G memory cgroup using 64k folios: Before: sys 7358.98 (stdev 54.927593) After : sys 7337.82 (stdev 39.398956) Link: https://lkml.kernel.org/r/20241218114633.85196-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/swap_cgroup: remove global swap cgroup lockKairui Song1-28/+49
commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance") replaced the cmpxchg/xchg with a global irq spinlock because some archs doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well. And as commented in swap_cgroup.c, this lock is not needed for map synchronization. Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement it to get rid of this lock. Introduced two helpers for doing so and they can be easily dropped if a generic 2 byte xchg is support. Testing using 64G brd and build with build kernel with make -j96 in 1.5G memory cgroup using 4k folios showed below improvement (6 test run): Before this series: Sys time: 10782.29 (stdev 42.353886) Real time: 171.49 (stdev 0.595541) After this commit: Sys time: 9617.23 (stdev 37.764062), -10.81% Real time: 159.65 (stdev 0.587388), -6.90% With 64k folios and 2G memcg: Before this series: Sys time: 8176.94 (stdev 26.414712) Real time: 141.98 (stdev 0.797382) After this commit: Sys time: 7358.98 (stdev 54.927593), -10.00% Real time: 134.07 (stdev 0.757463), -5.57% Sequential swapout of 8G 64k zero folios with madvise (24 test run): Before this series: 5461409.12 us (stdev 183957.827084) After this commit: 5420447.26 us (stdev 196419.240317) Sequential swapin of 8G 4k zero folios (24 test run): Before this series: 19736958.916667 us (stdev 189027.246676) After this commit: 19662182.629630 us (stdev 172717.640614) Performance is better or at least not worse for all tests above. Link: https://lkml.kernel.org/r/20241218114633.85196-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/swap_cgroup: remove swap_cgroup_cmpxchgKairui Song2-31/+0
This function is never used after commit 6b611388b626 ("memcg-v1: remove charge move code"). Link: https://lkml.kernel.org/r/20241218114633.85196-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm, memcontrol: avoid duplicated memcg enable checkKairui Song1-1/+1
Patch series "mm/swap_cgroup: remove global swap cgroup lock", v3. This series removes the global swap cgroup lock. The critical section of this lock is very short but it's still a bottle neck for mass parallel swap workloads. Up to 10% performance gain for tmpfs build kernel test on a 48c96t system under memory pressure, and no regression for other cases: This patch (of 3): mem_cgroup_uncharge_swap() includes a mem_cgroup_disabled() check, so the caller doesn't need to check that. Link: https://lkml.kernel.org/r/20241218114633.85196-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20241218114633.85196-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>