aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-03-21x86/mm: restore early initialization of high_memory for 32-bitsMike Rapoport (Microsoft)2-0/+6
Kernel test robot reports the following crash on 32-bit system with HIGHMEM and DEBUG_VIRTUAL: [ 0.056128][ T0] kernel BUG at arch/x86/mm/physaddr.c:77! PANIC: early exception 0x06 IP 60:c116539d error 0 cr2 0x0 [ 0.056916][ T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc4-00010-ga4dbe5c71817 #1 [ 0.057570][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 0.058299][ T0] EIP: __phys_addr (arch/x86/mm/physaddr.c:77) [ 0.058633][ T0] Code: 00 74 33 89 f0 e8 d3 8b 2e 00 89 c3 0f b6 d0 b8 58 bb 4b c5 31 c9 6a 00 e8 70 f5 15 00 83 c4 04 84 db 74 25 ff 05 78 de 5d c5 <0f> 0b b8 c8 91 ea c4 e8 e7 6e ea ff b8 58 bb 4b c5 31 d2 31 c9 6a All code [ 0.060017][ T0] EAX: 00000000 EBX: c61f7001 ECX: 00000000 EDX: 00000000 [ 0.060519][ T0] ESI: c61f7000 EDI: 061f7000 EBP: c4e31f04 ESP: c61f7000 [ 0.061016][ T0] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: cff4 EFLAGS: 00210002 [ 0.061560][ T0] CR0: 80050033 CR2: 00000000 CR3: 059fc000 CR4: 00000090 [ 0.062060][ T0] Call Trace: [ 0.062288][ T0] ? show_regs (arch/x86/kernel/dumpstack.c:478) [ 0.062588][ T0] ? early_fixup_exception (arch/x86/include/asm/nospec-branch.h:595) [ 0.062968][ T0] ? early_idt_handler_common (arch/x86/kernel/head_32.S:352) [ 0.063360][ T0] ? __phys_addr (arch/x86/mm/physaddr.c:77) [ 0.063677][ T0] ? one_page_table_init (arch/x86/mm/init_32.c:100) [ 0.064037][ T0] ? page_table_range_init (arch/x86/mm/init_32.c:227) [ 0.064411][ T0] ? permanent_kmaps_init (include/linux/pgtable.h:191 include/linux/pgtable.h:196 arch/x86/mm/init_32.c:395) [ 0.064814][ T0] ? paging_init (arch/x86/mm/init_32.c:677) [ 0.065118][ T0] ? native_pagetable_init (arch/x86/mm/init_32.c:481) [ 0.065503][ T0] ? setup_arch (arch/x86/kernel/setup.c:1131) [ 0.065819][ T0] ? start_kernel (include/linux/jump_label.h:267 init/main.c:920) [ 0.066143][ T0] ? i386_start_kernel (arch/x86/kernel/head32.c:79) [ 0.066501][ T0] ? startup_32_smp (arch/x86/kernel/head_32.S:292) The crash happens because commit e120d1bc12da ("arch, mm: set high_memory in free_area_init()") moved initialization of high_memory after __vmalloc_start_set and with high_memory still set to 0 any address passes is_vmalloc_addr() check. Restore early initialization of high_memory on 32-bit systems in initmem_init(). Link: https://lkml.kernel.org/r/20250319122337.1538924-1-rppt@kernel.org Fixes: e120d1bc12da ("arch, mm: set high_memory in free_area_init()") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503191442.112e954f-lkp@intel.com Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm/vmscan: don't try to reclaim hwpoison folioJinjiang Tu1-0/+7
Syzkaller reports a bug as follows: Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000 Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e memcg:ffff0000dd6d9000 anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff) raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9 raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000 page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio)) ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:184! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3 Hardware name: linux,dummy-virt (DT) pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : add_to_swap+0xbc/0x158 lr : add_to_swap+0xbc/0x158 sp : ffff800087f37340 x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780 x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0 x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4 x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000 x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000 x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001 x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000 Call trace: add_to_swap+0xbc/0x158 shrink_folio_list+0x12ac/0x2648 shrink_inactive_list+0x318/0x948 shrink_lruvec+0x450/0x720 shrink_node_memcgs+0x280/0x4a8 shrink_node+0x128/0x978 balance_pgdat+0x4f0/0xb20 kswapd+0x228/0x438 kthread+0x214/0x230 ret_from_fork+0x10/0x20 I can reproduce this issue with the following steps: 1) When a dirty swapcache page is isolated by reclaim process and the page isn't locked, inject memory failure for the page. me_swapcache_dirty() clears uptodate flag and tries to delete from lru, but fails. Reclaim process will put the hwpoisoned page back to lru. 2) The process that maps the hwpoisoned page exits, the page is deleted the page will never be freed and will be in the lru forever. 3) If we trigger a reclaim again and tries to reclaim the page, add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is cleared. To fix it, skip the hwpoisoned page in shrink_folio_list(). Besides, the hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap it in shrink_folio_list(), otherwise the folio will fail to be unmaped by hwpoison_user_mappings() since the folio isn't in lru list. Link: https://lkml.kernel.org/r/20250318083939.987651-3-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger,kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm/hwpoison: introduce folio_contain_hwpoisoned_page() helperJinjiang Tu3-4/+8
Patch series "mm/vmscan: don't try to reclaim hwpoison folio". Fix a bug during memory reclaim if folio is hwpoisoned. This patch (of 2): Introduce helper folio_contain_hwpoisoned_page() to check if the entire folio is hwpoisoned or it contains hwpoisoned pages. Link: https://lkml.kernel.org/r/20250318083939.987651-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20250318083939.987651-2-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger,kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21cgroup: docs: add pswpin and pswpout items in cgroup v2 docHao Jia1-0/+6
The commit 15ff4d409e1a ("mm/memcontrol: add per-memcg pgpgin/pswpin counter") introduced the pswpin and pswpout items in the memory.stat of cgroup v2. Therefore, update them accordingly in the cgroup-v2 documentation. Link: https://lkml.kernel.org/r/20250318075833.90615-3-jiahao.kernel@gmail.com Fixes: 15ff4d409e1a ("mm/memcontrol: add per-memcg pgpgin/pswpin counter") Signed-off-by: Hao Jia <jiahao1@lixiang.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm: vmscan: split proactive reclaim statistics from direct reclaim statisticsHao Jia6-15/+42
Patch series "Adding Proactive Memory Reclaim Statistics". These two patches are related to proactive memory reclaim. Patch 1 Split proactive reclaim statistics from direct reclaim counters and introduces new counters: pgsteal_proactive, pgdemote_proactive, and pgscan_proactive. Patch 2 Adds pswpin and pswpout items to the cgroup-v2 documentation. This patch (of 2): In proactive memory reclaim scenarios, it is necessary to accurately track proactive reclaim statistics to dynamically adjust the frequency and amount of memory being reclaimed proactively. Currently, proactive reclaim is included in direct reclaim statistics, which can make these direct reclaim statistics misleading. Therefore, separate proactive reclaim memory from the direct reclaim counters by introducing new counters: pgsteal_proactive, pgdemote_proactive, and pgscan_proactive, to avoid confusion with direct reclaim. Link: https://lkml.kernel.org/r/20250318075833.90615-1-jiahao.kernel@gmail.com Link: https://lkml.kernel.org/r/20250318075833.90615-2-jiahao.kernel@gmail.com Signed-off-by: Hao Jia <jiahao1@lixiang.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21selftests/mm: speed up split_huge_page_testRyan Roberts1-4/+7
create_pagecache_thp_and_fd() was previously writing a file sized at twice the PMD size by making a per-byte write syscall. This was quite slow when the PMD size is 4M, but completely intolerable for 32M (PMD size for arm64's 16K page size), and 512M (PMD size for arm64's 64K page size). The byte pattern has a 256 byte period, so let's create a 1K buffer and fill it with exactly 4 periods. Then we can write the buffer as many times as is required to fill the file. This makes things much more tolerable. The test now passes for 16K page size. It still fails for 64K page size because MAX_PAGECACHE_ORDER is too small for 512M folio size (I think). Link: https://lkml.kernel.org/r/20250318174343.243631-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Rafael Aquini <raquini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21selftests/mm: uffd-unit-tests support for hugepages > 2MRyan Roberts1-1/+4
uffd-unit-tests uses a memory area with a fixed 32M size. Then it calculates the number of pages by dividing by page_size, which itself is either the base page size or the PMD huge page size depending on the test config. For the latter, we end up with nr_pages=1 for arm64 16K base pages, and nr_pages=0 for 64K base pages. This doesn't end well. So let's make the 32M size a floor and also ensure that we have at least 2 pages given the PMD size. With this change, the tests pass on arm64 64K base page size configuration. Link: https://lkml.kernel.org/r/20250318174343.243631-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Rafael Aquini <raquini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21docs/mm/damon/design: document active DAMOS filter typeNhat Pham1-0/+2
Document availability and meaning of "active" DAMOS filter type on design document. Since introduction of the type requires no additional user ABI, usage and ABI document need no update. Link: https://lkml.kernel.org/r/20250318183029.2062917-3-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm/damon: implement a new DAMOS filter type for active pagesNhat Pham3-0/+6
Patch series "mm/damon: introduce DAMOS filter type for active pages". The memory reclaim algorithm categorizes pages into active and inactive lists, separately for file and anon pages. The system's performance relies heavily on the (relative and absolute) accuracy of this categorization. This patch series add a new DAMOS filter for pages' activeness, giving us visibility into the access frequency of the pages on each list. This insight can help us diagnose issues with the active-inactive balancing dynamics, and make decisions to optimize reclaim efficiency and memory utilization. For instance, we might decide to enable DAMON_LRU_SORT, if we find that there are pages on the active list that are infrequently accessed, or less frequently accessed than pages on the inactive list. This patch (of 2): Implement a DAMOS filter type for active pages on DAMON kernel API, and add support of it from the physical address space DAMON operations set (paddr). Link: https://lkml.kernel.org/r/20250318183029.2062917-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20250318183029.2062917-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21fs/dax: don't disassociate zero page entriesAlistair Popple1-2/+7
Prior to commit 38607c62b34b ("fs/dax: properly refcount fs dax pages") dax_associate_entry() and dax_disassociate_entry() would implicitly skip zero and empty dax entries using the for_each_mapped_pfn() macro. The use of compound ZONE_DEVICE folios removed the need for this macro and so it was removed, leading dax_folio_put() to be called on zero pages. This lead to the below warning. To fix this explicitly skip zero and empty entries in dax_associate/disassociate_entry(). [ 27.536963] ------------[ cut here ]------------ [ 27.537674] WARNING: CPU: 11 PID: 874 at fs/dax.c:415 dax_folio_put.isra.0+0x10d/0x170 [ 27.538844] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm [ 27.539732] CPU: 11 UID: 0 PID: 874 Comm: ctl_prefault Tainted: G W 6.14.0-rc2+ #1104 [ 27.541093] Tainted: [W]=WARN [ 27.541549] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/204 [ 27.543197] RIP: 0010:dax_folio_put.isra.0+0x10d/0x170 [ 27.543970] Code: 20 48 85 c0 0f 84 29 ff ff ff 48 83 e8 01 48 89 47 20 0f 84 1b ff ff ff 48 83 c4 10 5b 5d 41 5c c3 cc cc4 [ 27.546723] RSP: 0000:ffff961e4102fae0 EFLAGS: 00010002 [ 27.547505] RAX: 0000000000000001 RBX: ffffc9cce4e18000 RCX: 0000000000000009 [ 27.548564] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8a2a7badca40 [ 27.549630] RBP: ffffc9cce4e18000 R08: 0000000000009ffb R09: 00000000ffffdfff [ 27.550691] R10: 00000000ffffdfff R11: ffffffffa4e823a0 R12: 0000000000000000 [ 27.551748] R13: 0000000000000000 R14: 0000000010f10005 R15: 0000000000000004 [ 27.552819] FS: 00007f5f539d74c0(0000) GS:ffff8a2a7bac0000(0000) knlGS:0000000000000000 [ 27.554015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 27.554873] CR2: 00007f5f52e00000 CR3: 0000000909340000 CR4: 00000000000006f0 [ 27.555938] Call Trace: [ 27.556318] <TASK> [ 27.556650] ? __warn+0x91/0x190 [ 27.557146] ? dax_folio_put.isra.0+0x10d/0x170 [ 27.557824] ? report_bug+0x164/0x190 [ 27.558378] ? handle_bug+0x54/0x90 [ 27.558898] ? exc_invalid_op+0x17/0x70 [ 27.559489] ? asm_exc_invalid_op+0x1a/0x20 [ 27.560125] ? dax_folio_put.isra.0+0x10d/0x170 [ 27.560808] dax_insert_entry+0x1e1/0x420 [ 27.561419] dax_fault_iter+0x252/0x860 [ 27.561995] dax_iomap_pmd_fault+0x23c/0x4a0 [ 27.562651] ext4_dax_huge_fault+0x1e2/0x450 [ 27.563296] __handle_mm_fault+0x6c8/0x12b0 [ 27.563920] ? do_user_addr_fault+0x1ca/0x670 [ 27.564577] ? lock_vma_under_rcu+0x178/0x3b0 [ 27.565235] handle_mm_fault+0xe5/0x290 [ 27.565816] do_user_addr_fault+0x208/0x670 [ 27.566446] exc_page_fault+0x6d/0x230 [ 27.567008] asm_exc_page_fault+0x26/0x30 [ 27.567610] RIP: 0033:0x7f5f543bcb4f [ 27.568152] Code: 45 f0 48 8b 45 f0 48 8b 4d f8 48 03 41 18 48 89 45 e8 48 8b 45 f0 48 3b 45 e8 0f 83 97 00 00 00 48 8b 458 [ 27.570895] RSP: 002b:00007ffc2d774460 EFLAGS: 00010287 [ 27.571672] RAX: 00007f5f52e00000 RBX: 0000000000200000 RCX: 000055760153fc00 [ 27.572731] RDX: 0000000000000000 RSI: 0000557601542a20 RDI: 000055760153fc00 [ 27.573787] RBP: 00007ffc2d774460 R08: 0000000000000000 R09: 0000000000000073 [ 27.574840] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffc2d77534b [ 27.575897] R13: 00007ffc2d774aa0 R14: 0000000000800000 R15: 0000000000800000 [ 27.576961] </TASK> [ 27.577301] irq event stamp: 13394 [ 27.577810] hardirqs last enabled at (13393): [<ffffffffa3485780>] flush_tlb_mm_range+0x1c0/0x220 [ 27.579138] hardirqs last disabled at (13394): [<ffffffffa450d0c7>] _raw_spin_lock_irq+0x47/0x50 [ 27.580428] softirqs last enabled at (12530): [<ffffffffa433941a>] xs_tcp_send_request+0x22a/0x2e0 [ 27.581762] softirqs last disabled at (12528): [<ffffffffa40a60fd>] release_sock+0x1d/0xb0 [ 27.582986] ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20250319013301.369822-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503102229.122fbd6c-lkp@intel.com Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21MM documentation: add "Unaccepted" meminfo entryNico Pache1-0/+3
Commit dcdfdd40fa82 ("mm: Add support for unaccepted memory") added a entry to meminfo but did not document it in the proc.rst file. This counter tracks the amount of "Unaccepted" guest memory for some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP. Add the missing entry in the documentation. Link: https://lkml.kernel.org/r/20250317230403.79632-1-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21selftests/mm: add commentary about 9pfs bugsBrendan Jackman3-10/+22
As discussed here: https://lore.kernel.org/lkml/Z9RRkL1hom48z3Tt@google.com/ This code could benefit from some more commentary. To avoid needing to comment the same thing in multiple places (I guess more of these SKIPs will need to be added over time, for now I am only like 20% of the way through Project Run run_vmtests.sh Successfully), add a dummy "skip tests for this specific reason" function that basically just serves as a hook to hang comments on. Link: https://lkml.kernel.org/r/20250317-9pfs-comments-v1-1-9ac96043e146@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21fork: use __vmalloc_node() for stack allocationUladzislau Rezki (Sony)1-4/+2
Replace __vmalloc_node_range() by __vmalloc_node(). The last variant requires less parameters and it uses exactly the same arguments which are partly now hidden inside __vmalloc_node(). This change does not change any functionality. It makes the code a bit simpler. Link: https://lkml.kernel.org/r/20250317163614.166502-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21docs/mm: Physical Memory: Populate the "Zones" sectionJiwen Qi1-2/+264
Briefly describe what zones are and the fields of struct zone. Link: https://lkml.kernel.org/r/20250315211317.27612-1-jiwen7.qi@gmail.com Signed-off-by: Jiwen Qi <jiwen7.qi@gmail.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21xen: balloon: update the NR_BALLOON_PAGES stateNico Pache1-0/+4
Update the NR_BALLOON_PAGES counter when pages are added to or removed from the Xen balloon. Link: https://lkml.kernel.org/r/20250314213757.244258-5-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juegren Gross <jgross@suse.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21hv_balloon: update the NR_BALLOON_PAGES stateNico Pache1-0/+2
Update the NR_BALLOON_PAGES counter when pages are added to or removed from the Hyper-V balloon. Link: https://lkml.kernel.org/r/20250314213757.244258-4-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juegren Gross <jgross@suse.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21balloon_compaction: update the NR_BALLOON_PAGES stateNico Pache1-0/+2
Update the NR_BALLOON_PAGES counter when pages are added or removed using the balloon compaction interface. The virtio, Vmware, and pseries-cmm balloon drivers utilize the balloon_compaction interface to allocate and free balloon pages. Other balloon drivers will have to maintain this counter manually. Link: https://lkml.kernel.org/r/20250314213757.244258-3-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juegren Gross <jgross@suse.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21meminfo: add a per node counter for balloon driversNico Pache5-1/+10
Patch series "track memory used by balloon drivers", v2. This series introduces a way to track memory used by balloon drivers. Add a NR_BALLOON_PAGES counter to track how many pages are reclaimed by the balloon drivers. First add the accounting, then updates the balloon drivers (virtio, Hyper-V, VMware, Pseries-cmm, and Xen) to maintain this counter. The virtio, Vmware, and pseries-cmm balloon drivers utilize the balloon_compaction interface to allocate and free balloon pages. Other balloon drivers will have to maintain this counter manually. This makes the information visible in memory reporting interfaces like /proc/meminfo, show_mem, and OOM reporting. This provides admins visibility into their VM balloon sizes without requiring different virtualization tooling. Furthermore, this information is helpful when debugging an OOM inside a VM. This patch (of 4): Add NR_BALLOON_PAGES counter to track memory used by balloon drivers and expose it through /proc/meminfo and other memory reporting interfaces. [npache@redhat.com: document Balloon Meminfo entry] Link: https://lkml.kernel.org/r/a0315ccf-f244-460e-8643-fd7388724fe5@redhat.com Link: https://lkml.kernel.org/r/20250314213757.244258-1-npache@redhat.com Link: https://lkml.kernel.org/r/20250314213757.244258-2-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juegren Gross <jgross@suse.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm: remove references to folio in __memcg_kmem_uncharge_page()Matthew Wilcox (Oracle)1-5/+3
This use of folios is misleading because these pages are not part of a folio. Remove an unnecessary call to page_folio(), saving 58 bytes of text in a Debian kernel build. Link: https://lkml.kernel.org/r/20250314133617.138071-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm: simplify folio_memcg_charged()Matthew Wilcox (Oracle)1-3/+1
There's no need to check which kind of pointer is in the memcg_data field, all we actually care about is whether it's zero or not. Saves 70 bytes in workingset_activation() with the Debian config. Link: https://lkml.kernel.org/r/20250314133617.138071-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm: remove references to folio in split_page_memcg()Matthew Wilcox (Oracle)1-7/+23
We know that the passed in page is not part of a folio (it's a plain page allocated with GFP_ACCOUNT), so we should get rid of the misleading references to folios. Introduce page_objcg() and page_set_objcg() helpers to make things more clear. Link: https://lkml.kernel.org/r/20250314133617.138071-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm: simplify split_page_memcg()Matthew Wilcox (Oracle)3-12/+11
The last argument to split_page_memcg() is now always 0, so remove it, effectively reverting commit b8791381d7ed. Link: https://lkml.kernel.org/r/20250314133617.138071-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm: separate folio_split_memcg_refs() from split_page_memcg()Matthew Wilcox (Oracle)3-16/+24
Patch series "Minor memcg cleanups & prep for memdescs", v2. Separate the handling of accounted folios and GFP_ACCOUNT pages for easier to understand code. For more detail, see https://lore.kernel.org/linux-mm/Z9LwTOudOlCGny3f@casper.infradead.org/ This patch (of 5): Folios always use memcg_data to refer to the mem_cgroup while pages allocated with GFP_ACCOUNT have a pointer to the obj_cgroup. Since the caller already knows what it has, split the function into two and then we don't need to check. Move the assignment of split folio memcg_data to the point where we set up the other parts of the new folio. That leaves folio_split_memcg_refs() just handling the memcg accounting. Link: https://lkml.kernel.org/r/20250314133617.138071-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250314133617.138071-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21memcg: move do_memsw_account() to CONFIG_MEMCG_V1Shakeel Butt1-6/+7
The do_memsw_account() is used to enable or disable legacy memory+swap accounting in memory cgroup. However with disabled CONFIG_MEMCG_V1, we don't need to keep checking it. So, let's always return false for !CONFIG_MEMCG_V1 configs. Before the patch: $ size mm/memcontrol.o text data bss dec hex filename 49928 10736 4172 64836 fd44 mm/memcontrol.o After the patch: $ size mm/memcontrol.o text data bss dec hex filename 49430 10480 4172 64082 fa52 mm/memcontrol.o Link: https://lkml.kernel.org/r/20250312222552.3284173-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21memcg: avoid refill_stock for root memcgShakeel Butt1-1/+2
We never charge the page counters of root memcg, so there is no need to put root memcg in the memcg stock. At the moment, refill_stock() can be called from try_charge_memcg(), obj_cgroup_uncharge_pages() and mem_cgroup_uncharge_skmem(). The try_charge_memcg() and mem_cgroup_uncharge_skmem() are never called with root memcg, so those are fine. However obj_cgroup_uncharge_pages() can potentially call refill_stock() with root memcg if the objcg object has been reparented over to the root memcg. Let's just avoid refill_stock() from obj_cgroup_uncharge_pages() for root memcg. Link: https://lkml.kernel.org/r/20250313054812.2185900-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhockoc@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm/mm_init: rename init_reserved_page to init_deferred_pageMike Rapoport (Microsoft)1-3/+3
When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page() function performs initialization of a struct page that would have been deferred normally. Rename it to init_deferred_page() to better reflect what the function does. Link: https://lkml.kernel.org/r/20250225083017.567649-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm/mm_init: rename __init_reserved_page_zone to __init_page_from_nidMike Rapoport (Microsoft)3-4/+4
__init_reserved_page_zone() function finds the zone for pfn and nid and performs initialization of a struct page with that zone and nid. There is nothing in that function about reserved pages and it is misnamed. Rename it to __init_page_from_nid() to better reflect what the function does. Link: https://lkml.kernel.org/r/20250225083017.567649-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm/cma: using per-CMA locks to improve concurrent allocation performanceGe Yang2-3/+5
For different CMAs, concurrent allocation of CMA memory ideally should not require synchronization using locks. Currently, a global cma_mutex lock is employed to synchronize all CMA allocations, which can impact the performance of concurrent allocations across different CMAs. To test the performance impact, follow these steps: 1. Boot the kernel with the command line argument hugetlb_cma=30G to allocate a 30GB CMA area specifically for huge page allocations. (note: on my machine, which has 3 nodes, each node is initialized with 10G of CMA) 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G count=30 to fully utilize the CMA area by writing zeroes to a file in /dev/shm. 3. Open three terminals and execute the following commands simultaneously: (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB pages] of CMA memory.) On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc We attempt to allocate pages through the CMA debug interface and use the time command to measure the duration of each allocation. Performance comparison: Without this patch With this patch Terminal1 ~7s ~7s Terminal2 ~14s ~8s Terminal3 ~21s ~7s To solve problem above, we could use per-CMA locks to improve concurrent allocation performance. This would allow each CMA to be managed independently, reducing the need for a global lock and thus improving scalability and performance. Link: https://lkml.kernel.org/r/1739152566-744-1-git-send-email-yangge1116@126.com Signed-off-by: Ge Yang <yangge1116@126.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Aisheng Dong <aisheng.dong@nxp.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: page_alloc: defrag_mode kswapd/kcompactd watermarksJohannes Weiner6-15/+73
The previous patch added pageblock_order reclaim to kswapd/kcompactd, which helps, but produces only one block at a time. Allocation stalls and THP failure rates are still higher than they could be. To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change the watermarking for kswapd & kcompactd: instead of targeting the high watermark in order-0 pages and checking for one suitable block, simply require that the high watermark is entirely met in pageblocks. To this end, track the number of free pages within contiguous pageblocks, then change pgdat_balanced() and compact_finished() to check watermarks against this new value. This further reduces THP latencies and allocation stalls, and improves THP success rates against the previous patch: DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%) Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%) Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%) Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%) Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%) THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%) THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%) Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%) Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%) Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%) Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%) Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%) Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%) Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%) Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%) Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%) Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%) Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%) Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%) Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%) Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%) Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%) Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%) Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%) Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%) File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%) Also of note is that compaction work overall is reduced. The reason for this is that when free pageblocks are more readily available, allocations are also much more likely to get physically placed in LRU order, instead of being forced to scavenge free space here and there. This means that reclaim by itself has better chances of freeing up whole blocks, and the system relies less on compaction. Comparing all changes to the vanilla kernel: VANILLA DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%) THP allocation latencies and %sys time are down dramatically. THP allocation failures are down from nearly 50% to 8.5%. And to recall previous data points, the success rates are steady and reliable without the cumulative deterioration of fragmentation events. Compaction work is down overall. Direct compaction work especially is drastically reduced. As an aside, its success rate of 4% indicates there is room for improvement. For now it's good to rely on it less. Reclaim work is up overall, however direct reclaim work is down. Part of the increase can be attributed to a higher use of THPs, which due to internal fragmentation increase the memory footprint. This is not necessarily an unexpected side-effect for users of THP. However, taken both points together, there may well be some opportunities for fine tuning in the reclaim/compaction coordination. [hannes@cmpxchg.org: fix squawks from rebasing] Link: https://lkml.kernel.org/r/20250314210558.GD1316033@cmpxchg.org Link: https://lkml.kernel.org/r/20250313210647.1314586-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: page_alloc: defrag_mode kswapd/kcompactd assistanceJohannes Weiner1-4/+10
When defrag_mode is enabled, allocation fallbacks strongly prefer whole block conversions instead of polluting or stealing partially used blocks. This means there is a demand for pageblocks even from sub-block requests. Let kswapd/kcompactd help produce them. By the time kswapd gets woken up, normal rmqueue and block conversion fallbacks have been attempted and failed. So always wake kswapd with the block order; it will take care of producing a suitable compaction gap and then chain-wake kcompactd with the block order when its done. VANILLA DEFRAGMODE-ASYNC Hugealloc Time mean 52739.45 ( +0.00%) 34300.36 ( -34.96%) Hugealloc Time stddev 56541.26 ( +0.00%) 36390.42 ( -35.64%) Kbuild Real time 197.47 ( +0.00%) 196.13 ( -0.67%) Kbuild User time 1240.49 ( +0.00%) 1234.74 ( -0.46%) Kbuild System time 70.08 ( +0.00%) 62.62 ( -10.50%) THP fault alloc 46727.07 ( +0.00%) 57054.53 ( +22.10%) THP fault fallback 21910.60 ( +0.00%) 11581.40 ( -47.14%) Direct compact fail 195.80 ( +0.00%) 107.80 ( -44.72%) Direct compact success 7.93 ( +0.00%) 4.53 ( -38.06%) Direct compact success rate % 3.51 ( +0.00%) 3.20 ( -6.89%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 5461033.93 ( +62.07%) Compact daemon scanned free 5075474.47 ( +0.00%) 5824897.93 ( +14.77%) Compact direct scanned migrate 161787.27 ( +0.00%) 58336.93 ( -63.94%) Compact direct scanned free 163467.53 ( +0.00%) 32791.87 ( -79.94%) Compact total migrate scanned 3531388.53 ( +0.00%) 5519370.87 ( +56.29%) Compact total free scanned 5238942.00 ( +0.00%) 5857689.80 ( +11.81%) Alloc stall 2371.07 ( +0.00%) 2424.60 ( +2.26%) Pages kswapd scanned 2160926.73 ( +0.00%) 2657018.33 ( +22.96%) Pages kswapd reclaimed 533191.07 ( +0.00%) 559583.07 ( +4.95%) Pages direct scanned 400450.33 ( +0.00%) 722094.07 ( +80.32%) Pages direct reclaimed 94441.73 ( +0.00%) 107257.80 ( +13.57%) Pages total scanned 2561377.07 ( +0.00%) 3379112.40 ( +31.93%) Pages total reclaimed 627632.80 ( +0.00%) 666840.87 ( +6.25%) Swap out 47959.53 ( +0.00%) 77238.20 ( +61.05%) Swap in 7276.00 ( +0.00%) 11712.80 ( +60.97%) File refaults 138043.00 ( +0.00%) 143438.80 ( +3.91%) With this patch, defrag_mode=1 beats the vanilla kernel in THP success rates and allocation latencies. The trend holds over time: thp_fault_alloc VANILLA DEFRAGMODE-ASYNC 61988 52066 56474 58844 57258 58233 50187 58476 52388 54516 55409 59938 52925 57204 47648 60238 43669 55733 40621 56211 36077 59861 41721 57771 36685 58579 34641 51868 33215 56280 DEFRAGMODE-ASYNC also wins on %sys as ~3/4 of the direct compaction work is shifted to kcompactd. Reclaim activity is higher. Part of that is simply due to the increased memory footprint from higher THP use. The other aspect is that *direct* reclaim/compaction are still going for requested orders rather than targeting the page blocks required for fallbacks, which is less efficient than it could be. However, this is already a useful tradeoff to make, as in many environments peak periods are short and retaining the ability to produce THP through them is more important. Link: https://lkml.kernel.org/r/20250313210647.1314586-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: page_alloc: defrag_modeJohannes Weiner2-2/+34
The page allocator groups requests by migratetype to stave off fragmentation. However, in practice this is routinely defeated by the fact that it gives up *before* invoking reclaim and compaction - which may well produce suitable pages. As a result, fragmentation of physical memory is a common ongoing process in many load scenarios. Fragmentation deteriorates compaction's ability to produce huge pages. Depending on the lifetime of the fragmenting allocations, those effects can be long-lasting or even permanent, requiring drastic measures like forcible idle states or even reboots as the only reliable ways to recover the address space for THP production. In a kernel build test with supplemental THP pressure, the THP allocation rate steadily declines over 15 runs: thp_fault_alloc 61988 56474 57258 50187 52388 55409 52925 47648 43669 40621 36077 41721 36685 34641 33215 This is a hurdle in adopting THP in any environment where hosts are shared between multiple overlapping workloads (cloud environments), and rarely experience true idle periods. To make THP a reliable and predictable optimization, there needs to be a stronger guarantee to avoid such fragmentation. Introduce defrag_mode. When enabled, reclaim/compaction is invoked to its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT is enforced on the allocator fastpath and the reclaiming slowpath. For now, fallbacks are permitted to avert OOMs. There is a plan to add defrag_mode=2 to prefer OOMs over fragmentation, but this requires additional prep work in compaction and the reserve management to make it ready for all possible allocation contexts. The following test results are from a kernel build with periodic bursts of THP allocations, over 15 runs: vanilla defrag_mode=1 @claimer[unmovable]: 189 103 @claimer[movable]: 92 103 @claimer[reclaimable]: 207 61 @pollute[unmovable from movable]: 25 0 @pollute[unmovable from reclaimable]: 28 0 @pollute[movable from unmovable]: 38835 0 @pollute[movable from reclaimable]: 147136 0 @pollute[reclaimable from unmovable]: 178 0 @pollute[reclaimable from movable]: 33 0 @steal[unmovable from movable]: 11 0 @steal[unmovable from reclaimable]: 5 0 @steal[reclaimable from unmovable]: 107 0 @steal[reclaimable from movable]: 90 0 @steal[movable from reclaimable]: 354 0 @steal[movable from unmovable]: 130 0 Both types of polluting fallbacks are eliminated in this workload. Interestingly, whole block conversions are reduced as well. This is because once a block is claimed for a type, its empty space remains available for future allocations, instead of being padded with fallbacks; this allows the native type to group up instead of spreading out to new blocks. The assumption in the allocator has been that pollution from movable allocations is less harmful than from other types, since they can be reclaimed or migrated out should the space be needed. However, since fallbacks occur *before* reclaim/compaction is invoked, movable pollution will still cause non-movable allocations to spread out and claim more blocks. Without fragmentation, THP rates hold steady with defrag_mode=1: thp_fault_alloc 32478 20725 45045 32130 14018 21711 40791 29134 34458 45381 28305 17265 22584 28454 30850 While the downward trend is eliminated, the keen reader will of course notice that the baseline rate is much smaller than the vanilla kernel's to begin with. This is due to deficiencies in how reclaim and compaction are currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller allocations are competing with THPs for pageblocks, while making no effort themselves to reclaim or compact beyond their own request size. This effect already exists with the current usage of ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole block stealing much more strongly. Subsequent patches will address defrag_mode reclaim strategy to raise the THP success baseline above the vanilla kernel. Link: https://lkml.kernel.org/r/20250313210647.1314586-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: page_alloc: trace type pollution from compaction capturingJohannes Weiner1-0/+4
When the page allocator places pages of a certain migratetype into blocks of another type, it has lasting effects on the ability to compact and defragment down the line. For improving placement and compaction, visibility into such events is crucial. The most common case, allocator fallbacks, is already annotated, but compaction capturing is also allowed to grab pages of a different type. Extend the tracepoint to cover this case. Link: https://lkml.kernel.org/r/20250313210647.1314586-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: compaction: push watermark into compaction_suitable() callersJohannes Weiner3-38/+45
Patch series "mm: reliable huge page allocator". This series makes changes to the allocator and reclaim/compaction code to try harder to avoid fragmentation. As a result, this makes huge page allocations cheaper, more reliable and more sustainable. It's a subset of the huge page allocator RFC initially proposed here: https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/ The following results are from a kernel build test, with additional concurrent bursts of THP allocations on a memory-constrained system. Comparing before and after the changes over 15 runs: before after Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%) THP latencies are cut in half, and failure rates are cut by 75%. These metrics also hold up over time, while the vanilla kernel sees a steady downward trend in success rates with each subsequent run, owed to the cumulative effects of fragmentation. A more detailed discussion of results is in the patch changelogs. The patches first introduce a vm.defrag_mode sysctl, which enforces the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and compaction have run. They then change kswapd and kcompactd to target pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths. Patches #1 and #2 are somewhat unrelated cleanups, but touch the same code and so are included here to avoid conflicts from re-ordering. This patch (of 5): compaction_suitable() hardcodes the min watermark, with a boost to the low watermark for costly orders. However, compaction_ready() requires order-0 at the high watermark. It currently checks the marks twice. Make the watermark a parameter to compaction_suitable() and have the callers pass in what they require: - compaction_zonelist_suitable() is used by the direct reclaim path, so use the min watermark. - compact_suit_allocation_order() has a watermark in context derived from cc->alloc_flags. The only quirk is that kcompactd doesn't initialize cc->alloc_flags explicitly. There is a direct check in kcompactd_do_work() that passes ALLOC_WMARK_MIN, but there is another check downstack in compact_zone() that ends up passing the unset alloc_flags. Since they default to 0, and that coincides with ALLOC_WMARK_MIN, it is correct. But it's subtle. Set cc->alloc_flags explicitly. - should_continue_reclaim() is direct reclaim, use the min watermark. - Finally, consolidate the two checks in compaction_ready() to a single compaction_suitable() call passing the high watermark. There is a tiny change in behavior: before, compaction_suitable() would check order-0 against min or low, depending on costly order. Then there'd be another high watermark check. Now, the high watermark is passed to compaction_suitable(), and the costly order-boost (low - min) is added on top. This means compaction_ready() sets a marginally higher target for free pages. In a kernelbuild + THP pressure test, though, this didn't show any measurable negative effects on memory pressure or reclaim rates. As the comment above the check says, reclaim is usually stopped short on should_continue_reclaim(), and this just defines the worst-case reclaim cutoff in case compaction is not making any headway. [hughd@google.com: stop oops on out-of-range highest_zoneidx] Link: https://lkml.kernel.org/r/005ace8b-07fa-01d4-b54b-394a3e029c07@google.com Link: https://lkml.kernel.org/r/20250313210647.1314586-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20250313210647.1314586-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: convert lru_add_page_tail() to lru_add_split_folio()Matthew Wilcox (Oracle)1-9/+9
Remove three hidden calls to compound_head() and accesses to page->lru. Link: https://lkml.kernel.org/r/20250313151458.4145978-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17selftests/mm/cow: fix the incorrect error handlingCyan Yang1-1/+1
Error handling doesn't check the correct return value. This patch will fix it. Link: https://lkml.kernel.org/r/20250312043840.71799-1-cyan.yang@sifive.com Fixes: f4b5fd6946e2 ("selftests/vm: anon_cow: THP tests") Signed-off-by: Cyan Yang <cyan.yang@sifive.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/debug: add line breaksLiu Ye1-1/+1
Missing a newline character at the end of the format string. Link: https://lkml.kernel.org/r/20250312093717.364031-1-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17MAINTAINERS: adjust file entry in MAPLE TREELukas Bulwahn1-1/+1
Commit 0f3b602e1bad ("tools: separate out shared radix-tree components") moves files from radix-tree/linux to shared/linux in the ./tools/testing/ directory, but misses to adjust a file entry in MAPLE TREE. Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a broken reference. Adjust the file entry in MAPLE TREE. Link: https://lkml.kernel.org/r/20250312105245.216302-1-lukas.bulwahn@redhat.com Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: memory-failure: enhance comments for return value of memory_failure()Shuai Xue1-3/+7
The comments for the return value of memory_failure are not complete, supplement the comments. Link: https://lkml.kernel.org/r/20250312112852.82415-4-xueshuai@linux.alibaba.com Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Tony Luck <tony.luck@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ruidong Tian <tianruidong@linux.alibaba.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/hwpoison: do not send SIGBUS to processes with recovered clean pagesShuai Xue1-3/+8
When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. - Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1] Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill because it's not an urgent problem that no core is on the verge of consuming that bad data. It's also found that multi SRAO UCE may cause nested MCE interrupts and finally become an IERR. Hence, Intel downgrades the machine check bank signature of patrol scrub from SRAO to UCNA (Uncorrected, No Action required), and signal changed to #CMCI. Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) to try to offline the page despite the UC*NA* signature name. - Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1] Having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read. Thus: 1) Core is clever and thinks address A is needed soon, issues a speculative read. 2) Core finds it is going to use address A soon after sending the read request 3) The CMCI from the memory controller is in a race with MCE from the core that will soon try to retire the load from address A. Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison). - Why user process is killed for instr case Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") tries to fix noise message "Memory error not recovered" and skips duplicate SIGBUSs due to the race. But it also introduced a bug that kill_accessing_process() return -EHWPOISON for instr case, as result, kill_me_maybe() send a SIGBUS to user process. If the CMCI wins that race, the page is marked poisoned when uc_decode_notifier() calls memory_failure(). For dirty pages, memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, converting the PTE to a hwpoison entry. As a result, kill_accessing_process(): - call walk_page_range() and return 1 regardless of whether try_to_unmap() succeeds or fails, - call kill_proc() to make sure a SIGBUS is sent - return -EHWPOISON to indicate that SIGBUS is already sent to the process and kill_me_maybe() doesn't have to send it again. However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted to a hwpoison entry. Conversely, for clean pages where PTE entries are not marked as hwpoison, kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a SIGBUS. Console log looks like this: Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered Memory failure: 0x827ca68: already hardware poisoned mce: Memory error not recovered To fix it, return 0 for "corrupted page was clean", preventing an unnecessary SIGBUS to user process. [1] https://lore.kernel.org/lkml/20250217063335.22257-1-xueshuai@linux.alibaba.com/T/#mba94f1305b3009dd340ce4114d3221fe810d1871 Link: https://lkml.kernel.org/r/20250312112852.82415-3-xueshuai@linux.alibaba.com Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Tested-by: Tony Luck <tony.luck@intel.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ruidong Tian <tianruidong@linux.alibaba.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17x86/mce: use is_copy_from_user() to determine copy-from-user contextShuai Xue1-6/+5
Patch series "mm/hwpoison: Fix regressions in memory failure handling", v4. ## 1. What am I trying to do: This patchset resolves two critical regressions related to memory failure handling that have appeared in the upstream kernel since version 5.17, as compared to 5.10 LTS. - copyin case: poison found in user page while kernel copying from user space - instr case: poison found while instruction fetching in user space ## 2. What is the expected outcome and why - For copyin case: Kernel can recover from poison found where kernel is doing get_user() or copy_from_user() if those places get an error return and the kernel return -EFAULT to the process instead of crashing. More specifily, MCE handler checks the fixup handler type to decide whether an in kernel #MC can be recovered. When EX_TYPE_UACCESS is found, the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space. - For instr case: If a poison found while instruction fetching in user space, full recovery is possible. User process takes #PF, Linux allocates a new page and fills by reading from storage. ## 3. What actually happens and why - For copyin case: kernel panic since v5.17 Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup type for copy-from-user operations, changing it from EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. It breaks previous EX_TYPE_UACCESS handling when posion found in get_user() or copy_from_user(). - For instr case: user process is killed by a SIGBUS signal due to #CMCI and #MCE race When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. ### Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1] Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill because it's not an urgent problem that no core is on the verge of consuming that bad data. It's also found that multi SRAO UCE may cause nested MCE interrupts and finally become an IERR. Hence, Intel downgrades the machine check bank signature of patrol scrub from SRAO to UCNA (Uncorrected, No Action required), and signal changed to #CMCI. Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) to try to offline the page despite the UC*NA* signature name. ### Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1] Having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read. Thus: 1) Core is clever and thinks address A is needed soon, issues a speculative read. 2) Core finds it is going to use address A soon after sending the read request 3) The CMCI from the memory controller is in a race with MCE from the core that will soon try to retire the load from address A. Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison). ## Why user process is killed for instr case Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") tries to fix noise message "Memory error not recovered" and skips duplicate SIGBUSs due to the race. But it also introduced a bug that kill_accessing_process() return -EHWPOISON for instr case, as result, kill_me_maybe() send a SIGBUS to user process. # 4. The fix, in my opinion, should be: - For copyin case: The key point is whether the error context is in a read from user memory. We do not care about the ex-type if we know its a MOV reading from userspace. is_copy_from_user() return true when both of the following two checks are true: - the current instruction is copy - source address is user memory If copy_user is true, we set m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV; Then do_machine_check() will try fixup_exception() first. - For instr case: let kill_accessing_process() return 0 to prevent a SIGBUS. - For patch 3: The return value of memory_failure() is quite important while discussed instr case regression with Tony and Miaohe for patch 2, so add comment about the return value. This patch (of 3): Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable fixup type, EX_TYPE_EFAULT_REG, and commit 4c132d1d844a ("x86/futex: Remove .fixup usage") updated the extable fixup type for copy-from-user operations, changing it from EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. The error context for copy-from-user operations no longer functions as an in-kernel recovery context. Consequently, the error context for copy-from-user operations no longer functions as an in-kernel recovery context, resulting in kernel panics with the message: "Machine check: Data load in unrecoverable area of kernel." To address this, it is crucial to identify if an error context involves a read operation from user memory. The function is_copy_from_user() can be utilized to determine: - the current operation is copy - when reading user memory When these conditions are met, is_copy_from_user() will return true, confirming that it is indeed a direct copy from user memory. This check is essential for correctly handling the context of errors in these operations without relying on the extable fixup types that previously allowed for in-kernel recovery. So, use is_copy_from_user() to determine if a context is copy user directly. Link: https://lkml.kernel.org/r/20250312112852.82415-1-xueshuai@linux.alibaba.com Link: https://lkml.kernel.org/r/20250312112852.82415-2-xueshuai@linux.alibaba.com Fixes: 4c132d1d844a ("x86/futex: Remove .fixup usage") Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tony Luck <tony.luck@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ruidong Tian <tianruidong@linux.alibaba.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: lock PGDAT_RECLAIM_LOCKED with acquire memory orderingMathieu Desnoyers1-1/+1
The PGDAT_RECLAIM_LOCKED bit is used to provide mutual exclusion of node reclaim for struct pglist_data using a single bit. Use test_and_set_bit_lock rather than test_and_set_bit to test-and-set PGDAT_RECLAIM_LOCKED with an acquire memory ordering semantic. This changes the "lock" acquisition from a full barrier to an acquire memory ordering, which is weaker. The acquire semi-permeable barrier paired with the release on unlock is sufficient for this mutual exclusion use-case. No behavior change intended other than to reduce overhead by using the appropriate barrier. Link: https://lkml.kernel.org/r/20250312141014.129725-2-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Jade Alglave <j.alglave@ucl.ac.uk> Cc: Luc Maranget <luc.maranget@inria.fr> Cc: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: add missing release barrier on PGDAT_RECLAIM_LOCKED unlockMathieu Desnoyers1-1/+1
The PGDAT_RECLAIM_LOCKED bit is used to provide mutual exclusion of node reclaim for struct pglist_data using a single bit. It is "locked" with a test_and_set_bit (similarly to a try lock) which provides full ordering with respect to loads and stores done within __node_reclaim(). It is "unlocked" with clear_bit(), which does not provide any ordering with respect to loads and stores done before clearing the bit. The lack of clear_bit() memory ordering with respect to stores within __node_reclaim() can cause a subsequent CPU to fail to observe stores from a prior node reclaim. This is not an issue in practice on TSO (e.g. x86), but it is an issue on weakly-ordered architectures (e.g. arm64). Fix this by using clear_bit_unlock rather than clear_bit to clear PGDAT_RECLAIM_LOCKED with a release memory ordering semantic. This provides stronger memory ordering (release rather than relaxed). Link: https://lkml.kernel.org/r/20250312141014.129725-1-mathieu.desnoyers@efficios.com Fixes: d773ed6b856a ("mm: test and set zone reclaim lock before starting reclaim") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Jade Alglave <j.alglave@ucl.ac.uk> Cc: Luc Maranget <luc.maranget@inria.fr> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/madvise: remove len parameter of madvise_do_behavior()SeongJae Park1-6/+4
Because madise_should_skip() logic is factored out, making madvise_do_behavior() calculates 'len' on its own rather then receiving it as a parameter makes code simpler. Remove the parameter. Link: https://lkml.kernel.org/r/20250312164750.59215-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <howlett@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/madvise: deduplicate madvise_do_behavior() skip case handlingsSeongJae Park1-23/+34
The logic for checking if a given madvise() request for a single memory range can skip real work, namely madvise_do_behavior(), is duplicated in do_madvise() and vector_madvise(). Split out the logic to a function and reuse it. Link: https://lkml.kernel.org/r/20250312164750.59215-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <howlett@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/madvise: split out populate behavior check logicSeongJae Park1-7/+13
madvise_do_behavior() has a long open-coded 'behavior' check for MADV_POPULATE_{READ,WRITE}. It adds multiple layers[1] and make the code arguably take longer time to read. Like is_memory_failure(), split out the check to a separate function. This is not technically removing the additional layer but discourage further extending the switch-case. Also it makes madvise_do_behavior() code shorter and therefore easier to read. [1] https://lore.kernel.org/bd6d0bf1-c79e-46bd-a810-9791efb9ad73@lucifer.local Link: https://lkml.kernel.org/r/20250312164750.59215-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <howlett@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/madvise: use is_memory_failure() from madvise_do_behavior()SeongJae Park1-22/+27
Patch series "mm/madvise: cleanup requests validations and classifications". Cleanup madvise entry level code for cleaner request validations and classifications. This patch (of 4): To reduce redundant open-coded checks of CONFIG_MEMORY_FAILURE and MADV_{HWPOISON,SOFT_OFFLINE} in madvise_[un]lock(), is_memory_failure() is introduced. madvise_do_behavior() is still doing the same open-coded check, though. Use is_memory_failure() instead. To avoid build failure on !CONFIG_MEMORY_FAILURE case, implement an empty madvise_inject_error() under the config. Also move the definition of is_memory_failure() inside #ifdef CONFIG_MEMORY_FAILURE clause for madvise_inject_error() definition, to reduce duplicated ifdef clauses. Link: https://lkml.kernel.org/r/20250312164750.59215-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250312164750.59215-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <howlett@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/page_alloc: add trace event for totalreserve_pages calculationMartin Liu2-0/+19
This commit introduces a new trace event, `mm_calculate_totalreserve_pages`, which reports the new reserve value at the exact time when it takes effect. The `totalreserve_pages` value represents the total amount of memory reserved across all zones and nodes in the system. This reserved memory is crucial for ensuring that critical kernel operations have access to sufficient memory, even under memory pressure. By tracing the `totalreserve_pages` value, developers can gain insights that how the total reserved memory changes over time. Link: https://lkml.kernel.org/r/20250308034606.2036033-4-liumartin@google.com Signed-off-by: Martin Liu <liumartin@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/page_alloc: add trace event for per-zone lowmem reserve setupMartin Liu2-0/+29
This commit introduces the `mm_setup_per_zone_lowmem_reserve` trace event,which provides detailed insights into the kernel's per-zone lowmem reserve configuration. The trace event provides precise timestamps, allowing developers to 1. Correlate lowmem reserve changes with specific kernel events and able to diagnose unexpected kswapd or direct reclaim behavior triggered by dynamic changes in lowmem reserve. 2. Know memory allocation failures that occur due to insufficient lowmem reserve, by precisely correlating allocation attempts with reserve adjustments. Link: https://lkml.kernel.org/r/20250308034606.2036033-3-liumartin@google.com Signed-off-by: Martin Liu <liumartin@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/page_alloc: add trace event for per-zone watermark setupMartin Liu2-0/+34
Patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages", v2. This patchset introduces tracepoints to track changes in the lowmem reserves, watermarks and totalreserve_pages. This helps to track the exact timing of such changes and understand their relation to reclaim activities. The tracepoints added are: mm_setup_per_zone_lowmem_reserve mm_setup_per_zone_wmarks mm_calculate_totalreserve_pagesi This patch (of 3): This commit introduces the `mm_setup_per_zone_wmarks` trace event, which provides detailed insights into the kernel's per-zone watermark configuration, offering precise timing and the ability to correlate watermark changes with specific kernel events. While `/proc/zoneinfo` provides some information about zone watermarks, this trace event offers: 1. The ability to link watermark changes to specific kernel events and logic. 2. The ability to capture rapid or short-lived changes in watermarks that may be missed by user-space polling 3. Diagnosing unexpected kswapd activity or excessive direct reclaim triggered by rapidly changing watermarks. Link: https://lkml.kernel.org/r/20250308034606.2036033-1-liumartin@google.com Link: https://lkml.kernel.org/r/20250308034606.2036033-2-liumartin@google.com Signed-off-by: Martin Liu <liumartin@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Martin Liu <liumartin@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/shmem: fix functions documentationEnrico Bravi1-3/+3
Add missing parenthesis in @name parameter description. Link: https://lkml.kernel.org/r/20250310112535.84754-1-enrico.bravi@polito.it Signed-off-by: Enrico Bravi <enrico.bravi@polito.it> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>