aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2026-05-26Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds8-63/+55
Pull misc fixes from Andrew Morton: "13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address post-7.1 issues or aren't considered suitable for backporting. All patches are singletons - please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Revert "mm: introduce a new page type for page pool in page type" mm/vmalloc: do not trigger BUG() on BH disabled context MAINTAINERS, mailmap: change email for Eugen Hristev mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page kernel/fork: validate exit_signal in kernel_clone() mm: memcontrol: propagate NMI slab stats to memcg vmstats mm/damon/sysfs-schemes: delete tried region in regions_rmdirs() mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one zram: fix use-after-free in zram_writeback_endio memfd: deny writeable mappings when implying SEAL_WRITE ipc: limit next_id allocation to the valid ID range Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare" MAINTAINERS: .mailmap: update after GEHC spin-off
2026-05-22Merge tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds1-3/+3
Pull cgroup fixes from Tejun Heo: "Two rstat fixes: - Out-of-bounds access in the css_rstat_updated() BPF kfunc when called with an unchecked user-supplied cpu - Over-strict NMI guard after the recent switch to try_cmpxchg left sparc and ppc64 unable to queue rstat updates from NMI" * tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: relax NMI guard after switch to try_cmpxchg cgroup/rstat: validate cpu before css_rstat_cpu() access
2026-05-22Merge tag 'slab-for-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slabLinus Torvalds2-0/+3
Pull slab fix from Vlastimil Babka: - Stable fix for a missing cpus_read_lock in one of the cpu sheaves flushing paths (Qing Wang) * tag 'slab-for-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()
2026-05-21Revert "mm: introduce a new page type for page pool in page type"Byungchul Park1-9/+4
This reverts commit db359fccf212 ("mm: introduce a new page type for page pool in page type") and a part of 735a309b4bfb9e ("net: add net_iov_init() and use it to initialize ->page_type"). Netpp page_type'ed pages might be used in mapping so as to use @_mapcount. However, since @page_type and @_mapcount are union'ed in struct page, these two can't be used at the same time. Revert the commit introducing page_type for Netpp for now. The patch will be retried once @page_type and @_mapcount get allowed to be used at the same time. The revert also includes removal of @page_type initialization part introduced by commit 735a309b4bfb9e ("net: add net_iov_init() and use it to initialize ->page_type"), which will be restored on the retry. Link: https://lore.kernel.org/20260515034701.17027-1-byungchul@sk.com Fixes: db359fccf212 ("mm: introduce a new page type for page pool in page type") Signed-off-by: Byungchul Park <byungchul@sk.com> Reported-by: Dragos Tatulea <dtatulea@nvidia.com> Closes: https://lore.kernel.org/all/982b9bc1-0a0a-4fc5-8e3a-3672db2b29a1@nvidia.com Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Mark Bloch <mbloch@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Simon Horman <horms@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Toke Hoiland-Jorgensen <toke@redhat.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm/vmalloc: do not trigger BUG() on BH disabled contextUladzislau Rezki (Sony)1-1/+1
__get_vm_area_node() currently triggers a BUG() if in_interrupt() returns true. However, in_interrupt() also reports true when BH are disabled. The bridge code can call rhashtable_lookup_insert_fast() with bottom halves disabled: __vlan_add() -> br_fdb_add_local() spin_lock_bh(&br->hash_lock); <-- Disable BH -> fdb_add_local() -> fdb_create() -> rhashtable_lookup_insert_fast() -> kvmalloc() -> vmalloc() -> __get_vm_area_node() -> BUG_ON(in_interrupt()) spin_unlock_bh(&br->hash_lock) this triggers the BUG() despite the caller not being in NMI or hard IRQ context. Replace the in_interrupt() check with in_nmi() || in_hardirq(). Link: https://lore.kernel.org/20260515153009.2296191-1-urezki@gmail.com Fixes: c6307674ed82 ("mm: kvmalloc: add non-blocking support for vmalloc") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Ido Schimmel <idosch@nvidia.com> Reported-by: syzbot+8b12fc6e0fb139765b58@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69ff8c7c.050a0220.1036b8.000b.GAE@google.com/ Reviewed-by: Baoquan He <baoquan.he@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_pageSunny Patel1-1/+3
When migrate_vma_insert_huge_pmd_page() jumps to unlock_abort due to a PMD check failure, the pgtable allocated earlier via pte_alloc_one() is never freed, causing a memory leak. Added free_abort label to release the pgtable in error path. Link: https://lore.kernel.org/20260501115122.23288-1-nueralspacetech@gmail.com Fixes: a30b48bf1b24 ("mm/migrate_device: implement THP migration of zone device pages") Signed-off-by: Sunny Patel <nueralspacetech@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm: memcontrol: propagate NMI slab stats to memcg vmstatsAlexandre Ghiti1-0/+6
flush_nmi_stats() drains per-node NMI slab atomics into the per-node lruvec_stats, but does not propagate them to the memcg-level vmstats. For non NMI case, account_slab_nmi_safe() calls mod_memcg_lruvec_state() which updates both per-node lruvec_stats and memcg-level vmstats, so flush_nmi_stats() needs to flush to per-node lruvec_stats as well as memcg-level vmstats. So fix this by flushing to the memcg-level vmstats for NMI too. Link: https://lore.kernel.org/20260518082830.599102-1-alex@ghiti.fr Fixes: 940b01fc8dc1 ("memcg: nmi safe memcg stats for specific archs") Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()SeongJae Park1-4/+4
DAMON sysfs maintains the DAMOS tried region directory objects via a linked list. When the user requests refresh of the directories, DAMON sysfs removes all the region directories first, and then generate updated regions directory on the empty space. The removal function (damon_sysfs_scheme_regions_rm_dirs()) only puts the kobj objects. Deletion of the container region object from the linked list is done inside the kobj release callback function. If somehow the callback invocation is delayed, the list will contain regions list that gonna be freed. If the updated region directories creation is started in this situation, the list can be corrupted and use-after-free can happen. Because the kobj objects are managed by only DAMON sysfs, the issue cannot happen in normal situation. But, such delays can be made on kernels that built with CONFIG_DEBUG_KOBJECT_RELEASE. On the kernel, the issue can indeed be reproduced like below. # damo start --damos_action stat # cd /sys/kernel/mm/damon/admin/kdamonds/0/ # for i in {1..10}; do echo update_schemes_tried_regions > state; done # dmesg | grep underflow [ 89.296152] refcount_t: underflow; use-after-free. Fix the issue by removing the region object from the list when decrementing the reference count. Also update damos_sysfs_populate_region_dir() to add the region object to the list only after the kobject_init_and_add() is success, so that fail of kobject_init_and_add() is not leaving the deallocated object on the list. The issue was discovered [1] by Sashiko. Link: https://lore.kernel.org/20260518152559.93038-1-sj@kernel.org Link: https://lore.kernel.org/20260513011920.119183-1-sj@kernel.org [1] Fixes: 9277d0367ba1 ("mm/damon/sysfs-schemes: implement scheme region directory") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.2.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_oneDev Jain1-0/+2
Initialize nr_pages to 1 at the start of each loop iteration, like folio_referenced_one() does. Without this, nr_pages computed by a previous folio_unmap_pte_batch() call can be reused on a later iteration that does not run folio_unmap_pte_batch() again. mmap a 64K large folio with MAP_ANONYMOUS | MAP_DROPPABLE, then call madvise(MADV_FREE), then make the last page device-exclusive via HMM_DMIRROR_EXCLUSIVE. Trigger node reclaim through sysfs. Now, in try_to_unmap_one(), we will first clear the first 15 out of 16 entries mapping the lazyfree folio. This will set nr_pages to 15. In the next pvmw walk, this nr_pages gets reused on a device-exclusive pte, thus potentially corrupting folio refcount/mapcount. At the moment, I have a userspace program which can make the kernel spit out a trace, but the blow up is in folio_referenced_one(), because there are existing bugs in the interaction between device-private and rmap (which too I am investigating). I did a one liner kernel change to avoid going into folio_referenced_one(), and the kernel blows up at folio_remove_rmap_ptes in try_to_unmap_one which is what I wanted. Note that the bug is there not since file folio batching but lazyfree folio batching, since device-exclusive only works for anonymous folios. Userspace visible effect is simply kernel crashing somewhere due to refcount/mapcount corruption. Link: https://lore.kernel.org/20260518063656.3721056-1-dev.jain@arm.com Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Harry Yoo <harry@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21memfd: deny writeable mappings when implying SEAL_WRITEPratyush Yadav (Google)1-6/+6
When SEAL_EXEC is added, SEAL_WRITE is implied to make W^X. But the implied seal is set after the check that makes sure the memfd can not have any writable mappings. This means one can use SEAL_EXEC to apply SEAL_WRITE while having writeable mappings. This breaks the contract that SEAL_WRITE provides and can be used by an attacker to pass a memfd that appears to be write sealed but can still be modified arbitrarily. Fix this by adding the implied seals before the call for mapping_deny_writable() is done. Link: https://lore.kernel.org/20260505133922.797635-1-pratyush@kernel.org Fixes: c4f75bc8bd6b ("mm/memfd: add write seals when apply SEAL_EXEC to executable memfd") Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Jeff Xu <jeffxu@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <kees@kernel.org> Cc: "David Hildenbrand (Arm)" <david@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"Lorenzo Stoakes1-42/+29
This reverts commit ea52cb24cd3f ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare") with conflict resolution to account for changes in commit ea52cb24cd3f ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare"). The patch incorrectly handled hugetlb VMA lock allocation at the mmap_prepare stage, where a failed allocation occurring after mmap_prepare is called might result in the lock leaking. There is no risk of a merge causing a similar issues, as VMA_DONTEXPAND_BIT is set for hugetlb mappings. As a first step in addressing this issue, simply revert the change so we can rework how we do this having corrected the underlying issues. We maintain the VMA flags changes as best we can, accounting for the fact that we were working with a VMA descriptor previously and propagating like-for-like changes for this. Note that we invoke vma_set_flags() and do not call vma_start_write() as vm_flags_set() does. This is OK as it's being done in an .mmap hook where the VMA is not yet linked into the tree so nobody else can be accessing it. Link: https://lore.kernel.org/20260512160643.266960-1-ljs@kernel.org Fixes: ea52cb24cd3f ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Reported-by: Mingyu Wang <25181214217@stu.xidian.edu.cn> Closes: https://lore.kernel.org/linux-mm/20260425070700.562229-1-25181214217@stu.xidian.edu.cn/ Acked-by: Muchun Song <muchun.song@linux.dev> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-19Merge tag 'mm-hotfixes-stable-2026-05-18-21-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds5-9/+28
Pull misc fixes from Andrew Morton: "14 hotfixes. 9 are for MM. 10 are cc:stable and the remainder are for post-7.1 issues or aren't deemed suitable for backporting. There's a two-patch MAINTAINERS series from Mike Rapoport which updates us for the new KEXEC/KDUMP/crash/LUO/etc arrangements. And another two-patch series from Muchun Song to fix a couple of memory-hotplug issues. Otherwise singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-05-18-21-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/memory: fix spurious warning when unmapping device-private/exclusive pages mm: fix __vm_normal_page() to handle missing support for pmd_special()/pud_special() drivers/base/memory: fix memory block reference leak in poison accounting mm/memory_hotplug: fix memory block reference leak on remove lib: kunit_iov_iter: fix test fail on powerpc mm/page_alloc: fix initialization of tags of the huge zero folio with init_on_free MAINTAINERS: add kexec@ list to LIVE UPDATE ENTRY MAINTAINERS: add tree for KDUMP and KEXEC selftests/mm: run_vmtests.sh: fix destructive tests invocation scripts/gdb: slab: update field names of struct kmem_cache scripts/gdb: mm: cast untyped symbols in x86_page_ops mm/damon: fix damos_stat tracepoint format for sz_applied mm/damon/sysfs-schemes: call missing mem_cgroup_iter_break() mm/migrate_device: fix spinlock leak in migrate_vma_insert_huge_pmd_page
2026-05-18cgroup/rstat: validate cpu before css_rstat_cpu() accessQing Ming1-3/+3
css_rstat_updated() is exposed as a BPF kfunc and accepts a caller-provided cpu argument. The function uses cpu for per-cpu rstat lookups without checking whether it refers to a valid possible CPU. A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu == 0x7fffffff triggers: UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9 index 2147483647 is out of range for type 'long unsigned int [64]' Call Trace: css_rstat_updated bpf_iter_run_prog cgroup_iter_seq_show bpf_seq_read Add cpu validation to the BPF-facing css_rstat_updated() kfunc and move the common implementation to __css_rstat_updated() for in-kernel callers. Fixes: a319185be9f5 ("cgroup: bpf: enable bpf programs to integrate with rstat") Signed-off-by: Qing Ming <a0yami@mailbox.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-15Merge tag 'v7.1-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds1-0/+16
Pull crypto fixes from Herbert Xu: - Fix potential dead-lock in rhashtable when used by xattr - Avoid calling kvfree on atomic path in rhashtable * tag 'v7.1-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: rhashtable: Add bucket_table_free_atomic() helper mm/slab: Add kvfree_atomic() helper rhashtable: drop ht->mutex in rhashtable_free_and_destroy()
2026-05-14mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()Qing Wang2-0/+3
flush_rcu_sheaves_on_cache() calls queue_work_on() in a for_each_online_cpu() loop, which requires the cpu to stay online. But cpus_read_lock() is not held in kvfree_rcu_barrier_on_cache() and the set of "online cpus" is subject to change. There are two paths that call flush_rcu_sheaves_on_cache(): // has cpus_read_lock() flush_all_rcu_sheaves() -> flush_rcu_sheaves_on_cache() // no cpus_read_lock() kvfree_rcu_barrier_on_cache() -> flush_rcu_sheaves_on_cache() Fix this by holding cpus_read_lock() in kvfree_rcu_barrier_on_cache(). Why not move cpus_read_lock() from flush_all_rcu_sheaves() into flush_rcu_sheaves_on_cache()? The reason is it would introduce a new lock order (slab_mutex -> cpu_hotplug_lock). The reverse order (cpu_hotplug_lock -> slab_mutex) is established by - cpuhp_setup_state_nocalls(..., slub_cpu_setup, ...) - kmem_cache_destroy() The two orders together would form an AB-BA deadlock. Finally, add lockdep_assert_cpus_held() in flush_rcu_sheaves_on_cache() to catch the same problem in the future. Fixes: 0f35040de593 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction") Cc: <stable@vger.kernel.org> Signed-off-by: Qing Wang <wangqing7171@gmail.com> Link: https://patch.msgid.link/20260512035035.762317-1-wangqing7171@gmail.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-13mm/memory: fix spurious warning when unmapping device-private/exclusive pagesAlistair Popple1-1/+1
Device private and exclusive entries are only supported for anonymous folios. This condition is tested in __migrate_device_pages() and make_device_exclusive() using folio_test_anon(). However the unmap path tests this assumption using vma_is_anonymous(). This is wrong because whilst anonymous VMAs can only contain folios where folio_test_anon() is true the opposite relation does not hold. A folio for which folio_test_anon() is true does not imply vma_is_anonymous() is true. Such a condition can occur if for example a folio is part of a private filebacked mapping. In this case vma_is_anonymous() is false as the mapping is filebacked, but folio_test_anon() may be true, thus permitting devices to migrate the folio to device private memory. This can lead to the following spurious warnings during process teardown: [ 772.737706] ------------[ cut here ]------------ [ 772.739201] WARNING: mm/memory.c:1754 at unmap_page_range.cold+0x26/0x18a, CPU#17: hmm-tests/2041 [ 772.742050] Modules linked in: test_hmm nvidia_uvm(O) nvidia(O) [ 772.743959] CPU: 17 UID: 0 PID: 2041 Comm: hmm-tests Tainted: G W O 7.0.0+ #387 PREEMPT(full) [ 772.747104] Tainted: [W]=WARN, [O]=OOT_MODULE [ 772.748509] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 772.752117] RIP: 0010:unmap_page_range.cold+0x26/0x18a [ 772.753780] Code: 7e fe ff ff 48 89 4c 24 78 4c 89 44 24 38 e8 f2 ff b1 00 48 8b 4c 24 78 4c 8b 44 24 38 48 8b 44 24 18 48 83 78 48 00 74 04 90 <0f> 0b 90 48 89 ca b8 ff ff 37 00 48 c1 ea 03 48 c1 e0 2a 80 3c 02 [ 772.759602] RSP: 0018:ffff888112607550 EFLAGS: 00010286 [ 772.761310] RAX: ffff88811bbf4dc0 RBX: dffffc0000000000 RCX: ffffea03e9bfffd8 [ 772.763583] RDX: 1ffff1102377e9c1 RSI: 0000000000000008 RDI: ffff88811bbf4e08 [ 772.765914] RBP: 0000000000000006 R08: ffff8881059f7448 R09: ffffed10224c0e68 [ 772.768184] R10: ffff888112607347 R11: 0000000000000001 R12: 0000000000000001 [ 772.770461] R13: ffffea03e9bfffc0 R14: ffff888112607908 R15: ffffea03e9bfffc0 [ 772.772782] FS: 00007f327caa2780(0000) GS:ffff888427b7d000(0000) knlGS:0000000000000000 [ 772.775328] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 772.777187] CR2: 00007f327ca89000 CR3: 00000001994d5000 CR4: 00000000000006f0 [ 772.779135] Call Trace: [ 772.779792] <TASK> [ 772.780317] ? dmirror_interval_invalidate+0x1a3/0x290 [test_hmm] [ 772.781873] ? vm_normal_page_pud+0x2b0/0x2b0 [ 772.782992] ? __rwlock_init+0x150/0x150 [ 772.784006] ? lock_release+0x216/0x2b0 [ 772.785008] ? __mmu_notifier_invalidate_range_start+0x505/0x6e0 [ 772.786522] ? lock_release+0x216/0x2b0 [ 772.787498] ? unmap_single_vma+0xb6/0x210 [ 772.788573] unmap_vmas+0x27d/0x520 [ 772.789506] ? unmap_single_vma+0x210/0x210 [ 772.790607] ? mas_update_gap.part.0+0x620/0x620 [ 772.791834] unmap_region+0x19e/0x350 [ 772.792769] ? remove_vma+0x130/0x130 [ 772.793684] ? mas_alloc_nodes+0x1f2/0x300 [ 772.794730] vms_complete_munmap_vmas+0x8c1/0xe20 [ 772.795926] ? unmap_region+0x350/0x350 [ 772.796917] do_vmi_align_munmap+0x36a/0x4e0 [ 772.798018] ? lock_release+0x216/0x2b0 [ 772.799024] ? vma_shrink+0x620/0x620 [ 772.799983] do_vmi_munmap+0x150/0x2c0 [ 772.800939] __vm_munmap+0x161/0x2c0 [ 772.801872] ? expand_downwards+0xd60/0xd60 [ 772.802948] ? clockevents_program_event+0x1ef/0x540 [ 772.804217] ? lock_release+0x216/0x2b0 [ 772.805158] __x64_sys_munmap+0x59/0x80 [ 772.805776] do_syscall_64+0xfc/0x670 [ 772.806336] ? irqentry_exit+0xda/0x580 [ 772.806976] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 772.807772] RIP: 0033:0x7f327cbb2717 [ 772.808323] Code: 73 01 c3 48 8b 0d f9 76 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 76 0d 00 f7 d8 64 89 01 48 [ 772.811337] RSP: 002b:00007ffde7f57d38 EFLAGS: 00000202 ORIG_RAX: 000000000000000b [ 772.812564] RAX: ffffffffffffffda RBX: 00007f327cc9c000 RCX: 00007f327cbb2717 [ 772.813733] RDX: 0000000000000000 RSI: 0000000000400000 RDI: 00007f327c289000 [ 772.814867] RBP: 0000000000421360 R08: 000000000000001a R09: 0000000000000000 [ 772.815991] R10: 0000000000000003 R11: 0000000000000202 R12: 00007ffde7f57d74 [ 772.817121] R13: 00007f327c689010 R14: 0000000000100000 R15: 00007f327c289000 [ 772.818272] </TASK> [ 772.818614] irq event stamp: 0 [ 772.819159] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 772.820174] hardirqs last disabled at (0): [<ffffffff82a57ab3>] copy_process+0x19f3/0x6440 [ 772.821511] softirqs last enabled at (0): [<ffffffff82a57b00>] copy_process+0x1a40/0x6440 [ 772.822869] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 772.823871] ---[ end trace 0000000000000000 ]--- Fix this by using the same check for folio_test_anon() in zap_nonpresent_ptes(). Also add a hmm-test case for this. Link: https://lore.kernel.org/20260501065116.2057242-1-apopple@nvidia.com Fixes: 999dad824c39 ("mm/shmem: persist uffd-wp bit across zapping for file-backed") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reported-by: Arsen Arsenović <aarsenovic@baylibre.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13mm: fix __vm_normal_page() to handle missing support for pmd_special()/pud_special()David Hildenbrand (Arm)1-3/+19
On x86 32-bit with THP enabled, zap_huge_pmd() is seen to generate a "WARNING: mm/memory.c:735 at __vm_normal_page+0x6a/0x7d", from the VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn)); followed by "BUG: Bad rss-counter state"s, then later "BUG: Bad page state"s when reclaim gets to call shrink_huge_zero_folio_scan(). It's as if the _PAGE_SPECIAL bit never got set in the huge_zero pmd: and indeed, whereas pte_special() and pte_mkspecial() are subject to a dedicated CONFIG_ARCH_HAS_PTE_SPECIAL, pmd_special() and pmd_mkspecial() are subject to CONFIG_ARCH_SUPPORTS_PMD_PFNMAP, which is never enabled on any 32-bit architecture. While the problem was exposed through commit d80a9cb1a64a ("mm/huge_memory: add and use normal_or_softleaf_folio_pmd()"), it was an oversight in commit af38538801c6 ("mm/memory: factor out common code from vm_normal_page_*()") and would result in other problems: * huge zero folio accounted in smaps, pagemap (PAGE_IS_FILE) and numamaps as file-backed THP * folio_walk_start() returning the folio even without FW_ZEROPAGE set. Callers seem to tolerate that, though. ... and triggering the VM_WARN_ON_ONE(), although never reported so far. To fix it, teach vm_normal_page_pmd()/vm_normal_page_pud() to consider whether pmd_special/pud_special is actually implemented. Link: https://lore.kernel.org/20260430-pmd_special-v1-1-dbcbcfd72c20@kernel.org Fixes: af38538801c6 ("mm/memory: factor out common code from vm_normal_page_*()") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reported-by: Hugh Dickins <hughd@google.com> Closes: https://lore.kernel.org/r/74a75b59-2e13-3985-ee99-d5521f39df2a@google.com Reported-by: Bibo Mao <maobibo@loongson.cn> Closes: https://lore.kernel.org/r/20260430041121.2839350-1-maobibo@loongson.cn Debugged-by: Hugh Dickins <hughd@google.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13mm/memory_hotplug: fix memory block reference leak on removeMuchun Song1-0/+2
Patch series "mm: Fix memory block leaks and locking", v2. This series fixes two memory block device reference leaks and one locking issue around the per-memory_block hwpoison counter. This patch (of 2): remove_memory_blocks_and_altmaps() looks up each memory block with find_memory_block(), which acquires a reference to the memory block device. That reference is never dropped on this path, resulting in a leaked device reference when removing memory blocks and their altmaps. Drop the reference after retrieving mem->altmap and clearing mem->altmap, before removing the memory block device. Link: https://lore.kernel.org/20260428085219.1316047-1-songmuchun@bytedance.com Link: https://lore.kernel.org/20260428085219.1316047-2-songmuchun@bytedance.com Fixes: 6b8f0798b85a ("mm/memory_hotplug: split memmap_on_memory requests across memblocks") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13mm/page_alloc: fix initialization of tags of the huge zero folio with init_on_freeDavid Hildenbrand (Arm)1-4/+4
__GFP_ZEROTAGS semantics are currently a bit weird, but effectively this flag is only ever set alongside __GFP_ZERO and __GFP_SKIP_KASAN. If we run with init_on_free, we will zero out pages during __free_pages_prepare(), to skip zeroing on the allocation path. However, when allocating with __GFP_ZEROTAG set, post_alloc_hook() will consequently not only skip clearing page content, but also skip clearing tag memory. Not clearing tags through __GFP_ZEROTAGS is irrelevant for most pages that will get mapped to user space through set_pte_at() later: set_pte_at() and friends will detect that the tags have not been initialized yet (PG_mte_tagged not set), and initialize them. However, for the huge zero folio, which will be mapped through a PMD marked as special, this initialization will not be performed, ending up exposing whatever tags were still set for the pages. The docs (Documentation/arch/arm64/memory-tagging-extension.rst) state that allocation tags are set to 0 when a page is first mapped to user space. That no longer holds with the huge zero folio when init_on_free is enabled. Fix it by decoupling __GFP_ZEROTAGS from __GFP_ZERO, passing to tag_clear_highpages() whether we want to also clear page content. Invert the meaning of the tag_clear_highpages() return value to have clearer semantics. Reproduced with the huge zero folio by modifying the check_buffer_fill arm64/mte selftest to use a 2 MiB area, after making sure that pages have a non-0 tag set when freeing (note that, during boot, we will not actually initialize tags, but only set KASAN_TAG_KERNEL in the page flags). $ ./check_buffer_fill 1..20 ... not ok 17 Check initial tags with private mapping, sync error mode and mmap memory not ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory ... This code needs more cleanups; we'll tackle that next, like decoupling __GFP_ZEROTAGS from __GFP_SKIP_KASAN. [akpm@linux-foundation.org: s/__GPF_ZERO/__GFP_ZERO/, per David] Link: https://lore.kernel.org/20260421-zerotags-v2-1-05cb1035482e@kernel.org Fixes: adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Lance Yang <lance.yang@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13mm/damon/sysfs-schemes: call missing mem_cgroup_iter_break()SeongJae Park1-0/+1
damon_sysfs_memcg_path_to_id() breaks mem_cgroup_iter() loop without calling mem_cgroup_iter_break(). This leaks the cgroup reference. Fix the issue by calling mem_cgroup_iter_break() before the break. The issue was discovered [1] by Sashiko. Link: https://lore.kernel.org/20260426173625.86521-1-sj@kernel.org Link: https://lore.kernel.org/20260423004148.74722-1-sj@kernel.org [1] Fixes: 29cbb9a13f05 ("mm/damon/sysfs-schemes: implement scheme filters") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.3.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13mm/migrate_device: fix spinlock leak in migrate_vma_insert_huge_pmd_pageSunny Patel1-1/+1
When check_stable_address_space() fails after the PMD spinlock has been acquired via pmd_lock(), the code jumps directly to the abort label, bypassing the spin_unlock() call in unlock_abort. This causes the PMD spinlock to be permanently held, leading to a deadlock. Change the goto target from abort to unlock_abort to ensure the spinlock is always released on this error path. Link: https://lore.kernel.org/20260425133537.17463-1-nueralspacetech@gmail.com Fixes: a30b48bf1b24 ("mm/migrate_device: implement THP migration of zone device pages") Signed-off-by: Sunny Patel <nueralspacetech@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13Merge tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linuxLinus Torvalds1-6/+19
Pull liveupdate fixes from Mike Rapoport: "A few fixes for kexec handover and liveupdate: - make sure KHO is skipped for crash kernel - fix error reporting in memfd preservation if it fails mid-loop - don't allow preserving memfds whose page count exceeds UINT_MAX - fix documentation of memfd seals preservation to match the code" * tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux: mm/memfd_luo: document preservation of file seals mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX mm/memfd_luo: report error when restoring a folio fails mid-loop kho: skip KHO for crash kernel
2026-05-05mm/slab: Add kvfree_atomic() helperUladzislau Rezki (Sony)1-0/+16
kvmalloc() now supports non-sleeping GFP flags, including the vmalloc fallback path. This means it may return vmalloc memory even for GFP_ATOMIC and GFP_NOWAIT allocations. Freeing such memory with kvfree() may then end up calling vfree(), which is not safe for non-sleeping contexts. Introduce kvfree_atomic() helper for such cases. It mirrors kvfree(), but uses vfree_atomic() for vmalloced memory. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-05-04mm/memfd_luo: document preservation of file sealsDavid Carlier1-4/+5
Commit 8a552d68a86e ("mm: memfd_luo: preserve file seals") started preserving file seals across live update and restoring them via memfd_add_seals() on retrieve, but the DOC header was not updated and still listed seals under "Non-Preserved Properties" as being unsealed on restore. Move the Seals entry to the "Preserved Properties" section and describe the actual behavior, including the MEMFD_LUO_ALL_SEALS restriction that both preserve and retrieve enforce. Fixes: 8a552d68a86e ("mm: memfd_luo: preserve file seals") Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://patch.msgid.link/20260423125648.152113-2-devnexen@gmail.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
2026-05-04mm/memfd_luo: reject memfds whose page count exceeds UINT_MAXDavid Carlier1-2/+13
memfd_luo_preserve_folios() declares max_folios as unsigned int and computes it from the inode size, then passes it to memfd_pin_folios() which itself caps max_folios at unsigned int. For files whose base-page count exceeds UINT_MAX (larger than 16 TiB with 4 KiB pages), the assignment truncates silently: only a prefix of the file gets pinned and preserved, while memfd_luo_preserve() still records the full inode size in ser->size. On retrieve the inode is restored to the full size but only the preserved prefix repopulates the page cache, so the tail comes back as holes and user data is silently lost across the live update. Reject such files at preserve time with -EFBIG rather than chunk the pin loop, which would also require enlarging the preserved folios array well beyond what is practical. Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd") Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Link: https://patch.msgid.link/20260423125648.152113-1-devnexen@gmail.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
2026-05-03Merge tag 'slab-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slabLinus Torvalds2-0/+9
Pull slab fixes from Vlastimil Babka: - Stable fixes for CONFIG_SMP=n where _nolock() allocations in NMI both at kmalloc and page allocator levels are not properly protected by the spin_trylock() semantics on !SMP (Harry Yoo) * tag 'slab-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: return NULL early from kmalloc_nolock() in NMI on UP mm/page_alloc: return NULL early from alloc_frozen_pages_nolock() in NMI on UP
2026-04-30mm: memcontrol: fix rcu unbalance in get_non_dying_memcg_end()Qi Zheng1-10/+19
Currently, get_non_dying_memcg_start() and get_non_dying_memcg_end() both evaluate cgroup_subsys_on_dfl(memory_cgrp_subsys) independently to determine whether to acquire or release the RCU read lock. However, the result of cgroup_subsys_on_dfl() can change dynamically at runtime due to cgroup hierarchy rebinding (e.g., when the memory controller is moved between cgroup v1 and v2 hierarchies). This can cause the following warning: ===================================== WARNING: bad unlock balance detected! 7.0.0-next-20260420+ #83 Tainted: G W ------------------------------------- memcg-repro/270 is trying to release lock (rcu_read_lock) at: [<ffffffff815f57f7>] rcu_read_unlock+0x17/0x60 but there are no more locks to release! other info that might help us debug this: 1 lock held by memcg-repro/270: #0: ffff888102fa2088 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x285/0x880 stack backtrace: CPU: 0 UID: 0 PID: 270 Comm: memcg-repro Tainted: G W 7.0.0-next-20260420+ # Tainted: [W]=WARN Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: <TASK> ? rcu_read_unlock+0x17/0x60 dump_stack_lvl+0x77/0xb0 print_unlock_imbalance_bug+0xe0/0xf0 ? rcu_read_unlock+0x17/0x60 lock_release+0x21d/0x2a0 rcu_read_unlock+0x1c/0x60 do_pte_missing+0x233/0xb40 __handle_mm_fault+0x80e/0xcd0 handle_mm_fault+0x146/0x310 do_user_addr_fault+0x303/0x880 exc_page_fault+0x9b/0x270 asm_exc_page_fault+0x26/0x30 RIP: 0033:0x5590e4eb41ea Code: 61 cc 66 0f 6f e0 66 0f 61 c2 66 0f db cd 66 0f 69 e2 66 0f 6f d0 66 0f 69 d4 66 0f 61 0 RSP: 002b:00007ffcad25f030 EFLAGS: 00010202 RAX: 00005590e4eb8010 RBX: 00007ffcad260f7d RCX: 00007f73c474d44d RDX: 00005590e4eb80a0 RSI: 00005590e4eb503c RDI: 000000000000000f RBP: 00005590e4eb70a0 R08: 0000000000000000 R09: 00007f73c483a680 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffcad25f180 R14: 00005590e4eb6dd8 R15: 00007f73c4869020 </TASK> ------------[ cut here ]------------ Fix this by explicitly tracking the RCU lock state, ensuring that rcu_read_unlock() in get_non_dying_memcg_end() is strictly paired with the lock acquisition, regardless of any runtime rebinding events. Link: https://lore.kernel.org/20260429073105.44472-1-qi.zheng@linux.dev Fixes: 8285917d6f38 ("mm: memcontrol: prepare for reparenting non-hierarchical stats") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-28mm/memfd_luo: report error when restoring a folio fails mid-loopDavid Carlier1-0/+1
memfd_luo_retrieve_folios() initialises err to -EIO, but the per-iteration calls to mem_cgroup_charge(), shmem_add_to_page_cache() and shmem_inode_acct_blocks() reuse and overwrite err. Once any iteration completes successfully, err becomes zero. If a later iteration's kho_restore_folio() returns NULL, the failure path jumps to put_folios without resetting err, so the function returns 0. The caller memfd_luo_retrieve() then takes the success path, sets args->file and reports the restore as successful, leaving userspace with a partially populated memfd and no indication that anything went wrong. Set err to -EIO in the kho_restore_folio() failure branch so the error is propagated to the caller. Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd") Link: https://patch.msgid.link/20260415052300.362539-1-devnexen@gmail.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-27mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()David Carlier1-1/+11
mfill_copy_folio_retry() drops mmap_lock for the copy_from_user() call. During this window, the VMA can be replaced with a different type (e.g. hugetlb), making the caller's ops pointer stale. Subsequent use of the stale ops would dispatch into the wrong per-vma handlers. Capture the VMA's ops via vma_uffd_ops() before dropping the lock and compare against the current vma_uffd_ops() after re-acquiring it. Return -EAGAIN if they differ so the operation can be retried. This avoids comparing against the caller's ops which may have been overridden to anon_uffd_ops for MAP_PRIVATE file-backed mappings. Link: https://lore.kernel.org/20260424183638.196227-1-devnexen@gmail.com Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic") Reported-by: Usama Arif <usama.arif@linux.dev> Closes: https://lore.kernel.org/all/20260410114809.3592720-1-usama.arif@linux.dev/ Signed-off-by: David Carlier <devnexen@gmail.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/damon/stat: detect and use fresh enabled valueSeongJae Park1-10/+20
DAMON_STAT updates 'enabled' parameter value, which represents the running status of its kdamond, when the user explicitly requests start/stop of the kdamond. The kdamond can, however, be stopped even if the user explicitly requested the stop, if ctx->regions_score_histogram allocation failure at beginning of the execution of the kdamond. Hence, if the kdamond is stopped by the allocation failure, the value of the parameter can be stale. Users could show the stale value and be confused. The problem will only rarely happen in real and common setups because the allocation is arguably too small to fail. Also, unlike the similar bugs that are now fixed in DAMON_RECLAIM and DAMON_LRU_SORT, kdamond can be restarted in this case, because DAMON_STAT force-updates the enabled parameter value for user inputs. The bug is a bug, though. The issue stems from the fact that there are multiple events that can change the status, and following all the events is challenging. Dynamically detect and use the fresh status for the parameters when those are requested. The issue was dicovered [1] by Sashiko. Link: https://lore.kernel.org/20260419161003.79176-4-sj@kernel.org Link: https://lore.kernel.org/20260416040602.88665-1-sj@kernel.org [1] Fixes: 369c415e6073 ("mm/damon: introduce DAMON_STAT module") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Liew Rui Yan <aethernet65535@gmail.com> Cc: <stable@vger.kernel.org> # 6.17.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid valuesSeongJae Park1-30/+55
DAMON_LRU_SORT updates 'enabled' and 'kdamond_pid' parameter values, which represents the running status of its kdamond, when the user explicitly requests start/stop of the kdamond. The kdamond can, however, be stopped in events other than the explicit user request in the following three events. 1. ctx->regions_score_histogram allocation failure at beginning of the execution, 2. damon_commit_ctx() failure due to invalid user input, and 3. damon_commit_ctx() failure due to its internal allocation failures. Hence, if the kdamond is stopped by the above three events, the values of the status parameters can be stale. Users could show the stale values and be confused. This is already bad, but the real consequence is worse. DAMON_LRU_SORT avoids unnecessary damon_start() and damon_stop() calls based on the 'enabled' parameter value. And the update of 'enabled' parameter value depends on the damon_start() and damon_stop() call results. Hence, once the kdamond has stopped by the unintentional events, the user cannot restart the kdamond before the system reboot. For example, the issue can be reproduced via below steps. # cd /sys/module/damon_lru_sort/parameters # # # start DAMON_LRU_SORT # echo Y > enabled # ps -ef | grep kdamond root 806 2 0 17:53 ? 00:00:00 [kdamond.0] root 808 803 0 17:53 pts/4 00:00:00 grep kdamond # # # commit wrong input to stop kdamond withou explicit stop request # echo 3 > addr_unit # echo Y > commit_inputs bash: echo: write error: Invalid argument # # # confirm kdamond is stopped # ps -ef | grep kdamond root 811 803 0 17:53 pts/4 00:00:00 grep kdamond # # # users casn now show stable status # cat enabled Y # cat kdamond_pid 806 # # # even after fixing the wrong parameter, # # kdamond cannot be restarted. # echo 1 > addr_unit # echo Y > enabled # ps -ef | grep kdamond root 815 803 0 17:54 pts/4 00:00:00 grep kdamond The problem will only rarely happen in real and common setups for the following reasons. The allocation failures are unlikely in such setups since those allocations are arguably too small to fail. Also sane users on real production environments may not commit wrong input parameters. But once it happens, the consequence is quite bad. And the bug is a bug. The issue stems from the fact that there are multiple events that can change the status, and following all the events is challenging. Dynamically detect and use the fresh status for the parameters when those are requested. Link: https://lore.kernel.org/20260419161003.79176-3-sj@kernel.org Fixes: 40e983cca927 ("mm/damon: introduce DAMON-based LRU-lists Sorting") Co-developed-by: Liew Rui Yan <aethernet65535@gmail.com> Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.0.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/damon/reclaim: detect and use fresh enabled and kdamond_pid valuesSeongJae Park1-30/+55
Patch series "mm/damon/modules: detect and use fresh status", v3. DAMON modules including DAMON_RECLAIM, DAMON_LRU_SORT and DAMON_STAT commonly expose the kdamond running status via their parameters. Under certain scenarios including wrong user inputs and memory allocation failures, those parameter values can be stale. It can confuse users. For DAMON_RECLAIM and DAMON_LRU_SORT, it even makes the kdamond unable to be restarted before the system reboot. The problem comes from the fact that there are multiple events for the status changes and it is difficult to follow up all the scenarios. Fix the issue by detecting and using the status on demand, instead of using a cached status that is difficult to be updated. Patches 1-3 fix the bugs in DAMON_RECLAIM, DAMON_LRU_SORT and DAMON_STAT in the order. This patch (of 3): DAMON_RECLAIM updates 'enabled' and 'kdamond_pid' parameter values, which represents the running status of its kdamond, when the user explicitly requests start/stop of the kdamond. The kdamond can, however, be stopped in events other than the explicit user request in the following three events. 1. ctx->regions_score_histogram allocation failure at beginning of the execution, 2. damon_commit_ctx() failure due to invalid user input, and 3. damon_commit_ctx() failure due to its internal allocation failures. Hence, if the kdamond is stopped by the above three events, the values of the status parameters can be stale. Users could show the stale values and be confused. This is already bad, but the real consequence is worse. DAMON_RECLAIM avoids unnecessary damon_start() and damon_stop() calls based on the 'enabled' parameter value. And the update of 'enabled' parameter value depends on the damon_start() and damon_stop() call results. Hence, once the kdamond has stopped by the unintentional events, the user cannot restart the kdamond before the system reboot. For example, the issue can be reproduced via below steps. # cd /sys/module/damon_reclaim/parameters # # # start DAMON_RECLAIM # echo Y > enabled # ps -ef | grep kdamond root 806 2 0 17:53 ? 00:00:00 [kdamond.0] root 808 803 0 17:53 pts/4 00:00:00 grep kdamond # # # commit wrong input to stop kdamond withou explicit stop request # echo 3 > addr_unit # echo Y > commit_inputs bash: echo: write error: Invalid argument # # # confirm kdamond is stopped # ps -ef | grep kdamond root 811 803 0 17:53 pts/4 00:00:00 grep kdamond # # # users casn now show stable status # cat enabled Y # cat kdamond_pid 806 # # # even after fixing the wrong parameter, # # kdamond cannot be restarted. # echo 1 > addr_unit # echo Y > enabled # ps -ef | grep kdamond root 815 803 0 17:54 pts/4 00:00:00 grep kdamond The problem will only rarely happen in real and common setups for the following reasons. The allocation failures are unlikely in such setups since those allocations are arguably too small to fail. Also sane users on real production environments may not commit wrong input parameters. But once it happens, the consequence is quite bad. And the bug is a bug. The issue stems from the fact that there are multiple events that can change the status, and following all the events is challenging. Dynamically detect and use the fresh status for the parameters when those are requested. Link: https://lore.kernel.org/20260419161003.79176-1-sj@kernel.org Link: https://lore.kernel.org/20260419161003.79176-2-sj@kernel.org Fixes: e035c280f6df ("mm/damon/reclaim: support online inputs update") Co-developed-by: Liew Rui Yan <aethernet65535@gmail.com> Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 5.19.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/damon/sysfs-schemes: protect path kfree() with damon_sysfs_lockSeongJae Park1-1/+11
damon_sysfs_quot_goal->path can be read and written by users, via DAMON sysfs 'path' file. It can also be indirectly read, for the parameters {on,off}line committing to DAMON. The reads for parameters committing are protected by damon_sysfs_lock to avoid the sysfs files being destroyed while any of the parameters are being read. But the user-driven direct reads and writes are not protected by any lock, while the write is deallocating the path-pointing buffer. As a result, the readers could read the already freed buffer (user-after-free). Note that the user-reads don't race when the same open file is used by the writer, due to kernfs's open file locking. Nonetheless, doing the reads and writes with separate open files would be common. Fix it by protecting both the user-direct reads and writes with damon_sysfs_lock. Link: https://lore.kernel.org/20260423150253.111520-3-sj@kernel.org Fixes: c41e253a411e ("mm/damon/sysfs-schemes: implement path file under quota goal directory") Co-developed-by: Junxi Qian <qjx1298677004@gmail.com> Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.19.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lockSeongJae Park1-1/+11
Patch series "mm/damon/sysfs-schemes: fix use-after-free for [memcg_]path". Reads of 'memcg_path' and 'path' files in DAMON sysfs interface could race with their writes, results in use-after-free. Fix those. This patch (of 2): damon_sysfs_scheme_filter->mmecg_path can be read and written by users, via DAMON sysfs memcg_path file. It can also be indirectly read, for the parameters {on,off}line committing to DAMON. The reads for parameters committing are protected by damon_sysfs_lock to avoid the sysfs files being destroyed while any of the parameters are being read. But the user-driven direct reads and writes are not protected by any lock, while the write is deallocating the memcg_path-pointing buffer. As a result, the readers could read the already freed buffer (user-after-free). Note that the user-reads don't race when the same open file is used by the writer, due to kernfs's open file locking. Nonetheless, doing the reads and writes with separate open files would be common. Fix it by protecting both the user-direct reads and writes with damon_sysfs_lock. Link: https://lore.kernel.org/20260423150253.111520-1-sj@kernel.org Link: https://lore.kernel.org/20260423150253.111520-2-sj@kernel.org Fixes: 4f489fe6afb3 ("mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write") Co-developed-by: Junxi Qian <qjx1298677004@gmail.com> Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.16.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/hugetlb_cma: round up per_node before logging itSang-Heon Jeon1-0/+1
When the user requests a total hugetlb CMA size without per-node specification, hugetlb_cma_reserve() computes per_node from hugetlb_cma_size and the number of nodes that have memory per_node = DIV_ROUND_UP(hugetlb_cma_size, nodes_weight(hugetlb_bootmem_nodes)); The reservation loop later computes size = round_up(min(per_node, hugetlb_cma_size - reserved), PAGE_SIZE << order); So the actually reserved per_node size is multiple of (PAGE_SIZE << order), but the logged per_node is not rounded up, so it may be smaller than the actual reserved size. For example, as the existing comment describes, if a 3 GB area is requested on a machine with 4 NUMA nodes that have memory, 1 GB is allocated on the first three nodes, but the printed log is hugetlb_cma: reserve 3072 MiB, up to 768 MiB per node Round per_node up to (PAGE_SIZE << order) before logging so that the printed log always matches the actual reserved size. No functional change to the actual reservation size, as the following case analysis shows 1. remaining (hugetlb_cma_size - reserved) >= rounded per_node - AS-IS: min() picks unrounded per_node; round_up() returns rounded per_node - TO-BE: min() picks rounded per_node; round_up() returns rounded per_node (no-op) 2. remaining < unrounded per_node - AS-IS: min() picks remaining; round_up() returns round_up(remaining) - TO-BE: min() picks remaining; round_up() returns round_up(remaining) 3. unrounded per_node <= remaining < rounded per_node - AS-IS: min() picks unrounded per_node; round_up() returns rounded per_node - TO-BE: min() picks remaining; round_up() returns round_up(remaining) equals rounded per_node Link: https://lore.kernel.org/20260422143353.852257-1-ekffu200098@gmail.com Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma") # 5.7 Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/vma: do not try to unmap a VMA if mmap_prepare() invoked from mmap()Lorenzo Stoakes2-10/+19
The mmap_prepare hook functionality includes the ability to invoke mmap_prepare() from the mmap() hook of existing 'stacked' drivers, that is ones which are capable of calling the mmap hooks of other drivers/file systems (e.g. overlayfs, shm). As part of the mmap_prepare action functionality, we deal with errors by unmapping the VMA should one arise. This works in the usual mmap_prepare case, as we invoke this action at the last moment, when the VMA is established in the maple tree. However, the mmap() hook passes a not-fully-established VMA pointer to the caller (which is the motivation behind the mmap_prepare() work), which is detached. So attempting to unmap a VMA in this state will be problematic, with the most obvious symptom being a warning in vma_mark_detached(), because the VMA is already detached. It's also unncessary - the mmap() handler will clean up the VMA on error. So to fix this issue, this patch propagates whether or not an mmap action is being completed via the compatibility layer or directly. If the former, then we do not attempt VMA cleanup, if the latter, then we do. This patch also updates the userland VMA tests to reflect the change. Link: https://lore.kernel.org/20260421102150.189982-1-ljs@kernel.org Fixes: ac0a3fc9c07d ("mm: add ability to take further action in vm_area_desc") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Reported-by: syzbot+db390288d141a1dccf96@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69e69734.050a0220.24bfd3.0027.GAE@google.com/ Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm: start background writeback based on per-wb threshold for strictlimit BDIsJoanne Koong1-10/+6
The proactive nr_dirty > gdtc->bg_thresh check in balance_dirty_pages() only checks the global dirty threshold to start background writeback while the writer is still free-running, but for strictlimit BDIs (eg fuse), the per-wb dirty count can exceed the per-wb background threshold while the global threshold is not yet exceeded, so background writeback for this case never gets proactively started. Add a per-wb threshold check for strictlimit BDIs so that background writeback is started when wb_dirty exceeds wb_bg_thresh, which drains dirty pages before the writer hits the throttle wall, matching the proactive behavior that the global check provides for non-strictlimit BDIs. fio runs on fuse show about a 3-4% improvement in perf for buffered writes: fio --name=writeback_test --ioengine=psync --rw=write --bs=128k \ --size=2G --numjobs=4 --ramp_time=10 --runtime=20 \ --time_based --group_reporting=1 --direct=0 Link: https://lore.kernel.org/20260326234629.840938-2-joannelkoong@gmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27vmalloc: fix buffer overflow in vrealloc_node_align()Marco Elver1-1/+1
Commit 4c5d3365882d ("mm/vmalloc: allow to set node and align in vrealloc") added the ability to force a new allocation if the current pointer is on the wrong NUMA node, or if an alignment constraint is not met, even if the user is shrinking the allocation. On this path (need_realloc), the code allocates a new object of 'size' bytes and then memcpy()s 'old_size' bytes into it. If the request is to shrink the object (size < old_size), this results in an out-of-bounds write on the new buffer. Fix this by bounding the copy length by the new allocation size. Link: https://lore.kernel.org/20260420114805.3572606-2-elver@google.com Fixes: 4c5d3365882d ("mm/vmalloc: allow to set node and align in vrealloc") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/slab: return NULL early from kmalloc_nolock() in NMI on UPHarry Yoo (Oracle)1-0/+4
On UP kernels (!CONFIG_SMP), spin_trylock() is a no-op that unconditionally succeeds even when the lock is already held. As a result, kmalloc_nolock() called from NMI context can re-enter the slab allocator and acquire n->list_lock that the interrupted context is already holding, corrupting slab state. With CONFIG_DEBUG_SPINLOCK on UP, the following BUG is triggered with the slub_kunit test module: BUG: spinlock trylock failure on UP on CPU#0, kunit_try_catch/243 [...] Call Trace: <NMI> dump_stack_lvl+0x3f/0x60 do_raw_spin_trylock+0x41/0x50 _raw_spin_trylock+0x24/0x50 get_from_partial_node+0x120/0x4d0 ___slab_alloc+0x8a/0x4c0 kmalloc_nolock_noprof+0x164/0x310 [...] </NMI> Fix this by returning NULL early when invoked from NMI on a UP kernel. Link: https://lore.kernel.org/linux-mm/ad_cqe51pvr1WaDg@hyeyoo Cc: stable@vger.kernel.org Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260427-nolock-api-fix-v2-2-a6b83a92d9a4@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-04-27mm/page_alloc: return NULL early from alloc_frozen_pages_nolock() in NMI on UPHarry Yoo (Oracle)1-0/+5
On UP kernels (!CONFIG_SMP), spin_trylock() is a no-op that unconditionally succeeds even when the lock is already held. As a result, alloc_frozen_pages_nolock() called from NMI context can re-enter rmqueue() and acquire the zone lock that the interrupted context is already holding, corrupting the freelists. With CONFIG_DEBUG_SPINLOCK on UP, the following BUG is triggered with the slub_kunit test module: BUG: spinlock trylock failure on UP on CPU#0, kunit_try_catch/243 [...] Call Trace: <NMI> dump_stack_lvl+0x3f/0x60 do_raw_spin_trylock+0x41/0x50 _raw_spin_trylock+0x24/0x50 rmqueue.isra.0+0x2a9/0xa70 get_page_from_freelist+0xeb/0x450 alloc_frozen_pages_nolock_noprof+0x111/0x1e0 allocate_slab+0x42a/0x500 ___slab_alloc+0xa7/0x4c0 kmalloc_nolock_noprof+0x164/0x310 [...] </NMI> Fix this by returning NULL early when invoked from NMI on a UP kernel. Link: https://lore.kernel.org/linux-mm/ad_cqe51pvr1WaDg@hyeyoo Cc: stable@vger.kernel.org Fixes: d7242af86434 ("mm: Introduce alloc_frozen_pages_nolock()") Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260427-nolock-api-fix-v2-1-a6b83a92d9a4@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-04-24Merge tag 'slab-for-7.1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slabLinus Torvalds1-12/+12
Pull slab fix from Vlastimil Babka: - A stable fix for k(v)ealloc() where reallocating on a different node or shrinking the object can result in either losing the original data or a buffer overflow (Marco Elver) * tag 'slab-for-7.1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slub: fix data loss and overflow in krealloc()
2026-04-22Merge tag 's390-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linuxLinus Torvalds1-9/+6
Pull s390 updates from Vasily Gorbik: - Add support for CONFIG_PAGE_TABLE_CHECK and enable it in debug_defconfig. s390 can only tell user from kernel PTEs via the mm, so mm_struct is now passed into pxx_user_accessible_page() callbacks - Expose the PCI function UID as an arch-specific slot attribute in sysfs so a function can be identified by its user-defined id while still in standby. Introduces a generic ARCH_PCI_SLOT_GROUPS hook in drivers/pci/slot.c - Refresh s390 PCI documentation to reflect current behavior and cover previously undocumented sysfs attributes - zcrypt device driver cleanup series: consistent field types, clearer variable naming, a kernel-doc warning fix, and a comment explaining the intentional synchronize_rcu() in pkey_handler_register() - Provide an s390 arch_raw_cpu_ptr() that avoids the detour via get_lowcore() using alternatives, shrinking defconfig by ~27 kB - Guard identity-base randomization with kaslr_enabled() so nokaslr keeps the identity mapping at 0 even with RANDOMIZE_IDENTITY_BASE=y - Build S390_MODULES_SANITY_TEST as a module only by requiring KUNIT && m, since built-in would not exercise module loading - Remove the permanently commented-out HMCDRV_DEV_CLASS create_class() code in the hmcdrv driver - Drop stale ident_map_size extern conflicting with asm/page.h * tag 's390-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/zcrypt: Fix warning about wrong kernel doc comment PCI: s390: Expose the UID as an arch specific PCI slot attribute docs: s390/pci: Improve and update PCI documentation s390/pkey: Add comment about synchronize_rcu() to pkey base s390/hmcdrv: Remove commented out code s390/zcrypt: Slight rework on the agent_id field s390/zcrypt: Explicitly use a card variable in _zcrypt_send_cprb s390/zcrypt: Rework MKVP fields and handling s390/zcrypt: Make apfs a real unsigned int field s390/zcrypt: Rework domain processing within zcrypt device driver s390/zcrypt: Move inline function rng_type6cprb_msgx from header to code s390/percpu: Provide arch_raw_cpu_ptr() s390: Enable page table check for debug_defconfig s390/pgtable: Add s390 support for page table check s390/pgtable: Use set_pmd_bit() to invalidate PMD entry mm/page_table_check: Pass mm_struct to pxx_user_accessible_page() s390/boot: Respect kaslr_enabled() for identity randomization s390/Kconfig: Make modules sanity test a module-only option s390/setup: Drop stale ident_map_size declaration
2026-04-20Merge tag 'uml-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linuxLinus Torvalds1-0/+1
Pull uml updates from Johannes Berg: "Mostly cleanups and small things, notably: - musl libc compatibility - vDSO installation fix - TLB sync race fix for recent SMP support - build fix for 32-bit with Clang 20/21" * tag 'uml-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: um: Disable GCOV_PROFILE_ALL on 32-bit UML with Clang 20/21 um: drivers: call kernel_strrchr() explicitly in cow_user.c um: Replace strncpy() with strnlen()+memcpy_and_pad() in strncpy_chunk_from_user() x86/um: fix vDSO installation um: Remove CONFIG_FRAME_WARN from x86_64_defconfig um: Fix pte_read() and pte_exec() for kernel mappings um: Fix potential race condition in TLB sync um: time-travel: clean up kernel-doc warnings um: avoid struct sigcontext redefinition with musl um: fix address-of CMSG_DATA() rvalue in stub
2026-04-19Merge tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds8-12/+27
Pull MM fixes from Andrew Morton: "7 hotfixes. 6 are cc:stable and all are for MM. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/damon/core: disallow non-power of two min_region_sz on damon_start() mm/vmalloc: take vmap_purge_lock in shrinker mm: call ->free_folio() directly in folio_unmap_invalidate() mm: blk-cgroup: fix use-after-free in cgwb_release_workfn() mm/zone_device: do not touch device folio after calling ->folio_free() mm/damon/core: disallow time-quota setting zero esz mm/mempolicy: fix weighted interleave auto sysfs name
2026-04-19Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds30-996/+1622
Pull more MM updates from Andrew Morton: - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song) Address the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory - "fix unexpected type conversions and potential overflows" (Qi Zheng) Fix a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series - "kho: history: track previous kernel version and kexec boot count" (Breno Leitao) Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time - "liveupdate: prevent double preservation" (Pasha Tatashin) Teach LUO to avoid managing the same file across different active sessions - "liveupdate: Fix module unloading and unregister API" (Pasha Tatashin) Address an issue with how LUO handles module reference counting and unregistration during module unloading - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar) Simplify and clean up the zswap crypto compression handling and improve the lifecycle management of zswap pool's per-CPU acomp_ctx resources - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" (SeongJae Park) Address unlikely but possible leaks and deadlocks in damon_call() and damon_walk() - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park) Fix a couple of root-only wild pointer dereferences - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" (SeongJae Park) Update the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time - "Minor hmm_test fixes and cleanups" (Alistair Popple) Bugfixes and a cleanup for the HMM kernel selftests - "Modify memfd_luo code" (Chenghao Duan) Cleanups, simplifications and speedups to the memfd_lou code - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport) Support for userfaultfd in guest_memfd - "selftests/mm: skip several tests when thp is not available" (Chunyu Hu) Fix several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels - "mm/mprotect: micro-optimization work" (Pedro Falcato) A couple of nice speedups for mprotect() - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav) Document upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree * tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits) MAINTAINERS: add page cache reviewer mm/vmscan: avoid false-positive -Wuninitialized warning MAINTAINERS: update Dave's kdump reviewer email address MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE MAINTAINERS: drop include/linux/kho/abi/ from KHO MAINTAINERS: update KHO and LIVE UPDATE maintainers MAINTAINERS: update kexec/kdump maintainers entries mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() selftests: mm: skip charge_reserved_hugetlb without killall userfaultfd: allow registration of ranges below mmap_min_addr mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update mm/hugetlb: fix early boot crash on parameters without '=' separator zram: reject unrecognized type= values in recompress_store() docs: proc: document ProtectionKey in smaps mm/mprotect: special-case small folios when applying permissions mm/mprotect: move softleaf code out of the main function mm: remove '!root_reclaim' checking in should_abort_scan() mm/sparse: fix comment for section map alignment mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() selftests/mm: transhuge_stress: skip the test when thp not available ...
2026-04-18mm/damon/core: disallow non-power of two min_region_sz on damon_start()SeongJae Park1-0/+5
Commit d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") introduced a bug that allows unaligned DAMON region address ranges. Commit c80f46ac228b ("mm/damon/core: disallow non-power of two min_region_sz") fixed it, but only for damon_commit_ctx() use case. Still, DAMON sysfs interface can emit non-power of two min_region_sz via damon_start(). Fix the path by adding the is_power_of_2() check on damon_start(). The issue was discovered by sashiko [1]. Link: https://lore.kernel.org/20260411213638.77768-1-sj@kernel.org Link: https://lore.kernel.org/20260403155530.64647-1-sj@kernel.org [1] Fixes: d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/vmalloc: take vmap_purge_lock in shrinkerUladzislau Rezki (Sony)1-0/+1
decay_va_pool_node() can be invoked concurrently from two paths: __purge_vmap_area_lazy() when pools are being purged, and the shrinker via vmap_node_shrink_scan(). However, decay_va_pool_node() is not safe to run concurrently, and the shrinker path currently lacks serialization, leading to races and possible leaks. Protect decay_va_pool_node() by taking vmap_purge_lock in the shrinker path to ensure serialization with purge users. Link: https://lore.kernel.org/20260413192646.14683-1-urezki@gmail.com Fixes: 7679ba6b36db ("mm: vmalloc: add a shrinker to drain vmap pools") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <baoquan.he@linux.dev> Cc: chenyichong <chenyichong@uniontech.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm: call ->free_folio() directly in folio_unmap_invalidate()Matthew Wilcox (Oracle)3-3/+7
We can only call filemap_free_folio() if we have a reference to (or hold a lock on) the mapping. Otherwise, we've already removed the folio from the mapping so it no longer pins the mapping and the mapping can be removed, causing a use-after-free when accessing mapping->a_ops. Follow the same pattern as __remove_mapping() and load the free_folio function pointer before dropping the lock on the mapping. That lets us make filemap_free_folio() static as this was the only caller outside filemap.c. Link: https://lore.kernel.org/20260413184314.3419945-1-willy@infradead.org Fixes: fb7d3bc41493 ("mm/filemap: drop streaming/uncached pages when writeback completes") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-501448199@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm: blk-cgroup: fix use-after-free in cgwb_release_workfn()Breno Leitao1-2/+3
cgwb_release_workfn() calls css_put(wb->blkcg_css) and then later accesses wb->blkcg_css again via blkcg_unpin_online(). If css_put() drops the last reference, the blkcg can be freed asynchronously (css_free_rwork_fn -> blkcg_css_free -> kfree) before blkcg_unpin_online() dereferences the pointer to access blkcg->online_pin, resulting in a use-after-free: BUG: KASAN: slab-use-after-free in blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367) Write of size 4 at addr ff11000117aa6160 by task kworker/71:1/531 Workqueue: cgwb_release cgwb_release_workfn Call Trace: <TASK> blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367) cgwb_release_workfn (mm/backing-dev.c:629) process_scheduled_works (kernel/workqueue.c:3278 kernel/workqueue.c:3385) Freed by task 1016: kfree (./include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6246 mm/slub.c:6561) css_free_rwork_fn (kernel/cgroup/cgroup.c:5542) process_scheduled_works (kernel/workqueue.c:3302 kernel/workqueue.c:3385) ** Stack based on commit 66672af7a095 ("Add linux-next specific files for 20260410") I am seeing this crash sporadically in Meta fleet across multiple kernel versions. A full reproducer is available at: https://github.com/leitao/debug/blob/main/reproducers/repro_blkcg_uaf.sh (The race window is narrow. To make it easily reproducible, inject a msleep(100) between css_put() and blkcg_unpin_online() in cgwb_release_workfn(). With that delay and a KASAN-enabled kernel, the reproducer triggers the splat reliably in less than a second.) Fix this by moving blkcg_unpin_online() before css_put(), so the cgwb's CSS reference keeps the blkcg alive while blkcg_unpin_online() accesses it. Link: https://lore.kernel.org/20260413-blkcg-v1-1-35b72622d16c@debian.org Fixes: 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: JP Kobryn <inwardvessel@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/zone_device: do not touch device folio after calling ->folio_free()Matthew Brost1-1/+1
The contents of a device folio can immediately change after calling ->folio_free(), as the folio may be reallocated by a driver with a different order. Instead of touching the folio again to extract the pgmap, use the local stack variable when calling percpu_ref_put_many(). Link: https://lore.kernel.org/20260410230346.4009855-1-matthew.brost@intel.com Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>