aboutsummaryrefslogtreecommitdiffstats
path: root/mm (follow)
AgeCommit message (Collapse)AuthorFilesLines
6 daystreewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar1-2/+2
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
6 daysMerge tag 'loongarch-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongsonLinus Torvalds1-0/+1
Pull LoongArch updates from Huacai Chen: - Adjust the 'make install' operation - Support SCHED_MC (Multi-core scheduler) - Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS - Enable HAVE_ARCH_STACKLEAK - Increase max supported CPUs up to 2048 - Introduce the numa_memblks conversion - Add PWM controller nodes in dts - Some bug fixes and other small changes * tag 'loongarch-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: platform/loongarch: laptop: Unregister generic_sub_drivers on exit platform/loongarch: laptop: Add backlight power control support platform/loongarch: laptop: Get brightness setting from EC on probe LoongArch: dts: Add PWM support to Loongson-2K2000 LoongArch: dts: Add PWM support to Loongson-2K1000 LoongArch: dts: Add PWM support to Loongson-2K0500 LoongArch: vDSO: Correctly use asm parameters in syscall wrappers LoongArch: Fix panic caused by NULL-PMD in huge_pte_offset() LoongArch: Preserve firmware configuration when desired LoongArch: Avoid using $r0/$r1 as "mask" for csrxchg LoongArch: Introduce the numa_memblks conversion LoongArch: Increase max supported CPUs up to 2048 LoongArch: Enable HAVE_ARCH_STACKLEAK LoongArch: Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS LoongArch: Add SCHED_MC (Multi-core scheduler) support LoongArch: Add some annotations in archhelp LoongArch: Using generic scripts/install.sh in `make install` LoongArch: Add a default install.sh
7 daysMerge tag 'mm-stable-2025-06-06-16-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds11-11/+34
Pull more MM updates from Andrew Morton: "The series 'Fix uprobe pte be overwritten when expanding vma' fixes a longstanding and quite obscure bug related to the vma merging of the uprobe mmap page" * tag 'mm-stable-2025-06-06-16-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: selftests/mm: add test about uprobe pte be orphan during vma merge selftests/mm: extract read_sysfs and write_sysfs into vm_util mm: expose abnormal new_pte during move_ptes mm: fix uprobe pte be overwritten when expanding vma mm/damon: s/primitives/code/ on comments
7 daysMerge tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds6-21/+64
Pull misc fixes from Andrew Morton: "13 hotfixes. 6 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 11 are for MM" * tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count MAINTAINERS: add mm swap section kmsan: test: add module description MAINTAINERS: add tlb trace events to MMU GATHER AND TLB INVALIDATION mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race mm/hugetlb: unshare page tables during VMA split, not before MAINTAINERS: add Alistair as reviewer of mm memory policy iov_iter: use iov_offset for length calculation in iov_iter_aligned_bvec mm/mempolicy: fix incorrect freeing of wi_kobj alloc_tag: handle module codetag load errors as module load failures mm/madvise: handle madvise_lock() failure during race unwinding mm: fix vmstat after removing NR_BOUNCE KVM: s390: rename PROT_NONE to PROT_TYPE_DUMMY
8 dayskmsan: test: add module descriptionArnd Bergmann1-0/+1
Every module should have a description, and kbuild now warns for those that don't. WARNING: modpost: missing MODULE_DESCRIPTION() in mm/kmsan/kmsan_test.o Link: https://lkml.kernel.org/r/20250603075323.1839608-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Macro Elver <elver@google.com> Cc: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm/hugetlb: fix huge_pmd_unshare() vs GUP-fast raceJann Horn1-0/+7
huge_pmd_unshare() drops a reference on a page table that may have previously been shared across processes, potentially turning it into a normal page table used in another process in which unrelated VMAs can afterwards be installed. If this happens in the middle of a concurrent gup_fast(), gup_fast() could end up walking the page tables of another process. While I don't see any way in which that immediately leads to kernel memory corruption, it is really weird and unexpected. Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one(), just like we do in khugepaged when removing page tables for a THP collapse. Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-2-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-2-f4136f5ec58a@google.com Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm/hugetlb: unshare page tables during VMA split, not beforeJann Horn2-16/+51
Currently, __split_vma() triggers hugetlb page table unsharing through vm_ops->may_split(). This happens before the VMA lock and rmap locks are taken - which is too early, it allows racing VMA-locked page faults in our process and racing rmap walks from other processes to cause page tables to be shared again before we actually perform the split. Fix it by explicitly calling into the hugetlb unshare logic from __split_vma() in the same place where THP splitting also happens. At that point, both the VMA and the rmap(s) are write-locked. An annoying detail is that we can now call into the helper hugetlb_unshare_pmds() from two different locking contexts: 1. from hugetlb_split(), holding: - mmap lock (exclusively) - VMA lock - file rmap lock (exclusively) 2. hugetlb_unshare_all_pmds(), which I think is designed to be able to call us with only the mmap lock held (in shared mode), but currently only runs while holding mmap lock (exclusively) and VMA lock Backporting note: This commit fixes a racy protection that was introduced in commit b30c14cd6102 ("hugetlb: unshare some PMDs when splitting VMAs"); that commit claimed to fix an issue introduced in 5.13, but it should actually also go all the way back. [jannh@google.com: v2] Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-1-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-0-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-1-f4136f5ec58a@google.com Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [b30c14cd6102: hugetlb: unshare some PMDs when splitting VMAs] Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm/mempolicy: fix incorrect freeing of wi_kobjJoshua Hahn1-3/+1
We should not free wi_group->wi_kobj here. In the error path of add_weighted_interleave_group() where this snippet is called from, kobj_{del, put} is immediately called right after this section. Thus, it is not only unnecessary but also incorrect to free it here. Link: https://lkml.kernel.org/r/20250602162345.2595696-1-joshua.hahnjy@gmail.com Fixes: e341f9c3c841 ("mm/mempolicy: Weighted Interleave Auto-tuning") Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506011545.Fduxqxqj-lkp@intel.com/ Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm/madvise: handle madvise_lock() failure during race unwindingSeongJae Park1-1/+4
When unwinding race on -ERESTARTNOINTR handling of process_madvise(), madvise_lock() failure is ignored. Check the failure and abort remaining works in the case. Link: https://lkml.kernel.org/r/20250602174926.1074-1-sj@kernel.org Fixes: 4000e3d0a367 ("mm/madvise: remove redundant mmap_lock operations from process_madvise()") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Barry Song <21cnbao@gmail.com> Closes: https://lore.kernel.org/CAGsJ_4xJXXO0G+4BizhohSZ4yDteziPw43_uF8nPXPWxUVChzw@mail.gmail.com Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm: fix vmstat after removing NR_BOUNCEKirill A. Shutemov1-1/+0
Hongyu noticed that the nr_unaccepted counter kept growing even in the absence of unaccepted memory on the machine. This happens due to a commit that removed NR_BOUNCE: it removed the counter from the enum zone_stat_item, but left it in the vmstat_text array. As a result, all counters below nr_bounce in /proc/vmstat are shifted by one line, causing the numa_hit counter to be labeled as nr_unaccepted. To fix this issue, remove nr_bounce from the vmstat_text array. Link: https://lkml.kernel.org/r/20250529103832.2937460-1-kirill.shutemov@linux.intel.com Fixes: 194df9f66db8 ("mm: remove NR_BOUNCE zone stat") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Hongyu Ning <hongyu.ning@linux.intel.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm: expose abnormal new_pte during move_ptesPu Lehui1-0/+2
When executing move_ptes, the new_pte must be NULL, otherwise it will be overwritten by the old_pte, and cause the abnormal new_pte to be leaked. In order to make this problem to be more explicit, let's add WARN_ON_ONCE when new_pte is not NULL. [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/] Link: https://lkml.kernel.org/r/20250529155650.4017699-3-pulehui@huaweicloud.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Pu Lehui <pulehui@huawei.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm: fix uprobe pte be overwritten when expanding vmaPu Lehui2-3/+24
Patch series "Fix uprobe pte be overwritten when expanding vma". This patch (of 4): We encountered a BUG alert triggered by Syzkaller as follows: BUG: Bad rss-counter state mm:00000000b4a60fca type:MM_ANONPAGES val:1 And we can reproduce it with the following steps: 1. register uprobe on file at zero offset 2. mmap the file at zero offset: addr1 = mmap(NULL, 2 * 4096, PROT_NONE, MAP_PRIVATE, fd, 0); 3. mremap part of vma1 to new vma2: addr2 = mremap(addr1, 4096, 2 * 4096, MREMAP_MAYMOVE); 4. mremap back to orig addr1: mremap(addr2, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED, addr1); In step 3, the vma1 range [addr1, addr1 + 4096] will be remap to new vma2 with range [addr2, addr2 + 8192], and remap uprobe anon page from the vma1 to vma2, then unmap the vma1 range [addr1, addr1 + 4096]. In step 4, the vma2 range [addr2, addr2 + 4096] will be remap back to the addr range [addr1, addr1 + 4096]. Since the addr range [addr1 + 4096, addr1 + 8192] still maps the file, it will take vma_merge_new_range to expand the range, and then do uprobe_mmap in vma_complete. Since the merged vma pgoff is also zero offset, it will install uprobe anon page to the merged vma. However, the upcomming move_page_tables step, which use set_pte_at to remap the vma2 uprobe pte to the merged vma, will overwrite the newly uprobe pte in the merged vma, and lead that pte to be orphan. Since the uprobe pte will be remapped to the merged vma, we can remove the unnecessary uprobe_mmap upon merged vma. This problem was first found in linux-6.6.y and also exists in the community syzkaller: https://lore.kernel.org/all/000000000000ada39605a5e71711@google.com/T/ Link: https://lkml.kernel.org/r/20250529155650.4017699-1-pulehui@huaweicloud.com Link: https://lkml.kernel.org/r/20250529155650.4017699-2-pulehui@huaweicloud.com Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints") Signed-off-by: Pu Lehui <pulehui@huawei.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm/damon: s/primitives/code/ on commentsEnze Li8-8/+8
The word 'primitive' is not explicit. To make the code more easily understood, this commit renames 'primitives' to 'code' in header comments of some source files. Link: https://lkml.kernel.org/r/20250530053115.153238-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 daysMerge tag 'slab-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slabLinus Torvalds1-6/+9
Pull slab updates from Vlastimil Babka: - Make kvmalloc() more suitable for callers that need it to succeed, but without unnecessary overhead by reclaim and compaction to get a physically contiguous allocation. Instead fall back to vmalloc() more easily by default, unless instructed by __GFP_RETRY_MAYFAIL to prefer kmalloc() harder. This should allow the removal of a xfs-specific workaround (Michal Hocko) - Remove potentially excessive warnings due to memory pressure when allocating structures for per-object allocation profiling metadata (Usama Arif) * tag 'slab-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm: slub: only warn once when allocating slab obj extensions fails mm: kvmalloc: make kmalloc fast path real fast path
11 daysMerge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds16-51/+211
Pull more MM updates from Andrew Morton: - "zram: support algorithm-specific parameters" from Sergey Senozhatsky adds infrastructure for passing algorithm-specific parameters into zram. A single parameter `winbits' is implemented at this time. - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg charging nmi-safe, which is required by BFP, which can operate in NMI context. - "Some random fixes and cleanup to shmem" from Kemeng Shi implements small fixes and cleanups in the shmem code. - "Skip mm selftests instead when kernel features are not present" from Zi Yan fixes some issues in the MM selftest code. - "mm/damon: build-enable essential DAMON components by default" from SeongJae Park reworks DAMON Kconfig to make it easier to enable CONFIG_DAMON. - "sched/numa: add statistics of numa balance task migration" from Libo Chen adds more info into sysfs and procfs files to improve visibility into the NUMA balancer's task migration activity. - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown provides various updates to some of the MM selftests to make them play better with the overall containing framework. * tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits) mm/khugepaged: clean up refcount check using folio_expected_ref_count() selftests/mm: fix test result reporting in gup_longterm selftests/mm: report unique test names for each cow test selftests/mm: add helper for logging test start and results selftests/mm: use standard ksft_finished() in cow and gup_longterm selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled sched/numa: add statistics of numa balance task sched/numa: fix task swap by skipping kernel threads tools/testing: check correct variable in open_procmap() tools/testing/vma: add missing function stub mm/gup: update comment explaining why gup_fast() disables IRQs selftests/mm: two fixes for the pfnmap test mm/khugepaged: fix race with folio split/free using temporary reference mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order mmu_notifiers: remove leftover stub macros selftests/mm: deduplicate test names in madv_populate kcov: rust: add flags for KCOV with Rust mm: rust: make CONFIG_MMU ifdefs more narrow mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables() mm/damon/Kconfig: enable CONFIG_DAMON by default ...
11 daysMerge tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuseLinus Torvalds1-3/+9
Pull fuse updates from Miklos Szeredi: - Remove tmp page copying in writeback path (Joanne). This removes ~300 lines and with that a lot of complexity related to avoiding reclaim related deadlock. The old mechanism is replaced with a mapping flag that tells the MM not to block reclaim waiting for writeback to complete. The MM parts have been reviewed/acked by respective maintainers. - Convert more code to handle large folios (Joanne). This still just adds the code to deal with large folios and does not enable them yet. - Allow invalidating all cached lookups atomically (Luis Henriques). This feature is useful for CernVMFS, which currently does this iteratively. - Align write prefaulting in fuse with generic one (Dave Hansen) - Fix race causing invalid data to be cached when setting attributes on different nodes of a distributed fs (Guang Yuan Wu) - Update documentation for passthrough (Chen Linxuan) - Add fdinfo about the device number associated with an opened /dev/fuse instance (Chen Linxuan) - Increase readdir buffer size (Miklos). This depends on a patch to VFS readdir code that was already merged through Christians tree. - Optimize io-uring request expiration (Joanne) - Misc cleanups * tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) fuse: increase readdir buffer size readdir: supply dir_context.count as readdir buffer size hint fuse: don't allow signals to interrupt getdents copying fuse: support large folios for writeback fuse: support large folios for readahead fuse: support large folios for queued writes fuse: support large folios for stores fuse: support large folios for symlinks fuse: support large folios for folio reads fuse: support large folios for writethrough writes fuse: refactor fuse_fill_write_pages() fuse: support large folios for retrieves fuse: support copying large folios fs: fuse: add dev id to /dev/fuse fdinfo docs: filesystems: add fuse-passthrough.rst MAINTAINERS: update filter of FUSE documentation fuse: fix race between concurrent setattrs from multiple nodes fuse: remove tmp folio for writebacks and internal rb tree mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings fuse: optimize over-io-uring request expiration check ...
11 daysMerge tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-15/+24
Pull vfs fixes from Christian Brauner: - Fix the AT_HANDLE_CONNECTABLE option so filesystems that don't know how to decode a connected non-dir dentry fail the request - Use repr(transparent) to ensure identical layout between the C and Rust implementation of struct file - Add a missing xas_pause() into the dax code employing wait_entry_unlocked_exclusive() - Fix FOP_DONTCACHE which we disabled for v6.15. A folio could get redirtied and/or scheduled for writeback after the initial dropbehind test. Change the test accordingly to handle these cases so we can re-enable FOP_DONTCACHE again * tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exportfs: require ->fh_to_parent() to encode connectable file handles rust: file: improve safety comments rust: file: mark `LocalFile` as `repr(transparent)` fs/dax: Fix "don't skip locked entries when scanning entries" iomap: don't lose folio dropbehind state for overwrites mm/filemap: unify dropbehind flag testing and clearing mm/filemap: unify read/write dropbehind naming Revert "Disable FOP_DONTCACHE for now due to bugs" mm/filemap: use filemap_end_dropbehind() for read invalidation mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
13 daysmm/khugepaged: clean up refcount check using folio_expected_ref_count()Shivank Garg1-15/+2
Use folio_expected_ref_count() instead of open-coded logic in is_refcount_suitable(). This avoids code duplication and improves clarity. Drop is_refcount_suitable() as it is no longer needed. Link: https://lkml.kernel.org/r/20250526182818.37978-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Fengwei Yin <fengwei.yin@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 dayssched/numa: add statistics of numa balance taskChen Yu2-0/+4
On systems with NUMA balancing enabled, it has been found that tracking task activities resulting from NUMA balancing is beneficial. NUMA balancing employs two mechanisms for task migration: one is to migrate a task to an idle CPU within its preferred node, and the other is to swap tasks located on different nodes when they are on each other's preferred nodes. The kernel already provides NUMA page migration statistics in /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However, it lacks statistics regarding task migration and swapping. Therefore, relevant counts for task migration and swapping should be added. The following two new fields: numa_task_migrated numa_task_swapped will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched and /proc/vmstat. Introducing both per-task and per-memory cgroup (memcg) NUMA balancing statistics facilitates a rapid evaluation of the performance and resource utilization of the target workload. For instance, users can first identify the container with high NUMA balancing activity and then further pinpoint a specific task within that group, and subsequently adjust the memory policy for that task. In short, although it is possible to iterate through /proc/$pid/sched to locate the problematic task, the introduction of aggregated NUMA balancing activity for tasks within each memcg can assist users in identifying the task more efficiently through a divide-and-conquer approach. As Libo Chen pointed out, the memcg event relies on the text names in vmstat_text, and /proc/vmstat generates corresponding items based on vmstat_text. Thus, the relevant task migration and swapping events introduced in vmstat_text also need to be populated by count_vm_numa_event(), otherwise these values are zero in /proc/vmstat. In theory, task migration and swap events are part of the scheduler's activities. The reason for exposing them through the memory.stat/vmstat interface is that we already have NUMA balancing statistics in memory.stat/vmstat, and these events are closely related to each other. Following Shakeel's suggestion, we describe the end-to-end flow/story of all these events occurring on a timeline for future reference: The goal of NUMA balancing is to co-locate a task and its memory pages on the same NUMA node. There are two strategies: migrate the pages to the task's node, or migrate the task to the node where its pages reside. Suppose a task p1 is running on Node 0, but its pages are located on Node 1. NUMA page fault statistics for p1 reveal its "page footprint" across nodes. If NUMA balancing detects that most of p1's pages are on Node 1: 1.Page Migration Attempt: The Numa balance first tries to migrate p1's pages to Node 0. The numa_page_migrate counter increments. 2.Task Migration Strategies: After the page migration finishes, Numa balance checks every 1 second to see if p1 can be migrated to Node 1. Case 2.1: Idle CPU Available If Node 1 has an idle CPU, p1 is directly scheduled there. This event is logged as numa_task_migrated. Case 2.2: No Idle CPU (Task Swap) If all CPUs on Node1 are busy, direct migration could cause CPU contention or load imbalance. Instead: The Numa balance selects a candidate task p2 on Node 1 that prefers Node 0 (e.g., due to its own page footprint). p1 and p2 are swapped. This cross-node swap is recorded as numa_task_swapped. Link: https://lkml.kernel.org/r/d00edb12ba0f0de3c5222f61487e65f2ac58f5b1.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/7ef90a88602ed536be46eba7152ed0d33bad5790.1748002400.git.yu.c.chen@intel.com Signed-off-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Ayush Jain <Ayush.jain3@amd.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Libo Chen <libo.chen@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm/gup: update comment explaining why gup_fast() disables IRQsJann Horn1-1/+1
The current comment in gup_fast() talks about "IPIs that come from THPs splitting", which is outdated and refers to the old THP splitting implementation that was removed in commit ad0bed24e98b ("thp: drop all split_huge_page()-related code"), which landed in v4.5. Before then, THP splitting involved a pmdp_splitting_flush(), which sent an IPI to serialize against gup_fast(). Nowadays, we use tlb_remove_table_sync_one() to send IPIs that serialize against gup_fast(); this is used, for example, in THP *collapsing* to stop gup_fast() walks of a page table before depositing it. Link: https://lkml.kernel.org/r/20250528-gup-irq-comment-fix-v1-1-b9d83c345333@google.com Signed-off-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm/khugepaged: fix race with folio split/free using temporary referenceShivank Garg1-1/+17
hpage_collapse_scan_file() calls is_refcount_suitable(), which in turn calls folio_mapcount(). folio_mapcount() checks folio_test_large() before proceeding to folio_large_mapcount(), but there is a race window where the folio may get split/freed between these checks, triggering: VM_WARN_ON_FOLIO(!folio_test_large(folio), folio) Take a temporary reference to the folio in hpage_collapse_scan_file(). This stabilizes the folio during refcount check and prevents incorrect large folio detection due to concurrent split/free. Use helper folio_expected_ref_count() + 1 to compare with folio_ref_count() instead of using is_refcount_suitable(). Link: https://lkml.kernel.org/r/20250526182818.37978-1-shivankg@amd.com Fixes: 05c5323b2a34 ("mm: track mapcount of large folios in single value") Signed-off-by: Shivank Garg <shivankg@amd.com> Reported-by: syzbot+2b99589e33edbe9475ca@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6828470d.a70a0220.38f255.000c.GAE@google.com Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Fengwei Yin <fengwei.yin@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm: add CONFIG_PAGE_BLOCK_ORDER to select page block orderJuan Yescas2-1/+35
Problem: On large page size configurations (16KiB, 64KiB), the CMA alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably, and this causes the CMA reservations to be larger than necessary. This means that system will have less available MIGRATE_UNMOVABLE and MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them. The CMA_MIN_ALIGNMENT_BYTES increases because it depends on MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels. For example, in ARM, the CMA alignment requirement when: - CONFIG_ARCH_FORCE_MAX_ORDER default value is used - CONFIG_TRANSPARENT_HUGEPAGE is set: PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES ----------------------------------------------------------------------- 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB There are some extreme cases for the CMA alignment requirement when: - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set - CONFIG_TRANSPARENT_HUGEPAGE is NOT set: - CONFIG_HUGETLB_PAGE is NOT set PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES ------------------------------------------------------------------------ 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB This affects the CMA reservations for the drivers. If a driver in a 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal reservation has to be 32MiB due to the alignment requirements: reserved-memory { ... cma_test_reserve: cma_test_reserve { compatible = "shared-dma-pool"; size = <0x0 0x400000>; /* 4 MiB */ ... }; }; reserved-memory { ... cma_test_reserve: cma_test_reserve { compatible = "shared-dma-pool"; size = <0x0 0x2000000>; /* 32 MiB */ ... }; }; Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that allows to set the page block order in all the architectures. The maximum page block order will be given by ARCH_FORCE_MAX_ORDER. By default, CONFIG_PAGE_BLOCK_ORDER will have the same value that ARCH_FORCE_MAX_ORDER. This will make sure that current kernel configurations won't be affected by this change. It is a opt-in change. This patch will allow to have the same CMA alignment requirements for large page sizes (16KiB, 64KiB) as that in 4kb kernels by setting a lower pageblock_order. Tests: - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 on 4k and 16k kernels. - Verified that Transparent Huge Pages work when pageblock_order is 1, 7, 10 on 4k and 16k kernels. - Verified that dma-buf heaps allocations work when pageblock_order is 1, 7, 10 on 4k and 16k kernels. Benchmarks: The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The reason for the pageblock_order 7 is because this value makes the min CMA alignment requirement the same as that in 4kb kernels (2MB). - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf (https://developer.android.com/ndk/guides/simpleperf) to measure the # of instructions and page-faults on 16k kernels. The benchmark was executed 10 times. The averages are below: # instructions | #page-faults order 10 | order 7 | order 10 | order 7 -------------------------------------------------------- 13,891,765,770 | 11,425,777,314 | 220 | 217 14,456,293,487 | 12,660,819,302 | 224 | 219 13,924,261,018 | 13,243,970,736 | 217 | 221 13,910,886,504 | 13,845,519,630 | 217 | 221 14,388,071,190 | 13,498,583,098 | 223 | 224 13,656,442,167 | 12,915,831,681 | 216 | 218 13,300,268,343 | 12,930,484,776 | 222 | 218 13,625,470,223 | 14,234,092,777 | 219 | 218 13,508,964,965 | 13,432,689,094 | 225 | 219 13,368,950,667 | 13,683,587,37 | 219 | 225 ------------------------------------------------------------------- 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages There were 4.85% #instructions when order was 7, in comparison with order 10. 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%) The number of page faults in order 7 and 10 were the same. These results didn't show any significant regression when the pageblock_order is set to 7 on 16kb kernels. - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times on the 16k kernels with pageblock_order 7 and 10. order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) % ------------------------------------------------------------------- 15.8 | 16.4 | 0.6 | 3.80% 16.4 | 16.2 | -0.2 | -1.22% 16.6 | 16.3 | -0.3 | -1.81% 16.8 | 16.3 | -0.5 | -2.98% 16.6 | 16.8 | 0.2 | 1.20% ------------------------------------------------------------------- 16.44 16.4 -0.04 -0.24% Averages The results didn't show any significant regression when the pageblock_order is set to 7 on 16kb kernels. Link: https://lkml.kernel.org/r/20250521215807.1860663-1-jyescas@google.com Signed-off-by: Juan Yescas <jyescas@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()Roman Gushchin2-0/+3
Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") added a forced tlbflush to tlb_vma_end(), which is required to avoid a race between munmap() and unmap_mapping_range(). However it added some overhead to other paths where tlb_vma_end() is used, but vmas are not removed, e.g. madvise(MADV_DONTNEED). Fix this by moving the tlb flush out of tlb_end_vma() into new tlb_flush_vmas() called from free_pgtables(), somewhat similar to the stable version of the original commit: commit 895428ee124a ("mm: Force TLB flush for PFNMAP mappings before unlink_file_vma()"). Note, that if tlb->fullmm is set, no flush is required, as the whole mm is about to be destroyed. Link: https://lkml.kernel.org/r/20250522012838.163876-1-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm/damon/Kconfig: enable CONFIG_DAMON by defaultSeongJae Park1-0/+1
As of this writing, multiple major distros including Alma, Amazon, Android, CentOS, Debian, Fedora, and Oracle are build-enabling DAMON (set CONFIG_DAMON[1]). Enabling it by default will save configuration setup time for the current and future DAMON users. Build-enabling DAMON does not introduce a real risk since it makes no behavioral change by default. It requires explicit user requests to do anything. Only one potential risk is making the size of the kernel a little bit larger. On a production-purpose configuration, it increases the resulting kernel package size by about 0.1 % of the final package file. I believe that's too small to be a real problem in common setups. Hence, the benefit of enabling CONFIG_DAMON outweighs the potential risk. Set CONFIG_DAMON by default. Link: https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=DAMON [1] Link: https://lkml.kernel.org/r/20250521042755.39653-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm/damon/Kconfig: set DAMON_{VADDR,PADDR,SYSFS} default to DAMONSeongJae Park1-0/+3
Patch series "mm/damon: build-enable essential DAMON components by default". As of this writing, multiple major distros including Alma, Amazon, Android, CentOS, Debian, Fedora, and Oracle are build-enabling DAMON (set CONFIG_DAMON[1]). Configuring DAMON is not very easy, since it is disabled by default, and there are multiple essential options that need to be manually turned on, one by one. Make it easier, by grouping essential configurations to be enabled with one selection, and enabling build of the essential parts of DAMON by default. Note that build-enabling DAMON does not introduce any real risk, since it makes no behavioral change by default. It requires explicit user requests to do anything. Only one potential risk is making the size of the kernel a little bit larger. On a production-purpose configuration, it increases the resulting kernel package binary size by about 0.1 % of the final package file. I believe that's too small to be a real problem in common setups. DAMON_{VADDR,PADDR,SYSFS} are de-facto essential parts of DAMON for normal usages. Because those need to be enabled one by one, however, and there are other test-purpose or non-essential configurations, it is easy to be confused and make mistakes at setup. Make the essential configurations default to CONFIG_DAMON, so that those can be enabled by default with a single change. Link: https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=DAMON [1] Link: https://lkml.kernel.org/r/20250521042755.39653-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250521042755.39653-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 dayshugetlb: show nr_huge_pages in report_hugepages()Wenjie Xu1-1/+1
The number of pre-allocated huge pages should be nr_huge_pages, not free_huge_pages, although they are same during booting stage Link: https://lkml.kernel.org/r/20250515114231.65824-1-xuwenjie04@baidu.com Signed-off-by: Wenjie Xu <xuwenjie04@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm/shmem: remove unneeded xa_is_value() check in shmem_unuse_swap_entries()Kemeng Shi1-2/+0
As only value entry will be added to fbatch in shmem_find_swap_entries(), there is no need to do xa_is_value() check in shmem_unuse_swap_entries(). Link: https://lkml.kernel.org/r/20250516170939.965736-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm: shmem: only remove inode from swaplist when it's swapped page count is 0Kemeng Shi1-2/+2
Even if we fail to allocate a swap entry, the inode might have previously allocated entry and we might take inode containing swap entry off swaplist. As a result, try_to_unuse() may enter a potential dead loop to repeatedly look for inode and clean it's swap entry. Only take inode off swaplist when it's swapped page count is 0 to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-5-shikemeng@huaweicloud.com Fixes: b487a2da3575 ("mm, swap: simplify folio swap allocation") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202505161438.9009cf47-lkp@intel.com Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm/shmem: fix potential dead loop in shmem_unuse()Kemeng Shi1-3/+6
If multi shmem_unuse() for different swap type is called concurrently, a dead loop could occur as following: shmem_unuse(typeA) shmem_unuse(typeB) mutex_lock(&shmem_swaplist_mutex) list_for_each_entry_safe(info, next, ...) ... mutex_unlock(&shmem_swaplist_mutex) /* info->swapped may drop to 0 */ shmem_unuse_inode(&info->vfs_inode, type) mutex_lock(&shmem_swaplist_mutex) list_for_each_entry(info, next, ...) if (!info->swapped) list_del_init(&info->swaplist) ... mutex_unlock(&shmem_swaplist_mutex) mutex_lock(&shmem_swaplist_mutex) /* iterate with offlist entry and encounter a dead loop */ next = list_next_entry(info, swaplist); ... Restart the iteration if the inode is already off shmem_swaplist list to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-4-shikemeng@huaweicloud.com Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm: shmem: add missing shmem_unacct_size() in __shmem_file_setup()Kemeng Shi1-3/+3
We will miss shmem_unacct_size() when is_idmapped_mnt() returns a failure. Move is_idmapped_mnt() before shmem_acct_size() to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-3-shikemeng@huaweicloud.com Fixes: 7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm: shmem: avoid unpaired folio_unlock() in shmem_swapin_folio()Kemeng Shi1-0/+2
Patch series "Some random fixes and cleanup to shmem", v3. This series contains some simple fixes and cleanup which are made during learning shmem. More details can be found in respective patches. This patch (of 5): If we get a folio from swap_cache_get_folio() successfully but encounter a failure before the folio is locked, we will unlock the folio which was not previously locked. Put the folio and set it to NULL when a failure occurs before the folio is locked to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250516170939.965736-2-shikemeng@huaweicloud.com Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Hugh Dickins <hughd@google.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm/damon/core: avoid destroyed target reference from DAMOS quotaAkinobu Mita1-0/+8
When the number of the monitoring targets in running contexts is reduced, there may be DAMOS quotas referencing the targets that will be destroyed. Applying the scheme action for such DAMOS scheme will be skipped forever looking for the starting part of the region for the destroyed monitoring target. To fix this issue, when the monitoring target is destroyed, reset the starting part for all DAMOS quotas that reference the target. Link: https://lkml.kernel.org/r/20250517141852.142802-1-akinobu.mita@gmail.com Fixes: da87878010e5 ("mm/damon/sysfs: support online inputs update") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmemcg: make memcg_rstat_updated nmi safeShakeel Butt1-7/+9
Currently kernel maintains memory related stats updates per-cgroup to optimize stats flushing. The stats_updates is defined as atomic64_t which is not nmi-safe on some archs. Actually we don't really need 64bit atomic as the max value stats_updates can get should be less than nr_cpus * MEMCG_CHARGE_BATCH. A normal atomic_t should suffice. Also the function cgroup_rstat_updated() is still not nmi-safe but there is parallel effort to make it nmi-safe, so until then let's ignore it in the nmi context. Link: https://lkml.kernel.org/r/20250519063142.111219-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmemcg: nmi-safe slab stats updatesShakeel Butt1-3/+33
The objcg based kmem [un]charging can be called in nmi context and it may need to update NR_SLAB_[UN]RECLAIMABLE_B stats. So, let's correctly handle the updates of these stats in the nmi context. Link: https://lkml.kernel.org/r/20250519063142.111219-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmemcg: add nmi-safe update for MEMCG_KMEMShakeel Butt1-2/+19
The objcg based kmem charging and uncharging code path needs to update MEMCG_KMEM appropriately. Let's add support to update MEMCG_KMEM in nmi-safe way for those code paths. Link: https://lkml.kernel.org/r/20250519063142.111219-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmemcg: nmi safe memcg stats for specific archsShakeel Butt1-0/+49
There are archs which have NMI but does not support this_cpu_* ops safely in the nmi context but they support safe atomic ops in nmi context. For such archs, let's add infra to use atomic ops for the memcg stats which can be updated in nmi. At the moment, the memcg stats which get updated in the objcg charging path are MEMCG_KMEM, NR_SLAB_RECLAIMABLE_B & NR_SLAB_UNRECLAIMABLE_B. Rather than adding support for all memcg stats to be nmi safe, let's just add infra to make these three stats nmi safe which this patch is doing. Link: https://lkml.kernel.org/r/20250519063142.111219-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmemcg: disable kmem charging in nmi for unsupported archShakeel Butt1-0/+3
Patch series "memcg: nmi-safe kmem charging", v4. Users can attached their BPF programs at arbitrary execution points in the kernel and such BPF programs may run in nmi context. In addition, these programs can trigger memcg charged kernel allocations in the nmi context. However memcg charging infra for kernel memory is not equipped to handle nmi context for all architectures. This series removes the hurdles to enable kmem charging in the nmi context for most of the archs. For archs without CONFIG_HAVE_NMI, this series is a noop. For archs with NMI support and have CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS, the previous work to make memcg stats re-entrant is sufficient for allowing kmem charging in nmi context. For archs with NMI support but without CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS and with ARCH_HAVE_NMI_SAFE_CMPXCHG, this series added infra to support kmem charging in nmi context. Lastly those archs with NMI support but without CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS and ARCH_HAVE_NMI_SAFE_CMPXCHG, kmem charging in nmi context is not supported at all. Mostly used archs have support for CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS and this series should be almost a noop (other than making memcg_rstat_updated nmi safe) for such archs. This patch (of 5): The memcg accounting and stats uses this_cpu* and atomic* ops. There are archs which define CONFIG_HAVE_NMI but does not define CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS and ARCH_HAVE_NMI_SAFE_CMPXCHG, so memcg accounting for such archs in nmi context is not possible to support. Let's just disable memcg accounting in nmi context for such archs. Link: https://lkml.kernel.org/r/20250519063142.111219-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20250519063142.111219-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysmm: rename page->index to page->__folio_indexMatthew Wilcox (Oracle)5-10/+10
All users of page->index have been converted to not refer to it any more. Update a few pieces of documentation that were missed and prevent new users from appearing (or at least make them easy to grep for). Link: https://lkml.kernel.org/r/20250514181508.3019795-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysMerge tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds1-1/+1
Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...
13 daysMerge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds74-1695/+3127
Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
14 daysMerge tag 'gcc-minimum-version-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-genericLinus Torvalds1-6/+0
Pull compiler version requirement update from Arnd Bergmann: "Require gcc-8 and binutils-2.30 x86 already uses gcc-8 as the minimum version, this changes all other architectures to the same version. gcc-8 is used is Debian 10 and Red Hat Enterprise Linux 8, both of which are still supported, and binutils 2.30 is the oldest corresponding version on those. Ubuntu Pro 18.04 and SUSE Linux Enterprise Server 15 both use gcc-7 as the system compiler but additionally include toolchains that remain supported. With the new minimum toolchain versions, a number of workarounds for older versions can be dropped, in particular on x86_64 and arm64. Importantly, the updated compiler version allows removing two of the five remaining gcc plugins, as support for sancov and structeak features is already included in modern compiler versions. I tried collecting the known changes that are possible based on the new toolchain version, but expect that more cleanups will be possible. Since this touches multiple architectures, I merged the patches through the asm-generic tree." * tag 'gcc-minimum-version-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: Makefile.kcov: apply needed compiler option unconditionally in CFLAGS_KCOV Documentation: update binutils-2.30 version reference gcc-plugins: remove SANCOV gcc plugin Kbuild: remove structleak gcc plugin arm64: drop binutils version checks raid6: skip avx512 checks kbuild: require gcc-8 and binutils-2.30
2025-05-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-19/+243
Pull rdma updates from Jason Gunthorpe: "Usual collection of driver fixes: - Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw - Further ODP functionality in rxe - Remote access MRs in mana, along with more page sizes - Improve CM scalability with a rwlock around the agent - More trace points for hns - ODP hmm conversion to the new two step dma API - Support the ethernet HW device in mana as well as the RNIC - Cleanups: - Use secs_to_jiffies() when appropriate - Use ERR_CAST() instead of naked casts - Don't use %pK in printk - Unusued functions removed - Allocation type matching" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (57 commits) RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work RDMA/bnxt_re: Support extended stats for Thor2 VF RDMA/hns: Fix endian issue in trace events RDMA/mlx5: Avoid flexible array warning IB/cm: Remove dead code and adjust naming RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices RDMA/rxe: Break endless pagefault loop for RO pages RDMA/bnxt_re: Fix return code of bnxt_re_configure_cc RDMA/bnxt_re: Fix missing error handling for tx_queue RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs output RDMA/mlx5: Add support for 200Gbps per lane speeds RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stage RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction net: mana: Add support for auxiliary device servicing events RDMA/mana_ib: unify mana_ib functions to support any gdma device RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nic net: mana: Probe rdma device in mana driver RDMA/siw: replace redundant ternary operator with just rv RDMA/umem: Separate implicit ODP initialization from explicit ODP RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage ...
2025-05-30LoongArch: Introduce the numa_memblks conversionHuacai Chen1-0/+1
Commit 87482708210ff3333a ("mm: introduce numa_memblks") has moved numa_memblks from x86 to the generic code, but LoongArch was left out of this conversion. This patch introduces the generic numa_memblks for LoongArch. In detail: 1. Enable NUMA_MEMBLKS (but disable NUMA_EMU) in Kconfig; 2. Use generic definition for numa_memblk and numa_meminfo; 3. Use generic implementation for numa_add_memblk() and its friends; 4. Use generic implementation for numa_set_distance() and its friends; 5. Use generic implementation for memory_add_physaddr_to_nid() and its friends. Note: Disable NUMA_EMU because it needs more efforts and no obvious demand now. Tested-by: Binbin Zhou <zhoubinbin@loongson.cn> Signed-off-by: Yuquan Wang <wangyuquan1236@phytium.com.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-05-28Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds1-6/+2
Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...
2025-05-28Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linuxLinus Torvalds2-18/+56
Pull arm64 updates from Will Deacon: "The headline feature is the re-enablement of support for Arm's Scalable Matrix Extension (SME) thanks to a bumper crop of fixes from Mark Rutland. If matrices aren't your thing, then Ryan's page-table optimisation work is much more interesting. Summary: ACPI, EFI and PSCI: - Decouple Arm's "Software Delegated Exception Interface" (SDEI) support from the ACPI GHES code so that it can be used by platforms booted with device-tree - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI runtime calls - Fix a node refcount imbalance in the PSCI device-tree code CPU Features: - Ensure register sanitisation is applied to fields in ID_AA64MMFR4 - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests can reliably query the underlying CPU types from the VMM - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes to our context-switching, signal handling and ptrace code Entry code: - Hook up TIF_NEED_RESCHED_LAZY so that CONFIG_PREEMPT_LAZY can be selected Memory management: - Prevent BSS exports from being used by the early PI code - Propagate level and stride information to the low-level TLB invalidation routines when operating on hugetlb entries - Use the page-table contiguous hint for vmap() mappings with VM_ALLOW_HUGE_VMAP where possible - Optimise vmalloc()/vmap() page-table updates to use "lazy MMU mode" and hook this up on arm64 so that the trailing DSB (used to publish the updates to the hardware walker) can be deferred until the end of the mapping operation - Extend mmap() randomisation for 52-bit virtual addresses (on par with 48-bit addressing) and remove limited support for randomisation of the linear map Perf and PMUs: - Add support for probing the CMN-S3 driver using ACPI - Minor driver fixes to the CMN, Arm-NI and amlogic PMU drivers Selftests: - Fix FPSIMD and SME tests to align with the freshly re-enabled SME support - Fix default setting of the OUTPUT variable so that tests are installed in the right location vDSO: - Replace raw counter access from inline assembly code with a call to the the __arch_counter_get_cntvct() helper function Miscellaneous: - Add some missing header inclusions to the CCA headers - Rework rendering of /proc/cpuinfo to follow the x86-approach and avoid repeated buffer expansion (the user-visible format remains identical) - Remove redundant selection of CONFIG_CRC32 - Extend early error message when failing to map the device-tree blob" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (83 commits) arm64: cputype: Add cputype definition for HIP12 arm64: el2_setup.h: Make __init_el2_fgt labels consistent, again perf/arm-cmn: Add CMN S3 ACPI binding arm64/boot: Disallow BSS exports to startup code arm64/boot: Move global CPU override variables out of BSS arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespace perf/arm-cmn: Initialise cmn->cpu earlier kselftest/arm64: Set default OUTPUT path when undefined arm64: Update comment regarding values in __boot_cpu_mode arm64: mm: Drop redundant check in pmd_trans_huge() arm64/mm: Re-organise setting up FEAT_S1PIE registers PIRE0_EL1 and PIR_EL1 arm64/mm: Permit lazy_mmu_mode to be nested arm64/mm: Disable barrier batching in interrupt contexts arm64/cpuinfo: only show one cpu's info in c_show() arm64/mm: Batch barriers when updating kernel mappings mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes arm64/mm: Support huge pte-mapped pages in vmap mm/vmalloc: Gracefully unmap huge ptes mm/vmalloc: Warn on improper use of vunmap_range() arm64/mm: Hoist barriers out of set_ptes_anysz() loop ...
2025-05-28Merge tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linuxLinus Torvalds1-1/+2
Pull hardening updates from Kees Cook: - Update overflow helpers to ease refactoring of on-stack flex array instances (Gustavo A. R. Silva, Kees Cook) - lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo) - Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr) - Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh) - Add missed designated initializers now exposed by fixed randstruct (Nathan Chancellor, Kees Cook) - Document compilers versions for __builtin_dynamic_object_size - Remove ARM_SSP_PER_TASK GCC plugin - Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST builds - Kbuild: induce full rebuilds when dependencies change with GCC plugins, the Clang sanitizer .scl file, or the randstruct seed. - Kbuild: Switch from -Wvla to -Wvla-larger-than=1 - Correct several __nonstring uses for -Wunterminated-string-initialization * tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits) Revert "hardening: Disable GCC randstruct for COMPILE_TEST" lib/tests: randstruct: Add deep function pointer layout test lib/tests: Add randstruct KUnit test randstruct: gcc-plugin: Remove bogus void member net: qede: Initialize qede_ll_ops with designated initializer scsi: qedf: Use designated initializer for struct qed_fcoe_cb_ops md/bcache: Mark __nonstring look-up table integer-wrap: Force full rebuild when .scl file changes randstruct: Force full rebuild when seed changes gcc-plugins: Force full rebuild when plugins change kbuild: Switch from -Wvla to -Wvla-larger-than=1 hardening: simplify CONFIG_CC_HAS_COUNTED_BY overflow: Fix direct struct member initialization in _DEFINE_FLEX() kunit/overflow: Add tests for STACK_FLEX_ARRAY_SIZE() helper overflow: Add STACK_FLEX_ARRAY_SIZE() helper input/joystick: magellan: Mark __nonstring look-up table const watchdog: exar: Shorten identity name to fit correctly mod_devicetable: Enlarge the maximum platform_device_id name length overflow: Clarify expectations for getting DEFINE_FLEX variable sizes compiler_types: Identify compiler versions for __builtin_dynamic_object_size ...
2025-05-27Merge tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds1-2/+2
Pull cgroup updates from Tejun Heo: - cgroup rstat shared the tracking tree across all controllers with the rationale being that a cgroup which is using one resource is likely to be using other resources at the same time (ie. if something is allocating memory, it's probably consuming CPU cycles). However, this turned out to not scale very well especially with memcg using rstat for internal operations which made memcg stat read and flush patterns substantially different from other controllers. JP Kobryn split the rstat tree per controller. - cgroup BPF support was hooking into cgroup init/exit paths directly. Convert them to use a notifier chain instead so that other usages can be added easily. The two of the patches which implement this are mislabeled as belonging to sched_ext instead of cgroup. Sorry. - Relatively minor cpuset updates - Documentation updates * tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits) sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier sched_ext: Introduce cgroup_lifetime_notifier cgroup: Minor reorganization of cgroup_create() cgroup, docs: cpu controller's interaction with various scheduling policies cgroup, docs: convert space indentation to tab indentation cgroup: avoid per-cpu allocation of size zero rstat cpu locks cgroup, docs: be specific about bandwidth control of rt processes cgroup: document the rstat per-cpu initialization cgroup: helper for checking rstat participation of css cgroup: use subsystem-specific rstat locks to avoid contention cgroup: use separate rstat trees for each subsystem cgroup: compare css to cgroup::self in helper for distingushing css cgroup: warn on rstat usage by early init subsystems cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask() cgroup/rstat: Improve cgroup_rstat_push_children() documentation cgroup: fix goto ordering in cgroup_init() cgroup: fix pointer check in css_rstat_init() cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs cgroup/cpuset: Fix obsolete comment in cpuset_css_offline() cgroup/cpuset: Always use cpu_active_mask ...
2025-05-27mm: pcp: increase pcp->free_count threshold to trigger free_highNikhil Dhama1-1/+1
In old pcp design, pcp->free_factor gets incremented in nr_pcp_free() which is invoked by free_pcppages_bulk(). So, it used to increase free_factor by 1 only when we try to reduce the size of pcp list and free_high used to trigger only for order > 0 and order < costly_order and pcp->free_factor > 0. For iperf3 I noticed that with older design in kernel v6.6, pcp list was drained mostly when pcp->count > high (more often when count goes above 530). and most of the time pcp->free_factor was 0, triggering very few high order flushes. But this is changed in the current design, introduced in commit 6ccdcb6d3a74 ("mm, pcp: reduce detecting time of consecutive high order page freeing"), where pcp->free_factor is changed to pcp->free_count to keep track of the number of pages freed contiguously. In this design, pcp->free_count is incremented on every deallocation, irrespective of whether pcp list was reduced or not. And logic to trigger free_high is if pcp->free_count goes above batch (which is 63) and there are two contiguous page free without any allocation. With this design, for iperf3, pcp list is getting flushed more frequently because free_high heuristics is triggered more often now. I observed that high order pcp list is drained as soon as both count and free_count goes above 63. Due to this more aggressive high order flushing, applications doing contiguous high order allocation will require to go to global list more frequently. On a 2-node AMD machine with 384 vCPUs on each node, connected via Mellonox connectX-7, I am seeing a ~30% performance reduction if we scale number of iperf3 client/server pairs from 32 to 64. Though this new design reduced the time to detect high order flushes, but for application which are allocating high order pages more frequently it may be flushing the high order list pre-maturely. This motivates towards tuning on how late or early we should flush high order lists. So, in this patch, we increased the pcp->free_count threshold to trigger free_high from "batch" to "batch + pcp->high_min / 2" as suggested by Ying [1], In the original pcp->free_factor solution, free_high is triggered for contiguous freeing with size ranging from "batch" to "pcp->high + batch". So, the average value is "batch + pcp->high / 2". While in the pcp->free_count solution, free_high will be triggered for contiguous freeing with size "batch". So, to restore the original behavior, we can use the threshold "batch + pcp->high_min / 2" This new threshold keeps high order pages in pcp list for a longer duration which can help the application doing high order allocations frequently. With this patch performace to Iperf3 is restored and score for other benchmarks on the same machine are as follows: iperf3 lmbench3 netperf kbuild (AF_UNIX) (SCTP_STREAM_MANY) ------- --------- ----------------- ------ v6.6 vanilla (base) 100 100 100 100 v6.12 vanilla 69 113 98.5 98.8 v6.12 + this patch 100 110.3 100.2 99.3 netperf-tcp: 6.12 6.12 vanilla this_patch Hmean 64 732.14 ( 0.00%) 730.45 ( -0.23%) Hmean 128 1417.46 ( 0.00%) 1419.44 ( 0.14%) Hmean 256 2679.67 ( 0.00%) 2676.45 ( -0.12%) Hmean 1024 8328.52 ( 0.00%) 8339.34 ( 0.13%) Hmean 2048 12716.98 ( 0.00%) 12743.68 ( 0.21%) Hmean 3312 15787.79 ( 0.00%) 15887.25 ( 0.63%) Hmean 4096 17311.91 ( 0.00%) 17332.68 ( 0.12%) Hmean 8192 20310.73 ( 0.00%) 20465.09 ( 0.76%) Link: https://lore.kernel.org/all/875xjmuiup.fsf@DESKTOP-5N7EMDA/ [1] Link: https://lkml.kernel.org/r/20250407105219.55351-1-nikhil.dhama@amd.com Fixes: 6ccdcb6d3a74 ("mm, pcp: reduce detecting time of consecutive high order page freeing") Signed-off-by: Nikhil Dhama <nikhil.dhama@amd.com> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Bharata B Rao <bharata@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-27mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()Fan Ni1-11/+13
In __unmap_hugepage_range(), the "page" pointer always points to the first page of a huge page, which guarantees there is a folio associating with it. Convert the "page" pointer to use folio. Link: https://lkml.kernel.org/r/20250505182345.506888-6-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-27mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of pageFan Ni1-9/+9
The function __unmap_hugepage_range() has two kinds of users: 1) unmap_hugepage_range(), which passes in the head page of a folio. Since unmap_hugepage_range() already takes folio and there are no other uses of the folio struct in the function, it is natural for __unmap_hugepage_range() to take folio also. 2) All other uses, which pass in NULL pointer. In both cases, we can pass in folio. Refactor __unmap_hugepage_range() to take folio. Link: https://lkml.kernel.org/r/20250505182345.506888-5-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>