aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-06-05MAINTAINERS: add tlb trace events to MMU GATHER AND TLB INVALIDATIONTal Zussman1-0/+1
The MMU GATHER AND TLB INVALIDATION entry lists other TLB-related files. Add the tlb.h tracepoint file there as well. Link: https://lore.kernel.org/linux-mm/ce048e11-f79d-44a6-bacc-46e1ebc34b24@redhat.com/ Link: https://lkml.kernel.org/r/20250603-tlb-maintainers-v1-1-726d193c6693@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast raceJann Horn1-0/+7
huge_pmd_unshare() drops a reference on a page table that may have previously been shared across processes, potentially turning it into a normal page table used in another process in which unrelated VMAs can afterwards be installed. If this happens in the middle of a concurrent gup_fast(), gup_fast() could end up walking the page tables of another process. While I don't see any way in which that immediately leads to kernel memory corruption, it is really weird and unexpected. Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one(), just like we do in khugepaged when removing page tables for a THP collapse. Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-2-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-2-f4136f5ec58a@google.com Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm/hugetlb: unshare page tables during VMA split, not beforeJann Horn4-16/+56
Currently, __split_vma() triggers hugetlb page table unsharing through vm_ops->may_split(). This happens before the VMA lock and rmap locks are taken - which is too early, it allows racing VMA-locked page faults in our process and racing rmap walks from other processes to cause page tables to be shared again before we actually perform the split. Fix it by explicitly calling into the hugetlb unshare logic from __split_vma() in the same place where THP splitting also happens. At that point, both the VMA and the rmap(s) are write-locked. An annoying detail is that we can now call into the helper hugetlb_unshare_pmds() from two different locking contexts: 1. from hugetlb_split(), holding: - mmap lock (exclusively) - VMA lock - file rmap lock (exclusively) 2. hugetlb_unshare_all_pmds(), which I think is designed to be able to call us with only the mmap lock held (in shared mode), but currently only runs while holding mmap lock (exclusively) and VMA lock Backporting note: This commit fixes a racy protection that was introduced in commit b30c14cd6102 ("hugetlb: unshare some PMDs when splitting VMAs"); that commit claimed to fix an issue introduced in 5.13, but it should actually also go all the way back. [jannh@google.com: v2] Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-1-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-0-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-1-f4136f5ec58a@google.com Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [b30c14cd6102: hugetlb: unshare some PMDs when splitting VMAs] Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05MAINTAINERS: add Alistair as reviewer of mm memory policyAlistair Popple1-0/+1
I'm particularly familiar with mm/migrate.c and especially mm/migrate_device.c so add myself to MAINTAINERS. Link: https://lkml.kernel.org/r/20250530014917.2946940-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05iov_iter: use iov_offset for length calculation in iov_iter_aligned_bvecNitesh Shetty1-1/+1
If iov_offset is non-zero, then we need to consider iov_offset in length calculation, otherwise we might pass smaller IOs such as 512 bytes, in below scenario [1]. This issue is reproducible using lib-uring test/fixed-seg.c application with fixed buffer on a 512 LBA formatted device. [1] At present we pass the alignment check, for 512 LBA formatted devices, len_mask = 511 when IO is smaller, i->count = 512 has an offset, i->io_offset = 3584 with bvec values, bvec->bv_offset = 256, bvec->bv_len = 3840. In short, the first 256 bytes are in the current page, next 256 bytes are in the another page. Ideally we expect to fail the IO. I can think of 2 userspace scenarios where we experience this. a: From userspace, we observe a different behaviour when device LBA size is 512 vs 4096 bytes. For 4096 LBA formatted device, I see the same liburing test [2] failing, whereas 512 the test passes without this. This is reproducible everytime. [2] https://github.com/axboe/liburing/ b: Although I was not able to reproduce the below condition, but I suspect below case should be possible from user space for devices with 512 LBA formatted device. Lets say from userspace while allocating a virtually single chunk of memory, if we get 2 physical chunk of memory, and IO happens to be at the boundary of first physical chunk with length crossing first chunk, then we allow IOs to proceed and hence we might map wrong physical address length and proceed with IO rather than failing. : --- a/test/fixed-seg.c : +++ b/test/fixed-seg.c : @@ -64,7 +64,7 @@ static int test(struct io_uring *ring, int fd, int : vec_off) : return T_EXIT_FAIL; : } : : - ret = read_it(ring, fd, 4096, vec_off); : + ret = read_it(ring, fd, 4096, 7*512 + 256); : if (ret) { : fprintf(stderr, "4096 0 failed\n"); : return T_EXIT_FAIL; Effectively this is a write crossing the page boundary. Link: https://lkml.kernel.org/r/20250428095849.11709-1-nj.shetty@samsung.com Fixes: 2263639f96f2 ("iov_iter: streamline iovec/bvec alignment iteration") Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm/mempolicy: fix incorrect freeing of wi_kobjJoshua Hahn1-3/+1
We should not free wi_group->wi_kobj here. In the error path of add_weighted_interleave_group() where this snippet is called from, kobj_{del, put} is immediately called right after this section. Thus, it is not only unnecessary but also incorrect to free it here. Link: https://lkml.kernel.org/r/20250602162345.2595696-1-joshua.hahnjy@gmail.com Fixes: e341f9c3c841 ("mm/mempolicy: Weighted Interleave Auto-tuning") Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506011545.Fduxqxqj-lkp@intel.com/ Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05alloc_tag: handle module codetag load errors as module load failuresSuren Baghdasaryan4-20/+39
Failures inside codetag_load_module() are currently ignored. As a result an error there would not cause a module load failure and freeing of the associated resources. Correct this behavior by propagating the error code to the caller and handling possible errors. With this change, error to allocate percpu counters, which happens at this stage, will not be ignored and will cause a module load failure and freeing of resources. With this change we also do not need to disable memory allocation profiling when this error happens, instead we fail to load the module. Link: https://lkml.kernel.org/r/20250521160602.1940771-1-surenb@google.com Fixes: 10075262888b ("alloc_tag: allocate percpu counters for module tags dynamically") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Casey Chen <cachen@purestorage.com> Closes: https://lore.kernel.org/all/20250520231620.15259-1-cachen@purestorage.com/ Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm/madvise: handle madvise_lock() failure during race unwindingSeongJae Park1-1/+4
When unwinding race on -ERESTARTNOINTR handling of process_madvise(), madvise_lock() failure is ignored. Check the failure and abort remaining works in the case. Link: https://lkml.kernel.org/r/20250602174926.1074-1-sj@kernel.org Fixes: 4000e3d0a367 ("mm/madvise: remove redundant mmap_lock operations from process_madvise()") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Barry Song <21cnbao@gmail.com> Closes: https://lore.kernel.org/CAGsJ_4xJXXO0G+4BizhohSZ4yDteziPw43_uF8nPXPWxUVChzw@mail.gmail.com Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm: fix vmstat after removing NR_BOUNCEKirill A. Shutemov1-1/+0
Hongyu noticed that the nr_unaccepted counter kept growing even in the absence of unaccepted memory on the machine. This happens due to a commit that removed NR_BOUNCE: it removed the counter from the enum zone_stat_item, but left it in the vmstat_text array. As a result, all counters below nr_bounce in /proc/vmstat are shifted by one line, causing the numa_hit counter to be labeled as nr_unaccepted. To fix this issue, remove nr_bounce from the vmstat_text array. Link: https://lkml.kernel.org/r/20250529103832.2937460-1-kirill.shutemov@linux.intel.com Fixes: 194df9f66db8 ("mm: remove NR_BOUNCE zone stat") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Hongyu Ning <hongyu.ning@linux.intel.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05KVM: s390: rename PROT_NONE to PROT_TYPE_DUMMYLorenzo Stoakes1-4/+4
The enum type prot_type declared in arch/s390/kvm/gaccess.c declares an unfortunate identifier within it - PROT_NONE. This clashes with the protection bit define from the uapi for mmap() declared in include/uapi/asm-generic/mman-common.h, which is indeed what those casually reading this code would assume this to refer to. This means that any changes which subsequently alter headers in any way which results in the uapi header being imported here will cause build errors. Resolve the issue by renaming PROT_NONE to PROT_TYPE_DUMMY. Link: https://lkml.kernel.org/r/20250519145657.178365-1-lorenzo.stoakes@oracle.com Fixes: b3cefd6bf16e ("KVM: s390: Pass initialized arg even if unused") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505140943.IgHDa9s7-lkp@intel.com/ Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Acked-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com> Acked-by: Yang Shi <yang@os.amperecomputing.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: <stable@vger.kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: James Houghton <jthoughton@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-02rtmutex_api: provide correct extern functionsPaolo Bonzini1-12/+21
Commit fb49f07ba1d9 ("locking/mutex: implement mutex_lock_killable_nest_lock") changed the set of functions that mutex.c defines when CONFIG_DEBUG_LOCK_ALLOC is set. - it removed the "extern" declaration of mutex_lock_killable_nested from include/linux/mutex.h, and replaced it with a macro since it could be treated as a special case of _mutex_lock_killable. It also removed a definition of the function in kernel/locking/mutex.c. - likewise, it replaced mutex_trylock() with the more generic mutex_trylock_nest_lock() and replaced mutex_trylock() with a macro. However, it left the old definitions in place in kernel/locking/rtmutex_api.c, which causes failures when building with CONFIG_RT_MUTEXES=y. Bring over the changes. Fixes: fb49f07ba1d9 ("locking/mutex: implement mutex_lock_killable_nest_lock") Reported-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-01randstruct: gcc-plugin: Fix attribute additionKees Cook2-11/+43
Based on changes in the 2021 public version of the randstruct out-of-tree GCC plugin[1], more carefully update the attributes on resulting decls, to avoid tripping checks in GCC 15's comptypes_check_enum_int() when it has been configured with "--enable-checking=misc": arch/arm64/kernel/kexec_image.c:132:14: internal compiler error: in comptypes_check_enum_int, at c/c-typeck.cc:1519 132 | const struct kexec_file_ops kexec_image_ops = { | ^~~~~~~~~~~~~~ internal_error(char const*, ...), at gcc/gcc/diagnostic-global-context.cc:517 fancy_abort(char const*, int, char const*), at gcc/gcc/diagnostic.cc:1803 comptypes_check_enum_int(tree_node*, tree_node*, bool*), at gcc/gcc/c/c-typeck.cc:1519 ... Link: https://archive.org/download/grsecurity/grsecurity-3.1-5.10.41-202105280954.patch.gz [1] Reported-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Closes: https://github.com/KSPP/linux/issues/367 Closes: https://lore.kernel.org/lkml/20250530000646.104457-1-thiago.bauermann@linaro.org/ Reported-by: Ingo Saitz <ingo@hannover.ccc.de> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104745 Fixes: 313dd1b62921 ("gcc-plugins: Add the randstruct plugin") Tested-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Link: https://lore.kernel.org/r/20250530221824.work.623-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2025-06-01overflow: Introduce __DEFINE_FLEX for having no initializerKees Cook1-6/+19
While not yet in the tree, there is a proposed patch[1] that was depending on the prior behavior of _DEFINE_FLEX, which did not have an explicit initializer. Provide this via __DEFINE_FLEX now, which can also have attributes applied (e.g. __uninitialized). Examples of the resulting initializer behaviors can be seen here: https://godbolt.org/z/P7Go8Tr33 Link: https://lore.kernel.org/netdev/20250520205920.2134829-9-anthony.l.nguyen@intel.com [1] Fixes: 47e36ed78406 ("overflow: Fix direct struct member initialization in _DEFINE_FLEX()") Signed-off-by: Kees Cook <kees@kernel.org>
2025-06-01watchdog: iTCO_wdt: Update the heartbeat value after clamping timeoutZiyan Fu1-0/+1
When executing "modprobe iTCO_wdt heartbeat=700", the user-specified 'heartbeat' parameter exceeds the valid range, the driver clamps the timeout to default 30s but fails to update the logged 'heartbeat' value, resulting in misleading log output: iTCO_wdt iTCO_wdt: timeout value out of range, using 30 iTCO_wdt iTCO_wdt: initialized. heartbeat=700 sec (nowayout=0) After validating the range, update the 'heartbeat' value with the clamped timeout value to ensure that log messages accurately reflect the actual runtime parameters. Signed-off-by: Ziyan Fu <fuzy5@lenovo.com> Reviewed-by: Wim Van Sebroeck <wim@linux-watchdog.org> Link: https://lore.kernel.org/r/20250429102533.11886-1-13281011316@163.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: Add driver for Intel OC WDTDiogo Ivo3-0/+245
Add a driver for the Intel Over-Clocking Watchdog found in Intel Platform Controller (PCH) chipsets. This watchdog is controlled via a simple single-register interface and would otherwise be standard except for the presence of a LOCK bit that can only be set once per power cycle, needing extra handling around it. Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250317-ivo-intel_oc_wdt-v3-1-32c396f4eefd@siemens.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: arm_smc_wdt: get wdt status through SMCWD_GET_TIMELEFTAntonio Borneo1-3/+14
The optional SMCWD_GET_TIMELEFT command can be used to detect if the watchdog has already been started. See the implementation in OP-TEE secure OS [1]. At probe time, check if the watchdog is already started and then set WDOG_HW_RUNNING in the watchdog status. This will cause the watchdog framework to ping the watchdog until a userspace watchdog daemon takes over the control. Link: https://github.com/OP-TEE/optee_os/commit/a7f2d4bd8632 [1] Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250520085952.210723-1-antonio.borneo@foss.st.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: iTCO: Drop driver-internal lockingGuenter Roeck1-24/+0
The locking code in the iTCO watchdog driver has been carried along from before the watchdog core existed. The watchdog core protects calls into drivers since commit f4e9c82f64b5 ("watchdog: Add Locking support"), making driver-internal locking unnecessary. Drop it. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Wim Van Sebroeck <wim@linux-watchdog.org> Link: https://lore.kernel.org/r/20250517160936.3231017-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: apple: set max_hw_heartbeat_ms instead of max_timeoutFlorian Klink1-2/+5
The hardware only supports timeouts slightly below 3mins, but by using max_hw_heartbeat_ms we can let the kernel take care of supporting larger timeouts than that requested from userspace. Switching to max_hw_heartbeat_ms also means our set_timeout function now needs to configure the hardware to the minimum of either the requested timeout (in seconds) or the maximum supported by the user (in seconds). Signed-off-by: Florian Klink <flokli@flokli.de> Reviewed-by: Wim Van Sebroeck <wim@linux-watchdog.org> Link: https://lore.kernel.org/r/20250506142621.11428-2-flokli@flokli.de Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: qcom: introduce the device data for IPQ5424 watchdog deviceKathiravan Thirumoorthy1-0/+7
To retrieve the restart reason from IMEM, certain device specific data like IMEM compatible to lookup, location of IMEM to read, etc should be defined. To achieve that, introduce the separate device data for IPQ5424 and add the required details subsequently. Signed-off-by: Kathiravan Thirumoorthy <kathiravan.thirumoorthy@oss.qualcomm.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250502-wdt_reset_reason-v3-3-b2dc7ace38ca@oss.qualcomm.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01dt-bindings: watchdog: renesas,wdt: Document RZ/V2N (R9A09G056) supportLad Prabhakar1-1/+3
Document support for the watchdog IP found on the Renesas RZ/V2N (R9A09G056) SoC. The watchdog IP is identical to that on RZ/V2H(P), so `renesas,r9a09g057-wdt` will be used as a fallback compatible, enabling reuse of the existing driver without changes. Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250502120054.47323-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: lenovo_se30_wdt: Fix possible devm_ioremap() NULL pointer dereference in lenovo_se30_wdt_probe()Henry Martin1-0/+2
devm_ioremap() returns NULL on error. Currently, lenovo_se30_wdt_probe() does not check for this case, which results in a NULL pointer dereference. Add NULL check after devm_ioremap() to prevent this issue. Fixes: c284153a2c55 ("watchdog: lenovo_se30_wdt: Watchdog driver for Lenovo SE30 platform") Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250424071648.89016-1-bsdhenrymartin@gmail.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: s3c2410_wdt: Add exynos990-wdt compatible dataIgor Belwon1-1/+38
The Exynos990 has two watchdog clusters - cl0 and cl2. Add new driver data for these two clusters, making it possible to use the watchdog timer on this SoC. Signed-off-by: Igor Belwon <igor.belwon@mentallysanemainliners.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250420-wdt-resends-april-v1-2-f58639673959@mentallysanemainliners.org Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01dt-bindings: watchdog: samsung-wdt: Add exynos990-wdt compatibleIgor Belwon1-4/+7
Add a dt-binding compatible for the Exynos990 Watchdog timer. This watchdog is compatible with the GS101/Exynos850 design, as such it requires the cluster-index and syscon-phandle properties to be present. It also contains a cl2 cluster, as such the cluster-index property has been expanded. Signed-off-by: Igor Belwon <igor.belwon@mentallysanemainliners.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250420-wdt-resends-april-v1-1-f58639673959@mentallysanemainliners.org Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-05-31mm/khugepaged: clean up refcount check using folio_expected_ref_count()Shivank Garg1-15/+2
Use folio_expected_ref_count() instead of open-coded logic in is_refcount_suitable(). This avoids code duplication and improves clarity. Drop is_refcount_suitable() as it is no longer needed. Link: https://lkml.kernel.org/r/20250526182818.37978-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Fengwei Yin <fengwei.yin@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: fix test result reporting in gup_longtermMark Brown1-56/+94
The kselftest framework uses the string logged when a test result is reported as the unique identifier for a test, using it to track test results between runs. The gup_longterm test fails to follow this pattern, it runs a single test function repeatedly with various parameters but each result report is a string logging an error message which is fixed between runs. Since the code already logs each test uniquely before it starts refactor to also print this to a buffer, then use that name as the test result. This isn't especially pretty but is relatively straightforward and is a great help to tooling. Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-4-ff198df8e38e@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: report unique test names for each cow testMark Brown1-116/+217
The kselftest framework uses the string logged when a test result is reported as the unique identifier for a test, using it to track test results between runs. The cow test completely fails to follow this pattern, it runs test functions repeatedly with various parameters with each result report from those functions being a string logging an error message which is fixed between runs. Since the code already logs each test uniquely before it starts refactor to also print this to a buffer, then use that name as the test result. This isn't especially pretty but is relatively straightforward and is a great help to tooling. Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-3-ff198df8e38e@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: add helper for logging test start and resultsMark Brown1-0/+20
Several of the MM tests have a pattern of printing a description of the test to be run then reporting the actual TAP result using a generic string not connected to the specific test, often in a shared function used by many tests. The name reported typically varies depending on the specific result rather than the test too. This causes problems for tooling that works with test results, the names reported with the results are used to deduplicate tests and track them between runs so both duplicated names and changing names cause trouble for things like UIs and automated bisection. As a first step towards matching these tests better with the expectations of kselftest provide helpers which record the test name as part of the initial print and then use that as part of reporting a result. This is not added as a generic kselftest helper partly because the use of a variable to store the test name doesn't fit well with the header only implementation of kselftest.h and partly because it's not really an intended pattern. Ideally at some point the mm tests that use it will be updated to not need it. Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-2-ff198df8e38e@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: use standard ksft_finished() in cow and gup_longtermMark Brown2-12/+3
Patch series "selftests/mm: cow and gup_longterm cleanups", v2. The bulk of these changes modify the cow and gup_longterm tests to report unique and stable names for each test, bringing them into line with the expectations of tooling that works with kselftest. The string reported as a test result is used by tooling to both deduplicate tests and track tests between test runs, using the same string for multiple tests or changing the string depending on test result causes problems for user interfaces and automation such as bisection. It was suggested that converting to use kselftest_harness.h would be a good way of addressing this, however that really wants the set of tests to run to be known at compile time but both test programs dynamically enumarate the set of huge page sizes the system supports and test each. Refactoring to handle this would be even more invasive than these changes which are large but straightforward and repetitive. A version of the main gup_longterm cleanup was previously sent separately, this version factors out the helpers for logging the start of the test since the cow test looks very similar. This patch (of 4): The cow and gup_longterm test programs open code something that looks a lot like the standard ksft_finished() helper to summarise the test results and provide an exit code, convert to use ksft_finished(). Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-0-ff198df8e38e@kernel.org Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-1-ff198df8e38e@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabledEnze Li1-0/+4
When CONFIG_DAMON_SYSFS is disabled, the selftests fail with the following outputs, not ok 2 selftests: damon: sysfs_update_schemes_tried_regions_wss_estimation.py # exit=1 not ok 3 selftests: damon: damos_quota.py # exit=1 not ok 4 selftests: damon: damos_quota_goal.py # exit=1 not ok 5 selftests: damon: damos_apply_interval.py # exit=1 not ok 6 selftests: damon: damos_tried_regions.py # exit=1 not ok 7 selftests: damon: damon_nr_regions.py # exit=1 not ok 11 selftests: damon: sysfs_update_schemes_tried_regions_hang.py # exit=1 The root cause of this issue is that all the testcases above do not check the sysfs interface of DAMON whether it exists or not. With this patch applied, all the testcases above now pass successfully. Link: https://lkml.kernel.org/r/20250531093937.1555159-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31sched/numa: add statistics of numa balance taskChen Yu7-2/+27
On systems with NUMA balancing enabled, it has been found that tracking task activities resulting from NUMA balancing is beneficial. NUMA balancing employs two mechanisms for task migration: one is to migrate a task to an idle CPU within its preferred node, and the other is to swap tasks located on different nodes when they are on each other's preferred nodes. The kernel already provides NUMA page migration statistics in /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However, it lacks statistics regarding task migration and swapping. Therefore, relevant counts for task migration and swapping should be added. The following two new fields: numa_task_migrated numa_task_swapped will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched and /proc/vmstat. Introducing both per-task and per-memory cgroup (memcg) NUMA balancing statistics facilitates a rapid evaluation of the performance and resource utilization of the target workload. For instance, users can first identify the container with high NUMA balancing activity and then further pinpoint a specific task within that group, and subsequently adjust the memory policy for that task. In short, although it is possible to iterate through /proc/$pid/sched to locate the problematic task, the introduction of aggregated NUMA balancing activity for tasks within each memcg can assist users in identifying the task more efficiently through a divide-and-conquer approach. As Libo Chen pointed out, the memcg event relies on the text names in vmstat_text, and /proc/vmstat generates corresponding items based on vmstat_text. Thus, the relevant task migration and swapping events introduced in vmstat_text also need to be populated by count_vm_numa_event(), otherwise these values are zero in /proc/vmstat. In theory, task migration and swap events are part of the scheduler's activities. The reason for exposing them through the memory.stat/vmstat interface is that we already have NUMA balancing statistics in memory.stat/vmstat, and these events are closely related to each other. Following Shakeel's suggestion, we describe the end-to-end flow/story of all these events occurring on a timeline for future reference: The goal of NUMA balancing is to co-locate a task and its memory pages on the same NUMA node. There are two strategies: migrate the pages to the task's node, or migrate the task to the node where its pages reside. Suppose a task p1 is running on Node 0, but its pages are located on Node 1. NUMA page fault statistics for p1 reveal its "page footprint" across nodes. If NUMA balancing detects that most of p1's pages are on Node 1: 1.Page Migration Attempt: The Numa balance first tries to migrate p1's pages to Node 0. The numa_page_migrate counter increments. 2.Task Migration Strategies: After the page migration finishes, Numa balance checks every 1 second to see if p1 can be migrated to Node 1. Case 2.1: Idle CPU Available If Node 1 has an idle CPU, p1 is directly scheduled there. This event is logged as numa_task_migrated. Case 2.2: No Idle CPU (Task Swap) If all CPUs on Node1 are busy, direct migration could cause CPU contention or load imbalance. Instead: The Numa balance selects a candidate task p2 on Node 1 that prefers Node 0 (e.g., due to its own page footprint). p1 and p2 are swapped. This cross-node swap is recorded as numa_task_swapped. Link: https://lkml.kernel.org/r/d00edb12ba0f0de3c5222f61487e65f2ac58f5b1.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/7ef90a88602ed536be46eba7152ed0d33bad5790.1748002400.git.yu.c.chen@intel.com Signed-off-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Ayush Jain <Ayush.jain3@amd.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Libo Chen <libo.chen@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31sched/numa: fix task swap by skipping kernel threadsLibo Chen1-1/+2
Patch series "sched/numa: add statistics of numa balance task migration", v6. Introduce task migration and swap statistics in the following places: /sys/fs/cgroup/{GROUP}/memory.stat /proc/{PID}/sched /proc/vmstat These statistics facilitate a rapid evaluation of the performance and resource utilization of the target workload. This patch (of 2): Task swapping is triggered when there are no idle CPUs in task A's preferred node. In this case, the NUMA load balancer chooses a task B on A's preferred node and swaps B with A. This helps improve NUMA locality without introducing load imbalance between nodes. In the current implementation, B's NUMA node preference is not mandatory. That is to say, a kernel thread might be incorrectly chosen as B. However, kernel thread and user space thread that does not have mm are not supposed to be covered by NUMA balancing because NUMA balancing only considers user pages via VMAs. According to Peter's suggestion for fixing this issue, we use PF_KTHREAD to skip the kernel thread. curr->mm is also checked because it is possible that user_mode_thread() might create a user thread without an mm. As per Prateek's analysis, after adding the PF_KTHREAD check, there is no need to further check the PF_IDLE flag: : - play_idle_precise() already ensures PF_KTHREAD is set before adding : PF_IDLE : : - cpu_startup_entry() is only called from the startup thread which : should be marked with PF_KTHREAD (based on my understanding looking at : commit cff9b2332ab7 ("kernel/sched: Modify initial boot task idle : setup")) In summary, the check in task_numa_compare() now aligns with task_tick_numa(). Link: https://lkml.kernel.org/r/cover.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/43d68b356b25d124f0d222ebedf3859e86eefb9f.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/cover.1748002400.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/eaacc9c9bd37bac92d43a671867d85b2fdad3b06.1748002400.git.yu.c.chen@intel.com Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Libo Chen <libo.chen@oracle.com> Suggested-by: Michal Koutný <mkoutny@suse.com> Tested-by: Ayush Jain <Ayush.jain3@amd.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Aubrey Li <aubrey.li@intel.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31tools/testing: check correct variable in open_procmap()Dan Carpenter1-1/+1
Check if "procmap_out->fd" is negative instead of "procmap_out" (which is a pointer). Link: https://lkml.kernel.org/r/aDbFuUTlJTBqziVd@stanley.mountain Fixes: bd23f293a0d5 ("tools/testing: add PROCMAP_QUERY helper functions in mm self tests") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: levi.yun <yeoreum.yun@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31tools/testing/vma: add missing function stubLorenzo Stoakes1-0/+5
The hugetlb fix introduced in commit ee40c9920ac2 ("mm: fix copy_vma() error handling for hugetlb mappings") mistakenly did not provide a stub for the VMA userland testing, which results in a compile error when trying to build this. Provide this stub to resolve the issue. Link: https://lkml.kernel.org/r/20250528-fix-vma-test-v1-1-c8a5f533b38f@oracle.com Fixes: ee40c9920ac2 ("mm: fix copy_vma() error handling for hugetlb mappings") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/gup: update comment explaining why gup_fast() disables IRQsJann Horn1-1/+1
The current comment in gup_fast() talks about "IPIs that come from THPs splitting", which is outdated and refers to the old THP splitting implementation that was removed in commit ad0bed24e98b ("thp: drop all split_huge_page()-related code"), which landed in v4.5. Before then, THP splitting involved a pmdp_splitting_flush(), which sent an IPI to serialize against gup_fast(). Nowadays, we use tlb_remove_table_sync_one() to send IPIs that serialize against gup_fast(); this is used, for example, in THP *collapsing* to stop gup_fast() walks of a page table before depositing it. Link: https://lkml.kernel.org/r/20250528-gup-irq-comment-fix-v1-1-b9d83c345333@google.com Signed-off-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: two fixes for the pfnmap testDavid Hildenbrand1-4/+57
When unregistering the signal handler, we have to pass SIG_DFL, and blindly reading from PFN 0 and PFN 1 seems to be problematic on !x86 systems. In particularly, on arm64 tx2 machines where noting resides at these physical memory locations, we can generate RAS errors. Let's fix it by scanning /proc/iomem for actual "System RAM". Link: https://lkml.kernel.org/r/20250528195244.1182810-1-david@redhat.com Fixes: 2616b370323a ("selftests/mm: add simple VM_PFNMAP tests based on mmap'ing /dev/mem") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/all/232960c2-81db-47ca-a337-38c4bce5f997@arm.com/T/#u Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Aishwarya TCV <aishwarya.tcv@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/khugepaged: fix race with folio split/free using temporary referenceShivank Garg1-1/+17
hpage_collapse_scan_file() calls is_refcount_suitable(), which in turn calls folio_mapcount(). folio_mapcount() checks folio_test_large() before proceeding to folio_large_mapcount(), but there is a race window where the folio may get split/freed between these checks, triggering: VM_WARN_ON_FOLIO(!folio_test_large(folio), folio) Take a temporary reference to the folio in hpage_collapse_scan_file(). This stabilizes the folio during refcount check and prevents incorrect large folio detection due to concurrent split/free. Use helper folio_expected_ref_count() + 1 to compare with folio_ref_count() instead of using is_refcount_suitable(). Link: https://lkml.kernel.org/r/20250526182818.37978-1-shivankg@amd.com Fixes: 05c5323b2a34 ("mm: track mapcount of large folios in single value") Signed-off-by: Shivank Garg <shivankg@amd.com> Reported-by: syzbot+2b99589e33edbe9475ca@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6828470d.a70a0220.38f255.000c.GAE@google.com Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Fengwei Yin <fengwei.yin@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm: add CONFIG_PAGE_BLOCK_ORDER to select page block orderJuan Yescas4-5/+55
Problem: On large page size configurations (16KiB, 64KiB), the CMA alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably, and this causes the CMA reservations to be larger than necessary. This means that system will have less available MIGRATE_UNMOVABLE and MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them. The CMA_MIN_ALIGNMENT_BYTES increases because it depends on MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels. For example, in ARM, the CMA alignment requirement when: - CONFIG_ARCH_FORCE_MAX_ORDER default value is used - CONFIG_TRANSPARENT_HUGEPAGE is set: PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES ----------------------------------------------------------------------- 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB There are some extreme cases for the CMA alignment requirement when: - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set - CONFIG_TRANSPARENT_HUGEPAGE is NOT set: - CONFIG_HUGETLB_PAGE is NOT set PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES ------------------------------------------------------------------------ 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB This affects the CMA reservations for the drivers. If a driver in a 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal reservation has to be 32MiB due to the alignment requirements: reserved-memory { ... cma_test_reserve: cma_test_reserve { compatible = "shared-dma-pool"; size = <0x0 0x400000>; /* 4 MiB */ ... }; }; reserved-memory { ... cma_test_reserve: cma_test_reserve { compatible = "shared-dma-pool"; size = <0x0 0x2000000>; /* 32 MiB */ ... }; }; Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that allows to set the page block order in all the architectures. The maximum page block order will be given by ARCH_FORCE_MAX_ORDER. By default, CONFIG_PAGE_BLOCK_ORDER will have the same value that ARCH_FORCE_MAX_ORDER. This will make sure that current kernel configurations won't be affected by this change. It is a opt-in change. This patch will allow to have the same CMA alignment requirements for large page sizes (16KiB, 64KiB) as that in 4kb kernels by setting a lower pageblock_order. Tests: - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 on 4k and 16k kernels. - Verified that Transparent Huge Pages work when pageblock_order is 1, 7, 10 on 4k and 16k kernels. - Verified that dma-buf heaps allocations work when pageblock_order is 1, 7, 10 on 4k and 16k kernels. Benchmarks: The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The reason for the pageblock_order 7 is because this value makes the min CMA alignment requirement the same as that in 4kb kernels (2MB). - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf (https://developer.android.com/ndk/guides/simpleperf) to measure the # of instructions and page-faults on 16k kernels. The benchmark was executed 10 times. The averages are below: # instructions | #page-faults order 10 | order 7 | order 10 | order 7 -------------------------------------------------------- 13,891,765,770 | 11,425,777,314 | 220 | 217 14,456,293,487 | 12,660,819,302 | 224 | 219 13,924,261,018 | 13,243,970,736 | 217 | 221 13,910,886,504 | 13,845,519,630 | 217 | 221 14,388,071,190 | 13,498,583,098 | 223 | 224 13,656,442,167 | 12,915,831,681 | 216 | 218 13,300,268,343 | 12,930,484,776 | 222 | 218 13,625,470,223 | 14,234,092,777 | 219 | 218 13,508,964,965 | 13,432,689,094 | 225 | 219 13,368,950,667 | 13,683,587,37 | 219 | 225 ------------------------------------------------------------------- 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages There were 4.85% #instructions when order was 7, in comparison with order 10. 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%) The number of page faults in order 7 and 10 were the same. These results didn't show any significant regression when the pageblock_order is set to 7 on 16kb kernels. - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times on the 16k kernels with pageblock_order 7 and 10. order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) % ------------------------------------------------------------------- 15.8 | 16.4 | 0.6 | 3.80% 16.4 | 16.2 | -0.2 | -1.22% 16.6 | 16.3 | -0.3 | -1.81% 16.8 | 16.3 | -0.5 | -2.98% 16.6 | 16.8 | 0.2 | 1.20% ------------------------------------------------------------------- 16.44 16.4 -0.04 -0.24% Averages The results didn't show any significant regression when the pageblock_order is set to 7 on 16kb kernels. Link: https://lkml.kernel.org/r/20250521215807.1860663-1-jyescas@google.com Signed-off-by: Juan Yescas <jyescas@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mmu_notifiers: remove leftover stub macrosJann Horn1-3/+0
Commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()") removed the main definitions of {ptep,pmdp_huge,pudp_huge}_clear_flush_notify; just their !CONFIG_MMU_NOTIFIER stubs are left behind, remove them. Link: https://lkml.kernel.org/r/20250523-mmu-notifier-cleanup-unused-v1-1-cc1f47ebec33@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: deduplicate test names in madv_populateMark Brown1-9/+9
The madv_populate selftest has some repetitive code for several different cases that it covers, included repeated test names used in ksft_test_result() reports. This causes problems for automation, the test name is used to both track the test between runs and distinguish between multiple tests within the same run. Fix this by tweaking the messages with duplication to be more specific about the contexts they're in. Link: https://lkml.kernel.org/r/20250522-selftests-mm-madv-populate-dedupe-v1-1-fd1dedd79b4b@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31kcov: rust: add flags for KCOV with RustAlice Ryhl3-0/+10
Rust code is currently not instrumented properly when KCOV is enabled. Thus, add the relevant flags to perform instrumentation correctly. This is necessary for efficient fuzzing of Rust code. The sanitizer-coverage features of LLVM have existed for long enough that they are available on any LLVM version supported by rustc, so we do not need any Kconfig feature detection. The coverage level is set to 3, as that is the level needed by trace-pc. We do not instrument `core` since when we fuzz the kernel, we are looking for bugs in the kernel, not the Rust stdlib. Link: https://lkml.kernel.org/r/20250501-rust-kcov-v2-1-b71e83e9779f@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Co-developed-by: Matthew Maurer <mmaurer@google.com> Signed-off-by: Matthew Maurer <mmaurer@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Aleksandr Nogikh <nogikh@google.com> Acked-by: Miguel Ojeda <ojeda@kernel.org> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Bill Wendling <morbo@google.com> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Gary Guo <gary@garyguo.net> Cc: Justin Stitt <justinstitt@google.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Trevor Gross <tmgross@umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm: rust: make CONFIG_MMU ifdefs more narrowAlice Ryhl2-52/+72
Currently the entire kernel::mm module is ifdef'd out when CONFIG_MMU=n. However, there are some downstream users of the module in rust/kernel/task.rs and rust/kernel/miscdevice.rs. Thus, update the cfgs so that only MmWithUserAsync is removed with CONFIG_MMU=n. The code is moved into a new file, since the #[cfg()] annotation otherwise has to be duplicated several times. Link: https://lkml.kernel.org/r/20250516193219.2987032-1-aliceryhl@google.com Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505071753.kldNHYVQ-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202505072116.eSYC8igT-lkp@intel.com/ Fixes: 5bb9ed6cdfeb ("mm: rust: add abstraction for struct mm_struct") Signed-off-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()Roman Gushchin3-10/+39
Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") added a forced tlbflush to tlb_vma_end(), which is required to avoid a race between munmap() and unmap_mapping_range(). However it added some overhead to other paths where tlb_vma_end() is used, but vmas are not removed, e.g. madvise(MADV_DONTNEED). Fix this by moving the tlb flush out of tlb_end_vma() into new tlb_flush_vmas() called from free_pgtables(), somewhat similar to the stable version of the original commit: commit 895428ee124a ("mm: Force TLB flush for PFNMAP mappings before unlink_file_vma()"). Note, that if tlb->fullmm is set, no flush is required, as the whole mm is about to be destroyed. Link: https://lkml.kernel.org/r/20250522012838.163876-1-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/damon/Kconfig: enable CONFIG_DAMON by defaultSeongJae Park1-0/+1
As of this writing, multiple major distros including Alma, Amazon, Android, CentOS, Debian, Fedora, and Oracle are build-enabling DAMON (set CONFIG_DAMON[1]). Enabling it by default will save configuration setup time for the current and future DAMON users. Build-enabling DAMON does not introduce a real risk since it makes no behavioral change by default. It requires explicit user requests to do anything. Only one potential risk is making the size of the kernel a little bit larger. On a production-purpose configuration, it increases the resulting kernel package size by about 0.1 % of the final package file. I believe that's too small to be a real problem in common setups. Hence, the benefit of enabling CONFIG_DAMON outweighs the potential risk. Set CONFIG_DAMON by default. Link: https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=DAMON [1] Link: https://lkml.kernel.org/r/20250521042755.39653-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/damon/Kconfig: set DAMON_{VADDR,PADDR,SYSFS} default to DAMONSeongJae Park1-0/+3
Patch series "mm/damon: build-enable essential DAMON components by default". As of this writing, multiple major distros including Alma, Amazon, Android, CentOS, Debian, Fedora, and Oracle are build-enabling DAMON (set CONFIG_DAMON[1]). Configuring DAMON is not very easy, since it is disabled by default, and there are multiple essential options that need to be manually turned on, one by one. Make it easier, by grouping essential configurations to be enabled with one selection, and enabling build of the essential parts of DAMON by default. Note that build-enabling DAMON does not introduce any real risk, since it makes no behavioral change by default. It requires explicit user requests to do anything. Only one potential risk is making the size of the kernel a little bit larger. On a production-purpose configuration, it increases the resulting kernel package binary size by about 0.1 % of the final package file. I believe that's too small to be a real problem in common setups. DAMON_{VADDR,PADDR,SYSFS} are de-facto essential parts of DAMON for normal usages. Because those need to be enabled one by one, however, and there are other test-purpose or non-essential configurations, it is easy to be confused and make mistakes at setup. Make the essential configurations default to CONFIG_DAMON, so that those can be enabled by default with a single change. Link: https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=DAMON [1] Link: https://lkml.kernel.org/r/20250521042755.39653-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250521042755.39653-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31hugetlb: show nr_huge_pages in report_hugepages()Wenjie Xu1-1/+1
The number of pre-allocated huge pages should be nr_huge_pages, not free_huge_pages, although they are same during booting stage Link: https://lkml.kernel.org/r/20250515114231.65824-1-xuwenjie04@baidu.com Signed-off-by: Wenjie Xu <xuwenjie04@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: skip hugevm test if kernel config file is not presentZi Yan1-17/+9
When running hugevm tests in a machine without kernel config present, e.g., a VM running a kernel without CONFIG_IKCONFIG_PROC nor /boot/config-*, skip hugevm tests, which reads kernel config to get page table level information. Link: https://lkml.kernel.org/r/20250516132938.356627-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Adam Sindelar <adam@wowsignal.io> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: skip guard_regions.uffd tests when uffd is not presentZi Yan1-2/+15
Patch series "Skip mm selftests instead when kernel features are not present", v2. Two guard_regions tests on userfaultfd fail when userfaultfd is not present. Skip them instead. hugevm test reads kernel config to get page table level information and fails when neither /proc/config.gz nor /boot/config-* is present. Skip it instead. This patch (of 2): When userfaultfd is not compiled into kernel, userfaultfd() returns -1, causing guard_regions.uffd tests to fail. Skip the tests instead. Link: https://lkml.kernel.org/r/20250516132938.356627-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250516132938.356627-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Adam Sindelar <adam@wowsignal.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/shmem: remove unneeded xa_is_value() check in shmem_unuse_swap_entries()Kemeng Shi1-2/+0
As only value entry will be added to fbatch in shmem_find_swap_entries(), there is no need to do xa_is_value() check in shmem_unuse_swap_entries(). Link: https://lkml.kernel.org/r/20250516170939.965736-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm: shmem: only remove inode from swaplist when it's swapped page count is 0Kemeng Shi1-2/+2
Even if we fail to allocate a swap entry, the inode might have previously allocated entry and we might take inode containing swap entry off swaplist. As a result, try_to_unuse() may enter a potential dead loop to repeatedly look for inode and clean it's swap entry. Only take inode off swaplist when it's swapped page count is 0 to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-5-shikemeng@huaweicloud.com Fixes: b487a2da3575 ("mm, swap: simplify folio swap allocation") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202505161438.9009cf47-lkp@intel.com Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/shmem: fix potential dead loop in shmem_unuse()Kemeng Shi1-3/+6
If multi shmem_unuse() for different swap type is called concurrently, a dead loop could occur as following: shmem_unuse(typeA) shmem_unuse(typeB) mutex_lock(&shmem_swaplist_mutex) list_for_each_entry_safe(info, next, ...) ... mutex_unlock(&shmem_swaplist_mutex) /* info->swapped may drop to 0 */ shmem_unuse_inode(&info->vfs_inode, type) mutex_lock(&shmem_swaplist_mutex) list_for_each_entry(info, next, ...) if (!info->swapped) list_del_init(&info->swaplist) ... mutex_unlock(&shmem_swaplist_mutex) mutex_lock(&shmem_swaplist_mutex) /* iterate with offlist entry and encounter a dead loop */ next = list_next_entry(info, swaplist); ... Restart the iteration if the inode is already off shmem_swaplist list to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-4-shikemeng@huaweicloud.com Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>