aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-06-05mm: expose abnormal new_pte during move_ptesPu Lehui1-0/+2
When executing move_ptes, the new_pte must be NULL, otherwise it will be overwritten by the old_pte, and cause the abnormal new_pte to be leaked. In order to make this problem to be more explicit, let's add WARN_ON_ONCE when new_pte is not NULL. [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/] Link: https://lkml.kernel.org/r/20250529155650.4017699-3-pulehui@huaweicloud.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Pu Lehui <pulehui@huawei.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm: fix uprobe pte be overwritten when expanding vmaPu Lehui2-3/+24
Patch series "Fix uprobe pte be overwritten when expanding vma". This patch (of 4): We encountered a BUG alert triggered by Syzkaller as follows: BUG: Bad rss-counter state mm:00000000b4a60fca type:MM_ANONPAGES val:1 And we can reproduce it with the following steps: 1. register uprobe on file at zero offset 2. mmap the file at zero offset: addr1 = mmap(NULL, 2 * 4096, PROT_NONE, MAP_PRIVATE, fd, 0); 3. mremap part of vma1 to new vma2: addr2 = mremap(addr1, 4096, 2 * 4096, MREMAP_MAYMOVE); 4. mremap back to orig addr1: mremap(addr2, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED, addr1); In step 3, the vma1 range [addr1, addr1 + 4096] will be remap to new vma2 with range [addr2, addr2 + 8192], and remap uprobe anon page from the vma1 to vma2, then unmap the vma1 range [addr1, addr1 + 4096]. In step 4, the vma2 range [addr2, addr2 + 4096] will be remap back to the addr range [addr1, addr1 + 4096]. Since the addr range [addr1 + 4096, addr1 + 8192] still maps the file, it will take vma_merge_new_range to expand the range, and then do uprobe_mmap in vma_complete. Since the merged vma pgoff is also zero offset, it will install uprobe anon page to the merged vma. However, the upcomming move_page_tables step, which use set_pte_at to remap the vma2 uprobe pte to the merged vma, will overwrite the newly uprobe pte in the merged vma, and lead that pte to be orphan. Since the uprobe pte will be remapped to the merged vma, we can remove the unnecessary uprobe_mmap upon merged vma. This problem was first found in linux-6.6.y and also exists in the community syzkaller: https://lore.kernel.org/all/000000000000ada39605a5e71711@google.com/T/ Link: https://lkml.kernel.org/r/20250529155650.4017699-1-pulehui@huaweicloud.com Link: https://lkml.kernel.org/r/20250529155650.4017699-2-pulehui@huaweicloud.com Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints") Signed-off-by: Pu Lehui <pulehui@huawei.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm/damon: s/primitives/code/ on commentsEnze Li8-8/+8
The word 'primitive' is not explicit. To make the code more easily understood, this commit renames 'primitives' to 'code' in header comments of some source files. Link: https://lkml.kernel.org/r/20250530053115.153238-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-02rtmutex_api: provide correct extern functionsPaolo Bonzini1-12/+21
Commit fb49f07ba1d9 ("locking/mutex: implement mutex_lock_killable_nest_lock") changed the set of functions that mutex.c defines when CONFIG_DEBUG_LOCK_ALLOC is set. - it removed the "extern" declaration of mutex_lock_killable_nested from include/linux/mutex.h, and replaced it with a macro since it could be treated as a special case of _mutex_lock_killable. It also removed a definition of the function in kernel/locking/mutex.c. - likewise, it replaced mutex_trylock() with the more generic mutex_trylock_nest_lock() and replaced mutex_trylock() with a macro. However, it left the old definitions in place in kernel/locking/rtmutex_api.c, which causes failures when building with CONFIG_RT_MUTEXES=y. Bring over the changes. Fixes: fb49f07ba1d9 ("locking/mutex: implement mutex_lock_killable_nest_lock") Reported-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-01randstruct: gcc-plugin: Fix attribute additionKees Cook2-11/+43
Based on changes in the 2021 public version of the randstruct out-of-tree GCC plugin[1], more carefully update the attributes on resulting decls, to avoid tripping checks in GCC 15's comptypes_check_enum_int() when it has been configured with "--enable-checking=misc": arch/arm64/kernel/kexec_image.c:132:14: internal compiler error: in comptypes_check_enum_int, at c/c-typeck.cc:1519 132 | const struct kexec_file_ops kexec_image_ops = { | ^~~~~~~~~~~~~~ internal_error(char const*, ...), at gcc/gcc/diagnostic-global-context.cc:517 fancy_abort(char const*, int, char const*), at gcc/gcc/diagnostic.cc:1803 comptypes_check_enum_int(tree_node*, tree_node*, bool*), at gcc/gcc/c/c-typeck.cc:1519 ... Link: https://archive.org/download/grsecurity/grsecurity-3.1-5.10.41-202105280954.patch.gz [1] Reported-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Closes: https://github.com/KSPP/linux/issues/367 Closes: https://lore.kernel.org/lkml/20250530000646.104457-1-thiago.bauermann@linaro.org/ Reported-by: Ingo Saitz <ingo@hannover.ccc.de> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104745 Fixes: 313dd1b62921 ("gcc-plugins: Add the randstruct plugin") Tested-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Link: https://lore.kernel.org/r/20250530221824.work.623-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2025-06-01overflow: Introduce __DEFINE_FLEX for having no initializerKees Cook1-6/+19
While not yet in the tree, there is a proposed patch[1] that was depending on the prior behavior of _DEFINE_FLEX, which did not have an explicit initializer. Provide this via __DEFINE_FLEX now, which can also have attributes applied (e.g. __uninitialized). Examples of the resulting initializer behaviors can be seen here: https://godbolt.org/z/P7Go8Tr33 Link: https://lore.kernel.org/netdev/20250520205920.2134829-9-anthony.l.nguyen@intel.com [1] Fixes: 47e36ed78406 ("overflow: Fix direct struct member initialization in _DEFINE_FLEX()") Signed-off-by: Kees Cook <kees@kernel.org>
2025-06-01watchdog: iTCO_wdt: Update the heartbeat value after clamping timeoutZiyan Fu1-0/+1
When executing "modprobe iTCO_wdt heartbeat=700", the user-specified 'heartbeat' parameter exceeds the valid range, the driver clamps the timeout to default 30s but fails to update the logged 'heartbeat' value, resulting in misleading log output: iTCO_wdt iTCO_wdt: timeout value out of range, using 30 iTCO_wdt iTCO_wdt: initialized. heartbeat=700 sec (nowayout=0) After validating the range, update the 'heartbeat' value with the clamped timeout value to ensure that log messages accurately reflect the actual runtime parameters. Signed-off-by: Ziyan Fu <fuzy5@lenovo.com> Reviewed-by: Wim Van Sebroeck <wim@linux-watchdog.org> Link: https://lore.kernel.org/r/20250429102533.11886-1-13281011316@163.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: Add driver for Intel OC WDTDiogo Ivo3-0/+245
Add a driver for the Intel Over-Clocking Watchdog found in Intel Platform Controller (PCH) chipsets. This watchdog is controlled via a simple single-register interface and would otherwise be standard except for the presence of a LOCK bit that can only be set once per power cycle, needing extra handling around it. Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250317-ivo-intel_oc_wdt-v3-1-32c396f4eefd@siemens.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: arm_smc_wdt: get wdt status through SMCWD_GET_TIMELEFTAntonio Borneo1-3/+14
The optional SMCWD_GET_TIMELEFT command can be used to detect if the watchdog has already been started. See the implementation in OP-TEE secure OS [1]. At probe time, check if the watchdog is already started and then set WDOG_HW_RUNNING in the watchdog status. This will cause the watchdog framework to ping the watchdog until a userspace watchdog daemon takes over the control. Link: https://github.com/OP-TEE/optee_os/commit/a7f2d4bd8632 [1] Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250520085952.210723-1-antonio.borneo@foss.st.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: iTCO: Drop driver-internal lockingGuenter Roeck1-24/+0
The locking code in the iTCO watchdog driver has been carried along from before the watchdog core existed. The watchdog core protects calls into drivers since commit f4e9c82f64b5 ("watchdog: Add Locking support"), making driver-internal locking unnecessary. Drop it. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Wim Van Sebroeck <wim@linux-watchdog.org> Link: https://lore.kernel.org/r/20250517160936.3231017-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: apple: set max_hw_heartbeat_ms instead of max_timeoutFlorian Klink1-2/+5
The hardware only supports timeouts slightly below 3mins, but by using max_hw_heartbeat_ms we can let the kernel take care of supporting larger timeouts than that requested from userspace. Switching to max_hw_heartbeat_ms also means our set_timeout function now needs to configure the hardware to the minimum of either the requested timeout (in seconds) or the maximum supported by the user (in seconds). Signed-off-by: Florian Klink <flokli@flokli.de> Reviewed-by: Wim Van Sebroeck <wim@linux-watchdog.org> Link: https://lore.kernel.org/r/20250506142621.11428-2-flokli@flokli.de Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: qcom: introduce the device data for IPQ5424 watchdog deviceKathiravan Thirumoorthy1-0/+7
To retrieve the restart reason from IMEM, certain device specific data like IMEM compatible to lookup, location of IMEM to read, etc should be defined. To achieve that, introduce the separate device data for IPQ5424 and add the required details subsequently. Signed-off-by: Kathiravan Thirumoorthy <kathiravan.thirumoorthy@oss.qualcomm.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250502-wdt_reset_reason-v3-3-b2dc7ace38ca@oss.qualcomm.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01dt-bindings: watchdog: renesas,wdt: Document RZ/V2N (R9A09G056) supportLad Prabhakar1-1/+3
Document support for the watchdog IP found on the Renesas RZ/V2N (R9A09G056) SoC. The watchdog IP is identical to that on RZ/V2H(P), so `renesas,r9a09g057-wdt` will be used as a fallback compatible, enabling reuse of the existing driver without changes. Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250502120054.47323-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: lenovo_se30_wdt: Fix possible devm_ioremap() NULL pointer dereference in lenovo_se30_wdt_probe()Henry Martin1-0/+2
devm_ioremap() returns NULL on error. Currently, lenovo_se30_wdt_probe() does not check for this case, which results in a NULL pointer dereference. Add NULL check after devm_ioremap() to prevent this issue. Fixes: c284153a2c55 ("watchdog: lenovo_se30_wdt: Watchdog driver for Lenovo SE30 platform") Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250424071648.89016-1-bsdhenrymartin@gmail.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01watchdog: s3c2410_wdt: Add exynos990-wdt compatible dataIgor Belwon1-1/+38
The Exynos990 has two watchdog clusters - cl0 and cl2. Add new driver data for these two clusters, making it possible to use the watchdog timer on this SoC. Signed-off-by: Igor Belwon <igor.belwon@mentallysanemainliners.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250420-wdt-resends-april-v1-2-f58639673959@mentallysanemainliners.org Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-06-01dt-bindings: watchdog: samsung-wdt: Add exynos990-wdt compatibleIgor Belwon1-4/+7
Add a dt-binding compatible for the Exynos990 Watchdog timer. This watchdog is compatible with the GS101/Exynos850 design, as such it requires the cluster-index and syscon-phandle properties to be present. It also contains a cl2 cluster, as such the cluster-index property has been expanded. Signed-off-by: Igor Belwon <igor.belwon@mentallysanemainliners.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250420-wdt-resends-april-v1-1-f58639673959@mentallysanemainliners.org Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-05-31mm/khugepaged: clean up refcount check using folio_expected_ref_count()Shivank Garg1-15/+2
Use folio_expected_ref_count() instead of open-coded logic in is_refcount_suitable(). This avoids code duplication and improves clarity. Drop is_refcount_suitable() as it is no longer needed. Link: https://lkml.kernel.org/r/20250526182818.37978-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Fengwei Yin <fengwei.yin@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: fix test result reporting in gup_longtermMark Brown1-56/+94
The kselftest framework uses the string logged when a test result is reported as the unique identifier for a test, using it to track test results between runs. The gup_longterm test fails to follow this pattern, it runs a single test function repeatedly with various parameters but each result report is a string logging an error message which is fixed between runs. Since the code already logs each test uniquely before it starts refactor to also print this to a buffer, then use that name as the test result. This isn't especially pretty but is relatively straightforward and is a great help to tooling. Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-4-ff198df8e38e@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: report unique test names for each cow testMark Brown1-116/+217
The kselftest framework uses the string logged when a test result is reported as the unique identifier for a test, using it to track test results between runs. The cow test completely fails to follow this pattern, it runs test functions repeatedly with various parameters with each result report from those functions being a string logging an error message which is fixed between runs. Since the code already logs each test uniquely before it starts refactor to also print this to a buffer, then use that name as the test result. This isn't especially pretty but is relatively straightforward and is a great help to tooling. Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-3-ff198df8e38e@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: add helper for logging test start and resultsMark Brown1-0/+20
Several of the MM tests have a pattern of printing a description of the test to be run then reporting the actual TAP result using a generic string not connected to the specific test, often in a shared function used by many tests. The name reported typically varies depending on the specific result rather than the test too. This causes problems for tooling that works with test results, the names reported with the results are used to deduplicate tests and track them between runs so both duplicated names and changing names cause trouble for things like UIs and automated bisection. As a first step towards matching these tests better with the expectations of kselftest provide helpers which record the test name as part of the initial print and then use that as part of reporting a result. This is not added as a generic kselftest helper partly because the use of a variable to store the test name doesn't fit well with the header only implementation of kselftest.h and partly because it's not really an intended pattern. Ideally at some point the mm tests that use it will be updated to not need it. Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-2-ff198df8e38e@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: use standard ksft_finished() in cow and gup_longtermMark Brown2-12/+3
Patch series "selftests/mm: cow and gup_longterm cleanups", v2. The bulk of these changes modify the cow and gup_longterm tests to report unique and stable names for each test, bringing them into line with the expectations of tooling that works with kselftest. The string reported as a test result is used by tooling to both deduplicate tests and track tests between test runs, using the same string for multiple tests or changing the string depending on test result causes problems for user interfaces and automation such as bisection. It was suggested that converting to use kselftest_harness.h would be a good way of addressing this, however that really wants the set of tests to run to be known at compile time but both test programs dynamically enumarate the set of huge page sizes the system supports and test each. Refactoring to handle this would be even more invasive than these changes which are large but straightforward and repetitive. A version of the main gup_longterm cleanup was previously sent separately, this version factors out the helpers for logging the start of the test since the cow test looks very similar. This patch (of 4): The cow and gup_longterm test programs open code something that looks a lot like the standard ksft_finished() helper to summarise the test results and provide an exit code, convert to use ksft_finished(). Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-0-ff198df8e38e@kernel.org Link: https://lkml.kernel.org/r/20250527-selftests-mm-cow-dedupe-v2-1-ff198df8e38e@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabledEnze Li1-0/+4
When CONFIG_DAMON_SYSFS is disabled, the selftests fail with the following outputs, not ok 2 selftests: damon: sysfs_update_schemes_tried_regions_wss_estimation.py # exit=1 not ok 3 selftests: damon: damos_quota.py # exit=1 not ok 4 selftests: damon: damos_quota_goal.py # exit=1 not ok 5 selftests: damon: damos_apply_interval.py # exit=1 not ok 6 selftests: damon: damos_tried_regions.py # exit=1 not ok 7 selftests: damon: damon_nr_regions.py # exit=1 not ok 11 selftests: damon: sysfs_update_schemes_tried_regions_hang.py # exit=1 The root cause of this issue is that all the testcases above do not check the sysfs interface of DAMON whether it exists or not. With this patch applied, all the testcases above now pass successfully. Link: https://lkml.kernel.org/r/20250531093937.1555159-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31sched/numa: add statistics of numa balance taskChen Yu7-2/+27
On systems with NUMA balancing enabled, it has been found that tracking task activities resulting from NUMA balancing is beneficial. NUMA balancing employs two mechanisms for task migration: one is to migrate a task to an idle CPU within its preferred node, and the other is to swap tasks located on different nodes when they are on each other's preferred nodes. The kernel already provides NUMA page migration statistics in /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However, it lacks statistics regarding task migration and swapping. Therefore, relevant counts for task migration and swapping should be added. The following two new fields: numa_task_migrated numa_task_swapped will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched and /proc/vmstat. Introducing both per-task and per-memory cgroup (memcg) NUMA balancing statistics facilitates a rapid evaluation of the performance and resource utilization of the target workload. For instance, users can first identify the container with high NUMA balancing activity and then further pinpoint a specific task within that group, and subsequently adjust the memory policy for that task. In short, although it is possible to iterate through /proc/$pid/sched to locate the problematic task, the introduction of aggregated NUMA balancing activity for tasks within each memcg can assist users in identifying the task more efficiently through a divide-and-conquer approach. As Libo Chen pointed out, the memcg event relies on the text names in vmstat_text, and /proc/vmstat generates corresponding items based on vmstat_text. Thus, the relevant task migration and swapping events introduced in vmstat_text also need to be populated by count_vm_numa_event(), otherwise these values are zero in /proc/vmstat. In theory, task migration and swap events are part of the scheduler's activities. The reason for exposing them through the memory.stat/vmstat interface is that we already have NUMA balancing statistics in memory.stat/vmstat, and these events are closely related to each other. Following Shakeel's suggestion, we describe the end-to-end flow/story of all these events occurring on a timeline for future reference: The goal of NUMA balancing is to co-locate a task and its memory pages on the same NUMA node. There are two strategies: migrate the pages to the task's node, or migrate the task to the node where its pages reside. Suppose a task p1 is running on Node 0, but its pages are located on Node 1. NUMA page fault statistics for p1 reveal its "page footprint" across nodes. If NUMA balancing detects that most of p1's pages are on Node 1: 1.Page Migration Attempt: The Numa balance first tries to migrate p1's pages to Node 0. The numa_page_migrate counter increments. 2.Task Migration Strategies: After the page migration finishes, Numa balance checks every 1 second to see if p1 can be migrated to Node 1. Case 2.1: Idle CPU Available If Node 1 has an idle CPU, p1 is directly scheduled there. This event is logged as numa_task_migrated. Case 2.2: No Idle CPU (Task Swap) If all CPUs on Node1 are busy, direct migration could cause CPU contention or load imbalance. Instead: The Numa balance selects a candidate task p2 on Node 1 that prefers Node 0 (e.g., due to its own page footprint). p1 and p2 are swapped. This cross-node swap is recorded as numa_task_swapped. Link: https://lkml.kernel.org/r/d00edb12ba0f0de3c5222f61487e65f2ac58f5b1.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/7ef90a88602ed536be46eba7152ed0d33bad5790.1748002400.git.yu.c.chen@intel.com Signed-off-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Ayush Jain <Ayush.jain3@amd.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Libo Chen <libo.chen@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31sched/numa: fix task swap by skipping kernel threadsLibo Chen1-1/+2
Patch series "sched/numa: add statistics of numa balance task migration", v6. Introduce task migration and swap statistics in the following places: /sys/fs/cgroup/{GROUP}/memory.stat /proc/{PID}/sched /proc/vmstat These statistics facilitate a rapid evaluation of the performance and resource utilization of the target workload. This patch (of 2): Task swapping is triggered when there are no idle CPUs in task A's preferred node. In this case, the NUMA load balancer chooses a task B on A's preferred node and swaps B with A. This helps improve NUMA locality without introducing load imbalance between nodes. In the current implementation, B's NUMA node preference is not mandatory. That is to say, a kernel thread might be incorrectly chosen as B. However, kernel thread and user space thread that does not have mm are not supposed to be covered by NUMA balancing because NUMA balancing only considers user pages via VMAs. According to Peter's suggestion for fixing this issue, we use PF_KTHREAD to skip the kernel thread. curr->mm is also checked because it is possible that user_mode_thread() might create a user thread without an mm. As per Prateek's analysis, after adding the PF_KTHREAD check, there is no need to further check the PF_IDLE flag: : - play_idle_precise() already ensures PF_KTHREAD is set before adding : PF_IDLE : : - cpu_startup_entry() is only called from the startup thread which : should be marked with PF_KTHREAD (based on my understanding looking at : commit cff9b2332ab7 ("kernel/sched: Modify initial boot task idle : setup")) In summary, the check in task_numa_compare() now aligns with task_tick_numa(). Link: https://lkml.kernel.org/r/cover.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/43d68b356b25d124f0d222ebedf3859e86eefb9f.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/cover.1748002400.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/eaacc9c9bd37bac92d43a671867d85b2fdad3b06.1748002400.git.yu.c.chen@intel.com Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Libo Chen <libo.chen@oracle.com> Suggested-by: Michal Koutný <mkoutny@suse.com> Tested-by: Ayush Jain <Ayush.jain3@amd.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Aubrey Li <aubrey.li@intel.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31tools/testing: check correct variable in open_procmap()Dan Carpenter1-1/+1
Check if "procmap_out->fd" is negative instead of "procmap_out" (which is a pointer). Link: https://lkml.kernel.org/r/aDbFuUTlJTBqziVd@stanley.mountain Fixes: bd23f293a0d5 ("tools/testing: add PROCMAP_QUERY helper functions in mm self tests") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: levi.yun <yeoreum.yun@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31tools/testing/vma: add missing function stubLorenzo Stoakes1-0/+5
The hugetlb fix introduced in commit ee40c9920ac2 ("mm: fix copy_vma() error handling for hugetlb mappings") mistakenly did not provide a stub for the VMA userland testing, which results in a compile error when trying to build this. Provide this stub to resolve the issue. Link: https://lkml.kernel.org/r/20250528-fix-vma-test-v1-1-c8a5f533b38f@oracle.com Fixes: ee40c9920ac2 ("mm: fix copy_vma() error handling for hugetlb mappings") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/gup: update comment explaining why gup_fast() disables IRQsJann Horn1-1/+1
The current comment in gup_fast() talks about "IPIs that come from THPs splitting", which is outdated and refers to the old THP splitting implementation that was removed in commit ad0bed24e98b ("thp: drop all split_huge_page()-related code"), which landed in v4.5. Before then, THP splitting involved a pmdp_splitting_flush(), which sent an IPI to serialize against gup_fast(). Nowadays, we use tlb_remove_table_sync_one() to send IPIs that serialize against gup_fast(); this is used, for example, in THP *collapsing* to stop gup_fast() walks of a page table before depositing it. Link: https://lkml.kernel.org/r/20250528-gup-irq-comment-fix-v1-1-b9d83c345333@google.com Signed-off-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: two fixes for the pfnmap testDavid Hildenbrand1-4/+57
When unregistering the signal handler, we have to pass SIG_DFL, and blindly reading from PFN 0 and PFN 1 seems to be problematic on !x86 systems. In particularly, on arm64 tx2 machines where noting resides at these physical memory locations, we can generate RAS errors. Let's fix it by scanning /proc/iomem for actual "System RAM". Link: https://lkml.kernel.org/r/20250528195244.1182810-1-david@redhat.com Fixes: 2616b370323a ("selftests/mm: add simple VM_PFNMAP tests based on mmap'ing /dev/mem") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/all/232960c2-81db-47ca-a337-38c4bce5f997@arm.com/T/#u Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Aishwarya TCV <aishwarya.tcv@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/khugepaged: fix race with folio split/free using temporary referenceShivank Garg1-1/+17
hpage_collapse_scan_file() calls is_refcount_suitable(), which in turn calls folio_mapcount(). folio_mapcount() checks folio_test_large() before proceeding to folio_large_mapcount(), but there is a race window where the folio may get split/freed between these checks, triggering: VM_WARN_ON_FOLIO(!folio_test_large(folio), folio) Take a temporary reference to the folio in hpage_collapse_scan_file(). This stabilizes the folio during refcount check and prevents incorrect large folio detection due to concurrent split/free. Use helper folio_expected_ref_count() + 1 to compare with folio_ref_count() instead of using is_refcount_suitable(). Link: https://lkml.kernel.org/r/20250526182818.37978-1-shivankg@amd.com Fixes: 05c5323b2a34 ("mm: track mapcount of large folios in single value") Signed-off-by: Shivank Garg <shivankg@amd.com> Reported-by: syzbot+2b99589e33edbe9475ca@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6828470d.a70a0220.38f255.000c.GAE@google.com Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Fengwei Yin <fengwei.yin@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm: add CONFIG_PAGE_BLOCK_ORDER to select page block orderJuan Yescas4-5/+55
Problem: On large page size configurations (16KiB, 64KiB), the CMA alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably, and this causes the CMA reservations to be larger than necessary. This means that system will have less available MIGRATE_UNMOVABLE and MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them. The CMA_MIN_ALIGNMENT_BYTES increases because it depends on MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels. For example, in ARM, the CMA alignment requirement when: - CONFIG_ARCH_FORCE_MAX_ORDER default value is used - CONFIG_TRANSPARENT_HUGEPAGE is set: PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES ----------------------------------------------------------------------- 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB There are some extreme cases for the CMA alignment requirement when: - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set - CONFIG_TRANSPARENT_HUGEPAGE is NOT set: - CONFIG_HUGETLB_PAGE is NOT set PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES ------------------------------------------------------------------------ 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB This affects the CMA reservations for the drivers. If a driver in a 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal reservation has to be 32MiB due to the alignment requirements: reserved-memory { ... cma_test_reserve: cma_test_reserve { compatible = "shared-dma-pool"; size = <0x0 0x400000>; /* 4 MiB */ ... }; }; reserved-memory { ... cma_test_reserve: cma_test_reserve { compatible = "shared-dma-pool"; size = <0x0 0x2000000>; /* 32 MiB */ ... }; }; Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that allows to set the page block order in all the architectures. The maximum page block order will be given by ARCH_FORCE_MAX_ORDER. By default, CONFIG_PAGE_BLOCK_ORDER will have the same value that ARCH_FORCE_MAX_ORDER. This will make sure that current kernel configurations won't be affected by this change. It is a opt-in change. This patch will allow to have the same CMA alignment requirements for large page sizes (16KiB, 64KiB) as that in 4kb kernels by setting a lower pageblock_order. Tests: - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 on 4k and 16k kernels. - Verified that Transparent Huge Pages work when pageblock_order is 1, 7, 10 on 4k and 16k kernels. - Verified that dma-buf heaps allocations work when pageblock_order is 1, 7, 10 on 4k and 16k kernels. Benchmarks: The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The reason for the pageblock_order 7 is because this value makes the min CMA alignment requirement the same as that in 4kb kernels (2MB). - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf (https://developer.android.com/ndk/guides/simpleperf) to measure the # of instructions and page-faults on 16k kernels. The benchmark was executed 10 times. The averages are below: # instructions | #page-faults order 10 | order 7 | order 10 | order 7 -------------------------------------------------------- 13,891,765,770 | 11,425,777,314 | 220 | 217 14,456,293,487 | 12,660,819,302 | 224 | 219 13,924,261,018 | 13,243,970,736 | 217 | 221 13,910,886,504 | 13,845,519,630 | 217 | 221 14,388,071,190 | 13,498,583,098 | 223 | 224 13,656,442,167 | 12,915,831,681 | 216 | 218 13,300,268,343 | 12,930,484,776 | 222 | 218 13,625,470,223 | 14,234,092,777 | 219 | 218 13,508,964,965 | 13,432,689,094 | 225 | 219 13,368,950,667 | 13,683,587,37 | 219 | 225 ------------------------------------------------------------------- 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages There were 4.85% #instructions when order was 7, in comparison with order 10. 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%) The number of page faults in order 7 and 10 were the same. These results didn't show any significant regression when the pageblock_order is set to 7 on 16kb kernels. - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times on the 16k kernels with pageblock_order 7 and 10. order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) % ------------------------------------------------------------------- 15.8 | 16.4 | 0.6 | 3.80% 16.4 | 16.2 | -0.2 | -1.22% 16.6 | 16.3 | -0.3 | -1.81% 16.8 | 16.3 | -0.5 | -2.98% 16.6 | 16.8 | 0.2 | 1.20% ------------------------------------------------------------------- 16.44 16.4 -0.04 -0.24% Averages The results didn't show any significant regression when the pageblock_order is set to 7 on 16kb kernels. Link: https://lkml.kernel.org/r/20250521215807.1860663-1-jyescas@google.com Signed-off-by: Juan Yescas <jyescas@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mmu_notifiers: remove leftover stub macrosJann Horn1-3/+0
Commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()") removed the main definitions of {ptep,pmdp_huge,pudp_huge}_clear_flush_notify; just their !CONFIG_MMU_NOTIFIER stubs are left behind, remove them. Link: https://lkml.kernel.org/r/20250523-mmu-notifier-cleanup-unused-v1-1-cc1f47ebec33@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: deduplicate test names in madv_populateMark Brown1-9/+9
The madv_populate selftest has some repetitive code for several different cases that it covers, included repeated test names used in ksft_test_result() reports. This causes problems for automation, the test name is used to both track the test between runs and distinguish between multiple tests within the same run. Fix this by tweaking the messages with duplication to be more specific about the contexts they're in. Link: https://lkml.kernel.org/r/20250522-selftests-mm-madv-populate-dedupe-v1-1-fd1dedd79b4b@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31kcov: rust: add flags for KCOV with RustAlice Ryhl3-0/+10
Rust code is currently not instrumented properly when KCOV is enabled. Thus, add the relevant flags to perform instrumentation correctly. This is necessary for efficient fuzzing of Rust code. The sanitizer-coverage features of LLVM have existed for long enough that they are available on any LLVM version supported by rustc, so we do not need any Kconfig feature detection. The coverage level is set to 3, as that is the level needed by trace-pc. We do not instrument `core` since when we fuzz the kernel, we are looking for bugs in the kernel, not the Rust stdlib. Link: https://lkml.kernel.org/r/20250501-rust-kcov-v2-1-b71e83e9779f@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Co-developed-by: Matthew Maurer <mmaurer@google.com> Signed-off-by: Matthew Maurer <mmaurer@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Aleksandr Nogikh <nogikh@google.com> Acked-by: Miguel Ojeda <ojeda@kernel.org> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Bill Wendling <morbo@google.com> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Gary Guo <gary@garyguo.net> Cc: Justin Stitt <justinstitt@google.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Trevor Gross <tmgross@umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm: rust: make CONFIG_MMU ifdefs more narrowAlice Ryhl2-52/+72
Currently the entire kernel::mm module is ifdef'd out when CONFIG_MMU=n. However, there are some downstream users of the module in rust/kernel/task.rs and rust/kernel/miscdevice.rs. Thus, update the cfgs so that only MmWithUserAsync is removed with CONFIG_MMU=n. The code is moved into a new file, since the #[cfg()] annotation otherwise has to be duplicated several times. Link: https://lkml.kernel.org/r/20250516193219.2987032-1-aliceryhl@google.com Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505071753.kldNHYVQ-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202505072116.eSYC8igT-lkp@intel.com/ Fixes: 5bb9ed6cdfeb ("mm: rust: add abstraction for struct mm_struct") Signed-off-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()Roman Gushchin3-10/+39
Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") added a forced tlbflush to tlb_vma_end(), which is required to avoid a race between munmap() and unmap_mapping_range(). However it added some overhead to other paths where tlb_vma_end() is used, but vmas are not removed, e.g. madvise(MADV_DONTNEED). Fix this by moving the tlb flush out of tlb_end_vma() into new tlb_flush_vmas() called from free_pgtables(), somewhat similar to the stable version of the original commit: commit 895428ee124a ("mm: Force TLB flush for PFNMAP mappings before unlink_file_vma()"). Note, that if tlb->fullmm is set, no flush is required, as the whole mm is about to be destroyed. Link: https://lkml.kernel.org/r/20250522012838.163876-1-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/damon/Kconfig: enable CONFIG_DAMON by defaultSeongJae Park1-0/+1
As of this writing, multiple major distros including Alma, Amazon, Android, CentOS, Debian, Fedora, and Oracle are build-enabling DAMON (set CONFIG_DAMON[1]). Enabling it by default will save configuration setup time for the current and future DAMON users. Build-enabling DAMON does not introduce a real risk since it makes no behavioral change by default. It requires explicit user requests to do anything. Only one potential risk is making the size of the kernel a little bit larger. On a production-purpose configuration, it increases the resulting kernel package size by about 0.1 % of the final package file. I believe that's too small to be a real problem in common setups. Hence, the benefit of enabling CONFIG_DAMON outweighs the potential risk. Set CONFIG_DAMON by default. Link: https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=DAMON [1] Link: https://lkml.kernel.org/r/20250521042755.39653-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/damon/Kconfig: set DAMON_{VADDR,PADDR,SYSFS} default to DAMONSeongJae Park1-0/+3
Patch series "mm/damon: build-enable essential DAMON components by default". As of this writing, multiple major distros including Alma, Amazon, Android, CentOS, Debian, Fedora, and Oracle are build-enabling DAMON (set CONFIG_DAMON[1]). Configuring DAMON is not very easy, since it is disabled by default, and there are multiple essential options that need to be manually turned on, one by one. Make it easier, by grouping essential configurations to be enabled with one selection, and enabling build of the essential parts of DAMON by default. Note that build-enabling DAMON does not introduce any real risk, since it makes no behavioral change by default. It requires explicit user requests to do anything. Only one potential risk is making the size of the kernel a little bit larger. On a production-purpose configuration, it increases the resulting kernel package binary size by about 0.1 % of the final package file. I believe that's too small to be a real problem in common setups. DAMON_{VADDR,PADDR,SYSFS} are de-facto essential parts of DAMON for normal usages. Because those need to be enabled one by one, however, and there are other test-purpose or non-essential configurations, it is easy to be confused and make mistakes at setup. Make the essential configurations default to CONFIG_DAMON, so that those can be enabled by default with a single change. Link: https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=DAMON [1] Link: https://lkml.kernel.org/r/20250521042755.39653-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250521042755.39653-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31hugetlb: show nr_huge_pages in report_hugepages()Wenjie Xu1-1/+1
The number of pre-allocated huge pages should be nr_huge_pages, not free_huge_pages, although they are same during booting stage Link: https://lkml.kernel.org/r/20250515114231.65824-1-xuwenjie04@baidu.com Signed-off-by: Wenjie Xu <xuwenjie04@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: skip hugevm test if kernel config file is not presentZi Yan1-17/+9
When running hugevm tests in a machine without kernel config present, e.g., a VM running a kernel without CONFIG_IKCONFIG_PROC nor /boot/config-*, skip hugevm tests, which reads kernel config to get page table level information. Link: https://lkml.kernel.org/r/20250516132938.356627-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Adam Sindelar <adam@wowsignal.io> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31selftests/mm: skip guard_regions.uffd tests when uffd is not presentZi Yan1-2/+15
Patch series "Skip mm selftests instead when kernel features are not present", v2. Two guard_regions tests on userfaultfd fail when userfaultfd is not present. Skip them instead. hugevm test reads kernel config to get page table level information and fails when neither /proc/config.gz nor /boot/config-* is present. Skip it instead. This patch (of 2): When userfaultfd is not compiled into kernel, userfaultfd() returns -1, causing guard_regions.uffd tests to fail. Skip the tests instead. Link: https://lkml.kernel.org/r/20250516132938.356627-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250516132938.356627-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Adam Sindelar <adam@wowsignal.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/shmem: remove unneeded xa_is_value() check in shmem_unuse_swap_entries()Kemeng Shi1-2/+0
As only value entry will be added to fbatch in shmem_find_swap_entries(), there is no need to do xa_is_value() check in shmem_unuse_swap_entries(). Link: https://lkml.kernel.org/r/20250516170939.965736-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm: shmem: only remove inode from swaplist when it's swapped page count is 0Kemeng Shi1-2/+2
Even if we fail to allocate a swap entry, the inode might have previously allocated entry and we might take inode containing swap entry off swaplist. As a result, try_to_unuse() may enter a potential dead loop to repeatedly look for inode and clean it's swap entry. Only take inode off swaplist when it's swapped page count is 0 to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-5-shikemeng@huaweicloud.com Fixes: b487a2da3575 ("mm, swap: simplify folio swap allocation") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202505161438.9009cf47-lkp@intel.com Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/shmem: fix potential dead loop in shmem_unuse()Kemeng Shi1-3/+6
If multi shmem_unuse() for different swap type is called concurrently, a dead loop could occur as following: shmem_unuse(typeA) shmem_unuse(typeB) mutex_lock(&shmem_swaplist_mutex) list_for_each_entry_safe(info, next, ...) ... mutex_unlock(&shmem_swaplist_mutex) /* info->swapped may drop to 0 */ shmem_unuse_inode(&info->vfs_inode, type) mutex_lock(&shmem_swaplist_mutex) list_for_each_entry(info, next, ...) if (!info->swapped) list_del_init(&info->swaplist) ... mutex_unlock(&shmem_swaplist_mutex) mutex_lock(&shmem_swaplist_mutex) /* iterate with offlist entry and encounter a dead loop */ next = list_next_entry(info, swaplist); ... Restart the iteration if the inode is already off shmem_swaplist list to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-4-shikemeng@huaweicloud.com Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm: shmem: add missing shmem_unacct_size() in __shmem_file_setup()Kemeng Shi1-3/+3
We will miss shmem_unacct_size() when is_idmapped_mnt() returns a failure. Move is_idmapped_mnt() before shmem_acct_size() to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-3-shikemeng@huaweicloud.com Fixes: 7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm: shmem: avoid unpaired folio_unlock() in shmem_swapin_folio()Kemeng Shi1-0/+2
Patch series "Some random fixes and cleanup to shmem", v3. This series contains some simple fixes and cleanup which are made during learning shmem. More details can be found in respective patches. This patch (of 5): If we get a folio from swap_cache_get_folio() successfully but encounter a failure before the folio is locked, we will unlock the folio which was not previously locked. Put the folio and set it to NULL when a failure occurs before the folio is locked to fix the issue. Link: https://lkml.kernel.org/r/20250516170939.965736-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250516170939.965736-2-shikemeng@huaweicloud.com Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Hugh Dickins <hughd@google.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm/damon/core: avoid destroyed target reference from DAMOS quotaAkinobu Mita1-0/+8
When the number of the monitoring targets in running contexts is reduced, there may be DAMOS quotas referencing the targets that will be destroyed. Applying the scheme action for such DAMOS scheme will be skipped forever looking for the starting part of the region for the destroyed monitoring target. To fix this issue, when the monitoring target is destroyed, reset the starting part for all DAMOS quotas that reference the target. Link: https://lkml.kernel.org/r/20250517141852.142802-1-akinobu.mita@gmail.com Fixes: da87878010e5 ("mm/damon/sysfs: support online inputs update") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31memcg: make memcg_rstat_updated nmi safeShakeel Butt1-7/+9
Currently kernel maintains memory related stats updates per-cgroup to optimize stats flushing. The stats_updates is defined as atomic64_t which is not nmi-safe on some archs. Actually we don't really need 64bit atomic as the max value stats_updates can get should be less than nr_cpus * MEMCG_CHARGE_BATCH. A normal atomic_t should suffice. Also the function cgroup_rstat_updated() is still not nmi-safe but there is parallel effort to make it nmi-safe, so until then let's ignore it in the nmi context. Link: https://lkml.kernel.org/r/20250519063142.111219-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31memcg: nmi-safe slab stats updatesShakeel Butt1-3/+33
The objcg based kmem [un]charging can be called in nmi context and it may need to update NR_SLAB_[UN]RECLAIMABLE_B stats. So, let's correctly handle the updates of these stats in the nmi context. Link: https://lkml.kernel.org/r/20250519063142.111219-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31memcg: add nmi-safe update for MEMCG_KMEMShakeel Butt1-2/+19
The objcg based kmem charging and uncharging code path needs to update MEMCG_KMEM appropriately. Let's add support to update MEMCG_KMEM in nmi-safe way for those code paths. Link: https://lkml.kernel.org/r/20250519063142.111219-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31memcg: nmi safe memcg stats for specific archsShakeel Butt3-0/+66
There are archs which have NMI but does not support this_cpu_* ops safely in the nmi context but they support safe atomic ops in nmi context. For such archs, let's add infra to use atomic ops for the memcg stats which can be updated in nmi. At the moment, the memcg stats which get updated in the objcg charging path are MEMCG_KMEM, NR_SLAB_RECLAIMABLE_B & NR_SLAB_UNRECLAIMABLE_B. Rather than adding support for all memcg stats to be nmi safe, let's just add infra to make these three stats nmi safe which this patch is doing. Link: https://lkml.kernel.org/r/20250519063142.111219-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>