aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2023-10-28kbuild: remove ARCH_POSTLINK from module buildsMasahiro Yamada5-16/+1
The '%.ko' rule in arch/*/Makefile.postlink does nothing but call the 'true' command. Remove the unneeded code. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <n.schier@avm.de>
2023-10-28kbuild: unify no-compiler-targets and no-sync-config-targetsMasahiro Yamada1-11/+2
Now that vdso_install does not depend on any in-tree build artifact, it no longer needs a compiler, making no-compiler-targets the same as no-sync-config-targets. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2023-10-28kbuild: unify vdso_install rulesMasahiro Yamada20-185/+73
Currently, there is no standard implementation for vdso_install, leading to various issues: 1. Code duplication Many architectures duplicate similar code just for copying files to the install destination. Some architectures (arm, sparc, x86) create build-id symlinks, introducing more code duplication. 2. Unintended updates of in-tree build artifacts The vdso_install rule depends on the vdso files to install. It may update in-tree build artifacts. This can be problematic, as explained in commit 19514fc665ff ("arm, kbuild: make "make install" not depend on vmlinux"). 3. Broken code in some architectures Makefile code is often copied from one architecture to another without proper adaptation. 'make vdso_install' for parisc does not work. 'make vdso_install' for s390 installs vdso64, but not vdso32. To address these problems, this commit introduces a generic vdso_install rule. Architectures that support vdso_install need to define vdso-install-y in arch/*/Makefile. vdso-install-y lists the files to install. For example, arch/x86/Makefile looks like this: vdso-install-$(CONFIG_X86_64) += arch/x86/entry/vdso/vdso64.so.dbg vdso-install-$(CONFIG_X86_X32_ABI) += arch/x86/entry/vdso/vdsox32.so.dbg vdso-install-$(CONFIG_X86_32) += arch/x86/entry/vdso/vdso32.so.dbg vdso-install-$(CONFIG_IA32_EMULATION) += arch/x86/entry/vdso/vdso32.so.dbg These files will be installed to $(MODLIB)/vdso/ with the .dbg suffix, if exists, stripped away. vdso-install-y can optionally take the second field after the colon separator. This is needed because some architectures install a vdso file as a different base name. The following is a snippet from arch/arm64/Makefile. vdso-install-$(CONFIG_COMPAT_VDSO) += arch/arm64/kernel/vdso32/vdso.so.dbg:vdso32.so This will rename vdso.so.dbg to vdso32.so during installation. If such architectures change their implementation so that the base names match, this workaround will go away. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Sven Schnelle <svens@linux.ibm.com> # s390 Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Reviewed-by: Guo Ren <guoren@kernel.org> Acked-by: Helge Deller <deller@gmx.de> # parisc Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-18docs: kbuild: add INSTALL_DTBS_PATHRicardo B. Marliere2-0/+13
The documentation for kbuild and makefiles is missing an explanation of a variable important for some architectures. Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-18UML: remove unused cmd_vdso_installMasahiro Yamada1-12/+0
You cannot run this code because arch/um/Makefile does not define the vdso_install target. It appears that this code was blindly copied from another architecture. Remove the dead code. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Richard Weinberger <richard@nod.at>
2023-10-18csky: remove unused cmd_vdso_installMasahiro Yamada1-10/+0
You cannot run this code because arch/csky/Makefile does not define the vdso_install target. It appears that this code was blindly copied from another architecture. Remove the dead code. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Guo Ren <guoren@kernel.org>
2023-10-18modpost: factor out the common boilerplate of section_rel(a)Masahiro Yamada1-24/+26
The first few lines of section_rel() and section_rela() are the same. They both retrieve the index of the section to which the relocaton applies, and skip known-good sections. This common code should be moved to check_sec_ref(). Avoid ugly casts when computing 'start' and 'stop', and also make the Elf_Rel and Elf_Rela pointers const. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2023-10-18modpost: refactor check_sec_ref()Masahiro Yamada1-6/+7
We can replace &elf->sechdrs[i] with &sechdrs[i] to slightly shorten the code because we already have the local variable 'sechdrs'. However, defining 'sechdr' instead shortens the code further. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2023-10-18modpost: define TO_NATIVE() using bswap_* functionsMasahiro Yamada2-22/+16
The current TO_NATIVE() has some limitations: 1) You cannot cast the argument. 2) You cannot pass a variable marked as 'const'. 3) Passing an array is a bug, but it is not detected. Impelement TO_NATIVE() using bswap_*() functions. These are GNU extensions. If we face portability issues, we can port the code from include/uapi/linux/swab.h. With this change, get_rel_type_and_sym() can be simplified by casting the arguments directly. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2023-10-18modpost: fix ishtp MODULE_DEVICE_TABLE built on big-endian hostMasahiro Yamada1-2/+2
When MODULE_DEVICE_TABLE(ishtp, ) is built on a host with a different endianness from the target architecture, it results in an incorrect MODULE_ALIAS(). For example, see a case where drivers/platform/x86/intel/ishtp_eclite.c is built as a module for x86. If you build it on a little-endian host, you will get the correct MODULE_ALIAS: $ grep MODULE_ALIAS drivers/platform/x86/intel/ishtp_eclite.mod.c MODULE_ALIAS("ishtp:{6A19CC4B-D760-4DE3-B14D-F25EBD0FBCD9}"); However, if you build it on a big-endian host, you will get a wrong MODULE_ALIAS: $ grep MODULE_ALIAS drivers/platform/x86/intel/ishtp_eclite.mod.c MODULE_ALIAS("ishtp:{BD0FBCD9-F25E-B14D-4DE3-D7606A19CC4B}"); This issue has been unnoticed because the x86 kernel is most likely built natively on an x86 host. The guid field must not be reversed because guid_t is an array of __u8. Fixes: fa443bc3c1e4 ("HID: intel-ish-hid: add support for MODULE_DEVICE_TABLE()") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Thomas Weißschuh <linux@weissschuh.net> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
2023-10-18modpost: fix tee MODULE_DEVICE_TABLE built on big-endian hostMasahiro Yamada1-5/+5
When MODULE_DEVICE_TABLE(tee, ) is built on a host with a different endianness from the target architecture, it results in an incorrect MODULE_ALIAS(). For example, see a case where drivers/char/hw_random/optee-rng.c is built as a module for ARM little-endian. If you build it on a little-endian host, you will get the correct MODULE_ALIAS: $ grep MODULE_ALIAS drivers/char/hw_random/optee-rng.mod.c MODULE_ALIAS("tee:ab7a617c-b8e7-4d8f-8301-d09b61036b64*"); However, if you build it on a big-endian host, you will get a wrong MODULE_ALIAS: $ grep MODULE_ALIAS drivers/char/hw_random/optee-rng.mod.c MODULE_ALIAS("tee:646b0361-9bd0-0183-8f4d-e7b87c617aab*"); The same problem also occurs when you enable CONFIG_CPU_BIG_ENDIAN, and build it on a little-endian host. This issue has been unnoticed because the ARM kernel is configured for little-endian by default, and most likely built on a little-endian host (cross-build on x86 or native-build on ARM). The uuid field must not be reversed because uuid_t is an array of __u8. Fixes: 0fc1db9d1059 ("tee: add bus driver framework for TEE based devices") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
2023-10-18kbuild: make binrpm-pkg always produce kernel-devel packageMasahiro Yamada1-2/+0
The generation of the kernel-devel package is disabled for binrpm-pkg presumably because it was quite big (>= 200MB) and took a long time to package. Commit fe66b5d2ae72 ("kbuild: refactor kernel-devel RPM package and linux-headers Deb package") reduced the package size to 12MB, and now it is quick to build. It won't hurt to have binrpm-pkg generate it by default. If you want to skip the kernel-devel package generation, you can pass RPMOPTS='--without devel': $ make binrpm-pkg RPMOPTS='--without devel' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org>
2023-10-14rust: Respect HOSTCC when linking for hostMatthew Maurer2-0/+4
Currently, rustc defaults to invoking `cc`, even if `HOSTCC` is defined, resulting in build failures in hermetic environments where `cc` does not exist. This includes both hostprogs and proc-macros. Since we are setting the linker to `HOSTCC`, we set the linker flavor to `gcc` explicitly. The linker-flavor selects both which linker to search for if the linker is unset, and which kind of linker flags to pass. Without this flag, `rustc` would attempt to determine which flags to pass based on the name of the binary passed as `HOSTCC`. `gcc` is the name of the linker-flavor used by `rustc` for all C compilers, including both `gcc` and `clang`. Signed-off-by: Matthew Maurer <mmaurer@google.com> Reviewed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Tested-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-03kbuild: rpm-pkg: generate kernel.spec in rpmbuild/SPECS/Masahiro Yamada5-7/+12
kernel.spec is the last piece that resides outside the rpmbuild/ directory. Move all the RPM-related files to rpmbuild/ consistently. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org>
2023-10-03modpost: Optimize symbol search from linear to binary searchJack Brennen4-66/+232
Modify modpost to use binary search for converting addresses back into symbol references. Previously it used linear search. This change saves a few seconds of wall time for defconfig builds, but can save several minutes on allyesconfigs. Before: $ make LLVM=1 -j128 allyesconfig vmlinux -s KCFLAGS="-Wno-error" $ time scripts/mod/modpost -M -m -a -N -o vmlinux.symvers vmlinux.o 198.38user 1.27system 3:19.71elapsed After: $ make LLVM=1 -j128 allyesconfig vmlinux -s KCFLAGS="-Wno-error" $ time scripts/mod/modpost -M -m -a -N -o vmlinux.symvers vmlinux.o 11.91user 0.85system 0:12.78elapsed Signed-off-by: Jack Brennen <jbrennen@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-01Linux 6.6-rc4Linus Torvalds1-1/+1
2023-10-01kbuild: remove stale code for 'source' symlink in packaging scriptsMasahiro Yamada2-4/+0
Since commit d8131c2965d5 ("kbuild: remove $(MODLIB)/source symlink"), modules_install does not create the 'source' symlink. Remove the stale code from builddeb and kernel.spec. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-01modpost: Don't let "driver"s reference .exit.*Uwe Kleine-König1-2/+13
Drivers must not reference functions marked with __exit as these likely are not available when the code is built-in. There are few creative offenders uncovered for example in ARCH=amd64 allmodconfig builds. So only trigger the section mismatch warning for W=1 builds. The dual rule that drivers must not reference .init.* is implemented since commit 0db252452378 ("modpost: don't allow *driver to reference .init.*") which however missed that .exit.* should be handled in the same way. Thanks to Masahiro Yamada and Arnd Bergmann who gave valuable hints to find this improvement. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-01vmlinux.lds.h: remove unused CPU_KEEP and CPU_DISCARD macrosMasahiro Yamada1-7/+0
Remove the left-over of commit e24f6628811e ("modpost: remove all traces of cpuinit/cpuexit sections"). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2023-10-01modpost: add missing else to the "of" checkMauricio Faria de Oliveira1-1/+1
Without this 'else' statement, an "usb" name goes into two handlers: the first/previous 'if' statement _AND_ the for-loop over 'devtable', but the latter is useless as it has no 'usb' device_id entry anyway. Tested with allmodconfig before/after patch; no changes to *.mod.c: git checkout v6.6-rc3 make -j$(nproc) allmodconfig make -j$(nproc) olddefconfig make -j$(nproc) find . -name '*.mod.c' | cpio -pd /tmp/before # apply patch make -j$(nproc) find . -name '*.mod.c' | cpio -pd /tmp/after diff -r /tmp/before/ /tmp/after/ # no difference Fixes: acbef7b76629 ("modpost: fix module autoloading for OF devices with generic compatible property") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-09-30eventfs: Test for dentries array allocated in eventfs_release()Steven Rostedt (Google)1-1/+1
The dcache_dir_open_wrapper() could be called when a dynamic event is being deleted leaving a dentry with no children. In this case the dlist->dentries array will never be allocated. This needs to be checked for in eventfs_release(), otherwise it will trigger a NULL pointer dereference. Link: https://lore.kernel.org/linux-trace-kernel/20230930090106.1c3164e9@rorschach.local.home Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: ef36b4f92868 ("eventfs: Remember what dentries were created on dir open") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30tracing/user_events: Align set_bit() address for all archsBeau Belgrave1-7/+51
All architectures should use a long aligned address passed to set_bit(). User processes can pass either a 32-bit or 64-bit sized value to be updated when tracing is enabled when on a 64-bit kernel. Both cases are ensured to be naturally aligned, however, that is not enough. The address must be long aligned without affecting checks on the value within the user process which require different adjustments for the bit for little and big endian CPUs. Add a compat flag to user_event_enabler that indicates when a 32-bit value is being used on a 64-bit kernel. Long align addresses and correct the bit to be used by set_bit() to account for this alignment. Ensure compat flags are copied during forks and used during deletion clears. Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/ Cc: stable@vger.kernel.org Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement") Reported-by: Clément Léger <cleger@rivosinc.com> Suggested-by: Clément Léger <cleger@rivosinc.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30tracing: relax trace_event_eval_update() execution with cond_resched()Clément Léger1-0/+1
When kernel is compiled without preemption, the eval_map_work_func() (which calls trace_event_eval_update()) will not be preempted up to its complete execution. This can actually cause a problem since if another CPU call stop_machine(), the call will have to wait for the eval_map_work_func() function to finish executing in the workqueue before being able to be scheduled. This problem was observe on a SMP system at boot time, when the CPU calling the initcalls executed clocksource_done_booting() which in the end calls stop_machine(). We observed a 1 second delay because one CPU was executing eval_map_work_func() and was not preempted by the stop_machine() task. Adding a call to cond_resched() in trace_event_eval_update() allows other tasks to be executed and thus continue working asynchronously like before without blocking any pending task at boot time. Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Clément Léger <cleger@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30ring-buffer: Update "shortest_full" in pollingSteven Rostedt (Google)1-0/+3
It was discovered that the ring buffer polling was incorrectly stating that read would not block, but that's because polling did not take into account that reads will block if the "buffer-percent" was set. Instead, the ring buffer polling would say reads would not block if there was any data in the ring buffer. This was incorrect behavior from a user space point of view. This was fixed by commit 42fb0a1e84ff by having the polling code check if the ring buffer had more data than what the user specified "buffer percent" had. The problem now is that the polling code did not register itself to the writer that it wanted to wait for a specific "full" value of the ring buffer. The result was that the writer would wake the polling waiter whenever there was a new event. The polling waiter would then wake up, see that there's not enough data in the ring buffer to notify user space and then go back to sleep. The next event would wake it up again. Before the polling fix was added, the code would wake up around 100 times for a hackbench 30 benchmark. After the "fix", due to the constant waking of the writer, it would wake up over 11,0000 times! It would never leave the kernel, so the user space behavior was still "correct", but this definitely is not the desired effect. To fix this, have the polling code add what it's waiting for to the "shortest_full" variable, to tell the writer not to wake it up if the buffer is not as full as it expects to be. Note, after this fix, it appears that the waiter is now woken up around 2x the times it was before (~200). This is a tremendous improvement from the 11,000 times, but I will need to spend some time to see why polling is more aggressive in its wakeups than the read blocking code. Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark") Reported-by: Julia Lawall <julia.lawall@inria.fr> Tested-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-29Crash: add lock to serialize crash hotplug handlingBaoquan He1-0/+17
Eric reported that handling corresponding crash hotplug event can be failed easily when many memory hotplug event are notified in a short period. They failed because failing to take __kexec_lock. ======= [ 78.714569] Fallback order for Node 0: 0 [ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886 [ 78.717133] Policy zone: Normal [ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 80.056643] PEFILE: Unsigned PE binary ======= The memory hotplug events are notified very quickly and very many, while the handling of crash hotplug is much slower relatively. So the atomic variable __kexec_lock and kexec_trylock() can't guarantee the serialization of crash hotplug handling. Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug handling specifically. This doesn't impact the usage of __kexec_lock. Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com Fixes: 247262756121 ("crash: add generic infrastructure for crash hotplug support") Signed-off-by: Baoquan He <bhe@redhat.com> Tested-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause errorJuntong Deng2-4/+4
According to the awk manual, the -e option does not need to be specified in front of 'program' (unless you need to mix program-file). The redundant -e option can cause error when users use awk tools other than gawk (for example, mawk does not support the -e option). Error Example: awk: not an option: -e Link: https://lkml.kernel.org/r/VI1P193MB075228810591AF2FDD7D42C599C3A@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specifiedYang Shi1-20/+19
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel should attempt to migrate all existing pages, and return -EIO if there is misplaced or unmovable page. Then commit 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()") messed up the return value and didn't break VMA scan early ianymore when MPOL_MF_STRICT alone. The return value problem was fixed by commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke the VMA walk early if unmovable page is met, it may cause some pages are not migrated as expected. The code should conceptually do: if (MPOL_MF_MOVE|MOVEALL) scan all vmas try to migrate the existing pages return success else if (MPOL_MF_MOVE* | MPOL_MF_STRICT) scan all vmas try to migrate the existing pages return -EIO if unmovable or migration failed else /* MPOL_MF_STRICT alone */ break early if meets unmovable and don't call mbind_range() at all else /* none of those flags */ check the ranges in test_walk, EFAULT without mbind_range() if discontig. Fixed the behavior. Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.com Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Cc: Hugh Dickins <hughd@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rafael Aquini <aquini@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()Jinjie Ruan1-0/+2
When CONFIG_DAMON_VADDR_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected. Since commit 9f86d624292c ("mm/damon/vaddr-test: remove unnecessary variables"), the damon_destroy_ctx() is removed, but still call damon_new_target() and damon_new_region(), the damon_region which is allocated by kmem_cache_alloc() in damon_new_region() and the damon_target which is allocated by kmalloc in damon_new_target() are not freed. And the damon_region which is allocated in damon_new_region() in damon_set_regions() is also not freed. So use damon_destroy_target to free all the damon_regions and damon_target. unreferenced object 0xffff888107c9a940 (size 64): comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b ............kkkk 60 c7 9c 07 81 88 ff ff f8 cb 9c 07 81 88 ff ff `............... backtrace: [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0 [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0 [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0 [<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260 [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90 [<ffffffff81237cf6>] kthread+0x2b6/0x380 [<ffffffff81097add>] ret_from_fork+0x2d/0x70 [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20 unreferenced object 0xffff8881079cc740 (size 56): comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s) hex dump (first 32 bytes): 05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00 ................ 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk backtrace: [<ffffffff819bc492>] damon_new_region+0x22/0x1c0 [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0 [<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260 [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90 [<ffffffff81237cf6>] kthread+0x2b6/0x380 [<ffffffff81097add>] ret_from_fork+0x2d/0x70 [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20 unreferenced object 0xffff888107c9ac40 (size 64): comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b ............kkkk a0 cc 9c 07 81 88 ff ff 78 a1 76 07 81 88 ff ff ........x.v..... backtrace: [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0 [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0 [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0 [<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260 [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90 [<ffffffff81237cf6>] kthread+0x2b6/0x380 [<ffffffff81097add>] ret_from_fork+0x2d/0x70 [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20 unreferenced object 0xffff8881079ccc80 (size 56): comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s) hex dump (first 32 bytes): 05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00 ................ 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk backtrace: [<ffffffff819bc492>] damon_new_region+0x22/0x1c0 [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0 [<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260 [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90 [<ffffffff81237cf6>] kthread+0x2b6/0x380 [<ffffffff81097add>] ret_from_fork+0x2d/0x70 [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20 unreferenced object 0xffff888107c9af40 (size 64): comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b ............kkkk 20 a2 76 07 81 88 ff ff b8 a6 76 07 81 88 ff ff .v.......v..... backtrace: [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0 [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0 [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0 [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260 [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90 [<ffffffff81237cf6>] kthread+0x2b6/0x380 [<ffffffff81097add>] ret_from_fork+0x2d/0x70 [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20 unreferenced object 0xffff88810776a200 (size 56): comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s) hex dump (first 32 bytes): 05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00 ................ 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk backtrace: [<ffffffff819bc492>] damon_new_region+0x22/0x1c0 [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0 [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260 [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90 [<ffffffff81237cf6>] kthread+0x2b6/0x380 [<ffffffff81097add>] ret_from_fork+0x2d/0x70 [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20 unreferenced object 0xffff88810776a740 (size 56): comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.025s) hex dump (first 32 bytes): 3d 00 00 00 00 00 00 00 3f 00 00 00 00 00 00 00 =.......?....... 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk backtrace: [<ffffffff819bc492>] damon_new_region+0x22/0x1c0 [<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0 [<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0 [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260 [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90 [<ffffffff81237cf6>] kthread+0x2b6/0x380 [<ffffffff81097add>] ret_from_fork+0x2d/0x70 [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20 unreferenced object 0xffff888108038240 (size 64): comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 03 00 00 00 6b 6b 6b 6b ............kkkk 48 ad 76 07 81 88 ff ff 98 ae 76 07 81 88 ff ff H.v.......v..... backtrace: [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0 [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0 [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0 [<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210 [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90 [<ffffffff81237cf6>] kthread+0x2b6/0x380 [<ffffffff81097add>] ret_from_fork+0x2d/0x70 [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20 unreferenced object 0xffff88810776ad28 (size 56): comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s) hex dump (first 32 bytes): 05 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00 ................ 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk backtrace: [<ffffffff819bc492>] damon_new_region+0x22/0x1c0 [<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0 [<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0 [<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210 [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90 [<ffffffff81237cf6>] kthread+0x2b6/0x380 [<ffffffff81097add>] ret_from_fork+0x2d/0x70 [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20 Link: https://lkml.kernel.org/r/20230925072100.3725620-1-ruanjinjie@huawei.com Fixes: 9f86d624292c ("mm/damon/vaddr-test: remove unnecessary variables") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29mm, memcg: reconsider kmem.limit_in_bytes deprecationMichal Hocko2-0/+20
This reverts commits 86327e8eb94c ("memcg: drop kmem.limit_in_bytes") and partially reverts 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes") which have incrementally removed support for the kernel memory accounting hard limit. Unfortunately it has turned out that there is still userspace depending on the existence of memory.kmem.limit_in_bytes [1]. The underlying functionality is not really required but the non-existent file just confuses the userspace which fails in the result. The patch to fix this on the userspace side has been submitted but it is hard to predict how it will propagate through the maze of 3rd party consumers of the software. Now, reverting alone 86327e8eb94c is not an option because there is another set of userspace which cannot cope with ENOTSUPP returned when writing to the file. Therefore we have to go and revisit 58056f77502f as well. There are two ways to go ahead. Either we give up on the deprecation and fully revert 58056f77502f as well or we can keep kmem.limit_in_bytes but make the write a noop and warn about the fact. This should work for both known breaking workloads which depend on the existence but do not depend on the hard limit enforcement. Note to backporters to stable trees. a8c49af3be5f ("memcg: add per-memcg total kernel memory stat") introduced in 4.18 has added memcg_account_kmem so the accounting is not done by obj_cgroup_charge_pages directly for v1 anymore. Prior kernels need to add it explicitly (thanks to Johannes for pointing this out). [akpm@linux-foundation.org: fix build - remove unused local] Link: http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net [1] Link: https://lkml.kernel.org/r/ZRE5VJozPZt9bRPy@dhcp22.suse.cz Fixes: 86327e8eb94c ("memcg: drop kmem.limit_in_bytes") Fixes: 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29mm: zswap: fix potential memory corruption on duplicate storeDomenico Cerasuolo1-0/+20
While stress-testing zswap a memory corruption was happening when writing back pages. __frontswap_store used to check for duplicate entries before attempting to store a page in zswap, this was because if the store fails the old entry isn't removed from the tree. This change removes duplicate entries in zswap_store before the actual attempt. [cerasuolodomenico@gmail.com: add a warning and a comment, per Johannes] Link: https://lkml.kernel.org/r/20230925130002.1929369-1-cerasuolodomenico@gmail.com Link: https://lkml.kernel.org/r/20230922172211.1704917-1-cerasuolodomenico@gmail.com Fixes: 42c06a0e8ebe ("mm: kill frontswap") Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29arm64: hugetlb: fix set_huge_pte_at() to work with all swap entriesRyan Roberts1-14/+3
When called with a swap entry that does not embed a PFN (e.g. PTE_MARKER_POISONED or PTE_MARKER_UFFD_WP), the previous implementation of set_huge_pte_at() would either cause a BUG() to fire (if CONFIG_DEBUG_VM is enabled) or cause a dereference of an invalid address and subsequent panic. arm64's huge pte implementation supports multiple huge page sizes, some of which are implemented in the page table with multiple contiguous entries. So set_huge_pte_at() needs to work out how big the logical pte is, so that it can also work out how many physical ptes (or pmds) need to be written. It previously did this by grabbing the folio out of the pte and querying its size. However, there are cases when the pte being set is actually a swap entry. But this also used to work fine, because for huge ptes, we only ever saw migration entries and hwpoison entries. And both of these types of swap entries have a PFN embedded, so the code would grab that and everything still worked out. But over time, more calls to set_huge_pte_at() have been added that set swap entry types that do not embed a PFN. And this causes the code to go bang. The triggering case is for the uffd poison test, commit 99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") - added in v6.5-rc7. Although review shows that there are other call sites that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger on arm64 because arm64 doesn't support UFFD WP. Arguably, the root cause is really due to commit 18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), which aimed to simplify the interface to the core code by removing set_huge_swap_pte_at() (which took a page size parameter) and replacing it with calls to set_huge_pte_at() where the size was inferred from the folio, as descibed above. While that commit didn't break anything at the time, it did break the interface because it couldn't handle swap entries without PFNs. And since then new callers have come along which rely on this working. But given the brokeness is only observable after commit 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs"), that one gets the Fixes tag. Now that we have modified the set_huge_pte_at() interface to pass the huge page size in the previous patch, we can trivially fix this issue. Link: https://lkml.kernel.org/r/20230922115804.2043771-3-ryan.roberts@arm.com Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29mm: hugetlb: add huge page size param to set_huge_pte_at()Ryan Roberts22-49/+100
Patch series "Fix set_huge_pte_at() panic on arm64", v2. This series fixes a bug in arm64's implementation of set_huge_pte_at(), which can result in an unprivileged user causing a kernel panic. The problem was triggered when running the new uffd poison mm selftest for HUGETLB memory. This test (and the uffd poison feature) was merged for v6.5-rc7. Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable (correctly this time) to get it backported to v6.5, where the issue first showed up. Description of Bug ================== arm64's huge pte implementation supports multiple huge page sizes, some of which are implemented in the page table with multiple contiguous entries. So set_huge_pte_at() needs to work out how big the logical pte is, so that it can also work out how many physical ptes (or pmds) need to be written. It previously did this by grabbing the folio out of the pte and querying its size. However, there are cases when the pte being set is actually a swap entry. But this also used to work fine, because for huge ptes, we only ever saw migration entries and hwpoison entries. And both of these types of swap entries have a PFN embedded, so the code would grab that and everything still worked out. But over time, more calls to set_huge_pte_at() have been added that set swap entry types that do not embed a PFN. And this causes the code to go bang. The triggering case is for the uffd poison test, commit 99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") - added in v6.5-rc7. Although review shows that there are other call sites that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger on arm64 because arm64 doesn't support UFFD WP. If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise, it will dereference a bad pointer in page_folio(): static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry) { VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry)); return page_folio(pfn_to_page(swp_offset_pfn(entry))); } Fix === The simplest fix would have been to revert the dodgy cleanup commit 18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since things have moved on, this would have required an audit of all the new set_huge_pte_at() call sites to see if they should be converted to set_huge_swap_pte_at(). As per the original intent of the change, it would also leave us open to future bugs when people invariably get it wrong and call the wrong helper. So instead, I've added a huge page size parameter to set_huge_pte_at(). This means that the arm64 code has the size in all cases. It's a bigger change, due to needing to touch the arches that implement the function, but it is entirely mechanical, so in my view, low risk. I've compile-tested all touched arches; arm64, parisc, powerpc, riscv, s390, sparc (and additionally x86_64). I've additionally booted and run mm selftests against arm64, where I observe the uffd poison test is fixed, and there are no other regressions. This patch (of 2): In order to fix a bug, arm64 needs to be told the size of the huge page for which the pte is being set in set_huge_pte_at(). Provide for this by adding an `unsigned long sz` parameter to the function. This follows the same pattern as huge_pte_clear(). This commit makes the required interface modifications to the core mm as well as all arches that implement this function (arm64, parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed in a separate commit. No behavioral changes intended. Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> [vmalloc change] Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW statesLiam R. Howlett3-73/+237
When updating the maple tree iterator to avoid rewalks, an issue was introduced when shifting beyond the limits. This can be seen by trying to go to the previous address of 0, which would set the maple node to MAS_NONE and keep the range as the last entry. Subsequent calls to mas_find() would then search upwards from mas->last and skip the value at mas->index/mas->last. This showed up as a bug in mprotect which skips the actual VMA at the current range after attempting to go to the previous VMA from 0. Since MAS_NONE may already be set when searching for a value that isn't contained within a node, changing the handling of MAS_NONE in mas_find() would make the code more complicated and error prone. Furthermore, there was no way to tell which limit was hit, and thus which action to take (next or the entry at the current range). This solution is to add two states to track what happened with the previous iterator action. This allows for the expected behaviour of the next command to return the correct item (either the item at the range requested, or the next/previous). Tests are also added and updated accordingly. Link: https://lkml.kernel.org/r/20230921181236.509072-3-Liam.Howlett@oracle.com Link: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b Link: https://lore.kernel.org/linux-mm/20230921181236.509072-1-Liam.Howlett@oracle.com/ Fixes: 39193685d585 ("maple_tree: try harder to keep active node with mas_prev()") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Pedro Falcato <pedro.falcato@gmail.com> Closes: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b Closes: https://bugs.archlinux.org/task/79656 Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29maple_tree: add mas_is_active() to detect in-tree walksLiam R. Howlett1-0/+9
Patch series "maple_tree: Fix mas_prev() state regression". Pedro Falcato retported an mprotect regression [1] which was bisected back to the iterator changes for maple tree. Root cause analysis showed the mas_prev() running off the end of the VMA space (previous from 0) followed by mas_find(), would skip the first value. This patchset introduces maple state underflow/overflow so the sequence of calls on the maple state will return what the user expects. Users who encounter this bug may see mprotect(), userfaultfd_register(), and mlock() fail on VMAs mapped with address 0. This patch (of 2): Instead of constantly checking each possibility of the maple state, create a fast path that will skip over checking unlikely states. Link: https://lkml.kernel.org/r/20230921181236.509072-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230921181236.509072-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()Pan Bian1-3/+3
In nilfs_gccache_submit_read_data(), brelse(bh) is called to drop the reference count of bh when the call to nilfs_dat_translate() fails. If the reference count hits 0 and its owner page gets unlocked, bh may be freed. However, bh->b_page is dereferenced to put the page after that, which may result in a use-after-free bug. This patch moves the release operation after unlocking and putting the page. NOTE: The function in question is only called in GC, and in combination with current userland tools, address translation using DAT does not occur in that function, so the code path that causes this issue will not be executed. However, it is possible to run that code path by intentionally modifying the userland GC library or by calling the GC ioctl directly. [konishi.ryusuke@gmail.com: NOTE added to the commit log] Link: https://lkml.kernel.org/r/1543201709-53191-1-git-send-email-bianpan2016@163.com Link: https://lkml.kernel.org/r/20230921141731.10073-1-konishi.ryusuke@gmail.com Fixes: a3d93f709e89 ("nilfs2: block cache for garbage collection") Signed-off-by: Pan Bian <bianpan2016@163.com> Reported-by: Ferry Meng <mengferry@linux.alibaba.com> Closes: https://lkml.kernel.org/r/20230818092022.111054-1-mengferry@linux.alibaba.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29mm: abstract moving to the next PFNMatthew Wilcox (Oracle)2-1/+17
In order to fix the L1TF vulnerability, x86 can invert the PTE bits for PROT_NONE VMAs, which means we cannot move from one PTE to the next by adding 1 to the PFN field of the PTE. This results in the BUG reported at [1]. Abstract advancing the PTE to the next PFN through a pte_next_pfn() function/macro. Link: https://lkml.kernel.org/r/20230920040958.866520-1-willy@infradead.org Fixes: bcc6cc832573 ("mm: add default definition of set_ptes()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: syzbot+55cc72f8cc3a549119df@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/000000000000d099fa0604f03351@google.com [1] Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29mm: report success more often from filemap_map_folio_range()Matthew Wilcox (Oracle)1-2/+2
Even though we had successfully mapped the relevant page, we would rarely return success from filemap_map_folio_range(). That leads to falling back from the VMA lock path to the mmap_lock path, which is a speed & scalability issue. Found by inspection. Link: https://lkml.kernel.org/r/20230920035336.854212-1-willy@infradead.org Fixes: 617c28ecab22 ("filemap: batch PTE mappings") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29fs: binfmt_elf_efpic: fix personality for ELF-FDPICGreg Ungerer1-3/+2
The elf-fdpic loader hard sets the process personality to either PER_LINUX_FDPIC for true elf-fdpic binaries or to PER_LINUX for normal ELF binaries (in this case they would be constant displacement compiled with -pie for example). The problem with that is that it will lose any other bits that may be in the ELF header personality (such as the "bug emulation" bits). On the ARM architecture the ADDR_LIMIT_32BIT flag is used to signify a normal 32bit binary - as opposed to a legacy 26bit address binary. This matters since start_thread() will set the ARM CPSR register as required based on this flag. If the elf-fdpic loader loses this bit the process will be mis-configured and crash out pretty quickly. Modify elf-fdpic loader personality setting so that it preserves the upper three bytes by using the SET_PERSONALITY macro to set it. This macro in the generic case sets PER_LINUX and preserves the upper bytes. Architectures can override this for their specific use case, and ARM does exactly this. The problem shows up quite easily running under qemu using the ARM architecture, but not necessarily on all types of real ARM hardware. If the underlying ARM processor does not support the legacy 26-bit addressing mode then everything will work as expected. Link: https://lkml.kernel.org/r/20230907011808.2985083-1-gerg@kernel.org Fixes: 1bde925d23547 ("fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries") Signed-off-by: Greg Ungerer <gerg@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29MAINTAINERS: Fix Florian Fainelli's email addressUwe Kleine-König1-1/+1
Commit 31345a0f5901 ("MAINTAINERS: Replace my email address") added 13 instances of ...@broadcom.com and one of only ...@broadcom. I didn't double check if Broadcom really owns that TLD, but git send-email doesn't accept it, so add ".com" to that one bogous(?) instance. Fixes: 31345a0f5901 ("MAINTAINERS: Replace my email address") Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-09-29io_uring/fs: remove sqe->rw_flags checking from LINKATJens Axboe1-1/+1
This is unionized with the actual link flags, so they can of course be set and they will be evaluated further down. If not we fail any LINKAT that has to set option flags. Fixes: cf30da90bc3a ("io_uring: add support for IORING_OP_LINKAT") Cc: stable@vger.kernel.org Reported-by: Thomas Leonard <talex5@gmail.com> Link: https://github.com/axboe/liburing/issues/955 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-28x86/sgx: Resolves SECS reclaim vs. page fault for EAUG raceHaitao Huang1-5/+25
The SGX EPC reclaimer (ksgxd) may reclaim the SECS EPC page for an enclave and set secs.epc_page to NULL. The SECS page is used for EAUG and ELDU in the SGX page fault handler. However, the NULL check for secs.epc_page is only done for ELDU, not EAUG before being used. Fix this by doing the same NULL check and reloading of the SECS page as needed for both EAUG and ELDU. The SECS page holds global enclave metadata. It can only be reclaimed when there are no other enclave pages remaining. At that point, virtually nothing can be done with the enclave until the SECS page is paged back in. An enclave can not run nor generate page faults without a resident SECS page. But it is still possible for a #PF for a non-SECS page to race with paging out the SECS page: when the last resident non-SECS page A triggers a #PF in a non-resident page B, and then page A and the SECS both are paged out before the #PF on B is handled. Hitting this bug requires that race triggered with a #PF for EAUG. Following is a trace when it happens. BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:sgx_encl_eaug_page+0xc7/0x210 Call Trace: ? __kmem_cache_alloc_node+0x16a/0x440 ? xa_load+0x6e/0xa0 sgx_vma_fault+0x119/0x230 __do_fault+0x36/0x140 do_fault+0x12f/0x400 __handle_mm_fault+0x728/0x1110 handle_mm_fault+0x105/0x310 do_user_addr_fault+0x1ee/0x750 ? __this_cpu_preempt_check+0x13/0x20 exc_page_fault+0x76/0x180 asm_exc_page_fault+0x27/0x30 Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave") Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20230728051024.33063-1-haitao.huang%40linux.intel.com
2023-09-28sched/rt: Fix live lock between select_fallback_rq() and RT pushJoel Fernandes (Google)1-0/+1
During RCU-boost testing with the TREE03 rcutorture config, I found that after a few hours, the machine locks up. On tracing, I found that there is a live lock happening between 2 CPUs. One CPU has an RT task running, while another CPU is being offlined which also has an RT task running. During this offlining, all threads are migrated. The migration thread is repeatedly scheduled to migrate actively running tasks on the CPU being offlined. This results in a live lock because select_fallback_rq() keeps picking the CPU that an RT task is already running on only to get pushed back to the CPU being offlined. It is anyway pointless to pick CPUs for pushing tasks to if they are being offlined only to get migrated away to somewhere else. This could also add unwanted latency to this task. Fix these issues by not selecting CPUs in RT if they are not 'active' for scheduling, using the cpu_active_mask. Other parts in core.c already use cpu_active_mask to prevent tasks from being put on CPUs going offline. With this fix I ran the tests for days and could not reproduce the hang. Without the patch, I hit it in a few hours. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org
2023-09-28fs/smb/client: Reset password pointer to NULLQuang Le1-0/+1
Forget to reset ctx->password to NULL will lead to bug like double free Cc: stable@vger.kernel.org Cc: Willy Tarreau <w@1wt.eu> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Quang Le <quanglex97@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-28iomap: Spelling s/preceeding/preceding/gGeert Uytterhoeven1-1/+1
Fix a misspelling of "preceding". Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-28NFSD: Fix zero NFSv4 READ results when RQ_SPLICE_OK is not setChuck Lever1-2/+2
nfsd4_encode_readv() uses xdr->buf->page_len as a starting point for the nfsd_iter_read() sink buffer -- page_len is going to be offset by the parts of the COMPOUND that have already been encoded into xdr->buf->pages. However, that value must be captured /before/ xdr_reserve_space_vec() advances page_len by the expected size of the read payload. Otherwise, the whole front part of the first page of the payload in the reply will be uninitialized. Mantas hit this because sec=krb5i forces RQ_SPLICE_OK off, which invokes the readv part of the nfsd4_encode_read() path. Also, older Linux NFS clients appear to send shorter READ requests for files smaller than a page, whereas newer clients just send page-sized requests and let the server send as many bytes as are in the file. Reported-by: Mantas Mikulėnas <grawity@gmail.com> Closes: https://lore.kernel.org/linux-nfs/f1d0b234-e650-0f6e-0f5d-126b3d51d1eb@gmail.com/ Fixes: 703d75215555 ("NFSD: Hoist rq_vec preparation into nfsd_read() [step two]") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-09-28ata: libata-eh: Fix compilation warning in ata_eh_link_report()Damien Le Moal1-1/+1
The 6 bytes length of the tries_buf string in ata_eh_link_report() is too short and results in a gcc compilation warning with W-!: drivers/ata/libata-eh.c: In function ‘ata_eh_link_report’: drivers/ata/libata-eh.c:2371:59: warning: ‘%d’ directive output may be truncated writing between 1 and 11 bytes into a region of size 4 [-Wformat-truncation=] 2371 | snprintf(tries_buf, sizeof(tries_buf), " t%d", | ^~ drivers/ata/libata-eh.c:2371:56: note: directive argument in the range [-2147483648, 4] 2371 | snprintf(tries_buf, sizeof(tries_buf), " t%d", | ^~~~~~ drivers/ata/libata-eh.c:2371:17: note: ‘snprintf’ output between 4 and 14 bytes into a destination of size 6 2371 | snprintf(tries_buf, sizeof(tries_buf), " t%d", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2372 | ap->eh_tries); | ~~~~~~~~~~~~~ Avoid this warning by increasing the string size to 16B. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-09-28ata: libata-core: Fix compilation warning in ata_dev_config_ncq()Damien Le Moal1-1/+1
The 24 bytes length allocated to the ncq_desc string in ata_dev_config_lba() for ata_dev_config_ncq() to use is too short, causing the following gcc compilation warnings when compiling with W=1: drivers/ata/libata-core.c: In function ‘ata_dev_configure’: drivers/ata/libata-core.c:2378:56: warning: ‘%d’ directive output may be truncated writing between 1 and 2 bytes into a region of size between 1 and 11 [-Wformat-truncation=] 2378 | snprintf(desc, desc_sz, "NCQ (depth %d/%d)%s", hdepth, | ^~ In function ‘ata_dev_config_ncq’, inlined from ‘ata_dev_config_lba’ at drivers/ata/libata-core.c:2649:8, inlined from ‘ata_dev_configure’ at drivers/ata/libata-core.c:2952:9: drivers/ata/libata-core.c:2378:41: note: directive argument in the range [1, 32] 2378 | snprintf(desc, desc_sz, "NCQ (depth %d/%d)%s", hdepth, | ^~~~~~~~~~~~~~~~~~~~~ drivers/ata/libata-core.c:2378:17: note: ‘snprintf’ output between 16 and 31 bytes into a destination of size 24 2378 | snprintf(desc, desc_sz, "NCQ (depth %d/%d)%s", hdepth, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2379 | ddepth, aa_desc); | ~~~~~~~~~~~~~~~~ Avoid these warnings and the potential truncation by changing the size of the ncq_desc string to 32 characters. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-09-28scsi: sd: Do not issue commands to suspended disks on shutdownDamien Le Moal2-4/+14
If an error occurs when resuming a host adapter before the devices attached to the adapter are resumed, the adapter low level driver may remove the scsi host, resulting in a call to sd_remove() for the disks of the host. This in turn results in a call to sd_shutdown() which will issue a synchronize cache command and a start stop unit command to spindown the disk. sd_shutdown() issues the commands only if the device is not already runtime suspended but does not check the power state for system-wide suspend/resume. That is, the commands may be issued with the device in a suspended state, which causes PM resume to hang, forcing a reset of the machine to recover. Fix this by tracking the suspended state of a disk by introducing the suspended boolean field in the scsi_disk structure. This flag is set to true when the disk is suspended is sd_suspend_common() and resumed with sd_resume(). When suspended is true, sd_shutdown() is not executed from sd_remove(). Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-09-28ata: libata-core: Do not register PM operations for SAS portsDamien Le Moal3-2/+11
libsas does its own domain based power management of ports. For such ports, libata should not use a device type defining power management operations as executing these operations for suspend/resume in addition to libsas calls to ata_sas_port_suspend() and ata_sas_port_resume() is not necessary (and likely dangerous to do, even though problems are not seen currently). Introduce the new ata_port_sas_type device_type for ports managed by libsas. This new device type is used in ata_tport_add() and is defined without power management operations. Fixes: 2fcbdcb4c802 ("[SCSI] libata: export ata_port suspend/resume infrastructure for sas") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-09-28ata: libata-scsi: Fix delayed scsi_rescan_device() executionDamien Le Moal2-18/+31
Commit 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume") modified ata_scsi_dev_rescan() to check the scsi device "is_suspended" power field to ensure that the scsi device associated with an ATA device is fully resumed when scsi_rescan_device() is executed. However, this fix is problematic as: 1) It relies on a PM internal field that should not be used without PM device locking protection. 2) The check for is_suspended and the call to scsi_rescan_device() are not atomic and a suspend PM event may be triggered between them, casuing scsi_rescan_device() to be called on a suspended device and in that function blocking while holding the scsi device lock. This would deadlock a following resume operation. These problems can trigger PM deadlocks on resume, especially with resume operations triggered quickly after or during suspend operations. E.g., a simple bash script like: for (( i=0; i<10; i++ )); do echo "+2 > /sys/class/rtc/rtc0/wakealarm echo mem > /sys/power/state done that triggers a resume 2 seconds after starting suspending a system can quickly lead to a PM deadlock preventing the system from correctly resuming. Fix this by replacing the check on is_suspended with a check on the return value given by scsi_rescan_device() as that function will fail if called against a suspended device. Also make sure rescan tasks already scheduled are first cancelled before suspending an ata port. Fixes: 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>