aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/include/linux/mm.h (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-06-07wil6210: fix support for sparrow chipsetsSebastian Gottschall1-10/+16
the wil6210 driver irq handling code is unconditionally writing edma irq registers which are supposed to be only used on Talyn chipsets. This however leade to a chipset hang on the older sparrow chipset generation and firmware will not even boot. Fix that by simply checking for edma support before handling these registers. Tested on Netgear R9000 Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Link: https://patch.msgid.link/20250304012131.25970-2-s.gottschall@dd-wrt.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
2025-06-07wifi: ath10k: Avoid vdev delete timeout when firmware is already downLoic Poulain1-8/+25
In some scenarios, the firmware may be stopped before the interface is removed, either due to a crash or because the remoteproc (e.g., MPSS) is shut down early during system reboot or shutdown. This leads to a delay during interface teardown, as the driver waits for a vdev delete response that never arrives, eventually timing out. Example (SNOC): $ echo stop > /sys/class/remoteproc/remoteproc0/state [ 71.64] remoteproc remoteproc0: stopped remote processor modem $ reboot [ 74.84] ath10k_snoc c800000.wifi: failed to transmit packet, dropping: -108 [ 74.84] ath10k_snoc c800000.wifi: failed to submit frame: -108 [...] [ 82.39] ath10k_snoc c800000.wifi: Timeout in receiving vdev delete response To avoid this, skip waiting for the vdev delete response if the firmware is already marked as unreachable (`ATH10K_FLAG_CRASH_FLUSH`), similar to how `ath10k_mac_wait_tx_complete()` and `ath10k_vdev_setup_sync()` handle this case. Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Link: https://patch.msgid.link/20250522131704.612206-1-loic.poulain@oss.qualcomm.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
2025-06-07ath10k: snoc: fix unbalanced IRQ enable in crash recoveryCaleb Connolly1-1/+3
In ath10k_snoc_hif_stop() we skip disabling the IRQs in the crash recovery flow, but we still unconditionally call enable again in ath10k_snoc_hif_start(). We can't check the ATH10K_FLAG_CRASH_FLUSH bit since it is cleared before hif_start() is called, so instead check the ATH10K_SNOC_FLAG_RECOVERY flag and skip enabling the IRQs during crash recovery. This fixes unbalanced IRQ enable splats that happen after recovering from a crash. Fixes: 0e622f67e041 ("ath10k: add support for WCN3990 firmware crash recovery") Signed-off-by: Caleb Connolly <caleb.connolly@linaro.org> Tested-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Link: https://patch.msgid.link/20250318205043.1043148-1-caleb.connolly@linaro.org Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
2025-06-07tracing: Add rcu annotation around file->filter accessesSteven Rostedt1-4/+6
Running sparse on trace_events_filter.c triggered several warnings about file->filter being accessed directly even though it's annotated with __rcu. Add rcu_dereference() around it and shuffle the logic slightly so that it's always referenced via accessor functions. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250607102821.6c7effbf@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-07sh: kprobes: Remove unused variables in kprobe_exceptions_notify()Mike Rapoport1-4/+0
kbuild reports the following warning: arch/sh/kernel/kprobes.c: In function 'kprobe_exceptions_notify': >> arch/sh/kernel/kprobes.c:412:24: warning: variable 'p' set but not used [-Wunused-but-set-variable] 412 | struct kprobe *p = NULL; | ^ The variable 'p' is indeed unused since the commit fa5a24b16f94 ("sh/kprobes: Don't call the ->break_handler() in SH kprobes code") Remove that variable along with 'kprobe_opcode_t *addr' which also becomes unused after 'p' is removed. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505151341.EuRFR22l-lkp@intel.com/ Fixes: fa5a24b16f94 ("sh/kprobes: Don't call the ->break_handler() in SH kprobes code") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2025-06-07sh: ecovec24: Make SPI mode explicitGeert Uytterhoeven1-0/+1
Commit cf9e4784f3bde3e4 ("spi: sh-msiof: Add slave mode support") added a new mode member to the sh_msiof_spi_info structure, but did not update any board files. Hence all users in board files rely on the default being host mode. Make this unambiguous by configuring host mode explicitly. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2025-06-07sh: Replace __ASSEMBLY__ with __ASSEMBLER__ in all headersThomas Huth17-46/+46
While the GCC and Clang compilers already define __ASSEMBLER__ automatically when compiling assembly code, __ASSEMBLY__ is a macro that only gets defined by the Makefiles in the kernel. This can be very confusing when switching between userspace and kernelspace coding, or when dealing with uapi headers that rather should use __ASSEMBLER__ instead. So let's standardize on the __ASSEMBLER__ macro that is provided by the compilers now. This is a completely mechanical patch (done with a simple "sed -i" statement). Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: linux-sh@vger.kernel.org Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2025-06-07Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"Rafael J. Wysocki1-47/+7
Revert commit 70523f335734 ("Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"") to reapply the changes from commit 96040f7273e2 ("x86/smp: Eliminate mwait_play_dead_cpuid_hint()") reverted by it. Previously, these changes caused idle power to rise on systems booting with "nosmt" in the kernel command line because they effectively caused "dead" SMT siblings to remain in idle state C1 after executing the HLT instruction, which prevented the processor from reaching package idle states deeper than PC2 going forward. Now, the "dead" SMT siblings are rescanned after initializing a proper cpuidle driver for the processor (either intel_idle or ACPI idle), at which point they are able to enter a sufficiently deep idle state in native_play_dead() via cpuidle, so the code changes in question can be reapplied. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/7813065.EvYhyI6sBW@rjwysocki.net
2025-06-07ACPI: processor: Rescan "dead" SMT siblings during initializationRafael J. Wysocki3-0/+17
Make acpi_processor_driver_init() call arch_cpu_rescan_dead_smt_siblings(), via a new wrapper function called acpi_idle_rescan_dead_smt_siblings(), after successfully initializing the driver, to allow the "dead" SMT siblings to go into deep idle states, which is necessary for the processor to be able to reach deep package C-states (like PC10) going forward, so that power can be reduced sufficiently in suspend-to-idle, among other things. However, do it only if the ACPI idle driver is the current cpuidle driver (otherwise it is assumed that another cpuidle driver will take care of this) and avoid doing it on architectures other than x86. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/2005721.PYKUYFuaPT@rjwysocki.net
2025-06-07intel_idle: Rescan "dead" SMT siblings during initializationRafael J. Wysocki1-0/+2
Make intel_idle_init() call arch_cpu_rescan_dead_smt_siblings() after successfully registering intel_idle as the cpuidle driver so as to allow the "dead" SMT siblings (if any) to go into deep idle states. This is necessary for the processor to be able to reach deep package C-states (like PC10) going forward which is requisite for reducing power sufficiently in suspend-to-idle, among other things. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/10669885.nUPlyArG6x@rjwysocki.net
2025-06-07x86/smp: PM/hibernate: Split arch_resume_nosmt()Rafael J. Wysocki3-13/+33
Move the inner part of the arch_resume_nosmt() code into a separate function called arch_cpu_rescan_dead_smt_siblings(), so it can be used in other places where "dead" SMT siblings may need to be taken online and offline again in order to get into deep idle states. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/3361688.44csPzL39Z@rjwysocki.net [ rjw: Prevent build issues with CONFIG_SMP unset ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-06-07genksyms: Fix enum consts from a reference affecting new valuesPetr Pavlu1-7/+20
Enumeration constants read from a symbol reference file can incorrectly affect new enumeration constants parsed from an actual input file. Example: $ cat test.c enum { E_A, E_B, E_MAX }; struct bar { int mem[E_MAX]; }; int foo(struct bar *a) {} __GENKSYMS_EXPORT_SYMBOL(foo); $ cat test.c | ./scripts/genksyms/genksyms -T test.0.symtypes #SYMVER foo 0x070d854d $ cat test.0.symtypes E#E_MAX 2 s#bar struct bar { int mem [ E#E_MAX ] ; } foo int foo ( s#bar * ) $ cat test.c | ./scripts/genksyms/genksyms -T test.1.symtypes -r test.0.symtypes <stdin>:4: warning: foo: modversion changed because of changes in enum constant E_MAX #SYMVER foo 0x9c9dfd81 $ cat test.1.symtypes E#E_MAX ( 2 ) + 3 s#bar struct bar { int mem [ E#E_MAX ] ; } foo int foo ( s#bar * ) The __add_symbol() function includes logic to handle the incrementation of enumeration values, but this code is also invoked when reading a reference file. As a result, the variables last_enum_expr and enum_counter might be incorrectly set after reading the reference file, which later affects parsing of the actual input. Fix the problem by splitting the logic for the incrementation of enumeration values into a separate function process_enum() and call it from __add_symbol() only when processing non-reference data. Fixes: e37ddb825003 ("genksyms: Track changes to enum constants") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-06-07arch: use always-$(KBUILD_BUILTIN) for vmlinux.ldsMasahiro Yamada21-21/+21
The extra-y syntax is deprecated. Instead, use always-$(KBUILD_BUILTIN), which behaves equivalently. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Nicolas Schier <n.schier@avm.de>
2025-06-07do_change_type(): refuse to operate on unmounted/not ours mountsAl Viro1-0/+4
Ensure that propagation settings can only be changed for mounts located in the caller's mount namespace. This change aligns permission checking with the rest of mount(2). Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 07b20889e305 ("beginning of the shared-subtree proper") Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07kbuild: set y instead of 1 to KBUILD_{BUILTIN,MODULES}Masahiro Yamada2-8/+12
KBUILD_BUILTIN is set to 1 unless you are building only modules. KBUILD_MODULES is set to 1 when you are building only modules (a typical use case is "make modules"). It is more useful to set them to 'y' instead, so we can do something like: always-$(KBUILD_BUILTIN) += vmlinux.lds This works equivalently to: extra-y += vmlinux.lds This allows us to deprecate extra-y. extra-y and always-y are quite similar, and we do not need both. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nicolas Schier <n.schier@avm.de>
2025-06-07clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right usernsAl Viro1-0/+3
What we want is to verify there is that clone won't expose something hidden by a mount we wouldn't be able to undo. "Wouldn't be able to undo" may be a result of MNT_LOCKED on a child, but it may also come from lacking admin rights in the userns of the namespace mount belongs to. clone_private_mnt() checks the former, but not the latter. There's a number of rather confusing CAP_SYS_ADMIN checks in various userns during the mount, especially with the new mount API; they serve different purposes and in case of clone_private_mnt() they usually, but not always end up covering the missing check mentioned above. Reviewed-by: Christian Brauner <brauner@kernel.org> Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com> Fixes: 427215d85e8d ("ovl: prevent private clone if bind mount is not allowed") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07selftests/mount_setattr: adapt detached mount propagation testChristian Brauner1-16/+1
Make sure that detached trees don't receive mount propagation. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07do_move_mount(): split the checks in subtree-of-our-ns and entire-anon casesAl Viro1-21/+25
... and fix the breakage in anon-to-anon case. There are two cases acceptable for do_move_mount() and mixing checks for those is making things hard to follow. One case is move of a subtree in caller's namespace. * source and destination must be in caller's namespace * source must be detachable from parent Another is moving the entire anon namespace elsewhere * source must be the root of anon namespace * target must either in caller's namespace or in a suitable anon namespace (see may_use_mount() for details). * target must not be in the same namespace as source. It's really easier to follow if tests are *not* mixed together... Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 3b5260d12b1f ("Don't propagate mounts into detached trees") Reported-by: Allison Karlitskaya <lis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07fs: allow clone_private_mount() for a path on real rootfsKONDO KAZUMA(近藤 和真)1-10/+11
Mounting overlayfs with a directory on real rootfs (initramfs) as upperdir has failed with following message since commit db04662e2f4f ("fs: allow detached mounts in clone_private_mount()"). [ 4.080134] overlayfs: failed to clone upperpath Overlayfs mount uses clone_private_mount() to create internal mount for the underlying layers. The commit made clone_private_mount() reject real rootfs because it does not have a parent mount and is in the initial mount namespace, that is not an anonymous mount namespace. This issue can be fixed by modifying the permission check of clone_private_mount() following [1]. Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: db04662e2f4f ("fs: allow detached mounts in clone_private_mount()") Link: https://lore.kernel.org/all/20250514190252.GQ2023217@ZenIV/ [1] Link: https://lore.kernel.org/all/20250506194849.GT2023217@ZenIV/ Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kazuma Kondo <kazuma-kondo@nec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)Al Viro1-1/+1
9ffb14ef61ba "move_mount: allow to add a mount into an existing group" breaks assertions on ->mnt_share/->mnt_slave. For once, the data structures in question are actually documented. Documentation/filesystem/sharedsubtree.rst: All vfsmounts in a peer group have the same ->mnt_master. If it is non-NULL, they form a contiguous (ordered) segment of slave list. do_set_group() puts a mount into the same place in propagation graph as the old one. As the result, if old mount gets events from somewhere and is not a pure event sink, new one needs to be placed next to the old one in the slave list the old one's on. If it is a pure event sink, we only need to make sure the new one doesn't end up in the middle of some peer group. "move_mount: allow to add a mount into an existing group" ends up putting the new one in the beginning of list; that's definitely not going to be in the middle of anything, so that's fine for case when old is not marked shared. In case when old one _is_ marked shared (i.e. is not a pure event sink), that breaks the assumptions of propagation graph iterators. Put the new mount next to the old one on the list - that does the right thing in "old is marked shared" case and is just as correct as the current behaviour if old is not marked shared (kudos to Pavel for pointing that out - my original suggested fix changed behaviour in the "nor marked" case, which complicated things for no good reason). Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 9ffb14ef61ba ("move_mount: allow to add a mount into an existing group") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07finish_automount(): don't leak MNT_LOCKED from parent to childAl Viro1-1/+2
Intention for MNT_LOCKED had always been to protect the internal mountpoints within a subtree that got copied across the userns boundary, not the mountpoint that tree got attached to - after all, it _was_ exposed before the copying. For roots of secondary copies that is enforced in attach_recursive_mnt() - MNT_LOCKED is explicitly stripped for those. For the root of primary copy we are almost always guaranteed that MNT_LOCKED won't be there, so attach_recursive_mnt() doesn't bother. Unfortunately, one call chain got overlooked - triggering e.g. NFS referral will have the submount inherit the public flags from parent; that's fine for such things as read-only, nosuid, etc., but not for MNT_LOCKED. This is particularly pointless since the mount attached by finish_automount() is usually expirable, which makes any protection granted by MNT_LOCKED null and void; just wait for a while and that mount will go away on its own. Include MNT_LOCKED into the set of flags to be ignored by do_add_mount() - it really is an internal flag. Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07path_overmount(): avoid false negativesAl Viro1-6/+13
Holding namespace_sem is enough to make sure that result remains valid. It is *not* enough to avoid false negatives from __lookup_mnt(). Mounts can be unhashed outside of namespace_sem (stuck children getting detached on final mntput() of lazy-umounted mount) and having an unrelated mount removed from the hash chain while we traverse it may end up with false negative from __lookup_mnt(). We need to sample and recheck the seqlock component of mount_lock... Bug predates the introduction of path_overmount() - it had come from the code in finish_automount() that got abstracted into that helper. Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 26df6034fdb2 ("fix automount/automount race properly") Fixes: 6ac392815628 ("fs: allow to mount beneath top mount") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07fs/fhandle.c: fix a race in call of has_locked_children()Al Viro1-4/+14
may_decode_fh() is calling has_locked_children() while holding no locks. That's an oopsable race... The rest of the callers are safe since they are holding namespace_sem and are guaranteed a positive refcount on the mount in question. Rename the current has_locked_children() to __has_locked_children(), make it static and switch the fs/namespace.c users to it. Make has_locked_children() a wrapper for __has_locked_children(), calling the latter under read_seqlock_excl(&mount_lock). Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Fixes: 620c266f3949 ("fhandle: relax open_by_handle_at() permission checks") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07platform/loongarch: laptop: Unregister generic_sub_drivers on exitYao Zi1-3/+9
Without correct unregisteration, ACPI notify handlers and the platform drivers installed by generic_subdriver_init() will become dangling references after removing the loongson_laptop module, triggering various kernel faults when a hotkey is sent or at kernel shutdown. Cc: stable@vger.kernel.org Fixes: 6246ed09111f ("LoongArch: Add ACPI-based generic laptop driver") Signed-off-by: Yao Zi <ziyao@disroot.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-07platform/loongarch: laptop: Add backlight power control supportYao Zi1-36/+37
loongson_laptop_turn_{on,off}_backlight() are designed for controlling the power of the backlight, but they aren't really used in the driver previously. Unify these two functions since they only differ in arguments passed to ACPI method, and wire up loongson_laptop_backlight_update() to update the power state of the backlight as well. Tested on the TongFang L860-T2 Loongson-3A5000 laptop. Cc: stable@vger.kernel.org Fixes: 6246ed09111f ("LoongArch: Add ACPI-based generic laptop driver") Signed-off-by: Yao Zi <ziyao@disroot.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-06tracing: PM: Remove unused clock eventsSteven Rostedt1-47/+0
The events clock_enable, clock_disable, and clock_set_rate were added back in 2010. In 2011 they were used by the arm architecture but removed in 2013. These events add around 7K of memory which was wasted for the last 12 years. Remove them. Link: https://lore.kernel.org/all/20250529130138.544ffec4@gandalf.local.home/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kajetan Puchalski <kajetan.puchalski@arm.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/20250605162106.1a459dad@gandalf.local.home Fixes: 74704ac6ea402 ("tracing, perf: Add more power related events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06ring-buffer: Fix buffer locking in ring_buffer_subbuf_order_set()Dmitry Antipov1-3/+1
Enlarge the critical section in ring_buffer_subbuf_order_set() to ensure that error handling takes place with per-buffer mutex held, thus preventing list corruption and other concurrency-related issues. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Link: https://lore.kernel.org/20250606112242.1510605-1-dmantipov@yandex.ru Reported-by: syzbot+05d673e83ec640f0ced9@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=05d673e83ec640f0ced9 Fixes: f9b94daa542a8 ("ring-buffer: Set new size of the ring buffer sub page") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06tracing: Fix regression of filter waiting a long time on RCU synchronizationSteven Rostedt1-48/+138
When faultable trace events were added, a trace event may no longer use normal RCU to synchronize but instead used synchronize_rcu_tasks_trace(). This synchronization takes a much longer time to synchronize. The filter logic would free the filters by calling tracepoint_synchronize_unregister() after it unhooked the filter strings and before freeing them. With this function now calling synchronize_rcu_tasks_trace() this increased the time to free a filter tremendously. On a PREEMPT_RT system, it was even more noticeable. # time trace-cmd record -p function sleep 1 [..] real 2m29.052s user 0m0.244s sys 0m20.136s As trace-cmd would clear out all the filters before recording, it could take up to 2 minutes to do a recording of "sleep 1". To find out where the issues was: ~# trace-cmd sqlhist -e -n sched_stack select start.prev_state as state, end.next_comm as comm, TIMESTAMP_DELTA_USECS as delta, start.STACKTRACE as stack from sched_switch as start join sched_switch as end on start.prev_pid = end.next_pid Which will produce the following commands (and -e will also execute them): echo 's:sched_stack s64 state; char comm[16]; u64 delta; unsigned long stack[];' >> /sys/kernel/tracing/dynamic_events echo 'hist:keys=prev_pid:__arg_18057_2=prev_state,__arg_18057_4=common_timestamp.usecs,__arg_18057_7=common_stacktrace' >> /sys/kernel/tracing/events/sched/sched_switch/trigger echo 'hist:keys=next_pid:__state_18057_1=$__arg_18057_2,__comm_18057_3=next_comm,__delta_18057_5=common_timestamp.usecs-$__arg_18057_4,__stack_18057_6=$__arg_18057_7:onmatch(sched.sched_switch).trace(sched_stack,$__state_18057_1,$__comm_18057_3,$__delta_18057_5,$__stack_18057_6)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger The above creates a synthetic event that creates a stack trace when a task schedules out and records it with the time it scheduled back in. Basically the time a task is off the CPU. It also records the state of the task when it left the CPU (running, blocked, sleeping, etc). It also saves the comm of the task as "comm" (needed for the next command). ~# echo 'hist:keys=state,stack.stacktrace:vals=delta:sort=state,delta if comm == "trace-cmd" && state & 3' > /sys/kernel/tracing/events/synthetic/sched_stack/trigger The above creates a histogram with buckets per state, per stack, and the value of the total time it was off the CPU for that stack trace. It filters on tasks with "comm == trace-cmd" and only the sleeping and blocked states (1 - sleeping, 2 - blocked). ~# trace-cmd record -p function sleep 1 ~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -18 { state: 2, stack.stacktrace __schedule+0x1545/0x3700 schedule+0xe2/0x390 schedule_timeout+0x175/0x200 wait_for_completion_state+0x294/0x440 __wait_rcu_gp+0x247/0x4f0 synchronize_rcu_tasks_generic+0x151/0x230 apply_subsystem_event_filter+0xa2b/0x1300 subsystem_filter_write+0x67/0xc0 vfs_write+0x1e2/0xeb0 ksys_write+0xff/0x1d0 do_syscall_64+0x7b/0x420 entry_SYSCALL_64_after_hwframe+0x76/0x7e } hitcount: 237 delta: 99756288 <<--------------- Delta is 99 seconds! Totals: Hits: 525 Entries: 21 Dropped: 0 This shows that this particular trace waited for 99 seconds on synchronize_rcu_tasks() in apply_subsystem_event_filter(). In fact, there's a lot of places in the filter code that spends a lot of time waiting for synchronize_rcu_tasks_trace() in order to free the filters. Add helper functions that will use call_rcu*() variants to asynchronously free the filters. This brings the timings back to normal: # time trace-cmd record -p function sleep 1 [..] real 0m14.681s user 0m0.335s sys 0m28.616s And the histogram also shows this: ~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -21 { state: 2, stack.stacktrace __schedule+0x1545/0x3700 schedule+0xe2/0x390 schedule_timeout+0x175/0x200 wait_for_completion_state+0x294/0x440 __wait_rcu_gp+0x247/0x4f0 synchronize_rcu_normal+0x3db/0x5c0 tracing_reset_online_cpus+0x8f/0x1e0 tracing_open+0x335/0x440 do_dentry_open+0x4c6/0x17a0 vfs_open+0x82/0x360 path_openat+0x1a36/0x2990 do_filp_open+0x1c5/0x420 do_sys_openat2+0xed/0x180 __x64_sys_openat+0x108/0x1d0 do_syscall_64+0x7b/0x420 } hitcount: 2 delta: 77044 Totals: Hits: 55 Entries: 28 Dropped: 0 Where the total waiting time of synchronize_rcu_tasks_trace() is 77 milliseconds. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Andreas Ziegler <ziegler.andreas@siemens.com> Cc: Felix MOESSBAUER <felix.moessbauer@siemens.com> Link: https://lore.kernel.org/20250606201936.1e3d09a9@batman.local.home Reported-by: "Flot, Julien" <julien.flot@siemens.com> Tested-by: Julien Flot <julien.flot@siemens.com> Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults") Closes: https://lore.kernel.org/all/240017f656631c7dd4017aa93d91f41f653788ea.camel@siemens.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06libbpf: Handle unsupported mmap-based /sys/kernel/btf/vmlinux correctlyAndrii Nakryiko1-3/+3
libbpf_err_ptr() helpers are meant to return NULL and set errno, if there is an error. But btf_parse_raw_mmap() is meant to be used internally and is expected to return ERR_PTR() values. Because of this mismatch, when libbpf tries to mmap /sys/kernel/btf/vmlinux, we don't detect the error correctly with IS_ERR() check, and never fallback to old non-mmap-based way of loading vmlinux BTF. Fix this by using proper ERR_PTR() returns internally. Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Fixes: 3c0421c93ce4 ("libbpf: Use mmap to parse vmlinux BTF from sysfs") Cc: Lorenz Bauer <lmb@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250606202134.2738910-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-06intel_idle: Use subsys_initcall_sync() for initializationRafael J. Wysocki1-1/+1
It is not necessary to wait until the device_initcall() stage with intel_idle initialization. All of its dependencies are met after all subsys_initcall()s have run, so subsys_initcall_sync() can be used for initializing it. It is also better to ensure that intel_idle will always initialize before the ACPI processor driver that uses module_init() for its initialization. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/2994397.e9J7NaK4W3@rjwysocki.net
2025-06-06uapi: bitops: use UAPI-safe variant of BITS_PER_LONG againThomas Weißschuh1-2/+2
Commit 1e7933a575ed ("uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"") did not take in account that the usage of BITS_PER_LONG in __GENMASK() was changed to __BITS_PER_LONG for UAPI-safety in commit 3c7a8e190bc5 ("uapi: introduce uapi-friendly macros for GENMASK"). BITS_PER_LONG can not be used in UAPI headers as it derives from the kernel configuration and not from the current compiler invocation. When building compat userspace code or a compat vDSO its value will be incorrect. Switch back to __BITS_PER_LONG. Fixes: 1e7933a575ed ("uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
2025-06-06net: enetc: fix the netc-lib driver build dependencyWei Fang1-1/+5
The kernel robot reported the following errors when the netc-lib driver was compiled as a loadable module and the enetc-core driver was built-in. ld.lld: error: undefined symbol: ntmp_init_cbdr referenced by enetc_cbdr.c:88 (drivers/net/ethernet/freescale/enetc/enetc_cbdr.c:88) ld.lld: error: undefined symbol: ntmp_free_cbdr referenced by enetc_cbdr.c:96 (drivers/net/ethernet/freescale/enetc/enetc_cbdr.c:96) Simply changing "tristate" to "bool" can fix this issue, but considering that the netc-lib driver needs to support being compiled as a loadable module and LS1028 does not need the netc-lib driver. Therefore, we add a boolean symbol 'NXP_NTMP' to enable 'NXP_NETC_LIB' as needed. And when adding NETC switch driver support in the future, there is no need to modify the dependency, just select "NXP_NTMP" and "NXP_NETC_LIB" at the same time. Reported-by: Arnd Bergmann <arnd@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505220734.x6TF6oHR-lkp@intel.com/ Fixes: 4701073c3deb ("net: enetc: add initial netc-lib driver to support NTMP") Suggested-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-06platform/loongarch: laptop: Get brightness setting from EC on probeYao Zi1-1/+1
Previously during driver probe, 1 is unconditionally taken as current brightness value and set to props.brightness, which will be considered as the brightness before suspend and restored to EC on resume. Since a brightness value of 1 almost never matches EC's state on coldboot (my laptop's EC defaults to 80), this causes surprising changes of screen brightness on the first time of resume after coldboot. Let's get brightness from EC and take it as the current brightness on probe of the laptop driver to avoid the surprising behavior. Tested on TongFang L860-T2 Loongson-3A5000 laptop. Cc: stable@vger.kernel.org Fixes: 6246ed09111f ("LoongArch: Add ACPI-based generic laptop driver") Signed-off-by: Yao Zi <ziyao@disroot.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-06LoongArch: dts: Add PWM support to Loongson-2K2000Binbin Zhou1-0/+60
The module is supported, enable it. Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Binbin Zhou <zhoubinbin@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-06LoongArch: dts: Add PWM support to Loongson-2K1000Binbin Zhou2-1/+65
The module is supported, enable it. Also, add the pwm-fan and cooling-maps associated with it. Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Binbin Zhou <zhoubinbin@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-06LoongArch: dts: Add PWM support to Loongson-2K0500Binbin Zhou1-0/+160
The module is supported, enable it. Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Binbin Zhou <zhoubinbin@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-06LoongArch: vDSO: Correctly use asm parameters in syscall wrappersThomas Weißschuh2-4/+4
The syscall wrappers use the "a0" register for two different register variables, both the first argument and the return value. Here the "ret" variable is used as both input and output while the argument register is only used as input. Clang treats the conflicting input parameters as an undefined behaviour and optimizes away the argument assignment. The code seems to work by chance for the most part today but that may change in the future. Specifically clock_gettime_fallback() fails with clockids from 16 to 23, as implemented by the upcoming auxiliary clocks. Switch the "ret" register variable to a pure output, similar to the other architectures' vDSO code. This works in both clang and GCC. Link: https://lore.kernel.org/lkml/20250602102825-42aa84f0-23f1-4d10-89fc-e8bbaffd291a@linutronix.de/ Link: https://lore.kernel.org/lkml/20250519082042.742926976@linutronix.de/ Fixes: c6b99bed6b8f ("LoongArch: Add VDSO and VSYSCALL support") Fixes: 18efd0b10e0f ("LoongArch: vDSO: Wire up getrandom() vDSO implementation") Cc: stable@vger.kernel.org Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Reviewed-by: WANG Xuerui <git@xen0n.name> Reviewed-by: Xi Ruoyao <xry111@xry111.site> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-06-06ceph: fix variable dereferenced before check in ceph_umount_begin()Viacheslav Dubeyko1-2/+1
smatch warnings: fs/ceph/super.c:1042 ceph_umount_begin() warn: variable dereferenced before check 'fsc' (see line 1041) vim +/fsc +1042 fs/ceph/super.c void ceph_umount_begin(struct super_block *sb) { struct ceph_fs_client *fsc = ceph_sb_to_fs_client(sb); doutc(fsc->client, "starting forced umount\n"); ^^^^^^^^^^^ Dereferenced if (!fsc) ^^^^ Checked too late. return; fsc->mount_state = CEPH_MOUNT_SHUTDOWN; __ceph_umount_begin(fsc); } The VFS guarantees that the superblock is still alive when it calls into ceph via ->umount_begin(). Finally, we don't need to check the fsc and it should be valid. This patch simply removes the fsc check. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202503280852.YDB3pxUY-lkp@intel.com/ Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-06-06fs/xattr.c: fix simple_xattr_list()Stephen Smalley1-0/+1
commit 8b0ba61df5a1 ("fs/xattr.c: fix simple_xattr_list to always include security.* xattrs") failed to reset err after the call to security_inode_listsecurity(), which returns the length of the returned xattr name. This results in simple_xattr_list() incorrectly returning this length even if a POSIX acl is also set on the inode. Reported-by: Collin Funk <collin.funk1@gmail.com> Closes: https://lore.kernel.org/selinux/8734ceal7q.fsf@gmail.com/ Reported-by: Paul Eggert <eggert@cs.ucla.edu> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2369561 Fixes: 8b0ba61df5a1 ("fs/xattr.c: fix simple_xattr_list to always include security.* xattrs") Signed-off-by: Stephen Smalley <stephen.smalley.work@gmail.com> Link: https://lore.kernel.org/20250605165116.2063-1-stephen.smalley.work@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-05kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_countMax Kellermann1-0/+26
Expose a simple counter to userspace for monitoring tools. (akpm: 2536c5c7d6ae added the documentation but the code changes were lost) Link: https://lkml.kernel.org/r/20250504180831.4190860-3-max.kellermann@ionos.com Fixes: 2536c5c7d6ae ("kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Cc: Core Minyard <cminyard@mvista.com> Cc: Doug Anderson <dianders@chromium.org> Cc: Joel Granados <joel.granados@kernel.org> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Song Liu <song@kernel.org> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05MAINTAINERS: add mm swap sectionLorenzo Stoakes1-0/+19
In furtherance of ongoing efforts to ensure people are aware of who de-facto maintains/has an interest in specific parts of mm, as well trying to avoid get_maintainers.pl listing only Andrew and the mailing list for mm files - establish a swap memory management section and add relevant maintainers/reviewers. Link: https://lkml.kernel.org/r/20250604163139.126630-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Kairui Song <kasong@tencent.com> Acked-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Barry Song <baohua@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05kmsan: test: add module descriptionArnd Bergmann1-0/+1
Every module should have a description, and kbuild now warns for those that don't. WARNING: modpost: missing MODULE_DESCRIPTION() in mm/kmsan/kmsan_test.o Link: https://lkml.kernel.org/r/20250603075323.1839608-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Macro Elver <elver@google.com> Cc: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05MAINTAINERS: add tlb trace events to MMU GATHER AND TLB INVALIDATIONTal Zussman1-0/+1
The MMU GATHER AND TLB INVALIDATION entry lists other TLB-related files. Add the tlb.h tracepoint file there as well. Link: https://lore.kernel.org/linux-mm/ce048e11-f79d-44a6-bacc-46e1ebc34b24@redhat.com/ Link: https://lkml.kernel.org/r/20250603-tlb-maintainers-v1-1-726d193c6693@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast raceJann Horn1-0/+7
huge_pmd_unshare() drops a reference on a page table that may have previously been shared across processes, potentially turning it into a normal page table used in another process in which unrelated VMAs can afterwards be installed. If this happens in the middle of a concurrent gup_fast(), gup_fast() could end up walking the page tables of another process. While I don't see any way in which that immediately leads to kernel memory corruption, it is really weird and unexpected. Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one(), just like we do in khugepaged when removing page tables for a THP collapse. Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-2-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-2-f4136f5ec58a@google.com Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm/hugetlb: unshare page tables during VMA split, not beforeJann Horn4-16/+56
Currently, __split_vma() triggers hugetlb page table unsharing through vm_ops->may_split(). This happens before the VMA lock and rmap locks are taken - which is too early, it allows racing VMA-locked page faults in our process and racing rmap walks from other processes to cause page tables to be shared again before we actually perform the split. Fix it by explicitly calling into the hugetlb unshare logic from __split_vma() in the same place where THP splitting also happens. At that point, both the VMA and the rmap(s) are write-locked. An annoying detail is that we can now call into the helper hugetlb_unshare_pmds() from two different locking contexts: 1. from hugetlb_split(), holding: - mmap lock (exclusively) - VMA lock - file rmap lock (exclusively) 2. hugetlb_unshare_all_pmds(), which I think is designed to be able to call us with only the mmap lock held (in shared mode), but currently only runs while holding mmap lock (exclusively) and VMA lock Backporting note: This commit fixes a racy protection that was introduced in commit b30c14cd6102 ("hugetlb: unshare some PMDs when splitting VMAs"); that commit claimed to fix an issue introduced in 5.13, but it should actually also go all the way back. [jannh@google.com: v2] Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-1-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-0-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-1-f4136f5ec58a@google.com Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [b30c14cd6102: hugetlb: unshare some PMDs when splitting VMAs] Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05MAINTAINERS: add Alistair as reviewer of mm memory policyAlistair Popple1-0/+1
I'm particularly familiar with mm/migrate.c and especially mm/migrate_device.c so add myself to MAINTAINERS. Link: https://lkml.kernel.org/r/20250530014917.2946940-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05iov_iter: use iov_offset for length calculation in iov_iter_aligned_bvecNitesh Shetty1-1/+1
If iov_offset is non-zero, then we need to consider iov_offset in length calculation, otherwise we might pass smaller IOs such as 512 bytes, in below scenario [1]. This issue is reproducible using lib-uring test/fixed-seg.c application with fixed buffer on a 512 LBA formatted device. [1] At present we pass the alignment check, for 512 LBA formatted devices, len_mask = 511 when IO is smaller, i->count = 512 has an offset, i->io_offset = 3584 with bvec values, bvec->bv_offset = 256, bvec->bv_len = 3840. In short, the first 256 bytes are in the current page, next 256 bytes are in the another page. Ideally we expect to fail the IO. I can think of 2 userspace scenarios where we experience this. a: From userspace, we observe a different behaviour when device LBA size is 512 vs 4096 bytes. For 4096 LBA formatted device, I see the same liburing test [2] failing, whereas 512 the test passes without this. This is reproducible everytime. [2] https://github.com/axboe/liburing/ b: Although I was not able to reproduce the below condition, but I suspect below case should be possible from user space for devices with 512 LBA formatted device. Lets say from userspace while allocating a virtually single chunk of memory, if we get 2 physical chunk of memory, and IO happens to be at the boundary of first physical chunk with length crossing first chunk, then we allow IOs to proceed and hence we might map wrong physical address length and proceed with IO rather than failing. : --- a/test/fixed-seg.c : +++ b/test/fixed-seg.c : @@ -64,7 +64,7 @@ static int test(struct io_uring *ring, int fd, int : vec_off) : return T_EXIT_FAIL; : } : : - ret = read_it(ring, fd, 4096, vec_off); : + ret = read_it(ring, fd, 4096, 7*512 + 256); : if (ret) { : fprintf(stderr, "4096 0 failed\n"); : return T_EXIT_FAIL; Effectively this is a write crossing the page boundary. Link: https://lkml.kernel.org/r/20250428095849.11709-1-nj.shetty@samsung.com Fixes: 2263639f96f2 ("iov_iter: streamline iovec/bvec alignment iteration") Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm/mempolicy: fix incorrect freeing of wi_kobjJoshua Hahn1-3/+1
We should not free wi_group->wi_kobj here. In the error path of add_weighted_interleave_group() where this snippet is called from, kobj_{del, put} is immediately called right after this section. Thus, it is not only unnecessary but also incorrect to free it here. Link: https://lkml.kernel.org/r/20250602162345.2595696-1-joshua.hahnjy@gmail.com Fixes: e341f9c3c841 ("mm/mempolicy: Weighted Interleave Auto-tuning") Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506011545.Fduxqxqj-lkp@intel.com/ Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05alloc_tag: handle module codetag load errors as module load failuresSuren Baghdasaryan4-20/+39
Failures inside codetag_load_module() are currently ignored. As a result an error there would not cause a module load failure and freeing of the associated resources. Correct this behavior by propagating the error code to the caller and handling possible errors. With this change, error to allocate percpu counters, which happens at this stage, will not be ignored and will cause a module load failure and freeing of resources. With this change we also do not need to disable memory allocation profiling when this error happens, instead we fail to load the module. Link: https://lkml.kernel.org/r/20250521160602.1940771-1-surenb@google.com Fixes: 10075262888b ("alloc_tag: allocate percpu counters for module tags dynamically") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Casey Chen <cachen@purestorage.com> Closes: https://lore.kernel.org/all/20250520231620.15259-1-cachen@purestorage.com/ Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05mm/madvise: handle madvise_lock() failure during race unwindingSeongJae Park1-1/+4
When unwinding race on -ERESTARTNOINTR handling of process_madvise(), madvise_lock() failure is ignored. Check the failure and abort remaining works in the case. Link: https://lkml.kernel.org/r/20250602174926.1074-1-sj@kernel.org Fixes: 4000e3d0a367 ("mm/madvise: remove redundant mmap_lock operations from process_madvise()") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Barry Song <21cnbao@gmail.com> Closes: https://lore.kernel.org/CAGsJ_4xJXXO0G+4BizhohSZ4yDteziPw43_uF8nPXPWxUVChzw@mail.gmail.com Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>