wireguard-linux - WireGuard for the Linux kernel

Age	Commit message (Collapse)	Author	Files	Lines
2024-08-29	timers: Remove historical extra jiffie for timeout in msleep()	Anna-Maria Behnsen	1	-2/+2
	msleep() and msleep_interruptible() add a jiffie to the requested timeout. This extra jiffie was introduced to ensure that the timeout will not happen earlier than specified. Since the rework of the timer wheel, the enqueue path already takes care of this. So the extra jiffie added by msleep*() is pointless now. Remove this extra jiffie in msleep() and msleep_interruptible(). Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/all/20240829074133.4547-1-anna-maria@linutronix.de
2024-08-23	hrtimer: Use and report correct timerslack values for realtime tasks	Felix Moessbauer	5	-26/+22
	The timerslack_ns setting is used to specify how much the hardware timers should be delayed, to potentially dispatch multiple timers in a single interrupt. This is a performance optimization. Timers of realtime tasks (having a realtime scheduling policy) should not be delayed. This logic was inconsitently applied to the hrtimers, leading to delays of realtime tasks which used timed waits for events (e.g. condition variables). Due to the downstream override of the slack for rt tasks, the procfs reported incorrect (non-zero) timerslack_ns values. This is changed by setting the timer_slack_ns task attribute to 0 for all tasks with a rt policy. By that, downstream users do not need to specially handle rt tasks (w.r.t. the slack), and the procfs entry shows the correct value of "0". Setting non-zero slack values (either via procfs or PR_SET_TIMERSLACK) on tasks with a rt policy is ignored, as stated in "man 2 PR_SET_TIMERSLACK": Timer slack is not applied to threads that are scheduled under a real-time scheduling policy (see sched_setscheduler(2)). The special handling of timerslack on rt tasks in downstream users is removed as well. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240814121032.368444-2-felix.moessbauer@siemens.com
2024-08-14	hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.	Sebastian Andrzej Siewior	1	-0/+2
	The two hrtimer_cpu_base_.*_expiry() functions are wrappers around the locking functions and sparse complains about the missing counterpart. Add sparse annotation to denote that this bevaviour is expected. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240812105326.2240000-3-bigeasy@linutronix.de
2024-08-14	timers: Add sparse annotation for timer_sync_wait_running().	Sebastian Andrzej Siewior	1	-0/+2
	timer_sync_wait_running() first releases two locks and then acquires them again. This is unexpected and sparse complains about it. Add sparse annotation for timer_sync_wait_running() to note that the locking is expected. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240812105326.2240000-2-bigeasy@linutronix.de
2024-07-29	signal: Replace BUG_ON()s	Thomas Gleixner	1	-4/+7
	These really can be handled gracefully without killing the machine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	signal: Remove task argument from dequeue_signal()	Thomas Gleixner	3	-18/+14
	The task pointer which is handed to dequeue_signal() is always current. The argument along with the first comment about signalfd in that function is confusing at best. Remove it and use current internally. Update the stale comment for dequeue_signal() while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-timers: Consolidate signal queueing	Thomas Gleixner	4	-19/+15
	Rename posix_timer_event() to posix_timer_queue_signal() as this is what the function is about. Consolidate the requeue pending and deactivation updates into that function as there is no point in doing this in all incarnations of posix timers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Make k_itimer::it_active consistent	Thomas Gleixner	1	-0/+4
	Posix CPU timers are not updating k_itimer::it_active which makes it impossible to base decisions in the common posix timer code on it. Update it when queueing or dequeueing posix CPU timers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-timers: Consolidate timer setup	Thomas Gleixner	3	-20/+21
	hrtimer based and CPU timers have their own way to install the new interval and to reset overrun and signal handling related data. Create a helper function and do the same operation for all variants. This also makes the handling of the interval consistent. It's only stored when the timer is actually armed, i.e. timer->it_value != 0. Before that it was stored unconditionally for posix CPU timers and conditionally for the other posix timers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-timers: Convert timer list to hlist	Thomas Gleixner	6	-18/+15
	No requirement for a real list. Spare a few bytes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-timers: Clear overrun in common_timer_set()	Thomas Gleixner	1	-0/+1
	Keeping the overrun count of the previous setup around is just wrong. The new setting has nothing to do with the previous one and has to start from a clean slate. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-timers: Retrieve interval in common timer_settime() code	Thomas Gleixner	2	-9/+6
	No point in doing this all over the place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Simplify posix_cpu_timer_set()	Thomas Gleixner	1	-27/+17
	Avoid the late sighand lock/unlock dance when a timer is not armed to enforce reevaluation of the timer base so that the process wide CPU timer sampling can be disabled. Do it right at the point where the arming decision is made which already has sighand locked. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Remove incorrect comment in posix_cpu_timer_set()	Thomas Gleixner	1	-6/+1
	A leftover from historical code which describes fiction. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Use @now instead of @val for clarity	Thomas Gleixner	1	-13/+9
	posix_cpu_timer_set() uses @val as variable for the current time. That's confusing at best. Use @now as anywhere else and rewrite the confusing comment about clock sampling. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Do not arm SIGEV_NONE timers	Thomas Gleixner	1	-16/+13
	There is no point in arming SIGEV_NONE timers as they never deliver a signal. timer_gettime() is handling the expiry time correctly and that's all SIGEV_NONE timers care about. Prevent arming them and remove the expiry handler code which just disarms them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Replace old expiry retrieval in posix_cpu_timer_set()	Thomas Gleixner	1	-31/+6
	Reuse the split out __posix_cpu_timer_get() function which does already the right thing. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Handle SIGEV_NONE timers correctly in timer_set()	Thomas Gleixner	1	-1/+11
	Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in timer_get(), but the posix CPU timer implementation returns 1 nsec. Add the missing conditional. This will be cleaned up in a follow up patch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Handle SIGEV_NONE timers correctly in timer_get()	Thomas Gleixner	1	-5/+9
	Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in timer_get(), but the posix CPU timer implementation returns 1 nsec. Add the missing conditional. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Handle interval timers correctly in timer_get()	Thomas Gleixner	1	-1/+17
	timer_gettime() must return the remaining time to the next expiry of a timer or 0 if the timer is not armed and no signal pending, but posix CPU timers fail to forward a timer which is already expired. Add the required logic to address that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Save interval only for armed timers	Thomas Gleixner	1	-8/+6
	There is no point to return the interval for timers which have been disarmed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	posix-cpu-timers: Split up posix_cpu_timer_get()	Thomas Gleixner	1	-27/+24
	In preparation for addressing issues in the timer_get() and timer_set() functions of posix CPU timers. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29	selftests/timers/posix-timers: Validate overrun after unblock	Thomas Gleixner	1	-1/+60
	When a timer signal is blocked and later unblocked then one signal should be delivered with the correct number of overruns since the timer was queued. Validate that behaviour. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-07-29	selftests/timers/posix-timers: Validate timer_gettime()	Thomas Gleixner	1	-1/+57
	timer_gettime() must return the correct expiry time for interval timers even when the timer is not armed, which is the case when a signal is pending but blocked. Works correctly for regular posix timers, but not for posix CPU timers. Add a selftest to validate the fixes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-07-29	selftests/timers/posix-timers: Validate SIGEV_NONE	Thomas Gleixner	1	-1/+52
	Posix timers with a delivery mode of SIGEV_NONE deliver no signals but the remaining expiry time must be readable via timer_gettime() for both one shot and interval timers. That's implemented correctly for regular posix timers but broken for posix CPU timers. Add a self test so the fixes can be verified. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-07-29	selftests/timers/posix_timers: Validate signal rules	Thomas Gleixner	1	-1/+107
	Add a test case to validate correct behaviour vs. timer reprogramming and deletion. The handling of queued signals in case of timer reprogramming or deletion is inconsistent at best. POSIX does not really specify the behaviour for that: - "The effect of disarming or resetting a timer with pending expiration notifications is unspecified." - "The disposition of pending signals for the deleted timer is unspecified." In both cases it is reasonable to expect that pending signals are discarded. Especially in the reprogramming case it does not make sense to account for previous overruns or to deliver a signal for a timer which has been disarmed. Add tests to validate that no unexpected signals are delivered. They fail for now until the signal and posix timer code is updated. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-07-29	selftests/timers/posix_timers: Add SIG_IGN test	Thomas Gleixner	1	-2/+125
	Add a test case to validate correct behaviour vs. SIG_IGN. The posix specification states: "Setting a signal action to SIG_IGN for a signal that is pending shall cause the pending signal to be discarded, whether or not it is blocked." The kernel implements this in the signal handling code, but due to the way how posix timers are handling SIG_IGN for periodic timers, the behaviour after installing a real handler again is inconsistent and suprising. The following sequence is expected to deliver a signal: install_handler(SIG); block_signal(SIG); timer_create(...); <- Should send SIG timer_settime(value=100ms, interval=100ms); sleep(1); <- Timer expires and queues signal, timer is not rearmed as that should happen in the signal delivery path ignore_signal(SIG); <- Discards queued signal install_handler(SIG); <- Restore handler, should rearm but does not sleep(1); unblock_signal(SIG); <- Should deliver one signal with overrun count set in siginfo This fails because nothing rearms the timer when the signal handler is restored. Add a test for this case which fails until the signal and posix timer code is fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-07-29	selftests/timers/posix_timers: Simplify error handling	Thomas Gleixner	1	-99/+52
	No point in returning to main() on fatal errors. Just exit right away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
2024-07-28	Linux 6.11-rc1	Linus Torvalds	1	-2/+2

2024-07-28	minmax: simplify and clarify min_t()/max_t() implementation	Linus Torvalds	1	-8/+11
	This simplifies the min_t() and max_t() macros by no longer making them work in the context of a C constant expression. That means that you can no longer use them for static initializers or for array sizes in type definitions, but there were only a couple of such uses, and all of them were converted (famous last words) to use MIN_T/MAX_T instead. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-28	minmax: add a few more MIN_T/MAX_T users	Linus Torvalds	7	-10/+10
	Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant expressions in VM code") added the simpler MIN_T/MAX_T macros in order to avoid some excessive expansion from the rather complicated regular min/max macros. The complexity of those macros stems from two issues: (a) trying to use them in situations that require a C constant expression (in static initializers and for array sizes) (b) the type sanity checking and MIN_T/MAX_T avoids both of these issues. Now, in the whole (long) discussion about all this, it was pointed out that the whole type sanity checking is entirely unnecessary for min_t/max_t which get a fixed type that the comparison is done in. But that still leaves min_t/max_t unnecessarily complicated due to worries about the C constant expression case. However, it turns out that there really aren't very many cases that use min_t/max_t for this, and we can just force-convert those. This does exactly that. Which in turn will then allow for much simpler implementations of min_t()/max_t(). All the usual "macros in all upper case will evaluate the arguments multiple times" rules apply. We should do all the same things for the regular min/max() vs MIN/MAX() cases, but that has the added complexity of various drivers defining their own local versions of MIN/MAX, so that needs another level of fixes first. Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/ Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-29	kbuild: Fix '-S -c' in x86 stack protector scripts	Nathan Chancellor	2	-2/+2
	After a recent change in clang to stop consuming all instances of '-S' and '-c' [1], the stack protector scripts break due to the kernel's use of -Werror=unused-command-line-argument to catch cases where flags are not being properly consumed by the compiler driver: $ echo \| clang -o - -x c - -S -c -Werror=unused-command-line-argument clang: error: argument unused during compilation: '-c' [-Werror,-Wunused-command-line-argument] This results in CONFIG_STACKPROTECTOR getting disabled because CONFIG_CC_HAS_SANE_STACKPROTECTOR is no longer set. '-c' and '-S' both instruct the compiler to stop at different stages of the pipeline ('-S' after compiling, '-c' after assembling), so having them present together in the same command makes little sense. In this case, the test wants to stop before assembling because it is looking at the textual assembly output of the compiler for either '%fs' or '%gs', so remove '-c' from the list of arguments to resolve the error. All versions of GCC continue to work after this change, along with versions of clang that do or do not contain the change mentioned above. Cc: stable@vger.kernel.org Fixes: 4f7fd4d7a791 ("[PATCH] Add the -fstack-protector option to the CFLAGS") Fixes: 60a5317ff0f4 ("x86: implement x86_32 stack protector") Link: https://github.com/llvm/llvm-project/commit/6461e537815f7fa68cef06842505353cf5600e9c [1] Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-07-28	ubi: Fix ubi_init() ubiblock_exit() section mismatch	Richard Weinberger	1	-1/+1
	Since ubiblock_exit() is now called from an init function, the __exit section no longer makes sense. Cc: Ben Hutchings <bwh@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407131403.wZJpd8n2-lkp@intel.com/ Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
2024-07-28	kbuild: rpm-pkg: ghost modules.weakdep file	Jose Ignacio Tornos Martinez	1	-1/+1
	In the same way as for other similar files, mark as ghost the new file generated by depmod for configured weak dependencies for modules, modules.weakdep, so that although it is not included in the package, claim the ownership on it. Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-07-27	hostfs: fix the host directory parse when mounting.	Hongbo Li	1	-10/+55
	hostfs not keep the host directory when mounting. When the host directory is none (default), fc->source is used as the host root directory, and this is wrong. Here we use `parse_monolithic` to handle the old mount path for parsing the root directory. For new mount path, The `parse_param` is used for the host directory parse. Reported-and-tested-by: Maciej Żenczykowski <maze@google.com> Fixes: cd140ce9f611 ("hostfs: convert hostfs to use the new mount API") Link: https://lore.kernel.org/all/CANP3RGceNzwdb7w=vPf5=7BCid5HVQDmz1K5kC9JG42+HVAh_g@mail.gmail.com/ Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20240725065130.1821964-1-lihongbo22@huawei.com [brauner: minor fixes] Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-27	fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT	Seth Forshee (DigitalOcean)	1	-0/+11
	Christian noticed that it is possible for a privileged user to mount most filesystems with a non-initial user namespace in sb->s_user_ns. When fsopen() is called in a non-init namespace the caller's namespace is recorded in fs_context->user_ns. If the returned file descriptor is then passed to a process priviliged in init_user_ns, that process can call fsconfig(fd_fs, FSCONFIG_CMD_CREATE), creating a new superblock with sb->s_user_ns set to the namespace of the process which called fsopen(). This is problematic. We cannot assume that any filesystem which does not set FS_USERNS_MOUNT has been written with a non-initial s_user_ns in mind, increasing the risk for bugs and security issues. Prevent this by returning EPERM from sget_fc() when FS_USERNS_MOUNT is not set for the filesystem and a non-initial user namespace will be used. sget() does not need to be updated as it always uses the user namespace of the current context, or the initial user namespace if SB_SUBMOUNT is set. Fixes: cb50b348c71f ("convenience helpers: vfs_get_super() and sget_fc()") Reported-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/20240724-s_user_ns-fix-v1-1-895d07c94701@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-27	ALSA: firewire-lib: fix wrong value as length of header for CIP_NO_HEADER case	Takashi Sakamoto	1	-2/+1
	In a commit 1d717123bb1a ("ALSA: firewire-lib: Avoid -Wflex-array-member-not-at-end warning"), DEFINE_FLEX() macro was used to handle variable length of array for header field in struct fw_iso_packet structure. The usage of macro has a side effect that the designated initializer assigns the count of array to the given field. Therefore CIP_HEADER_QUADLETS (=2) is assigned to struct fw_iso_packet.header, while the original designated initializer assigns zero to all fields. With CIP_NO_HEADER flag, the change causes invalid length of header in isochronous packet for 1394 OHCI IT context. This bug affects all of devices supported by ALSA fireface driver; RME Fireface 400, 800, UCX, UFX, and 802. This commit fixes the bug by replacing it with the alternative version of macro which corresponds no initializer. Cc: stable@vger.kernel.org Fixes: 1d717123bb1a ("ALSA: firewire-lib: Avoid -Wflex-array-member-not-at-end warning") Reported-by: Edmund Raile <edmund.raile@proton.me> Closes: https://lore.kernel.org/r/rrufondjeynlkx2lniot26ablsltnynfaq2gnqvbiso7ds32il@qk4r6xps7jh2/ Reviewed-by: Takashi Iwai <tiwai@suse.de> Link: https://lore.kernel.org/r/20240725155640.128442-1-o-takashi@sakamocchi.jp Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
2024-07-27	Revert "firewire: Annotate struct fw_iso_packet with __counted_by()"	Takashi Sakamoto	1	-3/+2
	This reverts commit d3155742db89df3b3c96da383c400e6ff4d23c25. The header_length field is byte unit, thus it can not express the number of elements in header field. It seems that the argument for counted_by attribute can have no arithmetic expression, therefore this commit just reverts the issued commit. Suggested-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20240725161648.130404-1-o-takashi@sakamocchi.jp Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
2024-07-26	minmax: avoid overly complicated constant expressions in VM code	Linus Torvalds	2	-2/+9
	The minmax infrastructure is overkill for simple constants, and can cause huge expansions because those simple constants are then used by other things. For example, 'pageblock_order' is a core VM constant, but because it was implemented using 'min_t()' and all the type-checking that involves, it actually expanded to something like 2.5kB of preprocessor noise. And when that simple constant was then used inside other expansions: #define pageblock_nr_pages (1UL << pageblock_order) #define pageblock_start_pfn(pfn) ALIGN_DOWN((pfn), pageblock_nr_pages) and we then use that inside a 'max()' macro: case ISOLATE_SUCCESS: update_cached = false; last_migrated_pfn = max(cc->zone->zone_start_pfn, pageblock_start_pfn(cc->migrate_pfn - 1)); the end result was that one statement expanding to 253kB in size. There are probably other cases of this, but this one case certainly stood out. I've added 'MIN_T()' and 'MAX_T()' macros for this kind of "core simple constant with specific type" use. These macros skip the type checking, and as such need to be very sparingly used only for obvious cases that have active issues like this. Reported-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/all/36aa2cad-1db1-4abf-8dd2-fb20484aabc3@lucifer.local/ Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-26	minmax: avoid overly complex min()/max() macro arguments in xen	Linus Torvalds	1	-2/+3
	We have some very fancy min/max macros that have tons of sanity checking to warn about mixed signedness etc. This is all things that a sane compiler should warn about, but there are no sane compiler interfaces for this, and '-Wsign-compare' is broken [1] and not useful. So then we compensate (some would say over-compensate) by doing the checks manually with some truly horrid macro games. And no, we can't just use __builtin_types_compatible_p(), because the whole question of "does it make sense to compare these two values" is a lot more complicated than that. For example, it makes a ton of sense to compare unsigned values with simple constants like "5", even if that is indeed a signed type. So we have these very strange macros to try to make sensible type checking decisions on the arguments to 'min()' and 'max()'. But that can cause enormous code expansion if the min()/max() macros are used with complicated expressions, and particularly if you nest these things so that you get the first big expansion then expanded again. The xen setup.c file ended up ballooning to over 50MB of preprocessed noise that takes 15s to compile (obviously depending on the build host), largely due to one single line. So let's split that one single line to just be simpler. I think it ends up being more legible to humans too at the same time. Now that single file compiles in under a second. Reported-and-reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/all/c83c17bb-be75-4c67-979d-54eee38774c6@lucifer.local/ Link: https://staticthinking.wordpress.com/2023/07/25/wsign-compare-is-garbage/ [1] Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-26	nilfs2: handle inconsistent state in nilfs_btnode_create_block()	Ryusuke Konishi	2	-7/+22
	Syzbot reported that a buffer state inconsistency was detected in nilfs_btnode_create_block(), triggering a kernel bug. It is not appropriate to treat this inconsistency as a bug; it can occur if the argument block address (the buffer index of the newly created block) is a virtual block number and has been reallocated due to corruption of the bitmap used to manage its allocation state. So, modify nilfs_btnode_create_block() and its callers to treat it as a possible filesystem error, rather than triggering a kernel bug. Link: https://lkml.kernel.org/r/20240725052007.4562-1-konishi.ryusuke@gmail.com Fixes: a60be987d45d ("nilfs2: B-tree node cache") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+89cc4f2324ed37988b60@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=89cc4f2324ed37988b60 Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26	selftests/mm: skip test for non-LPA2 and non-LVA systems	Dev Jain	1	-1/+15
	Post my improvement of the test in e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing"): The test begins to fail on 4k and 16k pages, on non-LPA2 systems. To reduce noise in the CI systems, let us skip the test when higher address space is not implemented. Link: https://lkml.kernel.org/r/20240718052504.356517-1-dev.jain@arm.com Fixes: e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing") Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26	mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist()	Li Zhijian	1	-7/+11
	It's expected that no page should be left in pcp_list after calling zone_pcp_disable() in offline_pages(). Previously, it's observed that offline_pages() gets stuck [1] due to some pages remaining in pcp_list. Cause: There is a race condition between drain_pages_zone() and __rmqueue_pcplist() involving the pcp->count variable. See below scenario: CPU0 CPU1 ---------------- --------------- spin_lock(&pcp->lock); __rmqueue_pcplist() { zone_pcp_disable() { /* list is empty / if (list_empty(list)) { / add pages to pcp_list / alloced = rmqueue_bulk() mutex_lock(&pcp_batch_high_lock) ... __drain_all_pages() { drain_pages_zone() { / read pcp->count, it's 0 here / count = READ_ONCE(pcp->count) / 0 means nothing to drain / / update pcp->count */ pcp->count += alloced << order; ... ... spin_unlock(&pcp->lock); In this case, after calling zone_pcp_disable() though, there are still some pages in pcp_list. And these pages in pcp_list are neither movable nor isolated, offline_pages() gets stuck as a result. Solution: Expand the scope of the pcp->lock to also protect pcp->count in drain_pages_zone(), to ensure no pages are left in the pcp list after zone_pcp_disable() [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ Link: https://lkml.kernel.org/r/20240723064428.1179519-1-lizhijian@fujitsu.com Fixes: 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26	mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_node	Roman Gushchin	1	-0/+1
	Oliver Sand reported a performance regression caused by commit 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node"), which puts some fields of the mem_cgroup_per_node structure under the CONFIG_MEMCG_V1 config option. Apparently it causes a false cache sharing between lruvec and lru_zone_size members of the structure. Fix it by adding an explicit padding after the lruvec member. Even though the padding is not required with CONFIG_MEMCG_V1 set, it seems like the introduced memory overhead is not significant enough to warrant another divergence in the mem_cgroup_per_node layout, so the padding is added unconditionally. Link: https://lkml.kernel.org/r/20240723171244.747521-1-roman.gushchin@linux.dev Fixes: 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407121335.31a10cb6-oliver.sang@intel.com Tested-by: Oliver Sang <oliver.sang@intel.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26	alloc_tag: outline and export free_reserved_page()	Suren Baghdasaryan	2	-15/+18
	Outline and export free_reserved_page() because modules use it and it in turn uses page_ext_{get\|put} which should not be exported. The same result could be obtained by outlining {get\|put}_page_tag_ref() but that would have higher performance impact as these functions are used in more performance critical paths. Link: https://lkml.kernel.org/r/20240717212844.2749975-1-surenb@google.com Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407080044.DWMC9N9I-lkp@intel.com/ Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26	decompress_bunzip2: fix rare decompression failure	Ross Lagerwall	1	-1/+2
	The decompression code parses a huffman tree and counts the number of symbols for a given bit length. In rare cases, there may be >= 256 symbols with a given bit length, causing the unsigned char to overflow. This causes a decompression failure later when the code tries and fails to find the bit length for a given symbol. Since the maximum number of symbols is 258, use unsigned short instead. Link: https://lkml.kernel.org/r/20240717162016.1514077-1-ross.lagerwall@citrix.com Fixes: bc22c17e12c1 ("bzip2/lzma: library support for gzip, bzip2 and lzma decompression") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Cc: Alain Knaff <alain@knaff.lu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26	mm/huge_memory: avoid PMD-size page cache if needed	Gavin Shan	2	-5/+19
	xarray can't support arbitrary page cache size. the largest and supported page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d90642a71 ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However, it's possible to have 512MB page cache in the huge memory's collapsing path on ARM64 system whose base page size is 64KB. 512MB page cache is breaking the limitation and a warning is raised when the xarray entry is split as shown in the following example. [root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps \| grep KernelPageSize KernelPageSize: 64 kB [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c : int main(int argc, char *argv) { const char filename = TEST_XFS_FILENAME; int fd = 0; void buf = (void )-1, p; int pgsize = getpagesize(); int ret = 0; if (pgsize != 0x10000) { fprintf(stdout, "System with 64KB base page size is required!\n"); return -EPERM; } system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb"); system("echo 1 > /proc/sys/vm/drop_caches"); / Open the xfs file / fd = open(filename, O_RDONLY); assert(fd > 0); / Create VMA / buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0); assert(buf != (void )-1); fprintf(stdout, "mapped buffer at 0x%p\n", buf); /* Populate VMA / ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ); assert(ret == 0); / Collapse VMA / ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE); if (ret) { fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno); goto out; } / Split xarray entry. Write permission is needed / munmap(buf, TEST_MEM_SIZE); buf = (void )-1; close(fd); fd = open(filename, O_RDWR); assert(fd > 0); fallocate(fd, FALLOC_FL_KEEP_SIZE \| FALLOC_FL_PUNCH_HOLE, TEST_MEM_SIZE - pgsize, pgsize); out: if (buf != (void *)-1) munmap(buf, TEST_MEM_SIZE); if (fd > 0) close(fd); return ret; } [root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test [root@dhcp-10-26-1-207 ~]# /tmp/test ------------[ cut here ]------------ WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse \ xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net \ sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x780 sp : ffff8000ac32f660 x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0 x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000 x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8 x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x780 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2f0 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by correcting the supported page cache orders, different sets for DAX and other files. With it corrected, 512MB page cache becomes disallowed on all non-DAX files on ARM64 system where the base page size is 64KB. After this patch is applied, the test program fails with error -EINVAL returned from __thp_vma_allowable_orders() and the madvise() system call to collapse the page caches. Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26	mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit machines	Yang Shi	1	-1/+1
	Yves-Alexis Perez reported commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") didn't work for x86_32 [1]. It is because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT. !CONFIG_64BIT should cover all 32 bit machines. [1] https://lore.kernel.org/linux-mm/CAHbLzkr1LwH3pcTgM+aGQ31ip2bKqiqEQ8=FQB+t2c3dhNKNHA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240712155855.1130330-1-yang@os.amperecomputing.com Fixes: 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: Yves-Alexis Perez <corsac@debian.org> Tested-by: Yves-Alexis Perez <corsac@debian.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Salvatore Bonaccorso <carnil@debian.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> [6.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26	mm: fix old/young bit handling in the faulting path	Ram Tummala	1	-1/+1
	Commit 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()") replaced do_set_pte() with set_pte_range() and that introduced a regression in the following faulting path of non-anonymous vmas which caused the PTE for the faulting address to be marked as old instead of young. handle_pte_fault() do_pte_missing() do_fault() do_read_fault() \|\| do_cow_fault() \|\| do_shared_fault() finish_fault() set_pte_range() The polarity of prefault calculation is incorrect. This leads to prefault being incorrectly set for the faulting address. The following check will incorrectly mark the PTE old rather than young. On some architectures this will cause a double fault to mark it young when the access is retried. if (prefault && arch_wants_old_prefaulted_pte()) entry = pte_mkold(entry); On a subsequent fault on the same address, the faulting path will see a non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE will then be correctly marked young in handle_pte_fault() itself. Due to this bug, performance degradation in the fault handling path will be observed due to unnecessary double faulting. Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com Fixes: 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()") Signed-off-by: Ram Tummala <rtummala@nvidia.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26	dt-bindings: arm: update James Clark's email address	James Clark	2	-2/+2
	My new address is james.clark@linaro.org Link: https://lkml.kernel.org/r/20240709102512.31212-3-james.clark@linaro.org Signed-off-by: James Clark <james.clark@linaro.org> Cc: Bjorn Andersson <quic_bjorande@quicinc.com> Cc: Conor Dooley <conor+dt@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Geliang Tang <geliang@kernel.org> Cc: Hao Zhang <quic_hazha@quicinc.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org> Cc: Mao Jinlong <quic_jinlmao@quicinc.com> Cc: Matthieu Baerts <matttbe@kernel.org> Cc: Matt Ranostay <matt@ranostay.sg> Cc: Mike Leach <mike.leach@linaro.org> Cc: Oleksij Rempel <o.rempel@pengutronix.de> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>