aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-05-10bpf: Extend batch operations for map-in-map bpf-mapsTakshak Chahande2-2/+13
This patch extends batch operations support for map-in-map map-types: BPF_MAP_TYPE_HASH_OF_MAPS and BPF_MAP_TYPE_ARRAY_OF_MAPS A usecase where outer HASH map holds hundred of VIP entries and its associated reuse-ports per VIP stored in REUSEPORT_SOCKARRAY type inner map, needs to do batch operation for performance gain. This patch leverages the exiting generic functions for most of the batch operations. As map-in-map's value contains the actual reference of the inner map, for BPF_MAP_TYPE_HASH_OF_MAPS type, it needed an extra step to fetch the map_id from the reference value. selftests are added in next patch 2/2. Signed-off-by: Takshak Chahande <ctakshak@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220510082221.2390540-1-ctakshak@fb.com
2022-05-09mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()David Hildenbrand1-1/+1
New anonymous pages are always mapped natively: only THP/khugepaged code maps a new compound anonymous page and passes "true". Otherwise, we're just dealing with simple, non-compound pages. Let's give the interface clearer semantics and document these. Remove the PageTransCompound() sanity check from page_add_new_anon_rmap(). Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09bpf: Remove unused parameter from find_kfunc_desc_btf()Yuntao Wang1-5/+4
The func_id parameter in find_kfunc_desc_btf() is not used, get rid of it. Fixes: 2357672c54c3 ("bpf: Introduce BPF support for kernel module function calls") Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20220505070114.3522522-1-ytcoode@gmail.com
2022-05-09sched: Fix build warning without CONFIG_SYSCTLYueHaibing1-29/+36
IF CONFIG_SYSCTL is n, build warn: kernel/sched/core.c:1782:12: warning: ‘sysctl_sched_uclamp_handler’ defined but not used [-Wunused-function] static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write, ^~~~~~~~~~~~~~~~~~~~~~~~~~~ sysctl_sched_uclamp_handler() is used while CONFIG_SYSCTL enabled, wrap all related code with CONFIG_SYSCTL to fix this. Fixes: 3267e0156c33 ("sched: Move uclamp_util sysctls to core.c") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-05-09reboot: Fix build warning without CONFIG_SYSCTLYueHaibing1-27/+27
If CONFIG_SYSCTL is n, build warn: kernel/reboot.c:443:20: error: ‘kernel_reboot_sysctls_init’ defined but not used [-Werror=unused-function] static void __init kernel_reboot_sysctls_init(void) ^~~~~~~~~~~~~~~~~~~~~~~~~~ Move kernel_reboot_sysctls_init() to #ifdef block to fix this. Fixes: 06d177662fb8 ("kernel/reboot: move reboot sysctls to its own file") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-05-09mm,fs: Remove aops->readpageMatthew Wilcox (Oracle)1-3/+2
With all implementations of aops->readpage converted to aops->read_folio, we can stop checking whether it's set and remove the member from aops. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09fs: Introduce aops->read_folioMatthew Wilcox (Oracle)1-2/+4
Change all the callers of ->readpage to call ->read_folio in preference, if it exists. This is a transitional duplication, and will be removed by the end of the series. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09entry: Rename arch_check_user_regs() to arch_enter_from_user_mode()Sven Schnelle1-1/+1
arch_check_user_regs() is used at the moment to verify that struct pt_regs contains valid values when entering the kernel from userspace. s390 needs a place in the generic entry code to modify a cpu data structure when switching from userspace to kernel mode. As arch_check_user_regs() is exactly this, rename it to arch_enter_from_user_mode(). When entering the kernel from userspace, arch_check_user_regs() is used to verify that struct pt_regs contains valid values. Note that the NMI codepath doesn't call this function. s390 needs a place in the generic entry code to modify a cpu data structure when switching from userspace to kernel mode. As arch_check_user_regs() is exactly this, rename it to arch_enter_from_user_mode(). Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220504062351.2954280-2-tmricht@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2022-05-08Merge tag 'timers-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+2
Pull timer fix from Thomas Gleixner: "A fix and an email address update: - Mark the NMI safe time accessors notrace to prevent tracer recursion when they are selected as trace clocks. - John Stultz has a new email address" * tag 'timers-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Mark NMI safe time accessors as notrace MAINTAINERS: Update email address for John Stultz
2022-05-08Merge tag 'irq-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-10/+33
Pull irq fix from Thomas Gleixner: "A fix for the threaded interrupt core. A quick sequence of request/free_irq() can result in a hang because the interrupt thread did not reach the thread function and got stopped in the kthread core already. That leaves a state active counter arround which makes a invocation of synchronized_irq() on that interrupt hang forever. Ensure that the thread reached the thread function in request_irq() to prevent that" * tag 'irq-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Synchronize interrupt thread startup
2022-05-08stackleak: add on/off stack variantsMark Rutland1-3/+32
The stackleak_erase() code dynamically handles being on a task stack or another stack. In most cases, this is a fixed property of the caller, which the caller is aware of, as an architecture might always return using the task stack, or might always return using a trampoline stack. This patch adds stackleak_erase_on_task_stack() and stackleak_erase_off_task_stack() functions which callers can use to avoid on_thread_stack() check and associated redundant work when the calling stack is known. The existing stackleak_erase() is retained as a safe default. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220427173128.2603085-13-mark.rutland@arm.com
2022-05-08stackleak: rework poison scanningMark Rutland1-14/+4
Currently we over-estimate the region of stack which must be erased. To determine the region to be erased, we scan downwards for a contiguous block of poison values (or the low bound of the stack). There are a few minor problems with this today: * When we find a block of poison values, we include this block within the region to erase. As this is included within the region to erase, this causes us to redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with poison. * As the loop condition checks 'poison_count <= depth', it will run an additional iteration after finding the contiguous block of poison, decrementing 'erase_low' once more than necessary. As this is included within the region to erase, this causes us to redundantly overwrite an additional unsigned long with poison. * As we always decrement 'erase_low' after checking an element on the stack, we always include the element below this within the region to erase. As this is included within the region to erase, this causes us to redundantly overwrite an additional unsigned long with poison. Note that this is not a functional problem. As the loop condition checks 'erase_low > task_stack_low', we'll never clobber the STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll never fail to erase the element immediately above the STACK_END_MAGIC. In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)` bytes more than necessary, which is unfortunate. This patch reworks the logic to find the address immediately above the poisoned region, by finding the lowest non-poisoned address. This is factored into a stackleak_find_top_of_poison() helper both for clarity and so that this can be shared with the LKDTM test in subsequent patches. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220427173128.2603085-8-mark.rutland@arm.com
2022-05-08stackleak: rework stack high bound handlingMark Rutland1-5/+14
Prior to returning to userspace, we reset current->lowest_stack to a reasonable high bound. Currently we do this by subtracting the arbitrary value `THREAD_SIZE/64` from the top of the stack, for reasons lost to history. Looking at configurations today: * On i386 where THREAD_SIZE is 8K, the bound will be 128 bytes. The pt_regs at the top of the stack is 68 bytes (with 0 to 16 bytes of padding above), and so this covers an additional portion of 44 to 60 bytes. * On x86_64 where THREAD_SIZE is at least 16K (up to 32K with KASAN) the bound will be at least 256 bytes (up to 512 with KASAN). The pt_regs at the top of the stack is 168 bytes, and so this cover an additional 88 bytes of stack (up to 344 with KASAN). * On arm64 where THREAD_SIZE is at least 16K (up to 64K with 64K pages and VMAP_STACK), the bound will be at least 256 bytes (up to 1024 with KASAN). The pt_regs at the top of the stack is 336 bytes, so this can fall within the pt_regs, or can cover an additional 688 bytes of stack. Clearly the `THREAD_SIZE/64` value doesn't make much sense -- in the worst case, this will cause more than 600 bytes of stack to be erased for every syscall, even if actual stack usage were substantially smaller. This patches makes this slightly less nonsensical by consistently resetting current->lowest_stack to the base of the task pt_regs. For clarity and for consistency with the handling of the low bound, the generation of the high bound is split into a helper with commentary explaining why. Since the pt_regs at the top of the stack will be clobbered upon the next exception entry, we don't need to poison these at exception exit. By using task_pt_regs() as the high stack boundary instead of current_top_of_stack() we avoid some redundant poisoning, and the compiler can share the address generation between the poisoning and resetting of `current->lowest_stack`, making the generated code more optimal. It's not clear to me whether the existing `THREAD_SIZE/64` offset was a dodgy heuristic to skip the pt_regs, or whether it was attempting to minimize the number of times stackleak_check_stack() would have to update `current->lowest_stack` when stack usage was shallow at the cost of unconditionally poisoning a small portion of the stack for every exit to userspace. For now I've simply removed the offset, and if we need/want to minimize updates for shallow stack usage it should be easy to add a better heuristic atop, with appropriate commentary so we know what's going on. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220427173128.2603085-7-mark.rutland@arm.com
2022-05-08stackleak: clarify variable namesMark Rutland1-16/+14
The logic within __stackleak_erase() can be a little hard to follow, as `boundary` switches from being the low bound to the high bound mid way through the function, and `kstack_ptr` is used to represent the start of the region to erase while `boundary` represents the end of the region to erase. Make this a little clearer by consistently using clearer variable names. The `boundary` variable is removed, the bounds of the region to erase are described by `erase_low` and `erase_high`, and bounds of the task stack are described by `task_stack_low` and `task_stack_high`. As the same time, remove the comment above the variables, since it is unclear whether it's intended as rationale, a complaint, or a TODO, and is more confusing than helpful. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220427173128.2603085-6-mark.rutland@arm.com
2022-05-08stackleak: rework stack low bound handlingMark Rutland1-10/+4
In stackleak_task_init(), stackleak_track_stack(), and __stackleak_erase(), we open-code skipping the STACK_END_MAGIC at the bottom of the stack. Each case is implemented slightly differently, and only the __stackleak_erase() case is commented. In stackleak_task_init() and stackleak_track_stack() we unconditionally add sizeof(unsigned long) to the lowest stack address. In stackleak_task_init() we use end_of_stack() for this, and in stackleak_track_stack() we use task_stack_page(). In __stackleak_erase() we handle this by detecting if `kstack_ptr` has hit the stack end boundary, and if so, conditionally moving it above the magic. This patch adds a new stackleak_task_low_bound() helper which is used in all three cases, which unconditionally adds sizeof(unsigned long) to the lowest address on the task stack, with commentary as to why. This uses end_of_stack() as stackleak_task_init() did prior to this patch, as this is consistent with the code in kernel/fork.c which initializes the STACK_END_MAGIC value. In __stackleak_erase() we no longer need to check whether we've spilled into the STACK_END_MAGIC value, as stackleak_track_stack() ensures that `current->lowest_stack` stops immediately above this, and similarly the poison scan will stop immediately above this. For stackleak_task_init() and stackleak_track_stack() this results in no change to code generation. For __stackleak_erase() the generated assembly is slightly simpler and shorter. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220427173128.2603085-5-mark.rutland@arm.com
2022-05-08stackleak: remove redundant checkMark Rutland1-4/+0
In __stackleak_erase() we check that the `erase_low` value derived from `current->lowest_stack` is above the lowest legitimate stack pointer value, but this is already enforced by stackleak_track_stack() when recording the lowest stack value. Remove the redundant check. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220427173128.2603085-4-mark.rutland@arm.com
2022-05-08stackleak: move skip_erasing() check earlierMark Rutland1-4/+9
In stackleak_erase() we check skip_erasing() after accessing some fields from current. As generating the address of current uses asm which hazards with the static branch asm, this work is always performed, even when the static branch is patched to jump to the return at the end of the function. This patch avoids this redundant work by moving the skip_erasing() check earlier. To avoid complicating initialization within stackleak_erase(), the body of the function is split out into a __stackleak_erase() helper, with the check left in a wrapper function. The __stackleak_erase() helper is marked __always_inline to ensure that this is inlined into stackleak_erase() and not instrumented. Before this patch, on x86-64 w/ GCC 11.1.0 the start of the function is: <stackleak_erase>: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax 00 00 48 8b 48 20 mov 0x20(%rax),%rcx 48 8b 80 98 0a 00 00 mov 0xa98(%rax),%rax 66 90 xchg %ax,%ax <------------ static branch 48 89 c2 mov %rax,%rdx 48 29 ca sub %rcx,%rdx 48 81 fa ff 3f 00 00 cmp $0x3fff,%rdx After this patch, on x86-64 w/ GCC 11.1.0 the start of the function is: <stackleak_erase>: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) <--- static branch 65 48 8b 04 25 00 00 mov %gs:0x0,%rax 00 00 48 8b 48 20 mov 0x20(%rax),%rcx 48 8b 80 98 0a 00 00 mov 0xa98(%rax),%rax 48 89 c2 mov %rax,%rdx 48 29 ca sub %rcx,%rdx 48 81 fa ff 3f 00 00 cmp $0x3fff,%rdx Before this patch, on arm64 w/ GCC 11.1.0 the start of the function is: <stackleak_erase>: d503245f bti c d5384100 mrs x0, sp_el0 f9401003 ldr x3, [x0, #32] f9451000 ldr x0, [x0, #2592] d503201f nop <------------------------------- static branch d503233f paciasp cb030002 sub x2, x0, x3 d287ffe1 mov x1, #0x3fff eb01005f cmp x2, x1 After this patch, on arm64 w/ GCC 11.1.0 the start of the function is: <stackleak_erase>: d503245f bti c d503201f nop <------------------------------- static branch d503233f paciasp d5384100 mrs x0, sp_el0 f9401003 ldr x3, [x0, #32] d287ffe1 mov x1, #0x3fff f9451000 ldr x0, [x0, #2592] cb030002 sub x2, x0, x3 eb01005f cmp x2, x1 While this may not be a huge win on its own, moving the static branch will permit further optimization of the body of the function in subsequent patches. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220427173128.2603085-3-mark.rutland@arm.com
2022-05-08randstruct: Reorganize Kconfigs and attribute macrosKees Cook1-1/+1
In preparation for Clang supporting randstruct, reorganize the Kconfigs, move the attribute macros, and generalize the feature to be named CONFIG_RANDSTRUCT for on/off, CONFIG_RANDSTRUCT_FULL for the full randomization mode, and CONFIG_RANDSTRUCT_PERFORMANCE for the cache-line sized mode. Cc: linux-hardening@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220503205503.3054173-4-keescook@chromium.org
2022-05-07kdump: return -ENOENT if required cmdline option does not existZhen Lei1-2/+1
According to the current crashkernel=Y,low support in other ARCHes, it's an optional command-line option. When it doesn't exist, kernel will try to allocate minimum required memory below 4G automatically. However, __parse_crashkernel() returns '-EINVAL' for all error cases. It can't distinguish the nonexistent option from invalid option. Change __parse_crashkernel() to return '-ENOENT' for the nonexistent option case. With this change, crashkernel,low memory will take the default value if crashkernel=,low is not specified; while crashkernel reservation will fail and bail out if an invalid option is specified. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Acked-by: Baoquan He <bhe@redhat.com> Link: https://lore.kernel.org/r/20220506114402.365-2-thunder.leizhen@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-05-08kheaders: Have cpio unconditionally replace filesDaniel Mentz1-1/+1
For out-of-tree builds, this script invokes cpio twice to copy header files from the srctree and subsequently from the objtree. According to a comment in the script, there might be situations in which certain files already exist in the destination directory when header files are copied from the objtree: "The second CPIO can complain if files already exist which can happen with out of tree builds having stale headers in srctree. Just silence CPIO for now." GNU cpio might simply print a warning like "newer or same age version exists", but toybox cpio exits with a non-zero exit code unless the command line option "-u" is specified. To improve compatibility with toybox cpio, add the command line option "-u" to unconditionally replace existing files in the destination directory. Signed-off-by: Daniel Mentz <danielmentz@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-05-07fork: Explicitly set PF_KTHREADEric W. Biederman1-0/+3
Instead of implicitly inheriting PF_KTHREAD from the parent process examine arguments in kernel_clone_args to see if PF_KTHREAD should be set. This makes knowledge of which new threads are kernel threads explicit. This also makes it so that init and the user mode helper processes no longer have PF_KTHREAD set. Link: https://lkml.kernel.org/r/20220506141512.516114-6-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-05-07fork: Generalize PF_IO_WORKER handlingEric W. Biederman1-8/+8
Add fn and fn_arg members into struct kernel_clone_args and test for them in copy_thread (instead of testing for PF_KTHREAD | PF_IO_WORKER). This allows any task that wants to be a user space task that only runs in kernel mode to use this functionality. The code on x86 is an exception and still retains a PF_KTHREAD test because x86 unlikely everything else handles kthreads slightly differently than user space tasks that start with a function. The functions that created tasks that start with a function have been updated to set ".fn" and ".fn_arg" instead of ".stack" and ".stack_size". These functions are fork_idle(), create_io_thread(), kernel_thread(), and user_mode_thread(). Link: https://lkml.kernel.org/r/20220506141512.516114-4-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-05-07fork: Explicity test for idle tasks in copy_threadEric W. Biederman1-0/+9
The architectures ia64 and parisc have special handling for the idle thread in copy_process. Add a flag named idle to kernel_clone_args and use it to explicity test if an idle process is being created. Fullfill the expectations of the rest of the copy_thread implemetations and pass a function pointer in .stack from fork_idle(). This makes what is happening in copy_thread better defined, and is useful to make idle threads less special. Link: https://lkml.kernel.org/r/20220506141512.516114-3-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-05-07fork: Pass struct kernel_clone_args into copy_threadEric W. Biederman1-2/+2
With io_uring we have started supporting tasks that are for most purposes user space tasks that exclusively run code in kernel mode. The kernel task that exec's init and tasks that exec user mode helpers are also user mode tasks that just run kernel code until they call kernel execve. Pass kernel_clone_args into copy_thread so these oddball tasks can be supported more cleanly and easily. v2: Fix spelling of kenrel_clone_args on h8300 Link: https://lkml.kernel.org/r/20220506141512.516114-2-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-05-06kthread: Don't allocate kthread_struct for init and umhEric W. Biederman2-5/+23
If kthread_is_per_cpu runs concurrently with free_kthread_struct the kthread_struct that was just freed may be read from. This bug was introduced by commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads"). When kthread_struct started to be allocated for all tasks that have PF_KTHREAD set. This in turn required the kthread_struct to be freed in kernel_execve and violated the assumption that kthread_struct will have the same lifetime as the task. Looking a bit deeper this only applies to callers of kernel_execve which is just the init process and the user mode helper processes. These processes really don't want to be kernel threads but are for historical reasons. Mostly that copy_thread does not know how to take a kernel mode function to the process with for processes without PF_KTHREAD or PF_IO_WORKER set. Solve this by not allocating kthread_struct for the init process and the user mode helper processes. This is done by adding a kthread member to struct kernel_clone_args. Setting kthread in fork_idle and kernel_thread. Adding user_mode_thread that works like kernel_thread except it does not set kthread. In fork only allocating the kthread_struct if .kthread is set. I have looked at kernel/kthread.c and since commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") there have been no assumptions added that to_kthread or __to_kthread will not return NULL. There are a few callers of to_kthread or __to_kthread that assume a non-NULL struct kthread pointer will be returned. These functions are kthread_data(), kthread_parmme(), kthread_exit(), kthread(), kthread_park(), kthread_unpark(), kthread_stop(). All of those functions can reasonably expected to be called when it is know that a task is a kthread so that assumption seems reasonable. Cc: stable@vger.kernel.org Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Reported-by: Максим Кутявин <maximkabox13@gmail.com> Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-05-06printk, tracing: fix console tracepointMarco Elver1-2/+2
The original intent of the 'console' tracepoint per the commit 95100358491a ("printk/tracing: Add console output tracing") had been to "[...] record any printk messages into the trace, regardless of the current console loglevel. This can help correlate (existing) printk debugging with other tracing." Petr points out [1] that calling trace_console_rcuidle() in call_console_driver() had been the wrong thing for a while, because "printk() always used console_trylock() and the message was flushed to the console only when the trylock succeeded. And it was always deferred in NMI or when printed via printk_deferred()." With the commit 09c5ba0aa2fc ("printk: add kthread console printers"), things only got worse, and calls to call_console_driver() no longer happen with typical printk() calls but always appear deferred [2]. As such, the tracepoint can no longer serve its purpose to clearly correlate printk() calls and other tracing, as well as breaks usecases that expect every printk() call to result in a callback of the console tracepoint. Notably, the KFENCE and KCSAN test suites, which want to capture console output and assume a printk() immediately gives us a callback to the console tracepoint. Fix the console tracepoint by moving it into printk_sprint() [3]. One notable difference is that by moving tracing into printk_sprint(), the 'text' will no longer include the "header" (loglevel and timestamp), but only the raw message. Arguably this is less of a problem now that the console tracepoint happens on the printk() call and isn't delayed. Link: https://lore.kernel.org/all/Ym+WqKStCg%2FEHfh3@alley/ [1] Link: https://lore.kernel.org/all/CA+G9fYu2kS0wR4WqMRsj2rePKV9XLgOU1PiXnMvpT+Z=c2ucHA@mail.gmail.com/ [2] Link: https://lore.kernel.org/all/87fslup9dx.fsf@jogness.linutronix.de/ [3] Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Marco Elver <elver@google.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: John Ogness <john.ogness@linutronix.de> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220503073844.4148944-1-elver@google.com
2022-05-06Merge tag 'v5.18-rc5' into sched/core to pull in fixes & to resolve a conflictIngo Molnar22-759/+667
- sched/core is on a pretty old -rc1 base - refresh it to include recent fixes. - this also allows up to resolve a (trivial) .mailmap conflict Conflicts: .mailmap Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-05-05cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()Waiman Long1-2/+5
There are 3 places where the cpu and node masks of the top cpuset can be initialized in the order they are executed: 1) start_kernel -> cpuset_init() 2) start_kernel -> cgroup_init() -> cpuset_bind() 3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp() The first cpuset_init() call just sets all the bits in the masks. The second cpuset_bind() call sets cpus_allowed and mems_allowed to the default v2 values. The third cpuset_init_smp() call sets them back to v1 values. For systems with cgroup v2 setup, cpuset_bind() is called once. As a result, cpu and memory node hot add may fail to update the cpu and node masks of the top cpuset to include the newly added cpu or node in a cgroup v2 environment. For systems with cgroup v1 setup, cpuset_bind() is called again by rebind_subsystem() when the v1 cpuset filesystem is mounted as shown in the dmesg log below with an instrumented kernel. [ 2.609781] cpuset_bind() called - v2 = 1 [ 3.079473] cpuset_init_smp() called [ 7.103710] cpuset_bind() called - v2 = 0 smp_init() is called after the first two init functions. So we don't have a complete list of active cpus and memory nodes until later in cpuset_init_smp() which is the right time to set up effective_cpus and effective_mems. To fix this cgroup v2 mask setup problem, the potentially incorrect cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed. For cgroup v2 systems, the initial cpuset_bind() call will set the masks correctly. For cgroup v1 systems, the second call to cpuset_bind() will do the right setup. cc: stable@vger.kernel.org Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-05-05genirq: Synchronize interrupt thread startupThomas Pfaff3-10/+33
A kernel hang can be observed when running setserial in a loop on a kernel with force threaded interrupts. The sequence of events is: setserial open("/dev/ttyXXX") request_irq() do_stuff() -> serial interrupt -> wake(irq_thread) desc->threads_active++; close() free_irq() kthread_stop(irq_thread) synchronize_irq() <- hangs because desc->threads_active != 0 The thread is created in request_irq() and woken up, but does not get on a CPU to reach the actual thread function, which would handle the pending wake-up. kthread_stop() sets the should stop condition which makes the thread immediately exit, which in turn leaves the stale threads_active count around. This problem was introduced with commit 519cc8652b3a, which addressed a interrupt sharing issue in the PCIe code. Before that commit free_irq() invoked synchronize_irq(), which waits for the hard interrupt handler and also for associated threads to complete. To address the PCIe issue synchronize_irq() was replaced with __synchronize_hardirq(), which only waits for the hard interrupt handler to complete, but not for threaded handlers. This was done under the assumption, that the interrupt thread already reached the thread function and waits for a wake-up, which is guaranteed to be handled before acting on the stop condition. The problematic case, that the thread would not reach the thread function, was obviously overlooked. Make sure that the interrupt thread is really started and reaches thread_fn() before returning from __setup_irq(). This utilizes the existing wait queue in the interrupt descriptor. The wait queue is unused for non-shared interrupts. For shared interrupts the usage might cause a spurious wake-up of a waiter in synchronize_irq() or the completion of a threaded handler might cause a spurious wake-up of the waiter for the ready flag. Both are harmless and have no functional impact. [ tglx: Amended changelog ] Fixes: 519cc8652b3a ("genirq: Synchronize only with single thread on free_irq()") Signed-off-by: Thomas Pfaff <tpfaff@pcs.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/552fe7b4-9224-b183-bb87-a8f36d335690@pcs.com
2022-05-04Merge remote-tracking branch 'arm64/for-next/sme' into kvmarm-master/nextMarc Zyngier1-0/+12
Merge arm64's SME branch to resolve conflicts with the WFxT branch. Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-05-03seccomp: Add wait_killable semantic to seccomp user notifierSargun Dhillon1-2/+40
This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) that makes it so that when notifications are received by the supervisor the notifying process will transition to wait killable semantics. Although wait killable isn't a set of semantics formally exposed to userspace, the concept is searchable. If the notifying process is signaled prior to the notification being received by the userspace agent, it will be handled as normal. One quirk about how this is handled is that the notifying process only switches to TASK_KILLABLE if it receives a wakeup from either an addfd or a signal. This is to avoid an unnecessary wakeup of the notifying task. The reasons behind switching into wait_killable only after userspace receives the notification are: * Avoiding unncessary work - Often, workloads will perform work that they may abort (request racing comes to mind). This allows for syscalls to be aborted safely prior to the notification being received by the supervisor. In this, the supervisor doesn't end up doing work that the workload does not want to complete anyways. * Avoiding side effects - We don't want the syscall to be interruptible once the supervisor starts doing work because it may not be trivial to reverse the operation. For example, unmounting a file system may take a long time, and it's hard to rollback, or treat that as reentrant. * Avoid breaking runtimes - Various runtimes do not GC when they are during a syscall (or while running native code that subsequently calls a syscall). If many notifications are blocked, and not picked up by the supervisor, this can get the application into a bad state. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220503080958.20220-2-sargun@sargun.me
2022-05-03Merge branches 'docs.2022.04.20a', 'fixes.2022.04.20a', 'nocb.2022.04.11b', 'rcu-tasks.2022.04.11b', 'srcu.2022.05.03a', 'torture.2022.04.11b', 'torture-tasks.2022.04.20a' and 'torturescript.2022.04.20a' into HEADPaul E. McKenney20-339/+846
docs.2022.04.20a: Documentation updates. fixes.2022.04.20a: Miscellaneous fixes. nocb.2022.04.11b: Callback-offloading updates. rcu-tasks.2022.04.11b: RCU-tasks updates. srcu.2022.05.03a: Put SRCU on a memory diet. torture.2022.04.11b: Torture-test updates. torture-tasks.2022.04.20a: Avoid torture testing changing RCU configuration. torturescript.2022.04.20a: Torture-test scripting updates.
2022-05-03srcu: Drop needless initialization of sdp in srcu_gp_start()Lukas Bulwahn1-1/+1
Commit 9c7ef4c30f12 ("srcu: Make Tree SRCU able to operate without snp_node array") initializes the local variable sdp differently depending on the srcu's state in srcu_gp_start(). Either way, this initialization overwrites the value used when sdp is defined. This commit therefore drops this pointless definition-time initialization. Although there is no functional change, compiler code generation may be affected. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-05-03srcu: Prevent expedited GPs and blocking readers from consuming CPUPaul E. McKenney1-6/+38
If an SRCU reader blocks while a synchronize_srcu_expedited() waits for that same reader, then that grace period will spawn an endless series of workqueue handlers, consuming a full CPU. This quickly gets pointless because consuming more CPU isn't going to make that reader get done faster, especially if it is blocked waiting for an external event. This commit therefore spawns at most one pair of back-to-back workqueue handlers per expedited grace period phase, instead inserting increasing delays as that grace period phase grows older, but capped at 10 jiffies. In any case, if there have been at least 100 back-to-back workqueue handlers within a single jiffy, regardless of grace period or grace-period phase, then a one-jiffy delay is inserted. [ paulmck: Apply feedback from kernel test robot. ] Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Reported-by: Song Liu <song@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-05-03srcu: Add contention check to call_srcu() srcu_data ->lock acquisitionPaul E. McKenney1-9/+36
This commit increases the sensitivity of contention detection by adding checks to the acquisition of the srcu_data structure's lock on the call_srcu() code path. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-05-03srcu: Automatically determine size-transition strategy at bootPaul E. McKenney1-3/+20
This commit adds a srcutree.convert_to_big option of zero that causes SRCU to decide at boot whether to wait for contention (small systems) or immediately expand to large (large systems). A new srcutree.big_cpu_lim (defaulting to 128) defines how many CPUs constitute a large system. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-05-03Backmerge tag 'v5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-nextDave Airlie13-44/+52
Linux 5.18-rc5 There was a build fix for arm I wanted in drm-next, so backmerge rather then cherry-pick. Signed-off-by: Dave Airlie <airlied@redhat.com>
2022-05-02kthread: unexport kthread_blkcgChristoph Hellwig1-1/+0
kthread_blkcg is only used by the built-in blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220420042723.1010598-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02blk-cgroup: replace bio_blkcg with bio_blkcg_cssChristoph Hellwig1-2/+4
All callers of bio_blkcg actually want the CSS, so replace it with an interface that does return the CSS. This now allows to move struct blkcg_gq to block/blk-cgroup.h instead of exposing it in a public header. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220420042723.1010598-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02blktrace: cleanup the __trace_note_message interfaceChristoph Hellwig1-10/+10
Pass the cgroup_subsys_state instead of a the blkg so that blktrace doesn't need to poke into blk-cgroup internals, and give the name a blk prefix as the current name is way too generic for a public interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220420042723.1010598-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02time/sched_clock: Fix formatting of frequency reporting codeMaciej W. Rozycki1-6/+4
Use flat rather than nested indentation for chained else/if clauses as per coding-style.rst: if (x == y) { .. } else if (x > y) { ... } else { .... } This also improves readability. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/alpine.DEB.2.21.2204240148220.9383@angie.orcam.me.uk
2022-05-02time/sched_clock: Use Hz as the unit for clock rate reporting below 4kHzMaciej W. Rozycki1-1/+1
The kernel uses kHz as the unit for clock rates reported between 1MHz (inclusive) and 4MHz (exclusive), e.g.: sched_clock: 64 bits at 1000kHz, resolution 1000ns, wraps every 2199023255500ns This reduces the amount of data lost due to rounding, but hasn't been replicated for the kHz range when support was added for proper reporting of sub-kHz clock rates. Take the same approach for rates between 1kHz (inclusive) and 4kHz (exclusive), which makes it consistent. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/alpine.DEB.2.21.2204240106380.9383@angie.orcam.me.uk
2022-05-02time/sched_clock: Round the frequency reported to nearest rather than downMaciej W. Rozycki1-2/+3
The frequency reported for clock sources are rounded down, which gives misleading figures, e.g.: I/O ASIC clock frequency 24999480Hz sched_clock: 32 bits at 24MHz, resolution 40ns, wraps every 85901132779ns MIPS counter frequency 59998512Hz sched_clock: 32 bits at 59MHz, resolution 16ns, wraps every 35792281591ns Rounding to nearest is more adequate: I/O ASIC clock frequency 24999664Hz sched_clock: 32 bits at 25MHz, resolution 40ns, wraps every 85900499947ns MIPS counter frequency 59999728Hz sched_clock: 32 bits at 60MHz, resolution 16ns, wraps every 35791556599ns Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/alpine.DEB.2.21.2204240055590.9383@angie.orcam.me.uk
2022-05-02genirq: Use pm_runtime_resume_and_get() instead of pm_runtime_get_sync()Minghao Chi1-9/+4
pm_runtime_resume_and_get() achieves the same and simplifies the code. [ tglx: Simplify it further by presetting retval ] Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220418110716.2559453-1-chi.minghao@zte.com.cn
2022-05-02timekeeping: Consolidate fast timekeeperThomas Gleixner1-10/+10
Provide a inline function which replaces the copy & pasta. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220415091921.072296632@linutronix.de
2022-05-02timekeeping: Annotate ktime_get_boot_fast_ns() with data_race()Thomas Gleixner1-1/+1
Accessing timekeeper::offset_boot in ktime_get_boot_fast_ns() is an intended data race as the reader side cannot synchronize with a writer and there is no space in struct tk_read_base of the NMI safe timekeeper. Mark it so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220415091920.956045162@linutronix.de
2022-05-01mm: Fix PASID use-after-free issueFenghua Yu1-1/+1
The PASID is being freed too early. It needs to stay around until after device drivers that might be using it have had a chance to clear it out of the hardware. The relevant refcounts are: mmget() /mmput() refcount the mm's address space mmgrab()/mmdrop() refcount the mm itself The PASID is currently tied to the life of the mm's address space and freed in __mmput(). This makes logical sense because the PASID can't be used once the address space is gone. But, this misses an important point: even after the address space is gone, the PASID will still be programmed into a device. Device drivers might, for instance, still need to flush operations that are outstanding and need to use that PASID. They do this at file->release() time. Device drivers call the IOMMU driver to hold a reference on the mm itself and drop it at file->release() time. But, the IOMMU driver holds a reference on the mm itself, not the address space. The address space (and the PASID) is long gone by the time the driver tries to clean up. This is effectively a use-after-free bug on the PASID. To fix this, move the PASID free operation from __mmput() to __mmdrop(). This ensures that the IOMMU driver's existing mmgrab() keeps the PASID allocated until it drops its mm reference. Fixes: 701fac40384f ("iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit") Reported-by: Zhangfei Gao <zhangfei.gao@foxmail.com> Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Suggested-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhangfei Gao <zhangfei.gao@foxmail.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/r/20220428180041.806809-1-fenghua.yu@intel.com
2022-05-01smp: Make softirq handling RT safe in flush_smp_call_function_queue()Sebastian Andrzej Siewior2-1/+17
flush_smp_call_function_queue() invokes do_softirq() which is not available on PREEMPT_RT. flush_smp_call_function_queue() is invoked from the idle task and the migration task with preemption or interrupts disabled. So RT kernels cannot process soft interrupts in that context as that has to acquire 'sleeping spinlocks' which is not possible with preemption or interrupts disabled and forbidden from the idle task anyway. The currently known SMP function call which raises a soft interrupt is in the block layer, but this functionality is not enabled on RT kernels due to latency and performance reasons. RT could wake up ksoftirqd unconditionally, but this wants to be avoided if there were soft interrupts pending already when this is invoked in the context of the migration task. The migration task might have preempted a threaded interrupt handler which raised a soft interrupt, but did not reach the local_bh_enable() to process it. The "running" ksoftirqd might prevent the handling in the interrupt thread context which is causing latency issues. Add a new function which handles this case explicitely for RT and falls back to do_softirq() on !RT kernels. In the RT case this warns when one of the flushed SMP function calls raised a soft interrupt so this can be investigated. [ tglx: Moved the RT part out of SMP code ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/YgKgL6aPj8aBES6G@linutronix.de Link: https://lore.kernel.org/r/20220413133024.356509586@linutronix.de
2022-05-01smp: Rename flush_smp_call_function_from_idle()Thomas Gleixner4-11/+24
This is invoked from the stopper thread too, which is definitely not idle. Rename it to flush_smp_call_function_queue() and fixup the callers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220413133024.305001096@linutronix.de
2022-05-01sched: Fix missing prototype warningsThomas Gleixner8-10/+15
A W=1 build emits more than a dozen missing prototype warnings related to scheduler and scheduler specific includes. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220413133024.249118058@linutronix.de