aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux/sched (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-09-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+2
Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08fs/epoll: use a per-cpu counter for user's watches countNicholas Piggin1-1/+2
This counter tracks the number of watches a user has, to compare against the 'max_user_watches' limit. This causes a scalability bottleneck on SPECjbb2015 on large systems as there is only one user. Changing to a per-cpu counter increases throughput of the benchmark by about 30% on a 16-socket, > 1000 thread system. [rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n] [npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message] Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none [akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter] Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reported-by: Anton Blanchard <anton@ozlabs.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-5/+5
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03memcg: replace in_interrupt() by !in_task() in active_memcg()Vasily Averin1-1/+1
set_active_memcg() uses in_interrupt() check to select proper storage for cgroup: pointer on task struct or per-cpu pointer. It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH. It's better to use '!in_task()' instead. Link: https://lkml.org/lkml/2021/7/26/487 Link: https://lkml.kernel.org/r/ed4448b0-4970-616f-7368-ef9dd3cb628d@virtuozzo.com Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: report a more useful address for reclaim acquisitionMatthew Wilcox (Oracle)1-4/+4
A recent lockdep report included these lines: [ 96.177910] 3 locks held by containerd/770: [ 96.177934] #0: ffff88810815ea28 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x115/0x770 [ 96.177999] #1: ffffffff82915020 (rcu_read_lock){....}-{1:2}, at: get_swap_device+0x33/0x140 [ 96.178057] #2: ffffffff82955ba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 While it was not useful to that bug report to know where the reclaim lock had been acquired, it might be useful under other circumstances. Allow the caller of __fs_reclaim_acquire to specify the instruction pointer to use. Link: https://lkml.kernel.org/r/20210719185709.1755149-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Omar Sandoval <osandov@fb.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-01Merge branch 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-0/+1
Pull exit cleanups from Eric Biederman: "In preparation of doing something about PTRACE_EVENT_EXIT I have started cleaning up various pieces of code related to do_exit. Most of that code I did not manage to get tested and reviewed before the merge window opened but a handful of very useful cleanups are ready to be merged. The first change is simply the removal of the bdflush system call. The code has now been disabled long enough that even the oldest userspace working userspace setups anyone can find to test are fine with the bdflush system call being removed. Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of calling do_exit directly is interesting only in that it is nearly the most difficult of the incorrect uses of do_exit to remove. The change to the seccomp code to simply send a signal instead of calling do_coredump directly is a very nice little cleanup made possible by realizing the existing signal sending helpers were missing a little bit of functionality that is easy to provide" * 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal/seccomp: Dump core when there is only one live thread signal/seccomp: Refactor seccomp signal and coredump generation signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die exit/bdflush: Remove the deprecated bdflush system call
2021-09-01Merge branch 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-8/+3
Pull siginfo si_trapno updates from Eric Biederman: "The full set of si_trapno changes was not appropriate as a fix for the newly added SIGTRAP TRAP_PERF, and so I postponed the rest of the related cleanups. This is the rest of the cleanups for si_trapno that reduces it from being a really weird arch special case that is expect to be always present (but isn't) on the architectures that support it to being yet another field in the _sigfault union of struct siginfo. The changes have been reviewed and marinated in linux-next. With the removal of this awkward special case new code (like SIGTRAP TRAP_PERF) that works across architectures should be easier to write and maintain" * 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistency signal: Verify the alignment and size of siginfo_t signal: Remove the generic __ARCH_SI_TRAPNO support signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNK signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRP arm64: Add compile-time asserts for siginfo_t offsets arm: Add compile-time asserts for siginfo_t offsets sparc64: Add compile-time asserts for siginfo_t offsets
2021-08-30Merge tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+6
Pull timer updates from Thomas Gleixner: "Updates for timekeeping, timers and related drivers: Core code: - Cure a couple of correctness issues in the posix CPU timer code to prevent that the tick dependency for NOHZ full is kept alive for no reason. - Avoid expensive double reprogramming of the clockevent device in hrtimer_start_range_ns(). - Avoid pointless SMP function calls when the clock was set to avoid disturbing CPUs which do not have any affected timers queued. - Make the clocksource watchdog test work correctly when CONFIG_HZ is less than 100. Drivers: - Prefer the ARM architected timer over the Exynos timer which is way more expensive to access. - Add device tree bindings for new Ingenic SoCs - The usual improvements and cleanups all over the place" * tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits) clocksource: Make clocksource watchdog test safe for slow-HZ systems dt-bindings: timer: Add ABIs for new Ingenic SoCs clocksource/drivers/fttmr010: Pass around less pointers clocksource/drivers/mediatek: Optimize systimer irq clear flow on shutdown clocksource/drivers/ingenic: Use bitfield macro helpers clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel dt-bindings: timer: convert rockchip,rk-timer.txt to YAML clocksource/drivers/exynos_mct: Mark MCT device as CLOCK_EVT_FEAT_PERCPU clocksource/drivers/exynos_mct: Prioritise Arm arch timer on arm64 hrtimer: Unbreak hrtimer_force_reprogram() hrtimer: Use raw_cpu_ptr() in clock_was_set() hrtimer: Avoid more SMP function calls in clock_was_set() hrtimer: Avoid unnecessary SMP function calls in clock_was_set() hrtimer: Add bases argument to clock_was_set() time/timekeeping: Avoid invoking clock_was_set() twice timekeeping: Distangle resume and clock-was-set events timerfd: Provide timerfd_resume() hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case hrtimer: Ensure timerfd notification for HIGHRES=n hrtimer: Consolidate reprogramming code ...
2021-08-30Merge tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+5
Pull locking and atomics updates from Thomas Gleixner: "The regular pile: - A few improvements to the mutex code - Documentation updates for atomics to clarify the difference between cmpxchg() and try_cmpxchg() and to explain the forward progress expectations. - Simplification of the atomics fallback generator - The addition of arch_atomic_long*() variants and generic arch_*() bitops based on them. - Add the missing might_sleep() invocations to the down*() operations of semaphores. The PREEMPT_RT locking core: - Scheduler updates to support the state preserving mechanism for 'sleeping' spin- and rwlocks on RT. This mechanism is carefully preserving the state of the task when blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups targeted at the same task into account. The preserved or updated (via a regular wakeup) state is restored when the lock has been acquired. - Restructuring of the rtmutex code so it can be utilized and extended for the RT specific lock variants. - Restructuring of the ww_mutex code to allow sharing of the ww_mutex specific functionality for rtmutex based ww_mutexes. - Header file disentangling to allow substitution of the regular lock implementations with the PREEMPT_RT variants without creating an unmaintainable #ifdef mess. - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock implementations. Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT implementation is writer unfair because it is infeasible to do priority inheritance on multiple readers. Experience over the years has shown that real-time workloads are not the typical workloads which are sensitive to writer starvation. The alternative solution would be to allow only a single reader which has been tried and discarded as it is a major bottleneck especially for mmap_sem. Aside of that many of the writer starvation critical usage sites have been converted to a writer side mutex/spinlock and RCU read side protections in the past decade so that the issue is less prominent than it used to be. - The actual rtmutex based lock substitutions for PREEMPT_RT enabled kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and rwlock_t. The spin/rw_lock*() functions disable migration across the critical section to preserve the existing semantics vs per-CPU variables. - Rework of the futex REQUEUE_PI mechanism to handle the case of early wake-ups which interleave with a re-queue operation to prevent the situation that a task would be blocked on both the rtmutex associated to the outer futex and the rtmutex based hash bucket spinlock. While this situation cannot happen on !RT enabled kernels the changes make the underlying concurrency problems easier to understand in general. As a result the difference between !RT and RT kernels is reduced to the handling of waiting for the critical section. !RT kernels simply spin-wait as before and RT kernels utilize rcu_wait(). - The substitution of local_lock for PREEMPT_RT with a spinlock which protects the critical section while staying preemptible. The CPU locality is established by disabling migration. The underlying concepts of this code have been in use in PREEMPT_RT for way more than a decade. The code has been refactored several times over the years and this final incarnation has been optimized once again to be as non-intrusive as possible, i.e. the RT specific parts are mostly isolated. It has been extensively tested in the 5.14-rt patch series and it has been verified that !RT kernels are not affected by these changes" * tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits) locking/rtmutex: Return success on deadlock for ww_mutex waiters locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes locking/rtmutex: Dequeue waiter on ww_mutex deadlock locking/rtmutex: Dont dereference waiter lockless locking/semaphore: Add might_sleep() to down_*() family locking/ww_mutex: Initialize waiter.ww_ctx properly static_call: Update API documentation locking/local_lock: Add PREEMPT_RT support locking/spinlock/rt: Prepare for RT local_lock locking/rtmutex: Add adaptive spinwait mechanism locking/rtmutex: Implement equal priority lock stealing preempt: Adjust PREEMPT_LOCK_OFFSET for RT locking/rtmutex: Prevent lockdep false positive with PI futexes futex: Prevent requeue_pi() lock nesting issue on RT futex: Simplify handle_early_requeue_pi_wakeup() futex: Reorder sanity checks in futex_requeue() futex: Clarify comment in futex_requeue() futex: Restructure futex_requeue() futex: Correct the number of requeued waiters for PI futex: Remove bogus condition for requeue PI ...
2021-08-30Merge tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-18/+0
Pull scheduler updates from Ingo Molnar: - The biggest change in this cycle is scheduler support for asymmetric scheduling affinity, to support the execution of legacy 32-bit tasks on AArch32 systems that also have 64-bit-only CPUs. Architectures can fill in this functionality by defining their own task_cpu_possible_mask(p). When this is done, the scheduler will make sure the task will only be scheduled on CPUs that support it. (The actual arm64 specific changes are not part of this tree.) For other architectures there will be no change in functionality. - Add cgroup SCHED_IDLE support - Increase node-distance flexibility & delay determining it until a CPU is brought online. (This enables platforms where node distance isn't final until the CPU is only.) - Deadline scheduler enhancements & fixes - Misc fixes & cleanups. * tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) eventfd: Make signal recursion protection a task bit sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case sched: Introduce dl_task_check_affinity() to check proposed affinity sched: Allow task CPU affinity to be restricted on asymmetric systems sched: Split the guts of sched_setaffinity() into a helper function sched: Introduce task_struct::user_cpus_ptr to track requested affinity sched: Reject CPU affinity changes based on task_cpu_possible_mask() cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq() cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 sched: Introduce task_cpu_possible_mask() to limit fallback rq selection sched: Cgroup SCHED_IDLE support sched/topology: Skip updating masks for non-online nodes sched: Replace deprecated CPU-hotplug functions. sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS sched: Fix UCLAMP_FLAG_IDLE setting sched/deadline: Fix missing clock update in migrate_task_rq_dl() sched/fair: Avoid a second scan of target in select_idle_cpu sched/fair: Use prev instead of new target as recent_used_cpu sched: Don't report SCHED_FLAG_SUGOV in sched_getattr() ...
2021-08-26signal/seccomp: Refactor seccomp signal and coredump generationEric W. Biederman1-0/+1
Factor out force_sig_seccomp from the seccomp signal generation and place it in kernel/signal.c. The function force_sig_seccomp takes a parameter force_coredump to indicate that the sigaction field should be reset to SIGDFL so that a coredump will be generated when the signal is delivered. force_sig_seccomp is then used to replace both seccomp_send_sigsys and seccomp_init_siginfo. force_sig_info_to_task gains an extra parameter to force using the default signal action. With this change seccomp is no longer a special case and there becomes exactly one place do_coredump is called from. Further it no longer becomes necessary for __seccomp_filter to call do_group_exit. Acked-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87r1gr6qc4.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-17sched/wake_q: Provide WAKE_Q_HEAD_INITIALIZER()Thomas Gleixner1-2/+5
The RT specific spin/rwlock implementation requires special handling of the to be woken waiters. Provide a WAKE_Q_HEAD_INITIALIZER(), which can be used by the rtmutex code to implement an RT aware wake_q derivative. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210815211302.429918071@linutronix.de
2021-08-10posix-cpu-timers: Assert task sighand is locked while starting cputime counterFrederic Weisbecker1-0/+6
Starting the process wide cputime counter needs to be done in the same sighand locking sequence than actually arming the related timer otherwise this races against concurrent timers setting/expiring in the same threadgroup. Detecting that the cputime counter is started without holding the sighand lock is a first step toward debugging such situations. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210726125513.271824-2-frederic@kernel.org
2021-07-23signal: Remove the generic __ARCH_SI_TRAPNO supportEric W. Biederman1-8/+0
Now that __ARCH_SI_TRAPNO is no longer set by any architecture remove all of the code it enabled from the kernel. On alpha and sparc a more explict approach of using send_sig_fault_trapno or force_sig_fault_trapno in the very limited circumstances where si_trapno was set to a non-zero value. The generic support that is being removed always set si_trapno on all fault signals. With only SIGILL ILL_ILLTRAP on sparc and SIGFPE and SIGTRAP TRAP_UNK on alpla providing si_trapno values asking all senders of fault signals to provide an si_trapno value does not make sense. Making si_trapno an ordinary extension of the fault siginfo layout has enabled the architecture generic implementation of SIGTRAP TRAP_PERF, and enables other faulting signals to grow architecture generic senders as well. v1: https://lkml.kernel.org/r/m18s4zs7nu.fsf_-_@fess.ebiederm.org v2: https://lkml.kernel.org/r/20210505141101.11519-8-ebiederm@xmission.com Link: https://lkml.kernel.org/r/87bl73xx6x.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-07-23signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNKEric W. Biederman1-0/+2
While reviewing the signal handlers on alpha it became clear that si_trapno is only set to a non-zero value when sending SIGFPE and when sending SITGRAP with si_code TRAP_UNK. Add send_sig_fault_trapno and send SIGTRAP TRAP_UNK, and SIGFPE with it. Remove the define of __ARCH_SI_TRAPNO and remove the always zero si_trapno parameter from send_sig_fault and force_sig_fault. v1: https://lkml.kernel.org/r/m1eeers7q7.fsf_-_@fess.ebiederm.org v2: https://lkml.kernel.org/r/20210505141101.11519-7-ebiederm@xmission.com Link: https://lkml.kernel.org/r/87h7gvxx7l.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-07-23signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRPEric W. Biederman1-0/+1
While reviewing the signal handlers on sparc it became clear that si_trapno is only set to a non-zero value when sending SIGILL with si_code ILL_ILLTRP. Add force_sig_fault_trapno and send SIGILL ILL_ILLTRP with it. Remove the define of __ARCH_SI_TRAPNO and remove the always zero si_trapno parameter from send_sig_fault and force_sig_fault. v1: https://lkml.kernel.org/r/m1eeers7q7.fsf_-_@fess.ebiederm.org v2: https://lkml.kernel.org/r/20210505141101.11519-7-ebiederm@xmission.com Link: https://lkml.kernel.org/r/87mtqnxx89.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-07-07Merge tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-7/+12
Pull x86 fpu updates from Thomas Gleixner: "Fixes and improvements for FPU handling on x86: - Prevent sigaltstack out of bounds writes. The kernel unconditionally writes the FPU state to the alternate stack without checking whether the stack is large enough to accomodate it. Check the alternate stack size before doing so and in case it's too small force a SIGSEGV instead of silently corrupting user space data. - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been updated despite the fact that the FPU state which is stored on the signal stack has grown over time which causes trouble in the field when AVX512 is available on a CPU. The kernel does not expose the minimum requirements for the alternate stack size depending on the available and enabled CPU features. ARM already added an aux vector AT_MINSIGSTKSZ for the same reason. Add it to x86 as well. - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE related issues unearthed quite some inconsistencies, duplicated code and other issues. The fine granular overhaul addresses this, makes the code more robust and maintainable, which allows to integrate upcoming XSTATE related features in sane ways" * tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again x86/fpu/signal: Let xrstor handle the features to init x86/fpu/signal: Handle #PF in the direct restore path x86/fpu: Return proper error codes from user access functions x86/fpu/signal: Split out the direct restore code x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing() x86/fpu/signal: Sanitize the xstate check on sigframe x86/fpu/signal: Remove the legacy alignment check x86/fpu/signal: Move initial checks into fpu__restore_sig() x86/fpu: Mark init_fpstate __ro_after_init x86/pkru: Remove xstate fiddling from write_pkru() x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate() x86/fpu: Remove PKRU handling from switch_fpu_finish() x86/fpu: Mask PKRU from kernel XRSTOR[S] operations x86/fpu: Hook up PKRU into ptrace() x86/fpu: Add PKRU storage outside of task XSAVE buffer x86/fpu: Dont restore PKRU in fpregs_restore_userspace() x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi() x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs() x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs() ...
2021-06-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+8
Merge misc updates from Andrew Morton: "191 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab, slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap, mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization, pagealloc, and memory-failure)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits) mm,hwpoison: make get_hwpoison_page() call get_any_page() mm,hwpoison: send SIGBUS with error virutal address mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes mm/page_alloc: allow high-order pages to be stored on the per-cpu lists mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA docs: remove description of DISCONTIGMEM arch, mm: remove stale mentions of DISCONIGMEM mm: remove CONFIG_DISCONTIGMEM m68k: remove support for DISCONTIGMEM arc: remove support for DISCONTIGMEM arc: update comment about HIGHMEM implementation alpha: remove DISCONTIGMEM and NUMA mm/page_alloc: move free_the_page mm/page_alloc: fix counting of managed_pages mm/page_alloc: improve memmap_pages dbg msg mm: drop SECTION_SHIFT in code comments mm/page_alloc: introduce vm.percpu_pagelist_high_fraction mm/page_alloc: limit the number of pages on PCP lists when reclaim is active mm/page_alloc: scale the number of pages that are batch freed ...
2021-06-29mm: gup: pack has_pinned in MMF_HAS_PINNEDAndrea Arcangeli1-0/+8
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop cleanup. Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast would reintroduce a loss of SMP scalability to pin-fast, so there's no future potential usefulness to keep an atomic in the mm for this. set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like atomic_set after this commit) has to be still issued only once per "mm", so the difference between the two will be lost in the noise. will-it-scale "mmap2" shows no change in performance with enterprise config as expected. will-it-scale "pin_fast" retains the > 4000% SMP scalability performance improvement against upstream as expected. This is a noop as far as overall performance and SMP scalability are concerned. [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED] Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s [akpm@linux-foundation.org: coding style fixes] [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments] Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-28Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-7/+0
Pull user namespace rlimit handling update from Eric Biederman: "This is the work mainly by Alexey Gladkov to limit rlimits to the rlimits of the user that created a user namespace, and to allow users to have stricter limits on the resources created within a user namespace." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: cred: add missing return error code when set_cred_ucounts() failed ucounts: Silence warning in dec_rlimit_ucounts ucounts: Set ucount_max to the largest positive value the type can hold kselftests: Add test to check for rlimit changes in different user namespaces Reimplement RLIMIT_MEMLOCK on top of ucounts Reimplement RLIMIT_SIGPENDING on top of ucounts Reimplement RLIMIT_MSGQUEUE on top of ucounts Reimplement RLIMIT_NPROC on top of ucounts Use atomic_t for ucounts reference counting Add a reference to ucounts for each cred Increase size of ucounts to atomic_long_t
2021-06-28sched/sysctl: Move extern sysctl declarations to sched.hHailong Liu1-18/+0
Since commit '8a99b6833c88(sched: Move SCHED_DEBUG sysctl to debugfs)', SCHED_DEBUG sysctls are moved to debugfs, so these extern sysctls in include/linux/sched/sysctl.h are no longer needed for sysctl.c, even some are no longer needed. So move those extern sysctls that needed by kernel/sched/debug.c to kernel/sched/sched.h, and remove others that are no longer needed. Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210606115451.26745-1-liuhailongg6@163.com
2021-06-24sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flagBeata Michalska1-0/+10
Introducing new, complementary to SD_ASYM_CPUCAPACITY, sched_domain topology flag, to distinguish between shed_domains where any CPU capacity asymmetry is detected (SD_ASYM_CPUCAPACITY) and ones where a full set of CPU capacities is visible to all domain members (SD_ASYM_CPUCAPACITY_FULL). With the distinction between full and partial CPU capacity asymmetry, brought in by the newly introduced flag, the scope of the original SD_ASYM_CPUCAPACITY flag gets shifted, still maintaining the existing behaviour when one is detected on a given sched domain, allowing misfit migrations within sched domains that do not observe full range of CPU capacities but still do have members with different capacity values. It loses though it's meaning when it comes to the lowest CPU asymmetry sched_domain level per-cpu pointer, which is to be now denoted by SD_ASYM_CPUCAPACITY_FULL flag. Signed-off-by: Beata Michalska <beata.michalska@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210603140627.8409-2-beata.michalska@arm.com
2021-06-23Merge x86/urgent into x86/fpuBorislav Petkov1-0/+1
Pick up dependent changes which either went mainline (x86/urgent is based on -rc7 and that contains them) as urgent fixes and the current x86/urgent branch which contains two more urgent fixes, so that the bigger FPU rework can base off ontop. Signed-off-by: Borislav Petkov <bp@suse.de>
2021-06-18sched: Change task_struct::statePeter Zijlstra2-2/+2
Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-17sched/cpufreq: Consider reduced CPU capacity in energy calculationLukasz Luba1-1/+1
Energy Aware Scheduling (EAS) needs to predict the decisions made by SchedUtil. The map_util_freq() exists to do that. There are corner cases where the max allowed frequency might be reduced (due to thermal). SchedUtil as a CPUFreq governor, is aware of that but EAS is not. This patch aims to address it. SchedUtil stores the maximum allowed frequency in 'sugov_policy::next_freq' field. EAS has to predict that value, which is the real used frequency. That value is made after a call to cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits. In the existing code EAS is not able to predict that real frequency. This leads to energy estimation errors. To avoid wrong energy estimation in EAS (due to frequency miss prediction) make sure that the step which calculates Performance Domain frequency, is also aware of the allowed CPU capacity. Furthermore, modify map_util_freq() to not extend the frequency value. Instead, use map_util_perf() to extend the util value in both places: SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity. In the end, we achieve the same desirable behavior for both subsystems and alignment in regards to the real CPU frequency. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part) Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
2021-06-03Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar1-0/+1
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-21Merge branch 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-0/+1
Pull siginfo fix from Eric Biederman: "During the merge window an issue with si_perf and the siginfo ABI came up. The alpha and sparc siginfo structure layout had changed with the addition of SIGTRAP TRAP_PERF and the new field si_perf. The reason only alpha and sparc were affected is that they are the only architectures that use si_trapno. Looking deeper it was discovered that si_trapno is used for only a few select signals on alpha and sparc, and that none of the other _sigfault fields past si_addr are used at all. Which means technically no regression on alpha and sparc. While the alignment concerns might be dismissed the abuse of si_errno by SIGTRAP TRAP_PERF does have the potential to cause regressions in existing userspace. While we still have time before userspace starts using and depending on the new definition siginfo for SIGTRAP TRAP_PERF this set of changes cleans up siginfo_t. - The si_trapno field is demoted from magic alpha and sparc status and made an ordinary union member of the _sigfault member of siginfo_t. Without moving it of course. - si_perf is replaced with si_perf_data and si_perf_type ending the abuse of si_errno. - Unnecessary additions to signalfd_siginfo are removed" * 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo signal: Deliver all of the siginfo perf data in _perf signal: Factor force_sig_perf out of perf_sigtrap signal: Implement SIL_FAULT_TRAPNO siginfo: Move si_trapno inside the union inside _si_fault
2021-05-19x86/signal: Detect and prevent an alternate signal stack overflowChang S. Bae1-7/+12
The kernel pushes context on to the userspace stack to prepare for the user's signal handler. When the user has supplied an alternate signal stack, via sigaltstack(2), it is easy for the kernel to verify that the stack size is sufficient for the current hardware context. Check if writing the hardware context to the alternate stack will exceed it's size. If yes, then instead of corrupting user-data and proceeding with the original signal handler, an immediate SIGSEGV signal is delivered. Refactor the stack pointer check code from on_sig_stack() and use the new helper. While the kernel allows new source code to discover and use a sufficient alternate signal stack size, this check is still necessary to protect binaries with insufficient alternate signal stack size from data corruption. Fixes: c2bc11f10a39 ("x86, AVX-512: Enable AVX-512 States Context Switch") Reported-by: Florian Weimer <fweimer@redhat.com> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Len Brown <len.brown@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20210518200320.17239-6-chang.seok.bae@intel.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=153531
2021-05-18signal: Factor force_sig_perf out of perf_sigtrapEric W. Biederman1-0/+1
Separate filling in siginfo for TRAP_PERF from deciding that siginal needs to be sent. There are enough little details that need to be correct when properly filling in siginfo_t that it is easy to make mistakes if filling in the siginfo_t is in the same function with other logic. So factor out force_sig_perf to reduce the cognative load of on reviewers, maintainers and implementors. v1: https://lkml.kernel.org/r/m17dkjqqxz.fsf_-_@fess.ebiederm.org v2: https://lkml.kernel.org/r/20210505141101.11519-10-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20210517195748.8880-3-ebiederm@xmission.com Reviewed-by: Marco Elver <elver@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-12sched: Make nr_iowait_cpu() return 32-bit valueAlexey Dobriyan1-1/+1
Runqueue ->nr_iowait counters are 32-bit anyway. Propagate 32-bitness into other code, but don't try too hard. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-3-adobriyan@gmail.com
2021-05-12sched: Make nr_iowait() return 32-bit valueAlexey Dobriyan1-1/+1
Creating 2**32 tasks to wait in D-state is impossible and wasteful. Return "unsigned int" and save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-2-adobriyan@gmail.com
2021-05-12sched: Make nr_running() return 32-bit valueAlexey Dobriyan1-1/+1
Creating 2**32 tasks is impossible due to futex pid limits and wasteful anyway. Nobody has done it. Bring nr_running() into 32-bit world to save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-1-adobriyan@gmail.com
2021-05-12sched: Simplify sched_info_on()Peter Zijlstra1-8/+2
The situation around sched_info is somewhat complicated, it is used by sched_stats and delayacct and, indirectly, kvm. If SCHEDSTATS=Y (but disabled by default) sched_info_on() is unconditionally true -- this is the case for all distro kernel configs I checked. If for some reason SCHEDSTATS=N, but TASK_DELAY_ACCT=Y, then sched_info_on() can return false when delayacct is disabled, presumably because there would be no other users left; except kvm is. Instead of complicating matters further by accurately accounting sched_stat and kvm state, simply unconditionally enable when SCHED_INFO=Y, matching the common distro case. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210505111525.121458839@infradead.org
2021-05-05mm: honor PF_MEMALLOC_PIN for all movable pagesPavel Tatashin1-1/+5
PF_MEMALLOC_PIN is only honored for CMA pages, extend this flag to work for any allocations from ZONE_MOVABLE by removing __GFP_MOVABLE from gfp_mask when this flag is passed in the current context. Add is_pinnable_page() to return true if page is in a pinnable page. A pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type. Link: https://lkml.kernel.org/r/20210215161349.246722-8-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PINPavel Tatashin1-16/+5
PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not return pages that might belong to CMA region. This is currently used for long term gup to make sure that such pins are not going to be done on any CMA pages. When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it is focusing on CMA pages too much and that there is larger class of pages that need the same treatment. MOVABLE zone cannot contain any long term pins as well so it makes sense to reuse and redefine this flag for that usecase as well. Rename the flag to PF_MEMALLOC_PIN which defines an allocation context which can only get pages suitable for long-term pins. Also rename: memalloc_nocma_save()/memalloc_nocma_restore to memalloc_pin_save()/memalloc_pin_restore() and make the new functions common. [rppt@linux.ibm.com: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN] Link: https://lkml.kernel.org/r/20210331163816.11517-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210215161349.246722-6-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30Reimplement RLIMIT_MEMLOCK on top of ucountsAlexey Gladkov1-1/+0
The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. Changelog v11: * Fix issue found by lkp robot. v8: * Fix issues found by lkp-tests project. v7: * Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred. v6: * Fix bug in hugetlb_file_setup() detected by trinity. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30Reimplement RLIMIT_SIGPENDING on top of ucountsAlexey Gladkov1-1/+0
The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. Changelog v11: * Revert most of changes to fix performance issues. v10: * Fix memory leak on get_ucounts failure. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/df9d7764dddd50f28616b7840de74ec0f81711a8.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30Reimplement RLIMIT_MSGQUEUE on top of ucountsAlexey Gladkov1-4/+0
The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30Reimplement RLIMIT_NPROC on top of ucountsAlexey Gladkov1-1/+0
The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. To illustrate the impact of rlimits, let's say there is a program that does not fork. Some service-A wants to run this program as user X in multiple containers. Since the program never fork the service wants to set RLIMIT_NPROC=1. service-A \- program (uid=1000, container1, rlimit_nproc=1) \- program (uid=1000, container2, rlimit_nproc=1) The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails since user X already has one running process. We cannot use existing inc_ucounts / dec_ucounts because they do not allow us to exceed the maximum for the counter. Some rlimits can be overlimited by root or if the user has the appropriate capability. Changelog v11: * Change inc_rlimit_ucounts() which now returns top value of ucounts. * Drop inc_rlimit_ucounts_and_test() because the return code of inc_rlimit_ucounts() can be checked. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-29Merge tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fsLinus Torvalds1-3/+0
Pull fsnotify updates from Jan Kara: - support for limited fanotify functionality for unpriviledged users - faster merging of fanotify events - a few smaller fsnotify improvements * tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: shmem: allow reporting fanotify events with file handles on tmpfs fs: introduce a wrapper uuid_to_fsid() fanotify_user: use upper_32_bits() to verify mask fanotify: support limited functionality for unprivileged users fanotify: configurable limits via sysfs fanotify: limit number of event merge attempts fsnotify: use hash table for faster events merge fanotify: mix event info and pid into merge key hash fanotify: reduce event objectid to 29-bit hash fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
2021-04-21sched: Warn on long periods of pending need_reschedPaul Turner1-0/+3
CPU scheduler marks need_resched flag to signal a schedule() on a particular CPU. But, schedule() may not happen immediately in cases where the current task is executing in the kernel mode (no preemption state) for extended periods of time. This patch adds a warn_on if need_resched is pending for more than the time specified in sysctl resched_latency_warn_ms. If it goes off, it is likely that there is a missing cond_resched() somewhere. Monitoring is done via the tick and the accuracy is hence limited to jiffy scale. This also means that we won't trigger the warning if the tick is disabled. This feature (LATENCY_WARN) is default disabled. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210416212936.390566-1-joshdon@google.com
2021-04-20Merge tag 'v5.12-rc8' into sched/core, to pick up fixesIngo Molnar1-1/+2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-16sched: Move SCHED_DEBUG sysctl to debugfsPeter Zijlstra1-5/+3
Stop polluting sysctl with undocumented knobs that really are debug only, move them all to /debug/sched/ along with the existing /debug/sched_* files that already exist. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.287610138@infradead.org
2021-03-16fanotify: configurable limits via sysfsAmir Goldstein1-3/+0
fanotify has some hardcoded limits. The only APIs to escape those limits are FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARKS. Allow finer grained tuning of the system limits via sysfs tunables under /proc/sys/fs/fanotify, similar to tunables under /proc/sys/fs/inotify, with some minor differences. - max_queued_events - global system tunable for group queue size limit. Like the inotify tunable with the same name, it defaults to 16384 and applies on initialization of a new group. - max_user_marks - user ns tunable for marks limit per user. Like the inotify tunable named max_user_watches, on a machine with sufficient RAM and it defaults to 1048576 in init userns and can be further limited per containing user ns. - max_user_groups - user ns tunable for number of groups per user. Like the inotify tunable named max_user_instances, it defaults to 128 in init userns and can be further limited per containing user ns. The slightly different tunable names used for fanotify are derived from the "group" and "mark" terminology used in the fanotify man pages and throughout the code. Considering the fact that the default value for max_user_instances was increased in kernel v5.10 from 8192 to 1048576, leaving the legacy fanotify limit of 8192 marks per group in addition to the max_user_marks limit makes little sense, so the per group marks limit has been removed. Note that when a group is initialized with FAN_UNLIMITED_MARKS, its own marks are not accounted in the per user marks account, so in effect the limit of max_user_marks is only for the collection of groups that are not initialized with FAN_UNLIMITED_MARKS. Link: https://lore.kernel.org/r/20210304112921.3996419-2-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-13include/linux/sched/mm.h: use rcu_dereference in in_vfork()Matthew Wilcox (Oracle)1-1/+2
Fix a sparse warning by using rcu_dereference(). Technically this is a bug and a sufficiently aggressive compiler could reload the `real_parent' pointer outside the protection of the rcu lock (and access freed memory), but I think it's pretty unlikely to happen. Link: https://lkml.kernel.org/r/20210221194207.1351703-1-willy@infradead.org Fixes: b18dc5f291c0 ("mm, oom: skip vforked tasks from being selected") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-04kernel: provide create_io_thread() helperJens Axboe1-0/+2
Provide a generic helper for setting up an io_uring worker. Returns a task_struct so that the caller can do whatever setup is needed, then call wake_up_new_task() to kick it into gear. Add a kernel_clone_args member, io_thread, which tells copy_process() to mark the task with PF_IO_WORKER. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-17sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIODietmar Eggemann1-9/+0
The only remaining use of MAX_USER_PRIO (and USER_PRIO) is the SCALE_PRIO() definition in the PowerPC Cell architecture's Synergistic Processor Unit (SPU) scheduler. TASK_USER_PRIO isn't used anymore. Commit fe443ef2ac42 ("[POWERPC] spusched: Dynamic timeslicing for SCHED_OTHER") copied SCALE_PRIO() from the task scheduler in v2.6.23. Commit a4ec24b48dde ("sched: tidy up SCHED_RR") removed it from the task scheduler in v2.6.24. Commit 3ee237dddcd8 ("sched/prio: Add 3 macros of MAX_NICE, MIN_NICE and NICE_WIDTH in prio.h") introduced NICE_WIDTH much later. With: MAX_USER_PRIO = USER_PRIO(MAX_PRIO) = MAX_PRIO - MAX_RT_PRIO MAX_PRIO = MAX_RT_PRIO + NICE_WIDTH MAX_USER_PRIO = MAX_RT_PRIO + NICE_WIDTH - MAX_RT_PRIO MAX_USER_PRIO = NICE_WIDTH MAX_USER_PRIO can be replaced by NICE_WIDTH to be able to remove all the {*_}USER_PRIO defines. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210128131040.296856-3-dietmar.eggemann@arm.com
2021-02-17sched: Remove MAX_USER_RT_PRIODietmar Eggemann1-8/+1
Commit d46523ea32a7 ("[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO") was introduced due to a a small time period in which the realtime patch set was using different values for MAX_USER_RT_PRIO and MAX_RT_PRIO. This is no longer true, i.e. now MAX_RT_PRIO == MAX_USER_RT_PRIO. Get rid of MAX_USER_RT_PRIO and make everything use MAX_RT_PRIO instead. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210128131040.296856-2-dietmar.eggemann@arm.com
2020-12-22Merge tag 'pm-5.11-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-0/+5
Pull more power management updates from Rafael Wysocki: "These update the CPPC cpufreq driver and intel_pstate (which involves updating the cpufreq core and the schedutil governor) and make janitorial changes in the ACPI code handling processor objects. Specifics: - Rework the passive-mode "fast switch" path in the intel_pstate driver to allow it receive the minimum (required) and target (desired) performance information from the schedutil governor so as to avoid running some workloads too fast (Rafael Wysocki). - Make the intel_pstate driver allow the policy max limit to be increased after the guaranteed performance value for the given CPU has increased (Rafael Wysocki). - Clean up the handling of CPU coordination types in the CPPC cpufreq driver and make it export frequency domains information to user space via sysfs (Ionela Voinescu). - Fix the ACPI code handling processor objects to use a correct coordination type when it fails to map frequency domains and drop a redundant CPU map initialization from it (Ionela Voinescu, Punit Agrawal)" * tag 'pm-5.11-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Use most recent guaranteed performance values cpufreq: intel_pstate: Implement the ->adjust_perf() callback cpufreq: Add special-purpose fast-switching callback for drivers cpufreq: schedutil: Add util to struct sg_cpu cppc_cpufreq: replace per-cpu data array with a list cppc_cpufreq: expose information on frequency domains cppc_cpufreq: clarify support for coordination types cppc_cpufreq: use policy->cpu as driver of frequency setting ACPI: processor: fix NONE coordination for domain mapping failure
2020-12-22Merge branch 'pm-cpufreq'Rafael J. Wysocki1-0/+5
* pm-cpufreq: cpufreq: intel_pstate: Use most recent guaranteed performance values cpufreq: intel_pstate: Implement the ->adjust_perf() callback cpufreq: Add special-purpose fast-switching callback for drivers cpufreq: schedutil: Add util to struct sg_cpu cppc_cpufreq: replace per-cpu data array with a list cppc_cpufreq: expose information on frequency domains cppc_cpufreq: clarify support for coordination types cppc_cpufreq: use policy->cpu as driver of frequency setting ACPI: processor: fix NONE coordination for domain mapping failure ACPI: processor: Drop duplicate setting of shared_cpu_map