aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux/sched (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-01-04fork: fix some -Wmissing-prototypes warningsYi Wang1-0/+2
We get a warning when building kernel with W=1: kernel/fork.c:167:13: warning: no previous prototype for `arch_release_thread_stack' [-Wmissing-prototypes] kernel/fork.c:779:13: warning: no previous prototype for `fork_init' [-Wmissing-prototypes] Add the missing declaration in head file to fix this. Also, remove arch_release_thread_stack() completely because no arch seems to implement it since bb9d81264 (arch: remove tile port). Link: http://lkml.kernel.org/r/1542170087-23645-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-11sched/cpufreq: Prepare schedutil for Energy Aware SchedulingQuentin Perret1-0/+6
Schedutil requests frequency by aggregating utilization signals from the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of them. Since Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests, it needs to forecast the decisions made by the governor. In order to prepare the introduction of EAS, introduce schedutil_freq_util() to centralize the aforementioned signal aggregation and make it available to both schedutil and EAS. Since frequency selection and energy estimation still need to deal with RT and DL signals slightly differently, schedutil_freq_util() is called with a different 'type' parameter in those two contexts, and returns an aggregated utilization signal accordingly. While at it, introduce the map_util_freq() function which is designed to make schedutil's 25% margin usable easily for both sugov and EAS. As EAS will be able to predict schedutil's frequency requests more accurately than any other governor by design, it'd be sensible to make sure EAS cannot be used without schedutil. This will be done later, once EAS has actually been introduced. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-3-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11sched/topology: Relocate arch_scale_cpu_capacity() to the internal headerQuentin Perret1-0/+16
By default, arch_scale_cpu_capacity() is only visible from within the kernel/sched folder. Relocate it to include/linux/sched/topology.h to make it visible to other clients needing to know about the capacity of CPUs, such as the Energy Model framework. This also shrinks the <linux/sched/topology.h> public header. Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-2-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11sched/topology: Remove the ::smt_gain field from 'struct sched_domain'Vincent Guittot1-1/+0
::smt_gain is used to compute the capacity of CPUs of a SMT core with the constraint 1 < ::smt_gain < 2 in order to be able to compute number of CPUs per core. The field has_free_capacity of struct numa_stat, which was the last user of this computation of number of CPUs per core, has been removed by: 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()") We can now remove this constraint on core capacity and use the defautl value SCHED_CAPACITY_SCALE for SMT CPUs. With this remove, SCHED_CAPACITY_SCALE becomes the maximum compute capacity of CPUs on every systems. This should help to simplify some code and remove fields like rd->max_cpu_capacity Furthermore, arch_scale_cpu_capacity() is used with a NULL sd in several other places in the code when it wants the capacity of a CPUs to scale some metrics like in pelt, deadline or schedutil. In case on SMT, the value returned is not the capacity of SMT CPUs but default SCHED_CAPACITY_SCALE. So remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1535548752-4434-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-03sched: Fix various typos in commentsIngo Molnar3-4/+4
Go over the scheduler source code and fix common typos in comments - and a typo in an actual variable name. No change in functionality intended. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-28x86/speculation: Rework SMT state changeThomas Gleixner1-0/+2
arch_smt_update() is only called when the sysfs SMT control knob is changed. This means that when SMT is enabled in the sysfs control knob the system is considered to have SMT active even if all siblings are offline. To allow finegrained control of the speculation mitigations, the actual SMT state is more interesting than the fact that siblings could be enabled. Rework the code, so arch_smt_update() is invoked from each individual CPU hotplug function, and simplify the update function while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.521974984@linutronix.de
2018-11-28sched/smt: Expose sched_smt_present static keyThomas Gleixner1-0/+18
Make the scheduler's 'sched_smt_present' static key globaly available, so it can be used in the x86 speculation control code. Provide a query function and a stub for the CONFIG_SMP=n case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de
2018-10-30Merge tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-1/+0
Pull more power management updates from Rafael Wysocki: "These remove a questionable heuristic from the menu cpuidle governor, fix a recent build regression in the intel_pstate driver, clean up ARM big-Little support in cpufreq and fix up hung task watchdog's interaction with system-wide power management transitions. Specifics: - Fix build regression in the intel_pstate driver that doesn't build without CONFIG_ACPI after recent changes (Dominik Brodowski). - One of the heuristics in the menu cpuidle governor is based on a function returning 0 most of the time, so drop it and clean up the scheduler code related to it (Daniel Lezcano). - Prevent the arm_big_little cpufreq driver from being used on ARM64 which is not suitable for it and drop the arm_big_little_dt driver that is not used any more (Sudeep Holla). - Prevent the hung task watchdog from triggering during resume from system-wide sleep states by disabling it before freezing tasks and enabling it again after they have been thawed (Vitaly Kuznetsov)" * tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: kernel: hung_task.c: disable on suspend cpufreq: remove unused arm_big_little_dt driver cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64 cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI cpuidle: menu: Remove get_loadavg() from the performance multiplier sched: Factor out nr_iowait and nr_iowait_cpu
2018-10-26sched: loadavg: make calc_load_n() publicJohannes Weiner1-0/+3
It's going to be used in a later patch. Keep the churn separate. Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOADJohannes Weiner1-4/+17
There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-25cpuidle: menu: Remove get_loadavg() from the performance multiplierDaniel Lezcano1-1/+0
The function get_loadavg() returns almost always zero. To be more precise, statistically speaking for a total of 1023379 times passing in the function, the load is equal to zero 1020728 times, greater than 100, 610 times, the remaining is between 0 and 5. In 2011, the get_loadavg() was removed from the Android tree because of the above [1]. At this time, the load was: unsigned long this_cpu_load(void) { struct rq *this = this_rq(); return this->cpu_load[0]; } In 2014, the code was changed by commit 372ba8cb46b2 (cpuidle: menu: Lookup CPU runqueues less) and the load is: void get_iowait_load(unsigned long *nr_waiters, unsigned long *load) { struct rq *rq = this_rq(); *nr_waiters = atomic_read(&rq->nr_iowait); *load = rq->load.weight; } with the same result. Both measurements show using the load in this code path does no matter anymore. Removing it. [1] https://android.googlesource.com/kernel/common/+/4dedd9f124703207895777ac6e91dacde0f7cc17 Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-24Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-12/+11
Pull siginfo updates from Eric Biederman: "I have been slowly sorting out siginfo and this is the culmination of that work. The primary result is in several ways the signal infrastructure has been made less error prone. The code has been updated so that manually specifying SEND_SIG_FORCED is never necessary. The conversion to the new siginfo sending functions is now complete, which makes it difficult to send a signal without filling in the proper siginfo fields. At the tail end of the patchset comes the optimization of decreasing the size of struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit. The fundamental observation that enables this is by definition none of the known ways to use struct siginfo uses the extra bytes. This comes at the cost of a small user space observable difference. For the rare case of siginfo being injected into the kernel only what can be copied into kernel_siginfo is delivered to the destination, the rest of the bytes are set to 0. For cases where the signal and the si_code are known this is safe, because we know those bytes are not used. For cases where the signal and si_code combination is unknown the bits that won't fit into struct kernel_siginfo are tested to verify they are zero, and the send fails if they are not. I made an extensive search through userspace code and I could not find anything that would break because of the above change. If it turns out I did break something it will take just the revert of a single change to restore kernel_siginfo to the same size as userspace siginfo. Testing did reveal dependencies on preferring the signo passed to sigqueueinfo over si->signo, so bit the bullet and added the complexity necessary to handle that case. Testing also revealed bad things can happen if a negative signal number is passed into the system calls. Something no sane application will do but something a malicious program or a fuzzer might do. So I have fixed the code that performs the bounds checks to ensure negative signal numbers are handled" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits) signal: Guard against negative signal numbers in copy_siginfo_from_user32 signal: Guard against negative signal numbers in copy_siginfo_from_user signal: In sigqueueinfo prefer sig not si_signo signal: Use a smaller struct siginfo in the kernel signal: Distinguish between kernel_siginfo and siginfo signal: Introduce copy_siginfo_from_user and use it's return value signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE signal: Fail sigqueueinfo if si_signo != sig signal/sparc: Move EMT_TAGOVF into the generic siginfo.h signal/unicore32: Use force_sig_fault where appropriate signal/unicore32: Generate siginfo in ucs32_notify_die signal/unicore32: Use send_sig_fault where appropriate signal/arc: Use force_sig_fault where appropriate signal/arc: Push siginfo generation into unhandled_exception signal/ia64: Use force_sig_fault where appropriate signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn signal/ia64: Use the generic force_sigsegv in setup_frame signal/arm/kvm: Use send_sig_mceerr signal/arm: Use send_sig_fault where appropriate signal/arm: Use force_sig_fault where appropriate ...
2018-10-03signal: Distinguish between kernel_siginfo and siginfoEric W. Biederman1-9/+9
Linus recently observed that if we did not worry about the padding member in struct siginfo it is only about 48 bytes, and 48 bytes is much nicer than 128 bytes for allocating on the stack and copying around in the kernel. The obvious thing of only adding the padding when userspace is including siginfo.h won't work as there are sigframe definitions in the kernel that embed struct siginfo. So split siginfo in two; kernel_siginfo and siginfo. Keeping the traditional name for the userspace definition. While the version that is used internally to the kernel and ultimately will not be padded to 128 bytes is called kernel_siginfo. The definition of struct kernel_siginfo I have put in include/signal_types.h A set of buildtime checks has been added to verify the two structures have the same field offsets. To make it easy to verify the change kernel_siginfo retains the same size as siginfo. The reduction in size comes in a following change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11signal: Remove SEND_SIG_FORCEDEric W. Biederman1-1/+0
There are no more users of SEND_SIG_FORCED so it may be safely removed. Remove the definition of SEND_SIG_FORCED, it's use in is_si_special, it's use in TP_STORE_SIGINFO, and it's use in __send_signal as without any users the uses of SEND_SIG_FORCED are now unncessary. This makes the code simpler, easier to understand and use. Users of signal sending functions now no longer need to ask themselves do I need to use SEND_SIG_FORCED. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11signal: Remove the siginfo paramater from kernel_dqueue_signalEric W. Biederman1-2/+2
None of the callers use the it so remove it. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-10sched/topology: Add SD_ASYM_CPUCAPACITY flag detectionMorten Rasmussen1-3/+3
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the sched_domain in the hierarchy where all CPU capacities are visible for any CPU's point of view on asymmetric CPU capacity systems. The scheduler can then take to take capacity asymmetry into account when balancing at this level. It also serves as an indicator for how wide task placement heuristics have to search to consider all available CPU capacities as asymmetric systems might often appear symmetric at smallest level(s) of the sched_domain hierarchy. The flag has been around for while but so far only been set by out-of-tree code in Android kernels. One solution is to let each architecture provide the flag through a custom sched_domain topology array and associated mask and flag functions. However, SD_ASYM_CPUCAPACITY is special in the sense that it depends on the capacity and presence of all CPUs in the system, i.e. when hotplugging all CPUs out except those with one particular CPU capacity the flag should disappear even if the sched_domains don't collapse. Similarly, the flag is affected by cpusets where load-balancing is turned off. Detecting when the flags should be set therefore depends not only on topology information but also the cpuset configuration and hotplug state. The arch code doesn't have easy access to the cpuset configuration. Instead, this patch implements the flag detection in generic code where cpusets and hotplug state is already taken care of. All the arch is responsible for is to implement arch_scale_cpu_capacity() and force a full rebuild of the sched_domain hierarchy if capacities are updated, e.g. later in the boot process when cpufreq has initialized. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmussen@arm.com [ Fixed 'CPU' capitalization. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-08-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds3-3/+5
Merge more updates from Andrew Morton: - the rest of MM - procfs updates - various misc things - more y2038 fixes - get_maintainer updates - lib/ updates - checkpatch updates - various epoll updates - autofs updates - hfsplus - some reiserfs work - fatfs updates - signal.c cleanups - ipc/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits) ipc/util.c: update return value of ipc_getref from int to bool ipc/util.c: further variable name cleanups ipc: simplify ipc initialization ipc: get rid of ids->tables_initialized hack lib/rhashtable: guarantee initial hashtable allocation lib/rhashtable: simplify bucket_table_alloc() ipc: drop ipc_lock() ipc/util.c: correct comment in ipc_obtain_object_check ipc: rename ipcctl_pre_down_nolock() ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid() ipc: reorganize initialization of kern_ipc_perm.seq ipc: compute kern_ipc_perm.id under the ipc lock init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp adfs: use timespec64 for time conversion kernel/sysctl.c: fix typos in comments drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md fork: don't copy inconsistent signal handler state to child signal: make get_signal() return bool signal: make sigkill_pending() return bool ...
2018-08-22signal: make force_sigsegv() voidChristian Brauner1-1/+1
Patch series "signal: refactor some functions", v3. This series refactors a bunch of functions in signal.c to simplify parts of the code. The greatest single change is declaring the static do_sigpending() helper as void which makes it possible to remove a bunch of unnecessary checks in the syscalls later on. This patch (of 17): force_sigsegv() returned 0 unconditionally so it doesn't make sense to have it return at all. In addition, there are no callers that check force_sigsegv()'s return value. Link: http://lkml.kernel.org/r/20180602103653.18181-2-christian@brauner.io Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: James Morris <james.morris@microsoft.com> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22kernel/hung_task.c: allow to set checking interval separately from timeoutDmitry Vyukov1-0/+1
Currently task hung checking interval is equal to timeout, as the result hung is detected anywhere between timeout and 2*timeout. This is fine for most interactive environments, but this hurts automated testing setups (syzbot). In an automated setup we need to strictly order CPU lockup < RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall is not detected as task hung and task hung is not detected as silent machine loss. The large variance in task hung detection timeout requires setting silent machine loss timeout to a very large value (e.g. if task hung is 3 mins, then silent loss need to be set to ~7 mins). The additional 3 minutes significantly reduce testing efficiency because usually we crash kernel within a minute, and this can add hours to bug localization process as it needs to do dozens of tests. Allow setting checking interval separately from timeout. This allows to set timeout to, say, 3 minutes, but checking interval to 10 secs. The interval is controlled via a new hung_task_check_interval_secs sysctl, similar to the existing hung_task_timeout_secs sysctl. The default value of 0 results in the current behavior: checking interval is equal to timeout. [akpm@linux-foundation.org: update hung_task_timeout_max's comment] Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22userns: use refcount_t for reference counting instead atomic_tSebastian Andrzej Siewior1-2/+3
refcount_t type and corresponding API should be used instead of atomic_t wh en the variable is used as a reference counter. This avoids accidental refcounter overflows that might lead to use-after-free situations. Link: http://lkml.kernel.org/r/20180703200141.28415-6-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-21Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-3/+46
Pull core signal handling updates from Eric Biederman: "It was observed that a periodic timer in combination with a sufficiently expensive fork could prevent fork from every completing. This contains the changes to remove the need for that restart. This set of changes is split into several parts: - The first part makes PIDTYPE_TGID a proper pid type instead something only for very special cases. The part starts using PIDTYPE_TGID enough so that in __send_signal where signals are actually delivered we know if the signal is being sent to a a group of processes or just a single process. - With that prep work out of the way the logic in fork is modified so that fork logically makes signals received while it is running appear to be received after the fork completes" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits) signal: Don't send signals to tasks that don't exist signal: Don't restart fork when signals come in. fork: Have new threads join on-going signal group stops fork: Skip setting TIF_SIGPENDING in ptrace_init_task signal: Add calculate_sigpending() fork: Unconditionally exit if a fatal signal is pending fork: Move and describe why the code examines PIDNS_ADDING signal: Push pid type down into complete_signal. signal: Push pid type down into __send_signal signal: Push pid type down into send_signal signal: Pass pid type into do_send_sig_info signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task signal: Pass pid type into group_send_sig_info signal: Pass pid and pid type into send_sigqueue posix-timers: Noralize good_sigevent signal: Use PIDTYPE_TGID to clearly store where file signals will be sent pid: Implement PIDTYPE_TGID pids: Move the pgrp and session pid pointers from task_struct to signal_struct kvm: Don't open code task_pid in kvm_vcpu_ioctl pids: Compute task_tgid using signal->leader_pid ...
2018-08-17fs: fsnotify: account fsnotify metadata to kmemcgShakeel Butt1-0/+37
Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-09signal: Don't restart fork when signals come in.Eric W. Biederman1-0/+8
Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn> report that a periodic signal received during fork can cause fork to continually restart preventing an application from making progress. The code was being overly pessimistic. Fork needs to guarantee that a signal sent to multiple processes is logically delivered before the fork and just to the forking process or logically delivered after the fork to both the forking process and it's newly spawned child. For signals like periodic timers that are always delivered to a single process fork can safely complete and let them appear to logically delivered after the fork(). While examining this issue I also discovered that fork today will miss signals delivered to multiple processes during the fork and handled by another thread. Similarly the current code will also miss blocked signals that are delivered to multiple process, as those signals will not appear pending during fork. Add a list of each thread that is currently forking, and keep on that list a signal set that records all of the signals sent to multiple processes. When fork completes initialize the new processes shared_pending signal set with it. The calculate_sigpending function will see those signals and set TIF_SIGPENDING causing the new task to take the slow path to userspace to handle those signals. Making it appear as if those signals were received immediately after the fork. It is not possible to send real time signals to multiple processes and exceptions don't go to multiple processes, which means that that are no signals sent to multiple processes that require siginfo. This means it is safe to not bother collecting siginfo on signals sent during fork. The sigaction of a child of fork is initially the same as the sigaction of the parent process. So a signal the parent ignores the child will also initially ignore. Therefore it is safe to ignore signals sent to multiple processes and ignored by the forking process. Signals sent to only a single process or only a single thread and delivered during fork are treated as if they are received after the fork, and generally not dealt with. They won't cause any problems. V2: Added removal from the multiprocess list on failure. V3: Use -ERESTARTNOINTR directly V4: - Don't queue both SIGCONT and SIGSTOP - Initialize signal_struct.multiprocess in init_task - Move setting of shared_pending to before the new task is visible to signals. This prevents signals from comming in before shared_pending.signal is set to delayed.signal and being lost. V5: - rework list add and delete to account for idle threads v6: - Use sigdelsetmask when removing stop signals Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447 Reported-by: Wen Yang <wen.yang99@zte.com.cn> and Reported-by: majiang <ma.jiang@zte.com.cn> Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-03fork: Have new threads join on-going signal group stopsEric W. Biederman1-0/+2
There are only two signals that are delivered to every member of a signal group: SIGSTOP and SIGKILL. Signal delivery requires every signal appear to be delivered either before or after a clone syscall. SIGKILL terminates the clone so does not need to be considered. Which leaves only SIGSTOP that needs to be considered when creating new threads. Today in the event of a group stop TIF_SIGPENDING will get set and the fork will restart ensuring the fork syscall participates in the group stop. A fork (especially of a process with a lot of memory) is one of the most expensive system so we really only want to restart a fork when necessary. It is easy so check to see if a SIGSTOP is ongoing and have the new thread join it immediate after the clone completes. Making it appear the clone completed happened just before the SIGSTOP. The calculate_sigpending function will see the bits set in jobctl and set TIF_SIGPENDING to ensure the new task takes the slow path to userspace. V2: The call to task_join_group_stop was moved before the new task is added to the thread group list. This should not matter as sighand->siglock is held over both the addition of the threads, the call to task_join_group_stop and do_signal_stop. But the change is trivial and it is one less thing to worry about when reading the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-03signal: Add calculate_sigpending()Eric W. Biederman1-0/+1
Add a function calculate_sigpending to test to see if any signals are pending for a new task immediately following fork. Signals have to happen either before or after fork. Today our practice is to push all of the signals to before the fork, but that has the downside that frequent or periodic signals can make fork take much much longer than normal or prevent fork from completing entirely. So we need move signals that we can after the fork to prevent that. This updates the code to set TIF_SIGPENDING on a new task if there are signals or other activities that have moved so that they appear to happen after the fork. As the code today restarts if it sees any such activity this won't immediately have an effect, as there will be no reason for it to set TIF_SIGPENDING immediately after the fork. Adding calculate_sigpending means the code in fork can safely be changed to not always restart if a signal is pending. The new calculate_sigpending function sets sigpending if there are pending bits in jobctl, pending signals, the freezer needs to freeze the new task or the live kernel patching framework need the new thread to take the slow path to userspace. I have verified that setting TIF_SIGPENDING does make a new process take the slow path to userspace before it executes it's first userspace instruction. I have looked at the callers of signal_wake_up and the code paths setting TIF_SIGPENDING and I don't see anything else that needs to be handled. The code probably doesn't need to set TIF_SIGPENDING for the kernel live patching as it uses a separate thread flag as well. But at this point it seems safer reuse the recalc_sigpending logic and get the kernel live patching folks to sort out their story later. V2: I have moved the test into schedule_tail where siglock can be grabbed and recalc_sigpending can be reused directly. Further as the last action of setting up a new task this guarantees that TIF_SIGPENDING will be properly set in the new process. The helper calculate_sigpending takes the siglock and uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending clear TIF_SIGPENDING if it is unnecessary. This allows reusing the existing code and keeps maintenance of the conditions simple. Oleg Nesterov <oleg@redhat.com> suggested the movement and pointed out the need to take siglock if this code was going to be called while the new task is discoverable. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-25Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar1-1/+1
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-22alpha: fix osf_wait4() breakageAl Viro1-1/+1
kernel_wait4() expects a userland address for status - it's only rusage that goes as a kernel one (and needs a copyout afterwards) [ Also, fix the prototype of kernel_wait4() to have that __user annotation - Linus ] Fixes: 92ebce5ac55d ("osf_wait4: switch to kernel_wait4()") Cc: stable@kernel.org # v4.13+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21signal: Pass pid and pid type into send_sigqueueEric W. Biederman1-1/+1
Make the code more maintainable by performing more of the signal related work in send_sigqueue. A quick inspection of do_timer_create will show that this code path does not lookup a thread group by a thread's pid. Making it safe to find the task pointed to by it_pid with "pid_task(it_pid, type)"; This supports the changes needed in fork to tell if a signal was sent to a single process or a group of processes. Having the pid to task transition in signal.c will also make it easier to sort out races with de_thread and and the thread group leader exiting when it comes time to address that. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21pid: Implement PIDTYPE_TGIDEric W. Biederman1-3/+2
Everywhere except in the pid array we distinguish between a tasks pid and a tasks tgid (thread group id). Even in the enumeration we want that distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid we almost have an implementation of PIDTYPE_TGID in struct signal_struct. Add PIDTYPE_TGID as a first class member of the pid_type enumeration and into the pids array. Then remove the __PIDTYPE_TGID special case and the leader_pid in signal_struct. The net size increase is just an extra pointer added to struct pid and an extra pair of pointers of an hlist_node added to task_struct. The effect on code maintenance is the removal of a number of special cases today and the potential to remove many more special cases as PIDTYPE_TGID gets used to it's fullest. The long term potential is allowing zombie thread group leaders to exit, which will remove a lot more special cases in the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21pids: Move the pgrp and session pid pointers from task_struct to signal_structEric W. Biederman1-3/+23
To access these fields the code always has to go to group leader so going to signal struct is no loss and is actually a fundamental simplification. This saves a little bit of memory by only allocating the pid pointer array once instead of once for every thread, and even better this removes a few potential races caused by the fact that group_leader can be changed by de_thread, while signal_struct can not. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21pids: Compute task_tgid using signal->leader_pidEric W. Biederman1-0/+5
The cost is the the same and this removes the need to worry about complications that come from de_thread and group_leader changing. __task_pid_nr_ns has been updated to take advantage of this change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21pids: Move task_pid_type into sched/signal.hEric W. Biederman1-0/+8
The function is general and inline so there is no need to hide it inside of exit.c Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-16sched/sysctl: Remove unused sched_time_avg_ms sysctlVincent Guittot1-1/+0
/proc/sys/kernel/sched_time_avg_ms entry is not used anywhere, remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-12-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-07lockdep: fix fs_reclaim annotationOmar Sandoval1-0/+4
While revisiting my Btrfs swapfile series [1], I introduced a situation in which reclaim would lock i_rwsem, and even though the swapon() path clearly made GFP_KERNEL allocations while holding i_rwsem, I got no complaints from lockdep. It turns out that the rework of the fs_reclaim annotation was broken: if the current task has PF_MEMALLOC set, we don't acquire the dummy fs_reclaim lock, but when reclaiming we always check this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can fix this by moving the fs_reclaim_{acquire,release}() outside of the memalloc_noreclaim_{save,restore}(), althought kswapd is slightly different. After applying this, I got the expected lockdep splats. 1: https://lwn.net/Articles/625412/ Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-04Merge tag 'docs-4.18' of git://git.lwn.net/linuxLinus Torvalds1-2/+40
Pull documentation updates from Jonathan Corbet: "There's been a fair amount of work in the docs tree this time around, including: - Extensive RST conversions and organizational work in the memory-management docs thanks to Mike Rapoport. - An update of Documentation/features from Andrea Parri and a script to keep it updated. - Various LICENSES updates from Thomas, along with a script to check SPDX tags. - Work to fix dangling references to documentation files; this involved a fair number of one-liner comment changes outside of Documentation/ ... and the usual list of documentation improvements, typo fixes, etc" * tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits) Documentation: document hung_task_panic kernel parameter docs/admin-guide/mm: add high level concepts overview docs/vm: move ksm and transhuge from "user" to "internals" section. docs: Use the kerneldoc comments for memalloc_no*() doc: document scope NOFS, NOIO APIs docs: update kernel versions and dates in tables docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge docs/vm: transhuge: minor updates docs/vm: transhuge: change sections order Documentation: arm: clean up Marvell Berlin family info Documentation: gpio: driver: Fix a typo and some odd grammar docs: ranoops.rst: fix location of ramoops.txt scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode docs: uio-howto.rst: use a code block to solve a warning mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback w1: w1_io.c: fix a kernel-doc warning Documentation/process/posting: wrap text at 80 cols docs: admin-guide: add cgroup-v2 documentation Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'" Documentation: refcount-vs-atomic: Update reference to LKMM doc. ...
2018-05-29doc: document scope NOFS, NOIO APIsMichal Hocko1-0/+38
Although the api is documented in the source code Ted has pointed out that there is no mention in the core-api Documentation and there are people looking there to find answers how to use a specific API. Requested-by: "Theodore Y. Ts'o" <tytso@mit.edu> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-05-04sched/core: Introduce set_special_state()Peter Zijlstra1-1/+1
Gaurav reported a perceived problem with TASK_PARKED, which turned out to be a broken wait-loop pattern in __kthread_parkme(), but the reported issue can (and does) in fact happen for states that do not do condition based sleeps. When the 'current->state = TASK_RUNNING' store of a previous (concurrent) try_to_wake_up() collides with the setting of a 'special' sleep state, we can loose the sleep state. Normal condition based wait-loops are immune to this problem, but for sleep states that are not condition based are subject to this problem. There already is a fix for TASK_DEAD. Abstract that and also apply it to TASK_STOPPED and TASK_TRACED, both of which are also without condition based wait-loop. Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-16Merge branch 'mm-rst' into docs-nextJonathan Corbet1-2/+2
Mike Rapoport says: These patches convert files in Documentation/vm to ReST format, add an initial index and link it to the top level documentation. There are no contents changes in the documentation, except few spelling fixes. The relatively large diffstat stems from the indentation and paragraph wrapping changes. I've tried to keep the formatting as consistent as possible, but I could miss some places that needed markup and add some markup where it was not necessary. [jc: significant conflicts in vm/hmm.rst]
2018-04-16docs/vm: rename documentation files to .rstMike Rapoport1-2/+2
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-04-11exec: pass stack rlimit into mm layout functionsKees Cook1-2/+4
Patch series "exec: Pin stack limit during exec". Attempts to solve problems with the stack limit changing during exec continue to be frustrated[1][2]. In addition to the specific issues around the Stack Clash family of flaws, Andy Lutomirski pointed out[3] other places during exec where the stack limit is used and is assumed to be unchanging. Given the many places it gets used and the fact that it can be manipulated/raced via setrlimit() and prlimit(), I think the only way to handle this is to move away from the "current" view of the stack limit and instead attach it to the bprm, and plumb this down into the functions that need to know the stack limits. This series implements the approach. [1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()") [2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"") [3] to security@kernel.org, "Subject: existing rlimit races?" This patch (of 3): Since it is possible that the stack rlimit can change externally during exec (either via another thread calling setrlimit() or another process calling prlimit()), provide a way to pass the rlimit down into the per-architecture mm layout functions so that the rlimit can stay in the bprm structure instead of sitting in the signal structure until exec is finalized. Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Willy Tarreau <w@1wt.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Rik van Riel <riel@redhat.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Greg KH <greg@kroah.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Cc: Brad Spengler <spender@grsecurity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-07Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-securityLinus Torvalds1-1/+1
Pull general security layer updates from James Morris: - Convert security hooks from list to hlist, a nice cleanup, saving about 50% of space, from Sargun Dhillon. - Only pass the cred, not the secid, to kill_pid_info_as_cred and security_task_kill (as the secid can be determined from the cred), from Stephen Smalley. - Close a potential race in kernel_read_file(), by making the file unwritable before calling the LSM check (vs after), from Kees Cook. * 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: security: convert security hooks to use hlist exec: Set file unwritable before LSM check usb, signal, security: only pass the cred, not the secid, to kill_pid_info_as_cred and security_task_kill
2018-03-23Merge tag 'v4.16-rc6' into next-generalJames Morris2-1/+16
Merge to Linux 4.16-rc6 at the request of Jarkko, for his TPM updates.
2018-03-09sched/cpufreq: Provide migration hintPeter Zijlstra1-0/+1
It was suggested that a migration hint might be usefull for the CPU-freq governors. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/nohz: Clean up nohz enter/exitPeter Zijlstra1-2/+0
The primary observation is that nohz enter/exit is always from the current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be an atomic. Secondary is that we appear to have 2 nearly identical hooks in the nohz enter code, set_cpu_sd_state_idle() and nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into nohz_balance_{enter,exit}_idle. Removes an atomic op from both enter and exit paths. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09cpufreq/schedutil: Rewrite CPUFREQ_RT supportPeter Zijlstra1-2/+1
Instead of trying to duplicate scheduler state to track if an RT task is running, directly use the scheduler runqueue state for it. This vastly simplifies things and fixes a number of bugs related to sugov and the scheduler getting out of sync wrt this state. As a consequence we not also update the remove cfs/dl state when iterating the shared mask. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09cpufreq/schedutil: Remove unused CPUFREQ_DLPeter Zijlstra1-2/+1
Bitrot... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar2-1/+16
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-07usb, signal, security: only pass the cred, not the secid, to kill_pid_info_as_cred and security_task_killStephen Smalley1-1/+1
commit d178bc3a708f39cbfefc3fab37032d3f2511b4ec ("user namespace: usb: make usb urbs user namespace aware (v2)") changed kill_pid_info_as_uid to kill_pid_info_as_cred, saving and passing a cred structure instead of uids. Since the secid can be obtained from the cred, drop the secid fields from the usb_dev_state and async structures, and drop the secid argument to kill_pid_info_as_cred. Replace the secid argument to security_task_kill with the cred. Update SELinux, Smack, and AppArmor to use the cred, which avoids the need for Smack and AppArmor to use a secid at all in this hook. Further changes to Smack might still be required to take full advantage of this change, since it should now be possible to perform capability checking based on the supplied cred. The changes to Smack and AppArmor have only been compile-tested. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: John Johansen <john.johansen@canonical.com> Signed-off-by: James Morris <james.morris@microsoft.com>
2018-03-04sched/headers: Simplify and clean up header usage in the schedulerIngo Molnar1-6/+0
Do the following cleanups and simplifications: - sched/sched.h already includes <asm/paravirt.h>, so no need to include it in sched/core.c again. - order the <linux/sched/*.h> headers alphabetically - add all <linux/sched/*.h> headers to kernel/sched/sched.h - remove all unnecessary includes from the .c files that are already included in kernel/sched/sched.h. Finally, make all scheduler .c files use a single common header: #include "sched.h" ... which now contains a union of the relied upon headers. This makes the various .c files easier to read and easier to handle. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+12
Merge misc fixes from Andrew Morton: "16 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: don't defer struct page initialization for Xen pv guests lib/Kconfig.debug: enable RUNTIME_TESTING_MENU vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems selftests/memfd: add run_fuse_test.sh to TEST_FILES bug.h: work around GCC PR82365 in BUG() mm/swap.c: make functions and their kernel-doc agree (again) mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc ida: do zeroing in ida_pre_get() mm, swap, frontswap: fix THP swap if frontswap enabled certs/blacklist_nohashes.c: fix const confusion in certs blacklist kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE mm, mlock, vmscan: no more skipping pagevecs mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats Kbuild: always define endianess in kconfig.h include/linux/sched/mm.h: re-inline mmdrop() tools: fix cross-compile var clobbering