aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-09-15Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds4-5/+3
Pull locking fixes from Ingo Molnar: "Misc fixes: liblockdep fixes and ww_mutex fixes" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic" locking/lockdep: Delete unnecessary #include tools/lib/lockdep: Add dummy task_struct state member tools/lib/lockdep: Add empty nmi.h tools/lib/lockdep: Update Sasha Levin email to MSFT jump_label: Fix typo in warning message locking/mutex: Fix mutex debug call and ww_mutex documentation
2018-09-14flow_dissector: implements flow dissector BPF hookPetar Penkov2-0/+40
Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector path. The BPF program is per-network namespace. Signed-off-by: Petar Penkov <ppenkov@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-13Merge tag 'printk-for-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printkLinus Torvalds1-7/+5
Pull printk fix from Petr Mladek: "Revert a commit that caused "quiet", "debug", and "loglevel" early parameters to be ignored for early boot messages" * tag 'printk-for-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: Revert "printk: make sure to print log on console."
2018-09-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-19/+39
2018-09-12bpf/verifier: disallow pointer subtractionAlexei Starovoitov1-1/+1
Subtraction of pointers was accidentally allowed for unpriv programs by commit 82abbf8d2fc4. Revert that part of commit. Fixes: 82abbf8d2fc4 ("bpf: do not allow root to mangle valid pointers") Reported-by: Jann Horn <jannh@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-12bpf: btf: Fix end boundary calculation for type sectionMartin KaFai Lau1-1/+1
The end boundary math for type section is incorrect in btf_check_all_metas(). It just happens that hdr->type_off is always 0 for now because there are only two sections (type and string) and string section must be at the end (ensured in btf_parse_str_sec). However, type_off may not be 0 if a new section would be added later. This patch fixes it. Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-12kprobes: Don't call BUG_ON() if there is a kprobe in use on free listMasami Hiramatsu1-1/+7
Instead of calling BUG_ON(), if we find a kprobe in use on free kprobe list, just remove it from the list and keep it on kprobe hash list as same as other in-use kprobes. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/153666126882.21306.10738207224288507996.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-12kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()Masami Hiramatsu1-7/+20
Make reuse_unused_kprobe() to return error code if it fails to reuse unused kprobe for optprobe instead of calling BUG_ON(). Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/153666124040.21306.14150398706331307654.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-12kprobes: Remove pointless BUG_ON() from reuse_unused_kprobe()Masami Hiramatsu1-1/+0
Since reuse_unused_kprobe() is called when the given kprobe is unused, checking it inside again with BUG_ON() is pointless. Remove it. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/153666121154.21306.17540752948574483565.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-12kprobes: Remove pointless BUG_ON() from add_new_kprobe()Masami Hiramatsu1-2/+0
Before calling add_new_kprobe(), aggr_probe's GONE flag and kprobe GONE flag are cleared. We don't need to worry about that flag at this point. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/153666118298.21306.4915366706875652652.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-12kprobes: Remove pointless BUG_ON() from disarming processMasami Hiramatsu1-1/+0
All aggr_probes at this line are already disarmed by disable_kprobe() or checked by kprobe_disarmed(). So this BUG_ON() is pointless, remove it. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/153666115463.21306.8799008438116029806.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-11bpf: add bpffs pretty print for program array mapYonghong Song1-1/+24
Added bpffs pretty print for program array map. For a particular array index, if the program array points to a valid program, the "<index>: <prog_id>" will be printed out like 0: 6 which means bpf program with id "6" is installed at index "0". Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-11signal: Remove SEND_SIG_FORCEDEric W. Biederman1-4/+3
There are no more users of SEND_SIG_FORCED so it may be safely removed. Remove the definition of SEND_SIG_FORCED, it's use in is_si_special, it's use in TP_STORE_SIGINFO, and it's use in __send_signal as without any users the uses of SEND_SIG_FORCED are now unncessary. This makes the code simpler, easier to understand and use. Users of signal sending functions now no longer need to ask themselves do I need to use SEND_SIG_FORCED. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11signal: Use SEND_SIG_PRIV not SEND_SIG_FORCED with SIGKILL and SIGSTOPEric W. Biederman2-3/+3
Now that siginfo is never allocated for SIGKILL and SIGSTOP there is no difference between SEND_SIG_PRIV and SEND_SIG_FORCED for SIGKILL and SIGSTOP. This makes SEND_SIG_FORCED unnecessary and redundant in the presence of SIGKILL and SIGSTOP. Therefore change users of SEND_SIG_FORCED that are sending SIGKILL or SIGSTOP to use SEND_SIG_PRIV instead. This removes the last users of SEND_SIG_FORCED. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11signal: Never allocate siginfo for SIGKILL or SIGSTOPEric W. Biederman1-3/+4
The SIGKILL and SIGSTOP signals are never delivered to userspace so queued siginfo for these signals can never be observed. Therefore remove the chance of failure by never even attempting to allocate siginfo in those cases. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11signal: Don't send siginfo to kthreads.Eric W. Biederman1-1/+1
Today kernel threads never dequeue siginfo so it is pointless to enqueue siginfo for them. The usb gadget mass storage driver goes one farther and uses SEND_SIG_FORCED to guarantee that no siginfo is even enqueued. Generalize the optimization of the usb mass storage driver and never perform an unnecessary allocation when delivering signals to kthreads. Switch the mass storage driver from sending signals with SEND_SIG_FORCED to SEND_SIG_PRIV. As using SEND_SIG_FORCED is now unnecessary. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11signal: Always deliver the kernel's SIGKILL and SIGSTOP to a pid namespace initEric W. Biederman1-1/+1
Instead of playing whack-a-mole and changing SEND_SIG_PRIV to SEND_SIG_FORCED throughout the kernel to ensure a pid namespace init gets signals sent by the kernel, stop allowing a pid namespace init to ignore SIGKILL or SIGSTOP sent by the kernel. A pid namespace init is only supposed to be able to ignore signals sent from itself and children with SIG_DFL. Fixes: 921cf9f63089 ("signals: protect cinit from unblocked SIG_DFL signals") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11signal: Properly deliver SIGILL from uprobesEric W. Biederman1-2/+2
For userspace to tell the difference between a random signal and an exception, the exception must include siginfo information. Using SEND_SIG_FORCED for SIGILL is thus wrong, and it will result in userspace seeing si_code == SI_USER (like a random signal) instead of si_code == SI_KERNEL or a more specific si_code as all exceptions deliver. Therefore replace force_sig_info(SIGILL, SEND_SIG_FORCE, current) with force_sig(SIG_ILL, current) which gets this right and is shorter and easier to type. Fixes: 014940bad8e4 ("uprobes/x86: Send SIGILL if arch_uprobe_post_xol() fails") Fixes: 0b5256c7f173 ("uprobes: Send SIGILL if handle_trampoline() fails") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11signal: Always ignore SIGKILL and SIGSTOP sent to the global initEric W. Biederman1-0/+4
If the first process started (aka /sbin/init) receives a SIGKILL it will panic the system if it is delivered. Making the system unusable and undebugable. It isn't much better if the first process started receives SIGSTOP. So always ignore SIGSTOP and SIGKILL sent to init. This is done in a separate clause in sig_task_ignored as force_sig_info can clear SIG_UNKILLABLE and this protection should work even then. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11locking/lockdep, cpu/hotplug: Annotate AP threadPeter Zijlstra1-0/+28
Anybody trying to assert the cpu_hotplug_lock is held (lockdep_assert_cpus_held()) from AP callbacks will fail, because the lock is held by the BP. Stick in an explicit annotation in cpuhp_thread_fun() to make this work. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-tip-commits@vger.kernel.org Fixes: cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations") Link: http://lkml.kernel.org/r/20180911095127.GT24082@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-11kernel/reboot.c: export pm_power_off_prepareOleksij Rempel1-0/+1
Export pm_power_off_prepare. It is needed to implement power off on Freescale/NXP iMX6 based boards with external power management integrated circuit (PMIC). Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-09-11Revert "printk: make sure to print log on console."Petr Mladek1-7/+5
This reverts commit 375899cddcbb26881b03cb3fbdcfd600e4e67f4a. The visibility of early messages did not longer take into account "quiet", "debug", and "loglevel" early parameters. It would be possible to invalidate and recompute LOG_NOCONS flag for the affected messages. But it would be hairy. Instead this patch just reverts the problematic commit. We could come up with a better solution for the original problem. For example, we could simplify the logic and just mark messages that should always be visible or always invisible on the console. Also this patch reverts the related build fix commit ffaa619af1b06 ("printk: Fix warning about unused suppress_message_printing"). Finally, this patch does not put back the unused LOG_NOCONS flag. Link: http://lkml.kernel.org/r/20180910145747.emvfzv4mzlk5dfqk@pathway.suse.cz Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-kernel@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Maninder Singh <maninder1.s@samsung.com> Reported-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-09-11locking/rtmutex: Fix the preprocessor logic with normal #ifdef #else #endifSteven Rostedt (VMware)1-2/+2
Merging v4.14.68 into v4.14-rt I tripped over a conflict in the rtmutex.c code. There I found that we had: #ifdef CONFIG_DEBUG_LOCK_ALLOC [..] #endif #ifndef CONFIG_DEBUG_LOCK_ALLOC [..] #endif Really this should be: #ifdef CONFIG_DEBUG_LOCK_ALLOC [..] #else [..] #endif This cleans up that logic. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Rosin <peda@axentia.se> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180910214638.55926030@vmware.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10kallsyms: reduce size a little on 64-bitJan Beulich1-2/+2
Both kallsyms_num_syms and kallsyms_markers[] don't really need to use unsigned long as their (base) types; unsigned int fully suffices. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-09-10sched/topology: Make local variables staticzhong jiang1-2/+2
Fix the following warnings: kernel/sched/topology.c:10:15: warning: symbol 'sched_domains_tmpmask' was not declared. Should it be static? kernel/sched/topology.c:11:15: warning: symbol 'sched_domains_tmpmask2' was not declared. Should it be static? Signed-off-by: zhong jiang <zhongjiang@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1533299852-26941-1-git-send-email-zhongjiang@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10perf/core: Force USER_DS when recording user stack dataYabin Cui1-0/+4
Perf can record user stack data in response to a synchronous request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we end up reading user stack data using __copy_from_user_inatomic() under set_fs(KERNEL_DS). I think this conflicts with the intention of using set_fs(KERNEL_DS). And it is explicitly forbidden by hardware on ARM64 when both CONFIG_ARM64_UAO and CONFIG_ARM64_PAN are used. So fix this by forcing USER_DS when recording user stack data. Signed-off-by: Yabin Cui <yabinc@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 88b0193d9418 ("perf/callchain: Force USER_DS when invoking perf_callchain_user()") Link: http://lkml.kernel.org/r/20180823225935.27035-1-yabinc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic"Colin Ian King1-1/+1
Trivial fix to spelling mistake in pr_err() error message Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20180824112235.8842-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/lockdep: Delete unnecessary #includeBen Hutchings1-1/+0
Commit: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") added the inclusion of <trace/events/preemptirq.h>. liblockdep doesn't have a stub version of that header so now fails to build. However, commit: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"") removed the use of functions declared in that header. So delete the #include. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <alexander.levin@verizon.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize ...") Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints ...") Link: http://lkml.kernel.org/r/20180828203315.GD18030@decadent.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10PM / sleep: Show freezing tasks that caused a suspend abortTodd Brandt1-1/+1
For debug purposes it would be nice to see which tasks caused a suspend abort, i.e. which tasks were still in the process of freezing when a wakeup event occurred. This patch adds the info to pm_debug_messages. Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-09-10locking/rwsem: Make owner store task pointer of last owning readerWaiman Long3-28/+76
Currently, when a reader acquires a lock, it only sets the RWSEM_READER_OWNED bit in the owner field. The other bits are simply not used. When debugging hanging cases involving rwsems and readers, the owner value does not provide much useful information at all. This patch modifies the current behavior to always store the task_struct pointer of the last rwsem-acquiring reader in a reader-owned rwsem. This may be useful in debugging rwsem hanging cases especially if only one reader is involved. However, the task in the owner field may not the real owner or one of the real owners at all when the owner value is examined, for example, in a crash dump. So it is just an additional hint about the past history. If CONFIG_DEBUG_RWSEMS=y is enabled, the owner field will be checked at unlock time too to make sure the task pointer value is valid. That does have a slight performance cost and so is only enabled as part of that debug option. From the performance point of view, it is expected that the changes shouldn't have any noticeable performance impact. A rwsem microbenchmark (with 48 worker threads and 1:1 reader/writer ratio) was ran on a 2-socket 24-core 48-thread Haswell system. The locking rates on a 4.19-rc1 based kernel were as follows: 1) Unpatched kernel: 543.3 kops/s 2) Patched kernel: 549.2 kops/s 3) Patched kernel (CONFIG_DEBUG_RWSEMS on): 546.6 kops/s There was actually a slight increase in performance (1.1%) in this particular case. Maybe it was caused by the elimination of a branch or just a testing noise. Turning on the CONFIG_DEBUG_RWSEMS option also had less than the expected impact on performance. The least significant 2 bits of the owner value are now used to designate the rwsem is readers owned and the owners are anonymous. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1536265114-10842-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/numa: Remove unused numa_stats::nr_running fieldVincent Guittot1-3/+0
nr_running in struct numa_stats is not used anywhere in the code. Remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1535548752-4434-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/numa: Remove unused code from update_numa_stats()Vincent Guittot1-20/+1
With: commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()") the local variables 'smt', 'cpus' and 'capacity' and their results are not used anymore in numa_has_capacity() Remove this unused code. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1535548752-4434-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/debug: Explicitly cast sched_feat() to boolPeter Zijlstra1-1/+1
LLVM has a warning that tags expressions like: if (foo && non-bool-const) This pattern triggers for CONFIG_SCHED_DEBUG=n where sched_feat() ends up being whatever bit we select. Avoid the warning with an explicit cast to bool. Reported-by: Philipp Klocke <philipp97kl@gmail.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domainsMorten Rasmussen1-4/+8
The 'prefer sibling' sched_domain flag is intended to encourage spreading tasks to sibling sched_domain to take advantage of more caches and core for SMT systems. It has recently been changed to be on all non-NUMA topology level. However, spreading across domains with CPU capacity asymmetry isn't desirable, e.g. spreading from high capacity to low capacity CPUs even if high capacity CPUs aren't overutilized might give access to more cache but the CPU will be slower and possibly lead to worse overall throughput. To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain level immediately below SD_ASYM_CPUCAPACITY. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Don't move tasks to lower capacity CPUs unless necessaryChris Redpath1-0/+11
When lower capacity CPUs are load balancing and considering to pull something from a higher capacity group, we should not pull tasks from a CPU with only one task running as this is guaranteed to impede progress for that task. If there is more than one task running, load balance in the higher capacity group would have already made any possible moves to resolve imbalance and we should make better use of system compute capacity by moving a task if we still have more than one running. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Set rq->rd->overload when misfitValentin Schneider2-3/+9
Idle balance is a great opportunity to pull a misfit task. However, there are scenarios where misfit tasks are present but idle balance is prevented by the overload flag. A good example of this is a workload of n identical tasks. Let's suppose we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly CPU-intensive tasks - for the sake of simplicity let's say they are just CPU hogs, even when running on big CPUs. They are identical tasks, so on an SMP system they should all end at (roughly) the same time. However, in our case the LITTLE CPUs are less performing than the big CPUs, so tasks running on the LITTLEs will have a longer completion time. This means that the big CPUs will complete their work earlier, at which point they should pull the tasks from the LITTLEs. What we want to happen is summarized as follows: a,b,c,d are our CPU-hogging tasks _ signifies idling LITTLE_0 | a a a a _ _ LITTLE_1 | b b b b _ _ ---------|------------- big_0 | c c c c a a big_1 | d d d d b b ^ ^ Tasks end on the big CPUs, idle balance happens and the misfit tasks are pulled straight away This however won't happen, because currently the overload flag is only set when there is any CPU that has more than one runnable task - which may very well not be the case here if our CPU-hogging workload is all there is to run. As such, this commit sets the overload flag in update_sg_lb_stats when a group is flagged as having a misfit task. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()Valentin Schneider2-5/+5
This variable can be read and set locklessly within update_sd_lb_stats(). As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong can happen because of the compiler. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/core: Change root_domain->overload type to intValentin Schneider1-2/+2
sizeof(_Bool) is implementation defined, so let's just go with 'int' as is done for other structures e.g. sched_domain_shared->has_idle_cores. The local 'overload' variable used in update_sd_lb_stats can remain bool, as it won't impact any struct layout and can be assigned to the root_domain field. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-8-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Change 'prefer_sibling' type to boolValentin Schneider1-4/+2
This variable is entirely local to update_sd_lb_stats, so we can safely change its type and slightly clean up its initialisation. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-7-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Kick nohz balance if rq->misfit_task_loadValentin Schneider1-1/+1
There already are a few conditions in nohz_kick_needed() to ensure a nohz kick is triggered, but they are not enough for some misfit task scenarios. Excluding asym packing, those are: - rq->nr_running >=2: Not relevant here because we are running a misfit task, it needs to be migrated regardless and potentially through active balance. - sds->nr_busy_cpus > 1: If there is only the misfit task being run on a group of low capacity CPUs, this will be evaluated to False. - rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here, misfit task needs to be migrated regardless of rt/IRQ pressure As such, this commit adds an rq->misfit_task_load condition to trigger a nohz kick. The idea to kick a nohz balance for misfit tasks originally came from Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for the Android Common Kernel - see: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-6-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Consider misfit tasks when load-balancingMorten Rasmussen1-2/+49
On asymmetric CPU capacity systems load intensive tasks can end up on CPUs that don't suit their compute demand. In this scenarios 'misfit' tasks should be migrated to CPUs with higher compute capacity to ensure better throughput. group_misfit_task indicates this scenario, but tweaks to the load-balance code are needed to make the migrations happen. Misfit balancing only makes sense between a source group of lower per-CPU capacity and destination group of higher compute capacity. Otherwise, misfit balancing is ignored. group_misfit_task has lowest priority so any imbalance due to overload is dealt with first. The modifications are: 1. Only pick a group containing misfit tasks as the busiest group if the destination group has higher capacity and has spare capacity. 2. When the busiest group is a 'misfit' group, skip the usual average load and group capacity checks. 3. Set the imbalance for 'misfit' balancing sufficiently high for a task to be pulled ignoring average load. 4. Pick the CPU with the highest misfit load as the source CPU. 5. If the misfit task is alone on the source CPU, go for active balancing. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Add sched_group per-CPU max capacityMorten Rasmussen3-4/+23
The current sg->min_capacity tracks the lowest per-CPU compute capacity available in the sched_group when rt/irq pressure is taken into account. Minimum capacity isn't the ideal metric for tracking if a sched_group needs offloading to another sched_group for some scenarios, e.g. a sched_group with multiple CPUs if only one is under heavy pressure. Tracking maximum capacity isn't perfect either but a better choice for some situations as it indicates that the sched_group definitely compute capacity constrained either due to rt/irq pressure on all CPUs or asymmetric CPU capacities (e.g. big.LITTLE). Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Add 'group_misfit_task' load-balance typeMorten Rasmussen2-8/+48
To maximize throughput in systems with asymmetric CPU capacities (e.g. ARM big.LITTLE) load-balancing has to consider task and CPU utilization as well as per-CPU compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks with high utilization that are scheduled on a lower capacity CPU need to be identified and migrated to a higher capacity CPU if possible to maximize throughput. To implement this additional policy an additional group_type (load-balance scenario) is added: 'group_misfit_task'. This represents scenarios where a sched_group has one or more tasks that are not suitable for its per-CPU capacity. 'group_misfit_task' is only considered if the system is not overloaded or imbalanced ('group_imbalanced' or 'group_overloaded'). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each CPU is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/topology: Add static_key for asymmetric CPU capacity optimizationsMorten Rasmussen3-1/+12
The existing asymmetric CPU capacity code should cause minimal overhead for others. Putting it behind a static_key, it has been done for SMT optimizations, would make it easier to extend and improve without causing harm to others moving forward. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/topology: Add SD_ASYM_CPUCAPACITY flag detectionMorten Rasmussen1-6/+75
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the sched_domain in the hierarchy where all CPU capacities are visible for any CPU's point of view on asymmetric CPU capacity systems. The scheduler can then take to take capacity asymmetry into account when balancing at this level. It also serves as an indicator for how wide task placement heuristics have to search to consider all available CPU capacities as asymmetric systems might often appear symmetric at smallest level(s) of the sched_domain hierarchy. The flag has been around for while but so far only been set by out-of-tree code in Android kernels. One solution is to let each architecture provide the flag through a custom sched_domain topology array and associated mask and flag functions. However, SD_ASYM_CPUCAPACITY is special in the sense that it depends on the capacity and presence of all CPUs in the system, i.e. when hotplugging all CPUs out except those with one particular CPU capacity the flag should disappear even if the sched_domains don't collapse. Similarly, the flag is affected by cpusets where load-balancing is turned off. Detecting when the flags should be set therefore depends not only on topology information but also the cpuset configuration and hotplug state. The arch code doesn't have easy access to the cpuset configuration. Instead, this patch implements the flag detection in generic code where cpusets and hotplug state is already taken care of. All the arch is responsible for is to implement arch_scale_cpu_capacity() and force a full rebuild of the sched_domain hierarchy if capacities are updated, e.g. later in the boot process when cpufreq has initialized. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmussen@arm.com [ Fixed 'CPU' capitalization. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix kernel-doc notation warningRandy Dunlap1-0/+1
Fix kernel-doc warning for missing 'flags' parameter description: ../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: ea14b57e8a18 ("sched/cpufreq: Provide migration hint") Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/rwsem: Exit read lock slowpath if queue empty & no writerWaiman Long1-1/+12
It was discovered that a constant stream of readers with occassional writers pounding on a rwsem may cause many of the readers to enter the slowpath unnecessarily thus increasing latency and lowering performance. In the current code, a reader entering the slowpath critical section will unconditionally set the WAITING_BIAS, if not set yet, and clear its active count even if no one is in the wait queue and no writer is present. This causes some incoming readers to observe the presence of waiters in the wait queue and hence have to go into the slowpath themselves. With sufficient numbers of readers and a relatively short lock hold time, the WAITING_BIAS may be repeatedly turned on and off and a substantial portion of the readers will go into the slowpath sustaining a rather long queue in the wait queue spinlock and repeated WAITING_BIAS on/off cycle until the logjam is broken opportunistically. To avoid this situation from happening, an additional check is added to detect the special case that the reader in the critical section is the only one in the wait queue and no writer is present. When that happens, it can just exit the slowpath and return immediately as its active count has already been set in the lock. Other incoming readers won't observe the presence of waiters and so will not be forced into the slowpath. The issue was found in a customer site where they had an application that pounded on the pread64 syscalls heavily on an XFS filesystem. The application was run in a recent 4-socket boxes with a lot of CPUs. They saw significant spinlock contention in the rwsem_down_read_failed() call. With this patch applied, the system CPU usage went down from 85% to 57%, and the spinlock contention in the pread64 syscalls was gone. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Joe Mario <jmario@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1532459425-19204-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operationsPeter Zijlstra1-0/+5
Weirdly we seem to have forgotten this... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10Merge branch 'locking/urgent' into locking/core, to pick up fixesIngo Molnar2-3/+2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10jump_label: Fix typo in warning messageBorislav Petkov1-1/+1
There's no 'allocatote' - use the next best thing: 'allocate' :-) Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180907103521.31344-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>