aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2019-11-13perf/core: Fix missing static inline on perf_cgroup_switch()Ben Dooks (Codethink)1-1/+1
It looks like a "static inline" has been missed in front of the empty definition of perf_cgroup_switch() under certain configurations. Fixes the following sparse warning: kernel/events/core.c:1035:1: warning: symbol 'perf_cgroup_switch' was not declared. Should it be static? Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lkml.kernel.org/r/20191106132527.19977-1-ben.dooks@codethink.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13perf/core: Consistently fail fork on allocation failuresAlexander Shishkin1-1/+1
Commit: 313ccb9615948 ("perf: Allocate context task_ctx_data for child event") makes the inherit path skip over the current event in case of task_ctx_data allocation failure. This, however, is inconsistent with allocation failures in perf_event_alloc(), which would abort the fork. Correct this by returning an error code on task_ctx_data allocation failure and failing the fork in that case. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lkml.kernel.org/r/20191105075702.60319-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13perf/aux: Disallow aux_output for kernel eventsAlexander Shishkin1-0/+7
Commit ab43762ef0109 ("perf: Allow normal events to output AUX data") added 'aux_output' bit to the attribute structure, which relies on AUX events and grouping, neither of which is supported for the kernel events. This notwithstanding, attempts have been made to use it in the kernel code, suggesting the necessity of an explicit hard -EINVAL. Fix this by rejecting attributes with aux_output set for kernel events. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lkml.kernel.org/r/20191030134731.5437-3-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13perf/core: Reattach a misplaced commentAlexander Shishkin1-4/+3
A comment is in a wrong place in perf_event_create_kernel_counter(). Fix that. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lkml.kernel.org/r/20191030134731.5437-2-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13perf/aux: Fix the aux_output group inheritance fixAlexander Shishkin1-1/+1
Commit f733c6b508bc ("perf/core: Fix inheritance of aux_output groups") adds a NULL pointer dereference in case inherit_group() races with perf_release(), which causes the below crash: > BUG: kernel NULL pointer dereference, address: 000000000000010b > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 3b203b067 P4D 3b203b067 PUD 3b2040067 PMD 0 > Oops: 0000 [#1] SMP KASAN > CPU: 0 PID: 315 Comm: exclusive-group Tainted: G B 5.4.0-rc3-00181-g72e1839403cb-dirty #878 > RIP: 0010:perf_get_aux_event+0x86/0x270 > Call Trace: > ? __perf_read_group_add+0x3b0/0x3b0 > ? __kasan_check_write+0x14/0x20 > ? __perf_event_init_context+0x154/0x170 > inherit_task_group.isra.0.part.0+0x14b/0x170 > perf_event_init_task+0x296/0x4b0 Fix this by skipping over events that are getting closed, in the inheritance path. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: f733c6b508bc ("perf/core: Fix inheritance of aux_output groups") Link: https://lkml.kernel.org/r/20191101151248.47327-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13perf/core: Disallow uncore-cgroup eventsPeter Zijlstra1-0/+9
While discussing uncore event scheduling, I noticed we do not in fact seem to dis-allow making uncore-cgroup events. Such events make no sense what so ever because the cgroup is a CPU local state where uncore counts across a number of CPUs. Disallow them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13sched/pelt: Fix update of blocked PELT orderingVincent Guittot1-9/+20
update_cfs_rq_load_avg() can call cpufreq_update_util() to trigger an update of the frequency. Make sure that RT, DL and IRQ PELT signals have been updated before calling cpufreq. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: dsmythies@telus.net Cc: juri.lelli@redhat.com Cc: mgorman@suse.de Cc: rostedt@goodmis.org Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking") Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking") Fixes: 91c27493e78d ("sched/irq: Add IRQ utilization tracking") Link: https://lkml.kernel.org/r/1572434309-32512-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13sched/core: Avoid spurious lock dependenciesPeter Zijlstra1-1/+2
While seemingly harmless, __sched_fork() does hrtimer_init(), which, when DEBUG_OBJETS, can end up doing allocations. This then results in the following lock order: rq->lock zone->lock.rlock batched_entropy_u64.lock Which in turn causes deadlocks when we do wakeups while holding that batched_entropy lock -- as the random code does. Solve this by moving __sched_fork() out from under rq->lock. This is safe because nothing there relies on rq->lock, as also evident from the other __sched_fork() callsite. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: bigeasy@linutronix.de Cc: cl@linux.com Cc: keescook@chromium.org Cc: penberg@kernel.org Cc: rientjes@google.com Cc: thgarnie@google.com Cc: tytso@mit.edu Cc: will@kernel.org Fixes: b7d5dc21072c ("random: add a spinlock_t to struct batched_entropy") Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-12Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+26
Pull x86 TSX Async Abort and iTLB Multihit mitigations from Thomas Gleixner: "The performance deterioration departement is not proud at all of presenting the seventh installment of speculation mitigations and hardware misfeature workarounds: 1) TSX Async Abort (TAA) - 'The Annoying Affair' TAA is a hardware vulnerability that allows unprivileged speculative access to data which is available in various CPU internal buffers by using asynchronous aborts within an Intel TSX transactional region. The mitigation depends on a microcode update providing a new MSR which allows to disable TSX in the CPU. CPUs which have no microcode update can be mitigated by disabling TSX in the BIOS if the BIOS provides a tunable. Newer CPUs will have a bit set which indicates that the CPU is not vulnerable, but the MSR to disable TSX will be available nevertheless as it is an architected MSR. That means the kernel provides the ability to disable TSX on the kernel command line, which is useful as TSX is a truly useful mechanism to accelerate side channel attacks of all sorts. 2) iITLB Multihit (NX) - 'No eXcuses' iTLB Multihit is an erratum where some Intel processors may incur a machine check error, possibly resulting in an unrecoverable CPU lockup, when an instruction fetch hits multiple entries in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. A malicious guest running on a virtualized system can exploit this erratum to perform a denial of service attack. The workaround is that KVM marks huge pages in the extended page tables as not executable (NX). If the guest attempts to execute in such a page, the page is broken down into 4k pages which are marked executable. The workaround comes with a mechanism to recover these shattered huge pages over time. Both issues come with full documentation in the hardware vulnerabilities section of the Linux kernel user's and administrator's guide. Thanks to all patch authors and reviewers who had the extraordinary priviledge to be exposed to this nuisance. Special thanks to Borislav Petkov for polishing the final TAA patch set and to Paolo Bonzini for shepherding the KVM iTLB workarounds and providing also the backports to stable kernels for those!" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs Documentation: Add ITLB_MULTIHIT documentation kvm: x86: mmu: Recovery of shattered NX large pages kvm: Add helper function for creating VM worker threads kvm: mmu: ITLB_MULTIHIT mitigation cpu/speculation: Uninline and export CPU mitigations helpers x86/cpu: Add Tremont to the cpu vulnerability whitelist x86/bugs: Add ITLB_MULTIHIT bug infrastructure x86/tsx: Add config options to set tsx=on|off|auto x86/speculation/taa: Add documentation for TSX Async Abort x86/tsx: Add "auto" option to the tsx= cmdline parameter kvm/x86: Export MDS_NO=0 to guests when TSX is enabled x86/speculation/taa: Add sysfs reporting for TSX Async Abort x86/speculation/taa: Add mitigation for TSX Async Abort x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default x86/cpu: Add a helper function x86_read_arch_cap_msr() x86/msr: Add the IA32_TSX_CTRL MSR
2019-11-12cgroup: use cgrp->kn->id as the cgroup IDTejun Heo4-55/+29
cgroup ID is currently allocated using a dedicated per-hierarchy idr and used internally and exposed through tracepoints and bpf. This is confusing because there are tracepoints and other interfaces which use the cgroupfs ino as IDs. The preceding changes made kn->id exposed as ino as 64bit ino on supported archs or ino+gen (low 32bits as ino, high gen). There's no reason for cgroup to use different IDs. The kernfs IDs are unique and userland can easily discover them and map them back to paths using standard file operations. This patch replaces cgroup IDs with kernfs IDs. * cgroup_id() is added and all cgroup ID users are converted to use it. * kernfs_node creation is moved to earlier during cgroup init so that cgroup_id() is available during init. * While at it, s/cgroup/cgrp/ in psi helpers for consistency. * Fallback ID value is changed to 1 to be consistent with root cgroup ID. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12kernfs: use 64bit inos if ino_t is 64bitTejun Heo1-4/+17
Each kernfs_node is identified with a 64bit ID. The low 32bit is exposed as ino and the high gen. While this already allows using inos as keys by looking up with wildcard generation number of 0, it's adding unnecessary complications for 64bit ino archs which can directly use kernfs_node IDs as inos to uniquely identify each cgroup instance. This patch exposes IDs directly as inos on 64bit ino archs. The conversion is mostly straight-forward. * 32bit ino archs behave the same as before. 64bit ino archs now use the whole 64bit ID as ino and the generation number is fixed at 1. * 64bit inos still use the same idr allocator which gurantees that the lower 32bits identify the current live instance uniquely and the high 32bits are incremented whenever the low bits wrap. As the upper 32bits are no longer used as gen and we don't wanna start ino allocation with 33rd bit set, the initial value for highbits allocation is changed to 0 on 64bit ino archs. * blktrace exposes two 32bit numbers - (INO,GEN) pair - to identify the issuing cgroup. Userland builds FILEID_INO32_GEN fids from these numbers to look up the cgroups. To remain compatible with the behavior, always output (LOW32,HIGH32) which will be constructed back to the original 64bit ID by __kernfs_fh_to_dentry(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()Tejun Heo1-1/+1
kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the specified ino. On top of that, kernfs_get_node_by_id() and kernfs_fh_get_inode() implement full ID matching by testing the rest of ID. On surface, confusingly, the two are slightly different in that the latter uses 0 gen as wildcard while the former doesn't - does it mean that the latter can't uniquely identify inodes w/ 0 gen? In practice, this is a distinction without a difference because generation number starts at 1. There are no actual IDs with 0 gen, so it can always safely used as wildcard. Let's simplify the code by renaming kernfs_find_and_get_node_by_ino() to kernfs_find_and_get_node_by_id(), moving all lookup logics into it, and removing now unnecessary kernfs_get_node_by_id(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12kernfs: convert kernfs_node->id from union kernfs_node_id to u64Tejun Heo4-40/+34
kernfs_node->id is currently a union kernfs_node_id which represents either a 32bit (ino, gen) pair or u64 value. I can't see much value in the usage of the union - all that's needed is a 64bit ID which the current code is already limited to. Using a union makes the code unnecessarily complicated and prevents using 64bit ino without adding practical benefits. This patch drops union kernfs_node_id and makes kernfs_node->id a u64. ino is stored in the lower 32bits and gen upper. Accessors - kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the ino and gen. This simplifies ID handling less cumbersome and will allow using 64bit inos on supported archs. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexei Starovoitov <ast@kernel.org>
2019-11-12sched/Kconfig: Fix spelling mistake in user-visible help textSrivatsa S. Bhat (VMware)1-1/+1
Fix a spelling mistake in the help text for PREEMPT_RT. Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/157204450499.10518.4542293884417101528.stgit@srivatsa-ubuntu
2019-11-12time: Fix spelling mistake in commentMukesh Ojha1-1/+1
witin => within Signed-off-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1571124819-9639-1-git-send-email-mojha@codeaurora.org
2019-11-12time: Optimize ns_to_timespec64()Arnd Bergmann1-9/+12
ns_to_timespec64() calls div_s64_rem(), which is a rather slow function on 32-bit architectures, as it cannot take advantage of the do_div() optimizations for constant arguments. Open-code the div_s64_rem() function in ns_to_timespec64(), so a constant divider can be passed into the optimized div_u64_rem() function. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191108203435.112759-3-arnd@arndb.de
2019-11-12ntp/y2038: Remove incorrect time_t truncationArnd Bergmann1-1/+1
A cast to 'time_t' was accidentally left in place during the conversion of __do_adjtimex() to 64-bit timestamps, so the resulting value is incorrectly truncated. Remove the cast so the 64-bit time gets propagated correctly. Fixes: ead25417f82e ("timex: use __kernel_timex internally") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20191108203435.112759-2-arnd@arndb.de
2019-11-11cpuidle: Use nanoseconds as the unit of timeRafael J. Wysocki1-1/+1
Currently, the cpuidle subsystem uses microseconds as the unit of time which (among other things) causes the idle loop to incur some integer division overhead for no clear benefit. In order to allow cpuidle to measure time in nanoseconds, add two new fields, exit_latency_ns and target_residency_ns, to represent the exit latency and target residency of an idle state in nanoseconds, respectively, to struct cpuidle_state and initialize them with the help of the corresponding values in microseconds provided by drivers. Additionally, change cpuidle_governor_latency_req() to return the idle state exit latency constraint in nanoseconds. Also meeasure idle state residency (last_residency_ns in struct cpuidle_device and time_ns in struct cpuidle_driver) in nanoseconds and update the cpuidle core and governors accordingly. However, the menu governor still computes typical intervals in microseconds to avoid integer overflows. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net>
2019-11-11Merge branch 'for-5.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds1-1/+1
Pull cgroup fix from Tejun Heo: "There's an inadvertent preemption point in ptrace_stop() which was reliably triggering for a test scenario significantly slowing it down. This contains Oleg's fix to remove the unwanted preemption point" * 'for-5.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: freezer: call cgroup_enter_frozen() with preemption disabled in ptrace_stop()
2019-11-11kheaders: explain why include/config/autoconf.h is excluded from md5sumMasahiro Yamada1-2/+9
This comment block explains why include/generated/compile.h is omitted, but nothing about include/generated/autoconf.h, which might be more difficult to understand. Add more comments. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11kheaders: remove the last bashism to allow sh to run itMasahiro Yamada2-7/+8
'pushd' ... 'popd' is the last bash-specific code in this script. One way to avoid it is to run the code in a sub-shell. With that addressed, you can run this script with sh. I replaced $(BASH) with $(CONFIG_SHELL), and I changed the hashbang to #!/bin/sh. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11kheaders: optimize header copy for in-tree buildsMasahiro Yamada1-7/+9
This script copies headers by the cpio command twice; first from srctree, and then from objtree. However, when we building in-tree, we know the srctree and the objtree are the same. That is, all the headers copied by the first cpio are overwritten by the second one. Skip the first cpio when we are building in-tree. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11kheaders: optimize md5sum calculation for in-tree buildsMasahiro Yamada1-16/+16
This script computes md5sum of headers in srctree and in objtree. However, when we are building in-tree, we know the srctree and the objtree are the same. That is, we end up with the same computation twice. In fact, the first two lines of kernel/kheaders.md5 are always the same for in-tree builds. Unify the two md5sum calculations. For in-tree builds ($building_out_of_srctree is empty), we check only two directories, "include", and "arch/$SRCARCH/include". For out-of-tree builds ($building_out_of_srctree is 1), we check 4 directories, "$srctree/include", "$srctree/arch/$SRCARCH/include", "include", and "arch/$SRCARCH/include" since we know they are all different. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11kheaders: remove unneeded 'cat' command piped to 'head' / 'tail'Masahiro Yamada1-4/+4
The 'head' and 'tail' commands can take a file path directly. So, you do not need to run 'cat'. cat kernel/kheaders.md5 | head -1 ... is equivalent to: head -1 kernel/kheaders.md5 and the latter saves forking one process. While I was here, I replaced 'head -1' with 'head -n 1'. I also replaced '==' with '=' since we do not have a good reason to use the bashism. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11dma-debug: increase HASH_SIZEEric Dumazet1-2/+2
With modern NIC, it is not unusual having about ~256,000 active dma mappings and a hash size of 1024 buckets is too small. Forcing full cache line per bucket does not seem useful, especially now that we have contention on free_entries_lock for allocations and freeing of entries. Better use the space to fit more buckets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-11dma-debug: reorder struct dma_debug_entry fieldsEric Dumazet1-7/+7
Move all fields used during exact match lookups to the first cache line. This makes debug_dma_mapping_error() and friends about 50% faster. Since it removes two 32bit holes, force a cacheline alignment on struct dma_debug_entry. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-11dma-mapping: merge the generic remapping helpers into dma-directChristoph Hellwig2-61/+48
Integrate the generic dma remapping implementation into the main flow. This prepares for architectures like xtensa that use an uncached segment for pages in the kernel mapping, but can also remap highmem from CMA. To simplify that implementation we now always deduct the page from the physical address via the DMA address instead of the virtual address. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
2019-11-11dma-direct: provide mmap and get_sgtable method overridesChristoph Hellwig4-47/+75
For dma-direct we know that the DMA address is an encoding of the physical address that we can trivially decode. Use that fact to provide implementations that do not need the arch_dma_coherent_to_pfn architecture hook. Note that we still can only support mmap of non-coherent memory only if the architecture provides a way to set an uncached bit in the page tables. This must be true for architectures that use the generic remap helpers, but other architectures can also manually select it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
2019-11-11stacktrace: Get rid of unneeded '!!' patternJiri Slaby1-2/+2
My commit b0c51f158455 ("stacktrace: Don't skip first entry on noncurrent tasks") adds one or zero to skipnr by "!!(current == tsk)". But the C99 standard says: The == (equal to) and != (not equal to) operators are ... Each of the operators yields 1 if the specified relation is true and 0 if it is false. So there is no need to prepend the above expression by "!!" -- remove it. Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191111092647.27419-1-jslaby@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11irq_work: Slightly simplify IRQ_WORK_PENDING clearingFrederic Weisbecker1-4/+3
Instead of fetching the value of flags and perform an xchg() to clear a bit, just use atomic_fetch_andnot() that is more suitable to do that job in one operation while keeping the full ordering. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191108160858.31665-4-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11irq_work: Fix irq_work_claim() memory orderingFrederic Weisbecker1-15/+7
When irq_work_claim() finds IRQ_WORK_PENDING flag already set, we just return and don't raise a new IPI. We expect the destination to see and handle our latest updades thanks to the pairing atomic_xchg() in irq_work_run_list(). But cmpxchg() doesn't guarantee a full memory barrier upon failure. So it's possible that the destination misses our latest updates. So use atomic_fetch_or() instead that is unconditionally fully ordered and also performs exactly what we want here and simplify the code. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191108160858.31665-3-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11irq_work: Convert flags to atomic_tFrederic Weisbecker4-12/+12
We need to convert flags to atomic_t in order to later fix an ordering issue on atomic_cmpxchg() failure. This will allow us to use atomic_fetch_or(). Also clarify the nature of those flags. [ mingo: Converted two more usage site the original patch missed. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191108160858.31665-2-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11sched/core: Further clarify sched_class::set_next_task()Peter Zijlstra6-11/+17
It turns out there really is something special to the first set_next_task() invocation. In specific the 'change' pattern really should not cause balance callbacks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Fixes: f95d4eaee6d0 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task") Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11sched/fair: Use mul_u32_u32()Peter Zijlstra1-2/+1
While reading the code I encountered another site where we should be using mul_u32_u32() because GCC just won't take a hint. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20191108131909.717931380@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11sched/core: Simplify sched_class::pick_next_task()Peter Zijlstra7-35/+16
Now that the indirect class call never uses the last two arguments of pick_next_task(), remove them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20191108131909.660595546@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11sched/core: Optimize pick_next_task()Peter Zijlstra4-5/+8
Ever since we moved the sched_class definitions into their own files, the constant expression {fair,idle}_sched_class.pick_next_task() is not in fact a compile time constant anymore and results in an indirect call (barring LTO). Fix that by exposing pick_next_task_{fair,idle}() directly, this gets rid of the indirect call (and RETPOLINE) on the fast path. Also remove the unlikely() from the idle case, it is in fact /the/ way we select idle -- and that is a very common thing to do. Performance for will-it-scale/sched_yield improves by 2% (as reported by 0-day). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20191108131909.603037345@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11sched/core: Make pick_next_task_idle() more consistentPeter Zijlstra2-4/+5
Only pick_next_task_fair() needs the @prev and @rf argument; these are required to implement the cpu-cgroup optimization. None of the other pick_next_task() methods need this. Make pick_next_task_idle() more consistent. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20191108131909.545730862@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11sched/fair: Better document newidle_balance()Peter Zijlstra1-0/+5
Whilst chasing the pick_next_task() race, there was some confusion about the newidle_balance() return values. Document them. [ mingo: Minor edits. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20191108131909.488364308@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11Merge tag 'v5.4-rc7' into sched/core, to pick up fixesIngo Molnar46-350/+836
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11Merge tag 'v5.4-rc7' into perf/core, to pick up fixesIngo Molnar17-89/+224
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-10Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-6/+3
Pull timer fixes from Thomas Gleixner: "A small set of fixes for timekeepoing and clocksource drivers: - VDSO data was updated conditional on the availability of a VDSO capable clocksource. This causes the VDSO functions which do not depend on a VDSO capable clocksource to operate on stale data. Always update unconditionally. - Prevent a double free in the mediatek driver - Use the proper helper in the sh_mtu2 driver so it won't attempt to initialize non-existing interrupts" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping/vsyscall: Update VDSO data unconditionally clocksource/drivers/sh_mtu2: Do not loop using platform_get_irq_by_name() clocksource/drivers/mediatek: Fix error handling
2019-11-10Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds7-59/+113
Pull scheduler fixes from Thomas Gleixner: "Two fixes for scheduler regressions: - Plug a subtle race condition which was introduced with the rework of the next task selection functionality. The change of task properties became unprotected which can be observed inconsistently causing state corruption. - A trivial compile fix for CONFIG_CGROUPS=n" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix pick_next_task() vs 'change' pattern race sched/core: Fix compilation error when cgroup not selected
2019-11-10Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull irq fixlet from Thomas Gleixner: "A trivial fix for a kernel doc regression where an argument change was not reflected in the documentation" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irq/irqdomain: Update __irq_domain_alloc_fwnode() function documentation
2019-11-10Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+4
Pull stacktrace fix from Thomas Gleixner: "A small fix for a stacktrace regression. Saving a stacktrace for a foreign task skipped an extra entry which makes e.g. the output of /proc/$PID/stack incomplete" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: stacktrace: Don't skip first entry on noncurrent tasks
2019-11-10audit_get_nd(): don't unlock parent too earlyAl Viro1-1/+1
if the child has been negative and just went positive under us, we want coherent d_is_positive() and ->d_inode. Don't unlock the parent until we'd done that work... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10cgroup: don't put ERR_PTR() into fc->rootAl Viro1-2/+3
the caller of ->get_tree() expects NULL left there on error... Reported-by: Thibaut Sautereau <thibaut@sautereau.fr> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller3-5/+39
One conflict in the BPF samples Makefile, some fixes in 'net' whilst we were converting over to Makefile.target rules in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2-4/+7
Pull networking fixes from David Miller: 1) BPF sample build fixes from Björn Töpel 2) Fix powerpc bpf tail call implementation, from Eric Dumazet. 3) DCCP leaks jiffies on the wire, fix also from Eric Dumazet. 4) Fix crash in ebtables when using dnat target, from Florian Westphal. 5) Fix port disable handling whne removing bcm_sf2 driver, from Florian Fainelli. 6) Fix kTLS sk_msg trim on fallback to copy mode, from Jakub Kicinski. 7) Various KCSAN fixes all over the networking, from Eric Dumazet. 8) Memory leaks in mlx5 driver, from Alex Vesker. 9) SMC interface refcounting fix, from Ursula Braun. 10) TSO descriptor handling fixes in stmmac driver, from Jose Abreu. 11) Add a TX lock to synchonize the kTLS TX path properly with crypto operations. From Jakub Kicinski. 12) Sock refcount during shutdown fix in vsock/virtio code, from Stefano Garzarella. 13) Infinite loop in Intel ice driver, from Colin Ian King. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (108 commits) ixgbe: need_wakeup flag might not be set for Tx i40e: need_wakeup flag might not be set for Tx igb/igc: use ktime accessors for skb->tstamp i40e: Fix for ethtool -m issue on X722 NIC iavf: initialize ITRN registers with correct values ice: fix potential infinite loop because loop counter being too small qede: fix NULL pointer deref in __qede_remove() net: fix data-race in neigh_event_send() vsock/virtio: fix sock refcnt holding during the shutdown net: ethernet: octeon_mgmt: Account for second possible VLAN header mac80211: fix station inactive_time shortly after boot net/fq_impl: Switch to kvmalloc() for memory allocation mac80211: fix ieee80211_txq_setup_flows() failure path ipv4: Fix table id reference in fib_sync_down_addr ipv6: fixes rt6_probe() and fib6_nh->last_probe init net: hns: Fix the stray netpoll locks causing deadlock in NAPI path net: usb: qmi_wwan: add support for DW5821e with eSIM support CDC-NCM: handle incomplete transfer of MTU nfc: netlink: fix double device reference drop NFC: st21nfca: fix double free ...
2019-11-08sched: Fix pick_next_task() vs 'change' pattern racePeter Zijlstra7-58/+112
Commit 67692435c411 ("sched: Rework pick_next_task() slow-path") inadvertly introduced a race because it changed a previously unexplored dependency between dropping the rq->lock and sched_class::put_prev_task(). The comments about dropping rq->lock, in for example newidle_balance(), only mentions the task being current and ->on_cpu being set. But when we look at the 'change' pattern (in for example sched_setnuma()): queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */ running = task_current(rq, p); /* rq->curr == p */ if (queued) dequeue_task(...); if (running) put_prev_task(...); /* change task properties */ if (queued) enqueue_task(...); if (running) set_next_task(...); It becomes obvious that if we do this after put_prev_task() has already been called on @p, things go sideways. This is exactly what the commit in question allows to happen when it does: prev->sched_class->put_prev_task(rq, prev, rf); if (!rq->nr_running) newidle_balance(rq, rf); The newidle_balance() call will drop rq->lock after we've called put_prev_task() and that allows the above 'change' pattern to interleave and mess up the state. Furthermore, it turns out we lost the RT-pull when we put the last DL task. Fix both problems by extracting the balancing from put_prev_task() and doing a multi-class balance() pass before put_prev_task(). Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path") Reported-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Quentin Perret <qperret@google.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com>
2019-11-08sched/core: Fix compilation error when cgroup not selectedQais Yousef1-1/+1
When cgroup is disabled the following compilation error was hit kernel/sched/core.c: In function ‘uclamp_update_active_tasks’: kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known struct css_task_iter it; ^~ kernel/sched/core.c:1084:2: error: implicit declaration of function ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? [-Werror=implicit-function-declaration] css_task_iter_start(css, 0, &it); ^~~~~~~~~~~~~~~~~~~ __sg_page_iter_start kernel/sched/core.c:1085:14: error: implicit declaration of function ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? [-Werror=implicit-function-declaration] while ((p = css_task_iter_next(&it))) { ^~~~~~~~~~~~~~~~~~ __sg_page_iter_next kernel/sched/core.c:1091:2: error: implicit declaration of function ‘css_task_iter_end’; did you mean ‘get_task_cred’? [-Werror=implicit-function-declaration] css_task_iter_end(&it); ^~~~~~~~~~~~~~~~~ get_task_cred kernel/sched/core.c:1081:23: warning: unused variable ‘it’ [-Wunused-variable] struct css_task_iter it; ^~ cc1: some warnings being treated as errors make[2]: *** [kernel/sched/core.o] Error 1 Fix by protetion uclamp_update_active_tasks() with CONFIG_UCLAMP_TASK_GROUP Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Patrick Bellasi <patrick.bellasi@matbug.net> Cc: Mel Gorman <mgorman@suse.de> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Ben Segall <bsegall@google.com> Link: https://lkml.kernel.org/r/20191105112212.596-1-qais.yousef@arm.com