aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-02-06perf/aux: Make perf_event accessible to setup_aux()Mathieu Poirier1-1/+1
When pmu::setup_aux() is called the coresight PMU needs to know which sink to use for the session by looking up the information in the event's attr::config2 field. As such simply replace the cpu information by the complete perf_event structure and change all affected customers. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Suzuki Poulouse <suzuki.poulose@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-02-04perf/ring_buffer: Convert ring_buffer.aux_refcount to refcount_tElena Reshetova3-5/+5
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable ring_buffer.aux_refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the ring_buffer.aux_refcount it might make a difference in following places: - perf_aux_output_begin(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - rb_free_aux(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering + control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: namhyung@kernel.org Link: https://lkml.kernel.org/r/1548678448-24458-4-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04perf/ring_buffer: Convert ring_buffer.refcount to refcount_tElena Reshetova3-4/+5
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable ring_buffer.refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the ring_buffer.refcount it might make a difference in following places: - ring_buffer_get(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - ring_buffer_put(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering + control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: namhyung@kernel.org Link: https://lkml.kernel.org/r/1548678448-24458-3-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04perf: Convert perf_event_context.refcount to refcount_tElena Reshetova1-6/+6
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable perf_event_context.refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the perf_event_context.refcount it might make a difference in following places: - get_ctx(), perf_event_ctx_lock_nested(), perf_lock_task_context() and __perf_event_ctx_lock_double(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - put_ctx(): decrement in refcount_dec_and_test() provides RELEASE ordering and ACQUIRE ordering + control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: namhyung@kernel.org Link: https://lkml.kernel.org/r/1548678448-24458-2-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04perf/uprobes: Convert to SPDX license identifierThomas Gleixner1-14/+1
Replace the license boiler plate with a SPDX license identifier. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul McKenney <paulmck@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190116111308.211981422@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04perf/hw_breakpoints: Convert to SPDX license identifierThomas Gleixner1-14/+1
Replace the license boiler plate with a SPDX license identifier. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul McKenney <paulmck@linux.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190116111308.105855650@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04perf/core: Convert to SPDX license identifiersThomas Gleixner3-6/+3
Use proper SPDX license identifiers instead of the bogus reference to kernel-base/COPYING. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190116111308.012666937@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar19-66/+134
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04perf/core: Don't WARN() for impossible ring-buffer sizesMark Rutland1-0/+3
The perf tool uses /proc/sys/kernel/perf_event_mlock_kb to determine how large its ringbuffer mmap should be. This can be configured to arbitrary values, which can be larger than the maximum possible allocation from kmalloc. When this is configured to a suitably large value (e.g. thanks to the perf fuzzer), attempting to use perf record triggers a WARN_ON_ONCE() in __alloc_pages_nodemask(): WARNING: CPU: 2 PID: 5666 at mm/page_alloc.c:4511 __alloc_pages_nodemask+0x3f8/0xbc8 Let's avoid this by checking that the requested allocation is possible before calling kzalloc. Reported-by: Julien Thierry <julien.thierry@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Julien Thierry <julien.thierry@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20190110142745.25495-1-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-03Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-35/+6
Pull cpu hotplug fixes from Thomas Gleixner: "Two fixes for the cpu hotplug machinery: - Replace the overly clever 'SMT disabled by BIOS' detection logic as it breaks KVM scenarios and prevents speculation control updates when the Hyperthreads are brought online late after boot. - Remove a redundant invocation of the speculation control update function" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM x86/speculation: Remove redundant arch_smt_update() invocation
2019-02-03Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-7/+7
Pull perf fixes from Thomas Gleixner: "A pile of perf updates: - Fix broken sanity check in the /proc/sys/kernel/perf_cpu_time_max_percent write handler - Cure a perf script crash which caused by an unitinialized data structure - Highlight the hottest instruction in perf top and not a random one - Cure yet another clang issue when building perf python - Handle topology entries with no CPU correctly in the tools - Handle perf data which contains both tracepoints and performance counter entries correctly. - Add a missing NULL pointer check in perf ordered_events_free()" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf script: Fix crash when processing recorded stat data perf top: Fix wrong hottest instruction highlighted perf tools: Handle TOPOLOGY headers with no CPU perf python: Remove -fstack-clash-protection when building with some clang versions perf core: Fix perf_proc_update_handler() bug perf script: Fix crash with printing mixed trace point and other events perf ordered_events: Fix crash in ordered_events__free
2019-02-01psi: fix aggregation idle shut-offJohannes Weiner3-5/+45
psi has provisions to shut off the periodic aggregation worker when there is a period of no task activity - and thus no data that needs aggregating. However, while developing psi monitoring, Suren noticed that the aggregation clock currently won't stay shut off for good. Debugging this revealed a flaw in the idle design: an aggregation run will see no task activity and decide to go to sleep; shortly thereafter, the kworker thread that executed the aggregation will go idle and cause a scheduling change, during which the psi callback will kick the !pending worker again. This will ping-pong forever, and is equivalent to having no shut-off logic at all (but with more code!) Fix this by exempting aggregation workers from psi's clock waking logic when the state change is them going to sleep. To do this, tag workers with the last work function they executed, and if in psi we see a worker going to sleep after aggregating psi data, we will not reschedule the aggregation work item. What if the worker is also executing other items before or after? Any psi state times that were incurred by work items preceding the aggregation work will have been collected from the per-cpu buckets during the aggregation itself. If there are work items following the aggregation work, the worker's last_func tag will be overwritten and the aggregator will be kept alive to process this genuine new activity. If the aggregation work is the last thing the worker does, and we decide to go idle, the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until the clock is woken by later activity. But that should not be a problem. The buckets can hold 4s worth of time, and future activity will wake the clock with a 2s delay, giving us 2s worth of data we can leave behind when disabling aggregation. If it takes a worker more than two seconds to go idle after it finishes its last work item, we likely have bigger problems in the system, and won't notice one sample that was averaged with a bogus per-CPU weight. Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01kernel/exit.c: release ptraced tasks before zap_pid_ns_processesAndrei Vagin1-2/+10
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then zap_pid_ns_processes() waits on all tasks in a current pidns, and only then are tasks from the dead list released. zap_pid_ns_processes() can get stuck on waiting tasks from the dead list. In this case, we will have one unkillable process with one or more dead children. Thanks to Oleg for the advice to release tasks in find_child_reaper(). Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()") Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVMJosh Poimboeuf3-31/+5
With the following commit: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") ... the hotplug code attempted to detect when SMT was disabled by BIOS, in which case it reported SMT as permanently disabled. However, that code broke a virt hotplug scenario, where the guest is booted with only primary CPU threads, and a sibling is brought online later. The problem is that there doesn't seem to be a way to reliably distinguish between the HW "SMT disabled by BIOS" case and the virt "sibling not yet brought online" case. So the above-mentioned commit was a bit misguided, as it permanently disabled SMT for both cases, preventing future virt sibling hotplugs. Going back and reviewing the original problems which were attempted to be solved by that commit, when SMT was disabled in BIOS: 1) /sys/devices/system/cpu/smt/control showed "on" instead of "notsupported"; and 2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning. I'd propose that we instead consider #1 above to not actually be a problem. Because, at least in the virt case, it's possible that SMT wasn't disabled by BIOS and a sibling thread could be brought online later. So it makes sense to just always default the smt control to "on" to allow for that possibility (assuming cpuid indicates that the CPU supports SMT). The real problem is #2, which has a simple fix: change vmx_vm_init() to query the actual current SMT state -- i.e., whether any siblings are currently online -- instead of looking at the SMT "control" sysfs value. So fix it by: a) reverting the original "fix" and its followup fix: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") bc2d8d262cba ("cpu/hotplug: Fix SMT supported evaluation") and b) changing vmx_vm_init() to query the actual current SMT state -- instead of the sysfs control value -- to determine whether the L1TF warning is needed. This also requires the 'sched_smt_present' variable to exported, instead of 'cpu_smt_control'. Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Joe Mario <jmario@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
2019-01-29x86/speculation: Remove redundant arch_smt_update() invocationZhenzhong Duan1-4/+1
With commit a74cfffb03b7 ("x86/speculation: Rework SMT state change"), arch_smt_update() is invoked from each individual CPU hotplug function. Therefore the extra arch_smt_update() call in the sysfs SMT control is redundant. Fixes: a74cfffb03b7 ("x86/speculation: Rework SMT state change") Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <konrad.wilk@oracle.com> Cc: <dwmw@amazon.co.uk> Cc: <bp@suse.de> Cc: <srinivas.eeda@oracle.com> Cc: <peterz@infradead.org> Cc: <hpa@zytor.com> Link: https://lkml.kernel.org/r/e2e064f2-e8ef-42ca-bf4f-76b612964752@default
2019-01-27Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+1
Pull timer fix from Thomas Glexiner: "A single regression fix to address the unintended breakage of posix cpu timers. This is caused by a new sanity check in the common code, which fails for posix cpu timers under certain conditions because the posix cpu timer code never updates the variable which is checked" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Unbreak timer rearming
2019-01-27Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds4-11/+34
Pull locking fixes from Thomas Gleixner: "A small series of fixes which all address possible missed wakeups: - Document and fix the wakeup ordering of wake_q - Add the missing barrier in rcuwait_wake_up(), which was documented in the comment but missing in the code - Fix the possible missed wakeups in the rwsem and futex code" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Fix (possible) missed wakeup futex: Fix (possible) missed wakeup sched/wake_q: Fix wakeup ordering for wake_q sched/wake_q: Document wake_q_add() sched/wait: Fix rcuwait_wake_up() ordering
2019-01-27Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-1/+4
Pull irq fixes from Thomas Gleixner: "A small set of fixes for the interrupt subsystem: - Fix a double increment in the irq descriptor allocator which resulted in a sanity check only being done for every second affinity mask - Add a missing device tree translation in the stm32-exti driver. Without that the interrupt association is completely wrong. - Initialize the mutex in the GIC-V3 MBI driver - Fix the alignment for aliasing devices in the GIC-V3-ITS driver so multi MSI allocations work correctly - Ensure that the initial affinity of a interrupt is not empty at startup time. - Drop bogus include in the madera irq chip driver - Fix KernelDoc regression" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size genirq/irqdesc: Fix double increment in alloc_descs() genirq: Fix the kerneldoc comment for struct irq_affinity_desc irqchip/madera: Drop GPIO includes irqchip/gic-v3-mbi: Fix uninitialized mbi_lock irqchip/stm32-exti: Add domain translate function genirq: Make sure the initial affinity is not empty
2019-01-22Merge tag 'perf-urgent-for-mingo-5.0-20190121' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgentIngo Molnar1-7/+7
Pull perf/urgent fixes from Arnaldo Carvalho de Melo: Kernel: Stephane Eranian: - Fix perf_proc_update_handler() bug. perf script: Andi Kleen: - Fix crash with printing mixed trace point and other events. Tony Jones: - Fix crash when processing recorded stat data. perf top: He Kuang: - Fix wrong hottest instruction highlighted. perf python: Arnaldo Carvalho de Melo: - Remove -fstack-clash-protection when building with some clang versions. perf ordered_events: Jiri Olsa: - Fix out of buffers crash in ordered_events__free(). perf cpu_map: Stephane Eranian: - Handle TOPOLOGY headers with no CPU. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21bpf: Add module name [bpf] to ksymbols for bpf programsSong Liu1-1/+1
With this patch, /proc/kallsyms will show BPF programs as <addr> t bpf_prog_<tag>_<name> [bpf] Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-10-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-21perf, bpf: Introduce PERF_RECORD_BPF_EVENTSong Liu3-1/+118
For better performance analysis of BPF programs, this patch introduces PERF_RECORD_BPF_EVENT, a new perf_event_type that exposes BPF program load/unload information to user space. Each BPF program may contain up to BPF_MAX_SUBPROGS (256) sub programs. The following example shows kernel symbols for a BPF program with 7 sub programs: ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi When a bpf program is loaded, PERF_RECORD_KSYMBOL is generated for each of these sub programs. Therefore, PERF_RECORD_BPF_EVENT is not needed for simple profiling. For annotation, user space need to listen to PERF_RECORD_BPF_EVENT and gather more information about these (sub) programs via sys_bpf. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradeaed.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-4-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-21perf, bpf: Introduce PERF_RECORD_KSYMBOLSong Liu1-1/+97
For better performance analysis of dynamically JITed and loaded kernel functions, such as BPF programs, this patch introduces PERF_RECORD_KSYMBOL, a new perf_event_type that exposes kernel symbol register/unregister information to user space. The following data structure is used for PERF_RECORD_KSYMBOL. /* * struct { * struct perf_event_header header; * u64 addr; * u32 len; * u16 ksym_type; * u16 flags; * char name[]; * struct sample_id sample_id; * }; */ Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-2-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-21perf: Make perf_event_output() propagate the output() returnArnaldo Carvalho de Melo2-6/+8
For the original mode of operation it isn't needed, since we report back errors via PERF_RECORD_LOST records in the ring buffer, but for use in bpf_perf_event_output() it is convenient to return the errors, basically -ENOSPC. Currently bpf_perf_event_output() returns an error indication, the last thing it does, which is to push it to the ring buffer is that can fail and if so, this failure won't be reported back to its users, fix it. Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/r/20190118150938.GN5823@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-21locking/rwsem: Fix (possible) missed wakeupXie Yongji1-2/+9
Because wake_q_add() can imply an immediate wakeup (cmpxchg failure case), we must not rely on the wakeup being delayed. However, commit: e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task") relies on exactly that behaviour in that the wakeup must not happen until after we clear waiter->task. [ peterz: Added changelog. ] Signed-off-by: Xie Yongji <xieyongji@baidu.com> Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task") Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21futex: Fix (possible) missed wakeupPeter Zijlstra1-5/+8
We must not rely on wake_q_add() to delay the wakeup; in particular commit: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups") moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which could result in futex_wait() waking before observing ->lock_ptr == NULL and going back to sleep again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21sched/wake_q: Fix wakeup ordering for wake_qPeter Zijlstra1-3/+4
Notable cmpxchg() does not provide ordering when it fails, however wake_q_add() requires ordering in this specific case too. Without this it would be possible for the concurrent wakeup to not observe our prior state. Andrea Parri provided: C wake_up_q-wake_q_add { int next = 0; int y = 0; } P0(int *next, int *y) { int r0; /* in wake_up_q() */ WRITE_ONCE(*next, 1); /* node->next = NULL */ smp_mb(); /* implied by wake_up_process() */ r0 = READ_ONCE(*y); } P1(int *next, int *y) { int r1; /* in wake_q_add() */ WRITE_ONCE(*y, 1); /* wake_cond = true */ smp_mb__before_atomic(); r1 = cmpxchg_relaxed(next, 1, 2); } exists (0:r0=0 /\ 1:r1=0) This "exists" clause cannot be satisfied according to the LKMM: Test wake_up_q-wake_q_add Allowed States 3 0:r0=0; 1:r1=1; 0:r0=1; 1:r1=0; 0:r0=1; 1:r1=1; No Witnesses Positive: 0 Negative: 3 Condition exists (0:r0=0 /\ 1:r1=0) Observation wake_up_q-wake_q_add Never 0 3 Reported-by: Yongji Xie <elohimes@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21sched/wake_q: Document wake_q_add()Peter Zijlstra1-0/+12
The only guarantee provided by wake_q_add() is that a wakeup will happen after it, it does _NOT_ guarantee the wakeup will be delayed until the matching wake_up_q(). If wake_q_add() fails the cmpxchg() a concurrent wakeup is pending and that can happen at any time after the cmpxchg(). This means we should not rely on the wakeup happening at wake_q_up(), but should be ready for wake_q_add() to issue the wakeup. The delay; if provided (most likely); should only result in more efficient behaviour. Reported-by: Yongji Xie <elohimes@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21sched/wait: Fix rcuwait_wake_up() orderingPrateek Sood1-1/+1
For some peculiar reason rcuwait_wake_up() has the right barrier in the comment, but not in the code. This mistake has been observed to cause a deadlock in the following situation: P1 P2 percpu_up_read() percpu_down_write() rcu_sync_is_idle() // false rcu_sync_enter() ... __percpu_up_read() [S] ,- __this_cpu_dec(*sem->read_count) | smp_rmb(); [L] | task = rcu_dereference(w->task) // NULL | | [S] w->task = current | smp_mb(); | [L] readers_active_check() // fail `-> <store happens here> Where the smp_rmb() (obviously) fails to constrain the store. [ peterz: Added changelog. ] Signed-off-by: Prateek Sood <prsood@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery") Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21perf/core: Add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUsAndrew Murray1-0/+9
Many PMU drivers do not have the capability to exclude counting events that occur in specific contexts such as idle, kernel, guest, etc. These drivers indicate this by returning an error in their event_init upon testing the events attribute flags. This approach is error prone and often inconsistent. Let's instead allow PMU drivers to advertise their inability to exclude based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This allows the perf core to reject requests for exclusion events where there is no support in the PMU. Signed-off-by: Andrew Murray <andrew.murray@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <linux@armlinux.org.uk> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: robin.murphy@arm.com Cc: suzuki.poulose@arm.com Link: https://lkml.kernel.org/r/1547128414-50693-4-git-send-email-andrew.murray@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds4-5/+24
Pull networking fixes from David Miller: 1) Fix endless loop in nf_tables, from Phil Sutter. 2) Fix cross namespace ip6_gre tunnel hash list corruption, from Olivier Matz. 3) Don't be too strict in phy_start_aneg() otherwise we might not allow restarting auto negotiation. From Heiner Kallweit. 4) Fix various KMSAN uninitialized value cases in tipc, from Ying Xue. 5) Memory leak in act_tunnel_key, from Davide Caratti. 6) Handle chip errata of mv88e6390 PHY, from Andrew Lunn. 7) Remove linear SKB assumption in fou/fou6, from Eric Dumazet. 8) Missing udplite rehash callbacks, from Alexey Kodanev. 9) Log dirty pages properly in vhost, from Jason Wang. 10) Use consume_skb() in neigh_probe() as this is a normal free not a drop, from Yang Wei. Likewise in macvlan_process_broadcast(). 11) Missing device_del() in mdiobus_register() error paths, from Thomas Petazzoni. 12) Fix checksum handling of short packets in mlx5, from Cong Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (96 commits) bpf: in __bpf_redirect_no_mac pull mac only if present virtio_net: bulk free tx skbs net: phy: phy driver features are mandatory isdn: avm: Fix string plus integer warning from Clang net/mlx5e: Fix cb_ident duplicate in indirect block register net/mlx5e: Fix wrong (zero) TX drop counter indication for representor net/mlx5e: Fix wrong error code return on FEC query failure net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames tools: bpftool: Cleanup license mess bpf: fix inner map masking to prevent oob under speculation bpf: pull in pkt_sched.h header for tooling to fix bpftool build selftests: forwarding: Add a test case for externally learned FDB entries selftests: mlxsw: Test FDB offload indication mlxsw: spectrum_switchdev: Do not treat static FDB entries as sticky net: bridge: Mark FDB entries that were added by user as such mlxsw: spectrum_fid: Update dummy FID index mlxsw: pci: Return error on PCI reset timeout mlxsw: pci: Increase PCI SW reset timeout mlxsw: pci: Ring CQ's doorbell before RDQ's MAINTAINERS: update email addresses of liquidio driver maintainers ...
2019-01-18bpf: fix inner map masking to prevent oob under speculationDaniel Borkmann1-2/+15
During review I noticed that inner meta map setup for map in map is buggy in that it does not propagate all needed data from the reference map which the verifier is later accessing. In particular one such case is index masking to prevent out of bounds access under speculative execution due to missing the map's unpriv_array/index_mask field propagation. Fix this such that the verifier is generating the correct code for inlined lookups in case of unpriviledged use. Before patch (test_verifier's 'map in map access' dump): # bpftool prog dump xla id 3 0: (62) *(u32 *)(r10 -4) = 0 1: (bf) r2 = r10 2: (07) r2 += -4 3: (18) r1 = map[id:4] 5: (07) r1 += 272 | 6: (61) r0 = *(u32 *)(r2 +0) | 7: (35) if r0 >= 0x1 goto pc+6 | Inlined map in map lookup 8: (54) (u32) r0 &= (u32) 0 | with index masking for 9: (67) r0 <<= 3 | map->unpriv_array. 10: (0f) r0 += r1 | 11: (79) r0 = *(u64 *)(r0 +0) | 12: (15) if r0 == 0x0 goto pc+1 | 13: (05) goto pc+1 | 14: (b7) r0 = 0 | 15: (15) if r0 == 0x0 goto pc+11 16: (62) *(u32 *)(r10 -4) = 0 17: (bf) r2 = r10 18: (07) r2 += -4 19: (bf) r1 = r0 20: (07) r1 += 272 | 21: (61) r0 = *(u32 *)(r2 +0) | Index masking missing (!) 22: (35) if r0 >= 0x1 goto pc+3 | for inner map despite 23: (67) r0 <<= 3 | map->unpriv_array set. 24: (0f) r0 += r1 | 25: (05) goto pc+1 | 26: (b7) r0 = 0 | 27: (b7) r0 = 0 28: (95) exit After patch: # bpftool prog dump xla id 1 0: (62) *(u32 *)(r10 -4) = 0 1: (bf) r2 = r10 2: (07) r2 += -4 3: (18) r1 = map[id:2] 5: (07) r1 += 272 | 6: (61) r0 = *(u32 *)(r2 +0) | 7: (35) if r0 >= 0x1 goto pc+6 | Same inlined map in map lookup 8: (54) (u32) r0 &= (u32) 0 | with index masking due to 9: (67) r0 <<= 3 | map->unpriv_array. 10: (0f) r0 += r1 | 11: (79) r0 = *(u64 *)(r0 +0) | 12: (15) if r0 == 0x0 goto pc+1 | 13: (05) goto pc+1 | 14: (b7) r0 = 0 | 15: (15) if r0 == 0x0 goto pc+12 16: (62) *(u32 *)(r10 -4) = 0 17: (bf) r2 = r10 18: (07) r2 += -4 19: (bf) r1 = r0 20: (07) r1 += 272 | 21: (61) r0 = *(u32 *)(r2 +0) | 22: (35) if r0 >= 0x1 goto pc+4 | Now fixed inlined inner map 23: (54) (u32) r0 &= (u32) 0 | lookup with proper index masking 24: (67) r0 <<= 3 | for map->unpriv_array. 25: (0f) r0 += r1 | 26: (05) goto pc+1 | 27: (b7) r0 = 0 | 28: (b7) r0 = 0 29: (95) exit Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-18perf core: Fix perf_proc_update_handler() bugStephane Eranian1-7/+7
The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate syctl variable. When the PMU IRQ handler timing monitoring is disabled, i.e, when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100, then no modification to sysctl_perf_event_sample_rate is allowed to prevent possible hang from wrong values. The problem is that the test to prevent modification is made after the sysctl variable is modified in perf_proc_update_handler(). You get an error: $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate echo: write error: invalid argument But the value is still modified causing all sorts of inconsistencies: $ cat /proc/sys/kernel/perf_event_max_sample_rate 10001 This patch fixes the problem by moving the parsing of the value after the test. Committer testing: # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate -bash: echo: write error: Invalid argument # cat /proc/sys/kernel/perf_event_max_sample_rate 10001 # Signed-off-by: Stephane Eranian <eranian@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-18genirq/irqdesc: Fix double increment in alloc_descs()Huacai Chen1-1/+1
The recent rework of alloc_descs() introduced a double increment of the loop counter. As a consequence only every second affinity mask is validated. Remove it. [ tglx: Massaged changelog ] Fixes: c410abbbacb9 ("genirq/affinity: Add is_managed to struct irq_affinity_desc") Signed-off-by: Huacai Chen <chenhc@lemote.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Fuxin Zhang <zhangfx@lemote.com> Cc: Zhangjin Wu <wuzhangjin@gmail.com> Cc: Huacai Chen <chenhuacai@gmail.com> Cc: Dou Liyang <douliyangs@gmail.com> Link: https://lkml.kernel.org/r/1547694009-16261-1-git-send-email-chenhc@lemote.com
2019-01-18Merge branch 'stable/for-linus-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlbLinus Torvalds1-0/+2
Pull swiotlb fix from Konrad Rzeszutek Wilk: "A tiny fix for v5.0-rc2: This fixes an issue with GPU cards not working anymore with the DMA mapping work Christopher did - as the SWIOTLB is initialized first and then free'd (as IOMMU is available) but we forgot to clear our start and end entries which are used and BOOM" * 'stable/for-linus-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb: swiotlb: clear io_tlb_start and io_tlb_end in swiotlb_exit
2019-01-17bpf: Annotate implicit fall through in cgroup_dev_func_protoMathieu Malaterre1-0/+1
There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warnings (W=1). This commit removes the following warning: kernel/bpf/cgroup.c:719:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-17bpf: Make function btf_name_offset_valid staticMathieu Malaterre1-1/+1
Initially in commit 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)") the function 'btf_name_offset_valid' was introduced as static function it was later on changed to a non-static one, and then finally in commit 23127b33ec80 ("bpf: Create a new btf_name_by_offset() for non type name use case") the function prototype was removed. Revert back to original implementation and make the function static. Remove warning triggered with W=1: kernel/bpf/btf.c:470:6: warning: no previous prototype for 'btf_name_offset_valid' [-Wmissing-prototypes] Fixes: 23127b33ec80 ("bpf: Create a new btf_name_by_offset() for non type name use case") Signed-off-by: Mathieu Malaterre <malat@debian.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-17bpf: zero out build_id for BPF_STACK_BUILD_ID_IPStanislav Fomichev1-0/+2
When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset, make sure that build_id field is empty. Since we are using percpu free list, there is a possibility that we might reuse some previous bpf_stack_build_id with non-zero build_id. Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address") Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-17bpf: don't assume build-id length is always 20 bytesStanislav Fomichev1-2/+5
Build-id length is not fixed to 20, it can be (`man ld` /--build-id): * 128-bit (uuid) * 160-bit (sha1) * any length specified in ld --build-id=0xhexstring To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids, assume that build-id is somewhere in the range of 1 .. 20. Set the remaining bytes to zero. v2: * don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)", we already know that nhdr->n_descsz <= BPF_BUILD_ID_SIZE if we enter this 'if' condition Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address") Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-16swiotlb: clear io_tlb_start and io_tlb_end in swiotlb_exitChristoph Hellwig1-0/+2
Otherwise is_swiotlb_buffer will return false positives when we first initialize a swiotlb buffer, but then free it because we have an IOMMU available. Fixes: 55897af63091 ("dma-direct: merge swiotlb_dma_ops into the dma_direct code") Reported-by: Sibren Vasse <sibren@sibrenvasse.nl> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Sibren Vasse <sibren@sibrenvasse.nl> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2019-01-15seccomp: fix UAF in user-trap codeTycho Andersen1-0/+4
On the failure path, we do an fput() of the listener fd if the filter fails to install (e.g. because of a TSYNC race that's lost, or if the thread is killed, etc.). fput() doesn't actually release the fd, it just ads it to a work queue. Then the thread proceeds to free the filter, even though the listener struct file has a reference to it. To fix this, on the failure path let's set the private data to null, so we know in ->release() to ignore the filter. Reported-by: syzbot+981c26489b2d1c6316ba@syzkaller.appspotmail.com Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Signed-off-by: Tycho Andersen <tycho@tycho.ws> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <james.morris@microsoft.com>
2019-01-16Merge tag 'trace-v5.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds1-3/+9
Pull tracing fix from Steven Rostedt: "Andrea Righi fixed a NULL pointer dereference in trace_kprobe_create() It is possible to trigger a NULL pointer dereference by writing an incorrectly formatted string to the krpobe_events file" * tag 'trace-v5.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create()
2019-01-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds5-22/+88
Pull networking fixes from David Miller: 1) Fix regression in multi-SKB responses to RTM_GETADDR, from Arthur Gautier. 2) Fix ipv6 frag parsing in openvswitch, from Yi-Hung Wei. 3) Unbounded recursion in ipv4 and ipv6 GUE tunnels, from Stefano Brivio. 4) Use after free in hns driver, from Yonglong Liu. 5) icmp6_send() needs to handle the case of NULL skb, from Eric Dumazet. 6) Missing rcu read lock in __inet6_bind() when operating on mapped addresses, from David Ahern. 7) Memory leak in tipc-nl_compat_publ_dump(), from Gustavo A. R. Silva. 8) Fix PHY vs r8169 module loading ordering issues, from Heiner Kallweit. 9) Fix bridge vlan memory leak, from Ido Schimmel. 10) Dev refcount leak in AF_PACKET, from Jason Gunthorpe. 11) Infoleak in ipv6_local_error(), flow label isn't completely initialized. From Eric Dumazet. 12) Handle mv88e6390 errata, from Andrew Lunn. 13) Making vhost/vsock CID hashing consistent, from Zha Bin. 14) Fix lack of UMH cleanup when it unexpectedly exits, from Taehee Yoo. 15) Bridge forwarding must clear skb->tstamp, from Paolo Abeni. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits) bnxt_en: Fix context memory allocation. bnxt_en: Fix ring checking logic on 57500 chips. mISDN: hfcsusb: Use struct_size() in kzalloc() net: clear skb->tstamp in bridge forwarding path net: bpfilter: disallow to remove bpfilter module while being used net: bpfilter: restart bpfilter_umh when error occurred net: bpfilter: use cleanup callback to release umh_info umh: add exit routine for UMH process isdn: i4l: isdn_tty: Fix some concurrency double-free bugs vhost/vsock: fix vhost vsock cid hashing inconsistent net: stmmac: Prevent RX starvation in stmmac_napi_poll() net: stmmac: Fix the logic of checking if RX Watchdog must be enabled net: stmmac: Check if CBS is supported before configuring net: stmmac: dwxgmac2: Only clear interrupts that are active net: stmmac: Fix PCI module removal leak tools/bpf: fix bpftool map dump with bitfields tools/bpf: test btf bitfield with >=256 struct member offset bpf: fix bpffs bitfield pretty print net: ethernet: mediatek: fix warning in phy_start_aneg tcp: change txhash on SYN-data timeout ...
2019-01-15tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create()Andrea Righi1-3/+9
It is possible to trigger a NULL pointer dereference by writing an incorrectly formatted string to krpobe_events (trying to create a kretprobe omitting the symbol). Example: echo "r:event_1 " >> /sys/kernel/debug/tracing/kprobe_events That triggers this: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 6 PID: 1757 Comm: bash Not tainted 5.0.0-rc1+ #125 Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.5.1 08/09/2018 RIP: 0010:kstrtoull+0x2/0x20 Code: 28 00 00 00 75 17 48 83 c4 18 5b 41 5c 5d c3 b8 ea ff ff ff eb e1 b8 de ff ff ff eb da e8 d6 36 bb ff 66 0f 1f 44 00 00 31 c0 <80> 3f 2b 55 48 89 e5 0f 94 c0 48 01 c7 e8 5c ff ff ff 5d c3 66 2e RSP: 0018:ffffb5d482e57cb8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff82b12720 RDX: ffffb5d482e57cf8 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffb5d482e57d70 R08: ffffa0c05e5a7080 R09: ffffa0c05e003980 R10: 0000000000000000 R11: 0000000040000000 R12: ffffa0c04fe87b08 R13: 0000000000000001 R14: 000000000000000b R15: ffffa0c058d749e1 FS: 00007f137c7f7740(0000) GS:ffffa0c05e580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000497d46004 CR4: 00000000003606e0 Call Trace: ? trace_kprobe_create+0xb6/0x840 ? _cond_resched+0x19/0x40 ? _cond_resched+0x19/0x40 ? __kmalloc+0x62/0x210 ? argv_split+0x8f/0x140 ? trace_kprobe_create+0x840/0x840 ? trace_kprobe_create+0x840/0x840 create_or_delete_trace_kprobe+0x11/0x30 trace_run_command+0x50/0x90 trace_parse_run_command+0xc1/0x160 probes_write+0x10/0x20 __vfs_write+0x3a/0x1b0 ? apparmor_file_permission+0x1a/0x20 ? security_file_permission+0x31/0xf0 ? _cond_resched+0x19/0x40 vfs_write+0xb1/0x1a0 ksys_write+0x55/0xc0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x5a/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix by doing the proper argument checks in trace_kprobe_create(). Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/lkml/20190111095108.b79a2ee026185cbd62365977@kernel.org Link: http://lkml.kernel.org/r/20190111060113.GA22841@xps-13 Fixes: 6212dd29683e ("tracing/kprobes: Use dyn_event framework for kprobe events") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-01-15posix-cpu-timers: Unbreak timer rearmingThomas Gleixner1-0/+1
The recent commit which prevented a division by 0 issue in the alarm timer code broke posix CPU timers as an unwanted side effect. The reason is that the common rearm code checks for timer->it_interval being 0 now. What went unnoticed is that the posix cpu timer setup does not initialize timer->it_interval as it stores the interval in CPU timer specific storage. The reason for the separate storage is historical as the posix CPU timers always had a 64bit nanoseconds representation internally while timer->it_interval is type ktime_t which used to be a modified timespec representation on 32bit machines. Instead of reverting the offending commit and fixing the alarmtimer issue in the alarmtimer code, store the interval in timer->it_interval at CPU timer setup time so the common code check works. This also repairs the existing inconistency of the posix CPU timer code which kept a single shot timer armed despite of the interval being 0. The separate storage can be removed in mainline, but that needs to be a separate commit as the current one has to be backported to stable kernels. Fixes: 0e334db6bb4b ("posix-timers: Fix division by zero bug") Reported-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190111133500.840117406@linutronix.de
2019-01-15genirq: Make sure the initial affinity is not emptySrinivas Ramana1-0/+3
If all CPUs in the irq_default_affinity mask are offline when an interrupt is initialized then irq_setup_affinity() can set an empty affinity mask for a newly allocated interrupt. Fix this by falling back to cpu_online_mask in case the resulting affinity mask is zero. Signed-off-by: Srinivas Ramana <sramana@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-msm@vger.kernel.org Link: https://lkml.kernel.org/r/1545312957-8504-1-git-send-email-sramana@codeaurora.org
2019-01-14kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymoreJonathan Neuschäfer1-1/+2
UNAME26 is a mechanism to report Linux's version as 2.6.x, for compatibility with old/broken software. Due to the way it is implemented, it would have to be updated after 5.0, to keep the resulting versions unique. Linus Torvalds argued: "Do we actually need this? I'd rather let it bitrot, and just let it return random versions. It will just start again at 2.4.60, won't it? Anybody who uses UNAME26 for a 5.x kernel might as well think it's still 4.x. The user space is so old that it can't possibly care about differences between 4.x and 5.x, can it? The only thing that matters is that it shows "2.4.<largeenough>", which it will do regardless" Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-11umh: add exit routine for UMH processTaehee Yoo2-2/+32
A UMH process which is created by the fork_usermode_blob() such as bpfilter needs to release members of the umh_info when process is terminated. But the do_exit() does not release members of the umh_info. hence module which uses UMH needs own code to detect whether UMH process is terminated or not. But this implementation needs extra code for checking the status of UMH process. it eventually makes the code more complex. The new PF_UMH flag is added and it is used to identify UMH processes. The exit_umh() does not release members of the umh_info. Hence umh_info->cleanup callback should release both members of the umh_info and the private data. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-11bpf: fix bpffs bitfield pretty printYonghong Song1-6/+6
Commit 9d5f9f701b18 ("bpf: btf: fix struct/union/fwd types with kind_flag") introduced kind_flag and used bitfield_size in the btf_member to directly pretty print member values. The commit contained a bug where the incorrect parameters could be passed to function btf_bitfield_seq_show(). The bits_offset parameter in the function expects a value less than 8. Instead, the member offset in the structure is passed. The below is btf_bitfield_seq_show() func signature: void btf_bitfield_seq_show(void *data, u8 bits_offset, u8 nr_bits, struct seq_file *m) both bits_offset and nr_bits are u8 type. If the bitfield member offset is greater than 256, incorrect value will be printed. This patch fixed the issue by calculating correct proper data offset and bits_offset similar to non kind_flag case. Fixes: 9d5f9f701b18 ("bpf: btf: fix struct/union/fwd types with kind_flag") Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-10bpf: fix panic in stack_map_get_build_id() on i386 and arm32Song Liu1-1/+2
As Naresh reported, test_stacktrace_build_id() causes panic on i386 and arm32 systems. This is caused by page_address() returns NULL in certain cases. This patch fixes this error by using kmap_atomic/kunmap_atomic instead of page_address. Fixes: 615755a77b24 (" bpf: extend stackmap to save binary_build_id+offset instead of address") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+1
Merge misc fixes from Andrew Morton: "14 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, page_alloc: do not wake kswapd with zone lock held hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization" hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race" mm: page_mapped: don't assume compound page is huge or THP mm/memory.c: initialise mmu_notifier_range correctly tools/vm/page_owner: use page_owner_sort in the use example kasan: fix krealloc handling for tag-based mode kasan: make tag based mode work with CONFIG_HARDENED_USERCOPY kasan, arm64: use ARCH_SLAB_MINALIGN instead of manual aligning mm, memcg: fix reclaim deadlock with writeback mm/usercopy.c: no check page span for stack objects slab: alien caches must not be initialized if the allocation of the alien cache failed fork, memcg: fix cached_stacks case zram: idle writeback fixes and cleanup