aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-06-06genirq/generic_pending: Do not lose pending affinity updateThomas Gleixner1-7/+19
The generic pending interrupt mechanism moves interrupts from the interrupt handler on the original target CPU to the new destination CPU. This is required for x86 and ia64 due to the way the interrupt delivery and acknowledge works if the interrupts are not remapped. However that update can fail for various reasons. Some of them are valid reasons to discard the pending update, but the case, when the previous move has not been fully cleaned up is not a legit reason to fail. Check the return value of irq_do_set_affinity() for -EBUSY, which indicates a pending cleanup, and rearm the pending move in the irq dexcriptor so it's tried again when the next interrupt arrives. Fixes: 996c591227d9 ("x86/irq: Plug vector cleanup race") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Song Liu <songliubraving@fb.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <liu.song.a23@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Cc: Mike Travis <mike.travis@hpe.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Tariq Toukan <tariqt@mellanox.com> Link: https://lkml.kernel.org/r/20180604162224.386544292@linutronix.de
2018-06-06rseq: Introduce restartable sequences system callMathieu Desnoyers5-0/+365
Expose a new system call allowing each thread to register one userspace memory area to be used as an ABI between kernel and user-space for two purposes: user-space restartable sequences and quick access to read the current CPU number value from user-space. * Restartable sequences (per-cpu atomics) Restartables sequences allow user-space to perform update operations on per-cpu data without requiring heavy-weight atomic operations. The restartable critical sections (percpu atomics) work has been started by Paul Turner and Andrew Hunter. It lets the kernel handle restart of critical sections. [1] [2] The re-implementation proposed here brings a few simplifications to the ABI which facilitates porting to other architectures and speeds up the user-space fast path. Here are benchmarks of various rseq use-cases. Test hardware: arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading The following benchmarks were all performed on a single thread. * Per-CPU statistic counter increment getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 344.0 31.4 11.0 x86-64: 15.3 2.0 7.7 * LTTng-UST: write event 32-bit header, 32-bit payload into tracer per-cpu buffer getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 2502.0 2250.0 1.1 x86-64: 117.4 98.0 1.2 * liburcu percpu: lock-unlock pair, dereference, read/compare word getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 751.0 128.5 5.8 x86-64: 53.4 28.6 1.9 * jemalloc memory allocator adapted to use rseq Using rseq with per-cpu memory pools in jemalloc at Facebook (based on rseq 2016 implementation): The production workload response-time has 1-2% gain avg. latency, and the P99 overall latency drops by 2-3%. * Reading the current CPU number Speeding up reading the current CPU number on which the caller thread is running is done by keeping the current CPU number up do date within the cpu_id field of the memory area registered by the thread. This is done by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space, a notify-resume handler updates the current CPU value within the registered user-space memory area. User-space can then read the current CPU number directly from memory. Keeping the current cpu id in a memory area shared between kernel and user-space is an improvement over current mechanisms available to read the current CPU number, which has the following benefits over alternative approaches: - 35x speedup on ARM vs system call through glibc - 20x speedup on x86 compared to calling glibc, which calls vdso executing a "lsl" instruction, - 14x speedup on x86 compared to inlined "lsl" instruction, - Unlike vdso approaches, this cpu_id value can be read from an inline assembly, which makes it a useful building block for restartable sequences. - The approach of reading the cpu id through memory mapping shared between kernel and user-space is portable (e.g. ARM), which is not the case for the lsl-based x86 vdso. On x86, yet another possible approach would be to use the gs segment selector to point to user-space per-cpu data. This approach performs similarly to the cpu id cache, but it has two disadvantages: it is not portable, and it is incompatible with existing applications already using the gs segment selector for other purposes. Benchmarking various approaches for reading the current CPU number: ARMv7 Processor rev 4 (v7l) Machine model: Cubietruck - Baseline (empty loop): 8.4 ns - Read CPU from rseq cpu_id: 16.7 ns - Read CPU from rseq cpu_id (lazy register): 19.8 ns - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns - getcpu system call: 234.9 ns x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz: - Baseline (empty loop): 0.8 ns - Read CPU from rseq cpu_id: 0.8 ns - Read CPU from rseq cpu_id (lazy register): 0.8 ns - Read using gs segment selector: 0.8 ns - "lsl" inline assembly: 13.0 ns - glibc 2.19-0ubuntu6 getcpu: 16.6 ns - getcpu system call: 53.9 ns - Speed (benchmark taken on v8 of patchset) Running 10 runs of hackbench -l 100000 seems to indicate, contrary to expectations, that enabling CONFIG_RSEQ slightly accelerates the scheduler: Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1 kernel parameter), with a Linux v4.6 defconfig+localyesconfig, restartable sequences series applied. * CONFIG_RSEQ=n avg.: 41.37 s std.dev.: 0.36 s * CONFIG_RSEQ=y avg.: 40.46 s std.dev.: 0.33 s - Size On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is 567 bytes, and the data size increase of vmlinux is 5696 bytes. [1] https://lwn.net/Articles/650333/ [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Watson <davejwatson@fb.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Chris Lameter <cl@linux.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Hunter <ahh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Paul Turner <pjt@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Maurer <bmaurer@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-05Merge branch 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2-48/+94
Pull workqueue updates from Tejun Heo: - make kworkers report the workqueue it is executing or has executed most recently in /proc/PID/comm (so they show up in ps/top) - CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds inside CONFIG_SMP. * 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: move function definitions within CONFIG_SMP block workqueue: Make sure struct worker is accessible for wq_worker_comm() workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status} proc: Consolidate task->comm formatting into proc_task_name() workqueue: Set worker->desc to workqueue name by default workqueue: Make worker_attach/detach_pool() update worker->pool workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex
2018-06-05Merge branch 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds6-395/+512
Pull cgroup updates from Tejun Heo: - For cpustat, cgroup has a percpu hierarchical stat mechanism which propagates up the hierarchy lazily. This contains commits to factor out and generalize the mechanism so that it can be used for other cgroup stats too. The original intention was to update memcg stats to use it but memcg went for a different approach, so still the only user is cpustat. The factoring out and generalization still make sense and it's likely that this can be used for other purposes in the future. - cgroup uses kernfs_notify() (which uses fsnotify()) to inform user space of certain events. A rate limiting mechanism is added. - Other misc changes. * 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: css_set_lock should nest inside tasklist_lock rdmacg: Convert to use match_string() helper cgroup: Make cgroup_rstat_updated() ready for root cgroup usage cgroup: Add memory barriers to plug cgroup_rstat_updated() race window cgroup: Add cgroup_subsys->css_rstat_flush() cgroup: Replace cgroup_rstat_mutex with a spinlock cgroup: Factor out and expose cgroup_rstat_*() interface functions cgroup: Reorganize kernel/cgroup/rstat.c cgroup: Distinguish base resource stat implementation from rstat cgroup: Rename stat to rstat cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c cgroup: Limit event generation frequency cgroup: Explicitly remove core interface files
2018-06-05ide: don't enable/disable interrupts in force threaded-IRQ modeSebastian Andrzej Siewior1-0/+1
The interrupts are enabled/disabled so the interrupt handler can run with enabled interrupts while serving the interrupt and not lose other interrupts especially the timer tick. If the system runs with force-threaded interrupts then there is no need to enable the interrupts. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05tracing: Use match_string() instead of open coding it in trace_set_options()Yisheng Xie1-10/+5
match_string() returns the index of an array for a matching string, which can be used to simplify the code. Link: http://lkml.kernel.org/r/1526546163-4609-1-git-send-email-xieyisheng1@huawei.com Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-06-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller10-53/+169
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-06-05 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add a new BPF hook for sendmsg similar to existing hooks for bind and connect: "This allows to override source IP (including the case when it's set via cmsg(3)) and destination IP:port for unconnected UDP (slow path). TCP and connected UDP (fast path) are not affected. This makes UDP support complete, that is, connected UDP is handled by connect hooks, unconnected by sendmsg ones.", from Andrey. 2) Rework of the AF_XDP API to allow extending it in future for type writer model if necessary. In this mode a memory window is passed to hardware and multiple frames might be filled into that window instead of just one that is the case in the current fixed frame-size model. With the new changes made this can be supported without having to add a new descriptor format. Also, core bits for the zero-copy support for AF_XDP have been merged as agreed upon, where i40e bits will be routed via Jeff later on. Various improvements to documentation and sample programs included as well, all from Björn and Magnus. 3) Given BPF's flexibility, a new program type has been added to implement infrared decoders. Quote: "The kernel IR decoders support the most widely used IR protocols, but there are many protocols which are not supported. [...] There is a 'long tail' of unsupported IR protocols, for which lircd is need to decode the IR. IR encoding is done in such a way that some simple circuit can decode it; therefore, BPF is ideal. [...] user-space can define a decoder in BPF, attach it to the rc device through the lirc chardev.", from Sean. 4) Several improvements and fixes to BPF core, among others, dumping map and prog IDs into fdinfo which is a straight forward way to correlate BPF objects used by applications, removing an indirect call and therefore retpoline in all map lookup/update/delete calls by invoking the callback directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper for tc BPF programs to have an efficient way of looking up cgroup v2 id for policy or other use cases. Fixes to make sure we zero tunnel/xfrm state that hasn't been filled, to allow context access wrt pt_regs in 32 bit archs for tracing, and last but not least various test cases for fixes that landed in bpf earlier, from Daniel. 5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper. 6) Add a new bpf_get_current_cgroup_id() helper that can be used in tracing to retrieve the cgroup id from the current process in order to allow for e.g. aggregation of container-level events, from Yonghong. 7) Two follow-up fixes for BTF to reject invalid input values and related to that also two test cases for BPF kselftests, from Martin. 8) Various API improvements to the bpf_fib_lookup() helper, that is, dropping MPLS bits which are not fully hashed out yet, rejecting invalid helper flags, returning error for unsupported address families as well as renaming flowlabel to flowinfo, from David. 9) Various fixes and improvements to sockmap BPF kselftests in particular in proper error detection and data verification, from Prashant. 10) Two arm32 BPF JIT improvements. One is to fix imm range check with regards to whether immediate fits into 24 bits, and a naming cleanup to get functions related to rsh handling consistent to those handling lsh, from Wang. 11) Two compile warning fixes in BPF, one for BTF and a false positive to silent gcc in stack_map_get_build_id_offset(), from Arnd. 12) Add missing seg6.h header into tools include infrastructure in order to fix compilation of BPF kselftests, from Mathieu. 13) Several formatting cleanups in the BPF UAPI helper description that also fix an error during rst2man compilation, from Quentin. 14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is not built into the kernel, from Yue. 15) Remove a useless double assignment in dev_map_enqueue(), from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05Merge tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds7-95/+210
Pull power management updates from Rafael Wysocki: "These include a significant update of the generic power domains (genpd) and Operating Performance Points (OPP) frameworks, mostly related to the introduction of power domain performance levels, cpufreq updates (new driver for Qualcomm Kryo processors, updates of the existing drivers, some core fixes, schedutil governor improvements), PCI power management fixes, ACPI workaround for EC-based wakeup events handling on resume from suspend-to-idle, and major updates of the turbostat and pm-graph utilities. Specifics: - Introduce power domain performance levels into the the generic power domains (genpd) and Operating Performance Points (OPP) frameworks (Viresh Kumar, Rajendra Nayak, Dan Carpenter). - Fix two issues in the runtime PM framework related to the initialization and removal of devices using device links (Ulf Hansson). - Clean up the initialization of drivers for devices in PM domains (Ulf Hansson, Geert Uytterhoeven). - Fix a cpufreq core issue related to the policy sysfs interface causing CPU online to fail for CPUs sharing one cpufreq policy in some situations (Tao Wang). - Make it possible to use platform-specific suspend/resume hooks in the cpufreq-dt driver and make the Armada 37xx DVFS use that feature (Viresh Kumar, Miquel Raynal). - Optimize policy transition notifications in cpufreq (Viresh Kumar). - Improve the iowait boost mechanism in the schedutil cpufreq governor (Patrick Bellasi). - Improve the handling of deferred frequency updates in the schedutil cpufreq governor (Joel Fernandes, Dietmar Eggemann, Rafael Wysocki, Viresh Kumar). - Add a new cpufreq driver for Qualcomm Kryo (Ilia Lin). - Fix and clean up some cpufreq drivers (Colin Ian King, Dmitry Osipenko, Doug Smythies, Luc Van Oostenryck, Simon Horman, Viresh Kumar). - Fix the handling of PCI devices with the DPM_SMART_SUSPEND flag set and update stale comments in the PCI core PM code (Rafael Wysocki). - Work around an issue related to the handling of EC-based wakeup events in the ACPI PM core during resume from suspend-to-idle if the EC has been put into the low-power mode (Rafael Wysocki). - Improve the handling of wakeup source objects in the PM core (Doug Berger, Mahendran Ganesh, Rafael Wysocki). - Update the driver core to prevent deferred probe from breaking suspend/resume ordering (Feng Kan). - Clean up the PM core somewhat (Bjorn Helgaas, Ulf Hansson, Rafael Wysocki). - Make the core suspend/resume code and cpufreq support the RT patch (Sebastian Andrzej Siewior, Thomas Gleixner). - Consolidate the PM QoS handling in cpuidle governors (Rafael Wysocki). - Fix a possible crash in the hibernation core (Tetsuo Handa). - Update the rockchip-io Adaptive Voltage Scaling (AVS) driver (David Wu). - Update the turbostat utility (fixes, cleanups, new CPU IDs, new command line options, built-in "Low Power Idle" counters support, new POLL and POLL% columns) and add an entry for it to MAINTAINERS (Len Brown, Artem Bityutskiy, Chen Yu, Laura Abbott, Matt Turner, Prarit Bhargava, Srinivas Pandruvada). - Update the pm-graph to version 5.1 (Todd Brandt). - Update the intel_pstate_tracer utility (Doug Smythies)" * tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (128 commits) tools/power turbostat: update version number tools/power turbostat: Add Node in output tools/power turbostat: add node information into turbostat calculations tools/power turbostat: remove num_ from cpu_topology struct tools/power turbostat: rename num_cores_per_pkg to num_cores_per_node tools/power turbostat: track thread ID in cpu_topology tools/power turbostat: Calculate additional node information for a package tools/power turbostat: Fix node and siblings lookup data tools/power turbostat: set max_num_cpus equal to the cpumask length tools/power turbostat: if --num_iterations, print for specific number of iterations tools/power turbostat: Add Cannon Lake support tools/power turbostat: delete duplicate #defines x86: msr-index.h: Correct SNB_C1/C3_AUTO_UNDEMOTE defines tools/power turbostat: Correct SNB_C1/C3_AUTO_UNDEMOTE defines tools/power turbostat: add POLL and POLL% column tools/power turbostat: Fix --hide Pk%pc10 tools/power turbostat: Build-in "Low Power Idle" counters support tools/power turbostat: Don't make man pages executable tools/power turbostat: remove blank lines tools/power turbostat: a small C-states dump readability immprovement ...
2018-06-05printk: drop in_nmi check from printk_safe_flush_on_panic()Sergey Senozhatsky1-1/+1
Drop the in_nmi() check from printk_safe_flush_on_panic() and attempt to re-init (IOW unlock) locked logbuf spinlock from panic CPU regardless of its context. Otherwise, theoretically, we can deadlock on logbuf trying to flush per-CPU buffers: a) Panic CPU is running in non-NMI context b) Panic CPU sends out shutdown IPI via reboot vector c) Panic CPU fails to stop all remote CPUs d) Panic CPU sends out shutdown IPI via NMI vector One of the CPUs that we bring down via NMI vector can hold logbuf spin lock (theoretically). Link: http://lkml.kernel.org/r/20180530070350.10131-1-sergey.senozhatsky@gmail.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-06-04Merge branch 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-41/+34
Pull time/Y2038 updates from Thomas Gleixner: - Consolidate SySV IPC UAPI headers - Convert SySV IPC to the new COMPAT_32BIT_TIME mechanism - Cleanup the core interfaces and standardize on the ktime_get_* naming convention. - Convert the X86 platform ops to timespec64 - Remove the ugly temporary timespec64 hack * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) x86: Convert x86_platform_ops to timespec64 timekeeping: Add more coarse clocktai/boottime interfaces timekeeping: Add ktime_get_coarse_with_offset timekeeping: Standardize on ktime_get_*() naming timekeeping: Clean up ktime_get_real_ts64 timekeeping: Remove timespec64 hack y2038: ipc: Redirect ipc(SEMTIMEDOP, ...) to compat_ksys_semtimedop y2038: ipc: Enable COMPAT_32BIT_TIME y2038: ipc: Use __kernel_timespec y2038: ipc: Report long times to user space y2038: ipc: Use ktime_get_real_seconds consistently y2038: xtensa: Extend sysvipc data structures y2038: powerpc: Extend sysvipc data structures y2038: sparc: Extend sysvipc data structures y2038: parisc: Extend sysvipc data structures y2038: mips: Extend sysvipc data structures y2038: arm64: Extend sysvipc compat data structures y2038: s390: Remove unneeded ipc uapi header files y2038: ia64: Remove unneeded ipc uapi header files y2038: alpha: Remove unneeded ipc uapi header files ...
2018-06-04Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds9-96/+114
Pull timers and timekeeping updates from Thomas Gleixner: - Core infrastucture work for Y2038 to address the COMPAT interfaces: + Add a new Y2038 safe __kernel_timespec and use it in the core code + Introduce config switches which allow to control the various compat mechanisms + Use the new config switch in the posix timer code to control the 32bit compat syscall implementation. - Prevent bogus selection of CPU local clocksources which causes an endless reselection loop - Remove the extra kthread in the clocksource code which has no value and just adds another level of indirection - The usual bunch of trivial updates, cleanups and fixlets all over the place - More SPDX conversions * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) clocksource/drivers/mxs_timer: Switch to SPDX identifier clocksource/drivers/timer-imx-tpm: Switch to SPDX identifier clocksource/drivers/timer-imx-gpt: Switch to SPDX identifier clocksource/drivers/timer-imx-gpt: Remove outdated file path clocksource/drivers/arc_timer: Add comments about locking while read GFRC clocksource/drivers/mips-gic-timer: Add pr_fmt and reword pr_* messages clocksource/drivers/sprd: Fix Kconfig dependency clocksource: Move inline keyword to the beginning of function declarations timer_list: Remove unused function pointer typedef timers: Adjust a kernel-doc comment tick: Prefer a lower rating device only if it's CPU local device clocksource: Remove kthread time: Change nanosleep to safe __kernel_* types time: Change types to new y2038 safe __kernel_* types time: Fix get_timespec64() for y2038 safe compat interfaces time: Add new y2038 safe __kernel_timespec posix-timers: Make compat syscalls depend on CONFIG_COMPAT_32BIT_TIME time: Introduce CONFIG_COMPAT_32BIT_TIME time: Introduce CONFIG_64BIT_TIME in architectures compat: Enable compat_get/put_timespec64 always ...
2018-06-04Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-17/+27
Pull irq updates from Thomas Gleixner: - Consolidation of softirq pending: The softirq mask and its accessors/mutators have many implementations scattered around many architectures. Most do the same things consisting in a field in a per-cpu struct (often irq_cpustat_t) accessed through per-cpu ops. We can provide instead a generic efficient version that most of them can use. In fact s390 is the only exception because the field is stored in lowcore. - Support for level!?! triggered MSI (ARM) Over the past couple of years, we've seen some SoCs coming up with ways of signalling level interrupts using a new flavor of MSIs, where the MSI controller uses two distinct messages: one that raises a virtual line, and one that lowers it. The target MSI controller is in charge of maintaining the state of the line. This allows for a much simplified HW signal routing (no need to have hundreds of discrete lines to signal level interrupts if you already have a memory bus), but results in a departure from the current idea the kernel has of MSIs. - Support for Meson-AXG GPIO irqchip - Large stm32 irqchip rework (suspend/resume, hierarchical domains) - More SPDX conversions * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) ARM: dts: stm32: Add exti support to stm32mp157 pinctrl ARM: dts: stm32: Add exti support for stm32mp157c pinctrl/stm32: Add irq_eoi for stm32gpio irqchip irqchip/stm32: Add suspend/resume support for hierarchy domain irqchip/stm32: Add stm32mp1 support with hierarchy domain irqchip/stm32: Prepare common functions irqchip/stm32: Add host and driver data structures irqchip/stm32: Add suspend support irqchip/stm32: Add falling pending register support irqchip/stm32: Checkpatch fix irqchip/stm32: Optimizes and cleans up stm32-exti irq_domain irqchip/meson-gpio: Add support for Meson-AXG SoCs dt-bindings: interrupt-controller: New binding for Meson-AXG SoC dt-bindings: interrupt-controller: Fix the double quotes softirq/s390: Move default mutators of overwritten softirq mask to s390 softirq/x86: Switch to generic local_softirq_pending() implementation softirq/sparc: Switch to generic local_softirq_pending() implementation softirq/powerpc: Switch to generic local_softirq_pending() implementation softirq/parisc: Switch to generic local_softirq_pending() implementation softirq/ia64: Switch to generic local_softirq_pending() implementation ...
2018-06-04Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds4-68/+111
Pull scheduler updates from Ingo Molnar: - power-aware scheduling improvements (Patrick Bellasi) - NUMA balancing improvements (Mel Gorman) - vCPU scheduling fixes (Rohit Jain) * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Update util_est before updating schedutil sched/cpufreq: Modify aggregate utilization to always include blocked FAIR utilization sched/deadline/Documentation: Add overrun signal and GRUB-PA documentation sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu() sched/wait: Include <linux/wait.h> in <linux/swait.h> sched/numa: Stagger NUMA balancing scan periods for new threads sched/core: Don't schedule threads on pre-empted vCPUs sched/fair: Avoid calling sync_entity_load_avg() unnecessarily sched/fair: Rearrange select_task_rq_fair() to optimize it
2018-06-04Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-22/+22
Pull perf updates from Ingo Molnar: "Kernel side changes: - x86 Intel uncore driver cleanups and enhancements (Kan Liang) - group scheduling and other fixes (Song Liu - store frame pointer in the sample traces for better profiling (Alexey Budankov) - compat fixes/enhancements (Eugene Syromiatnikov) Tooling side changes, which you can build and install in a single step via: make -C tools/perf clean install perf annotate: - Support 'perf annotate --group' for non-explicit recorded event "groups", showing multiple columns, one for each event, just like when dealing with explicit event groups (those enclosed with {}) (Jin Yao) - Record min/max LBR cycles (>= Skylake) and add 'perf annotate' TUI hotkey to show it (c) (Jin Yao) perf bpf: - Add infrastructure to help in writing eBPF C programs to be used with '-e name.c' type events in tools such as 'record' and 'trace', with headers for common constructs and an examples directory that will get populated as we add more such helpers and the 'perf bpf' (Arnaldo Carvalho de Melo) perf stat: - Display time in precision based on std deviation (Jiri Olsa) - Add --table option to display time of each run (Jiri Olsa) - Display length strings of each run for --table option (Jiri Olsa) perf buildid-cache: - Add --list and --purge-all options (Ravi Bangoria) perf test: - Let 'perf test list' display subtests (Hendrik Brueckner) perf pti: - Create extra kernel maps to help in decoding samples in x86 PTI entry trampolines (Adrian Hunter) - Copy x86 PTI entry trampoline sections in the kcore copy used for annotation and intel_pt CPU traces decoding (Adrian Hunter) ... and a lot of other fixes, enhancements and cleanups I did not list, see the shortlog and git log for details" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (111 commits) perf/x86/intel/uncore: Clean up client IMC uncore perf/x86/intel/uncore: Expose uncore_pmu_event*() functions perf/x86/intel/uncore: Support IIO free-running counters on SKX perf/x86/intel/uncore: Add infrastructure for free running counters perf/x86/intel/uncore: Add new data structures for free running counters perf/x86/intel/uncore: Correct fixed counter index check in generic code perf/x86/intel/uncore: Correct fixed counter index check for NHM perf/x86/intel/uncore: Introduce customized event_read() for client IMC uncore perf/x86: Store user space frame-pointer value on a sample perf/core: Wire up compat PERF_EVENT_IOC_QUERY_BPF, PERF_EVENT_IOC_MODIFY_ATTRIBUTES perf/core: Fix bad use of igrab() perf/core: Fix group scheduling with mixed hw and sw events perf kcore_copy: Amend the offset of sections that remap kernel text perf kcore_copy: Copy x86 PTI entry trampoline sections perf kcore_copy: Get rid of kernel_map perf kcore_copy: Iterate phdrs perf kcore_copy: Layout sections perf kcore_copy: Calculate offset from phnum perf kcore_copy: Keep a count of phdrs perf kcore_copy: Keep phdr data in a list ...
2018-06-04Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds9-256/+198
Pull locking updates from Ingo Molnar: - Lots of tidying up changes all across the map for Linux's formal memory/locking-model tooling, by Alan Stern, Akira Yokosawa, Andrea Parri, Paul E. McKenney and SeongJae Park. Notable changes beyond an overall update in the tooling itself is the tidying up of spin_is_locked() semantics, which spills over into the kernel proper as well. - qspinlock improvements: the locking algorithm now guarantees forward progress whereas the previous implementation in mainline could starve threads indefinitely in cmpxchg() loops. Also other related cleanups to the qspinlock code (Will Deacon) - misc smaller improvements, cleanups and fixes all across the locking subsystem * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) locking/rwsem: Simplify the is-owner-spinnable checks tools/memory-model: Add reference for 'Simplifying ARM concurrency' tools/memory-model: Update ASPLOS information MAINTAINERS, tools/memory-model: Update e-mail address for Andrea Parri tools/memory-model: Fix coding style in 'lock.cat' tools/memory-model: Remove out-of-date comments and code from lock.cat tools/memory-model: Improve mixed-access checking in lock.cat tools/memory-model: Improve comments in lock.cat tools/memory-model: Remove duplicated code from lock.cat tools/memory-model: Flag "cumulativity" and "propagation" tests tools/memory-model: Add model support for spin_is_locked() tools/memory-model: Add scripts to test memory model tools/memory-model: Fix coding style in 'linux-kernel.def' tools/memory-model: Model 'smp_store_mb()' tools/memory-order: Update the cheat-sheet to show that smp_mb__after_atomic() orders later RMW operations tools/memory-order: Improve key for SELF and SV tools/memory-model: Fix cheat sheet typo tools/memory-model: Update required version of herdtools7 tools/memory-model: Redefine rb in terms of rcu-fence tools/memory-model: Rename link and rcu-path to rcu-link and rb ...
2018-06-04Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds16-483/+411
Pull RCU updates from Ingo Molnar: - updates to the handling of expedited grace periods - updates to reduce lock contention in the rcu_node combining tree [ These are in preparation for the consolidation of RCU-bh, RCU-preempt, and RCU-sched into a single flavor, which was requested by Linus in response to a security flaw whose root cause included confusion between the multiple flavors of RCU ] - torture-test updates that save their users some time and effort - miscellaneous fixes * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) rcu/x86: Provide early rcu_cpu_starting() callback torture: Make kvm-find-errors.sh find build warnings rcutorture: Abbreviate kvm.sh summary lines rcutorture: Print end-of-test state in kvm.sh summary rcutorture: Print end-of-test state torture: Fold parse-torture.sh into parse-console.sh torture: Add a script to edit output from failed runs rcu: Update list of rcu_future_grace_period() trace events rcu: Drop early GP request check from rcu_gp_kthread() rcu: Simplify and inline cpu_needs_another_gp() rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp() rcu: Make rcu_start_this_gp() check for out-of-range requests rcu: Add funnel locking to rcu_start_this_gp() rcu: Make rcu_start_future_gp() caller select grace period rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp() rcu: Clear request other than RCU_GP_FLAG_INIT at GP end rcu: Cleanup, don't put ->completed into an int rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs() rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition rcu: Make rcu_migrate_callbacks wake GP kthread when needed ...
2018-06-04Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-129/+52
Pull siginfo updates from Eric Biederman: "This set of changes close the known issues with setting si_code to an invalid value, and with not fully initializing struct siginfo. There remains work to do on nds32, arc, unicore32, powerpc, arm, arm64, ia64 and x86 to get the code that generates siginfo into a simpler and more maintainable state. Most of that work involves refactoring the signal handling code and thus careful code review. Also not included is the work to shrink the in kernel version of struct siginfo. That depends on getting the number of places that directly manipulate struct siginfo under control, as it requires the introduction of struct kernel_siginfo for the in kernel things. Overall this set of changes looks like it is making good progress, and with a little luck I will be wrapping up the siginfo work next development cycle" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) signal/sh: Stop gcc warning about an impossible case in do_divide_error signal/mips: Report FPE_FLTUNK for undiagnosed floating point exceptions signal/um: More carefully relay signals in relay_signal. signal: Extend siginfo_layout with SIL_FAULT_{MCEERR|BNDERR|PKUERR} signal: Remove unncessary #ifdef SEGV_PKUERR in 32bit compat code signal/signalfd: Add support for SIGSYS signal/signalfd: Remove __put_user from signalfd_copyinfo signal/xtensa: Use force_sig_fault where appropriate signal/xtensa: Consistenly use SIGBUS in do_unaligned_user signal/um: Use force_sig_fault where appropriate signal/sparc: Use force_sig_fault where appropriate signal/sparc: Use send_sig_fault where appropriate signal/sh: Use force_sig_fault where appropriate signal/s390: Use force_sig_fault where appropriate signal/riscv: Replace do_trap_siginfo with force_sig_fault signal/riscv: Use force_sig_fault where appropriate signal/parisc: Use force_sig_fault where appropriate signal/parisc: Use force_sig_mceerr where appropriate signal/openrisc: Use force_sig_fault where appropriate signal/nios2: Use force_sig_fault where appropriate ...
2018-06-04ring-buffer: Fix a bunch of typos in commentsSteven Rostedt (VMware)1-10/+10
An anonymous source sent me a bunch of typo fixes in the comments of ring_buffer.c file. That source did not want to be associated to this patch because they don't want to be known as "one of those" commiters (you know who you are!). They gave me permission to sign this off in my own name. Suggested-by: One-of-those-commiters@YouKnowWhoYouAre.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-06-04Merge branch 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-0/+2
Pull aio updates from Al Viro: "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly. The only thing I'm holding back for a day or so is Adam's aio ioprio - his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case), but let it sit in -next for decency sake..." * 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) aio: sanitize the limit checking in io_submit(2) aio: fold do_io_submit() into callers aio: shift copyin of iocb into io_submit_one() aio_read_events_ring(): make a bit more readable aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way aio: take list removal to (some) callers of aio_complete() aio: add missing break for the IOCB_CMD_FDSYNC case random: convert to ->poll_mask timerfd: convert to ->poll_mask eventfd: switch to ->poll_mask pipe: convert to ->poll_mask crypto: af_alg: convert to ->poll_mask net/rxrpc: convert to ->poll_mask net/iucv: convert to ->poll_mask net/phonet: convert to ->poll_mask net/nfc: convert to ->poll_mask net/caif: convert to ->poll_mask net/bluetooth: convert to ->poll_mask net/sctp: convert to ->poll_mask net/tipc: convert to ->poll_mask ...
2018-06-04bpf: guard bpf_get_current_cgroup_id() with CONFIG_CGROUPSYonghong Song1-0/+2
Commit bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper") introduced a new helper bpf_get_current_cgroup_id(). The helper has a dependency on CONFIG_CGROUPS. When CONFIG_CGROUPS is not defined, using the helper will result the following verifier error: kernel subsystem misconfigured func bpf_get_current_cgroup_id#80 which is hard for users to interpret. Guarding the reference to bpf_get_current_cgroup_id_proto with CONFIG_CGROUPS will result in below better message: unknown func bpf_get_current_cgroup_id#80 Fixes: bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-04Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds11-248/+27
Pull procfs updates from Al Viro: "Christoph's proc_create_... cleanups series" * 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits) xfs, proc: hide unused xfs procfs helpers isdn/gigaset: add back gigaset_procinfo assignment proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields tty: replace ->proc_fops with ->proc_show ide: replace ->proc_fops with ->proc_show ide: remove ide_driver_proc_write isdn: replace ->proc_fops with ->proc_show atm: switch to proc_create_seq_private atm: simplify procfs code bluetooth: switch to proc_create_seq_data netfilter/x_tables: switch to proc_create_seq_private netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data neigh: switch to proc_create_seq_data hostap: switch to proc_create_{seq,single}_data bonding: switch to proc_create_seq_data rtc/proc: switch to proc_create_single_data drbd: switch to proc_create_single resource: switch to proc_create_seq_data staging/rtl8192u: simplify procfs code jfs: simplify procfs code ...
2018-06-04Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-blockLinus Torvalds1-7/+7
Pull block updates from Jens Axboe: - clean up how we pass around gfp_t and blk_mq_req_flags_t (Christoph) - prepare us to defer scheduler attach (Christoph) - clean up drivers handling of bounce buffers (Christoph) - fix timeout handling corner cases (Christoph/Bart/Keith) - bcache fixes (Coly) - prep work for bcachefs and some block layer optimizations (Kent). - convert users of bio_sets to using embedded structs (Kent). - fixes for the BFQ io scheduler (Paolo/Davide/Filippo) - lightnvm fixes and improvements (Matias, with contributions from Hans and Javier) - adding discard throttling to blk-wbt (me) - sbitmap blk-mq-tag handling (me/Omar/Ming). - remove the sparc jsflash block driver, acked by DaveM. - Kyber scheduler improvement from Jianchao, making it more friendly wrt merging. - conversion of symbolic proc permissions to octal, from Joe Perches. Previously the block parts were a mix of both. - nbd fixes (Josef and Kevin Vigor) - unify how we handle the various kinds of timestamps that the block core and utility code uses (Omar) - three NVMe pull requests from Keith and Christoph, bringing AEN to feature completeness, file backed namespaces, cq/sq lock split, and various fixes - various little fixes and improvements all over the map * tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits) blk-mq: update nr_requests when switching to 'none' scheduler block: don't use blocking queue entered for recursive bio submits dm-crypt: fix warning in shutdown path lightnvm: pblk: take bitmap alloc. out of critical section lightnvm: pblk: kick writer on new flush points lightnvm: pblk: only try to recover lines with written smeta lightnvm: pblk: remove unnecessary bio_get/put lightnvm: pblk: add possibility to set write buffer size manually lightnvm: fix partial read error path lightnvm: proper error handling for pblk_bio_add_pages lightnvm: pblk: fix smeta write error path lightnvm: pblk: garbage collect lines with failed writes lightnvm: pblk: rework write error recovery path lightnvm: pblk: remove dead function lightnvm: pass flag on graceful teardown to targets lightnvm: pblk: check for chunk size before allocating it lightnvm: pblk: remove unnecessary argument lightnvm: pblk: remove unnecessary indirection lightnvm: pblk: return NVM_ error on failed submission lightnvm: pblk: warn in case of corrupted write buffer ...
2018-06-04Merge branches 'pm-pci', 'acpi-pm', 'pm-sleep' and 'pm-avs'Rafael J. Wysocki4-11/+30
* pm-pci: PCI / PM: Clean up outdated comments in pci_target_state() PCI / PM: Do not clear state_saved for devices that remain suspended * acpi-pm: ACPI: EC: Dispatch the EC GPE directly on s2idle wake ACPICA: Introduce acpi_dispatch_gpe() * pm-sleep: PM / hibernate: Fix oops at snapshot_write() PM / wakeup: Make s2idle_lock a RAW_SPINLOCK PM / s2idle: Make s2idle_wait_head swait based PM / wakeup: Make events_lock a RAW_SPINLOCK PM / suspend: Prevent might sleep splats * pm-avs: PM / AVS: rockchip-io: add io selectors and supplies for PX30
2018-06-04Merge branches 'pm-cpufreq-sched' and 'pm-cpuidle'Rafael J. Wysocki1-83/+179
* pm-cpufreq-sched: cpufreq: schedutil: Avoid missing updates for one-CPU policies schedutil: Allow cpufreq requests to be made even when kthread kicked cpufreq: Rename cpufreq_can_do_remote_dvfs() cpufreq: schedutil: Cleanup and document iowait boost cpufreq: schedutil: Fix iowait boost reset cpufreq: schedutil: Don't set next_freq to UINT_MAX Revert "cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily" * pm-cpuidle: cpuidle: governors: Consolidate PM QoS handling cpuidle: governors: Drop redundant checks related to PM QoS
2018-06-04Merge branches 'pm-qos' and 'pm-core'Rafael J. Wysocki2-1/+1
* pm-qos: PM / QoS: Drop redundant declaration of pm_qos_get_value() * pm-core: PM / runtime: Drop usage count for suppliers at device link removal PM / runtime: Fixup reference counting of device link suppliers at probe PM: wakeup: Use pr_debug() for the "aborting suspend" message PM / core: Drop unused internal inline functions for sysfs PM / core: Drop unused internal functions for pm_qos sysfs PM / core: Drop unused internal inline functions for wakeirqs PM / core: Drop internal unused inline functions for wakeups PM / wakeup: Only update last time for active wakeup sources PM / wakeup: Use seq_open() to show wakeup stats PM / core: Use dev_printk() and symbols in suspend/resume diagnostics PM / core: Simplify initcall_debug_report() timing PM / core: Remove unused initcall_debug_report() arguments PM / core: fix deferred probe breaking suspend resume order
2018-06-03bpf: implement bpf_get_current_cgroup_id() helperYonghong Song3-0/+18
bpf has been used extensively for tracing. For example, bcc contains an almost full set of bpf-based tools to trace kernel and user functions/events. Most tracing tools are currently either filtered based on pid or system-wide. Containers have been used quite extensively in industry and cgroup is often used together to provide resource isolation and protection. Several processes may run inside the same container. It is often desirable to get container-level tracing results as well, e.g. syscall count, function count, I/O activity, etc. This patch implements a new helper, bpf_get_current_cgroup_id(), which will return cgroup id based on the cgroup within which the current task is running. The later patch will provide an example to show that userspace can get the same cgroup id so it could configure a filter or policy in the bpf program based on task cgroup id. The helper is currently implemented for tracing. It can be added to other program types as well when needed. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-18/+35
Pull scheduler fixes from Thomas Gleixner: - two patches addressing the problem that the scheduler allows under certain conditions user space tasks to be scheduled on CPUs which are not yet fully booted which causes a few subtle and hard to debug issue - add a missing runqueue clock update in the deadline scheduler which triggers a warning under certain circumstances - fix a silly typo in the scheduler header file * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/headers: Fix typo sched/deadline: Fix missing clock update sched/core: Require cpu_active() in select_task_rq(), for user tasks sched/core: Fix rules for running on online && !active CPUs
2018-06-03bpf/xdp: devmap can avoid calling ndo_xdp_flushJesper Dangaard Brouer1-13/+6
The XDP_REDIRECT map devmap can avoid using ndo_xdp_flush, by instead instructing ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag in appropriate places. Notice after this patch it is possible to remove ndo_xdp_flush completely, as this is the last user of ndo_xdp_flush. This is left for later patches, to keep driver changes separate. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03xdp: add flags argument to ndo_xdp_xmit APIJesper Dangaard Brouer1-1/+1
This patch only change the API and reject any use of flags. This is an intermediate step that allows us to implement the flush flag operation later, for each individual driver in a separate patch. The plan is to implement flush operation via XDP_XMIT_FLUSH flag and then remove XDP_XMIT_FLAGS_NONE when done. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03bpf: fix context access in tracing progs on 32 bit archsDaniel Borkmann2-3/+10
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT program type in test_verifier report the following errors on x86_32: 172/p unpriv: spill/fill of different pointers ldx FAIL Unexpected error message! 0: (bf) r6 = r10 1: (07) r6 += -8 2: (15) if r1 == 0x0 goto pc+3 R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1 3: (bf) r2 = r10 4: (07) r2 += -76 5: (7b) *(u64 *)(r6 +0) = r2 6: (55) if r1 != 0x0 goto pc+1 R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp 7: (7b) *(u64 *)(r6 +0) = r1 8: (79) r1 = *(u64 *)(r6 +0) 9: (79) r1 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (71) r0 = *(u8 *)(r1 +68) invalid bpf_context access off=68 size=1 379/p check bpf_perf_event_data->sample_period half load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (69) r0 = *(u16 *)(r1 +68) invalid bpf_context access off=68 size=2 380/p check bpf_perf_event_data->sample_period word load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (61) r0 = *(u32 *)(r1 +68) invalid bpf_context access off=68 size=4 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (79) r0 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok() will then bail out saying that off & (size_default - 1) which is 68 & 7 doesn't cleanly align in the case of sample_period access from struct bpf_perf_event_data, hence verifier wrongly thinks we might be doing an unaligned access here though underlying arch can handle it just fine. Therefore adjust this down to machine size and check and rewrite the offset for narrow access on that basis. We also need to fix corresponding pe_prog_is_valid_access(), since we hit the check for off % size != 0 (e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs for tracing work on x86_32. Reported-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03bpf: avoid retpoline for lookup/update/delete calls on mapsDaniel Borkmann2-22/+58
While some of the BPF map lookup helpers provide a ->map_gen_lookup() callback for inlining the map lookup altogether it is not available for every map, so the remaining ones have to call bpf_map_lookup_elem() helper which does a dispatch to map->ops->map_lookup_elem(). In times of retpolines, this will control and trap speculative execution rather than letting it do its work for the indirect call and will therefore cause a slowdown. Likewise, bpf_map_update_elem() and bpf_map_delete_elem() do not have an inlined version and need to call into their map->ops->map_update_elem() resp. map->ops->map_delete_elem() handlers. Before: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#232656 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call bpf_map_delete_elem#215008 <-- indirect call via 16: (95) exit helper After: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#233328 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call htab_lru_map_delete_elem#238240 <-- direct call 16: (95) exit In all three lookup/update/delete cases however we can use the actual address of the map callback directly if we find that there's only a single path with a map pointer leading to the helper call, meaning when the map pointer has not been poisoned from verifier side. Example code can be seen above for the delete case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03bpf: show prog and map id in fdinfoDaniel Borkmann1-4/+8
Its trivial and straight forward to expose it for scripts that can then use it along with bpftool in order to inspect an individual application's used maps and progs. Right now we dump some basic information in the fdinfo file but with the help of the map/prog id full introspection becomes possible now. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03bpf: fixup error message from gpl helpers on license mismatchDaniel Borkmann1-1/+1
Stating 'proprietary program' in the error is just silly since it can also be a different open source license than that which is just not compatible. Reference: https://twitter.com/majek04/status/998531268039102465 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-15/+31
Filling in the padding slot in the bpf structure as a bug fix in 'ne' overlapped with actually using that padding area for something in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-02libnvdimm, e820: Register all pmem resourcesDan Williams1-0/+1
There is currently a mismatch between the resources that will trigger the e820_pmem driver to register/load and the resources that will actually be surfaced as pmem ranges. register_e820_pmem() uses walk_iomem_res_desc() which includes children and siblings. In contrast, e820_pmem_probe() only considers top level resources. For example the following resource tree results in the driver being loaded, but no resources being registered: 398000000000-39bfffffffff : PCI Bus 0000:ae 39be00000000-39bf07ffffff : PCI Bus 0000:af 39be00000000-39beffffffff : 0000:af:00.0 39be10000000-39beffffffff : Persistent Memory (legacy) Fix this up to allow definitions of "legacy" pmem ranges anywhere in system-physical address space. Not that it is a recommended or safe to define a pmem range in PCI space, but it is useful for debug / experimentation, and the restriction on being a top-level resource was arbitrary. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-06-02bpf: btf: Ensure t->type == 0 for BTF_KIND_FWDMartin KaFai Lau1-1/+20
The t->type in BTF_KIND_FWD is not used. It must be 0. This patch ensures that and also adds a test case in test_btf.c Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-02bpf: btf: Check array t->sizeMartin KaFai Lau1-0/+5
This patch ensures array's t->size is 0. The array size is decided by its individual elem's size and the number of elements. Hence, t->size is not used and it must be 0. A test case is added to test_btf.c Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-31Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar4-27/+72
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-31sched/headers: Fix typoDavidlohr Bueso1-1/+1
I cannot spell 'throttling'. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180530224940.17839-1-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-31sched/deadline: Fix missing clock updateJuri Lelli1-3/+3
A missing clock update is causing the following warning: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720 Call Trace: <IRQ> __hrtimer_run_queues+0x10f/0x530 hrtimer_interrupt+0xe5/0x240 smp_apic_timer_interrupt+0x79/0x2b0 apic_timer_interrupt+0xf/0x20 </IRQ> do_idle+0x203/0x280 cpu_startup_entry+0x6f/0x80 start_secondary+0x1b0/0x200 secondary_startup_64+0xa5/0xb0 hardirqs last enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360 hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0 softirqs last enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70 softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70 This happens because inactive_task_timer() calls sub_running_bw() (if TASK_DEAD and non_contending) that might trigger a schedutil update, which might access the clock. Clock is however currently updated only later in inactive_task_timer() function. Fix the problem by updating the clock right after task_rq_lock(). Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180530160809.9074-1-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-31sched/core: Require cpu_active() in select_task_rq(), for user tasksPaul Burton1-2/+1
select_task_rq() is used in a few paths to select the CPU upon which a thread should be run - for example it is used by try_to_wake_up() & by fork or exec balancing. As-is it allows use of any online CPU that is present in the task's cpus_allowed mask. This presents a problem because there is a period whilst CPUs are brought online where a CPU is marked online, but is not yet fully initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state < CPUHP_ONLINE. Usually we don't run any user tasks during this window, but there are corner cases where this can happen. An example observed is: - Some user task A, running on CPU X, forks to create task B. - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's task_struct::cpu field to X. - CPU X is offlined. - Task A, currently somewhere between the __set_task_cpu() in copy_process() and the call to wake_up_new_task(), is migrated to CPU Y by migrate_tasks() when CPU X is offlined. - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The scheduler is now active on CPU X, but there are no user tasks on the runqueue. - Task A runs on CPU Y & reaches wake_up_new_task(). This calls select_task_rq() with cpu=X, taken from task B's task_struct, and select_task_rq() allows CPU X to be returned. - Task A enqueues task B on CPU X's runqueue, via activate_task() & enqueue_task(). - CPU X now has a user task on its runqueue before it has reached the CPUHP_ONLINE state. In most cases, the user tasks that schedule on the newly onlined CPU have no idea that anything went wrong, but one case observed to be problematic is if the task goes on to invoke the sched_setaffinity syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state before the CPU that brought it online calls stop_machine_unpark(). This means that for a portion of the window of time between CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct cpu_stopper has its enabled field set to false. If a user thread is executed on the CPU during this window and it invokes sched_setaffinity with a CPU mask that does not include the CPU it's running on, then when __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke migration_cpu_stop() and perform the actual migration away from the CPU it will simply return -ENOENT rather than calling migration_cpu_stop(). We then return from the sched_setaffinity syscall back to the user task that is now running on a CPU which it just asked not to run on, and which is not present in its cpus_allowed mask. This patch resolves the problem by having select_task_rq() enforce that user tasks run on CPUs that are active - the same requirement that select_fallback_rq() already enforces. This should ensure that newly onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to schedule user tasks, and also implies that bringup_wait_for_ap() will have called stop_machine_unpark() which resolves the sched_setaffinity issue above. I haven't yet investigated them, but it may be of interest to review whether any of the actions performed by hotplug states between CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended effects on user tasks that might schedule before they are reached, which might widen the scope of the problem from just affecting the behaviour of sched_setaffinity. Signed-off-by: Paul Burton <paul.burton@mips.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-31sched/core: Fix rules for running on online && !active CPUsPeter Zijlstra1-12/+30
As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules for running on an online && !active CPU are stricter than just being a kthread, you need to be a per-cpu kthread. If you're not strictly per-CPU, you have better CPUs to run on and don't need the partially booted one to get your work done. The exception is to allow smpboot threads to bootstrap the CPU itself and get kernel 'services' initialized before we allow userspace on it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 955dbdf4ce87 ("sched: Allow migrating kthreads into online but inactive CPUs") Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-30bpf: devmap: remove redundant assignment of dev = devColin Ian King1-1/+1
The assignment dev = dev is redundant and should be removed. Detected by CoverityScan, CID#1469486 ("Evaluation order violation") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-30media: rc: introduce BPF_PROG_LIRC_MODE2Sean Young1-0/+7
Add support for BPF_PROG_LIRC_MODE2. This type of BPF program can call rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report that the last key should be repeated. The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall; the target_fd must be the /dev/lircN device. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-30bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not foundSean Young2-2/+11
This makes is it possible for bpf prog detach to return -ENOENT. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-30Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcuIngo Molnar1-0/+9
Pull RCU fix from Paul E. McKenney: "This additional v4.18 pull request contains a single commit that fell through the cracks: Provide early rcu_cpu_starting() callback for the benefit of the x86/mtrr code, which needs RCU to be available on incoming CPUs earlier than has been the case in the past." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-29tracing: Allow histogram triggers to access ftrace internal eventsSteven Rostedt (VMware)1-1/+1
Now that trace_marker can have triggers, including a histogram triggers, the onmatch() and onmax() access the trace event. To do so, the search routine to find the event file needs to use the raw __find_event_file() that does not filter out ftrace events. Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29tracing: Have zero size length in filter logic be full stringSteven Rostedt (VMware)1-11/+12
As strings in trace events may not have a nul terminating character, the filter string compares use the defined string length for the field for the compares. The trace_marker records data slightly different than do normal events. It's size is zero, meaning that the string is the rest of the array, and that the string also ends with '\0'. If the size is zero, assume that the string is nul terminated and read the string in question as is. Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29tracing: Add trigger file for trace_markers tracefs/ftrace/printSteven Rostedt (VMware)4-2/+29
Allow writing to the trace_markers file initiate triggers defined in tracefs/ftrace/print/trigger file. This will allow of user space to trigger the same type of triggers (including histograms) that the trace events use. Had to create a ftrace_event_register() function that will become the trace_marker print event's reg() function. This is required because of how triggers are enabled: event_trigger_write() { event_trigger_regex_write() { trigger_process_regex() { for p in trigger_commands { p->func(); /* trigger_snapshot_cmd->func */ event_trigger_callback() { cmd_ops->reg() /* register_trigger() */ { trace_event_trigger_enable_disable() { trace_event_enable_disable() { call->class->reg(); Without the reg() function, the trigger code will call a NULL pointer and crash the system. Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Clark Williams <williams@redhat.com> Cc: Karim Yaghmour <karim.yaghmour@opersys.com> Cc: Brendan Gregg <bgregg@netflix.com> Suggested-by: Joel Fernandes <joelaf@google.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29Merge tag 'trace-v4.17-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds3-10/+28
Pull tracing fixes from Steven Rostedt: "While writing selftests for a new feature, I triggered two existing bugs that deal with triggers and instances. - a generic trigger bug where the triggers are not removed from a linked list properly when deleting an instance. - a bug specific to snapshots, where the snapshot is done in the top level buffer, when it is supposed to snapshot the buffer associated to the instance the snapshot trigger exists in" * tag 'trace-v4.17-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Make the snapshot trigger work with instances tracing: Fix crash when freeing instances with event triggers