aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-11-23mm/hmm: define the pre-processor related parts of hmm.h even if disabledJason Gunthorpe1-1/+0
Only the function calls are stubbed out with static inlines that always fail. This is the standard way to write a header for an optional component and makes it easier for drivers that only optionally need HMM_MIRROR. Link: https://lore.kernel.org/r/20191112202231.3856-5-jgg@ziepe.ca Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Tested-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-23Revert "bpf: Emit audit messages upon successful prog load and unload"Jakub Kicinski2-32/+1
This commit reverts commit 91e6015b082b ("bpf: Emit audit messages upon successful prog load and unload") and its follow up commit 7599a896f2e4 ("audit: Move audit_log_task declaration under CONFIG_AUDITSYSCALL") as requested by Paul Moore. The change needs close review on linux-audit, tests etc. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-22tracing: Use xarray for syscall trace eventsHassan Naveed1-7/+25
Currently, a lot of memory is wasted for architectures like MIPS when init_ftrace_syscalls() allocates the array for syscalls using kcalloc. This is because syscalls numbers start from 4000, 5000 or 6000 and array elements up to that point are unused. Fix this by using a data structure more suited to storing sparsely populated arrays. The XARRAY data structure, implemented using radix trees, is much more memory efficient for storing the syscalls in question. Link: http://lkml.kernel.org/r/20191115234314.21599-1-hnaveed@wavecomp.com Signed-off-by: Hassan Naveed <hnaveed@wavecomp.com> Reviewed-by: Paul Burton <paulburton@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22tracing: Adding new functions for kernel access to Ftrace instancesDivya Indi3-22/+102
Adding 2 new functions - 1) struct trace_array *trace_array_get_by_name(const char *name); Return pointer to a trace array with given name. If it does not exist, create and return pointer to the new trace array. 2) int trace_array_set_clr_event(struct trace_array *tr, const char *system ,const char *event, bool enable); Enable/Disable events to this trace array. Additionally, - To handle reference counters, export trace_array_put() - Due to introduction of the above 2 new functions, we no longer need to export - ftrace_set_clr_event & trace_array_create APIs. Link: http://lkml.kernel.org/r/1574276919-11119-2-git-send-email-divya.indi@oracle.com Signed-off-by: Divya Indi <divya.indi@oracle.com> Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22tracing: Fix Kconfig indentationKrzysztof Kozlowski1-4/+4
Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ /\t/' -i */Kconfig Link: http://lkml.kernel.org/r/20191120133807.12741-1-krzk@kernel.org Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22ring-buffer: Fix typos in function ring_buffer_producerXianting Tian1-2/+2
Fix spelling and other typos Link: http://lkml.kernel.org/r/1573916755-32478-1-git-send-email-xianting_tian@126.com Signed-off-by: Xianting Tian <xianting_tian@126.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-16/+36
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock from commit c8183f548902 ("s390/qeth: fix potential deadlock on workqueue flush"), removed the code which was removed by commit 9897d583b015 ("s390/qeth: consolidate some duplicated HW cmd code"). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds1-1/+3
Pull networking fixes from David Miller: 1) Validate tunnel options length in act_tunnel_key, from Xin Long. 2) Fix DMA sync bug in gve driver, from Adi Suresh. 3) TSO kills performance on some r8169 chips due to HW issues, disable by default in that case, from Corinna Vinschen. 4) Fix clock disable mismatch in fec driver, from Chubong Yuan. 5) Fix interrupt status bits define in hns3 driver, from Huazhong Tan. 6) Fix workqueue deadlocks in qeth driver, from Julian Wiedmann. 7) Don't napi_disable() twice in r8152 driver, from Hayes Wang. 8) Fix SKB extension memory leak, from Florian Westphal. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits) r8152: avoid to call napi_disable twice MAINTAINERS: Add myself as maintainer of virtio-vsock udp: drop skb extensions before marking skb stateless net: rtnetlink: prevent underflows in do_setvfinfo() can: m_can_platform: remove unnecessary m_can_class_resume() call can: m_can_platform: set net_device structure as driver data hv_netvsc: Fix send_table offset in case of a host bug hv_netvsc: Fix offset usage in netvsc_send_table() net-ipv6: IPV6_TRANSPARENT - check NET_RAW prior to NET_ADMIN sfc: Only cancel the PPS workqueue if it exists nfc: port100: handle command failure cleanly net-sysfs: fix netdev_queue_add_kobject() breakage r8152: Re-order napi_disable in rtl8152_close net: qca_spi: Move reset_count to struct qcaspi net: qca_spi: fix receive buffer size check net/ibmvnic: Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode Revert "net/ibmvnic: Fix EOI when running in XIVE mode" net/mlxfw: Verify FSM error code translation doesn't exceed array size net/mlx5: Update the list of the PCI supported devices net/mlx5: Fix auto group size calculation ...
2019-11-22Merge tag 'pm-5.4-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-1/+7
Pull power management regression fix from Rafael Wysocki: "Fix problems with switching cpufreq drivers on some x86 systems with ACPI (and with changing the operation modes of the intel_pstate driver on those systems) introduced by recent changes related to the management of frequency limits in cpufreq" * tag 'pm-5.4-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: QoS: Invalidate frequency QoS requests after removal
2019-11-22tracing: Remove unnecessary DEBUG_FS dependencyKusanagi Kouichi1-1/+0
Tracing replaced debugfs with tracefs. Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20191120104350753.EWCT.12796.ppp.dion.ne.jp@dmta0009.auone-net.jp Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-21dma-mapping: treat dev->bus_dma_mask as a DMA limitNicolas Saenz Julienne1-14/+13
Using a mask to represent bus DMA constraints has a set of limitations. The biggest one being it can only hold a power of two (minus one). The DMA mapping code is already aware of this and treats dev->bus_dma_mask as a limit. This quirk is already used by some architectures although still rare. With the introduction of the Raspberry Pi 4 we've found a new contender for the use of bus DMA limits, as its PCIe bus can only address the lower 3GB of memory (of a total of 4GB). This is impossible to represent with a mask. To make things worse the device-tree code rounds non power of two bus DMA limits to the next power of two, which is unacceptable in this case. In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all over the tree and treat it as such. Note that dev->bus_dma_limit should contain the higher accessible DMA address. Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-21Merge branch 'for-next/zone-dma' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux into dma-mapping-for-nextChristoph Hellwig1-7/+6
Pull in a stable branch from the arm64 tree that adds the zone_dma_bits variable to avoid creating hard to resolve conflicts with that addition.
2019-11-21Merge branch 'kvm-tsx-ctrl' into HEADPaolo Bonzini25-98/+508
Conflicts: arch/x86/kvm/vmx/vmx.c
2019-11-21perf/core: Make the mlock accounting simple againAlexander Shishkin1-7/+1
Commit: d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()") does a lot of things to the mlock accounting arithmetics, while the only thing that actually needed to happen is subtracting the part that is charged to the mm from the part that is charged to the user, so that the former isn't charged twice. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Song Liu <songliubraving@fb.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Cc: songliubraving@fb.com Link: https://lkml.kernel.org/r/20191120170640.54123-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21sched/vtime: Bring up complete kcpustat accessorFrederic Weisbecker1-20/+116
Many callsites want to fetch the values of system, user, user_nice, guest or guest_nice kcpustat fields altogether or at least a pair of these. In that case calling kcpustat_field() for each requested field brings unecessary overhead when we could fetch all of them in a row. So provide kcpustat_cpu_fetch() that fetches the whole kcpustat array in a vtime safe way under the same RCU and seqcount block. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191121024430.19938-3-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21sched/cputime: Support other fields on kcpustat_field()Frederic Weisbecker1-11/+43
Provide support for user, nice, guest and guest_nice fields through kcpustat_field(). Whether we account the delta to a nice or not nice field is decided on top of the nice value snapshot taken at the time we call kcpustat_field(). If the nice value of the task has been changed since the last vtime update, we may have inacurrate distribution of the nice VS unnice cputime. However this is considered as a minor issue compared to the proper fix that would involve interrupting the target on nice updates, which is undesired on nohz_full CPUs. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191121024430.19938-2-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller12-141/+1187
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-11-20 The following pull-request contains BPF updates for your *net-next* tree. We've added 81 non-merge commits during the last 17 day(s) which contain a total of 120 files changed, 4958 insertions(+), 1081 deletions(-). There are 3 trivial conflicts, resolve it by always taking the chunk from 196e8ca74886c433: <<<<<<< HEAD ======= void *bpf_map_area_mmapable_alloc(u64 size, int numa_node); >>>>>>> 196e8ca74886c433dcfc64a809707074b936aaf5 <<<<<<< HEAD void *bpf_map_area_alloc(u64 size, int numa_node) ======= static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable) >>>>>>> 196e8ca74886c433dcfc64a809707074b936aaf5 <<<<<<< HEAD if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { ======= /* kmalloc()'ed memory can't be mmap()'ed */ if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { >>>>>>> 196e8ca74886c433dcfc64a809707074b936aaf5 The main changes are: 1) Addition of BPF trampoline which works as a bridge between kernel functions, BPF programs and other BPF programs along with two new use cases: i) fentry/fexit BPF programs for tracing with practically zero overhead to call into BPF (as opposed to k[ret]probes) and ii) attachment of the former to networking related programs to see input/output of networking programs (covering xdpdump use case), from Alexei Starovoitov. 2) BPF array map mmap support and use in libbpf for global data maps; also a big batch of libbpf improvements, among others, support for reading bitfields in a relocatable manner (via libbpf's CO-RE helper API), from Andrii Nakryiko. 3) Extend s390x JIT with usage of relative long jumps and loads in order to lift the current 64/512k size limits on JITed BPF programs there, from Ilya Leoshkevich. 4) Add BPF audit support and emit messages upon successful prog load and unload in order to have a timeline of events, from Daniel Borkmann and Jiri Olsa. 5) Extension to libbpf and xdpsock sample programs to demo the shared umem mode (XDP_SHARED_UMEM) as well as RX-only and TX-only sockets, from Magnus Karlsson. 6) Several follow-up bug fixes for libbpf's auto-pinning code and a new API call named bpf_get_link_xdp_info() for retrieving the full set of prog IDs attached to XDP, from Toke Høiland-Jørgensen. 7) Add BTF support for array of int, array of struct and multidimensional arrays and enable it for skb->cb[] access in kfree_skb test, from Martin KaFai Lau. 8) Fix AF_XDP by using the correct number of channels from ethtool, from Luigi Rizzo. 9) Two fixes for BPF selftest to get rid of a hang in test_tc_tunnel and to avoid xdping to be run as standalone, from Jiri Benc. 10) Various BPF selftest fixes when run with latest LLVM trunk, from Yonghong Song. 11) Fix a memory leak in BPF fentry test run data, from Colin Ian King. 12) Various smaller misc cleanups and improvements mostly all over BPF selftests and samples, from Daniel T. Lee, Andre Guedes, Anders Roxell, Mao Wenan, Yue Haibing. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21time: Zero the upper 32-bits in __kernel_timespec on 32-bitDmitry Safonov1-1/+2
On compat interfaces, the high order bits of nanoseconds should be zeroed out. This is because the application code or the libc do not guarantee zeroing of these. If used without zeroing, kernel might be at risk of using timespec values incorrectly. Originally it was handled correctly, but lost during is_compat_syscall() cleanup. Revert the condition back to check CONFIG_64BIT. Fixes: 98f76206b335 ("compat: Cleanup in_compat_syscall() callers") Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20191121000303.126523-1-dima@arista.com
2019-11-20bpf: Switch bpf_map_{area_alloc,area_mmapable_alloc}() to u64 sizeDaniel Borkmann1-4/+7
Given we recently extended the original bpf_map_area_alloc() helper in commit fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"), we need to apply the same logic as in ff1c08e1f74b ("bpf: Change size to u64 for bpf_map_{area_alloc, charge_init}()"). To avoid conflicts, extend it for bpf-next. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-11-20bpf: Emit audit messages upon successful prog load and unloadDaniel Borkmann2-1/+32
Allow for audit messages to be emitted upon BPF program load and unload for having a timeline of events. The load itself is in syscall context, so additional info about the process initiating the BPF prog creation can be logged and later directly correlated to the unload event. The only info really needed from BPF side is the globally unique prog ID where then audit user space tooling can query / dump all info needed about the specific BPF program right upon load event and enrich the record, thus these changes needed here can be kept small and non-intrusive to the core. Raw example output: # auditctl -D # auditctl -a always,exit -F arch=x86_64 -S bpf # ausearch --start recent -m 1334 [...] ---- time->Wed Nov 20 12:45:51 2019 type=PROCTITLE msg=audit(1574271951.590:8974): proctitle="./test_verifier" type=SYSCALL msg=audit(1574271951.590:8974): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=7ffe2d923e80 a2=78 a3=0 items=0 ppid=742 pid=949 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null) type=UNKNOWN[1334] msg=audit(1574271951.590:8974): auid=0 uid=0 gid=0 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=949 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" prog-id=3260 event=LOAD ---- time->Wed Nov 20 12:45:51 2019 type=UNKNOWN[1334] msg=audit(1574271951.590:8975): prog-id=3260 event=UNLOAD ---- [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191120213816.8186-1-jolsa@kernel.org
2019-11-20dma-direct: exclude dma_direct_map_resource from the min_low_pfn checkChristoph Hellwig2-3/+3
The valid memory address check in dma_capable only makes sense when mapping normal memory, not when using dma_map_resource to map a device resource. Add a new boolean argument to dma_capable to exclude that check for the dma_map_resource case. Fixes: b12d66278dd6 ("dma-direct: check for overflows on 32 bit DMA addresses") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
2019-11-20dma-direct: don't check swiotlb=force in dma_direct_map_resourceChristoph Hellwig1-1/+1
When mapping resources we can't just use swiotlb ram for bounce buffering. Switch to a direct dma_capable check instead. Fixes: cfced786969c ("dma-mapping: remove the default map_resource implementation") Reported-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
2019-11-20dma-debug: clean up put_hash_bucket()Dan Carpenter1-11/+9
The put_hash_bucket() is a bit cleaner if it takes an unsigned long directly instead of a pointer to unsigned long. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-20dma-mapping: drop the dev argument to arch_sync_dma_for_*Christoph Hellwig1-7/+7
These are pure cache maintainance routines, so drop the unused struct device argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2019-11-20Merge tag 'irqchip-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/coreThomas Gleixner1-0/+44
Pull irqchip updates from Marc Zyngier: - Qualcomm PDC wakeup interrupt support - Layerscape external IRQ support - Broadcom bcm7038 PM and wakeup support - Ingenic driver cleanup and modernization - GICv3 ITS preparation for GICv4.1 updates - GICv4 fixes - Various cleanups
2019-11-20fork: fix pidfd_poll()'s return typeLuc Van Oostenryck1-3/+3
pidfd_poll() is defined as returning 'unsigned int' but the .poll method is declared as returning '__poll_t', a bitwise type. Fix this by using the proper return type and using the EPOLL constants instead of the POLL ones, as required for __poll_t. Fixes: b53b0b9d9a61 ("pidfd: add polling support") Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: stable@vger.kernel.org # 5.3 Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20191120003320.31138-1-luc.vanoostenryck@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-11-20cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()Daniel Lezcano1-1/+7
Modify cpuidle_use_deepest_state() to take an additional exit latency limit argument to be passed to find_deepest_idle_state() and make cpuidle_idle_call() pass dev->forced_idle_latency_limit_ns to it for forced idle. Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: Rebase and rearrange code, subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20cpuidle: Allow idle injection to apply exit latency limitDaniel Lezcano1-7/+7
In some cases it may be useful to specify an exit latency limit for the idle state to be used during CPU idle time injection. Instead of duplicating the information in struct cpuidle_device or propagating the latency limit in the call stack, replace the use_deepest_state field with forced_latency_limit_ns to represent that limit, so that the deepest idle state with exit latency within that limit is forced (i.e. no governors) when it is set. A zero exit latency limit for forced idle means to use governors in the usual way (analogous to use_deepest_state equal to "false" before this change). Additionally, add play_idle_precise() taking two arguments, the duration of forced idle and the idle state exit latency limit, both in nanoseconds, and redefine play_idle() as a wrapper around that new function. This change is preparatory, no functional impact is expected. Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: Subject, changelog, cpuidle_use_deepest_state() kerneldoc, whitespace ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20PM: QoS: Invalidate frequency QoS requests after removalRafael J. Wysocki1-1/+7
Switching cpufreq drivers (or switching operation modes of the intel_pstate driver from "active" to "passive" and vice versa) does not work on some x86 systems with ACPI after commit 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS"), because the ACPI _PPC and thermal code uses the same frequency QoS request object for a given CPU every time a cpufreq driver is registered and freq_qos_remove_request() does not invalidate the request after removing it from its QoS list, so freq_qos_add_request() complains and fails when that request is passed to it again. Fix the issue by modifying freq_qos_remove_request() to clear the qos and type fields of the frequency request pointed to by its argument after removing it from its QoS list so as to invalidate it. Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS") Reported-and-tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-11-20futex: Prevent exit livelockThomas Gleixner1-15/+91
Oleg provided the following test case: int main(void) { struct sched_param sp = {}; sp.sched_priority = 2; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); int lock = vfork(); if (!lock) { sp.sched_priority = 1; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); _exit(0); } syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); return 0; } This creates an unkillable RT process spinning in futex_lock_pi() on a UP machine or if the process is affine to a single CPU. The reason is: parent child set FIFO prio 2 vfork() -> set FIFO prio 1 implies wait_for_child() sched_setscheduler(...) exit() do_exit() .... mm_release() tsk->futex_state = FUTEX_STATE_EXITING; exit_futex(); (NOOP in this case) complete() --> wakes parent sys_futex() loop infinite because tsk->futex_state == FUTEX_STATE_EXITING The same problem can happen just by regular preemption as well: task holds futex ... do_exit() tsk->futex_state = FUTEX_STATE_EXITING; --> preemption (unrelated wakeup of some other higher prio task, e.g. timer) switch_to(other_task) return to user sys_futex() loop infinite as above Just for the fun of it the futex exit cleanup could trigger the wakeup itself before the task sets its futex state to DEAD. To cure this, the handling of the exiting owner is changed so: - A refcount is held on the task - The task pointer is stored in a caller visible location - The caller drops all locks (hash bucket, mmap_sem) and blocks on task::futex_exit_mutex. When the mutex is acquired then the exiting task has completed the cleanup and the state is consistent and can be reevaluated. This is not a pretty solution, but there is no choice other than returning an error code to user space, which would break the state consistency guarantee and open another can of problems including regressions. For stable backports the preparatory commits ac31c7ff8624 .. ba31c1a48538 are required as well, but for anything older than 5.3.y the backports are going to be provided when this hits mainline as the other dependencies for those kernels are definitely not stable material. Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems") Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stable Team <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de
2019-11-20futex: Provide distinct return value when owner is exitingThomas Gleixner1-7/+9
attach_to_pi_owner() returns -EAGAIN for various cases: - Owner task is exiting - Futex value has changed The caller drops the held locks (hash bucket, mmap_sem) and retries the operation. In case of the owner task exiting this can result in a live lock. As a preparatory step for seperating those cases, provide a distinct return value (EBUSY) for the owner exiting case. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de
2019-11-20futex: Add mutex around futex exitThomas Gleixner1-0/+16
The mutex will be used in subsequent changes to replace the busy looping of a waiter when the futex owner is currently executing the exit cleanup to prevent a potential live lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de
2019-11-20futex: Provide state handling for exec() as wellThomas Gleixner1-4/+34
exec() attempts to handle potentially held futexes gracefully by running the futex exit handling code like exit() does. The current implementation has no protection against concurrent incoming waiters. The reason is that the futex state cannot be set to FUTEX_STATE_DEAD after the cleanup because the task struct is still active and just about to execute the new binary. While its arguably buggy when a task holds a futex over exec(), for consistency sake the state handling can at least cover the actual futex exit cleanup section. This provides state consistency protection accross the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the cleanup has been finished, this cannot prevent subsequent attempts to attach to the task in case that the cleanup was not successfull in mopping up all leftovers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de
2019-11-20futex: Sanitize exit state handlingThomas Gleixner1-7/+10
Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move the state setting into to the lock section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de
2019-11-20futex: Mark the begin of futex exit explicitlyThomas Gleixner2-13/+37
Instead of relying on PF_EXITING use an explicit state for the futex exit and set it in the futex exit function. This moves the smp barrier and the lock/unlock serialization into the futex code. As with the DEAD state this is restricted to the exit path as exec continues to use the same task struct. This allows to simplify that logic in a next step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de
2019-11-20futex: Set task::futex_state to DEAD right after handling futex exitThomas Gleixner2-1/+1
Setting task::futex_state in do_exit() is rather arbitrarily placed for no reason. Move it into the futex code. Note, this is only done for the exit cleanup as the exec cleanup cannot set the state to FUTEX_STATE_DEAD because the task struct is still in active use. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de
2019-11-20futex: Split futex_mm_release() for exit/execThomas Gleixner2-4/+8
To allow separate handling of the futex exit state in the futex exit code for exit and exec, split futex_mm_release() into two functions and invoke them from the corresponding exit/exec_mm_release() callsites. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de
2019-11-20exit/exec: Seperate mm_release()Thomas Gleixner2-2/+12
mm_release() contains the futex exit handling. mm_release() is called from do_exit()->exit_mm() and from exec()->exec_mm(). In the exit_mm() case PF_EXITING and the futex state is updated. In the exec_mm() case these states are not touched. As the futex exit code needs further protections against exit races, this needs to be split into two functions. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
2019-11-20futex: Replace PF_EXITPIDONE with a stateThomas Gleixner2-28/+15
The futex exit handling relies on PF_ flags. That's suboptimal as it requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in the middle of do_exit() to enforce the observability of PF_EXITING in the futex code. Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic over to the new state. The PF_EXITING dependency will be cleaned up in a later step. This prepares for handling various futex exit issues later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de
2019-11-20futex: Move futex exit handling into futex codeThomas Gleixner2-26/+32
The futex exit handling is #ifdeffed into mm_release() which is not pretty to begin with. But upcoming changes to address futex exit races need to add more functionality to this exit code. Split it out into a function, move it into futex code and make the various futex exit functions static. Preparatory only and no functional change. Folded build fix from Borislav. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de
2019-11-19bpf: Make array_map_mmap staticYueHaibing1-1/+1
Fix sparse warning: kernel/bpf/arraymap.c:481:5: warning: symbol 'array_map_mmap' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191119142113.15388-1-yuehaibing@huawei.com
2019-11-18ftrace: Rename ftrace_graph_stub to ftrace_stub_graphSteven Rostedt (VMware)1-3/+3
The ftrace_graph_stub was created and points to ftrace_stub as a way to assign the functon graph tracer function pointer to a stub function with a different prototype than what ftrace_stub has and not trigger the C verifier. The ftrace_graph_stub was created via the linker script vmlinux.lds.h. Unfortunately, powerpc already uses the name ftrace_graph_stub for its internal implementation of the function graph tracer, and even though powerpc would still build, the change via the linker script broke function tracer on powerpc from working. By using the name ftrace_stub_graph, which does not exist anywhere else in the kernel, this should not be a problem. Link: https://lore.kernel.org/r/1573849732.5937.136.camel@lca.pw Fixes: b83b43ffc6e4 ("fgraph: Fix function type mismatches of ftrace_graph_return using ftrace_stub") Reorted-by: Qian Cai <cai@lca.pw> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-18ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimizationSteven Rostedt (VMware)1-32/+88
If a direct ftrace callback is at a location that does not have any other ftrace helpers attached to it, it is possible to simply just change the text to call the new caller (if the architecture supports it). But this requires special architecture code. Currently, modify_ftrace_direct() uses a trick to add a stub ftrace callback to the location forcing it to call the ftrace iterator. Then it can change the direct helper to call the new function in C, and then remove the stub. Removing the stub will have the location now call the new location that the direct helper is using. The new helper function does the registering the stub trick, but is a weak function, allowing an architecture to override it to do something a bit more direct. Link: https://lore.kernel.org/r/20191115215125.mbqv7taqnx376yed@ast-mbp.dhcp.thefacebook.com Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-18perf/core: Fix the mlock accounting, againAlexander Shishkin1-4/+2
Commit: 5e6c3c7b1ec2 ("perf/aux: Fix tracking of auxiliary trace buffer allocation") tried to guess the correct combination of arithmetic operations that would undo the AUX buffer's mlock accounting, and failed, leaking the bottom part when an allocation needs to be charged partially to both user->locked_vm and mm->pinned_vm, eventually leaving the user with no locked bonus: $ perf record -e intel_pt//u -m1,128 uname [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.061 MB perf.data ] $ perf record -e intel_pt//u -m1,128 uname Permission error mapping pages. Consider increasing /proc/sys/kernel/perf_event_mlock_kb, or try again with a smaller value of -m/--mmap_pages. (current value: 1,128) Fix this by subtracting both locked and pinned counts when AUX buffer is unmapped. Reported-by: Thomas Richter <tmricht@linux.ibm.com> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()Vincent Guittot1-49/+62
update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays, which might be inefficient when the cpufreq driver has rate limitation. When a task is attached on a CPU, we have this call path: update_load_avg() update_cfs_rq_load_avg() cfs_rq_util_change -- > trig frequency update attach_entity_load_avg() cfs_rq_util_change -- > trig frequency update The 1st frequency update will not take into account the utilization of the newly attached task and the 2nd one might be discarded because of rate limitation of the cpufreq driver. update_cfs_rq_load_avg() is only called by update_blocked_averages() and update_load_avg() so we can move the call to cfs_rq_util_change/cpufreq_update_util() into these two functions. It's also interesting to note that update_load_avg() already calls cfs_rq_util_change() directly for the !SMP case. This change will also ensure that cpufreq_update_util() is called even when there is no more CFS rq in the leaf_cfs_rq_list to update, but only IRQ, RT or DL PELT signals. [ mingo: Minor updates. ] Reported-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: juri.lelli@redhat.com Cc: linux-pm@vger.kernel.org Cc: mgorman@suse.de Cc: rostedt@goodmis.org Cc: sargun@sargun.me Cc: srinivas.pandruvada@linux.intel.com Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18Merge tag 'v5.4-rc8' into sched/core, to pick up fixes and dependenciesIngo Molnar8-21/+74
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18sched/fair: Add comments for group_type and balancing at SD_NUMA levelVincent Guittot1-4/+31
Add comments to describe each state of goup_type and to add some details about the load balance at NUMA level. [ Valentin Schneider: Updates to the comments. ] [ mingo: Other updates to the comments. ] Reported-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1573570243-1903-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18sched/fair: Fix rework of find_idlest_group()Vincent Guittot1-7/+84
The task, for which the scheduler looks for the idlest group of CPUs, must be discounted from all statistics in order to get a fair comparison between groups. This includes utilization, load, nr_running and idle_cpus. Such unfairness can be easily highlighted with the unixbench execl 1 task. This test continuously call execve() and the scheduler looks for the idlest group/CPU on which it should place the task. Because the task runs on the local group/CPU, the latter seems already busy even if there is nothing else running on it. As a result, the scheduler will always select another group/CPU than the local one. This recovers most of the performance regression on my system from the recent load-balancer rewrite. [ mingo: Minor cleanups. ] Reported-by: kernel test robot <rong.a.chen@intel.com> Tested-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: hdanton@sina.com Cc: parth@linux.ibm.com Cc: pauld@redhat.com Cc: quentin.perret@arm.com Cc: riel@surriel.com Cc: srikar@linux.vnet.ibm.com Cc: valentin.schneider@arm.com Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()") Link: https://lkml.kernel.org/r/1571762798-25900-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18bpf: Add mmap() support for BPF_MAP_TYPE_ARRAYAndrii Nakryiko2-9/+148
Add ability to memory-map contents of BPF array map. This is extremely useful for working with BPF global data from userspace programs. It allows to avoid typical bpf_map_{lookup,update}_elem operations, improving both performance and usability. There had to be special considerations for map freezing, to avoid having writable memory view into a frozen map. To solve this issue, map freezing and mmap-ing is happening under mutex now: - if map is already frozen, no writable mapping is allowed; - if map has writable memory mappings active (accounted in map->writecnt), map freezing will keep failing with -EBUSY; - once number of writable memory mappings drops to zero, map freezing can be performed again. Only non-per-CPU plain arrays are supported right now. Maps with spinlocks can't be memory mapped either. For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc() to be mmap()'able. We also need to make sure that array data memory is page-sized and page-aligned, so we over-allocate memory in such a way that struct bpf_array is at the end of a single page of memory with array->value being aligned with the start of the second page. On deallocation we need to accomodate this memory arrangement to free vmalloc()'ed memory correctly. One important consideration regarding how memory-mapping subsystem functions. Memory-mapping subsystem provides few optional callbacks, among them open() and close(). close() is called for each memory region that is unmapped, so that users can decrease their reference counters and free up resources, if necessary. open() is *almost* symmetrical: it's called for each memory region that is being mapped, **except** the very first one. So bpf_map_mmap does initial refcnt bump, while open() will do any extra ones after that. Thus number of close() calls is equal to number of open() calls plus one more. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
2019-11-18bpf: Convert bpf_prog refcnt to atomic64_tAndrii Nakryiko3-28/+14
Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64 and remove artificial 32k limit. This allows to make bpf_prog's refcounting non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc. Validated compilation by running allyesconfig kernel build. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com