aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller8-40/+39
Merge conflict of mlx5 resolved using instructions in merge commit 9566e650bf7fdf58384bb06df634f7531ca3a97e. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18Merge tag 'spdx-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdxLinus Torvalds1-15/+1
Pull SPDX fixes from Greg KH: "Here are four small SPDX fixes for 5.3-rc5. A few style fixes for some SPDX comments, added an SPDX tag for one file, and fix up some GPL boilerplate for another file. All of these have been in linux-next for a few weeks with no reported issues (they are comment changes only, so that's to be expected...)" * tag 'spdx-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: i2c: stm32: Use the correct style for SPDX License Identifier intel_th: Use the correct style for SPDX License Identifier coccinelle: api/atomic_as_refcounter: add SPDX License Identifier kernel/configs: Replace GPL boilerplate code with SPDX identifier
2019-08-16Merge tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-4/+10
Pull power management fixes from Rafael Wysocki: "These add a check to avoid recent suspend-to-idle power regression on systems with NVMe drives where the PCIe ASPM policy is "performance" (or when the kernel is built without ASPM support), fix an issue related to frequency limits in the schedutil cpufreq governor and fix a mistake related to the PM QoS usage in the cpufreq core introduced recently. Specifics: - Disable NVMe power optimization related to suspend-to-idle added recently on systems where PCIe ASPM is not able to put PCIe links into low-power states to prevent excess power from being drawn by the system while suspended (Rafael Wysocki). - Make the schedutil governor handle frequency limits changes properly in all cases (Viresh Kumar). - Prevent the cpufreq core from treating positive values returned by dev_pm_qos_update_request() as errors (Viresh Kumar)" * tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled PCI/ASPM: Add pcie_aspm_enabled() cpufreq: schedutil: Don't skip freq update when limits change cpufreq: dev_pm_qos_update_request() can return 1 on success
2019-08-16Merge branch 'pm-cpufreq'Rafael J. Wysocki1-4/+10
* pm-cpufreq: cpufreq: schedutil: Don't skip freq update when limits change cpufreq: dev_pm_qos_update_request() can return 1 on success
2019-08-14Merge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds3-7/+24
Pull dma-mapping fixes from Christoph Hellwig: - fix the handling of the bus_dma_mask in dma_get_required_mask, which caused a regression in this merge window (Lucas Stach) - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me) - fix dma_mmap_coherent to not cause page attribute mismatches on coherent architectures like x86 (me) * tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: fix page attributes for dma_mmap_* dma-direct: don't truncate dma_required_mask to bus addressing capabilities dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING
2019-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski5-55/+350
Daniel Borkmann says: ==================== The following pull-request contains BPF updates for your *net-next* tree. There is a small merge conflict in libbpf (Cc Andrii so he's in the loop as well): for (i = 1; i <= btf__get_nr_types(btf); i++) { t = (struct btf_type *)btf__type_by_id(btf, i); if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); <<<<<<< HEAD /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t+1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && kind == BTF_KIND_DATASEC) { ======= t->size = sizeof(int); *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32); } else if (!has_datasec && btf_is_datasec(t)) { >>>>>>> 72ef80b5ee131e96172f19e74b4f98fa3404efe8 /* replace DATASEC with STRUCT */ Conflict is between the two commits 1d4126c4e119 ("libbpf: sanitize VAR to conservative 1-byte INT") and b03bc6853c0e ("libbpf: convert libbpf code to use new btf helpers"), so we need to pick the sanitation fixup as well as use the new btf_is_datasec() helper and the whitespace cleanup. Looks like the following: [...] if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && btf_is_datasec(t)) { /* replace DATASEC with STRUCT */ [...] The main changes are: 1) Addition of core parts of compile once - run everywhere (co-re) effort, that is, relocation of fields offsets in libbpf as well as exposure of kernel's own BTF via sysfs and loading through libbpf, from Andrii. More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2 and http://vger.kernel.org/lpc-bpf2018.html#session-2 2) Enable passing input flags to the BPF flow dissector to customize parsing and allowing it to stop early similar to the C based one, from Stanislav. 3) Add a BPF helper function that allows generating SYN cookies from XDP and tc BPF, from Petar. 4) Add devmap hash-based map type for more flexibility in device lookup for redirects, from Toke. 5) Improvements to XDP forwarding sample code now utilizing recently enabled devmap lookups, from Jesper. 6) Add support for reporting the effective cgroup progs in bpftool, from Jakub and Takshak. 7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter. 8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan. 9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei. 10) Add perf event output helper also for other skb-based program types, from Allan. 11) Fix a co-re related compilation error in selftests, from Yonghong. ==================== Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13btf: rename /sys/kernel/btf/kernel into /sys/kernel/btf/vmlinuxAndrii Nakryiko1-15/+15
Expose kernel's BTF under the name vmlinux to be more uniform with using kernel module names as file names in the future. Fixes: 341dfcf8d78e ("btf: expose BTF info through sysfs") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-13btf: expose BTF info through sysfsAndrii Nakryiko2-0/+54
Make .BTF section allocated and expose its contents through sysfs. /sys/kernel/btf directory is created to contain all the BTFs present inside kernel. Currently there is only kernel's main BTF, represented as /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported, each module will expose its BTF as /sys/kernel/btf/<module-name> file. Current approach relies on a few pieces coming together: 1. pahole is used to take almost final vmlinux image (modulo .BTF and kallsyms) and generate .BTF section by converting DWARF info into BTF. This section is not allocated and not mapped to any segment, though, so is not yet accessible from inside kernel at runtime. 2. objcopy dumps .BTF contents into binary file and subsequently convert binary file into linkable object file with automatically generated symbols _binary__btf_kernel_bin_start and _binary__btf_kernel_bin_end, pointing to start and end, respectively, of BTF raw data. 3. final vmlinux image is generated by linking this object file (and kallsyms, if necessary). sysfs_btf.c then creates /sys/kernel/btf/kernel file and exposes embedded BTF contents through it. This allows, e.g., libbpf and bpftool access BTF info at well-known location, without resorting to searching for vmlinux image on disk (location of which is not standardized and vmlinux image might not be even available in some scenarios, e.g., inside qemu during testing). Alternative approach using .incbin assembler directive to embed BTF contents directly was attempted but didn't work, because sysfs_proc.o is not re-compiled during link-vmlinux.sh stage. This is required, though, to update embedded BTF data (initially empty data is embedded, then pahole generates BTF info and we need to regenerate sysfs_btf.o with updated contents, but it's too late at that point). If BTF couldn't be generated due to missing or too old pahole, sysfs_btf.c handles that gracefully by detecting that _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating /sys/kernel/btf at all. v2->v3: - added Documentation/ABI/testing/sysfs-kernel-btf (Greg K-H); - created proper kobject (btf_kobj) for btf directory (Greg K-H); - undo v2 change of reusing vmlinux, as it causes extra kallsyms pass due to initially missing __binary__btf_kernel_bin_{start/end} symbols; v1->v2: - allow kallsyms stage to re-use vmlinux generated by gen_btf(); Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-10Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-10/+2
Pull scheduler fixes from Thomas Gleixner: "Three fixlets for the scheduler: - Avoid double bandwidth accounting in the push & pull code - Use a sane FIFO priority for the Pressure Stall Information (PSI) thread. - Avoid permission checks when setting the scheduler params for the PSI thread" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: Do not require setsched permission from the trigger creator sched/psi: Reduce psimon FIFO priority sched/deadline: Fix double accounting of rq/running bw in push & pull
2019-08-10dma-mapping: fix page attributes for dma_mmap_*Christoph Hellwig2-2/+19
All the way back to introducing dma_common_mmap we've defaulted to mark the pages as uncached. But this is wrong for DMA coherent devices. Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that flag is only treated special on the alloc side for non-coherent devices. Introduce a new dma_pgprot helper that deals with the check for coherent devices so that only the remapping cases ever reach arch_dma_mmap_pgprot and we thus ensure no aliasing of page attributes happens, which makes the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the remaining ones. Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but we'll phase it out soon. Fixes: 64ccc9c033c6 ("common: dma-mapping: add support for generic dma_mmap_* calls") Reported-by: Shawn Anastasio <shawn@anastas.io> Reported-by: Gavin Li <git@thegavinli.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
2019-08-10dma-direct: don't truncate dma_required_mask to bus addressing capabilitiesLucas Stach1-3/+0
The dma required_mask needs to reflect the actual addressing capabilities needed to handle the whole system RAM. When truncated down to the bus addressing capabilities dma_addressing_limited() will incorrectly signal no limitations for devices which are restricted by the bus_dma_mask. Fixes: b4ebe6063204 (dma-direct: implement complete bus_dma_mask handling) Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Tested-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-10dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPINGChristoph Hellwig1-2/+5
The new DMA_ATTR_NO_KERNEL_MAPPING needs to actually assign a dma_addr to work. Also skip it if the architecture needs forced decryption handling, as that needs a kernel virtual address. Fixes: d98849aff879 (dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code) Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
2019-08-10cpufreq: schedutil: Don't skip freq update when limits changeViresh Kumar1-4/+10
To avoid reducing the frequency of a CPU prematurely, we skip reducing the frequency if the CPU had been busy recently. This should not be done when the limits of the policy are changed, for example due to thermal throttling. We should always get the frequency within the new limits as soon as possible. Trying to fix this by using only one flag, i.e. need_freq_update, can lead to a race condition where the flag gets cleared without forcing us to change the frequency at least once. And so this patch introduces another flag to avoid that race condition. Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") Cc: v4.18+ <stable@vger.kernel.org> # v4.18+ Reported-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-08genirq/affinity: Create affinity mask for single vectorMing Lei1-4/+2
Since commit c66d4bd110a1f8 ("genirq/affinity: Add new callback for (re)calculating interrupt sets"), irq_create_affinity_masks() returns NULL in case of single vector. This change has caused regression on some drivers, such as lpfc. The problem is that single vector requests can happen in some generic cases: 1) kdump kernel 2) irq vectors resource is close to exhaustion. If in that situation the affinity mask for a single vector is not created, every caller has to handle the special case. There is no reason why the mask cannot be created, so remove the check for a single vector and create the mask. Fixes: c66d4bd110a1f8 ("genirq/affinity: Add new callback for (re)calculating interrupt sets") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190805011906.5020-1-ming.lei@redhat.com
2019-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds1-2/+2
Pull networking fixes from David Miller: "Yeah I should have sent a pull request last week, so there is a lot more here than usual: 1) Fix memory leak in ebtables compat code, from Wenwen Wang. 2) Several kTLS bug fixes from Jakub Kicinski (circular close on disconnect etc.) 3) Force slave speed check on link state recovery in bonding 802.3ad mode, from Thomas Falcon. 4) Clear RX descriptor bits before assigning buffers to them in stmmac, from Jose Abreu. 5) Several missing of_node_put() calls, mostly wrt. for_each_*() OF loops, from Nishka Dasgupta. 6) Double kfree_skb() in peak_usb can driver, from Stephane Grosjean. 7) Need to hold sock across skb->destructor invocation, from Cong Wang. 8) IP header length needs to be validated in ipip tunnel xmit, from Haishuang Yan. 9) Use after free in ip6 tunnel driver, also from Haishuang Yan. 10) Do not use MSI interrupts on r8169 chips before RTL8168d, from Heiner Kallweit. 11) Upon bridge device init failure, we need to delete the local fdb. From Nikolay Aleksandrov. 12) Handle erros from of_get_mac_address() properly in stmmac, from Martin Blumenstingl. 13) Handle concurrent rename vs. dump in netfilter ipset, from Jozsef Kadlecsik. 14) Setting NETIF_F_LLTX on mac80211 causes complete breakage with some devices, so revert. From Johannes Berg. 15) Fix deadlock in rxrpc, from David Howells. 16) Fix Kconfig deps of enetc driver, we must have PHYLIB. From Yue Haibing. 17) Fix mvpp2 crash on module removal, from Matteo Croce. 18) Fix race in genphy_update_link, from Heiner Kallweit. 19) bpf_xdp_adjust_head() stopped working with generic XDP when we fixes generic XDP to support stacked devices properly, fix from Jesper Dangaard Brouer. 20) Unbalanced RCU locking in rt6_update_exception_stamp_rt(), from David Ahern. 21) Several memory leaks in new sja1105 driver, from Vladimir Oltean" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (214 commits) net: dsa: sja1105: Fix memory leak on meta state machine error path net: dsa: sja1105: Fix memory leak on meta state machine normal path net: dsa: sja1105: Really fix panic on unregistering PTP clock net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as well net: dsa: sja1105: Fix broken learning with vlan_filtering disabled net: dsa: qca8k: Add of_node_put() in qca8k_setup_mdio_bus() net: sched: sample: allow accessing psample_group with rtnl net: sched: police: allow accessing police->params with rtnl net: hisilicon: Fix dma_map_single failed on arm64 net: hisilicon: fix hip04-xmit never return TX_BUSY net: hisilicon: make hip04_tx_reclaim non-reentrant tc-testing: updated vlan action tests with batch create/delete net sched: update vlan action for batched events operations net: stmmac: tc: Do not return a fragment entry net: stmmac: Fix issues when number of Queues >= 4 net: stmmac: xgmac: Fix XGMAC selftests be2net: disable bh with spin_lock in be_process_mcc net: cxgb3_main: Fix a resource leak in a error path in 'init_one()' net: ethernet: sun4i-emac: Support phy-handle property for finding PHYs net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER ...
2019-08-06sched/psi: Do not require setsched permission from the trigger creatorSuren Baghdasaryan1-1/+1
When a process creates a new trigger by writing into /proc/pressure/* files, permissions to write such a file should be used to determine whether the process is allowed to do so or not. Current implementation would also require such a process to have setsched capability. Setting of psi trigger thread's scheduling policy is an implementation detail and should not be exposed to the user level. Remove the permission check by using _nocheck version of the function. Suggested-by: Nick Kralevich <nnk@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: lizefan@huawei.com Cc: mingo@redhat.com Cc: akpm@linux-foundation.org Cc: kernel-team@android.com Cc: dennisszhou@gmail.com Cc: dennis@kernel.org Cc: hannes@cmpxchg.org Cc: axboe@kernel.dk Link: https://lkml.kernel.org/r/20190730013310.162367-1-surenb@google.com
2019-08-06sched/psi: Reduce psimon FIFO priorityPeter Zijlstra1-1/+1
PSI defaults to a FIFO-99 thread, reduce this to FIFO-1. FIFO-99 is the very highest priority available to SCHED_FIFO and it not a suitable default; it would indicate the psi work is the most important work on the machine. Since Real-Time tasks will have pre-allocated memory and locked it in place, Real-Time tasks do not care about PSI. All it needs is to be above OTHER. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de>
2019-08-06sched/deadline: Fix double accounting of rq/running bw in push & pullDietmar Eggemann1-8/+0
{push,pull}_dl_task() always calls {de,}activate_task() with .flags=0 which sets p->on_rq=TASK_ON_RQ_MIGRATING. {push,pull}_dl_task()->{de,}activate_task()->{de,en}queue_task()-> {de,en}queue_task_dl() calls {sub,add}_{running,rq}_bw() since p->on_rq==TASK_ON_RQ_MIGRATING. So {sub,add}_{running,rq}_bw() in {push,pull}_dl_task() is double-accounting for that task. Fix it by removing rq/running bw accounting in [push/pull]_dl_task(). Fixes: 7dd778841164 ("sched/core: Unify p->on_rq updates") Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Qais Yousef <qais.yousef@arm.com> Link: https://lkml.kernel.org/r/20190802145945.18702-2-dietmar.eggemann@arm.com
2019-08-03memremap: move from kernel/ to mm/Christoph Hellwig2-406/+0
memremap.c implements MM functionality for ZONE_DEVICE, so it really should be in the mm/ directory, not the kernel/ one. Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03kernel/signal.c: fix a kernel-doc markupMauro Carvalho Chehab1-1/+1
The kernel-doc parser doesn't handle expressions with %foo*. Instead, when an asterisk should be part of a constant, it uses an alternative notation: `foo*`. Link: http://lkml.kernel.org/r/7f18c2e0b5e39e6b7eb55ddeb043b8b260b49f2d.1563361575.git.mchehab+samsung@kernel.org Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-02Merge tag 'arm-swiotlb-5.3' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds1-2/+11
Pull arm swiotlb support from Christoph Hellwig: "This fixes a cascade of regressions that originally started with the addition of the ia64 port, but only got fatal once we removed most uses of block layer bounce buffering in Linux 4.18. The reason is that while the original i386/PAE code that was the first architecture that supported > 4GB of memory without an iommu decided to leave bounce buffering to the subsystems, which in those days just mean block and networking as no one else consumed arbitrary userspace memory. Later with ia64, x86_64 and other ports we assumed that either an iommu or something that fakes it up ("software IOTLB" in beautiful Intel speak) is present and that subsystems can rely on that for dealing with addressing limitations in devices. Except that the ARM LPAE scheme that added larger physical address to 32-bit ARM did not follow that scheme and thus only worked by chance and only for block and networking I/O directly to highmem. Long story, short fix - add swiotlb support to arm when build for LPAE platforms, which actuallys turns out to be pretty trivial with the modern dma-direct / swiotlb code to fix the Linux 4.18-ish regression" * tag 'arm-swiotlb-5.3' of git://git.infradead.org/users/hch/dma-mapping: arm: use swiotlb for bounce buffering on LPAE configs dma-mapping: check pfn validity in dma_common_{mmap,get_sgtable}
2019-08-02Merge tag 'dma-mapping-5.3-3' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds1-3/+5
Pull dma-mapping regression fixes from Christoph Hellwig: "Two related regression fixes for changes from this merge window to fix alignment issues introduced in the CMA allocation rework (Nicolin Chen)" * tag 'dma-mapping-5.3-3' of git://git.infradead.org/users/hch/dma-mapping: dma-contiguous: page-align the size in dma_free_contiguous() dma-contiguous: do not overwrite align in dma_alloc_contiguous()
2019-08-01bpf: always allocate at least 16 bytes for setsockopt hookStanislav Fomichev1-4/+13
Since we always allocate memory, allocate just a little bit more for the BPF program in case it need to override user input with bigger value. The canonical example is TCP_CONGESTION where input string might be too small to override (nv -> bbr or cubic). 16 bytes are chosen to match the size of TCP_CA_NAME_MAX and can be extended in the future if needed. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-31Merge tag 'trace-v5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds1-10/+7
Pull tracing fixes from Steven Rostedt: "Two minor fixes: - Fix trace event header include guards, as several did not match the #define to the #ifdef - Remove a redundant test to ftrace_graph_notrace_addr() that was accidentally added" * tag 'trace-v5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: fgraph: Remove redundant ftrace_graph_notrace_addr() test tracing: Fix header include guards in trace event headers
2019-07-30fgraph: Remove redundant ftrace_graph_notrace_addr() testChangbin Du1-10/+7
We already have tested it before. The second one should be removed. With this change, the performance should have little improvement. Link: http://lkml.kernel.org/r/20190730140850.7927-1-changbin.du@gmail.com Cc: stable@vger.kernel.org Fixes: 9cd2992f2d6c ("fgraph: Have set_graph_notrace only affect function_graph tracer") Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-30exit: make setting exit_state consistentChristian Brauner1-2/+3
Since commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state") we unconditionally set exit_state to EXIT_ZOMBIE before calling into do_notify_parent(). This was done to eliminate a race when querying exit_state in do_notify_pidfd(). Back then we decided to do the absolute minimal thing to fix this and not touch the rest of the exit_notify() function where exit_state is set. Since this fix has not caused any issues change the setting of exit_state to EXIT_DEAD in the autoreap case to account for the fact hat exit_state is set to EXIT_ZOMBIE unconditionally. This fix was planned but also explicitly requested in [1] and makes the whole code more consistent. /* References */ [1]: https://lore.kernel.org/lkml/CAHk-=wigcxGFR2szue4wavJtH5cYTTeNES=toUBVGsmX0rzX+g@mail.gmail.com Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-30kernel/configs: Replace GPL boilerplate code with SPDX identifierThomas Huth1-15/+1
The FSF does not reside in "675 Mass Ave, Cambridge" anymore... let's replace the old GPL boilerplate code with a proper SPDX identifier instead. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-29xdp: Add devmap_hash map type for looking up devices by hashed indexToke Høiland-Jørgensen2-0/+202
A common pattern when using xdp_redirect_map() is to create a device map where the lookup key is simply ifindex. Because device maps are arrays, this leaves holes in the map, and the map has to be sized to fit the largest ifindex, regardless of how many devices actually are actually needed in the map. This patch adds a second type of device map where the key is looked up using a hashmap, instead of being used as an array index. This allows maps to be densely packed, so they can be smaller. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-29xdp: Refactor devmap allocation code for reuseToke Høiland-Jørgensen1-53/+83
The subsequent patch to add a new devmap sub-type can re-use much of the initialisation and allocation code, so refactor it into separate functions. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-29pidfd: Add warning if exit_state is 0 during notificationJoel Fernandes (Google)1-0/+1
Previously a condition got missed where the pidfd waiters are awakened before the exit_state gets set. This can result in a missed notification [1] and the polling thread waiting forever. It is fixed now, however it would be nice to avoid this kind of issue going unnoticed in the future. So just add a warning to catch it in the future. /* References */ [1]: https://lore.kernel.org/lkml/20190717172100.261204-1-joel@joelfernandes.org/ Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20190724164816.201099-1-joel@joelfernandes.org Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-29dma-contiguous: page-align the size in dma_free_contiguous()Nicolin Chen1-1/+2
According to the original dma_direct_alloc_pages() code: { unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; if (!dma_release_from_contiguous(dev, page, count)) __free_pages(page, get_order(size)); } The count parameter for dma_release_from_contiguous() was page aligned before the right-shifting operation, while the new API dma_free_contiguous() forgets to have PAGE_ALIGN() at the size. So this patch simply adds it to prevent any corner case. Fixes: fdaeec198ada ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers") Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-29dma-contiguous: do not overwrite align in dma_alloc_contiguous()Nicolin Chen1-2/+3
The dma_alloc_contiguous() limits align at CONFIG_CMA_ALIGNMENT for cma_alloc() however it does not restore it for the fallback routine. This will result in a size mismatch between the allocation and free when running into the fallback routines after cma_alloc() fails, if the align is larger than CONFIG_CMA_ALIGNMENT. This patch adds a cma_align to take care of cma_alloc() and prevent the align from being overwritten. Fixes: fdaeec198ada ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers") Reported-by: Dafna Hirschfeld <dafna.hirschfeld@collabora.com> Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-27Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-44/+102
Pull scheduler fixes from Thomas Gleixner: "Two fixes for the fair scheduling class: - Prevent freeing memory which is accessible by concurrent readers - Make the RCU annotations for numa groups consistent" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Use RCU accessors consistently for ->numa_group sched/fair: Don't free p->numa_faults with concurrent readers
2019-07-27Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull perf fixes from Thomas Gleixner: "A pile of perf related fixes: Kernel: - Fix SLOTS PEBS event constraints for Icelake CPUs - Add the missing mask bit to allow counting hardware generated prefetches on L3 for Icelake CPUs - Make the test for hypervisor platforms more accurate (as far as possible) - Handle PMUs correctly which override event->cpu - Yet another missing fallthrough annotation Tools: perf.data: - Fix loading of compressed data split across adjacent records - Fix buffer size setting for processing CPU topology perf.data header. perf stat: - Fix segfault for event group in repeat mode - Always separate "stalled cycles per insn" line, it was being appended to the "instructions" line. perf script: - Fix --max-blocks man page description. - Improve man page description of metrics. - Fix off by one in brstackinsn IPC computation. perf probe: - Avoid calling freeing routine multiple times for same pointer. perf build: - Do not use -Wshadow on gcc < 4.8, avoiding too strict warnings treated as errors, breaking the build" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Mark expected switch fall-throughs perf/core: Fix creating kernel counters for PMUs that override event->cpu perf/x86: Apply more accurate check on hypervisor platform perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register perf/x86/intel: Fix SLOTS PEBS event constraint perf build: Do not use -Wshadow on gcc < 4.8 perf probe: Avoid calling freeing routine multiple times for same pointer perf probe: Set pev->nargs to zero after freeing pev->args entries perf session: Fix loading of compressed data split across adjacent records perf stat: Always separate stalled cycles per insn perf stat: Fix segfault for event group in repeat mode perf tools: Fix proper buffer size for feature processing perf script: Fix off by one in brstackinsn IPC computation perf script: Improve man page description of metrics perf script: Fix --max-blocks man page description
2019-07-27Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds4-15/+40
Pull locking fixes from Thomas Gleixner: "A set of locking fixes: - Address the fallout of the rwsem rework. Missing ACQUIREs and a sanity check to prevent a use-after-free - Add missing checks for unitialized mutexes when mutex debugging is enabled. - Remove the bogus code in the generic SMP variant of arch_futex_atomic_op_inuser() - Fixup the #ifdeffery in lockdep to prevent compile warnings" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/mutex: Test for initialized mutex locking/lockdep: Clean up #ifdef checks locking/lockdep: Hide unused 'class' variable locking/rwsem: Add ACQUIRE comments tty/ldsem, locking/rwsem: Add missing ACQUIRE to read_failed sleep loop lcoking/rwsem: Add missing ACQUIRE to read_slowpath sleep loop locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty locking/rwsem: Don't call owner_on_cpu() on read-owner futex: Cleanup generic SMP variant of arch_futex_atomic_op_inuser()
2019-07-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-2/+2
Alexei Starovoitov says: ==================== pull-request: bpf 2019-07-25 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) fix segfault in libbpf, from Andrii. 2) fix gso_segs access, from Eric. 3) tls/sockmap fixes, from Jakub and John. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-25Merge branch 'access-creds'Linus Torvalds1-2/+19
The access() (and faccessat()) credentials change can cause an unnecessary load on the RCU machinery because every access() call ends up freeing the temporary access credential using RCU. This isn't really noticeable on small machines, but if you have hundreds of cores you can cause huge slowdowns due to RCU storms. It's easy to avoid: the temporary access crededntials aren't actually normally accessed using RCU at all, so we can avoid the whole issue by just marking them as such. * access-creds: access: avoid the RCU grace period for the temporary subjective credentials
2019-07-25perf/core: Fix creating kernel counters for PMUs that override event->cpuLeonard Crestez1-1/+1
Some hardware PMU drivers will override perf_event.cpu inside their event_init callback. This causes a lockdep splat when initialized through the kernel API: WARNING: CPU: 0 PID: 250 at kernel/events/core.c:2917 ctx_sched_out+0x78/0x208 pc : ctx_sched_out+0x78/0x208 Call trace: ctx_sched_out+0x78/0x208 __perf_install_in_context+0x160/0x248 remote_function+0x58/0x68 generic_exec_single+0x100/0x180 smp_call_function_single+0x174/0x1b8 perf_install_in_context+0x178/0x188 perf_event_create_kernel_counter+0x118/0x160 Fix this by calling perf_install_in_context with event->cpu, just like perf_event_open Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Frank Li <Frank.li@nxp.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/c4ebe0503623066896d7046def4d6b1e06e0eb2e.1563972056.git.leonard.crestez@nxp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25locking/mutex: Test for initialized mutexSebastian Andrzej Siewior1-1/+10
An uninitialized/ zeroed mutex will go unnoticed because there is no check for it. There is a magic check in the unlock's slowpath path which might go unnoticed if the unlock happens in the fastpath. Add a ->magic check early in the mutex_lock() and mutex_trylock() path. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190703092125.lsdf4gpsh2plhavb@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25locking/lockdep: Clean up #ifdef checksArnd Bergmann1-7/+6
As Will Deacon points out, CONFIG_PROVE_LOCKING implies TRACE_IRQFLAGS, so the conditions I added in the previous patch, and some others in the same file can be simplified by only checking for the former. No functional change. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Yuyang Du <duyuyang@gmail.com> Fixes: 886532aee3cd ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING") Link: https://lkml.kernel.org/r/20190628102919.2345242-1-arnd@arndb.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25locking/lockdep: Hide unused 'class' variableArnd Bergmann1-1/+2
The usage is now hidden in an #ifdef, so we need to move the variable itself in there as well to avoid this warning: kernel/locking/lockdep_proc.c:203:21: error: unused variable 'class' [-Werror,-Wunused-variable] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yuyang Du <duyuyang@gmail.com> Cc: frederic@kernel.org Fixes: 68d41d8c94a3 ("locking/lockdep: Fix lock used or unused stats error") Link: https://lkml.kernel.org/r/20190715092809.736834-1-arnd@arndb.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25locking/rwsem: Add ACQUIRE commentsPeter Zijlstra1-5/+13
Since we just reviewed read_slowpath for ACQUIRE correctness, add a few coments to retain our findings. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25lcoking/rwsem: Add missing ACQUIRE to read_slowpath sleep loopPeter Zijlstra1-1/+3
While reviewing another read_slowpath patch, both Will and I noticed another missing ACQUIRE, namely: X = 0; CPU0 CPU1 rwsem_down_read() for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); X = 1; rwsem_up_write(); rwsem_mark_wake() atomic_long_add(adjustment, &sem->count); smp_store_release(&waiter->task, NULL); if (!waiter.task) break; ... } r = X; Allows 'r == 0'. Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is emptyJan Stancek1-0/+2
LTP mtest06 has been observed to occasionally hit "still mapped when deleted" and following BUG_ON on arm64. The extra mapcount originated from pagefault handler, which handled pagefault for vma that has already been detached. vma is detached under mmap_sem write lock by detach_vmas_to_be_unmapped(), which also invalidates vmacache. When the pagefault handler (under mmap_sem read lock) calls find_vma(), vmacache_valid() wrongly reports vmacache as valid. After rwsem down_read() returns via 'queue empty' path (as of v5.2), it does so without an ACQUIRE on sem->count: down_read() __down_read() rwsem_down_read_failed() __rwsem_down_read_failed_common() raw_spin_lock_irq(&sem->wait_lock); if (list_empty(&sem->wait_list)) { if (atomic_long_read(&sem->count) >= 0) { raw_spin_unlock_irq(&sem->wait_lock); return sem; The problem can be reproduced by running LTP mtest06 in a loop and building the kernel (-j $NCPUS) in parallel. It does reproduces since v4.20 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably in about an hour. The patched kernel ran fine for 10+ hours. Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Will Deacon <will@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dbueso@suse.de Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") Link: https://lkml.kernel.org/r/50b8914e20d1d62bb2dee42d342836c2c16ebee7.1563438048.git.jstancek@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25locking/rwsem: Don't call owner_on_cpu() on read-ownerWaiman Long1-1/+5
For writer, the owner value is cleared on unlock. For reader, it is left intact on unlock for providing better debugging aid on crash dump and the unlock of one reader may not mean the lock is free. As a result, the owner_on_cpu() shouldn't be used on read-owner as the task pointer value may not be valid and it might have been freed. That is the case in rwsem_spin_on_owner(), but not in rwsem_can_spin_on_owner(). This can lead to use-after-free error from KASAN. For example, BUG: KASAN: use-after-free in rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) Fix this by checking for RWSEM_READER_OWNED flag before calling owner_on_cpu(). Reported-by: Luis Henriques <lhenriques@suse.com> Tested-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Fixes: 94a9717b3c40e ("locking/rwsem: Make rwsem->owner an atomic_long_t") Link: https://lkml.kernel.org/r/81e82d5b-5074-77e8-7204-28479bbe0df0@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25sched/fair: Use RCU accessors consistently for ->numa_groupJann Horn1-39/+81
The old code used RCU annotations and accessors inconsistently for ->numa_group, which can lead to use-after-frees and NULL dereferences. Let all accesses to ->numa_group use proper RCU helpers to prevent such issues. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: 8c8a743c5087 ("sched/numa: Use {cpu, pid} to create task groups for shared faults") Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25sched/fair: Don't free p->numa_faults with concurrent readersJann Horn2-5/+21
When going through execve(), zero out the NUMA fault statistics instead of freeing them. During execve, the task is reachable through procfs and the scheduler. A concurrent /proc/*/sched reader can read data from a freed ->numa_faults allocation (confirmed by KASAN) and write it back to userspace. I believe that it would also be possible for a use-after-free read to occur through a race between a NUMA fault and execve(): task_numa_fault() can lead to task_numa_compare(), which invokes task_weight() on the currently running task of a different CPU. Another way to fix this would be to make ->numa_faults RCU-managed or add extra locking, but it seems easier to wipe the NUMA fault statistics on execve. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()") Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-24access: avoid the RCU grace period for the temporary subjective credentialsLinus Torvalds1-2/+19
It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU work because it installs a temporary credential that gets allocated and freed for each system call. The allocation and freeing overhead is mostly benign, but because credentials can be accessed under the RCU read lock, the freeing involves a RCU grace period. Which is not a huge deal normally, but if you have a lot of access() calls, this causes a fair amount of seconday damage: instead of having a nice alloc/free patterns that hits in hot per-CPU slab caches, you have all those delayed free's, and on big machines with hundreds of cores, the RCU overhead can end up being enormous. But it turns out that all of this is entirely unnecessary. Exactly because access() only installs the credential as the thread-local subjective credential, the temporary cred pointer doesn't actually need to be RCU free'd at all. Once we're done using it, we can just free it synchronously and avoid all the RCU overhead. So add a 'non_rcu' flag to 'struct cred', which can be set by users that know they only use it in non-RCU context (there are other potential users for this). We can make it a union with the rcu freeing list head that we need for the RCU case, so this doesn't need any extra storage. Note that this also makes 'get_current_cred()' clear the new non_rcu flag, in case we have filesystems that take a long-term reference to the cred and then expect the RCU delayed freeing afterwards. It's not entirely clear that this is required, but it makes for clear semantics: the subjective cred remains non-RCU as long as you only access it synchronously using the thread-local accessors, but you _can_ use it as a generic cred if you want to. It is possible that we should just remove the whole RCU markings for ->cred entirely. Only ->real_cred is really supposed to be accessed through RCU, and the long-term cred copies that nfs uses might want to explicitly re-enable RCU freeing if required, rather than have get_current_cred() do it implicitly. But this is a "minimal semantic changes" change for the immediate problem. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jan Glauber <jglauber@marvell.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com> Cc: Greg KH <greg@kroah.com> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-24dma-mapping: check pfn validity in dma_common_{mmap,get_sgtable}Christoph Hellwig1-2/+11
Check that the pfn returned from arch_dma_coherent_to_pfn refers to a valid page and reject the mmap / get_sgtable requests otherwise. Based on the arm implementation of the mmap and get_sgtable methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Vignesh Raghavendra <vigneshr@ti.com>
2019-07-23bpf: fix narrower loads on s390Ilya Leoshkevich1-2/+2
The very first check in test_pkt_md_access is failing on s390, which happens because loading a part of a struct __sk_buff field produces an incorrect result. The preprocessed code of the check is: { __u8 tmp = *((volatile __u8 *)&skb->len + ((sizeof(skb->len) - sizeof(__u8)) / sizeof(__u8))); if (tmp != ((*(volatile __u32 *)&skb->len) & 0xFF)) return 2; }; clang generates the following code for it: 0: 71 21 00 03 00 00 00 00 r2 = *(u8 *)(r1 + 3) 1: 61 31 00 00 00 00 00 00 r3 = *(u32 *)(r1 + 0) 2: 57 30 00 00 00 00 00 ff r3 &= 255 3: 5d 23 00 1d 00 00 00 00 if r2 != r3 goto +29 <LBB0_10> Finally, verifier transforms it to: 0: (61) r2 = *(u32 *)(r1 +104) 1: (bc) w2 = w2 2: (74) w2 >>= 24 3: (bc) w2 = w2 4: (54) w2 &= 255 5: (bc) w2 = w2 The problem is that when verifier emits the code to replace a partial load of a struct __sk_buff field (*(u8 *)(r1 + 3)) with a full load of struct sk_buff field (*(u32 *)(r1 + 104)), an optional shift and a bitwise AND, it assumes that the machine is little endian and incorrectly decides to use a shift. Adjust shift count calculation to account for endianness. Fixes: 31fd85816dbe ("bpf: permits narrower load from bpf program context fields") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>