aboutsummaryrefslogtreecommitdiffstats
path: root/net/tls/tls_main.c (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2018-08-22bpf, sockmap: fix sock hash count in alloc_sock_hash_elemDaniel Borkmann1-1/+3
When we try to allocate a new sock hash entry and the allocation fails, then sock hash map fails to reduce the map element counter, meaning we keep accounting this element although it was never used. Fix it by dropping the element counter on error. Fixes: 81110384441a ("bpf: sockmap, add hash map support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com>
2018-08-22bpf, sockmap: fix sock_hash_alloc and reject zero-sized keysDaniel Borkmann1-1/+3
Currently, it is possible to create a sock hash map with key size of 0 and have the kernel return a fd back to user space. This is invalid for hash maps (and kernel also hasn't been tested for zero key size support in general at this point). Thus, reject such configuration. Fixes: 81110384441a ("bpf: sockmap, add hash map support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Song Liu <songliubraving@fb.com>
2018-08-21xsk: fix return value of xdp_umem_assign_dev()Prashant Bhole1-2/+2
s/ENOTSUPP/EOPNOTSUPP/ in function umem_assign_dev(). This function's return value is directly returned by xsk_bind(). EOPNOTSUPP is bind()'s possible return value. Fixes: f734607e819b ("xsk: refactor xdp_umem_assign_dev()") Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-21act_ife: fix a potential deadlockCong Wang1-13/+21
use_all_metadata() acquires read_lock(&ife_mod_lock), then calls add_metainfo() which calls find_ife_oplist() which acquires the same lock again. Deadlock! Introduce __add_metainfo() which accepts struct tcf_meta_ops *ops as an additional parameter and let its callers to decide how to find it. For use_all_metadata(), it already has ops, no need to find it again, just call __add_metainfo() directly. And, as ife_mod_lock is only needed for find_ife_oplist(), this means we can make non-atomic allocation for populate_metalist() now. Fixes: 817e9f2c5c26 ("act_ife: acquire ife_mod_lock before reading ifeoplist") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21act_ife: move tcfa_lock down to where necessaryCong Wang1-25/+13
The only time we need to take tcfa_lock is when adding a new metainfo to an existing ife->metalist. We don't need to take tcfa_lock so early and so broadly in tcf_ife_init(). This means we can always take ife_mod_lock first, avoid the reverse locking ordering warning as reported by Vlad. Reported-by: Vlad Buslov <vladbu@mellanox.com> Tested-by: Vlad Buslov <vladbu@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21Revert "net: sched: act_ife: disable bh when taking ife_mod_lock"Cong Wang1-10/+10
This reverts commit 42c625a486f3 ("net: sched: act_ife: disable bh when taking ife_mod_lock"), because what ife_mod_lock protects is absolutely not touched in rate est timer BH context, they have no race. A better fix is following up. Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21net_sched: remove unused tcfa_capabCong Wang1-2/+0
Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21net_sched: remove list_head from tc_actionCong Wang14-60/+43
After commit 90b73b77d08e, list_head is no longer needed. Now we just need to convert the list iteration to array iteration for drivers. Fixes: 90b73b77d08e ("net: sched: change action API to use array of pointers to actions") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21net_sched: remove unused tcf_idr_check()Cong Wang2-21/+3
tcf_idr_check() is replaced by tcf_idr_check_alloc(), and __tcf_idr_check() now can be folded into tcf_idr_search(). Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21net_sched: remove unused parameter for tcf_action_delete()Cong Wang1-3/+2
Fixes: 16af6067392c ("net: sched: implement reference counted action release") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21net_sched: remove unnecessary ops->delete()Cong Wang18-146/+7
All ops->delete() wants is getting the tn->idrinfo, but we already have tc_action before calling ops->delete(), and tc_action has a pointer ->idrinfo. More importantly, each type of action does the same thing, that is, just calling tcf_idr_delete_index(). So it can be just removed. Fixes: b409074e6693 ("net: sched: add 'delete' function to action ops") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21net_sched: improve and refactor tcf_action_put_many()Cong Wang1-16/+15
tcf_action_put_many() is mostly called to clean up actions on failure path, but tcf_action_put_many(&actions[acts_deleted]) is used in the ugliest way: it passes a slice of the array and uses an additional NULL at the end to avoid out-of-bound access. acts_deleted is completely unnecessary since we can teach tcf_action_put_many() scan the whole array and checks against NULL pointer. Which also means tcf_action_delete() should set deleted action pointers to NULL to avoid double free. Fixes: 90b73b77d08e ("net: sched: change action API to use array of pointers to actions") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21hv_netvsc: ignore devices that are not PCIStephen Hemminger1-0/+5
Registering another device with same MAC address (such as TAP, VPN or DPDK KNI) will confuse the VF autobinding logic. Restrict the search to only run if the device is known to be a PCI attached VF. Fixes: e8ff40d4bff1 ("hv_netvsc: improve VF device matching") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21rds: tcp: remove duplicated include from tcp.cYue Haibing1-1/+0
Remove duplicated include. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21sch_cake: Remove unused including <linux/version.h>Yue Haibing1-1/+0
Remove including <linux/version.h> that don't need it. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21Bluetooth: mediatek: Fix memory leakGustavo A. R. Silva1-3/+5
In case memory resources for *fw* were allocated, release them before return. Addresses-Coverity-ID: 1472611 ("Resource leak") Fixes: 7237c4c9ec92 ("Bluetooth: mediatek: Add protocol support for MediaTek serial devices") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-08-21Bluetooth: Make BT_HCIUART_RTL configuration option depend on ACPIHans de Goede1-0/+1
At the moment we only support ACPI enumeration for serial port attached RTL bluetooth controllers. This commit adds a dependency on ACPI to the BT_HCIUART_RTL configuration option, fixing the following warning when ACPI is not enabled: drivers/bluetooth/hci_h5.c:920:22: warning: 'rtl_vnd' defined but not used Cc: Arnd Bergmann <arnd@arndb.de> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-08-20rhashtable: remove duplicated include from rhashtable.cYue Haibing1-1/+0
Remove duplicated include. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-20net/ipv6: Put lwtstate when destroying fib6_infoDavid Ahern1-0/+2
Prior to the introduction of fib6_info lwtstate was managed by the dst code. With fib6_info releasing lwtstate needs to be done when the struct is freed. Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-20r8152: disable RX aggregation on new Dell TB16 dockKai-Heng Feng1-2/+2
There's a new Dell TB16 dock with a different iSerialNumber. Apply the same fix from commit 0b1655143df0 ("r8152: disable RX aggregation on Dell TB16 dock") to this model. BugLink: https://bugs.launchpad.net/bugs/1785780 Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-20qed: Avoid sending mailbox commands when MFW is not responsiveTomer Tayar3-3/+56
Keep sending mailbox commands to the MFW when it is not responsive ends up with a redundant amount of timeout expiries. This patch prints the MCP status on the first command which is not responded, and blocks the following commands. Since the (un)load request commands might be not responded due to other PFs, the patch also adds the option to skip the blocking upon a failure. Signed-off-by: Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-20qed: Prevent a possible deadlock during driver load and unloadTomer Tayar2-20/+44
The MFW manages an internal lock to prevent concurrent hardware (de)initialization of different PFs. This, together with the busy-waiting for the MFW's responses for commands, might lead to a deadlock during concurrent load or unload of PFs. This patch adds the option to sleep within the busy-waiting, and uses it for the (un)load requests (which are not sent from an interrupt context) to prevent the possible deadlock. Signed-off-by: Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-20qed: Wait for MCP halt and resume commands to take placeTomer Tayar2-8/+39
Successive iterations of halting and resuming the management chip (MCP) might fail, since currently the driver doesn't wait for these operations to actually take place. This patch prevents the driver from moving forward before the operations are reflected in the state register. Signed-off-by: Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-20qed: Wait for ready indication before rereading the shmemTomer Tayar1-9/+41
The MFW might be reset and re-update its shared memory. Upon the detection of such a reset the driver rereads this memory, but it has to wait till the data is valid. This patch adds the missing wait for a data ready indication. Signed-off-by: Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19ip6_vti: fix a null pointer deference when destroy vti6 tunnelHaishuang Yan1-1/+2
If load ip6_vti module and create a network namespace when set fb_tunnels_only_for_init_net to 1, then exit the namespace will cause following crash: [ 6601.677036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 6601.679057] PGD 8000000425eca067 P4D 8000000425eca067 PUD 424292067 PMD 0 [ 6601.680483] Oops: 0000 [#1] SMP PTI [ 6601.681223] CPU: 7 PID: 93 Comm: kworker/u16:1 Kdump: loaded Tainted: G E 4.18.0+ #3 [ 6601.683153] Hardware name: Fedora Project OpenStack Nova, BIOS seabios-1.7.5-11.el7 04/01/2014 [ 6601.684919] Workqueue: netns cleanup_net [ 6601.685742] RIP: 0010:vti6_exit_batch_net+0x87/0xd0 [ip6_vti] [ 6601.686932] Code: 7b 08 48 89 e6 e8 b9 ea d3 dd 48 8b 1b 48 85 db 75 ec 48 83 c5 08 48 81 fd 00 01 00 00 75 d5 49 8b 84 24 08 01 00 00 48 89 e6 <48> 8b 78 08 e8 90 ea d3 dd 49 8b 45 28 49 39 c6 4c 8d 68 d8 75 a1 [ 6601.690735] RSP: 0018:ffffa897c2737de0 EFLAGS: 00010246 [ 6601.691846] RAX: 0000000000000000 RBX: 0000000000000000 RCX: dead000000000200 [ 6601.693324] RDX: 0000000000000015 RSI: ffffa897c2737de0 RDI: ffffffff9f2ea9e0 [ 6601.694824] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000 [ 6601.696314] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8dc323c07e00 [ 6601.697812] R13: ffff8dc324a63100 R14: ffffa897c2737e30 R15: ffffa897c2737e30 [ 6601.699345] FS: 0000000000000000(0000) GS:ffff8dc33fdc0000(0000) knlGS:0000000000000000 [ 6601.701068] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6601.702282] CR2: 0000000000000008 CR3: 0000000424966002 CR4: 00000000001606e0 [ 6601.703791] Call Trace: [ 6601.704329] cleanup_net+0x1b4/0x2c0 [ 6601.705268] process_one_work+0x16c/0x370 [ 6601.706145] worker_thread+0x49/0x3e0 [ 6601.706942] kthread+0xf8/0x130 [ 6601.707626] ? rescuer_thread+0x340/0x340 [ 6601.708476] ? kthread_bind+0x10/0x10 [ 6601.709266] ret_from_fork+0x35/0x40 Reproduce: modprobe ip6_vti echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net unshare -n exit This because ip6n->tnls_wc[0] point to fallback device in default, but in non-default namespace, ip6n->tnls_wc[0] will be NULL, so add the NULL check comparatively. Fixes: e2948e5af8ee ("ip6_vti: fix creating fallback tunnel device for vti6") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19ip6_vti: fix creating fallback tunnel device for vti6Haishuang Yan1-0/+2
When set fb_tunnels_only_for_init_net to 1, don't create fallback tunnel device for vti6 when a new namespace is created. Tested: [root@builder2 ~]# modprobe ip6_tunnel [root@builder2 ~]# modprobe ip6_vti [root@builder2 ~]# echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net [root@builder2 ~]# unshare -n [root@builder2 ~]# ip link 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19ip_vti: fix a null pointer deferrence when create vti fallback tunnelHaishuang Yan1-1/+2
After set fb_tunnels_only_for_init_net to 1, the itn->fb_tunnel_dev will be NULL and will cause following crash: [ 2742.849298] BUG: unable to handle kernel NULL pointer dereference at 0000000000000941 [ 2742.851380] PGD 800000042c21a067 P4D 800000042c21a067 PUD 42aaed067 PMD 0 [ 2742.852818] Oops: 0002 [#1] SMP PTI [ 2742.853570] CPU: 7 PID: 2484 Comm: unshare Kdump: loaded Not tainted 4.18.0-rc8+ #2 [ 2742.855163] Hardware name: Fedora Project OpenStack Nova, BIOS seabios-1.7.5-11.el7 04/01/2014 [ 2742.856970] RIP: 0010:vti_init_net+0x3a/0x50 [ip_vti] [ 2742.858034] Code: 90 83 c0 48 c7 c2 20 a1 83 c0 48 89 fb e8 6e 3b f6 ff 85 c0 75 22 8b 0d f4 19 00 00 48 8b 93 00 14 00 00 48 8b 14 ca 48 8b 12 <c6> 82 41 09 00 00 04 c6 82 38 09 00 00 45 5b c3 66 0f 1f 44 00 00 [ 2742.861940] RSP: 0018:ffff9be28207fde0 EFLAGS: 00010246 [ 2742.863044] RAX: 0000000000000000 RBX: ffff8a71ebed4980 RCX: 0000000000000013 [ 2742.864540] RDX: 0000000000000000 RSI: 0000000000000013 RDI: ffff8a71ebed4980 [ 2742.866020] RBP: ffff8a71ea717000 R08: ffffffffc083903c R09: ffff8a71ea717000 [ 2742.867505] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a71ebed4980 [ 2742.868987] R13: 0000000000000013 R14: ffff8a71ea5b49c0 R15: 0000000000000000 [ 2742.870473] FS: 00007f02266c9740(0000) GS:ffff8a71ffdc0000(0000) knlGS:0000000000000000 [ 2742.872143] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2742.873340] CR2: 0000000000000941 CR3: 000000042bc20006 CR4: 00000000001606e0 [ 2742.874821] Call Trace: [ 2742.875358] ops_init+0x38/0xf0 [ 2742.876078] setup_net+0xd9/0x1f0 [ 2742.876789] copy_net_ns+0xb7/0x130 [ 2742.877538] create_new_namespaces+0x11a/0x1d0 [ 2742.878525] unshare_nsproxy_namespaces+0x55/0xa0 [ 2742.879526] ksys_unshare+0x1a7/0x330 [ 2742.880313] __x64_sys_unshare+0xe/0x20 [ 2742.881131] do_syscall_64+0x5b/0x180 [ 2742.881933] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproduce: echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net modprobe ip_vti unshare -n Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19r8169: don't use MSI-X on RTL8106eJian-Hong Pan1-3/+6
Found the ethernet network on ASUS X441UAR doesn't come back on resume from suspend when using MSI-X. The chip is RTL8106e - version 39. [ 21.848357] libphy: r8169: probed [ 21.848473] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID 44900000, IRQ 127 [ 22.518860] r8169 0000:02:00.0 enp2s0: renamed from eth0 [ 29.458041] Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE) [ 63.227398] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full - flow control off [ 124.514648] Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE) Here is the ethernet controller in detail: 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136] (rev 07) Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast Ethernet controller [1043:200f] Flags: bus master, fast devsel, latency 0, IRQ 16 I/O ports at e000 [size=256] Memory at ef100000 (64-bit, non-prefetchable) [size=4K] Memory at e0000000 (64-bit, prefetchable) [size=16K] Capabilities: <access denied> Kernel driver in use: r8169 Kernel modules: r8169 Falling back to MSI fixes the issue. Fixes: 6c6aa15fdea5 ("r8169: improve interrupt handling") Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19net: lan743x_ptp: convert to ktime_get_clocktai_ts64Arnd Bergmann1-2/+1
timekeeping_clocktai64() has been renamed to ktime_get_clocktai_ts64() for consistency with the other ktime_get_* access functions. Rename the new caller that has come up as well. Question: this is the only ptp driver that sets the hardware time to the current system time in TAI. Why does it do that? Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19net: sched: always disable bh when taking tcf_lockVlad Buslov7-44/+47
Recently, ops->init() and ops->dump() of all actions were modified to always obtain tcf_lock when accessing private action state. Actions that don't depend on tcf_lock for synchronization with their data path use non-bh locking API. However, tcf_lock is also used to protect rate estimator stats in softirq context by timer callback. Change ops->init() and ops->dump() of all actions to disable bh when using tcf_lock to prevent deadlock reported by following lockdep warning: [ 105.470398] ================================ [ 105.475014] WARNING: inconsistent lock state [ 105.479628] 4.18.0-rc8+ #664 Not tainted [ 105.483897] -------------------------------- [ 105.488511] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 105.494871] swapper/16/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 105.500449] 00000000f86c012e (&(&p->tcfa_lock)->rlock){+.?.}, at: est_fetch_counters+0x3c/0xa0 [ 105.509696] {SOFTIRQ-ON-W} state was registered at: [ 105.514925] _raw_spin_lock+0x2c/0x40 [ 105.519022] tcf_bpf_init+0x579/0x820 [act_bpf] [ 105.523990] tcf_action_init_1+0x4e4/0x660 [ 105.528518] tcf_action_init+0x1ce/0x2d0 [ 105.532880] tcf_exts_validate+0x1d8/0x200 [ 105.537416] fl_change+0x55a/0x268b [cls_flower] [ 105.542469] tc_new_tfilter+0x748/0xa20 [ 105.546738] rtnetlink_rcv_msg+0x56a/0x6d0 [ 105.551268] netlink_rcv_skb+0x18d/0x200 [ 105.555628] netlink_unicast+0x2d0/0x370 [ 105.559990] netlink_sendmsg+0x3b9/0x6a0 [ 105.564349] sock_sendmsg+0x6b/0x80 [ 105.568271] ___sys_sendmsg+0x4a1/0x520 [ 105.572547] __sys_sendmsg+0xd7/0x150 [ 105.576655] do_syscall_64+0x72/0x2c0 [ 105.580757] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 105.586243] irq event stamp: 489296 [ 105.590084] hardirqs last enabled at (489296): [<ffffffffb507e639>] _raw_spin_unlock_irq+0x29/0x40 [ 105.599765] hardirqs last disabled at (489295): [<ffffffffb507e745>] _raw_spin_lock_irq+0x15/0x50 [ 105.609277] softirqs last enabled at (489292): [<ffffffffb413a6a3>] irq_enter+0x83/0xa0 [ 105.618001] softirqs last disabled at (489293): [<ffffffffb413a800>] irq_exit+0x140/0x190 [ 105.626813] other info that might help us debug this: [ 105.633976] Possible unsafe locking scenario: [ 105.640526] CPU0 [ 105.643325] ---- [ 105.646125] lock(&(&p->tcfa_lock)->rlock); [ 105.650747] <Interrupt> [ 105.653717] lock(&(&p->tcfa_lock)->rlock); [ 105.658514] *** DEADLOCK *** [ 105.665349] 1 lock held by swapper/16/0: [ 105.669629] #0: 00000000a640ad99 ((&est->timer)){+.-.}, at: call_timer_fn+0x10b/0x550 [ 105.678200] stack backtrace: [ 105.683194] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 4.18.0-rc8+ #664 [ 105.690249] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 105.698626] Call Trace: [ 105.701421] <IRQ> [ 105.703791] dump_stack+0x92/0xeb [ 105.707461] print_usage_bug+0x336/0x34c [ 105.711744] mark_lock+0x7c9/0x980 [ 105.715500] ? print_shortest_lock_dependencies+0x2e0/0x2e0 [ 105.721424] ? check_usage_forwards+0x230/0x230 [ 105.726315] __lock_acquire+0x923/0x26f0 [ 105.730597] ? debug_show_all_locks+0x240/0x240 [ 105.735478] ? mark_lock+0x493/0x980 [ 105.739412] ? check_chain_key+0x140/0x1f0 [ 105.743861] ? __lock_acquire+0x836/0x26f0 [ 105.748323] ? lock_acquire+0x12e/0x290 [ 105.752516] lock_acquire+0x12e/0x290 [ 105.756539] ? est_fetch_counters+0x3c/0xa0 [ 105.761084] _raw_spin_lock+0x2c/0x40 [ 105.765099] ? est_fetch_counters+0x3c/0xa0 [ 105.769633] est_fetch_counters+0x3c/0xa0 [ 105.773995] est_timer+0x87/0x390 [ 105.777670] ? est_fetch_counters+0xa0/0xa0 [ 105.782210] ? lock_acquire+0x12e/0x290 [ 105.786410] call_timer_fn+0x161/0x550 [ 105.790512] ? est_fetch_counters+0xa0/0xa0 [ 105.795055] ? del_timer_sync+0xd0/0xd0 [ 105.799249] ? __lock_is_held+0x93/0x110 [ 105.803531] ? mark_held_locks+0x20/0xe0 [ 105.807813] ? _raw_spin_unlock_irq+0x29/0x40 [ 105.812525] ? est_fetch_counters+0xa0/0xa0 [ 105.817069] ? est_fetch_counters+0xa0/0xa0 [ 105.821610] run_timer_softirq+0x3c4/0x9f0 [ 105.826064] ? lock_acquire+0x12e/0x290 [ 105.830257] ? __bpf_trace_timer_class+0x10/0x10 [ 105.835237] ? __lock_is_held+0x25/0x110 [ 105.839517] __do_softirq+0x11d/0x7bf [ 105.843542] irq_exit+0x140/0x190 [ 105.847208] smp_apic_timer_interrupt+0xac/0x3b0 [ 105.852182] apic_timer_interrupt+0xf/0x20 [ 105.856628] </IRQ> [ 105.859081] RIP: 0010:cpuidle_enter_state+0xd8/0x4d0 [ 105.864395] Code: 46 ff 48 89 44 24 08 0f 1f 44 00 00 31 ff e8 cf ec 46 ff 80 7c 24 07 00 0f 85 1d 02 00 00 e8 9f 90 4b ff fb 66 0f 1f 44 00 00 <4c> 8b 6c 24 08 4d 29 fd 0f 80 36 03 00 00 4c 89 e8 48 ba cf f7 53 [ 105.884288] RSP: 0018:ffff8803ad94fd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 105.892494] RAX: 0000000000000000 RBX: ffffe8fb300829c0 RCX: ffffffffb41e19e1 [ 105.899988] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8803ad9358ac [ 105.907503] RBP: ffffffffb6636300 R08: 0000000000000004 R09: 0000000000000000 [ 105.914997] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 [ 105.922487] R13: ffffffffb6636140 R14: ffffffffb66362d8 R15: 000000188d36091b [ 105.929988] ? trace_hardirqs_on_caller+0x141/0x2d0 [ 105.935232] do_idle+0x28e/0x320 [ 105.938817] ? arch_cpu_idle_exit+0x40/0x40 [ 105.943361] ? mark_lock+0x8c1/0x980 [ 105.947295] ? _raw_spin_unlock_irqrestore+0x32/0x60 [ 105.952619] cpu_startup_entry+0xc2/0xd0 [ 105.956900] ? cpu_in_idle+0x20/0x20 [ 105.960830] ? _raw_spin_unlock_irqrestore+0x32/0x60 [ 105.966146] ? trace_hardirqs_on_caller+0x141/0x2d0 [ 105.971391] start_secondary+0x2b5/0x360 [ 105.975669] ? set_cpu_sibling_map+0x1330/0x1330 [ 105.980654] secondary_startup_64+0xa5/0xb0 Taking tcf_lock in sample action with bh disabled causes lockdep to issue a warning regarding possible irq lock inversion dependency between tcf_lock, and psample_groups_lock that is taken when holding tcf_lock in sample init: [ 162.108959] Possible interrupt unsafe locking scenario: [ 162.116386] CPU0 CPU1 [ 162.121277] ---- ---- [ 162.126162] lock(psample_groups_lock); [ 162.130447] local_irq_disable(); [ 162.136772] lock(&(&p->tcfa_lock)->rlock); [ 162.143957] lock(psample_groups_lock); [ 162.150813] <Interrupt> [ 162.153808] lock(&(&p->tcfa_lock)->rlock); [ 162.158608] *** DEADLOCK *** In order to prevent potential lock inversion dependency between tcf_lock and psample_groups_lock, extract call to psample_group_get() from tcf_lock protected section in sample action init function. Fixes: 4e232818bd32 ("net: sched: act_mirred: remove dependency on rtnl lock") Fixes: 764e9a24480f ("net: sched: act_vlan: remove dependency on rtnl lock") Fixes: 729e01260989 ("net: sched: act_tunnel_key: remove dependency on rtnl lock") Fixes: d77284956656 ("net: sched: act_sample: remove dependency on rtnl lock") Fixes: e8917f437006 ("net: sched: act_gact: remove dependency on rtnl lock") Fixes: b6a2b971c0b0 ("net: sched: act_csum: remove dependency on rtnl lock") Fixes: 2142236b4584 ("net: sched: act_bpf: remove dependency on rtnl lock") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-18ip6_vti: simplify stats handling in vti6_xmitHaishuang Yan1-11/+3
Same as ip_vti, use iptunnel_xmit_stats to updates stats in tunnel xmit code path. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-18pcmcia: remove long deprecated pcmcia_request_exclusive_irq() functionLinus Torvalds3-49/+0
This function was created as a deprecated fallback case back in 2010 by commit eb14120f743d ("pcmcia: re-work pcmcia_request_irq()") for legacy cases. Actual in-kernel users haven't been around for a long while. The last in-kernel user was apparently removed four years ago by commit 5f5316fcd08e ("am2150: Update nmclan_cs.c to use update PCMCIA API"). Just remove it entirely. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18deprecate the '__deprecated' attribute warnings entirely and for goodLinus Torvalds3-28/+2
We haven't had lots of deprecation warnings lately, but the rdma use of it made them flare up again. They are not useful. They annoy everybody, and nobody ever does anything about them, because it's always "somebody elses problem". And when people start thinking that warnings are normal, they stop looking at them, and the real warnings that mean something go unnoticed. If you want to get rid of a function, just get rid of it. Convert every user to the new world order. And if you can't do that, then don't annoy everybody else with your marking that says "I couldn't be bothered to fix this, so I'll just spam everybody elses build logs with warnings about my laziness". Make a kernelnewbies wiki page about things that could be cleaned up, write a blog post about it, or talk to people on the mailing lists. But don't add warnings to the kernel build about cleanup that you think should happen but you aren't doing yourself. Don't. Just don't. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/hmm.c: remove unused variables align_start and align_endColin Ian King1-4/+1
Variables align_start and align_end are being assigned but are never used hence they are redundant and can be removed. Cleans up clang warnings: warning: variable 'align_start' set but not used [-Wunused-but-set-variable] warning: variable 'align_size' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/20180714161124.3923-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/userfaultfd.c: remove redundant pointer uwqColin Ian King1-3/+0
Pointer uwq is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'uwq' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, vmacache: hash addresses based on pmdDavid Rientjes2-15/+29
When perf profiling a wide variety of different workloads, it was found that vmacache_find() had higher than expected cost: up to 0.08% of cpu utilization in some cases. This was found to rival other core VM functions such as alloc_pages_vma() with thp enabled and default mempolicy, and the conditionals in __get_vma_policy(). VMACACHE_HASH() determines which of the four per-task_struct slots a vma is cached for a particular address. This currently depends on the pfn, so pfn 5212 occupies a different vmacache slot than its neighboring pfn 5213. vmacache_find() iterates through all four of current's vmacache slots when looking up an address. Hashing based on pfn, an address has ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or about 25%, *if* the vma is cached. This patch hashes an address by its pmd instead of pte to optimize for workloads with good spatial locality. This results in a higher probability of vmas being cached in the first slot that is checked: normally ~70% on the same workloads instead of 25%. [rientjes@google.com: various updates] Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807231532290.109445@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807091749150.114630@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru: introduce list_lru_shrink_walk_irq()Sebastian Andrzej Siewior3-6/+42
Provide list_lru_shrink_walk_irq() and let it behave like list_lru_walk_one() except that it locks the spinlock with spin_lock_irq(). This is used by scan_shadow_nodes() because its lock nests within the i_pages lock which is acquired with IRQ. This change allows to use proper locking promitives instead hand crafted lock_irq_disable() plus spin_lock(). There is no EXPORT_SYMBOL provided because the current user is in-kernel only. Add list_lru_shrink_walk_irq() which acquires the spinlock with the proper locking primitives. Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()Sebastian Andrzej Siewior1-6/+6
__list_lru_walk_one() is invoked with struct list_lru *lru, int nid as the first two argument. Those two are only used to retrieve struct list_lru_node. Since this is already done by the caller of the function for the locking, we can pass struct list_lru_node* directly and avoid the dance around it. Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: move locking from __list_lru_walk_one() to its callerSebastian Andrzej Siewior1-5/+13
Move the locking inside __list_lru_walk_one() to its caller. This is a preparation step in order to introduce list_lru_walk_one_irq() which does spin_lock_irq() instead of spin_lock() for the locking. Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()Sebastian Andrzej Siewior1-2/+2
Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user". This series removes the local_irq_disable() around list_lru_shrink_walk() (as used by mm/workingset) by adding list_lru_shrink_walk_irq(). Vladimir Davydov preferred this over `irq' argument which I added to struct list_lru. The initial post (of this series) received a Reviewed-by tag by Vladimir Davydov which I added to each patch of the series. The series applies on top of akpm's tree which has Kirill's shrink_slab series and does not clash with it (akpm asked me to wait a week or so and repost it then). I tested the code paths by triggering the OOM-killer via memory over commit and lockdep did not complain (nor did I see any warnings). This patch (of 4): list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the memcg_idx parameter. The same can be achieved by list_lru_walk_one() and passing NULL as memcg argument which then gets converted into -1. This is a preparation step when the spin_lock() function is lifted to the caller of __list_lru_walk_one(). Invoke list_lru_walk_one() instead __list_lru_walk_one() when possible. Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAPHuang Ying1-2/+3
CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable to optimize swapping for THP (Transparent Huge Page) without basic swapping support. In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y, split_swap_cluster() will not be built because it is in swapfile.c, but it will be called in huge_memory.c. This doesn't trigger a build error in practice because the call site is enclosed by PageSwapCache(), which is defined to be constant 0 when CONFIG_SWAP=n. But this is fragile and should be fixed. The comments are fixed too to reflect the latest progress. Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: delete old sparse_init and enable new onePavel Tatashin4-267/+1
Rename new_sparse_init() to sparse_init() which enables it. Delete old sparse_init() and all the code that became obsolete with. [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()] Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: add new sparse_init_nid() and sparse_init()Pavel Tatashin1-0/+85
sparse_init() requires to temporary allocate two large buffers: usemap_map and map_map. Baoquan He has identified that these buffers are so large that Linux is not bootable on small memory machines, such as a kdump boot. The buffers are especially large when CONFIG_X86_5LEVEL is set, as they are scaled to the maximum physical memory size. Baoquan provided a fix, which reduces these sizes of these buffers, but it is much better to get rid of them entirely. Add a new way to initialize sparse memory: sparse_init_nid(), which only operates within one memory node, and thus allocates memory either in large contiguous block or allocates section by section. This eliminates the need for use of temporary buffers. For simplified bisecting and review temporarly call sparse_init() new_sparse_init(), the new interface is going to be enabled as well as old code removed in the next patch. Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: move buffer init/fini to the common placePavel Tatashin3-12/+7
Now that both variants of sparse memory use the same buffers to populate memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the common place. Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: use the new sparse buffer functions in non-vmemmapPavel Tatashin1-27/+14
non-vmemmap sparse also allocated large contiguous chunk of memory, and if fails falls back to smaller allocations. Use the same functions to allocate buffer as the vmemmap-sparse Link: http://lkml.kernel.org/r/20180712203730.8703-3-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: abstract sparse buffer allocationsPavel Tatashin3-35/+54
Patch series "sparse_init rewrite", v6. In sparse_init() we allocate two large buffers to temporary hold usemap and memmap for the whole machine. However, we can avoid doing that if we changed sparse_init() to operated on per-node bases instead of doing it on the whole machine beforehand. As shown by Baoquan http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com The buffers are large enough to cause machine stop to boot on small memory systems. Another benefit of these changes is that they also obsolete CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER. This patch (of 5): When struct pages are allocated for sparse-vmemmap VA layout, we first try to allocate one large buffer, and than if that fails allocate struct pages for each section as we go. The code that allocates buffer is uses global variables and is spread across several call sites. Cleanup the code by introducing three functions to handle the global buffer: sparse_buffer_init() initialize the buffer sparse_buffer_fini() free the remaining part of the buffer sparse_buffer_alloc() alloc from the buffer, and if buffer is empty return NULL Define these functions in sparse.c instead of sparse-vmemmap.c because later we will use them for non-vmemmap sparse allocations as well. [akpm@linux-foundation.org: use PTR_ALIGN()] [akpm@linux-foundation.org: s/BUG_ON/WARN_ON/] Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/hugetlb.c: don't zero 1GiB bootmem pagesCannon Matthews1-1/+2
When using 1GiB pages during early boot, use the new memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it. Zeroing out hundreds or thousands of GiB in a single core memset() call is very slow, and can make early boot last upwards of 20-30 minutes on multi TiB machines. The memory does not need to be zero'd as the hugetlb pages are always zero'd on page fault. Tested: Booted with ~3800 1G pages, and it booted successfully in roughly the same amount of time as with 0, as opposed to the 25+ minutes it would take before. Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.com Signed-off-by: Cannon Matthews <cannonmatthews@google.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, page_alloc: double zone's batchsizeAaron Lu1-5/+4
To improve page allocator's performance for order-0 pages, each CPU has a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, PCP will be checked first before asking pages from Buddy. When PCP is used up, a batch of pages will be fetched from Buddy to improve performance and the size of batch can affect performance. zone's batch size gets doubled last time by commit ba56e91c9401("mm: page_alloc: increase size of per-cpu-pages") over ten years ago. Since then, CPU has envolved a lot and CPU's cache sizes also increased. Dave Hansen is concerned the current batch size doesn't fit well with modern hardware and suggested me to do two things: first, use a page allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find out how performance changes with different batch sizes on various machines and then choose a new default batch size; second, see how this new batch size work with other workloads. In the first test, we saw performance gains on high-core-count systems and little to no effect on older systems with more modest core counts. In this phase's test data, two candidates: 63 and 127 are chosen. In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability and more will-it-scale sub-tests are tested to see how these two candidates work with these workloads and decides a new default according to their results. Most test results are flat. will-it-scale/page_fault2 process mode has 10%-18% performance increase on 4-sockets Skylake and Broadwell. vm-scalability/lru-file-mmap-read has 17%-47% performance increase for 4-sockets servers while for 2-sockets servers, it caused 3%-8% performance drop. Further analysis showed that, with a larger pcp->batch and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch is maintained in this patch), zone lock contention shifted to LRU add side lock contention and that caused performance drop. This performance drop might be mitigated by others' work on optimizing LRU lock. Another downside of increasing pcp->batch is, when PCP is used up and need to fetch a batch of pages from Buddy, since batch is increased, that time can be longer than before. My understanding is, this doesn't affect slowpath where direct reclaim and compaction dominates. For fastpath, throughput is a win(according to will-it-scale/page_fault1) but worst latency can be larger now. Overall, I think double the batch size from 31 to 63 is relatively safe and provide good performance boost for high-core-count systems. The two phase's test results are listed below(all tests are done with THP disabled). Phase one(will-it-scale/page_fault1) test results: Skylake-EX: increased batch size has a good effect on zone->lock contention, though LRU contention will rise at the same time and limited the final performance increase. batch score change zone_contention lru_contention total_contention 31 15345900 +0.00% 64% 8% 72% 53 17903847 +16.67% 32% 38% 70% 63 17992886 +17.25% 24% 45% 69% 73 18022825 +17.44% 10% 61% 71% 119 18023401 +17.45% 4% 66% 70% 127 18029012 +17.48% 3% 66% 69% 137 18036075 +17.53% 4% 66% 70% 165 18035964 +17.53% 2% 67% 69% 188 18101105 +17.95% 2% 67% 69% 223 18130951 +18.15% 2% 67% 69% 255 18118898 +18.07% 2% 67% 69% 267 18101559 +17.96% 2% 67% 69% 299 18160468 +18.34% 2% 68% 70% 320 18139845 +18.21% 2% 67% 69% 393 18160869 +18.34% 2% 68% 70% 424 18170999 +18.41% 2% 68% 70% 458 18144868 +18.24% 2% 68% 70% 467 18142366 +18.22% 2% 68% 70% 498 18154549 +18.30% 1% 68% 69% 511 18134525 +18.17% 1% 69% 70% Broadwell-EX: similar pattern as Skylake-EX. batch score change zone_contention lru_contention total_contention 31 16703983 +0.00% 67% 7% 74% 53 18195393 +8.93% 43% 28% 71% 63 18288885 +9.49% 38% 33% 71% 73 18344329 +9.82% 35% 37% 72% 119 18535529 +10.96% 24% 46% 70% 127 18513596 +10.83% 23% 48% 71% 137 18514327 +10.84% 23% 48% 71% 165 18511840 +10.82% 22% 49% 71% 188 18593478 +11.31% 17% 53% 70% 223 18601667 +11.36% 17% 52% 69% 255 18774825 +12.40% 12% 58% 70% 267 18754781 +12.28% 9% 60% 69% 299 18892265 +13.10% 7% 63% 70% 320 18873812 +12.99% 8% 62% 70% 393 18891174 +13.09% 6% 64% 70% 424 18975108 +13.60% 6% 64% 70% 458 18932364 +13.34% 8% 62% 70% 467 18960891 +13.51% 5% 65% 70% 498 18944526 +13.41% 5% 64% 69% 511 18960839 +13.51% 5% 64% 69% Skylake-EP: although increased batch reduced zone->lock contention, but the effect is not as good as EX: zone->lock contention is still as high as 20% with a very high batch value instead of 1% on Skylake-EX or 5% on Broadwell-EX. Also, total_contention actually decreased with a higher batch but that doesn't translate to performance increase. batch score change zone_contention lru_contention total_contention 31 9554867 +0.00% 66% 3% 69% 53 9855486 +3.15% 63% 3% 66% 63 9980145 +4.45% 62% 4% 66% 73 10092774 +5.63% 62% 5% 67% 119 10310061 +7.90% 45% 19% 64% 127 10342019 +8.24% 42% 19% 61% 137 10358182 +8.41% 42% 21% 63% 165 10397060 +8.81% 37% 24% 61% 188 10341808 +8.24% 34% 26% 60% 223 10349135 +8.31% 31% 27% 58% 255 10327189 +8.08% 28% 29% 57% 267 10344204 +8.26% 27% 29% 56% 299 10325043 +8.06% 25% 30% 55% 320 10310325 +7.91% 25% 31% 56% 393 10293274 +7.73% 21% 31% 52% 424 10311099 +7.91% 21% 32% 53% 458 10321375 +8.02% 21% 32% 53% 467 10303881 +7.84% 21% 32% 53% 498 10332462 +8.14% 20% 33% 53% 511 10325016 +8.06% 20% 32% 52% Broadwell-EP: zone->lock and lru lock had an agreement to make sure performance doesn't increase and they successfully managed to keep total contention at 70%. batch score change zone_contention lru_contention total_contention 31 10121178 +0.00% 19% 50% 69% 53 10142366 +0.21% 6% 63% 69% 63 10117984 -0.03% 11% 58% 69% 73 10123330 +0.02% 7% 63% 70% 119 10108791 -0.12% 2% 67% 69% 127 10166074 +0.44% 3% 66% 69% 137 10141574 +0.20% 3% 66% 69% 165 10154499 +0.33% 2% 68% 70% 188 10124921 +0.04% 2% 67% 69% 223 10137399 +0.16% 2% 67% 69% 255 10143289 +0.22% 0% 68% 68% 267 10123535 +0.02% 1% 68% 69% 299 10140952 +0.20% 0% 68% 68% 320 10163170 +0.41% 0% 68% 68% 393 10000633 -1.19% 0% 69% 69% 424 10087998 -0.33% 0% 69% 69% 458 10187116 +0.65% 0% 69% 69% 467 10146790 +0.25% 0% 69% 69% 498 10197958 +0.76% 0% 69% 69% 511 10152326 +0.31% 0% 69% 69% Haswell-EP: similar to Broadwell-EP. batch score change zone_contention lru_contention total_contention 31 10442205 +0.00% 14% 48% 62% 53 10442255 +0.00% 5% 57% 62% 63 10452059 +0.09% 6% 57% 63% 73 10482349 +0.38% 5% 59% 64% 119 10454644 +0.12% 3% 60% 63% 127 10431514 -0.10% 3% 59% 62% 137 10423785 -0.18% 3% 60% 63% 165 10481216 +0.37% 2% 61% 63% 188 10448755 +0.06% 2% 61% 63% 223 10467144 +0.24% 2% 61% 63% 255 10480215 +0.36% 2% 61% 63% 267 10484279 +0.40% 2% 61% 63% 299 10466450 +0.23% 2% 61% 63% 320 10452578 +0.10% 2% 61% 63% 393 10499678 +0.55% 1% 62% 63% 424 10481454 +0.38% 1% 62% 63% 458 10473562 +0.30% 1% 62% 63% 467 10484269 +0.40% 0% 62% 62% 498 10505599 +0.61% 0% 62% 62% 511 10483395 +0.39% 0% 62% 62% Westmere-EP: contention is pretty small so not interesting. Note too high a batch value could hurt performance. batch score change zone_contention lru_contention total_contention 31 4831523 +0.00% 2% 3% 5% 53 4834086 +0.05% 2% 4% 6% 63 4834262 +0.06% 2% 3% 5% 73 4832851 +0.03% 2% 4% 6% 119 4830534 -0.02% 1% 3% 4% 127 4827461 -0.08% 1% 4% 5% 137 4827459 -0.08% 1% 3% 4% 165 4820534 -0.23% 0% 4% 4% 188 4817947 -0.28% 0% 3% 3% 223 4809671 -0.45% 0% 3% 3% 255 4802463 -0.60% 0% 4% 4% 267 4801634 -0.62% 0% 3% 3% 299 4798047 -0.69% 0% 3% 3% 320 4793084 -0.80% 0% 3% 3% 393 4785877 -0.94% 0% 3% 3% 424 4782911 -1.01% 0% 3% 3% 458 4779346 -1.08% 0% 3% 3% 467 4780306 -1.06% 0% 3% 3% 498 4780589 -1.05% 0% 3% 3% 511 4773724 -1.20% 0% 3% 3% Skylake-Desktop: similar to Westmere-EP, nothing interesting. batch score change zone_contention lru_contention total_contention 31 3906608 +0.00% 2% 3% 5% 53 3940164 +0.86% 2% 3% 5% 63 3937289 +0.79% 2% 3% 5% 73 3940201 +0.86% 2% 3% 5% 119 3933240 +0.68% 2% 3% 5% 127 3930514 +0.61% 2% 4% 6% 137 3938639 +0.82% 0% 3% 3% 165 3908755 +0.05% 0% 3% 3% 188 3905621 -0.03% 0% 3% 3% 223 3903015 -0.09% 0% 4% 4% 255 3889480 -0.44% 0% 3% 3% 267 3891669 -0.38% 0% 4% 4% 299 3898728 -0.20% 0% 4% 4% 320 3894547 -0.31% 0% 4% 4% 393 3875137 -0.81% 0% 4% 4% 424 3874521 -0.82% 0% 3% 3% 458 3880432 -0.67% 0% 4% 4% 467 3888715 -0.46% 0% 3% 3% 498 3888633 -0.46% 0% 4% 4% 511 3875305 -0.80% 0% 5% 5% Haswell-Desktop: zone->lock is pretty low as other desktops, though lru contention is higher than other desktops. batch score change zone_contention lru_contention total_contention 31 3511158 +0.00% 2% 5% 7% 53 3555445 +1.26% 2% 6% 8% 63 3561082 +1.42% 2% 6% 8% 73 3547218 +1.03% 2% 6% 8% 119 3571319 +1.71% 1% 7% 8% 127 3549375 +1.09% 0% 6% 6% 137 3560233 +1.40% 0% 6% 6% 165 3555176 +1.25% 2% 6% 8% 188 3551501 +1.15% 0% 8% 8% 223 3531462 +0.58% 0% 7% 7% 255 3570400 +1.69% 0% 7% 7% 267 3532235 +0.60% 1% 8% 9% 299 3562326 +1.46% 0% 6% 6% 320 3553569 +1.21% 0% 8% 8% 393 3539519 +0.81% 0% 7% 7% 424 3549271 +1.09% 0% 8% 8% 458 3528885 +0.50% 0% 8% 8% 467 3526554 +0.44% 0% 7% 7% 498 3525302 +0.40% 0% 9% 9% 511 3527556 +0.47% 0% 8% 8% Sandybridge-Desktop: the 0% contention isn't accurate but caused by dropped fractional part. Since multiple contention path's contentions are all under 1% here, with some arithmetic operations like add, the final deviation could be as large as 3%. batch score change zone_contention lru_contention total_contention 31 1744495 +0.00% 0% 0% 0% 53 1755341 +0.62% 0% 0% 0% 63 1758469 +0.80% 0% 0% 0% 73 1759626 +0.87% 0% 0% 0% 119 1770417 +1.49% 0% 0% 0% 127 1768252 +1.36% 0% 0% 0% 137 1767848 +1.34% 0% 0% 0% 165 1765088 +1.18% 0% 0% 0% 188 1766918 +1.29% 0% 0% 0% 223 1767866 +1.34% 0% 0% 0% 255 1768074 +1.35% 0% 0% 0% 267 1763187 +1.07% 0% 0% 0% 299 1765620 +1.21% 0% 0% 0% 320 1767603 +1.32% 0% 0% 0% 393 1764612 +1.15% 0% 0% 0% 424 1758476 +0.80% 0% 0% 0% 458 1758593 +0.81% 0% 0% 0% 467 1757915 +0.77% 0% 0% 0% 498 1753363 +0.51% 0% 0% 0% 511 1755548 +0.63% 0% 0% 0% Phase two test results: Note: all percent change is against base(batch=31). ebizzy.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 2410037±7% 2600451±2% +7.9% 2602878 +8.0% lkp-bdw-ex1 1493328 1489243 -0.3% 1492145 -0.1% lkp-skl-2sp2 1329674 1345891 +1.2% 1351056 +1.6% lkp-bdw-ep2 711511 711511 0.0% 710708 -0.1% lkp-wsm-ep2 75750 75528 -0.3% 75441 -0.4% lkp-skl-d01 264126 262791 -0.5% 264113 +0.0% lkp-hsw-d01 176601 176328 -0.2% 176368 -0.1% lkp-sb02 98937 98937 +0.0% 99030 +0.1% kbuild.buildtime (less is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 107.00 107.67 +0.6% 107.11 +0.1% lkp-bdw-ex1 97.33 97.33 +0.0% 97.42 +0.1% lkp-skl-2sp2 180.00 179.83 -0.1% 179.83 -0.1% lkp-bdw-ep2 178.17 179.17 +0.6% 177.50 -0.4% lkp-wsm-ep2 737.00 738.00 +0.1% 738.00 +0.1% lkp-skl-d01 642.00 653.00 +1.7% 653.00 +1.7% lkp-hsw-d01 1310.00 1316.00 +0.5% 1311.00 +0.1% netperf/TCP_STREAM.Throughput_total_Mbps (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 948790 947144 -0.2% 948333 -0.0% lkp-bdw-ex1 904224 904366 +0.0% 904926 +0.1% lkp-skl-2sp2 239731 239607 -0.1% 239565 -0.1% lk-bdw-ep2 365764 365933 +0.0% 365951 +0.1% lkp-wsm-ep2 93736 93803 +0.1% 93808 +0.1% lkp-skl-d01 77314 77303 -0.0% 77375 +0.1% lkp-hsw-d01 58617 60387 +3.0% 60208 +2.7% lkp-sb02 29990 30137 +0.5% 30103 +0.4% oltp.transactions (higer is better) machine batch=31 batch=63 batch=127 lkp-bdw-ex1 9073276 9100377 +0.3% 9036344 -0.4% lkp-skl-2sp2 8898717 8852054 -0.5% 8894459 -0.0% lkp-bdw-ep2 13426155 13384654 -0.3% 13333637 -0.7% lkp-hsw-ep2 13146314 13232784 +0.7% 13193163 +0.4% lkp-wsm-ep2 5035355 5019348 -0.3% 5033418 -0.0% lkp-skl-d01 418485 4413339 -0.1% 4419039 +0.0% lkp-hsw-d01 3517817±5% 3396120±3% -3.5% 3455138±3% -1.8% pigz.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.513e+08 1.507e+08 -0.4% 1.511e+08 -0.2% lkp-bdw-ex1 2.060e+08 2.052e+08 -0.4% 2.044e+08 -0.8% lkp-skl-2sp2 8.836e+08 8.845e+08 +0.1% 8.836e+08 -0.0% lkp-bdw-ep2 8.275e+08 8.464e+08 +2.3% 8.330e+08 +0.7% lkp-wsm-ep2 2.224e+08 2.221e+08 -0.2% 2.218e+08 -0.3% lkp-skl-d01 1.177e+08 1.177e+08 -0.0% 1.176e+08 -0.1% lkp-hsw-d01 1.154e+08 1.154e+08 +0.1% 1.154e+08 -0.0% lkp-sb02 0.633e+08 0.633e+08 +0.1% 0.633e+08 +0.0% will-it-scale.malloc1.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 620181 620484 +0.0% 620240 +0.0% lkp-bdw-ex1 1403610 1401201 -0.2% 1417900 +1.0% lkp-skl-2sp2 1288097 1284145 -0.3% 1283907 -0.3% lkp-bdw-ep2 1427879 1427675 -0.0% 1428266 +0.0% lkp-hsw-ep2 1362546 1353965 -0.6% 1354759 -0.6% lkp-wsm-ep2 2099657 2107576 +0.4% 2100226 +0.0% lkp-skl-d01 1476835 1476358 -0.0% 1474487 -0.2% lkp-hsw-d01 1308810 1303429 -0.4% 1301299 -0.6% lkp-sb02 589286 589284 -0.0% 588101 -0.2% will-it-scale.malloc1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 21289 21125 -0.8% 21241 -0.2% lkp-bdw-ex1 28114 28089 -0.1% 28007 -0.4% lkp-skl-2sp2 91866 91946 +0.1% 92723 +0.9% lkp-bdw-ep2 37637 37501 -0.4% 37317 -0.9% lkp-hsw-ep2 43673 43590 -0.2% 43754 +0.2% lkp-wsm-ep2 28577 28298 -1.0% 28545 -0.1% lkp-skl-d01 175277 173343 -1.1% 173082 -1.3% lkp-hsw-d01 130303 129566 -0.6% 129250 -0.8% lkp-sb02 113742±3% 116911 +2.8% 116417±3% +2.4% will-it-scale.malloc2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.206e+09 1.206e+09 -0.0% 1.206e+09 +0.0% lkp-bdw-ex1 1.319e+09 1.319e+09 -0.0% 1.319e+09 +0.0% lkp-skl-2sp2 8.000e+08 8.021e+08 +0.3% 7.995e+08 -0.1% lkp-bdw-ep2 6.582e+08 6.634e+08 +0.8% 6.513e+08 -1.1% lkp-hsw-ep2 6.671e+08 6.669e+08 -0.0% 6.665e+08 -0.1% lkp-wsm-ep2 1.805e+08 1.806e+08 +0.0% 1.804e+08 -0.1% lkp-skl-d01 1.611e+08 1.611e+08 -0.0% 1.610e+08 -0.0% lkp-hsw-d01 1.333e+08 1.332e+08 -0.0% 1.332e+08 -0.0% lkp-sb02 82485104 82478206 -0.0% 82473546 -0.0% will-it-scale.malloc2.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.574e+09 1.574e+09 -0.0% 1.574e+09 -0.0% lkp-bdw-ex1 1.737e+09 1.737e+09 +0.0% 1.737e+09 -0.0% lkp-skl-2sp2 9.161e+08 9.162e+08 +0.0% 9.181e+08 +0.2% lkp-bdw-ep2 7.856e+08 8.015e+08 +2.0% 8.113e+08 +3.3% lkp-hsw-ep2 6.908e+08 6.904e+08 -0.1% 6.907e+08 -0.0% lkp-wsm-ep2 2.409e+08 2.409e+08 +0.0% 2.409e+08 -0.0% lkp-skl-d01 1.199e+08 1.199e+08 -0.0% 1.199e+08 -0.0% lkp-hsw-d01 1.029e+08 1.029e+08 -0.0% 1.029e+08 +0.0% lkp-sb02 68081213 68061423 -0.0% 68076037 -0.0% will-it-scale.page_fault2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 14509125±4% 16472364 +13.5% 17123117 +18.0% lkp-bdw-ex1 14736381 16196588 +9.9% 16364011 +11.0% lkp-skl-2sp2 6354925 6435444 +1.3% 6436644 +1.3% lkp-bdw-ep2 8749584 8834422 +1.0% 8827179 +0.9% lkp-hsw-ep2 8762591 8845920 +1.0% 8825697 +0.7% lkp-wsm-ep2 3036083 3030428 -0.2% 3021741 -0.5% lkp-skl-d01 2307834 2304731 -0.1% 2286142 -0.9% lkp-hsw-d01 1806237 1800786 -0.3% 1795943 -0.6% lkp-sb02 842616 837844 -0.6% 833921 -1.0% will-it-scale.page_fault2.threads machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1623294 1615132±2% -0.5% 1656777 +2.1% lkp-bdw-ex1 1995714 2025948 +1.5% 2113753±3% +5.9% lkp-skl-2sp2 2346708 2415591 +2.9% 2416919 +3.0% lkp-bdw-ep2 2342564 2344882 +0.1% 2300206 -1.8% lkp-hsw-ep2 1820658 1831681 +0.6% 1844057 +1.3% lkp-wsm-ep2 1725482 1733774 +0.5% 1740517 +0.9% lkp-skl-d01 1832833 1823628 -0.5% 1806489 -1.4% lkp-hsw-d01 1427913 1427287 -0.0% 1420226 -0.5% lkp-sb02 750626 748615 -0.3% 746621 -0.5% will-it-scale.page_fault3.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 24382726 24400317 +0.1% 24668774 +1.2% lkp-bdw-ex1 35399750 35683124 +0.8% 35829492 +1.2% lkp-skl-2sp2 28136820 28068248 -0.2% 28147989 +0.0% lkp-bdw-ep2 37269077 37459490 +0.5% 37373073 +0.3% lkp-hsw-ep2 36224967 36114085 -0.3% 36104908 -0.3% lkp-wsm-ep2 16820457 16911005 +0.5% 16968596 +0.9% lkp-skl-d01 7721138 7725904 +0.1% 7756740 +0.5% lkp-hsw-d01 7611979 7650928 +0.5% 7651323 +0.5% lkp-sb02 3781546 3796502 +0.4% 3796827 +0.4% will-it-scale.page_fault3.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1865820±3% 1900917±2% +1.9% 1826245±4% -2.1% lkp-bdw-ex1 3094060 3148326 +1.8% 3150036 +1.8% lkp-skl-2sp2 3952940 3953898 +0.0% 3989360 +0.9% lkp-bdw-ep2 3420373±3% 3643964 +6.5% 3644910±5% +6.6% lkp-hsw-ep2 2609635±2% 2582310±3% -1.0% 2780459 +6.5% lkp-wsm-ep2 4395001 4417196 +0.5% 4432499 +0.9% lkp-skl-d01 5363977 5400003 +0.7% 5411370 +0.9% lkp-hsw-d01 5274131 5311294 +0.7% 5319359 +0.9% lkp-sb02 2917314 2913004 -0.1% 2935286 +0.6% will-it-scale.read1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 73762279±14% 69322519±10% -6.0% 69349855±13% -6.0% (result unstable) lkp-bdw-ex1 1.701e+08 1.704e+08 +0.1% 1.705e+08 +0.2% lkp-skl-2sp2 63111570 63113953 +0.0% 63836573 +1.1% lkp-bdw-ep2 79247409 79424610 +0.2% 78012656 -1.6% lkp-hsw-ep2 67677026 68308800 +0.9% 67539106 -0.2% lkp-wsm-ep2 13339630 13939817 +4.5% 13766865 +3.2% lkp-skl-d01 10969487 10972650 +0.0% no data lkp-hsw-d01 9857342±2% 10080592±2% +2.3% 10131560 +2.8% lkp-sb02 5189076 5197473 +0.2% 5163253 -0.5% will-it-scale.read1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 62468045±12% 73666726±7% +17.9% 79553123±12% +27.4% (result unstable) lkp-bdw-ex1 1.62e+08 1.624e+08 +0.3% 1.614e+08 -0.3% lkp-skl-2sp2 58319780 59181032 +1.5% 59821353 +2.6% lkp-bdw-ep2 74057992 75698171 +2.2% 74990869 +1.3% lkp-hsw-ep2 63672959 63639652 -0.1% 64387051 +1.1% lkp-wsm-ep2 13489943 13526058 +0.3% 13259032 -1.7% lkp-skl-d01 10297906 10338796 +0.4% 10407328 +1.1% lkp-hsw-d01 9636721 9667376 +0.3% 9341147 -3.1% lkp-sb02 4801938 4804496 +0.1% 4802290 +0.0% will-it-scale.write1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.111e+08 1.104e+08±2% -0.7% 1.122e+08±2% +1.0% lkp-bdw-ex1 1.392e+08 1.399e+08 +0.5% 1.397e+08 +0.4% lkp-skl-2sp2 59369233 58994841 -0.6% 58715168 -1.1% lkp-bdw-ep2 61820979 CPU throttle 63593123 +2.9% lkp-hsw-ep2 57897587 57435605 -0.8% 56347450 -2.7% lkp-wsm-ep2 7814203 7918017±2% +1.3% 7669068 -1.9% lkp-skl-d01 8886557 8971422 +1.0% 8818366 -0.8% lkp-hsw-d01 9171001±5% 9189915 +0.2% 9483909 +3.4% lkp-sb02 4475406 4475294 -0.0% 4501756 +0.6% will-it-scale.write1.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.058e+08 1.055e+08±2% -0.2% 1.065e+08 +0.7% lkp-bdw-ex1 1.316e+08 1.300e+08 -1.2% 1.308e+08 -0.6% lkp-skl-2sp2 54492421 56086678 +2.9% 55975657 +2.7% lkp-bdw-ep2 59360449 59003957 -0.6% 58101262 -2.1% lkp-hsw-ep2 53346346±2% 52530876 -1.5% 52902487 -0.8% lkp-wsm-ep2 7774006 7800092±2% +0.3% 7558833 -2.8% lkp-skl-d01 8346174 8235695 -1.3% no data lkp-hsw-d01 8636244 8655731 +0.2% 8658868 +0.3% lkp-sb02 4181820 4204107 +0.5% 4182992 +0.0% vm-scalability.anon-r-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 11933873±3% 12356544±2% +3.5% 12188624 +2.1% lkp-bdw-ex1 7114424±2% 7330949±2% +3.0% 7392419 +3.9% lkp-skl-2sp2 6773277±5% 6492332±8% -4.1% 6543962 -3.4% lkp-bdw-ep2 7133846±4% 7233508 +1.4% 7013518±3% -1.7% lkp-hsw-ep2 4576626 4527098 -1.1% 4551679 -0.5% lkp-wsm-ep2 2583599 2592492 +0.3% 2588039 +0.2% lkp-hsw-d01 998199±2% 1028311 +3.0% 1006460±2% +0.8% lkp-sb02 570572 567854 -0.5% 568449 -0.4% vm-scalability.anon-r-rand-mt.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1789419 1787830 -0.1% 1788208 -0.1% lkp-bdw-ex1 3492595±2% 3554966±2% +1.8% 3558835±3% +1.9% lkp-skl-2sp2 3856238±2% 3975403±4% +3.1% 3994600 +3.6% lkp-bdw-ep2 3726963±11% 3809292±6% +2.2% 3871924±4% +3.9% lkp-hsw-ep2 2131760±3% 2033578±4% -4.6% 2130727±6% -0.0% lkp-wsm-ep2 2369731 2368384 -0.1% 2370252 +0.0% lkp-skl-d01 1207128 1206220 -0.1% 1205801 -0.1% lkp-hsw-d01 964317 992329±2% +2.9% 992099±2% +2.9% lkp-sb02 567137 567346 +0.0% 566144 -0.2% vm-scalability.lru-file-mmap-read.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 19560469±6% 23018999 +17.7% 23418800 +19.7% lkp-bdw-ex1 17769135±14% 26141676±3% +47.1% 26284723±5% +47.9% lkp-skl-2sp2 14056512 13578884 -3.4% 13146214 -6.5% lkp-bdw-ep2 15336542 14737654 -3.9% 14088159 -8.1% lkp-hsw-ep2 16275498 15756296 -3.2% 15018090 -7.7% lkp-wsm-ep2 11272160 11237231 -0.3% 11310047 +0.3% lkp-skl-d01 7322119 7324569 +0.0% 7184148 -1.9% lkp-hsw-d01 6449234 6404542 -0.7% 6356141 -1.4% lkp-sb02 3517943 3520668 +0.1% 3527309 +0.3% vm-scalability.lru-file-mmap-read-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1689052 1697553 +0.5% 1698726 +0.6% lkp-bdw-ex1 1675246 1699764 +1.5% 1712226 +2.2% lkp-skl-2sp2 1800533 1799749 -0.0% 1800581 +0.0% lkp-bdw-ep2 1807422 1807758 +0.0% 1804932 -0.1% lkp-hsw-ep2 1809807 1808781 -0.1% 1807811 -0.1% lkp-wsm-ep2 1800198 1802434 +0.1% 1801236 +0.1% lkp-skl-d01 696689 695537 -0.2% 694106 -0.4% lkp-hsw-d01 698364 698666 +0.0% 696686 -0.2% lkp-sb02 258939 258787 -0.1% 258199 -0.3% Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/oom_kill.c: document oom_lockMichal Hocko1-0/+8
Add comments describing oom_lock's scope. Requested-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/hugetlb: remove gigantic page support for HIGHMEMMike Kravetz2-11/+1
This reverts ee8f248d266e ("hugetlb: add phys addr to struct huge_bootmem_page"). At one time powerpc used this field and supporting code. However that was removed with commit 79cc38ded1e1 ("powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line"). There are no users of this field and supporting code, so remove it. Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>