aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-03-15Merge tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-38/+55
Pull futex fix from Thomas Gleixner: "Fix for yet another subtle futex issue. The futex code used ihold() to prevent inodes from vanishing, but ihold() does not guarantee inode persistence. Replace the inode pointer with a per boot, machine wide, unique inode identifier. The second commit fixes the breakage of the hash mechanism which causes a 100% performance regression" * tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Unbreak futex hashing futex: Fix inode life-time issue
2020-03-15Merge tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+2
Pull timer fix from Thomas Gleixner: "A single fix adding the missing time namespace adjustment in sys/sysinfo which caused sys/sysinfo to be inconsistent with /proc/uptime when read from a task inside a time namespace" * tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sys/sysinfo: Respect boottime inside time namespace
2020-03-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds1-0/+4
Pull networking fixes from David Miller: "It looks like a decent sized set of fixes, but a lot of these are one liner off-by-one and similar type changes: 1) Fix netlink header pointer to calcular bad attribute offset reported to user. From Pablo Neira Ayuso. 2) Don't double clear PHY interrupts when ->did_interrupt is set, from Heiner Kallweit. 3) Add missing validation of various (devlink, nl802154, fib, etc.) attributes, from Jakub Kicinski. 4) Missing *pos increments in various netfilter seq_next ops, from Vasily Averin. 5) Missing break in of_mdiobus_register() loop, from Dajun Jin. 6) Don't double bump tx_dropped in veth driver, from Jiang Lidong. 7) Work around FMAN erratum A050385, from Madalin Bucur. 8) Make sure ARP header is pulled early enough in bonding driver, from Eric Dumazet. 9) Do a cond_resched() during multicast processing of ipvlan and macvlan, from Mahesh Bandewar. 10) Don't attach cgroups to unrelated sockets when in interrupt context, from Shakeel Butt. 11) Fix tpacket ring state management when encountering unknown GSO types. From Willem de Bruijn. 12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend() only in the suspend context. From Heiner Kallweit" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits) net: systemport: fix index check to avoid an array out of bounds access tc-testing: add ETS scheduler to tdc build configuration net: phy: fix MDIO bus PM PHY resuming net: hns3: clear port base VLAN when unload PF net: hns3: fix RMW issue for VLAN filter switch net: hns3: fix VF VLAN table entries inconsistent issue net: hns3: fix "tc qdisc del" failed issue taprio: Fix sending packets without dequeueing them net: mvmdio: avoid error message for optional IRQ net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register net: memcg: fix lockdep splat in inet_csk_accept() s390/qeth: implement smarter resizing of the RX buffer pool s390/qeth: refactor buffer pool code s390/qeth: use page pointers to manage RX buffer pool seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed net/packet: tpacket_rcv: do not increment ring index on drop sxgbe: Fix off by one in samsung driver strncpy size arg net: caif: Add lockdep expression to RCU traversal primitive MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer ...
2020-03-11Merge tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linuxLinus Torvalds1-0/+10
Pull thread fix from Christian Brauner: "This contains a single fix for a regression which was introduced when we introduced the ability to select a specific pid at process creation time. When this feature is requested, the error value will be set to -EPERM after exiting the pid allocation loop. This caused EPERM to be returned when e.g. the init process/child subreaper of the pid namespace has already died where we used to return ENOMEM before. The first patch here simply fixes the regression by unconditionally setting the return value back to ENOMEM again once we've successfully allocated the requested pid number. This should be easy to backport to v5.5. The second patch adds a comment explaining that we must keep returning ENOMEM since we've been doing it for a long time and have explicitly documented this behavior for userspace. This seemed worthwhile because we now have at least two separate example where people tried to change the return value to something other than ENOMEM (The first version of the regression fix did that too and the commit message links to an earlier patch that tried to do the same.). I have a simple regression test to make sure we catch this regression in the future but since that introduces a whole new selftest subdir and test files I'll keep this for v5.7" * tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: pid: make ENOMEM return value more obvious pid: Fix error return value in some cases
2020-03-11Merge tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds1-0/+2
Pull ftrace fix from Steven Rostedt: "Have ftrace lookup_rec() return a consistent record otherwise it can break live patching" * tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Return the first found result in lookup_rec()
2020-03-11ftrace: Return the first found result in lookup_rec()Artem Savkov1-0/+2
It appears that ip ranges can overlap so. In that case lookup_rec() returns whatever results it got last even if it found nothing in last searched page. This breaks an obscure livepatch late module patching usecase: - load livepatch - load the patched module - unload livepatch - try to load livepatch again To fix this return from lookup_rec() as soon as it found the record containing searched-for ip. This used to be this way prior lookup_rec() introduction. Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com Cc: stable@vger.kernel.org Fixes: 7e16f581a817 ("ftrace: Separate out functionality from ftrace_location_range()") Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-10cgroup: memcg: net: do not associate sock with unrelated cgroupShakeel Butt1-0/+4
We are testing network memory accounting in our setup and noticed inconsistent network memory usage and often unrelated cgroups network usage correlates with testing workload. On further inspection, it seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in irq context specially for cgroup v1. mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context and kind of assumes that this can only happen from sk_clone_lock() and the source sock object has already associated cgroup. However in cgroup v1, where network memory accounting is opt-in, the source sock can be unassociated with any cgroup and the new cloned sock can get associated with unrelated interrupted cgroup. Cgroup v2 can also suffer if the source sock object was created by process in the root cgroup or if sk_alloc() is called in irq context. The fix is to just do nothing in interrupt. WARNING: Please note that about half of the TCP sockets are allocated from the IRQ context, so, memory used by such sockets will not be accouted by the memcg. The stack trace of mem_cgroup_sk_alloc() from IRQ-context: CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1 Hardware name: ... Call Trace: <IRQ> dump_stack+0x57/0x75 mem_cgroup_sk_alloc+0xe9/0xf0 sk_clone_lock+0x2a7/0x420 inet_csk_clone_lock+0x1b/0x110 tcp_create_openreq_child+0x23/0x3b0 tcp_v6_syn_recv_sock+0x88/0x730 tcp_check_req+0x429/0x560 tcp_v6_rcv+0x72d/0xa40 ip6_protocol_deliver_rcu+0xc9/0x400 ip6_input+0x44/0xd0 ? ip6_protocol_deliver_rcu+0x400/0x400 ip6_rcv_finish+0x71/0x80 ipv6_rcv+0x5b/0xe0 ? ip6_sublist_rcv+0x2e0/0x2e0 process_backlog+0x108/0x1e0 net_rx_action+0x26b/0x460 __do_softirq+0x104/0x2a6 do_softirq_own_stack+0x2a/0x40 </IRQ> do_softirq.part.19+0x40/0x50 __local_bh_enable_ip+0x51/0x60 ip6_finish_output2+0x23d/0x520 ? ip6table_mangle_hook+0x55/0x160 __ip6_finish_output+0xa1/0x100 ip6_finish_output+0x30/0xd0 ip6_output+0x73/0x120 ? __ip6_finish_output+0x100/0x100 ip6_xmit+0x2e3/0x600 ? ipv6_anycast_cleanup+0x50/0x50 ? inet6_csk_route_socket+0x136/0x1e0 ? skb_free_head+0x1e/0x30 inet6_csk_xmit+0x95/0xf0 __tcp_transmit_skb+0x5b4/0xb20 __tcp_send_ack.part.60+0xa3/0x110 tcp_send_ack+0x1d/0x20 tcp_rcv_state_process+0xe64/0xe80 ? tcp_v6_connect+0x5d1/0x5f0 tcp_v6_do_rcv+0x1b1/0x3f0 ? tcp_v6_do_rcv+0x1b1/0x3f0 __release_sock+0x7f/0xd0 release_sock+0x30/0xa0 __inet_stream_connect+0x1c3/0x3b0 ? prepare_to_wait+0xb0/0xb0 inet_stream_connect+0x3b/0x60 __sys_connect+0x101/0x120 ? __sys_getsockopt+0x11b/0x140 __x64_sys_connect+0x1a/0x20 do_syscall_64+0x51/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The stack trace of mem_cgroup_sk_alloc() from IRQ-context: Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking") Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10Merge branch 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds2-14/+28
Pull cgroup fixes from Tejun Heo: - cgroup.procs listing related fixes. It didn't interlock properly with exiting tasks leaving a short window where a cgroup has empty cgroup.procs but still can't be removed and misbehaved on short reads. - psi_show() crash fix on 32bit ino archs - Empty release_agent handling fix * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup1: don't call release_agent when it is "" cgroup: fix psi_show() crash on 32bit ino archs cgroup: Iterate tasks that did not finish do_exit() cgroup: cgroup_procs_next should increase position index cgroup-v1: cgroup_pidlist_next should update position index
2020-03-10Merge branch 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-6/+8
Pull workqueue fixes from Tejun Heo: "Workqueue has been incorrectly round-robining per-cpu work items. Hillf's patch fixes that. The other patch documents memory-ordering properties of workqueue operations" * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: don't use wq_select_unbound_cpu() for bound works workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()
2020-03-10workqueue: don't use wq_select_unbound_cpu() for bound worksHillf Danton1-6/+8
wq_select_unbound_cpu() is designed for unbound workqueues only, but it's wrongly called when using a bound workqueue too. Fixing this ensures work queued to a bound workqueue with cpu=WORK_CPU_UNBOUND always runs on the local CPU. Before, that would happen only if wq_unbound_cpumask happened to include it (likely almost always the case), or was empty, or we got lucky with forced round-robin placement. So restricting /sys/devices/virtual/workqueue/cpumask to a small subset of a machine's CPUs would cause some bound work items to run unexpectedly there. Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Hillf Danton <hdanton@sina.com> [dj: massage changelog] Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-09pid: make ENOMEM return value more obviousChristian Brauner1-0/+8
The alloc_pid() codepath used to be simpler. With the introducation of the ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID") it got more complex. It hasn't been super obvious that ENOMEM is returned when the pid namespace init process/child subreaper of the pid namespace has died. As can be seen from multiple attempts to improve this see e.g. [1] and most recently [2]. We regressed returning ENOMEM in [3] and [2] restored it. Let's add a comment on top explaining that this is historic and documented behavior and cannot easily be changed. [1]: 35f71bc0a09a ("fork: report pid reservation failure properly") [2]: b26ebfe12f34 ("pid: Fix error return value in some cases") [3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-03-09futex: Unbreak futex hashingThomas Gleixner1-2/+2
The recent futex inode life time fix changed the ordering of the futex key union struct members, but forgot to adjust the hash function accordingly, As a result the hashing omits the leading 64bit and even hashes beyond the futex key causing a bad hash distribution which led to a ~100% performance regression. Hand in the futex key pointer instead of a random struct member and make the size calculation based of the struct offset. Fixes: 8019ad13ef7f ("futex: Fix inode life-time issue") Reported-by: Rong Chen <rong.a.chen@intel.com> Decoded-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Rong Chen <rong.a.chen@intel.com> Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de
2020-03-08pid: Fix error return value in some casesCorey Minyard1-0/+2
Recent changes to alloc_pid() allow the pid number to be specified on the command line. If set_tid_size is set, then the code scanning the levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM value. After the code scanning the levels, there are error returns that do not set retval, assuming it is still set to -ENOMEM. So set retval back to -ENOMEM after scanning the levels. Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID") Signed-off-by: Corey Minyard <cminyard@mvista.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <areber@redhat.com> Cc: <stable@vger.kernel.org> # 5.5 Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org [christian.brauner@ubuntu.com: fixup commit message] Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-03-07Merge tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+4
Pull block fixes from Jens Axboe: "Here are a few fixes that should go into this release. This contains: - Revert of a bad bcache patch from this merge window - Removed unused function (Daniel) - Fixup for the blktrace fix from Jan from this release (Cengiz) - Fix of deeper level bfqq overwrite in BFQ (Carlo)" * tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block: block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() blktrace: fix dereference after null check Revert "bcache: ignore pending signals when creating gc and allocator thread" block: Remove used kblockd_schedule_work_on()
2020-03-07Merge tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linuxLinus Torvalds2-3/+3
Pull thread fixes from Christian Brauner: "Here are a few hopefully uncontroversial fixes: - Use RCU_INIT_POINTER() when initializing rcu protected members in task_struct to fix sparse warnings. - Add pidfd_fdinfo_test binary to .gitignore file" * tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux: selftests: pidfd: Add pidfd_fdinfo_test in .gitignore exit: Fix Sparse errors and warnings fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()
2020-03-06futex: Fix inode life-time issuePeter Zijlstra1-36/+53
As reported by Jann, ihold() does not in fact guarantee inode persistence. And instead of making it so, replace the usage of inode pointers with a per boot, machine wide, unique inode identifier. This sequence number is global, but shared (file backed) futexes are rare enough that this should not become a performance issue. Reported-by: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-05blktrace: fix dereference after null checkCengiz Can1-1/+4
There was a recent change in blktrace.c that added a RCU protection to `q->blk_trace` in order to fix a use-after-free issue during access. However the change missed an edge case that can lead to dereferencing of `bt` pointer even when it's NULL: Coverity static analyzer marked this as a FORWARD_NULL issue with CID 1460458. ``` /kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store() 1898 ret = 0; 1899 if (bt == NULL) 1900 ret = blk_trace_setup_queue(q, bdev); 1901 1902 if (ret == 0) { 1903 if (attr == &dev_attr_act_mask) >>> CID 1460458: Null pointer dereferences (FORWARD_NULL) >>> Dereferencing null pointer "bt". 1904 bt->act_mask = value; 1905 else if (attr == &dev_attr_pid) 1906 bt->pid = value; 1907 else if (attr == &dev_attr_start_lba) 1908 bt->start_lba = value; 1909 else if (attr == &dev_attr_end_lba) ``` Added a reassignment with RCU annotation to fix the issue. Fixes: c780e86dd48 ("blktrace: Protect q->blk_trace with RCU") Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Cengiz Can <cengiz@kernel.wtf> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04cgroup1: don't call release_agent when it is ""Tycho Andersen1-1/+1
Older (and maybe current) versions of systemd set release_agent to "" when shutting down, but do not set notify_on_release to 0. Since 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()"), we filter out such calls when the user mode helper path is "". However, when used in conjunction with an actual (i.e. non "") STATIC_USERMODEHELPER, the path is never "", so the real usermode helper will be called with argv[0] == "". Let's avoid this by not invoking the release_agent when it is "". Signed-off-by: Tycho Andersen <tycho@tycho.ws> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-04cgroup: fix psi_show() crash on 32bit ino archsQian Cai1-3/+3
Similar to the commit d7495343228f ("cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not equal to 1 on 32bit ino archs which triggers all sorts of issues with psi_show() on s390x. For example, BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/ Read of size 4 at addr 000000001e0ce000 by task read_all/3667 collect_percpu_times+0x2d0/0x798 psi_show+0x7c/0x2a8 seq_read+0x2ac/0x830 vfs_read+0x92/0x150 ksys_read+0xe2/0x188 system_call+0xd8/0x2b4 Fix it by using cgroup_ino(). Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v5.5
2020-03-03sys/sysinfo: Respect boottime inside time namespaceCyril Hrubis1-0/+2
The sysinfo() syscall includes uptime in seconds but has no correction for time namespaces which makes it inconsistent with the /proc/uptime inside of a time namespace. Add the missing time namespace adjustment call. Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dmitry Safonov <dima@arista.com> Link: https://lkml.kernel.org/r/20200303150638.7329-1-chrubis@suse.cz
2020-03-02Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+2
Pull scheduler fix from Ingo Molnar: "Fix a scheduler statistics bug" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix statistics for find_idlest_group()
2020-02-28Merge tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-blockLinus Torvalds1-31/+83
Pull block fixes from Jens Axboe: - Passthrough insertion fix (Ming) - Kill off some unused arguments (John) - blktrace RCU fix (Jan) - Dead fields removal for null_blk (Dongli) - NVMe polled IO fix (Bijan) * tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block: nvme-pci: Hold cq_poll_lock while completing CQEs blk-mq: Remove some unused function arguments null_blk: remove unused fields in 'nullb_cmd' blktrace: Protect q->blk_trace with RCU blk-mq: insert passthrough request into hctx->dispatch directly
2020-02-28Merge tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-1/+1
Pull power management fixes from Rafael Wysocki: "Fix a recent cpufreq initialization regression (Rafael Wysocki), revert a devfreq commit that made incompatible changes and broke user land on some systems (Orson Zhai), drop a stale reference to a document that has gone away recently (Jonathan Neuschäfer), and fix a typo in a hibernation code comment (Alexandre Belloni)" * tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: Fix policy initialization for internal governor drivers Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs" PM / hibernate: fix typo "reserverd_size" -> "reserved_size" Documentation: power: Drop reference to interface.rst
2020-02-28exit: Fix Sparse errors and warningsMadhuparna Bhowmik1-2/+2
This patch fixes the following sparse error: kernel/exit.c:627:25: error: incompatible types in comparison expression And the following warning: kernel/exit.c:626:40: warning: incorrect type in assignment Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> [christian.brauner@ubuntu.com: edit commit message] Link: https://lore.kernel.org/r/20200130062028.4870-1-madhuparnabhowmik10@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-02-28fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()Madhuparna Bhowmik1-1/+1
Use RCU_INIT_POINTER() instead of rcu_access_pointer() in copy_sighand(). Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> [christian.brauner@ubuntu.com: edit commit message] Link: https://lore.kernel.org/r/20200127175821.10833-1-madhuparnabhowmik10@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-02-28Merge branches 'pm-sleep' and 'pm-devfreq'Rafael J. Wysocki1-1/+1
* pm-sleep: PM / hibernate: fix typo "reserverd_size" -> "reserved_size" Documentation: power: Drop reference to interface.rst * pm-devfreq: Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"
2020-02-27Merge tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/auditLinus Torvalds2-51/+60
Pull audit fixes from Paul Moore: "Two fixes for problems found by syzbot: - Moving audit filter structure fields into a union caused some problems in the code which populates that filter structure. We keep the union (that idea is a good one), but we are fixing the code so that it doesn't needlessly set fields in the union and mess up the error handling. - The audit_receive_msg() function wasn't validating user input as well as it should in all cases, we add the necessary checks" * tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: always check the netlink payload length in audit_receive_msg() audit: fix error handling in audit_data_to_entry()
2020-02-27sched/fair: Fix statistics for find_idlest_group()Vincent Guittot1-0/+2
sgs->group_weight is not set while gathering statistics in update_sg_wakeup_stats(). This means that a group can be classified as fully busy with 0 running tasks if utilization is high enough. This path is mainly used for fork and exec. Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20200218144534.4564-1-vincent.guittot@linaro.org
2020-02-26Merge tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds4-35/+127
Pull tracing and bootconfig updates: "Fixes and changes to bootconfig before it goes live in a release. Change in API of bootconfig (before it comes live in a release): - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists - Set CONFIG_BOOT_CONFIG to 'n' by default - Show error if "bootconfig" on cmdline but not compiled in - Prevent redefining the same value - Have a way to append values - Added a SELECT BLK_DEV_INITRD to fix a build failure Synthetic event fixes: - Switch to raw_smp_processor_id() for recording CPU value in preempt section. (No care for what the value actually is) - Fix samples always recording u64 values - Fix endianess - Check number of values matches number of fields - Fix a printing bug Fix of trace_printk() breaking postponed start up tests Make a function static that is only used in a single file" * tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue bootconfig: Add append value operator support bootconfig: Prohibit re-defining value on same key bootconfig: Print array as multiple commands for legacy command line bootconfig: Reject subkey and value on same parent key tools/bootconfig: Remove unneeded error message silencer bootconfig: Add bootconfig magic word for indicating bootconfig explicitly bootconfig: Set CONFIG_BOOT_CONFIG=n by default tracing: Clear trace_state when starting trace bootconfig: Mark boot_config_checksum() static tracing: Disable trace_printk() on post poned tests tracing: Have synthetic event test use raw_smp_processor_id() tracing: Fix number printing bug in print_synth_event() tracing: Check that number of vals matches number of synth event fields tracing: Make synth_event trace functions endian-correct tracing: Make sure synth_event_trace() example always uses u64
2020-02-26signal: avoid double atomic counter increments for user accountingLinus Torvalds1-9/+14
When queueing a signal, we increment both the users count of pending signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount of the user struct itself (because we keep a reference to the user in the signal structure in order to correctly account for it when freeing). That turns out to be fairly expensive, because both of them are atomic updates, and particularly under extreme signal handling pressure on big machines, you can get a lot of cache contention on the user struct. That can then cause horrid cacheline ping-pong when you do these multiple accesses. So change the reference counting to only pin the user for the _first_ pending signal, and to unpin it when the last pending signal is dequeued. That means that when a user sees a lot of concurrent signal queuing - which is the only situation when this matters - the only atomic access needed is generally the 'sigpending' count update. This was noticed because of a particularly odd timing artifact on a dual-socket 96C/192T Cascade Lake platform: when you get into bad contention, on that machine for some reason seems to be much worse when the contention happens in the upper 32-byte half of the cacheline. As a result, the kernel test robot will-it-scale 'signal1' benchmark had an odd performance regression simply due to random alignment of the 'struct user_struct' (and pointed to a completely unrelated and apparently nonsensical commit for the regression). Avoiding the double increments (and decrements on the dequeueing side, of course) makes for much less contention and hugely improved performance on that will-it-scale microbenchmark. Quoting Feng Tang: "It makes a big difference, that the performance score is tripled! bump from original 17000 to 54000. Also the gap between 5.0-rc6 and 5.0-rc6+Jiri's patch is reduced to around 2%" [ The "2% gap" is the odd cacheline placement difference on that platform: under the extreme contention case, the effect of which half of the cacheline was hot was 5%, so with the reduced contention the odd timing artifact is reduced too ] It does help in the non-contended case too, but is not nearly as noticeable. Reported-and-tested-by: Feng Tang <feng.tang@intel.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Philip Li <philip.li@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-25bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issueMasami Hiramatsu1-1/+0
Since commit d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") also changed the CONFIG_BOOTTIME_TRACING to select CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu, it introduced wrong dependencies with BLK_DEV_INITRD as below. WARNING: unmet direct dependencies detected for BOOT_CONFIG Depends on [n]: BLK_DEV_INITRD [=n] Selected by [y]: - BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y] This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so that both boot-time tracing and boot configuration off but those appear on the menu list. Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2 Fixes: d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") Reported-by: Randy Dunlap <rdunlap@infradead.org> Compiled-tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-25blktrace: Protect q->blk_trace with RCUJan Kara1-31/+83
KASAN is reporting that __blk_add_trace() has a use-after-free issue when accessing q->blk_trace. Indeed the switching of block tracing (and thus eventual freeing of q->blk_trace) is completely unsynchronized with the currently running tracing and thus it can happen that the blk_trace structure is being freed just while __blk_add_trace() works on it. Protect accesses to q->blk_trace by RCU during tracing and make sure we wait for the end of RCU grace period when shutting down tracing. Luckily that is rare enough event that we can afford that. Note that postponing the freeing of blk_trace to an RCU callback should better be avoided as it could have unexpected user visible side-effects as debugfs files would be still existing for a short while block tracing has been shut down. Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711 CC: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reported-by: Tristan Madani <tristmd@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24audit: always check the netlink payload length in audit_receive_msg()Paul Moore1-19/+21
This patch ensures that we always check the netlink payload length in audit_receive_msg() before we take any action on the payload itself. Cc: stable@vger.kernel.org Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-02-22audit: fix error handling in audit_data_to_entry()Paul Moore1-32/+39
Commit 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive") combined a number of separate fields in the audit_field struct into a single union. Generally this worked just fine because they are generally mutually exclusive. Unfortunately in audit_data_to_entry() the overlap can be a problem when a specific error case is triggered that causes the error path code to attempt to cleanup an audit_field struct and the cleanup involves attempting to free a stored LSM string (the lsm_str field). Currently the code always has a non-NULL value in the audit_field.lsm_str field as the top of the for-loop transfers a value into audit_field.val (both .lsm_str and .val are part of the same union); if audit_data_to_entry() fails and the audit_field struct is specified to contain a LSM string, but the audit_field.lsm_str has not yet been properly set, the error handling code will attempt to free the bogus audit_field.lsm_str value that was set with audit_field.val at the top of the for-loop. This patch corrects this by ensuring that the audit_field.val is only set when needed (it is cleared when the audit_field struct is allocated with kcalloc()). It also corrects a few other issues to ensure that in case of error the proper error code is returned. Cc: stable@vger.kernel.org Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive") Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-02-22Merge tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-18/+24
Pull irq fixes from Thomas Gleixner: "Two fixes for the irq core code which are follow ups to the recent MSI fixes: - The WARN_ON which was put into the MSI setaffinity callback for paranoia reasons actually triggered via a callchain which escaped when all the possible ways to reach that code were analyzed. The proc/irq/$N/*affinity interfaces have a quirk which came in when ALPHA moved to the generic interface: In case that the written affinity mask does not contain any online CPU it calls into ALPHAs magic auto affinity setting code. A few years later this mechanism was also made available to x86 for no good reasons and in a way which circumvents all sanity checks for interrupts which cannot have their affinity set from process context on X86 due to the way the X86 interrupt delivery works. It would be possible to make this work properly, but there is no point in doing so. If the interrupt is not yet started then the affinity setting has no effect and if it is started already then it is already assigned to an online CPU so there is no point to randomly move it to some other CPU. Just return EINVAL as the code has done before that change forever. - The new MSI quirk bit in the irq domain flags turned out to be already occupied, which escaped the author and the reviewers because the already in use bits were 0,6,2,3,4,5 listed in that order. That bit 6 was simply overlooked because the ordering was straight forward linear otherwise. So the new bit ended up being a duplicate. Fix it up by switching the oddball 6 to the obvious 1" * tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/irqdomain: Make sure all irq domain flags are distinct genirq/proc: Reject invalid affinity masks (again)
2020-02-22Merge tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linuxLinus Torvalds1-9/+0
Pull s390 fixes from Vasily Gorbik: - Remove ieee_emulation_warnings sysctl which is a dead code. - Avoid triggering rebuild of the kernel during make install. - Enable protected virtualization guest support in default configs. - Fix cio_ignore seq_file .next function to increase position index. And use kobj_to_dev instead of container_of in cio code. - Fix storage block address lists to contain absolute addresses in qdio code. - Few clang warnings and spelling fixes. * tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/qdio: fill SBALEs with absolute addresses s390/qdio: fill SL with absolute addresses s390: remove obsolete ieee_emulation_warnings s390: make 'install' not depend on vmlinux s390/kaslr: Fix casts in get_random s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range s390/pkey/zcrypt: spelling s/crytp/crypt/ s390/cio: use kobj_to_dev() API s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST s390/cio: cio_ignore_proc_seq_next should increase position index
2020-02-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds3-9/+57
Pull networking fixes from David Miller: 1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from Cong Wang. 2) Fix deadlock in xsk by publishing global consumer pointers when NAPI is finished, from Magnus Karlsson. 3) Set table field properly to RT_TABLE_COMPAT when necessary, from Jethro Beekman. 4) NLA_STRING attributes are not necessary NULL terminated, deal wiht that in IFLA_ALT_IFNAME. From Eric Dumazet. 5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov. 6) Handle mtu==0 devices properly in wireguard, from Jason A. Donenfeld. 7) Fix several lockdep warnings in bonding, from Taehee Yoo. 8) Fix cls_flower port blocking, from Jason Baron. 9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen. 10) Fix RDMA race in qede driver, from Michal Kalderon. 11) Fix several false lockdep warnings by adding conditions to list_for_each_entry_rcu(), from Madhuparna Bhowmik. 12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen. 13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song. 14) Hey, variables declared in switch statement before any case statements are not initialized. I learn something every day. Get rids of this stuff in several parts of the networking, from Kees Cook. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits) bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs. bnxt_en: Improve device shutdown method. net: netlink: cap max groups which will be considered in netlink_bind() net: thunderx: workaround BGX TX Underflow issue ionic: fix fw_status read net: disable BRIDGE_NETFILTER by default net: macb: Properly handle phylink on at91rm9200 s390/qeth: fix off-by-one in RX copybreak check s390/qeth: don't warn for napi with 0 budget s390/qeth: vnicc Fix EOPNOTSUPP precedence openvswitch: Distribute switch variables for initialization net: ip6_gre: Distribute switch variables for initialization net: core: Distribute switch variables for initialization udp: rehash on disconnect net/tls: Fix to avoid gettig invalid tls record bpf: Fix a potential deadlock with bpf_map_do_batch bpf: Do not grab the bucket spinlock by default on htab batch ops ice: Wait for VF to be reset/ready before configuration ice: Don't tell the OS that link is going down ice: Don't reject odd values of usecs set by user ...
2020-02-21y2038: remove unused time32 interfacesArnd Bergmann2-107/+0
No users remain, so kill these off before we grow new ones. Link: http://lkml.kernel.org/r/20200110154232.4104492-3-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-20bootconfig: Set CONFIG_BOOT_CONFIG=n by defaultMasami Hiramatsu1-1/+2
Set CONFIG_BOOT_CONFIG=n by default. This also warns user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given in the kernel command line. Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2 Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Clear trace_state when starting traceMasami Hiramatsu1-2/+2
Clear trace_state data structure when starting trace in __synth_event_trace_start() internal function. Currently trace_state is initialized only in the synth_event_trace_start() API, but the trace_state in synth_event_trace() and synth_event_trace_array() are on the stack without initialization. This means those APIs will see wrong parameters and wil skip closing process in __synth_event_trace_end() because trace_state->disabled may be !0. Link: http://lkml.kernel.org/r/158193315899.8868.1781259176894639952.stgit@devnote2 Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Disable trace_printk() on post poned testsSteven Rostedt (VMware)1-0/+2
The tracing seftests checks various aspects of the tracing infrastructure, and one is filtering. If trace_printk() is active during a self test, it can cause the filtering to fail, which will disable that part of the trace. To keep the selftests from failing because of trace_printk() calls, trace_printk() checks the variable tracing_selftest_running, and if set, it does not write to the tracing buffer. As some tracers were registered earlier in boot, the selftest they triggered would fail because not all the infrastructure was set up for the full selftest. Thus, some of the tests were post poned to when their infrastructure was ready (namely file system code). The postpone code did not set the tracing_seftest_running variable, and could fail if a trace_printk() was added and executed during their run. Cc: stable@vger.kernel.org Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Have synthetic event test use raw_smp_processor_id()Steven Rostedt (VMware)1-6/+6
The test code that tests synthetic event creation pushes in as one of its test fields the current CPU using "smp_processor_id()". As this is just something to see if the value is correctly passed in, and the actual CPU used does not matter, use raw_smp_processor_id(), otherwise with debug preemption enabled, a warning happens as the smp_processor_id() is called without preemption enabled. Link: http://lkml.kernel.org/r/20200220162950.35162579@gandalf.local.home Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Fix number printing bug in print_synth_event()Tom Zanussi1-3/+29
Fix a varargs-related bug in print_synth_event() which resulted in strange output and oopses on 32-bit x86 systems. The problem is that trace_seq_printf() expects the varargs to match the format string, but print_synth_event() was always passing u64 values regardless. This results in unspecified behavior when unpacking with va_arg() in trace_seq_printf(). Add a function that takes the size into account when calling trace_seq_printf(). Before: modprobe-1731 [003] .... 919.039758: gen_synth_test: next_pid_field=777(null)next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=3(null)my_string_field=thneed my_int_field=598(null) After: insmod-1136 [001] .... 36.634590: gen_synth_test: next_pid_field=777 next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=1 my_string_field=thneed my_int_field=598 Link: http://lkml.kernel.org/r/a9b59eb515dbbd7d4abe53b347dccf7a8e285657.1581720155.git.zanussi@kernel.org Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Check that number of vals matches number of synth event fieldsTom Zanussi1-2/+12
Commit 7276531d4036('tracing: Consolidate trace() functions') inadvertently dropped the synth_event_trace() and synth_event_trace_array() checks that verify the number of values passed in matches the number of fields in the synthetic event being traced, so add them back. Link: http://lkml.kernel.org/r/32819cac708714693669e0dfe10fe9d935e94a16.1581720155.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Make synth_event trace functions endian-correctTom Zanussi1-4/+58
synth_event_trace(), synth_event_trace_array() and __synth_event_add_val() write directly into the trace buffer and need to take endianness into account, like trace_event_raw_event_synth() does. Link: http://lkml.kernel.org/r/2011354355e405af9c9d28abba430d1f5ff7771a.1581720155.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Make sure synth_event_trace() example always uses u64Tom Zanussi1-17/+17
synth_event_trace() is the varargs version of synth_event_trace_array(), which takes an array of u64, as do synth_event_add_val() et al. To not only be consistent with those, but also to address the fact that synth_event_trace() expects every arg to be of the same type since it doesn't also pass in e.g. a format string, the caller needs to make sure all args are of the same type, u64. u64 is used because it needs to accomodate the largest type available in synthetic events, which is u64. This fixes the bug reported by the kernel test robot/Rong Chen. Link: https://lore.kernel.org/lkml/20200212113444.GS12867@shao2-debian/ Link: http://lkml.kernel.org/r/894c4e955558b521210ee0642ba194a9e603354c.1581720155.git.zanussi@kernel.org Fixes: 9fe41efaca084 ("tracing: Add synth event generation test module") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20PM / hibernate: fix typo "reserverd_size" -> "reserved_size"Alexandre Belloni1-1/+1
Fix a mistake in a variable name in a comment. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-02-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller3-9/+57
Alexei Starovoitov says: ==================== pull-request: bpf 2020-02-19 The following pull-request contains BPF updates for your *net* tree. We've added 10 non-merge commits during the last 10 day(s) which contain a total of 10 files changed, 93 insertions(+), 31 deletions(-). The main changes are: 1) batched bpf hashtab fixes from Brian and Yonghong. 2) various selftests and libbpf fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19bpf: Fix a potential deadlock with bpf_map_do_batchYonghong Song1-3/+31
Commit 057996380a42 ("bpf: Add batch ops to all htab bpf map") added lookup_and_delete batch operation for hash table. The current implementation has bpf_lru_push_free() inside the bucket lock, which may cause a deadlock. syzbot reports: -> #2 (&htab->buckets[i].lock#2){....}: __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159 htab_lru_map_delete_node+0xce/0x2f0 kernel/bpf/hashtab.c:593 __bpf_lru_list_shrink_inactive kernel/bpf/bpf_lru_list.c:220 [inline] __bpf_lru_list_shrink+0xf9/0x470 kernel/bpf/bpf_lru_list.c:266 bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:340 [inline] bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline] bpf_lru_pop_free+0x87c/0x1670 kernel/bpf/bpf_lru_list.c:499 prealloc_lru_pop+0x2c/0xa0 kernel/bpf/hashtab.c:132 __htab_lru_percpu_map_update_elem+0x67e/0xa90 kernel/bpf/hashtab.c:1069 bpf_percpu_hash_update+0x16e/0x210 kernel/bpf/hashtab.c:1585 bpf_map_update_value.isra.0+0x2d7/0x8e0 kernel/bpf/syscall.c:181 generic_map_update_batch+0x41f/0x610 kernel/bpf/syscall.c:1319 bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348 __do_sys_bpf+0x9b7/0x41e0 kernel/bpf/syscall.c:3460 __se_sys_bpf kernel/bpf/syscall.c:3355 [inline] __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&loc_l->lock){....}: check_prev_add kernel/locking/lockdep.c:2475 [inline] check_prevs_add kernel/locking/lockdep.c:2580 [inline] validate_chain kernel/locking/lockdep.c:2970 [inline] __lock_acquire+0x2596/0x4a00 kernel/locking/lockdep.c:3954 lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4484 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159 bpf_common_lru_push_free kernel/bpf/bpf_lru_list.c:516 [inline] bpf_lru_push_free+0x250/0x5b0 kernel/bpf/bpf_lru_list.c:555 __htab_map_lookup_and_delete_batch+0x8d4/0x1540 kernel/bpf/hashtab.c:1374 htab_lru_map_lookup_and_delete_batch+0x34/0x40 kernel/bpf/hashtab.c:1491 bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348 __do_sys_bpf+0x1f7d/0x41e0 kernel/bpf/syscall.c:3456 __se_sys_bpf kernel/bpf/syscall.c:3355 [inline] __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe Possible unsafe locking scenario: CPU0 CPU2 ---- ---- lock(&htab->buckets[i].lock#2); lock(&l->lock); lock(&htab->buckets[i].lock#2); lock(&loc_l->lock); *** DEADLOCK *** To fix the issue, for htab_lru_map_lookup_and_delete_batch() in CPU0, let us do bpf_lru_push_free() out of the htab bucket lock. This can avoid the above deadlock scenario. Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map") Reported-by: syzbot+a38ff3d9356388f2fb83@syzkaller.appspotmail.com Reported-by: syzbot+122b5421d14e68f29cd1@syzkaller.appspotmail.com Suggested-by: Hillf Danton <hdanton@sina.com> Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Brian Vazquez <brianvv@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200219234757.3544014-1-yhs@fb.com
2020-02-19bpf: Do not grab the bucket spinlock by default on htab batch opsBrian Vazquez1-2/+22
Grabbing the spinlock for every bucket even if it's empty, was causing significant perfomance cost when traversing htab maps that have only a few entries. This patch addresses the issue by checking first the bucket_cnt, if the bucket has some entries then we go and grab the spinlock and proceed with the batching. Tested with a htab of size 50K and different value of populated entries. Before: Benchmark Time(ns) CPU(ns) --------------------------------------------- BM_DumpHashMap/1 2759655 2752033 BM_DumpHashMap/10 2933722 2930825 BM_DumpHashMap/200 3171680 3170265 BM_DumpHashMap/500 3639607 3635511 BM_DumpHashMap/1000 4369008 4364981 BM_DumpHashMap/5k 11171919 11134028 BM_DumpHashMap/20k 69150080 69033496 BM_DumpHashMap/39k 190501036 190226162 After: Benchmark Time(ns) CPU(ns) --------------------------------------------- BM_DumpHashMap/1 202707 200109 BM_DumpHashMap/10 213441 210569 BM_DumpHashMap/200 478641 472350 BM_DumpHashMap/500 980061 967102 BM_DumpHashMap/1000 1863835 1839575 BM_DumpHashMap/5k 8961836 8902540 BM_DumpHashMap/20k 69761497 69322756 BM_DumpHashMap/39k 187437830 186551111 Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map") Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200218172552.215077-1-brianvv@google.com