aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-07-24Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-13/+55
2018-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-6/+10
Pull networking fixes from David Miller: 1) Handle stations tied to AP_VLANs properly during mac80211 hw reconfig. From Manikanta Pubbisetty. 2) Fix jump stack depth validation in nf_tables, from Taehee Yoo. 3) Fix quota handling in aRFS flow expiration of mlx5 driver, from Eran Ben Elisha. 4) Exit path handling fix in powerpc64 BPF JIT, from Daniel Borkmann. 5) Use ptr_ring_consume_bh() in page pool code, from Tariq Toukan. 6) Fix cached netdev name leak in nf_tables, from Florian Westphal. 7) Fix memory leaks on chain rename, also from Florian Westphal. 8) Several fixes to DCTCP congestion control ACK handling, from Yuchunk Cheng. 9) Missing rcu_read_unlock() in CAIF protocol code, from Yue Haibing. 10) Fix link local address handling with VRF, from David Ahern. 11) Don't clobber 'err' on a successful call to __skb_linearize() in skb_segment(). From Eric Dumazet. 12) Fix vxlan fdb notification races, from Roopa Prabhu. 13) Hash UDP fragments consistently, from Paolo Abeni. 14) If TCP receives lots of out of order tiny packets, we do really silly stuff. Make the out-of-order queue ending more robust to this kind of behavior, from Eric Dumazet. 15) Don't leak netlink dump state in nf_tables, from Florian Westphal. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits) net: axienet: Fix double deregister of mdio qmi_wwan: fix interface number for DW5821e production firmware ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull bnx2x: Fix invalid memory access in rss hash config path. net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper r8169: restore previous behavior to accept BIOS WoL settings cfg80211: never ignore user regulatory hint sock: fix sg page frag coalescing in sk_alloc_sg netfilter: nf_tables: move dumper state allocation into ->start tcp: add tcp_ooo_try_coalesce() helper tcp: call tcp_drop() from tcp_data_queue_ofo() tcp: detect malicious patterns in tcp_collapse_ofo_queue() tcp: avoid collapses in tcp_prune_queue() if possible tcp: free batches of packets in tcp_prune_ofo_queue() ip: hash fragments consistently ipv6: use fib6_info_hold_safe() when necessary can: xilinx_can: fix power management handling can: xilinx_can: fix incorrect clear of non-processed interrupts can: xilinx_can: fix RX overflow interrupt not being enabled can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting ...
2018-07-21Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-2/+15
Pull scheduler fixes from Ingo Molnar: "Two fixes: a stop-machine preemption fix and a SCHED_DEADLINE fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix switched_from_dl() warning stop_machine: Disable preemption when waking two stopper threads
2018-07-21mm: make vm_area_alloc() initialize core fieldsLinus Torvalds1-2/+8
Like vm_area_dup(), it initializes the anon_vma_chain head, and the basic mm pointer. The rest of the fields end up being different for different users, although the plan is to also initialize the 'vm_ops' field to a dummy entry. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21mm: make vm_area_dup() actually copy the old vma dataLinus Torvalds1-3/+7
.. and re-initialize th eanon_vma_chain head. This removes some boiler-plate from the users, and also makes it clear why it didn't need use the 'zalloc()' version. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21mm: use helper functions for allocating and freeing vm_area structsLinus Torvalds1-3/+18
The vm_area_struct is one of the most fundamental memory management objects, but the management of it is entirely open-coded evertwhere, ranging from allocation and freeing (using kmem_cache_[z]alloc and kmem_cache_free) to initializing all the fields. We want to unify this in order to end up having some unified initialization of the vmas, and the first step to this is to at least have basic allocation functions. Right now those functions are literally just wrappers around the kmem_cache_*() calls. This is a purely mechanical conversion: # new vma: kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc() # copy old vma kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old) # free vma kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma) to the point where the old vma passed in to the vm_area_dup() function isn't even used yet (because I've left all the old manual initialization alone). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller5-58/+179
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-20 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add sharing of BPF objects within one ASIC: this allows for reuse of the same program on multiple ports of a device, and therefore gains better code store utilization. On top of that, this now also enables sharing of maps between programs attached to different ports of a device, from Jakub. 2) Cleanup in libbpf and bpftool's Makefile to reduce unneeded feature detections and unused variable exports, also from Jakub. 3) First batch of RCU annotation fixes in prog array handling, i.e. there are several __rcu markers which are not correct as well as some of the RCU handling, from Roman. 4) Two fixes in BPF sample files related to checking of the prog_cnt upper limit from sample loader, from Dan. 5) Minor cleanup in sockmap to remove a set but not used variable, from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20kernfs: allow creating kernfs objects with arbitrary uid/gidDmitry Torokhov1-1/+3
This change allows creating kernfs files and directories with arbitrary uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments. The "simple" kernfs_create_file() and kernfs_create_dir() are left alone and always create objects belonging to the global root. When creating symlinks ownership (uid/gid) is taken from the target kernfs object. Co-Developed-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20bpf: btf: Clean up BTF_INT_BITS() in uapi btf.hMartin KaFai Lau1-6/+10
This patch shrinks the BTF_INT_BITS() mask. The current btf_int_check_meta() ensures the nr_bits of an integer cannot exceed 64. Hence, it is mostly an uapi cleanup. The actual btf usage (i.e. seq_show()) is also modified to use u8 instead of u16. The verification (e.g. btf_int_check_meta()) path stays as is to deal with invalid BTF situation. Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds6-42/+69
Pull networking fixes from David Miller: "Lots of fixes, here goes: 1) NULL deref in qtnfmac, from Gustavo A. R. Silva. 2) Kernel oops when fw download fails in rtlwifi, from Ping-Ke Shih. 3) Lost completion messages in AF_XDP, from Magnus Karlsson. 4) Correct bogus self-assignment in rhashtable, from Rishabh Bhatnagar. 5) Fix regression in ipv6 route append handling, from David Ahern. 6) Fix masking in __set_phy_supported(), from Heiner Kallweit. 7) Missing module owner set in x_tables icmp, from Florian Westphal. 8) liquidio's timeouts are HZ dependent, fix from Nicholas Mc Guire. 9) Link setting fixes for sh_eth and ravb, from Vladimir Zapolskiy. 10) Fix NULL deref when using chains in act_csum, from Davide Caratti. 11) XDP_REDIRECT needs to check if the interface is up and whether the MTU is sufficient. From Toshiaki Makita. 12) Net diag can do a double free when killing TCP_NEW_SYN_RECV connections, from Lorenzo Colitti. 13) nf_defrag in ipv6 can unnecessarily hold onto dst entries for a full minute, delaying device unregister. From Eric Dumazet. 14) Update MAC entries in the correct order in ixgbe, from Alexander Duyck. 15) Don't leave partial mangles bpf program in jit_subprogs, from Daniel Borkmann. 16) Fix pfmemalloc SKB state propagation, from Stefano Brivio. 17) Fix ACK handling in DCTCP congestion control, from Yuchung Cheng. 18) Use after free in tun XDP_TX, from Toshiaki Makita. 19) Stale ipv6 header pointer in ipv6 gre code, from Prashant Bhole. 20) Don't reuse remainder of RX page when XDP is set in mlx4, from Saeed Mahameed. 21) Fix window probe handling of TCP rapair sockets, from Stefan Baranoff. 22) Missing socket locking in smc_ioctl(), from Ursula Braun. 23) IPV6_ILA needs DST_CACHE, from Arnd Bergmann. 24) Spectre v1 fix in cxgb3, from Gustavo A. R. Silva. 25) Two spots in ipv6 do a rol32() on a hash value but ignore the result. Fixes from Colin Ian King" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (176 commits) tcp: identify cryptic messages as TCP seq # bugs ptp: fix missing break in switch hv_netvsc: Fix napi reschedule while receive completion is busy MAINTAINERS: Drop inactive Vitaly Bordug's email net: cavium: Add fine-granular dependencies on PCI net: qca_spi: Fix log level if probe fails net: qca_spi: Make sure the QCA7000 reset is triggered net: qca_spi: Avoid packet drop during initial sync ipv6: fix useless rol32 call on hash ipv6: sr: fix useless rol32 call on hash net: sched: Using NULL instead of plain integer net: usb: asix: replace mii_nway_restart in resume path net: cxgb3_main: fix potential Spectre v1 lib/rhashtable: consider param->min_size when setting initial table size net/smc: reset recv timeout after clc handshake net/smc: add error handling for get_user() net/smc: optimize consumer cursor updates net/nfc: Avoid stalls when nfc_alloc_send_skb() returned NULL. ipv6: ila: select CONFIG_DST_CACHE net: usb: rtl8150: demote allmulti message to dev_dbg() ...
2018-07-18bpf: offload: allow program and map sharing per-ASICJakub Kicinski1-7/+35
Allow programs and maps to be re-used across different netdevs, as long as they belong to the same struct bpf_offload_dev. Update the bpf_offload_prog_map_match() helper for the verifier and export a new helper for the drivers to use when checking programs at attachment time. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18bpf: offload: keep the offload state per-ASICJakub Kicinski1-17/+64
Create a higher-level entity to represent a device/ASIC to allow programs and maps to be shared between device ports. The extra work is required to make sure we don't destroy BPF objects as soon as the netdev for which they were loaded gets destroyed, as other ports may still be using them. When netdev goes away all of its BPF objects will be moved to other netdevs of the device, and only destroyed when last netdev is unregistered. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18bpf: offload: aggregate offloads per-deviceJakub Kicinski1-46/+96
Currently we have two lists of offloaded objects - programs and maps. Netdevice unregister notifier scans those lists to orphan objects associated with device being unregistered. This puts unnecessary (even if negligible) burden on all netdev unregister calls in BPF- -enabled kernel. The lists of objects may potentially get long making the linear scan even more problematic. There haven't been complaints about this mechanisms so far, but it is suboptimal. Instead of relying on notifiers, make the few BPF-capable drivers register explicitly for BPF offloads. The programs and maps will now be collected per-device not on a global list, and only scanned for removal when driver unregisters from BPF offloads. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18bpf: offload: rename bpf_offload_dev_match() to bpf_offload_prog_map_match()Jakub Kicinski2-2/+2
A set of new API functions exported for the drivers will soon use 'bpf_offload_dev_' as a prefix. Rename the bpf_offload_dev_match() which is internal to the core (used by the verifier) to avoid any confusion. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18bpf: sockmap: remove redundant pointer sgColin Ian King1-3/+0
Pointer sg is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'sg' set but not used [-Wunused-but-set-variable] Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18bpf: fix rcu annotations in compute_effective_progs()Roman Gushchin1-4/+3
The progs local variable in compute_effective_progs() is marked as __rcu, which is not correct. This is a local pointer, which is initialized by bpf_prog_array_alloc(), which also now returns a generic non-rcu pointer. The real rcu-protected pointer is *array (array is a pointer to an RCU-protected pointer), so the assignment should be performed using rcu_assign_pointer(). Fixes: 324bda9e6c5a ("bpf: multi program support for cgroup+bpf") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18bpf: bpf_prog_array_alloc() should return a generic non-rcu pointerRoman Gushchin1-1/+1
Currently the return type of the bpf_prog_array_alloc() is struct bpf_prog_array __rcu *, which is not quite correct. Obviously, the returned pointer is a generic pointer, which is valid for an indefinite amount of time and it's not shared with anyone else, so there is no sense in marking it as __rcu. This change eliminate the following sparse warnings: kernel/bpf/core.c:1544:31: warning: incorrect type in return expression (different address spaces) kernel/bpf/core.c:1544:31: expected struct bpf_prog_array [noderef] <asn:4>* kernel/bpf/core.c:1544:31: got void * kernel/bpf/core.c:1548:17: warning: incorrect type in return expression (different address spaces) kernel/bpf/core.c:1548:17: expected struct bpf_prog_array [noderef] <asn:4>* kernel/bpf/core.c:1548:17: got struct bpf_prog_array *<noident> kernel/bpf/core.c:1681:15: warning: incorrect type in assignment (different address spaces) kernel/bpf/core.c:1681:15: expected struct bpf_prog_array *array kernel/bpf/core.c:1681:15: got struct bpf_prog_array [noderef] <asn:4>* Fixes: 324bda9e6c5a ("bpf: multi program support for cgroup+bpf") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-17Mark HI and TASKLET softirq synchronousLinus Torvalds1-4/+8
Way back in 4.9, we committed 4cd13c21b207 ("softirq: Let ksoftirqd do its job"), and ever since we've had small nagging issues with it. For example, we've had: 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred") 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do not get to run") 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") all of which worked around some of the effects of that commit. The DVB people have also complained that the commit causes excessive USB URB latencies, which seems to be due to the USB code using tasklets to schedule USB traffic. This seems to be an issue mainly when already living on the edge, but waiting for ksoftirqd to handle it really does seem to cause excessive latencies. Now Hanna Hawa reports that this issue isn't just limited to USB URB and DVB, but also causes timeout problems for the Marvell SoC team: "I'm facing kernel panic issue while running raid 5 on sata disks connected to Macchiatobin (Marvell community board with Armada-8040 SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2) uses a tasklet to clean the done descriptors from the queue" The latency problem causes a panic: mv_xor_v2 f0400000.xor: dma_sync_wait: timeout! Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction We've discussed simply just reverting the original commit entirely, and also much more involved solutions (with per-softirq threads etc). This patch is intentionally stupid and fairly limited, because the issue still remains, and the other solutions either got sidetracked or had other issues. We should probably also consider the timer softirqs to be synchronous and not be delayed to ksoftirqd (since they were the issue with the earlier watchdog problems), but that should be done as a separate patch. This does only the tasklet cases. Reported-and-tested-by: Hanna Hawa <hannah@marvell.com> Reported-and-tested-by: Josef Griebichler <griebichler.josef@gmx.at> Reported-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-15sched/deadline: Fix switched_from_dl() warningJuri Lelli1-1/+10
Mark noticed that syzkaller is able to reliably trigger the following warning: dl_rq->running_bw > dl_rq->this_bw WARNING: CPU: 1 PID: 153 at kernel/sched/deadline.c:124 switched_from_dl+0x454/0x608 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 153 Comm: syz-executor253 Not tainted 4.18.0-rc3+ #29 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x458 show_stack+0x20/0x30 dump_stack+0x180/0x250 panic+0x2dc/0x4ec __warn_printk+0x0/0x150 report_bug+0x228/0x2d8 bug_handler+0xa0/0x1a0 brk_handler+0x2f0/0x568 do_debug_exception+0x1bc/0x5d0 el1_dbg+0x18/0x78 switched_from_dl+0x454/0x608 __sched_setscheduler+0x8cc/0x2018 sys_sched_setattr+0x340/0x758 el0_svc_naked+0x30/0x34 syzkaller reproducer runs a bunch of threads that constantly switch between DEADLINE and NORMAL classes while interacting through futexes. The splat above is caused by the fact that if a DEADLINE task is setattr back to NORMAL while in non_contending state (blocked on a futex - inactive timer armed), its contribution to running_bw is not removed before sub_rq_bw() gets called (!task_on_rq_queued() branch) and the latter sees running_bw > this_bw. Fix it by removing a task contribution from running_bw if the task is not queued and in non_contending state while switched to a different class. Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180711072948.27061-1-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15stop_machine: Disable preemption when waking two stopper threadsIsaac J. Manjarres1-1/+5
When cpu_stop_queue_two_works() begins to wake the stopper threads, it does so without preemption disabled, which leads to the following race condition: The source CPU calls cpu_stop_queue_two_works(), with cpu1 as the source CPU, and cpu2 as the destination CPU. When adding the stopper threads to the wake queue used in this function, the source CPU stopper thread is added first, and the destination CPU stopper thread is added last. When wake_up_q() is invoked to wake the stopper threads, the threads are woken up in the order that they are queued in, so the source CPU's stopper thread is woken up first, and it preempts the thread running on the source CPU. The stopper thread will then execute on the source CPU, disable preemption, and begin executing multi_cpu_stop(), and wait for an ack from the destination CPU's stopper thread, with preemption still disabled. Since the worker thread that woke up the stopper thread on the source CPU is affine to the source CPU, and preemption is disabled on the source CPU, that thread will never run to dequeue the destination CPU's stopper thread from the wake queue, and thus, the destination CPU's stopper thread will never run, causing the source CPU's stopper thread to wait forever, and stall. Disable preemption when waking the stopper threads in cpu_stop_queue_two_works(). Fixes: 0b26351b910f ("stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock") Co-Developed-by: Prasad Sodagudi <psodagud@codeaurora.org> Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org> Co-Developed-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: peterz@infradead.org Cc: matt@codeblueprint.co.uk Cc: bigeasy@linutronix.de Cc: gregkh@linuxfoundation.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1530655334-4601-1-git-send-email-isaacm@codeaurora.org
2018-07-13Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+1
Pull timer fixes from Ingo Molnar: "A clocksource driver fix and a revert" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: arm_arch_timer: Set arch_mem_timer cpumask to cpu_possible_mask Revert "tick: Prefer a lower rating device only if it's CPU local device"
2018-07-13Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-16/+25
Pull rseq fixes from Ingo Molnar: "Various rseq ABI fixes and cleanups: use get_user()/put_user(), validate parameters and use proper uapi types, etc" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq/selftests: cleanup: Update comment above rseq_prepare_unload rseq: Remove unused types_32_64.h uapi header rseq: uapi: Declare rseq_cs field as union, update includes rseq: uapi: Update uapi comments rseq: Use get_user/put_user rather than __get_user/__put_user rseq: Use __u64 for rseq_cs fields, validate user inputs
2018-07-13Merge tag 'trace-v4.18-rc3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds2-6/+7
Pull tracing fixlet from Steven Rostedt: "Joel Fernandes asked to add a feature in tracing that Android had its own patch internally for. I took it back in 4.13. Now he realizes that he had a mistake, and swapped the values from what Android had. This means that the old Android tools will break when using a new kernel that has the new feature on it. The options are: 1. To swap it back to what Android wants. 2. Add a command line option or something to do the swap 3. Just let Android carry a patch that swaps it back Since it requires setting a tracing option to enable this anyway, I doubt there are other users of this than Android. Thus, I've decided to take option 1. If someone else is actually depending on the order that is in the kernel, then we will have to revert this change and go to option 2 or 3" * tag 'trace-v4.18-rc3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Reorder display of TGID to be after PID
2018-07-12tracing: Reorder display of TGID to be after PIDJoel Fernandes (Google)2-6/+7
Currently ftrace displays data in trace output like so: _-----=> irqs-off / _----=> need-resched | / _---=> hardirq/softirq || / _--=> preempt-depth ||| / delay TASK-PID CPU TGID |||| TIMESTAMP FUNCTION | | | | |||| | | bash-1091 [000] ( 1091) d..2 28.313544: sched_switch: However Android's trace visualization tools expect a slightly different format due to an out-of-tree patch patch that was been carried for a decade, notice that the TGID and CPU fields are reversed: _-----=> irqs-off / _----=> need-resched | / _---=> hardirq/softirq || / _--=> preempt-depth ||| / delay TASK-PID TGID CPU |||| TIMESTAMP FUNCTION | | | | |||| | | bash-1091 ( 1091) [002] d..2 64.965177: sched_switch: From kernel v4.13 onwards, during which TGID was introduced, tracing with systrace on all Android kernels will break (most Android kernels have been on 4.9 with Android patches, so this issues hasn't been seen yet). From v4.13 onwards things will break. The chrome browser's tracing tools also embed the systrace viewer which uses the legacy TGID format and updates to that are known to be difficult to make. Considering this, I suggest we make this change to the upstream kernel and backport it to all Android kernels. I believe this feature is merged recently enough into the upstream kernel that it shouldn't be a problem. Also logically, IMO it makes more sense to group the TGID with the TASK-PID and the CPU after these. Link: http://lkml.kernel.org/r/20180626000822.113931-1-joel@joelfernandes.org Cc: jreck@google.com Cc: tkjos@google.com Cc: stable@vger.kernel.org Fixes: 441dae8f2f29 ("tracing: Add support for display of tgid in trace output") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-12bpf: don't leave partial mangled prog in jit_subprogs error pathDaniel Borkmann1-2/+9
syzkaller managed to trigger the following bug through fault injection: [...] [ 141.043668] verifier bug. No program starts at insn 3 [ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613 get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline] [ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613 fixup_call_args kernel/bpf/verifier.c:5587 [inline] [ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613 bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952 [ 141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51 [ 141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014 [ 141.049877] Call Trace: [ 141.050324] __dump_stack lib/dump_stack.c:77 [inline] [ 141.050324] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 [ 141.050950] ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60 [ 141.051837] panic+0x238/0x4e7 kernel/panic.c:184 [ 141.052386] ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385 [ 141.053101] ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537 [ 141.053814] ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530 [ 141.054506] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline] [ 141.054506] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline] [ 141.054506] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952 [ 141.055163] __warn.cold.8+0x163/0x1ba kernel/panic.c:538 [ 141.055820] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline] [ 141.055820] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline] [ 141.055820] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952 [...] What happens in jit_subprogs() is that kcalloc() for the subprog func buffer is failing with NULL where we then bail out. Latter is a plain return -ENOMEM, and this is definitely not okay since earlier in the loop we are walking all subprogs and temporarily rewrite insn->off to remember the subprog id as well as insn->imm to temporarily point the call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing out in such state and handing this over to the interpreter is troublesome since later/subsequent e.g. find_subprog() lookups are based on wrong insn->imm. Therefore, once we hit this point, we need to jump to out_free path where we undo all changes from earlier loop, so that interpreter can work on unmodified insn->{off,imm}. Another point is that should find_subprog() fail in jit_subprogs() due to a verifier bug, then we also should not simply defer the program to the interpreter since also here we did partial modifications. Instead we should just bail out entirely and return an error to the user who is trying to load the program. Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs") Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-11bpf: btf: Fix bitfield extraction for big endianOkash Khawaja1-17/+13
When extracting bitfield from a number, btf_int_bits_seq_show() builds a mask and accesses least significant byte of the number in a way specific to little-endian. This patch fixes that by checking endianness of the machine and then shifting left and right the unneeded bits. Thanks to Martin Lau for the help in navigating potential pitfalls when dealing with endianess and for the final solution. Fixes: b00b8daec828 ("bpf: btf: Add pretty print capability for data with BTF type info") Signed-off-by: Okash Khawaja <osk@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-11Merge tag 'trace-v4.18-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds1-1/+5
Pull kprobe fix from Steven Rostedt: "This fixes a memory leak in the kprobe code" * tag 'trace-v4.18-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/kprobe: Release kprobe print_fmt properly
2018-07-11tracing/kprobe: Release kprobe print_fmt properlyJiri Olsa1-1/+5
We don't release tk->tp.call.print_fmt when destroying local uprobe. Also there's missing print_fmt kfree in create_local_trace_kprobe error path. Link: http://lkml.kernel.org/r/20180709141906.2390-1-jolsa@kernel.org Cc: stable@vger.kernel.org Fixes: e12f03d7031a ("perf/core: Implement the 'perf_kprobe' PMU") Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-10rseq: uapi: Declare rseq_cs field as union, update includesMathieu Desnoyers1-6/+9
Declaring the rseq_cs field as a union between __u64 and two __u32 allows both 32-bit and 64-bit kernels to read the full __u64, and therefore validate that a 32-bit user-space cleared the upper 32 bits, thus ensuring a consistent behavior between native 32-bit kernels and 32-bit compat tasks on 64-bit kernels. Check that the rseq_cs value read is < TASK_SIZE. The asm/byteorder.h header needs to be included by rseq.h, now that it is not using linux/types_32_64.h anymore. Considering that only __32 and __u64 types are declared in linux/rseq.h, the linux/types.h header should always be included for both kernel and user-space code: including stdint.h is just for u64 and u32, which are not used in this header at all. Use copy_from_user()/clear_user() to interact with a 64-bit field, because arm32 does not implement 64-bit __get_user, and ppc32 does not 64-bit get_user. Considering that the rseq_cs pointer does not need to be loaded/stored with single-copy atomicity from the kernel anymore, we can simply use copy_from_user()/clear_user(). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com
2018-07-10rseq: uapi: Update uapi commentsMathieu Desnoyers1-1/+1
Update rseq uapi header comments to reflect that user-space need to do thread-local loads/stores from/to the struct rseq fields. As a consequence of this added requirement, the kernel does not need to perform loads/stores with single-copy atomicity. Update the comment associated to the "flags" fields to describe more accurately that it's only useful to facilitate single-stepping through rseq critical sections with debuggers. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com
2018-07-10rseq: Use get_user/put_user rather than __get_user/__put_userMathieu Desnoyers1-7/+7
__get_user()/__put_user() is used to read values for address ranges that were already checked with access_ok() on rseq registration. It has been recognized that __get_user/__put_user are optimizing the wrong thing. Replace them by get_user/put_user across rseq instead. If those end up showing up in benchmarks, the proper approach would be to use user_access_begin() / unsafe_{get,put}_user() / user_access_end() anyway. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: linux-arm-kernel@lists.infradead.org Link: https://lkml.kernel.org/r/20180709195155.7654-3-mathieu.desnoyers@efficios.com
2018-07-10rseq: Use __u64 for rseq_cs fields, validate user inputsMathieu Desnoyers1-4/+10
Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a consistent behavior for a 32-bit binary executed on 32-bit kernels and in compat mode on 64-bit kernels. Validating the value of abort_ip field to be below TASK_SIZE ensures the kernel don't return to an invalid address when returning to userspace after an abort. I don't fully trust each architecture code to consistently deal with invalid return addresses. Validating the value of the start_ip and post_commit_offset fields prevents overflow on arithmetic performed on those values, used to check whether abort_ip is within the rseq critical section. If validation fails, the process is killed with a segmentation fault. When the signature encountered before abort_ip does not match the expected signature, return -EINVAL rather than -EPERM to be consistent with other input validation return codes from rseq_get_rseq_cs(). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com
2018-07-10Revert "tick: Prefer a lower rating device only if it's CPU local device"Sudeep Holla1-2/+1
This reverts commit 1332a90558013ae4242e3dd7934bdcdeafb06c0d. The original issue was not because of incorrect checking of cpumask for both new and old tick device. It was incorrectly analysed was due to the misunderstanding of the comment and misinterpretation of the return value from tick_check_preferred. The main issue is with the clockevent driver that sets the cpumask to cpu_all_mask instead of cpu_possible_mask. Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Kevin Hilman <khilman@baylibre.com> Tested-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/1531151136-18297-1-git-send-email-sudeep.holla@arm.com
2018-07-08Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds6-73/+98
Pull scheduler fixes from Thomas Gleixner: - The hopefully final fix for the reported race problems in kthread_parkme(). The previous attempt still left a hole and was partially wrong. - Plug a race in the remote tick mechanism which triggers a warning about updates not being done correctly. That's a false positive if the race condition is hit as the remote CPU is idle. Plug it by checking the condition again when holding run queue lock. - Fix a bug in the utilization estimation of a run queue which causes the estimation to be 0 when a run queue is throttled. - Advance the global expiration of the period timer when the timer is restarted after a idle period. Otherwise the expiry time is stale and the timer fires prematurely. - Cure the drift between the bandwidth timer and the runqueue accounting, which leads to bogus throttling of runqueues - Place the call to cpufreq_update_util() correctly so the function will observe the correct number of running RT tasks and not a stale one. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kthread, sched/core: Fix kthread_parkme() (again...) sched/util_est: Fix util_est_dequeue() for throttled cfs_rq sched/fair: Advance global expiration when period timer is restarted sched/fair: Fix bandwidth timer clock drift condition sched/rt: Fix call to cpufreq_update_util() sched/nohz: Skip remote tick on idle task entirely
2018-07-07xdp: XDP_REDIRECT should check IFF_UP and MTUToshiaki Makita1-1/+6
Otherwise we end up with attempting to send packets from down devices or to send oversized packets, which may cause unexpected driver/device behaviour. Generic XDP has already done this check, so reuse the logic in native XDP. Fixes: 814abfabef3c ("xdp: add bpf_redirect helper function") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-07bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skbJohn Fastabend1-2/+2
In commit 'bpf: bpf_compute_data uses incorrect cb structure' (8108a7751512) we added the routine bpf_compute_data_end_sk_skb() to compute the correct data_end values, but this has since been lost. In kernel v4.14 this was correct and the above patch was applied in it entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree we lost the piece that renamed bpf_compute_data_pointers to the new function bpf_compute_data_end_sk_skb. This was done here, e1ea2f9856b7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") When it conflicted with the following rename patch, 6aaae2b6c433 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers") Finally, after a refactor I thought even the function bpf_compute_data_end_sk_skb() was no longer needed and it was erroneously removed. However, we never reverted the sk_skb_convert_ctx_access() usage of tcp_skb_cb which had been committed and survived the merge conflict. Here we fix this by adding back the helper and *_data_end_sk_skb() usage. Using the bpf_skc_data_end mapping is not correct because it expects a qdisc_skb_cb object but at the sock layer this is not the case. Even though it happens to work here because we don't overwrite any data in-use at the socket layer and the cb structure is cleared later this has potential to create some subtle issues. But, even more concretely the filter.c access check uses tcp_skb_cb. And by some act of chance though, struct bpf_skb_data_end { struct qdisc_skb_cb qdisc_cb; /* 0 28 */ /* XXX 4 bytes hole, try to pack */ void * data_meta; /* 32 8 */ void * data_end; /* 40 8 */ /* size: 48, cachelines: 1, members: 3 */ /* sum members: 44, holes: 1, sum holes: 4 */ /* last cacheline: 48 bytes */ }; and then tcp_skb_cb, struct tcp_skb_cb { [...] struct { __u32 flags; /* 24 4 */ struct sock * sk_redir; /* 32 8 */ void * data_end; /* 40 8 */ } bpf; /* 24 */ }; So when we use offset_of() to track down the byte offset we get 40 in either case and everything continues to work. Fix this mess and use correct structures its unclear how long this might actually work for until someone moves the structs around. Reported-by: Martin KaFai Lau <kafai@fb.com> Fixes: e1ea2f9856b7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") Fixes: 6aaae2b6c433 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-07bpf: sockmap, consume_skb in close pathJohn Fastabend1-1/+4
Currently, when a sock is closed and the bpf_tcp_close() callback is used we remove memory but do not free the skb. Call consume_skb() if the skb is attached to the buffer. Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-07bpf: sockhash, disallow bpf_tcp_close and update in parallelJohn Fastabend2-1/+18
After latest lock updates there is no longer anything preventing a close and recvmsg call running in parallel. Additionally, we can race update with close if we close a socket and simultaneously update if via the BPF userspace API (note the cgroup ops are already run with sock_lock held). To resolve this take sock_lock in close and update paths. Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-07bpf: sockmap, hash table is RCU so readers do not need locksJohn Fastabend1-2/+0
This removes locking from readers of RCU hash table. Its not necessary. Fixes: 81110384441a ("bpf: sockmap, add hash map support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-07bpf: sockmap, error path can not release psock in multi-map caseJohn Fastabend1-11/+6
The current code, in the error path of sock_hash_ctx_update_elem, checks if the sock has a psock in the user data and if so decrements the reference count of the psock. However, if the error happens early in the error path we may have never incremented the psock reference count and if the psock exists because the sock is in another map then we may inadvertently decrement the reference count. Fix this by making the error path only call smap_release_sock if the error happens after the increment. Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com Fixes: 81110384441a ("bpf: sockmap, add hash map support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-05Merge tag 'trace-v4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds6-17/+17
Pull tracing fixes and cleanups from Steven Rostedt: "While cleaning out my INBOX, I found a few patches that were lost in the noise. These are minor bug fixes and clean ups. Those include: - avoid a string overflow - code that didn't match the comment (but should) - a small code optimization (use of a conditional) - quiet printf warnings - nuke unused code - fix function graph interrupt annotation" * tag 'trace-v4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix missing return symbol in function_graph output ftrace: Nuke clear_ftrace_function tracing: Use __printf markup to silence compiler tracing: Optimize trace_buffer_iter() logic tracing: Make create_filter() code match the comments tracing: Avoid string overflow
2018-07-03tracing: Fix missing return symbol in function_graph outputChangbin Du1-1/+4
The function_graph tracer does not show the interrupt return marker for the leaf entry. On leaf entries, we see an unbalanced interrupt marker (the interrupt was entered, but nevern left). Before: 1) | SyS_write() { 1) | __fdget_pos() { 1) 0.061 us | __fget_light(); 1) 0.289 us | } 1) | vfs_write() { 1) 0.049 us | rw_verify_area(); 1) + 15.424 us | __vfs_write(); 1) ==========> | 1) 6.003 us | smp_apic_timer_interrupt(); 1) 0.055 us | __fsnotify_parent(); 1) 0.073 us | fsnotify(); 1) + 23.665 us | } 1) + 24.501 us | } After: 0) | SyS_write() { 0) | __fdget_pos() { 0) 0.052 us | __fget_light(); 0) 0.328 us | } 0) | vfs_write() { 0) 0.057 us | rw_verify_area(); 0) | __vfs_write() { 0) ==========> | 0) 8.548 us | smp_apic_timer_interrupt(); 0) <========== | 0) + 36.507 us | } /* __vfs_write */ 0) 0.049 us | __fsnotify_parent(); 0) 0.066 us | fsnotify(); 0) + 50.064 us | } 0) + 50.952 us | } Link: http://lkml.kernel.org/r/1517413729-20411-1-git-send-email-changbin.du@intel.com Cc: stable@vger.kernel.org Fixes: f8b755ac8e0cc ("tracing/function-graph-tracer: Output arrows signal on hardirq call/return") Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-03ftrace: Nuke clear_ftrace_functionYisheng Xie1-12/+1
clear_ftrace_function is not used outside of ftrace.c and is not help to use a function, so nuke it per Steve's suggestion. Link: http://lkml.kernel.org/r/1517537689-34947-1-git-send-email-xieyisheng1@huawei.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-03tracing: Use __printf markup to silence compilerMathieu Malaterre1-0/+5
Silence warnings (triggered at W=1) by adding relevant __printf attributes. CC kernel/trace/trace.o kernel/trace/trace.c: In function ‘__trace_array_vprintk’: kernel/trace/trace.c:2979:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format] len = vscnprintf(tbuffer, TRACE_BUF_SIZE, fmt, args); ^~~ AR kernel/trace/built-in.o Link: http://lkml.kernel.org/r/20180308205843.27447-1-malat@debian.org Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-03tracing: Optimize trace_buffer_iter() logicyuan linyu1-3/+1
Simplify and optimize the logic in trace_buffer_iter() to use a conditional operation instead of an if conditional. Link: http://lkml.kernel.org/r/20180408113631.3947-1-cugyly@163.com Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-03tracing: Make create_filter() code match the commentsSteven Rostedt (VMware)1-0/+5
The comment in create_filter() states that the passed in filter pointer (filterp) will either be NULL or contain an error message stating why the filter failed. But it also expects the filter pointer to point to NULL when passed in. If it is not, the function create_filter_start() will warn and return an error message without updating the filter pointer. This is not what the comment states. As we always expect the pointer to point to NULL, if it is not, trigger a WARN_ON(), set it to NULL, and then continue the path as the rest will work as the comment states. Also update the comment to state it must point to NULL. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-03tracing: Avoid string overflowArnd Bergmann1-1/+1
'err' is used as a NUL-terminated string, but using strncpy() with the length equal to the buffer size may result in lack of the termination: kernel/trace/trace_events_hist.c: In function 'hist_err_event': kernel/trace/trace_events_hist.c:396:3: error: 'strncpy' specified bound 256 equals destination size [-Werror=stringop-truncation] strncpy(err, var, MAX_FILTER_STR_VAL); This changes it to use the safer strscpy() instead. Link: http://lkml.kernel.org/r/20180328140920.2842153-1-arnd@arndb.de Cc: stable@vger.kernel.org Fixes: f404da6e1d46 ("tracing: Add 'last error' error facility for hist triggers") Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-03bpf: hash map: decrement counter on errorMauricio Vasquez B1-5/+11
Decrement the number of elements in the map in case the allocation of a new node fails. Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements") Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-03kthread, sched/core: Fix kthread_parkme() (again...)Peter Zijlstra2-26/+35
Gaurav reports that commit: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue") isn't working for him. Because of the following race: > controller Thread CPUHP Thread > takedown_cpu > kthread_park > kthread_parkme > Set KTHREAD_SHOULD_PARK > smpboot_thread_fn > set Task interruptible > > > wake_up_process > if (!(p->state & state)) > goto out; > > Kthread_parkme > SET TASK_PARKED > schedule > raw_spin_lock(&rq->lock) > ttwu_remote > waiting for __task_rq_lock > context_switch > > finish_lock_switch > > > > Case TASK_PARKED > kthread_park_complete > > > SET Running Furthermore, Oleg noticed that the whole scheduler TASK_PARKED handling is buggered because the TASK_DEAD thing is done with preemption disabled, the current code can still complete early on preemption :/ So basically revert that earlier fix and go with a variant of the alternative mentioned in the commit. Promote TASK_PARKED to special state to avoid the store-store issue on task->state leading to the WARN in kthread_unpark() -> __kthread_bind(). But in addition, add wait_task_inactive() to kthread_park() to ensure the task really is PARKED when we return from kthread_park(). This avoids the whole kthread still gets migrated nonsense -- although it would be really good to get this done differently. Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-03sched/util_est: Fix util_est_dequeue() for throttled cfs_rqVincent Guittot1-12/+4
When a cfs_rq is throttled, parent cfs_rq->nr_running is decreased and everything happens at cfs_rq level. Currently util_est stays unchanged in such case and it keeps accounting the utilization of throttled tasks. This can somewhat make sense as we don't dequeue tasks but only throttled cfs_rq. If a task of another group is enqueued/dequeued and root cfs_rq becomes idle during the dequeue, util_est will be cleared whereas it was accounting util_est of throttled tasks before. So the behavior of util_est is not always the same regarding throttled tasks and depends of side activity. Furthermore, util_est will not be updated when the cfs_rq is unthrottled as everything happens at cfs_rq level. Main results is that util_est will stay null whereas we now have running tasks. We have to wait for the next dequeue/enqueue of the previously throttled tasks to get an up to date util_est. Remove the assumption that cfs_rq's estimated utilization of a CPU is 0 if there is no running task so the util_est of a task remains until the latter is dequeued even if its cfs_rq has been throttled. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT") Link: http://lkml.kernel.org/r/1528972380-16268-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>