aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-49/+85
Three trivial overlapping conflicts. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-30/+46
Pull networking fixes from David Miller: 1) Out of bounds access in xfrm IPSEC policy unlink, from Yue Haibing. 2) Missing length check for esp4 UDP encap, from Sabrina Dubroca. 3) Fix byte order of RX STBC access in mac80211, from Johannes Berg. 4) Inifnite loop in bpftool map create, from Alban Crequy. 5) Register mark fix in ebpf verifier after pkt/null checks, from Paul Chaignon. 6) Properly use rcu_dereference_sk_user_data in L2TP code, from Eric Dumazet. 7) Buffer overrun in marvell phy driver, from Andrew Lunn. 8) Several crash and statistics handling fixes to bnxt_en driver, from Michael Chan and Vasundhara Volam. 9) Several fixes to the TLS layer from Jakub Kicinski (copying negative amounts of data in reencrypt, reencrypt frag copying, blind nskb->sk NULL deref, etc). 10) Several UDP GRO fixes, from Paolo Abeni and Eric Dumazet. 11) PID/UID checks on ipv6 flow labels are inverted, from Willem de Bruijn. 12) Use after free in l2tp, from Eric Dumazet. 13) IPV6 route destroy races, also from Eric Dumazet. 14) SCTP state machine can erroneously run recursively, fix from Xin Long. 15) Adjust AF_PACKET msg_name length checks, add padding bytes if necessary. From Willem de Bruijn. 16) Preserve skb_iif, so that forwarded packets have consistent values even if fragmentation is involved. From Shmulik Ladkani. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits) udp: fix GRO packet of death ipv6: A few fixes on dereferencing rt->from rds: ib: force endiannes annotation selftests: fib_rule_tests: print the result and return 1 if any tests failed ipv4: ip_do_fragment: Preserve skb_iif during fragmentation net/tls: avoid NULL pointer deref on nskb->sk in fallback selftests: fib_rule_tests: Fix icmp proto with ipv6 packet: validate msg_namelen in send directly packet: in recvmsg msg_name return at least sizeof sockaddr_ll sctp: avoid running the sctp state machine recursively stmmac: pci: Fix typo in IOT2000 comment Documentation: fix netdev-FAQ.rst markup warning ipv6: fix races in ip6_dst_destroy() l2ip: fix possible use-after-free appletalk: Set error code if register_snap_client failed net: dsa: bcm_sf2: fix buffer overflow doing set_rxnfc rxrpc: Fix net namespace cleanup ipv6/flowlabel: wait rcu grace period before put_pid() vrf: Use orig netdev to count Ip6InNoRoutes and a fresh route lookup when sending dest unreach tcp: add sanity tests in tcp_add_backlog() ...
2019-04-29Merge tag 'seccomp-v5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linuxLinus Torvalds1-2/+15
Pull seccomp fixes from Kees Cook: "Syzbot found a use-after-free bug in seccomp due to flags that should not be allowed to be used together. Tycho fixed this, I updated the self-tests, and the syzkaller PoC has been running for several days without triggering KASan (before this fix, it would reproduce). These patches have also been in -next for almost a week, just to be sure. - Add logic for making some seccomp flags exclusive (Tycho) - Update selftests for exclusivity testing (Kees)" * tag 'seccomp-v5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: Make NEW_LISTENER and TSYNC flags exclusive selftests/seccomp: Prepare for exclusive seccomp flags
2019-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-6/+89
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-04-28 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Introduce BPF socket local storage map so that BPF programs can store private data they associate with a socket (instead of e.g. separate hash table), from Martin. 2) Add support for bpftool to dump BTF types. This is done through a new `bpftool btf dump` sub-command, from Andrii. 3) Enable BPF-based flow dissector for skb-less eth_get_headlen() calls which was currently not supported since skb was used to lookup netns, from Stanislav. 4) Add an opt-in interface for tracepoints to expose a writable context for attached BPF programs, used here for NBD sockets, from Matt. 5) BPF xadd related arm64 JIT fixes and scalability improvements, from Daniel. 6) Change the skb->protocol for bpf_skb_adjust_room() helper in order to support tunnels such as sit. Add selftests as well, from Willem. 7) Various smaller misc fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27genetlink: optionally validate strictly/dumpsJohannes Berg1-0/+2
Add options to strictly validate messages and dump messages, sometimes perhaps validating dump messages non-strictly may be required, so add an option for that as well. Since none of this can really be applied to existing commands, set the options everwhere using the following spatch: @@ identifier ops; expression X; @@ struct genl_ops ops[] = { ..., { .cmd = X, + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, ... }, ... }; For new commands one should just not copy the .validate 'opt-out' flags and thus get strict validation. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: make validation more configurable for future strictnessJohannes Berg1-2/+3
We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: make nla_nest_start() add NLA_F_NESTED flagMichal Kubecek1-1/+1
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most netlink based interfaces (including recently added ones) are still not setting it in kernel generated messages. Without the flag, message parsers not aware of attribute semantics (e.g. wireshark dissector or libmnl's mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display the structure of their contents. Unfortunately we cannot just add the flag everywhere as there may be userspace applications which check nlattr::nla_type directly rather than through a helper masking out the flags. Therefore the patch renames nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start() as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually are rewritten to use nla_nest_start(). Except for changes in include/net/netlink.h, the patch was generated using this semantic patch: @@ expression E1, E2; @@ -nla_nest_start(E1, E2) +nla_nest_start_noflag(E1, E2) @@ expression E1, E2; @@ -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED) +nla_nest_start(E1, E2) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+4
Pull scheduler fix from Ingo Molnar: "Fix a division by zero bug that can trigger in the NUMA placement code" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/numa: Fix a possible divide-by-zero
2019-04-27bpf: Introduce bpf sk local storageMartin KaFai Lau2-4/+26
After allowing a bpf prog to - directly read the skb->sk ptr - get the fullsock bpf_sock by "bpf_sk_fullsock()" - get the bpf_tcp_sock by "bpf_tcp_sock()" - get the listener sock by "bpf_get_listener_sock()" - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock" into different bpf running context. this patch is another effort to make bpf's network programming more intuitive to do (together with memory and performance benefit). When bpf prog needs to store data for a sk, the current practice is to define a map with the usual 4-tuples (src/dst ip/port) as the key. If multiple bpf progs require to store different sk data, multiple maps have to be defined. Hence, wasting memory to store the duplicated keys (i.e. 4 tuples here) in each of the bpf map. [ The smallest key could be the sk pointer itself which requires some enhancement in the verifier and it is a separate topic. ] Also, the bpf prog needs to clean up the elem when sk is freed. Otherwise, the bpf map will become full and un-usable quickly. The sk-free tracking currently could be done during sk state transition (e.g. BPF_SOCK_OPS_STATE_CB). The size of the map needs to be predefined which then usually ended-up with an over-provisioned map in production. Even the map was re-sizable, while the sk naturally come and go away already, this potential re-size operation is arguably redundant if the data can be directly connected to the sk itself instead of proxy-ing through a bpf map. This patch introduces sk->sk_bpf_storage to provide local storage space at sk for bpf prog to use. The space will be allocated when the first bpf prog has created data for this particular sk. The design optimizes the bpf prog's lookup (and then optionally followed by an inline update). bpf_spin_lock should be used if the inline update needs to be protected. BPF_MAP_TYPE_SK_STORAGE: ----------------------- To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can be created to fit different bpf progs' needs. The map enforces BTF to allow printing the sk-local-storage during a system-wise sk dump (e.g. "ss -ta") in the future. The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete a "sk-local-storage" data from a particular sk. Think of the map as a meta-data (or "type") of a "sk-local-storage". This particular "type" of "sk-local-storage" data can then be stored in any sk. The main purposes of this map are mostly: 1. Define the size of a "sk-local-storage" type. 2. Provide a similar syscall userspace API as the map (e.g. lookup/update, map-id, map-btf...etc.) 3. Keep track of all sk's storages of this "type" and clean them up when the map is freed. sk->sk_bpf_storage: ------------------ The main lookup/update/delete is done on sk->sk_bpf_storage (which is a "struct bpf_sk_storage"). When doing a lookup, the "map" pointer is now used as the "key" to search on the sk_storage->list. The "map" pointer is actually serving as the "type" of the "sk-local-storage" that is being requested. To allow very fast lookup, it should be as fast as looking up an array at a stable-offset. At the same time, it is not ideal to set a hard limit on the number of sk-local-storage "type" that the system can have. Hence, this patch takes a cache approach. The last search result from sk_storage->list is cached in sk_storage->cache[] which is a stable sized array. Each "sk-local-storage" type has a stable offset to the cache[] array. In the future, a map's flag could be introduced to do cache opt-out/enforcement if it became necessary. The cache size is 16 (i.e. 16 types of "sk-local-storage"). Programs can share map. On the program side, having a few bpf_progs running in the networking hotpath is already a lot. The bpf_prog should have already consolidated the existing sock-key-ed map usage to minimize the map lookup penalty. 16 has enough runway to grow. All sk-local-storage data will be removed from sk->sk_bpf_storage during sk destruction. bpf_sk_storage_get() and bpf_sk_storage_delete(): ------------------------------------------------ Instead of using bpf_map_(lookup|update|delete)_elem(), the bpf prog needs to use the new helper bpf_sk_storage_get() and bpf_sk_storage_delete(). The verifier can then enforce the ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to "create" new elem if one does not exist in the sk. It is done by the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE. The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together, it has eliminated the potential use cases for an equivalent bpf_map_update_elem() API (for bpf_prog) in this patch. Misc notes: ---------- 1. map_get_next_key is not supported. From the userspace syscall perspective, the map has the socket fd as the key while the map can be shared by pinned-file or map-id. Since btf is enforced, the existing "ss" could be enhanced to pretty print the local-storage. Supporting a kernel defined btf with 4 tuples as the return key could be explored later also. 2. The sk->sk_lock cannot be acquired. Atomic operations is used instead. e.g. cmpxchg is done on the sk->sk_bpf_storage ptr. Please refer to the source code comments for the details in synchronization cases and considerations. 3. The mem is charged to the sk->sk_omem_alloc as the sk filter does. Benchmark: --------- Here is the benchmark data collected by turning on the "kernel.bpf_stats_enabled" sysctl. Two bpf progs are tested: One bpf prog with the usual bpf hashmap (max_entries = 8192) with the sk ptr as the key. (verifier is modified to support sk ptr as the key That should have shortened the key lookup time.) Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE. Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for each egress skb and then bump the cnt. netperf is used to drive data with 4096 connected UDP sockets. BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run) 27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633 loaded_at 2019-04-15T13:46:39-0700 uid 0 xlated 344B jited 258B memlock 4096B map_ids 16 btf_id 5 BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run) 30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739 loaded_at 2019-04-15T13:47:54-0700 uid 0 xlated 168B jited 156B memlock 4096B map_ids 17 btf_id 6 Here is a high-level picture on how are the objects organized: sk ┌──────┐ │ │ │ │ │ │ │*sk_bpf_storage─────▶ bpf_sk_storage └──────┘ ┌───────┐ ┌───────────┤ list │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ │ elem │ ┌────────┐ ├─▶│ snode │ │ ├────────┤ │ │ data │ bpf_map │ ├────────┤ ┌─────────┐ │ │map_node│◀─┬─────┤ list │ │ └────────┘ │ │ │ │ │ │ │ │ elem │ │ │ │ ┌────────┐ │ └─────────┘ └─▶│ snode │ │ ├────────┤ │ bpf_map │ data │ │ ┌─────────┐ ├────────┤ │ │ list ├───────▶│map_node│ │ │ │ └────────┘ │ │ │ │ │ │ elem │ └─────────┘ ┌────────┐ │ ┌─▶│ snode │ │ │ ├────────┤ │ │ │ data │ │ │ ├────────┤ │ │ │map_node│◀─┘ │ └────────┘ │ │ │ ┌───────┐ sk └──────────│ list │ ┌──────┐ │ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │*sk_bpf_storage───────▶bpf_sk_storage └──────┘ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26bpf: add writable context for raw tracepointsMatt Mullins3-2/+61
This is an opt-in interface that allows a tracepoint to provide a safe buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program. The size of the buffer must be a compile-time constant, and is checked before allowing a BPF program to attach to a tracepoint that uses this feature. The pointer to this buffer will be the first argument of tracepoints that opt in; the pointer is valid and can be bpf_probe_read() by both BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE programs that attach to such a tracepoint, but the buffer to which it points may only be written by the latter. Signed-off-by: Matt Mullins <mmullins@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26Merge tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds2-17/+20
Pull tracing fixes from Steven Rostedt: "Three tracing fixes: - Use "nosteal" for ring buffer splice pages - Memory leak fix in error path of trace_pid_write() - Fix preempt_enable_no_resched() (use preempt_enable()) in ring buffer code" * tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: trace: Fix preempt_enable_no_resched() abuse tracing: Fix a memory leak by early error exit in trace_pid_write() tracing: Fix buffer_ref pipe ops
2019-04-26trace: Fix preempt_enable_no_resched() abusePeter Zijlstra1-1/+1
Unless the very next line is schedule(), or implies it, one must not use preempt_enable_no_resched(). It can cause a preemption to go missing and thereby cause arbitrary delays, breaking the PREEMPT=y invariant. Link: http://lkml.kernel.org/r/20190423200318.GY14281@hirez.programming.kicks-ass.net Cc: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: the arch/x86 maintainers <x86@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: huang ying <huang.ying.caritas@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: stable@vger.kernel.org Fixes: 2c2d7329d8af ("tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-26tracing: Fix a memory leak by early error exit in trace_pid_write()Wenwen Wang1-1/+4
In trace_pid_write(), the buffer for trace parser is allocated through kmalloc() in trace_parser_get_init(). Later on, after the buffer is used, it is then freed through kfree() in trace_parser_put(). However, it is possible that trace_pid_write() is terminated due to unexpected errors, e.g., ENOMEM. In that case, the allocated buffer will not be freed, which is a memory leak bug. To fix this issue, free the allocated buffer when an error is encountered. Link: http://lkml.kernel.org/r/1555726979-15633-1-git-send-email-wang6495@umn.edu Fixes: f4d34a87e9c10 ("tracing: Use pid bitmap instead of a pid array for set_event_pid") Cc: stable@vger.kernel.org Signed-off-by: Wenwen Wang <wang6495@umn.edu> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-26tracing: Fix buffer_ref pipe opsJann Horn1-14/+14
This fixes multiple issues in buffer_pipe_buf_ops: - The ->steal() handler must not return zero unless the pipe buffer has the only reference to the page. But generic_pipe_buf_steal() assumes that every reference to the pipe is tracked by the page's refcount, which isn't true for these buffers - buffer_pipe_buf_get(), which duplicates a buffer, doesn't touch the page's refcount. Fix it by using generic_pipe_buf_nosteal(), which refuses every attempted theft. It should be easy to actually support ->steal, but the only current users of pipe_buf_steal() are the virtio console and FUSE, and they also only use it as an optimization. So it's probably not worth the effort. - The ->get() and ->release() handlers can be invoked concurrently on pipe buffers backed by the same struct buffer_ref. Make them safe against concurrency by using refcount_t. - The pointers stored in ->private were only zeroed out when the last reference to the buffer_ref was dropped. As far as I know, this shouldn't be necessary anyway, but if we do it, let's always do it. Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-30/+46
Alexei Starovoitov says: ==================== pull-request: bpf 2019-04-25 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) the bpf verifier fix to properly mark registers in all stack frames, from Paul. 2) preempt_enable_no_resched->preempt_enable fix, from Peter. 3) other misc fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller12-49/+88
Two easy cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25bpf: mark registers in all frames after pkt/null checksPaul Chaignon1-30/+46
In case of a null check on a pointer inside a subprog, we should mark all registers with this pointer as either safe or unknown, in both the current and previous frames. Currently, only spilled registers and registers in the current frame are marked. Packet bound checks in subprogs have the same issue. This patch fixes it to mark registers in previous frames as well. A good reproducer for null checks looks as follow: 1: ptr = bpf_map_lookup_elem(map, &key); 2: ret = subprog(ptr) { 3: return ptr != NULL; 4: } 5: if (ret) 6: value = *ptr; With the above, the verifier will complain on line 6 because it sees ptr as map_value_or_null despite the null check in subprog 1. Note that this patch fixes another resulting bug when using bpf_sk_release(): 1: sk = bpf_sk_lookup_tcp(...); 2: subprog(sk) { 3: if (sk) 4: bpf_sk_release(sk); 5: } 6: if (!sk) 7: return 0; 8: return 1; In the above, mark_ptr_or_null_regs will warn on line 6 because it will try to free the reference state, even though it was already freed on line 3. Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)") Signed-off-by: Paul Chaignon <paul.chaignon@orange.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25seccomp: Make NEW_LISTENER and TSYNC flags exclusiveTycho Andersen1-2/+15
As the comment notes, the return codes for TSYNC and NEW_LISTENER conflict, because they both return positive values, one in the case of success and one in the case of error. So, let's disallow both of these flags together. While this is technically a userspace break, all the users I know of are still waiting on me to land this feature in libseccomp, so I think it'll be safe. Also, at present my use case doesn't require TSYNC at all, so this isn't a big deal to disallow. If someone wanted to support this, a path forward would be to add a new flag like TSYNC_AND_LISTENER_YES_I_UNDERSTAND_THAT_TSYNC_WILL_JUST_RETURN_EAGAIN, but the use cases are so different I don't see it really happening. Finally, it's worth noting that this does actually fix a UAF issue: at the end of seccomp_set_mode_filter(), we have: if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { if (ret < 0) { listener_f->private_data = NULL; fput(listener_f); put_unused_fd(listener); } else { fd_install(listener, listener_f); ret = listener; } } out_free: seccomp_filter_free(prepared); But if ret > 0 because TSYNC raced, we'll install the listener fd and then free the filter out from underneath it, causing a UAF when the task closes it or dies. This patch also switches the condition to be simply if (ret), so that if someone does add the flag mentioned above, they won't have to remember to fix this too. Reported-by: syzbot+b562969adb2e04af3442@syzkaller.appspotmail.com Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") CC: stable@vger.kernel.org # v5.0+ Signed-off-by: Tycho Andersen <tycho@tycho.ws> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: James Morris <jamorris@linux.microsoft.com>
2019-04-25bpf: support BPF_PROG_QUERY for BPF_FLOW_DISSECTOR attach_typeStanislav Fomichev1-0/+2
target_fd is target namespace. If there is a flow dissector BPF program attached to that namespace, its (single) id is returned. v5: * drop net ref right after rcu unlock (Daniel Borkmann) v4: * add missing put_net (Jann Horn) v3: * add missing inline to skb_flow_dissector_prog_query static def (kbuild test robot <lkp@intel.com>) v2: * don't sleep in rcu critical section (Jakub Kicinski) * check input prog_cnt (exit early) Cc: Jann Horn <jannh@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-25sched/numa: Fix a possible divide-by-zeroXie XiuQi1-0/+4
sched_clock_cpu() may not be consistent between CPUs. If a task migrates to another CPU, then se.exec_start is set to that CPU's rq_clock_task() by update_stats_curr_start(). Specifically, the new value might be before the old value due to clock skew. So then if in numa_get_avg_runtime() the expression: 'now - p->last_task_numa_placement' ends up as -1, then the divider '*period + 1' in task_numa_placement() is 0 and things go bang. Similar to update_curr(), check if time goes backwards to avoid this. [ peterz: Wrote new changelog. ] [ mingo: Tweaked the code comment. ] Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: cj.chengjian@huawei.com Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20190425080016.GX11158@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller6-60/+641
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-04-22 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) allow stack/queue helpers from more bpf program types, from Alban. 2) allow parallel verification of root bpf programs, from Alexei. 3) introduce bpf sysctl hook for trusted root cases, from Andrey. 4) recognize var/datasec in btf deduplication, from Andrii. 5) cpumap performance optimizations, from Jesper. 6) verifier prep for alu32 optimization, from Jiong. 7) libbpf xsk cleanup, from Magnus. 8) other various fixes and cleanups. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23bpf: drop bpf_verifier_lockAlexei Starovoitov1-3/+5
Drop bpf_verifier_lock for root to avoid being DoS-ed by unprivileged. The BPF verifier is now fully parallel. All unpriv users are still serialized by bpf_verifier_lock to avoid exhausting kernel memory by running N parallel verifications. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23bpf: remove global variablesAlexei Starovoitov1-12/+13
Move three global variables protected by bpf_verifier_lock into 'struct bpf_verifier_env' to allow parallel verification. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-20Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-2/+11
Pull timer fixes from Ingo Molnar: "Misc clocksource driver fixes, and a sched-clock wrapping fix" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze() clocksource/drivers/timer-ti-dm: Remove omap_dm_timer_set_load_start clocksource/drivers/oxnas: Fix OX820 compatible clocksource/drivers/arm_arch_timer: Remove unneeded pr_fmt macro clocksource/drivers/npcm: select TIMER_OF
2019-04-20Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds4-39/+43
Pull perf fixes from Ingo Molnar: "Misc fixes: - various tooling fixes - kretprobe fixes - kprobes annotation fixes - kprobes error checking fix - fix the default events for AMD Family 17h CPUs - PEBS fix - AUX record fix - address filtering fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kprobes: Avoid kretprobe recursion bug kprobes: Mark ftrace mcount handler functions nokprobe x86/kprobes: Verify stack frame on kretprobe perf/x86/amd: Add event map for AMD Family 17h perf bpf: Return NULL when RB tree lookup fails in perf_env__find_btf() perf tools: Fix map reference counting perf evlist: Fix side band thread draining perf tools: Check maps for bpf programs perf bpf: Return NULL when RB tree lookup fails in perf_env__find_bpf_prog_info() tools include uapi: Sync sound/asound.h copy perf top: Always sample time to satisfy needs of use of ordered queuing perf evsel: Use hweight64() instead of hweight_long(attr.sample_regs_user) tools lib traceevent: Fix missing equality check for strcmp perf stat: Disable DIR_FORMAT feature for 'perf stat record' perf scripts python: export-to-sqlite.py: Fix use of parent_id in calls_view perf header: Fix lock/unlock imbalances when processing BPF/BTF info perf/x86: Fix incorrect PEBS_REGS perf/ring_buffer: Fix AUX record suppression perf/core: Fix the address filtering fix kprobes: Fix error check when reusing optimized probes
2019-04-20Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-2/+26
Pull scheduler fixes from Ingo Molnar: "A deadline scheduler warning/race fix, and a cfs_period_us quota calculation workaround where the real fix looks too involved to merge immediately" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Correctly handle active 0-lag timers sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
2019-04-20Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-4/+5
Pull locking fixes from Ingo Molnar: "A lockdep warning fix and a script execution fix when atomics are generated" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/atomics: Don't assume that scripts are executable locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' again
2019-04-19kernel/watchdog_hld.c: hard lockup message should end with a newlineSergey Senozhatsky1-1/+2
Separate print_modules() and hard lockup error message. Before the patch: NMI watchdog: Watchdog detected hard LOCKUP on cpu 1Modules linked in: nls_cp437 Link: http://lkml.kernel.org/r/20190412062557.2700-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19kprobes: Mark ftrace mcount handler functions nokprobeMasami Hiramatsu1-1/+5
Mark ftrace mcount handler functions nokprobe since probing on these functions with kretprobe pushes return address incorrectly on kretprobe shadow stack. Reported-by: Francis Deslauriers <francis.deslauriers@efficios.com> Tested-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/155094062044.6137.6419622920568680640.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-18ipv6: Add rate limit mask for ICMPv6 messagesStephen Suryaputra1-0/+6
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP message types use larger numeric values, a simple bitmask doesn't fit. I use large bitmap. The input and output are the in form of list of ranges. Set the default to rate limit all error messages but Packet Too Big. For Packet Too Big, use ratemask instead of hard-coded. There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow() aren't called. This patch only adds them to icmpv6_echo_reply(). Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says that it is also acceptable to rate limit informational messages. Thus, I removed the current hard-coded behavior of icmpv6_mask_allow() that doesn't rate limit informational messages. v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL isn't defined, expand the description in ip-sysctl.txt and remove unnecessary conditional before kfree(). v3: Inline the bitmap instead of dynamically allocated. Still is a pointer to it is needed because of the way proc_do_large_bitmap work. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18signal: use fdget() since we don't allow O_PATHChristian Brauner1-1/+1
As stated in the original commit for pidfd_send_signal() we don't allow to signal processes through O_PATH file descriptors since it is semantically equivalent to a write on the pidfd. We already correctly error out right now and return EBADF if an O_PATH fd is passed. This is because we use file->f_op to detect whether a pidfd is passed and O_PATH fds have their file->f_op set to empty_fops in do_dentry_open() and thus fail the test. Thus, there is no regression. It's just semantically correct to use fdget() and return an error right from there instead of taking a reference and returning an error later. Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jann@thejh.net> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-18timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze()Chang-An Chen3-2/+11
tick_freeze() introduced by suspend-to-idle in commit 124cf9117c5f ("PM / sleep: Make it possible to quiesce timers during suspend-to-idle") uses timekeeping_suspend() instead of syscore_suspend() during suspend-to-idle. As a consequence generic sched_clock will keep going because sched_clock_suspend() and sched_clock_resume() are not invoked during suspend-to-idle which can result in a generic sched_clock wrap. On a ARM system with suspend-to-idle enabled, sched_clock is registered as "56 bits at 13MHz, resolution 76ns, wraps every 4398046511101ns", which means the real wrapping duration is 8796093022202ns. [ 134.551779] suspend-to-idle suspend (timekeeping_suspend()) [ 1204.912239] suspend-to-idle resume (timekeeping_resume()) ...... [ 1206.912239] suspend-to-idle suspend (timekeeping_suspend()) [ 5880.502807] suspend-to-idle resume (timekeeping_resume()) ...... [ 6000.403724] suspend-to-idle suspend (timekeeping_suspend()) [ 8035.753167] suspend-to-idle resume (timekeeping_resume()) ...... [ 8795.786684] (2)[321:charger_thread]...... [ 8795.788387] (2)[321:charger_thread]...... [ 0.057226] (0)[0:swapper/0]...... [ 0.061447] (2)[0:swapper/2]...... sched_clock was not stopped during suspend-to-idle, and sched_clock_poll hrtimer was not expired because timekeeping_suspend() was invoked during suspend-to-idle. It makes sched_clock wrap at kernel time 8796s. To prevent this, invoke sched_clock_suspend() and sched_clock_resume() in tick_freeze() together with timekeeping_suspend() and timekeeping_resume(). Fixes: 124cf9117c5f (PM / sleep: Make it possible to quiesce timers during suspend-to-idle) Signed-off-by: Chang-An Chen <chang-an.chen@mediatek.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Kees Cook <keescook@chromium.org> Cc: Corey Minyard <cminyard@mvista.com> Cc: <linux-mediatek@lists.infradead.org> Cc: <linux-arm-kernel@lists.infradead.org> Cc: Stanley Chu <stanley.chu@mediatek.com> Cc: <kuohong.wang@mediatek.com> Cc: <freddy.hsin@mediatek.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1553828349-8914-1-git-send-email-chang-an.chen@mediatek.com
2019-04-17bpf: cpumap memory prefetchw optimizations for struct pageJesper Dangaard Brouer1-0/+12
A lot of the performance gain comes from this patch. While analysing performance overhead it was found that the largest CPU stalls were caused when touching the struct page area. It is first read with a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed written by page_frag_free() call. Measurements show that the prefetchw (W) variant operation is needed to achieve the performance gain. We believe this optimization it two fold, first the W-variant saves one step in the cache-coherency protocol, and second it helps us to avoid the non-temporal prefetch HW optimizations and bring this into all cache-levels. It might be worth investigating if prefetch into L2 will have the same benefit. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-17bpf: cpumap do bulk allocation of SKBsJesper Dangaard Brouer1-7/+15
As cpumap now batch consume xdp_frame's from the ptr_ring, it knows how many SKBs it need to allocate. Thus, lets bulk allocate these SKBs via kmem_cache_alloc_bulk() API, and use the previously introduced function build_skb_around(). Notice that the flag __GFP_ZERO asks the slab/slub allocator to clear the memory for us. This does clear a larger area than needed, but my micro benchmarks on Intel CPUs show that this is slightly faster due to being a cacheline aligned area is cleared for the SKBs. (For SLUB allocator, there is a future optimization potential, because SKBs will with high probability originate from same page. If we can find/identify continuous memory areas then the Intel CPU memset rep stos will have a real performance gain.) Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-17bpf: cpumap use ptr_ring_consume_batchedJesper Dangaard Brouer1-10/+11
Move ptr_ring dequeue outside loop, that allocate SKBs and calls network stack, as these operations that can take some time. The ptr_ring is a communication channel between CPUs, where we want to reduce/limit any cacheline bouncing. Do a concentrated bulk dequeue via ptr_ring_consume_batched, to shorten the period and times the remote cacheline in ptr_ring is read Batch size 8 is both to (1) limit BH-disable period, and (2) consume one cacheline on 64-bit archs. After reducing the BH-disable section further then we can consider changing this, while still thinking about L1 cacheline size being active. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller9-34/+72
Conflict resolution of af_smc.c from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16bpf: use BPF_CAST_CALL for casting bpf callPrashant Bhole1-3/+2
verifier.c uses BPF_CAST_CALL for casting bpf call except at one place in jit_subprogs(). Let's use the macro for consistency. Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-16sched/deadline: Correctly handle active 0-lag timersluca abeni1-2/+1
syzbot reported the following warning: [ ] WARNING: CPU: 4 PID: 17089 at kernel/sched/deadline.c:255 task_non_contending+0xae0/0x1950 line 255 of deadline.c is: WARN_ON(hrtimer_active(&dl_se->inactive_timer)); in task_non_contending(). Unfortunately, in some cases (for example, a deadline task continuosly blocking and waking immediately) it can happen that a task blocks (and task_non_contending() is called) while the 0-lag timer is still active. In this case, the safest thing to do is to immediately decrease the running bandwidth of the task, without trying to re-arm the 0-lag timer. Signed-off-by: luca abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: chengjian (D) <cj.chengjian@huawei.com> Link: https://lkml.kernel.org/r/20190325131530.34706-1-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockupPhil Auld1-0/+25
With extremely short cfs_period_us setting on a parent task group with a large number of children the for loop in sched_cfs_period_timer() can run until the watchdog fires. There is no guarantee that the call to hrtimer_forward_now() will ever return 0. The large number of children can make do_sched_cfs_period_timer() take longer than the period. NMI watchdog: Watchdog detected hard LOCKUP on cpu 24 RIP: 0010:tg_nop+0x0/0x10 <IRQ> walk_tg_tree_from+0x29/0xb0 unthrottle_cfs_rq+0xe0/0x1a0 distribute_cfs_runtime+0xd3/0xf0 sched_cfs_period_timer+0xcb/0x160 ? sched_cfs_slack_timer+0xd0/0xd0 __hrtimer_run_queues+0xfb/0x270 hrtimer_interrupt+0x122/0x270 smp_apic_timer_interrupt+0x6a/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> To prevent this we add protection to the loop that detects when the loop has run too many times and scales the period and quota up, proportionally, so that the timer can complete before then next period expires. This preserves the relative runtime quota while preventing the hard lockup. A warning is issued reporting this state and the new values. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16perf/ring_buffer: Fix AUX record suppressionAlexander Shishkin1-18/+15
The following commit: 1627314fb54a33e ("perf: Suppress AUX/OVERWRITE records") has an unintended side-effect of also suppressing all AUX records with no flags and non-zero size, so all the regular records in the full trace mode. This breaks some use cases for people. Fix this by restoring "regular" AUX records. Reported-by: Ben Gainey <Ben.Gainey@arm.com> Tested-by: Ben Gainey <Ben.Gainey@arm.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 1627314fb54a33e ("perf: Suppress AUX/OVERWRITE records") Link: https://lkml.kernel.org/r/20190329091338.29999-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16perf/core: Fix the address filtering fixAlexander Shishkin1-16/+21
The following recent commit: c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset") changes the address filtering logic to communicate filter ranges to the PMU driver via a single address range object, instead of having the driver do the final bit of math. That change forgets to take into account kernel filters, which are not calculated the same way as DSO based filters. Fix that by passing the kernel filters the same way as file-based filters. This doesn't require any additional changes in the drivers. Reported-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset") Link: https://lkml.kernel.org/r/20190329091212.29870-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16bpf: add map helper functions push, pop, peek in more BPF programsAlban Crequy2-0/+12
commit f1a2e44a3aec ("bpf: add queue and stack maps") introduced new BPF helper functions: - BPF_FUNC_map_push_elem - BPF_FUNC_map_pop_elem - BPF_FUNC_map_peek_elem but they were made available only for network BPF programs. This patch makes them available for tracepoint, cgroup and lirc programs. Signed-off-by: Alban Crequy <alban@kinvolk.io> Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-16kprobes: Fix error check when reusing optimized probesMasami Hiramatsu1-4/+2
The following commit introduced a bug in one of our error paths: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()") it missed to handle the return value of kprobe_optready() as error-value. In reality, the kprobe_optready() returns a bool result, so "true" case must be passed instead of 0. This causes some errors on kprobe boot-time selftests on ARM: [ ] Beginning kprobe tests... [ ] Probe ARM code [ ] kprobe [ ] kretprobe [ ] ARM instruction simulation [ ] Check decoding tables [ ] Run test cases [ ] FAIL: test_case_handler not run [ ] FAIL: Test andge r10, r11, r14, asr r7 [ ] FAIL: Scenario 11 ... [ ] FAIL: Scenario 7 [ ] Total instruction simulation tests=1631, pass=1433 fail=198 [ ] kprobe tests failed This can happen if an optimized probe is unregistered and next kprobe is registered on same address until the previous probe is not reclaimed. If this happens, a hidden aggregated probe may be kept in memory, and no new kprobe can probe same address. Also, in that case register_kprobe() will return "1" instead of minus error value, which can mislead caller logic. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # v5.0+ Fixes: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()") Link: http://lkml.kernel.org/r/155530808559.32517.539898325433642204.stgit@devnote2 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' againBart Van Assche1-4/+5
If lockdep_register_key() and lockdep_unregister_key() are called with debug_locks == false then the following warning is reported: WARNING: CPU: 2 PID: 15145 at kernel/locking/lockdep.c:4920 lockdep_unregister_key+0x1ad/0x240 That warning is reported because lockdep_unregister_key() ignores the value of 'debug_locks' and because the behavior of lockdep_register_key() depends on whether or not 'debug_locks' is set. Fix this inconsistency by making lockdep_unregister_key() take 'debug_locks' again into account. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: shenghui <shhuiw@foxmail.com> Fixes: 90c1cba2b3b3 ("locking/lockdep: Zap lock classes even with lock debugging disabled") Link: http://lkml.kernel.org/r/20190415170538.23491-1-bvanassche@acm.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-0/+10
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Remove the broute pseudo hook, implement this from the bridge prerouting hook instead. Now broute becomes real table in ebtables, from Florian Westphal. This also includes a size reduction patch for the bridge control buffer area via squashing boolean into bitfields and a selftest. 2) Add OS passive fingerprint version matching, from Fernando Fernandez. 3) Support for gue encapsulation for IPVS, from Jacky Hu. 4) Add support for NAT to the inet family, from Florian Westphal. This includes support for masquerade, redirect and nat extensions. 5) Skip interface lookup in flowtable, use device in the dst object. 6) Add jiffies64_to_msecs() and use it, from Li RongQing. 7) Remove unused parameter in nf_tables_set_desc_parse(), from Colin Ian King. 8) Statify several functions, patches from YueHaibing and Florian Westphal. 9) Add an optimized version of nf_inet_addr_cmp(), from Li RongQing. 10) Merge route extension to core, also from Florian. 11) Use IS_ENABLED(CONFIG_NF_NAT) instead of NF_NAT_NEEDED, from Florian. 12) Merge ip/ip6 masquerade extensions, from Florian. This includes netdevice notifier unification. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-14Merge branch 'page-refs' (page ref overflow)Linus Torvalds1-1/+5
Merge page ref overflow branch. Jann Horn reported that he can overflow the page ref count with sufficient memory (and a filesystem that is intentionally extremely slow). Admittedly it's not exactly easy. To have more than four billion references to a page requires a minimum of 32GB of kernel memory just for the pointers to the pages, much less any metadata to keep track of those pointers. Jann needed a total of 140GB of memory and a specially crafted filesystem that leaves all reads pending (in order to not ever free the page references and just keep adding more). Still, we have a fairly straightforward way to limit the two obvious user-controllable sources of page references: direct-IO like page references gotten through get_user_pages(), and the splice pipe page duplication. So let's just do that. * branch page-refs: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit
2019-04-14fs: prevent page refcount overflow in pipe_buf_getMatthew Wilcox1-1/+5
Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-12Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull timer fix from Ingo Molnar: "Fix the alarm_timer_remaining() return value" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Return correct remaining time
2019-04-12Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+3
Pull scheduler fix from Ingo Molnar: "Fix a NULL pointer dereference crash in certain environments" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
2019-04-12Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-11/+45
Pull perf fixes from Ingo Molnar: "Six kernel side fixes: three related to NMI handling on AMD systems, a race fix, a kexec initialization fix and a PEBS sampling fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix perf_event_disable_inatomic() race x86/perf/amd: Remove need to check "running" bit in NMI handler x86/perf/amd: Resolve NMI latency issues for active PMCs x86/perf/amd: Resolve race condition when disabling PMC perf/x86/intel: Initialize TFA MSR perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS