aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-0/+10
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Remove the broute pseudo hook, implement this from the bridge prerouting hook instead. Now broute becomes real table in ebtables, from Florian Westphal. This also includes a size reduction patch for the bridge control buffer area via squashing boolean into bitfields and a selftest. 2) Add OS passive fingerprint version matching, from Fernando Fernandez. 3) Support for gue encapsulation for IPVS, from Jacky Hu. 4) Add support for NAT to the inet family, from Florian Westphal. This includes support for masquerade, redirect and nat extensions. 5) Skip interface lookup in flowtable, use device in the dst object. 6) Add jiffies64_to_msecs() and use it, from Li RongQing. 7) Remove unused parameter in nf_tables_set_desc_parse(), from Colin Ian King. 8) Statify several functions, patches from YueHaibing and Florian Westphal. 9) Add an optimized version of nf_inet_addr_cmp(), from Li RongQing. 10) Merge route extension to core, also from Florian. 11) Use IS_ENABLED(CONFIG_NF_NAT) instead of NF_NAT_NEEDED, from Florian. 12) Merge ip/ip6 masquerade extensions, from Florian. This includes netdevice notifier unification. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller10-150/+903
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-04-12 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Improve BPF verifier scalability for large programs through two optimizations: i) remove verifier states that are not useful in pruning, ii) stop walking parentage chain once first LIVE_READ is seen. Combined gives approx 20x speedup. Increase limits for accepting large programs under root, and add various stress tests, from Alexei. 2) Implement global data support in BPF. This enables static global variables for .data, .rodata and .bss sections to be properly handled which allows for more natural program development. This also opens up the possibility to optimize program workflow by compiling ELFs only once and later only rewriting section data before reload, from Daniel and with test cases and libbpf refactoring from Joe. 3) Add config option to generate BTF type info for vmlinux as part of the kernel build process. DWARF debug info is converted via pahole to BTF. Latter relies on libbpf and makes use of BTF deduplication algorithm which results in 100x savings compared to DWARF data. Resulting .BTF section is typically about 2MB in size, from Andrii. 4) Add BPF verifier support for stack access with variable offset from helpers and add various test cases along with it, from Andrey. 5) Extend bpf_skb_adjust_room() growth BPF helper to mark inner MAC header so that L2 encapsulation can be used for tc tunnels, from Alan. 6) Add support for input __sk_buff context in BPF_PROG_TEST_RUN so that users can define a subset of allowed __sk_buff fields that get fed into the test program, from Stanislav. 7) Add bpf fs multi-dimensional array tests for BTF test suite and fix up various UBSAN warnings in bpftool, from Yonghong. 8) Generate a pkg-config file for libbpf, from Luca. 9) Dump program's BTF id in bpftool, from Prashant. 10) libbpf fix to use smaller BPF log buffer size for AF_XDP's XDP program, from Magnus. 11) kallsyms related fixes for the case when symbols are not present in BPF selftests and samples, from Daniel ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11bpf: support input __sk_buff context in BPF_PROG_TEST_RUNStanislav Fomichev1-1/+9
Add new set of arguments to bpf_attr for BPF_PROG_TEST_RUN: * ctx_in/ctx_size_in - input context * ctx_out/ctx_size_out - output context The intended use case is to pass some meta data to the test runs that operate on skb (this has being brought up on recent LPC). For programs that use bpf_prog_test_run_skb, support __sk_buff input and output. Initially, from input __sk_buff, copy _only_ cb and priority into skb, all other non-zero fields are prohibited (with EINVAL). If the user has set ctx_out/ctx_size_out, copy the potentially modified __sk_buff back to the userspace. We require all fields of input __sk_buff except the ones we explicitly support to be set to zero. The expectation is that in the future we might add support for more fields and we want to fail explicitly if the user runs the program on the kernel where we don't yet support them. The API is intentionally vague (i.e. we don't explicitly add __sk_buff to bpf_attr, but ctx_in) to potentially let other test_run types use this interface in the future (this can be xdp_md for xdp types for example). v4: * don't copy more than allowed in bpf_ctx_init [Martin] v3: * handle case where ctx_in is NULL, but ctx_out is not [Martin] * convert size==0 checks to ptr==NULL checks and add some extra ptr checks [Martin] v2: * Addressed comments from Martin Lau Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-09bpf: allow for key-less BTF in array mapDaniel Borkmann3-6/+26
Given we'll be reusing BPF array maps for global data/bss/rodata sections, we need a way to associate BTF DataSec type as its map value type. In usual cases we have this ugly BPF_ANNOTATE_KV_PAIR() macro hack e.g. via 38d5d3b3d5db ("bpf: Introduce BPF_ANNOTATE_KV_PAIR") to get initial map to type association going. While more use cases for it are discouraged, this also won't work for global data since the use of array map is a BPF loader detail and therefore unknown at compilation time. For array maps with just a single entry we make an exception in terms of BTF in that key type is declared optional if value type is of DataSec type. The latter LLVM is guaranteed to emit and it also aligns with how we regard global data maps as just a plain buffer area reusing existing map facilities for allowing things like introspection with existing tools. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09bpf: kernel side support for BTF Var and DataSecDaniel Borkmann1-20/+397
This work adds kernel-side verification, logging and seq_show dumping of BTF Var and DataSec kinds which are emitted with latest LLVM. The following constraints apply: BTF Var must have: - Its kind_flag is 0 - Its vlen is 0 - Must point to a valid type - Type must not resolve to a forward type - Size of underlying type must be > 0 - Must have a valid name - Can only be a source type, not sink or intermediate one - Name may include dots (e.g. in case of static variables inside functions) - Cannot be a member of a struct/union - Linkage so far can either only be static or global/allocated BTF DataSec must have: - Its kind_flag is 0 - Its vlen cannot be 0 - Its size cannot be 0 - Must have a valid name - Can only be a source type, not sink or intermediate one - Name may include dots (e.g. to represent .bss, .data, .rodata etc) - Cannot be a member of a struct/union - Inner btf_var_secinfo array with {type,offset,size} triple must be sorted by offset in ascending order - Type must always point to BTF Var - BTF resolved size of Var must be <= size provided by triple - DataSec size must be >= sum of triple sizes (thus holes are allowed) btf_var_resolve(), btf_ptr_resolve() and btf_modifier_resolve() are on a high level quite similar but each come with slight, subtle differences. They could potentially be a bit refactored in future which hasn't been done here to ease review. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09bpf: allow . char as part of the object nameDaniel Borkmann1-3/+3
Trivial addition to allow '.' aside from '_' as "special" characters in the object name. Used to allow for substrings in maps from loader side such as ".bss", ".data", ".rodata", but could also be useful for other purposes. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09bpf: add syscall side map freeze supportDaniel Borkmann1-12/+54
This patch adds a new BPF_MAP_FREEZE command which allows to "freeze" the map globally as read-only / immutable from syscall side. Map permission handling has been refactored into map_get_sys_perms() and drops FMODE_CAN_WRITE in case of locked map. Main use case is to allow for setting up .rodata sections from the BPF ELF which are loaded into the kernel, meaning BPF loader first allocates map, sets up map value by copying .rodata section into it and once complete, it calls BPF_MAP_FREEZE on the map fd to prevent further modifications. Right now BPF_MAP_FREEZE only takes map fd as argument while remaining bpf_attr members are required to be zero. I didn't add write-only locking here as counterpart since I don't have a concrete use-case for it on my side, and I think it makes probably more sense to wait once there is actually one. In that case bpf_attr can be extended as usual with a flag field and/or others where flag 0 means that we lock the map read-only hence this doesn't prevent to add further extensions to BPF_MAP_FREEZE upon need. A map creation flag like BPF_F_WRONCE was not considered for couple of reasons: i) in case of a generic implementation, a map can consist of more than just one element, thus there could be multiple map updates needed to set the map into a state where it can then be made immutable, ii) WRONCE indicates exact one-time write before it is then set immutable. A generic implementation would set a bit atomically on map update entry (if unset), indicating that every subsequent update from then onwards will need to bail out there. However, map updates can fail, so upon failure that flag would need to be unset again and the update attempt would need to be repeated for it to be eventually made immutable. While this can be made race-free, this approach feels less clean and in combination with reason i), it's not generic enough. A dedicated BPF_MAP_FREEZE command directly sets the flag and caller has the guarantee that map is immutable from syscall side upon successful return for any future syscall invocations that would alter the map state, which is also more intuitive from an API point of view. A command name such as BPF_MAP_LOCK has been avoided as it's too close with BPF map spin locks (which already has BPF_F_LOCK flag). BPF_MAP_FREEZE is so far only enabled for privileged users. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09bpf: add program side {rd, wr}only support for mapsDaniel Borkmann7-13/+62
This work adds two new map creation flags BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG in order to allow for read-only or write-only BPF maps from a BPF program side. Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only applies to system call side, meaning the BPF program has full read/write access to the map as usual while bpf(2) calls with map fd can either only read or write into the map depending on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows for the exact opposite such that verifier is going to reject program loads if write into a read-only map or a read into a write-only map is detected. For read-only map case also some helpers are forbidden for programs that would alter the map state such as map deletion, update, etc. As opposed to the two BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well as BPF_F_WRONLY_PROG really do correspond to the map lifetime. We've enabled this generic map extension to various non-special maps holding normal user data: array, hash, lru, lpm, local storage, queue and stack. Further generic map types could be followed up in future depending on use-case. Main use case here is to forbid writes into .rodata map values from verifier side. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09bpf: do not retain flags that are not tied to map lifetimeDaniel Borkmann1-1/+13
Both BPF_F_WRONLY / BPF_F_RDONLY flags are tied to the map file descriptor, but not to the map object itself! Meaning, at map creation time BPF_F_RDONLY can be set to make the map read-only from syscall side, but this holds only for the returned fd, so any other fd either retrieved via bpf file system or via map id for the very same underlying map object can have read-write access instead. Given that, keeping the two flags around in the map_flags attribute and exposing them to user space upon map dump is misleading and may lead to false conclusions. Since these two flags are not tied to the map object lets also not store them as map property. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09bpf: implement lookup-free direct value access for mapsDaniel Borkmann5-30/+124
This generic extension to BPF maps allows for directly loading an address residing inside a BPF map value as a single BPF ldimm64 instruction! The idea is similar to what BPF_PSEUDO_MAP_FD does today, which is a special src_reg flag for ldimm64 instruction that indicates that inside the first part of the double insns's imm field is a file descriptor which the verifier then replaces as a full 64bit address of the map into both imm parts. For the newly added BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following: the first part of the double insns's imm field is again a file descriptor corresponding to the map, and the second part of the imm field is an offset into the value. The verifier will then replace both imm parts with an address that points into the BPF map value at the given value offset for maps that support this operation. Currently supported is array map with single entry. It is possible to support more than just single map element by reusing both 16bit off fields of the insns as a map index, so full array map lookup could be expressed that way. It hasn't been implemented here due to lack of concrete use case, but could easily be done so in future in a compatible way, since both off fields right now have to be 0 and would correctly denote a map index 0. The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of map pointer versus load of map's value at offset 0, and changing BPF_PSEUDO_MAP_FD's encoding into off by one to differ between regular map pointer and map value pointer would add unnecessary complexity and increases barrier for debugability thus less suitable. Using the second part of the imm field as an offset into the value does /not/ come with limitations since maximum possible value size is in u32 universe anyway. This optimization allows for efficiently retrieving an address to a map value memory area without having to issue a helper call which needs to prepare registers according to calling convention, etc, without needing the extra NULL test, and without having to add the offset in an additional instruction to the value base pointer. The verifier then treats the destination register as PTR_TO_MAP_VALUE with constant reg->off from the user passed offset from the second imm field, and guarantees that this is within bounds of the map value. Any subsequent operations are normally treated as typical map value handling without anything extra needed from verification side. The two map operations for direct value access have been added to array map for now. In future other types could be supported as well depending on the use case. The main use case for this commit is to allow for BPF loader support for global variables that reside in .data/.rodata/.bss sections such that we can directly load the address of them with minimal additional infrastructure required. Loader support has been added in subsequent commits for libbpf library. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-5/+9
2019-04-08time: Introduce jiffies64_to_msecs()Li RongQing1-0/+10
there is a similar helper in net/netfilter/nf_tables_api.c, this maybe become a common request someday, so move it to time.c Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+2
Merge misc fixes from Andrew Morton: "14 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: kernel/sysctl.c: fix out-of-bounds access when setting file-max mm/util.c: fix strndup_user() comment sh: fix multiple function definition build errors MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM mm: writeback: use exact memcg dirty counts psi: clarify the units used in pressure files mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd() hugetlbfs: fix memory leak for resv_map mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX() lib/lzo: fix bugs for very short or empty input include/linux/bitrev.h: fix constant bitrev kmemleak: powerpc: skip scanning holes in the .bss section lib/string.c: implement a basic bcmp
2019-04-05kernel/sysctl.c: fix out-of-bounds access when setting file-maxWill Deacon1-1/+2
Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up min/max values for the file-max sysctl parameter via the .extra1 and .extra2 fields in the corresponding struct ctl_table entry. Unfortunately, the minimum value points at the global 'zero' variable, which is an int. This results in a KASAN splat when accessed as a long by proc_doulongvec_minmax on 64-bit architectures: | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0 | Read of size 8 at addr ffff2000133d1c20 by task systemd/1 | | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2 | Hardware name: linux,dummy-virt (DT) | Call trace: | dump_backtrace+0x0/0x228 | show_stack+0x14/0x20 | dump_stack+0xe8/0x124 | print_address_description+0x60/0x258 | kasan_report+0x140/0x1a0 | __asan_report_load8_noabort+0x18/0x20 | __do_proc_doulongvec_minmax+0x5d8/0x6a0 | proc_doulongvec_minmax+0x4c/0x78 | proc_sys_call_handler.isra.19+0x144/0x1d8 | proc_sys_write+0x34/0x58 | __vfs_write+0x54/0xe8 | vfs_write+0x124/0x3c0 | ksys_write+0xbc/0x168 | __arm64_sys_write+0x68/0x98 | el0_svc_common+0x100/0x258 | el0_svc_handler+0x48/0xc0 | el0_svc+0x8/0xc | | The buggy address belongs to the variable: | zero+0x0/0x40 | | Memory state around the buggy address: | ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa | ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00 | ^ | ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00 | ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fix the splat by introducing a unsigned long 'zero_ul' and using that instead. Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max") Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Christian Brauner <christian@brauner.io> Cc: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matteo Croce <mcroce@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-05Merge tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds2-4/+7
Pull syscall-get-arguments cleanup and fixes from Steven Rostedt: "Andy Lutomirski approached me to tell me that the syscall_get_arguments() implementation in x86 was horrible and gcc certainly gets it wrong. He said that since the tracepoints only pass in 0 and 6 for i and n repectively, it should be optimized for that case. Inspecting the kernel, I discovered that all users pass in 0 for i and only one file passing in something other than 6 for the number of arguments. That code happens to be my own code used for the special syscall tracing. That can easily be converted to just using 0 and 6 as well, and only copying what is needed. Which is probably the faster path anyway for that case. Along the way, a couple of real fixes came from this as the syscall_get_arguments() function was incorrect for csky and riscv. x86 has been optimized to for the new interface that removes the variable number of arguments, but the other architectures could still use some loving and take more advantage of the simpler interface" * tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: syscalls: Remove start and number from syscall_set_arguments() args syscalls: Remove start and number from syscall_get_arguments() args csky: Fix syscall_get_arguments() and syscall_set_arguments() riscv: Fix syscall_get_arguments() and syscall_set_arguments() tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments() ptrace: Remove maxargs from task_current_syscall()
2019-04-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller7-34/+70
Minor comment merge conflict in mlx5. Staging driver has a fixup due to the skb->xmit_more changes in 'net-next', but was removed in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05bpf: Add missed newline in verifier verbose logAndrey Ignatov1-1/+1
check_stack_access() that prints verbose log is used in adjust_ptr_min_max_vals() that prints its own verbose log and now they stick together, e.g.: variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1R2 stack pointer arithmetic goes out of range, prohibited for !root Add missing newline so that log is more readable: variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1 R2 stack pointer arithmetic goes out of range, prohibited for !root Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05bpf: Sanity check max value for var_off stack accessAndrey Ignatov1-3/+15
As discussed in [1] max value of variable offset has to be checked for overflow on stack access otherwise verifier would accept code like this: 0: (b7) r2 = 6 1: (b7) r3 = 28 2: (7a) *(u64 *)(r10 -16) = 0 3: (7a) *(u64 *)(r10 -8) = 0 4: (79) r4 = *(u64 *)(r1 +168) 5: (c5) if r4 s< 0x0 goto pc+4 R1=ctx(id=0,off=0,imm=0) R2=inv6 R3=inv28 R4=inv(id=0,umax_value=9223372036854775807,var_off=(0x0; 0x7fffffffffffffff)) R10=fp0,call_-1 fp-8=mmmmmmmm fp-16=mmmmmmmm 6: (17) r4 -= 16 7: (0f) r4 += r10 8: (b7) r5 = 8 9: (85) call bpf_getsockopt#57 10: (b7) r0 = 0 11: (95) exit , where R4 obviosly has unbounded max value. Fix it by checking that reg->smax_value is inside (-BPF_MAX_VAR_OFF; BPF_MAX_VAR_OFF) range. reg->smax_value is used instead of reg->umax_value because stack pointers are calculated using negative offset from fp. This is opposite to e.g. map access where offset must be non-negative and where umax_value is used. Also dedicated verbose logs are added for both min and max bound check failures to have diagnostics consistent with variable offset handling in check_map_access(). [1] https://marc.info/?l=linux-netdev&m=155433357510597&w=2 Fixes: 2011fccfb61b ("bpf: Support variable offset stack access from helpers") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05bpf: Reject indirect var_off stack access in unpriv modeAndrey Ignatov1-0/+16
Proper support of indirect stack access with variable offset in unprivileged mode (!root) requires corresponding support in Spectre masking for stack ALU in retrieve_ptr_limit(). There are no use-case for variable offset in unprivileged mode though so make verifier reject such accesses for simplicity. Pointer arithmetics is one (and only?) way to cause variable offset and it's already rejected in unpriv mode so that verifier won't even get to helper function whose argument contains variable offset, e.g.: 0: (7a) *(u64 *)(r10 -16) = 0 1: (7a) *(u64 *)(r10 -8) = 0 2: (61) r2 = *(u32 *)(r1 +0) 3: (57) r2 &= 4 4: (17) r2 -= 16 5: (0f) r2 += r10 variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1R2 stack pointer arithmetic goes out of range, prohibited for !root Still it looks like a good idea to reject variable offset indirect stack access for unprivileged mode in check_stack_boundary() explicitly. Fixes: 2011fccfb61b ("bpf: Support variable offset stack access from helpers") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05bpf: Reject indirect var_off stack access in raw modeAndrey Ignatov1-0/+9
It's hard to guarantee that whole memory is marked as initialized on helper return if uninitialized stack is accessed with variable offset since specific bounds are unknown to verifier. This may cause uninitialized stack leaking. Reject such an access in check_stack_boundary to prevent possible leaking. There are no known use-cases for indirect uninitialized stack access with variable offset so it shouldn't break anything. Fixes: 2011fccfb61b ("bpf: Support variable offset stack access from helpers") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05syscalls: Remove start and number from syscall_get_arguments() argsSteven Rostedt (Red Hat)2-3/+3
At Linux Plumbers, Andy Lutomirski approached me and pointed out that the function call syscall_get_arguments() implemented in x86 was horribly written and not optimized for the standard case of passing in 0 and 6 for the starting index and the number of system calls to get. When looking at all the users of this function, I discovered that all instances pass in only 0 and 6 for these arguments. Instead of having this function handle different cases that are never used, simply rewrite it to return the first 6 arguments of a system call. This should help out the performance of tracing system calls by ptrace, ftrace and perf. Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Dave Martin <dave.martin@arm.com> Cc: "Dmitry V. Levin" <ldv@altlinux.org> Cc: x86@kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: uclinux-h8-devel@lists.sourceforge.jp Cc: linux-hexagon@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: nios2-dev@lists.rocketboards.org Cc: openrisc@lists.librecores.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-riscv@lists.infradead.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86 Reviewed-by: Dmitry V. Levin <ldv@altlinux.org> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-19/+31
Pull networking fixes from David Miller: 1) Several hash table refcount fixes in batman-adv, from Sven Eckelmann. 2) Use after free in bpf_evict_inode(), from Daniel Borkmann. 3) Fix mdio bus registration in ixgbe, from Ivan Vecera. 4) Unbounded loop in __skb_try_recv_datagram(), from Paolo Abeni. 5) ila rhashtable corruption fix from Herbert Xu. 6) Don't allow upper-devices to be added to vrf devices, from Sabrina Dubroca. 7) Add qmi_wwan device ID for Olicard 600, from Bjørn Mork. 8) Don't leave skb->next poisoned in __netif_receive_skb_list_ptype, from Alexander Lobakin. 9) Missing IDR checks in mlx5 driver, from Aditya Pakki. 10) Fix false connection termination in ktls, from Jakub Kicinski. 11) Work around some ASPM issues with r8169 by disabling rx interrupt coalescing on certain chips. From Heiner Kallweit. 12) Properly use per-cpu qstat values on NOLOCK qdiscs, from Paolo Abeni. 13) Fully initialize sockaddr_in structures in SCTP, from Xin Long. 14) Various BPF flow dissector fixes from Stanislav Fomichev. 15) Divide by zero in act_sample, from Davide Caratti. 16) Fix bridging multicast regression introduced by rhashtable conversion, from Nikolay Aleksandrov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) ibmvnic: Fix completion structure initialization ipv6: sit: reset ip header pointer in ipip6_rcv net: bridge: always clear mcast matching struct on reports and leaves libcxgb: fix incorrect ppmax calculation vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x sch_cake: Make sure we can write the IP header before changing DSCP bits sch_cake: Use tc_skb_protocol() helper for getting packet protocol tcp: Ensure DCTCP reacts to losses net/sched: act_sample: fix divide by zero in the traffic path net: thunderx: fix NULL pointer dereference in nicvf_open/nicvf_stop net: hns: Fix sparse: some warnings in HNS drivers net: hns: Fix WARNING when remove HNS driver with SMMU enabled net: hns: fix ICMP6 neighbor solicitation messages discard problem net: hns: Fix probabilistic memory overwrite when HNS driver initialized net: hns: Use NAPI_POLL_WEIGHT for hns driver net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw() flow_dissector: rst'ify documentation ipv6: Fix dangling pointer when ipv6 fragment net-gro: Fix GRO flush when receiving a GSO packet. flow_dissector: document BPF flow dissector environment ...
2019-04-04tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments()Steven Rostedt (Red Hat)1-3/+6
The only users that calls syscall_get_arguments() with a variable and not a hard coded '6' is ftrace_syscall_enter(). syscall_get_arguments() can be optimized by removing a variable input, and always grabbing 6 arguments regardless of what the system call actually uses. Change ftrace_syscall_enter() to pass the 6 args into a local stack array and copy the necessary arguments into the trace event as needed. This is needed to remove two parameters from syscall_get_arguments(). Link: http://lkml.kernel.org/r/20161107213233.627583542@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-04bpf: increase verifier log limitAlexei Starovoitov1-1/+1
The existing 16Mbyte verifier log limit is not enough for log_level=2 even for small programs. Increase it to 1Gbyte. Note it's not a kernel memory limit. It's an amount of memory user space provides to store the verifier log. The kernel populates it 1k at a time. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04bpf: increase complexity limit and maximum program sizeAlexei Starovoitov2-2/+2
Large verifier speed improvements allow to increase verifier complexity limit. Now regardless of the program composition and its size it takes little time for the verifier to hit insn_processed limit. On typical x86 machine non-debug kernel processes 1M instructions in 1/10 of a second. (before these speed improvements specially crafted programs could be hitting multi-second verification times) Full kasan kernel with debug takes ~1 second for the same 1M insns. Hence bump the BPF_COMPLEXITY_LIMIT_INSNS limit to 1M. Also increase the number of instructions per program from 4k to internal BPF_COMPLEXITY_LIMIT_INSNS limit. 4k limit was confusing to users, since small programs with hundreds of insns could be hitting BPF_COMPLEXITY_LIMIT_INSNS limit. Sometimes adding more insns and bpf_trace_printk debug statements would make the verifier accept the program while removing code would make the verifier reject it. Some user space application started to add #define MAX_FOO to their programs and do: MAX_FOO=100; again: compile with MAX_FOO; try to load; if (fails_to_load) { reduce MAX_FOO; goto again; } to be able to fit maximum amount of processing into single program. Other users artificially split their single program into a set of programs and use all 32 iterations of tail_calls to increase compute limits. And the most advanced folks used unlimited tc-bpf filter list to execute many bpf programs. Essentially the users managed to workaround 4k insn limit. This patch removes the limit for root programs from uapi. BPF_COMPLEXITY_LIMIT_INSNS is the kernel internal limit and success to load the program no longer depends on program size, but on 'smartness' of the verifier only. The verifier will continue to get smarter with every kernel release. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04bpf: verbose jump offset overflow checkAlexei Starovoitov2-6/+12
Larger programs may trigger 16-bit jump offset overflow check during instruction patching. Make this error verbose otherwise users cannot decipher error code without printks in the verifier. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04bpf: convert temp arrays to kvcallocAlexei Starovoitov1-7/+7
Temporary arrays used during program verification need to be vmalloc-ed to support large bpf programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04bpf: improve verification speed by not remarking live_readAlexei Starovoitov1-0/+9
With large verifier speed improvement brought by the previous patch mark_reg_read() becomes the hottest function during verification. On a typical program it consumes 40% of cpu. mark_reg_read() walks parentage chain of registers to mark parents as LIVE_READ. Once the register is marked there is no need to remark it again in the future. Hence stop walking the chain once first LIVE_READ is seen. This optimization drops mark_reg_read() time from 40% of cpu to <1% and overall 2x improvement of verification speed. For some programs the longest_mark_read_walk counter improves from ~500 to ~5 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04bpf: improve verification speed by droping statesAlexei Starovoitov1-3/+41
Branch instructions, branch targets and calls in a bpf program are the places where the verifier remembers states that led to successful verification of the program. These states are used to prune brute force program analysis. For unprivileged programs there is a limit of 64 states per such 'branching' instructions (maximum length is tracked by max_states_per_insn counter introduced in the previous patch). Simply reducing this threshold to 32 or lower increases insn_processed metric to the point that small valid programs get rejected. For root programs there is no limit and cilium programs can have max_states_per_insn to be 100 or higher. Walking 100+ states multiplied by number of 'branching' insns during verification consumes significant amount of cpu time. Turned out simple LRU-like mechanism can be used to remove states that unlikely will be helpful in future search pruning. This patch introduces hit_cnt and miss_cnt counters: hit_cnt - this many times this state successfully pruned the search miss_cnt - this many times this state was not equivalent to other states (and that other states were added to state list) The heuristic introduced in this patch is: if (sl->miss_cnt > sl->hit_cnt * 3 + 3) /* drop this state from future considerations */ Higher numbers increase max_states_per_insn (allow more states to be considered for pruning) and slow verification speed, but do not meaningfully reduce insn_processed metric. Lower numbers drop too many states and insn_processed increases too much. Many different formulas were considered. This one is simple and works well enough in practice. (the analysis was done on selftests/progs/* and on cilium programs) The end result is this heuristic improves verification speed by 10 times. Large synthetic programs that used to take a second more now take 1/10 of a second. In cases where max_states_per_insn used to be 100 or more, now it's ~10. There is a slight increase in insn_processed for cilium progs: before after bpf_lb-DLB_L3.o 1831 1838 bpf_lb-DLB_L4.o 3029 3218 bpf_lb-DUNKNOWN.o 1064 1064 bpf_lxc-DDROP_ALL.o 26309 26935 bpf_lxc-DUNKNOWN.o 33517 34439 bpf_netdev.o 9713 9721 bpf_overlay.o 6184 6184 bpf_lcx_jit.o 37335 39389 And 2-3 times improvement in the verification speed. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04bpf: add verifier stats and log_level bit 2Alexei Starovoitov1-24/+52
In order to understand the verifier bottlenecks add various stats and extend log_level: log_level 1 and 2 are kept as-is: bit 0 - level=1 - print every insn and verifier state at branch points bit 1 - level=2 - print every insn and verifier state at every insn bit 2 - level=4 - print verifier error and stats at the end of verification When verifier rejects the program the libbpf is trying to load the program twice. Once with log_level=0 (no messages, only error code is reported to user space) and second time with log_level=1 to tell the user why the verifier rejected it. With introduction of bit 2 - level=4 the libbpf can choose to always use that level and load programs once, since the verification speed is not affected and in case of error the verbose message will be available. Note that the verifier stats are not part of uapi just like all other verbose messages. They're expected to change in the future. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-01signal: don't silently convert SI_USER signals to non-current pidfdJann Horn1-9/+4
The current sys_pidfd_send_signal() silently turns signals with explicit SI_USER context that are sent to non-current tasks into signals with kernel-generated siginfo. This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case. If a user actually wants to send a signal with kernel-provided siginfo, they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing this case is unnecessary. Instead of silently replacing the siginfo, just bail out with an error; this is consistent with other interfaces and avoids special-casing behavior based on security checks. Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner <christian@brauner.io>
2019-03-31Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+18
Pull CPU hotplug fixes from Thomas Gleixner: "Two SMT/hotplug related fixes: - Prevent crash when HOTPLUG_CPU is disabled and the CPU bringup aborts. This is triggered with the 'nosmt' command line option, but can happen by any abort condition. As the real unplug code is not compiled in, prevent the fail by keeping the CPU in zombie state. - Enforce HOTPLUG_CPU for SMP on x86 to avoid the above situation completely. With 'nosmt' being a popular option it's required to unplug the half brought up sibling CPUs (due to the MCE wreckage) completely" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n
2019-03-31Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+4
Pull core fixes from Thomas Gleixner: "A small set of core updates: - Make the watchdog respect the selected CPU mask again. That was broken by the rework of the watchdog thread management and caused inconsistent state and NMI watchdog being unstoppable. - Ensure that the objtool build can find the libelf location. - Remove dead kcore stub code" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: watchdog: Respect watchdog cpumask on CPU hotplug objtool: Query pkg-config for libelf location proc/kcore: Remove unused kclist_add_remap()
2019-03-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller3-19/+31
Daniel Borkmann says: ==================== pull-request: bpf 2019-03-29 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Bug fix in BTF deduplication that was mishandling an equivalence comparison, from Andrii. 2) libbpf Makefile fixes to properly link against libelf for the shared object and to actually export AF_XDP's xsk.h header, from Björn. 3) Fix use after free in bpf inode eviction, from Daniel. 4) Fix a bug in skb creation out of cpumap redirect, from Jesper. 5) Remove an unnecessary and triggerable WARN_ONCE() in max number of call stack frames checking in verifier, from Paul. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29xdp: fix cpumap redirect SKB creation bugJesper Dangaard Brouer1-3/+10
We want to avoid leaking pointer info from xdp_frame (that is placed in top of frame) like commit 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse"), and followup commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp headroom") that reserve this headroom. These changes also affected how cpumap constructed SKBs, as xdpf->headroom size changed, the skb data starting point were in-effect shifted with 32 bytes (sizeof xdp_frame). This was still okay, as the cpumap frame_size calculation also included xdpf->headroom which were reduced by same amount. A bug was introduced in commit 77ea5f4cbe20 ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't"), where the xdpf->headroom became part of the SKB_DATA_ALIGN rounding up. This round-up to find the frame_size is in principle still correct as it does not exceed the 2048 bytes frame_size (which is max for ixgbe and i40e), but the 32 bytes offset of pkt_data_start puts this over the 2048 bytes limit. This cause skb_shared_info to spill into next frame. It is a little hard to trigger, as the SKB need to use above 15 skb_shinfo->frags[] as far as I calculate. This does happen in practise for TCP streams when skb_try_coalesce() kicks in. KASAN can be used to detect these wrong memory accesses, I've seen: BUG: KASAN: use-after-free in skb_try_coalesce+0x3cb/0x760 BUG: KASAN: wild-memory-access in skb_release_data+0xe2/0x250 Driver veth also construct a SKB from xdp_frame in this way, but is not affected, as it doesn't reserve/deduct the room (used by xdp_frame) from the SKB headroom. Instead is clears the pointers via xdp_scrub_frame(), and allows SKB to use this area. The fix in this patch is to do like veth and instead allow SKB to (re)use the area occupied by xdp_frame, by clearing via xdp_scrub_frame(). (This does kill the idea of the SKB being able to access (mem) info from this area, but I guess it was a bad idea anyhow, and it was already killed by the veth changes.) Fixes: 77ea5f4cbe20 ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-29bpf: Support variable offset stack access from helpersAndrey Ignatov1-21/+54
Currently there is a difference in how verifier checks memory access for helper arguments for PTR_TO_MAP_VALUE and PTR_TO_STACK with regard to variable part of offset. check_map_access, that is used for PTR_TO_MAP_VALUE, can handle variable offsets just fine, so that BPF program can call a helper like this: some_helper(map_value_ptr + off, size); , where offset is unknown at load time, but is checked by program to be in a safe rage (off >= 0 && off + size < map_value_size). But it's not the case for check_stack_boundary, that is used for PTR_TO_STACK, and same code with pointer to stack is rejected by verifier: some_helper(stack_value_ptr + off, size); For example: 0: (7a) *(u64 *)(r10 -16) = 0 1: (7a) *(u64 *)(r10 -8) = 0 2: (61) r2 = *(u32 *)(r1 +0) 3: (57) r2 &= 4 4: (17) r2 -= 16 5: (0f) r2 += r10 6: (18) r1 = 0xffff888111343a80 8: (85) call bpf_map_lookup_elem#1 invalid variable stack read R2 var_off=(0xfffffffffffffff0; 0x4) Add support for variable offset access to check_stack_boundary so that if offset is checked by program to be in a safe range it's accepted by verifier. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-29ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASKAndrei Vagin1-2/+13
There are a few system calls (pselect, ppoll, etc) which replace a task sigmask while they are running in a kernel-space When a task calls one of these syscalls, the kernel saves a current sigmask in task->saved_sigmask and sets a syscall sigmask. On syscall-exit-stop, ptrace traps a task before restoring the saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by saved_sigmask, when the task returns to user-space. This patch fixes this problem. PTRACE_GETSIGMASK returns saved_sigmask if it's set. PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag. Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com Fixes: 29000caecbe8 ("ptrace: add ability to get/set signal-blocked mask") Signed-off-by: Andrei Vagin <avagin@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-28cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=nThomas Gleixner1-2/+18
Tianyu reported a crash in a CPU hotplug teardown callback when booting a kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot parameter. It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken forever in case that a bringup callback fails. Unfortunately this issue was not recognized when the CPU hotplug code was reworked, so the shortcoming just stayed in place. When a bringup callback fails, the CPU hotplug code rolls back the operation and takes the CPU offline. The 'nosmt' command line argument uses a bringup failure to abort the bringup of SMT sibling CPUs. This partial bringup is required due to the MCE misdesign on Intel CPUs. With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level teardown of a CPU including the synchronizations in various facilities like RCU, NOHZ and others. As a consequence the teardown callbacks which must be executed on the outgoing CPU within stop machine with interrupts disabled are executed on the control CPU in interrupt enabled and preemptible context causing the kernel to crash and burn. The pre state machine code has a different failure mode which is more subtle and resulting in a less obvious use after free crash because the control side frees resources which are still in use by the undead CPU. But this is not a x86 only problem. Any architecture which supports the SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less likely to be triggered because in 99.99999% of the cases all bringup callbacks succeed. The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on all architectures as the following architectures have either no hotplug support at all or not all subarchitectures support it: alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial). Crashing the kernel in such a situation is not an acceptable state either. Implement a minimal rollback variant by limiting the teardown to the point where all regular teardown callbacks have been invoked and leave the CPU in the 'dead' idle state. This has the following consequences: - the CPU is brought down to the point where the stop_machine takedown would happen. - the CPU stays there forever and is idle - The CPU is cleared in the CPU active mask, but not in the CPU online mask which is a legit state. - Interrupts are not forced away from the CPU - All facilities which only look at online mask would still see it, but that is the case during normal hotplug/unplug operations as well. It's just a (way) longer time frame. This will expose issues, which haven't been exposed before or only seldom, because now the normally transient state of being non active but online is a permanent state. In testing this exposed already an issue vs. work queues where the vmstat code schedules work on the almost dead CPU which ends up in an unbound workqueue and triggers 'preemtible context' warnings. This is not a problem of this change, it merily exposes an already existing issue. Still this is better than crashing fully without a chance to debug it. This is mainly thought as workaround for those architectures which do not support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP. Fixes: 2e1a3483ce74 ("cpu/hotplug: Split out the state walk into functions") Reported-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mukesh Ojha <mojha@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Micheal Kelley <michael.h.kelley@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de
2019-03-28watchdog: Respect watchdog cpumask on CPU hotplugThomas Gleixner1-2/+4
The rework of the watchdog core to use cpu_stop_work broke the watchdog cpumask on CPU hotplug. The watchdog_enable/disable() functions are now called unconditionally from the hotplug callback, i.e. even on CPUs which are not in the watchdog cpumask. As a consequence the watchdog can become unstoppable. Only invoke them when the plugged CPU is in the watchdog cpumask. Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.de
2019-03-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller27-124/+336
2019-03-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2-70/+106
Pull networking fixes from David Miller: "Fixes here and there, a couple new device IDs, as usual: 1) Fix BQL race in dpaa2-eth driver, from Ioana Ciornei. 2) Fix 64-bit division in iwlwifi, from Arnd Bergmann. 3) Fix documentation for some eBPF helpers, from Quentin Monnet. 4) Some UAPI bpf header sync with tools, also from Quentin Monnet. 5) Set descriptor ownership bit at the right time for jumbo frames in stmmac driver, from Aaro Koskinen. 6) Set IFF_UP properly in tun driver, from Eric Dumazet. 7) Fix load/store doubleword instruction generation in powerpc eBPF JIT, from Naveen N. Rao. 8) nla_nest_start() return value checks all over, from Kangjie Lu. 9) Fix asoc_id handling in SCTP after the SCTP_*_ASSOC changes this merge window. From Marcelo Ricardo Leitner and Xin Long. 10) Fix memory corruption with large MTUs in stmmac, from Aaro Koskinen. 11) Do not use ipv4 header for ipv6 flows in TCP and DCCP, from Eric Dumazet. 12) Fix topology subscription cancellation in tipc, from Erik Hugne. 13) Memory leak in genetlink error path, from Yue Haibing. 14) Valid control actions properly in packet scheduler, from Davide Caratti. 15) Even if we get EEXIST, we still need to rehash if a shrink was delayed. From Herbert Xu. 16) Fix interrupt mask handling in interrupt handler of r8169, from Heiner Kallweit. 17) Fix leak in ehea driver, from Wen Yang" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (168 commits) dpaa2-eth: fix race condition with bql frame accounting chelsio: use BUG() instead of BUG_ON(1) net: devlink: skip info_get op call if it is not defined in dumpit net: phy: bcm54xx: Encode link speed and activity into LEDs tipc: change to check tipc_own_id to return in tipc_net_stop net: usb: aqc111: Extend HWID table by QNAP device net: sched: Kconfig: update reference link for PIE net: dsa: qca8k: extend slave-bus implementations net: dsa: qca8k: remove leftover phy accessors dt-bindings: net: dsa: qca8k: support internal mdio-bus dt-bindings: net: dsa: qca8k: fix example net: phy: don't clear BMCR in genphy_soft_reset bpf, libbpf: clarify bump in libbpf version info bpf, libbpf: fix version info and add it to shared object rxrpc: avoid clang -Wuninitialized warning tipc: tipc clang warning net: sched: fix cleanup NULL pointer exception in act_mirr r8169: fix cable re-plugging issue net: ethernet: ti: fix possible object reference leak net: ibm: fix possible object reference leak ...
2019-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-15/+18
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-03-26 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) introduce bpf_tcp_check_syncookie() helper for XDP and tc, from Lorenz. 2) allow bpf_skb_ecn_set_ce() in tc, from Peter. 3) numerous bpf tc tunneling improvements, from Willem. 4) and other miscellaneous improvements from Adrian, Alan, Daniel, Ivan, Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26bpf: remove incorrect 'verifier bug' warningPaul Chaignon1-2/+3
The BPF verifier checks the maximum number of call stack frames twice, first in the main CFG traversal (do_check) and then in a subsequent traversal (check_max_stack_depth). If the second check fails, it logs a 'verifier bug' warning and errors out, as the number of call stack frames should have been verified already. However, the second check may fail without indicating a verifier bug: if the excessive function calls reside in dead code, the main CFG traversal may not visit them; the subsequent traversal visits all instructions, including dead code. This case raises the question of how invalid dead code should be treated. This patch implements the conservative option and rejects such code. Signed-off-by: Paul Chaignon <paul.chaignon@orange.com> Tested-by: Xiao Han <xiao.han@orange.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-26ftrace: Fix warning using plain integer as NULL & spelling correctionsHariprasad Kelam1-6/+6
Changed 0 --> NULL to avoid sparse warning Corrected spelling mistakes reported by checkpatch.pl Sparse warning below: sudo make C=2 CF=-D__CHECK_ENDIAN__ M=kernel/trace CHECK kernel/trace/ftrace.c kernel/trace/ftrace.c:3007:24: warning: Using plain integer as NULL pointer kernel/trace/ftrace.c:4758:37: warning: Using plain integer as NULL pointer Link: http://lkml.kernel.org/r/20190323183523.GA2244@hari-Inspiron-1545 Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-26tracing: initialize variable in create_dyn_event()Frank Rowand1-1/+1
Fix compile warning in create_dyn_event(): 'ret' may be used uninitialized in this function [-Wuninitialized]. Link: http://lkml.kernel.org/r/1553237900-8555-1-git-send-email-frowand.list@gmail.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Fixes: 5448d44c3855 ("tracing: Add unified dynamic event framework") Signed-off-by: Frank Rowand <frank.rowand@sony.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-26tracing: Remove unnecessary var_ref destroy in track_data_destroy()Tom Zanussi1-1/+0
Commit 656fe2ba85e8 (tracing: Use hist trigger's var_ref array to destroy var_refs) centralized the destruction of all the var_refs in one place so that other code didn't have to do it. The track_data_destroy() added later ignored that and also destroyed the track_data var_ref, causing a double-free error flagged by KASAN. ================================================================== BUG: KASAN: use-after-free in destroy_hist_field+0x30/0x70 Read of size 8 at addr ffff888086df2210 by task bash/1694 CPU: 6 PID: 1694 Comm: bash Not tainted 5.1.0-rc1-test+ #15 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 Call Trace: dump_stack+0x71/0xa0 ? destroy_hist_field+0x30/0x70 print_address_description.cold.3+0x9/0x1fb ? destroy_hist_field+0x30/0x70 ? destroy_hist_field+0x30/0x70 kasan_report.cold.4+0x1a/0x33 ? __kasan_slab_free+0x100/0x150 ? destroy_hist_field+0x30/0x70 destroy_hist_field+0x30/0x70 track_data_destroy+0x55/0xe0 destroy_hist_data+0x1f0/0x350 hist_unreg_all+0x203/0x220 event_trigger_open+0xbb/0x130 do_dentry_open+0x296/0x700 ? stacktrace_count_trigger+0x30/0x30 ? generic_permission+0x56/0x200 ? __x64_sys_fchdir+0xd0/0xd0 ? inode_permission+0x55/0x200 ? security_inode_permission+0x18/0x60 path_openat+0x633/0x22b0 ? path_lookupat.isra.50+0x420/0x420 ? __kasan_kmalloc.constprop.12+0xc1/0xd0 ? kmem_cache_alloc+0xe5/0x260 ? getname_flags+0x6c/0x2a0 ? do_sys_open+0x149/0x2b0 ? do_syscall_64+0x73/0x1b0 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 ? _raw_write_lock_bh+0xe0/0xe0 ? __kernel_text_address+0xe/0x30 ? unwind_get_return_address+0x2f/0x50 ? __list_add_valid+0x2d/0x70 ? deactivate_slab.isra.62+0x1f4/0x5a0 ? getname_flags+0x6c/0x2a0 ? set_track+0x76/0x120 do_filp_open+0x11a/0x1a0 ? may_open_dev+0x50/0x50 ? _raw_spin_lock+0x7a/0xd0 ? _raw_write_lock_bh+0xe0/0xe0 ? __alloc_fd+0x10f/0x200 do_sys_open+0x1db/0x2b0 ? filp_open+0x50/0x50 do_syscall_64+0x73/0x1b0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fa7b24a4ca2 Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 85 7a 0d 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 RSP: 002b:00007fffbafb3af0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 000055d3648ade30 RCX: 00007fa7b24a4ca2 RDX: 0000000000000241 RSI: 000055d364a55240 RDI: 00000000ffffff9c RBP: 00007fffbafb3bf0 R08: 0000000000000020 R09: 0000000000000002 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000003 R14: 0000000000000001 R15: 000055d364a55240 ================================================================== So remove the track_data_destroy() destroy_hist_field() call for that var_ref. Link: http://lkml.kernel.org/r/1deffec420f6a16d11dd8647318d34a66d1989a9.camel@linux.intel.com Fixes: 466f4528fbc69 ("tracing: Generalize hist trigger onmax and save action") Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-26bpf: fix use after free in bpf_evict_inodeDaniel Borkmann1-14/+18
syzkaller was able to generate the following UAF in bpf: BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline] BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318 Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423 CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+ #110 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x244/0x39d lib/dump_stack.c:113 print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430 lookup_last fs/namei.c:2269 [inline] path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318 filename_lookup+0x26a/0x520 fs/namei.c:2348 user_path_at_empty+0x40/0x50 fs/namei.c:2608 user_path include/linux/namei.h:62 [inline] do_mount+0x180/0x1ff0 fs/namespace.c:2980 ksys_mount+0x12d/0x140 fs/namespace.c:3258 __do_sys_mount fs/namespace.c:3272 [inline] __se_sys_mount fs/namespace.c:3269 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457569 Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569 RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000 RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000 R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4 R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff Allocated by task 9424: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553 __do_kmalloc mm/slab.c:3722 [inline] __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737 kstrdup+0x39/0x70 mm/util.c:49 bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356 vfs_symlink+0x37a/0x5d0 fs/namei.c:4127 do_symlinkat+0x242/0x2d0 fs/namei.c:4154 __do_sys_symlink fs/namei.c:4173 [inline] __se_sys_symlink fs/namei.c:4171 [inline] __x64_sys_symlink+0x59/0x80 fs/namei.c:4171 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9425: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kfree+0xcf/0x230 mm/slab.c:3817 bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565 evict+0x4b9/0x980 fs/inode.c:558 iput_final fs/inode.c:1550 [inline] iput+0x674/0xa90 fs/inode.c:1576 do_unlinkat+0x733/0xa30 fs/namei.c:4069 __do_sys_unlink fs/namei.c:4110 [inline] __se_sys_unlink fs/namei.c:4108 [inline] __x64_sys_unlink+0x42/0x50 fs/namei.c:4108 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe In this scenario path lookup under RCU is racing with the final unlink in case of symlinks. As Linus puts it in his analysis: [...] We actually RCU-delay the inode freeing itself, but when we do the final iput(), the "evict()" function is called synchronously. Now, the simple fix would seem to just RCU-delay the kfree() of the symlink data in bpf_evict_inode(). Maybe that's the right thing to do. [...] Al suggested to piggy-back on the ->destroy_inode() callback in order to implement RCU deferral there which can then kfree() the inode->i_link eventually right before putting inode back into inode cache. By reusing free_inode_nonrcu() from there we can avoid the need for our own inode cache and just reuse generic one as we currently do. And in-fact on top of all this we should just get rid of the bpf_evict_inode() entirely. This means truncate_inode_pages_final() and clear_inode() will then simply be called by the fs core via evict(). Dropping the reference should really only be done when inode is unhashed and nothing reachable anymore, so it's better also moved into the final ->destroy_inode() callback. Fixes: 0f98621bef5d ("bpf, inode: add support for symlinks and fix mtime/ctime") Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/
2019-03-24Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-56/+89
Pull scheduler updates from Thomas Gleixner: "Third more careful attempt for this set of fixes: - Prevent a 32bit math overflow in the cpufreq code - Fix a buffer overflow when scanning the cgroup2 cpu.max property - A set of fixes for the NOHZ scheduler logic to prevent waking up CPUs even if the capacity of the busy CPUs is sufficient along with other tweaks optimizing the behaviour for asymmetric systems (big/little)" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Skip LLC NOHZ logic for asymmetric systems sched/fair: Tune down misfit NOHZ kicks sched/fair: Comment some nohz_balancer_kick() kick conditions sched/core: Fix buffer overflow in cgroup2 property cpu.max sched/cpufreq: Fix 32-bit math overflow
2019-03-24Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+2
Pull perf updates from Thomas Gleixner: "A larger set of perf updates. Not all of them are strictly fixes, but that's solely the tip maintainers fault as they let the timely -rc1 pull request fall through the cracks for various reasons including travel. So I'm sending this nevertheless because rebasing and distangling fixes and updates would be a mess and risky as well. As of tomorrow, a strict fixes separation is happening again. Sorry for the slip-up. Kernel: - Handle RECORD_MMAP vs. RECORD_MMAP2 correctly so different consumers of the mmap event get what they requested. Tools: - A larger set of updates to perf record/report/scripts vs. time stamp handling - More Python3 fixups - A pile of memory leak plumbing - perf BPF improvements and fixes - Finalize the perf.data directory storage" [ Note: the kernel part is strictly a fix, the updates are purely to tooling - Linus ] * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) perf bpf: Show more BPF program info in print_bpf_prog_info() perf bpf: Extract logic to create program names from perf_event__synthesize_one_bpf_prog() perf tools: Save bpf_prog_info and BTF of new BPF programs perf evlist: Introduce side band thread perf annotate: Enable annotation of BPF programs perf build: Check what binutils's 'disassembler()' signature to use perf bpf: Process PERF_BPF_EVENT_PROG_LOAD for annotation perf symbols: Introduce DSO_BINARY_TYPE__BPF_PROG_INFO perf feature detection: Add -lopcodes to feature-libbfd perf top: Add option --no-bpf-event perf bpf: Save BTF information as headers to perf.data perf bpf: Save BTF in a rbtree in perf_env perf bpf: Save bpf_prog_info information as headers to perf.data perf bpf: Save bpf_prog_info in a rbtree in perf_env perf bpf: Make synthesize_bpf_events() receive perf_session pointer instead of perf_tool perf bpf: Synthesize bpf events with bpf_program__get_prog_info_linear() bpftool: use bpf_program__get_prog_info_linear() in prog.c:do_dump() tools lib bpf: Introduce bpf_program__get_prog_info_linear() perf record: Replace option --bpf-event with --no-bpf-event perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test() ...
2019-03-24Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull timer fixes from Thomas Gleixner: "A set of small fixes plus the removal of stale board support code: - Remove the board support code from the clpx711x clocksource driver. This change had fallen through the cracks and I'm sending it now rather than dealing with people who want to improve that stale code for 3 month. - Use the proper clocksource mask on RICSV - Make local scope functions and variables static" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/clps711x: Remove board support clocksource/drivers/riscv: Fix clocksource mask clocksource/drivers/mips-gic-timer: Make gic_compare_irqaction static clocksource/drivers/timer-ti-dm: Make omap_dm_timer_set_load_start() static clocksource/drivers/tcb_clksrc: Make tc_clksrc_suspend/resume() static clocksource/drivers/clps711x: Make clps711x_clksrc_init() static time/jiffies: Make refined_jiffies static