aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-11-25bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commandsDaniel Mack1-0/+81
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH which allow attaching and detaching eBPF programs to a target. On the API level, the target could be anything that has an fd in userspace, hence the name of the field in union bpf_attr is called 'target_fd'. When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is expected to be a valid file descriptor of a cgroup v2 directory which has the bpf controller enabled. These are the only use-cases implemented by this patch at this point, but more can be added. If a program of the given type already exists in the given cgroup, the program is swapped automically, so userspace does not have to drop an existing program first before installing a new one, which would otherwise leave a gap in which no program is attached. For more information on the propagation logic to subcgroups, please refer to the bpf cgroup controller implementation. The API is guarded by CAP_NET_ADMIN. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25cgroup: add support for eBPF programsDaniel Mack3-0/+186
This patch adds two sets of eBPF program pointers to struct cgroup. One for such that are directly pinned to a cgroup, and one for such that are effective for it. To illustrate the logic behind that, assume the following example cgroup hierarchy. A - B - C \ D - E If only B has a program attached, it will be effective for B, C, D and E. If D then attaches a program itself, that will be effective for both D and E, and the program in B will only affect B and C. Only one program of a given type is effective for a cgroup. Attaching and detaching programs will be done through the bpf(2) syscall. For now, ingress and egress inet socket filtering are the only supported use-cases. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-27/+87
All conflicts were simple overlapping changes except perhaps for the Thunder driver. That driver has a change_mtu method explicitly for sending a message to the hardware. If that fails it returns an error. Normally a driver doesn't need an ndo_change_mtu method becuase those are usually just range changes, which are now handled generically. But since this extra operation is needed in the Thunder driver, it has to stay. However, if the message send fails we have to restore the original MTU before the change because the entire call chain expects that if an error is thrown by ndo_change_mtu then the MTU did not change. Therefore code is added to nicvf_change_mtu to remember the original MTU, and to restore it upon nicvf_update_hw_max_frs() failue. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds1-3/+17
Pull sparc fixes from David Miller: 1) With modern networking cards we can run out of 32-bit DMA space, so support 64-bit DMA addressing when possible on sparc64. From Dave Tushar. 2) Some signal frame validation checks are inverted on sparc32, fix from Andreas Larsson. 3) Lockdep tables can get too large in some circumstances on sparc64, add a way to adjust the size a bit. From Babu Moger. 4) Fix NUMA node probing on some sun4v systems, from Thomas Tai. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: drop duplicate header scatterlist.h lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc sunbmac: Fix compiler warning sunqe: Fix compiler warnings sparc64: Enable 64-bit DMA sparc64: Enable sun4v dma ops to use IOMMU v2 APIs sparc64: Bind PCIe devices to use IOMMU v2 service sparc64: Initialize iommu_map_table and iommu_pool sparc64: Add ATU (new IOMMU) support sparc64: Add FORCE_MAX_ZONEORDER and default to 13 sparc64: fix compile warning section mismatch in find_node() sparc32: Fix inverted invalid_frame_pointer checks on sigreturns sparc64: Fix find_node warning if numa node cannot be found
2016-11-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-23/+47
Pull networking fixes from David Miller: 1) Clear congestion control state when changing algorithms on an existing socket, from Florian Westphal. 2) Fix register bit values in altr_tse_pcs portion of stmmac driver, from Jia Jie Ho. 3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe CAVALLARO. 4) Fix udplite multicast delivery handling, it ignores the udp_table parameter passed into the lookups, from Pablo Neira Ayuso. 5) Synchronize the space estimated by rtnl_vfinfo_size and the space actually used by rtnl_fill_vfinfo. From Sabrina Dubroca. 6) Fix memory leak in fib_info when splitting nodes, from Alexander Duyck. 7) If a driver does a napi_hash_del() explicitily and not via netif_napi_del(), it must perform RCU synchronization as needed. Fix this in virtio-net and bnxt drivers, from Eric Dumazet. 8) Likewise, it is not necessary to invoke napi_hash_del() is we are also doing neif_napi_del() in the same code path. Remove such calls from be2net and cxgb4 drivers, also from Eric Dumazet. 9) Don't allocate an ID in peernet2id_alloc() if the netns is dead, from WANG Cong. 10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold. 11) We cannot cache routes in ip6_tunnel when using inherited traffic classes, from Paolo Abeni. 12) Fix several crashes and leaks in cpsw driver, from Johan Hovold. 13) Splice operations cannot use freezable blocking calls in AF_UNIX, from WANG Cong. 14) Link dump filtering by master device and kind support added an error in loop index updates during the dump if we actually do filter, fix from Zhang Shengju. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits) tcp: zero ca_priv area when switching cc algorithms net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC tipc: eliminate obsolete socket locking policy description rtnl: fix the loop index update error in rtnl_dump_ifinfo() l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() net: macb: add check for dma mapping error in start_xmit() rtnetlink: fix FDB size computation netns: fix get_net_ns_by_fd(int pid) typo af_unix: conditionally use freezable blocking calls in read net: ethernet: ti: cpsw: fix fixed-link phy probe deferral net: ethernet: ti: cpsw: add missing sanity check net: ethernet: ti: cpsw: fix secondary-emac probe error path net: ethernet: ti: cpsw: fix of_node and phydev leaks net: ethernet: ti: cpsw: fix deferred probe net: ethernet: ti: cpsw: fix mdio device reference leak net: ethernet: ti: cpsw: fix bad register access in probe error path net: sky2: Fix shutdown crash cfg80211: limit scan results cache size net sched filters: pass netlink message flags in event notification ...
2016-11-21bpf, mlx5: fix mlx5e_create_rq taking reference on progDaniel Borkmann1-0/+1
In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but without checking the return value. bpf_prog_add() can fail since 92117d8443bc ("bpf: fix refcnt overflow"), so we really must check it. Take the reference right when we assign it to the rq from priv->xdp_prog, and just drop the reference on error path. Destruction in mlx5e_destroy_rq() looks good, though. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18lockdep: Limit static allocations if PROVE_LOCKING_SMALL is definedBabu Moger1-3/+17
Reduce the size of data structure for lockdep entries by half if PROVE_LOCKING_SMALL if defined. This is used only for sparc. Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18netns: make struct pernet_operations::id unsigned intAlexey Dobriyan1-1/+1
Make struct pernet_operations::id unsigned. There are 2 reasons to do so: 1) This field is really an index into an zero based array and thus is unsigned entity. Using negative value is out-of-bound access by definition. 2) On x86_64 unsigned 32-bit data which are mixed with pointers via array indexing or offsets added or subtracted to pointers are preffered to signed 32-bit data. "int" being used as an array index needs to be sign-extended to 64-bit before being used. void f(long *p, int i) { g(p[i]); } roughly translates to movsx rsi, esi mov rdi, [rsi+...] call g MOVSX is 3 byte instruction which isn't necessary if the variable is unsigned because x86_64 is zero extending by default. Now, there is net_generic() function which, you guessed it right, uses "int" as an array index: static inline void *net_generic(const struct net *net, int id) { ... ptr = ng->ptr[id - 1]; ... } And this function is used a lot, so those sign extensions add up. Patch snipes ~1730 bytes on allyesconfig kernel (without all junk messing with code generation): add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) Unfortunately some functions actually grow bigger. This is a semmingly random artefact of code generation with register allocator being used differently. gcc decides that some variable needs to live in new r8+ registers and every access now requires REX prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be used which is longer than [r8] However, overall balance is in negative direction: add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) function old new delta nfsd4_lock 3886 3959 +73 tipc_link_build_proto_msg 1096 1140 +44 mac80211_hwsim_new_radio 2776 2808 +32 tipc_mon_rcv 1032 1058 +26 svcauth_gss_legacy_init 1413 1429 +16 tipc_bcbase_select_primary 379 392 +13 nfsd4_exchange_id 1247 1260 +13 nfsd4_setclientid_confirm 782 793 +11 ... put_client_renew_locked 494 480 -14 ip_set_sockfn_get 730 716 -14 geneve_sock_add 829 813 -16 nfsd4_sequence_done 721 703 -18 nlmclnt_lookup_host 708 686 -22 nfsd4_lockt 1085 1063 -22 nfs_get_client 1077 1050 -27 tcf_bpf_init 1106 1076 -30 nfsd4_encode_fattr 5997 5930 -67 Total: Before=154856051, After=154854321, chg -0.00% Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16bpf: fix range arithmetic for bpf map accessJosef Bacik1-23/+47
I made some invalid assumptions with BPF_AND and BPF_MOD that could result in invalid accesses to bpf map entries. Fix this up by doing a few things 1) Kill BPF_MOD support. This doesn't actually get used by the compiler in real life and just adds extra complexity. 2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the minimum value to 0 for positive AND's. 3) Don't do operations on the ranges if they are set to the limits, as they are by definition undefined, and allowing arithmetic operations on those values could make them appear valid when they really aren't. This fixes the testcase provided by Jann as well as a few other theoretical problems. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16bpf: Fix compilation warning in __bpf_lru_list_rotate_inactiveMartin KaFai Lau1-1/+1
gcc-6.2.1 gives the following warning: kernel/bpf/bpf_lru_list.c: In function ‘__bpf_lru_list_rotate_inactive.isra.3’: kernel/bpf/bpf_lru_list.c:201:28: warning: ‘next’ may be used uninitialized in this function [-Wmaybe-uninitialized] The "next" is currently initialized in the while() loop which must have >=1 iterations. This patch initializes next to get rid of the compiler warning. Fixes: 3a08c2fd7634 ("bpf: LRU List") Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASHMartin KaFai Lau2-8/+129
Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15bpf: Add BPF_MAP_TYPE_LRU_HASHMartin KaFai Lau1-14/+252
Provide a LRU version of the existing BPF_MAP_TYPE_HASH. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15bpf: Refactor codes handling percpu mapMartin KaFai Lau1-26/+21
Refactor the codes that populate the value of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH typed bpf_map. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15bpf: Add percpu LRU listMartin KaFai Lau2-19/+151
Instead of having a common LRU list, this patch allows a percpu LRU list which can be selected by specifying a map attribute. The map attribute will be added in the later patch. While the common use case for LRU is #reads >> #updates, percpu LRU list allows bpf prog to absorb unusual #updates under pathological case (e.g. external traffic facing machine which could be under attack). Each percpu LRU is isolated from each other. The LRU nodes (including free nodes) cannot be moved across different LRU Lists. Here are the update performance comparison between common LRU list and percpu LRU list (the test code is at the last patch): [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \ ./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done 1 cpus: 2934082 updates 4 cpus: 7391434 updates 8 cpus: 6500576 updates [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \ ./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done 1 cpus: 2896553 updates 4 cpus: 9766395 updates 8 cpus: 17460553 updates Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15bpf: LRU ListMartin KaFai Lau3-1/+648
Introduce bpf_lru_list which will provide LRU capability to the bpf_htab in the later patch. * General Thoughts: 1. Target use case. Read is more often than update. (i.e. bpf_lookup_elem() is more often than bpf_update_elem()). If bpf_prog does a bpf_lookup_elem() first and then an in-place update, it still counts as a read operation to the LRU list concern. 2. It may be useful to think of it as a LRU cache 3. Optimize the read case 3.1 No lock in read case 3.2 The LRU maintenance is only done during bpf_update_elem() 4. If there is a percpu LRU list, it will lose the system-wise LRU property. A completely isolated percpu LRU list has the best performance but the memory utilization is not ideal considering the work load may be imbalance. 5. Hence, this patch starts the LRU implementation with a global LRU list with batched operations before accessing the global LRU list. As a LRU cache, #read >> #update/#insert operations, it will work well. 6. There is a local list (for each cpu) which is named 'struct bpf_lru_locallist'. This local list is not used to sort the LRU property. Instead, the local list is to batch enough operations before acquiring the lock of the global LRU list. More details on this later. 7. In the later patch, it allows a percpu LRU list by specifying a map-attribute for scalability reason and for use cases that need to prepare for the worst (and pathological) case like DoS attack. The percpu LRU list is completely isolated from each other and the LRU nodes (including free nodes) cannot be moved across the list. The following description is for the global LRU list but mostly applicable to the percpu LRU list also. * Global LRU List: 1. It has three sub-lists: active-list, inactive-list and free-list. 2. The two list idea, active and inactive, is borrowed from the page cache. 3. All nodes are pre-allocated and all sit at the free-list (of the global LRU list) at the beginning. The pre-allocation reasoning is similar to the existing BPF_MAP_TYPE_HASH. However, opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in the LRU map. * Active/Inactive List (of the global LRU list): 1. The active list, as its name says it, maintains the active set of the nodes. We can think of it as the working set or more frequently accessed nodes. The access frequency is approximated by a ref-bit. The ref-bit is set during the bpf_lookup_elem(). 2. The inactive list, as its name also says it, maintains a less active set of nodes. They are the candidates to be removed from the bpf_htab when we are running out of free nodes. 3. The ordering of these two lists is acting as a rough clock. The tail of the inactive list is the older nodes and should be released first if the bpf_htab needs free element. * Rotating the Active/Inactive List (of the global LRU list): 1. It is the basic operation to maintain the LRU property of the global list. 2. The active list is only rotated when the inactive list is running low. This idea is similar to the current page cache. Inactive running low is currently defined as "# of inactive < # of active". 3. The active list rotation always starts from the tail. It moves node without ref-bit set to the head of the inactive list. It moves node with ref-bit set back to the head of the active list and then clears its ref-bit. 4. The inactive rotation is pretty simply. It walks the inactive list and moves the nodes back to the head of active list if its ref-bit is set. The ref-bit is cleared after moving to the active list. If the node does not have ref-bit set, it just leave it as it is because it is already in the inactive list. * Shrinking the Inactive List (of the global LRU list): 1. Shrinking is the operation to get free nodes when the bpf_htab is full. 2. It usually only shrinks the inactive list to get free nodes. 3. During shrinking, it will walk the inactive list from the tail, delete the nodes without ref-bit set from bpf_htab. 4. If no free node found after step (3), it will forcefully get one node from the tail of inactive or active list. Forcefully is in the sense that it ignores the ref-bit. * Local List: 1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to batch enough operations before acquiring the lock of the global LRU. 2. A local list has two sub-lists, free-list and pending-list. 3. During bpf_update_elem(), it will try to get from the free-list of (the current CPU local list). 4. If the local free-list is empty, it will acquire from the global LRU list. The global LRU list can either satisfy it by its global free-list or by shrinking the global inactive list. Since we have acquired the global LRU list lock, it will try to get at most LOCAL_FREE_TARGET elements to the local free list. 5. When a new element is added to the bpf_htab, it will first sit at the pending-list (of the local list) first. The pending-list will be flushed to the global LRU list when it needs to acquire free nodes from the global list next time. * Lock Consideration: The LRU list has a lock (lru_lock). Each bucket of htab has a lock (buck_lock). If both locks need to be acquired together, the lock order is always lru_lock -> buck_lock and this only happens in the bpf_lru_list.c logic. In hashtab.c, both locks are not acquired together (i.e. one lock is always released first before acquiring another lock). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15Merge tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds1-1/+23
Pull tracing fixes from Steven Rostedt: "Alexei discovered a race condition in modules failing to load that can cause a ftrace check to trigger and disable ftrace. This is because of the way modules are registered to ftrace. Their functions are loaded in the ftrace function tables but set to "disabled" since they are still in the process of being loaded by the module. After the module is finished, it calls back into the ftrace infrastructure to enable it. Looking deeper into the locations that access all the functions in the table, I found more locations that should ignore the disabled ones" * tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records
2016-11-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller8-38/+23
Several cases of bug fixes in 'net' overlapping other changes in 'net-next-. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-3/+10
Pull networking fixes from David Miller: 1) Fix off by one wrt. indexing when dumping /proc/net/route entries, from Alexander Duyck. 2) Fix lockdep splats in iwlwifi, from Johannes Berg. 3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH is disabled, from Liping Zhang. 4) Memory leak when nft_expr_clone() fails, also from Liping Zhang. 5) Disable UFO when path will apply IPSEC tranformations, from Jakub Sitnicki. 6) Don't bogusly double cwnd in dctcp module, from Florian Westphal. 7) skb_checksum_help() should never actually use the value "0" for the resulting checksum, that has a special meaning, use CSUM_MANGLED_0 instead. From Eric Dumazet. 8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from Yuval MIntz. 9) Fix SCTP reference counting of associations and transports in sctp_diag. From Xin Long. 10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in a previous layer or similar, so explicitly clear the ipv6 control block in the skb. From Eli Cooper. 11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG Cong. 12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad Shenai. 13) Fix potential access past the end of the skb page frag array in tcp_sendmsg(). From Eric Dumazet. 14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix from David Ahern. 15) Don't return an error in tcp_sendmsg() if we wronte any bytes successfully, from Eric Dumazet. 16) Extraneous unlocks in netlink_diag_dump(), we removed the locking but forgot to purge these unlock calls. From Eric Dumazet. 17) Fix memory leak in error path of __genl_register_family(). We leak the attrbuf, from WANG Cong. 18) cgroupstats netlink policy table is mis-sized, from WANG Cong. 19) Several XDP bug fixes in mlx5, from Saeed Mahameed. 20) Fix several device refcount leaks in network drivers, from Johan Hovold. 21) icmp6_send() should use skb dst device not skb->dev to determine L3 routing domain. From David Ahern. 22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong. 23) We leak new macvlan port in some cases of maclan_common_netlink() errors. Fix from Gao Feng. 24) Similar to the icmp6_send() fix, icmp_route_lookup() should determine L3 routing domain using skb_dst(skb)->dev not skb->dev. Also from David Ahern. 25) Several fixes for route offloading and FIB notification handling in mlxsw driver, from Jiri Pirko. 26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet. 27) Fix long standing regression in ipv4 redirect handling, wrt. validating the new neighbour's reachability. From Stephen Suryaputra Lin. 28) If sk_filter() trims the packet excessively, handle it reasonably in tcp input instead of exploding. From Eric Dumazet. 29) Fix handling of napi hash state when copying channels in sfc driver, from Bert Kenward. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits) mlxsw: spectrum_router: Flush FIB tables during fini net: stmmac: Fix lack of link transition for fixed PHYs sctp: change sk state only when it has assocs in sctp_shutdown bnx2: Wait for in-flight DMA to complete at probe stage Revert "bnx2: Reset device during driver initialization" ps3_gelic: fix spelling mistake in debug message net: ethernet: ixp4xx_eth: fix spelling mistake in debug message ibmvnic: Fix size of debugfs name buffer ibmvnic: Unmap ibmvnic_statistics structure sfc: clear napi_hash state when copying channels mlxsw: spectrum_router: Correctly dump neighbour activity mlxsw: spectrum: Fix refcount bug on span entries bnxt_en: Fix VF virtual link state. bnxt_en: Fix ring arithmetic in bnxt_setup_tc(). Revert "include/uapi/linux/atm_zatm.h: include linux/time.h" tcp: take care of truncations done by sk_filter() ipv4: use new_gw for redirect neigh lookup r8152: Fix error path in open function net: bpqether.h: remove if_ether.h guard net: __skb_flow_dissect() must cap its return value ...
2016-11-14ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip recordsSteven Rostedt (Red Hat)1-0/+22
When a module is first loaded and its function ip records are added to the ftrace list of functions to modify, they are set to DISABLED, as their text is still in a read only state. When the module is fully loaded, and can be updated, the flag is cleared, and if their's any functions that should be tracing them, it is updated at that moment. But there's several locations that do record accounting and should ignore records that are marked as disabled, or they can cause issues. Alexei already fixed one location, but others need to be addressed. Cc: stable@vger.kernel.org Fixes: b7ffffbb46f2 "ftrace: Add infrastructure for delayed enabling of module functions" Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-14ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace recordsAlexei Starovoitov1-1/+1
ftrace_shutdown() checks for sanity of ftrace records and if dyn_ftrace->flags is not zero, it will warn. It can happen that 'flags' are set to FTRACE_FL_DISABLED at this point, since some module was loaded, but before ftrace_module_enable() cleared the flags for this module. In other words the module.c is doing: ftrace_module_init(mod); // calls ftrace_update_code() that sets flags=FTRACE_FL_DISABLED ... // here ftrace_shutdown() is called that warns, since err = prepare_coming_module(mod); // didn't have a chance to clear FTRACE_FL_DISABLED Fix it by ignoring disabled records. It's similar to what __ftrace_hash_rec_update() is already doing. Link: http://lkml.kernel.org/r/1478560460-3818619-1-git-send-email-ast@fb.com Cc: stable@vger.kernel.org Fixes: b7ffffbb46f2 "ftrace: Add infrastructure for delayed enabling of module functions" Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-14bpf: Use u64_to_user_ptr()Mickaël Salaün1-17/+12
Replace the custom u64_to_ptr() function with the u64_to_user_ptr() macro. Signed-off-by: Mickaël Salaün <mic@digikod.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14Revert "printk: make reading the kernel log flush pending lines"Linus Torvalds1-11/+0
This reverts commit bfd8d3f23b51018388be0411ccbc2d56277fe294. It turns out that this flushes things much too aggressiverly, and causes lines to break up when the system logger races with new continuation lines being printed. There's a pending patch to make printk() flushing much more straightforward, but it's too invasive for 4.9, so in the meantime let's just not make the system message logging flush continuation lines. They'll be flushed by the final newline anyway. Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-14Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+2
Pull irq fix from Ingo Molnar: "This fixes a genirq regression that resulted in the Intel/Broxton pinctrl/GPIO driver (and possibly others) spewing warnings" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Use irq type from irqdata instead of irqdesc
2016-11-12bpf, mlx4: fix prog refcount in mlx4_en_try_alloc_resources error pathDaniel Borkmann1-0/+11
Commit 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme") added a bug in that the prog's reference count is not dropped in the error path when mlx4_en_try_alloc_resources() is failing from mlx4_xdp_set(). We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we need to release again. Earlier in the call path, dev_change_xdp_fd() itself holds a reference to the prog as well (hence the '- 1' in the bpf_prog_add()), so a simple atomic_sub() is safe to use here. When an error is propagated, then bpf_prog_put() is called eventually from dev_change_xdp_fd() Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-11Merge tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-1/+3
Pull power management fixes from Rafael Wysocki: "These fix two bugs in error code paths in the PM core (system-wide suspend of devices), a device reference leak in the boot-time suspend test code and a cpupower utility regression from the 4.7 cycle. Specifics: - Prevent the PM core from attempting to suspend parent devices if any of their children, whose suspend callbacks were invoked asynchronously, have failed to suspend during the "late" and "noirq" phases of system-wide suspend of devices (Brian Norris). - Prevent the boot-time system suspend test code from leaking a reference to the RTC device used by it (Johan Hovold). - Fix cpupower to use the return value of one of its library functions correctly and restore the correct behavior of it when used for setting cpufreq tunables broken during the 4.7 development cycle (Laura Abbott)" * tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails PM / sleep: fix device reference leak in test_suspend cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set
2016-11-11Merge branches 'pm-tools-fixes' and 'pm-sleep-fixes'Rafael J. Wysocki1-1/+3
* pm-tools-fixes: cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set * pm-sleep-fixes: PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails PM / sleep: fix device reference leak in test_suspend
2016-11-11Revert "console: don't prefer first registered if DT specifies stdout-path"Hans de Goede1-12/+1
This reverts commit 05fd007e4629 ("console: don't prefer first registered if DT specifies stdout-path"). The reverted commit changes existing behavior on which many ARM boards rely. Many ARM small-board-computers, like e.g. the Raspberry Pi have both a video output and a serial console. Depending on whether the user is using the device as a more regular computer; or as a headless device we need to have the console on either one or the other. Many users rely on the kernel behavior of the console being present on both outputs, before the reverted commit the console setup with no console= kernel arguments on an ARM board which sets stdout-path in dt would look like this: [root@localhost ~]# cat /proc/consoles ttyS0 -W- (EC p a) 4:64 tty0 -WU (E p ) 4:1 Where as after the reverted commit, it looks like this: [root@localhost ~]# cat /proc/consoles ttyS0 -W- (EC p a) 4:64 This commit reverts commit 05fd007e4629 ("console: don't prefer first registered if DT specifies stdout-path") restoring the original behavior. Fixes: 05fd007e4629 ("console: don't prefer first registered if DT specifies stdout-path") Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.com Signed-off-by: Hans de Goede <hdegoede@redhat.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-09bpf: Remove unused but set variablesTobias Klauser1-2/+0
Remove the unused but set variables min_set and max_set in adjust_reg_min_max_vals to fix the following warning when building with 'W=1': kernel/bpf/verifier.c:1483:7: warning: variable ‘min_set’ set but not used [-Wunused-but-set-variable] There is no warning about max_set being unused, but since it is only used in the assignment of min_set it can be removed as well. They were introduced in commit 484611357c19 ("bpf: allow access into map value arrays") but seem to have never been used. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-08genirq: Use irq type from irqdata instead of irqdescThomas Gleixner1-2/+2
The type flags in the irq descriptor are there for historical reasons and only updated via irq_modify_status() or irq_set_type(). Both functions also update the type flags in irqdata. __setup_irq() is the only left over user of the type flags in the irq descriptor. If __setup_irq() is called with empty irq type flags, then the type flags are retrieved from irqdata. If an interrupt is shared, then the type flags are compared with the type flags stored in the irq descriptor. On x86 the ioapic does not have a irq_set_type() callback because the type is defined in the BIOS tables and cannot be changed. The type is stored in irqdata at setup time without updating the type data in the irq descriptor. As a result the comparison described above fails. There is no point in updating the irq descriptor flags because the only relevant storage is irqdata. Use the type flags from irqdata for both retrieval and comparison in __setup_irq() instead. Aside of that the print out in case of non matching type flags has the old and new type flags arguments flipped. Fix that as well. For correctness sake the flags stored in the irq descriptor should be removed, but this is beyond the scope of this bugfix and will be done in a later patch. Fixes: 4b357daed698 ("genirq: Look-up trigger type if not specified by caller") Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Jon Hunter <jonathanh@nvidia.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-07bpf: fix map not being uncharged during map creation failureDaniel Borkmann1-1/+3
In map_create(), we first find and create the map, then once that suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch a new anon fd through anon_inode_getfd(). The problem is, once the latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct the map via map->ops->map_free(), but without uncharging the previously locked memory first. That means that the user_struct allocation is leaked as well as the accounted RLIMIT_MEMLOCK memory not released. Make the label names in the fix consistent with bpf_prog_load(). Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07bpf: fix htab map destruction when extra reserve is in useDaniel Borkmann1-1/+2
Commit a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem") added an extra per-cpu reserve to the hash table map to restore old behaviour from pre prealloc times. When non-prealloc is in use for a map, then problem is that once a hash table extra element has been linked into the hash-table, and the hash table is destroyed due to refcount dropping to zero, then htab_map_free() -> delete_all_elements() will walk the whole hash table and drop all elements via htab_elem_free(). The problem is that the element from the extra reserve is first fed to the wrong backend allocator and eventually freed twice. Fixes: a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-05Merge branches 'sched-urgent-for-linus' and 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-9/+7
Pull stack vmap fixups from Thomas Gleixner: "Two small patches related to sched_show_task(): - make sure to hold a reference on the task stack while accessing it - remove the thread_saved_pc printout .. and add a sanity check into release_task_stack() to catch problems with task stack references" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Remove pointless printout in sched_show_task() sched/core: Fix oops in sched_show_task() * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fork: Add task stack refcounting sanity check and prevent premature task stack freeing
2016-11-03taskstats: fix the length of cgroupstats_cmd_get_policyWANG Cong1-1/+5
cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1], taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1], but their family.maxattr is TASKSTATS_CMD_ATTR_MAX. CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX, so we could end up accessing out-of-bound. Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1, this is safe because the rest are initialized to 0's. Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03sched/core: Remove pointless printout in sched_show_task()Linus Torvalds1-9/+0
In sched_show_task() we print out a useless hex number, not even a symbol, and there's a big question mark whether this even makes sense anyway, I suspect we should just remove it all. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: brgerst@gmail.com Cc: jann@thejh.net Cc: keescook@chromium.org Cc: linux-api@vger.kernel.org Cc: tycho.andersen@canonical.com Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-03sched/core: Fix oops in sched_show_task()Tetsuo Handa1-0/+3
When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread remains in the task list after its stack pointer was already set to NULL. Therefore, thread_saved_pc() and stack_not_used() in sched_show_task() will trigger NULL pointer dereference if an attempt to dump such thread's traces (e.g. SysRq-t, khungtaskd) is made. Since show_stack() in sched_show_task() calls try_get_task_stack() and sched_show_task() is called from interrupt context, calling try_get_task_stack() from sched_show_task() will be safe as well. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: brgerst@gmail.com Cc: jann@thejh.net Cc: keescook@chromium.org Cc: linux-api@vger.kernel.org Cc: tycho.andersen@canonical.com Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-02PM / sleep: fix device reference leak in test_suspendJohan Hovold1-1/+3
Make sure to drop the reference taken by class_find_device() after opening the RTC device. Fixes: 77437fd4e61f (pm: boot time suspend selftest) Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-01fork: Add task stack refcounting sanity check and prevent premature task stack freeingAndy Lutomirski1-0/+4
If something goes wrong with task stack refcounting and a stack refcount hits zero too early, warn and leak it rather than potentially freeing it early (and silently). Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-31bpf, inode: add support for symlinks and fix mtime/ctimeDaniel Borkmann1-6/+39
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops") added support for hard links that can be used for prog and map nodes, this work adds simple symlink support, which can be used f.e. for directories also when unpriviledged and works with cmdline tooling that understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use mount_nodev not mount_ns to mount the bpf filesystem"), there can be various mount instances with mount_nodev() and thus hierarchy can be flattened to facilitate object sharing. Thus, we can keep bpf tooling also working by repointing paths. Most of the functionality can be used from vfs library operations. The symlink is stored in the inode itself, that is in i_link, which is sufficient in our case as opposed to storing it in the page cache. While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update the directories mtime and ctime, so add a common helper for it called bpf_dentry_finalize() that takes care of it for all cases now. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller20-110/+192
Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29bpf: Print function name in addition to function idThomas Graf1-8/+27
The verifier currently prints raw function ids when printing CALL instructions or when complaining: 5: (85) call 23 unknown func 23 print a meaningful function name instead: 5: (85) call bpf_redirect#23 unknown func bpf_redirect#23 Moves the function documentation to a single comment and renames all helpers names in the list to conform to the bpf_ prefix notation so they can be greped in the kernel source. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28Merge tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-2/+2
Pull power management fixes from Rafael Wysocki: "These fix two intel_pstate issues related to the way it works when the scaling_governor sysfs attribute is set to "performance" and fix up messages in the system suspend core code. Specifics: - Fix a missing KERN_CONT in a system suspend message by converting the affected code to using pr_info() and pr_cont() instead of the "raw" printk() (Jon Hunter). - Make intel_pstate set the CPU P-state from its .set_policy() callback when the scaling_governor sysfs attribute is set to "performance" so that it interacts with NOHZ_FULL more predictably which was the case before 4.7 (Rafael Wysocki). - Make intel_pstate always request the maximum allowed P-state when the scaling_governor sysfs attribute is set to "performance" to prevent it from effectively ingoring that setting is some situations (Rafael Wysocki)" * tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Always set max P-state in performance mode PM / suspend: Fix missing KERN_CONT for suspend message cpufreq: intel_pstate: Set P-state upfront in performance mode
2016-10-28Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-6/+17
Pull perf fixes from Ingo Molnar: "Misc kernel fixes: a virtualization environment related fix, an uncore PMU driver removal handling fix, a PowerPC fix and new events for Knights Landing" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors perf/powerpc: Don't call perf_event_disable() from atomic context perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic perf/x86/intel/cstate: Add C-state residency events for Knights Landing
2016-10-28Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-30/+44
Pull timer fixes from Ingo Molnar: "Fix four timer locking races: two were noticed by Linus while reviewing the code while chasing for a corruption bug, and two from fixing spurious USB timeouts" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Prevent base clock corruption when forwarding timers: Prevent base clock rewind when forwarding clock timers: Lock base for same bucket optimization timers: Plug locking race vs. timer migration
2016-10-28Merge branches 'core-urgent-for-linus', 'irq-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+0
Pull objtool, irq and scheduler fixes from Ingo Molnar: "One more objtool fixlet for GCC6 code generation patterns, an irq DocBook fix and an unused variable warning fix in the scheduler" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix rare switch jump table pattern detection * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: doc: Add missing parameter for msi_setup * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Remove unused but set variable 'rq'
2016-10-28perf/powerpc: Don't call perf_event_disable() from atomic contextJiri Olsa1-2/+8
The trinity syscall fuzzer triggered following WARN() on powerpc: WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278 ... NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0 LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 Call Trace: [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable) [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0 [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0 [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0 [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100 [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48 Followed by a lockdep warning: =============================== [ INFO: suspicious RCU usage. ] 4.8.0-rc5+ #7 Tainted: G W ------------------------------- ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ls/2998: #0: (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0 #1: (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0 stack backtrace: CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7 Call Trace: [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable) [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180 [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0 [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0 [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380 [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60 [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0 [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0 [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0 [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0 [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100 [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48 While it looks like the first WARN() is probably valid, the other one is triggered by disabling event via perf_event_disable() from atomic context. The event is disabled here in case we were not able to emulate the instruction that hit the breakpoint. By disabling the event we unschedule the event and make sure it's not scheduled back. But we can't call perf_event_disable() from atomic context, instead we need to use the event's pending_disable irq_work method to disable it. Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-28perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panicJiri Olsa1-4/+9
CAI Qian reported a crash in the PMU uncore device removal code, enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option: https://marc.info/?l=linux-kernel&m=147688837328451 The reason for the crash is that perf_pmu_unregister() tries to remove a PMU device which is not added at this point. We add PMU devices only after pmu_bus is registered, which happens in the perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag. The fix is to get the 'pmu_bus_running' flag state at the point the PMU is taken out of the PMU list and remove the device later only if it's set. Reported-by: CAI Qian <caiqian@redhat.com> Tested-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Herring <robh@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-27kcov: properly check if we are in an interruptAndrey Konovalov1-1/+8
in_interrupt() returns a nonzero value when we are either in an interrupt or have bh disabled via local_bh_disable(). Since we are interested in only ignoring coverage from actual interrupts, do a proper check instead of just calling in_interrupt(). As a result of this change, kcov will start to collect coverage from within local_bh_disable()/local_bh_enable() sections. Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Nicolai Stange <nicstange@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Kees Cook <keescook@chromium.org> Cc: James Morse <james.morse@arm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27genetlink: mark families as __ro_after_initJohannes Berg1-1/+1
Now genl_register_family() is the only thing (other than the users themselves, perhaps, but I didn't find any doing that) writing to the family struct. In all families that I found, genl_register_family() is only called from __init functions (some indirectly, in which case I've add __init annotations to clarifly things), so all can actually be marked __ro_after_init. This protects the data structure from accidental corruption. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27genetlink: statically initialize familiesJohannes Berg1-6/+11
Instead of providing macros/inline functions to initialize the families, make all users initialize them statically and get rid of the macros. This reduces the kernel code size by about 1.6k on x86-64 (with allyesconfig). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27genetlink: no longer support using static family IDsJohannes Berg1-1/+0
Static family IDs have never really been used, the only use case was the workaround I introduced for those users that assumed their family ID was also their multicast group ID. Additionally, because static family IDs would never be reserved by the generic netlink code, using a relatively low ID would only work for built-in families that can be registered immediately after generic netlink is started, which is basically only the control family (apart from the workaround code, which I also had to add code for so it would reserve those IDs) Thus, anything other than GENL_ID_GENERATE is flawed and luckily not used except in the cases I mentioned. Move those workarounds into a few lines of code, and then get rid of GENL_ID_GENERATE entirely, making it more robust. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>