aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-02-17net: Do not allocate page fragments that are not skb alignedAlexander Duyck1-0/+4
This patch addresses the fact that there are drivers, specifically tun, that will call into the network page fragment allocators with buffer sizes that are not cache aligned. Doing this could result in data alignment and DMA performance issues as these fragment pools are also shared with the skb allocator and any other devices that will use napi_alloc_frags or netdev_alloc_frags. Fixes: ffde7328a36d ("net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17tcp: tcp_v4_err() should be more carefulEric Dumazet1-1/+4
ICMP handlers are not very often stressed, we should make them more resilient to bugs that might surface in the future. If there is no packet in retransmit queue, we should avoid a NULL deref. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: soukjin bae <soukjin.bae@samsung.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17tcp: clear icsk_backoff in tcp_write_queue_purge()Eric Dumazet1-1/+1
soukjin bae reported a crash in tcp_v4_err() handling ICMP_DEST_UNREACH after tcp_write_queue_head(sk) returned a NULL pointer. Current logic should have prevented this : if (seq != tp->snd_una || !icsk->icsk_retransmits || !icsk->icsk_backoff || fastopen) break; Problem is the write queue might have been purged and icsk_backoff has not been cleared. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: soukjin bae <soukjin.bae@samsung.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: sch_api: set an error msg when qdisc_alloc_handle() failsIvan Vecera1-2/+4
This patch sets an error message in extack when the number of qdisc handles exceeds the maximum. Also the error-code ENOSPC is more appropriate than ENOMEM in this situation. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reported-by: Li Shuang <shuali@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17ethtool: add compat for flash updateJakub Kicinski2-3/+39
If driver does not support ethtool flash update operation call into devlink. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17devlink: add flash update commandJakub Kicinski1-0/+30
Add devlink flash update command. Advanced NICs have firmware stored in flash and often cryptographically secured. Updating that flash is handled by management firmware. Ethtool has a flash update command which served us well, however, it has two shortcomings: - it takes rtnl_lock unnecessarily - really flash update has nothing to do with networking, so using a networking device as a handle is suboptimal, which leads us to the second one: - it requires a functioning netdev - in case device enters an error state and can't spawn a netdev (e.g. communication with the device fails) there is no netdev to use as a handle for flashing. Devlink already has the ability to report the firmware versions, now with the ability to update the firmware/flash we will be able to recover devices in bad state. To enable updates of sub-components of the FW allow passing component name. This name should correspond to one of the versions reported in devlink info. v1: - replace target id with component name (Jiri). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: cgroup: verify that filter is not NULL during walkVlad Buslov1-0/+2
Check that filter is not NULL before passing it to tcf_walker->fn() callback in cls_cgroup_walk(). This can happen when cls_cgroup_change() failed to set first filter. Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: matchall: verify that filter is not NULL in mall_walk()Vlad Buslov1-0/+3
Check that filter is not NULL before passing it to tcf_walker->fn() callback. This can happen when mall_change() failed to offload filter to hardware. Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") Reported-by: Ido Schimmel <idosch@mellanox.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: route: don't set arg->stop in route4_walk() when emptyVlad Buslov1-4/+1
Some classifiers set arg->stop in their implementation of tp->walk() API when empty. Most of classifiers do not adhere to that convention. Do not set arg->stop in route4_walk() to unify tp->walk() behavior among classifier implementations. Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: fw: don't set arg->stop in fw_walk() when emptyVlad Buslov1-4/+1
Some classifiers set arg->stop in their implementation of tp->walk() API when empty. Most of classifiers do not adhere to that convention. Do not set arg->stop in fw_walk() to unify tp->walk() behavior among classifier implementations. Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: caif: use skb helpers instead of open-coding themJann Horn1-11/+5
Use existing skb_put_data() and skb_trim() instead of open-coding them, with the skb_put_data() first so that logically, `skb` still contains the data to be copied in its data..tail area when skb_put_data() reads it. This change on its own is a cleanup, and it is also necessary for potential future integration of skbuffs with things like KASAN. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17neigh: hook tracepoints in neigh update codeRoopa Prabhu1-0/+11
hook tracepoints at the end of functions that update a neigh entry. neigh_update gets an additional tracepoint to trace the update flags and old and new neigh states. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17trace: events: add a few neigh tracepointsRoopa Prabhu1-0/+8
The goal here is to trace neigh state changes covering all possible neigh update paths. Plus have a specific trace point in neigh_update to cover flags sent to neigh_update. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller5-184/+640
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-02-16 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) numerous libbpf API improvements, from Andrii, Andrey, Yonghong. 2) test all bpf progs in alu32 mode, from Jiong. 3) skb->sk access and bpf_sk_fullsock(), bpf_tcp_sock() helpers, from Martin. 4) support for IP encap in lwt bpf progs, from Peter. 5) remove XDP_QUERY_XSK_UMEM dead code, from Jan. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller3-13/+14
Alexei Starovoitov says: ==================== pull-request: bpf 2019-02-16 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) fix lockdep false positive in bpf_get_stackid(), from Alexei. 2) several AF_XDP fixes, from Bjorn, Magnus, Davidlohr. 3) fix narrow load from struct bpf_sock, from Martin. 4) mips JIT fixes, from Paul. 5) gso handling fix in bpf helpers, from Willem. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16atm: clean up vcc_seq_next()Dan Carpenter1-1/+2
It's confusing to call PTR_ERR(v). The PTR_ERR() function is basically a fancy cast to long so it makes you wonder, was IS_ERR() intended? But that doesn't make sense because vcc_walk() doesn't return error pointers. This patch doesn't affect runtime, it's just a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF valuesGuillaume Nault1-0/+20
SO_SNDBUF and SO_RCVBUF (and their *BUFFORCE version) may overflow or underflow their input value. This patch aims at providing explicit handling of these extreme cases, to get a clear behaviour even with values bigger than INT_MAX / 2 or lower than INT_MIN / 2. For simplicity, only SO_SNDBUF and SO_SNDBUFFORCE are described here, but the same explanation and fix apply to SO_RCVBUF and SO_RCVBUFFORCE (with 'SNDBUF' replaced by 'RCVBUF' and 'wmem_max' by 'rmem_max'). Overflow of positive values =========================== When handling SO_SNDBUF or SO_SNDBUFFORCE, if 'val' exceeds INT_MAX / 2, the buffer size is set to its minimum value because 'val * 2' overflows, and max_t() considers that it's smaller than SOCK_MIN_SNDBUF. For SO_SNDBUF, this can only happen with net.core.wmem_max > INT_MAX / 2. SO_SNDBUF and SO_SNDBUFFORCE are actually designed to let users probe for the maximum buffer size by setting an arbitrary large number that gets capped to the maximum allowed/possible size. Having the upper half of the positive integer space to potentially reduce the buffer size to its minimum value defeats this purpose. This patch caps the base value to INT_MAX / 2, so that bigger values don't overflow and keep setting the buffer size to its maximum. Underflow of negative values ============================ For negative numbers, SO_SNDBUF always considers them bigger than net.core.wmem_max, which is bounded by [SOCK_MIN_SNDBUF, INT_MAX]. Therefore such values are set to net.core.wmem_max and we're back to the behaviour of positive integers described above (return maximum buffer size if wmem_max <= INT_MAX / 2, return SOCK_MIN_SNDBUF otherwise). However, SO_SNDBUFFORCE behaves differently. The user value is directly multiplied by two and compared with SOCK_MIN_SNDBUF. If 'val * 2' doesn't underflow or if it underflows to a value smaller than SOCK_MIN_SNDBUF then buffer size is set to its minimum value. Otherwise the buffer size is set to the underflowed value. This patch treats negative values passed to SO_SNDBUFFORCE as null, to prevent underflows. Therefore negative values now always set the buffer size to its minimum value. Even though SO_SNDBUF behaves inconsistently by setting buffer size to the maximum value when passed a negative number, no attempt is made to modify this behaviour. There may exist some programs that rely on using negative numbers to set the maximum buffer size. Avoiding overflows because of extreme net.core.wmem_max values is the most we can do here. Summary of altered behaviours ============================= val : user-space value passed to setsockopt() val_uf : the underflowed value resulting from doubling val when val < INT_MIN / 2 wmem_max : short for net.core.wmem_max val_cap : min(val, wmem_max) min_len : minimal buffer length (that is, SOCK_MIN_SNDBUF) max_len : maximal possible buffer length, regardless of wmem_max (that is, INT_MAX - 1) ^^^^ : altered behaviour SO_SNDBUF: +-------------------------+-------------+------------+----------------+ | CONDITION | OLD RESULT | NEW RESULT | COMMENT | +-------------------------+-------------+------------+----------------+ | val < 0 && | | | No overflow, | | wmem_max <= INT_MAX/2 | wmem_max*2 | wmem_max*2 | keep original | | | | | behaviour | +-------------------------+-------------+------------+----------------+ | val < 0 && | | | Cap wmem_max | | INT_MAX/2 < wmem_max | min_len | max_len | to prevent | | | | ^^^^^^^ | overflow | +-------------------------+-------------+------------+----------------+ | 0 <= val <= min_len/2 | min_len | min_len | Ordinary case | +-------------------------+-------------+------------+----------------+ | min_len/2 < val && | val_cap*2 | val_cap*2 | Ordinary case | | val_cap <= INT_MAX/2 | | | | +-------------------------+-------------+------------+----------------+ | min_len < val && | | | Cap val_cap | | INT_MAX/2 < val_cap | min_len | max_len | again to | | (implies that | | ^^^^^^^ | prevent | | INT_MAX/2 < wmem_max) | | | overflow | +-------------------------+-------------+------------+----------------+ SO_SNDBUFFORCE: +------------------------------+---------+---------+------------------+ | CONDITION | BEFORE | AFTER | COMMENT | | | PATCH | PATCH | | +------------------------------+---------+---------+------------------+ | val < INT_MIN/2 && | min_len | min_len | Underflow with | | val_uf <= min_len | | | no consequence | +------------------------------+---------+---------+------------------+ | val < INT_MIN/2 && | val_uf | min_len | Set val to 0 to | | val_uf > min_len | | ^^^^^^^ | avoid underflow | +------------------------------+---------+---------+------------------+ | INT_MIN/2 <= val < 0 | min_len | min_len | No underflow | +------------------------------+---------+---------+------------------+ | 0 <= val <= min_len/2 | min_len | min_len | Ordinary case | +------------------------------+---------+---------+------------------+ | min_len/2 < val <= INT_MAX/2 | val*2 | val*2 | Ordinary case | +------------------------------+---------+---------+------------------+ | INT_MAX/2 < val | min_len | max_len | Cap val to | | | | ^^^^^^^ | prevent overflow | +------------------------------+---------+---------+------------------+ Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16Merge tag 'nfsd-5.0-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-11/+38
Pull more nfsd fixes from Bruce Fields: "Two small fixes, one for crashes using nfs/krb5 with older enctypes, one that could prevent clients from reclaiming state after a kernel upgrade" * tag 'nfsd-5.0-2' of git://linux-nfs.org/~bfields/linux: sunrpc: fix 4 more call sites that were using stack memory with a scatterlist Revert "nfsd4: return default lease period"
2019-02-16Merge tag 'nfs-for-5.0-4' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds2-2/+3
Pull more NFS client fixes from Anna Schumaker: "Three fixes this time. Nicolas's is for xprtrdma completion vector allocation on single-core systems. Greg's adds an error check when allocating a debugfs dentry. And Ben's is an additional fix for nfs_page_async_flush() to prevent pages from accidentally getting truncated. Summary: - Make sure Send CQ is allocated on an existing compvec - Properly check debugfs dentry before using it - Don't use page_file_mapping() after removing a page" * tag 'nfs-for-5.0-4' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: Don't use page_file_mapping after removing the page rpc: properly check debugfs dentry before using it xprtrdma: Make sure Send CQ is allocated on an existing compvec
2019-02-16netfilter: nf_conntrack_sip: add sip_external_media logicAlin Nastac1-0/+42
When enabled, the sip_external_media logic will leave SDP payload untouched when it detects that interface towards INVITEd party is the same with the one towards media endpoint. The typical scenario for this logic is when a LAN SIP agent has more than one IP address (uses a different address for media streams than the one used on signalling stream) and it also forwards calls to a voice mailbox located on the WAN side. In such case sip_direct_media must be disabled (so normal calls could be handled by the SIP helper), but media streams that are not traversing this router must also be excluded from address translation (e.g. call forwards). Signed-off-by: Alin Nastac <alin.nastac@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-16netfilter: ipt_CLUSTERIP: make symbol 'cip_netdev_notifier' staticWei Yongjun1-1/+1
Fixes the following sparse warnings: net/ipv4/netfilter/ipt_CLUSTERIP.c:867:23: warning: symbol 'cip_netdev_notifier' was not declared. Should it be static? Fixes: 5a86d68bcf02 ("netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-16ipvs: fix warning on unused variableAndrea Claudi1-1/+2
When CONFIG_IP_VS_IPV6 is not defined, build produced this warning: net/netfilter/ipvs/ip_vs_ctl.c:899:6: warning: unused variable ‘ret’ [-Wunused-variable] int ret = 0; ^~~ Fix this by moving the declaration of 'ret' in the CONFIG_IP_VS_IPV6 section in the same function. While at it, drop its unneeded initialisation. Fixes: 098e13f5b21d ("ipvs: fix dependency on nf_defrag_ipv6") Reported-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-15net/ipv6: prefer rcu_access_pointer() over rcu_dereference()Paolo Abeni1-7/+1
rt6_cache_allowed_for_pmtu() checks for rt->from presence, but it does not access the RCU protected pointer. We can use rcu_access_pointer() and clean-up the code a bit. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15net: Fix for_each_netdev_feature on Big endianHauke Mehrtens1-2/+2
The features attribute is of type u64 and stored in the native endianes on the system. The for_each_set_bit() macro takes a pointer to a 32 bit array and goes over the bits in this area. On little Endian systems this also works with an u64 as the most significant bit is on the highest address, but on big endian the words are swapped. When we expect bit 15 here we get bit 47 (15 + 32). This patch converts it more or less to its own for_each_set_bit() implementation which works on 64 bit integers directly. This is then completely in host endianness and should work like expected. Fixes: fd867d51f ("net/core: generic support for disabling netdev features down stack") Signed-off-by: Hauke Mehrtens <hauke.mehrtens@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15net: ip6_gre: initialize erspan_ver just for erspan tunnelsLorenzo Bianconi1-14/+20
After commit c706863bc890 ("net: ip6_gre: always reports o_key to userspace"), ip6gre and ip6gretap tunnels started reporting TUNNEL_KEY output flag even if it is not configured. ip6gre_fill_info checks erspan_ver value to add TUNNEL_KEY for erspan tunnels, however in commit 84581bdae9587 ("erspan: set erspan_ver to 1 by default when adding an erspan dev") erspan_ver is initialized to 1 even for ip6gre or ip6gretap Fix the issue moving erspan_ver initialization in a dedicated routine Fixes: c706863bc890 ("net: ip6_gre: always reports o_key to userspace") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16bpf: make LWTUNNEL_BPF dependent on INETPeter Oskolkov1-1/+1
Lightweight tunnels are L3 constructs that are used with IP/IP6. For example, lwtunnel_xmit is called from ip_output.c and ip6_output.c only. Make the dependency explicit at least for LWT-BPF, as now they call into IP routing. V2: added "Reported-by" below. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Oskolkov <posk@google.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller34-77/+180
The netfilter conflicts were rather simple overlapping changes. However, the cls_tcindex.c stuff was a bit more complex. On the 'net' side, Cong is fixing several races and memory leaks. Whilst on the 'net-next' side we have Vlad adding the rtnl-ness support. What I've decided to do, in order to resolve this, is revert the conversion over to using a workqueue that Cong did, bringing us back to pure RCU. I did it this way because I believe that either Cong's races don't apply with have Vlad did things, or Cong will have to implement the race fix slightly differently. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15sunrpc: fix 4 more call sites that were using stack memory with a scatterlistScott Mayhew1-11/+38
While trying to reproduce a reported kernel panic on arm64, I discovered that AUTH_GSS basically doesn't work at all with older enctypes on arm64 systems with CONFIG_VMAP_STACK enabled. It turns out there still a few places using stack memory with scatterlists, causing krb5_encrypt() and krb5_decrypt() to produce incorrect results (or a BUG if CONFIG_DEBUG_SG is enabled). Tested with cthon on v4.0/v4.1/v4.2 with krb5/krb5i/krb5p using des3-cbc-sha1 and arcfour-hmac-md5. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-15netfilter: nf_tables: fix flush after rule deletion in the same batchPablo Neira Ayuso1-0/+3
Flush after rule deletion bogusly hits -ENOENT. Skip rules that have been already from nft_delrule_by_chain() which is always called from the flush path. Fixes: cf9dc09d0949 ("netfilter: nf_tables: fix missing rules flushing per table") Reported-by: Phil Sutter <phil@nwl.cc> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-15mac80211: Restore vif beacon interval if start ap failsRakesh Pillai1-1/+5
The starting of AP interface can fail due to invalid beacon interval, which does not match the minimum gcd requirement set by the wifi driver. In such case, the beacon interval of that interface gets updated with that invalid beacon interval. The next time that interface is brought up in AP mode, an interface combination check is performed and the beacon interval is taken from the previously set value. In a case where an invalid beacon interval, i.e. a beacon interval value which does not satisfy the minimum gcd criteria set by the driver, is set, all the subsequent trials to bring that interface in AP mode will fail, even if the subsequent trials have a valid beacon interval. To avoid this, in case of a failure in bringing up an interface in AP mode due to interface combination error, the interface beacon interval which is stored in bss conf, needs to be restored with the last working value of beacon interval. Tested on ath10k using WCN3990. Cc: stable@vger.kernel.org Fixes: 0c317a02ca98 ("cfg80211: support virtual interfaces with different beacon intervals") Signed-off-by: Rakesh Pillai <pillair@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-15mac80211: Free mpath object when rhashtable insertion failsHerbert Xu1-8/+9
When rhashtable insertion fails the mesh table code doesn't free the now-orphan mesh path object. This patch fixes that. Cc: stable@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-15mac80211: Use linked list instead of rhashtable walk for mesh tablesHerbert Xu2-101/+43
The mesh table code walks over hash tables for two purposes. First of all it's used as part of a netlink dump process, but it is also used for looking up entries to delete using criteria other than the hash key. The second purpose is directly contrary to the design specification of rhashtable walks. It is only meant for use by netlink dumps. This is because rhashtable is resizable and you cannot obtain a stable walk over it during a resize process. In fact mesh's use of rhashtable for dumping is bogus too. Rather than using rhashtable walk's iterator to keep track of the current position, it always converts the current position to an integer which defeats the purpose of the iterator. Therefore this patch converts all uses of rhashtable walk into a simple linked list. This patch also adds a new spin lock to protect the hash table insertion/removal as well as the walk list modifications. In fact the previous code was buggy as the removals can race with each other, potentially resulting in a double-free. Cc: stable@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-14bpf: fix memory leak in bpf_lwt_xmit_reroutePeter Oskolkov1-9/+20
On error the skb should be freed. Tested with diff/steps provided by David Ahern. v2: surface routing errors to the user instead of a generic EINVAL, as suggested by David Ahern. Reported-by: David Ahern <dsahern@gmail.com> Fixes: 3bd0b15281af ("bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c") Signed-off-by: Peter Oskolkov <posk@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-14devlink: Fix list access without lock while reading regionParav Pandit1-2/+5
While finding the devlink device during region reading, devlink device list is accessed and devlink device is returned without holding a lock. This could lead to use-after-free accesses. While at it, add lockdep assert to ensure that all future callers hold the lock when calling devlink_get_from_attrs(). Fixes: 4e54795a27f5 ("devlink: Add support for region snapshot read command") Signed-off-by: Parav Pandit <parav@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-14devlink: Return right error code in case of errors for region readParav Pandit1-7/+19
devlink_nl_cmd_region_read_dumpit() misses to return right error code on most error conditions. Return the right error code on such errors. Fixes: 4e54795a27f5 ("devlink: Add support for region snapshot read command") Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13Merge tag 'batadv-next-for-davem-20190213' of git://git.open-mesh.org/linux-mergeDavid S. Miller8-142/+1020
Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - fix memory leak in in batadv_dat_put_dhcp, by Martin Weinelt - fix typo, by Sven Eckelmann - netlink restructuring patch series (part 2), by Sven Eckelmann (19 patches) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13page_pool: use DMA_ATTR_SKIP_CPU_SYNC for DMA mappingsJesper Dangaard Brouer1-5/+6
As pointed out by Alexander Duyck, the DMA mapping done in page_pool needs to use the DMA attribute DMA_ATTR_SKIP_CPU_SYNC. As the principle behind page_pool keeping the pages mapped is that the driver takes over the DMA-sync steps. Reported-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13net: page_pool: don't use page->private to store dma_addr_tIlias Apalodimas1-4/+9
As pointed out by David Miller the current page_pool implementation stores dma_addr_t in page->private. This won't work on 32-bit platforms with 64-bit DMA addresses since the page->private is an unsigned long and the dma_addr_t a u64. A previous patch is adding dma_addr_t on struct page to accommodate this. This patch adapts the page_pool related functions to use the newly added struct for storing and retrieving DMA addresses from network drivers. Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13net: sched: remove duplicated include from cls_api.cYueHaibing1-1/+0
Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13net: sched: flower: only return error from hw offload if skip_swVlad Buslov1-2/+10
Recently introduced tc_setup_flow_action() can fail when parsing tcf_exts on some unsupported action commands. However, this should not affect the case when user did not explicitly request hw offload by setting skip_sw flag. Modify tc_setup_flow_action() callers to only propagate the error if skip_sw flag is set for filter that is being offloaded, and set extack error message in that case. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Fixes: 3a7b68617de7 ("cls_api: add translator to flow_action representation") Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13net: fix possible overflow in __sk_mem_raise_allocated()Eric Dumazet1-1/+1
With many active TCP sockets, fat TCP sockets could fool __sk_mem_raise_allocated() thanks to an overflow. They would increase their share of the memory, instead of decreasing it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.cPeter Oskolkov1-2/+124
This patch builds on top of the previous patch in the patchset, which added BPF_LWT_ENCAP_IP mode to bpf_lwt_push_encap. As the encapping can result in the skb needing to go via a different interface/route/dst, bpf programs can indicate this by returning BPF_LWT_REROUTE, which triggers a new route lookup for the skb. v8 changes: fix kbuild errors when LWTUNNEL_BPF is builtin, but IPV6 is a module: as LWTUNNEL_BPF can only be either Y or N, call IPV6 routing functions only if they are built-in. v9 changes: - fixed a kbuild test robot compiler warning; - call IPV6 routing functions via ipv6_stub. v10 changes: removed unnecessary IS_ENABLED and pr_warn_once. v11 changes: fixed a potential dst leak. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13ipv6_stub: add ipv6_route_input stub/proxy.Peter Oskolkov2-0/+13
Proxy ip6_route_input via ipv6_stub, for later use by lwt bpf ip encap (see the next patch in the patchset). Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13bpf: handle GSO in bpf_lwt_push_encapPeter Oskolkov1-2/+65
This patch adds handling of GSO packets in bpf_lwt_push_ip_encap() (called from bpf_lwt_push_encap): * IPIP, GRE, and UDP encapsulation types are deduced by looking into iphdr->protocol or ipv6hdr->next_header; * SCTP GSO packets are not supported (as bpf_skb_proto_4_to_6 and similar do); * UDP_L4 GSO packets are also not supported (although they are not blocked in bpf_skb_proto_4_to_6 and similar), as skb_decrease_gso_size() will break it; * SKB_GSO_DODGY bit is set. Note: it may be possible to support SCTP and UDP_L4 gso packets; but as these cases seem to be not well handled by other tunneling/encapping code paths, the solution should be generic enough to apply to all tunneling/encapping code. v8 changes: - make sure that if GRE or UDP encap is detected, there is enough of pushed bytes to cover both IP[v6] + GRE|UDP headers; - do not reject double-encapped packets; - whitelist TCP GSO packets rather than block SCTP GSO and UDP GSO. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encapPeter Oskolkov2-1/+67
Implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers to packets (e.g. IP/GRE, GUE, IPIP). This is useful when thousands of different short-lived flows should be encapped, each with different and dynamically determined destination. Although lwtunnels can be used in some of these scenarios, the ability to dynamically generate encap headers adds more flexibility, e.g. when routing depends on the state of the host (reflected in global bpf maps). v7 changes: - added a call skb_clear_hash(); - removed calls to skb_set_transport_header(); - refuse to encap GSO-enabled packets. v8 changes: - fix build errors when LWT is not enabled. Note: the next patch in the patchset with deal with GSO-enabled packets, which are currently rejected at encapping attempt. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encapPeter Oskolkov1-5/+43
This patch adds all needed plumbing in preparation to allowing bpf programs to do IP encapping via bpf_lwt_push_encap. Actual implementation is added in the next patch in the patchset. Of note: - bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT prog types in addition to BPF_PROG_TYPE_LWT_IN; - if the skb being encapped has GSO set, encapsulation is limited to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6); - as route lookups are different for ingress vs egress, the single external bpf_lwt_push_encap BPF helper is routed internally to either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs, depending on prog type. v8 changes: fixed a typo. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13sctp: set stream ext to NULL after freeing it in sctp_stream_outq_migrateXin Long1-1/+3
In sctp_stream_init(), after sctp_stream_outq_migrate() freed the surplus streams' ext, but sctp_stream_alloc_out() returns -ENOMEM, stream->outcnt will not be set to 'outcnt'. With the bigger value on stream->outcnt, when closing the assoc and freeing its streams, the ext of those surplus streams will be freed again since those stream exts were not set to NULL after freeing in sctp_stream_outq_migrate(). Then the invalid-free issue reported by syzbot would be triggered. We fix it by simply setting them to NULL after freeing. Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: syzbot+58e480e7b28f2d890bfd@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13sctp: call gso_reset_checksum when computing checksum in sctp_gso_segmentXin Long1-0/+1
Jianlin reported a panic when running sctp gso over gre over vlan device: [ 84.772930] RIP: 0010:do_csum+0x6d/0x170 [ 84.790605] Call Trace: [ 84.791054] csum_partial+0xd/0x20 [ 84.791657] gre_gso_segment+0x2c3/0x390 [ 84.792364] inet_gso_segment+0x161/0x3e0 [ 84.793071] skb_mac_gso_segment+0xb8/0x120 [ 84.793846] __skb_gso_segment+0x7e/0x180 [ 84.794581] validate_xmit_skb+0x141/0x2e0 [ 84.795297] __dev_queue_xmit+0x258/0x8f0 [ 84.795949] ? eth_header+0x26/0xc0 [ 84.796581] ip_finish_output2+0x196/0x430 [ 84.797295] ? skb_gso_validate_network_len+0x11/0x80 [ 84.798183] ? ip_finish_output+0x169/0x270 [ 84.798875] ip_output+0x6c/0xe0 [ 84.799413] ? ip_append_data.part.50+0xc0/0xc0 [ 84.800145] iptunnel_xmit+0x144/0x1c0 [ 84.800814] ip_tunnel_xmit+0x62d/0x930 [ip_tunnel] [ 84.801699] gre_tap_xmit+0xac/0xf0 [ip_gre] [ 84.802395] dev_hard_start_xmit+0xa5/0x210 [ 84.803086] sch_direct_xmit+0x14f/0x340 [ 84.803733] __dev_queue_xmit+0x799/0x8f0 [ 84.804472] ip_finish_output2+0x2e0/0x430 [ 84.805255] ? skb_gso_validate_network_len+0x11/0x80 [ 84.806154] ip_output+0x6c/0xe0 [ 84.806721] ? ip_append_data.part.50+0xc0/0xc0 [ 84.807516] sctp_packet_transmit+0x716/0xa10 [sctp] [ 84.808337] sctp_outq_flush+0xd7/0x880 [sctp] It was caused by SKB_GSO_CB(skb)->csum_start not set in sctp_gso_segment. sctp_gso_segment() calls skb_segment() with 'feature | NETIF_F_HW_CSUM', which causes SKB_GSO_CB(skb)->csum_start not to be set in skb_segment(). For TCP/UDP, when feature supports HW_CSUM, CHECKSUM_PARTIAL will be set and gso_reset_checksum will be called to set SKB_GSO_CB(skb)->csum_start. So SCTP should do the same as TCP/UDP, to call gso_reset_checksum() when computing checksum in sctp_gso_segment. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller5-8/+18
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Missing structure initialization in ebtables causes splat with 32-bit user level on a 64-bit kernel, from Francesco Ruggeri. 2) Missing dependency on nf_defrag in IPVS IPv6 codebase, from Andrea Claudi. 3) Fix possible use-after-free from release path of target extensions. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13netfilter: nft_compat: use-after-free when deleting targetsPablo Neira Ayuso1-1/+2
Fetch pointer to module before target object is released. Fixes: 29e3880109e3 ("netfilter: nf_tables: fix use-after-free when deleting compat expressions") Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>