aboutsummaryrefslogtreecommitdiffstats
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2019-07-09devlink: Return physical port fields only for applicable port flavoursParav Pandit1-0/+4
Physical port number and split group fields are applicable only to physical port flavours such as PHYSICAL, CPU and DSA. Hence limit returning those values in netlink response to such port flavours. Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09devlink: Refactor physical port attributesParav Pandit1-20/+38
To support additional devlink port flavours and to support few common and few different port attributes, move physical port attributes to a different structure. Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08net: sched: add mpls manipulation actions to TCJohn Hurley1-0/+30
Currently, TC offers the ability to match on the MPLS fields of a packet through the use of the flow_dissector_key_mpls struct. However, as yet, TC actions do not allow the modification or manipulation of such fields. Add a new module that registers TC action ops to allow manipulation of MPLS. This includes the ability to push and pop headers as well as modify the contents of new or existing headers. A further action to decrement the TTL field of an MPLS header is also provided with a new helper added to support this. Examples of the usage of the new action with flower rules to push and pop MPLS labels are: tc filter add dev eth0 protocol ip parent ffff: flower \ action mpls push protocol mpls_uc label 123 \ action mirred egress redirect dev eth1 tc filter add dev eth0 protocol mpls_uc parent ffff: flower \ action mpls pop protocol ipv4 \ action mirred egress redirect dev eth1 Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08net: core: add MPLS update core helper and use in OvSJohn Hurley1-0/+33
Open vSwitch allows the updating of an existing MPLS header on a packet. In preparation for supporting similar functionality in TC, move this to a common skb helper function. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08net: core: move pop MPLS functionality from OvS to core helperJohn Hurley1-0/+42
Open vSwitch provides code to pop an MPLS header to a packet. In preparation for supporting this in TC, move the pop code to an skb helper that can be reused. Remove the, now unused, update_ethertype static function from OvS. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08net: core: move push MPLS functionality from OvS to core helperJohn Hurley1-0/+64
Open vSwitch provides code to push an MPLS header to a packet. In preparation for supporting this in TC, move the push code to an skb helper that can be reused. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
Two cases of overlapping changes, nothing fancy. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08skbuff: increase verbosity when dumping skb dataWillem de Bruijn2-12/+103
skb_warn_bad_offload and netdev_rx_csum_fault trigger on hard to debug issues. Dump more state and the header. Optionally dump the entire packet and linear segment. This is required to debug checksum bugs that may include bytes past skb_tail_pointer(). Both call sites call this function inside a net_ratelimit() block. Limit full packet log further to a hard limit of can_dump_full (5). Based on an earlier patch by Cong Wang, see link below. Changes v1 -> v2 - dump frag_list only on full_pkt Link: https://patchwork.ozlabs.org/patch/1000841/ Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08Merge tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fsLinus Torvalds1-0/+20
Pull keyring namespacing from David Howells: "These patches help make keys and keyrings more namespace aware. Firstly some miscellaneous patches to make the process easier: - Simplify key index_key handling so that the word-sized chunks assoc_array requires don't have to be shifted about, making it easier to add more bits into the key. - Cache the hash value in the key so that we don't have to calculate on every key we examine during a search (it involves a bunch of multiplications). - Allow keying_search() to search non-recursively. Then the main patches: - Make it so that keyring names are per-user_namespace from the point of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not accessible cross-user_namespace. keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this. - Move the user and user-session keyrings to the user_namespace rather than the user_struct. This prevents them propagating directly across user_namespaces boundaries (ie. the KEY_SPEC_* flags will only pick from the current user_namespace). - Make it possible to include the target namespace in which the key shall operate in the index_key. This will allow the possibility of multiple keys with the same description, but different target domains to be held in the same keyring. keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this. - Make it so that keys are implicitly invalidated by removal of a domain tag, causing them to be garbage collected. - Institute a network namespace domain tag that allows keys to be differentiated by the network namespace in which they operate. New keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned the network domain in force when they are created. - Make it so that the desired network namespace can be handed down into the request_key() mechanism. This allows AFS, NFS, etc. to request keys specific to the network namespace of the superblock. This also means that the keys in the DNS record cache are thenceforth namespaced, provided network filesystems pass the appropriate network namespace down into dns_query(). For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other cache keyrings, such as idmapper keyrings, also need to set the domain tag - for which they need access to the network namespace of the superblock" * tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: keys: Pass the network namespace into request_key mechanism keys: Network namespace domain tag keys: Garbage collect keys for which the domain has been removed keys: Include target namespace in match criteria keys: Move the user and user-session keyrings to the user_namespace keys: Namespace keyring names keys: Add a 'recurse' flag for keyring searches keys: Cache the hash value to avoid lots of recalculation keys: Simplify key description management
2019-07-08coallocate socket_wq with socket itselfAl Viro1-1/+1
socket->wq is assign-once, set when we are initializing both struct socket it's in and struct socket_wq it points to. As the matter of fact, the only reason for separate allocation was the ability to RCU-delay freeing of socket_wq. RCU-delaying the freeing of socket itself gets rid of that need, so we can just fold struct socket_wq into the end of struct socket and simplify the life both for sock_alloc_inode() (one allocation instead of two) and for tun/tap oddballs, where we used to embed struct socket and struct socket_wq into the same structure (now - embedding just the struct socket). Note that reference to struct socket_wq in struct sock does remain a reference - that's unchanged. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-8/+14
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-07-09 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Lots of libbpf improvements: i) addition of new APIs to attach BPF programs to tracing entities such as {k,u}probes or tracepoints, ii) improve specification of BTF-defined maps by eliminating the need for data initialization for some of the members, iii) addition of a high-level API for setting up and polling perf buffers for BPF event output helpers, all from Andrii. 2) Add "prog run" subcommand to bpftool in order to test-run programs through the kernel testing infrastructure of BPF, from Quentin. 3) Improve verifier for BPF sockaddr programs to support 8-byte stores for user_ip6 and msg_src_ip6 members given clang tends to generate such stores, from Stanislav. 4) Enable the new BPF JIT zero-extension optimization for further riscv64 ALU ops, from Luke. 5) Fix a bpftool json JIT dump crash on powerpc, from Jiri. 6) Fix an AF_XDP race in generic XDP's receive path, from Ilya. 7) Various smaller fixes from Ilya, Yue and Arnd. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08net: core: page_pool: add user refcnt and reintroduce page_pool_destroyIvan Khoronzhuk2-0/+11
Jesper recently removed page_pool_destroy() (from driver invocation) and moved shutdown and free of page_pool into xdp_rxq_info_unreg(), in-order to handle in-flight packets/pages. This created an asymmetry in drivers create/destroy pairs. This patch reintroduce page_pool_destroy and add page_pool user refcnt. This serves the purpose to simplify drivers error handling as driver now drivers always calls page_pool_destroy() and don't need to track if xdp_rxq_info_reg_mem_model() was unsuccessful. This could be used for a special cases where a single RX-queue (with a single page_pool) provides packets for two net_device'es, and thus needs to register the same page_pool twice with two xdp_rxq_info structures. This patch is primarily to ease API usage for drivers. The recently merged netsec driver, actually have a bug in this area, which is solved by this API change. This patch is a modified version of Ivan Khoronzhuk's original patch. Link: https://lore.kernel.org/netdev/20190625175948.24771-2-ivan.khoronzhuk@linaro.org/ Fixes: 5c67bf0ec4d0 ("net: netsec: Use page_pool API") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addrStanislav Fomichev1-8/+14
Since commit cd17d7770578 ("bpf/tools: sync bpf.h") clang decided that it can do a single u64 store into user_ip6[2] instead of two separate u32 ones: # 17: (18) r2 = 0x100000000000000 # ; ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2); # 19: (7b) *(u64 *)(r1 +16) = r2 # invalid bpf_context access off=16 size=8 >From the compiler point of view it does look like a correct thing to do, so let's support it on the kernel side. Credit to Andrii Nakryiko for a proper implementation of bpf_ctx_wide_store_ok. Cc: Andrii Nakryiko <andriin@fb.com> Cc: Yonghong Song <yhs@fb.com> Fixes: cd17d7770578 ("bpf/tools: sync bpf.h") Reported-by: kernel test robot <rong.a.chen@intel.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2-85/+186
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-07-03 The following pull-request contains BPF updates for your *net-next* tree. There is a minor merge conflict in mlx5 due to 8960b38932be ("linux/dim: Rename externally used net_dim members") which has been pulled into your tree in the meantime, but resolution seems not that bad ... getting current bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed just so he's aware of the resolution below: ** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c: <<<<<<< HEAD static int mlx5e_open_cq(struct mlx5e_channel *c, struct dim_cq_moder moder, struct mlx5e_cq_param *param, struct mlx5e_cq *cq) ======= int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder, struct mlx5e_cq_param *param, struct mlx5e_cq *cq) >>>>>>> e5a3e259ef239f443951d401db10db7d426c9497 Resolution is to take the second chunk and rename net_dim_cq_moder into dim_cq_moder. Also the signature for mlx5e_open_cq() in ... drivers/net/ethernet/mellanox/mlx5/core/en.h +977 ... and in mlx5e_open_xsk() ... drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64 ... needs the same rename from net_dim_cq_moder into dim_cq_moder. ** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c: <<<<<<< HEAD int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix)); struct dim_cq_moder icocq_moder = {0, 0}; struct net_device *netdev = priv->netdev; struct mlx5e_channel *c; unsigned int irq; ======= struct net_dim_cq_moder icocq_moder = {0, 0}; >>>>>>> e5a3e259ef239f443951d401db10db7d426c9497 Take the second chunk and rename net_dim_cq_moder into dim_cq_moder as well. Let me know if you run into any issues. Anyway, the main changes are: 1) Long-awaited AF_XDP support for mlx5e driver, from Maxim. 2) Addition of two new per-cgroup BPF hooks for getsockopt and setsockopt along with a new sockopt program type which allows more fine-grained pass/reject settings for containers. Also add a sock_ops callback that can be selectively enabled on a per-socket basis and is executed for every RTT to help tracking TCP statistics, both features from Stanislav. 3) Follow-up fix from loops in precision tracking which was not propagating precision marks and as a result verifier assumed that some branches were not taken and therefore wrongly removed as dead code, from Alexei. 4) Fix BPF cgroup release synchronization race which could lead to a double-free if a leaf's cgroup_bpf object is released and a new BPF program is attached to the one of ancestor cgroups in parallel, from Roman. 5) Support for bulking XDP_TX on veth devices which improves performance in some cases by around 9%, from Toshiaki. 6) Allow for lookups into BPF devmap and improve feedback when calling into bpf_redirect_map() as lookup is now performed right away in the helper itself, from Toke. 7) Add support for fq's Earliest Departure Time to the Host Bandwidth Manager (HBM) sample BPF program, from Lawrence. 8) Various cleanups and minor fixes all over the place from many others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-1/+1
Daniel Borkmann says: ==================== pull-request: bpf 2019-07-03 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH on BE architectures, from Jiong. 2) Fix several bugs in the x32 BPF JIT for handling shifts by 0, from Luke and Xi. 3) Fix NULL pointer deref in btf_type_is_resolve_source_only(), from Stanislav. 4) Properly handle the check that forwarding is enabled on the device in bpf_ipv6_fib_lookup() helper code, from Anton. 5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit alignment such as m68k, from Baruch. 6) Fix kernel hanging in unregister_netdevice loop while unregistering device bound to XDP socket, from Ilya. 7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan. 8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri. 9) Fix bpftool to use correct argument in cgroup errors, from Jakub. 10) Fix detaching dummy prog in XDP redirect sample code, from Prashant. 11) Add Jonathan to AF_XDP reviewers, from Björn. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03bpf: add icsk_retransmits to bpf_tcp_sockStanislav Fomichev1-1/+19
Add some inet_connection_sock fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: add dsack_dups/delivered{, _ce} to bpf_tcp_sockStanislav Fomichev1-1/+10
Add more fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: split shared bpf_tcp_sock and bpf_sock_ops implementationStanislav Fomichev1-54/+126
We've added bpf_tcp_sock member to bpf_sock_ops and don't expect any new tcp_sock fields in bpf_sock_ops. Let's remove CONVERT_COMMON_TCP_SOCK_FIELDS so bpf_tcp_sock can be independently extended. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-01blackhole_netdev: use blackhole_netdev to invalidate dst entriesMahesh Bandewar1-1/+1
Use blackhole_netdev instead of 'lo' device with lower MTU when marking dst "dead". Signed-off-by: Mahesh Bandewar <maheshb@google.com> Tested-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-01net: link_watch: prevent starvation when processing linkwatch wqYunsheng Lin1-1/+12
When user has configured a large number of virtual netdev, such as 4K vlans, the carrier on/off operation of the real netdev will also cause it's virtual netdev's link state to be processed in linkwatch. Currently, the processing is done in a work queue, which may cause rtnl locking starvation problem and worker starvation problem for other work queue, such as irqfd_inject wq. This patch releases the cpu when link watch worker has processed a fixed number of netdev' link watch event, and schedule the work queue again when there is still link watch event remaining. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-30Merge tag 'mlx5e-updates-2019-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller1-2/+4
Saeed Mahameed says: ==================== mlx5e-updates-2019-06-28 This series adds some misc updates for mlx5e driver 1) Allow adding the same mac more than once in MPFS table 2) Move to HW checksumming advertising 3) Report netdevice MPLS features 4) Correct physical port name of the PF representor 5) Reduce stack usage in mlx5_eswitch_termtbl_create 6) Refresh TIR improvement for representors 7) Expose same physical switch_id for all representors ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29bpf_xdp_redirect_map: Perform map lookup in eBPF helperToke Høiland-Jørgensen1-14/+18
The bpf_redirect_map() helper used by XDP programs doesn't return any indication of whether it can successfully redirect to the map index it was given. Instead, BPF programs have to track this themselves, leading to programs using duplicate maps to track which entries are populated in the devmap. This patch fixes this by moving the map lookup into the bpf_redirect_map() helper, which makes it possible to return failure to the eBPF program. The lower bits of the flags argument is used as the return code, which means that existing users who pass a '0' flag argument will get XDP_ABORTED. With this, a BPF program can check the return code from the helper call and react by, for instance, substituting a different redirect. This works for any type of map used for redirect. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29devmap: Rename ifindex member in bpf_redirect_infoToke Høiland-Jørgensen1-13/+13
The bpf_redirect_info struct has an 'ifindex' member which was named back when the redirects could only target egress interfaces. Now that we can also redirect to sockets and CPUs, this is a bit misleading, so rename the member to tgt_index. Reorder the struct members so we can have 'tgt_index' and 'tgt_value' next to each other in a subsequent patch. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29devmap/cpumap: Use flush list instead of bitmapToke Høiland-Jørgensen1-2/+0
The socket map uses a linked list instead of a bitmap to keep track of which entries to flush. Do the same for devmap and cpumap, as this means we don't have to care about the map index when enqueueing things into the map (and so we can cache the map lookup). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-28Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linuxSaeed Mahameed1-2/+4
Misc updates from mlx5-next branch: 1) E-Switch vport metadata support for source vport matching 2) Convert mkey_table to XArray 3) Shared IRQs and to use single IRQ for all async EQs Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net: sched: refactor reinsert actionJohn Hurley1-3/+1
The TC_ACT_REINSERT return type was added as an in-kernel only option to allow a packet ingress or egress redirect. This is used to avoid unnecessary skb clones in situations where they are not required. If a TC hook returns this code then the packet is 'reinserted' and no skb consume is carried out as no clone took place. This return type is only used in act_mirred. Rather than have the reinsert called from the main datapath, call it directly in act_mirred. Instead of returning TC_ACT_REINSERT, change the type to the new TC_ACT_CONSUMED which tells the caller that the packet has been stolen by another process and that no consume call is required. Moving all redirect calls to the act_mirred code is in preparation for tracking recursion created by act_mirred. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27bpf: implement getsockopt and setsockopt hooksStanislav Fomichev1-1/+1
Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks. BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before passing them down to the kernel or bypass kernel completely. BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that kernel returns. Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure. The buffer memory is pre-allocated (because I don't think there is a precedent for working with __user memory from bpf). This might be slow to do for each {s,g}etsockopt call, that's why I've added __cgroup_bpf_prog_array_is_empty that exits early if there is nothing attached to a cgroup. Note, however, that there is a race between __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup program layout might have changed; this should not be a problem because in general there is a race between multiple calls to {s,g}etsocktop and user adding/removing bpf progs from a cgroup. The return code of the BPF program is handled as follows: * 0: EPERM * 1: success, continue with next BPF program in the cgroup chain v9: * allow overwriting setsockopt arguments (Alexei Starovoitov): * use set_fs (same as kernel_setsockopt) * buffer is always kzalloc'd (no small on-stack buffer) v8: * use s32 for optlen (Andrii Nakryiko) v7: * return only 0 or 1 (Alexei Starovoitov) * always run all progs (Alexei Starovoitov) * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov) (decided to use optval=-1 instead, optval=0 might be a valid input) * call getsockopt hook after kernel handlers (Alexei Starovoitov) v6: * rework cgroup chaining; stop as soon as bpf program returns 0 or 2; see patch with the documentation for the details * drop Andrii's and Martin's Acked-by (not sure they are comfortable with the new state of things) v5: * skip copy_to_user() and put_user() when ret == 0 (Martin Lau) v4: * don't export bpf_sk_fullsock helper (Martin Lau) * size != sizeof(__u64) for uapi pointers (Martin Lau) * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau) v3: * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko) * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii Nakryiko) * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau) * use BPF_FIELD_SIZEOF() for consistency (Martin Lau) * new CG_SOCKOPT_ACCESS macro to wrap repeated parts v2: * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau) * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau) * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau) * added [0,2] return code check to verifier (Martin Lau) * dropped unused buf[64] from the stack (Martin Lau) * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau) * dropped bpf_target_off from ctx rewrites (Martin Lau) * use return code for kernel bypass (Martin Lau & Andrii Nakryiko) Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-27net: ethtool: Allow parsing ETHER_FLOW types when using flow_ruleMaxime Chevallier1-0/+24
When parsing an ethtool_rx_flow_spec, users can specify an ethernet flow which could contain matches based on the ethernet header, such as the MAC address, the VLAN tag or the ethertype. ETHER_FLOW uses the src and dst ethernet addresses, along with the ethertype as keys. Matches based on the vlan tag are also possible, but they are specified using the special FLOW_EXT flag. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Acked-by: Pablo Neira Ayuso <pablo@gnumonks.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26rtnetlink: skip metrics loop for dst_default_metricsDavid Ahern1-0/+4
dst_default_metrics has all of the metrics initialized to 0, so nothing will be added to the skb in rtnetlink_put_metrics. Avoid the loop if metrics is from dst_default_metrics. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26keys: Network namespace domain tagDavid Howells1-0/+20
Create key domain tags for network namespaces and make it possible to automatically tag keys that are used by networked services (e.g. AF_RXRPC, AFS, DNS) with the default network namespace if not set by the caller. This allows keys with the same description but in different namespaces to coexist within a keyring. Signed-off-by: David Howells <dhowells@redhat.com> cc: netdev@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-afs@lists.infradead.org
2019-06-26xdp: Make __mem_id_disconnect staticYueHaibing1-1/+1
Fix sparse warning: net/core/xdp.c:88:6: warning: symbol '__mem_id_disconnect' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-22netns: restore ops before calling ops_exit_listLi RongQing1-1/+1
ops has been iterated to first element when call pre_exit, and it needs to restore from save_ops, not save ops to save_ops Fixes: d7d99872c144 ("netns: add pre_exit method to struct pernet_operations") Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+0
Minor SPDX change conflict. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-0/+114
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-06-19 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin. 2) BTF based map definition, from Andrii. 3) support bpf_map_lookup_elem for xskmap, from Jonathan. 4) bounded loops and scalar precision logic in the verifier, from Alexei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19netns: add pre_exit method to struct pernet_operationsEric Dumazet1-0/+28
Current struct pernet_operations exit() handlers are highly discouraged to call synchronize_rcu(). There are cases where we need them, and exit_batch() does not help the common case where a single netns is dismantled. This patch leverages the existing synchronize_rcu() call in cleanup_net() Calling optional ->pre_exit() method before ->exit() or ->exit_batch() allows to benefit from a single synchronize_rcu() call. Note that the synchronize_rcu() calls added in this patch are only in error paths or slow paths. Tested: $ time for i in {1..1000}; do unshare -n /bin/false;done real 0m2.612s user 0m0.171s sys 0m2.216s Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19page_pool: make sure struct device is stableJesper Dangaard Brouer1-0/+8
For DMA mapping use-case the page_pool keeps a pointer to the struct device, which is used in DMA map/unmap calls. For our in-flight handling, we also need to make sure that the struct device have not disappeared. This is assured via using get_device/put_device API. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reported-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19page_pool: add tracepoints for page_pool with details need by XDPJesper Dangaard Brouer2-1/+12
The xdp tracepoints for mem id disconnect don't carry information about, why it was not safe_to_remove. The tracepoint page_pool:page_pool_inflight in this patch can be used for extract this info for further debugging. This patchset also adds tracepoint for the pages_state_* release/hold transitions, including a pointer to the page. This can be used for stats about in-flight pages, or used to debug page leakage via keeping track of page pointer and combining this with kprobe for __put_page(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: add tracepoints for XDP memJesper Dangaard Brouer1-16/+5
These tracepoints make it easier to troubleshoot XDP mem id disconnect. The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is placed at the last stable place for the pointer to struct xdp_mem_allocator, just before it's scheduled for RCU removal. It also extract info on 'safe_to_remove' and 'force'. Detailed info about in-flight pages is not available at this layer. The next patch will added tracepoints needed at the page_pool layer for this. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: force mem allocator removal and periodic warningJesper Dangaard Brouer2-7/+48
If bugs exists or are introduced later e.g. by drivers misusing the API, then we want to warn about the issue, such that developer notice. This patch will generate a bit of noise in form of periodic pr_warn every 30 seconds. It is not nice to have this stall warning running forever. Thus, this patch will (after 120 attempts) force disconnect the mem id (from the rhashtable) and free the page_pool object. This will cause fallback to the put_page() as before, which only potentially leak DMA-mappings, if objects are really stuck for this long. In that unlikely case, a WARN_ONCE should show us the call stack. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: tracking page_pool resources and safe removalJesper Dangaard Brouer2-22/+105
This patch is needed before we can allow drivers to use page_pool for DMA-mappings. Today with page_pool and XDP return API, it is possible to remove the page_pool object (from rhashtable), while there are still in-flight packet-pages. This is safely handled via RCU and failed lookups in __xdp_return() fallback to call put_page(), when page_pool object is gone. In-case page is still DMA mapped, this will result in page note getting correctly DMA unmapped. To solve this, the page_pool is extended with tracking in-flight pages. And XDP disconnect system queries page_pool and waits, via workqueue, for all in-flight pages to be returned. To avoid killing performance when tracking in-flight pages, the implement use two (unsigned) counters, that in placed on different cache-lines, and can be used to deduct in-flight packets. This is done by mapping the unsigned "sequence" counters onto signed Two's complement arithmetic operations. This is e.g. used by kernel's time_after macros, described in kernel commit 1ba3aab3033b and 5a581b367b5, and also explained in RFC1982. The trick is these two incrementing counters only need to be read and compared, when checking if it's safe to free the page_pool structure. Which will only happen when driver have disconnected RX/alloc side. Thus, on a non-fast-path. It is chosen that page_pool tracking is also enabled for the non-DMA use-case, as this can be used for statistics later. After this patch, using page_pool requires more strict resource "release", e.g. via page_pool_release_page() that was introduced in this patchset, and previous patches implement/fix this more strict requirement. Drivers no-longer call page_pool_destroy(). Drivers already call xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will attempt to disconnect the mem id, and if attempt fails schedule the disconnect for later via delayed workqueue. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19page_pool: introduce page_pool_free and use in mlx5Jesper Dangaard Brouer1-4/+11
In case driver fails to register the page_pool with XDP return API (via xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool resources more directly than calling page_pool_destroy(), which does a unnecessarily RCU free procedure. This patch is preparing for removing page_pool_destroy(), from driver invocation. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: page_pool related fix to cpumapJesper Dangaard Brouer1-0/+15
When converting an xdp_frame into an SKB, and sending this into the network stack, then the underlying XDP memory model need to release associated resources, because the network stack don't have callbacks for XDP memory models. The only memory model that needs this is page_pool, when a driver use the DMA-mapping feature. Introduce page_pool_release_page(), which basically does the same as page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to save the function call overhead for others. Have cpumap call xdp_release_frame() before xdp_scrub_frame(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: fix leak of IDA cyclic id if rhashtable_insert_slow failsJesper Dangaard Brouer1-0/+2
Fix error handling case, where inserting ID with rhashtable_insert_slow fails in xdp_rxq_info_reg_mem_model, which leads to never releasing the IDA ID, as the lookup in xdp_rxq_info_unreg_mem_model fails and thus ida_simple_remove() is never called. Fix by releasing ID via ida_simple_remove(), and mark xdp_rxq->mem.id with zero, which is already checked in xdp_rxq_info_unreg_mem_model(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19net: page_pool: add helper function to unmap dma addressesIlias Apalodimas1-0/+7
On a previous patch dma addr was stored in 'struct page'. Use that to unmap DMA addresses used by network drivers Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19net: flow_offload: implement support for meta keyJiri Pirko1-0/+7
Implement support for previously added flow dissector meta key. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19flow_dissector: add support for ingress ifindex dissectionJiri Pirko1-0/+16
Add new key meta that contains ingress ifindex value and add a function to dissect this from skb. The key and function is prepared to cover other potential skb metadata values dissection. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18ipoib: show VF broadcast addressDenis Kirjanov1-0/+5
in IPoIB case we can't see a VF broadcast address for but can see for PF Before: 11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP mode DEFAULT group default qlen 256 link/infiniband 80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff vf 0 MAC 14:80:00:00:66:fe, spoof checking off, link-state disable, trust off, query_rss off ... After: 11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP mode DEFAULT group default qlen 256 link/infiniband 80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff vf 0 link/infiniband 80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off v1->v2: add the IFLA_VF_BROADCAST constant v2->v3: put IFLA_VF_BROADCAST at the end to avoid KABI breakage and set NLA_REJECT dev_setlink Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Acked-by: Doug Ledford <dledford@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18net: remove duplicate fetch in sock_getsockoptJingYi Hou1-3/+0
In sock_getsockopt(), 'optlen' is fetched the first time from userspace. 'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is fetched the second time from userspace. If change it between two fetches may cause security problems or unexpected behaivor, and there is no reason to fetch it a second time. To fix this, we need to remove the second fetch. Signed-off-by: JingYi Hou <houjingyi647@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller12-29/+84
Honestly all the conflicts were simple overlapping changes, nothing really interesting to report. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds8-6/+80
Pull networking fixes from David Miller: "Lots of bug fixes here: 1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer. 2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John Crispin. 3) Use after free in psock backlog workqueue, from John Fastabend. 4) Fix source port matching in fdb peer flow rule of mlx5, from Raed Salem. 5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet. 6) Network header needs to be set for packet redirect in nfp, from John Hurley. 7) Fix udp zerocopy refcnt, from Willem de Bruijn. 8) Don't assume linear buffers in vxlan and geneve error handlers, from Stefano Brivio. 9) Fix TOS matching in mlxsw, from Jiri Pirko. 10) More SCTP cookie memory leak fixes, from Neil Horman. 11) Fix VLAN filtering in rtl8366, from Linus Walluij. 12) Various TCP SACK payload size and fragmentation memory limit fixes from Eric Dumazet. 13) Use after free in pneigh_get_next(), also from Eric Dumazet. 14) LAPB control block leak fix from Jeremy Sowden" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits) lapb: fixed leak of control-blocks. tipc: purge deferredq list for each grp member in tipc_group_delete ax25: fix inconsistent lock state in ax25_destroy_timer neigh: fix use-after-free read in pneigh_get_next tcp: fix compile error if !CONFIG_SYSCTL hv_sock: Suppress bogus "may be used uninitialized" warnings be2net: Fix number of Rx queues used for flow hashing net: handle 802.1P vlan 0 packets properly tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() tcp: add tcp_min_snd_mss sysctl tcp: tcp_fragment() should apply sane memory limits tcp: limit payload size of sacked skbs Revert "net: phylink: set the autoneg state in phylink_phy_change" bpf: fix nested bpf tracepoints with per-cpu data bpf: Fix out of bounds memory access in bpf_sk_storage vsock/virtio: set SOCK_DONE on peer shutdown net: dsa: rtl8366: Fix up VLAN filtering net: phylink: set the autoneg state in phylink_phy_change net: add high_order_alloc_disable sysctl/static key tcp: add tcp_tx_skb_cache sysctl ...