aboutsummaryrefslogtreecommitdiffstats
path: root/net/core (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-08-03bpf: introduce the bpf_get_local_storage() helper functionRoman Gushchin1-1/+22
The bpf_get_local_storage() helper function is used to get a pointer to the bpf local storage from a bpf program. It takes a pointer to a storage map and flags as arguments. Right now it accepts only cgroup storage maps, and flags argument has to be 0. Further it can be extended to support other types of local storage: e.g. thread local storage etc. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-02net/socket: remove duplicated init codeMatthieu Baerts1-48/+3
This refactoring work has been started by David Howells in cdfbabfb2f0c (net: Work around lockdep limitation in sockets that use sockets) but the exact same day in 581319c58600 (net/socket: use per af lockdep classes for sk queues), Paolo Abeni added new classes. This reduces the amount of (nearly) duplicated code and eases the addition of new socket types. Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-14/+20
The BTF conflicts were simple overlapping changes. The virtio_net conflict was an overlap of a fix of statistics counter, happening alongisde a move over to a bonafide statistics structure rather than counting value on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02net: Fix coding style in skb_push()Ganesh Goudar1-1/+1
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-31bpf: Support bpf_get_socket_cookie in more prog typesAndrey Ignatov1-0/+28
bpf_get_socket_cookie() helper can be used to identify skb that correspond to the same socket. Though socket cookie can be useful in many other use-cases where socket is available in program context. Specifically BPF_PROG_TYPE_CGROUP_SOCK_ADDR and BPF_PROG_TYPE_SOCK_OPS programs can benefit from it so that one of them can augment a value in a map prepared earlier by other program for the same socket. The patch adds support to call bpf_get_socket_cookie() from BPF_PROG_TYPE_CGROUP_SOCK_ADDR and BPF_PROG_TYPE_SOCK_OPS. It doesn't introduce new helpers. Instead it reuses same helper name bpf_get_socket_cookie() but adds support to this helper to accept `struct bpf_sock_addr` and `struct bpf_sock_ops`. Documentation in bpf.h is changed in a way that should not break automatic generation of markdown. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-31lwt_bpf: remove unnecessary rcu_read_lock in run_lwt_bpfTaehee Yoo1-2/+0
run_lwt_bpf is called by bpf_{input/output/xmit}. These functions are already protected by rcu_read_lock. because lwtunnel_{input/output/xmit} holds rcu_read_lock and then calls bpf_{input/output/xmit}. So that rcu_read_lock in the run_lwt_bpf is unnecessary. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-31bpf: add End.DT6 action to bpf_lwt_seg6_action helperMathieu Xhonneux1-29/+59
The seg6local LWT provides the End.DT6 action, which allows to decapsulate an outer IPv6 header containing a Segment Routing Header (SRH), full specification is available here: https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-05 This patch adds this action now to the seg6local BPF interface. Since it is not mandatory that the inner IPv6 header also contains a SRH, seg6_bpf_srh_state has been extended with a pointer to a possible SRH of the outermost IPv6 header. This helps assessing if the validation must be triggered or not, and avoids some calls to ipv6_find_hdr. v3: s/1/true, s/0/false for boolean values v2: - changed true/false -> 1/0 - preempt_enable no longer called in first conditional block Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-30fib_rules: NULL check before kfree is not neededYueHaibing1-2/+1
kfree(NULL) is safe,so this removes NULL check before freeing the mem Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-30net/tc: introduce TC_ACT_REINSERT.Paolo Abeni1-1/+5
This is similar TC_ACT_REDIRECT, but with a slightly different semantic: - on ingress the mirred skbs are passed to the target device network stack without any additional check not scrubbing. - the rcu-protected stats provided via the tcf_result struct are updated on error conditions. This new tcfa_action value is not exposed to the user-space and can be used only internally by clsact. v1 -> v2: do not touch TC_ACT_REDIRECT code path, introduce a new action type instead v2 -> v3: - rename the new action value TC_ACT_REINJECT, update the helper accordingly - take care of uncloned reinjected packets in XDP generic hook v3 -> v4: - renamed again the new action value (JiriP) v4 -> v5: - fix build error with !NET_CLS_ACT (kbuild bot) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-30net: simplify sock_poll_waitChristoph Hellwig1-1/+1
The wait_address argument is always directly derived from the filp argument, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-29net: report invalid mtu value via netlink extackStephen Hemminger2-7/+18
If an invalid MTU value is set through rtnetlink return extra error information instead of putting message in kernel log. For other cases where there is no visible API, keep the error report in the log. Example: # ip li set dev enp12s0 mtu 10000 Error: mtu greater than device maximum. # ifconfig enp12s0 mtu 10000 SIOCSIFMTU: Invalid argument # dmesg | tail -1 [ 2047.795467] enp12s0: mtu greater than device maximum Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-29net: report min and max mtu network device settingsStephen Hemminger1-0/+6
Report the minimum and maximum MTU allowed on a device via netlink so that it can be displayed by tools like ip link. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller3-7/+10
Daniel Borkmann says: ==================== pull-request: bpf 2018-07-28 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) API fixes for libbpf's BTF mapping of map key/value types in order to make them compatible with iproute2's BPF_ANNOTATE_KV_PAIR() markings, from Martin. 2) Fix AF_XDP to not report POLLIN prematurely by using the non-cached consumer pointer of the RX queue, from Björn. 3) Fix __xdp_return() to check for NULL pointer after the rhashtable lookup that retrieves the allocator object, from Taehee. 4) Fix x86-32 JIT to adjust ebp register in prologue and epilogue by 4 bytes which got removed from overall stack usage, from Wang. 5) Fix bpf_skb_load_bytes_relative() length check to use actual packet length, from Daniel. 6) Fix uninitialized return code in libbpf bpf_perf_event_read_simple() handler, from Thomas. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-28bpf: use GFP_ATOMIC instead of GFP_KERNEL in bpf_parse_prog()Taehee Yoo1-1/+1
bpf_parse_prog() is protected by rcu_read_lock(). so that GFP_KERNEL is not allowed in the bpf_parse_prog(). [51015.579396] ============================= [51015.579418] WARNING: suspicious RCU usage [51015.579444] 4.18.0-rc6+ #208 Not tainted [51015.579464] ----------------------------- [51015.579488] ./include/linux/rcupdate.h:303 Illegal context switch in RCU read-side critical section! [51015.579510] other info that might help us debug this: [51015.579532] rcu_scheduler_active = 2, debug_locks = 1 [51015.579556] 2 locks held by ip/1861: [51015.579577] #0: 00000000a8c12fd1 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x2e0/0x910 [51015.579711] #1: 00000000bf815f8e (rcu_read_lock){....}, at: lwtunnel_build_state+0x96/0x390 [51015.579842] stack backtrace: [51015.579869] CPU: 0 PID: 1861 Comm: ip Not tainted 4.18.0-rc6+ #208 [51015.579891] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015 [51015.579911] Call Trace: [51015.579950] dump_stack+0x74/0xbb [51015.580000] ___might_sleep+0x16b/0x3a0 [51015.580047] __kmalloc_track_caller+0x220/0x380 [51015.580077] kmemdup+0x1c/0x40 [51015.580077] bpf_parse_prog+0x10e/0x230 [51015.580164] ? kasan_kmalloc+0xa0/0xd0 [51015.580164] ? bpf_destroy_state+0x30/0x30 [51015.580164] ? bpf_build_state+0xe2/0x3e0 [51015.580164] bpf_build_state+0x1bb/0x3e0 [51015.580164] ? bpf_parse_prog+0x230/0x230 [51015.580164] ? lock_is_held_type+0x123/0x1a0 [51015.580164] lwtunnel_build_state+0x1aa/0x390 [51015.580164] fib_create_info+0x1579/0x33d0 [51015.580164] ? sched_clock_local+0xe2/0x150 [51015.580164] ? fib_info_update_nh_saddr+0x1f0/0x1f0 [51015.580164] ? sched_clock_local+0xe2/0x150 [51015.580164] fib_table_insert+0x201/0x1990 [51015.580164] ? lock_downgrade+0x610/0x610 [51015.580164] ? fib_table_lookup+0x1920/0x1920 [51015.580164] ? lwtunnel_valid_encap_type.part.6+0xcb/0x3a0 [51015.580164] ? rtm_to_fib_config+0x637/0xbd0 [51015.580164] inet_rtm_newroute+0xed/0x1b0 [51015.580164] ? rtm_to_fib_config+0xbd0/0xbd0 [51015.580164] rtnetlink_rcv_msg+0x331/0x910 [ ... ] Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-28bpf: fix bpf_skb_load_bytes_relative pkt length checkDaniel Borkmann1-5/+7
The len > skb_headlen(skb) cannot be used as a maximum upper bound for the packet length since it does not have any relation to the full linear packet length when filtering is used from upper layers (e.g. in case of reuseport BPF programs) as by then skb->data, skb->len already got mangled through __skb_pull() and others. Fixes: 4e1ec56cdc59 ("bpf: add skb_load_bytes_relative helper") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com>
2018-07-27Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-nextDavid S. Miller1-1/+1
Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2018-07-27 1) Extend the output_mark to also support the input direction and masking the mark values before applying to the skb. 2) Add a new lookup key for the upcomming xfrm interfaces. 3) Extend the xfrm lookups to match xfrm interface IDs. 4) Add virtual xfrm interfaces. The purpose of these interfaces is to overcome the design limitations that the existing VTI devices have. The main limitations that we see with the current VTI are the following: VTI interfaces are L3 tunnels with configurable endpoints. For xfrm, the tunnel endpoint are already determined by the SA. So the VTI tunnel endpoints must be either the same as on the SA or wildcards. In case VTI tunnel endpoints are same as on the SA, we get a one to one correlation between the SA and the tunnel. So each SA needs its own tunnel interface. On the other hand, we can have only one VTI tunnel with wildcard src/dst tunnel endpoints in the system because the lookup is based on the tunnel endpoints. The existing tunnel lookup won't work with multiple tunnels with wildcard tunnel endpoints. Some usecases require more than on VTI tunnel of this type, for example if somebody has multiple namespaces and every namespace requires such a VTI. VTI needs separate interfaces for IPv4 and IPv6 tunnels. So when routing to a VTI, we have to know to which address family this traffic class is going to be encapsulated. This is a lmitation because it makes routing more complex and it is not always possible to know what happens behind the VTI, e.g. when the VTI is move to some namespace. VTI works just with tunnel mode SAs. We need generic interfaces that ensures transfomation, regardless of the xfrm mode and the encapsulated address family. VTI is configured with a combination GRE keys and xfrm marks. With this we have to deal with some extra cases in the generic tunnel lookup because the GRE keys on the VTI are actually not GRE keys, the GRE keys were just reused for something else. All extensions to the VTI interfaces would require to add even more complexity to the generic tunnel lookup. So to overcome this, we developed xfrm interfaces with the following design goal: It should be possible to tunnel IPv4 and IPv6 through the same interface. No limitation on xfrm mode (tunnel, transport and beet). Should be a generic virtual interface that ensures IPsec transformation, no need to know what happens behind the interface. Interfaces should be configured with a new key that must match a new policy/SA lookup key. The lookup logic should stay in the xfrm codebase, no need to change or extend generic routing and tunnel lookups. Should be possible to use IPsec hardware offloads of the underlying interface. 5) Remove xfrm pcpu policy cache. This was added after the flowcache removal, but it turned out to make things even worse. From Florian Westphal. 6) Allow to update the set mark on SA updates. From Nathan Harold. 7) Convert some timestamps to time64_t. From Arnd Bergmann. 8) Don't check the offload_handle in xfrm code, it is an opaque data cookie for the driver. From Shannon Nelson. 9) Remove xfrmi interface ID from flowi. After this pach no generic code is touched anymore to do xfrm interface lookups. From Benedict Wong. 10) Allow to update the xfrm interface ID on SA updates. From Nathan Harold. 11) Don't pass zero to ERR_PTR() in xfrm_resolve_and_create_bundle. From YueHaibing. 12) Return more detailed errors on xfrm interface creation. From Benedict Wong. 13) Use PTR_ERR_OR_ZERO instead of IS_ERR + PTR_ERR. From the kbuild test robot. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-27xdp: add NULL pointer check in __xdp_return()Taehee Yoo1-1/+2
rhashtable_lookup() can return NULL. so that NULL pointer check routine should be added. Fixes: 02b55e5657c3 ("xdp: add MEM_TYPE_ZERO_COPY") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-26net: rollback orig value on failure of dev_qdisc_change_tx_queue_lenTariq Toukan1-7/+10
Fix dev_change_tx_queue_len so it rolls back original value upon a failure in dev_qdisc_change_tx_queue_len. This is already done for notifirers' failures, share the code. In case of failure in dev_qdisc_change_tx_queue_len, some tx queues would still be of the new length, while they should be reverted. Currently, the revert is not done, and is marked with a TODO label in dev_qdisc_change_tx_queue_len, and should find some nice solution to do it. Yet it is still better to not apply the newly requested value. Fixes: 48bfd55e7e41 ("net_sched: plug in qdisc ops change_tx_queue_len") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Reported-by: Ran Rozenstein <ranro@mellanox.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-9/+9
2018-07-24net: remove blank lines at end of fileStephen Hemminger1-1/+0
Several files have extra line at end of file. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24net: remove redundant input checks in SIOCSIFTXQLEN case of dev_ifsiocTariq Toukan1-6/+1
The cited patch added a call to dev_change_tx_queue_len in SIOCSIFTXQLEN case. This obsoletes the new len comparison check done before the function call. Remove it here. For the desicion of keep/remove the negative value check, we examine the range check in dev_change_tx_queue_len. On 64-bit we will fail with -ERANGE. The 32-bit int ifr_qlen will be sign extended to 64-bits when it is passed into dev_change_tx_queue_len(). And then for negative values this test triggers: if (new_len != (unsigned int)new_len) return -ERANGE; because: if (0xffffffffWHATEVER != 0x00000000WHATEVER) On 32-bit the signed value will be accepted, changing behavior. Therefore, the negative value check is kept. Fixes: 3f76df198288 ("net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23sock: fix sg page frag coalescing in sk_alloc_sgDaniel Borkmann1-3/+3
Current sg coalescing logic in sk_alloc_sg() (latter is used by tls and sockmap) is not quite correct in that we do fetch the previous sg entry, however the subsequent check whether the refilled page frag from the socket is still the same as from the last entry with prior offset and length matching the start of the current buffer is comparing always the first sg list entry instead of the prior one. Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22rtnetlink: add rtnl_link_state check in rtnl_configure_linkRoopa Prabhu1-3/+6
rtnl_configure_link sets dev->rtnl_link_state to RTNL_LINK_INITIALIZED and unconditionally calls __dev_notify_flags to notify user-space of dev flags. current call sequence for rtnl_configure_link rtnetlink_newlink rtnl_link_ops->newlink rtnl_configure_link (unconditionally notifies userspace of default and new dev flags) If a newlink handler wants to call rtnl_configure_link early, we will end up with duplicate notifications to user-space. This patch fixes rtnl_configure_link to check rtnl_link_state and call __dev_notify_flags with gchanges = 0 if already RTNL_LINK_INITIALIZED. Later in the series, this patch will help the following sequence where a driver implementing newlink can call rtnl_configure_link to initialize the link early. makes the following call sequence work: rtnetlink_newlink rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes link and notifies user-space of default dev flags) rtnl_configure_link (updates dev flags if requested by user ifm and notifies user-space of new dev flags) Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-21net: skb_segment() should not return NULLEric Dumazet1-5/+5
syzbot caught a NULL deref [1], caused by skb_segment() skb_segment() has many "goto err;" that assume the @err variable contains -ENOMEM. A successful call to __skb_linearize() should not clear @err, otherwise a subsequent memory allocation error could return NULL. While we are at it, we might use -EINVAL instead of -ENOMEM when MAX_SKB_FRAGS limit is reached. [1] kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN CPU: 0 PID: 13285 Comm: syz-executor3 Not tainted 4.18.0-rc4+ #146 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:tcp_gso_segment+0x3dc/0x1780 net/ipv4/tcp_offload.c:106 Code: f0 ff ff 0f 87 1c fd ff ff e8 00 88 0b fb 48 8b 75 d0 48 b9 00 00 00 00 00 fc ff df 48 8d be 90 00 00 00 48 89 f8 48 c1 e8 03 <0f> b6 14 08 48 8d 86 94 00 00 00 48 89 c6 83 e0 07 48 c1 ee 03 0f RSP: 0018:ffff88019b7fd060 EFLAGS: 00010206 RAX: 0000000000000012 RBX: 0000000000000020 RCX: dffffc0000000000 RDX: 0000000000040000 RSI: 0000000000000000 RDI: 0000000000000090 RBP: ffff88019b7fd0f0 R08: ffff88019510e0c0 R09: ffffed003b5c46d6 R10: ffffed003b5c46d6 R11: ffff8801dae236b3 R12: 0000000000000001 R13: ffff8801d6c581f4 R14: 0000000000000000 R15: ffff8801d6c58128 FS: 00007fcae64d6700(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004e8664 CR3: 00000001b669b000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: tcp4_gso_segment+0x1c3/0x440 net/ipv4/tcp_offload.c:54 inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342 inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342 skb_mac_gso_segment+0x3b5/0x740 net/core/dev.c:2792 __skb_gso_segment+0x3c3/0x880 net/core/dev.c:2865 skb_gso_segment include/linux/netdevice.h:4099 [inline] validate_xmit_skb+0x640/0xf30 net/core/dev.c:3104 __dev_queue_xmit+0xc14/0x3910 net/core/dev.c:3561 dev_queue_xmit+0x17/0x20 net/core/dev.c:3602 neigh_hh_output include/net/neighbour.h:473 [inline] neigh_output include/net/neighbour.h:481 [inline] ip_finish_output2+0x1063/0x1860 net/ipv4/ip_output.c:229 ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317 NF_HOOK_COND include/linux/netfilter.h:276 [inline] ip_output+0x223/0x880 net/ipv4/ip_output.c:405 dst_output include/net/dst.h:444 [inline] ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124 iptunnel_xmit+0x567/0x850 net/ipv4/ip_tunnel_core.c:91 ip_tunnel_xmit+0x1598/0x3af1 net/ipv4/ip_tunnel.c:778 ipip_tunnel_xmit+0x264/0x2c0 net/ipv4/ipip.c:308 __netdev_start_xmit include/linux/netdevice.h:4148 [inline] netdev_start_xmit include/linux/netdevice.h:4157 [inline] xmit_one net/core/dev.c:3034 [inline] dev_hard_start_xmit+0x26c/0xc30 net/core/dev.c:3050 __dev_queue_xmit+0x29ef/0x3910 net/core/dev.c:3569 dev_queue_xmit+0x17/0x20 net/core/dev.c:3602 neigh_direct_output+0x15/0x20 net/core/neighbour.c:1403 neigh_output include/net/neighbour.h:483 [inline] ip_finish_output2+0xa67/0x1860 net/ipv4/ip_output.c:229 ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317 NF_HOOK_COND include/linux/netfilter.h:276 [inline] ip_output+0x223/0x880 net/ipv4/ip_output.c:405 dst_output include/net/dst.h:444 [inline] ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124 ip_queue_xmit+0x9df/0x1f80 net/ipv4/ip_output.c:504 tcp_transmit_skb+0x1bf9/0x3f10 net/ipv4/tcp_output.c:1168 tcp_write_xmit+0x1641/0x5c20 net/ipv4/tcp_output.c:2363 __tcp_push_pending_frames+0xb2/0x290 net/ipv4/tcp_output.c:2536 tcp_push+0x638/0x8c0 net/ipv4/tcp.c:735 tcp_sendmsg_locked+0x2ec5/0x3f00 net/ipv4/tcp.c:1410 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1447 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:641 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:651 __sys_sendto+0x3d7/0x670 net/socket.c:1797 __do_sys_sendto net/socket.c:1809 [inline] __se_sys_sendto net/socket.c:1805 [inline] __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1805 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x455ab9 Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fcae64d5c68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007fcae64d66d4 RCX: 0000000000455ab9 RDX: 0000000000000001 RSI: 0000000020000200 RDI: 0000000000000013 RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014 R13: 00000000004c1145 R14: 00000000004d1818 R15: 0000000000000006 Modules linked in: Dumping ftrace buffer: (ftrace buffer empty) Fixes: ddff00d42043 ("net: Move skb_has_shared_frag check out of GRE code and into segmentation") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20net: create reusable function for getting ownership info of sysfs inodesTyler Hicks2-18/+28
Make net_ns_get_ownership() reusable by networking code outside of core. This is useful, for example, to allow bridge related sysfs files to be owned by container root. Add a function comment since this is a potentially dangerous function to use given the way that kobject_get_ownership() works by initializing uid and gid before calling .get_ownership(). Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20net-sysfs: make sure objects belong to container's ownerDmitry Torokhov1-1/+46
When creating various objects in /sys/class/net/... make sure that they belong to container's owner instead of global root (if they belong to a container/namespace). Co-Developed-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20net-sysfs: require net admin in the init ns for setting tx_maxrateTyler Hicks1-0/+3
An upcoming change will allow container root to open some /sys/class/net files for writing. The tx_maxrate attribute can result in changes to actual hardware devices so err on the side of caution by requiring CAP_NET_ADMIN in the init namespace in the corresponding attribute store operation. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20net: Init backlog NAPI's gro_hash.David S. Miller1-6/+12
Based upon a patch by Sean Tranchetti. Fixes: d4546c2509b1 ("net: Convert GRO SKB handling to list_head.") Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller3-30/+136
All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-19flow_dissector: Dissect tos and ttl from the tunnel infoOr Gerlitz1-1/+13
Add dissection of the tos and ttl from the ip tunnel headers fields in case a match is needed on them. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-19net/page_pool: Fix inconsistent lock state warningTariq Toukan1-1/+1
Fix the warning below by calling the ptr_ring_consume_bh, which uses spin_[un]lock_bh. [ 179.064300] ================================ [ 179.069073] WARNING: inconsistent lock state [ 179.073846] 4.18.0-rc2+ #18 Not tainted [ 179.078133] -------------------------------- [ 179.082907] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 179.089637] swapper/21/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 179.095478] 00000000963d1995 (&(&r->consumer_lock)->rlock){+.?.}, at: __page_pool_empty_ring+0x61/0x100 [ 179.105988] {SOFTIRQ-ON-W} state was registered at: [ 179.111443] _raw_spin_lock+0x35/0x50 [ 179.115634] __page_pool_empty_ring+0x61/0x100 [ 179.120699] page_pool_destroy+0x32/0x50 [ 179.125204] mlx5e_free_rq+0x38/0xc0 [mlx5_core] [ 179.130471] mlx5e_close_channel+0x20/0x120 [mlx5_core] [ 179.136418] mlx5e_close_channels+0x26/0x40 [mlx5_core] [ 179.142364] mlx5e_close_locked+0x44/0x50 [mlx5_core] [ 179.148509] mlx5e_close+0x42/0x60 [mlx5_core] [ 179.153936] __dev_close_many+0xb1/0x120 [ 179.158749] dev_close_many+0xa2/0x170 [ 179.163364] rollback_registered_many+0x148/0x460 [ 179.169047] rollback_registered+0x56/0x90 [ 179.174043] unregister_netdevice_queue+0x7e/0x100 [ 179.179816] unregister_netdev+0x18/0x20 [ 179.184623] mlx5e_remove+0x2a/0x50 [mlx5_core] [ 179.190107] mlx5_remove_device+0xe5/0x110 [mlx5_core] [ 179.196274] mlx5_unregister_interface+0x39/0x90 [mlx5_core] [ 179.203028] cleanup+0x5/0xbfc [mlx5_core] [ 179.208031] __x64_sys_delete_module+0x16b/0x240 [ 179.213640] do_syscall_64+0x5a/0x210 [ 179.218151] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 179.224218] irq event stamp: 334398 [ 179.228438] hardirqs last enabled at (334398): [<ffffffffa511d8b7>] rcu_process_callbacks+0x1c7/0x790 [ 179.239178] hardirqs last disabled at (334397): [<ffffffffa511d872>] rcu_process_callbacks+0x182/0x790 [ 179.249931] softirqs last enabled at (334386): [<ffffffffa509732e>] irq_enter+0x5e/0x70 [ 179.259306] softirqs last disabled at (334387): [<ffffffffa509741c>] irq_exit+0xdc/0xf0 [ 179.268584] [ 179.268584] other info that might help us debug this: [ 179.276572] Possible unsafe locking scenario: [ 179.276572] [ 179.283877] CPU0 [ 179.286954] ---- [ 179.290033] lock(&(&r->consumer_lock)->rlock); [ 179.295546] <Interrupt> [ 179.298830] lock(&(&r->consumer_lock)->rlock); [ 179.304550] [ 179.304550] *** DEADLOCK *** Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18pktgen: convert safe uses of strncpy() to strcpy() to avoid string truncation warningJakub Kicinski1-6/+4
GCC 8 complains: net/core/pktgen.c: In function ‘pktgen_if_write’: net/core/pktgen.c:1419:4: warning: ‘strncpy’ output may be truncated copying between 0 and 31 bytes from a string of length 127 [-Wstringop-truncation] strncpy(pkt_dev->src_max, buf, len); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ net/core/pktgen.c:1399:4: warning: ‘strncpy’ output may be truncated copying between 0 and 31 bytes from a string of length 127 [-Wstringop-truncation] strncpy(pkt_dev->src_min, buf, len); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ net/core/pktgen.c:1290:4: warning: ‘strncpy’ output may be truncated copying between 0 and 31 bytes from a string of length 127 [-Wstringop-truncation] strncpy(pkt_dev->dst_max, buf, len); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ net/core/pktgen.c:1268:4: warning: ‘strncpy’ output may be truncated copying between 0 and 31 bytes from a string of length 127 [-Wstringop-truncation] strncpy(pkt_dev->dst_min, buf, len); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There is no bug here, but the code is not perfect either. It copies sizeof(pkt_dev->/member/) - 1 from user space into buf, and then does a strcmp(pkt_dev->/member/, buf) hence assuming buf will be null-terminated and shorter than pkt_dev->/member/ (pkt_dev->/member/ is never explicitly null-terminated, and strncpy() doesn't have to null-terminate so the assumption must be on buf). The use of strncpy() without explicit null-termination looks suspicious. Convert to use straight strcpy(). strncpy() would also null-pad the output, but that's clearly unnecessary since the author calls memset(pkt_dev->/member/, 0, sizeof(..)); prior to strncpy(), anyway. While at it format the code for "dst_min", "dst_max", "src_min" and "src_max" in the same way by removing extra new lines in one case. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18net: Move skb decrypted field, avoid explicity copyStefano Brivio1-6/+0
Commit 784abe24c903 ("net: Add decrypted field to skb") introduced a 'decrypted' field that is explicitly copied on skb copy and clone. Move it between headers_start[0] and headers_end[0], so that we don't need to copy it explicitly as it's copied by the memcpy() in __copy_skb_header(). While at it, drop the assignment in __skb_clone(), it was already redundant. This doesn't change the size of sk_buff or cacheline boundaries. The 15-bits hole before tc_index becomes a 14-bits hole, and will be again a 15-bits hole when this change is merged with commit 8b7008620b84 ("net: Don't copy pfmemalloc flag in __copy_skb_header()"). v2: as reported by kbuild test robot (oops, I forgot to build with CONFIG_TLS_DEVICE it seems), we can't use CHECK_SKB_FIELD() on a bit-field member. Just drop the check for the moment being, perhaps we could think of some magic to also check bit-field members one day. Fixes: 784abe24c903 ("net: Add decrypted field to skb") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18xdp: fix uninitialized 'err' variableJakub Kicinski1-6/+9
Smatch caught an uninitialized variable error which GCC seems to miss. Fixes: a25717d2b604 ("xdp: support simultaneous driver and hw XDP attachment") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: Fix GRO_HASH_BUCKETS assertion.David S. Miller1-1/+1
FIELD_SIZEOF() is in bytes, but we want bits. Fixes: d9f37d01e294 ("net: convert gro_count to bitmask") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: convert gro_count to bitmaskLi RongQing1-12/+24
gro_hash size is 192 bytes, and uses 3 cache lines, if there is few flows, gro_hash may be not fully used, so it is unnecessary to iterate all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline. convert gro_count to a bitmask, and rename it as gro_bitmask, each bit represents a element of gro_hash, only flush a gro_hash element if the related bit is set, to speed up napi_gro_flush(). and update gro_bitmask only if it will be changed, to reduce cache update Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: Add TLS RX offload featureIlya Lesokhin1-0/+1
This patch adds a netdev feature to configure TLS RX inline crypto offload. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: Add decrypted field to skbBoris Pismenny1-0/+6
The decrypted bit is propogated to cloned/copied skbs. This will be used later by the inline crypto receive side offload of tls. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-37/+116
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-15 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Various different arm32 JIT improvements in order to optimize code emission and make the JIT code itself more robust, from Russell. 2) Support simultaneous driver and offloaded XDP in order to allow for advanced use-cases where some work is offloaded to the NIC and some to the host. Also add ability for bpftool to load programs and maps beyond just the cgroup case, from Jakub. 3) Add BPF JIT support in nfp for multiplication as well as division. For the latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong. 4) Add BTF pretty print functionality to bpftool in plain and JSON output format, from Okash. 5) Add build and installation to the BPF helper man page into bpftool, from Quentin. 6) Add a TCP BPF callback for listening sockets which is triggered right after the socket transitions to TCP_LISTEN state, from Andrey. 7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup tree and prints all attached programs, from Roman. 8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged packets, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-18/+21
Daniel Borkmann says: ==================== pull-request: bpf 2018-07-13 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix AF_XDP TX error reporting before final kernel release such that it becomes consistent between copy mode and zero-copy, from Magnus. 2) Fix three different syzkaller reported issues: oob due to ld_abs rewrite with too large offset, another oob in l3 based skb test run and a bug leaving mangled prog in subprog JITing error path, from Daniel. 3) Fix BTF handling for bitfield extraction on big endian, from Okash. 4) Fix a missing linux/errno.h include in cgroup/BPF found by kbuild bot, from Roman. 5) Fix xdp2skb_meta.sh sample by using just command names instead of absolute paths for tc and ip and allow them to be redefined, from Taeung. 6) Fix availability probing for BPF seg6 helpers before final kernel ships so they can be detected at prog load time, from Mathieu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13skbuff: Unconditionally copy pfmemalloc in __skb_clone()Stefano Brivio1-2/+1
Commit 8b7008620b84 ("net: Don't copy pfmemalloc flag in __copy_skb_header()") introduced a different handling for the pfmemalloc flag in copy and clone paths. In __skb_clone(), now, the flag is set only if it was set in the original skb, but not cleared if it wasn't. This is wrong and might lead to socket buffers being flagged with pfmemalloc even if the skb data wasn't allocated from pfmemalloc reserves. Copy the flag instead of ORing it. Reported-by: Sabrina Dubroca <sd@queasysnail.net> Fixes: 8b7008620b84 ("net: Don't copy pfmemalloc flag in __copy_skb_header()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13xdp: support simultaneous driver and hw XDP attachmentJakub Kicinski2-59/+79
Split the query of HW-attached program from the software one. Introduce new .ndo_bpf command to query HW-attached program. This will allow drivers to install different programs in HW and SW at the same time. Netlink can now also carry multiple programs on dump (in which case mode will be set to XDP_ATTACHED_MULTI and user has to check per-attachment point attributes, IFLA_XDP_PROG_ID will not be present). We reuse IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size() doesn't need to be updated. Note that the installation side is still not there, since all drivers currently reject installing more than one program at the time. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13xdp: factor out common program/flags handling from driversJakub Kicinski1-0/+34
Basic operations drivers perform during xdp setup and query can be moved to helpers in the core. Encapsulate program and flags into a structure and add helpers. Note that the structure is intended as the "main" program information source in the driver. Most drivers will additionally place the program pointer in their fast path or ring structures. The helpers don't have a huge impact now, but they will decrease the code duplication when programs can be installed in HW and driver at the same time. Encapsulating the basic operations in helpers will hopefully also reduce the number of changes to drivers which adopt them. Helpers could really be static inline, but they depend on definition of struct netdev_bpf which means they'd have to be placed in netdevice.h, an already 4500 line header. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13xdp: don't make drivers report attachment modeJakub Kicinski2-6/+9
prog_attached of struct netdev_bpf should have been superseded by simply setting prog_id long time ago, but we kept it around to allow offloading drivers to communicate attachment mode (drv vs hw). Subsequently drivers were also allowed to report back attachment flags (prog_flags), and since nowadays only programs attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell the attachment mode from the flags driver reports. Remove prog_attached member. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13xdp: add per mode attributes for attached programsJakub Kicinski1-4/+26
In preparation for support of simultaneous driver and hardware XDP support add per-mode attributes. The catch-all IFLA_XDP_PROG_ID will still be reported, but user space can now also access the program ID in a new IFLA_XDP_<mode>_PROG_ID attribute. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-12devlink: Add generic parameters region_snapshotAlex Vesker1-0/+5
region_snapshot - When set enables capturing region snapshots Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12devlink: Add support for region snapshot read commandAlex Vesker1-0/+182
Add support for DEVLINK_CMD_REGION_READ_GET used for both reading and dumping region data. Read allows reading from a region specific address for given length. Dump allows reading the full region. If only snapshot ID is provided a snapshot dump will be done. If snapshot ID, Address and Length are provided a snapshot read will done. This is used for both snapshot access and will be used in the same way to access current data on the region. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12devlink: Add support for region snapshot delete commandAlex Vesker1-0/+93
Add support for DEVLINK_CMD_REGION_DEL used for deleting a snapshot from a region. The snapshot ID is required. Also added notification support for NEW and DEL of snapshots. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12devlink: Extend the support querying for region snapshot IDsAlex Vesker1-0/+53
Extend the support for DEVLINK_CMD_REGION_GET command to also return the IDs of the snapshot currently present on the region. Each reply will include a nested snapshots attribute that can contain multiple snapshot attributes each with an ID. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12devlink: Add support for region get commandAlex Vesker1-0/+114
Add support for DEVLINK_CMD_REGION_GET command which is used for querying for the supported DEV/REGION values of devlink devices. The support is both for doit and dumpit. Reply includes: BUS_NAME, DEVICE_NAME, REGION_NAME, REGION_SIZE Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>