aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/net/tun.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-10-15tun: Consistently configure generic netdev params via rtnetlinkSerhey Popovych1-0/+2
Configuring generic network device parameters on tun will fail in presence of IFLA_INFO_KIND attribute in IFLA_LINKINFO nested attribute since tun_validate() always return failure. This can be visualized with following ip-link(8) command sequences: # ip link set dev tun0 group 100 # ip link set dev tun0 group 100 type tun RTNETLINK answers: Invalid argument with contrast to dummy and veth drivers: # ip link set dev dummy0 group 100 # ip link set dev dummy0 type dummy # ip link set dev veth0 group 100 # ip link set dev veth0 group 100 type veth Fix by returning zero in tun_validate() when @data is NULL that is always in case since rtnl_link_ops->maxtype is zero in tun driver. Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX") Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10net: tun: remove useless codes of tun_automq_select_queueWang Li1-22/+13
Because the function __skb_get_hash_symmetric always returns non-zero. Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Wang Li <wangli39@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-16/+21
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net' overlapped the renaming of a netlink attribute in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tun: napi flags belong to tfileEric Dumazet1-9/+14
Since tun->flags might be shared by multiple tfile structures, it is better to make sure tun_get_user() is using the flags for the current tfile. Presence of the READ_ONCE() in tun_napi_frags_enabled() gave a hint of what could happen, but we need something stronger to please syzbot. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 13647 Comm: syz-executor5 Not tainted 4.19.0-rc5+ #59 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427 Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4 RSP: 0018:ffff8801c400f410 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325 RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0 RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000 R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358 R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004 FS: 00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: napi_gro_frags+0x3f4/0xc90 net/core/dev.c:5715 tun_get_user+0x31d5/0x42a0 drivers/net/tun.c:1922 tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1967 call_write_iter include/linux/fs.h:1808 [inline] new_sync_write fs/read_write.c:474 [inline] __vfs_write+0x6b8/0x9f0 fs/read_write.c:487 vfs_write+0x1fc/0x560 fs/read_write.c:549 ksys_write+0x101/0x260 fs/read_write.c:598 __do_sys_write fs/read_write.c:610 [inline] __se_sys_write fs/read_write.c:607 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:607 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457579 Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fe003614c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457579 RDX: 0000000000000012 RSI: 0000000020000000 RDI: 000000000000000a RBP: 000000000072c040 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0036156d4 R13: 00000000004c5574 R14: 00000000004d8e98 R15: 00000000ffffffff Modules linked in: RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427 Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4 RSP: 0018:ffff8801c400f410 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325 RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0 RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000 R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358 R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004 FS: 00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tun: initialize napi_mutex unconditionallyEric Dumazet1-1/+1
This is the first part to fix following syzbot report : console output: https://syzkaller.appspot.com/x/log.txt?x=145378e6400000 kernel config: https://syzkaller.appspot.com/x/.config?x=443816db871edd66 dashboard link: https://syzkaller.appspot.com/bug?extid=e662df0ac1d753b57e80 Following patch is fixing the race condition, but it seems safer to initialize this mutex at tfile creation anyway. Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+e662df0ac1d753b57e80@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tun: remove unused parametersEric Dumazet1-6/+6
tun_napi_disable() and tun_napi_del() do not need a pointer to the tun_struct Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-43/+0
Version bump conflict in batman-adv, take what's in net-next. iavf conflict, adjustment of netdev_ops in net-next conflicting with poll controller method removal in net. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23tun: remove ndo_poll_controllerEric Dumazet1-43/+0
As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. tun uses NAPI for TX completions, so we better let core networking stack call the napi->poll() to avoid the capture. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: accept an array of XDP buffs through sendmsg()Jason Wang1-3/+114
This patch implement TUN_MSG_PTR msg_control type. This type allows the caller to pass an array of XDP buffs to tuntap through ptr field of the tun_msg_control. If an XDP program is attached, tuntap can run XDP program directly. If not, tuntap will build skb and do a fast receiving since part of the work has been done by vhost_net. This will avoid lots of indirect calls thus improves the icache utilization and allows to do XDP batched flushing when doing XDP redirection. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tun: switch to new type of msg_controlJason Wang1-1/+5
This patch introduces to a new tun/tap specific msg_control: #define TUN_MSG_UBUF 1 #define TUN_MSG_PTR 2 struct tun_msg_ctl { int type; void *ptr; }; This allows us to pass different kinds of msg_control through sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will be used by the existed vhost_net zerocopy code. The second is XDP buff, which allows vhost_net to pass XDP buff to TUN. This could be used to implement accepting an array of XDP buffs from vhost_net in the following patches. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: move XDP flushing out of tun_do_xdp()Jason Wang1-1/+2
This will allow adding batch flushing on top. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: split out XDP logicJason Wang1-37/+51
This patch split out XDP logic into a single function. This make it to be reused by XDP batching path in the following patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: tweak on the path of skb XDP case in tun_build_skb()Jason Wang1-15/+24
If we're sure not to go native XDP, there's no need for several things like bh and rcu stuffs. So this patch introduces a helper to build skb and hold page refcnt. When we found we will go through skb path, build skb directly. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: simplify error handling in tun_build_skb()Jason Wang1-7/+3
There's no need to duplicate page get logic in each action. So this patch tries to get page and calculate the offset before processing XDP actions (except for XDP_DROP), and undo them when meet errors (we don't care the performance on errors). This will be used for factoring out XDP logic. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: enable bh early during processing XDPJason Wang1-7/+3
This patch move the bh enabling a little bit earlier, this will be used for factoring out the core XDP logic of tuntap. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: switch to use XDP_PACKET_HEADROOMJason Wang1-2/+1
Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13net: sock: introduce SOCK_XDPJason Wang1-0/+19
This patch introduces a new sock flag - SOCK_XDP. This will be used for notifying the upper layer that XDP program is attached on the lower socket, and requires for extra headroom. TUN will be the first user. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-1/+1
Pull core signal handling updates from Eric Biederman: "It was observed that a periodic timer in combination with a sufficiently expensive fork could prevent fork from every completing. This contains the changes to remove the need for that restart. This set of changes is split into several parts: - The first part makes PIDTYPE_TGID a proper pid type instead something only for very special cases. The part starts using PIDTYPE_TGID enough so that in __send_signal where signals are actually delivered we know if the signal is being sent to a a group of processes or just a single process. - With that prep work out of the way the logic in fork is modified so that fork logically makes signals received while it is running appear to be received after the fork completes" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits) signal: Don't send signals to tasks that don't exist signal: Don't restart fork when signals come in. fork: Have new threads join on-going signal group stops fork: Skip setting TIF_SIGPENDING in ptrace_init_task signal: Add calculate_sigpending() fork: Unconditionally exit if a fatal signal is pending fork: Move and describe why the code examines PIDNS_ADDING signal: Push pid type down into complete_signal. signal: Push pid type down into __send_signal signal: Push pid type down into send_signal signal: Pass pid type into do_send_sig_info signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task signal: Pass pid type into group_send_sig_info signal: Pass pid and pid type into send_sigqueue posix-timers: Noralize good_sigevent signal: Use PIDTYPE_TGID to clearly store where file signals will be sent pid: Implement PIDTYPE_TGID pids: Move the pgrp and session pid pointers from task_struct to signal_struct kvm: Don't open code task_pid in kvm_vcpu_ioctl pids: Compute task_tgid using signal->leader_pid ...
2018-08-04tun: not use hardcoded mask valueLi RongQing1-1/+2
0x3ff in tun_hashfn is mask of TUN_NUM_FLOW_ENTRIES, instead of hardcode, define a macro to setup the relationship with TUN_NUM_FLOW_ENTRIES Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-21signal: Use PIDTYPE_TGID to clearly store where file signals will be sentEric W. Biederman1-1/+1
When f_setown is called a pid and a pid type are stored. Replace the use of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread group. Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now is only for a thread. Update the users of __f_setown to use PIDTYPE_TGID instead of PIDTYPE_PID. For now the code continues to capture task_pid (when task_tgid would really be appropriate), and iterate on PIDTYPE_PID (even when type == PIDTYPE_TGID) out of an abundance of caution to preserve existing behavior. Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID for tgid lookup also be used to avoid taking the tasklist lock. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-20Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller1-1/+1
All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tun: Fix use-after-free on XDP_TXToshiaki Makita1-1/+1
On XDP_TX we need to free up the frame only when tun_xdp_tx() returns a negative value. A positive value indicates that the packet is successfully enqueued to the ptr_ring, so freeing the page causes use-after-free. Fixes: 735fc4054b3a ("xdp: change ndo_xdp_xmit API to support bulking") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-1/+0
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-15 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Various different arm32 JIT improvements in order to optimize code emission and make the JIT code itself more robust, from Russell. 2) Support simultaneous driver and offloaded XDP in order to allow for advanced use-cases where some work is offloaded to the NIC and some to the host. Also add ability for bpftool to load programs and maps beyond just the cgroup case, from Jakub. 3) Add BPF JIT support in nfp for multiplication as well as division. For the latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong. 4) Add BTF pretty print functionality to bpftool in plain and JSON output format, from Okash. 5) Add build and installation to the BPF helper man page into bpftool, from Quentin. 6) Add a TCP BPF callback for listening sockets which is triggered right after the socket transitions to TCP_LISTEN state, from Andrey. 7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup tree and prints all attached programs, from Roman. 8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged packets, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13xdp: don't make drivers report attachment modeJakub Kicinski1-1/+0
prog_attached of struct netdev_bpf should have been superseded by simply setting prog_id long time ago, but we kept it around to allow offloading drivers to communicate attachment mode (drv vs hw). Subsequently drivers were also allowed to report back attachment flags (prog_flags), and since nowadays only programs attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell the attachment mode from the flags driver reports. Remove prog_attached member. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-09net: allow ndo_select_queue to pass netdevAlexander Duyck1-1/+2
This patch makes it so that instead of passing a void pointer as the accel_priv we instead pass a net_device pointer as sb_dev. Making this change allows us to pass the subordinate device through to the fallback function eventually so that we can keep the actual code in the ndo_select_queue call as focused on possible on the exception cases. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-06-07net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlanWillem de Bruijn1-1/+2
Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr to communicate packet metadata to userspace. For skbuffs with vlan, the first two return the packet as it may have existed on the wire, inserting the VLAN tag in the user buffer. Then virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes. Commit f09e2249c4f5 ("macvtap: restore vlan header on user read") added this feature to macvtap. Commit 3ce9b20f1971 ("macvtap: Fix csum_start when VLAN tags are present") then fixed up csum_start. Virtio, packet and uml do not insert the vlan header in the user buffer. When introducing virtio_net_hdr_from_skb to deduplicate filling in the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was applied uniformly, breaking csum offset for packets with vlan on virtio and packet. Make insertion of VLAN_HLEN optional. Convert the callers to pass it when needed. Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion") Fixes: 1276f24eeef2 ("packet: use common code for virtio_net_hdr and skb GSO conversion") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-27/+17
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-06-05 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add a new BPF hook for sendmsg similar to existing hooks for bind and connect: "This allows to override source IP (including the case when it's set via cmsg(3)) and destination IP:port for unconnected UDP (slow path). TCP and connected UDP (fast path) are not affected. This makes UDP support complete, that is, connected UDP is handled by connect hooks, unconnected by sendmsg ones.", from Andrey. 2) Rework of the AF_XDP API to allow extending it in future for type writer model if necessary. In this mode a memory window is passed to hardware and multiple frames might be filled into that window instead of just one that is the case in the current fixed frame-size model. With the new changes made this can be supported without having to add a new descriptor format. Also, core bits for the zero-copy support for AF_XDP have been merged as agreed upon, where i40e bits will be routed via Jeff later on. Various improvements to documentation and sample programs included as well, all from Björn and Magnus. 3) Given BPF's flexibility, a new program type has been added to implement infrared decoders. Quote: "The kernel IR decoders support the most widely used IR protocols, but there are many protocols which are not supported. [...] There is a 'long tail' of unsupported IR protocols, for which lircd is need to decode the IR. IR encoding is done in such a way that some simple circuit can decode it; therefore, BPF is ideal. [...] user-space can define a decoder in BPF, attach it to the rc device through the lirc chardev.", from Sean. 4) Several improvements and fixes to BPF core, among others, dumping map and prog IDs into fdinfo which is a straight forward way to correlate BPF objects used by applications, removing an indirect call and therefore retpoline in all map lookup/update/delete calls by invoking the callback directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper for tc BPF programs to have an efficient way of looking up cgroup v2 id for policy or other use cases. Fixes to make sure we zero tunnel/xfrm state that hasn't been filled, to allow context access wrt pt_regs in 32 bit archs for tracing, and last but not least various test cases for fixes that landed in bpf earlier, from Daniel. 5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper. 6) Add a new bpf_get_current_cgroup_id() helper that can be used in tracing to retrieve the cgroup id from the current process in order to allow for e.g. aggregation of container-level events, from Yonghong. 7) Two follow-up fixes for BTF to reject invalid input values and related to that also two test cases for BPF kselftests, from Martin. 8) Various API improvements to the bpf_fib_lookup() helper, that is, dropping MPLS bits which are not fully hashed out yet, rejecting invalid helper flags, returning error for unsupported address families as well as renaming flowlabel to flowinfo, from David. 9) Various fixes and improvements to sockmap BPF kselftests in particular in proper error detection and data verification, from Prashant. 10) Two arm32 BPF JIT improvements. One is to fix imm range check with regards to whether immediate fits into 24 bits, and a naming cleanup to get functions related to rsh handling consistent to those handling lsh, from Wang. 11) Two compile warning fixes in BPF, one for BTF and a false positive to silent gcc in stack_map_get_build_id_offset(), from Arnd. 12) Add missing seg6.h header into tools include infrastructure in order to fix compilation of BPF kselftests, from Mathieu. 13) Several formatting cleanups in the BPF UAPI helper description that also fix an error during rst2man compilation, from Quentin. 14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is not built into the kernel, from Yue. 15) Remove a useless double assignment in dev_map_enqueue(), from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05tun: remove ndo_xdp_flush call tun_xdp_flushJesper Dangaard Brouer1-22/+1
Remove the ndo_xdp_flush call implementation tun_xdp_flush as no callers of ndo_xdp_flush are left. The tun drivers XDP_TX implementation also used tun_xdp_flush (and tun_xdp_xmit). This is easily solved by passing the XDP_XMIT_FLUSH flag to tun_xdp_xmit in tun_xdp_tx. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-04Allow ethtool to change tun link settingsChas Williams1-2/+25
Let user space set whatever it would like to advertise for the tun interface. Preserve the existing defaults. Signed-off-by: Chas Williams <3chas3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-03tun: implement flush flag for ndo_xdp_xmitJesper Dangaard Brouer1-6/+13
When passed the XDP_XMIT_FLUSH flag tun_xdp_xmit now performs the same kind of socket wake up as in tun_xdp_flush(). The wake up code from tun_xdp_flush is generalized and shared with tun_xdp_xmit. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03xdp: add flags argument to ndo_xdp_xmit APIJesper Dangaard Brouer1-2/+6
This patch only change the API and reject any use of flags. This is an intermediate step that allows us to implement the flush flag operation later, for each individual driver in a separate patch. The plan is to implement flush operation via XDP_XMIT_FLUSH flag and then remove XDP_XMIT_FLAGS_NONE when done. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-6/+9
Filling in the padding slot in the bpf structure as a bug fix in 'ne' overlapped with actually using that padding area for something in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29tun: Fix NULL pointer dereference in XDP redirectToshiaki Makita1-6/+9
Calling XDP redirection requires bh disabled. Softirq can call another XDP function and redirection functions, then the percpu static variable ri->map can be overwritten to NULL. This is a generic XDP case called from tun. [ 3535.736058] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [ 3535.743974] PGD 0 P4D 0 [ 3535.746530] Oops: 0000 [#1] SMP PTI [ 3535.750049] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel crypto_simd cryptd enclosure hpwdt hpilo glue_helper ipmi_si pcspkr wmi mei_me ioatdma mei ipmi_devintf shpchp dca ipmi_msghandler lpc_ich acpi_power_meter sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm smartpqi i40e crc32c_intel scsi_transport_sas tg3 i2c_core ptp pps_core [ 3535.813456] CPU: 5 PID: 1630 Comm: vhost-1614 Not tainted 4.17.0-rc4 #2 [ 3535.820127] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017 [ 3535.828732] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30 [ 3535.833740] RSP: 0018:ffffb4bc47bf7c58 EFLAGS: 00010246 [ 3535.839009] RAX: ffff9fdfcfea1c40 RBX: 0000000000000000 RCX: ffff9fdf27fe3100 [ 3535.846205] RDX: ffff9fdfca769200 RSI: 0000000000000000 RDI: 0000000000000000 [ 3535.853402] RBP: ffffb4bc491d9000 R08: 00000000000045ad R09: 0000000000000ec0 [ 3535.860597] R10: 0000000000000001 R11: ffff9fdf26c3ce4e R12: ffff9fdf9e72c000 [ 3535.867794] R13: 0000000000000000 R14: fffffffffffffff2 R15: ffff9fdfc82cdd00 [ 3535.874990] FS: 0000000000000000(0000) GS:ffff9fdfcfe80000(0000) knlGS:0000000000000000 [ 3535.883152] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3535.888948] CR2: 0000000000000018 CR3: 0000000bde724004 CR4: 00000000007626e0 [ 3535.896145] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3535.903342] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3535.910538] PKRU: 55555554 [ 3535.913267] Call Trace: [ 3535.915736] xdp_do_generic_redirect+0x7a/0x310 [ 3535.920310] do_xdp_generic.part.117+0x285/0x370 [ 3535.924970] tun_get_user+0x5b9/0x1260 [tun] [ 3535.929279] tun_sendmsg+0x52/0x70 [tun] [ 3535.933237] handle_tx+0x2ad/0x5f0 [vhost_net] [ 3535.937721] vhost_worker+0xa5/0x100 [vhost] [ 3535.942030] kthread+0xf5/0x130 [ 3535.945198] ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost] [ 3535.950031] ? kthread_bind+0x10/0x10 [ 3535.953727] ret_from_fork+0x35/0x40 [ 3535.957334] Code: 0e 74 15 83 f8 10 75 05 e9 49 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 29 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 49 a9 b3 ff 31 c0 c3 [ 3535.976387] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffffb4bc47bf7c58 [ 3535.982883] CR2: 0000000000000018 [ 3535.987096] ---[ end trace 383b299dd1430240 ]--- [ 3536.131325] Kernel panic - not syncing: Fatal exception [ 3536.137484] Kernel Offset: 0x26a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 3536.281406] ---[ end Kernel panic - not syncing: Fatal exception ]--- And a kernel with generic case fixed still panics in tun driver XDP redirect, because it disabled only preemption, but not bh. [ 2055.128746] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [ 2055.136662] PGD 0 P4D 0 [ 2055.139219] Oops: 0000 [#1] SMP PTI [ 2055.142736] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel ipmi_ssif crypto_simd enclosure cryptd hpwdt glue_helper ioatdma hpilo wmi dca pcspkr ipmi_si acpi_power_meter ipmi_devintf shpchp mei_me ipmi_msghandler mei lpc_ich sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e smartpqi tg3 scsi_transport_sas crc32c_intel i2c_core ptp pps_core [ 2055.206142] CPU: 6 PID: 1693 Comm: vhost-1683 Tainted: G W 4.17.0-rc5-fix-tun+ #1 [ 2055.215011] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017 [ 2055.223617] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30 [ 2055.228624] RSP: 0018:ffff998b07607cc0 EFLAGS: 00010246 [ 2055.233892] RAX: ffff8dbd8e235700 RBX: ffff8dbd8ff21c40 RCX: 0000000000000004 [ 2055.241089] RDX: ffff998b097a9000 RSI: 0000000000000000 RDI: 0000000000000000 [ 2055.248286] RBP: 0000000000000000 R08: 00000000000065a8 R09: 0000000000005d80 [ 2055.255483] R10: 0000000000000040 R11: ffff8dbcf0100000 R12: ffff998b097a9000 [ 2055.262681] R13: ffff8dbd8c98c000 R14: 0000000000000000 R15: ffff998b07607d78 [ 2055.269879] FS: 0000000000000000(0000) GS:ffff8dbd8ff00000(0000) knlGS:0000000000000000 [ 2055.278039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2055.283834] CR2: 0000000000000018 CR3: 0000000c0c8cc005 CR4: 00000000007626e0 [ 2055.291030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2055.298227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2055.305424] PKRU: 55555554 [ 2055.308153] Call Trace: [ 2055.310624] xdp_do_redirect+0x7b/0x380 [ 2055.314499] tun_get_user+0x10fe/0x12a0 [tun] [ 2055.318895] tun_sendmsg+0x52/0x70 [tun] [ 2055.322852] handle_tx+0x2ad/0x5f0 [vhost_net] [ 2055.327337] vhost_worker+0xa5/0x100 [vhost] [ 2055.331646] kthread+0xf5/0x130 [ 2055.334813] ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost] [ 2055.339646] ? kthread_bind+0x10/0x10 [ 2055.343343] ret_from_fork+0x35/0x40 [ 2055.346950] Code: 0e 74 15 83 f8 10 75 05 e9 e9 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 c9 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 e9 a9 b3 ff 31 c0 c3 [ 2055.366004] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffff998b07607cc0 [ 2055.372500] CR2: 0000000000000018 [ 2055.375856] ---[ end trace 2a2dcc5e9e174268 ]--- [ 2055.523626] Kernel panic - not syncing: Fatal exception [ 2055.529796] Kernel Offset: 0x2e000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 2055.677539] ---[ end Kernel panic - not syncing: Fatal exception ]--- v2: - Removed preempt_disable/enable since local_bh_disable will prevent preemption as well, feedback from Jason Wang. Fixes: 761876c857cb ("tap: XDP support") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-4/+15
Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-13/+24
Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-05-24 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc). 2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers. 3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit. 4) Jiong Wang adds support for indirect and arithmetic shifts to NFP 5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible. 6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions. 7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT. 8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events. 9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24xdp: change ndo_xdp_xmit API to support bulkingJesper Dangaard Brouer1-13/+24
This patch change the API for ndo_xdp_xmit to support bulking xdp_frames. When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown. Most of the slowdown is caused by DMA API indirect function calls, but also the net_device->ndo_xdp_xmit() call. Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed performance improved: for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps With frames avail as a bulk inside the driver ndo_xdp_xmit call, further optimizations are possible, like bulk DMA-mapping for TX. Testing without CONFIG_RETPOLINE show the same performance for physical NIC drivers. The virtual NIC driver tun sees a huge performance boost, as it can avoid doing per frame producer locking, but instead amortize the locking cost over the bulk. V2: Fix compile errors reported by kbuild test robot <lkp@intel.com> V4: Isolated ndo, driver changes and callers. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-23tuntap: correctly set SOCKWQ_ASYNC_NOSPACEJason Wang1-4/+15
When link is down, writes to the device might fail with -EIO. Userspace needs an indication when the status is resolved. As a fix, tun_net_open() attempts to wake up writers - but that is only effective if SOCKWQ_ASYNC_NOSPACE has been set in the past. This is not the case of vhost_net which only poll for EPOLLOUT after it meets errors during sendmsg(). This patch fixes this by making sure SOCKWQ_ASYNC_NOSPACE is set when socket is not writable or device is down to guarantee EPOLLOUT will be raised in either tun_chr_poll() or tun_sock_write_space() after device is up. Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Eric Dumazet <edumazet@google.com> Fixes: 1bd4978a88ac2 ("tun: honor IFF_UP in tun_get_user()") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-15/+12
S390 bpf_jit.S is removed in net-next and had changes in 'net', since that code isn't used any more take the removal. TLS data structures split the TX and RX components in 'net-next', put the new struct members from the bug fix in 'net' into the RX part. The 'net-next' tree had some reworking of how the ERSPAN code works in the GRE tunneling code, overlapping with a one-line headroom calculation fix in 'net'. Overlapping changes in __sock_map_ctx_update_elem(), keep the bits that read the prog members via READ_ONCE() into local variables before using them. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16tuntap: fix use after free during releaseJason Wang1-1/+1
After commit b196d88aba8a ("tun: fix use after free for ptr_ring") we need clean up tx ring during release(). But unfortunately, it tries to do the cleanup blindly after socket were destroyed which will lead another use-after-free. Fix this by doing the cleanup before dropping the last reference of the socket in __tun_detach(). Reported-by: Andrei Vagin <avagin@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Fixes: b196d88aba8a ("tun: fix use after free for ptr_ring") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-14tun: fix use after free for ptr_ringJason Wang1-15/+12
We used to initialize ptr_ring during TUNSETIFF, this is because its size depends on the tx_queue_len of netdevice. And we try to clean it up when socket were detached from netdevice. A race were spotted when trying to do uninit during a read which will lead a use after free for pointer ring. Solving this by always initialize a zero size ptr_ring in open() and do resizing during TUNSETIFF, and then we can safely do cleanup during close(). With this, there's no need for the workaround that was introduced by commit 4df0bfc79904 ("tun: fix a memory leak for tfile->tx_array"). Reported-by: syzbot+e8b902c3c3fadf0a9dba@syzkaller.appspotmail.com Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Michael S. Tsirkin <mst@redhat.com> Fixes: 1576d9860599 ("tun: switch to use skb array for tx") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10tun: Do SIOCGSKNS out of rtnl_lock()Kirill Tkhai1-11/+7
Since net ns of tun device is assigned on the device creation, and it never changes, we do not need to use any lock to get it from alive tun. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-6/+1
Conflicts were simple overlapping changes in microchip driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-1/+2
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-04-21 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Initial work on BPF Type Format (BTF) is added, which is a meta data format which describes the data types of BPF programs / maps. BTF has its roots from CTF (Compact C-Type format) with a number of changes to it. First use case is to provide a generic pretty print capability for BPF maps inspection, later work will also add BTF to bpftool. pahole support to convert dwarf to BTF will be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf), from Martin. 2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows for changing the data_end pointer. Only shrinking is currently supported which helps for crafting ICMP control messages. Minor changes in drivers have been added where needed so they recalc the packet's length also when data_end was adjusted, from Nikita. 3) Improve bpftool to make it easier to feed hex bytes via cmdline for map operations, from Quentin. 4) Add support for various missing BPF prog types and attach types that have been added to kernel recently but neither to bpftool nor libbpf yet. Doc and bash completion updates have been added as well for bpftool, from Andrey. 5) Proper fix for avoiding to leak info stored in frame data on page reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing to move the pointers into struct xdp_frame area, from Jesper. 6) Follow-up compile fix from BTF in order to include stdbool.h in libbpf, from Björn. 7) Few fixes in BPF sample code, that is, a typo on the netdevice in a comment and fixup proper dump of XDP action code in the tracepoint exception, from Wang and Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20tun: do not compute the rxhash, if not neededPaolo Abeni1-9/+6
Currently, the tun driver, in absence of an eBPF steering program, always compute the rxhash in its rx path, even when such value is later unused due to additional checks ( This changeset moves the all the related checks just before the __skb_get_hash_symmetric(), so that the latter is no more computed when unneeded. Also replace an unneeded RCU section with rcu_access_pointer(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18bpf: make tun compatible w/ bpf_xdp_adjust_tailNikita V. Shirokov1-1/+2
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as well (only "decrease" of pointer's location is going to be supported). changing of this pointer will change packet's size. for tun driver we need to adjust XDP_PASS handling by recalculating length of the packet if it was passed to the TCP/IP stack (in case if after xdp's prog run data_end pointer was adjusted) Reviewed-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-18tun: fix vlan packet truncationBjørn Mork1-6/+1
Bogus trimming in tun_net_xmit() causes truncated vlan packets. skb->len is correct whether or not skb_vlan_tag_present() is true. There is no more reason to adjust the skb length on xmit in this driver than any other driver. tun_put_user() adds 4 bytes to the total for tagged packets because it transmits the tag inline to userspace. This is similar to a nic transmitting the tag inline on the wire. Reproducing the bug by sending any tagged packet through back-to-back connected tap interfaces: socat TUN,tun-type=tap,iff-up,tun-name=in TUN,tun-type=tap,iff-up,tun-name=out & ip link add link in name in.20 type vlan id 20 ip addr add 10.9.9.9/24 dev in.20 ip link set in.20 up tshark -nxxi in -f arp -c1 2>/dev/null & tshark -nxxi out -f arp -c1 2>/dev/null & ping -c 1 10.9.9.5 >/dev/null 2>&1 The output from the 'in' and 'out' interfaces are different when the bug is present: Capturing on 'in' 0000 ff ff ff ff ff ff 76 cf 76 37 d5 0a 81 00 00 14 ......v.v7...... 0010 08 06 00 01 08 00 06 04 00 01 76 cf 76 37 d5 0a ..........v.v7.. 0020 0a 09 09 09 00 00 00 00 00 00 0a 09 09 05 .............. Capturing on 'out' 0000 ff ff ff ff ff ff 76 cf 76 37 d5 0a 81 00 00 14 ......v.v7...... 0010 08 06 00 01 08 00 06 04 00 01 76 cf 76 37 d5 0a ..........v.v7.. 0020 0a 09 09 09 00 00 00 00 00 00 .......... Fixes: aff3d70a07ff ("tun: allow to attach ebpf socket filter") Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17xdp: transition into using xdp_frame for ndo_xdp_xmitJesper Dangaard Brouer1-7/+12
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync. This builds towards changing the API further to become a bulk API, because xdp_buff is not a queue-able object while xdp_frame is. V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP") V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17xdp: transition into using xdp_frame for return APIJesper Dangaard Brouer1-2/+2
Changing API xdp_return_frame() to take struct xdp_frame as argument, seems like a natural choice. But there are some subtle performance details here that needs extra care, which is a deliberate choice. When de-referencing xdp_frame on a remote CPU during DMA-TX completion, result in the cache-line is change to "Shared" state. Later when the page is reused for RX, then this xdp_frame cache-line is written, which change the state to "Modified". This situation already happens (naturally) for, virtio_net, tun and cpumap as the xdp_frame pointer is the queued object. In tun and cpumap, the ptr_ring is used for efficiently transferring cache-lines (with pointers) between CPUs. Thus, the only option is to de-referencing xdp_frame. It is only the ixgbe driver that had an optimization, in which it can avoid doing the de-reference of xdp_frame. The driver already have TX-ring queue, which (in case of remote DMA-TX completion) have to be transferred between CPUs anyhow. In this data area, we stored a struct xdp_mem_info and a data pointer, which allowed us to avoid de-referencing xdp_frame. To compensate for this, a prefetchw is used for telling the cache coherency protocol about our access pattern. My benchmarks show that this prefetchw is enough to compensate the ixgbe driver. V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT") V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address and offset in dma_sync call") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17xdp: rhashtable with allocator ID to pointer mappingJesper Dangaard Brouer1-0/+6
Use the IDA infrastructure for getting a cyclic increasing ID number, that is used for keeping track of each registered allocator per RX-queue xdp_rxq_info. Instead of using the IDR infrastructure, which uses a radix tree, use a dynamic rhashtable, for creating ID to pointer lookup table, because this is faster. The problem that is being solved here is that, the xdp_rxq_info pointer (stored in xdp_buff) cannot be used directly, as the guaranteed lifetime is too short. The info is needed on a (potentially) remote CPU during DMA-TX completion time . In an xdp_frame the xdp_mem_info is stored, when it got converted from an xdp_buff, which is sufficient for the simple page refcnt based recycle schemes. For more advanced allocators there is a need to store a pointer to the registered allocator. Thus, there is a need to guard the lifetime or validity of the allocator pointer, which is done through this rhashtable ID map to pointer. The removal and validity of of the allocator and helper struct xdp_mem_allocator is guarded by RCU. The allocator will be created by the driver, and registered with xdp_rxq_info_reg_mem_model(). It is up-to debate who is responsible for freeing the allocator pointer or invoking the allocator destructor function. In any case, this must happen via RCU freeing. Use the IDA infrastructure for getting a cyclic increasing ID number, that is used for keeping track of each registered allocator per RX-queue xdp_rxq_info. V4: Per req of Jason Wang - Use xdp_rxq_info_reg_mem_model() in all drivers implementing XDP_REDIRECT, even-though it's not strictly necessary when allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero). V6: Per req of Alex Duyck - Introduce rhashtable_lookup() call in later patch V8: Address sparse should be static warnings (from kbuild test robot) Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17tun: convert to use generic xdp_frame and xdp_return_frame APIJesper Dangaard Brouer1-23/+20
The tuntap driver invented it's own driver specific way of queuing XDP packets, by storing the xdp_buff information in the top of the XDP frame data. Convert it over to use the more generic xdp_frame structure. The main problem with the in-driver method is that the xdp_rxq_info pointer cannot be trused/used when dequeueing the frame. V3: Remove check based on feedback from Jason Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>