aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/include/linux/skbuff.h (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-06-02bpf: Fix up bpf_skb_adjust_room helper's skb csum settingDaniel Borkmann1-0/+8
Lorenz recently reported: In our TC classifier cls_redirect [0], we use the following sequence of helper calls to decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet: bpf_skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO) bpf_redirect(skb->ifindex, BPF_F_INGRESS) It seems like some checksums of the inner headers are not validated in this case. For example, a TCP SYN packet with invalid TCP checksum is still accepted by the network stack and elicits a SYN ACK. [...] That is, we receive the following packet from the driver: | ETH | IP | UDP | GUE | IP | TCP | skb->ip_summed == CHECKSUM_UNNECESSARY ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading. On this packet we run skb_adjust_room_mac(-encap_len), and get the following: | ETH | IP | TCP | skb->ip_summed == CHECKSUM_UNNECESSARY Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is turned into a no-op due to CHECKSUM_UNNECESSARY. The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally, it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now there is no way to adjust the skb->csum_level. NICs that have checksum offload disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected. Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users from opting out. Opting out is useful for the case where we don't remove/add full protocol headers, or for the case where a user wants to adjust the csum level manually e.g. through bpf_csum_level() helper that is added in subsequent patch. The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally as well but doesn't change layers, only transitions between v4 to v6 and vice versa, therefore no adoption is required there. [0] https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/ Fixes: 2be7e212d541 ("bpf: add bpf_skb_adjust_room helper") Reported-by: Lorenz Bauer <lmb@cloudflare.com> Reported-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/CACAyw9-uU_52esMd1JjuA80fRPHJv5vsSg8GnfW3t_qDU4aVKQ@mail.gmail.com/ Link: https://lore.kernel.org/bpf/11a90472e7cce83e76ddbfce81fdfce7bfc68808.1591108731.git.daniel@iogearbox.net
2020-06-01net: Introduce netns_bpf for BPF programs attached to netnsJakub Sitnicki1-26/+0
In order to: (1) attach more than one BPF program type to netns, or (2) support attaching BPF programs to netns with bpf_link, or (3) support multi-prog attach points for netns we will need to keep more state per netns than a single pointer like we have now for BPF flow dissector program. Prepare for the above by extracting netns_bpf that is part of struct net, for storing all state related to BPF programs attached to netns. Turn flow dissector callbacks for querying/attaching/detaching a program into generic ones that operate on netns_bpf. Next patch will move the generic callbacks into their own module. This is similar to how it is organized for cgroup with cgroup_bpf. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com
2020-05-17net: allow __skb_ext_alloc to sleepFlorian Westphal1-1/+1
mptcp calls this from the transmit side, from process context. Allow a sleeping allocation instead of unconditional GFP_ATOMIC. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18skbuff.h: Replace zero-length array with flexible-array memberGustavo A. R. Silva1-1/+1
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2020-04-06skbuff.h: Improve the checksum related commentsDexuan Cui1-19/+19
Fixed the punctuation and some typos. Improved some sentences with minor changes. No change of semantics or code. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: Fix typo of SKB_SGO_CB_OFFSETCambda Zhu1-2/+2
The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the offset of the GSO in skb cb. This patch fixes the typo. Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-4/+32
Overlapping header include additions in macsec.c A bug fix in 'net' overlapping with the removal of 'version' string in ena_netdev.c Overlapping test additions in selftests Makefile Overlapping PCI ID table adjustments in iwlwifi driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} buildPablo Neira Ayuso1-4/+32
net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’: net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’ pkt->skb->tc_redirected = 1; ^~ net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’ pkt->skb->tc_from_ingress = 1; ^~ To avoid a direct dependency with tc actions from netfilter, wrap the redirect bits around CONFIG_NET_REDIRECT and move helpers to include/linux/skbuff.h. Turn on this toggle from the ifb driver, the only existing client of these bits in the tree. This patch adds skb_set_redirected() that sets on the redirected bit on the skbuff, it specifies if the packet was redirect from ingress and resets the timestamp (timestamp reset was originally missing in the netfilter bugfix). Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress") Reported-by: noreply@ellerman.id.au Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28net: datagram: drop 'destructor' argument from several helpersPaolo Abeni1-10/+2
The only users for such argument are the UDP protocol and the UNIX socket family. We can safely reclaim the accounted memory directly from the UDP code and, after the previous patch, we can do scm stats accounting outside the datagram helpers. Overall this cleans up a bit some datagram-related helpers, and avoids an indirect call per packet in the UDP receive path. v1 -> v2: - call scm_stat_del() only when not peeking - Kirill - fix build issue with CONFIG_INET_ESPINTCP Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16skbuff.h: fix all kernel-doc warningsRandy Dunlap1-0/+30
Fix all kernel-doc warnings in <linux/skbuff.h>. Fixes these warnings: ../include/linux/skbuff.h:890: warning: Function parameter or member 'list' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'dev_scratch' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'ip_defrag_offset' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'skb_mstamp_ns' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member '__cloned_offset' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'head_frag' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member '__pkt_type_offset' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'encapsulation' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'encap_hdr_csum' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'csum_valid' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member '__pkt_vlan_present_offset' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'vlan_present' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'csum_complete_sw' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'csum_level' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'inner_protocol_type' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'remcsum_offload' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'sender_cpu' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'reserved_tailroom' not described in 'sk_buff' ../include/linux/skbuff.h:890: warning: Function parameter or member 'inner_ipproto' not described in 'sk_buff' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-06skbuff: fix a data race in skb_queue_len()Qian Cai1-1/+13
sk_buff.qlen can be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96: unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821 net/unix/af_unix.c:1761 ____sys_sendmsg+0x33e/0x370 ___sys_sendmsg+0xa6/0xf0 __sys_sendmsg+0x69/0xf0 __x64_sys_sendmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99: __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029 __skb_try_recv_datagram+0xbe/0x220 unix_dgram_recvmsg+0xee/0x850 ____sys_recvmsg+0x1fb/0x210 ___sys_recvmsg+0xa2/0xf0 __sys_recvmsg+0x66/0xf0 __x64_sys_recvmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe Since only the read is operating as lockless, it could introduce a logic bug in unix_recvq_full() due to the load tearing. Fix it by adding a lockless variant of skb_queue_len() and unix_recvq_full() where READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to the commit d7d16a89350a ("net: add skb_queue_empty_lockless()"). Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net: Support GRO/GSO fraglist chaining.Steffen Klassert1-0/+2
This patch adds the core functions to chain/unchain GSO skbs at the frag_list pointer. This also adds a new GSO type SKB_GSO_FRAGLIST and a is_flist flag to napi_gro_cb which indicates that this flow will be GROed by fraglist chaining. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net: Add fraglist GRO/GSO feature flagsSteffen Klassert1-0/+2
This adds new Fraglist GRO/GSO feature flags. They will be used to configure fraglist GRO/GSO what will be implemented with some followup paches. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-nextDavid S. Miller1-3/+8
Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2020-01-21 1) Add support for TCP encapsulation of IKE and ESP messages, as defined by RFC 8229. Patchset from Sabrina Dubroca. Please note that there is a merge conflict in: net/unix/af_unix.c between commit: 3c32da19a858 ("unix: Show number of pending scm files of receive queue in fdinfo") from the net-next tree and commit: b50b0580d27b ("net: add queue argument to __skb_wait_for_more_packets and __skb_{,try_}recv_datagram") from the ipsec-next tree. The conflict can be solved as done in linux-next. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: skbuff: disambiguate argument and member for skb_list_walk_safe helperJason A. Donenfeld1-3/+3
This worked before, because we made all callers name their next pointer "next". But in trying to be more "drop-in" ready, the silliness here is revealed. This commit fixes the problem by making the macro argument and the member use different names. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09skb: add helpers to allocate ext independently from sk_buffPaolo Abeni1-0/+3
Currently we can allocate the extension only after the skb, this change allows the user to do the opposite, will simplify allocation failure handling from MPTCP. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09mptcp: Add MPTCP to skb extensionsMat Martineau1-0/+3
Add enum value for MPTCP and update config dependencies v5 -> v6: - fixed '__unused' field size Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08net: introduce skb_list_walk_safe for skb segment walkingJason A. Donenfeld1-0/+5
As part of the continual effort to remove direct usage of skb->next and skb->prev, this patch adds a helper for iterating through the singly-linked variant of skb lists, which are used for lists of GSO packet. The name "skb_list_..." has been chosen to match the existing function, "kfree_skb_list, which also operates on these singly-linked lists, and the "..._walk_safe" part is the same idiom as elsewhere in the kernel. This patch removes the helper from wireguard and puts it into linux/skbuff.h, while making it a bit more robust for general usage. In particular, parenthesis are added around the macro argument usage, and it now accounts for trying to iterate through an already-null skb pointer, which will simply run the iteration zero times. This latter enhancement means it can be used to replace both do { ... } while and while (...) open-coded idioms. This should take care of these three possible usages, which match all current methods of iterations. skb_list_walk_safe(segs, skb, next) { ... } skb_list_walk_safe(skb, skb, next) { ... } skb_list_walk_safe(segs, skb, segs) { ... } Gcc appears to generate efficient code for each of these. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-09net: add queue argument to __skb_wait_for_more_packets and __skb_{,try_}recv_datagramSabrina Dubroca1-3/+8
This will be used by ESP over TCP to handle the queue of IKE messages. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-12-04net: Fixed updating of ethertype in skb_mpls_push()Martin Varghese1-1/+1
The skb_mpls_push was not updating ethertype of an ethernet packet if the packet was originally received from a non ARPHRD_ETHER device. In the below OVS data path flow, since the device corresponding to port 7 is an l3 device (ARPHRD_NONE) the skb_mpls_push function does not update the ethertype of the packet even though the previous push_eth action had added an ethernet header to the packet. recirc_id(0),in_port(7),eth_type(0x0800),ipv4(tos=0/0xfc,ttl=64,frag=no), actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00), push_mpls(label=13,tc=0,ttl=64,bos=1,eth_type=0x8847),4 Fixes: 8822e270d697 ("net: core: move push MPLS functionality from OvS to core helper") Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-02Fixed updating of ethertype in function skb_mpls_popMartin Varghese1-1/+2
The skb_mpls_pop was not updating ethertype of an ethernet packet if the packet was originally received from a non ARPHRD_ETHER device. In the below OVS data path flow, since the device corresponding to port 7 is an l3 device (ARPHRD_NONE) the skb_mpls_pop function does not update the ethertype of the packet even though the previous push_eth action had added an ethernet header to the packet. recirc_id(0),in_port(7),eth_type(0x8847), mpls(label=12/0xfffff,tc=0/0,ttl=0/0x0,bos=1/1), actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00), pop_mpls(eth_type=0x800),4 Fixes: ed246cee09b9 ("net: core: move pop MPLS functionality from OvS to core helper") Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-01Merge tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playgroundLinus Torvalds1-2/+5
Pull y2038 cleanups from Arnd Bergmann: "y2038 syscall implementation cleanups This is a series of cleanups for the y2038 work, mostly intended for namespace cleaning: the kernel defines the traditional time_t, timeval and timespec types that often lead to y2038-unsafe code. Even though the unsafe usage is mostly gone from the kernel, having the types and associated functions around means that we can still grow new users, and that we may be missing conversions to safe types that actually matter. There are still a number of driver specific patches needed to get the last users of these types removed, those have been submitted to the respective maintainers" Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/ * tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (26 commits) y2038: alarm: fix half-second cut-off y2038: ipc: fix x32 ABI breakage y2038: fix typo in powerpc vdso "LOPART" y2038: allow disabling time32 system calls y2038: itimer: change implementation to timespec64 y2038: move itimer reset into itimer.c y2038: use compat_{get,set}_itimer on alpha y2038: itimer: compat handling to itimer.c y2038: time: avoid timespec usage in settimeofday() y2038: timerfd: Use timespec64 internally y2038: elfcore: Use __kernel_old_timeval for process times y2038: make ns_to_compat_timeval use __kernel_old_timeval y2038: socket: use __kernel_old_timespec instead of timespec y2038: socket: remove timespec reference in timestamping y2038: syscalls: change remaining timeval to __kernel_old_timeval y2038: rusage: use __kernel_old_timeval y2038: uapi: change __kernel_time_t to __kernel_old_time_t y2038: stat: avoid 'time_t' in 'struct stat' y2038: ipc: remove __kernel_time_t reference from headers y2038: vdso: powerpc: avoid timespec references ...
2019-11-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+6
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock from commit c8183f548902 ("s390/qeth: fix potential deadlock on workqueue flush"), removed the code which was removed by commit 9897d583b015 ("s390/qeth: consolidate some duplicated HW cmd code"). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-22udp: drop skb extensions before marking skb statelessFlorian Westphal1-0/+6
Once udp stack has set the UDP_SKB_IS_STATELESS flag, later skb free assumes all skb head state has been dropped already. This will leak the extension memory in case the skb has extensions other than the ipsec secpath, e.g. bridge nf data. To fix this, set the UDP_SKB_IS_STATELESS flag only if we don't have extensions or if the extension space can be free'd. Fixes: 895b5c9f206eb7d25dc1360a ("netfilter: drop bridge nf reset from nf_reset") Cc: Paolo Abeni <pabeni@redhat.com> Reported-by: Byron Stanoszek <gandalf@winds.org> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15y2038: socket: use __kernel_old_timespec instead of timespecArnd Bergmann1-2/+5
The 'timespec' type definition and helpers like ktime_to_timespec() or timespec64_to_timespec() should no longer be used in the kernel so we can remove them and avoid introducing y2038 issues in new code. Change the socket code that needs to pass a timespec to user space for backward compatibility to use __kernel_old_timespec instead. This type has the same layout but with a clearer defined name. Slightly reformat tcp_recv_timestamp() for consistency after the removal of timespec64_to_timespec(). Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-07net: add a READ_ONCE() in skb_peek_tail()Eric Dumazet1-2/+4
skb_peek_tail() can be used without protection of a lock, as spotted by KCSAN [1] In order to avoid load-stearing, add a READ_ONCE() Note that the corresponding WRITE_ONCE() are already there. [1] BUG: KCSAN: data-race in sk_wait_data / skb_queue_tail read to 0xffff8880b36a4118 of 8 bytes by task 20426 on cpu 1: skb_peek_tail include/linux/skbuff.h:1784 [inline] sk_wait_data+0x15b/0x250 net/core/sock.c:2477 kcm_wait_data+0x112/0x1f0 net/kcm/kcmsock.c:1103 kcm_recvmsg+0xac/0x320 net/kcm/kcmsock.c:1130 sock_recvmsg_nosec net/socket.c:871 [inline] sock_recvmsg net/socket.c:889 [inline] sock_recvmsg+0x92/0xb0 net/socket.c:885 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 write to 0xffff8880b36a4118 of 8 bytes by task 451 on cpu 0: __skb_insert include/linux/skbuff.h:1852 [inline] __skb_queue_before include/linux/skbuff.h:1958 [inline] __skb_queue_tail include/linux/skbuff.h:1991 [inline] skb_queue_tail+0x7e/0xc0 net/core/skbuff.c:3145 kcm_queue_rcv_skb+0x202/0x310 net/kcm/kcmsock.c:206 kcm_rcv_strparser+0x74/0x4b0 net/kcm/kcmsock.c:370 __strp_recv+0x348/0xf50 net/strparser/strparser.c:309 strp_recv+0x84/0xa0 net/strparser/strparser.c:343 tcp_read_sock+0x174/0x5c0 net/ipv4/tcp.c:1639 strp_read_sock+0xd4/0x140 net/strparser/strparser.c:366 do_strp_work net/strparser/strparser.c:414 [inline] strp_work+0x9a/0xe0 net/strparser/strparser.c:423 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 451 Comm: kworker/u4:3 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: kstrp strp_work Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-10/+26
The only slightly tricky merge conflict was the netdevsim because the mutex locking fix overlapped a lot of driver reload reorganization. The rest were (relatively) trivial in nature. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28net: add skb_queue_empty_lockless()Eric Dumazet1-9/+24
Some paths call skb_queue_empty() without holding the queue lock. We must use a barrier in order to not let the compiler do strange things, and avoid KCSAN splats. Adding a barrier in skb_queue_empty() might be overkill, I prefer adding a new helper to clearly identify points where the callers might be lockless. This might help us finding real bugs. The corresponding WRITE_ONCE() should add zero cost for current compilers. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-23net/flow_dissector: switch to siphashEric Dumazet1-1/+2
UDP IPv6 packets auto flowlabels are using a 32bit secret (static u32 hashrnd in net/core/flow_dissector.c) and apply jhash() over fields known by the receivers. Attackers can easily infer the 32bit secret and use this information to identify a device and/or user, since this 32bit secret is only set at boot time. Really, using jhash() to generate cookies sent on the wire is a serious security concern. Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be a dead end. Trying to periodically change the secret (like in sch_sfq.c) could change paths taken in the network for long lived flows. Let's switch to siphash, as we did in commit df453700e8d8 ("inet: switch IP ID generator to siphash") Using a cryptographically strong pseudo random function will solve this privacy issue and more generally remove other weak points in the stack. Packet schedulers using skb_get_hash_perturb() benefit from this change. Fixes: b56774163f99 ("ipv6: Enable auto flow labels by default") Fixes: 42240901f7c4 ("ipv6: Implement different admin modes for automatic flow labels") Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel") Fixes: cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Berger <jonathann1@walla.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-2/+3
Several cases of overlapping changes which were for the most part trivially resolvable. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-15net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actionsDavide Caratti1-2/+3
the following script: # tc qdisc add dev eth0 clsact # tc filter add dev eth0 egress protocol ip matchall \ > action mpls push protocol mpls_uc label 0x355aa bos 1 causes corruption of all IP packets transmitted by eth0. On TC egress, we can't rely on the value of skb->mac_len, because it's 0 and a MPLS 'push' operation will result in an overwrite of the first 4 octets in the packet L2 header (e.g. the Destination Address if eth0 is an Ethernet); the same error pattern is present also in the MPLS 'pop' operation. Fix this error in act_mpls data plane, computing 'mac_len' as the difference between the network header and the mac header (when not at TC ingress), and use it in MPLS 'push'/'pop' core functions. v2: unbreak 'make htmldocs' because of missing documentation of 'mac_len' in skb_mpls_pop(), reported by kbuild test robot CC: Lorenzo Bianconi <lorenzo@kernel.org> Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-07net: core: change return type of pskb_may_pull to boolHeiner Kallweit1-3/+3
This function de-facto returns a bool, so let's change the return type accordingly. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-01netfilter: drop bridge nf reset from nf_resetFlorian Westphal1-4/+1
commit 174e23810cd31 ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi recycle always drop skb extensions. The additional skb_ext_del() that is performed via nf_reset on napi skb recycle is not needed anymore. Most nf_reset() calls in the stack are there so queued skb won't block 'rmmod nf_conntrack' indefinitely. This removes the skb_ext_del from nf_reset, and renames it to a more fitting nf_reset_ct(). In a few selected places, add a call to skb_ext_reset to make sure that no active extensions remain. I am submitting this for "net", because we're still early in the release cycle. The patch applies to net-next too, but I think the rename causes needless divergence between those trees. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-27sk_buff: drop all skb extensions on free and skb scrubbingFlorian Westphal1-0/+9
Now that we have a 3rd extension, add a new helper that drops the extension space and use it when we need to scrub an sk_buff. At this time, scrubbing clears secpath and bridge netfilter data, but retains the tc skb extension, after this patch all three get cleared. NAPI reuse/free assumes we can only have a secpath attached to skb, but it seems better to clear all extensions there as well. v2: add unlikely hint (Eric Dumazet) Fixes: 95a7233c452a ("net: openvswitch: Set OvS recirc_id from tc chain index") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13netfilter: conntrack: move code to linux/nf_conntrack_common.h.Jeremy Sowden1-17/+15
Move some `struct nf_conntrack` code from linux/skbuff.h to linux/nf_conntrack_common.h. Together with a couple of helpers for getting and setting skb->_nfct, it allows us to remove CONFIG_NF_CONNTRACK checks from net/netfilter/nf_conntrack.h. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-06net: openvswitch: Set OvS recirc_id from tc chain indexPaul Blakey1-0/+13
Offloaded OvS datapath rules are translated one to one to tc rules, for example the following simplified OvS rule: recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2) Will be translated to the following tc rule: $ tc filter add dev dev1 ingress \ prio 1 chain 0 proto ip \ flower tcp ct_state -trk \ action ct pipe \ action goto chain 2 Received packets will first travel though tc, and if they aren't stolen by it, like in the above rule, they will continue to OvS datapath. Since we already did some actions (action ct in this case) which might modify the packets, and updated action stats, we would like to continue the proccessing with the correct recirc_id in OvS (here recirc_id(2)) where we left off. To support this, introduce a new skb extension for tc, which will be used for translating tc chain to ovs recirc_id to handle these miss cases. Last tc chain index will be set by tc goto chain action and read by OvS datapath. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-0/+8
Merge conflict of mlx5 resolved using instructions in merge commit 9566e650bf7fdf58384bb06df634f7531ca3a97e. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski1-1/+1
Daniel Borkmann says: ==================== The following pull-request contains BPF updates for your *net-next* tree. There is a small merge conflict in libbpf (Cc Andrii so he's in the loop as well): for (i = 1; i <= btf__get_nr_types(btf); i++) { t = (struct btf_type *)btf__type_by_id(btf, i); if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); <<<<<<< HEAD /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t+1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && kind == BTF_KIND_DATASEC) { ======= t->size = sizeof(int); *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32); } else if (!has_datasec && btf_is_datasec(t)) { >>>>>>> 72ef80b5ee131e96172f19e74b4f98fa3404efe8 /* replace DATASEC with STRUCT */ Conflict is between the two commits 1d4126c4e119 ("libbpf: sanitize VAR to conservative 1-byte INT") and b03bc6853c0e ("libbpf: convert libbpf code to use new btf helpers"), so we need to pick the sanitation fixup as well as use the new btf_is_datasec() helper and the whitespace cleanup. Looks like the following: [...] if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && btf_is_datasec(t)) { /* replace DATASEC with STRUCT */ [...] The main changes are: 1) Addition of core parts of compile once - run everywhere (co-re) effort, that is, relocation of fields offsets in libbpf as well as exposure of kernel's own BTF via sysfs and loading through libbpf, from Andrii. More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2 and http://vger.kernel.org/lpc-bpf2018.html#session-2 2) Enable passing input flags to the BPF flow dissector to customize parsing and allowing it to stop early similar to the C based one, from Stanislav. 3) Add a BPF helper function that allows generating SYN cookies from XDP and tc BPF, from Petar. 4) Add devmap hash-based map type for more flexibility in device lookup for redirects, from Toke. 5) Improvements to XDP forwarding sample code now utilizing recently enabled devmap lookups, from Jesper. 6) Add support for reporting the effective cgroup progs in bpftool, from Jakub and Takshak. 7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter. 8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan. 9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei. 10) Add perf event output helper also for other skb-based program types, from Allan. 11) Fix a co-re related compilation error in selftests, from Yonghong. ==================== Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-08net/tls: prevent skb_orphan() from leaking TLS plain text with offloadJakub Kicinski1-0/+8
sk_validate_xmit_skb() and drivers depend on the sk member of struct sk_buff to identify segments requiring encryption. Any operation which removes or does not preserve the original TLS socket such as skb_orphan() or skb_clone() will cause clear text leaks. Make the TCP socket underlying an offloaded TLS connection mark all skbs as decrypted, if TLS TX is in offload mode. Then in sk_validate_xmit_skb() catch skbs which have no socket (or a socket with no validation) and decrypted flag set. Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and sk->sk_validate_xmit_skb are slightly interchangeable right now, they all imply TLS offload. The new checks are guarded by CONFIG_TLS_DEVICE because that's the option guarding the sk_buff->decrypted member. Second, smaller issue with orphaning is that it breaks the guarantee that packets will be delivered to device queues in-order. All TLS offload drivers depend on that scheduling property. This means skb_orphan_partial()'s trick of preserving partial socket references will cause issues in the drivers. We need a full orphan, and as a result netem delay/throttling will cause all TLS offload skbs to be dropped. Reusing the sk_buff->decrypted flag also protects from leaking clear text when incoming, decrypted skb is redirected (e.g. by TC). See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP") for justification why the internal flag is safe. The only location which could leak the flag in is tcp_bpf_sendmsg(), which is taken care of by clearing the previously unused bit. v2: - remove superfluous decrypted mark copy (Willem); - remove the stale doc entry (Boris); - rely entirely on EOR marking to prevent coalescing (Boris); - use an internal sendpages flag instead of marking the socket (Boris). v3 (Willem): - reorganize the can_skb_orphan_partial() condition; - fix the flag leak-in through tcp_bpf_sendmsg. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30linux: Remove bvec page_offset, use bv_offsetJonathan Lemon1-5/+5
Now that page_offset is referenced through accessors, remove the union, and use bv_offset. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30linux: Add skb_frag_t page_offset accessorsJonathan Lemon1-8/+59
Add skb_frag_off(), skb_frag_off_add(), skb_frag_off_set(), and skb_frag_off_copy() accessors for page_offset. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-25bpf/flow_dissector: pass input flags to BPF flow dissector programStanislav Fomichev1-1/+1
C flow dissector supports input flags that tell it to customize parsing by either stopping early or trying to parse as deep as possible. Pass those flags to the BPF flow dissector so it can make the same decisions. In the next commits I'll add support for those flags to our reference bpf_flow.c v3: * Export copy of flow dissector flags instead of moving (Alexei Starovoitov) Acked-by: Petar Penkov <ppenkov@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Song Liu <songliubraving@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Petar Penkov <ppenkov@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-22net: Convert skb_frag_t to bio_vecMatthew Wilcox (Oracle)1-7/+2
There are a lot of users of frag->page_offset, so use a union to avoid converting those users today. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22net: Rename skb_frag_t size to bv_lenMatthew Wilcox (Oracle)1-5/+5
Improved compatibility with bvec Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22net: Rename skb_frag page to bv_pageMatthew Wilcox (Oracle)1-7/+5
One step closer to turning the skb_frag_t into a bio_vec. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22net: Reorder the contents of skb_frag_tMatthew Wilcox (Oracle)1-1/+1
Match the layout of bio_vec. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22net: Increase the size of skb_frag_tMatthew Wilcox (Oracle)1-5/+0
To increase commonality between block and net, we are going to replace the skb_frag_t with the bio_vec. This patch increases the size of skb_frag_t on 32-bit machines from 8 bytes to 12 bytes. The size is unchanged on 64-bit machines. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22net: Use skb accessors in network coreMatthew Wilcox (Oracle)1-1/+1
In preparation for unifying the skb_frag and bio_vec, use the fine accessors which already exist and use skb_frag_t instead of struct skb_frag_struct. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09net/flow_dissector: add connection tracking dissectionPaul Blakey1-0/+10
Retreives connection tracking zone, mark, label, and state from a SKB. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08net: sched: add mpls manipulation actions to TCJohn Hurley1-0/+1
Currently, TC offers the ability to match on the MPLS fields of a packet through the use of the flow_dissector_key_mpls struct. However, as yet, TC actions do not allow the modification or manipulation of such fields. Add a new module that registers TC action ops to allow manipulation of MPLS. This includes the ability to push and pop headers as well as modify the contents of new or existing headers. A further action to decrement the TTL field of an MPLS header is also provided with a new helper added to support this. Examples of the usage of the new action with flower rules to push and pop MPLS labels are: tc filter add dev eth0 protocol ip parent ffff: flower \ action mpls push protocol mpls_uc label 123 \ action mirred egress redirect dev eth1 tc filter add dev eth0 protocol mpls_uc parent ffff: flower \ action mpls pop protocol ipv4 \ action mirred egress redirect dev eth1 Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>