aboutsummaryrefslogtreecommitdiffstatshomepage
AgeCommit message (Collapse)AuthorFilesLines
2020-05-15Merge branch 'bpf-cap'Daniel Borkmann25-84/+223
Alexei Starovoitov says: ==================== v6->v7: - permit SK_REUSEPORT program type under CAP_BPF as suggested by Marek Majkowski. It's equivalent to SOCKET_FILTER which is unpriv. v5->v6: - split allow_ptr_leaks into four flags. - retain bpf_jit_limit under cap_sys_admin. - fixed few other issues spotted by Daniel. v4->v5: Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN and keep some of them under CAP_SYS_ADMIN. The user process has to have - CAP_BPF to create maps, do other sys_bpf() commands and load SK_REUSEPORT progs. Note: dev_map, sock_hash, sock_map map types still require CAP_NET_ADMIN. That could be relaxed in the future. - CAP_BPF and CAP_PERFMON to load tracing programs. - CAP_BPF and CAP_NET_ADMIN to load networking programs. (or CAP_SYS_ADMIN for backward compatibility). CAP_BPF solves three main goals: 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF. More on this below. This is the major difference vs v4 set back from Sep 2019. 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN prevents pointer leaks and arbitrary kernel memory access. 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs and making BPF infra more secure. Currently fuzzers run in unpriv. They will be able to run with CAP_BPF. The patchset is long overdue follow-up from the last plumbers conference. Comparing to what was discussed at LPC the CAP* checks at attach time are gone. For tracing progs the CAP_SYS_ADMIN check was done at load time only. There was no check at attach time. For networking and cgroup progs CAP_SYS_ADMIN was required at load time and CAP_NET_ADMIN at attach time, but there are several ways to bypass CAP_NET_ADMIN: - if networking prog is using tail_call writing FD into prog_array will effectively attach it, but bpf_map_update_elem is an unprivileged operation. - freplace prog with CAP_SYS_ADMIN can replace networking prog Consolidating all CAP checks at load time makes security model similar to open() syscall. Once the user got an FD it can do everything with it. read/write/poll don't check permissions. The same way when bpf_prog_load command returns an FD the user can do everything (including attaching, detaching, and bpf_test_run). The important design decision is to allow ID->FD transition for CAP_SYS_ADMIN only. What it means that user processes can run with CAP_BPF and CAP_NET_ADMIN and they will not be able to affect each other unless they pass FDs via scm_rights or via pinning in bpffs. ID->FD is a mechanism for human override and introspection. An admin can do 'sudo bpftool prog ...'. It's possible to enforce via LSM that only bpftool binary does bpf syscall with CAP_SYS_ADMIN and the rest of user space processes do bpf syscall with CAP_BPF isolating bpf objects (progs, maps, links) that are owned by such processes from each other. Another significant change from LPC is that the verifier checks are split into four flags. The allow_ptr_leaks flag allows pointer manipulations. The bpf_capable flag enables all modern verifier features like bpf-to-bpf calls, BTF, bounded loops, dead code elimination, etc. All the goodness. The bypass_spec_v1 flag enables indirect stack access from bpf programs and disables speculative analysis and bpf array mitigations. The bypass_spec_v4 flag disables store sanitation. That allows networking progs with CAP_BPF + CAP_NET_ADMIN enjoy modern verifier features while being more secure. Some networking progs may need CAP_BPF + CAP_NET_ADMIN + CAP_PERFMON, since subtracting pointers (like skb->data_end - skb->data) is a pointer leak, but the verifier may get smarter in the future. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2020-05-15selftests/bpf: Use CAP_BPF and CAP_PERFMON in testsAlexei Starovoitov3-21/+49
Make all test_verifier test exercise CAP_BPF and CAP_PERFMON Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-4-alexei.starovoitov@gmail.com
2020-05-15bpf: Implement CAP_BPFAlexei Starovoitov19-60/+134
Implement permissions as stated in uapi/linux/capability.h In order to do that the verifier allow_ptr_leaks flag is split into four flags and they are set as: env->allow_ptr_leaks = bpf_allow_ptr_leaks(); env->bypass_spec_v1 = bpf_bypass_spec_v1(); env->bypass_spec_v4 = bpf_bypass_spec_v4(); env->bpf_capable = bpf_capable(); The first three currently equivalent to perfmon_capable(), since leaking kernel pointers and reading kernel memory via side channel attacks is roughly equivalent to reading kernel memory with cap_perfmon. 'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions, subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the verifier, run time mitigations in bpf array, and enables indirect variable access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code by the verifier. That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will have speculative checks done by the verifier and other spectre mitigation applied. Such networking BPF program will not be able to leak kernel pointers and will not be able to access arbitrary kernel memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
2020-05-15bpf, capability: Introduce CAP_BPFAlexei Starovoitov3-3/+40
Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN. For backward compatibility include them in CAP_SYS_ADMIN as well. The end result provides simple safety model for applications that use BPF: - to load tracing program types BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc} use CAP_BPF and CAP_PERFMON - to load networking program types BPF_PROG_TYPE_{SCHED_CLS, XDP, SK_SKB, etc} use CAP_BPF and CAP_NET_ADMIN There are few exceptions from this rule: - bpf_trace_printk() is allowed in networking programs, but it's using tracing mechanism, hence this helper needs additional CAP_PERFMON if networking program is using this helper. - BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only to discourage production use. - BPF HW offload is allowed under CAP_SYS_ADMIN. - bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only. CAPs are not checked at attach/detach time with two exceptions: - loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users, hence CAP_NET_ADMIN is required at attach time. - flow_dissector detach doesn't check prog FD at detach, hence CAP_NET_ADMIN is required at detach time. CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id command and convert them to file descriptor via GET_FD_BY_ID command. This restriction guarantees that mutliple tasks with CAP_BPF are not able to affect each other. That leads to clean isolation of tasks. For example: task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link. task B with the same capabilities cannot detach that firewall unless task A explicitly passed link FD to task B via scm_rights or bpffs. CAP_SYS_ADMIN can still detach/unload everything. Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can accidentely mess with each other programs and maps. Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other. CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data. Such networking progs cannot access arbitrary kernel memory or leak pointers. bpftool, bpftrace, bcc tools binaries should NOT be installed with CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets. But users with these two permissions will be able to use these tracing tools. CAP_PERFMON is least secure, since it allows kprobes and kernel memory access. CAP_NET_ADMIN can stop network traffic via iproute2. CAP_BPF is the safest from security point of view and harmless on its own. Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map and if that map is used by firewall-like bpf prog. CAP_BPF allows many bpf prog_load commands in parallel. The verifier may consume large amount of memory and significantly slow down the system. Existing unprivileged BPF operations are not affected. In particular unprivileged users are allowed to load socket_filter and cg_skb program types and to create array, hash, prog_array, map-in-map map types. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-2-alexei.starovoitov@gmail.com
2020-05-15bpf, bpftool: Allow probing for CONFIG_HZ from kernel configDaniel Borkmann1-53/+67
In Cilium we've recently switched to make use of bpf_jiffies64() for parts of our tc and XDP datapath since bpf_ktime_get_ns() is more expensive and high-precision is not needed for our timeouts we have anyway. Our agent has a probe manager which picks up the json of bpftool's feature probe and we also use the macro output in our C programs e.g. to have workarounds when helpers are not available on older kernels. Extend the kernel config info dump to also include the kernel's CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a macro dump such that CONFIG_HZ can be propagated to BPF C code as a simple define if available via config. Latter allows to have _compile- time_ resolution of jiffies <-> sec conversion in our code since all are propagated as known constants. Given we cannot generally assume availability of kconfig everywhere, we also have a kernel hz probe [0] as a fallback. Potentially, bpftool could have an integrated probe fallback as well, although to derive it, we might need to place it under 'bpftool feature probe full' or similar given it would slow down the probing process overall. Yet 'full' doesn't fit either for us since we don't want to pollute the kernel log with warning messages from bpf_probe_write_user() and bpf_trace_printk() on agent startup; I've left it out for the time being. [0] https://github.com/cilium/cilium/blob/master/bpf/cilium-probe-kernel-hz.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Cc: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200513075849.20868-1-daniel@iogearbox.net
2020-05-15Merge branch 'restrict-bpf_probe_read'Alexei Starovoitov8-35/+101
Daniel Borkmann says: ==================== Small set of fixes in order to restrict BPF helpers for tracing which are broken on archs with overlapping address ranges as per discussion in [0]. I've targetted this for -bpf tree so they can be routed as fixes. Thanks! v1 -> v2: - switch to reusable %pks, %pus format specifiers (Yonghong) - fixate %s on kernel_ds probing for archs with overlapping addr space [0] https://lore.kernel.org/bpf/CAHk-=wjJKo0GVixYLmqPn-Q22WFu0xHaBSjKEo7e7Yw72y5SPQ@mail.gmail.com/T/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-05-15bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifierDaniel Borkmann3-32/+88
Usage of plain %s conversion specifier in bpf_trace_printk() suffers from the very same issue as bpf_probe_read{,str}() helpers, that is, it is broken on archs with overlapping address ranges. While the helpers have been addressed through work in 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"), we need an option for bpf_trace_printk() as well to fix it. Similarly as with the helpers, force users to make an explicit choice by adding %pks and %pus specifier to bpf_trace_printk() which will then pick the corresponding strncpy_from_unsafe*() variant to perform the access under KERNEL_DS or USER_DS. The %pk* (kernel specifier) and %pu* (user specifier) can later also be extended for other objects aside strings that are probed and printed under tracing, and reused out of other facilities like bpf_seq_printf() or BTF based type printing. Existing behavior of %s for current users is still kept working for archs where it is not broken and therefore gated through CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. For archs not having this property we fall-back to pick probing under KERNEL_DS as a sensible default. Fixes: 8d3b7dce8622 ("bpf: add support for %s specifier to bpf_trace_printk()") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Link: https://lore.kernel.org/bpf/20200515101118.6508-4-daniel@iogearbox.net
2020-05-15bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_rangeDaniel Borkmann1-1/+3
Given bpf_probe_read{,str}() BPF helpers are now only available under CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE, we need to add the drop-in replacements of bpf_probe_read_{kernel,user}_str() to do_refine_retval_range() as well to avoid hitting the same issue as in 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200515101118.6508-3-daniel@iogearbox.net
2020-05-15bpf: Restrict bpf_probe_read{, str}() only to archs where they workDaniel Borkmann5-2/+10
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs with overlapping address ranges, we should really take the next step to disable them from BPF use there. To generally fix the situation, we've recently added new helper variants bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str(). For details on them, see 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,kernel}_str helpers"). Given bpf_probe_read{,str}() have been around for ~5 years by now, there are plenty of users at least on x86 still relying on them today, so we cannot remove them entirely w/o breaking the BPF tracing ecosystem. However, their use should be restricted to archs with non-overlapping address ranges where they are working in their current form. Therefore, move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and have x86, arm64, arm select it (other archs supporting it can follow-up on it as well). For the remaining archs, they can workaround easily by relying on the feature probe from bpftool which spills out defines that can be used out of BPF C code to implement the drop-in replacement for old/new kernels via: bpftool feature probe macro Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
2020-05-15Merge tag 'drm-misc-fixes-2020-05-14' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixesDave Airlie1-3/+1
Just one meson patch this time to propagate an error code Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20200514073538.wvdtv5s2mt4wdrdj@gilmour.lan
2020-05-14Merge branch 'xdp-grow-tail'Alexei Starovoitov43-115/+451
Jesper Dangaard Brouer says: ==================== V4: - Fixup checkpatch.pl issues - Collected more ACKs V3: - Fix issue on virtio_net patch spotted by Jason Wang - Adjust name for variable in mlx5 patch - Collected more ACKs V2: - Fix bug in mlx5 for XDP_PASS case - Collected nitpicks and ACKs from mailing list V1: - Fix bug in dpaa2 XDP have evolved to support several frame sizes, but xdp_buff was not updated with this information. This have caused the side-effect that XDP frame data hard end is unknown. This have limited the BPF-helper bpf_xdp_adjust_tail to only shrink the packet. This patchset address this and add packet tail extend/grow. The purpose of the patchset is ALSO to reserve a memory area that can be used for storing extra information, specifically for extending XDP with multi-buffer support. One proposal is to use same layout as skb_shared_info, which is why this area is currently 320 bytes. When converting xdp_frame to SKB (veth and cpumap), the full tailroom area can now be used and SKB truesize is now correct. For most drivers this result in a much larger tailroom in SKB "head" data area. The network stack can now take advantage of this when doing SKB coalescing. Thus, a good driver test is to use xdp_redirect_cpu from samples/bpf/ and do some TCP stream testing. Use-cases for tail grow/extend: (1) IPsec / XFRM needs a tail extend[1][2]. (2) DNS-cache responses in XDP. (3) HAProxy ALOHA would need it to convert to XDP. (4) Add tail info e.g. timestamp and collect via tcpdump [1] http://vger.kernel.org/netconf2019_files/xfrm_xdp.pdf [2] http://vger.kernel.org/netconf2019.html Examples on howto access the tail area of an XDP packet is shown in the XDP-tutorial example[3]. [3] https://github.com/xdp-project/xdp-tutorial/blob/master/experiment01-tailgrow/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-05-14selftests/bpf: Xdp_adjust_tail add grow tail testsJesper Dangaard Brouer2-5/+144
Extend BPF selftest xdp_adjust_tail with grow tail tests, which is added as subtest's. The first grow test stays in same form as original shrink test. The second grow test use the newer bpf_prog_test_run_xattr() calls, and does extra checking of data contents. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158945350567.97035.9632611946765811876.stgit@firesoul
2020-05-14selftests/bpf: Adjust BPF selftest for xdp_adjust_tailJesper Dangaard Brouer2-8/+13
Current selftest for BPF-helper xdp_adjust_tail only shrink tail. Make it more clear that this is a shrink test case. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158945350058.97035.17280775016196207372.stgit@firesoul
2020-05-14bpf: Add xdp.frame_sz in bpf_prog_test_run_xdp().Jesper Dangaard Brouer1-4/+12
Update the memory requirements, when adding xdp.frame_sz in BPF test_run function bpf_prog_test_run_xdp() which e.g. is used by XDP selftests. Specifically add the expected reserved tailroom, but also allocated a larger memory area to reflect that XDP frames usually comes in this format. Limit the provided packet data size to 4096 minus headroom + tailroom, as this also reflect a common 3520 bytes MTU limit with XDP. Note that bpf_test_init already use a memory allocation method that clears memory. Thus, this already guards against leaking uninit kernel memory. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158945349549.97035.15316291762482444006.stgit@firesoul
2020-05-14xdp: Clear grow memory in bpf_xdp_adjust_tail()Jesper Dangaard Brouer1-0/+4
Clearing memory of tail when grow happens, because it is too easy to write a XDP_PASS program that extend the tail, which expose this memory to users that can run tcpdump. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945349039.97035.5262100484553494.stgit@firesoul
2020-05-14xdp: Allow bpf_xdp_adjust_tail() to grow packet sizeJesper Dangaard Brouer2-4/+11
Finally, after all drivers have a frame size, allow BPF-helper bpf_xdp_adjust_tail() to grow or extend packet size at frame tail. Remember that helper/macro xdp_data_hard_end have reserved some tailroom. Thus, this helper makes sure that the BPF-prog don't have access to this tailroom area. V2: Remove one chicken check and use WARN_ONCE for other Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158945348530.97035.12577148209134239291.stgit@firesoul
2020-05-14mlx5: Rx queue setup time determine frame_sz for XDPJesper Dangaard Brouer4-0/+10
The mlx5 driver have multiple memory models, which are also changed according to whether a XDP bpf_prog is attached. The 'rx_striding_rq' setting is adjusted via ethtool priv-flags e.g.: # ethtool --set-priv-flags mlx5p2 rx_striding_rq off On the general case with 4K page_size and regular MTU packet, then the frame_sz is 2048 and 4096 when XDP is enabled, in both modes. The info on the given frame size is stored differently depending on the RQ-mode and encoded in a union in struct mlx5e_rq union wqe/mpwqe. In rx striding mode rq->mpwqe.log_stride_sz is either 11 or 12, which corresponds to 2048 or 4096 (MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ). In non-striding mode (MLX5_WQ_TYPE_CYCLIC) the frag_stride is stored in rq->wqe.info.arr[0].frag_stride, for the first fragment, which is what the XDP case cares about. To reduce effect on fast-path, this patch determine the frame_sz at setup time, to avoid determining the memory model runtime. Variable is named frame0_sz to make it clear that this is only the frame size of the first fragment. This mlx5 driver does a DMA-sync on XDP_TX action, but grow is safe as it have done a DMA-map on the entire PAGE_SIZE. The driver also already does a XDP length check against sq->hw_mtu on the possible XDP xmit paths mlx5e_xmit_xdp_frame() + mlx5e_xmit_xdp_frame_mpwqe(). V3+4: Change variable name first_frame_sz to frame0_sz V2: Fix that frag_size need to be recalc before creating SKB. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Link: https://lore.kernel.org/bpf/158945348021.97035.12295039384250022883.stgit@firesoul
2020-05-14xdp: For Intel AF_XDP drivers add XDP frame_szJesper Dangaard Brouer4-0/+17
Intel drivers implement native AF_XDP zerocopy in separate C-files, that have its own invocation of bpf_prog_run_xdp(). The setup of xdp_buff is also handled in separately from normal code path. This patch update XDP frame_sz for AF_XDP zerocopy drivers i40e, ice and ixgbe, as the code changes needed are very similar. Introduce a helper function xsk_umem_xdp_frame_sz() for calculating frame size. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Björn Töpel <bjorn.topel@intel.com> Cc: intel-wired-lan@lists.osuosl.org Cc: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/158945347511.97035.8536753731329475655.stgit@firesoul
2020-05-14ice: Add XDP frame size to driverJesper Dangaard Brouer1-9/+25
This driver uses different memory models depending on PAGE_SIZE at compile time. For PAGE_SIZE 4K it uses page splitting, meaning for normal MTU frame size is 2048 bytes (and headroom 192 bytes). For larger MTUs the driver still use page splitting, by allocating order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than 4K, driver instead advance its rx_buffer->page_offset with the frame size "truesize". For XDP frame size calculations, this mean that in PAGE_SIZE larger than 4K mode the frame_sz change on a per packet basis. For the page split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be updated once outside the main NAPI loop. The default setting in the driver uses build_skb(), which provides the necessary headroom and tailroom for XDP-redirect in RX-frame (in both modes). There is one complication, which is legacy-rx mode (configurable via ethtool priv-flags). There are zero headroom in this mode, which is a requirement for XDP-redirect to work. The conversion to xdp_frame (convert_to_xdp_frame) will detect this insufficient space, and xdp_do_redirect() call will fail. This is deemed acceptable, as it allows other XDP actions to still work in legacy-mode. In legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also accept that xdp_adjust_tail shrink doesn't work. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: intel-wired-lan@lists.osuosl.org Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Link: https://lore.kernel.org/bpf/158945347002.97035.328088795813704587.stgit@firesoul
2020-05-14i40e: Add XDP frame size to driverJesper Dangaard Brouer1-5/+25
This driver uses different memory models depending on PAGE_SIZE at compile time. For PAGE_SIZE 4K it uses page splitting, meaning for normal MTU frame size is 2048 bytes (and headroom 192 bytes). For larger MTUs the driver still use page splitting, by allocating order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than 4K, driver instead advance its rx_buffer->page_offset with the frame size "truesize". For XDP frame size calculations, this mean that in PAGE_SIZE larger than 4K mode the frame_sz change on a per packet basis. For the page split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be updated once outside the main NAPI loop. The default setting in the driver uses build_skb(), which provides the necessary headroom and tailroom for XDP-redirect in RX-frame (in both modes). There is one complication, which is legacy-rx mode (configurable via ethtool priv-flags). There are zero headroom in this mode, which is a requirement for XDP-redirect to work. The conversion to xdp_frame (convert_to_xdp_frame) will detect this insufficient space, and xdp_do_redirect() call will fail. This is deemed acceptable, as it allows other XDP actions to still work in legacy-mode. In legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also accept that xdp_adjust_tail shrink doesn't work. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: intel-wired-lan@lists.osuosl.org Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Link: https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul
2020-05-14ixgbevf: Add XDP frame size to VF driverJesper Dangaard Brouer1-7/+27
This patch mirrors the changes to ixgbe in previous patch. This VF driver doesn't support XDP_REDIRECT, but correct tailroom is still necessary for BPF-helper xdp_adjust_tail. In legacy-mode + larger PAGE_SIZE, due to lacking tailroom, we accept that xdp_adjust_tail shrink doesn't work. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: intel-wired-lan@lists.osuosl.org Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Link: https://lore.kernel.org/bpf/158945345984.97035.13518286183248025173.stgit@firesoul
2020-05-14ixgbe: Add XDP frame size to driverJesper Dangaard Brouer1-8/+26
This driver uses different memory models depending on PAGE_SIZE at compile time. For PAGE_SIZE 4K it uses page splitting, meaning for normal MTU frame size is 2048 bytes (and headroom 192 bytes). For larger MTUs the driver still use page splitting, by allocating order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than 4K, driver instead advance its rx_buffer->page_offset with the frame size "truesize". For XDP frame size calculations, this mean that in PAGE_SIZE larger than 4K mode the frame_sz change on a per packet basis. For the page split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be updated once outside the main NAPI loop. The default setting in the driver uses build_skb(), which provides the necessary headroom and tailroom for XDP-redirect in RX-frame (in both modes). There is one complication, which is legacy-rx mode (configurable via ethtool priv-flags). There are zero headroom in this mode, which is a requirement for XDP-redirect to work. The conversion to xdp_frame (convert_to_xdp_frame) will detect this insufficient space, and xdp_do_redirect() call will fail. This is deemed acceptable, as it allows other XDP actions to still work in legacy-mode. In legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also accept that xdp_adjust_tail shrink doesn't work. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: intel-wired-lan@lists.osuosl.org Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Link: https://lore.kernel.org/bpf/158945345455.97035.14334355929030628741.stgit@firesoul
2020-05-14ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4KJesper Dangaard Brouer1-1/+2
The ixgbe driver have another memory model when compiled on archs with PAGE_SIZE above 4096 bytes. In this mode it doesn't split the page in two halves, but instead increment rx_buffer->page_offset by truesize of packet (which include headroom and tailroom for skb_shared_info). This is done correctly in ixgbe_build_skb(), but in ixgbe_rx_buffer_flip which is currently only called on XDP_TX and XDP_REDIRECT, it forgets to add the tailroom for skb_shared_info. This breaks XDP_REDIRECT, for veth and cpumap. Fix by adding size of skb_shared_info tailroom. Maintainers notice: This fix have been queued to Jeff. Fixes: 6453073987ba ("ixgbe: add initial support for xdp redirect") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Link: https://lore.kernel.org/bpf/158945344946.97035.17031588499266605743.stgit@firesoul
2020-05-14virtio_net: Add XDP frame size in two code pathsJesper Dangaard Brouer1-3/+12
The virtio_net driver is running inside the guest-OS. There are two XDP receive code-paths in virtio_net, namely receive_small() and receive_mergeable(). The receive_big() function does not support XDP. In receive_small() the frame size is available in buflen. The buffer backing these frames are allocated in add_recvbuf_small() with same size, except for the headroom, but tailroom have reserved room for skb_shared_info. The headroom is encoded in ctx pointer as a value. In receive_mergeable() the frame size is more dynamic. There are two basic cases: (1) buffer size is based on a exponentially weighted moving average (see DECLARE_EWMA) of packet length. Or (2) in case virtnet_get_headroom() have any headroom then buffer size is PAGE_SIZE. The ctx pointer is this time used for encoding two values; the buffer len "truesize" and headroom. In case (1) if the rx buffer size is underestimated, the packet will have been split over more buffers (num_buf info in virtio_net_hdr_mrg_rxbuf placed in top of buffer area). If that happens the XDP path does a xdp_linearize_page operation. V3: Adjust frame_sz in receive_mergeable() case, spotted by Jason Wang. The code is really hard to follow, so some hints to reviewers. The receive_mergeable() case gets frames that were allocated in add_recvbuf_mergeable() which uses headroom=virtnet_get_headroom(), and 'buf' ptr is advanced this headroom. The headroom can only be 0 or VIRTIO_XDP_HEADROOM, as virtnet_get_headroom is really simple: static unsigned int virtnet_get_headroom(struct virtnet_info *vi) { return vi->xdp_queue_pairs ? VIRTIO_XDP_HEADROOM : 0; } As frame_sz is an offset size from xdp.data_hard_start, reviewers should notice how this is calculated in receive_mergeable(): int offset = buf - page_address(page); [...] data = page_address(xdp_page) + offset; xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len; The calculated offset will always be VIRTIO_XDP_HEADROOM when reaching this code. Thus, xdp.data_hard_start will be page-start address plus vi->hdr_len. Given this xdp.frame_sz need to be reduced with vi->hdr_len size. IMHO a followup patch should cleanup this code to make it easier to maintain and understand, but it is outside the scope of this patchset. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/bpf/158945344436.97035.9445115070189151680.stgit@firesoul
2020-05-14vhost_net: Also populate XDP frame sizeJesper Dangaard Brouer1-0/+1
In vhost_net_build_xdp() the 'buf' that gets queued via an xdp_buff have embedded a struct tun_xdp_hdr (located at xdp->data_hard_start) which contains the buffer length 'buflen' (with tailroom for skb_shared_info). Also storing this buflen in xdp->frame_sz, does not obsolete struct tun_xdp_hdr, as it also contains a struct virtio_net_hdr with other information. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/bpf/158945343928.97035.4620233649151726289.stgit@firesoul
2020-05-14tun: Add XDP frame sizeJesper Dangaard Brouer1-0/+2
The tun driver have two code paths for running XDP (bpf_prog_run_xdp). In both cases 'buflen' contains enough tailroom for skb_shared_info. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/bpf/158945343419.97035.9594485183958037621.stgit@firesoul
2020-05-14nfp: Add XDP frame size to netronome driverJesper Dangaard Brouer1-0/+6
The netronome nfp driver use PAGE_SIZE when xdp_prog is set, but xdp.data_hard_start begins at offset NFP_NET_RX_BUF_HEADROOM. Thus, adjust for this when setting xdp.frame_sz, as it counts from data_hard_start. When doing XDP_TX this driver is smart and instead of a full DMA-map does a DMA-sync on with packet length. As xdp_adjust_tail can now grow packet length, add checks to make sure that grow size is within the DMA-mapped size. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/158945342911.97035.11214251236208648808.stgit@firesoul
2020-05-14net: thunderx: Add XDP frame sizeJesper Dangaard Brouer1-0/+1
To help reviewers these are the defines related to RCV_FRAG_LEN #define DMA_BUFFER_LEN 1536 /* In multiples of 128bytes */ #define RCV_FRAG_LEN (SKB_DATA_ALIGN(DMA_BUFFER_LEN + NET_SKB_PAD) + \ SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Sunil Goutham <sgoutham@marvell.com> Cc: Robert Richter <rrichter@marvell.com> Link: https://lore.kernel.org/bpf/158945342402.97035.12649844447148990032.stgit@firesoul
2020-05-14mlx4: Add XDP frame size and adjust max XDP MTUJesper Dangaard Brouer2-1/+3
The mlx4 drivers size of memory backing the RX packet is stored in frag_stride. For XDP mode this will be PAGE_SIZE (normally 4096). For normal mode frag_stride is 2048. Also adjust MLX4_EN_MAX_XDP_MTU to take tailroom into account. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Link: https://lore.kernel.org/bpf/158945341893.97035.2688142527052329942.stgit@firesoul
2020-05-14ena: Add XDP frame size to amazon NIC driverJesper Dangaard Brouer2-2/+4
Frame size ENA_PAGE_SIZE is limited to 16K on systems with larger PAGE_SIZE than 16K. Change ENA_XDP_MAX_MTU to also take into account the reserved tailroom. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Sameeh Jubran <sameehj@amazon.com> Cc: Arthur Kiyanovski <akiyano@amazon.com> Link: https://lore.kernel.org/bpf/158945341384.97035.907403694833419456.stgit@firesoul
2020-05-14net: ethernet: ti: Add XDP frame size to driver cpswJesper Dangaard Brouer2-0/+2
The driver code cpsw.c and cpsw_new.c both use page_pool with default order-0 pages or their RX-pages. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/bpf/158945340875.97035.752144756428532878.stgit@firesoul
2020-05-14qlogic/qede: Add XDP frame size to driverJesper Dangaard Brouer2-1/+2
The driver qede uses a full page, when XDP is enabled. The drivers value in rx_buf_seg_size (struct qede_rx_queue) will be PAGE_SIZE when an XDP bpf_prog is attached. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Ariel Elior <aelior@marvell.com> Cc: GR-everest-linux-l2@marvell.com Link: https://lore.kernel.org/bpf/158945340366.97035.7764939691580349618.stgit@firesoul
2020-05-14hv_netvsc: Add XDP frame size to driverJesper Dangaard Brouer2-1/+2
The hyperv NIC driver does memory allocation and copy even without XDP. In XDP mode it will allocate a new page for each packet and copy over the payload, before invoking the XDP BPF-prog. The positive thing it that its easy to determine the xdp.frame_sz. The XDP implementation for hv_netvsc transparently passes xdp_prog to the associated VF NIC. Many of the Azure VMs are using SRIOV, so majority of the data are actually processed directly on the VF driver's XDP path. So the overhead of the synthetic data path (hv_netvsc) is minimal. Then XDP is enabled on this driver, XDP_PASS and XDP_TX will create the SKB via build_skb (based on the newly allocated page). Now using XDP frame_sz this will provide more skb_tailroom, which netstack can use for SKB coalescing (e.g tcp_try_coalesce -> skb_try_coalesce). V3: Adjust patch desc to be more positive. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Link: https://lore.kernel.org/bpf/158945339857.97035.10212138582505736163.stgit@firesoul
2020-05-14dpaa2-eth: Add XDP frame sizeJesper Dangaard Brouer1-0/+7
The dpaa2-eth driver reserve some headroom used for hardware and software annotation area in RX/TX buffers. Thus, xdp.data_hard_start doesn't start at page boundary. When XDP is configured the area reserved via dpaa2_fd_get_offset(fd) is 448 bytes of which XDP have reserved 256 bytes. As frame_sz is calculated as an offset from xdp_buff.data_hard_start, an adjust from the full PAGE_SIZE == DPAA2_ETH_RX_BUF_RAW_SIZE. When doing XDP_REDIRECT, the driver doesn't need this reserved headroom any-longer and allows xdp_do_redirect() to use it. This is an advantage for the drivers own ndo-xdp_xmit, as it uses part of this headroom for itself. Patch also adjust frame_sz in this case. The driver cannot support XDP data_meta, because it uses the headroom just before xdp.data for struct dpaa2_eth_swa (DPAA2_ETH_SWA_SIZE=64), when transmitting the packet. When transmitting a xdp_frame in dpaa2_eth_xdp_xmit_frame (call via ndo_xdp_xmit) is uses this area to store a pointer to xdp_frame and dma_size, which is used in TX completion (free_tx_fd) to return frame via xdp_return_frame(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Ioana Radulescu <ruxandra.radulescu@nxp.com> Link: https://lore.kernel.org/bpf/158945339348.97035.8562488847066908856.stgit@firesoul
2020-05-14veth: Xdp using frame_sz in veth driverJesper Dangaard Brouer1-9/+13
The veth driver can run XDP in "native" mode in it's own NAPI handler, and since commit 9fc8d518d9d5 ("veth: Handle xdp_frames in xdp napi ring") packets can come in two forms either xdp_frame or skb, calling respectively veth_xdp_rcv_one() or veth_xdp_rcv_skb(). For packets to arrive in xdp_frame format, they will have been redirected from an XDP native driver. In case of XDP_PASS or no XDP-prog attached, the veth driver will allocate and create an SKB. The current code in veth_xdp_rcv_one() xdp_frame case, had to guess the frame truesize of the incoming xdp_frame, when using veth_build_skb(). With xdp_frame->frame_sz this is not longer necessary. Calculating the frame_sz in veth_xdp_rcv_skb() skb case, is done similar to the XDP-generic handling code in net/core/dev.c. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Link: https://lore.kernel.org/bpf/158945338840.97035.935897116345700902.stgit@firesoul
2020-05-14veth: Adjust hard_start offset on redirect XDP framesJesper Dangaard Brouer1-4/+4
When native XDP redirect into a veth device, the frame arrives in the xdp_frame structure. It is then processed in veth_xdp_rcv_one(), which can run a new XDP bpf_prog on the packet. Doing so requires converting xdp_frame to xdp_buff, but the tricky part is that xdp_frame memory area is located in the top (data_hard_start) memory area that xdp_buff will point into. The current code tried to protect the xdp_frame area, by assigning xdp_buff.data_hard_start past this memory. This results in 32 bytes less headroom to expand into via BPF-helper bpf_xdp_adjust_head(). This protect step is actually not needed, because BPF-helper bpf_xdp_adjust_head() already reserve this area, and don't allow BPF-prog to expand into it. Thus, it is safe to point data_hard_start directly at xdp_frame memory area. Fixes: 9fc8d518d9d5 ("veth: Handle xdp_frames in xdp napi ring") Reported-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945338331.97035.5923525383710752178.stgit@firesoul
2020-05-14xdp: Cpumap redirect use frame_sz and increase skb_tailroomJesper Dangaard Brouer1-18/+3
Knowing the memory size backing the packet/xdp_frame data area, and knowing it already have reserved room for skb_shared_info, simplifies using build_skb significantly. With this change we no-longer lie about the SKB truesize, but more importantly a significant larger skb_tailroom is now provided, e.g. when drivers uses a full PAGE_SIZE. This extra tailroom (in linear area) can be used by the network stack when coalescing SKBs (e.g. in skb_try_coalesce, see TCP cases where tcp_queue_rcv() can 'eat' skb). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945337822.97035.13557959180460986059.stgit@firesoul
2020-05-14xdp: Xdp_frame add member frame_sz and handle in convert_to_xdp_frameJesper Dangaard Brouer2-1/+21
Use hole in struct xdp_frame, when adding member frame_sz, which keeps same sizeof struct (32 bytes) Drivers ixgbe and sfc had bug cases where the necessary/expected tailroom was not reserved. This can lead to some hard to catch memory corruption issues. Having the drivers frame_sz this can be detected when packet length/end via xdp->data_end exceed the xdp_data_hard_end pointer, which accounts for the reserved the tailroom. When detecting this driver issue, simply fail the conversion with NULL, which results in feedback to driver (failing xdp_do_redirect()) causing driver to drop packet. Given the lack of consistent XDP stats, this can be hard to troubleshoot. And given this is a driver bug, we want to generate some more noise in form of a WARN stack dump (to ID the driver code that inlined convert_to_xdp_frame). Inlining the WARN macro is problematic, because it adds an asm instruction (on Intel CPUs ud2) what influence instruction cache prefetching. Thus, introduce xdp_warn and macro XDP_WARN, to avoid this and at the same time make identifying the function and line of this inlined function easier. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945337313.97035.10015729316710496600.stgit@firesoul
2020-05-14net: XDP-generic determining XDP frame sizeJesper Dangaard Brouer1-6/+8
The SKB "head" pointer points to the data area that contains skb_shared_info, that can be found via skb_end_pointer(). Given xdp->data_hard_start have been established (basically pointing to skb->head), frame size is between skb_end_pointer() and data_hard_start, plus the size reserved to skb_shared_info. Change the bpf_xdp_adjust_tail offset adjust of skb->len, to be a positive offset number on grow, and negative number on shrink. As this seems more natural when reading the code. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945336804.97035.7164852191163722056.stgit@firesoul
2020-05-14net: netsec: Add support for XDP frame sizeIlias Apalodimas1-12/+18
This driver takes advantage of page_pool PP_FLAG_DMA_SYNC_DEV that can help reduce the number of cache-lines that need to be flushed when doing DMA sync for_device. Due to xdp_adjust_tail can grow the area accessible to the by the CPU (can possibly write into), then max sync length *after* bpf_prog_run_xdp() needs to be taken into account. For XDP_TX action the driver is smart and does DMA-sync. When growing tail this is still safe, because page_pool have DMA-mapped the entire page size. Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/bpf/158945336295.97035.15034759661036971024.stgit@firesoul
2020-05-14mvneta: Add XDP frame size to driverJesper Dangaard Brouer1-10/+15
This marvell driver mvneta uses PAGE_SIZE frames, which makes it really easy to convert. Driver updates rxq and now frame_sz once per NAPI call. This driver takes advantage of page_pool PP_FLAG_DMA_SYNC_DEV that can help reduce the number of cache-lines that need to be flushed when doing DMA sync for_device. Due to xdp_adjust_tail can grow the area accessible to the by the CPU (can possibly write into), then max sync length *after* bpf_prog_run_xdp() needs to be taken into account. For XDP_TX action the driver is smart and does DMA-sync. When growing tail this is still safe, because page_pool have DMA-mapped the entire page size. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Cc: thomas.petazzoni@bootlin.com Link: https://lore.kernel.org/bpf/158945335786.97035.12714388304493736747.stgit@firesoul
2020-05-14sfc: Add XDP frame sizeJesper Dangaard Brouer1-0/+1
This driver uses RX page-split when possible. It was recently fixed in commit 86e85bf6981c ("sfc: fix XDP-redirect in this driver") to add needed tailroom for XDP-redirect. After the fix efx->rx_page_buf_step is the frame size, with enough head and tail-room for XDP-redirect. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158945335278.97035.14611425333184621652.stgit@firesoul
2020-05-14bnxt: Add XDP frame size to driverJesper Dangaard Brouer1-0/+1
This driver uses full PAGE_SIZE pages when XDP is enabled. In case of XDP uses driver uses __bnxt_alloc_rx_page which does full page DMA-map. Thus, xdp_adjust_tail grow is DMA compliant for XDP_TX action that does DMA-sync. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Cc: Michael Chan <michael.chan@broadcom.com> Cc: Andy Gospodarek <andrew.gospodarek@broadcom.com> Link: https://lore.kernel.org/bpf/158945334769.97035.13437970179897613984.stgit@firesoul
2020-05-14xdp: Add frame size to xdp_buffJesper Dangaard Brouer1-0/+13
XDP have evolved to support several frame sizes, but xdp_buff was not updated with this information. The frame size (frame_sz) member of xdp_buff is introduced to know the real size of the memory the frame is delivered in. When introducing this also make it clear that some tailroom is reserved/required when creating SKBs using build_skb(). It would also have been an option to introduce a pointer to data_hard_end (with reserved offset). The advantage with frame_sz is that (like rxq) drivers only need to setup/assign this value once per NAPI cycle. Due to XDP-generic (and some drivers) it's not possible to store frame_sz inside xdp_rxq_info, because it's varies per packet as it can be based/depend on packet length. V2: nitpick: deduct -> deduce Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945334261.97035.555255657490688547.stgit@firesoul
2020-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller139-544/+5253
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-05-14 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON. 2) support for narrow loads in bpf_sock_addr progs and additional helpers in cg-skb progs, from Andrey. 3) bpf benchmark runner, from Andrii. 4) arm and riscv JIT optimizations, from Luke. 5) bpf iterator infrastructure, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15Merge tag 'drm-intel-fixes-2020-05-13-1' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixesDave Airlie8-27/+79
- Handle idling during i915_gem_evict_something busy loops (Chris) - Mark current submissions with a weak-dependency (Chris) - Propagate errror from completed fences (Chris) - Fixes on execlist to avoid GPU hang situation (Chris) - Fixes couple deadlocks (Chris) - Timeslice preemption fixes (Chris) - Fix Display Port interrupt handling on Tiger Lake (Imre) - Reduce debug noise around Frame Buffer Compression +(Peter) - Fix logic around IPC W/a for Coffee Lake and Kaby Lake +(Sultan) - Avoid dereferencing a dead context (Chris) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200514040235.GA2164266@intel.com
2020-05-14Merge branch 'expand-cg_skb-helpers'Alexei Starovoitov7-24/+386
Andrey Ignatov says: ==================== v2->v3: - better documentation for bpf_sk_cgroup_id in uapi (Yonghong Song) - save/restore errno in network helpers (Yonghong Song) - cleanup leftover after switching selftest to skeleton (Yonghong Song) - switch from map to skel->bss in selftest (Yonghong Song) v1->v2: - switch selftests to skeleton. This patch set allows a bunch of existing sk lookup and skb cgroup id helpers, and adds two new bpf_sk_{,ancestor_}cgroup_id helpers to be used in cgroup skb programs. It fills the gap to cover a use-case to apply intra-host cgroup-bpf network policy based on a source cgroup a packet comes from. For example, there can be multiple containers A, B, C running on a host. Every such container runs in its own cgroup that can have multiple sub-cgroups. But all these containers can share some IP addresses. At the same time container A wants to have a policy for a server S running in it so that only clients from this same container can connect to S, but not from other containers (such as B, C). Source IP address can't be used to decide whether to allow or deny a packet, but it looks reasonable to filter by cgroup id. The patch set allows to implement the following policy: * when an ingress packet comes to container's cgroup, lookup peer (client) socket this packet comes from; * having peer socket, get its cgroup id; * compare peer cgroup id with self cgroup id and allow packet only if they match, i.e. it comes from same cgroup; * the "sub-cgroup" part of the story can be addressed by getting not direct cgroup id of the peer socket, but ancestor cgroup id on specified level, similar to existing "ancestor" flavors of cgroup id helpers. A newly introduced selftest implements such a policy in its basic form to provide a better idea on the use-case. Patch 1 allows existing sk lookup helpers in cgroup skb. Patch 2 allows skb_ancestor_cgroup_id in cgrou skb. Patch 3 introduces two new helpers to get cgroup id of socket. Patch 4 extends network helpers to use them in the next patch. Patch 5 adds selftest / example of use-case. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-05-14selftests/bpf: Test for sk helpers in cgroup skbAndrey Ignatov2-0/+192
Test bpf_sk_lookup_tcp, bpf_sk_release, bpf_sk_cgroup_id and bpf_sk_ancestor_cgroup_id helpers from cgroup skb program. The test creates a testing cgroup, starts a TCPv6 server inside the cgroup and creates two client sockets: one inside testing cgroup and one outside. Then it attaches cgroup skb program to the cgroup that checks all TCP segments coming to the server and allows only those coming from the cgroup of the server. If a segment comes from a peer outside of the cgroup, it'll be dropped. Finally the test checks that client from inside testing cgroup can successfully connect to the server, but client outside the cgroup fails to connect by timeout. The main goal of the test is to check newly introduced bpf_sk_{,ancestor_}cgroup_id helpers. It also checks a couple of socket lookup helpers (tcp & release), but lookup helpers were introduced much earlier and covered by other tests. Here it's mostly checked that they can be called from cgroup skb. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/171f4c5d75e8ff4fe1c4e8c1c12288b5240a4549.1589486450.git.rdna@fb.com
2020-05-14selftests/bpf: Add connect_fd_to_fd, connect_wait net helpersAndrey Ignatov2-13/+63
Add two new network helpers. connect_fd_to_fd connects an already created client socket fd to address of server fd. Sometimes it's useful to separate client socket creation and connecting this socket to a server, e.g. if client socket has to be created in a cgroup different from that of server cgroup. Additionally connect_to_fd is now implemented using connect_fd_to_fd, both helpers don't treat EINPROGRESS as an error and let caller decide how to proceed with it. connect_wait is a helper to work with non-blocking client sockets so that if connect_to_fd or connect_fd_to_fd returned -1 with errno == EINPROGRESS, caller can wait for connect to finish or for connection timeout. The helper returns -1 on error, 0 on timeout (1sec, hard-coded), and positive number on success. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/1403fab72300f379ca97ead4820ae43eac4414ef.1589486450.git.rdna@fb.com
2020-05-14bpf: Introduce bpf_sk_{, ancestor_}cgroup_id helpersAndrey Ignatov3-11/+121
With having ability to lookup sockets in cgroup skb programs it becomes useful to access cgroup id of retrieved sockets so that policies can be implemented based on origin cgroup of such socket. For example, a container running in a cgroup can have cgroup skb ingress program that can lookup peer socket that is sending packets to a process inside the container and decide whether those packets should be allowed or denied based on cgroup id of the peer. More specifically such ingress program can implement intra-host policy "allow incoming packets only from this same container and not from any other container on same host" w/o relying on source IP addresses since quite often it can be the case that containers share same IP address on the host. Introduce two new helpers for this use-case: bpf_sk_cgroup_id() and bpf_sk_ancestor_cgroup_id(). These helpers are similar to existing bpf_skb_{,ancestor_}cgroup_id helpers with the only difference that sk is used to get cgroup id instead of skb, and share code with them. See documentation in UAPI for more details. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/f5884981249ce911f63e9b57ecd5d7d19154ff39.1589486450.git.rdna@fb.com