linux-dev/tools/testing/selftests/bpf/Makefile, branch master

selftests/bpf: Add liburandom_read.so to TEST_GEN_FILES

2022-09-22T20:54:39Z

Added urandom_read shared lib is missing from the list of installed files what makes urandom_read test after `make install` or `make gen_tar` broken. Add the library to TEST_GEN_FILES. The names in the list do not contain $(OUTPUT) since it's added by lib.mk code. Fixes: 00a0fa2d7d49 ("selftests/bpf: Add urandom_read shared lib and USDTs") Signed-off-by: Yauheni Kaliuta Signed-off-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20220920161409.129953-1-ykaliuta@redhat.com

selftests/bpf: Add test for bpf_verify_pkcs7_signature() kfunc

2022-09-22T00:33:42Z

Perform several tests to ensure the correct implementation of the bpf_verify_pkcs7_signature() kfunc. Do the tests with data signed with a generated testing key (by using sign-file from scripts/) and with the tcp_bic.ko kernel module if it is found in the system. The test does not fail if tcp_bic.ko is not found. First, perform an unsuccessful signature verification without data. Second, perform a successful signature verification with the session keyring and a new one created for testing. Then, ensure that permission and validation checks are done properly on the keyring provided to bpf_verify_pkcs7_signature(), despite those checks were deferred at the time the keyring was retrieved with bpf_lookup_user_key(). The tests expect to encounter an error if the Search permission is removed from the keyring, or the keyring is expired. Finally, perform a successful and unsuccessful signature verification with the keyrings with pre-determined IDs (the last test fails because the key is not in the platform keyring). The test is currently in the deny list for s390x (JIT does not support calling kernel function). Signed-off-by: Roberto Sassu Link: https://lore.kernel.org/r/20220920075951.929132-13-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov

selftests/bpf: Add veristat tool for mass-verifying BPF object files

2022-09-16T20:41:41Z

Add a small tool, veristat, that allows mass-verification of a set of *libbpf-compatible* BPF ELF object files. For each such object file, veristat will attempt to verify each BPF program *individually*. Regardless of success or failure, it parses BPF verifier stats and outputs them in human-readable table format. In the future we can also add CSV and JSON output for more scriptable post-processing, if necessary. veristat allows to specify a set of stats that should be output and ordering between multiple objects and files (e.g., so that one can easily order by total instructions processed, instead of default file name, prog name, verdict, total instructions order). This tool should be useful for validating various BPF verifier changes or even validating different kernel versions for regressions. Here's an example for some of the heaviest selftests/bpf BPF object files: $ sudo ./veristat -s insns,file,prog {pyperf,loop,test_verif_scale,strobemeta,test_cls_redirect,profiler}*.linked3.o File Program Verdict Duration, us Total insns Total states Peak states ------------------------------------ ------------------------------------ ------- ------------ ----------- ------------ ----------- loop3.linked3.o while_true failure 350990 1000001 9663 9663 test_verif_scale3.linked3.o balancer_ingress success 115244 845499 8636 2141 test_verif_scale2.linked3.o balancer_ingress success 77688 773445 3048 788 pyperf600.linked3.o on_event success 2079872 624585 30335 30241 pyperf600_nounroll.linked3.o on_event success 353972 568128 37101 2115 strobemeta.linked3.o on_event success 455230 557149 15915 13537 test_verif_scale1.linked3.o balancer_ingress success 89880 554754 8636 2141 strobemeta_nounroll2.linked3.o on_event success 433906 501725 17087 1912 loop6.linked3.o trace_virtqueue_add_sgs success 282205 398057 8717 919 loop1.linked3.o nested_loops success 125630 361349 5504 5504 pyperf180.linked3.o on_event success 2511740 160398 11470 11446 pyperf100.linked3.o on_event success 744329 87681 6213 6191 test_cls_redirect.linked3.o cls_redirect success 54087 78925 4782 903 strobemeta_subprogs.linked3.o on_event success 57898 65420 1954 403 test_cls_redirect_subprogs.linked3.o cls_redirect success 54522 64965 4619 958 strobemeta_nounroll1.linked3.o on_event success 43313 57240 1757 382 pyperf50.linked3.o on_event success 194355 46378 3263 3241 profiler2.linked3.o tracepoint__syscalls__sys_enter_kill success 23869 43372 1423 542 pyperf_subprogs.linked3.o on_event success 29179 36358 2499 2499 profiler1.linked3.o tracepoint__syscalls__sys_enter_kill success 13052 27036 1946 936 profiler3.linked3.o tracepoint__syscalls__sys_enter_kill success 21023 26016 2186 915 profiler2.linked3.o kprobe__vfs_link success 5255 13896 303 271 profiler1.linked3.o kprobe__vfs_link success 7792 12687 1042 1041 profiler3.linked3.o kprobe__vfs_link success 7332 10601 865 865 profiler2.linked3.o kprobe_ret__do_filp_open success 3417 8900 216 199 profiler2.linked3.o kprobe__vfs_symlink success 3548 8775 203 186 pyperf_global.linked3.o on_event success 10007 7563 520 520 profiler3.linked3.o kprobe_ret__do_filp_open success 4708 6464 532 532 profiler1.linked3.o kprobe_ret__do_filp_open success 3090 6445 508 508 profiler3.linked3.o kprobe__vfs_symlink success 4477 6358 521 521 profiler1.linked3.o kprobe__vfs_symlink success 3381 6347 507 507 profiler2.linked3.o raw_tracepoint__sched_process_exec success 2464 5874 292 189 profiler3.linked3.o raw_tracepoint__sched_process_exec success 2677 4363 397 283 profiler2.linked3.o kprobe__proc_sys_write success 1800 4355 143 138 profiler1.linked3.o raw_tracepoint__sched_process_exec success 1649 4019 333 240 pyperf600_bpf_loop.linked3.o on_event success 2711 3966 306 306 profiler2.linked3.o raw_tracepoint__sched_process_exit success 1234 3138 83 66 profiler3.linked3.o kprobe__proc_sys_write success 1755 2623 223 223 profiler1.linked3.o kprobe__proc_sys_write success 1222 2456 193 193 loop2.linked3.o while_true success 608 1783 57 30 profiler3.linked3.o raw_tracepoint__sched_process_exit success 789 1680 146 146 profiler1.linked3.o raw_tracepoint__sched_process_exit success 592 1526 133 133 strobemeta_bpf_loop.linked3.o on_event success 1015 1512 106 106 loop4.linked3.o combinations success 165 524 18 17 profiler3.linked3.o raw_tracepoint__sched_process_fork success 196 299 25 25 profiler1.linked3.o raw_tracepoint__sched_process_fork success 109 265 19 19 profiler2.linked3.o raw_tracepoint__sched_process_fork success 111 265 19 19 loop5.linked3.o while_true success 47 84 9 9 ------------------------------------ ------------------------------------ ------- ------------ ----------- ------------ ----------- Signed-off-by: Andrii Nakryiko Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20220909193053.577111-4-andrii@kernel.org

selftests/bpf: regroup and declare similar kfuncs selftests in an array

2022-09-07T17:57:28Z

Similar to tools/testing/selftests/bpf/prog_tests/dynptr.c: we declare an array of tests that we run one by one in a for loop. Followup patches will add more similar-ish tests, so avoid a lot of copy paste by grouping the declaration in an array. For light skeletons, we have to rely on the offsetof() macro so we can statically declare which program we are using. In the libbpf case, we can rely on bpf_object__find_program_by_name(). So also change the Makefile to generate both light skeletons and normal ones. Signed-off-by: Benjamin Tissoires Acked-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220906151303.2780789-2-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov

selftests/bpf: Store BPF object files with .bpf.o extension

2022-09-02T13:55:37Z

BPF object files are, in a way, the final artifact produced as part of the ahead-of-time compilation process. That makes them somewhat special compared to "regular" object files, which are a intermediate build artifacts that can typically be removed safely. As such, it can make sense to name them differently to make it easier to spot this difference at a glance. Among others, libbpf-bootstrap [0] has established the extension .bpf.o for BPF object files. It seems reasonable to follow this example and establish the same denomination for selftest build artifacts. To that end, this change adjusts the corresponding part of the build system and the test programs loading BPF object files to work with .bpf.o files. [0] https://github.com/libbpf/libbpf-bootstrap Suggested-by: Andrii Nakryiko Signed-off-by: Daniel Müller Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20220901222253.1199242-1-deso@posteo.net

selftests/bpf: Make sure bpf_{g,s}et_retval is exposed everywhere

2022-08-23T23:08:22Z

For each hook, have a simple bpf_set_retval(bpf_get_retval) program and make sure it loads for the hooks we want. The exceptions are the hooks which don't propagate the error to the callers: - sockops - recvmsg - getpeername - getsockname - cg_skb ingress and egress Acked-by: Martin KaFai Lau Signed-off-by: Stanislav Fomichev Link: https://lore.kernel.org/r/20220823222555.523590-6-sdf@google.com Signed-off-by: Alexei Starovoitov

selftests, xsk: Rename AF_XDP testing app

2022-07-08T12:22:15Z

Recently, xsk part of libbpf was moved to selftests/bpf directory and lives on its own because there is an AF_XDP testing application that needs it called xdpxceiver. That name makes it a bit hard to indicate who maintains it as there are other XDP samples in there, whereas this one is strictly about AF_XDP. Do s/xdpxceiver/xskxceiver so that it will be easier to figure out who maintains it. A follow-up patch will correct MAINTAINERS file. Signed-off-by: Maciej Fijalkowski Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20220707111613.49031-2-maciej.fijalkowski@intel.com

selftests/bpf: Add benchmark for local_storage RCU Tasks Trace usage

2022-07-07T14:35:21Z

This benchmark measures grace period latency and kthread cpu usage of RCU Tasks Trace when many processes are creating/deleting BPF local_storage. Intent here is to quantify improvement on these metrics after Paul's recent RCU Tasks patches [0]. Specifically, fork 15k tasks which call a bpf prog that creates/destroys task local_storage and sleep in a loop, resulting in many call_rcu_tasks_trace calls. To determine grace period latency, trace time elapsed between rcu_tasks_trace_pregp_step and rcu_tasks_trace_postgp; for cpu usage look at rcu_task_trace_kthread's stime in /proc/PID/stat. On my virtualized test environment (Skylake, 8 cpus) benchmark results demonstrate significant improvement: BEFORE Paul's patches: SUMMARY tasks_trace grace period latency avg 22298.551 us stddev 1302.165 us SUMMARY ticks per tasks_trace grace period avg 2.291 stddev 0.324 AFTER Paul's patches: SUMMARY tasks_trace grace period latency avg 16969.197 us stddev 2525.053 us SUMMARY ticks per tasks_trace grace period avg 1.146 stddev 0.178 Note that since these patches are not in bpf-next benchmarking was done by cherry-picking this patch onto rcu tree. [0] https://lore.kernel.org/rcu/20220620225402.GA3842369@paulmck-ThinkPad-P17-Gen-1/ Signed-off-by: Dave Marchevsky Signed-off-by: Daniel Borkmann Acked-by: Paul E. McKenney Acked-by: Martin KaFai Lau Link: https://lore.kernel.org/bpf/20220705190018.3239050-1-davemarchevsky@fb.com

libbpf: move xsk.{c,h} into selftests/bpf

2022-06-28T20:13:32Z

Remove deprecated xsk APIs from libbpf. But given we have selftests relying on this, move those files (with minimal adjustments to make them compilable) under selftests/bpf. We also remove all the removed APIs from libbpf.map, while overall keeping version inheritance chain, as most APIs are backwards compatible so there is no need to reassign them as LIBBPF_1.0.0 versions. Cc: Magnus Karlsson Signed-off-by: Andrii Nakryiko Link: https://lore.kernel.org/r/20220627211527.2245459-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov

selftests/bpf: Add benchmark for local_storage get

2022-06-23T02:14:33Z

Add a benchmarks to demonstrate the performance cliff for local_storage get as the number of local_storage maps increases beyond current local_storage implementation's cache size. "sequential get" and "interleaved get" benchmarks are added, both of which do many bpf_task_storage_get calls on sets of task local_storage maps of various counts, while considering a single specific map to be 'important' and counting task_storage_gets to the important map separately in addition to normal 'hits' count of all gets. Goal here is to mimic scenario where a particular program using one map - the important one - is running on a system where many other local_storage maps exist and are accessed often. While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 bpf_task_storage_gets for the important map for every 10 map gets. This is meant to highlight performance differences when important map is accessed far more frequently than non-important maps. A "hashmap control" benchmark is also included for easy comparison of standard bpf hashmap lookup vs local_storage get. The benchmark is similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH instead of local storage. Only one inner map is created - a hashmap meant to hold tid -> data mapping for all tasks. Size of the hashmap is hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these keys which are actually fetched as part of the benchmark is configurable. Addition of this benchmark is inspired by conversation with Alexei in a previous patchset's thread [0], which highlighted the need for such a benchmark to motivate and validate improvements to local_storage implementation. My approach in that series focused on improving performance for explicitly-marked 'important' maps and was rejected with feedback to make more generally-applicable improvements while avoiding explicitly marking maps as important. Thus the benchmark reports both general and important-map-focused metrics, so effect of future work on both is clear. Regarding the benchmark results. On a powerful system (Skylake, 20 cores, 256gb ram): Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s Looking at the "sequential get" results, it's clear that as the number of task local_storage maps grows beyond the current cache size (16), there's a significant reduction in hits throughput. Note that current local_storage implementation assigns a cache_idx to maps as they are created. Since "sequential get" is creating maps 0..n in order and then doing bpf_task_storage_get calls in the same order, the benchmark is effectively ensuring that a map will not be in cache when the program tries to access it. For "interleaved get" results, important-map hits throughput is greatly increased as the important map is more likely to be in cache by virtue of being accessed far more frequently. Throughput still reduces as # maps increases, though. To get a sense of the overhead of the benchmark program, I commented out bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.c and ran the benchmark on the same host as the 'real' run. Results: Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s local_storage cache interleaved get: hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s local_storage cache interleaved get: hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s local_storage cache interleaved get: hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s local_storage cache interleaved get: hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s local_storage cache interleaved get: hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s local_storage cache interleaved get: hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s local_storage cache interleaved get: hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: hashmap_control_1k: ~53.8ns hashmap_control_10k: ~124.2ns hashmap_control_100k: ~206.5ns sequential_get_1: ~12.1ns sequential_get_10: ~16.0ns sequential_get_16: ~13.8ns sequential_get_17: ~16.8ns sequential_get_24: ~40.9ns sequential_get_32: ~65.2ns sequential_get_100: ~148.2ns sequential_get_1000: ~2204ns Clearly demonstrating a cliff. In the discussion for v1 of this patch, Alexei noted that local_storage was 2.5x faster than a large hashmap when initially implemented [1]. The benchmark results show that local_storage is 5-10x faster: a long-running BPF application putting some pid-specific info into a hashmap for each pid it sees will probably see on the order of 10-100k pids. Bench numbers for hashmaps of this size are ~10x slower than sequential_get_16, but as the number of local_storage maps grows far past local_storage cache size the performance advantage shrinks and eventually reverses. When running the benchmarks it may be necessary to bump 'open files' ulimit for a successful run. [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/ Signed-off-by: Dave Marchevsky Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov