wireguard-linux - WireGuard for the Linux kernel

Age	Commit message (Collapse)	Author	Files	Lines
2023-04-21	selftests/bpf: add missing netfilter return value and ctx access tests	Florian Westphal	3	-0/+174
	Extend prog_tests with two test cases: # ./test_progs --allow=verifier_netfilter_retcode #278/1 verifier_netfilter_retcode/bpf_exit with invalid return code. test1:OK #278/2 verifier_netfilter_retcode/bpf_exit with valid return code. test2:OK #278/3 verifier_netfilter_retcode/bpf_exit with valid return code. test3:OK #278/4 verifier_netfilter_retcode/bpf_exit with invalid return code. test4:OK #278 verifier_netfilter_retcode:OK This checks that only accept and drop (0,1) are permitted. NF_QUEUE could be implemented later if we can guarantee that attachment of such programs can be rejected if they get attached to a pf/hook that doesn't support async reinjection. NF_STOLEN could be implemented via trusted helpers that can guarantee that the skb will eventually be free'd. v4: test case for bpf_nf_ctx access checks, requested by Alexei Starovoitov. v5: also check ctx->{state,skb} can be dereferenced (Alexei). # ./test_progs --allow=verifier_netfilter_ctx #281/1 verifier_netfilter_ctx/netfilter invalid context access, size too short:OK #281/2 verifier_netfilter_ctx/netfilter invalid context access, size too short:OK #281/3 verifier_netfilter_ctx/netfilter invalid context access, past end of ctx:OK #281/4 verifier_netfilter_ctx/netfilter invalid context, write:OK #281/5 verifier_netfilter_ctx/netfilter valid context read and invalid write:OK #281/6 verifier_netfilter_ctx/netfilter test prog with skb and state read access:OK #281/7 verifier_netfilter_ctx/netfilter test prog with skb and state read access @unpriv:OK #281 verifier_netfilter_ctx:OK Summary: 1/7 PASSED, 0 SKIPPED, 0 FAILED This checks: 1/2: partial reads of ctx->{skb,state} are rejected 3. read access past sizeof(ctx) is rejected 4. write to ctx content, e.g. 'ctx->skb = NULL;' is rejected 5. ctx->state content cannot be altered 6. ctx->state and ctx->skb can be dereferenced 7. ... same program fails for unpriv (CAP_NET_ADMIN needed). Link: https://lore.kernel.org/bpf/20230419021152.sjq4gttphzzy6b5f@dhcp-172-26-102-232.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20230420201655.77kkgi3dh7fesoll@MacBook-Pro-6.local/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-8-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21	bpf: add test_run support for netfilter program type	Florian Westphal	3	-0/+162
	add glue code so a bpf program can be run using userspace-provided netfilter state and packet/skb. Default is to use ipv4:output hook point, but this can be overridden by userspace. Userspace provided netfilter state is restricted, only hook and protocol families can be overridden and only to ipv4/ipv6. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-7-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21	tools: bpftool: print netfilter link info	Florian Westphal	6	-0/+210
	Dump protocol family, hook and priority value: $ bpftool link 2: netfilter prog 14 ip input prio -128 pids install(3264) 5: netfilter prog 14 ip6 forward prio 21 pids a.out(3387) 9: netfilter prog 14 ip prerouting prio 123 pids a.out(5700) 10: netfilter prog 14 ip input prio 21 pids test2(5701) v2: Quentin Monnet suggested to also add 'bpftool net' support: $ bpftool net xdp: tc: flow_dissector: netfilter: ip prerouting prio 21 prog_id 14 ip input prio -128 prog_id 14 ip input prio 21 prog_id 14 ip forward prio 21 prog_id 14 ip output prio 21 prog_id 14 ip postrouting prio 21 prog_id 14 'bpftool net' only dumps netfilter link type, links are sorted by protocol family, hook and priority. v5: fix bpf ci failure: libbpf needs small update to prog_type_name[] and probe_prog_load helper. v4: don't fail with -EOPNOTSUPP in libbpf probe_prog_load, update prog_type_name[] with "netfilter" entry (bpf ci) v3: fix bpf.h copy, 'reserved' member was removed (Alexei) use p_err, not fprintf (Quentin) Suggested-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/eeeaac99-9053-90c2-aa33-cc1ecb1ae9ca@isovalent.com/ Reviewed-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-6-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21	netfilter: disallow bpf hook attachment at same priority	Florian Westphal	1	-0/+12
	This is just to avoid ordering issues between multiple bpf programs, this could be removed later in case it turns out to be too cautious. bpf prog could still be shared with non-bpf hook, otherwise we'd have to make conntrack hook registration fail just because a bpf program has same priority. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-5-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21	netfilter: nfnetlink hook: dump bpf prog id	Florian Westphal	2	-16/+89
	This allows userspace ("nft list hooks") to show which bpf program is attached to which hook. Without this, user only knows bpf prog is attached at prio x, y, z at INPUT and FORWARD, but can't tell which program is where. v4: kdoc fixups (Simon Horman) Link: https://lore.kernel.org/bpf/ZEELzpNCnYJuZyod@corigine.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-4-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21	bpf: minimal support for programs hooked into netfilter framework	Florian Westphal	6	-1/+88
	This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs that will be invoked via the NF_HOOK() points in the ip stack. Invocation incurs an indirect call. This is not a necessity: Its possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the program invocation with the same method already done for xdp progs. This isn't done here to keep the size of this chunk down. Verifier restricts verdicts to either DROP or ACCEPT. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21	bpf: add bpf_link support for BPF_NETFILTER programs	Florian Westphal	7	-0/+194
	Add bpf_link support skeleton. To keep this reviewable, no bpf program can be invoked yet, if a program is attached only a c-stub is called and not the actual bpf program. Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig. Uapi example usage: union bpf_attr attr = { }; attr.link_create.prog_fd = progfd; attr.link_create.attach_type = 0; /* unused */ attr.link_create.netfilter.pf = PF_INET; attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN; attr.link_create.netfilter.priority = -128; err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr)); ... this would attach progfd to ipv4:input hook. Such hook gets removed automatically if the calling program exits. BPF_NETFILTER program invocation is added in followup change. NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it allows to tell userspace which program is attached at the given hook when user runs 'nft hook list' command rather than just the priority and not-very-helpful 'this hook runs a bpf prog but I can't tell which one'. Will also be used to disallow registration of two bpf programs with same priority in a followup patch. v4: arm32 cmpxchg only supports 32bit operand s/prio/priority/ v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if more use cases pop up (arptables, ebtables, netdev ingress/egress etc). Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21	bpftool: Update doc to explain struct_ops register subcommand.	Kui-Feng Lee	1	-4/+8
	The "struct_ops register" subcommand now allows for an optional LINK_DIR to be included. This specifies the directory path where bpftool will pin struct_ops links with the same name as their corresponding map names. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/r/20230420002822.345222-2-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21	bpftool: Register struct_ops with a link.	Kui-Feng Lee	4	-25/+75
	You can include an optional path after specifying the object name for the 'struct_ops register' subcommand. Since the commit 226bc6ae6405 ("Merge branch 'Transit between BPF TCP congestion controls.'") has been accepted, it is now possible to create a link for a struct_ops. This can be done by defining a struct_ops in SEC(".struct_ops.link") to make libbpf returns a real link. If we don't pin the links before leaving bpftool, they will disappear. To instruct bpftool to pin the links in a directory with the names of the maps, we need to provide the path of that directory. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/r/20230420002822.345222-1-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21	selftests/bpf: Verify optval=NULL case	Stanislav Fomichev	2	-0/+40
	Make sure we get optlen exported instead of getting EFAULT. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230418225343.553806-3-sdf@google.com
2023-04-21	bpf: Don't EFAULT for getsockopt with optval=NULL	Stanislav Fomichev	1	-3/+6
	Some socket options do getsockopt with optval=NULL to estimate the size of the final buffer (which is returned via optlen). This breaks BPF getsockopt assumptions about permitted optval buffer size. Let's enforce these assumptions only when non-NULL optval is provided. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Reported-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7 Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com
2023-04-21	selftests/xsk: Put MAP_HUGE_2MB in correct argument	Magnus Karlsson	1	-7/+4
	Put the flag MAP_HUGE_2MB in the correct flags argument instead of the wrong offset argument. Fixes: 2ddade322925 ("selftests/xsk: Fix munmap for hugepage allocated umem") Reported-by: Kal Cutter Conley <kal.conley@dectris.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230421062208.3772-1-magnus.karlsson@gmail.com
2023-04-21	bpf: Fix bpf_refcount_acquire's refcount_t address calculation	Dave Marchevsky	2	-5/+5
	When calculating the address of the refcount_t struct within a local kptr, bpf_refcount_acquire_impl should add refcount_off bytes to the address of the local kptr. Due to some missing parens, the function is incorrectly adding sizeof(refcount_t) * refcount_off bytes. This patch fixes the calculation. Due to the incorrect calculation, bpf_refcount_acquire_impl was trying to refcount_inc some memory well past the end of local kptrs, resulting in kasan and refcount complaints, as reported in [0]. In that thread, Florian and Eduard discovered that bpf selftests written in the new style - with __success and an expected __retval, specifically - were not actually being run. As a result, selftests added in bpf_refcount series weren't really exercising this behavior, and thus didn't unearth the bug. With this fixed behavior it's safe to revert commit 7c4b96c00043 ("selftests/bpf: disable program test run for progs/refcounted_kptr.c"), this patch does so. [0] https://lore.kernel.org/bpf/ZEEp+j22imoN6rn9@strlen.de/ Fixes: 7c50b1cb76ac ("bpf: Add bpf_refcount_acquire kfunc") Reported-by: Florian Westphal <fw@strlen.de> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20230421074431.3548349-1-davemarchevsky@fb.com
2023-04-21	bpf: Fix race between btf_put and btf_idr walk.	Alexei Starovoitov	1	-5/+3
	Florian and Eduard reported hard dead lock: [ 58.433327] _raw_spin_lock_irqsave+0x40/0x50 [ 58.433334] btf_put+0x43/0x90 [ 58.433338] bpf_find_btf_id+0x157/0x240 [ 58.433353] btf_parse_fields+0x921/0x11c0 This happens since btf->refcount can be 1 at the time of btf_put() and btf_put() will call btf_free_id() which will try to grab btf_idr_lock and will dead lock. Avoid the issue by doing btf_put() without locking. Fixes: 3d78417b60fb ("bpf: Add bpf_btf_find_by_name_kind() helper.") Fixes: 1e89106da253 ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().") Reported-by: Florian Westphal <fw@strlen.de> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20230421014901.70908-1-alexei.starovoitov@gmail.com
2023-04-20	selftests/bpf: populate map_array_ro map for verifier_array_access test	Eduard Zingerman	2	-5/+41
	Two test cases: - "valid read map access into a read-only array 1" and - "valid read map access into a read-only array 2" Expect that map_array_ro map is filled with mock data. This logic was not taken into acount during initial test conversion. This commit modifies prog_tests/verifier.c entry point for this test to fill the map. Fixes: a3c830ae0209 ("selftests/bpf: verifier/array_access.c converted to inline assembly") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230420232317.2181776-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-20	selftests/bpf: add pre bpf_prog_test_run_opts() callback for test_loader	Eduard Zingerman	2	-0/+17
	When a test case is annotated with __retval tag the test_loader engine would use libbpf's bpf_prog_test_run_opts() to do a test run of the program and compare retvals. This commit allows to perform arbitrary actions on bpf object right before test loader invokes bpf_prog_test_run_opts(). This could be used to setup some state for program execution, e.g. fill some maps. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230420232317.2181776-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-20	selftests/bpf: fix __retval() being always ignored	Eduard Zingerman	2	-3/+3
	Florian Westphal found a bug in and suggested a fix for test_loader.c processing of __retval tag. Because of this bug the function test_loader.c:do_prog_test_run() never executed and all __retval test tags were ignored. If this bug is fixed a number of test cases from progs/verifier_array_access.c fail with retval not matching the expected value. This test was recently converted to use test_loader.c and inline assembly in [1]. When doing the conversion I missed the important detail of test_verifier.c operation: when it creates fixup_map_array_ro, fixup_map_array_wo and fixup_map_array_small it populates these maps with a dummy record. Disabling the __retval checks for the affected verifier_array_access in this commit to avoid false-postivies in any potential bisects. The issue is addressed in the next patch. I verified that the __retval tags are now respected by changing expected return values for all tests annotated with __retval, and checking that these tests started to fail. [1] https://lore.kernel.org/bpf/20230325025524.144043-1-eddyz87@gmail.com/ Fixes: 19a8e06f5f91 ("selftests/bpf: Tests execution support for test_loader.c") Reported-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230420232317.2181776-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-20	selftests/bpf: disable program test run for progs/refcounted_kptr.c	Eduard Zingerman	1	-4/+4
	Florian Westphal found a bug in test_loader.c processing of __retval tag. Because of this bug the function test_loader.c:do_prog_test_run() never executed and all __retval test tags were ignored. This hid an issue with progs/refcounted_kptr.c tests. When __retval tag bug is fixed and refcounted_kptr.c tests are run kernel reports various issues and eventually hangs. Shortest reproducer is the following command run a few times: $ for i in $(seq 1 4); do (./test_progs --allow=refcounted_kptr &); done Commenting out __retval tags for these tests until this issue is resolved. Reported-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230420232317.2181776-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-20	bpftool: Replace "__fallthrough" by a comment to address merge conflict	Quentin Monnet	1	-1/+1
	The recent support for inline annotations in control flow graphs generated by bpftool introduced the usage of the "__fallthrough" macro in a switch/case block in btf_dumper.c. This change went through the bpf-next tree, but resulted in a merge conflict in linux-next, because this macro has been renamed "fallthrough" (no underscores) in the meantime. To address the conflict, we temporarily switch to a simple comment instead of a macro. Related: commit f7a858bffcdd ("tools: Rename __fallthrough to fallthrough") Fixes: 9fd496848b1c ("bpftool: Support inline annotations when dumping the CFG of a program") Reported-by: Sven Schnelle <svens@linux.ibm.com> Reported-by: Thomas Richter <tmricht@linux.ibm.com> Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/all/yt9dttxlwal7.fsf@linux.ibm.com/ Link: https://lore.kernel.org/bpf/20230412123636.2358949-1-tmricht@linux.ibm.com/ Link: https://lore.kernel.org/bpf/20230420003333.90901-1-quentin@isovalent.com
2023-04-19	selftests/bpf: Add test to access integer type of variable array	Feng Zhou	5	-0/+70
	Add prog test for accessing integer type of variable array in tracing program. In addition, hook load_balance function to access sd->span[0], only to confirm whether the load is successful. Because there is no direct way to trigger load_balance call. Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Link: https://lore.kernel.org/r/20230420032735.27760-3-zhoufeng.zf@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-19	bpf: support access variable length array of integer type	Feng Zhou	1	-3/+5
	After this commit: bpf: Support variable length array in tracing programs (9c5f8a1008a1) Trace programs can access variable length array, but for structure type. This patch adds support for integer type. Example: Hook load_balance struct sched_domain { ... unsigned long span[]; } The access: sd->span[0]. Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Link: https://lore.kernel.org/r/20230420032735.27760-2-zhoufeng.zf@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-19	selftests/xsk: Fix munmap for hugepage allocated umem	Magnus Karlsson	2	-4/+16
	Fix the unmapping of hugepage allocated umems so that they are properly unmapped. The new test referred to in the fixes label, introduced a test that allocated a umem that is not a multiple of a 2M hugepage size. This is fine for mmap() that rounds the size up the nearest multiple of 2M. But munmap() requires the size to be a multiple of the hugepage size in order for it to unmap the region. The current behaviour of not properly unmapping the umem, was discovered when further additions of tests that require hugepages (unaligned mode tests only) started failing as the system was running out of hugepages. Fixes: c0801598e543 ("selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230418143617.27762-1-magnus.karlsson@gmail.com
2023-04-18	libbpf: mark bpf_iter_num_{new,next,destroy} as __weak	Andrii Nakryiko	1	-3/+3
	Mark bpf_iter_num_{new,next,destroy}() kfuncs declared for bpf_for()/bpf_repeat() macros as __weak to allow users to feature-detect their presence and guard bpf_for()/bpf_repeat() loops accordingly for backwards compatibility with old kernels. Now that libbpf supports kfunc calls poisoning and better reporting of unresolved (but called) kfuncs, declaring number iterator kfuncs in bpf_helpers.h won't degrade user experience and won't cause unnecessary kernel feature dependencies. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-18	libbpf: move bpf_for(), bpf_for_each(), and bpf_repeat() into bpf_helpers.h	Andrii Nakryiko	2	-103/+103
	To make it easier for bleeding-edge BPF applications, such as sched_ext, to utilize open-coded iterators, move bpf_for(), bpf_for_each(), and bpf_repeat() macros from selftests/bpf-internal bpf_misc.h helper, to libbpf-provided bpf_helpers.h header. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-18	selftests/bpf: add missing __weak kfunc log fixup test	Andrii Nakryiko	2	-0/+41
	Add test validating that libbpf correctly poisons and reports __weak unresolved kfuncs in post-processed verifier log. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-18	libbpf: improve handling of unresolved kfuncs	Andrii Nakryiko	1	-3/+69
	Currently, libbpf leaves `call #0` instruction for __weak unresolved kfuncs, which might lead to a confusing verifier log situations, where invalid `call #0` will be treated as successfully validated. We can do better. Libbpf already has an established mechanism of poisoning instructions that failed some form of resolution (e.g., CO-RE relocation and BPF map set to not be auto-created). Libbpf doesn't fail them outright to allow users to guard them through other means, and as long as BPF verifier can prove that such poisoned instructions cannot be ever reached, this doesn't consistute an invalid BPF program. If user didn't guard such code, libbpf will extract few pieces of information to tie such poisoned instructions back to additional information about what entitity wasn't resolved (e.g., BPF map name, or CO-RE relocation information). __weak unresolved kfuncs fit this model well, so this patch extends libbpf with poisioning and log fixup logic for kfunc calls. Note, this poisoning is done only for kfunc calls, not kfunc address resolution (ldimm64 instructions). The former cannot be ever valid, if reached, so it's safe to poison them. The latter is a valid mechanism to check if __weak kfunc ksym was resolved, and do necessary guarding and work arounds based on this result, supported in most recent kernels. As such, libbpf keeps such ldimm64 instructions as loading zero, never poisoning them. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-18	libbpf: report vmlinux vs module name when dealing with ksyms	Andrii Nakryiko	1	-4/+5
	Currently libbpf always reports "kernel" as a source of ksym BTF type, which is ambiguous given ksym's BTF can come from either vmlinux or kernel module BTFs. Make this explicit and log module name, if used BTF is from kernel module. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-18	libbpf: misc internal libbpf clean ups around log fixup	Andrii Nakryiko	1	-12/+14
	Normalize internal constants, field names, and comments related to log fixup. Also add explicit `ext_idx` alias for relocation where relocation is pointing to extern description for additional information. No functional changes, just a clean up before subsequent additions. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-17	selftests/bpf: Add a selftest for checking subreg equality	Yonghong Song	2	-0/+60
	Add a selftest to ensure subreg equality if source register upper 32bit is 0. Without previous patch, the test will fail verification. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230417222139.360607-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-17	bpf: Improve verifier u32 scalar equality checking	Yonghong Song	1	-2/+7
	In [1], I tried to remove bpf-specific codes to prevent certain llvm optimizations, and add llvm TTI (target transform info) hooks to prevent those optimizations. During this process, I found if I enable llvm SimplifyCFG:shouldFoldTwoEntryPHINode transformation, I will hit the following verification failure with selftests: ... 8: (18) r1 = 0xffffc900001b2230 ; R1_w=map_value(off=560,ks=4,vs=564,imm=0) 10: (61) r1 = (u32 )(r1 +0) ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 11: (79) r2 = (u64 )(r6 +152) ; R2_w=scalar() R6=ctx(off=0,imm=0) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 12: (55) if r2 != 0xb9fbeef goto pc+10 ; R2_w=195018479 13: (bc) w2 = w1 ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (test < __NR_TESTS) 14: (a6) if w1 < 0x9 goto pc+1 16: R0=2 R1_w=scalar(umax=8,var_off=(0x0; 0xf)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(off=0,imm=0) R10=fp0 ; 16: (27) r2 = 28 ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) 17: (18) r3 = 0xffffc900001b2118 ; R3_w=map_value(off=280,ks=4,vs=564,imm=0) 19: (0f) r3 += r2 ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) R3_w=map_value(off=280,ks=4,vs=564,umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) 20: (61) r2 = (u32 )(r3 +0) R3 unbounded memory access, make sure to bounds check any such access processed 97 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 10 mark_read 6 -- END PROG LOAD LOG -- libbpf: prog 'ingress_fwdns_prio100': failed to load: -13 libbpf: failed to load object 'test_tc_dtime' libbpf: failed to load BPF skeleton 'test_tc_dtime': -13 ... At insn 14, with condition 'w1 < 9', register r1 is changed from an arbitrary u32 value to `scalar(umax=8,var_off=(0x0; 0xf))`. Register r2, however, remains as an arbitrary u32 value. Current verifier won't claim r1/r2 equality if the previous mov is alu32 ('w2 = w1'). If r1 upper 32bit value is not 0, we indeed cannot clamin r1/r2 equality after 'w2 = w1'. But in this particular case, we know r1 upper 32bit value is 0, so it is safe to claim r1/r2 equality. This patch exactly did this. For a 32bit subreg mov, if the src register upper 32bit is 0, it is okay to claim equality between src and dst registers. With this patch, the above verification sequence becomes ... 8: (18) r1 = 0xffffc9000048e230 ; R1_w=map_value(off=560,ks=4,vs=564,imm=0) 10: (61) r1 = (u32 )(r1 +0) ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 11: (79) r2 = (u64 )(r6 +152) ; R2_w=scalar() R6=ctx(off=0,imm=0) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 12: (55) if r2 != 0xb9fbeef goto pc+10 ; R2_w=195018479 13: (bc) w2 = w1 ; R1_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (test < __NR_TESTS) 14: (a6) if w1 < 0x9 goto pc+1 ; R1_w=scalar(id=6,umin=9,umax=4294967295,var_off=(0x0; 0xffffffff)) ... from 14 to 16: R0=2 R1_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R2_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R6=ctx(off=0,imm=0) R10=fp0 16: (27) r2 = 28 ; R2_w=scalar(umax=224,var_off=(0x0; 0xfc)) 17: (18) r3 = 0xffffc9000048e118 ; R3_w=map_value(off=280,ks=4,vs=564,imm=0) 19: (0f) r3 += r2 20: (61) r2 = (u32 )(r3 +0) ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R3_w=map_value(off=280,ks=4,vs=564,umax=224,var_off=(0x0; 0xfc),s32_max=252,u32_max=252) ... and eventually the bpf program can be verified successfully. [1] https://reviews.llvm.org/D147968 Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230417222134.359714-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-17	bpf: lirc program type should not require SYS_CAP_ADMIN	Sean Young	1	-1/+0
	Make it possible to load lirc program type with just CAP_BPF. There is nothing exceptional about lirc programs that means they require SYS_CAP_ADMIN. In order to attach or detach a lirc program type you need permission to open /dev/lirc0; if you have permission to do that, you can alter all sorts of lirc receiving options. Changing the IR protocol decoder is no different. Right now on a typical distribution /dev/lirc devices are only read/write by root. Ideally we would make them group read/write like other devices so that local users can use them without becoming root. Signed-off-by: Sean Young <sean@mess.org> Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-17	bpf: Set skb redirect and from_ingress info in __bpf_tx_skb	Daniel Borkmann	2	-0/+10
	There are some use-cases where it is desirable to use bpf_redirect() in combination with ifb device, which currently is not supported, for example, around filtering inbound traffic with BPF to then push it to ifb which holds the qdisc for shaping in contrast to doing that on the egress device. Toke mentions the following case related to OpenWrt: Because there's not always a single egress on the other side. These are mainly home routers, which tend to have one or more WiFi devices bridged to one or more ethernet ports on the LAN side, and a single upstream WAN port. And the objective is to control the total amount of traffic going over the WAN link (in both directions), to deal with bufferbloat in the ISP network (which is sadly still all too prevalent). In this setup, the traffic can be split arbitrarily between the links on the LAN side, and the only "single bottleneck" is the WAN link. So we install both egress and ingress shapers on this, configured to something like 95-98% of the true link bandwidth, thus moving the queues into the qdisc layer in the router. It's usually necessary to set the ingress bandwidth shaper a bit lower than the egress due to being "downstream" of the bottleneck link, but it does work surprisingly well. We usually use something like a matchall filter to put all ingress traffic on the ifb, so doing the redirect from BPF has not been an immediate requirement thus far. However, it does seem a bit odd that this is not possible, and we do have a BPF-based filter that layers on top of this kind of setup, which currently uses u32 as the ingress filter and so it could presumably be improved to use BPF instead if that was available. Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Reported-by: Yafang Shao <laoar.shao@gmail.com> Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-16	bpf,docs: Remove KF_KPTR_GET from documentation	David Vernet	1	-15/+6
	A prior patch removed KF_KPTR_GET from the kernel. Now that it's no longer accessible to kfunc authors, this patch removes it from the BPF kfunc documentation. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230416084928.326135-4-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-16	bpf: Remove KF_KPTR_GET kfunc flag	David Vernet	2	-66/+0
	We've managed to improve the UX for kptrs significantly over the last 9 months. All of the existing use cases which previously had KF_KPTR_GET kfuncs (struct bpf_cpumask , struct task_struct , and struct cgroup ) have all been updated to be synchronized using RCU. In other words, their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU \| KF_ACQUIRE kfuncs, with the pointers themselves also being readable from maps in an RCU read region thanks to the types being RCU safe. While KF_KPTR_GET was a logical starting point for kptrs, it's become clear that they're not the correct abstraction. KF_KPTR_GET is a flag that essentially does nothing other than enforcing that the argument to a function is a pointer to a referenced kptr map value. At first glance, that's a useful thing to guarantee to a kfunc. It gives kfuncs the ability to try and acquire a reference on that kptr without requiring the BPF prog to do something like this: struct kptr_type in_map, new = NULL; in_map = bpf_kptr_xchg(&map->value, NULL); if (in_map) { new = bpf_kptr_type_acquire(in_map); in_map = bpf_kptr_xchg(&map->value, in_map); if (in_map) bpf_kptr_type_release(in_map); } That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is the only alternative, it's better than nothing. However, the problem with any KF_KPTR_GET kfunc lies in the fact that it always requires some kind of synchronization in order to safely do an opportunistic acquire of the kptr in the map. This is because a BPF program running on another CPU could do a bpf_kptr_xchg() on that map value, and free the kptr after it's been read by the KF_KPTR_GET kfunc. For example, the now-removed bpf_task_kptr_get() kfunc did the following: struct task_struct bpf_task_kptr_get(struct task_struct *pp) { struct task_struct p; rcu_read_lock(); p = READ_ONCE(pp); / If p is non-NULL, it could still be freed by another CPU, * so we have to do an opportunistic refcount_inc_not_zero() * and return NULL if the task will be freed after the * current RCU read region. */ \|f (p && !refcount_inc_not_zero(&p->rcu_users)) p = NULL; rcu_read_unlock(); return p; } In other words, the kfunc uses RCU to ensure that the task remains valid after it's been peeked from the map. However, this is completely redundant with just defining a KF_RCU kfunc that itself does a refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now does. So, the question of whether KF_KPTR_GET is useful is actually, "Are there any synchronization mechanisms / safety flags that are required by certain kptrs, but which are not provided by the verifier to kfuncs?" The answer to that question today is "No", because every kptr we currently care about is RCU protected. Even if the answer ever became "yes", the proper way to support that referenced kptr type would be to add support for whatever synchronization mechanism it requires in the verifier, rather than giving kfuncs a flag that says, "Here's a pointer to a referenced kptr in a map, do whatever you need to do." With all that said -- so as to allow us to consolidate the kfunc API, and simplify the verifier a bit, this patch removes KF_KPTR_GET, and all relevant logic from the verifier. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230416084928.326135-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-16	bpf: Remove bpf_kfunc_call_test_kptr_get() test kfunc	David Vernet	4	-152/+5
	We've managed to improve the UX for kptrs significantly over the last 9 months. All of the prior main use cases, struct bpf_cpumask , struct task_struct , and struct cgroup *, have all been updated to be synchronized mainly using RCU. In other words, their KF_ACQUIRE kfunc calls are all KF_RCU, and the pointers themselves are MEM_RCU and can be accessed in an RCU read region in BPF. In a follow-on change, we'll be removing the KF_KPTR_GET kfunc flag. This patch prepares for that by removing the bpf_kfunc_call_test_kptr_get() kfunc, and all associated selftests. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230416084928.326135-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15	selftests/bpf: Add refcounted_kptr tests	Dave Marchevsky	3	-0/+496
	Test refcounted local kptr functionality added in previous patches in the series. Usecases which pass verification: * Add refcounted local kptr to both tree and list. Then, read and - possibly, depending on test variant - delete from tree, then list. * Also test doing read-and-maybe-delete in opposite order * Stash a refcounted local kptr in a map_value, then add it to a rbtree. Read from both, possibly deleting after tree read. * Add refcounted local kptr to both tree and list. Then, try reading and deleting twice from one of the collections. * bpf_refcount_acquire of just-added non-owning ref should work, as should bpf_refcount_acquire of owning ref just out of bpf_obj_new Usecases which fail verification: * The simple successful bpf_refcount_acquire cases from above should both fail to verify if the newly-acquired owning ref is not dropped Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-10-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15	bpf: Centralize btf_field-specific initialization logic	Dave Marchevsky	2	-12/+35
	All btf_fields in an object are 0-initialized by memset in bpf_obj_init. This might not be a valid initial state for some field types, in which case kfuncs that use the type will properly initialize their input if it's been 0-initialized. Some BPF graph collection types and kfuncs do this: bpf_list_{head,node} and bpf_rb_node. An earlier patch in this series added the bpf_refcount field, for which the 0 state indicates that the refcounted object should be free'd. bpf_obj_init treats this field specially, setting refcount to 1 instead of relying on scattered "refcount is 0? Must have just been initialized, let's set to 1" logic in kfuncs. This patch extends this treatment to list and rbtree field types, allowing most scattered initialization logic in kfuncs to be removed. Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case they'll be 0-initialized without passing through the newly-added logic, so scattered initialization logic must remain for these collection root types. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15	bpf: Migrate bpf_rbtree_remove to possibly fail	Dave Marchevsky	7	-107/+191
	This patch modifies bpf_rbtree_remove to account for possible failure due to the input rb_node already not being in any collection. The function can now return NULL, and does when the aforementioned scenario occurs. As before, on successful removal an owning reference to the removed node is returned. Adding KF_RET_NULL to bpf_rbtree_remove's kfunc flags - now KF_RET_NULL \| KF_ACQUIRE - provides the desired verifier semantics: * retval must be checked for NULL before use * if NULL, retval's ref_obj_id is released * retval is a "maybe acquired" owning ref, not a non-owning ref, so it will live past end of critical section (bpf_spin_unlock), and thus can be checked for NULL after the end of the CS BPF programs must add checks ============================ This does change bpf_rbtree_remove's verifier behavior. BPF program writers will need to add NULL checks to their programs, but the resulting UX looks natural: bpf_spin_lock(&glock); n = bpf_rbtree_first(&ghead); if (!n) { /* ... /} res = bpf_rbtree_remove(&ghead, &n->node); bpf_spin_unlock(&glock); if (!res) / Newly-added check after this patch / return 1; n = container_of(res, / ... /); / Do something else with n / bpf_obj_drop(n); return 0; The "if (!res)" check above is the only addition necessary for the above program to pass verification after this patch. bpf_rbtree_remove no longer clobbers non-owning refs ==================================================== An issue arises when bpf_rbtree_remove fails, though. Consider this example: struct node_data { long key; struct bpf_list_node l; struct bpf_rb_node r; struct bpf_refcount ref; }; long failed_sum; void bpf_prog() { struct node_data n = bpf_obj_new(/* ... /); struct bpf_rb_node res; n->key = 10; bpf_spin_lock(&glock); bpf_list_push_back(&some_list, &n->l); /* n is now a non-owning ref / res = bpf_rbtree_remove(&some_tree, &n->r, / ... /); if (!res) failed_sum += n->key; / not possible / bpf_spin_unlock(&glock); / if (res) { do something useful and drop } ... */ } The bpf_rbtree_remove in this example will always fail. Similarly to bpf_spin_unlock, bpf_rbtree_remove is a non-owning reference invalidation point. The verifier clobbers all non-owning refs after a bpf_rbtree_remove call, so the "failed_sum += n->key" line will fail verification, and in fact there's no good way to get information about the node which failed to add after the invalidation. This patch removes non-owning reference invalidation from bpf_rbtree_remove to allow the above usecase to pass verification. The logic for why this is now possible is as follows: Before this series, bpf_rbtree_add couldn't fail and thus assumed that its input, a non-owning reference, was in the tree. But it's easy to construct an example where two non-owning references pointing to the same underlying memory are acquired and passed to rbtree_remove one after another (see rbtree_api_release_aliasing in selftests/bpf/progs/rbtree_fail.c). So it was necessary to clobber non-owning refs to prevent this case and, more generally, to enforce "non-owning ref is definitely in some collection" invariant. This series removes that invariant and the failure / runtime checking added in this patch provide a clean way to deal with the aliasing issue - just fail to remove. Because the aliasing issue prevented by clobbering non-owning refs is no longer an issue, this patch removes the invalidate_non_owning_refs call from verifier handling of bpf_rbtree_remove. Note that bpf_spin_unlock - the other caller of invalidate_non_owning_refs - clobbers non-owning refs for a different reason, so its clobbering behavior remains unchanged. No BPF program changes are necessary for programs to remain valid as a result of this clobbering change. A valid program before this patch passed verification with its non-owning refs having shorter (or equal) lifetimes due to more aggressive clobbering. Also, update existing tests to check bpf_rbtree_remove retval for NULL where necessary, and move rbtree_api_release_aliasing from progs/rbtree_fail.c to progs/rbtree.c since it's now expected to pass verification. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-8-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15	selftests/bpf: Modify linked_list tests to work with macro-ified inserts	Dave Marchevsky	4	-67/+73
	The linked_list tests use macros and function pointers to reduce code duplication. Earlier in the series, bpf_list_push_{front,back} were modified to be macros, expanding to invoke actual kfuncs bpf_list_push_{front,back}_impl. Due to this change, a code snippet like: void (p)(void , void ) = (void )&bpf_list_##op; p(hexpr, nexpr); meant to do bpf_list_push_{front,back}(hexpr, nexpr), will no longer work as it's no longer valid to do &bpf_list_push_{front,back} since they're no longer functions. This patch fixes issues of this type, along with two other minor changes - one improvement and one fix - both related to the node argument to list_push_{front,back}. * The fix: migration of list_push tests away from (void , void ) func ptr uncovered that some tests were incorrectly passing pointer to node, not pointer to struct bpf_list_node within the node. This patch fixes such issues (CHECK(..., f) -> CHECK(..., &f->node)) * The improvement: In linked_list tests, the struct foo type has two list_node fields: node and node2, at byte offsets 0 and 40 within the struct, respectively. Currently node is used in ~all tests involving struct foo and lists. The verifier needs to do some work to account for the offset of bpf_list_node within the node type, so using node2 instead of node exercises that logic more in the tests. This patch migrates linked_list tests to use node2 instead of node. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-7-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15	bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail	Dave Marchevsky	4	-51/+148
	Consider this code snippet: struct node { long key; bpf_list_node l; bpf_rb_node r; bpf_refcount ref; } int some_bpf_prog(void ctx) { struct node n = bpf_obj_new(/.../), m; bpf_spin_lock(&glock); bpf_rbtree_add(&some_tree, &n->r, / ... /); m = bpf_refcount_acquire(n); bpf_rbtree_add(&other_tree, &m->r, / ... /); bpf_spin_unlock(&glock); / ... / } After bpf_refcount_acquire, n and m point to the same underlying memory, and that node's bpf_rb_node field is being used by the some_tree insert, so overwriting it as a result of the second insert is an error. In order to properly support refcounted nodes, the rbtree and list insert functions must be allowed to fail. This patch adds such support. The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to return an int indicating success/failure, with 0 -> success, nonzero -> failure. bpf_obj_drop on failure ======================= Currently the only reason an insert can fail is the example above: the bpf_{list,rb}_node is already in use. When such a failure occurs, the insert kfuncs will bpf_obj_drop the input node. This allows the insert operations to logically fail without changing their verifier owning ref behavior, namely the unconditional release_reference of the input owning ref. With insert that always succeeds, ownership of the node is always passed to the collection, since the node always ends up in the collection. With a possibly-failed insert w/ bpf_obj_drop, ownership of the node is always passed either to the collection (success), or to bpf_obj_drop (failure). Regardless, it's correct to continue unconditionally releasing the input owning ref, as something is always taking ownership from the calling program on insert. Keeping owning ref behavior unchanged results in a nice default UX for insert functions that can fail. If the program's reaction to a failed insert is "fine, just get rid of this owning ref for me and let me go on with my business", then there's no reason to check for failure since that's default behavior. e.g.: long important_failures = 0; int some_bpf_prog(void ctx) { struct node n, m, o; / all bpf_obj_new'd / bpf_spin_lock(&glock); bpf_rbtree_add(&some_tree, &n->node, / ... /); bpf_rbtree_add(&some_tree, &m->node, / ... /); if (bpf_rbtree_add(&some_tree, &o->node, / ... /)) { important_failures++; } bpf_spin_unlock(&glock); } If we instead chose to pass ownership back to the program on failed insert - by returning NULL on success or an owning ref on failure - programs would always have to do something with the returned ref on failure. The most likely action is probably "I'll just get rid of this owning ref and go about my business", which ideally would look like: if (n = bpf_rbtree_add(&some_tree, &n->node, / ... /)) bpf_obj_drop(n); But bpf_obj_drop isn't allowed in a critical section and inserts must occur within one, so in reality error handling would become a hard-to-parse mess. For refcounted nodes, we can replicate the "pass ownership back to program on failure" logic with this patch's semantics, albeit in an ugly way: struct node n = bpf_obj_new(/* ... /), m; bpf_spin_lock(&glock); m = bpf_refcount_acquire(n); if (bpf_rbtree_add(&some_tree, &n->node, /* ... /)) { / Do something with m / } bpf_spin_unlock(&glock); bpf_obj_drop(m); bpf_refcount_acquire is used to simulate "return owning ref on failure". This should be an uncommon occurrence, though. Addition of two verifier-fixup'd args to collection inserts =========================================================== The actual bpf_obj_drop kfunc is bpf_obj_drop_impl(void , struct btf_struct_meta ), with bpf_obj_drop macro populating the second arg with 0 and the verifier later filling in the arg during insn fixup. Because bpf_rbtree_add and bpf_list_push_{front,back} now might do bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be passed to bpf_obj_drop_impl. Similarly, because the 'node' param to those insert functions is the bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a pointer to the beginning of the node, the insert functions need to be able to find the beginning of the node struct. A second verifier-populated param is necessary: the offset of {list,rb}_node within the node type. These two new params allow the insert kfuncs to correctly call __bpf_obj_drop_impl: beginning_of_node = bpf_rb_node_ptr - offset if (already_inserted) __bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record); Similarly to other kfuncs with "hidden" verifier-populated params, the insert functions are renamed with _impl prefix and a macro is provided for common usage. For example, bpf_rbtree_add kfunc is now bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets "hidden" args to 0. Due to the two new args BPF progs will need to be recompiled to work with the new _impl kfuncs. This patch also rewrites the "hidden argument" explanation to more directly say why the BPF program writer doesn't need to populate the arguments with anything meaningful. How does this new logic affect non-owning references? ===================================================== Currently, non-owning refs are valid until the end of the critical section in which they're created. We can make this guarantee because, if a non-owning ref exists, the referent was added to some collection. The collection will drop() its nodes when it goes away, but it can't go away while our program is accessing it, so that's not a problem. If the referent is removed from the collection in the same CS that it was added in, it can't be bpf_obj_drop'd until after CS end. Those are the only two ways to free the referent's memory and neither can happen until after the non-owning ref's lifetime ends. On first glance, having these collection insert functions potentially bpf_obj_drop their input seems like it breaks the "can't be bpf_obj_drop'd until after CS end" line of reasoning. But we care about the memory not being _freed_ until end of CS end, and a previous patch in the series modified bpf_obj_drop such that it doesn't free refcounted nodes until refcount == 0. So the statement can be more accurately rewritten as "can't be free'd until after CS end". We can prove that this rewritten statement holds for any non-owning reference produced by collection insert functions: If the input to the insert function is _not_ refcounted * We have an owning reference to the input, and can conclude it isn't in any collection * Inserting a node in a collection turns owning refs into non-owning, and since our input type isn't refcounted, there's no way to obtain additional owning refs to the same underlying memory * Because our node isn't in any collection, the insert operation cannot fail, so bpf_obj_drop will not execute * If bpf_obj_drop is guaranteed not to execute, there's no risk of memory being free'd * Otherwise, the input to the insert function is refcounted * If the insert operation fails due to the node's list_head or rb_root already being in some collection, there was some previous successful insert which passed refcount to the collection * We have an owning reference to the input, it must have been acquired via bpf_refcount_acquire, which bumped the refcount * refcount must be >= 2 since there's a valid owning reference and the node is already in a collection * Insert triggering bpf_obj_drop will decr refcount to >= 1, never resulting in a free So although we may do bpf_obj_drop during the critical section, this will never result in memory being free'd, and no changes to non-owning ref logic are needed in this patch. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-6-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15	bpf: Add bpf_refcount_acquire kfunc	Dave Marchevsky	3	-11/+91
	Currently, BPF programs can interact with the lifetime of refcounted local kptrs in the following ways: bpf_obj_new - Initialize refcount to 1 as part of new object creation bpf_obj_drop - Decrement refcount and free object if it's 0 collection add - Pass ownership to the collection. No change to refcount but collection is responsible for bpf_obj_dropping it In order to be able to add a refcounted local kptr to multiple collections we need to be able to increment the refcount and acquire a new owning reference. This patch adds a kfunc, bpf_refcount_acquire, implementing such an operation. bpf_refcount_acquire takes a refcounted local kptr and returns a new owning reference to the same underlying memory as the input. The input can be either owning or non-owning. To reinforce why this is safe, consider the following code snippets: struct node n = bpf_obj_new(typeof(n)); // A struct node m = bpf_refcount_acquire(n); // B In the above snippet, n will be alive with refcount=1 after (A), and since nothing changes that state before (B), it's obviously safe. If n is instead added to some rbtree, we can still safely refcount_acquire it: struct node n = bpf_obj_new(typeof(n)); struct node m; bpf_spin_lock(&glock); bpf_rbtree_add(&groot, &n->node, less); // A m = bpf_refcount_acquire(n); // B bpf_spin_unlock(&glock); In the above snippet, after (A) n is a non-owning reference, and after (B) m is an owning reference pointing to the same memory as n. Although n has no ownership of that memory's lifetime, it's guaranteed to be alive until the end of the critical section, and n would be clobbered if we were past the end of the critical section, so it's safe to bump refcount. Implementation details: * From verifier's perspective, bpf_refcount_acquire handling is similar to bpf_obj_new and bpf_obj_drop. Like the former, it returns a new owning reference matching input type, although like the latter, type can be inferred from concrete kptr input. Verifier changes in {check,fixup}_kfunc_call and check_kfunc_args are largely copied from aforementioned functions' verifier changes. * An exception to the above is the new KF_ARG_PTR_TO_REFCOUNTED_KPTR arg, indicated by new "__refcounted_kptr" kfunc arg suffix. This is necessary in order to handle both owning and non-owning input without adding special-casing to "__alloc" arg handling. Also a convenient place to confirm that input type has bpf_refcount field. * The implemented kfunc is actually bpf_refcount_acquire_impl, with 'hidden' second arg that the verifier sets to the type's struct_meta in fixup_kfunc_call. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15	bpf: Support refcounted local kptrs in existing semantics	Dave Marchevsky	2	-8/+16
	A local kptr is considered 'refcounted' when it is of a type that has a bpf_refcount field. When such a kptr is created, its refcount should be initialized to 1; when destroyed, the object should be free'd only if a refcount decr results in 0 refcount. Existing logic always frees the underlying memory when destroying a local kptr, and 0-initializes all btf_record fields. This patch adds checks for "is local kptr refcounted?" and new logic for that case in the appropriate places. This patch focuses on changing existing semantics and thus conspicuously does _not_ provide a way for BPF programs in increment refcount. That follows later in the series. __bpf_obj_drop_impl is modified to do the right thing when it sees a refcounted type. Container types for graph nodes (list, tree, stashed in map) are migrated to use __bpf_obj_drop_impl as a destructor for their nodes instead of each having custom destruction code in their _free paths. Now that "drop" isn't a synonym for "free" when the type is refcounted it makes sense to centralize this logic. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15	bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing	Dave Marchevsky	5	-2/+32
	A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types meant for use in BPF programs. Similarly to other opaque types like bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in user-defined struct types a bpf_refcount can be located, so necessary btf_record plumbing is added to enable this. bpf_refcount is sized to hold a refcount_t. Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in btf_record as refcount_off in addition to being in the field array. Caching refcount_off makes sense for this field because further patches in the series will modify functions that take local kptrs (e.g. bpf_obj_drop) to change their behavior if the type they're operating on is refcounted. So enabling fast "is this type refcounted?" checks is desirable. No such verifier behavior changes are introduced in this patch, just logic to recognize 'struct bpf_refcount' in btf_record. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15	bpf: Remove btf_field_offs, use btf_record's fields instead	Dave Marchevsky	6	-130/+43
	The btf_field_offs struct contains (offset, size) for btf_record fields, sorted by offset. btf_field_offs is always used in conjunction with btf_record, which has btf_field 'fields' array with (offset, type), the latter of which btf_field_offs' size is derived from via btf_field_type_size. This patch adds a size field to struct btf_field and sorts btf_record's fields by offset, making it possible to get rid of btf_field_offs. Less data duplication and less code complexity results. Since btf_field_offs' lifetime closely followed the btf_record used to populate it, most complexity wins are from removal of initialization code like: if (btf_record_successfully_initialized) { foffs = btf_parse_field_offs(rec); if (IS_ERR_OR_NULL(foffs)) // free the btf_record and return err } Other changes in this patch are pretty mechanical: * foffs->field_off[i] -> rec->fields[i].offset * foffs->field_sz[i] -> rec->fields[i].size * Sort rec->fields in btf_parse_fields before returning * It's possible that this is necessary independently of other changes in this patch. btf_record_find in syscall.c expects btf_record's fields to be sorted by offset, yet there's no explicit sorting of them before this patch, record's fields are populated in the order they're read from BTF struct definition. BTF docs don't say anything about the sortedness of struct fields. * All functions taking struct btf_field_offs * input now instead take struct btf_record *. All callsites of these functions already have access to the correct btf_record. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13	samples/bpf: sampleip: Replace PAGE_OFFSET with _text address	Rong Tao	1	-2/+9
	Macro PAGE_OFFSET(0xffff880000000000) in sampleip_user.c is inaccurate, for example, in aarch64 architecture, this value depends on the CONFIG_ARM64_VA_BITS compilation configuration, this value defaults to 48, the corresponding PAGE_OFFSET is 0xffff800000000000, if we use the value defined in sampleip_user.c, then all KSYMs obtained by sampleip are (user) Symbol error due to PAGE_OFFSET error: $ sudo ./sampleip 1 Sampling at 99 Hertz for 1 seconds. Ctrl-C also ends. ADDR KSYM COUNT 0xffff80000810ceb8 (user) 1 0xffffb28ec880 (user) 1 0xffff8000080c82b8 (user) 1 0xffffb23fed24 (user) 1 0xffffb28944fc (user) 1 0xffff8000084628bc (user) 1 0xffffb2a935c0 (user) 1 0xffff80000844677c (user) 1 0xffff80000857a3a4 (user) 1 ... A few examples of addresses in the CONFIG_ARM64_VA_BITS=48 environment in the aarch64 environment: $ sudo head /proc/kallsyms ffff8000080a0000 T _text ffff8000080b0000 t gic_handle_irq ffff8000080b0000 T _stext ffff8000080b0000 T __irqentry_text_start ffff8000080b00b0 t gic_handle_irq ffff8000080b0230 t gic_handle_irq ffff8000080b03b4 T __irqentry_text_end ffff8000080b03b8 T __softirqentry_text_start ffff8000080b03c0 T __do_softirq ffff8000080b0718 T __entry_text_start We just need to replace the PAGE_OFFSET with the address _text in /proc/kallsyms to solve this problem: $ sudo ./sampleip 1 Sampling at 99 Hertz for 1 seconds. Ctrl-C also ends. ADDR KSYM COUNT 0xffffb2892ab0 (user) 1 0xffffb2b1edfc (user) 1 0xffff800008462834 __arm64_sys_ppoll 1 0xffff8000084b87f4 eventfd_read 1 0xffffb28e6788 (user) 1 0xffff8000081e96d8 rcu_all_qs 1 0xffffb2ada878 (user) 1 ... Signed-off-by: Rong Tao <rongtao@cestc.cn> Link: https://lore.kernel.org/r/tencent_A0E82E0BEE925285F8156D540731DF805F05@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13	bpf: Support 64-bit pointers to kfuncs	Ilya Leoshkevich	5	-40/+110
	test_ksyms_module fails to emit a kfunc call targeting a module on s390x, because the verifier stores the difference between kfunc address and __bpf_call_base in bpf_insn.imm, which is s32, and modules are roughly (1 << 42) bytes away from the kernel on s390x. Fix by keeping BTF id in bpf_insn.imm for BPF_PSEUDO_KFUNC_CALLs, and storing the absolute address in bpf_kfunc_desc. Introduce bpf_jit_supports_far_kfunc_call() in order to limit this new behavior to the s390x JIT. Otherwise other JITs need to be modified, which is not desired. Introduce bpf_get_kfunc_addr() instead of exposing both find_kfunc_desc() and struct bpf_kfunc_desc. In addition to sorting kfuncs by imm, also sort them by offset, in order to handle conflicting imms from different modules. Do this on all architectures in order to simplify code. Factor out resolving specialized kfuncs (XPD and dynptr) from fixup_kfunc_call(). This was required in the first place, because fixup_kfunc_call() uses find_kfunc_desc(), which returns a const pointer, so it's not possible to modify kfunc addr without stripping const, which is not nice. It also removes repetition of code like: if (bpf_jit_supports_far_kfunc_call()) desc->addr = func; else insn->imm = BPF_CALL_IMM(func); and separates kfunc_desc_tab fixups from kfunc_call fixups. Suggested-by: Jiri Olsa <olsajiri@gmail.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230412230632.885985-1-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13	bpf: Add preempt_count_{sub,add} into btf id deny list	Yafang	1	-0/+4
	The recursion check in __bpf_prog_enter* and __bpf_prog_exit* leave preempt_count_{sub,add} unprotected. When attaching trampoline to them we get panic as follows, [ 867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf (stack is 0000000046a46a15..00000000537e7b28) [ 867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI [ 867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4 [ 867.843100] Call Trace: [ 867.843101] <TASK> [ 867.843104] asm_exc_int3+0x3a/0x40 [ 867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0 [ 867.843135] __bpf_prog_enter_recur+0x17/0x90 [ 867.843148] bpf_trampoline_6442468108_0+0x2e/0x1000 [ 867.843154] ? preempt_count_sub+0x1/0xa0 [ 867.843157] preempt_count_sub+0x5/0xa0 [ 867.843159] ? migrate_enable+0xac/0xf0 [ 867.843164] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843168] bpf_trampoline_6442468108_0+0x55/0x1000 ... [ 867.843788] preempt_count_sub+0x5/0xa0 [ 867.843793] ? migrate_enable+0xac/0xf0 [ 867.843829] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228 (stack is 00000000b23e2bc4..000000006d95af35) [ 867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924 (stack is 00000000ffd69623..0000000014eb594c) [ 867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0 (stack is 00000000034b6438..0000000078d1bcec) [ 867.843842] bpf_trampoline_6442468108_0+0x55/0x1000 ... That is because in __bpf_prog_exit_recur, the preempt_count_{sub,add} are called after prog->active is decreased. Fixing this by adding these two functions into btf ids deny list. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yafang <laoar.shao@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20230413025248.79764-1-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13	selftests/bpf: Workaround for older vm_sockets.h.	Alexei Starovoitov	1	-0/+5
	Some distros ship with older vm_sockets.h that doesn't have VMADDR_CID_LOCAL which causes selftests build to fail: /tmp/work/bpf/bpf/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c:261:18: error: ‘VMADDR_CID_LOCAL’ undeclared (first use in this function); did you mean ‘VMADDR_CID_HOST’? 261 \| addr->svm_cid = VMADDR_CID_LOCAL; \| ^~~~~~~~~~~~~~~~ \| VMADDR_CID_HOST Workaround this issue by defining it on demand. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13	selftests/bpf: Fix merge conflict due to SYS() macro change.	Alexei Starovoitov	1	-2/+2
	Fix merge conflict between bpf/bpf-next trees due to change of arguments in SYS() macro. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13	bpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmap	Daniel Borkmann	1	-6/+4
	syzbot reported a splat and bisected it to recent commit ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap"): [...] WARNING: CPU: 1 PID: 9280 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 Modules linked in: CPU: 1 PID: 9280 Comm: syz-executor.1 Not tainted 6.2.0-syzkaller-13249-gd319f344561d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 [...] Call Trace: <TASK> spin_unlock_bh include/linux/spinlock.h:395 [inline] sock_map_del_link+0x2ea/0x510 net/core/sock_map.c:165 sock_map_unref+0xb0/0x1d0 net/core/sock_map.c:184 sock_hash_delete_elem+0x1ec/0x2a0 net/core/sock_map.c:945 map_delete_elem kernel/bpf/syscall.c:1536 [inline] __sys_bpf+0x2edc/0x53e0 kernel/bpf/syscall.c:5053 __do_sys_bpf kernel/bpf/syscall.c:5166 [inline] __se_sys_bpf kernel/bpf/syscall.c:5164 [inline] __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5164 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fe8f7c8c169 </TASK> [...] Revert for now until we have a proper solution. Fixes: ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap") Reported-by: syzbot+49f6cef45247ff249498@syzkaller.appspotmail.com Cc: Hsin-Wei Hung <hsinweih@uci.edu> Cc: Xin Liu <liuxin350@huawei.com> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/000000000000f1db9605f939720e@google.com/