aboutsummaryrefslogtreecommitdiffstatshomepage
AgeCommit message (Collapse)AuthorFilesLines
2020-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller35-232/+486
Daniel Borkmann says: ==================== pull-request: bpf 2020-06-30 The following pull-request contains BPF updates for your *net* tree. We've added 28 non-merge commits during the last 9 day(s) which contain a total of 35 files changed, 486 insertions(+), 232 deletions(-). The main changes are: 1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer types, from Yonghong Song. 2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki. 3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb internals and integrate it properly into DMA core, from Christoph Hellwig. 4) Fix RCU splat from recent changes to avoid skipping ingress policy when kTLS is enabled, from John Fastabend. 5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its position masking to work, from Andrii Nakryiko. 6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading of network programs, from Maciej Żenczykowski. 7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer. 8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30bpf: Add tests for PTR_TO_BTF_ID vs. null comparisonYonghong Song4-2/+63
Add two tests for PTR_TO_BTF_ID vs. null ptr comparison, one for PTR_TO_BTF_ID in the ctx structure and the other for PTR_TO_BTF_ID after one level pointer chasing. In both cases, the test ensures condition is not removed. For example, for this test struct bpf_fentry_test_t { struct bpf_fentry_test_t *a; }; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) { if (arg == 0) test7_result = 1; return 0; } Before the previous verifier change, we have xlated codes: int test7(long long unsigned int * ctx): ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) 0: (79) r1 = *(u64 *)(r1 +0) ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) 1: (b4) w0 = 0 2: (95) exit After the previous verifier change, we have: int test7(long long unsigned int * ctx): ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) 0: (79) r1 = *(u64 *)(r1 +0) ; if (arg == 0) 1: (55) if r1 != 0x0 goto pc+4 ; test7_result = 1; 2: (18) r1 = map[id:6][0]+48 4: (b7) r2 = 1 5: (7b) *(u64 *)(r1 +0) = r2 ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) 6: (b4) w0 = 0 7: (95) exit Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630171241.2523875-1-yhs@fb.com
2020-06-30bpf: Fix an incorrect branch elimination by verifierYonghong Song1-2/+1
Wenbo reported an issue in [1] where a checking of null pointer is evaluated as always false. In this particular case, the program type is tp_btf and the pointer to compare is a PTR_TO_BTF_ID. The current verifier considers PTR_TO_BTF_ID always reprents a non-null pointer, hence all PTR_TO_BTF_ID compares to 0 will be evaluated as always not-equal, which resulted in the branch elimination. For example, struct bpf_fentry_test_t { struct bpf_fentry_test_t *a; }; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) { if (arg == 0) test7_result = 1; return 0; } int BPF_PROG(test8, struct bpf_fentry_test_t *arg) { if (arg->a == 0) test8_result = 1; return 0; } In above bpf programs, both branch arg == 0 and arg->a == 0 are removed. This may not be what developer expected. The bug is introduced by Commit cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ"), where PTR_TO_BTF_ID is considered to be non-null when evaluting pointer vs. scalar comparison. This may be added considering we have PTR_TO_BTF_ID_OR_NULL in the verifier as well. PTR_TO_BTF_ID_OR_NULL is added to explicitly requires a non-NULL testing in selective cases. The current generic pointer tracing framework in verifier always assigns PTR_TO_BTF_ID so users does not need to check NULL pointer at every pointer level like a->b->c->d. We may not want to assign every PTR_TO_BTF_ID as PTR_TO_BTF_ID_OR_NULL as this will require a null test before pointer dereference which may cause inconvenience for developers. But we could avoid branch elimination to preserve original code intention. This patch simply removed PTR_TO_BTD_ID from reg_type_not_null() in verifier, which prevented the above branches from being eliminated. [1]: https://lore.kernel.org/bpf/79dbb7c0-449d-83eb-5f4f-7af0cc269168@fb.com/T/ Fixes: cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ") Reported-by: Wenbo Zhang <ethercflow@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630171240.2523722-1-yhs@fb.com
2020-06-30Merge branch 'net-ipa-three-bug-fixes'David S. Miller5-10/+32
Alex Elder says: ==================== net: ipa: three bug fixes This series contains three bug fixes for the Qualcomm IPA driver. In practice these bugs are unlikke.y to be harmful, but they do represent incorrect code. Version 2 adds "Fixes" tags to two of the patches and fixes a typo in one (found by checkpatch.pl). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: ipa: introduce ipa_cmd_tag_process()Alex Elder3-0/+25
Create a new function ipa_cmd_tag_process() that simply allocates a transaction, adds a tag process command to it to clear the hardware pipeline, and commits the transaction. Call it in from ipa_endpoint_suspend(), after suspending the modem endpoints but before suspending the AP command TX and AP LAN RX endpoints (which are used by the tag sequence). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: ipa: no checksum offload for SDM845 LAN RXAlex Elder1-1/+0
The AP LAN RX endpoint should not have download checksum offload enabled. The receive handler does properly accommodate the trailer that's added by the hardware, but we ignore it. Fixes: 1ed7d0c0fdba ("soc: qcom: ipa: configuration data") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: ipa: always check for stopped channelAlex Elder1-9/+7
In gsi_channel_stop(), there's a check to see if the channel might have entered STOPPED state since a previous call, which might have timed out before stopping completed. That check actually belongs in gsi_channel_stop_command(), which is called repeatedly by gsi_channel_stop() for RX channels. Fixes: 650d1603825d ("soc: qcom: ipa: the generic software interface") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: mvneta: fix use of state->speedRussell King1-1/+1
When support for short preambles was added, it incorrectly keyed its decision off state->speed instead of state->interface. state->speed is not guaranteed to be correct for in-band modes, which can lead to short preambles being unexpectedly disabled. Fix this by keying off the interface mode, which is the only way that mvneta can operate at 2.5Gbps. Fixes: da58a931f248 ("net: mvneta: Add support for 2500Mbps SGMII") Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30Merge branch 'support-AF_PACKET-for-layer-3-devices'David S. Miller12-18/+34
Jason A. Donenfeld says: ==================== support AF_PACKET for layer 3 devices Hans reported that packets injected by a correct-looking and trivial libpcap-based program were not being accepted by wireguard. In investigating that, I noticed that a few devices weren't properly handling AF_PACKET-injected packets, and so this series introduces a bit of shared infrastructure to support that. The basic problem begins with socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) sockets. When sendto is called, AF_PACKET examines the headers of the packet with this logic: static void packet_parse_headers(struct sk_buff *skb, struct socket *sock) { if ((!skb->protocol || skb->protocol == htons(ETH_P_ALL)) && sock->type == SOCK_RAW) { skb_reset_mac_header(skb); skb->protocol = dev_parse_header_protocol(skb); } skb_probe_transport_header(skb); } The middle condition there triggers, and we jump to dev_parse_header_protocol. Note that this is the only caller of dev_parse_header_protocol in the kernel, and I assume it was designed for this purpose: static inline __be16 dev_parse_header_protocol(const struct sk_buff *skb) { const struct net_device *dev = skb->dev; if (!dev->header_ops || !dev->header_ops->parse_protocol) return 0; return dev->header_ops->parse_protocol(skb); } Since AF_PACKET already knows which netdev the packet is going to, the dev_parse_header_protocol function can see if that netdev has a way it prefers to figure out the protocol from the header. This, again, is the only use of parse_protocol in the kernel. At the moment, it's only used with ethernet devices, via eth_header_parse_protocol. This makes sense, as mostly people are used to AF_PACKET-injecting ethernet frames rather than layer 3 frames. But with nothing in place for layer 3 netdevs, this function winds up returning 0, and skb->protocol then is set to 0, and then by the time it hits the netdev's ndo_start_xmit, the driver doesn't know what to do with it. This is a problem because drivers very much rely on skb->protocol being correct, and routinely reject packets where it's incorrect. That's why having this parsing happen for injected packets is quite important. In wireguard, ipip, and ipip6, for example, packets from AF_PACKET are just dropped entirely. For tun devices, it's sort of uglier, with the tun "packet information" header being passed to userspace containing a bogus protocol value. Some userspace programs are ill-equipped to deal with that. (But of course, that doesn't happen with tap devices, which benefit from the similar shared infrastructure for layer 2 netdevs, further motiviating this patchset for layer 3 netdevs.) This patchset addresses the issue by first adding a layer 3 header parse function, much akin to the existing one for layer 2 packets, and then adds a shared header_ops structure that, also much akin to the existing one for layer 2 packets. Then it wires it up to a few immediate places that stuck out as requiring it, and does a bit of cleanup. This patchset seems like it's fixing real bugs, so it might be appropriate for stable. But they're also very old bugs, so if you'd rather not backport to stable, that'd make sense to me too. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: xfrmi: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld1-0/+2
The xfrm interface uses skb->protocol to determine packet type, and bails out if it's not set. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and xfrmi rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: sit: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld1-0/+1
Sit uses skb->protocol to determine packet type, and bails out if it's not set. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and sit rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: vti: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld2-0/+2
Vti uses skb->protocol to determine packet type, and bails out if it's not set. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and vti rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30tun: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld1-0/+2
The tun driver passes up skb->protocol to userspace in the form of PI headers. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and the tun driver then gives userspace bogus values that it can't deal with. Note that this isn't the case with tap, because tap already benefits from the shared infrastructure for ethernet headers. But with tun, there's nothing. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30wireguard: queueing: make use of ip_tunnel_parse_protocolJason A. Donenfeld2-18/+3
Now that wg_examine_packet_protocol has been added for general consumption as ip_tunnel_parse_protocol, it's possible to remove wg_examine_packet_protocol and simply use the new ip_tunnel_parse_protocol function directly. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30wireguard: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld1-0/+1
WireGuard uses skb->protocol to determine packet type, and bails out if it's not set or set to something it's not expecting. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and wireguard then rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Reported-by: Hans Wippel <ndev@hwipl.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: ipip: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld2-0/+2
Ipip uses skb->protocol to determine packet type, and bails out if it's not set. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and ipip rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: ip_tunnel: add header_ops for layer 3 devicesJason A. Donenfeld2-0/+21
Some devices that take straight up layer 3 packets benefit from having a shared header_ops so that AF_PACKET sockets can inject packets that are recognized. This shared infrastructure will be used by other drivers that currently can't inject packets using AF_PACKET. It also exposes the parser function, as it is useful in standalone form too. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30bpf, netns: Fix use-after-free in pernet pre_exit callbackJakub Sitnicki1-7/+3
Iterating over BPF links attached to network namespace in pre_exit hook is not safe, even if there is just one. Once link gets auto-detached, that is its back-pointer to net object is set to NULL, the link can be released and freed without waiting on netns_bpf_mutex, effectively causing the list element we are operating on to be freed. This leads to use-after-free when trying to access the next element on the list, as reported by KASAN. Bug can be triggered by destroying a network namespace, while also releasing a link attached to this network namespace. | ================================================================== | BUG: KASAN: use-after-free in netns_bpf_pernet_pre_exit+0xd9/0x130 | Read of size 8 at addr ffff888119e0d778 by task kworker/u8:2/177 | | CPU: 3 PID: 177 Comm: kworker/u8:2 Not tainted 5.8.0-rc1-00197-ga0c04c9d1008-dirty #776 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 | Workqueue: netns cleanup_net | Call Trace: | dump_stack+0x9e/0xe0 | print_address_description.constprop.0+0x3a/0x60 | ? netns_bpf_pernet_pre_exit+0xd9/0x130 | kasan_report.cold+0x1f/0x40 | ? netns_bpf_pernet_pre_exit+0xd9/0x130 | netns_bpf_pernet_pre_exit+0xd9/0x130 | cleanup_net+0x30b/0x5b0 | ? unregister_pernet_device+0x50/0x50 | ? rcu_read_lock_bh_held+0xb0/0xb0 | ? _raw_spin_unlock_irq+0x24/0x50 | process_one_work+0x4d1/0xa10 | ? lock_release+0x3e0/0x3e0 | ? pwq_dec_nr_in_flight+0x110/0x110 | ? rwlock_bug.part.0+0x60/0x60 | worker_thread+0x7a/0x5c0 | ? process_one_work+0xa10/0xa10 | kthread+0x1e3/0x240 | ? kthread_create_on_node+0xd0/0xd0 | ret_from_fork+0x1f/0x30 | | Allocated by task 280: | save_stack+0x1b/0x40 | __kasan_kmalloc.constprop.0+0xc2/0xd0 | netns_bpf_link_create+0xfe/0x650 | __do_sys_bpf+0x153a/0x2a50 | do_syscall_64+0x59/0x300 | entry_SYSCALL_64_after_hwframe+0x44/0xa9 | | Freed by task 198: | save_stack+0x1b/0x40 | __kasan_slab_free+0x12f/0x180 | kfree+0xed/0x350 | process_one_work+0x4d1/0xa10 | worker_thread+0x7a/0x5c0 | kthread+0x1e3/0x240 | ret_from_fork+0x1f/0x30 | | The buggy address belongs to the object at ffff888119e0d700 | which belongs to the cache kmalloc-192 of size 192 | The buggy address is located 120 bytes inside of | 192-byte region [ffff888119e0d700, ffff888119e0d7c0) | The buggy address belongs to the page: | page:ffffea0004678340 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 | flags: 0x2fffe0000000200(slab) | raw: 02fffe0000000200 ffffea00045ba8c0 0000000600000006 ffff88811a80ea80 | raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 | page dumped because: kasan: bad access detected | | Memory state around the buggy address: | ffff888119e0d600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ffff888119e0d680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc | >ffff888119e0d700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ^ | ffff888119e0d780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc | ffff888119e0d800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ================================================================== Remove the "fast-path" for releasing a link that got auto-detached by a dying network namespace to fix it. This way as long as link is on the list and netns_bpf mutex is held, we have a guarantee that link memory can be accessed. An alternative way to fix this issue would be to safely iterate over the list of links and ensure there is no access to link object after detaching it. But, at the moment, optimizing synchronization overhead on link release without a workload in mind seems like an overkill. Fixes: ab53cad90eb1 ("bpf, netns: Keep a list of attached bpf_link's") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200630164541.1329993-1-jakub@cloudflare.com
2020-06-30Merge branch 'fix-sockmap-flow_dissector-uapi'Alexei Starovoitov8-25/+103
Lorenz Bauer says: ==================== Both sockmap and flow_dissector ingnore various arguments passed to BPF_PROG_ATTACH and BPF_PROG_DETACH. We can fix the attach case by checking that the unused arguments are zero. I considered requiring target_fd to be -1 instead of 0, but this leads to a lot of churn in selftests. There is also precedent in that bpf_iter already expects 0 for a similar field. I think that we can come up with a work around for fd 0 should we need to in the future. The detach case is more problematic: both cgroups and lirc2 verify that attach_bpf_fd matches the currently attached program. This way you need access to the program fd to be able to remove it. Neither sockmap nor flow_dissector do this. flow_dissector even has a check for CAP_NET_ADMIN because of this. The patch set addresses this by implementing the desired behaviour. There is a possibility for user space breakage: any callers that don't provide the correct fd will fail with ENOENT. For sockmap the risk is low: even the selftests assume that sockmap works the way I described. For flow_dissector the story is less straightforward, and the selftests use a variety of arguments. I've includes fixes tags for the oldest commits that allow an easy backport, however the behaviour dates back to when sockmap and flow_dissector were introduced. What is the best way to handle these? This set is based on top of Jakub's work "bpf, netns: Prepare for multi-prog attachment" available at https://lore.kernel.org/bpf/87k0zwmhtb.fsf@cloudflare.com/T/ Since v1: - Adjust selftests - Implement detach behaviour ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-30selftests: bpf: Pass program to bpf_prog_detach in flow_dissectorLorenz Bauer1-2/+2
Calling bpf_prog_detach is incorrect, since it takes target_fd as its argument. The intention here is to pass it as attach_bpf_fd, so use bpf_prog_detach2 and pass zero for target_fd. Fixes: 06716e04a043 ("selftests/bpf: Extend test_flow_dissector to cover link creation") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-7-lmb@cloudflare.com
2020-06-30selftests: bpf: Pass program and target_fd in flow_dissector_reattachLorenz Bauer1-6/+6
Pass 0 as target_fd when attaching and detaching flow dissector. Additionally, pass the expected program when detaching. Fixes: 1f043f87bb59 ("selftests/bpf: Add tests for attaching bpf_link to netns") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-6-lmb@cloudflare.com
2020-06-30bpf: sockmap: Require attach_bpf_fd when detaching a programLorenz Bauer4-8/+70
The sockmap code currently ignores the value of attach_bpf_fd when detaching a program. This is contrary to the usual behaviour of checking that attach_bpf_fd represents the currently attached program. Ensure that attach_bpf_fd is indeed the currently attached program. It turns out that all sockmap selftests already do this, which indicates that this is unlikely to cause breakage. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
2020-06-30bpf: sockmap: Check value of unused args to BPF_PROG_ATTACHLorenz Bauer1-0/+3
Using BPF_PROG_ATTACH on a sockmap program currently understands no flags or replace_bpf_fd, but accepts any value. Return EINVAL instead. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-4-lmb@cloudflare.com
2020-06-30bpf: flow_dissector: Check value of unused flags to BPF_PROG_DETACHLorenz Bauer3-9/+19
Using BPF_PROG_DETACH on a flow dissector program supports neither attach_flags nor attach_bpf_fd. Yet no value is enforced for them. Enforce that attach_flags are zero, and require the current program to be passed via attach_bpf_fd. This allows us to remove the check for CAP_SYS_ADMIN, since userspace can now no longer remove arbitrary flow dissector programs. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-3-lmb@cloudflare.com
2020-06-30bpf: flow_dissector: Check value of unused flags to BPF_PROG_ATTACHLorenz Bauer1-0/+3
Using BPF_PROG_ATTACH on a flow dissector program supports neither target_fd, attach_flags or replace_bpf_fd but accepts any value. Enforce that all of them are zero. This is fine for replace_bpf_fd since its presence is indicated by BPF_F_REPLACE. It's more problematic for target_fd, since zero is a valid fd. Should we want to use the flag later on we'd have to add an exception for fd 0. The alternative is to force a value like -1. This requires more changes to tests. There is also precedent for using 0, since bpf_iter uses this for target_fd as well. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-2-lmb@cloudflare.com
2020-06-30Merge branch 'bpf-multi-prog-prep'Alexei Starovoitov5-76/+160
Jakub Sitnicki says: ==================== This patch set prepares ground for link-based multi-prog attachment for future netns attach types, with BPF_SK_LOOKUP attach type in mind [0]. Two changes are needed in order to attach and run a series of BPF programs: 1) an bpf_prog_array of programs to run (patch #2), and 2) a list of attached links to keep track of attachments (patch #3). Nothing changes for BPF flow_dissector. Just as before only one program can be attached to netns. In v3 I've simplified patch #2 that introduces bpf_prog_array to take advantage of the fact that it will hold at most one program for now. In particular, I'm no longer using bpf_prog_array_copy. It turned out to be less suitable for link operations than I thought as it fails to append the same BPF program. bpf_prog_array_replace_item is also gone, because we know we always want to replace the first element in prog_array. Naturally the code that handles bpf_prog_array will need change once more when there is a program type that allows multi-prog attachment. But I feel it will be better to do it gradually and present it together with tests that actually exercise multi-prog code paths. [0] https://lore.kernel.org/bpf/20200511185218.1422406-1-jakub@cloudflare.com/ v2 -> v3: - Don't check if run_array is null in link update callback. (Martin) - Allow updating the link with the same BPF program. (Andrii) - Add patch #4 with a test for the above case. - Kill bpf_prog_array_replace_item. Access the run_array directly. - Switch from bpf_prog_array_copy() to bpf_prog_array_alloc(1, ...). - Replace rcu_deref_protected & RCU_INIT_POINTER with rcu_replace_pointer. - Drop Andrii's Ack from patch #2. Code changed. v1 -> v2: - Show with a (void) cast that bpf_prog_array_replace_item() return value is ignored on purpose. (Andrii) - Explain why bpf-cgroup cannot replace programs in bpf_prog_array based on bpf_prog pointer comparison in patch #2 description. (Andrii) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-30selftests/bpf: Test updating flow_dissector link with same programJakub Sitnicki1-4/+28
This case, while not particularly useful, is worth covering because we expect the operation to succeed as opposed when re-attaching the same program directly with PROG_ATTACH. While at it, update the tests summary that fell out of sync when tests extended to cover links. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-5-jakub@cloudflare.com
2020-06-30bpf, netns: Keep a list of attached bpf_link'sJakub Sitnicki2-20/+24
To support multi-prog link-based attachments for new netns attach types, we need to keep track of more than one bpf_link per attach type. Hence, convert net->bpf.links into a list, that currently can be either empty or have just one item. Instead of reusing bpf_prog_list from bpf-cgroup, we link together bpf_netns_link's themselves. This makes list management simpler as we don't have to allocate, initialize, and later release list elements. We can do this because multi-prog attachment will be available only for bpf_link, and we don't need to build a list of programs attached directly and indirectly via links. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-4-jakub@cloudflare.com
2020-06-30bpf, netns: Keep attached programs in bpf_prog_arrayJakub Sitnicki3-48/+96
Prepare for having multi-prog attachments for new netns attach types by storing programs to run in a bpf_prog_array, which is well suited for iterating over programs and running them in sequence. After this change bpf(PROG_QUERY) may block to allocate memory in bpf_prog_array_copy_to_user() for collected program IDs. This forces a change in how we protect access to the attached program in the query callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from an RCU read lock to holding a mutex that serializes updaters. Because we allow only one BPF flow_dissector program to be attached to netns at all times, the bpf_prog_array pointed by net->bpf.run_array is always either detached (null) or one element long. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
2020-06-30flow_dissector: Pull BPF program assignment up to bpf-netnsJakub Sitnicki3-14/+22
Prepare for using bpf_prog_array to store attached programs by moving out code that updates the attached program out of flow dissector. Managing bpf_prog_array is more involved than updating a single bpf_prog pointer. This will let us do it all from one place, bpf/net_namespace.c, in the subsequent patch. No functional change intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
2020-06-30bpf: Enforce BPF ringbuf size to be the power of 2Andrii Nakryiko1-10/+8
BPF ringbuf assumes the size to be a multiple of page size and the power of 2 value. The latter is important to avoid division while calculating position inside the ring buffer and using (N-1) mask instead. This patch fixes omission to enforce power-of-2 size rule. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200630061500.1804799-1-andriin@fb.com
2020-06-30xsk: Use dma_need_sync instead of reimplenting itChristoph Hellwig1-47/+3
Use the dma_need_sync helper instead of (not always entirely correctly) poking into the dma-mapping internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-5-hch@lst.de
2020-06-30xsk: Remove a double pool->dev assignment in xp_dma_mapChristoph Hellwig1-1/+0
->dev is already assigned at the top of the function, remove the duplicate one at the end. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-4-hch@lst.de
2020-06-30xsk: Replace the cheap_dma flag with a dma_need_sync flagChristoph Hellwig2-6/+5
Invert the polarity and better name the flag so that the use case is properly documented. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-3-hch@lst.de
2020-06-30dma-mapping: Add a new dma_need_sync APIChristoph Hellwig5-0/+30
Add a new API to check if calls to dma_sync_single_for_{device,cpu} are required for a given DMA streaming mapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-2-hch@lst.de
2020-06-29genetlink: get rid of family->attrbufCong Wang2-37/+13
genl_family_rcv_msg_attrs_parse() reuses the global family->attrbuf when family->parallel_ops is false. However, family->attrbuf is not protected by any lock on the genl_family_rcv_msg_doit() code path. This leads to several different consequences, one of them is UAF, like the following: genl_family_rcv_msg_doit(): genl_start(): genl_family_rcv_msg_attrs_parse() attrbuf = family->attrbuf __nlmsg_parse(attrbuf); genl_family_rcv_msg_attrs_parse() attrbuf = family->attrbuf __nlmsg_parse(attrbuf); info->attrs = attrs; cb->data = info; netlink_unicast_kernel(): consume_skb() genl_lock_dumpit(): genl_dumpit_info(cb)->attrs Note family->attrbuf is an array of pointers to the skb data, once the skb is freed, any dereference of family->attrbuf will be a UAF. Maybe we could serialize the family->attrbuf with genl_mutex too, but that would make the locking more complicated. Instead, we can just get rid of family->attrbuf and always allocate attrbuf from heap like the family->parallel_ops==true code path. This may add some performance overhead but comparing with taking the global genl_mutex, it still looks better. Fixes: 75cdbdd08900 ("net: ieee802154: have genetlink code to parse the attrs during dumpit") Fixes: 057af7071344 ("net: tipc: have genetlink code to parse the attrs during dumpit") Reported-and-tested-by: syzbot+3039ddf6d7b13daf3787@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+80cad1e3cb4c41cde6ff@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+736bcbcb11b60d0c0792@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+520f8704db2b68091d44@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+c96e4dfb32f8987fdeed@syzkaller.appspotmail.com Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29Merge tag 'mac80211-for-net-2020-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211David S. Miller6-16/+56
Johannes Berg says: ==================== Couple of fixes/small things: * TX control port status check fixed to not assume frame format * mesh control port fixes * error handling/leak fixes when starting AP, with HE attributes * fix broadcast packet handling with encapsulation offload * add new AKM suites * and a small code cleanup ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28llc: make sure applications use ARPHRD_ETHEREric Dumazet1-3/+7
syzbot was to trigger a bug by tricking AF_LLC with non sensible addr->sllc_arphrd It seems clear LLC requires an Ethernet device. Back in commit abf9d537fea2 ("llc: add support for SO_BINDTODEVICE") Octavian Purdila added possibility for application to use a zero value for sllc_arphrd, convert it to ARPHRD_ETHER to not cause regressions on existing applications. BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:199 [inline] BUG: KASAN: use-after-free in list_empty include/linux/list.h:268 [inline] BUG: KASAN: use-after-free in waitqueue_active include/linux/wait.h:126 [inline] BUG: KASAN: use-after-free in wq_has_sleeper include/linux/wait.h:160 [inline] BUG: KASAN: use-after-free in skwq_has_sleeper include/net/sock.h:2092 [inline] BUG: KASAN: use-after-free in sock_def_write_space+0x642/0x670 net/core/sock.c:2813 Read of size 8 at addr ffff88801e0b4078 by task ksoftirqd/3/27 CPU: 3 PID: 27 Comm: ksoftirqd/3 Not tainted 5.5.0-rc1-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506 kasan_report+0x12/0x20 mm/kasan/common.c:639 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135 __read_once_size include/linux/compiler.h:199 [inline] list_empty include/linux/list.h:268 [inline] waitqueue_active include/linux/wait.h:126 [inline] wq_has_sleeper include/linux/wait.h:160 [inline] skwq_has_sleeper include/net/sock.h:2092 [inline] sock_def_write_space+0x642/0x670 net/core/sock.c:2813 sock_wfree+0x1e1/0x260 net/core/sock.c:1958 skb_release_head_state+0xeb/0x260 net/core/skbuff.c:652 skb_release_all+0x16/0x60 net/core/skbuff.c:663 __kfree_skb net/core/skbuff.c:679 [inline] consume_skb net/core/skbuff.c:838 [inline] consume_skb+0xfb/0x410 net/core/skbuff.c:832 __dev_kfree_skb_any+0xa4/0xd0 net/core/dev.c:2967 dev_kfree_skb_any include/linux/netdevice.h:3650 [inline] e1000_unmap_and_free_tx_resource.isra.0+0x21b/0x3a0 drivers/net/ethernet/intel/e1000/e1000_main.c:1963 e1000_clean_tx_irq drivers/net/ethernet/intel/e1000/e1000_main.c:3854 [inline] e1000_clean+0x4cc/0x1d10 drivers/net/ethernet/intel/e1000/e1000_main.c:3796 napi_poll net/core/dev.c:6532 [inline] net_rx_action+0x508/0x1120 net/core/dev.c:6600 __do_softirq+0x262/0x98c kernel/softirq.c:292 run_ksoftirqd kernel/softirq.c:603 [inline] run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595 smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Allocated by task 8247: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] __kasan_kmalloc mm/kasan/common.c:513 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521 slab_post_alloc_hook mm/slab.h:584 [inline] slab_alloc mm/slab.c:3320 [inline] kmem_cache_alloc+0x121/0x710 mm/slab.c:3484 sock_alloc_inode+0x1c/0x1d0 net/socket.c:240 alloc_inode+0x68/0x1e0 fs/inode.c:230 new_inode_pseudo+0x19/0xf0 fs/inode.c:919 sock_alloc+0x41/0x270 net/socket.c:560 __sock_create+0xc2/0x730 net/socket.c:1384 sock_create net/socket.c:1471 [inline] __sys_socket+0x103/0x220 net/socket.c:1513 __do_sys_socket net/socket.c:1522 [inline] __se_sys_socket net/socket.c:1520 [inline] __ia32_sys_socket+0x73/0xb0 net/socket.c:1520 do_syscall_32_irqs_on arch/x86/entry/common.c:337 [inline] do_fast_syscall_32+0x27b/0xe16 arch/x86/entry/common.c:408 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139 Freed by task 17: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] kasan_set_free_info mm/kasan/common.c:335 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474 kasan_slab_free+0xe/0x10 mm/kasan/common.c:483 __cache_free mm/slab.c:3426 [inline] kmem_cache_free+0x86/0x320 mm/slab.c:3694 sock_free_inode+0x20/0x30 net/socket.c:261 i_callback+0x44/0x80 fs/inode.c:219 __rcu_reclaim kernel/rcu/rcu.h:222 [inline] rcu_do_batch kernel/rcu/tree.c:2183 [inline] rcu_core+0x570/0x1540 kernel/rcu/tree.c:2408 rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2417 __do_softirq+0x262/0x98c kernel/softirq.c:292 The buggy address belongs to the object at ffff88801e0b4000 which belongs to the cache sock_inode_cache of size 1152 The buggy address is located 120 bytes inside of 1152-byte region [ffff88801e0b4000, ffff88801e0b4480) The buggy address belongs to the page: page:ffffea0000782d00 refcount:1 mapcount:0 mapping:ffff88807aa59c40 index:0xffff88801e0b4ffd raw: 00fffe0000000200 ffffea00008e6c88 ffffea0000782d48 ffff88807aa59c40 raw: ffff88801e0b4ffd ffff88801e0b4000 0000000100000003 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88801e0b3f00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff88801e0b3f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88801e0b4000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88801e0b4080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88801e0b4100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: abf9d537fea2 ("llc: add support for SO_BINDTODEVICE") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28net: explain the lockdep annotations for dev_uc_unsync()Cong Wang1-0/+10
The lockdep annotations for dev_uc_unsync() and dev_mc_unsync() are not easy to understand, so add some comments to explain why they are correct. Similar for the rest netif_addr_lock_bh() cases, they don't need nested version. Cc: Taehee Yoo <ap420073@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28net: get rid of lockdep_set_class_and_subclass()Cong Wang3-11/+8
lockdep_set_class_and_subclass() is meant to reduce the _nested() annotations by assigning a default subclass. For addr_list_lock, we have to compute the subclass at run-time as the netdevice topology changes after creation. So, we should just get rid of these lockdep_set_class_and_subclass() and stick with our _nested() annotations. Fixes: 845e0ebb4408 ("net: change addr_list_lock back to static key") Suggested-by: Taehee Yoo <ap420073@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28lib: packing: add documentation for pbuflen argumentVladimir Oltean1-0/+1
Fixes sparse warning: Function parameter or member 'pbuflen' not described in 'packing' Fixes: 554aae35007e ("lib: Add support for generic packing operations") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28bridge: mrp: Fix endian conversion and some other warningsHoratiu Vultur3-3/+3
The following sparse warnings are fixed: net/bridge/br_mrp.c:106:18: warning: incorrect type in assignment (different base types) net/bridge/br_mrp.c:106:18: expected unsigned short [usertype] net/bridge/br_mrp.c:106:18: got restricted __be16 [usertype] net/bridge/br_mrp.c:281:23: warning: incorrect type in argument 1 (different modifiers) net/bridge/br_mrp.c:281:23: expected struct list_head *entry net/bridge/br_mrp.c:281:23: got struct list_head [noderef] * net/bridge/br_mrp.c:332:28: warning: incorrect type in argument 1 (different modifiers) net/bridge/br_mrp.c:332:28: expected struct list_head *new net/bridge/br_mrp.c:332:28: got struct list_head [noderef] * net/bridge/br_mrp.c:332:40: warning: incorrect type in argument 2 (different modifiers) net/bridge/br_mrp.c:332:40: expected struct list_head *head net/bridge/br_mrp.c:332:40: got struct list_head [noderef] * net/bridge/br_mrp.c:682:29: warning: incorrect type in argument 1 (different modifiers) net/bridge/br_mrp.c:682:29: expected struct list_head const *head net/bridge/br_mrp.c:682:29: got struct list_head [noderef] * Reported-by: kernel test robot <lkp@intel.com> Fixes: 2f1a11ae11d222 ("bridge: mrp: Add MRP interface.") Fixes: 4b8d7d4c599182 ("bridge: mrp: Extend bridge interface") Fixes: 9a9f26e8f7ea30 ("bridge: mrp: Connect MRP API with the switchdev API") Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28Merge branch 'fix-sockmap'Alexei Starovoitov3-9/+40
John Fastabend says: ==================== Fix a splat introduced by recent changes to avoid skipping ingress policy when kTLS is enabled. The RCU splat was introduced because in the non-TLS case the caller is wrapped in an rcu_read_lock/unlock. But, in the TLS case we have a reference to the psock and the caller did not wrap its call in rcu_read_lock/unlock. To fix extend the RCU section to include the redirect case which was missed. From v1->v2 I changed the location a bit to simplify the code some. See patch 1. But, then Martin asked why it was not needed in the non-TLS case. The answer for patch 1 was, as stated above, because the caller has the rcu read lock. However, there was still a missing case where a BPF user could in-theory line up a set of parameters to hit a case where the code was entered from strparser side from a different context then the initial caller. To hit this user would need a parser program to return value greater than skb->len then an ENOMEM error could happen in the strparser codepath triggering strparser to retry from a workqueue and without rcu_read_lock original caller used. See patch 2 for details. Finally, we don't actually have any selftests for parser returning a value geater than skb->len so add one in patch 3. This is especially needed because at least I don't have any code that uses the parser to return value greater than skb->len. So I wouldn't have caught any errors here in my own testing. Thanks, John v1->v2: simplify code in patch 1 some and add patches 2 and 3. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-28bpf, sockmap: Add ingres skb tests that utilize merge skbsJohn Fastabend2-1/+25
Add a test to check strparser merging skbs is working. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/159312681884.18340.4922800172600252370.stgit@john-XPS-13-9370
2020-06-28bpf, sockmap: RCU dereferenced psock may be used outside RCU blockJohn Fastabend1-1/+9
If an ingress verdict program specifies message sizes greater than skb->len and there is an ENOMEM error due to memory pressure we may call the rcv_msg handler outside the strp_data_ready() caller context. This is because on an ENOMEM error the strparser will retry from a workqueue. The caller currently protects the use of psock by calling the strp_data_ready() inside a rcu_read_lock/unlock block. But, in above workqueue error case the psock is accessed outside the read_lock/unlock block of the caller. So instead of using psock directly we must do a look up against the sk again to ensure the psock is available. There is an an ugly piece here where we must handle the case where we paused the strp and removed the psock. On psock removal we first pause the strparser and then remove the psock. If the strparser is paused while an skb is scheduled on the workqueue the skb will be dropped on the flow and kfree_skb() is called. If the workqueue manages to get called before we pause the strparser but runs the rcvmsg callback after the psock is removed we will hit the unlikely case where we run the sockmap rcvmsg handler but do not have a psock. For now we will follow strparser logic and drop the skb on the floor with skb_kfree(). This is ugly because the data is dropped. To date this has not caused problems in practice because either the application controlling the sockmap is coordinating with the datapath so that skbs are "flushed" before removal or we simply wait for the sock to be closed before removing it. This patch fixes the describe RCU bug and dropping the skb doesn't make things worse. Future patches will improve this by allowing the normal case where skbs are not merged to skip the strparser altogether. In practice many (most?) use cases have no need to merge skbs so its both a code complexity hit as seen above and a performance issue. For example, in the Cilium case we always set the strparser up to return sbks 1:1 without any merging and have avoided above issues. Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
2020-06-28bpf, sockmap: RCU splat with redirect and strparser error or TLSJohn Fastabend1-7/+6
There are two paths to generate the below RCU splat the first and most obvious is the result of the BPF verdict program issuing a redirect on a TLS socket (This is the splat shown below). Unlike the non-TLS case the caller of the *strp_read() hooks does not wrap the call in a rcu_read_lock/unlock. Then if the BPF program issues a redirect action we hit the RCU splat. However, in the non-TLS socket case the splat appears to be relatively rare, because the skmsg caller into the strp_data_ready() is wrapped in a rcu_read_lock/unlock. Shown here, static void sk_psock_strp_data_ready(struct sock *sk) { struct sk_psock *psock; rcu_read_lock(); psock = sk_psock(sk); if (likely(psock)) { if (tls_sw_has_ctx_rx(sk)) { psock->parser.saved_data_ready(sk); } else { write_lock_bh(&sk->sk_callback_lock); strp_data_ready(&psock->parser.strp); write_unlock_bh(&sk->sk_callback_lock); } } rcu_read_unlock(); } If the above was the only way to run the verdict program we would be safe. But, there is a case where the strparser may throw an ENOMEM error while parsing the skb. This is a result of a failed skb_clone, or alloc_skb_for_msg while building a new merged skb when the msg length needed spans multiple skbs. This will in turn put the skb on the strp_wrk workqueue in the strparser code. The skb will later be dequeued and verdict programs run, but now from a different context without the rcu_read_lock()/unlock() critical section in sk_psock_strp_data_ready() shown above. In practice I have not seen this yet, because as far as I know most users of the verdict programs are also only working on single skbs. In this case no merge happens which could trigger the above ENOMEM errors. In addition the system would need to be under memory pressure. For example, we can't hit the above case in selftests because we missed having tests to merge skbs. (Added in later patch) To fix the below splat extend the rcu_read_lock/unnlock block to include the call to sk_psock_tls_verdict_apply(). This will fix both TLS redirect case and non-TLS redirect+error case. Also remove psock from the sk_psock_tls_verdict_apply() function signature its not used there. [ 1095.937597] WARNING: suspicious RCU usage [ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W [ 1095.944363] ----------------------------- [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage! [ 1095.950866] [ 1095.950866] other info that might help us debug this: [ 1095.950866] [ 1095.957146] [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1 [ 1095.961482] 1 lock held by test_sockmap/15970: [ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls] [ 1095.968568] [ 1095.968568] stack backtrace: [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1 [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 1095.980519] Call Trace: [ 1095.982191] dump_stack+0x8f/0xd0 [ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0 [ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250 [ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls] v2: Improve commit message to identify non-TLS redirect plus error case condition as well as more common TLS case. In the process I decided doing the rcu_read_unlock followed by the lock/unlock inside branches was unnecessarily complex. We can just extend the current rcu block and get the same effeective without the shuffling and branching. Thanks Martin! Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls") Reported-by: Jakub Sitnicki <jakub@cloudflare.com> Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
2020-06-27net: ipv4: Fix wrong type conversion from hint to rt in ip_route_use_hint()Miaohe Lin1-1/+1
We can't cast sk_buff to rtable by (struct rtable *)hint. Use skb_rtable(). Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26enetc: Fix tx rings bitmap iteration range, irq handlingClaudiu Manoil1-2/+2
The rings bitmap of an interrupt vector encodes which of the device's rings were assigned to that interrupt vector. Hence the iteration range of the tx rings bitmap (for_each_set_bit()) should be the total number of Tx rings of that netdevice instead of the number of rings assigned to the interrupt vector. Since there are 2 cores, and one interrupt vector for each core, the number of rings asigned to an interrupt vector is half the number of available rings. The impact of this error is that the upper half of the tx rings could still generate interrupts during napi polling. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26ionic: update the queue count on openShannon Nelson1-0/+8
Let the network stack know the real number of queues that we are using. v2: added error checking Fixes: 49d3b493673a ("ionic: disable the queues on link down") Signed-off-by: Shannon Nelson <snelson@pensando.io> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26nl80211: fix memory leak when parsing NL80211_ATTR_HE_BSS_COLORLuca Coelho1-1/+1
If there is an error when parsing the NL80211_ATTR_HE_BSS_COLOR attribute, we return immediately without freeing param.acl. Fit it by using goto out instead of returning immediately. Fixes: 5c5e52d1bb96 ("nl80211: add handling for BSS color") Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20200626124931.7ad2a3eb894f.I60905fb70bd20389a3b170db515a07275e31845e@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>