Age | Commit message (Collapse) | Author | Files | Lines |
|
$ make -C tools/perf build-test
does, ends up with these two problems:
make[3]: *** No rule to make target '/tmp/tmp.zq13cHILGB/perf-5.3.0/include/uapi/linux/bpf.h', needed by 'bpf_helper_defs.h'. Stop.
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [Makefile.perf:757: /tmp/tmp.zq13cHILGB/perf-5.3.0/tools/lib/bpf/libbpf.a] Error 2
make[2]: *** Waiting for unfinished jobs....
Because $(srcdir) points to the /tmp/tmp.zq13cHILGB/perf-5.3.0 directory
and we need '/tools/ after that variable, and after fixing this then we
get to another problem:
/bin/sh: /home/acme/git/perf/tools/scripts/bpf_helpers_doc.py: No such file or directory
make[3]: *** [Makefile:184: bpf_helper_defs.h] Error 127
make[3]: *** Deleting file 'bpf_helper_defs.h'
LD /tmp/build/perf/libapi-in.o
make[2]: *** [Makefile.perf:778: /tmp/build/perf/libbpf.a] Error 2
make[2]: *** Waiting for unfinished jobs....
Because this requires something outside the tools/ directories that gets
collected into perf's detached tarballs, to fix it just add it to
tools/perf/MANIFEST, which this patch does, now it works for that case
and also for all these other cases.
Fixes: e01a75c15969 ("libbpf: Move bpf_{helpers, helper_defs, endian, tracing}.h into libbpf")
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/n/tip-4pnkg2vmdvq5u6eivc887wen@git.kernel.org
Link: https://lore.kernel.org/bpf/20191126151045.GB19483@kernel.org
|
|
Similarly to a0d7da26ce86 ("libbpf: Fix call relocation offset calculation
bug"), relocations against global variables need to take into account
referenced symbol's st_value, which holds offset into a corresponding data
section (and, subsequently, offset into internal backing map). For static
variables this offset is always zero and data offset is completely described
by respective instruction's imm field.
Convert a bunch of selftests to global variables. Previously they were relying
on `static volatile` trick to ensure Clang doesn't inline static variables,
which with global variables is not necessary anymore.
Fixes: 393cdfbee809 ("libbpf: Support initialized global variables")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191127200651.1381348-1-andriin@fb.com
|
|
Fix Makefile's diagnostic diff output when there is LIBBPF_API-versioned
symbols mismatch.
Fixes: 1bd63524593b ("libbpf: handle symbol versioning properly for libbpf.a")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191127200134.1360660-1-andriin@fb.com
|
|
If vmlinux BTF generation fails, but CONFIG_DEBUG_INFO_BTF is set,
.BTF section of vmlinux is empty and kernel will prohibit
BPF loading and return "in-kernel BTF is malformed".
--dump-section argument to binutils' objcopy was added in version 2.25.
When using pre-2.25 binutils, BTF generation silently fails. Convert
to --only-section which is present on pre-2.25 binutils.
Documentation/process/changes.rst states that binutils 2.21+
is supported, not sure those standards apply to BPF subsystem.
v2:
* exit and print an error if gen_btf fails (John Fastabend)
v3:
* resend with Andrii's Acked-by/Tested-by tags
Fixes: 341dfcf8d78ea ("btf: expose BTF info through sysfs")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20191127161410.57327-1-sdf@google.com
|
|
kernel/bpf/btf.c:4023 btf_distill_func_proto()
error: potentially dereferencing uninitialized 't'.
kernel/bpf/btf.c
4012 nargs = btf_type_vlen(func);
4013 if (nargs >= MAX_BPF_FUNC_ARGS) {
4014 bpf_log(log,
4015 "The function %s has %d arguments. Too many.\n",
4016 tname, nargs);
4017 return -EINVAL;
4018 }
4019 ret = __get_type_size(btf, func->type, &t);
^^
t isn't initialized for the first -EINVAL return
This is unlikely path, since BTF should have been validated at this point.
Fix it by returning 'void' BTF.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191126230106.237179-1-ast@kernel.org
|
|
In gve_alloc_queue_page_list(), when a page allocation fails,
qpl->num_entries will be wrong. In this case priv->num_registered_pages
can underflow in gve_free_queue_page_list(), causing subsequent calls
to gve_alloc_queue_page_list() to fail.
Fixes: f5cedc84a30d ("gve: Add transmit and receive support")
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Any argument outside of that range would result in an out of bound
memory access, since the accessed array is 65536 bits long.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When user-space sets the OVS_UFID_F_OMIT_* flags, and the relevant
flow has no UFID, we can exceed the computed size, as
ovs_nla_put_identifier() will always dump an OVS_FLOW_ATTR_KEY
attribute.
Take the above in account when computing the flow command message
size.
Fixes: 74ed7ab9264c ("openvswitch: Add support for unique flow IDs.")
Reported-by: Qi Jun Ding <qding@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix the return paths for all I/O operations to ensure
that the I/O completed successfully. Then pass the return
to the caller for further processing
Fixes: 01db923e8377 ("net: phy: dp83869: Add TI dp83869 phy")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Dan Murphy <dmurphy@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We need to calculate the skb size correctly otherwise we risk triggering
skb_over_panic[1]. The issue is that data_len is added to the skb in a
nl attribute, but we don't account for its header size (nlattr 4 bytes)
and alignment. We account for it when calculating the total size in
the > PSAMPLE_MAX_PACKET_SIZE comparison correctly, but not when
allocating after that. The fix is simple - use nla_total_size() for
data_len when allocating.
To reproduce:
$ tc qdisc add dev eth1 clsact
$ tc filter add dev eth1 egress matchall action sample rate 1 group 1 trunc 129
$ mausezahn eth1 -b bcast -a rand -c 1 -p 129
< skb_over_panic BUG(), tail is 4 bytes past skb->end >
[1] Trace:
[ 50.459526][ T3480] skbuff: skb_over_panic: text:(____ptrval____) len:196 put:136 head:(____ptrval____) data:(____ptrval____) tail:0xc4 end:0xc0 dev:<NULL>
[ 50.474339][ T3480] ------------[ cut here ]------------
[ 50.481132][ T3480] kernel BUG at net/core/skbuff.c:108!
[ 50.486059][ T3480] invalid opcode: 0000 [#1] PREEMPT SMP
[ 50.489463][ T3480] CPU: 3 PID: 3480 Comm: mausezahn Not tainted 5.4.0-rc7 #108
[ 50.492844][ T3480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
[ 50.496551][ T3480] RIP: 0010:skb_panic+0x79/0x7b
[ 50.498261][ T3480] Code: bc 00 00 00 41 57 4c 89 e6 48 c7 c7 90 29 9a 83 4c 8b 8b c0 00 00 00 50 8b 83 b8 00 00 00 50 ff b3 c8 00 00 00 e8 ae ef c0 fe <0f> 0b e8 2f df c8 fe 48 8b 55 08 44 89 f6 4c 89 e7 48 c7 c1 a0 22
[ 50.504111][ T3480] RSP: 0018:ffffc90000447a10 EFLAGS: 00010282
[ 50.505835][ T3480] RAX: 0000000000000087 RBX: ffff888039317d00 RCX: 0000000000000000
[ 50.507900][ T3480] RDX: 0000000000000000 RSI: ffffffff812716e1 RDI: 00000000ffffffff
[ 50.509820][ T3480] RBP: ffffc90000447a60 R08: 0000000000000001 R09: 0000000000000000
[ 50.511735][ T3480] R10: ffffffff81d4f940 R11: 0000000000000000 R12: ffffffff834a22b0
[ 50.513494][ T3480] R13: ffffffff82c10433 R14: 0000000000000088 R15: ffffffff838a8084
[ 50.515222][ T3480] FS: 00007f3536462700(0000) GS:ffff88803eac0000(0000) knlGS:0000000000000000
[ 50.517135][ T3480] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 50.518583][ T3480] CR2: 0000000000442008 CR3: 000000003b222000 CR4: 00000000000006e0
[ 50.520723][ T3480] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 50.522709][ T3480] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 50.524450][ T3480] Call Trace:
[ 50.525214][ T3480] skb_put.cold+0x1b/0x1b
[ 50.526171][ T3480] psample_sample_packet+0x1d3/0x340
[ 50.527307][ T3480] tcf_sample_act+0x178/0x250
[ 50.528339][ T3480] tcf_action_exec+0xb1/0x190
[ 50.529354][ T3480] mall_classify+0x67/0x90
[ 50.530332][ T3480] tcf_classify+0x72/0x160
[ 50.531286][ T3480] __dev_queue_xmit+0x3db/0xd50
[ 50.532327][ T3480] dev_queue_xmit+0x18/0x20
[ 50.533299][ T3480] packet_sendmsg+0xee7/0x2090
[ 50.534331][ T3480] sock_sendmsg+0x54/0x70
[ 50.535271][ T3480] __sys_sendto+0x148/0x1f0
[ 50.536252][ T3480] ? tomoyo_file_ioctl+0x23/0x30
[ 50.537334][ T3480] ? ksys_ioctl+0x5e/0xb0
[ 50.540068][ T3480] __x64_sys_sendto+0x2a/0x30
[ 50.542810][ T3480] do_syscall_64+0x73/0x1f0
[ 50.545383][ T3480] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 50.548477][ T3480] RIP: 0033:0x7f35357d6fb3
[ 50.551020][ T3480] Code: 48 8b 0d 18 90 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d f9 d3 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 eb f6 ff ff 48 89 04 24
[ 50.558547][ T3480] RSP: 002b:00007ffe0c7212c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 50.561870][ T3480] RAX: ffffffffffffffda RBX: 0000000001dac010 RCX: 00007f35357d6fb3
[ 50.565142][ T3480] RDX: 0000000000000082 RSI: 0000000001dac2a2 RDI: 0000000000000003
[ 50.568469][ T3480] RBP: 00007ffe0c7212f0 R08: 00007ffe0c7212d0 R09: 0000000000000014
[ 50.571731][ T3480] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000082
[ 50.574961][ T3480] R13: 0000000001dac2a2 R14: 0000000000000001 R15: 0000000000000003
[ 50.578170][ T3480] Modules linked in: sch_ingress virtio_net
[ 50.580976][ T3480] ---[ end trace 61a515626a595af6 ]---
CC: Yotam Gigi <yotamg@mellanox.com>
CC: Jiri Pirko <jiri@mellanox.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Simon Horman <simon.horman@netronome.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
correct usage prototype of callback in tasklet_init().
Report by https://github.com/KSPP/linux/issues/20
Signed-off-by: Phong Tran <tranmanphong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
correct usage prototype of callback in tasklet_init().
Report by https://github.com/KSPP/linux/issues/20
Signed-off-by: Phong Tran <tranmanphong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Note that the sysctl write accessor functions guarantee that:
net->ipv4.sysctl_ip_prot_sock <= net->ipv4.ip_local_ports.range[0]
invariant is maintained, and as such the max() in selinux hooks is actually spurious.
ie. even though
if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) {
per logic is the same as
if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) {
it is actually functionally equivalent to:
if (snum < low || snum > high) {
which is equivalent to:
if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) {
even though the first clause is spurious.
But we want to hold on to it in case we ever want to change what what
inet_port_requires_bind_service() means (for example by changing
it from a, by default, [0..1024) range to some sort of set).
Test: builds, git 'grep inet_prot_sock' finds no other references
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Provide some serialization for device CRQ commands
and queries to ensure that the shared variable used for
storing return codes is properly synchronized.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Create a wrapper for wait_for_completion calls with additional
driver checks to ensure that the driver does not wait on a
disabled device. In those cases or if the device does not respond
in an extended amount of time, this will allow the driver an
opportunity to recover.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If we receive a notification that the device has been deactivated
or removed, force a completion of all waiting threads.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix multiple calls to init_completion for device completion
structures. Instead, initialize them during device probe and
reinitialize them later as needed.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It already existed in part of the function, but move it
to a higher level and use it consistently throughout.
Safe since sk is never written to.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It cannot overlap with the local port range - ie. with autobind selectable
ports - and not with reserved ports.
Indeed 'ip_local_reserved_ports' isn't even a range, it's a (by default
empty) set.
Fixes: 4548b683b781 ("Introduce a sysctl that modifies the value of PROT_SOCK.")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In commit 4f07b80c9733 ("tipc: check msg->req data len in
tipc_nl_compat_bearer_disable") the same patch code was copied into
routines: tipc_nl_compat_bearer_disable(),
tipc_nl_compat_link_stat_dump() and tipc_nl_compat_link_reset_stats().
The two link routine occurrences should have been modified to check
the maximum link name length and not bearer name length.
Fixes: 4f07b80c9733 ("tipc: check msg->reg data len in tipc_nl_compat_bearer_disable")
Signed-off-by: John Rutherford <john.rutherford@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
u32 is not defined for libbpf when compiled outside of kernel sources (e.g.,
in Github projection). Use __u32 instead.
Fixes: b8c54ea455dc ("libbpf: Add support to attach to fentry/fexit tracing progs")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191125212948.1163343-1-andriin@fb.com
|
|
To fix build with !CONFIG_MMU, implement it for no-MMU configurations as well.
Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20191123220835.1237773-1-andriin@fb.com
|
|
Slip_open doesn't clean-up device which registration failed from the
slip_devs device list. On next open after failure this list is iterated
and freed device is accessed. Fix this by calling sl_free_netdev in error
path.
Here is the trace from the Syzbot:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x197/0x210 lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
__kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
kasan_report+0x12/0x20 mm/kasan/common.c:634
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
sl_sync drivers/net/slip/slip.c:725 [inline]
slip_open+0xecd/0x11b7 drivers/net/slip/slip.c:801
tty_ldisc_open.isra.0+0xa3/0x110 drivers/tty/tty_ldisc.c:469
tty_set_ldisc+0x30e/0x6b0 drivers/tty/tty_ldisc.c:596
tiocsetd drivers/tty/tty_io.c:2334 [inline]
tty_ioctl+0xe8d/0x14f0 drivers/tty/tty_io.c:2594
vfs_ioctl fs/ioctl.c:46 [inline]
file_ioctl fs/ioctl.c:509 [inline]
do_vfs_ioctl+0xdb6/0x13e0 fs/ioctl.c:696
ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
__do_sys_ioctl fs/ioctl.c:720 [inline]
__se_sys_ioctl fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: 3b5a39979daf ("slip: Fix memory leak in slip_open error path")
Reported-by: syzbot+4d5170758f3762109542@syzkaller.appspotmail.com
Cc: David Miller <davem@davemloft.net>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jouni Hogander <jouni.hogander@unikie.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This function was using configuration of port 0 in devicetree for all ports.
In case CPU port was not 0, the delay settings was ignored. This resulted not
working communication between CPU and the switch.
Fixes: f5b8631c293b ("net: dsa: sja1105: Error out if RGMII delays are requested in DT")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While enqueueing a broadcast skb to port->bc_queue, schedule_work()
is called to add port->bc_work, which processes the skbs in
bc_queue, to "events" work queue. If port->bc_queue is full, the
skb will be discarded and schedule_work(&port->bc_work) won't be
called. However, if port->bc_queue is full and port->bc_work is not
running or pending, port->bc_queue will keep full and schedule_work()
won't be called any more, and all broadcast skbs to macvlan will be
discarded. This case can happen:
macvlan_process_broadcast() is the pending function of port->bc_work,
it moves all the skbs in port->bc_queue to the queue "list", and
processes the skbs in "list". During this, new skbs will keep being
added to port->bc_queue in macvlan_broadcast_enqueue(), and
port->bc_queue may already full when macvlan_process_broadcast()
return. This may happen, especially when there are a lot of real-time
threads and the process is preempted.
Fix this by calling schedule_work(&port->bc_work) even if
port->bc_work is full in macvlan_broadcast_enqueue().
Fixes: 412ca1550cbe ("macvlan: Move broadcasts into a work queue")
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The ENETC hardware support the Credit Based Shaper(CBS) which part
of the IEEE-802.1Qav. The CBS driver was loaded by the sch_cbs
interface when set in the QOS in the kernel.
Here is an example command to set 20Mbits bandwidth in 1Gbits port
for taffic class 7:
tc qdisc add dev eth0 root handle 1: mqprio \
num_tc 8 map 0 1 2 3 4 5 6 7 hw 1
tc qdisc replace dev eth0 parent 1:8 cbs \
locredit -1470 hicredit 30 \
sendslope -980000 idleslope 20000 offload 1
Signed-off-by: Po Liu <Po.Liu@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add helpers to make locking/unlocking the MDIO bus easier.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Geert Uytterhoeven reported that using devm_reset_controller_get leads
to a WARNING when probing a reset-controlled PHY. This is because the
device devm_reset_controller_get gets supplied is not actually the
one being probed.
Acquire an unmanaged reset-control as well as free the reset_control on
unregister to fix this.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
CC: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David Bauer <mail@david-bauer.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).
However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.
The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file. Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS. The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.
The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it. With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.
Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()". We've done that for all the obvious and common
situations, we may need a few more. Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):
"And I appreciate if people could help at least somehow with "getting
rid of mixed case entirely" (i.e. always lock f_pos_lock on
!FMODE_STREAM), because this transition starts to diverge from my
particular use-case too far. To me it makes sense to do that
transition as follows:
- convert nonseekable_open -> stream_open via stream_open.cocci;
- audit other nonseekable_open calls and convert left users that
truly don't depend on position to stream_open;
- extend stream_open.cocci to analyze alloc_file_pseudo as well (this
will cover pipes and sockets), or maybe convert pipes and sockets
to FMODE_STREAM manually;
- extend stream_open.cocci to analyze file_operations that use
no_llseek or noop_llseek, but do not use nonseekable_open or
alloc_file_pseudo. This might find files that have stream semantic
but are opened differently;
- extend stream_open.cocci to analyze file_operations whose
.read/.write do not use ppos at all (independently of how file was
opened);
- ...
- after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
!FMODE_STREAM;
- gather bug reports for deadlocked read/write and convert missed
cases to FMODE_STREAM, probably extending stream_open.cocci along
the road to catch similar cases
i.e. always take f_pos_lock unless a file is explicitly marked as
being stream, and try to find and cover all files that are streams"
We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.
The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".
The second test was correct, although a bit subtle and worth just
re-iterating here. There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with
clone()).
The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.
But in both cases the file count would be greater than one, so the
"file_count(file) > 1" test catches both situations. Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.
Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In commit 3975b097e577 ("convert stream-like files -> stream_open, even
if they use noop_llseek") Kirill used a coccinelle script to change
"nonseekable_open()" to "stream_open()", which changed the trivial cases
of stream-like file descriptors to the new model with FMODE_STREAM.
However, the two big cases - sockets and pipes - don't actually have
that trivial pattern at all, and were thus never converted to
FMODE_STREAM even though it makes lots of sense to do so.
That's particularly true when looking forward to the next change:
getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
decide whether f_pos updates are needed or not. And if they are, we'll
always do them atomically.
This came up because KCSAN (correctly) noted that the non-locked f_pos
updates are data races: they are clearly benign for the case where we
don't care, but it would be good to just not have that issue exist at
all.
Note that the reason we used FMODE_ATOMIC_POS originally is that only
doing it for the minimal required case is "safer" in that it's possible
that the f_pos locking can cause unnecessary serialization across the
whole write() call. And in the worst case, that kind of serialization
can cause deadlock issues: think writers that need readers to empty the
state using the same file descriptor.
[ Note that the locking is per-file descriptor - because it protects
"f_pos", which is obviously per-file descriptor - so it only affects
cases where you literally use the same file descriptor to both read
and write.
So a regular pipe that has separate reading and writing file
descriptors doesn't really have this situation even though it's the
obvious case of "reader empties what a bit writer concurrently fills"
But we want to make pipes as being stream-line anyway, because we
don't want the unnecessary overhead of locking, and because a named
pipe can be (ab-)used by reading and writing to the same file
descriptor. ]
There are likely a lot of other cases that might want FMODE_STREAM, and
looking for ".llseek = no_llseek" users and other cases that don't have
an lseek file operation at all and making them use "stream_open()" might
be a good idea. But pipes and sockets are likely to be the two main
cases.
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The commit f05499a06fb4 ("writeback: use ino_t for inodes in
tracepoints") introduced a lot of GCC compilation warnings on s390,
In file included from ./include/trace/define_trace.h:102,
from ./include/trace/events/writeback.h:904,
from fs/fs-writeback.c:82:
./include/trace/events/writeback.h: In function
'trace_raw_output_writeback_page_template':
./include/trace/events/writeback.h:76:12: warning: format '%lu' expects
argument of type 'long unsigned int', but argument 4 has type 'ino_t'
{aka 'unsigned int'} [-Wformat=]
TP_printk("bdi %s: ino=%lu index=%lu",
^~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/trace/trace_events.h:360:22: note: in definition of macro
'DECLARE_EVENT_CLASS'
trace_seq_printf(s, print); \
^~~~~
./include/trace/events/writeback.h:76:2: note: in expansion of macro
'TP_printk'
TP_printk("bdi %s: ino=%lu index=%lu",
^~~~~~~~~
Fix them by adding necessary casts where ino_t could be either "unsigned
int" or "unsigned long".
Fixes: f05499a06fb4 ("writeback: use ino_t for inodes in tracepoints")
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
This enables the use of SW timestamping.
ax88179_178a uses the usbnet transmit function usbnet_start_xmit() which
implements software timestamping. ax88179_178a overrides ethtool_ops but
missed to set .get_ts_info. This caused SOF_TIMESTAMPING_TX_SOFTWARE
capability to be not available.
Signed-off-by: Andreas K. Besslein <besslein.andreas@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
When mlxsw_sp_adj_discard_write() is called for the first time, the
value stored in 'mlxsw_sp->router->adj_discard_index' is invalid, as
indicated by 'mlxsw_sp->router->adj_discard_index_valid' being set to
'false'.
In this case, we should not use the value initially stored in
'mlxsw_sp->router->adj_discard_index' (0) and instead use the value
allocated later in the function.
Fixes: 983db6198f0d ("mlxsw: spectrum_router: Allocate discard adjacency entry when needed")
Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
When a GRE tunnel is bound to an underlay netdevice and that netdevice is
moved to a different VRF, that could cause two tunnels to have the same
underlay local address in the same VRF. Linux in this situation dispatches
the traffic according to the tunnel key (or lack thereof), but that cannot
be offloaded to Spectrum devices.
Detect this situation and unoffload the two impacted tunnels when it
happens.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
old_addr as well as new_addr, it's a bit redundant and unnecessarily
complicates __bpf_arch_text_poke() itself since we can derive the same from
the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
as types which also allows to clean up call-sites.
In addition to that, __bpf_arch_text_poke() currently verifies that text
matches expected old_insn before we invoke text_poke_bp(). Also add a check
on new_insn and skip rewrite if it already matches. Reason why this is rather
useful is that it avoids making any special casing in prog_array_map_poke_run()
when old and new prog were NULL and has the benefit that also for this case
we perform a check on text whether it really matches our expectations.
Suggested-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net
|
|
For BPF_PROG_TYPE_TRACING, the bpf_prog's ctx is an array of u64.
This patch borrows the idea from BPF_CALL_x in filter.h to
convert a u64 to the arg type of the traced function.
The new BPF_TRACE_x has an arg to specify the return type of a bpf_prog.
It will be used in the future TCP-ops bpf_prog that may return "void".
The new macros are defined in the new header file "bpf_trace_helpers.h".
It is under selftests/bpf/ for now. It could be moved to libbpf later
after seeing more upcoming non-tracing use cases.
The tests are changed to use these new macros also. Hence,
the k[s]u8/16/32/64 are no longer needed and they are removed
from the bpf_helpers.h.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191123202504.1502696-1-kafai@fb.com
|
|
Add a definition of bpf_jit_blinding_enabled() when CONFIG_BPF_JIT is not set
in order to fix a recent build regression:
[...]
CC kernel/bpf/verifier.o
CC kernel/bpf/inode.o
kernel/bpf/verifier.c: In function ‘fixup_bpf_calls’:
kernel/bpf/verifier.c:9132:25: error: implicit declaration of function ‘bpf_jit_blinding_enabled’; did you mean ‘bpf_jit_kallsyms_enabled’? [-Werror=implicit-function-declaration]
9132 | bool expect_blinding = bpf_jit_blinding_enabled(prog);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| bpf_jit_kallsyms_enabled
CC kernel/bpf/helpers.o
CC kernel/bpf/hashtab.o
[...]
Fixes: d2e4c1e6c294 ("bpf: Constant map key tracking for prog array pokes")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Reported-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/40baf8f3507cac4851a310578edfb98ce73b5605.1574541375.git.daniel@iogearbox.net
|
|
Add several BPF kselftest cases for tail calls which test the various
patch directions, and that multiple locations are patched in same and
different programs.
# ./test_progs -n 45
#45/1 tailcall_1:OK
#45/2 tailcall_2:OK
#45/3 tailcall_3:OK
#45/4 tailcall_4:OK
#45/5 tailcall_5:OK
#45 tailcalls:OK
Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED
I've also verified the JITed dump after each of the rewrite cases that
it matches expectations.
Also regular test_verifier suite passes fine which contains further tail
call tests:
# ./test_verifier
[...]
Summary: 1563 PASSED, 0 SKIPPED, 0 FAILED
Checked under JIT, interpreter and JIT + hardening.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/3d6cbecbeb171117dccfe153306e479798fb608d.1574452833.git.daniel@iogearbox.net
|
|
Add initial code emission for *direct* jumps for tail call maps in
order to avoid the retpoline overhead from a493a87f38cf ("bpf, x64:
implement retpoline for tail call") for situations that allow for
it, meaning, for known constant keys at verification time which are
used as index into the tail call map. In case of Cilium which makes
heavy use of tail calls, constant keys are used in the vast majority,
only for a single occurrence we use a dynamic key.
High level outline is that if the target prog is NULL in the map, we
emit a 5-byte nop for the fall-through case and if not, we emit a
5-byte direct relative jmp to the target bpf_func + skipped prologue
offset. Later during runtime, we patch these 5-byte nop/jmps upon
tail call map update or deletions dynamically. Note that on x86-64
the direct jmp works as we reuse the same stack frame and skip
prologue (as opposed to some other JIT implementations).
One of the issues is that the tail call map slots can change at any
given time even during JITing. Therefore, we have two passes: i) emit
nops for all patchable locations during main JITing phase until we
declare prog->jited = 1 eventually. At this point the image is stable,
not public yet and with all jmps disabled. While JITing, we collect
additional info like poke->ip in order to remember the patch location
for later modifications. In ii) bpf_tail_call_direct_fixup() walks
over the progs poke_tab, locks the tail call maps poke_mutex to
prevent from parallel updates and patches in the right locations via
__bpf_arch_text_poke(). Note, the main bpf_arch_text_poke() cannot
be used at this point since we're not yet exposed to kallsyms. For
the update we use plain memcpy() since the image is not public and
still in read-write mode. After patching, we activate that poke entry
through poke->ip_stable. Meaning, at this point any tail call map
updates/deletions are not going to ignore that poke entry anymore.
Then, bpf_arch_text_poke() might still occur on the read-write image
until we finally locked it as read-only. Both modifications on the
given image are under text_mutex to avoid interference with each
other when update requests come in in parallel for different tail
call maps (current one we have locked in JIT and different one where
poke->ip_stable was already set).
Example prog:
# ./bpftool p d x i 1655
0: (b7) r3 = 0
1: (18) r2 = map[id:526]
3: (85) call bpf_tail_call#12
4: (b7) r0 = 1
5: (95) exit
Before:
# ./bpftool p d j i 1655
0xffffffffc076e55c:
0: nopl 0x0(%rax,%rax,1)
5: push %rbp
6: mov %rsp,%rbp
9: sub $0x200,%rsp
10: push %rbx
11: push %r13
13: push %r14
15: push %r15
17: pushq $0x0 _
19: xor %edx,%edx |_ index (arg 3)
1b: movabs $0xffff88d95cc82600,%rsi |_ map (arg 2)
25: mov %edx,%edx | index >= array->map.max_entries
27: cmp %edx,0x24(%rsi) |
2a: jbe 0x0000000000000066 |_
2c: mov -0x224(%rbp),%eax | tail call limit check
32: cmp $0x20,%eax |
35: ja 0x0000000000000066 |
37: add $0x1,%eax |
3a: mov %eax,-0x224(%rbp) |_
40: mov 0xd0(%rsi,%rdx,8),%rax |_ prog = array->ptrs[index]
48: test %rax,%rax | prog == NULL check
4b: je 0x0000000000000066 |_
4d: mov 0x30(%rax),%rax | goto *(prog->bpf_func + prologue_size)
51: add $0x19,%rax |
55: callq 0x0000000000000061 | retpoline for indirect jump
5a: pause |
5c: lfence |
5f: jmp 0x000000000000005a |
61: mov %rax,(%rsp) |
65: retq |_
66: mov $0x1,%eax
6b: pop %rbx
6c: pop %r15
6e: pop %r14
70: pop %r13
72: pop %rbx
73: leaveq
74: retq
After; state after JIT:
# ./bpftool p d j i 1655
0xffffffffc08e8930:
0: nopl 0x0(%rax,%rax,1)
5: push %rbp
6: mov %rsp,%rbp
9: sub $0x200,%rsp
10: push %rbx
11: push %r13
13: push %r14
15: push %r15
17: pushq $0x0 _
19: xor %edx,%edx |_ index (arg 3)
1b: movabs $0xffff9d8afd74c000,%rsi |_ map (arg 2)
25: mov -0x224(%rbp),%eax | tail call limit check
2b: cmp $0x20,%eax |
2e: ja 0x000000000000003e |
30: add $0x1,%eax |
33: mov %eax,-0x224(%rbp) |_
39: jmpq 0xfffffffffffd1785 |_ [direct] goto *(prog->bpf_func + prologue_size)
3e: mov $0x1,%eax
43: pop %rbx
44: pop %r15
46: pop %r14
48: pop %r13
4a: pop %rbx
4b: leaveq
4c: retq
After; state after map update (target prog):
# ./bpftool p d j i 1655
0xffffffffc08e8930:
0: nopl 0x0(%rax,%rax,1)
5: push %rbp
6: mov %rsp,%rbp
9: sub $0x200,%rsp
10: push %rbx
11: push %r13
13: push %r14
15: push %r15
17: pushq $0x0
19: xor %edx,%edx
1b: movabs $0xffff9d8afd74c000,%rsi
25: mov -0x224(%rbp),%eax
2b: cmp $0x20,%eax .
2e: ja 0x000000000000003e .
30: add $0x1,%eax .
33: mov %eax,-0x224(%rbp) |_
39: jmpq 0xffffffffffb09f55 |_ goto *(prog->bpf_func + prologue_size)
3e: mov $0x1,%eax
43: pop %rbx
44: pop %r15
46: pop %r14
48: pop %r13
4a: pop %rbx
4b: leaveq
4c: retq
After; state after map update (no prog):
# ./bpftool p d j i 1655
0xffffffffc08e8930:
0: nopl 0x0(%rax,%rax,1)
5: push %rbp
6: mov %rsp,%rbp
9: sub $0x200,%rsp
10: push %rbx
11: push %r13
13: push %r14
15: push %r15
17: pushq $0x0
19: xor %edx,%edx
1b: movabs $0xffff9d8afd74c000,%rsi
25: mov -0x224(%rbp),%eax
2b: cmp $0x20,%eax .
2e: ja 0x000000000000003e .
30: add $0x1,%eax .
33: mov %eax,-0x224(%rbp) |_
39: nopl 0x0(%rax,%rax,1) |_ fall-through nop
3e: mov $0x1,%eax
43: pop %rbx
44: pop %r15
46: pop %r14
48: pop %r13
4a: pop %rbx
4b: leaveq
4c: retq
Nice bonus is that this also shrinks the code emission quite a bit
for every tail call invocation.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/6ada4c1c9d35eeb5f4ecfab94593dafa6b5c4b09.1574452833.git.daniel@iogearbox.net
|
|
Add tracking of constant keys into tail call maps. The signature of
bpf_tail_call_proto is that arg1 is ctx, arg2 map pointer and arg3
is a index key. The direct call approach for tail calls can be enabled
if the verifier asserted that for all branches leading to the tail call
helper invocation, the map pointer and index key were both constant
and the same.
Tracking of map pointers we already do from prior work via c93552c443eb
("bpf: properly enforce index mask to prevent out-of-bounds speculation")
and 09772d92cd5a ("bpf: avoid retpoline for lookup/update/ delete calls
on maps").
Given the tail call map index key is not on stack but directly in the
register, we can add similar tracking approach and later in fixup_bpf_calls()
add a poke descriptor to the progs poke_tab with the relevant information
for the JITing phase.
We internally reuse insn->imm for the rewritten BPF_JMP | BPF_TAIL_CALL
instruction in order to point into the prog's poke_tab, and keep insn->imm
as 0 as indicator that current indirect tail call emission must be used.
Note that publishing to the tracker must happen at the end of fixup_bpf_calls()
since adding elements to the poke_tab reallocates its memory, so we need
to wait until its in final state.
Future work can generalize and add similar approach to optimize plain
array map lookups. Difference there is that we need to look into the key
value that sits on stack. For clarity in bpf_insn_aux_data, map_state
has been renamed into map_ptr_state, so we get map_{ptr,key}_state as
trackers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/e8db37f6b2ae60402fa40216c96738ee9b316c32.1574452833.git.daniel@iogearbox.net
|
|
This work adds program tracking to prog array maps. This is needed such
that upon prog array updates/deletions we can fix up all programs which
make use of this tail call map. We add ops->map_poke_{un,}track()
helpers to maps to maintain the list of programs and ops->map_poke_run()
for triggering the actual update.
bpf_array_aux is extended to contain the list head and poke_mutex in
order to serialize program patching during updates/deletions.
bpf_free_used_maps() will untrack the program shortly before dropping
the reference to the map. For clearing out the prog array once all urefs
are dropped we need to use schedule_work() to have a sleepable context.
The prog_array_map_poke_run() is triggered during updates/deletions and
walks the maintained prog list. It checks in their poke_tabs whether the
map and key is matching and runs the actual bpf_arch_text_poke() for
patching in the nop or new jmp location. Depending on the type of update,
we use one of BPF_MOD_{NOP_TO_JUMP,JUMP_TO_NOP,JUMP_TO_JUMP}.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1fb364bb3c565b3e415d5ea348f036ff379e779d.1574452833.git.daniel@iogearbox.net
|
|
Add initial poke table data structures and management to the BPF
prog that can later be used by JITs. Also add an instance of poke
specific data for tail call maps; plan for later work is to extend
this also for BPF static keys.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1db285ec2ea4207ee0455b3f8e191a4fc58b9ade.1574452833.git.daniel@iogearbox.net
|
|
We're going to extend this with further information which is only
relevant for prog array at this point. Given this info is not used
in critical path, move it into its own structure such that the main
array map structure can be kept on diet.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/b9ddccdb0f6f7026489ee955f16c96381e1e7238.1574452833.git.daniel@iogearbox.net
|
|
We later on are going to need a sleepable context as opposed to plain
RCU callback in order to untrack programs we need to poke at runtime
and tracking as well as image update is performed under mutex.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/09823b1d5262876e9b83a8e75df04cf0467357a4.1574452833.git.daniel@iogearbox.net
|
|
Add BPF_MOD_{NOP_TO_JUMP,JUMP_TO_JUMP,JUMP_TO_NOP} patching for x86
JIT in order to be able to patch direct jumps or nop them out. We need
this facility in order to patch tail call jumps and in later work also
BPF static keys.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/aa4784196a8e5e985af4b30a4fe5336bce6e9643.1574452833.git.daniel@iogearbox.net
|
|
Add a test that benchmarks different ways of attaching BPF program to a kernel function.
Here are the results for 2.4Ghz x86 cpu on a kernel without mitigations:
$ ./test_progs -n 49 -v|grep events
task_rename base 2743K events per sec
task_rename kprobe 2419K events per sec
task_rename kretprobe 1876K events per sec
task_rename raw_tp 2578K events per sec
task_rename fentry 2710K events per sec
task_rename fexit 2685K events per sec
On a kernel with retpoline:
$ ./test_progs -n 49 -v|grep events
task_rename base 2401K events per sec
task_rename kprobe 1930K events per sec
task_rename kretprobe 1485K events per sec
task_rename raw_tp 2053K events per sec
task_rename fentry 2351K events per sec
task_rename fexit 2185K events per sec
All 5 approaches:
- kprobe/kretprobe in __set_task_comm()
- raw tracepoint in trace_task_rename()
- fentry/fexit in __set_task_comm()
are roughly equivalent.
__set_task_comm() by itself is quite fast, so any extra instructions add up.
Until BPF trampoline was introduced the fastest mechanism was raw tracepoint.
kprobe via ftrace was second best. kretprobe is slow due to trap. New
fentry/fexit methods via BPF trampoline are clearly the fastest and the
difference is more pronounced with retpoline on, since BPF trampoline doesn't
use indirect jumps.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20191122011515.255371-1-ast@kernel.org
|
|
test_core_reloc_kernel.c selftest is the only CO-RE test that reads and
returns for validation calling thread's information (pid, tgid, comm). Thus it
has to make sure that only test_prog's invocations are honored.
Fixes: df36e621418b ("selftests/bpf: add CO-RE relocs testing setup")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20191121175900.3486133-1-andriin@fb.com
|
|
Three test cases are added.
Test 1: jmp32 'reg op imm'.
Test 2: jmp32 'reg op reg' where dst 'reg' has unknown constant
and src 'reg' has known constant
Test 3: jmp32 'reg op reg' where dst 'reg' has known constant
and src 'reg' has unknown constant
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191121170651.449096-1-yhs@fb.com
|
|
If bpf_object__open_file() gets path like "some/dir/obj.o", it should derive
BPF object's name as "obj" (unless overriden through opts->object_name).
Instead, due to using `path` as a fallback value for opts->obj_name, path is
used as is for object name, so for above example BPF object's name will be
verbatim "some/dir/obj", which leads to all sorts of troubles, especially when
internal maps are concern (they are using up to 8 characters of object name).
Fix that by ensuring object_name stays NULL, unless overriden.
Fixes: 291ee02b5e40 ("libbpf: Refactor bpf_object__open APIs to use common opts")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191122003527.551556-1-andriin@fb.com
|
|
With latest llvm (trunk https://github.com/llvm/llvm-project),
test_progs, which has +alu32 enabled, failed for strobemeta.o.
The verifier output looks like below with edit to replace large
decimal numbers with hex ones.
193: (85) call bpf_probe_read_user_str#114
R0=inv(id=0)
194: (26) if w0 > 0x1 goto pc+4
R0_w=inv(id=0,umax_value=0xffffffff00000001)
195: (6b) *(u16 *)(r7 +80) = r0
196: (bc) w6 = w0
R6_w=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
197: (67) r6 <<= 32
R6_w=inv(id=0,smax_value=0x7fffffff00000000,umax_value=0xffffffff00000000,
var_off=(0x0; 0xffffffff00000000))
198: (77) r6 >>= 32
R6=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
...
201: (79) r8 = *(u64 *)(r10 -416)
R8_w=map_value(id=0,off=40,ks=4,vs=13872,imm=0)
202: (0f) r8 += r6
R8_w=map_value(id=0,off=40,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
203: (07) r8 += 9696
R8_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
...
255: (bf) r1 = r8
R1_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
...
257: (85) call bpf_probe_read_user_str#114
R1 unbounded memory access, make sure to bounds check any array access into a map
The value range for register r6 at insn 198 should be really just 0/1.
The umax_value=0xffffffff caused later verification failure.
After jmp instructions, the current verifier already tried to use just
obtained information to get better register range. The current mechanism is
for 64bit register only. This patch implemented to tighten the range
for 32bit sub-registers after jmp32 instructions.
With the patch, we have the below range ranges for the
above code sequence:
193: (85) call bpf_probe_read_user_str#114
R0=inv(id=0)
194: (26) if w0 > 0x1 goto pc+4
R0_w=inv(id=0,smax_value=0x7fffffff00000001,umax_value=0xffffffff00000001,
var_off=(0x0; 0xffffffff00000001))
195: (6b) *(u16 *)(r7 +80) = r0
196: (bc) w6 = w0
R6_w=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0x1))
197: (67) r6 <<= 32
R6_w=inv(id=0,umax_value=0x100000000,var_off=(0x0; 0x100000000))
198: (77) r6 >>= 32
R6=inv(id=0,umax_value=1,var_off=(0x0; 0x1))
...
201: (79) r8 = *(u64 *)(r10 -416)
R8_w=map_value(id=0,off=40,ks=4,vs=13872,imm=0)
202: (0f) r8 += r6
R8_w=map_value(id=0,off=40,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
203: (07) r8 += 9696
R8_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
...
255: (bf) r1 = r8
R1_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
...
257: (85) call bpf_probe_read_user_str#114
...
At insn 194, the register R0 has better var_off.mask and smax_value.
Especially, the var_off.mask ensures later lshift and rshift
maintains proper value range.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191121170650.449030-1-yhs@fb.com
|