aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/bpf
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/bpf')
-rw-r--r--Documentation/bpf/bpf_design_QA.rst25
-rw-r--r--Documentation/bpf/bpf_devel_QA.rst2
-rw-r--r--Documentation/bpf/bpf_prog_run.rst117
-rw-r--r--Documentation/bpf/btf.rst94
-rw-r--r--Documentation/bpf/clang-notes.rst30
-rw-r--r--Documentation/bpf/index.rst4
-rw-r--r--Documentation/bpf/instruction-set.rst359
-rw-r--r--Documentation/bpf/kfuncs.rst193
-rw-r--r--Documentation/bpf/libbpf/index.rst3
-rw-r--r--Documentation/bpf/libbpf/libbpf_naming_convention.rst13
-rw-r--r--Documentation/bpf/linux-notes.rst53
-rw-r--r--Documentation/bpf/map_cgroup_storage.rst4
-rw-r--r--Documentation/bpf/map_hash.rst185
-rw-r--r--Documentation/bpf/verifier.rst2
14 files changed, 887 insertions, 197 deletions
diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst
index 437de2a7a5de..a210b8a4df00 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -214,6 +214,12 @@ A: NO. Tracepoints are tied to internal implementation details hence they are
subject to change and can break with newer kernels. BPF programs need to change
accordingly when this happens.
+Q: Are places where kprobes can attach part of the stable ABI?
+--------------------------------------------------------------
+A: NO. The places to which kprobes can attach are internal implementation
+details, which means that they are subject to change and can break with
+newer kernels. BPF programs need to change accordingly when this happens.
+
Q: How much stack space a BPF program uses?
-------------------------------------------
A: Currently all program types are limited to 512 bytes of stack
@@ -273,3 +279,22 @@ cc (congestion-control) implementations. If any of these kernel
functions has changed, both the in-tree and out-of-tree kernel tcp cc
implementations have to be changed. The same goes for the bpf
programs and they have to be adjusted accordingly.
+
+Q: Attaching to arbitrary kernel functions is an ABI?
+-----------------------------------------------------
+Q: BPF programs can be attached to many kernel functions. Do these
+kernel functions become part of the ABI?
+
+A: NO.
+
+The kernel function prototypes will change, and BPF programs attaching to
+them will need to change. The BPF compile-once-run-everywhere (CO-RE)
+should be used in order to make it easier to adapt your BPF programs to
+different versions of the kernel.
+
+Q: Marking a function with BTF_ID makes that function an ABI?
+-------------------------------------------------------------
+A: NO.
+
+The BTF_ID macro does not cause a function to become part of the ABI
+any more than does the EXPORT_SYMBOL_GPL macro.
diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst
index 253496af8fef..761474bd7fe6 100644
--- a/Documentation/bpf/bpf_devel_QA.rst
+++ b/Documentation/bpf/bpf_devel_QA.rst
@@ -658,7 +658,7 @@ when:
.. Links
.. _Documentation/process/: https://www.kernel.org/doc/html/latest/process/
-.. _netdev-FAQ: ../networking/netdev-FAQ.rst
+.. _netdev-FAQ: Documentation/process/maintainer-netdev.rst
.. _selftests:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/
.. _Documentation/dev-tools/kselftest.rst:
diff --git a/Documentation/bpf/bpf_prog_run.rst b/Documentation/bpf/bpf_prog_run.rst
new file mode 100644
index 000000000000..4868c909df5c
--- /dev/null
+++ b/Documentation/bpf/bpf_prog_run.rst
@@ -0,0 +1,117 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Running BPF programs from userspace
+===================================
+
+This document describes the ``BPF_PROG_RUN`` facility for running BPF programs
+from userspace.
+
+.. contents::
+ :local:
+ :depth: 2
+
+
+Overview
+--------
+
+The ``BPF_PROG_RUN`` command can be used through the ``bpf()`` syscall to
+execute a BPF program in the kernel and return the results to userspace. This
+can be used to unit test BPF programs against user-supplied context objects, and
+as way to explicitly execute programs in the kernel for their side effects. The
+command was previously named ``BPF_PROG_TEST_RUN``, and both constants continue
+to be defined in the UAPI header, aliased to the same value.
+
+The ``BPF_PROG_RUN`` command can be used to execute BPF programs of the
+following types:
+
+- ``BPF_PROG_TYPE_SOCKET_FILTER``
+- ``BPF_PROG_TYPE_SCHED_CLS``
+- ``BPF_PROG_TYPE_SCHED_ACT``
+- ``BPF_PROG_TYPE_XDP``
+- ``BPF_PROG_TYPE_SK_LOOKUP``
+- ``BPF_PROG_TYPE_CGROUP_SKB``
+- ``BPF_PROG_TYPE_LWT_IN``
+- ``BPF_PROG_TYPE_LWT_OUT``
+- ``BPF_PROG_TYPE_LWT_XMIT``
+- ``BPF_PROG_TYPE_LWT_SEG6LOCAL``
+- ``BPF_PROG_TYPE_FLOW_DISSECTOR``
+- ``BPF_PROG_TYPE_STRUCT_OPS``
+- ``BPF_PROG_TYPE_RAW_TRACEPOINT``
+- ``BPF_PROG_TYPE_SYSCALL``
+
+When using the ``BPF_PROG_RUN`` command, userspace supplies an input context
+object and (for program types operating on network packets) a buffer containing
+the packet data that the BPF program will operate on. The kernel will then
+execute the program and return the results to userspace. Note that programs will
+not have any side effects while being run in this mode; in particular, packets
+will not actually be redirected or dropped, the program return code will just be
+returned to userspace. A separate mode for live execution of XDP programs is
+provided, documented separately below.
+
+Running XDP programs in "live frame mode"
+-----------------------------------------
+
+The ``BPF_PROG_RUN`` command has a separate mode for running live XDP programs,
+which can be used to execute XDP programs in a way where packets will actually
+be processed by the kernel after the execution of the XDP program as if they
+arrived on a physical interface. This mode is activated by setting the
+``BPF_F_TEST_XDP_LIVE_FRAMES`` flag when supplying an XDP program to
+``BPF_PROG_RUN``.
+
+The live packet mode is optimised for high performance execution of the supplied
+XDP program many times (suitable for, e.g., running as a traffic generator),
+which means the semantics are not quite as straight-forward as the regular test
+run mode. Specifically:
+
+- When executing an XDP program in live frame mode, the result of the execution
+ will not be returned to userspace; instead, the kernel will perform the
+ operation indicated by the program's return code (drop the packet, redirect
+ it, etc). For this reason, setting the ``data_out`` or ``ctx_out`` attributes
+ in the syscall parameters when running in this mode will be rejected. In
+ addition, not all failures will be reported back to userspace directly;
+ specifically, only fatal errors in setup or during execution (like memory
+ allocation errors) will halt execution and return an error. If an error occurs
+ in packet processing, like a failure to redirect to a given interface,
+ execution will continue with the next repetition; these errors can be detected
+ via the same trace points as for regular XDP programs.
+
+- Userspace can supply an ifindex as part of the context object, just like in
+ the regular (non-live) mode. The XDP program will be executed as though the
+ packet arrived on this interface; i.e., the ``ingress_ifindex`` of the context
+ object will point to that interface. Furthermore, if the XDP program returns
+ ``XDP_PASS``, the packet will be injected into the kernel networking stack as
+ though it arrived on that ifindex, and if it returns ``XDP_TX``, the packet
+ will be transmitted *out* of that same interface. Do note, though, that
+ because the program execution is not happening in driver context, an
+ ``XDP_TX`` is actually turned into the same action as an ``XDP_REDIRECT`` to
+ that same interface (i.e., it will only work if the driver has support for the
+ ``ndo_xdp_xmit`` driver op).
+
+- When running the program with multiple repetitions, the execution will happen
+ in batches. The batch size defaults to 64 packets (which is same as the
+ maximum NAPI receive batch size), but can be specified by userspace through
+ the ``batch_size`` parameter, up to a maximum of 256 packets. For each batch,
+ the kernel executes the XDP program repeatedly, each invocation getting a
+ separate copy of the packet data. For each repetition, if the program drops
+ the packet, the data page is immediately recycled (see below). Otherwise, the
+ packet is buffered until the end of the batch, at which point all packets
+ buffered this way during the batch are transmitted at once.
+
+- When setting up the test run, the kernel will initialise a pool of memory
+ pages of the same size as the batch size. Each memory page will be initialised
+ with the initial packet data supplied by userspace at ``BPF_PROG_RUN``
+ invocation. When possible, the pages will be recycled on future program
+ invocations, to improve performance. Pages will generally be recycled a full
+ batch at a time, except when a packet is dropped (by return code or because
+ of, say, a redirection error), in which case that page will be recycled
+ immediately. If a packet ends up being passed to the regular networking stack
+ (because the XDP program returns ``XDP_PASS``, or because it ends up being
+ redirected to an interface that injects it into the stack), the page will be
+ released and a new one will be allocated when the pool is empty.
+
+ When recycling, the page content is not rewritten; only the packet boundary
+ pointers (``data``, ``data_end`` and ``data_meta``) in the context object will
+ be reset to the original values. This means that if a program rewrites the
+ packet contents, it has to be prepared to see either the original content or
+ the modified version on subsequent invocations.
diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst
index 1ebf4c5c7ddc..cf8722f96090 100644
--- a/Documentation/bpf/btf.rst
+++ b/Documentation/bpf/btf.rst
@@ -74,7 +74,7 @@ sequentially and type id is assigned to each recognized type starting from id
#define BTF_KIND_ARRAY 3 /* Array */
#define BTF_KIND_STRUCT 4 /* Struct */
#define BTF_KIND_UNION 5 /* Union */
- #define BTF_KIND_ENUM 6 /* Enumeration */
+ #define BTF_KIND_ENUM 6 /* Enumeration up to 32-bit values */
#define BTF_KIND_FWD 7 /* Forward */
#define BTF_KIND_TYPEDEF 8 /* Typedef */
#define BTF_KIND_VOLATILE 9 /* Volatile */
@@ -87,6 +87,7 @@ sequentially and type id is assigned to each recognized type starting from id
#define BTF_KIND_FLOAT 16 /* Floating point */
#define BTF_KIND_DECL_TAG 17 /* Decl Tag */
#define BTF_KIND_TYPE_TAG 18 /* Type Tag */
+ #define BTF_KIND_ENUM64 19 /* Enumeration up to 64-bit values */
Note that the type section encodes debug info, not just pure types.
``BTF_KIND_FUNC`` is not a type, and it represents a defined subprogram.
@@ -101,10 +102,10 @@ Each type contains the following common data::
* bits 24-28: kind (e.g. int, ptr, array...etc)
* bits 29-30: unused
* bit 31: kind_flag, currently used by
- * struct, union and fwd
+ * struct, union, fwd, enum and enum64.
*/
__u32 info;
- /* "size" is used by INT, ENUM, STRUCT and UNION.
+ /* "size" is used by INT, ENUM, STRUCT, UNION and ENUM64.
* "size" tells the size of the type it is describing.
*
* "type" is used by PTR, TYPEDEF, VOLATILE, CONST, RESTRICT,
@@ -281,10 +282,10 @@ modes exist:
``struct btf_type`` encoding requirement:
* ``name_off``: 0 or offset to a valid C identifier
- * ``info.kind_flag``: 0
+ * ``info.kind_flag``: 0 for unsigned, 1 for signed
* ``info.kind``: BTF_KIND_ENUM
* ``info.vlen``: number of enum values
- * ``size``: 4
+ * ``size``: 1/2/4/8
``btf_type`` is followed by ``info.vlen`` number of ``struct btf_enum``.::
@@ -297,6 +298,10 @@ The ``btf_enum`` encoding:
* ``name_off``: offset to a valid C identifier
* ``val``: any value
+If the original enum value is signed and the size is less than 4,
+that value will be sign extended into 4 bytes. If the size is 8,
+the value will be truncated into 4 bytes.
+
2.2.7 BTF_KIND_FWD
~~~~~~~~~~~~~~~~~~
@@ -364,7 +369,8 @@ No additional type data follow ``btf_type``.
* ``name_off``: offset to a valid C identifier
* ``info.kind_flag``: 0
* ``info.kind``: BTF_KIND_FUNC
- * ``info.vlen``: 0
+ * ``info.vlen``: linkage information (BTF_FUNC_STATIC, BTF_FUNC_GLOBAL
+ or BTF_FUNC_EXTERN)
* ``type``: a BTF_KIND_FUNC_PROTO type
No additional type data follow ``btf_type``.
@@ -375,6 +381,9 @@ type. The BTF_KIND_FUNC may in turn be referenced by a func_info in the
:ref:`BTF_Ext_Section` (ELF) or in the arguments to :ref:`BPF_Prog_Load`
(ABI).
+Currently, only linkage values of BTF_FUNC_STATIC and BTF_FUNC_GLOBAL are
+supported in the kernel.
+
2.2.13 BTF_KIND_FUNC_PROTO
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -493,7 +502,7 @@ the attribute is applied to a ``struct``/``union`` member or
a ``func`` argument, and ``btf_decl_tag.component_idx`` should be a
valid index (starting from 0) pointing to a member or an argument.
-2.2.17 BTF_KIND_TYPE_TAG
+2.2.18 BTF_KIND_TYPE_TAG
~~~~~~~~~~~~~~~~~~~~~~~~
``struct btf_type`` encoding requirement:
@@ -503,6 +512,45 @@ valid index (starting from 0) pointing to a member or an argument.
* ``info.vlen``: 0
* ``type``: the type with ``btf_type_tag`` attribute
+Currently, ``BTF_KIND_TYPE_TAG`` is only emitted for pointer types.
+It has the following btf type chain:
+::
+
+ ptr -> [type_tag]*
+ -> [const | volatile | restrict | typedef]*
+ -> base_type
+
+Basically, a pointer type points to zero or more
+type_tag, then zero or more const/volatile/restrict/typedef
+and finally the base type. The base type is one of
+int, ptr, array, struct, union, enum, func_proto and float types.
+
+2.2.19 BTF_KIND_ENUM64
+~~~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: 0 or offset to a valid C identifier
+ * ``info.kind_flag``: 0 for unsigned, 1 for signed
+ * ``info.kind``: BTF_KIND_ENUM64
+ * ``info.vlen``: number of enum values
+ * ``size``: 1/2/4/8
+
+``btf_type`` is followed by ``info.vlen`` number of ``struct btf_enum64``.::
+
+ struct btf_enum64 {
+ __u32 name_off;
+ __u32 val_lo32;
+ __u32 val_hi32;
+ };
+
+The ``btf_enum64`` encoding:
+ * ``name_off``: offset to a valid C identifier
+ * ``val_lo32``: lower 32-bit value for a 64-bit value
+ * ``val_hi32``: high 32-bit value for a 64-bit value
+
+If the original enum value is signed and the size is less than 8,
+that value will be sign extended into 8 bytes.
+
3. BTF Kernel API
=================
@@ -565,18 +613,15 @@ A map can be created with ``btf_fd`` and specified key/value type id.::
In libbpf, the map can be defined with extra annotation like below:
::
- struct bpf_map_def SEC("maps") btf_map = {
- .type = BPF_MAP_TYPE_ARRAY,
- .key_size = sizeof(int),
- .value_size = sizeof(struct ipv_counts),
- .max_entries = 4,
- };
- BPF_ANNOTATE_KV_PAIR(btf_map, int, struct ipv_counts);
+ struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, int);
+ __type(value, struct ipv_counts);
+ __uint(max_entries, 4);
+ } btf_map SEC(".maps");
-Here, the parameters for macro BPF_ANNOTATE_KV_PAIR are map name, key and
-value types for the map. During ELF parsing, libbpf is able to extract
-key/value type_id's and assign them to BPF_MAP_CREATE attributes
-automatically.
+During ELF parsing, libbpf is able to extract key/value type_id's and assign
+them to BPF_MAP_CREATE attributes automatically.
.. _BPF_Prog_Load:
@@ -824,13 +869,12 @@ structure has bitfields. For example, for the following map,::
___A b1:4;
enum A b2:4;
};
- struct bpf_map_def SEC("maps") tmpmap = {
- .type = BPF_MAP_TYPE_ARRAY,
- .key_size = sizeof(__u32),
- .value_size = sizeof(struct tmp_t),
- .max_entries = 1,
- };
- BPF_ANNOTATE_KV_PAIR(tmpmap, int, struct tmp_t);
+ struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, int);
+ __type(value, struct tmp_t);
+ __uint(max_entries, 1);
+ } tmpmap SEC(".maps");
bpftool is able to pretty print like below:
::
diff --git a/Documentation/bpf/clang-notes.rst b/Documentation/bpf/clang-notes.rst
new file mode 100644
index 000000000000..528feddf2db9
--- /dev/null
+++ b/Documentation/bpf/clang-notes.rst
@@ -0,0 +1,30 @@
+.. contents::
+.. sectnum::
+
+==========================
+Clang implementation notes
+==========================
+
+This document provides more details specific to the Clang/LLVM implementation of the eBPF instruction set.
+
+Versions
+========
+
+Clang defined "CPU" versions, where a CPU version of 3 corresponds to the current eBPF ISA.
+
+Clang can select the eBPF ISA version using ``-mcpu=v3`` for example to select version 3.
+
+Arithmetic instructions
+=======================
+
+For CPU versions prior to 3, Clang v7.0 and later can enable ``BPF_ALU`` support with
+``-Xclang -target-feature -Xclang +alu32``. In CPU version 3, support is automatically included.
+
+Atomic operations
+=================
+
+Clang can generate atomic instructions by default when ``-mcpu=v3`` is
+enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction
+Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable
+the atomics features, while keeping a lower ``-mcpu`` version, you can use
+``-Xclang -target-feature -Xclang +alu32``.
diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst
index ef5c996547ec..1b50de1983ee 100644
--- a/Documentation/bpf/index.rst
+++ b/Documentation/bpf/index.rst
@@ -19,11 +19,15 @@ that goes into great technical depth about the BPF Architecture.
faq
syscall_api
helpers
+ kfuncs
programs
maps
+ bpf_prog_run
classic_vs_extended.rst
bpf_licensing
test_debug
+ clang-notes
+ linux-notes
other
.. only:: subproject and html
diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst
index 3704836fe6df..5d798437dad4 100644
--- a/Documentation/bpf/instruction-set.rst
+++ b/Documentation/bpf/instruction-set.rst
@@ -1,7 +1,12 @@
+.. contents::
+.. sectnum::
+
+========================================
+eBPF Instruction Set Specification, v1.0
+========================================
+
+This document specifies version 1.0 of the eBPF instruction set.
-====================
-eBPF Instruction Set
-====================
Registers and calling convention
================================
@@ -11,10 +16,10 @@ all of which are 64-bits wide.
The eBPF calling convention is defined as:
- * R0: return value from function calls, and exit value for eBPF programs
- * R1 - R5: arguments for function calls
- * R6 - R9: callee saved registers that function calls will preserve
- * R10: read-only frame pointer to access stack
+* R0: return value from function calls, and exit value for eBPF programs
+* R1 - R5: arguments for function calls
+* R6 - R9: callee saved registers that function calls will preserve
+* R10: read-only frame pointer to access stack
R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if
necessary across calls.
@@ -22,13 +27,19 @@ necessary across calls.
Instruction encoding
====================
-eBPF uses 64-bit instructions with the following encoding:
+eBPF has two instruction encodings:
+
+* the basic instruction encoding, which uses 64 bits to encode an instruction
+* the wide instruction encoding, which appends a second 64-bit immediate value
+ (imm64) after the basic instruction for a total of 128 bits.
+
+The basic instruction encoding looks as follows:
- ============= ======= =============== ==================== ============
- 32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB)
- ============= ======= =============== ==================== ============
- immediate offset source register destination register opcode
- ============= ======= =============== ==================== ============
+============= ======= =============== ==================== ============
+32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB)
+============= ======= =============== ==================== ============
+immediate offset source register destination register opcode
+============= ======= =============== ==================== ============
Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.
@@ -38,30 +49,30 @@ Instruction classes
The three LSB bits of the 'opcode' field store the instruction class:
- ========= ===== ===============================
- class value description
- ========= ===== ===============================
- BPF_LD 0x00 non-standard load operations
- BPF_LDX 0x01 load into register operations
- BPF_ST 0x02 store from immediate operations
- BPF_STX 0x03 store from register operations
- BPF_ALU 0x04 32-bit arithmetic operations
- BPF_JMP 0x05 64-bit jump operations
- BPF_JMP32 0x06 32-bit jump operations
- BPF_ALU64 0x07 64-bit arithmetic operations
- ========= ===== ===============================
+========= ===== =============================== ===================================
+class value description reference
+========= ===== =============================== ===================================
+BPF_LD 0x00 non-standard load operations `Load and store instructions`_
+BPF_LDX 0x01 load into register operations `Load and store instructions`_
+BPF_ST 0x02 store from immediate operations `Load and store instructions`_
+BPF_STX 0x03 store from register operations `Load and store instructions`_
+BPF_ALU 0x04 32-bit arithmetic operations `Arithmetic and jump instructions`_
+BPF_JMP 0x05 64-bit jump operations `Arithmetic and jump instructions`_
+BPF_JMP32 0x06 32-bit jump operations `Arithmetic and jump instructions`_
+BPF_ALU64 0x07 64-bit arithmetic operations `Arithmetic and jump instructions`_
+========= ===== =============================== ===================================
Arithmetic and jump instructions
================================
-For arithmetic and jump instructions (BPF_ALU, BPF_ALU64, BPF_JMP and
-BPF_JMP32), the 8-bit 'opcode' field is divided into three parts:
+For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` and
+``BPF_JMP32``), the 8-bit 'opcode' field is divided into three parts:
- ============== ====== =================
- 4 bits (MSB) 1 bit 3 bits (LSB)
- ============== ====== =================
- operation code source instruction class
- ============== ====== =================
+============== ====== =================
+4 bits (MSB) 1 bit 3 bits (LSB)
+============== ====== =================
+operation code source instruction class
+============== ====== =================
The 4th bit encodes the source operand:
@@ -78,71 +89,103 @@ The four MSB bits store the operation code.
Arithmetic instructions
-----------------------
-BPF_ALU uses 32-bit wide operands while BPF_ALU64 uses 64-bit wide operands for
+``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
-The code field encodes the operation as below:
-
- ======== ===== ==========================
- code value description
- ======== ===== ==========================
- BPF_ADD 0x00 dst += src
- BPF_SUB 0x10 dst -= src
- BPF_MUL 0x20 dst \*= src
- BPF_DIV 0x30 dst /= src
- BPF_OR 0x40 dst \|= src
- BPF_AND 0x50 dst &= src
- BPF_LSH 0x60 dst <<= src
- BPF_RSH 0x70 dst >>= src
- BPF_NEG 0x80 dst = ~src
- BPF_MOD 0x90 dst %= src
- BPF_XOR 0xa0 dst ^= src
- BPF_MOV 0xb0 dst = src
- BPF_ARSH 0xc0 sign extending shift right
- BPF_END 0xd0 endianness conversion
- ======== ===== ==========================
-
-BPF_ADD | BPF_X | BPF_ALU means::
+The 'code' field encodes the operation as below:
+
+======== ===== ==========================================================
+code value description
+======== ===== ==========================================================
+BPF_ADD 0x00 dst += src
+BPF_SUB 0x10 dst -= src
+BPF_MUL 0x20 dst \*= src
+BPF_DIV 0x30 dst /= src
+BPF_OR 0x40 dst \|= src
+BPF_AND 0x50 dst &= src
+BPF_LSH 0x60 dst <<= src
+BPF_RSH 0x70 dst >>= src
+BPF_NEG 0x80 dst = ~src
+BPF_MOD 0x90 dst %= src
+BPF_XOR 0xa0 dst ^= src
+BPF_MOV 0xb0 dst = src
+BPF_ARSH 0xc0 sign extending shift right
+BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
+======== ===== ==========================================================
+
+``BPF_ADD | BPF_X | BPF_ALU`` means::
dst_reg = (u32) dst_reg + (u32) src_reg;
-BPF_ADD | BPF_X | BPF_ALU64 means::
+``BPF_ADD | BPF_X | BPF_ALU64`` means::
dst_reg = dst_reg + src_reg
-BPF_XOR | BPF_K | BPF_ALU means::
+``BPF_XOR | BPF_K | BPF_ALU`` means::
src_reg = (u32) src_reg ^ (u32) imm32
-BPF_XOR | BPF_K | BPF_ALU64 means::
+``BPF_XOR | BPF_K | BPF_ALU64`` means::
src_reg = src_reg ^ imm32
+Byte swap instructions
+~~~~~~~~~~~~~~~~~~~~~~
+
+The byte swap instructions use an instruction class of ``BPF_ALU`` and a 4-bit
+'code' field of ``BPF_END``.
+
+The byte swap instructions operate on the destination register
+only and do not use a separate source register or immediate value.
+
+The 1-bit source operand field in the opcode is used to select what byte
+order the operation convert from or to:
+
+========= ===== =================================================
+source value description
+========= ===== =================================================
+BPF_TO_LE 0x00 convert between host byte order and little endian
+BPF_TO_BE 0x08 convert between host byte order and big endian
+========= ===== =================================================
+
+The 'imm' field encodes the width of the swap operations. The following widths
+are supported: 16, 32 and 64.
+
+Examples:
+
+``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::
+
+ dst_reg = htole16(dst_reg)
+
+``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::
+
+ dst_reg = htobe64(dst_reg)
+
Jump instructions
-----------------
-BPF_JMP32 uses 32-bit wide operands while BPF_JMP uses 64-bit wide operands for
+``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
otherwise identical operations.
-The code field encodes the operation as below:
-
- ======== ===== ========================= ============
- code value description notes
- ======== ===== ========================= ============
- BPF_JA 0x00 PC += off BPF_JMP only
- BPF_JEQ 0x10 PC += off if dst == src
- BPF_JGT 0x20 PC += off if dst > src unsigned
- BPF_JGE 0x30 PC += off if dst >= src unsigned
- BPF_JSET 0x40 PC += off if dst & src
- BPF_JNE 0x50 PC += off if dst != src
- BPF_JSGT 0x60 PC += off if dst > src signed
- BPF_JSGE 0x70 PC += off if dst >= src signed
- BPF_CALL 0x80 function call
- BPF_EXIT 0x90 function / program return BPF_JMP only
- BPF_JLT 0xa0 PC += off if dst < src unsigned
- BPF_JLE 0xb0 PC += off if dst <= src unsigned
- BPF_JSLT 0xc0 PC += off if dst < src signed
- BPF_JSLE 0xd0 PC += off if dst <= src signed
- ======== ===== ========================= ============
+The 'code' field encodes the operation as below:
+
+======== ===== ========================= ============
+code value description notes
+======== ===== ========================= ============
+BPF_JA 0x00 PC += off BPF_JMP only
+BPF_JEQ 0x10 PC += off if dst == src
+BPF_JGT 0x20 PC += off if dst > src unsigned
+BPF_JGE 0x30 PC += off if dst >= src unsigned
+BPF_JSET 0x40 PC += off if dst & src
+BPF_JNE 0x50 PC += off if dst != src
+BPF_JSGT 0x60 PC += off if dst > src signed
+BPF_JSGE 0x70 PC += off if dst >= src signed
+BPF_CALL 0x80 function call
+BPF_EXIT 0x90 function / program return BPF_JMP only
+BPF_JLT 0xa0 PC += off if dst < src unsigned
+BPF_JLE 0xb0 PC += off if dst <= src unsigned
+BPF_JSLT 0xc0 PC += off if dst < src signed
+BPF_JSLE 0xd0 PC += off if dst <= src signed
+======== ===== ========================= ============
The eBPF program needs to store the return value into register R0 before doing a
BPF_EXIT.
@@ -151,14 +194,26 @@ BPF_EXIT.
Load and store instructions
===========================
-For load and store instructions (BPF_LD, BPF_LDX, BPF_ST and BPF_STX), the
+For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_STX``), the
8-bit 'opcode' field is divided as:
- ============ ====== =================
- 3 bits (MSB) 2 bits 3 bits (LSB)
- ============ ====== =================
- mode size instruction class
- ============ ====== =================
+============ ====== =================
+3 bits (MSB) 2 bits 3 bits (LSB)
+============ ====== =================
+mode size instruction class
+============ ====== =================
+
+The mode modifier is one of:
+
+ ============= ===== ==================================== =============
+ mode modifier value description reference
+ ============= ===== ==================================== =============
+ BPF_IMM 0x00 64-bit immediate instructions `64-bit immediate instructions`_
+ BPF_ABS 0x20 legacy BPF packet access (absolute) `Legacy BPF Packet access instructions`_
+ BPF_IND 0x40 legacy BPF packet access (indirect) `Legacy BPF Packet access instructions`_
+ BPF_MEM 0x60 regular load and store operations `Regular load and store operations`_
+ BPF_ATOMIC 0xc0 atomic operations `Atomic operations`_
+ ============= ===== ==================================== =============
The size modifier is one of:
@@ -171,109 +226,103 @@ The size modifier is one of:
BPF_DW 0x18 double word (8 bytes)
============= ===== =====================
-The mode modifier is one of:
+Regular load and store operations
+---------------------------------
- ============= ===== ====================================
- mode modifier value description
- ============= ===== ====================================
- BPF_IMM 0x00 used for 64-bit mov
- BPF_ABS 0x20 legacy BPF packet access
- BPF_IND 0x40 legacy BPF packet access
- BPF_MEM 0x60 all normal load and store operations
- BPF_ATOMIC 0xc0 atomic operations
- ============= ===== ====================================
+The ``BPF_MEM`` mode modifier is used to encode regular load and store
+instructions that transfer data between a register and memory.
-BPF_MEM | <size> | BPF_STX means::
+``BPF_MEM | <size> | BPF_STX`` means::
*(size *) (dst_reg + off) = src_reg
-BPF_MEM | <size> | BPF_ST means::
+``BPF_MEM | <size> | BPF_ST`` means::
*(size *) (dst_reg + off) = imm32
-BPF_MEM | <size> | BPF_LDX means::
+``BPF_MEM | <size> | BPF_LDX`` means::
dst_reg = *(size *) (src_reg + off)
-Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW.
+Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
Atomic operations
-----------------
-eBPF includes atomic operations, which use the immediate field for extra
-encoding::
+Atomic operations are operations that operate on memory and can not be
+interrupted or corrupted by other access to the same memory region
+by other eBPF programs or means outside of this specification.
+
+All atomic operations supported by eBPF are encoded as store operations
+that use the ``BPF_ATOMIC`` mode modifier as follows:
- .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
- .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
+* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
+* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
+* 8-bit and 16-bit wide atomic operations are not supported.
-The basic atomic operations supported are::
+The 'imm' field is used to encode the actual atomic operation.
+Simple atomic operation use a subset of the values defined to encode
+arithmetic operations in the 'imm' field to encode the atomic operation:
- BPF_ADD
- BPF_AND
- BPF_OR
- BPF_XOR
+======== ===== ===========
+imm value description
+======== ===== ===========
+BPF_ADD 0x00 atomic add
+BPF_OR 0x40 atomic or
+BPF_AND 0x50 atomic and
+BPF_XOR 0xa0 atomic xor
+======== ===== ===========
-Each having equivalent semantics with the ``BPF_ADD`` example, that is: the
-memory location addresed by ``dst_reg + off`` is atomically modified, with
-``src_reg`` as the other operand. If the ``BPF_FETCH`` flag is set in the
-immediate, then these operations also overwrite ``src_reg`` with the
-value that was in memory before it was modified.
-The more special operations are::
+``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means::
- BPF_XCHG
+ *(u32 *)(dst_reg + off16) += src_reg
-This atomically exchanges ``src_reg`` with the value addressed by ``dst_reg +
-off``. ::
+``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
- BPF_CMPXCHG
+ *(u64 *)(dst_reg + off16) += src_reg
-This atomically compares the value addressed by ``dst_reg + off`` with
-``R0``. If they match it is replaced with ``src_reg``. In either case, the
-value that was there before is zero-extended and loaded back to ``R0``.
+In addition to the simple atomic operations, there also is a modifier and
+two complex atomic operations:
-Note that 1 and 2 byte atomic operations are not supported.
+=========== ================ ===========================
+imm value description
+=========== ================ ===========================
+BPF_FETCH 0x01 modifier: return old value
+BPF_XCHG 0xe0 | BPF_FETCH atomic exchange
+BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange
+=========== ================ ===========================
-Clang can generate atomic instructions by default when ``-mcpu=v3`` is
-enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction
-Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable
-the atomics features, while keeping a lower ``-mcpu`` version, you can use
-``-Xclang -target-feature -Xclang +alu32``.
+The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
+always set for the complex atomic operations. If the ``BPF_FETCH`` flag
+is set, then the operation also overwrites ``src_reg`` with the value that
+was in memory before it was modified.
-You may encounter ``BPF_XADD`` - this is a legacy name for ``BPF_ATOMIC``,
-referring to the exclusive-add operation encoded when the immediate field is
-zero.
+The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value
+addressed by ``dst_reg + off``.
-16-byte instructions
---------------------
+The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
+``dst_reg + off`` with ``R0``. If they match, the value addressed by
+``dst_reg + off`` is replaced with ``src_reg``. In either case, the
+value that was at ``dst_reg + off`` before the operation is zero-extended
+and loaded back to ``R0``.
-eBPF has one 16-byte instruction: ``BPF_LD | BPF_DW | BPF_IMM`` which consists
-of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single
-instruction that loads 64-bit immediate value into a dst_reg.
+64-bit immediate instructions
+-----------------------------
-Packet access instructions
---------------------------
+Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
+encoding for an extra imm64 value.
-eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
-(BPF_IND | <size> | BPF_LD) which are used to access packet data.
+There is currently only one such instruction.
-They had to be carried over from classic BPF to have strong performance of
-socket filters running in eBPF interpreter. These instructions can only
-be used when interpreter context is a pointer to ``struct sk_buff`` and
-have seven implicit operands. Register R6 is an implicit input that must
-contain pointer to sk_buff. Register R0 is an implicit output which contains
-the data fetched from the packet. Registers R1-R5 are scratch registers
-and must not be used to store the data across BPF_ABS | BPF_LD or
-BPF_IND | BPF_LD instructions.
+``BPF_LD | BPF_DW | BPF_IMM`` means::
-These instructions have implicit program exit condition as well. When
-eBPF program is trying to access the data beyond the packet boundary,
-the interpreter will abort the execution of the program. JIT compilers
-therefore must preserve this property. src_reg and imm32 fields are
-explicit inputs to these instructions.
+ dst_reg = imm64
-For example, BPF_IND | BPF_W | BPF_LD means::
- R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
+Legacy BPF Packet access instructions
+-------------------------------------
-and R1 - R5 are clobbered.
+eBPF previously introduced special instructions for access to packet data that were
+carried over from classic BPF. However, these instructions are
+deprecated and should no longer be used.
diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
new file mode 100644
index 000000000000..0f858156371d
--- /dev/null
+++ b/Documentation/bpf/kfuncs.rst
@@ -0,0 +1,193 @@
+=============================
+BPF Kernel Functions (kfuncs)
+=============================
+
+1. Introduction
+===============
+
+BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux
+kernel which are exposed for use by BPF programs. Unlike normal BPF helpers,
+kfuncs do not have a stable interface and can change from one kernel release to
+another. Hence, BPF programs need to be updated in response to changes in the
+kernel.
+
+2. Defining a kfunc
+===================
+
+There are two ways to expose a kernel function to BPF programs, either make an
+existing function in the kernel visible, or add a new wrapper for BPF. In both
+cases, care must be taken that BPF program can only call such function in a
+valid context. To enforce this, visibility of a kfunc can be per program type.
+
+If you are not creating a BPF wrapper for existing kernel function, skip ahead
+to :ref:`BPF_kfunc_nodef`.
+
+2.1 Creating a wrapper kfunc
+----------------------------
+
+When defining a wrapper kfunc, the wrapper function should have extern linkage.
+This prevents the compiler from optimizing away dead code, as this wrapper kfunc
+is not invoked anywhere in the kernel itself. It is not necessary to provide a
+prototype in a header for the wrapper kfunc.
+
+An example is given below::
+
+ /* Disables missing prototype warnings */
+ __diag_push();
+ __diag_ignore_all("-Wmissing-prototypes",
+ "Global kfuncs as their definitions will be in BTF");
+
+ struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
+ {
+ return find_get_task_by_vpid(nr);
+ }
+
+ __diag_pop();
+
+A wrapper kfunc is often needed when we need to annotate parameters of the
+kfunc. Otherwise one may directly make the kfunc visible to the BPF program by
+registering it with the BPF subsystem. See :ref:`BPF_kfunc_nodef`.
+
+2.2 Annotating kfunc parameters
+-------------------------------
+
+Similar to BPF helpers, there is sometime need for additional context required
+by the verifier to make the usage of kernel functions safer and more useful.
+Hence, we can annotate a parameter by suffixing the name of the argument of the
+kfunc with a __tag, where tag may be one of the supported annotations.
+
+2.2.1 __sz Annotation
+---------------------
+
+This annotation is used to indicate a memory and size pair in the argument list.
+An example is given below::
+
+ void bpf_memzero(void *mem, int mem__sz)
+ {
+ ...
+ }
+
+Here, the verifier will treat first argument as a PTR_TO_MEM, and second
+argument as its size. By default, without __sz annotation, the size of the type
+of the pointer is used. Without __sz annotation, a kfunc cannot accept a void
+pointer.
+
+.. _BPF_kfunc_nodef:
+
+2.3 Using an existing kernel function
+-------------------------------------
+
+When an existing function in the kernel is fit for consumption by BPF programs,
+it can be directly registered with the BPF subsystem. However, care must still
+be taken to review the context in which it will be invoked by the BPF program
+and whether it is safe to do so.
+
+2.4 Annotating kfuncs
+---------------------
+
+In addition to kfuncs' arguments, verifier may need more information about the
+type of kfunc(s) being registered with the BPF subsystem. To do so, we define
+flags on a set of kfuncs as follows::
+
+ BTF_SET8_START(bpf_task_set)
+ BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
+ BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
+ BTF_SET8_END(bpf_task_set)
+
+This set encodes the BTF ID of each kfunc listed above, and encodes the flags
+along with it. Ofcourse, it is also allowed to specify no flags.
+
+2.4.1 KF_ACQUIRE flag
+---------------------
+
+The KF_ACQUIRE flag is used to indicate that the kfunc returns a pointer to a
+refcounted object. The verifier will then ensure that the pointer to the object
+is eventually released using a release kfunc, or transferred to a map using a
+referenced kptr (by invoking bpf_kptr_xchg). If not, the verifier fails the
+loading of the BPF program until no lingering references remain in all possible
+explored states of the program.
+
+2.4.2 KF_RET_NULL flag
+----------------------
+
+The KF_RET_NULL flag is used to indicate that the pointer returned by the kfunc
+may be NULL. Hence, it forces the user to do a NULL check on the pointer
+returned from the kfunc before making use of it (dereferencing or passing to
+another helper). This flag is often used in pairing with KF_ACQUIRE flag, but
+both are orthogonal to each other.
+
+2.4.3 KF_RELEASE flag
+---------------------
+
+The KF_RELEASE flag is used to indicate that the kfunc releases the pointer
+passed in to it. There can be only one referenced pointer that can be passed in.
+All copies of the pointer being released are invalidated as a result of invoking
+kfunc with this flag.
+
+2.4.4 KF_KPTR_GET flag
+----------------------
+
+The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument
+as a pointer to kptr, safely increments the refcount of the object it points to,
+and returns a reference to the user. The rest of the arguments may be normal
+arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with
+KF_ACQUIRE and KF_RET_NULL flags.
+
+2.4.5 KF_TRUSTED_ARGS flag
+--------------------------
+
+The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
+indicates that the all pointer arguments will always have a guaranteed lifetime,
+and pointers to kernel objects are always passed to helpers in their unmodified
+form (as obtained from acquire kfuncs).
+
+It can be used to enforce that a pointer to a refcounted object acquired from a
+kfunc or BPF helper is passed as an argument to this kfunc without any
+modifications (e.g. pointer arithmetic) such that it is trusted and points to
+the original object.
+
+Meanwhile, it is also allowed pass pointers to normal memory to such kfuncs,
+but those can have a non-zero offset.
+
+This flag is often used for kfuncs that operate (change some property, perform
+some operation) on an object that was obtained using an acquire kfunc. Such
+kfuncs need an unchanged pointer to ensure the integrity of the operation being
+performed on the expected object.
+
+2.4.6 KF_SLEEPABLE flag
+-----------------------
+
+The KF_SLEEPABLE flag is used for kfuncs that may sleep. Such kfuncs can only
+be called by sleepable BPF programs (BPF_F_SLEEPABLE).
+
+2.4.7 KF_DESTRUCTIVE flag
+--------------------------
+
+The KF_DESTRUCTIVE flag is used to indicate functions calling which is
+destructive to the system. For example such a call can result in system
+rebooting or panicking. Due to this additional restrictions apply to these
+calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
+added later.
+
+2.5 Registering the kfuncs
+--------------------------
+
+Once the kfunc is prepared for use, the final step to making it visible is
+registering it with the BPF subsystem. Registration is done per BPF program
+type. An example is shown below::
+
+ BTF_SET8_START(bpf_task_set)
+ BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
+ BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
+ BTF_SET8_END(bpf_task_set)
+
+ static const struct btf_kfunc_id_set bpf_task_kfunc_set = {
+ .owner = THIS_MODULE,
+ .set = &bpf_task_set,
+ };
+
+ static int init_subsystem(void)
+ {
+ return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set);
+ }
+ late_initcall(init_subsystem);
diff --git a/Documentation/bpf/libbpf/index.rst b/Documentation/bpf/libbpf/index.rst
index 4e8c656b539a..3722537d1384 100644
--- a/Documentation/bpf/libbpf/index.rst
+++ b/Documentation/bpf/libbpf/index.rst
@@ -6,14 +6,13 @@ libbpf
.. toctree::
:maxdepth: 1
+ API Documentation <https://libbpf.readthedocs.io/en/latest/api.html>
libbpf_naming_convention
libbpf_build
This is documentation for libbpf, a userspace library for loading and
interacting with bpf programs.
-For API documentation see the `versioned API documentation site <https://libbpf.readthedocs.io/en/latest/api.html>`_.
-
All general BPF questions, including kernel functionality, libbpf APIs and
their application, should be sent to bpf@vger.kernel.org mailing list.
You can `subscribe <http://vger.kernel.org/vger-lists.html#bpf>`_ to the
diff --git a/Documentation/bpf/libbpf/libbpf_naming_convention.rst b/Documentation/bpf/libbpf/libbpf_naming_convention.rst
index f86360f734a8..c5ac97f3d4c4 100644
--- a/Documentation/bpf/libbpf/libbpf_naming_convention.rst
+++ b/Documentation/bpf/libbpf/libbpf_naming_convention.rst
@@ -9,8 +9,8 @@ described here. It's recommended to follow these conventions whenever a
new function or type is added to keep libbpf API clean and consistent.
All types and functions provided by libbpf API should have one of the
-following prefixes: ``bpf_``, ``btf_``, ``libbpf_``, ``xsk_``,
-``btf_dump_``, ``ring_buffer_``, ``perf_buffer_``.
+following prefixes: ``bpf_``, ``btf_``, ``libbpf_``, ``btf_dump_``,
+``ring_buffer_``, ``perf_buffer_``.
System call wrappers
--------------------
@@ -59,15 +59,6 @@ Auxiliary functions and types that don't fit well in any of categories
described above should have ``libbpf_`` prefix, e.g.
``libbpf_get_error`` or ``libbpf_prog_type_by_name``.
-AF_XDP functions
--------------------
-
-AF_XDP functions should have an ``xsk_`` prefix, e.g.
-``xsk_umem__get_data`` or ``xsk_umem__create``. The interface consists
-of both low-level ring access functions and high-level configuration
-functions. These can be mixed and matched. Note that these functions
-are not reentrant for performance reasons.
-
ABI
---
diff --git a/Documentation/bpf/linux-notes.rst b/Documentation/bpf/linux-notes.rst
new file mode 100644
index 000000000000..956b0c86699d
--- /dev/null
+++ b/Documentation/bpf/linux-notes.rst
@@ -0,0 +1,53 @@
+.. contents::
+.. sectnum::
+
+==========================
+Linux implementation notes
+==========================
+
+This document provides more details specific to the Linux kernel implementation of the eBPF instruction set.
+
+Byte swap instructions
+======================
+
+``BPF_FROM_LE`` and ``BPF_FROM_BE`` exist as aliases for ``BPF_TO_LE`` and ``BPF_TO_BE`` respectively.
+
+Legacy BPF Packet access instructions
+=====================================
+
+As mentioned in the `ISA standard documentation <instruction-set.rst#legacy-bpf-packet-access-instructions>`_,
+Linux has special eBPF instructions for access to packet data that have been
+carried over from classic BPF to retain the performance of legacy socket
+filters running in the eBPF interpreter.
+
+The instructions come in two forms: ``BPF_ABS | <size> | BPF_LD`` and
+``BPF_IND | <size> | BPF_LD``.
+
+These instructions are used to access packet data and can only be used when
+the program context is a pointer to a networking packet. ``BPF_ABS``
+accesses packet data at an absolute offset specified by the immediate data
+and ``BPF_IND`` access packet data at an offset that includes the value of
+a register in addition to the immediate data.
+
+These instructions have seven implicit operands:
+
+* Register R6 is an implicit input that must contain a pointer to a
+ struct sk_buff.
+* Register R0 is an implicit output which contains the data fetched from
+ the packet.
+* Registers R1-R5 are scratch registers that are clobbered by the
+ instruction.
+
+These instructions have an implicit program exit condition as well. If an
+eBPF program attempts access data beyond the packet boundary, the
+program execution will be aborted.
+
+``BPF_ABS | BPF_W | BPF_LD`` (0x20) means::
+
+ R0 = ntohl(*(u32 *) ((struct sk_buff *) R6->data + imm))
+
+where ``ntohl()`` converts a 32-bit value from network byte order to host byte order.
+
+``BPF_IND | BPF_W | BPF_LD`` (0x40) means::
+
+ R0 = ntohl(*(u32 *) ((struct sk_buff *) R6->data + src + imm))
diff --git a/Documentation/bpf/map_cgroup_storage.rst b/Documentation/bpf/map_cgroup_storage.rst
index cab9543017bf..8e5fe532c07e 100644
--- a/Documentation/bpf/map_cgroup_storage.rst
+++ b/Documentation/bpf/map_cgroup_storage.rst
@@ -31,7 +31,7 @@ The map uses key of type of either ``__u64 cgroup_inode_id`` or
};
``cgroup_inode_id`` is the inode id of the cgroup directory.
-``attach_type`` is the the program's attach type.
+``attach_type`` is the program's attach type.
Linux 5.9 added support for type ``__u64 cgroup_inode_id`` as the key type.
When this key type is used, then all attach types of the particular cgroup and
@@ -155,7 +155,7 @@ However, the BPF program can still only associate with one map of each type
``BPF_MAP_TYPE_CGROUP_STORAGE`` or more than one
``BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE``.
-In all versions, userspace may use the the attach parameters of cgroup and
+In all versions, userspace may use the attach parameters of cgroup and
attach type pair in ``struct bpf_cgroup_storage_key`` as the key to the BPF map
APIs to read or update the storage for a given attachment. For Linux 5.9
attach type shared storages, only the first value in the struct, cgroup inode
diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
new file mode 100644
index 000000000000..e85120878b27
--- /dev/null
+++ b/Documentation/bpf/map_hash.rst
@@ -0,0 +1,185 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+===============================================
+BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
+===============================================
+
+.. note::
+ - ``BPF_MAP_TYPE_HASH`` was introduced in kernel version 3.19
+ - ``BPF_MAP_TYPE_PERCPU_HASH`` was introduced in version 4.6
+ - Both ``BPF_MAP_TYPE_LRU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
+ were introduced in version 4.10
+
+``BPF_MAP_TYPE_HASH`` and ``BPF_MAP_TYPE_PERCPU_HASH`` provide general
+purpose hash map storage. Both the key and the value can be structs,
+allowing for composite keys and values.
+
+The kernel is responsible for allocating and freeing key/value pairs, up
+to the max_entries limit that you specify. Hash maps use pre-allocation
+of hash table elements by default. The ``BPF_F_NO_PREALLOC`` flag can be
+used to disable pre-allocation when it is too memory expensive.
+
+``BPF_MAP_TYPE_PERCPU_HASH`` provides a separate value slot per
+CPU. The per-cpu values are stored internally in an array.
+
+The ``BPF_MAP_TYPE_LRU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
+variants add LRU semantics to their respective hash tables. An LRU hash
+will automatically evict the least recently used entries when the hash
+table reaches capacity. An LRU hash maintains an internal LRU list that
+is used to select elements for eviction. This internal LRU list is
+shared across CPUs but it is possible to request a per CPU LRU list with
+the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``.
+
+Usage
+=====
+
+.. c:function::
+ long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
+
+Hash entries can be added or updated using the ``bpf_map_update_elem()``
+helper. This helper replaces existing elements atomically. The ``flags``
+parameter can be used to control the update behaviour:
+
+- ``BPF_ANY`` will create a new element or update an existing element
+- ``BPF_NOEXIST`` will create a new element only if one did not already
+ exist
+- ``BPF_EXIST`` will update an existing element
+
+``bpf_map_update_elem()`` returns 0 on success, or negative error in
+case of failure.
+
+.. c:function::
+ void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
+
+Hash entries can be retrieved using the ``bpf_map_lookup_elem()``
+helper. This helper returns a pointer to the value associated with
+``key``, or ``NULL`` if no entry was found.
+
+.. c:function::
+ long bpf_map_delete_elem(struct bpf_map *map, const void *key)
+
+Hash entries can be deleted using the ``bpf_map_delete_elem()``
+helper. This helper will return 0 on success, or negative error in case
+of failure.
+
+Per CPU Hashes
+--------------
+
+For ``BPF_MAP_TYPE_PERCPU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
+the ``bpf_map_update_elem()`` and ``bpf_map_lookup_elem()`` helpers
+automatically access the hash slot for the current CPU.
+
+.. c:function::
+ void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
+
+The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the
+value in the hash slot for a specific CPU. Returns value associated with
+``key`` on ``cpu`` , or ``NULL`` if no entry was found or ``cpu`` is
+invalid.
+
+Concurrency
+-----------
+
+Values stored in ``BPF_MAP_TYPE_HASH`` can be accessed concurrently by
+programs running on different CPUs. Since Kernel version 5.1, the BPF
+infrastructure provides ``struct bpf_spin_lock`` to synchronise access.
+See ``tools/testing/selftests/bpf/progs/test_spin_lock.c``.
+
+Userspace
+---------
+
+.. c:function::
+ int bpf_map_get_next_key(int fd, const void *cur_key, void *next_key)
+
+In userspace, it is possible to iterate through the keys of a hash using
+libbpf's ``bpf_map_get_next_key()`` function. The first key can be fetched by
+calling ``bpf_map_get_next_key()`` with ``cur_key`` set to
+``NULL``. Subsequent calls will fetch the next key that follows the
+current key. ``bpf_map_get_next_key()`` returns 0 on success, -ENOENT if
+cur_key is the last key in the hash, or negative error in case of
+failure.
+
+Note that if ``cur_key`` gets deleted then ``bpf_map_get_next_key()``
+will instead return the *first* key in the hash table which is
+undesirable. It is recommended to use batched lookup if there is going
+to be key deletion intermixed with ``bpf_map_get_next_key()``.
+
+Examples
+========
+
+Please see the ``tools/testing/selftests/bpf`` directory for functional
+examples. The code snippets below demonstrates API usage.
+
+This example shows how to declare an LRU Hash with a struct key and a
+struct value.
+
+.. code-block:: c
+
+ #include <linux/bpf.h>
+ #include <bpf/bpf_helpers.h>
+
+ struct key {
+ __u32 srcip;
+ };
+
+ struct value {
+ __u64 packets;
+ __u64 bytes;
+ };
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_LRU_HASH);
+ __uint(max_entries, 32);
+ __type(key, struct key);
+ __type(value, struct value);
+ } packet_stats SEC(".maps");
+
+This example shows how to create or update hash values using atomic
+instructions:
+
+.. code-block:: c
+
+ static void update_stats(__u32 srcip, int bytes)
+ {
+ struct key key = {
+ .srcip = srcip,
+ };
+ struct value *value = bpf_map_lookup_elem(&packet_stats, &key);
+
+ if (value) {
+ __sync_fetch_and_add(&value->packets, 1);
+ __sync_fetch_and_add(&value->bytes, bytes);
+ } else {
+ struct value newval = { 1, bytes };
+
+ bpf_map_update_elem(&packet_stats, &key, &newval, BPF_NOEXIST);
+ }
+ }
+
+Userspace walking the map elements from the map declared above:
+
+.. code-block:: c
+
+ #include <bpf/libbpf.h>
+ #include <bpf/bpf.h>
+
+ static void walk_hash_elements(int map_fd)
+ {
+ struct key *cur_key = NULL;
+ struct key next_key;
+ struct value value;
+ int err;
+
+ for (;;) {
+ err = bpf_map_get_next_key(map_fd, cur_key, &next_key);
+ if (err)
+ break;
+
+ bpf_map_lookup_elem(map_fd, &next_key, &value);
+
+ // Use key and value here
+
+ cur_key = &next_key;
+ }
+ }
diff --git a/Documentation/bpf/verifier.rst b/Documentation/bpf/verifier.rst
index fae5f6273bac..d4326caf01f9 100644
--- a/Documentation/bpf/verifier.rst
+++ b/Documentation/bpf/verifier.rst
@@ -329,7 +329,7 @@ Program with unreachable instructions::
BPF_EXIT_INSN(),
};
-Error:
+Error::
unreachable insn 1