aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/00-INDEX2
-rw-r--r--Documentation/networking/batman-adv.rst220
-rw-r--r--Documentation/networking/batman-adv.txt215
-rw-r--r--Documentation/networking/filter.txt122
-rw-r--r--Documentation/networking/index.rst1
-rw-r--r--Documentation/networking/ip-sysctl.txt13
-rw-r--r--Documentation/networking/netvsc.txt63
-rw-r--r--Documentation/networking/strparser.txt207
8 files changed, 530 insertions, 313 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
index c6beb5f1637f..7a79b3587dd3 100644
--- a/Documentation/networking/00-INDEX
+++ b/Documentation/networking/00-INDEX
@@ -30,8 +30,6 @@ atm.txt
- info on where to get ATM programs and support for Linux.
ax25.txt
- info on using AX.25 and NET/ROM code for Linux
-batman-adv.txt
- - B.A.T.M.A.N routing protocol on top of layer 2 Ethernet Frames.
baycom.txt
- info on the driver for Baycom style amateur radio modems
bonding.txt
diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst
new file mode 100644
index 000000000000..a342b2cc3dc6
--- /dev/null
+++ b/Documentation/networking/batman-adv.rst
@@ -0,0 +1,220 @@
+==========
+batman-adv
+==========
+
+Batman advanced is a new approach to wireless networking which does no longer
+operate on the IP basis. Unlike the batman daemon, which exchanges information
+using UDP packets and sets routing tables, batman-advanced operates on ISO/OSI
+Layer 2 only and uses and routes (or better: bridges) Ethernet Frames. It
+emulates a virtual network switch of all nodes participating. Therefore all
+nodes appear to be link local, thus all higher operating protocols won't be
+affected by any changes within the network. You can run almost any protocol
+above batman advanced, prominent examples are: IPv4, IPv6, DHCP, IPX.
+
+Batman advanced was implemented as a Linux kernel driver to reduce the overhead
+to a minimum. It does not depend on any (other) network driver, and can be used
+on wifi as well as ethernet lan, vpn, etc ... (anything with ethernet-style
+layer 2).
+
+
+Configuration
+=============
+
+Load the batman-adv module into your kernel::
+
+ $ insmod batman-adv.ko
+
+The module is now waiting for activation. You must add some interfaces on which
+batman can operate. After loading the module batman advanced will scan your
+systems interfaces to search for compatible interfaces. Once found, it will
+create subfolders in the ``/sys`` directories of each supported interface,
+e.g.::
+
+ $ ls /sys/class/net/eth0/batman_adv/
+ elp_interval iface_status mesh_iface throughput_override
+
+If an interface does not have the ``batman_adv`` subfolder, it probably is not
+supported. Not supported interfaces are: loopback, non-ethernet and batman's
+own interfaces.
+
+Note: After the module was loaded it will continuously watch for new
+interfaces to verify the compatibility. There is no need to reload the module
+if you plug your USB wifi adapter into your machine after batman advanced was
+initially loaded.
+
+The batman-adv soft-interface can be created using the iproute2 tool ``ip``::
+
+ $ ip link add name bat0 type batadv
+
+To activate a given interface simply attach it to the ``bat0`` interface::
+
+ $ ip link set dev eth0 master bat0
+
+Repeat this step for all interfaces you wish to add. Now batman starts
+using/broadcasting on this/these interface(s).
+
+By reading the "iface_status" file you can check its status::
+
+ $ cat /sys/class/net/eth0/batman_adv/iface_status
+ active
+
+To deactivate an interface you have to detach it from the "bat0" interface::
+
+ $ ip link set dev eth0 nomaster
+
+
+All mesh wide settings can be found in batman's own interface folder::
+
+ $ ls /sys/class/net/bat0/mesh/
+ aggregated_ogms fragmentation isolation_mark routing_algo
+ ap_isolation gw_bandwidth log_level vlan0
+ bonding gw_mode multicast_mode
+ bridge_loop_avoidance gw_sel_class network_coding
+ distributed_arp_table hop_penalty orig_interval
+
+There is a special folder for debugging information::
+
+ $ ls /sys/kernel/debug/batman_adv/bat0/
+ bla_backbone_table log neighbors transtable_local
+ bla_claim_table mcast_flags originators
+ dat_cache nc socket
+ gateways nc_nodes transtable_global
+
+Some of the files contain all sort of status information regarding the mesh
+network. For example, you can view the table of originators (mesh
+participants) with::
+
+ $ cat /sys/kernel/debug/batman_adv/bat0/originators
+
+Other files allow to change batman's behaviour to better fit your requirements.
+For instance, you can check the current originator interval (value in
+milliseconds which determines how often batman sends its broadcast packets)::
+
+ $ cat /sys/class/net/bat0/mesh/orig_interval
+ 1000
+
+and also change its value::
+
+ $ echo 3000 > /sys/class/net/bat0/mesh/orig_interval
+
+In very mobile scenarios, you might want to adjust the originator interval to a
+lower value. This will make the mesh more responsive to topology changes, but
+will also increase the overhead.
+
+
+Usage
+=====
+
+To make use of your newly created mesh, batman advanced provides a new
+interface "bat0" which you should use from this point on. All interfaces added
+to batman advanced are not relevant any longer because batman handles them for
+you. Basically, one "hands over" the data by using the batman interface and
+batman will make sure it reaches its destination.
+
+The "bat0" interface can be used like any other regular interface. It needs an
+IP address which can be either statically configured or dynamically (by using
+DHCP or similar services)::
+
+ NodeA: ip link set up dev bat0
+ NodeA: ip addr add 192.168.0.1/24 dev bat0
+
+ NodeB: ip link set up dev bat0
+ NodeB: ip addr add 192.168.0.2/24 dev bat0
+ NodeB: ping 192.168.0.1
+
+Note: In order to avoid problems remove all IP addresses previously assigned to
+interfaces now used by batman advanced, e.g.::
+
+ $ ip addr flush dev eth0
+
+
+Logging/Debugging
+=================
+
+All error messages, warnings and information messages are sent to the kernel
+log. Depending on your operating system distribution this can be read in one of
+a number of ways. Try using the commands: ``dmesg``, ``logread``, or looking in
+the files ``/var/log/kern.log`` or ``/var/log/syslog``. All batman-adv messages
+are prefixed with "batman-adv:" So to see just these messages try::
+
+ $ dmesg | grep batman-adv
+
+When investigating problems with your mesh network, it is sometimes necessary to
+see more detail debug messages. This must be enabled when compiling the
+batman-adv module. When building batman-adv as part of kernel, use "make
+menuconfig" and enable the option ``B.A.T.M.A.N. debugging``
+(``CONFIG_BATMAN_ADV_DEBUG=y``).
+
+Those additional debug messages can be accessed using a special file in
+debugfs::
+
+ $ cat /sys/kernel/debug/batman_adv/bat0/log
+
+The additional debug output is by default disabled. It can be enabled during
+run time. Following log_levels are defined:
+
+.. flat-table::
+
+ * - 0
+ - All debug output disabled
+ * - 1
+ - Enable messages related to routing / flooding / broadcasting
+ * - 2
+ - Enable messages related to route added / changed / deleted
+ * - 4
+ - Enable messages related to translation table operations
+ * - 8
+ - Enable messages related to bridge loop avoidance
+ * - 16
+ - Enable messages related to DAT, ARP snooping and parsing
+ * - 32
+ - Enable messages related to network coding
+ * - 64
+ - Enable messages related to multicast
+ * - 128
+ - Enable messages related to throughput meter
+ * - 255
+ - Enable all messages
+
+The debug output can be changed at runtime using the file
+``/sys/class/net/bat0/mesh/log_level``. e.g.::
+
+ $ echo 6 > /sys/class/net/bat0/mesh/log_level
+
+will enable debug messages for when routes change.
+
+Counters for different types of packets entering and leaving the batman-adv
+module are available through ethtool::
+
+ $ ethtool --statistics bat0
+
+
+batctl
+======
+
+As batman advanced operates on layer 2, all hosts participating in the virtual
+switch are completely transparent for all protocols above layer 2. Therefore
+the common diagnosis tools do not work as expected. To overcome these problems,
+batctl was created. At the moment the batctl contains ping, traceroute, tcpdump
+and interfaces to the kernel module settings.
+
+For more information, please see the manpage (``man batctl``).
+
+batctl is available on https://www.open-mesh.org/
+
+
+Contact
+=======
+
+Please send us comments, experiences, questions, anything :)
+
+IRC:
+ #batman on irc.freenode.org
+Mailing-list:
+ b.a.t.m.a.n@open-mesh.org (optional subscription at
+ https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
+
+You can also contact the Authors:
+
+* Marek Lindner <mareklindner@neomailbox.ch>
+* Simon Wunderlich <sw@simonwunderlich.de>
diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt
deleted file mode 100644
index ccf94677b240..000000000000
--- a/Documentation/networking/batman-adv.txt
+++ /dev/null
@@ -1,215 +0,0 @@
-BATMAN-ADV
-----------
-
-Batman advanced is a new approach to wireless networking which
-does no longer operate on the IP basis. Unlike the batman daemon,
-which exchanges information using UDP packets and sets routing
-tables, batman-advanced operates on ISO/OSI Layer 2 only and uses
-and routes (or better: bridges) Ethernet Frames. It emulates a
-virtual network switch of all nodes participating. Therefore all
-nodes appear to be link local, thus all higher operating proto-
-cols won't be affected by any changes within the network. You can
-run almost any protocol above batman advanced, prominent examples
-are: IPv4, IPv6, DHCP, IPX.
-
-Batman advanced was implemented as a Linux kernel driver to re-
-duce the overhead to a minimum. It does not depend on any (other)
-network driver, and can be used on wifi as well as ethernet lan,
-vpn, etc ... (anything with ethernet-style layer 2).
-
-
-CONFIGURATION
--------------
-
-Load the batman-adv module into your kernel:
-
-# insmod batman-adv.ko
-
-The module is now waiting for activation. You must add some in-
-terfaces on which batman can operate. After loading the module
-batman advanced will scan your systems interfaces to search for
-compatible interfaces. Once found, it will create subfolders in
-the /sys directories of each supported interface, e.g.
-
-# ls /sys/class/net/eth0/batman_adv/
-# elp_interval iface_status mesh_iface throughput_override
-
-If an interface does not have the "batman_adv" subfolder it prob-
-ably is not supported. Not supported interfaces are: loopback,
-non-ethernet and batman's own interfaces.
-
-Note: After the module was loaded it will continuously watch for
-new interfaces to verify the compatibility. There is no need to
-reload the module if you plug your USB wifi adapter into your ma-
-chine after batman advanced was initially loaded.
-
-The batman-adv soft-interface can be created using the iproute2
-tool "ip"
-
-# ip link add name bat0 type batadv
-
-To activate a given interface simply attach it to the "bat0"
-interface
-
-# ip link set dev eth0 master bat0
-
-Repeat this step for all interfaces you wish to add. Now batman
-starts using/broadcasting on this/these interface(s).
-
-By reading the "iface_status" file you can check its status:
-
-# cat /sys/class/net/eth0/batman_adv/iface_status
-# active
-
-To deactivate an interface you have to detach it from the
-"bat0" interface:
-
-# ip link set dev eth0 nomaster
-
-
-All mesh wide settings can be found in batman's own interface
-folder:
-
-# ls /sys/class/net/bat0/mesh/
-# aggregated_ogms fragmentation isolation_mark routing_algo
-# ap_isolation gw_bandwidth log_level vlan0
-# bonding gw_mode multicast_mode
-# bridge_loop_avoidance gw_sel_class network_coding
-# distributed_arp_table hop_penalty orig_interval
-
-There is a special folder for debugging information:
-
-# ls /sys/kernel/debug/batman_adv/bat0/
-# bla_backbone_table log neighbors transtable_local
-# bla_claim_table mcast_flags originators
-# dat_cache nc socket
-# gateways nc_nodes transtable_global
-
-Some of the files contain all sort of status information regard-
-ing the mesh network. For example, you can view the table of
-originators (mesh participants) with:
-
-# cat /sys/kernel/debug/batman_adv/bat0/originators
-
-Other files allow to change batman's behaviour to better fit your
-requirements. For instance, you can check the current originator
-interval (value in milliseconds which determines how often batman
-sends its broadcast packets):
-
-# cat /sys/class/net/bat0/mesh/orig_interval
-# 1000
-
-and also change its value:
-
-# echo 3000 > /sys/class/net/bat0/mesh/orig_interval
-
-In very mobile scenarios, you might want to adjust the originator
-interval to a lower value. This will make the mesh more respon-
-sive to topology changes, but will also increase the overhead.
-
-
-USAGE
------
-
-To make use of your newly created mesh, batman advanced provides
-a new interface "bat0" which you should use from this point on.
-All interfaces added to batman advanced are not relevant any
-longer because batman handles them for you. Basically, one "hands
-over" the data by using the batman interface and batman will make
-sure it reaches its destination.
-
-The "bat0" interface can be used like any other regular inter-
-face. It needs an IP address which can be either statically con-
-figured or dynamically (by using DHCP or similar services):
-
-# NodeA: ip link set up dev bat0
-# NodeA: ip addr add 192.168.0.1/24 dev bat0
-
-# NodeB: ip link set up dev bat0
-# NodeB: ip addr add 192.168.0.2/24 dev bat0
-# NodeB: ping 192.168.0.1
-
-Note: In order to avoid problems remove all IP addresses previ-
-ously assigned to interfaces now used by batman advanced, e.g.
-
-# ip addr flush dev eth0
-
-
-LOGGING/DEBUGGING
------------------
-
-All error messages, warnings and information messages are sent to
-the kernel log. Depending on your operating system distribution
-this can be read in one of a number of ways. Try using the com-
-mands: dmesg, logread, or looking in the files /var/log/kern.log
-or /var/log/syslog. All batman-adv messages are prefixed with
-"batman-adv:" So to see just these messages try
-
-# dmesg | grep batman-adv
-
-When investigating problems with your mesh network it is some-
-times necessary to see more detail debug messages. This must be
-enabled when compiling the batman-adv module. When building bat-
-man-adv as part of kernel, use "make menuconfig" and enable the
-option "B.A.T.M.A.N. debugging".
-
-Those additional debug messages can be accessed using a special
-file in debugfs
-
-# cat /sys/kernel/debug/batman_adv/bat0/log
-
-The additional debug output is by default disabled. It can be en-
-abled during run time. Following log_levels are defined:
-
- 0 - All debug output disabled
- 1 - Enable messages related to routing / flooding / broadcasting
- 2 - Enable messages related to route added / changed / deleted
- 4 - Enable messages related to translation table operations
- 8 - Enable messages related to bridge loop avoidance
- 16 - Enable messages related to DAT, ARP snooping and parsing
- 32 - Enable messages related to network coding
- 64 - Enable messages related to multicast
-128 - Enable messages related to throughput meter
-255 - Enable all messages
-
-The debug output can be changed at runtime using the file
-/sys/class/net/bat0/mesh/log_level. e.g.
-
-# echo 6 > /sys/class/net/bat0/mesh/log_level
-
-will enable debug messages for when routes change.
-
-Counters for different types of packets entering and leaving the
-batman-adv module are available through ethtool:
-
-# ethtool --statistics bat0
-
-
-BATCTL
-------
-
-As batman advanced operates on layer 2 all hosts participating in
-the virtual switch are completely transparent for all protocols
-above layer 2. Therefore the common diagnosis tools do not work
-as expected. To overcome these problems batctl was created. At
-the moment the batctl contains ping, traceroute, tcpdump and
-interfaces to the kernel module settings.
-
-For more information, please see the manpage (man batctl).
-
-batctl is available on https://www.open-mesh.org/
-
-
-CONTACT
--------
-
-Please send us comments, experiences, questions, anything :)
-
-IRC: #batman on irc.freenode.org
-Mailing-list: b.a.t.m.a.n@open-mesh.org (optional subscription
- at https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
-
-You can also contact the Authors:
-
-Marek Lindner <mareklindner@neomailbox.ch>
-Simon Wunderlich <sw@simonwunderlich.de>
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index b69b205501de..d0fdba7d66e2 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -793,7 +793,7 @@ Some core changes of the new internal format:
bpf_exit
After the call the registers R1-R5 contain junk values and cannot be read.
- In the future an eBPF verifier can be used to validate internal BPF programs.
+ An in-kernel eBPF verifier is used to validate internal BPF programs.
Also in the new design, eBPF is limited to 4096 insns, which means that any
program will terminate quickly and will only call a fixed number of kernel
@@ -1017,7 +1017,7 @@ At the start of the program the register R1 contains a pointer to context
and has type PTR_TO_CTX.
If verifier sees an insn that does R2=R1, then R2 has now type
PTR_TO_CTX as well and can be used on the right hand side of expression.
-If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE,
+If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE,
since addition of two valid pointers makes invalid pointer.
(In 'secure' mode verifier will reject any type of pointer arithmetic to make
sure that kernel addresses don't leak to unprivileged users)
@@ -1039,7 +1039,7 @@ is a correct program. If there was R1 instead of R6, it would have
been rejected.
load/store instructions are allowed only with registers of valid types, which
-are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR. They are bounds and alignment checked.
+are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
For example:
bpf_mov R1 = 1
bpf_mov R2 = 2
@@ -1058,7 +1058,7 @@ intends to load a word from address R6 + 8 and store it into R0
If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
that offset 8 of size 4 bytes can be accessed for reading, otherwise
the verifier will reject the program.
-If R6=FRAME_PTR, then access should be aligned and be within
+If R6=PTR_TO_STACK, then access should be aligned and be within
stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
so it will fail verification, since it's out of bounds.
@@ -1069,7 +1069,7 @@ For example:
bpf_ld R0 = *(u32 *)(R10 - 4)
bpf_exit
is invalid program.
-Though R10 is correct read-only register and has type FRAME_PTR
+Though R10 is correct read-only register and has type PTR_TO_STACK
and R10 - 4 is within stack bounds, there were no stores into that location.
Pointer register spill/fill is tracked as well, since four (R6-R9)
@@ -1094,6 +1094,71 @@ all use cases.
See details of eBPF verifier in kernel/bpf/verifier.c
+Register value tracking
+-----------------------
+In order to determine the safety of an eBPF program, the verifier must track
+the range of possible values in each register and also in each stack slot.
+This is done with 'struct bpf_reg_state', defined in include/linux/
+bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
+register state has a type, which is either NOT_INIT (the register has not been
+written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
+pointer type. The types of pointers describe their base, as follows:
+ PTR_TO_CTX Pointer to bpf_context.
+ CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic
+ on these pointers is forbidden.
+ PTR_TO_MAP_VALUE Pointer to the value stored in a map element.
+ PTR_TO_MAP_VALUE_OR_NULL
+ Either a pointer to a map value, or NULL; map accesses
+ (see section 'eBPF maps', below) return this type,
+ which becomes a PTR_TO_MAP_VALUE when checked != NULL.
+ Arithmetic on these pointers is forbidden.
+ PTR_TO_STACK Frame pointer.
+ PTR_TO_PACKET skb->data.
+ PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden.
+However, a pointer may be offset from this base (as a result of pointer
+arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
+offset'. The former is used when an exactly-known value (e.g. an immediate
+operand) is added to a pointer, while the latter is used for values which are
+not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
+the range of possible values in the register.
+The verifier's knowledge about the variable offset consists of:
+* minimum and maximum values as unsigned
+* minimum and maximum values as signed
+* knowledge of the values of individual bits, in the form of a 'tnum': a u64
+'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
+1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
+mask and value; no bit should ever be 1 in both. For example, if a byte is read
+into a register from memory, the register's top 56 bits are known zero, while
+the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
+then OR this with 0x40, we get (0x40; 0xcf), then if we add 1 we get (0x0;
+0x1ff), because of potential carries.
+Besides arithmetic, the register state can also be updated by conditional
+branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
+it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
+branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or
+BPF_JSGE) would instead update the signed minimum/maximum values. Information
+from the signed and unsigned bounds can be combined; for instance if a value is
+first tested < 8 and then tested s> 4, the verifier will conclude that the value
+is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
+PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
+pointers sharing that same variable offset. This is important for packet range
+checks: after adding some variable to a packet pointer, if you then copy it to
+another register and (say) add a constant 4, both registers will share the same
+'id' but one will have a fixed offset of +4. Then if it is bounds-checked and
+found to be less than a PTR_TO_PACKET_END, the other register is now known to
+have a safe range of at least 4 bytes. See 'Direct packet access', below, for
+more on PTR_TO_PACKET ranges.
+The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
+the pointer returned from a map lookup. This means that when one copy is
+checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
+As well as range-checking, the tracked information is also used for enforcing
+alignment of pointer accesses. For instance, on most systems the packet pointer
+is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
+over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
+pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
+bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
+that pointer are safe.
+
Direct packet access
--------------------
In cls_bpf and act_bpf programs the verifier allows direct access to the packet
@@ -1121,7 +1186,7 @@ it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14)
which is zero bytes.
More complex packet access may look like:
- R0=imm1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
+ R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
7: r4 = *(u8 *)(r3 +12)
8: r4 *= 14
@@ -1135,26 +1200,31 @@ More complex packet access may look like:
16: r2 += 8
17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */
18: if r2 > r1 goto pc+2
- R0=inv56 R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv52 R5=pkt(id=0,off=14,r=14) R10=fp
+ R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
19: r1 = *(u8 *)(r3 +4)
The state of the register R3 is R3=pkt(id=2,off=0,r=8)
id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some
offset within a packet and since the program author did
'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8).
-The verifier only allows 'add' operation on packet registers. Any other
-operation will set the register state to 'unknown_value' and it won't be
+The verifier only allows 'add'/'sub' operations on packet registers. Any other
+operation will set the register state to 'SCALAR_VALUE' and it won't be
available for direct packet access.
Operation 'r3 += rX' may overflow and become less than original skb->data,
-therefore the verifier has to prevent that. So it tracks the number of
-upper zero bits in all 'uknown_value' registers, so when it sees
-'r3 += rX' instruction and rX is more than 16-bit value, it will error as:
-"cannot add integer value with N upper zero bits to ptr_to_packet"
+therefore the verifier has to prevent that. So when it sees 'r3 += rX'
+instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
+against skb->data_end will not give us 'range' information, so attempts to read
+through the pointer will give "invalid access to packet" error.
Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is
-R4=inv56 which means that upper 56 bits on the register are guaranteed
-to be zero. After insn 'r4 *= 14' the state becomes R4=inv52, since
-multiplying 8-bit value by constant 14 will keep upper 52 bits as zero.
-Similarly 'r2 >>= 48' will make R2=inv48, since the shift is not sign
-extending. This logic is implemented in evaluate_reg_alu() function.
+R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
+of the register are guaranteed to be zero, and nothing is known about the lower
+8 bits. After insn 'r4 *= 14' the state becomes
+R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
+value by constant 14 will keep upper 52 bits as zero, also the least significant
+bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make
+R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
+extending. This logic is implemented in adjust_reg_min_max_vals() function,
+which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
+versa) and adjust_scalar_min_max_vals() for operations on two scalars.
The end result is that bpf program author can access packet directly
using normal C code as:
@@ -1214,6 +1284,22 @@ The map is defined by:
. key size in bytes
. value size in bytes
+Pruning
+-------
+The verifier does not actually walk all possible paths through the program. For
+each new branch to analyse, the verifier looks at all the states it's previously
+been in when at this instruction. If any of them contain the current state as a
+subset, the branch is 'pruned' - that is, the fact that the previous state was
+accepted implies the current state would be as well. For instance, if in the
+previous state, r1 held a packet-pointer, and in the current state, r1 holds a
+packet-pointer with a range as long or longer and at least as strict an
+alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't
+have been used by any path from that point, so any value in r2 (including
+another NOT_INIT) is safe. The implementation is in the function regsafe().
+Pruning considers not only the registers but also the stack (and any spilled
+registers it may hold). They must all be safe for the branch to be pruned.
+This is implemented in states_equal().
+
Understanding eBPF verifier messages
------------------------------------
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index b5bd87e01f52..66e620866245 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -6,6 +6,7 @@ Contents:
.. toctree::
:maxdepth: 2
+ batman-adv
kapi
z8530book
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 974ab47ae53a..84c9b8cee780 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -353,12 +353,7 @@ tcp_l3mdev_accept - BOOLEAN
compiled with CONFIG_NET_L3_MASTER_DEV.
tcp_low_latency - BOOLEAN
- If set, the TCP stack makes decisions that prefer lower
- latency as opposed to higher throughput. By default, this
- option is not set meaning that higher throughput is preferred.
- An example of an application where this default should be
- changed would be a Beowulf compute cluster.
- Default: 0
+ This is a legacy option, it has no effect anymore.
tcp_max_orphans - INTEGER
Maximal number of TCP sockets not attached to any user file handle,
@@ -1291,8 +1286,7 @@ tag - INTEGER
xfrm4_gc_thresh - INTEGER
The threshold at which we will start garbage collecting for IPv4
destination cache entries. At twice this value the system will
- refuse new allocations. The value must be set below the flowcache
- limit (4096 * number of online cpus) to take effect.
+ refuse new allocations.
igmp_link_local_mcast_reports - BOOLEAN
Enable IGMP reports for link local multicast groups in the
@@ -1778,8 +1772,7 @@ ratelimit - INTEGER
xfrm6_gc_thresh - INTEGER
The threshold at which we will start garbage collecting for IPv6
destination cache entries. At twice this value the system will
- refuse new allocations. The value must be set below the flowcache
- limit (4096 * number of online cpus) to take effect.
+ refuse new allocations.
IPv6 Update by:
diff --git a/Documentation/networking/netvsc.txt b/Documentation/networking/netvsc.txt
new file mode 100644
index 000000000000..4ddb4e4b0426
--- /dev/null
+++ b/Documentation/networking/netvsc.txt
@@ -0,0 +1,63 @@
+Hyper-V network driver
+======================
+
+Compatibility
+=============
+
+This driver is compatible with Windows Server 2012 R2, 2016 and
+Windows 10.
+
+Features
+========
+
+ Checksum offload
+ ----------------
+ The netvsc driver supports checksum offload as long as the
+ Hyper-V host version does. Windows Server 2016 and Azure
+ support checksum offload for TCP and UDP for both IPv4 and
+ IPv6. Windows Server 2012 only supports checksum offload for TCP.
+
+ Receive Side Scaling
+ --------------------
+ Hyper-V supports receive side scaling. For TCP, packets are
+ distributed among available queues based on IP address and port
+ number. Current versions of Hyper-V host, only distribute UDP
+ packets based on the IP source and destination address.
+ The port number is not used as part of the hash value for UDP.
+ Fragmented IP packets are not distributed between queues;
+ all fragmented packets arrive on the first channel.
+
+ Generic Receive Offload, aka GRO
+ --------------------------------
+ The driver supports GRO and it is enabled by default. GRO coalesces
+ like packets and significantly reduces CPU usage under heavy Rx
+ load.
+
+ SR-IOV support
+ --------------
+ Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV
+ is enabled in both the vSwitch and the guest configuration, then the
+ Virtual Function (VF) device is passed to the guest as a PCI
+ device. In this case, both a synthetic (netvsc) and VF device are
+ visible in the guest OS and both NIC's have the same MAC address.
+
+ The VF is enslaved by netvsc device. The netvsc driver will transparently
+ switch the data path to the VF when it is available and up.
+ Network state (addresses, firewall, etc) should be applied only to the
+ netvsc device; the slave device should not be accessed directly in
+ most cases. The exceptions are if some special queue discipline or
+ flow direction is desired, these should be applied directly to the
+ VF slave device.
+
+ Receive Buffer
+ --------------
+ Packets are received into a receive area which is created when device
+ is probed. The receive area is broken into MTU sized chunks and each may
+ contain one or more packets. The number of receive sections may be changed
+ via ethtool Rx ring parameters.
+
+ There is a similar send buffer which is used to aggregate packets for sending.
+ The send area is broken into chunks of 6144 bytes, each of section may
+ contain one or more packets. The send buffer is an optimization, the driver
+ will use slower method to handle very large packets or if the send buffer
+ area is exhausted.
diff --git a/Documentation/networking/strparser.txt b/Documentation/networking/strparser.txt
index a0bf573dfa61..fe01302471ae 100644
--- a/Documentation/networking/strparser.txt
+++ b/Documentation/networking/strparser.txt
@@ -1,45 +1,107 @@
-Stream Parser
--------------
+Stream Parser (strparser)
+
+Introduction
+============
The stream parser (strparser) is a utility that parses messages of an
-application layer protocol running over a TCP connection. The stream
+application layer protocol running over a data stream. The stream
parser works in conjunction with an upper layer in the kernel to provide
kernel support for application layer messages. For instance, Kernel
Connection Multiplexor (KCM) uses the Stream Parser to parse messages
using a BPF program.
+The strparser works in one of two modes: receive callback or general
+mode.
+
+In receive callback mode, the strparser is called from the data_ready
+callback of a TCP socket. Messages are parsed and delivered as they are
+received on the socket.
+
+In general mode, a sequence of skbs are fed to strparser from an
+outside source. Message are parsed and delivered as the sequence is
+processed. This modes allows strparser to be applied to arbitrary
+streams of data.
+
Interface
----------
+=========
The API includes a context structure, a set of callbacks, utility
-functions, and a data_ready function. The callbacks include
-a parse_msg function that is called to perform parsing (e.g.
-BPF parsing in case of KCM), and a rcv_msg function that is called
-when a full message has been completed.
+functions, and a data_ready function for receive callback mode. The
+callbacks include a parse_msg function that is called to perform
+parsing (e.g. BPF parsing in case of KCM), and a rcv_msg function
+that is called when a full message has been completed.
-A stream parser can be instantiated for a TCP connection. This is done
-by:
+Functions
+=========
-strp_init(struct strparser *strp, struct sock *csk,
+strp_init(struct strparser *strp, struct sock *sk,
struct strp_callbacks *cb)
-strp is a struct of type strparser that is allocated by the upper layer.
-csk is the TCP socket associated with the stream parser. Callbacks are
-called by the stream parser.
+ Called to initialize a stream parser. strp is a struct of type
+ strparser that is allocated by the upper layer. sk is the TCP
+ socket associated with the stream parser for use with receive
+ callback mode; in general mode this is set to NULL. Callbacks
+ are called by the stream parser (the callbacks are listed below).
+
+void strp_pause(struct strparser *strp)
+
+ Temporarily pause a stream parser. Message parsing is suspended
+ and no new messages are delivered to the upper layer.
+
+void strp_pause(struct strparser *strp)
+
+ Unpause a paused stream parser.
+
+void strp_stop(struct strparser *strp);
+
+ strp_stop is called to completely stop stream parser operations.
+ This is called internally when the stream parser encounters an
+ error, and it is called from the upper layer to stop parsing
+ operations.
+
+void strp_done(struct strparser *strp);
+
+ strp_done is called to release any resources held by the stream
+ parser instance. This must be called after the stream processor
+ has been stopped.
+
+int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
+ unsigned int orig_offset, size_t orig_len,
+ size_t max_msg_size, long timeo)
+
+ strp_process is called in general mode for a stream parser to
+ parse an sk_buff. The number of bytes processed or a negative
+ error number is returned. Note that strp_process does not
+ consume the sk_buff. max_msg_size is maximum size the stream
+ parser will parse. timeo is timeout for completing a message.
+
+void strp_data_ready(struct strparser *strp);
+
+ The upper layer calls strp_tcp_data_ready when data is ready on
+ the lower socket for strparser to process. This should be called
+ from a data_ready callback that is set on the socket. Note that
+ maximum messages size is the limit of the receive socket
+ buffer and message timeout is the receive timeout for the socket.
+
+void strp_check_rcv(struct strparser *strp);
+
+ strp_check_rcv is called to check for new messages on the socket.
+ This is normally called at initialization of a stream parser
+ instance or after strp_unpause.
Callbacks
----------
+=========
-There are four callbacks:
+There are six callbacks:
int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
parse_msg is called to determine the length of the next message
in the stream. The upper layer must implement this function. It
should parse the sk_buff as containing the headers for the
- next application layer messages in the stream.
+ next application layer message in the stream.
- The skb->cb in the input skb is a struct strp_rx_msg. Only
+ The skb->cb in the input skb is a struct strp_msg. Only
the offset field is relevant in parse_msg and gives the offset
where the message starts in the skb.
@@ -50,26 +112,41 @@ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
-ESTRPIPE : current message should not be processed by the
kernel, return control of the socket to userspace which
can proceed to read the messages itself
- other < 0 : Error is parsing, give control back to userspace
+ other < 0 : Error in parsing, give control back to userspace
assuming that synchronization is lost and the stream
is unrecoverable (application expected to close TCP socket)
In the case that an error is returned (return value is less than
- zero) the stream parser will set the error on TCP socket and wake
- it up. If parse_msg returned -ESTRPIPE and the stream parser had
- previously read some bytes for the current message, then the error
- set on the attached socket is ENODATA since the stream is
- unrecoverable in that case.
+ zero) and the parser is in receive callback mode, then it will set
+ the error on TCP socket and wake it up. If parse_msg returned
+ -ESTRPIPE and the stream parser had previously read some bytes for
+ the current message, then the error set on the attached socket is
+ ENODATA since the stream is unrecoverable in that case.
+
+void (*lock)(struct strparser *strp)
+
+ The lock callback is called to lock the strp structure when
+ the strparser is performing an asynchronous operation (such as
+ processing a timeout). In receive callback mode the default
+ function is to lock_sock for the associated socket. In general
+ mode the callback must be set appropriately.
+
+void (*unlock)(struct strparser *strp)
+
+ The unlock callback is called to release the lock obtained
+ by the lock callback. In receive callback mode the default
+ function is release_sock for the associated socket. In general
+ mode the callback must be set appropriately.
void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
rcv_msg is called when a full message has been received and
is queued. The callee must consume the sk_buff; it can
call strp_pause to prevent any further messages from being
- received in rcv_msg (see strp_pause below). This callback
+ received in rcv_msg (see strp_pause above). This callback
must be set.
- The skb->cb in the input skb is a struct strp_rx_msg. This
+ The skb->cb in the input skb is a struct strp_msg. This
struct contains two fields: offset and full_len. Offset is
where the message starts in the skb, and full_len is the
the length of the message. skb->len - offset may be greater
@@ -78,59 +155,53 @@ void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
int (*read_sock_done)(struct strparser *strp, int err);
read_sock_done is called when the stream parser is done reading
- the TCP socket. The stream parser may read multiple messages
- in a loop and this function allows cleanup to occur when existing
- the loop. If the callback is not set (NULL in strp_init) a
- default function is used.
+ the TCP socket in receive callback mode. The stream parser may
+ read multiple messages in a loop and this function allows cleanup
+ to occur when exiting the loop. If the callback is not set (NULL
+ in strp_init) a default function is used.
void (*abort_parser)(struct strparser *strp, int err);
This function is called when stream parser encounters an error
- in parsing. The default function stops the stream parser for the
- TCP socket and sets the error in the socket. The default function
- can be changed by setting the callback to non-NULL in strp_init.
+ in parsing. The default function stops the stream parser and
+ sets the error in the socket if the parser is in receive callback
+ mode. The default function can be changed by setting the callback
+ to non-NULL in strp_init.
-Functions
----------
+Statistics
+==========
-The upper layer calls strp_tcp_data_ready when data is ready on the lower
-socket for strparser to process. This should be called from a data_ready
-callback that is set on the socket.
+Various counters are kept for each stream parser instance. These are in
+the strp_stats structure. strp_aggr_stats is a convenience structure for
+accumulating statistics for multiple stream parser instances.
+save_strp_stats and aggregate_strp_stats are helper functions to save
+and aggregate statistics.
-strp_stop is called to completely stop stream parser operations. This
-is called internally when the stream parser encounters an error, and
-it is called from the upper layer when unattaching a TCP socket.
+Message assembly limits
+=======================
-strp_done is called to unattach the stream parser from the TCP socket.
-This must be called after the stream processor has be stopped.
+The stream parser provide mechanisms to limit the resources consumed by
+message assembly.
-strp_check_rcv is called to check for new messages on the socket. This
-is normally called at initialization of the a stream parser instance
-of after strp_unpause.
+A timer is set when assembly starts for a new message. In receive
+callback mode the message timeout is taken from rcvtime for the
+associated TCP socket. In general mode, the timeout is passed as an
+argument in strp_process. If the timer fires before assembly completes
+the stream parser is aborted and the ETIMEDOUT error is set on the TCP
+socket if in receive callback mode.
-Statistics
-----------
+In receive callback mode, message length is limited to the receive
+buffer size of the associated TCP socket. If the length returned by
+parse_msg is greater than the socket buffer size then the stream parser
+is aborted with EMSGSIZE error set on the TCP socket. Note that this
+makes the maximum size of receive skbuffs for a socket with a stream
+parser to be 2*sk_rcvbuf of the TCP socket.
-Various counters are kept for each stream parser for a TCP socket.
-These are in the strp_stats structure. strp_aggr_stats is a convenience
-structure for accumulating statistics for multiple stream parser
-instances. save_strp_stats and aggregate_strp_stats are helper functions
-to save and aggregate statistics.
+In general mode the message length limit is passed in as an argument
+to strp_process.
-Message assembly limits
------------------------
+Author
+======
-The stream parser provide mechanisms to limit the resources consumed by
-message assembly.
+Tom Herbert (tom@quantonium.net)
-A timer is set when assembly starts for a new message. The message
-timeout is taken from rcvtime for the associated TCP socket. If the
-timer fires before assembly completes the stream parser is aborted
-and the ETIMEDOUT error is set on the TCP socket.
-
-Message length is limited to the receive buffer size of the associated
-TCP socket. If the length returned by parse_msg is greater than
-the socket buffer size then the stream parser is aborted with
-EMSGSIZE error set on the TCP socket. Note that this makes the
-maximum size of receive skbuffs for a socket with a stream parser
-to be 2*sk_rcvbuf of the TCP socket.