aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/af_xdp.rst36
-rw-r--r--Documentation/networking/checksum-offloads.rst143
-rw-r--r--Documentation/networking/checksum-offloads.txt122
-rw-r--r--Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst14
-rw-r--r--Documentation/networking/device_drivers/intel/e100.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/e1000.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/e1000e.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/fm10k.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/i40e.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/iavf.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/ice.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/igb.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/igbvf.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/ixgb.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/ixgbe.rst1
-rw-r--r--Documentation/networking/device_drivers/intel/ixgbevf.rst1
-rw-r--r--Documentation/networking/device_drivers/stmicro/stmmac.txt2
-rw-r--r--Documentation/networking/devlink-health.txt86
-rw-r--r--Documentation/networking/devlink-info-versions.rst43
-rw-r--r--Documentation/networking/devlink-params-mlxsw.txt10
-rw-r--r--Documentation/networking/dsa/dsa.txt23
-rw-r--r--Documentation/networking/filter.txt33
-rw-r--r--Documentation/networking/ieee802154.rst (renamed from Documentation/networking/ieee802154.txt)193
-rw-r--r--Documentation/networking/index.rst7
-rw-r--r--Documentation/networking/msg_zerocopy.rst2
-rw-r--r--Documentation/networking/operstates.txt14
-rw-r--r--Documentation/networking/phy.rst447
-rw-r--r--Documentation/networking/phy.txt427
-rw-r--r--Documentation/networking/scaling.rst (renamed from Documentation/networking/scaling.txt)131
-rw-r--r--Documentation/networking/segmentation-offloads.rst (renamed from Documentation/networking/segmentation-offloads.txt)48
-rw-r--r--Documentation/networking/sfp-phylink.rst268
-rw-r--r--Documentation/networking/snmp_counter.rst295
-rw-r--r--Documentation/networking/switchdev.txt37
-rw-r--r--Documentation/networking/timestamping.txt43
34 files changed, 1616 insertions, 820 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 4ae4f9d8f8fe..e14d7d40fc75 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -295,6 +295,41 @@ using::
For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.
+FAQ
+=======
+
+Q: I am not seeing any traffic on the socket. What am I doing wrong?
+
+A: When a netdev of a physical NIC is initialized, Linux usually
+ allocates one Rx and Tx queue pair per core. So on a 8 core system,
+ queue ids 0 to 7 will be allocated, one per core. In the AF_XDP
+ bind call or the xsk_socket__create libbpf function call, you
+ specify a specific queue id to bind to and it is only the traffic
+ towards that queue you are going to get on you socket. So in the
+ example above, if you bind to queue 0, you are NOT going to get any
+ traffic that is distributed to queues 1 through 7. If you are
+ lucky, you will see the traffic, but usually it will end up on one
+ of the queues you have not bound to.
+
+ There are a number of ways to solve the problem of getting the
+ traffic you want to the queue id you bound to. If you want to see
+ all the traffic, you can force the netdev to only have 1 queue, queue
+ id 0, and then bind to queue 0. You can use ethtool to do this::
+
+ sudo ethtool -L <interface> combined 1
+
+ If you want to only see part of the traffic, you can program the
+ NIC through ethtool to filter out your traffic to a single queue id
+ that you can bind your XDP socket to. Here is one example in which
+ UDP traffic to and from port 4242 are sent to queue 2::
+
+ sudo ethtool -N <interface> rx-flow-hash udp4 fn
+ sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
+ 4242 action 2
+
+ A number of other ways are possible all up to the capabilitites of
+ the NIC you have.
+
Credits
=======
@@ -309,4 +344,3 @@ Credits
- Michael S. Tsirkin
- Qi Z Zhang
- Willem de Bruijn
-
diff --git a/Documentation/networking/checksum-offloads.rst b/Documentation/networking/checksum-offloads.rst
new file mode 100644
index 000000000000..905c8a84b103
--- /dev/null
+++ b/Documentation/networking/checksum-offloads.rst
@@ -0,0 +1,143 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Checksum Offloads
+=================
+
+
+Introduction
+============
+
+This document describes a set of techniques in the Linux networking stack to
+take advantage of checksum offload capabilities of various NICs.
+
+The following technologies are described:
+
+* TX Checksum Offload
+* LCO: Local Checksum Offload
+* RCO: Remote Checksum Offload
+
+Things that should be documented here but aren't yet:
+
+* RX Checksum Offload
+* CHECKSUM_UNNECESSARY conversion
+
+
+TX Checksum Offload
+===================
+
+The interface for offloading a transmit checksum to a device is explained in
+detail in comments near the top of include/linux/skbuff.h.
+
+In brief, it allows to request the device fill in a single ones-complement
+checksum defined by the sk_buff fields skb->csum_start and skb->csum_offset.
+The device should compute the 16-bit ones-complement checksum (i.e. the
+'IP-style' checksum) from csum_start to the end of the packet, and fill in the
+result at (csum_start + csum_offset).
+
+Because csum_offset cannot be negative, this ensures that the previous value of
+the checksum field is included in the checksum computation, thus it can be used
+to supply any needed corrections to the checksum (such as the sum of the
+pseudo-header for UDP or TCP).
+
+This interface only allows a single checksum to be offloaded. Where
+encapsulation is used, the packet may have multiple checksum fields in
+different header layers, and the rest will have to be handled by another
+mechanism such as LCO or RCO.
+
+CRC32c can also be offloaded using this interface, by means of filling
+skb->csum_start and skb->csum_offset as described above, and setting
+skb->csum_not_inet: see skbuff.h comment (section 'D') for more details.
+
+No offloading of the IP header checksum is performed; it is always done in
+software. This is OK because when we build the IP header, we obviously have it
+in cache, so summing it isn't expensive. It's also rather short.
+
+The requirements for GSO are more complicated, because when segmenting an
+encapsulated packet both the inner and outer checksums may need to be edited or
+recomputed for each resulting segment. See the skbuff.h comment (section 'E')
+for more details.
+
+A driver declares its offload capabilities in netdev->hw_features; see
+Documentation/networking/netdev-features.txt for more. Note that a device
+which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and
+csum_offset given in the SKB; if it tries to deduce these itself in hardware
+(as some NICs do) the driver should check that the values in the SKB match
+those which the hardware will deduce, and if not, fall back to checksumming in
+software instead (with skb_csum_hwoffload_help() or one of the
+skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in
+include/linux/skbuff.h).
+
+The stack should, for the most part, assume that checksum offload is supported
+by the underlying device. The only place that should check is
+validate_xmit_skb(), and the functions it calls directly or indirectly. That
+function compares the offload features requested by the SKB (which may include
+other offloads besides TX Checksum Offload) and, if they are not supported or
+enabled on the device (determined by netdev->features), performs the
+corresponding offload in software. In the case of TX Checksum Offload, that
+means calling skb_csum_hwoffload_help(skb, features).
+
+
+LCO: Local Checksum Offload
+===========================
+
+LCO is a technique for efficiently computing the outer checksum of an
+encapsulated datagram when the inner checksum is due to be offloaded.
+
+The ones-complement sum of a correctly checksummed TCP or UDP packet is equal
+to the complement of the sum of the pseudo header, because everything else gets
+'cancelled out' by the checksum field. This is because the sum was
+complemented before being written to the checksum field.
+
+More generally, this holds in any case where the 'IP-style' ones complement
+checksum is used, and thus any checksum that TX Checksum Offload supports.
+
+That is, if we have set up TX Checksum Offload with a start/offset pair, we
+know that after the device has filled in that checksum, the ones complement sum
+from csum_start to the end of the packet will be equal to the complement of
+whatever value we put in the checksum field beforehand. This allows us to
+compute the outer checksum without looking at the payload: we simply stop
+summing when we get to csum_start, then add the complement of the 16-bit word
+at (csum_start + csum_offset).
+
+Then, when the true inner checksum is filled in (either by hardware or by
+skb_checksum_help()), the outer checksum will become correct by virtue of the
+arithmetic.
+
+LCO is performed by the stack when constructing an outer UDP header for an
+encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for the
+IPv6 equivalents, in udp6_set_csum().
+
+It is also performed when constructing an IPv4 GRE header, in
+net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when
+constructing an IPv6 GRE header; the GRE checksum is computed over the whole
+packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be possible to use
+LCO here as IPv6 GRE still uses an IP-style checksum.
+
+All of the LCO implementations use a helper function lco_csum(), in
+include/linux/skbuff.h.
+
+LCO can safely be used for nested encapsulations; in this case, the outer
+encapsulation layer will sum over both its own header and the 'middle' header.
+This does mean that the 'middle' header will get summed multiple times, but
+there doesn't seem to be a way to avoid that without incurring bigger costs
+(e.g. in SKB bloat).
+
+
+RCO: Remote Checksum Offload
+============================
+
+RCO is a technique for eliding the inner checksum of an encapsulated datagram,
+allowing the outer checksum to be offloaded. It does, however, involve a
+change to the encapsulation protocols, which the receiver must also support.
+For this reason, it is disabled by default.
+
+RCO is detailed in the following Internet-Drafts:
+
+* https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00
+* https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
+
+In Linux, RCO is implemented individually in each encapsulation protocol, and
+most tunnel types have flags controlling its use. For instance, VXLAN has the
+flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that RCO should be
+used when transmitting to a given remote destination.
diff --git a/Documentation/networking/checksum-offloads.txt b/Documentation/networking/checksum-offloads.txt
deleted file mode 100644
index 27bc09cfcf6d..000000000000
--- a/Documentation/networking/checksum-offloads.txt
+++ /dev/null
@@ -1,122 +0,0 @@
-Checksum Offloads in the Linux Networking Stack
-
-
-Introduction
-============
-
-This document describes a set of techniques in the Linux networking stack
- to take advantage of checksum offload capabilities of various NICs.
-
-The following technologies are described:
- * TX Checksum Offload
- * LCO: Local Checksum Offload
- * RCO: Remote Checksum Offload
-
-Things that should be documented here but aren't yet:
- * RX Checksum Offload
- * CHECKSUM_UNNECESSARY conversion
-
-
-TX Checksum Offload
-===================
-
-The interface for offloading a transmit checksum to a device is explained
- in detail in comments near the top of include/linux/skbuff.h.
-In brief, it allows to request the device fill in a single ones-complement
- checksum defined by the sk_buff fields skb->csum_start and
- skb->csum_offset. The device should compute the 16-bit ones-complement
- checksum (i.e. the 'IP-style' checksum) from csum_start to the end of the
- packet, and fill in the result at (csum_start + csum_offset).
-Because csum_offset cannot be negative, this ensures that the previous
- value of the checksum field is included in the checksum computation, thus
- it can be used to supply any needed corrections to the checksum (such as
- the sum of the pseudo-header for UDP or TCP).
-This interface only allows a single checksum to be offloaded. Where
- encapsulation is used, the packet may have multiple checksum fields in
- different header layers, and the rest will have to be handled by another
- mechanism such as LCO or RCO.
-CRC32c can also be offloaded using this interface, by means of filling
- skb->csum_start and skb->csum_offset as described above, and setting
- skb->csum_not_inet: see skbuff.h comment (section 'D') for more details.
-No offloading of the IP header checksum is performed; it is always done in
- software. This is OK because when we build the IP header, we obviously
- have it in cache, so summing it isn't expensive. It's also rather short.
-The requirements for GSO are more complicated, because when segmenting an
- encapsulated packet both the inner and outer checksums may need to be
- edited or recomputed for each resulting segment. See the skbuff.h comment
- (section 'E') for more details.
-
-A driver declares its offload capabilities in netdev->hw_features; see
- Documentation/networking/netdev-features.txt for more. Note that a device
- which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start
- and csum_offset given in the SKB; if it tries to deduce these itself in
- hardware (as some NICs do) the driver should check that the values in the
- SKB match those which the hardware will deduce, and if not, fall back to
- checksumming in software instead (with skb_csum_hwoffload_help() or one of
- the skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in
- include/linux/skbuff.h).
-
-The stack should, for the most part, assume that checksum offload is
- supported by the underlying device. The only place that should check is
- validate_xmit_skb(), and the functions it calls directly or indirectly.
- That function compares the offload features requested by the SKB (which
- may include other offloads besides TX Checksum Offload) and, if they are
- not supported or enabled on the device (determined by netdev->features),
- performs the corresponding offload in software. In the case of TX
- Checksum Offload, that means calling skb_csum_hwoffload_help(skb, features).
-
-
-LCO: Local Checksum Offload
-===========================
-
-LCO is a technique for efficiently computing the outer checksum of an
- encapsulated datagram when the inner checksum is due to be offloaded.
-The ones-complement sum of a correctly checksummed TCP or UDP packet is
- equal to the complement of the sum of the pseudo header, because everything
- else gets 'cancelled out' by the checksum field. This is because the sum was
- complemented before being written to the checksum field.
-More generally, this holds in any case where the 'IP-style' ones complement
- checksum is used, and thus any checksum that TX Checksum Offload supports.
-That is, if we have set up TX Checksum Offload with a start/offset pair, we
- know that after the device has filled in that checksum, the ones
- complement sum from csum_start to the end of the packet will be equal to
- the complement of whatever value we put in the checksum field beforehand.
- This allows us to compute the outer checksum without looking at the payload:
- we simply stop summing when we get to csum_start, then add the complement of
- the 16-bit word at (csum_start + csum_offset).
-Then, when the true inner checksum is filled in (either by hardware or by
- skb_checksum_help()), the outer checksum will become correct by virtue of
- the arithmetic.
-
-LCO is performed by the stack when constructing an outer UDP header for an
- encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for
- the IPv6 equivalents, in udp6_set_csum().
-It is also performed when constructing an IPv4 GRE header, in
- net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when
- constructing an IPv6 GRE header; the GRE checksum is computed over the
- whole packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be
- possible to use LCO here as IPv6 GRE still uses an IP-style checksum.
-All of the LCO implementations use a helper function lco_csum(), in
- include/linux/skbuff.h.
-
-LCO can safely be used for nested encapsulations; in this case, the outer
- encapsulation layer will sum over both its own header and the 'middle'
- header. This does mean that the 'middle' header will get summed multiple
- times, but there doesn't seem to be a way to avoid that without incurring
- bigger costs (e.g. in SKB bloat).
-
-
-RCO: Remote Checksum Offload
-============================
-
-RCO is a technique for eliding the inner checksum of an encapsulated
- datagram, allowing the outer checksum to be offloaded. It does, however,
- involve a change to the encapsulation protocols, which the receiver must
- also support. For this reason, it is disabled by default.
-RCO is detailed in the following Internet-Drafts:
-https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00
-https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
-In Linux, RCO is implemented individually in each encapsulation protocol,
- and most tunnel types have flags controlling its use. For instance, VXLAN
- has the flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that
- RCO should be used when transmitting to a given remote destination.
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst b/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst
index a188466b6698..5045df990a4c 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst
+++ b/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst
@@ -27,11 +27,12 @@ Driver Overview
The DPIO driver is bound to DPIO objects discovered on the fsl-mc bus and
provides services that:
- A) allow other drivers, such as the Ethernet driver, to enqueue and dequeue
+
+ A. allow other drivers, such as the Ethernet driver, to enqueue and dequeue
frames for their respective objects
- B) allow drivers to register callbacks for data availability notifications
+ B. allow drivers to register callbacks for data availability notifications
when data becomes available on a queue or channel
- C) allow drivers to manage hardware buffer pools
+ C. allow drivers to manage hardware buffer pools
The Linux DPIO driver consists of 3 primary components--
DPIO object driver-- fsl-mc driver that manages the DPIO object
@@ -140,11 +141,10 @@ QBman portal interface (qbman-portal.c)
The qbman-portal component provides APIs to do the low level hardware
bit twiddling for operations such as:
- -initializing Qman software portals
-
- -building and sending portal commands
- -portal interrupt configuration and processing
+ - initializing Qman software portals
+ - building and sending portal commands
+ - portal interrupt configuration and processing
The qbman-portal APIs are not public to other drivers, and are
only used by dpio-service.
diff --git a/Documentation/networking/device_drivers/intel/e100.rst b/Documentation/networking/device_drivers/intel/e100.rst
index 5e2839b4ec92..2b9f4887beda 100644
--- a/Documentation/networking/device_drivers/intel/e100.rst
+++ b/Documentation/networking/device_drivers/intel/e100.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+==============================================================
Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters
==============================================================
diff --git a/Documentation/networking/device_drivers/intel/e1000.rst b/Documentation/networking/device_drivers/intel/e1000.rst
index 6379d4d20771..956560b6e745 100644
--- a/Documentation/networking/device_drivers/intel/e1000.rst
+++ b/Documentation/networking/device_drivers/intel/e1000.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+===========================================================
Linux* Base Driver for Intel(R) Ethernet Network Connection
===========================================================
diff --git a/Documentation/networking/device_drivers/intel/e1000e.rst b/Documentation/networking/device_drivers/intel/e1000e.rst
index 33554e5416c5..01999f05509c 100644
--- a/Documentation/networking/device_drivers/intel/e1000e.rst
+++ b/Documentation/networking/device_drivers/intel/e1000e.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+======================================================
Linux* Driver for Intel(R) Ethernet Network Connection
======================================================
diff --git a/Documentation/networking/device_drivers/intel/fm10k.rst b/Documentation/networking/device_drivers/intel/fm10k.rst
index bf5e5942f28d..ac3269e34f55 100644
--- a/Documentation/networking/device_drivers/intel/fm10k.rst
+++ b/Documentation/networking/device_drivers/intel/fm10k.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+==============================================================
Linux* Base Driver for Intel(R) Ethernet Multi-host Controller
==============================================================
diff --git a/Documentation/networking/device_drivers/intel/i40e.rst b/Documentation/networking/device_drivers/intel/i40e.rst
index 0cc16c525d10..848fd388fa6e 100644
--- a/Documentation/networking/device_drivers/intel/i40e.rst
+++ b/Documentation/networking/device_drivers/intel/i40e.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+==================================================================
Linux* Base Driver for the Intel(R) Ethernet Controller 700 Series
==================================================================
diff --git a/Documentation/networking/device_drivers/intel/iavf.rst b/Documentation/networking/device_drivers/intel/iavf.rst
index f8b42b64eb28..2d0c3baa1752 100644
--- a/Documentation/networking/device_drivers/intel/iavf.rst
+++ b/Documentation/networking/device_drivers/intel/iavf.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+==================================================================
Linux* Base Driver for Intel(R) Ethernet Adaptive Virtual Function
==================================================================
diff --git a/Documentation/networking/device_drivers/intel/ice.rst b/Documentation/networking/device_drivers/intel/ice.rst
index 4d118b827bbb..c220aa2711c6 100644
--- a/Documentation/networking/device_drivers/intel/ice.rst
+++ b/Documentation/networking/device_drivers/intel/ice.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+===================================================================
Linux* Base Driver for the Intel(R) Ethernet Connection E800 Series
===================================================================
diff --git a/Documentation/networking/device_drivers/intel/igb.rst b/Documentation/networking/device_drivers/intel/igb.rst
index e87a4a72ea2d..fc8cfaa5dcfa 100644
--- a/Documentation/networking/device_drivers/intel/igb.rst
+++ b/Documentation/networking/device_drivers/intel/igb.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+===========================================================
Linux* Base Driver for Intel(R) Ethernet Network Connection
===========================================================
diff --git a/Documentation/networking/device_drivers/intel/igbvf.rst b/Documentation/networking/device_drivers/intel/igbvf.rst
index a8a9ffa4f8d3..9cddabe8108e 100644
--- a/Documentation/networking/device_drivers/intel/igbvf.rst
+++ b/Documentation/networking/device_drivers/intel/igbvf.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+============================================================
Linux* Base Virtual Function Driver for Intel(R) 1G Ethernet
============================================================
diff --git a/Documentation/networking/device_drivers/intel/ixgb.rst b/Documentation/networking/device_drivers/intel/ixgb.rst
index 8bd80e27843d..945018207a92 100644
--- a/Documentation/networking/device_drivers/intel/ixgb.rst
+++ b/Documentation/networking/device_drivers/intel/ixgb.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+=====================================================================
Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection
=====================================================================
diff --git a/Documentation/networking/device_drivers/intel/ixgbe.rst b/Documentation/networking/device_drivers/intel/ixgbe.rst
index 86d887a63606..c7d25483fedb 100644
--- a/Documentation/networking/device_drivers/intel/ixgbe.rst
+++ b/Documentation/networking/device_drivers/intel/ixgbe.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+=============================================================================
Linux* Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Adapters
=============================================================================
diff --git a/Documentation/networking/device_drivers/intel/ixgbevf.rst b/Documentation/networking/device_drivers/intel/ixgbevf.rst
index 56cde6366c2f..5d4977360157 100644
--- a/Documentation/networking/device_drivers/intel/ixgbevf.rst
+++ b/Documentation/networking/device_drivers/intel/ixgbevf.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0+
+=============================================================
Linux* Base Virtual Function Driver for Intel(R) 10G Ethernet
=============================================================
diff --git a/Documentation/networking/device_drivers/stmicro/stmmac.txt b/Documentation/networking/device_drivers/stmicro/stmmac.txt
index 2bb07078f535..1ae979fd90d2 100644
--- a/Documentation/networking/device_drivers/stmicro/stmmac.txt
+++ b/Documentation/networking/device_drivers/stmicro/stmmac.txt
@@ -267,7 +267,7 @@ static struct fixed_phy_status stmmac0_fixed_phy_status = {
During the board's device_init we can configure the first
MAC for fixed_link by calling:
- fixed_phy_add(PHY_POLL, 1, &stmmac0_fixed_phy_status, -1);
+ fixed_phy_add(PHY_POLL, 1, &stmmac0_fixed_phy_status);
and the second one, with a real PHY device attached to the bus,
by using the stmmac_mdio_bus_data structure (to provide the id, the
reset procedure etc).
diff --git a/Documentation/networking/devlink-health.txt b/Documentation/networking/devlink-health.txt
new file mode 100644
index 000000000000..1db3fbea0831
--- /dev/null
+++ b/Documentation/networking/devlink-health.txt
@@ -0,0 +1,86 @@
+The health mechanism is targeted for Real Time Alerting, in order to know when
+something bad had happened to a PCI device
+- Provide alert debug information
+- Self healing
+- If problem needs vendor support, provide a way to gather all needed debugging
+ information.
+
+The main idea is to unify and centralize driver health reports in the
+generic devlink instance and allow the user to set different
+attributes of the health reporting and recovery procedures.
+
+The devlink health reporter:
+Device driver creates a "health reporter" per each error/health type.
+Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
+or unknown (driver specific).
+For each registered health reporter a driver can issue error/health reports
+asynchronously. All health reports handling is done by devlink.
+Device driver can provide specific callbacks for each "health reporter", e.g.
+ - Recovery procedures
+ - Diagnostics and object dump procedures
+ - OOB initial parameters
+Different parts of the driver can register different types of health reporters
+with different handlers.
+
+Once an error is reported, devlink health will do the following actions:
+ * A log is being send to the kernel trace events buffer
+ * Health status and statistics are being updated for the reporter instance
+ * Object dump is being taken and saved at the reporter instance (as long as
+ there is no other dump which is already stored)
+ * Auto recovery attempt is being done. Depends on:
+ - Auto-recovery configuration
+ - Grace period vs. time passed since last recover
+
+The user interface:
+User can access/change each reporter's parameters and driver specific callbacks
+via devlink, e.g per error type (per health reporter)
+ - Configure reporter's generic parameters (like: disable/enable auto recovery)
+ - Invoke recovery procedure
+ - Run diagnostics
+ - Object dump
+
+The devlink health interface (via netlink):
+DEVLINK_CMD_HEALTH_REPORTER_GET
+ Retrieves status and configuration info per DEV and reporter.
+DEVLINK_CMD_HEALTH_REPORTER_SET
+ Allows reporter-related configuration setting.
+DEVLINK_CMD_HEALTH_REPORTER_RECOVER
+ Triggers a reporter's recovery procedure.
+DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
+ Retrieves diagnostics data from a reporter on a device.
+DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
+ Retrieves the last stored dump. Devlink health
+ saves a single dump. If an dump is not already stored by the devlink
+ for this reporter, devlink generates a new dump.
+ dump output is defined by the reporter.
+DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
+ Clears the last saved dump file for the specified reporter.
+
+
+ netlink
+ +--------------------------+
+ | |
+ | + |
+ | | |
+ +--------------------------+
+ |request for ops
+ |(diagnose,
+ mlx5_core devlink |recover,
+ |dump)
++--------+ +--------------------------+
+| | | reporter| |
+| | | +---------v----------+ |
+| | ops execution | | | |
+| <----------------------------------+ | |
+| | | | | |
+| | | + ^------------------+ |
+| | | | request for ops |
+| | | | (recover, dump) |
+| | | | |
+| | | +-+------------------+ |
+| | health report | | health handler | |
+| +-------------------------------> | |
+| | | +--------------------+ |
+| | health reporter create | |
+| +----------------------------> |
++--------+ +--------------------------+
diff --git a/Documentation/networking/devlink-info-versions.rst b/Documentation/networking/devlink-info-versions.rst
new file mode 100644
index 000000000000..c79ad8593383
--- /dev/null
+++ b/Documentation/networking/devlink-info-versions.rst
@@ -0,0 +1,43 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+=====================
+Devlink info versions
+=====================
+
+board.id
+========
+
+Unique identifier of the board design.
+
+board.rev
+=========
+
+Board design revision.
+
+board.manufacture
+=================
+
+An identifier of the company or the facility which produced the part.
+
+fw.mgmt
+=======
+
+Control unit firmware version. This firmware is responsible for house
+keeping tasks, PHY control etc. but not the packet-by-packet data path
+operation.
+
+fw.app
+======
+
+Data path microcode controlling high-speed packet processing.
+
+fw.undi
+=======
+
+UNDI software, may include the UEFI driver, firmware or both.
+
+fw.ncsi
+=======
+
+Version of the software responsible for supporting/handling the
+Network Controller Sideband Interface.
diff --git a/Documentation/networking/devlink-params-mlxsw.txt b/Documentation/networking/devlink-params-mlxsw.txt
new file mode 100644
index 000000000000..c63ea9fc7009
--- /dev/null
+++ b/Documentation/networking/devlink-params-mlxsw.txt
@@ -0,0 +1,10 @@
+fw_load_policy [DEVICE, GENERIC]
+ Configuration mode: driverinit
+
+acl_region_rehash_interval [DEVICE, DRIVER-SPECIFIC]
+ Sets an interval for periodic ACL region rehashes.
+ The value is in milliseconds, minimal value is "3000".
+ Value "0" disables the periodic work.
+ The first rehash will be run right after value is set.
+ Type: u32
+ Configuration mode: runtime
diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt
index 25170ad7d25b..43ef767bc440 100644
--- a/Documentation/networking/dsa/dsa.txt
+++ b/Documentation/networking/dsa/dsa.txt
@@ -236,19 +236,6 @@ description.
Design limitations
==================
-DSA is a platform device driver
--------------------------------
-
-DSA is implemented as a DSA platform device driver which is convenient because
-it will register the entire DSA switch tree attached to a master network device
-in one-shot, facilitating the device creation and simplifying the device driver
-model a bit, this comes however with a number of limitations:
-
-- building DSA and its switch drivers as modules is currently not working
-- the device driver parenting does not necessarily reflect the original
- bus/device the switch can be created from
-- supporting non-MDIO and non-MMIO (platform) switches is not possible
-
Limits on the number of devices and ports
-----------------------------------------
@@ -533,16 +520,12 @@ Bridge VLAN filtering
function that the driver has to call for each VLAN the given port is a member
of. A switchdev object is used to carry the VID and bridge flags.
-- port_fdb_prepare: bridge layer function invoked when the bridge prepares the
- installation of a Forwarding Database entry. If the operation is not
- supported, this function should return -EOPNOTSUPP to inform the bridge code
- to fallback to a software implementation. No hardware setup must be done in
- this function. See port_fdb_add for this and details.
-
- port_fdb_add: bridge layer function invoked when the bridge wants to install a
Forwarding Database entry, the switch hardware should be programmed with the
specified address in the specified VLAN Id in the forwarding database
- associated with this VLAN ID
+ associated with this VLAN ID. If the operation is not supported, this
+ function should return -EOPNOTSUPP to inform the bridge code to fallback to
+ a software implementation.
Note: VLAN ID 0 corresponds to the port private database, which, in the context
of DSA, would be the its port-based VLAN, used by the associated bridge device.
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index 2196b824e96c..319e5e041f38 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -464,10 +464,11 @@ breakpoints: 0 1
JIT compiler
------------
-The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC,
-ARM, ARM64, MIPS and s390 and can be enabled through CONFIG_BPF_JIT. The JIT
-compiler is transparently invoked for each attached filter from user space
-or for internal kernel users if it has been previously enabled by root:
+The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
+PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
+CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
+attached filter from user space or for internal kernel users if it has
+been previously enabled by root:
echo 1 > /proc/sys/net/core/bpf_jit_enable
@@ -603,9 +604,10 @@ got from bpf_prog_create(), and 'ctx' the given context (e.g.
skb pointer). All constraints and restrictions from bpf_check_classic() apply
before a conversion to the new layout is being done behind the scenes!
-Currently, the classic BPF format is being used for JITing on most 32-bit
-architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64, arm32 perform
-JIT compilation from eBPF instruction set.
+Currently, the classic BPF format is being used for JITing on most
+32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
+sparc64, arm32, riscv (RV64G) perform JIT compilation from eBPF
+instruction set.
Some core changes of the new internal format:
@@ -827,7 +829,7 @@ tracing filters may do to maintain counters of events, for example. Register R9
is not used by socket filters either, but more complex filters may be running
out of registers and would have to resort to spill/fill to stack.
-Internal BPF can used as generic assembler for last step performance
+Internal BPF can be used as a generic assembler for last step performance
optimizations, socket filters and seccomp are using it as assembler. Tracing
filters may use it as assembler to generate code from kernel. In kernel usage
may not be bounded by security considerations, since generated internal BPF code
@@ -865,7 +867,7 @@ Three LSB bits store instruction class which is one of:
BPF_STX 0x03 BPF_STX 0x03
BPF_ALU 0x04 BPF_ALU 0x04
BPF_JMP 0x05 BPF_JMP 0x05
- BPF_RET 0x06 [ class 6 unused, for future if needed ]
+ BPF_RET 0x06 BPF_JMP32 0x06
BPF_MISC 0x07 BPF_ALU64 0x07
When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
@@ -902,9 +904,9 @@ If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:
BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
BPF_END 0xd0 /* eBPF only: endianness conversion */
-If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:
+If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:
- BPF_JA 0x00
+ BPF_JA 0x00 /* BPF_JMP only */
BPF_JEQ 0x10
BPF_JGT 0x20
BPF_JGE 0x30
@@ -912,8 +914,8 @@ If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:
BPF_JNE 0x50 /* eBPF only: jump != */
BPF_JSGT 0x60 /* eBPF only: signed '>' */
BPF_JSGE 0x70 /* eBPF only: signed '>=' */
- BPF_CALL 0x80 /* eBPF only: function call */
- BPF_EXIT 0x90 /* eBPF only: function return */
+ BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */
+ BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */
BPF_JLT 0xa0 /* eBPF only: unsigned '<' */
BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */
BPF_JSLT 0xc0 /* eBPF only: signed '<' */
@@ -936,8 +938,9 @@ Classic BPF wastes the whole BPF_RET class to represent a single 'ret'
operation. Classic BPF_RET | BPF_K means copy imm32 into return register
and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
in eBPF means function exit only. The eBPF program needs to store return
-value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently
-unused and reserved for future use.
+value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
+BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
+operands for the comparisons instead.
For load and store instructions the 8-bit 'code' field is divided as:
diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.rst
index e74d8e1da0e2..36ca823a1122 100644
--- a/Documentation/networking/ieee802154.txt
+++ b/Documentation/networking/ieee802154.rst
@@ -1,54 +1,79 @@
-
- Linux IEEE 802.15.4 implementation
-
+===============================
+IEEE 802.15.4 Developer's Guide
+===============================
Introduction
============
The IEEE 802.15.4 working group focuses on standardization of the bottom
two layers: Medium Access Control (MAC) and Physical access (PHY). And there
are mainly two options available for upper layers:
- - ZigBee - proprietary protocol from the ZigBee Alliance
- - 6LoWPAN - IPv6 networking over low rate personal area networks
+
+- ZigBee - proprietary protocol from the ZigBee Alliance
+- 6LoWPAN - IPv6 networking over low rate personal area networks
The goal of the Linux-wpan is to provide a complete implementation
of the IEEE 802.15.4 and 6LoWPAN protocols. IEEE 802.15.4 is a stack
of protocols for organizing Low-Rate Wireless Personal Area Networks.
The stack is composed of three main parts:
- - IEEE 802.15.4 layer; We have chosen to use plain Berkeley socket API,
- the generic Linux networking stack to transfer IEEE 802.15.4 data
- messages and a special protocol over netlink for configuration/management
- - MAC - provides access to shared channel and reliable data delivery
- - PHY - represents device drivers
+- IEEE 802.15.4 layer; We have chosen to use plain Berkeley socket API,
+ the generic Linux networking stack to transfer IEEE 802.15.4 data
+ messages and a special protocol over netlink for configuration/management
+- MAC - provides access to shared channel and reliable data delivery
+- PHY - represents device drivers
Socket API
==========
-int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
-.....
+.. c:function:: int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
The address family, socket addresses etc. are defined in the
include/net/af_ieee802154.h header or in the special header
in the userspace package (see either http://wpan.cakelab.org/ or the
git tree at https://github.com/linux-wpan/wpan-tools).
+6LoWPAN Linux implementation
+============================
+
+The IEEE 802.15.4 standard specifies an MTU of 127 bytes, yielding about 80
+octets of actual MAC payload once security is turned on, on a wireless link
+with a link throughput of 250 kbps or less. The 6LoWPAN adaptation format
+[RFC4944] was specified to carry IPv6 datagrams over such constrained links,
+taking into account limited bandwidth, memory, or energy resources that are
+expected in applications such as wireless Sensor Networks. [RFC4944] defines
+a Mesh Addressing header to support sub-IP forwarding, a Fragmentation header
+to support the IPv6 minimum MTU requirement [RFC2460], and stateless header
+compression for IPv6 datagrams (LOWPAN_HC1 and LOWPAN_HC2) to reduce the
+relatively large IPv6 and UDP headers down to (in the best case) several bytes.
+
+In September 2011 the standard update was published - [RFC6282].
+It deprecates HC1 and HC2 compression and defines IPHC encoding format which is
+used in this Linux implementation.
+
+All the code related to 6lowpan you may find in files: net/6lowpan/*
+and net/ieee802154/6lowpan/*
+
+To setup a 6LoWPAN interface you need:
+1. Add IEEE802.15.4 interface and set channel and PAN ID;
+2. Add 6lowpan interface by command like:
+# ip link add link wpan0 name lowpan0 type lowpan
+3. Bring up 'lowpan0' interface
-Kernel side
-=============
+Drivers
+=======
Like with WiFi, there are several types of devices implementing IEEE 802.15.4.
1) 'HardMAC'. The MAC layer is implemented in the device itself, the device
- exports a management (e.g. MLME) and data API.
+exports a management (e.g. MLME) and data API.
2) 'SoftMAC' or just radio. These types of devices are just radio transceivers
- possibly with some kinds of acceleration like automatic CRC computation and
- comparation, automagic ACK handling, address matching, etc.
+possibly with some kinds of acceleration like automatic CRC computation and
+comparation, automagic ACK handling, address matching, etc.
Those types of devices require different approach to be hooked into Linux kernel.
-
HardMAC
-=======
+-------
See the header include/net/ieee802154_netdev.h. You have to implement Linux
net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family
@@ -64,9 +89,8 @@ net_device with a pointer to struct ieee802154_mlme_ops instance. The fields
assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional.
All other fields are required.
-
SoftMAC
-=======
+-------
The MAC is the middle layer in the IEEE 802.15.4 Linux stack. This moment it
provides interface for drivers registration and management of slave interfaces.
@@ -79,99 +103,78 @@ This layer is going to be extended soon.
See header include/net/mac802154.h and several drivers in
drivers/net/ieee802154/.
+Fake drivers
+------------
+
+In addition there is a driver available which simulates a real device with
+SoftMAC (fakelb - IEEE 802.15.4 loopback driver) interface. This option
+provides a possibility to test and debug the stack without usage of real hardware.
Device drivers API
==================
The include/net/mac802154.h defines following functions:
- - struct ieee802154_hw *
- ieee802154_alloc_hw(size_t priv_data_len, const struct ieee802154_ops *ops):
- allocation of IEEE 802.15.4 compatible hardware device
- - void ieee802154_free_hw(struct ieee802154_hw *hw):
- freeing allocated hardware device
+.. c:function:: struct ieee802154_dev *ieee802154_alloc_device (size_t priv_size, struct ieee802154_ops *ops)
- - int ieee802154_register_hw(struct ieee802154_hw *hw):
- register PHY which is the allocated hardware device, in the system
+Allocation of IEEE 802.15.4 compatible device.
- - void ieee802154_unregister_hw(struct ieee802154_hw *hw):
- freeing registered PHY
+.. c:function:: void ieee802154_free_device(struct ieee802154_dev *dev)
- - void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb,
- u8 lqi):
- telling 802.15.4 module there is a new received frame in the skb with
- the RF Link Quality Indicator (LQI) from the hardware device
+Freeing allocated device.
- - void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb,
- bool ifs_handling):
- telling 802.15.4 module the frame in the skb is or going to be
- transmitted through the hardware device
+.. c:function:: int ieee802154_register_device(struct ieee802154_dev *dev)
+
+Register PHY in the system.
+
+.. c:function:: void ieee802154_unregister_device(struct ieee802154_dev *dev)
+
+Freeing registered PHY.
+
+.. c:function:: void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb, u8 lqi):
+
+Telling 802.15.4 module there is a new received frame in the skb with
+the RF Link Quality Indicator (LQI) from the hardware device.
+
+.. c:function:: void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb, bool ifs_handling):
+
+Telling 802.15.4 module the frame in the skb is or going to be
+transmitted through the hardware device
The device driver must implement the following callbacks in the IEEE 802.15.4
-operations structure at least:
-struct ieee802154_ops {
- ...
- int (*start)(struct ieee802154_hw *hw);
- void (*stop)(struct ieee802154_hw *hw);
- ...
- int (*xmit_async)(struct ieee802154_hw *hw, struct sk_buff *skb);
- int (*ed)(struct ieee802154_hw *hw, u8 *level);
- int (*set_channel)(struct ieee802154_hw *hw, u8 page, u8 channel);
- ...
-};
-
- - int start(struct ieee802154_hw *hw):
- handler that 802.15.4 module calls for the hardware device initialization.
-
- - void stop(struct ieee802154_hw *hw):
- handler that 802.15.4 module calls for the hardware device cleanup.
-
- - int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb):
- handler that 802.15.4 module calls for each frame in the skb going to be
- transmitted through the hardware device.
-
- - int ed(struct ieee802154_hw *hw, u8 *level):
- handler that 802.15.4 module calls for Energy Detection from the hardware
- device.
-
- - int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel):
- set radio for listening on specific channel of the hardware device.
+operations structure at least::
-Moreover IEEE 802.15.4 device operations structure should be filled.
+ struct ieee802154_ops {
+ ...
+ int (*start)(struct ieee802154_hw *hw);
+ void (*stop)(struct ieee802154_hw *hw);
+ ...
+ int (*xmit_async)(struct ieee802154_hw *hw, struct sk_buff *skb);
+ int (*ed)(struct ieee802154_hw *hw, u8 *level);
+ int (*set_channel)(struct ieee802154_hw *hw, u8 page, u8 channel);
+ ...
+ };
-Fake drivers
-============
+.. c:function:: int start(struct ieee802154_hw *hw):
-In addition there is a driver available which simulates a real device with
-SoftMAC (fakelb - IEEE 802.15.4 loopback driver) interface. This option
-provides a possibility to test and debug the stack without usage of real hardware.
+Handler that 802.15.4 module calls for the hardware device initialization.
-See sources in drivers/net/ieee802154 folder for more details.
+.. c:function:: void stop(struct ieee802154_hw *hw):
+Handler that 802.15.4 module calls for the hardware device cleanup.
-6LoWPAN Linux implementation
-============================
+.. c:function:: int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb):
-The IEEE 802.15.4 standard specifies an MTU of 127 bytes, yielding about 80
-octets of actual MAC payload once security is turned on, on a wireless link
-with a link throughput of 250 kbps or less. The 6LoWPAN adaptation format
-[RFC4944] was specified to carry IPv6 datagrams over such constrained links,
-taking into account limited bandwidth, memory, or energy resources that are
-expected in applications such as wireless Sensor Networks. [RFC4944] defines
-a Mesh Addressing header to support sub-IP forwarding, a Fragmentation header
-to support the IPv6 minimum MTU requirement [RFC2460], and stateless header
-compression for IPv6 datagrams (LOWPAN_HC1 and LOWPAN_HC2) to reduce the
-relatively large IPv6 and UDP headers down to (in the best case) several bytes.
+Handler that 802.15.4 module calls for each frame in the skb going to be
+transmitted through the hardware device.
-In September 2011 the standard update was published - [RFC6282].
-It deprecates HC1 and HC2 compression and defines IPHC encoding format which is
-used in this Linux implementation.
+.. c:function:: int ed(struct ieee802154_hw *hw, u8 *level):
-All the code related to 6lowpan you may find in files: net/6lowpan/*
-and net/ieee802154/6lowpan/*
+Handler that 802.15.4 module calls for Energy Detection from the hardware
+device.
-To setup a 6LoWPAN interface you need:
-1. Add IEEE802.15.4 interface and set channel and PAN ID;
-2. Add 6lowpan interface by command like:
- # ip link add link wpan0 name lowpan0 type lowpan
-3. Bring up 'lowpan0' interface
+.. c:function:: int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel):
+
+Set radio for listening on specific channel of the hardware device.
+
+Moreover IEEE 802.15.4 device operations structure should be filled.
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 59e86de662cd..5449149be496 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -24,14 +24,21 @@ Contents:
device_drivers/intel/i40e
device_drivers/intel/iavf
device_drivers/intel/ice
+ devlink-info-versions
+ ieee802154
kapi
z8530book
msg_zerocopy
failover
net_failover
+ phy
+ sfp-phylink
alias
bridge
snmp_counter
+ checksum-offloads
+ segmentation-offloads
+ scaling
.. only:: subproject
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index fe46d4867e2d..18c1415e7bfa 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -7,7 +7,7 @@ Intro
=====
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
-The feature is currently implemented for TCP sockets.
+The feature is currently implemented for TCP and UDP sockets.
Opportunity and Caveats
diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt
index 355c6d8ef8ad..b203d1334822 100644
--- a/Documentation/networking/operstates.txt
+++ b/Documentation/networking/operstates.txt
@@ -22,8 +22,9 @@ and changeable from userspace under certain rules.
2. Querying from userspace
Both admin and operational state can be queried via the netlink
-operation RTM_GETLINK. It is also possible to subscribe to RTMGRP_LINK
-to be notified of updates. This is important for setting from userspace.
+operation RTM_GETLINK. It is also possible to subscribe to RTNLGRP_LINK
+to be notified of updates while the interface is admin up. This is
+important for setting from userspace.
These values contain interface state:
@@ -101,8 +102,9 @@ because some driver controlled protocol establishment has to
complete. Corresponding functions are netif_dormant_on() to set the
flag, netif_dormant_off() to clear it and netif_dormant() to query.
-On device allocation, networking core sets the flags equivalent to
-netif_carrier_ok() and !netif_dormant().
+On device allocation, both flags __LINK_STATE_NOCARRIER and
+__LINK_STATE_DORMANT are cleared, so the effective state is equivalent
+to netif_carrier_ok() and !netif_dormant().
Whenever the driver CHANGES one of these flags, a workqueue event is
@@ -133,11 +135,11 @@ netif_carrier_ok() && !netif_dormant() is set by the
driver. Afterwards, the userspace application can set IFLA_OPERSTATE
to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
netif_carrier_off() or netif_dormant_on(). Changes made by userspace
-are multicasted on the netlink group RTMGRP_LINK.
+are multicasted on the netlink group RTNLGRP_LINK.
So basically a 802.1X supplicant interacts with the kernel like this:
--subscribe to RTMGRP_LINK
+-subscribe to RTNLGRP_LINK
-set IFLA_LINKMODE to 1 via RTM_SETLINK
-query RTM_GETLINK once to get initial state
-if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
new file mode 100644
index 000000000000..0dd90d7df5ec
--- /dev/null
+++ b/Documentation/networking/phy.rst
@@ -0,0 +1,447 @@
+=====================
+PHY Abstraction Layer
+=====================
+
+Purpose
+=======
+
+Most network devices consist of set of registers which provide an interface
+to a MAC layer, which communicates with the physical connection through a
+PHY. The PHY concerns itself with negotiating link parameters with the link
+partner on the other side of the network connection (typically, an ethernet
+cable), and provides a register interface to allow drivers to determine what
+settings were chosen, and to configure what settings are allowed.
+
+While these devices are distinct from the network devices, and conform to a
+standard layout for the registers, it has been common practice to integrate
+the PHY management code with the network driver. This has resulted in large
+amounts of redundant code. Also, on embedded systems with multiple (and
+sometimes quite different) ethernet controllers connected to the same
+management bus, it is difficult to ensure safe use of the bus.
+
+Since the PHYs are devices, and the management busses through which they are
+accessed are, in fact, busses, the PHY Abstraction Layer treats them as such.
+In doing so, it has these goals:
+
+#. Increase code-reuse
+#. Increase overall code-maintainability
+#. Speed development time for new network drivers, and for new systems
+
+Basically, this layer is meant to provide an interface to PHY devices which
+allows network driver writers to write as little code as possible, while
+still providing a full feature set.
+
+The MDIO bus
+============
+
+Most network devices are connected to a PHY by means of a management bus.
+Different devices use different busses (though some share common interfaces).
+In order to take advantage of the PAL, each bus interface needs to be
+registered as a distinct device.
+
+#. read and write functions must be implemented. Their prototypes are::
+
+ int write(struct mii_bus *bus, int mii_id, int regnum, u16 value);
+ int read(struct mii_bus *bus, int mii_id, int regnum);
+
+ mii_id is the address on the bus for the PHY, and regnum is the register
+ number. These functions are guaranteed not to be called from interrupt
+ time, so it is safe for them to block, waiting for an interrupt to signal
+ the operation is complete
+
+#. A reset function is optional. This is used to return the bus to an
+ initialized state.
+
+#. A probe function is needed. This function should set up anything the bus
+ driver needs, setup the mii_bus structure, and register with the PAL using
+ mdiobus_register. Similarly, there's a remove function to undo all of
+ that (use mdiobus_unregister).
+
+#. Like any driver, the device_driver structure must be configured, and init
+ exit functions are used to register the driver.
+
+#. The bus must also be declared somewhere as a device, and registered.
+
+As an example for how one driver implemented an mdio bus driver, see
+drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file
+for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/")
+
+(RG)MII/electrical interface considerations
+===========================================
+
+The Reduced Gigabit Medium Independent Interface (RGMII) is a 12-pin
+electrical signal interface using a synchronous 125Mhz clock signal and several
+data lines. Due to this design decision, a 1.5ns to 2ns delay must be added
+between the clock line (RXC or TXC) and the data lines to let the PHY (clock
+sink) have enough setup and hold times to sample the data lines correctly. The
+PHY library offers different types of PHY_INTERFACE_MODE_RGMII* values to let
+the PHY driver and optionally the MAC driver, implement the required delay. The
+values of phy_interface_t must be understood from the perspective of the PHY
+device itself, leading to the following:
+
+* PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any
+ internal delay by itself, it assumes that either the Ethernet MAC (if capable
+ or the PCB traces) insert the correct 1.5-2ns delay
+
+* PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay
+ for the transmit data lines (TXD[3:0]) processed by the PHY device
+
+* PHY_INTERFACE_MODE_RGMII_RXID: the PHY should insert an internal delay
+ for the receive data lines (RXD[3:0]) processed by the PHY device
+
+* PHY_INTERFACE_MODE_RGMII_ID: the PHY should insert internal delays for
+ both transmit AND receive data lines from/to the PHY device
+
+Whenever possible, use the PHY side RGMII delay for these reasons:
+
+* PHY devices may offer sub-nanosecond granularity in how they allow a
+ receiver/transmitter side delay (e.g: 0.5, 1.0, 1.5ns) to be specified. Such
+ precision may be required to account for differences in PCB trace lengths
+
+* PHY devices are typically qualified for a large range of applications
+ (industrial, medical, automotive...), and they provide a constant and
+ reliable delay across temperature/pressure/voltage ranges
+
+* PHY device drivers in PHYLIB being reusable by nature, being able to
+ configure correctly a specified delay enables more designs with similar delay
+ requirements to be operate correctly
+
+For cases where the PHY is not capable of providing this delay, but the
+Ethernet MAC driver is capable of doing so, the correct phy_interface_t value
+should be PHY_INTERFACE_MODE_RGMII, and the Ethernet MAC driver should be
+configured correctly in order to provide the required transmit and/or receive
+side delay from the perspective of the PHY device. Conversely, if the Ethernet
+MAC driver looks at the phy_interface_t value, for any other mode but
+PHY_INTERFACE_MODE_RGMII, it should make sure that the MAC-level delays are
+disabled.
+
+In case neither the Ethernet MAC, nor the PHY are capable of providing the
+required delays, as defined per the RGMII standard, several options may be
+available:
+
+* Some SoCs may offer a pin pad/mux/controller capable of configuring a given
+ set of pins'strength, delays, and voltage; and it may be a suitable
+ option to insert the expected 2ns RGMII delay.
+
+* Modifying the PCB design to include a fixed delay (e.g: using a specifically
+ designed serpentine), which may not require software configuration at all.
+
+Common problems with RGMII delay mismatch
+-----------------------------------------
+
+When there is a RGMII delay mismatch between the Ethernet MAC and the PHY, this
+will most likely result in the clock and data line signals to be unstable when
+the PHY or MAC take a snapshot of these signals to translate them into logical
+1 or 0 states and reconstruct the data being transmitted/received. Typical
+symptoms include:
+
+* Transmission/reception partially works, and there is frequent or occasional
+ packet loss observed
+
+* Ethernet MAC may report some or all packets ingressing with a FCS/CRC error,
+ or just discard them all
+
+* Switching to lower speeds such as 10/100Mbits/sec makes the problem go away
+ (since there is enough setup/hold time in that case)
+
+Connecting to a PHY
+===================
+
+Sometime during startup, the network driver needs to establish a connection
+between the PHY device, and the network device. At this time, the PHY's bus
+and drivers need to all have been loaded, so it is ready for the connection.
+At this point, there are several ways to connect to the PHY:
+
+#. The PAL handles everything, and only calls the network driver when
+ the link state changes, so it can react.
+
+#. The PAL handles everything except interrupts (usually because the
+ controller has the interrupt registers).
+
+#. The PAL handles everything, but checks in with the driver every second,
+ allowing the network driver to react first to any changes before the PAL
+ does.
+
+#. The PAL serves only as a library of functions, with the network device
+ manually calling functions to update status, and configure the PHY
+
+
+Letting the PHY Abstraction Layer do Everything
+===============================================
+
+If you choose option 1 (The hope is that every driver can, but to still be
+useful to drivers that can't), connecting to the PHY is simple:
+
+First, you need a function to react to changes in the link state. This
+function follows this protocol::
+
+ static void adjust_link(struct net_device *dev);
+
+Next, you need to know the device name of the PHY connected to this device.
+The name will look something like, "0:00", where the first number is the
+bus id, and the second is the PHY's address on that bus. Typically,
+the bus is responsible for making its ID unique.
+
+Now, to connect, just call this function::
+
+ phydev = phy_connect(dev, phy_name, &adjust_link, interface);
+
+*phydev* is a pointer to the phy_device structure which represents the PHY.
+If phy_connect is successful, it will return the pointer. dev, here, is the
+pointer to your net_device. Once done, this function will have started the
+PHY's software state machine, and registered for the PHY's interrupt, if it
+has one. The phydev structure will be populated with information about the
+current state, though the PHY will not yet be truly operational at this
+point.
+
+PHY-specific flags should be set in phydev->dev_flags prior to the call
+to phy_connect() such that the underlying PHY driver can check for flags
+and perform specific operations based on them.
+This is useful if the system has put hardware restrictions on
+the PHY/controller, of which the PHY needs to be aware.
+
+*interface* is a u32 which specifies the connection type used
+between the controller and the PHY. Examples are GMII, MII,
+RGMII, and SGMII. For a full list, see include/linux/phy.h
+
+Now just make sure that phydev->supported and phydev->advertising have any
+values pruned from them which don't make sense for your controller (a 10/100
+controller may be connected to a gigabit capable PHY, so you would need to
+mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions
+for these bitfields. Note that you should not SET any bits, except the
+SUPPORTED_Pause and SUPPORTED_AsymPause bits (see below), or the PHY may get
+put into an unsupported state.
+
+Lastly, once the controller is ready to handle network traffic, you call
+phy_start(phydev). This tells the PAL that you are ready, and configures the
+PHY to connect to the network. If the MAC interrupt of your network driver
+also handles PHY status changes, just set phydev->irq to PHY_IGNORE_INTERRUPT
+before you call phy_start and use phy_mac_interrupt() from the network
+driver. If you don't want to use interrupts, set phydev->irq to PHY_POLL.
+phy_start() enables the PHY interrupts (if applicable) and starts the
+phylib state machine.
+
+When you want to disconnect from the network (even if just briefly), you call
+phy_stop(phydev). This function also stops the phylib state machine and
+disables PHY interrupts.
+
+Pause frames / flow control
+===========================
+
+The PHY does not participate directly in flow control/pause frames except by
+making sure that the SUPPORTED_Pause and SUPPORTED_AsymPause bits are set in
+MII_ADVERTISE to indicate towards the link partner that the Ethernet MAC
+controller supports such a thing. Since flow control/pause frames generation
+involves the Ethernet MAC driver, it is recommended that this driver takes care
+of properly indicating advertisement and support for such features by setting
+the SUPPORTED_Pause and SUPPORTED_AsymPause bits accordingly. This can be done
+either before or after phy_connect() and/or as a result of implementing the
+ethtool::set_pauseparam feature.
+
+
+Keeping Close Tabs on the PAL
+=============================
+
+It is possible that the PAL's built-in state machine needs a little help to
+keep your network device and the PHY properly in sync. If so, you can
+register a helper function when connecting to the PHY, which will be called
+every second before the state machine reacts to any changes. To do this, you
+need to manually call phy_attach() and phy_prepare_link(), and then call
+phy_start_machine() with the second argument set to point to your special
+handler.
+
+Currently there are no examples of how to use this functionality, and testing
+on it has been limited because the author does not have any drivers which use
+it (they all use option 1). So Caveat Emptor.
+
+Doing it all yourself
+=====================
+
+There's a remote chance that the PAL's built-in state machine cannot track
+the complex interactions between the PHY and your network device. If this is
+so, you can simply call phy_attach(), and not call phy_start_machine or
+phy_prepare_link(). This will mean that phydev->state is entirely yours to
+handle (phy_start and phy_stop toggle between some of the states, so you
+might need to avoid them).
+
+An effort has been made to make sure that useful functionality can be
+accessed without the state-machine running, and most of these functions are
+descended from functions which did not interact with a complex state-machine.
+However, again, no effort has been made so far to test running without the
+state machine, so tryer beware.
+
+Here is a brief rundown of the functions::
+
+ int phy_read(struct phy_device *phydev, u16 regnum);
+ int phy_write(struct phy_device *phydev, u16 regnum, u16 val);
+
+Simple read/write primitives. They invoke the bus's read/write function
+pointers.
+::
+
+ void phy_print_status(struct phy_device *phydev);
+
+A convenience function to print out the PHY status neatly.
+::
+
+ void phy_request_interrupt(struct phy_device *phydev);
+
+Requests the IRQ for the PHY interrupts.
+::
+
+ struct phy_device * phy_attach(struct net_device *dev, const char *phy_id,
+ phy_interface_t interface);
+
+Attaches a network device to a particular PHY, binding the PHY to a generic
+driver if none was found during bus initialization.
+::
+
+ int phy_start_aneg(struct phy_device *phydev);
+
+Using variables inside the phydev structure, either configures advertising
+and resets autonegotiation, or disables autonegotiation, and configures
+forced settings.
+::
+
+ static inline int phy_read_status(struct phy_device *phydev);
+
+Fills the phydev structure with up-to-date information about the current
+settings in the PHY.
+::
+
+ int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd);
+
+Ethtool convenience functions.
+::
+
+ int phy_mii_ioctl(struct phy_device *phydev,
+ struct mii_ioctl_data *mii_data, int cmd);
+
+The MII ioctl. Note that this function will completely screw up the state
+machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to
+use this only to write registers which are not standard, and don't set off
+a renegotiation.
+
+PHY Device Drivers
+==================
+
+With the PHY Abstraction Layer, adding support for new PHYs is
+quite easy. In some cases, no work is required at all! However,
+many PHYs require a little hand-holding to get up-and-running.
+
+Generic PHY driver
+------------------
+
+If the desired PHY doesn't have any errata, quirks, or special
+features you want to support, then it may be best to not add
+support, and let the PHY Abstraction Layer's Generic PHY Driver
+do all of the work.
+
+Writing a PHY driver
+--------------------
+
+If you do need to write a PHY driver, the first thing to do is
+make sure it can be matched with an appropriate PHY device.
+This is done during bus initialization by reading the device's
+UID (stored in registers 2 and 3), then comparing it to each
+driver's phy_id field by ANDing it with each driver's
+phy_id_mask field. Also, it needs a name. Here's an example::
+
+ static struct phy_driver dm9161_driver = {
+ .phy_id = 0x0181b880,
+ .name = "Davicom DM9161E",
+ .phy_id_mask = 0x0ffffff0,
+ ...
+ }
+
+Next, you need to specify what features (speed, duplex, autoneg,
+etc) your PHY device and driver support. Most PHYs support
+PHY_BASIC_FEATURES, but you can look in include/mii.h for other
+features.
+
+Each driver consists of a number of function pointers, documented
+in include/linux/phy.h under the phy_driver structure.
+
+Of these, only config_aneg and read_status are required to be
+assigned by the driver code. The rest are optional. Also, it is
+preferred to use the generic phy driver's versions of these two
+functions if at all possible: genphy_read_status and
+genphy_config_aneg. If this is not possible, it is likely that
+you only need to perform some actions before and after invoking
+these functions, and so your functions will wrap the generic
+ones.
+
+Feel free to look at the Marvell, Cicada, and Davicom drivers in
+drivers/net/phy/ for examples (the lxt and qsemi drivers have
+not been tested as of this writing).
+
+The PHY's MMD register accesses are handled by the PAL framework
+by default, but can be overridden by a specific PHY driver if
+required. This could be the case if a PHY was released for
+manufacturing before the MMD PHY register definitions were
+standardized by the IEEE. Most modern PHYs will be able to use
+the generic PAL framework for accessing the PHY's MMD registers.
+An example of such usage is for Energy Efficient Ethernet support,
+implemented in the PAL. This support uses the PAL to access MMD
+registers for EEE query and configuration if the PHY supports
+the IEEE standard access mechanisms, or can use the PHY's specific
+access interfaces if overridden by the specific PHY driver. See
+the Micrel driver in drivers/net/phy/ for an example of how this
+can be implemented.
+
+Board Fixups
+============
+
+Sometimes the specific interaction between the platform and the PHY requires
+special handling. For instance, to change where the PHY's clock input is,
+or to add a delay to account for latency issues in the data path. In order
+to support such contingencies, the PHY Layer allows platform code to register
+fixups to be run when the PHY is brought up (or subsequently reset).
+
+When the PHY Layer brings up a PHY it checks to see if there are any fixups
+registered for it, matching based on UID (contained in the PHY device's phy_id
+field) and the bus identifier (contained in phydev->dev.bus_id). Both must
+match, however two constants, PHY_ANY_ID and PHY_ANY_UID, are provided as
+wildcards for the bus ID and UID, respectively.
+
+When a match is found, the PHY layer will invoke the run function associated
+with the fixup. This function is passed a pointer to the phy_device of
+interest. It should therefore only operate on that PHY.
+
+The platform code can either register the fixup using phy_register_fixup()::
+
+ int phy_register_fixup(const char *phy_id,
+ u32 phy_uid, u32 phy_uid_mask,
+ int (*run)(struct phy_device *));
+
+Or using one of the two stubs, phy_register_fixup_for_uid() and
+phy_register_fixup_for_id()::
+
+ int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask,
+ int (*run)(struct phy_device *));
+ int phy_register_fixup_for_id(const char *phy_id,
+ int (*run)(struct phy_device *));
+
+The stubs set one of the two matching criteria, and set the other one to
+match anything.
+
+When phy_register_fixup() or \*_for_uid()/\*_for_id() is called at module,
+unregister fixup and free allocate memory are required.
+
+Call one of following function before unloading module::
+
+ int phy_unregister_fixup(const char *phy_id, u32 phy_uid, u32 phy_uid_mask);
+ int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask);
+ int phy_register_fixup_for_id(const char *phy_id);
+
+Standards
+=========
+
+IEEE Standard 802.3: CSMA/CD Access Method and Physical Layer Specifications, Section Two:
+http://standards.ieee.org/getieee802/download/802.3-2008_section2.pdf
+
+RGMII v1.3:
+http://web.archive.org/web/20160303212629/http://www.hp.com/rnd/pdfs/RGMIIv1_3.pdf
+
+RGMII v2.0:
+http://web.archive.org/web/20160303171328/http://www.hp.com/rnd/pdfs/RGMIIv2_0_final_hp.pdf
diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
deleted file mode 100644
index bdec0f700bc1..000000000000
--- a/Documentation/networking/phy.txt
+++ /dev/null
@@ -1,427 +0,0 @@
-
--------
-PHY Abstraction Layer
-(Updated 2008-04-08)
-
-Purpose
-
- Most network devices consist of set of registers which provide an interface
- to a MAC layer, which communicates with the physical connection through a
- PHY. The PHY concerns itself with negotiating link parameters with the link
- partner on the other side of the network connection (typically, an ethernet
- cable), and provides a register interface to allow drivers to determine what
- settings were chosen, and to configure what settings are allowed.
-
- While these devices are distinct from the network devices, and conform to a
- standard layout for the registers, it has been common practice to integrate
- the PHY management code with the network driver. This has resulted in large
- amounts of redundant code. Also, on embedded systems with multiple (and
- sometimes quite different) ethernet controllers connected to the same
- management bus, it is difficult to ensure safe use of the bus.
-
- Since the PHYs are devices, and the management busses through which they are
- accessed are, in fact, busses, the PHY Abstraction Layer treats them as such.
- In doing so, it has these goals:
-
- 1) Increase code-reuse
- 2) Increase overall code-maintainability
- 3) Speed development time for new network drivers, and for new systems
-
- Basically, this layer is meant to provide an interface to PHY devices which
- allows network driver writers to write as little code as possible, while
- still providing a full feature set.
-
-The MDIO bus
-
- Most network devices are connected to a PHY by means of a management bus.
- Different devices use different busses (though some share common interfaces).
- In order to take advantage of the PAL, each bus interface needs to be
- registered as a distinct device.
-
- 1) read and write functions must be implemented. Their prototypes are:
-
- int write(struct mii_bus *bus, int mii_id, int regnum, u16 value);
- int read(struct mii_bus *bus, int mii_id, int regnum);
-
- mii_id is the address on the bus for the PHY, and regnum is the register
- number. These functions are guaranteed not to be called from interrupt
- time, so it is safe for them to block, waiting for an interrupt to signal
- the operation is complete
-
- 2) A reset function is optional. This is used to return the bus to an
- initialized state.
-
- 3) A probe function is needed. This function should set up anything the bus
- driver needs, setup the mii_bus structure, and register with the PAL using
- mdiobus_register. Similarly, there's a remove function to undo all of
- that (use mdiobus_unregister).
-
- 4) Like any driver, the device_driver structure must be configured, and init
- exit functions are used to register the driver.
-
- 5) The bus must also be declared somewhere as a device, and registered.
-
- As an example for how one driver implemented an mdio bus driver, see
- drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file
- for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/")
-
-(RG)MII/electrical interface considerations
-
- The Reduced Gigabit Medium Independent Interface (RGMII) is a 12-pin
- electrical signal interface using a synchronous 125Mhz clock signal and several
- data lines. Due to this design decision, a 1.5ns to 2ns delay must be added
- between the clock line (RXC or TXC) and the data lines to let the PHY (clock
- sink) have enough setup and hold times to sample the data lines correctly. The
- PHY library offers different types of PHY_INTERFACE_MODE_RGMII* values to let
- the PHY driver and optionally the MAC driver, implement the required delay. The
- values of phy_interface_t must be understood from the perspective of the PHY
- device itself, leading to the following:
-
- * PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any
- internal delay by itself, it assumes that either the Ethernet MAC (if capable
- or the PCB traces) insert the correct 1.5-2ns delay
-
- * PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay
- for the transmit data lines (TXD[3:0]) processed by the PHY device
-
- * PHY_INTERFACE_MODE_RGMII_RXID: the PHY should insert an internal delay
- for the receive data lines (RXD[3:0]) processed by the PHY device
-
- * PHY_INTERFACE_MODE_RGMII_ID: the PHY should insert internal delays for
- both transmit AND receive data lines from/to the PHY device
-
- Whenever possible, use the PHY side RGMII delay for these reasons:
-
- * PHY devices may offer sub-nanosecond granularity in how they allow a
- receiver/transmitter side delay (e.g: 0.5, 1.0, 1.5ns) to be specified. Such
- precision may be required to account for differences in PCB trace lengths
-
- * PHY devices are typically qualified for a large range of applications
- (industrial, medical, automotive...), and they provide a constant and
- reliable delay across temperature/pressure/voltage ranges
-
- * PHY device drivers in PHYLIB being reusable by nature, being able to
- configure correctly a specified delay enables more designs with similar delay
- requirements to be operate correctly
-
- For cases where the PHY is not capable of providing this delay, but the
- Ethernet MAC driver is capable of doing so, the correct phy_interface_t value
- should be PHY_INTERFACE_MODE_RGMII, and the Ethernet MAC driver should be
- configured correctly in order to provide the required transmit and/or receive
- side delay from the perspective of the PHY device. Conversely, if the Ethernet
- MAC driver looks at the phy_interface_t value, for any other mode but
- PHY_INTERFACE_MODE_RGMII, it should make sure that the MAC-level delays are
- disabled.
-
- In case neither the Ethernet MAC, nor the PHY are capable of providing the
- required delays, as defined per the RGMII standard, several options may be
- available:
-
- * Some SoCs may offer a pin pad/mux/controller capable of configuring a given
- set of pins'strength, delays, and voltage; and it may be a suitable
- option to insert the expected 2ns RGMII delay.
-
- * Modifying the PCB design to include a fixed delay (e.g: using a specifically
- designed serpentine), which may not require software configuration at all.
-
-Common problems with RGMII delay mismatch
-
- When there is a RGMII delay mismatch between the Ethernet MAC and the PHY, this
- will most likely result in the clock and data line signals to be unstable when
- the PHY or MAC take a snapshot of these signals to translate them into logical
- 1 or 0 states and reconstruct the data being transmitted/received. Typical
- symptoms include:
-
- * Transmission/reception partially works, and there is frequent or occasional
- packet loss observed
-
- * Ethernet MAC may report some or all packets ingressing with a FCS/CRC error,
- or just discard them all
-
- * Switching to lower speeds such as 10/100Mbits/sec makes the problem go away
- (since there is enough setup/hold time in that case)
-
-
-Connecting to a PHY
-
- Sometime during startup, the network driver needs to establish a connection
- between the PHY device, and the network device. At this time, the PHY's bus
- and drivers need to all have been loaded, so it is ready for the connection.
- At this point, there are several ways to connect to the PHY:
-
- 1) The PAL handles everything, and only calls the network driver when
- the link state changes, so it can react.
-
- 2) The PAL handles everything except interrupts (usually because the
- controller has the interrupt registers).
-
- 3) The PAL handles everything, but checks in with the driver every second,
- allowing the network driver to react first to any changes before the PAL
- does.
-
- 4) The PAL serves only as a library of functions, with the network device
- manually calling functions to update status, and configure the PHY
-
-
-Letting the PHY Abstraction Layer do Everything
-
- If you choose option 1 (The hope is that every driver can, but to still be
- useful to drivers that can't), connecting to the PHY is simple:
-
- First, you need a function to react to changes in the link state. This
- function follows this protocol:
-
- static void adjust_link(struct net_device *dev);
-
- Next, you need to know the device name of the PHY connected to this device.
- The name will look something like, "0:00", where the first number is the
- bus id, and the second is the PHY's address on that bus. Typically,
- the bus is responsible for making its ID unique.
-
- Now, to connect, just call this function:
-
- phydev = phy_connect(dev, phy_name, &adjust_link, interface);
-
- phydev is a pointer to the phy_device structure which represents the PHY. If
- phy_connect is successful, it will return the pointer. dev, here, is the
- pointer to your net_device. Once done, this function will have started the
- PHY's software state machine, and registered for the PHY's interrupt, if it
- has one. The phydev structure will be populated with information about the
- current state, though the PHY will not yet be truly operational at this
- point.
-
- PHY-specific flags should be set in phydev->dev_flags prior to the call
- to phy_connect() such that the underlying PHY driver can check for flags
- and perform specific operations based on them.
- This is useful if the system has put hardware restrictions on
- the PHY/controller, of which the PHY needs to be aware.
-
- interface is a u32 which specifies the connection type used
- between the controller and the PHY. Examples are GMII, MII,
- RGMII, and SGMII. For a full list, see include/linux/phy.h
-
- Now just make sure that phydev->supported and phydev->advertising have any
- values pruned from them which don't make sense for your controller (a 10/100
- controller may be connected to a gigabit capable PHY, so you would need to
- mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions
- for these bitfields. Note that you should not SET any bits, except the
- SUPPORTED_Pause and SUPPORTED_AsymPause bits (see below), or the PHY may get
- put into an unsupported state.
-
- Lastly, once the controller is ready to handle network traffic, you call
- phy_start(phydev). This tells the PAL that you are ready, and configures the
- PHY to connect to the network. If you want to handle your own interrupts,
- just set phydev->irq to PHY_IGNORE_INTERRUPT before you call phy_start.
- Similarly, if you don't want to use interrupts, set phydev->irq to PHY_POLL.
-
- When you want to disconnect from the network (even if just briefly), you call
- phy_stop(phydev).
-
-Pause frames / flow control
-
- The PHY does not participate directly in flow control/pause frames except by
- making sure that the SUPPORTED_Pause and SUPPORTED_AsymPause bits are set in
- MII_ADVERTISE to indicate towards the link partner that the Ethernet MAC
- controller supports such a thing. Since flow control/pause frames generation
- involves the Ethernet MAC driver, it is recommended that this driver takes care
- of properly indicating advertisement and support for such features by setting
- the SUPPORTED_Pause and SUPPORTED_AsymPause bits accordingly. This can be done
- either before or after phy_connect() and/or as a result of implementing the
- ethtool::set_pauseparam feature.
-
-
-Keeping Close Tabs on the PAL
-
- It is possible that the PAL's built-in state machine needs a little help to
- keep your network device and the PHY properly in sync. If so, you can
- register a helper function when connecting to the PHY, which will be called
- every second before the state machine reacts to any changes. To do this, you
- need to manually call phy_attach() and phy_prepare_link(), and then call
- phy_start_machine() with the second argument set to point to your special
- handler.
-
- Currently there are no examples of how to use this functionality, and testing
- on it has been limited because the author does not have any drivers which use
- it (they all use option 1). So Caveat Emptor.
-
-Doing it all yourself
-
- There's a remote chance that the PAL's built-in state machine cannot track
- the complex interactions between the PHY and your network device. If this is
- so, you can simply call phy_attach(), and not call phy_start_machine or
- phy_prepare_link(). This will mean that phydev->state is entirely yours to
- handle (phy_start and phy_stop toggle between some of the states, so you
- might need to avoid them).
-
- An effort has been made to make sure that useful functionality can be
- accessed without the state-machine running, and most of these functions are
- descended from functions which did not interact with a complex state-machine.
- However, again, no effort has been made so far to test running without the
- state machine, so tryer beware.
-
- Here is a brief rundown of the functions:
-
- int phy_read(struct phy_device *phydev, u16 regnum);
- int phy_write(struct phy_device *phydev, u16 regnum, u16 val);
-
- Simple read/write primitives. They invoke the bus's read/write function
- pointers.
-
- void phy_print_status(struct phy_device *phydev);
-
- A convenience function to print out the PHY status neatly.
-
- int phy_start_interrupts(struct phy_device *phydev);
- int phy_stop_interrupts(struct phy_device *phydev);
-
- Requests the IRQ for the PHY interrupts, then enables them for
- start, or disables then frees them for stop.
-
- struct phy_device * phy_attach(struct net_device *dev, const char *phy_id,
- phy_interface_t interface);
-
- Attaches a network device to a particular PHY, binding the PHY to a generic
- driver if none was found during bus initialization.
-
- int phy_start_aneg(struct phy_device *phydev);
-
- Using variables inside the phydev structure, either configures advertising
- and resets autonegotiation, or disables autonegotiation, and configures
- forced settings.
-
- static inline int phy_read_status(struct phy_device *phydev);
-
- Fills the phydev structure with up-to-date information about the current
- settings in the PHY.
-
- int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd);
-
- Ethtool convenience functions.
-
- int phy_mii_ioctl(struct phy_device *phydev,
- struct mii_ioctl_data *mii_data, int cmd);
-
- The MII ioctl. Note that this function will completely screw up the state
- machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to
- use this only to write registers which are not standard, and don't set off
- a renegotiation.
-
-
-PHY Device Drivers
-
- With the PHY Abstraction Layer, adding support for new PHYs is
- quite easy. In some cases, no work is required at all! However,
- many PHYs require a little hand-holding to get up-and-running.
-
-Generic PHY driver
-
- If the desired PHY doesn't have any errata, quirks, or special
- features you want to support, then it may be best to not add
- support, and let the PHY Abstraction Layer's Generic PHY Driver
- do all of the work.
-
-Writing a PHY driver
-
- If you do need to write a PHY driver, the first thing to do is
- make sure it can be matched with an appropriate PHY device.
- This is done during bus initialization by reading the device's
- UID (stored in registers 2 and 3), then comparing it to each
- driver's phy_id field by ANDing it with each driver's
- phy_id_mask field. Also, it needs a name. Here's an example:
-
- static struct phy_driver dm9161_driver = {
- .phy_id = 0x0181b880,
- .name = "Davicom DM9161E",
- .phy_id_mask = 0x0ffffff0,
- ...
- }
-
- Next, you need to specify what features (speed, duplex, autoneg,
- etc) your PHY device and driver support. Most PHYs support
- PHY_BASIC_FEATURES, but you can look in include/mii.h for other
- features.
-
- Each driver consists of a number of function pointers, documented
- in include/linux/phy.h under the phy_driver structure.
-
- Of these, only config_aneg and read_status are required to be
- assigned by the driver code. The rest are optional. Also, it is
- preferred to use the generic phy driver's versions of these two
- functions if at all possible: genphy_read_status and
- genphy_config_aneg. If this is not possible, it is likely that
- you only need to perform some actions before and after invoking
- these functions, and so your functions will wrap the generic
- ones.
-
- Feel free to look at the Marvell, Cicada, and Davicom drivers in
- drivers/net/phy/ for examples (the lxt and qsemi drivers have
- not been tested as of this writing).
-
- The PHY's MMD register accesses are handled by the PAL framework
- by default, but can be overridden by a specific PHY driver if
- required. This could be the case if a PHY was released for
- manufacturing before the MMD PHY register definitions were
- standardized by the IEEE. Most modern PHYs will be able to use
- the generic PAL framework for accessing the PHY's MMD registers.
- An example of such usage is for Energy Efficient Ethernet support,
- implemented in the PAL. This support uses the PAL to access MMD
- registers for EEE query and configuration if the PHY supports
- the IEEE standard access mechanisms, or can use the PHY's specific
- access interfaces if overridden by the specific PHY driver. See
- the Micrel driver in drivers/net/phy/ for an example of how this
- can be implemented.
-
-Board Fixups
-
- Sometimes the specific interaction between the platform and the PHY requires
- special handling. For instance, to change where the PHY's clock input is,
- or to add a delay to account for latency issues in the data path. In order
- to support such contingencies, the PHY Layer allows platform code to register
- fixups to be run when the PHY is brought up (or subsequently reset).
-
- When the PHY Layer brings up a PHY it checks to see if there are any fixups
- registered for it, matching based on UID (contained in the PHY device's phy_id
- field) and the bus identifier (contained in phydev->dev.bus_id). Both must
- match, however two constants, PHY_ANY_ID and PHY_ANY_UID, are provided as
- wildcards for the bus ID and UID, respectively.
-
- When a match is found, the PHY layer will invoke the run function associated
- with the fixup. This function is passed a pointer to the phy_device of
- interest. It should therefore only operate on that PHY.
-
- The platform code can either register the fixup using phy_register_fixup():
-
- int phy_register_fixup(const char *phy_id,
- u32 phy_uid, u32 phy_uid_mask,
- int (*run)(struct phy_device *));
-
- Or using one of the two stubs, phy_register_fixup_for_uid() and
- phy_register_fixup_for_id():
-
- int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask,
- int (*run)(struct phy_device *));
- int phy_register_fixup_for_id(const char *phy_id,
- int (*run)(struct phy_device *));
-
- The stubs set one of the two matching criteria, and set the other one to
- match anything.
-
- When phy_register_fixup() or *_for_uid()/*_for_id() is called at module,
- unregister fixup and free allocate memory are required.
-
- Call one of following function before unloading module.
-
- int phy_unregister_fixup(const char *phy_id, u32 phy_uid, u32 phy_uid_mask);
- int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask);
- int phy_register_fixup_for_id(const char *phy_id);
-
-Standards
-
- IEEE Standard 802.3: CSMA/CD Access Method and Physical Layer Specifications, Section Two:
- http://standards.ieee.org/getieee802/download/802.3-2008_section2.pdf
-
- RGMII v1.3:
- http://web.archive.org/web/20160303212629/http://www.hp.com/rnd/pdfs/RGMIIv1_3.pdf
-
- RGMII v2.0:
- http://web.archive.org/web/20160303171328/http://www.hp.com/rnd/pdfs/RGMIIv2_0_final_hp.pdf
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.rst
index b7056a8a0540..f78d7bf27ff5 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.rst
@@ -1,4 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
Scaling in the Linux Networking Stack
+=====================================
Introduction
@@ -10,11 +14,11 @@ multi-processor systems.
The following technologies are described:
- RSS: Receive Side Scaling
- RPS: Receive Packet Steering
- RFS: Receive Flow Steering
- Accelerated Receive Flow Steering
- XPS: Transmit Packet Steering
+- RSS: Receive Side Scaling
+- RPS: Receive Packet Steering
+- RFS: Receive Flow Steering
+- Accelerated Receive Flow Steering
+- XPS: Transmit Packet Steering
RSS: Receive Side Scaling
@@ -45,7 +49,9 @@ programmable filters. For example, webserver bound TCP port 80 packets
can be directed to their own receive queue. Such “n-tuple” filters can
be configured from ethtool (--config-ntuple).
-==== RSS Configuration
+
+RSS Configuration
+-----------------
The driver for a multi-queue capable NIC typically provides a kernel
module parameter for specifying the number of hardware queues to
@@ -63,7 +69,9 @@ commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
indirection table could be done to give different queues different
relative weights.
-== RSS IRQ Configuration
+
+RSS IRQ Configuration
+~~~~~~~~~~~~~~~~~~~~~
Each receive queue has a separate IRQ associated with it. The NIC triggers
this to notify a CPU when new packets arrive on the given queue. The
@@ -77,7 +85,9 @@ affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems
will be running irqbalance, a daemon that dynamically optimizes IRQ
assignments and as a result may override any manual settings.
-== Suggested Configuration
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
RSS should be enabled when latency is a concern or whenever receive
interrupt processing forms a bottleneck. Spreading load between CPUs
@@ -105,10 +115,12 @@ Whereas RSS selects the queue and hence CPU that will run the hardware
interrupt handler, RPS selects the CPU to perform protocol processing
above the interrupt handler. This is accomplished by placing the packet
on the desired CPU’s backlog queue and waking up the CPU for processing.
-RPS has some advantages over RSS: 1) it can be used with any NIC,
-2) software filters can easily be added to hash over new protocols,
+RPS has some advantages over RSS:
+
+1) it can be used with any NIC
+2) software filters can easily be added to hash over new protocols
3) it does not increase hardware device interrupt rate (although it does
-introduce inter-processor interrupts (IPIs)).
+ introduce inter-processor interrupts (IPIs))
RPS is called during bottom half of the receive interrupt handler, when
a driver sends a packet up the network stack with netif_rx() or
@@ -135,21 +147,25 @@ packets have been queued to their backlog queue. The IPI wakes backlog
processing on the remote CPU, and any queued packets are then processed
up the networking stack.
-==== RPS Configuration
+
+RPS Configuration
+-----------------
RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
by default for SMP). Even when compiled in, RPS remains disabled until
explicitly configured. The list of CPUs to which RPS may forward traffic
-can be configured for each receive queue using a sysfs file entry:
+can be configured for each receive queue using a sysfs file entry::
- /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
+ /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
This file implements a bitmap of CPUs. RPS is disabled when it is zero
(the default), in which case packets are processed on the interrupting
CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to
the bitmap.
-== Suggested Configuration
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
For a single queue device, a typical RPS configuration would be to set
the rps_cpus to the CPUs in the same memory domain of the interrupting
@@ -163,7 +179,9 @@ and unnecessary. If there are fewer hardware queues than CPUs, then
RPS might be beneficial if the rps_cpus for each queue are the ones that
share the same memory domain as the interrupting CPU for that queue.
-==== RPS Flow Limit
+
+RPS Flow Limit
+--------------
RPS scales kernel receive processing across CPUs without introducing
reordering. The trade-off to sending all packets from the same flow
@@ -187,29 +205,33 @@ No packets are dropped when the input packet queue length is below
the threshold, so flow limit does not sever connections outright:
even large flows maintain connectivity.
-== Interface
+
+Interface
+~~~~~~~~~
Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
turned on. It is implemented for each CPU independently (to avoid lock
and cache contention) and toggled per CPU by setting the relevant bit
in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
-bitmap interface as rps_cpus (see above) when called from procfs:
+bitmap interface as rps_cpus (see above) when called from procfs::
- /proc/sys/net/core/flow_limit_cpu_bitmap
+ /proc/sys/net/core/flow_limit_cpu_bitmap
Per-flow rate is calculated by hashing each packet into a hashtable
bucket and incrementing a per-bucket counter. The hash function is
the same that selects a CPU in RPS, but as the number of buckets can
be much larger than the number of CPUs, flow limit has finer-grained
identification of large flows and fewer false positives. The default
-table has 4096 buckets. This value can be modified through sysctl
+table has 4096 buckets. This value can be modified through sysctl::
- net.core.flow_limit_table_len
+ net.core.flow_limit_table_len
The value is only consulted when a new table is allocated. Modifying
it does not update active tables.
-== Suggested Configuration
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
Flow limit is useful on systems with many concurrent connections,
where a single connection taking up 50% of a CPU indicates a problem.
@@ -280,10 +302,10 @@ table), the packet is enqueued onto that CPU’s backlog. If they differ,
the current CPU is updated to match the desired CPU if one of the
following is true:
-- The current CPU's queue head counter >= the recorded tail counter
- value in rps_dev_flow[i]
-- The current CPU is unset (>= nr_cpu_ids)
-- The current CPU is offline
+ - The current CPU's queue head counter >= the recorded tail counter
+ value in rps_dev_flow[i]
+ - The current CPU is unset (>= nr_cpu_ids)
+ - The current CPU is offline
After this check, the packet is sent to the (possibly updated) current
CPU. These rules aim to ensure that a flow only moves to a new CPU when
@@ -291,19 +313,23 @@ there are no packets outstanding on the old CPU, as the outstanding
packets could arrive later than those about to be processed on the new
CPU.
-==== RFS Configuration
+
+RFS Configuration
+-----------------
RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
by default for SMP). The functionality remains disabled until explicitly
-configured. The number of entries in the global flow table is set through:
+configured. The number of entries in the global flow table is set through::
+
+ /proc/sys/net/core/rps_sock_flow_entries
- /proc/sys/net/core/rps_sock_flow_entries
+The number of entries in the per-queue flow table are set through::
-The number of entries in the per-queue flow table are set through:
+ /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
- /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
-== Suggested Configuration
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
Both of these need to be set before RFS is enabled for a receive queue.
Values for both are rounded up to the nearest power of two. The
@@ -347,7 +373,9 @@ functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
to populate the map. For each CPU, the corresponding queue in the map is
set to be one whose processing CPU is closest in cache locality.
-==== Accelerated RFS Configuration
+
+Accelerated RFS Configuration
+-----------------------------
Accelerated RFS is only available if the kernel is compiled with
CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
@@ -356,11 +384,14 @@ of CPU to queues is automatically deduced from the IRQ affinities
configured for each receive queue by the driver, so no additional
configuration should be necessary.
-== Suggested Configuration
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
This technique should be enabled whenever one wants to use RFS and the
NIC supports hardware acceleration.
+
XPS: Transmit Packet Steering
=============================
@@ -430,20 +461,25 @@ transport layer is responsible for setting ooo_okay appropriately. TCP,
for instance, sets the flag when all data for a connection has been
acknowledged.
-==== XPS Configuration
+XPS Configuration
+-----------------
XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
default for SMP). The functionality remains disabled until explicitly
configured. To enable XPS, the bitmap of CPUs/receive-queues that may
use a transmit queue is configured using the sysfs file entry:
-For selection based on CPUs map:
-/sys/class/net/<dev>/queues/tx-<n>/xps_cpus
+For selection based on CPUs map::
+
+ /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
+
+For selection based on receive-queues map::
+
+ /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
-For selection based on receive-queues map:
-/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
-== Suggested Configuration
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
For a network device with a single transmission queue, XPS configuration
has no effect, since there is no choice in this case. In a multi-queue
@@ -460,16 +496,18 @@ explicitly configured mapping receive-queue(s) to transmit queue(s). If the
user configuration for receive-queue map does not apply, then the transmit
queue is selected based on the CPUs map.
-Per TX Queue rate limitation:
-=============================
+
+Per TX Queue rate limitation
+============================
These are rate-limitation mechanisms implemented by HW, where currently
-a max-rate attribute is supported, by setting a Mbps value to
+a max-rate attribute is supported, by setting a Mbps value to::
-/sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
+ /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
A value of zero means disabled, and this is the default.
+
Further Information
===================
RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
@@ -480,5 +518,6 @@ Accelerated RFS was introduced in 2.6.35. Original patches were
submitted by Ben Hutchings (bwh@kernel.org)
Authors:
-Tom Herbert (therbert@google.com)
-Willem de Bruijn (willemb@google.com)
+
+- Tom Herbert (therbert@google.com)
+- Willem de Bruijn (willemb@google.com)
diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.rst
index aca542ec125c..89d1ee933e9f 100644
--- a/Documentation/networking/segmentation-offloads.txt
+++ b/Documentation/networking/segmentation-offloads.rst
@@ -1,4 +1,9 @@
-Segmentation Offloads in the Linux Networking Stack
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Segmentation Offloads
+=====================
+
Introduction
============
@@ -15,6 +20,7 @@ The following technologies are described:
* Partial Generic Segmentation Offload - GSO_PARTIAL
* SCTP accelleration with GSO - GSO_BY_FRAGS
+
TCP Segmentation Offload
========================
@@ -42,6 +48,7 @@ NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when performing TSO
and we will either increment the IP ID for all frames, or leave it at a
static value based on driver preference.
+
UDP Fragmentation Offload
=========================
@@ -54,6 +61,7 @@ UFO is deprecated: modern kernels will no longer generate UFO skbs, but can
still receive them from tuntap and similar devices. Offload of UDP-based
tunnel protocols is still supported.
+
IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads
========================================================
@@ -71,17 +79,19 @@ refer to the tunnel headers as the outer headers, while the encapsulated
data is normally referred to as the inner headers. Below is the list of
calls to access the given headers:
-IPIP/SIT Tunnel:
- Outer Inner
-MAC skb_mac_header
-Network skb_network_header skb_inner_network_header
-Transport skb_transport_header
+IPIP/SIT Tunnel::
+
+ Outer Inner
+ MAC skb_mac_header
+ Network skb_network_header skb_inner_network_header
+ Transport skb_transport_header
-UDP/GRE Tunnel:
- Outer Inner
-MAC skb_mac_header skb_inner_mac_header
-Network skb_network_header skb_inner_network_header
-Transport skb_transport_header skb_inner_transport_header
+UDP/GRE Tunnel::
+
+ Outer Inner
+ MAC skb_mac_header skb_inner_mac_header
+ Network skb_network_header skb_inner_network_header
+ Transport skb_transport_header skb_inner_transport_header
In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and
SKB_GSO_UDP_TUNNEL_CSUM. These two additional tunnel types reflect the
@@ -93,6 +103,7 @@ header has requested a remote checksum offload. In this case the inner
headers will be left with a partial checksum and only the outer header
checksum will be computed.
+
Generic Segmentation Offload
============================
@@ -106,6 +117,7 @@ Before enabling any hardware segmentation offload a corresponding software
offload is required in GSO. Otherwise it becomes possible for a frame to
be re-routed between devices and end up being unable to be transmitted.
+
Generic Receive Offload
=======================
@@ -117,6 +129,7 @@ this is IPv4 ID in the case that the DF bit is set for a given IP header.
If the value of the IPv4 ID is not sequentially incrementing it will be
altered so that it is when a frame assembled via GRO is segmented via GSO.
+
Partial Generic Segmentation Offload
====================================
@@ -134,6 +147,7 @@ is the outer IPv4 ID field. It is up to the device drivers to guarantee
that the IPv4 ID field is incremented in the case that a given header does
not have the DF bit set.
+
SCTP accelleration with GSO
===========================
@@ -157,14 +171,14 @@ appropriately.
There are some helpers to make this easier:
- - skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if
- an skb is an SCTP GSO skb.
+- skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if
+ an skb is an SCTP GSO skb.
- - For size checks, the skb_gso_validate_*_len family of helpers correctly
- considers GSO_BY_FRAGS.
+- For size checks, the skb_gso_validate_*_len family of helpers correctly
+ considers GSO_BY_FRAGS.
- - For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size
- will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs.
+- For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size
+ will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs.
This also affects drivers with the NETIF_F_FRAGLIST & NETIF_F_GSO_SCTP bits
set. Note also that NETIF_F_GSO_SCTP is included in NETIF_F_GSO_SOFTWARE.
diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst
new file mode 100644
index 000000000000..5bd26cb07244
--- /dev/null
+++ b/Documentation/networking/sfp-phylink.rst
@@ -0,0 +1,268 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+phylink
+=======
+
+Overview
+========
+
+phylink is a mechanism to support hot-pluggable networking modules
+without needing to re-initialise the adapter on hot-plug events.
+
+phylink supports conventional phylib-based setups, fixed link setups
+and SFP (Small Formfactor Pluggable) modules at present.
+
+Modes of operation
+==================
+
+phylink has several modes of operation, which depend on the firmware
+settings.
+
+1. PHY mode
+
+ In PHY mode, we use phylib to read the current link settings from
+ the PHY, and pass them to the MAC driver. We expect the MAC driver
+ to configure exactly the modes that are specified without any
+ negotiation being enabled on the link.
+
+2. Fixed mode
+
+ Fixed mode is the same as PHY mode as far as the MAC driver is
+ concerned.
+
+3. In-band mode
+
+ In-band mode is used with 802.3z, SGMII and similar interface modes,
+ and we are expecting to use and honor the in-band negotiation or
+ control word sent across the serdes channel.
+
+By example, what this means is that:
+
+.. code-block:: none
+
+ &eth {
+ phy = <&phy>;
+ phy-mode = "sgmii";
+ };
+
+does not use in-band SGMII signalling. The PHY is expected to follow
+exactly the settings given to it in its :c:func:`mac_config` function.
+The link should be forced up or down appropriately in the
+:c:func:`mac_link_up` and :c:func:`mac_link_down` functions.
+
+.. code-block:: none
+
+ &eth {
+ managed = "in-band-status";
+ phy = <&phy>;
+ phy-mode = "sgmii";
+ };
+
+uses in-band mode, where results from the PHY's negotiation are passed
+to the MAC through the SGMII control word, and the MAC is expected to
+acknowledge the control word. The :c:func:`mac_link_up` and
+:c:func:`mac_link_down` functions must not force the MAC side link
+up and down.
+
+Rough guide to converting a network driver to sfp/phylink
+=========================================================
+
+This guide briefly describes how to convert a network driver from
+phylib to the sfp/phylink support. Please send patches to improve
+this documentation.
+
+1. Optionally split the network driver's phylib update function into
+ three parts dealing with link-down, link-up and reconfiguring the
+ MAC settings. This can be done as a separate preparation commit.
+
+ An example of this preparation can be found in git commit fc548b991fb0.
+
+2. Replace::
+
+ select FIXED_PHY
+ select PHYLIB
+
+ with::
+
+ select PHYLINK
+
+ in the driver's Kconfig stanza.
+
+3. Add::
+
+ #include <linux/phylink.h>
+
+ to the driver's list of header files.
+
+4. Add::
+
+ struct phylink *phylink;
+
+ to the driver's private data structure. We shall refer to the
+ driver's private data pointer as ``priv`` below, and the driver's
+ private data structure as ``struct foo_priv``.
+
+5. Replace the following functions:
+
+ .. flat-table::
+ :header-rows: 1
+ :widths: 1 1
+ :stub-columns: 0
+
+ * - Original function
+ - Replacement function
+ * - phy_start(phydev)
+ - phylink_start(priv->phylink)
+ * - phy_stop(phydev)
+ - phylink_stop(priv->phylink)
+ * - phy_mii_ioctl(phydev, ifr, cmd)
+ - phylink_mii_ioctl(priv->phylink, ifr, cmd)
+ * - phy_ethtool_get_wol(phydev, wol)
+ - phylink_ethtool_get_wol(priv->phylink, wol)
+ * - phy_ethtool_set_wol(phydev, wol)
+ - phylink_ethtool_set_wol(priv->phylink, wol)
+ * - phy_disconnect(phydev)
+ - phylink_disconnect_phy(priv->phylink)
+
+ Please note that some of these functions must be called under the
+ rtnl lock, and will warn if not. This will normally be the case,
+ except if these are called from the driver suspend/resume paths.
+
+6. Add/replace ksettings get/set methods with:
+
+ .. code-block:: c
+
+ static int foo_ethtool_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
+ {
+ struct foo_priv *priv = netdev_priv(dev);
+
+ return phylink_ethtool_ksettings_set(priv->phylink, cmd);
+ }
+
+ static int foo_ethtool_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
+ {
+ struct foo_priv *priv = netdev_priv(dev);
+
+ return phylink_ethtool_ksettings_get(priv->phylink, cmd);
+ }
+
+7. Replace the call to:
+
+ phy_dev = of_phy_connect(dev, node, link_func, flags, phy_interface);
+
+ and associated code with a call to:
+
+ err = phylink_of_phy_connect(priv->phylink, node, flags);
+
+ For the most part, ``flags`` can be zero; these flags are passed to
+ the of_phy_attach() inside this function call if a PHY is specified
+ in the DT node ``node``.
+
+ ``node`` should be the DT node which contains the network phy property,
+ fixed link properties, and will also contain the sfp property.
+
+ The setup of fixed links should also be removed; these are handled
+ internally by phylink.
+
+ of_phy_connect() was also passed a function pointer for link updates.
+ This function is replaced by a different form of MAC updates
+ described below in (8).
+
+ Manipulation of the PHY's supported/advertised happens within phylink
+ based on the validate callback, see below in (8).
+
+ Note that the driver no longer needs to store the ``phy_interface``,
+ and also note that ``phy_interface`` becomes a dynamic property,
+ just like the speed, duplex etc. settings.
+
+ Finally, note that the MAC driver has no direct access to the PHY
+ anymore; that is because in the phylink model, the PHY can be
+ dynamic.
+
+8. Add a :c:type:`struct phylink_mac_ops <phylink_mac_ops>` instance to
+ the driver, which is a table of function pointers, and implement
+ these functions. The old link update function for
+ :c:func:`of_phy_connect` becomes three methods: :c:func:`mac_link_up`,
+ :c:func:`mac_link_down`, and :c:func:`mac_config`. If step 1 was
+ performed, then the functionality will have been split there.
+
+ It is important that if in-band negotiation is used,
+ :c:func:`mac_link_up` and :c:func:`mac_link_down` do not prevent the
+ in-band negotiation from completing, since these functions are called
+ when the in-band link state changes - otherwise the link will never
+ come up.
+
+ The :c:func:`validate` method should mask the supplied supported mask,
+ and ``state->advertising`` with the supported ethtool link modes.
+ These are the new ethtool link modes, so bitmask operations must be
+ used. For an example, see drivers/net/ethernet/marvell/mvneta.c.
+
+ The :c:func:`mac_link_state` method is used to read the link state
+ from the MAC, and report back the settings that the MAC is currently
+ using. This is particularly important for in-band negotiation
+ methods such as 1000base-X and SGMII.
+
+ The :c:func:`mac_config` method is used to update the MAC with the
+ requested state, and must avoid unnecessarily taking the link down
+ when making changes to the MAC configuration. This means the
+ function should modify the state and only take the link down when
+ absolutely necessary to change the MAC configuration. An example
+ of how to do this can be found in :c:func:`mvneta_mac_config` in
+ drivers/net/ethernet/marvell/mvneta.c.
+
+ For further information on these methods, please see the inline
+ documentation in :c:type:`struct phylink_mac_ops <phylink_mac_ops>`.
+
+9. Remove calls to of_parse_phandle() for the PHY,
+ of_phy_register_fixed_link() for fixed links etc. from the probe
+ function, and replace with:
+
+ .. code-block:: c
+
+ struct phylink *phylink;
+
+ phylink = phylink_create(dev, node, phy_mode, &phylink_ops);
+ if (IS_ERR(phylink)) {
+ err = PTR_ERR(phylink);
+ fail probe;
+ }
+
+ priv->phylink = phylink;
+
+ and arrange to destroy the phylink in the probe failure path as
+ appropriate and the removal path too by calling:
+
+ .. code-block:: c
+
+ phylink_destroy(priv->phylink);
+
+10. Arrange for MAC link state interrupts to be forwarded into
+ phylink, via:
+
+ .. code-block:: c
+
+ phylink_mac_change(priv->phylink, link_is_up);
+
+ where ``link_is_up`` is true if the link is currently up or false
+ otherwise.
+
+11. Verify that the driver does not call::
+
+ netif_carrier_on()
+ netif_carrier_off()
+
+ as these will interfere with phylink's tracking of the link state,
+ and cause phylink to omit calls via the :c:func:`mac_link_up` and
+ :c:func:`mac_link_down` methods.
+
+Network drivers should call phylink_stop() and phylink_start() via their
+suspend/resume paths, which ensures that the appropriate
+:c:type:`struct phylink_mac_ops <phylink_mac_ops>` methods are called
+as necessary.
+
+For information describing the SFP cage in DT, please see the binding
+documentation in the kernel source tree
+``Documentation/devicetree/bindings/net/sff,sfp.txt``
diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
index fe8f741193be..52b026be028f 100644
--- a/Documentation/networking/snmp_counter.rst
+++ b/Documentation/networking/snmp_counter.rst
@@ -1,16 +1,17 @@
-===========
+============
SNMP counter
-===========
+============
This document explains the meaning of SNMP counters.
General IPv4 counters
-====================
+=====================
All layer 4 packets and ICMP packets will change these counters, but
these counters won't be changed by layer 2 packets (such as STP) or
ARP packets.
* IpInReceives
+
Defined in `RFC1213 ipInReceives`_
.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26
@@ -23,6 +24,7 @@ and so on). It indicates the number of aggregated segments after
GRO/LRO.
* IpInDelivers
+
Defined in `RFC1213 ipInDelivers`_
.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28
@@ -33,6 +35,7 @@ supported protocols will be delivered, if someone listens on the raw
socket, all valid IP packets will be delivered.
* IpOutRequests
+
Defined in `RFC1213 ipOutRequests`_
.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28
@@ -42,6 +45,7 @@ multicast packets, and would always be updated together with
IpExtOutOctets.
* IpExtInOctets and IpExtOutOctets
+
They are Linux kernel extensions, no RFC definitions. Please note,
RFC1213 indeed defines ifInOctets and ifOutOctets, but they
are different things. The ifInOctets and ifOutOctets include the MAC
@@ -49,6 +53,7 @@ layer header size but IpExtInOctets and IpExtOutOctets don't, they
only include the IP layer header and the IP layer data.
* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts
+
They indicate the number of four kinds of ECN IP packets, please refer
`Explicit Congestion Notification`_ for more details.
@@ -60,6 +65,7 @@ for the same packet, you might find that IpInReceives count 1, but
IpExtInNoECTPkts counts 2 or more.
* IpInHdrErrors
+
Defined in `RFC1213 ipInHdrErrors`_. It indicates the packet is
dropped due to the IP header error. It might happen in both IP input
and IP forward paths.
@@ -67,6 +73,7 @@ and IP forward paths.
.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27
* IpInAddrErrors
+
Defined in `RFC1213 ipInAddrErrors`_. It will be increased in two
scenarios: (1) The IP address is invalid. (2) The destination IP
address is not a local address and IP forwarding is not enabled
@@ -74,6 +81,7 @@ address is not a local address and IP forwarding is not enabled
.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27
* IpExtInNoRoutes
+
This counter means the packet is dropped when the IP stack receives a
packet and can't find a route for it from the route table. It might
happen when IP forwarding is enabled and the destination IP address is
@@ -81,6 +89,7 @@ not a local address and there is no route for the destination IP
address.
* IpInUnknownProtos
+
Defined in `RFC1213 ipInUnknownProtos`_. It will be increased if the
layer 4 protocol is unsupported by kernel. If an application is using
raw socket, kernel will always deliver the packet to the raw socket
@@ -89,10 +98,12 @@ and this counter won't be increased.
.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27
* IpExtInTruncatedPkts
+
For IPv4 packet, it means the actual data size is smaller than the
"Total Length" field in the IPv4 header.
* IpInDiscards
+
Defined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped
in the IP receiving path and due to kernel internal reasons (e.g. no
enough memory).
@@ -100,20 +111,23 @@ enough memory).
.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28
* IpOutDiscards
+
Defined in `RFC1213 ipOutDiscards`_. It indicates the packet is
dropped in the IP sending path and due to kernel internal reasons.
.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28
* IpOutNoRoutes
+
Defined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is
dropped in the IP sending path and no route is found for it.
.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29
ICMP counters
-============
+=============
* IcmpInMsgs and IcmpOutMsgs
+
Defined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_
.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41
@@ -126,6 +140,7 @@ IcmpOutMsgs would still be updated if the IP header is constructed by
a userspace program.
* ICMP named types
+
| These counters include most of common ICMP types, they are:
| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_
| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_
@@ -180,6 +195,7 @@ straightforward. The 'In' counter means kernel receives such a packet
and the 'Out' counter means kernel sends such a packet.
* ICMP numeric types
+
They are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the
ICMP type number. These counters track all kinds of ICMP packets. The
ICMP type number definition could be found in the `ICMP parameters`_
@@ -192,12 +208,14 @@ IcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply
packet, IcmpMsgInType0 would increase 1.
* IcmpInCsumErrors
+
This counter indicates the checksum of the ICMP packet is
wrong. Kernel verifies the checksum after updating the IcmpInMsgs and
before updating IcmpMsgInType[N]. If a packet has bad checksum, the
IcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated.
* IcmpInErrors and IcmpOutErrors
+
Defined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_
.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41
@@ -209,7 +227,7 @@ and the sending packet path use IcmpOutErrors. When IcmpInCsumErrors
is increased, IcmpInErrors would always be increased too.
relationship of the ICMP counters
--------------------------------
+---------------------------------
The sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they
are updated at the same time. The sum of IcmpMsgInType[N] plus
IcmpInErrors should be equal or larger than IcmpInMsgs. When kernel
@@ -229,8 +247,9 @@ IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
IcmpInErrors.
General TCP counters
-==================
+====================
* TcpInSegs
+
Defined in `RFC1213 tcpInSegs`_
.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
@@ -247,6 +266,7 @@ isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
counter would only increase 1.
* TcpOutSegs
+
Defined in `RFC1213 tcpOutSegs`_
.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
@@ -258,6 +278,7 @@ GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
increase 2.
* TcpActiveOpens
+
Defined in `RFC1213 tcpActiveOpens`_
.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
@@ -267,6 +288,7 @@ state. Every time TcpActiveOpens increases 1, TcpOutSegs should always
increase 1.
* TcpPassiveOpens
+
Defined in `RFC1213 tcpPassiveOpens`_
.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
@@ -275,6 +297,7 @@ It means the TCP layer receives a SYN, replies a SYN+ACK, come into
the SYN-RCVD state.
* TcpExtTCPRcvCoalesce
+
When packets are received by the TCP layer and are not be read by the
application, the TCP layer will try to merge them. This counter
indicate how many packets are merged in such situation. If GRO is
@@ -282,12 +305,14 @@ enabled, lots of packets would be merged by GRO, these packets
wouldn't be counted to TcpExtTCPRcvCoalesce.
* TcpExtTCPAutoCorking
+
When sending packets, the TCP layer will try to merge small packets to
a bigger one. This counter increase 1 for every packet merged in such
situation. Please refer to the LWN article for more details:
https://lwn.net/Articles/576263/
* TcpExtTCPOrigDataSent
+
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
explaination below::
@@ -297,6 +322,7 @@ explaination below::
more useful to track the TCP retransmission rate.
* TCPSynRetrans
+
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
explaination below::
@@ -304,6 +330,7 @@ explaination below::
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
* TCPFastOpenActiveFail
+
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
explaination below::
@@ -313,6 +340,7 @@ explaination below::
.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
* TcpExtListenOverflows and TcpExtListenDrops
+
When kernel receives a SYN from a client, and if the TCP accept queue
is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows.
At the same time kernel will also add 1 to TcpExtListenDrops. When a
@@ -336,17 +364,22 @@ time client replies ACK, this socket will get another chance to move
to the accept queue.
+TCP Fast Open
+=============
* TcpEstabResets
+
Defined in `RFC1213 tcpEstabResets`_.
.. _RFC1213 tcpEstabResets: https://tools.ietf.org/html/rfc1213#page-48
* TcpAttemptFails
+
Defined in `RFC1213 tcpAttemptFails`_.
.. _RFC1213 tcpAttemptFails: https://tools.ietf.org/html/rfc1213#page-48
* TcpOutRsts
+
Defined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates
the 'segments sent containing the RST flag', but in linux kernel, this
couner indicates the segments kerenl tried to send. The sending
@@ -354,6 +387,30 @@ process might be failed due to some errors (e.g. memory alloc failed).
.. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
+* TcpExtTCPSpuriousRtxHostQueues
+
+When the TCP stack wants to retransmit a packet, and finds that packet
+is not lost in the network, but the packet is not sent yet, the TCP
+stack would give up the retransmission and update this counter. It
+might happen if a packet stays too long time in a qdisc or driver
+queue.
+
+* TcpEstabResets
+
+The socket receives a RST packet in Establish or CloseWait state.
+
+* TcpExtTCPKeepAlive
+
+This counter indicates many keepalive packets were sent. The keepalive
+won't be enabled by default. A userspace program could enable it by
+setting the SO_KEEPALIVE socket option.
+
+* TcpExtTCPSpuriousRTOs
+
+The spurious retransmission timeout detected by the `F-RTO`_
+algorithm.
+
+.. _F-RTO: https://tools.ietf.org/html/rfc5682
TCP Fast Path
============
@@ -389,20 +446,23 @@ will disable the fast path at first, and try to enable it after kernel
receives packets.
* TcpExtTCPPureAcks and TcpExtTCPHPAcks
+
If a packet set ACK flag and has no data, it is a pure ACK packet, if
kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
if kernel handles it in the slow path, TcpExtTCPPureAcks will
increase 1.
* TcpExtTCPHPHits
+
If a TCP packet has data (which means it is not a pure ACK packet),
and this packet is handled in the fast path, TcpExtTCPHPHits will
increase 1.
TCP abort
-========
+=========
* TcpExtTCPAbortOnData
+
It means TCP layer has data in flight, but need to close the
connection. So TCP layer sends a RST to the other side, indicate the
connection is not closed very graceful. An easy way to increase this
@@ -421,11 +481,13 @@ when the application closes a connection, kernel will send a RST
immediately and increase the TcpExtTCPAbortOnData counter.
* TcpExtTCPAbortOnClose
+
This counter means the application has unread data in the TCP layer when
the application wants to close the TCP connection. In such a situation,
kernel will send a RST to the other side of the TCP connection.
* TcpExtTCPAbortOnMemory
+
When an application closes a TCP connection, kernel still need to track
the connection, let it complete the TCP disconnect process. E.g. an
app calls the close method of a socket, kernel sends fin to the other
@@ -447,10 +509,12 @@ the tcp_mem. Please refer the tcp_mem section in the `TCP man page`_:
* TcpExtTCPAbortOnTimeout
+
This counter will increase when any of the TCP timers expire. In such
situation, kernel won't send RST, just give up the connection.
* TcpExtTCPAbortOnLinger
+
When a TCP connection comes into FIN_WAIT_2 state, instead of waiting
for the fin packet from the other side, kernel could send a RST and
delete the socket immediately. This is not the default behavior of
@@ -458,6 +522,7 @@ Linux kernel TCP stack. By configuring the TCP_LINGER2 socket option,
you could let kernel follow this behavior.
* TcpExtTCPAbortFailed
+
The kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is
satisfied. If an internal error occurs during this process,
TcpExtTCPAbortFailed will be increased.
@@ -465,7 +530,7 @@ TcpExtTCPAbortFailed will be increased.
.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50
TCP Hybrid Slow Start
-====================
+=====================
The Hybrid Slow Start algorithm is an enhancement of the traditional
TCP congestion window Slow Start algorithm. It uses two pieces of
information to detect whether the max bandwidth of the TCP path is
@@ -481,23 +546,27 @@ relate with the Hybrid Slow Start algorithm.
.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf
* TcpExtTCPHystartTrainDetect
+
How many times the ACK train length threshold is detected
* TcpExtTCPHystartTrainCwnd
+
The sum of CWND detected by ACK train length. Dividing this value by
TcpExtTCPHystartTrainDetect is the average CWND which detected by the
ACK train length.
* TcpExtTCPHystartDelayDetect
+
How many times the packet delay threshold is detected.
* TcpExtTCPHystartDelayCwnd
+
The sum of CWND detected by packet delay. Dividing this value by
TcpExtTCPHystartDelayDetect is the average CWND which detected by the
packet delay.
TCP retransmission and congestion control
-======================================
+=========================================
The TCP protocol has two retransmission mechanisms: SACK and fast
recovery. They are exclusive with each other. When SACK is enabled,
the kernel TCP stack would use SACK, or kernel would use fast
@@ -516,12 +585,14 @@ https://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf
.. _RFC6582: https://tools.ietf.org/html/rfc6582
* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery
+
When the congestion control comes into Recovery state, if sack is
used, TcpExtTCPSackRecovery increases 1, if sack is not used,
TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP
stack begins to retransmit the lost packets.
* TcpExtTCPSACKReneging
+
A packet was acknowledged by SACK, but the receiver has dropped this
packet, so the sender needs to retransmit this packet. In this
situation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver
@@ -532,6 +603,7 @@ the RTO expires for this packet, then the sender assumes this packet
has been dropped by the receiver.
* TcpExtTCPRenoReorder
+
The reorder packet is detected by fast recovery. It would only be used
if SACK is disabled. The fast recovery algorithm detects recorder by
the duplicate ACK number. E.g., if retransmission is triggered, and
@@ -542,6 +614,7 @@ order packet. Thus the sender would find more ACks than its
expectation, and the sender knows out of order occurs.
* TcpExtTCPTSReorder
+
The reorder packet is detected when a hole is filled. E.g., assume the
sender sends packet 1,2,3,4,5, and the receiving order is
1,2,4,5,3. When the sender receives the ACK of packet 3 (which will
@@ -551,6 +624,7 @@ fill the hole), two conditions will let TcpExtTCPTSReorder increase
than the retransmission timestamp.
* TcpExtTCPSACKReorder
+
The reorder packet detected by SACK. The SACK has two methods to
detect reorder: (1) DSACK is received by the sender. It means the
sender sends the same packet more than one times. And the only reason
@@ -562,6 +636,29 @@ packet yet, the sender would know packet 4 is out of order. The TCP
stack of kernel will increase TcpExtTCPSACKReorder for both of the
above scenarios.
+* TcpExtTCPSlowStartRetrans
+
+The TCP stack wants to retransmit a packet and the congestion control
+state is 'Loss'.
+
+* TcpExtTCPFastRetrans
+
+The TCP stack wants to retransmit a packet and the congestion control
+state is not 'Loss'.
+
+* TcpExtTCPLostRetransmit
+
+A SACK points out that a retransmission packet is lost again.
+
+* TcpExtTCPRetransFail
+
+The TCP stack tries to deliver a retransmission packet to lower layers
+but the lower layers return an error.
+
+* TcpExtTCPSynRetrans
+
+The TCP stack retransmits a SYN packet.
+
DSACK
=====
The DSACK is defined in `RFC2883`_. The receiver uses DSACK to report
@@ -574,10 +671,12 @@ sender side.
.. _RFC2883 : https://tools.ietf.org/html/rfc2883
* TcpExtTCPDSACKOldSent
+
The TCP stack receives a duplicate packet which has been acked, so it
sends a DSACK to the sender.
* TcpExtTCPDSACKOfoSent
+
The TCP stack receives an out of order duplicate packet, so it sends a
DSACK to the sender.
@@ -586,6 +685,7 @@ The TCP stack receives a DSACK, which indicates an acknowledged
duplicate packet is received.
* TcpExtTCPDSACKOfoRecv
+
The TCP stack receives a DSACK, which indicate an out of order
duplicate packet is received.
@@ -640,23 +740,26 @@ A skb should be shifted or merged, but the TCP stack doesn't do it for
some reasons.
TCP out of order
-===============
+================
* TcpExtTCPOFOQueue
+
The TCP layer receives an out of order packet and has enough memory
to queue it.
* TcpExtTCPOFODrop
+
The TCP layer receives an out of order packet but doesn't have enough
memory, so drops it. Such packets won't be counted into
TcpExtTCPOFOQueue.
* TcpExtTCPOFOMerge
+
The received out of order packet has an overlay with the previous
packet. the overlay part will be dropped. All of TcpExtTCPOFOMerge
packets will also be counted into TcpExtTCPOFOQueue.
TCP PAWS
-=======
+========
PAWS (Protection Against Wrapped Sequence numbers) is an algorithm
which is used to drop old packets. It depends on the TCP
timestamps. For detail information, please refer the `timestamp wiki`_
@@ -666,13 +769,15 @@ and the `RFC of PAWS`_.
.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps
* TcpExtPAWSActive
+
Packets are dropped by PAWS in Syn-Sent status.
* TcpExtPAWSEstab
+
Packets are dropped by PAWS in any status other than Syn-Sent.
TCP ACK skip
-===========
+============
In some scenarios, kernel would avoid sending duplicate ACKs too
frequently. Please find more details in the tcp_invalid_ratelimit
section of the `sysctl document`_. When kernel decides to skip an ACK
@@ -684,6 +789,7 @@ it has no data.
.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
* TcpExtTCPACKSkippedSynRecv
+
The ACK is skipped in Syn-Recv status. The Syn-Recv status means the
TCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is
waiting for an ACK. Generally, the TCP stack doesn't need to send ACK
@@ -697,6 +803,7 @@ increase TcpExtTCPACKSkippedSynRecv.
* TcpExtTCPACKSkippedPAWS
+
The ACK is skipped due to PAWS (Protect Against Wrapped Sequence
numbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2
or Time-Wait statuses, the skipped ACK would be counted to
@@ -705,18 +812,22 @@ TcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK
would be counted to TcpExtTCPACKSkippedPAWS.
* TcpExtTCPACKSkippedSeq
+
The sequence number is out of window and the timestamp passes the PAWS
check and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait.
* TcpExtTCPACKSkippedFinWait2
+
The ACK is skipped in Fin-Wait-2 status, the reason would be either
PAWS check fails or the received sequence number is out of window.
* TcpExtTCPACKSkippedTimeWait
+
Tha ACK is skipped in Time-Wait status, the reason would be either
PAWS check failed or the received sequence number is out of window.
* TcpExtTCPACKSkippedChallenge
+
The ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines
3 kind of challenge ACK, please refer `RFC 5961 section 3.2`_,
`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these
@@ -729,8 +840,9 @@ unacknowledged number (more strict than `RFC 5961 section 5.2`_).
.. _RFC 5961 section 5.2: https://tools.ietf.org/html/rfc5961#page-11
TCP receive window
-=================
+==================
* TcpExtTCPWantZeroWindowAdv
+
Depending on current memory usage, the TCP stack tries to set receive
window to zero. But the receive window might still be a no-zero
value. For example, if the previous window size is 10, and the TCP
@@ -738,14 +850,16 @@ stack receives 3 bytes, the current window size would be 7 even if the
window size calculated by the memory usage is zero.
* TcpExtTCPToZeroWindowAdv
+
The TCP receive window is set to zero from a no-zero value.
* TcpExtTCPFromZeroWindowAdv
+
The TCP receive window is set to no-zero value from zero.
Delayed ACK
-==========
+===========
The TCP Delayed ACK is a technique which is used for reducing the
packet count in the network. For more details, please refer the
`Delayed ACK wiki`_
@@ -753,10 +867,12 @@ packet count in the network. For more details, please refer the
.. _Delayed ACK wiki: https://en.wikipedia.org/wiki/TCP_delayed_acknowledgment
* TcpExtDelayedACKs
+
A delayed ACK timer expires. The TCP stack will send a pure ACK packet
and exit the delayed ACK mode.
* TcpExtDelayedACKLocked
+
A delayed ACK timer expires, but the TCP stack can't send an ACK
immediately due to the socket is locked by a userspace program. The
TCP stack will send a pure ACK later (after the userspace program
@@ -765,29 +881,152 @@ TCP stack will also update TcpExtDelayedACKs and exit the delayed ACK
mode.
* TcpExtDelayedACKLost
+
It will be updated when the TCP stack receives a packet which has been
ACKed. A Delayed ACK loss might cause this issue, but it would also be
triggered by other reasons, such as a packet is duplicated in the
network.
Tail Loss Probe (TLP)
-===================
+=====================
TLP is an algorithm which is used to detect TCP packet loss. For more
details, please refer the `TLP paper`_.
.. _TLP paper: https://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
* TcpExtTCPLossProbes
+
A TLP probe packet is sent.
* TcpExtTCPLossProbeRecovery
+
A packet loss is detected and recovered by TLP.
+TCP Fast Open
+=============
+TCP Fast Open is a technology which allows data transfer before the
+3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a
+general description.
+
+.. _TCP Fast Open wiki: https://en.wikipedia.org/wiki/TCP_Fast_Open
+
+* TcpExtTCPFastOpenActive
+
+When the TCP stack receives an ACK packet in the SYN-SENT status, and
+the ACK packet acknowledges the data in the SYN packet, the TCP stack
+understand the TFO cookie is accepted by the other side, then it
+updates this counter.
+
+* TcpExtTCPFastOpenActiveFail
+
+This counter indicates that the TCP stack initiated a TCP Fast Open,
+but it failed. This counter would be updated in three scenarios: (1)
+the other side doesn't acknowledge the data in the SYN packet. (2) The
+SYN packet which has the TFO cookie is timeout at least once. (3)
+after the 3-way handshake, the retransmission timeout happens
+net.ipv4.tcp_retries1 times, because some middle-boxes may black-hole
+fast open after the handshake.
+
+* TcpExtTCPFastOpenPassive
+
+This counter indicates how many times the TCP stack accepts the fast
+open request.
+
+* TcpExtTCPFastOpenPassiveFail
+
+This counter indicates how many times the TCP stack rejects the fast
+open request. It is caused by either the TFO cookie is invalid or the
+TCP stack finds an error during the socket creating process.
+
+* TcpExtTCPFastOpenListenOverflow
+
+When the pending fast open request number is larger than
+fastopenq->max_qlen, the TCP stack will reject the fast open request
+and update this counter. When this counter is updated, the TCP stack
+won't update TcpExtTCPFastOpenPassive or
+TcpExtTCPFastOpenPassiveFail. The fastopenq->max_qlen is set by the
+TCP_FASTOPEN socket operation and it could not be larger than
+net.core.somaxconn. For example:
+
+setsockopt(sfd, SOL_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen));
+
+* TcpExtTCPFastOpenCookieReqd
+
+This counter indicates how many times a client wants to request a TFO
+cookie.
+
+SYN cookies
+===========
+SYN cookies are used to mitigate SYN flood, for details, please refer
+the `SYN cookies wiki`_.
+
+.. _SYN cookies wiki: https://en.wikipedia.org/wiki/SYN_cookies
+
+* TcpExtSyncookiesSent
+
+It indicates how many SYN cookies are sent.
+
+* TcpExtSyncookiesRecv
+
+How many reply packets of the SYN cookies the TCP stack receives.
+
+* TcpExtSyncookiesFailed
+
+The MSS decoded from the SYN cookie is invalid. When this counter is
+updated, the received packet won't be treated as a SYN cookie and the
+TcpExtSyncookiesRecv counter wont be updated.
+
+Challenge ACK
+=============
+For details of challenge ACK, please refer the explaination of
+TcpExtTCPACKSkippedChallenge.
+
+* TcpExtTCPChallengeACK
+
+The number of challenge acks sent.
+
+* TcpExtTCPSYNChallenge
+
+The number of challenge acks sent in response to SYN packets. After
+updates this counter, the TCP stack might send a challenge ACK and
+update the TcpExtTCPChallengeACK counter, or it might also skip to
+send the challenge and update the TcpExtTCPACKSkippedChallenge.
+
+prune
+=====
+When a socket is under memory pressure, the TCP stack will try to
+reclaim memory from the receiving queue and out of order queue. One of
+the reclaiming method is 'collapse', which means allocate a big sbk,
+copy the contiguous skbs to the single big skb, and free these
+contiguous skbs.
+
+* TcpExtPruneCalled
+
+The TCP stack tries to reclaim memory for a socket. After updates this
+counter, the TCP stack will try to collapse the out of order queue and
+the receiving queue. If the memory is still not enough, the TCP stack
+will try to discard packets from the out of order queue (and update the
+TcpExtOfoPruned counter)
+
+* TcpExtOfoPruned
+
+The TCP stack tries to discard packet on the out of order queue.
+
+* TcpExtRcvPruned
+
+After 'collapse' and discard packets from the out of order queue, if
+the actually used memory is still larger than the max allowed memory,
+this counter will be updated. It means the 'prune' fails.
+
+* TcpExtTCPRcvCollapsed
+
+This counter indicates how many skbs are freed during 'collapse'.
+
examples
-=======
+========
ping test
---------
+---------
Run the ping command against the public dns server 8.8.8.8::
nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
@@ -831,7 +1070,7 @@ and its corresponding Echo Reply packet are constructed by:
So the IpExtInOctets and IpExtOutOctets are 20+16+48=84.
tcp 3-way handshake
-------------------
+-------------------
On server side, we run::
nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
@@ -873,7 +1112,7 @@ ACK, so client sent 2 packets, received 1 packet, TcpInSegs increased
1, TcpOutSegs increased 2.
TCP normal traffic
------------------
+------------------
Run nc on server::
nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
@@ -996,7 +1235,7 @@ and the packet received from client qualified for fast path, so it
was counted into 'TcpExtTCPHPHits'.
TcpExtTCPAbortOnClose
---------------------
+---------------------
On the server side, we run below python script::
import socket
@@ -1030,7 +1269,7 @@ If we run tcpdump on the server side, we could find the server sent a
RST after we type Ctrl-C.
TcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout
------------------------------------------------
+---------------------------------------------------
Below is an example which let the orphan socket count be higher than
net.ipv4.tcp_max_orphans.
Change tcp_max_orphans to a smaller value on client::
@@ -1152,7 +1391,7 @@ FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
TcpExtTCPAbortOnTimeout 10 0.0
TcpExtTCPAbortOnLinger
----------------------
+----------------------
The server side code::
nstatuser@nstat-b:~$ cat server_linger.py
@@ -1197,7 +1436,7 @@ After run client_linger.py, check the output of nstat::
TcpExtTCPAbortOnLinger 1 0.0
TcpExtTCPRcvCoalesce
--------------------
+--------------------
On the server, we run a program which listen on TCP port 9000, but
doesn't read any data::
@@ -1257,7 +1496,7 @@ the receiving queue. So the TCP layer merged the two packets, and we
could find the TcpExtTCPRcvCoalesce increased 1.
TcpExtListenOverflows and TcpExtListenDrops
-----------------------------------------
+-------------------------------------------
On server, run the nc command, listen on port 9000::
nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
@@ -1305,7 +1544,7 @@ TcpExtListenOverflows and TcpExtListenDrops would be larger, because
the SYN of the 4th nc was dropped, the client was retrying.
IpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes
-----------------------------------------------
+-------------------------------------------------
server A IP address: 192.168.122.250
server B IP address: 192.168.122.251
Prepare on server A, add a route to server B::
@@ -1400,7 +1639,7 @@ a route for the 8.8.8.8 IP address, so server B increased
IpOutNoRoutes.
TcpExtTCPACKSkippedSynRecv
-------------------------
+--------------------------
In this test, we send 3 same SYN packets from client to server. The
first SYN will let server create a socket, set it to Syn-Recv status,
and reply a SYN/ACK. The second SYN will let server reply the SYN/ACK
@@ -1448,7 +1687,7 @@ Check snmp cunter on nstat-b::
As we expected, TcpExtTCPACKSkippedSynRecv is 1.
TcpExtTCPACKSkippedPAWS
-----------------------
+-----------------------
To trigger PAWS, we could send an old SYN.
On nstat-b, let nc listen on port 9000::
@@ -1485,7 +1724,7 @@ failed, the nstat-b replied an ACK for the first SYN, skipped the ACK
for the second SYN, and updated TcpExtTCPACKSkippedPAWS.
TcpExtTCPACKSkippedSeq
---------------------
+----------------------
To trigger TcpExtTCPACKSkippedSeq, we send packets which have valid
timestamp (to pass PAWS check) but the sequence number is out of
window. The linux TCP stack would avoid to skip if the packet has
diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt
index 82236a17b5e6..86174ce8cd13 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.txt
@@ -92,11 +92,11 @@ device.
Switch ID
^^^^^^^^^
-The switchdev driver must implement the switchdev op switchdev_port_attr_get
-for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same
-physical ID for each port of a switch. The ID must be unique between switches
-on the same system. The ID does not need to be unique between switches on
-different systems.
+The switchdev driver must implement the net_device operation
+ndo_get_port_parent_id for each port netdev, returning the same physical ID for
+each port of a switch. The ID must be unique between switches on the same
+system. The ID does not need to be unique between switches on different
+systems.
The switch ID is used to locate ports on a switch and to know if aggregated
ports belong to the same switch.
@@ -196,7 +196,7 @@ The switch device will learn/forget source MAC address/VLAN on ingress packets
and notify the switch driver of the mac/vlan/port tuples. The switch driver,
in turn, will notify the bridge driver using the switchdev notifier call:
- err = call_switchdev_notifiers(val, dev, info);
+ err = call_switchdev_notifiers(val, dev, info, extack);
Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
forgetting, and info points to a struct switchdev_notifier_fdb_info. On
@@ -232,10 +232,8 @@ Learning_sync attribute enables syncing of the learned/forgotten FDB entry to
the bridge's FDB. It's possible, but not optimal, to enable learning on the
device port and on the bridge port, and disable learning_sync.
-To support learning and learning_sync port attributes, the driver implements
-switchdev op switchdev_port_attr_get/set for
-SWITCHDEV_ATTR_PORT_ID_BRIDGE_FLAGS. The driver should initialize the attributes
-to the hardware defaults.
+To support learning, the driver implements switchdev op
+switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS.
FDB Ageing
^^^^^^^^^^
@@ -373,22 +371,3 @@ The driver can monitor for updates to arp_tbl using the netevent notifier
NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops
for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy
to know when arp_tbl neighbor entries are purged from the port.
-
-Transaction item queue
-^^^^^^^^^^^^^^^^^^^^^^
-
-For switchdev ops attr_set and obj_add, there is a 2 phase transaction model
-used. First phase is to "prepare" anything needed, including various checks,
-memory allocation, etc. The goal is to handle the stuff that is not unlikely
-to fail here. The second phase is to "commit" the actual changes.
-
-Switchdev provides an infrastructure for sharing items (for example memory
-allocations) between the two phases.
-
-The object created by a driver in "prepare" phase and it is queued up by:
-switchdev_trans_item_enqueue()
-During the "commit" phase, the driver gets the object by:
-switchdev_trans_item_dequeue()
-
-If a transaction is aborted during "prepare" phase, switchdev code will handle
-cleanup of the queued-up objects.
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index 9d1432e0aaa8..bbdaf8990031 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -6,11 +6,21 @@ The interfaces for receiving network packages timestamps are:
* SO_TIMESTAMP
Generates a timestamp for each incoming packet in (not necessarily
monotonic) system time. Reports the timestamp via recvmsg() in a
- control message as struct timeval (usec resolution).
+ control message in usec resolution.
+ SO_TIMESTAMP is defined as SO_TIMESTAMP_NEW or SO_TIMESTAMP_OLD
+ based on the architecture type and time_t representation of libc.
+ Control message format is in struct __kernel_old_timeval for
+ SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for
+ SO_TIMESTAMP_NEW options respectively.
* SO_TIMESTAMPNS
Same timestamping mechanism as SO_TIMESTAMP, but reports the
- timestamp as struct timespec (nsec resolution).
+ timestamp as struct timespec in nsec resolution.
+ SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD
+ based on the architecture type and time_t representation of libc.
+ Control message format is in struct timespec for SO_TIMESTAMPNS_OLD
+ and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options
+ respectively.
* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
Only for multicast:approximate transmit timestamp obtained by
@@ -22,7 +32,7 @@ The interfaces for receiving network packages timestamps are:
timestamps for stream sockets.
-1.1 SO_TIMESTAMP:
+1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW):
This socket option enables timestamping of datagrams on the reception
path. Because the destination socket, if any, is not known early in
@@ -31,15 +41,25 @@ same is true for all early receive timestamp options.
For interface details, see `man 7 socket`.
+Always use SO_TIMESTAMP_NEW timestamp to always get timestamp in
+struct __kernel_sock_timeval format.
-1.2 SO_TIMESTAMPNS:
+SO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038
+on 32 bit machines.
+
+1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW):
This option is identical to SO_TIMESTAMP except for the returned data type.
Its struct timespec allows for higher resolution (ns) timestamps than the
timeval of SO_TIMESTAMP (ms).
+Always use SO_TIMESTAMPNS_NEW timestamp to always get timestamp in
+struct __kernel_timespec format.
+
+SO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038
+on 32 bit machines.
-1.3 SO_TIMESTAMPING:
+1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW):
Supports multiple types of timestamp requests. As a result, this
socket option takes a bitmap of flags, not a boolean. In
@@ -323,10 +343,23 @@ SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.
These timestamps are returned in a control message with cmsg_level
SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
+For SO_TIMESTAMPING_OLD:
+
struct scm_timestamping {
struct timespec ts[3];
};
+For SO_TIMESTAMPING_NEW:
+
+struct scm_timestamping64 {
+ struct __kernel_timespec ts[3];
+
+Always use SO_TIMESTAMPING_NEW timestamp to always get timestamp in
+struct scm_timestamping64 format.
+
+SO_TIMESTAMPING_OLD returns incorrect timestamps after the year 2038
+on 32 bit machines.
+
The structure can return up to three timestamps. This is a legacy
feature. At least one field is non-zero at any time. Most timestamps
are passed in ts[0]. Hardware timestamps are passed in ts[2].